Storage Developer Conference - #141: Unlocking the New Performance and QoS Capabilities of the Software-Enabled Flash API

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode 141. Hello, I'm Rory Bolt, one of the principal architects of Software-Enabled Flash, a newly announced technology from Kyoksha. Today I will be covering what Software-Enabled Flash is, why we created it, and the concepts and technologies that it contains. I'll also show some demonstrations of Software-Enabled Flash running on a prototype FPGA, a few coding examples,

Starting point is 00:01:03 and I'll finish up by directing you to where you can find more information on software-enabled Flash. What is software-enabled Flash? Software-enabled Flash is a media-based, host-managed hardware approach. We'll talk more about this in the coming slides. We redefine host interactions with Flash to allow applications to maximize performance and provide functionality that would be difficult, if not impossible, to achieve with existing interfaces. Giving the host control over the media enables the host to define both the

Starting point is 00:01:37 behavior and the performance characteristics for Flash-based solutions. This can enable new storage functionality that maximizes performance and efficiencies. Or stated simply, software-enabled Flash gives you the ability to extract maximum value from your Flash resources. Coupled with this media-based hardware approach is a software-enabling API. This API is designed to make managing Flash media easy

Starting point is 00:02:06 while exposing all the capabilities of Flash. The highlights of this API are, first and foremost, it is an open-source Flash-native API. Next, it's designed to make it easy to create storage solutions with powerful capabilities. It abstracts the low-level details that are vendor and flash generation specific so that your code can take advantage of the latest flash technologies without modification. Being an open project, any flash vendor can build software-enabled flash devices

Starting point is 00:02:39 that are optimized for their own flash media. And finally, although just the API and documentation are published now by Kyoksha, we will be releasing an open source software development kit with reference source code for block drivers, Flash translation layers, utilities, and other coding examples. As previously introduced, software enabled Flash

Starting point is 00:03:04 consists of hardware and software components working together. I've had the unique opportunity to meet with the storage developers of most of the world's hyperscalers and talk about their storage needs. Taken with the input of other engineers at Kyoksha, this has allowed us to distill a list of basic requirements and features for hyperscale customers. I should mention that although hyperscalers face similar problems, their individual priorities and approaches vary significantly. On the requirement side, flash abstraction is not just about code reuse. There can be major economic and performance advantages to transitioning quickly to newer flash generations at scale. Scheduling is increasingly important to the hyperscalers, and it's going to be covered in depth in following slides. Access to all the media, or other words, avoiding the RAID tax.

Starting point is 00:03:59 Most of the hyperscalers already ensure system integrity at the system level and view RAID within the device as actually being a capacity tax that they pay for and don't really need. Host CPU offload. For hyperscalers, the host CPU can actually be a sellable resource, and so minimizing the impact to the host CPU is very important to them. Flexible DRAM configurations. Flexibility in the DRAM architecture relieves the device from handling worst-case scenarios, and we're going to be talking more about this too. On the functionality side, data placement to minimize write amplification is extremely important to almost all the hyperscalers. Also of interest is isolation. This can be isolation for security reasons or relief from noisy neighbors in

Starting point is 00:04:53 multi-tenant environments. Latency is extremely important to the hyperscalers, and we're going to cover this in depth in the following slides too. Buffer management is really tied to the flexible DRAM configurations I mentioned earlier and will be covered in the following slides. Finally, adaptability to new workloads. The hyperscale environment is dynamic, and it changes very quickly behind the scenes. The preference is for standard configurations that can be provisioned, configured, and deployed in real time. In response to those requirements and required features, the software-enabled Flash API with hardware structure, buffer management, programming of the media itself, and error management on the media.

Starting point is 00:05:51 Not listed separately, but touching many of these areas, is latency control. In order to maximize the performance of Flash storage, it is necessary to be aware of the geometry of the flash resources. The software-enabled flash API allows storage applications to query this geometry in a standardized manner. Some of the characteristics exposed are listed here. The API also allows control over how many flash blocks may be opened simultaneously and control management of the associated write buffers. When discussing the programming of Flash, it is important to note that the optimal programming algorithms vary from vendor to vendor,

Starting point is 00:06:31 and often between Flash generations of a given vendor. Software-enabled Flash handles all these details and lets the Flash vendor optimize for their own media. The API was created with consideration for other Flash vendors, as well as all the foreseeable needs of Kyokushio's own future Flash generations. Finally, with respect to error management, software-enabled Flash allows vendors

Starting point is 00:06:56 to optimize error reduction techniques specifically for their own media characteristics. And it also controls the time budget allowed for error recovery attempts, once again tying back to latency. This is a high-level block diagram of our software-enabled flash controller. This is one possible configuration. Other vendors are free to implement different architectures as long as they comply at the API level.

Starting point is 00:07:29 For example, although we use the toggle interface to connect to our flash chips, other vendors might want to implement the OnFee interconnect. Note that the design also uses standardized interfaces wherever possible and is focused mainly on the management of the flash media itself, programming, lifetime management, and defect management. An example of utilization of standardized interfaces is the use of the PCI interface to communicate with the host itself. The controller has advanced scheduling capabilities on a per die basis and hardware acceleration for garbage collection if needed or wear leveling tasks within the device. Another important call out is the optional use of on-device DRAM. Software-enabled flash can be configured without any on-device DRAM and can be used host memory resources instead. As shown in the block diagram, the use of DRAM on our software-enabled flash controller is optional. Why is that important? The answer lies in the fact that hyperscale customers often require thousands of simultaneously open flash blocks on each flash drive. The actual numbers vary from hyperscaler to hyperscaler,

Starting point is 00:09:00 but requirements we have heard have ranged between 4,000 open blocks per device to up to 16,000 open blocks per device. Since each open block requires a write buffer, this can create a demand for a lot of memory, potentially tens of gigabytes per device. It's often the case that the actual number of open flash blocks is unknown ahead of time and can vary significantly over time as a function of system load. So sizing the device DRAM to be able to handle the worst case loads creates stranded DRAM resources on the device under normal circumstances. Software-enabled Flash supports device-side DRAM configuration, host-only DRAM configurations, or hybrid configurations that allow drive DRAM to be sized for normal usage

Starting point is 00:09:59 and host DRAM to be used during periods exceeding the limits of the drive's DRAM resources. Note that in systems that use host-side DRAM resources, there are system requirements to protect against data loss in the event of unexpected power loss to the system. Many hyperscale environments already have this in place with either non-volatile memory resources in the hosts, system-level mirroring, or system-level erasure encoding. Now I'd like to introduce the software components of software-enabled Flash. Kyokusha has created and will release a software development kit. This will provide open-source reference block drivers,

Starting point is 00:10:41 open-source reference Flash translation layers, and open source device management utilities. Bundled with the SDK will be an open source API library as well as an open source device driver. This block diagram shows how the pieces of the SDK interface with each other. Two notes on the software layering. First, as you can see, it is possible for a user application to interface directly to the software-enabled Flash API library, bypassing file systems and traditional device drivers within the system. We've built a couple proof-of-concept applications, and we have them running today on our FPGA prototypes. These include a software-enabled Flash engine for FIO, as well as versions of RocksDB and Firecracker that are software-enabled Flash native. Although these applications are currently just proof-of-concepts, in the future, we plan to include open- source native applications as part of the SDK itself.

Starting point is 00:11:52 The second note is the SDS stack in the center of the diagram. Most hyperscalers today are already running their own software-defined storage stacks in their environments. These software-defined storage stacks can be modified to interface to the software-enabled Flash API library and are not dependent upon the SDK reference code for any purposes other than as an example of best practices for how to implement solutions using software-enabled Flash. And now for a system-level view of one possible deployment of software-enabled Flash. Note that the items outlined in red dashes are items that would likely be customized for a particular user or customer environment. So customers would likely modify the reference flash translation layer, possibly the reference block drivers, and certainly they would have their own software-enabled flash native applications.

Starting point is 00:12:59 Now let's start at the top and work down. Here we see unmodified applications using the POSIX API to talk to a file system. The file system would talk using a block device I.O. call, transitioning from the virtualized guest into the host system and would interface with the software-enabled flash QEMU block driver. The block driver makes use of the reference flash translation layer, and then uses the Ceph API to call into the software-enabled flash library. From there, we are now working on an IOU ring extension to allow us to bypass system calls and interface directly with the software-enabled flash driver, and then the transition through the

Starting point is 00:13:54 software-enabled flash command set from the kernel to the software-enabled flash unit itself, the actual hardware device. The software-enabled flash QEMU block driver is part of the SDK itself and it's useful because it allows unmodified applications to take advantage of many but not all of the features of software-enabled flash. Some of the features that can be used in this type of configuration are isolation, latency control, and die time allocation. Now I will introduce the concepts and features of software-enabled flash that will be necessary to understand the later examples. This diagram depicts one possible software-enabled flash unit.

Starting point is 00:14:46 There is the controller and then the die that make up the unit. In this particular example, we have 32 die that are arranged as four banks across eight channels attached to the software-enabled flash controller. The first concept I'd like to introduce is that of a virtual device. And a virtual device is a set of one or more flash die that provide hardware-level isolation. So an individual die can only ever be assigned to one virtual device at a time. DAI are never split or shared across virtual devices. The next concept is that of a quality of service or QoS domain. And this is a mechanism that we use to impose both capacity quotas as well as scheduling policy

Starting point is 00:15:44 and provide provisions for software-based isolation. Note that it is possible for multiple QoS domains to share a single virtual device. So in this example, two QoS domains are sharing one virtual device. This QoS domain is consuming the entire virtual device. And likewise, the amount of capacity that's allocated to a QoS domain can be variable and allocated over time. The final concept is that of a placement ID, and this is a mechanism that allows applications to group data at the superblock level within a QoLity of service domain.

Starting point is 00:16:42 This slide describes how the concepts introduced provide control over data isolation and placement. Superblocks start out in a free pool within the virtual device. As a QoS domain allocates a storage block, it is drawn from this free pool and assigned to the domain. The device is free to choose any free block in the pool so that it can track block wear and block health to assign the optimal block to a domain to maximize device endurance. Superblocks are never shared between domains. There is no mixing of data at the block level. When a superblock is released, it is returned to the free pool so that over the lifetime of the device, ownership of a superblock can transition between QoS domains if there are multiple QoS domains defined in a virtual device. So to summarize data isolation, the two main

Starting point is 00:17:41 mechanisms are listed here with their benefits and restrictions. Die-level isolation is hardware-based isolation. It is the most effective and the least scalable. The reason it's the least scalable is that it's limited by the number of physical die present. On most devices, there will be somewhere between 32 and possibly 128 DAI per device. So you would have a maximum of either 32 or 128 virtual devices possible. Isolation at the block level within QoS domains is the most scalable solution. This scales to thousands of tenants, but it only provides software-based isolation that's enforced by the scheduling capabilities of the software-enabled flash unit. Closely related to data isolation is data placement.

Starting point is 00:18:45 Here we're going to introduce the concept of a nameless write mechanism to control data placement. Why did we feel that a new write mechanism was necessary? There are system benefits that can be realized with control over data placement. However, if physical addressing is allowed for writes, the host becomes responsible for device wear. Flash memory is a consumable. Poor choices for physical data placement can wear out flash devices quickly. So how can the host have control over data placement without needing to take responsibility over ensuring device health? The answer is Nameless Write. When a new superblock is required for a placement ID,

Starting point is 00:19:30 or if a new superblock is manually allocated, the device chooses the optimal superblock to use. This is the framework for Nameless Write. Now let's see how a Nameless Write works. Nameless Write allows the device to choose where to physically write the data, but it allows the host to bound the possible choices for the device. As mentioned earlier, QoS domains are mapped to device nodes in the system, so that Nameless Write operations must supply a QoS domain handle as well as either a placement ID for auto-allocation mode

Starting point is 00:20:06 or a superblock flash address returned by a previous manual superblock allocate command. The QoS domain maps to a virtual device, which in turn specifies which dies can be used for the write. The placement ID or flash address specifies which superblock, owned by the domain, should be used for the write. If a placement ID is specified, a nameless write can span superblocks and additional superblocks will be allocated to the domain as needed. In manual allocation mode, nameless writes cannot span superblocks. The device is free to write data to any empty space within the bounds specified by the host. And when the write is complete, the actual physical flash address is returned, enabling direct physical reads with no address translation. Nameless write operation automatically handles

Starting point is 00:21:00 all media defects and mapping. Direct physical read optimizes performance and minimizes latency. Similar to the nameless write operation, software-enabled Flash has a nameless copy operation that can be used to move data within device without host processing. This is useful for implementing garbage collection, wear leveling, and other administrative tasks. The nameless copy function takes as input a source superblock, a destination

Starting point is 00:21:34 superblock, and a copy instruction. These copy instructions are powerful primitives supporting valid data bitmaps, lists of logical block addresses, or even filters to select copied data based upon logical block address. A nameless copy can operate on entire superblocks with a single command. This animation illustrates the difference in impact for the host for implementing a garbage collect using standard read and write commands versus the nameless copy command. And now for a more concrete example. This movie is a demonstration of nameless copy running on our FPGA prototype device. Both sides are running identical write workloads. In a short while, Garbage Collect will start and you can see the difference in impact to

Starting point is 00:22:28 the system. As Garbage Collect starts, you can see CPU utilization rising rapidly on the manual copy side. This demonstration has been sped up for the purposes of this recording, but at the end of the demonstration, the nameless copy has issued 24 commands versus over 800,000 commands for a manual copy. The nameless copy has issued or transferred 120 kilobytes of copy instructions versus over 3 gigabytes of data that had to be copied using the read and write permittives.

Starting point is 00:23:32 Another key feature of software enabled with Flash is the advanced queuing and scheduling features, which we will spend a lot of time on over the next several slides. Scheduling and queuing controls how the die time is spent. We can control read time versus write time versus copy time, as well as the prioritization of multiple workloads. Consider a multi-tenant environment. There may be business reasons to enforce fairness or to give certain tenants priorities, and these business needs may change over time. These tenants can share a device, and a weighted fair queuing can support the performance goals of the business. The host is allowed to prioritize and manage die time through the software-enabled Flash API, and the device will enforce the scheduling policy.

Starting point is 00:24:19 This is the basic architecture of the software-enabled Flash scheduler. First, I will go down the feature list. Each virtual device has eight input queues, and since a virtual device can map to as small an area as one die, this means that each individual die can potentially have eight FIFO input queues. The device scheduler automatically handles program suspend resume for both program and erase commands. The host can specify a specific queue for each flash access command on a per QoS domain basis. So one QoS domain might submit its reads to Q0

Starting point is 00:25:07 and its writes to Q1 and its copy commands down to Q7. And another QoS domain, for priority reasons, might want to have its reads go into Q2, its write operations or programmer operations go into queue one, and its copy operations go into queue seven. Every queue can specify die time weightings for each individual operation, read, erase, and program.

Starting point is 00:25:40 So each queue has its own weights for erase program and read operations. And finally, the host can provide overrides for both the default queue assignments and the default weightings for individual commands to dynamically adjust to changes in the environment. So now that I've gone over the features, let's talk a little bit about the functionality. When all of the weights are set to zero for all of the queues, it works as an eight-level priority-based scheduler, with Q0 being the highest priority queue and Q7 being the lowest priority queue. When all the programming erase and read weights are set to the same non-zero value, it works as a round-robin scheduler. And finally, when unique erase program and read weights are assigned on a per-queue basis, it works as a die-time weighted scheduler. You should also note that even though there are eight queues defined in the architecture, it is not necessary to use

Starting point is 00:26:46 all eight queues if your application does not need to. And here is a demonstration of software-enabled Flash scheduler running on our FPGA prototype. We're going to start the test and note that we have two domains. We have set the weight to be slightly higher for one domain than the other domain so that the two lines do not overlap each other. In a moment, we will alter the weight, and you can see that we have now reduced the weight of QoS Domain 2 to 150 for read operations, and we have increased the weight of QoS Domain 1 to 250. This graph is a graph of latency, And so we have lowered the latency for domain two and now we have just reversed it and made QoS domain one have a weight of 150 and QoS

Starting point is 00:27:57 domain two the weight of 250. And you can see that the priorities invert and now QoS domain 1 has the lower latency. We can take this to some extremes and watch in real time as the scheduler reacts to the waiting overrides and adjusts the latency response curves accordingly. This is the final demonstration of our FPGA prototype. In this demonstration, we defined three separate virtual devices and are running three different workloads on three different storage protocols. For this demonstration, we defined a virtual device that was running the ZNS protocol with a mix of read and write workload. For this graph, blue represents a heat map of read operations. Red represents a heat map of write operations. You can see the channels labeled across the front, channel 0 through channel 8, and four banks.

Starting point is 00:29:07 And so here we have a virtual device spanning channels 0, 1, and 2, and banks 2 and 3 that was doing a read-dominated workload running a custom hyperscale FTL from one of the hyperscalers. And finally, in the foreground, we had a third virtual device that was running a standard block mode driver with a write-dominated workload, hence the red bars. None of these workloads were impacting any of the others. They had full hardware isolation with no shared dies between them. It's important to note that we don't think this is a very realistic use case. We don't see people trying to run multiple storage protocols simultaneously on the same device. The purpose of showing this flexibility, though, is to illustrate a capability that is important to hyperscale customers. Hyperscale customers can deploy a single device at scale, and then they can provision and configure the device in real time to match the dynamic needs of their storage environment.

Starting point is 00:30:45 And as new storage protocols and new storage applications are created, they can quickly be implemented using the software-enabled flash primitives. And now for some actual examples from the upcoming SDK. SEF CLI is used to configure software-enabled flash devices. It's a command line tool. It's open source, and it's included in the software development kit. Any of the functions of the SEF CLI command could actually be incorporated into an application if needed. There's extensive built-in help, and this is the top-level help output.

Starting point is 00:31:39 In addition to allowing the configuration of the device, it also supports all of the API primitives, so you can actually read and write data using the sefcli command if you want. But probably the neatest feature is this, the shell command. Sefcli contains a built-in Python shell for interactive programming of the device. This is extremely useful for examining the device for diagnostic purposes during software development. And you can even write Python scripts and send them to the SEFCLI program to execute. Once again, a really helpful capability

Starting point is 00:32:17 for debugging your software. Now that I've introduced SEFCLI, let's go over a few examples of its use. Note that in all of these examples, many of the possible settings are not illustrated, and we're just using the default values. This is an example of creating a virtual device. When you create a virtual device, you can specify the default weights for the erase program and read operations, as well as the copy program, copy erase, and copy read operations. We haven't included that in this example for the sake of

Starting point is 00:33:00 brevity. So here we are invoking SCF CLI telling it to create a virtual device. Minus S0 is specifying that we want to operate on software enabled flash unit zero. We're now going to define the layout of the virtual device saying that it starts on channel zero and spans four channels. It starts on bank zero and spans four banks. We're going to assign this virtual device a unique identifier, and we're going to specify how many QoS domains we're going to allow to be created within this virtual device. Once you execute this command, you have created the virtual device and you can use the following command, sefcli list virtual,

Starting point is 00:33:51 to list out all the virtual devices, including the one you just created. The next example is that of creating a QoS domain. And when you're creating a QoS domain, you get to specify the queue assignments for each of the flash operations. Note, same invocation line, but now we're saying instead of create virtual device, we're creating a QoS domain. We're specifying it to operate on the SEF unit zero, the first unit present in the system. We're going to give that virtual device that we used in the last example as input here. We're saying create this QoS domain on virtual device zero.

Starting point is 00:34:39 And now we're going to assign an ID to the QoS domain. In this case, we're making QoS domain ID 2. We're going to put a capacity limit. And in this case, we are saying that this QoS domain is going to have a maximum of 3 million ADUs or atomic data units. We're next specifying the size of the atomic data unit as being 4k. And we're going to put in a couple interesting parameters here at the end. The number of root pointers for this QoS domain and the number of placement IDs for this QoS domain. The number of placement IDs defines how many parallel auto-allocating streams you can have within each QoS domain that will group data in the same superblock as specified by the application.

Starting point is 00:35:43 The number of root pointers specifies the number of metadata locations that you can store in the configuration of the QoS domain. And so if you're implementing a lookup table or a key value store or some other storage construct on top of software-enabled flash, you may want to store the metadata associated with a QoS domain. And this construct, a root pointer, allows you to store the metadata for a QoS domain within the actual QoS domain itself and then store the address at which you stored it in the root pointer, and it gives you a bootstrapping mechanism for reinitializing a QoS domain at startup time. And finally, just like with the virtual device example, after you've created this QoS domain,

Starting point is 00:36:40 you can use the list QoS domain command to see all the QoS domains that have been defined for that unit. And now an example of nameless write. This is sort of the smallest possible program to perform a write, and it contains three main functions. The first function is called get set Ceph handle. This fetches a handle to a particular Ceph unit in the system. So if you had multiple Ceph devices in the system, they would be enumerated zero through N. You specify in the index of the unit that you want to operate on, and it returns a handle to that SEF unit. Next, we're going to open a previously

Starting point is 00:37:40 created QoS domain. And so we're going to call SCF open QoS domain. We're going to pass in the handle so it knows which unit we're operating on. We're going to pass in the ID of the QoS domain we want to open. We're going to pass in a notification function pointer to receive asynchronous event notifications, as well as a context, which is a piece of user-defined data that will be passed back and forth with all calls, helpful for implementing multiple contexts in your environment. A key, which is the encryption key for the QoS domain. And finally, we're going to return a handle to the QoS domain that we just opened.

Starting point is 00:38:30 This is all preparation work for the actual nameless write command down here, which is called SCF write without physical address. It should be noted that you have to get a handle to the device and open the QoS domain once at the start of your application, and then you can issue as many write commands as you want with the open handle and QoS domain. So when you want to write to a QoS domain, you call write without physical address. You pass in the handle to the QoS domain. In this case, we're setting the mode to be auto allocate. We then have to supply a placement ID of which super block of the super blocks that can be kept separate by placement ID, we want to write this data into.

Starting point is 00:39:30 We will pass in the user address, which is just user-defined metadata. In the case of a block mode driver, this would be the LBA. The number of atomic data units we want to write, remember when we defined a QoS domain, we specified the size of the atomic data unit. In the previous example, it was 4k. This is saying how many 4k chunks we want to write into this QoS domain. We pass in the address of an IOV, which defines the memory buffers associated with the data that we want to write. This is the number of entries in the IO vector. Permanent address. Remember, the nameless write functionality returns the address at which the

Starting point is 00:40:14 data was written. So we specify we want the data to go into this QoS domain and be grouped with this placement ID, but the actual unit determines the physical address at which the data is going to be placed and returns that here in the permanent address. It also returns the distance to the end of the superblock that we're currently operating in after the write has completed. This is useful when you're not using auto-allocate mode so that you can know when it's time to allocate the next block. And finally, I'd like to call attention to the override structure. The overrides parameter is a pointer to a structure. It can be nil if you don't want to override anything, but if you wanted to override the default queue assignment or operation wait, you would do it by supplying an override to the write function. And once again, this is how we showed that previous example

Starting point is 00:41:12 of dynamically adjusting the latency response between two QoS domains. That was by altering the override of the defaults for those two QoS domains, that was by altering the overrides of the defaults for those two QoS domains. As I mentioned in the earlier example, when you open a QoS domain, you can supply a pointer to a notification function. This is an example of an asynchronous event handler. Once again, opening the QoS domain, we passed in a handle to the event handler. Once again, opening the QS domain we passed in a handle to the Notification handler. Here's the definition of the notification handler. And this is typically implemented as just a giant switch statement handling the different types of notifications that can come from the event. There are several different types of asynchronous events that the device can issue, address update notifications in the case of data

Starting point is 00:42:10 that's been moved on the device, block state change events for super blocks that have been closed or filled, as well as capacity-related events. This slide is illustrating a snippet of code from our Flash translation layer and an important concept of the software-enabled Flash API itself. This is the routine that's used to update the lookup table for our reference flash translation layer. And you will notice that when the FTL update command is called, it is called with the old flash address and the new flash address. And this is an important design decision. When we update or move data,

Starting point is 00:43:08 we supply both the old address that the data was located at, as well as the new address where the data now resides. This was done so that we can make updates to the lookup tables in the flash translation layer lockless. And so we can handle race conditions between incoming data overwrites from users and offloaded copy operations within the device by the use of atomic compare and exchange operations on the lookup table that allow us to handle the race conditions without having to introduce locks into the FTL. So a very important concept. This next example is a direct access read. Once again, this is sort of the minimal program where we're going through the operations of specifying which unit we want to operate on and getting the handle to the SCF unit.

Starting point is 00:44:12 We're opening the QoS domain. Once again, these two steps don't need to be handled more than once in an application. But once you have specified a unit and opened up a QS domain, you can then issue read with physical address and using a flash address parameter, which would be something that was returned in the previous example in the permanent address field. You can specify the starting address you want to read from, the number of atomic data units. Once again, in these examples, these are 4K blocks, if you will. And again, an IOV specifying the layout of the memory to put the return data in.

Starting point is 00:45:07 The IOV count says how many entries are in the IO vector itself. IO vector offset is a field that allows you to do multiple operations on a single IO vector at different offsets so that you can handle very complex memory layout schemes spread across multiple operations. Finally, you have the opportunity to pass in the user address. And once again, this is metadata associated with the user data itself. In the case of the block mode or the reference FTL, the user address is in fact the LBA.

Starting point is 00:45:56 And when we perform the read of the physical flash address, we will read the associated metadata and compare it to the expected LBA as a data integrity check that the data being read from the flash itself is the data that was expected. And finally, once again, we have a pointer to an override structure that would allow you to override either the default queue assignments or the operation weights. There's a little note here in this example. The error recovery mode for a QoS domain is set at the time of creation, and this determines

Starting point is 00:46:34 whether the software-enabled flash unit will perform automatic error correction and automatic error recovery on the QoS domain. In manual mode, there's a function called SEF set read deadline that determines essentially what the recovery time budget is so that you can specify how much time the SEF unit will spend trying to recover data when an error has occurred before aborting the operation with an error response. And this is important because, as mentioned earlier, many times in hyperscale environments, they've triple mirrored or they have other copies of the data. And it's faster often to go fetch the data from an alternate source than to try and do a heroic ECC recovery operation on the flash itself. Now, all of the previous examples, mainly just to make them more easily understood, have been synchronous examples.

Starting point is 00:47:42 But it should go without saying that all the data path operations for software-enabled Flash have asynchronous versions. This is an example of what read with physical address would look like in its asynchronous form. We essentially have an IO control block that has the bundle of parameters associated with the call, and then you just issue read with physical address async, passing in once again the handle, which QoS domain on which unit we're operating on, and a pointer to the IO control block itself. Well, we're coming to the end of the presentation. At this point, I want to do a little summary and tell you where you can go to learn more about software-enabled Flash. So, as a wrap-up or a summary, software-enabled Flash fundamentally changes the relationship between the host and solid state storage. It consists of purpose-built hardware and a very powerful open source Flash native API to interface to software.

Starting point is 00:48:52 It leverages industry standard protocols wherever possible, and it can be used as demonstrated as a building block to create different types of storage solutions. Once again, in our demonstrations, we have created standard block mode devices, zone namespace devices, as well as custom hyperscale FTL devices, all on top of the software-enabled flash primitives. And the most important note to me anyway, is we're combining full host control with ease of use, taking away the burden of the media management and the low level details of the Flash itself. So for more information on software enabled Flash, my recommendation is to go to our microsite,

Starting point is 00:49:45 www.softwareenabledflash.com. On the microsite, you can find our white paper and also either through the link below for github.com, Keoxia America, or through the microsite, you'll find a link to it. You can actually go to the GitHub repository, download the latest version of the API, as well as associated documentation. I encourage you to check back on the microsite as we have more of the software development kit available. It will be announced on the microsite, as well as we have some interesting demonstrations on latency control and garbage collect that will be hosted shortly on the microsite, too.

Starting point is 00:50:35 Thank you very much. Thanks for listening. about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #141: Unlocking the New Performance and QoS Capabilities of the Software-Enabled Flash API

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.