Storage Developer Conference - #160: SPDK Schedulers

Episode Date: January 11, 2022

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNIA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snia.org slash podcasts. You are listening to STC Podcast episode 160.
Starting point is 00:00:41 Hello, everyone. My name is Tomasz Zawadzki. I'm a cloud software engineer at Intel, and I'm in the role of SPDK core maintainer in the project. My name is Jim Harris. I'm a principal software engineer, also in Intel's data platforms group. I'm also an SPDK core maintainer. So Tomik and I would like to introduce our topic today, SPDK schedulers, doing power savings and pulled mode applications. So first we'll start off with the usual notices and disclaimers. We'll let you read that offline more later if you're so interested, but let's go ahead and move forward to the good stuff. So first, an SBDK overview. So I'm sure a lot of you are already familiar with SBDK. If not, there's some other talks that you can go back and look at, SDC talks and other conferences. But just to give a little bit of an overview, just real quick today before we dive into our topic, SPDK is a framework for building high-performance storage applications.
Starting point is 00:01:53 And you see a lot of blocks there in that diagram. They really represent sets of drivers and libraries. But also, SPDK includes fully functional storage target applications, and that's actually the basis for what we're going to talk about today is how to get power savings from those applications. SPDK uses a user space pulled mode programming model. One of the key parts of SPDK is it's not just an open source project, but it is an open source community. If you go out to our website, spdk.io, you can find a bunch of different channels to reach out. If you've got more questions on the talk we have today or just SPDK in general, please feel free to reach out there. Let's go ahead and talk a little bit about the SPDK threading model. So, you know, we talked a lot.
Starting point is 00:02:47 I talked just briefly there about how we use a pull mode programming model in SPDK. And so I want to talk a little bit about what that threading model looks like, because it really lays the basis for the scheduler stuff that Tomek is going to talk about later in our talk. So when you look at an SPDK application, you set it up to run on a certain number of cores. And so the example we're showing here today is we're running the SPDK application on four cores. And we call an SPDK reactor on each of these CPU cores. And this reactor is effectively a pinned POSIX thread. So this is the only thread that we, the only POSIX thread that we run on each of these cores. And these are created by the SPDK application framework. So a little bit about that application framework, the libraries in SPDK are all designed
Starting point is 00:03:46 to be integrated into existing applications, existing frameworks, but SPDK does have its own application framework to bring up these threads, to get them running on cores. And so that's what we're going to primarily focus on today. But everything we talk about with SPDK threads today is independent and can be integrated into other application frameworks as well. So these SPDK threads, these SPDK threads are really, it's a lightweight threading abstraction. There is no tie between an SPDK thread and a POSIX thread. By default in SPDK, you know, we typically have one SPDK thread per CPU core. These are created by the top level block storage protocol library. So, for example, when the NVMF library is initialized, the NVMF target library, it will by default create one SPDK thread that runs on each CPU core. The other block storage protocols then may create their own SPDK threads
Starting point is 00:04:51 to run on those CPU cores. And then at the top layer, the application framework, whether that's SPDK's framework or whether it's some other application framework, is responsible for calling SPDK thread pull on each of those SPDK threads. And that's how those threads get an opportunity to actually do their work. So next, if we look at what SPDK thread pull does. So each of these SPDK threads will execute code from a number of different libraries. So it could be the NVMF target layer. The NVMF target layer may call down into the block device layer. The block device layer may call into SPDK block device modules like the NVMe driver or maybe CephRBD or LibI SCSI.
Starting point is 00:05:43 And these libraries will all register SPDK pollers. And these SPDK pollers become associated with the thread that they were registered on. And these pollers are all responsible for polling on something. So it could be an NVMe queue pair. It could be a e EPOL file descriptor. So, for example, in SPDK, we make heavy use of EPOL to pull, like, groups of TCP sockets or Ceph RBD event FDs. It could be pulling on an RDMA completion queue. And the key part here is that every time the framework calls SPDK thread pull, it's going to poll each one of those SPDK pollers. So each SPDK poller is going to get a chance to run, to go check for and do some work every time the thread is polled.
Starting point is 00:06:37 There is a small exception here. We do have the concept of timed pollers in SPDK, where you can register a poller to run at a specific time interval. So for example, you want this to run every one millisecond. So of course, those pollers are only going to run once that millisecond expires. They're not going to execute every time that the thread is polled. The other thing that these threads do is each thread has a message queue. So we make heavy use, it's a shared nothing architecture. So instead of using locks in SPDK, if threads need to communicate, they can send messages to one another. And so that's another thing that SPDK thread pull will do is when a thread is pulled, it will run, you know, basically execute any messages that have been sent to that SPDK thread.
Starting point is 00:07:30 Okay, so how do we save power when we're idle, right? So now we have these SPDK reactors running. We have these SPDK threads with their SPDK pullers running in each core. But when everything's idle, all these pollers are doing is just using up CPU cycles. And so, you know, doing this, this polled mode programming gets us great performance and efficiency when the CPU cores are busy. But what we've been looking at is how do we save CPU cycles when those cores aren't as busy, you know, effectively when they're idle or when they're running, you know, a very, very low IO workload. And so, you know, there's a few, there's a few potential ways you can do this.
Starting point is 00:08:15 There's some ways that some other frameworks like DPDK are looking at. And so we're going to talk about a couple of those, why maybe they don't exactly meet with what we want to do with SPDK. And then we're going to move into want to do with SPDK, and then we're going to move into the solution, the SPDK scheduler solution that Tomic's really been focusing on. So the first one is interrupt mode. So, you know, basically having, you know, SPDK threads, which can effectively, you know, block on events. We actually do have some limited support for this in SPDK. It's restricted to a very small subset of the libraries, though. Today, that doesn't include the NVMe driver. It does not include the NVMF target library.
Starting point is 00:09:03 It really only supports libraries that register file descriptors with the SPDK thread, because then we can actually block on like an EPOL file descriptor. So, you know, the challenge here is to extend this to the rest of SPDK becomes, it becomes really, really complex. You end up having, you know, multiple layers of nested FD groups. You know, the other challenge is that every single library or module in SPDK then has to support this, right? Because you can't have one, you know, library that requires, you know, polling something without something that they can block on. And so that becomes a challenge. Maybe at some point down the road, SBDK will support that, but that's not the direction that the project is taking right now. So next, on x86, there's some newer instructions in some of the upcoming CPUs for doing monitor and M-Wade unprivileged.
Starting point is 00:10:07 So being able to do these from user space. So for those that are familiar with how these work in the kernel, you can basically set up a monitor address and then you can execute an M-Wade instruction and it'll basically put that CPU thread, that CPU core into a deeper power state. But once there's a right to that monitored address, it will wake it up. And so these instructions are, once these are available,
Starting point is 00:10:35 user space processes can take this same approach. And this works really, really, really well when you have one thread that's pulling on one hardware queue. And so we see this a lot in packet processing applications. So for example, in DPDK, where you've got maybe a relatively small number of hardware queues, you know, receive queues that you're receiving packets for. And you can actually do a monitor on the next descriptor that's in the buffer ring. But the challenge for SPDK is that we are typically pulling many hardware queues from one thread, you know, so for example, in the case of like NVMe, where we may have a whole lot of NVMe SSDs in the system. But certainly when you're pulling kernel TCP sockets, you know, not only is it not a hardware queue, but you've got potentially tens or hundreds or even thousands of kernel TCP sockets that you're wanting to observe events on, and the
Starting point is 00:11:33 U-monitor having that single address range doesn't really fit for that use case. So the third option is what happens if we could move these SPDK threads, right? So, you know, in this example here, we show that the SPDK threads that? Because now cores one and three, they no longer have any SPDK threads that are running. But then we also know that those threads are going to continue to be pulled on the cores that they've been moved to. And fortunately, this is very, very well supported by the SPDK threading model. So when those threads allocate resources, so for example, those pullers allocate an NVMe queue pair, or they create a file descriptor for a TCP socket. We only ever poll or touch those resources on that SPDK thread. So as long as we continue with that threading model, where we only touch those resources on that thread, it doesn't really matter what core we're doing it on. We're still going to be thread safe.
Starting point is 00:12:47 And so with that being said, I'm going to turn it over to Tomek, and he's going to go into SPDK schedulers and how they can build on top of this. Thank you, Jim. So everyone, Jim described how the SPDK threading model, particularly threading library, works together with event framework to allow for moving of the threads. But the actual interaction happens with the scheduling framework in a very similar manner. So I would like to describe how the scheduling phases occur during a single scheduling occurrence. One important part to remember is that whenever scheduling is in progress,
Starting point is 00:13:42 is happening, the reactors, the cores, still hold the SPDK threads. There's no stop of the operation during the scheduling. So scheduling is performed on an interval called scheduling period, which is set by the user whenever he enables a particular schedule. When the timer expires, gather matrix, the first phase, starts. In similar fashion to, first, it has to be mentioned that in similar fashion to SPK threads sending messages between each other, so do reactors by sending events, which is an abstraction on the event framework level. And exactly this mechanism is used to send events between different cores to execute the gather matrix function. Anytime an SPDK thread is pulled,
Starting point is 00:14:53 the result of the pull as well as the time of its execution is gathered. This makes it possible for gather matrix to collect that information on per thread as well as per call basis the most important data here is the number of cycles spent executing the spdk thread and it's put into two types of categories for threads that did the work, which is the busy TSC, as well as those that ultimately didn't, the idle TSC. This information is available for both last scheduling period as well as a total of a lifetime of a thread.
Starting point is 00:15:40 Similar data, cumulative data for particular core is available as well. And there's a couple of other parts to the structure here. It's the SPDK threads core, that is, sorry, SPDK threads L core assignment, as well as data for what current core mode is, whether it was put to sleep or not. So once gather metrics go over all the reactors, the data is the input to the balance function, as well it will be the output of that function. Second phase of the scheduling is the decision point where an action can be taken
Starting point is 00:16:30 based on the previously gathered information. So schedulers can decide to change the SPDK threads core assignment, can opt to put a core to sleep if no SPDK threads are present on that core, as well as modify the core frequency using SPDK governor. More on this on the next slide. So the logic behind the balance function is implemented by pluggable SPDK schedulers. On the right is the interface that has to be implemented by set schedulers. Besides that, the SPDK scheduler register macro has to be called to make the event framework aware of the new scheduler. After that point, a user can switch between different implementations in runtime using the framework set scheduler RPC.
Starting point is 00:17:28 And optionally, a scheduling period, a custom scheduling period can be provided. At this time, one such scheduler is called dynamic and it has been implemented. Later in the presentation, I'll go over the goals of the dynamic scheduler as well as the results of the CPU utilization comparison against the static SPDK thread assignment. And as I mentioned during the balance call, schedulers can optionally use what is called SPDK governors. This is another abstraction
Starting point is 00:18:09 that allows to change the frequency of a particular CPU core used by the application, thus reducing the power used by that very core. Right now, SPDK provides a governor called DPDK-Governor based on top of DPDK's RT-Power library. It's possible, similar to the schedulers, to implement new kinds of SPDK-Governors. And similarly, once a particular governor is registered within the event framework, any scheduler can enable it during its initialization. And finally, this concludes the balance function implemented in the scheduler. And now the operation moves back to the event framework where the output from the balance data, changing the core mode, meaning putting a core to sleep or waking it up.
Starting point is 00:19:30 This action is performed on each core, depending on the output from the balance. Another is marking SPDK thread for move to a different reactor. This action is only performed on the main core because once SPDK thread is marked for the move, the reactor which currently has the thread assigned is responsible for first stopping, stop of the polling on that particular reactor, as well as sending an event to the destination reactor to start polling it there.
Starting point is 00:20:10 After that, the scheduling phases are complete, and the next one will start once the scheduling period expires. Before the addition of the scheduling framework, the threads were assigned a particular core and remained there for the application's lifetime. The static thread assignment is now part of the static scheduler, which remains the default. But in addition to that, as of the latest SPDK releases, users can enable dynamic scheduler via RPC. Besides adding the scheduler framework, the dynamic scheduler has been a focus for us to prove out the concept.
Starting point is 00:20:58 This particular scheduler's goal was to save power when possible, but not at the cost of the performance. Dynamic scheduler will err on the side of caution when consolidating SPDK threads on particular core. Once an SPDK thread becomes active, it will be put on a core that allows for its best utilization, which means either an unused core or the one that is least busy across the application. In every use case, it is possible, sorry, in a very active use case, it is possible that no difference would be observed between the dynamic scheduler and static scheduler. Yet, for the cases when particular threads utilization goes down, it is consolidated with other SPDK threads, and if it's totally idle, it is put directly on the main core. Any core after such move that is left with no SPDK threads is put to sleep to save the power.
Starting point is 00:22:08 Finally, the CPU frequency of the main core, when its utilization is low, is reduced. And this is the main core, is the last remaining core whenever all others are put to sleep. With that said, now let's go over some of the performance and, importantly, CPU utilization data for select cases comparing dynamic scheduler and static thread assignments. SPDK project publishes quarterly performance reports for key components. The following results are performed in similar fashion
Starting point is 00:22:51 to the SPDK NVMF TCP report, specifically test case four, which measures NVMF performance with increasing number of connections. In this case, the NVMF TCP target, one SPDK thread is present in each core and 30 CPU cores were assigned for the whole application. This is the number always used by the static scheduler and the maximum the dynamic scheduler can use. On the next slide, CPU utilization shown will be measured for the whole system,
Starting point is 00:23:34 which means it can exceed the value of 30 CPUs in part due to, for example, kernel TCP stack processing. To that target, two SPDK NVMF TCP initiators are connected, each consisting of eight NVMF subsystems. FIO SPDK BDEV plugin is used to perform 4K random read workload on each of those NVMe subsystemss where a couple of parameters are scaled first the q depth is scaled to result in higher utilization whenever it is increased for each spdk thread meanwhile increasing the num jobs parameter in FIO results in higher number of TCP connections being created to handle the IO generated.
Starting point is 00:24:32 This means that the more TCP connections per each SPDK threads are present on the target site whenever that parameter is increased enough. This particular case shows the results when there are 16 NVMF subsystems connected with NAND jobs set to one, resulting in a total of 16 TCP connections handling the I. This, of course, means there are actively 16 SPDK threads being used due to the number of connections. Please note that the line representing thousands of IOPS in both cases are on the same level. That means that no performance was lost switching to dynamic scheduler in this particular case. Meanwhile, comparing the blue and orange bars shows the difference in CPU utilization between the cases. The orange is static scheduler, consuming always at least 30 CPU cores as configured. Meanwhile, blue bars represent dynamic scheduler, collapsing SPDK threads to minimal set of CPU cores, put to rest, resulting in around 10 CPUs being utilized during the test case.
Starting point is 00:26:11 For very high IO load, in case of QDEP-128, each connection needs to be processed on the separate core, resulting in all of the 16 threads occupying a separate CPU core. The rest of the CPU cores are not used due to the low connection count, soon they are put to sleep. That said, before going to the next case with increased number of connections, let's discuss some of the underlying mechanism behind the NVMF SPDK pollers, cupers, and TCP connections. As previously mentioned by Jim, SPDK pollers in NVMF TCP poll
Starting point is 00:26:55 and EPOL-FD. That file description corresponds to a group of multiple TCP sockets. Each TCP socket represents an NVMF cuper. And assigning the NVMF cupers to NVMF pole groups is done in a round-robin fashion. There is no guarantee that cupers will be distributed evenly across SPK threads, as well as a single SPK thread can consist of multiple
Starting point is 00:27:26 cupers with varying amount of activity. Another is cupers connecting and then disconnecting, resulting in uneven number of cupers for each SPK threads. Finally, another one is the initiator site can spread the load across multiple cupers, thus sometimes playing against the scheduler detecting the load on the target site. Pole group mechanism has worked great in the past for the static scheduler, yet with dynamic scheduling in use, it can diminish the benefits of such scheduling. The next case shows the results when there are 16 NVMF subsystems connected with NAM jobs set to 4, resulting in a total of 64 TCP connections handling the I.O., meaning all of the 30 SPK threads are actively being used. Again, the line representing thousands of IOPS being in both cases are on the same level. That means no performance was lost again.
Starting point is 00:28:39 Meanwhile, comparing the blue and orange bars shows the difference in the CPU utilization. Pointing on the low IO load with QDEPH8, the SPK threads were collapsed to around 22 cores, but going to 64 QDEPF results in capping out the CPU utilization similar to the static scheduler. All across those cases, some of the cycles are spent on calling EFL in each of the SPDK threads. Better or possibly dynamic organization of NVMF poll groups across the SPK threats would allow to further decrease the CPU utilization in many of those cases. So having said that, let's go over some of the key highlights
Starting point is 00:29:39 from this presentation, as well as next steps for the SPK event and scheduler frameworks. Polnode applications require special handling to save power and CPU cycles when idle. SPDK event framework now allows for moving idle SPDK threads to put cores to sleep, thus saving power. Plugable SPDK scheduler framework is provided to define when and where SPDK threads should be moved. SPDK dynamic scheduler was added. It consolidates SPDK threads on minimal set of cores
Starting point is 00:30:21 and puts remaining cores to sleep. Even seeing the progress we've made so far, the work does not end here. The dynamic scheduler logic for balancing could use further improvements. For example, one of the requests we got was to provide a tweakable values to the user to allow better suitability for each of the use cases being run. The cost of multiple poll groups on a single core needs to be addressed to further reduce the CPU utilization. And right now, dynamic scheduler is using a DPDK governor,
Starting point is 00:31:08 only scales the frequency of the main core. This could be improved by changing frequencies of other cores, depending on their load too. Finally, dynamic scheduler prioritizes core selection based on their L core number, but the selection could take into account other factors such as NUMA node, hyperthreading, or high-frequency cores. Thank you everyone for attending this session and hearing about our learnings while implementing SPDK scheduling. Thank you. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community.
Starting point is 00:32:07 For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.