Storage Developer Conference - #147: Platform Performance Analysis for I/O-intensive Applications

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast, episode 147. Good morning or good afternoon. My name is Perry Taylor. I'm a platform monitoring engineer at Intel and I focus largely on of VTunes I.O. analysis. The topic that we want to present today is called Platform Performance Analysis for I.O. Intensive Applications.

Starting point is 00:01:15 High-performance storage applications running on Intel servers actively utilize I.O. capabilities and I.O. accelerating features of platform by interfacing with NVMe devices. In a big picture, performance issues appearing in such I.O. intensive workloads might be categorized into three wide domains depending on where the bottleneck is. The first domain is called IODeviceBound. In this case, the performance is limited by device capabilities. This case is pretty trivial and can be addressed by checking the observed performance with the device datasheet. The second domain covers core-bound cases. This is when the performance is limited because of a core

Starting point is 00:02:05 under utilizes the device due to some algorithmic or micro architectural inefficiencies. This case is very well covered with various core centric analysis from hotspots to micro architectural exploration and Intel processor trace based analysis. And the third domain is called transfer bound. In this case the performance is limited by non-optimal interactions between CPU and device. This presentation focuses on the transfer-bound domain, which appears most challenging for performance debugging and tuning. Our talk consists of two blocks.

Starting point is 00:02:51 First, we provide the overview of platform-level activities running behind I-O traffic processing on Skylake and Cascade Lake servers. In this part, we discuss the details of Intel Data DirectIO hardware technology and consider ANCOR utilization in simplest scenario typical for high performance storage applications. After that, we overview relevant performance analysis methodologies going from platform level telemetry to performance metrics built upon and implemented in Intel Performance Monitoring Tools. We will discuss best-known methods for high performance and overview possible directions of platform I.O. performance tuning. Let's get started. Okay, so now let's dive into the architectural background and go over some of the I.O. transaction flows.

Starting point is 00:03:54 Okay, this is an overview of the Intel Xeon Scalable Processor. On the right-hand side is this picture, and let me get you kind of oriented around this picture of what we're looking at. So if you think about a two socket system, this picture represents one of those sockets. This is the SOC that would be seated in one. If we were to represent both sockets here, we would have two of these grids interconnected to each other through the Intel Ultrapath Interconnect or UPI. OK, so the mesh is the lines, the grid that these are sitting on. And it's really just a fabric interconnecting all of these blocks to allow them to talk to each other and, you know, move data from one block to another block. Of course, there's the cores. I think everyone's familiar with what a core is.

Starting point is 00:04:54 It contains the execution units and the L1 cache and the L2 cache. Of course, there's multiple cores and the number of cores will vary depending on the particular SKU of server and SOC that you have. So beyond the L2 cache, we move into the uncore. The first uncore unit to talk about is the CHA, and that is the unit that's just below each one of these cores. The CHA is the cache controller and really contains the engine for coherency. So it makes sure that when someone requests data, that it's getting the latest copy of that data. It also contains your L3, a slice of your L3 cache, right? So each one of these CHAs contains part of the L3 cache and all together make up the full capacity of L3 cache.

Starting point is 00:06:03 There's the integrated memory controller. This is, you know, really just bridging you to your DDR. There's the Intel UltraPath interconnect. I mentioned it at the beginning of this where if we have two socket system, UPI is used to communicate between those two sockets. Finally here we have the integrated I-O controller and this is what interfaces to PCI Express devices and the key here is that any interaction between a PCI Express device and the SOC is handled by the IO unit and then can flow through other UnCore units on that data path. We will go into that path later, but remember this picture, if you can, that on the top there, we have a PCI Express device connected to the system. As it requests or sends data,

Starting point is 00:07:06 it's going to go through that IIO unit. It's going to be touching on a particular CHA or memory controller and accessing L3 cache or DDR. We'll go into that more later. Now let's talk a little bit more about the integrated I.O. controller and its role in the SOC. So as you saw in the previous picture, we have five of these I.I.O. units. Three of them connect to PCI Express Gen 3. Between the three, they cover 48 PCI Express Gen 3 lanes, each one having a x16, which doesn't have to be a x16, right? It can be bifurcated into 4x4 or 2x8, for example. The other unit is for DMI, which connects to your legacy I.O., and then also

Starting point is 00:08:10 connects the Crystal Beach DMA. The our strictly ordered PCI Express devices and translate their requests to the out-of-order mesh. So you think about, we have a PCI Express device, let's say it wants to write 256 bytes of data into the SOC. The IIO will take that transaction layer packet with 256 bytes of data in that single packet. It will break it up into 64-byte transactions and deliver that to the mesh. You know, the same thing happens the other direction. It can get multiple transactions from the mesh, coalesce them into a transaction layer packet. That's, you know, let's say 128 bytes, 256 bytes, and then deliver that TLP to the PCI Express device.

Starting point is 00:09:34 There are two categories of transactions going through the IIO unit that enable your core-to-device or device-to-core communication. We have inbound transactions and we have outbound transactions. So inbound transactions are those coming into your system. These are initiated by your I-O device targeting system memory. We have inbound reads and inbound writes. A quick example, if we're a disk controller, we would do an inbound write to pull data from system memory to write to a disk.

Starting point is 00:10:09 And then we just flip that for an inbound write scenario. The disk controller reads from a disk and then writes that data into the system memory. So your outbound transactions are typically initiated by cores and they target your I.O. device memory. Again, an outbound read means our core is going to be reading from an I.O. device and then writes where the core is writing data to an I.O. device. These are typically done through memory mapped IO addresses. And circling back to the inbound transactions now, the thing to keep in mind about inbound transactions is that this is where your bulk data transfer happens, right? If you need to move a lot of data one direction or another,

Starting point is 00:11:06 really it's the I.O. device that is mastering those requests to push data back and forth. These transactions are driven by Intel Data Direct I.O. hardware technology. That's a lot to say, so we'll just call it Intel DDIO. The thing up front to know about Intel DDIO is that it is transparent to software. So your application or your driver doesn't have to do anything special to take advantage of Intel DDIO. However, there are some pitfalls that we can be aware of that may lead to sub-optimal performance and there's some things we can do to avoid those pitfalls. Now we can get into the details of what Intel DDIO is and why it helps and improves I.O. performance.

Starting point is 00:12:11 So the key thing to remember about Intel DDIO is L3 cache access because it allows I.O. devices to talk directly with a local L3 cache rather than main memory. So the inbound transactions coming from your I.O. device are routed directly to the local L3 cache. This is a big advantage because your local L3 cache is close and it's fast compared to the main memory. Inbound reads, inbound writes, are quite a bit different in how it is handled. So let me walk through those two flows for you. So inbound reads, this is where your IO device is requesting data from the system. In this scenario, it will do an L3 cache lookup. If it's in the L3 cache, it will then deliver that back to your device through, again, through the IIO unit. If it's not in the L3 cache, then it will get it from wherever it's at, most likely main memory, deliver it back to your device.

Starting point is 00:13:21 So two things. There's an L3 cache lookup, but there is not an L3 cache allocation. All right, so this means that there's no cache line being allocated and there's no data being left behind in the L3 cache that wasn't already there. Now let's talk about inbound writes because it is a little bit different. For inbound writes, we do have a cache line allocation happening in the L3 cache and this happens in two different phases. So in the first phase, the IIO unit has to get the cache line, has to get ownership of that cache line, so it can write to it and modify it. So first, if it's not already in the L3 cache, the L3 cache will allocate a line for it and gives ownership of that line to the IIO device that, I'm sorry, to the IIO unit. Then the IIO unit will modify the data

Starting point is 00:14:28 and write it back into the L3 cache. Oh, I should say it'll write it back to the L3 cache under the default configuration. So the distinction here, we have allocating and non-allocating rights. Allocating is like the default Intel DDIO behavior. So it allocates into the L3 cache. If you have a system configured for non-allocating rights, then it does not go directly back into the L3 cache. It will go to DRAM. So there are two scenarios for L3 lookup that happen with Intel DDIO. We can either get an L3 hit or we can get an L3 miss. As you can see here, the hits are good, the misses are bad.

Starting point is 00:15:25 Let me walk through this table really quick, and it'll help summarize what I was talking through in the previous slide. So for inbound reads, if we get an inbound read hit in the L3 cache, that means it was in the L3 and we can deliver it to the device from the L3. If we miss, we then have to go to local DRAM or even to the remote socket for that data. For inbound writes, if we get an L3 hit, that means there was already a cache line allocated in the L3 and we can overwrite that data with the new data. Now if we get an inbound write that misses in the L3, then we've got to do some extra work to make this happen. First, we have to evict a cache line that's already there in order to make room for

Starting point is 00:16:23 this new cache line allocation, right? So we evict that cache line back to memory, and then we can allocate a new cache line. Once we allocate that cache line and our IIO has ownership of that, it can write that data back into the L3 cache, leaving that data in the L3 cache. We want to avoid as many DDIO misses as possible. Not only is it beneficial to hit because it's close and fast, that also means you're not consuming extra bandwidth in local DRAM or going to the remote socket. Let's shift gears a little bit and talk briefly about the other direction. So the last two slides, we've been talking about inbound transactions, those where a PCI Express

Starting point is 00:17:23 device is requesting something or writing something to the system. Now we flip that and we're talking about where the core is doing a read or a write to the PCI Express device. These are our memory mapped IO accesses, which is the primary mechanism for performing these outbound PCI Express transactions. The thing to remember is that these MMIO accesses are quite expensive and we want to limit them. Out of the two, the reads are the most expensive because we have a full round-trip latency to that device. Writes are also, but not quite as expensive, and so we want to minimize as many of those as we can. This presentation doesn't go into those tricks and techniques. There is an entirely separate

Starting point is 00:18:18 presentation that talks about this written by Ben Walker, and we've provided the link right here. You can click on that later and go into the particular tricks that are known for minimizing those MMIO rights. The last thing I want to say here about MMIO is that it is not driven by Intel DDIO. So if you remember, inbound transactions are driven by Intel DDIO, outbound transactions with MMIO are not. So now we can walk through an example of a typical IIO flow that utilizes Intel DDIO. And in this example, we're going to have a application that wants to read data from an SSD. And it really helps put everything together that we've been talking about so far.

Starting point is 00:19:20 First, the left-hand picture. This maps back to the original architectural overview that we went over. If you remember that kind of orange picture, we had that grid with all the blocks on it. This is an abstracted view of that. So let me maybe highlight some of these things. Highlight. So if you think back to that picture, right, we had the core, we had lots of cores throughout that. Here we're just pointing towards one of those cores. This here represents our distributed L3 cache or our last level cache. And then we have, we talked about memory controllers

Starting point is 00:20:10 connecting us to DRAM. And then here, we had the IIO units that bridge us to different PCI Express devices. Okay, let me clear these and we can continue. On to the example of our application reading data from the SSD. The first thing is that our core is going to write the I-O command descriptor into the submission queue that we're showing in the last level cache. It then begins pulling the completion queue to see when that work is finished.

Starting point is 00:20:58 The next thing it does is it will notify the SSD that a new descriptor is available. So that means it has to do this outbound PCI Express write. This is those expensive MMIO writes that we were talking about. And it has to write clear out to the SSD submission queue doorbell to say, hey, I have work that I want you to do. Next, the device, the SSD, reads that descriptor to get the buffer address. So this is now an inbound PCI Express read, and it is going to get an L3 hit on that read. For step four, our SSD now knows what address it needs. It reads that from the disk and is going to write that data into the system. So here, step four, we are going to come in and write into the L3 cache. Let's talk a little bit about DDIO right here.

Starting point is 00:22:10 So we talked about allocating flows. In this scenario, we are writing into the L3 cache, and it's going to stay in the L3 cache. That's the allocating write flow. If for some reason we have the non-allocating write flow, then it's not going to stay here in the LLC. It's going to instead get pushed out to DRAM. Okay, so that was step four. Step five, now the device writes to the completion queue here and says, hey, I have moved that data that you asked for. And step six, here, the core has been watching for this and it detects that the completion is updated. And then it does a little bit of bookkeeping. It does another MMIO write to move the completion QTEL pointer.

Starting point is 00:23:21 The last thing I want to mention here is the good-bad scenarios. If you think back a few slides ago, we were talking about L3 misses and L3 hits, right? That L3 hit is good, L3 miss is bad for performance. And what we just walked through was an ideal scenario where all of those inbound requests were getting L3 hits, right? That data or the information was already in the LLC. I just want to point out that it's possible that, like if we look at step three, you know, by the time that that core writes that information to the L3 cache, and then the device goes to read it, it's possible it could have been evicted from cache and pushed out to memory,

Starting point is 00:24:05 or the other socket. And in that scenario, we would get an L3 miss and have to get it from DRAM. With an understanding of the architecture, the typical I-O transaction flow, and the importance of Intel DDIO, we can now move on to the performance analysis of these IO flows. Let's start performance analysis overview with discussing platform telemetry points. Skylake and Cascade Lake servers incorporate thousands of encore performance monitoring events provided through performance monitoring units or PMUs. PMUs are assigned to ANCOR IP

Starting point is 00:24:51 blocks and track lots of activities happening in the ANCOR. Full description of available PMU types, counters, events and programming is provided through dedicated ANCOR performance monitoring guides. The activities induced by DDIO and MMIO transactions can be monitored at all stages of execution in the ANCOR. On the inbound path incoming traffic is first handled by integrated IEO block. IEO block consists of several sub blocks and when going from the IEO device side the inbound traffic first is managed by traffic controller. Traffic controller is covered with IEO PMU. With IEO events

Starting point is 00:25:42 one can measure each 4 bytes transferred in both directions with a breakdown by transaction type and with 4 PCIe lanes granularity. Then the IIO requests are passed from traffic controller to IRP. IRP is responsible for communications between IAO and L3Cache. IRP performance monitoring events reflect coherent requests sent and snoops received by IAO. On the MESH and L3Cache side, requests may be monitored with CHA PMU. The mesh events cover credit tracking, inserts of transactions to various queues, and so on.

Starting point is 00:26:34 From the L3 observability perspective, CHA events count requests, insert rate and occupancy, L3 lookups and evictions, memory directory accesses and so on. Also, CHAPMU provides special monitoring capabilities that allow for opcode filtering, counting of events separately for cores and IEO controllers, counting L3 hits and misses separately and so on. When the IA request processing induces memory or cross-socket traffic, it may be monitored through events provided by integrated memory controller and UPI PMUs. However, using raw events for performance analysis is a quite challenging task because it requires deep understanding of what exactly is counted by these events and how it is related to IOTraffic

Starting point is 00:27:32 processing. To make the performance debugging and tuning easier, Intel provides various performance analysis tools, which build ready-to-use performance metrics upon on-core events. As an alternative to analyzing raw events, we want to present developing input and output analysis of Intel VTune Profiler. VTune is Intel's flagship performance analysis tool. It includes lots of various analysis types, which help to effectively locate various bottlenecks within core and beyond.

Starting point is 00:28:11 IEO analysis gives the uncore-centric view on application performance through platform-level metrics like DRAM bandwidth, UPI bandwidth, and metrics revealing DDIO and MMIO details. IEO analysis is available on Linux targets when running on server architectures. VTune automatically collects needed ANCHOR events and presents human-readable performance metrics. For PCIe traffic, the top-level metrics are inbound and outbound PCIe read and write bandwidth. As it was previously discussed, inbound traffic is initiated by PCIe devices and targets system memory, and outbound traffic is initiated by cores and goes to the device memory or device registers.

Starting point is 00:29:07 As the second level details for inbound PCIe traffic or roughly speaking for DDIO traffic, VTUNE shows the ratios of requests that hit or miss the L3 cache. These metrics are sometimes roughly called DDIO hit and DDIO miss ratios. For outbound traffic, VTUNE detects the sources of MMIO accesses and binds the functions that issue MMIO reads and MMIO writes with specific PCIe devices targeted by these transactions. The data on this slide is captured for SPDK-based application. As you can see, this is the summary page of collected result, and here PCIe metrics are presented for the whole platform.

Starting point is 00:30:02 If you need finer granularity, VTune presents detailed data in grid views, where inbound and outbound traffic and DDIO hits and misses are presented with the breakdown by CPU sockets and groups of PCIe devices. In simple terms, these groups are defined by IIO units, and as to the same IIO, the metrics will be aggregated for these devices. For MMIO accesses, the grid view allows choosing any core related data grouping, so MMIO reads and writes can be presented with the attribution to any court-specific entity, like function, thread, process, and so on. Also, you may see here that VTune shows the call stacks leading to

Starting point is 00:31:14 MMOReads and MMORights with the attribution to specific PCIe devices. DDIO traffic when missing local L3 cache may induce DRAM and UPI traffic, so let's also see how one can correlate PCIe-related metrics with memory and cross-socket traffic. On the timeline, VTune shows inbound and outbound PCIe bandwidth with the end device granularity. On the same timeline, you may also see memory and UPI bandwidth and correlate PCIe traffic with memory bandwidth and cross-socket interconnect bandwidth. In this case, the correlation between inbound PCIe, UPI and DRAM bandwidth is clearly visible. Now we see that all platform performance metrics can be easily collected and viewed with the different level of details. The next question is how to use all this information to tune platform IEO performance.

Starting point is 00:32:27 Let us summarize how this information may be used in platform IEO performance analysis for storage applications. Let's assume that IOPS is the optimization criteria and it is lower than expected. As the first step, we run IO analysis and get the full set of platform level metrics. First, it makes sense to do a quick and simple step before digging into SOC level details. Having measured PCIe traffic and IOPS, check with the device datasheet if the throughput limits are met. Usually SSD specifications present maximal IOPS or read and write rates for specific memory access patterns. If application hits disk limits, the obvious solution is to consider device upgrade or consider adding more devices.

Starting point is 00:33:27 If the throughput is below device limits, there are at least three possible directions for investigation. The first direction is to try figuring out if PCIe physical layer is a bottleneck. Inbound and outbound traffic captured by VTUNE reflect PCIe bandwidth in terms of data payloads. Over PCIe lanes, the payloads are transferred with overhead added by headers of transaction layer packets, data link layer packets, physical layer packets, and eventually physical encoding. As a result, transferring of data effectively takes just a portion of physical link bandwidth. In some cases, the overhead added by PCA protocol may become a bottleneck. A quick example is following. Let's say SSD needs to write 64 bytes. There are several options on

Starting point is 00:34:27 how to do that. For example, SSD may write 64 bytes once or it may write 4 times per 16 bytes. Obviously, the latter case introduces 4 times higher overhead than the first option. Platform limits payload size and this may lead to similar situation, so the advice is to check maximal payload size and maximal re-request size to minimize TLP header overhead. More advanced thing is to count full and partial cache line write requests on the platform side and check if a device coalesces small write requests. The second direction is DDIO hit and miss ratios exploration.

Starting point is 00:35:17 DDIO misses evidence that application does not fully benefit from hardware capabilities. And if DDAO misses are observed, there are two investigation areas. The trivial one applies to multi-socket configurations only. To avoid DDAO misses and all consequences such as induced memory and UPI traffic, make sure that cores, devices, and memory used for their interaction reside within same sockets. In short, always try to use application topologies where all resources are located locally. Advanced performance tuning area is about development of effective L3 cache management technique. The problem is to try keeping all the data consumed by cores and IO in the L3

Starting point is 00:36:12 cache. You can try various prefetching, for example, adding instructions providing hints to hardware core prefetchers or using software caching techniques, for example, even with data buffers going beyond L3 cache capacity, application may reuse recently accessed buffer elements and avoid L3 misses from core and IOS side and thus get served through L3 cache only. In ideal case, when DDIO technology is properly utilized, you shouldn't see any DRAM traffic. As another advanced topic that may help to better manage data in the L3, consider cache

Starting point is 00:36:57 allocation technology, which allows for LLC partitioning and isolating the data used by specific cores. It is worth mentioning that sometimes even with DDIU misses, performance may be good in terms of throughput. In this case, keep in mind that DDIU miss scenario introduces much higher latencies and wastes DRAM and UPI bandwidth and platform power. Eventually check if application accesses disks through MMIO address space. The reads are super expensive and are not needed on the data path, so just avoid them.

Starting point is 00:37:42 The MMIO writes are used for doorbells. Advice is to estimate how many bytes are written through MMIO per IO operation and see whether it is possible to minimize this value. Follow the link on this slide to learn more about MMIO traffic minimization tricks. The interesting question is how to estimate outbound rights consumption per I.O. operation. To do that, you need to know your communication pattern. For the trivial example with application making pure reads, we know that for each I.O. operation application makes two doorbells. Knowing the doorbell's register size, you can bind application read rate and outbound write rate and check if consumed bandwidth is expected.

Starting point is 00:38:38 If not, use MMIO access analysis to locate the sources of unexpected MMIO writes. On the other hand, when PCA metrics are collected in block with API-level statistics, you may divide outbound write bandwidth by IOPS and get the bytes written to the device register per single IO operation. This value may be used as optimization criteria. The question is where to take these API-level statistics from. If your application is based on SPDK, you get API-level metrics out of box.

Starting point is 00:39:20 VTune has the integration with SPDK, and when SPDK is compiled with the dedicated option, VTune collects various statistics such as read and write operations, bytes read and written, and so on. If you're not using SPDK, you may always extend any predefined VTune analysis with your own custom collector. This collector should provide the data in specific CSV format, and this data will be shown along with platform level metrics.

Starting point is 00:39:59 Now, before we finish, let's get back to where we started performance analysis overview. In some cases, you may need to go beyond existing metrics of VTune. So, the last direction we want to touch today is dealing with row counters. If you need to go hard way and explore second level effects you may customize IO analysis by adding events counting snoops sent to IO coherent operations and so on to navigate through row counters use encore performance monitoring guide and our previous dedicated talks and here is our call to action to you. Try VTune IEO analysis, try dealing with raw events, and do not hesitate to contact us

Starting point is 00:40:52 to provide a feedback or ask questions. This will help us keeping platform IEO analysis useful for analyzing high-performance storage workloads. And please find the list of references at the very end of our presentation. Thank you for your attention. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with

Starting point is 00:41:36 your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #147: Platform Performance Analysis for I/O-intensive Applications

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Storage Developer Conference - #147: Platform Performance Analysis for I/O-intensive Applications

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.