Storage Developer Conference - #147: Platform Performance Analysis for I/O-intensive Applications
Episode Date: June 7, 2021...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to SDC Podcast, episode 147.
Good morning or good afternoon. My name is Perry Taylor. I'm a platform monitoring engineer at Intel and I focus largely on of VTunes I.O. analysis.
The topic that we want to present today is called Platform Performance Analysis for I.O.
Intensive Applications.
High-performance storage applications running on Intel servers actively utilize I.O. capabilities
and I.O. accelerating features of platform by
interfacing with NVMe devices. In a big picture, performance issues appearing in
such I.O. intensive workloads might be categorized into three wide domains
depending on where the bottleneck is. The first domain is called IODeviceBound. In this case, the performance is
limited by device capabilities. This case is pretty trivial and can be addressed by checking the
observed performance with the device datasheet. The second domain covers core-bound cases. This
is when the performance is limited because of a core
under utilizes the device due to some algorithmic or micro architectural
inefficiencies. This case is very well covered with various core centric
analysis from hotspots to micro architectural exploration and Intel
processor trace based analysis. And the third domain
is called transfer bound. In this case the performance is limited by non-optimal interactions
between CPU and device. This presentation focuses on the transfer-bound domain, which appears most challenging for performance debugging
and tuning.
Our talk consists of two blocks.
First, we provide the overview of platform-level activities running behind I-O traffic processing
on Skylake and Cascade Lake servers.
In this part, we discuss the details of Intel Data DirectIO hardware technology
and consider ANCOR utilization in simplest scenario typical for high performance storage
applications. After that, we overview relevant performance analysis methodologies
going from platform level telemetry to performance metrics built upon and implemented in Intel Performance Monitoring Tools.
We will discuss best-known methods for high performance and overview possible directions of platform I.O. performance tuning.
Let's get started. Okay, so now let's dive into the architectural background and go over some of the I.O. transaction flows.
Okay, this is an overview of the Intel Xeon Scalable Processor.
On the right-hand side is this picture, and let me get you kind of oriented around this picture of what we're
looking at. So if you think about a two socket system, this picture represents one of those
sockets. This is the SOC that would be seated in one. If we were to represent both sockets here,
we would have two of these grids interconnected to each other through the Intel Ultrapath Interconnect or UPI.
OK, so the mesh is the lines, the grid that these are sitting on.
And it's really just a fabric interconnecting all of these blocks to allow them to talk to each other and, you know, move data from one block to another block.
Of course, there's the cores. I think everyone's familiar with what a core is.
It contains the execution units and the L1 cache and the L2 cache. Of course, there's
multiple cores and the number of cores will vary depending on the
particular SKU of server and SOC that you have. So beyond the L2 cache, we move into the uncore.
The first uncore unit to talk about is the CHA, and that is the unit that's just below each one of these cores.
The CHA is the cache controller and really contains the engine for coherency.
So it makes sure that when someone requests data, that it's getting the latest copy of that data.
It also contains your L3, a slice of your L3 cache, right?
So each one of these CHAs contains part of the L3 cache and all together make up the full capacity of L3 cache.
There's the integrated memory controller.
This is, you know, really just bridging you to your DDR. There's the Intel UltraPath interconnect. I
mentioned it at the beginning of this where if we have two socket system,
UPI is used to communicate between those two sockets. Finally here we have the integrated I-O controller and
this is what interfaces to PCI Express devices and the key here is that any
interaction between a PCI Express device and the SOC is handled by the IO unit and then can flow through other UnCore units on that data path.
We will go into that path later, but remember this picture, if you can, that
on the top there, we have a PCI Express device connected to the system. As it requests or sends data,
it's going to go through that IIO unit. It's going to be touching on a particular CHA
or memory controller and accessing L3 cache or DDR. We'll go into that more later. Now let's talk a little bit more about the integrated I.O.
controller and its role in the SOC. So as you saw in the previous picture, we have five of these
I.I.O. units. Three of them connect to PCI Express Gen 3.
Between the three, they cover 48 PCI Express Gen 3 lanes, each one having a
x16, which doesn't have to be a x16, right?
It can be bifurcated into 4x4 or 2x8, for example.
The other unit is for DMI, which connects to your legacy I.O., and then also
connects the Crystal Beach DMA. The our strictly ordered PCI Express devices and translate
their requests to the out-of-order mesh.
So you think about, we have a PCI Express device, let's say it wants to write 256 bytes
of data into the SOC. The IIO will take that transaction
layer packet with 256 bytes of data in that single packet. It will break it up into 64-byte
transactions and deliver that to the mesh. You know, the same thing happens the other direction.
It can get multiple transactions from the mesh, coalesce them into a transaction layer packet.
That's, you know, let's say 128 bytes, 256 bytes, and then deliver that TLP to the PCI Express device.
There are two categories of transactions going through the IIO unit that enable your core-to-device or device-to-core communication.
We have inbound transactions and we have outbound transactions.
So inbound transactions are those coming into your system.
These are initiated by your I-O device targeting system memory.
We have inbound reads and inbound writes.
A quick example, if we're a disk controller,
we would do an inbound write to pull data from system memory
to write to a disk.
And then we just flip that for an inbound write scenario.
The disk controller reads from a disk and then writes that data into the system memory.
So your outbound transactions are typically initiated by cores and they target your I.O.
device memory.
Again, an outbound read means our core is going to be reading from an I.O. device and
then writes where the core is writing data to an I.O. device. These are typically done through memory mapped IO addresses. And
circling back to the inbound transactions now, the thing to keep in mind about inbound
transactions is that this is where your bulk data transfer happens, right? If you need to move a lot of data one direction or another,
really it's the I.O. device that is mastering those requests to push data back and forth.
These transactions are driven by Intel Data Direct I.O. hardware technology.
That's a lot to say, so we'll just call it Intel DDIO.
The thing up front to know about Intel DDIO is that it is transparent to software.
So your application or your driver doesn't have to do anything special to take advantage of Intel DDIO.
However, there are some pitfalls that we can be aware of that may lead to sub-optimal performance
and there's some things we can do to avoid those pitfalls.
Now we can get into the details of what Intel DDIO is and why it helps and improves I.O. performance.
So the key thing to remember about Intel DDIO is L3 cache access because it allows I.O. devices to talk directly with a local L3 cache rather than main memory. So the inbound transactions coming from your I.O. device
are routed directly to the local L3 cache.
This is a big advantage because your local L3 cache is close and it's fast
compared to the main memory. Inbound reads, inbound writes, are quite a bit different in how it is handled.
So let me walk through those two flows for you.
So inbound reads, this is where your IO device is requesting data from the system.
In this scenario, it will do an L3 cache lookup. If it's in the L3 cache, it will then deliver that back to your device through, again, through the IIO unit.
If it's not in the L3 cache, then it will get it from wherever it's at, most likely main memory, deliver it back to your device.
So two things.
There's an L3 cache lookup, but there is not an L3 cache allocation.
All right, so this means that there's no cache line being allocated and there's no data being
left behind in the L3 cache that wasn't already there. Now let's talk about inbound writes because it is a little bit different. For inbound writes, we do have a cache line allocation happening in the
L3 cache and this happens in two different phases. So in the first phase,
the IIO unit has to get the cache line, has to get ownership of that cache line,
so it can write to it and modify it. So first, if it's not already in the L3 cache,
the L3 cache will allocate a line for it and gives ownership of that line to the IIO device that, I'm sorry, to the IIO unit. Then the IIO unit will modify the data
and write it back into the L3 cache. Oh, I should say it'll write it back to the L3 cache
under the default configuration. So the distinction here, we have allocating and non-allocating rights.
Allocating is like the default Intel DDIO behavior.
So it allocates into the L3 cache.
If you have a system configured for non-allocating rights, then it does not go directly back into the L3 cache.
It will go to DRAM. So there are two scenarios
for L3 lookup that happen with Intel DDIO. We can either get an L3 hit or we
can get an L3 miss. As you can see here, the hits are good, the misses are bad.
Let me walk through this table really quick, and it'll help summarize what I was talking
through in the previous slide.
So for inbound reads, if we get an inbound read hit in the L3 cache, that means it was
in the L3 and we can deliver it to the device from the L3. If we miss, we then have to
go to local DRAM or even to the remote socket for that data. For inbound writes, if we get an L3 hit,
that means there was already a cache line allocated in the L3 and we can overwrite that data with the new data.
Now if we get an inbound write that misses in the L3, then we've got to do some extra work to make
this happen. First, we have to evict a cache line that's already there in order to make room for
this new cache line allocation, right?
So we evict that cache line back to memory, and then we can allocate a new cache line.
Once we allocate that cache line and our IIO has ownership of that,
it can write that data back into the L3 cache, leaving that data in the L3 cache. We want to avoid as many DDIO misses as possible.
Not only is it beneficial to hit because it's close and fast, that also means you're not
consuming extra bandwidth in local DRAM or going to the remote socket.
Let's shift gears a little bit and talk briefly about the other direction.
So the last two slides, we've been talking about inbound transactions, those where a PCI Express
device is requesting something or writing something to the system.
Now we flip that and we're talking about where the core is doing a read or a write to the PCI Express device.
These are our memory mapped IO accesses, which is the primary mechanism for performing these outbound PCI Express transactions.
The thing to remember is that these MMIO accesses are quite expensive and we want to limit them.
Out of the two, the reads are the most expensive because we have a full round-trip latency to that
device. Writes are also, but not quite as expensive,
and so we want to minimize as many of those as we can.
This presentation doesn't go into those tricks and techniques. There is an entirely separate
presentation that talks about this written by Ben Walker, and we've provided the link right here. You can click on that later
and go into the particular tricks that are known for minimizing those MMIO rights.
The last thing I want to say here about MMIO is that it is not driven by Intel DDIO. So if you remember, inbound transactions are driven by Intel DDIO, outbound
transactions with MMIO are not. So now we can walk through an example of a typical IIO flow that utilizes Intel DDIO.
And in this example, we're going to have a application
that wants to read data from an SSD.
And it really helps put everything together
that we've been talking about so far.
First, the left-hand picture.
This maps back to the original architectural overview that we went
over. If you remember that kind of orange picture, we had that grid with all the blocks on it.
This is an abstracted view of that. So let me maybe highlight some of these things.
Highlight. So if you think back to that picture, right, we had the core,
we had lots of cores throughout that. Here we're just pointing towards one of those cores.
This here represents our distributed L3 cache or our last level cache.
And then we have, we talked about memory controllers
connecting us to DRAM.
And then here,
we had the IIO units that bridge us to different PCI Express devices.
Okay, let me clear these and we can continue.
On to the example of our application reading data from the SSD.
The first thing is that our core is going to write
the I-O command descriptor into the submission queue that we're showing in the last level cache.
It then begins pulling the completion queue to see when that work is finished.
The next thing it does is it will notify the SSD that a new descriptor is available.
So that means it has to do this
outbound PCI Express write. This is those expensive MMIO writes that we were talking about.
And it has to write clear out to the SSD submission queue doorbell to say, hey, I have work that I
want you to do. Next, the device, the SSD, reads that descriptor to get the buffer
address. So this is now an inbound PCI Express read, and it is going to get an L3 hit on that read. For step four, our SSD now knows what address it needs. It reads that from the disk
and is going to write that data into the system. So here, step four,
we are going to come in and write into the L3 cache. Let's talk a little bit about DDIO right here.
So we talked about allocating flows. In this scenario, we are writing into the L3 cache,
and it's going to stay in the L3 cache. That's the allocating write flow. If for some reason we have the non-allocating write flow, then it's not
going to stay here in the LLC. It's going to instead get pushed out to DRAM.
Okay, so that was step four. Step five, now the device writes to the completion queue here and says, hey, I have moved that
data that you asked for.
And step six, here, the core has been watching for this and it detects that the completion
is updated. And then it does a little bit of bookkeeping.
It does another MMIO write to move the completion QTEL pointer.
The last thing I want to mention here is the good-bad scenarios.
If you think back a few slides ago, we were talking about L3 misses and L3 hits, right?
That L3 hit is good, L3 miss is bad for performance.
And what we just walked through was an ideal scenario where all of those inbound requests were getting L3 hits, right?
That data or the information
was already in the LLC. I just want to point out that it's possible that, like if we look at step
three, you know, by the time that that core writes that information to the L3 cache,
and then the device goes to read it, it's possible it could have been evicted from cache and pushed out to memory,
or the other socket. And in that scenario, we would get an L3 miss and have to get it from DRAM.
With an understanding of the architecture, the typical I-O transaction flow, and the importance
of Intel DDIO, we can now move on to the performance
analysis of these IO flows.
Let's start performance analysis overview with discussing platform telemetry points.
Skylake and Cascade Lake servers incorporate thousands of encore performance monitoring
events provided
through performance monitoring units or PMUs. PMUs are assigned to ANCOR IP
blocks and track lots of activities happening in the ANCOR. Full description
of available PMU types, counters, events and programming is provided through
dedicated ANCOR
performance monitoring guides. The activities induced by DDIO and MMIO
transactions can be monitored at all stages of execution in the ANCOR. On the
inbound path incoming traffic is first handled by integrated IEO block. IEO block consists of several sub blocks
and when going from the IEO device side the inbound traffic first is managed by
traffic controller. Traffic controller is covered with IEO PMU. With IEO events
one can measure each 4 bytes transferred in both directions with a breakdown
by transaction type and with 4 PCIe lanes granularity.
Then the IIO requests are passed from traffic controller to IRP.
IRP is responsible for communications between IAO and L3Cache.
IRP performance monitoring events reflect coherent requests sent and snoops received by IAO.
On the MESH and L3Cache side, requests may be monitored with CHA PMU.
The mesh events cover credit tracking, inserts of transactions to various queues, and so
on.
From the L3 observability perspective, CHA events count requests, insert rate and occupancy,
L3 lookups and evictions, memory directory accesses and so on.
Also, CHAPMU provides special monitoring capabilities that allow for opcode filtering,
counting of events separately for cores and IEO controllers, counting L3 hits and misses separately and so on. When the IA request processing induces memory or
cross-socket traffic, it may be monitored through events provided by integrated memory controller
and UPI PMUs. However, using raw events for performance analysis is a quite challenging
task because it requires
deep understanding of what exactly is counted by these events and how it is related to IOTraffic
processing.
To make the performance debugging and tuning easier, Intel provides various performance
analysis tools, which build ready-to-use performance metrics upon on-core events.
As an alternative to analyzing raw events, we want to present developing input and output
analysis of Intel VTune Profiler.
VTune is Intel's flagship performance analysis tool.
It includes lots of various analysis types, which help to effectively
locate various bottlenecks within core and beyond.
IEO analysis gives the uncore-centric view on application performance through platform-level
metrics like DRAM bandwidth, UPI bandwidth, and metrics revealing DDIO and MMIO details.
IEO analysis is available on Linux targets when running on server architectures.
VTune automatically collects needed ANCHOR events and presents human-readable performance metrics.
For PCIe traffic, the top-level metrics are inbound and outbound PCIe read and write bandwidth.
As it was previously discussed, inbound traffic is initiated by PCIe devices and targets system
memory, and outbound traffic is initiated by cores and goes to the device memory or
device registers.
As the second level details for inbound PCIe traffic or roughly speaking for DDIO traffic,
VTUNE shows the ratios of requests that hit or miss the L3 cache.
These metrics are sometimes roughly called DDIO hit and DDIO miss ratios.
For outbound traffic, VTUNE detects the sources of MMIO accesses and binds the functions that
issue MMIO reads and MMIO writes with specific PCIe devices targeted by these transactions.
The data on this slide is captured for SPDK-based application.
As you can see, this is the summary page of collected result, and here PCIe metrics are
presented for the whole platform.
If you need finer granularity, VTune presents detailed data in grid views, where inbound
and outbound traffic and DDIO hits and misses are presented with the breakdown by CPU sockets
and groups of PCIe devices.
In simple terms, these groups are defined by IIO units, and as to the same IIO,
the metrics will be aggregated for these devices. For MMIO accesses, the grid view
allows choosing any core related data grouping, so MMIO reads and writes can be
presented with the attribution to any court-specific entity, like function,
thread, process, and so on. Also, you may see here that VTune shows the call stacks leading to
MMOReads and MMORights with the attribution to specific PCIe devices. DDIO traffic when missing local L3 cache may induce DRAM and UPI
traffic, so let's also see how one can correlate PCIe-related metrics with
memory and cross-socket traffic. On the timeline, VTune shows inbound and outbound PCIe bandwidth with the end device granularity.
On the same timeline, you may also see memory and UPI bandwidth and correlate PCIe traffic with memory bandwidth and cross-socket interconnect bandwidth.
In this case, the correlation between inbound PCIe, UPI and DRAM
bandwidth is clearly visible. Now we see that all platform performance metrics
can be easily collected and viewed with the different level of details. The next
question is how to use all this information to tune platform IEO performance.
Let us summarize how this information may be used in platform IEO performance analysis for storage applications.
Let's assume that IOPS is the optimization criteria and it is lower than expected.
As the first step, we run IO analysis and get the full set of platform level metrics.
First, it makes sense to do a quick and simple step before digging into SOC level details.
Having measured PCIe traffic and IOPS, check with the device datasheet if the
throughput limits are met. Usually SSD specifications present maximal IOPS or
read and write rates for specific memory access patterns. If application hits disk
limits, the obvious solution is to consider device upgrade or consider adding more devices.
If the throughput is below device limits, there are at least three possible directions for investigation.
The first direction is to try figuring out if PCIe physical layer is a bottleneck. Inbound and outbound traffic captured by VTUNE reflect PCIe bandwidth
in terms of data payloads. Over PCIe lanes, the payloads are transferred with overhead added by
headers of transaction layer packets, data link layer packets, physical layer packets, and eventually physical encoding.
As a result, transferring of data effectively takes just a portion of physical link bandwidth.
In some cases, the overhead added by PCA protocol may become a bottleneck.
A quick example is following.
Let's say SSD needs to write 64 bytes. There are several options on
how to do that. For example, SSD may write 64 bytes once or it may write 4 times
per 16 bytes. Obviously, the latter case introduces 4 times higher overhead than
the first option.
Platform limits payload size and this may lead to similar situation, so the advice is
to check maximal payload size and maximal re-request size to minimize TLP header overhead.
More advanced thing is to count full and partial cache line write requests on the platform
side and check if a device coalesces small write requests.
The second direction is DDIO hit and miss ratios exploration.
DDIO misses evidence that application does not fully benefit from hardware capabilities.
And if DDAO misses are observed, there are two investigation areas.
The trivial one applies to multi-socket configurations only.
To avoid DDAO misses and all consequences such as induced memory and UPI traffic, make sure that cores, devices, and memory used for their
interaction reside within same sockets. In short, always try to use application
topologies where all resources are located locally. Advanced performance
tuning area is about development of effective L3 cache management technique.
The problem is to try keeping all the data consumed by cores and IO in the L3
cache. You can try various prefetching, for example, adding instructions
providing hints to hardware core prefetchers or using software caching
techniques, for example, even with data buffers going beyond L3 cache capacity,
application may reuse recently accessed buffer elements
and avoid L3 misses from core and IOS side
and thus get served through L3 cache only.
In ideal case, when DDIO technology is properly utilized, you shouldn't see any DRAM traffic.
As another advanced topic that may help to better manage data in the L3, consider cache
allocation technology, which allows for LLC partitioning and isolating the data used by
specific cores.
It is worth mentioning that sometimes even with DDIU misses, performance may be good
in terms of throughput.
In this case, keep in mind that DDIU miss scenario introduces much higher latencies
and wastes DRAM and UPI bandwidth and platform power.
Eventually check if application accesses disks through MMIO address space.
The reads are super expensive and are not needed on the data path, so just avoid them.
The MMIO writes are used for doorbells. Advice is to estimate how many bytes
are written through MMIO per IO operation and see whether it is possible to minimize this value.
Follow the link on this slide to learn more about MMIO traffic minimization tricks. The interesting question is how to estimate outbound rights consumption per I.O. operation.
To do that, you need to know your communication pattern.
For the trivial example with application making pure reads, we know that for each I.O. operation
application makes two doorbells.
Knowing the doorbell's register size, you can bind application read rate and outbound write rate
and check if consumed bandwidth is expected.
If not, use MMIO access analysis to locate the sources of unexpected MMIO writes.
On the other hand, when PCA metrics are collected in block with API-level statistics, you may
divide outbound write bandwidth by IOPS and get the bytes written to the device register
per single IO operation.
This value may be used as optimization criteria.
The question is where to take these API-level statistics from.
If your application is based on SPDK,
you get API-level metrics out of box.
VTune has the integration with SPDK,
and when SPDK is compiled with the dedicated option,
VTune collects various statistics such as read and write operations, bytes read and
written, and so on.
If you're not using SPDK, you may always extend any predefined VTune analysis with
your own custom collector.
This collector should provide the data in specific CSV format,
and this data will be shown along with platform level metrics.
Now, before we finish, let's get back to where we started performance analysis overview.
In some cases, you may need to go beyond existing metrics of VTune.
So, the last direction we want to touch today is dealing with row counters.
If you need to go hard way and explore second level effects you may customize IO analysis by adding
events counting snoops sent to IO coherent operations and so on to
navigate through row counters use encore performance monitoring guide and our
previous dedicated talks and here is our call to action to you.
Try VTune IEO analysis, try dealing with raw events, and do not hesitate to contact us
to provide a feedback or ask questions.
This will help us keeping platform IEO analysis useful for analyzing high-performance storage
workloads. And please find the list of references at the very end of our presentation.
Thank you for your attention.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to
developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with
your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.