Storage Developer Conference - #168: PCIe® 6.0: A High-Performance Interconnect for Storage Networking Challenges

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast, Episode 169. Hello, everybody. Welcome to this session. I'm Mohithin Mazumdar from Intel. I co-chair the PCIe SIG electrical workgroup, and I will be giving this presentation,

Starting point is 00:00:55 PCIe 6.0 specification, a high-performance interconnect for storage networking challenges. So I will give you a little bit of background on the PCI technology and how it is evolving. And then I will work and talk about key metrics and the requirements for PCI 6.0 specification, how we use a holistic approach by optimizing electrical, logical protocol, and all aspects of this technology to minimize latency for 64 gigatransfer per second PAMFOR signaling. I will provide a summary of key electrical improvements in PCI 6.0 and talk about PCI compliance process, which is a key important aspect of PCI technology. And we'll conclude the presentation.

Starting point is 00:01:53 So if we just look at the PCI interconnect for storage, the graph that you are seeing here. It's giving an example snapshot. So if we look at and focus on this blue color here, the bar portion, that is NVMe PCIe based. So as we can see from 2017 and moving into the future into 2024, it is growing very rapidly. And today, 70% to 80% of data center SSDs use PCIe as the interface because of its high performance, which is exemplified by the data rate and scalable widths available in x1, x2, x4, x8, x16. Combination of those two gives the flexibility in terms of bandwidth. Low latency because PCI directly connects the storage device to the host. It has extended reliability, availability, and serviceability features, RAS features. It is available in many standard form factors, M.2, U.2, add-in card, EDSFF.

Starting point is 00:03:15 These are all standardized in the industry. It also provides low power. And overall, because this technology is used in so many segments of the computing industry, it overall reduces the cost of total ownership by minimizing overall cost because of its widespread application in the industry. Example of usages that benefit most from PCIe and VME-based SSDs are like database, AI and machine learning, HPCA technology, mainly low latency, high bandwidth, and faster speed. And all of those features makes this an ideal candidate for this type of solutions. So overall, how the PCAe is scaling and maintaining backwards compatibility, this architecture and technology delivers a high-performance, low-latency interconnect between the storage SSDs and the host CPU switch or switch. What I'm going to give you a little bit of flavor of other usages of PCIe.

Starting point is 00:04:46 One of the major drivers of the high bandwidth is also the Ethernet network adapters. As demonstrated here on one side, it will be connecting through the Ethernet, and the other side, which is mostly the cam connector based, is directly communicating with the CPU. To meet a 400 gig Ethernet card in a by 16, you need minimum of 32 giga transfer per second. That's what PCI 5.0 will be able to meet. So it is going hand in hand to meet the demands of the bandwidth on the CPU side and the networking side. Similarly, many applications like AI machine learning, HPC that we talked about, uses accelerators, and that also often comes in CAM card form factors. And this is a form factor that has been in use since PCIe 1.0 that's spanning almost two decades. Similarly, other applications like storage is coming into those factors,

Starting point is 00:06:08 and there are essentially you have by one, by two, by four, by eight, by 16, all these form factors to meet the various demands of the bandwidth. Similarly, PCIe comes in form factors like BGA to enable much lower size and compact form factor to mobile usages and other such usages. U.2 provides the hard drives. So all of these form factors will be demanding various types of overall bandwidth. So we expect that as PCA technology keeps scaling, the many usages will demand various bandwidths and all of this, the latest and greatest bandwidth and even the lower will find many usages across all compute continuum. So let's take a look at how PCIe has been progressing as its bandwidth is scaling. PCIe 1.2 started with 2.5 giga transfer per second with the 8-bit, 10-bit encoding in 2003. And soon, it got doubled in terms of data rate with 5.0, keeping the backwards compatibility. As we moved from PCI 2.2 to 3.0,

Starting point is 00:07:38 it was considered extremely risky at that time of the technology, which needs to be low cost, needs to be high volume. And we did not go to twice the bandwidth. We went to eight gigatransfer per second. But we changed the encoding from 8-bit, 10-bit to 128-bit, 130-bit. And that gave us about 20% improvement into the logic side. So combining the logic and electrical, we got again, total 2x bandwidth, which is the continuing the doubling the bandwidth every generation. Moving from there to PCI 4.0 up to 6.0, each generation, we doubled the data rate and is still maintaining the overall mechanical backwards compatibility and software backwards compatibility. leverage and reuse their investment over this two decades of time while continuing to the bandwidth doubling for six generations. And another aspect is the PCI Express architecture, the PHI standards, what does it provide? It provides a single PHI standard. What it means is that the PHY is the same, whether it's a very short channel or a very long channel. So that allows the same circuits to be used across many

Starting point is 00:09:15 types of applications from data center to mobile to embedded. And in each of those cases, it provides low power, it provides high performance. Starting with PCIe 5.0, it supports an alternate protocol support. What it means that the PHY doesn't have to use just the PCIe protocol. It can negotiate and it can be a bridge for other protocols, which then allows PCI-Fi to be used in non-PCI protocols and makes the use of this is much more versatile. So the doubling the bandwidth is happening as we just saw for sixth generation now with full backwards compatibility and is available in many different standardized form factors. In addition, PCIC provides a mature and a very robust compliance and interoperability program through which industry members basically

Starting point is 00:10:20 come together through various compliance workshops to ensure the proper tools are validated. And through that, the designs that are compliant to the spec there can interoperate easily. And this type of program enables an ecosystem maturity in a short period of time. So overall, we see that backwards compatibility with enable broad ecosystem, and it makes PCI architecture a low cost IO for diverse applications. Let's now focus on PCI 6.0. As we are working through this specification development, PCI expects this specification to be complete by the end of this year.

Starting point is 00:11:12 As we started this work probably more than two and a half years ago, we set some expectations. is as PCI 6.0 doubles the bandwidth, it needs to leverage a lot of the industry ecosystem and installed base and be able to bring this doubling of the bandwidth. So that's the number one requirement is data rate needs to double to go to 64 gigatransfer per second. We move to PAM4 modulation. This is not new technology, but this is needed. And what is new and what are the specific features in PCIe?

Starting point is 00:11:57 We will discuss those in the next few slides. The next biggest and hardest requirement was the latency. PCIe is a load store IO architecture. It cannot tolerate large latency. significant latency, we still cannot allow hundreds of nanoseconds or 100 nanoseconds plus latency often for implemented in network and other industry standard usages is capable. So we will go through how we address that challenge. Bandwidth inefficiency, because of the additional error correction, all those must be less than 2% error over PCIe across all payload sizes. Reliability in PCIe is not defined by raw bit error rate. Rather, it is defined by a parameter called failure in time, which is expressed in units of errors in billion hours. So this number needs to be much, much less than one.

Starting point is 00:13:15 Channel reach. If we want to keep the data center architecture and platform architecture similar and continue to leverage what we have, then we need to have channel reach similar to PCIe 5.0. As PCIe supports up to two retimers, so with the addition of retimers, it allows essentially double, more than double, they're basically two retimer.

Starting point is 00:13:47 Each retimer will allow full, the bandwidth, doubling of the bandwidth. So it provides a huge channel reach and it has that built into the architecture. Power efficiency needs to be better than PCI 5.0. And in addition, low power, all the features remain similar to previous generations, entry and exit latency for L1 low power state. But we also have added a new power state, L0P, which supports scalable power consumption with bandwidth usage without interrupting the traffic. So without introducing much latency,

Starting point is 00:14:31 it provides another low power state. Plug and play, it's essentially fully backwards compatible, which means a PCIe 1.0 card can be plugged in a 6.0 system and then it will work. And similarly, an older system will be able to work with the newer card at whatever the maximum speed both the card and system supports. In addition to all of these key features, it's needless to say that it needs to be HVM ready, needs to be cost effective, and it has to be scalable to hundreds of lanes for particularly server usage in a platform. So we need to make the right trade-offs to meet each of these key metrics. And that's what sort of initiated all of the technical development work and the pathfinding and eventual standards that we are codifying in PCI 6.0.

Starting point is 00:15:38 Let's now take a look at what is the basics of PAMFOR signaling. This could be just a refresher for many of you. PAMFOR signaling is pulse amplitude modulation where there are four voltage levels as seen in this graph on the right, zero, one, two, three. Now with four voltage level, basically we can encode two bits of information.

Starting point is 00:16:06 And what that does is instead of binary signaling that we have used so far, where each of this UI would be representing either state zero or one, here we are putting two bits so we can keep the same UI and the same Nyquist frequency as 32 gigatransfer per second. Why is that necessary? At 32 gigatransfer per second in PCIe 5.0, we support pad-to-pad loss of 36 dB at 16 gigahertz. If we try to go to 64 GHz transfer per second NRZ binary signaling, then that loss becomes more than 70 dB for the same channels. And at such a high loss, the receiver noise would be more than the amplitude of the signal arriving to the receiver. So that's a fundamental limit. Without extreme measures, such loss cannot be supported. And that's why we moved to PAMFOR signaling as other industry standards for 50 plus gigabit per second have also moved. But what we lose is now you can

Starting point is 00:17:27 see we have three eyes. Each of these level from zero to one or one to two or two to three now have one third the voltage than a binary signaling would have. So this reduced voltage swing and also the transition between various of those voltage levels degrades both the overall signal fidelity as well as the timing. So even though it solves the fundamental loss barrier, it introduces significant challenges because of this sensitivity to noise. And any noise that could be crosstalk, reflection, or device noise, overcoming those and still provide a reliable signal detection is one of the fundamental challenges. To address some of them, we do mandatory gray coding to minimize errors in UI. We also make precoding mandatory to minimize DFE burst errors,

Starting point is 00:18:37 which is this converting the two bit error. So the voltage level, as we have already discussed at TX and RX defines the encoding. So now in the next few slides, I will focus on how we are going to achieve a solution that is low latency, but is still with PAM4 signaling. As we know in other standards, PAM4 signaling needs strong FEC,

Starting point is 00:19:10 which then comes with latency increase as well as complexity and area and power. So our goal is to address that. So some of the key parameters of interest are FBER, which means first bit error rate, which is the probability of the first bit error occurring at the receiver before any burst error happens. And then error correlation within lane and across lanes is also very important.

Starting point is 00:19:42 That's within lane, it could be a DFE burst error, across lanes could be happening due to a common power supply noise and other issues that can give those errors. So BER eventually is going to be depending on the first pit error rate before any burst error is considered and the error correlation in a lane or across lanes.

Starting point is 00:20:11 So how do we handle errors and then what metrics do we use? So the two mechanism to correct errors, one is correction through FEC, forward error correction, but challenge is latency and complexity increases exponentially with the number of symbols corrected. So the other mechanism is detection of errors by CRC. Link level retry is a strength of PCIe. It's unique. It has a latency of 100 nanosecond. So we try to combine two things. One is detection is linear. Latency, complexity, and bandwidth overheads are linear. So we make the CRC much more robust. We have a 32-bit CRC in PCIe 5.0, and for 6.0, we make it a 64-bit CRC. And then comes the probability of retry

Starting point is 00:21:08 and the failure in time that is our metric. So instead of, again, focusing on minimizing par lane effective BER with a very robust and heavy-duty FEC, we are looking at the metric of failure in time, which provides us similar reliability as previous gens of PCIE, but relieves us from the requirement of having a very low effective beta rate and the high FEC latency. So what we try to do is take advantage of the PCIe link level retry mechanism and strong detection through CRC, and then combine those two to implement a low-latency light FEC. So that summarizes in this slide that go for a lightweight FEC, use a low-latency retry mechanism, apply or achieve a lower probability of retrying so that the overall latency remains low.

Starting point is 00:22:35 The graphs on the right side, there's a lot of data here, but the main thing I want to point out that this graph here, which on the left side of y-axis is giving probability and BER, on the right side, it's giving fit rate. So our goal is, again, to go to fit rate close to zero or much, much less than one. And the others others we are flexible. So the double symbol correct can give us very low effective beta rate, but it is expensive as we discussed and high latency.

Starting point is 00:23:16 On the other hand, single symbol correct here, even with rate try probability of 10 to the minus 5 and FBER of 10 to the minus 6, we can achieve fit of much, much less than 1. And we take advantage of rate try latency of 100 nanosecond. But instead of going for a 100 nanosecond latency FEC, we make use of link level retry with 100 nanosecond and keeping the probability of that retry very low and overall giving the failure in time in a range that is compatible with the same reliability that we have always enjoyed with PCIe. Getting into a little bit more details. So FEC needs fixed set of bytes. So because of that, a fleet mode is introduced in PCIe 6.0. Fleet stands for Flow Control Unit.

Starting point is 00:24:26 So the correction, detection, and retry, all of them are done at the FLIT level. So to enable this mode, now lower data rates should also use the same FLIT once this mode is enabled. And FLIT size is designed to be 256 bytes out of that um 236 bytes of tlp transaction layer packet six bytes of dlp eight by sub crc which is 64-bit and then six bytes of

Starting point is 00:24:59 fec since now it is a fleet based there's no header, no framing token, so TLP is now reformatted and no transaction or data link layer level CRC, so everything is at a fleet level. of overhead actually improves the efficiency. So net loss in overall bandwidth is much more reduced. And with this and a single symbol correct FEC and a light FEC and three-way interleaved FEC, we achieved a fleet latency that is in the order of two nanoseconds for by 16 port and 4 nanoseconds for by 8, 8 nanoseconds for by 4, and 16 nanoseconds for by 2, and 32 nanoseconds for by 1. So overall, the key idea is optimized by retry error fleet only with the existing go back and retry. So this is a mechanism PCIE already had, minimizing the retry probability

Starting point is 00:26:16 and making use of that link level retry we are enabling. But it provides some challenges on the electrical side. So let me address those. As we have seen, to achieve the FIT much, much less than one, the first beta rate FBER needs to be less than 10 to the minus six. This is almost two orders of magnitude better BER it needs to be compared to if we had to, if we could do a heavyweight and more robust FEC. So this increases the electrical challenges. So we have made a lot of improvements and compromises. The pad-to-pad channel loss in PCIe 5.0 is 36 dB. We estimated that that would be very difficult for achieving 10 to the minus 6 BER. So we go for a pad-to-pad channel loss less than 32 dB at 16 GHz.

Starting point is 00:27:20 We significantly reduce the channel crosstalk and crosstalk within package and the reflection at various interfaces. We improved significantly the reference clock and CDR. Jitter reduction compared to PCI 5.0 is about 2x. Similarly, to mimic the receiver improvement, we developed a reference equalization scheme that includes transmitter, second precursor, improved CTLE peaking and bandwidth, and a 16-tap DFE instead of a 3-tap DFE. In addition, we need to minimize the burst error probability. So we mandate PAM4 precoding and gray coding, and we also limit the DFE tap coefficients. All of the taps, if, for example,

Starting point is 00:28:18 if the first tap is now minimized so that the tap coefficient on the DFE tap 1 relative to the cursor needs to be less than 0.55. So that theoretically limits or lowers the probability of DFE burst error. So this type of electrical constraints will ensure lower probability of correlated errors and hence will allow us to use a very lightweight single-symbol correct three-way interleave FSC. Going a little bit more details on the electrical side. So a very common topology that we study for PCIe spec development is a so-called one connector CAM topology.

Starting point is 00:29:12 What we are showing here is a system host board with a non-root complex, which could be a storage device or could be a graphics or GPU and so on, it has to be able to support at least about 13 inches of PCB routing because sometimes that is the distance physically one has to go to without the use of a retimer. So that's a tough constraint to meet. And what we are showing in this table is the PCI 5.0, that total loss budget is 36 dB. Root complex, which could be the CPU or a switch, takes about 9 dB in 5.0. And the add-in card, which doesn't include the connector here, is about 9.5 dB. And then connector plus routing, and then the board is 17.5 dB. Since we are losing 4 dB budget from die-pad to pad, we tighten the root complex package loss by a dB and adding card loss by another dB.

Starting point is 00:30:48 So that leaves about a 2 dB gap for the system routing. So we can see that by improving the PCB board materials or the stack up, we probably need about 10% improvement in the PCB loss to be able to support similar channel reaches 5.0. But this allowed us to really support very close to PCA 5.0 like channel. And in that way, we would be able to leverage the fundamental infrastructure of the platforms and data centers and still move to doubling the bandwidth in PCI 6.0 time. So let me summarize the key electrical challenges in PCI 6.0. As we mentioned, it's a PAM4 signaling. It requires FEC.

Starting point is 00:31:43 We must improve the raw bit error rate or first bit error rate before any correction and before any DFE burst error. And that needs to be at least at a minus six, which is very difficult to achieve. Pad-to-pad loss is also reduced by 4 dB to minus 32 dB at 16 gigahertz. And reference package models improved to mimic the actual implementation, three to six dB, so it's 1.5 to 2x improvement in reflection and crosstalk. Clock quality has significantly improved.

Starting point is 00:32:21 Random jitter is now 100 femtosecond. The TX equalization has been revamped with a second TX precursor and a new preset table have been defined at 64 gigatransfer per second. We have made TX precoding and gray coding mandatory. Improvement in silicon jitter is about 2x. The receiver has been improved with an improved CTLE and 16-tap DFE. Also, we expect even more improvement in the receiver quality by defining a receiver eye mask of the outer eye, top eye, 10% UI, and only 6 millivolt, which means a very weak signal has to be detected by the receiver. So it puts stringent constraint on the quality of the receiver that we must have.

Starting point is 00:33:22 Overall, by making all these improvements in electrical layer, logical layer, and protocol layer, we are able to achieve a low latency solution, still better power efficiency than PCI 5.0 and supporting PCI 5.0 like channel reach. Of course, channel quality has to improve, material has to improve, and all of those improvements must happen. But we should be able to still bring

Starting point is 00:33:53 this PCI 6.0 technology in a fully backwards compatible way in the infrastructure that we have in many applications and many form factors that are going to be in the industry for a long time to come. In the next few slides, I will not probably go through too much details, but we show a glimpse of the type of analysis work that we do to assess that can we reach the channel reach that are similar to PCA 5.0 with all the decisions and requirements you have made in the specification.

Starting point is 00:34:35 So we take again a topology, which is so-called one connector cam topology, where a card, PCA card is communicating with the root complex. So we make an assumption that the transmitter is the spec-defined transmitter, receiver is the spec-defined receiver, and then pin-to-pin model is where we model the PCB routing on both system and the add-in card. We also represent the package models by the reference package models that are defined in the specification. And then we run the simulation to see whether we are achieving our goal. So this slide shows the eye height versus the system routing length.

Starting point is 00:35:26 And the PCB length is from 4 to 14 inches. CTLE index here shows by exhaustively sweeping the indices that are corresponding to various CTLE cards defined in the specification. And wherever the maximum area is obtained is considered the optimal and achievable. So what this red line shows is the spec limit for a passing condition. Eye height needs to be higher than that. So we can see about 13 inches is where we can achieve some positive solution, which is higher than six millivolts. Similarly, while we go higher and higher loss, DFE tap is kept under control to reduce the burst error. So the H1, H0 is the DFE tap coefficient. First one divided by the cursor is limited to 0.55.

Starting point is 00:36:28 So that minimizes the DFE per sterler probability. And similarly here, we show the eye width. The top eye width spec limit is 10% UI. And we can again see up to 13 inches, we have more than 10% DUI margin. So this kind of demonstrates that with the spec-defined transmitter-receiver channel and still keeping the DFE limit to extend to control the DFE burst error, we can achieve 13-inch system routing, 4-inch add-in card routing without a retimer.

Starting point is 00:37:05 And hence, we can support a wide volume of usages of PCIe intact and bring doubling of the data rate and doubling of the bandwidth. So this would be basically the last topic or slide. I want to give you a little bit of an idea that how PCI-SIG offers a compliance process. Once the specs are defined, like the basis specification and CAM specification, based on those, we derive a compliance

Starting point is 00:37:43 and interoperability test specification that provides all the details of how the test should be done, what are the pass-fail criteria, what equipment to be used, and then what tools to be used. All those collateral then moves on. And then PCI-E SIG holds about four workshops, compliance workshops, where industry members bring their systems, cards, tools, test and measurement vendors bring their capabilities. And this process continues to develop some mature and robust program for the compliance. And at a certain point, after the specifications are developed, there would be something called integrators list. If certain number of tests are passed at a protocol level, electrical, logical level, then those are considered essentially

Starting point is 00:38:43 PCA interoperable card or systems. And what this process guarantees that if those systems and cards that pass, if they are put together, they will work flawlessly. And that's the program provides a path to design compliance. And these are basically what gives a mature ecosystem across many usages of the PCI technology. So to summarize, in PCI 6.0, this architecture can meet the needs of storage interconnect solution in the foreseeable future. We see that this bandwidth provides not only the data rate scaling, this scalable width, this is another knob it has, it provides everything in a low power, low latency, high volume, and cost-effective way. It doubles the bandwidth with full backwards compatibility. We have seen and reviewed 64-bit CRC, a light FVC, and first bit error rate less than 10 to the minus six, and using link level retry with low retry probability

Starting point is 00:40:06 enables a low latency solution and highly reliable with fit much, much less than one solution. And we are also able to support PCA 5.0 like channel reach with many improvements in circuits and channels. So I am truly excited. I think the PCIe 6.0 is going to meet not only all the demands that storage will be putting, but also the needs of many other usages and architectures.

Starting point is 00:40:41 And because of its widespread use across many segments of the industry, it will be providing a mature ecosystem, very cost-effective solution for all storage demands in the future. Thank you very much. Really appreciate your presence in this presentation and in this session. If you have any questions, please pass those on. I will try to answer them. And again, really appreciate your presence.

Starting point is 00:41:16 Hopefully, you got a flavor of what is coming in the PCI 6.0 technology and in the overall PCI technology. And you'll consider their usage and effective usage in many applications that you drive and bring to the industry. Thank you so much again. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers

Starting point is 00:41:41 mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the storage developer conference, visit www.storagedeveloper.org.

Storage Developer Conference - #168: PCIe® 6.0: A High-Performance Interconnect for Storage Networking Challenges

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.