Storage Developer Conference - #195: PCIe® 6.0 Specification and Beyond

Episode Date: September 5, 2023

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, this is Bill Martin, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developers Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast Episode 195. Greetings, everyone. We'll talk about PCIe 6.0 and also the journey beyond that. PCI Express is a multi-generational journey. We just completed the 6.0 specification and what is in it for
Starting point is 00:00:53 different applications. I will touch on that. My name is Devendra Das Sharma. We'll start with introduction where I will talk about how PCI Express as a technology has evolved. Do a little deep dive into PCI Express 6.0. Also, I would have talked about 7.0. It will be very similar to 6.0, so pretty much you're going to get a preview of what it looks like. PCI Express technology and how it relates to storage, form factors, compliance, and then conclusions. So anytime you have a question, just raise your hand,
Starting point is 00:01:32 shout the question, and I'll take it. Want to make it more informal. So this is the evolution of PCI Express, right? We started off back in 2003 with 2.5 gig. At that time, it was 8-bit and B encoding. Pretty much every, on an average, right? Every three years, we have been doubling the bandwidth or the data rate with full backward compatibility. You know, that's the average cadence.
Starting point is 00:02:00 Sometimes it's a little longer. Sometimes it's a little shorter, but, you know, on an average, give or take. And you can see that throughout this journey, it's like 2.5 gig, 2003, 5 gig, 2007, 8 gig, 2010. And note that five to eight, even though it's not doubling the data rate because of the encoding changes,
Starting point is 00:02:19 you get double the bandwidth, which is basically what we wanted to do at that time, because if we went with 10 gig, we would have left the server channels behind at that time. So it took us a lot longer for the materials and everything to catch up. So after that, you will see that, you know, the progression keeps going. 16 gig was same 128, 130, 32 gig. If not, this would have been 10, 20, 40, and then 80, which would be getting harder and harder and harder, right? So you probably had to do PAM4 back in 40,
Starting point is 00:02:52 rather than, we basically wait till it's absolutely necessary for us to make a huge transition. So this is a big transition, and we'll see why. So it's a PCI Express as we all know has become the ubiquitous IO across all platforms right that you can imagine any anywhere it's the you know interconnect that connects everything and off-lit you will see that alternate protocols like Compute Express Link they are using PCI Express as the fundamental infrastructure.
Starting point is 00:03:27 And it's one stack, one silicon that covers all the segments, but multiple form factors as we will see. The promise of backwards compatibility is that you can take a x16 Gen 5 device, it will interoperate with a x1 Gen 1 device, right? Or vice versa, right? You can do any permutations, they will will interoperate with a buy one, Gen 1 device, right? Or vice versa, right? You can do any permutation, they will just interoperate.
Starting point is 00:03:50 And this is not the end of the journey, clearly. We have, you know, anytime we start such an undertaking, right, the general rule of thumb is three generations spanning a decade, very successful technology. By that means this is more than successful, right, in a good way. So more than two decades, now on to the seventh generation,
Starting point is 00:04:14 pretty impressive, right? And, you know, the entire industry with 900 plus member companies that make up the PCI-SIG, they are behind the technology and lots of innovations happen. We announced the PCI Express 7.0 specification because, you know, no prizes for guessing. After 64, it will be 128 gig. Okay. Double, right? I mean, it's a very predictable cadence with which
Starting point is 00:04:40 we move with full backward compatibility. And, you know, it's going to be 128 gig, the same 1B, 1B flit mode as 64 gig, right? And, you know, we make the transition once. Once we make the transition, we'll take advantage of it for as long as we can. And 128 giga transfers per second, which means that you get 512 gigabytes per second per direction right for a by 16 link we use the same pam for signaling and of course right now it's very early
Starting point is 00:05:15 days we think we can get there of course like anything else as you go through your you know more and more details things could change but know, the engineering judgment leading up to this is that this is feasible. Of course, we have a lot of hard work by no means. It's a slam dunk and all of that, but this is feasible, right? We are focusing on the channel parameters and the reach. The key would be to deliver the low latency and high reliability target, but most importantly with the power efficiency and the cost effectiveness, right? And hundreds of lanes of PCI Express
Starting point is 00:05:48 that exist in a platform. So this is the mix of the different data rate and the different width. And these are the bandwidth numbers that you get. And as I said, the promise of backward compatibility is that anything can interoperate with anything else. Of course, based on the least common denominator, right?
Starting point is 00:06:07 You cannot have a by two connected to a by 16 and expect to get a by 16 performance. It'll just give the by two, but it will work. So what are the bandwidth drivers, right? What are the ones that are causing us to basically lead these speeds transitions, right? Speeds and feeds. On the device side, you got a lot of devices that are demanding high amount of bandwidth, right? So networking started off with 800 gig. That was basically taken. That will be with PCIe Gen 5 by 16. That becomes a networking solution.
Starting point is 00:06:47 Gen 6, that is accelerators, FPGA, A6, memory, storage, alternate protocols like CXL. You know, once with CXL, you're providing the... PCI Express so far had been a consumer of bandwidth. A good consumer, but been a consumer of bandwidth. A good consumer, but nevertheless a consumer of bandwidth. And what I mean by that is anytime you plug in a device, whether it is a networking device, it's a storage device, it consumes the memory bandwidth.
Starting point is 00:07:17 And if you look into the platform today, the memory bandwidth is the bottleneck, right? A lot of the platforms, they have a lot of PCI Express lanes. You just cannot feed it because there is not enough memory bandwidth available. With the move of memory onto CXL, which is based on PCIe, it's now a producer of that bandwidth,
Starting point is 00:07:39 which means that all the demand that you get, not only are you a consumer, but also you're a producer, which basically puts all the pressure in terms of having to deliver more bandwidth because, you know, as the sockets compute capability is growing and people are doing not just a single type of compute, they're doing heterogeneous compute, right? All kinds of computes are coming up. There is a need to deliver more memory bandwidth, more IO bandwidth. So it's the virtuous cycle of innovation that's going on.
Starting point is 00:08:09 And the key to deliver all of this is power efficient bandwidth, right? And that's what basically, that is a lot of requirement coming to PCI Express to deliver that. Now, while that's good and all of that, at the same time, you need to be cognizant of what makes this technology or what has made this technology so successful up until
Starting point is 00:08:31 now. And what has made the technology so successful up until now is that it's a cost-effective solution, right? If you think about something on the motherboard with hundreds of lanes it's standard materials it's not like exotic materials that you use it's volume platforms right shipping in you know billions of uh billions of uh devices worldwide right uh so it's the volume platform at the same time performance has never been compromised. We are delivering performance. We are delivering the performance with the best power efficiency that the industry can expect. Of course, hundreds of lanes. Scalability is the other key thing.
Starting point is 00:09:16 With all of these, it's a tight rope to walk for the technology. As I said, these are the new usage models, right? Storage, cloud, AI, analytics, edge, you name it, right? Everywhere it is there. PCI Express is there everywhere, whether it is as load store IO or whether it is to deliver memory bandwidth or whatever, right?
Starting point is 00:09:36 I mean, it's there everywhere. With all of that, that demand for delivering the bandwidth within that same power cost and all of those envelop continues to be there. So as I said earlier, I'll talk mostly about PCI Express 6.0, but also you can, these are going to be the requirements, same requirements with 7.0,
Starting point is 00:10:01 except the data rate will be double, right? Other than that, it's pretty much the same. In other words, none of these requirements, when we put, before we start, and this has been true for a while now, right? Before we start working on any technology, these are, think of it as the guardrails that we put saying, these are the things that must be met.
Starting point is 00:10:22 If we cannot meet, then of course, you have to go back and have a discussion, but these are the guardrails, right? And none of these requirements are really, if you look deeply into it, none of them are really negotiable requirements. They all must be met. So data rate 64 gig, you know, if we don't double the bandwidth, it's not worth the time for a lot of people to invest and, you know, go for a 10% frequency increase, right? It's a lot of heavy lift. So you have to take that step function.
Starting point is 00:10:51 Latency wise, we don't want, because these are load store interconnect, right? These are not networking kind of interconnect. Every nanosecond of bandwidth matters for a lot of applications, right? Especially if you're doing memory access and things like that. Latency is important, super important.
Starting point is 00:11:12 So what we said is that when we do PAM4, of course we knew that PAM4 means that the bit, as we will see in a minute, the bit error rate will be extremely high with all of that. We are going to do forward error correction, so that will increase latency. And what we said is that the latency adder
Starting point is 00:11:30 should ideally be zero. If not, it has to be less than a single digit, transmit plus receive, right? Those are the guardrails, right? And you know, we clearly said that we cannot do the 100 plus nanosecond effect latency that networking standards have done. It doesn't work for us. Because a lot of the load to use latencies,
Starting point is 00:11:52 if you look at platforms today, those are less than that by themselves. So if you add 100 nanosecond on top of that, it's going to just become, you would not make the transition, you would stay in the 32 gig data rate. Bandwidth inefficiency. Another way of saying,
Starting point is 00:12:10 what's the bandwidth efficiency? You know, both are the same thing. It has to be less than 2% adder. And by that, what I mean is that if I was getting bandwidth X at 32 gig, when I go to 64 gig, I must get at least 1.98x. Reliability.
Starting point is 00:12:30 Clearly, it's not a negotiable thing. We measure reliability in fit, which is failure in time. How many failures are going to get in a billion hours? In a billion hours, that is. And we want it to be very close to zero. In reality, nothing is zero, right? There is always a probability of a failure happening, right? But we want that probability of failure in a by 16 link as measured in fit to be as close to zero as you can imagine. Channel reach, again, volume platforms. People don't change their platforms just because, you know, you can't tell people that hey you know you are earlier in the motherboard the pci you know pci slots were 10 12 inches away now put them two inches away it doesn't work right so you have to be the channel reach has to be the pretty much similar
Starting point is 00:13:17 to what it was there before power efficiency again better than 5.0. And you will notice that this is a soft requirement in the sense that ideally the power efficiency should be half of the prior generation power efficiency. In other words, if you spend 10 picojoule per bit with Gen 5, you should be spending 5 picojoule per bit when you're delivering 64 gig. So that way the power number will be flat. In reality, what happens is that initially when people come up with their implement that way the power number will be flat. In reality, what happens is that initially when people come up with their implementations, the power numbers are higher and then it takes a generation or two before they can tweak and then it starts getting down, right? And, you know, in order to make up for that, we did a bunch of these low power enhancements and, you know,
Starting point is 00:13:59 we introduced a new low power state so that way you can modulate the link width and all, but that's the goal, right? And of of course if your power efficiency is more than the previous generation then it's clearly uh moving in the wrong direction and other thing is plug and play right uh full backward compatibility again something uh you know you cannot really uh you know it's not an easy thing to move away from, right? Because otherwise it causes a lot more issues and challenges, and of course it has to be high volume manufacturing ready.
Starting point is 00:14:32 It cannot be a niche kind of a technology, right? Very cost effective, scalable to hundreds of lanes in a platform. So these are the right trade offs, and we have to meet each and every one of these metrics. So now, what's about PAM4 signaling at 64 gig? Well, unlike an RZ, which is non-returned to zero, there are three I's, not one I. And by that, what we mean is that we are keeping the unit interval the same as 32 gig. It's the same frequency if you want to think about it, Nyquist
Starting point is 00:15:02 frequency is 32 gig, but instead of sending a zero or a one, we are sending two bits, right? That's why it is PAM-4, four level pulse modulation. And basically what happens with that is because you are squeezing in three eyes rather than one, you're basically getting reduced eye height and also reduced eye width. And that increases the susceptibility to errors. You're much more susceptible for errors to happen, right? In other
Starting point is 00:15:32 words, a small amount of voltage perturbation can move you from one to the other. So as a result, you will end up getting more and more errors. And that's fine. That's something that you cannot avoid. You just need to figure out how to work in spite of that. We did other things like, you know, gray coding to minimize the number of errors. These are well-known techniques. Pre-coding to minimize the number of errors in a burst and all of those. And we'll talk some more about that. So given that, you know, we're getting a lot of error. So far, PCI Express had been the last five generations where all with a bit error rate of what is known as
Starting point is 00:16:09 10 to the, a bit error rate of 10 to the power minus 12, which means that every 10 to the power 12 bits, you can get one bit in error. Those numbers are, you know, with PAMFOR, those numbers are many, many, many orders of magnitude worse, right? Just to give you an idea, you know, with PAMFOR, those numbers are many, many, many orders of magnitude worse, right? Just to give you an idea, you know, we started off, when we started off,
Starting point is 00:16:29 a lot of people said, hey, should we go with a 10 to the power minus four? Like, and I'll show you the numbers, some of the analysis that we did. Minus four, like networking have done. So bottom line, error magnitudes, errors are several orders of magnitude worse than what we are used to. And not only that, there are two other things that are at play here.
Starting point is 00:16:53 In order to do lower latency, most people would most likely end up doing things like what we call a DFE, decision feedback equalizer, continuous linear time equalizer, those kind of designs. Which means that an error that happens in a bit is most likely going to propagate to next bits, next few bits. Okay, because the way you determine what this bit is, is dependent on what the values of the previous few bits were. So they tend to propagate, right, with those kind of implementations. So that's what is known as a burst error. And you will see here that we define something called a first error. What is the probability of a first error happening? And then, you know, you're going to get a burst. And not only that, we are also cognizant of the fact that there can be some common events that will lead to a correlation across lanes, right? And those are the types of errors that we need to
Starting point is 00:17:43 be careful about. So when we give a number, it's based on the FBER. The actual number of errors will be more than that. All of these errors count as a single incidence of an FBER. So I'll up-level this discussion and say that, okay, so what do we know? We know that we are going to get a lot of errors, right? What do we do if we get a lot of errors? Today, in PCI Express, up until before Gen 6, you get an error, which was once in 10 to the power 12 bits, which is almost like once in a blue moon,
Starting point is 00:18:15 if you get an error, there is a CRC, detect the error, you ask for the link partner to replay, saying, hey, you know, you sent me that packet number 10, that doesn't look good, can you send from 10 onwards, right? And it will replay and off you go. With these many errors, if you start asking for a replay, you're going to waste a lot of bandwidth because you're going to constantly be asking a link partner to send errors, to send you replay messages. So we need to have some form of forward error correction. And so for that, we did a little study here. You know, we assumed a 256 byte transfer,
Starting point is 00:18:50 and we said that if we could correct always any single FBER instance, what would things look like? What will be the probability of asking us for a replay? If we correct two instances of that, what would the probability of a replay be? If we did three, what would it be? And you know, of course, there are bigger numbers of a replay be? If we did three, what would it be? And, you know, of course, there are bigger numbers we tried, but it doesn't matter. I'll make the point even with three of these, right? And what you see here is that this is the FBER, the raw FBER
Starting point is 00:19:16 rate, 10 to the power minus 4, minus 5, minus 6. What you notice is that even a single instance of correction gets you to a very reasonable replay rate, somewhere around 10 to the power minus 6 or so, if you are going to get a 10 to the power minus 6 FBER. But notice that the same one with 10 to the power minus 4, your replay probability is somewhere around 10 to the power minus 2 or minus 3, which is a pretty unacceptable number. So what we said is, we are going to go aggressive, go for this point, and then just pick a single symbol correct ECC, FEC mechanism. Why is that? Because a single symbol correct, and then we are going to back it, and then of course we are going to have a very strong crc because we want to make sure that nothing gets past that crc so correction is let me see if i have this okay no so so so correction correction consists of two things, right? One of them is, of course, your FEC will correct some, not all, but it corrects reasonable enough. Then you got CRC that will detect the rest, right?
Starting point is 00:20:34 It has to really be good at detecting. Once you detect, then you can ask for a replay, right? So you got two ways of correcting things. The replay mechanism already existed. Now we are adding to it the FEC mechanism. Now, why do I want a single symbol correct FEC? The reason is that error correction, it's an exponential problem. So if you think about what does it take to correct an error? If you have n number of symbols and you can correct, let's say, up to x symbols, how many permutations do you have?
Starting point is 00:21:08 You've got to choose x permutations, which in itself is an exponential thing. Within each of those x symbols, you've got 2 to the power x possibility because bit errors can happen, you know, one bit here, one bit there, that way, right? If you look into any symbol, so you've got 2 to the power eight possibilities if you assume your symbol is eight bits, right? Well, actually, technically, we'll argue two to the power eight minus one. I'm just trying to make it easy. So every symbol can have two to the power eight possible ways in which errors can happen. And there are X symbols, so to the power X. So not only do you have N choose X, you've got two to the power eight to the power X. There are that do you have n choose x, you've got 2 to the power 8 to the power x. There are that many error possibilities that you're dealing with if you know that there are
Starting point is 00:21:49 x errors. If it is x minus 1, you can add that and you can keep doing that math, right? And that's a very difficult thing. And that's the reason why all of those other standards, they pay such a huge penalty in terms of latency. It's because that problem is an exponential problem. And what do we know with one? It becomes linear. So that's the reason why we want to stick to single symbol correct, as long as it's putting me in the right ballpark that I don't have to replay that often. CRC, I wanted a very strong CRC, but what do we know about CRC? CRC is a linear problem. And by that, what I mean is that I generate the syndrome. It's a division thing. Fine. I've been doing CRC, which is a division.
Starting point is 00:22:37 No problem. We all know how to do that. But at the end, how do I know something is correct or wrong? All that I look for is, is it zero or is it non-zero? I'm not trying to look at that syndrome and trying to figure out, hey, which of these n choose x 2 to the power 8 to the power x is wrong. I don't care. All that I know is something is not right. If something is not right, I just ask for a replay, right? So lightweight FEC, but very strong CRC, and a maniacal focus on keeping latency low. That's the key. Those are the key things, right, that PCI Express has. And again, that's critical because of the nature of the load store memory-based kind of interconnect
Starting point is 00:23:19 that PCI Express as well as similar load store interconnect protocols are. So now what do we do when we have a forward error correction? Forward error correction works on a fixed number of bytes. I can't have, well, I suppose I can have variable bytes, but that makes my life even more difficult, right? So if I have a fixed number of bytes on which, this one seems to be on a time-based thing somehow. Okay, so if I have a fixed number of bytes,
Starting point is 00:23:54 then I need to be able to correct, that I need to correct, which means that I need to fix that as the unit on which the FEC works. So we define something called flit. And flit is nothing new for, while it's new for PCI Express, it's nothing new, right?
Starting point is 00:24:09 All the coherent links have used flit for ages. So the FEC works on that flit and we chose 256 byte as the flit size. So if the error correction is happening at the flit level, which means that naturally I want to do my CRC during the time when I do the correct. Otherwise, you know, it just becomes more complicated, right? You do FEC somewhere. As far as CRC is concerned, you belong to five or six different groups, right? So if you do FEC, it makes sense that you do the CRC check. If you do
Starting point is 00:24:42 the CRC check, it makes sense that you replay at the flit level, as opposed to replaying at the packet level, transaction layer packet level that PCI Express did. So that's basically where we went, right? And lower data rates will also use the same flit because you can't be, you can move between data rates dynamically. So you cannot be having one type of replay
Starting point is 00:25:02 if you are in PCI 64 gig data rate, and then another type when you dynamically change to let's say 2.5 gig data rate. So once you have negotiated, it's always the same way for again simplicity. So we picked the flit size as 256 bytes as shown in the picture here. 236 of these bytes, so we'll see 0 through 235. These are what are known as TLPs or transaction layer packets. That's what is reserved for TLPs. 6 bytes are for data link layer payload. Then you will see there are 8 bytes of CRC that covers the rest.
Starting point is 00:25:35 And then you will see that there are 6 bytes of ECC. Actually, those 6 bytes of ECC are 3 groups of 2 bytes of ECC each. And what happens is that you will notice that those three groups are these three colors. So what happens with those three colors is that, you know, each of them is a separate ECC group. So we'll see that even if there is an error happening that goes across them, as long as it is less than 24 bits
Starting point is 00:26:03 or three symbols long, three bytes long, then burst, you are going to be able to correct that error. So that's the, okay. All right. So that's the reason why we have done arrangement in that particular way. And in the process, what we are able to do is that we removed the sync header what we are able to do is that we
Starting point is 00:26:25 remove the sync header, we removed all the framing token, because everybody has a particular slot mechanism where the TLPs go in a particular place, DLLPs go in a particular place. So there is no need to say, hey, this is a TLP type, this is a DLLP type, the locations are fixed. So we are going to get those benefits as you will see, and no per TLP and DLLP CRC.
Starting point is 00:26:47 So yes, we are spending a lot more, eight bytes of CRC, which is a lot, but that's for a good reason, very strong detection. And we'll see that these help actually in terms of the bandwidth. So even though we are paying more, but because we are amortizing it across an entire 256 byte fleet, we really come out ahead.
Starting point is 00:27:10 So what is the retry probability and all of those fit and everything else look like? Remember, my FBR rate is 10 to the power minus six. I'm assuming by retry time to be 200 nanosecond, which is reasonably, actually we expect the retry to be less than that. But it's, you know, even if it is 300, let's say if it increases, these numbers don't change much. So the retry probability of a given flit is around 10 to the power minus six. Well, what it means is that every 10 to the power six or so, right, every million, I think that's the unit, right? Every million or so 256 byte flit,
Starting point is 00:27:47 one of them will get retried, which is a reasonable amount. You have the retry probability over the, if you did like go back and kind of mechanism, that's the retry probability over the retry time. And this is the failure in time. You'll see that it's about 10 to the power minus 10 which is as close to 0 as you can get which is pretty good right anything less than 10 to the
Starting point is 00:28:09 power minus 3 is pretty good and this is at 10 to the power minus 10 and we also did the bandwidth loss mitigation by only retrying the flit that has an error as opposed to the go back n mechanism go back n mechanism already exists which means that if you got something that is wrong in, let's say, flit number 10, then you're going to ask for 10, 11, 12, 13, so on and so forth. Everything will start from 10. The other ones just says that just give me 10. I have got 12, 13, 14, 15.
Starting point is 00:28:37 I will take it from my local thing. And then after that, you can go back to 16. So now let's look into what does this do from a bandwidth perspective, right? Remember, we are paying all of this overhead in terms of the FEC, much bigger CRC, and all of that. So what we did was that we put the picture. I think every time I... Let me not use this and save. Okay. So, if we look at this picture, we change the data payload size here in D words. Every D word is four bytes. So, you'll see that smaller payloads, they get about more than 3x improvement. So,
Starting point is 00:29:24 remember, because we doubled it, we are expecting a 2x improvement. So why are we getting a 3x improvement? It's because of that efficiency gain. We are not paying the per TLP framing token, per TLP CRC. And you can expect to see that that will be much more pronounced when the payload is smaller. When the packets are smaller. When the packets are smaller, the overheads are bigger. Because you got rid of those per TLP overheads, those don't matter anymore that much, right?
Starting point is 00:29:59 I wonder what's going on. Okay. Yeah. I know it's telling me to keep moving. That's a good one. Okay. So. good one okay so what happened all right hot plug anybody So, while it's... I'm doing something. No. All right.
Starting point is 00:31:17 Okay. So while he's doing that. So that's the reason why you will see that most of the cases we have, you know, there is a higher than 2 start seeing a bandwidth decline, meaning your bandwidth, instead of doubling the bandwidth, you start getting to like 1.95x kind of bandwidth number, which is not quite at 1.98. It's worse than that. But we looked at it. We said, okay, this is a reasonable trade-off to make, right? Especially given that some payloads you're getting a lot better, some you're getting slightly worse, which is fine, right? We can live with that.
Starting point is 00:32:10 So up to 512 bytes, efficiency is better. And beyond that, it's... Ah, yeah, there, thank you. So then the question is is what about the latency? The table here shows the latency impact that we have for the different payload sizes and all. And the thing to note here is if it is plus, that means with 6.0 you're going to get that much higher latency.
Starting point is 00:32:42 And if it is minus, that means you're going to get lower latency. And why is that? Because you're moving data across a faster length. So you're gaining some amount in terms of the latency, right? Because you're running it faster. But on the other hand, you're doing all of these flit accumulation, which you didn't have to do.
Starting point is 00:32:58 And that's the reason why you will see that smaller payloads, the latency impact, and narrower links, the latency impact is higher. Because, you know, you can't really wait for the first D word, which is the four bytes of data, and say, hey, I got my four bytes of data. Let me go and consume it, which you could do in the previous case. Here, you have to wait till the 256-pound boundary. So, we take all of that into account. And you'll see that, you know, overall, these are are by one but if you look into the by 16 kind of number it's it's it's a watch to you start gaining the thing and i didn't even uh credit us for the latency savings due to removing sync header basically you get rid of a bunch of muxes
Starting point is 00:33:36 all of those things so you know if you take those into account mostly you are going to come out ahead so how well did we do compared to the... So how well did we do compared to what we said, right? Data rate, of course, we met the data rate. Latency, we exceeded the expectations, right? There are a couple of cases where we didn't quite meet, but, you know, that's fine. That's reasonable.
Starting point is 00:34:10 Bandwidth, again, we generally exceeded the expectation. Reliability is pretty good. Channel reach is the same. Power efficiency, of course, your mileage varies. And then for low power, we introduced a new L0P state, which will get us real power savings and fully backward compatible. And then, you know, HVM ready, right? So pretty good from all of that perspective and expect the same with PCIe 7.0.
Starting point is 00:34:34 So then the question is, okay, what about storage, right? So, you know, one of the key drivers for PCI Express has been all the SSDs. So if you look into, and in general, the thinking is that, hey, SSDs are not at the bleeding edge of technology transition. And I used to think that myself. Generally, you expect the GPGPUs or the networking guys to be at the bleeding edge of the technology transition storage. Generally, we'll follow a year or so later. Been proven wrong, and this is one of those things where you're
Starting point is 00:35:10 happy that you're proven wrong. So if you look into PCIe 5.0 integrator list, right, we did the compliance program. The very first set of devices that got qualified are SSDs. So aggregate bandwidth consumer, but at the same time, pretty much moving at the leading edge of the technology transition. And it's all especially with NVMe, latency is lower with hundreds of lanes coming out of the CPU,
Starting point is 00:35:46 you are no longer going through a different, you know, bus converter kind of thing. You're directly sitting on the CPU bus, very low latency access, and to the point where right now the storage stack is not really the bottleneck, right? And that's putting a lot of pressure on the networking side to catch up, which is good. That's the way we
Starting point is 00:36:05 want that entire virtual cycle of innovation to to happen right we don't want storage to be slowing things down and these are the again uh some of the recent data in terms of how many devices are connected to pcie and the green one is pcie so you can see that you you know, which is a good thing. Same thing by capacity, the dark, oh, different color here. This color is PCI Express, so pretty decent there. And you know, PCI Express has got a lot of, sorry, I apologize for this. No. Some combination of...
Starting point is 00:36:50 It's trying to also increase the font size, which is... Okay. So PCI Express has got pretty good RAS capability, right? I mean, we have all of our... As you saw, right? We take our reliability fairly seriously. We have error injection mechanism defined. We got hierarchical timeout support for transactions, right?
Starting point is 00:37:13 So, and also containment that we will see and advanced error reporting, support for hot plug, planned hot plug, surprise hot plug. So, and for, especially for the the storage segment we did the notion of downstream port containment which basically says that hey as you move from this hba kind of topology to a direct pc express connect you want to be you know people always plug in plug out the wrong ssd and you need to make sure that you know you're down the whole system. So we defined the containment such that the containment is limited to the device that is being impacted. All the other devices will
Starting point is 00:37:53 continue as usual, even though they're under a switch hierarchy, right? So what you want to do is that if you're under a switch, you don't want the CPU's root port to be timing out on the other SSDs that are not really impacted, right? So those are the types of things that PCI Express did to address those issues. IO virtualization, again, you know, it has been there for a long time. You know, we support IO virtualization natively in terms of having a lot of the virtual machines access to the virtual functions of the devices. And again, storage was one of the early movers along with networking to the virtualization space. And, you know, we are also working with DMTF to define a lot of security mechanisms so that it's
Starting point is 00:38:38 we basically want to have a very coherent strategy, right? In terms across different types of IOs, in terms of the security solutions that we offer. So we take advantage of, you know, a lot of the work that DMTF has done. You know, we define a set of protocols on PCI-SIG. As you can see, this is basically the split of what PCI-SIG did versus what we are leveraging from the MCTP, DMTF stuff. So all of those are fundamentally what you're trying to do is that you're trying to make sure that you're authenticating the device on the other
Starting point is 00:39:18 side. And also any traffic that is going across PCI Express is protected with encryption, both end-to-end encryption and our per-hop encryption, right? So that's basically where we are at. PCI Express, again, one specification, same silicon, but multiple form factors. Again, if you go for, you know, the form factor that you expect in here is going to be very different than that of a server, right? So it makes sense that one specification, but multiple form factors, and these are a whole host of them that you can see. Some of them are done by PCI-SIG directly. Some of them are done by other organizations such as yours.
Starting point is 00:40:05 So all good stuff, right? We're also working on the... Yeah, they're trying to send me a message. So we're also trying to work on the cable topology. And this is an interesting development, actually. So, you know, there are internal cables and external cables. And of course, they help us with respect to the reach. But most importantly, most importantly, there is a move uh so far right pc express and any kind of direct load store
Starting point is 00:40:46 interconnect was based on a given node in other words if you plug in if you have a system if you have a server right and you put a gpgpu or you put a piece of memory there it's captive resources for that server any kind of dma or anything is within that server, right? In other words, if I have a rack of servers like you can see here, right? And let's say if you need the storage somewhere else or if you need the memory from some other node, there is no way for you to really do that other than go through some kind of networking semantics, right? And we are moving towards a world where load store will access those directly. So what we mean by that is that you want to be able to have your pool of resources that you are going to be able to access using the load store semantics, such as whether it is PCI Express, whether it is CXL. So that way you can imagine that in the future you can have some nodes that are going to have some amount of, let's say, SSDs, some amount of GPGPUs and all of that. And then, you know, you can have some of the others in a pool, so to speak, right?
Starting point is 00:41:51 And we have hot plug flows. So we can remove, let's say, a GPGPU from the pool, assign it to some server. And then once it is done, we can remove, we can basically remove it, put it back in the pool and then reassign it to somebody else. So, and in order to do that, you will need some kind of cable topology. So, you know, there is, we are working on in terms of the cable topologies to realize that vision going forward, right? And there are other things that we are working on, like unordered IO and things like that, which is basically going to fundamentally PCI Express is like CXL and other these load store topologies are moving towards like more like a fabric kind of thing. It's a gradual shift, but it's a very strategic shift that's happening. Right. So something to keep an eye out for. Right. So clearly the normal mode of load store and everything, the frequency doubling,
Starting point is 00:42:45 everything continues. But in addition to that, we are having those. Any successful, as we all know, any successful standard needs to have a well-defined, robust compliance program, because if you have, you know, parts that are out there for two decades and you got a 900 member company consortium developing technologies that are based on a standard, right? How do you make sure that they all interoperate with each other, right? And one of the means to do that
Starting point is 00:43:18 is to come to a compliance workshop and get your part tested for compliance, right? So we have got an extensive compliance test suite that covers everything from electrical signaling to, you know, link layer, are you doing the replace properly or not, to the transaction layer, to the configuration register. So it's a fairly well-defined compliance mechanism that we go through, right, in the compliance test suite. And the other thing that it does is it basically brings in that mindset that specifications, right, are well-defined that somebody can design to without
Starting point is 00:43:54 knowing who the other, which other component it's going to interoperate with, right? You're all going by the set of standard and then, you know, you don't, you can't really possibly pick up the phone as a designer and talk to every other designer that could have worked across any other company and expect to make it work. This is basically our means of enforcing that compliance mechanism through the compliance workshops.
Starting point is 00:44:19 In conclusion, PCI Express, we got a single standard that covers the entire compute continuum. It's a predominant direct IO interconnect, right, from the CPU with high bandwidth and used for alternate protocols with coherence and memory semantics. Low power, high performance. And I used to, you know, generally I say to people that PCI stands for peripheral component interconnect. It definitely is a peripheral component interconnect, but also now it is the main component interconnect, right? It's moved from the periphery to the main
Starting point is 00:44:52 part, and then also it's moving beyond the periphery into the rack level, right? So which is the goodness of it currently on the seventh generation? And again, you know, expect to see the innovation engine to continue, right? We've got lots of very interesting and real problems, right, that we want to solve and, you know, the journey doesn't stop, right? It's going to continue.
Starting point is 00:45:20 So with that, open to questions if you have any. Yes? So, as you know, right, I mean, we try to, a lot of the, it's a shared burden system, right? It's like a lot of it goes to the silicon side, and then there is, of course, you do expect things like, we are expecting the materials, right, the low-loss materials to become, which has been the case, by the way, right? And we've worked on this for a long time. But, you know, things have, you know, the loss per inch in terms of dB, right, for a given frequency has gone down, right?
Starting point is 00:46:35 And that's because, you know, these are all volume platforms, right? So, Jonathan, as you know, this is like, you know, the volume economics is at play, right? The demand is there. Somebody will do it. And it's not like people are, just like we know how to do things
Starting point is 00:46:48 as we in this context as PCI Express to do the technology, people that are making the materials, they know how to do the lower loss. It's a question of when do they become more mainstream volume, right? So there is that material cost that you brought up, right?
Starting point is 00:47:04 Which is basically on the board side that adds to the cost and then also you know even on the connector side right there are a lot of innovations that have gone on right in terms of making sure that the loss there is you know you're not getting you're basically there are discontinuities that we made sure you're pretty much it i feel like every generation we are pushing it out that huge upswing of that curve and we dodge the bullet and you know we will keep pushing it out right we'll come up with something and we'll push those out right um there will always be those discontinuities and then the other thing is of course back drill and all of those things that we have been doing and most importantly it's on the package side also
Starting point is 00:47:44 right lots of innovations happening to mitigate the package And most importantly, it's on the package side also, right? Lots of innovations happening to mitigate the package loss there, right? So it's a combination of all of those that we really have to together work with, right? And again, as I said, the ones that really, that thing that really helps us is the huge volume in which we get deployed, right?
Starting point is 00:48:01 I mean, that makes it possible, right? Otherwise it just becomes a boutique technology, right? So I hope I answered that question. Yeah. So it's some of the exotic materials that are worse that become less exotic. Exactly, yes. Yeah, yeah.
Starting point is 00:48:18 Yeah. You mentioned something about the power that you use that LG was using. Yeah. But in a data center, what is the device that you use Yeah. Yeah. Yeah. Okay. So I think there were two questions. Let me answer both one by one. So the first one I had to do with the devices should never sleep and yes, L0P is when the devices are not sleeping, right? So what the basic premise there is if you have a by 16 link, for example
Starting point is 00:49:05 let's say you are monitoring how much bandwidth you are consuming and you say hey you know I've been consuming less than 25% of the bandwidth so you bring it down to four lanes 12 lanes go to sleep they go to the low power state the other four lanes are active under l0P, at least one lane is always active. So in other words, any transition, right? Under, at no point, your link is out of commission, so to speak. In other words, you are able to do transactions all the time.
Starting point is 00:49:38 Even when you're trying to bring up that buy four link, let's say 25%, let's say you saw that, hey, you know, my utilization is now at 25, maybe my queues are backing up, let me get to from by four to a by eight, right? So you upsize your link to a by eight, right? While you're upsizing, four lanes are training, but these four lanes that were running,
Starting point is 00:49:55 they will continue to run your traffic. The other four lanes will train in the background. Once they come, once they're up, we will seamlessly bring them in. And there is a point called skip order set is then intro we send periodically during that time we are going to upsize the link to a buy it you will not know the difference okay so that's first part of the question right no more stalls and you know this is one of those learnings right that you
Starting point is 00:50:22 go through you make mistakes and you learn. That had to do with the dynamic link for those of you that, you know. And when you saw that, when we saw that, why is that not getting deployed? Like, oh, yes, of course. Stalls are bad, right? The second part is power efficiency numbers, right? And, you know, your mileage generally varies. So in general, right, what we have seen is any time there is a transition, people typically will start off with about 1.5x or 1.3x of where they were the previous generation, which is not a 2x, but it's not going to be half. Meaning it's not going to be one, right?
Starting point is 00:51:03 So they will go there. Meaning if you do picojoule per bit, they will be at around, you know, somewhere around 0.75, whatever, that kind to be one, right? So they will go there, meaning if you do picajoule per bit, they will be at around, you know, somewhere around 0.75, whatever that kind of a number, right? I mean, they're not going to be at half. Over a few generations after they do that design, in general, they tend to go down. Does it mean that, you know, you're going to keep,
Starting point is 00:51:19 as you're doubling the data rate every single generation, your power efficiency number will asymptotically go to zero? No, but you know, it know, we have been pretty good in terms of it's amazing when people get a budget what they will do or what they can do. Right. So. Any other questions? Yes. Talk a little bit about L0P and since it's a storage conference, how L0P might work with NDME? Yeah, so L0P, the way it works is you can have any application running on top.
Starting point is 00:51:55 It doesn't matter, right? It can be NVMe, it can be your, you know, whatever, GPGPU, it doesn't matter. You know, hardware will be monitoring how much bandwidth consumption is taking place realistically on this link. And we can know that because we know when we stuff idles. When we don't have transactions to send, we send idle. So it's very easy for hardware to monitor at the most lowest level. You can do a very quick calculation and you can have different policies. You can say, hey, you know, if I see that I'm going to take an average over the last 10 microseconds, let's say,
Starting point is 00:52:32 and if I see that I'm consistently using the link less than 25% of its utilization of the bandwidth that is available, maybe I go to a by four. Or if I'm a little bit more, especially for NVMe, you can, you know, you don't have to move, move back and forth that fast, right? We also have mechanisms where the other side will tell you that, hey, you know, how quickly can I go from a bifur to let's say a biaid, realistically, right? I mean, remember, there is no stall.
Starting point is 00:53:01 The four lanes will run, but the other four, how long will it take for it to really come back up? So every design is different, so that gets exchanged, right? So you can make a policy based on that, hardware policy saying, you know, how quickly does it need to turn, right? In reality, it's like, you know, microsecond kind of number, right?
Starting point is 00:53:21 So think of it as if I monitor the traffic for a microsecond or two microseconds, decide to switch it back or forth, then anything to undo that, right? Especially on the upsizing part, it will take me one to two microseconds. That's a reasonable thing for me to do. And I'm going to save the power.
Starting point is 00:53:43 Your block rate will not. Yeah. Actually, you will see, you know, if I see that reads are easy because reads I know by if I want to, I can look at the reads and know that there is a huge bandwidth demand coming. So I could So I could easily, I can have some intelligence to predict that. Writes might be a little bit more tricky, but nevertheless, so even if you have a write, you know, I can always look into, you know, how far my hardware queues
Starting point is 00:54:14 are getting backed up, right? And based on that, I can decide saying, hey, you know, these things are coming in fast, right? The buildup is really fast. So instead of taking small steps, I might just decide to go all the way to by 16 and then dial it down later on. So you could have different policies depending on how you can look at the slope at which things are building up. And based on that,
Starting point is 00:54:35 you can make a decision. Right now, all of that work is planned to be on the PCIe side. It's all on the hardware side. We're not planning on the hardware side. No, you don't need to make any changes in the upper layer. Yeah. You have a four lane device. You have a four lane device. Oh yeah. If you have a four lane device, you're going to have a four lane device.
Starting point is 00:55:14 Now within that four lane device, you've got choices. You can just say, Hey, I always want to be four lanes to hell with your power story. Right. Which is fine. You will always be four lanes. I could make a request saying, Hey, you know, we are using less than 25%. Can we go to buy one? You can always
Starting point is 00:55:28 say, forget it. I want it to be buy four. These are... No, no, no, no, no, no, no. Those are all independent things. Yes. Yeah. These are all, you know, at the end of the day, you want to, you know, just like any low power state transition, right?
Starting point is 00:55:46 It doesn't take away from your max, from the capacity that you had planned. What you're doing is that you're looking at the current consumption and say that, hey, do I need to keep this consumption rate going? Or can I dial it down temporarily and then when I need, I can flip it back up. But your buy four will always be by four. It will always deliver you the by four worth of bandwidth. That doesn't go away with L0P. All right, if you have more questions, I'll be here. I was getting the time up indication.
Starting point is 00:56:19 So thank you all very much. Thanks for listening. For additional information on the material presented in this podcast, be sure to check out our educational library at sneha.org slash library. To learn more about the Storage Developer Conference, visit storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.