Storage Developer Conference - #195: PCIe® 6.0 Specification and Beyond
Episode Date: September 5, 2023...
Transcript
Discussion (0)
Hello, this is Bill Martin, SNEA Technical Council Co-Chair.
Welcome to the SDC Podcast.
Every week, the SDC Podcast presents important technical topics to the storage developer community.
Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developers Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to SDC Podcast Episode 195. Greetings, everyone. We'll talk about
PCIe 6.0 and also the journey beyond that. PCI Express is a
multi-generational journey. We just completed the 6.0 specification and what is in it for
different applications. I will touch on that. My name is Devendra Das Sharma. We'll start with
introduction where I will talk about how PCI Express as a technology has evolved.
Do a little deep dive into PCI Express 6.0.
Also, I would have talked about 7.0.
It will be very similar to 6.0, so pretty much you're going to get a preview of what it looks like.
PCI Express technology and how it relates to storage, form factors, compliance,
and then conclusions.
So anytime you have a question, just raise your hand,
shout the question, and I'll take it.
Want to make it more informal.
So this is the evolution of PCI Express, right?
We started off back in 2003 with 2.5 gig.
At that time, it was 8-bit and B encoding.
Pretty much every, on an average, right?
Every three years, we have been doubling the bandwidth or the data rate with full backward compatibility.
You know, that's the average cadence.
Sometimes it's a little longer.
Sometimes it's a little shorter,
but, you know, on an average, give or take.
And you can see that throughout this journey,
it's like 2.5 gig, 2003, 5 gig, 2007, 8 gig, 2010.
And note that five to eight,
even though it's not doubling the data rate
because of the encoding changes,
you get double the bandwidth,
which is basically what we wanted to do at that time, because if we went with 10 gig, we would have left the server channels behind at that time.
So it took us a lot longer for the materials and everything to catch up.
So after that, you will see that, you know, the progression keeps going.
16 gig was same 128, 130, 32 gig.
If not, this would have been 10, 20, 40, and then 80,
which would be getting harder and harder and harder, right?
So you probably had to do PAM4 back in 40,
rather than, we basically wait
till it's absolutely necessary for us
to make a huge transition.
So this is a big transition, and we'll see why. So it's a PCI Express
as we all know has become the ubiquitous IO across all platforms right that you
can imagine any anywhere it's the you know interconnect that connects
everything and off-lit you will see that alternate protocols like Compute
Express Link they are using PCI Express as the fundamental infrastructure.
And it's one stack, one silicon
that covers all the segments,
but multiple form factors as we will see.
The promise of backwards compatibility
is that you can take a x16 Gen 5 device,
it will interoperate with a x1 Gen 1 device, right? Or vice versa, right? You can do any permutations, they will will interoperate with a buy one, Gen 1 device, right?
Or vice versa, right?
You can do any permutation, they will just interoperate.
And this is not the end of the journey, clearly.
We have, you know, anytime we start such an undertaking,
right, the general rule of thumb is three generations
spanning a decade, very successful technology.
By that means this is more than successful,
right, in a good way.
So more than two decades,
now on to the seventh generation,
pretty impressive, right?
And, you know, the entire industry
with 900 plus member companies
that make up the PCI-SIG,
they are behind the technology
and lots of innovations happen.
We announced the PCI Express 7.0 specification because, you know, no prizes for guessing.
After 64, it will be 128 gig. Okay. Double, right? I mean, it's a very predictable cadence with which
we move with full backward compatibility. And, you know, it's going to be 128 gig,
the same 1B, 1B flit mode as 64 gig, right?
And, you know, we make the transition once.
Once we make the transition,
we'll take advantage of it for as long as we can.
And 128 giga transfers per second,
which means that you get 512 gigabytes per second per direction
right for a by 16 link we use the same pam for signaling and of course right now it's very early
days we think we can get there of course like anything else as you go through your you know
more and more details things could change but know, the engineering judgment leading up to this is that this is feasible.
Of course, we have a lot of hard work by no means.
It's a slam dunk and all of that, but this is feasible, right?
We are focusing on the channel parameters and the reach.
The key would be to deliver the low latency and high reliability target,
but most importantly with the power efficiency and the cost effectiveness, right?
And hundreds of lanes of PCI Express
that exist in a platform.
So this is the mix of the different data rate
and the different width.
And these are the bandwidth numbers that you get.
And as I said, the promise of backward compatibility
is that anything can interoperate with anything else.
Of course,
based on the least common denominator, right?
You cannot have a by two connected to a by 16
and expect to get a by 16 performance.
It'll just give the by two, but it will work.
So what are the bandwidth drivers, right?
What are the ones that are causing us
to basically lead these speeds transitions, right? Speeds and feeds. On the
device side, you got a lot of devices that are demanding high amount of bandwidth, right? So
networking started off with 800 gig. That was basically taken. That will be with PCIe Gen 5 by 16. That becomes a networking solution.
Gen 6, that is accelerators, FPGA, A6, memory, storage,
alternate protocols like CXL.
You know, once with CXL, you're providing the...
PCI Express so far had been a consumer of bandwidth.
A good consumer, but been a consumer of bandwidth.
A good consumer, but nevertheless a consumer of bandwidth. And what I mean by that is anytime you plug in a device,
whether it is a networking device, it's a storage device,
it consumes the memory bandwidth.
And if you look into the platform today,
the memory bandwidth is the bottleneck, right?
A lot of the platforms, they have a lot of PCI Express lanes.
You just cannot feed it
because there is not enough memory bandwidth available.
With the move of memory onto CXL,
which is based on PCIe,
it's now a producer of that bandwidth,
which means that all the demand that you get,
not only are you a consumer,
but also you're a producer,
which basically puts all the pressure in terms of having to deliver more bandwidth because,
you know, as the sockets compute capability is growing and people are doing not just a single
type of compute, they're doing heterogeneous compute, right? All kinds of computes are coming
up. There is a need to deliver more memory bandwidth, more IO bandwidth.
So it's the virtuous cycle of innovation that's going on.
And the key to deliver all of this
is power efficient bandwidth, right?
And that's what basically,
that is a lot of requirement coming to PCI Express
to deliver that.
Now, while that's good and all of that,
at the same time, you need to
be cognizant of what makes this technology or what has made this technology so successful up until
now. And what has made the technology so successful up until now is that it's a cost-effective
solution, right? If you think about something on the motherboard with hundreds of lanes it's standard materials it's not like exotic materials that you use it's volume platforms right
shipping in you know billions of uh billions of uh devices worldwide right uh so it's the volume
platform at the same time performance has never been compromised. We are delivering performance.
We are delivering the performance with
the best power efficiency that the industry can expect.
Of course, hundreds of lanes.
Scalability is the other key thing.
With all of these, it's a tight rope to walk for the technology.
As I said, these are the new usage models, right?
Storage, cloud, AI, analytics, edge, you name it, right?
Everywhere it is there.
PCI Express is there everywhere,
whether it is as load store IO
or whether it is to deliver memory bandwidth
or whatever, right?
I mean, it's there everywhere.
With all of that, that demand for delivering the bandwidth
within that same power cost and all of those envelop
continues to be there.
So as I said earlier,
I'll talk mostly about PCI Express 6.0,
but also you can, these are going to be the requirements,
same requirements with 7.0,
except the data rate will be double, right?
Other than that, it's pretty much the same.
In other words, none of these requirements,
when we put, before we start,
and this has been true for a while now, right? Before we start working on any technology,
these are, think of it as the guardrails
that we put saying,
these are the things that must be met.
If we cannot meet, then of course,
you have to go back and have a discussion,
but these are the guardrails, right? And none of these requirements are really, if you look
deeply into it, none of them are really negotiable requirements. They all must be met. So data rate
64 gig, you know, if we don't double the bandwidth, it's not worth the time for a lot of people to
invest and, you know, go for a 10% frequency increase, right?
It's a lot of heavy lift.
So you have to take that step function.
Latency wise, we don't want,
because these are load store interconnect, right?
These are not networking kind of interconnect.
Every nanosecond of bandwidth matters
for a lot of applications, right?
Especially if you're doing memory access
and things like that.
Latency is important, super important.
So what we said is that when we do PAM4,
of course we knew that PAM4 means that the bit,
as we will see in a minute,
the bit error rate will be extremely high
with all of that.
We are going to do forward error correction,
so that will increase latency.
And what we said is that the latency adder
should ideally be zero.
If not, it has to be less than a single digit,
transmit plus receive, right?
Those are the guardrails, right?
And you know, we clearly said that we cannot do the
100 plus nanosecond effect latency that networking standards have done.
It doesn't work for us.
Because a lot of the load to use latencies,
if you look at platforms today,
those are less than that by themselves.
So if you add 100 nanosecond on top of that,
it's going to just become,
you would not make the transition,
you would stay in the 32 gig data rate.
Bandwidth inefficiency.
Another way of saying,
what's the bandwidth efficiency?
You know, both are the same thing.
It has to be less than 2% adder.
And by that, what I mean is that
if I was getting bandwidth X at 32 gig,
when I go to 64 gig,
I must get at least 1.98x.
Reliability.
Clearly, it's not a negotiable thing. We measure reliability in fit, which is failure in time.
How many failures are going to get in a billion hours? In a billion hours, that is.
And we want it to be very close to zero. In reality, nothing is zero, right? There is always a probability of a failure happening, right?
But we want that probability of failure in a by 16 link as measured in fit to be as close to zero as you can imagine.
Channel reach, again, volume platforms.
People don't change their platforms just because, you know, you can't tell people that hey you know you are earlier in
the motherboard the pci you know pci slots were 10 12 inches away now put them two inches away
it doesn't work right so you have to be the channel reach has to be the pretty much similar
to what it was there before power efficiency again better than 5.0. And you will notice that this is a soft requirement in the
sense that ideally the power efficiency should be half of the prior generation power efficiency.
In other words, if you spend 10 picojoule per bit with Gen 5, you should be spending 5 picojoule
per bit when you're delivering 64 gig. So that way the power number will be flat. In reality,
what happens is that initially when people come up with their implement that way the power number will be flat. In reality, what happens is that initially
when people come up with their implementations, the power numbers are higher and then it takes
a generation or two before they can tweak and then it starts getting down, right? And, you know,
in order to make up for that, we did a bunch of these low power enhancements and, you know,
we introduced a new low power state so that way you can modulate the link width and all,
but that's the goal, right? And of of course if your power efficiency is more than the previous
generation then it's clearly uh moving in the wrong direction and other thing is plug and play
right uh full backward compatibility again something uh you know you cannot really uh
you know it's not an easy thing to move away from, right?
Because otherwise it causes a lot more issues
and challenges, and of course it has to be
high volume manufacturing ready.
It cannot be a niche kind of a technology, right?
Very cost effective, scalable to hundreds of lanes
in a platform.
So these are the right trade offs,
and we have to meet each and every one of these metrics.
So now, what's about PAM4 signaling at 64 gig? Well, unlike an RZ, which is non-returned
to zero, there are three I's, not one I. And by that, what we mean is that we are keeping the
unit interval the same as 32 gig. It's the same frequency if you want to think about it, Nyquist
frequency is 32 gig, but instead of sending a zero or a one,
we are sending two bits, right?
That's why it is PAM-4, four level pulse modulation.
And basically what happens with that is
because you are squeezing in three eyes rather than one,
you're basically getting reduced eye height
and also reduced eye width. And that increases
the susceptibility to errors. You're much more susceptible for errors to happen, right? In other
words, a small amount of voltage perturbation can move you from one to the other. So as a result,
you will end up getting more and more errors. And that's fine. That's something that you cannot
avoid. You just need to figure
out how to work in spite of that. We did other things like, you know, gray coding to minimize
the number of errors. These are well-known techniques. Pre-coding to minimize the number
of errors in a burst and all of those. And we'll talk some more about that.
So given that, you know, we're getting a lot of error. So far, PCI Express had been the last five generations
where all with a bit error rate of what is known as
10 to the, a bit error rate of 10 to the power minus 12,
which means that every 10 to the power 12 bits,
you can get one bit in error.
Those numbers are, you know, with PAMFOR,
those numbers are many, many, many orders
of magnitude worse, right? Just to give you an idea, you know, with PAMFOR, those numbers are many, many, many orders of magnitude worse, right?
Just to give you an idea, you know,
we started off, when we started off,
a lot of people said, hey, should we go with
a 10 to the power minus four?
Like, and I'll show you the numbers,
some of the analysis that we did.
Minus four, like networking have done.
So bottom line, error magnitudes,
errors are several orders of magnitude worse than what we are used to.
And not only that, there are two other things that are at play here.
In order to do lower latency, most people would most likely end up doing things like what we call a DFE, decision feedback equalizer, continuous linear time equalizer, those kind of designs.
Which means that an error that happens in a bit is most likely going to propagate to next bits, next few bits.
Okay, because the way you determine what this bit is, is dependent on what the values of the
previous few bits were. So they tend to propagate, right, with those kind of implementations. So
that's what is known as a burst error. And you will see here that we define something called a first error. What is the
probability of a first error happening? And then, you know, you're going to get a burst.
And not only that, we are also cognizant of the fact that there can be some common events that
will lead to a correlation across lanes, right? And those are the types of errors that we need to
be careful about. So when we give a number,
it's based on the FBER. The actual number of errors will be more than that. All of these
errors count as a single incidence of an FBER. So I'll up-level this discussion and say that,
okay, so what do we know? We know that we are going to get a lot of errors, right?
What do we do if we get a lot of errors?
Today, in PCI Express, up until before Gen 6,
you get an error, which was once in 10 to the power 12 bits,
which is almost like once in a blue moon,
if you get an error, there is a CRC, detect the error,
you ask for the link partner to replay,
saying, hey, you know, you sent me that packet number 10,
that doesn't look good, can you send from 10 onwards, right? And it will replay and off you go. With these many errors,
if you start asking for a replay, you're going to waste a lot of bandwidth because you're going to
constantly be asking a link partner to send errors, to send you replay messages. So we need
to have some form of forward error correction. And so for that, we did a little study here.
You know, we assumed a 256 byte transfer,
and we said that if we could correct always
any single FBER instance, what would things look like?
What will be the probability of asking us for a replay?
If we correct two instances of that,
what would the probability of a replay be?
If we did three, what would it be?
And you know, of course, there are bigger numbers of a replay be? If we did three, what would it be?
And, you know, of course, there are bigger numbers we tried, but it doesn't matter. I'll make the point even with three of these, right? And what you see here is that this is the FBER, the raw FBER
rate, 10 to the power minus 4, minus 5, minus 6. What you notice is that even a single instance of correction gets you to a very
reasonable replay rate, somewhere around 10 to the power minus 6 or so, if you are going to get
a 10 to the power minus 6 FBER. But notice that the same one with 10 to the power minus 4,
your replay probability is somewhere around 10 to the power minus 2 or minus 3, which is a pretty unacceptable number.
So what we said is, we are going to go aggressive, go for this point, and then just pick a single symbol correct ECC, FEC mechanism. Why is that? Because a single symbol correct,
and then we are going to back it, and then of course we are going to have a very strong crc because we want to make sure that nothing gets past that crc so correction is
let me see if i have this okay no so so so correction correction consists of two things, right? One of them is, of course, your FEC will correct some, not all, but it corrects reasonable enough.
Then you got CRC that will detect the rest, right?
It has to really be good at detecting.
Once you detect, then you can ask for a replay, right?
So you got two ways of correcting things.
The replay mechanism already existed.
Now we are adding to it the FEC mechanism.
Now, why do I want a single symbol correct FEC? The reason is that error correction,
it's an exponential problem. So if you think about what does it take to correct an error?
If you have n number of symbols and you can correct, let's say, up to x symbols, how many permutations do you have?
You've got to choose x permutations, which in itself is an exponential thing.
Within each of those x symbols, you've got 2 to the power x possibility because bit errors can happen, you know, one bit here, one bit there, that way, right?
If you look into any symbol, so you've got 2 to the power eight possibilities if you assume your symbol is eight bits, right? Well, actually,
technically, we'll argue two to the power eight minus one. I'm just trying to make it easy.
So every symbol can have two to the power eight possible ways in which errors can happen.
And there are X symbols, so to the power X. So not only do you have N choose X, you've got two
to the power eight to the power X. There are that do you have n choose x, you've got 2 to the power 8 to the power x.
There are that many error possibilities that you're dealing with if you know that there are
x errors. If it is x minus 1, you can add that and you can keep doing that math, right?
And that's a very difficult thing. And that's the reason why all of those
other standards, they pay such a huge penalty in terms of latency. It's because that
problem is an exponential problem. And what do we know with one? It becomes linear. So that's the
reason why we want to stick to single symbol correct, as long as it's putting me in the
right ballpark that I don't have to replay that often. CRC, I wanted a very strong CRC,
but what do we know about CRC? CRC is a linear problem. And by that, what I mean is that I
generate the syndrome. It's a division thing. Fine. I've been doing CRC, which is a division.
No problem. We all know how to do that. But at the end, how do I know something is correct or
wrong? All that I look for is, is it zero or is it non-zero? I'm not trying to look at that syndrome and trying to figure out, hey, which of these n
choose x 2 to the power 8 to the power x is wrong. I don't care. All that I know is something is not
right. If something is not right, I just ask for a replay, right? So lightweight FEC, but very strong CRC, and a maniacal focus on keeping latency low.
That's the key.
Those are the key things, right, that PCI Express has.
And again, that's critical because of the nature
of the load store memory-based kind of interconnect
that PCI Express as well as similar load store
interconnect protocols are. So now what do we do when we have a forward error correction?
Forward error correction works on a fixed number of bytes.
I can't have, well, I suppose I can have variable bytes,
but that makes my life even more difficult, right?
So if I have a fixed number of bytes on which,
this one seems to be on a time-based thing somehow.
Okay, so if I have a fixed number of bytes,
then I need to be able to correct,
that I need to correct,
which means that I need to fix that as the unit
on which the FEC works.
So we define something called flit.
And flit is nothing new for,
while it's new for PCI Express,
it's nothing new, right?
All the coherent links have used flit for ages.
So the FEC works on that flit
and we chose 256 byte as the flit size.
So if the error correction is happening at the flit level,
which means that naturally I
want to do my CRC during the time when I do the correct. Otherwise, you know, it just becomes
more complicated, right? You do FEC somewhere. As far as CRC is concerned, you belong to five or
six different groups, right? So if you do FEC, it makes sense that you do the CRC check. If you do
the CRC check, it makes sense that you replay at the flit level,
as opposed to replaying at the packet level,
transaction layer packet level that PCI Express did.
So that's basically where we went, right?
And lower data rates will also use the same flit
because you can't be,
you can move between data rates dynamically.
So you cannot be having one type of replay
if you are in PCI 64 gig data rate,
and then another type when you dynamically
change to let's say 2.5 gig data rate. So once you have negotiated, it's always the same way
for again simplicity. So we picked the flit size as 256 bytes as shown in the picture here.
236 of these bytes, so we'll see 0 through 235. These are what are known as TLPs or transaction layer packets.
That's what is reserved for TLPs.
6 bytes are for data link layer payload.
Then you will see there are 8 bytes of CRC that covers the rest.
And then you will see that there are 6 bytes of ECC.
Actually, those 6 bytes of ECC are 3 groups of 2 bytes of ECC each.
And what happens is that you will notice
that those three groups are these three colors.
So what happens with those three colors is that,
you know, each of them is a separate ECC group.
So we'll see that even if there is an error happening
that goes across them, as long as it is less than 24 bits
or three symbols long, three bytes long, then burst,
you are going to be able to correct that error.
So that's the, okay.
All right.
So that's the reason why we have done arrangement
in that particular way.
And in the process, what we are able to do
is that we removed the sync header what we are able to do is that we
remove the sync header, we removed all the framing token,
because everybody has a particular slot mechanism
where the TLPs go in a particular place,
DLLPs go in a particular place.
So there is no need to say, hey, this is a TLP type,
this is a DLLP type, the locations are fixed.
So we are going to get those benefits as you will see,
and no per TLP and DLLP CRC.
So yes, we are spending a lot more,
eight bytes of CRC, which is a lot,
but that's for a good reason, very strong detection.
And we'll see that these help actually
in terms of the bandwidth.
So even though we are paying more,
but because we are amortizing it across
an entire 256 byte fleet, we really come out ahead.
So what is the retry probability and all of those fit and everything else look like?
Remember, my FBR rate is 10 to the power minus six.
I'm assuming by retry time to be 200 nanosecond,
which is reasonably, actually we expect the retry to be less than that.
But it's, you know, even if it is 300, let's say if it increases, these numbers don't change much.
So the retry probability of a given flit is around 10 to the power minus six.
Well, what it means is that every 10 to the power six or so, right, every million, I think that's the unit, right?
Every million or so 256 byte flit,
one of them will get retried,
which is a reasonable amount.
You have the retry probability over the,
if you did like go back and kind of mechanism,
that's the retry probability over the retry time.
And this is the failure in time.
You'll see that it's about 10 to the power minus 10
which is as close to 0 as you can get which is pretty good right anything less than 10 to the
power minus 3 is pretty good and this is at 10 to the power minus 10 and we also did the bandwidth
loss mitigation by only retrying the flit that has an error as opposed to the go back n mechanism go
back n mechanism already exists which means that if you got something that is wrong in,
let's say, flit number 10,
then you're going to ask for 10, 11, 12, 13, so on and so forth.
Everything will start from 10.
The other ones just says that just give me 10.
I have got 12, 13, 14, 15.
I will take it from my local thing.
And then after that, you can go back to 16.
So now let's look into what does this do from a bandwidth perspective, right?
Remember, we are paying all of this overhead in terms of the FEC, much bigger CRC, and all of that.
So what we did was that we put the picture.
I think every time I... Let me not use this and save. Okay. So,
if we look at this picture, we change the data payload size here in D words. Every D word is
four bytes. So, you'll see that smaller payloads, they get about more than 3x improvement. So,
remember, because we doubled it, we are expecting a 2x improvement.
So why are we getting a 3x improvement?
It's because of that efficiency gain.
We are not paying the per TLP framing token, per TLP CRC.
And you can expect to see that that will be much more pronounced when the payload is smaller.
When the packets are smaller. When the
packets are smaller, the overheads are bigger. Because you got rid of those per TLP overheads,
those don't matter anymore that much, right?
I wonder what's going on. Okay.
Yeah.
I know it's telling me to keep moving.
That's a good one.
Okay. So. good one okay so what happened all right hot plug anybody So, while it's...
I'm doing something.
No.
All right.
Okay.
So while he's doing that.
So that's the reason why you will see that most of the cases we have, you know, there is a higher than 2 start seeing a bandwidth decline, meaning your bandwidth, instead of doubling the bandwidth, you start getting to like 1.95x kind of bandwidth
number, which is not quite at 1.98. It's worse than that. But we looked at it. We said, okay,
this is a reasonable trade-off to make, right? Especially given that some payloads you're getting
a lot better, some you're getting slightly worse,
which is fine, right?
We can live with that.
So up to 512 bytes, efficiency is better.
And beyond that, it's...
Ah, yeah, there, thank you.
So then the question is is what about the latency?
The table here shows the latency impact that we have for the different payload sizes and
all.
And the thing to note here is if it is plus, that means with 6.0 you're going to get that
much higher latency.
And if it is minus, that means you're going to get lower latency.
And why is that?
Because you're moving data across a faster length.
So you're gaining some amount in terms of the latency, right?
Because you're running it faster.
But on the other hand,
you're doing all of these flit accumulation,
which you didn't have to do.
And that's the reason why you will see that
smaller payloads, the latency impact,
and narrower links, the latency impact is higher.
Because, you know, you can't really wait for the first D word, which is the four bytes of data,
and say, hey, I got my four bytes of data. Let me go and consume it, which you could do in the
previous case. Here, you have to wait till the 256-pound boundary. So, we take all of that into
account. And you'll see that, you know, overall, these are are by one but if you look into the by 16 kind of number it's it's it's a watch to you start gaining the thing and i didn't even uh credit us
for the latency savings due to removing sync header basically you get rid of a bunch of muxes
all of those things so you know if you take those into account mostly you are going to come out ahead
so how well did we do compared to the...
So how well did we do compared to what we said, right?
Data rate, of course, we met the data rate.
Latency, we exceeded the expectations, right?
There are a couple of cases where we didn't quite meet,
but, you know, that's fine.
That's reasonable.
Bandwidth, again, we generally exceeded the expectation.
Reliability is pretty good.
Channel reach is the same.
Power efficiency, of course, your mileage varies.
And then for low power, we introduced a new L0P state,
which will get us real power savings
and fully backward compatible. And then, you know, HVM ready, right?
So pretty good from all of that perspective and expect the same with PCIe 7.0.
So then the question is, okay, what about storage, right?
So, you know, one of the key drivers for PCI Express has been all the SSDs.
So if you look into, and in general, the thinking is that,
hey, SSDs are not at the bleeding edge of technology transition.
And I used to think that myself.
Generally, you expect the GPGPUs or the networking guys
to be at the bleeding edge of the technology transition storage.
Generally, we'll follow a year or so later. Been proven wrong, and this is one of those things where you're
happy that you're proven wrong. So if you look into PCIe 5.0 integrator list, right,
we did the compliance program. The very first set of devices that got qualified are SSDs.
So aggregate bandwidth consumer,
but at the same time,
pretty much moving at the leading edge
of the technology transition.
And it's all especially with NVMe,
latency is lower with hundreds of lanes coming out of the CPU,
you are no longer going through a different, you know,
bus converter kind of thing.
You're directly sitting on the CPU bus,
very low latency access, and to the point where
right now the storage stack is not really the bottleneck, right?
And that's putting a lot of pressure on the networking side
to catch up, which is good.
That's the way we
want that entire virtual cycle of innovation to to happen right we don't want storage to be slowing
things down and these are the again uh some of the recent data in terms of how many devices are
connected to pcie and the green one is pcie so you can see that you you know, which is a good thing. Same thing by capacity, the dark, oh, different color here.
This color is PCI Express, so pretty decent there.
And you know, PCI Express has got a lot of,
sorry, I apologize for this.
No.
Some combination of...
It's trying to also increase the font size, which is...
Okay.
So PCI Express has got pretty good RAS capability, right?
I mean, we have all of our...
As you saw, right?
We take our reliability fairly seriously.
We have error injection mechanism defined.
We got hierarchical timeout support for transactions, right?
So, and also containment that we will see
and advanced error reporting,
support for hot plug, planned hot plug,
surprise hot plug.
So, and for, especially for the the storage segment we did the notion of
downstream port containment which basically says that hey as you move from this hba kind of topology
to a direct pc express connect you want to be you know people always plug in plug out the wrong ssd
and you need to make sure that you know you're down the whole system. So we defined the containment such that the containment is limited to the device that is being impacted. All the other devices will
continue as usual, even though they're under a switch hierarchy, right? So what you want to do
is that if you're under a switch, you don't want the CPU's root port to be timing out on the other
SSDs that are not really impacted, right? So those are the types of things that PCI Express did to address those issues.
IO virtualization, again, you know, it has been there for a long time.
You know, we support IO virtualization natively in terms of having a lot of the virtual machines
access to the virtual functions of the devices.
And again, storage was one of the early movers along with networking to the virtualization space.
And, you know, we are also working with DMTF to define a lot of security mechanisms so that it's
we basically want to have a very coherent strategy, right? In terms across different types of IOs,
in terms of the security solutions that we offer.
So we take advantage of, you know,
a lot of the work that DMTF has done.
You know, we define a set of protocols on PCI-SIG.
As you can see, this is basically the split of what PCI-SIG did versus
what we are leveraging from the MCTP, DMTF stuff. So all of those are fundamentally what you're
trying to do is that you're trying to make sure that you're authenticating the device on the other
side. And also any traffic that is going across PCI Express is protected with encryption,
both end-to-end encryption and our per-hop encryption, right?
So that's basically where we are at.
PCI Express, again, one specification, same silicon, but multiple form factors.
Again, if you go for, you know, the form factor that you expect in here is going to
be very different than that of a server, right? So it makes sense that one specification, but
multiple form factors, and these are a whole host of them that you can see. Some of them are done
by PCI-SIG directly. Some of them are done by other organizations such as yours.
So all good stuff, right?
We're also working on the...
Yeah, they're trying to send me a message.
So we're also trying to work on the cable topology.
And this is an interesting development, actually.
So, you know, there are internal cables and external cables.
And of course, they help us with respect to the reach.
But most importantly, most importantly, there is a move uh so far right pc express and any kind of direct load store
interconnect was based on a given node in other words if you plug in if you have a system if you
have a server right and you put a gpgpu or you put a piece of memory there it's captive resources for
that server any kind of dma or anything is within that server, right? In other words,
if I have a rack of servers like you can see here, right? And let's say if you need the storage somewhere else or if you need the memory from some other node, there is no way for you to really do
that other than go through some kind of networking semantics, right? And we are moving towards a world
where load store will access those directly. So what we mean by that is that you want to be able to have your pool of resources that you are going to be able to access using the load store semantics, such as whether it is PCI Express, whether it is CXL.
So that way you can imagine that in the future you can have some nodes that are going to have some amount of, let's say, SSDs, some amount of GPGPUs
and all of that. And then, you know, you can have some of the others in a pool, so to speak, right?
And we have hot plug flows. So we can remove, let's say, a GPGPU from the pool, assign it to
some server. And then once it is done, we can remove, we can basically remove it, put it back
in the pool and then reassign it to somebody else. So, and in order to do that, you will need some kind of cable topology. So, you
know, there is, we are working on in terms of the cable topologies to realize that vision going
forward, right? And there are other things that we are working on, like unordered IO and things
like that, which is basically going to fundamentally PCI Express is like CXL and other these load store topologies are moving towards like more like a fabric kind of thing.
It's a gradual shift, but it's a very strategic shift that's happening.
Right. So something to keep an eye out for. Right. So clearly the normal mode of load store and everything, the frequency doubling,
everything continues. But in addition to that, we are having those. Any successful, as we all know,
any successful standard needs to have a well-defined, robust compliance program, because
if you have, you know, parts that are out there for two decades
and you got a 900 member company consortium
developing technologies that are based on a standard, right?
How do you make sure that they all interoperate
with each other, right?
And one of the means to do that
is to come to a compliance workshop
and get your part tested for compliance, right?
So we have got an extensive
compliance test suite that covers everything from electrical signaling to, you know, link layer,
are you doing the replace properly or not, to the transaction layer, to the configuration register.
So it's a fairly well-defined compliance mechanism that we go through, right, in the
compliance test suite. And the other thing that it does is it basically brings in that mindset that
specifications, right, are well-defined that somebody can design to without
knowing who the other, which other component it's going to interoperate with,
right?
You're all going by the set of standard and then, you know, you don't,
you can't really possibly pick up the phone as a designer and talk to
every other designer that could have worked across
any other company and expect to make it work.
This is basically our means of enforcing
that compliance mechanism through the compliance workshops.
In conclusion, PCI Express,
we got a single standard that covers
the entire compute continuum.
It's a predominant direct IO interconnect, right, from the CPU with high bandwidth and used for alternate protocols with coherence and memory semantics.
Low power, high performance.
And I used to, you know, generally I say to people that PCI stands for peripheral component interconnect.
It definitely is a peripheral component interconnect, but also now it is the
main component interconnect, right? It's moved from the periphery to the main
part, and then also it's moving beyond the periphery into the rack level,
right? So which is the goodness of it currently on the seventh generation?
And again, you know, expect
to see the innovation engine to continue, right?
We've got lots of very interesting
and real problems, right, that we want to solve
and, you know, the journey doesn't stop, right?
It's going to continue.
So with that, open to questions if you have any.
Yes? So, as you know, right, I mean, we try to, a lot of the, it's a shared burden system, right? It's like a lot of it goes to the silicon side,
and then there is, of course, you do expect things like,
we are expecting the materials, right, the low-loss materials to become,
which has been the case, by the way, right?
And we've worked on this for a long time. But, you know, things have, you know,
the loss per inch in terms of dB, right,
for a given frequency has gone down, right?
And that's because, you know,
these are all volume platforms, right?
So, Jonathan, as you know, this is like, you know,
the volume economics is at play, right?
The demand is there.
Somebody will do it.
And it's not like people are,
just like we know how to do things
as we in this context as PCI Express
to do the technology,
people that are making the materials,
they know how to do the lower loss.
It's a question of when do they become
more mainstream volume, right?
So there is that material cost
that you brought up, right?
Which is basically on the board
side that adds to the cost and then also you know even on the connector side right there are a lot
of innovations that have gone on right in terms of making sure that the loss there is you know
you're not getting you're basically there are discontinuities that we made sure you're pretty
much it i feel like every generation we are pushing it out that huge upswing of that curve and we dodge the bullet and you know
we will keep pushing it out right we'll come up with something and we'll push those out right um
there will always be those discontinuities and then the other thing is of course back drill and
all of those things that we have been doing and most importantly it's on the package side also
right lots of innovations happening to mitigate the package And most importantly, it's on the package side also, right?
Lots of innovations happening to mitigate
the package loss there, right?
So it's a combination of all of those
that we really have to together work with, right?
And again, as I said, the ones that really,
that thing that really helps us is the huge volume
in which we get deployed, right?
I mean, that makes it possible, right?
Otherwise it just becomes a boutique technology, right?
So I hope I answered that question.
Yeah.
So it's some of the exotic materials that are worse
that become less exotic.
Exactly, yes.
Yeah, yeah.
Yeah.
You mentioned something about the power
that you use that LG was using.
Yeah.
But in a data center, what is the device that you use Yeah.
Yeah. Yeah. Okay. So I think there were two questions. Let me answer both one by one.
So the first one I had to do with the devices should never sleep and yes, L0P is when the devices are not sleeping, right? So what the basic premise there is if you have a by 16 link,
for example
let's say you are monitoring how much bandwidth you are consuming and you say
hey you know I've been consuming less than 25% of the bandwidth so you bring
it down to four lanes 12 lanes go to sleep they go to the low power state the
other four lanes are active under l0P, at least one lane is always active.
So in other words, any transition, right?
Under, at no point, your link is out of commission,
so to speak.
In other words, you are able to do transactions all the time.
Even when you're trying to bring up that buy four link,
let's say 25%, let's say you saw that,
hey, you know, my utilization is now at 25,
maybe my queues are backing up,
let me get to from by four to a by eight, right?
So you upsize your link to a by eight, right?
While you're upsizing, four lanes are training,
but these four lanes that were running,
they will continue to run your traffic.
The other four lanes will train in the background.
Once they come, once they're up,
we will seamlessly bring them in.
And there is a point called skip
order set is then intro we send periodically during that time we are going to upsize the
link to a buy it you will not know the difference okay so that's first part of the question right
no more stalls and you know this is one of those learnings right that you
go through you make mistakes and you learn.
That had to do with the dynamic link for those of you that, you know.
And when you saw that, when we saw that, why is that not getting deployed?
Like, oh, yes, of course.
Stalls are bad, right?
The second part is power efficiency numbers, right? And, you know, your mileage generally varies.
So in general, right, what we have seen is any time there is a transition, people typically will start off with about 1.5x or 1.3x of where they were the previous generation, which is not a 2x, but it's not going to be half.
Meaning it's not going to be one, right?
So they will go there.
Meaning if you do picojoule per bit, they will be at around, you know, somewhere around 0.75, whatever, that kind to be one, right? So they will go there, meaning if you do picajoule per bit,
they will be at around, you know, somewhere around 0.75,
whatever that kind of a number, right?
I mean, they're not going to be at half.
Over a few generations after they do that design,
in general, they tend to go down.
Does it mean that, you know, you're going to keep,
as you're doubling the data rate every single generation,
your power efficiency number will asymptotically go to zero?
No, but you know, it know, we have been pretty good in terms of
it's amazing when people get a budget what they will do or what they can do.
Right. So. Any other questions? Yes.
Talk a little bit about L0P and since it's a storage conference, how L0P might work with NDME? Yeah, so
L0P, the way it works is
you can have any application running on top.
It doesn't matter, right?
It can be NVMe, it can be your, you know,
whatever, GPGPU, it doesn't matter.
You know, hardware will be monitoring how much bandwidth consumption is taking place realistically on this link.
And we can know that because we know when we stuff idles.
When we don't have transactions to send, we send idle.
So it's very easy for hardware to monitor at the most lowest level.
You can do a very quick calculation and you can have different policies. You can say, hey, you know, if I see that I'm going to take an average over the last 10 microseconds, let's say,
and if I see that I'm consistently using the link less than 25% of its utilization of the bandwidth
that is available, maybe I go to a by four. Or if I'm a little bit more, especially for NVMe,
you can, you know, you don't have to move,
move back and forth that fast, right?
We also have mechanisms where the other side will tell you
that, hey, you know, how quickly can I go from a bifur
to let's say a biaid, realistically, right?
I mean, remember, there is no stall.
The four lanes will run, but the other four,
how long will it take for it to really come back up?
So every design is different, so that gets exchanged, right?
So you can make a policy based on that,
hardware policy saying, you know,
how quickly does it need to turn, right?
In reality, it's like, you know,
microsecond kind of number, right?
So think of it as if I monitor the traffic
for a microsecond or two microseconds,
decide to switch it back or forth,
then anything to undo that, right?
Especially on the upsizing part,
it will take me one to two microseconds.
That's a reasonable thing for me to do.
And I'm going to save the power.
Your block rate will not. Yeah. Actually, you will see, you know, if I see that
reads are easy because reads I know by if I want to, I can look at the reads and know that there
is a huge bandwidth demand coming. So I could So I could easily, I can have some intelligence to predict that.
Writes might be a little bit more tricky,
but nevertheless,
so even if you have a write,
you know, I can always look into,
you know, how far my hardware queues
are getting backed up, right?
And based on that, I can decide saying,
hey, you know, these things are coming in fast, right?
The buildup is really fast.
So instead of taking small steps,
I might just decide to
go all the way to by 16 and then dial it down later on. So you could have different policies
depending on how you can look at the slope at which things are building up. And based on that,
you can make a decision. Right now, all of that work is planned to be on the PCIe side.
It's all on the hardware side.
We're not planning on the hardware side. No, you don't need to make any changes in the upper layer.
Yeah.
You have a four lane device.
You have a four lane device.
Oh yeah.
If you have a four lane device, you're going to have a four lane device.
Now within that four lane device, you've got choices.
You can just say, Hey,
I always want to be four lanes to hell with your power story.
Right.
Which is fine.
You will always be four lanes.
I could make a request saying, Hey, you know, we are using less than
25%. Can we go to buy one? You can always
say, forget it. I want it to be
buy four. These are...
No, no, no, no, no, no, no.
Those are all independent things. Yes.
Yeah.
These are all, you know, at the end of the day,
you want to, you know,
just like any low power state transition, right?
It doesn't take away from your max, from the capacity that you had planned.
What you're doing is that you're looking at the current consumption and say that, hey, do I need to keep this consumption rate going?
Or can I dial it down temporarily and then when I need, I can flip it back up.
But your buy four will always be by four.
It will always deliver you the by four worth of bandwidth.
That doesn't go away with L0P.
All right, if you have more questions, I'll be here.
I was getting the time up indication.
So thank you all very much.
Thanks for listening.
For additional information on the material presented in this podcast, be sure to check out our educational library at sneha.org slash library.
To learn more about the Storage Developer Conference, visit storagedeveloper.org.