Storage Developer Conference - #86: Emerging Interconnects: Open Coherent Accelerator Processor Interface (OpenCAPI) and Gen-Z
Episode Date: February 14, 2019...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast Episode 86. All right, so I'm going to talk a bit
about what we're working, what we're doing with OpenCAPI and some background, which a lot of it is in common with the Gen Z presentation, in common with previous presentations.
So you'll kind of see how we're trying to at least address that.
So industry-wise, we've got really two significant things that are quite dramatic compared to, like, say, my last 20 years in processor architecture.
The last 20 years were spent making our CPUs
go from 400 megahertz, I think was the first one I worked on,
to somewhere in the 4 gigahertz range.
And then, you know, some number of years ago,
probably 5 to 6, you know, we capped out on that frequency,
and we're having essentially Moore's Law pains.
No greater frequency parts.
We are getting more transistors
but those transistors are at the same power
that they were in previous designs.
And so we end up having this huge problem
trying to do everything with general purpose processors.
So we've adopted accelerators.
IBM just did this joint project with NVIDIA supercomputer.
We now have the top 500 supercomputer in the world.
And where do all the flops come from?
They don't come from the general purpose processors.
They come from the accelerators in NVIDIA.
Those NVIDIA accelerators, though, need to talk to the network.
They need to talk to host memory.
And so that's kind of what the power processors become in those systems,
is a way to tie accelerators together, tie emerging memory together,
and tie networks together.
But certainly we have to deal with accelerators and integrated accelerators.
The other side of things is memory technologies.
We now have a long list, as you guys have seen,
of different emerging memory technologies.
And we need to be able to
figure out how to deploy those technologies in an efficient way, in a very dynamic way,
and that's a big part of what we're trying to accomplish. So what does an accelerator want to
be? And what we don't want it to be is a device that you talk to through a device driver that has
high software overheads,
that transfers a block of data to the accelerator,
the accelerator churns on it for a while,
and then sends the data back.
That very much limits the ways
in which you can deploy accelerators.
We need the accelerators to be tightly coupled
to the processor, to the host processor,
and other accelerators in the system,
and able to communicate through shared memory.
So we want to be able to talk through accelerator to accelerator,
accelerator to host through our shared memory model
just like multiprocessor systems talk to each other.
So if I've got eight CPUs in a box,
you don't need device drivers to talk from one CPU to another.
You want to be able to talk through shared memory.
So that has implications
on how we do addressing. We don't really like IOMMUs. We like MMUs. We also want to be able
to manage cache coherence. Once again, when I share data between two CPUs, I don't have to do
a bunch of explicit cache flush instructions or things like that. That's sort of kind of, you know,
two decades ago or whatever. We've got lots of form factors.
We have lots of different accelerators. Different applications require different accelerators.
And so we want to have a very fluid, heterogeneous system. The other side of things is data centers
want flexibility and openness. People don't want closed standards. They want to be able
to go and buy processors from Intel, processors from AMD by processors from other vendors IBM
obviously right and they want to be able to mix accelerators across those
ecosystems as well so proprietary bad open obviously a requirement going
forward so let's talk a little bit about that. So acceleration is becoming commonplace. The acceleration
communication wants to be fast. PCIe, people have latencies in the 700 microsecond range,
microseconds. That's a lot of time for a processor to be sitting there doing nothing, right? We're
trying to execute in a CPU, trying to execute, I don't know, four instructions a cycle, three instructions a
cycle, those instructions are 250, you know, they're a quarter of a nanosecond, right? So if
I'm sitting there for a thousand cycles waiting for an accelerator, that could have been 4,000
instructions that I wasted, right? Processors are expensive. We want to get full utilization of them.
And if your processor thread's sitting there for 4,000 cycles waiting on memory or waiting on an accelerator, that really limits
how well you can ship a little bit of work out to the accelerator and have it back. So we need
to make things tightly coupled and low latency. So to get this to happen, though, once again,
we need this full industry participation.
It's got to be, in the past, IBM has done memory in this way, but in proprietary standards.
We've done accelerators also in proprietary standards.
But we want to be able to have a broader set of memory devices and a broader set of acceleration devices to get tied to.
So how do we do this? We were looking at what we had before.
So historically, we built accelerators using a PCIe ecosystem. And so we started back in Power 8 with doing coherent accelerators on PCIe. But the problem we ran into is the latency of a PCIe stack was much higher than we needed.
I just said 700 nanoseconds to talk to PCIe, which is just tremendously too long.
So we also needed to have a way to do coherence.
PCIe doesn't have a typically good coherent mechanism in it.
And to do things involving coherence, you need virtual channels.
Those are expensive in PCI ecosystems. You also need different types of formatting and templating.
So we started with a clean sheet design that we based around our previous work in attaching low
latency memory devices, and also the previous work we did in accelerators to build this device
where you could have a very thin layer in the accelerator that would talk across an open
interface with low latency. We had to go clean sheet though, because if you looked at the way
the PCIe packets were made, the way they were encoded, it was sort of incompatible with what
we wanted to build. The one interesting point, though, just a side note,
is the electrical standard with PCIe is actually fairly reasonable.
It's a differential signaling interface.
It runs at fairly high frequencies.
And we've found that the electrical signaling with PCIe
is actually quite compatible with OpenCAPI.
So a lot of the vendors we've talked to currently have a PCIe-based NIC, whatnot,
are able to take that same I-O interface that talks PCIe and have it run higher frequency and
then talk Open CAPI. So electrically, PCIe is pretty good. The problem we had with it really
was how are the packets formatted and how are those packets aligned? And a little low frequency,
but if we bump the frequency up, reorganize the way the packets work,
quite a bit of improvement can be made.
That said, so that's how we create the open CAPI bus, open standard.
So addressing, this is an interesting discussion,
especially when you talk about how you do virtualization
and integrate accelerators into your work.
So what we want the accelerator to see
is just what a processor sees,
which is an effective address or a virtual address.
So when you dereference a pointer in your program,
the actual physical location
has been hidden from the application.
Those pointers run in a true virtual memory system,
so if I want to dereference a pointer, it might be on disk. It might be paged in memory. It may not be paged in
memory, but it's just a pointer. So we want the accelerators to be able to operate using those
same virtual addresses that the processors need. Now today, with an I.O. type of infrastructure,
you often need to make calls to the IOMMU to set up translations for
the I-O device, and those pages need to be pinned. And so the software overhead to go give a block of
data to a DMA engine is quite prohibitive. So we want to be able to have the processor
use the same page tables as the accelerator. To do that, though, we need to be able to handle
I-O and. and validations,
or translation and validations and all these things.
So that's a big part of what we're building in the OpenCAPI protocol
is a way to have all those things be straight virtual addressing.
And what we do is we actually do translation from the virtual address
to the physical address in the host bridge.
So from an accelerator's perspective, all it sees is program addresses.
We hide what the actual physical address is
from the accelerator, and that enables security.
It enables really simplified translation shoot-downs
and a variety of things.
But that's a big part of the programming model, right,
which is we want a CPU that's executing code.
We want an accelerator that's operating on those same shared data structures. We want those all to be tied to the same sort of flat
memory space. So processor going to system memory, accelerator going to system memory
looks the same. We also need to integrate device memory into that same flat address space.
So if you're writing code and you want to target an operation into device memory,
you want to target it into system memory, those should look the same to the programmer.
So we need to provide flat virtual address space
and able to integrate device memory into the same system map as the memory.
So this is kind of how we see this heterogeneous world working. able to integrate device memory into the same system map as the memory.
So this is kind of how we see this heterogeneous world working.
You can also think about multiple kinds of memory.
There could be DRAM-based memory in the system.
There could be phase change memory.
Any variety of different memory devices all need to map into the same shared address space.
So this one I guess I have a little bit of passion about in memory and systems.
So the way we attach memory historically in like a typical industry system is with DDR memory.
And direct attached DDR memory is really a horrible interface.
It runs at low frequency, right?
If I'm talking about DDR,
it's running maybe at 2,500, 2.6 gigahertz.
This is an extremely low frequency compared to what we get with differential signaling.
In the differential space, we're at 25 gig today.
So we're running like 10 times faster
device-to-processor interfaces compared to DDR.
The other problem with DDR is it's not very scalable.
As we try and put more memory devices on this bidirectional bus,
this bidirectional bus has more loads.
It runs at lower frequency.
The DDR interface is also very...
It doesn't hide the memory device from the processor.
So if I put DDR4 memory on a processor
that has a very specific electrical standard timing interface,
if I want to go plug DDR5 memory into a processor,
I have to have a totally different electrical standard.
So there's incompatibilities.
And what if I want to plug phase change memory
into that processor slot?
Now I have to come up with some,
I'm going to call it slightly hacky approach to
try and integrate different memory latencies into this very, you know, bare DDR protocol.
And so the DDR protocol has big problems. It's too slow, and it's too tied to the device media.
And so what we need to do is we want to run it on differential signaling, just like every other interface in the processor is moved to other than memory.
So what does differential signaling give you?
It gives you a massive increase in bandwidth,
and it also enables you to develop a protocol that's memory agnostic.
So I do a loadout to memory, and when the data comes back,
it might be an out-of-order, it might be at variable latency.
It comes back, and I didn't care whether or not it was DDR4, phase change memory,
graphics memory, the load went out to the network, right? So attaching memory with differential signaling is tremendously the right way to go. And what we've done here is we basically
have made a variety of memory form factors. This connector you'll see later in a little bit,
as are the form factors.
But we built a buffer chip.
You can think of this as equivalent to an LR buffer chip,
but instead of talking DDR out one side,
it talks differential signaling.
Then it talks to DRAM devices,
and we have different varieties of these.
We have short ones to fit little 1U servers.
We have tall ones, which have higher RAS or redundancy features. It could be phase change
memory, storage class memory. It could be any variety of things. But one thing we've done,
and this is something that you'll see that's a little different from the next presentation,
is we've optimized this for extreme low latency. And by extreme low latency,
we have about a five nanosecond delta between using direct attached memory
or this SerDes based memory.
And that's accomplished with a lot of tricks in the protocol
that we've developed on previous buffered solutions.
And that's a significant difference from other protocols.
Now that said, that means the protocols themselves are very much associated with a direct host attachment
right we don't have you know switching capability open Cappy this is I can plug
a DDR memory in with minimal latency add right in the idle case so five nanoseconds
in the idle case turns into a positive latency in the loaded case, right,
because I'm able to put that much more memory
behind my processor.
So this is pretty important with what we're doing here
and a big step at what's differentiating
future IBM systems,
and we really like to get the rest of the ecosystem
to pick up on this device
because it really is a way to build better memory subsystems.
So what are the different types of paradigms you see?
We kind of already talked about this, but we want to be able to do direct DDR replacement.
So instead of having DDR DIMMs, we'd like to plug in those other DIMMs, memory DIMMs
you saw there.
Those enable this to be a generic agnostic interface.
So that same processor, I can now plug in storage class memory, storage
class memory with different memory latencies, different ordering rules. One other thing I
should mention about storage class memory and this interface is that writes in this OpenCAPI
interface are explicitly acknowledged, right? So when you send the write to the device,
the device can then acknowledge when that write completed.
This can then be tied into your I.O. libraries on your processor,
because that's typically how this persistent works is you do a write, a store rather,
you execute some flush operations.
When those flush operations complete, then you know the data is secure.
But with traditional DDR interfaces, and this is especially true in
some of these flashback devices, you do a write to DDR, and it just says, okay, yeah, there's no
acknowledge, right? There's nothing in the DDR protocol that enables you to say, this is when
my write's secure. There's no explicit acknowledge. It's just you send the data, and if there wasn't
an error, life is good. The problem is you don't know if it really got there or not.
And so since we have an explicit write acknowledge,
that enables much more sophisticated persistence mechanisms in the memory.
And there's nothing to say that the OpenCAPI devices couldn't have tiering.
And so we've looked at some devices where one processor interface talks to DDR
as sort of a cache of memory that
lives behind it, right? So you could tier things. You could also bridge to other memory technologies
here, right? You could come out the back of this. You could have a Gen Z connector just to try and
talk about things, right? So you've got your low latency connectivity to your DDR, and then you can have the back of that go off,
talk to other memory standards, right?
So that's kind of how we see that.
This is a little bit of a rehash of what we already saw,
but we have host processors, we have accelerated processors,
we have accelerator memory.
This could be HBM in the case of like a GPU.
We have host memory, which is, and this is all shared.
We can issue atomic operations.
Those are supported on these protocols.
Different threads can all communicate sort of in one happy family.
Accelerators, and this is kind of interesting,
is that memory devices, as they become more sophisticated, want to be more than just slaves.
If I have a device that maybe is doing compression or is mixing, it wants to be a memory slave some of the time, but it also wants to master operations.
I think we saw some of the earlier presentations where you put a device memory out there, maybe it has the ability to compress and decompress data as it goes.
But to do so, maybe it needs to talk to the processor to learn encryption keys. Maybe it
needs to figure out what decompression dictionary needs to be used. And so you get this hybrid
approach where you have a memory device where it looks like loads and stores, but it's sort of a
smart memory device.
It needs to be able to initiate accelerated operations, right? It needs to master commands.
And so that's another thing you don't get with a DDR interface, is you don't get the ability to have the accelerator generate an interrupt. It can't issue a load. It can't be its own master.
And so the nice part here that we merged sort of the OpenCAPI memory interface with the OpenCAPI accelerator interface is that it can be an accelerator and a memory device at the same time.
And a lot of times the accelerator devices have memory behind them.
And so this enables a wide variety of paradigms on, you know, what is my accelerator doing?
Maybe I'm doing just simple offload where I go tell it to do work.
It does the work. But potentially it's sitting on the network, right, which enables it to do compression,
decompression, encryption. It may have its own memory on the device. If it has its own memory
on the device, it may be accessing that concurrently with the other processor. So these are the kind
of devices we really want to build in these new emerging spaces.
Signaling technology here, I'm just a little comment here. Right now we're at 25 gig, which is what our processors today support. We have a roadmap going out to about 2021 currently,
which shows us going 32 gig, 56 gig are the type of frequencies we're looking at doing in that 20, you know, 21 kind of time frame.
The other thing that's important is the standard here.
This is fairly easy to digest memory standard.
It matches Xilinx FPGA, Certis.
We haven't come across a single vendor yet who can't support this differential signaling technology.
It also has a lot in common with the next presenter.
It's common with them.
The other thing to mention here is we do have work groups to help defining these things.
One of those things we use in the work group is a common 25 gig signaling. We use this signaling
for SMP interconnect. We also use the same signaling standard for our NVIDIA and VLink attach. So same
25 gig, we actually have currently three personalities on our processors. We can talk
OpenCAPI across those links. We can talk NVIDIA and VLink. And we also use the same protocol for
our processor interconnect. And this is kind of an interesting thing in that the silicon area is mostly dedicated to the CERTI signaling instrument.
The actual thing that decodes what a translation block looks like is pretty tiny.
And so, so far, we've been successful with having, we have three technologies on this protocol.
And it's really quite tenable.
It's not a significant tax on a processor chip.
So let's kind of look at what an OpenCAPI system looks like
if we think about our future.
We have a derivative power 9 system that will look just like this.
And what is that?
Nearly all of my signal IOs,
or ideally all of the signal IOs coming off the processor,
are this differential signaling.
And that lets us build a very composable system
where maybe somebody wants to build a system
with just lots of DDR bandwidth, right?
Some HPC lab, they got a bunch of codes,
they care only about memory bandwidth,
they dedicate all these 25K signaling things for memory,
and they have a terabyte per second of memory bandwidth,
right, per socket, right?
10X factor for what you can do today, approximately.
That makes their codes run 8x faster.
Big TCO win for those organizations.
Let's say you want to run a database
that wants lots of persistent memory, right?
So you dedicate half of these IOs to persistent memory,
half of them for DDR, right?
Depending on what exact ratio you need,
it lets you build exactly what you want.
You can use some of these lanes
for accelerators, right? So it lets us take one processor chip, one host silicon, and then compose
picking up different accelerators, different memory, and tie them all together, right? So
that's what we like a lot as CPU designers about these sort of universal interfaces is it lets us
build a variety of optimized systems with different components to them. This is all tied together with the things we just talked about. A little bit of roadmap here.
Open CAPI is kind of an interesting name. CAPI was the name we used for our proprietary
accelerator attach going back to Power 8, which was tunneling these coherent operations across PCIe. That, we saw the latency
was kind of painful. We really needed to get better latency. That's why we transitioned to
the direct 25 gig signaling. We made it open as well because the particular thing we had with CAPI
was a little complex and we needed to move IP from the accelerator to the host. So we made our host do more work, the accelerator do less work,
and then we made it open.
And that's what we have today, Power9.
We've been shipping these for not quite a year, but close to a year,
and that has 25 gig accelerator interface.
For a Power9 follow-on, we will extend our OpenCAPI capability
and we'll add this low latency memory interface.
The Power9 you see here, for its memory, we have two versions. One of them is just traditional DDR.
The other one is our proprietary buffered memory solution. So that same technology in our
proprietary buffered one is what we've put into the OpenCAPI standard, which is what this 3.0
version is, and that's what we'll have our next chip here.
But memory bandwidth-wise, we're talking a standard DDR4 memory subsystem
based around OpenCAPI.
We have like 450, 400 gigabytes per second of sustained bandwidth.
So a massive bump in bandwidth to support accelerators,
high-performance computing.
It's a big bump in bandwidth.
And this is something where we'd like to see other vendors pick this up, right? Or even accelerators, right? If you want to put a lot of memory behind an
accelerator, use this open CAPI memory interface. So that's what we have on the short list.
Talk a little bit about what these things look like. We found that running this high-frequency interface, we typically want to do through cabling solutions.
The motherboard, if you're going to run 25, 30, 250 gig, the material becomes more expensive.
And so we're actually kind of trying to border between, do you want to do all cabled solutions, right?
That way your motherboard can just be power and ground.
We could take these cables off the top of a module, perhaps, and have them go directly to accelerators.
So today we're pretty centric around cables. We have a short reach in the motherboard from the
processor socket, then we have these jumper cables that can go plug into the card. To make this sort
of simple, the first prototype cards we did, we put them on a PCIe bus.
We get power and ground from the PCIe bus.
We plug it in the box.
We run one of these jumper cables to run the 25 gig over.
So what do some of these parts look like?
This is an FPGA card from Alpha Data.
This particular one had the connector on the side.
Production versions of this have the connector at the back.
But essentially, it looks just like a PCIe card, except it has this eight lanes of 25 gig going into the side of it. And this is kind of an interesting card that Mellanox has built, which is a hybrid device. And so it takes their
standard NIC, which continues to use the PCIe bus, but then it has a parallel FPGA on the card that
talks OpenCAPI.
So you kind of think about this as you're able to take your NIC
where the data flows to the NIC
and the NIC can either route commands to the processor
but it also can route commands directly to an FPGA.
And that FPGA you can think of as an extension of the processor.
It's tightly coupled coherently with the CPU
and this enables it to run packet inspection things
as if you had a little algorithm moved onto the card.
So this does show kind of how you can sort of hybridize different devices.
This has InfiniBand Ethernet merged with OpenCAPI.
So different networking solutions can merge with OpenCAPI
to sort of build these devices together.
Just to comment, the 25 gig stuff runs clean.
There's no reason it shouldn't.
It's a fairly standard interface.
But yeah, we do have these in the lab.
We have, there's probably actually
about four different FPGA card variants
that are OpenCAPI cards that people can buy.
So we have a consortium for OpenCAPI.
The president, are you still the president, Myron?
Yes.
Raise your hand.
There's the president.
Elected, yeah.
Anyway,
I'm trying not to talk about that.
All right, so we have,
we're trying to proliferate this,
working to get members
to help build accelerators
and adopt the standard.
And you can join today.
There's different membership levels,
like many of these consortiums.
And feel free to talk to Myron about it.
So that's what I have.
And now we're going to switch to our next presenter.
Good to see everybody here today.
Unfortunately, I don't have any good jokes to start off with,
like Jeff did. So I find myself in the same boat that Austin was in yesterday
where I know I'm between you and lunch.
So I will try and get through this.
And then don't forget we've got the question session at the end.
So if you've got questions, we'll get Jeff back up here.
You're going to ask both of us some questions.
The disclaimer that you've seen otherwise, you know, it says,
hey, I'm going to lie to you while I'm up here because I can't predict the future as well as I'm going to act like I can.
But let me start off with just a little bit about the consortium.
So Gen Z is fairly new.
We were founded two years ago this month, and we started with a 12-member team.
Now we're up to over 60 companies that are with us. And we've been able to show some really great demos.
If you've been to Supercomputer Flash Memory Summit, you've had a chance maybe to see our demos.
But essentially, it allows us to show servers connected to pools of memory and then running right out of that pool of memory.
So we can actually execute code out of the memory pools.
And then we've got a lot of stuff going on.
People like PLDA and Intelliprop have both got IP that's available for us. The good folks at
MicroSemi are starting to do some work on some silicon that will be available in the 2020
timeframe. And we also have some devices that I'll talk about a little later in the presentation from folks like Smart Modular. And then finally, the big thing that we pride ourselves on is we're
completely open. So if you go out to our website, you're going to find our latest specs are all
posted out there. There's no challenge to get to them. And we also put a lot of our draft specs
out there for you to look at, and then we encourage feedback.
We would love to get your feedback.
If you see something that seems like it's missing or something that's not there,
we've got a feedback form that you can give us the feedback right away,
and that helps to make sure that even if you can't become a member,
that you're available to give us the input that we're looking for.
To give you an idea of who our members are,
and I'm not going to read off the whole list, but you can kind of see the 60-ish members that are
there. The big thing we want to show here is we have everything needed for an ecosystem. So you
see the CPU vendors, you see the silicon vendors, you see the software vendors, the connector
vendors, and a number of OEMs that can put the products out. So that we're really ready to put something new on.
And what this has done is allowed us to move fairly quickly.
So you're going to see, if you go out to our website, a number of specs that we've posted around the spec for the architecture itself,
as well as a number of mechanical and five specs, because we've done those with a great group of people that have helped us out.
If you see your name up there, thank you very much.
We appreciate your help.
If you don't see your name up there, hint Amber, we'd love to have you join.
Because we think that we're really moving along in an area that the industry is going to need this kind of help.
Why do we need Gen Z?
That's always the question we get.
So it starts off with cores are growing at an exponential rate.
So anybody who's designed with a CPU recently realizes, hey, I've got a ton of cores.
The problem is pin count limits what else can be done on that CPU.
And so when you start to look at the memory channels, they really haven't grown significantly,
right? We had four, we've moved to six or eight, depending on the vendor. And then we're kind of
stuck there because 4,000 pins is already a huge aircraft carrier of a device. We also have limited ourselves in PCIe lanes.
We're anywhere from 48 to 84 PCIe lanes,
so our I-O bandwidth isn't really growing either.
And so when you look at what that does per core,
you see that both of them are negative trends,
which means it's harder and harder to get things in and out of the CPU,
and we want to be able to change that,
and we think Gen Z is a way to go do that. The other thing that makes it important, and we've
had lots of conversations, you know, Dave had a great look at what storage class memories are out
there and available, so I won't jump into that too much, but it's that space between DRAM and SSDs
that really is looking for an answer, right? And it's three orders of magnitude different.
So Intel has done a great job with a 3D cross point of starting to come into this space.
But one thing you look at is you see here storage class memory is byte addressable.
And so it's very much like memory.
But the other piece that's not very much like memory is it's persistent.
And so anybody who's done storage systems for any time knows persistence brings a whole new paradigm to what you need to think about when you're talking to that memory.
It can't just go away because a VR failed somewhere.
If it does, my availability goes away, and I'm no longer useful to the customer.
I need to start thinking of memory very much like I think of storage. It's about availability, not reliability. It's about serviceability and how I
get to it. Anybody who's tried to replace a DIMM realizes, right, I got to get my strap out. I got
to get in and pull something out delicately and put a new piece in. That doesn't work, right? And
the storage guys have known it forever.
That's why you've got UDOT to just plug something in
and make it easy to replace something that's failed.
The other thing, as you look at the bottom of the slide,
is this is kind of what we see going forward.
Right now, today, we have DRAM,
and we have a good set of spinning or solid-state disks.
Storage class memory starts to come in with 3D cross point and some of the other things
that you saw today.
And over time, actually, is the winner in the space.
So it's going to push the flash down to colder and colder areas.
And it's going to be pushing memory up into just exactly what I need. I don't want to spend a
lot more on memory. So as we looked at Gen Z, we said, what do we need? Well, obviously, we need
high performance. So high performance comes in simple, right, high bandwidth, low latency solutions.
We also need something that avoids protocol translations.
We don't want to jump between protocols
because it just wastes time and energy.
It also has to have low software complexity.
I don't want a large stack that I have to run through.
You kind of saw on some of Amber's slides
how much that takes away from your performance
because you're running through the software stack.
It needs to be highly reliable.
No single point of failure. It needs to be highly reliable. No single point of failure.
It needs to have multi-path.
Again, you can't let persistent memory disappear on you for no reason.
Security.
Everybody today, and that's a big thing that's in almost every news article,
is somehow security has come in.
So with Gen Z being defined just recently,
it thinks about security at a different level than some of the buses that have been defined 10 or 20 years ago.
And so we have the ability to look at the device, and if it doesn't have the right keys, the packet itself may never get through the switch to get down to your device.
So it may never see errant code or malicious code trying to modify your memory.
Flexibility is important to us, of course, so you need to be able to support multiple topologies.
In fact, as you look here, you kind of see this is the standard look where you have all your devices connected in through a CPU.
What we want to do is we want to get accelerators pushed out into pools.
We want to get memory pushed out into pools and be able to have storage and I.O. also in those pools.
So that I now can compose systems rather than have to think about buying everything at one time.
And we'll get into that a little later in the presentation.
Compatibility is important to us because we need to be able to use what's there today.
In fact, PCI to PCIe
transition is what inspired this. If you think about when we went to PCIe, really all you had
to do was have PCI ready software, and it worked immediately on PCIe. Over time, the OSs and the
applications started to take advantage of what PCIe brought. So we're doing the
same thing. You don't need to have any new OSs. You don't have to have modified applications to
be able to use Gen Z in this solution that we're looking at here. And then of course it needs to
have an economic advantage or people won't pick up on it. All right, there we go. So then all you do is you put Gen Z in as a fabric.
It has multi-paths so that if you have a failure in one path, you can go through another path.
And it allows you to put things like your DRAM or your storage class memory as well as your other devices out on this fabric.
And they can be inside the box or they can be outside the box.
Let me give you a quick example of just memory.
So if you think about the processors that we have today, we've got an MMU.
And Jeff actually talked nicely about that.
We want to be behind the MMU, not the IOMMU.
So you have your standard connection to DRAM.
What we do is we just put the Gen Z protocol engine in here,
and now we can talk to our memory directly, including storage class memory.
And you'll notice now I have dual paths to them,
so I can support that multiple path need that I have.
You also find that this memory connection, if you think about what we're doing,
we support both PCIe and 802.3
PHYs or Ethernet PHYs. So when you go to that, you start to see, you know, as simple as 16 gig
transfers per second, like we have in today's PCIe, all the way up to 112 gig transfers per
second, which is the next generation Ethernet. All those PHYs work through there. And what that does is give you a per channel bandwidth
of up to 400 gigabits per second.
Put that in perspective, that's eight DDR5 channels
that you can put out on the end of one Gen Z link.
And then kind of going through the rest of it,
we support the point-to-point as well as switch-based topologies,
so if you're trying to go inside the box, you can go point-to-point.
If you want to go out onto the network or a fabric, you can exit the box.
This is what Jeff already said you were going to see.
Here's the connector.
So this is the connector that we worked with Tyco with called the sliver connector,
and you can see it has multiple what we call chiclets, chiclet two chiclets or four chiclets it's just a simple
8 16 or 32 lanes if you you liken that to PCIe that's the by 4 the by 8 and the by 16.
The difference is the bandwidth you can see on a by 16 starts at 100 gigabytes per second going up to the 448 gigabytes per second I was talking about.
Even on just eight lanes, it can be up to or start at 25 and continue up from there to 112.
There's also a host of different types.
So we can go right to the cable.
That's what we were talking about.
It'll help you limit your expenses right if you if you go to some really low loss material on your pcb you're
going to have a large expense especially on a big server size pcb so instead get off to a cable get
off to some optics look at your right angle or board to board style connectors in that same space.
We've also done quite a bit of work on form factor.
So we have the single-wide and the double-wide form factors that allow you to plug in.
And working with our partners that are in the memory areas,
what we found is on the SSDs,
we can get up to 48 terabytes in a unit.
For storage class memory, the prediction here working with those is about 5 terabytes in a single unit.
And in DRAM, we can get up to 280 terabytes.
That form factor is what Smart Modular showed off at Flash Memory Summit as a 96-gig device just a month ago.
So we're starting to see these come into the
marketplace. You can also see that we looked forward considering what other types of devices
we wanted to connect into Gen Z. And so we've got GPUs that can be connected in, FPGAs. Obviously,
those would be able to fit things like SmartNICs and other storage devices.
One thing I should have mentioned there is, I'll go back to it just real quick.
If you look at the connector, we have something that's unique in the connector.
I can take, obviously, a little 1C device and I can stick it into a 4C socket.
But this connector actually allows you to take the 4C device and stick it into a four C socket. But this connector actually allows you to take the four C device
and stick it into a one C socket.
There's nothing that prevents you from using any of the devices
in any of the connectors
because they all have the same keying that goes on.
So if I don't have the perfect slot for my device,
it's okay because I can add that in anyway.
Then what does that give us? That looks like something where we have a nice enclosure. We
actually go out and you start to think about, if I can go back, something that has a drawer
full of devices. So I can take a drawer and I can add a couple of memory modules.
I can add a CPU, I can add a GPU and maybe an FPGA and plug it all in. And now I've got a nice
little system, right? And as I add more drawers, I can add more components and either add them to
the current system or start to make a new system. And so as we get into what does it
mean to be composable, Gen Z allows that composability at a rack scale.
Now let's talk a little bit about the OS stuff. I said, hey, you don't need a modified OS.
And that's because as you look at what we're doing, we've got the standard NVMe driver that we can use with just very minor
modifications that we've been able to now make it so you can talk to the Gen Z hardware. So if you
want to put just standard SSDs out there or NVMe drives, it's available to you. If you wanted to
talk to memory, you can do it one of two ways. You can talk to it as a memory device with just load store or you can
talk to it as a block device using that same driver. And then finally we do
messaging and you can do messaging in one or two ways. The easiest way that
we've put together is essentially like you see in a hypervisor. You get a virtual NIC or an eNIC,
and you're able to use that to message to your other devices.
Simple messaging can be used to tell another device out on the end of the bus,
hey, I've got data ready for you.
Here's where it's at.
Let's go use that data.
And let that device pull the data to itself.
So you're not using the CPU cycles to do a lot of data movement.
The other would be to do an OFI-type device where you actually, for something like HPC
now, you have a very fast, about 250 nanosecond messaging opportunity.
Composable memory is one of the things that we all talk about,
but today we don't know how to do that.
The current buses, whether it's PCIe or Ethernet, InfiniBand,
all of those don't really allow you to run at the speed of memory.
And so Gen Z, that was the main thing that we wanted to do,
was be able to show how memory can get outside the box
and allow you to compose that into a system later.
If you think about it right now, all memory sits behind a processor of some sort,
whether that's a CPU or a GPU.
And anybody who wants to get to that memory is really a second-class citizen.
I can't get to it unless I go through the processor. Also, if something happens to that
processor, maybe it goes bad, maybe a VR around it goes bad, all the memory that sits behind it
is gone, right? So it doesn't really meet that high availability model that we need in memory. On the Gen Z side of things,
we allow multiple different ways to connect up. So you can do a mesh,
you can do a simple direct connect, and you can also do a fabric. And so you can connect to these
different devices and everybody is a first class citizen. If a GPU wants to hit a section of memory, it can do that.
And this memory can be assigned either directly, so allocated, or it can be shared and allow for multiple devices to talk into that same memory space.
We also support PCIe devices in a simple thing called logical PCIe device.
What it does is when it's a Gen Z looking piece,
what you see here is nice native Gen Z looking through a switch.
That switch has a 30 nanosecond cut through time compared to a PCIe switch.
And then you look down here at your different SSD, GPU, whatever you have.
And so you're saving about 200 to 240 nanoseconds round-trip time by just using Gen Z.
You can also do things like multipathing, so that if you lose one of your links, you still have access to it.
From a system point of view that doesn't have the Gen Z understanding, these legacy devices are still there.
They look just like a standard PCIe device. And you access them by doing memory writes
to essentially the space that's down here
that looks like your standard PCIe register set.
And so you're communicating with them
at very low latency, very low CPU overhead.
What that means is you start to think about a composable system. Now you're able to do
things like pick my CPUs, pick my storage class memory, my SSDs, maybe even a gateway to a bigger
storage system, get my data services like RAID and other things put together composed as an
individual storage box.
What that does for us, when you start to think about it at a server level, is now I have these pools of resources, right?
And when I wanted to take and make something out of it, I start to add a rack scale fabric
and a server builder, for lack of a better term, right? A management piece that actually goes
out and discovers the devices and allows you to put them together. Then I start to build a system
and I can bring my, you know, just the right amount of CPUs in, just the right amount of DRAM,
storage class memory, et cetera, all go into the system. Now I'm ready to load my application on top of this, and it looks like
just a standard server. There's nothing about it you can't do just like you do with a server you
might put into a rack that's a 1U or 2U server. So nice bare metal server. I load my application
on top of it, and now I'm ready to start building my next server that has different characteristics. So the next server needs a lot more storage than the previous one.
I build those in.
If I'm wrong, I can always add more later.
Or if I'm wrong, I can take something back and reallocate it to the pool.
So one of the things you heard about yesterday was, hey, let's quit buying all this over-provisioned stuff.
It allows you to put your over-provisioning just into the pool.
So instead of the 60%,
I think that was being talked about yesterday,
per server,
you can over-provision your pool slightly
by maybe 20%.
And then if you need to,
you buy just what you need.
You don't end up buying a little bit of everything
to get that piece that you need.
If you look at an ideal state, and it looks like my fonts didn't come through well,
but I can have compute.
I can have a nice Gen Z type of a fabric here.
I can go to something that is memory, maybe some SCM, some SSDs,
and then some standard hard drives down here as a rack. When I go to put it together, I actually am looking at it as,
what's my application that I want to run?
It's not what hardware did I have to buy.
It's what application did I want to run and what does it require?
So you can look at, hey, I can now, my in-memory database had a need.
I can do some software-defined storage. And so if you look at the
advantages, the customer gets the capitalization to pay as you go. For the end users, I get to
adjust it for what I need. And for IT folks, it really allows you to orchestrate the business
and become very efficient. So the benefits. Compose things. Then go off, get those resources untrapped. So
make it so that if I lose a piece, it's not a big deal. I can just reallocate something new.
Avoid the over-provisioning. Purchase the resources when needed. Increase your reliability,
availability, serviceability. Repurpose the old stuff, right?
When I don't need it anymore, that's great. Put it back in the pool and use it for my next server
needs. And then allow things to move at the speed of that particular technology. Didn't hit on this
a lot, but if I go to Gen Z, I put the memory controller out with the memory,
or the media controller, rather, out with the memory device.
So I don't have to wait for a CPU spin just because I want to go to the next best memory.
And you saw in Dave's presentation a number of memories that are coming down.
And some of them were in the dream sequence.
There's a lot of people working on different types of storage class memory.
They're going to be coming down, and they're not going to be aligned with a CPU cycle.
So you need something to abstract you from that CPU cycle to this,
and we think Gen Z is a great way to abstract yourself.
Finally, with a little bit of work that we've been doing and some modeling that we've been doing, we see some great opportunities for performance boosts.
The Spark, you know, 15, 20% faster because it's just modifying existing frameworks, and we actually have that Spark work out today. if you start to think about the new algorithms you can do, because now I can really put things together in the way I want to do them,
I can start to see 100x increase.
And if you actually go brand new rethink of what I'm going to do,
we believe that you can see things like in the financial world
where you get about an 8,000 times speed up of what we're doing.
So with that, I think we now move into the question and answer
session, but I would encourage you to ask some great questions and join Gen Z.
Thank you, Curtis. Yeah, come back up. Thank you, Curtis. While we assemble the speakers,
we're going to enter the discussion part of this. We're going to have a microphone that moves around for questions, okay?
We're recording, and that'll make better fidelity for the recording and everything else.
And so I just want to kick it off while we get that going and ask our speakers, you know, thanks for not food fighting.
It's great.
I love it.
You actually showed some hints at how these things might
dovetail, but maybe you could open the discussion by actually telling us some ideas about how these
things will work together in a complementary way as we go forward. You want to go first?
Yeah, I think I covered on this, right? I mean, the key thing, so if I look at the differences,
let me try and highlight this. open capi is very centric with
having very low latency and tightly coupled coherence between a host silicon and the
accelerators i feel like gen z while you say you can be post memory um i think from a latency
perspective we're much lower latency but we we have nothing in OpenCAPI associated with
switches, partially because the optimizations we have to achieve low latency and some of this
coherence things aren't tenable in switches. So that's how they go together. I agree. You know,
you saw on the one chart with the pyramids, we don't see standard memory going away from
attaching to the CPU. And that's something OpenCAPI brings away from attaching to the CPU.
And that's something OpenCAPI brings a great attach to the CPU.
What we see is a tiering of memory that needs to occur, much like we see in tiering of storage.
And so as you go from something that might be integrated into the host to something connected directly to the host,
we think the next step is something that goes across Gen Z.
We don't have the same latencies.
You talked about your latencies or the latencies of a DDR type of a channel.
Gen Z is larger than those, so we're not trying to replace those.
And I think that's where we dovetail quite nicely. And an opportunity to have directly connected memory and then have larger, less expensive pools of memory that you can use, as I showed in some of my later slides.
Thank you.
Where did the mic go?
It looks like it's been distributed already.
Hi, everyone.
I'm Steve Bates from Adetacom.
So can you guys comment a little on the operating support that's going to be needed to make all of this work?
I just had a look in kernel.org, and I see a driver for OpenCAPI, which is great.
But, Curtis, to your point, right, you're imagining the world where I have a processor running Linux,
and I'm going to Gen Z is going to arrive and basically create
maybe four or eight or 16 terabytes of new memory
inside the operating system.
The operating system also has to be cognizant that some
of this is memory that I can put into my slab allocator.
Some of it's memory, but it's kind of slow,
and it's also persistent.
So I can't put that on my slab allocator.
I've got to do something maybe DAX-ish with it.
And some of that memory is memory mapped accelerators
that I better not do low stores against,
because it will actually make an FPGA do something.
And then you had a slide that said there's no OS changes.
I can't reconcile that.
I can't.
So can you comment on that?
Yeah, absolutely.
So Gen Z maps into the standard memory map of the processor.
Right?
So I...
And it doesn't work on RISC-V,
and it doesn't work on pretty much anything except AMD64.
So our earlier focus has been on some of the AMD stuff,
some of the Cavium as members arm.
So there are limitations to memory space there.
So you'd have to be cognizant of those limitations early.
But again, if you're looking for a reasonable amount of memory,
then you can use Gen Z to attach to that.
As you're going to bigger and bigger memory areas,
then yes, we're going to have to continue to work
with the ARM folks
and some of the other processor guys
to allow for those large memory spaces to occur. asking, can you go talk to the Linux community and start getting them free
and starting to help point out
where the hotspots are going to be
inside the Linux kernel?
Because I think it would be good for you
as an organization to do that ASAP.
Great, great, great input.
Yes, and as you might have noticed,
Red Hat is actually a member.
And so those discussions are going on,
but I'm not sure it's out into the wild yet.
Yeah, I mean, I think any of the software sides
of different types of memory seem,
I mean, it's an emerging, right?
And so Linux's capability to understand
different types of memory, it sort of has to happen.
And you see it in, you know,
whether or not you're doing the Intel,
plug it into an M slot, or whether or not
it's an open CAPI slot.
The software looks kind of similar.
It's gotta be cognizant of all the different types
of memory and understand how to do that.
That's a big software challenge,
but it's not really tied necessarily to the protocol. Is that ENFIT?
Is that JEDEC?
And then the OSs have to know how to talk to that interface to go, okay, so you're giving me 7 terabytes, but it's not DRAM-like.
It's more like something with this kind of access, and I've got to work out what to do.
Yeah, I mean, think about the tiering.
If you get high bandwidth memory or HBM 1.2 into processors,
that's going to be one tier that's very fast, low latency, but not very big.
You're going to have standard attached DRAM that is another class.
You're going to have storage class memory that's yet another class.
I'm not sure that Gen Z in and of itself is the only one who's going to have to deal with that problem,
but we're having those conversations.
That's going to be a problem that the world has.
It needs to identify itself in speed and whether or not it's persistent.
And so, yes, those are being worked, and JEDIC is one place to do that,
and they're working actually with the PMEM initiative on coming up with some of those pieces as well.
We're going to give you your exercise.
Thanks, Lucy.
I noticed both OpenCAPI and Gen Z have FPGAs.
A lot of people play around with this stuff.
NVIDIA seems to be a member of OpenCAPI but not Gen Z or C6.
Do you guys have any thoughts about why GPU accelerators,
which are very common in data centers, AWS and Azure data centers, are showing up,
but it doesn't seem to be a point of focus for either OpenCAPI or Gen Z.
You've got them as a member. I'll let you take that one.
Yeah, I don't think I can really speak for NVIDIA as I'm an IBM employee and not an NVIDIA employee.
So, yeah, I don't think I can say much about why they're in one or the other.
I'm not quite sure I got the rest of the question.
Accelerators.
How come there aren't any NVIDIA or GPU accelerators you can play with?
AMD or?
Oh, for OpenCAPI?
Yeah.
Yeah, I think it's partially a point in time,
right? We had, in VLink,
Attach before we
had OpenCAPI Attach.
And so that enabled... These things are just two
years old, remember.
Well, and
as you look at
the partners we have, AMD
is part of Gen Z,
and that's not just the processor side.
So they are looking at this.
But, you know, Tom brings up a very good point.
We're young.
We have to prove ourselves.
And right now, GPU folks have a lot of other things they're focused on.
Yeah, I mean, the first thing is FPGAs because, heck, those are easy.
And then memory is probably the next easiest one
and then accelerators are probably the order.
In Gen Z, you are introducing all these devices under MMU.
So the question,
doesn't it introduce also a new category of problems,
especially with error handling,
with some stalling CPU cores.
So how it is to be solved?
Yes.
Yes, it does.
You know, when you start to put more and more things behind the MMU, if you get a problem,
you start to stall cores.
And so we are looking into the management of that and how do you work through it and
are there changes that would need to be made
in the MMU to allow for this?
How do you maybe split transactions and other things
so you don't stall the core?
Thank you, and second small question.
You also said that you are able to emulate
legacy PCI devices with Gen Z.
So what about, for example, interrupt handling
when you are under MMU?
When you're under what?
With interrupts.
So do you have an idea to support interrupts?
Yes.
So essentially you're just tunneling PCIe across Gen Z,
and so it looks very much like a standard PCIe interface
to the host.
Okay, thank you. But we're not really used to. It looks very much like a standard PCIe interface to the host.
Okay. Thank you.
But we're not really used to MMU-attached devices passing interrupts, I think, was part of that question.
Yes. So it is an interesting confluence, I.O. and memory.
It is, and yet we think that it really is not going to be used for everything, right, but for
things where you're looking for low latency accesses to PCIe devices, that's where it's
helpful. That's where we were showing kind of the GPUs and the NVMe devices is the real candidates
for that. So, yeah, oh, sorry, I was, we, I mean, we have interrupts in OpenCAPI. They're address mapped.
It kind of looks like a special magic address.
So do you think there's some IO action
to just when you might want to...
Well, there's...
Yeah, this is maybe a hard conversation to have.
But, yeah, I mean, there's interrupt support.
You do have some interesting things
where if you have
a page fault, like an I.O. device typically
would have never had a page fault in the past.
Yeah.
But that is a new thing maybe devices have to deal with.
So in Gen Z you said a processor is always going to have
its own memory, and so we have two classes of memory.
One is slower memory over the gen z right so i'm assuming
the instructions will always be in the direct memory to the processor and the data will be
stored in this slower memory i don't think that's going to be the case um
you know the the memory size is going to continue to be small and if you're
if you're running an important task, absolutely you might want it into that local memory.
But if you think about a machine that's virtualized across a wide space, you're going to run out of that local memory perhaps, and then you're going to put things out into the Gen Z attached memory.
And you can still run from there, and so it kind of gets back to the question before of,
hey, I need to start to be aware of what's my priority
so that I go into the fastest memory possible
with all the characteristics I'm looking for.
Future applications are going to maybe talk to the OS
and say things like, persistence is most important to me.
Put me in persistent memory
and then give me the fastest stuff in that category.
That persistent may be all sitting out on Gen Z,
but since persistence was more important,
your application is going to be put out there.
If instead, no, give me the very fastest memory
and I just need a little chunk of it,
maybe you go into HBM.
Thank you. fastest memory and I just need a little chunk of it, maybe you go into HBM. I can tell you that execute in place is a very interesting question with memory.
We begin to provision the speed of memory like we traditionally provision
storage. And it's quite an interesting question. Is it better
to put it on moderate rate memory and execute it that way or is it better
to migrate it to high-speed local memory?
And these things blur that, thereby bringing in that question of management.
I mean, from a text perspective, from a code, I mean, some of it might be infrequently executed.
Why waste your expensive DRAM on it, right?
And the fact that it is addressable lets you have – you can go and infrequently run code, not have to page it in. It can live out in
cheaper, higher latency memory, and depending on the access characteristics, that's
where you put it. Some code could be local. Some code could be remote.
It all just works. It depends on the access rates.
I do think that applications are going to get more and more aware.
Big thing we've got to do is get this out there first, and then let the applications
come along. I think Amber talked about it. It takes a while.
We've got to get the hardware in place. We've got to let the software developers feel like they've got something
that's stable. Then they'll start to do their work.
Question. Could both of you speak about multi-mastering,
either in interest or demonstrated capability?
Like, could there be, in the Gen Z case, could you have two CPU instances talk to the same device across the fabric?
Yeah, so I'll start, but I definitely want to.
We actually currently have demos that do that.
So we take, right now our demo uses a server that's just
nothing but a memory device and then takes four other servers and we allocate memory to each of
the servers as as separate pieces and then we've got another space that's uh shared amongst all
four servers and so we can we can have all those servers talk into that one shared spot.
Right now, it's the last guy to talk gets to put his data in there.
We don't have the coherency going yet,
but that's one of the pieces we're working on is the ability to have coherency in that shared space.
Yeah, I'm not sure.
I mean, from a, you know, there's a question here.
Is this under one OS or under multiple OSs? This is maybe a differentiator.
Under one OS, it's just sort of business as usual.
Under multiple OSs, that's probably more of what you guys are targeting rather than OpenCAP.
Question?
Question there, I see.
Here we are.
As you see, what are the major challenges for Gen Z?
Major challenges for Gen Z.
Well, let me talk about some of the near-term stuff.
Right now, we're doing a lot of work on our management spec.
So if you have interest in that, come join us.
You see a lot of the work that I kind of say,
hey, this is going to occur.
It takes a lot of software effort
to get that just working.
So there's that piece.
The other is, how do we get more ubiquitous?
So you heard people say,
hey, where's NVIDIA in this?
We need them to come along for the ride
when they're ready.
Where's Intel? They're coming along for the ride when they're ready. Where's Intel?
They're coming along for the ride in some ways and not in others.
We'd love to have them join in and see the value here because let's face it,
they're 90-some-odd percent of the data center CPUs.
And that was a Gen Z targeted question, so I don't have to answer.
Did I see a question over here before or no?
Okay, I guess not.
All right, well, thank you, everybody.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Thank you.