Podcast Archive - StorageReview.com - Podcast #131: SSD Industry Vet Jonmichael Hands – The Latest on Enterprise SSDs
Episode Date: July 15, 2024This is another live podcast in which Brian checks in with long-time friend and… The post Podcast #131: SSD Industry Vet Jonmichael Hands – The Latest on Enterprise SSDs appeared first on ...StorageReview.com.
Transcript
Discussion (0)
We are live here with my good friend John Michael Hands who, gosh, I almost said Intel,
which is not the case.
That's how you and I have known each other for a very long time, but you're now with
Fadu.
You've been in the industry forever.
You know more about Flash than most people. So this will be an interesting conversation.
How are things going over there?
I was going to say last time we caught up, it was with some Topo Chico.
I had to go find one from the fridge to make sure we were aligned.
No, yeah, last time we caught up was about Chia and blockchain crypto stuff,
which is, you know know crypto craze is still
kind of going but i'm back in the ssd world well yeah i don't know i mean you you brought it up
the crypto thing what what what's going on with that i know uh i know chia was fun for a while
is it still uh yeah a lot of these projects just take a long time.
You know, it wasn't a great IPO market or a great crypto market the last two years. And so, you know, just it was actually weird timing because the storage market was getting
just pummeled to right was like six quarters of absolute pain.
And so, yeah, I went back to tech here late in the fall,
earlier in the year. And then it's just like AI just jump-started the SSD market back to life,
overnight, 50% increase back to normal. And then it went from being the worst year ever
to potentially the best year ever for SSD revenue.
Well, it helped though that during COVID,
the NAND prices got absolutely destroyed
for a long period of time.
Clearly, they've been on the way back up
the last six or eight months.
Anyone who's had to buy SSDs will tell you that.
I've heard more complaining
in the last month about SSD prices than I'd heard in the three years prior. So this is
a renewed fresh pain, I guess, for people that aren't used to the cycles of the ups
and downs of NAND.
Yeah, I mean, everybody, you know, like WD has their big like AI data cycle,
SolidIne's obviously been talking about
these different stages for the ai life cycle like every company is basically seeing this impact uh
in a different way but it's obvious like just the more money that is spent on capex for data centers
like they're going to need more flash what percentage is it of total capex we don't know
but it's a lot more than it was it's's going to be, you know, even though storage is kind of like the unsung hero of AI,
nobody really cares about it until something breaks or something gets slowed down or the
GPUs aren't being utilized fully. Well, I mean, so you hit on like six different points there,
and we should dive into this a little bit more. Storage, the feedback
we're getting and the projects we're getting asked to engage on are to, you know, highlight how
storage is important to this AI journey because you're right, the storage isn't a problem until
it is, but legacy SAN infrastructure is not necessarily ready to fuel super high-end AI GPU work.
And I guess we've got to make some delineation here, right?
Between AI that's in a 2U server with two L40Ss in it
versus racks and racks and racks of eight-way systems per rack
and liquid loops and all this other thing.
That's a different thing.
The pictures we've been showing of X's data center is different.
Yeah, one of the cool things that I saw,
the meta engineering has been really good about putting out blog posts
about kind of what they're doing.
And they actually shared a lot of information on their website
about what they're doing with the storage on their two 25,000 H100 clusters
that they trained Lama 3 on.
They mentioned that they're building an exabyte scale already just for these clusters.
So exabytes of flash extra just for these clusters.
They mentioned that they're doing it all over 400 gig Ethernet or InfiniBand.
So they have this disaggregated storage box.
They mentioned it's kind of like Yosemite v3.5 that they have in OCP.
It's like more drives, disaggregated storage.
But, you know, this move where they're going, you know, they're shifting from like 4 terabytes to 8 terabytes now to 16 terabytes in this disaggregated storage tier because they need a lot more flash capacity for training these large multimodal models. But like, and then you have like the fast SSDs
like inside the training servers,
which is just obvious.
Like you want Gen 5, you want the fastest interface
because you're dumping these checkpoints super fast.
You just want fast sequential write,
good random read for this training.
But yeah, it's still like,
if you look at like one of these H100 HGX systems,
right, eight H100sGX systems, right?
Eight H100s. I mean, the system costs somewhere between $270,000 and $300,000 if you buy it from
like a distributor. So even the eight SSDs in there is like three or 4% of the total cost of
the server. It's very small, right? So going high end on the SSDs Gen 5 is not going to like
significantly increase the cost of these training servers.
Well, it's almost more of a capacity play at that point, right?
Getting as much storage in that cache in the GPU server as you can is part of it.
But then there's all sorts of other technologies,
like WD has been demonstrating their NVMe over fabrics, OpenFlex chassis that's designed to try to get these storage drives
via GPU direct adjacent to the GPUs.
Then there's everyone else with a parallel file system out there
talking about the Weka guys, DDN, Vast Data,
all talking about these things.
But I think all this stuff confuses the issue at times,
because again, AI is not a monolithic singular thing.
Have you guys spent some time thinking about
what those different parts of AI look like
or the workflows there?
Yeah, it's funny.
I just joined like the ML Commons,
like the storage benchmark stuff.
And I'm starting around messing around
with like some of these AI storage benchmarks. look i spent i spent a bunch of time doing benchmarking on
inference just to see like hey is there anything storage intensive there and it's like when you're
doing inferencing there's zero storage it's all running from gpu vram like even if you go to
system vram it's like 10 times slow so like the only storage you need in inferencing right now is just loading the model.
Now, I think people are figuring out that these RAG systems are really interesting, right?
Instead of just giving everything into a prompt in one giant context window,
like it's just better just to have the data right next to it in a database.
And then the systems are going to get smarter about just deciding,
okay, let's look up where that data is in this RAG and then use that for the prompts.
And so those workloads are just going to look exactly like we have for databases already, like the NoSQL databases that we already have.
And it's just going to be like you want sequential write, you know, good random read performance, good quality service during random reads it's like everything that we've already done for optimizing for databases for hyperscale databases and s and sql databases like for rack that's just going to
be already there and built in so that's awesome like we don't have to do anything else um you know
i've talked to a bunch of folks about this checkpointing you know whether or not like okay
is it just done locally to the server or do you guys do it over this nfs do you you know it just
depends right like the the answers we got were like very different like some some people will checkpoint more frequently um because they're training a
model where they don't want to lose a bunch of gpu cycles and it's it's much it's actually not very
expensive to go just take a bunch of checkpoints and then if you need to roll back a day or two
it's no big deal um and they it's funny they've been doing this in the hpc world for ages right
like when the um i remember actually the very one of the very first like uh ssds we worked on was this you know pcie by eight
adding card and uh i remember like cray cray was actually the first one to use it for hpc like they
use it as a thing called the burst buffer like where they're just dumping the d-ram into uh
into uh flash like you know periodically to basically checkpoint these HPC systems.
AI training workloads are exactly the same thing.
So at some point, some people checkpoint, is it every 10 minutes?
Is it every one minute?
As the models get bigger, you need a lot more bandwidth to checkpoint.
Do you do it on an array?
Do you do it locally before?
It just depends.
I'm sure even within one company company they might have different checkpointing stuff for a bunch of
different model types and you know so uh all i know so far from all my research is fast drive
good faster or big i guess depending on what you're doing too, because that's the other big bit.
Yeah, the big one is interesting, right?
Because it's funny, we were talking about the SSD market like being crashed like last year.
It's funny, it got to the point where QLC was like,
even TLC was like three times the price of a hard drive.
And that was, when we did all these,
when we wrote the SNEA TCO model, that was like the times the price of a hard drive. Like it, and that was, when we did all these, we wrote the SNEA TCO model.
That was like the golden crossover.
If you could get to like two and a half to three X,
the ASP per gigabyte,
like almost all the TCO models just crossed over
for making, for all storage use cases for flash
versus hard drive.
But then it bounced back up and now it's like over,
you know, back over 10 X.
But another phenomenon is happening where the data centers are running out of
power. Yeah. Like, I mean, everybody's trying to secure five megawatt,
10 megawatt, a hundred, a hundred megawatt data centers.
They're talking about gigawatt data centers now, um,
to train these models and eventually like they're going to figure out they're
like, Oh yeah,
if we can save three or four X on the power and storage,
like cares about storage?
Yeah, just go to the more expensive drives.
Like, you know, if we can reduce the power because you need all the power you can get.
So I do think that's happening and it's going to come.
There are other use cases for big drives, just that the fact that they're bigger and
you don't need a bunch of network bandwidth.
You can have a bunch of stuff local to the server and cache a bunch of the training models
there so you don't have to keep grabbing it from the network. So
a lot of data centers don't have like fancy 400 gig or 800 gig
networking, they have like, oh, shitty networking.
No, I mean, we we got Broadcom sent us for 400 gig Nix this
week or last week. And those are the first real high speed mix
we've gotten in and you talk to
people and they're like well you might raise a hundred gig ethernet end-to-end it should be able
to fuel anything and it's like well no because when you start passing that over the wire it's
really not all that fast even 40 gigs so we've got a server one of these this uh super micro over my
shoulder here this e3s server it's got two OCP slots.
And now Broadcom's got single port 400 gig cards that we can drop in there and get 40 gig a second
out each port. So now we've got 80 gig out of the box, which now starts to make it much more useful
from whatever you want to do with it. If you want to feed a GPU server or be a high-speed cache for data to feed the GPUs.
But lots of opportunities there.
Yeah, there's a reason why these,
when you look at like,
when you get a quote for one of these 8X,
HTX systems with H100s,
they come with eight 400 gig NICs.
Like, you just need like,
if you're building big GPU clusters,
you need crazy network bandwidth.
Well, so you started talking about how AI is driven storage re-awareness, right?
Everyone's got to rethink their capabilities, their capacity, their backup plan.
What do you want to keep? What do you want to protect?
But the switching business is fired up too, because I don't know how many people you talk to. I talk to them all the time
where they're at 1025 and they think they're cruising along because they're used to just
serving up business apps. Now, these guys aren't the guys that are going to put a rack of eight
of these $300,000 systems to work for training. But even if you pick up a pre-trained model and then apply your data
to it, there's still a pretty heavy lift at moving stuff around and even keeping a relatively benign
2U server with a couple of GPUs in it going. Switching now is a real big part of that.
Yeah, it's actually amazing. I talked to a lot of companies, like even some of these SSD companies,
and you're like, okay, in your DevOps,
how many GPU servers do you have?
And a lot of these guys are like, none.
It's like, that is,
it's not the right answer for this AI stuff.
You can't just rely on external like APIs.
I mean, you know, my basic benchmarking shows
that you can get like 120th the price
of OpenAI per token on an open source model, like fairly easily with like out tweaking
your optimizing very much. But the real value of obviously of running these these GPUs locally
is having your private data sets because a lot of companies aren't just going to dump
all their stuff in AWS and be like, okay, we can do, you know, rag on that.
But it's, you know, we're moving to this day.
Obviously, it's going to add a bunch of value.
People are going to want their data sets in these, you know, available to these AI models, right?
It would be stupid not to.
Other companies are going to do it and they're going to the cloud, though, is the easiest option because we were hearing early on there was Oracle Bare Metal had the A100 and later H100 8-way boxes.
There were a couple others, but it's not like you can just show up and get a month on those things.
They're big contracts.
They're not simple consume for a little while and turn it off kind of deals because obviously those those cloud guys are having to make big investments and it depends on where you are too.
So, uh, you know, just a reminder for the audience, if you're catching this later,
we shot this thing live. So we're live right now and we're getting a lot of great comments
on YouTube, discord, and all of our, our social media platforms. But one, uh, uh, Gabriel Faraz
is checking in from Brazil. Gabriel, thanks for
following along. JM, do you see any, as you look at the global markets and you work for a global
company now, do you guys see any trends that are different in what we're used to in North America
versus something that they would see in Brazil or South America or Europe or Asia? Are there
different things happening in different global regions?
Sorry, I had to take a second to go on the Discord
and like my own, your post that you put.
Yeah, discord.gg slash storage review.
If you guys aren't in there, get in there.
I am there.
You guys can tag me and yell at me if you need to ask a question.
I wouldn't invite that, but now you have, so that's your fault.
The lab pillagers and such.
Yes.
No, I mean, like I said, it's probably similar.
I think all these companies, you just see the McKinsey's of the world and these consulting
Accenture just making hundreds of thousands of dollars just to help basic companies figure
out AI stuff. It's honestly not that complicated. You just need to get some GPUs. hundreds of thousands of dollars just to like help basic companies like figure out like ai stuff
like it's honestly not that complicated you just need to get some gpus you need to play around with some stuff running like these rag stuff even three or four months ago it was
like very complicated to set up i ran one like last weekend in the nvidia whatever they had
it's like their ai studio it just just spins up a Docker and it launches you
a little Chrome app that has RAG right there.
You just upload a file or a folder
and it will convert it to a vector database.
And then you have, you can use their endpoint.
Like this is not like super, super complicated stuff,
but it is, you know, the whole AI stuff
is kind of clunky right now.
But I think a lot of the companies are in this same spot
where they're just like,
we know we need an AI strategy.
We don't really know what that is.
We should be using these tools.
Open AI and these guys are going to be pitching
these enterprise plans that are private
and don't leak your data.
You have to hook up AWS to it and stuff.
Sure, you can do that.
It's going to cost you 20 times more
than figuring it out yourself.
Well, it's funny you mentioned RAG because we actually did a piece, and I'll put it in the
notes on YouTube, but we set up a QNAP NAS with an A4000 and ran the RAG on there from
NVIDIA. I've lost the name. Oh, it was a chat GPT kind of thing. Anyway,
you can run it against your private data. So if you've got a little NAS, like a QNAP,
to be able to drop a GPU in there and run RAG against your own private company data,
it's so simple. And that's where that could, you know, even, again, talking about what is AI,
that can be running on hard drives with a relatively
affordable gpu on a nas that you may already have and now you're taking advantage of this in a
private way that that is simple and protects your data that's another big concern right
yeah i mean i use this stuff all day every day to help. It's hard to go back, you know, like, especially for a serial
procrastinator like myself, when, you know, you just be looking at a blank page for like an hour,
you're like, I don't want to start this project. But AI, you're just like, just write me an outline,
just like, get it going. It's so helpful. Yeah, in that context, certainly it is. It gives you
an advantage, I guess a head start, yeah, if you're having some creative impairment.
I know we've talked a lot about AI and these things, these conversations are all unstructured,
but that's the thing that's on the front of everyone's minds. Talk to me a little bit about these devices for anyone listening just to the audio. I'm holding a
very tall E1 s SSD and a and a new got you got to Okay, James,
that's a few I've got the two flavors here of this is the
body of you want to ask. So this is obviously PCIe Gen five, you
know, up to 20 watts, but most of the hyperscalers are running
this at about 16 watts for the 7.8 7.68 terabyte and 14 watts for the 3.84 terabyte obviously the
15 millimeter is kind of kind of the more common you have uh meta was like the hyperscaler that um
you know adopted the you know the 25 millimeter it's it's funny they claim that like there's a
huge difference in the thermal simulations,
even that like very, very small difference in heat sink
that they see like pretty massive differences
in thermal simulations,
mostly because the drives are spaced more.
But yeah, so I think you guys saw,
WD made a big announcement for this, the AI data cycle,
where they announced their high performance Gen 5 nvme for training servers and
uh inference and all that and uh that's based off the body controller and firmware and you can see
this is kind of like an early engineering sample here but uh um yeah i mean this thing is insanely
fast one of the coolest thing we've been i I mean, it has this new feature called flexible data placement. So we talked about that a bunch.
I had a blog post on that and we,
very short, we did this like OCP tech talk thing.
So I was on the panel there.
Let's see if I can get my camera here.
There we go.
And we, yeah, I have a really sweet talk on FTP
for FMS this year that I'm working on.
I can talk a little bit about these little crazy experiments I'm running.
But, yeah, exciting.
So this, I think, you know, there's going to be obviously this E1.S, U.2, E3, basically, you know, anywhere from 2 to 16 terabyte TLC.
It's really for this high performance TLC compute Gen5.
Talk about that for a little bit
from the form factor world,
because we get a lot of questions and confusion still
with the whole EDSFF family.
We've seen, I don't know,
probably four or five different iterations in the lab here
of different Z heights, different shapes, E3,
E3-2T. You've got the E1S, E1L. I mean, there's a lot going on there. Has that, in your view,
I know you're close with the OCP standards. I know you're involved in SNEA and the other work groups,
but do you feel like we're kind of honing in on a couple that are going to survive?
Yeah, you know to survive in the
mainstream? Yeah, for sure. You know, it's funny. I did this update for SNEA. It was like the all
things on form factors update for SNEA. I made a joke. My friend texted me. He's like, oh, man,
SSD form factors. I couldn't think of a more boring topic. But since it's you, I'm going to
watch the podcast. Please, let's go. I happen to watch the podcast please let's go i happen to think ssd form
factors are very interesting um but you know i looked at the spec you know the the updates to
the e1.s spec like ta 1006 i mean there hasn't really been very many updates since like 2019
like we had it pretty much dialed in back then um you know a couple minor changes you know the the
biggest one was like adding this 15 millimeter because Microsoft thought it was kind of the sweet spot
for performance and power.
And you have the SuperMicro E1.0 system
that hopefully we'll have in my lab at some point soon.
I don't get sent all the fancy toys
like Brian does all the time.
We have to procure them the hard way.
I just use blackmail to get ours.
It's much easier.
But yeah,
the super micro server has,
you know,
24 of these in one year,
right.
Which is obviously very,
very dense,
like much denser than you can get with you.
That's a U dot two.
Traditionally you get 10 or 12,
you know,
U dot twos in,
in one year chassis.
And so obviously you're having double that,
you know,
24 drives,
you can get much more performance,
more IO, more, and these things can go, you know, technically drives, you can get much more performance, more IO, more.
And these things can go, you know, technically the ESF drives can go up to 25 watts.
You can have basically the equivalent of U.2 performance in a smaller form factor.
So, you know, what happened with the hyperscalers is they did exactly what they said they were going to do, right?
OCP, you know, Meta has like almost, I think like 15 designs in ocp that use e1.s microsoft
has the same bunch like almost every single server every single new drive that's going into these
hyperscale platforms has e1.s if you've seen um and brian thank you for that for that webcast i
did on edsf i stole some some pictures from storage review uh for some of these new edsf
chassis because i don't i don't have them all in my lab.
But you have NVIDIA and all the other,
like ASUS and all these other guys doing the NVIDIA DGX that they are now doing with E1.S.
And so you have all these training servers
that have EDSF support.
As we go into, like, you know,
E3 was a little bit slower adoption.
So E1. So, okay.
So E1.S, just to summarize, hyperscalers were the main ones using these 1U form factors, right?
Hyperscalers were using these 1U servers.
They literally invented E1.S for hyperscale 1U use case.
And they're doing exactly what they said they were going to do.
Everything was designed.
Okay.
Before you go to E3S though, why is it
in the latest revisions of hardware for
enterprise gear, do you think
that Dell, Lenovo, HP,
etc. didn't
do more with E1S
since that was already existing
in the market?
It's super easy.
I was with
the folks at Dell and HP when we wrote the E3 spec, right?
I was at Intel, and we literally helped them write the spec.
You know, they were the lead authors on the spec.
And they said very clearly, look, we want one form factor that will work for 1U and 2U.
Because at the time, 80% of the volume for the servers was on 2U platforms.
So they said, we don't want to do an e1
for you know one use and then something else for 2u that was stupid we just want one form factor
that we can literally put horizontally now i actually thought for edspep that the e3 2t like
the thicker one would actually win out because it's more u.2 like and higher power um but they
went another direction dell and HP wanted to go with higher
drive count for systems because they wanted a performance advantage for a CUDA 2, which
in retrospect actually was a very good choice because now you have these EDSF servers that are
sitting behind you that have a huge number of drive count for the system. Up to, I think,
Supermicro, Pettiscale has like 36 E3 in a 2U chassis. that's insane. Like, it's a ton of drives.
So in retrospect, like for these kind of storage performance configs, like that was,
that was the right choice,
because you can have the flexibility to go either like,
you know, two 1Ts or one 2T.
But E3 is really strong because, you know,
one, you have the EDSF connector.
I think a lot of people, the adoption wasn't as fast just because you didn't get all the segments coming over from U.2 right away.
U.2 had TLC.
It had QLC.
You have small drives, big drives.
You had all these different, the SLC drives.
You had all these different segments.
And not all those could have just converted to E3 right away. And then you had, when E3 was started, it was at that point, they were like, yeah, let's not move U.2 to Gen 5.
Let's just leave U.2 at Gen 4.
That way it'll be a faster transition to EDS-FEF.
And then customers said, oh, well, there's these couple tweaks we can make to the U.2 spec to make the insertion loss a little bit better.
You know, they literally just released the U.2 Gen 5 spec in PCI-6 a month ago.
It just got out.
My preference would have been to just be like,
look, we have a form factor that's optimized for Gen 5 and Gen 6.
Why are you wasting your time?
At that time at Intel, we voiced our opinion pretty strongly.
We should be spending our time getting PCI-C compliance up on EDSFF because that's
going to be the dominant form factor going forward. Well, I mean, part of the problem
is the customers, right? Because you change things and make them weird. I mean, isn't that kind of
why the U.2 drive is the way it is to match up with the 10K and 15K hard drives from back then?
Yeah. And then it's's funny right as we're
transitioning to edsff you know broadcom and hpe you're like let's have this great idea called
udot3 and i'm just like no god stop just stop with the insanity like we have edsff please everybody
like what's so well udot3 that that clearly didn't work yeah don't yeah don't get out don't
make me get on my soapbox on that one but But anyways, I was very vocally opposed to that.
Obviously, people don't want SATA and SAS drives and servers anymore.
NVMe is not new, right?
The first NVMe drive was 2014.
That was 10 years ago.
It's not like a new spec anymore.
Right.
No, well, the investment in SATA is long gone. I mean, no one's doing anything new there
spec wise, right? And even SAS is being kept alive mostly by hard drives and a couple players
that have big investments. But SAS was always hard. And so if you have the investment in SAS,
you want to protect that. And I think you were around in the early
days when Intel got into the flash business and said, well, we'll take SATA and NVMe. And our
development partner was HGST back then. You guys can have SaaS. And I mean, that seems to have
worked out pretty well. Those are great drives. Yeah, those are great drives. I mean, look, you know, I learned
SaaS when I was designing JBODs
at Sun, like, one of my very first jobs
out of college. I know how, like,
I mean, it's funny, once these
SaaS guys retire, there'll be literally nobody
that knows the spec.
They've been saying that about
mainframe guys and tape guys
for a long time, but they
still find a way to just... I, to say you know nvme is is much more organized it's easy to look at the
specs you know they're all publicly available there's no like shenanigans you can just download
them on the website um the the thing about that the thing that they got really right about nvme
not just not just the performance and scalable fabrics and queuing
and all the other good technical stuff,
but the way that they made the spec very open to just be like,
you can ask the drive with an NVMe Identify controller
what the drive supports.
You don't ever have to just guess like, hey, what is this drive?
What is it doing? What features does it support?
You can just ask the drive, what is this drive? What is it doing? Like what features does it support? What is it?
You can just ask the drive, like, what do you support?
It's one of these like little thing
that nobody talks about in NVMe,
but it's extremely useful for software
to be able to like query the drive
and just be like, what can you do?
So we went down a little tangent there.
So go back to E3S.
And I would argue though, that even for server guys that didn't
want two different form factors, we saw some early 2U servers where E1Ss were stacked high,
low. I mean, it wouldn't have been impossible. I get that they didn't want to do it. And now
that we're seeing some of the designs come to market, there's a couple, like some of the dense
platforms. Like we just looked at one from Dell, the C6600 chassis, where they just have a couple like some of the dense platforms. Like we just looked at one from Dell,
the C6600 chassis where they just have a couple E3S in the middle.
That would be hard to do with the two T1s.
It'd be too thick.
And anyway, but yeah.
So go talk a little bit more about that.
Yeah, so the summary is, you know, you have E1.S for hyperscale.
That's going good.
And now it seems like it's made its way into all these training servers.
So we're going to see another uptake in enterprise for these enterprise training servers that are using E1.S.
And then, and again, it makes sense, right?
It's a small form factor and it's super fast.
So, and then on the E3 side, it's kind of made its way into these enterprise platforms.
But I mean, even six months ago, there was only one Gen 5 E3 drive available on the market you could buy.
I think the Samsung drive was the only one you could buy.
So it just, you know, there wasn't a huge market and there was a huge premium and they didn't have all the capacity points that all the other drives had.
So, you know, it sucks as a server manufacturer they still have to sell you dot two and you know and
so you know the transition didn't happen as fast as i would have liked of course it would be nice
if all the ssd vendors could just like now for gen six hopefully we can be smart and just be like
we're not going to make it a gen six you know you dot two and then it's only you get sped and it's
like so no why I think
yeah I mean I think really there will be a couple holdovers on gen 5 for you dot
2 but most of them are stopping at gen 4 and and gen 6 has new requirements are
gonna make it probably too hard as to drive availability you're right I mean
there was a Samsung part initially you seeoxia followed a little bit. We've got some other stuff that's not public yet,
but there's more coming.
Yeah.
So that's exciting.
I wasn't there this year at the PCI SIG developers conference in June,
but there was a slide on the form factor stuff that basically just said,
look, there is no plans to develop a U.2 spec for PCIe Gen 6.
Please don't ask.
Like, if you even think about asking,
Brian and I will come find you and do bad things to you.
Just use EDSF.
It is a better form factor.
Obviously, different capacities, different server types, better thermals.
I mean, quite literally everything is better about it. So obviously that will, you know, in the Gen 6 timeframe, which it seems crazy,
it seems super far away. I mean, I live in the world of strategic planning where I'm thinking
three years ahead always. So Gen 6 doesn't seem that far away in my world, you know, but, you know,
I know people are just, in the reality, you know, customers are just now deploying gen 5 and yeah they're they're
overcoming signal integrity stuff like you do in every generation like at the beginning and then
and then you figure it out and stuff gets cheaper and you get re-timers and switches and all that
ecosystem stuff gets gets pretty inexpensive and then everything just works um so i don't want to
yeah i don't want to spend too much time on client, but you're a consumer of storage too, and I know you like to have fun with your personal projects.
What's your take on what's going on on the client side? Because that, I'm looking at these Gen 5 drives, you know, like even a two terabyte can consume maybe, you know, 13 or 14 watts.
It's really not the right, you know, M.2 is not the right form factor for like a high performance Gen 5 drive.
It's just not.
And unfortunately, like you're trying to shove, you know, a square into a round hole. Like, you know, there was just this world where like,
oh, laptops and desktops just had to have
the same form factor.
You must all be M.2, right?
And that was where the consumer were like,
which is so stupid.
The priorities on a laptop are battery life
and quick performance and bursty performance
where the performance on a desktop or a workstation,
you have lots of power. You generally want more faster, sustained write bandwidth. People are now
not just doing consumer type workloads. They're doing AI stuff. They're moving big data files
around. They're not just doing light laptop stuff. In my opinion, those are completely
separate segments. An enthusiast desktop drive and a laptop drive should not have share any of the same dna like
i think it's so stupid and trust me i run like a data any any data center drive and you just put in
a desktop and boot it like it'll run circles around any consumer drive in every benchmark
they're just not even on the same planet so yeah we're talking about form factor we got a question
on on youtube or or maybe more of a statement yeah this why Jay is saying I was hoping he 32t would
make it to consumer desktops, because it's a shared form
factor for SSDs and other things like PCI cards. He brings up a
really good point and one that every show we go to, we talked
to Dell about their workstations and HPZ and others.
It's like, why are we still trying to jam these M.2s into weird caddies
so that we can hot swap them in the front of these things?
Why not just go to an enterprise backplane?
You could put 8E1S in there, easy peasy.
Yeah, and you can do it pretty inexpensively, right?
You can do, so I know it's like the worst name ever but the
there is an official pci sig gen 5 and gen 6 cabling standard called uh copper link which
a lot of people might know as mcio um but you know this spec is is very good now for cabling gen 5
and there's there's ways to do it with srs so you don't need clock and that you can have you know
that was kind of the point of SATA cables.
You could have these like super cheap SATA cables.
Like there's no reason why you couldn't do that
in a desktop or a workstation and cable PCIe.
Like we know how to cable PCIe.
We know how retimers work.
We know how this like, it's not that expensive.
I'm sure if somebody is buying a high-end workstation,
they'd appreciate a switch card out to a back plane
where they can put a bunch of drives in.
The fact that you can buy a $15,000 workstation and you can't even select a U.2 or an E1.S is mind-blowingly stupid to me.
I can't even tell you how stupid and frustrating that is.
Well, 15 grand is not even that much anymore for some of these high-end.
You throw a couple RTX 6000 ADAS in there
at seven grand a pop or whatever.
And a lot of these systems are doing that
where they're putting in two or three or four.
And then you look at your storage and you're like,
well, I've got no PCIe slots left.
Or if I do, I want to put a high speed NIC in there
so I can get on and off pretty easily.
And then you're trying to squeeze
and work around a
couple M.2 drives, which as we've discussed already, are problematic. I mean, M.2 is going
to have another problem, even in the enterprise when we go to Gen 6 and beyond, that most of the
heat dissipation comes through the connector on that thing, right? And to the motherboard and to
get the heat out of that system, that
are out of that drive, there's going to be a ton of problems with heat dissipation with
with these M.2 drives.
Yeah, I'm hoping some people wake up. I know. I mean, we were, you know, back when I was at
Intel, and I could actually talk to the platform team about, you know, you know, designs and
stuff, you know, that for the whole, what, what is now what the Xeon W, you know um you know designs and stuff you know that for the whole what what uh what is now
what the xeon w uh you know product line that was called like fishhawk falls i think was the
code name or something like that but we were talking about evs back then of like at least
putting one connector on the motherboard so people could test it and stuff this was like six years
ago right i mean i don't know like i said just unfortunately there's just like zero vision
in that in that market but oh i digress um i'm hopeful that we start to make an impact and
people start putting e-disrupt drives and workstations it's very clear that consumers
and high-end like like people that are you know on your discord these guys are doing enterprise type
workloads or they're building like i said i still work with a lot of these people in the blockchain world they were doing these projects and like you need
to figure out how to get an enterprise drive on a consumer system and right now you just have these
like little adapter cards and it's not perfect i know i know we're doing it we're we're living it
and it's uh it's frustrating all right so here here here, here's another. Here's
another question for you. Dan wants wants to know why did Gen
five SSDs have poor idle power efficiency? The Gen five
controllers have trouble sleeping individual cores or
some other thing? I don't know that I've seen that a lot. Have
you seen idle power consumption issues with Gen five drives?
I mean, it doesn't have anything.
I mean, there is something weird, right?
Like where anybody who has an NVIDIA card knows this,
where if you just leave the card idle, it goes to Gen 1.
And then when you like run a workload, it pops up to Gen 4,
like these NVIDIA GPUs.
And that's not how you're supposed to do idle power management in PCI Express.
Like PCI Express has like L1 substates, L1, L1.2.
Like it has all this stuff to find.
And, you know, so they're obviously doing that.
If I recall, you save like maybe 100 to 200 milliwatts per lane.
And so on 16 lanes, maybe a couple of watts by like,
by sleeping, you know know the bus down to gen
one um yeah i mean so obviously the there is some impact there right if nvidia does this they're not
just they're not just doing this because they're stupid they're doing this because they want to save
a few watts uh but really again there are other pcie power states that are like you know you
should be using you know l1 maybe it has you should be using, you know, L1, maybe it has a
millisecond resume time, you know, from L1 or something, which is, you know, for, you know,
for a database workload, that's catastrophic. But for a consumer workload, there's just not,
it's a nothing burger, right? It's you wouldn't even notice, right? So yeah, no, power, I mean,
obviously, I'm in the process of writing a extremely
detailed blog post on power efficiency.
Uh, hopefully we'll get some website fix, fix this, uh, by, by today
or tomorrow and it'll be up there.
So for everyone else, you know, JM writes, uh, extensive blogs.
He's a really technical guy.
And so if you want to keep up with all the, I'm going to politely call it
minutiae, but there's a lot of detail
involved in
SSDs, the management, the
firmware, how the power
states work, all that sort of thing. He's
writing that now. He's with Thaddeus.
He's got a blog there. We'll link to that also.
The blog's having a bit of a
moment, but it'll be back.
We'll have some great content.
For anything that's not business topic, I have a personal blog called ssdcentral.net,
which is supposed to just be about stupid SSD topics that people want to hear about.
So yeah, please, if you guys want me to write about some stuff, it's been fun.
Mostly I started it because I had a bunch of these
random posts on LinkedIn and I was like,
yeah, they should probably like actually live
on websites somewhere.
So I just started one and put some stuff out there.
But yeah, I'm gonna be writing for the Foddy blog now.
We've got some really good stuff.
We got one on FTP and one on this energy efficiency
and some other good stuff in the pipeline.
As you heard from heard earlier, like deep, deep good stuff in the pipeline i'm as as you heard from
heard earlier like deep deep deep in the ai world trying to figure out how i like get all these
thoughts into one post probably not probably not gonna be able to do that but no and you probably
shouldn't either because i think if you you try to go and we suffer from this if you try to do too
much in one piece and you end up with a 8 000 word-word article. It gets chunky to consume.
But go back to FTP, because that's something I definitely want to talk about.
I know that you and the FADU team have some leadership position here.
This is not a technology that everyone has.
So I know, to full transparency, Samsung's there.
They're working on it.
I know you guys are working on it.
After that, it gets pretty unclear but for anyone that's curious about how the hyperscalers are
trying to manage right amplification and that is uh the goal to efficiently write to your ssd so
you're not over wearing it in terms of uh endurance right this uh this concept has come up before most recently was called zone
namespaces which was a dramatic uh non-success if that's a polite way to to say it but fdp flexible
data placement seems real seems like it's something that the the hyperscalers are already
embracing which is typically a good sign that that it'll happen
and come down to the enterprise eventually but talk a little bit about that because i know you're
really passionate about fdp yeah fdp is really cool um obviously flexible data placement technology
um it's it's almost impossible to understand ssds if you don't understand right application factor
so i mentioned that was like the first blog post I put on my personal blog was just talking about WAF
and over-provisioning.
Like, okay, so when you test an enterprise drive,
you're typically testing worst case, 4K, random write,
and it typically has like a WAF of five.
That's like this JEDEC workload.
And this is how they rate the drive in TBW.
And that's terabytes written, not,
I see a lot of people writing total bytes written. that is wrong. It was terabytes written my friends
so like
typically if you
The equation is actually extremely simple your NAND is rated for a certain amount of programmer race cycles
You know paired with an ECC engine on a controller and you have have a certain amount of flash, raw flash on the drive.
And so if you want to know your endurance,
you just basically multiply the program rate cycles,
how many cycles can each NAND flash do, times the capacity.
And now you have the TBW at WAP equals one.
Now, the problem is when you fill up the drive,
now the drive has to do garbage collection and move that data around,
and it gets hard when the drive is full.
And so the more spare area you have,
the more over-provisioning,
the more efficient those garbage collection algorithms get.
So if you look at like a three drive write per day drive
versus one drive write per day drive,
those are the same drive.
Like those are exactly the same drive.
The SSD vendors think you're stupid.
So they just make it 3.2 terabytes
instead of making it 3.84
because they think people are not smart enough to run a single command to basically
delete the namespace and then create a namespace that's smaller. I mean, that's literally it.
There's no, you don't even have to do that. You could make a partition that's smaller.
You could just write to the certain LBA space. It's all the same thing. It's all just WAF.
And so there's been a bunch of technologies over the years to basically see like, how
can we make WAF better
or close to one on every single workload?
That way you can get the benefits
of all the over-provisioning
without actually spending the money.
So basically that's what the FTP is.
You can tag the data and flexible data placement.
It's a little bit different than what Streams was
where you can tag the host data
with what's called an RUH or Reclaim Unit Handle.
And then the drive can figure out, okay,
well, where should I put that based off that tag?
The really cool thing about this is it's backwards compatible, so you don't have to have
software that's necessarily FTP aware.
You can just break a drive up into a bunch of namespaces, give each namespace to Reclaim
Unit Handle, and then all of a sudden each workload or each VM doing each namespace,
it's just actually has the data contained within that.
So the way it's logically, it's called flexible data placement
because it's like logically placed on the flash.
Basically, our controller is 16 channel controller.
If you look at like, you know, a 64 gigabyte die,
this is a 512 gigabit die, this is pretty common.
You might have like 64 or 128 die
per drive and so the way we do flexible data placement is we have eight of these reclaim
unit handles each one is across all 16 channels and then maybe we pick like four die or eight die
so you basically get full performance on every single workload but they're completely isolated
and so you get uh basically better quality service, you better write implications, you get better endurance, you get better performance.
And so there are some applications that are starting to add like FTP support for tagging.
Like you can think about like a file system. This is very simple. It has metadata and has data.
It'd be very easy to just split those up into two streams and tag them differently so the
drive knows where to put them. There's a bunch of really simple stuff like that that just can dramatically it
decrease whack but uh yeah again like the the whole reason why people buy these three drive
rate per day drives is to get higher endurance and better random right performance but you don't you
don't need a three drive right per day drive to do that you can you just need to get the waft closer
one there's a bunch of ways to do that for free it up for free I mean
meaning there's that it was some sophistication needed well I mean we we
talked about this a decade ago when we were looking at the transition from hard
drives to flash or at least hybrid environments the dream back then if you
talk to all the flash startups fusion IO for, they were always saying if we could just make SQL Server aware that it's
writing to Flash and not hard drives, there's so
much efficiency that could be gained by that. But it never
really worked out that way because the applications
are applications. They're not storage experts.
It's much simpler than that, Brian.
Like, I talked to a bunch of these developers.
Like, I talked to the Ceph guys.
And, you know, these are, like, all really smart software guys that are, like, doing hardcore storage stuff.
And a lot of these guys just don't have time to learn how these SSDs work, like in garbage collection and trim.
I mean, there's a lot of stuff to worry about.
It's really that simple. Unless you're a hyperscaler
optimizing these workloads, like typical enterprise workloads, even the databases
that are storage centric, I still see a ton of recommendations online to not mount drives in
Linux with discard, with trim. It's like, oh my God, this is the stupidest thing on the planet.
Trim is the only way an operating system has to tell the drive that the data is not needed anymore it's quite literally the same
the most important command for an ssd period and people are like yeah you don't need that it's uh
you know because what they've been doing is just over provisioning the drives and
letting the drive garbage collection figure it all out itself and that that's just the lazy way
of doing it like hyperscalers they're not going to waste 30 or 40% capex on over provisioning,
they're going to figure out how to use that storage. And
well, no, I mean, they're going the other way, right? I mean,
they're trying to maximize every piece of NAND and then also handle
resiliency there where if there is a NAND failures or bad die that
they can map around it and continue to use that drive it
forever until it's yeah the way
yeah the way most ssds do this is just it's just xor internally between some die like if you have
a bunch of die you might have one or two die in there that it's all x or together it's just like
a mini raid four on on the drive um the zns was um they had some ideas where maybe you don't do that, maybe you do this at the host level.
But like FTP would allow you, you know, there's, so there's, the spec is really flexible. Meta was
one of the key authors on the spec. And Google was also another author. So Google's use case
looks a lot more like that custom array where you're individually allocating stuff to each die and
more like a pure storage array or something like that that so google's idea of fdp is more like
that where you're actually managing the ftl and managing the garbage collection and telling the
drive you know when to do the you know when to reclaim these these handles um where the the meta
mode is more of like a hey let's just take an eight terabyte drive and split it up into eight
one terabyte drives and each of them are a little more isolated and are more resilient to WAF and have better endurance.
They're both really cool use cases.
Well, Meta can get away with that, though, because they've got the time and they've got the engineering talent.
I guess when I'm looking at FTP, I think it's really exciting.
And I think it's more promising than anything else we've seen to address this right amplification issue.
But it's got to get easy.
And that's where I think it breaks.
It's like you need Windows Server 25 or something to be FTP aware and to do some of this or Ubuntu 22.04 or whatever.
Whatever it is that people are running to
have an opportunity for it to be aware.
And I was going to ask you, I'm glad you
said that. So that's what my
Flash Memory Summit presentation is about
this year. I have like a really
simple Proxmox example where I have
and this is just done on a stupid
desktop. I have my garage. It's not like a fancy server
or anything. But I have like
this drive here. And on a stupid desktop i have my garage it's not like a fancy server or anything but i have like um this
drive here whoop and uh hopefully just nuke my uh my screen here um so basically i have this
supports eight uh ruh so basically um when so there's one command it's basically a set features
command in nvme coi to turn on ftp So basically you're saying drive, turn on FTP.
And then there's an NVMe COI command that basically you can query the drive to say, okay,
how many reclaim units do you have? Okay. And the drive says, okay, I have eight and they're each,
you know, 14 gigabytes or whatever. And again, the reclaim unit handles maps of super blocks that are
striped across all the die and all the channel. So they get super high performance.
So once you do that, I can go create a namespace and just tag it with an RUH.
I can say, okay, this namespace is RUH1, this namespace is RUH2.
So I just take the drive and just made eight namespaces.
Now I give the VMs, each of those I just map in Proxmox.
I'm just giving them a Vertio namespace mapped
directly over in LVM.
Each VM gets its own namespace.
Now, these namespaces, the VMs, each VM is writing to the drive.
The drive actually tags it with the FTP tag with that namespace, with that RUH.
So all the data is actually physically separated.
But the applications are not aware.
Proxmox has no idea that
there's FTP enabled. It has no clue. It just writes the data to the VM and the VM,
you know, basically LVM writes it to virtio and virtio writes it to the block layer and it writes
to the drive, the namespace, but it's physically separate. So you get all these benefits of
FTP, but there's no software development needed. And so this is the big difference of FTP versus the other like ZNS.
ZNS, you have to have a ZNS drive.
You have to have a ZNS aware file system.
You have to have a ZNS everything or else it doesn't work.
And by the way, you can only sequentially write.
If you try to do any random write, it just explodes.
We had a zoned hard drive.
We played with that once.
It was not enjoyable
i have to do yeah yeah so well so you're a controller guy now with with thadu so you guys
are are engineering the chips that go on the the s enterprise ssds anyway and can do things
like fdp and for you it's a major differentiator but you brought up pure why not do less on the
drive and do more on the in the pure case of the array side where they're doing more of that
intelligence and flash management in their software versus the drive why why don't we rely
more on on host systems and software to do that yeah i mean there there was like remember
yeah kioxia had this like software enabled flash like there was at one point there was at one point
where we thought oh maybe maybe like the whole storage market goes to these like diy solutions
where there's a bunch of software that does the ftl the garbage collection and all that stuff
but i just told you like these guys are worried about other problems in application layer about
like data like the Ceph guys
they're making sure they don't lose your data and they have to worry about nine million different drives like HDDs and SSDs and
Consumer drives and like a thousand different drive models like these guys are not thinking at that level
So no, I think so at one point in time
I think that there was a lot of discussion about like a lot of people going that way now
Maybe with the Google FTPp mode you know in the future maybe there's some open source software that makes
this easier so you can have like a bunch of drives in array and maybe you get some some benefit there
um but this requires a lot of like tuning and application knowledge and again like when you're
when you do that model you have to understand how ssds work because now you're managing the FTL at the host.
You have to know how garbage collection works.
You have to know how reclaiming works.
You have to know how ECC works.
You have to know how data protection and area recovery and all this stuff.
Quite frankly, people didn't want to learn that.
So most of the market is just regular ssds because it it works well that's why really when you look
at that market pure and i guess ibm are the only two i can think of where they're really doing their
own flash modules and engineering around that it's not easy and and they had to be right or else the
company wouldn't exist today yeah so it was a high risk uh effort for them. Well, I see a straggler question from Discord that is actually a good one.
My thoughts on E1.L.
And I was the product manager for the Intel ruler, so we kind of invented the E1.L.
I love E1.L.
People think that it's like, oh, you know, it's just like, you know, Intel did a tremendous
amount of research before releasing that, showing that that is the densest you can get in one U,
right? We had like one petabyte in one U, we demonstrated in 2017, you know, 32, 32 terabyte
drives. I think we had powered on in 2017. So E1.0, it is absolutely for like super dense storage arrays it is unbeatable
now e3.e3.l like the longer version of e3 is actually pretty good too um but you have to do
one of these foldy pcbs um to get like super high capacities but no um like big drives are coming
people are like hyperscalers are gonna be be asking for 128, 256 terabyte drives.
Like they're not stopping at 64.
They're going to go insanely big.
We're sitting here in early July.
I don't have any embargoes that I'm breaking, but I think it's pretty clear 122 will be a topic of conversation
in terms of shipping before the year's out i mean these drives they're the density play is
is is coming quickly i mean this is uh this is popping fast so anyone that looks at the pile
of 61 terabyte drives we have they're they're about to to get uh to get even larger you can just you can send them you can send them my way when you're done with them.
We're never done.
I totally won't use them for any project.
Yeah, you want to store plots on them.
You cheat.
Go back to the controller, though, is this is interesting to me i think then that
historically ssds have battled out over iops and throughput and a couple other things you know
where leveling and whatever um but lately it feels like there hasn't been as much discrepancy in
performance across the bulk of enter drives that are quality in any in any category right i mean they all do pretty well and we so rarely that that couldn't be farther from the truth right um
so one of the things that body has is really slick is their their 16 channel controller is
actually pretty small it fits on an e1.s and an e3 and a u.2 same controller so all the other
competitors have to use an eight channel controller to get them to get in the power levels of the e1.s and so obviously right there you know
like you're not only like better power efficiency because the controller has a bunch of you know
that's that's the topic of the blog post about some of the controller tweaks that they've done
to basically make it extremely power efficient as far as performance per watt but we just have
a 16 channel controller in a small form factor so you're basically like 2x the performance on almost every workload that's
not a very that's not a subtle difference that's a mat major difference no two ends it's clearly
material yes yeah and not on on you know when you talk about you know the comparing uh you know on
16 channel versus um you know another 16 channel drive and a higher power form factor,
the power efficiency starts to matter.
The hyperscalers are getting smart about this. If you have, in the EDSF example with the Supermicro chassis,
with 36 drives, E3, you're not going to run 36 drives at 25 watts.
That's just stupid.
It just doesn't make any sense.
You're going to want to run those at 14 or 16 watts and your power efficiency if your controller
power is lower then when you start squeezing that power down hey i don't want to run it at 25 watts
i want to run the drive at 16 watts or 14 watts your performance loss is much less with a power
efficient drive on a power hungry drive like the other 16 channel guys out there when you start
squeezing that power you're going to lose like 70 percent of the performance and
that's not acceptable for most workloads and so you have this chassis that has all these drives
36 slots but you can't even use them because the fans would be at 100 and you just consume all the
power in the server so uh these little tiny these little tiny um power efficiency gains like two
watts here three watts there it doesn't sound like a lot but when
you have 24 or 36 drives in a server, it's a lot of power.
It's a huge difference.
No, yeah, it's a fair point. And then I guess when you fold in
other things like FTP would that can be a differentiator that not
every controller supports. What what else is there that that
buyers should be thinking about? I mean, we talked power
performance, FTP.
Anything else special sauce related that could be a differentiator for the FADU drives?
Yeah, I'd say the way we're positioning it right now, obviously, it was designed for hyperscalers.
You get all the benefits of the open ecosystem with OCP, the log pages, the telemetry.
You'd basically be able to do like
machine learning models on the output for smart, all that, all that good stuff you get
for the monitoring health.
I talked about the power efficiency being best in class power efficiency, extremely
important metric, like people should know performance per watt for, you know, again,
it's a huge metric in CPUs, right?
CPU performance per watt is like a massively talked about issue, right?
So it's, I don't know why it hasn a massively talked about issue right so it it's
i don't know why it hasn't been talked about as much in ssd land but it is equally as important
in ssds um obviously like having best in class gen 5 performance is amazing so this thing is just
insanely fast and then um i mentioned these new features like ftp uh it's really cool having the
flexibility in these drives to be able to run some of these new end me features and again i don't expect most people to you know be using this stuff but really
the way i'm going to position in this blog post to be able to enable ftp create some namespaces
this is not something that you have to be like a hardcore software developer to
play with like you can anybody can can do this well it's good stuff i mean we're we're looking at the wdsn 861 that's using the
fadu controller in an enterprise drive and they've got the uh the hyperscale drive too so that's how
this is coming to market we'll link to that to jm's blog to the rag work we talked about a couple
other things in the youtube description uh For all you guys that joined live,
a lot of great comments and conversation. We certainly appreciate that. So if you're catching this on the rebroadcast or the audio only, make sure that you get in on our socials so that you're
seeing these things come up live as they go. And Jay, I'm hanging out on the Discord.
Yeah, and Brian and I, we're going to come back and do this with, I think, some of the and uh i'm hanging out yeah i'm hanging out on the discord yeah and i got the discord too yeah
brian and i we're gonna come back and do this with uh i think some of the wd folks talk about ai
talk about fdp um yeah like we're keep the conversation going it's been fun yeah it's
been fun it's the first time i've talked to jm in the last six months where he wasn't driving
in his tesla so that guy loves that tesla so much sleeps in it, I think. But it's good to see you sitting down, sir. A lot of trips to the Bay Area.
Yeah, it's extremely hot here right now. I guess I can't complain. There's other places that are
pretty toasty right now. So I guess I can't complain that much.
Good. I'll see you soon in person. I appreciate your time. Thanks for doing the pod. Cool. Thanks, Brian.