Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 07x05: Efficiently Scaling AI Data Infrastructure with Ocient
Episode Date: July 1, 2024As the volume of data supporting AI applications grows ever larger, it's critical to deliver scalable performance without overlooking power efficiency. This episode of Utilizing Tech, sponsored by... Solidigm, brings Chris Gladwin, CEO and co-founder of Ocient, to talk about scalable and efficient data platforms for AI with Jeniece Wnorowski and Stephen Foskett. Ocient has developed a new data analytics stack focused on scalability with energy efficiency for ultra-large data analytics applications. At scale, applications need to incorporate trillions of data points, and it is not just desirable but necessary to enable this without losing sight of energy consumption. Ocient leverages flash storage to reduce power consumption and increase performance but also moves data processing closer to the storage to reduce power consumption further. This type of integrated storage and compute would not be possible without flash, and reflects the architecture of modern processors, which locate memory on-package with compute. Ocient is already popular in telco, e-commerce, and automotive, and the scale of data required by AI applications is similar, especially as concepts like retrieval-augmented generation are implemented. The conversation around datacenter, cloud, and AI energy usage is coming to the fore, and companies must address the environmental impact of everything we do. Hosts: Stephen Foskett, Organizer of Tech Field Day: https://www.linkedin.com/in/sfoskett/ Jeniece Wnorowski, Datacenter Product Marketing Manager at Solidigm: https://www.linkedin.com/in/jeniecewnorowski/ Guest: Chris Gladwin, CEO and Cofounder, Ocient: https://www.linkedin.com/in/chris-gladwin-7ba42b/ Follow Utilizing Tech Website: https://www.UtilizingTech.com/ X/Twitter: https://www.twitter.com/UtilizingTech Tech Field Day Website: https://www.TechFieldDay.com LinkedIn: https://www.LinkedIn.com/company/Tech-Field-Day X/Twitter: https://www.Twitter.com/TechFieldDay Tags: #UtilizingTech, #Sponsored, #AIDataInfrastructure, #AI, @SFoskett, @TechFieldDay, @UtilizingTech, @Solidigm,
Transcript
Discussion (0)
As the volume of data supporting AI applications grows ever larger,
it's critical to deliver scalable performance without overlooking power efficiency.
This episode of Utilizing Tech, sponsored by Solidaim, brings Chris Gladwin,
CEO and co-founder of Oceant, to talk about scalable and efficient data platforms for AI.
Welcome to Utilizing Tech, the podcast about emerging technology from Tech Field Day,
part of the Futurum Group.
This season is presented by Solidaim and focuses on the question of AI data infrastructure.
I'm your host, Stephen Foskett, organizer of the Tech Field Day event series.
And joining me today from Solidaim as my co-host is Janice. Welcome to the show.
Hi, Stephen. Thanks for having us back.
You know, Janice, you and I have spent a lot of time talking about a lot of storage
topics, but one of the things that we come back to a lot is energy efficiency and the question
of power consumption when it comes to AI data infrastructure. Yeah, absolutely. At the beginning
of the year, it was really just around AI. How about AI? What's the GPU? How do I work with the
GPU? It was all about the instances of the
GPU. But now we're kind of turning that corner and we're starting to see many organizations
looking at not just performance, but also really looking at power and efficiency and scalability.
So we are super excited about the opportunity today to talk about that a little bit more. And
I know you want
to introduce our guests, so I'll let you do that. But yeah, we're excited to dive into that specific
topic. Yeah, absolutely. And I think this is one of those things that comes up again and again,
when people talk about AI, you know, the naysayers are constantly talking about the fact that,
you know, it uses all this power and, you know, people aren't considering that,
they're not thinking about that.
Well, one company that is absolutely thinking about that is Oceant.
And so it's very, very nice to have here on the podcast today, somebody I've known for a long time, Chris Gladwin, CEO and co-founder of Oceant.
Welcome to the show.
Great to be here.
Really look forward to this conversation.
Great to reconnect with you, Steve. And I know we together over the past years, and here we are again.
So tell us a little bit more about what is Oceant and which part of the AI data infrastructure stack do you play in?
And then we can talk a little bit more about the question of energy efficiency and scaling performance? So OCEAN is a company that's developed a new software architecture for analysis
of large data sets for complex analytics, always on workloads. And in doing that, we've really,
we have focused on efficiency in general, both price performance as well as energy efficiency, because they go hand in hand.
So what we're doing is providing solutions using this new architecture for this emerging group of
ultra-large data analytics requirements. And as you and I have seen, we worked together at my
prior company, CleverSafe, in the kind of large scale storage realm.
Today's, you know, ultra large, you know, analytics workloads just become tomorrow's normal.
And eventually that's what your phone does.
So it's really important to get this right.
We really have to, as an industry, kind of be on the right trajectory of efficiency.
Otherwise, you know, we're going to have some real problems powering this future without being
efficient. So Chris, let's dive into that a little bit. How is OCEAN specifically looking at
addressing some of these challenges? Well, it really comes down to focus. As with a lot of
things you see in information technology, you know, capabilities just keep growing and growing and growing. And it really is like, what are the improving cost, having new capabilities, they do it. And so, you know, the other thing that
is really important is the focus on efficiency. So what we're seeing now in the industry is
all these new capabilities are coming out in the realm of artificial intelligence and large
scale data analytics and all these really amazing capabilities. But so far, there hasn't been a
focus on doing that, not only in a way that's cost-efficient and performant, but in a way that
really is energy-efficient. I was fortunate enough, one of the jobs I had early in my career
is I worked at Zenith Data Systems in Chicago.
That's why I originally moved to Chicago when it was the largest portable PC maker in the world.
And that was the very, very beginning of focusing information technology on energy efficiency,
because back then you had a battery and, you know, batteries back then weren't that great.
But yet you still had to find a way to get, you know, two, three, four hours of battery life.
And so once you had a kind of an engineering team that said, look, we've got to figure out how to make this more energy efficient.
So, you know, making the CPU speed up and slow down, you know, while it's doing, you know, when it has some kind of time where it's not really busy, you know, that has a huge effect.
And like you just start going down this list. And so what we've begun to do at OCEAN is really look at that energy efficiency
and not just optimize for scale, not just optimize for cost, not just optimize for performance,
but also optimize for energy efficiency. And you can get transformational benefits by doing that.
You know, we've already announced that we can do 50 to 90%,
like truly 90% reduction in energy efficiency. And a lot of it just has to do with focus.
It takes person years, person centuries of dedicated, diligent engineering work to deliver
that kind of result. But you're not going to get there unless you focus on it, unless you say,
this is the target, this is what we're doing, let's go make it happen. And of course, it's not just about efficiency,
it's about scalability and performance as well. I mean, you have to do the job,
not just do it efficiently, but you're doing both. And it's interesting what you mentioned that
the scale and the scope of data, I mean, when we started our careers in the storage industry,
megabytes was a big number. I mean, when we started our careers in the storage industry, megabytes was
a big number. I mean, I remember my first gigabyte storage array, and now that's a laughably small
number. So you're right, it is coming everywhere. And the technology that you describe, changing
clock speed on processors and stuff, that is not just table stakes for modern processors. It's the
whole ballgame. You cannot, you cannot not build a processor
that operates in that way.
And you see the same thing happening with data.
Well, the prior company I started
was a company called Cleversafe,
which when IBM bought the company
in our category of on-prem object storage software,
you know, we had at hyperscale,
things at 100 petabytes and above
of total storage in that system. You know, we had we had at hyperscale things that 100 petabytes and above of total storage
uh in that system you know we had 100 market share we made all those systems because at that
time no one else could do it i remember when and ibm bought that company in 2015 and at that time
petabytes were like normal and exabytes tens of exabytes hundreds of exabytes was kind of the
state of the the art in terms of scale for storage systems.
I remember when I started CleverSafe, I started in 2004.
I think in 2005, I sat down and calculated how many systems in the world at that time, 2005, were at least a petabyte.
And my estimate was 13.
That was it.
You know, and now a petabyte, like that's a corner of a server. So this always happens. And we focus right now is, you know, things where,
you know, you need at least 500 cores in order to deliver that solution on which our software
would run. Typically, in terms of data volumes, the average query or the average machine learning
function or the average geospatial function, the things that, you know, the analytics themselves are going to look at hundreds of billions, if not trillions of elements, data elements,
or like rows in a spreadsheet would be one way to think of that.
And, you know, at that scale, you know, if you kind of that trillion scale, you know,
there's around 500 to a thousand different systems right now.
Still, you know, just a small part of the giant data analytics market. But,
you know, this concept of trillions and the next number that people are going to start to learn is
a quadrillion, which is a thousand trillion. You know, we're working actually right now on the
first quadrillion scale system. These things don't deploy overnight. But, you know, those words are
words we're going to start to use is, you know, not just
trillion scale, exascale, but quadrillion scale. That's what's coming. So tell us a little bit,
this is fascinating, Chris. So you talked a lot about focus and being able to scale. Can you
speak a little bit? I know you've been working with us on deploying the 61.44 terabyte drives,
but what's fascinating to me
is the way you're able to kind of architecture
to the overall system.
Can you give us some like quantifiable numbers
and show us like how you're, you know,
making those systems a lot more dense
and, you know, power efficient?
Well, the breakthrough that enables
what OCEAN is able to deliver is solid state.
And I think the amount of investment, you know, from semiconductor companies,
you know, like Solidigm and others to bring that technology to market over the past decades,
you know, is definitely tens of billions of dollars, probably around $60 billion investment.
And that, you know, that kind of goes back to focus.
And, you know, this problem that OCEAN solves, which And that, you know, that kind of goes back to focus. And, you know, this
problem that OCEAN solves, which is how do you limit, you know, scale the amount of data you're
analyzing in a system without limits, without worse than linear increase in cost? You know,
that kind of had been the state of the art where, oh yeah, you could scale up and scale up and scale
up. But if you want to analyze a million times the data,
it's going to cost you more than a million times the dollars.
That had always been a known problem in computing for decades.
And how can you solve this problem?
And the reason why this was such a problem was you really had two building blocks.
As a software designer or software builder company,
you cannot go faster than the hardware. Your price performance
cannot be better than the hardware. And if you do your job perfectly, you're going to max out what
the hardware can deliver. And previously, the only two building blocks you had were DRAM
and spinning disk. And the problem with DRAM is it's crazy expensive. Yeah, you can solve these
giant hyperscale problems with DRAM. That's what a supercomputer is. And they cost about a billion dollars. So there's some people that can spend a billion dollars. But if can the read-write head settle into a track?
How fast can the platter spin?
That time hasn't changed for decades.
And so on a Moore's law adjusted basis, spinning disk keeps getting slower and slower and slower.
And it's a million times slower than it used to be.
And it's thousands, you know, it's just way too slow to solve this problem.
Along comes solid state.
Solid state today is offering thousands of times, 2,000 to 3,000 times the price performance per dollar of spinning disk.
That's limited not by a physical phenomenon, but an electrical phenomenon.
And that's on a Moore's law curve.
So 2,000 to 3,000 times better price performance right now means 5000 times
better, 10,000 times better, a hundred thousand times better. And it's the thing that unlocks
this whole, this, the, the solution to this problem of hyperscale, always on compute intensive
data analytics is unlocked with solid state. But it's not just solid state that's making
this possible. And I think that that's the thing, you know, people could listen to this and be like, oh, great, you know, yeah, you're using
SSDs, congratulations. But you're doing a lot more than that. I think one of the interesting aspects
of the OCEAN solution is sort of this proximity idea that you're doing processing closer and
closer to the data. And that actually reflects the architecture of modern machine learning and HPC.
So if you look at how, for example, the NVIDIA Grace achieves such high performance, it's because the memory and the compute are located right together on the same.
It's the same way that Apple achieves performance with their Apple Silicon.
And you guys actually have a data approach that works similar to that,
right? So you're not moving as much data around. Yeah. I mean, you and I have been around long
enough to have heard like, you got to move the compute to the data many, many times. But we're
in this realm where when you're looking at these hyperscale workloads, it is not, I mean, typically
in the workloads that Oceant focuses on, you're talking petabytes,
you know, if not exabytes of data, not just being stored, but being analyzed.
And if you want to run a query or run a machine learning function or something like that,
and it's a petabyte of data and you need to move that, you know, it might take a day.
It might take an hour.
I mean, that's just not, you cannot use that architecture.
So what we've seen, you know, in terms of focus is the rest of the industry is focused on kind of the large opportunity, which is real. And they, you know, they've really done
a great job building solutions for kind of smaller active data sets, smaller amounts of data that's
being analyzed. And they separate compute and storage into two tiers in the architecture fundamentally. So they have to pull the data out of the storage
tier, across the network, into the compute tier. And that's fine if it's a gigabyte of data or even
a terabyte of data, maybe 10 terabytes of data that you're analyzing, but you're getting into
the hundreds of terabytes, petabytes, tens of petabytes, like that's just simply not going to
work. So what we've done is collapsed, collapsed compute and storage into a single tier.
So we're not pulling data from storage across a network connection up into a compute server.
We're pulling data across multiple parallel PCI lanes, like within a server.
And, you know, we'll have thousands of times the data bandwidth just at an architectural
level.
And it shows up in queries.
I mean, you know, the kind of analysis we can do are things that are either impossible.
You know, you've tried, customers have tried other things.
It just doesn't work.
And one of the problems we'll often have is that the rate at which they're adding data
to the system is greater than the rate at which other systems can add data.
So you never catch up and that's a problem.
Or the other thing we see is we'll replace two or three or four systems or even five with the single ocean system.
And I think, Denise, you were asking earlier about kind of what does that look like?
You know, we'll often see like five or 10 racks of equipment,
you know, 100, 200 kilowatts of power draw, and you can replace that with like half a rack,
you know, one-tenth the number of servers, one-tenth the amount of electricity.
Wow. Okay. So I just want to dive in and on that note, talk about some of your customers,
right? I think it's fascinating, as Stephen said, what you're doing with the software and the hardware.
Can you give us a little bit of color around the type of customers, like you've got, you know, large scale,
complex, always on, you know, requirements for either your business or your mission,
because, you know, it includes some government customers as well. And there's only so many ways
you can have this much data, and there's only some certain use cases that need this. And so,
we have a pretty good sense of who they are. We actually have the way we model the market is we do a lot of research where we'll go off and identify a use case and write down in a spreadsheet, who are all the customers?
How much data do they have to analyze?
Really understand it on a bottoms up basis.
So the biggest market right now is telcos.
Telcos are big networks. And they're going through a process
right now of making what I think is the largest investment in human history, which is 5G. I think
they're spending somewhere between $5 and $10 trillion on 5G, which is a very big number.
And 5G is amazing. It's not only amazing price performance, super high location
resolution that's going to enable all these new apps, but it's also the first real redo of the
backend infrastructure of mobile telephony in a long time. So there's a lot going on there.
It's amazing. It's going to happen. And there's no denying the world will benefit from its use. A challenge for that is
in 4G, the amount of data, metadata that a large telco makes is already at a scale that they can't
analyze. So if you're a major telco, your network connects things a lot. Like your phone wants to
buy something,
it's going to make 10 connections to do that. And this is constantly happening.
So a major telco will make a trillion connections every two days, maybe every three days.
And if you want to go back and analyze, like, why is my network slow in Boston yesterday? Or
where should I put my next cell tower, or things like that that you
need to do. Or there's also compliance reasons why you have to have this data and analyze it.
Already, they can't analyze at that scale, that trillion scale. Along comes 5G. 5G increases the
amount of metadata that a telco network creates by 30 to 50 times. So they're already like, I don't know
how to deal with this volume that I have today. And it's going to 30 to 50 X. And, you know, the
marketing people at those telcos, what they're doing is they're going to go sell 5G. And the
IT people that have to run the network, they don't get to say, whoa, whoa, whoa, slow down. You know,
this is really hard. How am I going to deal with this metadata? No, they don't get a vote. They just get a problem and they have to solve that problem.
So that's one example. We also see it in vehicles. Vehicles are the same thing. A car,
a typical car today makes petabytes of data. And right now they have to just throw most of it away
because they can't analyze it
at that scale that it's made. We also see this in ad tech. We also see this in other markets
like financial services. So there's these very specific markets that have this kind of requirement
where they're just dealing with the scale of data. There's a close relationship, I think, between HPC and massive data scale and, of course, emerging AI applications.
So talk a little bit about how AI applications are starting to demand this kind of scale. which is emerging as one of the key technologies that are powering practical applications of AI,
as opposed to just check out my cool chatbot. Well, I think if you step back, there's been a
lot of AI revolutions over the decades. This is not the first time the world has been captivated
by AI. What I would say is different about this round is not
only is the technology, you know, better and amazing, but it's like the first time AI has
dealt with scale. It used to be, you know, AI, you know, I did some AI programming back in the 80s
and, you know, it was like, you know, a megabyte of data, you know, something like that. And it wasn't scale.
And, you know, when you look at like what large language models are doing, kind of for the first time, they're able to understand the whole, you know, whole language.
And every time it's been expressed on the Internet, which is giant, like that's a scale of AI that just that's new.
The prior revolutions couldn't do that technically.
So I think that's that's kind of the one of the big differences,
maybe the big difference in this AI revolution that we're going through now.
But that creates challenges.
And one of those challenges is if you want to analyze
the world's language or analyze the world's network
metadata or analyze, you know,
all the vehicle telematics data and have AI have the intelligence of what that data
is saying. Well, you have to have that data in, you know, one in an analyzable form,
and it never starts in that form. So that loading and transformation, and it's not a one-time thing.
You know, if you want to analyze vehicle telematics data with AI, that is a data set that is a fire hose of massive proportion that never shuts off.
I mean, the cars are driving.
People are using them.
Exabytes of data are going to pour into your system.
And so the
real challenge isn't, oh, I've got this static data set. I'm going to put this AI system on,
and I'm going to do, I don't know, correlation or regression or ask it questions about
reliability or something like that. That would be hard enough if it was an exascale
static data set. But that's not the requirement. The requirement is there's a
billion cars driving at all times, and they're just pumping out data. And you've got to put a
system on top of that that derives intelligence. And that's a challenge. And a big challenge for
that is how do I take this never-ending giant pipe of data and get it into a usable form
in a reasonable period of time.
That's a real difficult challenge. It's something we worked a lot on at OCEAN as well.
Maybe we switch gears just for a moment. So I know, Chris, we've talked about this before,
right? You know, you have 50 to 90% power efficiency, and it's just amazing. But, you know,
working with a lot of different partners from the Sol Solidigm side, we're not really seeing that from other partners.
Right. We're we're not seeing the same aggressive stance.
So is there anything that you guys are trying to do, you know, to uphold the industry, to push others to kind of do the same thing?
Do you want to talk a little bit about that? Yeah, absolutely. I mean, the reason why an essential ingredient for making these systems more power efficient is to focus on making them more power efficient.
And, you know, that like we're just not going to get there.
And the way it works is just like, you know, every company that makes any kind of computing product
focuses on cost efficiency, focuses on performance. And the way that happens is, you know, it's not
just like one number, like make it seven. You know, it's very complex when you say, what do
you mean by cost? Well, you got all the lifecycle costs. What do you mean by performance? Like,
what does that mean? You know, so the first thing is to define what it means to
be energy efficient. And we're working right now to, you know, with a lot of other industry players,
including Solidigm and others to say, like, we've got to define what energy efficiency means. What's
the benchmark? What's the metric? And we're in the very early stages of an industry of doing that,
but that's going to happen. We're going to create measurements and metrics and, you know, goals. And so that, that's kind
of step one in parallel with that is, okay, now you, now you've defined, what does it mean to be
efficient? What does it mean to be energy efficient? And so then the way it works at any
information technology development company is you then start to prioritize.
And the low hanging fruit is bigger. You focus on the bigger, low hanging fruit.
So you always start with like, oh, this first thing we could do won't take very much time.
It'll make it twice as efficient. All right, let's do that. Then let's focus on the next thing.
Well, that's going to take a lot longer, but it'll, you know, it'll be, you know, another two X improvement. Let's do that next. And you, you, you, you prioritize
them based on their efficiency. And then, you know, over time it gets harder, like 7% more
efficient with this giant investment. Well, that's down the road. So where the industry is right now
is we're, we really haven't focused on it. And as a result, like there's a lot of low hanging fruit.
And so you're going to see, like, you saw this in Ocean, like immediately we come out with 50 to 90% improvement.
In some cases, we've been demonstrating a 98% reduction. Like that's giant because it was just
a lot of low hanging fruit to start with. And so, you know, we now already know here's the things
we're going to do next. This will double it. This will, you know, improve it by 30%. So that, that's, it really is just, you know, that simple and that complex.
I mean, at a simple level, it's like, you just prioritize the biggest bang for the buck and work
your way down that list. And the only way you do that is by saying, I'm going to make that a
priority and I'm, and I'm going to make, and here's how I'm going to measure it. Now it's
it gets really complex in how you do it.
But but it is really just simply a matter of focus.
It is really refreshing to hear somebody in your position focusing on energy efficiency, because this is, you know, it is a consequential topic to literally every company.
And yet it is not the focus of most companies.
And yet, what do we hear? We hear people constantly talking about criticizing AI,
criticizing the cloud, modern compute, et cetera, for the energy consumption that it has.
Criticizing it as, I mean, people will talk about the scale. They'll talk about, you know, oh, well, this company, you know, they've got their own nuclear power plant or this company is, you know, investing in in, you know, buying, you know, pre buying like, you know, gigawatts of electricity in order to support their build out, you know, terawatts.
Yeah. What you know, it's it is refreshing. It is refreshing to hear somebody say, no, you know, it is refreshing.
It is refreshing to hear somebody say, no, you know, we've got to think about the efficiency of this.
We've got to think about the impact of all this.
And yet at the same time, being able to say, but we also have to be able to support data scale that just nobody could achieve previously.
Yeah, and the reason why it's really changed.
If you go back a year or two ago, this wasn't even a topic.
But what's been happening is energy use by data centers in general, driven by Bitcoin, definitely driven lately by AI.
But still, you know, all the other types of data analytics are the biggest users still and will continue.
So if you go back a year or two, it just it just hadn't like reached the tipping point.
And then a couple of years ago, the amount of energy consumed by data centers passed California.
It's starting to get real. And now it's about to pass Brazil, you know, like a big German economy, you know, is less, it will soon be less energy than what data centers are doing.
And the problem is, here's this category that has gone from,
you know, not a big deal
to like passing by large countries and accelerating.
If you look at the latest IEA models,
the International Energy Association models
of like what is going on
with data center power consumption,
they have different models,
kind of like climate models.
They're projecting different things.
And the range is between the amount of data, sorry, energy consumed by know, storage would double every two years and that's Moore's law or compute or whatever. And next thing you know, it's a
million times more storage or a million times more compute. Well, that's not going to work with energy.
We're not going to be able to have a million times more energy for compute. So the only answer is
efficiency. And so that, I guess, you know, my call to action would be
right now, if you look at RFPs, and this is what drives the industry is when big customers buy
stuff or customers buy stuff in general, what's in the RFP, you know, the request for proposal,
what's in, you know, what are they making their buying decision based on? And it'll have like
cost and performance detailed to forever. And here's all the capabilities I need.
They don't currently have.
And here's how much energy and here's how much energy efficiency you got to hit.
It's not in RFP.
So my call to action would be, you know, as as customers and it's in it's in customers
best interest, like these are accelerating costs, a million dollars a year for energy.
You can't have that accelerate. So we need to see an RFPs and buying decision making is energy efficiency, energy use.
And that's going to cause the whole industry to focus on it. And you'll see breakthrough results.
I've got to agree with you there. And the other thing, too, to keep in mind is many of these companies have made commitments that they're going to reduce their energy or their greenhouse gas impact. They can't
not do that just because they're chasing the AI trend. So they have to find ways of reducing power consumption. Great, great conversation. I so appreciate the fact that we were able to get
you on here and talk about this because I feel like this is something that's been missing from our conversations a lot of the time here on Utilizing Tech.
And yet it's something that's important to all of us.
So thank you so much, Chris.
It's great to catch up with you.
It's great to learn about Oceant. advances in technology like flash storage and co-location of compute processing and storage
and so on to improve the overall impact of everything that we're doing.
Thank you for listening to this episode of Utilizing Tech Podcast.
You can find this podcast in your favorite podcast application as well as on our YouTube
channel.
Just search for it in your favorite search engine.
If you enjoyed this discussion, please do leave us a rating, maybe a nice review.
This podcast was brought to you by Solidigm,
as well as Tech Field Day, part of the Futurum group.
For show notes and more episodes,
head over to our dedicated website,
which is utilizingtech.com,
or find us on XTwitter and Mastodon at Utilizing Tech.
Thanks for listening, and we will catch you next week.