In The Arena by TechArena - Delivering a Sustainable Data Platform for the AI Era with Weka, Data Insights Series Sponsored by Solidigm
Episode Date: May 2, 2024TechArena host Allyson Klein and Solidigm’s Jeniece Wnorowski chat with Weka’s Joel Kaufman, as he tours the Weka data platform and how the company’s innovation provides sustainable data managem...ent that scales for the AI era.
Transcript
Discussion (0)
Welcome to the Tech Arena Data Insight Series. My name is Allison Klein, and joining me back in studio is my co-host, Janice Naroski from Solidigm. Welcome to the program, Janice.
Thank you, Allison. It's great to be back.
We continue to just have fantastic discussions with leaders in the industry and, you know, pursuing what's happening with the data pipeline and how is that transforming for organizations
as they try to tackle new capabilities
with artificial intelligence.
Tell me about who you've got lined up for us today
to talk to.
Yeah, thank you, Alison.
I am super excited about our guests today.
We have joining us Joel Kaufman,
who is the Senior Technical Marketing Manager of
Weka. And anyone who's been following anything AI, Weka is definitely one of those organizations
that are a major bright spot in innovation. And we're just super excited to hear from
Joel about the technologies that they're working on and the innovation that they're working toward.
Welcome to the program, Joel.
Hi, glad to be here.
So, Joel, why don't you just start with an introduction of yourself and, you know, your background and data and how it led to the position that you've got at Weka?
Yeah, I started out quite a while back without dating myself too much, but for a while I worked for Silicon Graphics, which should take you way back in the day, doing a variety of things for them around high performance computing, managing systems there, things like that.
And then after a while, a few people who had moved on to a different company, which became NetApp, said, hey, you should come over and start working there.
And so I wound up at NetApp for a very long time, approximately a little over 20 years,
doing everything from introductory setup to managing entire teams of technical marketing
engineers, and then handling a lot of our data protection, data management, and replication, data replication programs.
And then after a while, again, people you know always pull you around. And so I got this call
saying, you should come over and check out this very cool new technology at a company called Weka.
At which point I moved out of the engineering side of technical marketing and more into the marketing
side of technical marketing, where I help pull together and explain our technology in a way that
is meaningful and makes sense to a lot of our customers, our partners, and sometimes even
internally for training our own people at Weka. Now, we recently had Weka on the program,
and I wanted to follow up to go a little bit more under the hood with your solutions. The
first interview was great, and people should check that out. But there's so many more questions that
I had at the end of that interview. So I'm so glad that Janice and I are getting the chance
to talk to you. Before that we go there, though, can you just
do an introduction of Weka for those who aren't familiar with your solutions and, you know,
put a little context around how Weka works within the industry?
Yeah. So the way to think of Weka this day is that we're really this data platform that,
you know, partially through a lot of foresight and
partially through a little bit of luck, we came to this point where there was this convergence of
really high performance compute, really high performance networking, but storage seemed to
be left behind in the dust a little bit. And so our founders took a look at it and said, you know, there's got to be a way of solving this storage problem as part of this, I guess you call it kind of a pyramid, right?
A triangle of these three core things that go into infrastructure. a incredibly parallelized file system. We're able to deliver really strong software-based data management,
distributed storage functionality across the entire data platform that we provide.
And we can do it on-prem or in the cloud for pretty much any workload that's out there.
We tend to focus a little bit today on the really performance-intensive workloads,
things like AIML, high-performance compute, but all the different variations
about what those implementations look like across a large number of industries.
That is amazing, Joel. Can you comment a little bit further,
though, on any particular customer challenges that Weka is kind of uniquely solving? And why
is this a focus for you today? Yeah. So if we take a look kind of at the history of what's
been going on in the industry, we started out with things like HPC. And HPC was
sort of this isolated, you know, it was built for large labs, maybe some, you know, universities that
were doing this in the public sector space. And there was this trend of ultra high performance.
And that moved on for about 10 years. And then you fast forward towards today.
And what we're finding is that a lot of the types of applications, the problems that customers really were trying to solve around, you know, even internal things, manufacturing, business and finance types of things, started to require more and more compute power.
They started to require a lot more intelligence
to what they're doing. And when you couple that with the rise of things like AI ML,
generative AI, it began as convergence. And so now we're seeing HPC and converging into AI in the
enterprise space. And so a lot of these challenges that we're seeing is companies that are saying,
I'm used to doing traditional enterprise-level IT.
Here's a whole new classification of applications
that they might be in the cloud,
they might be on-prem,
they're incredibly high performance,
they have scale that they've never seen before, not just from a performance or capacity standpoint,
but even little things like the numbers of files that are being used to pull this data
in for processing are at volumes that just literally have never been seen in the history
of computing.
And so being able to say, take a step back and go, we've simplified this environment. We're able to give you all the scale and performance you need. And by doing that, it really makes it a much more simplified and easy to consume experience for those enterprise customers. Some of the customers that we're dealing with at this point aren't just the
traditional top enterprise tier. It's a new generation of data center providers that are
doing things like GPU as a service, right? They need to figure out how do they handle
tens or hundreds of customers all trying to consume massive farms of GPUs in a somewhat isolated manner and yet maintain consistent performance across all of it to give the best bang for the buck. and not now are looking at the fact that traditional types of storage, you know, that
are being represented as data platforms simply don't have the performance density to make them
actually sustainable moving forward. So being able to say we can offer all these capabilities,
but reduce our entire carbon footprint, make sure that it is sustainable from a community standpoint is really becoming very, very crucial to a lot of these customers. beautiful story. And, you know, you're talking about how you've simplified what I think folks
who have managed data for a really long time know is that that simplification is not an easy thing
to do. So can you go a little bit deeper into how Weka has approached data management in a unique
way to deliver that simplification to customers across that diverse landscape?
Yeah, absolutely.
So one of the things that we've noticed, and I heard you at the very beginning talking
about data pipeline.
We've been talking about data pipelines now for probably close to three years or so.
What we discovered in a lot of customers is that when you start looking at
legacy architectures, and I don't mean that in a really disparaging way, what I really mean is
architectures that have done fantastic for traditional enterprise IT for years and years
and probably will continue to do that for the future. They're not architected to look at different types of IO in a system and manage
it in a really appreciable way. So a good example for this is, let's talk about AI in general,
or generative AI. If you look at a pipeline for a workflow and a tool chain that's used for all
these, you start out by ingesting data, and it could be market data. It could be data sets for things like genomics, protein folding libraries,
integration with cryo-EM systems. It could be images for doing manufacturing quality assurance,
right? QA on varieties of components, things like that. So you ingest,
there's a certain IO profile for that. Then you turn around and say, the next thing you need to
do is normalize this data out, go ahead and transform it, maybe an ETL or an ELT type of
function, something that takes that data. And suddenly you go from this big, you know, maybe
slow, but lots of streams of writes coming in, to now you have to do this blended
IO of reads and writes and back and forth. And as the scale of these files gets larger and larger,
now you have to do tons of metadata lookups. And eventually you get around to processing the data.
And then the final step, well, not really the final step, but the next step in the pipeline is maybe you send it off to training
in the AI model because you've normalized the data.
Now you do the ingest, and you do the retuning,
and the training, and the fine-tuning, and so on,
and that's a massive type of read function,
and then you take the data and you validate it,
send it back if someone says the precision's not enough,
and you start your automated loop,
and this type of blended I.O. across the board
has been a nightmare for most companies to handle.
And it was so bad for a long time
that even with going from hard drives to flash drives,
you simply could not have storage systems
that were architected to handle every stage.
And so you wound up with, here's my dedicated ingest system.
Then I'll copy the data over to a system for doing the ETL.
Then I'll copy the data again to pump it into these GPUs for training.
And then I'll take the result back out and then they'll copy it back.
And this has caused just massive complexity. So what Weka has done is we have such a,
the ability to handle so many different IO profiles at the same time without any real
performance deterioration is we've flattened that entire copying architecture out.
We've made it to the point where you can just have a single pool
or file system, if you want to, of this data
and have so much performance across reads, writes, big files, small files,
and numbers of files that we've removed that entire copy process.
And ultimately what it does is it helps you feed your compute platform,
CPUs and GPOs, faster to keep them massively utilized
so they're not sitting there burning power,
just idling along, waiting for data to come in.
And that simplification has really transformed
what a lot of our customers are doing.
That is amazing. And just hearing you explain that simplification has really transformed what a lot of our customers are doing. That is amazing. And just hearing you explain that simplification, as many others are kind
of looking at the overall data pipeline to kind of understand how to navigate through it.
I think you guys have a really amazing handle on it. But can you take a step back? And you've
mentioned a lot of different workloads as well, but I'd love to understand even further, you know, you guys seem to have a really big affinity with working with
data in the cloud. And can you speak a little more specifically around how you're solving data
movement challenges across distributed systems? Yeah, this is a real interesting one. And I kind of want to clarify a little bit. When we talk about data
movement challenges, there's a unique aspect to it that I think is underestimated. And that is that
data has an extreme amount of gravity. And so when we talk with customers, one of the biggest things
that we do and consult with our customers about is
not just, you know, can you move the data, but should you move the data? And so we're finding
this really interesting combination of customers who are saying, nope, everything I'm going to do
is on-prem. Some who are entirely 100% cloud native entirely. We actually have one customer
in the media and entertainment space,
Preymaker.
Everything they do,
it's 100% cloud focused.
They don't want to deal with infrastructure.
But more and more,
we're beginning to see this trend
of customers who are making these decisions
about what data needs to be moved.
And it doesn't necessarily have to be all of it.
So we get into this very
hybrid type of, of cloud play. So what Weka does under the covers a little bit is we have a
technology that, that really enables us called Snap to Object. And one of the things when we
began this, this, uh, this process of bringing Weka as a product to market, is we took a look at what costs look
like and what a better way of maybe doing replication would look like, or data movement,
really, in this case. Actually, let's call it data mobility or data liquidity, if you want to.
And so being able to take a complete image of what the data is on a Weka system, move it to an object store where we don't care where it lives. It could be and we produce this every time we take the snapshot and move the data, if you can pass that key along to another Weka system, it can then go access that data as long as they have access to the object store. And so you get this
sort of combination of killer third-party witness of data because it's now on an entirely separate
object store system. So you have that separation of domains and yet any other weka system can grab it and so we're seeing use cases
um a great example there's a pharmaceutical company um in the boston area uh really big
work that they're doing around protein folding and virology and things like that to create
solutions you know from the health of of their customers And what they do is they do a lot of their reprocessing on-prem.
And then when they do their final model trading and final analysis, they snapshot the data
up into the cloud, right?
Attach a cloud-based Weka system, and then they can scale up massive amounts of rental compute power to address that data
at really high performance levels,
again, even in the cloud.
And then once they're done,
they get a couple of much smaller outputs
and they just send it right back down to on-prem
for final archiving, storage, et cetera.
And so we're seeing this type of data movement and distribution happening across
a lot of our customers now. So, Joel, thank you so much for that. Could you also speak a little
bit more about the work you guys recently introduced with the WekaPod as a complement
to your WekaReference architecture and tell us a bit about, you know, why that infrastructure?
Yeah.
So this has been kind of an interesting journey.
If you look at Weka from the start point,
we are completely agnostic.
We are a software solution, right?
In fact, our original coding
and our original builds were all in the cloud.
We're one of the, I guess you could say, sort of oddball infrastructure companies,
where instead of starting on-prem and then saying, we'll figure out how to port it to the cloud,
we start in the cloud, and then we had customer demand come in and say, hey, you should be on-prem because we have a real need locally.
And one of the benefits of all this, just a little side note,
is that because we're just software, we run the exact same binary,
whether it's on-prem or in the cloud.
We don't change how we operate our data platform.
And because of this, it gives us a certain amount of agnosticism that lets us, you know, lets our customers again, make those decisions to deploy anywhere.
So cloud was evolution one. Then we move on to evolution two, which was on-prem and then hybrid
and evolution three now is what we're, is essentially the appliance or the ultimate in consumption simplification for on-prem customers. systems and base pod systems, they wanted to have a, a appliance effectively built out for,
for use in those, those particular use cases. And so Weka, we, we partnered with one of our,
our hardware vendors that we work with, and we've produced effectively a completely wrapped appliance that if a customer wants to purchase it they can
they can buy it as as a complete effectively turnkey bundle as part of a super pod or base
pod deployment and we go out the door now and give them that complete reference architecture with
with us as the data platform the compute from from NVIDIA, and make it super simple
for them to use. Effectively, at that point, it's a plug and play type of solution.
Where we're starting to see customers have real strong interest for this is customers who either
don't know what their AI solutions or high-performance compute solutions will look like,
or they're unsure what the requirements will be moving forward in the future.
And so one of the things that we've done with this WekaPod that's kind of interesting is that
it's all Gen 5 hardware under the covers. And what I is pcie gen 5 so you get the latest processors you
get the latest uh ssds flash drives and the latest networking in there and the ultimate result is
that we give you this performance density that is you can start with a very small environment
in fact the smallest one of this is eight servers or eight nodes, I guess you could say. And yet those eight nodes can go anywhere from half a petabyte to one petabyte, but can deliver performance that is absolutely unheard of. 18 million IOPS, latency that rivals a raw fiber channel SAN. In fact, in some cases,
even faster than that. And yet this entire stack is seven and a half kilowatts of power consumption
completely. And so that type of performance density, yeah, it gives our customers so much
flexibility in terms of saying, look, if I put in this really small system, I don't know what my future scaling will look like, but the performance is there.
The ability to expand is there.
I don't have to burn a ton of power and cooling for that footprint.
You know, to be quite frank, this entire space is moving so fast. I mean, we've seen, let alone the last two years, the last six months or even two months alone of how this industry has moved and the changes in what data structures look like and the hardware. yourself. And yet this is probably as close as you're going to be able to get to something that
at least for that data platform component really could help you out when there's an unknown future
coming with the rate of change. You know, Joel, I loved how you talked about that because I've
been thinking about this quite a bit in terms of trying to forecast how the industry is going to
keep delivering balanced platforms
when you're looking at the innovation cycles that we're looking at, right?
And so as you break down that platform, and you did such a beautiful job of it,
are there areas where you look at balancing performance, efficiency, and scale
where you want the industry to really pay attention across logic, storage media,
network and IO, where you're thinking, hey, this is going to become a bottleneck pretty soon?
Or is there anything in particular that you would call out from an efficiency standpoint
that the industry really needs to focus on? Yeah, I think there's going to be a couple of
turning points that are going to have to be addressed at some point. From a hardware standpoint, it's only going to go up from here, right? We've gone from 10 years ago, 15 years ago, 10 gig Ethernet to, you know, all the way through 100, 200, 400. And by this time next year, 800 gigabit is on the table.
So I don't think we're going to have these huge network bottlenecks
for the vast majority of workloads that are out there.
And the same thing, processors will get faster.
GPUs will get faster.
Storage devices will still be kind of interesting in terms of how they go.
I think there's two things that are
going to have to be addressed. One is that there's in the storage industry and storage device
category, there's really this kind of bifurcation that's happening. And this is kind of a, just to
be fair, this is a Joel opinion to a large extent. It's not necessarily a Weka opinion. But what we're seeing is this weird bifurcation. You have traditional flash SSDs,
you know, the TLC layer that are relatively high performance, good endurance, and so on and so
forth. But so far they've been somewhat limited in size. And so that's created this secondary type of device out there
where you have bigger SSDs, flash devices,
but they don't necessarily have as much endurance.
But capacities are significantly higher.
And so the question becomes,
can you go ahead and figure out a way of making either the fast versions, the TLC devices, bigger?
Or is there a way of making the bigger QLC devices a little bit faster?
And so there's a bit of convergence there that needs to happen. If I had to place a bet, I'd put it on the TLC side,
because that seems to be where the innovation is happening a little bit faster. But that being
said, you know, as Janice was saying, Soladyne, they've gone and turned the corner the other
direction and said, we're going to make QLC that's getting faster and faster and faster.
So I think there's, when those start to collide head to head, that's going to be a real interesting point
to see where that goes. Beyond that though, the number one thing that's going to have to happen
in the industry, it really is this addressing of sustainability. I came away from the GTC where we had a huge presence there
and they were talking about Blackwell, the new GB200 systems from NVIDIA and the Blackwell
processor combined with the Grace processors and a stack where it's going to be water-cooled pretty much by default
when you buy the full, huge, disaggregated GPU there.
And when Jensen was on stage talking about 120 kilowatts per rack,
the question becomes, it may be significantly faster,
and you can subdivide that up.
I think we're going to reach this interesting
point where it's going to be incredibly hard for various companies to actually be able to deploy
something like a water-cooled GB200 system. And it will be customers who can only have,
you know, do they have the power that they can actually bring in?
And more and more data centers, power is becoming an absolute problem because the utilities
that are supplying them are going, we're out of power capacity.
We literally cannot provide more amperage and current into your data center because
we're out our our major infrastructure
we don't have enough power plants to produce the power to feed these things um and so i think
initially this is going to be a very sparsely deployed type of system simply not not because
the technology doesn't exist but because the technology doesn't exist, but because the power
doesn't exist. And that's going to be something that's going to become an ongoing reckoning
through the industry where what is the correct amount of compute power or that you're able to
even put against a problem without having to build custom data centers in places where they have surplus power.
And that's going to be a very interesting challenge moving forward.
Wow. A lot of awesome insight there, Joel. When you started off by saying we're, you know,
bifurcation of hardware and who's up to the challenge. And I think we're definitely up to
the challenge. And as you talk through, you know, sustainability and some of those elements,
we couldn't agree more, which is why we really believe in, you know, not just creating drives
that are higher capacity or fast, but how does it kind of help attack this entire solution set?
And your insight and work with Weka has just been admirable from my standpoint personally,
and also from our company, SolidIme standpoint. But we do have to ask, been to a lot of shows,
seen Weka there, but where else can folks go just to learn more and engage with your team and maybe trial some of this work you're working on?
Yeah, you know, as is true, once it where you should go to find any and all information.
We have links there for all of our solutions, both on industries, both on technology types.
And from there, we can absolutely, you know, you can click a few buttons, chat with people live online and get answers and find out more about how we can help you out.
Well, Joel, next time you guys bring out the purple Lamborghinis, I'd like an invitation for a ride.
That was really cool at GTC.
And I'm sure there were a lot of folks who are a bit envious of the folks that got a chance to take a look at those.
Thank you so much for being on the program today.
I've been following Weka since last year when you guys were at Cloud Field Day, and I've just been so impressed with the solutions that you're delivering to the market.
We want to keep having you on the program, and today just underscored why.
Thanks for being here.
More than happy to do it.
And absolutely, if the Lamborghinis come out, come find me.
I'll make sure you get in for a ride.
Janice, you're going to have to come with me.
We're going to have to find a three-seater.
All right.
And Janice, thank you so much for co-hosting.
We will be back with our next episode soon as we explore the data pipeline.