In The Arena by TechArena - WEKA on AI Data Centers: A New Infrastructure Playbook
Episode Date: November 12, 2025In this episode, Allyson Klein, Scott Shadley, and Jeneice Wnorowski (Solidigm) talk with Val Bercovici (WEKA) about aligning hardware and software, scaling AI productivity, and building next-gen data... centers.
Transcript
Discussion (0)
Welcome to Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein.
Now, let's step into the arena.
Welcome in the arena. My name's Allison Klein. And today is another Data Insights podcast, which means Janice Murowski with Solidime is back with me.
Janice, how's it going?
Hey, Allison, it's great. And it's awesome to be back.
So we have a fantastic episode today. This is amazing. Why don't you tell us what the topic is and who you've brought along with you this time?
Yes, I am, as always, even more excited about this topic. But today we have an opportunity to talk token economics and overall just AI productivity. And to talk through this, we actually have Weka joining us today, as well as a guest from Solidine. So today we're going to hear from Val.
Brokow Vucci, who is the chief AI officer for WECA, and then we have Scott Shally, Director of Thought Leadership for Solidine.
So, guys, welcome to the program.
Great to be here.
Looking forward to a fun conversation.
Val, this is your first time on the podcast, which is awesome.
We've had WECA on before, but why don't you go ahead and just introduce yourself in your background at WECA?
Sure.
And Scott and I used to work before.
You know, I think we intersected, obviously, at NetApp after the Solid Fire acquisition, where I was CETO and,
spend some time with Google helping bring their Borg project to the world as open source Kubernetes.
And since then, I joined WECA in charge of AI strategy.
And ordinarily, that would be a part-time role.
And there would be all sorts of other tactical things to do.
But in the biggest industry of our lifetimes, it's growing faster than anything we've ever seen.
Strategy is actually a full-time job because in the past well, once alone I've been here,
the industry has shifted so dramatically.
And you do have to think ahead and make some very clever educated guesses.
and obviously be right a bit more than you're wrong most of the time.
And, Scott, you've been on the show quite a bit.
I know that our audience is pretty familiar with you, but just do a reintroduction.
Yeah, no problem.
I've been in the industry now a little over 25 plus years, so it adds some validation to my age.
I also went for one of the largest titles possible with my thought leadership.
So the more letters you have in your name, the more important you are, right?
But I've been working for Solidine for just over three years, loving the technology growth
and direction we're going into Val's point.
the momentum behind the market we're driving now has been a lot of fun to follow and push and be a part of.
So Scott and Val, let's start with the overall big picture.
Time to First Token is becoming a new measure of AI responsiveness.
So why is this such a critical benchmark and how does it change the way you both think about building infrastructure or inference?
I'll jump on this if you don't mind, Scott.
So what I love about our industry in general here and metrics like Time to First Token,
is it so transparent and it takes me back to my early database benchmarking days
when like transactions per second was literally business value.
And obviously we've added so many layers of abstraction since then,
but AI, we're back to the future basically where important metrics like time to first
token literally translates to revenue, OPEX and gross margin for the inference providers
for the model builders.
And ultimately, if you take a look at big companies and headlines like Cursor and Anthropic
with Cloud Code, to their value.
value and profitability in marketplace or valuation.
So specifically, it's around how long it takes to translate the prompts that we send
into the first bite of response that we actually see on our screens.
And it gets much, much deeper than that for the AI apps now that are more API-driven
and chat-driven.
But I've spoken enough.
Let's see, obviously, what Scott thinks about this and how we can translate it into the
rest of the conversation.
Absolutely.
And to your point, it kind of goes back to the TPS reports, right?
I'd love the reference to office spacer.
That was wonderful.
But yeah, it is truly.
We're all in a society now where it's how fast can we do something.
And especially with AI now where you're doing things with RAG and all these other ways
we're getting into agenic and beyond, the time to first token is not even necessarily the
time to when our response of the first token, but the internal cycle of responses to that.
So it's all about the data, but it's a matter of when you see that first response to the
data you put in, whether it be token or any other type of economic at this point.
I just want to add an example because we're talking a little bit in abstract here.
The example is real-time voice translation.
Apple just announced that with their pause the other day.
Google did it a few months ago at their I.O. conference.
Who wants to wait an awkward pregnant pause of 30, 40 seconds for a translation?
We want that to be real-time and instantaneous.
In time to first, Oaken, there's a key metric for that kind of use case.
Now, as we're turning to a period where we're seeing broad proliferation across so many different industries,
the economics become really important.
And we know, Scott, that AI economics hinge on cost, performance, density.
There's so many variables here that IT administrators are taking a look at.
When you look at your conversations with customers,
how are they looking at how the SSDs in particular are reshaping this equation
to be able to scale AI factories to meet the need efficiently?
Yeah, it's an interesting conundry we've run into, right?
So with all good things at launch, you go out at all full bore,
you put as much effort and energy as you can into it.
And then as you start to evolve it,
you realize there's a need to optimize that layout, that look, and that feel.
And so that's what's really happening with the SSD marketplace
is it's allowing us to bridge that gap even more than we've ever done before
between that I have to have memory now versus I have data in storage somewhere.
And the ability to do that handshake has just never been something that more legacy
hardware could do and SSDs really have an opportunity to support in the most amazing fashion,
but they still themselves have to be the right size. We have stuff that's sitting right next
to a GPU, next to the memory, next to whatever, and we have stuff sitting off on a network
attach of some swarks, and they're not the same product. And then if anybody tries to tell you
that, then they don't know what they're talking about. And that's one thing we love about our
conversations with the customers is it's not how many SSDs you want. It's we know you need
storage, we know you need to improve your response times. How do we do that with the most effective
products on the market? So Val, I have a question for you specifically. Around WECA's neural mesh
axon, which embeds storage intelligence into the GPU, are close to the GPU. Can you impact
for us how this works and why is it such a game changer for accelerating token readiness?
Yeah, I love this topic in general because it shows you how different AI is and AI factories, which
obviously are fundamentally AI data centers,
then the traditional clouds and legacy data centers
we've worked with in the past.
We have this whole spectrum of capabilities
that this protocol we all use for solid-stage storage,
the protocol is NVME.
And I like to joke in many contexts,
particularly in GPU computing, AI computing,
there's no S, there's no storage in NVME, right?
It's non-volatile memory extension or memory express.
And what we're able to do with Axon
is actually take NAND flash,
media, so SSDs, and we're able to simply through software create memory performance and
memory interface to those SSDs, certainly in aggregate, as a group, because it's all about
memory bandwidth.
So the more SSDs, as we just refer to them, NVME devices, we can pool together, the more
we can match both the raw bandwidth, MbU memory bandwidth utilization is a key metric in
AI, but also the latency that the software of AI, which are primarily inferencer,
servers in the monetization case, what that software expects from a latency perspective.
So Axon is quite simply software-defined memory.
It's software you install on a stock GPU server, either using the Nvidia reference architecture,
the DGX motherboards, or Dell, Super Micro, Lenovo, HP, you name it.
And merely by installing a software, Axon sees and pools together all of the embedded SSDs,
the embedded NVME devices on each and every GPU server, often also called the GPU node,
and together delivers the memory bandwidth and latency that inference servers are the most popular
the one of the market today is an open source one called VLLM.
This software, this key software, actually sees the NVMEs as memory now,
which is remarkable in terms of addressing the really pressing topics of token economics
and helping invert the upside down economics of AI businesses that is being widely reported on today.
So, Scott Val just gave you a fantastic testimonial for why SST technology is really needed in this space.
Can you unpack a little bit about how solid I am is taking advantage of this moment in terms of delivery of solutions that actually balance that low latency with the high capacity and it's required for AI pipelines?
Yeah, and I appreciate Val very much for giving that definition from, I would say, our partner point of view, because when we start talking about some of the,
of the stuff, it's like, oh, yeah, they're just touting their line. But NVME came to existence
through the legacy of the Solidime organization and where we came from as an organization. And
when you talk about things like latency and the ability to scale, NVME offered us that
unprecedented opportunity to do that. And you look at all the other infrastructure that's out
there for storage. You have NVME as the forefront of that. And we produce those products.
And when you want to talk about the capacity tradeoff, you have the ability to look at the
capacity drives we provide, give that read performance that everybody still needs while offering
the right level of right performance to back it up when you need to actually pull the data
back into the storage products. And that's how WECA can take advantage of the capabilities is
because regardless of the product, the drive is no longer the bottleneck. The latency provided by the
products are in such aggregate, to Val's point, that you can pull the data in a pool away from
that. And you can't do that with other technologies. This concept of
global namespace is allowing you the access to all of that capacity at once with the extremely
low read latencies, but that nice right backup capability is something you just can't do with
things like a sound of hard drive, for example.
And let me jump in on that because there's a particular term here that is central to all
these discussions called KV cash, key value cash.
It's the working memory of LLMs fundamentally.
And it's a cache.
It's very reedcentric.
And there's occasional rights there.
And again, the raw capacity of the memory, you can provide KV cash, the raw bandwidth, the latency in a reed-centric workload is tailor-made, I think, for Philadelphia's product line. It's really the sweet spot. And again, timing is everything in these markets. These are solutions that are absolutely ready for prime time today.
Yeah, I mean, Val, you're right. Timing is everything. And what seems to be an ongoing theme of this is utilization, right? Utilization is a huge driver of just overall ROI.
by WECA, and I think you've talked about this specifically, that you guys have about 90% or so GPU utilization.
Can you explain why storage architecture is really pivotal to kind of achieving that level of efficiency?
Sure. And this is a fun topic now because a year ago, we would talk pretty opaquely about GPU utilization.
You know, more is better. I think what people are realizing now is you can basically be busy on a stationary bike or you can be busy on a Tour de France bike, you know, winning a race.
So it's how you're actually using the GPUs now.
versus just keeping them hot and busy.
And there's two primary use cases, training and inference.
In the training workload, it actually still is like the stationary bike discussion
where the busier a GPU is during training,
the more it's actually crunching all the data,
the more it's doing these things called gradient descents
and finding an optimal loss function and this magical checkpoint,
you know, the one that actually works best is what you ship as a model version.
That's the value there.
But GPU utilization is a vastly different beast when it comes to monetizing these models.
Inference is quite complicated.
It's quite a bit of a rabbit hole once you get into it.
But fundamentally, it has two pieces.
A piece called pre-fill where you take your prompts and you actually convert them one for one more or less into tokens, 1-2.75 into tokens.
But then you create the working memory, the KV cash, by pre-filling it from those tokens.
And that just balloons 100K maybe, you know, with a PDF attachment or something.
something of tokens into just many tens of gigabytes very often of key value cash as you add
up to 10 or 20,000 dimensions to each and every one of those tokens.
So pre-fill is one part of it, very GPU intensive.
So much so that Invidia just yesterday, just to bring in headlines into the discussion,
announced this brand new GPU category for the first time in their 30-year history called
the Ruben C-PX.
Don't ask me what the C and the P-N-E-X actually stand for.
they just like those letters nowadays.
But the Ruben CPX is basically a GPU
that's dedicated just to one phase of inference,
just this compute-intensive pre-fill phase of inference
precisely because there isn't much memory required.
However, that's just the background stuff.
The foreground stuff that you and I see,
which would be the reasoning, the thought process,
and reasoning models, the reasoning tokens,
and of course that we actually pay for,
the output tokens, that's something called decode.
And that's extremely memorable.
intensive. So much so that, again, we were asking GPUs in the past, and we will be for another
year or so until this new product actually ships from Nvidia. We're asking them to do two things
at once, like walk and chew gum, and they're not good at it because pre-filling and decoding are
literally asking a GPU to do very expensive kind of context switching. And when you can separate those
two things and have just pre-fill or just decode, now you have an optimally efficient GPU,
which is the most expensive thing we've ever bought in our data centers on a unit basis.
And now you're really getting a true ROI and bang for the buck to the point where we've
demonstrated you can get about five data centers worth of output from one data center
when you get GPU utilization right.
Now, I think that we've stated that SSTs are really critical for this.
And Scott, I want to go back to you because we're hitting this moment when enterprises
are starting to deploy AI in broad proliferation.
at some conferences lately talking to everyone from large retailers to banks and everybody's got
AA projects moving out of PSCs into broad proliferation.
But we know something about their environments.
They're still heavy HDV systems dependent.
How is solid I'm working with these enterprises to get them transition so that they can
actually deploy AI pipelines without paying a latency penalty?
Yeah, it's an interesting thing.
We've all talked about over the years of the death of a hard drive, and we know that it's really not going to happen.
The hard drive is here to stay and will be forever and ever, ever, just like tape still exists today.
And they have their place in the systems, and generally those systems evolve.
And the challenge that we're seeing and the value that we're providing is the fact that we're not allowing talking hot and cold, but levels of warm, right?
And we're trying to make sure that those levels of warm are appropriately driven to handle, for example, the KV Cash that Val was talking about.
So you need to have a footprint of data that can be consumed to create those large KV caches and other aspects of it.
And it's just something that you can't do with a device that's designed to do read or write effectively.
You have to have something that can do both at the right workloads effectively.
And that's where our high capacity drives are coming into play.
We're providing that massive amount of storage, 122 terabytes, moving to 245 plus along with the rest of the marketplace.
But we're doing it with a technology and a footprint and focus that says, this is what you really need for it.
We're not trying to say, here's the fastest or here's the densest, but the most optimized for these workloads.
And that's one thing that our team, and you can see by looking at our site and talking to all people that work here,
that we're driving this towards making the AI optimizations that are necessary so that customers don't have to try to do A or B.
It's really A plus B equals the right answer.
and that's what we're doing with these drives,
is to make sure that that's the solution we're solving for
and not trying to be overly aggressive
of being this, that, or the other,
but just a properly aligned solution for the customer.
So with that, Val, you know, everything Scott's saying,
wanted to get your opinion on,
as companies move from pilot projects
to what we're now calling kind of AI factories,
what new challenges are you seeing around inference pipelines
and then how does WECA really address those?
Yeah, I love that question because the challenges are so transparent right now.
If you take a look on Reddit or X, social media, you go to conferences and you ask people, you know, what they're seeing in terms of actual real world adoption, they're seeing two things.
They're addicted to where there's value.
And there's a lot of value in agents, particularly coding agents and research agents.
People are happily paying $200 a month or more now for this because it's generating thousands of dollars a month, if not tens of thousands of value.
On the other hand, the providers of these solutions literally can't take money from their best customers.
What I mean by that is it's so expensive to provide this volume of tokens that agents needs,
which is anywhere from 100 to 10,000 times X, you know, not percent, but X more tokens and then simple chat sessions.
And it's so expensive to provide these tokens that there's this very prominent problem of throttling and rate limits.
So you get these weird five-hour windows
where you have to optimize
just how many questions your agent
asks of a back-end model
because then the door shuts
and you can't do any more work
with that account at least
for the next four hours
if you get it wrong.
So we're getting into this weird situation
where the inefficiencies of inference
as I talked about earlier on
are becoming really transparent
and we need to introduce efficiencies
like assembly lines to AI factories
which is crazy, right?
When you and I think of a factory
today, there's no way we just don't imagine an assembly line just implicitly in that.
We can't think of a factory without that.
The harsh reality of AI inference today is there are no assembly lines.
It's a really primitive process of moving data back and forth.
And again, this re-prefilling over and over because you run out of this KV cash very quickly,
often in minutes for busy agents.
And then you're back to this expensive process.
And by the way, it takes tens and tens of kilowatts every time you reprefill this.
So we don't have assembly lines yet, and that's one of the key things we're looking forward to is with technologies like augmented memory grid and yourselves, we're able to actually add streamlined processing and assembly lines to this pre-fill and decode phase of inference, drop the actual price of high volume tokens by another 100 X to 10,000 X and make these things affordable now.
Now, you know, one thing that comes to mind is scale. And maybe I'm influenced because I was at the San Jose Air Force yesterday.
and I saw WECA ads everywhere telling me about your performance at scale now,
but I want to turn this to Scott's throughput at scale is really difficult,
and we're talking about thousands of GPUs.
How do you design the SSD technology so that it can work in mass
to deliver this throughput at scale to feed these GPUs?
Is there some secret sauce there, or how do you approach that?
It's a great opportunity to talk about that, right?
because there's the aspect of what we can do,
but what we need the industry to do
because one of the other challenges is
there's no ability for one single source
for every solution.
And so when you look at it from that perspective,
perfect example of when you're looking about
very fast, close to the GPU throughput,
we designed or co-designed from a hardware perspective
with Nvidia and ability to do a liquid cooling of our drive,
allowing it to be more effective at its throughput
in an environment where we're now removing infrastructure
which puts power back into the system to allow the GPU to operate at a more optimal level.
So we all talk about reducing power and that kind of stuff.
We're never going to reduce the power, but we can certainly optimize the use of the power.
So locally to that GPU, we've liquid cooled the drive,
and we've introduced that product, and it's going to be shipping soon in some of those new platforms.
But if it's not there, then I've got this network attached box.
And to the bottleneck comment that Val made, our solution can be optimized to allow the right level
of write and read performance. You can actually tailor the controller architecture that we provide
through firmware hooks and offers to our customers to let them pick how fast at what power
envelope they want to run their products. And you can do that across the family of solutions
that we provide. So it's a unique new opportunity to allow you to tailor the throughput at the
scale you need while keeping the infrastructure at the right power levels. And that's something
that we haven't really had to look at in the past because it was run them as fast as you possibly
can, but we know that that's not really what everyone means. They need the system to run at the
most optimal level possible. And I love the concept of the assembly line. As soon as you said that,
it went in my head and I started literally watching data bits flowing through the assembly line
of the AI factory. Yeah, it's not a physical infrastructure, but how that bit gets from a stored
bit on a man device to a GPU and back as necessary. And the whole path it has to take,
One of the most critical challenge is the downtime and eliminating the downtime.
I have a niece who works at a factory and she shut the factory floor down for 30 seconds
to institute something new and forgot to tell a boss and the boss got so mad at her
because the system was down for 30 seconds out of an hour day.
And we're even more critical of that in this kind of an infrastructure.
So being able to make sure our products are highly reliable as well and the quality is there
is paramount.
And that's one of the things that Solidime is best known for us in that regard.
Now, you give us a lot of good insight to what the current challenges are, right,
and how Weka is of tackling them today.
But looking ahead, where do you see the next bottlenecks really emerging in scaled inference?
And how is Weka prepared to solve them?
This ties back to the axon topic earlier on.
So many things about GPU computing are just so fundamentally different than CPU computing.
A main one is when you take a look at a standard motherboard for a GPU server,
It has actually two prominent networks, not one.
It has that traditional network that's now referred to as the North-South network,
which is high-performance.
It's up to 400 gigabits very often, which in the world of regular CPU computing
is insane, excessive performance and bandwidth.
But it also has a secondary network, which Jensen increasingly refers to as the heart
of the systems of the rack scale systems that Nvidia sells now.
They focus less on the chips and more on the racks, like an NVL-72 rack and so forth.
And that is a compute network, also referred to as an east-west network.
And there's 16 times, one-six times more bandwidth on that east-west compute network than there is on the north-south network.
So the ability to actually address that compute network as well as the north-south storage network is critical towards unlocking some of the bottlenecks here.
And it's actually on a critical path to being able to take solid-time drives and make them deliver memory value, which GPUs or something.
hungry for for this KV Cash working memory we keep referring to.
That's one of the essential bottlenecks here.
And there's a lot of industry movement in this regards, not just at the sort of firmware
or low level protocol stage, but very much now in the open source community, which is
encouraging to see because that's where all these cool models we read about, like deep
seek and others, that's where the rubber meets the road and the open source inference servers
take the deepseaks of the world and actually enable way better performance at a way lower
token cost by being able to address more and more storage as memory for larger working memory
effect of it.
Val and Scott, this has been an awesome conversation.
I want to ask you to sum up, we've talked about all of the technology, what is capable?
We've talked about how enterprises are really starting to throttle up their utilization of
AI.
If you were going to talk to an IT practitioner today about the one thing that they need to be
thinking about when they are navigating the tradeoffs between performance, cost, long-term
productivity, what would you tell them to pay attention to now? Scott, do you want to take that
one first? Yeah, I'll start. I'll say that the one thing that tends to still be an existing
problem in our infrastructure today is the hardware guy is not talking to the software guy.
And I think it's a prominent thing that we should, as we navigate forward, we need to be
more aligned within our organizations that whatever software needs to run has
to have a hardware equivalent solution that's not overbearing too much, not enough.
We've gone through hyperconverge.
We've gone through disaggregated.
We've done all this stuff.
And now AI's throwing a whole new wrench into it.
And it's just making sure to optimize across both hardware and software.
Wow.
Yeah, I'll dive a little bit even deeper into that.
So again, just taking the look at the fundamental differences between cloud data center
and an AI factory, AI data center, is the fact that the CPUs, a good one today,
maybe has 100 cores, an average GPU has over 17,000 cores.
So this is so fundamentally different apples and oranges
that my advice to a lot of people,
having learned this to hard by myself over and over again
during major disruptions in tech,
is it's easier to unlearn what you think you know
about these very scalable data centers, these AI factories,
and just relearn from first principles
and relearn from a clean sheet of paper.
Because processing these distributed apps, if you will,
across 17,000 cores per single GPU.
And God knows, you know, millions and trillions of cores and data centers,
large AI data centers, it's just a fundamentally different world.
So it's better to sort of just start from scratch,
apply experience to how IT operations works and budgets and catchler works.
But when it comes to the technology designs and the optimizations and best practices,
get out there, learn, absorb.
It's never easier to learn anything.
You can ask chat GPT the most detailed questions.
and it'll explain to you from first principles how this works and how it's different from
what you're used to. So don't bring old biases into this new world. It's probably my best
advice. I love that. Scott and Val, it's been a real boot. I always want to have you guys back.
When we are talking, you know, one thing that I was thinking about is that our audience is going to
want to keep the conversation going between now and the next episode that you might emerge on.
where can they find the information about the solutions we talked about today and engage with you directly?
Sure. Well, from Weka, it's Weca.com. I always Google either, you know, my name or other names of some key folks at Weka,
like one of our product managers for this augmented memory grid product. His name is Callan, C-A-L-A-N-F-Fox.
So Google Weca Val blog and Weca-Callon blog and all of a sudden, some really cool blogs and videos will tend to come up.
Yep. And on our side, solidime.com forward slash AI.
is a great place to start for AI-related activities you'll find with collaborations we've done
with Beka, as males, how you can deploy different products in different ways. So it's a great
opportunity to look for it. If you want to find me, I'm S.M. Shadley on almost every social platform.
Awesome. Thank you so much, guys. This has been a real pleasure. And with that, we're wrapping
another episode of Data Insights. Janice, always so much fun to host with you. Thanks so much to all of you.
Thank you, Allison. Thanks.
Thanks for joining Tech Arena.
Subscribe and engage at our website,
Techorina.aI.
All content is copyright by Tech Arena.
