SemiWiki.com - Podcast EP308: How Clockwork Optimizes AI Clusters with Dan Zheng
Episode Date: September 26, 2025Daniel is joined by Dan Zheng, VP of Partnerships and Operations at Clockwork. Dan was the General Manager for Product and Partnerships at Urban Engines which was acquired by Google in 2016. He has al...so held roles at Stanford University and Google. Dan explores the challenges of operating massive AI hardware infrastructure at … Read More
Transcript
Discussion (0)
Hello, my name is Daniel Nenny, founder of semi-wiki, the open forum for semiconductor professionals.
Welcome to the Semiconductor Insiders podcast series.
My guest today is Dan Zang, VP of Partnerships and Operations at Clockwork.
Dan was the general manager for product and partnerships at Urban Engines, which was required by Google in 2016.
Dan has also held roles at Stanford University and Google.
Welcome to the podcast, Dan.
Thank you, Daniel.
Great to be here today.
First, may I ask, what brought you to Clockwork?
I'm one of the founding members of Clockwork.
I've been there since day one.
At the time, we were doing research work at Stanford University.
The broader research topic is self-programming networks.
How do we build a network that can sense, infer, and control itself?
No, interesting.
So despite all the money going into GPUs, most AI clusters still only run at a fraction of their potential, you know, from what I'm told.
Why is that?
And what does it mean for companies trying to train and deploy AI at a massive scale?
Now, operating modern large-scale GPU clusters efficiently is very, very challenging, right?
So if you look at GPU cluster itself is very complex.
is very complex. Each GPU servers has a GPUs, a CPU, a front-end network that handles
north-south traffic, your application traffic. There's also a backend network that handles
inter-GPU communication. There could also be a separate storage network. Now operators have
very silo view of the infrastructure. For example, you use DCGM, Data Center, GPU manager,
for GPUs.
You use Unified Fabric Manager for networking metrics.
But when a job is running slow or when installs,
it's very difficult to pinpoint what the problem is.
You can spend hours debugging or finger pointing.
Is it a compute problem?
Is it a networking problem?
Or is it a storage problem?
Even companies like META, they run into issues.
In the Alama 3 paper, over a period of 54 days,
there were about 400 plus interruptions.
This can be GPU errors known as GPU falling off the bus,
a network link flap, or a memory error.
Now for centralized AI infrastructure teams
across large enterprises and AI builders,
there's so much demand for the GPU infrastructure.
They're constantly juggling hundreds of job requests.
When something goes down, jobs gets rescheduled and pushed to the back of the long queue.
This creates delays and slows down innovation.
So in all, for companies raising to train and deploy AI at scale,
this type of inefficiency directly translate into slower time to market and higher operating costs.
So how does solving for AI infrastructure challenges help AI and cloud leaders maximize
the chips and hardware that they already have and achieve more with the same infrastructure.
I mean, that's going to be important.
You know, is there a world where you need to buy less hardware or maybe your hardware
lasts longer due to, you know, other capabilities?
You know, what impact could this have on the AI market for compute?
Yeah, that's a very good question, Daniel.
So essentially, if you make the infrastructure more efficient, you can run more jobs on the same GPUs,
so you can do more with less, right?
So here's an example, looking at the efficiency gap today.
Say you operate a model-sized GPU cluster of 1,000 GPUs.
So on that cluster, you'll have about 3,000 fiber optical links.
Each link has two optical transceivers.
So there's a total of about 9,000 network components.
Okay.
Now, even though each of the network components,
of the network component is well engineered,
is rated to have a meantime to failure of, say, three years.
A single network link flap of 20 seconds
can bring down the entire job, right?
So as a result, you have to restart
from the previous checkpoint.
Now, folks usually do checkpointing,
perhaps two to three times a day.
So the previous checkpoint in this example
can be five hours ago.
So you have to redo five hours of work,
so there's 10 hours per GPU.
So that easily translate into 10,000 GPU hours wasted,
just for one interruption.
So having worked with many customers,
a cluster of this size,
experiencing three network failures per week is not uncommon.
Cleo IQ delivers stateful tolerance,
so it allows jobs to continue without disruptions,
despite hardware hiccups and failures.
This means companies can effectively do more with less.
Now, if the total addressable market for AI compute is finite,
then something like Free IQ can extend the lifespan
of current infrastructures and delay the need
for additional purchases.
But we know AI is a good example of Javon's paradox.
As it gets cheaper, faster, and easier,
we create more usage.
We make more use of it.
So the total consumption actually goes up, right?
So in a foreseeable future,
I believe GPU demand will continue to rise,
but by making the infrastructure more efficient,
it improves ROI.
So instead of rationing GPUs, resources
in a top-down fashion,
it really helped to democratize access to GPUs,
GPUs, allowing anything with any good ideas to try those new ideas out on the infrastructure.
Oh, got it. So people often say Moore's law is slowing down. In the context of AI, why are so
many people now saying communication, not compute as the real bottleneck? And I hear that a lot,
by the way. And so how should that change the way we think about performance?
Yeah, very good question. So in AI, the challenge isn't just a
the raw power of individual GPUs,
but how effectively they can communicate and work together.
So in fact, we like to say communication is a new Morse law.
Now, to understand this, we need to dive a bit deeper
into the AI workloads themselves, right?
AI workloads are fundamentally different
from your traditional applications or cloud applications.
They're highly synchronized and they run in lockstep.
It's sort of like conducting the world's largest orchestra.
Everything has to be in sync and perfect.
To give you an example, for training, for each iteration,
there's lots of compute done on the GPUs.
Then partial results are shared with all the other GPUs.
Now, when you have a straggler, everyone is waiting for you.
When a link fails, it can bring down the entire job.
When there's contention or congestion in the fact,
fabric, throughput will drop.
AI clusters in general are getting bigger and bigger,
so communication efficiency is becoming more important.
If you look at an older model like Lama 3,
it was trained on 16,000 GPUs.
Frontiers model today are trained on 100,000 or more GPUs.
Same for inferencing too, so it's not just for training.
With proliferation of reasoning models,
make sure of expert models, you're running your model on multiple GPU nodes now,
and you have to use KV cache, be extensively.
So communication become more and more important.
Now, according to an AMD paper, communication can easily make up 30 to 50% of a total job completion time.
Now, rethinking performance through this lens, the success, measuring success is not about the number of the number of the number of
GPUs you can deploy, but rather it's about how well the infrastructure can harness from them
collectively. So as a quick sort of analogy, if you look at cars, so this is not about the raw
horsepower of a car. It's the useful work that it delivers. It's a distance you can travel in the
real-world MPG. Okay. So if the next big speed bumps will come from just
adding more GPUs but from improving how they talk to each other what will this unlock
over the next decade you know what in innovations interconnect fabrics or software do you see shaping
the future of AI infrastructure so so we really see software will be able to stitch together
different generations of GPUs from different vendors right so you can imagine you know
i made investments into the nvdia a 100s h 100s i'm getting the block well chips
I'm also getting AMD, MI-325s.
The software should be able to stitch them together
in a way that you can get the most use out of all the GPUs
that you have invested in.
The other sort of challenge has been, you know,
in terms of capacity engineering and capacity allocation,
oftentimes you have stranded GPUs or often GPUs.
So this software should be able to allow organizations
to tap into previously wasted capacities.
Now, more importantly, I think
it should democratize access to AI compute,
enabling more jobs to be run by more teams,
and really accelerate innovation across industries.
There's a big difference between what we call
GPU poor and GPU rich.
So this will help to close the gap
between the GPU poor and the GPU rich.
We also have seen infrastructure evolves from a single GPU server to a large cluster of
GPU servers in the whole data centers. Folks are building gigawatt campuses and edge compute
to be closer to users. So scale out networking will become more and more important.
Not only you have to connect GPU servers together, you also have to connect to existing
traditional CPU servers in your existing data center
so that you can embed AI capabilities everywhere
in your enterprise applications.
The other thing that will happen is proliferation of AI agents
and physical AI's and robots,
they can become a normal part of our daily life, right?
So things like orchestration, intelligence scheduling,
security, trust, resilience, engineering,
will also become very important in the coming years.
Oh, okay. So AI systems are becoming more diverse, you know,
different GPUs, NICs, networks and software stacks all in play. How big of a challenge is
this heterogeneity and what does it actually take to make infrastructure run anywhere?
Yeah, so heterogeneity is both a challenge on opportunity, right? It does complicate
infrastructure management, but it also enables organizations to avoid login and optimize
ROI. In one sense, that's not new for most enterprises. If you look at cloud computing,
most run on hybrid cloud or multi-cloud. They invest resources to make their workloads
portable. Now, if you look at AI, your training workloads may still favor a vertically
integrated stack, you can use all of Nvidia.
But increasingly, influencing is all about performance per unit cost, right?
So that makes multi-venter strategy a lot more appealing.
Now looking at sort of making things run anywhere,
making infrastructure run anywhere, you really need to have an abstraction layer,
allowing for hybrid and vendor agnostic environment,
And, you know, if you look at sort of things like VMware, they have pioneered virtual machines on CPU compute, we run our apps on virtual machines or on Kubernetes.
So this type of flexibility is also needed on the GPU side.
And it's critical as enterprise looks to embed AI more deeply into various different applications while keeping the cost low.
Okay, a different topic.
There's been an increased focus on AI sovereignty as a strategic priority.
So how do infrastructure choices like resiliency, vendor, neutral fabrics, support, you know,
how does it support that goal?
And what will it take to achieve true AI independence?
Right.
So if you look at AI sovereignty is really about the three Cs, right?
Choice, control, and compliance.
of AI models, choice of infrastructure without login, control in terms of national security,
fostering innovations, and economic competitiveness. Now, vendor neutral fabrics gives them
choice, right? You can mix and match components, you can maintain your independence, you're not
tied to a single supplier. Now with a resilient fabric, it gives you direct control.
of the infrastructure availability and efficiency.
So AI factory or GPU data center is a very expensive investment.
So you really want to run it nonstop 24 by 7.
As an example, DCAI, the Danish Center for Artificial Intelligence,
one of our customers, their mission is really to serve not only industries,
are also researchers and startups effectively, right?
And they need to deliver performance,
reliability, and efficiency at scale.
So last question, Dan.
From a semiconductor and interconnect perspective,
how important is open vendor neutral infrastructure
in enabling independence?
And what role do chip makers play in making that viable?
Very good question.
So many people describe AI is a new electricity, right?
So if you look at sort of running AI or AI infrastructure
as running a large public transport system,
in a large city like Tokyo, London, or Singapore,
there the goal is to move people as efficiently as possible.
So you have to operate different modes of transportation,
buses, trains, light rails, right?
You deal with failures and spikes in traffic every day, and having an open standard allows everyone to serve different needs and work really well together.
Now, for AI infrastructure, for semiconductor industry, with the open standard, it allows the chip makers to innovate and serve different needs and niche effectively.
All together, you can push the AI infrastructure and efficiency forward.
so that we can really take advantage of the capabilities of AI in our day-to-day lives.
Great conversation. Thank you, Dan.
Thanks, Daniel.
That concludes our podcast. Thank you all for listening and have a great day.
Thank you.