SemiWiki.com - Podcast EP308: How Clockwork Optimizes AI Clusters with Dan Zheng

Starting point is 00:00:00 Hello, my name is Daniel Nenny, founder of semi-wiki, the open forum for semiconductor professionals. Welcome to the Semiconductor Insiders podcast series. My guest today is Dan Zang, VP of Partnerships and Operations at Clockwork. Dan was the general manager for product and partnerships at Urban Engines, which was required by Google in 2016. Dan has also held roles at Stanford University and Google. Welcome to the podcast, Dan. Thank you, Daniel. Great to be here today.

Starting point is 00:00:38 First, may I ask, what brought you to Clockwork? I'm one of the founding members of Clockwork. I've been there since day one. At the time, we were doing research work at Stanford University. The broader research topic is self-programming networks. How do we build a network that can sense, infer, and control itself? No, interesting. So despite all the money going into GPUs, most AI clusters still only run at a fraction of their potential, you know, from what I'm told.

Starting point is 00:01:10 Why is that? And what does it mean for companies trying to train and deploy AI at a massive scale? Now, operating modern large-scale GPU clusters efficiently is very, very challenging, right? So if you look at GPU cluster itself is very complex. is very complex. Each GPU servers has a GPUs, a CPU, a front-end network that handles north-south traffic, your application traffic. There's also a backend network that handles inter-GPU communication. There could also be a separate storage network. Now operators have very silo view of the infrastructure. For example, you use DCGM, Data Center, GPU manager,

Starting point is 00:01:57 for GPUs. You use Unified Fabric Manager for networking metrics. But when a job is running slow or when installs, it's very difficult to pinpoint what the problem is. You can spend hours debugging or finger pointing. Is it a compute problem? Is it a networking problem? Or is it a storage problem?

Starting point is 00:02:20 Even companies like META, they run into issues. In the Alama 3 paper, over a period of 54 days, there were about 400 plus interruptions. This can be GPU errors known as GPU falling off the bus, a network link flap, or a memory error. Now for centralized AI infrastructure teams across large enterprises and AI builders, there's so much demand for the GPU infrastructure.

Starting point is 00:02:52 They're constantly juggling hundreds of job requests. When something goes down, jobs gets rescheduled and pushed to the back of the long queue. This creates delays and slows down innovation. So in all, for companies raising to train and deploy AI at scale, this type of inefficiency directly translate into slower time to market and higher operating costs. So how does solving for AI infrastructure challenges help AI and cloud leaders maximize the chips and hardware that they already have and achieve more with the same infrastructure. I mean, that's going to be important.

Starting point is 00:03:32 You know, is there a world where you need to buy less hardware or maybe your hardware lasts longer due to, you know, other capabilities? You know, what impact could this have on the AI market for compute? Yeah, that's a very good question, Daniel. So essentially, if you make the infrastructure more efficient, you can run more jobs on the same GPUs, so you can do more with less, right? So here's an example, looking at the efficiency gap today. Say you operate a model-sized GPU cluster of 1,000 GPUs.

Starting point is 00:04:08 So on that cluster, you'll have about 3,000 fiber optical links. Each link has two optical transceivers. So there's a total of about 9,000 network components. Okay. Now, even though each of the network components, of the network component is well engineered, is rated to have a meantime to failure of, say, three years. A single network link flap of 20 seconds

Starting point is 00:04:35 can bring down the entire job, right? So as a result, you have to restart from the previous checkpoint. Now, folks usually do checkpointing, perhaps two to three times a day. So the previous checkpoint in this example can be five hours ago. So you have to redo five hours of work,

Starting point is 00:04:56 so there's 10 hours per GPU. So that easily translate into 10,000 GPU hours wasted, just for one interruption. So having worked with many customers, a cluster of this size, experiencing three network failures per week is not uncommon. Cleo IQ delivers stateful tolerance, so it allows jobs to continue without disruptions,

Starting point is 00:05:23 despite hardware hiccups and failures. This means companies can effectively do more with less. Now, if the total addressable market for AI compute is finite, then something like Free IQ can extend the lifespan of current infrastructures and delay the need for additional purchases. But we know AI is a good example of Javon's paradox. As it gets cheaper, faster, and easier,

Starting point is 00:05:56 we create more usage. We make more use of it. So the total consumption actually goes up, right? So in a foreseeable future, I believe GPU demand will continue to rise, but by making the infrastructure more efficient, it improves ROI. So instead of rationing GPUs, resources

Starting point is 00:06:18 in a top-down fashion, it really helped to democratize access to GPUs, GPUs, allowing anything with any good ideas to try those new ideas out on the infrastructure. Oh, got it. So people often say Moore's law is slowing down. In the context of AI, why are so many people now saying communication, not compute as the real bottleneck? And I hear that a lot, by the way. And so how should that change the way we think about performance? Yeah, very good question. So in AI, the challenge isn't just a the raw power of individual GPUs,

Starting point is 00:06:55 but how effectively they can communicate and work together. So in fact, we like to say communication is a new Morse law. Now, to understand this, we need to dive a bit deeper into the AI workloads themselves, right? AI workloads are fundamentally different from your traditional applications or cloud applications. They're highly synchronized and they run in lockstep. It's sort of like conducting the world's largest orchestra.

Starting point is 00:07:25 Everything has to be in sync and perfect. To give you an example, for training, for each iteration, there's lots of compute done on the GPUs. Then partial results are shared with all the other GPUs. Now, when you have a straggler, everyone is waiting for you. When a link fails, it can bring down the entire job. When there's contention or congestion in the fact, fabric, throughput will drop.

Starting point is 00:07:53 AI clusters in general are getting bigger and bigger, so communication efficiency is becoming more important. If you look at an older model like Lama 3, it was trained on 16,000 GPUs. Frontiers model today are trained on 100,000 or more GPUs. Same for inferencing too, so it's not just for training. With proliferation of reasoning models, make sure of expert models, you're running your model on multiple GPU nodes now,

Starting point is 00:08:26 and you have to use KV cache, be extensively. So communication become more and more important. Now, according to an AMD paper, communication can easily make up 30 to 50% of a total job completion time. Now, rethinking performance through this lens, the success, measuring success is not about the number of the number of the number of GPUs you can deploy, but rather it's about how well the infrastructure can harness from them collectively. So as a quick sort of analogy, if you look at cars, so this is not about the raw horsepower of a car. It's the useful work that it delivers. It's a distance you can travel in the real-world MPG. Okay. So if the next big speed bumps will come from just

Starting point is 00:09:20 adding more GPUs but from improving how they talk to each other what will this unlock over the next decade you know what in innovations interconnect fabrics or software do you see shaping the future of AI infrastructure so so we really see software will be able to stitch together different generations of GPUs from different vendors right so you can imagine you know i made investments into the nvdia a 100s h 100s i'm getting the block well chips I'm also getting AMD, MI-325s. The software should be able to stitch them together in a way that you can get the most use out of all the GPUs

Starting point is 00:10:01 that you have invested in. The other sort of challenge has been, you know, in terms of capacity engineering and capacity allocation, oftentimes you have stranded GPUs or often GPUs. So this software should be able to allow organizations to tap into previously wasted capacities. Now, more importantly, I think it should democratize access to AI compute,

Starting point is 00:10:30 enabling more jobs to be run by more teams, and really accelerate innovation across industries. There's a big difference between what we call GPU poor and GPU rich. So this will help to close the gap between the GPU poor and the GPU rich. We also have seen infrastructure evolves from a single GPU server to a large cluster of GPU servers in the whole data centers. Folks are building gigawatt campuses and edge compute

Starting point is 00:11:04 to be closer to users. So scale out networking will become more and more important. Not only you have to connect GPU servers together, you also have to connect to existing traditional CPU servers in your existing data center so that you can embed AI capabilities everywhere in your enterprise applications. The other thing that will happen is proliferation of AI agents and physical AI's and robots, they can become a normal part of our daily life, right?

Starting point is 00:11:39 So things like orchestration, intelligence scheduling, security, trust, resilience, engineering, will also become very important in the coming years. Oh, okay. So AI systems are becoming more diverse, you know, different GPUs, NICs, networks and software stacks all in play. How big of a challenge is this heterogeneity and what does it actually take to make infrastructure run anywhere? Yeah, so heterogeneity is both a challenge on opportunity, right? It does complicate infrastructure management, but it also enables organizations to avoid login and optimize

Starting point is 00:12:22 ROI. In one sense, that's not new for most enterprises. If you look at cloud computing, most run on hybrid cloud or multi-cloud. They invest resources to make their workloads portable. Now, if you look at AI, your training workloads may still favor a vertically integrated stack, you can use all of Nvidia. But increasingly, influencing is all about performance per unit cost, right? So that makes multi-venter strategy a lot more appealing. Now looking at sort of making things run anywhere, making infrastructure run anywhere, you really need to have an abstraction layer,

Starting point is 00:13:10 allowing for hybrid and vendor agnostic environment, And, you know, if you look at sort of things like VMware, they have pioneered virtual machines on CPU compute, we run our apps on virtual machines or on Kubernetes. So this type of flexibility is also needed on the GPU side. And it's critical as enterprise looks to embed AI more deeply into various different applications while keeping the cost low. Okay, a different topic. There's been an increased focus on AI sovereignty as a strategic priority. So how do infrastructure choices like resiliency, vendor, neutral fabrics, support, you know, how does it support that goal?

Starting point is 00:14:01 And what will it take to achieve true AI independence? Right. So if you look at AI sovereignty is really about the three Cs, right? Choice, control, and compliance. of AI models, choice of infrastructure without login, control in terms of national security, fostering innovations, and economic competitiveness. Now, vendor neutral fabrics gives them choice, right? You can mix and match components, you can maintain your independence, you're not tied to a single supplier. Now with a resilient fabric, it gives you direct control.

Starting point is 00:14:44 of the infrastructure availability and efficiency. So AI factory or GPU data center is a very expensive investment. So you really want to run it nonstop 24 by 7. As an example, DCAI, the Danish Center for Artificial Intelligence, one of our customers, their mission is really to serve not only industries, are also researchers and startups effectively, right? And they need to deliver performance, reliability, and efficiency at scale.

Starting point is 00:15:25 So last question, Dan. From a semiconductor and interconnect perspective, how important is open vendor neutral infrastructure in enabling independence? And what role do chip makers play in making that viable? Very good question. So many people describe AI is a new electricity, right? So if you look at sort of running AI or AI infrastructure

Starting point is 00:15:52 as running a large public transport system, in a large city like Tokyo, London, or Singapore, there the goal is to move people as efficiently as possible. So you have to operate different modes of transportation, buses, trains, light rails, right? You deal with failures and spikes in traffic every day, and having an open standard allows everyone to serve different needs and work really well together. Now, for AI infrastructure, for semiconductor industry, with the open standard, it allows the chip makers to innovate and serve different needs and niche effectively. All together, you can push the AI infrastructure and efficiency forward.

Starting point is 00:16:43 so that we can really take advantage of the capabilities of AI in our day-to-day lives. Great conversation. Thank you, Dan. Thanks, Daniel. That concludes our podcast. Thank you all for listening and have a great day. Thank you.

SemiWiki.com - Podcast EP308: How Clockwork Optimizes AI Clusters with Dan Zheng

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.