In The Arena by TechArena - Revolutionizing AI Workloads with Advanced Data Center Solutions

Episode Date: November 18, 2024

Discover how AI is transforming data centers, with innovations in high-speed networking, emulation, and hyperscale infrastructure driving efficiency and performance in the era of AI workloads....

Transcript
Discussion (0)
Starting point is 00:00:00 . Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. . Welcome in the arena. Welcome in the arena. My name is Alison Klein, and we're coming to you from OCP Summit in San Jose, and I'm delighted to be with Ingrid Sack and Alex Bortard with Keysight.
Starting point is 00:00:33 Welcome to the program, guys. Thank you, Alison. Good morning, and happy to be here. So why don't we just get started with introductions and your roles at Keysight. Sure. My name is Ankur Sheikh. I'm Director of Engineering at NetworkFest Group of Keysight Technologies. I am Alex Bordek. I'm a Lead Product Manager for Hyperscale Infrastructure.
Starting point is 00:00:57 So this is a fantastic week at OCP Summit. We've seen over 7,000 people at the show and everyone is talking about AI. When you look at how artificial intelligence is reshaping industries, how do you see AI actually impacting the data center and what does that mean in terms of infrastructure? Okay. So AI and particularly large language models, generative AI and large language models have placed tremendous near demands on the data infrastructure that we have, data center infrastructure. The infrastructure that was built to serve your Instagram and Twitter and so on is just not enough in terms of the compute, in terms of the network to do these large language model trainings and the inference jobs that we are
Starting point is 00:01:56 seeing being used. So we're seeing servers with massive amounts of compute, massive amounts of networking, are being deployed very rapidly. In all of this, power is, of course, a big concern. Data centers have limited amount of power to work with. So it's a constant battle amongst the entire industry ecosystem to get more compute, more networking with lower power. Now, networking and benchmarking is becoming so important in the way that this is delivered. Why don't we talk a little bit about
Starting point is 00:02:36 how Keysight is doing in solutions in this space and a little bit of background on Keysight in the testing measurement space. All right. and a little bit of background like this site in the um testing measurement space uh all right so uh key site takes into takes the droids from hewlett-packard uh we were spot off from here department has agile and people spoke in 90s and in the 2014 inside was uh you know spun off again as an independent company. We joined the site through XA acquisition in 2018, I think. And so what our contribution is, is to allow our customers who are network equipment manufacturers, hyperscale operators, silicon vendors,
Starting point is 00:03:25 to create solutions that can provide a better performance of the systems while keeping the scale reasonable. Now, I know that it's a journey to deliver a product like this in market. Can you talk a little bit about what that process is like and why Hangcliqset is at the advanced edge of some technology?
Starting point is 00:03:49 So KeySight and specifically the XEA business, which is part of us, Alex has said, we've been in the network edge business for a quarter of the case, right? But very early on, we realized that the kind of demands for networking that is placed by AIMM is very different than traditional network testing. So what you're seeing in this product right now that I'm showcasing here, OCP, is essentially the tip of the iceberg, I can say so. This is something that we've been working on for over three years in commerce. And it starts with really understanding the applications, what kind of demands they place on the network, how it is distrenched and how we can do that, replicate that in our solution at scale in a cost-effective manner for our customers. So it's been a long journey, but we are just getting started. Now, one aspect of the solution that caught my eye is that you're
Starting point is 00:04:55 able to emulate AI traffic without actual physical GPUs. Why did you make that call, and what is such a game changer? Right. Well, there are several reasons here. One is on the services. Of course, it's really hard for, you know, for a lot of engineers like networking people to get hands of, you know, a requires to learn full stack to operate all the software engineers that make real system work. Network engineers have to focus on their part in the puzzle, and we are building a solution that specifically targets their workflow and making it much easier to operate. And we are looking at organizations that want to scale AI infrastructure, and I think a lot of them are here to speak to the OCP. How does the platform help mitigate some of the risks in that scale for them? So the platform is a way for them to experiment and make more informed decisions before really like at last
Starting point is 00:06:08 check and deploy and seeing what they get. So this is a way for them to really get a good understanding of what they're going to see when they deploy it. That's so cool. I know that there's a lot of network topologies out there and a lot of different approaches in terms of how different operators are deploying their networks. How does your solution ensure flexibility across these different environments? Well, by design, our solution
Starting point is 00:06:44 isn't tied to any specific topology or set of components. The way we develop the solution is we get in touch with our customers who have particular design objectives in mind. And quite often we co-design the solution with them. At the end, what we get is something that is adaptable to a variety of situations. So it's really configurable and it's not tied to any specific design choice. Now, let's get into the actual AI data centers for a second. You know, at LLM is offer and require extremely vast and efficient data movement. Where are you seeing traffic bottlenecks in real
Starting point is 00:07:32 world environments and how does your platform help address? So traffic bottlenecks show up in a couple of different places. One is there's just not enough bandwidth, right? The GPUs are getting faster day by day. And second, the AI ML data movements on these networks are very different. They end up being sometimes very synchronized with each other. And so you end up with these patterns where everybody is trying to send to one guy at the same time and no matter how big a pipe you have it's not going to be enough right right uh because you have n people trying to send to one and so what is needed is newer smarter algorithms uh that schedule packets that mitigate these
Starting point is 00:08:27 congestion scenarios and so on and so that's where our solution helps customers to do a lot of these what-if experiments to figure out what works better for different models every model different. Whether you're training a LAMA-3 versus a GPT-4, the network patterns are going to be different. And we have a way and a solution to replay those same training jobs. With that variation, one follow-up question that I have is AI LLMs who are moving very fast in terms of their definition and evolution.
Starting point is 00:09:09 How does your tool keep pace with this rate of change? uh very often the companies who are introducing uh some new innovative ways to approach the problem they get in touch with us early in the cycle so we are aware of what's going on and we are able to continue or develop our solution alongside that innovation so that we can arrive at the same point and allow them to test their components the way they are to use them. So we work very closely with a couple of hyperscalers, right, and as Alex was just talking about previously, just yesterday we were in a meeting where a hyperscaler is looking to design and redo their entire data center design. How they're going about it is pretty novel.
Starting point is 00:10:11 I found it very unique. But they're thinking of doing something eight, nine months down the line. They're talking course right now. Because they want us and their ecosystem to help them get there in eight, nine months. So this is where Alex was just saying co-design. This is what we really mean by co-design, right? They're coming up with, hey, let's go and design this part. They send it out to all their partners, like us, get that installed,
Starting point is 00:10:40 move on to the next piece. And eventually, six months, eight months down the line, they'll have a brand new design. Amazing. Now, when you look at across performance, cost savings, what time to insight this customer as a perfect example, what are the vectors that you're influencing back to the customer in terms of the benefits that you deliver? The biggest benefit that our customers really appreciate is how repeatable our testing process is. They find it accelerates work they do
Starting point is 00:11:18 because they have less fluctuation where they don't need them. And they can focus really on improving, you know, something that, you know, that they're working on right now. So that's one. The second one is, we mentioned already, it's much easier to run the system. So more people can just jump on it and start using it. And at the end, even in the lab,
Starting point is 00:11:45 people run into the same problem with space and power, you know, when we have to deploy really large-scale systems, you can imagine that there isn't much lab space. It really helps them to save that space for the equipment they need to test rather than others. The test tool is very high-density and doesn't supply much. Now, this is obviously cutting edge technology, otherwise the Icarus tailors wouldn't be tapping you for it. How did this compare to some of the standard solutions and what are the we talked about earlier, are really geared towards testing
Starting point is 00:12:27 scenarios which were really, I would say, defined, say, two decades back, in the early days of LinkedIn. You had a certain amount of traffic on its web, you had a certain amount of traffic on its video, and so on and so forth. And it was predictable, but also the traffic had a lot of traffic, there was video and so on and so forth. And it was predictable, but you also, the traffic had a lot of variation because it was coming from thousand different users, but here the traffic really comes down from one GPU to one other GPU. There's not a lot of variation.
Starting point is 00:13:00 There's lot of synchronization, as I said earlier. And so the network behaves very differently when it is faced with traffic like that, versus something that is very well balanced out. And so the KPI is the people who really look for a tail latency. One of the things that people really look for is tail latency. In the olden days, you can, a vendor could say, hey, where is the average latency that my device offers? You, when you build your network with my device,
Starting point is 00:13:33 here is the average latency. Average latency is not that useful anymore. What matters is what is the tail latency? When did the last guy complete in this scenario? Because only after the last guy complete in this scenario? Because only after the last guy completes can the entire typing go on to the next batch, the next job, right? So the slowest guy determines how fast you'll win. Yeah. And the slowest guy determines how much time your GPUs are waiting for either.
Starting point is 00:14:00 Now, I know you've been in space for a whole decade. We've seen this massive transformation. I know that you are ahead of the curve. So how are you going to be ahead of the curve for the next wave of AI? Well, if you look at what's coming right now from the network perspective, people are ready to jump into the 800 gigabit networks. They're waiting to get new network cards that are capable of that speed and the internal server, you know, PCI bandwidth for that. But and beyond that, people are looking at 1.6 terabyte, etc.
Starting point is 00:14:42 So that's for sure where we always get in we are always release and the test products for those speed earlier than than the real systems there are servers a capable of doing that oh that's one but the other thing is I think the one of the very critical moments that lower down capability to really scale out the AI systems is rate of failure of our own. The moment you get to the scale of 100,000 GPUs, really the amount of failures on the network in a continuous basis is astonishing. And solving that problem is a tough one. And this is something that we are really focusing on right now.
Starting point is 00:15:33 Guys, it was a pleasure talking to you. One final question for you. Where can folks find out about the solution that we talked about today? Right, so if you, for example, Google our product AI data center test platform, you're going to end up on our product page and we have some material out there for you. And then I'm Alex Wartek. You can find it on LinkedIn. Same with me. I'm Kutcher on LinkedIn.
Starting point is 00:16:01 And well, thank you so much for the time today, guys. It was a real pleasure. And I'm really excited about the technology that you're delivering. I think it's critical for being able to get AI up, running its scale. Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net. All content is copyright by The Tech Arena.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.