In The Arena by TechArena - AMI CEO Sanjoy Maity on AI’s Five-Headed Hydra

Starting point is 00:00:00 Welcome to Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome in the arena. My name's Alison Klein, and today we've got a fantastic episode for you. I've got Sanjoy, Mady, CEO of AMI back in the studio with us. Sandra, welcome back to the program. Thank you, Alison. Good to be back. Now, let's just get started. You've been on the show before last year, but for those of the audience that have missed our last conversation,

Starting point is 00:00:39 can you just please briefly reintroduce yourself and AMI? Sure. So my name is Sanjay. I'm the CEO of AMI. And at AMI, what we do is we make all kinds of platform and silicon firmware, which enables all the compute ecosystem or compute platform, including the complex AI data centers servers that we see. and up to all the way to age devices or industrial PCs and everything.

Starting point is 00:01:09 Side by side, we also make middleware components for the AI data center, physical and IT infrastructure management, which includes cooling, rack, or power cells, etc. And at the top level, we also have a pod level or a data center level infrastructure management, which is a centralized one single pinop glass view of the entire data center. including all the nodes, all the AI infrastructure component like pulling infrastructure or power infrastructure, et cetera. So we are a full stack software format solution provider. That's amazing. I invited you back on the program because I recently read a blog that you wrote around management of AI infrastructure being somewhat of a five-hitted hydro right now. And I loved

Starting point is 00:01:56 your analogy and I loved what you brought to that conversation. So I thought it would make a great podcast, what pushed you to frame the problem in the way that you did? Yeah, what I see is in the industry, everybody is trying to solve the same problem. And the main problem is for the mainstream computer delivery because of the massive demand that we have in the air data center, how fast we can innovate and how fast we can scale the operation and the deployment of the compute devices. By doing that, the many industry leaders or influencer or technology providers, they're innovating too fast and there is no time for us to making sure that it is a standard and it is a one version of truth in terms of the

Starting point is 00:02:44 implementation, which is actually a problem for the end customers or end users where they have a heterogeneous components coming or implementation coming from different companies with different level of security measures, with a different level of manageability features. etc. We call it as a fragmentation in the industry and there are multiple strategic inflection points in the industry because of this technology shifts that are happening. And at the same time, if it is getting fragmented, it becomes very difficult for the in customer or in user to manage the heterogeneous environment. And we believe that that is something to be resolved. And as we are waiting for

Starting point is 00:03:31 coming up with the standards and common way of implementation, the gap is increasing. So that's why it is like a monster like a highter. You cannot solve. By the time, it is trying to solve, again, the gap is moving forward, right? So that's why the analogy is, and it is a problem. It is not a problem for the manufacturing side or problem like EMI, let's say, providing different type of solutions to our customer. But we as a thought leader, we should have a purpose for the industry and the ecosystem, we believe that this is a problem that we all need to solve as a good ecosystem partner. And this is fragmentation needs to start with. And there are other issues definitely will talk about it. So this was my motivation to write this, to create an awareness

Starting point is 00:04:19 within the ecosystem. Now, I know that we're going to go through each head of the hydra in detail, but before we get there, can you give us a sense? And I know that you work with partners from across the value chain to deliver foundational technology, but gives you a unique perspective to understand why is all of this converging now? I know that we're building out AI data center infrastructure, but what is it in that that is changing that is creating this charge? Yes, so if you see that, there is a massive demand. The demand was not there before in terms of the compute, usage of compute or training or whatever we say. Ultimately, it's a compute at the lower.

Starting point is 00:05:01 The amount of processing that is the demand by the customer and demand by the world is unprecedented. We have never seen that before. It was only for scientific applications for scientific purposes, research purposes it used to be used. But now we are seeing any devices, even a compute, even a cell phone or mobile phone in your pocket is so powerful and doing the AI calculation, right? So those kind of demands are pushing the data centers to scale. the operation as fast as possible. Now, one single company cannot solve this problem. That is the main problem it is happening.

Starting point is 00:05:40 When we were depending on majority on one company with the GPU, to start with, for training our AI models, AI is the most fascinating technology and the most used processing power used in the industry today. We see that the GPU has some other side effects. like it is a very power-hungry device. It also creates complex environment. And it is only good if you have a massive scaled operation. So now people are looking and the industry is looking for a smaller scale or target-specific

Starting point is 00:06:19 AI inference or training devices, which you can think of more target-specific custom processing unit. Let's call it X-View. If it is targeted for data processing, we call it as DPU. If it is targeted for network side of calculation or analysis, we call NPU. So it can be anything. So let's call it as XPU. So in this light, there are many, many companies are creating this target-specific processing

Starting point is 00:06:47 unit. That's why it is getting fragmented. And each one has their own way of implementing the processors, one way of managing the security. So all put together, it's highly fermented. And that is the reason, because the demand versus the deployment, it's a race currently. Now, one of the things that you talk about is that if we look at these problems, and you describe this a little bit, they're interconnected, they can't be solved in isolation.

Starting point is 00:07:18 firmware is what holds these challenges together. And I think that firmware, everybody in tech knows what firmware is, but it's not necessarily something that has had its spotlight moment until now. Can you just give a sense to our audience, how we should view firmware in 2026? And why maybe is it finally reaching a point where it's becoming just so much more foundationally important to data centers? Very good question.

Starting point is 00:07:46 So I will give one example with two companies' implementation, let's say, two type. There are many, but they will give a clear explanation. For example, let's say we are using GPU and CPU in a data center. So GPUs will be used for training a model. And then CPU will be mostly used for inferencing that when you are trying to use it. And both are coming from two different companies and both are managed in terms of power and thermal versus the efficiency in two different way, which I mean that the CPU coming from one. company can be managed with there are a concept of power capy.

Starting point is 00:08:29 If the data center believes that this is the best way of using the CPU for the best efficiency and maximum business outcome of compute power at the particular time in the data center, then they have some levers, some tuning that they can do with the management functions for that particular CPU or a cluster of CPUs. In case of people coming from a different company has a completely different way of doing it. Now, if each company starts giving their own way of APIs and the livers, then it is very difficult for the industry or the end customer to, you know, fine tune and make sure that in that environment that they can get the best out of the

Starting point is 00:09:11 return on the investment that they have in terms of power, thermal and everything. So this is where I think farmware plays a major role here. Because the intelligence, how to change and how to control a GPU and CPU is actually from the GPU and the CPU. So the phoneware resides inside this particular processor unit to start with. Today it is coming from a proprietary built CPUs from different companies. I would call it as a vertically aligned CPU industry or GPU industry, processing industry. As opposed to right now, we are also seeing there is a horizontally aligned processing industry coming up, which is the custom silicon, where there is a common framework across the GPU, DPU, NPO,

Starting point is 00:10:02 whatever you can think of, right, those kind of XPO's. And each target-specific processor has its own addition. So the common portion, I mean, Arm has defined this as a new specification on CSS, which is custom compute subsystem. And the way that it is been is in the form of chiplets. So now, adding more chiplets, we can increase the each target-specific need. At the same time, managing security, power thermal could be common. So that results also many issues in the data centers in the future. But vertically aligned processor economy or industry will still see the heterogeneous management APIs and everything.

Starting point is 00:10:48 So we are always trying to solve it outside of the processor. In case of the horizontally aligned custom silicon design, we are deeply, very closely working with companies like ARM to make sure that we have the common way of managing it, whoever makes this to create an ecosystem out of it. So we are putting our firmware inside that. But vertically aligned thing, we are trying to solve this, outside of this in our pluxform firmware,

Starting point is 00:11:19 which we call BIOS, which is our aptube, Farmware, which is for both purposes, and Megarac, which is our management solutions of the firmware. And we are trying to resolve in that level to give a common way of managing from the upper layer. Now, you described fragmentation at length, and you talked about the various forms of accelerator technologies that are going into AI data centers. When does this fragmentation become a real operational risk in your mind? it is already

Starting point is 00:11:51 and we see because we closely work with the tier one hyper-scale data center companies and we see the massive challenges that they have. There's a significant amount of investment that they do to mitigate that and put together.

Starting point is 00:12:08 But I see the major challenges. My last market report that I read is the growth of the new clouds are actually 54%. So it is very fast growth segment than the tier one. Where the one is at race because of this heterogeneous platform, how they are mitigating is probably right now they are not deploying too much of heterogeneous solution, which is also not efficient for them because they probably would be using the same GPU for,

Starting point is 00:12:41 you know, sometimes for a set of workload, it will be overkill to optimize that they need heterogeneous solution. and it is coming. I think that is that area, the new clouds will face the maximum challenge for the heterogeneous environment in the future. Now, security is another area that you call out and security from where anchors anything from secure boot to add a station or update integrity, but it's often invisible to the people responsible for defending it. So why is this gap so dangerous as AI infrastructure scales? Very good question. So I would put it this way that in the future, when the new clouds are growing so fast, and each data center will focus on particular set of the cloud services that they will come out.

Starting point is 00:13:31 So eventually we will see for a bigger task, we will depend on a hybrid cloud. And here, the security at the attestations comes into the picture. In the high level, everybody understands that the cloud providers has to guarantee or attest the security measures. or whatever that we see. And that level, if it is a hybrid cloud, you know that you cannot be a self-certified entity. So that means one cloud vendor or one provider cannot say that I am certifying myself.

Starting point is 00:14:05 It has to be certified by third party in case of that. To do that, you would need an industry standard way to attest. At least, the GPU and CPU and the heterogeneous component are using different mechanism of security implementation. It is a problem. So the industry is heavily motivated right now and the big implementation towards the standard like Calitra, which is a silicon root-up trust,

Starting point is 00:14:34 which solves the problem of a common way of attesting it, so you don't have to self-attest. For example, a CPU manufacturer cannot say, okay, I am self-satisfying buying CPUs for the memory access, for my security of my processor by the same company. It has to open a common APIs and common way of third party to attest that. So that is, we believe, is the main big challenge that we will see in the future. And of course, the firmware plays the maximum major role here,

Starting point is 00:15:08 because firmware has to expose the security measurement and during the secure vote and all these phases, even if the runtime, that is the, platform firmware plays a maximum major role here. Now, what I thought was really interesting in your article is that you then went into power and thermal, which a lot of people don't think about in terms of IT administration, but there is something really interesting in play in AI data centers in this space. Can you talk about why there is an integration of power and thermal into traditional IT stack management and what the future holds here?

Starting point is 00:15:47 Yeah, absolutely. This is the most complex subject today in front of us. And the reason is, so far we dealt with computing, which is typically a binary algebra on that kind of subject that we know and we controlled it for a long time. And the CPUs and GPUs are the processor or the compute environment was producing heat, which was under air-cooled management for a long time. It's very easy, right? But when we are looking at the future designs, which is the massive amount of heat dissipation and management is required, those kind of things are not easy to manage. At the same time, what happens is the processors are getting bigger and bigger in size. When it is getting bigger and bigger size,

Starting point is 00:16:38 there is so much you can do cooling per square millimeter area. When a GPU is large size, because of the chiplet design or multi-dye packaging design, putting more cords, that means one side of the chip will be different temperatures than the other side of the chip. And you are actually flowing the cold liquid in one side and bringing the hot liquid back from the other side,

Starting point is 00:17:05 but you don't know what are the other portions of the chip is having what level of the temperature. And that happens because inside the chip, there are typically 96 to 128 or 256 core, not all cores are running at the same speed. They are actually isolated in different cluster. It's called GSP, let's say in case of GPU. It is a graphical GPU service processor.

Starting point is 00:17:34 At each cluster within that ship is running workloads. If one portal is running high workload, that will be hot. And if we don't cool it down, then the entire ship will be shutting down. This is a typical phenomenon of hotspots that the chip will have. So I'm talking about multiple complex problem here, which is based on the thermal flux and thermal wall. At the same time, the liquid properties, which is cooling down, which has different dialectic coming from different companies to coolant. So every time the management firmware or the controller, which is seeking next to that GPU, to control it,

Starting point is 00:18:15 needs to know the dielectric properties of the coolant, which is coming from a particular company, the flow, the pump accents and the flow of the liquid, needs to know the hot spots, needs to know actually the workload running, the isolated GSPs. And then it has to go back and manage it, right? So now I talk about something called

Starting point is 00:18:39 closed-low management of this thermal. Close-loop management means Now the rack previously used, we used to see one rack full of compute servers. That's all. And the servers will be in the middle portion of the rack. Bottom probably will be power supplies and top will be our top of the rack switch to connect the rack to the network. Now this whole compute rack is disaggregated in three. One is the compute, which contains the GPU and CPU rack.

Starting point is 00:19:09 Second is dedicated for cooling purposes, cooling rack. and third one is dedicated for power, which is power rack or power cell. Now, when we have disaggregated things, now each rack is coming from different vendor now. Previously, we used to buy the full rack of servers, including the cooling arrangement of air cool devices, together from one company. Now it is coming from three different companies, right, and it's sourced. Now, these companies, they don't work with each other, and whenever there is a workload given by the data center misstated to a particular server.

Starting point is 00:19:47 That server is getting hot, but that has to be pulled down by the pulling rack, which is coming from a different company. Their management controller is different. So imagine the how fragmented and heterogeneous the manageability is. So what we are trying to do is trying to make sure that we follow the same standard. And at the top level, at the middle level, we have rack manager in each rack and working with its RAP providers so that we can manage them properly. And at the top level, data center level, we have an infrastructure manager which can tie all

Starting point is 00:20:21 of them as a single pin of glass to the system administrator. But the subject is very complex. So that used to be a liquid and pomp and everything used to be part of the data center facilities. Now mechanical, electrical and plumbing, which is all put together, is the cooling and all these things are shifting, the manageability shifting from the data center facilities to the IT. Because it has to control, because that controls the efficiency of the server as well as the token processing, as well as the business of the whole data center.

Starting point is 00:20:56 You know, it's fascinating about this as you wrote about fleet scale, and I think that really puts it into perspective. I mean, AI data centers are not one rack. They're football fields full of racks. And I think that one question that I had for you is, is you know, is you know, look at the scale of these implementations and all of the challenges that you've talked about thus far, what breaks first if the control plane isn't consistent across that infrastructure? That would be very difficult for anybody to manage because how do you get the cooling?

Starting point is 00:21:27 So let's say from the top level, I'm a system administrator. I want to maximize the token processing per what. That's how my scorecard is and I am measuring my business. So that means I need to know what is the efficiencies. It's like a sports team, right? You need to know the strengths of each player and distribute that game strategy. And it's nothing different in a data center. So now if you want to do it and if you don't know exactly the characteristics

Starting point is 00:21:54 and the manageability strengths of each rat and each thing, then you will not be able to distribute this workload to the servers properly and run this business. So that means you need one single, way of managing compute rack, cooling rack, power rack. There are one of the common industry standards which has to come up to the top level, infrastructure or board level management,

Starting point is 00:22:23 where the system administrator can give either programmatic interface through that or graphical interface or whatever way it is, but it has to be a common way of managing these three racks. It cannot be disjointed it. And it has to be closed low. Many, if a workload has been given, it has to be cooling down properly. Then it has to be the cooling rack has to work in conjunction with the compute BMC

Starting point is 00:22:51 or the management controller that we have. All has to work locked step. Otherwise it will not happen, right? Yeah. Now, the final thing that you talked about is mission criticality of this. And I think that we've seen hyperscale data center, outages, and we've heard about some of the challenges of the manageability of these massive clusters. Can you talk about downtime in these types of environments and can we even bound what

Starting point is 00:23:21 that would cost to the organization? Yeah, we know from the reports from the market that the AFR, which is annual failure rate of AI type of servers or the GPUs, are 11% where the spinning drive, which is the most vulnerable or most, I would say that error-prone device in a data center is one percent, less than one person. Now, it is not because of the GPU failure. It is because of more complex situations. And so the vast reliability, availability, and serviceability is a big subject and a big focus for EMI also to solve this problem. One, I mentioned about the hotspot in a large CPU. type of dye or a processor where it cannot be cooled down as fast as possible and there are

Starting point is 00:24:14 multiple places having a higher temperature than the other places in the chip and that brings down the entire chip down right so this kind of failures will happen unless we put feedback low and unless we have a mechanism to control it there are many other rash areas memory rash and then And also we see the DRAM controller failure, which is the predictive analytics that we do sometimes and provide that telemetry data. But we from the format side provide the telemetry data upwards. But somebody in the data center administrator level should take the decision. Even when we're talking about cooling, we cannot also pull down the servers too much.

Starting point is 00:24:57 If we do pull down the server too much, then there will be condensation happening in the back. And that will also shut down these systems in a hyperscale environment. So you see that subjects are not easy. It's a very complex subject. So I would say that reliability, availability, and serviceability are in multiple areas. One is the GPU, CPU itself, where the way that we manage the thermal and power and the way that we manage the high-speed interconnect and interfaces, could be UCIE, could be C-XL, that kind of interconnects.

Starting point is 00:25:30 How do we predictive failure analyze the DRAMs or the high-speed memory, HBM, those kind of things, as well as the storage? All put together the telemetry and analytics that we continuously do, we provide a tons of data from our out-up-band controller to the upward that is available to do analytics. And that should be the base of controlling the RAS and increase the reliability of the entire data center eventually. Now, one thing that I think is really interesting is that there's a lot of proprietary solutions out there that say that they're going to help with us. But one thing I liked in your article is that you say that they're quietly adding to the chaos. Why hasn't that worked and why do we need a different approach? So our goal is to defragment whatever activities are going on. So everybody is trying to solve problems and they just do the solutions are coming.

Starting point is 00:26:28 there is a different way of solving this problem. But at the same time, it also creates fragmented and why it is a problem, right? The problem is, for example, if we have a solution doing out-of-band management type of solution for power and thermal, and if you don't have the transparency

Starting point is 00:26:47 to the entire ecosystem and the world, then nobody knows how to do a security test of that particular solution or implementation. Nobody knows what are the patches that we need to apply. It's always depending on individual implementation. So what we do at AMI is we come up with all this innovative ideas. The standards are very important to us. So we are engaged with OCP very much.

Starting point is 00:27:16 And OCP is putting together a lot of workrooms, five or six different work groups, which is the AI infrastructure side of it, including the ORV3, which is OpenRack V3, LUTS system hardware V3, and we are continuously contributing the proof of concept,

Starting point is 00:27:36 as well as the standards that we are contributing to the standards. Not only the POCs and the standards, we also open-sourced our major platform format to OCP. We have the best and maximum transparency about the implementation. So this is where we believe that we are differentiated.

Starting point is 00:27:57 ourselves than the other solution providers because we want to do it with transparency. Now, you've been in the firmware business for over three decades, and how does that history position AMI differently for this moment? It's actually four tickets. We celebrated 40 years. Yes, definitely. We started as a proprietary firmware provider, as an independent and farmer provider, of course, in the ecosystem. But we see ourselves differently now because think of the way the industry is moving, the pace, and the way the innovations to be done, it cannot be done with the proprietary firmware anymore because you will be hurting the ecosystem.

Starting point is 00:28:39 And first of all, I mentioned so many different parallel technologies. It could be coming 30 years ago, we saw only one company who is providing CPUs and pretty much 100% of the industry was using one processor Intel, right? Today, it's not like that. Today, there are so many other companies are coming with different architecture. How the industry will move forward fast unless we have a transparency and we have an open source and open architecture concept. So that is where we have also evolved.

Starting point is 00:29:14 And I would say after 40 years, we have evolved a lot. Every firmware or everything we do, our. idea is first open source in our mind. We do not think of proprietary things at all. And that is how we are evolving today. And in the future, also, we'll be continuously growing in that way. That's awesome. I have one final question for you. And it's kind of a high, low. If the industry doesn't really rethink how to approach firmware as a foundational layer, what are we risking? And conversely, what does getting this right unlock? If industry does not think the firmware is important, the main problem would be you have a house, you locked everything, but you are opening your basement door open all the time. Nobody cares, nobody looked at it, but it starts from the firmware. Everything that compute hardware wants to provide. So, firmware is the main enabler or manager of the hardware that we see today. So that will be definitely a major issue.

Starting point is 00:30:22 I don't think industry is looking that way. Industry has a lot of attention today on the farm. We see that all the time. And we continuously are working with all the industry, influencer as well as the leader. Key silicon provider. As a matter of fact, for the custom silicon designs that I talked about, there we have partnered with the arm, I mentioned.

Starting point is 00:30:45 And together with not just the arm, there is a set of companies part of the Arm total design. And we're working together to create, build a new ecosystem, Seneca ecosystem where this form where problems will be solved. So not even that. I have a very high confidence that the industry is definitely emphasizing that a lot. Yeah, Arm is doing such a fantastic job building that chiplet economy with their ecosystem. And I'm so glad that you guys are a part of it.

Starting point is 00:31:15 Sondra, it was such a pleasure talking to you today. I learned so much every time I get the pleasure to have you on the show. Thank you so much for your time today. Where can the audience learn more about what AMI is doing in this space? I would encourage all our customers and partners and viewers of this podcast to look at ami.com, as well as we are deeply integrated within OCP, and please take a look at our OCP contributions and the work that we are doing together. There's a lot of good things happening, solving all these five problems that we discussed today.

Starting point is 00:31:51 I can't wait to have you back on the show. Thank you so much for your time today. Thank you, Alison, and thank you to be here. And it's a great pleasure talking to you every time. Thanks for joining Tech Arena. Subscribe and engage at our website, Techorina.ai. All content is copyright by Techorina.

In The Arena by TechArena - AMI CEO Sanjoy Maity on AI’s Five-Headed Hydra

AMI CEO Sanjoy Maity breaks down what he calls the “five-headed Hydra” of AI data center management. Sanjoy explains why the explosion of custom accelerators from dozens of vendors has created a f...ragmentation crisis across firmware, security, power, thermal management, and fleet operations.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.