In The Arena by TechArena - AMI on Open Firmware for the AI-Native Data Center

Starting point is 00:00:00 Welcome to Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome in the arena. My name's Alison Corrine, and I am so delighted. It's a real treat. I have AMI CEO, Sanjoy Mighty, with me. Zanjoy, welcome to the program. The first time you're on me. Thank you. Thank you very much for inviting me. So, Sanja, before we get started, can you just introduce a little bit about AMRI? You played a pivotal role in firmware innovation over the last 40 years, which is incredible.

Starting point is 00:00:47 Since you took the helm, how is the company's strategy evolved, particularly around embracing open source solutions? Yes, we are in this business for 40 years. years. That's not a small time period. So we started our journey around 1985, and that is the time a lot of things changed in the firmware area, especially hardware innovations were very different in 80s and 90s. As you know, the PCs and servers and the cloud, they all evolved in last three decades. We always innovated. We were a very innovative company, and innovation is in our DNA. So we always created firmware which can help a couple of things in the industry

Starting point is 00:01:30 which is scalability from day one. The industry started with different slabor of hardware which we used to call chipsets. And now it turned it to be all two or three chipsets, but there were like 20 different companies in the past. But what we did all the time is a common uniform code for the ecosystem where any computer manufacturer can scale their operations

Starting point is 00:01:55 first and we provided the highest quality product all the time, secured product all the time and an uniform port base to maintain and manage. And when I took over in 2019, before that, KMI was making a class-leading farmer, but it was more of a proprietary farmer based on some of the open source and open architecture and the industry, but our farmer IPs and everything was more of a closed source and proprietary, we recognize that the industry is moving fast. And because the industry's, the

Starting point is 00:02:31 AI innovations and the server industry, cloud technologies, they require first high pace of innovation from everybody. And then productization comes into the picture. But innovation, transparency, and scalability become the big problem. So what we did, we embraced open source. We started contributing to the open source. And today, in last five to six years, our journey completely changed 180 degree. All our products are open source based, and we work with all the industry influencers, large companies, CSPs, and we provide them the open source based solution today. So that's how our journey started. And today we have a completely open source based solution. That's incredible. No, we're heading into a really important time for open source with the Open Compute Summit coming up next week.

Starting point is 00:03:28 How do you see AMI's role in OCP and in general open source technology innovation within the hardware's perspective? And what role does AMI play in open ecosystems today? Very good question. And listen, our environment with OCP has some purpose. So I will start with the OCP's mission statement. OCP was created to accelerate the scalable designs where the ecosystem can embrace and it is open source, mostly in the hardware side of it. And we see some challenges over there, which is the scalability, of course, is the first challenge.

Starting point is 00:04:08 Second is the standardization. Third is basically commoditization. And fourth, I would say security. and finally sustain this whole thing. Now, AMI recognized these are the challenges and AMI wanted to prove that the OCP's mission can be accomplished by our contribution to OCP. Now, there is a broader spectrum of open source

Starting point is 00:04:33 which is open buyers or IDK2 or UFI that we know. And there is also an open BMC, which is the main topic I would touch more because that is reality today. So that is a Linux foundation and there's a broader public open source where everybody contributes. There are probably 100 companies or individuals have contributing to that, which is growing, which is very good. But the problem comes in the scalability part of it because there are so many varieties of component manufacturer and also OCP has multiple specifications. One notably is the modular system specification.

Starting point is 00:05:17 So whenever there are multiple CPU architecture could be X-80C, could be armed, could be risk, there could be multiple component manufacturers, especially for the manageability controllers, coming from company like a speed, Novotone, Axiato, maybe other companies are also making. Then there is a lot of combination of the hardware. that people are producing. And when somebody takes the code base from open BMC, they try to manage and create the product and then contribute back.

Starting point is 00:05:51 So what is happening at the end that public or the broad pool of open source are getting fragmented. So when it gets fragmented, it's a big problem because you have OCP accepted hardware, which source space that you should use, which you can maintain for 10 to 15 years. for sustainable purposes, and you have to patch continuously, you have to secure the system

Starting point is 00:06:17 continuously. So we recognize that problem. So what we do, we actually invest a significant amount of time and effort. So we did that. We merge test, and not only that, now OCP has a guideline for security, audit purposes, or practices called OCP safe. We actually run and we invest that time and we make sure that it is OCP safe compliant by one of the four approved auditing company by OCP. And then we give it back to the community, to the OCP community, which is more interested to use a common source space, uniform source base for developing the product, which is the ODIM ecosystem. So we contributed our buyers, BMC, as well as our security code based to OCP in OCP's GitHub.

Starting point is 00:07:10 So besides that, we are also involved with other steering committees where the standardization or specifications are being developed. One is the modular system. Other is rack management. Another is the open silicon firmware. And that would be other manageability solutions that we are also participating. We are bringing all of them together to make sure that these specifications are implemented as a modular way on the court base that we are.

Starting point is 00:07:40 contributing to the OCP. So that means we are also contributing to the broad, public, open source, and we take that community court back to merging, testing, and validating, creating a production-worthy, ROM-supported code base, and providing it to the OCP. And that is basically for an accountability purposes with SLA to our customer. This is really foundational work, and it's impressive, the scope and scale that you're engaging in. Can you give some examples of some successful collaborations that either you drove within the OCB community or other standards bodies that you reference that illustrate the potential of open source firmware, you know? Absolutely. We worked with a company like Jabel and

Starting point is 00:08:28 our goal was to show that how modular and how uniform a port base can be from our point of view. So we took the OCP version of the firmware and we had a tool change. that we built around it, and those are our IP. And we could change an Intel-based processor module to an AMD-based processor module on the flight. We could change the format, and the rest of the system is same, and it is up and running, and it booted the system. So that proved that how modular it should be

Starting point is 00:09:00 and how easy and scalable that solution could be for the ODM ecosystem completely. Yeah, I mean, some folks will listen to this and say, okay, that seems simple that we all know that if you've managed hardware, these are things that will lead you did in the water if they're not working correctly. When you look at some of the challenges, and we talked about code forking and getting fragmented, can you talk a little bit about the technical and organizational challenges and making failure more open and irreparable across these diverse platforms that are coming know, in an AI-driven world?

Starting point is 00:09:40 That's absolutely a massive challenge that we see every day. In an AI infrastructure, especially, it is very disaggregated architecture, because AI demands and tasks or the workloads are different than the normal server or the compute. So it is a disaggregated solution. For example, the compute, the GPU, the networking, the cooling, and the power, everything is disaggregated. Every component has firmware to manage it. And they are all using the common open BMC code base, the base code base to manage each component. And guess what? Each person may come from different ODMs and different companies altogether, then put together and create

Starting point is 00:10:26 the actual server or actual AI infrastructure. So it is coming from a cross-wender solution. If you don't know the code base, you are losing the visibility of software bill of material. You have no auditing capability. Within the data center, you cannot definitely say that all modules within the firmware across your data center is having the same security patches or same security measure. So these are the challenges that we always continuously see. Plus the obvious bug fixes will not be there. So this is our effort.

Starting point is 00:11:02 that we always merging and every quarter that we want to upstream to OCP GitHub and telling the OCP community, basically, that who are interested to make, who are relevant, basically, for making hardware. And we are telling that this is the common uniform board base, download it, and build it on top of it. And we have very success stories on that. The common base is used for our own product also, which is based on open source. And we, delivery to very large Tier 1 CSPs in this country, as well as all the ODMs in Taiwan. So it is definitely a production worthy one, and we know that it can happen. So this is on our effort today.

Starting point is 00:11:47 Now, you talked earlier about how innovation needs to move faster. What innovations in the firmware space excite you most, especially those that could be accelerated through the open source collaboration that you're talking about? So one of the thing is definitely, I would say, sustainability is an important topic always. Whether it is environmental or even delivering the product, the challenges are not just build the product, build, secure, scale and deliver. And all put together, the innovations should be in the architecture by design. So it has to be modular. So it should work that way.

Starting point is 00:12:28 So not only just the modularity, but how do we build the firmware, how do we test the firmware from the self-intest point of view, how do we sign the format, how do we create the software build-up material, and how do we deploy that? All are important and all are implemented within the format. That is basically how we build the format. Then the second part comes, how the farmware has innovations about the feature wise, and that is the AI world, the AI infrastructure world, demands a lot of innovations within the firmware. In terms of the monitoring and availability, availability and serviceability side of it, monitoring the RAS functionality and elevating the RAS to the data center level standard is very important. And they are all innovative

Starting point is 00:13:25 features that we build, continuously build within the firmware. Now, I did want to ask you about a couple of different things around AI's influence and firmware delivery. And I spent a lot of time in the Silicon Arena. What role do you see for work bringing in AI silicon and in platform development? Great. So I will divide into two part of it. The first aspect is AI silicons are now heading towards the custom silicon designs. And we see a lot of news and Finally, very exciting news from NVIDIA and Intel that we have looked at a couple of weeks ago that both companies will work together, where the GPU technology comes from Nvidia and it will be used X-86 technology together. This is one example, but ARM has shown this last three to four years, showing that, okay, you can take the IPs from different companies that you can build a custom silicon which has the dedicated or target-specific EIF hardware.

Starting point is 00:14:23 Now, in this, there are two types of firmware that we see, and we are playing major role over there. One is the silicon firmware, which is running inside the silicon. So there are different IPs coming from different companies, and they are packaged to create a custom silicon for AI purposes. And there are probably 10 or 15 different type of firmware running inside the silicon. And those are important because these IPs are also coming from running. different vendor that 10 years ago or, you know, Intel, AMD, or the ecosystem processor

Starting point is 00:15:00 company, they control the entire processor technology. They deliver the final product. But in this case, it is coming from multiple companies. So when we are building, somebody is designing the processor or the silicon, then it is very important. The own silicon firmware should enable those IPs, working with those individual companies, and make a lot of them. And make a lot of, sure that the silicon works. So these are basically follow the chiplet technology. Each are coming as a chiplet. So chiplet firmware, and in the future, we believe there will be more specification and standard coming, how to interact with the chiplets and how to manage chiplet on silicon. And it will evolve tremendously in the future. We are part of the armed chiplet

Starting point is 00:15:48 specification, and we are also working with them in that trial. And second side, is the plaquesome firmware. So I talked about the silicon format. Then the second part is the platform format, where is the boot format, which is BIOS, manageability firmware we call BMC, base board management controller, and the security format,

Starting point is 00:16:08 which is the platform route of trust. All three are very important in case of AI. One is the booting of the system is important with the attestation, properly, route of trust, validation, all these things is very important. and secure boot is important. The manageability part is probably the most important in an AI data center.

Starting point is 00:16:31 Just to compare, there was a study made by META probably a year and a half ago, that shows that typical failure of a GPU in a data center. It's not because of the GPU silicon, but associated high speed connectivity and associated glue logic. In general, the failure of a GPU In a data center, average failure rate is, they call AFR, annual failure rate, is 11%.

Starting point is 00:17:00 The highest AFR or the annual failure rate was a spinning drive, which is less than 1%. So compare in the context of less than 1% to 11%. So we are innovating many things in our firmware. In terms of telemetry, we are even using AI tools to do that. that how can we predict for the GPU failure and minimize that. So that is one of the area that I think the firmware has to innovate more in the future. That's fascinating. And as you were talking, one thing that I thought about is AI's design point is moving from a single system to rack scale and even data center scale.

Starting point is 00:17:46 How do you see the role of firmware expanding within that process? This is the great question. So this has a tremendous amount of dependency on the firmware. So now when we go from one server to the rack scale server, then the clustering comes into the picture. And you cannot just build the server. It is on demand or dynamic composition and decomposition of the rack. And then multiple racks create the pod and multiple pod creates the entire data center. So that's how the whole vision is.

Starting point is 00:18:17 Now, within a rack, if you think of the firmware, their responsibility is to also have the capability of taking instructions from the upper level manager or rack manager or data center manager, whatever we call, can compose and decompose the clusters, smaller clusters within the rack. So that is a big area that firmware has to play a role, major role, and it is dynamic and it is demand basis. So that's the area that the rack scale impact is there. Now, when we talk about the rack skill AI server, two major challenges that we have is cooling because now it is generating a massive heat because it is now a one megawatt rack. So OCP Dublin, they just announced that

Starting point is 00:19:09 there is a specification from Google for a one megawatt that. As a matter of fact, just 10 years ago, I thought that megawatt is for, for the Hentat data center. Now it is only one rack, correct? The cooling technologies are three different areas that we believe farmer will play a major role. One is the server side of it,

Starting point is 00:19:31 how to cool the chips. And based on the cooling technology and cooling mechanism, the GPUs, CPUs, and all this component will be throttled properly. So it can achieve the best business performance to the data center. Second part is the cooling or coolant distribution unit, which is not part of the actual compute rack.

Starting point is 00:19:53 It is a site rack that provides the cooling and cooling distribution. Over there also, there is a major role for firmware, and that will communicate with the main rack that will also control the flow of the liquid, that will also detect that leak of the liquid, and those kind of things. And third is more coming into the facility side of it, which is all this cooling, needs to be cooled down quickly and then coming back. So there are major firmware which will also play major role. That's really cool.

Starting point is 00:20:28 And thinking about the intelligence that firmware will deliver is really interesting when you consider the brute force liquid cooling that's going on today. You know, one thing that I wanted to ask you about, Svirj, you've charted a new course for AMI in your tenure. And I guess as you see this market transformation happening in a real tournament, How do you see AMRIE in this space living forward? So the good news is that we always know that we have to be innovative and we have to be a thought leader in this area.

Starting point is 00:21:00 We invest a lot in terms of the partnership with the companies and we invest a lot with the community and the platform like OCE and especially contributing to the standardization or the specification that is evolving around it. As I mentioned, we will continue to innovate and contribute to the cooling technologies, open-chiplet format technologies. And we believe that these are the areas we will evolve in the future. Yeah, that's really awesome. And I guess I wouldn't be a good interviewer if I didn't ask you, do you see AMI having soap beyond firmware in the future?

Starting point is 00:21:40 We are already. So what we do, we believe that in the AI world, we started. our journey with firmware. We do have that. And we are the leader today in the firmware world. But we see that it is not enough. Because when we are talking about rack scale, when you are talking about the cooling technologies, when you are talking about the power, I have not talked about the power and carbon footprint, forecasting and all this area. We do have software today, which does, and we are bringing it. We have not launched it fully. We also have an AI-based analysis of this telemetries, which is coming from the firmware and then analyze it and give a meaningful

Starting point is 00:22:21 information, actionable information to the data center operator. Otherwise, the firmware data is very raw data to them. So these are the areas AMI already have products that we are bringing. We do have solution for composing and decomposing the rack based on the demand and based on the workload job. And so we are beyond farmware. So we are basically having software and running in-band software and all those things. That's so fascinating.

Starting point is 00:22:52 And I love to hear the trajectory of the company and really how it paces with what is happening in data centers in terms of the transformation. Sanjay, I'm sure that we keep a lot of interest from our audience today. Where would you send them to find out more about all the things that you talked about and engage your team? So I would first say that all the workgroup that we are working with OCP, if they are OCP contributor or member, they will find us definitely over there what is our work and everything and can connect. Our website is a good place to start with and there are a lot of materials there and information as well as contacts there. I personally can be available in LinkedIn. I am always available on LinkedIn and if any question or anything comes. But these are the

Starting point is 00:23:41 two places. I will start with OCP and everybody should see our work and then get engaged. And I will encourage everybody also to contribute and build a good community. Well, Sandra, I know that you're a busy guy. I appreciate your time today to share this incredible vision that you're bringing with the AMI team to the marketplace. Thank you so much for spending time with TechCrain. Thank you, Alison. Very nice talking to you. And thank you for having me. Thanks for joining Tech Arena. Subscribe and engage at our website, techorina.aI. All content is copyright by tech arena.

In The Arena by TechArena - AMI on Open Firmware for the AI-Native Data Center

AMI CEO Sanjoy Maity joins In the Arena to unpack the company's shift to open source firmware, OCP contributions, OpenBMC hardening, and the rack-scale future—cooling, power, telemetry, and RAS buil...t for AI.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.