In The Arena by TechArena - Data Center in the AI Era with Jen Huffstetler

Episode Date: August 22, 2024

Join Allyson Klein as she welcomes former colleague/ industry innovator Jen Huffstetler. Jen shares her extensive experience driving advancements from client devices to the data center, including grou...ndbreaking technologies like Centrino and 3D packaging.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome to the Tech Arena. My name is Alison Klein, and I am so excited because I have former colleague and friend Jen Hufstetler back on the show. Welcome to the show, Jen. Thanks, Allison. It's a pleasure to have you. You have an incredible history of driving innovation
Starting point is 00:00:39 across the industry from client to mobile and into the data center. Can we just start with a brief history of your career and the areas of the industry that you've touched? Yeah, I've had this incredible fortune to be at the center of several inflection points throughout the computing industry. Harkening way back to my early client days, what folks might not know is in the late 90s, despite the internet boom taking off, laptops were still large, luggable, the battery life didn't really last very long, and they had to plug it into a network cable. And so you can move your PC, but it couldn't really be connected everywhere. And that was where my journey began with what we called Centrino
Starting point is 00:01:25 technology, which was the first mainstream laptop focused on being thin and light, providing all day battery life, integrating wireless connectivity so that you could have pervasive connectivity. And this was way back in 2003. And we all just probably forget what life was like without being able to go to an airport or a coffee shop or anywhere and now have that pervasive connectivity. So that was the first tectonic shift I was in. And then since then, I've mostly been focused on the data center side, building server systems for network storage, mission critical applications, helping our LXM business turn profitable there. It's not an easy thing to do if you're familiar
Starting point is 00:02:05 with this very competitive space. I also then focused on componentry in the data center space, leading product across CPU, GPU, DIMMs, building an industry first silicon as a service business, and also looking out to the future of the data center in partnership with technical leaders across the industry about what do we need to be successful with the future of compute in the data center. Some of the things that came out of that were integrated optical IO POCs that you've seen publicly announced recently. And then most lately, I've been focused on product sustainability across all of these devices and what that means for fab processing, chip design, for software, the system level. It could be at the data center level and
Starting point is 00:02:54 inclusive of software, AI frameworks, and beyond. That last bit is what I want to talk to you about today. I want to talk about the data center and what's going on in the data center. And there is such a wide gamut of topics to address when we think about the data center. We stand at an amazing moment for data center innovation. And I talk about this every day on the tech arena, but we don't get a chance to talk about it that often on air. It's really filled with opportunity and challenges. And I guess one question that I have for you as somebody who has been in this space for a long time, just like me, how do you see this moment and how do you compare it to other moments that we've lived through? Yeah, right now we are in the midst of this unfolding data center of the future. These last few years, we've been
Starting point is 00:03:42 experiencing a shift from homogeneous compute, where there were millions of the same computer, and now a much more expansion into heterogeneous compute. And what that means is having specialized accelerators that more closely match the workload characteristics. These examples have been unfolding and accelerating in terms of volume and deployment over the last few years, starting first with offloaded network, storage, security applications. There's many names for this in the industry. That's continuing. And in this AI moment, we're now seeing not only the GPUs, but also specialized accelerators for inference, for training. This is very new
Starting point is 00:04:26 in this moment. And in the past, it was really, how do we take advantage of light compute and run it most efficiently, almost like a Southwest Airlines model, where every airplane is the same, and we can run it really efficient. The workloads have become so complex that the infrastructure providers are now seeing the need to innovate at that silicon level to match it. We're also seeing a lot of innovation at the systems architecture level. You see some of these accelerator companies now deploying at a pod level. That's very different than the past where we had singular nodes, everything was fitting into what you and I would call a pizza box in a one-year or four-year. But the design point now is in a multi-rack level. That's huge innovation in the infrastructure. And it really allows them to integrate these new cooling
Starting point is 00:05:17 technologies like liquid cooling. And we're seeing even more innovation we'll talk about, I'm sure, in the software layers and the data types. I see it as these two inflection points are happening. There's that system architecture piece, but we're also at another computing inflection point where AI is becoming the killer app that will define everything in the next 10, 15, 20 years. And that's really what we're seeing unfold today with this extensive deployment of high power GPUs, ASICs to train these ever increasing large language models. We see hyperscalers investing heavily in this space, seeking to retain their market position, but they're also encountering challenges because they weren't planning for this much power. And when you build a data center,
Starting point is 00:06:03 that's a long range project. You need to procure that power years in advance to ensure that it's going to be there. So right now this energy load is actually constraining the compute growth. That's very unique. Whereas in the past, a lot of what we saw was consolidation and energy efficiency. So it was actually more work being done with less. We're now seeing a very different model where they're just deploying more and more product to meet the ever-growing size of these large language models and running into this constraint with regards to energy availability. Now, you described a really beautiful picture of all of this innovation, and a lot of it
Starting point is 00:06:45 aimed at hyperscale today because they're chasing the monetization of AI. When I think about the next five years, one question that I have is, how will all of this infrastructure innovation actually influence the deployment of technology across the large hyperscalers and enterprise? And when do you think enterprise will have fully digested the fact that AI is that killer app and they have it running across their use cases and their workloads? It's a great question. It's like the $64,000 question from that old show. I think for the hyperscalers, they're already feeling that
Starting point is 00:07:26 competitive threat now. So they've fully embraced and are fully adopting it. On the enterprise side, this is so new that right now, I think a lot of the corporate strategy teams at these large enterprises around the globe, they're really thinking about how are we going to integrate this new technology and how are we going to prepare for the impacts of this on our business? Where are the competitive threats? How do we need to reallocate our resources so that we're investing for the future and the implications that broad deployment of AI will have on us? So I actually think it's top of mind for every enterprise around the globe, but it's going to unfold very
Starting point is 00:08:05 different. There's a lot of opportunity in this space for companies to take advantage of these new resources, to lower costs, use digital twins, upskill their workforce, integrate autonomous support in their factories, manage their energy footprint, as we were just talking. But when I think about how it compares to what we've already experienced, I think it's helpful to go back in time. And that's probably going to inform us for what this future two decades is going to look like. And you and I both watched enterprises of all sides over the last two decades embrace virtualization, cloud capabilities at different rates. There was no one journey. There wasn't one trajectory. It depended on the
Starting point is 00:08:46 company, what their competitive environment looked like, what their financial profile looked like. Enterprises by nature are pretty risk averse in terms of new technology deployment. They don't want to disrupt services for their current customers or their existing revenue streams. But what you are going to see is just this massive shift. I think Gartner's predicting it's going from like 6% in 2003 to 80% of enterprises are expected to have deployed some kind of POC by 2026. So we're in this moment where everyone's trying to figure it out. We know that the pace of cloud adoption took off for a few reasons, but those reasons were really centered at the heart of user pain points or ease of use.
Starting point is 00:09:29 And it was the developers who found it easier to build new services in a cloud instance than it was to call their IT department, wait nine weeks for a server to show up, another nine weeks for that server to get deployed. It just provided that instant access. Some of those software as a service providers, Workday, Salesforce, they provide new capabilities over what existed in the monolithic solution. And it saved IT time to configure all of this. And now I think what we're seeing with this microservices taking off is that a programming shift has happened. And that's going
Starting point is 00:10:06 to help accelerate the adoption of these techniques. And you're seeing some of the largest GPU providers put forth tools to help onboard these developers to take advantage of the compute capability in the easiest way. I think when you streamline that for the developers, that's really going to be what helps enterprises move most quickly through this adoption is when they're able to take advantage of the new tools and techniques that exist. Now, I want to go back to that multi-rack configuration that you were talking about earlier. And I want to talk to you about the rising power that some of these configurations are taking and the change to power and cooling technologies
Starting point is 00:10:45 that are going to be required. There's been some statistics that say that data center power demand is going to grow well beyond the 2% of the world's consumption that it sits today to alarming rates. And I don't even want to say any of them because there's so much divergence in people's thinking on this.
Starting point is 00:11:03 But how do you see the fundamental metrics of power in the data center growing in the AI era? And how do you see that influencing both how people think about greenfield and power delivery and think about cooling technologies and dissipation of heat from all of this infrastructure? Yeah, it's a great question. I'll go back to, if we look to the past to help inform the future in this space, what we know is that there has been
Starting point is 00:11:33 ongoing data center energy efficiency happening. From 2000 to 2010, we saw increased processor core counts. We saw virtualization, cloud computing, consolidating servers, which provided a significant boost in the compute work done per watt or kilowatt. Intel's reported from 2010 to 2020 that they saved a thousand terawatt hours just from the innovation of processor in 2010 to 2020. We're going to continue to see that happen. We know that hyperscalers like Google are
Starting point is 00:12:05 publishing their own results about how they take their existing infrastructure and are able to focus on energy management, on utilization of their fleet and say 40% of their power. You're going to keep seeing that. We've seen the PUEs reduce globally from over two to down to a global average of around 1.6, world-class at less than 1.05. That type of innovation is going to continue to happen. So I'm very optimistic that this won't be a renovate energy explosion as some folks are potentially forecasting. In the short run, you're seeing brownfield data centers bump up against that wall that we talked about. And so they're just acquiring fuel cells, things that they can do to meet their energy needs today. While for their greenfield sites, they're planning out megacities with long range power procurement planning with the local utilities, with the developers to ensure that they have the headroom for what might be a linear trajectory, which is the right thing to do, right? How do we prepare for the future? But in the long run, I think just like those last two decades, we're going to be surprised by how much more
Starting point is 00:13:16 efficient all of this gets, not only with new hardware that comes out, which can provide tremendous advantages in performance per watt, but the software innovations that have yet to unfold. We've yet to see the application of AI on the software stacks, which in some companies I've seen just by analyzing your code and rewriting it, it can save 30% of energy. There's also a lot of innovation happening in the industry on this space. Collectively, there's something called the MX Alliance, and they are innovating on next generation, more efficient data formats. So what used to be at FP16 moved to FP8, this floating point, there's a new data type
Starting point is 00:14:00 called MXFP4. This is narrower. It's less precise, but it allows those AI models to take up much less space inside the computer. It requires less fetches from the memory, but runs more efficiently overall. And we're seeing companies supporting this in next-gen hardware. And even before they do that, they're putting out software solutions. And in some of the software alone examples, you're getting a six and a half X improvement in per watt. So to me, that's just a taste.
Starting point is 00:14:32 We're at the very beginning of this journey of the innovations that are yet to unfold in the software space. And I just think it's super exciting what's yet to come. There's, of course, open source frameworks like Python has libraries that automate popular model optimizations. There's quantization, pruning, knowledge distillation across multiple deep learning frameworks. You're really going to see 100X and then we all need to see the 1000X improvements in that combination of not only the chip hardware, the data center hardware, but also in the software solutions and how tightly coupled that software is into this
Starting point is 00:15:13 infrastructure moving forward. It's so interesting what you said about AI driving software innovation in terms of efficiency. It makes perfect sense. And we're talking about the fact that coding is going away. We haven't really broadly discussed the implications of that. And I can't wait to hear more about this particular topic. And we'll definitely be bringing more about that onto the tech arena. You didn't talk about cooling, so I've got to go back to it. Cooling is always an interesting technology in terms of the historical debate of when is the moment that liquid cooling is going to become the predominant technology in the data center. And I feel like we finally made it to that moment. What do you think? Do you think that
Starting point is 00:15:57 this is just a hyperscale trend? Do you see liquid cooling playing out in other environments in the enterprise or at the edge? Yeah, it's a great question. I peppered it in. I didn't go deep. I think liquid cooling has a long history in all of the supercomputers around the globe, HPC at enterprises. I know Ford's had it deployed for 10 years in their HPC clusters. We are at a point now where some of these chips have exceeded a thermal density that no longer makes air cooling a possibility. And so we have to remember, we're talking about heterogeneous compute environment. Even when you talk about hyperscale, I think where you're going to see the liquid cooling take off first, and it already is. That's when I was talking about those pods. When you're designing at a rack level, you can integrate cold plate cooling, for example,
Starting point is 00:16:56 as your standard unit of delivery, that everything will be as efficient as possible. It will be liquid cooled. It now, of course, offers you that opportunity for energy reuse. I think we're going to see more regulations. Expect that of data centers because unfortunately data centers are not the economic boost in a community that they could be. They don't have a lot of labor force required, but there's other benefits they could be providing that community like that energy reuse. So I think you're definitely going to see it take off in these large accelerator deployments where they're reaching a kilowatt, two kilowatts per chip, you're going to need really more efficient solutions. I know everyone's continuing to look at immersion cooling as well. And when that will take off because it provides so many
Starting point is 00:17:37 other benefits inside a data center, you can make your data center smaller. It doesn't have to have these really tall ceilings for airflow movement. Everything can change in the design of it all, but there's some barriers to adoption there. And they include, one, there's the facilities and that's true for both the cold freight and liquid cooling. And that's one of the things I think the industry is starting to come together on. How do you standardize what that looks like? There's also barriers in terms of maintenance. And how do you now have a million nodes and you're changing the ergonomics of how somebody has to do the predictive maintenance, the failure maintenance on a server? That's very different.
Starting point is 00:18:20 It changes everything if you've seen a tank. You now need a crane to pick it up. I think this intersectionality actually of AI, and you're seeing more and more folks talk about robotics and autonomous robots in factories, that could be one of those inflection points that helps some of these technologies that challenge the dominant logic, the dominant way you train a data center technician to manage the servers, I think that's actually going to be an accelerant to that most sustainable solution of immersion. When we talk about enterprises, again, they adopt technology at a lower rate if it's addressing a pain point that they have. Like at the edge you mentioned, if you can deal
Starting point is 00:19:03 with the harsh conditions in a desert, in extreme heat, or with high pollutants in the air, I think you're going to see more adoption more quickly in locations like that where the technology is really solving the customer pain point. Inside a traditional enterprise data center, things are likely just going to move a little bit slower. They're juggling a lot of things. A data center in most cases isn't their business like it is for a hyperscaler and their investments, they'll have a lower risk tolerance for what that takes, what that means. But I know that the expansion of POCs is going and growing. Given the stakes of this moment, we can expect to see winners and losers across industries. There's so much technology inflection going on.
Starting point is 00:19:45 What advice would you offer IT leaders to ensure that they have the right strategies in place? I think IT leaders really need to be mindful about how many different projects they're taking on, choosing partners to navigate this journey together because nobody has the solution today. And you're going to need an ecosystem to support the deployment of these new use cases, whether it's in your factory, whether it's just in managing your internal infrastructure, whether it's helping to improve the productivity of your knowledge workers, there's going to be different experts and there's going to be a lot of noise in the system that really finding a partner that you can go on that journey with, I think is
Starting point is 00:20:29 like the number one piece of advice I'd have in terms of managing your infrastructure, really understand your current compute footprint. Are you utilizing all of it today? Do you have waste? Is there underutilization? You're wasting money if you have zombie instances that somebody spun up and nobody's touched in three years. So there's a lot of opportunity, but you really need to take that first step. And you can start with the cost profile. I know a lot of large enterprises
Starting point is 00:20:54 have done that. And that can help you to focus on how to manage your infrastructure. And specifically in this AI realm and moment, we talked about the model sizes, right? It will be bimodal. Enterprises are going to deploy smaller domain-specific models, emphasize data quality over the quantity. A smaller data set is going to use less energy when you're training that model. You'll have lighter ongoing compute and storage implications from that as well. Think about the level of accuracy you need. When you look at the, you throw more compute resources at it, you're going from like a 99.7 to the 0.8 to 0.9% accuracy. What's really needed for each
Starting point is 00:21:37 use case. These domain specific models are really going to be helpful for enterprises to work with partners who have even done training with their specific language of your healthcare. There will be healthcare specific models, legal specific models, and how can you accelerate your time to deployment by choosing partners that really deeply understand your business and that can help you on that journey? Because at the end of the day, every enterprise is going to be orchestrating a lot of different models to seek outcomes and transformation in their business. And I think the most important thing is just don't wait. Every company is facing this moment together and the future is on the line, right? How do you support your teams so that they can get upskilled quickly to help you solve for your future AI strategy.
Starting point is 00:22:27 So I've been thinking about your story about Centrino throughout this interview, and I have a question for you. If you were going to be able to go back in time and talk to that young engineer that was working on Centrino about what you've seen and what you've witnessed in the industry. What would you tell her and what would you tell the young engineers that are working on AI today in terms of their opportunity to be part of this larger story? I think when I go back and sit in the shoes at that moment, we were building the future and putting forth this bold vision. And it seemed like it was going to be impossible at the time. You literally didn't have wireless
Starting point is 00:23:13 hotspots in an airport. You had to make sure, okay, who on the team is going to go work on the ecosystem partnership so that we hit the major airports like San Francisco? Who's going to work with a Starbucks? So really thinking about a clear vision of what you think is possible and get advice and support, whether it's within your own company, advisors, there's many advisor networks around the globe, to help you think through that holistic strategy, because it wasn't all engineering. It was understanding who those go-to-market partners would be to help fulfill that vision. And so if I move over today to the AI space, the future is in the hands of every AI engineer today. I think you're getting some taste about
Starting point is 00:23:58 what that could look like from some of the leaders in large keynotes that they have painting a vision for you. What vision do you want to be a part of? What conversation do you want to be a part of or story to be telling 20 years from now that you made happen? And I don't think there's any limit to what you can dream. It's just having a clear goal and a vision, knowing what role you play and what help you need to build out that vision, whether it's internally, if you're at a corporation, or through an advisor network, if you're just
Starting point is 00:24:29 getting started. That's fantastic. Jen, one final question for you. I'm sure that people who are listening are going to want to engage with you. Where do you want them to reach out and talk further? Thanks, Allison. You can always find me on LinkedIn, where you will continue to see me advocating for sustainable computing so that together we have a sustainable future. Thanks so much for being on the program today.
Starting point is 00:24:53 Thank you. My pleasure. Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net. All content is copyright by The Tech Arena.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.