In The Arena by TechArena - Prioritizing Sustainable Software Paradigms for the AI Era with Fermyon’s Matt Butcher

Episode Date: October 17, 2023

TechArena host Allyson Klein chats with Fermyon CEO Matt Butcher about cloud redundancy’s impact to sustainability, and how his organization is delivering new serverless AI capabilities help usher i...n a more sustainable computing future.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome to the Tech Arena. My name is Alison Klein, And today I'm so delighted to invite back to the studio, Matt Butcher, CEO of Fermion. Welcome to the program. Thanks for having me. So Matt, you were on, gosh, it almost seems like almost a year ago,
Starting point is 00:00:38 and you were talking about Fermion and what you're delivering in terms of web assembly capabilities. But why don't you just go ahead and reintroduce yourself and what you're delivering in terms of web assembly capabilities. But why don't you just go ahead and reintroduce yourself and what you've been up to? Yeah, sure. I've been a long time open source developer. I've been in, I really, I was probably among that like first generation where open source was the thing when I started writing code in 1995 and has just been a part of my career. And I've had the opportunity to work in some really phenomenal places ranging from big companies like HP cloud and Google, and then more recently, Microsoft to small startups like revolve and Deus that have been in that same ecosystem. About two years ago, we started Fermion and a big part of what we were doing when we started
Starting point is 00:01:22 Fermion was tackling a problem that was problem that was top of mind for us. And that was compute efficiency. If you go to Firmion today and then you take a look at Firmion's website, we built a WebAssembly serverless cloud platform, developer tools that are designed to help you really build serverless functions that have blazing fast startup speed, that have all the data services built in with a developer experience. It's just designed to be really easy for you. But that's the thing that we envisioned building on a platform that was going to be a hyper efficient kind of platform. So I can give you a little bit of the backstory about why efficiency was a big deal to us. While working at Microsoft, so I came from Deus, we got acquired. My team there was the open source team inside of Azure.
Starting point is 00:02:10 It was called Deus Labs. And we were just, we were tackling difficult problems. That was the job mandate. Build interesting things, release them as open source, be good participants in the open source world, especially in the cloud native world with containers. One of the good things about this is that the challenges you hear about are the difficult ones. And many of them would have big rewards if you could figure out how to do it well, but are tough. And I remember very distinctly having this conversation with a colleague who is in Azure Compute. Another great thing about working at something like Google or Amazon or Microsoft is you run into the foremost experts in all these fields.
Starting point is 00:02:52 And so here I'm talking to somebody in cloud computing and I said to him, you know, tell me what's the biggest challenge of running the kind of core cloud compute service? And I, you know, thought he was going to dive into one of those, you know, highly technical topics and it was going to dive into one of those highly technical topics, and it was going to stretch my brain in new and interesting ways. And it did, but in a completely different direction than I thought. He said, you know what the issue is? We buy and rack servers
Starting point is 00:03:17 as fast as we possibly can. And this is during a period where Azure was just massively growing. And he said, it's discouraging to know that many of these systems are allocated to workload, but they're really running about 80% idle. And he said, we're paying for an electricity bill for a thing we're not using. And that really struck me. And having been in the Kubernetes ecosystem for a long time and having been at that point really recently looking at serverless Azure functions, AWS Lambda, it really struck me in a different way, right? We had been assuming that when we built the cloud, we were going to build a system that was more efficient, that was more power efficient, that was a more efficient use
Starting point is 00:04:01 of hardware. But then we started building systems on top of it that were not actually designed to be that efficient. So take Kubernetes, for example. Say I've got a website, right? During peak times, I may have tens of thousands of users. During low tide, I might have several hundred to just a thousand or two thousand users. When I provision resources in Kubernetes, I have to provision for the max load I'm going to get. And for the most part, keep that, keep the entire thing running the whole time. So I might have three or five or seven or nine instances of my website running somewhere in a Kubernetes cluster. And you do that so that if there's an outage, you still have enough compute power. If peak comes in, you still have compute
Starting point is 00:04:45 power. If you go viral, you still have compute power. But on the day-to-day basis, right, moment by moment, a lot of that is just sitting there idle. And that became the challenge where we said, okay, coming out of that meeting, it took months for this to percolate and going, okay, so what can we do? What can we try? And the kind of net result was maybe the issue here isn't how people are writing applications, but how we design cloud compute. We essentially design cloud compute in such a way that it forces people to have to take this sort of wasteful perspective by default. And that got me really thinking about the basics. So in cloud compute, you have the workhorse of the cloud is the virtual machine. Encapsulates the entire operating system from the kernel and the drivers all the way up to your application layer code.
Starting point is 00:05:33 They take minutes to start. They're huge images. They're very inefficient to move around. They're slow to start. Consequently, you cannot force a user to wait while you start up extra instances to handle more load, right? They're just too slow for that. Containers came along and were far more efficient uses of a lot of pieces of the compute infrastructure, right? You could load multiple containers onto the same operating system. Kubernetes can run 110-ish containers per
Starting point is 00:06:01 compute node, and going from just a few to 110 is actually a really big lead. But again, the startup time for a container comes in at around 12 to 25 seconds. And if you say you're, if you're in the middle of making a purchase and you're in like shopping along on your favorite shopping site and enough load comes in that you have to wait for 10 to 12 seconds for a page to load because additional instances are scaling up, you're not going to wait, right? And we all know that. And so the solution to dealing with two fairly inefficient core compute primitives was to just keep a bunch of them running and deal with the ways. So what we looked at was, is there a separate, a new kind of compute, right? A third kind of compute that can sit alongside those two that would have the
Starting point is 00:06:49 characteristics that let us not have to be wasteful. So we call this a scale zero problem at Microsoft. And I know that term is gaining more and more traction now, but the core question was when nobody is visiting your site, can you scale all the way down to having zero instances running? And if a millisecond later, 10,000 people go hit your site, can you scale all the way down to having zero instances running? And if a millisecond later, 10,000 people go hit your site, can you scale up to 10,000 instant? And that was the kind of the core problem that led us past containers and into saying there's got to be another kind of compute that can, that has the performance characteristics
Starting point is 00:07:20 that would do this. As we enumerated what those performance characteristics would be, we said, okay, instant startup time, same security model as the cloud. We really want cross architecture, cross platform. When we're creating this checklist and looking around and one technology bubbled up to the top of our queue, and that was WebAssembly. And WebAssembly was originally built to run inside the browser, right? It was a way to run languages other than JavaScript inside the browser, but because it was running in the browser, right? It was a way to run languages other than JavaScript inside the browser. But because it was running in the browser, it had to start up instantly because we're impatient when we're in our browsers
Starting point is 00:07:50 and we want things to load right away. And it had to have a very strong security model because we don't want people to be able to attack us from code loaded in our browsers. And it had to be cross architecture and cross platform because browsers run on everything from my iPad to my laptop to some fridges out there, right? And so all these kind of funny characteristics that were designed to work really well in a particular environment, those were the characteristics we wanted to pluck
Starting point is 00:08:13 out of that environment and drop in the cloud and say, can we build a serverless infrastructure based on this? So that was really what got us going on WebAssembly was this idea that as a form of cloud compute, it could be really efficient. And we spent the first year and a half or so of Fermion's life really honing in on that idea that we could create an efficient CPU, efficient memory style of cloud compute that had that ability to scale to zero and then scale up to 10,000 very quickly. Your conversation really took me into a couple of places. One was that the cloud service providers have been so good about building energy efficient data centers. They focus on PUE, they focus on green energy, they focus on all these things, but you're absolutely right. The fundamental premise of cloud is inefficient. And we overlook, I've been talking to a number of vendors over the last few months about compute sustainability in general, as well as just what does energy efficiency mean in the era of AI?
Starting point is 00:09:12 Because I was just talking to Microsoft about their Azure, and they're talking about running hundreds of thousands of GPUs to train chat GPT. Yeah. That's crazy when you think about the energy. And there's all sorts of forecasts about the amount of energy that data center computing is going to use over the next few years and how we may be getting some great benefit out of it. But let's open our eyes and talk about, you know, are we doing it as efficiently as possible? And so that's why I was so interested when I came across your recent blog on carbon neutral AI. I was like, wow, Matt was doing WebAssembly, which I totally get as
Starting point is 00:09:53 an efficient form of compute. But how did we go here? And I wanted to have you back on the show so you could describe. So why don't you start out and tell me what the blog and where your focus has integrated into this broader format? Yeah, sure. And it really is a continuation of the beginning of that story, too. But the announcement early in September, on September 5th, we announced that among the other services that Fermion has built, our new one is serverless AI. And this was combining the idea of serverless that we were talking about just a moment ago with AI inferencing, right inferencing against an LLM. So just as a reminder, for those of you who still like me getting comfortable
Starting point is 00:10:34 with all the parlance around, around AI. When you build up a model, you have this, this massive unit of information and it's stored in a numeric format. And the process of querying this, which is inferencing, is just incredibly calculation intensive. Vector math and fairly intensive forms of mathematics are involved in asking the AI a question and then having it compose an answer and return it back. And, but yet, so there's a deep complexity involved there. And yet the human element of this is just awesome, right? For the first time, when I ask a computer a question and it gives me back an answer,
Starting point is 00:11:09 I don't have to learn a new programming language. I don't need to get deeply into a debugging process. It's just a much more conversational way or a much more naturally human way of interacting with a computer. And so what we saw here is this big potential, right? In that you can ask an LLM to do anything from summarizing an article to a blog post we wrote today was, can you get an LLM to generate a haiku, which is actually very difficult because syllables are a very peculiar human construct. But these are all things that we can get the machine to do. And this is amazing, right? This is a huge advance in technology. But again, the cost of it is requiring this huge amount of compute power, which in turn generates, consumes a huge amount of electricity, and then generates a huge amount of typically
Starting point is 00:11:58 heat as a waste product. And we had come into this arena from the perspective, again, from that early Azure experience where we're saying, OK, how do we be more effective and efficient in using compute resource? And then here's this brand new field that's so promising that also is like that same problem multiplied times 100. Right. How do we deal with a compute unit like an H100 or an A100, these heavy duty, high powered NVIDIA GPUs that need to be used for AI inferencing. And what can we do to, first of all, optimize the performance that we can get out of them. And second of all, minimize the impact that these have on the environment around us. So we started on the first one and said, okay, step one would be to figure out how to be more efficient with the way
Starting point is 00:12:45 we use GPU. So we looked at the profile of how existing AI and ML services were using GPUs. And a lot of times it was, you lock the GPU for at least several minutes. And one process might only be inferencing for three or five or seven seconds, but the GPU is locked for a long period of time. And so you're preventing other things from being able to use that same GPU. And so, okay, effectively time slicing the GPU is step one. So we began building this system where we could say, okay, we only need to lock the GPU
Starting point is 00:13:15 for the amount of time that the inferences actually happen. So if the inference take two seconds, the GPU is only locked for two seconds. If it takes two minutes, we lock it for two minutes, but we're freeing it up. We're only allocating it when we're ready to start inferencing and we're freeing it up the millisecond that we're done doing the inferencing. And so time slicing is one effective way. Then we said, okay, we need to actually find some GPUs to run all the stuff on. And GPUs are a little hard to come by right now. And we reached out to our friends at Civo. We've known Siam and Mark and Dinesh from Civo for a long time. And we work with them here and there on a number of things. They're very active in the cloud data and Kubernetes sphere. And we reached
Starting point is 00:13:54 out to them and said, Hey, I know you're going to launch a GPU offering. Tell us more about it. And they start the way all these conversations are going to be powerful. GPUs are going to be AI grade and stuff like that. And then they told us about this partnership that they had formed with a company called Deep Green. So Deep Green is based in the UK and their technology is one of those that the first time I heard it, I thought this is fiction. Surely this is not true, but they said, okay, so this is a group of engineers who went, all right, the by-product of AI inferencing is heat. So what are the systems where we could use that byproduct and replace something else that is also consuming energy in order to convert it to heat? And so they developed this technology where they can take GPUs and they can encase
Starting point is 00:14:38 them in, I believe it's copper, I forget which metal it is, and they encase them in metal and then they submerse them in mineral oil to capture the heat. So when you think about the way we typically do this computing thing, right? You got your fan on your laptop, the flowing cool air across your heat exchanger to try or your heat sink to try and cool your system.
Starting point is 00:14:57 Or you're using some kind of liquid cooling system to try and cool your system. They did the opposite. They said, okay, how do we capture that heat and then start to disperse it outward and then pass it onto a heat exchanger and then use it to heat something? And so they did this proof of concept where they heated a swimming pool at a UK-based recreation club using GPUs and demonstrated that in fact, not only does this end up being approximately carbon neutral, it actually consumes less carbon than if we were running the inferencing and running the mechanisms necessary to keep the pool by some
Starting point is 00:15:29 other means. And then from there, they've begun branching out. So Deep Green is looking to be able to open these sort of like, you can think of them as micro data centers that are GPU. This is true edge computing, right? When you're talking about computing near swimming pools or near a large apartment building that has a lot of hot water that hot water heating needs and things like that. And these GPUs can then effectively create a stable source of heat for the kinds of things that normally we'd need some other kind of fuel. And so when we heard that, that was it, right? That was, you know, we picked Civo to partner with, Civo and DeepGrain to partner with, because this is just such,
Starting point is 00:16:02 it's so true to the kind of way we want to be looking at compute over the future. And particularly, you know, I'm going to admit here, I was a little bit skeptical of the cryptocurrency thing, which also used a lot of GPU, but just skeptical that it was adding enough value to warrant its position in the marketplace. But ML and AI are undeniably, generative AI is undeniably changing the way that we're doing computing. It started now and it's just going to radically change over time. Every day now I use it from working with Notion to trying to find things, summarize things. And it's just changed the way that we can interact with our devices around us.
Starting point is 00:16:39 And so consequently, I think there's a very real need, a very immediate need to start solving the problem this way, right? And saying, how do we, if we can't reduce yet the amount of electricity that is consumed in doing these inferences, is there a way that we can at least take the output of that process and do something useful? And deep greens to it. Yeah, that's what some said is that you're doing serverless for inferencing. Inferencing is going to be in like every application. It's going to be on every device.
Starting point is 00:17:07 So how do you break down what subset of inferencing would apply in your mind to this approach? Is it all inferencing? Are there certain characteristics that make a lot of sense? I think for us right now, and there's at least one thing that we can pick off the table and that's training, right? You didn't even mention that because we both know that training is just, oh, it takes so much time and it probably will remain in the hands of the specialists and in regular toolkit. So looking at this from a developer's point of view, inferencing is easier to code. And I've done several apps now, so I can say this with confidence. It's easier to code with an inferencing engine than it is to code with SQL, right? It's easier than writing a SQL query and dealing with the output. And that's really exciting because that means that it should be easy for developers to work
Starting point is 00:18:02 with this. Right now, and I'm just fresh out of the AI conference. AI conference happened here in the Bay Area, in San Francisco. And we had a lot of conversations with a lot of specialists doing a lot of different things. And over again, the most popular inferencing that people are doing right now is using large language models to do textual-based inferencing. And that's a good one because it's chat GPT and OpenAI really opened our eyes to the possibility there. But now what we're seeing is, what I'm seeing is a lot of really valid use cases.
Starting point is 00:18:34 So we've chosen that as our first foray. But I'm noticing now a little bit that image processing, which sort of was the ideal for AI several years ago, is once again making a resurgence. And so I wouldn't be surprised at all if within a year or two, we start to see the text-based LLMs here and then the image-based AI tasks catching up and hitting parity with those. I'm unfortunately woefully inadequate when it comes to knowledge about a lot of the speech and voice recognition stuff they're doing. But I did get to see a couple of demos and I think that's cool and probably will catch on to some degree. But when you think about how we interact with the
Starting point is 00:19:13 world around us as humans, text and image are just so central to the way that we communicate effectively in the online world. And text processing, I think, will always kind of be a second seat to those two. So then based on that, how are you engaging the developer community? How are you engaging customers to take this thing for a spin and see how it responds? And what are the kind of responses that you're getting from this community in terms of meeting some broader organizational goals around sustainability? Yeah. I think the interesting thing is that, and I was guilty of this myself, is that I assumed, and many of us assumed, that again, AI would be the realm of the specialists, that there would be a special class of developer called an AI developer, and AI developers would do inferencing.
Starting point is 00:20:01 And so for Radu, who's the co-founder of Fermion, he's been doing AI work for years and years. And I've always, I wouldn't say I've dismissed it, but I've always viewed that as a neat thing to do on the side because you're really appealing to a specialist group. And what I think we're seeing now is, again, that kind of move toward AI just being another developer, another tool in the developer's toolbox, right? A thing that we will all need to be comfortable with and that is opening a bunch of new opportunities. And some of the areas where we're really seeing this kind of technology take hold, of course, the kind of chat style thing that we've seen gain some popularity with chat GPT. But that doesn't seem to be the most exciting one. I'm noticing a couple of other trends where one of them is synthesizing information. And that one,
Starting point is 00:20:53 it seems boring until you think about how much of the day-to-day work life, day-to-day life period that we spend just trying to synthesize information around us, right? We read a bunch of articles and then we try and say, okay, what was the summary I got out of that? We try and digest just absolutely immense amounts of information via email, via news sites, via our Notion pages at work or our Word documents. And the ability of machine learning algorithms to be able to optimize this process for us can't be understated. I was talking to a guy yesterday whose background is academia, and he admitted something that you never hear an academic admit. He said, sometimes it's a little embarrassing that you're reading through a scientific paper and you hit upon a couple of concepts that you don't know.
Starting point is 00:21:29 And he says, I'm a specialist. So I should know these things and I should go out there and read all of the top papers in the field and understand this. But it's just a paragraph in a paper I'm trying to get through. I just need something that says, hey, here's that concept in a nutshell. And when specialists are writing, and so he gave this very interesting case of saying, you know, AI's ability to go out there and already be trained on this massive corpus of information and be able to succinctly give him an answer was a very powerful and profound game changer for him that saved him many hours of
Starting point is 00:21:59 research time. Now, most of us probably don't do that level of research when we're reading about our favorite band online or doing a little bit of comparison shopping. But I noticed that Amazon is using this now to summarize the reviews. And I thought that's brilliant because I'm a review reader. I'm the one who's, oh, four and a half stars. I wonder how many of those were one stars and two stars and what do they complain about? And so I noticed that they started generating that summary. And my first couple of times I saw that, I'm like, okay, I'm going to read the summary
Starting point is 00:22:26 and then I'm going to read the reviews to see if it's fair. And then I realized that actually I had apparently at some point judged the fact that they're just accurate enough that I just go with it. Now I read the summary and I'm like, I think myself 20 minutes of reading three star and four star reviews. So I really enjoy that. And I think that the impact of the ability of AI to summarize or turn into bullet points or that kind of thing is actually just absolutely game changing, even
Starting point is 00:22:49 though it seems really boring when you say it like that, when you recognize how sophisticated that technology is and how little work it takes for a developer to say, okay, I need to train it so that it uses this corpus of text in order to generate those answers. When you see how easy it can be for them to do that, and then the profound impact that comes out of that, it's very exciting. So that, I think, is just one of those cases that we've seen. We've been playing around with sentiment analysis and things like that, too, where ML can detect by the way you phrase things whether or not you're annoyed or whether you're OK.
Starting point is 00:23:21 And those kinds of things can be good, particularly to alert me when my emotional level might be coming across in my text when I don't want it to. And I just think these are really clever things that normally would require an overabundance of human involvement. And now we can just let a machine do them, right?
Starting point is 00:23:40 And the things where we waste a lot of time doing something very simple, now we can have the machine waste some time and do that for us. And I really like that. I just think that's the real game changing potential for LLMs and this kind of tech space. I can't wait to see more in this space. All of the examples are really exciting. We are living through such an exciting time in terms of technology and the development of new use cases and new ways to change our lives, really. I'm really excited about the work that you're doing to help enable all that.
Starting point is 00:24:10 What is the latest with Fermion? What would you tell folks in terms of engaging on this carbon neutral AI project and engaging with you in general? Yeah, and we're continuing on in our process of building this kind of serverless environment that makes it very easy for developers to write code. Again, the big story for us is that we have been working to make compute more efficient and working to make compute faster. But when it comes to how we want to enable developers to be more productive, this kind of serverless methodology, Jits is very helpful. So when you think about the way a developer
Starting point is 00:24:44 normally writes code, you spend a lot of time writing the boilerplate code to stand up a server. That's going to run for hours, days, months, whatever, even years in some cases. The serverless environment basically says, look, you just write the piece of code that says, here's the, when a request for some information comes in, this is what you do and you send back a response. None of the rest of that stuff, all the rest of that stuff is handled by the system. And so we've worked very hard to build that kind of experience so that developers can
Starting point is 00:25:12 be very productive very quickly. Now, the upshot of that is that in the background, when we're just executing small chunks and we're not starting something up that's going to run for a long time, it means control over how to execute that trickles down to the platform and the platform can very efficiently schedule these and very efficiently use compute resources. This is how we managed to build a technology that is so effective with the GPU because we know the millisecond somebody asks to do a GPU inference, we know, okay, this is the time we need to lock the GPU. So we're going to continue on that trend. And again, we're very excited because the AI and generative AI in particular is just moving so quickly that now as we get going on
Starting point is 00:25:52 this system, we say, okay, we understand what the problem looks like. We're starting now as we're seeing users begin to use this. We're understanding the shape of the workloads, which means we can see how they're using an inferencing engine, see how long these things are taking, and we can further optimize them. So I think there's a lot of fun stuff that's coming up in the next six to 12 months across this new industry or burgeoning industry, as well as at Fermion specifically. And it's very easy if you want to give this stuff a try, you can head over to developer.fermion.com, or actually you can go to fermion.com's rent page and click on the developer resources section. And you can can go in there right now the AI stuff is in private beta so we're still asking people to sign up but it's free and it's really easy to get started with and it's a
Starting point is 00:26:34 lot of fun to experiment with this stuff it is just sometimes it just blows my mind what the system can do thank you so much once again for being on the Tech Arena. It's always a pleasure. And I am so excited to see the progress of Fermion. Yeah, thanks again for having me. This is a lot of fun. Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net. All content is copyright by The Tech Arena.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.