Screaming in the Cloud - Episode 73 - Building a Cloud Supercomputer on AWS with Mike Warren

Starting point is 00:00:00 Hello and welcome to Screaming in the Cloud with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. using Nagios at scale anymore because monitoring looks like something very different in a modern architecture where you have ephemeral containers spinning up and down, for example. How do you know how up your application

Starting point is 00:00:52 is in an environment like that? At scale, it's never a question of whether your site is up, but rather a question of how down is it? LightStep lets you answer that question effectively. Discover what other companies, including Lyft, Twilio, Box, and Jithub, have already learned. Visit lightstep.com to learn more.

Starting point is 00:01:11 My thanks to them for sponsoring this episode of Screaming in the Cloud. Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by Mike Warren, co-founder and CTO of Descartes Labs. Welcome to the show, Mike. Thanks, Corey. Happy to be here. So you have a fascinating story in that you had a 25-year career at the Los Alamos National Lab and then joined, not joined, started a company, Descartes Labs, about four years ago now. That's right. I kind of considered, you know, it was a 30-year-long education that provided me kind of the right skills to go out and start a company

Starting point is 00:01:57 that used computing and science to help our customers understand the world. And it certainly seems that you've done some interesting things with it lately. At the time of this recording, about a month or so ago, you wound up making, I guess, a bit of a splash in the cloud computing world by building an effective supercomputer on top of AWS. And it qualified, I think, what it was at place 136 on the top 500 list of the most powerful supercomputers in the world. And that's impressive in its own right. But what's fascinating about this is you didn't set out to do this with an enormous project

Starting point is 00:02:38 behind you. You didn't decide to do this with a grant from someone. You did it with a corporate credit card. That's right. And I guess it wasn't such a surprise to us. We knew the capability was coming and it just happened to be the time was right. And I had some time to, you know, compile HP LINPACK and run it. But that's the way the computing industry is going. And anything you can do in a data center or in a supercomputing center is eventually going to be possible in the cloud. So one of the parts of the story that resonated the most with me was the fact that you did this using spot instances,

Starting point is 00:03:26 but you didn't tell anyone at AWS that you were doing this until after it was already done. And I confirmed that myself by reaching out to a friend who's reasonably well-placed in the compute org. I said, wow, congratulations. You just hosted something that would wind up counting as one of the top 500 supercomputers in the world. And his response was, wait, wait, what now? Which was fascinating to see, where it used to be that

Starting point is 00:03:51 this was the sort of thing that would require such a tremendous amount of coordinated effort between different people, different stakeholders across a wide organization. And it turns out that because it's neat, wouldn't have been a sufficient justification to do approximately any of it. And now, apparently, you can do this for $5,000, which is less than a typical server tends to cost, and about what a typical engineer embezzles in office supplies in a given year.

Starting point is 00:04:20 And the economics of that are staggering. How long did you plan on doing this before deciding to just spark it up? Not long. And it's really why the cloud is so attractive for businesses like ours. You don't have to wait for resources. You don't have to coordinate. You don't have to spend time on the phone.

Starting point is 00:04:43 You don't have to worry about support contracts and all of this. Even a top 500 run on a supercomputer requires an enormous amount of coordination. You got to kick all the other users off. It's a big deal and it may only be done once on one of these, you know, nation-scale supercomputers. But this on AWS, any given day when there's a thousand nodes available, you can pull those up and run it a petaflop. Which is astonishing. I mean, you did some back of the envelope calculations on what it would cost to do this in hardware.

Starting point is 00:05:29 And the interesting part to me wasn't the dollar figure, which, what was that again? We figured, you know, $20 to $30 million in hardware to get to this near to petaflop level. Yeah, and the amazing part for me was that looking at this, you also mentioned in that article, and I'll throw a link to that in the show notes, but the fact that it would also take, by your estimation, six to 12 months just to procure the hardware and get it set up and everything and get everything ready just to do this. Yeah. I think people don't realize the

Starting point is 00:06:00 sort of unmentioned overheads in all of these HPC sort of applications. You know, you've got to procure the hardware and make sure you have the budget authority and get the right signatures. And that's after assuming you have a data center or a building and enough cooling and power and all those sorts of infrastructure. So one thing that rang a little bit, I guess we'll call it false in the narrative, and not to poke holes in your legend, has been that, oh, we didn't tell AWS that we were going to be doing any of this. We just decided to go ahead and run it. But anyone who spent more than 20 minutes swearing at the AWS console knows that, okay, you start up an account, and the first thing that you need to wind up doing is request a whole bunch of service limit increases. Well, you're already running one of those

Starting point is 00:06:55 instances in that region. Why do you need a second one? Justify it. And they're pretty good about granting those service limit increases. But I'm curious to know, did you already have sufficient limits to run this from other workloads that have been done in that account? Or did you wind up opening the most bizarre limit increase service ticket that they've probably seen in a month? As I recall, we had about half the resources in US East where we ran this.

Starting point is 00:07:24 So we had kind of this scale of resources spread across different zones. But there's also different quotas between spot and not spot. So there was a specific request to kind of up the quota for spot in US East to this level. And there's another interesting API limitation, which is a single call to allocate nodes for Spot can only allocate 1,000 at a time. So it's a little extra workflow.

Starting point is 00:08:04 You've got to break it into two parts and there's throttling on the API calls. So you can't immediately allocate, say, 1,200 nodes. You've got to allocate 1,000, wait a minute, and then allocate the remainder. Take us through this from the beginning, where effectively, what does the architecture of this look like? I'm not entirely sure what the supercomputer in question was chomping on. So in my mental model, I'm going to figure that you were just mining Bitcoin, which it turns out is super financially viable in the cloud.

Starting point is 00:08:41 You just need to do it in someone else's account. But on a more serious note, what was the, you mentioned need to do it in someone else's account. On a more serious note, you mentioned LINPACK. What does that do? An important distinction to make is how much communication needs to happen among the processors. Mining Bitcoin, the processor doesn't need to know about anything else going on in the world. It's completely independent. So you can start up a thousand different nodes mining Bitcoin and all of the clouds do that very well. The top 500 benchmark is based on the inversion of a very large matrix, and it uses a piece of software called Linpack.

Starting point is 00:09:33 And in the solution of that problem, all of these processors have to talk to each other and exchange data, and they have to do that with very low latency. So, you know, starting up a thousand nodes and if any one of those fails in terms of computation or network communication during that process, the whole thing falls over. So these tightly coupled types of HPC applications that are typically written in something called MPI, which is the message passing interface, are a lot more challenging to do in the cloud than the task parallel sorts of applications like a typical web server. One of the things that fascinates me is that whenever you're using any sufficiently large number of computers,

Starting point is 00:10:27 some of them are intrinsically going to break, fail out of the cluster, etc. That's the nature of things. That sort of goes double when you're using something like on Google's preemptible instances or an AWS Spotfleets, where it turns out that subject to the available supply, things are generally not nearly as available as you would expect if, for example, someone else in that region decides to spin up a supercomputer to take a spot on the top 500 list. So how does each one of those things checkpoint what it's working on in some sort of central location so that it can die and be replaced by something else without losing all the work that Node has done? Or doesn't it? No, it can't. There's no checkpointing these sorts of problems in any easy manner. So

Starting point is 00:11:13 everything's got to work perfectly for the three or six hours that this benchmark is running. So the reliability is very important. And I've heard anecdotal reports of some of these very fast top 10 supercomputers needing to try to run the benchmark several times before they can get it to run reliably. So when you wound up doing this, did you just over provision by a bit and assume you would lose some number of nodes along the way? That also doesn't work. Once you've sort of labeled each of these processors with a number, you can't have one disappear. It's got part of the state of the

Starting point is 00:11:58 problem in it. So if you start with 1200 nodes, you need them all to work the whole time. So that's why these tightly coupled applications are a lot more challenging to scale up. So when you wound up doing this, you effectively wound up with how many instances that were a part of this? I mean, I saw the statistics on petaflops, but that's hard for me to put into something that I can think of. Yeah, it was a bit under 1,200 nodes. And these were 36 core processors, or rather 18 core processors with two dies per node. So that gives you 41,000-odd processors.

Starting point is 00:12:51 And these are the hardware cores, not the abomination of virtual counting. Gotcha. Yeah, it looks like you did this entirely on top of C5s, which is the performance of those things is impressive. And the economics of it are fantastic. For me, I think the hardest part to wrap my head around is the fact that you had 1,200 of these things running

Starting point is 00:13:12 without a hiccup for three straight hours on spot, which historically was always extremely interrupt prone. And it's surprising to me that none of those wound up getting reclaimed during that window. Just at that scale, I would expect even traditional computers, one of the 1,200 is going to fall over and crash into the sea because I'm lucky like that. Well, there's, I think, two things you're talking about there.

Starting point is 00:13:35 And one of those is solved by Amazon's spot blocks. So, you know, you say you want a certain number of processors for some number of hours between one and six and then Amazon guarantees it's not going to optionally take one of those away from you. So those are a bit more expensive than the normal spot but a lot less expensive than the non-spot instances. So what becomes important is just the inherent failure rate of the hardware in the network. And in our experience, the cloud resources we've used

Starting point is 00:14:20 have been incredibly reliable to the point where, you know, certainly we've seen in Google their predictive task migration where they can sort of understand that a node is about to fail and then migrate that whole kernel and processes to another piece of hardware so that you never know about it and they can pull that defective hardware out and fix it. The idea of being able to do something like this almost on a lark on some given afternoon for what generally falls well within any arbitrary employee's corporate spending approval limit is just mind-boggling to me. I know I keep belaboring this, but it's one of those things that just tends to resonate an awful lot. Can you talk to me at all about how this winds up manifesting compared to,

Starting point is 00:15:19 I guess, the stuff you do day to day? I'm going to guess that Descartes Labs does an awful lot of interesting stuff. I mean, you folks work on satellite and aerial imagery, to my understanding. But how does that tie back to effectively, step one, take a giant supercomputer. We'll figure out step two later. Well, it's really democratizing supercomputing. And it's an extension of the power that software gives an individual. We're able to do things now with good software and a smart person that used to take an entire group of people with lots of infrastructure to do.

Starting point is 00:15:58 So it's always been something I've been very interested in and goes back to our building Beowulf clusters back in the 90s. That was really democratizing parallel computing for people who couldn't afford the state-of-the-art supercomputers. And there were untold times that people collared me and told them stories about their group in college who had a cluster that they built in the closet that allowed them to do the research that they were doing. So on a day-to-day basis, what does, I guess, your computing environment look like? I mean, you spun out of a national lab, which for starters, we know is going to be something that is relatively, shall we say, computationally impressive. Mike Julian, my business partner, used to help run the Titan supercomputer at Oak Ridge National Lab about 10 years ago. So I've gotten absorbed into it by osmosis almost, and I always view that as, here go smart people.

Starting point is 00:16:59 I just sit here and make sarcastic comments on the side. But what does that look like on a day-to-day basis of what Descartes Labs actually does? Well, the last big computation I did before we founded Descartes Labs was on the Titan machine. We had 80 million CPU hours to calculate the evolution of the universe, and we did that with a trillion particles. So a lot of that kind of thinking has carried over to our environment at Descartes Labs. And I spend a lot of my time in the command line

Starting point is 00:17:38 writing Python scripts. But our interaction with customers is much more focused around APIs and making it very easy to interact with these petascale data sets. You tell us where on the earth and what time over the last 20 years you want to see an image, and we can deliver that to you in less than a second. And are you building this entirely on top of AWS? Are you effectively, for all comers, as far as large cloud goes, is there a significant on-prem component?

Starting point is 00:18:17 There's no on-prem at all. Descartes Labs IT infrastructure consists of a laptop for everyone, essentially. The bulk of our platform is implemented in Google Cloud. We do have data input processes that are running in AWS, but we've tried to keep most of the platform cloud agnostic so that we would move to another cloud if that made sense in terms of the economics. Often a great approach, especially with what you're talking about. It sounds like the way that you're designing things requires some software that seems highly portable to

Starting point is 00:19:05 almost wherever the data itself happens to live. You're not necessarily needing to leverage a lot of the higher level services in order to start effectively chewing on mass quantities of data with effectively undifferentiated compute services. It feels like the more you look at something like this, the economic story starts to be a lot more around the cost of moving data around. Turns out that moving compute to the data is often way cheaper than moving data to the compute. Right. Our philosophy has been to work at a fairly low level. One of our big successes has been essentially implementing a virtual file system on top of the cloud object store so we can take any number of open source packages and they see a POSIX file system

Starting point is 00:19:54 to interact with and we don't have to spend a lot of time rewriting the I.O. routines there. A lot of those open source packages, are they relatively recent? Are they effectively coming from, I guess, I want to say the heyday of a lot of the HPC work, or at least what felt like the heyday of it back in, I want to say 2005 to 2010. That may not actually be the heyday and just when I was looking into it, but I'm assuming that my experience mirrors everyone's experience. No, it spans the whole range. We go all the way from 20-year-old million-line Fortran packages to the latest convolutional neural network, which was written a month ago. So as you take a look right now at getting this project up and running, what could AWS have done to make it easier for you, if anything? Well, I think it's, you know, this environment that they offer is what I worked in 20 years ago. It's Linux plus Intel plus a fast network. And the real key in my mind is the real expense is the software engineers that you need to write this code and deploy it. And now Amazon has eliminated all the friction around the hardware capacity to actually execute that and bring it to reality.

Starting point is 00:21:31 So that's kind of the magic. in terms of making our programmers more efficient, are not overshadowed by the fact that they can't get access to a petaflop machine to test it or help develop it. It's remarkable, now you can probably get more capacity in AWS with five minutes notice than you can on most of these supercomputers who have to keep their queues very full to keep their utilization high to justify the cost of that dedicated hardware. When you take a look at doing this again, I mean, effectively, first, is this something

Starting point is 00:22:19 you would ever consider doing again just as a neat proof of concept? I mean, arguably, it got more attention when I linked against it in last week in AWS than most articles I link against. So it clearly resonated with people. I mean, sure. Maybe we'll help our local university do the next one or, you know, do it in a couple other clouds at the same time. It's a big community and it's not, you know, the top 500 is kind of a, it doesn't provide any utility in itself, it's kind of just demonstrating what could be possible with a set of hardware. So I'm kind of more interested in going to the next level of let's run more codes than we already are in this HPC environment. So I guess the big question here is what inspired you to do this on AWS?

Starting point is 00:23:19 You mentioned that the bulk of what you're building today is on GCP. What triggered that decision from your perspective? I think AWS has been the first to have the network infrastructure that makes this possible. In the same way that you need all of the CPUs to be fully available during this computation, you can't have any network bottlenecks. So the new architecture and scalability of AWS's network, not having other users interfering with the network bandwidth available and having low enough latency in these messages, AWS was just the first to get there. But Azure has also demonstrated this sort of performance in their hardware that has dedicated low latency networks.

Starting point is 00:24:19 And I would imagine Google is not far behind. It's interesting, and I think that a lot of people, myself included, don't tend to equate incredibly reliable networking as a prerequisite for something like this. But in hindsight, that is effectively what defines supercomputers. Things like InfiniBand, how quickly can you get data not just across the bus inside of a given node, but between nodes in many cases? It's one of those things that sounds super easy until you look into it and then realize,

Starting point is 00:24:51 huh, the way this is currently architected from a network point of view, we are never going to be able to move the kind of data that we think we're going to be working on to all the nodes in question. It's one of those, I think, very poorly understood aspects of systems design, especially in this modern world where everyone is going to more or less wind up in a place of it's just an API call away. The number of people who have to think about that is getting smaller, not bigger. Definitely. And there's, I think, some research that needs to be done in the range of latencies. You know, InfiniBand's down at the microsecond level. AWS can now do things at the 15, 20 microsecond level. And that's a lot shorter than the generation of 10 gigabit ethernet, which a lot of applications

Starting point is 00:25:49 showed didn't have low enough latency for their needs. So a lot of these important sorts of molecular dynamics, seismic data processing, all these big HPC applications, we really don't know if they're limited by the current implementations or if you really need to go down to the microsecond latency network. What's next? I mean, you've been doing this an awfully long time.

Starting point is 00:26:23 You've been focusing on HPC, working on compute at a scale that, frankly, most of us have a difficult time imagining, let alone working with. What do you see as the next evolution down this path? I mean, HPC historically has been more or less the purview of researchers and academics. Now we're starting to see these types of things move into the commercial space in a way that I don't think we did before. Well, that's, I think, I saw a factor of a million improvement in performance from when I started writing our gravitational N-body evolution code in graduate school. And you think about that factor of a million in anything else in your experience, it's just never

Starting point is 00:27:11 happened. So parallel computing has been a unique experience over the last 20 years. And where it's at now is just being available to tens or hundreds or thousands times more programmers. So I think finally we'll get to an era where the investment in hardware has not disadvantaged all of these programmers who could really make some breakthroughs, but it's hard to do that when your code goes 10 times faster every five years without you having to do anything. There's something to be said

Starting point is 00:27:59 for the idea of a cloud computing provider where you just throw an application into their environment and you don't touch it again. And over time, the network gets better, the disks get more reliable, the instances it runs on gets faster. If you try that in your data center, raccoons generally carry it off by the end of year three. And that says terrible things about my pest control, but also about, I guess, the way that people tend to, on some level, reduce these hyperscale cloud providers down to, oh, it's just someone else's computer. It's a different place for me to run my VMs. And I think that you're demonstrating that it has the potential and the opportunity to be far more than that. Absolutely. There are now clear economies of scale for HPC,

Starting point is 00:28:49 whereas before these were all very specialized systems in a not very big market. So the real democratization puts this power in the hands of anyone who can write the right type of software to take advantage of it. And it becomes a true commodity that is really just distinguished by its cost. Wonderful. If people want to learn more about, I guess, how you've done this and more about what, I guess, you folks are up to over at Descartes Labs, where can they find you? We're at DescartesLabs.com. We've got a good series of blog posts around kind of what we're up to and what we're doing. And we've got big plans to grow our data platform

Starting point is 00:29:48 beyond geospatial imagery into a lot of other very big data sets that are relevant to what's happening in the world. So I'm going to go out on a limb and assume that you're hiring. We definitely are. And it's a great place to work for a scientist. I kind of, you know, took my experience at a national lab and having worked at universities. And I think we've put together the best of, you know, the history of research and engineering and made it into a really great place to work

Starting point is 00:30:29 and think about software and solve the biggest problems that the world is facing. Thank you so much for taking the time to speak with me today, Mike. I appreciate it. Thanks, Corey. It's been fun. Mike Warren, co-founder and CTO of Descartes Mike. I appreciate it. Thanks, Corey. It's been fun. Mike Warren, co-founder and CTO of Descartes Labs. I'm Corey Quinn.

Starting point is 00:30:49 This is Screaming in the Cloud. This has been this week's episode of Screaming in the Cloud. You can also find more Corey at screaminginthecloud.com or wherever Fine Snark is sold.

Starting point is 00:31:17 This has been a HumblePod production. Stay humble.

Screaming in the Cloud - Episode 73 - Building a Cloud Supercomputer on AWS with Mike Warren

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.