Screaming in the Cloud - Episode 73 - Building a Cloud Supercomputer on AWS with Mike Warren
Episode Date: August 14, 2019About Mike WarrenMike Warren is cofounder and CTO of Descartes Labs. Mike’s past work spans a wide range of disciplines, with the recurring theme of developing and applying advanced softwar...e and computing technology to understand the physical and virtual world. He was a scientist at Los Alamos National Laboratory for 25 years, and also worked as a Senior Software Engineer at Sandpiper Networks/Digital Island. His work has been recognized on multiple occasions, including the Gordon Bell prize for outstanding achievement in high-performance computing. He has degrees in Physics and Engineering & Applied Science from Caltech, and he received a PhD in Physics from the University of California, Santa Barbara.Links Referenced@m_warrenhttps://www.linkedin.com/in/mike-warren-3a0439b1/https://lightstep.com/Â
Transcript
Discussion (0)
Hello and welcome to Screaming in the Cloud with your host, cloud economist Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud. using Nagios at scale anymore because monitoring looks like something very different in a modern architecture
where you have ephemeral containers
spinning up and down, for example.
How do you know how up your application
is in an environment like that?
At scale, it's never a question of whether
your site is up, but rather a question of
how down is it? LightStep
lets you answer that question effectively.
Discover what other companies, including
Lyft, Twilio, Box, and Jithub, have already learned.
Visit lightstep.com to learn more.
My thanks to them for sponsoring this episode of Screaming in the Cloud.
Welcome to Screaming in the Cloud. I'm Corey Quinn.
I'm joined this week by Mike Warren, co-founder and CTO of Descartes Labs.
Welcome to the show, Mike.
Thanks, Corey. Happy to be here.
So you have a fascinating story in that you had a 25-year career at the Los Alamos National Lab
and then joined, not joined, started a company, Descartes Labs, about four years ago now. That's right.
I kind of considered, you know, it was a 30-year-long education that provided me kind of the right skills to go out and start a company
that used computing and science to help our customers understand the world.
And it certainly seems that you've done
some interesting things with it lately. At the time of this recording, about a month or so ago,
you wound up making, I guess, a bit of a splash in the cloud computing world by building an
effective supercomputer on top of AWS. And it qualified, I think, what it was at place 136 on the top 500 list of the most
powerful supercomputers in the world.
And that's impressive in its own right.
But what's fascinating about this is you didn't set out to do this with an enormous project
behind you.
You didn't decide to do this with a grant from someone.
You did it with a corporate credit card.
That's right. And I guess it wasn't such a surprise to us. We knew the capability was coming
and it just happened to be the time was right. And I had some time to, you know, compile HP
LINPACK and run it. But that's the way the computing industry is going. And anything you can do in a data
center or in a supercomputing center is eventually going to be possible in the cloud.
So one of the parts of the story that resonated the most with me was the fact that you did this using spot instances,
but you didn't tell anyone at AWS
that you were doing this until after it was already done.
And I confirmed that myself by reaching out to a friend
who's reasonably well-placed in the compute org.
I said, wow, congratulations.
You just hosted something that would wind up counting
as one of the top 500 supercomputers in the world.
And his response was, wait, wait, what now? Which was fascinating to see, where it used to be that
this was the sort of thing that would require such a tremendous amount of coordinated effort
between different people, different stakeholders across a wide organization. And it turns out that
because it's neat, wouldn't have been a sufficient justification
to do approximately any of it.
And now, apparently, you can do this for $5,000,
which is less than a typical server tends to cost,
and about what a typical engineer embezzles
in office supplies in a given year.
And the economics of that are staggering.
How long did you plan on doing this
before deciding to just spark it up?
Not long.
And it's really why the cloud is so attractive for businesses like ours.
You don't have to wait for resources.
You don't have to coordinate.
You don't have to spend time on the phone.
You don't have to worry about support contracts and all of this.
Even a top 500 run on a supercomputer requires an enormous amount of coordination.
You got to kick all the other users off.
It's a big deal and it may only be done once on one of these, you know, nation-scale supercomputers.
But this on AWS, any given day when there's a thousand nodes available, you can pull those up and run it a petaflop.
Which is astonishing.
I mean, you did some back of the envelope calculations
on what it would cost to do this in hardware.
And the interesting part to me wasn't the dollar figure,
which, what was that again?
We figured, you know, $20 to $30 million in hardware
to get to this near to petaflop level.
Yeah, and the amazing part for me was that looking at this, you also mentioned in
that article, and I'll throw a link to that in the show notes, but the fact that it would also take,
by your estimation, six to 12 months just to procure the hardware and get it set up and
everything and get everything ready just to do this. Yeah. I think people don't realize the
sort of unmentioned overheads in all of these HPC sort of applications. You know, you've got to
procure the hardware and make sure you have the budget authority and get the right signatures.
And that's after assuming you have a data center or a building and enough cooling and power and all those sorts of infrastructure.
So one thing that rang a little bit, I guess we'll call it false in the narrative, and not to poke
holes in your legend, has been that, oh, we didn't tell AWS that we were going to be doing any of
this. We just decided to go ahead and run it. But anyone who spent more than 20 minutes swearing at
the AWS console knows that, okay, you start up an account, and the first thing that you need to wind up doing is
request a whole bunch of service limit increases. Well, you're already running one of those
instances in that region. Why do you need a second one? Justify it. And they're pretty good about
granting those service limit increases. But I'm curious to know, did you already have sufficient limits to run this
from other workloads that have been done in that account?
Or did you wind up opening
the most bizarre limit increase service ticket
that they've probably seen in a month?
As I recall, we had about half the resources
in US East where we ran this.
So we had kind of this scale of resources spread across different zones.
But there's also different quotas between spot and not spot.
So there was a specific request to kind of up the quota for
spot in US East to this level.
And there's another interesting API limitation,
which is a single call to allocate nodes for Spot
can only allocate 1,000 at a time.
So it's a little extra workflow.
You've got to break it into two parts and there's
throttling on the API calls. So you can't immediately allocate, say, 1,200 nodes. You've
got to allocate 1,000, wait a minute, and then allocate the remainder.
Take us through this from the beginning,
where effectively, what does the architecture of this look like?
I'm not entirely sure what the supercomputer in question was chomping on.
So in my mental model, I'm going to figure that you were just mining Bitcoin,
which it turns out is super financially viable in the cloud.
You just need to do it in someone else's account.
But on a more serious note, what was the, you mentioned need to do it in someone else's account.
On a more serious note, you mentioned LINPACK. What does that do?
An important distinction to make is how much communication needs to happen among the processors. Mining Bitcoin, the processor doesn't need to know about anything else going on in the world.
It's completely independent.
So you can start up a thousand different nodes mining Bitcoin and all of the clouds do that
very well.
The top 500 benchmark is based on the inversion of a very large matrix, and it uses a piece of software called Linpack.
And in the solution of that problem, all of these processors have to talk to each other and exchange data, and they have to do that with very low latency. So, you know,
starting up a thousand nodes and if any one of those fails in terms of
computation or network communication during that process, the whole thing
falls over. So these tightly coupled types of HPC applications that are
typically written in something called MPI,
which is the message passing interface, are a lot more challenging to do in the cloud
than the task parallel sorts of applications like a typical web server.
One of the things that fascinates me is that whenever you're using any sufficiently large number of computers,
some of them are intrinsically going to break, fail out of the cluster, etc.
That's the nature of things.
That sort of goes double when you're using something like on Google's preemptible instances or an AWS Spotfleets,
where it turns out that subject to the available supply, things are generally not nearly as
available as you would expect if, for example, someone else in that region decides to spin up
a supercomputer to take a spot on the top 500 list. So how does each one of those things checkpoint
what it's working on in some sort of central location so that it can die and be replaced
by something else without losing all the work that Node has done? Or doesn't it? No, it can't. There's no checkpointing these sorts of problems in any easy manner. So
everything's got to work perfectly for the three or six hours that this benchmark is running. So
the reliability is very important. And I've heard anecdotal reports of some of these
very fast top 10 supercomputers needing to try to run the benchmark several times before they
can get it to run reliably. So when you wound up doing this, did you just over provision by a bit
and assume you would lose some number of nodes along
the way? That also doesn't work. Once you've sort of
labeled each of these processors with a number, you can't have one
disappear. It's got part of the state of the
problem in it. So if you start with 1200 nodes, you need them all to work
the whole time.
So that's why these tightly coupled applications are a lot more challenging to scale up.
So when you wound up doing this, you effectively wound up with how many instances that were a part of this?
I mean, I saw the statistics on petaflops, but that's hard for me to put into something that I can think of.
Yeah, it was a bit under 1,200 nodes.
And these were 36 core processors,
or rather 18 core processors with two dies per node. So that gives you 41,000-odd processors.
And these are the hardware cores,
not the abomination of virtual counting.
Gotcha.
Yeah, it looks like you did this entirely on top of C5s,
which is the performance of those things is impressive.
And the economics of it are fantastic.
For me, I think the hardest part to wrap my head around
is the fact that you had 1,200 of these things running
without a hiccup for three straight hours on spot,
which historically was always extremely interrupt prone.
And it's surprising to me that none of those
wound up getting reclaimed during that window.
Just at that scale,
I would expect even traditional computers,
one of the 1,200 is going to fall over and crash into the sea because I'm lucky like that.
Well, there's, I think, two things you're talking about there.
And one of those is solved by Amazon's spot blocks.
So, you know, you say you want a certain number of processors for some
number of hours between one and six and then Amazon guarantees it's not going to
optionally take one of those away from you. So those are a bit more expensive
than the normal spot but a lot less expensive than the non-spot instances.
So what becomes important is just
the inherent failure rate of the hardware in the network.
And in our experience, the cloud resources we've used
have been incredibly reliable to the point where, you know, certainly we've seen in Google
their predictive task migration where they can sort of understand that a node is about to fail
and then migrate that whole kernel and processes to another piece of hardware so that you never know
about it and they can pull that defective hardware out and fix it. The idea of being able to do
something like this almost on a lark on some given afternoon for what generally falls well within
any arbitrary employee's corporate spending approval limit is just
mind-boggling to me. I know I keep belaboring this, but it's one of those things that just tends to
resonate an awful lot. Can you talk to me at all about how this winds up manifesting compared to,
I guess, the stuff you do day to day? I'm going to guess that Descartes Labs does an awful lot
of interesting stuff. I mean, you folks work on satellite and aerial imagery, to my understanding.
But how does that tie back to effectively, step one, take a giant supercomputer.
We'll figure out step two later.
Well, it's really democratizing supercomputing.
And it's an extension of the power that software gives an individual.
We're able to do things now with good software and a smart person that used to take an entire group of people
with lots of infrastructure to do.
So it's always been something I've been very interested in
and goes back to our building Beowulf clusters back in the 90s.
That was really democratizing parallel computing for people who couldn't afford the state-of-the-art supercomputers.
And there were untold times that people collared me and told them stories about their group in college who had a cluster that they built in the closet that allowed them to do the research that they were doing.
So on a day-to-day basis, what does, I guess, your computing environment look like?
I mean, you spun out of a national lab, which for starters, we know is going to be something that is relatively, shall we say, computationally impressive.
Mike Julian, my business partner, used to help run the Titan supercomputer at Oak Ridge National Lab about 10 years ago.
So I've gotten absorbed into it by osmosis almost, and I always view that as, here go smart people.
I just sit here and make sarcastic comments on the side.
But what does that look like on a day-to-day basis of what
Descartes Labs actually does? Well, the last big computation I did before we founded Descartes Labs
was on the Titan machine. We had 80 million CPU hours to calculate the evolution of the universe,
and we did that with a trillion particles.
So a lot of that kind of thinking
has carried over to our environment at Descartes Labs.
And I spend a lot of my time in the command line
writing Python scripts.
But our interaction with customers
is much more focused around APIs
and making it very easy to interact with these petascale data sets.
You tell us where on the earth and what time over the last 20 years
you want to see an image, and we can deliver that to you in less than a second.
And are you building this entirely on top of AWS? Are you effectively, for all comers,
as far as large cloud goes, is there a significant on-prem component?
There's no on-prem at all. Descartes Labs IT infrastructure consists of a laptop for everyone, essentially.
The bulk of our platform is implemented in Google Cloud.
We do have data input processes that are running in AWS,
but we've tried to keep most of the platform cloud agnostic so that we would move to another
cloud if that made sense in terms of the economics.
Often a great approach, especially with what you're talking about.
It sounds like the way that you're designing things requires some software that seems highly
portable to
almost wherever the data itself happens to live. You're not necessarily needing to leverage a lot
of the higher level services in order to start effectively chewing on mass quantities of data
with effectively undifferentiated compute services. It feels like the more you look at something like
this, the economic story starts to be a lot more around the
cost of moving data around. Turns out that moving compute to the data is often way cheaper than
moving data to the compute. Right. Our philosophy has been to work at a fairly low level. One of our
big successes has been essentially implementing a virtual file system on top of the cloud
object store so we can take any number of open source packages and they see a POSIX file system
to interact with and we don't have to spend a lot of time rewriting the I.O. routines there.
A lot of those open source packages, are they relatively recent? Are they effectively coming from, I guess, I want to say the heyday of a lot of the HPC work, or at least what felt like the heyday of it back in, I want to say 2005 to 2010.
That may not actually be the heyday and just when I was looking into it, but I'm assuming that my experience mirrors everyone's experience. No, it spans the whole range. We go all the way from 20-year-old million-line Fortran packages to the latest convolutional neural network, which was written a month ago.
So as you take a look right now at getting this project up and running, what could AWS have done to make it
easier for you, if anything? Well, I think it's, you know, this environment that they offer is
what I worked in 20 years ago. It's Linux plus Intel plus a fast network. And the real key in my mind
is the real expense is the software engineers that you need to write this code and deploy
it. And now Amazon has eliminated all the friction around the hardware capacity to actually execute that and bring it to reality.
So that's kind of the magic. in terms of making our programmers more efficient, are not overshadowed by the fact that they can't get access
to a petaflop machine to test it or help develop it.
It's remarkable, now you can probably get more capacity
in AWS with five minutes notice than you can
on most of these supercomputers
who have to keep their queues very full to
keep their utilization high to justify the cost of that dedicated hardware.
When you take a look at doing this again, I mean, effectively, first, is this something
you would ever consider doing again just as a neat proof of concept?
I mean, arguably, it got more attention when I linked against it in last week in AWS than most articles I link against. So it clearly
resonated with people. I mean, sure. Maybe we'll help our local university do the next one or,
you know, do it in a couple other clouds at the same time. It's a big community and it's not, you know, the top
500 is kind of a, it doesn't provide any utility in itself, it's kind of just
demonstrating what could be possible with a set of hardware. So I'm kind of
more interested in going to the next level of let's run more codes than we already are in this HPC environment.
So I guess the big question here is what inspired you to do this on AWS?
You mentioned that the bulk of what you're building today is on GCP.
What triggered that decision from your perspective?
I think AWS has been the first to have the network infrastructure that makes this possible.
In the same way that you need all of the CPUs to be fully available during this computation, you can't have any network bottlenecks.
So the new architecture and scalability of AWS's network, not having other users
interfering with the network bandwidth available and having low
enough latency in these messages, AWS was just the first to get there.
But Azure has also demonstrated this sort of performance in their hardware that has dedicated low latency networks.
And I would imagine Google is not far behind.
It's interesting, and I think that a lot of people, myself included,
don't tend to equate incredibly reliable networking as a prerequisite for something like this.
But in hindsight, that is effectively what defines supercomputers.
Things like InfiniBand, how quickly can you get data
not just across the bus inside of a given node,
but between nodes in many cases?
It's one of those things that sounds super easy until you look into it and then realize,
huh, the way this is currently architected from a network point of view,
we are never going to be able to move the kind of data that we think we're going to be working on to all the nodes in question.
It's one of those, I think, very poorly understood aspects of systems design,
especially in this modern world where everyone is going to more or less wind up in a place of
it's just an API call away. The number of people who have to think about that is getting smaller,
not bigger. Definitely. And there's, I think, some research that needs to be done in the range of latencies.
You know, InfiniBand's down at the microsecond level.
AWS can now do things at the 15, 20 microsecond level. And that's a lot shorter than the generation of 10 gigabit ethernet, which a lot of applications
showed didn't have low enough latency for their needs.
So a lot of these important sorts
of molecular dynamics, seismic data processing,
all these big HPC applications, we really don't know if they're
limited by the current implementations or if you really need to go down to the microsecond
latency network.
What's next?
I mean, you've been doing this an awfully long time.
You've been focusing on HPC,
working on compute at a scale that, frankly, most of us have a difficult time imagining,
let alone working with. What do you see as the next evolution down this path? I mean,
HPC historically has been more or less the purview of researchers and academics.
Now we're starting to see these types of things move into the commercial space in a way that I don't think we did before.
Well, that's, I think, I saw a factor of a million improvement in performance from when
I started writing our gravitational N-body evolution code in graduate school. And you think about that
factor of a million in anything else in your experience, it's just never
happened. So parallel computing has been a unique experience over the last 20
years. And where it's at now is just being available to tens or hundreds or thousands
times more programmers. So I think finally we'll get to an era where the investment in
hardware has not disadvantaged all of these programmers who could really make some breakthroughs,
but it's hard to do that
when your code goes 10 times faster every five years
without you having to do anything.
There's something to be said
for the idea of a cloud computing provider
where you just throw an application into their environment
and you don't touch it again. And over time, the network gets better, the disks get more reliable,
the instances it runs on gets faster. If you try that in your data center, raccoons generally carry
it off by the end of year three. And that says terrible things about my pest control, but also
about, I guess, the way that people tend to, on some level, reduce these hyperscale cloud providers down to,
oh, it's just someone else's computer. It's a different place for me to run my VMs.
And I think that you're demonstrating that it has the potential and the opportunity to be far more than that. Absolutely. There are now clear economies of scale for HPC,
whereas before these were all very specialized systems
in a not very big market.
So the real democratization puts this power in the hands of anyone who can write the right type of software to take advantage of it.
And it becomes a true commodity that is really just distinguished by its cost.
Wonderful.
If people want to learn more about, I guess, how you've done this and more about what, I guess, you folks are up to over at Descartes Labs, where can they find you?
We're at DescartesLabs.com.
We've got a good series of blog posts around kind of what we're up to and what we're doing. And we've got big plans to grow our data platform
beyond geospatial imagery into a lot of other very big data sets
that are relevant to what's happening in the world.
So I'm going to go out on a limb and assume that you're hiring.
We definitely are.
And it's a great place to work for a scientist. I kind of,
you know, took my experience at a national lab and having worked at universities. And I think
we've put together the best of, you know, the history of research and engineering
and made it into a really great place to work
and think about software
and solve the biggest problems that the world is facing.
Thank you so much for taking the time to speak with me today, Mike.
I appreciate it.
Thanks, Corey. It's been fun.
Mike Warren, co-founder and CTO of Descartes Mike. I appreciate it. Thanks, Corey. It's been fun. Mike Warren,
co-founder and CTO of Descartes Labs.
I'm Corey Quinn.
This is Screaming in the Cloud.
This has been this week's episode
of Screaming in the Cloud.
You can also find more
Corey at
screaminginthecloud.com
or wherever
Fine Snark is sold.
This has been a HumblePod production.
Stay humble.