a16z Podcast - a16z Podcast: Why the Datacenter Needs an Operating System
Episode Date: December 18, 2014What does an operating system for today's datacenter look like? Why do we even need one, and how does it function? Mesosphere's Benjamin Hindman, the co-creator of Apache Mesos, joins Steven Sinofsky ...for an all-OS discussion. The views expressed here are those of the individual AH Capital Management, L.L.C. (“a16z”) personnel quoted and are not the views of a16z or its affiliates. Certain information contained in here has been obtained from third-party sources, including from portfolio companies of funds managed by a16z. While taken from sources believed to be reliable, a16z has not independently verified such information and makes no representations about the enduring accuracy of the information or its appropriateness for a given situation. This content is provided for informational purposes only, and should not be relied upon as legal, business, investment, or tax advice. You should consult your own advisers as to those matters. References to any securities or digital assets are for illustrative purposes only, and do not constitute an investment recommendation or offer to provide investment advisory services. Furthermore, this content is not directed at nor intended for use by any investors or prospective investors, and may not under any circumstances be relied upon when making a decision to invest in any fund managed by a16z. (An offering to invest in an a16z fund will be made only by the private placement memorandum, subscription agreement, and other relevant documentation of any such fund and should be read in their entirety.) Any investments or portfolio companies mentioned, referred to, or described are not representative of all investments in vehicles managed by a16z, and there can be no assurance that the investments will be profitable or that other investments made in the future will have similar characteristics or results. A list of investments made by funds managed by Andreessen Horowitz (excluding investments and certain publicly traded cryptocurrencies/ digital assets for which the issuer has not provided permission for a16z to disclose publicly) is available at https://a16z.com/investments/. Charts and graphs provided within are for informational purposes solely and should not be relied upon when making any investment decision. Past performance is not indicative of future results. The content speaks only as of the date indicated. Any projections, estimates, forecasts, targets, prospects, and/or opinions expressed in these materials are subject to change without notice and may differ or be contrary to opinions expressed by others. Please see https://a16z.com/disclosures for additional important information.
Transcript
Discussion (0)
The content here is for informational purposes only, should not be taken as legal business, tax, or
investment advice, or be used to evaluate any investment or security and is not directed at any
investors or potential investors in any A16Z fund. For more details, please see A16Z.com slash
disclosures. Good afternoon, everybody. This is Steven Sinovsky here with the A16Z podcast.
Very excited today to have Benjamin Heinemann of Mesosphere here today.
And we're going to talk about a new concept that the company is coming out with called the Data Center Operating System, or DCOS.
You know, today, you know, apps, they span servers.
There are things like Kafka and Spark and MapReduce and Cassandra, and it's super, super complex to roll out these huge systems.
In fact, the real challenge of just allocating resources and figuring things out,
reminds me, personally, of the very early days of computing when programmers were responsible
for allocating the resources of a machine. If you wanted a file, you sort of wrote your own file
system. If you wanted a process, you had to figure out which part of the CPU to save and store
and load. And, you know, great programmers back in those days, which really weren't as long
ago as people seemed to think, knew how to squeeze the most out of a computer by being able to
manually allocate resources.
You know, my old boss, Bill Gates, was famous for how many things he could squeeze into
an 8-bit bite of basic, you know, over the weekend.
And it's very, very important back then to do that.
And the problem was, if you were really good at it, your code became completely unmanageable
and hard to deal with.
And that turns out to be a little bit of what's going on today in the data center,
except I think it's a little bit of the opposite.
Today, you know, an enterprise at the big data center has taken.
the opposite approach, which is, let's just keep buying more and more resources and use them
for special purposes, so I don't have to think hard about packing more bits into a byte,
so to speak. And so there's more servers and more complexity and more VMs. You know,
you're in this world of like it's basically one app per server, one app per VM. And, you know,
what the problem is, that's simpler, but not simple, to manage, but it leads to this unbelievable
waste. And waste in a data center is a big mess.
85% of the resources go unused. And I think, to me, that's where the data center operating
system really comes in. And so I think that what we want to do is just sort of talk about
this, what we call it the DCOS, which is weird because data center is one word. So it really
should be DOS, but that would take this podcast a whole different level. And I don't think we
really want to go there. But, you know, if we think about traditional operating
system is allocating the CPU and the memory and the disk and the network, all for a
single computer. What is the, Ben, what's the DCOS? Yeah, yeah, great. So the data center
operating system consists of a bunch of components. And when you really think about an
operating system, it itself consists of a bunch of components. In fact, operating systems
have evolved over the years. We've had, you know, monolithic operating system, microservice
based operating systems.
And what we've really done with Mesosphere's data center operating system is we've created
something that's more like a microkernel-like operating system where at the core of it is
this open-source project that we have called Apache Mesos.
And it's what's being used at a bunch of companies like Twitter and Airbnb and other companies
to actually run their infrastructure.
And then a lot of the other what we're calling data center services, which are these software
frameworks which run on top of Mesos and can take advantage of Mesos to actually execute
the computations that you want to do, things like Kafka and HDFS and Hadoop and Cassandra.
And those components really make up the core parts of what makes a data center operating system.
So you can really think about it.
The base level is the kernel, which is Mesos, just like a kernel in an operating system.
And then things like storage, something like HDFS, which leverages the kernel, Mesos, to actually
actually provided storage.
And then these data center services
that I mentioned
and then a really, really
key one for us is what we call
our distributed init D
and that's a Linux, not Windows.
I'm old enough.
I actually would have called it Unix, but you call it Linux.
That's true, yeah.
And our distributed init D
is what we ship is something called
Marathon, but there's
alternatives to that, just like in fact in the Linux,
world today, there's alternatives to NITD, your system D, you've got a bunch of different
and NITD alternatives.
Kind of like on your operating system today, you have many alternative browsers, Internet Explorer,
Chrome, Firefox.
But we use Marathon, and that's kind of, that's the core of how you end up running a lot of
your tasks, because that's your init system where you describe all your tasks.
And then, of course, to interact with your operating system, you need some kind of
interface? Well, what about going back just before we jump into the interface, tell me, like,
you know, when I think about what an operating system needs to do, one of the things
that needs to do is it needs to, like, schedule things. So is there a schedule or two as well?
Yeah. So the kernel, Mesos, the core primitives that it really provides is task management,
not process management, but task management, resource allocation, resource isolation, the things,
the things you'd expect to get from something that needs to run, multi-tenant, lots of applications
at the same time. What's the difference between?
a task then in a process. Great, yeah. So we chose task because we didn't want to overload
the process nomenclature. And a task is just, it's the entity that we use to describe something
that we have launched on some host in the data center. And so it could be a process. It could
be a collection of processes. But it's the thing, it's the unit that we use to actually schedule.
It's the thing that consumes resources, really, at the end of the day.
So I have a conceptual understanding of like a level of, of, you know,
services, but how does it actually work?
Like, how do I, how do I get all of this onto a machine, onto a data center?
What's the mechanism that everybody's connected up?
Yeah, yeah.
So, um, that's the bus in a sense.
Yeah, yeah, yeah.
So, um, uh, the way it works is that using one of these data center services we talked
about, um, that consist of the entire operating system, something like Marathon for, for, for, um, running your tasks.
You would interface through Marathon, you would ask Marathon, you'd say, hey, marathon
launch this task, just like you would tell in a D on Linux, hey, um,
run this task when when when when you boot up and then what it does is it uses what we really think
of as kind of a system call interface to mesos to get resources allocated to it and then launch a
task so so it says to mesos hey um i'd like to run this i'd like to run a task i need these resources
to get the resources allocated to it and then it launches the task and then mesos at that at that
period takes it takes care of making sure that it gets the task to the right machine the right host
launches the task monitors it isolates it when it fails it tells the system
that it's failed, so it can either be relaunched, whatever it needs to happen.
And so the communication really is between one of these data center services like Marathon
that's running on top of Mesos and Mesos, which is really providing kind of the system
call, the system call API.
And when you think about it, this is one of the interesting things about Mesos itself.
It really is much more like a kernel.
You know, if you download Mesos by itself today, there's not really much you can do with it.
Just like if you're downloading the Linux kernel by itself today,
right great now i got the kernel what i do i'm not going to program code which is going to do
interrupt 80 to you know do a system call you're going to use something at a higher level you're
going to say bash at a higher level to uh to launch tasks we're going to use some kind of window
manager at a higher level and that's exactly what something like marathon is providing on on
top of mesos today so so first how do all of the you know i think of a data center i think
a rack and i think of all these boxes how how does how does misos know that the boxes are
of its resource pool.
Like, what connects them all?
Yeah.
So on each individual machine, we run an agent process.
And so that process could be launched either via a system image that you would use one of our
system images or if you were using some more traditional configuration management software,
you could use that to set up all your individual machines, physical or virtual.
And then they all communicate back through the Mesos master, as we call it, the sort of
the brain of Mesos.
which is responsible for managing all these machines
that have connected through their agents
and then the bus is basically between those machines
and the masters themselves.
Cool. So now I'm sitting in front of the machine
or of the cluster or whatever.
And how do I know I'm running it?
You mentioned a command line.
So I'm sort of in my head.
I have this now the data center is now like one computer.
One big computer.
And so, well, I want to tell it
to do something.
What do I do?
Yeah.
So the interface, really, the first interface that we've provided is a command line interface.
And so we did this for a bunch of reasons.
So not a card reader.
Not a punch card.
Okay.
We made it plugable.
We can make that interface as well.
But it made a lot of sense for us to actually make this be really the first interface to the DCOS.
And so what you can do is you can actually type from a term.
terminal, you can type DCOS space, and then one of these data center services that I was
mentioning something like Marathon, you could say Marathon, and then you can give it some information
to run a task. You can say like DCOS, Marathon, run, and then the command you want to run
and maybe some extra flag information to describe how it gets its artifacts, it's, you know,
its resources to run. And then you do that and it starts running. And so of course,
what does that mean? It starts running. Well, it could mean that you could go to some web browser
if the tasks that you launch happen to be a web server. But of course, you can also do something
with a CLI, which is DCOSPS, so you can actually see all the processes that are running,
all the tasks you have running.
So all the processes or all the tasks?
All the tasks.
Yeah, right.
So you've immediately, like, I'm kind of done with processes.
Yeah.
So now I'm looking at a task might be spanning, is going to span all sorts of resources.
Yeah.
Yeah.
So in the, the, the CLI today, what we have is just, just all the tasks.
But as we evolve the CLI, we'll be able to drill down so you can see for this tasks,
what processes represent those tasks, for those processes, what,
threads represent those processes. So you'll be able to see all the resources that are actually
being consumed at a fine grade level. Even in the best cases of single machine computing,
at some point for diagnostics or performance or something, you're going to actually have to
know how things are done. Most definitely. So the fact that you're using these abstractions doesn't
prohibit a DevOps person from really knowing what's going on. That's exactly right. Yep. And
that's just the same today where if you just type PS on, say, a Linux box, you do just see the
processes. But if you want, you can really dive in and you can say, show me all the threads for
those processes well. So, okay, so, so you sort of described how I get something going. Like,
is that, do I install software on it? What, what do I think of is? Like, where does, where does the
task come from? Yeah, yeah. So once, once, once, once the Mesosphere, DCOS software is really
installed everywhere and you want to run other tasks, we have built a repository, a registry-like system
that allows you to describe a task
and just kind of like homebrew
or like the package managers out there,
you can say, hey, I want to install
one of these frameworks, one of these services.
You can do that.
It'll pull down from a repository
the necessary bits of information.
You can have it either get installed
on the distributed file system.
You might have running something like HTFS or Steph,
which again is something that's running on top
of the DCOS.
And so you can point to where it is.
is. And then you can say, hey, my in at D, you know, my service scheduler, go ahead and now
run this service, pull it from this location so you have the bits and go from there.
Just so folks can have a clear view. Like, give me, what are some specific examples of
services that you're, that come to mind or tasks that you would, you would really think
of? Yeah. So, um, making it really concrete. Yeah, yeah, yeah. So at a, at a, at a company like
Twitter, um, which is a big user of, of Mesos, the, um, they basically decomposed their
architecture from this monolithic architecture and a bunch of small services. And each of those
individual apps, each of those individual services, which is say when a tweet comes in, it's sending
out a post to an SMS or it's, you know, hydrating the tweet for other people's timelines so that other
people can see that this tweet has come in because it should show up. Each of these individual
services would be a kind of task and app that you might want to run. And so you would just say to the
DCOS, hey, I want to run this application. I don't care. You know, where I want to run it. Just here's the
information here's the binary needs to run go your big computer run this somewhere on the big
computer so so one of the things that jumps to mind is is that you know when I think of an OS I think
not just of like the resource management but it also provides conceptual models for really important
things like one that jumps to mind is security and so anytime you start telling me like oh by the way
code's running anywhere I start to worry like well if code is anywhere and I don't know where it is
doesn't that make me vulnerable in places that I'm not predicting right so
So tell me a little bit about how, like, something like isolation or how I think of security in the DCOS model.
Yeah, so, so, you know, I think this is a really interesting topic because what tends to happen on a lot of these organizations, when there isn't some centralized way and people are thinking about how they want to do resource management and run their applications, is you get a bunch of disaggregated, you know, everyone's doing it slightly differently.
Yeah.
So oftentimes you have worse security because, you know, rather than a security team being able to audit just the one way in which everything gets to.
run, they have to audit a whole bunch of different processes, and some people get a little bit
differently. And then the worst part about that is they can't compose, right? And this to me is
one of the fundamental issues I have with a lot of distributed systems is because people are
building distributed systems in such a personalized way, you know, personalized for their
organization or their company, you can't easily build a distributed system in one organization and
move it to another organization. And security is a perfect example that, you know, one organization
uses L-DAP.
So the first way that they build it in
is it hooks into L-DAP
and it's so ingrained
that they're going to do L-DAP
and another organization
doesn't use L-DAP.
They use some other mechanisms
of authentication
or identity or whatever it is.
You always see this like with
when you have a big giant web presence,
you have the company that operates
the web server part
and then they went and did analytics
in a completely different sort of stack.
And they're figuring out
how to get the access to the logs
to do the analysis.
And then no one can either do
both, deal with both, audit both. Exactly. Exactly. So, I mean, this is one of the biggest drivers
of why we're trying to build, why we're building a data center operating system is because
at the end of the day, somebody should be able to build an application against the primitives,
like security primitives that could be provided by a data center operating system, and go
and run it in another organization, because it's just an app that you built. And, you know,
it was very interesting at the beginning of the podcast when you were talking about the people
that wrote the, you know, the hardcore applications, that's the case we distribute systems today, too.
You know, we joke, they have to have a PhD to write a distributed system.
Many PhDs came about showing you how to write distributed systems, and then they went and built
them. Yeah, that's right. But we're at the point now where everyone is basically building a
distributed system. They don't all have PhDs, and we want to be able to build those distributed
systems in one organization, run them in another organization, and be able to do that in a really,
really efficient manner. And that, like, security is a perfect example of something that if we
can provide the interfaces for doing security in our distributed systems and people can build
against those interfaces, then we can easily move our applications across organizations.
So building on the applications part, like one of the things that obviously has a huge
amount of attention and excitement right now, whether it's from Docker or CoreOS, is just
the notion of containers. So in listening to you, I'm sort of trying to parse in my head,
like, do I no longer need containers?
Are you going to provide a container that I have to use?
Am I going to be able to use containers that I've already created?
Where do containers fit in on your stack?
Yeah, that's a great question.
So Mesos has used containerization technologies.
What we've used to underpin the Mesosphere DCOS has used containerization technologies for a long time since 2009.
In fact, in 2009, we even had Solaris Zone support.
So we had containerization technologies.
from even outside of Linux.
And we've provided that containerization technology
and we'll continue to do so.
So when people have used the existing containerization technology
to build new things like Docker on top,
that's been something that we've been able to integrate
with very, very easily.
So if you're creating Docker images,
this is a fantastic thing.
You can give it directly to us.
We can launch those Docker images
directly using our containerization technology.
And as this stuff evolves,
as other companies introduce new image-like formats
to describe the bits you need to run your containers.
Again, this is just going to be something
that we can plug in to our data center operating system.
You just give us bits, and we'll run those bits.
And if those bits happen to be a Docker image
or a rocket app container specification,
we'll take those things and we can actually run them.
But the benefit is, of course,
so first you can go create your container
however you want to go create it.
And then the neat thing is you were able to deploy it in a distributed way
where you're scaling in a highly efficient way without really realizing it.
Yeah, and when there are failures, they get rescheduled.
When we want to do even smarter things like over-subscription
because we want to move that 85% unused resources to say 10% unused resources,
we can start to do all that, just like an operating system does for you
under the covers today on, say, your laptop.
and you just gave us the binary that we need to run,
whether it's a container image
or whether it's a, you know, some real binary.
Cool.
So I want to take a step back
because to me this is what's so fascinating
is that what you're really doing
is just changing what I view
as the abstractions of an operating system.
And you're basically directly,
or by implication, saying,
wow, you know, the abstractions that people deal with,
like the notion of having a virtual machine
is just completely wrong
and that we really need a new set of
abstractions. And to me, what this feels like is is when virtual memory came out, the abstraction
just blew your mind because you went from, like, I literally personally went from like figuring
out where to put stuff in 640K of memory to having two gigabytes of memory. And not only that,
but the address space was linear. So I actually got to just, you know, just not worry about
where it went, whereas I spent the first two years of my career, like swap tuning code so I knew
exactly where in memory it was going to be.
And so it seems crazy to think like that.
Because aren't a bunch of hardcore people just going to say, no, the problem is, if I
have a whole data center, I'm going to be better at organizing what goes where than some
piece of software that doesn't know the loads, the resource needs, and why would the
DCOS know better than me?
I'm a smart PhD distributed computable.
I'm not, but someone might be.
Yeah, no, no, I think that's exactly right.
I think that what we're doing is we're doing exactly what virtual memory did for existing operating systems,
which is providing the abstractions so that we can really, really effectively do the resource management,
the scheduling, dealing with the failures.
And I think just like what you saw in virtual memory,
there will probably be a lot of people who believe that they can do it better,
but time's going to show that actually we can start to do far more sophisticated things.
And we will be able to do far better scheduling for utilization,
for meeting SLAs, for serving the customers.
Yeah, and I think to me that's just a super important point
for folks to understand, because in these kind of transitions
when you're changing abstraction layers,
there's this sort of management retrenching of like,
wow, security is really important.
We know how to secure this, so we're going to stick with it.
Even though, you know, and a few percentage of utilization won't change it,
even though the system isn't secure,
it's just comfortably insecure.
That's right, yeah.
And, like, it was great to be comfortable even though you were failing.
And so I think that, like, for me, that's the big transition,
that people are going to have to just sort of get over their own perceived expertise
and let computers do stuff that they're good at.
That's right, yeah.
And that's why I think pulling in analogies of the past is so valuable.
Because it helps people start to realize, you know what, maybe this is a good idea.
It was a big giant waste of time.
So who's using it today?
Yeah. So the open source components that make up a large part of the DCOS are used by a large number of companies today.
Some of the biggest users out there are companies like Twitter, Airbnb, HubSpot, eBay, and PayPal are using it for running things.
Netflix is using it for running things.
So some of the smaller companies without a lot of machines?
That's right. Yeah. I mean, one of the great things about the way that the software has evolved over the years is we've made it so that it works well at the
small scale, but it also scales. And it works very, very well for the large scales guys.
And as hardware itself is starting to evolve in our data centers, and maybe the rack is
going to start looking less like the rack or a machine is going to start looking less
like a machine, you really need these levels of abstraction for both the small guys and for the
big guys. Yeah, it certainly seems to me that one of the things that an operating system brings
is it allows hardware to proceed at a different pace of innovation. And so when I look
at DCOS, I think, wow, this is really going to free.
a set of people to go, well, let's just go replace our servers with arm servers. Let's go
replace our networking infrastructure in a certain way because they'll be able to map those
abstractions up. Rather than today, I mean, once you say it's a VM running this instruction
set that assumes this level of, you're stuck. That's right. Yeah. So a lot of people are looking
at the stack of cloud today. You know, we haven't even used a lot here because we're really focused
on distributed operating system. But, you know, and they think a platform as a service or infrastructure
as a service.
And so to me,
like,
let's assume
that this isn't
platform as a service.
Let's take that,
let's assume
what we understand
the possibility.
But is,
you know,
IS is definitely at this
VM server level.
So why is this
not an IS kind of thing?
Yeah, yeah.
Yeah.
Yeah.
So one of the biggest
differentiators
between what we've done
versus what they've done
with the infrastructure
as a service space
is really tried
to provide these
abstractions
and these primitives
that enable you building new distributed systems on top.
And again, that's really what an operating system should be providing.
What infrastructure as a service provides to you is another machine.
It turns a physical machine into a virtual machine
or maybe a virtual machine into a virtual machine
when you're running, say, open stack on EC2.
And that does not help the developer build another system.
It's the same primitive.
It's just kind of wrapped up.
And so really what you get from something like a data center operating system
are the abstractions and imperatives that make it easier to build new distribute systems,
and that's what makes it easier to then move those distribute systems from one organization to another organization
because that's the abstraction that everybody has and they can use those.
Yeah, you know, I think that this is super interesting because I think from an IT leadership
and the enterprise perspective, you know, right now we're on the verge where everybody wants to move
to cloud, they don't know what that means, and so they're very quickly virtualizing the servers
that they have laying around.
And I'm a big believer that that's just not a useful, a good use of time.
I think it might be cost effective in some marginal way, but the cost of moving and the bugs you introduce and stuff.
And so I think what would you say to sort of your typical enterprise CIO?
Well, there's not really a typical, but an enterprise CIO is overseeing a move, like what is it that they'll, that they should understand about moving to a misos kind of environment rather than take this intermediate step of doing a bunch more VM stuff or better managing your VMs?
Right, right, right.
Yeah, I mean, I think one thing that's really, really clear is that, uh,
One of the nice things about a data center operating system is that it doesn't really compete with an infrastructure service at the end of the day because it's still about just taking all your resources, whether those resources come from virtual or physical machines and using those resources effectively.
So for folks that do already have infrastructure as a service like deployments, there's still a ton of value in using Mesos and the data center operating system because you still want to best take advantage of the resources that you already have, again, if it's just bunch of virtual machines.
And the same thing applies why something like Mesos and the DCOS is still so valuable in EC2-like environments on AWS is because, again, still, you want to best take advantage of all the resources that you have.
But for people that are starting from scratch, I think you can really now start to take a very close look on whether or not you need to go through that first level of virtualization or not.
And we've had a lot of reports of people that can go directly to using something like Mesos and the data center operating system.
And then you don't have to start paying that 30% virtualization overhead for running your applications,
which can start to save a lot of money.
Because that's how I sort of think of it as both the cost savings.
And then like if you're going to go a greenfield, like if you're going to build a new expense app,
rather than just virtualize the old expense app, you probably want to build it because you know it's never going to use a whole rack.
Yeah.
Like so why would, but you're going to probably, if you were to go build it, you would dedicate the rack.
Yeah.
And then you get all the overhead of a bunch of VMs.
And so it seems like you should just go straight to building it as a distributed app.
And then you'll have your thousand apps over the next 10 years that get rewritten
are all just going to squeeze in and use the right amount of resources.
Yep, that's exactly right.
But then I want to go back one quick sec to the platform as a service.
Because to me, like platform as a service and infrastructure service are sort of almost inherently connected in an inefficient way.
Yeah, yeah.
So what would you say that, well, oh, no, we're okay because we're just going to use, you know, a cloud vendor's platform.
But that doesn't solve the distributed thing, doesn't?
Yeah, no, no.
I mean, what ends up happening at the end of the day with platform as a service is, again,
so it's a high level abstraction on top of infrastructure or service.
What platform as a service really solves is the fact that, oh, great, from infrastructure
service, I've got a bunch of machines.
Now what do I do?
So platform as a service said, okay, well, we'll abstract away the machines and we'll let you
just run your tasks, your processes, your apps, whatever it is, but then you just run the
processes.
And what you really want is you want to be able to launch those processes, those applications,
and then you want those applications to be able to continue to execute by using the underlying
infrastructure, by calling back into something like the data center operating system and
say, hey, now I need more resources, or for us to be able to call into the apps, the
data center operating system, be a call on the apps and say, hey, this machine is going down
for reboot because it's doing maintenance.
You should know about this, just like in a normal operating system.
We actually, you know, you do memory paging.
And that's the big distinguisher again between something like platform as a service and the
data center operating system.
is platform as a service is about, okay, here's an app, I run it, I go,
and the data center operating system is about, okay, here's an app, I run it,
and then while that app is running, it uses the data center operating system to continue to run.
It calls back in, it uses the system call API, and as that API gets bigger and bigger and bigger,
it makes a really, really rich environment for programmers to be able to build really
sophisticated distributed applications.
One last question is, I mean, you described a lot of stuff, so I'll make it two parts.
A, where can I get the stuff today, and what can I do with it?
And then B, like, what comes next?
Yeah, you can go to masos.apachi.org.
And that's where you can learn a lot about the kernel itself, the Mesos kernel.
And the new stuff.
That was the second question.
Yeah, the side part was like, well, tell everybody,
now that they've absorbed all this, what's coming next?
Yeah, yeah.
Yeah, so the new stuff's the most fun stuff to me.
It's really where we start to take the beginning.
of what it means to be, you know, a data center operating system and take it to the next
level. And it means we start to take the things that historically have been really, really
tough to run, regardless of whether or not you've used higher levels of abstraction, like things
like passes or infrastructure as a service, like stateful services. And we get to start running
those things in a really, really, really effective way in the data center that historically
have required a lot of humans to actually deal with that kind of stuff. So there are two examples I want
to give here, two primitives that are being built that I think are really, really cool.
One primitive we're building in is this notion of maintenance.
So because we have this software layer, the kernel, actually running in our data center
operating system, when the applications are running on top, we can have it start to actually
deal with maintenance of things that are happening in your data center.
So, for example, when a machine or rack needs to go offline, we can have the software,
talk to the other software, and say, hey, you know what?
this machine is going down for repair
we need to reschedule you
or you should get rescheduled you need to move data
just treated like it was a failure but a plan failure
that's right that's right it's a failure but a plan for that's exactly right
and this is this is huge because usually the way this works
in most organizations is a human walks up to another human
and says hey I'm going to be taking this rack down
what can we actually do about this but we can turn this into software
and the analogy that I like to give
from just traditional operating systems
is the operating systems today would do things like
page out memory. But what they do is they just say, hey, you know, we're going to use the LRU
algorithm. We're going to page out the least recently used. And that doesn't always work great.
And wouldn't it be better if actually the operating system could work with the applications
around on top to do smarter things when it comes to failures or needing more resources or whatever
it is? And that I think is like that realm of things is to me one of the most exciting things
that we're going to be working on because we get to reimagine a lot of the basic primitives that
existed for single machines and rebuild them in a way that makes sense in a distributed
environment and makes sense for people that want to do things in a smarter way and sort of what
we've been working with for a lot of time.
At a scale that people can only imagine.
At a scale that it's already hard enough to do it manually and so we have to do it in software-based
ways and so we can do that.
Awesome.
Well, thanks so much.
This has been Benjamin Heinemann from Mesosphere and I'm Steven Sinovsky signing off this
episode of the A16Z podcast. Thanks everybody. Great. Thank you.