a16z Podcast - a16z Podcast: Why the Datacenter Needs an Operating System

Starting point is 00:00:00 The content here is for informational purposes only, should not be taken as legal business, tax, or investment advice, or be used to evaluate any investment or security and is not directed at any investors or potential investors in any A16Z fund. For more details, please see A16Z.com slash disclosures. Good afternoon, everybody. This is Steven Sinovsky here with the A16Z podcast. Very excited today to have Benjamin Heinemann of Mesosphere here today. And we're going to talk about a new concept that the company is coming out with called the Data Center Operating System, or DCOS. You know, today, you know, apps, they span servers. There are things like Kafka and Spark and MapReduce and Cassandra, and it's super, super complex to roll out these huge systems.

Starting point is 00:00:49 In fact, the real challenge of just allocating resources and figuring things out, reminds me, personally, of the very early days of computing when programmers were responsible for allocating the resources of a machine. If you wanted a file, you sort of wrote your own file system. If you wanted a process, you had to figure out which part of the CPU to save and store and load. And, you know, great programmers back in those days, which really weren't as long ago as people seemed to think, knew how to squeeze the most out of a computer by being able to manually allocate resources. You know, my old boss, Bill Gates, was famous for how many things he could squeeze into

Starting point is 00:01:27 an 8-bit bite of basic, you know, over the weekend. And it's very, very important back then to do that. And the problem was, if you were really good at it, your code became completely unmanageable and hard to deal with. And that turns out to be a little bit of what's going on today in the data center, except I think it's a little bit of the opposite. Today, you know, an enterprise at the big data center has taken. the opposite approach, which is, let's just keep buying more and more resources and use them

Starting point is 00:01:56 for special purposes, so I don't have to think hard about packing more bits into a byte, so to speak. And so there's more servers and more complexity and more VMs. You know, you're in this world of like it's basically one app per server, one app per VM. And, you know, what the problem is, that's simpler, but not simple, to manage, but it leads to this unbelievable waste. And waste in a data center is a big mess. 85% of the resources go unused. And I think, to me, that's where the data center operating system really comes in. And so I think that what we want to do is just sort of talk about this, what we call it the DCOS, which is weird because data center is one word. So it really

Starting point is 00:02:38 should be DOS, but that would take this podcast a whole different level. And I don't think we really want to go there. But, you know, if we think about traditional operating system is allocating the CPU and the memory and the disk and the network, all for a single computer. What is the, Ben, what's the DCOS? Yeah, yeah, great. So the data center operating system consists of a bunch of components. And when you really think about an operating system, it itself consists of a bunch of components. In fact, operating systems have evolved over the years. We've had, you know, monolithic operating system, microservice based operating systems.

Starting point is 00:03:17 And what we've really done with Mesosphere's data center operating system is we've created something that's more like a microkernel-like operating system where at the core of it is this open-source project that we have called Apache Mesos. And it's what's being used at a bunch of companies like Twitter and Airbnb and other companies to actually run their infrastructure. And then a lot of the other what we're calling data center services, which are these software frameworks which run on top of Mesos and can take advantage of Mesos to actually execute the computations that you want to do, things like Kafka and HDFS and Hadoop and Cassandra.

Starting point is 00:03:56 And those components really make up the core parts of what makes a data center operating system. So you can really think about it. The base level is the kernel, which is Mesos, just like a kernel in an operating system. And then things like storage, something like HDFS, which leverages the kernel, Mesos, to actually actually provided storage. And then these data center services that I mentioned and then a really, really

Starting point is 00:04:22 key one for us is what we call our distributed init D and that's a Linux, not Windows. I'm old enough. I actually would have called it Unix, but you call it Linux. That's true, yeah. And our distributed init D is what we ship is something called

Starting point is 00:04:40 Marathon, but there's alternatives to that, just like in fact in the Linux, world today, there's alternatives to NITD, your system D, you've got a bunch of different and NITD alternatives. Kind of like on your operating system today, you have many alternative browsers, Internet Explorer, Chrome, Firefox. But we use Marathon, and that's kind of, that's the core of how you end up running a lot of your tasks, because that's your init system where you describe all your tasks.

Starting point is 00:05:09 And then, of course, to interact with your operating system, you need some kind of interface? Well, what about going back just before we jump into the interface, tell me, like, you know, when I think about what an operating system needs to do, one of the things that needs to do is it needs to, like, schedule things. So is there a schedule or two as well? Yeah. So the kernel, Mesos, the core primitives that it really provides is task management, not process management, but task management, resource allocation, resource isolation, the things, the things you'd expect to get from something that needs to run, multi-tenant, lots of applications at the same time. What's the difference between?

Starting point is 00:05:44 a task then in a process. Great, yeah. So we chose task because we didn't want to overload the process nomenclature. And a task is just, it's the entity that we use to describe something that we have launched on some host in the data center. And so it could be a process. It could be a collection of processes. But it's the thing, it's the unit that we use to actually schedule. It's the thing that consumes resources, really, at the end of the day. So I have a conceptual understanding of like a level of, of, you know, services, but how does it actually work? Like, how do I, how do I get all of this onto a machine, onto a data center?

Starting point is 00:06:20 What's the mechanism that everybody's connected up? Yeah, yeah. So, um, that's the bus in a sense. Yeah, yeah, yeah. So, um, uh, the way it works is that using one of these data center services we talked about, um, that consist of the entire operating system, something like Marathon for, for, for, um, running your tasks. You would interface through Marathon, you would ask Marathon, you'd say, hey, marathon launch this task, just like you would tell in a D on Linux, hey, um,

Starting point is 00:06:44 run this task when when when when you boot up and then what it does is it uses what we really think of as kind of a system call interface to mesos to get resources allocated to it and then launch a task so so it says to mesos hey um i'd like to run this i'd like to run a task i need these resources to get the resources allocated to it and then it launches the task and then mesos at that at that period takes it takes care of making sure that it gets the task to the right machine the right host launches the task monitors it isolates it when it fails it tells the system that it's failed, so it can either be relaunched, whatever it needs to happen. And so the communication really is between one of these data center services like Marathon

Starting point is 00:07:23 that's running on top of Mesos and Mesos, which is really providing kind of the system call, the system call API. And when you think about it, this is one of the interesting things about Mesos itself. It really is much more like a kernel. You know, if you download Mesos by itself today, there's not really much you can do with it. Just like if you're downloading the Linux kernel by itself today, right great now i got the kernel what i do i'm not going to program code which is going to do interrupt 80 to you know do a system call you're going to use something at a higher level you're

Starting point is 00:07:52 going to say bash at a higher level to uh to launch tasks we're going to use some kind of window manager at a higher level and that's exactly what something like marathon is providing on on top of mesos today so so first how do all of the you know i think of a data center i think a rack and i think of all these boxes how how does how does misos know that the boxes are of its resource pool. Like, what connects them all? Yeah. So on each individual machine, we run an agent process.

Starting point is 00:08:21 And so that process could be launched either via a system image that you would use one of our system images or if you were using some more traditional configuration management software, you could use that to set up all your individual machines, physical or virtual. And then they all communicate back through the Mesos master, as we call it, the sort of the brain of Mesos. which is responsible for managing all these machines that have connected through their agents and then the bus is basically between those machines

Starting point is 00:08:52 and the masters themselves. Cool. So now I'm sitting in front of the machine or of the cluster or whatever. And how do I know I'm running it? You mentioned a command line. So I'm sort of in my head. I have this now the data center is now like one computer. One big computer.

Starting point is 00:09:12 And so, well, I want to tell it to do something. What do I do? Yeah. So the interface, really, the first interface that we've provided is a command line interface. And so we did this for a bunch of reasons. So not a card reader. Not a punch card.

Starting point is 00:09:27 Okay. We made it plugable. We can make that interface as well. But it made a lot of sense for us to actually make this be really the first interface to the DCOS. And so what you can do is you can actually type from a term. terminal, you can type DCOS space, and then one of these data center services that I was mentioning something like Marathon, you could say Marathon, and then you can give it some information to run a task. You can say like DCOS, Marathon, run, and then the command you want to run

Starting point is 00:09:56 and maybe some extra flag information to describe how it gets its artifacts, it's, you know, its resources to run. And then you do that and it starts running. And so of course, what does that mean? It starts running. Well, it could mean that you could go to some web browser if the tasks that you launch happen to be a web server. But of course, you can also do something with a CLI, which is DCOSPS, so you can actually see all the processes that are running, all the tasks you have running. So all the processes or all the tasks? All the tasks.

Starting point is 00:10:22 Yeah, right. So you've immediately, like, I'm kind of done with processes. Yeah. So now I'm looking at a task might be spanning, is going to span all sorts of resources. Yeah. Yeah. So in the, the, the CLI today, what we have is just, just all the tasks. But as we evolve the CLI, we'll be able to drill down so you can see for this tasks,

Starting point is 00:10:40 what processes represent those tasks, for those processes, what, threads represent those processes. So you'll be able to see all the resources that are actually being consumed at a fine grade level. Even in the best cases of single machine computing, at some point for diagnostics or performance or something, you're going to actually have to know how things are done. Most definitely. So the fact that you're using these abstractions doesn't prohibit a DevOps person from really knowing what's going on. That's exactly right. Yep. And that's just the same today where if you just type PS on, say, a Linux box, you do just see the processes. But if you want, you can really dive in and you can say, show me all the threads for

Starting point is 00:11:16 those processes well. So, okay, so, so you sort of described how I get something going. Like, is that, do I install software on it? What, what do I think of is? Like, where does, where does the task come from? Yeah, yeah. So once, once, once, once the Mesosphere, DCOS software is really installed everywhere and you want to run other tasks, we have built a repository, a registry-like system that allows you to describe a task and just kind of like homebrew or like the package managers out there, you can say, hey, I want to install

Starting point is 00:11:52 one of these frameworks, one of these services. You can do that. It'll pull down from a repository the necessary bits of information. You can have it either get installed on the distributed file system. You might have running something like HTFS or Steph, which again is something that's running on top

Starting point is 00:12:07 of the DCOS. And so you can point to where it is. is. And then you can say, hey, my in at D, you know, my service scheduler, go ahead and now run this service, pull it from this location so you have the bits and go from there. Just so folks can have a clear view. Like, give me, what are some specific examples of services that you're, that come to mind or tasks that you would, you would really think of? Yeah. So, um, making it really concrete. Yeah, yeah, yeah. So at a, at a, at a company like Twitter, um, which is a big user of, of Mesos, the, um, they basically decomposed their

Starting point is 00:12:37 architecture from this monolithic architecture and a bunch of small services. And each of those individual apps, each of those individual services, which is say when a tweet comes in, it's sending out a post to an SMS or it's, you know, hydrating the tweet for other people's timelines so that other people can see that this tweet has come in because it should show up. Each of these individual services would be a kind of task and app that you might want to run. And so you would just say to the DCOS, hey, I want to run this application. I don't care. You know, where I want to run it. Just here's the information here's the binary needs to run go your big computer run this somewhere on the big computer so so one of the things that jumps to mind is is that you know when I think of an OS I think

Starting point is 00:13:21 not just of like the resource management but it also provides conceptual models for really important things like one that jumps to mind is security and so anytime you start telling me like oh by the way code's running anywhere I start to worry like well if code is anywhere and I don't know where it is doesn't that make me vulnerable in places that I'm not predicting right so So tell me a little bit about how, like, something like isolation or how I think of security in the DCOS model. Yeah, so, so, you know, I think this is a really interesting topic because what tends to happen on a lot of these organizations, when there isn't some centralized way and people are thinking about how they want to do resource management and run their applications, is you get a bunch of disaggregated, you know, everyone's doing it slightly differently. Yeah. So oftentimes you have worse security because, you know, rather than a security team being able to audit just the one way in which everything gets to.

Starting point is 00:14:11 run, they have to audit a whole bunch of different processes, and some people get a little bit differently. And then the worst part about that is they can't compose, right? And this to me is one of the fundamental issues I have with a lot of distributed systems is because people are building distributed systems in such a personalized way, you know, personalized for their organization or their company, you can't easily build a distributed system in one organization and move it to another organization. And security is a perfect example that, you know, one organization uses L-DAP. So the first way that they build it in

Starting point is 00:14:41 is it hooks into L-DAP and it's so ingrained that they're going to do L-DAP and another organization doesn't use L-DAP. They use some other mechanisms of authentication or identity or whatever it is.

Starting point is 00:14:52 You always see this like with when you have a big giant web presence, you have the company that operates the web server part and then they went and did analytics in a completely different sort of stack. And they're figuring out how to get the access to the logs

Starting point is 00:15:05 to do the analysis. And then no one can either do both, deal with both, audit both. Exactly. Exactly. So, I mean, this is one of the biggest drivers of why we're trying to build, why we're building a data center operating system is because at the end of the day, somebody should be able to build an application against the primitives, like security primitives that could be provided by a data center operating system, and go and run it in another organization, because it's just an app that you built. And, you know, it was very interesting at the beginning of the podcast when you were talking about the people

Starting point is 00:15:34 that wrote the, you know, the hardcore applications, that's the case we distribute systems today, too. You know, we joke, they have to have a PhD to write a distributed system. Many PhDs came about showing you how to write distributed systems, and then they went and built them. Yeah, that's right. But we're at the point now where everyone is basically building a distributed system. They don't all have PhDs, and we want to be able to build those distributed systems in one organization, run them in another organization, and be able to do that in a really, really efficient manner. And that, like, security is a perfect example of something that if we can provide the interfaces for doing security in our distributed systems and people can build

Starting point is 00:16:12 against those interfaces, then we can easily move our applications across organizations. So building on the applications part, like one of the things that obviously has a huge amount of attention and excitement right now, whether it's from Docker or CoreOS, is just the notion of containers. So in listening to you, I'm sort of trying to parse in my head, like, do I no longer need containers? Are you going to provide a container that I have to use? Am I going to be able to use containers that I've already created? Where do containers fit in on your stack?

Starting point is 00:16:43 Yeah, that's a great question. So Mesos has used containerization technologies. What we've used to underpin the Mesosphere DCOS has used containerization technologies for a long time since 2009. In fact, in 2009, we even had Solaris Zone support. So we had containerization technologies. from even outside of Linux. And we've provided that containerization technology and we'll continue to do so.

Starting point is 00:17:09 So when people have used the existing containerization technology to build new things like Docker on top, that's been something that we've been able to integrate with very, very easily. So if you're creating Docker images, this is a fantastic thing. You can give it directly to us. We can launch those Docker images

Starting point is 00:17:26 directly using our containerization technology. And as this stuff evolves, as other companies introduce new image-like formats to describe the bits you need to run your containers. Again, this is just going to be something that we can plug in to our data center operating system. You just give us bits, and we'll run those bits. And if those bits happen to be a Docker image

Starting point is 00:17:48 or a rocket app container specification, we'll take those things and we can actually run them. But the benefit is, of course, so first you can go create your container however you want to go create it. And then the neat thing is you were able to deploy it in a distributed way where you're scaling in a highly efficient way without really realizing it. Yeah, and when there are failures, they get rescheduled.

Starting point is 00:18:11 When we want to do even smarter things like over-subscription because we want to move that 85% unused resources to say 10% unused resources, we can start to do all that, just like an operating system does for you under the covers today on, say, your laptop. and you just gave us the binary that we need to run, whether it's a container image or whether it's a, you know, some real binary. Cool.

Starting point is 00:18:35 So I want to take a step back because to me this is what's so fascinating is that what you're really doing is just changing what I view as the abstractions of an operating system. And you're basically directly, or by implication, saying, wow, you know, the abstractions that people deal with,

Starting point is 00:18:50 like the notion of having a virtual machine is just completely wrong and that we really need a new set of abstractions. And to me, what this feels like is is when virtual memory came out, the abstraction just blew your mind because you went from, like, I literally personally went from like figuring out where to put stuff in 640K of memory to having two gigabytes of memory. And not only that, but the address space was linear. So I actually got to just, you know, just not worry about where it went, whereas I spent the first two years of my career, like swap tuning code so I knew

Starting point is 00:19:22 exactly where in memory it was going to be. And so it seems crazy to think like that. Because aren't a bunch of hardcore people just going to say, no, the problem is, if I have a whole data center, I'm going to be better at organizing what goes where than some piece of software that doesn't know the loads, the resource needs, and why would the DCOS know better than me? I'm a smart PhD distributed computable. I'm not, but someone might be.

Starting point is 00:19:46 Yeah, no, no, I think that's exactly right. I think that what we're doing is we're doing exactly what virtual memory did for existing operating systems, which is providing the abstractions so that we can really, really effectively do the resource management, the scheduling, dealing with the failures. And I think just like what you saw in virtual memory, there will probably be a lot of people who believe that they can do it better, but time's going to show that actually we can start to do far more sophisticated things. And we will be able to do far better scheduling for utilization,

Starting point is 00:20:17 for meeting SLAs, for serving the customers. Yeah, and I think to me that's just a super important point for folks to understand, because in these kind of transitions when you're changing abstraction layers, there's this sort of management retrenching of like, wow, security is really important. We know how to secure this, so we're going to stick with it. Even though, you know, and a few percentage of utilization won't change it,

Starting point is 00:20:42 even though the system isn't secure, it's just comfortably insecure. That's right, yeah. And, like, it was great to be comfortable even though you were failing. And so I think that, like, for me, that's the big transition, that people are going to have to just sort of get over their own perceived expertise and let computers do stuff that they're good at. That's right, yeah.

Starting point is 00:21:02 And that's why I think pulling in analogies of the past is so valuable. Because it helps people start to realize, you know what, maybe this is a good idea. It was a big giant waste of time. So who's using it today? Yeah. So the open source components that make up a large part of the DCOS are used by a large number of companies today. Some of the biggest users out there are companies like Twitter, Airbnb, HubSpot, eBay, and PayPal are using it for running things. Netflix is using it for running things. So some of the smaller companies without a lot of machines?

Starting point is 00:21:35 That's right. Yeah. I mean, one of the great things about the way that the software has evolved over the years is we've made it so that it works well at the small scale, but it also scales. And it works very, very well for the large scales guys. And as hardware itself is starting to evolve in our data centers, and maybe the rack is going to start looking less like the rack or a machine is going to start looking less like a machine, you really need these levels of abstraction for both the small guys and for the big guys. Yeah, it certainly seems to me that one of the things that an operating system brings is it allows hardware to proceed at a different pace of innovation. And so when I look at DCOS, I think, wow, this is really going to free.

Starting point is 00:22:14 a set of people to go, well, let's just go replace our servers with arm servers. Let's go replace our networking infrastructure in a certain way because they'll be able to map those abstractions up. Rather than today, I mean, once you say it's a VM running this instruction set that assumes this level of, you're stuck. That's right. Yeah. So a lot of people are looking at the stack of cloud today. You know, we haven't even used a lot here because we're really focused on distributed operating system. But, you know, and they think a platform as a service or infrastructure as a service. And so to me,

Starting point is 00:22:46 like, let's assume that this isn't platform as a service. Let's take that, let's assume what we understand the possibility.

Starting point is 00:22:54 But is, you know, IS is definitely at this VM server level. So why is this not an IS kind of thing? Yeah, yeah. Yeah.

Starting point is 00:23:01 Yeah. So one of the biggest differentiators between what we've done versus what they've done with the infrastructure as a service space is really tried

Starting point is 00:23:11 to provide these abstractions and these primitives that enable you building new distributed systems on top. And again, that's really what an operating system should be providing. What infrastructure as a service provides to you is another machine. It turns a physical machine into a virtual machine or maybe a virtual machine into a virtual machine

Starting point is 00:23:29 when you're running, say, open stack on EC2. And that does not help the developer build another system. It's the same primitive. It's just kind of wrapped up. And so really what you get from something like a data center operating system are the abstractions and imperatives that make it easier to build new distribute systems, and that's what makes it easier to then move those distribute systems from one organization to another organization because that's the abstraction that everybody has and they can use those.

Starting point is 00:23:55 Yeah, you know, I think that this is super interesting because I think from an IT leadership and the enterprise perspective, you know, right now we're on the verge where everybody wants to move to cloud, they don't know what that means, and so they're very quickly virtualizing the servers that they have laying around. And I'm a big believer that that's just not a useful, a good use of time. I think it might be cost effective in some marginal way, but the cost of moving and the bugs you introduce and stuff. And so I think what would you say to sort of your typical enterprise CIO? Well, there's not really a typical, but an enterprise CIO is overseeing a move, like what is it that they'll, that they should understand about moving to a misos kind of environment rather than take this intermediate step of doing a bunch more VM stuff or better managing your VMs?

Starting point is 00:24:38 Right, right, right. Yeah, I mean, I think one thing that's really, really clear is that, uh, One of the nice things about a data center operating system is that it doesn't really compete with an infrastructure service at the end of the day because it's still about just taking all your resources, whether those resources come from virtual or physical machines and using those resources effectively. So for folks that do already have infrastructure as a service like deployments, there's still a ton of value in using Mesos and the data center operating system because you still want to best take advantage of the resources that you already have, again, if it's just bunch of virtual machines. And the same thing applies why something like Mesos and the DCOS is still so valuable in EC2-like environments on AWS is because, again, still, you want to best take advantage of all the resources that you have. But for people that are starting from scratch, I think you can really now start to take a very close look on whether or not you need to go through that first level of virtualization or not. And we've had a lot of reports of people that can go directly to using something like Mesos and the data center operating system. And then you don't have to start paying that 30% virtualization overhead for running your applications,

Starting point is 00:25:42 which can start to save a lot of money. Because that's how I sort of think of it as both the cost savings. And then like if you're going to go a greenfield, like if you're going to build a new expense app, rather than just virtualize the old expense app, you probably want to build it because you know it's never going to use a whole rack. Yeah. Like so why would, but you're going to probably, if you were to go build it, you would dedicate the rack. Yeah. And then you get all the overhead of a bunch of VMs.

Starting point is 00:26:06 And so it seems like you should just go straight to building it as a distributed app. And then you'll have your thousand apps over the next 10 years that get rewritten are all just going to squeeze in and use the right amount of resources. Yep, that's exactly right. But then I want to go back one quick sec to the platform as a service. Because to me, like platform as a service and infrastructure service are sort of almost inherently connected in an inefficient way. Yeah, yeah. So what would you say that, well, oh, no, we're okay because we're just going to use, you know, a cloud vendor's platform.

Starting point is 00:26:35 But that doesn't solve the distributed thing, doesn't? Yeah, no, no. I mean, what ends up happening at the end of the day with platform as a service is, again, so it's a high level abstraction on top of infrastructure or service. What platform as a service really solves is the fact that, oh, great, from infrastructure service, I've got a bunch of machines. Now what do I do? So platform as a service said, okay, well, we'll abstract away the machines and we'll let you

Starting point is 00:26:55 just run your tasks, your processes, your apps, whatever it is, but then you just run the processes. And what you really want is you want to be able to launch those processes, those applications, and then you want those applications to be able to continue to execute by using the underlying infrastructure, by calling back into something like the data center operating system and say, hey, now I need more resources, or for us to be able to call into the apps, the data center operating system, be a call on the apps and say, hey, this machine is going down for reboot because it's doing maintenance.

Starting point is 00:27:23 You should know about this, just like in a normal operating system. We actually, you know, you do memory paging. And that's the big distinguisher again between something like platform as a service and the data center operating system. is platform as a service is about, okay, here's an app, I run it, I go, and the data center operating system is about, okay, here's an app, I run it, and then while that app is running, it uses the data center operating system to continue to run. It calls back in, it uses the system call API, and as that API gets bigger and bigger and bigger,

Starting point is 00:27:50 it makes a really, really rich environment for programmers to be able to build really sophisticated distributed applications. One last question is, I mean, you described a lot of stuff, so I'll make it two parts. A, where can I get the stuff today, and what can I do with it? And then B, like, what comes next? Yeah, you can go to masos.apachi.org. And that's where you can learn a lot about the kernel itself, the Mesos kernel. And the new stuff.

Starting point is 00:28:20 That was the second question. Yeah, the side part was like, well, tell everybody, now that they've absorbed all this, what's coming next? Yeah, yeah. Yeah, so the new stuff's the most fun stuff to me. It's really where we start to take the beginning. of what it means to be, you know, a data center operating system and take it to the next level. And it means we start to take the things that historically have been really, really

Starting point is 00:28:41 tough to run, regardless of whether or not you've used higher levels of abstraction, like things like passes or infrastructure as a service, like stateful services. And we get to start running those things in a really, really, really effective way in the data center that historically have required a lot of humans to actually deal with that kind of stuff. So there are two examples I want to give here, two primitives that are being built that I think are really, really cool. One primitive we're building in is this notion of maintenance. So because we have this software layer, the kernel, actually running in our data center operating system, when the applications are running on top, we can have it start to actually

Starting point is 00:29:18 deal with maintenance of things that are happening in your data center. So, for example, when a machine or rack needs to go offline, we can have the software, talk to the other software, and say, hey, you know what? this machine is going down for repair we need to reschedule you or you should get rescheduled you need to move data just treated like it was a failure but a plan failure that's right that's right it's a failure but a plan for that's exactly right

Starting point is 00:29:41 and this is this is huge because usually the way this works in most organizations is a human walks up to another human and says hey I'm going to be taking this rack down what can we actually do about this but we can turn this into software and the analogy that I like to give from just traditional operating systems is the operating systems today would do things like page out memory. But what they do is they just say, hey, you know, we're going to use the LRU

Starting point is 00:30:05 algorithm. We're going to page out the least recently used. And that doesn't always work great. And wouldn't it be better if actually the operating system could work with the applications around on top to do smarter things when it comes to failures or needing more resources or whatever it is? And that I think is like that realm of things is to me one of the most exciting things that we're going to be working on because we get to reimagine a lot of the basic primitives that existed for single machines and rebuild them in a way that makes sense in a distributed environment and makes sense for people that want to do things in a smarter way and sort of what we've been working with for a lot of time.

Starting point is 00:30:38 At a scale that people can only imagine. At a scale that it's already hard enough to do it manually and so we have to do it in software-based ways and so we can do that. Awesome. Well, thanks so much. This has been Benjamin Heinemann from Mesosphere and I'm Steven Sinovsky signing off this episode of the A16Z podcast. Thanks everybody. Great. Thank you.

Your Ad Here

a16z Podcast - a16z Podcast: Why the Datacenter Needs an Operating System

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.