Screaming in the Cloud - Looking at the Current State of Resilience with Spencer Kimball

Starting point is 00:00:00 redefining your threshold for what's a disaster, where you're going to have a recovery step and postmortems for all the affected applications. You kind of move that threshold forward. You say, we're going to be able to survive an availability zone going away. Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined today by Spencer Kimball, who's the CEO and co-founder of Cockroach Labs. It's been an interesting year in the world of databases, data stores, and well, just about anything involving data. Spencer, thanks for joining me. Corey, it's a pleasure to be here. Outages happen, and it's never good when they do. They severely disrupt your business,

Starting point is 00:00:42 cost time and money, and risk sending your customers to the waiting arms of your competition. But what if you could prevent downtime before it starts? Enter CockroachDB, the world's most resilient database. Thanks to its revolutionary distributed SQL architecture, CockroachDB is designed to defy downtime and keep apps online no matter what. And now, CockroachDB is available, hey as you go, on the AWS marketplace, making it easier than ever to get started. Get the resilience you require without the upfront costs. Visit cockroachlabs.com slash last week to learn more or get started in the AWS marketplace. Cockroach Labs has been one of those companies that's been around forever on some level. Like I was hearing about CockroachDB must have been, oh dear Lord, at least 10 years

Starting point is 00:01:40 ago, if not longer. Time has become a flat circle at this point, but it's good to see you folks are still doing well. Well, you know, that's what most startups were encouraged to do in the various boom and bust cycles that we've been a part of. Be cockroaches, right? Lose this pretty idea of being a unicorn and get down to basics and survive. And yeah, we have been. Your memory's accurate. We've been around for just about 10 years now. It'll be 10 years in February. Okay, good. Just to know at least my timing is not that far gone. And you're also one of the vanishingly few number of companies in the tech ecosystem that does not have AI splattered all over every aspect of your web page. But last time I said that, it's like, well, we actually have a release going out in two days. It's, oh no. So if I'm just jumping the gun on that,

Starting point is 00:02:21 you're about to be AI splattered. Let me retract that. Well, AI is very exciting and it'll lead to a lot more use cases. But the interesting thing about databases is they're required for every use case and nothing's changing about that. We do have some AI capabilities. That's not at all unusual in the database space, but it certainly doesn't define us. We're not a database for AI. Sure, you can use it for AI use cases. In fact, you ought to. We've got some pretty big AI. Sure, you can use it for AI use cases. In fact, you ought to. We've got some pretty big AI headliners that use Cockroach for their use cases. Yeah, we're not chasing the AI puck. The reality is we solve a very, very difficult problem, which is how do you become the system of record for the most critical data, the metadata that really runs the mission

Starting point is 00:03:02 critical applications that people rely on every day. And that has always been a problem, right? Since these systems were first introduced in the 60s, and it will be a problem in 100 years that needs to continuously be solved. For all the joking I do about anything is a database if you hold it wrong, the two things I'm extraordinarily conservative about in a technical sense are file systems and databases. Because when you get those wrong, the mistakes show. If you're a big enough company, they will show on the headline of the New York Times. It's one of those problems where, let's be very sure that we know what we're doing

Starting point is 00:03:32 on things that can't be trivially fixed with a wave of our hand. Like, oh, I just dropped that table. We're using it for something is not a great question to be asked. That's exactly right. I mean, this is a foundational piece of infrastructure. And if you build your house on a bad foundation, the problems start to show up and they don't stop until one coming from you folks, though in hindsight, I absolutely should have. The state of resilience 2025, which is honestly like catnip for me at this point.

Starting point is 00:04:10 What led to the creation of this thing? Did someone just say, hey, we should wind up doing this and people might click a link somewhere because if so, it worked out super well for you. Well, listen, you have to step back and look about five years into the past. We actually just saw an opportunity about five years ago to release the first of these kind of annual reports. It wasn't about resilience at that point in time. It was actually a consequence of us really struggling with the idiosyncrasies of the different hyperscaler cloud providers. We were actually making Cockroach available as a fully managed cloud service in addition to many of our customers running it themselves as sort of a self-hosted product. And in that process, we experienced some pretty dramatic differences

Starting point is 00:04:49 between the hardware and the networking and the sort of costs of the different cloud vendors. And so we actually went in there and we got much more scientific about it and started really doing benchmarking with a very database-centric perspective. And the results of that benchmarking were very interesting. And we figured the rest of the industry would be very curious to know what we came up with

Starting point is 00:05:13 in terms of what's the best bang for the buck? What are the most efficient options in the different clouds? And at that first report, there was quite a bit of discrepancy and ultimately an arbitrage opportunity for people that were going to select one cloud over another to run database-backed workloads. And the success of that report encouraged us to do that same one two more times. So we did it for three years, and it was probably one of our highest performing pieces of content because it was interesting. But what was interesting is the cloud vendors paid a ton of attention to it as well. And it created quite a brouhaha internally in some of the CSPs. And as a consequence, the actual prices and the differences in performance between the cloud

Starting point is 00:05:59 vendors began to be diminished in those three years to the point where in the third report, the differences were fairly de minimis. So we actually, in the fourth year, decided to do a different report. And that fourth year was actually the state of multi-cloud. What we're looking at is kind of similar to this most recent one, which of course, we'll talk quite a bit about. We actually surveyed a huge number of enterprise businesses out there. We've talked to the CIOs and the architects and so forth. And we asked them, what's your stance on multi-cloud? Are you using a single cloud? Are you using on-premise still? In other words, are you hybrid? Do you have two of the hyperscalers? Do you have all three of the

Starting point is 00:06:40 hyperscalers? What are the reasons for that? And it was actually pretty eye-opening. We actually found that in the enterprise segment, most companies were definitively multi-cloud. They had at least two, and often three of the three big hyperscalers. And a lot of that was due to just different teams, different times, more permissive attitudes, people kind of running in their own direction. There was M&A, so they acquired companies

Starting point is 00:07:04 that used a different cloud than wherever their center of gravity was. And it's kind of hard to move these applications once they get started. It doesn't stop people from trying. Not that I ever see it go well, but yeah, we decided to do this because in line with our corporate strategy, cue four years of screaming and wailing and gnashing of teeth. Yes, that makes a ton of sense. I mean, I'm happy to be participating now. But this most recent year, we decided to focus on what our customers were asking Cockroach to help them solve.

Starting point is 00:07:32 And we saw that that was really our biggest differentiator. And we're a database, we're a distributed database that's really cloud native and that has some nice advantages. One of them is scalability. It can get very, very, very large. And that helped some of our big tech companies in particular that had these big use cases and millions or tens or hundreds of millions of customers. But we also found that resilience is important to

Starting point is 00:07:54 all of our customers. And this is another thing that a really distributed cloud-native architecture can get right in a way that more legacy monolithic databases don't have as easy a time at. And so we actually just focused this report on, again, a survey. And I think we hit a thousand senior cloud architects, engineering and tech executives. A minimum seniority was vice president. And we looked at North America and EMEA, so Middle East and Europe, and APAC. And it was just this year recently, ending in about September 10th. And boy, the results were surprising and a little eye-watering, I'd say, just in terms of how pervasive the resilience concerns are and the damages resulting from a lack of resilience and the sort of unpreparedness and just the general high, like DEF CON 1 level of

Starting point is 00:08:47 anxiety about where these companies were, how much this stuff was costing, and ultimately what that was going to mean going forward. It makes sense. People are not going to reach for a distributed database, in my experience, unless resilience is top of mind for wanting to avoid those single points of failure. Yet there's also an availability and latency concern for far-flung applications, sure. But you don't get very far down that path unless resilience is top of mind. For anyone running something

Starting point is 00:09:16 that they care even halfway about, making sure it doesn't fall over for no apparent reason or bad apparent reasons is sort of the thing that they need to care about the most, at least in my world. I'm an old, grumpy, washed up Unix sysadmin who turned into something very weird afterwards. But I was always very scared of make sure the site stays up. I didn't sleep very well most nights waiting for the pager to go off. Fortunately, this year has been the most notable, not to dunk on them unnecessarily, but one

Starting point is 00:09:43 of those notable outages was the CrowdStrike issue. And I timed that perfectly because it hit the day I started my six-week sabbatical. So I wasn't around for any of the nonsense, the running around. I hear about it now, but I was completely off the internet for that entire span of time. And I could not have timed that better, not to the point where I'm starting to wonder if people suspect that I had a hand in it somewhere. But as best I can tell, it was one of those things that had a confluence of complicated things hitting all at once, like most large outages do these days. No one acted particularly irresponsibly, and a lot of lessons were learned coming out of it. But no one wants to have to go through something like that if they can possibly avoid it. It's a good thing it didn't happen on the day you were leaving if you had Delta tickets,

Starting point is 00:10:27 because that was a major problem. It seems so. It was, it's one of those areas where it's, whenever you have a plan for disasters and you sit around doing your disaster planning, your tabletop exercises, the one constant I've seen in every outage that I've ever worked through has been that you don't quite envision how a disaster is going to unfold. Most people didn't have every Windows computer starts in a crash loop on boot instantly. That just wasn't something people envisioned as being part of what they were defending against. Every issue I've ever seen of any significant scale has sort of taken that form of, oh, in hindsight,

Starting point is 00:11:00 we should have wondered what if, but we didn't in the right areas. I'm curious what you found in the report, though, that surprised you the most. Well, I think it was the pervasive nature of the operational resilience concerns. You know, that was by far the most surprising. You know, I will just make a comment on the CrowdStrike outage. You know, I think that what it represents is a certain, well, first, maybe it helps to understand CrowdStrike's business model, which is really quite a huge value proposition for the companies that use it. What they do is

Starting point is 00:11:31 they say, okay, we're sort of a one-stop shop for handling all of the compliance and the security applied to the very vast and growing surface area that is threatened by cyber attacks. And if anyone listening to this has ever had to stand up a service in the cloud, the number of hoops you have to jump through is quite intimidating. And it's only increasing in terms of the scope and the number of boxes you have to check. And so that growing complexity of the task is made much more tractable by a product like CrowdStrike that not only has a huge sort of set of capabilities that address all of those threats, but it also is constantly being updated in order to address the evolving threat landscape. And that's part of what went wrong, right? Like many companies were allowing CrowdStrike to automatically update, you know,

Starting point is 00:12:23 and just immediately upon releases coming out, instead of letting them bake a little bit and letting somebody else find out the hard way that the update might have a problem in it. And it was kind of a simple programming error. But like, this is just an example of one of these things where you kind of have to trust this technical monoculture, which was CrowdStrike's ability to protect these Windows machines from cyber threats. Because if you don't trust somebody else, every single company out there has these same problems. And most people are going to address them very poorly without trusting CrowdStrike's technical competence and their economies of scale and so forth. Of course,

Starting point is 00:12:57 that same thing applies writ large to the hyperscalers, right? These are massive technical monocultures. And by the way, any one of those three companies, the AWS, GCP, and Azure, are better than probably any other company in the world at running secure data centers and services and the whole substrate, which we call the public cloud these days. Each one represents a very exceptionally fine-tuned and expert level technical monoculture. But nevertheless, it's a technical monoculture, right? So if something's wrong with one of these, it can be quite systemic. And just like the CrowdStrike was a very simple programming error, which honestly should have been caught, but like, you know, SHIT happens, right? Everyone knows that. And when you look at the increasingly complex way that any modern application is deployed using just a bunch of different cloud services put

Starting point is 00:13:52 together and so forth, all of those services and pieces of infrastructure, they're relying on the trust on whichever vendor that is putting things together properly, protecting against cyber threats, dealing with their own kind of lower level minutia of managing resilience and scale and not going down. And you have to put honestly tens or hundreds of these things together in a modern service that's being stood up. And so the only way to really prepare for the unknown unknowns, like which one of these things is going to fail on you, is diversification. You know, the companies, for example, that had more than Windows running, this is, you know, the CrowdStrike thing is just one small example.

Starting point is 00:14:32 You know, they had Windows, Mac, and Linux machines, for example. They certainly didn't have as much of an outage as folks whose organizations relied only on Windows. Again, a little bit of a facile example, but it's one of the reasons that companies are eager to embrace a multi-cloud strategy, for example. One of the challenges, unless you do that very well

Starting point is 00:14:52 with embracing a multi-cloud strategy to eliminate single points of failure is you inadvertently introduce additional single points of failure. We want to avoid AWS's issues, so we're going to put everything on Azure for our e-commerce site, except Stripe, which is all in on AWS. So now we're exposed to both Azure's issues and we can't accept anyone's money when AWS goes down as well. conduct surveys for this report, is there a sense of safety in numbers? As in, when the CrowdStrike issue happened, to continue using an easy recent example, the headlines didn't say individual company A or individual company B was having problems. It distilled down to,

Starting point is 00:15:35 computers aren't working super great today and everything's broken. Whereas if people are running their own environments and they experience an outage there, suddenly they're the only folks who are down versus everyone. Is there a safety in numbers perception? Oh, 100%. I mean, that is one of the big reasons to use the public cloud. You're not going to get fired if one of the big hyperscalers has a regional cloud outage because you're not the only company that went down when, say, US East disappeared from the DNS, right? It was a huge, huge list of companies. Now, the problem, of course, with that is that that safety in numbers really applies to the larger pack of smaller companies. Once you

Starting point is 00:16:11 get over a certain size and you have really mission-critical applications and services that consumers rely on and will bitterly complain into their x.com when the thing goes away, then the safety in numbers argument wears a little bit thin. So those bigger companies, those enterprises with the mission-critical estates, they actually have to think beyond where we can just make safe technology choices, rely on big vendors that are safe, quote unquote, safe choices. Ultimately, the best ones, I mean, not everyone does it right, as you'll read in this report that we have. I think a lot of companies feel unprepared here. But the ones that are leaning forward the most, sort of the innovators, to use the sort of crossing the chasm idea, the innovators and the early adopters,

Starting point is 00:16:57 those kinds of companies are, you know, the ones that really do, you know, for example, embrace multi-cloud as an example, and seek to have that sort of diversification and much more in-depth planning and adopt the latest infrastructure that is looking to exploit the cloud to have a higher degree of resilience and, for example, more scalability that's sort of elastic so that you don't have a success disaster of too many people essentially using your service and creating a denial of service kind of condition. So yes, you're totally right. Boy, I mean, running an application across multiple clouds actively is not for the faint of heart. But it is one of those things that the best companies are actually already starting to do. And as they sort of pioneer that, it's like companies like Cockroach Labs and the hyperscalers and, you know, I don't

Starting point is 00:17:46 know, hundreds, if not thousands of other vendors, they all kind of start to make that easier, right? Just like CrowdStrike, for example, helps companies manage the complexity of all of these different security issues across the big surface area, expanding surface area. Companies like Cockroach can help with the database, making that easy, for example, to run the database across and replicate actively across multiple cloud vendors. Now, that's not something that databases were expected to do 10 years ago. Now that there are some early adopters that are pushing in that direction, that kind of paves the way for the larger crowd to come along when that becomes more economical and a lot simpler, where the complexity is

Starting point is 00:18:26 sort of transparently handled by the vendor. Unplanned disruptions to your database can grind your business to a halt, leave users in the lurch, and bruise your reputation. In short, downtime is a killer. So why not prevent it before it happens with CockroachDB, the world's most resilient database with its revolutionary distributed SQL architecture that's designed to defy downtime and keep your apps online no matter what. And now CockroachDB is available pay-as-you-go on the AWS marketplace, making it easier than ever to get started. Achieve the

Starting point is 00:19:07 resilience your enterprise requires without the upfront costs. Visit cockroachlabs.com slash last week to learn more or get started today on the AWS Marketplace. For those who may not be aware, I spend my days, but I'm not, you know, talking to a microphone, indulging my love affair with the sound of my own voice as a consultant,

Starting point is 00:19:31 fixing horrifying AWS bills for very large companies. So I have a bias where I tend to come at everything from a cost-first perspective. In theory, I love the idea of replicating databases between providers.

Starting point is 00:19:44 If you're looking at doing something that is genuinely durable and can exist independently upon multiple providers simultaneously, then the way that the providers charge for data egress seems like it's sort of the Achilles heel to the entire endeavor, just because you will pay very dearly for that egress fee across all of the big players. No, you're absolutely correct. I'll give you a couple takes on that perspective, which is, it is sort of a ground troop. There are mitigations and ultimately strategies that transcend the problem of economics here. Sort of just in terms of the base reality today, when your mission critical use case is valuable enough, then you'll pay those egress costs, right? The economics actually makes sense

Starting point is 00:20:24 because the cost of downtime is so extraordinary and also the cost of reputation and brand and so forth. So for example, let's say you're one of the biggest banks in the world and you have a huge fraction of US retail banking customers. You might very well consider the cost of replicating across cloud vendors and paying those egress fees to be a fair cost-benefit analysis. Oh yeah, very much so. To the extent that that actually starts to happen, you know, you can negotiate with the vendors to give you relief from those egress costs. That is half of our consulting, doing the negotiation of these large contractual vehicles with AWS on behalf of customers. And yeah, at scale, everything is up for negotiation,

Starting point is 00:21:05 as it turns out. Absolutely. And then, of course, there are technical solutions that use other vendors. So you can do these direct connects. You can use things like Equinix and Megaport and so forth.

Starting point is 00:21:15 And you can actually connect. And this is also very important if you're going to do something that's hybrid in terms of replication across private clouds and public clouds and so forth. You really need to think about hooking up essentially your own direct connections. And you can obviate some of those egress costs.

Starting point is 00:21:31 And of course, vendors like Cockroach in our managed service can do that. And in those kinds of direct connect scenarios, you actually just get a certain amount of bandwidth that can be used. And that becomes quite economical if you fill those pipes. If you over-provision that and you're barely using it, then you might pay more than the egress costs, right? So there are opportunities there to actually, to really mitigate the networking costs. And then of course, you know, one thing we like to say is that resilience is the new cost efficiency. So, you know, that kind of goes back to that earlier point of like how

Starting point is 00:22:02 valuable is the use case and what are the consequences of it going down. But in this report, we just put out on the state of resilience. The numbers are a little eye-watering. I mean, 100% of the 1,000 companies we surveyed reported financial losses due to downtime. So 100%, nobody escapes this. Large enterprises lost an average of almost $500,000, so half a million dollars per incident. And these things, on average, a million dollars per incident. And these things on average were 86 incidents per year. And so can you put your whole foundation, certainly as you migrate more legacy

Starting point is 00:22:32 use cases or build greenfield kinds of things, it does make sense to think about spending to embrace the innovation that's available and obviate some of these mounting costs. I think a much worse strategy would be to accept all this new complexity to build the latest and greatest. And by the way, throwing AI into everything is certainly on most people's roadmaps. You got to get it into this complex ecosystem. You're calling out, you're calling these LLMs, everything's expensive. All kinds of things can break because you're just increasing the complexity. If you don't try to manage that and really do it on things that aren't just the lift and shift of the old stuff, and then we're bringing in more and more new stuff. In other words, your foundation

Starting point is 00:23:13 isn't improving as you add additions to your house and new stories, but you're only going to exacerbate the problem, right? So you really do have to embrace that. And ultimately, the cost savings for this sort of mounting toll of resilience disasters, that is a good argument to invest a little bit in the short term for a long-term reward. It feels like whenever you're talking about operational resiliency, it becomes a full stack conversation at almost every level. Outages where we had a full DR site ready to go, but we could not make a DNS record change due to the outage in order to point to that DR site looms large in my memory. Having a database

Starting point is 00:23:50 that's able to abstract that away sounds great. An approach that I've seen work from the opposite direction for some values of work has been the idea that you have handle application at the application layer and then move everything up into code. That solves for a bunch of historical problems with databases that don't like to replicate very well at the cost of introducing a whole bunch more. The takeaway that I took from all of this has been that everything is complicated and no one's happy. We still have outages. We still see a bunch of weird incidents that are difficult to predict in advance, if not impossible. In hindsight, they look blindingly obvious with that benefit of hindsight. It's difficult to predict in advance, if not impossible. In hindsight, they look blindingly obvious with that benefit of hindsight. It's nice to know that at least the executives at these large companies feel that as well, that there's a sense that they aren't

Starting point is 00:24:34 blaming. So what is the reason that you had those out? Are crappy IT people? I did not see that as a contributing factor in virtually any part of the report that I scanned, but I may have missed that part. Oh yeah. I don't think people blame their staffs. I mean, it is an overwhelming challenge and no matter what you do, whether you're migrating and trying to modernize or you're building just from scratch with the best of breed selection of technologies,

Starting point is 00:24:59 you're going to have new problems. It's kind of like the devil, you know, for the devil, you don't. But I think there is an opportunity to sort of make incremental progress that really can address some of the things that are becoming unsupportable, you know, just because you have too many pieces cobbled together. And so when you're doing everything in your own data center,

Starting point is 00:25:21 everything was under your control, things changed very slowly. There was one set of concerns. You didn't need this new infrastructure and distributed capabilities and so forth. But as things are moved into the public cloud and everything is shifting and there's all these different things connected and all are introducing their own points of failure, you have to kind of move with the times, right? So you can't accept that the old way of doing things is kind of the same as the new way, even though the new way is not going to remove all the problems. And in fact, we'll introduce some things you haven't seen before. There is an empirical experience for our customers, at least, that you can move beyond some of the things that are causing an unacceptable number of outages.

Starting point is 00:26:08 For example, availability zones going away or nodes dying or networks having partial partitions. Those are the kinds of things that with a distributed architecture, you can work around. And also things like disk, like the Elastic Block Store and AWS having high latency events,

Starting point is 00:26:26 one every million writes. That might not have been a problem that anyone had on their radar, but it sure afflicts you when you are moving a huge application into the public cloud. And so how do you deal with that? Well, on a legacy database, you really are kind of stuck on that EBS volume. And if it misbehaves underneath you, your end application is going to experience that pain. But with a distributed architecture, there's all kinds of interesting things you can do with sort of automated failover between multiple EBS volumes and across multiple nodes and across multiple facilities and across regions and even cloud vendors. The right way to think about it is in this new world, you actually have the opportunity to define what kind of an outage you're looking to survive automatically.

Starting point is 00:27:09 So it's kind of like redefining your threshold for what's a disaster, where you're going to have a recovery step and postmortems for all the affected applications. You kind of move that threshold forward. You say, we're going to be able to survive an availability zone going away. And it's going to have this additional cost. Or we're going to survive an entire region going away. Like this frequently happens for whatever reason. Sometimes it's DNS. But once a year, you see these things and you see all the companies that are affected by it. You can actually have your entire application survive that

Starting point is 00:27:39 if the application is diversified across multiple regions. That means your application code is running in those multiple regions. And your database has replicas of the data in those different regions, and the whole thing needs to be tested. That's the trick. That is the absolute trick. And when you read the report, you'll see that most of those surveyed, not very prepared to handle outages. It's like 20% of companies report being fully prepared, 33% had structured response plans. And less than a third regularly conduct failover tests. From my perspective, it's always been valuable to run in an active-active configuration

Starting point is 00:28:12 if you need both ends to work correctly. Otherwise, we tested it a month ago. Great. There have been a bunch of changes to your code base since then. Have those been tested? You have further problems dealing with the challenge of knowing when to go ahead and flip a switch of, OK, we're seeing some weird stuff. Do we activate the D.R. plan or do we just hope that it subsides? So much of the fog of the incident is always around what's happening. Is it our code? Is it an infrastructure provider? What is causing this? And do we wind up instigating our D.R. plan? Because once's started, it's sometimes very hard to stop or to fail back. That's a huge point. And it's one of the reasons that less than a third regularly conduct failover tests.

Starting point is 00:28:51 Because often conducting a failover test means that you initiate an outage. Because that's how most disaster recovery and active-active failovers work. Even though, to your point, active- active, and sort of like a traditional Oracle GoldenGate setup, it does allow you to be testing both ends of your primary and your secondary, so to speak, because they're both actively taking reads and writes and so forth. So they're participating.

Starting point is 00:29:17 So you know they both work. You know they're both there. You know they're both reachable and so forth. But if you really wanted to test what happens if, for example, you make it so one of your locations that has one of these replicas is no longer visible, all kinds of other things can go wrong. Plus, in order to do that, you may actually end up not having the full commit log of transactions in the database replicated. So you might actually create the

Starting point is 00:29:43 conditions where there's some data regression or even data loss. And so people are loathe to embrace that kind of testing on a regular basis because it can be so disruptive. But you do need to. If you don't turn off one of those data centers, you don't really know how your application

Starting point is 00:29:59 might interact or other components that are dependent on that data center that you just totally forgot about. Someone put in some new message queue thing that was only running in that one place. And now that message queue is down and the whole system backs up. These are inevitable problems, right? If you don't test them. The beauty of, and I'll give Cockroach another plug here, of a sort of modern replication configuration like Cockroach, which is called consensus replication, is that you don't just have sort of a primary and a secondary and an active-active configuration. You would have three or more replication sites, and you only need the majority of them. So if you have three,

Starting point is 00:30:34 you need two of them to be available. If you have five, you need three of them to be available. Odd numbers are very important for this. Otherwise, to avoid split brain. Exactly. Or if you have four, you can do four, but that means that you don't really get much benefit from it because you need three always to be up. You just need the majority. So a cockroach can handle all of those different configurations, but the beauty is you can actually turn off

Starting point is 00:30:54 any of these replication sites, whether it's a node or it's an availability zone or it's a region or it's a cloud vendor. And you have a total expectation that there's not going to be any kind of data loss or data regression or anything. That's just how the system works.

Starting point is 00:31:09 It's not the sort of asynchronous, conflict resolution prone, old fashioned way of doing things. It's a new kind of gold standard that does let you do this testing in situ with very real world scenarios. And that can, I think, change these statistics for companies, right? That less than a third regularly conduct failover tests when you need to regularly conduct these failover tests. And by the way, that's still not going to get you to 100%. It just won't. There's things that you can't imagine that you wouldn't have tested for, but you can get a lot closer to the 100%. Is there hope? This, I guess, is my last question for you on this, because a recurring theme throughout this report is that folks are worried, folks are concerned about outages,

Starting point is 00:31:50 about regulatory pressures, about data requirements. It feels that fear is sort of an undercurrent that is running through the industry right now, particularly with regards to operational resiliency. Are we still going to be having this talk in a year or two and nothing is going to have changed? Or do you see there being a light at the end of the tunnel? It's a great question. to operational resiliency. Are we still going to be having this talk in a year or two and nothing is going to have changed? Or do you see there being a light at the end of the tunnel? It's a great question. I think that the anxiety is never going to go away.

Starting point is 00:32:12 I mean, recalling my reference to the crossing the chasm idea really is how people adopt new technology, right? You have these innovators and early adopters and then the early majority, and then you're kind of at the halfway point in the distribution. And on the other side of that, you have the late majority and the late adopters and then the early majority. And then you're kind of at the halfway point in the distribution. And on the other side of that, you have the late majority and the late adopters and companies that just never changed. They're on a zero-maintenance diet.

Starting point is 00:32:31 I think you're going to have that inevitably. And the complexity is going to keep increasing. So you're always going to have probably a healthy majority of companies that are behind the curve, sounds so pejorative, but it is nevertheless the case that we're in a rapidly evolving landscape, threat landscape, complexity landscape, but also capabilities and potential for new markets and expansion and growth in any one of these companies.

Starting point is 00:32:59 It's sort of a mix of exciting and anxiety-inducing. I think in several years, we're going to be having the same conversation. It won't be the same kinds of technology and the same kinds of threats. Those will have all evolved quite a bit. Meanwhile, the things that we're talking about now will have penetrated deep into that early majority and probably into some of the late majority. But it'll again be these forward-leaning companies that are tackling the newest threats with innovation, whereas most of the rest of the companies out there will kind of be like, huh? Well, that sounds like something that maybe we'll get to, but boy, we're struggling still with the last year's problems or last decade's problems in some cases.

Starting point is 00:33:39 I do get a strong sense of urgency around this. It feels like there's a growing awareness here where companies will not have the luxury of taking a wait and see approach. There is urgency. We're seeing it across our business. And I think that urgency right now is, you know, there's part carrot and part stick. I don't know if that's quite the right metaphor, but people have an urgency to modernize partly because they want to have all the benefits that they think AI is going to bring to their use cases and to their customers. And there's quite a bit of an excitement there. I think there is an increasing degree of urgency because regulators and anxiety, because regulators are starting to look at this with a fairly acute perspective. And I'll mention there's a new regulation over in the EU and in the UK

Starting point is 00:34:27 called the Digital Operational Resilience Act. And they're actually looking hard at companies in terms of critical services and infrastructure. So for example, if you're in banking or in utilities, where people rely on your service, now the regulators are starting to assess what your plans are to survive these different kinds of outages that I was describing. Like what happens if your cloud vendor goes away or is deemed unfit for some systemic security or cyber threat?

Starting point is 00:34:54 How long does it take you to move your service or to reconstitute in the event of a widespread failure? And those answers right now are not very good across those critical industries, but they're going to get better. And then that moves the sort of state of the art and moves the regulators' expectations. But of course, it creates a lot of anxiety. And the teeth on these regulations, kind of like the GDPR, are pretty extreme. So there's a big stick. I mean, it's rarely used, but ultimately,

Starting point is 00:35:20 there's a growing realization of the costs of this complexity and the implications for that means for society. That's where these regulations are being, that's the perspective that the regulations are being fashioned from. And when you're in one of these industries and you've got your budgets and you got all your interesting new projects to try to grow your market share and so forth, now you've got a host of new requirements from the regulators. So it's one thing if you're the kind of company that is fairly on top of the innovation, has made big progress towards migrating your whole estate to more modern technologies and infrastructures. But that's a small fraction of all the companies out there. So yeah, you're right. There's a lot of anxiety.

Starting point is 00:36:03 And I think it's because in the interest of doing things less expensively and doing things more quickly, building new services more quickly, there's been a lot of additional complexity that leads to failure in unexpected ways. And by the way, AI is just going to make all these things worse because cyber threats are definitely going to, I think, grow exponentially with the ability to automate. And so, you know, I think that we're going to see this anxiety continue at pace. I don't know if it's necessarily going to get worse because they're just, no matter what the regulatory frameworks require of companies, they can only require so much so quickly, right? They can't sort of break the systems back. Companies have always been highly incentivized to avoid outages if they could be said to have a corporate religion it's money they

Starting point is 00:36:48 like money and as you cited every outage costs them money they don't want to have those the question becomes when at what point is are the efforts that individual companies are doing are no longer sufficient and i don't necessarily know that there is a good answer to that no uh and like i said it's it's rapidly evolving i think that to your point again about being in a crowd, right? It's a question of whether you're in the middle of that pack. If you are, you're pretty safe, I'd say. So you have to look at your peers and just decide whether you're going to get undue scrutiny and that's going to impact your brand or your bottom line.

Starting point is 00:37:21 Because it's not just regulators, of course, that look at that. It's your customers. Are they going to go to a competitor if they feel that you're not giving them the kind of service and the trust that they placed in you is being eroded? I really want to thank you for taking the time to speak with me about all of this.

Starting point is 00:37:36 If people want to learn more or get their own copy of the report, where's the best place for them to find it? Let's see, where is that report? I mean, I would listen, I would go to our website. It's all over that. So it's cockroachlabs.com. You will easily be able to find it. Let's see. Where is that report? I mean, I would listen. I would go to our website. It's all over that. So it's cockroachlabs.com. You will easily be able to find it. It's called the State of Resilience 2025. Of course, you could just search on Google for that. Or we'll just be even easier and put a link to it in the show notes for you.

Starting point is 00:37:56 That works too. Thank you for doing that. Thank you so much for being so generous with your time. I really appreciate it. My pleasure, Corey. Thank you for having me on. Spencer Kimball, CEO and co-founder of Cockroach Labs. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment. And tell us, by the way, in that comment, which podcast platform you're using because that at least

Starting point is 00:38:25 is a segment that understands the value of avoiding a monoculture.

Screaming in the Cloud - Looking at the Current State of Resilience with Spencer Kimball

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.