Screaming in the Cloud - Looking at the Current State of Resilience with Spencer Kimball
Episode Date: December 10, 2024Spencer Kimball, CEO of Cockroach Labs, joins Corey Quinn to discuss the evolving challenges of database resilience in 2025. They discuss the State of Resilience 2025 report, revealing widesp...read operational concerns, costly outages, and gaps in failover preparedness. Modern resilience strategies, like active-active configurations and consensus replication, reduce risks but require expertise and investment. Spencer highlights growing regulatory pressures, such as the EU’s Digital Operational Resilience Act, and the rising complexity of distributed systems. Despite challenges, Cockroach Labs aims to simplify resilience, enabling organizations to modernize while balancing risk, cost, and customer trust.Show Highlights(0:00) Intro(0:36) Cockroach Labs sponsor read(3:14) The foundational nature of databases(3:55) Cockroach Labs’ State of Resilience 2025 report(8:55) CrowdStrike as an example of why database resilience is so important(11:04) What Spencer found most surprising in the report’s results(15:13) Understanding the multi-cloud strategy as safety in numbers(18:29) Cockroach Labs sponsor read(19:23) Why cost isn’t the Achilles’ heel of the multi-cloud strategy that some people think(23:52) Executives are blaming IT people for outages as much(28:21) The importance of active-active configurations(32:01) Why anxiety about operational resiliency will never fully go away(37:52) How to access the State of Resilience 2025 reportAbout Spencer KimballSpencer Kimball is the CEO and co-founder of Cockroach Labs, a company dedicated to building resilient, cloud-native databases. Before founding Cockroach Labs, Spencer had a distinguished career in technology, including contributions to Google’s Colossus file system. Alongside co-founders Peter Mattis and Ben Darnell, he launched CockroachDB, a globally distributed SQL database designed to handle modern data challenges like resilience, multi-cloud deployment, and compliance with evolving data sovereignty laws. CockroachDB is renowned for its innovative architecture, enabling consistent and scalable database performance across regions and clouds. Under Spencer’s leadership, the company continues to redefine operational resilience for enterprises worldwide.LinksCockroach Labs: https://www.cockroachlabs.com/The State of Resilience 2025 report https://www.cockroachlabs.com/guides/the-state-of-resilience-2025/SponsorCockroach Labs: cockroachlabs.com/lastweek
Transcript
Discussion (0)
redefining your threshold for what's a disaster, where you're going to have a recovery step and
postmortems for all the affected applications. You kind of move that threshold forward. You say,
we're going to be able to survive an availability zone going away.
Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined today by Spencer Kimball,
who's the CEO and co-founder of Cockroach Labs.
It's been an interesting year in the world of databases, data stores, and well, just about
anything involving data. Spencer, thanks for joining me. Corey, it's a pleasure to be here.
Outages happen, and it's never good when they do. They severely disrupt your business,
cost time and money, and risk sending your customers to the waiting arms of your competition.
But what if you could prevent downtime before it starts?
Enter CockroachDB, the world's most resilient database.
Thanks to its revolutionary distributed SQL architecture, CockroachDB is designed to defy downtime and keep apps online no matter what.
And now, CockroachDB is available, hey as you go, on the AWS marketplace, making it easier than ever
to get started. Get the resilience you require without the upfront costs. Visit cockroachlabs.com slash last week to learn more or get started in
the AWS marketplace. Cockroach Labs has been one of those companies that's been around forever on
some level. Like I was hearing about CockroachDB must have been, oh dear Lord, at least 10 years
ago, if not longer. Time has become a flat circle at this point, but it's good to see you folks are still doing well. Well, you know, that's what most startups were encouraged
to do in the various boom and bust cycles that we've been a part of. Be cockroaches, right?
Lose this pretty idea of being a unicorn and get down to basics and survive. And yeah, we have been.
Your memory's accurate. We've been around for just about 10 years now. It'll be 10 years in February.
Okay, good. Just to know at least my timing is not that far gone. And you're also
one of the vanishingly few number of companies in the tech ecosystem that does not have AI splattered
all over every aspect of your web page. But last time I said that, it's like, well, we actually
have a release going out in two days. It's, oh no. So if I'm just jumping the gun on that,
you're about to be AI splattered. Let me retract that. Well, AI is very exciting and it'll lead to a lot more use cases. But the interesting thing
about databases is they're required for every use case and nothing's changing about that.
We do have some AI capabilities. That's not at all unusual in the database space,
but it certainly doesn't define us. We're not a database for AI. Sure, you can use it for AI use
cases. In fact, you ought to. We've got some pretty big AI. Sure, you can use it for AI use cases. In fact, you ought
to. We've got some pretty big AI headliners that use Cockroach for their use cases. Yeah, we're
not chasing the AI puck. The reality is we solve a very, very difficult problem, which is how do you
become the system of record for the most critical data, the metadata that really runs the mission
critical applications that people rely on every day. And that has always been a problem, right? Since these systems were first introduced in the 60s,
and it will be a problem in 100 years that needs to continuously be solved.
For all the joking I do about anything is a database if you hold it wrong,
the two things I'm extraordinarily conservative about in a technical sense are file systems and
databases. Because when you get those wrong, the mistakes show. If you're a big enough company,
they will show on the headline of the New York Times.
It's one of those problems where,
let's be very sure that we know what we're doing
on things that can't be trivially fixed
with a wave of our hand.
Like, oh, I just dropped that table.
We're using it for something
is not a great question to be asked.
That's exactly right.
I mean, this is a foundational piece of infrastructure.
And if you build your house on a bad foundation, the problems start to show up and they don't stop until one coming from you folks, though in hindsight, I absolutely should have. The state of resilience 2025, which is honestly like catnip for me at this point.
What led to the creation of this thing? Did someone just say, hey, we should wind up doing
this and people might click a link somewhere because if so, it worked out super well for you.
Well, listen, you have to step back and look about five years into the past. We actually
just saw an opportunity about five years ago to release the first of these kind of annual reports.
It wasn't about resilience at that point in time.
It was actually a consequence of us really struggling with the idiosyncrasies of the different hyperscaler cloud providers.
We were actually making Cockroach available as a fully managed cloud service in addition to many of our customers running it themselves as sort of a self-hosted product.
And in that process, we experienced some pretty dramatic differences
between the hardware and the networking
and the sort of costs of the different cloud vendors.
And so we actually went in there
and we got much more scientific about it
and started really doing benchmarking
with a very database-centric perspective.
And the results of that benchmarking were very interesting.
And we figured the rest of the industry would be very curious to know what we came up with
in terms of what's the best bang for the buck?
What are the most efficient options in the different clouds?
And at that first report, there was quite a bit of discrepancy and ultimately an arbitrage opportunity for people that were going to select one cloud over another to run database-backed workloads.
And the success of that report encouraged us to do that same one two more times.
So we did it for three years, and it was probably one of our highest performing pieces of content because it was interesting.
But what was interesting is the cloud vendors paid a ton of
attention to it as well. And it created quite a brouhaha internally in some of the CSPs.
And as a consequence, the actual prices and the differences in performance between the cloud
vendors began to be diminished in those three years to the point where in the third report,
the differences were fairly de minimis. So we actually, in the fourth year, decided to do
a different report. And that fourth year was actually the state of multi-cloud.
What we're looking at is kind of similar to this most recent one, which of course,
we'll talk quite a bit about. We actually surveyed a huge number of enterprise businesses out there.
We've talked to the CIOs and the architects and so forth. And we asked them, what's your stance
on multi-cloud? Are you using a single cloud? Are you using on-premise still? In other words,
are you hybrid? Do you have two of the hyperscalers? Do you have all three of the
hyperscalers? What are the reasons for that? And it was actually pretty eye-opening. We actually found that in the enterprise segment,
most companies were definitively multi-cloud.
They had at least two,
and often three of the three big hyperscalers.
And a lot of that was due to just different teams,
different times, more permissive attitudes,
people kind of running in their own direction.
There was M&A, so they acquired companies
that used a different cloud than wherever their center of gravity was.
And it's kind of hard to move these applications once they get started.
It doesn't stop people from trying. Not that I ever see it go well, but yeah,
we decided to do this because in line with our corporate strategy,
cue four years of screaming and wailing and gnashing of teeth.
Yes, that makes a ton of sense. I mean, I'm happy to be participating now. But this most recent year,
we decided to focus on what our customers
were asking Cockroach to help them solve.
And we saw that that was really our biggest differentiator.
And we're a database, we're a distributed database
that's really cloud native
and that has some nice advantages.
One of them is scalability.
It can get very, very, very large.
And that helped some of our big tech companies in particular that had these big use cases and millions or
tens or hundreds of millions of customers. But we also found that resilience is important to
all of our customers. And this is another thing that a really distributed cloud-native architecture
can get right in a way that more legacy monolithic databases don't have as
easy a time at. And so we actually just focused this report on, again, a survey. And I think we
hit a thousand senior cloud architects, engineering and tech executives. A minimum seniority was vice
president. And we looked at North America and EMEA, so Middle East and Europe, and APAC. And it was just this year recently,
ending in about September 10th. And boy, the results were surprising and a little eye-watering,
I'd say, just in terms of how pervasive the resilience concerns are and the damages resulting
from a lack of resilience and the sort of unpreparedness and just the general high, like DEF CON 1 level of
anxiety about where these companies were, how much this stuff was costing, and ultimately what that
was going to mean going forward. It makes sense. People are not going to reach for a distributed
database, in my experience, unless resilience is top of mind for wanting to avoid those single
points of failure.
Yet there's also an availability and latency concern for far-flung applications, sure.
But you don't get very far down that path
unless resilience is top of mind.
For anyone running something
that they care even halfway about,
making sure it doesn't fall over for no apparent reason
or bad apparent reasons
is sort of the thing that they need to care about the most, at least in my world.
I'm an old, grumpy, washed up Unix sysadmin who turned into something very weird afterwards.
But I was always very scared of make sure the site stays up.
I didn't sleep very well most nights waiting for the pager to go off.
Fortunately, this year has been the most notable, not to dunk on them unnecessarily, but one
of those notable outages was the CrowdStrike issue. And I timed that perfectly because it hit the day I started my
six-week sabbatical. So I wasn't around for any of the nonsense, the running around. I hear about
it now, but I was completely off the internet for that entire span of time. And I could not have
timed that better, not to the point where I'm starting to wonder if people suspect that I had a hand in it somewhere. But as best I can tell, it was one of those things
that had a confluence of complicated things hitting all at once, like most large outages do
these days. No one acted particularly irresponsibly, and a lot of lessons were learned coming out of it.
But no one wants to have to go through something like that if they can possibly avoid it.
It's a good thing it didn't happen on the day you were leaving if you had Delta tickets,
because that was a major problem.
It seems so.
It was, it's one of those areas where it's, whenever you have a plan for disasters and
you sit around doing your disaster planning, your tabletop exercises, the one constant
I've seen in every outage that I've ever worked through has been that you don't quite
envision how a disaster is going to unfold.
Most people didn't have every Windows computer starts in a crash loop on boot instantly. That just wasn't something people envisioned as being part of what they were defending against. Every
issue I've ever seen of any significant scale has sort of taken that form of, oh, in hindsight,
we should have wondered what if, but we didn't in the right areas. I'm curious what you found
in the report, though, that surprised you the most.
Well, I think it was the pervasive nature of the operational resilience concerns.
You know, that was by far the most surprising.
You know, I will just make a comment on the CrowdStrike outage.
You know, I think that what it represents is a certain, well, first, maybe it helps
to understand CrowdStrike's business model,
which is really quite a huge value proposition for the companies that use it. What they do is
they say, okay, we're sort of a one-stop shop for handling all of the compliance and the security
applied to the very vast and growing surface area that is threatened by cyber attacks. And if anyone listening to this has ever had to
stand up a service in the cloud, the number of hoops you have to jump through is quite intimidating.
And it's only increasing in terms of the scope and the number of boxes you have to check.
And so that growing complexity of the task is made much more tractable by a product like CrowdStrike that not only has a huge
sort of set of capabilities that address all of those threats, but it also is constantly being
updated in order to address the evolving threat landscape. And that's part of what went wrong,
right? Like many companies were allowing CrowdStrike to automatically update, you know,
and just immediately upon
releases coming out, instead of letting them bake a little bit and letting somebody else
find out the hard way that the update might have a problem in it. And it was kind of a simple
programming error. But like, this is just an example of one of these things where you kind
of have to trust this technical monoculture, which was CrowdStrike's ability to protect these
Windows machines from cyber threats. Because if you don't trust somebody else, every single company out there
has these same problems. And most people are going to address them very poorly without trusting
CrowdStrike's technical competence and their economies of scale and so forth. Of course,
that same thing applies writ large to the hyperscalers, right? These are massive technical
monocultures. And by the way, any one of those three companies, the AWS, GCP, and Azure, are better than probably any other company in the
world at running secure data centers and services and the whole substrate, which we call the public
cloud these days. Each one represents a very exceptionally fine-tuned and expert level technical monoculture. But nevertheless,
it's a technical monoculture, right? So if something's wrong with one of these, it can be
quite systemic. And just like the CrowdStrike was a very simple programming error, which honestly
should have been caught, but like, you know, SHIT happens, right? Everyone knows that. And when you look at the increasingly complex
way that any modern application is deployed using just a bunch of different cloud services put
together and so forth, all of those services and pieces of infrastructure, they're relying on the
trust on whichever vendor that is putting things together properly, protecting against cyber
threats, dealing with their own kind of lower level minutia of managing resilience and scale and not going down. And you have to put
honestly tens or hundreds of these things together in a modern service that's being stood up.
And so the only way to really prepare for the unknown unknowns, like which one of these things
is going to fail on you, is diversification. You know, the companies, for example, that had more than Windows running,
this is, you know, the CrowdStrike thing
is just one small example.
You know, they had Windows, Mac,
and Linux machines, for example.
They certainly didn't have as much of an outage
as folks whose organizations relied only on Windows.
Again, a little bit of a facile example,
but it's one of the reasons that companies
are eager to embrace a multi-cloud strategy, for example.
One of the challenges, unless you do that very well
with embracing a multi-cloud strategy
to eliminate single points of failure
is you inadvertently introduce
additional single points of failure.
We want to avoid AWS's issues,
so we're going to put everything on Azure for our e-commerce site, except Stripe, which is all in on AWS. So now we're exposed to both Azure's issues and we can't accept anyone's money when AWS goes down as well. conduct surveys for this report, is there a sense of safety in numbers? As in, when the
CrowdStrike issue happened, to continue using an easy recent example, the headlines didn't say
individual company A or individual company B was having problems. It distilled down to,
computers aren't working super great today and everything's broken. Whereas if people are running
their own environments and they experience an outage there, suddenly they're the only folks
who are down versus everyone. Is there a safety in numbers perception? Oh, 100%. I mean, that is
one of the big reasons to use the public cloud. You're not going to get fired if one of the big
hyperscalers has a regional cloud outage because you're not the only company that went down when,
say, US East disappeared from the DNS, right? It was a huge, huge list of companies. Now,
the problem, of course, with that
is that that safety in numbers really applies to the larger pack of smaller companies. Once you
get over a certain size and you have really mission-critical applications and services that
consumers rely on and will bitterly complain into their x.com when the thing goes away,
then the safety in numbers argument wears a little bit thin.
So those bigger companies, those enterprises with the mission-critical estates, they actually have
to think beyond where we can just make safe technology choices, rely on big vendors that
are safe, quote unquote, safe choices. Ultimately, the best ones, I mean, not everyone does it right, as you'll read in this report that we have. I think a lot of companies feel
unprepared here. But the ones that are leaning forward the most, sort of the innovators,
to use the sort of crossing the chasm idea, the innovators and the early adopters,
those kinds of companies are, you know, the ones that really do, you know, for example,
embrace multi-cloud as an example, and seek to have that sort of diversification and much more in-depth planning and adopt the latest infrastructure that is looking
to exploit the cloud to have a higher degree of resilience and, for example, more scalability
that's sort of elastic so that you don't have a success disaster of too many people essentially
using your service and creating a denial of service kind of condition. So yes, you're totally right. Boy, I mean, running an application across
multiple clouds actively is not for the faint of heart. But it is one of those things that
the best companies are actually already starting to do. And as they sort of pioneer that,
it's like companies like Cockroach Labs and the hyperscalers and, you know, I don't
know, hundreds, if not thousands of other vendors, they all kind of start to make that easier,
right? Just like CrowdStrike, for example, helps companies manage the complexity of all of these
different security issues across the big surface area, expanding surface area. Companies like
Cockroach can help with the database, making that easy, for example, to run
the database across and replicate actively across multiple cloud vendors. Now, that's not something
that databases were expected to do 10 years ago. Now that there are some early adopters that are
pushing in that direction, that kind of paves the way for the larger crowd to come along when that
becomes more economical and a lot simpler, where the complexity is
sort of transparently handled by the vendor.
Unplanned disruptions to your database can grind your business to a halt, leave users
in the lurch, and bruise your reputation.
In short, downtime is a killer.
So why not prevent it before it happens with CockroachDB, the world's most resilient database
with its revolutionary distributed SQL architecture that's designed to defy downtime and keep your
apps online no matter what. And now CockroachDB is available pay-as-you-go on the AWS marketplace,
making it easier than ever to get started. Achieve the
resilience your enterprise requires without the upfront costs. Visit cockroachlabs.com
slash last week to learn more or get started today on the AWS Marketplace.
For those who may not be aware, I spend my days,
but I'm not, you know,
talking to a microphone,
indulging my love affair
with the sound of my own voice
as a consultant,
fixing horrifying AWS bills
for very large companies.
So I have a bias
where I tend to come at everything
from a cost-first perspective.
In theory, I love the idea
of replicating databases
between providers.
If you're looking at doing something that is genuinely durable and can exist independently upon multiple providers simultaneously, then the way that the providers charge for data egress seems like it's sort of the Achilles heel to the entire endeavor, just because you will pay very dearly for that egress fee across all of the big players. No, you're absolutely correct. I'll give you a couple takes on that perspective,
which is, it is sort of a ground troop.
There are mitigations and ultimately strategies
that transcend the problem of economics here.
Sort of just in terms of the base reality today,
when your mission critical use case is valuable enough,
then you'll pay those egress costs, right?
The economics actually makes sense
because the cost of downtime is so extraordinary and also the cost of reputation and brand and so
forth. So for example, let's say you're one of the biggest banks in the world and you have a huge
fraction of US retail banking customers. You might very well consider the cost of replicating across
cloud vendors and paying those egress fees to be a fair cost-benefit
analysis. Oh yeah, very much so. To the extent that that actually starts to happen, you know,
you can negotiate with the vendors to give you relief from those egress costs. That is half of
our consulting, doing the negotiation of these large contractual vehicles with AWS on behalf
of customers. And yeah, at scale, everything is up for negotiation,
as it turns out.
Absolutely.
And then, of course,
there are technical solutions
that use other vendors.
So you can do these direct connects.
You can use things like Equinix
and Megaport and so forth.
And you can actually connect.
And this is also very important
if you're going to do something
that's hybrid in terms of replication
across private clouds
and public clouds and so forth.
You really need to think about hooking up essentially your own direct connections.
And you can obviate some of those egress costs.
And of course, vendors like Cockroach in our managed service can do that.
And in those kinds of direct connect scenarios,
you actually just get a certain amount of bandwidth that can be used.
And that becomes quite economical if you fill those pipes.
If you over-provision that and you're barely using it, then you might pay more than the
egress costs, right? So there are opportunities there to actually, to really mitigate the
networking costs. And then of course, you know, one thing we like to say is that resilience is
the new cost efficiency. So, you know, that kind of goes back to that earlier point of like how
valuable is the use case and what are the consequences of it going down.
But in this report, we just put out on the state of resilience.
The numbers are a little eye-watering.
I mean, 100% of the 1,000 companies we surveyed reported financial losses due to downtime.
So 100%, nobody escapes this.
Large enterprises lost an average of almost $500,000, so half a million dollars per incident.
And these things, on average, a million dollars per incident. And these things on average were 86
incidents per year. And so can you put your whole foundation, certainly as you migrate more legacy
use cases or build greenfield kinds of things, it does make sense to think about spending to
embrace the innovation that's available and obviate some of these mounting costs. I think a much worse strategy would be
to accept all this new complexity to build the latest and greatest. And by the way,
throwing AI into everything is certainly on most people's roadmaps. You got to get it into this
complex ecosystem. You're calling out, you're calling these LLMs, everything's expensive.
All kinds of things can break because you're just increasing the complexity.
If you don't try to manage that and really do it on things that aren't just the lift and shift of
the old stuff, and then we're bringing in more and more new stuff. In other words, your foundation
isn't improving as you add additions to your house and new stories, but you're only going to
exacerbate the problem, right? So you really do have to embrace that. And ultimately, the cost
savings for this sort of mounting toll of resilience
disasters, that is a good argument to invest a little bit in the short term for a long-term
reward. It feels like whenever you're talking about operational resiliency, it becomes a full
stack conversation at almost every level. Outages where we had a full DR site ready to go, but we
could not make a DNS record change
due to the outage in order to point to that DR site looms large in my memory. Having a database
that's able to abstract that away sounds great. An approach that I've seen work from the opposite
direction for some values of work has been the idea that you have handle application at the
application layer and then move everything up into code. That solves for a bunch of historical problems with databases that don't like to replicate very well at the
cost of introducing a whole bunch more. The takeaway that I took from all of this has been
that everything is complicated and no one's happy. We still have outages. We still see a bunch of
weird incidents that are difficult to predict in advance, if not impossible. In hindsight,
they look blindingly obvious with that benefit of hindsight. It's difficult to predict in advance, if not impossible. In hindsight,
they look blindingly obvious with that benefit of hindsight. It's nice to know that at least the executives at these large companies feel that as well, that there's a sense that they aren't
blaming. So what is the reason that you had those out? Are crappy IT people? I did not see that
as a contributing factor in virtually any part of the report that I scanned, but I may have missed that part.
Oh yeah.
I don't think people blame their staffs.
I mean,
it is an overwhelming challenge and no matter what you do,
whether you're migrating and trying to modernize or you're building just from
scratch with the best of breed selection of technologies,
you're going to have new problems.
It's kind of like the devil,
you know,
for the devil,
you don't.
But I think there is an opportunity to sort of make incremental progress that really can
address some of the things that are becoming unsupportable, you know, just because you have
too many pieces cobbled together. And so when you're doing everything in your own data center,
everything was under your control, things changed very slowly. There was one set of concerns.
You didn't need this new infrastructure and distributed capabilities and so forth.
But as things are moved into the public cloud and everything is shifting and there's all these different things connected and all are introducing their own points of failure, you have to kind of move with the times, right?
So you can't accept that the old
way of doing things is kind of the same as the new way, even though the new way is not going to
remove all the problems. And in fact, we'll introduce some things you haven't seen before.
There is an empirical experience for our customers, at least, that you can move beyond
some of the things that are causing an unacceptable number of outages.
For example, availability zones going away
or nodes dying or networks having partial partitions.
Those are the kinds of things
that with a distributed architecture,
you can work around.
And also things like disk,
like the Elastic Block Store and AWS
having high latency events,
one every million writes. That might not have been a problem that anyone had on their radar,
but it sure afflicts you when you are moving a huge application into the public cloud.
And so how do you deal with that? Well, on a legacy database, you really are kind of stuck
on that EBS volume. And if it misbehaves underneath you, your end application is going to experience that pain. But with a distributed architecture,
there's all kinds of interesting things you can do with sort of automated failover between multiple
EBS volumes and across multiple nodes and across multiple facilities and across regions and even
cloud vendors. The right way to think about it is in this new world, you actually have the
opportunity to define what kind of an outage you're looking to survive automatically.
So it's kind of like redefining your threshold for what's a disaster, where you're going to have a recovery step and postmortems for all the affected applications.
You kind of move that threshold forward.
You say, we're going to be able to survive an availability zone going away. And it's going to have this additional cost.
Or we're going to survive an entire region going away.
Like this frequently happens for whatever reason.
Sometimes it's DNS.
But once a year, you see these things and you see all the companies that are affected by it.
You can actually have your entire application survive that
if the application is diversified across multiple regions.
That means your application code is running in those multiple regions. And your database has replicas of the data in those different regions,
and the whole thing needs to be tested. That's the trick.
That is the absolute trick. And when you read the report, you'll see that most of those surveyed,
not very prepared to handle outages. It's like 20% of companies report being fully prepared,
33% had structured response plans.
And less than a third regularly conduct failover tests.
From my perspective, it's always been valuable to run in an active-active configuration
if you need both ends to work correctly.
Otherwise, we tested it a month ago.
Great.
There have been a bunch of changes to your code base since then.
Have those been tested?
You have further problems dealing with the challenge of knowing when to go ahead and flip a switch of, OK, we're seeing some weird stuff. Do we activate the D.R. plan or do we just hope that it subsides? So much of the fog of the incident is always around what's happening. Is it our code? Is it an infrastructure provider? What is causing this? And do we wind up instigating our D.R. plan? Because once's started, it's sometimes very hard to stop or to fail back.
That's a huge point.
And it's one of the reasons that less than a third regularly conduct failover tests.
Because often conducting a failover test means that you initiate an outage.
Because that's how most disaster recovery and active-active failovers work.
Even though, to your point, active- active, and sort of like a traditional Oracle GoldenGate setup,
it does allow you to be testing both ends
of your primary and your secondary, so to speak,
because they're both actively taking reads and writes
and so forth.
So they're participating.
So you know they both work.
You know they're both there.
You know they're both reachable and so forth.
But if you really wanted to test what happens
if, for example, you make it so
one of your locations that has one of these replicas is no longer visible, all kinds of
other things can go wrong. Plus, in order to do that, you may actually end up not having the full
commit log of transactions in the database replicated. So you might actually create the
conditions where there's some data regression
or even data loss.
And so people are loathe to embrace
that kind of testing on a regular basis
because it can be so disruptive.
But you do need to.
If you don't turn off one of those data centers,
you don't really know how your application
might interact or other components
that are dependent on that data center
that you just totally forgot about. Someone put in some new message queue thing that was only running in
that one place. And now that message queue is down and the whole system backs up. These are
inevitable problems, right? If you don't test them. The beauty of, and I'll give Cockroach
another plug here, of a sort of modern replication configuration like Cockroach, which is called
consensus replication, is that you don't just have sort of a primary and a secondary and an active-active configuration. You would have
three or more replication sites, and you only need the majority of them. So if you have three,
you need two of them to be available. If you have five, you need three of them to be available.
Odd numbers are very important for this. Otherwise, to avoid split brain.
Exactly. Or if you have four, you can do four, but that means that you don't really get much benefit from it
because you need three always to be up.
You just need the majority.
So a cockroach can handle
all of those different configurations,
but the beauty is you can actually turn off
any of these replication sites,
whether it's a node
or it's an availability zone
or it's a region
or it's a cloud vendor.
And you have a total expectation
that there's not going to be
any kind of data loss or data regression or anything. That's just how the system works.
It's not the sort of asynchronous, conflict resolution prone, old fashioned way of doing
things. It's a new kind of gold standard that does let you do this testing in situ with very
real world scenarios. And that can, I think, change these statistics
for companies, right? That less than a third regularly conduct failover tests when you need
to regularly conduct these failover tests. And by the way, that's still not going to get you to 100%.
It just won't. There's things that you can't imagine that you wouldn't have tested for,
but you can get a lot closer to the 100%. Is there hope? This, I guess, is my last
question for you on this, because a recurring theme throughout this report is that folks are worried, folks are concerned about outages,
about regulatory pressures, about data requirements. It feels that fear is sort of an
undercurrent that is running through the industry right now, particularly with regards to operational
resiliency. Are we still going to be having this talk in a year or two and nothing is going to
have changed? Or do you see there being a light at the end of the tunnel? It's a great question. to operational resiliency. Are we still going to be having this talk in a year or two and nothing is going to have
changed?
Or do you see there being a light at the end of the tunnel?
It's a great question.
I think that the anxiety is never going to go away.
I mean, recalling my reference to the crossing the chasm idea really is how people adopt
new technology, right?
You have these innovators and early adopters and then the early majority, and then you're
kind of at the halfway point in the distribution.
And on the other side of that, you have the late majority and the late adopters and then the early majority. And then you're kind of at the halfway point in the distribution. And on the other side of that,
you have the late majority and the late adopters
and companies that just never changed.
They're on a zero-maintenance diet.
I think you're going to have that inevitably.
And the complexity is going to keep increasing.
So you're always going to have
probably a healthy majority of companies
that are behind the curve, sounds so pejorative,
but it is nevertheless the
case that we're in a rapidly evolving landscape, threat landscape, complexity landscape, but also
capabilities and potential for new markets and expansion and growth in any one of these companies.
It's sort of a mix of exciting and anxiety-inducing. I think in several years, we're going to be having
the same conversation. It won't be the same kinds of technology and the same kinds of threats.
Those will have all evolved quite a bit. Meanwhile, the things that we're talking about now
will have penetrated deep into that early majority and probably into some of the late majority.
But it'll again be these forward-leaning companies that are tackling the newest threats with innovation,
whereas most of the rest of the companies out there will kind of be like, huh? Well,
that sounds like something that maybe we'll get to, but boy, we're struggling still with the
last year's problems or last decade's problems in some cases.
I do get a strong sense of urgency around this. It feels like there's a growing awareness here
where companies will not have the luxury of taking a wait and see approach. There is urgency. We're
seeing it across our business. And I think that urgency right now is, you know, there's part
carrot and part stick. I don't know if that's quite the right metaphor, but people have an
urgency to modernize partly because they want to have all the benefits that they think AI is going to
bring to their use cases and to their customers. And there's quite a bit of an excitement there.
I think there is an increasing degree of urgency because regulators and anxiety,
because regulators are starting to look at this with a fairly acute perspective. And I'll mention there's a new regulation over in the EU and in the UK
called the Digital Operational Resilience Act. And they're actually looking hard at companies
in terms of critical services and infrastructure. So for example, if you're in banking or in
utilities, where people rely on your service, now the regulators are starting to assess what
your plans are to survive
these different kinds of outages that I was describing.
Like what happens if your cloud vendor goes away
or is deemed unfit for some systemic security
or cyber threat?
How long does it take you to move your service
or to reconstitute in the event of a widespread failure?
And those answers right now are not very good
across those critical industries,
but they're going to get better.
And then that moves the sort of state of the art and moves the regulators' expectations.
But of course, it creates a lot of anxiety. And the teeth on these regulations, kind of like the
GDPR, are pretty extreme. So there's a big stick. I mean, it's rarely used, but ultimately,
there's a growing realization of the costs of this complexity and the implications for that
means for society. That's where these regulations are being, that's the perspective that the
regulations are being fashioned from. And when you're in one of these industries and you've got
your budgets and you got all your interesting new projects to try to grow your market share and so
forth, now you've got a host of new requirements from the regulators. So it's one thing if you're
the kind of company that is fairly on top of the innovation, has made big progress towards
migrating your whole estate to more modern technologies and infrastructures. But that's
a small fraction of all the companies out there. So yeah, you're right. There's a lot of anxiety.
And I think it's because in the interest of doing things less expensively and doing things more quickly, building new
services more quickly, there's been a lot of additional complexity that leads to failure
in unexpected ways. And by the way, AI is just going to make all these things worse because
cyber threats are definitely going to, I think, grow exponentially with the ability to automate.
And so, you know, I think that we're going to see this anxiety continue at pace. I don't know if
it's necessarily going to get worse because they're just, no matter what the regulatory
frameworks require of companies, they can only require so much so quickly, right? They can't
sort of break the systems back. Companies have always been highly incentivized to avoid outages if they could be said to have a corporate religion it's money they
like money and as you cited every outage costs them money they don't want to have those the
question becomes when at what point is are the efforts that individual companies are doing are
no longer sufficient and i don't necessarily know that there is a good answer to that no uh and like
i said it's it's rapidly evolving i think that to your point again about being in a crowd, right?
It's a question of whether you're in the middle of that pack.
If you are, you're pretty safe, I'd say.
So you have to look at your peers and just decide whether you're going to get undue scrutiny
and that's going to impact your brand or your bottom line.
Because it's not just regulators, of course, that look at that.
It's your customers.
Are they going to go to a competitor
if they feel that you're not giving them
the kind of service and the trust
that they placed in you is being eroded?
I really want to thank you for taking the time
to speak with me about all of this.
If people want to learn more
or get their own copy of the report,
where's the best place for them to find it?
Let's see, where is that report?
I mean, I would listen, I would go to our website. It's all over that. So it's cockroachlabs.com. You will easily be able to find it. Let's see. Where is that report? I mean, I would listen. I would go to our website. It's
all over that. So it's cockroachlabs.com. You will easily be able to find it. It's called
the State of Resilience 2025. Of course, you could just search on Google for that.
Or we'll just be even easier and put a link to it in the show notes for you.
That works too. Thank you for doing that.
Thank you so much for being so generous with your time. I really appreciate it.
My pleasure, Corey. Thank you for having me on.
Spencer Kimball, CEO and co-founder of Cockroach Labs. I'm cloud economist Corey Quinn, and this
is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your
podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star
review on your podcast platform of choice, along with an angry comment. And tell us, by the way,
in that comment, which podcast platform you're using because that at least
is a segment
that understands the value
of avoiding a monoculture.