Screaming in the Cloud - Is It Broken Everywhere or Just for Me with Omri Sass
Episode Date: January 22, 2026When your website stops working at 3 AM, you need to answer one question fast: Is it my code or is a big cloud provider having problems? Omri Sass from Datadog explains updog.ai, a tool that ...monitors whether major services like AWS, CloudFlare, and others are actually working. Instead of asking people to report problems like Down Detector does, updog uses real data from thousands of computers to detect when services go down. Omri shares why this took 6 years to build, how they process massive amounts of data with machine learning, and why cloud providers have been strangely upset about these tools existing.About Omri: Omri Sass is a Director of Product Management at Datadog, where he leads and supports a team of 25+ product managers driving initiatives across Bits AI SRE, Data Observability, Service Management, and most recently, the launch of updog.ai. Outside of work, Omri is an avid sci-fi reader, a dedicated yoga practitioner, and happily outmatched by his cat.Show Highlights:(02:12) What is Updog and How Does It Work(03:38) Why Knowing If It's a Global Problem Matters(04:01) The Problem With Testing Every Endpoint Yourself(05:52) How Datadog Discovered EC2 Outages From Their Own Systems(10:38) When AWS Regions Go Down and Cascade Failures(13:13) What Happens When Services Rebuild Completely(16:29) The Most Important Learning During a 3 AM Incident(20:11) Why This Took So Long to Build(23:40) When Datadog Going Down Isn't Critical Path(25:22) How They Picked Which AWS Services to Monitor(27:07) What Comes Next for Updog(30:11) Where to Find Omri and UpdogLinks: Datadog: datadoghq.comOmir’s LinkedIn: https://www.linkedin.com/in/omri-sass-65632a14/Sponsored by: duckbillhq.com
Transcript
Discussion (0)
I would say it's always best effort.
So we learn based on the knowledge that we have, and when we need to adapt our knowledge,
we do our best effort to adapt it.
Our investment in this area is pretty good, and like we have people who do ongoing maintenance
and continuously look at model improvements.
Welcome to Screaming in the Cloud.
I'm Corey Quinn, and I am at long last thrilled to notice that something in this world exists
that I've wanted to exist for a very long time.
We'll get into that. Omri Sass has been a data dog for something like six years now.
Omri, thank you for joining me.
Thanks for having me, Corey.
This episode is sponsored in part by my day job, Duck Bill.
Do you have a horrifying AWS bill?
That can mean a lot of things.
Predicting what it's going to be, determining what it should be,
negotiating your next long-term contract with AWS,
or just figuring out why it increasingly resembles a phone number,
but nobody seems to quite know why that is.
To learn more, visit duckbillhq.com.
Remember, you can't duck the duck bill bill,
which my CEO reliably informs me
is absolutely not our slogan.
So you are apparently the mastermind, as it were,
behind the recently launched updog.a.I.
So forgive me, what's updog?
I, you know, I was expecting the conversation
to start like that, and I have to say,
It's definitely not me being the mastermind there.
I joined Dataog, like you said, about six years ago.
This thing has been in the making.
Some folks would say even before that.
I'm happy to share a bit of the history later.
But I joined the Applied AI group here at Datadog a couple of months ago
while this project was already ongoing.
So I do have to give credit where credit it is due.
We have an amazing Applied AI folks,
like a group of data scientists, engineers,
product manager, who's this has been his passion,
for quite a while. And so I'm not the mastermind. I'm just the pretty face. I like, and it's an
impressive beard, I will say. I'm envious. I can't grow one myself. For those who may not be aware of
the beautiful thing that is Up Dog, how do you describe it? So Up Dog is effectively down detector.
So if you're familiar with that, it's a way of making sure that very common SaaS providers are
actually up, but Updog is powered by telemetry of people who actually use these providers.
It's not a test against all their APIs or anything like that.
It's also not user reported like Down Detector tends to be. And I have to say it was awfully
considerate of you to the day that we were recording this, most of the morning Cloudflare has
taken significant outages. In fact, right now as we speak, there's a banner at the top of Updog.a.I.
Cloudflare is reporting an outage right now. Updog is not detecting it as it is on API endpoints that are not watched today. We are working on adding those API endpoints to our watch list. Now, this sounds like a no-brainer. I have been asking various monitoring and observability companies for this since I was a young CIS admin. Because when suddenly your website no longer serves web, your big question is, is it my crappy code or is it a global issue that is affecting everyone?
And it's not because whose throat do I choke here.
It's what do I do to get this thing back up?
Because if it's a major provider that has just gone down, there are very few code changes
you are going to make on your side that will bring the site back up.
And in fact, you could conceivably take it from a working state to a non-working state.
And now you have two problems.
Conversely, if everything else is reporting fine, maybe look at what's going on in your specific
environment.
For folks who have not lived in the trenches of 3am pages, that's a huge.
huge deal. It's effectively bifurcating the world in two. That is precisely right. And I'll say
one of the biggest reasons that it took so long, at least for us to get there, is the
understanding that we can't just test all of these endpoints. There's a reason, like you mentioned,
down detector uses user reports. If we were to run synthetic tests, for example, against all of
these endpoints ourselves, if we, and we report something is down, we now need to verify that the
thing being down is the actual website and not our testing that is no longer correct.
And it's worse than that because you take a look at any, but take AWS, as I often am forced to do.
And you, oh, okay, well, is EC2 working in U.S. East 1? That's over a hundred data center facilities.
It's not, at that scale, it's not a question of, is it down? It's a question of how down is it?
There are, you can have 100 customers there and five are saying things are terrific and five are saying
that they're terrible and the rest are at varying degrees between those two points, just because
it's blind people trying to describe an elephant by touch.
That is exactly right.
And what you just described is the realization that we had about the asymmetry of data.
And I rest assured that's probably the word with the most syllables I'm going to use today.
That's above my IQ grade.
But what you just described is exactly the realization that we had about the asymmetry
of data. We have more data than any individual one of our customers, i.e., we have all of the blind
people touching the elephant at the same time and not needing to describe it, right? We have the
sense of touch for all of these folks. And what we do is actually looking at this data in aggregate
and using it to try to understand whether all of these endpoints are up or down. Now, let me try to
make that slightly more real, when we started going along this journey, our realization was that
when EC2 is down, that's actually the specific example. When EC2 was down, the load on our watchdog
backend, watchdog is our machine learning anomaly detector, increases significantly because everyone
has a higher error rate and higher latencies and a drop in throughput. And so our back end had to
compensate for that and we saw a surge in processing power. So we were like, hey, we're not looking at
customer data here is purely within our systems, but something is definitely going on in the real
world. It's not a byproduct of anything that we're doing. It's not tied to any change that we've
made. It's not anything our systems are well-functioning. And through investigating that, we'd realize
that it's actually tied to EC2. And then we started figuring out, wait, what are the
most common things that people rely on that are observed with Datadog.
And if you look at the Updog.a.i website today, that's also a really easy way to see what
people use Datadog to monitor of their third-party dependencies, because it's just the top
API endpoints that we observe.
I'm curious on some level, like effectively what I care about on this today, for example,
there are a bunch of folks that wound up looking at this now.
You're right.
You don't have the Cloudflare endpoints themselves.
but Aiden is first alphabetically.
They took a dip earlier.
AWS took a little bit of one that I'm sure.
We'll get back to that in a minute or two.
PayPal was down the drain.
Open AI had a bunch of issues.
X is probably the worst of all of these graphs as I look at that,
formerly known as Twitter.
And great, this is a high level approach to this is what I care about.
I almost want to take one level beyond this in either step,
where just give me a traffic light.
Is something globally,
being messed up right now. Like earlier today, when multiple services are all down in the dumps,
that's what I want to see at the top. Yep, things globally are broken. Maybe it's a routing
convergence issue where a lot of traffic between Oregon and Virginia no longer routes properly.
Maybe it's that an off provider is breaking and everything is break is out to lunch at that point.
Like you almost just want to see at a high level with no scroll above the fold. What's broken right now?
It's amazing that you should say that.
I know some people who work on this.
I can file a feature request for you.
Oh, honestly, yeah.
Oh, yeah.
I was either to ask you on social media or ask you directly here at a scenario in which you cannot possibly say no.
Oh, I'm a mean bastard.
But that one's easy for me because we're already working on it.
Because then you go the other direction where AWS right now, like each of these little cards are clickable.
So I click on one and it talks about different services, DynamoDB,
elastic load balancing, elastic search, so on and so forth. But I'm not seeing, at least at this
level here or anywhere else, for that matter, anything that breaks it down by region. And AWS is
very good at regional isolation. So I'm of two minds on this. On the one hand, it's, well,
Stockholm is broken. What's going on? Yeah, there are five customers using that, give or take.
I exaggerate. Fine, it'll be great. This is a big picture. What is going on,
regardless of the intricacies of any given providers' endpoints.
Then the other, it'd be nice to have more granularity.
So I can see both sides.
We're working towards that as well.
The first release of Updog, which I think was fortunate timing for us and for our users,
is basically where we are today.
We're investing heavily in improving granularity and coverage.
So to your point about Cloudflare, we went with the most common APIs that people actually look at.
If you look at some of the error messages that people have been posting on the airwaves formerly known as Twitter, you'll see that they mention a particular endpoint.
I think it's challenged. Cloudflare.
That one's as far as I know not a documented API, and I'm sure I'm going to be corrected about that later.
But it's not one that is commonly observed by our users.
And so when we began this journey, we had to go off of available telemetry.
Now that it's public, we can take feature requests, we can learn our lessons, we can improve every.
thing that we're doing, which is exactly what we're going to do. Cloudflare in particular,
we're working overtime to make sure that we account for that.
People accuse me of being an asshole when I say this, but I'm being completely sincere
that I have trouble using a cloud provider where I do not understand how it breaks. And
people think, oh, you think we're going to go down. Everything's going to go down. What I want to
know is what's that going to look like? Until I saw a number of AWS outages, I did not believe
a lot of the assertions they made about hard region separation.
Because we did see cascading issues in the early days.
It turns out what those were is, oh, maybe if you're running everything in Virginia and it goes
down, you're not the only person trying to spin up capacity in Oregon at the same time,
and now you have a herd of elephants problem.
Turns out your DR tests don't account for that.
Oh, yeah.
And I think that there's a couple of famous stories about people who realize that they should do
DIA ahead of other people and basically beat the stampede so that their websites were up during one of
these moments. And then everyone else came up and were like, hey, we're out of capacity. Like,
what's going on? So there's a couple of like famous stories like that through, through history.
And I'll just say the example that you just gave, I think, is one that AWS is, you know,
that they've told this story quite a bit. And regionally today, I think makes sense. But even if you look at the
US East Juan outage from a couple of weeks ago, a lot of the managed services were down,
not necessarily because of the service itself was down or, you know, I'm not involved. I wasn't
there. But my understanding that was DynamoDB was part of the services that were directly impacted,
and a lot of their other managed services use Dynamo under the hood. So there's, the cascading
failure would happen, even if it's not cascading regionally. It cascades logically between the
different services, and they have so many of them. So like, who's to know now, but we now have
this information. We can use that and learn and adapt our models based on it. What happens, again,
I'm sure that you're going to be nibble to death by ducks, and we will get there momentarily,
but what happens when that learning becomes invalid? By what I mean is that we remember the great
S3 apocalypse of 2019. They completely rebuilt that service under the hood, which, my understanding,
is the third time since it launched in 2005, that they had done that. And everything that you've
learned about how it fails back in 2019 is no longer valid, or at least an argument could be made
in good faith that that is the case. So for that, I would say it's always best effort. So we learn
based on the knowledge that we have, and when we need to adapt our knowledge, we do our best effort
to adapt it. Our investment in this area is pretty good. And like we have people who do ongoing
maintenance and continuously look at model improvements. And so if we do something like that,
hopefully we'll catch it. And I will say AWS in particular, but a lot of the providers that you
would see on Up Dog, we have good relationships with them. We would hope that they come to us and
tell us, hey, this thing that you're saying is incorrect or you need to update this. Let's work on it
together. Yes, that is what I really want to get into here. Because historically, when there have
been issues. For a while, I redirected gaslighting. Me to the AWS official status page,
because that's what it felt like. Back in the days of the green circles everywhere, I had a
in-line transformer that would downgrade things and be a lot more honest about how these things
worked, because suddenly it would break in a region and it would take a while for them to update it.
I understand that. It's a question of how do you determine the impact of an outage, how bad is it? When do you
update the global status page versus not. And there are political considerations with this, too.
I'm not suggesting otherwise. But it gets back to that question of, is something on your side
broken or is it on my side that's broken? And every time this has happened, they get so salty
about down detector and anything like it to the point where they've had VPs making public
disparaging commentary about the approach. I'm not looking for the universal source of truth for
things. That's not what I go to down detector for. I'm not pulling up downdetector.com to look for a
root cause analysis to inform my learnings about things. Especially with that, it's just user reports.
So I'm sure today they're trotting that line out already because there was an AWS blip on
down detector just because that is what people equate to the internet is broken with, which
sort of a symptom of their own success. They have been unreasonably annoyed by the existence of these
things, but it's the first thing we look for. It's effectively putting a bit of data around
anyone I know on Twitter, or the tire fire formerly known as Twitter, noticing an issue here
because suddenly things started breaking and I need to figure out again, is it me or is it you?
So there's two things that you said here. First of all, my go to is dumpster fire,
but I'll take tire fire. You know, I learned something new today. Just like the bike shed is
full of yaks in need of a shave. Oh, wow. I'm taking that as a personal offense in my
beard. The other thing that you said here, I think, is a realization that is starting to permeate
across the industry. And that is that some people will use this to measure their SLAs and they'll
use this to go to their account teams and complain demand credits and things like that.
And that's valid reason for folks like AWS and other providers to maybe get a bit, you know,
add a bit of salt to their behavior. Maybe just a smidge. But the flip side is that more and more
people come to the same realization that you come to. In the first moment of responding to an
incident, when I'm still orienting myself, in the 3am case, and it's always the worst one is
always at 3 a.m., I'm groggy, I'm still orienting myself. I'm trying to figure out what action
do I take right now. Hey, this thing is not even my problem. It comes from somewhere else.
is one of the most important learnings that I can grab in a moment and not waste time on it,
especially given that most people don't think to ask that unless they're very experienced
and have gone through these types of issues.
Because of that, we see more and more people who are actually interested in joining this.
And when we launch, like on the day of launch, our legal team was basically like ready
for all the inbound, salty, like angry emails.
we didn't get any.
But on the same week that we launched,
we had a provider who's not represented on that page,
reach out to us and say,
hey, we're data dog customers.
Like, why aren't we even up there?
We're like, wow, come talk to us.
We didn't know you broke.
Yeah, exactly.
This episode is sponsored by my own company, Duck Bill.
Having trouble with your AWS bill,
perhaps it's time to renegotiate a contract with them.
Maybe you're just wondering how to predict what's going on in the wide world of AWS.
Well, that's where Duck Bill comes in to help.
Remember, you can't duck the Duck Bill Bill, which I am reliably informed by my business partner,
is absolutely not our motto.
To learn more, visit DuckbillHQ.com.
On some level, this is a little bit of, you must be at least this big to wind up appearing here.
I know people are listening to this, and they're going to take the first.
wrong lesson away from it. This is not a marketing opportunity. I'm sorry, but it's not.
This is, this is systemically important providers. In fact, I could make an argument about some
of the folks that are included. Is this really something that needs to be up here? Azure DevOps is
an example. Yeah, if you're on Azure, you're used to it breaking periodically. I'm sorry,
but it's true. We talk about not knowing that it's a global problem. There have been multiple
occasions where I'm trying to get GitHub actions to work properly, only to find out that
it's, it can have actions that's broken at the time. Totally fair, but we still want to be able
to reflect that to you. But you also want to turn this into a scrolling Zoom forever. It would be
interesting, almost to have a frequency algorithm where when something is breaking right now,
like you sort of have to hope that it's going to be alphabetically supreme, whereas it would be
nice to surface that and not have to scroll forever. Again, minor stuff. Part of the problem is you don't
get a lot of opportunities to test this with wide-ranging global outages that impact multiple
providers. So make hay well the sun shines. Exactly. Like we got two in the sense, since we
released this in a very brief moment. And I, you know, I say this. I might sound like I have a
smile on my face. Obviously, like my heart goes out to all the people who have to actually
respond to those incidents and to every one of their users who was having a rough day. But to your
point, this is a golden opportunity to learn and to make sure that that knowledge is the
and available to everyone as equally as possible.
Never let a good crisis go to waste.
That is one of the first adages that I heard our CTO speak when I joined the company,
and it's etched into the back of my brain.
And it's an important thing.
You've overshot the thing that I was asking for,
because as soon as you start getting too granular,
you start to get into the works for me, not for you,
and it descends into meaningless noise.
All I really wanted was for a traffic light web page,
you folks put up there or even a graph. Don't even put the numbers on it, make it logarithmic,
just control for individual big customers that you have, and just tell me what your
alerting volume is right now, where that's enough signal to answer the thing that I have.
This is superior to that because, oh, great, now I know whether it's my Wi-Fi or whether
it's the site that isn't working. Some of these services are reliable enough that if they're not
working for me, my first thought is that my local internet isn't working as well as it used to.
I mean, Google crossed that boundary a while back.
If Google.com doesn't load, it's probably your fault.
Completely agreed.
My question, not sounding rude about this is, why did this take so long to exist as a product?
When I did kind of the rounds with the team and the director of data science who kind of runs the group, he's an old friend of mine.
We've been working very closely for a long time.
I asked the same question.
And I heard a really funny bit of history, and you can, if you Google, Data Dog, post.
Pokemon Go, you may find something kind of funny, where in 2016, two engineers here at Datadog
realized that, I think they were using Pokemon, it was like playing all over, and they'd
realize that there were a bunch of connectivity issues, and everyone, like, people were literally,
like, trying to swap their phones and, like, tap on it and, like, figure out if it was the phone
that was wrong, the Wi-Fi that was wrong, or if Pokemon Go itself, like Niantic servers
were down.
and they basically built a public data dog dashboard that kept track of a whole bunch of health measures for the Niantic APIs.
And they kind of published that.
It made a splash on then Twitter.
You tell me if it was before after it turned into a dumpster full of tires on fire.
And that kind of idea of like, hey, we can do this type of public service stuck with the same group.
And then a couple of years later, we released Watchdog, the ML engine that does anomaly detection for us.
And then came that realization that I mentioned earlier where every time that there's a major outage with one of the cloud providers, with one of the main SaaS providers out there, we would see load increase.
And ever since then, we worked on refining the model.
Like, it started with data that we had.
And then it moved to what type of telemetry is the best predictor.
We found that if we take fairly naive approaches, it gets noisy.
It gets actually really noisy.
and we have so many cases where the service isn't in fact down, it's a one-off or it's something
that change either an hour or in a customer's environment that makes it look like the service is down.
So we had to do a lot of refinement and we ended up building our proprietary model to do this.
So it's an actual ML model that we built, homegrown, doesn't use any of the big AI providers,
anything like that.
It processes a massive amount of data.
I think he threw out the like amount of, you know, what kind of, you know, what kind of
of data processes in a given time. I don't remember. And we had to build a low latency pipeline
for it because the last thing that we want is to say, oh, this thing is up in five minutes
from it being down only show it. So there's a bunch of these things where we started building
and we're like, oh, this thing seems to work. And when it says something's down, it's mostly down,
but not always. And then it's late. So it has to be high reliability. It has to be decoupled from
the rest of data dogs or if we're down, not that we're ever down. It's my favorite joke to tell
them front of users because, you know, in a room full of SREs, you tell someone, our code is perfect.
We never have any issues. Everyone starts. That's why Datadog needs to be a provider on this.
And it just could, that doesn't even be a graph. That could just be a static image.
There we go. We're also working on that. You should know we are, we try to be fairly transparent,
not topic of today's conversation. We have a Finops tool, data dog cloud cost management.
We put data dog costs on there.
by default at no additional charge.
So we do try to make sure that we accept our place in the ecosystem in that way.
We try to be humble about our place in the ecosystem.
Part of the challenge, too, is that I would argue, and you can tell me if I'm wrong on this,
I don't see Datadog petting an outage as being a critical path issue, by which I mean,
if you folks go down dark globally, no one's website should stop working because, oh, our telemetry
provider is not working, therefore we're going to just block on I.O.
it's going to take us down. Sure, they're going to have no idea what their site is doing or if their
site is up at all, but you're a second order effect. You're not critical path. That is very correct.
And to be fair, it's something that allows me to sleep much better at night. But I will say that
there are a couple of good examples of customers who use the observability data either to gate
deployments. So if you practice continuous deployment or continuous integration and you don't have
observability, suddenly you need to shut down your ability to deploy code. And that may not be,
hey, the website is down, but it is considered in our language, Sev 1 or Sev2 incidents,
or like the worst kinds of incidents. And there are also other companies that are not e-business.
They have real-world brick and mortar or, you know, some airlines or things like that,
where if they lose observability into some parts of their system, there is a real-world repercussion.
So we take our own reliability very seriously.
And again, kind of adages that I've kind of heard our CTO say that I've been etched in my brain,
our reliability target needs to be as high, so like as strict, as our most strict customer.
And that's how we treat it.
I will say our ops game is pretty good.
It has to be at some point.
I do have, again, things that you can bike shed to death if you need to.
AWS, you're monitoring 12 services.
How did you pick those?
Highest popularity among our users.
These are by far the most used AWS services among the data dog customer base.
Fascinating.
Looking at the list, there are things that I'm not entirely surprised by any of these things,
with the exception of KMS.
I don't love the fact that that's as popular as it is,
but there are ways to use it for free.
And yeah, it does feel like it's more critical path.
And you're going to see more operational log noise out of it,
for lack of a better term.
I'm sure that right now, the biggest thing that someone at Amazon is upset about internally
is that their service isn't included.
I don't see bedrock on this.
Don't worry, they already have Anthropic and Open AIs APIs on here, so they're covered there.
Again, if anyone from the AWS Bedrock team wants to come and talk to us, they know where
to find us.
We have good friends on the Bedrock team.
Oh, yeah.
I still find it fun as well.
Like, a couple of these folks, the first one list, A-D-N-A-D-Y-E-N.
I don't, off the top of my head, I don't know who that is.
are what they do. So this is a marketing story here. Oh, interesting. Oh, good for the Aden folks.
And payments, data, and financial management in one solution. Okay. Now we know. It's basically a pipeline
for payments. Yes. And I'm actually willing to bet you that you have used them, you know,
just like a block, formerly square. They have devices where you tap your credit card to use. I think
they're, if memory serves, they're more popular in Europe than the U.S., but I've seen their devices here, too.
They are a Dutch company, which would explain it.
It's that useful stuff. The world is full of things that we don't use. It's weird because I'm thinking of this in the context of infrastructure providers. Like, what the hell kind of cloud provider is this? A highly specific one. Thank you for asking. Exactly. What do you see is coming next for this? I mean, you mentioned that there's the idea of the overall, here's what's broken. And we learn as we go on this. But if you had a magic wand, what would you make it do? Well, the easy answer to that is what Corey said. But jokes aside, regional visibility.
into all the services. The counterpoint to that is that global rolling outages are not a thing with
AWS. They are frequently a thing with GCP and with Azure. It's Tuesday. They're probably down
somewhere already. Every provider implements these things somewhat differently. And then you also
have the cascade effects. Like as we saw this morning, when Cloudflare goes down, a lot of API endpoints
behind Cloudflare will also go down. If AWS is a bad day, so does half of the internet. There's a, there's a strong
sense that this is becoming a symptom of centralization where it's not, again, reliability is far
higher today than it ever has been. The difference is that we've centralized across a few
providers that when they have a very bad day, it takes everyone down simultaneously, as opposed
to your crappy data centers is down on Mondays and Tuesdays, and mine is down on Wednesdays and
Thursdays, and that's just how the internet worked for a long time. So I think that there's a moment
of reckoning here for a lot of companies, and the cloud providers included and many of their
customers, you know, folks who maybe never invested in reliability or resilience or disaster recovery
or any flavor of being, I think resilient is probably the best word here to any particular
outage. Because to your point, while some things are more centralized, right? There are the main
hyperscalers are where we're mostly centralized. A lot of things are significantly more
distributed. And that distribution, on the one hand, means we're more resilient in the overall,
in the aggregate, but it's harder to figure out what's actually broken. And so on the one hand,
I would hope that a bunch of companies that are critical for their users that have critical
infrastructure up in the cloud would remember that the internet is not actually just U.S. East
One, as a lot of folks who consider the clouds, didn't realize that the cloud was.
was mostly just USDA Swan, and we all learned that the hard way a couple of weeks ago,
and would start to move things to other places or build redundancies.
And then, you know, maybe the negotiation here is, I'll be down, but I'll be down for not
as long as the cloud provider. I can fail over safely or degrade gracefully or any of these
things. A lot of nice sounding terms that we can throw at it, but that a lot of folks haven't
heard or decided to not prioritize. Now is a good point. Or they don't understand what the reality
that looks like because you can't simulate an S3 outage. Yeah, you can block it from your app. Sure,
terrific. You can't figure out how your third party critical dependencies are going to react then.
And when they all rely on each other, it becomes a very strange mess. And that's why outage is at this scale unique.
Yep, completely agreed. I want to thank you for taking the time to walk me through the thought
process behind this and how it works. If people want to learn more, where's the best place for them to find you?
Updog.a.ai. And then after that,
at Datadoghq.com.
And if you're already a data dog customer,
I'm sure your account team knows where to find me.
You're going to regret saying that
because everyone, Modulo, everyone, is a Datadog customer.
Omri, thank you so much for your time.
I appreciate it.
Thanks for having me.
Omri Sass, Director of Product Management at Datadog.
I'm cloud economist, Corey Quinn,
and this is screaming in the cloud.
If you've enjoyed this podcast, please leave a five-star review
on your podcast platform of Joyce.
Whereas if you've hated this podcast,
Please leave a five-star review on your podcast platform of choice, along with an angry, bewildered comment along the lines of date, a dog like Tinder for pets?
That's disgusting, showing that you did not get the point.
