Screaming in the Cloud - All Along the Shoreline.io of Automation with Anurag Gupta
Episode Date: July 20, 2021This week Corey is joined by Anurag Gupta, founder and CEO of Shoreline.io. Anurag guides us through the large variety of services he helped launch to include RDS, Aurora, EMR, Redshift and o...ther. The result? Running things almost like a start-up—but with some distinct differences. Eventually Anurag ended up back in the testy waters of start-ups. He and Corey discuss the nature of that transition to get back to solving holistic problems, tapping into conveying those stories, and what Anurag was able to bring to his team at Shoreline.io where automation is king. Anurag goes into the details of what Shoreline is and what they do. Stay tuned for me.Links:Shoreline.io: https://shoreline.ioLinkedIn: https://www.linkedin.com/in/awgupta/Email: anurag@Shoreline.io
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
Your company might be stuck in the middle of a DevOps evolution
without even realizing it.
Lucky you.
Does your company culture discourage risk?
Are you willing to admit it?
Does your team have clear responsibilities?
Depends on who you ask.
Are you struggling to get
buy-in on DevOps practices? Well, download the 2021 State of DevOps Report brought to you annually by
Puppet since 2011 to explore the trends and blockers keeping mid-evolution firms stuck in
the middle of their DevOps evolution because they fail to evolve or die like dinosaurs.
The significance of organizational buy-in, and oh, it is significant indeed, and why
team identities and interaction models matter.
Not to mention whether the use of automation and cloud translate to DevOps success.
All that and more awaits you.
Visit www.puppet.com to download your copy of the report now.
If you're familiar with Cloud Custodian, you'll love Stacklet, which is made by the same people
who created Cloud Custodian, but put something useful on top of it so you don't need to be a
YAML expert to work with it. They're hosting a webinar called Governance is Code, the guardrails for cloud at
scale, because it's a new paradigm that enables organizations to use code to manage and automate
various aspects of governance. If you're interested in exploring this, you should absolutely make it
a point to sign up, because they're going to have people who know what they're talking about.
Just kidding, they're going to have me talking about this.
It's going to be on Thursday, July 22nd at 1 p.m. Eastern. To sign up, visit snark.cloud slash stacklet webinar.
All one word.
That's snark.cloud slash stacklet webinar.
And I'll talk to you on Thursday, July 22nd.
Welcome to Screaming in the Cloud.
I'm Corey Quinn. This promoted episode is brought to you by Shoreline, and I'm certain that we're going to get there.
But first, I'm notorious for telling the story about how Route 53 is, in fact, a database,
and anyone who disagrees with me is wrong. Now, AWS today is extraordinarily tight-lipped about
whether that's accurate or not. So the next best thing, of course, is to talk to the person who
used to run all of AWS's database offerings and start off there and get it from the source.
Today, of course, he is not at Amazon, which means he's allowed to speak with me.
My guest is Anurag Gupta, the founder and CEO of Shoreline.io. Anurag, thank you for joining me.
Thanks for having me on the show, Corey.
It's great to be on, and I followed you for a long time.
I think of you as AWS marketing, frankly.
The running gag has been that I am the de facto head of AWS marketing as a part-time gag,
because I wandered past and saw an empty seat and sat down and then got stuck with the role. I mostly kid, but there does seem to be at times a bit of a
challenge as far as expressing stories and telling those stories in useful ways. And some mistakes
just sort of persist stubbornly forever. One of them is in the list of services, Route 53 shows
up as networking and content delivery, which I think regardless of the answer, it doesn't really fit there. I maintain it's a database, but did you have oversight into
that along with Glue, Athena, all the RDS options, managed blockchain for some reason as well?
Was it considered a database internally, or was that not really how they viewed it?
It's not really how they view it. I mean, certainly there is a long IP table, right,
and routing tables, but I think we characterized it in a whole different org. So I had a responsibility
for analytics, Redshift, Glue, EMR, etc., and transactional databases, Aurora, RDS, stuff like that. Very often when you have someone who was working at a very large company, and yes,
Amazon is a bunch of small teams internally, but let's face it, they're creeping up on $2 trillion
in valuation at the time of this recording. It's fairly common to see that startups are,
oh, this person was at Amazon for ages, as if it's some sort of amazing
selling point. Because, you know, a company with, what is it, 1.2 million people, give or take,
is absolutely like a relatively small, just-founded startup culturally, in terms of resources,
all the rest. Conversely, when you're working at scales like that, where the edge case becomes the
common case and the corner case becomes something that happens 18 times an hour.
It informs the way you think about things
radically differently,
and your reputation does precede you.
So I'm going to opt for assuming
that this is, rather than being the story about,
oh, we're just going to try and turn this company
into the second coming of Amazon,
that there's something that you saw
while you were at AWS
that you thought was an unmet need in the ecosystem, and that's what Shoreline is setting out to build.
Is that slightly accurate?
Or, no, you're just basically, there's a figurehead because the Amazon name is great for getting investors.
No, that's very astute.
So when I joined AWS, they gave me eight people and they asked me to go disrupt data warehousing and transaction processing.
So those turned into Redshift and Aurora, respectively.
And gradually I added on more services.
But in that sense, Amazon does operate like a startup.
You know, they really believe in restricting the number of resources you get so that you have time and you're forced to think and
be creative. That said, you know, you don't really wake up at night sweating about whether you're
going to hit payroll. This is sort of my fourth startup at this point. And, you know, there are
sleepless nights at a startup. And, you know, it's different. I'd go launch a service at AWS,
and there'd be a thousand people
who are signed up to the beta the next day.
And that's not the way startups work,
but there are advantages as well.
I can definitely empathize with that.
My last job before I started this place
was at a small scrappy startup,
which was great for three months
and then BlackRock bought us.
And then, oh, large regulated finance company combined with my personality ended about the
way you'd think it would. And okay. So instead of having the fears and the challenges that I
dealt with, and I'm going to go start my own company and have different challenges. And
yeah, they are definitely different. I never lied awake at night worrying about how I was going to
make payroll, for example. There's also the freedom in some ways at large companies where whatever function needs to get
done, whatever problem you have, there is some department somewhere that handles that almost
exclusively. Whereas in scrappy startup land, it's, well, whatever problem needs to get done today,
that is your job right now. And your job description can easily fill six pages by the end of month two.
It's a question of trade-offs and the rest.
What did you see that gave you the idea to go for startup number four?
So, you know, when I joined AWS thinking I was going to build a bunch of database engines,
and I've done that before,
what I learned is that building services is different than building products.
And in particular, nobody cares about your performance or features if your service isn't up.
You know, inside AWS, we used to talk about utility computing, you know, metering and providing compute storage database the way, you know, my local utility provider PG&E provides power and gas.
And, you know, if I call up PG&E and say that the power is out at my house, you know, I don't really want to hear, oh, you know, did you know that we have six nines power availability in the state of California? I mean, the power's still out. Go come in here and fix it. And yeah, I don't really care
about fancy new features they're doing back at the plant. Really, all I care about is cost
and availability. The idea of utility computing got into that direction too, in a lot of ways,
in some strange nuances too. The idea that when I flip the
light switch, I don't stop and wonder, is the light going to turn on? You know, until I installed
IoT switches and then everything's a gamble in the wild times again. And if the light doesn't come on,
I assume that the fuse is out or the light bulb is blown. Did VG&E wind up dropping service to
my neighborhood is sort of the last question that I have down that list. It took a while for cloud to get there. But at this point, if I can't access something in AWS,
my default assumption is that it's my local internet, not the cloud provider. That was hard
one. That's right. And so I think a lot of other SaaS companies or anybody operating in the cloud
are now working and struggling to get that same
degree of availability and confidence to supply to their customers. And so that's really the reason
for Shoreline. There's been a lot of discussion around the idea of availability and what that
means for a business outcome, where I still tell this story from time to time that back in 2012 or so, I was going to buy a pair
of underpants on amazon.com where I buy everything. And instead of completing the purchase, it threw
one of the great pictures of staff dogs up. Now, if you listen to a lot of reports on availability,
then for one day out of the week, I would just not wear underwear. In practice, I waited an hour,
tried it again, the purchase went through, and it was fine.
However, if that happened every third time I tried to make a purchase, I would spend a lot more money at Target. There has to be a baseline level of availability. That doesn't mean that
your site is never down, period, because that is in many cases an unrealistic aspiration, and it
turns every outage that winds up coming down the road into an all-hands-on-deck five-alarm fire,
which may not be warranted. But you do need to have a certain level of availability that
meets or exceeds your customers' expectations of same. That's the way that I've always viewed it.
I think that's exactly right. I also think it's important to look at it from a customer perspective,
not a fleet perspective. So a lot of people do inward facing SRE measurements of
fleet-wide availability. Now your customer really cares about the region they're in, or perhaps even
the particular host they're on. And that's even more true if they've got data. So for example,
an individual database failing, it'll take a long time for it to come back up elsewhere.
That's different than something more ephemeral like an instance which you can move more easily.
Part of the challenge that I've noticed as well with dealing with large cloud providers, a recurring joke has been the AWS status page.
It is the purest possible expression of a static site because it never
changes. And people get upset when things go down and the status page isn't updated.
But the challenge is, is when you're talking about something that is effectively global scale,
it stops being a question of is it up or is it down and transitions long before then into how
up or how down is it? And things that impact one customer may very well completely miss another.
If you're being an absolutist,
it will always be a sea of red,
which doesn't tell people anything useful.
Whereas if a customer is down and their site is off,
they don't really care
that most other customers aren't affected.
I mean, on some level,
you kind of want everyone to be down
because that defers headline risk,
as well as if my site is having a problem,
it could be days where someone gets around to fixing a small bug. Whereas if everything is down,
oh, this will be getting attention very rapidly. That's exactly right. Sounds like you've done
ops before. Oh, yes. You can tell that because I'm cynical and bitter about everything. It doesn't
take long working in operationally focused roles to get there. I appreciate you're saying that,
though. Usually people say, let me guess, you used to be an ops person. How can you tell?
Because your code is garbage, is the other way that people go down that path. And yeah,
credit where due. They're not wrong. You mentioned that back when you were at Amazon,
you were given a team of eight people and told to disrupt the data warehouse. Yeah,
I've disrupted the data warehouse as a single person before, so it doesn't seem that hard, but I'm guessing you mean something beyond causing an outage. It's more about disrupting the
space, presumably, and I think looking back from 2021, it's hard to argue that Amazon hasn't
disrupted the data warehouse space and 15 other spaces besides. Yeah, so that's what we were all
about, sort of trying to find areas of non-consumption.
So clearly data was growing, data warehousing was not growing at the same rate.
We figured that had to do with either a cost problem or it had to do with a simplicity problem or something else, right?
You know, why aren't people analyzing the data that they're collecting?
So that led to Redshift, a similar
problem in transaction processing, led to Aurora, and, you know, various other things.
You also said a couple of minutes ago that Amazon tends to talk more about features than they do
about products, and building a product at a startup is a foundationally different experience.
I think you're absolutely onto something there. Historically, Amazon has folks get on stage at
reInvent and talk about this new thing that got released. And it feels an awful lot like a company
saying, yeah, here's some great bricks you can use to build a house. Well, okay, what kind of
house can I build with those bricks? Here to talk about the house that they built is our guest
customer speaker from Netflix. And it seems like they sort of abdicated in many respects the storytelling
portion to a number of their customers. It is a very rare startup that has the luxury of being
able to just punt on building a product and its product story that goes along with it.
Have you found that your time at Amazon made storytelling something that
you wound up missing a bit more or retelling stories internally that we just don't get to
see from the outside or is, oh, wow, I never learned to tell a story before because at Amazon,
no one does that. And I have to learn how to do that now that I'm at a startup again.
No, I think it really is a storytelling experience. I mean, it's a narrative-based culture there,
which is in many ways a storytelling experience. So we were trying to provide a set of capabilities
so that people could build their own things. Much as Kindle allows people to self-publish books,
we're not really writing books of our own. And so I think that was the experience
there. Outside, you know, you are trying to solve more holistic problems, but you're still only
a puzzle piece in the experience that any given customer has, right? You don't
satisfy all of their needs, you know, soup to nuts. And part of the challenge too, is that if I'm a small
scrappy startup trying to get something out the door for the first time, the problems that I'm
experiencing and the challenges that I have are radically different than something that has
attained hyperscale and now has whole optimization stories or series of stories going on. It's,
will this thing even work at all is my initial focus.
And in some ways, it feels like conference wear cuts against a lot of that
because it's hard not to look
at the aspirational version of events
that people tell on stage at every event I've ever seen
and not come away with a takeaway of,
oh, what I've built is actually terrible
and depressing and sad.
One of the things that I find that resonates about what you're
building over at Shoreline is it's not just about the build things from scratch and get them
provisioned for the first time. It's about the ongoing operationalization, I think, if that's a
word, about that experience and how to wind up handling the care and feeding of something that
exists and is running,
but is also subject to change because all things are continually being iterated on.
That's right. I feel like operation is sort of an increasingly important but underappreciated part of the service delivery experience,
much as maybe QA was a couple of decades ago.
And over time, we've gone and we've built pipelines to automate our test infrastructure.
We have deployment tools to deploy it, to configure it.
But what's weird is that there are two parts of the puzzle that are still
highly manual, developing software and operating that software
in production. And the other thing that's interesting about that is that you can decide
when you are working on developing a piece of code or testing it or deploying it or configuring it.
You don't get to decide when the disk goes down or something breaks. That's why you have 24-7 on-call.
And so the whole point of Shoreline is to break that into two problems.
The things that are automatable and make it easy, trivial to automate those things away
so you don't wake up to do something for the 10th time.
And then for the remaining things that are novel
to make diagnosing and repairing your fleet
as simple and straightforward
as diagnosing and repairing a single box.
And we do a lot of distributed systems techs
underneath the covers to make that the case.
But those are the two things that we do.
And so hopefully that reduces people's downtime.
And it also brings back a lot of time for the operators so they can focus on higher value things like, you know, working with you to reduce their AWS bill.
Yeah, for better or worse, working on the AWS bill is always sort of a backseat function or a back burner function.
It's never the burning priority unless things have gone seriously awry. It's a good governance thing. It's the idea of, okay, let's
optimize this, fix unit economics. It is rarely the number one most pressing area of business for
a company, nor should it be. I think people are sometimes surprised to hear me say that.
You want to be reasonable stewards of the money entrusted to you, and you obviously want to
continue to remain in business by not losing money on everything you sell, but trying to make it up in volume.
But at some point, it's time to stop cutting and focus instead on revenue growth. That is usually
the path to success for almost every company I've ever spoken to, unless they are either
very out of kilter or in a very strange spot in the industry.
That's true. But it does belong, I think, in the ops function to do
optimization of your experience, whether, and you know, improving your resources,
improving your security posture, all of those sorts of things fall into production ops landscape
from my perspective. But people just don't have time for it because their fleets are growing
far, far faster than their headcount is.
So the only solution to that is automation.
And I want to talk to you about that.
Historically, the idea has been that you have monitoring or observability these days,
which I consider to be hipster monitoring, figuring out what's going on in your environment. Then you wind up with incidents being declared when certain things wind up triggering, which presumably are things that actually matter and not you're waking someone
up for vague reasons like load average is high on these notes, which tells you nothing in isolation
whatsoever. So you have the incident management portion of that next, and that handles a lot of
the waking folks up and getting everyone onto the call. You're focusing on, I guess, a third tranche here,
which is the idea of incident automation.
Tell me about that.
That's exactly right.
So having sort of been in the trenches,
I never got excited about one more dashboard to look at
or someone sort of routing a ticket to the right person per se,
because it'll get there, right?
Oh, yeah.
Like one of the most depressing things you'll ever see in a company
is the utilization numbers from the analytics
on the dashboards you build for people.
They look at them the day you build them and hand it off,
and then the next person visiting it is you
while running this report to make sure the dashboard is still there.
Yeah.
I mean, they are important things, right?
I mean, you get this huge sinking feeling if something is wrong and your observability tool is also down,
like CloudWatch was in some large scale events, or if your ticketing system is down and you don't
even notify somebody and you don't even know to wake up. But what did excite me, so you need
those things, they're necessary, but they're not
sufficient. What I think is also needed is something that actually reduces the number of
tickets, not just lets you observe them or find the right person to act upon it. So automation is
the path to reducing tickets, which is when I got excited because that was one less thing to wake up on
that gave me more time back to do things. And most importantly, it improved my customer
availability because any individual issue handled manually is going to take an hour or two or three
to deal with. The issue of being done by a computer is going to take a few seconds or a few minutes. It's a
whole different thing. It's the difference between the glitch and having to go out on an apology tour
to your customers. I really love installing, upgrading, and fixing security agents in my
cloud estate. Why do I say that? Because I sell things for a company that deploys an agent. There's no other reason. Because let's face it,
agents can be a real headache. Well, Orca Security now gives you a single tool to detect
basically every risk in your cloud environment that's as easy to install and maintain as a
smartphone app. It is agentless, or my intro would have gotten me in trouble here, but it can still see
deep into your AWS workloads while guaranteeing 100% coverage. With Orca Security, there are no
overlooked assets, no DevOps headaches, and believe me, you will hear from those people
if you cause them headaches, and no performance hits on live environment. Connect your first cloud account in minutes and see for
yourself at orca.security. That's orca as in whale dot security as in that thing your company claims
to care about but doesn't until right after it really should have. Oh yes, I feel like those of
us who have been in the ops world for long enough, we always have a horror story or two of automation around incidents run amok.
A classic thing that we learned by doing this, for example, is if you have a primary and a secondary, failover should be automated.
Failing back should not be, or you wind up in these wonderful states of things thrashing back and forth. In many cases in data center land, if you have a phantom router ready to step in, if the primary router goes offline, more outages are caused by a heartbeat
failure between those two devices, and they both start vying for power. And that becomes a problem.
Same story with a lot of automation approaches. For example, if, oh, every time a disk winds up
getting full, all right, we're going to fire off something to automatically expand the volume. Well, without something to stop that feedback loop, you're going to potentially wind up with an unbounded growth problem, and then you wind up with having no more disks to expand the volume to, being the way that that winds up smacking into things. This is clearly something you've thought about, given that you have built a company out of this, and this is not your first rodeo by a long stretch.
How do you think about those things?
So I think you're exactly right there again.
So the key here is to have the operator or the SRE define what needs to happen on an individual box, but then provide guardrails around them so that you can decide like, oh, a lot of these things
have happened at the same time. I'm going to put a rate limiter or a circuit breaker on it
and then send it off to somebody else to look at manually. As you said, failover, but don't
flap back and forth or limit the number of times that something is allowed to fail before you
send it for some.
Finally, everything grounds at a human being looking at something, but that's not a reason
not to do the simple stuff automatically because wasting human intelligence and time on doing just
manual stuff again and again and again is pointless. And also it increases the likelihood
that they're going to cause errors
because they're doing something mundane
rather than something that requires their intelligence.
And so that also is worse than handing it off to be automated.
But there are a lot of guardrails
that can be put around this, that we put around it,
that is the distributed systems part of it that we provide. You know, in some sense,
we're an orchestration system for automation, production ops, the same way that other people
provide an orchestration system for deployments and automated rollback and so forth.
What technical stacks do you wind up supporting for stuff like this? Is it anything you can
effectively SSH into? Does it integrate better with certain cloud providers than this year and likely go to
VMware on-prem next year. But, you know, finally, customers tell us what to do.
Oh, yeah. Building for things that have no customer usage is, that's great and all,
but talking to folks, we're like, yeah, it'd be nice if it had this. Will you buy it if it does?
No. Yeah, let's maybe put that one on the backlog.
You've done startups too, I see that.
Oh, once or twice.
Talk to customers.
I find that's one of those things
that absolutely is the most effective use of your time you can do.
Looking at your site, shoreline.io,
for those who want to follow along at home,
it lists a few different remediations that you give as examples.
And one of them is expanding disk volumes
as they tend to run out of space. I'm
assuming from that perspective alone that you are almost certainly running some form of agent?
We are running an agent. So part of that is because that way we don't need credentials so
that you can just run inside the customer environment directly and without your having
to pass credentials to some
third party. Part of it is also so you can do things quickly. So every second, we'll scrape
thousands of metrics from the Prometheus exporter ecosystem, calculate thousands more, compare them
against hundreds of alarms, and then take action when necessary. And so if you run on box, that can be done far faster than if you go off box.
And also a lot of the problems that happen
in the production environment are related to networking.
And it's not like the box isn't accessible,
but it may be that the monitoring path is not accessible.
So you really want to make sure
that the box can protect itself,
even if there's some issue somewhere in the fleet. And that really becomes an important thing, because that's the only time that you need incident automation when something's gone wrong.
I assume that that agent then has specific commands or tasks it's able to do, or does it accept arbitrary command execution? Arbitrary command execution, whatever you can type in at the Linux command prompt, whether
it's a call to the AWS CLI, kube control, Linux commands like top, or, you know, even shell
scripts, you can automate using Shoreline. Yeah, that was one of the ways that Nagios got it wrong
once upon a time with their NRPE,
their Nagios Remote Plug-in Engine, where you would only be allowed to run explicit things
that have been pre-approved and pushed out to things in advance. And it's one of the reasons,
I suspect, why remediation in those days never took off. Now, we've learned a lot about
observability and monitoring and keeping an eye on things that have grown well beyond
host-based stuff. So it's nice to see that there is growth in that. I'm much more optimistic about
it this time around based upon what you're saying. I hope you're right because I think
the key thing also is that I think a lot of these tools vendors think of themselves as the center
of the universe, whereas I think Shoreline works
best if it's entirely invisible. That's what you want from a feedback control system, from a
automation system, that it just gave you time back and issues are just getting fixed behind the
scenes. That's actually what a lot of AWS is doing behind the scenes, right? You're not seeing something whenever some rack
goes down. The thing that has always taken me aback, and I don't know how many times I'm going
to have to learn this lesson before it sticks, I fall into the common trap of take any one of the
big internationally renowned tech companies, and it's easy to believe that, oh, everything inside is far future wizardry of
everything works super well. The automation is flawless. Everything is pristine. And your
environment compared to that is relative garbage. It turns out that every company I've ever spoken
with and taken SREs from those companies out to have way too many drinks until they hit honesty levels.
They always talk about it being a sad dumpster fire in a bunch of different ways. And we're
talking some of the companies that people loud as the aspirational, your infrastructure should be
like these companies. And I find it really important to continue to socialize that point,
just because the failure mode otherwise is people think that their company
just employs terrible engineers.
And if people were any good, it would be seamless.
Just like they say on conference stages,
it's like comparing your dating life to a romantic comedy.
It's not an accurate depiction of how the world works.
Yeah, that's true.
That said, I'd say that like the DBA working on-prem may be managing 100 databases.
The average DBA in RDS or somebody on-call might be managing 100,000.
At that point, automation is no longer optional.
Yeah, and the way you get there is every week you squash and extinguish one thing forever. And then you start
seeing less and less frequent things because, you know, one in a million is actually occurring to
you. But, you know, like if it was one in a hundred, that would just crush you. And so you
just need to, you know, very diligently every week, every day, remove something.
Shoreline is in many ways the product I wish I had had at AWS because it makes automating that stuff easy,
a matter of minutes rather than months.
And so that gives you the capability to do automation.
Everyone wants automation, but the question is,
why don't they do it?
And it's just because it takes so much time
and we're so busy as operators.
Absolutely. I don't mean to say that these large companies working at Hyperscale have not
solved for these problems and done truly impressive things, but there's always sharp
edges. There's always things that are challenging and tricky. On this show, we had Dr. Christina
Maslach recently as an expert on burnout, given that she spent her entire career studying occupational burnout as an academic. And it turns out that it's not, to equate this to the operations world, it's not waking up at two in the morning to have to fix a problem, generally, that burns people out. It's being woken up to fix a problem at 2 a.m. consistently, and it's always the same problem and nothing ever seems to change. It's
the worst ops jobs I've ever seen are the ones where you have to wake up to fix a thing,
but you're not empowered to actually fix the cause, just the symptom.
I couldn't agree more. And that's the other aspect of Shoreline is to allow the operators or SREs to build the remediations rather than just
put a ticket into some queue for some developer to get prioritized alongside everything else.
Because you're on the sharp edge when you're doing ops, right? You deal with all the consequences
of the issues that are raised. And so, you know, it's fine that you say like, okay, there's this memory
leak. I'll create a ticket back to dev to go and fix it. But, you know, I need something that helps
me actually fix it here and now. Or if there's a log that's filling up my disk, it's fine to tell
somebody about it, but you have to, you know, grow your disk or move that log off the disk. And
you don't want to have to wake up for those things.
No.
And the idea that everything like this gets fixed is a bit of a misnomer.
One of my hobbies is whenever a site goes down and it is uncovered, sometimes very publicly, sometimes in RCEs, that the actual reason everything broke was due to an expired certificate.
Yep. that the actual reason everything broke was due to an expired certificate. I like to go and schedule out a couple of calendar reminders on that one for myself
of check it in 90 days in case they're using a refresh from Let's Encrypt.
And let's check it as well in one year and see if there's another outage just like that.
It has a non-zero success rate because as much as we want to convince ourselves
that, oh, that bit me once and I'll never get bitten like that again, that doesn't always hold
true. Certificates are a very common source of very widespread outages. It's actually one of
the remediations we provide out of the box. So, you know, alongside like making it possible for
people to create these things quickly,
we also provide what we call op packs,
which are basically getting started things
which have the metrics, alarms, actions, bots,
so they can just, you know, like fix it forever
without actually having to do very much other than,
you know, like reviews what we have done.
And that's on some level, I think, part of the magic is the
abstracting away the toil so that people are left to solve interesting problems and think about
these things and guiding them down a path where, okay, what should I do on an automatic basis if
the disk fills up? Well, I should extend the volume. Yeah. But maybe you should alert after
the fifth time in an hour that you have to extend
the same volume because just spitballing here, maybe there's a different problem here that
putting a bandaid on isn't going to necessarily solve. It forces people to think about what are
those triggers that should absolutely result in human intervention? Because you don't necessarily
want to solve things like memory leaks, for example, with, oh, our application leaks memory,
so we have to restart it once a day. Now, in practice, the right way to solve that is to fix
the application. In practice, there are so many cron jobs out there that are set to restart things
specifically for that reason, because cron jobs are quick and easy, and application developer time
is absolutely not easy to come by in many of these shops. It just comes down to something
that helps enforce more of a process, more of a rigor. I like the idea quite a bit. It aligns both with where people are
and how a better tomorrow starts to look. I really do think you're onto something here.
I mean, I think it's one of these things where you just have to understand it's not either or,
that it's not a question of operator pain or developer pain. It's let's go and address it in the here and now
and also provide the information,
also through an automated ticket generation,
to where someone can look to fix it forever at source, right?
Oh, yeah.
It's always great.
The user experience, too, of having those tickets created automatically
is also sometimes handy
because the worst way to tell someone
you don't care about their problem
when they come to you in a panic
is have you opened a ticket?
Now, yes, of course,
you need a ticket to track these things,
but maybe when someone is ghost pale
and scared to death
about what they think just broke the data,
maybe have a little more empathy there.
And yeah, the process is important,
but there should be automatic ways to do that.
These things all have APIs.
I really like your vision of operational maturity
and managing remediation in many cases
on an automatic basis.
I think it's going to be so much more important
in a world where deployments are more frequent.
You have microservices, you have multiple clouds,
you have containers that give a 10x increase
in the number of things you have to manage.
There's a lot for operators to have to keep in their heads.
And things are just changing constantly with containers.
Every minute someone comes and one goes.
So you just really need to,
even if you're just doing it for a diagnosis,
it needs to be collecting it and putting it aside is really critical.
If people want to learn more about what you're building and how you think about these things,
where can they find you?
They can reach out to me on LinkedIn at awgupta. Or, you know, of course, they can go to
shoreline.io and reach out there. I'm also onurag at shoreline.io if they want to reach out directly.
And, you know, we'd love to give people demos.
We know there's a lot of pain out there where our mission is to reduce it.
Thank you so much for taking the time to speak with me today.
I really appreciate it.
This is a great privilege to talk to you.
Anurag Gupta, CEO and founder of Shoreline.io.
I'm cloud economist, Corey Quinn,
and this is Screaming in the Cloud.
If you've enjoyed this podcast,
please leave a five-star review
on your podcast platform of choice.
Whereas if you've hated this podcast,
please leave a five-star review
on your podcast platform of choice,
along with a comment telling me that I'm wrong
and that Amazonians are the best at being on call because they carry six pagers.
If your AWS bill keeps rising and your blood pressure is doing the same,
then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying.
The Duck Bill Group works for you, not AWS. We tailor recommendations to your business,
and we get to the point. Visit duckbillgroup.com to get started. this has been a humble pod production
stay humble