Screaming in the Cloud - All Along the Shoreline.io of Automation with Anurag Gupta

Episode Date: July 20, 2021

This week Corey is joined by Anurag Gupta, founder and CEO of Shoreline.io. Anurag guides us through the large variety of services he helped launch to include RDS, Aurora, EMR, Redshift and o...ther. The result? Running things almost like a start-up—but with some distinct differences. Eventually Anurag ended up back in the testy waters of start-ups. He and Corey discuss the nature of that transition to get back to solving holistic problems, tapping into conveying those stories, and what Anurag was able to bring to his team at Shoreline.io where automation is king. Anurag goes into the details of what Shoreline is and what they do. Stay tuned for me.Links:Shoreline.io: https://shoreline.ioLinkedIn: https://www.linkedin.com/in/awgupta/Email: anurag@Shoreline.io

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. Your company might be stuck in the middle of a DevOps evolution without even realizing it.
Starting point is 00:00:35 Lucky you. Does your company culture discourage risk? Are you willing to admit it? Does your team have clear responsibilities? Depends on who you ask. Are you struggling to get buy-in on DevOps practices? Well, download the 2021 State of DevOps Report brought to you annually by Puppet since 2011 to explore the trends and blockers keeping mid-evolution firms stuck in
Starting point is 00:01:00 the middle of their DevOps evolution because they fail to evolve or die like dinosaurs. The significance of organizational buy-in, and oh, it is significant indeed, and why team identities and interaction models matter. Not to mention whether the use of automation and cloud translate to DevOps success. All that and more awaits you. Visit www.puppet.com to download your copy of the report now. If you're familiar with Cloud Custodian, you'll love Stacklet, which is made by the same people who created Cloud Custodian, but put something useful on top of it so you don't need to be a
Starting point is 00:01:39 YAML expert to work with it. They're hosting a webinar called Governance is Code, the guardrails for cloud at scale, because it's a new paradigm that enables organizations to use code to manage and automate various aspects of governance. If you're interested in exploring this, you should absolutely make it a point to sign up, because they're going to have people who know what they're talking about. Just kidding, they're going to have me talking about this. It's going to be on Thursday, July 22nd at 1 p.m. Eastern. To sign up, visit snark.cloud slash stacklet webinar. All one word. That's snark.cloud slash stacklet webinar.
Starting point is 00:02:19 And I'll talk to you on Thursday, July 22nd. Welcome to Screaming in the Cloud. I'm Corey Quinn. This promoted episode is brought to you by Shoreline, and I'm certain that we're going to get there. But first, I'm notorious for telling the story about how Route 53 is, in fact, a database, and anyone who disagrees with me is wrong. Now, AWS today is extraordinarily tight-lipped about whether that's accurate or not. So the next best thing, of course, is to talk to the person who used to run all of AWS's database offerings and start off there and get it from the source. Today, of course, he is not at Amazon, which means he's allowed to speak with me.
Starting point is 00:02:59 My guest is Anurag Gupta, the founder and CEO of Shoreline.io. Anurag, thank you for joining me. Thanks for having me on the show, Corey. It's great to be on, and I followed you for a long time. I think of you as AWS marketing, frankly. The running gag has been that I am the de facto head of AWS marketing as a part-time gag, because I wandered past and saw an empty seat and sat down and then got stuck with the role. I mostly kid, but there does seem to be at times a bit of a challenge as far as expressing stories and telling those stories in useful ways. And some mistakes just sort of persist stubbornly forever. One of them is in the list of services, Route 53 shows
Starting point is 00:03:41 up as networking and content delivery, which I think regardless of the answer, it doesn't really fit there. I maintain it's a database, but did you have oversight into that along with Glue, Athena, all the RDS options, managed blockchain for some reason as well? Was it considered a database internally, or was that not really how they viewed it? It's not really how they view it. I mean, certainly there is a long IP table, right, and routing tables, but I think we characterized it in a whole different org. So I had a responsibility for analytics, Redshift, Glue, EMR, etc., and transactional databases, Aurora, RDS, stuff like that. Very often when you have someone who was working at a very large company, and yes, Amazon is a bunch of small teams internally, but let's face it, they're creeping up on $2 trillion in valuation at the time of this recording. It's fairly common to see that startups are,
Starting point is 00:04:41 oh, this person was at Amazon for ages, as if it's some sort of amazing selling point. Because, you know, a company with, what is it, 1.2 million people, give or take, is absolutely like a relatively small, just-founded startup culturally, in terms of resources, all the rest. Conversely, when you're working at scales like that, where the edge case becomes the common case and the corner case becomes something that happens 18 times an hour. It informs the way you think about things radically differently, and your reputation does precede you.
Starting point is 00:05:11 So I'm going to opt for assuming that this is, rather than being the story about, oh, we're just going to try and turn this company into the second coming of Amazon, that there's something that you saw while you were at AWS that you thought was an unmet need in the ecosystem, and that's what Shoreline is setting out to build. Is that slightly accurate?
Starting point is 00:05:31 Or, no, you're just basically, there's a figurehead because the Amazon name is great for getting investors. No, that's very astute. So when I joined AWS, they gave me eight people and they asked me to go disrupt data warehousing and transaction processing. So those turned into Redshift and Aurora, respectively. And gradually I added on more services. But in that sense, Amazon does operate like a startup. You know, they really believe in restricting the number of resources you get so that you have time and you're forced to think and be creative. That said, you know, you don't really wake up at night sweating about whether you're
Starting point is 00:06:11 going to hit payroll. This is sort of my fourth startup at this point. And, you know, there are sleepless nights at a startup. And, you know, it's different. I'd go launch a service at AWS, and there'd be a thousand people who are signed up to the beta the next day. And that's not the way startups work, but there are advantages as well. I can definitely empathize with that. My last job before I started this place
Starting point is 00:06:39 was at a small scrappy startup, which was great for three months and then BlackRock bought us. And then, oh, large regulated finance company combined with my personality ended about the way you'd think it would. And okay. So instead of having the fears and the challenges that I dealt with, and I'm going to go start my own company and have different challenges. And yeah, they are definitely different. I never lied awake at night worrying about how I was going to make payroll, for example. There's also the freedom in some ways at large companies where whatever function needs to get
Starting point is 00:07:11 done, whatever problem you have, there is some department somewhere that handles that almost exclusively. Whereas in scrappy startup land, it's, well, whatever problem needs to get done today, that is your job right now. And your job description can easily fill six pages by the end of month two. It's a question of trade-offs and the rest. What did you see that gave you the idea to go for startup number four? So, you know, when I joined AWS thinking I was going to build a bunch of database engines, and I've done that before, what I learned is that building services is different than building products.
Starting point is 00:07:50 And in particular, nobody cares about your performance or features if your service isn't up. You know, inside AWS, we used to talk about utility computing, you know, metering and providing compute storage database the way, you know, my local utility provider PG&E provides power and gas. And, you know, if I call up PG&E and say that the power is out at my house, you know, I don't really want to hear, oh, you know, did you know that we have six nines power availability in the state of California? I mean, the power's still out. Go come in here and fix it. And yeah, I don't really care about fancy new features they're doing back at the plant. Really, all I care about is cost and availability. The idea of utility computing got into that direction too, in a lot of ways, in some strange nuances too. The idea that when I flip the light switch, I don't stop and wonder, is the light going to turn on? You know, until I installed IoT switches and then everything's a gamble in the wild times again. And if the light doesn't come on,
Starting point is 00:08:55 I assume that the fuse is out or the light bulb is blown. Did VG&E wind up dropping service to my neighborhood is sort of the last question that I have down that list. It took a while for cloud to get there. But at this point, if I can't access something in AWS, my default assumption is that it's my local internet, not the cloud provider. That was hard one. That's right. And so I think a lot of other SaaS companies or anybody operating in the cloud are now working and struggling to get that same degree of availability and confidence to supply to their customers. And so that's really the reason for Shoreline. There's been a lot of discussion around the idea of availability and what that means for a business outcome, where I still tell this story from time to time that back in 2012 or so, I was going to buy a pair
Starting point is 00:09:46 of underpants on amazon.com where I buy everything. And instead of completing the purchase, it threw one of the great pictures of staff dogs up. Now, if you listen to a lot of reports on availability, then for one day out of the week, I would just not wear underwear. In practice, I waited an hour, tried it again, the purchase went through, and it was fine. However, if that happened every third time I tried to make a purchase, I would spend a lot more money at Target. There has to be a baseline level of availability. That doesn't mean that your site is never down, period, because that is in many cases an unrealistic aspiration, and it turns every outage that winds up coming down the road into an all-hands-on-deck five-alarm fire, which may not be warranted. But you do need to have a certain level of availability that
Starting point is 00:10:29 meets or exceeds your customers' expectations of same. That's the way that I've always viewed it. I think that's exactly right. I also think it's important to look at it from a customer perspective, not a fleet perspective. So a lot of people do inward facing SRE measurements of fleet-wide availability. Now your customer really cares about the region they're in, or perhaps even the particular host they're on. And that's even more true if they've got data. So for example, an individual database failing, it'll take a long time for it to come back up elsewhere. That's different than something more ephemeral like an instance which you can move more easily. Part of the challenge that I've noticed as well with dealing with large cloud providers, a recurring joke has been the AWS status page.
Starting point is 00:11:21 It is the purest possible expression of a static site because it never changes. And people get upset when things go down and the status page isn't updated. But the challenge is, is when you're talking about something that is effectively global scale, it stops being a question of is it up or is it down and transitions long before then into how up or how down is it? And things that impact one customer may very well completely miss another. If you're being an absolutist, it will always be a sea of red, which doesn't tell people anything useful.
Starting point is 00:11:51 Whereas if a customer is down and their site is off, they don't really care that most other customers aren't affected. I mean, on some level, you kind of want everyone to be down because that defers headline risk, as well as if my site is having a problem, it could be days where someone gets around to fixing a small bug. Whereas if everything is down,
Starting point is 00:12:10 oh, this will be getting attention very rapidly. That's exactly right. Sounds like you've done ops before. Oh, yes. You can tell that because I'm cynical and bitter about everything. It doesn't take long working in operationally focused roles to get there. I appreciate you're saying that, though. Usually people say, let me guess, you used to be an ops person. How can you tell? Because your code is garbage, is the other way that people go down that path. And yeah, credit where due. They're not wrong. You mentioned that back when you were at Amazon, you were given a team of eight people and told to disrupt the data warehouse. Yeah, I've disrupted the data warehouse as a single person before, so it doesn't seem that hard, but I'm guessing you mean something beyond causing an outage. It's more about disrupting the
Starting point is 00:12:50 space, presumably, and I think looking back from 2021, it's hard to argue that Amazon hasn't disrupted the data warehouse space and 15 other spaces besides. Yeah, so that's what we were all about, sort of trying to find areas of non-consumption. So clearly data was growing, data warehousing was not growing at the same rate. We figured that had to do with either a cost problem or it had to do with a simplicity problem or something else, right? You know, why aren't people analyzing the data that they're collecting? So that led to Redshift, a similar problem in transaction processing, led to Aurora, and, you know, various other things.
Starting point is 00:13:32 You also said a couple of minutes ago that Amazon tends to talk more about features than they do about products, and building a product at a startup is a foundationally different experience. I think you're absolutely onto something there. Historically, Amazon has folks get on stage at reInvent and talk about this new thing that got released. And it feels an awful lot like a company saying, yeah, here's some great bricks you can use to build a house. Well, okay, what kind of house can I build with those bricks? Here to talk about the house that they built is our guest customer speaker from Netflix. And it seems like they sort of abdicated in many respects the storytelling portion to a number of their customers. It is a very rare startup that has the luxury of being
Starting point is 00:14:15 able to just punt on building a product and its product story that goes along with it. Have you found that your time at Amazon made storytelling something that you wound up missing a bit more or retelling stories internally that we just don't get to see from the outside or is, oh, wow, I never learned to tell a story before because at Amazon, no one does that. And I have to learn how to do that now that I'm at a startup again. No, I think it really is a storytelling experience. I mean, it's a narrative-based culture there, which is in many ways a storytelling experience. So we were trying to provide a set of capabilities so that people could build their own things. Much as Kindle allows people to self-publish books,
Starting point is 00:15:00 we're not really writing books of our own. And so I think that was the experience there. Outside, you know, you are trying to solve more holistic problems, but you're still only a puzzle piece in the experience that any given customer has, right? You don't satisfy all of their needs, you know, soup to nuts. And part of the challenge too, is that if I'm a small scrappy startup trying to get something out the door for the first time, the problems that I'm experiencing and the challenges that I have are radically different than something that has attained hyperscale and now has whole optimization stories or series of stories going on. It's, will this thing even work at all is my initial focus.
Starting point is 00:15:47 And in some ways, it feels like conference wear cuts against a lot of that because it's hard not to look at the aspirational version of events that people tell on stage at every event I've ever seen and not come away with a takeaway of, oh, what I've built is actually terrible and depressing and sad. One of the things that I find that resonates about what you're
Starting point is 00:16:06 building over at Shoreline is it's not just about the build things from scratch and get them provisioned for the first time. It's about the ongoing operationalization, I think, if that's a word, about that experience and how to wind up handling the care and feeding of something that exists and is running, but is also subject to change because all things are continually being iterated on. That's right. I feel like operation is sort of an increasingly important but underappreciated part of the service delivery experience, much as maybe QA was a couple of decades ago. And over time, we've gone and we've built pipelines to automate our test infrastructure.
Starting point is 00:16:53 We have deployment tools to deploy it, to configure it. But what's weird is that there are two parts of the puzzle that are still highly manual, developing software and operating that software in production. And the other thing that's interesting about that is that you can decide when you are working on developing a piece of code or testing it or deploying it or configuring it. You don't get to decide when the disk goes down or something breaks. That's why you have 24-7 on-call. And so the whole point of Shoreline is to break that into two problems. The things that are automatable and make it easy, trivial to automate those things away
Starting point is 00:17:38 so you don't wake up to do something for the 10th time. And then for the remaining things that are novel to make diagnosing and repairing your fleet as simple and straightforward as diagnosing and repairing a single box. And we do a lot of distributed systems techs underneath the covers to make that the case. But those are the two things that we do.
Starting point is 00:18:02 And so hopefully that reduces people's downtime. And it also brings back a lot of time for the operators so they can focus on higher value things like, you know, working with you to reduce their AWS bill. Yeah, for better or worse, working on the AWS bill is always sort of a backseat function or a back burner function. It's never the burning priority unless things have gone seriously awry. It's a good governance thing. It's the idea of, okay, let's optimize this, fix unit economics. It is rarely the number one most pressing area of business for a company, nor should it be. I think people are sometimes surprised to hear me say that. You want to be reasonable stewards of the money entrusted to you, and you obviously want to continue to remain in business by not losing money on everything you sell, but trying to make it up in volume.
Starting point is 00:18:48 But at some point, it's time to stop cutting and focus instead on revenue growth. That is usually the path to success for almost every company I've ever spoken to, unless they are either very out of kilter or in a very strange spot in the industry. That's true. But it does belong, I think, in the ops function to do optimization of your experience, whether, and you know, improving your resources, improving your security posture, all of those sorts of things fall into production ops landscape from my perspective. But people just don't have time for it because their fleets are growing far, far faster than their headcount is.
Starting point is 00:19:28 So the only solution to that is automation. And I want to talk to you about that. Historically, the idea has been that you have monitoring or observability these days, which I consider to be hipster monitoring, figuring out what's going on in your environment. Then you wind up with incidents being declared when certain things wind up triggering, which presumably are things that actually matter and not you're waking someone up for vague reasons like load average is high on these notes, which tells you nothing in isolation whatsoever. So you have the incident management portion of that next, and that handles a lot of the waking folks up and getting everyone onto the call. You're focusing on, I guess, a third tranche here, which is the idea of incident automation.
Starting point is 00:20:10 Tell me about that. That's exactly right. So having sort of been in the trenches, I never got excited about one more dashboard to look at or someone sort of routing a ticket to the right person per se, because it'll get there, right? Oh, yeah. Like one of the most depressing things you'll ever see in a company
Starting point is 00:20:29 is the utilization numbers from the analytics on the dashboards you build for people. They look at them the day you build them and hand it off, and then the next person visiting it is you while running this report to make sure the dashboard is still there. Yeah. I mean, they are important things, right? I mean, you get this huge sinking feeling if something is wrong and your observability tool is also down,
Starting point is 00:20:50 like CloudWatch was in some large scale events, or if your ticketing system is down and you don't even notify somebody and you don't even know to wake up. But what did excite me, so you need those things, they're necessary, but they're not sufficient. What I think is also needed is something that actually reduces the number of tickets, not just lets you observe them or find the right person to act upon it. So automation is the path to reducing tickets, which is when I got excited because that was one less thing to wake up on that gave me more time back to do things. And most importantly, it improved my customer availability because any individual issue handled manually is going to take an hour or two or three
Starting point is 00:21:39 to deal with. The issue of being done by a computer is going to take a few seconds or a few minutes. It's a whole different thing. It's the difference between the glitch and having to go out on an apology tour to your customers. I really love installing, upgrading, and fixing security agents in my cloud estate. Why do I say that? Because I sell things for a company that deploys an agent. There's no other reason. Because let's face it, agents can be a real headache. Well, Orca Security now gives you a single tool to detect basically every risk in your cloud environment that's as easy to install and maintain as a smartphone app. It is agentless, or my intro would have gotten me in trouble here, but it can still see deep into your AWS workloads while guaranteeing 100% coverage. With Orca Security, there are no
Starting point is 00:22:32 overlooked assets, no DevOps headaches, and believe me, you will hear from those people if you cause them headaches, and no performance hits on live environment. Connect your first cloud account in minutes and see for yourself at orca.security. That's orca as in whale dot security as in that thing your company claims to care about but doesn't until right after it really should have. Oh yes, I feel like those of us who have been in the ops world for long enough, we always have a horror story or two of automation around incidents run amok. A classic thing that we learned by doing this, for example, is if you have a primary and a secondary, failover should be automated. Failing back should not be, or you wind up in these wonderful states of things thrashing back and forth. In many cases in data center land, if you have a phantom router ready to step in, if the primary router goes offline, more outages are caused by a heartbeat failure between those two devices, and they both start vying for power. And that becomes a problem.
Starting point is 00:23:36 Same story with a lot of automation approaches. For example, if, oh, every time a disk winds up getting full, all right, we're going to fire off something to automatically expand the volume. Well, without something to stop that feedback loop, you're going to potentially wind up with an unbounded growth problem, and then you wind up with having no more disks to expand the volume to, being the way that that winds up smacking into things. This is clearly something you've thought about, given that you have built a company out of this, and this is not your first rodeo by a long stretch. How do you think about those things? So I think you're exactly right there again. So the key here is to have the operator or the SRE define what needs to happen on an individual box, but then provide guardrails around them so that you can decide like, oh, a lot of these things have happened at the same time. I'm going to put a rate limiter or a circuit breaker on it and then send it off to somebody else to look at manually. As you said, failover, but don't flap back and forth or limit the number of times that something is allowed to fail before you
Starting point is 00:24:44 send it for some. Finally, everything grounds at a human being looking at something, but that's not a reason not to do the simple stuff automatically because wasting human intelligence and time on doing just manual stuff again and again and again is pointless. And also it increases the likelihood that they're going to cause errors because they're doing something mundane rather than something that requires their intelligence. And so that also is worse than handing it off to be automated.
Starting point is 00:25:16 But there are a lot of guardrails that can be put around this, that we put around it, that is the distributed systems part of it that we provide. You know, in some sense, we're an orchestration system for automation, production ops, the same way that other people provide an orchestration system for deployments and automated rollback and so forth. What technical stacks do you wind up supporting for stuff like this? Is it anything you can effectively SSH into? Does it integrate better with certain cloud providers than this year and likely go to VMware on-prem next year. But, you know, finally, customers tell us what to do.
Starting point is 00:26:12 Oh, yeah. Building for things that have no customer usage is, that's great and all, but talking to folks, we're like, yeah, it'd be nice if it had this. Will you buy it if it does? No. Yeah, let's maybe put that one on the backlog. You've done startups too, I see that. Oh, once or twice. Talk to customers. I find that's one of those things that absolutely is the most effective use of your time you can do.
Starting point is 00:26:32 Looking at your site, shoreline.io, for those who want to follow along at home, it lists a few different remediations that you give as examples. And one of them is expanding disk volumes as they tend to run out of space. I'm assuming from that perspective alone that you are almost certainly running some form of agent? We are running an agent. So part of that is because that way we don't need credentials so that you can just run inside the customer environment directly and without your having
Starting point is 00:27:03 to pass credentials to some third party. Part of it is also so you can do things quickly. So every second, we'll scrape thousands of metrics from the Prometheus exporter ecosystem, calculate thousands more, compare them against hundreds of alarms, and then take action when necessary. And so if you run on box, that can be done far faster than if you go off box. And also a lot of the problems that happen in the production environment are related to networking. And it's not like the box isn't accessible, but it may be that the monitoring path is not accessible.
Starting point is 00:27:40 So you really want to make sure that the box can protect itself, even if there's some issue somewhere in the fleet. And that really becomes an important thing, because that's the only time that you need incident automation when something's gone wrong. I assume that that agent then has specific commands or tasks it's able to do, or does it accept arbitrary command execution? Arbitrary command execution, whatever you can type in at the Linux command prompt, whether it's a call to the AWS CLI, kube control, Linux commands like top, or, you know, even shell scripts, you can automate using Shoreline. Yeah, that was one of the ways that Nagios got it wrong once upon a time with their NRPE, their Nagios Remote Plug-in Engine, where you would only be allowed to run explicit things
Starting point is 00:28:30 that have been pre-approved and pushed out to things in advance. And it's one of the reasons, I suspect, why remediation in those days never took off. Now, we've learned a lot about observability and monitoring and keeping an eye on things that have grown well beyond host-based stuff. So it's nice to see that there is growth in that. I'm much more optimistic about it this time around based upon what you're saying. I hope you're right because I think the key thing also is that I think a lot of these tools vendors think of themselves as the center of the universe, whereas I think Shoreline works best if it's entirely invisible. That's what you want from a feedback control system, from a
Starting point is 00:29:12 automation system, that it just gave you time back and issues are just getting fixed behind the scenes. That's actually what a lot of AWS is doing behind the scenes, right? You're not seeing something whenever some rack goes down. The thing that has always taken me aback, and I don't know how many times I'm going to have to learn this lesson before it sticks, I fall into the common trap of take any one of the big internationally renowned tech companies, and it's easy to believe that, oh, everything inside is far future wizardry of everything works super well. The automation is flawless. Everything is pristine. And your environment compared to that is relative garbage. It turns out that every company I've ever spoken with and taken SREs from those companies out to have way too many drinks until they hit honesty levels.
Starting point is 00:30:05 They always talk about it being a sad dumpster fire in a bunch of different ways. And we're talking some of the companies that people loud as the aspirational, your infrastructure should be like these companies. And I find it really important to continue to socialize that point, just because the failure mode otherwise is people think that their company just employs terrible engineers. And if people were any good, it would be seamless. Just like they say on conference stages, it's like comparing your dating life to a romantic comedy.
Starting point is 00:30:34 It's not an accurate depiction of how the world works. Yeah, that's true. That said, I'd say that like the DBA working on-prem may be managing 100 databases. The average DBA in RDS or somebody on-call might be managing 100,000. At that point, automation is no longer optional. Yeah, and the way you get there is every week you squash and extinguish one thing forever. And then you start seeing less and less frequent things because, you know, one in a million is actually occurring to you. But, you know, like if it was one in a hundred, that would just crush you. And so you
Starting point is 00:31:19 just need to, you know, very diligently every week, every day, remove something. Shoreline is in many ways the product I wish I had had at AWS because it makes automating that stuff easy, a matter of minutes rather than months. And so that gives you the capability to do automation. Everyone wants automation, but the question is, why don't they do it? And it's just because it takes so much time and we're so busy as operators.
Starting point is 00:31:46 Absolutely. I don't mean to say that these large companies working at Hyperscale have not solved for these problems and done truly impressive things, but there's always sharp edges. There's always things that are challenging and tricky. On this show, we had Dr. Christina Maslach recently as an expert on burnout, given that she spent her entire career studying occupational burnout as an academic. And it turns out that it's not, to equate this to the operations world, it's not waking up at two in the morning to have to fix a problem, generally, that burns people out. It's being woken up to fix a problem at 2 a.m. consistently, and it's always the same problem and nothing ever seems to change. It's the worst ops jobs I've ever seen are the ones where you have to wake up to fix a thing, but you're not empowered to actually fix the cause, just the symptom. I couldn't agree more. And that's the other aspect of Shoreline is to allow the operators or SREs to build the remediations rather than just put a ticket into some queue for some developer to get prioritized alongside everything else.
Starting point is 00:32:53 Because you're on the sharp edge when you're doing ops, right? You deal with all the consequences of the issues that are raised. And so, you know, it's fine that you say like, okay, there's this memory leak. I'll create a ticket back to dev to go and fix it. But, you know, I need something that helps me actually fix it here and now. Or if there's a log that's filling up my disk, it's fine to tell somebody about it, but you have to, you know, grow your disk or move that log off the disk. And you don't want to have to wake up for those things. No. And the idea that everything like this gets fixed is a bit of a misnomer.
Starting point is 00:33:31 One of my hobbies is whenever a site goes down and it is uncovered, sometimes very publicly, sometimes in RCEs, that the actual reason everything broke was due to an expired certificate. Yep. that the actual reason everything broke was due to an expired certificate. I like to go and schedule out a couple of calendar reminders on that one for myself of check it in 90 days in case they're using a refresh from Let's Encrypt. And let's check it as well in one year and see if there's another outage just like that. It has a non-zero success rate because as much as we want to convince ourselves that, oh, that bit me once and I'll never get bitten like that again, that doesn't always hold true. Certificates are a very common source of very widespread outages. It's actually one of the remediations we provide out of the box. So, you know, alongside like making it possible for
Starting point is 00:34:24 people to create these things quickly, we also provide what we call op packs, which are basically getting started things which have the metrics, alarms, actions, bots, so they can just, you know, like fix it forever without actually having to do very much other than, you know, like reviews what we have done. And that's on some level, I think, part of the magic is the
Starting point is 00:34:46 abstracting away the toil so that people are left to solve interesting problems and think about these things and guiding them down a path where, okay, what should I do on an automatic basis if the disk fills up? Well, I should extend the volume. Yeah. But maybe you should alert after the fifth time in an hour that you have to extend the same volume because just spitballing here, maybe there's a different problem here that putting a bandaid on isn't going to necessarily solve. It forces people to think about what are those triggers that should absolutely result in human intervention? Because you don't necessarily want to solve things like memory leaks, for example, with, oh, our application leaks memory,
Starting point is 00:35:24 so we have to restart it once a day. Now, in practice, the right way to solve that is to fix the application. In practice, there are so many cron jobs out there that are set to restart things specifically for that reason, because cron jobs are quick and easy, and application developer time is absolutely not easy to come by in many of these shops. It just comes down to something that helps enforce more of a process, more of a rigor. I like the idea quite a bit. It aligns both with where people are and how a better tomorrow starts to look. I really do think you're onto something here. I mean, I think it's one of these things where you just have to understand it's not either or, that it's not a question of operator pain or developer pain. It's let's go and address it in the here and now
Starting point is 00:36:08 and also provide the information, also through an automated ticket generation, to where someone can look to fix it forever at source, right? Oh, yeah. It's always great. The user experience, too, of having those tickets created automatically is also sometimes handy because the worst way to tell someone
Starting point is 00:36:28 you don't care about their problem when they come to you in a panic is have you opened a ticket? Now, yes, of course, you need a ticket to track these things, but maybe when someone is ghost pale and scared to death about what they think just broke the data,
Starting point is 00:36:40 maybe have a little more empathy there. And yeah, the process is important, but there should be automatic ways to do that. These things all have APIs. I really like your vision of operational maturity and managing remediation in many cases on an automatic basis. I think it's going to be so much more important
Starting point is 00:36:58 in a world where deployments are more frequent. You have microservices, you have multiple clouds, you have containers that give a 10x increase in the number of things you have to manage. There's a lot for operators to have to keep in their heads. And things are just changing constantly with containers. Every minute someone comes and one goes. So you just really need to,
Starting point is 00:37:23 even if you're just doing it for a diagnosis, it needs to be collecting it and putting it aside is really critical. If people want to learn more about what you're building and how you think about these things, where can they find you? They can reach out to me on LinkedIn at awgupta. Or, you know, of course, they can go to shoreline.io and reach out there. I'm also onurag at shoreline.io if they want to reach out directly. And, you know, we'd love to give people demos. We know there's a lot of pain out there where our mission is to reduce it.
Starting point is 00:37:57 Thank you so much for taking the time to speak with me today. I really appreciate it. This is a great privilege to talk to you. Anurag Gupta, CEO and founder of Shoreline.io. I'm cloud economist, Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review
Starting point is 00:38:15 on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with a comment telling me that I'm wrong and that Amazonians are the best at being on call because they carry six pagers. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying.
Starting point is 00:38:47 The Duck Bill Group works for you, not AWS. We tailor recommendations to your business, and we get to the point. Visit duckbillgroup.com to get started. this has been a humble pod production stay humble

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.