Screaming in the Cloud - The Evolution of Cloud Services with Richard Hartmann

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This episode is sponsored in part by our friends at AWS AppConfig. Engineers love to solve and occasionally create problems,

Starting point is 00:00:39 but not when it's an on-call fire drill at four in the morning. Software problems should drive innovation and collaboration, not stress and sleeplessness and threats of violence. That's why so many developers are realizing the value of AWS AppConfig feature flags. Feature flags let developers push code to production, but hide that feature from customers so that the developers can release their feature when it's ready. This practice allows for safe, fast, and convenient software development. You can seamlessly incorporate AppConfig feature flags into your AWS or cloud environment and ship your features with excitement, not trepidation and fear. To get started, go to snark.cloud slash appconfig. That's snark.cloud slash appconfig.

Starting point is 00:01:32 This episode is brought to us in part by our friends at Datadog. Datadog's a SaaS monitoring and security platform that enables full-stack observability for developers, IT operations, security, and business teams in the cloud age. Datadog's platform, along with 500-plus vendor integrations, allows you to correlate metrics, traces, logs, and security signals across your applications, infrastructure, and third-party services in a single pane of glass. Combine these with drag and drop dashboards and machine learning based alerts to help teams troubleshoot and collaborate more effectively, prevent downtime

Starting point is 00:02:10 and enhance performance and reliability. Try Datadog in your environment today with a free 14 day trial and get a complimentary t-shirt when you install the agent. To learn more, visit datadoghq.com slash screaming in the cloud to get started. That's www.datadoghq.com slash screaming in the cloud. Welcome to Screaming in the Cloud. I'm Corey Quinn. There are an awful lot of people who are incredibly good at understanding the ins and outs and the intricacies of the observability world. But they didn't have time to come on the show today. Instead, I am talking to my dear friend of two decades now, Richard Hartman, better known on the internet as Richie H., who's the director of community at Grafana Labs, here to suffer in a somewhat atypical departure for the theme of this

Starting point is 00:03:06 show, personal attacks for once. Richie, thank you for joining me. And thank you for agreeing on personal attacks. Exactly. It was one of your writers, like there have to be the personal attacks back and forth or you refuse to appear on the show. You've been on before. In fact, the last time we did a recording, I believe you were here in person, which was a long time ago. What have you been up to? You're still at Grafana Labs. And in many cases, I would point out that, wow, you've been there for many years. That seems to be an atypical thing, which is an American tech industry perspective. Because every time you and I talk about this, you look at folks who, wow, you were only at that company for five years. What's wrong with you? You tend to take the longer view, and I tend to have the fast twitch, time to go ahead and leave jobs, because it's been more than 20 minutes

Starting point is 00:03:55 approach. I see that you're continuing to live what you preach, though. How's it been? Yeah, so there's a little bit of COVID brains, I think. When we talked in 2018, I was still working at SpaceNet, building a data center. But the last two and a half years didn't really happen for many people, myself included. So I guess that includes you. No, no, you're right. You've only been at Grafana Labs a couple of years. One would think I would check the notes before shooting my mouth off, but then one wouldn't know me. What notes?

Starting point is 00:04:24 Anyway, I've been around Prometheus and Grafana since 2015, but like real full-time everything is 2020. There was something in between since 2018, I contracted to do vulnerability handling and everything for Grafana Labs. Of course, they had something and they didn't know how to deal with it. But no, the full time is 2020. But as to the space in and of itself, it's maybe a little bit German of me, but trying to understand the real world and trying to get an overview of systems and how they actually work and if they are working correctly and as intended, and if not, how they're not working as intended and how to fix this is something

Starting point is 00:05:05 which has always been super important to me, in part because I just want to understand the world. And this is a really, really good way to automate understanding of the world. So it's basically a work-saving mechanism. That's why I've been sticking to it for so long, I guess. Back in the early days of monitoring systems, because we called it monitoring back then, because using simple words that lacked

Starting point is 00:05:29 nuance was sort of de rigueur back then. We wound up effectively having tools. Nagios is the one that springs to mind and it was terrible in all the ways you would expect a tool written in janky Perl in the early 2000s to be. But it told you what was going

Starting point is 00:05:46 on. It tried to do a thing, generally reach a server or query it about things. And when things fell out of certain specs, it screamed its head off, which meant that when you had things like the core switch melting down, thinking of one very particular incident, you didn't get a Nagios alert. You got 4,000 Nagios alerts. But start to finish, you could wrap your head rather fully around what Nagios did and why it did the sometimes strange things that it did. These days, when you take a look at Prometheus, which we hear a lot about, particularly in the Kubernetes space, and Grafana, which is often mentioned in the same breath, it's never been quite clear to me exactly where those start and stop. It always

Starting point is 00:06:25 feels like it's a component in a larger system to tell you what's going on, rather than a one-stop shop that's going to, you know, shriek its head off when something breaks in the middle of the night. Is that the right way to think about it? The wrong way to think about it? It's a way to think about it. So personally, I use the terms monitoring and observability pretty much interchangeably. Observability is a relatively well-defined term, even though most people won't agree. But if you look back into the 70s, into control theory, where the term is coming from, it is the measure of how much you're able to determine the internal state of a system by looking at its inputs and its outputs. Depending on the definition, some people don't include the inputs, but that is the OG definition as far as I'm aware.

Starting point is 00:07:07 And from this, there flow a lot of things. This question of, or this interpretation of the difference between telling that yes, something is broken versus why something is broken.

Starting point is 00:07:18 Or if you can't ask new questions on the fly, it's not observability. Like all of those things are fundamentally mapped to this definition of, I need enough data to determine the internal state of whatever system I have just by looking at what is coming in, what is going out. And that is at the core the thing. Now, obviously, it's become a buzzword, which is oftentimes the fate of successful things. So it's become a buzzword, and you oftentimes the fate of successful things. So it's become a buzzword, and you end up with cargo halting. I would argue periodically that observability is hipster monitoring. If you call it monitoring, you get yelled at by charity majors, which is tongue-in-cheek,

Starting point is 00:07:57 but she has opinions made nonetheless, shall I say, frustrating by the fact that she is invariably correct in those opinions, which just somehow makes it so much worse. It would be easy to dismiss things she says if she weren't always right. And the world is changing, especially as we get into the world of distributed systems, is the server that runs the app, working or not working, loses meaning when we're talking about distributed

Starting point is 00:08:25 systems, when we're talking about containers running on top of Kubernetes, which turns every outage into a murder mystery. We start having distributed applications composed of microservices, so you have no idea necessarily where an issue is. Okay, is this one microservice having an issue related to the request coming into a completely separate microservice. And it seems that for those types of applications, the answer has been tracing for a long time now, where originally it was something that felt like it was sprung fully formed from the forehead of some god known as one of the hyperscalers, but now is available to basically everyone in theory.

Starting point is 00:09:00 In practice, it seems that instrumenting applications is still one of the hardest parts of all of this. I tried hooking up one of my own applications to be observed via OTEL, the Open Telemetry Project, and it turns out that right now, OTEL and AWS Lambda have an intersection point that makes everything extremely difficult to work with. It's not there yet. It's not baked yet. And someday I hope that changes, because I would love to interchangeably just throw metrics and traces and logs to all the different observability tools and see which ones work, which ones don't. But that still feels very far away from current state of the art. Before we go there, maybe one thing which I don't fully agree with. You said

Starting point is 00:09:40 that previously you were told if a service up or down, that's the thing which you cared about. And I don't think that's what people actually cared about. At that time, also what they fundamentally cared about is the user-facing service up or down or impacted. Is it slow? Does it return errors every X percent for requests? Something like this. Is the site up? You're right.

Starting point is 00:10:01 I was hand-waving over a whole bunch of things. It was, okay, first, the web server was turning a page. Yes or no? Great. Can I ping the server? Okay, well, there are ways a server can crash and still leave enough of the TCP IP stack up where it can respond to pings and do little else. And then you start adding things to it.

Starting point is 00:10:18 The Nagios thing that I always wanted to add and had to was, is the disk full? And that was annoying. And on some level, why should I care in the modern era how was, is the disk full? And that was annoying. And on some level, like, why should I care in the modern era how much stuff is on a disk? The storage is cheap and free and plentiful. The problem is after the third outage in a month, because the disk filled up, you start to not have a good answer for, well, why aren't you monitoring whether the disk is full? And that was the contributors to taking down the server. When the website broke, there were what felt like a relatively small number of reasonably well-understood contributors to that at small to mid-sized applications, which is what I'm talking about, the only things that people would let me touch. I wasn't running hyperscale stuff where you have a fleet of 10,000 web servers and is the server up?

Starting point is 00:11:01 Yeah, in that scenario, no one cares. But when we're talking about the database server and the two application servers and the four web servers talking to them, you think about it more in terms of pets than you do cattle. Yes, absolutely. Yet, I think there was a mistake back then, and I tried to do it differently. As a specific example with the disk, and I'm absolutely agreeing that previous generation tools limit you in how you can actually work with your data in particular once you're with metrics where you can do actual math on the data it does not matter if the disk is almost full it matters if that disk is going to be full within x amount of time if that disk is 98 full and it sits there at 98 for 10 years and provides the service no one cares the thing is

Starting point is 00:11:47 will it actually run out in the next two hours in the next five hours what have you depending on this is this currently or imminently customer impacting or user impacting then yes alert on it raise hell wake people make them fix it as opposed to this thing can be dealt with during business hours on the next workday and you don't have to wait. Yeah, the big filer with massive amounts of storage has crossed the 70% line. Okay, now it's time to start thinking about that. What do you want to do? Maybe it's time to order another shelf of disks for it, which is going to take some

Starting point is 00:12:17 time. That's a radically different scenario than the 20 gigabyte root volume on your server just started filling up dramatically. The rate of change is such it'll be full in 20 minutes. Yeah, one of those is something you want to wake people up for. Generally speaking, you don't want to wake people up for what is fundamentally a longer-term strategic business problem. That can be sorted out in the light of day versus, we're not going to be making money in two two hours so if i don't wake up and fix this now that's the kind of thing you generally want to be woken up for well let's be honest you

Starting point is 00:12:51 don't want that to happen at all but if it does happen you kind of want to know in advance rather than after the fact you're literally describing linear predict from prometheus which is precisely for this where i can look back over X amount of time and make a linear prediction because everything else breaks down at scale, blah, blah, blah, too detailed. But the thing is, I can draw a line with my pencil by hand on my data and I can predict when is this thing going to it, which is obviously precisely correct. If I have a TLS certificate, it's a little bit more hand wavy when it's a disk, but still you can look into the future and you say, what will be happening if current trends for the last

Starting point is 00:13:31 X amount of time continue in Y amount of time? And that's precisely the thing there where you get this more powerful ability of doing math with your data. See, when you say it like that, it sounds like it actually is a whole term of art, where you're focusing on an in-depth field, where salaries are astronomical, whereas the tools that I had to talk about this stuff back in the day made me sound like, effectively, the sysadmin that I was grunting and pointing, that this is going to fill up. And that is how I thought about it. And this is the challenge, where it's easy to think about these things in narrow, defined contexts like that, but at scale, things break. Like the idea of anomaly

Starting point is 00:14:10 detection. Well, okay, great. If normally the CPU in these things is super bored and suddenly it gets really busy, that's atypical. Maybe we should look into it, assuming that it has a challenge. The problem is, is that that is a lot harder than it sounds because there are so many factors that factor into it. And as soon as you have something, quote unquote, intelligent, making decisions on this, it doesn't take too many false positives before you start ignoring everything it has to say and missing legitimate things.

Starting point is 00:14:38 It's this weird and obnoxious conflation of both hard technical problems and human psychology. And the breaking up of old service boundaries. Of course, when you say microservices and such, fundamentally, functionally, a microservice or nanoservice, pico service, but the pendulum is already swinging back to larger units of complexity. But it fundamentally does not make any difference if I have a monolith on some mainframe or if I have a bunch of microservices. Yes, I can scale differently.

Starting point is 00:15:12 I can scale horizontally a lot more easily. Vertically, it's a little bit harder, blah, blah, blah. But fundamentally, the logic and the complexity which is being packaged is fundamentally the same. More users, everything, but it is fundamentally the same. What's happening again and again and again is I'm breaking up those old boundaries, which means the old tools which have assumptions built in about certain aspects of how I can actually get an overview of a system just start breaking down.

Starting point is 00:15:40 When my complexity unit or my service or what have i is usually congruent with with a physical piece of hardware or several services are congruent with that piece of hardware it absolutely makes sense to think about things in terms of this one physical server the fact that you have different considerations in cloud and microservices and blah blah blah is not inherently that it is more complex. On the contrary, it is fundamentally the same thing. It scales with users and everything, but it is fundamentally the same thing. But I have different boundaries of where I put interfaces onto my complexity,

Starting point is 00:16:17 which basically allow me to hide all of this complexity from the downstream users. That's part of the challenge that I think we're grappling with across this entire industry from start to finish, where we originally looked at these things and could reason about it because it's the computer and I know how those things work. Well, kind of, but okay, sure. But then we start layering levels of complexity on top of layers of complexity on top of layers of complexity. And suddenly when things stop working the way that we expect, it can be very challenging to unpack and understand why. One of the ways I got into this whole space was understanding, to some degree, of how system calls work, of how the kernel wound up interacting with user space,

Starting point is 00:17:00 about how Linux systems worked from start to finish. And these days, that isn't particularly necessary most of the time for the care and feeding of applications. The challenge is when things start breaking, suddenly having that in my back pocket to pull out could be extremely handy. But I don't think it's nearly as central as it once was, and I don't know that I would necessarily advise someone new to the space to spend a few years as a systems person digging into a lot of those aspects. And this is why you

Starting point is 00:17:30 need to know what inodes are and how they work. Not really, not anymore. It's not front and center the way that it once was in most environments, at least in the world that I live in. Agree? Disagree? Agreed, but it's very much unsurprising. You probably can't tell me how to precisely grow sugar cane or corn. You can't tell me how to refine the sugar out of it, but you can absolutely bake a cake. But you will not be able to tell me even a third of, and I'm, for the record, I'm also not able to tell you even a third about the supply chain, which just goes from,

Starting point is 00:18:02 I have a field and some seeds, and I need to have a package of refined sugar. You're absolutely unable to do any of this. The thing is, you've been part of the previous generation of infrastructure, or you know how this underlying infrastructure works. So you have more ability to reason about this, but it's not needed for cloud services nearly as much. You need different types of skill sets, but that doesn't mean the old skill set is completely useless, at least not as of right now. It's much more a case of you need fewer of those people and you need them in different

Starting point is 00:18:34 places because those things have become infrastructure, which is basically the cloud play where a lot of this is just becoming infrastructure more and more. Oh yeah, back then I distinctly remember my elders looking down their noses at me because I didn't know assembly. And how could I possibly consider myself a competent systems admin if I didn't at least have a working knowledge of assembly, or at least C, which I over time learned enough about to know that I didn't want to be a C programmer. And you're right, this is the value of cloud. I mean, going back to those days, getting a web server up and running just to compile Apache's HTTPD took a week and an in-depth knowledge of GCC flags. And then in time, oh great, we're going to have RPM or DEBs. Great. Okay. Then in time you have apt if you're in the

Starting point is 00:19:20 DEB land, because I know you are a Debian developer. But over in Red Hat land, we had Yum and other tools. And then in time, it became, oh, we can just use something like Puppet or Chef to wind up ensuring that that thing is installed. And then, oh, just Docker run. And now it's a checkbox in a web console for S3. These things get easier with time. And step by step by step, we're standing on the shoulders of giants. Even in the last 10 years of my career, I used to have a great challenge question that I would interview people with. Do you know what tiny URL is? It takes a short URL and then expands it to a longer one. Great.

Starting point is 00:19:58 On the whiteboard, tell me how you'd implement that. You could go up one side and down the other, and then you can add constraints, multiple data centers. Now one goes offline. How do, and then you can add constraints, multiple data centers. Now one goes offline. How do you not lose data, et cetera, et cetera. But these days, there are so many ways to do that using cloud services that it almost becomes triviates. Okay, multiple data centers, API gateway, a Lambda and a global DynamoDB table. Now what?

Starting point is 00:20:21 Well, now it gets slow. Why is it getting slow? Well, in that scenario, probably because of something underlying the cloud provider. So now you lose an entire AWS region. How do you handle that? Seems to me when that happens, the entire internet's kind of broken. Do people really need longer URLs? And that is a valid answer in many cases. The question doesn't really work without a whole bunch of additional constraints that make it sound fake. And that's not a weakness. That is the fact that computers and cloud services have never been as accessible as they are now. And that's a win for everyone.

Starting point is 00:20:54 There's one aspect of accessibility which is actually decreasing, or two. A, you need to pay for them on an ongoing basis, and B, you need an internal connection, which is suitably fast, low latency, what have you. And those are things which actually do make things harder for a variety of reasons. If I look at our backend systems, as in Grafana, all of them have single binary modes where you literally compile everything into a single binary and you can run it on your laptop.

Starting point is 00:21:20 Of course, if you're stuck on a plane, you can't do any work on it. That kind of is not the best of situations. And if you have a huge CICD pipeline and everything, and it's cloud and fine and dandy, but your internet breaks. Yeah, so I do agree that it is becoming generally more accessible. I disagree that it is becoming more accessible along all possible axes. I would agree. There is an a silver lining to that as well, where yes, they are fraught and dangerous, and I would preface this with a whole bunch of warnings.

Starting point is 00:21:51 But from a cost perspective, all of the cloud providers do have a free tier offering where you can kick the tires on a lot of these things in return for no money. Surprisingly, the best one of those is Oracle Cloud, where they have an unlimited free tier, use whatever you want in this subset of services, and you will never be charged a dime. As opposed to the AWS model of free tier, where, well, okay, it suddenly got very popular, or you misconfigured something, and surprise, you now owe us enough money to buy Belize.

Starting point is 00:22:18 That doesn't usually lead to a great customer experience. But you're right. You can't get away from needing an internet connection of at least some level of stability and throughput in order for a lot of these things to work. The stuff you would do locally on a Raspberry Pi, for example, if you're budget constrained and want to get something out here, or your laptop. Great. That's not going to work in the same way

Starting point is 00:22:40 as a full-on cloud service will. It's not free unless you have hard guarantees that you're not going to ever pay anything. It's fine to send warning. It's fine to switch the thing off. It's fine to have you hit random hard and soft quotas. It is not a free service if you can't guarantee that it is free. I agree with you. I think that there needs to be a free offering where, well, okay, you want us to suddenly stop serving traffic to the world? Yes. With the alternative is, is you have to start charging me through the nose? Yes. I want you to stop serving traffic. That is definitionally what it says on the tin. And as an independent learner, that is what I want. Conversely, if I'm an

Starting point is 00:23:23 enterprise, yeah, I don't care about money. We're running our Super Bowl ad right now. So whatever you do, don't stop serving traffic, charge us all the money. And there's been a lot of hand-wringing about, well, how do we figure out which direction to go in? And it's, have you considered asking the customer? So on a scale of one to bank, how serious is this account going to be? What are your big concerns? Never charge me or never go down, because we can build for either of those. Just let's make sure that those expectations are aligned.

Starting point is 00:23:51 Because if you guess, you're going to get it wrong, and then no one's going to like you. I would argue that all those services from all cloud providers actually build to address both of those. It's a deliberate choice not to offer certain aspects. Absolutely. When I talk to AWS, like, yeah, but there is an eventual consistency challenge in

Starting point is 00:24:11 the billing system where it takes, as anyone who's looked at the billing system can see, multiple days sometimes for usage data to show up. So how would we be able to stop things if usage starts climbing? To which my relatively direct response is, that sounds like a you problem. I don't know how you'd fix that, but I do know that if suddenly you decide as a matter of policy to, okay, if you're in the free tier, we will not charge you, or even we will not charge you more than $20 a month, so you build yourself some hen room, great. And anything that people are able to spin up, well, you're just going to have to eat the cost as a provider. I somehow suspect that that would get fixed super quickly if that were the constraint.

Starting point is 00:24:52 The fact that it isn't is a conscious choice. Absolutely. And the reason I'm so passionate about this, about the free space, is not because I want to get a bunch of things for free. I assure you, I do not. I mean, I spend my life fixing AWS bills and looking at AWS pricing, and my argument is very rarely it's too expensive. It's that the billing dimension is hard to predict or doesn't align with a customer's

Starting point is 00:25:16 experience or prices a service out of a bunch of use cases where it'll be great. But very rarely do I just sit here shaking my fist and saying it costs too much. The problem is, is when you scare the living crap out of a student with a surprise bill that's more than their entire college tuition, even if you waive it a week or so later, do you think they're ever going to be as excited as they once were to go and use cloud services and build things for themselves and see what's possible? I mean, you and I met on IRC 20 years ago because back in those days, the failure mode and the risk financially was extremely low. It's, yeah, the biggest concern that I had back then when I was doing some of my Linux experimentation is if I type the wrong thing, I'm going to break my laptop. And yeah, that happened once or twice. And I learned not to

Starting point is 00:26:00 make those same kinds of mistakes or put guardrails in. So the blast radius was smaller. Use a remote system instead. Yeah, someone else's computer that i can destroy wonderful but that was how we live and we learn as we were coming up there was never an opportunity for us to my understanding to wind up accidentally running up an eight million dollar charge absolutely and psychological safety is one of the most important things in what most people do. We are social animals. Without this psychological safety, you're not going to have long-term self-sustaining groups.

Starting point is 00:26:36 You will not make someone really excited about it. There's two basic ways to sell. Trust or force. Those are the two ones. There's none else. Managing shards. Maintenance windows. Overprovisioning.

Starting point is 00:26:52 Elastic hash bills. I know, I know. It's a spooky season and you're already shaking. It's time for caching to be simpler. Memento Serverless Cache lets you forget back end to focus on good code and great user experiences with true auto scaling and a pay-per-use pricing model it makes caching easy no matter your cloud provider get going for free at go memento.co slash screaming. That's go m o m e n t o dot c o slash screaming. Yeah, but it just also looks ridiculous. I was talking to someone somewhat recently who was used to spending four bucks a

Starting point is 00:27:34 month on their AWS bill for some S3 stuff. Great. Good for them. That's awesome. Their credentials got compromised. Yes, that is on them to some extent. Okay, great. But now, after six days, they were told that they owed $360,000 to AWS. And I don't know how, as a cloud company, you can sit there and ask a student to do that. That is not a realistic thing. They're what is known in the United States, at least, in the world of civil litigation, as quote-unquote judgment-proof, which means, great, you could wind up finding that someone owes you $20 billion. Most of the time, they don't have that, so you're not able to recoup it. Yeah, the judgment feels good, but you're never going to see it. That's the problem with something like that. It's, yeah,

Starting point is 00:28:20 I would declare bankruptcy long before, as a student, I wound up paying that kind of money. And I don't hear any stories about them releasing the collection agency hounds against people in that scenario. But I wouldn't guarantee that. I would never urge someone to ignore that bill and see what happens. And it's such an off-putting thing that, from my perspective, is beneath the company. And let's be clear, I see this behavior at times on Google Cloud, and I see it on Azure as well. This is not something that is unique to AWS, but they are the 800-pound gorilla in the space, and that's important. Whereas, just to mention right now, because I was about to give you crap for this too, but if I go to Grafana.com, it says,

Starting point is 00:29:03 and I quote, play around with the Grafana stack. Experience Grafana for yourself. No registration or installation needed. Good. I was about to yell at you if it's, oh, just give us your credit card and go ahead and start spinning things up and we won't charge you. Honest. Even your free account does not require a credit card. You're doing it right. That tells me that I'm not going to get a giant surprise bill. You have no idea how much thought and work went into our free offering. There was a lot of math involved. None of this is easy.

Starting point is 00:29:34 I want to be very clear on that. Pricing is one of the hardest things to get right, especially in cloud. And it also, when you get it right, it doesn't look like it was that hard for you to do. But I fix people's AWS bills for a living. And still, five or six years in, one of the hardest things I still wrestle with is pricing engagements. It's incredibly nuanced, incredibly challenging. And at least for services in the cloud space where you're doing usage-based billing, that becomes a problem. But glancing at your pricing page, you do hit the two things that are incredibly important to me.

Starting point is 00:30:10 The first one is use something for free as an added bonus. You can use it forever, and I can get started with it right now. Great. When I go and look at your pricing page or I want to use your product and it tells me to click here to contact us, that tells me it's an enterprise sales cycle, it's going to be really expensive and I'm not solving my problem tonight. Whereas the other side of it, the enterprise offering needs to be contact us and you do that. That speaks to the enterprise procurement people who don't know how to sign a check that doesn't have two commas in it and they want to have custom terms and all the rest and they're prepared to

Starting point is 00:30:43 pay for that. If you don't have that, you look too small time. It doesn't matter what price you put on it, you wind up offering your enterprise tier at some large number. Yeah, for some companies, that's a small number. You don't necessarily want to back yourself in depending upon what the specific needs are. You've gotten it right.

Starting point is 00:31:02 Every common criticism that I have about pricing, you folks have gotten right. And I definitely can pick up on your fingerprints on a lot of this, because it sounds like a weird thing to say of, well, he's the director of community. Why would he weigh in on pricing? I don't think you understand what community is when you ask that question. Yes, I fully agree agree it's super important to get pricing right or to get many things right and usually the things which just feel naturally correct are the ones which took the most effort and the most time and everything and yes at least from the from the like i was in those conversations or part of them and the one thing which was always

Starting point is 00:31:43 clear is when we say it's free it must be is when we say it's free, it must be free. When we say it is forever free, it must be forever free. No games, no lies. Do what you say and say what you do, basically. We have things where initially you get certain pro features and you can keep paying and you can keep using them or after X amount of time, they go away. Things like these are built in because that's what people want.

Starting point is 00:32:07 They want to play around with the whole thing and see, hey, is this actually providing me value? Do I want to pay for this feature, which is nice? Or this and that plugin or what have you. And yeah, you're also absolutely right that once you leave these constraints of basically self-serve cloud. You are talking about bespoke deals, but you're also talking about, okay, let's sit down. Let's actually understand what your business is. What are your business problems?

Starting point is 00:32:32 What are you going to solve today? What are you trying to solve tomorrow? Let us find a way of actually supporting you and invest into a mutual partnership and not just grab the money and run. We have extremely low churn for, I would say, pretty good reasons because this thing about our users, our customers being successful, we do take extremely seriously.

Starting point is 00:32:57 It's one of those areas that I just can't shake the feeling is underappreciated industry-wide. And the reason I say that this is your fingerprints on it is because if this had been wrong, you have a lot of, we'll call them idiosyncrasies, where there are certain things you absolutely will not stand for

Starting point is 00:33:15 and misleading people and tricking them into paying money is high on that list. One of the reasons we're friends. So yeah, but I say, I see your fingerprints on this. It's, yeah, this hadn't been worked out the way that it is. You would not still be there. One other thing that I wanted to call out about, well, I guess it's a confluence of pricing

Starting point is 00:33:33 and logging and the rest. I look at your free tier and it offers up to 50 gigabytes of ingest a month. And it's easy for me to sit here and compare that to other services, other tools and other logging stories. And then I have to stop and think for a minute that, yeah, disks have gotten way bigger, and internet connections have gotten way faster, and even the logs have gotten way wordier.

Starting point is 00:33:56 I still am not sure that most people can really contextualize just how much logging fits into 50 gigs of data do you have any i guess ballpark examples of what that looks like because it's been long enough that i've since i've been playing in these waters that i can't really contextualize it anymore lord of the rings is roughly five megabytes it's actually less so we are talking literally 10 000 lord of the rings which you can just show for us and and we're just storing this for you which also tells you that you're not going to be reading any of this or some of it yes but not all of it you need better tooling and you need proper tooling and some of this is more modern some of this is where we where we actually pushed the state of the art but i'm also biased but i for myself do claim that we actually pushed the state of the art, but I'm also biased.

Starting point is 00:34:48 But I, for myself, do claim that we did push the state of the art here. But at the same time, you come back to those absolute fundamentals of how humans deal with data. If you look back as far, basically as far as we have writing, literally 6,000 years ago is the oldest writing humans have always dealt with information with the state of of the world in very specific ways a is it important enough to even write it down to even persist it in whatever persistence mechanisms i have at my disposal if yes write a detailed account or record a detailed account of whatever the thing is. But it turns out this is expensive and it's not what you need. So over time, you optimize towards only taking down key events and only noting key events, maybe with their interconnections, but fundamentally the key events.

Starting point is 00:35:40 As your data grows, as you have more stuff stuff as this still is important to your business and keeps being more important to or doesn't even need to be a business can be social can be whatever whatever thing it is it becomes expensive again to to retain all of those key events so you turn them into numbers and you can do actual math on them and that's this this path which you've seen again and again and again and again throughout humanity's history. Literally, as long as we have written records, this has played out again and again and again and again for every single field which humans actually cared about.

Starting point is 00:36:16 At different times, like power networks are way ahead of this, but fundamentally power networks work on metrics. But for transient load spikes and everything they have logs built into their power measurement devices but those are only far and in between because the main thing is just metrics time series and you see this again and again you also were sysadmin and internet related all switches are have been metrics based or metrics first for for basically forever for 20 30 years but that stands to reason because the internet is running at by roughly 20 years scale wise in front of the cloud because obviously you need the internet because else you

Starting point is 00:36:57 wouldn't be having a cloud so all of those growing pains why why metrics are all of a sudden the thing, or have been for a few years now, is basically because people who were writing software, providing their own software services, hit the scaling limitations which you hit for internet service providers two decades, three decades ago. But fundamentally, you have this complete system, basically profiles or distributed tracing, depending on how you view distributed tracings. You can also argue that distributed tracing is key events which are linked to each other. Logs sit firmly in the key event thing, and then you turn this into numbers, and that is metrics. And that's basically it. You have extremes at the end where you can have valid, depending on your circumstances, engineering trade-offs of where you invest the most. But fundamentally, that is why those always appear again

Starting point is 00:37:55 in humanities dealing with data and observability is no different. I take a look at last month's AWS bill. Mine is pretty well optimized. It's a bit over 500 bucks. And right around 150 of that is various forms of logging and detecting change in the environment. And on the one hand, I sit here and I think, oh, I should optimize that

Starting point is 00:38:17 because the value of those logs to me is zero. Except that whenever I have to go in and diagnose something or respond to an incident or have some forensic exploration, they then are worth an awful lot. And I am prepared to pay 150 bucks a month for that because the potential value of having that when the time comes is going to be extraordinarily useful. And it basically just feels like a tax on top of what it is that I'm doing. The same thing happens with application observability, where, yeah, when you just want the big substantial stuff, yeah, until you're trying to diagnose something. But in some cases, yeah, okay, then crank up the verbosity and then look for it. But if you're trying to figure it

Starting point is 00:38:57 out after an event that isn't likely or hopefully won't recur, you're going to wish that you'd spent a little bit more on collecting data out of it. You're always going to be wrong. You're always going to be unhappy on some level. Ish. You could absolutely be optimizing this. I mean, for $500, it's probably not worth your time unless you take it as an exercise. But outside of due diligence, where you need specific logs tied to or specific events tied to specific times, I would argue that a lot of the problems with logs is just dealing with it wrong. You have this one extreme of full-text indexing everything,

Starting point is 00:39:35 and you have this other extreme of a data lake, which is just a euphemism of never looking at the data again, to keep storage vendors happy. There is an in-between in between again i'm biased but like for example with loki you have those same label sets as you have on your metrics with prometheus and you have literally the same which means you only index that part and you only extract on ingestion time if you don't have structured logs yet only put the metadata about whatever you care about extracted and put it it into your label set and store this.

Starting point is 00:40:05 And that's the only thing you index. But this goes further than just this. You can also turn those logs into metrics. And to me, this is a path of optimization. Previously, I logged this and that error. Okay, fine, but it's just a log line telling me it's HTTP 500. No one cares that this is at this precise time. Log levels are also basically an anti-pattern

Starting point is 00:40:30 because they're just trying to deal with the amount of data which I have and try and get a handle on that level. Whereas it would be much easier if I just counted every time I have an HTTP 500, I just up my counter by one and again and again and again and all of a sudden I have literally and I did the math on this over 99.8%

Starting point is 00:40:54 of the data which I have to store just goes away. It's just magicked away and we're only talking about the first time I'm hitting this logline the second time I'm hitting this log line is functionally free if i turn this into metrics it becomes cheap enough that one of the mantras which i have if you need to onboard your developers on modern observability blah blah blah blah blah the whole bells and whistles usually people have

Starting point is 00:41:19 logs like that's what they have unless they were from ISPs or power companies or so. They usually start with metrics. But most users, which I see both with my Grafana and with my Prometheus head-on, tend to start with logs. They have issues with those logs because they're basically unstructured and useless. And you need to first make them useful to some extent. But then you can leverage on this. And instead of having a debug statement just put a counter every single time you think hey maybe i should put a debug statement just put a counter instead in two months time see if it was worth it or if you delete that line and just remove that counter it's so much cheaper you can just throw this on and just have it run for a week or a month

Starting point is 00:42:02 or whatever time frame and done but it goes beyond this because all of a sudden if i can turn my logs into metrics properly i can start rewriting my alerts on those metrics i can actually persist those metrics and can more aggressively throw my logs away but also i have this transition made a lot easier where i don't have this huge lift where this day in three months is the big cut over and we're going to release the new version of this and that software and it's not going to have that it's going to have 80% less logs and everything will be great and then you miss the first the first maintenance window because someone is ill or what have you and then

Starting point is 00:42:41 the next big friday is coming So you can't actually deploy there. I mean, Black Friday, but we can also talk about deploying on Fridays. But the thing is, you have this huge thing. Whereas if you have this as a continuous improvement process, I can just look at this is the log which is coming out. I turn this into a number. I start emitting metrics directly. And I see that those numbers match. And so I can just start, I build new stuff, I put it into the new data format, I actually emit the new data format directly from my code instrumentation and only then do I start removing the instrumentation for the logs. And that allows me to, with full confidence, with psychological safety, just move a lot more quickly, deliver

Starting point is 00:43:24 much more quickly and also cut down on my costs more quickly, deliver much more quickly, and also cut down on my costs more quickly because I'm just using more efficient data types. I really want to thank you for spending as much time as you have. If people want to learn more about how you view the world and figure out what other personal attacks

Starting point is 00:43:40 they can throw your way, where's the best place for them to find you? Personal attacks, probably Twitter. It's like the go-to place for this kind of thing. For actually tracking, I stopped maintaining my own website. Maybe I'll do again. But if you go on github.com slash richieh slash talks,

Starting point is 00:43:54 you'll find a reasonably up-to-date list of all the talks, interviews, presentations, panels, what have you, which I did over the last whatever amount of time. And we will, of course, put links to that in the show notes. Thanks again for your time. It's always appreciated. Thank you.

Starting point is 00:44:14 Richard Hartman, Director of Community at Grafana Labs. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment. And then when someone else comes along with an insulting comment they want to add, we'll just increment the counter by one. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it

Starting point is 00:44:56 smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started. This has been a humble pod production stay humble

Screaming in the Cloud - The Evolution of Cloud Services with Richard Hartmann

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.