Screaming in the Cloud - The Evolution of Cloud Services with Richard Hartmann
Episode Date: October 18, 2022About RichardRichard "RichiH" Hartmann is the Director of Community at Grafana Labs, Prometheus team member, OpenMetrics founder, OpenTelemetry member, CNCF Technical Advisory Group Observabi...lity chair, CNCF Technical Oversight Committee member, CNCF Governing Board member, and more. He also leads, organizes, or helps run various conferences from hundreds to 18,000 attendess, including KubeCon, PromCon, FOSDEM, DENOG, DebConf, and Chaos Communication Congress. In the past, he made mainframe databases work, ISP backbones run, kept the largest IRC network on Earth running, and designed and built a datacenter from scratch. Go through his talks, podcasts, interviews, and articles at https://github.com/RichiH/talks or follow him on Twitter at https://twitter.com/TwitchiH for musings on the intersection of technology and society.Links Referenced:Grafana Labs: https://grafana.com/Twitter: https://twitter.com/TwitchiHRichard Hartmann list of talks: https://github.com/richih/talks
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
This episode is sponsored in part by our friends at AWS AppConfig.
Engineers love to solve and occasionally create problems,
but not when it's an on-call fire drill at four in the morning.
Software problems should drive innovation and collaboration, not stress and sleeplessness and threats of violence.
That's why so many developers are realizing the value of AWS AppConfig feature flags.
Feature flags let developers push code to production, but hide that feature from customers so that the developers can release
their feature when it's ready. This practice allows for safe, fast, and convenient software
development. You can seamlessly incorporate AppConfig feature flags into your AWS or cloud
environment and ship your features with excitement, not trepidation and fear. To get started, go to snark.cloud slash appconfig.
That's snark.cloud slash appconfig.
This episode is brought to us in part by our friends at Datadog.
Datadog's a SaaS monitoring and security platform
that enables full-stack observability for developers,
IT operations, security, and business
teams in the cloud age. Datadog's platform, along with 500-plus vendor integrations, allows you to
correlate metrics, traces, logs, and security signals across your applications, infrastructure,
and third-party services in a single pane of glass. Combine these with drag and drop dashboards and machine
learning based alerts to help teams troubleshoot and collaborate more effectively, prevent downtime
and enhance performance and reliability. Try Datadog in your environment today with a free
14 day trial and get a complimentary t-shirt when you install the agent. To learn more, visit datadoghq.com slash screaming in the cloud to get started.
That's www.datadoghq.com slash screaming in the cloud.
Welcome to Screaming in the Cloud. I'm Corey Quinn. There are an awful lot of people who are
incredibly good at understanding the ins and outs and the intricacies of the observability
world. But they didn't have time to come on the show today. Instead, I am talking to my dear friend
of two decades now, Richard Hartman, better known on the internet as Richie H., who's the director
of community at Grafana Labs, here to suffer in a somewhat atypical departure for the theme of this
show, personal attacks for once. Richie, thank you for joining me. And thank you for agreeing
on personal attacks. Exactly. It was one of your writers, like there have to be the personal attacks
back and forth or you refuse to appear on the show. You've been on before. In fact, the last
time we did a recording, I believe you were here in person, which was a long time ago. What have you been up to? You're still at Grafana Labs. And in many
cases, I would point out that, wow, you've been there for many years. That seems to be an atypical
thing, which is an American tech industry perspective. Because every time you and I talk
about this, you look at folks who, wow, you were only at that company for five years. What's wrong with you? You tend to take the longer view, and I
tend to have the fast twitch, time to go ahead and leave jobs, because it's been more than 20 minutes
approach. I see that you're continuing to live what you preach, though. How's it been?
Yeah, so there's a little bit of COVID brains, I think. When we talked in 2018, I was still working at SpaceNet, building a data center.
But the last two and a half years didn't really happen for many people, myself included.
So I guess that includes you.
No, no, you're right.
You've only been at Grafana Labs a couple of years.
One would think I would check the notes before shooting my mouth off, but then one wouldn't know me.
What notes?
Anyway, I've been around Prometheus
and Grafana since 2015, but like real full-time everything is 2020. There was something in between
since 2018, I contracted to do vulnerability handling and everything for Grafana Labs. Of
course, they had something and they didn't know how to deal with it. But no, the full time is 2020.
But as to the space in and of itself, it's maybe a little bit German of me, but trying
to understand the real world and trying to get an overview of systems and how they actually
work and if they are working correctly and as intended, and if not, how they're not working
as intended and how to fix this is something
which has always been super important to me, in part because I just want to understand
the world.
And this is a really, really good way to automate understanding of the world.
So it's basically a work-saving mechanism.
That's why I've been sticking to it for so long, I guess.
Back in the early days of monitoring systems, because we called it
monitoring back then, because
using simple words that lacked
nuance was sort of de rigueur
back then. We wound up
effectively having tools.
Nagios is the one that springs to mind
and it was terrible in all the ways
you would expect a tool written in
janky Perl in the early 2000s
to be. But it told you what was going
on. It tried to do a thing, generally reach a server or query it about things. And when things
fell out of certain specs, it screamed its head off, which meant that when you had things like
the core switch melting down, thinking of one very particular incident, you didn't get a Nagios
alert. You got 4,000 Nagios alerts. But start to finish, you could wrap your head rather
fully around what Nagios did and why it did the sometimes strange things that it did.
These days, when you take a look at Prometheus, which we hear a lot about, particularly in the
Kubernetes space, and Grafana, which is often mentioned in the same breath, it's never been
quite clear to me exactly where those start and stop. It always
feels like it's a component in a larger system to tell you what's going on, rather than a one-stop
shop that's going to, you know, shriek its head off when something breaks in the middle of the night.
Is that the right way to think about it? The wrong way to think about it?
It's a way to think about it. So personally, I use the terms monitoring and observability
pretty much interchangeably. Observability is a relatively well-defined term, even though most people won't agree.
But if you look back into the 70s, into control theory, where the term is coming from, it is the measure of how much you're able to determine the internal state of a system by looking at its inputs and its outputs.
Depending on the definition, some people don't include the inputs, but that is the OG definition
as far as I'm aware.
And from this,
there flow a lot of things.
This question of,
or this interpretation of
the difference between
telling that yes,
something is broken
versus why something is broken.
Or if you can't ask new questions on the fly,
it's not observability.
Like all of those things
are fundamentally mapped to this definition of, I need enough data to determine the internal state of whatever system
I have just by looking at what is coming in, what is going out. And that is at the core the thing.
Now, obviously, it's become a buzzword, which is oftentimes the fate of successful things.
So it's become a buzzword, and you oftentimes the fate of successful things. So it's become a buzzword,
and you end up with cargo halting. I would argue periodically that observability is hipster monitoring. If you call it monitoring, you get yelled at by charity majors, which is tongue-in-cheek,
but she has opinions made nonetheless, shall I say, frustrating by the fact that she is invariably correct in those opinions,
which just somehow makes it so much worse.
It would be easy to dismiss things she says
if she weren't always right.
And the world is changing,
especially as we get into the world of distributed systems,
is the server that runs the app, working or not working,
loses meaning when we're talking about distributed
systems, when we're talking about containers running on top of Kubernetes, which turns every
outage into a murder mystery. We start having distributed applications composed of microservices,
so you have no idea necessarily where an issue is. Okay, is this one microservice having an issue
related to the request coming into a completely separate microservice. And it seems that for those types of applications,
the answer has been tracing for a long time now,
where originally it was something that felt like it was sprung fully formed
from the forehead of some god known as one of the hyperscalers,
but now is available to basically everyone in theory.
In practice, it seems that instrumenting applications
is still one of the hardest parts of
all of this. I tried hooking up one of my own applications to be observed via OTEL, the Open
Telemetry Project, and it turns out that right now, OTEL and AWS Lambda have an intersection point
that makes everything extremely difficult to work with. It's not there yet. It's not baked yet.
And someday I hope that changes, because I would love to interchangeably just throw metrics and traces and logs to all the different observability
tools and see which ones work, which ones don't. But that still feels very far away from current
state of the art. Before we go there, maybe one thing which I don't fully agree with. You said
that previously you were told if a service up or down, that's the thing which you cared about.
And I don't think that's what people actually cared about.
At that time, also what they fundamentally cared about is the user-facing service up or down or impacted.
Is it slow?
Does it return errors every X percent for requests?
Something like this.
Is the site up?
You're right.
I was hand-waving over a whole bunch of things.
It was, okay, first, the web server was turning a page.
Yes or no?
Great.
Can I ping the server?
Okay, well, there are ways a server can crash and still leave enough of the TCP IP stack up
where it can respond to pings and do little else.
And then you start adding things to it.
The Nagios thing that I always wanted to add and had to was, is the disk full?
And that was annoying.
And on some level, why should I care in the modern era how was, is the disk full? And that was annoying. And on some level, like,
why should I care in the modern era how much stuff is on a disk? The storage is cheap and free and
plentiful. The problem is after the third outage in a month, because the disk filled up, you start
to not have a good answer for, well, why aren't you monitoring whether the disk is full? And that
was the contributors to taking down the server. When the website broke, there were what felt like a relatively small number of reasonably well-understood contributors to that at small to mid-sized applications, which is what I'm talking about, the only things that people would let me touch.
I wasn't running hyperscale stuff where you have a fleet of 10,000 web servers and is the server up?
Yeah, in that scenario, no one cares. But when we're talking about the database server and the two application servers and the four web servers talking to them,
you think about it more in terms of pets than you do cattle.
Yes, absolutely. Yet, I think there was a mistake back then, and I tried to do it differently.
As a specific example with the disk, and I'm absolutely agreeing that previous generation tools limit you
in how you can actually work with your data in particular once you're with metrics where you can
do actual math on the data it does not matter if the disk is almost full it matters if that disk is
going to be full within x amount of time if that disk is 98 full and it sits there at 98
for 10 years and provides the service no one cares the thing is
will it actually run out in the next two hours in the next five hours what have you depending on
this is this currently or imminently customer impacting or user impacting then yes alert on it
raise hell wake people make them fix it as opposed to this thing can be dealt with during business
hours on the next workday and you don't have to wait.
Yeah, the big filer with massive amounts of storage has crossed the 70% line.
Okay, now it's time to start thinking about that.
What do you want to do?
Maybe it's time to order another shelf of disks for it, which is going to take some
time.
That's a radically different scenario than the 20 gigabyte root volume on your server
just started filling up dramatically. The rate
of change is such it'll be full in 20 minutes. Yeah, one of those is something you want to wake
people up for. Generally speaking, you don't want to wake people up for what is fundamentally a
longer-term strategic business problem. That can be sorted out in the light of day versus,
we're not going to be making money in two two hours so if i don't wake up and
fix this now that's the kind of thing you generally want to be woken up for well let's be honest you
don't want that to happen at all but if it does happen you kind of want to know in advance rather
than after the fact you're literally describing linear predict from prometheus which is precisely
for this where i can look back over X amount of time and make a linear prediction
because everything else breaks down at scale, blah, blah, blah, too detailed.
But the thing is, I can draw a line with my pencil by hand on my data
and I can predict when is this thing going to it, which is obviously precisely correct.
If I have a TLS certificate, it's a little bit more hand wavy when it's a disk, but still
you can look into the future and you say, what will be happening if current trends for the last
X amount of time continue in Y amount of time? And that's precisely the thing there where you
get this more powerful ability of doing math with your data. See, when you say it like that,
it sounds like it actually is a whole term of art,
where you're focusing on an in-depth field, where salaries are astronomical, whereas the tools that
I had to talk about this stuff back in the day made me sound like, effectively, the sysadmin
that I was grunting and pointing, that this is going to fill up. And that is how I thought about
it. And this is the challenge, where it's easy to think about these things
in narrow, defined contexts like that, but at scale, things break. Like the idea of anomaly
detection. Well, okay, great. If normally the CPU in these things is super bored and suddenly it
gets really busy, that's atypical. Maybe we should look into it, assuming that it has a challenge.
The problem is, is that that is a lot harder than it sounds because there are so many factors that factor into it.
And as soon as you have something, quote unquote, intelligent,
making decisions on this,
it doesn't take too many false positives
before you start ignoring everything it has to say
and missing legitimate things.
It's this weird and obnoxious conflation
of both hard technical problems and human psychology.
And the breaking up of old service boundaries. Of course, when you say microservices and such,
fundamentally, functionally, a microservice or nanoservice, pico service, but the pendulum is
already swinging back to larger units of complexity. But it fundamentally does not make any difference
if I have a monolith on some mainframe
or if I have a bunch of microservices.
Yes, I can scale differently.
I can scale horizontally a lot more easily.
Vertically, it's a little bit harder, blah, blah, blah.
But fundamentally, the logic and the complexity
which is being packaged is fundamentally the same.
More users, everything, but it is fundamentally the same.
What's happening again and again and again is I'm breaking up those old boundaries,
which means the old tools which have assumptions built in about certain aspects
of how I can actually get an overview of a system just start breaking down.
When my complexity unit or my service or what have i is usually congruent with with a
physical piece of hardware or several services are congruent with that piece of hardware it
absolutely makes sense to think about things in terms of this one physical server the fact that
you have different considerations in cloud and microservices and blah blah blah is not inherently
that it is more complex.
On the contrary, it is fundamentally the same thing.
It scales with users and everything, but it is fundamentally the same thing.
But I have different boundaries of where I put interfaces onto my complexity,
which basically allow me to hide all of this complexity from the downstream users.
That's part of the challenge that I think we're grappling with
across this entire industry from start to finish, where we originally looked at these things and
could reason about it because it's the computer and I know how those things work. Well, kind of,
but okay, sure. But then we start layering levels of complexity on top of layers of complexity on
top of layers of complexity. And suddenly when things stop working the way that we expect, it can be very challenging to
unpack and understand why. One of the ways I got into this whole space was understanding,
to some degree, of how system calls work, of how the kernel wound up interacting with user space,
about how Linux systems worked from start to finish. And these days, that isn't particularly necessary
most of the time for the care and feeding of applications.
The challenge is when things start breaking,
suddenly having that in my back pocket to pull out
could be extremely handy.
But I don't think it's nearly as central as it once was,
and I don't know that I would necessarily advise
someone new to the space to spend a few years as a systems person digging into a lot of those aspects. And this is why you
need to know what inodes are and how they work. Not really, not anymore. It's not front and center
the way that it once was in most environments, at least in the world that I live in. Agree? Disagree?
Agreed, but it's very much unsurprising.
You probably can't tell me how to precisely grow sugar cane or corn.
You can't tell me how to refine the sugar out of it,
but you can absolutely bake a cake.
But you will not be able to tell me even a third of,
and I'm, for the record, I'm also not able to tell you even a third about the supply chain, which just goes from,
I have a field and some seeds,
and I need to have
a package of refined sugar. You're absolutely unable to do any of this. The thing is, you've
been part of the previous generation of infrastructure, or you know how this underlying
infrastructure works. So you have more ability to reason about this, but it's not needed for
cloud services nearly as much. You need different types of
skill sets, but that doesn't mean the old skill set is completely useless, at least not as of
right now. It's much more a case of you need fewer of those people and you need them in different
places because those things have become infrastructure, which is basically the cloud
play where a lot of this is just becoming infrastructure more and more. Oh yeah, back then I distinctly remember
my elders looking down their noses at me because I didn't know assembly. And how could I possibly
consider myself a competent systems admin if I didn't at least have a working knowledge of
assembly, or at least C, which I over time learned enough about to know that I didn't want to be a C
programmer. And you're right, this is the value of cloud. I mean, going back to those days, getting a web server up and running just to compile
Apache's HTTPD took a week and an in-depth knowledge of GCC flags. And then in time,
oh great, we're going to have RPM or DEBs. Great. Okay. Then in time you have apt if you're in the
DEB land, because I know you are a Debian developer. But over in Red Hat land, we had Yum and other tools. And then in time, it became, oh, we can just use something
like Puppet or Chef to wind up ensuring that that thing is installed. And then, oh, just Docker run.
And now it's a checkbox in a web console for S3. These things get easier with time. And step by
step by step, we're standing on the shoulders of giants.
Even in the last 10 years of my career, I used to have a great challenge question that I would interview people with.
Do you know what tiny URL is?
It takes a short URL and then expands it to a longer one.
Great.
On the whiteboard, tell me how you'd implement that.
You could go up one side and down the other, and then you can add constraints, multiple
data centers. Now one goes offline. How do, and then you can add constraints, multiple data centers.
Now one goes offline.
How do you not lose data, et cetera, et cetera.
But these days, there are so many ways to do that using cloud services that it almost becomes triviates.
Okay, multiple data centers, API gateway, a Lambda and a global DynamoDB table.
Now what?
Well, now it gets slow.
Why is it getting slow?
Well, in that scenario, probably
because of something underlying the cloud provider. So now you lose an entire AWS region. How do you
handle that? Seems to me when that happens, the entire internet's kind of broken. Do people really
need longer URLs? And that is a valid answer in many cases. The question doesn't really work
without a whole bunch of additional constraints that make it sound fake. And that's not a weakness. That is the fact that computers and cloud services
have never been as accessible as they are now. And that's a win for everyone.
There's one aspect of accessibility which is actually decreasing, or two. A, you need to pay
for them on an ongoing basis, and B, you need an internal connection, which is suitably fast, low latency, what have you.
And those are things which actually do make things harder
for a variety of reasons.
If I look at our backend systems, as in Grafana,
all of them have single binary modes
where you literally compile everything into a single binary
and you can run it on your laptop.
Of course, if you're stuck on a plane,
you can't do any work on it.
That kind of is not the best of situations. And if you have a huge CICD pipeline and everything, and it's cloud and
fine and dandy, but your internet breaks. Yeah, so I do agree that it is becoming generally more
accessible. I disagree that it is becoming more accessible along all possible axes.
I would agree. There is an a silver lining to that as well,
where yes, they are fraught and dangerous,
and I would preface this with a whole bunch of warnings.
But from a cost perspective,
all of the cloud providers do have a free tier offering
where you can kick the tires on a lot of these things
in return for no money.
Surprisingly, the best one of those is Oracle Cloud,
where they have an unlimited free tier,
use whatever you want in this subset of services, and you will never be charged a dime.
As opposed to the AWS model of free tier, where, well, okay, it suddenly got very popular, or you misconfigured something, and surprise, you now owe us enough money to buy Belize.
That doesn't usually lead to a great customer experience. But you're right. You can't get away from needing an internet connection
of at least some level of stability and throughput
in order for a lot of these things to work.
The stuff you would do locally on a Raspberry Pi,
for example, if you're budget constrained
and want to get something out here, or your laptop.
Great.
That's not going to work in the same way
as a full-on cloud service will.
It's not free unless you have hard guarantees that you're not going to ever pay anything. It's fine to send warning. It's
fine to switch the thing off. It's fine to have you hit random hard and soft quotas.
It is not a free service if you can't guarantee that it is free.
I agree with you. I think that there needs to be a free offering where, well, okay, you want us to
suddenly stop serving traffic to the world? Yes. With the alternative is, is you have to start
charging me through the nose? Yes. I want you to stop serving traffic. That is definitionally what
it says on the tin. And as an independent learner, that is what I want. Conversely, if I'm an
enterprise, yeah, I don't care about money.
We're running our Super Bowl ad right now. So whatever you do, don't stop serving traffic,
charge us all the money. And there's been a lot of hand-wringing about, well, how do we figure out
which direction to go in? And it's, have you considered asking the customer? So on a scale
of one to bank, how serious is this account going to be? What are your big concerns? Never charge me or never go down, because
we can build for either of those.
Just let's make sure that
those expectations are aligned.
Because if you guess, you're going to get it wrong,
and then no one's going to like you.
I would argue that all
those services from all cloud providers
actually build to address both of those.
It's a deliberate choice not to offer
certain aspects.
Absolutely. When I talk to AWS, like, yeah, but there is an eventual consistency challenge in
the billing system where it takes, as anyone who's looked at the billing system can see,
multiple days sometimes for usage data to show up. So how would we be able to stop things if
usage starts climbing? To which my relatively direct response is,
that sounds like a you problem. I don't know how you'd fix that, but I do know that if suddenly you decide as a matter of policy to, okay, if you're in the free tier, we will not charge you,
or even we will not charge you more than $20 a month, so you build yourself some hen room,
great. And anything that people are able to spin up, well, you're just going to have to eat the cost as a provider.
I somehow suspect that that would get fixed super quickly
if that were the constraint.
The fact that it isn't is a conscious choice.
Absolutely.
And the reason I'm so passionate about this,
about the free space,
is not because I want to get a bunch of things for free.
I assure you, I do not.
I mean,
I spend my life fixing AWS bills and looking at AWS pricing, and my argument is very rarely it's too expensive. It's that the billing dimension is hard to predict or doesn't align with a customer's
experience or prices a service out of a bunch of use cases where it'll be great. But very rarely
do I just sit here shaking my fist and saying it costs too much.
The problem is, is when you scare the living crap out of a student with a surprise bill that's more than their entire college tuition, even if you waive it a week or so later, do you think they're
ever going to be as excited as they once were to go and use cloud services and build things for
themselves and see what's possible? I mean, you and I met on IRC 20 years ago because back in those days,
the failure mode and the risk financially was extremely low. It's, yeah, the biggest concern
that I had back then when I was doing some of my Linux experimentation is if I type the wrong thing,
I'm going to break my laptop. And yeah, that happened once or twice. And I learned not to
make those same kinds of mistakes or put guardrails in. So the blast radius was smaller.
Use a remote system instead. Yeah, someone else's computer that i can destroy wonderful
but that was how we live and we learn as we were coming up there was never an opportunity for us
to my understanding to wind up accidentally running up an eight million dollar charge
absolutely and psychological safety is one of the most important things in what most people do.
We are social animals.
Without this psychological safety, you're not going to have long-term self-sustaining
groups.
You will not make someone really excited about it.
There's two basic ways to sell.
Trust or force.
Those are the two ones.
There's none else.
Managing shards.
Maintenance windows.
Overprovisioning.
Elastic hash bills.
I know, I know.
It's a spooky season and you're already shaking.
It's time for caching to be simpler.
Memento Serverless Cache lets you forget back end to focus on good code and great user
experiences with true auto scaling and a pay-per-use pricing model it makes caching easy
no matter your cloud provider get going for free at go memento.co slash screaming. That's go m o m e n t o dot c o slash screaming. Yeah, but it just also looks
ridiculous. I was talking to someone somewhat recently who was used to spending four bucks a
month on their AWS bill for some S3 stuff. Great. Good for them. That's awesome. Their credentials
got compromised. Yes, that is on them to some extent. Okay, great. But now, after six days, they were told that they owed $360,000 to AWS.
And I don't know how, as a cloud company, you can sit there and ask a student to do that.
That is not a realistic thing.
They're what is known in the United States, at least, in the world of civil litigation,
as quote-unquote judgment-proof, which means, great, you could wind up finding that someone owes you $20 billion.
Most of the time, they don't have that, so you're not able to recoup it. Yeah, the judgment feels
good, but you're never going to see it. That's the problem with something like that. It's, yeah,
I would declare bankruptcy long before, as a student, I wound up paying that kind of money.
And I don't hear any stories about them releasing the collection agency hounds against people in
that scenario. But I wouldn't guarantee that. I would never urge someone to ignore that bill and
see what happens. And it's such an off-putting thing that, from my perspective, is beneath
the company. And let's be clear, I see this behavior
at times on Google Cloud, and I see it on Azure as well. This is not something that is unique to AWS,
but they are the 800-pound gorilla in the space, and that's important. Whereas, just to mention
right now, because I was about to give you crap for this too, but if I go to Grafana.com, it says,
and I quote, play around with the Grafana
stack. Experience Grafana for yourself. No registration or installation needed. Good.
I was about to yell at you if it's, oh, just give us your credit card and go ahead and start
spinning things up and we won't charge you. Honest. Even your free account does not require
a credit card. You're doing it right. That tells me that I'm not going to get a giant surprise bill.
You have no idea how much thought and work went into our free offering.
There was a lot of math involved.
None of this is easy.
I want to be very clear on that.
Pricing is one of the hardest things to get right, especially in cloud.
And it also, when you get it right, it doesn't look like it was that hard for you to do.
But I fix people's AWS bills for a living.
And still, five or six years in, one of the hardest things I still wrestle with is pricing engagements.
It's incredibly nuanced, incredibly challenging.
And at least for services in the cloud space where you're doing usage-based billing, that becomes a problem.
But glancing at your pricing page, you do hit the two things that are incredibly important to me.
The first one is use something for free as an added bonus.
You can use it forever, and I can get started with it right now.
Great.
When I go and look at your pricing page or I want to use your product and it tells me to click here to contact us,
that tells me it's an enterprise sales cycle, it's going to be really expensive and I'm not solving my problem tonight. Whereas
the other side of it, the enterprise offering needs to be contact us and you do that. That
speaks to the enterprise procurement people who don't know how to sign a check that doesn't have
two commas in it and they want to have custom terms and all the rest and they're prepared to
pay for that. If you don't have that, you look too small time.
It doesn't matter what price you put on it,
you wind up offering your enterprise tier
at some large number.
Yeah, for some companies, that's a small number.
You don't necessarily want to back yourself in
depending upon what the specific needs are.
You've gotten it right.
Every common criticism that I have about pricing, you folks
have gotten right. And I definitely can pick up on your fingerprints on a lot of this, because it
sounds like a weird thing to say of, well, he's the director of community. Why would he weigh in
on pricing? I don't think you understand what community is when you ask that question.
Yes, I fully agree agree it's super important
to get pricing right or to get many things right and usually the things which just feel naturally
correct are the ones which took the most effort and the most time and everything and yes at least
from the from the like i was in those conversations or part of them and the one thing which was always
clear is when we say it's free it must be is when we say it's free, it must be free.
When we say it is forever free, it must be forever free.
No games, no lies.
Do what you say and say what you do, basically.
We have things where initially you get certain pro features
and you can keep paying and you can keep using them
or after X amount of time, they go away.
Things like these are built in because that's what people want.
They want to play around with the whole thing and see, hey, is this actually providing me value?
Do I want to pay for this feature, which is nice?
Or this and that plugin or what have you.
And yeah, you're also absolutely right that once you leave these constraints of basically self-serve cloud. You are talking about bespoke deals,
but you're also talking about,
okay, let's sit down.
Let's actually understand what your business is.
What are your business problems?
What are you going to solve today?
What are you trying to solve tomorrow?
Let us find a way of actually supporting you
and invest into a mutual partnership
and not just grab the money and run.
We have extremely low churn for, I would say, pretty good reasons
because this thing about our users, our customers being successful,
we do take extremely seriously.
It's one of those areas that I just can't shake the feeling
is underappreciated industry-wide.
And the reason I say that this is your fingerprints on it
is because if this had been wrong,
you have a lot of,
we'll call them idiosyncrasies,
where there are certain things
you absolutely will not stand for
and misleading people and tricking them into paying money
is high on that list.
One of the reasons we're friends.
So yeah, but I say, I see your fingerprints on this.
It's, yeah, this hadn't been worked out the way that it is.
You would not still be there.
One other thing that I wanted to call out about,
well, I guess it's a confluence of pricing
and logging and the rest.
I look at your free tier
and it offers up to 50 gigabytes of ingest a month.
And it's easy for me to sit here
and compare that to other services,
other tools and other logging stories.
And then I have to stop and think for a minute that, yeah, disks have gotten way bigger,
and internet connections have gotten way faster, and even the logs have gotten way wordier.
I still am not sure that most people can really contextualize just how much logging fits into 50 gigs of data do you have any i guess ballpark
examples of what that looks like because it's been long enough that i've since i've been playing in
these waters that i can't really contextualize it anymore lord of the rings is roughly five
megabytes it's actually less so we are talking literally 10 000 lord of the rings which you can just show
for us and and we're just storing this for you which also tells you that you're not going to
be reading any of this or some of it yes but not all of it you need better tooling and you need
proper tooling and some of this is more modern some of this is where we where we actually pushed
the state of the art but i'm also biased but i for myself do claim that we actually pushed the state of the art, but I'm also biased.
But I, for myself, do claim that we did push the state of the art here.
But at the same time, you come back to those absolute fundamentals of how humans deal with data.
If you look back as far, basically as far as we have writing, literally 6,000 years ago is the oldest writing humans have always dealt with information with the state of of the world in very specific ways a is it important enough to even write it down
to even persist it in whatever persistence mechanisms i have at my disposal if yes
write a detailed account or record a detailed account of whatever the thing is.
But it turns out this is expensive and it's not what you need.
So over time, you optimize towards only taking down key events and only noting key events,
maybe with their interconnections, but fundamentally the key events.
As your data grows, as you have more stuff stuff as this still is important to your business and keeps
being more important to or doesn't even need to be a business can be social can be whatever
whatever thing it is it becomes expensive again to to retain all of those key events
so you turn them into numbers and you can do actual math on them and that's this this path
which you've seen again and again and again and again throughout
humanity's history.
Literally, as long as we have written records, this has played out again and again and again
and again for every single field which humans actually cared about.
At different times, like power networks are way ahead of this, but fundamentally power
networks work on metrics.
But for transient load spikes and
everything they have logs built into their power measurement devices but those are only far and
in between because the main thing is just metrics time series and you see this again and again you
also were sysadmin and internet related all switches are have been metrics based or metrics first for for basically
forever for 20 30 years but that stands to reason because the internet is running at by roughly 20
years scale wise in front of the cloud because obviously you need the internet because else you
wouldn't be having a cloud so all of those growing pains why why metrics are all of a sudden the thing, or have been for a few years now, is basically because people who were writing software, providing their own software services, hit the scaling limitations which you hit for internet service providers two decades, three decades ago.
But fundamentally, you have this complete system, basically profiles or distributed tracing, depending on how you view distributed tracings.
You can also argue that distributed tracing is key events which are linked to each other.
Logs sit firmly in the key event thing, and then you turn this into numbers, and that is metrics.
And that's basically it. You have extremes at the end where you can have valid,
depending on your circumstances,
engineering trade-offs of where you invest the most.
But fundamentally, that is why those always appear again
in humanities dealing with data and observability is no different.
I take a look at last month's AWS bill.
Mine is pretty well optimized.
It's a bit over 500 bucks.
And right around 150 of that is various forms of logging
and detecting change in the environment.
And on the one hand, I sit here and I think,
oh, I should optimize that
because the value of those logs to me is zero.
Except that whenever I have to go in and diagnose something or respond to an incident or
have some forensic exploration, they then are worth an awful lot. And I am prepared to pay
150 bucks a month for that because the potential value of having that when the time comes is going
to be extraordinarily useful. And it basically just feels like a tax on top of what it is that
I'm doing. The same thing happens with application observability, where, yeah, when you just want
the big substantial stuff, yeah, until you're trying to diagnose something. But in some cases,
yeah, okay, then crank up the verbosity and then look for it. But if you're trying to figure it
out after an event that isn't likely or hopefully won't recur, you're going to wish that you'd spent
a little bit more on collecting data out of it.
You're always going to be wrong. You're always going to be unhappy on some level.
Ish. You could absolutely be optimizing this. I mean, for $500, it's probably not worth your time unless you take it as an exercise. But outside of due diligence, where you need specific
logs tied to or specific events tied to specific times,
I would argue that a lot of the problems with logs
is just dealing with it wrong.
You have this one extreme of full-text indexing everything,
and you have this other extreme of a data lake,
which is just a euphemism of never looking at the data again,
to keep storage vendors happy.
There is an in-between in between again i'm biased
but like for example with loki you have those same label sets as you have on your metrics
with prometheus and you have literally the same which means you only index that part and you only
extract on ingestion time if you don't have structured logs yet only put the metadata about
whatever you care about extracted and put it it into your label set and store this.
And that's the only thing you index.
But this goes further than just this.
You can also turn those logs into metrics.
And to me, this is a path of optimization.
Previously, I logged this and that error.
Okay, fine, but it's just a log line telling me it's HTTP 500.
No one cares that this is at this precise time.
Log levels are also basically an anti-pattern
because they're just trying to deal with the amount of data which I have
and try and get a handle on that level.
Whereas it would be much easier if I just counted every time I have an HTTP 500,
I just up my counter by one
and again and again and again
and all of a sudden I have
literally and I did the math on this
over 99.8%
of the data which I
have to store just goes
away. It's
just magicked away and we're only talking
about the first time I'm hitting this logline
the second time I'm hitting this log line is functionally free if i turn this into metrics
it becomes cheap enough that one of the mantras which i have if you need to onboard your developers
on modern observability blah blah blah blah blah the whole bells and whistles usually people have
logs like that's what they have unless they were from ISPs or power companies or so.
They usually start with metrics.
But most users, which I see both with my Grafana and with my Prometheus head-on, tend to start with logs.
They have issues with those logs because they're basically unstructured and useless.
And you need to first make them useful to some extent.
But then you can leverage on this.
And instead of having a debug statement just put a counter every single time you think hey maybe i should put a debug statement just put a counter instead in two months time see if it was worth it or if you delete that line and just remove that
counter it's so much cheaper you can just throw this on and just have it run for a week or a month
or whatever time frame and done but it goes beyond
this because all of a sudden if i can turn my logs into metrics properly i can start rewriting my
alerts on those metrics i can actually persist those metrics and can more aggressively throw
my logs away but also i have this transition made a lot easier where i don't have this huge lift
where this day in three months is the big
cut over and we're going to release the new version of this and that software and it's not
going to have that it's going to have 80% less logs and everything will be great and then you
miss the first the first maintenance window because someone is ill or what have you and then
the next big friday is coming So you can't actually deploy there.
I mean, Black Friday, but we can also talk about deploying on Fridays. But the thing is,
you have this huge thing. Whereas if you have this as a continuous improvement process,
I can just look at this is the log which is coming out. I turn this into a number. I start
emitting metrics directly. And I see that those numbers match. And so I can just start, I build new stuff,
I put it into the new data format, I actually emit the new data format directly from my code
instrumentation and only then do I start removing the instrumentation for the logs. And that allows
me to, with full confidence, with psychological safety, just move a lot more quickly, deliver
much more quickly and also cut down on my costs more quickly, deliver much more quickly,
and also cut down on my costs more quickly
because I'm just using more efficient data types.
I really want to thank you
for spending as much time as you have.
If people want to learn more
about how you view the world
and figure out what other personal attacks
they can throw your way,
where's the best place for them to find you?
Personal attacks, probably Twitter.
It's like the go-to place for this kind of thing.
For actually tracking,
I stopped maintaining my own website.
Maybe I'll do again.
But if you go on github.com slash richieh slash talks,
you'll find a reasonably up-to-date list
of all the talks, interviews, presentations, panels,
what have you,
which I did over the last whatever amount of time.
And we will, of course, put links to that in the show notes.
Thanks again for your time.
It's always appreciated.
Thank you.
Richard Hartman, Director of Community at Grafana Labs.
I'm cloud economist Corey Quinn, and this is Screaming in the Cloud.
If you've enjoyed this podcast, please leave a five-star
review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a
five-star review on your podcast platform of choice, along with an insulting comment. And then
when someone else comes along with an insulting comment they want to add, we'll just increment
the counter by one. If your AWS bill keeps rising and your blood pressure is doing the same,
then you need the Duck Bill Group. We help companies fix their AWS bill by making it
smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business
and we get to the point.
Visit duckbillgroup.com to get started.
This has been a humble pod production
stay humble