The Changelog: Software Development, Open Source - Gerhard goes to KubeCon (part 2) (Interview)
Episode Date: December 27, 2019Gerhard is back for part two of our interviews at KubeCon 2019. Join him as he goes deep on Prometheus with Björn Rabenstein, Ben Kochie, and Frederic Branczyk... Grafana with Tom Wilkie and Ed Welch...... and Crossplane with Jared Watts, Marques Johansson, and Dan Mangum. Don't miss part one with Bryan Liles, Priyanka Sharma, Natasha Woods, & Alexis Richardson.
Transcript
Discussion (0)
Bandwidth for ChangeLog is provided by Fastly.
Learn more at Fastly.com.
We move fast and fix things here at ChangeLog because of Rollbar.
Check them out at Rollbar.com.
And we're hosted on Linode cloud servers.
Head to Linode.com slash ChangeLog.
This episode is brought to you by DigitalOcean.
DigitalOcean is the simplest cloud platform for developers and teams
with products like Droplets, Sp droplets spaces kubernetes load balancers
block storage and pre-built one-click apps you can deploy manage and scale cloud applications
faster and more efficiently on digital ocean whether you're running one virtual machine or
10 000 digital ocean makes managing your infrastructure way too easy head to do.co
changelog again do.co slash changelog. Again, do.co slash changelog.
Welcome back, everyone.
This is The Changelog,
a podcast featuring the hackers,
the leaders,
and the innovators in software.
I'm Jared Santo,
managing editor of Changelog Media.
I hope you're enjoying
these last few days of 2019. This is our final episode of the year. software. I'm Jared Santo, managing editor of Changelog Media. I hope you're enjoying these
last few days of 2019. This is our final episode of the year. Gerhard is back for part two of our
interview series from KubeCon. Join him for some deep, lengthy conversations on Prometheus,
Grafana, and Crossplane. Oh, and one last note before I pass the mic. If there's an interesting
topic or a great guest that you would like to hear on the show, let us know at changelog.com slash request. We would love to hear from you. That's it.
Enjoy. Today we have around this square table, rectangular table. We have Bjorn from Grafana.
We have Fred from Rat Hat and we have Ben from GitLab. All of them are Prometheus contributors.
So this is going to be a technical discussion. We're going to mention a lot about cool things
about Prometheus and who would like to get us started? Sure, I'm Ben. I'm a Site Reliability
Engineer at GitLab. I've been contributing to the project for quite a number of years now. My focus is on getting developers and other systems to integrate with Prometheus.
So I don't work on the core code so much,
but I try and help people get their data into Prometheus
and then learn how to actually turn that into monitoring.
Sure, I go.
All right, my name is Bjorn.
I work at Grafana, but that's quite recent. I now,
fortunately enough, to kind of be a full-time Promethean. So my company pays me to contribute
to the project, and I also do internal Prometheus-related things. Previously,
until like half a year ago, I was at SoundCloud, where Prometheus had its cradle, as I like to say it. And there we kind of had other jobs.
We were like production engineers or site reliability engineers or something.
Ben was also there.
And we had to create Prometheus for doing our job as a tool.
But it was always like a site business in a way.
It sounds kind of weird now that it's so popular.
Yeah, I'm Frederik.
I am an architect at Red Hat. that it's so popular. Yeah, I'm Frederik. I'm
an architect
at Red Hat. I'm basically the architect
for everything observability.
And I happen to have started
with Prometheus in that space
roughly three and a half years ago.
Even
though it's been three and a half years, I think I'm
the most recent at this table to
have joined the Prometheus project.
Yeah.
And one thing which I'd like to add is that this year,
for the top contributor in the cloud native landscape,
the award went to Fred, right?
So I think Bjorn, you were mentioning earlier that Prometheus,
the contributors got awards awards in a row.
Every single year, one of the Prometheus contributors got some sort of an award.
There's like a streak going on here.
Is that right?
You might think it's like a political thing that we have to get an award, but I think we really have a bunch of awesome people.
I think Prometheus, looking at how it grew, right?
Everybody's looking at Kubernetes and everybody knows Kubernetes.
But Prometheus is also a graduated project in the CNCF.
And a lot of activities happening around Prometheus, around observability, around metrics.
I find that super interesting because it's not just about the platform. It's also all the other tooling that goes in the platform.
And Prometheus is one of the shining stars of the CNCF. We were the second
graduated project. There we go. We almost graduated first but yeah I guess
but Kubernetes had to take that. They are also much bigger projects so
there was way more effort for us. It's kind of easy to graduate but
interestingly I had this I did this for a talk recently where I thought graduated does that mean we're done like it's kind of easy to graduate. But interestingly, I did this for a talk recently where I thought,
graduated, does that mean we're done?
It's kind of stabilized.
We just get maintenance PRs.
And CNCF has this DevStats tool.
It's a Grafana dashboard, shameless plug, where they can plot.
They just evaluate activities among companies, among contributors.
And you can just draw graphs. how actively is this project contributed to and if you look at the
Prometheus graph it looks like from the moment of graduation you actually got
more activity like it's it's probably like smaller things that are not so
visible but a lot is going on in the Prometheus ecosystem. Right and you only
just had PromCon not long ago how was that like two weeks ago one week ago
it was very recent uh yeah that was the the second week of november uh it was great um we it's a very
small community gathering uh we're actually sad this year we wanted to expand the size of it uh
but we just couldn't get the venue big enough that would uh was available
when we needed it so uh yeah it's a small 220 person conference and it's uh all talks about
uh prometheus and and development of what's going on and uh people's stories and how they
uh they uh use prometheus tickets were highly sought after
it felt like a rock concert yes and i think even our live stream was um well visited right
yeah we had um uh i think we peaked at something about 80 people on the live stream was a little
unreliable this year but uh we'll hopefully do better next time all the talks will be
we'll get proper recordings on the website.
Yes.
Everybody can watch that.
I think what's super exciting about PromCon,
I believe all of us have been at every official PromCon.
I think there was one unofficial.
Oh, no, you.
I was at the first unofficial PromCon Zero.
Okay.
You were too, right?
It was at SoundCloud most.
I mean, that was. We called it PromCon
when developers came together
to prepare the 1.0 release.
But then the real PromCon happened.
I was the first.
The most recent one.
I think what's really interesting about
how PromCon has evolved
over the last couple of years
is that in the
first two to three years,
I think it was very, very Prometheus development focused.
And this year, last year also already, we've seen this a lot,
that I think the entire community is kind of evolving
that Prometheus is a very stable project,
and we're now more demonstrating how it can be used in
extremely powerful ways and i think that kind of reflects um in some way the graduated status is
why i think like because people can rely on it um that's we're seeing all this adoption that
is just incredible i think also like how this ecosystem doesn't have like a strict boundary you have lots of projects that
are not prometheus projects but they are closely related and there are loads of integration points
it's open source it's open community and i think that that really works well one thing which i
really liked about prometheus is this emerging standard of open metrics. So it's less about a specific product,
it's more about a standard which people and vendors are starting to agree on. And I think
that is such an important moment when you have all these companies saying, you know what, Prometheus
is onto something. So how about we stop calling the exposition format that, we start calling it Open Metrics.
Did you have any involvement with that?
Yeah, so I'm one of the people that started the Open Metrics project.
And, you know, as a site reliability engineer, I'm working with my developers to instrument their code and make it so that I can monitor it.
And I also have to work with a lot of vendor code.
And for a long, long time,
the only real proper standard is SNMP.
But SNMP for a modern developer
is extremely clunky and really hard to use.
And it's not cloud native,
if we want to use the buzzword and
as as an sre i don't actually care if vendors use prometheus but we need open metrics as a
modern standard to replace snmp as the transport protocol of metric data. And I really like how the metrics,
so open metrics, open telemetry,
which is a combination of open census
and open tracing.
Thank you very much, Fred.
So the combination of these two,
how does open metrics fit into open telemetry?
So open telemetry is it so open telemetry because it comes from the
the open tracing and open census open census was this idea of creating a standard instrumentation
library that handles both the tracing and the metrics and some of the logging pieces and this
is a really great idea especially especially like, you know,
from when I'm wearing my SRE hat,
is that you have a standard library for instrumenting your code.
And the OpenMetrics is just the way you can get,
or is what I think should be,
is the way you get the metric data out of OpenTelemetry.
And so it's just kind of the standardized interface
because the tracing interface is kind of still young and fast-moving
and it hasn't settled down.
But the Prometheus in the open metric standard
is something that we want to see last for as long as SNMP has lasted.
SNMP has been around since the early 90s,
and it hasn't changed much,
and the data model is actually quite good
with being clunky and a little bit designed
around 16-bit CPUs and things like that.
But we want to see the Open Metrics transport format
be this long-term, stable thing that vendors can rely on
so we have metrics the story is really good we have traces and the story distributed tracing
is really good as well where are logs or events as some like to call them. Where do they fit in in this model?
And I'm looking at Bjorn because I know that Loki is this like up-and-coming
project. We'll be talking later with Tom about Loki and there's, I forget
his name, but he's the maintainer of Loki or the head behind Loki as Tom got him.
Actually, we have a bunch of people I can find working on Loki.
It's like a big deal, obviously.
But I don't even feel like I would do them justice if I know to tell them.
You should probably ask later.
I mean, the purpose, you should take it from the other way, that Prometheus is often like
people see Prometheus, they realize it's like this hot thing that they should use.
They see all the success they have and then they try to shoehorn all their like observability use cases into Prometheus.
And then they start to use Prometheus for event logging.
And Prometheus is a really bad event logging system.
And that's a lot where we have to fight fight kind of whatever we have to
convince people that they shouldn't do this even if they're angry at us but then there's also like
the other whatever the backlash for like the logs processing people try to to solve everything and
yeah i mean we we kind of have more this inclusive picture that you need all those tools, you need to combine them nicely.
And Loki has this idea where you take some parts of Prometheus, which is like services gallery and labeling,
and use the exact same thing for logs collection.
And then it's easy to collect the dots and jump from an alert with certain labels into the appropriate logs that you have collected.
It goes into that direction.
But I guess you will talk a lot about that with Tom.
Yes.
Actually, I'm a strong believer of connecting different signals via metadata.
Actually, Tom and I did a keynote at KubeCon Barcelona
about exactly this topic.
So I highly recommend people checking that out.
Okay.
Are the videos out yet from Barcelona?
Yeah.
Are they?
Cool.
It's not only him recommending himself.
I recommend that as well.
Right.
Okay.
Yeah.
And from the Prometheus project perspective,
I see it as with Prometheus,
we have a very specific focus
and we kind of follow a bit of the Unix philosophy of,
I want a tool as an engineer i want a tool that does one thing and one thing well and you know i look at some of
these large monitoring platform things and i see a lot of vendors they they they they also combine
monitoring and management into the same platform and And with Prometheus, we explicitly don't
have any kind of management. We even don't even have any templating in our configuration file,
because different organizations have completely different ideas on what they want for their
configuration management to look like. You know. You have things like Kubernetes and config maps and operators and that,
and then you might have another organization
that they're doing everything with a templating configuration management
like Chef or Ansible or one of those.
And so the layering approach to observability is really, really important to me because I want a really good logging system and I really want a really good metric system.
And I really want a good tracing analysis system and crash dump controls and profiles.
And to me, those are all different pieces of software
and I need to combine them.
And there's no one magic solution
that's going to solve all my problems all at once.
So I can see this idea of the building blocks
and having the right building blocks,
right being a very relative term in this context,
because right to me is different than right to you.
So this choice of selecting whichever building blocks are right for you
and combining them, again, whichever way is right for you,
and then you get this like almost everybody gets what they want
and yet the pieces exist that they can be combined in almost infinite ways.
So Prometheus has grown a lot.
Prometheus is like on a crazy trajectory right now from where I'm standing.
And I would like to zoom in a little bit in a shorter time span.
So, for example, the last six months, just to get a better appreciation of all the change that is happening in Prometheus. Let's focus on the last six months, the big
items that have been delivered and the impact they had on the project.
We should also say there's so many, we call it project, a repository in the
Prometheus GitHub org and there are many projects like Alert Manager is probably
something very famous, NodeExporter is pretty active and big and all those things.
But every project has new stuff going on.
And I think we should restrain ourselves to just the Prometheus server itself
because otherwise we could chat forever about all the new things.
Yeah, and actually a few of us have been discussing that
the Prometheus Prometheus core code is really, really reasonably feature complete.
And it's not actually moving that fast.
We have lots of small changes that are still important.
But the speed of the project is actually how many additional things that are connected to Prometheus that is expanding.
There's a large momentum about things that are being built around Prometheus that is expanding. There's a large momentum about things
that are being built around Prometheus
while Prometheus itself is largely stabilizing and optimizing.
Yeah.
And then, yeah, I mean, should we talk about something new?
Of course, now that you say stuff around Prometheus,
it was always a very hot topic
that Prometheus doesn't have this idea of having a
distributed clustered storage engine built in. And we always said that's somebody else's problem.
And then we provided an, I can still experimental the interface, right? Officially.
Officially, yes, but it works.
Yeah. So we created this kind of experimental write interface,
and now we have dozens of vendors or open source projects
that integrate against this interface
where Prometheus can send out the metrics it has collected
to something out there.
And this has seen a lot of improvements recently.
I don't know.
Does one of you want to talk about details?
Actually, even commercial vendors, monitoring platform vendors,
are starting to accept Prometheus Remote Write as a way to get data into their observability stack.
I don't think any of us actually worked on these improvements, but
I think the most notable thing that happened in remote write was previously remote write,
whenever Prometheus scraped any samples, it immediately queued them up and tried to send
them to the remote storage. And this had various problems, one of which is we really just keep all these samples in memory until we send them off.
And so one of the dangers was if the remote storage was down, we would continue to queue up all of this data in memory and potentially cause out-of-memoryts, for example. And so kind of the solution to this was
Prometheus has a write-ahead log
where the most recent data is written to
before it gets flushed into an immutable block of data.
And so instead of doing all of this in memory,
basically we use the write-ahead log
as a kind of persistent on-disk buffer
and that write-ahead log as a kind of persistent on-disk buffer and that writeahead log is tailed
and then we send the data off based on that so i think this is one of those things the feature
actually hasn't changed at all in its functionality it just the implementation itself changed to be
to be a lot more robust than it used to be. And I think that's really exciting
and it kind of shows the details
that we're starting to focus on in Prometheus.
So for all those projects
that are being built around Prometheus,
it's very important,
it's becoming even more important for the core
to be more robust, to be more performant,
to be dependable, right?
So that it can support all those extension points and all that
growth yeah i guess if it's still experimental you should do something about yeah let's see
should we talk about the flip side of that the remote read yeah because that is the flip side
of it if you have a prometheus server that has stored stuff away into remote storage often those
remote storage providers
have their own query engine.
Sometimes they even support literally PromQL
and you can work on that.
But sometimes you just want your Prometheus server
to know about that data
that has been stowed away somewhere.
And there's the flip side of the remote, right?
Which is remote read.
And that, yeah, I mean,
that's also kind of still experimental,
but there was a similar problem.
Who wants to take this for the memory?
Should I go ahead?
It's actually, we're all not the domain experts in that, right?
So the problem there was that Prometheus runs a query
and then the query engine has to retrieve the data
and the API looked like that it would essentially get all the samples that this query had to act on
in one go. So the remote backend for that had to construct all those samples in memory on their
side and then send it all over. So Prometheus has to receive it all on its own side it's all there and then um that
could have a huge impact on on like memory usage in that moment so we would uh we i mean that
concretely happened you would would both parts the back end would build up all this huge amount
of samples in memory and then prometheus has to read it and i mean prometheus has a really
efficient way of storing time series data in in blocks in its own storage so the idea was to just
stream the data like streaming is anyway the hotness right where you it's all the one stream
you don't have to all build it up first and then send it out. I think it also reuses the exact block format of Prometheus.
Yeah, the big problem with the remote read was that we have all this compressed data on disk and in memory,
and the remote read would decompress it, serialize it, and then send it out over the wire completely uncompressed,
and it was using huge amounts of bandwidth.
Actually, was it taking it and then snappy compressing it,
if I remember correctly?
I believe so, yeah.
Yeah, so it would take a well-compressed time series block,
serialize it, and then recompress it with a generic compression,
and this was just kind of silly.
In hindsight, yeah.
In hindsight, yes.
Yeah, and this doesn't just benefit
the Prometheus server itself,
but basically this is, again,
there are a bunch of integrations
around Prometheus that benefit from this.
Yeah, but I think Thanos was,
that was a big deal for Thanos, this improvement.
Yes, because Thanos essentially sits next to a Prometheus server
and uses this API to read raw data from the time series database.
And so it was a big deal for this component to have this more efficient way of doing it
because Thanos itself had already this streaming approach.
So it loaded everything into memory and then sent it off in a streaming approach so now it can actually make use of all
of these things so why do you think that this remote uh write and remote read are becoming
more and more important these days I mean is something happening with Prometheus is it
getting to a point where this
is becoming more important? Why is it an important thing now? As users of Prometheus grow,
they grow beyond the capacity of one Prometheus server. And Prometheus was designed uh from a background of distributed systems and uh where
where prometheus got its inspiration uh we had hundreds or thousands of monitoring
mini nodes and each of these mini nodes would watch one specific task and and keep track of
one small piece of the puzzle and as people grow their monitoring needs
and they're running into the same exact problems where a single monitoring server is not powerful
enough to monitor a whole entire kubernetes cluster with tens of thousands of pods and
multiple clusters are geo-distributed so they they're running into the same problems. And being able to take Prometheus
and turn it into just the core of a bigger system
means that you need these in-and-out data streams
in order to make it the spokes of a full platform.
So that's another hint as to the popularity of Prometheus and the
use cases for Prometheus which they're like machines they can't they are not
big enough to be able to run everything in one machine so again it got to the
point where you need more than one and what does that look like so
this is a story in a use case,
which is becoming more and more relevant.
So there was the remote write, the remote read,
important improvements in the last six months.
What other things are noteworthy?
I mean, it's actually a bit longer ago than six months
where we decided we'd go on a strict six-week cadence of releases.
Similar to Kubernetes, but they have a longer cadence.
Three months.
Go has this similar thing.
I mean, personally, my ideal is always you should just release when you have something to release.
And in the ideal world, that just works.
But in the real world world people just procrastinate and then we had seen
this that just nobody was bothering to release a new prometheus server and then we had way too
many things piled up so we just said okay every six weeks and should we ever reach this point
where we have a new release and nothing interesting has happened we can reconsider that but so far we
have done this now for almost a year i
think yeah so we always get a release shepherd nominated ahead of time and then you have this
really like you cut a release candidate you like tell the world that they should try it out and
then usually you get a fairly stable dot zero release like we are what is the current 2.14
2.14.0 i think we didn't have a work fix release for that one right yep
that worked that was during promcon actually where we released that but that was just coincidence
because it's a strict six-week cadence right yeah so every time there's something interesting
happening and and since yeah so releases go up but we also have this all built into like
benchmarking the the benchmarking tooling, our internal benchmarks,
are way better now, and it's all part of the procedure
to run benchmarks to see regressions.
We had a few of them in the past.
Nice, interesting new features, but also, sadly,
new feature was everything is a bit slower.
So that can't really happen yet.
Or it happens in an informed where we say,
okay, now we have, whatever, stainless handling,
and we accept that this has a tiny performance penalty.
So, yeah.
At least we can, because we have all of these tools,
we can do these things in a controlled way, right?
As opposed to realizing these things
after we've already released it
and users opening issues.
And one thing that personally for my organization
is really cool about the regular release schedule is
we know exactly when the next release candidate is going to be cut.
So the SRE team can plan cantering these kinds
of releases and contribute back
with issues and so on and I think that's
that's also for us as maintainers really powerful to get more consistent feedback
do you see the adoption of new releases their way of seeing what the adoption is
and what I mean by that maybe maybe number of downloads, maybe something that
will tell you, okay, the users are upgrading, and they're like running these new releases,
is there such a place that you have, maybe it's publicly available?
Yeah, there's the, there are counters for looking at how many downloads we get from
from the official releases. There's also how many people pull their docker
docker images but uh we're not really paying attention to this we um we're we're more focused
on development than than uh marketing numbers do we have like github download counters yes
okay but it's like we don't we don't we we mostly don't even pay attention to that.
But then also, of course,
some organizations wouldn't even download directly from GitHub.
They just download it into their own repository,
so you can never know.
We needed to do some phone home mechanism into Prometheus,
and we're not doing that.
But Grafana has some maltracking about their installed instances and they also
report back the number of like which data source is being used by that grafana instance
and every promcon has a little lightning talk by some grafana person telling us how many grafana
instances there are in the world that phone home and how many of them have prometheus
as a data source and like the grafana growth is like crazy but the percentage of grafana instances
using prometheus is also growing like crazy like it's like the second order of growth and i think
this year we hit the more than 50 percent of grafana instances have a Prometheus data source. That's mind-blowing.
Okay.
So releasing new versions, having the six-week cycle
when users can expect a new version to be cut,
a new version to be available.
Do you do anything about deprecating old versions
or stopping any support for older versions?
It's largely on an ad hoc basis.
If there is someone who is willing to backport a fix, I think we generally are open to cutting another patch release.
Sometimes us as Red Hat, we support older versions in our product, for example, and that's when we do those kinds of things.
I don't think we have a set schedule of when we don't support anything anymore,
but it generally doesn't happen too often.
It happens.
I mean, also, we are on major version 2,
and we have a few features listed as experimental
that can actually have breaking changes.
Breaking changes? It's getting hard. Third day of the conference.
Where you could not just seamlessly upgrade, but most features are not experimental. So
there's very few reasons for somebody to not go to the next minor release.
Sometimes we have like little storage optimizations
where we try, after some problems in the past,
where you couldn't go back from.
Once you have gone to the higher version
and the storage has used the new encoding version internally,
the older versions couldn't act on it.
And we are now doing things like where you have to switch it on
with a flag in the next minor release,
and then it becomes default, but you could still switch it off,
and then it becomes the only way of doing it or something.
It's very smooth, and I think rarely...
I mean, some companies have these very strict procedures
to whitelist a new version,
but in general, it's happening rarely that someone says,
I really still have to run Prometheus 2.12.
Could you please have this bug fix release for 2.12?
As a matter of fact, I don't remember the last time
we've done anything like this.
Yeah, the releases are always upgradable
within the major version.
So the incremental upgrade is completely seamless.
It's just drop in the
new version restart and away you go um there's been no real problem with upgrades yeah interestingly
so i also work on one of the projects uh that integrate around prometheus called the prometheus
operator and we actually test to this day upgrades from prometheus 1.4 i believe up until the latest version so
amazing okay um
yep should we find something else to talk about um so there could be we could talk about
unit testing rules and alerts.
Alert testing is a big deal because I guess it's like a... I have discussed this actually also quite often recently,
how you actually make sure that an alert will fire if you actually have an outage.
This is a big, arguably not quite solved problem,
but at least in Prometheus you can now unit test your rules,
recording rules as well as alerting rules.
It's all built in Promptool,
this little command line tool
that's distributed alongside with the server.
And there's a little,
kind of a domain-specific language, if you want,
to formulate rules that you can write.
This is how the time series looks like,
and then I want this alert fires in that way,
all those things.
I think they have a blog post on the project website.
Yeah, do we think we have a...
That's pretty cool.
Yeah, I think, again, this is one of those things
where it shows the maturity of the project and the ecosystem
that people don't only care about monitoring and alerting,
but they also care about actually testing their alerting rules so we talked
about um the big noteworthy noteworthy initiatives um that have been delivered in the last six months
the exciting stuff the most exciting stuff um what about the next six months what do you have
on your roadmap things which are worth mentioning i mean we have a roadmap on the website but it's kind of almost obsolete because i think most of the issues or items there
have been at least almost been implemented so like i think it's time for getting more into
more visionary things but also like there's some things very concretely happening. One thing is probably
that will be really
visible. It's like a new UI for
the Prometheus server.
Some people just use Grafana
as their interface for
Prometheus, but originally when Prometheus
was created, there was no Grafana.
We actually had our own little dashboard builder, but
Prometheus was really meant to...
Why are you laughing?
Hey, I'm still a Promdash fan.
Okay, so it has still fans.
Stuart will like you now.
So whatever.
So we want to talk about the future.
So the UI on the Prometheus server was always very simplistic,
but I totally loved it.
It was my daily tool to work with.
But yeah, it kind of a bit hasn't aged that well.
So we're replacing our handwritten JavaScript from 2013 or so
with a nice new React user interface.
And it's now in 2.14, and you can go give it a spin.
There's a button to click to try the new UI.
Okay.
This will give, like, a lot of...
I mean, this is essentially at the moment
just reconstructing all the features we have,
but this will allow, like, modern stuff,
like proper autocompletion and tooltips
and all those things.
That will be very easy to include.
You get a glimpse of it if you do the Grafana X-Pro View.
It's a lot of stuff, but that's all very much wired into Grafana.
And in the Prometheus UI, we try to get this in a more generic form,
and we also want to be able to do like this LSP language server protocol
like which is which is this generic way where like IDEs can inquire from a
server what to do with like auto completion and stuff so this could work
for the Prometheus UI itself but there's actually an intern at Red Hat like
working with Fred Tvias he's working on this, just implementing this LSP
for PromQL, and then
you can point your VS Code
to that, and suddenly you get auto-completion
in your editor of writing
rules, and that's so cool.
Yes, I'm really excited about that.
I'm also really excited to
finally get those
beautiful help strings at all the metrics
output and getting that into the basic user interface
because this would help all of the users of Prometheus
to be able to see what does this metric name actually mean
and get the extended help information
and the explicit types that we have.
We have this data in Prometheus
and it's been many years and not exposed to the user.
As a matter of fact, I saw a demo
last week showing exactly
that.
This was like, I mean, I always
tell the story of Prometheus as it
has started with the instrumentation.
It's instrumentation first and we
always put in there, you had to
describe your metrics within help string
and you have to tell that it's a counter
or gauge and then Prometheus was just not doing anything with that information
and that was lasting for way too long and now something is happening
that's actually that actually resonates really well because you're right like a lot of effort
goes into describing what the metrics are and then when you consume them just consume them
as metrics as values right and then a lot of that information, actually all of that information, gets lost. So I can see a really good opportunity
here for maybe Grafana or another UI to make use of that information, to maybe start explaining
what the different metrics are, right? As the original authors intended them. And there's a question which I have. I'm wondering how, like what are the limits for describing metrics?
When I say limits, I mean, is it like a single string? And is there like a limit of how big that string can be?
Can you add any formatting to that string? Because I'm almost thinking Markdown is a bit crazy in hell, but why not?
It's like the next step to this i mean that might evolve
when we actually use it but at the moment it's a plain text string with no length restrictions
right right yeah you can write i mean it wasn't the help string we had this an incident that's
out there where somebody accidentally put a whole like html source code into a label
and prometheus could ingest that just fine.
It looked really weird when you looked at the metric.
But we are usually not implying any
fixed limits on anything.
Or any formatting, just like plain text.
The formatting, yeah,
might evolve, we will see.
It's actually interesting, we've had the metadata API
through which you can query
the help and type
information for, I think,
about a year and a half now,
but just haven't
actually made use of it just yet.
I think, as Bjorn
started out with the React UI,
it's a really cool
thing that we can now, with
a modern approach,
do all of these things and just within julius did
the initial initial work for this react-based ui and just within a couple of weeks of having this
entry we've had a tremendous amount of contributions to this because suddenly we've opened up like a
pool of engineers that can help us out with
these things which was kind of the initial point anyways because nobody was really contributing to
the to the old ui and suddenly we're just a couple weeks into it and it's just validated the point
that making this more accessible opens a large pool of contributions which Which is a very interesting point in an open-source project.
Should you go for something with a known,
big base of people who act with that?
Like, let's react.
And kind of the competing way was the alert manager
I got refurbished a while ago in Elm,
which has a way smaller community,
but a very committed community.
And we had a bunch of committed contributors.
And I think they are now obviously not happy
that this is happening in React.
But I think it's a really tough decision.
You could say it's the same when we started Prometheus
and decided to use Go and not like Java, for example.
I mean, Go is a way technically better language for that.
But back then it was, we were early adopters.
Like, we also found a lot of bugs in Go,
or feature requests that we really needed,
but it was a big bet to go into this new language
that doesn't have an established community yet,
and I think it's not a clear cut what way to go,
but this is, yeah, it speaks volumes
that we get, get like new contributors that
are super enthusiastic about coding react i mean i've only been enthusiastic but hopefully i mean
luckily there are others who like it so do you know how that decision was made like what to
choose or was it was it like the size of the community was it or did someone just say oh
this looks cool and they started using
react do you know i think it was largely driven by by julius julius wanted to learn react actually
and kind of tried it out here obviously asked everyone in our in one of our dev summits if
people think this is a good idea to actually pursue fully. And we agreed on it. I mean, I think we never had like an explicit decision.
Often things just happen, which can be good.
Sometimes I think decisions should be explicit.
But again, this is not easy to make a call.
If this should be like super top-down,
we all sit together in a committee and vote about it,
or this should just happen.
Yeah, I mean. I think it or this should just happen yeah i mean i think
it's best because just let it happen because somebody whoever's willing to do the work is
is the one that should drive the change because we can we can make a committee decision after
committee decision and then nobody will do anything with it and so doing the decision making by being
willing to do the work and support it is much healthier for a project.
That sounds like such an adult approach and such a sensible approach.
It's almost like, of course it makes sense.
Yeah, you're right.
Like whoever gets to do the work should, you know, decide whoever is most passionate about it.
Well, they're going to be doing the work anyway.
So why don't you just go ahead and, and you know because we trust you to make the right
decision and as it turns out it was the right decision right the react community joined and
there's like all this new interest that you wouldn't have had i mean i i don't think it's
it's always that clear i think a project is sometimes very complex and some people need some guidance should they even become active in
this area and i think we also had kind of incidents in the past where somebody just did something
and it kind of steamrolled the others and then they feel he felt like frustrated or something
i think i think this is an actual hard problem i actually read a paper right now that some of my
colleagues who was in bigger open source projects,
recommended to me,
how are open source communities making decisions?
There's active research going on on that.
Like, should you have a governance structure?
I mean, we have a governance structure now.
Like it's, I think it's an interesting,
but also very hard or it's a hard problem.
That's why it's an interesting problem and important.
That's the paper which I would like to read for sure and i can i know that many others will as well so i'll look forward to that link from jordan okay so one of the things which i'm
aware of as a prometheus user is memory use is there anything that is being done about that in the next six months and improvements around improving Prometheus' use of memory?
As a matter of fact, we had one of our developer summits just the inserts are happening, the live inserts of the data that's being scraped.
And that builds a block of the most recent two hours of data.
And then that's flushed to disk to an immutable block.
And then we use memory mapping so the kernel takes care of that memory management there.
But that most recent two hours worth of data is kept in memory until we do this procedure.
And so that can potentially make up a large amount of memory that you're using.
And so we're looking into ways of offloading this from RAM REM basically to other mechanisms we haven't fully
decided on what that is but we are actively looking into improvements that
we can make there are various other mechanisms that that we want to look
into even within the immutable blocks of data we want to explore as bjorn likes to say new old
chunk encodings because when we this when we wrote the new time series engine
we kind of made the decision that we'll for now only look at one type of chunk encoding
and we've realized that there's probably
looking back in hindsight there's probably some potential for making better decisions
potentially at runtime or at compaction time for example to optimize some of this data in a better
way yeah like we had the the prometheus the storage engine for which this one was essentially
hacked together.
And when it was working well enough, we would do all the other stuff.
And then the Prometheus 2 storage engine was really very carefully designed,
but also like kind of reverted into just using essentially the classical Gorilla encoding that gets a Facebook paper.
And the Prometheus 1 storage had a few crazy hacks that we never really evaluated.
But now we can compare.
Cortex has this interesting...
Cortex is one of those remote storage solutions.
But they also use the exact same storage format.
And they support everything, all the versions back into the pass.
And they can directly compare how things look like.
And apparently, if you just look at the encoding,
the Prometheus 1 encoding is like 30% better or something.
So we see we can actually kind of, what's the word,
like recover some of the archaeological evidence from that
and perhaps improve.
We can forward port some of the optimizations.
Yeah, the Prometheus 2 format was very much designed
to reduce the CPU needs for ingestion.
And that completely succeeded
to the point where
we actually have spare CPU.
When you look at the CPU to memory ratios
of a common server,
the Prometheus server will use
all of the memory,
but only a quarter of the available CPU in the typical ratios you get on servers.
So we could spend some more CPU to improve the compression and get us back some of that memory.
Because every time we improve our compression, it not only improves the disk storage space, it improves the memory
storage because we keep the same data in memory as we do on disk.
I'm sure that many users will be excited about this.
I'm very excited to hear that.
I'm looking forward to what will come out of this.
As we are approaching the end of our interview, any other things worth mentioning or like
one thing which is really worth mentioning
i mean there would be no story about the future complete without my favorite kind of topic in
prometheus and that's histograms i'm probably known as mr histogram or something so like
histograms and prometheus is like extremely powerful uh approach but it's kind of half-baked.
We introduced them in 2015.
And histograms is like a bucketed counter,
like really like broadly spoken.
And yeah, but there's...
From an SRE perspective,
histograms are extremely important
in getting more detail
out of the latency in our applications.
Several other monitoring platforms talk very loudly about histograms being important
because we need detailed data on requests coming into the system,
and an average is not good enough.
And summaries, pre-computed quantiles, are also not good enough
because they usually don't give us the granularity,
and also they can't be compared across instances.
So if I've got a dozen pods, I need to have super detailed histogram data in order to do a proper analysis
of my request, because it's okay to have 10 milliseconds of latency on a request,
but it's not okay when 5% of those are so slow they're useless to the user.
The typical is 10 milliseconds, but 5% of them are 10 seconds.
I can't have that from my service SLA perspective.
So I need more and more and more and more histograms,
but right now they're just super expensive.
And that's because Prometheus, in the same, like when we talked about the metadata,
where we said Prometheus throws everything away and everything is just like floating point numbers with timestamps, essentially.
That's the same for histograms where the other part of the information is that this is all buckets belonging
to the same histogram now every bucket that counter becomes its own time series in the in
the Prometheus server so every bucket you add is like comes with the full cost of a new time series
with no potential of whatever putting this together in some way or compressing this in some way. And there's decades of research how to represent distributions in an efficient way.
And now that I have more time to work on Prometheus,
my boss also likes this topic a lot.
So perfect opportunity to really go into this.
I had a little talk at PromCon where I was giving my current state of research.
And now at this conference like
so many people and so many companies and organizations they are interested in that it
was really exciting and the idea is to get something where those we could have way more
buckets or we even have some kind of digest approach to that that plays well with the
prometheus data model so it's a true challenge and it will be fairly invasive because it also changes how like prometheus the storage engine the evaluation model
how it works because suddenly you have something that's not just a float it's like a representation
of a distribution but the idea is that we will have very detailed and not very expensive histograms
in the not too far future.
And yeah, I'm very hyped about this.
That is so cool.
That is so cool.
So you mentioned something there which reminded me of a discussion which we had earlier. And that was around being more open and getting the community more involved in what is happening in Prometheus. So you or maybe Fred mentioned about the monthly community calls,
the virtual calls.
Who would like to cover that?
Sure.
Yeah, we're trying to be more open with the wider developer community
and our wider user base.
And a lot of people have found that the Prometheus developer team
is a little closed off and a little opaque.
So we're now doing monthly public meetings
and sharing what the developer team is up to
and taking more input from the community
in order to be a better open source project.
So how can users join those monthly meetings?
Yes, on our website we have an announcement area for those community meetings.
Yes, they're alternating so that they are compatible with Asian time zones and American time zones every other month.
So that hopefully allows worldwide participation.
Do we announce them on mailing lists or Twitter or something?
We do announce them regularly on Twitter and the schedule is open.
People can come and just ask their questions.
We're super happy to answer them
to the best of our abilities thank you that's that's a great way of ending this in that there's
no ending there's like other ways that people can join this and not just like because this is like
one-sided people are listening to us but that's a way of them participating in Prometheus, getting to know more about Prometheus.
So when is the next monthly meeting?
Do you know?
I think we just had one, so it'll be next month.
Okay, so December.
Yeah.
Right.
31st of December, I'm sure.
I believe it's every first Wednesday of the month.
And then the opposite time zone is the third Wednesday of every month.
Whatever.
I think it should be looked at on the record.
It should provide a link in the show notes.
We will do.
Thank you very much, Ben.
Thank you very much, Fred.
And thank you very much, Bjorn.
It was a great pleasure having you.
And I'm so excited about what you will do next.
Thank you. Thank you.
How often do you think about internal tooling? I'm talking about the back office apps, the tool
the customer service team uses to access your databases, the S3 uploader you built last year
for the marketing team, that quick Firebase admin
panel that lets you monitor key KPIs, and maybe even the tool that your data science team had
together so they could provide custom ad spend insights. Literally every line of business relies
upon internal tooling, but if I'm being honest, I don't know many engineers out there who enjoy
building internal tools, let alone getting them excited about maintaining or even supporting them and this is where retool comes in companies like doordash
brex plaid and even amazon they use retool to build internal tooling super fast the idea is
that almost all internal tools look the same they're made of tables drop downs buttons text
inputs and retool gives you a point click drag and drop interface that makes it super simple to build these types of interfaces
in hours, not days. Retool connects to any database or API. For example, to pull data
from Postgres, just write a SQL query and drag and drop a table onto the canvas.
And if you want to search across those fields fields add a search input bar and update your
query save it share it it's too easy retool is built by engineers explicitly for engineers and
for those concerned about data security retool can even be set up on premise in about 15 minutes
using docker kubernetes or heroku learn more and try it free at retool.com slash changelog
again retool.com slash changelog and Again, retool.com slash changelog.
And by our friends at Square.
We're helping them to announce their new developer YouTube channel.
Head to youtube.com slash square dev to learn more and subscribe.
Here's a preview of their first episode of the Sandbox Show,
where Shannon Skipper and Richard Moot deep dive into the concept of item potency.
Welcome to the pilot episode of The Sandbox Show, a show where we'll... A YouTube show.
...where we'll deep dive into subjects that developers find interesting.
Don't worry, there will be plenty of live coding.
I'm Shannon and this is Richard, and we're going to cover a broad range of topics as the show evolves, but for today,
what are we going to be covering?
On this first episode, we're going to be covering item potency.
We had talked to people in our community and the thing that people seem to be really confused
by is this concept of item potency and how does it relate to interacting with an API.
And so I didn't do some Googling on this beforehand, but I know that you did.
I did.
So the definition of item potency comes from item and potent.
So item being same and potent power or potency.
So it's the same potency.
All right.
Check out this full-length show and more on their YouTube channel at youtube.com slash
square dev or search for Square Developer.
Again, youtube.com slash square dev or search for square developer again youtube.com slash square dev
is the 21st of November 2019. It's the last day of
coupon North America. It's been a sunny day. It's been a great
day so far. We had a great number of hosts and guests
on this show. No, there was only one, it was just me. We had a great number of guests on the show.
Just earlier I was talking to Bjorn from Grafana, Fred from Red Hat and also Ben from GitLab and
they were all on the Prometheus team, very passionate, a lot of interesting things that they've shared with us.
And now we have Tom from Grafana, and we have Ed also from Grafana.
And I'm also one of the Prometheus maintainers.
Oh, thank you.
I mean, I have seen your PRs here and there,
but yes, another Prometheus maintainer.
So the reason why I was very excited to speak with you was,
I know that you have a very passionate view on observability,
on what it means for a system to be observable.
And one of the key components in this new landscape with it, which is Kubernetes, all these stacks, the layers are getting deeper and deeper. So understanding what is happening in this very complex landscape, you need observability tooling, which is mature, which is complete.
So tell me a bit about that. Yeah, I mean, thank you for having us. Observability is one of these
buzzwords that's been going around a lot in the past few years. I think, you know, I've just been
asked, I've been asked a lot the past few days what is observability how does grafana fit into the observability landscape
i think you you know observability was previously kind of defined around these three pillars metrics
logs and traces um and then last year i think it was all this past year it was trendy kind of bash
that uh as a as a analogy and some some of it was rightly
so some of it may be less so um i still sometimes think about it like that um but i i try to avoid
thinking about the particular data type the particular way you're storing it storing it the
way you're collecting that data and i try and think more about how people are using that data.
So for me, observability is about any kind of tooling, infrastructure, UIs, anything you build that helps you understand the behavior of your applications and its infrastructure.
I think that's something really important to emphasize, because at the end of the day,
it's about the stories that we tell right and then we use
data some form of data to tell a certain story and whatever data is relevant for that story use it
doesn't matter what you call it as long as the focus is what are you trying to convey what are
trying for someone to understand and what are you trying to what point are you trying to make right
doesn't matter what you call it,
as long as you don't forget what this is all about.
So I'll give you an example that I think is really relevant,
at least to Ed and I.
We were in Munich two weeks ago for the Prometheus conference.
Great event, 200 or so people coming to just focus on Prometheus.
And towards the end of the first day ed your pager went off right our
hosted service um was having an issue and it turns out like it took us two hours to diagnose it we're
using all of our tooling to and understand what went wrong um and i think at the end of the world
we still don't actually know the root cause yet i mean once we figure it out we'll put it on the
blog um but the the point of the story is more that a few days later,
after we'd got back from PromCon, after we all sat together,
after we had a video call with eight or nine of the team members on
and we were fishing through all of our metrics, all of our logs
and all of our traces to try and figure out what really happened
to try and get to that root cause.
That was, for for me such a valuable
experience dogfooding our own products dogfooding our own projects that we work on and using them
to kind of try and understand what went wrong and try and build that picture and try and you know
we've got graphs we've got log segments we've got everything we possibly gather together to try and
understand why uh you know a node failure and an etcd master election and then a network partition
and everything seemed to go wrong at once,
but really what was the root cause?
And that was exciting.
We also had David and members of the Grafana team
join in to see a live example
of how people were using the tools they're building
and how they can improve the UX of those tools.
And I think he ended up recording it
and showing it to more people in the team to go like, look know he wanted to click this but it was it wasn't quite in the
right place so it wasn't quite the right thing that's a great story one thing which i really
like about this story is um how relevant different elements of observability for lack of a better
word um how important certain elements are so when you're trying to dig for root cause analysis, logs, they are very, very
important, right? So metrics are getting a lot of attention,
traces are getting a lot of attention, but I'm not seeing
the same thing for logs. So other than Loki, which is an
open source project, is there anything else out there that I'm
not aware of or for logs
specifically that integrates with prometheus that integrates with zipkin or jaeger or what
or whatever else you may have that will give you this root cause analysis tooling yeah i think the
an interesting one here is when i joined profano labs 18 months ago they were already big users of
zipkin but not in a traditional use case they
weren't using it to visualize requests spanning multiple microservices they're actually using
zipkin mostly for um like logs uh the request centric login because zipkin has these kind of
basic logging features um i said zipkin then didn't i i meant jaeger didn't i yeah i meant
jaeger sorry big users of jaeger yeah it's fine we can edit that out um but yeah so they were big users but not for distributed tracing we came along
and we wanted to use it for the visualization of the of the request flows through all the
microservices but but yeah i was kind of i'd never really seen jaeger used primarily for something
other than visualizing request flow so i guess you could think about the tracing tools as like a more more request
oriented way of logging i mean obviously there are a lot of logging vendors out there you know
a lot of them are represented at kubecon i think the most popular one for kubernetes has always
been elastic um the elastic stack elk um that's what most people use and uh and it's a great it's
a great tool like uh one of the things
that always impresses me about elastic is you can pretty much do anything with it like i've seen
people you know build their whole bi and analytics stack on elastic i've seen people use it for
developer centric logging people use it for audit logging people use it for security analysis
people are using it for actually like searching web pages as well, which kind of is fun because that's what it was originally used for.
Loki, I know you said apart from Loki,
but Loki is not like Elastic in that sense.
Like we are just focused on the developer-centric logging flow.
You know, we just want to use basically what you would see in kubectl logs.
We want to give it a bit better user interface
so you can kind of point and click and see it in Grafana.
And honestly, I mean, we're a big, we've touched on dogfooding already.
I think it's one of our superpowers at Grafana Labs.
We build the product we want to use as developers.
And really, the reason I started the Loki project was because you can't cube cut logs a pod that's gone away.
And one of the common failure modes like pods would
die disappear get rescheduled etc and i wanted to know what was going on in that pod before that
happened and that's why we built loki that's why we that's why we wanted basically cube cuttle logs
but with you know with a bit more attention and so here's an interesting one cube cuttle cube
cuttle cube ctl what do we say cube Cube Control? Cube Control, really? There are so many ways, yeah. There are so many ways, no.
Cube CTL, from my perspective.
Cube CTL, not Cube Cuddle.
No.
So wasn't there an unofficial logo which was a cuttlefish?
Yes, there was.
There was an unofficial logo in a couple of places,
yet the cuttlefish gets mentioned.
I like the cuttlefish one.
I mean, yeah, CTL, CIS CTL?
Maybe that's where I have...
I would say CISCUTL.
CISCUTL.
But did you used to say CISCUTL before?
No, I mean, maybe not.
And this one really like it.
It's definitely IOCUTL and not IOCUTL.
Okay.
Earlier, Ben was mentioning about all the different building blocks
that exist in the observability landscape in the CNCF.
And I can see Loki as one of those
building blocks. The one thing which I really like about Grafana is that it doesn't make a,
it doesn't limit you what data sources you can use. So if you want to use Elk, you can do that.
If you want to use Stackdriver, you can do that, which is logging from a vendor, perfectly fine, no problems.
And if you want to use Prometheus, a very popular project,
a graduated project, second graduated project in CNCF,
you can use that as well.
And it's a combination of all these tools and many others.
InfluxDB.
We've got over 60 different data sources.
There you go.
I mean, I don't even know them all.
Yeah, I mean, I couldn't name them them all you can combine them in innovative ways and you can almost do the right
thing the right thing being relative and being relevant for you so what is the right thing for
you and if you want to use loki so be it if you want to use splunk so be it yeah well so the thing
i think is even more cool is it's not just about having these data sources and having to pull this data into dashboards and the explore mode.
But what we're working on is, you know, with Loki, we built this experience where because we have this consistent metadata between the metrics and the logs, we allow you to switch between them automatically.
So given any Prometheus graph, any Prometheus query, we can automatically show you relevant logs for it.
Now, that was a very Loki-specific feature.
That was a very Loki-specific experience.
We've been working really hard to try and bring that to other data sources.
So we're now hopefully, as long as you curate your labels correctly,
be able to achieve that kind of experience between Graphite and Elastic.
This is something I didn't really understand until i
joined grafana labs like the team is so committed to this big tent philosophy like of enabling these
kind of workflows and enabling other systems and i really think the grafana project is the only
thing out there that really allows you to combine and mix and match and really is such so like so
much so more additive to the ecosystem than other projects that are like,
no, you can only use this data source. You can only talk to this database.
A bridge. A bridge to all sorts of things.
We're Switzerland, right?
Yeah, right. I like that analogy very much. So we have Ed here. I hear that he's quite
involved with Loki. And when you said we, Tom, I'm sure you meant the royal we because it's mostly
Ed right let's be honest
Loki it's mostly Ed so tell us Ed
about Loki
why do you like it what do you like about it
where is it going
yeah I can still remember
well probably about 10 months ago
when I was interviewing with Tom
and we were talking about Loki and it was new
to me at the time.
And the first question I asked,
isn't there already a solved problem?
You know,
don't we,
we have solutions for logging already.
And then,
uh,
as he explained,
I would almost call it a simplification of,
of how Loki stores compared to other systems.
Um,
like all that immediately scratches an itch that I've had,
right?
Like I've been a developer,
you know,
my whole life.
And the two things that I do most with logs is I deploy software and I tail them.
And I look for errors, right?
And then I'm running the software and it's broken, right?
And I got to go find where it's broken.
So what Loki does really well is we only index the metadata,
the label data that is part of your logs and not the full text of the log.
So from an operating and overhead, it's much sort of leaner, I guess.
And as long as you're looking for data and you know the time span and you know that that relative like, you know, metadata, the server that was on the application, you're there.
You're looking at your logs like and the tailing aspect is included as well with with Grafana um so i'm like wow you know that's that's what i wanted right like and and the big
advantage from an operating uh perspective with loki now is that the you know the index um scales
according to the size of your your metadata and not your log content right so um we're you know
almost a couple orders of magnitude smaller on our index than we are on our stored log data.
And then we can take advantage of object stores and compression to store data cheaply.
So it's a really nice optimization on log content when you're a developer, an operator,
and you really want to just want to get to my logs right now.
I want to look at this application logs.
And last week, we're regularly right like let's go look
at the you know what are the journal logs say for this node right like what is going on here you
know can we add a regex filter on there for pcb out of memory like oh that's you know that's a lot
of those right and and recently we've been adding support for for metric style queries against your
logs right so this is the to me this was like the the grep you know minus b minus b minus b and then
piping into word count you know i want minus b minus v and then piping into word
count you know i want to know like how often is this happening but but it gets better because i
can see now in time how often it happens right and it's like pcp out of memory you know like that's
it's probably wrong right that's probably a problem and um that's been really exciting you
know and i feel like that's resonating with a lot of people we talk to here as well that are um you
know this is this is what i want for my logs Like there's way more you can do with your logs than that, right? Absolutely. And
some of these other projects are, you know, much better suited for, you know, the different kinds
of queries you might do where you're you need a full index. But in a lot of cases, you know,
the Loki model is really, really perfect for that.
I really like that how you take a really simple simple idea. You start really simple as you possibly can. When you. Um, and I look at,
I look at elastic,
but on Lucy and probably as a great building block.
And I look at a lot of the projects that came out of that as being generally
useful in a lot of places.
But I,
you know,
I don't think big data ever quite hit.
It's like it's promise.
Um,
so one of the things I've always tried to do with,
I think with everything I've done is be very, very focused on a particular story, a particular end user, a particular use case.
You know, with with Loki, that use case was the, you know, the instance.
I mean, I'm still on call at Grafana Labs.
I don't know how Ed feels about that, but I think I still occasionally get paged at 3 a.m.
And I really wanted tooling that would help me very quickly in a sleep deprived state,
get to the root,
get to the,
get to the problem as quickly as possible.
And that's what the focus has always been on with Loki.
And so you asked,
where do we stop?
Well,
I think we don't
try and make loki do tracing we don't try and make loki do bi we don't try and make loki do
you know use cases that are beyond that sleep deprived you know 3am instant response drill um
i think we we we stay with these tightly focused stories and that's how we build great
projects.
That's mean.
I learned that from Prometheus,
Prometheus and still does is,
is incredibly focused and incredibly,
you know,
um,
incredibly resistant to like feature,
right.
And scope creep.
Um,
and so I learned a lot through the Prometheus project and I'm really like,
I really key to,
to apply that to the,
to this project and maybe future projects. I caveat it with one thing um what we did with loki
in the way we built loki so quickly is we actually took all of the distributed systems algorithms and
data structures from another one of my projects from cortex um and so loki is really just like a
thin well maybe not so thin anymore, but you know,
it was started off as a thin veneer wrapped around the same distributed hash tables, the
same inverted indexes and chunk stores that we used in cortex.
And that's how we got the first project out so quickly.
And so I'm all for code reuse.
I'm all for reusing data structures and sharing and this kind of stuff.
But I just think the end solution that you build it into should be really, really focused.
So Cortex is really cool.
And I would like us to go into that soon.
But before that, I would like to add an extra insight for those that maybe don't know you very well.
You're the VP of product for Grafana Labs.
So why are you being paged?
Because you like it?
Because you want to be close to the tooling?
Because you want to see what people will be getting?
I think that's possibly the most committed VP of product
that I've known.
And that's the right way of approaching it
so that you have a firsthand experience yourself.
Yeah.
Of all those products.
I think it's like, we talk at Grafana Labs about authenticity.
Like we try and not spin the stories we're telling.
We try and just tell real stories, authentic stories.
And we try and talk about, you know, we, you know, I remember
having a conversation with the CEO, with Raj about like, what does it mean to like build these
empowered distributed teams of, of really awesome software engineers. And I think one of the ways
we encapsulated it, like you see a lot on people's Twitter and bios, you see like, you know, thoughts
or opinions here on my own. Like, so we have a very like opinions, like I never want any of my employees to have to caveat their opinions. Like we, I trust them all.
I want them to feel empowered to speak on behalf of, of the projects and the, and the company that
they represent. And, uh, and yeah, I want them to speak authentically. Um, so part of that,
if you hear me standing up talking and telling a story about like why I built Cortex, why we started Loki, why I use Prometheus, why I use Grafana, these are real stories from my actual experience.
And I do miss not being able to write as much code as I used to.
On the flight over to San Diego from London, I actually did a PR for Prometheus because like, you know, I'm a software engineer at the heart.
I do miss it sometimes, but also i see the work that ed and the
rest of the team are able to do um and like i just think you know as long as i can i can help
as long as i can build a an environment for people to be that that successful then then i'm happy
i think i think that's a great um philosophy have. And it's really powerful.
We can see how important it is to approach things like that,
to really believe in that and to operate under that mindset.
Yeah, I try to.
So Cortex, very interesting.
Another interesting Grafana Labs product, project, how would you call it? Well, so interesting Cortex isn't interesting Grafana Labs product,
project, how would you call it?
Interesting Cortex isn't a Grafana Labs project.
I started the Cortex
project over three years ago
before I worked for Grafana Labs.
About a year ago
we put it into the CNCF
and so it's actually a CNCF
sandbox project used by a lot of
companies.
Every time I come to KubeCon I meet new companies who are like oh hey
we use Cortex I'm like wow I had no idea
you know we really just started it
for our own needs to begin with
we do
Grafana Labs does use Cortex
to power our hosted Prometheus
product in Grafana Cloud
and so that's where our vested interest is right
we are doing this because it's the basis of one of our big project products.
Um,
but also like one of the things,
you know,
I,
I like cortex.
I mean,
in a previous life I worked on Cassandra on a package.
And so you'll see heavy influence in cortex in the algorithms and in the data
structures from Cassandra.
You know,
we do a very similar virtual node scheme.
We have very similar distribution and consistency and replication and these kinds of structures from Cassandra. We do a very similar virtual node scheme. We have very similar distribution and consistency and replication
and these kind of things to Cassandra.
I liked Cortex mainly because I was learning this new language.
It was called Go.
And I thought this would be a great language to do lots of these kind of concurrent,
highly distributed systems in.
And so I kind of thought, well, what are the algorithms
that I hope will be really easy to implement in Go
that would be challenging to implement in other languages?
So that was kind of one of my motivations for Cortex.
I also at the time was building a different product.
It's still in the observability space,
but it was still building a work on something called Scope.
And I spent a long time building this.
And one of the tools i used whilst
building scope was prometheus and very quickly realized that prometheus was where it was at and
was incredibly useful um and so yeah so that's kind of how i got into the prometheus space and
then i thought well what the world really needs is like horizontally scalable clustered version
of prometheus mostly because i just thought it'd be cool to build um and so we started it we built
it and we kind of learned what the actual use cases it applied to were we learned as we went and now
I'd say like I originally thought long-term storage would be the biggest the biggest value of something
like Cortex but now I think really it's the you know we talked about how the Prometheus community
and the Prometheus team we we like to keep Prometheus well-defined and tight and small and easy to operate.
And this excludes a lot of use cases.
This particularly excludes a lot of use cases
that involve monitoring over a global fleet of servers.
And so really, I think the Cortex project,
its main value proposition is about monitoring
lots of servers deployed in a global fleet. Maybe you've
got tens of clusters on multiple different continents and you want to bring all of that,
all of those metrics into a single place so you can do these queries.
And then when we joined Grafana Labs and they had much larger customers than I'd ever worked
with before, we started to experience query performance issues with Cortex. We hadn't really at the time had any very, very large users on it. And as we started
to onboard very large users, they started to complain about the query performance. And so I
guess the past 18 months of Cortex projects has been almost 100% focused on making it the fastest
possible Prometheus query evaluator out there um and that was the
talk i gave at kubecon a couple of days ago uh it was about how we parallelize and cache and
and partially and and emit like parallel partial sums for us to kind of re-aggregate you know and
and we do all of these different techniques to really really accelerate our promql expressions
and then the really the really interesting thing happened a few months ago because thanos you know we can't we can't not mention thanos
thanos started off a year after cortex um started by bartek who also lives in london
a good friend of mine and started to solve exactly the same problems that cortex was solved but
effectively did it in the completely opposite way almost every step along the way they chose the opposite thanos has become a lot more popular than cortex for sure um and they did a really good job
of making it a really easy to adopt system great documentation and really a really like they really
invested in the community um so i learned a lot you know thanos is more popular than cortex but
i think one of the things we've been able to do recently is take a lot of stuff we've built and deployed in Cortex to accelerate query performance and apply
it to Thanos. And that's kind of exciting because now we can bring these really cool techniques
to a much larger community. I know this was asked before, but the one thing which I kept thinking
during your talk is when will you announce that Thanos and Cortex will merge and we will become
one? And I think you made a great joke
about it like they have right they will merge I know that is not happening or at least not right
now not that we know of but the inspiration was from Flux and Argo our two very popular projects
in the CID space have merged I think that's a great combination of effort, getting the best of both worlds.
I'm sure many are wondering, will that ever happen? It would be cool, but I'm sure it also
has its own challenges for that to be the case, for Thanos and Cortex to merge. So we'll watch
this space for sure. I don't want to see merging as an an end goal. Like I think the end goal should be collaboration.
Like in the same way, you know,
one of the things I like about the Prometheus community is they've been so
open to adding maintainers because of their contributions effectively to other
projects.
So the main reason I'm a Prometheus maintainer is because I started Cortec.
And similarly,
like Bartek has been added to the prometheus maintainer team
recently as well so there's a huge overlap between the thanos maintainers the prometheus
maintainers and the cortex maintainers and really i don't think the end goal should be
should be um convergence of these two projects i think there should be an increased collaboration
between them and that's what we're what we're working towards i really like working with the
thanos guys i really like working with the prometheus guys and finding ways in which we can share and collaborate more
share cool examples try different things in different projects um that sounds awesome to me
like the deployment models for thanos and cortex are completely different opposite ends of the
spectrum and so maybe they'll never merge right maybe they'll never because the deployments are
so different maybe they'll stay separate um but? Maybe they'll never because the deployments are so different. Maybe they'll stay separate.
But I think the technologies and the libraries they share,
I mean, both Thanos and Cortex use the same PromQL query engine
that Prometheus uses.
I mean, it is the Prometheus query engine.
Both Cortex and Thanos use the same compression format
for their time series data.
You know, we share way more stuff in common
than our differences, really.
And I just, you know, I look at some of the mergers of communities over the past year,
and I think they've been announced before, really, like the communities have had a chance to gel
and really demonstrate the benefits of that merger.
And so, like, I definitely kind of, I want to demonstrate the benefits of working together first.
And if it turns out, you know, we are already working together and we are having some great success.
And if that continues, and if, like, we like we find you know even more ways to work together then
maybe a merger makes sense but but i'm more interested in the the shared code the collaboration
the shared solutions that's a great take i really like that makes a lot of sense as if you have
thought about this long and hard i would say so you strike me like the person that always has a couple of projects, side projects in
his back pocket.
Anything that you'd like to share with us?
Anything interesting that you're working on, hacking on, or maybe Ed?
What do you reckon, like Tanker?
Tanker's pretty cool.
We should mention Tanker here.
So this is not really my project.
There's a very young chap called Tom brack in uh in germany who approached
us actually at cubecon um and well he was 17 at the time he came up to our booth spoke to gotham
and i and said i really like what you're doing with jsonic i really like the whole mixins thing
i really like cortex i really like loki like do you have a summer internship position and i'm like
a 17 year old kid is talking to me about Jsonic.
Jsonic is one of the nichest aspects of this community,
like I'm aware of, right?
And so we got chatting to him,
and he did end up doing a summer internship.
And about the same time,
Petio was sold to VMware,
and VMware discontinued the Ksonic project.
We were big users.
I really liked what they were doing with Ksonic.
I really liked how it enabled
this kind of reusable and composable
configuration as code.
And when I joined
Profile Labs, we rolled out Ksonic everywhere.
And so to hear it was discontinued was like
a bit of a problem for us. We continued to use
it. We continued to invest in it.
And when Tom Brack came along, we actually
re-implemented it
into this project called Tank with a whole bunch of other really cool improvements that he's done.
It's now much faster. It just forks out to kubectl. So we don't have a lot of
compatibility challenges. It's got a much more sophisticated diffing mechanism.
And this 17-year-old kid has just massively improved the productivity of the engineers
in Grafana Labs by really
improving the tool chain for our Kubernetes config management. So if anyone here is using JSON,
using Ksonnet and wondering what the future holds, I'd encourage you to check out Tank.
It's a really, really cool project.
This is something which keeps coming over and over again the community the openness the the barrier of entry which is so low
and how everybody's there to help you right whatever age you have whatever inclination have
whatever you want to do you can do and everybody's there to guide you help you and accept whichever
contribution you want to bring this is something so valuable which over the last
three days i keep seeing over and over again um let's say like it's one of the core values
of this new community and this new ecosystem which has grown so much by 12 000 people did
you manage to speak to all of them i mean probably about a 12th of them right yeah right it definitely
feels that way. I think
I would definitely agree the superpower
for the Kubernetes and for the cloud
native community as a whole is
this openness, is this acceptance.
I really like what
the CNCF has done by having
multiple competing projects
in their
incubation, like Thanos and Cortex are both
in there. And I really look forward to other projects
coming in and doing the same thing.
I think I really like how the CNCF
are not kingmakers in this respect.
I think that openness is great.
And then the whole,
no matter what you think about Kubernetes
and its complexity and its adoption,
I think the real benefit of Kubernetes is the openness.
And if you really want to and have the time and the effort
to make a contribution and make a change,
definitely it will be accepted and you'll be embraced open arms.
And eventually you'll be put in charge of some huge component
and you're like, what?
Yeah, I'm a big fan.
And especially if you're a VP of product, right?
PR to Prometheus.
Yeah, I don't, I mean, I think I've had some PRs into Kubernetes. I'm not sure.
But I don't get to do as much code as I used to. I mean, I do miss it. I think, you know,
you still get to play. I still do a fair amount of config management work because I still help
with the deployments and still building dashboards and occasionally doing PRs to Prometheus and still doing a fair amount of code review.
Not as much as I used to, but I've spent a lot of my time doing all sorts of things now.
Doing marketing work, that's an interesting one.
So as you're approaching the end of this interview,
and also we're approaching the end of KubeCon, which is an amazing, amazing event. Um, anything specific that, um, you were impressed by, or you wouldn't expect to
see, and you were very happy to see, um, any key takeaways?
Uh, my, uh, my story is, is we were talking a little bit, this is my first
QCon, um, and I'm new to the open source community.
I've worked a lot of enterprise jobs prior to this and it's it is really exciting I have to say that
the people that come up to the booth and talk about like hey we use perform hey we love it you
know like being part of that you know being part of a project that I met someone that has a
contributor to Loki that came up they were really really excited. It's a really cool feeling to have
people
see these tools and actually use them, come
talk to you about it. I really enjoy
the amount
of people interested,
the talks that we're giving that are deep dives
into these projects that people are interested in seeing.
It's such a different experience than the software
I've done in the past.
I think it's really neat as a developer even if you're just using
these tools because it's a
because of the tools and their proliferation
and their openness it's a skill
set you can take anywhere with you right
like these are real skills and there's
I think companies are starting to see the real value in
having tool chains that
people know by name right
you hear Prometheus more and more and more
that's,
that's really valuable. And to have that be open source technology is really amazing.
Thank you, Ed. Thank you, Tom. It's been a pleasure having you. I look forward to the next one.
Cheers.
This episode is brought to you by Git Prime.
Git Prime helps software teams accelerate their velocity and release products faster by turning historical Git data into easy-to-understand insights and reports.
Because past performance predicts future performance, Git Prime can examine your Git data to identify bottlenecks, compare sprints and releases over time,
and enable data-driven discussions about engineering and product development.
Shift faster because you know more, not because you're rushing.
Get started at gitprime.com slash changelog.
That's G-I-T-P-R-I-M-E dot com slash changelog.
Again, gitprime.com slash changelog. I would like to say that we've kept the best for last,
but that's something for you to appreciate.
We are definitely ending the KubeCon on a high. Most people are already breaking off and some
have already flown back home. We're still here so in this way we are officially ending KubeCon
with this last interview. I have around me three gentlemen left to right. We have Jared, we have Marcus,
and we have Dan, all from Upbound. You may recognize them by Crossplane, that's a very
strong name, and also Rook. So they are the ones, some of them that are behind these great projects.
I'll let them maybe speak a little bit about their involvement
and also tell us what they're passionate about,
what their takeaways are from the conference.
So who would like to start?
I'd be happy to start.
So this is Jared.
And I have been a founder and a maintainer
on both the Rook project and the Crossplane project. So I've been
sort of living in the open source cloud native ecosystem for multiple years now. And one of the
biggest things for me that I see consistently is that each KubeCon gets that much more crazy,
that much more lively. And the amount of new people that are coming into the ecosystem
is always a fairly surprising amount. I think anytime that you go to a talk and people ask,
is this your first KubeCon? You see a large majority of the room raising their hands.
And to me, that says that this ecosystem is onto something exciting and it's attracting more people
and it's gaining more adoption. And that's something that consistently excites me a lot.
I see it all the time at every KubeCon.
Yeah, Dan was calling those the second graders, right?
There were a lot of second graders at this KubeCon,
and some fourth graders.
It was, I really enjoyed that.
It was a great analogy.
The analogy where he was showing how his son was playing Minecraft and hiding the screen because that was the way to
survive the night and uh yes everyone at at the convention was represented if it was their first
year they were uh considered second graders and everyone else was only fourth graders because the
project itself is only five years old and so we're all new and learning this together.
Yeah, it's a great analogy.
Yeah, definitely.
I think personally, that was a really cool analogy for me because I actually graduated from college recently
and I'm fairly young in the community.
But a lot of people have been extremely welcoming and kind to me,
welcoming me into not just the Crossplane and rook uh ecosystems but also in the
greater kubernetes ecosystem um welcoming onto the actual release team for 1.17 and being part of
that was super cool and there's just a lot of people have been around you know from the inception
of kubernetes who are saying you know you're a young person come in here and you're welcome and
we value your thoughts and opinions and your efforts.
So it's definitely a cool place to be at KubeCon and being surrounded by really talented people like that.
And actually, I think that's something that speaks a lot to not only the community and the ecosystem here amongst people that are part of this cloud native movement,
but I think that's just open source in general. I've seen a massive change over the past, you know, five years, 10 years, and, you know, even earlier than that, where you've got
these communities that are able to form based on, you know, these more socialized sites like GitHub
and GitLab, where you're able to, you know, get these communities built and be able to be very
collaborative in a very open environment that not only is getting these projects more out there and in the hands of other people,
but it's attracting people that bring a lot of enthusiasm that feel welcomed because of the way
that the community is treating people, but getting more people involved in open source that has,
you know, ever been involved before. It's not something just for, you know,
graybeards anymore. It's open sources for everybody now and it's pretty awesome so this is something that was mentioned a couple of times even i mentioned
it a couple of times in in these interviews um i'm still surprised by how open and welcoming
everybody is even though it's been three packed days, even today, everybody was still happy, was still smiling and really happy to answer any questions.
And even though they were really tired, you could see some people had three very hard days and who knows how many months before that.
So Brian was just saying a lot of the preparation started six months ago.
So some have been at this for a really long time.
And yet, open, welcoming, warm.
It was great.
My first KubeCon, I loved it.
What was your first KubeCon?
This was my first KubeCon.
So you were experiencing that welcoming attitude firsthand.
Yes.
I love that.
That was amazing.
Natasha and Priyanka, they were talking about the process and especially
natasha since she has been in the cncf for a couple of years before git lab she's saying
about the processes which they have in place all the documentation how that is such an important
factor in this welcoming community i think that's really been recognized as a key thing in the success of Kubernetes and
the open source ecosystem in general. I think that's one of the drivers for it. It's not only
the right thing to do to welcome people in and make everyone feel a part of the community. It's
also in the best interest of the project. And I'm sure Jared will probably talk about this shortly,
but I think that's been reflected in some of the work we're doing as well, where, you know, we're reliant on a strong community to be successful in what we're trying to go after.
So, yeah, it's cool to see that it's not only the right thing to do to treat people well, but it's also beneficial for, you know, achieving whatever goal you're searching for. And speaking about the goals, I think that's another thing that makes the open source projects work
and has people coming to the booth,
being happy to talk about the project.
Maybe they don't understand it at first,
but as you start talking to them,
they realize and you realize
that they have the same concerns
and they need the same sort of outcomes that you do.
And when there's a fit between your tool
and what their needs are.
And the ecosystem of open source is many solutions to the same problem. And each one kind of tackles
it a different way. But it's great when you start explaining what your product does and
they latch onto that and they kind of they lead the conversation because they know how to make what you've offered so far more useful to fit their circumstances.
And yeah, it's good to have those conversations.
I think it keeps that positive attitude.
If everybody walked up and like, what is your product?
I don't get it.
It'd be a little souring. And along with that welcoming nature there,
this is a story I really like to share with people because it highlights how things can go
in the completely opposite direction and cause a very toxic environment. And so I will certainly
not mention the project that this happened on. But and it's not in the cloud native ecosystem
at all. It's certainly not a CNCF project because all those communities are super welcoming and
kind. But there was an open source project I got really excited about because it was very aligned
with some of my personal interests. And being a maintainer on other open source projects,
I know how important it is to have a contributor's guide to be able to welcome new people into the
community, but also have pragmatic or practical steps of this is how you build the project. This
is how you add unit tests. This is the criteria for opening a pull request and getting it accepted.
And so I opened an issue on a particular GitHub open source project. And within five minutes or
so, one of the maintainers on that project replied back to me for my request to create a contributor
guide so that I could start helping them out. He told me that it was the dumbest issue he's ever seen. He used some explicit language and
said that he's tired of idiots opening issues in his repo. And I cannot imagine that they ever got
another contributor to join that project ever again because of that completely toxic behavior.
And so there's a spectrum of being welcoming, kind, supportive. And then there's that type of behavior, which I don't think anyone else has ever had an experience like that.
It's definitely an anomaly, an outlier, but it is the worst way to run a community ever.
Wow. Wow. Okay.
Well, I'm really glad that that's like, you know, like a really bad example.
And because it's really easy to forget right but these things do happen
even today we don't realize because we're so privileged to be in such a great community and
to have so many nice people genuinely nice people around us and we do forget that things like these
do happen so what i would, everybody that has such an experience
and more than welcome to join the CNCF community, right?
Because we will show them that that is not normal.
We'll show them what normal is.
We'll be more than happy to get as many people as want on board
because this is normal and this is good.
Yes.
And I think that speaks to the success of this approach.
I'm not sure how many people were at the last KubeCon,
but this one was 12,000 people.
And I know the first ones, like only four or five years ago,
were like 500, 1,000.
So how much this community has grown,
and maybe this has something to do with it, I think.
And the success of one project can lead to the success of the other projects.
Once you've modeled how to develop a great community and nurture the community with this sort of support to continue contributing,
all the other projects are going to be able to benefit from that so that's i'm really glad you mentioned that marcus because
i would like us to maybe start looking a little bit at crossplane and the one thing which at least
that's what crossplane is to me and you know you can give me your perspectives is how it's the
embodiment of leveling the playing field,
being open, bridges everywhere, right?
Everybody's welcome to the party.
No vendor lock-in.
It's just the opposite of that, right?
We're open.
We embrace everybody.
We are open to anybody working with us.
And this is what we think the future looks like.
So it's this, all the bridges between all the vendors,
all the ISAs, all the services?
That's how I see it.
But how do you see it, Dan?
Yeah, so that's exactly right. And, you know, we pitch the project as the open multi-cloud control plane.
And that's really what it is.
We're really trying to open up all of the different cloud provider managed services
to anyone and everyone and really reduce that barrier of switching between them.
And, you know, it's built in such a way that allows people to add their own extension points
to that. So there's really no one who's not welcome there, right? You could start a cloud
provider in your in your home lab, in your apartment. And you could add a stack for that
with crossplane, which I'm sure we'll get to later, but and extend that to include that.
And what that does is it really allows people to pick the best solution for their problem. So,
you know, there's, there's a variety of scales of cloud providers, and maybe you just provide
a managed database service, and it has a very specific use case.
And in an enterprise setting, that can be really hard to adopt because it takes a lot of effort and time to bring on new providers and integrate with them.
But if you integrate in a consistent way, then you the companies and the groups of people who are providing open source projects that, you know, fit certain maybe niche needs.
Those are now a lot easier to use and you can pick the best thing to fit whatever use case you have.
Yeah, I think that's when you're trying to level the playing field or provide easy attainable access to open source software or to, you know, proprietary software, whatever it may be.
But getting access in a consistent way across a lot of different options to a lot of different people and needs and scenarios, you know, that that's really part of opening opening the door there for everybody. And so I think that our efforts here are being based on this foundation that Kubernetes itself has started.
Because if you take a step back and you look at the underlying cloud provider or hardware or whatever it may be.
It abstracts away the infrastructure in the data center and allows your applications to run in a very agnostic way. So Kubernetes kind of started pioneering this trail here where your application doesn't have to worry about the environment it's running in.
You know, it can basically just express itself
in a simple way and then run anywhere. That's a start. But then there's many ways to take that
further. We've heard Dan mentioned something about stacks. I'm looking at Marcus because I know that
he's been closely involved with various stacks. Can you tell us Marcus Marcus, what stacks are and what stacks are currently available in Crossplane?
Sure. Stacks are a package of resources that Crossplane uses to extend the Kubernetes API
with knowledge of cloud provider resources or any sort of infrastructure resource.
Additionally, applications, but first focusing on the infrastructure resources.
There are stacks currently for Google, Azure, and AWS,
and additional ones, Packet and Rook, all interesting topics.
So taking the example of Google, there's a Cloud SQL MySQL instance. And one can imagine in Kubernetes, creating an instance of that resource, specifying in the spec of that resource, all of the API parameters that
you need to configure that resource in the cloud. And then within Kubernetes, using Kubernetes
lifecycle management, you've created this resource that will be reconciled,
creating a cloud provider resource.
And the byproduct of that is a secret
that you can bind to your application
so that whatever application it is you need
that needs MySQL that has access to your MySQL.
The way that we've done this in Crossplane
is we've abstracted that fact to five, currently like five different abstractions.
Maybe there's six, I'm losing count.
Different abstractions.
So we've got one for MySQL, Redis, Postgres, object storage, Kubernetes engines themselves. And if you're familiar with the concept of the CSI drivers
where there's persistent volume claims and their storage classes,
in that setting, you have a deployment who has the intent,
a deployment with pods that have the intent to be bound to storage,
box storage, whatever.
And they make a request for, say, 20 gigs of storage attached. They don't know, they don't care how that storage is attached to storage, box storage, whatever. And they make a request for, say, 20 gigs of storage attached.
They don't know, they don't care how that storage is attached to them, the pods.
And somewhere else has been configured a storage class.
And this storage class dictates that storage will be provided through EBS
or through any other form of storage that the cloud provider is capable of providing.
All the other settings, whether it's faster service or cheaper service,
is defined in that storage class.
And what Crossplane's done is take that concept and extend it to all of the other resources
that you could want to use in your cluster or for your applications.
So MySQL and Postgres and so forth.
So MySQL, Postgres, and you mentioned Rook as well.
These are still relatively low-level building blocks.
Do you have higher-level building blocks
for someone that, for example,
wants a type of an application
so that there's a bit more
that's done for you out of the box?
You don't have these blocks to assemble yourself.
Yeah, so one of the things
that we're really focused on as a project
is addressing it in layers, right?
So starting with
the lowest level, and then building on top of that, and also allowing other people in the
community to build on top of it. And one of the great values of being standardized on the Kubernetes
API is that we can integrate with a lot of different things. So as Marcus was talking about,
we have a lot of infrastructure resources that we that we talked about. And you know, in some ways,
those are abstracted, because they're managed services, which are a little simpler than running your own, you know, MySQL instance on
bare metal or something like that. But you can continue to build on top of that and package
those together. And Marcus alluded a little bit to a different kind of stack that we support as
well, which are application stacks. So a common example that we talk about, just because everyone's usually familiar with it,
is a WordPress instance.
So a WordPress blog, everyone's pretty much familiar with that.
And usually what it takes to do that is somewhere to run it.
So maybe a Kubernetes cluster, and then some sort of deployments into that cluster, which
have the container running in a pod or something like that.
And then some sort of database, MySQL for WordPress, that you need to provision as well
for that to talk to and store posts and comments and that sort of thing. And so what you can do
with crossplane is bundle that up into another sort of custom resource, which is a Kubernetes
concept, which basically allows you to extend their control plan. So all of these infrastructure resources we've talked about are deployed through custom resource
definitions, and then instances of those are the custom resources. So you could extend that to
have a WordPress custom resource definition that says, you know, I need these maybe lower level
concepts, as you were alluding to, to be able to run this application.
And, you know, someone can just deploy this WordPress instance resource, and it will take
care of deploying all those resources in an agnostic manner as well, meaning that it can be
deployed on GCP or AWS or Azure or any other cloud provider, even your on-prem solution, if so be it.
And so that allows someone who's at
a higher level. We like to think about a separation of concern and cross-plane between
someone who would be on a platform or operations team who defines available infrastructure,
and then someone on an applications team, where if you get something like a WordPress instance,
maybe on a marketing team or something higher than that, being able to deploy things in a
consistent manner that is something that their organization has deemed appropriate for their use case.
So I really like this concept. And one thing, again, on the top of my head, which I would
really like to know if it exists, is you have crossplane running in a Kubernetes cluster. Can that Crossplane instance stamp out other Kubernetes clusters,
which maybe have a couple of building blocks already pre-installed?
They're all the same. Does this functionality exist?
Yeah, so if you look at...
When you take a philosophy of treating everything as a resource in Kubernetes, then that allows you to do some interesting things where Kubernetes itself can be treated as just another type of resource.
So, you know, maybe you need a Postgres, maybe you need a Redis cache, but maybe you also need a Kubernetes cluster.
And so being able to dynamically provision, you know, on the fly, bring up a Kubernetes cluster with a certain configuration or certain applications or, you know, certain networking plugins, whatever it may need or policies,
whatever it may be, to be able to, you know, on demand, bring those up and get them as part of
your environment is a consistent experience like with any other type of resource. So I've heard
people many times kind of express how Kubernetes is a platform for platforms. And I think that we're
really starting to see that, that a lot of the base problems have been solved in Kubernetes of,
you know, a declarative API for configuration, active reconciliation controllers that are,
you know, level triggered, not edge triggered. There's all these different philosophies that
went into Kubernetes that have made this platform where we can start building higher level concepts
on top of it. And then the higher you go up the stack, the more opinionated you can become.
So you become more specific to certain use cases. But when you have these building blocks,
and you've got community effort around, you know, bringing them into something that's more useful
and higher up the stack with more functionality or easier to use, you know,
then you can end up with cases where I can just bring up Kubernetes itself and
start using that and treat that as maybe clusters as cattle, you know,
everything, a lot of things are training towards cattle. That's right.
Another, another trend there. And somebody used one this week too,
that it was something that as cattle that I had,
I had never heard before. And I want to,
I want to remember that and bring that back.
Cause I think it was taking it a little too far that, you know, it was like,
okay, not everything has to be cattle,
but maybe I'm just not on board with it yet.
So new things from cube gone this week that I still need to process.
Well, I did hear that, uh, cube controller,
however your preferred way of saying that word,
CTL. Yes. I did hear that pronounces cube cattle this week uh which is saying that to a whole
another level so like cubed cattle and just yeah that's a good one one thing which i don't know
enough about and i'd like to know more about this rook where that's rook bitten all of this yes and
i'd be happy to take that one since uh you been working on Rook for just over three years now.
So I believe that where Rook really shines is its focus being on an orchestrator for storage.
If you think about the roots of the Rook project when we started it more than three years ago.
Something that we saw as Kubernetes was still in very early days is that you would ask people that are using Kubernetes, you'd say, oh, okay, so what are you doing for persistent storage?
And almost nobody had a good answer to that.
That was a very, very commonly unanswered question because they're just running stateless workloads in Kubernetes. And so we started seeing value of, okay, if we can use these primitives and these patterns that are in Kubernetes and these best practices that are starting to form
around how do you manage an application's lifecycle? How do you maintain reliability
of a distributed system? All these things, these problems were being solved and then being able to
build on top of that with, okay, let's do the same thing for storage. Let's reuse the Kubernetes best practices and patterns to stop relying on external storage or storage that's outside of
the cluster. Maybe it's in a NAS device or a SAN or maybe like a cloud providers
block storage service or whatever it may be. but being able to bring those into the cluster
and orchestrate them to be able to take advantage
of the resources that are already in the cluster,
available hard drives or different classes of service,
a regular spinning platter disk or SSD or NVMe,
whatever it may be,
but being able to provide storage to applications
in a cloud-native type of way,
going to the full stack there.
And so that's something that we found that got a lot of traction pretty quickly.
And then, you know, it wasn't too long.
It was only a few early minor releases before we started getting production usage of it,
which was always very surprising because it was an alpha-level project
and we were very clear about this isn't intended to be used in production yet.
But we got production, you know production adoption pretty early on right away,
which helped drive the maturity of the project as well.
Wow. Okay. Three years. That's a long time, right?
In the Kubernetes world and Kubernetes itself has been like five years, roughly.
So three years, that's a really long time.
Enough to mature to get to a point where it solves a lot of real world problems.
That's great to hear um i'm wondering this is more of a like personal interest does it support lvm
does rook support lvm yes and so that's uh an interesting question because if you look at the
design of the rook project it's it's basically separated to two distinct layers one of the layers
which is the core functionality of
Rook is this orchestration layer, this management layer that, you know, will do the steps necessary
to bring up the data layer that's underneath it to get it running and do day two operations to
make sure it's healthy. And so storage providers that Rook performs storage orchestration for
within your Kubernetes cluster, it's up to that data path there to know how to handle LVM
or any other type of storage fabrics and storage presentations
that you can find in a cluster.
So there are a number of storage providers inside of Rook that do work with LVM.
Okay, that's great. I really have to check that out.
Very, very interesting.
Okay, so just to go back to Marcus again, because it's something which is at the back of my mind,
is you mentioned support in cross-plane for AWS, GCP, and Azure, or Azure, as you pronounce it.
What about the other providers? There's like so many more other providers, and Dan mentioned this, right?
Like any provider can be part of Crossplane.
What does the path for other providers look like that would like to be part of Crossplane?
Sure.
Well, we've stamped out the pattern by creating those stacks.
And in the process of creating those stacks, they were created initially, all of them within
the Crossplane project itself.
And it was interesting, even though it's all inside of one repository, the different providers
were implemented by different developers at different times, adopting different best practices,
what they thought was the best practice at the time, and eventually coalesced into one set of design patterns, which had been sort of the best of breed.
And around the same time, we decided to extract these, what we call stacks,
extract those providers, those stacks out of the Crossplane project
into their own stack repositories.
So github.com slash crossplaneio slash stack-gcp-azure.
And I don't know if I'm pronouncing it correctly.
And stack-aws.
And we have additional ones,
rook and packet.
And there's really an easy way
to get that started
for any other cloud provider
interested in being able
to provide their managed services through
crossplane and having that abstracted away, if you have a managed MySQL or a managed Postgres,
then users can create a claim for a MySQL instance. And one day, they're getting RDS,
the next day, they're getting GCP, the next day, they're getting your service,
maybe in one namespace, it's resolving to GCP and then day they're getting your service. Maybe in one namespace it's resolving to
GCP and then for
some production workload and in another
namespace it's reconciling
to
whoever's cloud providers
manage MySQL. And again, not just
for MySQL, Postgres, all the
different types. And Packet
is a great example because
before Packet we didn't have the abstraction
for machines. But packet provides their devices where they what device is the name?
Yeah, it's essentially a, you know, bare metal offering that they provide via their cloud
provider offering. And, you know, they, they came and wanted to have a stack and we didn't have support for what we call a claim
for machine instances. So we wouldn't be able to dynamically provision those. So as part of the
core Crossplane project, we now had a stack that wanted to be able to dynamically provision and
integrate with Crossplane. So we were happy to work with them to add the machine instance claim
type that now allows an abstraction that can be used by
other providers as well, because obviously AWS and GCP, etc, have, you know, VMs like EC2,
and that sort of thing, they can also utilize that. So it's just another opportunity for portability.
Another thing to kind of build on what Marcus was saying is, besides just having those best
practices reflected in those stacks in our organization, we also have abstracted out to a library cross-plane runtime, which is kind of
based on the controller runtime project, which I'm sure a lot of listeners who have built controllers
are familiar with. So that's part of the Kubernetes organization. Essentially, what that does is it
gives you a interface for building controllers and running those in a Kubernetes cluster and some best practices for doing that.
Well, most of our stacks are using that, but also doing other things, namely interacting with external APIs.
So there's certain patterns that are very common across stacks that do that. So we've been able to abstract those out into a library
and just say, you know, you just need to tell us
for this resource how you want to observe the resource,
create the resource, update the resource, and delete it,
and then provide us methods to do that.
And then the logic that's around that
and actually executing those things
can happen in the runtime library.
So it really lowers the barrier to entry for people implementing new stacks, which I think is really valuable as
we see more and more community adoption. I think just today, we actually saw a cloud provider in
Europe announced that they were using Crossplane and had built a stack for that. And we had very
little input on that. We did a little bit of code review, but they were able to take that library
and some of the documentation we've written
and build their own stack,
largely isolated from any of the work
that the cross-plane community was doing.
And that was some really strong validation for us.
And I think that we'll start to see that happening
a lot more in the next weeks and months.
And it also gets back to the idea
of Kubernetes being a platform for
platforms. Kubernetes and its architecture has enabled Crossplane to now become a platform for
all these other different cloud providers or independent software vendors or whoever to
build their application and get more reach and scope of accessing more customer markets or more
segments or whatever for people to come and start using their software
in this open cloud sort of way
with portability and all these different features
that enable more people to access more software.
Yeah.
So we've heard a lot about the AWS and GCP
and Azure,
which would make people think
that it's mostly about infrastructure
or infrastructure or like a service.
But service, again, which is still tied to the infrastructure.
But I know that recently you have started,
maybe even finished, integration with GitLab.
So you can get the GitLab resource,
which is a completely different type of resource
that's cross-pl Crossplane enables. Can
someone tell me more about that? I'd be happy to talk
about that. That's something definitely
that I've spent a lot of time on recently.
And so if you,
you know, we started alluding earlier, Dan
was talking about how you can create
a Crossplane stack that helps
you deploy your application such as WordPress.
And, you know, WordPress was a good place
to start because it's a fairly simple application.
It's just a container and MySQL
and then maybe a cluster to run that container on.
But then in the CubeCon Barcelona timeframe,
we put a significant effort
into being able to deploy GitLab itself.
And so if you look at the architectural components in GitLab,
they have a Helm chart.
And currently that's their main supported way that they had started with to deploy GitLab and everything that comes with it into Kubernetes.
And once you render that out, you know, it's on the order of like 50 different containers, like, you know, 20 config maps, let's say, all these different resources that, you know, speaks to a fairly complicated application set, right? And, but if you boil it down, what it really needs is a set of containers to run their
microservices, and then Postgres, Redis, object storage, and that's basically it. So, you know,
we being able to model that and then express in a very portable way that my application needs these
containers and these databases, et cetera,
and being able to deploy that to any cloud is a huge step forward
in being able to easily manage applications,
not just infrastructure, but higher-level applications,
such as GitLab, into new environments
that maybe they haven't been able to run in so far.
Yeah.
Hearing you talk about that made me think of something else, which may sound crazy.
I like that, right?
So I can imagine there being a need for having a crossplane, managers crossplane, right?
Updates, right?
Because you have a crossplane instance that keeps all these other cross-place instances up to date, maybe,
or the application's up to date, but maybe I think there will be something else which will
keep the application because you have the bigger loops, which reconcile maybe less frequently.
And then you keep going in and in and in until you have some very quick loops,
which reconcile every five seconds, 10 seconds, or whatever.
Is this something that you've thought about
or did it come up before?
Yeah, that is not as crazy of an idea as you would think,
or maybe we're also crazy too,
but either way, it's a positive idea.
That's definitely true regardless.
We can go with that. That's fine.
But if you think about the architecture in general
in Kubernetes around controllers that are performing active reconciliation
i mean it's a great pattern um you know it's an old pattern too you know it's it's commonly used
in you know robotics let's say to run in a control loop and sit there watch the uh actual state in
the environment and compare that to the desired state and then make the see what the delta is
there and take an operational step towards you know minimizing that delta there between actual
and desired and so the same exact example there that you brought up of a cross plane to manage
cross planes uh that's entirely within the realm of reason of you know it's a set of controllers
that can watch the environments and make changes to it to continue to drive it so if there's a new
update to cross plane you know you can you know the single control plane, you could be able to
watch that see that there's an update, you know, take the imperative steps within this controllers
reconciliation loop to upgrade the application and get it to the newest version. But it's all it's
all just the operator pattern and controller patterns inside of Kubernetes. And you can use
that to manage basically any resource. And so I think it's a good idea to be able to manage cross planes and
be able to, because if you think about it, not everyone's going to want to run and manage their
own cross plane. And so I think that there's definitely value in being able to automate that
and take some of that effort away from people and let the controllers and the machines do that for
you so that you can have a cross plane instance that's hosted for you as a service
and be able to get all the benefits out of your Crossplane
without having to manage it yourself.
Let the software do that for you.
And I think there's definitely value in that
that we see for sure.
Okay.
So this, in my mind, set us on a path
that requires me to ask the next question,
which is what big things do you have on the horizon that you can share?
Yeah, I think scheduling is one area that we're looking forward to designing and approaching.
So when you have these Kubernetes application workloads,
the concept that was raised earlier of
bundling your application and its managed resources as a sort of single component,
you're going to need some sort of way to describe where to run that application. What cluster should
it be run on? Which managed service should it be using. So currently, the way that these abstract types, these MySQL instances, these Kubernetes clusters,
currently the way that they resolve is through label selectors. So you've described a class,
named that class and set some set of options on that class. But right now you're referencing it by name.
And so an area that we'd like to figure out is how we can do that dynamically.
So scheduling it based on perhaps cost, perhaps based on the region, the locality,
the affinity to another workload.
There's all sorts of areas that we can really go into there.
Maybe the performance of a cluster or an application is sort of failing,
and so that could lead to an application being bound to another application in some sense.
So lots of layers of abstraction here and lots of fuzzy decision making that can
really provide a better application deployment experience.
And building on what Marcus is saying there is that if you take a look at what the scheduler
does inside of a Kubernetes cluster, the in- know, in cluster scheduler, its job is to
figure out, its job is to know about the topology of the cluster, know about the resources that are
available in the cluster, and then make the best decisions about where a pod should be scheduled
to, where it should run based on, you know, is that node overloaded? Or do I need to evict some
pods somewhere? Or does it match the particular hardware resources that are
available on a particular node? So then if you take that idea of Kubernetes as a control plane,
figuring out where pods should run across nodes in a cluster, and then go a higher level where
you have something like crossplane, which is a control plane that's spanning across multiple
clouds, multiple clusters, on-premises environments.
But it's a higher level that is aware of the topology of all the resources that are available and then can make these smart scheduling decisions about where should an application run based
on whatever constraints it thinks is most important.
So this whole idea of scheduling that was done in cluster for Kubernetes can definitely
be raised up like Marcus was talking about to make decisions more at a global scale. That's really cool. I'm really looking forward to what's going to come
out of this because it's super exciting. And I know that, you know, different providers and
different teams are tackling this in their own specific way. So whoever gets there first, or
even if it's like multiples, it'll be a great moment because it will open up other possibilities, right?
And it's all building blocks, next steps, next steps.
This is really, really exciting.
So as we are approaching towards the end of this great discussion,
which I'm sure we can continue,
one thing which I'd like to mention is that
the way I got to learn about crossplane is via your
youtube live streams the tbss i believe and and and dan was was the last one that i've seen i
think on the last stream and uh it's it was great to see that in action uh so uh can you tell us
more about how that works where where the idea came from,
how it feels to be on the other side? Absolutely. So if anyone out there wants to go watch some
very low quality videos, I disagree. We do a live stream every two weeks. And that's something that
we got ramped up shortly after I joined Upbound.
And it's really just a time.
It's very informal.
And it's a time for us to talk about new things in the cross-plane community, new things in Kubernetes that are related.
And then also to do a lot of really live demoing.
And actually someone asked me today, you know, why don't you just you know, just record your demos and just post them on there
and then you can make sure that that everything goes smoothly and that sort of thing. And the
reason we don't do that is because we think there's a ton of value in messing up, right?
There's a lot of different configuration that can happen when you're provisioning things across
cloud providers on prem, lots of different services, lots of different plugins. There's a
lot of different ways you can mess up, which is not really a reflection of the system or even of your own ability it's just
complicated and what it does when you provision things and you run into issues with it and you
work through it is it shows people how to troubleshoot when they run into those same issues
it also adds a layer of humanity to it I think that allows people who are tuning in especially live when they're dropping comments and that sort of thing, to be able to talk
about what their individual experiences are.
I like to say we've had some other people host as well on some episodes.
We actually recently had multiple people hosted a single episode, which you might want to
skip that one.
There was some technical difficulties.
I apologize.
I'm not a visual engineer.
But what I like to encourage people to do is, you know, talk about something they're
interested in outside of Crossplane.
So a lot of times I'll start a show by talking about the Utah Jazz, which is a basketball
team I really love.
And I'll encourage other people to do the same because, you know, it comes down to it.
The end users of Crossplane and the people that build Crossplane are going to have to be really closely integrated, right?
Because it is a platform that is going to inherently have to make some architectural decisions.
And we want to be best informed about how users want to use the platform so that we can build it to meet those specifications and then encourage them to come in and build parts of it as well. So I think just building that community and having fun and talking about, you know, you
can do all these things and we're excited about them and we'd like for you to come join
us on this journey.
I think that's really the purpose of TBS, which is the binding status, which is kind
of a play on, you know, claims binding to classes.
I think that's the purpose of the show.
And we had a couple people come up and mention that they'd watched episodes which i was uh astounded by and
i apologize for the the time that they had wasted but uh it was personally and as an organization
really validating to say you know what people care about what's going on here and uh they feel
welcome into the community by this style of communication.
So there's one big downside to this, from my perspective,
is that I enjoy watching the shows more than trying cross-playing out.
So the risk there is that I will continue watching all the cross-played shows forever
and never try cross-playing because it's so exciting to watch
that I spend all the time watching rather than trying it out.
So that's one of the real risks of this.
Well, I think the solution to that is we just have to have you come on and host and then you'll be forced to try it out.
Oh man!
With hundreds of people watching!
Just put you in the hot seat.
Right, yeah.
A forcing function.
Yeah, that's actually a great idea, I have to say.
I don't know how I'll get out of that one.
Any last parting thoughts?
Well, it's really easy to try it out,
so you don't have an excuse.
You just help them install it.
And as long as you've got some cluster somewhere,
install it in Kind or install it in K3S on your laptop.
Docker on Mac includes Kubernetes engine now. somewhere, install it in Kind or install it in K3S on your laptop. Docker
on Mac includes Kubernetes
engine now. So from
there, you can help install
your crossplane and from there start
provisioning more clusters,
more managed resources, the
Kubernetes applications.
And another piece
I'd like to piggyback off
the idea of the videos is that we have a lot of documentation.
We've worked hard to update this documentation, both on how to build stacks and how to use Crossplane.
We've been updating it every version.
And we're trying to get more strict about making sure that our docs are updated with every release.
And we've been releasing the product faster and faster.
The last release was 0.5, and before that was our first minor patch in 0.4.1.
We've worked on our build pipeline so that we can get the updates out there quicker.
So with all of this, you have documentation to test it out with. And I'd like to say that, yes,
the video is probably one easy way to consume it. So for different people, different things are going to work. Whether it's reading the docs, whether it's installing the product and just
trying it out by hand, or whether it's watching us fumble at the kubectl command line. YAML is not the easiest thing to just crock at a distance.
Sometimes you need to watch somebody stumble over how to best describe it
or just read thoroughly what we've done or jump in the code.
Visit the GitHub project, star it.
That stuff is really useful to us.
Leave issues for any kind of ideas that you would like
to see Crossplane expand or delve into. And a closing thought on that, that I strongly
believe in is that I consistently see that some of the best feedback and ideas for a project comes
from brand new users that have never seen it before. Because, you know, you could be, you know,
a project maintainer, let's say, and you're consistently living in that code base
and you know all the ins and outs and the idiosyncrasies of it.
And you kind of get, you know, a very specific, you know, myopic view on it almost.
But then you have a brand new person try it out for the first time with fresh eyes.
And they see something immediately that you've been completely blind to
for the past six months.
So some of the best feedback comes from brand new users.
So we are super open to new people trying it out and giving us their ideas because they're probably going to be good ideas
as well. Okay. So on that note, I really like that idea. How about we stop the interview now
and I can start trying some cross-plane stuff out for the first time. You can watch me
and tell me all the things that I'm doing wrong. I'd really like that. Or maybe you can tell us
what we've been doing wrong. Or that, yes.
This will get crazy.
I'm really looking forward to that.
Dan, thank you very much.
Marcus, thank you very much.
Jared, thank you very much.
It was a pleasure having you.
I'm so excited that you were on the show and I'm looking forward
to what will happen next.
Thank you so much for having me.
Thank you.
It was a pleasure.
Yeah, we really love ChangeLog.
Love all the shows. Go time. Just subscribe to the Master Feed. You get us. It was a pleasure. Yeah, we really love ChangeLog. Love all the shows.
Go time.
Subscribe to the Master feed.
You get everything.
It's the best.
Thank you, Marcus.
Thank you.
Thank you.
All right.
Thank you for tuning in to the ChangeLog.
You heard Marcus.
Subscribe to Master.
It's our majestic monolith.
Get this show, brain science, founders talk, and everything we produce all in one place.
You've got nothing to lose.
Special thanks to our friends at the CNCF for making this series possible,
and to Gerhard LeZou for conducting these awesome interviews.
Our music is produced by The Beat Freak, Breakmaster Cylinder,
and we are sponsored by some amazing companies.
Support them, they support us.
You know Fastly, Robar, and Linode have our back.
Thanks to them.
Thanks for listening.
We'll talk to you in the next decade. Thank you. Субтитры добавил DimaTorzok