Screaming in the Cloud - OpsLevel and The Need for a Developer Portal with Kenneth Rose
Episode Date: June 15, 2023Kenneth Rose, CTO at OpsLevel, joins Corey on Screaming in the Cloud to discuss how OpsLevel is helping developer teams to scale effectively. Kenneth reveals what a developer portal is, how h...e thinks about the functionality of a developer portal, and the problems a developer portal solves for large developer teams. Corey and Kenneth discuss how to drive adoption of a developer portal, and Kenneth explains why it’s so necessary to have executive buy-in throughout that process. Kenneth also discusses how using their own portal internally along with seeking out customer feedback has allowed OpsLevel to make impactful innovations. About KenKenneth (Ken) Rose is the CTO and Co-Founder of OpsLevel. Ken has spent over 15 years scaling engineering teams as an early engineer at PagerDuty and Shopify. Having in-the-trenches experience has allowed Ken a unique perspective on how some of the best teams are built and scaled and lends this viewpoint to building products for OpsLevel, a service ownership platform built to turn chaos into consistency for engineering leaders.Links Referenced:OpsLevel: https://www.opslevel.com/LinkedIn: https://www.linkedin.com/company/opslevel/Twitter: https://twitter.com/OpsLevelHQ
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
Welcome to Screaming in the Cloud.
I'm Corey Quinn.
About, oh, I don't know, two years ago and change,
I wound up writing a blog post titled
Developer Portals Are an Anti-Pattern,
and I haven't really spent a lot of time
thinking about them since.
This promoted guest episode is brought to us
by our friends at Ops Level,
and they have sent their CTO and co-founder, Ken Rose,
presumably in an attempt to change my perspective
on these things.
Let's find out.
Ken, thank you for agreeing to, well, run the gauntlet,
for lack of a better term. Hey, Corey, thanks again for having me. And I've heard, you know, heard and listened to your
show a bunch and really excited to be here today. Let's begin with defining our terms. I'm curious
to know what a developer portal is. What would you say a developer portal means to you, like it's a
college entrance essay? Right, definitely. So really, a developer portal means to you like it's a college entrance essay. Right, definitely. So really,
a developer portal is this consolidated place for developers to come to, especially in large
organizations, to be able to get their jobs done more easily, right? A large challenge that
developers have in large organizations. There's just a lot to do and a lot to take care of. So
a developer portal is a place for developers to be able to better own, manage, and run
the services that they're responsible for
that run in production.
And they can do that through easy access
to self-service tooling.
I guess on some level,
this turns into one of those alignment charts
of what is a database
and how prescriptive you want to be.
It's like, well, as a senior engineer,
a database, because you can query them
and they have information.
Would you consider, for example, would Kubernetes be a developer platform and or would the AWS console? Yeah, that's actually an interesting question, right? So I think there's
actually two, we're going to get really niggly here. There's developer platform and developer
portal, right? And the word portal for me is something that sits above a developer platform.
I don't know if you remember like the late nineties, early 2000s, likeals were all the rage. Yahoo and all the Visties were like search portals that
were trying to, at the time, consolidate all this information on a much smaller internet to make it
easy to access. A developer portal is sort of the same thing, but custom built for developers and
trying to consolidate a lot of the tooling that exists. Now, in terms of the AWS console, yeah,
maybe. It has a suite of tools and a suite of offerings. It doesn't do a lot on the,
well, how do I quickly find out what's running in production and who is responsible for it?
I don't know, I'm on this AWS chip, like the, you know, 300th new offering in the last week that I
haven't, you know, kept on top of. But, you know, there's definitely some spectrum in terms of what
goes into developer portal. For me, there's kind of three main things you need. You do need some
kind of a catalog, like what's out there, who owns it. You need some kind of a way to measure like how good are those services,
how well built are they? And then you need some access to self-service tooling. And that last
part is where like the Kubernetes or AWS could be, you know, sort of a dev portal as well.
My experience with developer portals, there was a time when I loved it. RightScale was what I used at some depth back in, I want to say, 2010, 2011,
because the EC2 console was clearly not built or designed by anyone who had not built EC2
themselves with their bare hands and sweat of their brow. And in time, the EC2 console got
better, where it wasn't written in hieroglyphics, as best we could tell, and it became click button
to launch instance. And RightScale really didn as best we could tell, and it became click-button-to-launch-instance.
And Ridescale really didn't have a second act,
and they wound up getting acquired by our friends over at Flexera years later,
and I haven't seen their developer portal in at least eight years
as a direct result of this.
So the problem, at least when I was viewing it purely in the context of AWS services, it feels like you are competing against AWS iterating forward on developer experience,
which they iterate slowly sometimes and unevenly across their breadth of services.
But it does feel like, at some level, by building an internal portal,
you are, first, trying to out-innovate AWS in some ways, and two, you are inherently making the trade-off of not using recent features and enhancements that have not themselves been incorporated into the portal.
That's where, I guess, the start, the genesis of my opposition to the developer portal approach comes from.
Is that philosophy valid these days? Not as much, because I can see an argument for it shifting.
Yeah, I think it's slightly different.
I think of a developer portal as, again, it's something that sort of sits on top of AWS
or Google Cloud or whatever cloud provider you use, right?
You give the example, for example, with RightScale and EC2.
So provisioning instances is one part of the activity you have to do as a developer.
Now, in most modern organizations, you have like your product developers that ship features.
They don't actually care about provisioning instances themselves.
There are another group called the platform engineers or platform group that are responsible for building automation and tooling to help spin up instances and create CI, CD pipelines and get
everything you need set up. And they might use AWS under the covers to do that. But the automation
built on top and making that accessible to developers, that's really what a developer portal can provide.
In addition, it also provides links to operational tooling that you need, technical documentation.
It's everything you need as a developer to do your job in one place.
And though AWS builds itself as that, I think of them as more, they have a lot of platform offerings, right?
They have a lot of infra offerings, but they still haven't been able to, I think, customize that. Unless you're an organization that builds, that has kind of
gone in all on AWS and doesn't build any of your own tooling, that's where a developer portal helps.
It really helps by consolidating all that information in one place, by making that
information discoverable for every developer. So they have less, less cognitive load, right?
We've asked developers to kind of do too much. We don't, we've asked to shift left and well, how do we, how do we make that information more accessible? Regarding the point of, you know,
AWS adds new features or new capabilities all the time. And like, well, you have this dev portal,
that's sort of your interface for how to get things done. Like, how do you use those? Dev
portal doesn't stop you from doing that, right? So my mental model is if I'm a developer and I
want to spin up a new service, I can just press a button inside of my dev portal in my company and do that. And I have a service that is built according to the
latest standards. It has a CICD pipeline. It already has, you know, it's registered in PagerD,
it's registered in Datadog, it has all the various bits. And if then there's something
else that I want to do that isn't really on the golden path, because maybe this is some new
service or some experiment, nothing stops us from doing that. Like you still can use all those tools
from AWS, you know, kind of raw. And if those Like you still can use all those tools from AWS,
you know, kind of raw.
And if those prove to be valuable
for the rest of the organization, great.
They can make their way into the debt portal.
They can actually become a source of leverage.
But if they're not,
then they can also just sit there on the vine.
Like not everything that AWS ever produces
will be used by every company.
Many years ago, I got a Cisco pair of certifications
because recession was hitting
and I needed to do better at networking.
And taking those certifications
in those days before Cisco became
the sad corporate dragon with no friends
that we all know today,
they were highly germane and relevant.
But I distinctly remember even now,
15 years later,
that there was this entire philosophy
of pretend that the entire world is Cisco only, which in networking is
absolutely never true. It feels like a lot of the AWS designs and patterns tend to assume,
oh, you're going to use AWS services for everything. I have never yet found that to
be true other than when I'm just trying to be obstinate. And hell is interoperability
between a bunch of different things. Yes, I may want to spin up an EC2 instance and an AWS load balancer and some S3 storage or whatnot, but I'm also going to want to monitor it with pager duty.
I'm going to want to have a CDN that isn't cloud front because most CDNs these days don't hate you in quite the same economic ways and are simpler to work with, etc cetera, et cetera, et cetera. So there's definitely a story wherein I've found that
the interoperability of tying these things together is helpful. How do you avoid falling
down the trap of, oh, everyone should be multi-cloud, single pane of glass, et cetera,
et cetera, in practice that always seems to turn to custard? Yeah, I think multi-cloud and single
pane of glass are actually two different things. So multi-cloud, like I agree with you to some sense, like pick a cloud and go
with it. Like, unless you have really good business reasons to go for multi-cloud and sometimes you do
like years ago, I worked at PagerD, they were multi-cloud for a reliability reason that, hey,
if one cloud provider goes down, you don't want to- They were an example I used all the time for
that story. Specifically, the thing that woke you up was homed in a bunch of different places. Whereas the marketing site, the onboarding flow, the periphery stuff around it
was not because it didn't need to be. The core business need of wake you up was very much
multi-cloud because once upon a time it wasn't, and it went down with the rest of US East 1,
and people weren't woken up to be told their site was on fire.
100%. And on the application side, even then, pick a cloud and go with it, unless there's a really compelling business reason for your business to go multi-cloud. Maybe there's something,
credits or compliance or availability, right? There might be reasons, but you have to be
articulate about whether they're right for you. Now, single pane of glass, I think that's different,
right? I do think that's something that ultimately is a net boon for developers.
There, in any large organization,
there is a myriad of internal tools
that have been built.
And it's like, well,
how do I provision a new topic
in the Kafka cluster?
How do I actually get access
to the AWS console?
How do I spin up a new service?
How do I kind of do these things?
And if I'm a developer,
I just want to ship features.
Like that's what I'm incented to do. That's what I'm optimizing for. And all this other stuff, I have to do as part of do these things? And if I'm a developer, I just want to ship features. Like that's what I'm incented to do.
That's what I'm optimizing for.
And all this other stuff, I have to do as part of my job,
but I don't want to have to become like a Kubernetes guru
to be able to do it, right?
So what a developer portal is trying to do
is be that single pane of glass,
bringing all of these common set of tools
and responsibilities that you have as a developer
in one place.
They're easy to search for, they're easy to find,
they're easy to query, they're easy to use.
I should probably have asked this earlier on, but let's disambiguate for a little bit here.
Because when I'm setting up to use a new service or product and kick the tires on it,
no two explorations really look the same.
Whereas at most responsible mature companies that are building products,
that are services that are going to production use,
they've standardized around
a number of different approaches.
What does your target customer look like?
Is there a certain point of scale,
a certain level of complexity,
a certain maturity of process?
Absolutely.
So a tool like Ops Level or a developer portal
really only makes sense
when you hit some critical mass
in terms of the number of services you have running in production or the number of developers that you have.
So when you hit 20, 30, 50 developers or 20, 30, 50 services, an important part of a developer portal is this catalog of what's out there.
Once you kind of hit the Dunbar number of services, like when you have more than you can keep in your head, that's when you start to need tooling like this.
If you look at our customer base, they're all, you know, kind of medium
to large size companies.
If you're a startup with like 10 people,
ops level is probably not right for you.
We use ops level internally at ops level.
And you know, like we're still a small company.
It's like, we make it work for us
because we know how to get the most out of it.
But like, it's not the perfect fit
because it's not really meant for, you know,
smaller companies.
Oh, I hear you.
I think I'm probably,
I have a better AWS bill analytics
system running internally here at the Duckbill Group and some banks do. So I hear you on that
front. But it also implies to me that there's no ops level prospect or customer deployment that
has ever been Greenfield. It's always, you're building existing things. There's already
infrastructure in place. Vendors have been selected across the board.
You aren't, if no one is starting a company day one, they're going to, all right, time to spin up our AWS account. And we're also going to wind up signing up for ops level from the sound of it.
Accurate, inaccurate. I think that's actually accurate. Like a lot of the problems we solve
are the problems that come as you start to scale both your product and your engineering team.
And it's the problem. What do those painful problems look like? In other words, what is someone sitting at home
right now listening to this or driving to work,
debating whether they want to ram a bridge abutment
or go into the office,
depending on their mental state today.
What painful problem do they have
that ops level is designed to fix?
Yeah, for sure.
So let's help people self-select.
So here's my mental model for any end-jork.
There are product developers, platform developers, and engineering leaders.
Product developers, if you're asking questions like, I just got paid for the service.
I don't know what this does.
Or it's upstream from here.
Where do I find the technical documentation?
Or I think I have to do something with the payment service.
Where do I find the API for that?
When you get to that scale, a developer portal can help you.
If you're a platform engineer and you have questions like, okay, we got to migrate. We're migrating, I don't know, from a data dog to honeycomb, right? We got
to get these 50s or a hundred or thousands of services and all these different owners to like
switch to some new tool or, Hey, we've done all this work to ship the golden path. Like how do
we actually measure the adoption of all this work that we're doing? And if it's actually valuable,
right? Like we want everybody to be on a certain set of CI tooling or a certain minimum version of some library or framework. How do we do that? How do
we measure that ops levels for you, right? We have a whole bunch of stuff around maturity.
And if you're an engineering leader, ultimately the questions you care about are like,
how fast are my developers working? I have this massive team. We've made this massive investment
in hiring all these humans to write software and bring value for our customers. How can we be more
efficient as a business in terms of that value delivery? And that's where Offset Level can help as well.
Guardrails, whether they be economic, regulatory, or otherwise, have to make it easier than doing
things incorrectly. Because one of the miracle aspects of cloud also turns into a bit of a
problem, which is shadow IT is only ever a corporate credit card away. Make it too difficult to comply with corporate policies,
and people won't.
And they're good actors.
They're trying to get work done.
They're not trying to make people's lives harder,
but they don't want to spend six weeks
provisioning an EC2 cluster.
So there's always that weird trade-off.
Now, it feels, and please correct me if I'm wrong,
once someone has rolled out ops level at
their organization, where it really shines is spinning up a new service where, okay, great,
you're going to spin up the automatic observability portion of it. You're going to spin up the
underlying infrastructure in certain ways that comply with our policies. It's going to build
CICD pipelines around it. You're going to wind up having the various cost instrumentation rolled out to it.
But for services that are already extant
within the environment,
is there an ops level story for them?
Oh, absolutely.
So I look at it as like,
the first problem ops level helps solve
is the cataloging problem.
What's out there and who owns it?
So not even getting developers to spin up new services
that are kind of on the golden path,
but just understanding the taxonomy of what are the services we have? How do those services
compose into higher level things like systems or domains? What's the whole set of infrastructure
we have? Like I have 58 AWS accounts, maybe I have a handful of GCP ones also, some Azure.
I have all this infrastructure that like, how do I start to get a handle on like what's out there
in prod and who's responsible for it? And that helps you get in front of compliance risks, security risks. That's really the starting point for OpsL is building
that catalog. And we have a bunch of integrations that kind of slurp all this data to automatically
assemble that catalog or YAML as well, if that's your thing. But that's the starting point is
building that catalog and figuring out this assignment of like, okay, this service and this
human or this certain team, they're paired together. A number of offerings in this space, which honestly, my exposure to it is bounded simultaneously to
things that are 10 years old and no one uses anymore, or a bunch of things I found on GitHub.
And the challenge that both of those products tend to have is that they assume certain things
to be true about a given environment,
that they're using Terraform to manage everything,
or they're always going to be using CloudFormation,
or everyone there knows Python,
or something else like that.
What are the prerequisites to get started with Ops Level?
Yeah, so we work pretty hard
to build just a ton of integrations.
I would say integrations is our just continuing thing we have going on in the background.
Like when we started, like we only supported GitHub.
Now we support all the Gits, you know, like GitHub, GitLab, Bitbucket, Azure DevOps.
I think we're building Gidia.
There's just a whole like long tail of integrations.
The same with APM tooling, the same with vulnerability management tooling, right?
And the reason we do that is because there's just this huge vendor footprint and people want Ops Level to work for them.
Now, the other thing we try to do is we also build APIs.
So anything we have as a core integration, we also have an underlying API for.
So that no matter what, you have an escape hatch.
If you're using some tool that we don't support or you have some homegrown thing, there's always a way to try to be able to integrate that into Ops Level.
When people think about developer portals, the most common one that pops to mind is Backstage, which Spotify wound up building internally, championing open sourcing.
And I believe on some level turning into a product because if there's one thing people want, it's to have their podcast music company become a SaaS vendor, which is weird to me.
But the criticisms that I've seen about it across the board have all rung relatively true, including from people internal at Spotify who have used the thing, which is the first is underestimating the amount of effort that is necessary to maintain backstage itself.
That the build versus buy discussion is always harder.
Engineers love to build,
but they shouldn't be building things
outside of their core competency half the time.
And the other is driving adoption within the org,
where you can have the most amazing developer portal
in the known universe,
but if people don't use it,
it may as well not exist.
And doing the carrot and stick approach
often doesn't work.
I think you have a pretty good answer
that I need to not even ask you to elaborate on.
Well, how do we avoid having to maintain this ourselves
since you have a company that does this?
But how do you find companies are driving adoption
successfully once they have deployed ops level?
Yeah, that's a great question.
So absolutely, like, I think the biggest thing you need first
is kind of cultural buy-in,
that this is a tool that we want to invest in, right?
I think one of the reasons Spotify was successful with Backstage, and I think it was System Z before that, was that they had this kind of flywheel of like, they saw that their developers were getting, you know, better, faster, working, happier by using this type of tooling, by reducing the cognitive load.
The way that we approach it is sort of similar, right?
We want to make sure that there is executive buy-in that like everybody agrees,
this is like a problem that's worth solving. The first step we do is trying to build out that
catalog again and helping assign ownership. And that helps people understand like, Hey,
these are the services I'm responsible for. Oh, look, and now here's this other context that I
didn't have before. And then helping organizations, you know, what, it depends on the problem you're
trying to solve, but whether that's rolling out self-serve automation to help developers reduce what was before a ton of cognitive load, or if it's helping platform teams define what good looks like so they can start to level up the overall health of what's running in production.
We kind of work on different problems, but it's picking one problem and then kind of working with the customers and driving it forward. On some level, I think that this is going to be looked down upon inherently just by automatic
reflex of folks with infrastructure engineering backgrounds. It's taken me some time to learn to
overcome my own negative reaction to it because it's, I'm here to build things and I want to
build things out in such a way that it's
portable and reusable without having to be tied to a particular vendor and move on. And it took me a
long time to realize that what that instinct was whispering in my ear was in fact, no, you should
be your own cloud provider. If that's really what I want to do, I probably should just brush up on,
you know, computer science trivia from 20 years
ago and then go see if I can pass Google's SRE interview. I'm not here to build the things that
just provision infrastructure from scratch every company I wind up landing at. It feels like
there's more important, impactful work that I can do. And let's be clear, people are never going to
follow guardrails themselves when they have to do a bunch of manual steps.
It has to be something that is done for them.
And I don't know how you necessarily get there
without having some form of blueprint
or something like that provided for them
with something that is self-service
because otherwise it's not going to work.
I 100% agree, by the way, Corey,
like the take that like automation
is the only way to drive a lot of this forward is true, right?
If for every single thing you're trying, like we have a concept called
a rubric and it's basically how you measure the service health and you can, it's very customizable.
You have different dimensions, but if for any check that's on your rubric, it requires manual
effort from all your developers. That is going to be harder than something you can just automate
away. So vulnerability management is a great example. If you tell developers, hey, you have
to go up there, get this library. Okay. Some percent of the example. If you tell developers, hey, you have to go up there in this library.
Okay, 7% of the limits.
If you give developers, here's a pull request that's already been done and has a test passing
and now you just need to merge it,
you're going to have a much better adoption rate with that.
Similarly with like applying templates
and being able to up-level,
you know, kind of apply the latest version of a template
to an existing service.
Those types of capabilities,
anything where you can automate what the fixes are,
absolutely you're going to get better adoption.
As you take a look at your existing reference customers,
which is something I always look for on vendor websites,
because like, oh, we have many customers
who will absolutely not admit to being customers.
It's like, that sounds like something that's easy to say.
You have actual names tied to these things,
not just companies, but also individuals.
If you were to sit down and ask your existing
customer base, so why did you wind up implementing OMS level? And what has the value that's delivered
to you been since that implementation? What do they say? Definitely. I actually had to check
our website because we land new customers and put new logos on it. I was like, oh, I wonder what the
current set is. I have the exact same challenge. Oh, we have some mutual customers.
And it's okay.
I don't know if I can mention them by name because I haven't checked our own list
of testimonial rights lately
because say the wrong thing
and that's how you wind up being sued
and not having a company anymore.
Yeah, so I definitely want to stay on side on that part.
But in terms of like kind of sample reference customer,
a lot of the folks that we initially work with
are the platform teams, right? They're the teams that care about what's out there and they need to know who's responsible
for it because they're trying to drive some kind of cross-cutting change across the entire, you
know, production footprint. And so the first thing that generally people will say is, and I love this
quote, this came, I won't name them, but like it's in one of our case studies. It was like,
I had like 50 different attempts at making a spreadsheet and they're all like in the graveyard,
like to be able to capture what's out there and who's responsible for it.
And just OpServo helping automate that has been one of the biggest values that they've gotten.
The second point then is now being able to drive maturity and be able to measure how well those services are being built.
And again, it's sort of this interesting thing where we start with the platform teams and then sometime later security teams find out about OpServo and they're like, oh, this is a tool I can use to get developers to do stuff.
I've been trying to get developers to do stuff for the longest time.
And I filed JIRA tickets and they just sit there and nothing gets done.
But when it becomes part of this overall health score that you're trying to increase across the board, yeah, it's just a way to kind of drive action.
I think that there's a dichotomy of companies that emerge.
And I tend to see the world through a lens of AWS bills.
So let's go down that path.
I feel like there are some companies,
presumably like OpsLevel,
whereas if I, assuming you're running on top of AWS,
if I were to pull your AWS bill,
I would see upwards of 80% of your spend
is going to be on this application called OpsLevel,
the service that you provide to people
is opposed to the other side of the world,
which is large enterprises
where they're spending hundreds of millions of dollars a year,
but the largest application they have
is a million and a half a year in spend.
It's just that they have thousands of these things
scattered everywhere.
That latter case is where I tend to see more platform teams
where I start to see a lot of managing a whole bunch of relatively small workloads. And developer platforms really seem to be where a lot of the solutions lead. Whereas 80% of our workload is one application, we don't feel the need for that as much. Is that accurate? Am I misunderstanding some aspect of it? No, 100%. You hit the nail on the head. Like, okay, think about the typical, like, microservices adoption journey. Like, you started with, you know,
some small company like us. You started with a monolith. Then you read on Hacker News and
realize, oh, if we want to hire people, we've got to be doing what all the cool kids are up to.
Right. We've got to microservice all the things. But that's actually, you know, microservices
should come later, right? As a response to, you need to scale your org and scale your...
As someone who started
building some applications with microservices, I could not agree more. 100%. So it's as you're
starting to take that steps to having just more moving parts in your production infrastructure,
right? If you have one moving part, unless it's like a really large moving part that you can
internally break down, like kind of this majestic monolith where you do have kind of like individual
domains that are owned by different teams. But really the problem we're trying to solve, it's
more about like who owns what. Now, if that's a single atomic unit,
great, but can you decompose that? But if you just have like one small application and kind of like
the whole team is owning everything, again, a developer portal is probably not the right tool
for you. It really is a tool that you need as you start to scale your engineer work. And as you start
to scale the number of moving parts in your production infrastructure. I tended to used to think of that in terms of boring companies versus innovative ones,
and I don't think that's accurate.
I think it is the question of maturity and where companies lead to on some level if ops
level starts growing and becomes larger and larger in different ways and starts doing
acquisitions and launching into other areas.
At some point, you don't have just one product offering.
You have a multitude of them,
at which point having something like that
is going to be critical.
But I have to ask,
given that you are sort of not exactly
your target customer profile,
what have the sharp edges been
on using it for your use case?
Yeah, so we actually have an internal Slack channel
we call Ops Level on Ops Level.
And finding those sharp edges actually has been really useful for us. You know,
all the good stuff, dogfooding, and it makes your own product better. Okay. So we have our main app.
We also do have a bunch of smaller things that it's like, oh yeah, we need, you know, we have like, I don't know, various hack day things that go on. It's important. We kind of wind those down
for, you know, compliance. We have our marketing site. We have like a terraform. So there's like
stuff. It's not like hundreds or thousands of things, but there's more than just the main app. The second though,
it's really on the maturity piece that we really try to get a lot of value out of our own product,
right? Helping, we have our own platform team. They're also trying to drive certain initiatives
with our product developers. There is that usual tension of our product, like our own product
developers are like, I want to ship features. What's the security thing I have to go take care
of right now?
But ops level itself helps reflect that.
We had an operational review today and it was like, oh, this one service is actually now, we have platinum as a level.
It's in gold instead of platinum.
It's like, why?
Oh, there's this thing that came up.
We got to go fix that.
Great.
Let's actually go fix that.
So we're back into platinum.
Do you find that there's often a choice you have to make internally where you could make the product more effective for your specific use case, but that also diverges from where your typical customer needs or wants the product to go?
No, I think a lot of the things we find for our use case are like, there are more small paper cuts, right?
That just as we're using, it's like, hey, like, as I'm using this, I want to see the report for this particular check.
Why do I have to click six times to get, you know,
like, wouldn't it be great if we had a button rate?
And so it's those type of like small innovations that kind of come up and those ultimately lead to,
you know, a better product for our customers.
We also work really closely with our customers
and, you know, developers are not shy about telling you
what they don't like about your product.
And I say this with love, like a lot of our customers
give us phenomenal feedback
just on how our product can be made better.
And we try to internalize that and roll that feedback into the product.
You have a number of integrations of different SaaS providers, infrastructure providers, etc.
that you wind up working with.
I imagine that given your scale and scope and whatnot, those offerings are dictated by what customers say,
Hey, we're using this thing. Are you
going to support that or are you not going to maintain our business? Which is a great way to
wind up financing a lot of product development and figuring out what matters to people.
My question for you is, if you look across the totality of your user base,
what are the most popularly used integrations, if you can say?
Yeah, for sure. I think right now I can actually dive in
to pull the numbers.
GitHub and GitLab are,
I think GitHub has slightly more adoption
across our customer base.
At least with our customers,
almost nobody uses Bitbucket.
I mean, we have a small number,
but it's, I think, single digit percentage.
A lot of people use PagerDuty,
which, hey, I'm an ex-PagerDuty person,
ex-Deutonian, I'm glad to see that.
I have a free tier PagerDuty account, ex-Dutonian, I'm glad to see that. I have a free-tier PagerDuty account
that will automatically page me
from my home automation stuff,
specifically if, you know, the fire alarm goes off.
Like, yeah, okay, there are certain things
I want to be woken up for, but it's a very short list.
Yeah, it's funny, the running default message
when we use a test PagerDuty was the server's on fire,
but in your case, it'd be like, the house is on fire.
Like, you know, go get that taken care of. There's one other tool also that's used a lot.
Datadog actually is used a ton by just across our entire customer base. Despite it's, we're also a
Datadog partner, we're a Datadog customer, you know, it's not cheap, but it's a good product
for, you know, monitoring logs in their opinion. No, other than cloud infrastructure providers,
I get the number one most common source of inquiries is Datadog optimization It has now risen to a board-level concern in many cases because observability is
expensive. That's a sign of success on some level. Meanwhile, I'm sitting here like, date a dog? Oh
my god, that's disgusting. It's like Tinder for pets, which it turns out is not at all what they
do. Nice. Yeah. As far as infrastructure providers, is that something that people
wrap around on day one, or does that tend to be a later-in-time approach? Are they first in production. You know, if you have multiple AWS accounts,
multiple Kubernetes clusters,
dozens or even hundreds of teams,
God help you if you're going to try
to build a list manually
to consolidate all that information.
That's really the first part
is integrate Kubernetes,
integrate your CICD pipelines,
integrate Git,
integrate your cloud account.
Like we'll integrate with everything
and we'll try to build that map of like,
here's everything that's out there
and start to try to assign it.
And here's people that we think might be responsible in terms of owning
the software. That's generally the starting point. Which makes an awesome amount of sense. I think
that going at it from the infrastructure first perspective is where I've seen most developer
platforms founder. And to be fair, the job is easier now than it was years ago, because it used
to be that you're being out-innovated by AWS constantly.
Innovation has slowed down there, and you know that because of how much they say the pace of
innovation has only sped up. And whenever AWS says something in a marketing context,
they're insecure about it. I've learned this through the fullness of time observing that
company. And these days, most customers do not use the majority of features available for any
given service. They have solidified to a point where you can responsibly build on top of these things. Now it seems that the problem is all the
yes and stuff that gets built on top of. Yeah. Do you have an example, actually,
like one of the yes and tools that you're thinking about?
Oh, absolutely. We have a bunch of AWS environment stuff, so we should configure CloudWatch to look
at all these things from an observability perspective. No, you should not.
You should set up Datadog.
And the first time someone does that by hand, they enable all of the observability and the rest and suddenly get charged approximately the GDP of Guam.
And okay, maybe we shouldn't do that because then you have the downstream impact of that on your CloudWatch bill.
So, okay, how do we optimize this for the observability piece directly tied to that? How do we make sure that we get woken up when the site is down or preferably before that,
but not every time basically a EBS volume starts to get a little bit toasty?
You have to start dialing this stuff in.
And once you've found a lot of those aspects,
being able to templatize that and roll that out
on an ongoing basis
and having the integrations all work together
feels like it's the right problem to be solving.
Yeah, absolutely.
And the group that I think is responsible
for that kind of,
because it's a set of problems you described,
is really like platform teams.
Sometimes service owners are like,
how should we get paid?
But really what you're describing
are these kind of cross-cutting engineering concerns
that platform teams are uniquely poised
to help solve in an engineering organization, right?
I was thinking about what you said earlier.
Nobody just wants to rebuild the same info over and over,
but it's sort of like,
it's not just building the inference.
It's kind of like solving this.
How do we ship?
How do we actually run stuff in prod?
And not just run it,
but get observability
and ensure that we're woken up for it.
And like, what's that total end-to-end look like
from like developers writing code
to running software in production
that's serving traffic
and solving all the problems that's with it.
That's what I think of as platform engineering.
So my last question
before we wind up wrapping this episode
comes down to, I am very adept
at two different programming languages,
and those are brute force and enthusiasm.
What implementation language
is most of what you find yourself working with,
and why is it invariably going to be YAML?
Yeah, that's a great question.
So I think
there's, in terms of implementing ops level and implementing a service catalog, we support YAML.
Like, you know, there's this very common workflow. You just drop a YAML spec basically in your repo
if you're a service owner and that we can support that. I don't think that's a great take though.
We have other integrations. Again, if the problem you're trying to solve is I want to build a
catalog of everything that's out there, asking each of your developers, hey, can you please all
write YAML files that describe the services you own and drop them into this repo? You've inverted
this database that essentially you're trying to build of what's out there and stored it in Git,
potentially across several hundred or thousands of repos. You put a lot of toil now on individual
product developers to go write and maintain these files. And if you ever have to like make a blanket update to these files, there's no atomic way to kind of do that.
Right. So I look at YAML as like, I get it.
You know, like we use YAML for all the things in DevOps.
So why not our service catalog as well?
But I think it's toil.
Like there are easier ways to build a catalog by kind of just integrate.
Like hook up AWS, hook up GitHub, hook up Kubernetes, hook up your CIDC pipeline,
hook up all these different sources that have information about what's running in prod,
and let the software, let the tool automatically infer what's actually running, as opposed to
requiring humans to manually enter data. I find that there are remarkably few technical holy wars
that I cannot unify both sides on by nominating something far worse,
like the VI versus IMAX stuff, the tabs versus spaces, and of course the JSON versus YAML folks.
My JSON versus YAML answer is XML, God's language. I find that as soon as you suggest that,
people care a hell of a lot less about the differences between JSON and YAML
because their job is to now kill the apostate,
which is me. Right. Yeah. I remember XML like, oh man, 2002 SOAP. I remember SOAP as a protocol.
That was the thing. Some of the earliest S3 API calls were done in SOAP. And I think they finally
just used it to wash their mouths out when all was said and done. Nice. Yeah. I really want to
thank you for taking the time to do your level best to attempt to convert me.
And I would argue in many respects, you have succeeded.
I'm thinking about this differently than I did half an hour ago.
If people want to learn more, where's the best place for them to find you?
Absolutely.
So you can always check out our website, Optible.com.
We're also fairly active on LinkedIn.
If Twitter hasn't imploded by the time this episode becomes launched,
then we can also check us out at twitter.com slash opslevelhq.
We're always posting just different content on how to be successful with service maturity, DevOps, developer productivity, so that ultimately you can ship value to your customers faster.
And we will, of course, put links to that in the show notes.
Thank you so much for taking the time not just to speak with me, but also for sponsoring this episode.
It is appreciated. Cheers. Ken Rose, CTO and co-founder at Ops Level. I'm cloud economist
Corey Quinn, and this has been a promoted guest episode of Screaming in the Cloud. If you've
enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas
if you've hated this podcast, please leave a five-star review on your podcast platform of choice,
along with an angry comment, which upon further reflection, you could have posted to all of the podcast platforms if only you had the right developer platform to pull it off.
If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their
AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS.
We tailor recommendations to your business, and we get to the point. Visit duckbillgroup.com to get started.