Software at Scale - Software at Scale 36 - Decomposing Monoliths with Ganesh Datta
Episode Date: November 2, 2021Ganesh Datta is the CTO and co-founder of Cortex, a microservice management platform.Apple Podcasts | Spotify | Google PodcastsWe continue the age-old monolith/microservice debate and dig into why... companies seem to like services so much (I’m generally cautious about such migrations). Ganesh has a ton of insights into developer productivity and tooling to make engineering teams successful that we dive into.Highlights00:00 - Why solve the service management problem?06:00 - When to drive a monolith → service migration? What inflection points should one think about to make that decision?08:30 - What would Ganesh do his next service migration?10:30 - What tools are useful when migrating to services?12:00 - Standardizing infrastructure to facilitate migrations. How much to standardize (à la Google), to letting team make their own decisions (à la Amazon)?17:30 - How does a tool like Cortex help with these problems?21:30 - How opinionated should such tools be? How much user education is part of building such tools?27:00 - What are the key cultural components of successful engineering teams?31:00 - Tactically, what does good service management look like today?37:00 - What’s the cost/benefit ratio of shipping an on-prem product vs. a SaaS tool?41:30 - What would your advice be for the next software engineer embarking on their monolith → microservice migration? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
Transcript
Discussion (0)
Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications.
I'm your host, Utsav Shah, and thank you for listening.
Hey, welcome to another episode of the Software at Scale podcast.
Joining me here today is Ganesh Dutta, who is the co-founder and CTO of Cortex.
Thank you for joining me here today is Ganesh Tata, who is the co-founder and CTO of Cortex. Thank you for
joining me. I've always been thinking about that with Cortex. It's like, I wish I started that
company a few years ago. So I'd love to know what your background is and how you got interested in
this problem. I can kind of tell you my background and the story of how we ended up here. So before
I started Cortex, I was at this company called LendUp. I was like a
fintech company, we had, you know, around 50 to 100 engineers range. And I started there just as
they were starting their microservice, like monolith to microservice migration journey.
And I somehow ended up on the team that was, you know, working on the very first service they
pulled out. And I mean, one interesting thing about that was the infrastructure for that
quote, unquote, microservice was pretty much identical to the monolith.
So they weren't really actually going far down the microservices path, but you know,
they were starting to break things out.
And so I started working on that particular project.
And as I was there for a couple of years, we went down that, you know, pulling things
out of the monolith route.
And we had around 50 to 60 services by the time I left.
And as part of that process,
I feel like I got to experience a lot of different things, you know, on both sides of
being a developer, working on microservices and dealing with the chaos that comes with it.
And also on the other side, you know, kind of later in my tenure there,
of trying to define the standards and help people to actually build microservices the right way.
And so a couple of different experiences that I had when while I was there, I think one was, as we started building services, it became harder for me as a developer
to actually understand what was out there. And, you know, in some cases, people would, you know,
start working on the same microservice in two different teams, even though it's already been
built, or you would get paid in the middle of the night, you know, it's 2am. And you have no idea what the service is, you just see like some alerts going off. And you're
digging through confluence pages and wikis and readme is trying to like piece things together.
And, you know, and that's not what any engineer wants to do. And as we kind of kept going down
this route, we started realizing like, okay, maybe it's time now to actually bring some
standardization to what we're doing, like enough of this kind of like free for all,
like if everyone does things in
similar ways, it's going to be easier for us to actually operate these microservices.
And so we started on the process of putting together production readiness checklists and
guides and things like that. And that was a problem in and of itself, because how do you
kind of circulate that across the organization? How do you get people to care? How do you actually
track progress of how many services have we migrated over? How many services are meeting these standards?
And so as I was kind of dealing with all these challenges, I was trying to put together some
sort of tooling where, you know, every time we create a service, maybe it would like create a
static microsite somewhere, we could have like a catalog of services, and you kind of see where
this is going. There wasn't really any tooling like that at the time.
And so my co-founders at the time they were at Uber and Twilio.
And so Uber being, you know, the classic case of microservices gone wrong, you know, that thousands and thousands of services.
So I asked him over a beer, I'm like, like, Hey man, like, you've got to
have some solution for this internally.
Like, what do you guys do?
How do you solve this?
And they didn't have anything.
There was like, we have some kind of internal tools we've built, but it's
the same set of issues, like we have no idea, was like, we have some kind of internal tools we've built, but it's the same set of issues.
Like we have no idea, you know,
services are named after Game of Thrones characters,
which hit close to home because it'll end up with the exact same thing.
And so I think that was kind of a moment for me.
I was like, you know what?
Like if all of us are having the same problem,
then maybe there's something here.
And so we started, you know, kind of working on it on the side.
And that's where I took off.
That is so interesting.
And maybe you can just start with why migrate away from a monolith in the first place, right? So you said it's like 50 to 100 developers. And I'm just curious,
like, what are the reasons behind that? I think there's a couple reasons why
folks end up doing it. One of the reasons I would say, is the ability to move faster.
So as a team, if we're kind of restricted
by the infrastructure of the monolith,
deploys are slower, build times are slower.
And so it's just the actual release, deploy, build cycle
is extremely slow.
And so that's one of the main reasons
I think people end up moving away from monoliths.
I think there's more tactical reasons
in terms of ownership, everything from the data
to the actual frameworks,
the language to the tooling.
And so to kind of touch on that a little bit,
I would say if you have a monolith,
it's very easy for data models to start kind of overlapping with each other.
Like I'm working on a feature that's separate from your team's feature.
I need something that you're producing, some data that you're producing.
So I'm just going to reach into your table.
I know that the data is there and I'm going to yank it out. Now, what that means is if your team tries to
like release a feature upgrade, not only do they have to think about their data model, they have to
think about how am I reaching into their data and like mucking around with it in a way that I
shouldn't be. And so, you know, you can do a monolith, right, and you can draw strict boundaries.
But generally, over time, especially if you're a startup, things like this start to happen, and it becomes much harder to operate your to your monolith. And yeah, there's a lot of it
becomes impossible to reason about how what what's actually happening in there. And so to account for
that, you end up pulling a piece out and say, like, this is a self contained module, it does,
you know, one, two or three things. And that's all it does. It has its own data,
we're exposing this via an API so that we know what the contract is. And we guarantee that contract is going to hold true. And so you as a
consumer of my service knows exactly what it does. And you can rely on that. And that gives my team
the flexibility to implement that however we wish. If we think that, you know, updating our data
model is going to improve our ability to move fast and release new features, we can do that and not
break any other customers and not have to think about that. And so part of it, again, is like an
organizational thing. And so you have this concept of like, you know, Conway's lot where your,
your, your software, like kind of reflects the organization. And so as a team grows, you end up
being broken down into individual teams. So right now, you might have like a backend team.
But then one day, the backend team becomes platform team, the payments team, the you know,
the front end API team. And so each of those teams now have their own charters. And so in order for them to be able to move with full autonomy, microservices kind of show up as like, hey,
this is our service, it does the things that we as a team need to do. And so not only is it like
a technical thing, but it's also an organizational thing
where your software is now representing
how your organization is structured as well.
Interesting.
Yeah.
Did you notice any like inflection points?
I heard a lot of mentions of, you know,
when you have multiple teams
that need to interface with each other's data,
that's when things start getting confusing.
Would you say like,
as soon as you have more than like five or six teams, you have to start thinking about not stepping on each other's toes?
How would you go about thinking about, okay, this is the time when I should...
These are the warning signs which make me think everyone sharing the same data model is not the right approach.
I think it's when you have teams working on just different things. I feel like that itself
is an indicator of one team could step on the other team's toes and cause problems. And that
may not necessarily mean that you need to move to microservices immediately. But it's time to start
thinking about how do we draw the boundaries. And those boundaries might be within the monolith.
But they could also be like, hey, we're going to actually take this piece and pull it out.
And so it's interesting because that kind of lines up with what we've seen at Cortex in terms of when this becomes like a pressing pain point and when people start investing in it.
I would say like 30 to 50 engineers is about the time where teams are trying to be proactive and say like, hey, we know we're about to add a bunch more microservices.
We know teams are starting to do their own thing,
and we want to get a grapple on it before it goes crazy.
I think 100 engineers is the tipping point.
100 engineers, there's enough tribal knowledge,
there's enough context lost between teams,
there's enough turnover within teams
that it becomes important to say,
okay, we really need to figure this out now.
We need to know what's out there.
We need to know how are we doing things.
I think 100 engineers approximately is the tipping point at which a lot of
companies end up building tooling internally for this kind of stuff yeah it's kind of like success
causes these problems right if your company's doing well enough okay that's that's suffering
from success yeah so if you had to do the migration today which you did at lend up like a
few years ago would you do that again in like a heartbeat? Or is it something that you would consider doing later doing with
more tooling? What would you do different? Honestly, I think I would have done it faster.
And I think that was partially because we did see a lot of benefits. And part of it is, and this is
no fault of the organization, but there's always inertia. So like, if you're doing things a certain
way, and the company's growing,
there's a level of risk to changing the way you're building software.
And so there is some level of inertia.
And so at LendUp,
what I thought we did really well was the first microservice that they pulled
out was probably the highest risk thing they could have pulled out.
So LendUp was like a lending company and lending business means that you're
tracking, you know, what people have paid,
how much interest has accrued and how much people need to pay, like actually tracking the financials. Without that, your business is nothing. And that was the first system they pulled out of the monolith. And so that kind of sounds unintuitive, because you're like, why wouldn't we test this new paradigm with something low risk, we can test it and then, pulling up bigger chunks. But what actually turned out to be the case was, because we pulled out something so important, it was like an indicator to the rest of the
organization that, like, hey, this is real, like the organization is putting our back behind this,
like we're all our weight is behind this microservices strategy. Now we want to continue
pulling things out, and we're willing to invest the time and energy into that. And so I think even
though we pulled out like a high risk service at the start, you know, there were still, there were still features that we had to develop on other teams and stuff.
And so things kind of slowed down in terms of pulling out microservices.
And so I think if I could have gone back and done things differently,
I would have built some standardization in terms of tooling
to help developers create microservices much easier.
Because part of the problem was because we hadn't done it so many times,
speeding up a new service was a lot of overhead. And so I would have automated that away,
but then made it easier and actually push people to say, hey, this should be in a service. Don't
do this in a monolith. This is your opportunity. You're building a new feature, pull it out,
think about the domain model and the boundaries and do it right from day one.
So I think we would have actually gone faster, which is funny because at Cortex, we're a monolith
right now. Even though we're helping companies with funny because at Cortex we're a monolith right now
even though we're helping companies with microservices
we are building as a monolith
and I think it's a mix of my experiences
helping other companies deal with microservices
but also like seeing the benefits of a monolith at this stage
which is I think is just an interesting dichotomy
for you know a company doing what we do
you don't want to like over-engineer things is my guess
yeah yeah but at some point
you have to dog food your product, I guess.
Exactly.
So what is some tooling that would have been like most beneficial, right?
Like when you're talking about, I think you spoke about having like standardized
generation for services, like co-generation or something is my guess.
What is the tooling that would be most useful?
I think looking back, that's something we invested in maybe
two years into the microservices journey, but
I think the CodeGen
automation piece probably would have been the most valuable
because it gives people
basically a golden path that says,
hey, if you use this template to generate the bullet
play for your service, you're going to get everything out of
the box. You don't have to worry that you're doing
something wrong and things are going to go haywire.
We've tested this. We've guaranteed this. You have the support of our infra team
who has built this template. And so that kind of gives you the confidence of doing things
the right way. Plus, it creates a standardization where you're not
tracking, like your service is tracking the latency as latency
and I'm tracking it as response time. And now all of a sudden our dashboards are trying to
drop two different metric names
for the exact same thing.
But instead of like helps us move faster
and operate our services better.
So I would say like the automation around
like templatization would have been extremely powerful
because there was a lot of copy pasting involved.
And, you know, a lot of software engineering
is copy pasting sometimes,
but the less that you can do,
I think the faster you can move.
So like how much should the infra team
be involved
in like standardizing things? So you mentioned, you know, you could generate starting points,
like boilerplate and all that. But like, what about things like metrics? What about things like,
you know, the right dashboard? Like, do you auto generate a dashboard for every service? Like,
how far should you go in standardization versus letting teams do their own thing
when it comes to like infrastructure? I think a lot of organizations have different opinions on this.
I am of the opinion that standardization is good,
even though developers sometimes feel like they don't have as much autonomy.
Over time, it makes your life easier because you have this pool of knowledge
that grows and grows and grows in the organization,
and you can just move faster without much overhead.
And I think we're kind of getting to the questions around the organizational challenges.
Where do we draw the boundaries between these different teams?
And that's part of the problem in modern engineering teams is there's so much complexity
that you have different teams that own different charters.
You have a security team that cares about security.
You have infra that's trying to build a platform.
You have feature development that's just trying to ship things. You have engineering leadership that cares about security. You have infra that's trying to build a platform. You have feature development that's just trying to ship things.
You have engineering leadership that cares about reliability.
You have SRE who's thinking about best practices.
So how do you bring all those people together and say,
let's work together on the golden path?
And so I think, interestingly,
templatizing is one of the places that they can do that.
And so infra can say,
hey, if you want to be on our latest and greatest Kubernetes platform, then here's our deploy script. Okay, let's put that in
the template, you know, and then your SRE team says we want to automate, you know, tracking sort of
metrics and dashboarding, we want a reliability dashboard that we can provide to engineering
leadership. So hey, you know what, we're going to provide you some baseline metrics that will come
out of the box in this template if you use this agent. And so okay, so now that's in the template, development team says, like, we're a Golang shop
or a Colin shop, whatever. So this is the frameworks that we're going to support. This is
what we like, we have lots of tooling around this, you know, our developers like this. So here's the
framework that we're going to use, you know, and so like, security says, like, okay, you know what,
on every CI build, we want to run sneak security vulnerability scans.
And so now all these teams have now come together and say, if you use this template, you're going to basically make all of us happy.
This is the golden path from across the board.
And so as a developer, I don't have to think about what are the requirements from different teams.
I just use this template and I get everything.
And I think that is extremely high value because now each team has gotten what they need to do and they have the standardization that's going to make like help them automate a lot of the stuff that they want to do across the organization and so as a developer i can do things much easier so i think
that's why it's so valuable okay um so i think i think that makes sense and like the the whole idea
of putting everything in the template also helps automate things versus having like a really large
document that people have to follow and like a checklist because i've seen processes where like
i mean you i've seen this on your blog as well right like production readiness reviews and
production readiness templates like just automate all of that rather than trying to have people
going through like check boxes and manually reviewing okay so that's helpful what else
though once you actually deploy the service to production right. Okay, so that's helpful. What else though, once you actually
deploy the service to production, right, and things break, that's still going to happen in
like a microservice world, right? So what do you do at that point? Or like, how should you think
about like failures at that point? I think there's a couple of pieces to it. And so I'm not really
going to touch much on like, like the operational like observability piece, because I think, you
know, there's a lot of a lot of material, a lot of people have talked about SLOs and monitoring and that kind of stuff.
But there's another piece of this, which is, how do you how do you actually make sure that when
something does go haywire, your organization is prepared to deal with that. And that goes back to
the point you just made around like all these checklists. And the reason you have those
checklists is because you don't want to be scrambling when something goes wrong. Like you
want everything to be ready to be like, we know where dashboards are, we know where
metrics are. And like, we have the telemetry we need. Now let's go in and actually figure out
what's going wrong. And so part of like the whole value of production readiness checklist is to get
the organization to a state where when things are in production, you're good to go, you know,
things go wrong. And so I think part of the challenge that organizations face is,
like, again, it goes back to the standardization where and this is a problem that I faced before was, you have some teams that have put their runbooks in Google Docs, you have some teams
that have it in, you know, markdown files in the repo, you have some teams that, you know,
have some sort of like automated playbooks. And when you're paged at 2am, how the hell do you
figure out, like, where do I look? Where do I even start? And so
having some sort of standardization around those practices itself is extremely important. And so
being opinionated as an organization, I think is valuable to say, like, hey, we're we are using
Grafana and every Grafana dashboard should have a latency, you know, metric that we can see to
debug things, you need to have a system restart runbook for every single service. And that's like,
if something goes wrong, that's the first place I have somebody can start,
you need to know who the accountable owner is, like, if something's wrong, and like,
I'm not accountable for it, who do I page? Like, I don't know, like, I don't want to go and ping
people and like page the wrong person and wake them up at 2am, who is accountable for it. And
so an organization, I think needs to treat this almost in like a very like operational machine manner,
because that's what it is like, the more the more you streamline, the more you standardize,
the more just like cookie cutter it is, you can just like knock things out, you can figure things
out much easier. And so like the operations piece of that, I think is extremely important.
And you know, that being said, obviously, the observability and all that is important,
because without that, you can't do any of this. And you know, that hopefully the templating has
helped solve that for you but i think the production
readiness piece of that from an organizational standpoint is extremely important in order to
deal with these kind of issues okay so two questions on that like does the cortex product
help with the standardization plus like the second challenge i see is that if you have
a really large organization that's already doing things its own way, how do you actually decide, or you have to make sure that everyone migrates to a certain best practice?
So maybe let me ask the first question first, which is how does your product, how does Cortex
help with all of this?
So Cortex is what we're calling a software engineering platform, which means we want
to help make
developers' lives easier, make it easier for them to create microservices, operate those
microservices, and then give leadership and SREs and all the other organizations visibility
into how those are performing.
And so that means a couple of different pieces.
The first piece is what we call the service catalog.
So the catalog is exactly how it sounds.
You have information on
every single service. It's like a single pane of glass for every service, library component,
anything you can think of, including who's the owner, who is the business owner, where's the
Slack channel, where the run books, like every single thing you need to operate it, this is
where it is. And so this is part of kind of touches a little bit on your second question, which is,
if you can organize information in a standard way, it doesn't matter where the organization that information lives.
So for example, it doesn't matter if I have my runbooks in Google Docs, and you have yours in
Confluence, as long as I have a place where it says, you know, system restart runbook, I click
on it, it's going to take me there. And so now it doesn't matter where it is, as long as I know I
can access that information. And so like, that's the first piece that Cortex does. The second piece touches on like that
checklist aspect of it. How do you make sure that every service actually has this this runbook or
this playbook somewhere. And so we have this product called scorecard, which basically,
lets you define a set of rules to score your services. And so you can think about like all
these spreadsheets that organizations have in terms of production readiness, checklist, security audits, you know, things like that, you can
automate that away.
So using like our custom language that we've built, you can build a scorecard with rules
that say, every service in order to be marked as production ready, needs to have at least
two owners so that if you know, one person leaves the team, there's still somebody accountable
for it needs to have a runbook, it needs to have a corresponding page duty escalation policy with three levels that way, you know, somebody can escalate it.
And so Cortex automates this entire thing away. And the third piece is how do you get people to
care? You know, how do you actually kind of push people to do that. And so we have this feature
called initiatives, where basically, as an engineering leader, I can say, hey, this quarter,
we really want everybody to just define their on call rotations.
And so I know we have 10 rules in our production readiness checklist. But right now, let's like,
do one thing at a time. So let's focus and let's knock things out. And so Cortex will actually go in and like, you know, the gamification aspect of like messaging people to say, like, hey,
your service is dropped 10% this week, like go and fix these three things about your service,
and you'll be in the top 10%. And so Cortex kind of aggregates all this information and lets developers understand, okay, I this is what I need to fix. And if I fix
this, I'm good. And it gives engineering leaders and SREs and securities, the ability to come
together and say, like, this is what we've aligned on as our guidelines. And so we can objectively
score services on this, it's no more like, okay, I think the service is not great, like, you know,
how come and like, I'm going to ping somebody on Slack, and It's no more like, okay, I think the service is not great. Like, you know,
how come and like, I'm going to ping somebody on Slack, and there's this whole like subjective element. Instead, it's very objective. And so that's kind of what Cortex does. One of the recent
things we released is like integration for creating microservices as well. And so obviously,
like in the scorecards product lets you track, am I following best practices? Am I following our
standards? Am I, you know, doing well in our migrations and whatnot? But can we actually help you follow them from day one. And so we have like
a templating feature that lets like organizations define a template and say, create a microservice,
I want to create like a goal and microservice, and I fill out some form, it'll automatically
generate the repo, it'll push the boilerplate code, it'll add it to the service catalog.
And so that means it's like meeting all of its production readiness standards from day one.
And so that's kind of like the gist of what Cortex does is, you know, it'll add it to the service catalog. And so that means it's like meeting all of its production readiness standards from day one. And so that's kind of like the gist of what Cortex does is,
you know, it helps developers create services, operate services, and make sure those services are staying, you know, high quality. So is Cortex like opinionated about what good looks like?
I can totally imagine the next step is, I have this language that lets me define what good looks
like. But I don't know, as you know,
a person who for the first time who's migrating from monoliths to microservices, like, should
every service have an escalation, like a pager duty escalation? It makes sense once you describe
it, but like, I don't think I know all of like, what's good. There's a couple of pieces to that.
I think one is, you know, we provide guidance to customers on like, hey, these are what other
customers are
doing. Here's some examples of what production readiness is. But I think your question kind of
touches on a point you made earlier, which is, you have a company that's already doing things
100 different ways. How do you actually get them all to migrate into something? And so,
you know, unfortunately, like a lot of this comes down to the organization,
having to want this change, like they say, like, like things are crazy. We want to kind of wrangle this into some sort of,
you know, take this chaos and turn it into something calmer. And so the organization,
organization comes in and says like, you know, we, we know we have 10 things and we're going to
approve these five things. So we're not going to like standardize everything. We're going to still
create these golden paths. And so I think a lot of it does come out of the organization to say, like, we know what we want, what we're struggling with
is like, how do we get people to listen? How do we get people to care about this stuff? How do we
even start automating this understanding where we are? And I think that's been like the core
challenge is a lot of organizations internally know, you know, if everything was great, and
everybody was doing all this stuff, this is where we want to be. This is like the gold, like the
world in which we can dream.
And this is our utopia as an organization.
They just don't know how to get there.
And so I think we're more focused on like helping them get there than
trying to tell them where they should get to though.
I think in a,
in a future world,
you know,
and this is something that we talk about internally is can we kind of
gather insights across organizations and share that to customers?
It's like,
maybe like a marketplace of production readiness standards. I'm like, like, hey, this is what Airbnb is doing,
like click on that. And you get your Airbnb production readiness standard, kind of like
what we do for like linting and stuff now. Now, why is production readiness any different? So
that is kind of what we want to get to one day. But I think for today, it's like the customer
knows what they want, and they just need some way to get there. I can imagine even things like case
studies, like people saying this is what they had
and this is how they moved to it would be helpful.
As you get more customers,
you get more of an idea for like what works across people,
what doesn't work.
So things can only get better.
And like the last piece is as an engineering leader,
how do I know how much time to devote on this, right?
Like I have head of product telling me
that we need to ship so many features
by the end of the month.
I have like engineers complaining about tech debt and i have this platform that i want to buy that
will help me get to this goal of improving my visibility into microservices and just like my
overall engineering services how do i know how to get there let's say i buy this platform right
should i spin up a team that will drive this change should I just ask every engineering team
to do a little bit of work like how much time should I spend on this how much percentage of
my like engineering bandwidth should be spent on this like I don't know how to think about that
that's a super interesting question because I think that the actual answer is that one step
before like in order to ask you know what should I be spending time on like how much time do we
invest in this you need to know where you are today.
And the problem is organizations don't know where they're at today.
So they don't even have a way to ask the question of,
what should I be working on?
Or how much time should I invest in this when they don't know what this is?
And so for a lot of organizations, the first step is just saying,
what the hell is out there?
How are we performing?
And that gives them the visibility to say,
oh my God, our code coverage is really, really bad. And maybe this is
what's causing our incidents because we just don't have unit testing. And let's focus on that.
And as an engineering leader, now I have the visibility to say, for every new
service, we want to start investing in code coverage. We want to invest in
testing. We want to report on those metrics. And so I think step one
and this kind of talk like
touches on how we think about the adoption phase of Cortex for our customers is like,
the first step is baselining, which is, we don't know what's out there, let's understand
the current situation. And then as an engineering leader, I can figure out what to prioritize.
And so I think for a lot of engineering leaders, the bottom line is reliability quality,
because those things directly impact, you know, the bottom line is reliability, quality, because those
things directly impact the financials and the actual reliability of the software.
And so a lot of this production readiness and all this stuff, why do we actually even
care about that stuff in the first place?
It's because that impacts us as a business.
And so as an organization, I'm going to focus on the things that will help me get there.
And so, for example, I can take a look at this and say like,
hey, it looks like our MTTR is really bad.
And we're just doing really poorly at responding to incidents.
Why is that?
Oh, it looks like, you know, we don't have owners for a lot of things.
So the escalation time, you know, we're taking 30 minutes to find who the owner is.
Okay, I'm going to create an initiative for that.
Let's fix that next.
And so the eventual goal is reliability.
And it's up to the engineering leader
to say, like, what are the key things like the easy wins we can do to get there. And generally,
what we've seen is, it has to come from the team that owns that service, like there has to be
like that stewardship or accountability out of the team level. And that is not just for the success
of like Cortex or these initiatives, but from just like a broader engineering philosophy,
like the code that I ship
is something that I need to be accountable for,
you know, from start to finish.
And that includes like operations
and, you know, keeping that high quality.
And so part of like the dream that Cortex that sells
is like, we're going to help you create that culture
of accountability and ownership.
Because if you don't have that culture,
then obviously things will suck.
And so part of the thing is
the team has to own
that process. They need to be accountable for maintaining the quality of the service. They
need to be accountable for operating the service. And so Cortex kind of gives engineers the
visibility on what to prioritize, but the teams need to do it themselves.
So you help basically with all of the technical hurdles that there may be
with regards to like transparency and not really knowing what's going on but then the social problems of actually fixing those things
is kind of on the culture of the company that has to drive those changes what are maybe some key
like cultural components that you've seen in successful engineering teams or engineering
organization like what is required so cornix tries our best to help with that cultural aspect as well. And that's kind of where we see the value that we can provide. So like
through gamification, through leaderboards, and like creating this culture of like, hey,
I care about the quality of my services. And we're starting to see that in a lot of organizations
where like in a scorecard, like we stack rank all the services based on their scores. And very
commonly, we'll see like the top 10% of services are owned by the same team because they've gone in,
they've fixed all their services.
And that's exactly the kind of culture that we want to see.
And so I think those are generally the cultural things
that we want to create.
In terms of engineering organizations
and high-performing engineering organizations,
I think it comes down to a few things.
I think, one, engineering leadership needs to define
clear goals that center around reliability. engineering leadership needs to define clear goals that
center around reliability. And it needs to be clear that the developers are accountable for that.
And so what's interesting is we've seen organizations that define like OKRs around
Cortex, like, hey, we want every service to be 70% production ready, and they see some value in that.
And so if you treat like production readiness and service quality as like secondary to product development
then you've made clear your incentives to organization like product development is first
reliability second and so if your OKRs don't include reliability metrics like whether it's
cortex or anything else if there's nothing in there at an organizational level of like
hey we care about the quality of the output that we're producing then there's no incentive for
managers to prioritize that,
for product managers to get pushback on like,
hey, we have to fix this first.
And so unless the organization has defined these goals,
then it's hard for anybody to advocate for that.
And so I think it has to come from the top
in terms of like, this is something we actually care about
and are willing to invest time and money into.
So as an engineer who does not have any like visible like
any sway on like organizational goals I can't really make change and or I can't make too large
a change unless like the engineering leadership is aligned on like some kind of quality is kind
of what you're saying. I think for broader things like this there is a component of
thinking the leadership has to push it and that's kind of what we've seen is like
you know maybe like you know, maybe like, you know,
I as a developer can evangelize this to a few teams,
but at a large organization, it's just not going to work.
Like there's too much surface area
for a single team or a single developer to cover.
I think there are things that developers can do.
And that is investing in like the templating, for example.
So I can create a template.
And so maybe it doesn't have buy-in
from all the other SRE and the infra teams. But if I can share that with other developers, and it makes their lives
easier, then they're going to start using that as well, you know, just because it literally just
saves them time. Like, why would I not do something that's going to save me time. And so developers,
I think, can invest in those kind of things, or even, you know, evangelizing certain practices
or advocating for those to be part of like production readiness standards, like production readiness standards, things that they've learned in the organization.
That was part of the challenge that I had in my previous job was
having been on the team that pushed forward this microservices
journey first, we had a lot of learnings. We knew that certain things had to be
done. Our logs needed to have certain pieces of data.
If we didn't know which instance those
log lines were coming from, or we didn't know request tracing, or if we didn't have an easy
way of restarting our services, then you would run into issues. And we had learned those lessons
the hard way. But the problem was how do you evangelize that and get other people to care
about those things? And production readiness checklist is a way of evangelizing that. But
unfortunately, I think engineering leadership, it has to be involved in part of it
being like, like, hey, this is important, like, we have to care about these things. So I think,
especially the organization gets bigger, it just becomes more and more important.
And maybe we can talk a little more tactically, right? Like, what is good services look like?
What have you seen, you know, requests from customers on, you know, please integrate with
this particular metrics pipeline,
or can we get metrics from this service because we think that constitutes making a service better?
You were talking about making sure that we check sneak vulnerabilities, for example.
What are people doing nowadays to basically say that this service is running well versus not?
It very much aligns with how we've been seeing customers doing this. And I think one
of the common patterns we see, and I had done the same thing in my production writing this checklist
is it's broken down into categories and in a kind of different phases of the life cycle of the
microservice. So the first phase is like development, when you're actually building your service,
are you building it the right way? And so that includes things like, am I containerized? Or am I still
using some old platform? Am I, you know, running automated unit test as part of my CI suite? Am I
using the right CI suite? Do I have a readme file? Am I using the right framework? Am I using the
right package versions, just like basic things around like, am I building this thing in the
right way so that once I get into production, it's going to be okay. And so standardization
starts from there. The next piece is, okay, okay, my development maturity is kind of how we call it.
The development maturity is good.
I've built it the right way, but now I'm ready to go into production.
Am I ready?
That's kind of the next step.
And so that is very much around things like, do you have an on-call rotation?
Does your on-call rotation have alerting enabled?
Does your on-call rotation have escalation tiers?
Do you have runbooks?
Do you have dashboards? Are you tracking the rightcall rotation have escalation tiers? Do you have runbooks? Do you have dashboards?
Are you tracking the right metrics?
Do you have ownership?
Do you have accountability?
Do you have Slack channels?
Like all the things where something's on fire, what do I do now?
That's production readiness.
And so that can even be further broken down into some subsections.
Production readiness can include things like security.
Has your service been secured?
Are you running vulnerability scans?
Observability stuff. Are you tracking the right
metrics? Do you have the Datadog agent
set up? Are you pushing metrics to the right places?
Are you using the right observability tools?
Do you have a Sentry project for your service?
That can be the observability
piece. And then finally
it's around the post-production
piece, which is your service is in production.
Are you operating your service the right way? And so that can include things like
is your service triggering a bunch of alerts off of business hours and waking people up?
Do you have tons of compliance-ish tickets open in Jira
that you're never closing out? Because that's an indicator of maybe your team doesn't have time
and you're just stretched super thin, or your process has a hole in it.
And so there's this concept of like post production, like operational maturity. And so I break that up into like,
operational readiness and operational maturity, like, are you operating it the right way.
And so those are kind of like the three main buckets that we see, like most customers setting
up. And obviously, there's like a lot of like ad hoc things like, hey, we're trying to migrate from
you know, one version to another or trying to look at everyone to move on to Kubernetes.
And those are like ad hoc things. But these are the main things that we see running all
the time is like, am I building it the right way? Am I building it to be ready for production? Am I
operating it the right way? And like those kind of have those subcategories. Okay. Are there like
any interesting technological trends you've seen? Like, for example, I've seen that people are just
using more and more things like Fargate and Kubernetes rather than, you know, even trying to spin up services on EC2, which I think makes sense to me.
It removes like a whole class of problems of like, do I need to containerize or not? Because you can
just start with that. Are you seeing anything interesting just through your customers or just
generally? So the things that we're seeing are Kubernetes, like you said, is becoming huge. I
think most of our customers are on Kubernetes already, which is interesting.
It's not something that we expected.
I think there is a lot of focus on, surprisingly, a lot of people care about Jira metrics.
And that is kind of around post-mortem tickets and SEV1, SEV2 tickets, things like that,
which I think are people realizing there is engineering is engineering outside of like, the dev tools are using like this, you know, unfortunately, you know, developers
have a love hate relationship with Jira, but it's part of the engineering process.
I think that's been really interesting to see how people see that now as like an engineering
quality metric, and not just like an engineering productivity metric.
And so that's been interesting.
I think templatization like through you know for creating
services that's seen like a resurgence you know there was kind of a lot of talk about it a few
years ago but i think it's really exploded now a lot more people are investing in that which has
been super interesting and part of the reason why we kind of built that feature out is you know a
lot of people were asking for that i think those are kind of like the big things that we're seeing
as commonalities or trends.
I think another common thing has been like configuration as code and not just for like
your deployments, but even for like vendors and tools like that.
Like a lot of people ask us like, can we configure Cortex through, you know, Git?
Can we do GitOps for all of our stuff?
And that's a common pattern.
And I think that's kind of related to the the SRE and infra platform teams that are driving
some of these initiatives.
And I think they really value GitOps and version control and all that kind of stuff for these
things.
So I think that's been really interesting to see a lot of people using that as well.
Okay.
And I noticed that Cortex can be used on the cloud and also be self-hosted.
Have you seen people self-hosting more over time or less?
The traditional wisdom is that people are just using SaaS tools,
but like what's been your experience?
Yeah, I think that one,
that was interesting because partially because we're a startup, you know,
there's questions on security and stuff.
And so we just got our SOC 2 certification, which is like a whole,
it's own can of worms, but I think that's going to,
that's make it easier for people to go to the cloud.
But I think for a lot of organizations,
they have a lot of tools
that they run on-prem. And that
actually is another trend that we've seen.
A lot of people are running GitLab on-prem that
they don't want to expose to the public internet. Their Kubernetes clusters
are obviously internal. They're running
Bugsnag on-prem, things like that.
And so a tool like this, which is
kind of like a hub for all of your integrations,
needs to be able to talk to all those things. And if it can't talk
to it, then it's useless. And so a lot of those customers who are running
highly sensitive environments end up going with the self-hosted model. And I think that has been
almost like a secret weapon for us in terms of being able to support those companies. And I know
like the trend has been, you know, moving towards the cloud, but I don't think we would have been
where we are had we not kind of been in the bullet and come on into the on prem world earlier.
Was it hard to just build that out? Or was it not as hard as people make it sound?
It's a mixed bag. Getting started with on prem was not difficult, because we had containerized
everything like because we were because we're using Google App Engine for deployments. And so
we already had everything in Docker. And so we had to wrap that up and like a help chart that people could deploy into Kubernetes. That part was pretty
straightforward. All things considered, you know, at the time, I didn't know anything about Helm. So
I was like, let's kind of learn Helm and figure that out, which is its own story. But overall,
I would say it's pretty easy. What has been difficult, though, has been all the other stuff
around once you're already deployed in a customer's environment, which is like, how do you
help them debug things? I mean, how do you get logs? How do you get metrics? You know, if you're already deployed in a customer's environment, which is like, how do you help them debug things?
I mean, how do you get logs?
How do you get metrics?
You know, if you're, we just went out and fundraised, how do we know like, is there one person using it or is 500 people using it?
And we don't have that visibility in like on-prem environments.
So it's all of these like operational things that have been more difficult than the actual
like productionalizing our product and like making it deployable on-prem.
That part was easy.
It's everything else that is much much harder cool and maybe uh one philosophical question is you
mentioned a lot about like what service best practices are and like how people are thinking
about it do you feel like it would be helpful if you know just developer education was like a bigger
thing and we learned some of these things in college versus like learning them through like
random blogs that people are writing online.
Like, do you have any thoughts on that at all?
I think it is valuable.
Like looking back now at the time,
like my software engineering class
had like a lot of design patterns and stuff.
And at the time I was like,
you know, do I really care about this stuff?
Like when am I ever going to use the factory pattern?
And, you know, we're writing Cortex
and all of a sudden I kind of catch myself
writing like API client factory. I was like, ah okay like it is it is valuable after all you know and as
an enterprise as it sounds like you know these things have been designed for a reason like they
add value and so i think there is some value in having exposure to things like this and i think
there are some things that you can only get through experience. So like why logging a certain way is important,
but I think understanding just basic things around logging, like, okay,
maybe I'm not going to learn like how to structure my logs and where it
should go and all that stuff. But like, Hey,
like you should be logging because you know, when something goes wrong,
you want to be able to see like in real time, like this is what happened.
These are the sequence of steps.
And so I can debug it because I feel like they don't teach you that. And I know when I was debugging things,
I just did like print Hello. And like I said, like, Okay, this is where I am, you know,
I didn't use a debugger. And I was like, one of those annoying developers like that.
But you know, that that as a concept, I wish I had learned around like,
hey, like, hey, observability, monitoring, like, thinking about the quality of your software is
super important. Because you're into folks a lot who are like really good at coding but they're not great at software engineering and
those are two very distinct concepts i would say and so i wish there was more emphasis on software
engineering in terms of like how do you design things because things can break like how do you
design your software so it's adaptable to change so you know things like designing apis very simple
very you know that's never going to change people are going to be designing ap you know, things like designing API's very simple, very, you know, that's never
going to change, people are gonna be designing API's forever. I feel like that's something
colleges could teach, because it's a basic skill. Now, maybe you don't have to teach GraphQL. But
like, what is an API? How do you design it? What are some, you know, pros and cons of designing
things a certain way? What is rest? Why do you do logging? What does telemetry mean? And so I think
these are concepts that exist forever.
And so I think those things definitely be taught even just to have familiarity, like, and I think
this is, like more philosophical education wise, I think for me, a lot of things, something stay
with me, something kind of went over my head. But even the things that went over my head in college,
I think I had some like vague memory of like, Oh,, like I kind of saw that and I generally know what that looks like
or what that means.
And it gives me a good starting point to,
you know, to go off of.
And I think that is valuable.
Like even if developers don't have
like a very strong fundamental on those things,
having familiarity, I think is very valuable.
Okay.
And a wrap up question,
like what's your advice to the software engineer
who's embarking on their monolith, the microservice migration today, right? Like what would be your advice to you five years ago? What should someone think about? are serious about this journey, you got to do something big. And so I would say that's step one, I would say step two is, you know, don't boil the ocean. So like, don't try to build all the
infrastructure first, like, if you have a model that that's running on Heroku, or, you know,
EC two instance, or something like that, don't spend, you know, six weeks building a Kubernetes
cluster. And so deploy pipelines, all this stuff, get it out, learn, learn those lessons, a lot of
this comes from experience. And so get that out there, you know,, get it out, learn, learn those lessons, a lot of this comes from experience.
And so get that out there, you know, with the bare minimum. And that is much more valuable than
trying to build all this beautiful infrastructure around it. I think the third piece is telemetry,
telemetry, telemetry, like make sure you have logging you have, you know, APM or monitoring,
because things will go wrong. And so make sure you have the ability to actually figure out what has
what has happened. I think that is extremely valuable. And so make sure you have the ability to actually figure out what has happened.
I think that is extremely valuable.
And then finally, document things.
Because if you are an individual developer or your team is the first one embarking on this mission, people have a lot of questions.
And you're almost kind of being like the trailblazer here.
And so help the organization understand what you've done.
So document the tools that you've used, you know, how to operate your service,
things like that, where, like, now, you are kind of being like, the torchbearer and the
trailblazer for microservices and, like, kind of do your part to be a part of that, that
mission within the organization.
I think those are the kind of the main things that I would have known, I wish I had known.
So it's similar to product development, like create a lean MVP,
measure and iterate pretty much, right?
Exactly.
Yeah.
Well, Ganesh, thank you so much for being a guest.
I think this was a lot of fun.
Thank you so much for having me.