Software at Scale - Software at Scale 36 - Decomposing Monoliths with Ganesh Datta

Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Hey, welcome to another episode of the Software at Scale podcast. Joining me here today is Ganesh Dutta, who is the co-founder and CTO of Cortex. Thank you for joining me here today is Ganesh Tata, who is the co-founder and CTO of Cortex. Thank you for joining me. I've always been thinking about that with Cortex. It's like, I wish I started that company a few years ago. So I'd love to know what your background is and how you got interested in this problem. I can kind of tell you my background and the story of how we ended up here. So before

Starting point is 00:00:41 I started Cortex, I was at this company called LendUp. I was like a fintech company, we had, you know, around 50 to 100 engineers range. And I started there just as they were starting their microservice, like monolith to microservice migration journey. And I somehow ended up on the team that was, you know, working on the very first service they pulled out. And I mean, one interesting thing about that was the infrastructure for that quote, unquote, microservice was pretty much identical to the monolith. So they weren't really actually going far down the microservices path, but you know, they were starting to break things out.

Starting point is 00:01:12 And so I started working on that particular project. And as I was there for a couple of years, we went down that, you know, pulling things out of the monolith route. And we had around 50 to 60 services by the time I left. And as part of that process, I feel like I got to experience a lot of different things, you know, on both sides of being a developer, working on microservices and dealing with the chaos that comes with it. And also on the other side, you know, kind of later in my tenure there,

Starting point is 00:01:36 of trying to define the standards and help people to actually build microservices the right way. And so a couple of different experiences that I had when while I was there, I think one was, as we started building services, it became harder for me as a developer to actually understand what was out there. And, you know, in some cases, people would, you know, start working on the same microservice in two different teams, even though it's already been built, or you would get paid in the middle of the night, you know, it's 2am. And you have no idea what the service is, you just see like some alerts going off. And you're digging through confluence pages and wikis and readme is trying to like piece things together. And, you know, and that's not what any engineer wants to do. And as we kind of kept going down this route, we started realizing like, okay, maybe it's time now to actually bring some

Starting point is 00:02:20 standardization to what we're doing, like enough of this kind of like free for all, like if everyone does things in similar ways, it's going to be easier for us to actually operate these microservices. And so we started on the process of putting together production readiness checklists and guides and things like that. And that was a problem in and of itself, because how do you kind of circulate that across the organization? How do you get people to care? How do you actually track progress of how many services have we migrated over? How many services are meeting these standards? And so as I was kind of dealing with all these challenges, I was trying to put together some

Starting point is 00:02:54 sort of tooling where, you know, every time we create a service, maybe it would like create a static microsite somewhere, we could have like a catalog of services, and you kind of see where this is going. There wasn't really any tooling like that at the time. And so my co-founders at the time they were at Uber and Twilio. And so Uber being, you know, the classic case of microservices gone wrong, you know, that thousands and thousands of services. So I asked him over a beer, I'm like, like, Hey man, like, you've got to have some solution for this internally. Like, what do you guys do?

Starting point is 00:03:17 How do you solve this? And they didn't have anything. There was like, we have some kind of internal tools we've built, but it's the same set of issues, like we have no idea, was like, we have some kind of internal tools we've built, but it's the same set of issues. Like we have no idea, you know, services are named after Game of Thrones characters, which hit close to home because it'll end up with the exact same thing. And so I think that was kind of a moment for me.

Starting point is 00:03:34 I was like, you know what? Like if all of us are having the same problem, then maybe there's something here. And so we started, you know, kind of working on it on the side. And that's where I took off. That is so interesting. And maybe you can just start with why migrate away from a monolith in the first place, right? So you said it's like 50 to 100 developers. And I'm just curious, like, what are the reasons behind that? I think there's a couple reasons why

Starting point is 00:03:57 folks end up doing it. One of the reasons I would say, is the ability to move faster. So as a team, if we're kind of restricted by the infrastructure of the monolith, deploys are slower, build times are slower. And so it's just the actual release, deploy, build cycle is extremely slow. And so that's one of the main reasons I think people end up moving away from monoliths.

Starting point is 00:04:19 I think there's more tactical reasons in terms of ownership, everything from the data to the actual frameworks, the language to the tooling. And so to kind of touch on that a little bit, I would say if you have a monolith, it's very easy for data models to start kind of overlapping with each other. Like I'm working on a feature that's separate from your team's feature.

Starting point is 00:04:39 I need something that you're producing, some data that you're producing. So I'm just going to reach into your table. I know that the data is there and I'm going to yank it out. Now, what that means is if your team tries to like release a feature upgrade, not only do they have to think about their data model, they have to think about how am I reaching into their data and like mucking around with it in a way that I shouldn't be. And so, you know, you can do a monolith, right, and you can draw strict boundaries. But generally, over time, especially if you're a startup, things like this start to happen, and it becomes much harder to operate your to your monolith. And yeah, there's a lot of it becomes impossible to reason about how what what's actually happening in there. And so to account for

Starting point is 00:05:14 that, you end up pulling a piece out and say, like, this is a self contained module, it does, you know, one, two or three things. And that's all it does. It has its own data, we're exposing this via an API so that we know what the contract is. And we guarantee that contract is going to hold true. And so you as a consumer of my service knows exactly what it does. And you can rely on that. And that gives my team the flexibility to implement that however we wish. If we think that, you know, updating our data model is going to improve our ability to move fast and release new features, we can do that and not break any other customers and not have to think about that. And so part of it, again, is like an organizational thing. And so you have this concept of like, you know, Conway's lot where your,

Starting point is 00:05:55 your, your software, like kind of reflects the organization. And so as a team grows, you end up being broken down into individual teams. So right now, you might have like a backend team. But then one day, the backend team becomes platform team, the payments team, the you know, the front end API team. And so each of those teams now have their own charters. And so in order for them to be able to move with full autonomy, microservices kind of show up as like, hey, this is our service, it does the things that we as a team need to do. And so not only is it like a technical thing, but it's also an organizational thing where your software is now representing how your organization is structured as well.

Starting point is 00:06:29 Interesting. Yeah. Did you notice any like inflection points? I heard a lot of mentions of, you know, when you have multiple teams that need to interface with each other's data, that's when things start getting confusing. Would you say like,

Starting point is 00:06:42 as soon as you have more than like five or six teams, you have to start thinking about not stepping on each other's toes? How would you go about thinking about, okay, this is the time when I should... These are the warning signs which make me think everyone sharing the same data model is not the right approach. I think it's when you have teams working on just different things. I feel like that itself is an indicator of one team could step on the other team's toes and cause problems. And that may not necessarily mean that you need to move to microservices immediately. But it's time to start thinking about how do we draw the boundaries. And those boundaries might be within the monolith. But they could also be like, hey, we're going to actually take this piece and pull it out.

Starting point is 00:07:25 And so it's interesting because that kind of lines up with what we've seen at Cortex in terms of when this becomes like a pressing pain point and when people start investing in it. I would say like 30 to 50 engineers is about the time where teams are trying to be proactive and say like, hey, we know we're about to add a bunch more microservices. We know teams are starting to do their own thing, and we want to get a grapple on it before it goes crazy. I think 100 engineers is the tipping point. 100 engineers, there's enough tribal knowledge, there's enough context lost between teams, there's enough turnover within teams

Starting point is 00:07:56 that it becomes important to say, okay, we really need to figure this out now. We need to know what's out there. We need to know how are we doing things. I think 100 engineers approximately is the tipping point at which a lot of companies end up building tooling internally for this kind of stuff yeah it's kind of like success causes these problems right if your company's doing well enough okay that's that's suffering from success yeah so if you had to do the migration today which you did at lend up like a

Starting point is 00:08:23 few years ago would you do that again in like a heartbeat? Or is it something that you would consider doing later doing with more tooling? What would you do different? Honestly, I think I would have done it faster. And I think that was partially because we did see a lot of benefits. And part of it is, and this is no fault of the organization, but there's always inertia. So like, if you're doing things a certain way, and the company's growing, there's a level of risk to changing the way you're building software. And so there is some level of inertia. And so at LendUp,

Starting point is 00:08:51 what I thought we did really well was the first microservice that they pulled out was probably the highest risk thing they could have pulled out. So LendUp was like a lending company and lending business means that you're tracking, you know, what people have paid, how much interest has accrued and how much people need to pay, like actually tracking the financials. Without that, your business is nothing. And that was the first system they pulled out of the monolith. And so that kind of sounds unintuitive, because you're like, why wouldn't we test this new paradigm with something low risk, we can test it and then, pulling up bigger chunks. But what actually turned out to be the case was, because we pulled out something so important, it was like an indicator to the rest of the organization that, like, hey, this is real, like the organization is putting our back behind this, like we're all our weight is behind this microservices strategy. Now we want to continue pulling things out, and we're willing to invest the time and energy into that. And so I think even

Starting point is 00:09:41 though we pulled out like a high risk service at the start, you know, there were still, there were still features that we had to develop on other teams and stuff. And so things kind of slowed down in terms of pulling out microservices. And so I think if I could have gone back and done things differently, I would have built some standardization in terms of tooling to help developers create microservices much easier. Because part of the problem was because we hadn't done it so many times, speeding up a new service was a lot of overhead. And so I would have automated that away, but then made it easier and actually push people to say, hey, this should be in a service. Don't

Starting point is 00:10:12 do this in a monolith. This is your opportunity. You're building a new feature, pull it out, think about the domain model and the boundaries and do it right from day one. So I think we would have actually gone faster, which is funny because at Cortex, we're a monolith right now. Even though we're helping companies with funny because at Cortex we're a monolith right now even though we're helping companies with microservices we are building as a monolith and I think it's a mix of my experiences helping other companies deal with microservices

Starting point is 00:10:32 but also like seeing the benefits of a monolith at this stage which is I think is just an interesting dichotomy for you know a company doing what we do you don't want to like over-engineer things is my guess yeah yeah but at some point you have to dog food your product, I guess. Exactly. So what is some tooling that would have been like most beneficial, right?

Starting point is 00:10:51 Like when you're talking about, I think you spoke about having like standardized generation for services, like co-generation or something is my guess. What is the tooling that would be most useful? I think looking back, that's something we invested in maybe two years into the microservices journey, but I think the CodeGen automation piece probably would have been the most valuable because it gives people

Starting point is 00:11:13 basically a golden path that says, hey, if you use this template to generate the bullet play for your service, you're going to get everything out of the box. You don't have to worry that you're doing something wrong and things are going to go haywire. We've tested this. We've guaranteed this. You have the support of our infra team who has built this template. And so that kind of gives you the confidence of doing things the right way. Plus, it creates a standardization where you're not

Starting point is 00:11:35 tracking, like your service is tracking the latency as latency and I'm tracking it as response time. And now all of a sudden our dashboards are trying to drop two different metric names for the exact same thing. But instead of like helps us move faster and operate our services better. So I would say like the automation around like templatization would have been extremely powerful

Starting point is 00:11:53 because there was a lot of copy pasting involved. And, you know, a lot of software engineering is copy pasting sometimes, but the less that you can do, I think the faster you can move. So like how much should the infra team be involved in like standardizing things? So you mentioned, you know, you could generate starting points,

Starting point is 00:12:09 like boilerplate and all that. But like, what about things like metrics? What about things like, you know, the right dashboard? Like, do you auto generate a dashboard for every service? Like, how far should you go in standardization versus letting teams do their own thing when it comes to like infrastructure? I think a lot of organizations have different opinions on this. I am of the opinion that standardization is good, even though developers sometimes feel like they don't have as much autonomy. Over time, it makes your life easier because you have this pool of knowledge that grows and grows and grows in the organization,

Starting point is 00:12:41 and you can just move faster without much overhead. And I think we're kind of getting to the questions around the organizational challenges. Where do we draw the boundaries between these different teams? And that's part of the problem in modern engineering teams is there's so much complexity that you have different teams that own different charters. You have a security team that cares about security. You have infra that's trying to build a platform. You have feature development that's just trying to ship things. You have engineering leadership that cares about security. You have infra that's trying to build a platform. You have feature development that's just trying to ship things.

Starting point is 00:13:07 You have engineering leadership that cares about reliability. You have SRE who's thinking about best practices. So how do you bring all those people together and say, let's work together on the golden path? And so I think, interestingly, templatizing is one of the places that they can do that. And so infra can say, hey, if you want to be on our latest and greatest Kubernetes platform, then here's our deploy script. Okay, let's put that in

Starting point is 00:13:29 the template, you know, and then your SRE team says we want to automate, you know, tracking sort of metrics and dashboarding, we want a reliability dashboard that we can provide to engineering leadership. So hey, you know what, we're going to provide you some baseline metrics that will come out of the box in this template if you use this agent. And so okay, so now that's in the template, development team says, like, we're a Golang shop or a Colin shop, whatever. So this is the frameworks that we're going to support. This is what we like, we have lots of tooling around this, you know, our developers like this. So here's the framework that we're going to use, you know, and so like, security says, like, okay, you know what, on every CI build, we want to run sneak security vulnerability scans.

Starting point is 00:14:10 And so now all these teams have now come together and say, if you use this template, you're going to basically make all of us happy. This is the golden path from across the board. And so as a developer, I don't have to think about what are the requirements from different teams. I just use this template and I get everything. And I think that is extremely high value because now each team has gotten what they need to do and they have the standardization that's going to make like help them automate a lot of the stuff that they want to do across the organization and so as a developer i can do things much easier so i think that's why it's so valuable okay um so i think i think that makes sense and like the the whole idea of putting everything in the template also helps automate things versus having like a really large document that people have to follow and like a checklist because i've seen processes where like

Starting point is 00:14:49 i mean you i've seen this on your blog as well right like production readiness reviews and production readiness templates like just automate all of that rather than trying to have people going through like check boxes and manually reviewing okay so that's helpful what else though once you actually deploy the service to production right. Okay, so that's helpful. What else though, once you actually deploy the service to production, right, and things break, that's still going to happen in like a microservice world, right? So what do you do at that point? Or like, how should you think about like failures at that point? I think there's a couple of pieces to it. And so I'm not really going to touch much on like, like the operational like observability piece, because I think, you

Starting point is 00:15:22 know, there's a lot of a lot of material, a lot of people have talked about SLOs and monitoring and that kind of stuff. But there's another piece of this, which is, how do you how do you actually make sure that when something does go haywire, your organization is prepared to deal with that. And that goes back to the point you just made around like all these checklists. And the reason you have those checklists is because you don't want to be scrambling when something goes wrong. Like you want everything to be ready to be like, we know where dashboards are, we know where metrics are. And like, we have the telemetry we need. Now let's go in and actually figure out what's going wrong. And so part of like the whole value of production readiness checklist is to get

Starting point is 00:15:56 the organization to a state where when things are in production, you're good to go, you know, things go wrong. And so I think part of the challenge that organizations face is, like, again, it goes back to the standardization where and this is a problem that I faced before was, you have some teams that have put their runbooks in Google Docs, you have some teams that have it in, you know, markdown files in the repo, you have some teams that, you know, have some sort of like automated playbooks. And when you're paged at 2am, how the hell do you figure out, like, where do I look? Where do I even start? And so having some sort of standardization around those practices itself is extremely important. And so being opinionated as an organization, I think is valuable to say, like, hey, we're we are using

Starting point is 00:16:35 Grafana and every Grafana dashboard should have a latency, you know, metric that we can see to debug things, you need to have a system restart runbook for every single service. And that's like, if something goes wrong, that's the first place I have somebody can start, you need to know who the accountable owner is, like, if something's wrong, and like, I'm not accountable for it, who do I page? Like, I don't know, like, I don't want to go and ping people and like page the wrong person and wake them up at 2am, who is accountable for it. And so an organization, I think needs to treat this almost in like a very like operational machine manner, because that's what it is like, the more the more you streamline, the more you standardize,

Starting point is 00:17:09 the more just like cookie cutter it is, you can just like knock things out, you can figure things out much easier. And so like the operations piece of that, I think is extremely important. And you know, that being said, obviously, the observability and all that is important, because without that, you can't do any of this. And you know, that hopefully the templating has helped solve that for you but i think the production readiness piece of that from an organizational standpoint is extremely important in order to deal with these kind of issues okay so two questions on that like does the cortex product help with the standardization plus like the second challenge i see is that if you have

Starting point is 00:17:40 a really large organization that's already doing things its own way, how do you actually decide, or you have to make sure that everyone migrates to a certain best practice? So maybe let me ask the first question first, which is how does your product, how does Cortex help with all of this? So Cortex is what we're calling a software engineering platform, which means we want to help make developers' lives easier, make it easier for them to create microservices, operate those microservices, and then give leadership and SREs and all the other organizations visibility into how those are performing.

Starting point is 00:18:16 And so that means a couple of different pieces. The first piece is what we call the service catalog. So the catalog is exactly how it sounds. You have information on every single service. It's like a single pane of glass for every service, library component, anything you can think of, including who's the owner, who is the business owner, where's the Slack channel, where the run books, like every single thing you need to operate it, this is where it is. And so this is part of kind of touches a little bit on your second question, which is,

Starting point is 00:18:43 if you can organize information in a standard way, it doesn't matter where the organization that information lives. So for example, it doesn't matter if I have my runbooks in Google Docs, and you have yours in Confluence, as long as I have a place where it says, you know, system restart runbook, I click on it, it's going to take me there. And so now it doesn't matter where it is, as long as I know I can access that information. And so like, that's the first piece that Cortex does. The second piece touches on like that checklist aspect of it. How do you make sure that every service actually has this this runbook or this playbook somewhere. And so we have this product called scorecard, which basically, lets you define a set of rules to score your services. And so you can think about like all

Starting point is 00:19:21 these spreadsheets that organizations have in terms of production readiness, checklist, security audits, you know, things like that, you can automate that away. So using like our custom language that we've built, you can build a scorecard with rules that say, every service in order to be marked as production ready, needs to have at least two owners so that if you know, one person leaves the team, there's still somebody accountable for it needs to have a runbook, it needs to have a corresponding page duty escalation policy with three levels that way, you know, somebody can escalate it. And so Cortex automates this entire thing away. And the third piece is how do you get people to care? You know, how do you actually kind of push people to do that. And so we have this feature

Starting point is 00:19:57 called initiatives, where basically, as an engineering leader, I can say, hey, this quarter, we really want everybody to just define their on call rotations. And so I know we have 10 rules in our production readiness checklist. But right now, let's like, do one thing at a time. So let's focus and let's knock things out. And so Cortex will actually go in and like, you know, the gamification aspect of like messaging people to say, like, hey, your service is dropped 10% this week, like go and fix these three things about your service, and you'll be in the top 10%. And so Cortex kind of aggregates all this information and lets developers understand, okay, I this is what I need to fix. And if I fix this, I'm good. And it gives engineering leaders and SREs and securities, the ability to come together and say, like, this is what we've aligned on as our guidelines. And so we can objectively

Starting point is 00:20:39 score services on this, it's no more like, okay, I think the service is not great, like, you know, how come and like, I'm going to ping somebody on Slack, and It's no more like, okay, I think the service is not great. Like, you know, how come and like, I'm going to ping somebody on Slack, and there's this whole like subjective element. Instead, it's very objective. And so that's kind of what Cortex does. One of the recent things we released is like integration for creating microservices as well. And so obviously, like in the scorecards product lets you track, am I following best practices? Am I following our standards? Am I, you know, doing well in our migrations and whatnot? But can we actually help you follow them from day one. And so we have like a templating feature that lets like organizations define a template and say, create a microservice, I want to create like a goal and microservice, and I fill out some form, it'll automatically

Starting point is 00:21:18 generate the repo, it'll push the boilerplate code, it'll add it to the service catalog. And so that means it's like meeting all of its production readiness standards from day one. And so that's kind of like the gist of what Cortex does is, you know, it'll add it to the service catalog. And so that means it's like meeting all of its production readiness standards from day one. And so that's kind of like the gist of what Cortex does is, you know, it helps developers create services, operate services, and make sure those services are staying, you know, high quality. So is Cortex like opinionated about what good looks like? I can totally imagine the next step is, I have this language that lets me define what good looks like. But I don't know, as you know, a person who for the first time who's migrating from monoliths to microservices, like, should every service have an escalation, like a pager duty escalation? It makes sense once you describe

Starting point is 00:21:55 it, but like, I don't think I know all of like, what's good. There's a couple of pieces to that. I think one is, you know, we provide guidance to customers on like, hey, these are what other customers are doing. Here's some examples of what production readiness is. But I think your question kind of touches on a point you made earlier, which is, you have a company that's already doing things 100 different ways. How do you actually get them all to migrate into something? And so, you know, unfortunately, like a lot of this comes down to the organization, having to want this change, like they say, like, like things are crazy. We want to kind of wrangle this into some sort of,

Starting point is 00:22:29 you know, take this chaos and turn it into something calmer. And so the organization, organization comes in and says like, you know, we, we know we have 10 things and we're going to approve these five things. So we're not going to like standardize everything. We're going to still create these golden paths. And so I think a lot of it does come out of the organization to say, like, we know what we want, what we're struggling with is like, how do we get people to listen? How do we get people to care about this stuff? How do we even start automating this understanding where we are? And I think that's been like the core challenge is a lot of organizations internally know, you know, if everything was great, and everybody was doing all this stuff, this is where we want to be. This is like the gold, like the

Starting point is 00:23:04 world in which we can dream. And this is our utopia as an organization. They just don't know how to get there. And so I think we're more focused on like helping them get there than trying to tell them where they should get to though. I think in a, in a future world, you know,

Starting point is 00:23:16 and this is something that we talk about internally is can we kind of gather insights across organizations and share that to customers? It's like, maybe like a marketplace of production readiness standards. I'm like, like, hey, this is what Airbnb is doing, like click on that. And you get your Airbnb production readiness standard, kind of like what we do for like linting and stuff now. Now, why is production readiness any different? So that is kind of what we want to get to one day. But I think for today, it's like the customer knows what they want, and they just need some way to get there. I can imagine even things like case

Starting point is 00:23:43 studies, like people saying this is what they had and this is how they moved to it would be helpful. As you get more customers, you get more of an idea for like what works across people, what doesn't work. So things can only get better. And like the last piece is as an engineering leader, how do I know how much time to devote on this, right?

Starting point is 00:23:59 Like I have head of product telling me that we need to ship so many features by the end of the month. I have like engineers complaining about tech debt and i have this platform that i want to buy that will help me get to this goal of improving my visibility into microservices and just like my overall engineering services how do i know how to get there let's say i buy this platform right should i spin up a team that will drive this change should I just ask every engineering team to do a little bit of work like how much time should I spend on this how much percentage of

Starting point is 00:24:29 my like engineering bandwidth should be spent on this like I don't know how to think about that that's a super interesting question because I think that the actual answer is that one step before like in order to ask you know what should I be spending time on like how much time do we invest in this you need to know where you are today. And the problem is organizations don't know where they're at today. So they don't even have a way to ask the question of, what should I be working on? Or how much time should I invest in this when they don't know what this is?

Starting point is 00:24:56 And so for a lot of organizations, the first step is just saying, what the hell is out there? How are we performing? And that gives them the visibility to say, oh my God, our code coverage is really, really bad. And maybe this is what's causing our incidents because we just don't have unit testing. And let's focus on that. And as an engineering leader, now I have the visibility to say, for every new service, we want to start investing in code coverage. We want to invest in

Starting point is 00:25:19 testing. We want to report on those metrics. And so I think step one and this kind of talk like touches on how we think about the adoption phase of Cortex for our customers is like, the first step is baselining, which is, we don't know what's out there, let's understand the current situation. And then as an engineering leader, I can figure out what to prioritize. And so I think for a lot of engineering leaders, the bottom line is reliability quality, because those things directly impact, you know, the bottom line is reliability, quality, because those things directly impact the financials and the actual reliability of the software.

Starting point is 00:25:50 And so a lot of this production readiness and all this stuff, why do we actually even care about that stuff in the first place? It's because that impacts us as a business. And so as an organization, I'm going to focus on the things that will help me get there. And so, for example, I can take a look at this and say like, hey, it looks like our MTTR is really bad. And we're just doing really poorly at responding to incidents. Why is that?

Starting point is 00:26:12 Oh, it looks like, you know, we don't have owners for a lot of things. So the escalation time, you know, we're taking 30 minutes to find who the owner is. Okay, I'm going to create an initiative for that. Let's fix that next. And so the eventual goal is reliability. And it's up to the engineering leader to say, like, what are the key things like the easy wins we can do to get there. And generally, what we've seen is, it has to come from the team that owns that service, like there has to be

Starting point is 00:26:34 like that stewardship or accountability out of the team level. And that is not just for the success of like Cortex or these initiatives, but from just like a broader engineering philosophy, like the code that I ship is something that I need to be accountable for, you know, from start to finish. And that includes like operations and, you know, keeping that high quality. And so part of like the dream that Cortex that sells

Starting point is 00:26:55 is like, we're going to help you create that culture of accountability and ownership. Because if you don't have that culture, then obviously things will suck. And so part of the thing is the team has to own that process. They need to be accountable for maintaining the quality of the service. They need to be accountable for operating the service. And so Cortex kind of gives engineers the

Starting point is 00:27:15 visibility on what to prioritize, but the teams need to do it themselves. So you help basically with all of the technical hurdles that there may be with regards to like transparency and not really knowing what's going on but then the social problems of actually fixing those things is kind of on the culture of the company that has to drive those changes what are maybe some key like cultural components that you've seen in successful engineering teams or engineering organization like what is required so cornix tries our best to help with that cultural aspect as well. And that's kind of where we see the value that we can provide. So like through gamification, through leaderboards, and like creating this culture of like, hey, I care about the quality of my services. And we're starting to see that in a lot of organizations

Starting point is 00:27:56 where like in a scorecard, like we stack rank all the services based on their scores. And very commonly, we'll see like the top 10% of services are owned by the same team because they've gone in, they've fixed all their services. And that's exactly the kind of culture that we want to see. And so I think those are generally the cultural things that we want to create. In terms of engineering organizations and high-performing engineering organizations,

Starting point is 00:28:19 I think it comes down to a few things. I think, one, engineering leadership needs to define clear goals that center around reliability. engineering leadership needs to define clear goals that center around reliability. And it needs to be clear that the developers are accountable for that. And so what's interesting is we've seen organizations that define like OKRs around Cortex, like, hey, we want every service to be 70% production ready, and they see some value in that. And so if you treat like production readiness and service quality as like secondary to product development then you've made clear your incentives to organization like product development is first

Starting point is 00:28:51 reliability second and so if your OKRs don't include reliability metrics like whether it's cortex or anything else if there's nothing in there at an organizational level of like hey we care about the quality of the output that we're producing then there's no incentive for managers to prioritize that, for product managers to get pushback on like, hey, we have to fix this first. And so unless the organization has defined these goals, then it's hard for anybody to advocate for that.

Starting point is 00:29:15 And so I think it has to come from the top in terms of like, this is something we actually care about and are willing to invest time and money into. So as an engineer who does not have any like visible like any sway on like organizational goals I can't really make change and or I can't make too large a change unless like the engineering leadership is aligned on like some kind of quality is kind of what you're saying. I think for broader things like this there is a component of thinking the leadership has to push it and that's kind of what we've seen is like

Starting point is 00:29:44 you know maybe like you know, maybe like, you know, I as a developer can evangelize this to a few teams, but at a large organization, it's just not going to work. Like there's too much surface area for a single team or a single developer to cover. I think there are things that developers can do. And that is investing in like the templating, for example. So I can create a template.

Starting point is 00:30:02 And so maybe it doesn't have buy-in from all the other SRE and the infra teams. But if I can share that with other developers, and it makes their lives easier, then they're going to start using that as well, you know, just because it literally just saves them time. Like, why would I not do something that's going to save me time. And so developers, I think, can invest in those kind of things, or even, you know, evangelizing certain practices or advocating for those to be part of like production readiness standards, like production readiness standards, things that they've learned in the organization. That was part of the challenge that I had in my previous job was having been on the team that pushed forward this microservices

Starting point is 00:30:36 journey first, we had a lot of learnings. We knew that certain things had to be done. Our logs needed to have certain pieces of data. If we didn't know which instance those log lines were coming from, or we didn't know request tracing, or if we didn't have an easy way of restarting our services, then you would run into issues. And we had learned those lessons the hard way. But the problem was how do you evangelize that and get other people to care about those things? And production readiness checklist is a way of evangelizing that. But unfortunately, I think engineering leadership, it has to be involved in part of it

Starting point is 00:31:08 being like, like, hey, this is important, like, we have to care about these things. So I think, especially the organization gets bigger, it just becomes more and more important. And maybe we can talk a little more tactically, right? Like, what is good services look like? What have you seen, you know, requests from customers on, you know, please integrate with this particular metrics pipeline, or can we get metrics from this service because we think that constitutes making a service better? You were talking about making sure that we check sneak vulnerabilities, for example. What are people doing nowadays to basically say that this service is running well versus not?

Starting point is 00:31:40 It very much aligns with how we've been seeing customers doing this. And I think one of the common patterns we see, and I had done the same thing in my production writing this checklist is it's broken down into categories and in a kind of different phases of the life cycle of the microservice. So the first phase is like development, when you're actually building your service, are you building it the right way? And so that includes things like, am I containerized? Or am I still using some old platform? Am I, you know, running automated unit test as part of my CI suite? Am I using the right CI suite? Do I have a readme file? Am I using the right framework? Am I using the right package versions, just like basic things around like, am I building this thing in the

Starting point is 00:32:18 right way so that once I get into production, it's going to be okay. And so standardization starts from there. The next piece is, okay, okay, my development maturity is kind of how we call it. The development maturity is good. I've built it the right way, but now I'm ready to go into production. Am I ready? That's kind of the next step. And so that is very much around things like, do you have an on-call rotation? Does your on-call rotation have alerting enabled?

Starting point is 00:32:41 Does your on-call rotation have escalation tiers? Do you have runbooks? Do you have dashboards? Are you tracking the rightcall rotation have escalation tiers? Do you have runbooks? Do you have dashboards? Are you tracking the right metrics? Do you have ownership? Do you have accountability? Do you have Slack channels? Like all the things where something's on fire, what do I do now?

Starting point is 00:32:54 That's production readiness. And so that can even be further broken down into some subsections. Production readiness can include things like security. Has your service been secured? Are you running vulnerability scans? Observability stuff. Are you tracking the right metrics? Do you have the Datadog agent set up? Are you pushing metrics to the right places?

Starting point is 00:33:12 Are you using the right observability tools? Do you have a Sentry project for your service? That can be the observability piece. And then finally it's around the post-production piece, which is your service is in production. Are you operating your service the right way? And so that can include things like is your service triggering a bunch of alerts off of business hours and waking people up?

Starting point is 00:33:32 Do you have tons of compliance-ish tickets open in Jira that you're never closing out? Because that's an indicator of maybe your team doesn't have time and you're just stretched super thin, or your process has a hole in it. And so there's this concept of like post production, like operational maturity. And so I break that up into like, operational readiness and operational maturity, like, are you operating it the right way. And so those are kind of like the three main buckets that we see, like most customers setting up. And obviously, there's like a lot of like ad hoc things like, hey, we're trying to migrate from you know, one version to another or trying to look at everyone to move on to Kubernetes.

Starting point is 00:34:04 And those are like ad hoc things. But these are the main things that we see running all the time is like, am I building it the right way? Am I building it to be ready for production? Am I operating it the right way? And like those kind of have those subcategories. Okay. Are there like any interesting technological trends you've seen? Like, for example, I've seen that people are just using more and more things like Fargate and Kubernetes rather than, you know, even trying to spin up services on EC2, which I think makes sense to me. It removes like a whole class of problems of like, do I need to containerize or not? Because you can just start with that. Are you seeing anything interesting just through your customers or just generally? So the things that we're seeing are Kubernetes, like you said, is becoming huge. I

Starting point is 00:34:44 think most of our customers are on Kubernetes already, which is interesting. It's not something that we expected. I think there is a lot of focus on, surprisingly, a lot of people care about Jira metrics. And that is kind of around post-mortem tickets and SEV1, SEV2 tickets, things like that, which I think are people realizing there is engineering is engineering outside of like, the dev tools are using like this, you know, unfortunately, you know, developers have a love hate relationship with Jira, but it's part of the engineering process. I think that's been really interesting to see how people see that now as like an engineering quality metric, and not just like an engineering productivity metric.

Starting point is 00:35:20 And so that's been interesting. I think templatization like through you know for creating services that's seen like a resurgence you know there was kind of a lot of talk about it a few years ago but i think it's really exploded now a lot more people are investing in that which has been super interesting and part of the reason why we kind of built that feature out is you know a lot of people were asking for that i think those are kind of like the big things that we're seeing as commonalities or trends. I think another common thing has been like configuration as code and not just for like

Starting point is 00:35:50 your deployments, but even for like vendors and tools like that. Like a lot of people ask us like, can we configure Cortex through, you know, Git? Can we do GitOps for all of our stuff? And that's a common pattern. And I think that's kind of related to the the SRE and infra platform teams that are driving some of these initiatives. And I think they really value GitOps and version control and all that kind of stuff for these things.

Starting point is 00:36:12 So I think that's been really interesting to see a lot of people using that as well. Okay. And I noticed that Cortex can be used on the cloud and also be self-hosted. Have you seen people self-hosting more over time or less? The traditional wisdom is that people are just using SaaS tools, but like what's been your experience? Yeah, I think that one, that was interesting because partially because we're a startup, you know,

Starting point is 00:36:32 there's questions on security and stuff. And so we just got our SOC 2 certification, which is like a whole, it's own can of worms, but I think that's going to, that's make it easier for people to go to the cloud. But I think for a lot of organizations, they have a lot of tools that they run on-prem. And that actually is another trend that we've seen.

Starting point is 00:36:49 A lot of people are running GitLab on-prem that they don't want to expose to the public internet. Their Kubernetes clusters are obviously internal. They're running Bugsnag on-prem, things like that. And so a tool like this, which is kind of like a hub for all of your integrations, needs to be able to talk to all those things. And if it can't talk to it, then it's useless. And so a lot of those customers who are running

Starting point is 00:37:07 highly sensitive environments end up going with the self-hosted model. And I think that has been almost like a secret weapon for us in terms of being able to support those companies. And I know like the trend has been, you know, moving towards the cloud, but I don't think we would have been where we are had we not kind of been in the bullet and come on into the on prem world earlier. Was it hard to just build that out? Or was it not as hard as people make it sound? It's a mixed bag. Getting started with on prem was not difficult, because we had containerized everything like because we were because we're using Google App Engine for deployments. And so we already had everything in Docker. And so we had to wrap that up and like a help chart that people could deploy into Kubernetes. That part was pretty

Starting point is 00:37:47 straightforward. All things considered, you know, at the time, I didn't know anything about Helm. So I was like, let's kind of learn Helm and figure that out, which is its own story. But overall, I would say it's pretty easy. What has been difficult, though, has been all the other stuff around once you're already deployed in a customer's environment, which is like, how do you help them debug things? I mean, how do you get logs? How do you get metrics? You know, if you're already deployed in a customer's environment, which is like, how do you help them debug things? I mean, how do you get logs? How do you get metrics? You know, if you're, we just went out and fundraised, how do we know like, is there one person using it or is 500 people using it?

Starting point is 00:38:13 And we don't have that visibility in like on-prem environments. So it's all of these like operational things that have been more difficult than the actual like productionalizing our product and like making it deployable on-prem. That part was easy. It's everything else that is much much harder cool and maybe uh one philosophical question is you mentioned a lot about like what service best practices are and like how people are thinking about it do you feel like it would be helpful if you know just developer education was like a bigger thing and we learned some of these things in college versus like learning them through like

Starting point is 00:38:42 random blogs that people are writing online. Like, do you have any thoughts on that at all? I think it is valuable. Like looking back now at the time, like my software engineering class had like a lot of design patterns and stuff. And at the time I was like, you know, do I really care about this stuff?

Starting point is 00:38:58 Like when am I ever going to use the factory pattern? And, you know, we're writing Cortex and all of a sudden I kind of catch myself writing like API client factory. I was like, ah okay like it is it is valuable after all you know and as an enterprise as it sounds like you know these things have been designed for a reason like they add value and so i think there is some value in having exposure to things like this and i think there are some things that you can only get through experience. So like why logging a certain way is important, but I think understanding just basic things around logging, like, okay,

Starting point is 00:39:31 maybe I'm not going to learn like how to structure my logs and where it should go and all that stuff. But like, Hey, like you should be logging because you know, when something goes wrong, you want to be able to see like in real time, like this is what happened. These are the sequence of steps. And so I can debug it because I feel like they don't teach you that. And I know when I was debugging things, I just did like print Hello. And like I said, like, Okay, this is where I am, you know, I didn't use a debugger. And I was like, one of those annoying developers like that.

Starting point is 00:39:54 But you know, that that as a concept, I wish I had learned around like, hey, like, hey, observability, monitoring, like, thinking about the quality of your software is super important. Because you're into folks a lot who are like really good at coding but they're not great at software engineering and those are two very distinct concepts i would say and so i wish there was more emphasis on software engineering in terms of like how do you design things because things can break like how do you design your software so it's adaptable to change so you know things like designing apis very simple very you know that's never going to change people are going to be designing ap you know, things like designing API's very simple, very, you know, that's never going to change, people are gonna be designing API's forever. I feel like that's something

Starting point is 00:40:28 colleges could teach, because it's a basic skill. Now, maybe you don't have to teach GraphQL. But like, what is an API? How do you design it? What are some, you know, pros and cons of designing things a certain way? What is rest? Why do you do logging? What does telemetry mean? And so I think these are concepts that exist forever. And so I think those things definitely be taught even just to have familiarity, like, and I think this is, like more philosophical education wise, I think for me, a lot of things, something stay with me, something kind of went over my head. But even the things that went over my head in college, I think I had some like vague memory of like, Oh,, like I kind of saw that and I generally know what that looks like

Starting point is 00:41:07 or what that means. And it gives me a good starting point to, you know, to go off of. And I think that is valuable. Like even if developers don't have like a very strong fundamental on those things, having familiarity, I think is very valuable. Okay.

Starting point is 00:41:20 And a wrap up question, like what's your advice to the software engineer who's embarking on their monolith, the microservice migration today, right? Like what would be your advice to you five years ago? What should someone think about? are serious about this journey, you got to do something big. And so I would say that's step one, I would say step two is, you know, don't boil the ocean. So like, don't try to build all the infrastructure first, like, if you have a model that that's running on Heroku, or, you know, EC two instance, or something like that, don't spend, you know, six weeks building a Kubernetes cluster. And so deploy pipelines, all this stuff, get it out, learn, learn those lessons, a lot of this comes from experience. And so get that out there, you know,, get it out, learn, learn those lessons, a lot of this comes from experience. And so get that out there, you know, with the bare minimum. And that is much more valuable than

Starting point is 00:42:09 trying to build all this beautiful infrastructure around it. I think the third piece is telemetry, telemetry, telemetry, like make sure you have logging you have, you know, APM or monitoring, because things will go wrong. And so make sure you have the ability to actually figure out what has what has happened. I think that is extremely valuable. And so make sure you have the ability to actually figure out what has happened. I think that is extremely valuable. And then finally, document things. Because if you are an individual developer or your team is the first one embarking on this mission, people have a lot of questions. And you're almost kind of being like the trailblazer here.

Starting point is 00:42:40 And so help the organization understand what you've done. So document the tools that you've used, you know, how to operate your service, things like that, where, like, now, you are kind of being like, the torchbearer and the trailblazer for microservices and, like, kind of do your part to be a part of that, that mission within the organization. I think those are the kind of the main things that I would have known, I wish I had known. So it's similar to product development, like create a lean MVP, measure and iterate pretty much, right?

Starting point is 00:43:08 Exactly. Yeah. Well, Ganesh, thank you so much for being a guest. I think this was a lot of fun. Thank you so much for having me.

Software at Scale - Software at Scale 36 - Decomposing Monoliths with Ganesh Datta

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.