The Changelog: Software Development, Open Source - Chasing the 9s (Interview)

Starting point is 00:00:00 this week on the change log i'm talking to marching kirk about chasing the nines marching is the co-founder and ceo of noble nine where they build tools for managing service level objectives also known as slos we also talk about service level agreements, SLAs, service level indicators, SLIs, error budgets, monitoring, and how it all comes together to help teams align on goals, improve customer satisfaction, manage risks, increase transparency, and of course, a favorite around here, continuous improvement, Kaizen. Today's show is an awesome deep dive into the world of chasing those nines. I hope you enjoy it a massive thank you to our friends and our partners at fastly and fly our pods are fast to download globally because fastly they are fast globally check them out at fastly.com and our friends at fly help us put our app and our database closer users all over the world with no ops

Starting point is 00:01:01 learn more at fly.io. This episode is brought to you by our friends at Square. Develop on the platform that sellers trust. Here's what you can do with Square. You can bridge more experiences. You can build online, mobile, and in-person commerce experiences that connect more customers and sellers. You can build custom booking solutions. You can create and track orders. You can accept payments. You can manage and curate inventory. You can organize customers. You can manage employees. You can extend Square gift cards to your app. You can use Afterpay. And all this is powered by the world-class Square APIs and SDKs that enable you to build full-featured business apps for yourself or millions of Square sellers. So much is available as a Square Solutions partner.

Starting point is 00:01:53 Learn more and get started at changelog.com. Again, changelog.com. so so so marching you're the head of a very cool acronym is is becoming more and more hot i think slos are important but i'm not really sure everybody understands what an slo is how often do you find yourself just simply starting a conversation describing that acronym and how that pertains to Noble9? Yeah, that's a really good question. I would say when we started this company in 2019,

Starting point is 00:02:53 there were very few people understanding that acronym. And those were usually the SREs coming out of Google, Facebook, a few other companies, right? I would say probably within the past year and a half or so, it feels like it's becoming more of a mainstream. So I would say 50% of the time, maybe more people do understand what SLOs are. And surprisingly, a lot of those people also understand the application, the benefits,

Starting point is 00:03:20 and all the good things coming out of SLOs. So the market is definitely maturing, expanding, and the conversations we're having are definitely at the level that we can have a conversation. We come in without educating people and trying to push something on that, basically. So what is an SLO?

Starting point is 00:03:40 How do you describe it? What is an SLO? SLO is a service level objective. So for us and for most of our customers and prospects, this is a concept that helps them understand and build infrastructural

Starting point is 00:03:55 applications to the level that allows them to operate in a way that customers are happy. So you got two different extremes. You got the extreme of, you know, building application or infrastructure that's 100% available. And I don't want to say it's impossible.

Starting point is 00:04:17 I'm sure some people will come out and say, of course, we do that. I don't think I want to go in that direction. And then you have the other extreme, which is things are constantly breaking and customers are not happy and leaving your application or your company and looking for other alternatives, right? So SLO is really about finding this sweet spot

Starting point is 00:04:40 between those two extremes where customers are not impacted, they're happy, they're not looking for different options, and you're not spending tons of money on you know trying to achieve the 100% availability and i think chasing the nines is what we call it around here chasing the nines right i mean we all want as many nines as possible but like i think they get infinitely more expensive and also potentially impossible to some degree to chase like the six or the seven nines. It's just really, you know, five nines tend to be what most can adequately achieve.

Starting point is 00:05:11 Would you say? What nine do you chase? Yeah, that is pretty expensive at that point, right? Five nines is expensive too? Okay. Oh, yeah, it's expensive. I think, you know, three and a half, 4.95, right? Or four nines. 3.999. It's getting to that point where it's really, really hard, right? When you start calculating how many minutes it can be done per year, then you finally realize like, oh yeah, there's no way. There's no way. Right. However, right now, most people are thinking about the nines in terms of SLAs, right? And SLAs are a legal construct.

Starting point is 00:05:51 Right. Agreement is in the word, right? Or in the acronym is the last letter of the acronym at least. And there's five pages of, you know, what we're excluding from calculation of the nines and so forth, right? SLOs, on the other hand, are not that. It's true, real-time, very visible and transparent information to both you, internal customers, external customers, right?

Starting point is 00:06:15 So it's definitely a different concept. And achieving those without any exclusions or definitions around the legal calculation is definitely a much different concept. You can translate SLOs into SLAs. You can make your SLOs SLAs, but I would question how many people out there are already ready for that type of approach.

Starting point is 00:06:38 So measuring performance of a service, of an entire stack, whatever it might be becomes infinitely more important as you begin to make the agreement rigid through an SLA but SLOs allow teams to have that flexibility I kind of think of it like an act or a analogy of like maybe a stick of bubble gum before you chew it is kind of the SLA where it's sort of rigid, right? It will eventually become sort of mungible, so to speak, or flexible. And maybe the SLOs are, you know, the chewed up bubble gum.

Starting point is 00:07:11 It's kind of like mushy and you can kind of move it around and it allows for imperfections. It's not that original thing, right? It gives you a chance to sort of have bugs because it's going to happen, right? Or have downtime or, you know, times in the day even when you've got more traffic

Starting point is 00:07:26 and maybe those SLOs or maybe, I don't know, you need to measure things essentially to give that flexibility to the system, especially to the level that software has become more and more complex. Very large systems, large monoliths, whatever you might have, entire services, microservices, APIs, all these things are moving parts. Latency alone and the often offender DNS, right? I mean, things just happen in systems that are complex. There you go. Yeah, this is a very important point, right?

Starting point is 00:07:58 It's not necessarily about something going down. In many cases, things are not going down, right? You've got to slow down in delivery of services. Something else might happen. Latency is a fairly simple concept, but understanding how that latency is managing your customers to your application, it's becoming complex, right? For example, another part of the SLO is error budget, right? You have this difference of how much of the error budget

Starting point is 00:08:24 you can burn before it becomes an issue and you violate the S budget, right? You have this difference of how much of the error budget you can burn before it becomes an issue and you're violating your SLO, right? The question is, like, how fast are you burning down that budget? If it's, you know, burning slowly, the impact on the customer is probably not very big, right? But when you start seeing things going down quite quickly, then you have a problem, right? That's when you start thinking about,

Starting point is 00:08:45 are you waking people up in the middle of the night? Are you failing over from region to region or infrastructure to infrastructure? Every single one of those operations is very, very costly. So it really helps you also understand how you should be acting and helps you really make those decisions in real time. So, I mean, with all the observability

Starting point is 00:09:04 that has been around the last five So, I mean, with all the observability that has been around the last five years, I want to say, I got to imagine that it's kind of easy or it should be easy to measure these things, but it's not. So at Noble9, this is kind of what you do, right? That's your mission is to make measuring these things easier. How did you, you know, find this gap in the marketplace, so to speak, to form Noble9? And what hole did you fill? Yeah, so my co-founder and I, we started a company before. It was around marketplaces and billing from old days when AWS showed up, disrupted software vendors with this crazy consumption billing and things like that.

Starting point is 00:09:45 And it's been struggling. How do I address that need for my customer? And how do I align with AWS and other cloud providers for that matter? You know, to exit to Google. We find ourselves at Google, and day one, we start rewriting this application to handle Google levels of traffic and consumption. And that's how we really learned how Google operates, how Google sets goals, how Google operates on a daily basis, how they release software.

Starting point is 00:10:16 And of course, all the concepts around SRE were very, very interesting to us. But SLOs in particular, you know, we came to this conclusion that it's really, really hard to go into microservices, Kubernetes, and, you know, interconnected systems, not having SLOs to understand all the dependencies and impact of one service on another on the application. And then, you know,

Starting point is 00:10:39 with this constant push within the IT towards, you know, more of a business-oriented, business-driven decisions on the IT side, to us, the SLOs are really a very simple thing to correlate IT to business and vice versa. And for us, that was one of the biggest things that we figured, if we go into that world of Kubernetes

Starting point is 00:11:01 and microservices, that's going to be it. People will realize that they need SLOs to operate efficiently. It seems like a good negotiating tactic, too. Like, if you've got the rigidity of the SLA, which is like, okay, it's either black or white. It's a one or a zero, right? It's very binary in terms of like, did you or did you not know, okay, you're in breach? In terms of just simple contract terms, whether it's internal teams contract or with a customer contract, at some point you agree on an agreement of how things will work. But an SLO kind of gives you that, okay, well, how flexible can the system be? How flexible can we be to still achieve your goals, customer and or internal teams or whatever it might be?

Starting point is 00:11:45 That's a point of negotiation, right? It gives you that flexibility. Yeah, well, it gives you a point of negotiation and flexibility, but also gives you a better communication across the teams, right? You wouldn't believe how many times we come to a customer or prospect, sit down, and they keep telling us how much they love SLOs. They've been using SLOs for a while. And after a year or two years, they find out that their definition of SLOs within different teams are much different.

Starting point is 00:12:11 So the four nines for one team don't necessarily mean four nines. So with the complexity of today's systems, distributed systems, it's really, really hard to even define how we're looking at certain things, right? What is the degradation in service for me versus what is the degradation service for you? And of course, there are levels that are just still amazing to me, although I'm not shocked, where people are finding out that, you know, there's this one service they take a dependency on and, you know, it's really running on the server under somebody's desk. I wouldn't imagine that still happens,

Starting point is 00:12:48 but it does. So getting those people to talk to each other and define those SLOs so everybody in the chain understands how they're getting affected are just amazing, right? And I think that the best conclusion out of most of those conversations

Starting point is 00:13:03 is looking at the legal contract in the SLA, a lot of people realize like, well, there's really no effort for us to offer those five nines because we have a piece that's, you know, two nines somewhere in the chain, right? So the collaboration are standing across organizations, across teams is very, very important. And that's really our focus. Okay. So we kind of know what SLOs are. We kind of know what they are used for. We kind of understand how they help teams effectively build and manage software and communicate and also

Starting point is 00:13:35 communicate and provide assurances to customers. How do they manifest? Like, is it a Google Doc? Since we're talking about Google. I guess it's pre-Noble 9, there was one way. And maybe now, you know, with the inception of your company and how you help organize these SLOs and, you know, pay attention to the observability of, or the data from different services. And how do you establish an SLO? How does it look in the world that's not Noble9? And then how does Noble9 sort of like make that a better feature for teams to sort of like aggregate them together and all that good stuff? How does it play out?

Starting point is 00:14:14 Yeah, so you're right. People have been doing SLOs in different ways. You know, spreadsheets. We still see a lot of people doing spreadsheets. And it kind of works, right? You know, at the end of the month, you process your data. And the application of that type of approach is fairly limited. But a lot of people use SLOs for planning.

Starting point is 00:14:34 So if you get this data, process that data on a monthly basis, then that's enough for you, right? You have a really good understanding of what happened, what maybe you should adjust, you know, the teams. What we do is we process information near real time, right? I want to say real time, but that's kind of hard as well.

Starting point is 00:14:54 And give you insight into what's happening, you know, when things are happening. So we give you really, we don't need to use those SLOs for planning, even if you process that, you know, monthly or weekly, but also give you really the knowledge to use those SLOs for planning, even if you process that monthly or weekly, but also give you the ability to act in certain situations in almost real time, right? So like I mentioned, if you have to fail over,

Starting point is 00:15:14 if you have to file a ticket or have an understanding if there is a huge impact happening right now to your customers, if you get a signal that something is down, it doesn't really mean that your customers? You know, if you get a signal that something is down, does it really mean that your customers are getting affected, right? It's, hey, the disk is down, right? Or is not responding. What does it really mean?

Starting point is 00:15:35 Are your customers impacted or not? So our focus from that perspective is really giving teams the ability to understand if there's something that they have to do right now, if there's something that's really affecting their customers and they have to wake up teams across the globe or failover application or roll back the code that just pushed into production yesterday.

Starting point is 00:15:58 So for us, that is key. And for most of our customers, they might start with simple things like using SLOs for planning, but they really quickly ramp up to use SLOs on a daily, hourly basis. This is their goal to take a look and understand how their customers are being impacted and how they should be responding to any given situation. How much does this overlap with incident management or just incidents at large? Like SLOs sort of like are an indicator, but they're not necessarily an incident. So there's like lot of players in that market and they have fairly similar but also different ways that they allow you to manage those incidents. And that's all about bringing people together

Starting point is 00:16:51 and start looking at things and maybe deploying templates or things. For us, it's really determining if there is an incident or there should be an incident declared, right? So it all has to do with the error budget, how much you're burning. In many situations, things happen,

Starting point is 00:17:10 but we allow you to, for example, open a ticket in Jira, so somebody can take a look at it at some point, you know, with a different level of severity. It doesn't have to be an incident. And if your SLO, based on the SLO configuration and the burn down of your error budget, we determine that there's incident, we integrate incident response systems, right?

Starting point is 00:17:31 We'll open the incident and let you deal with that incident within that particular system. I asked that question because I was like, you know, I'm looking at your integration, something. Okay, well, if Noble9 lets me, you know know pay attention to and define my slos and this is like the agreement basically to the team this is where we define it there's a flow to define as you said your error budget you know kind of figure out where you're kind of pulling your data from what your data sources are i gotta imagine at some point like the next step might be an incident but one of your integrations or is not an incident manager by any means. It's data sources.

Starting point is 00:18:09 You've got events and alerts, which may trigger. I suppose maybe you're throwing data into, say, Discord or Slack, and that triggers something else. But I didn't see the integration for the incident management part of it. And then you've got data exports, which is like, hey, how can we take this data with us and take it into a meeting or analyze it differently or munch it somewhere else? So we do integrate with incident management systems.

Starting point is 00:18:32 PagerDuty is one of them. ServiceNow is one of them. We've also done work with webhooks and push data to other systems out there, FireHydra and a few others. But we try not to be in that space. It is a completely different space. You deal with those incidents in a very specific

Starting point is 00:18:51 way. We don't want to play in that space. For us, it's really focused on determining and understanding where, based on the configuration of SLO, of course, when we should declare that incident. And that's our input into incident management systems configuration of SLO, of course, when we should declare that incident, right? And that's our input into incident management systems or paging systems out there, for example.

Starting point is 00:19:11 Gotcha. I got to imagine that if you're using FireHardient or Incident or somebody else that's out in that space, and I'm familiar with those two because we've worked with them before, that pager duty might just trigger something in the incident management flows. Like say something happened, you know, this may trigger it. So I was just kind of curious because, I mean, like it's one thing to define and sort of track, but then something's got to happen, right? And maybe it's not an incident.

Starting point is 00:19:37 Like you said, maybe it's just, you know, outside of our normal range of our error budget. You know, it's just a percent or two beyond where we want it to be. And somebody just needs to put some eyeballs on it and it's not really an incident. But then in some cases, it might literally be downtime or way beyond the threshold and it's a more actionable thing, which if you really mince the incident management

Starting point is 00:19:59 or the incident word, some folks in that world will say, well, most things, if not all things, are incidents and we should track them because you've got to organize around it. And so it really becomes an orchestration of who should be involved in checking this out. Was it resolved? Not a catastrophic incident. Like small incidents are still incidents, basically. Tracking things, yes, of course. But we also integrate with Jira ServiceNow,

Starting point is 00:20:24 as I mentioned. So opening a ticket for someone to look at it at some point with specific severity is one thing, but declaring an incident is, to us, it's a completely different concept. Somebody needs to declare that incident

Starting point is 00:20:39 because something happened at a certain level with a certain severity, and our customers are impacted beyond the point that we believe is what they should experience, right? So it's like calling, you know, fire department. I have an issue, but they might respond over the phone and tell you, deal with this in that way, whatever it might be, and they get tons of calls like that, right?

Starting point is 00:21:03 You got an extinguisher. Take care of yourself. Exactly. Or, you know, that might be somewhere on this fine line. Or should we go for that, right? But people call fire department with all kinds of crazy things. And a lot of those things are being handled on the phone, right? And it's some kind of advice. But when they fire up the engine,

Starting point is 00:21:23 that's where the incident is declared, right? And they operate within a completely different concept and framework and, you know, show up and work on that. So to us, those are the things where, like, yes, you know, you want to call it incident, but you're not responding immediately, then that's fine. It's still being tracked in JIRA or ServiceNow or whatever it might be, and there's record of that, that people look into it.

Starting point is 00:21:45 And this kind of operates within that whole SRE concept. SREs are there to make systems better, right? So you know something happened, you found out that there's ability or opportunity for optimization. And then you go and figure out when you can prioritize those things, when you're going to do one thing versus the other, because there's always not one thing, right? There are multiple different things that you have to address. So that's kind of how we deal with this. And there are a lot of those opportunities for optimizations, changes, fixes, but they're not necessarily ready to be done right now and getting the entire team just to shift their direction to work on that. Take me a little further into this world before Noble 9. It seems like if you, I mean, how were people doing this beforehand?

Starting point is 00:22:35 It mentioned spreadsheets. Was it just that simple in most cases? Were there any other systems built around this? Do you have customers? I know that you had an acquisition of Google and you sort of learned and did these things as part of that. Like, were there any other systems built around this? Like, do you have customers, you know, I know that you had an acquisition of Google and you sort of learned and did these things as part of that. But, you know, what was the world like before you sort of organized it better?

Starting point is 00:22:54 It's a really good question. I don't really, from my experience with prospects or customers, I don't go before spreadsheets. My question is, was there anything in there, really? Right. Something happened, especially if you have a monolithic application. Well, now we know something is not working, and we have to go and figure out how we're going to manage this.

Starting point is 00:23:16 Detecting issues like that within a monolithic application is much easier, right? A lot of large enterprise customers just began their journey to the cloud, right? So they had full control over their systems and, you know, everything is running on this big, you know, sunfire system or whatever it

Starting point is 00:23:33 might be. You know, you approach those things in a different way. It's kind of like, you know, when we VMware showed up many, many, many years ago, right? They changed how enterprises operated. They changed how enterprises, you know, accounted for many, many years ago, right? They changed how enterprises operated. They changed how enterprises, you know, accounted for systems, managed the systems,

Starting point is 00:23:50 alerted on systems. And I think right now with microservices, Kubernetes, and all the little pieces coming into play, I think that's, you know, exponentially bigger issue than what we saw with VMware, right, coming into play. So we're just at the beginning of the evolution going from what we know, how we manage our systems, into something

Starting point is 00:24:12 completely different. And I think one of the biggest elements of this play is taking dependencies on complete external systems where we have absolutely zero understanding how they operate, right? It's, you know, a lot of organizations out there are using Okta,

Starting point is 00:24:27 for example, right? A lot of organizations are using similar systems like that, maybe databases. They have no way to see how things operate. So we actually have a lot of customers or prospects coming to us telling us that they need to

Starting point is 00:24:43 implement something because their customers don't really trust them, how they define the SLAs. They are asking questions like, okay, great, but how you architected your application. So that gives me a little bit of assurance that you built in the right way, and I can expect that your systems can operate. Because maybe your SLAs, you know, five nines. That's great, and I'm going that your system is going to operate. Because maybe your SLA

Starting point is 00:25:05 is, you know, five nines. That's great. Then I'm going to spend a year integrating, doing things. And then it's, you know, it kind of starts

Starting point is 00:25:11 going down every week. That's a big issue. Your customers think that's your problem and you have a dependency on this outside system that you can't really influence and you don't know

Starting point is 00:25:21 how it's operating. Just kind of think about it in very similar terms as what happened to security many years ago, right? 10, 20 years ago. We used to go on websites and buy things because it had this little logo that says, you know, trust me, I'm super secure.

Starting point is 00:25:35 Just do it, right? Well, there were many different things that we had and people used to do that, right? And now you cannot, nobody's going to do business with you unless you adhere to certain frameworks and certain certifications and so forth. And, you know, from a reliability perspective, we're really getting close to a very similar approach.

Starting point is 00:25:56 Tell me how you architected your systems, how I can trust you that you did the right thing. You build this on AWS, that's great, but, like, is it multi-region? Is it, you know, I need to see some data that really gives me a good idea

Starting point is 00:26:08 or comfort that I can make a big investment because, you know, enterprise is not going to go there and say, oh, three months later

Starting point is 00:26:15 we can just switch all of our systems to something different. That doesn't happen, right? So it's a big, big investment. So with the introductions of SLOs or just maybe better orchestration and formation of them and monitoring them, how does that world change then?

Starting point is 00:26:32 So you can take on, say, maybe a loose cannon, so to speak, or just something that's less reliable and you have just better thresholds on that? You have better observability of the actual performance of that for you within certain ranges? It's all about transparency. So we have a few customers that use SLOs and they expose those SLOs to their customers to sell them a higher availability system or higher assurance reliability system, right? If you pay X, you're getting this shared system that everybody's using and, you know, it's been great, but we give you the accesses from that perspective. You're getting your SLAs.

Starting point is 00:27:14 However, they have a higher level service that costs more, but they provide a very transparent SLO so the customer can actually see if they're performing to the SLOs that they define. And some of them even go to the point where they will do SLOs per customer.

Starting point is 00:27:33 As you can imagine, that's a more expensive thing, of course. But they will custom tailor that system to provide the performance that the customer is asking for and they very transparently provide you with the data to back it up. Yeah. This is interesting what you're talking about because this is a sales tactic essentially. It's a value add in this case. Having two different tiers. Here's the one that has better objectives, maybe better assurances, etc. Or just something that we're paying attention to more and therefore it costs more.

Starting point is 00:28:07 But here's the one that's sort of the on-ramp for, you know, the lower level customers are still amazing customers. It's just this is when we, you know, we give less nines to, we give less assurances to. And it's cheaper because it gets you in the door, it gets you using a product or whatever it might be. And then you determine if it's viable and if you actually need high assurances, high availability, et cetera, well, then you naturally graduate. And of course you pay more because that's great assurances to have. I love that. Yeah.

Starting point is 00:28:34 How many people know about this? I mean, are people doing that a lot with, with different plans? Like, can you go to X, Y, Z service provider? And you're seeing that more and more people communicating these, these SLOs. we got a few customers i would probably say about somewhere between 10 and 20 percent of our customers are either there they implemented that type of offering or they're working on it so it's starting can you share any names or speak behind? Unfortunately, I can't. No customer's names?

Starting point is 00:29:06 Sorry. Well, you know, it's a new concept. And yeah, we're working with them to help them build that out. But I would say that those were their concepts, their ideas. They got inquiries from their customers to provide that type of service. Well, I'll tell you one name and you don't have to say anything. I'll say it because it's on your website. I'm so glad they're your customer.

Starting point is 00:29:30 If this is true, it's Ticketmaster because I can't get my T-Swift tickets. I can't get my other tickets. I need to get these tickets, Ticketmaster. Come on, SLO. Anyways, I can imagine that's got to be somewhere in there. Well, I've heard that the ticket sale went much better than some other ticket sales. Yeah. Oh, is that right?

Starting point is 00:29:49 Well, maybe that's a good thing. I didn't hear any news about this, so maybe it went better, but yeah. Right. Gosh, the world would be on fire if you couldn't get your Beyonce tickets. Oh, yes. Oh, yeah. I just bought some Jerry Seinfeld tickets here in Austin via Ticketmaster. Had no problem, thankfully.

Starting point is 00:30:04 Jerry Seinfeld is a little less popular than, say, Taylor Swift or Beyonce, but still cool. Still cool. I agree. I missed his performance in Santa Barbara a month ago or so. You know, when I did some initial research on this, I like to go to a couple of different sources. One that sort of is an easy button, but not everybody goes there for their first search. And it's YouTube. And the reason why I go there is because I'm a premium YouTube user. I cannot stand advertisements on YouTube.

Starting point is 00:30:31 They're just terrible. I don't mind good ads. I hate bad ads. Yeah. But I go to YouTube and I search SLOs and I start to get educated on SLOs and who's using them, who's talking about them and whatnot. And it's mostly Google and then you. Right?

Starting point is 00:30:46 Okay. So like the results were Google, Google, Google, and then Noble 9. And I think it was a 90-second video. It was like SLOs in 90 seconds. So one, I would optimize more for maybe improving that video or doing a follow-up that's better because the audio quality wasn't super amazing. But you did commit to your objective. There you go, which was 90 seconds. So congratulations on that. But I mean, it feels like this is an enterprise problem coming down to everyday

Starting point is 00:31:15 applications. Would you agree with that? Like where's the maturity with SLOs? They're becoming more known. You're about a year or so into this more well-known space. But what's the maturity level of teams truly leveraging SLOs to their advantage? So first of all, interesting, I got to go do the YouTube search because that's definitely not something that we see in real life. I think the situation is that, you know, Google definitely has been pushing the concepts for a long time and they have teams that just focus on that 100%. But within engagements that we have outside, there are a couple other companies that focus on SLOs, but every single monitoring company or observability company out there

Starting point is 00:31:59 has got some kind of solution or something to say about SLOs. And that's really like the real life situation for us. Data Trace, Data Dog, Neuralik. I mean, everybody else, right? Just name them. So the real life, I guess, it's a little different than YouTube. And then maturity, where is it?

Starting point is 00:32:18 You know, it's our point of view. Like we haven't really done like a huge market, you know, research. And we've had conversations with a number of analysts and they of course agree that the market is maturing. People understand how SLOs help them run their business on the base of our

Starting point is 00:32:38 customers. You mentioned one or two. Their SLOs are becoming the core of the operation. I would say that way. One of our customers called it tier zero of observability that helps them really bring it all together, allows them to see different teams and different operations

Starting point is 00:33:00 at the same level, right? It's the same reference point, I would say. So you don't have this issue where you have four nines that are completely differently defined versus three nines and so forth. And you really get a good idea of, you know, how things are performing, where you take dependencies, what they can offer. And then finally, a lot of customers, I would say probably every single one of our customers is using SLS for planning.

Starting point is 00:33:27 And sometimes it's as simple as, you simple as if somebody shows up and says, I need another $5 million to spend on AWS. The question is like, why? Well, we're running out of capacity. And that's usually where the conversation ends, right? Now, SLOs really enable you to provide a better insight into what needs to happen. Do we have an issue with capacity on the cloud provider? Do we have an issue with our application hitting limits? Do we have an issue of this monolith that cannot scale anymore?

Starting point is 00:33:58 And we have to figure out how we really transition to something different. It really helps people to understand how the teams are performing too. You're sometimes pushing out features because everybody gets promoted on features, not on maintenance. And you start seeing degradation of your service, degradation of your customer experience. So you need to start thinking about

Starting point is 00:34:19 how we pull back, when do we pull back, how much do we pull back. We want to stay competitive, but we don't want to get our system to break every hour, right? So a lot of those concepts, like the more people are using SLOs, the more mature they get with it very, very quickly. Is this kind of where your service health dashboard comes into play, where you can sort of see at a glance what you have sort of tracked, I suppose, within

Starting point is 00:34:43 Noble 9, but you have them sort of organized and they're color coded. Well, this one's green and this one's red. I'm assuming maybe there's a yellow or potentially, but it's something like where this is like sort of in a degraded state and it's not quite red, but it's getting close to red. Like, is that where something like this comes into play where you can sort of see at a glance where things are playing? Yeah, for the organizations that are looking across, definitely.

Starting point is 00:35:06 It's one of those things that gives them a very quick idea of what's happening and they can drill down. And sometimes for teams, if they operate multiple different services or they monitor multiple different inputs into their SLAs,

Starting point is 00:35:22 that becomes also very interesting and very needed. But like any dashboard of this type, you know, it's a quick view of what's going on and how we can quickly get to the root of the problem, for example. Interesting. Okay, so reactionary, of course, because you've got integrations to PagerDuty so you can fire off incidents.

Starting point is 00:35:41 But then planning, I've got to imagine, is a big one. Like you had said before, if you want to expand your spend with AWS or GCP or what have you, and you don't have any data besides, you know, we just need it. Like this sort of fills that gap of like, okay, why do you need it? More data is always good. What is your plan then with Noble 9? What is the big dream, so to speak?

Starting point is 00:36:03 It seems like your early innings, and this is, I don't want to say what you the, the big dream, so to speak? It seems like your early innings and this is, I don't want to say what you build is not amazing, but it seems pretty simple, right? Track some objectives, establish some communication with your team, give yourself a dashboard and then integrate with, you know, the necessary players in the field, whether it's data dog or pager duty or, you know, the different data warehouses and whatnot.

Starting point is 00:36:26 What's next? What's the next big thing for you all? First of all, I would say that, yeah, most good software is simple, right? That's the whole point. For sure. It's solving a complex problem. That's what I was trying to caveat with. This is not a negative simple.

Starting point is 00:36:39 It seems pretty straightforward. This is, you know, you kind of got into the easy button for the most part. That's what I'm trying to say. No, of course, of course. So that was a huge focus for us because dealing with those problems, it's not easy. You know, finding a reference point for multiple different data sources, right? Everybody's doing things in a different way. And then customers store a lot of data and databases, right? Just pulling all that

Starting point is 00:37:05 information together and allowing people to have it in a simple view is super complex, right? And a lot of people have already tried. A lot of people failed and a lot of them

Starting point is 00:37:15 are on version 2, 3, and maybe 4. So for us, you know, yes, this is the beginnings. I feel like we built a very strong base platform and now we have at least

Starting point is 00:37:30 two years of roadmap to build features that help you consume information easier, help you share information easier, collaborate on the platform, mostly focus on that. I think the big dream and, you know, pushing it in the direction of business data, right?

Starting point is 00:37:50 The whole concept of IT operates against business goals. How do we start bringing those information together and, you know, helping people on both sides understand the inputs and outputs much better, right? So you have the business people like, all right, why do we and outputs much better, right? So we have the business people like, all right, why do we just lose our margins, right? Because we're spending $20 million more on infrastructure. That just happened because we needed capacity, right? And on the IT side, of course, you know,

Starting point is 00:38:17 what are our goals in terms of, you know, customer growth, customer satisfaction, migration? You know, that's a big thing for us. People migrating from on-prem to cloud, as I mentioned, they have a full understanding of what they have versus a very small part of what they can understand and change and configure. So migrating with this reference point of where you are today,

Starting point is 00:38:40 it's a big issue. You've probably heard a lot of stories of like, oh, we migrated to cloud saving no money as a matter of fact we're spending more money our applications are not performing better we have more issues blah blah blah you know that's that's standard list of things right so now have a better understanding of where you are how you're going to measure those things because maybe sometimes you just don't see the benefit right or maybe sometimes somebody did things in the wrong way,

Starting point is 00:39:06 configured it incorrectly, and now you feel like all your two years of work of migrating applications went nowhere. You're in the worst situation. So there's a lot of that happening for us as well. So let's paint a picture then. So imagine somebody's listening to this and they're like, you know what, okay.

Starting point is 00:39:21 We've done SLOs in the spreadsheet way. We've tracked them to some degree behind the scenes. We've been, you know, willy-nilly about it. We've done some things, but not to the level that this would do. What does it take to get started? Like, what is the initial conversation? Is it a conversation with the team? Okay, these are the services we have. This is the data we want to track. This is how we want to measure things. And how does that manifest into actually having SLOs in place? What's the time frame from

Starting point is 00:39:47 I want to do it to you've got it in production to actually have an objective? So this is a great situation for us. No question. You are doing SLOs whatever way you already sold an idea on the concept. Your teams are in some way bought into this thing or maybe forced to do this.

Starting point is 00:40:04 You never know, right? So you already are looking at certain inputs. You have those defined. We can very, very easily, probably within a day or two, configure you to be at the same point where you are with your spreadsheets. And then we have a number of tools

Starting point is 00:40:22 that help you build, configure SLOs in a very quick way. So at AWS reInvent, we introduced Replay that allows you to bring data from all your systems for the past 90 days, 100 days, or a year, and then look at that data so you can start to understand what SLOs would make sense. And now we just released this thing called Analyzer that can use that data and suggest SLOs to you. Interesting.

Starting point is 00:40:55 So you can also, with the combination of Replay and Analyzer, you can set this SLO and with Replay, you can go back to your events, like you had an outage three months ago. You can look how your SLO would be affected and how your error budget would be burned so it gives you a good idea of how you should be acting. And of course, you can keep tuning those SLOs,

Starting point is 00:41:17 but we give you a number of tools, like I said, that are going to allow you to get operational within a week, I would say. But I think the biggest part that we bring to the table that's been very successful for most of our customers is SLOs as code. A lot of people are struggling with bringing in another thing, another concept.

Starting point is 00:41:37 With SLOs as code, you basically can get your teams or your developers to only deploy code with SLOs defined, right? So you don't have an SLO on this specific thing. The code is not getting checked in. We're not pushing it out. And that really helps all the organizations to have some kind of standard of like, okay, at least we have SLOs.

Starting point is 00:42:01 And then, you know, the tools I mentioned, you can play with them, you can tune them up, get to the point where, you know, it really benefits all the organizations. And I think that, you know, with a few teams, 90 days, it's most likely enough time to get it really tuned up and set up for the organizations. So in some cases, it may be, or most cases,

Starting point is 00:42:22 if you don't really have an idea of how to implement SLOs or where you might go, essentially use past data to predict to some degree with your analyzer and whatnot. Yeah. Is there a scenario where, I'm sure you have great content out there, and you've obviously got that 90-second YouTube that I mentioned, which is phenomenal as an on-ramp. You should definitely revisit that. Are you finding that while you also have a service, you also have to educate, have a consultant, so to speak? Do you have sales folks? I know there's some things where it's like,

Starting point is 00:42:56 you know what? I would use it if you demystified how I use it. Do you find that? What's the uphill battle here for SLOs? Should they make sense? But, like, getting people to buy into it, like, what is the selling point here? Yeah, so quite often we are in those situations. Not as much now. As I said, the market is more mature, but we run into those things. But, you know, there's this lead that's been hired into the organization,

Starting point is 00:43:21 either to create an SRE organization or implement SLOs or, in general, work on a strategy for observability. And they fully understand the benefits. But, of course, they have a number of teams that always have an excuse and different arguments. What we're doing is great, and there's no need. We all been there, right? So for those situations, we do bootcamps. And those bootcamps could be anywhere from four hours to three days or even five days.

Starting point is 00:43:48 We can go through full training exercises. Like if we do the three-day, I believe, at the end of the whole bootcamp, you're coming out with your SLAs and, I mean, your SLIs and SLOs defined, implemented in a system, and you can start rolling from that perspective. If you need more with organizational adjustment changes, whatnot, we have a number of consulting partners, anywhere from boutique organizations to Accenture, Cognizant, that we've been working with.

Starting point is 00:44:16 So we can tailor an approach for an organization from anywhere. Hands-on, we send our SREs there. They help you out. They figure it out for you. All the way to, you know, for organization, onboarding and personal adjustments as well. You mentioned a new acronym there, SLI. What does that mean?

Starting point is 00:44:38 How does that play into SLOs? That's the input. You need your SLIs. Service level indicators, right? You pick those first. So those are the things that you want to use as signals for your SLOs. Okay.

Starting point is 00:44:53 It could be, you know, latency a couple of times, that's easy. You could be looking at, you know, number of logins or failed logins or things like that, that, you know, then you input into creating your SLO and you build your SLO based on the inputs. Gotcha.

Starting point is 00:45:11 How would you rate where you're at today in terms of, you know, market and product and things like that? Like, what are some things that you've done well and some things that you may have not done so well? Like, how would you rate yourself? Like, if you were a scale of zero to 10, zero being absolutely terrible, go home, stop, to 10,

Starting point is 00:45:30 you're knocking out of the park, keep going, more funding, go, go, go. Well, I think from a product perspective, with our first company, we made a lot of mistakes. A lot of them. I think my rating would be definitely under five in the first place. Good honesty. I like it. I like the rating would be definitely under five in the first place.

Starting point is 00:45:45 Good honesty. I like it. I like the honesty. Yeah, so we had issues. And of course, that was also part of the reason why we started rewriting the system of Google the day we showed up, right? We knew it. We told them they knew it. It was a whole concept. But we learned a lot from that, right?

Starting point is 00:46:01 We also learned a lot from working within Google product organizations. So I think, you know, from a building perspective, from an architecture perspective, performance, I think the product is somewhere around seven. When it comes to market, you know, this market's been changing a lot. And quite frankly, I know everybody experienced this. You know, we started in 2019. Then, of course, we had a pandemic a few months later.

Starting point is 00:46:26 Then other things happened, right? So I would say we had a good idea. We had a good idea, and we hope the market's going to develop in a certain way. But, of course, we made some missteps in terms of, you know, who we market to, how we message things. But that's kind of standard when it comes to a small organization. So we're constantly evolving there. We're somewhere around six, I would say, on message.

Starting point is 00:46:50 We just had this conversation yesterday, so we're adjusting the message, getting better in who we market to. But overall, like I said, I feel very confident with the product itself. I really focused on another thing that we didn't do when we worked with the previous company. I really focused on, you know, another thing that we didn't do when we worked with the previous company. We had remote teams,

Starting point is 00:47:09 we had teams in different countries, and, you know, there was, I don't think I put enough focus on culture, which is very, very important to me. And, you know, this time, that was a huge, huge thing to focus on from day one. And I think on culture, we're actually probably the highest.

Starting point is 00:47:23 I would rate us an eight of the culture. So given all those components, I think we're in a really good position to drive to be one of the top players in this space. Well, that's good because I think messaging is probably the one where everyone is always improving for sure. I think if you have culture in place or at least a good intention for culture, you've got a good foundation and therefore not so much easy, but it's easier with good culture and good team and good morale, et cetera, to build the right product. And then, you know, messaging is always sort of trailing, right? Like if the product is moving and especially being,

Starting point is 00:47:59 you know, like a new category, so to speak, in terms of SLOs, you know, I think it makes sense why you're messaging is a low-loft because you're probably still learning who specifically is your customer. Because SLOs affect everybody, but not everybody buys them. Yeah, and our customers invent ways to use SLOs too. So that's interesting. There are a lot of very interesting use cases

Starting point is 00:48:17 that really come from our customers. So that plays into how we message as well. That's right. When you piqued my interest with, you know, leveraging SLOs as a product thing, you know, like how do you, you know, have product tiers? And that's really a chief revenue officer's opportunity potential. I mean, so how do you market to a CRO, for example, like with SLOs?

Starting point is 00:48:41 Well, hey, adopt SLOs and, you know, maybe you have healthier teams or a healthier product if you have tiers, one that's more expensive and more premium, and you can quantify the sell, for lack of better terms, with an SLO, right? I mean, that to me simplifies things. So your customer there is like product owners, chief revenue officers, potentially marketers, you know? So you're not really, you know, you're not really marketing to say, director of engineering in that case.

Starting point is 00:49:09 He probably cares a lot about SLOs. You know, that's where we started, of course, right? And that they cared, as you said, but definitely things are expanding beyond that. And that was our hope from the beginning. Like I said, a lot of those things that happened in the board in the past three years, reshaped many things in this business.

Starting point is 00:49:26 So we're trying to adjust as quickly as we can. What is it that keeps you up at night? Do you get good sleep? What are some of your healthy practices, you know, in terms of just like life? Do you let things like your, you know, in quotes, your day job, your baby, your company keep you up at night? Are there things that do keep you up at night? And if so, what are they? Do I look like I sleep well?

Starting point is 00:49:47 I don't know. Maybe, maybe you do, maybe you don't. I don't know. Well, there's always something, right? I think the one thing I learned with talking, it's been the fact that there are certain things that can affect certain things you cannot affect, right? So if I wake up in the middle of the night,

Starting point is 00:50:04 it's usually with some idea to think through i just had this revelation and i tried to solve this problem like yes i don't think that fear plays a role at all uh there's always this oh let me take a step back and think about it because i don't know if we're going in the right direction and there's always a little bit of fear from that perspective but i think it's more of a healthy fear, right? Check yourself if you're doing the right thing. But part of the reason I like being a startup is the fact that, I mean,

Starting point is 00:50:35 there's no shortage of issues that you have to solve on a daily basis. And that's what excites me. I like that. And you have a great team that thinks in a very, very similar way. So yeah, we love doing those things. We love building a company.

Starting point is 00:50:52 That's where the fun is. Even if sometimes we have a bad day and you have to check yourself and take a break, go for a walk, whatever you might do. But like I said, in general, the team is really, really good and supporting each other, liking the same things, driving the same direction. That's the most important thing. I know I can fall back on certain people in the organization. Good. That's good for you. How much can you share about the horizon?

Starting point is 00:51:18 You know, what's just over the horizon or right at it? Like maybe something that not many people know about around Noble9 or SLOs or the next big thing. What can you share about the future? You know, I mentioned a couple of things for us pushing and focusing more on the business aspects,

Starting point is 00:51:33 relationship between business and IT, making SLOs easier to use. I know we push a number of tools to help customers do that, but that's one of the biggest teams for us. And yeah, you know, a few partnerships out there that I think are going to be very impactful. I'm super excited about those.

Starting point is 00:51:52 Huge investment, of course. But those are the next 12 months for sure. Gotcha. All right, anything else left unsaid? What did I not ask you that you're like, man, how do we miss this? Is there things that I just totally gapped? I don't know.

Starting point is 00:52:06 I really like the questions. You know, you did amazing research. I am really surprised. Great. Nice. Yeah, really like that. Like the questions.

Starting point is 00:52:18 I was just going through it. You know, we talked about where we are. How market gets impacted by SLOs, how people are using them. I can't really think of anything else that we missed.

Starting point is 00:52:29 Well, it's been fun having you here. Thank you so much for your time today. Appreciate the wild adventure into SLOs and all the ways they can be used. It's so cool. Big fan of the impact to teams and organizations, leveraging them the right ways. And good to see you and Noble9 really doing it right. So appreciate the time. Thank you very much.

Starting point is 00:52:47 It's an honor to be here. I appreciate the conversations. They're very, very good. Awesome. Thank you again. Thank you. Okay, SLOs, is your team using them? Are you using them?

Starting point is 00:53:00 If so, how are you using them? What benefits do you get from using them? We want to hear from you. Give us a shout in the comments. The link is in the show notes. Again, a massive thank you to our friends at Fastly, Fly, and also TypeSense. And of course, the banging beats master himself, Breakmaster Cylinder. Those beats, they're banging. And of course, to you, thank you for listening. No bonus today, but still, we encourage you to become a Plus Plus subscriber. That's where you get the extended episodes, the bonus content, the deeper dives, the closer to the metal, the skip the ads section of the Change Law Podcast universe. Check it out at changelaw.com slash plus plus. But hey, that's it. The show is done. Thank you. Game on.

The Changelog: Software Development, Open Source - Chasing the 9s (Interview)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.