PurePerformance - Unlocking the Power of Observability: Engineering Practices for Success with Toli Apostolidis

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always I have with me my wonderful co-host, co-host, co-host, Andy Grabner. Hi Andy, how are you doing today? Good, but you just told a lie. Because not always am I with you when we record a podcast or the other way around.

Starting point is 00:00:50 You haven't been around last time. When I'm here, I always have my... I think there might have been one that I did without you. But yes, last week I couldn't make it for various reasons. But yeah, it's weird though. I've been trying to catch up on my sleep, Andy, and you know what happened? I've got another weird dream. Another weird dream. Yeah. I went to my barber to get a haircut. And my barber said, Hey,

Starting point is 00:01:19 he went the usual without even looking around or seeing what, you know, everyone in the waiting room had. I just went in and said, yeah. He got the buzzers out and went really, really close to my head. And I was like, oh, this is too much. This is too much. And I look around and who's in the waiting room but you. And you're there with a really short cropped hair as well. And I'm like, Andy, I didn't realize what happened.

Starting point is 00:01:42 I didn't realize this was the usual. And you said to me, well, you didn't look around. You didn't realize this was the usual. And you said to me, well, you didn't look around. You weren't observing what was going on. I'm giving you a weird voice. Now this is your Muppet voice today. You weren't observing and looking around to see what happened. So you got the wrong thing there.

Starting point is 00:02:06 And I guess I learned my lesson about observing when I first got into the place there and making it a best practice. Well, I think I will take your dream as a life advice now because something similar happened to me just recently, funnily enough. No way. No way. Yeah, exactly. If people look close at the picture, they see me with very short hair.

Starting point is 00:02:18 But still, it's summer here, so it's all good. Brian, do you think we should keep talking about dreams and haircuts, summer here, so it's all good. Brian, do you think we should keep talking about dreams and haircuts, or shall we actually try to get a little of an additional opinion on observability, on, I don't know, developer experience, on platform

Starting point is 00:02:36 engineering, and also what's around DevOps? What do you think? I think if we have a guest who obviously did look around and noticed and didn't get the usual because he observed properly, I think that would be a good idea. Perfect. Maybe you could tell us a thing or two about that. Maybe, yeah.

Starting point is 00:02:55 Well, with this, I would love to welcome to the show our guest of today, Apostolidis, but I think short, just Toli. Yeah, correct. That's it. Awesome. Toli, thank you so much for being on the show. The two of us, we met just a couple of weeks ago in London for a conference we both presented in. It was called WTF is SRE. You had a talk with another colleague of yours, a former colleague, I think, as I've learned,

Starting point is 00:03:25 you've just switched jobs. But you had a talk about when DevOps meets service delivery, which is a really great talk to watch, which I have a couple of questions later on. But for the audience, for our listeners, can you quickly introduce yourself, who you are, what you do, what gets you excited, what you're passionate about? Yeah, well, thanks for having me it was uh it was fun uh meeting you at the conference because uh

Starting point is 00:03:50 you mentioned observability and my antennas antennas went up and we couldn't stop talking for a while interestingly i did go to for a haircut today as well and um i made sure that the uh the person who gave me a haircut last time told the person who gave me a haircut today to keep the numbers right. So I'm right on. So my experience is I studied maths and mathematical physics at university and did a master's and then didn't know what to do. Wanted to do a PhD and ended up being hired as a mathematician in a company that built an optimization engine. So I was hired to write algorithms in code for my first seven years in software engineering. But as you know, with all these things,

Starting point is 00:04:37 the hard bits and the interesting bits are like five or 10% of the work. The rest of it is APIs and websites and schemas and all the databases and all the rest. So I learned a lot about software engineering there and then went into an energy company. And from then on, I kind of got involved into software and started understanding what it takes to be a software engineer. But then I had my first outage

Starting point is 00:05:06 where direct customers were calling customer support and started to realize, oh God, I need to know what's happening. But I couldn't know what's happening because only the ops people had access to the logs. I said, do you have logs? Didn't know you had logs. So then I started getting interested

Starting point is 00:05:23 in the DevOps movement quite late on, maybe in 2018, 2017, and started reading up about observability. So then moved to a company called Cinch in the UK that's an online secondhand car platform, a new generation of car platforms like Kavanaugh in the States. And I spent about three and a half years there building teams out, enabling a DevOps mindset, enabling an observability culture, and I learned a ton through that experience. And you'll probably hear me talking about DevOps and SRE, interestingly, and even the interaction between DevOps and more traditional practices, and also observability and event-driven architectures. So thanks for the reminder also about the event-driven architecture.

Starting point is 00:06:24 So folks, if you're listening in, if you want to see TOLI live on stage, we will be posting a couple of links you did besides WTF is SRE. You also did the GoTo. There's another YouTube video that you sent over. And obviously you have your own website, TOLI.io, where people can follow up on the stuff that you have created over the years. I remember when we were sitting, so we were in London, we were in the speaker room. We were sitting next to each other.

Starting point is 00:06:51 I think we were both preparing our slides and presentation. And then we started to talk. And you really said, yeah, observability. That's a key topic. And then when we followed up on that conversation, I said, hey, let's do a podcast together. You then came back with a couple of ideas on what you would like to discuss on the podcast based on your experience. And now reading through that list of what you presented back to me, what you would like to discuss, the first thing actually struck out to me, where you said, hey, we should think about observability as we did with testing.

Starting point is 00:07:27 And I assume, and correct me if I'm wrong, because it took us a long, long time from a testing perspective to educate developers to test-first, test-driven development. Is this what you have in mind with basically how can we get an observability-driven development mindset into engineers and how can we get this achieved? Yeah, absolutely. That was it.

Starting point is 00:07:45 So in my first job, I was a mathematician first and a software engineer second. We didn't have any testing. It was service-oriented architecture. So the architecture was really good. The people I was working with were super intelligent and super, super nice. But we only had input XML in, XML out tests, end-to-end, pure end-to-end tests.

Starting point is 00:08:08 We didn't have any unit tests or any kind of other types of tests. And that took time. And at the time in 2011, 2012, that was kind of not standard that you didn't have tests, but it was something that the industry was learning. So it took a long time to get to the point where now you go to interviews and candidates are embarrassed to say that they don't like TDD and whether you like TDD or not testing practices and testing techniques is something that every software engineer should have in their their capabilities their their experience but what people don't have is observability.

Starting point is 00:08:47 And that's okay because observability has really exploded within the software engineer role in the last three, four years at most. So when we started out at Cinch hiring in 2019, I would ask people, okay, so I would go through the whole software lifecycle. How do you go from code on your computer to code in production in front of customers? And a lot of the candidates would be really explicit, explaining how the whole process works, even CICD.

Starting point is 00:09:21 And the bit that missed, and towards the end of the whole example or the whole kind of story I asked well how do you know your code is working in production after that and I often got some very surprising answers for that and that wasn't to catch them out but more it was the first step into persuading them then when they start at cinch that's something they need to think about because i knew that a lot of people wouldn't be thinking about it but and i and i set it out as a as a goal that um when i finish my experience at cinch i'll know some people or i'll know people who have put observability on their CVs, observability tools, observability practices on their CVs,

Starting point is 00:10:06 so that the industry starts maturing in that. And the reason for that is, I think, it's more important that code works in production than it works on your laptop or on a non-prod environment. So observability, in a way, is more important because you want to have ways to understand whether the health of your business transactions is there or not. And that's where you need to focus attention. But I'm not saying that observability replaces testing, but I think

Starting point is 00:10:40 they're very, very complementary, but we put a lot of energy on testing and not enough in observability. I really like, I'm just taking some notes here. I really like what you just said. You said it matters more that your code runs in production than it runs on your laptop, right? Because in the end, that's really what matters is in case somebody makes the decision to actually push this into production.

Starting point is 00:11:02 Also, what I've noticed, and maybe you can give me a little bit of feedback here. I remember your presentation at GoToConference where you talked about Cinch, right? And how you decided to go with a serverless architecture and you basically had your, I think, six or seven different teams. You talked about the search team

Starting point is 00:11:24 and the catalog team that the what what did you call it um the product catalog yeah exactly and so you had individual serverless components and you said some some serverless functions were kind of more like like a service like a virtual service with different features did, when you set out to define those services, those serverless functions, define the definition of healthy, the definition of how do I know it actually runs successfully in production? Like the definition of done, a definition of is it observable and do I know what I expect from the system to be healthy in production? I think early on, what we did was that we had this premise that teams are autonomous and they build, ship and support their systems.

Starting point is 00:12:26 What that did was that made them think about the support part. And, you know, build and ship is something that we're getting better and better at and we're quite mature, but the supporting aspect is quite immature and even hard to do. So I think starting from that premise, we were able to empower the teams, empower the software developers to start thinking, okay, so if I've got a search service, how do I know that the search service is returning results that's benefiting the customer? How do I know that overall we are mostly returning results? Or how do I know if I've got too many scenarios where we're not returning any results? So once they start asking themselves those questions, they weren't set from outside, they start exploring, okay, so what tools do I have?

Starting point is 00:13:14 So we give them the tools. We give them a single tool, single observability tool, which is their platform of understanding. So you have your hosting platform, which is your cloud provider. In our case, it was AWS. But then you have your hosting platform, which is your cloud provider. In our case, it was AWS. But then you have your understanding platform, which is your observability tool.

Starting point is 00:13:31 And that's where you go and start exploring, how do I know whether this search service works? You start learning about instrumentation. You start custom instrumentation. You start learning about the various telemetry data types. And you start learning how to be curious. We can't tell them what to measure apart from them talking to the business or talking to,

Starting point is 00:13:57 business is not a great term, but talking to the product owners, talking to the stakeholders to understand what's important for them. But I think later on when we got more mature, as you saw in the talk at the conference we were at, we started becoming a bit more systematic about when we're launching a new service, one of the checklist points was have you done your observability due diligence? But early on, it was all about how can we persuade engineers to be curious? How can we help them and teach them and learn together how to use, how to instrument their code? And one of the big decisions I think that worked for us

Starting point is 00:14:40 and I think is really, really important is that we planted a software engineer in each team that was their task to learn and to enable the others to learn. So you would have your normal tech lead or team lead, but you would also have something called an automation engineer, which we had at the time. And they would be there saying, hey guys, what about, how will we know this will work in production? And they'd be like, oh, well, I don't know. And then what we suggested, well, maybe try, let's trace it. Let's add a custom tag. This is how you start creating a dashboard.

Starting point is 00:15:16 This is how you explore this data. And they're there day to day. And I think that was a catalyst. Do you think, and I know I remember this was the presentation, I think, that you gave it in London with your automation engineer in the middle, and then you had, I think you had different layers of engineers, but you call it an automation engineer. Would you explain to me, though? For me, it almost sounds like this is kind of the definition right now for an SRE. Isn't that what it is or not?

Starting point is 00:15:47 Yeah, so the title of that role is a bit unfortunate because we couldn't find a better role. Initially, it was DevOps engineer. We didn't want to call it DevOps engineer. So we shot ourselves in the foot a bit by calling it that because it was hard to hire externally because everyone thought it was a test automation role. But once you were in the company, we actually hired within the company quite a bit because

Starting point is 00:16:25 it's a role that learns a lot, but their scope is infrastructure as code, is CISD pipelines, and observability and monitoring. What we found, actually, is that they focused a lot on observability and monitoring because that's where the gaps were. Most software engineers we were hiring kind of knew how to do infrastructure as code, knew how to do pipelines, but they didn't really know how to do observability.

Starting point is 00:16:44 So that's where I think they focused most of their attention. But yeah, it's probably very similar to an SRE plus the rest, so plus the other parts of the stack. Yeah, and the reason why I bring it up right in the end, we know that the titles in the end matter for the outside world to understand or have an idea of what this is. In the end, whether you call it test automation, sorry, automation engineer or SRE or DevOps, if they're all doing the same. But if that's why I'm just asking, because people have a certain assumption, as you said, if you put put test automation engineer on that maybe the majority of people think about test automation and maybe that's something that they don't want but if you put it if you use a term that the industry

Starting point is 00:17:32 has already now coined right and say sre is somebody that is uh focusing on observability because observability allows them to build a system reliable and resilient because you can then use the observability data to then trigger, automate your runbooks and things like this. I was just curious. I think it's a very interesting topic, though, because I think in that scenario, we didn't want to hire X infrastructure engineers or we didn't want to hire X security engineers or X ops people. We actually wanted to hire software engineers

Starting point is 00:18:07 with some experience with coding and some experience with testing and that kind of things, so that they can learn all the practices that are needed for site reliability. And ultimately, they can persuade their peers, not because they're experienced or they're more senior, because they weren't, but because they can understand and write code and they have similar practices.

Starting point is 00:18:37 And so that's the angle we wanted to go with. Now, coming back to the teaching engineers on how to instrument their code, how to make it observable, a big challenge I think Brian that the two of us have seen over the years especially going all the way back when we started working for our company. We had a feature in the product and it did auto instrumentation, but we had a feature. We called it shotgun instrumentation, which meant you could instrument everything. It was awesome, right? For developers.

Starting point is 00:19:13 Turned into a profiler, yeah. Yeah, turned into a profiler. And that was 15 years ago, right? 15 years ago, we were able to distribute the tracing with every method on that you can think of. Obviously, with the drawback that you're collecting a ton of data that only a few people really need, and if this change then makes it into higher environments, you have a lot of overhead.

Starting point is 00:19:34 So I think the challenging thing what I see right now is with the big hype around open telemetry and open observability, it's great that we have these standards. But I think the challenge still is that we need to teach people, I guess, what level of instrumentation really makes sense so that you're not just collecting data because you can collect data, but you really collected the data that then helps you to then make the right call in case there's an outage, in case there's a problem, or in case to detect a bug in the CI system and you want to still have enough data.

Starting point is 00:20:09 So do you have any experience or any suggestions on how we can actually teach these best practices on how and what to instrument and what type of data we need? Because there's different ways how we collect observability data, right? Yeah, absolutely. It's a very interesting topic. Every CTO or every engineer directly you might speak to, they'll ask you, well, if we ingest everything, then how are we going to kind of keep a cap on the bill, basically? Because we know that observability is very expensive.

Starting point is 00:20:49 At Cinch, interestingly, observing the system was more expensive than hosting it because it was entirely serverless. It was a lot cheaper to host rather than the observability tool was way more expensive. And the reason for that is because we took a conscious decision that we're going to ingest everything and index everything liberally, so that we can have a higher chance that the teams and the software engineers will adopt the observability practice. So what I mean by this is, if as a software engineer, you want to understand how your system is behaving, and you want to answer a a question and you go and look at your observability tool,

Starting point is 00:21:47 observability platform, your understanding platform, and you don't see, you don't get any insights first time, second time, third time, you won't use it again. So the angle is as a company that wants to understand the business transactions, you want to actually ingest everything because you want to maximize the potential

Starting point is 00:22:07 for understanding something. But then you can't really ingest everything. So the shotgun option, I was actually thinking about that before this call. That's the extreme. But then, as you say, the volume would be too high. I think the middle ground is thinking about it as constructing a data set. So while you're writing code, you've got three things to think about.

Starting point is 00:22:34 Write the actual code, write your tests, and instrument your code with custom tags. And you have to think of those three, and you can think of them in any, in any kind of order you want. It doesn't matter. Like the TDD, the TDD aficionados can do testing first. I don't care. But what I do care is that you start thinking about, okay,

Starting point is 00:22:57 my code now is creating an order. What do I do in this case? I want to, have I added a custom tag of order ID? If the order is created successfully, have I added created true, for example? So my angle would be

Starting point is 00:23:14 teaching people to instrument their code consciously and thinking about what would be useful on the other side. More often than not, I get the answer, well, this this might not be useful so i'm not going to add it and my answer always is if it's potentially useful if you think if it's as a lot of people say is high cardinality and

Starting point is 00:23:38 you end up with high a lot of a high number of dimensions then you you're increasing your likelihood that you'll you'll be able to look through your data and understand what's happening when you need to. You don't know what that is now, but you will know at some point. So the other thing I'd probably add to that is that you can't just teach. I can't just go and teach everyone.

Starting point is 00:23:59 We had 20 teams, for example, at Cinch. You can't teach everyone. You have to start building a community of interest around these kind of things. People need to learn from each other. So practice only evolves. It's a complex practice. There's a lot going on.

Starting point is 00:24:16 There's no standards as such. I mean, there is open telemetry and there is some guidance online. It's a growing field, but people need to learn from other people and that's where they learn the best. I'm taking a lot of notes here because I want to first of all for my benefit for later on, because this is what I enjoy so much about the podcast that we have interesting guests with different backgrounds and different experiences. Also for the summary of the podcast that we're writing, what you're saying was

Starting point is 00:24:52 interesting that, you know, teach people how to instrument code consciously, making them think about the other side, like, well, how can this data potentially be useful? And if you don't have a clear answer of no, this will never be useful, then it's probably something good to include. That was why I'm wondering, should we also think about the process if, let's say, we put instrumentation in, but then we monitor the monitored data access? We monitor who is actually using a certain piece of data. And if nobody ever uses it within, let's say, a month, a quarter, a year, then we could kind of flag also instrumentation data that nobody ever has existed either. Never put it on a dashboard, never create an alert on it, never fetch that log with

Starting point is 00:25:37 that particular piece. Is this also a practice? We should kind of, you know, kind of like rethink and update your instrumentation based on usage? I'm a big proponent of the more the vendors can do for us, the better. So we pay a lot of money to the vendors, but that's for a good reason because they are the differentiator, that's what they do well. I've not really used an open source stack for observability, but I do put a lot of trust in observability vendors and I'm happy to pay them because that's not the differentiator for for most companies. And in that sense, I would say that when it comes to

Starting point is 00:26:28 deciding things like, is this useful anymore? Or, yeah, absolutely. If you can flag up these things and then I can choose what to do, it'd be really, really useful. So whatever you can do

Starting point is 00:26:43 to help the UX of the developer, then great. I'd say what's interesting is that even a year later, you might need that tag. It's your decision to decide whether that piece of data is useful or not. But flagging it to me will be super useful because you can start clearing up up things but i think more

Starting point is 00:27:06 important than that is i think that the sampling and the indexing and all of that space because um that's where the the billing and uh what's what's the word that's where you can make any difference to billing and to bills so if you take take the stance that you actually, so as a user of an observability platform, I want to have confidence that I'll find what I'm looking for. And I don't want to be thinking about indexing or don't be thinking about anything around sampling. I want to think the minimum I can around that. What I want to think is understand what my system is doing. So I think there's a lot that the vendors can do to help with that. Yeah, maybe Andy should explain what we're doing on that off the record. Yeah, I think we've been in this space for a while

Starting point is 00:28:08 and so have our competitors. I think over the years, I think we got much better in hiding the complexity away because that's, as you said earlier, you're exactly paying for that service because you don't want to necessarily think about all these things yourself when you ingest data, how you store it, how you index it, how you give people access to this stuff, if you can afford it, obviously, right? If this is a value for

Starting point is 00:28:35 you that a commercial observability platform would give you. Yeah, I'd want to say on that side too, before people, I think it's definitely important to, just like clean code, clean observability is important, right? We want to make sure, like any vendor or even OpenTelemetry, they're going to be observing the universal parts of traces. Service hops, database, things, right? Any custom code, no one's going to know what we need to do.

Starting point is 00:29:08 So that's where the developers might be adding in the additional components, right? But when you start talking about then over time removing things, depending on what your viewing platform is. So in our case, obviously the data, you know, the interest backend with Davis and all that

Starting point is 00:29:24 could be whoever's platform. They may have, like we do, some sort of AI or some other assisted analysis that's taking a bunch of inputs. What's going to be important for you when you are cleaning up is to understand how all the data is being used by that system. Because some of the data you're collecting might not be something you as the human are looking into, but it's the machine that's observing it, taking it into account.

Starting point is 00:29:51 And if you just start chopping things out without that knowledge, you could be throwing off your model. So yes, it's important to clean up your code, but it's more important to understand what everything is being based on and have that knowledge of what's feeding everything before you do that.

Starting point is 00:30:06 Yeah, absolutely agree. And the reason why I think this discussion is important because I know, and we talked about this before we hit the record button, there has been a lot of hype, obviously, in the last years on distributed tracing, right? We have open telemetry that enables everybody to create distributed traces and uh but now it seems there's at least some type of not not a resistance but making people aware of that distributed traces can become very expensive right from a capturing perspective from a storage perspective and some people now questioning is distributed traces really something that we need to analyze systems?

Starting point is 00:30:45 Can we just do the same thing just with metrics or just with logs? And Brian, as you just said, we've built a lot of systems over the years, like the observability vendors that really detect changes in your distributed traces and therefore automatically detect change in system behavior, automatically detect bottlenecks. We can see things that otherwise would be hard to see if you don't put the same kind of distributed tracing capability, let's say, into your logs because you added a trace ID on the log. I'm just wondering totally from your perspective, because you mentioned earlier, right, you need to be, observability can become very expensive. The question is how much price

Starting point is 00:31:31 do you want to pay? Do you see like the need, let's say, what's your take on traces? Can we do everything with logs, with metrics? What is your guidance to engineers? When to use what? Maybe that's the better way to phrase the question. Do you have any guidance on how you advise developers on when to use what observability signal? Yeah, I think it's a very important point. And I think that confused a lot of people when they start out in their observability journey. In my mind, there's four telemetry data types that are the most popular. Tracing, metrics, and logs, mostly for the back end. And then you have real user monitoring for the front end. I would say my go-to would be real user monitoring plus tracing

Starting point is 00:32:26 with real user monitoring linking to the backend traces. That would be the ideal, but you don't leave it at that. You have to enrich the telemetry data with custom tags or custom attributes that represent your business transactions. If you don't do that then you've just auto-enabled telemetry and you'll get you'll get everything out of the box but you you won't be understanding the health business transactions and i can guarantee that one day you'll have an incident and you'll go back and add that so that you can understand a bit more

Starting point is 00:33:01 what's happening or you'll go to um'll go to another non-profit environment, try to reproduce it, and I'm bored at that point. If we're still in that space and we're still trying to sort things out in a non-production environment in 2023, then we're missing the point. So that would be my go-to. However, I think one of the lessons I learned is I started with Honeycomb. At Cinch, we used it a bit and kind of learned a lot about their observability principles and practices. And then we moved to Datadog. And I'd say that what you would suggest is observability vendor dependent.

Starting point is 00:33:48 So you have to look at your observability vendor and see what they are promoting, what they are doing well, because they might be storing and indexing one telemetry data type better than others. They might be billing one telemetry data better than others. And pragmatically, you might have to look at that, which is, I think, in my mind, is a bit sad. But hopefully, we get to a point where there is a standard. And I'm not saying when I say tracing and RUM, that's all you use. But you use one as a base, and then the rest as an exception. So for

Starting point is 00:34:19 example, metrics are a signal in time without context. So you would use it, but then you can't go and find out unless you look at the code. So avoiding looking at the code, you can't find out what's happening. With logs, you'll get a lot of noise, and you'll likely get a lot of natural language that's useful. And in our case, what I found interesting was that a lot of non-techies found that useful on dashboards. But that should be the exception rather than the norm.

Starting point is 00:34:54 And you should try and instrument everything through traces. And when I say traces, I'm not a big fan of lane graphs. It's not that I'm against them. I like them when you're looking at an individual trace. But I think the real value, the real power is querying a data that spans. That's where it becomes really, really powerful. And that's where high dimensionality and high cardinality is really important. I hope I've answered your question. You did. and I wanted to add one more thing to this because you brought up a very good point. The reason why you need to still look obviously at what your observability vendor is doing

Starting point is 00:35:35 different than others is because we all come from a different background, right? We've been on the Dynatrace side, we started with APM back then. And from there we evolved. So we started with traces, then we went into metrics, logs, real user data. A competitor may have started with logs, and then they evolved into metrics and then traces. So obviously we have our history and what we've always been really good in, and therefore have a certain lean-in on a certain type of observability data set. But I think what we also see, at least this is what we are doing internally, we try to treat every observability signal equally as good as possible. And hopefully there will be a time when

Starting point is 00:36:26 it should no longer matter really what backend system you have. I think the best compromise you can probably make is, I suppose most observability vendor will enable a service tag that you can add. If you're using all telemetry data types, at least add some basic top level tags like service and version, if you want, things like that are really, really important so that you can correlate between data types, data types potentially. Exactly. Yeah. I mean, it's also, you know, we don't want to make this about our

Starting point is 00:37:02 our organization, but obviously we have the most experience but things like linking tracers with logs because we automatically take the log and and put the tracer the on it that's one thing you talked about business transactions right we automatically extract anything from the the user that is interacting with your front end and we know who they are and where they click and like your search example that you brought to the conference, you will probably be interested in something like what did people search for? Where do they come from? What do they search for? What other filters do they apply?

Starting point is 00:37:35 And then you want to see this to analyze also search behavior and user behavior and then how it impacts your performance and resiliency of your system. Yeah. But yeah, I assume by 2023, most observability vendors can hopefully do this in an easy way. By 2023? We're in 2023. I know. That's what I'm saying. I assume by now everybody, most vendors are doing this. Yeah, yeah, yeah.

Starting point is 00:38:00 Yeah. One thing really briefly too, you mentioned the idea, the real user monitoring and having the other information. So the way I interpreted that, and I just wanted to double check if this is a correct interpretation, would be using what when you have a public facing site or tool, the real user monitoring is basically your SLO, it's what you're looking for the impact to hit. And then you want the traces, logs, and metrics and everything else underneath so that you can do the investigation into what is impacting that. And part of that trace could be stuff in the browser as well. But you're looking primarily at what is the impact to the end user. And we've been saying that. A lot of people have always been saying that. Who cares if your CPU is X, Y, Z, whatever it might be? What's the impact on your end user? Are we've been saying that. A lot of people have always been saying that. Who cares if your CPU is XYZ,

Starting point is 00:38:46 whatever it might be? What's the impact on your end user? Are they feeling it? That should be your guiding principle. And it almost... If I can put the words in your mouth, it feels like you're saying when you have real user monitoring, that should be what you're looking at to see how the system's running.

Starting point is 00:39:02 And then all the other telemetry is the supporting evidence that you need to dive in. Would that be fair? Yeah, I think so. I think I'll change slightly the words in my mouth in the sense that I think real user monitoring is, I see it as a tracing of the front end. And absolutely, it's about knowing what your users are doing

Starting point is 00:39:26 and understanding so that you can improve the UX. What's interesting is that it's a lot easier to use something like synthetics than it is to use and understand and enable teams to use real user monitoring because there's more concepts in there to understand. It's also more complex than tracing because there's a lot more dimensions like a view and an action and things like that this this does not just request response like it is on the back end and um it really does uh give a bit of context in terms of if a synthetics goes off the home page is down that's more like you know itetics goes off, the homepage is down.

Starting point is 00:40:06 That's more likely, you know, it's likely that the homepage is down, but it's not definite. Whereas with a real user monitoring, if 90% of the users can't access the homepage, then you know that 90, you know what 90% of your users are experiencing de facto, like as long as it's behind the ones that are behind cookies and stuff like that, a behind cookies and stuff like that. And what I'd say about real user monitoring, it's an interesting one that we experienced at Cinch from the perspective of the teams running the software. So I had a really hard job getting front-end leaning software engineers to care about observability. And real user monitoring was my in.

Starting point is 00:40:51 It helped me get them into the platform and it helped them then go and explore other things like tracing and SLOs and things like that. So that was really powerful from that perspective. And the other aspect of it is the often, and I've seen that in a white paper that I think New Relic published recently, was that a lot of the companies

Starting point is 00:41:16 have a very fragmented observability platform setting. So they have multiple platforms. I've seen that as being an anti-pattern. I've experienced it as being an anti-pattern. It just fragments the view of your software. So my take is, and I think the white paper was saying the same, is that you want to have one observability platform. You want to be able to throw them away if the billing gets too expensive, for sure, and OpenTelemetry helps with that. But you want to have one observability platform that's not your cloud provider

Starting point is 00:41:48 so that you can have a shared understanding across teams, across disciplines. So you also start seeing that some more UX-centric widgets or views like a funnel or like the core vitals that are good for SEO and things like that. So you're starting to bring in non-engineers as well to look at the same data and you're all looking at the same data and people are sharing the same links with each other and they're not siloed by different observability platforms. So to kind of round off, the Reels andism monitoring really enables that aspect rather than having one for frontend and one for backend. And the last thing on the backend bit is that we did focus a lot on custom instrumentation. So a lot of the things, because we were serverless as well at

Starting point is 00:42:39 Cinch, a lot of things that we cared about wasn't CPU, it wasn't memory or memory leaks or anything like that. We cared about higher order attributes. Okay, some of them were serverless specific and came from AWS metrics, but most of them were orders, search, various product details and things like that. I was going to say one last thing I wanted to get in because I know we got to wrap up soon because this started from before our call when we started talking a little bit

Starting point is 00:43:12 about some aspects, observability with logs and traces and what people prefer. And Andy and I are obviously big fans of traces. Back before when I was a performance tester, before we started working at Dynatrace and even had any tool like this, we were running tests, we'd see a slowdown, I'd be looking at logs or processing server metrics and trying to figure out why it's slowing down and just really couldn't. It wasn't until we had a trace tool in there that we could see what was going on with the code. So big, big fan of traces. And that's leading me to think, for people who are relying or really insisting on logs, because I see it all the time.

Starting point is 00:43:51 We go to prospects and customers, and they're like, oh, we want log, log, logs. I feel that logs are reactionary. Logs are showing you errors that occur in your system, but there's really not much you can do for optimization when it comes to logs, because logs are not telling you how things are running, where you're spending time. Unless people are using logs differently, but I imagine if you're trying to get that

Starting point is 00:44:15 level of information for logs, storage costs, indexing, and all, it's going to shoot through the roof because we know logs are expensive. So is there really a case? Do you see situations where people can use logs for anything but reacting to incidents? Is there proactive log use cases for optimization? Or is that really where the trace is, like the hardcore case for traces outside of other places? Yeah, I think there's definitely a use case for logs. And I know you didn't say that, is there a use case for logs? Oh yeah, there's definitely a use case for logs. And I know you didn't say that, is there a use case for logs?

Starting point is 00:44:45 Oh yeah, there's definitely a use case for logs, I agree. I think the boundaries of tracing is at your own software, so at the end of your own software. So you have to ingest logs for third parties and things like that. So that's really important, and cloud providers in some cases. So you have to accept the reality of using logs, but I do agree that they become reactionary. They're a signal for me that it's a reactional thing.

Starting point is 00:45:15 And it also encourages going into patterns like introducing correlation IDs and introducing kind introducing custom tags that then you can query on and all these things and durations as well. So you start seeing in code things like log and duration here to see how long this takes, which are all built into tracing and spans.

Starting point is 00:45:40 By taking the span duration, you know what that long. If you want a smaller part of the span, you can create a subspan. So that's all structural within traces. So you end up building something similar to a data set of spans, but not good enough. But then you have the advantage of correlating with potentially third parties and software that's not instrumented with tracing. Yeah.

Starting point is 00:46:10 Hey, I know we need to wrap up here because we've kind of hard stop. Thank you so much for doing this podcast with us. And I would have a couple of more questions, especially on serverless, because that's a topic I currently host a working group within our customer base on serverless observability best practices. Maybe I want to have you back for another discussion,

Starting point is 00:46:34 because obviously you have a lot of experience on this. But I learned a ton today. Yeah, I learned a ton today. And thank you so much. We don't keep you much longer because we want to make sure you catch your next appointment. I've loved this conversation.

Starting point is 00:46:51 And yeah, thanks for having me. And I get the sentiment of having you back to talk about serverless because we haven't really touched on that for a while, Andy. So I definitely think that we can dig into that.

Starting point is 00:47:03 All right, we'll wrap it up here. Thanks, everyone, for listening. And thank you, Toli, for spending time with us. Andy. So I definitely think that we can dig into that. Alright, we'll wrap it up here. Thanks everyone for listening and thank you totally for spending time with us. Thanks to all of our listeners and we'll see you all next time. Bye-bye.

PurePerformance - Unlocking the Power of Observability: Engineering Practices for Success with Toli Apostolidis

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.