PurePerformance - Why Developers have different Observability Requirements with Liran Haimovitch

Episode Date: January 1, 2024

After analyzing Distributed Traces over more than 15 years Brian and I thought that everyone in software engineering and operations must be satisfied with all that observability data we have available.... But. Maybe Brian and I were wrong because we didn’t fully understand all the use cases - especially those for developers that must fix code in production or need to quickly understand what code from somebody else is really doing without having the luxury to add another log line and redeploy on the fly. To learn more about the observability requirements of developers we invited Liran Haimovitch, CTO at Rookout and now part of Dynatrace, who has spent the last 7 years solving the challenging problems that developers face day and night. Tune in and learn about what non-breaking breakpoints are, how it is possible to "debug in production" without impacting running code and how we can make developers lives easier even though we push so many things "to the left"

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance Yes, yes, and we're on episode 198, which means we're getting close to 200. 200, wow. I had a dream about you last night. Not quite the same kind of dreams, because I don't know if there's a lesson in it. But I mean, maybe there's a lesson for you personally. But I was sitting on the beach in Hawaii, and there was this crazy dog named Buddha. And it was jumping into the water, pulling out these big rocks, and just chewing on the rocks with its teeth.
Starting point is 00:01:03 And I was thinking, this dog's going to gonna break his mouth so i go up to him and uh the dog turns around and faces me and he's got your face and he's like n plus one n plus one was his bark so that's all i got today that's all i got today yeah but it reminds me a little bit it reminds me a little bit about the pain that i had the last couple of days because it felt like chewing on rocks with my teeth. Thanks for the reminder. The bad tooth will be extracted in the next two days. So then this problem will be solved forever. Talking about problems that people need to solve. Hey, look at that. Such a great segue. Such a great segue. Yeah, we have a guest as always with us today.
Starting point is 00:01:47 And today I want to welcome Liran. Liran Haimovic, I hope I pronounced your last name correctly. Welcome to the show. Can you do me the favor and introduce yourself? Who you are, what you've been doing over the last couple of years, and what motivates you every day? Sure, Andy. It's great being here.
Starting point is 00:02:03 So my name is Liran Haimovic. It's pretty close, especially depends on how you pronounce day? Sure, Andy. It's great being here. So my name is Ivan Kramovich. It's pretty close, especially depends on how you pronounce it in different nationalities. But today, I'm an architect at Dynatrace working with you guys. Up until a few months ago,
Starting point is 00:02:17 I was the CTO and co-founder of Rookout, which was acquired by Dynatrace just three or almost four months now so time is flying as you say Rookout was a developer observability company and we still run the platform
Starting point is 00:02:36 we service developers they try to better understand their code and debug it in various remote environments, even production. And it's super exciting being a part of Dynatrace and collaborating on making observability more accessible for developers in general. Cool.
Starting point is 00:02:54 Yeah, it's amazing how time flies, four months. I thought it was longer even. Yeah, yeah. It feels forever. But Liren, just to, because the topic of today, I really want to focus in on some of the things you said in preparation of this podcast.
Starting point is 00:03:11 You said observability requirements for developers are different than what, let's say, we are typically assuming what observability means maybe for your IT operations, for your site reliability engineer, for your DevOps engineer. And I'm actually curious, what are the different requirements when it comes to observability for developers? Why do developers have different needs for observability?
Starting point is 00:03:37 So, you know, early in the process in the discussion with Dynatrace, I went to some of your guys, or our guys now, and they're saying that banks love Dynatrace. I went to some of your guys and they're saying that banks love Dynatrace and the reason for that, they kind of brought
Starting point is 00:03:49 this use case where you have one IT guy and he owns 10 applications. Now, for them, they need to know when every one
Starting point is 00:03:56 of those applications go down and they don't really have the time to focus on each and every application.
Starting point is 00:04:01 They want to install something, forget all about it, and whenever something goes wrong, have some sort of alert go off, maybe say hi, whatever you want to call it, pop up and say, hey, something is wrong.
Starting point is 00:04:13 This needs your attention right now. If it can even point directly to what went wrong, so you can even easier fix it, that's what's needed. As most IT operations, I'm saying that in a bad way. They need to get a lot of stuff done.
Starting point is 00:04:27 They need to keep a lot of services up and running. And what they mostly care about is when things break. And that's their job. If you look at developers, they are constantly tweaking,
Starting point is 00:04:39 constantly changing, constantly adapting. And in fact, for them, half the time something is broken, whether it's a new feature they are working on and it has a bug.
Starting point is 00:04:49 Maybe it's something they released last week that didn't fly well during your release. Or maybe it's a piece of code from five years ago a customer just complained about. For developers, the standard, the usual is the code is not working perfectly. Something is wrong. And that's why you're focusing on the first place.
Starting point is 00:05:09 And all of a sudden, you need to go in deeper. It's not just about tell me what's right, tell me what's wrong. It's about how deep can you go if something is wrong. Even more so, it's not just about this deployment failed, this database is down, because those are IT problems, and that's what traditional observability is very good at detecting. This drive ran out of space, the database is broken, we need to fix it to clean up, I don't know, delete the temp folder, whatever.
Starting point is 00:05:41 Again, super important stuff, but that wouldn't make it to the developer. A developer might see that database queries are failing because some index went out of bound. And all of a sudden, you need to go in way deeper. It's not just about the database is broken. It's not just about these SQL queries are failing. It's about why. And those questions require going way deeper into the application.
Starting point is 00:06:09 And even more challenging, those questions are far less anticipated. Because whenever you're dealing with one of those unexpected issues, you are trying to solve a murder mystery, trying to figure out what's going on. And each time, it's a new piece of code, it's a new application, it's a new library, it's a new bug, and you start collecting data. And while for the IT ops guys, it's the same data day in, day out.
Starting point is 00:06:40 Is my app, what's my request per second? What's my latency? What's my error rate? Is it good? Is it bad? Those are very clear, strong signals that you know you can rely on. All of a sudden, you're asking, you're looking for much weaker signals, you're trying to build a much more complicated
Starting point is 00:06:58 map, and those are not as easily answered. That's a very different discussion you're having with yourself, a very different discovery process. And you often change contexts. You go into this problem, you go into that problem, and you're asking different questions.
Starting point is 00:07:18 And it's so much easier when you have tools that are tailored for solving those problems rather than trying to use the same day in day out a high level metrics that are great for it i have a couple of thoughts here um because you probably know brian and i we have been doing kind of you know distributed trace analysis or you know back in the days we call it pure path now it's distributed tracing whether it comes from an agent like a diamond trace agent or whether it comes from open telemetry so we always thought or i always thought with distributed traces we already go pretty deep because we can go down to the method level we can capture a method argument we can capture a return value and one of the things we have done over the years, instead of relying on
Starting point is 00:08:06 manually placing what we call sensors kind of in the first generation of APM, application performance monitoring, where you had to say, I want to have this and this and this, we then kind of became a little smarter and kind of meshed up the feeder points, like the sensors that we placed. And then we also meshed it up with some snapshotting of your threads. And then this was kind of like we filled some of the gaps. So that means we've had and we've worked with distributed tracers with very rich information already for several years. And now when I listen to you, I still, I know what it is, but I want to hear it
Starting point is 00:08:46 from you. I want to hear from you what additional data that you don't get from an APM product, if you just think about APM, what additional data do developers need to really troubleshoot the reps besides the distributed traces? So that's a great question. But before we dive into the deeper data, different data, I want to take a step back and say, there's this amazing quote by Henry David Thoreau. It's not what you look at that matters, it's what you see. And the question is not which data the one agent is bringing in
Starting point is 00:09:21 or which data OpenTelemetry is bringing in or whatever APM of choice you're using. It's not just about what data they're bringing in or which data OpenTelemetry is bringing in or whatever APM of choice you're using. It's not just about what data they're bringing in. It's about how they're indexing it, how they're making it possible for use, how they're visualizing it. Now, tracing is amazing, but one of the biggest challenges in tracing is sampling. Now, again, if you're looking broad picture, health status, then sampling is great. You don't need 10 million transactions per second fully captured to know if the system is up or down. On the other hand, if you are looking for that particular transaction that is failing and you need to know why, then you need that particular transaction.
Starting point is 00:09:59 It doesn't help to know that 99% of the transactions are doing perfectly fine if you didn't capture the one that's failing. So I really want to focus that we as a group, as we think about developer observability, obviously snapshots, which is something that was the core of Rookout, and I'll touch upon in a second, something we super value, but it goes beyond that.
Starting point is 00:10:21 It's about how do we make traces more valuable for developers? How do we make logs more valuable for developers? How do we make logs more valuable for developers? How do we make metrics? How do we make release or monitoring? All of those tools are great, and they capture amazing data. But you would often find that developers need, within that context, different pieces of data or use the same data differently.
Starting point is 00:10:44 But even more so, I think Sn snapshots is a great example of that because if you look at what Rookout has been doing with snapshots and some other stuff we've seen, is that we literally allow you to get a snapshot of your application just like it would appear in a debugger. So you set a breakpoint, a specific line, a non-breaking breakpoint, and then you get to see when this line is hit,
Starting point is 00:11:07 those are the local variable values. That's the stack trace. This is exactly how it would appear in the local debugger of your choice. I'm not judging. You can use VS Code. You can use Eclipse. You can use JetBrains.
Starting point is 00:11:19 I know what's my favorite, but I don't want to spoil anything. But you would get a very similar example, except this is running right now in a remote environment, potentially distributed. Don't stop the application. We provide all the security and privacy and other guardrails to make sure you can use it anywhere you want.
Starting point is 00:11:40 And yet you can see very, very deep into the code. And I think that's another key difference between developers and IT ops. IT ops people or SREs or whatever you want to call them, generally don't really care about the application code. They didn't write it. They don't maintain it. And that's somebody else's job. Developers, on the other hand, spend their day, day in, day out,
Starting point is 00:12:04 stirring the code. If it's good, it's their code. If their day, day in, day out, stirring the code. If it's good, it's their code. If they're not so lucky, it's somebody else's code. They've taken ownership for after years. At Rookout and now Giant Trace, we're trying to make the observability tools very code-specific.
Starting point is 00:12:20 A lot of the signals you see in observability require to understand the context, the code context. Without them, they might be meaningless. Think about a log line. In theory, a log line can say anything. When you write a function,
Starting point is 00:12:37 you can literally print out the log line to say anything, regardless of what's actually happening. Now, I'm not saying people are evil and intentionally logging the wrong thing, but, you know, maybe somebody misunderstood the function as they added a log line. Maybe the function changed over time
Starting point is 00:12:55 and the log line wasn't fixed. So many things can happen. And it's super important if you see the log line and you understand which context it was written in, what was happening. If you see the logline and you understand which context it was written in, what was happening. If you see a snapshot of the local variables, it's super important to know which version they are on. Even more so today,
Starting point is 00:13:13 with continuous deployments, there are so many environments, there are so many versions floating around. For instance, you've just deployed a fix, and you now see the bug wasn't fixed at all but was the right version deployed when the bug continued reproducing? Maybe something went wrong.
Starting point is 00:13:33 Maybe another dependency tree or fix wasn't deployed yet or was it deployed? Those are all critical questions for developers and it's super important for them that the signals they get are correlated with the context, with the precise version of the code. And it's also about
Starting point is 00:13:55 how easy it is to get it. For instance, so yeah, you deployed something. Now do I have to go through Argo CD, figure out what's deployed, then I have to go through Argo CD, figure out what's deployed, then I have to move from semantic versioning to Git hashes,
Starting point is 00:14:10 and then to get from Git hashes to do the Git checkout and find the right file, and it can easily be a matter of hours or even days just going through the toil. Or I can have a system
Starting point is 00:14:21 that automatically correlates for me and say, this is the log line, this is the source for me and say, this is the log line, this is the source file it came from, this is the exact version that was running. Use it. Yeah, I think the best way that I can now kind of recap this, because I also used to be a developer, right?
Starting point is 00:14:40 And I understand that when you live in a tool that you use to develop and to debug locally you will be more efficient also when you can stay in that tool like your favorite ide but then being able to get the same level of details from an app from your app that runs in whatever version in whatever remote environment i think that's alone is amazing and as you correctly said it goes beyond far beyond on what we currently have with distributed traces. If I think alone of all the stack frames, of all the environment, of all the variables that you have on the stack. And it also obviously goes much beyond what you also correctly said,
Starting point is 00:15:20 what SRE or DevOps or IT operations are interested in, because they are looking at the system as a whole. They're trying to look at it, I think, independent from the application because they cannot know the ins and outs of the application. They just want to make sure that the system around it is healthy and therefore they're using the classical indicators like your metrics or maybe some indicators that they can also extract from the logs, whether there's more error logs now. But overall, I think they are obviously, they don't know what the application should do in detail, and they're also not interested in these things. And that's why, if I sum it up, getting this fine grain information in the tool that the developer sits, so you don't have to waste time to go from
Starting point is 00:16:01 tool to tool to tool, and then try to find the answer. I think these are, at least from my perspective, how I hear you, the different requirements for developers for observability, especially in production environments. Definitely. And I also think one of the biggest benefits in this new approach of developer observability is around dynamic instrumentation. It's around how do we allow developers to determine the data they need in real time. Again, because the questions are constantly changing, being able to collect data in real
Starting point is 00:16:38 time to decide, I want to collect data from this line. I need an extra log line here. I need a metric here. I need to understand the performance implication of this line. I need to see how long this takes. I need to see how often this is called. Being able to do that in real time to specify the data you want to collect and instantly get it is so much more powerful and cost efficient than trying to capture everything all the time. You know, I've heard a lot of things about various observability tools.
Starting point is 00:17:08 I rarely hear customers saying that observability is so cheap. At the end of the day, everybody's trying to collect so much information and that costs money and that costs resources. And it's always a challenge of optimizing how do I collect as much as possible and we're still getting not paying too much and part of the balance besides obviously being cost efficient is also
Starting point is 00:17:32 being able to adapt the more agile you are in your observability the more you can adapt to changing requirements to ongoing activities the more you can be efficient by not trying to hold everything all the time. Because trust me, no matter how much you collect, you're never going to have everything. You're never going to have everything you want just by trying to hold everything. Yeah, that's a great point. And Andy, I was thinking like an analogy
Starting point is 00:18:03 of what this is, right? Going back to the idea you presented a long time ago with DevOps and continuous feedback with the photo camera. Remember, so that idea of digital, basically digital picture for people who don't know this provides that instant feedback. Whereas when you had to take pictures, you had to wait, get it developed. And then, oh, I had my thumb covering the lens right it's too late now um the analogy i'm thinking of with this one is it's going to be in the apple ecosystem so sorry android users i'm sure there's a similar thing there but there's been some recent changes to the whole the find my um functionality on the iphone so if you can't find your watch, you can't find your phone,
Starting point is 00:18:46 whatever you have, this Find My thing. And where I see the observability part is now they have it so that you can walk around and it'll detect where it is. And it's basically like the hot and cold game. You're getting closer, you're getting closer. And if I'm lucky, it brings me to my bed and my phone is just sitting right there on my watch Or my watch, right? And I could see it. I grab it. That's, you know, if the trace gives you the data you need, fantastic, you're set. But on those days, my bed is an absolute mess.
Starting point is 00:19:14 I get there and I'm trying to move through all the covers and everything and I can't find it. Well, then I have right into the same ecosystem, the ping button. And suddenly my phone makes a ping and my ears directly locate exactly where it is. And without having to go through my bed, tear it apart and everything, I can just go bam, right, find it and get right to the heart of the matter. So it's bridging that gap. If you can get close, if you can get to the heart with it, without having to hit the ping, great. But when you need that extra bit, without even having to think, without setting something up, you hit it, you know, without having to hit the ping, great. But when you need that extra bit, without even having to think, without setting something up, you hit it,
Starting point is 00:19:48 you find it right away, and you move on, and you then go doom scroll on the internet. At least that's the way I'm thinking about it. It's a great analogy, because you don't want your phone to ping all the time. You want it to ping at the time when you need it, because you can't find it. And the all the time. You want it to ping at the time when you need it
Starting point is 00:20:05 because you can't find it. And the same thing is you want to have these non-breaking breakpoints. How do you call them again? Yeah, non-breaking breakpoints. Because they look and feel like breakpoints, but they don't break you up. Yeah, so that means with non-breaking, we might have listeners on the call
Starting point is 00:20:22 that are not familiar with what debugging really means and in the sense of a debugger where you basically stop the runtime from actually executing or you let it step by step and you kind of hold it. And with this, you basically simulate what a breakpoint does in a debugger where you capture the full stack frame with all the variables, but without actually holding the runtime and you're just collecting this information. But that's great, right?
Starting point is 00:20:52 I mean, and I guess, and Liran, if you can fill me in a little bit on, because you mentioned earlier, right, observability is expensive. If you would capture unbreaking breakpoints all the time. I guess we would have a problem. First of all, we would have too much data that nobody cares about. Also, it will probably cost a lot of things to capture and store it. So how do you solve this problem? By the way, how do developers then actually use this technology? When do they turn it on? How long do they turn it on? Or how does the system, what did you build to make sure that you're capturing enough information
Starting point is 00:21:26 without capturing too much information so I would mention that snapshots are truly cheap if you're using a great tool to collect them they might be slightly more expensive than logs because they're so informative.
Starting point is 00:21:45 And I think in one of our previous media discussions at Rookout, I've kind of used the concept, I've tried to outline the concept that snapshots are worth a thousand log lines because they're so much more detailed. Instead of having to write out your variables one by one, stringify them, figure out how to represent them, Snapshot would capture the entire state of the
Starting point is 00:22:07 application very, very accurately, keeping type information, keeping all the minutiae detail going to make the difference between fixing a bug and not fixing a bug. And that's super easy to get. And those breakpoints are
Starting point is 00:22:23 really, really cheap. A magnitude of a millisecond or so depends on the runtime exactly. And using workout or similar tools, you can just, with a click of a button, set it on any line you want
Starting point is 00:22:38 and instantly get it applied. You can apply it to a single server you're interested in or you can apply it to a whole fleet of servers. You are wondering what's happening there. You can use conditional breakpoints to filter out a specific user or a specific case you're interested in. You can even connect those non-breaking breakpoints to various automation workflows,
Starting point is 00:23:03 but whenever something goes wrong, whenever latency goes up, whenever you have an exception thrown, whenever something you're interested in happening, you can instantly set a breakpoint there. So by the time you would actually get to look at it, you would get a whole more context than what a traditional observability tool
Starting point is 00:23:24 that are not going as deep can provide you. So essentially, you get an alert. By the time the alert is there, you actually walk up to your laptop and see the alert. You've got a whole slew of additional context that's going to make it so much easier and take so much of the guesswork away from the triaging process.
Starting point is 00:23:44 And so there are also various, sometimes breakpoints by the way are much more longer lived. Maybe your release cycle is only once every two weeks and you are worried about something and you want to throw in an extra logline. So you can say, I want
Starting point is 00:23:59 a new logline on this line. You can even add a condition to it. The next time this variable is over 50k, is longer than 50k, send add a condition to it. The next time this variable is over 50k, is longer than 50k, send me a message to Slack. And you can instantly do that. And you don't have to do an emergency patch. You don't have to release a new version. You don't have to wait for the next
Starting point is 00:24:16 release. You can instantly do that on the fly without worrying about it. You can add new metrics if you're trying to measure something for performance. You can collect new metrics if you're trying to measure something for performance. You can collect more data and you can really adapt to your needs. Some customers use it to debug remote environments. Maybe you have applications deployed with your end customers and the customer has to be notified or install the patch. Maybe you're using an environment that
Starting point is 00:24:44 has a downtime whenever it's being upgraded and you don't want to incur additional downtime. You can set those breakpoints for weeks or months or however long you need to. And you would often find, you would almost always find that
Starting point is 00:25:00 adding a breakpoint is so much cheaper, easier, and almost so much risk free. You know, I had and almost so much risk-free. You know, I had my own podcast up until... Actually, I had my own podcast until my first was born, which is now almost 18 months. So it's been a while. But in one of the first episodes I recorded, there was this guy from a storage security company. And he mentioned that early in the days, they actually released a version.
Starting point is 00:25:29 It wasn't an emergency patch, by the way. It was a major version of the product. And somebody added a log there. And that log crashed the system repeatedly. And they had to back out that major release and emergency release fix, all because of an improper logline.
Starting point is 00:25:48 If you think about it, at the end of the day, a logline or a metric or any changes you make, as small as possible, is code. And any code, any change, carries risk. And part of the promise of
Starting point is 00:26:04 those observability tools for developers and carries risk risk and part of the promise of this of those observability tools for developers and snapshots and so on and so forth is that we
Starting point is 00:26:11 take away much of the risk we provide very significant guardrails to make sure that no matter
Starting point is 00:26:16 what you do you access uninitialized variables you try to print out something that's too big
Starting point is 00:26:22 you try to inadvertently access the database whatever you want to whatever you're about to that's stupid that's too big, you try to inadvertently access the database, whatever you're about to do that's stupid, that's inappropriate, that's better not be happening, then we have the guardrails in place to ensure that this won't happen, that you won't be risking the integrity of your product, the integrity of your service just for that log line. Because 10 out of 10 times
Starting point is 00:26:44 you would prefer the service to keep running and the observability of the data to be missing, especially if you provide a clear indication that you're not getting the data rather than take down the service in the name of get me that extra logline. I think this was a use case.
Starting point is 00:27:01 Go ahead, Brian. I was thinking about especially with that last use case, I had this thought in my head and I think that last use case about the log line really solidified it. If we go back to when DevOps started, right? And then as containers came in and as Kubernetes was coming in more and there was a shift of putting more and more work on the developers.
Starting point is 00:27:26 Set your ingress points, set your network routing, do everything as code, do observability as code, and developers are going to do it all. When the job of the developer is to write good code to begin with and to fix bad code when it gets discovered. But now there's this whole idea of learning all these other tasks, learning, oh, you know, I have to write all these additional logs in there, and what's going to happen if I do that because I'm now tasked with this? And recently we've seen, with the rise of platform engineering,
Starting point is 00:27:57 there's been a turnaround from that to say, hey, maybe we shouldn't put this all on the developers. We'll have special teams that will take care of the platform so the developer's not defining what container they're running it in or what size JVM. It's going to be somewhat opinionated in some ways. But what you're describing, I think, takes it even further because we're removing more of it from the developer to have to do observability and the debug side of it. But then when a problem does occur,
Starting point is 00:28:25 they don't have to spend as much time trying to figure out what happened. I don't care what observability tool you have. You know, developer has an issue. They're going to have to start digging and diving and they may get close. You know, before we were saying they might get close. They might have to dig deeper.
Starting point is 00:28:39 They may or may not have been trained on the tool, right? But the idea of removing the barrier for the developer to get to that answer as quickly as possible with the least amount of friction, with the least amount of thinking ahead of time, oh, I have to capture this method argument. If you can just turn something on, it's going to capture this stuff. It's pulling back the constraints or the burden we put on the developers as DevOps came in, and we're bringing it back so the developers can really just focus on writing good code and then fixing code when they need to.
Starting point is 00:29:12 And that's what they excel at, and that's what's probably going to make them happiest. And then the happier your developer is, the better everything is, and the whole world becomes a shiny, happy place, right? But I think it's a really important adjustment that we're going through on this side now. And it sounds like this is going to just make it a lot easier for those teams to execute. So I agree. I think the whole shift left, it's super important. On the one hand, as you said, it's a big promise.
Starting point is 00:29:44 It's about having, can developers truly own everything and developers truly be responsible for everything? And the answer is kind of, it depends. And what it depends on is about powerful tooling and useful abstraction. Developers can't know everything, but if you provide them an easy enough approach to observability
Starting point is 00:30:05 that's closely enough related to their day-to-day, they'll be able to grasp it. If you provide them with easy enough access to security for at least with good guidance and very simple action items, they can grasp it. And at the same time, it's also important to note that we are sparing developers a lot of work they used to do a couple of decades ago. Most developers today don't worry about memory management.
Starting point is 00:30:30 They don't worry about allocating and freeing memory because the modern runtimes do it for them. They don't worry so much about compiling and linking and dependency management, again, because good runtimes take care of a lot of the heavy lifting and a lot of the stuff that we can abstract from them. And this frees up their memory, their CPU, their minds to deal with higher obstruction problem. But at the end of the day,
Starting point is 00:30:58 it's a very small capacity that they're getting. And we need to very wisely adopt it. And so if we want developers to own production, to relate to production, to understand what's happening in production, we need to provide them great observability tools that speak their language,
Starting point is 00:31:16 that are easy for them to grasp, and then they will be more than happy to adopt them and take part of that. And we've seen this to be truly transformative for organizations. When developers are no longer disconnected from production,
Starting point is 00:31:29 when they are empowered to understand how their code is behaving in production, it's super motivating for them and it can have a huge impact on quality and velocity and so on.
Starting point is 00:31:41 I want to also just add my two sentences to this because I remember many, many occasions where problem happens, what do you do? You need more log lines, so you add two more log lines, you run it again, you don't get the logs that you expected or you need more than you add. Like five iterations, 10 iterations later, you have code where you have more log line code, more code that create logs than the actual code that
Starting point is 00:32:06 is doing business logic and i think this alone is for me an amazing selling point that i don't need especially from a troubleshooting perspective i don't need to modify my code to get more of the let's say traditional observability signals that i need to diagnose the problem because I can just treat it as if I would sit there locally and just attach my debugger. And Brian, this is the same, I think, shift, generational shift that we've seen when we introduced real user monitoring. Because with real user monitoring and session replay, all of a sudden, we could see everything that happens in the browser for every single user, every single line of JavaScript that was executed. And we could, as a developer, right, there was no need anymore to say, let me walk over
Starting point is 00:32:56 to the end user, and then let's turn on the developer tools in the browser, and it can give me all the data. With real user monitoring and session replay, we all of a sudden got this data. And it feels like what you are telling us here with observability for developers, the unbreaking breakpoints and the snapshotting technology, this is exactly what we give developers that are working on microservices, on any type of code that runs in application servers, wherever it runs, to get all this data that they need at their fingertips without having to go through an additional hoop like rebuilding the code with additional logs and then redeploying it and
Starting point is 00:33:37 then hoping that the error happens again. Because this is another thing, right? Many times when you then modified the code and added more logs then you may have changed the timing behavior and all of a sudden with the race condition you had a different timing behavior of your code and all of a sudden this problem didn't happen anymore, a different problem came up.
Starting point is 00:33:56 So there's like so many things that it's really great to hear with this technology that you guys have built. Liren, I also... It's funny that you mention that because joining Dynatrace, I found under the hood so many amazing observability capabilities,
Starting point is 00:34:15 observability technologies, and some of them are not well-known but very, very tailored for developers. I think today Dynatrace is maybe the best memory performance profiler anywhere I've seen. It provides amazing granularity
Starting point is 00:34:32 for allocations and the allocations of memory and GC stops and so on and so forth. There's even a capability, there are a lot of thread profiling, continuous profiling for CPU,
Starting point is 00:34:45 thread profiling for identifying logs, even heap dumps that can be used to troubleshoot a variety of issues.
Starting point is 00:34:53 And all of those features are, you know, super important, super powerful that are in there. And part of the things that we're
Starting point is 00:35:00 very excited about as we're joining Dynatrace is kind of seeing how we can make developers out there more aware of everything Dynatrace kind of seeing how we can make developers out there more aware of everything
Starting point is 00:35:07 Dynatrace have to offer for them because there's so many useful capabilities out there and I don't I don't always feel
Starting point is 00:35:15 that they are appreciated as much as they should be yeah but we are trying to do our best to to surface them right and
Starting point is 00:35:24 I know I had a session with the Arden a couple of weeks ago, where we did a YouTube video on developer observability meets app observability. And then we talked about kind of, you know, what the quote unquote, SU frame rate, the traditional observability brings in and then what the develop observability brings in. Also, just to give a little heads up or like a forward-looking statement, our conference performance is coming up. So it's going to be the last week of January, first week of February, where we all gather in Vegas. And there's going to be a big focus obviously also on these use cases that empower developers
Starting point is 00:36:07 to do the job better and easier. So folks, if you're listening in and if you're still contemplating on whether you want to join us in Vegas, you should because there's a lot of cool use cases. Yeah, exactly. And you can find Andy and have a drink and celebrate my
Starting point is 00:36:22 50th birthday on the 29th of January. There you go. So now everyone knows my birthday. I'm giving out security. I'm giving out security information. But yeah. Also, if everything else fails, you can always join us online. But still do come to Vegas. It's more fun.
Starting point is 00:36:38 Yeah. Liren, I want to not only talk about the blue skies, I also want to talk about some of the challenging questions that you sometimes get, because we always get these challenging questions. We've been in the observability space and the topic of overhead always comes up, the topic of who should be able,
Starting point is 00:36:58 what type of data are you really capturing and who is then going to see this data? Talking about data privacy, talking about security. I mean, there's like so many questions that always come up. Can you just glance over maybe some of these topics, let's say the challenging questions that you sometimes face? So I would say, as I mentioned, an individual snapshot is roughly one millisecond.
Starting point is 00:37:22 Obviously, it depends on the size of the snapshot and the runtime and so on and so forth, but that's what you can expect. It's very negligible, especially if you... It's not a tool that you're meant to capture a thousand snapshots every request. It's a tool that's meant to capture a handful of snapshots when you need them.
Starting point is 00:37:40 If you think about not only your P95, your P99s, they're not going to be affected in any way if you spend a couple of milliseconds capturing snapshots here and there. Obviously, if you just want to inject a log line or a metric, that's going to be even cheaper, way, way, way cheaper. We also have a variety of safeties at the individual breakpoint level, at the global levels.
Starting point is 00:38:04 We cap the CPU. We incur, even at the individual breakpoint level at the global levels and we cap the CPU we incur even at the worst case but for the most part you would see that we never get
Starting point is 00:38:13 anywhere near them and we have all the default limits come in the average engineer captures a bunch of
Starting point is 00:38:20 snapshots and the breakpoint turns off and nothing will happen and I think most customers won't even see two or three percent of CPU increase captures a bunch of snapshots, and the breakpoint turns off, and nothing really happened. And I think most customers won't even see 2% or 3% of CPU increase, and definitely nothing on the latency. Other than that,
Starting point is 00:38:34 around security and privacy, I think the most important thing to realize is that this access is needed. And if you're not using a tool to give it, a good tool like an observability tool that allows you to set a lot of policies in place, we'll discuss in a second, then something worse is going to happen.
Starting point is 00:38:58 Either the bug won't get fixed or the problem won't get resolved. Or what you will probably do is that engineers will spend their time to outsmart and outmaneuver the system and they're going to end up choosing worst options. They're going to sit down with an ops guy or an IT guy and SSH into the system. They're going to sit down with the database administrator and start querying their raw records. And they're going to figure out what they need because at the end of the day,
Starting point is 00:39:27 the business, the operations need to get the data because they need to fix something. And if you think about it, then the risks going down those routes are way, way, way bigger. You're essentially punching much bigger holes into your security and compliance posture. You are having much less control.
Starting point is 00:39:47 And that's how things have been done so far. Let's not kid ourselves. That's the alternative. Using Lookout or similar tools, you can assign tons and tons of policies, starting with SSO integrations or based access control. You can decide who gets access to where. You can add a bunch of data. Masking rules, you can control data governance,
Starting point is 00:40:09 do exactly what the data is going to be stored. You can control data retention. And obviously you have audit logs on top of everything else, so you know exactly what happened and why. And at the end of the day, if this is about servicing your application and ensuring it's running optimally as it should,
Starting point is 00:40:31 it's part of your day-to-day operations, and you've put all the guardrails and safeties in place which we provide them, that's the best way to ensure not only compliance, but also resilience as a whole. Cool. Thanks for those answers. I mean, we've been facing questions like this over the past 15 years since we've been living in the observability world. And I think you bring up really good points that in the end, what really matters is that you can fix your system challenges as fast as possible.
Starting point is 00:41:11 And you've obviously thought about everything that you need to think about when you design a system like what you've built to make sure that misuse can be avoided, that you have all the guardrails in place, that you also audit, everything is auditable. And I think that's obviously you have adopters of that technology before, we had adopters of the technology before you joined the Dynatrace family. And it's just great to hear. And I'm just really excited to see how much easier the life of our users will be, especially as they are trying to then, like Brian said, try to find that phone in the messy bedroom.
Starting point is 00:41:51 They just want to find that messy bug somewhere under a lot of dirty lines of code, and then it's going to be easy to spot it. Forward-looking, is there anything... I'm not sure how much you can obviously say because these are always things when we talk about looking to the future. But I assume there's still a lot of ideas
Starting point is 00:42:14 on what other things can happen and what other things observability providers like we as Dynatrace can do for developers in the future, even beyond what you have already built. I assume you have a long list of things that you would like to include. And as I said, I know it's going to be challenging to talk maybe about some roadmap items,
Starting point is 00:42:37 but if there's anything you can say, even if it's just, hey, sure, there's a lot of stuff coming, be prepared. So definitely there's a lot of stuff coming. Be prepared. Definitely, there's a lot of stuff coming. I would say that's part of the joy of joining such a huge platform and with a lot of technical excellence, Dynatrace, there are so many opportunities out there. There are so many observability signals lying there.
Starting point is 00:43:04 There are so many capabilities within the storage engine, the query engine, you know, and everything else that's going on that the options are truly limitless to building amazing, valuable applications, features that can cater to developers in a way I think that aren't seen today
Starting point is 00:43:25 in the market almost anywhere. And the combination of those capabilities in a single platform can lead to an amazing user experience and amazing automation and workflow capabilities. One thing
Starting point is 00:43:41 we'll be focusing on for the next year on top of everything around Rookout is also around the IT integrations. Some of that exists from a pre-acquisition that we're working on right now, but it's super important for us to have Rookout available, to have tenant trace capabilities available in all the common IDEs.
Starting point is 00:44:02 And for us, it's super important to bring as many observability capabilities or even more so as many observability insights as possible very close to the engineers in their ideas in a contextual manner making it a seamless part of the development experience a seamless part of the observability of the troubleshooting experience, having observability data and insights at their fingertips, having the observability platform push insights proactively for the developers in a contextual manner
Starting point is 00:44:39 so that you will benefit from observability. Because I think one of the things you will often see is that you see the observability-savvy engineers, the production-savvy engineers, those who are not just about coding, of the most senior, they know their way around the cloud, they know their way around the Linux machine,
Starting point is 00:45:02 they know their way around Kubernetes and containers and observability and they can get tons of stuff done and they are very good at diving into the nitty-gritty details but not everybody had the time to dive into those areas some people have other areas of expertise some people have not spent as long in the company and are not as familiar with all in the company and not as familiar with all the
Starting point is 00:45:26 tech stack and all the tools. And it's a huge burden. And you find that some people are enabled and can find their way in production. Even pre-production is often very complex. While most people just
Starting point is 00:45:41 struggle to deal with this super complicated tech stacks and everything you need to know, now that observability can really be a differentiator here. It can be a turning point. It can make everything so much easier by making things accessible to the average developer without having to go through training and learning each and every tool and understanding every aspect of the text doc just getting easy to digest insights from the observability it can be a game changer
Starting point is 00:46:14 for so many developers yeah i'm looking i'm looking forward to as you said the combination of our technologies where davis is going to an area, a hotspot area in your application landscape and then triggering the snapshots. And then as a developer, I don't need to react necessarily on a message that brings me into a dashboard with metrics and logs, but it brings me into my ide where my debugger runs so it looks like a debugger but it's actually like analyzing all these snapshots i mean that's that's a really that's a really nice way of thinking about it and how these two worlds can really kind of uh benefit from each other yeah that's pretty cool yeah definitely and i'm in a bit of an analogy type of day today
Starting point is 00:47:06 so i'll leave it with one last analogy um how often do you find yourself watching a movie and be like oh my gosh that's that guy right and you're like i want to look up who that actor is because i reckon i need on all this stuff or actress whatever you know whoever it is well if you go back to let's say the 90s you'd have to get off the couch go to your computer connect your modem and then maybe go on a usenet or some group and hope somebody had a catalog and there's there's three parts to this so part two now is you either have your phone or your laptop which is already connected so it's really easy you pull up imdb or wikipedia they have the cast right there at the top and you can easily find it right in scenario one you one, you're probably not going to do it. And I should take a step back.
Starting point is 00:47:47 The reason why this is important is for developer observability to succeed, right? It's got to be easy. It's got to give them the relevant data, right? So the modem, not easy. It's going to be hard to find the data. Your phone, go into IMDB. It's going to be very easy. But then if you take a look at what Amazon does on Amazon Prime, if you pause that video, it's going to show you all the actors in the scene and you can just move up to them and bam, get the bio right there. I think that is the key for developer adoption because if they've got to work hard for it, they're probably
Starting point is 00:48:19 going to be less likely to do it. And if the information is not relevant, again, but we want the developers to adopt this more, right? And I think that's going to be the key towards, you know, whatever tooling, you know, even beyond what Rookout's doing, beyond what we're doing, whatever these tasks are that we need the developers to undertake, it's got to be made easy and very, very relevant. And then we'll start seeing the fruits of that labor, I think. A cool analogy. Definitely.
Starting point is 00:48:46 All right, Liran, thank you so much for spending an hour with us and inspiring us and enlightening us and educating us on what developer observability really is all about. I think I now understand better on why developers need a different type of observability that we
Starting point is 00:49:02 have been always talking about. And yeah, I'm very much looking forward to seeing you in Vegas. And I'm very much looking forward to having your technology part of the Dynatrace technology. And overall, whether it's Dynatrace or we always look beyond Dynatrace,
Starting point is 00:49:20 I think these discussions should just inspire people with new ideas on what's possible because we should not be comfortable with the status quo. Definitely. Looking forward to meeting you all and thanks for having me for the show. Thank you. And thanks to all our listeners.
Starting point is 00:49:36 Hope this was helpful. Till next time.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.