Software at Scale - Software at Scale 14 - Liran Haimovitch: CTO, Rookout

Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Thank you for being a guest on the Software at Scale podcast. Can you give a little bit of an introduction to yourself for guests and for listeners? Of course. So my name is Liran Chemovic, and I'm a co-founder and CTO at Rookout. Before founding Rookout, I spent about a decade doing cyber security, you know, developing stuff in that area, kernel mode, user mode, researching how viruses work, how antiviruses work, and doing all sorts of projects along that area. And then about five years ago, I decided to go my own way,

Starting point is 00:00:46 build my own thing. That's kind of how Rookout got started. And now we're kind of taking the cybersecurity mindset and skillset to develop a new kind of debugger, new kind of debugging platform, something that will allow a new generation of how to operate in production.

Starting point is 00:01:04 And I did a little bit of research. Like Rookout seems to have started in Jan 2017. Is that like approximately when? And can you tell me like exactly, like, you know, what gave you the idea? Is there like a spark that you decided to go with your co-founders to build something like Rookout?

Starting point is 00:01:21 So on the one hand, it's always been a pain of mine as I was working on cyber projects, whether it was desktop applications or server applications. Something goes wrong in the application you're running and it always goes wrong in production

Starting point is 00:01:37 and it tends to go wrong at the weekend. And you have no idea what's going on. And the thing is, as you scale your software, as it gets bigger, as it gets beyond your laptop, then nobody cares anymore about how it's looking on your laptop. Nobody cares anymore how it's operating in the demo environment. All that matters is how it works in the real world,

Starting point is 00:02:01 how it works for the customer. And that's when things often act very differently than they do on your machine. And up until now, the only way to operate in those remote environments has been through logging. You essentially, you read the log file and it's never enough. It tends to have more holes than Swiss cheese. And then you have to add more logs. And adding logs is going to take you, if you're lucky, a few hours. If you're unlucky, it's going to take you days and

Starting point is 00:02:30 weeks. And then it's probably not going to give you the full picture, it's just going to give you a few extra hints. And then you're going to find yourself doing it over again, over again. And I remember one bug, I spent six months chasing that bug. I think I released 15 versions during those six months. And I know 20 people kept asking me about that bug daily. Not the best experience of my life. So, yeah, it's clear that running software in production means you can't have any of your nice tools to help debug stuff locally.

Starting point is 00:03:09 You can basically just try to log stuff and figure out what's going wrong. So then where does Rookout come into that? How does Rookout help? So, look out, there is a famous quote by Mary Popendink, who wrote some of the early, some of the more early important stuff about the management theory behind DevOps. And she said that an organization maturity is measured by how fast can they deliver a single line of code. And the thought comes to mind, if online delivering is one logline, why do I have to go through this entire software development lifecycle? I have to write my code. I have to test it. It has to go through unit tests, end-to-end tests.

Starting point is 00:03:52 It has to be approved, code reviewed. And all I'm doing is changing the logline. Well, we all know that once you change a single line of code, you never know what's going to happen. And you can always mess things up in so many different ways, whether it's performance or availability or correctness, side effects. And you really do have to make those tests. And I was wondering, we were wondering kind of, what can we do about it? Maybe we can find a different way to add that line of code,

Starting point is 00:04:21 to add that logline without endangering the system, without requiring all those tests and deployments. And that's kind of how we built Rookout. We built Rookout on the one hand to provide you with a debugger-like experience, set a breakpoint, and get the data, so-called, except we don't do it locally. We do it in the remote running environment, and we don't stop the application. We do it in the remote running environment and we don't stop the application. We collect the data you would see in a traditional debugger with a lot of safeties baked in to make sure you're not going to impact your production environments in any way,

Starting point is 00:04:56 but you're going to get an almost seamless debugger experience while operating in production, potentially operating on hundreds of servers at the same time. So to listeners, like I tried out like a demo, I tried out using the sandbox on like rookout.com and it's basically very similar to like IntelliJ's debugging experience that I experienced. Like I put like a break point

Starting point is 00:05:22 on one handler method of like a web server. And when that handler was called, like it collected a bunch of information from me, like local variables and what the state of the server was, which seems immensely valuable once you actually get to play with it. And has the experience of like other customers just been similar to that? So customers, I love showing the product because whenever i show the product people get excited i've seen more than one job drop literally the first time i was it was awkward like does that really happen in the real world is that

Starting point is 00:05:58 not a cartoon thing so i can tell you if you surprise people enough, their jaws literally drop. And I've seen that more than once. But it goes beyond that because often when you speak to customers, you hear of their horror stories, the bugs they've been chasing for three or six or 12 months. Sometimes if you are lucky, they still have those bugs open. And a few times they even deployed Rookout and found those bugs in 15 minutes. So that's very rewarding.

Starting point is 00:06:28 But in general, empowering engineers to operate in staging and production environments as if they were their own laptops, that's super satisfying. I guess if you think about

Starting point is 00:06:44 it from first principles, why is it so much easier to debug locally today? I'm saying in a pre-Rookout world, that's how most development happens today. Why is it so easy to debug stuff locally versus in production? Because you have all of these tools and you can just basically print every variable and recreate state.

Starting point is 00:07:02 And that's what Rookout is trying to emulate, but in production environments. When you're working locally, you have full control over what's going on. You have full visibility, and you have lots of room for error because even if you mess something up, let's say you're going to erase your database. It's your local database. Nobody cares.

Starting point is 00:07:22 You can just spin it up again. If you disconnect your network by accident, nobody cares. And as you have so much control, so much visibility, it's very easy. And if you mess something up, then it's usually a matter of seconds or minutes to get it back together. And you're good to go. When you're working in production, the scale is much larger. The complexity is much bigger, and the cost of mistake is

Starting point is 00:07:47 obviously so much bigger. And so either you don't have access at all or you have some restricted access using Ops, whatever, SSH, and things get honestly much scarier because you're running without...

Starting point is 00:08:02 You're running throughout the... So Because you're running without... You're running throughout the... So let me retry that. And so when you are operating in productions, things are much scarier because you can totally mess things up. You can mess up your environment. And when you're working in production, customers are going to notice

Starting point is 00:08:25 everybody's going to notice and you want your production to be up and running yeah so there's like a general fear when people are working in production like i don't want to mess something up i don't want to like log too much here because like what if i overwhelm my logging infrastructure and yeah it seems like tools can help with that, right? Like if your tool can make sure that it's not going to get overwhelmed and you like log something there, you'll feel that much safer in doing things.

Starting point is 00:08:57 Is that like the right way to think about it? Yeah, that's exactly the right way. Lookout has a lot of built-in safeties regarding how the tool is going to be used, what it's going to collect. You can tailor some of those safeties to meet your requirements, but you can rest assured that whatever is going to happen,

Starting point is 00:09:16 everything is going to stay safe. Even if you set a breakpoint in the hottest code passing the code, something that gets called a million times a second, we're going to detect that and disable it or move it to sampling mode based on your preference so everything is going to keep running as usual. Okay. And that's pretty interesting, yeah, because if I want to log, I know that I don't want to log today in a tight loop because

Starting point is 00:09:39 that's going to overwhelm. How does Lookout detect that it's being called too often and protect your codebase from getting affected by that? So for each breakpoint that gets called, we calculate the frequency of the calls and the time spent within the breakpoint itself for data collection. And then we cap those using various limits. And if you break those limits, once we see that you're approaching those limits, we're going to move to sampling mode. And if we decide something looks fishy or it's taking too long, then we just disable it altogether.

Starting point is 00:10:14 Because after all, you want your production to keep running, even if it means missing out on some of the data. Yeah, that makes sense. And that is exactly the trade-off that I would think of. Does this happen locally inside the Lookout library or is that happening in a server somewhere and you're making a network call? The emphasis of Lookout is offloading as much as we can to the heart of the application. So essentially what happens is that we use an SDK. We have a Java for Java, Nuget package for.NET,

Starting point is 00:10:53 and we have PyPy, NPM, and Jam package for Ruby, Python, and all. And essentially what happens, you install us, just one more package, you initialize us when you get started. And then for each environment, we have our own technique that we connect to the running application in memory. We find the functions you're interested in monitoring and we literally modify them in memory, kind of like we recompile them

Starting point is 00:11:23 with the additional data collection code per your request. And then that's the code that does the collection with all the safeties built in. And the data, once it's collected, gets pushed into a background queue that gets flashed out without impacting the application at all. Okay. So yeah, that makes sense.. I hadn't thought about how it's not going to be super easy to add a break point to anywhere in your code base.

Starting point is 00:11:52 You literally need to modify the code that runs. You're not injecting a library call or something. You're not expecting that. You have to inject that. That is actually pretty interesting. It seems like a pretty complex problem, right? Where you had to do a bunch of engineering work for each language that you support, right? Yeah, so it's kind of ironic. I spent most of my career doing C++ development, actually for the Windows operating system.

Starting point is 00:12:19 And I've somehow found myself over the past five years diving deep into the implementation of the Python interpreter and Node V8, OpenJDK, Ruby MRI, and kind of figuring out all of those obscure unknown APIs, sometimes looking at the specs themselves of the implementation and definitely the source code, and kind of tying it all together to build something that's reliable, easy to use and gets the job done.

Starting point is 00:12:55 And that sounds like a lot of complexity and I'm honestly scared of the amount of stuff you had to do to make this work, but it sounds awesome. And just so that I can clarify my understanding of how Rookout works then. So there's a web UI where a developer sees their code base, presumably from GitHub or something, you must be grabbing their source code. And you have an IDE interface where you can basically put a breakpoint on any line of code. That gets transmitted to a server. And the server will presumably somehow transport that to your production servers where the SDK detects that there's a new break point and dynamically changes the code base to inject that break point and then puts that

Starting point is 00:13:40 information back in a queue and sends that back to like lookout servers for you to debug. And then that materializes on the lookout UI. Is that roughly accurate? Yeah, that's fairly accurate. Yeah. Presumably that's just what happens. Okay, cool. And yeah, it sounds like a fun technological problem

Starting point is 00:13:59 and solving a real customer problem. So that sounds amazing, honestly. And it seems like customers are also interested in something like this and i can also see what the clear difference is between this and something passive where you have to manually log to something like sentry but maybe you can talk about you know what is the difference between the experience of like using sentry where you manually log exceptions and you still get breadcrumbs and all that versus using Rookout? So Sentry is an awesome tool. They've just raised the round

Starting point is 00:14:32 recently. So they're growing crazy fast. Love the company. And Sentry have been growing in a group of tools called error tracking. Their focus is on identifying when things go wrong in the system. You can either report exceptions yourself via an API, or you can use Middlewell, supplied by them, to automatically detect issues in web servers and so on. And then once an exception happens, once an error happens, you can see the point where the error was detected,

Starting point is 00:15:06 essentially the point where the exception was wrong, potentially with breadcrumbs, they can show you some, what was the activity in the system before, stuff such as a HTTP requests or logs, and so on, depending on the system itself. There are a few key differences here, because everybody who knows Sentry because everybody who uses Sentry knows that it's

Starting point is 00:15:28 not just about capturing the data, it's also what you do with it. A large part of Sentry's functionality, there I say most of it, relies on not how you collect the raw data, but how do you make it accessible? How do you aggregate various errors into the same group? How do you set alerts on those and so on and so forth? And so centric is kind of a tool that allows you to see the errors in your system, group them, and then kind of give you some advice into where they are and what to do about them.

Starting point is 00:16:04 A workout is very, very different. Rookout is first and foremost doesn't collect anything by itself. We only collect stuff you want and so you can use anything anytime. Many bugs in the system or issues in the system are not related to errors. In fact, quite often, Rookout is used outside the context of errors at all. Many customers tell us that as they roll out new features, they like using Rookout to see that the code is behaving as expected, just setting a break point, seeing the new flow is taking place,

Starting point is 00:16:39 or that the variable is changing. As well as if you often, especially when working with dynamic typing languages, you want to see what's the value that the variable is getting in production. Whether it's JavaScript or Python or Ruby, you want to see what's the real values. And there are many use cases when you don't have errors. Even if you're looking at a bug, it might not be an error. Maybe you've sent the wrong response, but didn't throw an exception.

Starting point is 00:17:06 And last but not least, when you do have an exception, you're just seeing the point where it was thrown. You might not be seeing what led to it. And often seeing what happened, just where it was thrown is enough, but more often you need more data and you need more context.

Starting point is 00:17:25 And so Rookout is kind of a tool you can use whenever you need a new piece of data. And you can even use Rookout and Sentry interchangeably. You can connect from Sentry to Rookout when you need more information. And you can send data from Rookout to other services, such as Sentry, as you collect more data and you want to correlate it with what's already there. Interesting. So how does the Rookout Sentry integration work? So let's say I add a breakpoint

Starting point is 00:17:50 and that piece of code gets called like 10 times a second, let's say. And then you can use Sentry to aggregate those 10 calls into like one message on Sentry so it's easier to understand that? So there are two ways. Actually, there is a Rookout integration

Starting point is 00:18:06 on the Sentry marketplace where you can just add a Rookout button to your stack traces so that when you see an error, you can click Rookout and go to debug further, go to the right server or to the right code and continue your debugging session if you need more information.

Starting point is 00:18:22 The other approach is if you set a breakpoint within a catch block, you can collect the exception and kind of send it to Centries, which is very similar to what you would be doing in code, but maybe for some reason you didn't report to Centry from that catch block, from that accept block, whether it's because you thought it's a benign exception

Starting point is 00:18:44 or it was too noisy, or you forgot, and now all of a sudden you want to monitor these sketch blocks so you can do it using Rookout. That makes sense. And one way that I've been thinking now to summarize Rookout is it makes developing and production interactive, where with most languages, I'm assuming, you know,

Starting point is 00:19:06 not counting languages like Erlang, where it's easier to like hot swap code in production. Rookout actually helps you interact with your code as it's running in production. And that's not really possible with the class of like monitoring and observability tools that we have today. Is that roughly accurate?

Starting point is 00:19:24 That's accurate. I think today the technology exists to hot-swap code in most runtimes. The thing is, it's not so much whether it's possible as whether it's possible to do it in a safe manner, whether it's advisable to do so. And I know very few engineers I would trust to hot swap code in a production environment. Yes.

Starting point is 00:19:50 And I'm not sure I'm one of them. Yeah. So instead of letting engineers hot swap code directly, what you do is you trust this one SDK, basically, that does it in a safe way for you, where it doesn't actually make any changes, but it just sends logging information. this one SDK basically that does it in a safe way for you where it doesn't actually make any changes but it just sends logging information and it makes it decently interactive so that you can debug anything that's going wrong. Yeah we've kind of we've taken the functionality of hot swapping and simplified it to something that's much more concrete, safe, easier to test. And so you get most of the benefits of it

Starting point is 00:20:26 without any of the risks. Yeah. How does a developer like integrate their Rookout, like the Rookout SDK? Like is that open source and it's just about like adding a library? It seems like it would be a little more convoluted, right? If it's doing much more than a library behind the scenes.

Starting point is 00:20:43 So it's just a library. Okay. And you install it as part of your dependencies and for node it's npm install for instance and then you import the library called start with the token and that's it. Okay. The SDKs are not open source yet. We are working on open sourcing them and that's pretty much it. We kind of try to take away the magic in easy-to-consume packages. So you would use it just like you would import any other third-party dependency. You import it, you run it, and it does a bunch of magic behind the scenes. And you have to give it a client ID and something like that. Interesting. That is pretty cool. I actually just want to know more about like customer

Starting point is 00:21:27 reactions because i i know that if i use this at workout this would make my life so much easier yeah just seeing the smile of your on your face tells me the story um yeah so i i think much of much of the focus we've been seeing was actually on the business impact i know many engineers i've spoken to were very they saw the value in it for themselves they were like we're staring at the screen all day and it's very painful and it's wasting our time but that's kind of our job and we might be able to make it slightly faster, but that might be a lot of work and we have to go through the organization,

Starting point is 00:22:11 we have to buy a new tool, we have to deploy to production. That's obviously sometimes scary, especially if you're a software engineer. Most software engineers don't often deploy monitoring tools to production. And what we found is that software engineers underestimate the impact this has on the business.

Starting point is 00:22:31 When you fail to solve customer issues in a fast and consistent manner, that's hurting your business, that's hurting the end customer. And we're seeing that time and time again. Dealers get delayed, dealers get cancelled and it really matters a lot the ability to provide high quality service by handling those bugs and issues in a quick manner and I think that's one of the things

Starting point is 00:22:59 we're trying to teach engineers that it goes beyond their personal suffering to have a real impact on the business and that impact on the business is more than enough reasons to go ahead and change it because there are better ways to work there are better ways to do our jobs okay yeah i think education is a big part of it because what I'm thinking, I have multiple questions, but the first thing here, it sounds like the sales process is mostly to like towards engineers because they have the, they're empowered to basically make purchases like this and like spread them in the organization. But one thing that I'm also thinking is, first of all, is that

Starting point is 00:23:41 accurate? Like you generally sell it to like an engineer and you don't have to sell it directly to a head of engineering or something like that it's a debugging tool it's used by engineers and software engineers are focus points usually the director or head of engineering is the one to sign off on it okay but engineers of all of all ranks are advocates i would say yeah and once an engineer has bought the tool for their organization like is it do you generally find it hard to educate the rest of that team or the rest of the company that they have a tool like look out when i'm just contrasting it with something like sentry where everybody knows that there's like a century because that's where you see all the errors but look out is much more like of an interactive,

Starting point is 00:24:26 like one is the one type experience, right? Especially with COVID where people are not sitting next to each other at their desk and seeing how their coworkers are debugging things. How does knowledge of Rookout spread in an organization? Oh, there are a few reasons to that. Actually, that's an interesting product question and something we're always

Starting point is 00:24:45 focusing on how do we make more people aware of what we're doing and we found that the collaboration feature is actually very useful we were surprised how often people ask for it because sometimes just like when you see a snippet of code and you are wondering what the hell is this doing and you want to publish it on the slack channel or send it to a friend and ask, what is this? What is it doing? Why was it written this way? Or I saw in Git Blame, it's your fault. So tell me what's going on. So the same happens when you debug, you're taking, you take a snapshot and you're seeing an odd value or an odd class and you're wondering, hey, what's that doing here? Why is this variable getting discussed? Why is this variable seven and not nine?

Starting point is 00:25:30 And that's actually something that's very useful. And you can share it. You can share directly to Slack. You can add it to a ticket in Jira. And you can share that information. And that's actually very useful because it's much more accurate to take a full snapshot with the timestamp and the line number and the file name and all the variables that's been collected and you can share it to somebody and he's going to

Starting point is 00:25:50 tell you this is this or this is that and it's much better than trying to describe what you think you've seen especially as you as you all know sometimes we make mistakes and with with with the full context it's much easier to verify we saw it, we think we saw, and that when somebody can explain to us, much easier, because here's the full context of what we've seen. That makes a lot of sense.

Starting point is 00:26:15 It's like you can basically make it easy to share the result of a breakpoint, which is, first of all, impossible locally. It's too hard to be able to do that without taking a screenshot. Anybody can see what the result of that breakpoint, which is, first of all, impossible locally. It's too hard to be able to do that without taking a screenshot. Anybody can see what the result of that breakpoint is, and somebody else can point out, oh, it looks like this variable is not what it should be. And that way, knowledge of the tool can also spread within an organization.

Starting point is 00:26:37 That makes a lot of sense. And it's very similar to something like Sourcegraph, where you get to just share snippets of code in your code search tool, and that share snippets of code in your code search tool and that's how more people start using your code search tool, even if that's not a tool they used before. With Rookout, it seems like you have this ability to get a lot of information from the code base. Is there anything else that you're using that ability for? How else, what other problems are you solving and how are you making debugging easier in general?

Starting point is 00:27:07 So you can actually take the data extracted by Rookout and you can send it to your favorite analytics tool. Whether you are a fan of logging using Kibana or Yumi or Splunk, whether you prefer using metrics within Grafana or Prometheus or Datadog, you can just inject new logs and metrics and send them to your usual tool so that you can see side-by-side metrics and logs that you've added via code and the metrics and logs that you've added via Rookout.

Starting point is 00:27:38 And you can just see them interspersed and kind of tell the story together. So what kind of metric are you adding with record? Is it the fact that you've added a breakpoint and you want to check the value of a variable that gets logged into something like Datadog? Maybe you want to count the number of times a function is called.

Starting point is 00:27:58 Maybe you want to count the number of times you've passed to a specific line. Maybe you can even add the condition, how many times was this function called with that variable every second. And you can kind of throw in those metrics on the fly instead of having to write a stats d report or a new Prometheus exporter. You can just throw in the data,

Starting point is 00:28:22 and we're going to tie it in for you and ETL it all the way to your target of choice. Interesting. So today what I would do is I would add a line in my code base saying add a distribution or add a log line here. And that would be like a whole PR and submitting it and pushing it to

Starting point is 00:28:37 production. With Lookout, I can add a breakpoint, a pseudo breakpoint, and I can automatically get metrics on how many times this line of code is being called in production. Yeah. That seems like a

Starting point is 00:28:48 whole new product in itself, right? Like it's like monitoring on the fly. Yeah. But sometimes you're adding a breakpoint and you need just

Starting point is 00:28:56 for, I don't know, you need for 10 minutes to see this metric. Yeah. Or maybe you're debating what's the right metrics and you have, I'm trying

Starting point is 00:29:02 to find the right place to copy the number of logins. What is the most accurate place in the code to do so? So you can experiment, you try things. Traditionally,

Starting point is 00:29:12 if you have to open a PR for the experiment, that's not very nice. If you can just throw in a couple, throw in three, five, seven breakpoints, you see the data coming out of each of them. And then you can say

Starting point is 00:29:24 which is best for you, which serves your purposes the best and stick with it. And you can do that entire experiment in a space of 10 minutes. Yeah. The way I'm thinking about this is like experimenting with like AWS console or GCP console before making changes in Terraform. It's like you basically make sure everything works, everything is accurate. And then you make changes in yourform it's like you basically make sure everything works everything is accurate and then you make changes in your code base to make sure that it's consistent and it can be like multi-cloud cloud friendly or whatever so you get the opportunity of like experimenting super quickly and then all the benefits of version control once things are set up in stone that is

Starting point is 00:30:01 cool because like i did not realize that we don't have that ability with monitoring today. Everything has to be explicitly logged or something in the code base. But with something like Rookout, you actually get to skip that. But it seems like you can take this even further. If I can count how many times a function is being called in a second, can I time how long that function takes on the fly if I'm debugging like a performance regression? So that's actually our latest features, which we've just released. We've added the ability to traditionally tracing tools have been able to show you how long does the function take, how long does the code segment take. But you have, again, you have to build it

Starting point is 00:30:44 into the code. You have to say, I want to measure from here to there. And you usually set a few dozens of spans per request. And then you often have a lot of holes. This span took all of the sudden five seconds.

Starting point is 00:30:59 Why did it take five seconds? What's in those five seconds? I want to drill down a bit deeper. Now go ahead and do a PR and see you tomorrow. Instead, Rookout allows you to do it on the fly. You can select any two lines of code in the same function or in different functions, even in different microservices. And we can tell you the latency between those two lines of code for a single request.

Starting point is 00:31:24 Okay, so I can say like this is point A and this is point B and tell me how long it takes for one line of code to reach from point, like for execution to go from line A to line B and it can create a graph for me on the fly. Yeah, it can create a graph for you on the fly. You can put point A and point B on the same function. You can put them in different functions.

Starting point is 00:31:46 You can even put them across microservices so you can see how long does the request take from point A. You can actually have multiple points. So you can go from point A to point B to point C, and you're going to see all of that, and you're going to see it per request. So if this request took from point A to point B to point C this long,

Starting point is 00:32:02 this request took more or less. You can even use conditional breakpoints. So only monitor the differences between point A to point B to point C this long, this request took more or less. You can even use conditional breakpoints. So only monitor the differences between point A and point B if value X is 7 or for a specific customer or any other condition you're interested in.

Starting point is 00:32:18 And does this work behind the scenes like super similar to Jaeger where like you create like a unique ID and then that gets propagated and that's how the back end can figure it out. So it was at this microservice at this time and it was at this microservice, this ID was at this microservice at this time and it can just show that in the UI to you? Yeah so it works exactly like Jaeger. We rely on the open tracing and the open telemetry specs so it's very easy to set up.

Starting point is 00:32:48 The big benefit is that you can create stuff on the fly. You don't have to worry about it. You don't have to mess around with it. You just get the benefits of the process data in a nice pretty graph without having to worry about it. That is a very unique way of using the open tracing

Starting point is 00:33:03 spec. I think there's a monitoring conference coming up, which I don't know if you've applied to give a talk there, but I think that would be a very unique talk. Most people are just talking about how they deployed Jaeger in their production environment, but this is just such a different take. It's like, can I trace on the fly by clicking on lines of code? It's just pretty unique.

Starting point is 00:33:24 And I'm also guessing that then like Rookout like works seamlessly with Jaeger. So if let's say I already have Jaeger installed on my code base and I'm like tracing stuff, can I see spans from Jaeger or is that like just going to be in the Jaeger UI? So first and foremost, we actually collect all this. We actually collect the span data. So when you set a breakpoint, we're going to show you everything that's been on the spend at that point of time, including the request ID, all the tags, the logs associated with it.

Starting point is 00:33:55 Additionally, you can use the request ID we collect for you. You can look it up in Jaeger and see the entire spend on the fly without having to search for it. Okay. So, yeah, I guess it makes sense because you're using the same standard under the hood. Mm-hmm. Interesting. Which is the so-called same standard has been redefined maybe five times over the past few years.

Starting point is 00:34:21 But that's open source politics. I'm not going there. I mean, that actually, I want to take you there now. So what do you think is happening? Like, I had no idea about this. And you can describe in how much of a detail

Starting point is 00:34:34 that you found. Have you tried contributing to these or are you part of the discussion? So I haven't been too deep into the discussion because tracing is not the focus of what we do. Yeah. And we do actually use tracing quite a bit for our SaaS platform.

Starting point is 00:34:50 So I'm familiar with it both as a consumer of tracing and as a vendor in the space. Tracing started with the rise of Jaeger and Zipkin and those around Pinterest and Uber and all that stuff, they've pioneered some pretty cool concepts and some pretty interesting stuff. And then along the way came OpenTracing, which was supposed to standardize that. And I haven't been following too deep into it,

Starting point is 00:35:22 but at some point there was a break and some group, a separate group, I don't remember its name, started competing with OpenTracing. And then a year ago, they announced OpenTelemetry. And they've kind of deprecated everything else

Starting point is 00:35:39 before announcing OpenTelemetry was production grade. So it's been a mess over the past couple of years with all these competing standards and nothing is truly ready. But now open telemetry is maturing. And hopefully that's going to put the open source walls behind us,

Starting point is 00:36:00 at least in the spans and tracing realm. That's so interesting. And most engineers are not even aware of all of this stuff happening behind the scenes. I'm sure this is happening on a GitHub discussion and a Zoom call or something. I don't even know what people are discussing about this. Presumably the fight is technical in nature, like, oh, it's too much data per trace or something like that. I don't even know what you would discuss. They're always going to find something. Yeah. Yeah. Like which format to use and all that. Cool. Yeah. I think this has been super informative. Is there something that I'm missing out on

Starting point is 00:36:46 that you think we should be telling listeners? I can talk on for hours, but no point in making it too long. If there is anything of interest you want to discuss, I don't know. I'm open to it. I guess I had one question from way back in the beginning

Starting point is 00:37:11 that I'm still thinking about. You said that you had a bunch of experience in cybersecurity before you started this company. How did that help? How did that shape your experience into building a tool like this? So going back to an example I gave about how long does it take to deploy a single line of code? Essentially, lines of code all end up as bytes in memory.

Starting point is 00:37:35 And quite often, the change you make in an application, all it does is flip one bit. You take one bit, you turn it on and off, and that's the only real change you've made in the application. Now you're recompiling, you're running it through CICD, you're testing, you're approving, but at the end of the day, you just flipped a bit. And kind of thinking of walk through the layers, how does the code you're seeing in front of you, the the code you're seeing in front of you,

Starting point is 00:38:07 the source code you're seeing in front of you, translate into a running application? How is the application behaving as it's running? Those are kind of much of the cybersecurity mindset and skill set. That's a lot of stuff that I used to do around the Windows operating system, the Linux operating systems, kind of taking things apart, figuring out how they work and what are the implications of that. And so in workout, we do similar things,

Starting point is 00:38:33 except we do those things to the runtime themselves. We take apart the Python runtime or the Java runtime. We kind of figure out how it's working. We discuss it with the contributors. We read online. I gave a few talks about in PyCon about how does the Python interpreter works about the debugger inside of it.

Starting point is 00:38:54 And by kind of studying how it works and somewhat taking it apart, we can redesign what it's doing and we can learn how to utilize what's already there to do something it was never meant to do. That makes sense. And I can see the cybersecurity mindset in that. It's ultimately just code and it's just a runtime.

Starting point is 00:39:19 It's just bits. So if you modify that, you can get what you want. And it shouldn't take you like six months to find a bug. Because we control basically everything that's being done. Yeah, that's the irony of it, we control the code, we control the servers, we control the the build process, we control everything. And still, we have no idea what our code is doing. It sounds very sad when you think about it. But it means that there's like a lot of opportunity for things to get better there always is yeah and yeah i think this is great

Starting point is 00:39:53 thank you so much for being a guest i know it's late there so have a good night hope you had fun

Your Ad Here

Software at Scale - Software at Scale 14 - Liran Haimovitch: CTO, Rookout

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.