Software at Scale - Software at Scale 14 - Liran Haimovitch: CTO, Rookout
Episode Date: March 23, 2021Liran Haimovitch is the Co-Founder and CTO of Rookout, a new style debugging tool that enables developers to debug web applications by adding debugger style breakpoints in production (without actually... stopping the application). Apple Podcasts | Spotify | Google PodcastsRookout belongs to a new class of developer tools that aim to make application debugging more interactive than the standard “inspect logs” experience that is standard industry practice today. I’d encourage checking out the demo to understand the tool better.We discuss the state of developer tools today, debuggability, observability, and “understandability” of code, the technical implementation of Rookout, the engineering workflows around debugging tools, the sales process for Rookout, and more. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
Transcript
Discussion (0)
Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications.
I'm your host, Utsav Shah, and thank you for listening.
Thank you for being a guest on the Software at Scale podcast.
Can you give a little bit of an introduction to yourself for guests and for listeners?
Of course. So my name is Liran Chemovic, and I'm a co-founder and CTO at Rookout.
Before founding Rookout, I spent about a decade doing cyber security, you know, developing stuff
in that area, kernel mode, user mode, researching how viruses work, how antiviruses work,
and doing all sorts of projects along that area. And then about five years ago, I decided to go my own way,
build my own thing.
That's kind of how Rookout got started.
And now we're kind of taking
the cybersecurity mindset and skillset
to develop a new kind of debugger,
new kind of debugging platform,
something that will allow a new generation
of how to operate in production.
And I did a little bit of research.
Like Rookout seems to have started in Jan 2017.
Is that like approximately when?
And can you tell me like exactly,
like, you know, what gave you the idea?
Is there like a spark that you decided
to go with your co-founders
to build something like Rookout?
So on the one hand,
it's always been a pain of mine
as I was working on cyber projects,
whether it was desktop applications
or server applications.
Something goes wrong
in the application you're running
and it always goes wrong in production
and it tends to go wrong at the weekend.
And you have no idea what's going on.
And the thing is,
as you scale your software,
as it gets bigger, as it gets beyond your laptop,
then nobody cares anymore about how it's looking on your laptop.
Nobody cares anymore how it's operating in the demo environment.
All that matters is how it works in the real world,
how it works for the customer.
And that's when things often act very differently than they do on your machine.
And up until now, the only way to operate in those remote environments has been through
logging.
You essentially, you read the log file and it's never enough.
It tends to have more holes than Swiss cheese.
And then you have to add more logs.
And adding logs is going to take you, if you're lucky, a few hours. If you're unlucky, it's going to take you days and
weeks. And then it's probably not going to give you the full picture, it's just going to give you
a few extra hints. And then you're going to find yourself doing it over again, over again. And I
remember one bug, I spent six months chasing that bug.
I think I released 15 versions during those six months.
And I know 20 people kept asking me about that bug daily.
Not the best experience of my life.
So, yeah, it's clear that running software in production
means you can't have any of your nice tools to help debug stuff locally.
You can basically just try to log stuff and figure out what's going wrong.
So then where does Rookout come into that?
How does Rookout help? So, look out, there is a famous quote by Mary Popendink, who wrote some of the early, some of the more early important stuff about the management theory behind DevOps.
And she said that an organization maturity is measured by how fast can they deliver a single line of code.
And the thought comes to mind, if online delivering is one logline, why do I have to go through this entire software development lifecycle?
I have to write my code.
I have to test it.
It has to go through unit tests, end-to-end tests.
It has to be approved, code reviewed.
And all I'm doing is changing the logline.
Well, we all know that once you change a single line of code,
you never know what's going to happen.
And you can always mess things up in so many
different ways, whether it's performance or availability or correctness, side effects.
And you really do have to make those tests. And I was wondering, we were wondering kind of,
what can we do about it? Maybe we can find a different way to add that line of code,
to add that logline without endangering the system, without requiring all those tests and deployments.
And that's kind of how we built Rookout. We built Rookout on the one
hand to provide you with a debugger-like experience, set a breakpoint,
and get the data, so-called, except we don't
do it locally. We do it in the remote running environment, and we
don't stop the application. We do it in the remote running environment and we don't stop the application.
We collect the data you would see in a traditional debugger with a lot of safeties baked in
to make sure you're not going to impact your production environments in any way,
but you're going to get an almost seamless debugger experience while operating in production,
potentially operating on hundreds of servers at the same time.
So to listeners, like I tried out like a demo,
I tried out using the sandbox on like rookout.com
and it's basically very similar
to like IntelliJ's debugging experience
that I experienced.
Like I put like a break point
on one handler method of like a web server.
And when that handler was called, like it collected a bunch of information from me,
like local variables and what the state of the server was,
which seems immensely valuable once you actually get to play with it.
And has the experience of like other customers just been similar to that?
So customers, I love showing the product because
whenever i show the product people get excited i've seen more than one job drop literally
the first time i was it was awkward like does that really happen in the real world is that
not a cartoon thing so i can tell you if you surprise people enough, their jaws literally drop. And I've seen that more than once.
But it goes beyond that because often when you speak to customers,
you hear of their horror stories,
the bugs they've been chasing for three or six or 12 months.
Sometimes if you are lucky, they still have those bugs open.
And a few times they even deployed Rookout
and found those bugs in 15 minutes.
So that's very rewarding.
But in general,
empowering
engineers to operate in
staging and production environments
as if they were their own
laptops,
that's super satisfying.
I guess if you think about
it from first principles,
why is it so much easier to debug locally today?
I'm saying in a pre-Rookout world,
that's how most development happens today.
Why is it so easy to debug stuff locally versus in production?
Because you have all of these tools
and you can just basically print every variable
and recreate state.
And that's what Rookout is trying to emulate,
but in production environments.
When you're working locally,
you have full control over what's going on.
You have full visibility, and you have lots of room for error
because even if you mess something up,
let's say you're going to erase your database.
It's your local database. Nobody cares.
You can just spin it up again.
If you disconnect your network by accident, nobody cares.
And as you have so much control, so much visibility, it's very easy.
And if you mess something up, then it's usually a matter of seconds or minutes to get it back together.
And you're good to go.
When you're working in production, the scale is much larger.
The complexity is much
bigger, and the cost of mistake is
obviously so much bigger.
And so either you don't have access at all
or you have some restricted access
using Ops, whatever,
SSH, and things get
honestly much scarier
because you're running
without...
You're running throughout the...
So Because you're running without... You're running throughout the... So let me retry that.
And so when you are operating in productions,
things are much scarier
because you can totally mess things up.
You can mess up your environment.
And when you're working in production,
customers are going to notice
everybody's going to notice and you want your production to be up and running yeah so there's
like a general fear when people are working in production like i don't want to mess something
up i don't want to like log too much here because like what if i overwhelm my logging infrastructure
and yeah it seems like tools can help with that, right?
Like if your tool can make sure
that it's not going to get overwhelmed
and you like log something there,
you'll feel that much safer in doing things.
Is that like the right way to think about it?
Yeah, that's exactly the right way.
Lookout has a lot of built-in safeties
regarding how the tool is going to be used,
what it's going to collect.
You can tailor some of those safeties
to meet your requirements,
but you can rest assured that whatever is going to happen,
everything is going to stay safe.
Even if you set a breakpoint in the hottest code
passing the code,
something that gets called a million times a second,
we're going to detect
that and disable it or move it to sampling mode based on your preference so everything is going
to keep running as usual. Okay. And that's pretty interesting,
yeah, because if I want to log, I know that I don't want to log today in a tight loop because
that's going to overwhelm. How does Lookout detect that it's being called too often and protect your codebase from
getting affected by that? So for each breakpoint that gets called, we calculate the frequency of
the calls and the time spent within the breakpoint itself for data collection. And then we cap those
using various limits. And if you break those limits,
once we see that you're approaching those limits,
we're going to move to sampling mode.
And if we decide something looks fishy or it's taking too long,
then we just disable it altogether.
Because after all, you want your production to keep running,
even if it means missing out on some of the data.
Yeah, that makes sense.
And that is exactly the trade-off that I would think of.
Does this happen locally inside the Lookout library or is that happening in a server somewhere and you're making
a network call? The emphasis of Lookout is offloading as much as we can to the heart
of the application. So essentially what happens is that we use an SDK.
We have a Java for Java, Nuget package for.NET,
and we have PyPy, NPM, and Jam package for Ruby, Python, and all.
And essentially what happens, you install us,
just one more package, you initialize us when you get started.
And then for each environment, we have our own technique
that we connect to the running application in memory.
We find the functions you're interested in monitoring
and we literally modify them in memory,
kind of like we recompile them
with the additional
data collection code per your request.
And then that's the code that does the collection with all the safeties built in.
And the data, once it's collected, gets pushed into a background queue that gets flashed
out without impacting the application at all.
Okay.
So yeah, that makes sense.. I hadn't thought about how
it's not going to be super easy to add a break point to anywhere in your code base.
You literally need to modify the code that runs. You're not injecting a library call or something.
You're not expecting that. You have to inject that. That is actually pretty interesting. It
seems like a pretty complex problem, right?
Where you had to do a bunch of engineering work
for each language that you support, right?
Yeah, so it's kind of ironic.
I spent most of my career doing C++ development,
actually for the Windows operating system.
And I've somehow found myself over the past five years
diving deep into the implementation of the Python interpreter
and Node V8, OpenJDK, Ruby MRI,
and kind of figuring out all of those obscure unknown APIs,
sometimes looking at the specs themselves of the implementation
and definitely the source code,
and kind of
tying it all together to build something that's reliable, easy to use and gets the job done.
And that sounds like a lot of complexity and I'm honestly scared of the amount of stuff
you had to do to make this work, but it sounds awesome.
And just so that I can clarify my understanding of how Rookout works then. So there's a web UI where a developer sees
their code base, presumably from GitHub or something, you must be grabbing their source code.
And you have an IDE interface where you can basically put a breakpoint on any line of code.
That gets transmitted to a server. And the server will presumably
somehow transport that to your production servers where the SDK detects that there's a new break
point and dynamically changes the code base to inject that break point and then puts that
information back in a queue and sends that back to like lookout servers for you to debug.
And then that materializes on the lookout UI.
Is that roughly accurate?
Yeah, that's fairly accurate.
Yeah.
Presumably that's just what happens.
Okay, cool.
And yeah, it sounds like a fun technological problem
and solving a real customer problem.
So that sounds amazing, honestly.
And it seems like customers are
also interested in something like this and i can also see what the clear difference is between this
and something passive where you have to manually log to something like sentry but maybe you can
talk about you know what is the difference between the experience of like using sentry where you
manually log exceptions and you still get breadcrumbs and
all that versus using Rookout? So Sentry is an awesome tool. They've just raised the round
recently. So they're growing crazy fast. Love the company. And Sentry have been growing in a group
of tools called error tracking.
Their focus is on identifying when things go wrong in the system.
You can either report exceptions yourself via an API,
or you can use Middlewell, supplied by them,
to automatically detect issues in web servers and so on.
And then once an exception happens, once an error happens,
you can see the point where the error was detected,
essentially the point where the exception was wrong,
potentially with breadcrumbs,
they can show you some,
what was the activity in the system before,
stuff such as a HTTP requests or logs,
and so on, depending on the system itself.
There are a few key differences here,
because everybody who knows Sentry because everybody who uses Sentry knows that it's
not just about capturing the data, it's also what you do with it.
A large part of Sentry's functionality, there I say most of it, relies on not how you collect
the raw data, but how do you make it accessible?
How do you aggregate various errors into the same group?
How do you set alerts on those and so on and so forth?
And so centric is kind of a tool that allows you to see the errors in your system,
group them, and then kind of give you some advice into where they are
and what to do about them.
A workout is very, very different. Rookout is first and foremost doesn't collect anything by itself.
We only collect stuff you want and so you can use anything anytime. Many bugs in the system
or issues in the system are not related to errors. In fact, quite often, Rookout is used outside the context of errors at all.
Many customers tell us that as they roll out new features,
they like using Rookout to see that
the code is behaving as expected,
just setting a break point,
seeing the new flow is taking place,
or that the variable is changing.
As well as if you often,
especially when working with dynamic typing languages,
you want to see what's the value that the variable is getting in production.
Whether it's JavaScript or Python or Ruby, you want to see what's the real values.
And there are many use cases when you don't have errors.
Even if you're looking at a bug, it might not be an error.
Maybe you've sent the wrong response, but didn't throw an exception.
And last but not least,
when you do have an exception,
you're just seeing the point where it was thrown.
You might not be seeing what led to it.
And often seeing what happened,
just where it was thrown is enough,
but more often you need more data
and you need more context.
And so Rookout is kind of a tool you can use whenever you need a new piece of data.
And you can even use Rookout and Sentry interchangeably.
You can connect from Sentry to Rookout when you need more information.
And you can send data from Rookout to other services, such as Sentry,
as you collect more data and you want to correlate it with what's already there.
Interesting.
So how does the Rookout Sentry integration work?
So let's say I add a breakpoint
and that piece of code gets called
like 10 times a second, let's say.
And then you can use Sentry
to aggregate those 10 calls
into like one message on Sentry
so it's easier to understand that?
So there are two ways.
Actually, there is a Rookout integration
on the Sentry marketplace
where you can just add a Rookout button
to your stack traces
so that when you see an error,
you can click Rookout and go to debug further,
go to the right server or to the right code
and continue your debugging session
if you need more information.
The other approach is
if you set a breakpoint
within a catch block, you can collect the exception
and kind of send it to Centries,
which is very similar to what you would be doing in code,
but maybe for some reason you didn't report to Centry
from that catch block, from that accept block,
whether it's because you thought it's a benign exception
or it was too noisy, or you forgot,
and now all of a sudden you want to monitor these sketch blocks
so you can do it using Rookout.
That makes sense.
And one way that I've been thinking now to summarize Rookout
is it makes developing and production interactive,
where with most languages,
I'm assuming, you know,
not counting languages like Erlang,
where it's easier to like hot swap code in production.
Rookout actually helps you interact with your code
as it's running in production.
And that's not really possible
with the class of like monitoring
and observability tools that we have today.
Is that roughly accurate?
That's accurate.
I think today the technology exists
to hot-swap code in most runtimes.
The thing is, it's not so much whether it's possible
as whether it's possible to do it in a safe manner,
whether it's advisable to do so.
And I know very few engineers I would trust to hot swap code in a production environment.
Yes.
And I'm not sure I'm one of them.
Yeah.
So instead of letting engineers hot swap code directly, what you do is you trust this one
SDK, basically, that does it in a safe way for you, where it doesn't actually make any
changes, but it just sends logging information. this one SDK basically that does it in a safe way for you where it doesn't actually make any changes
but it just sends logging information and it makes it decently interactive so that you can debug
anything that's going wrong. Yeah we've kind of we've taken the functionality of hot swapping
and simplified it to something that's much more concrete, safe, easier to test. And so you get most of the benefits of it
without any of the risks.
Yeah.
How does a developer like integrate their Rookout,
like the Rookout SDK?
Like is that open source
and it's just about like adding a library?
It seems like it would be a little more convoluted, right?
If it's doing much more than a library behind the scenes.
So it's just a library.
Okay. And you install it as part of your dependencies and for node it's npm install for instance and then you
import the library called start with the token and that's it. Okay. The SDKs are not open source
yet. We are working on open sourcing them and that's pretty much it. We kind of try to take away the magic
in easy-to-consume packages.
So you would use it just like you would import any other third-party dependency. You
import it, you run it, and it does a bunch of magic behind the scenes. And you have to
give it a client ID and something like that. Interesting. That is pretty cool. I actually just want to know more about like customer
reactions because i i know that if i use this at workout this would make my life so much easier
yeah just seeing the smile of your on your face tells me the story um yeah
so i i think much of much of the focus we've been seeing was actually on the business impact i know
many engineers i've spoken to were very they saw the value in it for themselves they were like we're
staring at the screen all day and it's very painful and it's wasting our time but that's
kind of our job and we might be able to make it slightly faster,
but that might be a lot of work
and we have to go through the organization,
we have to buy a new tool,
we have to deploy to production.
That's obviously sometimes scary,
especially if you're a software engineer.
Most software engineers don't often deploy
monitoring tools to production.
And what we found is that software engineers underestimate
the impact this has on the business.
When you fail to solve customer issues in a fast and consistent manner,
that's hurting your business, that's hurting the end customer.
And we're seeing that time and time again.
Dealers get delayed, dealers get cancelled
and it really matters a lot
the ability to provide high quality service
by handling those bugs and issues in a quick manner
and I think that's one of the things
we're trying to teach engineers
that it goes beyond their personal suffering to have a real
impact on the business and that impact on the business is more than enough reasons to
go ahead and change it because there are better ways to work there are better ways to do our jobs
okay yeah i think education is a big part of it because what I'm thinking, I have multiple
questions, but the first thing here, it sounds like the sales process is mostly to like towards
engineers because they have the, they're empowered to basically make purchases like this and like
spread them in the organization. But one thing that I'm also thinking is, first of all, is that
accurate? Like you generally sell it to like an engineer and you don't have to sell it directly to a head of engineering or something like that it's a
debugging tool it's used by engineers and software engineers are focus points usually the director or
head of engineering is the one to sign off on it okay but engineers of all of all ranks are
advocates i would say yeah and once an engineer has bought the tool for
their organization like is it do you generally find it hard to educate the rest of that team
or the rest of the company that they have a tool like look out when i'm just contrasting it with
something like sentry where everybody knows that there's like a century because that's where you
see all the errors but look out is much more like of an interactive,
like one is the one type experience, right?
Especially with COVID where people are not sitting
next to each other at their desk
and seeing how their coworkers are debugging things.
How does knowledge of Rookout spread in an organization?
Oh, there are a few reasons to that.
Actually, that's an interesting product question
and something we're always
focusing on how do we make more people aware of what we're doing and we found that the collaboration
feature is actually very useful we were surprised how often people ask for it because sometimes
just like when you see a snippet of code and you are wondering what the hell is this doing and you
want to publish it on the slack channel or send it to a friend and ask, what is this? What is it doing? Why was it written this way? Or I saw in Git
Blame, it's your fault. So tell me what's going on. So the same happens when you debug, you're
taking, you take a snapshot and you're seeing an odd value or an odd class and you're wondering,
hey, what's that doing here? Why is this variable getting discussed?
Why is this variable seven and not nine?
And that's actually something that's very useful.
And you can share it.
You can share directly to Slack.
You can add it to a ticket in Jira.
And you can share that information.
And that's actually very useful because it's much more accurate to take a full snapshot
with the timestamp and the line number and the file name
and all the variables that's been collected and you can share it to somebody and he's going to
tell you this is this or this is that and it's much better than trying to describe what you think
you've seen especially as you as you all know sometimes we make mistakes and with with with
the full context it's much easier to verify we saw it,
we think we saw,
and that when somebody can explain to us,
much easier,
because here's the full context of what we've seen.
That makes a lot of sense.
It's like you can basically make it easy to share
the result of a breakpoint,
which is, first of all, impossible locally.
It's too hard to be able to do that
without taking a screenshot. Anybody can see what the result of that breakpoint, which is, first of all, impossible locally. It's too hard to be able to do that without taking a screenshot.
Anybody can see what the result of that breakpoint is, and somebody else can point out, oh, it
looks like this variable is not what it should be.
And that way, knowledge of the tool can also spread within an organization.
That makes a lot of sense.
And it's very similar to something like Sourcegraph, where you get to just share snippets of code
in your code search tool, and that share snippets of code in your code search
tool and that's how more people start using your code search tool, even if that's not
a tool they used before. With Rookout, it seems like you have this ability to get a
lot of information from the code base. Is there anything else that you're using that
ability for? How else, what other problems are you solving and how are you making debugging easier
in general?
So you can actually take the data extracted by Rookout and you can send it to your favorite
analytics tool.
Whether you are a fan of logging using Kibana or Yumi or Splunk, whether you prefer using
metrics within Grafana or Prometheus or Datadog, you can just inject new logs and metrics
and send them to your usual tool
so that you can see side-by-side metrics and logs
that you've added via code
and the metrics and logs that you've added via Rookout.
And you can just see them interspersed
and kind of tell the story together.
So what kind of metric are you adding with record?
Is it the fact that you've added a breakpoint
and you want to check the value of a variable
that gets logged into something like Datadog?
Maybe you want to count the number of times
a function is called.
Maybe you want to count the number of times
you've passed to a specific line.
Maybe you can even add the condition,
how many times was this function called with that variable every second.
And you can kind of throw in those metrics on the fly
instead of having to write a stats d report
or a new Prometheus exporter.
You can just throw in the data,
and we're going to tie it in for you
and ETL it all the way to your
target of choice. Interesting. So
today what I would do is I would
add a line in my code base saying
add a distribution or add a log line
here. And that would be like a whole
PR and submitting it and pushing it to
production. With Lookout, I can
add a breakpoint,
a pseudo breakpoint, and I can automatically
get metrics on how many times this line of
code is being called
in production.
Yeah.
That seems like a
whole new product
in itself, right?
Like it's like
monitoring on the fly.
Yeah.
But sometimes you're
adding a breakpoint
and you need just
for, I don't know,
you need for 10 minutes
to see this metric.
Yeah.
Or maybe you're
debating what's the
right metrics and
you have, I'm trying
to find the right
place to copy the
number of logins.
What is the most accurate place
in the code to do so?
So you can experiment,
you try things.
Traditionally,
if you have to open a PR
for the experiment,
that's not very nice.
If you can just throw in a couple,
throw in three, five, seven breakpoints,
you see the data coming out
of each of them.
And then you can say
which is best
for you, which serves your purposes the best and stick with it. And you can do that entire experiment
in a space of 10 minutes. Yeah. The way I'm thinking about this is like experimenting with
like AWS console or GCP console before making changes in Terraform. It's like you basically
make sure everything works, everything is accurate. And then you make changes in yourform it's like you basically make sure everything works everything is accurate and
then you make changes in your code base to make sure that it's consistent and it can be like
multi-cloud cloud friendly or whatever so you get the opportunity of like experimenting super
quickly and then all the benefits of version control once things are set up in stone that is
cool because like i did not realize that we don't have that ability with monitoring today. Everything
has to be explicitly logged or something in the code base. But with something like Rookout,
you actually get to skip that. But it seems like you can take this even further. If I
can count how many times a function is being called in a second, can I time how long that
function takes on the fly if I'm debugging
like a performance regression? So that's actually our latest features, which we've just released.
We've added the ability to traditionally tracing tools have been able to show you how long does
the function take, how long does the code segment take. But you have, again, you have to build it
into the code. You have to say, I want to measure
from here to there. And you usually
set a few dozens of
spans per request.
And then you often have a lot of
holes.
This span
took all of the sudden five seconds.
Why did it take five seconds? What's in those
five seconds? I want to drill down a bit
deeper. Now go ahead and do a PR and see you tomorrow.
Instead, Rookout allows you to do it on the fly.
You can select any two lines of code in the same function
or in different functions, even in different microservices.
And we can tell you the latency between those two lines of code
for a single request.
Okay, so I can say like this is point A
and this is point B and tell me how long it takes
for one line of code to reach from point,
like for execution to go from line A to line B
and it can create a graph for me on the fly.
Yeah, it can create a graph for you on the fly.
You can put point A and point B on the same function.
You can put them in different functions.
You can even put them across microservices
so you can see how long does the request
take from point A. You can actually
have multiple points. So you can go from point A
to point B to point C, and you're going to
see all of that, and you're going to see it per request.
So if this request took from
point A to point B to point C this long,
this request took more or less.
You can even use conditional breakpoints. So only monitor the differences between point A to point B to point C this long, this request took more or less. You can even use conditional breakpoints.
So only monitor the differences
between point A and point B
if value X is 7
or for a specific customer
or any other
condition you're interested in.
And does this work behind the scenes
like super similar to Jaeger where like you
create like a unique ID and then
that gets propagated and
that's how the back end can figure it out. So it was at this microservice at this time and it was
at this microservice, this ID was at this microservice at this time and it can just
show that in the UI to you? Yeah so it works exactly like Jaeger. We rely on the open tracing
and the open telemetry specs so it's very easy to set up.
The big benefit is that you can create stuff on the fly. You don't
have to worry about it. You don't have to mess around
with it. You just get the
benefits of the process
data in a nice pretty graph
without having to worry about it.
That is a very unique
way of using the open tracing
spec. I think there's a monitoring conference coming up,
which I don't know if you've applied to give a talk there,
but I think that would be a very unique talk.
Most people are just talking about how they deployed Jaeger
in their production environment,
but this is just such a different take.
It's like, can I trace on the fly by clicking on lines of code?
It's just pretty unique.
And I'm also guessing that then like Rookout like works seamlessly with Jaeger.
So if let's say I already have Jaeger installed on my code base and I'm like tracing stuff,
can I see spans from Jaeger or is that like just going to be in the Jaeger UI?
So first and foremost, we actually collect all this.
We actually collect the span data.
So when you set a breakpoint, we're going to show you everything that's been on the
spend at that point of time, including the request ID, all the tags, the logs associated
with it.
Additionally, you can use the request ID we collect for you.
You can look it up in Jaeger and see the entire spend on the fly without having to search for it.
Okay.
So, yeah, I guess it makes sense because you're using the same standard under the hood.
Mm-hmm.
Interesting.
Which is the so-called same standard has been redefined maybe five times over the past
few years.
But that's open source politics.
I'm not going there.
I mean, that actually,
I want to take you there now.
So what do you think is happening?
Like, I had no idea about this.
And you can describe
in how much of a detail
that you found.
Have you tried contributing to these
or are you part of the discussion?
So I haven't been too deep
into the discussion
because tracing is not the focus
of what we do.
Yeah. And we do actually use tracing quite a bit for our SaaS platform.
So I'm familiar with it both as a consumer of tracing and as a vendor in the space.
Tracing started with the rise of Jaeger and Zipkin and those around Pinterest and Uber
and all that stuff,
they've pioneered some pretty cool concepts
and some pretty interesting stuff.
And then along the way came OpenTracing,
which was supposed to standardize that.
And I haven't been following too deep into it,
but at some point there was a break
and some group,
a separate group,
I don't remember its name,
started competing with OpenTracing.
And then a year ago,
they announced OpenTelemetry.
And they've kind of deprecated everything else
before announcing OpenTelemetry
was production grade.
So it's been a mess over the past couple of years
with all these competing standards
and nothing is truly ready.
But now open telemetry is maturing.
And hopefully that's going to put
the open source walls behind us,
at least in the spans and tracing realm.
That's so interesting. And most engineers are not even aware of all of this stuff happening behind the scenes. I'm sure this is happening
on a GitHub discussion and a Zoom call or something. I don't even know what people are
discussing about this. Presumably the fight is technical in nature, like, oh, it's too much data
per trace or something
like that. I don't even know what you would discuss. They're always going to find something.
Yeah. Yeah. Like which format to use and all that. Cool. Yeah. I think this has been super
informative. Is there something that I'm missing out on
that you think we should be telling listeners?
I can talk on for hours,
but no point in making it too long.
If there is anything of interest you want to discuss,
I don't know.
I'm open to it.
I guess I had one question
from way back in the beginning
that I'm still thinking about.
You said that you had a bunch of experience
in cybersecurity before you started this company.
How did that help?
How did that shape your experience
into building a tool like this?
So going back to an example I gave about how long does it take to deploy a single line of code?
Essentially, lines of code all end up as bytes in memory.
And quite often, the change you make in an application, all it does is flip one bit.
You take one bit, you turn it on and off,
and that's the only real change you've made in the application.
Now you're recompiling, you're running it through CICD,
you're testing, you're approving,
but at the end of the day, you just flipped a bit.
And kind of thinking of walk through the layers,
how does the code you're seeing in front of you, the the code you're seeing in front of you,
the source code you're seeing in front of you,
translate into a running application?
How is the application behaving as it's running?
Those are kind of much of the cybersecurity mindset and skill set.
That's a lot of stuff that I used to do around the Windows operating system,
the Linux operating systems, kind of taking things apart,
figuring out how they work and what are the implications of that.
And so in workout, we do similar things,
except we do those things to the runtime themselves.
We take apart the Python runtime or the Java runtime.
We kind of figure out how it's working.
We discuss it with the contributors.
We read online.
I gave a few talks about in PyCon
about how does the Python interpreter works
about the debugger inside of it.
And by kind of studying how it works
and somewhat taking it apart,
we can redesign what it's doing
and we can learn how to utilize what's already there
to do something it was never meant to do.
That makes sense.
And I can see the cybersecurity mindset in that.
It's ultimately just code and it's just a runtime.
It's just bits.
So if you modify that, you can get what you want.
And it shouldn't take you like six months to
find a bug. Because we control basically everything that's being done. Yeah, that's the irony of it,
we control the code, we control the servers, we control the the build process, we control
everything. And still, we have no idea what our code is doing. It sounds very sad when you think
about it. But it means that there's like a lot
of opportunity for things to get better there always is yeah and yeah i think this is great
thank you so much for being a guest i know it's late there so have a good night hope you had fun