PurePerformance - Why Developers have different Observability Requirements with Liran Haimovitch
Episode Date: January 1, 2024After analyzing Distributed Traces over more than 15 years Brian and I thought that everyone in software engineering and operations must be satisfied with all that observability data we have available.... But. Maybe Brian and I were wrong because we didn’t fully understand all the use cases - especially those for developers that must fix code in production or need to quickly understand what code from somebody else is really doing without having the luxury to add another log line and redeploy on the fly. To learn more about the observability requirements of developers we invited Liran Haimovitch, CTO at Rookout and now part of Dynatrace, who has spent the last 7 years solving the challenging problems that developers face day and night. Tune in and learn about what non-breaking breakpoints are, how it is possible to "debug in production" without impacting running code and how we can make developers lives easier even though we push so many things "to the left"
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance Yes, yes, and we're on episode 198, which means we're getting close to 200. 200, wow.
I had a dream about you last night.
Not quite the same kind of dreams, because I don't know if there's a lesson in it.
But I mean, maybe there's a lesson for you personally.
But I was sitting on the beach in Hawaii, and there was this crazy dog named Buddha.
And it was jumping into the water, pulling out these big rocks, and just chewing on the rocks with its teeth.
And I was thinking, this dog's going to gonna break his mouth so i go up to him and uh the dog turns around and
faces me and he's got your face and he's like n plus one n plus one was his bark so that's all i
got today that's all i got today yeah but it reminds me a little bit it reminds me a little
bit about the pain that i had the last couple of days
because it felt like chewing on rocks with my teeth. Thanks for the reminder. The bad tooth
will be extracted in the next two days. So then this problem will be solved forever.
Talking about problems that people need to solve. Hey, look at that.
Such a great segue. Such a great segue. Yeah, we have a guest as always with us today.
And today I want to welcome Liran.
Liran Haimovic, I hope I pronounced your last name correctly.
Welcome to the show.
Can you do me the favor and introduce yourself?
Who you are, what you've been doing over the last couple of years,
and what motivates you every day?
Sure, Andy.
It's great being here.
So my name is Liran Haimovic.
It's pretty close, especially depends on how you pronounce day? Sure, Andy. It's great being here. So my name is Ivan Kramovich. It's pretty close,
especially depends on how you pronounce it
in different nationalities.
But today,
I'm an architect at Dynatrace
working with you guys.
Up until a few months ago,
I was the CTO and co-founder of Rookout,
which was acquired by Dynatrace
just three or almost four months now so time is
flying as you say
Rookout was a
developer
observability company
and we still run the platform
we service developers
they try to better understand
their code and debug it
in various remote environments, even production.
And it's super exciting being a part of Dynatrace
and collaborating on making observability
more accessible for developers in general.
Cool.
Yeah, it's amazing how time flies, four months.
I thought it was longer even.
Yeah, yeah.
It feels forever.
But Liren, just to,
because the topic of today,
I really want to focus in on some of the things you said
in preparation of this podcast.
You said observability requirements for developers
are different than what, let's say,
we are typically assuming what observability means
maybe for your IT operations,
for your site reliability engineer, for your DevOps engineer.
And I'm actually curious, what are the different requirements
when it comes to observability for developers?
Why do developers have different needs for observability?
So, you know, early in the process in the discussion with Dynatrace,
I went to some of your guys, or our guys now,
and they're saying that banks love Dynatrace. I went to some of your guys and they're saying that
banks love
Dynatrace and
the reason for
that, they
kind of brought
this use case
where you have
one IT guy
and he owns
10 applications.
Now, for them,
they need to know
when every one
of those
applications go
down and they
don't really have
the time to
focus on each
and every
application.
They want to
install something,
forget all about
it, and whenever something goes wrong,
have some sort of alert go off,
maybe say hi, whatever you want to call it,
pop up and say,
hey, something is wrong.
This needs your attention right now.
If it can even point directly to what went wrong,
so you can even easier fix it,
that's what's needed.
As most IT operations,
I'm saying that in a bad way.
They need to get
a lot of stuff done.
They need to keep
a lot of services
up and running.
And what they mostly care about
is when things break.
And that's their job.
If you look at developers,
they are constantly tweaking,
constantly changing,
constantly adapting.
And in fact,
for them,
half the time
something is broken,
whether it's a new feature they are working on
and it has a bug.
Maybe it's something they released last week
that didn't fly well during your release.
Or maybe it's a piece of code from five years ago
a customer just complained about.
For developers, the standard,
the usual is the code is not working perfectly.
Something is wrong.
And that's why you're focusing on the first place.
And all of a sudden, you need to go in deeper.
It's not just about tell me what's right, tell me what's wrong.
It's about how deep can you go if something is wrong.
Even more so, it's not just about this deployment failed,
this database is down, because those are IT problems,
and that's what traditional observability is very good at detecting.
This drive ran out of space, the database is broken,
we need to fix it to clean up, I don't know, delete the temp folder, whatever.
Again, super important stuff, but that wouldn't make it to the developer.
A developer might see that database queries are failing
because some index went out of bound.
And all of a sudden, you need to go in way deeper.
It's not just about the database is broken.
It's not just about these SQL queries are failing.
It's about why.
And those questions require going way deeper into the application.
And even more challenging, those questions are far less anticipated.
Because whenever you're dealing with one of those unexpected issues,
you are trying to solve a murder mystery, trying to figure out what's going on.
And each time, it's a new piece of code,
it's a new application, it's a new library,
it's a new bug, and you start collecting data.
And while for the IT ops guys,
it's the same data day in, day out.
Is my app, what's my request per second?
What's my latency? What's my error rate?
Is it good? Is it bad? Those are
very clear, strong signals
that you know you can rely on.
All of a sudden, you're asking,
you're looking for much weaker signals,
you're trying to build a much more complicated
map, and those
are not as easily answered.
That's a very different
discussion you're having with yourself,
a very different discovery process.
And you often change contexts.
You go into this problem, you go into that problem,
and you're asking different questions.
And it's so much easier when you have tools
that are tailored for solving those problems rather than trying to
use the same day in day out a high level metrics that are great for it i have a couple of thoughts
here um because you probably know brian and i we have been doing kind of you know distributed trace
analysis or you know back in the days we call it pure path now it's distributed tracing whether it comes from an agent like a diamond trace agent or whether it comes from open telemetry
so we always thought or i always thought with distributed traces we already go pretty deep
because we can go down to the method level we can capture a method argument we can capture a return
value and one of the things we have done over the years, instead of relying on
manually placing what we call sensors kind of in the first generation of APM, application performance
monitoring, where you had to say, I want to have this and this and this, we then kind of became a
little smarter and kind of meshed up the feeder points, like the sensors that we placed.
And then we also meshed it up with some snapshotting of your threads.
And then this was kind of like we filled some of the gaps.
So that means we've had and we've worked with distributed tracers
with very rich information already for several years.
And now when I listen to you, I still, I know what it is, but I want to hear it
from you. I want to hear from you what additional data that you don't get from an APM product,
if you just think about APM, what additional data do developers need to really troubleshoot the reps
besides the distributed traces? So that's a great question.
But before we dive into the deeper data, different data,
I want to take a step back and say,
there's this amazing quote by Henry David Thoreau.
It's not what you look at that matters, it's what you see.
And the question is not which data the one agent is bringing in
or which data OpenTelemetry is bringing in
or whatever APM of choice you're using. It's not just about what data they're bringing in or which data OpenTelemetry is bringing in or whatever APM of choice you're
using. It's not just about what data they're bringing in. It's about how they're indexing it,
how they're making it possible for use, how they're visualizing it. Now, tracing is amazing,
but one of the biggest challenges in tracing is sampling. Now, again, if you're looking broad
picture, health status, then sampling is great. You don't need 10 million transactions per second fully captured to know if the system is up or down.
On the other hand, if you are looking for that particular transaction that is failing and you need to know why,
then you need that particular transaction.
It doesn't help to know that 99% of the transactions are doing perfectly fine if you didn't capture the one that's failing.
So I really want to focus that we as a group,
as we think about developer observability,
obviously snapshots, which is something
that was the core of Rookout,
and I'll touch upon in a second,
something we super value,
but it goes beyond that.
It's about how do we make traces
more valuable for developers?
How do we make logs more valuable for developers? How do we make logs more valuable for developers?
How do we make metrics?
How do we make release or monitoring?
All of those tools are great, and they capture amazing data.
But you would often find that developers need, within that context,
different pieces of data or use the same data differently.
But even more so, I think Sn snapshots is a great example of that
because if you look at what Rookout has been doing with snapshots
and some other stuff we've seen,
is that we literally allow you to get a snapshot of your application
just like it would appear in a debugger.
So you set a breakpoint, a specific line, a non-breaking breakpoint,
and then you get to see
when this line is hit,
those are the local variable values.
That's the stack trace.
This is exactly how it would appear
in the local debugger of your choice.
I'm not judging.
You can use VS Code.
You can use Eclipse.
You can use JetBrains.
I know what's my favorite,
but I don't want to spoil anything.
But you would get a very similar example,
except this is running right now in a remote environment,
potentially distributed.
Don't stop the application.
We provide all the security and privacy and other guardrails
to make sure you can use it anywhere you want.
And yet you can see very, very deep into the code.
And I think that's another key difference between developers and IT ops.
IT ops people or SREs or whatever you want to call them,
generally don't really care about the application code.
They didn't write it.
They don't maintain it.
And that's somebody else's job.
Developers, on the other hand, spend their day, day in, day out,
stirring the code. If it's good, it's their code. If their day, day in, day out, stirring the code.
If it's good, it's their code.
If they're not so lucky, it's somebody else's code.
They've taken ownership for
after years.
At Rookout and now Giant Trace,
we're trying to make the
observability tools very code-specific.
A lot of the signals
you see in observability
require to understand the context,
the code context.
Without them, they might be meaningless.
Think about a log line.
In theory, a log line can say anything.
When you write a function,
you can literally print out the log line
to say anything,
regardless of what's actually happening.
Now, I'm not saying people are evil
and intentionally logging the wrong thing,
but, you know, maybe somebody misunderstood the function
as they added a log line.
Maybe the function changed over time
and the log line wasn't fixed.
So many things can happen.
And it's super important if you see the log line
and you understand which context it was written in,
what was happening. If you see the logline and you understand which context it was written in, what was happening.
If you see a snapshot of the local variables,
it's super important to know which version they are on.
Even more so today,
with continuous deployments,
there are so many environments,
there are so many versions floating around.
For instance, you've just deployed a fix,
and you now see the bug wasn't fixed at all
but was the right version deployed
when the bug continued reproducing?
Maybe something went wrong.
Maybe another dependency tree or fix wasn't deployed yet
or was it deployed?
Those are all critical questions for developers
and it's super important for them
that the signals they get
are correlated with the context,
with the precise version of the code.
And it's also about
how easy it is to get it.
For instance,
so yeah, you deployed something.
Now do I have to go through Argo CD,
figure out what's deployed, then I have to go through Argo CD, figure out what's deployed,
then I have to
move from semantic versioning
to Git hashes,
and then to get from Git hashes
to do the Git checkout
and find the right file,
and it can easily be
a matter of hours
or even days
just going through the toil.
Or I can have a system
that automatically correlates
for me and say,
this is the log line, this is the source for me and say, this is the log line,
this is the source file it came from,
this is the exact version that was running.
Use it.
Yeah, I think the best way that I can now kind of recap this,
because I also used to be a developer, right?
And I understand that when you live in a tool
that you use to develop and to debug locally you will be more efficient also when you can stay in that tool like your favorite
ide but then being able to get the same level of details from an app from your app that runs in
whatever version in whatever remote environment i think that's alone is amazing and as you correctly
said it goes beyond far beyond on what we currently have
with distributed traces. If I think alone of all the stack frames, of all the environment,
of all the variables that you have on the stack. And it
also obviously goes much beyond what you also correctly said,
what SRE or DevOps or IT operations are interested in, because they are looking at
the system as a whole. They're trying to look at it, I think, independent from the application
because they cannot know the ins and outs of the application. They just want to make sure that the
system around it is healthy and therefore they're using the classical indicators like
your metrics or maybe some indicators that they can also extract from the logs,
whether there's more error logs now. But overall, I think they are obviously, they don't know what the application should do in detail,
and they're also not interested in these things. And that's why, if I sum it up, getting this fine
grain information in the tool that the developer sits, so you don't have to waste time to go from
tool to tool to tool, and then try to find the answer.
I think these are, at least from my perspective, how I hear you,
the different requirements for developers for observability,
especially in production environments.
Definitely.
And I also think one of the biggest benefits in this new approach of developer observability is around dynamic instrumentation.
It's around how do we allow developers to determine the data they need in real time.
Again, because the questions are constantly changing, being able to collect data in real
time to decide, I want to collect data from this line.
I need an extra log line here.
I need a metric here.
I need to understand the performance implication of this line. I need to see how long this takes. I need
to see how often this is called. Being able to do that in real time to specify the data you want to
collect and instantly get it is so much more powerful and cost efficient than trying to capture
everything all the time. You know, I've heard a lot of things
about various observability tools.
I rarely hear customers saying
that observability is so cheap.
At the end of the day,
everybody's trying to collect so much information
and that costs money and that costs resources.
And it's always a challenge of optimizing
how do I collect as much as possible and we're still getting
not paying too much and part of the balance besides obviously being cost efficient is also
being able to adapt the more agile you are in your observability the more you can adapt to
changing requirements to ongoing activities the more you can be efficient by not trying to hold everything all the time.
Because trust me, no matter how much you collect,
you're never going to have everything.
You're never going to have everything you want
just by trying to hold everything.
Yeah, that's a great point.
And Andy, I was thinking like an analogy
of what this is, right?
Going back to the idea you presented a long time ago with DevOps and continuous feedback with the photo camera.
Remember, so that idea of digital, basically digital picture for people who don't know this provides that instant feedback.
Whereas when you had to take pictures, you had to wait, get it developed.
And then, oh, I had my thumb covering the lens right it's too late now um the analogy i'm
thinking of with this one is it's going to be in the apple ecosystem so sorry android users i'm
sure there's a similar thing there but there's been some recent changes to the whole the find my
um functionality on the iphone so if you can't find your watch, you can't find your phone,
whatever you have, this Find My thing.
And where I see the observability part is now they have it
so that you can walk around and it'll detect where it is.
And it's basically like the hot and cold game.
You're getting closer, you're getting closer.
And if I'm lucky, it brings me to my bed
and my phone is just sitting right there on my watch Or my watch, right? And I could see it. I grab it. That's, you know, if the trace gives
you the data you need, fantastic, you're set. But on those days, my bed is an absolute mess.
I get there and I'm trying to move through all the covers and everything and I can't find it.
Well, then I have right into the same ecosystem, the ping button. And suddenly my phone makes a
ping and my ears directly locate exactly
where it is. And without having to go through my bed, tear it apart and everything, I can just go
bam, right, find it and get right to the heart of the matter. So it's bridging that gap. If you can
get close, if you can get to the heart with it, without having to hit the ping, great. But when
you need that extra bit, without even having to think, without setting something up, you hit it, you know, without having to hit the ping, great. But when you need that extra bit, without even having to
think, without setting something up, you hit it,
you find it right away, and you move
on, and you then
go doom scroll on the internet.
At least that's the way I'm thinking about it.
It's a great analogy, because
you don't want your phone to ping all the time.
You want it to ping
at the time when you need it, because you can't find it. And the all the time. You want it to ping at the time when you need it
because you can't find it.
And the same thing is you want to have these non-breaking breakpoints.
How do you call them again?
Yeah, non-breaking breakpoints.
Because they look and feel like breakpoints,
but they don't break you up.
Yeah, so that means with non-breaking,
we might have listeners on the call
that are not familiar with what debugging really means
and in the sense of a debugger where you basically stop the runtime from actually executing
or you let it step by step and you kind of hold it.
And with this, you basically simulate what a breakpoint does in a debugger
where you capture the full stack frame with all the variables,
but without
actually holding the runtime and you're just collecting this information.
But that's great, right?
I mean, and I guess, and Liran, if you can fill me in a little bit on,
because you mentioned earlier, right, observability is expensive.
If you would capture unbreaking breakpoints all the time. I guess we would have
a problem. First of all, we would have too much data that nobody cares about. Also, it will probably
cost a lot of things to capture and store it. So how do you solve this problem? By the way,
how do developers then actually use this technology? When do they turn it on? How long do
they turn it on? Or how does the system, what did you build to make sure that you're capturing
enough information
without capturing too much information
so I would mention that
snapshots are
truly cheap
if you're using a great tool to collect them
they
might be slightly more expensive
than logs because they're so informative.
And I think in one of our previous media discussions
at Rookout, I've kind of used the concept,
I've tried to outline the concept
that snapshots are worth a thousand log lines
because they're so much more detailed.
Instead of having to write out your variables one by one,
stringify them, figure out how to represent them,
Snapshot would capture the entire state of the
application very, very accurately,
keeping type information,
keeping all the minutiae detail
going to make the difference between fixing a bug
and not fixing a bug.
And that's
super easy to get.
And those breakpoints are
really, really cheap.
A magnitude of a millisecond or so
depends on the runtime exactly.
And using workout
or similar tools,
you can just,
with a click of a button,
set it on any line you want
and instantly get it applied.
You can apply it
to a single server
you're interested in
or you can apply it
to a whole fleet of servers. You are wondering what's happening there. You can use
conditional breakpoints to filter out a specific user or a specific case you're interested in.
You can even connect those non-breaking breakpoints to various automation workflows,
but whenever something goes wrong,
whenever latency goes up,
whenever you have an exception thrown,
whenever something you're interested in happening,
you can instantly set a breakpoint there.
So by the time you would actually get to look at it,
you would get a whole more context
than what a traditional observability tool
that are not going as deep can provide you.
So essentially, you get an alert.
By the time the alert is there,
you actually walk up to your laptop and see the alert.
You've got a whole slew of additional context
that's going to make it so much easier
and take so much of the guesswork away
from the triaging process.
And so there are also
various, sometimes breakpoints
by the way are much more longer lived.
Maybe your release cycle
is only once every two weeks
and you are worried about something
and you want to throw in an extra logline.
So you can say, I want
a new logline on this line. You can even
add a condition to it. The next time this
variable is over 50k, is longer than 50k, send add a condition to it. The next time this variable is over 50k,
is longer than 50k, send me a message
to Slack. And you can instantly do
that. And you don't have to do an emergency
patch. You don't have to release
a new version. You don't have to wait for the next
release. You can instantly do that
on the fly without
worrying about it. You can add new
metrics if you're trying to measure
something for performance. You can collect new metrics if you're trying to measure something for performance.
You can collect more data and you can really adapt to your needs. Some customers use it to
debug remote environments. Maybe you have applications deployed with your end customers
and the customer has to be notified or install the patch. Maybe you're using an environment that
has a downtime whenever it's
being upgraded and you don't want to
incur additional downtime.
You can set those breakpoints for
weeks or months or however long you
need to.
And you would often
find, you would almost always find that
adding a breakpoint is so much
cheaper, easier, and almost so much risk
free. You know, I had and almost so much risk-free.
You know, I had my own podcast up until... Actually, I had my own podcast until my first was born, which is now almost 18 months.
So it's been a while.
But in one of the first episodes I recorded, there was this guy from a storage security company.
And he mentioned that early in the days,
they actually released a version.
It wasn't an emergency patch, by the way.
It was a major version of the product.
And somebody added a log there.
And that log crashed the system repeatedly.
And they had to back out that major release
and emergency release fix,
all because of
an improper logline.
If you think about it,
at the end of the day, a logline
or a metric or any changes
you make, as small as possible,
is code. And any
code, any change, carries
risk.
And part of the promise of
those observability tools for developers and carries risk risk and part of the promise of this of those
observability
tools for
developers
and snapshots
and so on
and so forth
is that we
take away
much of the
risk
we provide
very significant
guardrails
to make sure
that no matter
what you do
you access
uninitialized
variables
you try to
print out
something that's
too big
you try to
inadvertently
access the
database
whatever you want to whatever you're about to that's stupid that's too big, you try to inadvertently access the database, whatever you're
about to do that's stupid, that's inappropriate, that's better not be happening, then we have the
guardrails in place to ensure that this won't happen, that you won't be risking the integrity
of your product, the integrity of your service just for that log line. Because 10 out of 10 times
you would prefer the service to keep
running and the observability
of the data to be missing, especially if you
provide a clear indication that you're not
getting the data rather than
take down the service in the name of
get me that extra logline.
I think this was a use case.
Go ahead, Brian.
I was thinking about
especially with that last use case,
I had this thought in my head
and I think that last use case about the log line really solidified it.
If we go back to when DevOps started, right?
And then as containers came in and as Kubernetes was coming in more
and there was a shift of putting more and more work on the developers.
Set your ingress points, set your network routing, do everything as code, do observability
as code, and developers are going to do it all.
When the job of the developer is to write good code to begin with and to fix bad code
when it gets discovered.
But now there's this whole idea of learning all these other tasks,
learning, oh, you know, I have to write all these additional logs in there,
and what's going to happen if I do that because I'm now tasked with this?
And recently we've seen, with the rise of platform engineering,
there's been a turnaround from that to say,
hey, maybe we shouldn't put this all on the developers.
We'll have special teams that will take care of the platform
so the developer's not defining what container they're running it in or what size
JVM. It's going to be somewhat opinionated in some ways.
But what you're describing, I think, takes it even further
because we're removing more of it from the developer to have to do observability
and the debug side of it. But then when a problem does occur,
they don't have to spend as much time
trying to figure out what happened.
I don't care what observability tool you have.
You know, developer has an issue.
They're going to have to start digging and diving
and they may get close.
You know, before we were saying they might get close.
They might have to dig deeper.
They may or may not have been trained on the tool, right?
But the idea of removing the barrier for the developer
to get to that answer as quickly as possible with the least amount of friction, with the
least amount of thinking ahead of time, oh, I have to capture this method argument.
If you can just turn something on, it's going to capture this stuff.
It's pulling back the constraints or the burden we put on the developers as DevOps came in,
and we're bringing it back so the developers can really just focus on writing good code
and then fixing code when they need to.
And that's what they excel at, and that's what's probably going to make them happiest.
And then the happier your developer is, the better everything is, and the whole world
becomes a shiny, happy place, right?
But I think it's a really important adjustment that we're going through on this side now.
And it sounds like this is going to just make it a lot easier for those teams to execute.
So I agree.
I think the whole shift left, it's super important.
On the one hand, as you said, it's a big promise.
It's about having,
can developers truly own everything and developers truly be responsible for everything?
And the answer is kind of, it depends.
And what it depends on is about powerful tooling
and useful abstraction.
Developers can't know everything,
but if you provide them an easy enough approach
to observability
that's closely enough related to their day-to-day,
they'll be able to grasp it.
If you provide them with easy enough access to security
for at least with good guidance and very simple action items,
they can grasp it.
And at the same time, it's also important to note
that we are sparing developers a lot of work they used to do a couple of decades ago.
Most developers today don't worry about memory management.
They don't worry about allocating and freeing memory because the modern runtimes do it for them.
They don't worry so much about compiling and linking and dependency management, again, because good runtimes take care of a lot of the heavy lifting
and a lot of the stuff
that we can abstract from them.
And this frees up their memory,
their CPU, their minds
to deal with higher obstruction problem.
But at the end of the day,
it's a very small capacity
that they're getting.
And we need to very wisely adopt it.
And so if we want developers to own production,
to relate to production,
to understand what's happening in production,
we need to provide them great observability tools
that speak their language,
that are easy for them to grasp,
and then they will be more than happy to adopt them
and take part of that.
And we've seen this to be truly transformative
for organizations.
When developers
are no longer disconnected
from production,
when they are empowered
to understand
how their code
is behaving in production,
it's super motivating for them
and it can have a huge impact
on quality and velocity
and so on.
I want to also just add
my two sentences to this
because I remember many, many occasions
where problem happens, what do you do?
You need more log lines, so you add two more log lines, you run it again, you don't get
the logs that you expected or you need more than you add.
Like five iterations, 10 iterations later, you have code where you have more log line
code, more code that create logs than the actual code that
is doing business logic and i think this alone is for me an amazing selling point that i don't need
especially from a troubleshooting perspective i don't need to modify my code to get more of the
let's say traditional observability signals that i need to diagnose the problem because I can just treat it
as if I would sit there locally and just attach my debugger. And Brian, this is the same, I think,
shift, generational shift that we've seen when we introduced real user monitoring. Because with
real user monitoring and session replay, all of a sudden, we could see everything that happens in the
browser for every single user, every single line of JavaScript that was executed.
And we could, as a developer, right, there was no need anymore to say, let me walk over
to the end user, and then let's turn on the developer tools in the browser, and it can
give me all the data.
With real user monitoring and session replay, we all of a sudden got this data. And it feels like what you are telling us here with observability
for developers, the unbreaking breakpoints and the snapshotting technology, this is exactly
what we give developers that are working on microservices, on any type of code that runs
in application servers, wherever it runs, to
get all this data that they need at their fingertips without having to go through an
additional hoop like rebuilding the code with additional logs and then redeploying it and
then hoping that the error happens again.
Because this is another thing, right?
Many times when you then modified the code and added more logs
then you may have changed the timing behavior
and all of a sudden with the race condition
you had a different timing behavior
of your code and all of a sudden this problem didn't happen
anymore, a different problem came up.
So there's like so many things that
it's really great to hear
with this technology that you guys have built.
Liren, I also...
It's funny that you mention that
because joining Dynatrace,
I found under the hood
so many amazing observability capabilities,
observability technologies,
and some of them are not well-known
but very, very tailored for developers.
I think today Dynatrace is
maybe the best memory
performance profiler
anywhere I've seen.
It provides amazing granularity
for allocations
and the allocations of memory
and GC stops
and so on and so forth.
There's even a capability,
there are a lot of
thread profiling,
continuous profiling for CPU,
thread profiling
for identifying
logs,
even heap dumps
that can be used
to troubleshoot
a variety of
issues.
And all of those
features are,
you know,
super important,
super powerful
that are in there.
And part of the
things that we're
very excited about
as we're joining
Dynatrace is kind
of seeing how
we can make developers out there more aware of everything Dynatrace kind of seeing how we can make
developers out there
more aware
of everything
Dynatrace have to offer
for them
because there's so many
useful
capabilities
out there
and I don't
I don't always feel
that they are
appreciated as much
as they should be
yeah but we are
trying to do our best
to
to surface them
right and
I know I had a session with
the Arden a couple of weeks ago, where we did a YouTube video on developer observability meets
app observability. And then we talked about kind of, you know, what the quote unquote,
SU frame rate, the traditional observability brings in and then what the develop observability brings in. Also, just to give a little heads up or like a forward-looking
statement, our conference performance is coming up. So it's going to be the last week of January,
first week of February, where we all gather in Vegas. And there's going to be a big focus
obviously also on these use cases
that empower developers
to do the job better and easier.
So folks, if you're listening in and if you're still
contemplating on whether you want to
join us in Vegas,
you should because there's a lot of cool use
cases.
Yeah, exactly.
And you can find Andy and have a drink and celebrate my
50th birthday on the 29th of January.
There you go. So now everyone knows my birthday.
I'm giving out security.
I'm giving out security information.
But yeah.
Also, if everything else fails, you can always join us online.
But still do come to Vegas.
It's more fun.
Yeah.
Liren, I want to not only talk about the blue skies,
I also want to talk about some of the challenging questions
that you sometimes get,
because we always get these challenging questions.
We've been in the observability space
and the topic of overhead always comes up,
the topic of who should be able,
what type of data are you really capturing
and who is then going to see this data?
Talking about data privacy, talking about security.
I mean, there's like so many questions that always come up.
Can you just glance over maybe some of these topics,
let's say the challenging questions that you sometimes face?
So I would say, as I mentioned,
an individual snapshot is roughly one millisecond.
Obviously, it depends on the size of the snapshot
and the runtime and so on and so forth, but
that's what you can expect. It's very negligible,
especially if you...
It's not a tool that you're
meant to capture a thousand snapshots
every request. It's a tool that's meant to capture
a handful of snapshots when you need them.
If you think about not only your
P95, your
P99s,
they're not going to be affected in any way if you spend a couple of milliseconds capturing snapshots here and there.
Obviously, if you just want to inject a log line or a metric,
that's going to be even cheaper, way, way, way cheaper.
We also have a variety of safeties at the individual breakpoint level,
at the global levels.
We cap the CPU. We incur, even at the individual breakpoint level at the global levels and we cap the CPU
we incur
even at the
worst case
but for the
most part you
would see that
we never get
anywhere near
them and we
have all the
default limits
come in
the average
engineer captures
a bunch of
snapshots and
the breakpoint
turns off and
nothing will
happen and I think most customers won't even see two or three percent of CPU increase captures a bunch of snapshots, and the breakpoint turns off, and nothing really happened.
And I think most customers won't even see 2% or 3% of CPU increase,
and definitely nothing on the latency.
Other than that,
around security and privacy,
I think the most important thing to realize
is that this access is needed.
And if you're not using a tool to give it,
a good tool like an observability tool
that allows you to set a lot of policies in place,
we'll discuss in a second,
then something worse is going to happen.
Either the bug won't get fixed
or the problem won't get resolved.
Or what you will probably do
is that engineers will spend their time to outsmart and
outmaneuver the system and they're going to end up choosing worst options. They're going to sit down
with an ops guy or an IT guy and SSH into the system. They're going to sit down with the database
administrator and start querying their raw records.
And they're going to figure out what they need because at the end of the day,
the business, the operations need to get the data
because they need to fix something.
And if you think about it,
then the risks going down those routes
are way, way, way bigger.
You're essentially punching much bigger holes
into your security and compliance posture.
You are having much less control.
And that's how things have been done so far.
Let's not kid ourselves.
That's the alternative.
Using Lookout or similar tools, you can assign tons and tons of policies,
starting with SSO integrations or based access control.
You can decide who gets access to where.
You can add a bunch of data.
Masking rules, you can control data governance,
do exactly what the data is going to be stored.
You can control data retention.
And obviously you have audit logs
on top of everything else,
so you know exactly what happened and why.
And at the end of the day,
if this is about servicing your application
and ensuring it's running optimally as it should,
it's part of your day-to-day operations,
and you've put all the guardrails and safeties in place
which we provide them,
that's the best way to ensure not only compliance,
but also resilience as a whole.
Cool. Thanks for those answers. I mean, we've been facing questions like this over the past 15 years since we've been living in the observability world.
And I think you bring up really good points that in the end, what really matters is that you can fix your system challenges
as fast as possible.
And you've obviously thought about everything
that you need to think about when you design a system
like what you've built to make sure that misuse can be avoided,
that you have all the guardrails in place,
that you also audit, everything is auditable.
And I think that's obviously you have adopters of that technology before, we had adopters of the technology before you joined the Dynatrace family. And it's just great
to hear. And I'm just really excited to see how much easier the life of our users will be, especially as they are trying to then,
like Brian said, try to find that phone in the messy bedroom.
They just want to find that messy bug somewhere
under a lot of dirty lines of code,
and then it's going to be easy to spot it.
Forward-looking, is there anything...
I'm not sure how much you can obviously say
because these are always things
when we talk about looking to the future.
But I assume there's still a lot of ideas
on what other things can happen
and what other things observability providers
like we as Dynatrace can do for developers in the future,
even beyond what you have already built.
I assume you have a long list of things
that you would like to include.
And as I said, I know it's going to be challenging
to talk maybe about some roadmap items,
but if there's anything you can say,
even if it's just, hey, sure,
there's a lot of stuff coming, be prepared.
So definitely there's a lot of stuff coming. Be prepared.
Definitely, there's a lot of stuff coming.
I would say that's part of the joy of joining such a huge platform and with a lot of technical excellence, Dynatrace,
there are so many opportunities out there.
There are so many observability signals lying there.
There are so many capabilities
within the storage engine,
the query engine, you know,
and everything else that's going on
that the options are truly limitless
to building amazing, valuable applications,
features that can cater to developers
in a way I think that aren't seen today
in the market almost anywhere.
And the combination
of those capabilities in a single platform
can
lead to an amazing user experience
and amazing automation
and workflow capabilities.
One thing
we'll be focusing on for the
next year on top of everything around Rookout
is also around the IT integrations.
Some of that exists from a pre-acquisition
that we're working on right now,
but it's super important for us to have Rookout available,
to have tenant trace capabilities available
in all the common IDEs.
And for us, it's super important
to bring as many observability capabilities or
even more so as many observability insights as possible very close to the engineers in their
ideas in a contextual manner making it a seamless part of the development experience a seamless
part of the observability of the troubleshooting experience,
having observability data and insights at their fingertips,
having the observability platform push insights proactively
for the developers in a contextual manner
so that you will benefit from observability.
Because I think one of the things you will often see
is that you see the observability-savvy engineers,
the production-savvy engineers,
those who are not just about coding,
of the most senior,
they know their way around the cloud,
they know their way around the Linux machine,
they know their way around Kubernetes and containers
and observability
and they can get tons of stuff done
and they are very good at diving into the nitty-gritty details
but not everybody had the time to dive into those areas
some people have other areas of expertise
some people have not spent as long in the company
and are not as familiar with all in the company and not as familiar with all the
tech stack and all the tools.
And it's a huge
burden. And you
find that some people are enabled
and can
find their way in production. Even pre-production
is often very complex.
While most people just
struggle to deal with this
super complicated tech stacks and everything you need to know,
now that observability can really be a differentiator here.
It can be a turning point.
It can make everything so much easier by making things accessible to the average developer
without having to go through training and learning each and every tool
and understanding every aspect of
the text doc just getting easy to digest insights from the observability it can be a game changer
for so many developers yeah i'm looking i'm looking forward to as you said the combination
of our technologies where davis is going to an area, a hotspot area in your
application landscape and then triggering the snapshots. And then as a developer, I don't
need to react necessarily on a message that brings me into a dashboard with metrics and logs,
but it brings me into my ide where my debugger runs so
it looks like a debugger but it's actually like analyzing all these snapshots i mean that's that's
a really that's a really nice way of thinking about it and how these two worlds can really
kind of uh benefit from each other yeah that's pretty cool yeah definitely and i'm in a bit of an analogy type of day today
so i'll leave it with one last analogy um how often do you find yourself watching a movie
and be like oh my gosh that's that guy right and you're like i want to look up who that actor is
because i reckon i need on all this stuff or actress whatever you know whoever it is well
if you go back to let's say the 90s you'd have to get off the couch go to your computer
connect your modem and then maybe go on a usenet or some group and hope somebody had a catalog
and there's there's three parts to this so part two now is you either have your phone
or your laptop which is already connected so it's really easy you pull up imdb or wikipedia they
have the cast right there at the top and you can easily find it right in scenario one you one, you're probably not going to do it. And I should take a step back.
The reason why this is important is for developer observability to succeed, right?
It's got to be easy. It's got to give them the relevant data, right?
So the modem, not easy. It's going to be hard to find the data.
Your phone, go into IMDB. It's going to be very easy. But then if you
take a look at what Amazon does on Amazon Prime, if you
pause that video, it's going to show you all the actors in the scene and you can just move up to
them and bam, get the bio right there. I think that is the key
for developer adoption because if they've got to work hard for it, they're probably
going to be less likely to do it. And if the information is not relevant,
again, but we want the developers to adopt this more,
right?
And I think that's going to be the key towards, you know, whatever tooling, you know, even
beyond what Rookout's doing, beyond what we're doing, whatever these tasks are that we need
the developers to undertake, it's got to be made easy and very, very relevant.
And then we'll start seeing the fruits of that labor, I think.
A cool analogy. Definitely.
All right, Liran, thank you so much for
spending an hour with us
and inspiring us and
enlightening us and educating us on
what developer observability really is
all about. I think I now understand better
on why developers need
a different type of observability that we
have been always talking about.
And yeah, I'm very much looking forward
to seeing you in Vegas.
And I'm very much looking forward
to having your technology
part of the Dynatrace technology.
And overall, whether it's Dynatrace
or we always look beyond Dynatrace,
I think these discussions
should just inspire people
with new ideas on what's possible
because we should not be comfortable with the status quo.
Definitely.
Looking forward to meeting you all and thanks for having me for the show.
Thank you.
And thanks to all our listeners.
Hope this was helpful.
Till next time.