PurePerformance - Unlocking the Power of Observability: Engineering Practices for Success with Toli Apostolidis
Episode Date: July 17, 2023Are you frustrated with your team's ability to troubleshoot issues in production despite their proficiency in pushing out new builds? The root of this problem may lie in the absence of Observability D...riven Development. In our latest episode we are joined by Apostolis Apostolidis (also known as Toli) who - as Head of Engineering Practices at cinch - has spent his past years enabling teams to adopt the easiest path to value. He is passionate about DevOps and has a strong opinion on how to educate engineers on "Consciously Instrumenting Code for good Observability".Tune in learn more about good engineering practices, building internal communities of practice, the benefits of traces over metrics and logs and why we need to start adding observability to our CVs and LinkedIn profiles.Here are all relevant links we discussed in this episodeTolis Website: https://www.toli.io/Tolis LinkedIn Profile: https://www.linkedin.com/in/apostolosapostolidis/Toli on Twitter: https://twitter.com/apostolis09/WTFisSRE Talk on DevOps Meets Service Delivery: https://www.youtube.com/watch?v=nLrx0BCMl0YGOTO talk on EDA in Practice: https://www.youtube.com/watch?v=wM-dTroS0FA
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson and as always I have with me my wonderful co-host,
co-host, co-host, Andy Grabner.
Hi Andy, how are you doing today?
Good, but you just told a lie.
Because not always am I with you when we record a podcast or the other way around.
You haven't been around last time.
When I'm here, I always have my... I think there might have been one that I did without you.
But yes, last week I couldn't make it for various reasons.
But yeah, it's weird though. I've been trying to catch up on my sleep, Andy,
and you know what happened?
I've got another weird dream.
Another weird dream. Yeah. I went to my barber to get a haircut.
And my barber said, Hey,
he went the usual without even looking around or seeing what, you know,
everyone in the waiting room had. I just went in and said, yeah.
He got the buzzers out and went really, really close to my head.
And I was like, oh, this is too much.
This is too much.
And I look around and who's in the waiting room but you.
And you're there with a really short cropped hair as well.
And I'm like, Andy, I didn't realize what happened.
I didn't realize this was the usual.
And you said to me, well, you didn't look around. You didn't realize this was the usual.
And you said to me, well, you didn't look around.
You weren't observing what was going on.
I'm giving you a weird voice.
Now this is your Muppet voice today.
You weren't observing and looking around to see what happened.
So you got the wrong thing there.
And I guess I learned my lesson about observing when I first got into the place there and making it a best practice. Well, I think I will take your dream as a life advice now
because something similar happened to me just recently,
funnily enough.
No way.
No way.
Yeah, exactly.
If people look close at the picture,
they see me with very short hair.
But still, it's summer here, so it's all good.
Brian, do you think we should keep talking about dreams
and haircuts, summer here, so it's all good. Brian, do you think we should keep talking about dreams and
haircuts, or shall we actually
try to get a little
of an additional opinion on observability,
on, I don't know,
developer experience, on platform
engineering, and also what's around DevOps?
What do you think?
I think if we have a guest who
obviously did look around
and noticed and didn't get the usual because he observed properly, I think that would be a good idea.
Perfect.
Maybe you could tell us a thing or two about that.
Maybe, yeah.
Well, with this, I would love to welcome to the show our guest of today, Apostolidis, but I think short, just Toli.
Yeah, correct. That's it.
Awesome. Toli, thank you so much for being on the show.
The two of us, we met just a couple of weeks ago in London
for a conference we both presented in.
It was called WTF is SRE.
You had a talk with another colleague of yours,
a former colleague, I think, as I've learned,
you've just switched jobs.
But you had a talk about when DevOps meets service delivery,
which is a really great talk to watch,
which I have a couple of questions later on.
But for the audience, for our listeners,
can you quickly introduce yourself, who you are, what you do,
what gets you excited, what you're passionate about?
Yeah, well, thanks for having me it was uh it was fun uh meeting you at the conference because uh
you mentioned observability and my antennas antennas went up and we couldn't stop talking
for a while interestingly i did go to for a haircut today as well and um i made sure that
the uh the person who gave me a haircut last time told the person who gave me a haircut today to keep the numbers right.
So I'm right on.
So my experience is I studied maths and mathematical physics at university and did a master's and then didn't know what to do.
Wanted to do a PhD and ended up being hired as a mathematician
in a company that built an optimization engine. So I was hired to write algorithms in code
for my first seven years in software engineering. But as you know, with all these things,
the hard bits and the interesting bits are like five or 10% of the work. The rest of it is APIs
and websites and schemas and all the databases and all the rest.
So I learned a lot about software engineering there
and then went into an energy company.
And from then on, I kind of got involved into software
and started understanding what it takes
to be a software engineer.
But then I had my first outage
where direct customers were calling customer support
and started to realize, oh God,
I need to know what's happening.
But I couldn't know what's happening
because only the ops people had access to the logs.
I said, do you have logs?
Didn't know you had logs.
So then I started getting interested
in the DevOps movement quite late on,
maybe in 2018, 2017, and started reading up about observability. So then moved to a company called
Cinch in the UK that's an online secondhand car platform, a new generation of car platforms like Kavanaugh in the States.
And I spent about three and a half years there building teams out, enabling a DevOps mindset,
enabling an observability culture, and I learned a ton through that experience. And you'll probably hear me talking about DevOps and SRE, interestingly, and even the interaction between DevOps
and more traditional practices,
and also observability and event-driven architectures.
So thanks for the reminder also about the event-driven architecture.
So folks, if you're listening in, if you want to see TOLI live on stage,
we will be posting a couple of links you did besides WTF is SRE.
You also did the GoTo.
There's another YouTube video that you sent over.
And obviously you have your own website, TOLI.io,
where people can follow up on the stuff that you have created over the years.
I remember when we were sitting, so we were in London, we were in the speaker room.
We were sitting next to each other.
I think we were both preparing our slides and presentation.
And then we started to talk.
And you really said, yeah, observability.
That's a key topic.
And then when we followed up on that conversation, I said, hey, let's do a podcast together. You then came back with a couple of ideas on what you would like to discuss on the
podcast based on your experience. And now reading through that list of what you presented back to me,
what you would like to discuss, the first thing actually struck out to me, where you said, hey,
we should think about observability as we did with testing.
And I assume, and correct me if I'm wrong,
because it took us a long, long time from a testing perspective to educate developers
to test-first, test-driven development.
Is this what you have in mind with basically
how can we get an observability-driven development mindset
into engineers and how can we get this achieved?
Yeah, absolutely.
That was it.
So in my first job, I was a mathematician first
and a software engineer second.
We didn't have any testing.
It was service-oriented architecture.
So the architecture was really good.
The people I was working with were super intelligent
and super, super nice.
But we only had input XML in, XML out tests, end-to-end, pure end-to-end tests.
We didn't have any unit tests or any kind of other types of tests.
And that took time.
And at the time in 2011, 2012, that was kind of not standard that you didn't have tests,
but it was something that the industry was learning.
So it took a long time to get to the point where now you go to interviews and candidates are embarrassed to say that they
don't like TDD and whether you like TDD or not testing practices and testing techniques is
something that every software engineer should have in their their capabilities their their experience
but what people don't have is observability.
And that's okay because observability has really exploded
within the software engineer role in the last three, four years at most.
So when we started out at Cinch hiring in 2019,
I would ask people, okay, so I would go through the whole software lifecycle.
How do you go from code on your computer
to code in production in front of customers?
And a lot of the candidates would be really explicit,
explaining how the whole process works, even CICD.
And the bit that missed,
and towards the end of the whole example or the whole kind
of story I asked well how do you know your code is working in production after that and I often
got some very surprising answers for that and that wasn't to catch them out but more it was the first
step into persuading them then when they start at cinch that's something
they need to think about because i knew that a lot of people wouldn't be thinking about it
but and i and i set it out as a as a goal that um when i finish my experience at cinch
i'll know some people or i'll know people who have put observability on their CVs, observability tools, observability practices on their CVs,
so that the industry starts maturing in that.
And the reason for that is, I think,
it's more important that code works in production
than it works on your laptop or on a non-prod environment.
So observability, in a way, is more important
because you want to have ways to
understand whether the health of your business transactions is there or not. And that's where
you need to focus attention. But I'm not saying that observability replaces testing, but I think
they're very, very complementary, but we put a lot of energy on testing and not enough in observability.
I really like, I'm just taking some notes here.
I really like what you just said.
You said it matters more that your code runs in production
than it runs on your laptop, right?
Because in the end, that's really what matters
is in case somebody makes the decision
to actually push this into production.
Also, what I've noticed,
and maybe you can give me a little bit of feedback here.
I remember your presentation at GoToConference
where you talked about Cinch, right?
And how you decided to go with a serverless architecture
and you basically had your, I think,
six or seven different teams.
You talked about the search team
and the catalog team that
the what what did you call it um the product catalog yeah exactly and so you had individual
serverless components and you said some some serverless functions were kind of more like
like a service like a virtual service with different features did, when you set out to define those services, those serverless
functions, define the definition of healthy, the definition of how do I know it actually
runs successfully in production? Like the definition of done, a definition of is it
observable and do I know what I expect from the system to be healthy in production?
I think early on, what we did was that we had this premise that teams are autonomous and they build, ship and support their systems.
What that did was that made them think about the support part.
And, you know, build and ship is something that we're getting better and better at and we're quite mature, but the supporting aspect is quite immature and even hard to do.
So I think starting from that premise, we were able to empower the teams, empower the software developers to start thinking,
okay, so if I've got a search service, how do I know that the search service is returning results that's benefiting the customer?
How do I know that overall we are mostly returning results?
Or how do I know if I've got too many scenarios where we're not returning any results? So once they start asking themselves those questions,
they weren't set from outside,
they start exploring, okay, so what tools do I have?
So we give them the tools.
We give them a single tool, single observability tool,
which is their platform of understanding.
So you have your hosting platform,
which is your cloud provider.
In our case, it was AWS. But then you have your hosting platform, which is your cloud provider. In our case, it was AWS.
But then you have your understanding platform,
which is your observability tool.
And that's where you go and start exploring,
how do I know whether this search service works?
You start learning about instrumentation.
You start custom instrumentation.
You start learning about the various telemetry data types.
And you start learning how to be curious.
We can't tell them what to measure
apart from them talking to the business or talking to,
business is not a great term, but talking to the product owners,
talking to the stakeholders to understand what's important for them.
But I think later on when we got more mature, as you saw in the talk at the conference we were at, we started becoming
a bit more systematic about when we're launching a new service, one of the checklist points was
have you done your observability due diligence? But early on, it was all about how can we persuade engineers to be curious?
How can we help them and teach them and learn together how to use,
how to instrument their code?
And one of the big decisions I think that worked for us
and I think is really, really important is that we planted a software engineer
in each team that was their task to learn and to enable the others to learn. So you would have your
normal tech lead or team lead, but you would also have something called an automation engineer,
which we had at the time. And they would be there saying, hey guys, what about, how will we know this will work in production?
And they'd be like, oh, well, I don't know.
And then what we suggested, well, maybe try, let's trace it.
Let's add a custom tag.
This is how you start creating a dashboard.
This is how you explore this data.
And they're there day to day.
And I think that was a catalyst.
Do you think, and I know I remember this was the presentation, I think, that you gave it in London
with your automation engineer in the middle, and then you had, I think you had
different layers of engineers, but you call it an automation engineer.
Would you explain to me, though? For me, it almost sounds like this is kind of the definition right now for an SRE.
Isn't that what it is or not?
Yeah, so the title of that role is a bit unfortunate
because we couldn't find a better role.
Initially, it was DevOps engineer.
We didn't want to call it DevOps engineer.
So we shot ourselves in the foot a bit by calling it that
because it was hard to hire externally
because everyone thought it was a test automation role.
But once you were in the company, we actually hired within the company quite a bit because
it's a role that learns a lot, but their scope is infrastructure as code, is CISD pipelines,
and observability and monitoring.
What we found, actually,
is that they focused a lot on observability and monitoring because that's where the gaps were.
Most software engineers we were hiring
kind of knew how to do infrastructure as code,
knew how to do pipelines,
but they didn't really know how to do observability.
So that's where I think they focused most of their attention.
But yeah, it's probably very similar to an SRE plus the rest,
so plus the other parts of the stack.
Yeah, and the reason why I bring it up right in the end,
we know that the titles in the end matter for the outside world to understand or have an idea of what this is.
In the end, whether you call it test automation, sorry, automation engineer or SRE or DevOps, if they're all doing the same.
But if that's why I'm just asking, because people have a certain assumption, as you said, if you put put test automation engineer on that maybe the majority of people think about test automation and maybe
that's something that they don't want but if you put it if you use a term that the industry
has already now coined right and say sre is somebody that is uh focusing on observability
because observability allows them to build a system reliable and resilient because you can
then use the observability data to then trigger, automate your runbooks and things like this.
I was just curious.
I think it's a very interesting topic, though, because I think in that scenario,
we didn't want to hire X infrastructure engineers or we didn't want to hire
X security engineers or X ops people.
We actually wanted to hire software engineers
with some experience with coding and some experience with testing
and that kind of things,
so that they can learn all the practices that are needed
for site reliability.
And ultimately, they can persuade their peers, not because they're experienced or they're more senior,
because they weren't,
but because they can understand and write code
and they have similar practices.
And so that's the angle we wanted to go with.
Now, coming back to the teaching engineers on how to instrument their
code, how to make it observable, a big challenge I think Brian that the two of us have seen over
the years especially going all the way back when we started working for our company.
We had a feature in the product and it did auto instrumentation, but we had a feature.
We called it shotgun instrumentation, which meant you could instrument everything.
It was awesome, right?
For developers.
Turned into a profiler, yeah.
Yeah, turned into a profiler.
And that was 15 years ago, right?
15 years ago, we were able to distribute the tracing with every method on that you can
think of.
Obviously, with the drawback that you're collecting a ton of data that only a few
people really need, and if this change then makes it into higher environments,
you have a lot of overhead.
So I think the challenging thing what I see right now is with the big hype around
open telemetry and open observability, it's great that we have these standards.
But I think the challenge still is
that we need to teach people, I guess, what level of instrumentation really makes sense so that
you're not just collecting data because you can collect data, but you really collected the data
that then helps you to then make the right call in case there's an outage, in case there's a problem,
or in case to detect a bug in the CI system
and you want to still have enough data.
So do you have any experience or any suggestions
on how we can actually teach these best practices
on how and what to instrument and what type of data we need?
Because there's different ways how we collect observability data, right?
Yeah, absolutely. It's a very interesting topic. Every CTO or every engineer directly you might
speak to, they'll ask you, well, if we ingest everything, then how are we going to kind of
keep a cap on the bill, basically?
Because we know that observability is very expensive.
At Cinch, interestingly, observing the system was more expensive than hosting it
because it was entirely serverless.
It was a lot cheaper to host rather than the observability tool was way more expensive.
And the reason for that is because we took a conscious decision
that we're going to ingest everything and index everything liberally, so that we can have a higher chance that the teams and the software engineers
will adopt the observability practice.
So what I mean by this is, if as a software engineer, you want to understand how your
system is behaving, and you want to answer a a question and you go and look at your observability tool,
observability platform, your understanding platform,
and you don't see, you don't get any insights
first time, second time, third time,
you won't use it again.
So the angle is as a company
that wants to understand the business transactions,
you want to actually ingest everything
because you want to maximize the potential
for understanding something.
But then you can't really ingest everything.
So the shotgun option,
I was actually thinking about that before this call.
That's the extreme.
But then, as you say, the volume would be too high.
I think the middle ground is thinking about it as constructing a data set.
So while you're writing code, you've got three things to think about.
Write the actual code, write your tests, and instrument your code with custom tags.
And you have to think of those three, and you can think of them in any, in any kind of order you want.
It doesn't matter.
Like the TDD,
the TDD aficionados can do testing first.
I don't care.
But what I do care is that you start thinking about,
okay,
my code now is creating an order.
What do I do in this case?
I want to,
have I added a custom tag of order ID?
If the order
is created successfully, have I
added created true, for example?
So my angle would be
teaching
people to
instrument their code
consciously and thinking about
what would be useful
on the other side.
More often than not, I get the answer, well, this this might not be useful so i'm not going to add it and my answer always is
if it's potentially useful if you think if it's as a lot of people say is high cardinality and
you end up with high a lot of a high number of dimensions then you you're increasing your
likelihood that you'll you'll be able to look through your data
and understand what's happening when you need to.
You don't know what that is now,
but you will know at some point.
So the other thing I'd probably add to that
is that you can't just teach.
I can't just go and teach everyone.
We had 20 teams, for example, at Cinch.
You can't teach everyone.
You have to start building a community of interest
around these kind of things.
People need to learn from each other.
So practice only evolves.
It's a complex practice.
There's a lot going on.
There's no standards as such.
I mean, there is open telemetry
and there is some guidance online.
It's a growing field,
but people need to learn from other people and that's where
they learn the best. I'm taking a lot of notes here because I want to first of all for my
benefit for later on, because this is what I enjoy so much about the podcast that we have interesting
guests with different backgrounds and different experiences. Also for the summary of the podcast that we're writing, what you're saying was
interesting that, you know, teach people how to instrument code consciously, making them
think about the other side, like, well, how can this data potentially be useful? And if
you don't have a clear answer of no, this will never be useful, then it's probably something good to include. That was why I'm wondering,
should we also think about the process if, let's say, we put instrumentation in,
but then we monitor the monitored data access? We monitor who is actually using a certain piece of
data. And if nobody ever uses it within, let's say, a month, a quarter, a year, then
we could kind of flag also instrumentation data that nobody ever has existed either.
Never put it on a dashboard, never create an alert on it, never fetch that log with
that particular piece.
Is this also a practice?
We should kind of, you know, kind of like rethink and update your instrumentation based on usage?
I'm a big proponent of the more the vendors can do for us, the better.
So we pay a lot of money to the vendors, but that's for a good reason because they are
the differentiator, that's what they do
well. I've not really used an open source stack for observability, but I do put a lot of trust in
observability vendors and I'm happy to pay them because that's not the differentiator for for most companies. And in that sense, I would say that when it comes to
deciding
things like,
is this useful anymore?
Or, yeah, absolutely.
If you can flag up these things
and then I can choose what to do,
it'd be really, really useful.
So whatever you can do
to help the UX of the developer,
then great.
I'd say what's interesting is that even a year later,
you might need that tag.
It's your decision to decide
whether that piece of data is useful or not.
But flagging it to me will be super useful
because you can start clearing up up things but i think more
important than that is i think that the sampling and the indexing and all of that space because
um that's where the the billing and uh what's what's the word that's where you can make any
difference to billing and to bills so if you take take the stance that you actually, so as a user
of an observability platform, I want to have confidence that I'll find what I'm looking for.
And I don't want to be thinking about indexing or don't be thinking about anything around sampling. I want to think the minimum I can
around that. What I want to think is understand what my system is doing. So I think there's
a lot that the vendors can do to help with that.
Yeah, maybe Andy should explain what we're doing on that off the record. Yeah, I think we've been in this space for a while
and so have our competitors.
I think over the years,
I think we got much better in hiding the complexity away
because that's, as you said earlier,
you're exactly paying for that service
because you don't want to necessarily think about all these
things yourself when you ingest data, how you store it, how you index it, how you give
people access to this stuff, if you can afford it, obviously, right? If this is a value for
you that a commercial observability platform would give you.
Yeah, I'd want to say on that side too, before people, I think it's definitely important to,
just like clean code, clean observability is important, right?
We want to make sure, like any vendor or even OpenTelemetry,
they're going to be observing the universal parts of traces.
Service hops, database, things, right?
Any custom code,
no one's going to know what we need to do.
So that's where the developers
might be adding in the additional components, right?
But when you start talking about then
over time removing things,
depending on what your viewing platform is.
So in our case, obviously the data,
you know, the interest backend
with Davis and all that
could be whoever's platform.
They may have, like we do, some sort of AI or some other
assisted analysis that's taking a bunch of inputs. What's going to
be important for you when you are cleaning up is to understand
how all the data is being used by that system. Because some of the data
you're collecting might not be something you as the human are looking into,
but it's the machine that's observing it,
taking it into account.
And if you just start chopping things out
without that knowledge,
you could be throwing off your model.
So yes, it's important to clean up your code,
but it's more important to understand
what everything is being based on
and have that knowledge of what's feeding everything
before you do that.
Yeah, absolutely agree.
And the reason why I think this discussion is important because I know, and we talked
about this before we hit the record button, there has been a lot of hype, obviously, in
the last years on distributed tracing, right?
We have open telemetry that enables everybody to create distributed traces and uh but now it seems there's at least some type of not not a resistance but
making people aware of that distributed traces can become very expensive right from a capturing
perspective from a storage perspective and some people now questioning is distributed traces
really something that we need to analyze systems?
Can we just do the same thing just with metrics or just with logs?
And Brian, as you just said, we've built a lot of systems over the years, like the
observability vendors that really detect changes in your distributed traces and therefore
automatically detect change in system
behavior, automatically detect bottlenecks. We can see things that otherwise would be hard to see
if you don't put the same kind of distributed tracing capability, let's say, into your logs
because you added a trace ID on the log. I'm just wondering totally from your perspective, because you mentioned earlier,
right, you need to be, observability can become very expensive. The question is how much price
do you want to pay? Do you see like the need, let's say, what's your take on traces? Can we
do everything with logs, with metrics? What is your guidance to engineers?
When to use what? Maybe that's the better way to phrase the question. Do you have any guidance
on how you advise developers on when to use what observability signal?
Yeah, I think it's a very important point. And I think that confused a lot of people
when they start out in their observability journey. In my mind, there's four telemetry data types that are the most
popular. Tracing, metrics, and logs, mostly for the back end. And then you have real user
monitoring for the front end. I would say my go-to would be real user monitoring plus tracing
with real user monitoring linking to the backend traces.
That would be the ideal, but you don't leave it at that.
You have to enrich the telemetry data with custom tags
or custom attributes that represent your business transactions.
If you don't do that then
you've just auto-enabled telemetry and you'll get you'll get everything out of the box but you you
won't be understanding the health business transactions and i can guarantee that one day
you'll have an incident and you'll go back and add that so that you can understand a bit more
what's happening or you'll go to um'll go to another non-profit environment,
try to reproduce it, and I'm bored at that point.
If we're still in that space and we're still trying to sort things out
in a non-production environment in 2023, then we're missing the point.
So that would be my go-to.
However, I think one of the lessons I learned is I started with Honeycomb.
At Cinch, we used it a bit and kind of learned a lot about their observability principles and practices.
And then we moved to Datadog. And I'd say that what you would suggest is observability vendor dependent.
So you have to look at your observability vendor and see what they are promoting, what they are doing well,
because they might be storing and indexing
one telemetry data type better than others.
They might be billing one telemetry data better than others.
And pragmatically, you might have to look at that,
which is, I think, in my mind, is a bit sad.
But hopefully, we get to a point where there is a standard. And I'm not saying when I say tracing
and RUM, that's all you use. But you use one as a base, and then the rest as an exception. So for
example, metrics are a signal in time without context.
So you would use it, but then you can't go and find out unless you look at the code.
So avoiding looking at the code, you can't find out what's happening.
With logs, you'll get a lot of noise,
and you'll likely get a lot of natural language that's useful.
And in our case, what I found interesting
was that a lot of non-techies found that useful on dashboards.
But that should be the exception rather than the norm.
And you should try and instrument everything through traces.
And when I say traces, I'm not a big fan of lane graphs.
It's not that I'm against them.
I like them when you're looking at an individual trace. But I think the real value, the real power is querying a data that spans. That's where it
becomes really, really powerful. And that's where high dimensionality and high cardinality is really
important. I hope I've answered your question. You did. and I wanted to add one more thing to this because you brought up a very good
point.
The reason why you need to still look obviously at what your observability vendor is doing
different than others is because we all come from a different background, right?
We've been on the Dynatrace side, we started with APM back then. And from there we evolved. So we started with
traces, then we went into metrics, logs, real user data. A competitor may have started with logs,
and then they evolved into metrics and then traces. So obviously we have our history and
what we've always been really good in, and therefore have a certain lean-in on a certain type of observability data set.
But I think what we also see, at least this is what we are doing internally,
we try to treat every observability signal equally as good as possible.
And hopefully there will be a time when
it should no longer matter really what backend system you have.
I think the best compromise you can probably make is, I suppose most observability vendor will
enable a service tag that you can add. If you're using all telemetry data types, at least add
some basic top level tags like service and
version, if you want, things like that are really, really important so that you can
correlate between data types, data types potentially.
Exactly. Yeah.
I mean, it's also, you know, we don't want to make this about our
our organization, but obviously we have the most experience but things like linking tracers with logs because we automatically take the log and
and put the tracer the on it that's one thing you talked about business transactions right we
automatically extract anything from the the user that is interacting with your front end and we
know who they are and where they click and like your search example that you brought to the conference, you will probably be interested
in something like what did people search for?
Where do they come from?
What do they search for?
What other filters do they apply?
And then you want to see this to analyze also search behavior and user behavior and then
how it impacts your performance and resiliency of your system. Yeah. But yeah, I assume by 2023, most observability vendors can hopefully do this in an easy way.
By 2023?
We're in 2023.
I know.
That's what I'm saying.
I assume by now everybody, most vendors are doing this.
Yeah, yeah, yeah.
Yeah.
One thing really briefly too, you mentioned the idea, the real user monitoring and having the other information.
So the way I interpreted that, and I just wanted to double check if this is a correct interpretation, would be using what when you have a public facing site or tool, the real user monitoring is basically your SLO, it's what you're looking for the impact to hit. And then you want the traces, logs, and metrics and everything else underneath so that you can do the investigation into what is impacting that.
And part of that trace could be stuff in the browser as well.
But you're looking primarily at what is the impact to the end user.
And we've been saying that.
A lot of people have always been saying that.
Who cares if your CPU is X, Y, Z, whatever it might be? What's the impact on your end user? Are we've been saying that. A lot of people have always been saying that. Who cares if your CPU is XYZ,
whatever it might be? What's the impact
on your end user? Are they feeling it?
That should be your guiding principle.
And it almost...
If I can put the words in your mouth, it feels like
you're saying when you have real user monitoring,
that should be what you're looking at
to see how the system's running.
And then all the other telemetry
is the supporting evidence that you need to dive in.
Would that be fair?
Yeah, I think so.
I think I'll change slightly the words in my mouth
in the sense that I think real user monitoring is,
I see it as a tracing of the front end.
And absolutely, it's about knowing what your users are doing
and understanding so that you can improve the UX.
What's interesting is that it's a lot easier to use something like synthetics
than it is to use and understand and enable teams to use real user monitoring
because there's more concepts in there to understand.
It's also more complex than tracing because there's a lot more dimensions like a view and
an action and things like that this this does not just request response like it is
on the back end and um it really does uh give a bit of context in terms of if a synthetics goes
off the home page is down that's more like you know itetics goes off, the homepage is down.
That's more likely, you know, it's likely that the homepage is down, but it's not
definite. Whereas with a real user monitoring, if 90% of the users can't
access the homepage, then you know that 90, you know what 90% of your users are
experiencing de facto, like as long as it's behind the ones that are behind
cookies and stuff like that, a behind cookies and stuff like that.
And what I'd say about real user monitoring, it's an interesting one that we experienced at Cinch
from the perspective of the teams running the software. So I had a really hard job getting front-end leaning software engineers to care about observability.
And real user monitoring was my in.
It helped me get them into the platform
and it helped them then go and explore other things
like tracing and SLOs and things like that.
So that was really powerful from that perspective.
And the other aspect of it is the often,
and I've seen that in a white paper
that I think New Relic published recently,
was that a lot of the companies
have a very fragmented observability platform setting.
So they have multiple platforms.
I've seen that as being an anti-pattern. I've experienced it as being an anti-pattern.
It just fragments the view of your software.
So my take is, and I think the white paper was saying the same, is that
you want to have one observability platform. You want to
be able to throw them away if the billing gets too expensive, for sure, and OpenTelemetry
helps with that. But you want to have one observability platform that's not your cloud provider
so that you can have a shared understanding across teams, across disciplines.
So you also start seeing that some more UX-centric widgets or views like a funnel
or like the core vitals that are good for SEO and things like that.
So you're starting to bring in non-engineers as well to look at the same data and you're all looking at the same data
and people are sharing the same links with each other and they're not siloed by different observability platforms.
So to kind of round off, the Reels andism monitoring really enables that aspect rather than having
one for frontend and one for backend. And the last thing on the backend bit is that we did focus a
lot on custom instrumentation. So a lot of the things, because we were serverless as well at
Cinch, a lot of things that we cared about wasn't CPU, it wasn't memory or memory leaks or anything like that.
We cared about higher order attributes.
Okay, some of them were serverless specific
and came from AWS metrics,
but most of them were orders, search,
various product details and things like that.
I was going to say one last thing I wanted to get in because I know we got to wrap up
soon because this started from before our call when we started talking a little bit
about some aspects, observability with logs and traces and what people prefer. And Andy
and I are obviously big fans of traces. Back before when I was a performance tester, before we started working at Dynatrace and even had any tool like this,
we were running tests, we'd see a slowdown, I'd be looking at logs or processing server metrics
and trying to figure out why it's slowing down and just really couldn't.
It wasn't until we had a trace tool in there that we could see what was going on with the code.
So big, big fan of traces. And that's leading me to think,
for people who are relying or really insisting on logs,
because I see it all the time.
We go to prospects and customers,
and they're like, oh, we want log, log, logs.
I feel that logs are reactionary.
Logs are showing you errors that occur in your system,
but there's really not much you can do for
optimization when it comes to logs, because logs are not telling you how things are running,
where you're spending time.
Unless people are using logs differently, but I imagine if you're trying to get that
level of information for logs, storage costs, indexing, and all, it's going to shoot through
the roof because we know logs are expensive.
So is there really a case?
Do you see situations where people can use logs for anything but reacting to incidents?
Is there proactive log use cases for optimization?
Or is that really where the trace is, like the hardcore case for traces outside of other places?
Yeah, I think there's definitely a use case for logs.
And I know you didn't say that, is there a use case for logs? Oh yeah, there's definitely a use case for logs. And I know you didn't say that, is there a use case for logs?
Oh yeah, there's definitely a use case for logs, I agree.
I think the boundaries of tracing is at your own software,
so at the end of your own software.
So you have to ingest logs for third parties and things like that.
So that's really important, and cloud providers in some cases.
So you have to accept the reality of using logs,
but I do agree that they become reactionary.
They're a signal for me that it's a reactional thing.
And it also encourages going into patterns
like introducing correlation IDs
and introducing kind introducing custom tags
that then you can query on and all these things
and durations as well.
So you start seeing in code things like log and duration here
to see how long this takes,
which are all built into tracing and spans.
By taking the span duration, you know what that long.
If you want a smaller part of the span, you can create a subspan.
So that's all structural within traces.
So you end up building something similar to a data set of spans,
but not good enough.
But then you have the advantage of correlating with potentially
third parties and software that's not instrumented with tracing.
Yeah.
Hey, I know we need to wrap up here because we've kind of hard stop.
Thank you so much for doing this podcast with us.
And I would have a couple of more questions, especially on serverless,
because that's a topic I currently host a working group
within our customer base
on serverless observability best practices.
Maybe I want to have you back
for another discussion,
because obviously you have a lot of experience on this.
But I learned a ton today.
Yeah, I learned a ton today.
And thank you so much.
We don't keep you much longer
because we want to make sure
you catch your next appointment.
I've loved this conversation.
And yeah, thanks for having me.
And I get the sentiment
of having you back
to talk about serverless
because we haven't really touched
on that for a while, Andy.
So I definitely think
that we can dig into that.
All right, we'll wrap it up here.
Thanks, everyone, for listening. And thank you, Toli, for spending time with us. Andy. So I definitely think that we can dig into that. Alright, we'll wrap it up here. Thanks everyone for
listening and thank you totally for
spending time with us. Thanks to all
of our listeners and we'll
see you all next time. Bye-bye.