PurePerformance - Don't burst in Flames: 20 years of Performance Engineering with Martin Spier
Episode Date: October 23, 2023Martin Spier was one of six engineers to take care of all of Netflix Operations about 10 years ago. Back then performance and observability tools weren't as sophisticated and didn't scale to the needs... of Netflix as some do today. FlameScope was one of the Open Source projects that evolved out of that period, visualizing Flame Graphs on a time-scaled heatmap to identify specific performance patterns that caused issues in their complex systems back then.Tune in to this episode and hear more performance and observability stories from Martin, about his early days in Brazil, his time at Expedia and Netflix and about his current role as VP of Engineering at PicPay - one of the hottest fin techs in Brazil.More links we discussed:Performance Summit talk about FlameCommander: https://www.youtube.com/watch?v=L58GrWcrD00CMG Impact talk on Real User Monitoring at Netflix: https://www.cmg.org/2019/04/impact-2019-real-user-performance-monitoring-at-netflix-scale/Learn more about Vector: https://netflixtechblog.com/extending-vector-with-ebpf-to-inspect-host-and-container-performance-5da3af4c584bMartin's GitHub: https://github.com/spiermarConnect with him on LinkedIn: https://www.linkedin.com/in/martinspier/
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance on the record. Yes, yes, it has been. And thank you for our audience for excusing my absence,
but we had to get some new content out,
and Andy was traveling like he does, as always.
You know, it's an exciting time right now, Andy,
I've got to tell you.
Well, first of all, it's the month of October,
which means I can officially, for this month,
call you Andy Candy Grabner.
Bring that back.
That started a long time ago, which brings me to the second exciting bit that we're on episode 193.
So after this one, we have seven more until we hit 200.
I think we've been doing this for about seven years now.
I think so. Which is pretty crazy.
Yeah.
Yeah.
And I just recently hit my, just last month i hit my 12-year
milestone at uh dynatrace wow which is pretty pretty crazy yeah yeah yeah been a long long
time been awesome um but it's funny too thinking back to um you know when we started this all and
where you and i came from right we both started in performance testing i don't know if we called
it engineering at the time but in performance testing. I don't know if we called it engineering at the time, but performance testing, then performance engineering,
and then into this whole observability thing.
But it's always been performance at the heart of it.
I know I used to only think that performance could be load tests.
And as we got into the world of Dynatrace and some of this other stuff,
there are so many other aspects of it.
And I think today it's coming in a little bit more full circle.
We're going to be doing a little bit more full circle, right?
We're going to be doing
a little bit more of a
performance-oriented podcast.
Yeah, and I also think, right,
because back in the days
we talked a lot about patterns
and how to detect patterns.
And I think there's a lot of stuff
that we are going to hear today
from our guest.
And I think, Brian,
do you think it's time
to introduce himself?
Absolutely.
I think it's time.
If he knows who he is.
If he knows who he is.
If he's figured that out.
And if he doesn't know who he is, maybe he can tell us who he wants to be.
Two things.
Yeah.
But I will pronounce the guest of today with the way I would pronounce it because it looks like a German name.
Martin Speer.
But I will let Martin introduce himself.
Martin, thank you so much for being on the show we met a couple of weeks ago in sao paulo at an event and now we're here recording
this podcast welcome to pure performance thank you for being here please do us a favor if you
know who you are introduce yourself or if you know who you want to be introduce who you want to be
well well thank you thank you and hello everyone, thank you. Thank you. And hello, everyone.
Thank you for the invite.
I'm really, really happy to be sharing my war stories with you guys and all the audience.
Yeah, it's a very deep, thoughtful thing.
Hard to think.
But one thing is for sure, I do love performance.
I mean, I think my whole career was around performance somehow.
Just in the recent years, I've been more of a bureaucrat.
But other than that, my whole career was around that.
I mean, I think I wrote my first line of code, a really bad one,
kind of when I was super young, was kind of nine-ish, I guess.
I got to experience the whole internet in the 90s
when things were kind of fairly small.
It was a bit of a no-brainer going into computer science.
Ended up studying that.
And I started my career as a sysadmin when sysadmin was a thing.
And by the way, most performance engineers I kind of get in touch these days either came from a testing background.
Because back then it was kind of perf testing and then you kind of get in touch these days either came from a newer testing background, you know,
because back then it was kind of perf testing,
and then you kind of go into something else,
or sysadmin. You know, at Netflix,
the bulk of the engineers there came from a sysadmin background.
And, you know,
back then, by pure luck, I sort of got
involved into a perf improvement
project. The application was slow
back when we kind of had, you know, waterfall model projects, perf improvement project. Application was slow back when we kind of had,
you know, waterfall model projects,
perf improvement project.
It went really, really well.
We improved things.
And from that, the company I was working at at the time,
I said, hey, maybe we need a performance engineering team.
And, you know, back then I didn't even know
performance engineering was the thing.
I started researching a bit more and that's how I got involved know performance engineering was the thing. I started researching a bit more,
and that's how I got involved into performance engineering.
Back then, it was a bit of mainframe and a few other things,
not to give up my age here.
But I ended up working in that, moved to the U.S.
12, 13, 14, 15 years ago. I don't know, long, long time. I spent a
couple years at Expedia. So if you're from the US, you probably know Expedia, travel agency,
working really in the large lines of business, hotels, air, cars, booking, all those things.
And then my sort of random chance, I received a cold email from a recruiter in California.
I knew the company, not just because it was a user, but they also had some really, really cool open source projects back then.
That was Netflix.
And they were starting a new performance engineering team over there.
And this was over a decade ago. The company was really, really different back then.
I remember celebrating the 20 million user mark, which was huge back then.
Now it's kind of a lot more than that.
There's only a few hundred employees.
Content development wasn't a thing back then.
It was just licensing.
DVD was still huge from those who are kind of not from the US.
Netflix started in the US here
as a DVD delivery kind of thing.
You went to websites,
selected what you wanted.
You got a red envelope
with the DVD.
And it was a really,
really interesting time back then
because Netflix was migrating
from the data center
to this new thing
called the cloud.
And everyone is really apprehensive about the cloud and suspicious.
Should I host my data over there?
And is it safe?
Who's guaranteeing that?
So I ended up facing a lot of the performance architecture scaling problems at cloud at first hand. Because it was not something I could kind of go to
Stack Overflow and ask how to kind of solve that problem.
Probably no one kind of faced those issues before.
I got to work with really, really great people
back in the day.
Maybe you guys know Adrian Cockcroft
from Sun.
Amazing, amazing time.
Try to define what architecture for the cloud was
what it is today, which is really, really cool.
I migrated to lots of different areas there.
Yes, I started with architecture, backend systems,
but over time, I work on client side,
which is not just your Android, iOS, and web, but also TVs and PlayStations and Chromecasts and all those kind of weird things.
There was even one thing that ran Windows C back then.
So interesting times.
Big data, which kind of started, now is a huge problem.
Back then, it was something new.
Hey,
processing all this data costs a lot of money. How can I improve
that? And now
we do machine learning. Machine learning is becoming
a problem. There's perf engineering teams
focusing on machine learning, how to
optimize that. And I got to
work on all those things, all with
sort of a performance lens.
And ended up developing a bunch of tools. We can talk about those things, but all with a performance lens.
Ended up developing a bunch of tools.
We can talk about those things later. And about two years ago, I left Netflix. And today, I lead
what I call the foundation engineering team at PicPay.
So PicPay, probably most of the listeners don't
know PicPay, but PicPay is probably most of the listeners know PicPay,
but PicPay is short for picture payment.
The analogy I try to make, Venmo in Latam in Brazil
is one of the large fintechs in Brazil.
Over 70 million accounts open, fairly big.
And the way I like to explain what foundational engineering is there is basically the bulk of core engineering
so everything is not directly related to a financial
product but everything that supports all of those
programs so your infrastructure, platforms
internal platforms, architecture, internal tools, mobile
platform and I have data platform until recently.
I sort of call it the plumbing.
When it's working well, no one sees it.
But when it starts giving you headaches, all hell breaks loose.
And that includes Observability too.
Observability is one of my teams.
It's one of the teams I have a bit of passion.
And they get bothered with me a bit of passion. And I kind of,
they get bothered with me a bit
because I try to, you know,
give more opinions than I should.
But it's been an interesting,
interesting change.
I think when we,
well, thank you so much.
It's amazing to have somebody
like you on the podcast
with such big history, right?
I mean, it's amazing
when you explain.
Now I got to ask you a question and without dating yourself, on the podcast with such big history, right? I mean, it's amazing when you explain.
Now I got to ask you a question and without dating yourself,
but what was your first development language?
Because you said when you were nine years old,
you were studying the codes.
Do you remember the language?
That was basic.
Yes.
Yeah.
That was basic.
Yeah.
It's funny.
It's for me the same.
My first computer was an Amiga Commodore Amiga 500
and it was Amiga Basic.
Did you have the basic with the line numbers or I think it was they had a version maybe QBasic
or something like that after? No, it was the line numbers. Line numbers, yeah. Yeah, okay. 20 go to
10. 10 for Brian, 20 go to 10. Exactly. Yeah. That's awesome. Yeah. And I think that helps. You appreciate hardware resources.
And I think that's one thing I really love about performance.
You get into how to optimize things
and because you value hardware resources,
which is something that changed over the years.
A lot of brand new developers,
they work at a level of abstraction
and everything is abstracted,
even memory allocation.
I mean, I can probably ask most engineers today,
you know how your kind of language here
allocates memory and de-allocates memory
or what's malloc?
It's most will probably never have to deal with it.
Yeah, it's interesting.
Go on, Eddie. I was just Yeah, it's interesting. I remember...
Go on, Eddie.
I was just saying,
understanding the basics, I think,
is also a privilege that we have when we grew up
because I remember, besides basic,
my first language in school,
in high school, was assembler.
So we actually had to learn
how to move bits and pieces around in the register.
It was really interesting.
And I think they really just, back in the days,
it was in the early to mid-90s,
not that Assembler was still something you would code,
because obviously we already had languages like C, C++,
but it was really great foundational knowledge that we gained.
Yeah, I was just going to go back into the,
I don't know if you had it.
You were doing Rational, right, Andy, you worked with?
No, I was working with Segway.
Oh, Segway, so yeah. So I don't know, was the language C on that one? Because I remember in Lode Runner, we would have to do C, and anytime we do a fancy function, you have to do the malloc. And I remember that confused the hell out of me.
I think C was Lode Runner was doing, yeah, C, but we were doing, I think it was more like Pascal-based,
to be honest with you, if I think back, yeah.
Yeah, and LoadRunner was C, but it wasn't a standard compiler.
Right, right.
Which kind of caused a lot of headaches.
Yeah.
Yeah.
So first language, basic, yeah.
And Andy, I don't know if you caught it, right?
So Martin has a direct tie-in to Dynatrace because his first, I think you said it was your first performance job,
you were working for Expedia, which just made me think immediately of Easy Travel.
So you basically worked for Easy Travel.
Easy Travel is our demo, one of our demo apps that we've had for years and years and years for Dynatrace.
So I spin up this travel website that just cracks me up it was space travel before that wasn't it
yeah anyway it was it was way too back yeah yeah no it's called something else but something with
space travel yeah yeah but uh martin a couple of quick questions so in the preparation of this
call or this podcast you sent us a bunch
of links uh it was one of the as you call it earlier before we hit the recording button like
your little baby um you have a lot of presentations on uh flame commander or a flame scope uh an open
source project that you um brought to life i guess, while your time at Netflix, correct?
Yes, correct.
We will, folks, if you're interested in learning more
about Flame Commander and all the other open source projects
that were released back then by Netflix,
you will see the links in the description of the podcast.
But can you tell us a little bit about
why you built this tool back then?
What problem you actually tried to solve?
Yeah, sure.
It's quite an interesting and funny story.
So we had a fairly small team at Netflix.
I think at peak it was maybe six engineers to take care of all Netflix globally, all devices, everything.
And everyone had a specific focus, kind of backend, client, data, benchmarking, kernel, JVM.
And I remember it was one of the most common requests was,
hey, I had a CPU regression
or something's kind of weird with my CPU, new build.
Pretty common problem.
And this issue was intermittent.
You know, every... You guys probably noticed that.
They had that problem in the past.
And this one was specifically hard
because it wasn't even a second.
It was kind of sub-millisecond.
It was maybe kind of 100 millis,
like sub that.
And it was really hard to find
what was blipping here.
And then back then, Vadim, my colleague,
he was having that problem
and then took a CPU profile,
sampled NextApp,
and started slicing it
into very, very tiny bits
and generating flame grass from those things
to say, hey, what's that spike?
But it was really hard to catch that specific moment.
And then I think it was, it's kind of fuzzy because it was a long, long time ago, but
I think it was Brendan that decided to, hey, let's plot that as a heat map and see kind
of what we can find here.
Interesting enough, we could clearly see the blips and exactly the timeframe
which should slice things.
And then I took it to,
hey, let's create a tool on that
where I can navigate back and forth
between these two things,
like the heat map and the flame graph.
And cool, developed a tool.
Interesting.
Can we open all our old profiles
that we took off kind of applications
over the years
and see what we can find.
And sure enough, there was a lot
of really interesting patterns. We even
wrote a blog post about that.
Interesting patterns like
your GC spiking
or jobs that
trigger every second or so.
All sorts of things.
And wow,
there's so much we can optimize here
that I never even saw before.
It just got washed into the profiles.
And interesting, that's just CPU.
What else can I open with that visualization?
Then we started to move, hey, memory allocation profiles
and all the BPF things.
And it kind of grew into a really, really interesting tool.
It was standalone back then.
That was Flamescope.
That's the open source version.
And, hey, we need to make it easier
for all developers to have that capability.
Then came Flame Commander.
It was basically a cloud-hosted version of that
where you could just point to your instance,
single click, and take any profile
and analyze that
either as a flame graph
or, you know,
your flame scope visualization,
go back and forth.
It had a historical archive.
You could compare
kind of older versions
with new versions of things.
And it kind of grew
as a cloud,
overall cloud profiler
on Netflix.
So it's just,
that's the story
how it created
with a really,
really tiny
thing that, you know,
probably most of you
faced before.
Yeah.
And so for me, a
couple of questions
on this.
So I think this was,
if I look back and
also in the Git
repository, what was
it like eight years
ago?
Probably.
Yeah.
So eight years ago,
there must have been
2015.
Obviously you said
you were a small team.
It's amazing.
Only six engineers taking care of Netflix.
Now, did you look for other tools?
Or did you just say, no, we built it ourselves?
Or was there nothing available?
Nothing from that level.
Remember that back in the day, especially when Netflix started,
not a lot of observability tools were available that worked at that scale.
Take your time series metrics,
internally developed Atlas, still huge.
I still don't know if any tools available today
can kind of take on that load.
Vector, which was kind of real-time monitoring
sub-second of low-level metrics.
Also, I don't think I've seen anything similar to that,
that is not centrally aggregated, that kind of connects to the host, kind of streams directly
to the browser. So yes, there was part of, hey, we like to develop tools, but there is also,
in general, there's nothing that can do what the level of granularity we want to get,
and also support our scale.
So that's where we kind of went straight to developing tools.
Obviously, you were pioneers back then,
especially as it comes to that scale, right?
I think now years have passed
and I guess there might be other alternatives now out there.
But that's really interesting.
Brian, this also reminds me, if you remember,
one of the early podcasts we had with Goranka,
who was a performance engineer at Facebook back in the days.
Oh, yeah. I spoke at conferences with her.
She was taking up capacity engineering there.
Yeah, yeah. Exactly.
So we had her on the podcast as well.
And obviously, Facebook back then, same challenge.
Big scale, no other tools available.
And they had to figure out a new way to get all this data from all of their hundreds and thousands of hosts
that they had back then and then analyze it.
Another question that I had, so open source.
You or Netflix decided to open source these tools.
And I think this was at a time where open sourcing,
I'm not sure if that many large organization actually went down where you know open sourcing i'm not sure if that many large
organization actually went down that route in open sourcing something that was built obviously
it's intellectual property that you built and do you remember why netflix decided back then
to actually open source these tools because that's giving away a lot of stuff for free
eventually right i mean yeah um i think first it's part of the DNA.
Everyone likes to have
those discussions in the open
and also kind of discuss
implementation and how we can
solve this problem.
There were a few things that I remember
back then was, first was
the question was
is it a competitive
differential for Netflix if everyone starts using this? the question was, is it a competitive differential
for Netflix if
everyone starts using this?
If it is,
we're probably not open source.
It's just a...
But if
you check most of the tools, they're not the
recommendation algorithm, for example,
which is close it, guard it.
But cloud management thing said, hey, if which is close it, guard it. But
cloud management things that, hey, if
everyone starts using that,
that's good for us. We're sort of a standard.
We can all contribute
and improve how we use
the tool internally too.
It's really interesting
to give visibility to engineers
of the problems we work in.
It's a tech brand. It's really
important. We're competing for really top talent and it's good for everyone to know the kind of
problems we're working on. And that generally comes to open source projects. So it wasn't
much of a huge discussion. It was just, hey, if our competitors are using this tool, that'd be a problem. Probably not.
And then after came
how much effort it takes to
manage those projects over time
and so on and so forth.
But the idea was generally that.
Let's just, you know,
it's interesting, it's good for the community.
No competitive advantage.
Let's open source it. Yeah. No, it's interesting it's good for the community um no competitive advantage let's open source it yeah
no it's good and obviously it gave you a lot of chance to speak at all sorts of conferences and
i have a couple of tabs here open on my browser uh we spoke at some like uh cmg impact i see here
uh and some other conferences obviously it's a great way to then, you know, speak about it, make obviously, you know, kind of like always free advertising for it as well,
right?
Because obviously these conferences, they are happy that somebody speaks about their
own experiences.
And especially if the tools that you're using are something that everybody can then use
and don't have to purchase some.
Exactly.
And we love contributions too.
I think that was really interesting.
If people start using it, they'll find other uses for it.
They'll add features to it.
Everyone benefits.
It goes back to that community that we see so much in the IT world
of people sharing and helping each other out.
It's funny, too, because in a way,
without making it sound terrible,
but I like the phrase,
open source is marketing,
where it's you're marketing your technology staff
for a cool place to work to help.
Obviously, that's not the reason
you're going to do the open source,
but that can be another factor.
Is anybody going to get a competitive advantage,
and will this potentially
help us attract more talent as we're growing and expanding it's uh it's an absolute yes on the
second one for sure you know yeah one thing that i noticed i looked at the the video uh from the
performance summit that was the first video i watched from you and it reminded me so much about pattern recognition, right?
What we did.
So Brian and I, when we started our work as performance engineers, especially now with
Dynatrace, everything was about distributed tracing.
So I'm not sure how many distributed traces we've analyzed in our lifetime, but it's enough.
But the interesting thing is we always kept looking for patterns.
Like we always come back to the N plus one query pattern,
you know,
too many threads being used,
the calling,
fetching too much data,
making too many calls to a remote system,
high latency and things like this.
What I really liked about the way flame scope and flame commander worked,
I think it's flame scope with the visualization.
Yeah.
Is the different patterns that you then could like visually see. And, and I Commander worked, I think it's Flame Scope with the visualization, is the different patterns that you then could visually see.
And that was just, it was really great.
Folks, if you can check it out, I'm sure you have presented this
in multiple different presentations, but the one on YouTube
is the one Flame Commander Netflix Cloud Profiler by Martin.
And you, I think, starting with minute number five
in that video, all the way until like five minutes
of just pattern after pattern after pattern
and how it looks like visually.
Yeah.
And it's interesting.
Every time we presented about that,
it was always the same question.
Hey, have you kind of trained a model
to kind of learn those patterns
and detect those automatically
that's
on the
at least
it was on
the to-do
list
when I
left
but
it's
so
that's
actually
an interesting
thought
you can
train a
model
to detect
those
because
what a
human eye
can do
picture
recognition
is so
far
advanced
so if
it's
creating
that
picture
and then
it seems
like it's a very
easy leap. Exactly and it's at scale makes a huge difference. Hey, manually checking one application
very easy when I have thousands and thousands of applications and doing that manually it
becomes harder. Reminds me of the fingerprint database in all those detective movies. They get the finger and then you get a hit.
Exactly. We got them.
Do you know, Martin,
is the tool still used
at Netflix? I know you left a couple
of years ago, but still, are you aware?
As far as I know, yes.
Jason Vadim
can compliment, yes.
Now, with all the stuff that you learned at Netflix,
now you moved on to PicPay,
and this is also how we got to know each other
because you were presenting in Brazil.
Do you have a different...
I know you have a different role now,
but I guess with everything that you've learned,
do you bring a certain, let's say,
motivation into your organization around observability, around performance? do you bring a certain let's say motivation
into your organization around observability
around performance like do you still
do you take some of the lessons learned
and make sure that in your organization
you're doing things similarly or is it a
different world now
because different technologies take maybe
different people
no I'm definitely taking a lot of lessons
I think that was
one of the reasons I joined was
PicPay is a different company. It's a lot younger. Well, it's younger-ish.
But engineering-wise, the company grew so fast. And with that, it absorbed a lot of technical
depth. And engineering maturity didn't grow as fast.
So that's what I'm trying to bring
to the table, kind of get a lot of the lessons
I learned when Netflix was
scaling and all the pains we
kind of suffered during those
years to pick pay, which was
in a very similar momentum,
right? Just to
make those things kind of get fixed
a bit faster.
That's what I'm trying to bring to the table.
Let's have the architecture discussions
early on so I don't really suffer
when things are just too big
to fix.
Performance, for sure.
Let's not get this
out of hand here.
Also,
observability.
The
observability space,
at least that is my reading
in Brazil, and I'm assuming
most markets that are not
super developed, your big
tech companies, observability is
still quite young.
Most engineers in your team, if they did not come from a tech companies. Observability is still quite young. Most
engineers in your team,
if they did not come from
a big tech or a large company,
their
observability is, hey, can I go check the logs
and scan everything manually?
That's the background of most engineers
that never worked at a company
at PicPay or Netflix
scale.
And trying to get rid of that thinking of,
hey, I can manually do a lot of things
or I can publish text and scan text as much as I want
and bring observability to a pattern,
a point where I can scale, I can continue scaling.
And that's kind of where we are right now in that space to a pattern, a point where I can scale, I can continue scaling.
And that's kind of where we are right now
in that space
is getting rid of
all the technical that
causes problems.
Problems is too expensive.
That's the first one.
And it's really slow
to find problems
or understand
what's going on in the system.
And that's kind of moving
from scanning logs to,
hey, metrics
and traces and
full end-to-end traces
and this sort of thing.
So that's the
maturity I'm trying to bring to the organization.
And obviously, I've seen that.
Not many engineers on my team have seen that
at scale.
So I do give, I do have a lot more opinions than I should in my position.
I guess that's always hard, right?
To sometimes not let your past dictate your actions.
But yeah, that's what it is.
Yeah, there was a joke internally too.
I mean, I'm working a lot of improving app performance,
so Android iOS performance in the app.
It's a focus right now.
And I had to contain myself
actually installing Android Studio
and start profiling things again
to the point where I took a screenshot
of Android Studio running on my machine
and I kind of sent that to the team.
And then it became a joke.
Hey, Martin is reviewing your PRs now.
That's just to contain myself.
Yeah.
I know we had this discussion also in Sao Paulo
because when you got on stage
and you introduced yourself to me
and you said you're working in foundation engineering. And i said this is something that i would call maybe a platform
engineering right because that's kind of the term that i have been using for a little bit and i
think we you agreed on this and also the way you explained it earlier you really make sure that
that an organization an engineering organization really has everything so that they can really
produce great output
without all the complexity around that tool ecosystem brings
and the processes bring.
I get a lot of questions from our community
on what is the right thing to get observability
into foundational engineering, into platform engineering.
Question to you now, do you bring observability as a mandatory thing
into everything you do?
Or is this still, you know, optional?
Or is it mandatory?
That would be an interesting thing to hear.
Yeah, sure. I like to see, I agree with you.
Foundation, platform, different terms, but our reading is basically the same.
End of the day is all layers of abstraction, right?
What I'm trying to provide to my clients, and I tell it to my team all the time,
I think of our Platform Engineering Foundation
as a startup internally in the company. I'm providing a service,
I'm providing a level of abstraction so all other teams can build their features
for the users a lot faster without having to worry about the details.
Of course, the more I know, the better, but delivering that faster.
I think that's the idea.
And observability, when observability started within platform engineering,
that was mostly to provide a managed platform to other teams,
being internal hosted tools, tools that we acquired,
tools that came to M&As, managed old tools.
That was the initial idea.
And with time, I'm changing that vision to,
hey, we're actually responsible
for the best practices
and the processes
and what's expected from each team.
I don't buy into the huge idea of,
hey, something is mandatory
for other teams.
What I'd like to do
is to offer a better product
that they would get somewhere else.
They need to see the value of that.
It's just not part of my culture of,
hey, you have to do that.
It's mandatory.
Sometimes it's necessary for sure.
But I like to provide a better service,
a better product.
And that comes,
hey, here is your Java image
that you use to build all your application.
It's all fully
instrumented already. And it's a lot easier
for you to just use that
and get all those things for free
than having to develop those things yourself.
We come up with,
hey, here's a minimum that you need
to have a system in production. All those things
come for free if you use our platform.
You're free to
use whatever, but
at the same time, most engineers
don't want to have
more work that they need.
That's why you become an engineer,
right? Because you don't want to manually do
things. You want to automate them and
be done with it.
And less work is the constant
search for less work.
And it's a very interesting approach.
I brought that from Netflix. Back at Netflix, we
called that the paved path. Here's a really nice
highway you can follow. And here's the off-road path.
There's always cases that you have to go off-road
because you don't have a really nice highway there.
But 99% of other cases, it's just a lot easier to go on a highway.
Yeah.
So you could call them paved paths or something.
I think it'd be like golden paths, paved paths, whatever it is.
But I think the way you explain it is nice, right?
I mean, hopefully everybody understands that driving on the paved path makes more sense now do you how do you sell your product internally
or have you reached a status already where you said now it's clear for everyone to use our
platform or did you have to do you still have to do advertising do you still have to do advertising? Do you still have to sell it internally? Do you still have to educate people?
So the current products, they're pretty established, I would say, with a few exceptions.
It's all about internal, how other teams see our team as a reference.
I think that helps a lot. If you work on doing your research
and going through every detail,
listening to everyone on their input
before you make a decision,
and then you do the job,
you don't need a lot of internal marketing.
People just look at you and,
hey, I trust that they did their homework and I'll follow.
But obviously, there are always technologies that you don't have a full consensus for sure.
It always happens.
And on those cases, we have to do a bit more marketing,
a bit more education,
and kind of go...
I always try to bring the discussion
to a technical discussion.
But not a lot of marketing.
And once you standardize things,
it's...
I joke with the Apple environment.
Everything works nicely together here.
Once you step out,
everything becomes a lot harder. It's nicely together here. Once you step out, everything becomes a lot harder.
It's good or bad.
But if you're
in the environment, everything comes for free.
It's a lot easier. Everyone tested
your path and whatever you need to do
has a really nice
documentation. It's just
easier. I thought of a
marketing poster you can hang
in the office so people have an idea.
So on the paved path, you can
have an engineer with a laptop
in a self-driven car
on a smooth road typing away
with no problem getting their work done.
For the off-road path, they're going to be in a
big Jeep Wrangler on a big rocky road
bouncing around as they're trying
to type and drive at the same time.
Take your pick, man.
I do like the off-road bit, as an engineer, but at the same time, I work for a bank.
It's not the right place to be doing that.
But do you want to try typing while you're driving and bouncing?
That's the bit, right? Yeah, yeah.
It's interesting, too, because, Andy because anyway we talked a lot about platform engineering as well there's the idea you know
i like the terms paved path and you know unpaved road um better than opinionated right but there's
it's similar concept to the here's here's the the the one where you have your rules it's all set up
but it's easier to go.
I like that, though.
You're keeping it open, right?
You're letting the engineers make the choice.
You're trying to inform them and educate them about the pros and cons.
And obviously, if there's a reason why they should pick the unpaved road,
just like the reason why you pick any technology is because there is a reason,
but it gets them to think about it. And I think it's more powerful if they come
to the decision to do the paved path
because they realize that's the better path
as opposed to just like, no, this is what you do.
It's interesting to see how that'll work.
It's exactly the case because
even if you try to provide a platform
that will cover
all use cases,
that'll never be the case.
You always have exceptions
and things you don't want to support
as a central team
just because it's one corner case there
and it makes no sense for us
to invest a lot of time and effort
and people to support
that specific one use case
for kind of one team.
And there's always cases like that
in a large company.
You'll never be able
to standardize everything.
Sometimes you might have a Windows
server running Lua there because
that one solution that they bought and it needs to
be that one and you always have cases
like that and when you try
to impose things you always forget about
the corner cases
back when I was at
WebMD there was a team that had a tool written
in Fox Pro which I'd never heard of until that came out.
And they're like, well, we need to make it work because it's no longer being made or maintained.
So I was like, oh, wow.
But that's the case.
Just bringing that up for the old language nod.
The question for you now, coming back to where we all started where we started all in performance
engineering um do you now at picpay where do you do performance engineering do you still
do performance testing as part of your delivery pipeline or is it more everything moved to
production where you're basically analyzing performance behavior and performance changes
as part of a production let's say a blue green blue-green rollout or a canary rollout.
How does this, what do you do?
So it's, there is performance testing, but it's ad hoc on specific cases.
And it's the same on Netflix.
I can tell you the whole story kind of went, because I remember the first thing I did when I joined was creating a performance, fully automated test framework.
So ad hoc cases, there is a completely new application.
I have no idea how it behaves.
I need to put some load in it to see how it behaves.
It's not to validate.
It's not a regression test.
It's nothing like that.
It's just I want to see how it behaves with load.
That's it. There is a patch
or there is a new library version
or there's something that changed
that is risky.
I want to see how it behaves
and kind of what's the difference.
I don't need to match
production workload
or anything like that.
I just want to see how it behaves
under load
and try to find any issues
before production.
To guarantee production is working, it's part of the development process, I guess.
I think that's the same as with Netflix.
Canary releases, I think that's the first.
And with that, during the canary,
you need perf metrics there.
Have you regressed CPU significantly?
Have you regressed your memory allocations?
Are you generating errors?
Whatever you can think of should be there,
should be in the canary evaluation. If you do that automatically, manually,
but it should be part of the evaluation
to see how things behave in production. Blue-green helps, even if you didn't catch whatever you had to catch during Canary. Hopefully, you'll catch that in a blue company. And that happened at PicPay before I joined.
And I think it was exactly like that.
You get to a scale where it's just really, really, really, really hard
to test things pre-production.
Not just because of the scale and the load you need to generate.
There's always ways of kind of achieving that.
But the system became too complex for you to simulate
all weird things that can happen.
Getting an environment
with the right configuration
really, really hard.
You have feature flags.
You have things that change
the behavior of the system.
You have different versions
of applications.
Having a separate environment
that scales or it's a,
you know, have data
from that's similar to production.
So the test is just too many variables
to know pre-production.
And at the same time,
we're releasing multiple times a day,
every application.
So there's not a lot of time to test anything
pre-production to make sure there's no regression.
Keeping those tests updated.
Netflix was fairly simple in user use cases.
PicPay is huge,
huge of features,
what's available to the user. So
keeping that to test something end-to-end
really, really hard. So it just
became really hard to test things
pre-production and we have to
rely on the
actual development and deploy processes
to catch those problems.
And one more question on that.
Obviously, doing the load tests are different, but one thing I didn't know, again,
until I got more into Dynatrace, the understanding that performance is more than just load.
In those pre-production environments, are you at least looking for patterns when you're not under load?
Andy mentioned the N plus one query problem
or single execution,
CPU utilization is more,
the number of calls to the database is more,
things that might indicate a potential problem
that gets exasperated under load.
Is that being looked at or is it...
And I don't even know if the flame thing
can help with that
because it's not really going to be quite under load.
But, you know, just imagine there are a lot of common issues.
And I'm just thinking on the AI side, if you have a picture of what that looks like,
you can do a comparison.
Yeah, right now it's not part of, you know, your everyday engineer, everyday developer life.
Obviously, if there is a suspicion that something
might be bad or
I'm not sure, yes, all the tools
we have available allow for that,
at least, to how to investigate that in pre-production
environments, for sure.
Not part of the process as a
day-to-day.
But if you're an engineer,
you smell when things can go bad.
I'm adding this SDK, and for some reason,
it's a lot larger than the previous one.
Interesting. Kind of let me see what's going on here.
Or, hey, I developed this algorithm here,
but, you know, quite complex.
Let me see how it behaves.
Or I'm adding this dependency here,
external call or whatever.
Interesting.
Let me see how that behaves
in a non-production environment.
It's just,
you know things that are risky
when you touch them
and hopefully it's a trigger
for you to investigate a bit more.
Martin, I got one final question
because I think then we're getting
almost to the end.
Now it seems that at Netflix you were obviously heavily looking at metrics, like your infrastructure
metrics and so on.
Now with the whole, let's say, excitement about OpenTelemetry and Traces, even though
Traces has been around for a long, long time, but it's been really made popular, obviously, with OpenTelemetry
and all the tools that came with it.
Do you now see the benefit of having all these additional signals
besides your metrics and your logs?
Do you also look at Traces?
Do you look at real user data and all this stuff?
Oh, yeah, of course, of course. Even on Netflix,
we developed
a kind of
end-to-end
kind of distributed
tracing solution.
Back then,
it was internally
developed.
Back in the day,
it was internally
developed.
Today,
I think it's
Zipkin-based.
When I left,
it was Zipkin-based.
What's the Google paper?
It was Dapper?
Dapper,
yeah.
Dapper, exactly exactly it was based
on that
internally
I think we
called
salt
and
extremely
important
especially
to understand
a large
and complex
environment
and the
dependencies
and when
something
breaks
I ended
up even
developing
a bunch
of tools
a few
open source
I think
trying to
remember
if any
of the
open source
tools are
there that are based on tracing data of tools, a few open source, I think. I'm trying to remember if any of the open source tools are there.
That are based on tracing data, for sure.
Take, for example,
so, develop a tool to visualize
volume of requests versus time.
So, where am I spending time? I have a request that comes to my edge layer.
And have you seen the Sankey diagrams?
Yeah.
They kind of open up.
So I use the Sankey diagram based on tracing
to see where we're spending time.
Okay.
That compose the time, I guess,
to the edge layer request itself.
So very interesting, super important to have edge layer request itself. So very interesting,
super important to have the tracing data there.
In a constant changing environment too,
which is you have a company
with kind of thousands of microservices,
those things are changing all the time.
I remember in the early days
when I tried to design a,
do a design of the architecture,
where the requests go and et cetera.
By the time I finished, you know, a couple of days later,
it already changed.
So the only way of understanding how, like,
the flow of data and the flow of requests in a system
is through tracing.
So extremely, extremely important.
Yeah, the data structure I don't like very much is logging
just because it doesn't scale really well. Tracing metrics, super, extremely important. Yeah, the data structure I don't like very much is logging just because it doesn't scale really well.
Tracing, metrics, super, super important.
Cool.
Hey, Martin, thank you so much for taking time out of your day.
I'm sure you're super busy with your role at PicPay.
Thank you for giving us all the insights
into what you learned over the years.
It was great to hear that your first
programming language was basic
kind of brought back some memories from my own
childhood
I hope our paths
will cross again I know you're currently
in Texas I will be in Texas by the way
first
after KubeCon the week after KubeCon
if you make it to KubeCon
if you make it to KubeCon, visit us there as well.
All right, yeah, not scheduled,
but yeah, if you're around, let's
have a beer. Let's have a beer, yeah, sounds good.
Brian, any
final words from you?
Yeah, well, first of all,
thank you as well. Two more
thoughts that I didn't get in the beginning, right?
Number one is that, you know, thank you
also to all of our listeners.
You know, we wouldn't be able to talk to amazing guests like Martin.
And, you know, Andy and I learn so much from this as always.
And hopefully you're all learning.
And the other special thing about today.
So, Martin, you're in Texas.
You're not too far away.
We might cross paths because today is 5G zombie day, if no one knew about this, right?
So in the United States, they're doing a testing of the federal emergency broadcasting.
And the latest conspiracy on that is that's going to trigger, I don't know if it's the
COVID microchips that are supposedly being used, but it's going to turn us all into zombies.
So Martin, if that happens, I'll meet you on the field eating brains together.
And Andy, yeah, I guess maybe travel to the United States might be restricted because it'll be a zombie land.
To all of our future zombie listeners, thank you.
Martin, it's been a real pleasure.
Yeah, you have any last things you want to get in there, Martin, as well?
No, just really, really thank you.
Thank you for the chance of sharing my war stories here.
Always eager to chat performance.
As you guys noticed, it's something I'm really, really passionate about.
And again, I always mention that.
I mean, it's a very small world.
We tend to kind of bump into each other at conferences and you name it so you know
listeners
you guys
anyone
if you want to chat
performance or anything
as always feel free
to kind of reach out
you know
I think you guys
will share links
to LinkedIn
and all those things
I'm always eager
to chat
with like-minded folks
alright well thank you
thank you for giving back
to everybody
with your projects
and everything else
so alright everyone thanks for listening we'll talk to you next time and happy october