PurePerformance - Making the case for SRE in a DevOps organization with Bart Enkelaar
Episode Date: June 21, 2021How do you convince an organization that just went through a 2 year DevOps transformation to continue the journey by applying SRE practices? What is SRE anyway? What are good SLOs? And how do you get ...development teams to take responsibility for their code in production?Bart Enkelaar, Lead Site Reliability Engineer at bol.com, not only got their organization to apply SRE practices, define good SLOs and got dev teams to rotate on-call duties. He also followed the advice of Margaret, Chief Platform Officer, to bring his personal passion to the job. This led to inspiring and educating the community about SRE and SLO through music. To see what I mean check out Barts The Game of SLOs – a three part reliability musical from SLOConf or his funny tech conversations at Friendly Tech Chats.Linkedin - https://www.linkedin.com/in/bart-enkelaar-02242710/Margaret, Chief Platform Officer - https://www.youtube.com/watch?v=hy1gUEhbnBMGame of SLOs: A 3 part reliability musical - https://www.youtube.com/watch?v=Y53Pho93i-kFriendly Tech Chats - https://www.youtube.com/channel/UChHHWkO537q6Yp2dXtJpOzQ/featured
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. hello everyone to another episode of pure performance you may wonder why is it my voice
andy and not brian wilson who typically does the introduction and passing it over to me well
today he's not here and he will be very very sorry because the person that is next to me is my guest he's a i think it's the first time
we'll have a musical a musical actually on uh on the pre-performance podcast um and i'm so happy
that i found you i mean i didn't find you kind of but bart ankelar hopefully i pronounced this correctly yeah that's absolutely fine perfect
bart i saw your performance at slow conf and i was blown away the way you transported an amazing
message the way you became an advocate for slos for service level objectives but taking it
packaging it up in the musical because Because I think people have different ways
how to memorize things.
And I think music is a great way to transport things.
And I'm just very happy that you then decided
and agreed on going on a podcast with us
to talk about everything that gets you excited
and got you excited to make this presentation,
to write a song, to write a
musical about SLOs.
I think there's more about SLOs.
There's SRE.
I mean, you are, who are you, by the way?
Let's say it that way.
Introduce yourself to the audience, and then we'll dive into the topic.
I think there will be some music later on, and we'll see where the conversation takes
us.
Yeah, let's go for that.
So thanks a lot for that great introduction, Andy.
It's quite an introduction, I must say.
So indeed, my name is Bart Enkelaar,
and I'm a lead site reliability engineer at Bol.com,
which is the largest online retailing platform of the Netherlands and Belgium.
So we're basically the Gaelic village against Amazon in the Netherlands, Belgium, essentially.
So yeah, I've been a backend engineer mostly since 2008 and joined Bottle.com six years ago. and really developed, got more and more interested
in the operational side of things
as I did different things at those companies.
And we basically started our DevOps journey around 2016.
And we consider that to be basically done in 2018-ish.
And after we wrapped up that project
of the big DevOps transformation,
we noticed that our attention
to operational standards
and to the reliability of different parts of systems
was slowly declining.
And also, like many tech companies,
we've been growing every year.
So the performance problems that we were having were expanding.
And we really went looking for a new way to take another next step in the way we balance that reliability with our high innovation needs.
And that's how we came at SRE.
And also, that was around 2018, I think,
that we started experimenting with it.
And around that same time,
we also started moving to the Google Cloud.
So having some connections with Google,
of course, championed SRE through their CRE program quite heavily that made that connection fit extra.
So, yeah, that's when we started on that journey.
It's like, man, this is what we want to do.
And I personally was really enamored from this because I have a history of backend engineering, right? So the endless, how do you call it, tightrope pull
between the PO and the team was like,
no, we need to focus more on improving the technical health of our system.
But no, we need more features, you know?
And I was always slightly frustrated by the fact that that was a struggle.
Because it was like, but we have the same goals here how how why don't
we agree you know and then suddenly in sre i i saw this solution like yes this is it that that
realignment for user happiness for me uh that was uh yes so uh then uh i i uh more and more people started getting interested in Baudel's column.
And we actually did a pilot that failed due to all kinds of reasons.
And at that point, some people were thinking like, OK, this has three things, maybe not for Baudel's column, but we have a failure here.
That is awesome that that means we
can learn so so uh i turned that failure into a big evaluation and added to a big plan and
sent that plan to the board and they're like oh yeah this is a good idea and here we go and now
we're doing more and more sre every day but this also means your your company is definitely mature
enough to actually allow failure and i think this is part of the cultural transformation as well that you're not just afraid and just
assessing risk all the time and say no we can't do this because eventually we fail i think it's
about you know embracing failure not that you want to fail all the time but as you said you
want to learn from failure i have one question because you said you had the devops transformation
it took about two years and you considered it done even though we all know it's always a journey.
It's nothing like this is done.
The way I try to explain
and figure out what's DevOps
and SRE,
I always think that DevOps is really
people using
automation to speed up delivery on
the one side, really improving lead time
for change, really taking
automation to get features fast out in the production. On the other side, I lead time for change uh really taking automation to you know get
features fast out in the production on the other side i see sre coming obviously from the operation
side and saying hey we're constantly having things change how can we now use automation to
keep the system resilient even though changes are much faster than ever before and kind of in the
middle right between these two teams uh I think there's an SLO.
Because SLOs are really what kind of is the contract
and kind of the common goal for everyone
because you want to deliver services that produce,
that make your end users happy.
And you can measure this either using, I don't know,
page load times conversion rate for the business,
but it can be a measure, a technical metric like,
you know, how resource hungry are we?
And are we still making money?
Or is the infrastructure more costed than before?
How often do we fail and stuff like this?
And I'm sure there's different analytics.
But in the end, for me, at least the way I take it, SLOs, service level objectives,
are a great way to align everyone, whether it's DevOps and SRE, who kind of use automation
from two different sides,
where it's the business that obviously wants to get more and more features out.
But in the end, we need to agree on a couple of indicators.
And this is stuff we want to deliver, right? Because we assume if we deliver on these promises,
then we'll make better, more money and everybody's happy.
Yeah.
And the key here, I think, is that these indicators and these objectives are part of the process, right?
So this is also not a one-time agreement and then we all know what to do because they will not match everyone's expectation at some point.
And that is a problem that is just as much a bug as a feature bug would be.
And, yeah.
Exactly. So now, Bart, from your
experience, now you said you were a backend engineer,
but you were interested in operations,
then SRE came along.
I think some of our viewers
kind of have the same thing, but SRE
is the big new thing. I see a lot of people on LinkedIn
changing their profiles.
First, they all became DevOps engineers.
Now everybody tries to become a site reliability engineer.
A lot of people have no real, I mean, sorry,
I don't want to offend anybody,
but some people just put it into their title,
even though they may not really know what it really means
or because their organizations say we need to up-level,
we need to show the world that we're doing it.
How can you and how have you actually brought sre and how do you explain sre how can you how can you motivate change agents
within an organization to actually become a change agent i know you can write probably a song about
it too yeah but um yeah what what are they what are the things that you did how did you how did
this start and what can others learn now from that?
Yeah, so I actually...
One of the presentations that I gave on several conferences
last year was the case for SRE.
And when I gave that at Agile Testing Days, I started it with this first...
with the first parts of the song that became the first scene of the
musical later.
my main
case for SRE
as such was
that
this is an
industry standard that is developing,
but that of course can be a bit of a double-edged sword
because it might be a hype.
I don't think it is, but people can waylay that argument in that way.
But the fact that it is an industry standard
means that there is external specialists
that think this is a good idea that can be brought in
to help the transition. And I think that's a good idea that can be brought in to help the transition.
And I think that's a good thing.
And then at the same time, we were actively noticing
these problems with the reliability of our platform.
So not only was this an external solution,
it's also a solution that addresses a concrete problem that we had
and what i did in the plan that i sent to the board and that i i used them to convince them to
to give us resources to actually start doing something in the in the company and start this
change um was that i explicitly uh took a couple of concrete problems that we had in the company,
and I said, SRE can help solve this.
So by taking this transition, this basically mind shift that you have to push out to the whole organization
and bringing the goals down from
we should all do everything differently to these are small concrete problems that we can fix as a
first step which incidentally is also the first steps on our journey to getting to this sre
mindset that was i think the most successful things that we did to enact that change at
now can you can you share a couple more details because i think when i when i go out now and i
asked what is s3 i'm not sure if i get the same answer and also not maybe the same starting point
um yeah like can you be a little more specific on like some of the,
like you said, you took some concrete problems and said,
and SRE will fix it.
So what is this SRE now?
Is it automation in operations?
Is it sitting down and bringing the stakeholders together
and say, let's define what our success criteria is
and let's figure out how we can achieve this?
What was it? Yeah, that was is indeed an interesting question because we
um were at a situation that we had this devops uh transition so we had quite some experience
with automation and the whole department that that handled things like uhometheus clusters and metrics tooling. So those kind of traditional SRE innovations were not necessary in our company.
But as I mentioned, we did have this cloud transition.
And we also still have sort of an operations team that manages operations
for all the applications that still run in our own data center.
And in our cloud transition, we said, no, we want to do full DevOps.
Our teams should have complete ownership over operations
of their applications. But the problem we had was that
we had teams of like two, three people.
And if a team of two, three people
manages five different microservices,
then asking them to be on-call
puts quite a bit of strain on that team.
But the operations team was so busy
with managing all the DC applications
that they couldn't take that on-call either.
So there was no solution, basically,
in the company for on-call.
And we were going live with our cloud migration
and we were going live with critical services
in the cloud,
which we couldn't support outside of Office Hours.
So this was one of the problems where I said, well,
this is something we can fix from the SRE team.
We can build the tooling to enable people to take ownership
of that outside of office hours operations
and then facilitate the conversation between the software team
and the people who take that responsibility
who can do a full normal on-call rotation.
And that's what we did.
Basically, we built virtual teams of software engineers
who are available, who are organized per product domain, essentially per
value domain, who take over the outside of ourselves responsibility for those teams.
And we support them from the SRE team out by building tooling for them and helping them
innovate the processes that they use to do that on goal support. So that means you are, if I
understand this correctly, your vision on your implementation of SRE
is that you as the SRE team,
you are kind of providing reliability as a service
to the teams, right?
Enabling them, obviously showing them how this works,
how to use the metrics.
I mean, how to get monitoring in there,
how to get alerting in there.
But I assume you helped earlier also
with architectural decisions because SRE should not just be i tell you faster that something fails
and that means you get triggered so much more often than you have to work nights and weekends
but it's exactly exactly reliability starts with the first architectural decisions and it's a
continuous effort and i think that's where you also help. That's cool.
And then you have virtual teams that are kind of from a particular problem domain.
You pull them together and then they are in rotation
because obviously things will happen.
Exactly.
Does your SRE team then,
so your team is never on call.
That means you're really just providing the tooling
and the best practices
and the mentoring.
Are you also on call or not?
Yeah, what we do is we make sure that there's one SRE
in each of these rotation pools.
So that single person specializes in that domain
to enable those software engineers to run the shift together,
essentially, so that we also get the experience run the shift together, essentially,
so that we also get the experience from the front line, essentially,
and eat our own dog food in that sort of way.
And, yeah, so, yeah, that's basically the biggest part of what we've been doing last year.
And at the same time, we're trying to facilitate the shift
of all the products to
collaborate together using slis and slos because yeah we still have a big challenge there
and that's basically the the other part of what we're doing and this is also where the music comes
in because that's lots of uh evangelization uh do you think this is the trigger point for some live music now?
I think we can go there.
Maybe there's a little story that I want to tell about it and feel free to
cut it out if you want, because, you know,
I sing in an Irish folk band and I've always played in several bands all my
life. But to me, my work and my passion for
music was always separate, you know? And we have a director who's really focused on diversity and her
vision of diversity is, wouldn't it be great if we could enable everyone to bring their whole
self to the table every day?
And that got me thinking about what am I doing with this hard split between music and my
work because there's so much passion and energy I get from the music outside of my work.
Why not bring that into my work?
Now, that was scary as hell to me,
but she provided me with the inspiration to like,
okay, I'm going for this.
I'm bringing my whole self into work.
And then I did that with the Agile Testing Days and then Slow Conf, I tuned it up a notch.
And there's several internal presentations
where I've done several
kinds of songs and i want to i want to tell you i think this is awesome and i think especially if
you become let's say an advocate an influencer a game changer whatever it is i think you're only
truly believed by people if you are true and if you are natural and if you are who you are
and you should not be just,
this is what I am from nine to five.
And this is who I am the rest of the time.
So I think this is great.
And I told you earlier for me,
my passion in life besides my work is salsa dancing.
That's where I met my wife.
When I get on stage at some of the events and they have music, they typically play some salsa music and I just do some little moves,
but it's just because this is who I am and i'm very proud of it and i love it and it changed my life
and i like the passion of latin music so i can encourage you as you are producing more songs
maybe at some point in the future you're producing some latin beats and i would be i would be happy
to get on stage with you maybe at actual testing days
because i know these guys as well and then nice you play i i bring my wife along we do some salsa
dancing my wife is also she's an s3 or she's a devops engineer yeah nice and you know that jose
would love that yeah yeah of course so um i i love that idea idea you know, the drummer of my Irish folk band is actually from the Caribbean,
so we bring in some Caribbean tunes there and I'm sure he'd be up for some salsa.
So I love this concept and I'm definitely going to come back to you on that.
As you can probably hear,
I've been getting more and more excited
about bringing this music into my work.
So I'm definitely planning to do
more of that.
We call it the SLO, the Salsa Level Objective.
See? That's nice!
Okay, you know what? I'm writing that down.
Or maybe in the end.
It's recorded, actually.
Yeah, that's exactly right.
Whether it's Salsa Level Objective or S or salsa latin objective we'll figure something out yeah yeah yeah exactly exactly so um
shall i go for the song i think you should go for the song do it show us your passion yes Yes.
So, yeah, the setting is that we're diving into the history of IT, guided by the Big Bang Theory, essentially. The weapon did explode and it increased the level of loads and monoliths they couldn't cope So surface, there's no microservices Cloud-contained dependencies increased system complexity So now we all need SRE
SRE!
Once upon a time there was a site to search upon
And as they grew and grew their operation game
Oh yeah, it grew along
DevOps gave to friends to give their learnings proper names
Our nursing, their driven ways to maximize their user happiness. They put it in a book to give the world a proper look.
And yeah, they call it SRE. SRE. SRE. Well now we call it SRE. SRE. It's a way to make our users more happy and to maximize our innovation speed.
Goodbye to wrong incentives
and conflicting bad directives.
Our best and brightest figure
that everyone should really ask for.
Woo! figure that everyone should really S.R.A.
Woo!
Yeah!
Now it's 2021 and S.R.E.'s been built upon
a community of tribes and all
forcations are alive.
Poor performance is a quote.
I don't know what this is good, but now we have all SRE.
Embracing blameless failure modes.
Next level infrastructure codes is what we have at SRE.
So come and do it, SRE.
SRE!
SRE!
Woo!
This is awesome. this is just phenomenal see uh what's uh what's your colleague called who said uh diversity bring in your full self what's what's her name uh her name is
margaret for ha we should uh thank her i will thank her because this is this is phenomenal right because see we are all
we we if we are if we are in this as a whole i think we are we we can do more things than just
like putting being put in a box between nine and five right and if we're just told what we're
supposed to do instead of us bringing in the best it's not margaret she's one of our directors it
i think she's for state of her heart is one of the one of the it directors okay well figure it out
that's really but she's awesome awesome yeah and uh there's this video where she tells this story
and you should link it with the with the podcast yeah definitely um you know you should send me
over all the links that we should add to it hey uh bart now there's a lot of material out there from you i will we will i can encourage
everyone slo conf uh you call it the game of slos uh a three-part reliability musical just
phenomenal more music from you more stories really nice then you also have uh friendly tech chats what is that about yes yes so um an ex
colleague of mine he left the company and we were like yep but we had this amazing disagreements
all the time how come now we can no longer disagree it's like okay let's just do that online
uh so uh we're both fairly experienced backend engineers, and we really focus on quality code
and what it takes to bring quality code to production quickly.
And yeah, so every week we have a new subject about that.
And we love to get, we started this year,
and we love to get input there and questions there.
And yeah, we kind of hope mostly we're just enjoying what we're doing,
but we can use a couple more opinionated people who tell us about what we're saying wrong.
It's good that we are in a world where we don't all agree, us about what we're saying wrong.
It's good that we are in a world where we don't all agree because then everybody is
the same and nobody is different and then the world is really boring.
And in this case...
Yeah, and not only that, the world is too complex to crock.
So if we would all agree, we would all be wrong in the same way.
And we'd never get to a better place hey i want to uh quickly ask i want to ask you one
more question before we before we stop this um because i've been uh advocating for slos just as
you do right but the challenging thing is always what is a good slo what are the slos i start with
right if i'm new to slos and i have to i have to sit down and have to say okay what are my three slos i start with i'm responsible for a particular service a critical application
what do you say yeah i start asking questions yeah okay more questions so um is this a uh mostly
synchronous service is this http endpointspoints or does this mostly do ETL
or message processing?
And then depending on that answers for HTTP,
you go availability latency probably.
And for messaging,
I find that often messaging also benefits
from a process latency,
which is
an easier way to understand
throughput, I guess.
I guess a little bit challenging to measure though, right?
Because if you send a message, how can you
measure when the whole process
that starts asynchronously is done?
Yeah, exactly. So you'll need
some timestamps and usually you need
to collaborate with more systems to get that data there.
But that's not a reason to not do it, of course.
Yeah, very good.
Again, I think there's also a lot of great information at the rest of the SlowConf.
So folks, if you have never heard about SlowConf,
even though we mentioned it in the previous recordings,
great conference that was initiated by Noble9.
And I really liked the format.
It was like only 10 to 15 minute short presentations
where you really as a presenter had to think hard
on how to get the content that you want to transfer
in a short and concise way,
but makes it easier for the consumer
because we are, I think, all tired of listening
and watching videos
that are an hour long even though i make the same mistake that i produce too much content too long
but try to follow my own advice today and and keep it short and precise yeah no excellent plan and i
whole uh hardly agree with that recommendation yeah part before we part part before we part that almost sounds uh like something
that would go into the line of a song uh yeah maybe a salsa song about maybe maybe something
but maybe then in spanish because uh you know people yeah but does it does it rhyme in spanish
that's a question uh yeah we'll figure it out I'll ask my wife or we can ask Jose. We'll figure it out.
So I want to just recap on some of the things
that I really liked about your story.
First of all, a transformation is never over.
There's a difference between DevOps and SRE.
I think what you did with SREs is that you said,
we had certain issues.
Let me propose to you how we can fix it.
And the way you addressed it
is your team is providing services,
tooling and services
and mentoring to your development team
so that they can take ownership
of the applications that are deployed in production.
I like the concept of where you do it.
You have virtual teams
where the engineers
from the different product teams, it seems that share the same problem domain come in,
but with an SRE and then they rotate so that they're not on their own. Obviously, you are
an amazing singer and diversity rules. And I think diversity is getting us forward. Yeah.
Yeah.
Sounds good.
Thanks a lot for having me.
I've seriously enjoyed myself and I thought it was a really interesting conversation.
So thanks.
And I hope it's not the last one.
And we let's promise to the world,
well,
maybe let's threat the world at some point in the not too distant future.
We get on stage together.
You sing, I dance,
and we have fun and we inspire people.
Yes, let's.
Let's do this.
Cool.
All right.
Now wave goodbye to everyone out there,
especially to Brian.
I'm so sorry you aren't with us today
because he would have enjoyed this even more
than I know you do typically with these sessions.
Bye-bye.