PurePerformance - Making the case for SRE in a DevOps organization with Bart Enkelaar

Episode Date: June 21, 2021

How do you convince an organization that just went through a 2 year DevOps transformation to continue the journey by applying SRE practices? What is SRE anyway? What are good SLOs? And how do you get ...development teams to take responsibility for their code in production?Bart Enkelaar, Lead Site Reliability Engineer at bol.com, not only got their organization to apply SRE practices, define good SLOs and got dev teams to rotate on-call duties. He also followed the advice of Margaret, Chief Platform Officer, to bring his personal passion to the job. This led to inspiring and educating the community about SRE and SLO through music. To see what I mean check out Barts The Game of SLOs – a three part reliability musical from SLOConf or his funny tech conversations at Friendly Tech Chats.Linkedin - https://www.linkedin.com/in/bart-enkelaar-02242710/Margaret, Chief Platform Officer - https://www.youtube.com/watch?v=hy1gUEhbnBMGame of SLOs: A 3 part reliability musical - https://www.youtube.com/watch?v=Y53Pho93i-kFriendly Tech Chats - https://www.youtube.com/channel/UChHHWkO537q6Yp2dXtJpOzQ/featured

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. hello everyone to another episode of pure performance you may wonder why is it my voice andy and not brian wilson who typically does the introduction and passing it over to me well today he's not here and he will be very very sorry because the person that is next to me is my guest he's a i think it's the first time we'll have a musical a musical actually on uh on the pre-performance podcast um and i'm so happy that i found you i mean i didn't find you kind of but bart ankelar hopefully i pronounced this correctly yeah that's absolutely fine perfect bart i saw your performance at slow conf and i was blown away the way you transported an amazing message the way you became an advocate for slos for service level objectives but taking it
Starting point is 00:01:22 packaging it up in the musical because Because I think people have different ways how to memorize things. And I think music is a great way to transport things. And I'm just very happy that you then decided and agreed on going on a podcast with us to talk about everything that gets you excited and got you excited to make this presentation, to write a song, to write a
Starting point is 00:01:45 musical about SLOs. I think there's more about SLOs. There's SRE. I mean, you are, who are you, by the way? Let's say it that way. Introduce yourself to the audience, and then we'll dive into the topic. I think there will be some music later on, and we'll see where the conversation takes us.
Starting point is 00:01:59 Yeah, let's go for that. So thanks a lot for that great introduction, Andy. It's quite an introduction, I must say. So indeed, my name is Bart Enkelaar, and I'm a lead site reliability engineer at Bol.com, which is the largest online retailing platform of the Netherlands and Belgium. So we're basically the Gaelic village against Amazon in the Netherlands, Belgium, essentially. So yeah, I've been a backend engineer mostly since 2008 and joined Bottle.com six years ago. and really developed, got more and more interested
Starting point is 00:02:47 in the operational side of things as I did different things at those companies. And we basically started our DevOps journey around 2016. And we consider that to be basically done in 2018-ish. And after we wrapped up that project of the big DevOps transformation, we noticed that our attention to operational standards
Starting point is 00:03:14 and to the reliability of different parts of systems was slowly declining. And also, like many tech companies, we've been growing every year. So the performance problems that we were having were expanding. And we really went looking for a new way to take another next step in the way we balance that reliability with our high innovation needs. And that's how we came at SRE. And also, that was around 2018, I think,
Starting point is 00:03:53 that we started experimenting with it. And around that same time, we also started moving to the Google Cloud. So having some connections with Google, of course, championed SRE through their CRE program quite heavily that made that connection fit extra. So, yeah, that's when we started on that journey. It's like, man, this is what we want to do. And I personally was really enamored from this because I have a history of backend engineering, right? So the endless, how do you call it, tightrope pull
Starting point is 00:04:27 between the PO and the team was like, no, we need to focus more on improving the technical health of our system. But no, we need more features, you know? And I was always slightly frustrated by the fact that that was a struggle. Because it was like, but we have the same goals here how how why don't we agree you know and then suddenly in sre i i saw this solution like yes this is it that that realignment for user happiness for me uh that was uh yes so uh then uh i i uh more and more people started getting interested in Baudel's column. And we actually did a pilot that failed due to all kinds of reasons.
Starting point is 00:05:14 And at that point, some people were thinking like, OK, this has three things, maybe not for Baudel's column, but we have a failure here. That is awesome that that means we can learn so so uh i turned that failure into a big evaluation and added to a big plan and sent that plan to the board and they're like oh yeah this is a good idea and here we go and now we're doing more and more sre every day but this also means your your company is definitely mature enough to actually allow failure and i think this is part of the cultural transformation as well that you're not just afraid and just assessing risk all the time and say no we can't do this because eventually we fail i think it's about you know embracing failure not that you want to fail all the time but as you said you
Starting point is 00:05:56 want to learn from failure i have one question because you said you had the devops transformation it took about two years and you considered it done even though we all know it's always a journey. It's nothing like this is done. The way I try to explain and figure out what's DevOps and SRE, I always think that DevOps is really people using
Starting point is 00:06:18 automation to speed up delivery on the one side, really improving lead time for change, really taking automation to get features fast out in the production. On the other side, I lead time for change uh really taking automation to you know get features fast out in the production on the other side i see sre coming obviously from the operation side and saying hey we're constantly having things change how can we now use automation to keep the system resilient even though changes are much faster than ever before and kind of in the middle right between these two teams uh I think there's an SLO.
Starting point is 00:06:46 Because SLOs are really what kind of is the contract and kind of the common goal for everyone because you want to deliver services that produce, that make your end users happy. And you can measure this either using, I don't know, page load times conversion rate for the business, but it can be a measure, a technical metric like, you know, how resource hungry are we?
Starting point is 00:07:07 And are we still making money? Or is the infrastructure more costed than before? How often do we fail and stuff like this? And I'm sure there's different analytics. But in the end, for me, at least the way I take it, SLOs, service level objectives, are a great way to align everyone, whether it's DevOps and SRE, who kind of use automation from two different sides, where it's the business that obviously wants to get more and more features out.
Starting point is 00:07:32 But in the end, we need to agree on a couple of indicators. And this is stuff we want to deliver, right? Because we assume if we deliver on these promises, then we'll make better, more money and everybody's happy. Yeah. And the key here, I think, is that these indicators and these objectives are part of the process, right? So this is also not a one-time agreement and then we all know what to do because they will not match everyone's expectation at some point. And that is a problem that is just as much a bug as a feature bug would be. And, yeah.
Starting point is 00:08:10 Exactly. So now, Bart, from your experience, now you said you were a backend engineer, but you were interested in operations, then SRE came along. I think some of our viewers kind of have the same thing, but SRE is the big new thing. I see a lot of people on LinkedIn changing their profiles.
Starting point is 00:08:25 First, they all became DevOps engineers. Now everybody tries to become a site reliability engineer. A lot of people have no real, I mean, sorry, I don't want to offend anybody, but some people just put it into their title, even though they may not really know what it really means or because their organizations say we need to up-level, we need to show the world that we're doing it.
Starting point is 00:08:42 How can you and how have you actually brought sre and how do you explain sre how can you how can you motivate change agents within an organization to actually become a change agent i know you can write probably a song about it too yeah but um yeah what what are they what are the things that you did how did you how did this start and what can others learn now from that? Yeah, so I actually... One of the presentations that I gave on several conferences last year was the case for SRE. And when I gave that at Agile Testing Days, I started it with this first...
Starting point is 00:09:25 with the first parts of the song that became the first scene of the musical later. my main case for SRE as such was that this is an industry standard that is developing,
Starting point is 00:09:43 but that of course can be a bit of a double-edged sword because it might be a hype. I don't think it is, but people can waylay that argument in that way. But the fact that it is an industry standard means that there is external specialists that think this is a good idea that can be brought in to help the transition. And I think that's a good idea that can be brought in to help the transition. And I think that's a good thing.
Starting point is 00:10:08 And then at the same time, we were actively noticing these problems with the reliability of our platform. So not only was this an external solution, it's also a solution that addresses a concrete problem that we had and what i did in the plan that i sent to the board and that i i used them to convince them to to give us resources to actually start doing something in the in the company and start this change um was that i explicitly uh took a couple of concrete problems that we had in the company, and I said, SRE can help solve this.
Starting point is 00:10:54 So by taking this transition, this basically mind shift that you have to push out to the whole organization and bringing the goals down from we should all do everything differently to these are small concrete problems that we can fix as a first step which incidentally is also the first steps on our journey to getting to this sre mindset that was i think the most successful things that we did to enact that change at now can you can you share a couple more details because i think when i when i go out now and i asked what is s3 i'm not sure if i get the same answer and also not maybe the same starting point um yeah like can you be a little more specific on like some of the,
Starting point is 00:11:47 like you said, you took some concrete problems and said, and SRE will fix it. So what is this SRE now? Is it automation in operations? Is it sitting down and bringing the stakeholders together and say, let's define what our success criteria is and let's figure out how we can achieve this? What was it? Yeah, that was is indeed an interesting question because we
Starting point is 00:12:09 um were at a situation that we had this devops uh transition so we had quite some experience with automation and the whole department that that handled things like uhometheus clusters and metrics tooling. So those kind of traditional SRE innovations were not necessary in our company. But as I mentioned, we did have this cloud transition. And we also still have sort of an operations team that manages operations for all the applications that still run in our own data center. And in our cloud transition, we said, no, we want to do full DevOps. Our teams should have complete ownership over operations of their applications. But the problem we had was that
Starting point is 00:13:06 we had teams of like two, three people. And if a team of two, three people manages five different microservices, then asking them to be on-call puts quite a bit of strain on that team. But the operations team was so busy with managing all the DC applications that they couldn't take that on-call either.
Starting point is 00:13:29 So there was no solution, basically, in the company for on-call. And we were going live with our cloud migration and we were going live with critical services in the cloud, which we couldn't support outside of Office Hours. So this was one of the problems where I said, well, this is something we can fix from the SRE team.
Starting point is 00:13:50 We can build the tooling to enable people to take ownership of that outside of office hours operations and then facilitate the conversation between the software team and the people who take that responsibility who can do a full normal on-call rotation. And that's what we did. Basically, we built virtual teams of software engineers who are available, who are organized per product domain, essentially per
Starting point is 00:14:25 value domain, who take over the outside of ourselves responsibility for those teams. And we support them from the SRE team out by building tooling for them and helping them innovate the processes that they use to do that on goal support. So that means you are, if I understand this correctly, your vision on your implementation of SRE is that you as the SRE team, you are kind of providing reliability as a service to the teams, right? Enabling them, obviously showing them how this works,
Starting point is 00:14:57 how to use the metrics. I mean, how to get monitoring in there, how to get alerting in there. But I assume you helped earlier also with architectural decisions because SRE should not just be i tell you faster that something fails and that means you get triggered so much more often than you have to work nights and weekends but it's exactly exactly reliability starts with the first architectural decisions and it's a continuous effort and i think that's where you also help. That's cool.
Starting point is 00:15:27 And then you have virtual teams that are kind of from a particular problem domain. You pull them together and then they are in rotation because obviously things will happen. Exactly. Does your SRE team then, so your team is never on call. That means you're really just providing the tooling and the best practices
Starting point is 00:15:45 and the mentoring. Are you also on call or not? Yeah, what we do is we make sure that there's one SRE in each of these rotation pools. So that single person specializes in that domain to enable those software engineers to run the shift together, essentially, so that we also get the experience run the shift together, essentially, so that we also get the experience from the front line, essentially,
Starting point is 00:16:14 and eat our own dog food in that sort of way. And, yeah, so, yeah, that's basically the biggest part of what we've been doing last year. And at the same time, we're trying to facilitate the shift of all the products to collaborate together using slis and slos because yeah we still have a big challenge there and that's basically the the other part of what we're doing and this is also where the music comes in because that's lots of uh evangelization uh do you think this is the trigger point for some live music now? I think we can go there.
Starting point is 00:16:50 Maybe there's a little story that I want to tell about it and feel free to cut it out if you want, because, you know, I sing in an Irish folk band and I've always played in several bands all my life. But to me, my work and my passion for music was always separate, you know? And we have a director who's really focused on diversity and her vision of diversity is, wouldn't it be great if we could enable everyone to bring their whole self to the table every day? And that got me thinking about what am I doing with this hard split between music and my
Starting point is 00:17:37 work because there's so much passion and energy I get from the music outside of my work. Why not bring that into my work? Now, that was scary as hell to me, but she provided me with the inspiration to like, okay, I'm going for this. I'm bringing my whole self into work. And then I did that with the Agile Testing Days and then Slow Conf, I tuned it up a notch. And there's several internal presentations
Starting point is 00:18:03 where I've done several kinds of songs and i want to i want to tell you i think this is awesome and i think especially if you become let's say an advocate an influencer a game changer whatever it is i think you're only truly believed by people if you are true and if you are natural and if you are who you are and you should not be just, this is what I am from nine to five. And this is who I am the rest of the time. So I think this is great.
Starting point is 00:18:32 And I told you earlier for me, my passion in life besides my work is salsa dancing. That's where I met my wife. When I get on stage at some of the events and they have music, they typically play some salsa music and I just do some little moves, but it's just because this is who I am and i'm very proud of it and i love it and it changed my life and i like the passion of latin music so i can encourage you as you are producing more songs maybe at some point in the future you're producing some latin beats and i would be i would be happy to get on stage with you maybe at actual testing days
Starting point is 00:19:05 because i know these guys as well and then nice you play i i bring my wife along we do some salsa dancing my wife is also she's an s3 or she's a devops engineer yeah nice and you know that jose would love that yeah yeah of course so um i i love that idea idea you know, the drummer of my Irish folk band is actually from the Caribbean, so we bring in some Caribbean tunes there and I'm sure he'd be up for some salsa. So I love this concept and I'm definitely going to come back to you on that. As you can probably hear, I've been getting more and more excited about bringing this music into my work.
Starting point is 00:19:48 So I'm definitely planning to do more of that. We call it the SLO, the Salsa Level Objective. See? That's nice! Okay, you know what? I'm writing that down. Or maybe in the end. It's recorded, actually. Yeah, that's exactly right.
Starting point is 00:20:06 Whether it's Salsa Level Objective or S or salsa latin objective we'll figure something out yeah yeah yeah exactly exactly so um shall i go for the song i think you should go for the song do it show us your passion yes Yes. So, yeah, the setting is that we're diving into the history of IT, guided by the Big Bang Theory, essentially. The weapon did explode and it increased the level of loads and monoliths they couldn't cope So surface, there's no microservices Cloud-contained dependencies increased system complexity So now we all need SRE SRE! Once upon a time there was a site to search upon And as they grew and grew their operation game Oh yeah, it grew along DevOps gave to friends to give their learnings proper names
Starting point is 00:21:23 Our nursing, their driven ways to maximize their user happiness. They put it in a book to give the world a proper look. And yeah, they call it SRE. SRE. SRE. Well now we call it SRE. SRE. It's a way to make our users more happy and to maximize our innovation speed. Goodbye to wrong incentives and conflicting bad directives. Our best and brightest figure that everyone should really ask for. Woo! figure that everyone should really S.R.A. Woo!
Starting point is 00:22:11 Yeah! Now it's 2021 and S.R.E.'s been built upon a community of tribes and all forcations are alive. Poor performance is a quote. I don't know what this is good, but now we have all SRE. Embracing blameless failure modes. Next level infrastructure codes is what we have at SRE.
Starting point is 00:22:38 So come and do it, SRE. SRE! SRE! Woo! This is awesome. this is just phenomenal see uh what's uh what's your colleague called who said uh diversity bring in your full self what's what's her name uh her name is margaret for ha we should uh thank her i will thank her because this is this is phenomenal right because see we are all we we if we are if we are in this as a whole i think we are we we can do more things than just like putting being put in a box between nine and five right and if we're just told what we're
Starting point is 00:23:16 supposed to do instead of us bringing in the best it's not margaret she's one of our directors it i think she's for state of her heart is one of the one of the it directors okay well figure it out that's really but she's awesome awesome yeah and uh there's this video where she tells this story and you should link it with the with the podcast yeah definitely um you know you should send me over all the links that we should add to it hey uh bart now there's a lot of material out there from you i will we will i can encourage everyone slo conf uh you call it the game of slos uh a three-part reliability musical just phenomenal more music from you more stories really nice then you also have uh friendly tech chats what is that about yes yes so um an ex colleague of mine he left the company and we were like yep but we had this amazing disagreements
Starting point is 00:24:12 all the time how come now we can no longer disagree it's like okay let's just do that online uh so uh we're both fairly experienced backend engineers, and we really focus on quality code and what it takes to bring quality code to production quickly. And yeah, so every week we have a new subject about that. And we love to get, we started this year, and we love to get input there and questions there. And yeah, we kind of hope mostly we're just enjoying what we're doing, but we can use a couple more opinionated people who tell us about what we're saying wrong.
Starting point is 00:25:04 It's good that we are in a world where we don't all agree, us about what we're saying wrong. It's good that we are in a world where we don't all agree because then everybody is the same and nobody is different and then the world is really boring. And in this case... Yeah, and not only that, the world is too complex to crock. So if we would all agree, we would all be wrong in the same way. And we'd never get to a better place hey i want to uh quickly ask i want to ask you one more question before we before we stop this um because i've been uh advocating for slos just as
Starting point is 00:25:35 you do right but the challenging thing is always what is a good slo what are the slos i start with right if i'm new to slos and i have to i have to sit down and have to say okay what are my three slos i start with i'm responsible for a particular service a critical application what do you say yeah i start asking questions yeah okay more questions so um is this a uh mostly synchronous service is this http endpointspoints or does this mostly do ETL or message processing? And then depending on that answers for HTTP, you go availability latency probably. And for messaging,
Starting point is 00:26:20 I find that often messaging also benefits from a process latency, which is an easier way to understand throughput, I guess. I guess a little bit challenging to measure though, right? Because if you send a message, how can you measure when the whole process
Starting point is 00:26:37 that starts asynchronously is done? Yeah, exactly. So you'll need some timestamps and usually you need to collaborate with more systems to get that data there. But that's not a reason to not do it, of course. Yeah, very good. Again, I think there's also a lot of great information at the rest of the SlowConf. So folks, if you have never heard about SlowConf,
Starting point is 00:27:01 even though we mentioned it in the previous recordings, great conference that was initiated by Noble9. And I really liked the format. It was like only 10 to 15 minute short presentations where you really as a presenter had to think hard on how to get the content that you want to transfer in a short and concise way, but makes it easier for the consumer
Starting point is 00:27:21 because we are, I think, all tired of listening and watching videos that are an hour long even though i make the same mistake that i produce too much content too long but try to follow my own advice today and and keep it short and precise yeah no excellent plan and i whole uh hardly agree with that recommendation yeah part before we part part before we part that almost sounds uh like something that would go into the line of a song uh yeah maybe a salsa song about maybe maybe something but maybe then in spanish because uh you know people yeah but does it does it rhyme in spanish that's a question uh yeah we'll figure it out I'll ask my wife or we can ask Jose. We'll figure it out.
Starting point is 00:28:06 So I want to just recap on some of the things that I really liked about your story. First of all, a transformation is never over. There's a difference between DevOps and SRE. I think what you did with SREs is that you said, we had certain issues. Let me propose to you how we can fix it. And the way you addressed it
Starting point is 00:28:27 is your team is providing services, tooling and services and mentoring to your development team so that they can take ownership of the applications that are deployed in production. I like the concept of where you do it. You have virtual teams where the engineers
Starting point is 00:28:45 from the different product teams, it seems that share the same problem domain come in, but with an SRE and then they rotate so that they're not on their own. Obviously, you are an amazing singer and diversity rules. And I think diversity is getting us forward. Yeah. Yeah. Sounds good. Thanks a lot for having me. I've seriously enjoyed myself and I thought it was a really interesting conversation. So thanks.
Starting point is 00:29:15 And I hope it's not the last one. And we let's promise to the world, well, maybe let's threat the world at some point in the not too distant future. We get on stage together. You sing, I dance, and we have fun and we inspire people. Yes, let's.
Starting point is 00:29:31 Let's do this. Cool. All right. Now wave goodbye to everyone out there, especially to Brian. I'm so sorry you aren't with us today because he would have enjoyed this even more than I know you do typically with these sessions.
Starting point is 00:29:44 Bye-bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.