PurePerformance - The many facets of an SRE with Alexandra Franz

Starting point is 00:00:00 It's time for pure performance. Get your stopwatches ready. It's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson. And as always I have with me my mocker in chief with his mock turtle neck on. Andy Grabner, how are you doing, Andy today? I'm feeling almost the same as with our previous recording because I'm still,

Starting point is 00:00:39 what did you call it, a fashion statement? I have my fashion statement around my neck here. it still tries to keep my neck warm because I want to make sure my muscles are all good for tomorrow and the weekend ahead because we're going on these key slopes. So you are, you're trying to make sure that you are ready for the events coming up. That my body stays resilient for everything that is thrown towards me, kind of, right?

Starting point is 00:01:09 You want to make sure whatever, whatever condition, rough, rough snow patches, icy or just perfect powder, that we can kind of swift through it. And it's kind of like a better segue as with the previous recording that we did, right? Yes, much better. Yes, much better. Hey, without further ado, the topic of today is focusing around resiliency, reliability, side reliability.

Starting point is 00:01:36 We found a great guest, Alexandra Franz. Alex, servas. Hi. Hey, hey, thank you so much for being here. You are, and I, this now, kind of, I'm going to reveal the secret to Brian. Because before we hit the record button, I was saying Alexandra Franz, it sounds like such an Austrian name. And while you do work in Austria, you're not from Austria. But Brian, for you now, she has the same country origin as our previous recording from yesterday.

Starting point is 00:02:05 You know, right before we started recording, I was just thinking, maybe she's Romanian. You know, it was, it popped in my head. I was like, well, let me wait and see. So, okay, yeah, perfect. So I'm talking quite a bit about, well, a little bit about Romania. Yeah, yeah. Did you figure out all the places you had a beer in Romania? I figured it out.

Starting point is 00:02:28 I looked at my beer with me up yesterday and I looked at all of the different places. And I'm looking forward to Bucharest for Cloud Native Days, Romania next year, because I'm sure I will lock in another beverage there. But now, Alex, Thank you so much for being here. You are a site reliability engineer at Dinah Chase, but can you tell us a little bit more about your background? I just revealed you are from Romania originally.

Starting point is 00:02:52 Maybe a quick overview of what brought you to Austria and what brought you to become an SRE. Hi, thanks for having me today, actually. Yeah, as you said, I'm from Romania. And I would say probably my journey to Sari, it's actually pretty interesting. And also journey to Austria. I came in Austria because I studied aerospace engineering,

Starting point is 00:03:12 and I came working actually in an aerospace company more or less as a system engineer for the aviation. And after some years, somehow the clouds brought me to the other clouds, and now I'm working with our cloud providers, right, and working more into the software, but for cloud providers, keeping it more in touch with technology and newer stuff, basically, than what aviation brings in terms of software, let's say. Really cool.

Starting point is 00:03:43 And folks, as always, we will share a lot of links, wonder links. Alex, if you're okay with it, we will share your LinkedIn profile so people can also see where you came from. Really great, though, right? We have two Romanians back-to-back on our podcast also showing. I mean, Diana that we had yesterday, I think she's now in Spain. I think that's what you said. You made it to Austria.

Starting point is 00:04:09 We're obviously happy that you're here. You're working out of our Vienna office. Yes, exactly. Yeah. So today's topic, and for everybody that is listening in, Alex and I, we sat together a couple weeks ago in the Vienna office, and we've been bouncing the idea around quite a while because I wanted to learn more about ESRI at Dinah Trace.

Starting point is 00:04:32 I've been talking about ESRI quite a bit over the years, but it's always great how things. get applied, what we learn internally. And so I have a couple of topics for you that I want to discuss. But before we go into the topics, I first want to get a quick overview of what are kind of the responsibilities overall. I know you're a big team, but overall, can you quickly highlight what are some of the responsibilities of an SRE in our organization?

Starting point is 00:05:00 I think in the big words, and I would now phrase one of our colleagues, SREs in Dinah Trace take care of the money printer, right? That's a good way of it. But in an nutshell, let's say, SREs are taking care of the whole production environments and not only production, also pre-production, but production mainly where we are making sure we are delivering the software in a safety manner

Starting point is 00:05:29 to our customers and making sure it's arriving in time, but also taking consideration sure any other blockers that are coming along the way. We are taking care of monitoring, making sure the systems are staying healthy, green, and in case of any issues, reacting, investigating, and making sure we are solving everything in time. And also, additionally to that, taking consideration costs, taking consideration scaling, making sure we are prepared for any events that are coming towards us and planning ahead for that, but also being reactive if anything is needed and being there basically in support.

Starting point is 00:06:01 So I would say, as I already in the address are pretty busy because we are always in some some kind of activity and task and making sure that everything is just running without influencing or being seen by other people from the other hand of the software, let's say. I think it was well put and well explained. Just one more clarification question because we are, and obviously many, many organizations in our observability space, we are operating a SaaS business 24-7, this mean your team is also structured globally around the world? Can you give us a quick overview? Good, hint on it. Yes. So this is also one of the cool stuff of being in SREs. We are not just

Starting point is 00:06:48 in one place and one office. We are across the globe. So we have people across Austria actually spread across different offices in Austria, but we also have people in Detroit, in Nauram area, also in APEC. So we also see people in Sydney, we also see people in Detroit, we also have some people in Texas, for example, even California. So we are quite spread and we are a big team. Also making sure that like this we can provide support 24-7 in a sense. But additionally to that is not only that, we are also on-call. So that means also in the out-of-office hours, we are there to support the customers

Starting point is 00:07:22 and also to make sure that the platform is running smoothly. So we are there more or less every time. And this also makes it interesting because it's not just about the challenges that you have every day. You will have different challenges also by collaboration and also by communication and, yeah, brings a bigger package, actually, into the view as an SRI. Yeah, you know, Andy, I never thought of SRI from this point of view before, but the challenge, you know, I always think of SREs in a single office. It looks like Andy's having some audio problems, maybe. But, okay. So, Andy, one of the things that this brings up a new idea that I hadn't thought about before.

Starting point is 00:08:05 and I'll share because I'm not sure if other people thought about it is I always think of SREs in like a single office and a single, you know, knock or some, you know, whatever it might be, right? But talking about all the different office locations, all the, you know, the 24-7 coverage and all that, obviously whenever you're setting up your practices for SRE now, not only is that you and your team figuring that out, but that's something you have to coordinate and make work with all the different offices. And I'm sure people in different countries, I'm sure, you know, are going to have different ideas of how to approach it. So it becomes a lot more of a coordinated effort, a lot more of not in the election sense, but political in terms of we want to do it this way. Well, we over here want to do it this way. It just broadens the scope of what you have to do as an SRE in that situation. Just fascinating.

Starting point is 00:08:55 Yeah. And I think that, I mean, we are obviously a SaaS-based, we're providing a SaaS-based service. but many, if I look back at my early days, when we started software developing and the software that we built, right, that was all, we built it, we tested it, we shipped it, and operation was done by somebody else. And even these companies, when they operated,

Starting point is 00:09:20 most of them had their nine to five operational schedule, right? But nowadays, everything is available all the time. And I think, this is why I think it's so great to have you on this call, because I want to learn from you what does it really that Esseries are doing. And so others that listen in can learn. Now, Alex, I want to jump to the first topic. It is a topic that throughout the year, and I think we have seen some changes,

Starting point is 00:09:48 but we've always talked about single-day events or then special events, like a Black Friday, a Cyber Monday. And I think things have changed a little bit. I remember 20 years back, where it was maybe 15 years back when I started talking about this event, it was really Black Friday, then Cyber Monday, and then we saw things shifting. What I would like to know about is how do you see, from an S3 perspective,

Starting point is 00:10:20 these things changed? Are they really still single-day events, or do they happen all of the time? How do you prepare for this? What have we learned about it? Any insights that you can give us from an S-S-Rey perspective? I think it's very interesting actually how also us as consumers we are changing over the years right and this is a pattern we are seeing it as well for us in the way we are seeing also for

Starting point is 00:10:43 our customers how things are behaving because we also have a platform for a lot of retail customers and every year we are always learning and learning how we should prepare next year better how we should be better actually into into the whole organization of big events and one year maybe it started just Black Friday and then we had Cyber Monday and you said and what we've seen also this year actually it's not about just one day it's actually a full month of growth

Starting point is 00:11:12 because now it's not just about Black Friday being on Friday it's actually Black Friday starts depending on retailers depending on the areas depending on the regions starts away earlier some of them even start on 1st of November right and then what they do because you cannot just wait and see it so you prepare somehow in advance even now because we already saw also from experience

Starting point is 00:11:31 that all of these shopping and buying and discounts that you can see actually online are not starting now just in one day. It's a continuous growth over the time and it's not something you can say, okay, it's done now and then tomorrow it's gone. It's actually a continuous growth

Starting point is 00:11:49 which goes down a bit and then depending where it is, like if it's Friday, Saturday, Sunday, you'll have a connection also with Christmas, right? Because then people are buying closer to Christmas then there's boxing day depending on the regions and depending on that you have to we have to slowly adapt into those situations

Starting point is 00:12:06 and learn on how we can better actually make use of our infrastructures for that. So yeah, it's an interesting pattern how every year changes so curious how next year will bring for us into that area. Yeah, and I guess, I mean, knowing that this podcast will air around the time with another big, big event,

Starting point is 00:12:27 it's the Super Bowl is coming up and there was also I remember in the early days Brian when we always talked about how the Super Bowl ads were them bringing down websites because they cost spike traffic Alex is a Super Bowl also an event that we see

Starting point is 00:12:43 I think last year we saw some spikes or some closer spikes we also saw this year in the periods when there were the games actually during the evenings especially for the streaming stuff we could also see when the bigger games were there that there is more traffic more people actually using their platforms so we could also see some increases

Starting point is 00:13:00 on our side. For sure Super Bowl will also bring some more traffic on our side as well and we will see something in there. So I'm guessing also depends on the outcomes of what happens for the finals right, right? Yeah, true too. And then another big event next year

Starting point is 00:13:16 now that I think of the World Cup is coming up. Yeah. That's more like what Black Friday looks like now because the World Cup is not just one day. It's a series of days as opposed to Super Bowl is a single-day event. So it's interesting how things will be spread differently for the different events. It's also depending, right, because you will have one-day events, right?

Starting point is 00:13:38 So then like those one-day events, it's also depending on the peak hours of the regions. Because like Super Bowl will be a specific, very specific for No-RAM, so you will see more increase into that. Same goes for games. World Cup, it's a bit different because now it will be more spread depending across the regions where people are watching. So usually the spikes are also really influencing the areas. and the regions where people are actually more into those specific sports, which is also a very interesting pattern to actually observe into that. So, Alex, if you think, go ahead.

Starting point is 00:14:07 I was going to ask, you know, this idea, as you were talking about Black Friday Cyber Monday, you know, before online retail, it was always in store in person, right? So what I'm getting at is I wonder how much of the spread of those becoming, month-long events versus a single-day event because even online it used to be like they were Cyber Monday, right? And obviously if you're expanding that time frame for the sales, it's probably good for business, right? But I also wonder how much of that expansion might be tied to the fact that so many sites would be crashing on those big day events. So they would be like, hey, let's spread this out over a few days so it's not everybody coming at once. Was there part of it?

Starting point is 00:14:52 It's part of the, and I don't know if we have an answer to this, but as part of that spread, of these big day events because of it's not just thousands of people go in a store, it's millions of people go on a website, crashing it, taking it down, causing devastation to the internet infrastructure, and they figured, hey, if we buffer it over the course of a few weeks, we can handle it better. I don't know. Any thoughts on that? I'm sure we don't have any information on that.

Starting point is 00:15:21 To be honest, I don't, but it's also an assumption worth to make because at the DNV, all learn very on our experiences. So everyone learns, if you put it from a consumer side, you can also say the longer you have it, the more people you will have actually doing the shopping. So then it's always depending where you put it, right? I think you have a point here, Brian,

Starting point is 00:15:41 because think about it, the retailers are not the only ones that see the spike. Because if you buy something, it needs to be produced. If it's not already produced, it needs to be shipped. That means if you can spread or shed the load, if you can spread the load out

Starting point is 00:15:55 across multiple days, weeks or months, then it's better for everybody. Alex, you brought up a really interesting point and I think this brings me to the next topic. You said some of these events are obviously regional. How does an SRE plan for capacity if you have to factor in all of these things? How do you plan with the cloud vendors?

Starting point is 00:16:19 I know we are running on all the major cloud vendors. So what does an SRE do? What's your role? do you figure out how to scale, when to scale, are there any lessons learned, any things that you say, when I would have wished that I knew this earlier, what are some of the things that are challenging in that respect? For the big event, it's always planned, right? So we always know, based on last year's data, last year metrics and last year, like all the numbers we accumulate over the last year.

Starting point is 00:16:52 same on how it was for Black Friday. We knew from a year before approximately how much load we can expect. We know that we have somewhere 30% growth maybe in those days. Now it was spread, more or less spread over a month, so it was not at once. It was somewhere there. And then it depends a lot on the regions because as an SRA generally you also learn

Starting point is 00:17:11 then which regions are more popular, which regions are more heavy used than other regions. Like East U.S., it's usually generally a way bigger region than other regions in U.S. it's a more popular region to be selected then based on that you also because we work close with the cloud providers as you said we also learn a bit of their capacities

Starting point is 00:17:31 and where usually are more constraints because at the end cloud cloud providers is not an infinite cloud that just expands in the air and then you just get the stuff whenever you want and you pull it in it's still something physical that they need to have and make sure that they are there

Starting point is 00:17:46 so planning for big events usually starts a couple of months maybe even three months before that, we know that we need to do. We are syncing together with the cloud providers, knowing where the growth is more expected. And based on that, we know what, their capacity as well. So we say, okay, we will need, I don't know, amount of 40% more in each region. This is what we need as capacity right now. What is the expected date where we can get that?

Starting point is 00:18:10 This is where we would need it. Let's work together. And this is where it starts a lot of communication also with the cloud providers to make sure we are getting in time what we need as, capacity in there. And additionally to that, of course, it's not just about planning, but sometimes you also need to be reactive. And there's when you start to look at metrics, there you start to look at performances, there where you start to see what it's causing and where it's our actually clusters are struggling and what they would need. And then based on that, you

Starting point is 00:18:40 start to decide scaling. And this is where the data serves a lot for us, because right, then we can look at the metrics. Then we can look actually at the data that we have, at the conceptions and the performance of our hardware to see if we can and we need to actually make a scaling decision. And for me, it's always fascinating. I think you heard it spot on. You said the cloud providers don't provide infinite resources or at the click of a button because they also have this hardware somewhere.

Starting point is 00:19:11 And especially at these big day events or week events, everybody has the same problem. right? Everybody wants the cloud resources and I guess this is where the upfront negotiation comes in. So do we actually negotiate that we really kind of get an assurance that we get these resources? How does this work? Sometimes, yes.

Starting point is 00:19:32 It also depends, right? It's at the end also a business and it depends a lot on the type of business with each cloud provider. We are also more heavily involved into AWS. We have much more data in there, much more resources in there. So that's also a different standing that you have in there.

Starting point is 00:19:49 And also for the other class provider, also like Asia and UCP, right, we work together with them and say, okay, this is what we need. We need to make sure that this is allocated for us and we have it in there. Of course, on some direction, they can actually say, yeah, that's yours. But that's also, as you said, it's not just us that you will need it. It's also other customers that are coming and say, I also need it. And then the other one also needs it. So then it's also a decision from their end to actually say, okay, which customers, we need to make sure that they have it. and then which customers we can say we sacrificed

Starting point is 00:20:19 or maybe they need to sacrifice something from their end to make sure that then we can put it somewhere else so they can actually provide us with hardware. And constraints happen to them as well as we said also for Black Friday, right? It's not just about having the hardware there but it's also you need to order it, you need to have it distributed.

Starting point is 00:20:37 So then sometimes for them also they plan something and it's not depending just on them to have everything ready because it needs to arrive to them, put it up, install it and do all the stuff in there. I'm curious with the cloud providers, right? We're seeing some pressure, I don't know, we're seeing some pressure on the cloud providers from their own AI efforts, right?

Starting point is 00:21:02 And we're definitely seeing, seeing they're in some internal capacity. Is that starting to be seen for customers like us and anyone else using the cloud provider, where during these events, it's harder to get that extra capacity because they're dedicating so much of their own hardware and stuff to their AI efforts. Is that coming into play yet?

Starting point is 00:21:23 Have we started seeing anything with that, or is it being handled well so far? I think it's a mix, of course, and it's also depending a lot on the type of hardware and instance family's unit, because it's also a matter of what do you want to use. There are some more popularity depending on bigger near or shiny or instance types usually are more popular, right? Because people want to use it, it's bigger, it gives you more power. but then also that means more people are like fighting on it and also probably also them themselves they also use it because it gives them more more resources in there

Starting point is 00:21:53 so it's always depending what you want to use I think also most of the problems we see in terms of capacity I don't think it's just related to that on their end because they're using for AI or things like that probably there is an influence but it's also some parts where we also do not know directly because right also as a as they are they will not come to us and say hey you know I'm using it for AI now sorry. You brought up instance type,

Starting point is 00:22:22 and obviously instance types keep changing all the time, and I assume they also take certain instance types and chips away. How do we deal with this? Do we need to, I assume we know how we perform on certain hardware,

Starting point is 00:22:37 and then do we optimize for that? So do we have different deployment options when we know in a certain region we don't get the chips or if you have any insights on this, I think this would be very interesting for me, right? Because maybe region West has more on this. Region East not, so we need to deploy differently. Is this how it works?

Starting point is 00:22:59 So we do even have currently running like that. There are certain regions where we just do not have the same instance types. They are just not provided by the cloud providers. Also because of the demand in that areas, right? So also for them at the end it's a matter of demand and how many people are demanding it, how much it's used because it's also very cost, it's also most costly for them, right?

Starting point is 00:23:21 And we do have certain environments or in certain regions which are running on different distance types. Maybe they are not as performance or SB, but it's also for us as a deployment, it's way smaller than the other ones. So it's not hurting us in terms of performance because at the end we get the same output, maybe you just, instead of having three instances

Starting point is 00:23:40 in there you would have six. just because it's a bit smaller, so it needs to perform in a similar way. But we do have those alternatives, basically because we don't have all the time, the capacity in there. And it's not only about that, right? It's also about different chips, as you said. It's maybe sometimes you have inter, sometimes you have AMD in there. Depending on the providers and what you have in that region, you try to mix it up if you don't have the same everywhere. And try to adapt to those situations at the end.

Starting point is 00:24:08 Andy, I remember several years ago, I don't know if it was Sonia or someone else that we had on, but when we were first launching, you know, what Dinah Traces today, with Grail, I think it was, we had it set up for AWS, and then when we were trying to set it up for Azure, I think there was this idea of, okay, well, we'll use the fancier, faster disks that Azure offers. But it turned out those performed worse than the standard. right so when you when speaking about this I guess you know in bringing this up in consideration that SREs have to face or just even architects too is that you're

Starting point is 00:24:49 as you're saying if the chip is available or if they're out of a certain type of storage or something it's not as easy to just say okay well just give us the next one because you have to verify that that stuff's going to perform well even if you're saying we're going to use a six core versus a three core three core or three core versus a six and we need to scale up well do you get you know does two three-core instances give you the same performance as one six-core instance, right? Probably not because there's a, right? So it's a lot of things to factor in and keep control of it.

Starting point is 00:25:18 I guess the curiosity is, or my question really is, is how much of this stuff can you test for ahead of time in terms of, do we know what capacity we get if we switch to these kind of things and how much of it is we're not 100% sure and we'll just have a bunch of backup plans ready for, different approaches for a more adult, like how do you prepare for all those different potential scenarios in a situation where you don't know what you're going to get? We had actually an example from this year, right? Because at the end, sometimes you

Starting point is 00:25:52 also want to just switch to newer and better stuff. And in situations, right, you have a plan on how you want to roll out that. Because it's not like you just go with everything at once and then everything will be in there, it's a rollout plan itself. So you try to do a step by step, there's also certain actions that you need to do in there as an infrastructure.

Starting point is 00:26:08 And sometimes you get the plan, you say, okay, you agree with the plan with the cloud providers. And then at the end, also from there, and sometimes they calculate something, but then chips are not there. Sometimes not everything is in there. So then you need to adapt to the new things in there. And depending on the cases, right, you either sacrifice some other environments and then you say, okay, I want the bigger instances here, because actually here it will give me a bigger benefit right now. And then you keep the smaller instances somewhere else. of course you use the same instances you had before right so you'll never we will never switch to

Starting point is 00:26:42 something we never test it usually if we decide okay this is new or shinier armor that we want to use let's see how it behaves actually into the fight so you start first into the pre-production environments really test it go through it and see how it behaves and then we say okay we want to go with this but then we slowly start to migrate towards those if those are not yet there or we have issues in terms of capacity then either we say we do it where it's actually most benefited or if not then we say, okay, we stick what we have right now until we get the enough capacity to actually migrate to those respective instance steps. Do you do just to be clear on this?

Starting point is 00:27:22 Because I wanted to know what you have to do as an S-R-R-E. Do you have to do all of this yourself? Do you have people that also do and give you some of this data that help you optimize, that help you test, get insights? how you set up? The good part of Sari is that you collaborate with so many other teams, right? So that means you also don't need to do everything alone. Usually it's also going through different levels, right?

Starting point is 00:27:47 Because when you choose to actually, you say, okay, we want to go to newer instance types. It's not just a choice we do it now and done. It's also a matter of how much cost effective it is for us. Because in Sari, you want to also care about this. And the moment you want to choose and you say, okay, there are more instances there. We want to go to bigger ones. it's actually improving our cost overall. Like maybe now for like one month

Starting point is 00:28:07 until we switch everything, it's higher, but then in six months plan it's actually way lower cost. So you need to go actually for also the cloud cost part and also for the teams to calculate and make sure this is a proof we want it. Then we have teams that are actually doing performance testing. There are teams that are actually testing the

Starting point is 00:28:23 overall combination of software, deployments and everything with bigger instances, compare the numbers and say, okay, these are the numbers, this is what you gain, this is what you this is what actually gets better. And then based on that with architects and other teams which are doing the numbers, we say, okay, if we do this action in production,

Starting point is 00:28:42 based on this, we could scale in with this amount of nodes, actually, because these nodes are so much bigger and better. This is the amount of money we actually save. So there are multiple teams, usually multiple people involved into the decisions. It's not just about us, because at the end, the moment we switch is not just switching on infrastructure and then we put everything in there, but each service that it's using that infrastructure needs to actually work on that infrastructure. So we need to make sure that everything works smoothly.

Starting point is 00:29:10 And we do it in a very controlled way at the end. There was a really big event, I think at least two actually in the last couple of months, besides the Black Friday and Cyber Monday, where half of the world stood still. You mentioned earlier that U.S. is a very popular cloud region, and that's why many systems are actually there. I also, I think, Brian, you just mentioned it earlier, that Taylor Swift released a new Netflix show, and that also killed Netflix for a little bit at least. But, well, Netflix is great.

Starting point is 00:29:50 It's not of our concern right now. But what is of our concern is what happens when some of these big events happen, like AWS had an outage, Azure has an issue. I assume this is not the most fun time for an SRE. It's not, but it's also one of the challenging ones, right? It's the part where it keeps you a bit in check of like what challenges you have in there, what is actually, why do you need to do, how to handle it. And the downside of this event is actually something you cannot control as an SRI.

Starting point is 00:30:23 It's not about having a service running on our platform and we try to figure it out. We can control certain stuff. In this case, you cannot really do much because you cannot just go to one of the cloud providers and say, yeah, sure, let me put the harder there and let me help you in a way that they can actually fix it, right? And it's an interesting time because usually if it happens,

Starting point is 00:30:46 I think we had an Asia outage in Switzerland somewhere in August or September, and it happened somewhere in weekend, and you just react, you see it, and then you say, okay, we put a communication, there is an outage. and you just wait because you cannot do much more than that. It will make no difference if you try to do something

Starting point is 00:31:04 because it's just something happens on their side. It's also a lot of people involved into those situations because usually it will escalate because you need to make the communication. Also other services are affected. And a lot of people are coming in together to try to figure out, okay, how do we do this better? How do we communicate this? Because also the customers then say, okay, you are doing this,

Starting point is 00:31:25 but how can we make sure that we are not getting hit and more again. Because at the end, it's not us providing the service. We are on a cloud provider. And then how do you make sure that this becomes resilient? There is a lot of questions and a lot of communication, actually, in those cases. Yeah. Sounds like a great case for chaos engineering, Andy, right?

Starting point is 00:31:42 Adding what happens if these things happen to your cloud provider, right? Yeah. And I think this is also, and Alexander, correct me if I'm wrong, but at least this is what I've been hearing, especially since these incidents this recently, that people really now think more and more about two things. A, the multi-cloud or multi-region setup is important because you can never know whether a big disaster like this happens.

Starting point is 00:32:06 And then also some start thinking about, okay, how can I reduce the risk? Does it make sense for me to move back to something that I completely control? Not saying that building your own data center is a good strategy because building your own data center and operating it is a lot of effort. but at least what I hear, and I'm not sure if you see this as well, the whole discussion around multi-region, multi-cloud setup is becoming more of a topic again. It's picking much more popularity for sure and also multi-region, right, at the end.

Starting point is 00:32:41 We also have banks running like that, right? We have airlines running on cloud, right, in certain directions. And then if something goes down, what do you do? because then you're blind. Yeah, I suppose there's a cost factor to that too, right? Because if you're going to have, if you're going to be able to make a quick switch, especially if one data center goes just completely dark,

Starting point is 00:33:03 you have to have all that data in your backup to continue. So that means any company who's looking to do that has to pay the extra cost to have all this data transfer and have all this stuff ready to go just in case. But meanwhile, if it's not being used, it's, I don't want to say wasted money, but it's an investment in a what-if scenario. But the thing is, right,

Starting point is 00:33:27 I mean, this is not a new problem that we solve as an industry. Just before cloud providers, organizations, large organizations, had multiple data centers, a failover, a disaster recovery. But I think now with more and more organizations really becoming software companies

Starting point is 00:33:43 that need to operate 24-7 where every minute of downtime counts, they also need to think about this and because you don't own the data centers anymore you need to have the right contracts and the right strategies to then provide your service from a different region or from a different cloud vendor

Starting point is 00:34:00 and that also means more testing on disaster recovery or on switching over more testing across and this is for me the interesting piece what you just kind of explained earlier that cloud is not cloud and so if I think I'm born on Cloud A,

Starting point is 00:34:20 then I just move stuff forward to Cloud B because they provide the same thing anyway. That's not the case. That's really fascinating. You're not Mario in Super Mario Bros. Where you can out from Cloud to Cloud if you get to the secret board. So there's another component of this, right?

Starting point is 00:34:37 Because we're talking right now about major outages in cloud regions and all this, right? But sometimes it may be a fact of a reduction in the, their performance, right? Now, if we think about Black Friday, not Black Friday, but like Super Bowl, right? Some of the older

Starting point is 00:34:55 stuff we learned from the Super Bowl ads would be, especially in retail, would be let's redesign our website for the game, get rid of everything we don't need, and just focus on what we're trying to promote, right? So we can do, you know, as let's say if I'm a vendor, and I think it was one of the car companies that

Starting point is 00:35:13 did this, we can do a lot of stuff on our side to reduce what we need to deliver what we want to provide so that we can get more capacity out of that. When it comes to providing a platform like Dinah Trace or other SaaS vendors that service other companies, it's not like you can just say, well, let's turn off most of our functionality so that we can deliver our car ad, right? However, I guess the question is, and not specific to us, but using us as an example, are there tweaks we can make to our platform to reduce what we need to consume in the situations where

Starting point is 00:35:53 there might be a restriction on what we have access to. I'm not asking for specifics, but I'm just like, in general, are there? Or is it like we need all or nothing? Because to me, it just sounds like a lot of these providers would need all or nothing, because how do you reduce what you can give? But is that part of what you do as SREs to try to find things that you can tweak in these kind of worst-case scenarios. That's a very good question.

Starting point is 00:36:17 I mean, for us, not directly, at least not towards us, because at the end, what we offer, it's a monitoring tool for them, and we don't. What we can do is just make sure that they have optimization in terms of queries or in metrics or anything that they need in there to make sure that they are monitoring what they need, right? And they are not

Starting point is 00:36:33 just having noise or anything like that, which helps them also long-term. And working with the teams, also with the other developers and ask them, look, the guys are having problems. They are having performance issues, like what can we do for them? But at the end, for directly for the customers to what they can do in their platform to improve, like, I don't know, the UI or anything like that, it's a bit harder for us because we don't

Starting point is 00:36:53 have a direct view of what happens there. We can just see the outcome or the output of what we monitor out of it, right? So then you can just see from our end, okay, they have very heavy queries in there. Or maybe they actually monitoring stuff, but do they really need all of these? Because it's just a lot of noise or a lot of logs, which are just infallogs. Do you really need everything in there? Like, it's also these questions that we can ask. And depending on that, we can say, okay, guys, you could optimize this or do you really need this?

Starting point is 00:37:21 Sometimes it's also happening for the customers, right? They maybe just put the script. They didn't realize it. And then they had a lot of cloud metrics in there. And then it's just booming their platform as well because at the end they are struggling to monitor the stuff. And they don't see it, bringing their sites down. But then we see it on the other. And we're like, guys, do you really need this?

Starting point is 00:37:38 Because now it's breaking your system. So can we do something into that? But it's a bit harder to go to them to say, maybe optimize this part of the UI. Yeah, so there's stuff we can optimize on our side, but yeah, no idea or reduce. But, yeah. I think the, thanks for the reminder, Brian, about the, I think I would call it an MVS, the minimal viable viable service that I want to provide, right? Kind of like to the bare minimum.

Starting point is 00:38:04 Yeah. And I guess from an e-commerce side perspective, I would just say people need to see the product. they need to be able to buy it and check out. Very easy in a scenario. It's easy. From an observability perspective, the only thing that I could say in minimal viable service is we want to make sure

Starting point is 00:38:20 we always ingest all the data, we're not losing data, we analyze it and we alert. Whether a dashboard takes one seconds or one and a half seconds to load and whether we turn on all the bells and whistles, this might be individual features that you could potentially kind of turn down a little bit. But I guess it's an interesting thought. How would you do this? Yeah. A lot more complicated.

Starting point is 00:38:45 Alex, you talked about obviously this, and there was a reason why I brought up this whole outage because I assumed you said it's a challenging day. You know, some people like challenges, right? Because finally something is happening. Like, who, there's a party going on? We're going to get by the injury. We're going to get pizza.

Starting point is 00:39:04 Would you say you call it party injury? So the question is those moments that I assume you're not looking forward. You cannot control them, as you said. But give me ideas of moments that you cherish, that you really say, this is what I love about being an S-Sere. This is what I really like to do. And what is it? So I don't look forward on getting it.

Starting point is 00:39:35 those kind of issues, right, and having incidents or critical issues and being in the cause. But in the same time, it's actually something that really drives me a lot. So probably is not something that everyone would say being just in issues and doing like firefighting. But it's actually the type of activity that it's challenging and interesting because it's not just because you are like in there and trying to troubleshoot and giving you something, but it's also the amount of people actually collaborating there. Because generally when there's an issue, you have different teams, different people coming in. And then you try to figure out, you try to really see where the, where the amount of people,

Starting point is 00:40:05 the issue where it's coming from, trying to identify the root cause. And at the end, when you find it, it gives you actually a very nice feeling at the end because you actually figure out what's causing it. Of course, we should not get into that and should be found out in pre-production and all of the stuff. But sometimes there are so many small cases and different corner cases which you cannot just identify in pre-production. And this is quite an interesting drive, I would say. And I think this is the main beauty of SR. it's the fact that you are not just doing one topic at once

Starting point is 00:40:39 it's about you're jumping through different topics you have to actually have a bit of multitasking you have to jump through different ideas also different communications and you also need to communicate actually with so many people from in so many topics so you do now scaling maybe then you have

Starting point is 00:40:56 an issue actually happening into the observability side and then you try to identify what's happening with the metrics what's happening there with the responses in there right so then it's also the of people you actually collaborate with as an SRE is so much different than just being just being in a development site where you have your own piece of parts you know your piece in there and you are very specialized but in the sari you are specialized but in the same time you are very

Starting point is 00:41:20 broad so you have a bit of a multi-flavor multi-flavored chocolate I'd say yeah you know it's interesting because the you know I agree when you're when you're in a tough situation everyone comes together and collaborates and you find your way out of the it, especially if you're all, you know, working very, very solidly together. You come out with this great victory. You're the hero at the end of the movie who destroyed the big enemy, right? But so much of what your job is, right, is preventing that stuff from happening in the first place. And it always seems like in general, nobody notices when you did things properly and it didn't

Starting point is 00:42:00 become an issue. Do you have ways to track like this big event happened? Didn't impact us because we were prepared and we did everything and let's call ourselves the hero again. It's not the same quite, you know, not the same celebration because you didn't have to face the monster with a sword. But like, well, we didn't have to face them because we set up all our traps and it caught him in and we're awesome. Right. Right. Is there a way that we track that? Can you identify that and say like, look how great we are because we're not getting into these situations?

Starting point is 00:42:30 and celebrate the victories that you didn't have to even fight? I think it's also interesting how, if you think now about it, because usually it's harder for us to celebrate stuff that don't happen, right? You don't like, also in life generally, right? You really celebrate the small things that didn't happen or the things that you prevent it, but celebrate in the aftermath of the stuff that were tougher. And you tend to get over the good stuff much faster in a sense.

Starting point is 00:42:58 We don't really have a way of measure, but in the same time we also can say there are certain situations where also probably incidents like critical issues happened but we found them actually so fast because we were paying attention, we were prepared in there we knew how to react to it and we said, okay, we do this, we do that,

Starting point is 00:43:13 that we do it in there, then we prevented it and then nobody saw it at the end, right? It was something that was concentrated in a small group, then the customers didn't see nobody was impacted but could have been a bigger issue. So there are those kind of things where we just remember ourselves in a sense and we just remind in our

Starting point is 00:43:30 case because of how we are spread and how we are working with everyone, we usually tend to remind ourselves in different meetings that we have together. This is the great stuff that we were there. We went through the debugging session and we found that and then we prevented it and we did this. I would say another good example. It's also the Black Friday's Aber Monday this year. We were so much in a different way prepared because it was not about just a Sari being involved into that, but there were also all of the other teams that are part of the full

Starting point is 00:43:59 platform in there. So they were also prepared. They had also their schedules in there. They also had their teams in there. They also had their proper rambooks and also the thoughtful process of what can we do in case this bad thing happen

Starting point is 00:44:13 or what can we do in case this part it's happening. And at the end when we crossed the line of the full Black Friday, Summer Monday, actually, it was a very uneventful period for us because everyone

Starting point is 00:44:26 was just prepared and it worked so smooth. It was, it worked perfectly in terms of collaboration, in terms of people reacting, being there for us, and also us being there and actually have the support from the people. So I think it's always the small bits, but you never really count or like measure those parts of there, which probably we should change, right? Yeah, I hope someone's noticing.

Starting point is 00:44:49 I hope the people above are noticing, right? Because that's, okay, you're doing what you're tasked to do fantastically, right? Yeah. Yeah. I think I remember some terms, like a near incident and non-incident or something like this, where you basically can measure we were fortunate enough to have handled, as you said, the incident fast enough without an impact. It's like when the automated brakes of a car actually work, right?

Starting point is 00:45:19 We've prevented that many accidents because our system reacted fast enough. It's pretty cool. Hey Alex, thank you so much for your insights. It's always amazing how fast time flies. I know we always say in the beginning we only have until the top of the hour from the recording. And sometimes we think, wow, this is a lot of time. But then, as you can tell, it's interesting because the time flies. We'll definitely have you back because we have a folks.

Starting point is 00:45:49 We have a long list of things that we wanted to discuss. And we only got to a small portion of it. We want to talk about automating the right toil. We wanted to talk about one of your favorite topics about releasing new changes, releasing new versions. I also wanted to talk about the skills of an SRE. What can you give people along the way, even though I think we covered a lot of these things already. Any final thoughts from your end before we close this session? I think anyone who is thinking about this, I should try it once,

Starting point is 00:46:21 at least try to just see what they are doing or just shadow them, just be there and spotlight them. because it's a quite different word. We tend sometimes in our teams to say we are like ER doctors or the adventure sometimes, right, because you're always there fighting and making sure everything works. And if it's not, then you are there to try to make it better and try to make it work. So, yeah, if you're thinking about it once, maybe try it because it's a cool word, I would say. That's great. Yeah, I mean, and we'll get into it in the next episode, but there's all kinds of skills that are needed in that, right?

Starting point is 00:46:57 don't be like, well, this is what I do. There's, you know, as you mentioned, performance testing. There's all different kinds of pieces needed. So, anyway. It also reminds me a little bit about it's a different world. And it's when I started my career as a software engineer, I had to spend the first two or three months as a quality engineer testing the software that I was later developing.

Starting point is 00:47:24 And this gave me a completely different perspective about quality. And I think from your perspective, if you have to be in SRE and you see what it takes to operate software, what problems can come up, it gives you a different perspective when you architect and develop the software. Yes. So everybody should have to work with the SRE team for six months before they go back to their job. Yeah, rotations, right?

Starting point is 00:47:46 I mean, rotations are great, yeah. Great idea. There you go. Well, really, really appreciate it, Alex, Alexandria, which you prefer. Alex or Alexandria? Alex? I'm open.

Starting point is 00:47:59 I think it's, yeah, it's, I was Alex actually also in the team, but then we got another Alex. So then it's a, we are missing and switching depending on the people. Okay, well,

Starting point is 00:48:08 either way, it's fantastic having you on. Really look forward to the next episode. Andy, thanks for bringing her on. This has been great. And hope everyone enjoyed it.

Starting point is 00:48:17 And stay tuned for the next time we have you on. And enjoy your skiing. Come on up. And Andy, don't break your back. Doing my best. I try to be resilient.

Starting point is 00:48:31 There we go. Thank you, everybody.

PurePerformance - The many facets of an SRE with Alexandra Franz

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.