PurePerformance - The many facets of an SRE with Alexandra Franz
Episode Date: February 2, 2026From Systems Engineer in Aeronautics via many clouds to becoming an SRE in Observability! That's the path from our guest, Alexandra Franz who is a Lead Product Engineer in SRE at Dynatrace. Tune in an...d learn how their team plans ahead for expected high traffic around Black Friday, Cyber Monday or the Super Bowl. We discuss how regional traffic patterns and differences in available hardware get factored in for capacity management and cost control. We also learn why global cloud outages are stressful - but - how those incidents can also be the reward for a good SRE.Make sure to connect with Alexandra on LinkedIn: https://www.linkedin.com/in/alexandrafranz/
Transcript
Discussion (0)
It's time for pure performance.
Get your stopwatches ready.
It's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson.
And as always I have with me my mocker in chief with his mock turtle neck on.
Andy Grabner, how are you doing, Andy today?
I'm feeling almost the same as with our previous recording because I'm still,
what did you call it, a fashion statement?
I have my fashion statement around my neck here.
it still tries to keep my neck warm because I want to make sure my muscles are all good
for tomorrow and the weekend ahead because we're going on these key slopes.
So you are,
you're trying to make sure that you are ready for the events coming up.
That my body stays resilient for everything that is thrown towards me,
kind of, right?
You want to make sure whatever, whatever condition, rough,
rough snow patches, icy or just perfect powder,
that we can kind of swift through it.
And it's kind of like a better segue as with the previous recording that we did, right?
Yes, much better.
Yes, much better.
Hey, without further ado, the topic of today is focusing around resiliency, reliability,
side reliability.
We found a great guest, Alexandra Franz.
Alex, servas.
Hi.
Hey, hey, thank you so much for being here.
You are, and I, this now, kind of, I'm going to reveal the secret to Brian.
Because before we hit the record button, I was saying Alexandra Franz, it sounds like such an Austrian name.
And while you do work in Austria, you're not from Austria.
But Brian, for you now, she has the same country origin as our previous recording from yesterday.
You know, right before we started recording, I was just thinking, maybe she's Romanian.
You know, it was, it popped in my head.
I was like, well, let me wait and see.
So, okay, yeah, perfect.
So I'm talking quite a bit about, well, a little bit about Romania.
Yeah, yeah.
Did you figure out all the places you had a beer in Romania?
I figured it out.
I looked at my beer with me up yesterday and I looked at all of the different places.
And I'm looking forward to Bucharest for Cloud Native Days, Romania next year,
because I'm sure I will lock in another beverage there.
But now, Alex,
Thank you so much for being here.
You are a site reliability engineer at Dinah Chase,
but can you tell us a little bit more about your background?
I just revealed you are from Romania originally.
Maybe a quick overview of what brought you to Austria
and what brought you to become an SRE.
Hi, thanks for having me today, actually.
Yeah, as you said, I'm from Romania.
And I would say probably my journey to Sari,
it's actually pretty interesting.
And also journey to Austria.
I came in Austria because I studied aerospace engineering,
and I came working actually in an aerospace company more or less
as a system engineer for the aviation.
And after some years, somehow the clouds brought me to the other clouds,
and now I'm working with our cloud providers, right,
and working more into the software, but for cloud providers,
keeping it more in touch with technology and newer stuff, basically,
than what aviation brings in terms of software, let's say.
Really cool.
And folks, as always, we will share a lot of links, wonder links.
Alex, if you're okay with it, we will share your LinkedIn profile
so people can also see where you came from.
Really great, though, right?
We have two Romanians back-to-back on our podcast also showing.
I mean, Diana that we had yesterday, I think she's now in Spain.
I think that's what you said.
You made it to Austria.
We're obviously happy that you're here.
You're working out of our Vienna office.
Yes, exactly.
Yeah.
So today's topic, and for everybody that is listening in,
Alex and I, we sat together a couple weeks ago in the Vienna office,
and we've been bouncing the idea around quite a while
because I wanted to learn more about ESRI at Dinah Trace.
I've been talking about ESRI quite a bit over the years,
but it's always great how things.
get applied, what we learn internally.
And so I have a couple of topics for you that I want to discuss.
But before we go into the topics,
I first want to get a quick overview of what are kind of the responsibilities overall.
I know you're a big team, but overall,
can you quickly highlight what are some of the responsibilities of an SRE in our organization?
I think in the big words, and I would now phrase one of our colleagues,
SREs in Dinah Trace take care of the money printer, right?
That's a good way of it.
But in an nutshell, let's say,
SREs are taking care of the whole production environments
and not only production, also pre-production,
but production mainly where we are making sure
we are delivering the software in a safety manner
to our customers and making sure it's arriving in time,
but also taking consideration sure any other blockers
that are coming along the way.
We are taking care of monitoring, making sure the systems are staying healthy, green,
and in case of any issues, reacting, investigating, and making sure we are solving everything in time.
And also, additionally to that, taking consideration costs, taking consideration scaling,
making sure we are prepared for any events that are coming towards us and planning ahead for that,
but also being reactive if anything is needed and being there basically in support.
So I would say, as I already in the address are pretty busy because we are always in some
some kind of activity and task and making sure that everything is just running
without influencing or being seen by other people from the other hand of the software, let's say.
I think it was well put and well explained.
Just one more clarification question because we are, and obviously many, many organizations
in our observability space, we are operating a SaaS business 24-7,
this mean your team is also structured globally around the world? Can you give us a quick overview?
Good, hint on it. Yes. So this is also one of the cool stuff of being in SREs. We are not just
in one place and one office. We are across the globe. So we have people across Austria actually
spread across different offices in Austria, but we also have people in Detroit, in Nauram area,
also in APEC. So we also see people in Sydney, we also see people in Detroit, we also have
some people in Texas, for example, even California.
So we are quite spread and we are a big team.
Also making sure that like this we can provide support 24-7 in a sense.
But additionally to that is not only that, we are also on-call.
So that means also in the out-of-office hours, we are there to support the customers
and also to make sure that the platform is running smoothly.
So we are there more or less every time.
And this also makes it interesting because it's not just about the challenges that you have every day.
You will have different challenges also by collaboration and also by communication and, yeah, brings a bigger package, actually, into the view as an SRI.
Yeah, you know, Andy, I never thought of SRI from this point of view before, but the challenge, you know, I always think of SREs in a single office.
It looks like Andy's having some audio problems, maybe.
But, okay.
So, Andy, one of the things that this brings up a new idea that I hadn't thought about before.
and I'll share because I'm not sure if other people thought about it is I always think of SREs in like a single office and a single, you know, knock or some, you know, whatever it might be, right?
But talking about all the different office locations, all the, you know, the 24-7 coverage and all that, obviously whenever you're setting up your practices for SRE now, not only is that you and your team figuring that out, but that's something you have to coordinate and make work with all the different offices.
And I'm sure people in different countries, I'm sure, you know, are going to have different ideas of how to approach it.
So it becomes a lot more of a coordinated effort, a lot more of not in the election sense,
but political in terms of we want to do it this way.
Well, we over here want to do it this way.
It just broadens the scope of what you have to do as an SRE in that situation.
Just fascinating.
Yeah.
And I think that, I mean, we are obviously a SaaS-based, we're providing a SaaS-based service.
but many, if I look back at my early days,
when we started software developing
and the software that we built, right,
that was all, we built it, we tested it,
we shipped it, and operation was done by somebody else.
And even these companies, when they operated,
most of them had their nine to five operational schedule, right?
But nowadays, everything is available all the time.
And I think, this is why I think it's so great to have you on this call,
because I want to learn from you what does it really that Esseries are doing.
And so others that listen in can learn.
Now, Alex, I want to jump to the first topic.
It is a topic that throughout the year,
and I think we have seen some changes,
but we've always talked about single-day events
or then special events, like a Black Friday, a Cyber Monday.
And I think things have changed a little bit.
I remember 20 years back,
where it was maybe 15 years back when I started talking about this event,
it was really Black Friday, then Cyber Monday,
and then we saw things shifting.
What I would like to know about is how do you see, from an S3 perspective,
these things changed?
Are they really still single-day events,
or do they happen all of the time?
How do you prepare for this?
What have we learned about it?
Any insights that you can give us from an S-S-Rey perspective?
I think it's very interesting actually how also us as consumers we are changing over the years
right and this is a pattern we are seeing it as well for us in the way we are seeing also for
our customers how things are behaving because we also have a platform for a lot of retail
customers and every year we are always learning and learning how we should prepare next year
better how we should be better actually into into the whole organization of big events
and one year maybe it started just Black Friday
and then we had Cyber Monday and you said
and what we've seen also this year actually
it's not about just one day
it's actually a full month of growth
because now it's not just about Black Friday
being on Friday it's actually Black Friday starts
depending on retailers depending on the areas
depending on the regions
starts away earlier some of them even start on 1st of November right
and then what they do because you cannot just wait and see it
so you prepare somehow in advance even now
because we already saw also from experience
that all of these
shopping and buying and discounts
that you can see actually online
are not starting now just in one day.
It's a continuous growth over the time
and it's not something you can say,
okay, it's done now and then tomorrow it's gone.
It's actually a continuous growth
which goes down a bit
and then depending where it is,
like if it's Friday, Saturday, Sunday,
you'll have a connection also with Christmas, right?
Because then people are buying closer to Christmas
then there's boxing day depending on the regions
and depending on that you have to
we have to slowly adapt into those situations
and learn on how we can better actually make use
of our infrastructures for that.
So yeah, it's an interesting pattern how every year changes
so curious how next year will bring for us
into that area.
Yeah, and I guess, I mean,
knowing that this podcast will air
around the time with another big, big event,
it's the Super Bowl is coming up
and there was also
I remember in the early days Brian
when we always talked about how the Super Bowl
ads were them bringing down
websites because they cost spike traffic
Alex is a Super Bowl also
an event that we see
I think last year we saw some spikes or some
closer spikes we also saw this year in the periods
when there were the games actually during the
evenings especially for the streaming
stuff we could also see
when the bigger games were there that there is more
traffic more people actually using their
platforms so we could also see some increases
on our side. For sure
Super Bowl will also bring some more
traffic on our side as well
and we will see something in there.
So I'm guessing also
depends on the outcomes of what happens for the finals
right, right? Yeah, true too.
And then another big event next year
now that I think of the World Cup is coming
up. Yeah.
That's more like what Black
Friday looks like now because the World Cup
is not just one day.
It's a series of days as opposed to Super Bowl is a single-day event.
So it's interesting how things will be spread differently for the different events.
It's also depending, right, because you will have one-day events, right?
So then like those one-day events, it's also depending on the peak hours of the regions.
Because like Super Bowl will be a specific, very specific for No-RAM, so you will see more increase into that.
Same goes for games.
World Cup, it's a bit different because now it will be more spread depending across the regions where people are watching.
So usually the spikes are also really influencing the areas.
and the regions where people are actually more into those specific sports,
which is also a very interesting pattern to actually observe into that.
So, Alex, if you think, go ahead.
I was going to ask, you know, this idea, as you were talking about Black Friday Cyber Monday,
you know, before online retail, it was always in store in person, right?
So what I'm getting at is I wonder how much of the spread of those becoming,
month-long events versus a single-day event because even online it used to be like they were Cyber Monday, right?
And obviously if you're expanding that time frame for the sales, it's probably good for business, right?
But I also wonder how much of that expansion might be tied to the fact that so many sites would be crashing on those big day events.
So they would be like, hey, let's spread this out over a few days so it's not everybody coming at once.
Was there part of it?
It's part of the, and I don't know if we have an answer to this, but as part of that spread,
of these big day events because of it's not just thousands of people go in a store,
it's millions of people go on a website, crashing it, taking it down, causing devastation
to the internet infrastructure, and they figured, hey, if we buffer it over the course of a few weeks,
we can handle it better.
I don't know.
Any thoughts on that?
I'm sure we don't have any information on that.
To be honest, I don't, but it's also an assumption worth to make because at the DNV, all learn very
on our experiences.
So everyone learns,
if you put it from a consumer side,
you can also say the longer you have it,
the more people you will have actually doing the shopping.
So then it's always depending where you put it, right?
I think you have a point here, Brian,
because think about it,
the retailers are not the only ones that see the spike.
Because if you buy something,
it needs to be produced.
If it's not already produced,
it needs to be shipped.
That means if you can spread or shed the load,
if you can spread the load out
across multiple days, weeks or months,
then it's better for everybody.
Alex, you brought up a really interesting point
and I think this brings me to the next topic.
You said some of these events are obviously regional.
How does an SRE plan for capacity
if you have to factor in all of these things?
How do you plan with the cloud vendors?
I know we are running on all the major cloud vendors.
So what does an SRE do?
What's your role?
do you figure out how to scale, when to scale, are there any lessons learned, any things that
you say, when I would have wished that I knew this earlier, what are some of the things that
are challenging in that respect?
For the big event, it's always planned, right? So we always know, based on last year's
data, last year metrics and last year, like all the numbers we accumulate over the last year.
same on how it was for Black Friday.
We knew from a year before approximately how much load we can expect.
We know that we have somewhere 30% growth maybe in those days.
Now it was spread, more or less spread over a month,
so it was not at once.
It was somewhere there.
And then it depends a lot on the regions
because as an SRA generally you also learn
then which regions are more popular,
which regions are more heavy used than other regions.
Like East U.S., it's usually generally a way bigger region
than other regions in U.S.
it's a more popular region to be selected
then based on that you also
because we work close with the cloud providers
as you said we also learn a bit of their capacities
and where usually are more constraints
because at the end cloud
cloud providers is not an infinite cloud
that just expands in the air
and then you just get the stuff whenever you want
and you pull it in
it's still something physical
that they need to have and make sure that they are there
so planning for big events
usually starts a couple of months
maybe even three months before that, we know that we need to do.
We are syncing together with the cloud providers, knowing where the growth is more expected.
And based on that, we know what, their capacity as well.
So we say, okay, we will need, I don't know, amount of 40% more in each region.
This is what we need as capacity right now.
What is the expected date where we can get that?
This is where we would need it.
Let's work together.
And this is where it starts a lot of communication also with the cloud providers
to make sure we are getting in time what we need as,
capacity in there. And additionally to that, of course, it's not just about planning, but
sometimes you also need to be reactive. And there's when you start to look at metrics, there
you start to look at performances, there where you start to see what it's causing and where
it's our actually clusters are struggling and what they would need. And then based on that, you
start to decide scaling. And this is where the data serves a lot for us, because right, then
we can look at the metrics. Then we can look actually at the data that we have, at the
conceptions and the performance of our hardware to see if we can and we need to actually make
a scaling decision.
And for me, it's always fascinating.
I think you heard it spot on.
You said the cloud providers don't provide infinite resources or at the click of a button
because they also have this hardware somewhere.
And especially at these big day events or week events, everybody has the same problem.
right? Everybody wants the cloud
resources and I guess this is
where the upfront negotiation comes in.
So do we actually negotiate that we really
kind of get an assurance
that we get these resources? How does this work?
Sometimes, yes.
It also depends, right?
It's at the end also a business
and it depends a lot on the type of business
with each cloud provider.
We are also more heavily involved
into AWS. We have much more data
in there, much more resources in there.
So that's also a different standing that you have in there.
And also for the other class provider, also like Asia and UCP, right, we work together with them and say, okay, this is what we need.
We need to make sure that this is allocated for us and we have it in there.
Of course, on some direction, they can actually say, yeah, that's yours.
But that's also, as you said, it's not just us that you will need it.
It's also other customers that are coming and say, I also need it.
And then the other one also needs it.
So then it's also a decision from their end to actually say, okay, which customers, we need to make sure that they have it.
and then which customers we can say we sacrificed
or maybe they need to sacrifice something from their end
to make sure that then we can put it somewhere else
so they can actually provide us with hardware.
And constraints happen to them as well as we said also
for Black Friday, right?
It's not just about having the hardware there
but it's also you need to order it,
you need to have it distributed.
So then sometimes for them also they plan something
and it's not depending just on them to have everything ready
because it needs to arrive to them, put it up,
install it and do all the stuff in there.
I'm curious with the cloud providers, right?
We're seeing some pressure, I don't know,
we're seeing some pressure on the cloud providers
from their own AI efforts, right?
And we're definitely seeing,
seeing they're in some internal capacity.
Is that starting to be seen for customers like us
and anyone else using the cloud provider,
where during these events,
it's harder to get that extra capacity
because they're dedicating so much of their own hardware and stuff to their AI efforts.
Is that coming into play yet?
Have we started seeing anything with that, or is it being handled well so far?
I think it's a mix, of course, and it's also depending a lot on the type of hardware and instance family's unit,
because it's also a matter of what do you want to use.
There are some more popularity depending on bigger near or shiny or instance types usually are more popular, right?
Because people want to use it, it's bigger, it gives you more power.
but then also that means more people are like fighting on it
and also probably also them themselves they also use it because it gives them
more more resources in there
so it's always depending what you want to use
I think also most of the problems we see in terms of capacity
I don't think it's just related to that on their end because they're using for
AI or things like that probably there is an influence but it's also some
parts where we also do not know directly because right also as a as they are
they will not come to us and say hey you know I'm using it for AI now
sorry.
You brought up instance type,
and obviously instance types
keep changing all the time,
and I assume they also take certain instance types
and chips away.
How do we deal with this?
Do we need to,
I assume we know how we perform
on certain hardware,
and then do we optimize for that?
So do we have different deployment options
when we know in a certain region we don't get the chips
or if you have any insights on this,
I think this would be very interesting for me, right?
Because maybe region West has more on this.
Region East not, so we need to deploy differently.
Is this how it works?
So we do even have currently running like that.
There are certain regions where we just do not have the same instance types.
They are just not provided by the cloud providers.
Also because of the demand in that areas, right?
So also for them at the end it's a matter of demand
and how many people are demanding it, how much it's used
because it's also very cost,
it's also most costly for them, right?
And we do have certain environments
or in certain regions which are running on different distance types.
Maybe they are not as performance or SB,
but it's also for us as a deployment,
it's way smaller than the other ones.
So it's not hurting us in terms of performance
because at the end we get the same output,
maybe you just, instead of having three instances
in there you would have six.
just because it's a bit smaller, so it needs to perform in a similar way.
But we do have those alternatives, basically because we don't have all the time, the capacity in there.
And it's not only about that, right?
It's also about different chips, as you said.
It's maybe sometimes you have inter, sometimes you have AMD in there.
Depending on the providers and what you have in that region, you try to mix it up if you don't have the same everywhere.
And try to adapt to those situations at the end.
Andy, I remember several years ago, I don't know if it was Sonia or someone else that we had on,
but when we were first launching, you know, what Dinah Traces today,
with Grail, I think it was, we had it set up for AWS,
and then when we were trying to set it up for Azure,
I think there was this idea of, okay, well, we'll use the fancier, faster disks that Azure offers.
But it turned out those performed worse than the standard.
right so when you when speaking about this I guess you know in bringing this up in
consideration that SREs have to face or just even architects too is that you're
as you're saying if the chip is available or if they're out of a certain type of
storage or something it's not as easy to just say okay well just give us the next
one because you have to verify that that stuff's going to perform well even if
you're saying we're going to use a six core versus a three core three core or three
core versus a six and we need to scale up well do you get you know does two
three-core instances give you the same performance as one six-core instance, right?
Probably not because there's a, right?
So it's a lot of things to factor in and keep control of it.
I guess the curiosity is, or my question really is, is how much of this stuff can you test for ahead of time in terms of, do we know what capacity we get if we switch to these kind of things and how much of it is we're not 100% sure and we'll just have a bunch of backup plans ready for,
different approaches for a more
adult, like how do you prepare for all those different
potential scenarios
in a situation where you don't know what you're
going to get?
We had actually an example from this year, right?
Because at the end, sometimes you
also want to just switch to newer and better
stuff. And in situations,
right, you have a plan on how you want to roll out that.
Because it's not like you just go with everything at once
and then everything will be in there, it's a rollout plan
itself. So you try to do a step by step,
there's also certain actions that you need to do
in there as an infrastructure.
And sometimes you get the plan, you say, okay, you agree with the plan with the cloud providers.
And then at the end, also from there, and sometimes they calculate something, but then chips are not there.
Sometimes not everything is in there.
So then you need to adapt to the new things in there.
And depending on the cases, right, you either sacrifice some other environments and then you say, okay, I want the bigger instances here,
because actually here it will give me a bigger benefit right now.
And then you keep the smaller instances somewhere else.
of course you use the same instances you had before right so you'll never we will never switch to
something we never test it usually if we decide okay this is new or shinier armor that we want to use
let's see how it behaves actually into the fight so you start first into the pre-production environments
really test it go through it and see how it behaves and then we say okay we want to go with this but then we
slowly start to migrate towards those if those are not yet there or we have issues in terms of capacity
then either we say we do it where it's actually most benefited or if not then we say,
okay, we stick what we have right now until we get the enough capacity to actually migrate
to those respective instance steps.
Do you do just to be clear on this?
Because I wanted to know what you have to do as an S-R-R-E.
Do you have to do all of this yourself?
Do you have people that also do and give you some of this data that help you optimize,
that help you test, get insights?
how you set up?
The good part of Sari is that you collaborate with so many other teams, right?
So that means you also don't need to do everything alone.
Usually it's also going through different levels, right?
Because when you choose to actually, you say, okay, we want to go to newer instance types.
It's not just a choice we do it now and done.
It's also a matter of how much cost effective it is for us.
Because in Sari, you want to also care about this.
And the moment you want to choose and you say, okay, there are more instances there.
We want to go to bigger ones.
it's actually improving our cost overall.
Like maybe now for like one month
until we switch everything, it's higher, but then in
six months plan it's actually way lower
cost. So you need to go actually for also
the cloud cost part and also for the teams to
calculate and make sure this is a proof we want
it. Then we have
teams that are actually doing performance
testing. There are teams that are actually testing the
overall combination of
software, deployments and everything
with bigger instances, compare the numbers
and say, okay, these are
the numbers, this is what you gain, this is what you
this is what actually gets better.
And then based on that with architects and other teams which are doing the numbers,
we say, okay, if we do this action in production,
based on this, we could scale in with this amount of nodes,
actually, because these nodes are so much bigger and better.
This is the amount of money we actually save.
So there are multiple teams, usually multiple people involved into the decisions.
It's not just about us, because at the end, the moment we switch is not just switching on
infrastructure and then we put everything in there,
but each service that it's using that infrastructure needs to actually work on that infrastructure.
So we need to make sure that everything works smoothly.
And we do it in a very controlled way at the end.
There was a really big event, I think at least two actually in the last couple of months,
besides the Black Friday and Cyber Monday, where half of the world stood still.
You mentioned earlier that U.S.
is a very popular cloud region, and that's why many systems are actually there.
I also, I think, Brian, you just mentioned it earlier, that Taylor Swift released a new Netflix
show, and that also killed Netflix for a little bit at least.
But, well, Netflix is great.
It's not of our concern right now.
But what is of our concern is what happens when some of these big events happen, like
AWS had an outage, Azure has an issue.
I assume this is not the most fun time for an SRE.
It's not, but it's also one of the challenging ones, right?
It's the part where it keeps you a bit in check of like what challenges you have in there,
what is actually, why do you need to do, how to handle it.
And the downside of this event is actually something you cannot control as an SRI.
It's not about having a service running on our platform and we try to figure it out.
We can control certain stuff.
In this case, you cannot really do much
because you cannot just go to one of the cloud providers
and say, yeah, sure, let me put the harder there
and let me help you in a way that they can actually fix it, right?
And it's an interesting time
because usually if it happens,
I think we had an Asia outage in Switzerland
somewhere in August or September,
and it happened somewhere in weekend,
and you just react, you see it,
and then you say, okay, we put a communication,
there is an outage.
and you just wait because you cannot do much more than that.
It will make no difference if you try to do something
because it's just something happens on their side.
It's also a lot of people involved into those situations
because usually it will escalate because you need to make the communication.
Also other services are affected.
And a lot of people are coming in together to try to figure out,
okay, how do we do this better?
How do we communicate this?
Because also the customers then say, okay, you are doing this,
but how can we make sure that we are not getting hit
and more again.
Because at the end, it's not us providing the service.
We are on a cloud provider.
And then how do you make sure that this becomes resilient?
There is a lot of questions and a lot of communication, actually, in those cases.
Yeah.
Sounds like a great case for chaos engineering, Andy, right?
Adding what happens if these things happen to your cloud provider, right?
Yeah.
And I think this is also, and Alexander, correct me if I'm wrong,
but at least this is what I've been hearing,
especially since these incidents this recently,
that people really now think more and more about two things.
A, the multi-cloud or multi-region setup is important
because you can never know whether a big disaster like this happens.
And then also some start thinking about,
okay, how can I reduce the risk?
Does it make sense for me to move back to something that I completely control?
Not saying that building your own data center is a good strategy
because building your own data center and operating it is a lot of effort.
but at least what I hear, and I'm not sure if you see this as well,
the whole discussion around multi-region, multi-cloud setup is becoming more of a topic again.
It's picking much more popularity for sure and also multi-region, right, at the end.
We also have banks running like that, right?
We have airlines running on cloud, right, in certain directions.
And then if something goes down, what do you do?
because then you're blind.
Yeah, I suppose there's a cost factor to that too, right?
Because if you're going to have,
if you're going to be able to make a quick switch,
especially if one data center goes just completely dark,
you have to have all that data in your backup to continue.
So that means any company who's looking to do that
has to pay the extra cost to have all this data transfer
and have all this stuff ready to go just in case.
But meanwhile, if it's not being used,
it's, I don't want to say wasted money,
but it's an investment in a what-if scenario.
But the thing is, right,
I mean, this is not a new problem
that we solve as an industry.
Just before cloud providers,
organizations, large organizations,
had multiple data centers,
a failover, a disaster recovery.
But I think now with more and more organizations
really becoming software companies
that need to operate 24-7
where every minute of downtime counts,
they also need to think about this
and because you don't own the data centers anymore
you need to have the right contracts
and the right strategies
to then provide your service
from a different region or from a different cloud vendor
and that also means more testing
on disaster recovery or on switching over
more testing across
and this is for me the interesting piece
what you just kind of explained earlier
that cloud is not cloud
and so if I think
I'm born on Cloud A,
then I just move stuff forward to Cloud B
because they provide the same thing anyway.
That's not the case.
That's really fascinating.
You're not Mario in Super Mario Bros.
Where you can out from Cloud to Cloud
if you get to the secret board.
So there's another component of this, right?
Because we're talking right now about major outages
in cloud regions and all this, right?
But sometimes it may be a fact of
a reduction in the,
their performance, right?
Now, if we think about
Black Friday, not Black Friday, but
like Super Bowl, right? Some of the older
stuff we learned from the Super Bowl ads would be,
especially in retail, would be
let's redesign our website
for the game, get rid of
everything we don't need, and just focus on what we're
trying to promote, right? So we can
do, you know, as let's say
if I'm a vendor, and I think it was one of the car companies that
did this, we can do a lot
of stuff on our side to reduce what we
need to deliver what we want to provide so that we can get more capacity out of that.
When it comes to providing a platform like Dinah Trace or other SaaS vendors that service
other companies, it's not like you can just say, well, let's turn off most of our functionality so
that we can deliver our car ad, right?
However, I guess the question is, and not specific to us, but using us as an example, are there
tweaks we can make to our platform to reduce what we need to consume in the situations where
there might be a restriction on what we have access to.
I'm not asking for specifics, but I'm just like, in general, are there?
Or is it like we need all or nothing?
Because to me, it just sounds like a lot of these providers would need all or nothing,
because how do you reduce what you can give?
But is that part of what you do as SREs to try to find things that you can tweak in these
kind of worst-case scenarios.
That's a very good question.
I mean, for us, not directly,
at least not towards us, because at the end,
what we offer, it's a monitoring tool for them,
and we don't. What we can do
is just make sure that they have optimization
in terms of queries or in metrics or
anything that they need in there to make sure that they are
monitoring what they need, right? And they are not
just having noise or anything like that, which
helps them also long-term.
And working with the teams,
also with the other developers and ask them,
look, the guys are having problems. They are
having performance issues, like what can we do for them?
But at the end, for directly for the customers to what they can do in their platform to improve,
like, I don't know, the UI or anything like that, it's a bit harder for us because we don't
have a direct view of what happens there.
We can just see the outcome or the output of what we monitor out of it, right?
So then you can just see from our end, okay, they have very heavy queries in there.
Or maybe they actually monitoring stuff, but do they really need all of these?
Because it's just a lot of noise or a lot of logs, which are just infallogs.
Do you really need everything in there?
Like, it's also these questions that we can ask.
And depending on that, we can say, okay, guys, you could optimize this or do you really need this?
Sometimes it's also happening for the customers, right?
They maybe just put the script.
They didn't realize it.
And then they had a lot of cloud metrics in there.
And then it's just booming their platform as well because at the end they are struggling to monitor the stuff.
And they don't see it, bringing their sites down.
But then we see it on the other.
And we're like, guys, do you really need this?
Because now it's breaking your system.
So can we do something into that?
But it's a bit harder to go to them to say, maybe optimize this part of the UI.
Yeah, so there's stuff we can optimize on our side, but yeah, no idea or reduce.
But, yeah.
I think the, thanks for the reminder, Brian, about the, I think I would call it an MVS,
the minimal viable viable service that I want to provide, right?
Kind of like to the bare minimum.
Yeah.
And I guess from an e-commerce side perspective, I would just say people need to see the product.
they need to be able to buy it and check out.
Very easy in a scenario.
It's easy.
From an observability perspective,
the only thing that I could say
in minimal viable service is we want to make sure
we always ingest all the data, we're not losing data,
we analyze it and we alert.
Whether a dashboard takes one seconds or one and a half seconds to load
and whether we turn on all the bells and whistles,
this might be individual features that you could potentially kind of turn down a little bit.
But I guess it's an interesting thought.
How would you do this?
Yeah. A lot more complicated.
Alex, you talked about obviously this,
and there was a reason why I brought up this whole outage
because I assumed you said it's a challenging day.
You know, some people like challenges, right?
Because finally something is happening.
Like, who, there's a party going on?
We're going to get by the injury.
We're going to get pizza.
Would you say you call it party injury?
So the question is those moments that I assume you're not looking forward.
You cannot control them, as you said.
But give me ideas of moments that you cherish, that you really say,
this is what I love about being an S-Sere.
This is what I really like to do.
And what is it?
So I don't look forward on getting it.
those kind of issues, right, and having incidents or critical issues and being in the cause.
But in the same time, it's actually something that really drives me a lot.
So probably is not something that everyone would say being just in issues and doing like firefighting.
But it's actually the type of activity that it's challenging and interesting because it's not just
because you are like in there and trying to troubleshoot and giving you something, but it's also the
amount of people actually collaborating there.
Because generally when there's an issue, you have different teams, different people coming in.
And then you try to figure out, you try to really see where the, where the amount of people,
the issue where it's coming from, trying to identify the root cause. And at the end, when you find
it, it gives you actually a very nice feeling at the end because you actually figure out what's
causing it. Of course, we should not get into that and should be found out in pre-production and all
of the stuff. But sometimes there are so many small cases and different corner cases which you
cannot just identify in pre-production. And this is quite an interesting drive, I would say.
And I think this is the main beauty of SR.
it's the fact that you
are not just doing one topic at once
it's about you're jumping through different
topics you have to actually have a bit of
multitasking you have to jump through different
ideas also different
communications and you also need to communicate
actually with so many people
from in so many topics
so you do now scaling maybe then you have
an issue actually happening into the observability
side and then you try to identify what's
happening with the metrics what's happening there
with the responses in there
right so then it's also the
of people you actually collaborate with as an SRE is so much different than just being just
being in a development site where you have your own piece of parts you know your piece in there
and you are very specialized but in the sari you are specialized but in the same time you are very
broad so you have a bit of a multi-flavor multi-flavored chocolate I'd say yeah you know it's
interesting because the you know I agree when you're when you're in a tough situation
everyone comes together and collaborates and you find your way out of the
it, especially if you're all, you know, working very, very solidly together.
You come out with this great victory.
You're the hero at the end of the movie who destroyed the big enemy, right?
But so much of what your job is, right, is preventing that stuff from happening in the first place.
And it always seems like in general, nobody notices when you did things properly and it didn't
become an issue.
Do you have ways to track like this big event happened?
Didn't impact us because we were prepared and we did everything and let's call ourselves the hero again.
It's not the same quite, you know, not the same celebration because you didn't have to face the monster with a sword.
But like, well, we didn't have to face them because we set up all our traps and it caught him in and we're awesome.
Right. Right.
Is there a way that we track that?
Can you identify that and say like, look how great we are because we're not getting into these situations?
and celebrate the victories that you didn't have to even fight?
I think it's also interesting how, if you think now about it,
because usually it's harder for us to celebrate stuff that don't happen, right?
You don't like, also in life generally, right?
You really celebrate the small things that didn't happen
or the things that you prevent it,
but celebrate in the aftermath of the stuff that were tougher.
And you tend to get over the good stuff much faster in a sense.
We don't really have a way of measure,
but in the same time we also can say
there are certain situations where also probably
incidents like critical issues happened
but we found them actually so fast
because we were paying attention, we were prepared in there
we knew how to react to it and we said,
okay, we do this, we do that,
that we do it in there, then we prevented it
and then nobody saw it at the end, right?
It was something that was concentrated
in a small group, then the customers didn't see
nobody was impacted but could have been a bigger issue.
So there are those kind of things
where we just remember ourselves in a sense
and we just remind in our
case because of how we are spread and how we are working with everyone, we usually tend to
remind ourselves in different meetings that we have together.
This is the great stuff that we were there.
We went through the debugging session and we found that and then we prevented it and we did this.
I would say another good example.
It's also the Black Friday's Aber Monday this year.
We were so much in a different way prepared because it was not about just a Sari being
involved into that, but there were also all of the other teams that are part of the full
platform in there.
So they were also prepared.
They had also their schedules in there.
They also had their teams in there.
They also had their proper rambooks
and also the thoughtful process
of what can we do in case
this bad thing happen
or what can we do in case
this part it's happening.
And at the end
when we crossed the line of
the full Black Friday,
Summer Monday, actually,
it was a very uneventful
period for us because everyone
was just prepared
and it worked so smooth.
It was,
it worked perfectly in terms of collaboration, in terms of people reacting, being there for us,
and also us being there and actually have the support from the people.
So I think it's always the small bits, but you never really count or like measure those parts of there,
which probably we should change, right?
Yeah, I hope someone's noticing.
I hope the people above are noticing, right?
Because that's, okay, you're doing what you're tasked to do fantastically, right?
Yeah.
Yeah.
I think I remember some terms, like a near incident and non-incident or something like this,
where you basically can measure we were fortunate enough to have handled, as you said,
the incident fast enough without an impact.
It's like when the automated brakes of a car actually work, right?
We've prevented that many accidents because our system reacted fast enough.
It's pretty cool.
Hey Alex, thank you so much for your insights.
It's always amazing how fast time flies.
I know we always say in the beginning we only have until the top of the hour from the recording.
And sometimes we think, wow, this is a lot of time.
But then, as you can tell, it's interesting because the time flies.
We'll definitely have you back because we have a folks.
We have a long list of things that we wanted to discuss.
And we only got to a small portion of it.
We want to talk about automating the right toil.
We wanted to talk about one of your favorite topics about releasing new changes, releasing new versions.
I also wanted to talk about the skills of an SRE.
What can you give people along the way, even though I think we covered a lot of these things already.
Any final thoughts from your end before we close this session?
I think anyone who is thinking about this, I should try it once,
at least try to just see what they are doing or just shadow them, just be there and spotlight them.
because it's a quite different word.
We tend sometimes in our teams to say we are like ER doctors or the adventure sometimes, right,
because you're always there fighting and making sure everything works.
And if it's not, then you are there to try to make it better and try to make it work.
So, yeah, if you're thinking about it once, maybe try it because it's a cool word, I would say.
That's great. Yeah, I mean, and we'll get into it in the next episode,
but there's all kinds of skills that are needed in that, right?
don't be like, well, this is what I do.
There's, you know, as you mentioned, performance testing.
There's all different kinds of pieces needed.
So, anyway.
It also reminds me a little bit about it's a different world.
And it's when I started my career as a software engineer,
I had to spend the first two or three months as a quality engineer
testing the software that I was later developing.
And this gave me a completely different perspective about quality.
And I think from your perspective, if you have to be in SRE
and you see what it takes to operate software,
what problems can come up,
it gives you a different perspective when you architect and develop the software.
Yes.
So everybody should have to work with the SRE team for six months before they go back to their job.
Yeah, rotations, right?
I mean, rotations are great, yeah.
Great idea.
There you go.
Well, really, really appreciate it, Alex, Alexandria,
which you prefer.
Alex or Alexandria?
Alex?
I'm open.
I think it's, yeah,
it's, I was Alex actually
also in the team,
but then we got another Alex.
So then it's a,
we are missing and switching
depending on the people.
Okay, well,
either way,
it's fantastic having you on.
Really look forward to the next episode.
Andy,
thanks for bringing her on.
This has been great.
And hope everyone
enjoyed it.
And stay tuned for the next time
we have you on.
And enjoy your skiing.
Come on up.
And Andy,
don't break your back.
Doing my best.
I try to be resilient.
There we go.
Thank you, everybody.
