PurePerformance - 051 Building a Zero-Dashboard Monitoring Culture with Erik Landsness
Episode Date: December 18, 2017Erik Landsness, Director Network Operations Center & SRE at Beachbody, talks us through his last 1.5 years in his role where he has been transforming the role and culture of the traditional NOC team f...rom human-based Dashboard analytics to a Automated Self-Healing Zero-Dashboard Culture. While they haven’t yet reached that end state they have made big strides. Erik shares with us how to gradually transform into a modern operations team that automates things that humans shouldn’t do – such as staring at dashboards on walls 😊Erik is also presenting at Dynatrace PERFORM 2018. Make sure to check out his session to learn first hand! https://www.dynatrace.com/perform/speakers/
Transcript
Discussion (0)
It's time for Pure Performance.
Get your stopwatches ready.
It's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello, everybody. My name is Brian Wilson. Welcome to Pure Performance.
I'm going to be serious today because our co-host Andreas Grabner, and yes I'm calling him Andreas,
said we usually start the show with bad jokes. And I was offended. So hi Andy.
Hi Brian, well I don't feel offended. Actually, it is.
I think sometimes we start the show with bad jokes that we don't that shouldn't even be considered jokes.
That's why it's so bad.
I don't know.
We'll see.
Fozzie Bear from The Muppet Show is one of my favorite comedians.
And if you're familiar with him, which you probably aren't, but he's just notoriously horrible.
Anyhow.
So there's our bad joke introduction.
All right, awesome.
We got that out of the way.
What do we have?
What are we talking about today, Andy?
And who's our guest?
Yeah, so today I'm very intrigued with this. And the reason why we have this person on board is because his name is Eric, and he's going to present at our upcoming Perform Conference.
And the title is Building a Zero Dashboard NOC and SRE Team.
And I thought this was pretty cool because I believe that's kind of where the industry
is heading.
But I believe a lot of people in our industry are still focused so much on building great
big dashboards that they can be proud of by putting them on a big walled monitor. And I think it's just interesting to hear from Eric kind of his journey at his current employer
and kind of how that transition went.
And without further ado, I want to see, first of all, Eric, are you still there with us?
Yes, I'm here.
Perfect.
So, Eric, would you first of all maybe introduce yourself, who you are, a little background, and then let's jump into the situation that I think the story that you're going to tell as well next end of January.
But let's start with the introduction.
This sounds kind of like a unicorn story, though, but I'm fascinated to hear it, right?
Because the zero – yes, Eric, please. I definitely want to hear this one. Sure, yeah. So my name is Eric. I'm the director of the Network Operations Center at a company called Beachbody, who you may know for all of our health and fitness products, such as the very famous P90X workout routines that everyone loves to hate.
So, yeah, so I've been at Beachbody for about a year and a half now. Previous to that, I've spent countless years working in support organizations for web hosting companies and CDN providers and things like that.
So this is actually my first corporate gig.
So it's been an interesting journey for me. So yeah, so pretty excited to talk to you guys and looking forward to talking a little bit about what we're doing over at Beachbody.
Yeah, and the last one, I have yours.
So since you've been with the company, have you transformed your body as well?
I used to work at Weight Watchers and everyone in the IT department was not participating in Weight Watchers.
Yeah, that's a pretty common thing.
The technology department at Beachbody is probably the least fit department in the company.
So, but Eric, I remember when we did – we had some calls to figure out what is the story going to be that you're telling at Perform.
And we went back to when you – at the point when you got hired and kind of when
you walked into the existing status quo situation.
And maybe you want to bring us back to that point and what your vision was.
And then we just take it from there, because I believe it's a fantastic idea that you have.
And let us, you know, kind of tell us where you are in the transformation story and what
your end goal is.
Yeah, sure. Definitely. So, yeah, I was brought on board at Beachbody.
It's actually kind of interesting. I was hired because my current title at my previous employer had the word knock in it.
But I've honestly never worked in a knock before in my life, which is kind of interesting.
So, like I mentioned, my past experience is mostly in support
organizations. So in call centers, for the most part, delivering end user support to customers.
So basically, I'm used to problems coming to me and solving them. So for me, the concept of
putting dashboards on the screen is pretty foreign to me, but it sounded like a really
cool opportunity. Beachbody is a great company. But it sounded like a really cool opportunity. Beachbody
is a great company. So I was looking forward to the opportunity. So on my first day of work,
I came in and checked out the NOC. We had a little NOC there in the office. And there were TVs all
over the wall. This is an office the size of a single desk type office, but there's two guys sitting in there.
There are six massive TVs hanging on the wall, and there's all kinds of stuff on the walls.
And I look at our main monitoring tool, and it's completely red.
And I instantly panicked, like what's going on?
Oh, my God.
And I started like, whoa, did I walk into a mess here this morning?
And the two guys that are working in the NOC were very calm
and collected and relaxed. And they said, oh, no, no, no, no, it's no big deal. We don't really
look at those. And I think there's a lot of that that kind of goes on in the NOC world where it's,
you know, we get ourselves in the habit of building dashboards and putting things on screens just to fill up the real estate.
We don't really understand what it is we're trying to accomplish.
So what we've kind of tried to do is take a step back from that and think about, well, what is it that we really need to know?
And it boils down to one thing, and that one thing is, is there anything broken?
And that's really all we need to know.
We don't really need to see green things on the wall to make us feel better, right? So we don't
really have anything to gain from that. And all we're doing is wasting electricity and hurting
our eyes with all the light. So what we've tried to do is we've tried to think of ways that we can tool our monitoring so that it bubbles up issues to us rather than having to go look and hunt for issues.
There's one thing that computers are really, really good at, and that's taking lots of data and telling you when something in that data is not normal.
Humans are really bad at that.
We're really, really bad at making comparisons of lots of data. So, um,
getting away from that, uh, that mindset of looking at things on a dashboard and looking
for something that's not quite right. Um, you know, it really makes a lot of sense,
um, but it's a difficult cultural, uh, cultural shift. Uh, and that's something that we're,
we're kind of working on, uh, at Beachbody. We've, uh, you know, when I first came on board, the, the, the plan, um,
for the knock was to build it out, to hire more people, bring, uh, uh, bring some more, uh, people,
uh, in-house created 24 by seven staff. Um, and that, that's something that I've, I've pushed
back against and, and it took a lot of convincing to, to, um, to get people to the point where they thought the same thing.
I remember a question was asked to me, which was, you know, if you don't have people watching screens, how do you know if something is broken?
And I thought about that question for a little bit, and I responded with another question, and that was, if you have people watching screens, but your monitoring is not good, how do you know if something is broken?
And that's really the shift in mindset that I think we're trying to do.
But we've managed to overcome that kind of hurdle and sell it internally. And it's just a matter of doing the implementation.
So you asked kind of where we are in the process.
I'd say we're about halfway through our journey.
We've done a lot of tooling.
We've built the first kind of steps of automation.
Once your monitoring is reliable, you can start to automatically fix things,
which is the super cool thing about having great monitoring.
Instead of having people watching and reacting to things, you're actually just reacting to the data that you have also in code.
So we're kind of making a transition from hiring knock technicians to hiring site reliability engineers. So instead of having people that look for things
that are broken and call other people to fix them, we're having individuals look for things that
break and write software to automatically react to those problems. Wow. Well, that's pretty cool.
I mean, I think I want to just digest what you just all said because I think there's a lot of cool stuff in here and a lot of things that are actually we need to bring to the attention of people.
So, first of all, what I like a lot and what I see a lot as well, right, we are used to do things in a certain way.
And I think some people actually build their careers around producing something,
whether it's nice dashboards, whether it's nice reports. I remember my wife telling me in her previous role that she always had to send that report on a Friday in the afternoon. And if the
report wasn't sent and it was like a 50 page document and people got upset. And then I said,
what is actually happening with this report? And she says, she doesn't know. She doesn't care.
All she knows she needs to produce that report.
And it's like the same what you said, right?
It's great if you have these dashboards and somebody can be very proud of it.
But if it's not impacting the bottom line, it's knowing actually when something is wrong, then the nicest dashboard, the nicest report doesn't help anybody.
And I think that's a great point that you made.
And now I – then I also like some of the quotes that you just said.
I mean what do you do without the people?
Because then nobody can tell you if something is wrong.
But yeah, then the side assumption is if you don't have great monitoring,
then you just have people that look at bad data and make bad decisions, basically bad human alerting.
The thing that I want to dig a little deeper before you explain how your SRE teams and all work.
So you said the only thing that you really care about is if something is broken, right?
And do you have like a magic metric that you're looking at?
I know, for instance, in e-commerce, it is often – I think Amazon is famous for quoting the Werner Vogels.
He said the only metric he really cares about is order rate.
So do you have like a magic metric that is the ultimate thing that you are trying to feed monitoring data to and then looking at this metric and you know if something is broken or do you have a different like a set of metrics that you now try to came up
with or a set of i don't know what it is sure sure yeah so the the the magic metric for our
business is our uptime percentage which is a pretty common um you know old school type of metric. But for me, from a from a NOC standpoint,
the only metric that I truly care about is my number of problems. And we use a we use
Dynatrace. We've started using that product a few months back. And one of the things that I
that I really enjoy about it, I kind of mentioned that computers are really good at looking through data and telling us when something's not right.
You know, when we made the transition to that offering, we were able to start putting more data into it, right?
So rather than traditional monitoring where you're saying, okay, let me look at CPU, let me look at memory, let me look at
disk space utilization, let me look at IO, et cetera, et cetera, et cetera. You put those
pieces of information in and you have to specifically call out, I want to view this,
I want to view this, et cetera. And then not only that, you have to tell the system generally in a
traditional monitoring tool, you have to tell it what your thresholds are. So I want a, you know, if it reaches this percentage of memory utilization, then alert me, et cetera.
And the problem with that is that you have to really understand your environment inside and out.
And, you know, that's something that we don't have the luxury of at Beachbody.
So we have some growing pains.
We're a company that grew extremely rapidly. And with it, the technology group grew and the product offerings grew. And when that happens, you don't have nice, uniform, well-designeduum kind of solutions that are unique. And that's what we deal with, right?
So people that design things are long gone. People that build them might bolt things onto them. And
it's not a well-thought-out, mature product. And the advantage to using something like a
Dynatrace offering is that I can just dump whatever data this application is giving me, whatever data these nodes are giving me, I can just throw it all in there and ask the system to tell me when something is interesting.
So I no longer have to figure out what's important to me and look at specific metrics.
All that I'm looking for is tell me when something is not right. So the more data that pumps through it,
the more that it understands, okay, this is normal operating levels. And when something's
not accurate, then we get a problem that bubbles up. We get an alert. It goes into our chat ops
tool. The knock reacts to it, or we, automatic scripts that trigger based on the problems.
You know, Andy, a lot of that sounds very similar to some of the concepts you and I
were discussing and we were chatting with some of our other guests on monitoring and
observability based on that article.
I'm going to keep probably referencing this article for a long time to come now.
Cindy Sridharan um eric i'm
not sure if you've uh read it i'm going to send you the link to it though sure yeah it's this
crit article but but it talks about very specifically as you're mentioning when you're
monitoring in that way you're talking about you're only monitoring for what you know to monitor
right as opposed to just capturing everything and having something else pull it in for you
and one reference i have to just get in there and andy i'll toss it in for you. And when one reference,
I have to just get in there and Andy,
I'll toss it back to you in a second.
When you're talking about having the,
you know,
the different dashboards and the usefulness of the dashboards for anybody
listening,
go check out,
there's a Monty Python sketch called the hospital sketch.
And it's one of the famous ones because that's where they say,
bring me the machine that goes ping
you know because in the hospital room and that is so much of like what this is like you have all
these different things and they're useless in a lot of ways right so anyhow i wanted to get those
two things in there before uh the conversation keeps going andy uh you you were going to say
something there too yeah no i um i think i wanted to say something that you mentioned about you don't have the luxury of knowing your system that well enough to be able to actually define thresholds.
I think that's obviously a true statement.
But I think even more so with the new application architectures we are dealing with, I think it doesn't make sense anymore to define alerts based on resource thresholds.
I had a session, I was at reInvent two weeks ago, and we were there with our booth, and then there's a lot of monitoring companies around, and some are, you know, they're very pretty dashboards.
And so what I thought, what I did is I observed people when they came over from some of the other booths and came to our booth,
and then I showed them a Dynatrace dashboard with a lot of charts on it. And then I asked them,
so what does this dashboard tell you? Is it good or bad? And then they said, well, it looks strange.
I don't know. I think it's bad. And I said, honestly, you have no clue because you didn't
build that dashboard. And even if somebody would have built the dashboard and set an alert on something and it goes red, it doesn't mean that it's a bad thing.
Because just because a CPU goes hot doesn't mean it's a bad thing if it doesn't impact the bottom line.
Coming back to what you said is are we up?
Can people view our content?
Can people sign up for our service? That's in the end what really matters, and especially if we're now building very dynamic environments that scale up and down depending on demand,
looking at something like a CPU utilization, memory utilization, or the existence or let's say the number of instances of a certain component becomes not completely irrelevant but not as important anymore as it used to be
yeah i was gonna and i was gonna challenge you there andy as well because if you're thinking
about something like caches right um eh cash or um different kind of cash usages while you can
scale outside if you're not properly leveraging your cash then you're going to end up costing
yourself more money so it might not necessarily impact, let's say, that Amazon sales rate.
But you run the risk of introducing scaling risk to your system if you're not looking at certain things that might not trigger some sort of automated alert.
Your lack of use of cash might not trigger an issue until it's way out of
control, right? So I think there are, again, it's not black and white, I think, but there are
probably use cases where some of those metrics are important to have. But again, if you're just
monitoring that cache and you have that number there, it's meaningless if it's not tied to and
or analyzed in conjunction with some of these other components, right? Just having that number
is what's kind of useless. It has to be actionable. Exactly. And I think, Eric, to your point, you said, you know, machines are pretty good
in ingesting a lot of data and finding anomalies. And I believe that's why anomaly detection and
automatic baselining is the key point here that, you know, we all try to do. And I mean,
that's not only true for Dynatrace, but I think other monitoring vendors as well.
But I think that's a key thing. For sure.
Cool. So thanks for that
little excursion. So that means dashboards alone
obviously don't cut it anymore, and we want to
go to kind of this zero dashboard
knock. So you
explained where you came from, how
the situation was, and kind of your
path. What
else did you do? Some lessons learned, or
like you mentioned SRE team. What does that
mean exactly? Sure. Yeah. So the SRE team, we currently have a single person doing SRE work.
In addition to myself, I'm actually writing software as well. But compared to the size of
our NOC, I have a NOC of two full-time employees, which probably sounds amazing to people listening that we only have two full-time employees for such a large company monitoring all of our applications.
But we've managed to do a lot of really hard work in the beginning improving our traditional monitoring.
And we still have traditional type monitoring. We beginning, improving our traditional monitoring. And we still have
traditional type monitoring. We still have up-down monitoring. We still have ping monitoring and
memory monitoring and CPU monitoring, et cetera. But we just don't look at them, right? We use
them as another data point that we feed into our system to understand, okay, I'm seeing this weird
thing happening, but I'm also seeing that CPU started
spiking. We can make some kind of correlation and identify issues that way. Again, it's like I said,
it's more data, more, more, more data. So you don't abandon your traditional type monitoring.
You just use it as an additional data point on top of seeing what comes out of more automated type monitoring, right? So for us, what the SRE
team's goal is, is to dig through the findings. They do actually create dashboards. They create
dashboards not for staring at on a TV, but they do it to get a whole look at a picture of a
particular application so that they can see how things are trending.
They can dig down into things and they can understand maybe some tuning that can be done.
They make recommendations to the delivery team saying, you know, hey, I think that we have this bottleneck in our code, right? So the machine identifies this particular query has been running
increasingly longer amounts of time. So we need to start looking at ways that we can optimize
our database, et cetera. So they do some proactive work that way. And then in addition to that,
they're building our self-healing tools so that we have things that, for example, if we start to run out of disk space,
that there's something that will automatically go through and tar up files or something like that.
They build that type of automation.
Or if a process fails in a certain way, they build something that, you know,
will kick off a job to restart, do a rolling restart in the service or something like that.
That's their focus.
That's pretty cool.
The whole concept of self-healing is – I actually have to kind of admit something here
that I had a hard time using the term self-healing for a while because I thought it's something
that's futuristic.
I always talked about mitigation, but I think the more and more I hear people like you also
talking about self-healing the more comfortable I am using that term and it's been also something
that we've been promoting you know we we see you know some of our users and exactly as you
using monitoring data to trigger better more focused self-healing actions.
You know, it's basically, I mean, just some of the examples that you mentioned, full disk,
you know, process that stalls, some other things like, you know, scaling up, scaling
down, turning on an additional cache layer or turning something off.
These are all things, thanks to, I believe, better monitoring, it allows us to actually do much smarter
auto-remediation, which then kind of looks like the system is actually healing itself.
So I believe that's actually a cool thing.
Sure.
And Adi, to bring up the topic you've been bringing up a lot, you know, talking about
our own transformation in these realms, and Eric, you were talking about how you are all
still in the process of making these things. One of our, you know, the story much better than I do,
Andy, but I believe one of our initial forays into this realm was to take all of our knock run books.
And let's say, say we just automated them because it was, it was basically a script.
So we know when, again, based on the traditional monitoring, when certain conditions are met, there's an action that gets taken manually.
And part of that initial process could very easily be just taking.
And did you have anything like that where, you know, with the existing NOC, did you go through that process of just automating all those?
Yeah, for sure.
I would like to say that we're finished with that process, but it's still
a work in progress. But there is some automation that's been put into play. Some of it still has
human checks in place. So there's still some confidence building that needs to exist that
our data that comes out of our monitoring tools is accurate and is actually identifying a real
problem. And once that confidence is built with the teams,
then we can build that glue that kind of,
when this happens, automatically run this automated script.
But the first step is to automate it
so that you know that you can just hit a button
rather than having to manually log into things
and restart when you're playing,
as I like to call playing whack-a-mole
with things that are broken.
So you said the SRE team.
Is that part of your operations team?
Is this a separate team that you manage?
How is that organized?
That's interesting for me to know.
Sure.
Yeah.
So there's not a whole lot of crazy formal structure when you only have three employees.
But technically, I do run two individual teams.
So there's the NOC and then there's the SRE team.
But they all – I mean they're all reporting into me.
We all go to the same daily stand-up.
We all go – we all have a weekly all-call with all the NOC technicians and the SRE.
And then we actually roll up – our NOC rolls up into the DevOps organization.
It does not roll up into a traditional infrastructure ops organization.
So that means when you talk about the DevOps organization, that means you are part of, I mean, it's all the engineering organization. That means you have your daily standups, I assume, well, within your team,
but how often do you meet with the actual, the application teams and development teams?
Sure. So that's kind of on a, you know, it's on an as needed basis. Our DevOps organization is
actually, it's in its infancy. So we are still considered to be
ops. So our DevOps came from ops rather than from dev. But we have started adding DevOps resources
to scrum teams on the development staff. So they will join the individual standups for those
particular products. So we look at the products that need the most help, and we inject a resource into those
products.
And then once we've managed to make improvements, then we move on to the next product.
That's pretty cool.
Hey, I know you kind of explained, obviously, what happened in the last year and a half,
and there's still ways to go. But coming back to some of the resistance, I think you mentioned the first big point
is to prove out that the monitoring that you actually have in place is valuable, that it
gives the right data.
What else is there?
I could imagine if you come in, when you came in a year and a half ago, and they were telling
you, well, we want to hire these people, and you said no, wasn't there even more fear of are we going to lose – I mean are some people going to lose a job?
What were the other kind of pushbacks that you had to overcome?
Sure.
Yeah.
I think one thing that I've been gifted with is the ability to kind of paint a picture of a happier place. I don't know
how I'm able to do that, but somehow I'm able to do that. So I think having that skill has been
extremely useful for me. And I'm a firm believer in, rather than saying, hey, this is what we're
going to do, you need to explain to people, hey, this is where we're trying to head.
And then let them be part of the process
to get us there, right? So I think of this analogy of, you know, you see those old movies where
there's this giant ship and there's a bunch of people in the belly of the ship that are
rowing all together, right? And there's no windows anywhere and they're just all,
you know, getting yelled at, hey, row, row, row. And I prefer a method of instead, you know, putting some
beautiful windows and pointing at a nice, beautiful island with some great surfing breaks out offshore
and saying, hey, guys, let's go over there. It looks pretty awesome. And I think that the boat
would get there a lot faster if you do it that way. So I believe in that approach of, you know,
selling a vision rather than just dictating to people how things are going to work. So I believe in that approach of, you know, selling a vision rather than just dictating
to people how things are going to work. And I think that's really, really, really important.
And then I think the other thing that's super important when you're making a transition
like this is that you need to make sure that you ease into the process. You know, I hate to use the
way too often overused terms of crawl walk run but that is super super important you
know if you're going to do any tooling changes you have to make sure that the new tool still
gives the same data that people are used to seeing right so it still has to have dashboards
you still have to put something up on a tv and then it's just a matter of saying you know once
you get all of your data in there and people are using a new tool, then you can start to show them some of the data that they're getting that's in addition to see where you're going and then figure out
what, how can I be part of that? Right. So when you just, even when you talk about like, Hey,
there's some great surf breaks, right. Someone down there rowing can be like, well, what am I
going to do when I get there? Right. Hey, I can learn surfing. Right. And there's that whole
leveling up instead of someone being driven out because I only do things this way
and this is all I know.
And now someone's trying to switch it on me
and I don't get it and I'm not being trained.
That approach opens it up more to saying,
I see where you want to go
and I can find these three interesting things
that I'd be very happy to learn and figure out
in order to help be part of the team that gets us there.
It's just a great way he did that, I think.
Yeah.
I agree.
I like the tune.
By the way, Eric, I think you should find a way how to kind of draw that picture of
the boat and then like the old and the new way and put it in your presentation.
I think that's a great analogy and I'm sure we can visualize this in a nice way.
Yeah, sure.
That's great advice.
Now I have a couple of more questions.
So I know that you guys obviously have your quote-unquote traditional business,
that Beachbody has been around for a while, and now you're moving into this,
I think it's called Beachbody on Demand, a while. And now you're moving into this, I think
it's called Beachbody on Demand, where you're doing a lot of stuff, move the architecture into
the cloud. Is this correct? Yes. Yeah, that's definitely one of the large offerings that
we're moving into the cloud as well as some other things that are in the works too.
Could you give us just a little overview of what that kind of environment looks like, that new stack, and also if that changes anything on the monitoring approach
or if you were actually able to say, well,
we have this new kind of platform that we're building out.
And instead of having a lot of legacy,
we can build something completely new in certain areas like monitoring
and dashboarding because we can kind of start from scratch.
Is there something like that that you can kind of fill us in a little bit?
Yeah, for sure.
Sure, yeah, definitely.
Yeah, so that product used to be in-house hosted on VMware virtual machines,
and it was all monitored via very traditional up-down CPU memory disk space type monitoring. So there was a project
that happened earlier this year to move it to the cloud. And instead of just moving it to the cloud,
it was actually built from the ground up and launched as a new offering that we then cut
over to from the on-premise version into the Amazon Web Services
hosted version of it. So what that system kind of looks like now is very different than what it did
before. It's very microservice oriented. So the front end is a single-page web application that then makes calls for authentication.
It actually makes calls for authentication to on-premise gear, and then it makes calls for content.
It makes calls for entitlements and things like this and builds the page based on data that's returned for microservices.
Databases sit in RDS, in AWS. So it is a very different
kind of architecture from what it was previously. So this does create some kind of interesting
monitoring challenges. We have unfortunately yet to get that solution into Dynatrace. It's in the
works and we're working on that. It does pose kind of
some interesting things for us. For one thing, it is a streaming service. It utilizes a lot of
third-party solutions. And then in addition to that, we also have clients that most people are
not used to seeing in their monitoring tools, such as Roku devices and Apple TVs and Amazon Firesticks and things like that.
So because of that, we definitely have some interesting challenges.
We're looking at building out Dynatrace so that we can get everything set up.
But that's a journey for us that we're working through.
In the meantime, our monitoring has kind of transitioned into more of an API endpoint type monitoring. So we're doing things that are
slightly more intelligent than traditional monitoring. And those things include,
you know, hitting endpoints that make up the back end of the system and analyzing JSON packets and
responding based on the error codes that we return and things like that. So
that is definitely a step in the right direction. And for anyone who's looking at transitioning from
a more traditional up-down type monitoring and trying to get into a more modern type monitoring
that gives them more insights, certainly exposing health check data through some type of data
elements such as a JSON packet is a really great way to kind of
get into that so that you can start to monitor the data that's returned from those endpoints.
That's pretty cool. Are you going to include that in your presentation maybe as well,
that you have to perform? I think that will be very interesting, some lessons learned,
some technical tips and tricks, actually. Sure. Yeah, absolutely. That's, that's definitely a, you know,
a step on the journey of the crawl, walk, run. That's,
that's certainly a nice walk step for people to kind of transition into.
Cool. All right. Is there anything else that you,
I know that you,
there's still obviously ways to go until you're fully there where you are,
where you want to be with your vision.
Is there anything else you want to tell the audience before we kind of wrap up this episode?
No, I think just some parting words of wisdom is try to avoid the pitfalls of doing what you're doing just because it's the way that you're doing it.
Trust me, there is a better way to monitor your infrastructure.
I like that.
Brian.
That's time.
Avoid the temptation to bring in the machine that goes ping.
And get the machine that goes ping!
And Andy,
would you like to summon
the Summarator?
Let's do it.
I'll try to summarize all this.
This is an excellent story.
I believe what I learned is,
you know,
don't fear that change
may disrupt the way
you used to work.
It is obviously disrupting
the way we used to look at dashboards,
the way we used to create reports, to bring back my wife's story again. And I'm sure that happened
in your organization as well. I think we all need to embrace the change. I think it's great to think
about that the stupid, annoying manual tasks that we as humans can do but shouldn't do, like analyzing a lot of data
and figuring out if something is wrong. This is something that machines are really good at.
So let the machine do what they are really good at. And then let's change the way we help our
bottom line, which is making sure our business is up and running, our users are satisfied.
Use the data to your advantage and don't get caught up with something that somebody else like machines can do. are doing with the SRE team, making sure to automate as much as possible when it comes
to dealing with faulty situations, whether it's cleaning disks, whether it's restarting
processes, whether it's reconfiguring routes, whatever it is.
And I believe what we all need to understand is that, I mean, our systems are not getting
less complex.
They're getting more complex.
You know, you guys are moving to the cloud
and that means we have more moving pieces,
third-party cloud services that we don't use.
That means even more so,
we need to rely on automation and good monitoring
to focus on the stuff that is really important.
Again, which is, you know,
making sure that our systems are up and running all the time
without having to be up all the time 24-7 as humans
to look at nice but meaningless dashboards.
I think that's what I want to say.
Awesome, Andy.
And you know, your bit about letting machines do what they want to do
reminded me of a slide I once had that I think is very appropriate.
There was a story, I think it was up in Michigan,
where there's a lot of artisanal things going on these days.
And there was these artisanal grounds clearing
where people would hire goats to come in
and let them roam their fields for days and days and days
and clear the field.
And there was a kind of an uproar
from the landscaping community saying,
hey, these goats are taking our jobs.
Never mind the fact that it might take the goats 30 days to do what the landscaper can do in five hours.
There was just this fear of our jobs are going to be replaced by goats.
And my reaction to that is if a goat can do your job, let it do it and go do something better.
So same thing with the machines as we're talking about.
If the machines can make sense of the data, if the machines can take the actions, automate it, let the machines do it so that you can spend your time doing something more productive that the machines can't do.
Such as creating, you know, most of the features and everything else going into that still, at least at this point has to be created by humans.
So let's,
let's concentrate on that part of it and let the machines take over what,
what they can.
Completely agree.
Cool.
Eric,
any final words from your side?
Yeah,
just one,
one last thing I think to add to your,
to your summary.
One thing that's super important is that as you make this journey it's, it's really, really important to explain to people where you're trying to get to rather than just say, hey, we're changing this.
That's really important to get buy-in so that you don't create unnecessary fear and unrest in your organization.
Great point.
Yeah, I really appreciate the time, guys.
It was a pleasure chatting with you guys.
All right, excellent.
If anybody listening would like to see and meet Eric, you'll be at Perform.
Perform, we call it Perform 2018, right?
That's going to be my birthday weekend, Andy.
So make sure you bring me a present.
You too as well, Eric.
I'll be there.
It's going to be in Las Vegas at which hotel?
Is that going to be at the one with the fountains, right?
I think it's the Bellagio.
The Bellagio, yes.
And in January.
So hopefully we'll see a bunch of you there.
Definitely come and meet Eric and check out his talk about all this.
I'm sure he'll do much more in-depth on a lot of things and with awesome slides, especially now that we know the pressure is on, Eric.
You're going to hand-draw a galley ship with people chained up
and getting whipped while they row.
I'm thinking of surfing.
Right, exactly.
I can't wait to see it.
All right, thank it. All right.
Thank you.
All right.
Thank you.
No, thanks. Thanks so much.
Bye.