PurePerformance - 051 Building a Zero-Dashboard Monitoring Culture with Erik Landsness

Starting point is 00:00:00 It's time for Pure Performance. Get your stopwatches ready. It's time for Pure Performance with Andy Grabner and Brian Wilson. Hello, everybody. My name is Brian Wilson. Welcome to Pure Performance. I'm going to be serious today because our co-host Andreas Grabner, and yes I'm calling him Andreas, said we usually start the show with bad jokes. And I was offended. So hi Andy. Hi Brian, well I don't feel offended. Actually, it is. I think sometimes we start the show with bad jokes that we don't that shouldn't even be considered jokes.

Starting point is 00:00:52 That's why it's so bad. I don't know. We'll see. Fozzie Bear from The Muppet Show is one of my favorite comedians. And if you're familiar with him, which you probably aren't, but he's just notoriously horrible. Anyhow. So there's our bad joke introduction. All right, awesome.

Starting point is 00:01:08 We got that out of the way. What do we have? What are we talking about today, Andy? And who's our guest? Yeah, so today I'm very intrigued with this. And the reason why we have this person on board is because his name is Eric, and he's going to present at our upcoming Perform Conference. And the title is Building a Zero Dashboard NOC and SRE Team. And I thought this was pretty cool because I believe that's kind of where the industry is heading.

Starting point is 00:01:34 But I believe a lot of people in our industry are still focused so much on building great big dashboards that they can be proud of by putting them on a big walled monitor. And I think it's just interesting to hear from Eric kind of his journey at his current employer and kind of how that transition went. And without further ado, I want to see, first of all, Eric, are you still there with us? Yes, I'm here. Perfect. So, Eric, would you first of all maybe introduce yourself, who you are, a little background, and then let's jump into the situation that I think the story that you're going to tell as well next end of January. But let's start with the introduction.

Starting point is 00:02:13 This sounds kind of like a unicorn story, though, but I'm fascinated to hear it, right? Because the zero – yes, Eric, please. I definitely want to hear this one. Sure, yeah. So my name is Eric. I'm the director of the Network Operations Center at a company called Beachbody, who you may know for all of our health and fitness products, such as the very famous P90X workout routines that everyone loves to hate. So, yeah, so I've been at Beachbody for about a year and a half now. Previous to that, I've spent countless years working in support organizations for web hosting companies and CDN providers and things like that. So this is actually my first corporate gig. So it's been an interesting journey for me. So yeah, so pretty excited to talk to you guys and looking forward to talking a little bit about what we're doing over at Beachbody. Yeah, and the last one, I have yours. So since you've been with the company, have you transformed your body as well? I used to work at Weight Watchers and everyone in the IT department was not participating in Weight Watchers.

Starting point is 00:03:24 Yeah, that's a pretty common thing. The technology department at Beachbody is probably the least fit department in the company. So, but Eric, I remember when we did – we had some calls to figure out what is the story going to be that you're telling at Perform. And we went back to when you – at the point when you got hired and kind of when you walked into the existing status quo situation. And maybe you want to bring us back to that point and what your vision was. And then we just take it from there, because I believe it's a fantastic idea that you have. And let us, you know, kind of tell us where you are in the transformation story and what

Starting point is 00:04:03 your end goal is. Yeah, sure. Definitely. So, yeah, I was brought on board at Beachbody. It's actually kind of interesting. I was hired because my current title at my previous employer had the word knock in it. But I've honestly never worked in a knock before in my life, which is kind of interesting. So, like I mentioned, my past experience is mostly in support organizations. So in call centers, for the most part, delivering end user support to customers. So basically, I'm used to problems coming to me and solving them. So for me, the concept of putting dashboards on the screen is pretty foreign to me, but it sounded like a really

Starting point is 00:04:44 cool opportunity. Beachbody is a great company. But it sounded like a really cool opportunity. Beachbody is a great company. So I was looking forward to the opportunity. So on my first day of work, I came in and checked out the NOC. We had a little NOC there in the office. And there were TVs all over the wall. This is an office the size of a single desk type office, but there's two guys sitting in there. There are six massive TVs hanging on the wall, and there's all kinds of stuff on the walls. And I look at our main monitoring tool, and it's completely red. And I instantly panicked, like what's going on? Oh, my God.

Starting point is 00:05:18 And I started like, whoa, did I walk into a mess here this morning? And the two guys that are working in the NOC were very calm and collected and relaxed. And they said, oh, no, no, no, no, it's no big deal. We don't really look at those. And I think there's a lot of that that kind of goes on in the NOC world where it's, you know, we get ourselves in the habit of building dashboards and putting things on screens just to fill up the real estate. We don't really understand what it is we're trying to accomplish. So what we've kind of tried to do is take a step back from that and think about, well, what is it that we really need to know? And it boils down to one thing, and that one thing is, is there anything broken?

Starting point is 00:06:04 And that's really all we need to know. We don't really need to see green things on the wall to make us feel better, right? So we don't really have anything to gain from that. And all we're doing is wasting electricity and hurting our eyes with all the light. So what we've tried to do is we've tried to think of ways that we can tool our monitoring so that it bubbles up issues to us rather than having to go look and hunt for issues. There's one thing that computers are really, really good at, and that's taking lots of data and telling you when something in that data is not normal. Humans are really bad at that. We're really, really bad at making comparisons of lots of data. So, um, getting away from that, uh, that mindset of looking at things on a dashboard and looking

Starting point is 00:06:49 for something that's not quite right. Um, you know, it really makes a lot of sense, um, but it's a difficult cultural, uh, cultural shift. Uh, and that's something that we're, we're kind of working on, uh, at Beachbody. We've, uh, you know, when I first came on board, the, the, the plan, um, for the knock was to build it out, to hire more people, bring, uh, uh, bring some more, uh, people, uh, in-house created 24 by seven staff. Um, and that, that's something that I've, I've pushed back against and, and it took a lot of convincing to, to, um, to get people to the point where they thought the same thing. I remember a question was asked to me, which was, you know, if you don't have people watching screens, how do you know if something is broken? And I thought about that question for a little bit, and I responded with another question, and that was, if you have people watching screens, but your monitoring is not good, how do you know if something is broken?

Starting point is 00:07:54 And that's really the shift in mindset that I think we're trying to do. But we've managed to overcome that kind of hurdle and sell it internally. And it's just a matter of doing the implementation. So you asked kind of where we are in the process. I'd say we're about halfway through our journey. We've done a lot of tooling. We've built the first kind of steps of automation. Once your monitoring is reliable, you can start to automatically fix things, which is the super cool thing about having great monitoring.

Starting point is 00:08:29 Instead of having people watching and reacting to things, you're actually just reacting to the data that you have also in code. So we're kind of making a transition from hiring knock technicians to hiring site reliability engineers. So instead of having people that look for things that are broken and call other people to fix them, we're having individuals look for things that break and write software to automatically react to those problems. Wow. Well, that's pretty cool. I mean, I think I want to just digest what you just all said because I think there's a lot of cool stuff in here and a lot of things that are actually we need to bring to the attention of people. So, first of all, what I like a lot and what I see a lot as well, right, we are used to do things in a certain way. And I think some people actually build their careers around producing something, whether it's nice dashboards, whether it's nice reports. I remember my wife telling me in her previous role that she always had to send that report on a Friday in the afternoon. And if the

Starting point is 00:09:35 report wasn't sent and it was like a 50 page document and people got upset. And then I said, what is actually happening with this report? And she says, she doesn't know. She doesn't care. All she knows she needs to produce that report. And it's like the same what you said, right? It's great if you have these dashboards and somebody can be very proud of it. But if it's not impacting the bottom line, it's knowing actually when something is wrong, then the nicest dashboard, the nicest report doesn't help anybody. And I think that's a great point that you made. And now I – then I also like some of the quotes that you just said.

Starting point is 00:10:12 I mean what do you do without the people? Because then nobody can tell you if something is wrong. But yeah, then the side assumption is if you don't have great monitoring, then you just have people that look at bad data and make bad decisions, basically bad human alerting. The thing that I want to dig a little deeper before you explain how your SRE teams and all work. So you said the only thing that you really care about is if something is broken, right? And do you have like a magic metric that you're looking at? I know, for instance, in e-commerce, it is often – I think Amazon is famous for quoting the Werner Vogels.

Starting point is 00:10:52 He said the only metric he really cares about is order rate. So do you have like a magic metric that is the ultimate thing that you are trying to feed monitoring data to and then looking at this metric and you know if something is broken or do you have a different like a set of metrics that you now try to came up with or a set of i don't know what it is sure sure yeah so the the the magic metric for our business is our uptime percentage which is a pretty common um you know old school type of metric. But for me, from a from a NOC standpoint, the only metric that I truly care about is my number of problems. And we use a we use Dynatrace. We've started using that product a few months back. And one of the things that I that I really enjoy about it, I kind of mentioned that computers are really good at looking through data and telling us when something's not right. You know, when we made the transition to that offering, we were able to start putting more data into it, right?

Starting point is 00:11:57 So rather than traditional monitoring where you're saying, okay, let me look at CPU, let me look at memory, let me look at disk space utilization, let me look at IO, et cetera, et cetera, et cetera. You put those pieces of information in and you have to specifically call out, I want to view this, I want to view this, et cetera. And then not only that, you have to tell the system generally in a traditional monitoring tool, you have to tell it what your thresholds are. So I want a, you know, if it reaches this percentage of memory utilization, then alert me, et cetera. And the problem with that is that you have to really understand your environment inside and out. And, you know, that's something that we don't have the luxury of at Beachbody. So we have some growing pains.

Starting point is 00:12:43 We're a company that grew extremely rapidly. And with it, the technology group grew and the product offerings grew. And when that happens, you don't have nice, uniform, well-designeduum kind of solutions that are unique. And that's what we deal with, right? So people that design things are long gone. People that build them might bolt things onto them. And it's not a well-thought-out, mature product. And the advantage to using something like a Dynatrace offering is that I can just dump whatever data this application is giving me, whatever data these nodes are giving me, I can just throw it all in there and ask the system to tell me when something is interesting. So I no longer have to figure out what's important to me and look at specific metrics. All that I'm looking for is tell me when something is not right. So the more data that pumps through it, the more that it understands, okay, this is normal operating levels. And when something's not accurate, then we get a problem that bubbles up. We get an alert. It goes into our chat ops

Starting point is 00:13:58 tool. The knock reacts to it, or we, automatic scripts that trigger based on the problems. You know, Andy, a lot of that sounds very similar to some of the concepts you and I were discussing and we were chatting with some of our other guests on monitoring and observability based on that article. I'm going to keep probably referencing this article for a long time to come now. Cindy Sridharan um eric i'm not sure if you've uh read it i'm going to send you the link to it though sure yeah it's this crit article but but it talks about very specifically as you're mentioning when you're

Starting point is 00:14:33 monitoring in that way you're talking about you're only monitoring for what you know to monitor right as opposed to just capturing everything and having something else pull it in for you and one reference i have to just get in there and andy i'll toss it in for you. And when one reference, I have to just get in there and Andy, I'll toss it back to you in a second. When you're talking about having the, you know, the different dashboards and the usefulness of the dashboards for anybody

Starting point is 00:14:55 listening, go check out, there's a Monty Python sketch called the hospital sketch. And it's one of the famous ones because that's where they say, bring me the machine that goes ping you know because in the hospital room and that is so much of like what this is like you have all these different things and they're useless in a lot of ways right so anyhow i wanted to get those two things in there before uh the conversation keeps going andy uh you you were going to say

Starting point is 00:15:18 something there too yeah no i um i think i wanted to say something that you mentioned about you don't have the luxury of knowing your system that well enough to be able to actually define thresholds. I think that's obviously a true statement. But I think even more so with the new application architectures we are dealing with, I think it doesn't make sense anymore to define alerts based on resource thresholds. I had a session, I was at reInvent two weeks ago, and we were there with our booth, and then there's a lot of monitoring companies around, and some are, you know, they're very pretty dashboards. And so what I thought, what I did is I observed people when they came over from some of the other booths and came to our booth, and then I showed them a Dynatrace dashboard with a lot of charts on it. And then I asked them, so what does this dashboard tell you? Is it good or bad? And then they said, well, it looks strange. I don't know. I think it's bad. And I said, honestly, you have no clue because you didn't

Starting point is 00:16:21 build that dashboard. And even if somebody would have built the dashboard and set an alert on something and it goes red, it doesn't mean that it's a bad thing. Because just because a CPU goes hot doesn't mean it's a bad thing if it doesn't impact the bottom line. Coming back to what you said is are we up? Can people view our content? Can people sign up for our service? That's in the end what really matters, and especially if we're now building very dynamic environments that scale up and down depending on demand, looking at something like a CPU utilization, memory utilization, or the existence or let's say the number of instances of a certain component becomes not completely irrelevant but not as important anymore as it used to be yeah i was gonna and i was gonna challenge you there andy as well because if you're thinking about something like caches right um eh cash or um different kind of cash usages while you can

Starting point is 00:17:17 scale outside if you're not properly leveraging your cash then you're going to end up costing yourself more money so it might not necessarily impact, let's say, that Amazon sales rate. But you run the risk of introducing scaling risk to your system if you're not looking at certain things that might not trigger some sort of automated alert. Your lack of use of cash might not trigger an issue until it's way out of control, right? So I think there are, again, it's not black and white, I think, but there are probably use cases where some of those metrics are important to have. But again, if you're just monitoring that cache and you have that number there, it's meaningless if it's not tied to and or analyzed in conjunction with some of these other components, right? Just having that number

Starting point is 00:18:01 is what's kind of useless. It has to be actionable. Exactly. And I think, Eric, to your point, you said, you know, machines are pretty good in ingesting a lot of data and finding anomalies. And I believe that's why anomaly detection and automatic baselining is the key point here that, you know, we all try to do. And I mean, that's not only true for Dynatrace, but I think other monitoring vendors as well. But I think that's a key thing. For sure. Cool. So thanks for that little excursion. So that means dashboards alone obviously don't cut it anymore, and we want to

Starting point is 00:18:32 go to kind of this zero dashboard knock. So you explained where you came from, how the situation was, and kind of your path. What else did you do? Some lessons learned, or like you mentioned SRE team. What does that mean exactly? Sure. Yeah. So the SRE team, we currently have a single person doing SRE work.

Starting point is 00:18:55 In addition to myself, I'm actually writing software as well. But compared to the size of our NOC, I have a NOC of two full-time employees, which probably sounds amazing to people listening that we only have two full-time employees for such a large company monitoring all of our applications. But we've managed to do a lot of really hard work in the beginning improving our traditional monitoring. And we still have traditional type monitoring. We beginning, improving our traditional monitoring. And we still have traditional type monitoring. We still have up-down monitoring. We still have ping monitoring and memory monitoring and CPU monitoring, et cetera. But we just don't look at them, right? We use them as another data point that we feed into our system to understand, okay, I'm seeing this weird thing happening, but I'm also seeing that CPU started

Starting point is 00:19:46 spiking. We can make some kind of correlation and identify issues that way. Again, it's like I said, it's more data, more, more, more data. So you don't abandon your traditional type monitoring. You just use it as an additional data point on top of seeing what comes out of more automated type monitoring, right? So for us, what the SRE team's goal is, is to dig through the findings. They do actually create dashboards. They create dashboards not for staring at on a TV, but they do it to get a whole look at a picture of a particular application so that they can see how things are trending. They can dig down into things and they can understand maybe some tuning that can be done. They make recommendations to the delivery team saying, you know, hey, I think that we have this bottleneck in our code, right? So the machine identifies this particular query has been running

Starting point is 00:20:45 increasingly longer amounts of time. So we need to start looking at ways that we can optimize our database, et cetera. So they do some proactive work that way. And then in addition to that, they're building our self-healing tools so that we have things that, for example, if we start to run out of disk space, that there's something that will automatically go through and tar up files or something like that. They build that type of automation. Or if a process fails in a certain way, they build something that, you know, will kick off a job to restart, do a rolling restart in the service or something like that. That's their focus.

Starting point is 00:21:25 That's pretty cool. The whole concept of self-healing is – I actually have to kind of admit something here that I had a hard time using the term self-healing for a while because I thought it's something that's futuristic. I always talked about mitigation, but I think the more and more I hear people like you also talking about self-healing the more comfortable I am using that term and it's been also something that we've been promoting you know we we see you know some of our users and exactly as you using monitoring data to trigger better more focused self-healing actions.

Starting point is 00:22:05 You know, it's basically, I mean, just some of the examples that you mentioned, full disk, you know, process that stalls, some other things like, you know, scaling up, scaling down, turning on an additional cache layer or turning something off. These are all things, thanks to, I believe, better monitoring, it allows us to actually do much smarter auto-remediation, which then kind of looks like the system is actually healing itself. So I believe that's actually a cool thing. Sure. And Adi, to bring up the topic you've been bringing up a lot, you know, talking about

Starting point is 00:22:40 our own transformation in these realms, and Eric, you were talking about how you are all still in the process of making these things. One of our, you know, the story much better than I do, Andy, but I believe one of our initial forays into this realm was to take all of our knock run books. And let's say, say we just automated them because it was, it was basically a script. So we know when, again, based on the traditional monitoring, when certain conditions are met, there's an action that gets taken manually. And part of that initial process could very easily be just taking. And did you have anything like that where, you know, with the existing NOC, did you go through that process of just automating all those? Yeah, for sure.

Starting point is 00:23:22 I would like to say that we're finished with that process, but it's still a work in progress. But there is some automation that's been put into play. Some of it still has human checks in place. So there's still some confidence building that needs to exist that our data that comes out of our monitoring tools is accurate and is actually identifying a real problem. And once that confidence is built with the teams, then we can build that glue that kind of, when this happens, automatically run this automated script. But the first step is to automate it

Starting point is 00:23:55 so that you know that you can just hit a button rather than having to manually log into things and restart when you're playing, as I like to call playing whack-a-mole with things that are broken. So you said the SRE team. Is that part of your operations team? Is this a separate team that you manage?

Starting point is 00:24:14 How is that organized? That's interesting for me to know. Sure. Yeah. So there's not a whole lot of crazy formal structure when you only have three employees. But technically, I do run two individual teams. So there's the NOC and then there's the SRE team. But they all – I mean they're all reporting into me.

Starting point is 00:24:36 We all go to the same daily stand-up. We all go – we all have a weekly all-call with all the NOC technicians and the SRE. And then we actually roll up – our NOC rolls up into the DevOps organization. It does not roll up into a traditional infrastructure ops organization. So that means when you talk about the DevOps organization, that means you are part of, I mean, it's all the engineering organization. That means you have your daily standups, I assume, well, within your team, but how often do you meet with the actual, the application teams and development teams? Sure. So that's kind of on a, you know, it's on an as needed basis. Our DevOps organization is actually, it's in its infancy. So we are still considered to be

Starting point is 00:25:27 ops. So our DevOps came from ops rather than from dev. But we have started adding DevOps resources to scrum teams on the development staff. So they will join the individual standups for those particular products. So we look at the products that need the most help, and we inject a resource into those products. And then once we've managed to make improvements, then we move on to the next product. That's pretty cool. Hey, I know you kind of explained, obviously, what happened in the last year and a half, and there's still ways to go. But coming back to some of the resistance, I think you mentioned the first big point

Starting point is 00:26:10 is to prove out that the monitoring that you actually have in place is valuable, that it gives the right data. What else is there? I could imagine if you come in, when you came in a year and a half ago, and they were telling you, well, we want to hire these people, and you said no, wasn't there even more fear of are we going to lose – I mean are some people going to lose a job? What were the other kind of pushbacks that you had to overcome? Sure. Yeah.

Starting point is 00:26:36 I think one thing that I've been gifted with is the ability to kind of paint a picture of a happier place. I don't know how I'm able to do that, but somehow I'm able to do that. So I think having that skill has been extremely useful for me. And I'm a firm believer in, rather than saying, hey, this is what we're going to do, you need to explain to people, hey, this is where we're trying to head. And then let them be part of the process to get us there, right? So I think of this analogy of, you know, you see those old movies where there's this giant ship and there's a bunch of people in the belly of the ship that are rowing all together, right? And there's no windows anywhere and they're just all,

Starting point is 00:27:19 you know, getting yelled at, hey, row, row, row. And I prefer a method of instead, you know, putting some beautiful windows and pointing at a nice, beautiful island with some great surfing breaks out offshore and saying, hey, guys, let's go over there. It looks pretty awesome. And I think that the boat would get there a lot faster if you do it that way. So I believe in that approach of, you know, selling a vision rather than just dictating to people how things are going to work. So I believe in that approach of, you know, selling a vision rather than just dictating to people how things are going to work. And I think that's really, really, really important. And then I think the other thing that's super important when you're making a transition like this is that you need to make sure that you ease into the process. You know, I hate to use the

Starting point is 00:28:00 way too often overused terms of crawl walk run but that is super super important you know if you're going to do any tooling changes you have to make sure that the new tool still gives the same data that people are used to seeing right so it still has to have dashboards you still have to put something up on a tv and then it's just a matter of saying you know once you get all of your data in there and people are using a new tool, then you can start to show them some of the data that they're getting that's in addition to see where you're going and then figure out what, how can I be part of that? Right. So when you just, even when you talk about like, Hey, there's some great surf breaks, right. Someone down there rowing can be like, well, what am I going to do when I get there? Right. Hey, I can learn surfing. Right. And there's that whole

Starting point is 00:29:02 leveling up instead of someone being driven out because I only do things this way and this is all I know. And now someone's trying to switch it on me and I don't get it and I'm not being trained. That approach opens it up more to saying, I see where you want to go and I can find these three interesting things that I'd be very happy to learn and figure out

Starting point is 00:29:24 in order to help be part of the team that gets us there. It's just a great way he did that, I think. Yeah. I agree. I like the tune. By the way, Eric, I think you should find a way how to kind of draw that picture of the boat and then like the old and the new way and put it in your presentation. I think that's a great analogy and I'm sure we can visualize this in a nice way.

Starting point is 00:29:46 Yeah, sure. That's great advice. Now I have a couple of more questions. So I know that you guys obviously have your quote-unquote traditional business, that Beachbody has been around for a while, and now you're moving into this, I think it's called Beachbody on Demand, a while. And now you're moving into this, I think it's called Beachbody on Demand, where you're doing a lot of stuff, move the architecture into the cloud. Is this correct? Yes. Yeah, that's definitely one of the large offerings that

Starting point is 00:30:14 we're moving into the cloud as well as some other things that are in the works too. Could you give us just a little overview of what that kind of environment looks like, that new stack, and also if that changes anything on the monitoring approach or if you were actually able to say, well, we have this new kind of platform that we're building out. And instead of having a lot of legacy, we can build something completely new in certain areas like monitoring and dashboarding because we can kind of start from scratch. Is there something like that that you can kind of fill us in a little bit?

Starting point is 00:30:48 Yeah, for sure. Sure, yeah, definitely. Yeah, so that product used to be in-house hosted on VMware virtual machines, and it was all monitored via very traditional up-down CPU memory disk space type monitoring. So there was a project that happened earlier this year to move it to the cloud. And instead of just moving it to the cloud, it was actually built from the ground up and launched as a new offering that we then cut over to from the on-premise version into the Amazon Web Services hosted version of it. So what that system kind of looks like now is very different than what it did

Starting point is 00:31:34 before. It's very microservice oriented. So the front end is a single-page web application that then makes calls for authentication. It actually makes calls for authentication to on-premise gear, and then it makes calls for content. It makes calls for entitlements and things like this and builds the page based on data that's returned for microservices. Databases sit in RDS, in AWS. So it is a very different kind of architecture from what it was previously. So this does create some kind of interesting monitoring challenges. We have unfortunately yet to get that solution into Dynatrace. It's in the works and we're working on that. It does pose kind of some interesting things for us. For one thing, it is a streaming service. It utilizes a lot of

Starting point is 00:32:34 third-party solutions. And then in addition to that, we also have clients that most people are not used to seeing in their monitoring tools, such as Roku devices and Apple TVs and Amazon Firesticks and things like that. So because of that, we definitely have some interesting challenges. We're looking at building out Dynatrace so that we can get everything set up. But that's a journey for us that we're working through. In the meantime, our monitoring has kind of transitioned into more of an API endpoint type monitoring. So we're doing things that are slightly more intelligent than traditional monitoring. And those things include, you know, hitting endpoints that make up the back end of the system and analyzing JSON packets and

Starting point is 00:33:21 responding based on the error codes that we return and things like that. So that is definitely a step in the right direction. And for anyone who's looking at transitioning from a more traditional up-down type monitoring and trying to get into a more modern type monitoring that gives them more insights, certainly exposing health check data through some type of data elements such as a JSON packet is a really great way to kind of get into that so that you can start to monitor the data that's returned from those endpoints. That's pretty cool. Are you going to include that in your presentation maybe as well, that you have to perform? I think that will be very interesting, some lessons learned,

Starting point is 00:34:01 some technical tips and tricks, actually. Sure. Yeah, absolutely. That's, that's definitely a, you know, a step on the journey of the crawl, walk, run. That's, that's certainly a nice walk step for people to kind of transition into. Cool. All right. Is there anything else that you, I know that you, there's still obviously ways to go until you're fully there where you are, where you want to be with your vision. Is there anything else you want to tell the audience before we kind of wrap up this episode?

Starting point is 00:34:47 No, I think just some parting words of wisdom is try to avoid the pitfalls of doing what you're doing just because it's the way that you're doing it. Trust me, there is a better way to monitor your infrastructure. I like that. Brian. That's time. Avoid the temptation to bring in the machine that goes ping. And get the machine that goes ping! And Andy,

Starting point is 00:35:05 would you like to summon the Summarator? Let's do it. I'll try to summarize all this. This is an excellent story. I believe what I learned is, you know, don't fear that change

Starting point is 00:35:18 may disrupt the way you used to work. It is obviously disrupting the way we used to look at dashboards, the way we used to create reports, to bring back my wife's story again. And I'm sure that happened in your organization as well. I think we all need to embrace the change. I think it's great to think about that the stupid, annoying manual tasks that we as humans can do but shouldn't do, like analyzing a lot of data and figuring out if something is wrong. This is something that machines are really good at.

Starting point is 00:35:51 So let the machine do what they are really good at. And then let's change the way we help our bottom line, which is making sure our business is up and running, our users are satisfied. Use the data to your advantage and don't get caught up with something that somebody else like machines can do. are doing with the SRE team, making sure to automate as much as possible when it comes to dealing with faulty situations, whether it's cleaning disks, whether it's restarting processes, whether it's reconfiguring routes, whatever it is. And I believe what we all need to understand is that, I mean, our systems are not getting less complex. They're getting more complex.

Starting point is 00:36:43 You know, you guys are moving to the cloud and that means we have more moving pieces, third-party cloud services that we don't use. That means even more so, we need to rely on automation and good monitoring to focus on the stuff that is really important. Again, which is, you know, making sure that our systems are up and running all the time

Starting point is 00:37:02 without having to be up all the time 24-7 as humans to look at nice but meaningless dashboards. I think that's what I want to say. Awesome, Andy. And you know, your bit about letting machines do what they want to do reminded me of a slide I once had that I think is very appropriate. There was a story, I think it was up in Michigan, where there's a lot of artisanal things going on these days.

Starting point is 00:37:29 And there was these artisanal grounds clearing where people would hire goats to come in and let them roam their fields for days and days and days and clear the field. And there was a kind of an uproar from the landscaping community saying, hey, these goats are taking our jobs. Never mind the fact that it might take the goats 30 days to do what the landscaper can do in five hours.

Starting point is 00:37:53 There was just this fear of our jobs are going to be replaced by goats. And my reaction to that is if a goat can do your job, let it do it and go do something better. So same thing with the machines as we're talking about. If the machines can make sense of the data, if the machines can take the actions, automate it, let the machines do it so that you can spend your time doing something more productive that the machines can't do. Such as creating, you know, most of the features and everything else going into that still, at least at this point has to be created by humans. So let's, let's concentrate on that part of it and let the machines take over what, what they can.

Starting point is 00:38:33 Completely agree. Cool. Eric, any final words from your side? Yeah, just one, one last thing I think to add to your, to your summary.

Starting point is 00:38:41 One thing that's super important is that as you make this journey it's, it's really, really important to explain to people where you're trying to get to rather than just say, hey, we're changing this. That's really important to get buy-in so that you don't create unnecessary fear and unrest in your organization. Great point. Yeah, I really appreciate the time, guys. It was a pleasure chatting with you guys. All right, excellent. If anybody listening would like to see and meet Eric, you'll be at Perform. Perform, we call it Perform 2018, right?

Starting point is 00:39:15 That's going to be my birthday weekend, Andy. So make sure you bring me a present. You too as well, Eric. I'll be there. It's going to be in Las Vegas at which hotel? Is that going to be at the one with the fountains, right? I think it's the Bellagio. The Bellagio, yes.

Starting point is 00:39:33 And in January. So hopefully we'll see a bunch of you there. Definitely come and meet Eric and check out his talk about all this. I'm sure he'll do much more in-depth on a lot of things and with awesome slides, especially now that we know the pressure is on, Eric. You're going to hand-draw a galley ship with people chained up and getting whipped while they row. I'm thinking of surfing. Right, exactly.

Starting point is 00:40:00 I can't wait to see it. All right, thank it. All right. Thank you. All right. Thank you. No, thanks. Thanks so much. Bye.

Your Ad Here

PurePerformance - 051 Building a Zero-Dashboard Monitoring Culture with Erik Landsness

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.