PurePerformance - 029 What is Metrics Driven NetOps

Starting point is 00:00:00 It's time for Pure Performance. Get your stopwatches ready. It's time for Pure Performance with Andy Grabner and Brian Wilson. Hey, this is another episode of Pure Performance. This time a little special. We were supposed to have Brian again on the call. Unfortunately, he couldn't make it. Hopefully, have him back on the next call. But I took the opportunity today to actually follow up on a conversation I had a couple of weeks ago with Tom,

Starting point is 00:00:43 who is actually with me today. Tom McGonigal, A for Andrew, sorry for that. I want to give you the chance in a couple of seconds to introduce yourself. But I think we met each other at a local meetup here in Boston. It was a DevOps meetup. I talked about metrics-driven continuous delivery. I talked about DevOps. And then you actually came in later on and said, hey, you know, this is a very interesting topic, and you also have something to say, especially from a network side. So without further ado, I want to, A, give you the chance to introduce yourselves, and then I want to figure out the topic today. I think we said we want to evolve a little bit the idea of metrics-driven DevOps, which is kind of maybe a different angle to what I've been doing in the past with metrics-driven DevOps.

Starting point is 00:01:26 So, Tom, maybe a quick word on who you are and what your passion is about, network operations and all that stuff. Just let us know who you are. Sure. Thank you very much, Andy. It's a tremendous pleasure to be on your podcast, and it was a tremendous pleasure to see you speak and give the metrics-driven DevOps speech. And based on that, I asked you to come to

Starting point is 00:01:46 my meetup, which is the Boston Jenkins Area Meetup Group, which is the largest Jenkins meetup group in the world, and speak in March. And so this March, you're going to be coming and giving kind of a Jenkins-tailored specific presentation on metrics-driven DevOps. And so just seeing you speak and seeing your presentation sparked in me an interest in trying to figure out exactly what metrics-driven NetOps looks like. I am currently working as a field systems engineer at F5 Networks, the application delivery controller company. And we are undergoing a pretty significant change,

Starting point is 00:02:21 and we are implementing DevOps throughout our organization, across our cloud subject matter experts, and we are getting questions from our customers about DevOps and NetOps and what that looks like. And for me, one of the tenets of DevOps has always been CAMS, which stands for Culture, Automation, Monitoring, and Sharing. And so what does the network look like without metrics? The stuff that you're doing at Dynatrace is so whiz-bang. It's so exciting. It's so interesting. The UFOs that you guys have in the office and the one that you have here in our little conference room are so cool. And why can't network engineers have UFOs and expose the API to their network gear,

Starting point is 00:03:07 and why can't we have better metrics in the network space? That's cool. So let me ask you a couple questions, because I'm not an expert at all in networks. Obviously, I understand that the network makes everything possible because it connects our tiers. It allows us to send data over the wire. I know you, as the company that you work for, F5, you are providing some great tools on different, I think, OC levels, right?

Starting point is 00:03:29 Routers, but also, I think, application gateways on different levels. I think you did a quick presentation at the last DevOps Meetup in Boston and gave a little intro on what you do there. But, you know, when it comes to DevOps, for me, personally, I believe everything is centered around the application. We will write a lot of applications, more than before, and we're pushing applications out faster to a virtual, a physical environment, in the cloud, whatever it is. And obviously everything is backed by the network, because if there's no network, nobody can get to my apps. The metrics that I'm always looking at are application-specific metrics, response time, how many database statements are executed, how many bytes do we send over the wire, and in case some of these metrics actually go wrong, and if I make a code change, and that changes the metric in a way that I believe is no longer good for the application

Starting point is 00:04:16 because we're sending too many bytes over the wire, we're making too many database calls, we're doing something crazy on the CPU, then I raise the flag and say something is wrong. And that's when the UFO, which is one way of visualizing the state, goes red, for instance. Now, from a network perspective, I would have assumed that you guys have a lot of metrics. We do. You know, we've had metrics for decades. And there's a protocol called Simple Network Management Protocol called SNMP,

Starting point is 00:04:45 and it exposes a plethora of metrics about the infrastructure. But are they actionable? Are they timely? Are they useful? Are they plottable on a graph? You know, there's these questions. But you mentioned your focus on the application. And I just want to, you know, before working at F5, I worked at CloudBees, the Jenkins company. And something I talked to about with all my customers and clients was that software is eating the world. Mark Andreessen's famous quote. And software is eating the network world as well.

Starting point is 00:05:26 Our ADCs, our F5 Big IP devices, are incredibly programmable. They have restful interfaces, and we are providing Python SDKs and Ansible code to orchestrate and automate the infrastructure. The typical application development that I think

Starting point is 00:05:42 you're picturing is also applicable to the infrastructure as code development using the Python SDK, using Ansible, using Chef, using Puppet, using Solve to configure the big IPs. And so software is eating the world. It's not only just the application guys who are the king, right? But it's also eating the network world as well. So that means what you tell me, as an application guy,

Starting point is 00:06:08 I think by now everybody's understanding that they need to write some kind of infrastructures code, like how do I deploy a new JBoss if I need it because capacity requires it, right? Sure. Or if I have too much load. But you now tell me an additional aspect that I never thought about is also writing infrastructures code for the underlying infrastructure. That's right. So if you think of a CICD.

Starting point is 00:06:31 Meaning the network. With infrastructure, in this case, I mean the network. Exactly. So if you think of a CICD pipeline where you're moving an application from dev to QA to prod, you are going to touch the ADC. You're going to touch the load balancers. The load balancers are part of the CICD process, and that means that there needs to be orchestration and automation

Starting point is 00:06:52 of the load balancers in order to facilitate the CICD pipeline. Now, do you see, we see a lot of our people that we talk with and work with, they are trying to move towards something that makes it easier for development teams to push applications through the pipeline and then obviously in production. And they are using platform-as-a-service solutions, whether it's a Cloud Foundry, whether it's an OpenShift, whether it's Microsoft. And the promise of these platforms is you don't need to worry about this anymore. Is this something what you see as well? Are you working with these vendors?

Starting point is 00:07:30 Are you working with a cloud foundry? Absolutely. Yeah, so I used to work at OpenShift. I was a site reliability engineer for the OpenShift project, and I supported over a million applications, you know, specific tailor-made applications. And the pipeline and the support of these applications is very much a piece of our business and a very important piece of our business. We are the edge of the network. We are what controls the application delivery and the application access. And so we have this very, very important place to play in the PaaS space,

Starting point is 00:08:09 in the infrastructure as a service space. So that means if we talk about the PaaS environments now, again, I would assume if I'm a developer, I don't want to care about this. I use an orchestration engine, and my orchestration layer allows me to, say, scale up depending on load, scale up to make sure that a certain response time SLA is met. But should I, as a developer, care about the underlying network, that the infrastructure is there, or should the PaaS provide that for me, automatically configure all the routers?

Starting point is 00:08:38 That's a phenomenal question, and so I'd love to debate this. I actually have literally a debate with you on this. So I see both sides of the story. So part of me says the application developer who is in charge of an application should be aware of the capabilities of a modern ADC in that they can tune the TCP and the HTTP protocols to service their applications and to the best of their abilities. They can implement caching, and they can implement big TCP windows and pipelines and implement SSL on the ADC and on the load balancer. They can do all these things.

Starting point is 00:09:19 And then so half of me says yes. The application developer is going to, the responsibility of the network engineer is going to be shifting left to the application developer. And then the other side of me sees exactly what you were kind of pointing out. Maybe the application developer shouldn't be aware of this. I guess it depends on the maturity of the product, the maturity of the individual, and the capabilities of the organization. But I'd love to hear your thoughts on this. So I think it's interesting. I did a webinar recently.

Starting point is 00:09:52 It was called DevOps for Operations Engineers. What does DevOps for Ops in general mean? And basically, I came to exactly the same two conclusions. I think the future for the traditional ops teams, they have two options. Either shifting left, meaning they become part of the application delivery team, which actually goes towards a no-ops environment. There's no traditional ops anymore, but ops is just part of the application delivery team, and they just give their expertise and make sure that the environment is there. Or the other way would be becoming obsessed with service. That means for a large organization, they provide additional easy-to-consumeable services to the application team so that they can run their applications on the infrastructure. So either become part of the application teams or become more what Amazon and Microsoft and Google are right now, basically very providers of infrastructure, easily manageable and controllable

Starting point is 00:10:48 through REST APIs. I think these are the two things. So I think that was perfectly put. I think you perfectly put it. I don't think, I think we're getting closer to the answer. I think it's a combination of the two. I think just a simple fact that we have infrastructure operations as a service. And you don't have to worry about, if the application developer doesn't have to worry about something like a VLAN or routes or just the network nitty-gritty,

Starting point is 00:11:23 it allows them to then orchestrate at the level, at the layer seven level, at the application layer level. and it allows them to interact with something like a big IP at the Layer 7 level and focus on what they're good at. And that includes orchestration, I would argue. I think if we keep pulling on this thread around continuous delivery and what that looks like and how the network plays with the continuous delivery model. And if we could just kind of circle back to what type of metrics we need out of a continuous delivery pipeline that's focused on the network. You know, what does that look like?

Starting point is 00:11:58 Yeah. So, I mean, what I think it has to look like, I mean, from an application, again, I represent totally the application development team, right? Because that's just more I feel comfortable with. So what I believe the application teams can deliver and should deliver with tools like Dynatrace or any other tool where you can get application-specific data, we can tell you on a transaction-by-transaction basis, feature-by-feature, application-by-application, whatever you want to call it, how many bytes are most likely being sent over the wire. If we have some production monitoring data to know how many people at any given point

Starting point is 00:12:32 in time are using that feature, we can tell you how much bandwidth we actually need and how much data we send to which endpoints. And I believe the magic of taking this data and putting it into the pipeline is the following. If I know how my production environment looks like now, let's say 80% of my users are using these two features, and I know exactly how many bytes we send over the wire at which point during the day and during the week. And if we now make a change, so if you're pushing a change through the pipeline saying, we are changing that feature that is used by 80% of the people, and we are now requiring 20% more round trips

Starting point is 00:13:09 between the application server and the web server, between the application server and the database server. And if I have this information and give it to my network team, then they should be proactively figuring out, okay, what does this really mean? Is this a good idea or a bad idea? How do I need to configure my infrastructure? How can I automate that?

Starting point is 00:13:30 How can I even maybe automatically provision infrastructure depending on the load patterns that we have, right? And I think this is then really nicely playing into the DevOps story where the cool or the perfect world will be that my infrastructure is automatically understanding the patterns from the application, from the end users, and then providing the infrastructure exactly that it needs. It may be even able to anticipate certain load patterns based on historical data, based on certain events that happen in the world right now,

Starting point is 00:14:06 and then automatically provision the right infrastructure and the right network configuration and bandwidth. I love it. I've never thought of that before. I never made the connection for the feedback loop. I'm sure you're familiar with, or maybe your listeners are or are not, Gene Kim's three ways, and the second way is continuous intelligence. Exactly. And so it's the automatic feed.

Starting point is 00:14:29 So APM, and I apologize, APM stands for Application Performance Monitoring or Metrics? Application Performance Monitoring. Or Management. Management. We actually recently changed. I think we coined the new term DPM, Digital Performance Management, because in the end, yes, it's about applications, but we are helping people to do digital transformation through applications. But whatever the terminology, it is we have an application and end user-centric

Starting point is 00:14:54 view. Why end user? Because we also capture metrics from the end user, how end users interact with the application, talking about the load patterns. Where do people come from? Do we send the bytes to the local user community in Boston here? Right. Or do most people come from somewhere else, and therefore we have to think about total different things, right? We need to think about the CDN. We need to think about network bandwidth and latency becomes an issue.

Starting point is 00:15:16 So, yeah. But I just love the idea that the APM technology is orchestrating and configuring the network, you know, and provisioning new resources, provisioning new servers at an odd-hawk basis. It's, you know, there's an expression that floats around F5, and it's source of truth. Have you ever heard of this? You know, like, oftentimes, like, GitHub is the source of truth.

Starting point is 00:15:39 You know, but what we're talking about is Dynatrace being the source of truth. And, you know, there's various application characteristics happening. You know, for example, it's just poor performance. Well, you know what? We need to autoscale. You know, maybe we rely on AWS autoscaling, but we supplement it with application metrics that automatically spin up new instances, provision them, and add them to the big IP in an automated way. I love that story.

Starting point is 00:16:07 I think it's very fascinating. It's smart scaling, right? Smart scaling. That's what it is, yeah. Because if you just scale up because you see performance goes up, that's like you have a bad tooth, and the only thing you do is you eat more Advil. Right. But you don't fix in the root cause of the problem.

Starting point is 00:16:20 That's right. Right? This is kind of the idea. And I think, so again, coming back to my, also the story, the metrics driven story that I want to tell also at your meetup, it's about understanding what potential impact I have with the code change that I'm pushing through the pipeline, through the continuous delivery pipeline. If I know I'm changing the feature that is used by 80% of my user base on the peak load

Starting point is 00:16:41 on a Friday afternoon, if that's the peak load time, and I'm changing it in a way that I require 10% more database queries, I require to send 50% more bytes because I changed some images on that page now to high resolution instead of what it was before, then I need to make sure that this information, before I push this change into production, ends up with the people that need to make sure that the environment is provisioned correctly. In a perfect way, as I said earlier, in the future, hopefully, in the soon future or the distant future, maybe the orchestration layers of the world that we build or use will automatically

Starting point is 00:17:19 take care of this. Because they look at historical load patterns. They look at what changes come down the pipeline they can look at jenkins and see which features are changing in which way and then and then they can automatically make sure that that the infrastructure is provisioned in the right way and i think the right way is essential in both ways it shouldn't be too less but also not too much infrastructure because in the end we have to pay for it too, right? If you over-provision it.

Starting point is 00:17:45 I love it. I love the idea that the APM is the orchestrator. Is it, not to give away, you know, the too much insider information, but is that on the roadmap for data trace? Well, the thing is, we have all the data, right? Whether, I mean, we have, I'm not sure if we become the actual orchestrator,

Starting point is 00:18:03 even though we can, because we have a concept of incidents and we can trigger events. But I wonder if it is more, if it is like a, you know, we take our data, but then we need to pull in other data as well from folks like you guys, right, from the cloud providers, and then based on that make a good decision. We within Dynatrace, when we actually implemented our current SaaS-based offering, we built our own orchestration layer because back then when we started,

Starting point is 00:18:31 there was nothing like a Kubernetes, like a message server. So we built it on our own, and what we do, we actually look at APM data and infrastructure data, and then basically we automated everything that a normal ops team would do. So we see a shortage in resources, and we scale up. It doesn't help. Well, either we scale up a little more. If this doesn't solve the problem, that's probably not the root cause that we have.

Starting point is 00:18:54 Then we look deeper and say, what is the actual root cause? And instead of endlessly putting more resources on the problem, we actually alert and make sure the problem actually gets addressed and fixed. So this is what we actually did. We built into our orchestration engine the logic of what a normal ops engineer would do, an application engineer would do. They have the runbooks. We automated all the runbooks by automatically looking at all the metrics, understanding

Starting point is 00:19:20 the dependencies of the different systems, and then taking certain actions to get the system back to the healthy state. Oh, it's beautiful. It's continuous intelligence. That's what it is, yeah. It is. It's wonderful. Yeah. What a great design.

Starting point is 00:19:36 So we wanted to talk about, I mean, I think when we sat down before we had this meeting, before we started recording, we said metrics-driven net ops. So I know this is obviously what we just talked about. This is, I think, something where the industry is moving towards, right? Somewhere leading the way. So we already do some of this internally. I'm sure some of the unicorns are doing it already. Your exposure to current operation teams and network teams,

Starting point is 00:20:06 what do they do right now? And what would be your recommendation of kind of leveling them up a little bit? What can current teams that have been doing their job over the last couple of years, always the same way, what do they need to learn? What are the first steps? Well, so, you know, just in a very broad sense, I would say it's automation and in a very general sense, and I'm going to get very specific. You know, they are very reactive. They are, you know, when I started my career, I worked for the Federal Aviation Administration and I worked in their operation center and I just monitored systems. You know, we had all these screens and you just keep, it was a NOC, it was a Network Operations Center. And we get alerts through this NetMail technology,

Starting point is 00:20:50 and we'd get alerts through these graphical user interfaces, and then we'd react to it. And there needs to be more automation and self-healing. And so there's various things that can go wrong on a big IP device. For example, the network itself can collapse. Someone can trip over a cable and unplug it. Or there could be a software bug in the big IP. Or there could be any host of various challenges.

Starting point is 00:21:18 But what we need to do as a field, and I mean field as in the network operations as an entire industry, what we need to do is we need to have more automation. And there's literally hundreds of thousands of network engineers in the world today who are doing all of their work on the command line. And that's just an ancient and old practice. They need to move towards an automation strategy. They need to become more comfortable. We have an expression that we use at F5 called a super net ops person, but I've talked about them more as site reliability engineers. And if you're familiar with site

Starting point is 00:21:58 reliability engineers, they are what is referred to as T-shaped employees. So in the case of a network-focused site reliability engineer, they're going to have depth in networking and security, and then they're going to have breadth in agile and metrics and monitoring and automation. And this T-shaped employee gives them the ability to, you know, really service various modern-day cloud-based problems. And once again, the key to that T-shaped employee is automation. They need, you know, the typical network engineer that's not doing automation now needs to level up.

Starting point is 00:22:38 And I guess the challenge is, and this was also a challenge, obviously, on the application side, maybe, or on the, let's say, the traditional ops side, right? When we talk about automation, it means people have the fear that they're automating away their own jobs. That's right. That's the biggest challenge. Yeah. And so, but the question is, what can we do to take their fear? Because I don't think they're automating away their jobs. I believe their jobs are changing, right?

Starting point is 00:23:05 Yeah. I was talking about this earlier today. I believe what we need to do is we need to appeal to people's self-interest. You know, there's an expression that I, or there's a phrase, like a turn of phrase that I use that the big network factory in Boston is going away. You know, there's going to be a lack of jobs in the next five years in the network operation space. There's going to be less and less jobs.

Starting point is 00:23:25 These guys need to level up. They need to become site reliability engineers. They need to have cloud skills. They need to leverage their networking skills and supplement it with agile monitoring and automation skills. And my alliteration around this factory, this network factory leaving Boston is meant to appeal to their sense of self-interest. And so we have to have this one-two punch where we appeal to their self-interest, and

Starting point is 00:23:50 then we give them F5, for example, gives them a path to learn those skills. And we give them study guides, and we give them players guides, and we give them podcasts, and we give them how-to videos, and we provide code, and we provide examples, and we give them how-to videos and we provide code and we provide examples and we get out and we market and we get out and we talk and we get out and we help and we just help these guys get to where they need to get to. It's an incredibly difficult proposition.

Starting point is 00:24:16 Yeah, it is. Well, especially, you know, I mean, there's always it depends on the personality too. A lot of people are eager to learn, do something new, but some people are just resistant to change, right? And they're comfortable in the way they've been doing things over the last 10, 20 years. That's right. And I mean, it's a general problem that we have as a human society or as humans. So I like the fact that you are, I mean, what you personally do with the meetups, right? You are spreading the word about what's new, what's cool,

Starting point is 00:24:42 what is the direction you need to go. And you were educating people. I was a teacher at a junior college in Boston for three years. My grandfather was a teacher. And so it's in my blood. And this is my life's work. At the beginning of the show, I didn't mention, but I do have 10 years of DevOps experience. My DevOps birthday just passed. It was January 17th when

Starting point is 00:25:05 I first started using Puppet. And I've basically been putting in 16-hour days ever since. And so I have this tremendous amount of experience with DevOps. And I'm incredibly passionate about it. And I want to be a community organizer. And I want to help people. And I was raised to help people. I mentioned to you earlier that I have a disabled sister. And I was raised to help people. I mentioned to you earlier that I have a disabled sister. And I was raised to help people. And I'm here to help. And if anyone wants to reach out to me, they can. And you can reach me at t.mcgonigal at f5.com and just get in touch.

Starting point is 00:25:36 And I'll give you a call back, and we can figure out what you need to do and what you need help with. And I'm here to help. That's really cool. Wow. So, you know what? I think this is a perfect time to actually conclude our session maybe, right? Okay. We talked, I think we learned or we have, we created this new vision for people out

Starting point is 00:25:54 there. It was an excellent discussion, Andy. Thank you so much. Yeah, it was really cool, right? Yeah. So metrics, we can use these metrics that we get from the application teams to give the direction of what we need from the infrastructure network side, from resource provisioning, but in this case, obviously,

Starting point is 00:26:09 from the network side. I know you said there's a lot of effort that goes into making infrastructure and especially network automatable, right? Part of infrastructure is code. That's great. We figured out that we need that traditional or network teams can go two directions, either to become part of the application team, putting their knowledge into these teams so that they can build better apps in the future with the right mindset and with the right knowledge about the network in mind. The other option is moving towards an ops as a service or network as a service, whatever you want to call it.

Starting point is 00:26:44 Both directions need heavy automation. And then don't be afraid, but actually encourage change, right? Go out there, see what's new, go to the local meetup scenes. Because, I mean, I love the meetups. I mean, I know we're fortunate here in Boston because there's so many meetups. There's so many nice ones, yeah. And I know we are doing one again in March. March 8th, I think.

Starting point is 00:27:05 March 8th, yeah. And it's amazing. You are the largest Jenkins meetup group in the world. Isn't that cool? That's awesome, yeah. Thanks for inviting me. Thanks for being here. Thanks for presenting.

Starting point is 00:27:15 It's going to be a great presentation. I hope anyone in the Boston area can make it. And we do have sponsors. We have food, and where we have our meetups is literally next door to a liquor store, so you're welcome to get something to drink and listen to Andy and harass him or whatever it takes to get you out to the meetup. I'd love to see everyone there. And then just as an aside, on the days of the meetups, we have something called a DevOps Dojo, which is a safe place for mentoring and learning.

Starting point is 00:27:45 And Andy's going to be there on March 8th from 12 to 7, and I'm going to be there from 12 to 7. And we're just there to help people learn about DevOps and learn about automation and learn about infrastructure as a code and all that good stuff. And Jenkins as well. Cool. And I also want to do one more shout-out. I think I mentioned to you in an email in June, we have a big Agile conference coming to Boston, the Agile Testing Days. I know these guys from Europe.

Starting point is 00:28:10 They've been running this conference in Berlin for a couple of years now to make it to Boston. Jenkins will be a big topic. CICD, DevOps. So that's a great way. That sounds fantastic. Yeah, that sounds really great. That's a great shout-out.

Starting point is 00:28:21 I'm definitely going to be there. Yeah. Cool. All right. Hey, thank you so much. And I know we will do another session because we have some additional ideas we want to discuss. Thanks so much, Andy. Thank you.

PurePerformance - 029 What is Metrics Driven NetOps

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.