PurePerformance - 029 What is Metrics Driven NetOps
Episode Date: February 27, 2017Thomas McGonagle just had his 10 years DevOps anniversary at it was 10 years ago when he got first exposed to Infrastructure as Code through Puppet. He is currently working with F5, helping Big IP Net...work Teams around the world automate the Network as part of their DevOps transformation.We met Tom at a recent DevOps meetup in Boston which sparked this conversation on what “Metrics Driven Continuous Delivery” could mean for Network Operations Engineers. What are the metrics to look at? How to engage with the application teams to provision better and automated network resources? How to bake this into the Continues Delivery Cycle?Besides NetOps Thomas is also passionate about CI/CD. He runs the largest Jenkins User Group in the World out of Boston, MA. If you happen to be around check out their next meetups and DoJo’s: https://www.meetup.com/Boston-Jenkins-Area-Meetup/
Transcript
Discussion (0)
It's time for Pure Performance.
Get your stopwatches ready.
It's time for Pure Performance with Andy Grabner and Brian Wilson.
Hey, this is another episode of Pure Performance.
This time a little special.
We were supposed to have Brian again on the call.
Unfortunately, he couldn't make it.
Hopefully, have him back on the next call. But I took the opportunity today to actually follow up on a conversation I had a couple of weeks ago with Tom,
who is actually with me today. Tom McGonigal,
A for Andrew, sorry for that. I want to give you the chance in a couple of seconds to introduce
yourself. But I think we met each other at a local meetup here in Boston. It was a DevOps meetup.
I talked about metrics-driven continuous delivery. I talked about DevOps. And then you actually came
in later on and said, hey, you know, this is a very interesting topic, and you also have something to say, especially from a network side.
So without further ado, I want to, A, give you the chance to introduce yourselves, and then I want to figure out the topic today.
I think we said we want to evolve a little bit the idea of metrics-driven DevOps, which is kind of maybe a different angle to what I've been doing in the past
with metrics-driven DevOps.
So, Tom, maybe a quick word on who you are and what your passion is about, network operations
and all that stuff.
Just let us know who you are.
Sure.
Thank you very much, Andy.
It's a tremendous pleasure to be on your podcast, and it was a tremendous pleasure to see you
speak and give the metrics-driven DevOps speech.
And based on that, I asked you to come to
my meetup, which is the Boston Jenkins Area Meetup Group, which is the largest Jenkins
meetup group in the world, and speak in March. And so this March, you're going to be coming and
giving kind of a Jenkins-tailored specific presentation on metrics-driven DevOps. And
so just seeing you speak and seeing your presentation sparked in me an interest
in trying to figure out exactly what metrics-driven NetOps looks like.
I am currently working as a field systems engineer at F5 Networks,
the application delivery controller company.
And we are undergoing a pretty significant change,
and we are implementing DevOps throughout our organization,
across our
cloud subject matter experts, and we are getting questions from our customers about DevOps and
NetOps and what that looks like. And for me, one of the tenets of DevOps has always been
CAMS, which stands for Culture, Automation, Monitoring, and Sharing.
And so what does the network look like without metrics? The stuff that you're doing at Dynatrace is so whiz-bang. It's so exciting. It's so interesting. The UFOs that you guys have in
the office and the one that you have here in our little conference room are so cool. And why can't
network engineers have UFOs and expose the API to their network gear,
and why can't we have better metrics in the network space?
That's cool.
So let me ask you a couple questions,
because I'm not an expert at all in networks.
Obviously, I understand that the network makes everything possible
because it connects our tiers.
It allows us to send data over the wire.
I know you, as the company that you work for, F5, you are providing some great tools on different, I think, OC levels, right?
Routers, but also, I think, application gateways on different levels.
I think you did a quick presentation at the last DevOps Meetup in Boston and gave a little intro on what you do there.
But, you know, when it comes to DevOps, for me, personally, I believe everything is centered around the application.
We will write a lot of applications, more than before, and we're pushing applications out faster to a virtual, a physical environment, in the cloud, whatever it is.
And obviously everything is backed by the network, because if there's no network, nobody can get to my apps.
The metrics that I'm always looking at are application-specific metrics, response time, how many database statements are executed, how many bytes do we send over the wire,
and in case some of these metrics actually go wrong, and if I make a code change,
and that changes the metric in a way that I believe is no longer good for the application
because we're sending too many bytes over the wire, we're making too many database calls,
we're doing something crazy on the CPU, then I raise the flag and
say something is wrong.
And that's when the UFO, which is one way of visualizing the state, goes red, for instance.
Now, from a network perspective, I would have assumed that you guys have a lot of metrics.
We do.
You know, we've had metrics for decades.
And there's a protocol called Simple Network Management Protocol called SNMP,
and it exposes a plethora of metrics about the infrastructure.
But are they actionable? Are they timely? Are they useful? Are they plottable on a graph?
You know, there's these questions.
But you mentioned your focus on the application.
And I just want to, you know, before working at F5, I worked at CloudBees, the Jenkins company.
And something I talked to about with all my customers and clients was that software is eating the world.
Mark Andreessen's famous quote.
And software is eating the network world as well.
Our ADCs, our F5 Big IP devices, are incredibly programmable. They have restful
interfaces, and we
are providing Python
SDKs and Ansible code
to orchestrate and automate
the infrastructure.
The typical application
development that I think
you're picturing is also
applicable to the infrastructure
as code development using the Python SDK, using Ansible, using Chef, using Puppet, using
Solve to configure the big IPs.
And so software is eating the world.
It's not only just the application guys who are the king, right?
But it's also eating the network world as well.
So that means what you tell me, as an application guy,
I think by now everybody's understanding that they need to write some kind of infrastructures code,
like how do I deploy a new JBoss if I need it because capacity requires it, right?
Sure.
Or if I have too much load.
But you now tell me an additional aspect that I never thought about
is also writing infrastructures code for the underlying infrastructure.
That's right.
So if you think of a CICD.
Meaning the network.
With infrastructure, in this case, I mean the network.
Exactly.
So if you think of a CICD pipeline where you're moving an application
from dev to QA to prod, you are going to touch the ADC.
You're going to touch the load balancers.
The load balancers are part of the CICD process,
and that means that there needs to be orchestration and automation
of the load balancers in order to facilitate the CICD pipeline.
Now, do you see, we see a lot of our people that we talk with and work with,
they are trying to move towards something that makes it easier for development teams to push applications through the pipeline and then obviously in production.
And they are using platform-as-a-service solutions,
whether it's a Cloud Foundry, whether it's an OpenShift, whether it's Microsoft.
And the promise of these platforms is you don't need to worry about this anymore.
Is this something what you see as well?
Are you working with these vendors?
Are you working with a cloud foundry?
Absolutely.
Yeah, so I used to work at OpenShift.
I was a site reliability engineer for the OpenShift project,
and I supported over a million applications, you know, specific tailor-made applications. And the pipeline and the support of these applications is very much a piece of our business
and a very important piece of our business.
We are the edge of the network.
We are what controls the application delivery and the application access. And so we have this very, very important place to play in the PaaS space,
in the infrastructure as a service space.
So that means if we talk about the PaaS environments now,
again, I would assume if I'm a developer, I don't want to care about this.
I use an orchestration engine, and my orchestration layer allows me to, say,
scale up depending on load,
scale up to make sure that a certain response time SLA is met.
But should I, as a developer, care about the underlying network, that the infrastructure is there,
or should the PaaS provide that for me, automatically configure all the routers?
That's a phenomenal question, and so I'd love to debate this.
I actually have literally a debate with you on this.
So I see both sides of the story.
So part of me says the application developer who is in charge of an application should be aware of the capabilities of a modern ADC in that they can tune the TCP and the HTTP protocols to service their applications
and to the best of their abilities.
They can implement caching, and they can implement big TCP windows and pipelines
and implement SSL on the ADC and on the load balancer.
They can do all these things.
And then so half of me says yes.
The application developer is going to, the responsibility of the network engineer is going to be shifting left to the application developer.
And then the other side of me sees exactly what you were kind of pointing out.
Maybe the application developer shouldn't be aware of this.
I guess it depends on the maturity of the product, the maturity of the individual, and the capabilities of the organization.
But I'd love to hear your thoughts on this.
So I think it's interesting.
I did a webinar recently.
It was called DevOps for Operations Engineers.
What does DevOps for Ops in general mean?
And basically, I came to exactly the same two conclusions.
I think the future for the traditional ops teams, they have two options.
Either shifting left, meaning they become part of the application delivery team, which actually goes towards a no-ops environment.
There's no traditional ops anymore, but ops is just part of the application delivery team, and they just give their expertise and make sure that the environment is there.
Or the other way would be becoming obsessed with service. That means for a large organization, they provide additional easy-to-consumeable services to the application team so that they can run their applications on the infrastructure.
So either become part of the application teams or become more what Amazon and Microsoft and Google are right now, basically very providers of infrastructure, easily manageable and controllable
through REST APIs.
I think these are the two things.
So I think that was perfectly put.
I think you perfectly put it.
I don't think, I think we're getting closer to the answer.
I think it's a combination of the two.
I think just a simple fact that we have infrastructure operations as a service.
And you don't have to worry about, if the application developer doesn't have to worry about something like a VLAN or routes or just the network nitty-gritty,
it allows them to then orchestrate at the level, at the layer seven level, at the application layer level. and it allows them to interact with something like a big IP
at the Layer 7 level and focus on what they're good at.
And that includes orchestration, I would argue.
I think if we keep pulling on this thread around continuous delivery
and what that looks like and how the network plays with the continuous delivery model.
And if we could just kind of circle back to what type of metrics we need
out of a continuous delivery pipeline that's focused on the network.
You know, what does that look like?
Yeah.
So, I mean, what I think it has to look like, I mean, from an application,
again, I represent totally the application development team, right?
Because that's just more I feel comfortable with.
So what I believe the application teams can deliver and should deliver with tools like Dynatrace or any other tool where you can get application-specific data,
we can tell you on a transaction-by-transaction basis, feature-by-feature, application-by-application, whatever you want to call it,
how many bytes are most likely being sent over the wire.
If we have some production monitoring data to know how many people at any given point
in time are using that feature, we can tell you how much bandwidth we actually need and
how much data we send to which endpoints.
And I believe the magic of taking this data and putting it into the pipeline is the following.
If I know how my production environment looks like now, let's say 80% of my users are using these two features,
and I know exactly how many bytes we send over the wire at which point during the day and during the week.
And if we now make a change, so if you're pushing a change through the pipeline saying,
we are changing that feature that is used by 80% of the people,
and we are now requiring 20% more round trips
between the application server and the web server,
between the application server and the database server.
And if I have this information and give it to my network team,
then they should be proactively figuring out,
okay, what does this really mean?
Is this a good idea or a bad idea?
How do I need to configure my infrastructure?
How can I automate that?
How can I even maybe automatically provision infrastructure
depending on the load patterns that we have, right?
And I think this is then really nicely playing into the DevOps story
where the cool or the perfect world will be that my infrastructure
is automatically understanding the patterns from the application, from the end users,
and then providing the infrastructure exactly that it needs.
It may be even able to anticipate certain load patterns based on historical data,
based on certain events that happen in the world right now,
and then automatically provision the right infrastructure
and the right network configuration and bandwidth.
I love it. I've never thought of that before.
I never made the connection for the feedback loop.
I'm sure you're familiar with, or maybe your listeners are or are not,
Gene Kim's three ways, and the second way is continuous intelligence.
Exactly.
And so it's the automatic feed.
So APM, and I apologize, APM stands for Application Performance Monitoring or Metrics?
Application Performance Monitoring.
Or Management.
Management.
We actually recently changed.
I think we coined the new term DPM, Digital Performance Management,
because in the end, yes, it's about applications, but we are helping people to do digital transformation
through applications. But whatever the terminology, it is we have an application and end user-centric
view. Why end user? Because we also capture metrics from the end user, how end users interact
with the application, talking about the load patterns. Where do people come from? Do we send
the bytes to the local user community in Boston here?
Right.
Or do most people come from somewhere else,
and therefore we have to think about total different things, right?
We need to think about the CDN.
We need to think about network bandwidth and latency becomes an issue.
So, yeah.
But I just love the idea that the APM technology is orchestrating
and configuring the network, you know, and provisioning new resources,
provisioning new servers at an odd-hawk basis.
It's, you know, there's an expression that floats around F5,
and it's source of truth.
Have you ever heard of this?
You know, like, oftentimes, like, GitHub is the source of truth.
You know, but what we're talking about is Dynatrace being the source of truth.
And, you know, there's various application characteristics happening.
You know, for example, it's just poor performance.
Well, you know what?
We need to autoscale.
You know, maybe we rely on AWS autoscaling, but we supplement it with application metrics that automatically spin up new instances, provision them, and add them to the big IP in an automated
way.
I love that story.
I think it's very fascinating.
It's smart scaling, right?
Smart scaling.
That's what it is, yeah.
Because if you just scale up because you see performance goes up, that's like you have
a bad tooth, and the only thing you do is you eat more Advil.
Right.
But you don't fix in the root cause of the problem.
That's right.
Right?
This is kind of the idea.
And I think, so again, coming back to my, also the story, the metrics driven story that
I want to tell also at your meetup, it's about understanding what potential impact I have
with the code change that I'm pushing through the pipeline, through the continuous delivery
pipeline.
If I know I'm changing the feature that is used by 80% of my user base on the peak load
on a Friday afternoon, if that's the peak load time, and I'm changing
it in a way that I require 10% more database queries, I require to send 50% more bytes
because I changed some images on that page now to high resolution instead of what it
was before, then I need to make sure that this information, before I push this change
into production, ends up with the people that need to make sure that
the environment is provisioned correctly.
In a perfect way, as I said earlier, in the future, hopefully, in the soon future or the
distant future, maybe the orchestration layers of the world that we build or use will automatically
take care of this.
Because they look at historical load patterns.
They look at what changes come down the
pipeline they can look at jenkins and see which features are changing in which way and then and
then they can automatically make sure that that the infrastructure is provisioned in the right way
and i think the right way is essential in both ways it shouldn't be too less but also not too
much infrastructure because in the end we have to pay for it too, right?
If you over-provision it.
I love it.
I love the idea that the APM is the orchestrator.
Is it, not to give away, you know,
the too much insider information,
but is that on the roadmap for data trace?
Well, the thing is, we have all the data, right?
Whether, I mean, we have,
I'm not sure if we become the actual orchestrator,
even though we can,
because we have a concept of incidents
and we can trigger events.
But I wonder if it is more, if it is like a, you know, we take our data, but then we
need to pull in other data as well from folks like you guys, right, from the cloud providers,
and then based on that make a good decision.
We within Dynatrace, when we actually implemented our current SaaS-based offering,
we built our own orchestration layer because back then when we started,
there was nothing like a Kubernetes, like a message server.
So we built it on our own, and what we do, we actually look at APM data
and infrastructure data, and then basically we automated everything
that a normal ops team would do.
So we see a shortage in resources, and we scale up.
It doesn't help.
Well, either we scale up a little more.
If this doesn't solve the problem, that's probably not the root cause that we have.
Then we look deeper and say, what is the actual root cause?
And instead of endlessly putting more resources on the problem,
we actually alert and make sure the problem actually gets addressed and fixed.
So this is what we actually did.
We built into our orchestration engine the logic of what a normal ops engineer would
do, an application engineer would do.
They have the runbooks.
We automated all the runbooks by automatically looking at all the metrics, understanding
the dependencies of the different systems, and then taking certain actions to get the system back to the healthy state.
Oh, it's beautiful.
It's continuous intelligence.
That's what it is, yeah.
It is.
It's wonderful.
Yeah.
What a great design.
So we wanted to talk about, I mean, I think when we sat down before we had this meeting,
before we started recording, we said metrics-driven net ops.
So I know this is obviously what we just talked about.
This is, I think, something where the industry is moving towards, right?
Somewhere leading the way.
So we already do some of this internally.
I'm sure some of the unicorns are doing it already.
Your exposure to current operation teams and network teams,
what do they do right now? And what would be your recommendation of kind of leveling them up a
little bit? What can current teams that have been doing their job over the last couple of years,
always the same way, what do they need to learn? What are the first steps?
Well, so, you know, just in a very broad sense, I would say it's automation and in a very
general sense, and I'm going to get very specific. You know, they are very reactive. They are, you
know, when I started my career, I worked for the Federal Aviation Administration and I worked in
their operation center and I just monitored systems. You know, we had all these screens and
you just keep, it was a NOC, it was a Network Operations Center. And we get alerts through this NetMail technology,
and we'd get alerts through these graphical user interfaces,
and then we'd react to it.
And there needs to be more automation and self-healing.
And so there's various things that can go wrong on a big IP device.
For example, the network itself can collapse.
Someone can trip over a cable and unplug it.
Or there could be a software bug in the big IP.
Or there could be any host of various challenges.
But what we need to do as a field, and I mean field as in the network operations as an entire industry, what we
need to do is we need to have more automation.
And there's literally hundreds of thousands of network engineers in the world today who
are doing all of their work on the command line.
And that's just an ancient and old practice.
They need to move towards an automation strategy. They need
to become more comfortable. We have an expression that we use at F5 called a super net ops person,
but I've talked about them more as site reliability engineers. And if you're familiar with site
reliability engineers, they are what is referred to as T-shaped employees. So in the case of a
network-focused site reliability engineer,
they're going to have depth in networking and security,
and then they're going to have breadth in agile and metrics and monitoring and automation.
And this T-shaped employee gives them the ability to, you know,
really service various modern-day cloud-based problems.
And once again, the key to that T-shaped employee is automation.
They need, you know, the typical network engineer that's not doing automation now needs to level up.
And I guess the challenge is, and this was also a challenge, obviously, on the application side, maybe, or on the, let's say, the traditional ops side, right?
When we talk about automation, it means people have the fear that they're automating away their own jobs.
That's right.
That's the biggest challenge.
Yeah.
And so, but the question is, what can we do to take their fear?
Because I don't think they're automating away their jobs.
I believe their jobs are changing, right?
Yeah.
I was talking about this earlier today.
I believe what we need to do is we need to appeal to people's self-interest.
You know, there's an expression that I, or there's a phrase, like a turn of phrase that
I use that the big network factory in Boston is going away.
You know, there's going to be a lack of jobs in the next five years in the network operation
space.
There's going to be less and less jobs.
These guys need to level up.
They need to become site reliability engineers.
They need to have cloud skills.
They need to leverage their networking skills and supplement it with agile monitoring and
automation skills.
And my alliteration around this factory, this network factory leaving Boston is meant to
appeal to their sense of self-interest.
And so we have to have this one-two punch where we appeal to their self-interest, and
then we give them F5, for example, gives them a path to learn those skills.
And we give them study guides, and we give them players guides, and we give them podcasts,
and we give them how-to videos, and we provide code, and we provide examples, and we give them how-to videos and we provide code and we provide
examples and we get out and we market
and we get out and we talk and we get out
and we help and we just help these guys
get to where they need to get to.
It's an incredibly difficult proposition.
Yeah, it is. Well, especially,
you know, I mean, there's always
it depends on the personality too. A lot of people are
eager to learn, do something new, but some people
are just resistant to change, right? And they're comfortable in the way they've
been doing things over the last 10, 20 years. That's right. And I mean, it's a general problem
that we have as a human society or as humans. So I like the fact that you are, I mean, what you
personally do with the meetups, right? You are spreading the word about what's new, what's cool,
what is the direction you need to go. And you were educating people.
I was a teacher at a junior college in Boston for three years.
My grandfather was a teacher.
And so it's in my blood.
And this is my life's work.
At the beginning of the show, I didn't mention, but I do have 10 years of DevOps experience.
My DevOps birthday just passed.
It was January 17th when
I first started using Puppet. And I've basically been putting in 16-hour days ever since. And so
I have this tremendous amount of experience with DevOps. And I'm incredibly passionate about it.
And I want to be a community organizer. And I want to help people. And I was raised to help
people. I mentioned to you earlier that I have a disabled sister. And I was raised to help people. I mentioned to you earlier that I have a disabled sister.
And I was raised to help people.
And I'm here to help.
And if anyone wants to reach out to me, they can.
And you can reach me at t.mcgonigal at f5.com and just get in touch.
And I'll give you a call back, and we can figure out what you need to do and what you need help with.
And I'm here to help.
That's really cool.
Wow.
So, you know what?
I think this is a perfect time to actually conclude our session maybe, right?
Okay.
We talked, I think we learned or we have, we created this new vision for people out
there.
It was an excellent discussion, Andy.
Thank you so much.
Yeah, it was really cool, right?
Yeah.
So metrics, we can use these metrics that we get from the application teams to give
the direction of what we need from the
infrastructure network side, from resource provisioning, but in this case, obviously,
from the network side. I know you said there's a lot of effort that goes into making infrastructure
and especially network automatable, right? Part of infrastructure is code. That's great.
We figured out that we need that traditional or network teams can go two directions, either to become
part of the application team, putting their knowledge into these teams so that they can
build better apps in the future with the right mindset and with the right knowledge about
the network in mind.
The other option is moving towards an ops as a service or network as a service, whatever
you want to call it.
Both directions need heavy automation.
And then don't be afraid, but actually encourage change, right?
Go out there, see what's new, go to the local meetup scenes.
Because, I mean, I love the meetups.
I mean, I know we're fortunate here in Boston because there's so many meetups.
There's so many nice ones, yeah.
And I know we are doing one again in March.
March 8th, I think.
March 8th, yeah.
And it's amazing.
You are the largest Jenkins meetup group in the world.
Isn't that cool?
That's awesome, yeah.
Thanks for inviting me.
Thanks for being here.
Thanks for presenting.
It's going to be a great presentation.
I hope anyone in the Boston area can make it.
And we do have sponsors.
We have food, and where we have our meetups is literally next door to a liquor store,
so you're welcome to get something to drink and listen to Andy and harass him or whatever it takes
to get you out to the meetup. I'd love to see everyone there. And then just as an aside,
on the days of the meetups, we have something called a DevOps Dojo, which is a safe place for
mentoring and learning.
And Andy's going to be there on March 8th from 12 to 7, and I'm going to be there from 12 to 7.
And we're just there to help people learn about DevOps and learn about automation and learn about infrastructure as a code and all that good stuff.
And Jenkins as well.
Cool.
And I also want to do one more shout-out.
I think I mentioned to you in an email in June, we have a big Agile conference coming to Boston,
the Agile Testing Days.
I know these guys from Europe.
They've been running this conference in Berlin
for a couple of years now to make it to Boston.
Jenkins will be a big topic.
CICD, DevOps.
So that's a great way.
That sounds fantastic.
Yeah, that sounds really great.
That's a great shout-out.
I'm definitely going to be there.
Yeah.
Cool.
All right.
Hey, thank you so much.
And I know we will do another session because we have some additional ideas we want to discuss.
Thanks so much, Andy.
Thank you.