PurePerformance - 050 How Infrastructure as Code and Immutable Infrastructure enabled us to scale
Episode Date: December 4, 2017Are you still deploying machines manually? Do you have to login to machines to apply changes? Do you spend hours or even days to detect infrastructure issues messing with your test execution or even p...roduction? We have the answer for your pain: Listen to this podcast!Markus Heimbach leads the Infrastructure and Service team at Dynatrace and explains how they got rid of Snowflakes (not in the political sense), tackled the Configuration Drift issue, and how his team became a Service Organization powering the innovation at Dynatrace R&D. Get a glimpse of his talk track from his presentation at #devone.at - https://speakerdeck.com/markusheimbach/infrastructure-as-code As another teaser: you will hear about Test Automation of Infrastructure Code, leveraging Docker and Kubernetes (k8s) and how to use and leverage Immutable Infrastructure!
Transcript
Discussion (0)
It's time for Pure Performance.
Get your stopwatches ready.
It's time of Pure Performance.
As always with me is my co-host, Andy Grabner. Andy, do you know why?
You know, I'm not even going to ask you if you know why it's a special episode.
I'm going to go old school on you.
If you remember back in the beginning, and any of our loyal listeners who've been with us since the beginning might remember we used to have trivia right and
do you remember then nobody ever answered we got one answer really ever after after what after a
year and a half yeah um anyway so we we stopped doing the trivia way back but andy so here's the
trivia just to you right so right now my parents are in hawaii on their 50th wedding anniversary
what does that have to do with today's show Right. So right now my parents are in Hawaii on their 50th wedding anniversary.
What does that have to do with today's show?
Maybe it is our 50th show.
Correct.
Look at that. So what do I win now?
You win the privilege of continuing to breathe.
You wouldn't have lost it. You wouldn't have lost it. But that's about all I can guarantee guarantee that's all about all i can give you is um all right cool so anyway what else is happening what else is
happening um well we you know we've got perform coming up right yeah exactly yeah we are busy
you're doing hot days i believe and podcasting yeah so if you want to come uh meet me in person
i will not give autographs.
Not that I would expect
to ever do that
because that's the most
ridiculous thing.
But I would just think
it would be hilarious.
So maybe, I think,
we'll probably get some joker
like Brian Chandler
if he shows up
asking me for one.
And yes, that's a call out
to Brian Chandler.
But yeah, I'm going to be doing
a couple of hot days there
and I will be co-hosting
the live podcast
with the folks from PerfBytes.
And you're going to be doing some presentations as well, right?
Yeah, we're currently, and this is a shout out to all of our listeners.
So performance at the end of January, last week in Vegas, and we are, it's already an
amazing lineup of presenters that we have.
We're still looking for some speakers. So if you have any exciting topic around moving to the cloud, containers, microservices, AI,
IoT, mobile, UEM, then let us know.
Just send an email or go to perform.dynatrace.com and figure it out.
But please, if you have a chance, go to Vegas.
It's not only a fun town, but I think it's going to be a great week to learn.
We have one day of hot training day, so hands-on training day, and then two weeks of the conference itself with a lot of breakouts.
So, yeah, it's going to be exciting.
But more exciting for me is actually the guest that we have today.
Oh, please tell us.
Yeah, because I know he's been sitting idle there and waiting, and he says, when are they finally done with their intro?
So I want to introduce Markus Heimbach.
Markus is a colleague of ours from our Lens Lab in Austria.
And the reason why I brought Markus and invited him is in June.
I think it was June, Markus, Dev1?
Hi, guys.
I think it was the first or second of June this year, yeah?
Yeah.
So Dev1 is our Dynatrace conference from developers to developers,
and we open it up also to the public.
And Marcus did an amazing talk about how his team is actually has transformed
over the last couple of years and is now doing a lot of infrastructure automation.
And the slides are up on SlideShare,
so we will post the link there.
But Markus, without further ado,
I would love to hear from you.
First of all, who you are, what you do are at Dynatrace,
and kind of what has transformed in your team
within Dynatrace and the service that you provide
to the engineering team.
Thanks, Andy. Yeah, as you said, we are basically running the whole environment
where our great product Dynatrace is being developed and built.
We have basically two labs in Linz and in Gdansk, Poland, where we have a large CI running, continuous integration.
My team is just a couple of guys.
We are five team members, and we run barely 100 machines across the two labs. And meanwhile, we have about 900 machines running in average.
Mainly of them are virtual machines.
And of course, we're also using Docker very heavily.
So we orchestrate around 100 Docker containers a day
for our continuous integration, testing and building and various other topics where we are leveraging the Docker environments. also responsible for keeping and provisioning new machines across all known operating systems
and architectures.
And just to name a few, it's Solaris Spark, it's IBM AX, IBM Mainframe, and even though
it's Linux on set, so it's a very large variety of very interesting platforms.
And we try to make that as continuous and as automated as possible and bring the automation even to those ancient operating systems like Solaris and IX.
So that means I don't have to write you a letter in case I need a machine and it takes a week or two a month until I get my machine.
It's a little more automated than that.
A little bit, yes.
Of course, it's depending on what platform we are running.
If you're just heading for a Linux system running on some x86 environment, you can create it on your own.
We have an own private cloud running where we have
some prepared images and these are around 250 meanwhile, where we have pre-baked images provided
for each developer to reproduce some bugs or to test some build environment without really going in the CI and breaking the build.
And if it's a little bit more on the older systems like the Solaris environments,
we don't have the full automation for the developers as a self-service yet,
but we are working on that.
But for us, it takes roughly five minutes to ramp up a new AX-L-PAR or a new Solaris L-DOM.
So it's a matter of a couple of minutes instead of weeks.
Do you automate spinning up the mainframe?
Yes, that would be nice.
But I think IBM has some problems in terms of cost and licensing and that.
But it would be a nice topic.
Actually, if you think about Linux, this is even actually possible as there are some cool HMC, it's called in the IBM world,
where you have a control host running in front of the mainframe.
And there we are able to provide a set Linux LPAR also very quickly
as we are leveraging the RSS VM.
And it's also possible to sort of automate it
or at least to reduce the mineral amount to a minimum.
Yeah, I thought I was making a stupid joke and jokes on me as always, Andy.
Yeah.
And Marcus, so the service that your team is providing, as I understand,
obviously all the automation for developers so they can stand up to environments when they need them.
For CI, what about production?
Do you also provide the same services for our production environment?
Not for our product itself.
So Dynatrace hosted in our Amazon environment is not handled within my team.
But we run a different sort of production environment.
And this is our website and all subsequent um um parts of that so the blog
the documentation and and all this stuff and and this is also fully automated and and is running
solely within aws cool pretty much pretty cool so how long have you been with Dynatrace? I don't even know how many years. I made my sixth year this year.
When? I just had six years too. When was yours?
First of July.
Okay, mine was 26th of September. Well, there we go.
Okay.
Yeah, it's been very interesting and it was a great experience to be part of this amazing team.
So how was Dynatrace six years ago in that area?
Because I assume that many of our listeners, I'm not sure how many are where you are, probably not all of them.
That's why they're probably somewhere in the middle between what you experienced six years ago and where you are right now so maybe we can talk tell us a little bit about the challenges that or the situation that was there whether
there were challenges and how you kind of went to the stage where you are right now that would be
very interesting yeah yeah definitely yeah um honestly six years ago i i was uh in a different
team then i we i was in a very small team and we provided a customer safe service named e-services
and we provided a java web application and we did all the deployments manually days back so
copying over the var file to the web server and doing the rollout of the new version and manually rolling back
and all these kinds of things, you know.
But then I was unable to switch the team and we were migrating the test automation team
more specifically.
And there I had the chance to switch to the infrastructure part of the
test automation and there I got the first experience with Amazon and it I think this
was something like 2012 or something where we started our first projects with AWS and there
were the first yeah touching points points with infrastructure automation and infrastructure as code.
But in our lab, it was very horrible.
So we had a lot of handcrafted and snowflake servers where barely any documentation was available.
So some guy did set up Ubuntu system in that way or a redhead
system in another way and and it was very snowflakey and of course we had a documentation
in our wiki but as always um the documentation is outdated or misleading in the in the worst case case yeah and and we we had servers with the big one and and yeah and you name it
like the even bigger one and these types of server or we had an a parisk for HBO
X and and we had a power serve for a Linux system running on power and so on.
So it was very, very snowflakey, as we say nowadays.
Yeah, I've never heard that term before.
That's kind of saying delicate.
Yeah, I mean, I've heard people being called snowflakes, right?
That's very common over here when you have the left and the right attacking each other.
But I've never heard a machine called a snowflake.
That means it's pretty delicate and can fall over at any time, I guess.
Something along those lines.
Yes, yes.
Yeah, it's basically that it's a very unique system.
Oh, unique, okay.
And it's set up manually, so there is no – even no documentation in terms of code.
So it's one of a kind, yeah.
Yeah, yeah, definitely.
Okay.
And there's a very prominent guy out there
it's martin fowler um who basically yeah i'm not really sure if if he uh provided that term
but he's uh very keen on on infrastructure automation and and uh serverless and and all
the new uh orchestration and and deploymentration and deployment mechanisms we have right now.
Yeah, and it was just horrible.
And, of course, we were running VMware days back.
And if you needed a new VM with a similar use case, you just took a running VM,
you went into the web UI or in the rich client and said clone me
this VM so you had even the case that you had running VMs and and you made a hot clone of that
and of course as it was not documented what is on the system you ended up with failing builds or
failing tests as there are some hidden configuration on the system which you didn't
expect to be there or you had some missing parts what you expected to be there so it was yeah if
we look back now it was very horrible and and and yeah so you know this is the bad story we have to
tell on that there's a good point there right that a lot of times people might look at well setting up setting up all this automation that you're going to be talking about a little bit more of, right?
But that's going to take time, and we're not going to be able to get things done because we have to divert our attention from building machines to setting up and testing our automation.
But just that story you're telling, you go back in time.
How much?
I mean, this is more of a rhetorical question,
but just think about how much time is wasted troubleshooting these hot clones that you're
setting up, you know, all these machines that have all these snowflakes, a new term I'm going
to start using there all the time now. You're setting up all these snowflakes, right, that
things don't work. And you don't know then if it's because of the code or anything else you've
pushed to it. Or as you said, there's some weird configuration file or or god forbid something as simple as a
host file that you you know spent three days looking everywhere else and didn't even check
to see that there was some something in the host file all right so much wasted time going back i
mean i remember in my old testing days going uh dealing with all that kind of stuff that when you
switch it over yeah you're gonna
have to put a put some up some upfront time into automating it but as you know you we have in the
notes for talking about in here even like you know the immutable infrastructure so that every time
it creates every time you spin something up you're spinning up a fresh clean one and so if something
doesn't work don't even try to fix it just start over right um yeah so yeah and it's even worse
it's not only that that you as a person who made the clone is responsible for finding the the glitch
it's more or less uh as you grow as a as a company um some developers are looking why the build is
failing or why the test is failing and in the the worst case, it's not only one developer,
it's 10, 20 or 50 developers
because you can't be certain what was the problem.
And then all these working power is burned
for something very easy to fix.
Of course, you have to make it in the first place.
But in the end, it's not only the productivity
of the infrastructure team, but it's the whole company or the whole development team, actually, which is enabled to be more productive and can focus on providing new features for our products and for value for our customers.
So this is not only within the team itself, it's the whole company basically, which also is enabled to work better and more efficiently.
And also the thing is trust, right?
If you cannot trust the system because you don't know which result it produces, then, well, why use a system that you don't trust?
And then maybe then people go off and do their own thing because they don't trust you.
And then it's even worse because then you have like people that are creating something on the side that you should actually be providing
yeah yeah yeah and additionally maybe it's also interesting is uh if i look back um we're roughly
same uh team size so we had roughly five or even even six uh guys running the team. And we had about 40 to 50 OEMs
and five to 10 hosts, something like that.
Or even the ratio is a little bit more
on the bare metal side.
So we wouldn't even able to scale
in the dimensions we have right now
without really scaling the team equally.
So we would now end up in a team with, I don't know,
20 employees to just keep the environment up.
And this would be even worse in a way that you are not able
to really work on the environment because you're just firefighting.
We had the term days back that we had a just a couple of of guys across several
teams doing firefighting as it was so complex to to find failures or or to pinpoint problems
that we even had this firefighting team and and it was each time you you had to make that it was
horrible because you hardly were able to get something productive
because you are just running from failure to failure.
And most of the times it's really hard to find the real root cause.
And then it could be some VM being moved around and then some storage IOPS.
I don't know.
There were several problems or a VM warehouse was overutil host was overutilized and and and then there was some
network congestion and so on that was just horrible hey um it's actually two questions
that i have i mean the first thing is are you using dynatrace now to actually monitor the
complete infrastructure do you have one do you have one agency installed yes yes of course um we we actually we are running both uh especially on
the more ancient systems and and and did i mean set linux and and solaris spark where we just got
the um the one agent being developed um i think we are now in somewhat beta phase or alpha phase for the one agent on Solaris.
As we don't have them already available,
we are running Appmon on these environments,
Dynatrace Appmon, and using the host agent
or the special agent depending on the software we are running there.
And on all new environments, we rely on our one agent
and the technology
and the insight the agent is providing.
Cool.
That's great.
I mean, I've been telling our Dynatrace story
for the last couple of months
at different events
and I always tell about the DevOps transformation
and that we use our own products, right?
Either on dog food
or as we'd like to call it, drink our own champagne.
Of course.
Much better, yeah.
So that's great.
Now, the question that is really, I think, interesting for many listeners is,
so you were in a situation where you had these snowflakes and everything was more chaotic
and you wasted too much time in firefighting.
How did you change that and what were kind of the steps to get there?
Some advice and things you've done.
Yeah.
Yeah, basically, as I stated previously, we had some touching points with Amazon days back
where we had seen how easy it could be to provision a very vanilla instance
and create all your stuff on that vanilla machine.
We thought, how can we get that in our environment?
And something like configuration management was also introduced
and was founded days back.
And we took the first step with Puppet as Puppet is able to run on all our systems.
As I said in the beginning, we have a large variety of different operating systems and architectures.
So we need a tool which is able to run everywhere
where we need it to run and and puppet is the only and only tool only configuration management tool
which is capable of running on all our operating systems and and we started with them to make some
small changes on the systems like like providing some JDKs.
Of course, Dynatrace is a Java company mainly, so we also have some building and stuff around with Java.
And we used it to deploy our JDKs onto the build machines and test machines and so on.
And of course, even though it's just a small piece and a seemingly easy one,
just copy over a zip file, unzip it, place it somewhere, and you're happy.
Even that had some dramatic impacts and disruptions in our CI.
As you might think, you change a JDK.
So a build is running using a specific JDK, and now a puppet is running, replacing the JDK, causing a build failure.
Or the JDK unzip, of course, uses some IO as it has to write the files to disk.
It has some network utilization and so on, causing also side effects to our tests for example we have some very specific unit tests
where you have some 50 milliseconds timeouts and so on and if you are applying some configurations
during that time frame the builds are doomed to fail basically and on those problems you don't think in the first place
because you think you just apply something
and what could possibly go wrong
and you think nothing because it's just providing something new
or something reliably on a very specific case.
This is one part.
So the side effects when you apply a configuration
into a sort of production environment.
The other one is that even though you install Puppet on a system, the underlying system was a snowflake.
So you change something like putting a JDK in a specific directory.
But the configuration of the built environment was assuming that the JDK in a specific directory, but the configuration of the build environment
was assuming that the JDK is running somewhere differently.
So there were some glitches and problems we had to work around,
and we learned a lot from them.
So like detecting if a build is running,
or we basically force the applying of a configuration very regularly.
Normally, we have a configuration run every 30 minutes.
And if a node is very utilized, we force it that it at least runs once a day
so that we can be sure that our configuration is being applied
and we don't have a configuration
drift.
This is also something very important that you run the configurations and the settings
constantly and continuously.
There's also a term founded or created by Martin Fowler that you create a configuration
drift. founded or created by martin fowler that you create a configuration drift if you say
you create an instance you configure it for a certain purpose and then you never check if the
configuration has changed and and for sure there will some guy connecting to the machine making
some change in the registry or in a host file or creating some fancy shell script doing some fancy stuff
and you don't recognize that if you if you don't constantly like your your environment settings on
these machines that's pretty cool hey this uh and i remember this from your slides where you said
you know for us it is actually a security issue. There's an alert going off if somebody logs in to one of these machines, right?
Yes, yes.
Yeah.
So, I mean, just a thought.
Now, thinking about our product, Dynatrace, and we do automatic log analytics, wouldn't it be cool if we just by default, every time we see that somebody logs into a machine, we create an event that fits into our AI.
Yeah, that's something you can do already.
So with the custom alerting and the log agent, you are able to do exactly that, that you have some patterns where you search for and then you can trigger an event within Dynatrace.
Perfect.
And you can then also apply this, obviously, to particular types of machines.
I'm sure you can do this tech-based
because maybe you have some machines
in your infrastructure that are
for quote-unquote common use,
but then you really want to kind of seal
certain machines that you automate all the time.
Yeah, yeah.
Of course, we have some machines
where you have to log in
because you're running tests remotely.
For example, some very rare systems like the mainframe where you don't have that many around.
So there's something where you are sharing the system.
So the test or the built environment is remotely logging in.
There you are expecting these sort of things. But as you said, on the other hand, you can be very specifically
on machines where you don't want anybody to connect to the machine.
So in the example then that you talked about when someone goes in and changes some environmental
variables or something, right, I'm sure there are times when you use your automation,
you use your configuration to spin something up. And let's say it's a new version of an OS or anything else where you find out that your old configuration doesn't work right.
So, you know, obviously it makes sense at that point maybe.
And I'm kind of asking this more, is this your process or how do you go about doing this?
Let's say it does make sense to go in there and mess around with the environments to set them properly.
Would the best practice then be figure out what the settings are supposed to be, then destroy that server, update the configuration, and then deploy it with the new settings that you figured out, you know, kind of in your experiment?
Or how do you handle the situation?
Yeah, that's right.
So that's the new approach we are running since roughly one and a half years or maybe two years that we moved away from trying to keep systems up to date.
Because, as you said, we have a large team and you will have some manual configurations and you cannot track every change in the system.
And the other thing is some tests are modifying parts of the operating system or parts of the application we are testing.
So there's a trade-off.
You cannot track all changes in the environment.
And we also learned that and there is a sort of new approach,
and this is the immutable infrastructure term.
And what we are doing here is basically we are using a very awesome product
from HashiCorp named Packer, where we are creating all our re-embrace images,
or not re-embrace, but visualized environments. We can use re-embrace, we can re-embrace but visualized environments we can use re-embrace
we can use krem which we're heavily using we can use even amis for amazon and and there we create
very unique and very specific images for each use case we have and if we change something like a new JDK or we need a new GCC we just
create a new image out of our configuration and then we are replacing
all our running instances which are originating from that very specific type
of image and replacing it with a new version of that.
And we are also going a step further.
We are not only replacing the images or the running virtual machines after a change,
but we also constantly are replacing machines without any change
just to circumvent these configuration drift problems. So in our lab, the developers know that our machines are replaced constantly.
So if they need a change to be made permanently in our lab and our environments,
they can either make the change on their own,
as we have everything running in Git,
so they can just fork it can can create
a pull request we're doing the reveal and then we enable our developers basically to to enable
their own environment and and maintain their requirements uh on their own um and and we are
um yeah and replacing them very very often even, even if we have no change,
just to make sure to have no configuration drift and no side effects on our build and our test environment.
Great. That's pretty cool. So what I just learned, so immutable infrastructure, HashiCorp is one of the companies.
I know we use HashiCorp and we love them
any other products in that space
that others would use
that others people may know about
yeah of course we are
using Puppet as I said in the beginning
and Pekka can
utilize different provisioners
in that area but as we have a very
strong and solid Puppet background
we are also
using puppet to configure these images and other than that we are of course using ansible to to
roll out a complex and and integrated um software like red hat openshift which we are also supporting um and to provide uh environments for our developers and
for our ci um there we're using um ansible and uh yeah a lot of other products like cradle and and
and which we are using for our test automation internally so we we have an i and an own uh ci
and and the automation behind for our CI automation.
So that's something we learned as well.
We started with a puppet and having it in SVN and making changes, rolling them out.
And then you end up with a broken CI because if you change something very critical, something very deep in the system, it's very likely that you have a side effect
which you don't see in the first place.
And this is also something we learned,
and we built up our own test and built an environment
for our infrastructure code.
And as the term says, it's code managing the infrastructure.
So you can apply the same software engineering methodologies you use
for your java development so we have code quality uh rules we have static code analysis to to see
how our code has some yeah concept conceptual problems and is using some terms you're not supposed to use.
And also we are creating our own virtual machines where we can run our new configuration on.
So we are already utilizing these terms for our own infrastructure code stuff as well.
Now, you mentioned infrastructure, you know, as a service. I mean, I mean, infrastructure as well. Now, you mentioned infrastructure as a service. I mean,
infrastructure as code. Andy,
if you recall going back to episode 29,
yes, I did look it up earlier.
I have these all just on my head. We had
Thomas McGonigal on and we were talking about
network as code.
I guess my
question there for you, Marcus, then is
have you broached into the area as network as code?
Does it apply in the environments that you're working in?
Yes, of course.
We are using it more or less more in our, we are also running some very business-critical services for our development cycle,
like our centralized code repository and where we place all our build binaries.
And these are very crucial to be running all the time and for that we are using typernitis um and and providing these services within um within
um and uh yeah we are using basically a chorus and flannel uh for our network layer which is
basically an an overlay network um so we can hiding all the inter cluster communication from
the remaining uh network um so this is more the the part we are
using it it's more about hiding um cluster internet traffic from the remaining network
awesome that's pretty cool hey so now you just brought something up brian um even though you
kind of flipped the words but but infrastructure as a service.
I look at your title right now, actually.
I'm looking at the email that you sent.
And it says team lead infrastructure and services.
So that means, obviously, I understand now what you do,
but your team as a whole, if you could sum it up again,
not only do you provide infrastructure as a service and all the services that is required to run the complete test automation that supports developers.
You also mentioned the website.
You run that and some other services.
What other services are you providing as a team to the complete R&D organization, just to sum it up?
Because it would be interesting to know what is a team like yours actually providing to the whole organization.
It's basically that we try to enable the developers to work very productively and efficiently.
So it's like a service organization within the development team and and as I said it's about try to to provide these services like artifactory where where you host all your binaries which are used for
local bits where the developer are trying the end code changes prior to
pushing it in the CI so that you don't generate a failing build or yeah the central git repositories and
and these sort of things some database servers where all these services are baked and and it's
a very very interesting very comprehensive environment and and it's uh how should i say it's
it's it's demanding but on the other hand it's very interesting and and makes a lot of fun to
to work on that and and to see uh these new uh um tools we have right now. So if you think five years ago,
something like Docker or Kubernetes was not around the corner even.
You had some dodgy terms
like Linux LXC containers.
Even nowadays,
I'm sure not every Linux user
or experienced Linux user
knows about LXC.
But Docker is everywhere
and this enabled us a lot to provide very consistent and easy to use images to run important services.
Because you can test the very same image or the piece of software which went or goes to production prior to going into production.
And we are very using development and staging environments to pretty certain that we don't break critical services.
And we use them binary identical to be on the safe side. So we push the same images through all three stages
and we are able to roll back very easily
as we have them prepared.
And if something goes down,
like storage goes down or a host breaks
and the disks are gone,
it makes no issue
because either there's self-healing
due to some automation within
Kubernetes or it's self-healed due to some failover mechanisms we have in place running
the services distributed across several machines.
And this is just the way we make it.
Cool.
And last question that I have, and I think then we probably want to start wrapping up.
But when I talk with some of our customers and prospects at our large enterprise organizations and they have like more traditional operations, traditional teams, teams that handle infrastructure in a traditional way, they always fear that if they become a service organization and they automate a lot of things, like what
you said, your developers can just change the configuration and get them, make a pull
request, and then they get this.
So basically, you automated a lot of the work away.
So have you ever had the fear that you're automating your jobs away?
No, definitely not.
I'd say it's the way around um if i would have to
to uh keep our servers like in the kindergarten that you have to take care of as a child and and
and uh one is whining or had been fallen down and and you you have to fix it. It is fun in a small scale, right?
But if you run it in a very, very large scale, you are not even able to provide your own
business and you're stuck and you can't advance.
And then in the end, your boss will say um how could you basically yeah get your job
done and and this won't work if you if the business of your company is is is growing and and
you are as a it team or as a service team you are responsible to enable the um the company to grow. And if you can't grow, you're basically one point
or the main point of the company to not grow.
And so I see it the other way around.
So if you're not able to scale up, the whole company wouldn't scale.
And in the end, your boss will fire you.
And the important part also is not only about the work
it's also especially in our time right now we have so many uh cool technologies to work on and to
learn how to use them if if we had to um yeah take care each each machine in our environment
we wouldn't have the time to be innovative,
to try new things.
We would just be stuck on the dark middle ages
where we started.
And then your job wouldn't make fun.
And I think fun at work is important
to be innovative
and to get out of your box,
to try new and, of course course to fail on new technologies.
No one is perfect in new technologies.
You have to play with them, to fail with them and to learn how to really use them.
So it's like with Kubernetes, we knew a year ago that we want to go with Kubernetes, but we didn't have any experience with it.
And it was a hard trial and error in the first place, especially to get Kubernetes running on-premises.
So to ramp up a Kubernetes cluster in Google Cloud or in Amazon is just that easy.
But to really let it run on-premises, you have to understand how Kubernetes is working, and that's interesting.
And if you don't have the time, you are not able to learn new stuff and to enable you even more and more and more in the new environments.
And what you say there with the whole job security thing, it reminds me, again, going back to an older episode.
Last year at Perform, there's another Perform blog.
Come on, people.
Last year at Perform, we were talking with Josh McKenty from Pivotal.
And someone had asked him about, you know, what does he think Web 3.0 is, right?
And he said, I don't care about Web 3.0.
I care about Web 4.0.
And then I think I'm paraphrasing what it was.
The general idea there was he doesn't want to have to think about the Internet anymore.
Any device, anything he's using, there should not even be a thought of, well, am I connected to the Internet?
What's my bandwidth?
It's all just part of the fabric, the fabric of life at that point. Um, and if you think about operations in the same
way, right back, back when you were doing it all the manual way, everything else, people know
operations exists because it's a pain point for everybody. It's you all there all hours of the
night, all hours of the weekend, holding things up because you don't have another bare metal server
to spin up.
Maybe you've even gone to VMware, but your hypervisors are full and you need to order
another server or your backlog with a million changes, all this other kind of stuff, right?
It's a huge pain point.
And hey, people know you exist and you're the only ones who can get through it.
So there's your job security kind of a way, but it's an awful setup.
Whereas the relation I'm making to this whole like Web 4.0 kind of thing is a great team is invisible.
Right. The operations team should be invisible to the organization.
People should even say, oh, we have operations. And the answer is like, well, who do you think is running this great thing?
That's transparent. You know, it's all it's all that operations team that's making it happen.
Right. And it doesn't obviously happen by itself.
No matter what you have automated, the automation is more for the end users.
And yeah, you can have bits and pieces automated to make your job easier, keeping it up and running.
But that's that constant tweaking.
That's where that maintenance comes in.
That's where all that piece goes in.
And so, yeah, the goal should be disappear, right?
Absolutely.
Not to lose your job, but be so people don't even know
you exist in a way right and then of course if it's running so well you can just like sit back
and watch a movie during the day and yeah voice or get on on your technologies yeah so basically
you are a proxy and and a gatekeeper and and if the proxy does not perform or the gatekeeper does not perform, then you are basically – yeah, you're not accelerating the business.
You're just a pain point, as you said.
Yeah.
So, Andy.
Very cool.
Markus, is there anything kind of thing that we missed?
I know we have our – we kind of had our talk track or we knew what we
wanted to say is there anything that you final words that you want to say yeah i i think just
uh the the important stuff is um some lessons learned um um it's basically uh the some some
of what i told is really be keen on automation.
It's just something you should really invest on.
And there's a very prominent XKCD about that.
We have some sort of a metrics shown.
Is it worth to automate something?
So if it's sort of you make a task only once a year and it takes you five minutes.
Of course, there's no need to automate it.
But if it's just a task, a five-second task a day you have to do in the morning, I don't know, make certain command to get, I don't know, to ramp up a certain service in the morning.
This is something you definitely want to automate and this is the key
in our in our life to really be automating because if you automate it you know it you have to
understand it and you are implicitly documenting it because documenting the code itself is sort of
a document like a runbook how you deploy an application or how you deploy a server the code itself is the is the truth every document
running and residing in a wiki is just doomed to be outdated because no one
else will update the wiki after he has finished his story so and the code itself is the truth yeah and I think that
that is my my my story to tell automate automate automate cool all right Brian shall I summarize
let's do it yes the summer reader so for me what what I like the best, what you said, applying general development principles, software development principles to the way we think about infrastructure.
I talked about you have a full test automation suite that actually tests your infrastructure as code. More importantly, on top of that, what you really do as a team, you are providing services to other parts of the organization that are fully automated so that you are not the bottleneck in keeping them from innovation.
And because you automate so much, you actually have the time to look into your own innovation so that you make sure that the services you provide next year are up to date and allow them to keep innovating.
So I love the story, great where we came from and where we are, some good references to tools that we are using,
and some kind of new terms that Brian learned.
I'm sure he will cover that in his summary.
We want to make sure that people go to dev1.at.
You will find Markus' presentation, both slides and video.
We'll put the links up there as well.
And my last plug, perform.
It's the third time now.
Also join us at perform because there will be topics like this as well covered.
And that's it from my side.
Brian.
Well, yes, Snowflake.
It's so funny because I've only heard it in that horrible other way.
But yes, I love the idea of this unique server.
Another term that didn't come up in the show, but in the notes Marcus sent us over, I saw in your lessons learned, it said Docker is awesome, but orchestration sucks.
K8S.
And I'm like, what the heck is K8S?
So, of course, I had to look it up.
And I guess I've been seeing these pop up before that's an abbreviation for kubernetes which is actually has a name it's a new numeronym or um where you replace i mean i just think it's
ridiculous but i guess it's what all the hip kids are doing nowadays huh at least they don't call
it k8s that would be even more confusing just for typing. But for anyone who sees words with letters, it's called,
you can say to somebody, hey, but do you know what that is called? And they might be like,
oh, I don't know. You can say it's a numeronym. And then you could feel like a smart old person
for a moment if you're old and listening. I think one of the key things you
have, you know, we touched upon here and there is, you know, everything you talk about, whenever I
listen to these discussions, right, it's overwhelming. Because if you think about,
we're hearing you in your final stage, right? But this was, how many years did you say this
project's been going on for? I know you've been with the company for six years, but you all started this how many years ago?
I think roughly four or four and a half years, something like that.
Right.
And that's important because when you hear the finished story, right, if you're going on that journey, you start thinking like, holy cow, how do we get this all done in one bang?
Right.
And it's not.
It's making those small changes, doing a little bit at a time. Add one layer in, then add another layer in.
Because if you go for that full bang, maybe, maybe you'll get lucky.
But chances are you'll say, oh, we have to make a tweak because we got it all kind of running, but now we have to tweak it.
And you have to make a tweak somewhere right at the very beginning, which means you have to make a tweak layer after layer after layer after layer after layer.
Because that one tweak has like a huge impact on the rest of the line.
You know, start small, make small changes. And then as you go, it's going to be a lot easier to make a tiny
change to something else towards the beginning of the line that you can adjust for before you
get finished. And just even from that point of view, it's, it's never going to be an overnight
thing. So, so take your time. Um, I, I just know I get overwhelmed even listening to these, you
know, thinking about the amount of work involved. But it pays off.
And again, try to make it so, you know, you become invisible.
My thought of it is if I were a CTO and I couldn't recognize the operations team, that meant they were doing an awesome job because they were never in the firefights, you know.
So, yeah, that's all I've got to say.
Anything else from anybody?
Marcus, I think we also want to, I know you said there's other topics.
We heavily leverage the cloud, whether it's AWS or other cloud services,
and I think we should invite you back to talk about this as well
because that's very important for our listeners as well.
Yes.
I think so, yes, especially the website stuff.
I think we provided really
cool solutions for our problems and and i think it's worth next podcast somewhat in the future
and if anybody has any comments or stories or ideas or anything you can reach us you could
tweet us at um either at grab neander. No, did I just... I just... Almost.
At Grab Neander Andy
or I'm at Emperor Wilson.
We also have at...
What's our...
At Pure underscore DT.
Yes, at Pure underscore DT.
We'd love to hear from you,
ideas, feedback, anything.
Marcus, do you do Twitter
or anything where you... Do you have any other blogs? I know we have have the slideshare we're going to put up anything else you'd like to
share with people um it i am more the the participant and and read more tweets so i'm not
that heavily providing content yeah all i do is promote the show on ours too so um okay and again
we'll make it number four.
But the reason I'm going to bring up reform one more time, right?
You're going to meet people there that have done all this stuff.
So if like so many people are at these events and this we can almost say this could be at any event you go to.
Right.
But there's so many people who've gone through these things and learning how to do them well, really relies on getting that firsthand experience and storytelling from,
from people who've gone through it.
So,
um,
highly encourage it.
And that's the last time I will mention it for today.
All right.
So thanks everybody.
Okay,
guys.
Thank you.
Bye.