PurePerformance - 050 How Infrastructure as Code and Immutable Infrastructure enabled us to scale

Episode Date: December 4, 2017

Are you still deploying machines manually? Do you have to login to machines to apply changes? Do you spend hours or even days to detect infrastructure issues messing with your test execution or even p...roduction? We have the answer for your pain: Listen to this podcast!Markus Heimbach leads the Infrastructure and Service team at Dynatrace and explains how they got rid of Snowflakes (not in the political sense), tackled the Configuration Drift issue, and how his team became a Service Organization powering the innovation at Dynatrace R&D. Get a glimpse of his talk track from his presentation at #devone.at - https://speakerdeck.com/markusheimbach/infrastructure-as-code As another teaser: you will hear about Test Automation of Infrastructure Code, leveraging Docker and Kubernetes (k8s) and how to use and leverage Immutable Infrastructure!

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance. Get your stopwatches ready. It's time of Pure Performance. As always with me is my co-host, Andy Grabner. Andy, do you know why? You know, I'm not even going to ask you if you know why it's a special episode. I'm going to go old school on you. If you remember back in the beginning, and any of our loyal listeners who've been with us since the beginning might remember we used to have trivia right and do you remember then nobody ever answered we got one answer really ever after after what after a
Starting point is 00:00:53 year and a half yeah um anyway so we we stopped doing the trivia way back but andy so here's the trivia just to you right so right now my parents are in hawaii on their 50th wedding anniversary what does that have to do with today's show Right. So right now my parents are in Hawaii on their 50th wedding anniversary. What does that have to do with today's show? Maybe it is our 50th show. Correct. Look at that. So what do I win now? You win the privilege of continuing to breathe.
Starting point is 00:01:22 You wouldn't have lost it. You wouldn't have lost it. But that's about all I can guarantee guarantee that's all about all i can give you is um all right cool so anyway what else is happening what else is happening um well we you know we've got perform coming up right yeah exactly yeah we are busy you're doing hot days i believe and podcasting yeah so if you want to come uh meet me in person i will not give autographs. Not that I would expect to ever do that because that's the most ridiculous thing.
Starting point is 00:01:48 But I would just think it would be hilarious. So maybe, I think, we'll probably get some joker like Brian Chandler if he shows up asking me for one. And yes, that's a call out
Starting point is 00:01:57 to Brian Chandler. But yeah, I'm going to be doing a couple of hot days there and I will be co-hosting the live podcast with the folks from PerfBytes. And you're going to be doing some presentations as well, right? Yeah, we're currently, and this is a shout out to all of our listeners.
Starting point is 00:02:14 So performance at the end of January, last week in Vegas, and we are, it's already an amazing lineup of presenters that we have. We're still looking for some speakers. So if you have any exciting topic around moving to the cloud, containers, microservices, AI, IoT, mobile, UEM, then let us know. Just send an email or go to perform.dynatrace.com and figure it out. But please, if you have a chance, go to Vegas. It's not only a fun town, but I think it's going to be a great week to learn. We have one day of hot training day, so hands-on training day, and then two weeks of the conference itself with a lot of breakouts.
Starting point is 00:02:54 So, yeah, it's going to be exciting. But more exciting for me is actually the guest that we have today. Oh, please tell us. Yeah, because I know he's been sitting idle there and waiting, and he says, when are they finally done with their intro? So I want to introduce Markus Heimbach. Markus is a colleague of ours from our Lens Lab in Austria. And the reason why I brought Markus and invited him is in June. I think it was June, Markus, Dev1?
Starting point is 00:03:22 Hi, guys. I think it was the first or second of June this year, yeah? Yeah. So Dev1 is our Dynatrace conference from developers to developers, and we open it up also to the public. And Marcus did an amazing talk about how his team is actually has transformed over the last couple of years and is now doing a lot of infrastructure automation. And the slides are up on SlideShare,
Starting point is 00:03:50 so we will post the link there. But Markus, without further ado, I would love to hear from you. First of all, who you are, what you do are at Dynatrace, and kind of what has transformed in your team within Dynatrace and the service that you provide to the engineering team. Thanks, Andy. Yeah, as you said, we are basically running the whole environment
Starting point is 00:04:12 where our great product Dynatrace is being developed and built. We have basically two labs in Linz and in Gdansk, Poland, where we have a large CI running, continuous integration. My team is just a couple of guys. We are five team members, and we run barely 100 machines across the two labs. And meanwhile, we have about 900 machines running in average. Mainly of them are virtual machines. And of course, we're also using Docker very heavily. So we orchestrate around 100 Docker containers a day for our continuous integration, testing and building and various other topics where we are leveraging the Docker environments. also responsible for keeping and provisioning new machines across all known operating systems
Starting point is 00:05:27 and architectures. And just to name a few, it's Solaris Spark, it's IBM AX, IBM Mainframe, and even though it's Linux on set, so it's a very large variety of very interesting platforms. And we try to make that as continuous and as automated as possible and bring the automation even to those ancient operating systems like Solaris and IX. So that means I don't have to write you a letter in case I need a machine and it takes a week or two a month until I get my machine. It's a little more automated than that. A little bit, yes. Of course, it's depending on what platform we are running.
Starting point is 00:06:14 If you're just heading for a Linux system running on some x86 environment, you can create it on your own. We have an own private cloud running where we have some prepared images and these are around 250 meanwhile, where we have pre-baked images provided for each developer to reproduce some bugs or to test some build environment without really going in the CI and breaking the build. And if it's a little bit more on the older systems like the Solaris environments, we don't have the full automation for the developers as a self-service yet, but we are working on that. But for us, it takes roughly five minutes to ramp up a new AX-L-PAR or a new Solaris L-DOM.
Starting point is 00:07:09 So it's a matter of a couple of minutes instead of weeks. Do you automate spinning up the mainframe? Yes, that would be nice. But I think IBM has some problems in terms of cost and licensing and that. But it would be a nice topic. Actually, if you think about Linux, this is even actually possible as there are some cool HMC, it's called in the IBM world, where you have a control host running in front of the mainframe. And there we are able to provide a set Linux LPAR also very quickly
Starting point is 00:07:55 as we are leveraging the RSS VM. And it's also possible to sort of automate it or at least to reduce the mineral amount to a minimum. Yeah, I thought I was making a stupid joke and jokes on me as always, Andy. Yeah. And Marcus, so the service that your team is providing, as I understand, obviously all the automation for developers so they can stand up to environments when they need them. For CI, what about production?
Starting point is 00:08:22 Do you also provide the same services for our production environment? Not for our product itself. So Dynatrace hosted in our Amazon environment is not handled within my team. But we run a different sort of production environment. And this is our website and all subsequent um um parts of that so the blog the documentation and and all this stuff and and this is also fully automated and and is running solely within aws cool pretty much pretty cool so how long have you been with Dynatrace? I don't even know how many years. I made my sixth year this year. When? I just had six years too. When was yours?
Starting point is 00:09:11 First of July. Okay, mine was 26th of September. Well, there we go. Okay. Yeah, it's been very interesting and it was a great experience to be part of this amazing team. So how was Dynatrace six years ago in that area? Because I assume that many of our listeners, I'm not sure how many are where you are, probably not all of them. That's why they're probably somewhere in the middle between what you experienced six years ago and where you are right now so maybe we can talk tell us a little bit about the challenges that or the situation that was there whether there were challenges and how you kind of went to the stage where you are right now that would be
Starting point is 00:09:52 very interesting yeah yeah definitely yeah um honestly six years ago i i was uh in a different team then i we i was in a very small team and we provided a customer safe service named e-services and we provided a java web application and we did all the deployments manually days back so copying over the var file to the web server and doing the rollout of the new version and manually rolling back and all these kinds of things, you know. But then I was unable to switch the team and we were migrating the test automation team more specifically. And there I had the chance to switch to the infrastructure part of the
Starting point is 00:10:47 test automation and there I got the first experience with Amazon and it I think this was something like 2012 or something where we started our first projects with AWS and there were the first yeah touching points points with infrastructure automation and infrastructure as code. But in our lab, it was very horrible. So we had a lot of handcrafted and snowflake servers where barely any documentation was available. So some guy did set up Ubuntu system in that way or a redhead system in another way and and it was very snowflakey and of course we had a documentation in our wiki but as always um the documentation is outdated or misleading in the in the worst case case yeah and and we we had servers with the big one and and yeah and you name it
Starting point is 00:11:50 like the even bigger one and these types of server or we had an a parisk for HBO X and and we had a power serve for a Linux system running on power and so on. So it was very, very snowflakey, as we say nowadays. Yeah, I've never heard that term before. That's kind of saying delicate. Yeah, I mean, I've heard people being called snowflakes, right? That's very common over here when you have the left and the right attacking each other. But I've never heard a machine called a snowflake.
Starting point is 00:12:23 That means it's pretty delicate and can fall over at any time, I guess. Something along those lines. Yes, yes. Yeah, it's basically that it's a very unique system. Oh, unique, okay. And it's set up manually, so there is no – even no documentation in terms of code. So it's one of a kind, yeah. Yeah, yeah, definitely.
Starting point is 00:12:44 Okay. And there's a very prominent guy out there it's martin fowler um who basically yeah i'm not really sure if if he uh provided that term but he's uh very keen on on infrastructure automation and and uh serverless and and all the new uh orchestration and and deploymentration and deployment mechanisms we have right now. Yeah, and it was just horrible. And, of course, we were running VMware days back. And if you needed a new VM with a similar use case, you just took a running VM,
Starting point is 00:13:23 you went into the web UI or in the rich client and said clone me this VM so you had even the case that you had running VMs and and you made a hot clone of that and of course as it was not documented what is on the system you ended up with failing builds or failing tests as there are some hidden configuration on the system which you didn't expect to be there or you had some missing parts what you expected to be there so it was yeah if we look back now it was very horrible and and and yeah so you know this is the bad story we have to tell on that there's a good point there right that a lot of times people might look at well setting up setting up all this automation that you're going to be talking about a little bit more of, right? But that's going to take time, and we're not going to be able to get things done because we have to divert our attention from building machines to setting up and testing our automation.
Starting point is 00:14:18 But just that story you're telling, you go back in time. How much? I mean, this is more of a rhetorical question, but just think about how much time is wasted troubleshooting these hot clones that you're setting up, you know, all these machines that have all these snowflakes, a new term I'm going to start using there all the time now. You're setting up all these snowflakes, right, that things don't work. And you don't know then if it's because of the code or anything else you've pushed to it. Or as you said, there's some weird configuration file or or god forbid something as simple as a
Starting point is 00:14:49 host file that you you know spent three days looking everywhere else and didn't even check to see that there was some something in the host file all right so much wasted time going back i mean i remember in my old testing days going uh dealing with all that kind of stuff that when you switch it over yeah you're gonna have to put a put some up some upfront time into automating it but as you know you we have in the notes for talking about in here even like you know the immutable infrastructure so that every time it creates every time you spin something up you're spinning up a fresh clean one and so if something doesn't work don't even try to fix it just start over right um yeah so yeah and it's even worse
Starting point is 00:15:26 it's not only that that you as a person who made the clone is responsible for finding the the glitch it's more or less uh as you grow as a as a company um some developers are looking why the build is failing or why the test is failing and in the the worst case, it's not only one developer, it's 10, 20 or 50 developers because you can't be certain what was the problem. And then all these working power is burned for something very easy to fix. Of course, you have to make it in the first place.
Starting point is 00:16:00 But in the end, it's not only the productivity of the infrastructure team, but it's the whole company or the whole development team, actually, which is enabled to be more productive and can focus on providing new features for our products and for value for our customers. So this is not only within the team itself, it's the whole company basically, which also is enabled to work better and more efficiently. And also the thing is trust, right? If you cannot trust the system because you don't know which result it produces, then, well, why use a system that you don't trust? And then maybe then people go off and do their own thing because they don't trust you. And then it's even worse because then you have like people that are creating something on the side that you should actually be providing yeah yeah yeah and additionally maybe it's also interesting is uh if i look back um we're roughly
Starting point is 00:16:57 same uh team size so we had roughly five or even even six uh guys running the team. And we had about 40 to 50 OEMs and five to 10 hosts, something like that. Or even the ratio is a little bit more on the bare metal side. So we wouldn't even able to scale in the dimensions we have right now without really scaling the team equally. So we would now end up in a team with, I don't know,
Starting point is 00:17:27 20 employees to just keep the environment up. And this would be even worse in a way that you are not able to really work on the environment because you're just firefighting. We had the term days back that we had a just a couple of of guys across several teams doing firefighting as it was so complex to to find failures or or to pinpoint problems that we even had this firefighting team and and it was each time you you had to make that it was horrible because you hardly were able to get something productive because you are just running from failure to failure.
Starting point is 00:18:08 And most of the times it's really hard to find the real root cause. And then it could be some VM being moved around and then some storage IOPS. I don't know. There were several problems or a VM warehouse was overutil host was overutilized and and and then there was some network congestion and so on that was just horrible hey um it's actually two questions that i have i mean the first thing is are you using dynatrace now to actually monitor the complete infrastructure do you have one do you have one agency installed yes yes of course um we we actually we are running both uh especially on the more ancient systems and and and did i mean set linux and and solaris spark where we just got
Starting point is 00:18:53 the um the one agent being developed um i think we are now in somewhat beta phase or alpha phase for the one agent on Solaris. As we don't have them already available, we are running Appmon on these environments, Dynatrace Appmon, and using the host agent or the special agent depending on the software we are running there. And on all new environments, we rely on our one agent and the technology and the insight the agent is providing.
Starting point is 00:19:30 Cool. That's great. I mean, I've been telling our Dynatrace story for the last couple of months at different events and I always tell about the DevOps transformation and that we use our own products, right? Either on dog food
Starting point is 00:19:44 or as we'd like to call it, drink our own champagne. Of course. Much better, yeah. So that's great. Now, the question that is really, I think, interesting for many listeners is, so you were in a situation where you had these snowflakes and everything was more chaotic and you wasted too much time in firefighting. How did you change that and what were kind of the steps to get there?
Starting point is 00:20:07 Some advice and things you've done. Yeah. Yeah, basically, as I stated previously, we had some touching points with Amazon days back where we had seen how easy it could be to provision a very vanilla instance and create all your stuff on that vanilla machine. We thought, how can we get that in our environment? And something like configuration management was also introduced and was founded days back.
Starting point is 00:20:47 And we took the first step with Puppet as Puppet is able to run on all our systems. As I said in the beginning, we have a large variety of different operating systems and architectures. So we need a tool which is able to run everywhere where we need it to run and and puppet is the only and only tool only configuration management tool which is capable of running on all our operating systems and and we started with them to make some small changes on the systems like like providing some JDKs. Of course, Dynatrace is a Java company mainly, so we also have some building and stuff around with Java. And we used it to deploy our JDKs onto the build machines and test machines and so on.
Starting point is 00:21:39 And of course, even though it's just a small piece and a seemingly easy one, just copy over a zip file, unzip it, place it somewhere, and you're happy. Even that had some dramatic impacts and disruptions in our CI. As you might think, you change a JDK. So a build is running using a specific JDK, and now a puppet is running, replacing the JDK, causing a build failure. Or the JDK unzip, of course, uses some IO as it has to write the files to disk. It has some network utilization and so on, causing also side effects to our tests for example we have some very specific unit tests where you have some 50 milliseconds timeouts and so on and if you are applying some configurations
Starting point is 00:22:35 during that time frame the builds are doomed to fail basically and on those problems you don't think in the first place because you think you just apply something and what could possibly go wrong and you think nothing because it's just providing something new or something reliably on a very specific case. This is one part. So the side effects when you apply a configuration into a sort of production environment.
Starting point is 00:23:07 The other one is that even though you install Puppet on a system, the underlying system was a snowflake. So you change something like putting a JDK in a specific directory. But the configuration of the built environment was assuming that the JDK in a specific directory, but the configuration of the build environment was assuming that the JDK is running somewhere differently. So there were some glitches and problems we had to work around, and we learned a lot from them. So like detecting if a build is running, or we basically force the applying of a configuration very regularly.
Starting point is 00:23:48 Normally, we have a configuration run every 30 minutes. And if a node is very utilized, we force it that it at least runs once a day so that we can be sure that our configuration is being applied and we don't have a configuration drift. This is also something very important that you run the configurations and the settings constantly and continuously. There's also a term founded or created by Martin Fowler that you create a configuration
Starting point is 00:24:24 drift. founded or created by martin fowler that you create a configuration drift if you say you create an instance you configure it for a certain purpose and then you never check if the configuration has changed and and for sure there will some guy connecting to the machine making some change in the registry or in a host file or creating some fancy shell script doing some fancy stuff and you don't recognize that if you if you don't constantly like your your environment settings on these machines that's pretty cool hey this uh and i remember this from your slides where you said you know for us it is actually a security issue. There's an alert going off if somebody logs in to one of these machines, right? Yes, yes.
Starting point is 00:25:10 Yeah. So, I mean, just a thought. Now, thinking about our product, Dynatrace, and we do automatic log analytics, wouldn't it be cool if we just by default, every time we see that somebody logs into a machine, we create an event that fits into our AI. Yeah, that's something you can do already. So with the custom alerting and the log agent, you are able to do exactly that, that you have some patterns where you search for and then you can trigger an event within Dynatrace. Perfect. And you can then also apply this, obviously, to particular types of machines. I'm sure you can do this tech-based
Starting point is 00:25:49 because maybe you have some machines in your infrastructure that are for quote-unquote common use, but then you really want to kind of seal certain machines that you automate all the time. Yeah, yeah. Of course, we have some machines where you have to log in
Starting point is 00:26:03 because you're running tests remotely. For example, some very rare systems like the mainframe where you don't have that many around. So there's something where you are sharing the system. So the test or the built environment is remotely logging in. There you are expecting these sort of things. But as you said, on the other hand, you can be very specifically on machines where you don't want anybody to connect to the machine. So in the example then that you talked about when someone goes in and changes some environmental variables or something, right, I'm sure there are times when you use your automation,
Starting point is 00:26:41 you use your configuration to spin something up. And let's say it's a new version of an OS or anything else where you find out that your old configuration doesn't work right. So, you know, obviously it makes sense at that point maybe. And I'm kind of asking this more, is this your process or how do you go about doing this? Let's say it does make sense to go in there and mess around with the environments to set them properly. Would the best practice then be figure out what the settings are supposed to be, then destroy that server, update the configuration, and then deploy it with the new settings that you figured out, you know, kind of in your experiment? Or how do you handle the situation? Yeah, that's right. So that's the new approach we are running since roughly one and a half years or maybe two years that we moved away from trying to keep systems up to date.
Starting point is 00:27:35 Because, as you said, we have a large team and you will have some manual configurations and you cannot track every change in the system. And the other thing is some tests are modifying parts of the operating system or parts of the application we are testing. So there's a trade-off. You cannot track all changes in the environment. And we also learned that and there is a sort of new approach, and this is the immutable infrastructure term. And what we are doing here is basically we are using a very awesome product from HashiCorp named Packer, where we are creating all our re-embrace images,
Starting point is 00:28:23 or not re-embrace, but visualized environments. We can use re-embrace, we can re-embrace but visualized environments we can use re-embrace we can use krem which we're heavily using we can use even amis for amazon and and there we create very unique and very specific images for each use case we have and if we change something like a new JDK or we need a new GCC we just create a new image out of our configuration and then we are replacing all our running instances which are originating from that very specific type of image and replacing it with a new version of that. And we are also going a step further. We are not only replacing the images or the running virtual machines after a change,
Starting point is 00:29:16 but we also constantly are replacing machines without any change just to circumvent these configuration drift problems. So in our lab, the developers know that our machines are replaced constantly. So if they need a change to be made permanently in our lab and our environments, they can either make the change on their own, as we have everything running in Git, so they can just fork it can can create a pull request we're doing the reveal and then we enable our developers basically to to enable their own environment and and maintain their requirements uh on their own um and and we are
Starting point is 00:29:59 um yeah and replacing them very very often even, even if we have no change, just to make sure to have no configuration drift and no side effects on our build and our test environment. Great. That's pretty cool. So what I just learned, so immutable infrastructure, HashiCorp is one of the companies. I know we use HashiCorp and we love them any other products in that space that others would use that others people may know about yeah of course we are
Starting point is 00:30:34 using Puppet as I said in the beginning and Pekka can utilize different provisioners in that area but as we have a very strong and solid Puppet background we are also using puppet to configure these images and other than that we are of course using ansible to to roll out a complex and and integrated um software like red hat openshift which we are also supporting um and to provide uh environments for our developers and
Starting point is 00:31:06 for our ci um there we're using um ansible and uh yeah a lot of other products like cradle and and and which we are using for our test automation internally so we we have an i and an own uh ci and and the automation behind for our CI automation. So that's something we learned as well. We started with a puppet and having it in SVN and making changes, rolling them out. And then you end up with a broken CI because if you change something very critical, something very deep in the system, it's very likely that you have a side effect which you don't see in the first place. And this is also something we learned,
Starting point is 00:31:50 and we built up our own test and built an environment for our infrastructure code. And as the term says, it's code managing the infrastructure. So you can apply the same software engineering methodologies you use for your java development so we have code quality uh rules we have static code analysis to to see how our code has some yeah concept conceptual problems and is using some terms you're not supposed to use. And also we are creating our own virtual machines where we can run our new configuration on. So we are already utilizing these terms for our own infrastructure code stuff as well.
Starting point is 00:32:45 Now, you mentioned infrastructure, you know, as a service. I mean, I mean, infrastructure as well. Now, you mentioned infrastructure as a service. I mean, infrastructure as code. Andy, if you recall going back to episode 29, yes, I did look it up earlier. I have these all just on my head. We had Thomas McGonigal on and we were talking about network as code. I guess my
Starting point is 00:33:01 question there for you, Marcus, then is have you broached into the area as network as code? Does it apply in the environments that you're working in? Yes, of course. We are using it more or less more in our, we are also running some very business-critical services for our development cycle, like our centralized code repository and where we place all our build binaries. And these are very crucial to be running all the time and for that we are using typernitis um and and providing these services within um within um and uh yeah we are using basically a chorus and flannel uh for our network layer which is
Starting point is 00:33:54 basically an an overlay network um so we can hiding all the inter cluster communication from the remaining uh network um so this is more the the part we are using it it's more about hiding um cluster internet traffic from the remaining network awesome that's pretty cool hey so now you just brought something up brian um even though you kind of flipped the words but but infrastructure as a service. I look at your title right now, actually. I'm looking at the email that you sent. And it says team lead infrastructure and services.
Starting point is 00:34:35 So that means, obviously, I understand now what you do, but your team as a whole, if you could sum it up again, not only do you provide infrastructure as a service and all the services that is required to run the complete test automation that supports developers. You also mentioned the website. You run that and some other services. What other services are you providing as a team to the complete R&D organization, just to sum it up? Because it would be interesting to know what is a team like yours actually providing to the whole organization. It's basically that we try to enable the developers to work very productively and efficiently.
Starting point is 00:35:28 So it's like a service organization within the development team and and as I said it's about try to to provide these services like artifactory where where you host all your binaries which are used for local bits where the developer are trying the end code changes prior to pushing it in the CI so that you don't generate a failing build or yeah the central git repositories and and these sort of things some database servers where all these services are baked and and it's a very very interesting very comprehensive environment and and it's uh how should i say it's it's it's demanding but on the other hand it's very interesting and and makes a lot of fun to to work on that and and to see uh these new uh um tools we have right now. So if you think five years ago, something like Docker or Kubernetes was not around the corner even.
Starting point is 00:36:29 You had some dodgy terms like Linux LXC containers. Even nowadays, I'm sure not every Linux user or experienced Linux user knows about LXC. But Docker is everywhere and this enabled us a lot to provide very consistent and easy to use images to run important services.
Starting point is 00:36:56 Because you can test the very same image or the piece of software which went or goes to production prior to going into production. And we are very using development and staging environments to pretty certain that we don't break critical services. And we use them binary identical to be on the safe side. So we push the same images through all three stages and we are able to roll back very easily as we have them prepared. And if something goes down, like storage goes down or a host breaks and the disks are gone,
Starting point is 00:37:38 it makes no issue because either there's self-healing due to some automation within Kubernetes or it's self-healed due to some failover mechanisms we have in place running the services distributed across several machines. And this is just the way we make it. Cool. And last question that I have, and I think then we probably want to start wrapping up.
Starting point is 00:38:05 But when I talk with some of our customers and prospects at our large enterprise organizations and they have like more traditional operations, traditional teams, teams that handle infrastructure in a traditional way, they always fear that if they become a service organization and they automate a lot of things, like what you said, your developers can just change the configuration and get them, make a pull request, and then they get this. So basically, you automated a lot of the work away. So have you ever had the fear that you're automating your jobs away? No, definitely not. I'd say it's the way around um if i would have to to uh keep our servers like in the kindergarten that you have to take care of as a child and and
Starting point is 00:38:56 and uh one is whining or had been fallen down and and you you have to fix it. It is fun in a small scale, right? But if you run it in a very, very large scale, you are not even able to provide your own business and you're stuck and you can't advance. And then in the end, your boss will say um how could you basically yeah get your job done and and this won't work if you if the business of your company is is is growing and and you are as a it team or as a service team you are responsible to enable the um the company to grow. And if you can't grow, you're basically one point or the main point of the company to not grow. And so I see it the other way around.
Starting point is 00:39:54 So if you're not able to scale up, the whole company wouldn't scale. And in the end, your boss will fire you. And the important part also is not only about the work it's also especially in our time right now we have so many uh cool technologies to work on and to learn how to use them if if we had to um yeah take care each each machine in our environment we wouldn't have the time to be innovative, to try new things. We would just be stuck on the dark middle ages
Starting point is 00:40:31 where we started. And then your job wouldn't make fun. And I think fun at work is important to be innovative and to get out of your box, to try new and, of course course to fail on new technologies. No one is perfect in new technologies. You have to play with them, to fail with them and to learn how to really use them.
Starting point is 00:40:57 So it's like with Kubernetes, we knew a year ago that we want to go with Kubernetes, but we didn't have any experience with it. And it was a hard trial and error in the first place, especially to get Kubernetes running on-premises. So to ramp up a Kubernetes cluster in Google Cloud or in Amazon is just that easy. But to really let it run on-premises, you have to understand how Kubernetes is working, and that's interesting. And if you don't have the time, you are not able to learn new stuff and to enable you even more and more and more in the new environments. And what you say there with the whole job security thing, it reminds me, again, going back to an older episode. Last year at Perform, there's another Perform blog. Come on, people.
Starting point is 00:41:55 Last year at Perform, we were talking with Josh McKenty from Pivotal. And someone had asked him about, you know, what does he think Web 3.0 is, right? And he said, I don't care about Web 3.0. I care about Web 4.0. And then I think I'm paraphrasing what it was. The general idea there was he doesn't want to have to think about the Internet anymore. Any device, anything he's using, there should not even be a thought of, well, am I connected to the Internet? What's my bandwidth?
Starting point is 00:42:22 It's all just part of the fabric, the fabric of life at that point. Um, and if you think about operations in the same way, right back, back when you were doing it all the manual way, everything else, people know operations exists because it's a pain point for everybody. It's you all there all hours of the night, all hours of the weekend, holding things up because you don't have another bare metal server to spin up. Maybe you've even gone to VMware, but your hypervisors are full and you need to order another server or your backlog with a million changes, all this other kind of stuff, right? It's a huge pain point.
Starting point is 00:42:54 And hey, people know you exist and you're the only ones who can get through it. So there's your job security kind of a way, but it's an awful setup. Whereas the relation I'm making to this whole like Web 4.0 kind of thing is a great team is invisible. Right. The operations team should be invisible to the organization. People should even say, oh, we have operations. And the answer is like, well, who do you think is running this great thing? That's transparent. You know, it's all it's all that operations team that's making it happen. Right. And it doesn't obviously happen by itself. No matter what you have automated, the automation is more for the end users.
Starting point is 00:43:29 And yeah, you can have bits and pieces automated to make your job easier, keeping it up and running. But that's that constant tweaking. That's where that maintenance comes in. That's where all that piece goes in. And so, yeah, the goal should be disappear, right? Absolutely. Not to lose your job, but be so people don't even know you exist in a way right and then of course if it's running so well you can just like sit back
Starting point is 00:43:50 and watch a movie during the day and yeah voice or get on on your technologies yeah so basically you are a proxy and and a gatekeeper and and if the proxy does not perform or the gatekeeper does not perform, then you are basically – yeah, you're not accelerating the business. You're just a pain point, as you said. Yeah. So, Andy. Very cool. Markus, is there anything kind of thing that we missed? I know we have our – we kind of had our talk track or we knew what we
Starting point is 00:44:25 wanted to say is there anything that you final words that you want to say yeah i i think just uh the the important stuff is um some lessons learned um um it's basically uh the some some of what i told is really be keen on automation. It's just something you should really invest on. And there's a very prominent XKCD about that. We have some sort of a metrics shown. Is it worth to automate something? So if it's sort of you make a task only once a year and it takes you five minutes.
Starting point is 00:45:05 Of course, there's no need to automate it. But if it's just a task, a five-second task a day you have to do in the morning, I don't know, make certain command to get, I don't know, to ramp up a certain service in the morning. This is something you definitely want to automate and this is the key in our in our life to really be automating because if you automate it you know it you have to understand it and you are implicitly documenting it because documenting the code itself is sort of a document like a runbook how you deploy an application or how you deploy a server the code itself is the is the truth every document running and residing in a wiki is just doomed to be outdated because no one else will update the wiki after he has finished his story so and the code itself is the truth yeah and I think that
Starting point is 00:46:07 that is my my my story to tell automate automate automate cool all right Brian shall I summarize let's do it yes the summer reader so for me what what I like the best, what you said, applying general development principles, software development principles to the way we think about infrastructure. I talked about you have a full test automation suite that actually tests your infrastructure as code. More importantly, on top of that, what you really do as a team, you are providing services to other parts of the organization that are fully automated so that you are not the bottleneck in keeping them from innovation. And because you automate so much, you actually have the time to look into your own innovation so that you make sure that the services you provide next year are up to date and allow them to keep innovating. So I love the story, great where we came from and where we are, some good references to tools that we are using, and some kind of new terms that Brian learned. I'm sure he will cover that in his summary. We want to make sure that people go to dev1.at.
Starting point is 00:47:25 You will find Markus' presentation, both slides and video. We'll put the links up there as well. And my last plug, perform. It's the third time now. Also join us at perform because there will be topics like this as well covered. And that's it from my side. Brian. Well, yes, Snowflake.
Starting point is 00:47:45 It's so funny because I've only heard it in that horrible other way. But yes, I love the idea of this unique server. Another term that didn't come up in the show, but in the notes Marcus sent us over, I saw in your lessons learned, it said Docker is awesome, but orchestration sucks. K8S. And I'm like, what the heck is K8S? So, of course, I had to look it up. And I guess I've been seeing these pop up before that's an abbreviation for kubernetes which is actually has a name it's a new numeronym or um where you replace i mean i just think it's ridiculous but i guess it's what all the hip kids are doing nowadays huh at least they don't call
Starting point is 00:48:23 it k8s that would be even more confusing just for typing. But for anyone who sees words with letters, it's called, you can say to somebody, hey, but do you know what that is called? And they might be like, oh, I don't know. You can say it's a numeronym. And then you could feel like a smart old person for a moment if you're old and listening. I think one of the key things you have, you know, we touched upon here and there is, you know, everything you talk about, whenever I listen to these discussions, right, it's overwhelming. Because if you think about, we're hearing you in your final stage, right? But this was, how many years did you say this project's been going on for? I know you've been with the company for six years, but you all started this how many years ago?
Starting point is 00:49:07 I think roughly four or four and a half years, something like that. Right. And that's important because when you hear the finished story, right, if you're going on that journey, you start thinking like, holy cow, how do we get this all done in one bang? Right. And it's not. It's making those small changes, doing a little bit at a time. Add one layer in, then add another layer in. Because if you go for that full bang, maybe, maybe you'll get lucky. But chances are you'll say, oh, we have to make a tweak because we got it all kind of running, but now we have to tweak it.
Starting point is 00:49:35 And you have to make a tweak somewhere right at the very beginning, which means you have to make a tweak layer after layer after layer after layer after layer. Because that one tweak has like a huge impact on the rest of the line. You know, start small, make small changes. And then as you go, it's going to be a lot easier to make a tiny change to something else towards the beginning of the line that you can adjust for before you get finished. And just even from that point of view, it's, it's never going to be an overnight thing. So, so take your time. Um, I, I just know I get overwhelmed even listening to these, you know, thinking about the amount of work involved. But it pays off. And again, try to make it so, you know, you become invisible.
Starting point is 00:50:08 My thought of it is if I were a CTO and I couldn't recognize the operations team, that meant they were doing an awesome job because they were never in the firefights, you know. So, yeah, that's all I've got to say. Anything else from anybody? Marcus, I think we also want to, I know you said there's other topics. We heavily leverage the cloud, whether it's AWS or other cloud services, and I think we should invite you back to talk about this as well because that's very important for our listeners as well. Yes.
Starting point is 00:50:41 I think so, yes, especially the website stuff. I think we provided really cool solutions for our problems and and i think it's worth next podcast somewhat in the future and if anybody has any comments or stories or ideas or anything you can reach us you could tweet us at um either at grab neander. No, did I just... I just... Almost. At Grab Neander Andy or I'm at Emperor Wilson. We also have at...
Starting point is 00:51:14 What's our... At Pure underscore DT. Yes, at Pure underscore DT. We'd love to hear from you, ideas, feedback, anything. Marcus, do you do Twitter or anything where you... Do you have any other blogs? I know we have have the slideshare we're going to put up anything else you'd like to share with people um it i am more the the participant and and read more tweets so i'm not
Starting point is 00:51:35 that heavily providing content yeah all i do is promote the show on ours too so um okay and again we'll make it number four. But the reason I'm going to bring up reform one more time, right? You're going to meet people there that have done all this stuff. So if like so many people are at these events and this we can almost say this could be at any event you go to. Right. But there's so many people who've gone through these things and learning how to do them well, really relies on getting that firsthand experience and storytelling from, from people who've gone through it.
Starting point is 00:52:08 So, um, highly encourage it. And that's the last time I will mention it for today. All right. So thanks everybody. Okay, guys.
Starting point is 00:52:18 Thank you. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.