PurePerformance - 054 Moving to Continuous AWS-based Enterprise Web Hosting – Lessons Learned

Starting point is 00:00:00 It's time for Pure Performance. Get your stopwatches ready. It's time for Pure Performance with Andy Grabner and Brian Wilson. Hello, everybody. Before we get into today's episode, I wanted to preempt it with a public service announcement. Just to let you all know, we had quite some challenges with the audio when we recorded this episode. So please forgive us. I've done my best to clean it up and make it as listenable as possible. Hope you enjoy the show. Hello, everybody, and welcome to yet another fun-filled episode of Pure Performance. My name is Brian

Starting point is 00:00:49 Wilson, and as always, sitting virtually next to me is my sidekick, Andy Grabner. Andy, how are you doing today? I'm actually pretty good over in Austria right now. Wow, that's great. And you've been doing a lot of traveling lately, haven't you? But before we get to your traveling, you being in Austria, you know, we've been talking about people being on the show two times a lot lately. And today is our quickest two-timer turnaround ever, isn't it? Right? Because we just recorded an episode with this person about a month or two ago, and now they're back for another episode. So that's basically the shortest MTTT, Mean Time to Two Timer. Yes, exactly. And now you've been traveling a lot, right? You've been at a lot of really cool conferences lately, haven't you, Andy?

Starting point is 00:01:36 That's true. Last week I actually spent in Scotland, in Edinburgh, at a castle with our friends from the United States. Yeah, I saw you posted a picture of that. That was like a real castle, wasn't it? I bet the cell phone reception was amazing. Exactly. It says, no service. But actually,

Starting point is 00:01:53 let's go ahead. I really want to lift the secret about who is the fastest MTTT guy. Actually, the guy in this case, Marcus. Welcome back to the show. Thanks for letting me join you again. Thank you for joining. And I'm not sure if people remember what we talked about the first time, but we had you

Starting point is 00:02:14 on the show a while ago where we talked about infrastructure as code. Yes. That was pretty cool. That is true. And today we want to talk about something different. What is it? It is how we as a company are hosting our own website. And as we are doing it, in my eyes, on a very high-sophisticated level,

Starting point is 00:02:34 we wanted to share that with the audience and to highlight what we basically did to let our enterprise website being very stable, very robust, and of course, very fast. Yes, we are a performance company. So we are very cautious about providing a fast website and providing a good experience to our website business. So if I can interject here, a while back we had on both Burns and Anita

Starting point is 00:03:04 discussing how we transform from a six-month release to one-hour code deployments. And obviously now we're running in AWS. So today's talk is kind of like, if you want to think of it this way, this is us zooming in on the infrastructure and AWS part and what we achieve there. Is that a fair assessment? Not really, because Bernd and Anita, I think we are running on the product we are hosting, and we are actually talking about our own website. So basically, if you hit www.dynatrace.com, that's what you get when you enter our website, and that's what we're going to talk about.

Starting point is 00:03:39 Both is running on AWS, actually, but it's running on a very different scheme, actually. And what we did in our website is what we want to share and provide today. All right. But it's really good, Brian, that you brought it up, actually playing somebody that would assume that when we talk about Dynatrace, that we obviously talk about our monitoring products but this time it's actually a website that is probably projects a similar project that many companies have out there they have website they use different ways of hosting a website but i'm really i mean you gave us a nice write up on what you want to talk about yeah and i think it starts about two years ago

Starting point is 00:04:20 christmas 2015. yeah that's project started actually we we got in touch with AWS in the years prior to 2015 actually with some smaller projects and we learned about all the yeah functionality of AWS and we wanted to leverage this for our website as well. We were running on a pretty large CMS days back. And we started thinking about how we can basically get our very clumsy CMS thingy into the cloud and get it fast. And we had some key use cases in mind when we started this project. And I took my Christmas vacation in some spare minutes to start decoding on that. Yeah, we wanted to get basically to use the core features of AWS. We wanted to be the site resilient, multi-zone and multi-region deployment.

Starting point is 00:05:26 So if in a rare case that, for example, Amazon US East goes down, I don't want that our website is down. So that's the reason why we got in the first place for multi-zone and multi-region deployments. It should be, of course, automatically scaling in terms of load. And it should be of course fully CI driven so I don't want to touch any manual step to get our website out this was the the the core ideas behind our our website deployments yeah and definitely we wanted to have different stages so we wanted to have a development and staging environment where we can actually see if and how our website is working, if there are some problems around.

Starting point is 00:06:11 And of course, we wanted to have everything automated. As I laid down in the first podcast, actually, we are large promoters of automate everything you can do. And of course, we followed that paradigm as well in our website deployment. That's pretty cool. And so if I hear this right, in the old way, we had a CMS system and it meant going through multiple hoops to actually get a change into different environments. And then I remember these days, right?

Starting point is 00:06:40 Sometimes it's like it felt like forever to get changes out there. And so now, everything fully automated, how often do we, I know I may jump ahead a little bit, but how often do we deploy now? We nailed it down to one to two deployments a day because we have everything in Git and the

Starting point is 00:06:59 merging process and reviewing process is done once a day. But we do a lot more deployments a day in the staging and in the dev branches but on production we usually go for one sometimes two deployments but if there are more it's just hitting a button and and it runs yeah and that's cool I mean sometimes people ask me you know you talk about the Facebook's of the world and how often they deploy like every 11 seconds obviously this only makes sense for certain companies for a website project like ours I

Starting point is 00:07:31 mean twice a day already seems very flexible right because you can push code out if you need it for whatever reason a new campaign comes in or something changes you need to get a new customer logo out there because we just signed them for a case study and then it has to be quick, quick. Or you do some marketing campaigns, right? And marketing campaigns have to be timed in triggering Google ad, blah, blah, blah, whatever you can think of. So it's sometimes neat that you have some comment changes and then you schedule for, I don't know, 1 p.m.

Starting point is 00:08:02 you need to get a certain page out because it's being promoted by Google or something. So this is also something to think about. And as everything is running as a pipeline, you can even schedule something for 2 a.m. in the morning because it just goes out at that particular time. That's pretty cool. So that means if you have something prepared in staging, the day before everything looks good,

Starting point is 00:08:23 and then you just schedule the production pipeline for 2 o'clock in the morning. And then you exactly know what happens at 2 o'clock in the morning. Yeah. Awesome. Marcus, you mentioned that merge and review phase. That's a manual process, correct? Yeah, of course. Because contributors are humans, right?

Starting point is 00:08:38 So the actual content is being produced by humans. So we have our marketing guys um creating our awesome pages we have partner managers and all sorts of contributors and they want to get their content in the in the in the website basically and and to get this done you create a brand within our environment um then you get a pull request and all CI coverage on that as well, like having a linter that don't produce bad links or something so that everything from a semantic point of view is consistent. Of course, the content itself needs to be also reviewed

Starting point is 00:09:18 by a guy to make grammar and all this stuff correct. And did you also know that you have podcast producers putting content on your pages? I'm pretty familiar with this process you just described. Every two weeks, we have to check in a new branch, which gets reviewed by Lucas. Hi, Lucas, and thanks if you're listening. So yeah, we actually take part in this ourselves. I get to see the automation take place during each check-in. It usually goes through an automated checklist to check for multiple criteria,

Starting point is 00:09:46 I guess, and then it waits for the manual review before getting merged and published a little while later, usually by the next day, but sometimes as short as a few hours, depending, I guess, when I do it. So it's like doing a static code analysis or some

Starting point is 00:10:01 things like that, right, when you refer it back to regular programming yeah that's pretty cool because you always be in the case that you enter a link or you refer to a sub page within your your article you are writing um and you just have a typo in and no one ever will catch that and as we have everything in code we can just check if this page you're referring to is existing. And if not, you're just being refused to create the pull request. Cool.

Starting point is 00:10:28 So then if you just said everything is in code, can you tell me like how do I write? What's the code that I write in that then gets moved and transferred into Website House's work? Well, basically, we are using a static website generator because, yeah, as everyone knows, creating like a CMS way, you always have one single point of performance, and that is your web server, which is actually providing the contents. And we learned that having that running on the server actually is a penalty on rendering time. And so we actually let our ci building all the

Starting point is 00:11:07 content and that's usually some markdown files we've added some special flavors for headers and several other ease of use for the content uh content contributors and then we generate out of these large files of markdown and and. We scale them into different image sets that they can be zoomed in in the website and so on, that everything is pre-generated to the server, barely does only deliver the content, and does not need to render anything. And we create a big zip. I think it's around 700, 800 megabytes. We generate each build of our website, roughly.

Starting point is 00:11:49 And, yeah, this is done asynchronously, offline, and it does not have any impact on the rendering side. And then we just push out this content into our website stack, and then it gets distributed. And the 7,800 meg, that includes obviously all the images, the videos, and everything we have on the website. That's why it's rather large. It's the biggest contributor system, images, of course.

Starting point is 00:12:14 And then also multi-language, because I know our website is available in different languages and all that stuff, yeah. And I'm happy to announce since two weeks, we are even running in the Chinese part of Amazon. So we have a dynatrace.cn running entirely in Chinese region. And it is lightning fast, not from our side, but within Chinese region, it's very fast. We got the feedback.

Starting point is 00:12:39 Yeah, cool. Because getting through the Chinese wall is tough from either side. And so we are just providing our content on their servers and then it's fast again. Cool. Now, you mentioned that two years ago, you looked into the different AWS offerings and different services. Now, you mentioned there

Starting point is 00:12:56 is a pipeline, but I assume you really use the AWS CodePipeline, the CodeDeploy. What do you use? Did you pick something else? No. We are leveraging the compute features from Amazon and the content and storage capabilities from AWS. But as we are a software development company, we have our own pipelines and tooling around

Starting point is 00:13:21 and we basically are using the tooling we have here in our lab and this is a big build system and we are using gradle actually to drive our builds where we trigger all the the static site creation and and compression of the of the pictures and all the stuff and also orchestrating basically the pipeline, which steps are needed to actually create the instances, create all the tooling around to get everything fitting in together that Amazon is able to start new instances.

Starting point is 00:13:58 And when we do an update, is this a rolling update or is this, you are, how does the deployment work on the existing? So the website runs right now, right? On some easy tools, machines, and an update comes along. How does this work? You deploy it on a new machine and then just basically with the load balancer, with the scalability groups, you move over or do you deploy the new content on the existing machines and then extract it? How does this work? How does the after-process work? Yeah, SOA had a very strong security focus when we started with the project because everyone knows that very prominent websites have been defaced

Starting point is 00:14:35 and you get all these SQL injection problems. And I don't know if you run a CMS. We started in the first place to not get in this situation. And what we ended up with is to leverage our immutable infrastructure pattern also to our website. So our website runs an immutable image of the existing website. So there is barely anything you can change on the website host. It's just one small part, and it is a logging.

Starting point is 00:15:04 And even the logging is being streamed directly to a logging server. So just in case an attack happens, and the rare case that somebody opens a box, we still get the logs, and the attacker cannot change or hide his entrance by modifying the logs because they are just streamed away. And to answer your question, by modifying the logs because they are just reamed away. And to answer your question,

Starting point is 00:15:32 we basically replace every EC2 instance running on our website. It's being replaced with a very new version once we deploy a new release of our website. And we did that by basically using AMIs. So we are creating a new AMI for each version of our website. Distribute this AMI across multiple sounds within the region and also distribute it across our regions we want to have the website being hosted. So we have one in

Starting point is 00:16:03 US, one in Europe, now to in Asia, because we are running in Chinese as well. So we have four regions where we deploy our, our image in, and it's binary identical. So if the image we are running in staging is this very same we are in a production. So if we see or the content provider is reviewing their content on the staging environment, can be sure that this very same content is being running in production

Starting point is 00:16:34 because we do not change anything on these images. And then we are just leveraging launch configurations and auto-scaling groups. So we create a new set for each version we provide. And then we integrate the new autoscaling group in the load balancer we are running. And then the autoscaling group takes care to actually start the required amount of instances.

Starting point is 00:17:00 And then we wait until these new instances are working and successfully registered in the load balancer. And then we do connection training and start killing the existing ones. And this is a great advantage. In the rare case that something really goes bad, even in production, the old ones won't be dropped because if they are not successfully running in the load balancer, the remaining are not being killed in the load balancer the remaining

Starting point is 00:17:25 are not being killed. It's very resilient and we are running since March 2016 actually in production and we had no outage till now. Cool. Once more we had this

Starting point is 00:17:41 in the beginning of or was it late 2016, where this big DDoS from PlayStation network sort of thing also hit our DNS provider. We had a small outage, but only in the US.

Starting point is 00:17:57 Remaining regions were still working as we had this distributed setup. That's pretty cool. I think, Brian, we talked about the DDoS attack with others in the past, and I think we mentioned back then that we got a little hit on that. That's pretty amazing. Yeah, that is amazing.

Starting point is 00:18:18 So my question is, you have a lot of availability, a lot of security. You're obviously trying to make the website bulletproof. And now we're running the site in multiple regions, right, to cover the different geographical locations of our website. But they're all AWS regions. Do we have any plans on spreading out over multiple cloud providers? You know, we're talking about the CC2 DDoS attack, which impacted our North American site.

Starting point is 00:18:46 You know, in order to cover those scenarios where a problem on a single vendor occurs, are we planning on distributing that risk over multiple vendors? Well, right now we are actually pretty heavy with AWS. But as we are just utilizing some APIs, so we are now using the amazon apis it wouldn't be much of a problem to actually go to asia or to google cloud because it would be just another great layer in between to to use the google cloud for example but we took some countermeasures already to mitigate these problems. And yeah, we are pretty confident that we are not being hit by another DDoS. Obviously, we are, at the end, a monitoring company. So how do we monitor the website?

Starting point is 00:19:40 With Dynatrace, of course. Oh, my God. I mean, what's the best practices, anything course. Oh, my God. Oh, my God. No, but how, I mean, what's the, any best practices, anything that we learned, or what did you do? Well, we did quite a bunch of monitoring with Edmund prior to that. And, of course, 2015, we were in the early stages of Dynatrace. And we started right away going with Dynatrace. And we did not have to set

Starting point is 00:20:11 any collector or any in-between infrastructure needed because we are just using the SaaS offering for Dynatrace. We are an internal customer, so to say. And we did not need to change anything

Starting point is 00:20:23 because we are just downloading our agent from our tenant, which is also baked in into our image. And then if we create a new image, we get a new version of our agent. And lessons learned, it works out of the box. If you do cloud-native deployment, tenant trace is the way to go.

Starting point is 00:20:46 Cool. So that's actually a good question for me. That means you are, when you build the AMI, you're baking the agent in. And it's not because the way I've done some AWS tutorials, monitoring tutorials, and I typically show how to download the agent, install it, do the startup in the user data section of your EC2 instance.

Starting point is 00:21:04 So you are going a different route. Yeah, we live the immutable image pattern very strictly, so it wasn't a deal for us to let this change in our environment. So we really

Starting point is 00:21:19 bake everything together, and we are downloading a new version of the agent once we want to have it, so everything together and we are downloading a new version of the agent once we are wanting to want to have it so it's more an orchestrator part cool and uh just out of curiosity how many ec2 instances right now power dynatrace.com on a day-to-day basis it's around um it's eight basically yeah so in every availability zone or in every geographical location we have two. We have four regions

Starting point is 00:21:48 running right now. And of course, not counting all the staging and dev environments, but for production we are running with eight instances. And as we have everything statically running we just need

Starting point is 00:22:03 T2 medium instances. Okay, cool. And before we were using a product which was very expensive where we paid roughly 500,000 bucks a year for just letting the website running. Okay, wow. And now we are down from the run rate cost a lot and we got a lot more resilience and and stability and of course performance cool that's funny just be just

Starting point is 00:22:31 before you were talking about it i was about to ask you if we're being smart and are monitoring and understanding how much it's costing us to run in the cloud and are we in fact saving money yeah because as we have everything statically generated our web servers don't need to be smart so they don't need to generate pages up on demand so they're just delivering content so it's about reading from disk and linux is very smart about caching all the file system access and once the instance is running a couple of minutes the most usage sites are in the file system cache and then you're just delivering the the contents and and we are doing some caching layers in between and also keep it in memory so it's it's very sophisticated but on the other hand very cheap

Starting point is 00:23:19 deployment think about we are running 24 7 without any downtime and at a cost rate from I think it's 300 or 400 dollars a month we we have to pay for the whole website and that's it and that's it yeah that's a big gain how do you deal with backend systems that you're depending on because I'm sure there are some backend systems when you fill out forms when you do certain things for the website um marketo if you think about that is is just yeah um also uh integrated within but if marketo is down then we can't do a lot about it but we're also running a blog right so and the blog is also integrated in our our website um and the blog is running wordpress and we used to run it in pagerly but we had severe problems also with down times and response business and so on and what we did there is as we all know wordpress has a lot of security holes in there, and barely a week goes by if not CVE is being released for WordPress.

Starting point is 00:24:30 So what we did there, basically, we have an authoring system only being accessible from the inside. So if you need to generate a new blog post, you need to be within the corporate network. And this is being then generated in the authoring system running in the EC2 cloud, but only accessible from the corporate network. And there we are basically caching and pre-generating all the blog content

Starting point is 00:24:59 on a caching layer. And we are prohibiting all VP admin and all other malicious or potential malicious and yeah, vulnerable REST cores due to our hidden WordPress installation. So we basically have a mixture of pseudo-changeable content in the blog, as every company needs a block, right? This is non-questionable.

Starting point is 00:25:28 But there are pretty hard security concerns behind. And the solution we came up with is we are just caching the content. If it goes down, we still have the caching. It's like a reverse proxy, basically. That's what it is. Yeah, it's a reverse proxy with some caching in between. And also, it's not going back if it detects it down.

Starting point is 00:25:47 It's not providing the 500 or 400 or something, what you still get. It's just delivering the content it has in the cache. And meanwhile, as everything is an auto-scanning group, it still recovers and going up, and then it will provide the new content or the remaining content. Pretty cool.

Starting point is 00:26:05 So, Andy, shall we summon the Summaryator? Let's do it. I think we should summon the Summaryator. So, if I may summarize, we went from our old rigid process, which was using a CMS system that was, first of all, very costly, to a fully automated, we generate our website twice a day for production and push it out through immutable infrastructure,

Starting point is 00:26:29 which means we're actually building EMIs twice a day that can get deployed into production. The same images are also first deployed into staging where we can actually validate all the changes. Deployment in staging actually happens much more frequently. Much more frequently, yeah. But we're deploying across four different geographical locations in Amazon, always in two availability zones, which makes obviously a lot of sense.

Starting point is 00:26:54 And we got the flexibility of the newer model, high performance, monitoring built in, because Standard Trace 1 agents are built into the EMIs, and better performance in the end. agents are built into the EMIs. Yes. And better performance in the end. I think that's the key thing. And a lot of

Starting point is 00:27:08 cool lessons learned on how to leverage a CICV pipeline not only with traditional software products but actually with a

Starting point is 00:27:16 website. Yes. You're right. Yeah. Just to mention the performance part, I wasn't able to point

Starting point is 00:27:23 that out. We are basically down to roughly 500 milliseconds to mention the performance part, I wasn't able to point that out. We are basically down to roughly 500 milliseconds to deliver the website into the browser. Cool. And we know that by fact because we have done it with RAM. Exactly. You're right. That's pretty cool.

Starting point is 00:27:37 Did I miss anything? I think no. Brian, what do you think? Yeah, I think this is really cool. You know, a while ago we had those sessions I mentioned earlier with Burns and Anita where we discussed our own digital transformation for developing our product. But I don't think I'd consider it our website at all. You know, it's not our product, yet it's a major asset of ours. I even noticed as I've been updating the podcast page that the process

Starting point is 00:28:05 had changed quite a bit, but I never really thought about it. I can say though that the process to put in changes has become much easier. You know, I think it's kind of cool to think about, you know, especially if you work at a company where your product isn't the website, you know, if you're not at Amazon or Snapchat or some web property, but just a property that uses a website to communicate about your product, you know, yeah, the delivery and maintenance of that website is very important and deserves to go through its own transformation. And in our own case, we could see how it made us, in our own case, we can see how it made the work for us all much easier, as well as it saves us a lot of money, right? So, Markus, thanks for coming on again and for joining the Two-Timer Club. Congratulations. And really, just thanks for sharing this story.

Starting point is 00:28:53 It was great. Thank you. But I just want to follow up on that, Brian. I think it's also important that you as a content contributor actually got no friction by generating a pipeline-based workflow, right? So this was also something our website experience colleagues are very worried about to provide a CMS-like user experience for the content creators, actually. Because, of course, you could create all the pipeline stuff and it could be very clumsy in the end that you need to git check out and git commit and all this stuff but of course we also

Starting point is 00:29:32 had the view on the experience of the um yeah of the users actually because if it's hidden and it's always working it's also nice for for you guys actually providing content and that you don't need to worry about oh is this a c CMS or is it a content-driven pipeline? Because in the end, it's the experience also of the users which are providing the content. Is content-driven pipeline a term that is out there? Or did you just come up with that? Maybe it's a new term.

Starting point is 00:29:58 As we had the snowflakes last time, right? So we might have the content CI. Yeah, I'm Googling it right now. Yeah, it's really cool. Content-driven pipelines. Makes a lot of sense. And while he's Googling or binging or whatever he's doing, another thought that came to mind, which we implicitly mentioned,

Starting point is 00:30:17 but as a content developer, I basically have my own branch, and I can basically deploy my changes into my own dev environment, which as you said, is completely the same as it will look later on in production because I'm running through the same pipeline. The end falls out in AMI, which you deploy into EC2 and boom, here we go. Yeah. This is pretty cool. Cool.

Starting point is 00:30:38 Okay. Are you ready? I did a search for content driven pipeline in quotes and I only got four hits. Oh, here we go. Here we go. We have a, we, I think only got four hits. Oh, here we go. Here we go. We have a, I think we have a name for the, a title for the episode.

Starting point is 00:30:49 Content-driven pipeline. Awesome. Thank you. Thank you, guys. Well, Marcus, thank you very much for coming, as always, and, well, now you have a challenge.

Starting point is 00:30:59 You are now part of the Two Timers Club and have the opportunity, if you have another thought or idea, to come back on to be the first in the Three Timer Club. So the challenge is all on you, my friend. Thanks for coming on. And for anybody else out there listening, if you would like to join our podcast, either have an idea for us to discuss or maybe even come on, maybe even be a quick riser to this two or three timers club yourself. We'd love to have your ideas or love to talk with you. You could send us a tweet at pure underscore DT or send us an email at pure performance at dynatrace.com. We'd love to hear

Starting point is 00:31:39 from you. All right. Bye. Bye. Thank you guys. Thank you. Bye.

PurePerformance - 054 Moving to Continuous AWS-based Enterprise Web Hosting – Lessons Learned

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.