PurePerformance - 054 Moving to Continuous AWS-based Enterprise Web Hosting – Lessons Learned
Episode Date: January 29, 2018Markus Heimbach, Team Lead of the Infrastructure and Service Team at Dynatrace, explains the continuous delivery process of www.dynatrace.com really works behind the scenes. 2 years ago the web site t...eam used a traditional CMS (Content Management System) which was slow, error prone, and didn’t deliver the expected end user experience for visitors of our website. 2 years later Markus and his team built a fully automated “Content Delivery Pipeline”. The team decided to leverage Git, static generated web content, immutable infrastructure, and Dynatrace OneAgent monitoring. Production deployments happen twice a day but staging and development deployments – using the same deployment pipelines – happen much more frequently. The result is a very flexible delivery pipeline, fully version controlled content, a very secure and fast website and everything monitored with Dynatrace. Thanks Markus for letting us look behind the scene of www.dynatrace.com
Transcript
Discussion (0)
It's time for Pure Performance.
Get your stopwatches ready.
It's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello, everybody. Before we get into today's episode, I wanted to preempt it with a public service announcement. Just to let you all know, we had quite some challenges with the audio when we recorded this episode.
So please forgive us. I've done my best to clean it up and make it as listenable as possible.
Hope you enjoy the show.
Hello,
everybody, and welcome to yet another fun-filled episode of Pure Performance. My name is Brian
Wilson, and as always, sitting virtually next to me is my sidekick, Andy Grabner. Andy,
how are you doing today? I'm actually pretty good over in Austria right now.
Wow, that's great. And you've been doing a lot of traveling lately, haven't you? But before we get
to your traveling, you being in Austria, you know, we've been talking about people being on the show two times a lot lately. And today is our quickest two-timer turnaround ever, isn't it? Right? Because we just recorded an episode with this person about a month or two ago, and now they're back for another episode.
So that's basically the shortest MTTT, Mean Time to Two Timer.
Yes, exactly.
And now you've been traveling a lot, right?
You've been at a lot of really cool conferences lately, haven't you, Andy?
That's true.
Last week I actually spent in Scotland, in Edinburgh, at a castle with our friends from
the United States.
Yeah, I saw you posted a picture of that.
That was like a real castle, wasn't it?
I bet the cell phone reception was amazing.
Exactly. It says, no service.
But actually,
let's go ahead. I really want to
lift the secret about who is the fastest
MTTT guy.
Actually, the guy in this case, Marcus.
Welcome back to the show.
Thanks for letting me join you again.
Thank you for joining.
And I'm not sure if people remember what we talked about the first time, but we had you
on the show a while ago where we talked about infrastructure as code.
Yes.
That was pretty cool.
That is true.
And today we want to talk about something different.
What is it?
It is how we as a company are hosting our own website.
And as we are doing it, in my eyes, on a very high-sophisticated level,
we wanted to share that with the audience and to highlight
what we basically did to let our enterprise website
being very stable, very robust, and of course, very fast.
Yes, we are a performance company.
So we are very cautious about providing a fast website
and providing a good experience to our website business.
So if I can interject here,
a while back we had on both Burns and Anita
discussing how we transform from a six-month release to one-hour code deployments.
And obviously now we're running in AWS.
So today's talk is kind of like, if you want to think of it this way, this is us zooming in on the infrastructure and AWS part and what we achieve there.
Is that a fair assessment?
Not really, because Bernd and Anita, I think we are running on the product we are hosting,
and we are actually talking about our own website.
So basically, if you hit www.dynatrace.com, that's what you get when you enter our website,
and that's what we're going to talk about.
Both is running on AWS, actually, but it's running on a very different scheme, actually.
And what we did in our website is what we want to share and provide today.
All right.
But it's really good, Brian, that you brought it up, actually playing somebody that would assume that when we talk about Dynatrace,
that we obviously talk about our monitoring products but this time it's actually
a website that is probably projects a similar project that many companies have out there
they have website they use different ways of hosting a website but i'm really i mean you gave
us a nice write up on what you want to talk about yeah and i think it starts about two years ago
christmas 2015. yeah that's project started actually we we got in touch
with AWS in the years prior to 2015 actually with some smaller projects and we learned about all the
yeah functionality of AWS and we wanted to leverage this for our website as well. We were running on a pretty large CMS days back.
And we started thinking about how we can basically get our very clumsy CMS thingy into the cloud and get it fast.
And we had some key use cases in mind when we started this project.
And I took my Christmas vacation in some spare minutes to start decoding on that.
Yeah, we wanted to get basically to use the core features of AWS.
We wanted to be the site resilient, multi-zone and multi-region deployment.
So if in a rare case that, for example, Amazon US East goes down, I don't want that our website is down.
So that's the reason why we got in the first place for multi-zone and multi-region deployments.
It should be, of course, automatically scaling in terms of load.
And it should be of course fully CI
driven so I don't want to touch any manual step to get our website out this
was the the the core ideas behind our our website deployments yeah and
definitely we wanted to have different stages so we wanted to have a
development and staging environment where we can actually see if and how our website is working, if there are some problems around.
And of course, we wanted to have everything automated.
As I laid down in the first podcast, actually, we are large promoters of automate everything you can do.
And of course, we followed that paradigm as well in our website deployment.
That's pretty cool.
And so if I hear this right, in the old way, we had a CMS system
and it meant going through multiple hoops to actually get a change
into different environments.
And then I remember these days, right?
Sometimes it's like it felt like forever to get changes out there.
And so now, everything
fully automated, how often do we, I know
I may jump ahead a little bit, but how often do
we deploy now?
We nailed it down to one to
two deployments a day because we have
everything in Git and the
merging process and reviewing process is done
once a day.
But we do a lot more
deployments a day in the staging and in the dev branches but on production we
usually go for one sometimes two deployments but if there are more it's
just hitting a button and and it runs yeah and that's cool I mean sometimes
people ask me you know you talk about the Facebook's of the world and how
often they deploy like every 11 seconds obviously this only makes sense for certain companies for a website project like ours I
mean twice a day already seems very flexible right because you can push code out if you need it for
whatever reason a new campaign comes in or something changes you need to get a new customer
logo out there because we just signed them for a case study and then it has to be quick, quick.
Or you do some marketing campaigns, right?
And marketing campaigns have to be timed in triggering Google ad,
blah, blah, blah, whatever you can think of.
So it's sometimes neat that you have some comment changes
and then you schedule for, I don't know, 1 p.m.
you need to get a certain page out because it's being promoted by Google or something.
So this is also something to think about.
And as everything is running as a pipeline,
you can even schedule something for 2 a.m. in the morning
because it just goes out at that particular time.
That's pretty cool.
So that means if you have something prepared in staging,
the day before everything looks good,
and then you just schedule the production pipeline for 2 o'clock in the morning.
And then you exactly know what happens at 2 o'clock in the morning.
Yeah.
Awesome.
Marcus, you mentioned that merge and review phase.
That's a manual process, correct?
Yeah, of course.
Because contributors are humans, right?
So the actual content is being produced by humans.
So we have our marketing guys um creating our awesome pages
we have partner managers and all sorts of contributors and they want to get their content
in the in the in the website basically and and to get this done you create a brand within our
environment um then you get a pull request and all CI coverage on that as well,
like having a linter that don't produce bad links or something
so that everything from a semantic point of view is consistent.
Of course, the content itself needs to be also reviewed
by a guy to make grammar and all this stuff correct.
And did you also know that you have podcast producers putting content on your pages?
I'm pretty familiar with this process you just described.
Every two weeks, we have to check in a new branch, which gets reviewed by Lucas.
Hi, Lucas, and thanks if you're listening.
So yeah, we actually take part in this ourselves.
I get to see the automation take place during each check-in.
It usually goes through an automated checklist to check for multiple criteria,
I guess, and then it waits
for the manual review before
getting merged and published a little
while later, usually by the next
day, but sometimes as short as a few hours,
depending, I guess, when I do it.
So it's like doing
a static code analysis or some
things like that, right, when you
refer it back to regular
programming yeah that's pretty cool because you always be in the case that you enter a link or you
refer to a sub page within your your article you are writing um and you just have a typo in and
no one ever will catch that and as we have everything in code we can just check if this
page you're referring to is existing.
And if not, you're just being refused to create the pull request.
Cool.
So then if you just said everything is in code, can you tell me like how do I write?
What's the code that I write in that then gets moved and transferred into Website House's work?
Well, basically, we are using a static website generator because, yeah, as everyone knows, creating like a CMS way,
you always have one single point of performance,
and that is your web server, which is actually providing the contents.
And we learned that having that running on the server
actually is a penalty on rendering time.
And so we actually let our ci building all the
content and that's usually some markdown files we've added some special flavors for headers and
several other ease of use for the content uh content contributors and then we generate out
of these large files of markdown and and. We scale them into different image sets that they can be zoomed in in the website and so on,
that everything is pre-generated to the server, barely does only deliver the content,
and does not need to render anything.
And we create a big zip.
I think it's around 700, 800 megabytes.
We generate each build of our website, roughly.
And, yeah, this is done asynchronously, offline,
and it does not have any impact on the rendering side.
And then we just push out this content into our website stack,
and then it gets distributed.
And the 7,800 meg, that includes obviously all the images,
the videos, and everything we have on the website.
That's why it's rather large.
It's the biggest contributor system, images, of course.
And then also multi-language,
because I know our website is available in different languages
and all that stuff, yeah.
And I'm happy to announce since two weeks,
we are even running in the Chinese part of Amazon.
So we have a dynatrace.cn running entirely in Chinese region.
And it is lightning fast, not from our side, but within Chinese region, it's very fast.
We got the feedback.
Yeah, cool.
Because getting through the Chinese wall is tough from either side.
And so we are just providing our content on their servers
and then it's fast again. Cool.
Now, you mentioned
that two years ago, you looked
into the different AWS offerings and
different services. Now, you mentioned there
is a pipeline, but I assume you really use
the AWS CodePipeline,
the CodeDeploy. What do you use?
Did you pick something else? No.
We are leveraging the compute features from Amazon
and the content and storage capabilities from AWS.
But as we are a software development company,
we have our own pipelines and tooling around
and we basically are using the tooling we have here in our lab
and this is a big build system and we are using gradle actually to drive our builds where we
trigger all the the static site creation and and compression of the of the pictures and all the
stuff and also orchestrating basically the pipeline,
which steps are needed to actually create the instances,
create all the tooling around
to get everything fitting in together
that Amazon is able to start new instances.
And when we do an update,
is this a rolling update or is this,
you are, how does the deployment work on the existing?
So the website runs right now, right? On some easy tools, machines, and an update comes along.
How does this work? You deploy it on a new machine and then just basically with the load balancer,
with the scalability groups, you move over or do you deploy the new content on the existing
machines and then extract it? How does this work? How does the after-process work? Yeah, SOA had a very strong security focus when we started with the project
because everyone knows that very prominent websites have been defaced
and you get all these SQL injection problems.
And I don't know if you run a CMS.
We started in the first place to not get in this situation.
And what we ended up with is to leverage our immutable infrastructure pattern
also to our website.
So our website runs an immutable image of the existing website.
So there is barely anything you can change on the website host.
It's just one small part, and it is a logging.
And even the logging is being streamed directly to a logging server.
So just in case an attack happens,
and the rare case that somebody opens a box,
we still get the logs, and the attacker cannot change
or hide his entrance by modifying the logs
because they are just streamed away.
And to answer your question, by modifying the logs because they are just reamed away.
And to answer your question,
we basically replace every EC2 instance running on our website. It's being replaced with a very new version
once we deploy a new release of our website.
And we did that by basically using AMIs.
So we are creating a new AMI
for each version of our website. Distribute this
AMI across multiple sounds within the region
and also distribute it across our
regions we want to have the website being hosted. So we have one in
US, one in Europe, now to in Asia, because we are running
in Chinese as well. So we have four regions where we deploy
our, our image in, and it's binary identical. So if the
image we are running in staging is this very same we are in a
production. So if we see or the content provider
is reviewing their content on the staging environment,
can be sure that this very same content
is being running in production
because we do not change anything on these images.
And then we are just leveraging launch configurations
and auto-scaling groups.
So we create a new set for each version we provide.
And then we integrate the new autoscaling group
in the load balancer we are running.
And then the autoscaling group takes care
to actually start the required amount of instances.
And then we wait until these new instances are working
and successfully registered in the load balancer.
And then we do connection training and start killing the existing ones.
And this is a great advantage.
In the rare case that something really goes bad,
even in production, the old ones won't be dropped
because if they are not successfully running in the load balancer,
the remaining are not being killed in the load balancer the remaining
are not being killed.
It's very resilient
and we are running since
March 2016 actually
in production and we
had no outage till now.
Cool.
Once more we had this
in the beginning of
or was it late 2016, where this
big DDoS from
PlayStation
network sort of thing also hit
our DNS provider.
We had a small outage,
but only in the US.
Remaining regions were still
working as we had this distributed
setup.
That's pretty cool.
I think, Brian, we talked about the DDoS attack with others in the past,
and I think we mentioned back then that we got a little hit on that.
That's pretty amazing.
Yeah, that is amazing.
So my question is, you have a lot of availability, a lot of security.
You're obviously trying to make the website bulletproof.
And now we're running the site in multiple regions, right,
to cover the different geographical locations of our website.
But they're all AWS regions.
Do we have any plans on spreading out over multiple cloud providers?
You know, we're talking about the CC2 DDoS attack,
which impacted our North American site.
You know, in order to cover those scenarios where a problem on a single vendor occurs, are we planning on distributing that risk over multiple vendors?
Well, right now we are actually pretty heavy with AWS.
But as we are just utilizing some APIs, so we are now using the amazon apis it wouldn't be much of
a problem to actually go to asia or to google cloud because it would be just another great
layer in between to to use the google cloud for example but we took some countermeasures already to mitigate these problems.
And yeah, we are pretty confident that we are not being hit by another DDoS.
Obviously, we are, at the end, a monitoring company.
So how do we monitor the website?
With Dynatrace, of course.
Oh, my God.
I mean, what's the best practices, anything course. Oh, my God. Oh, my God.
No, but how, I mean, what's the, any best practices, anything that we learned, or what did you do?
Well, we did quite a bunch of monitoring with Edmund prior to that.
And, of course, 2015, we were in the early stages of Dynatrace. And we started right away
going with Dynatrace.
And we did not have to set
any collector
or any in-between infrastructure needed
because we are just using
the SaaS offering for Dynatrace.
We are an internal customer,
so to say.
And we did not need
to change anything
because we are just downloading
our agent from our tenant,
which is also baked in into our image.
And then if we create a new image,
we get a new version of our agent.
And lessons learned, it works out of the box.
If you do cloud-native deployment,
tenant trace is the way to go.
Cool. So that's actually a good question for me.
That means you are, when you build the AMI,
you're baking the agent in.
And it's not because the way I've done some AWS tutorials,
monitoring tutorials,
and I typically show how to download the agent,
install it, do the startup in the user data section
of your EC2 instance.
So you are going
a different route. Yeah, we
live the immutable image pattern
very strictly, so
it wasn't a deal for us to
let this change
in our environment.
So we really
bake everything together, and we
are downloading a new version of the
agent once we
want to have it, so everything together and we are downloading a new version of the agent once we are wanting to want
to have it so it's more an orchestrator part cool and uh just out of curiosity how many ec2
instances right now power dynatrace.com on a day-to-day basis it's around um it's eight
basically yeah so in every availability zone or in every geographical location we have two.
We have four regions
running right now.
And of course, not counting all the
staging and dev environments, but for production
we are running with
eight instances.
And
as we have everything statically running
we just need
T2 medium instances.
Okay, cool.
And before we were using a product which was very expensive
where we paid roughly 500,000 bucks a year for just letting the website running.
Okay, wow.
And now we are down from the run rate cost a lot
and we got a lot more
resilience and and stability and of course performance cool that's funny just be just
before you were talking about it i was about to ask you if we're being smart and are monitoring
and understanding how much it's costing us to run in the cloud and are we in fact saving money
yeah because as we have everything statically generated our web servers don't need
to be smart so they don't need to generate pages up on demand so they're just delivering content
so it's about reading from disk and linux is very smart about caching all the file system access and
once the instance is running a couple of minutes the most usage sites are in the file system cache
and then you're just delivering the the contents and and we are doing some caching layers in between
and also keep it in memory so it's it's very sophisticated but on the other hand very cheap
deployment think about we are running 24 7 without any downtime and at a cost rate from I think it's
300 or 400 dollars a month we we have to pay for the whole website and that's it and that's it
yeah that's a big gain how do you deal with backend systems that you're depending on because
I'm sure there are some backend systems when you fill out forms when you do certain things for the website um marketo if you think about that is is just
yeah um also uh integrated within but if marketo is down then we can't do a lot about it but we're
also running a blog right so and the blog is also integrated in our our website um and the blog is running wordpress
and we used to run it in pagerly but we had severe problems also with down times and
response business and so on and what we did there is as we all know wordpress has a lot of security holes in there, and barely a week goes by if not CVE is being released for WordPress.
So what we did there, basically,
we have an authoring system only being accessible from the inside.
So if you need to generate a new blog post,
you need to be within the corporate network.
And this is being then generated in the authoring system running in the EC2 cloud,
but only accessible from the
corporate network. And there we are basically
caching and pre-generating all the blog content
on a caching layer. And we are prohibiting all
VP admin
and all other malicious or potential malicious
and yeah, vulnerable REST cores
due to our hidden WordPress installation.
So we basically have a mixture of pseudo-changeable content
in the blog, as every company needs a block, right?
This is non-questionable.
But there are pretty hard security concerns behind.
And the solution we came up with is
we are just caching the content.
If it goes down, we still have the caching.
It's like a reverse proxy, basically.
That's what it is.
Yeah, it's a reverse proxy with some caching in between.
And also, it's not going back if it detects it down.
It's not providing the 500 or 400 or something,
what you still get.
It's just delivering the content it has in the cache.
And meanwhile, as everything is an auto-scanning group,
it still recovers and going up,
and then it will provide the new content
or the remaining content.
Pretty cool.
So, Andy, shall we summon the Summaryator?
Let's do it.
I think we should summon the Summaryator.
So, if I may summarize,
we went from our old rigid process,
which was using a CMS system that was, first of all, very costly,
to a fully automated, we generate our website twice a day for production
and push it out through immutable infrastructure,
which means we're actually building EMIs twice a day
that can get deployed into production.
The same images are also first deployed into staging
where we can actually validate all the changes.
Deployment in staging actually happens much more frequently.
Much more frequently, yeah.
But we're deploying across four different geographical locations in Amazon,
always in two availability zones, which makes obviously a lot of sense.
And we got the flexibility of the newer model, high performance, monitoring built in,
because Standard Trace 1 agents are built into the EMIs,
and better performance in the end. agents are built into the EMIs. Yes. And better
performance in the
end.
I think that's the
key thing.
And a lot of
cool lessons
learned on how
to leverage a
CICV pipeline not
only with
traditional software
products but
actually with a
website.
Yes.
You're right.
Yeah.
Just to mention
the performance
part, I wasn't
able to point
that out.
We are basically
down to roughly 500 milliseconds to mention the performance part, I wasn't able to point that out. We are basically down to roughly 500 milliseconds to deliver the website into the browser.
Cool.
And we know that by fact because we have done it with RAM.
Exactly.
You're right.
That's pretty cool.
Did I miss anything?
I think no.
Brian, what do you think?
Yeah, I think this is really cool.
You know, a while ago we had those sessions I mentioned earlier with Burns and Anita where we discussed our own digital transformation for developing our product.
But I don't think I'd consider it our website at all.
You know, it's not our product, yet it's a major asset of ours.
I even noticed as I've been updating the podcast page that the process
had changed quite a bit, but I never really thought about it. I can say though that the process
to put in changes has become much easier. You know, I think it's kind of cool to think about,
you know, especially if you work at a company where your product isn't the website, you know,
if you're not at Amazon or Snapchat or some web property, but just a property that uses a website to communicate about your product, you know, yeah, the delivery and maintenance of that website is very important and deserves to go through its own transformation.
And in our own case, we could see how it made us, in our own case, we can see how it made the work for us all much easier, as well as it saves us a lot of money, right?
So, Markus, thanks for coming on again and for joining the Two-Timer Club.
Congratulations.
And really, just thanks for sharing this story.
It was great.
Thank you.
But I just want to follow up on that, Brian.
I think it's also important that you as a content contributor actually got no friction by generating a pipeline-based workflow, right?
So this was also something our website experience colleagues are very worried about to provide
a CMS-like user experience for the content creators, actually.
Because, of course, you could create all the pipeline stuff and it could be very clumsy in
the end that you need to git check out and git commit and all this stuff but of course we also
had the view on the experience of the um yeah of the users actually because if it's hidden and it's
always working it's also nice for for you guys actually providing content and that you don't
need to worry about oh is this a c CMS or is it a content-driven pipeline?
Because in the end, it's the experience also of the users
which are providing the content.
Is content-driven pipeline a term that is out there?
Or did you just come up with that?
Maybe it's a new term.
As we had the snowflakes last time, right?
So we might have the content CI.
Yeah, I'm Googling it right now.
Yeah, it's really cool.
Content-driven pipelines.
Makes a lot of sense.
And while he's Googling or binging or whatever he's doing,
another thought that came to mind, which we implicitly mentioned,
but as a content developer, I basically have my own branch,
and I can basically deploy my changes into my own dev environment,
which as you said, is completely the same as it will look later on in production because
I'm running through the same pipeline.
The end falls out in AMI, which you deploy into EC2 and boom, here we go.
Yeah.
This is pretty cool.
Cool.
Okay.
Are you ready?
I did a search for content driven pipeline in quotes and I only got four hits.
Oh, here we go.
Here we go. We have a, we, I think only got four hits. Oh, here we go. Here we go.
We have a,
I think we have a name for the,
a title for the episode.
Content-driven pipeline.
Awesome.
Thank you.
Thank you, guys.
Well, Marcus,
thank you very much for coming,
as always,
and, well, now you have a challenge.
You are now part of the Two Timers Club
and have the opportunity,
if you have another thought or idea, to come back on to be the first in the Three Timer Club.
So the challenge is all on you, my friend.
Thanks for coming on.
And for anybody else out there listening, if you would like to join our podcast, either have an idea for us to discuss or maybe even come on, maybe even be a quick riser to this two or three timers
club yourself. We'd love to have your ideas or love to talk with you. You could send us a tweet
at pure underscore DT or send us an email at pure performance at dynatrace.com. We'd love to hear
from you. All right. Bye. Bye. Thank you guys. Thank you. Bye.