PurePerformance - Everything we messed up and learned when moving to AWS with Justin Donohoo

Episode Date: May 11, 2020

Have you ever burned 30k because you forgot to turn off your test VMs over the weekend? Have you ever accidentally deleted “the production table” because you thought you were connected to your dev... database? We often only hear the good stories and not those that teach us about what we should not do in order to avoid disaster!Join this episode where Justin Donohoo, Founder and CTO of Observian, tells us horror stories from his professional life that taught him great lessons on what not to do when moving to the cloud, re-architecture because of exponential growth or let the intern do things he/she shouldn’t do.https://www.linkedin.com/in/jdonohoo/

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time of Pure Performance COVID-19 edition. This is your host Brian Wilson and as always I have with me the lovely and talented Andy Grabner. Andy Grabner, how are you doing today? I'm good, why are you smoothing me? What's wrong with you? What do you what do you want to get in return shall i let's say some nice words about you as well no i want to get a captain environment um now i i was thinking i always think about a lot of times on the on the on the night
Starting point is 00:00:55 nighttime shows right whenever they introduce a male guest it's you know it's oh the esteemed whatever and almost every single time a female guest is introduced, it's always the lovely and something. I'm like, that's always odd. So I wanted to introduce you as a male as the lovely and talented Mr. Andy Grabner. Because you are lovely and talented. Well, I try to be. And my wife keeps telling me from time to time.
Starting point is 00:01:21 So I think it's all right. And also you mentioned earlier, it's the first COVID edition. even though when this airs covid hopefully is beyond us well who knows but it's the first time we actually record while we're all quarantined right yeah yeah i mean it's been it's been around we've known about it since at least before perform but you're right this is quarantine edition right quarantine in Quarantine in place. But I'm sure people are sick of hearing everything about that. So let's, yeah, let's give them a break from that unless we have, you know, want to make fun of any, any of our leaders for the poor jobs they're doing, but we can, we can leave that off because this isn't a political podcast either. We actually have a fun show today.
Starting point is 00:02:05 Right? Remember a while back, I forget which episode it was. Maybe our guest remembers. But there was an episode we had where we said, you know, it'd be great to hear about like failures, right? Because we hear a lot of people talking about, I think we might've been talking to Gene Kim.
Starting point is 00:02:19 And we were talking about hearing about all the success stories of DevOps transformations and all these real great pipelines, but we never quite hear about when things go wrong and maybe what you learn from it. Because it's always, you know, things are allowed to go wrong, but we always want to learn a thing or two from that.
Starting point is 00:02:36 And we had one of our old guests contact me again and said, oh, I've got some great stories. I'd love to be back on. So that's what we're doing today. Hopefully it'll be a fun episode, put everyone's mind in a more fun place. Andy, would you like to go ahead and do the honors of introducing our guest who is on mute or he might have disconnected? Well, I hope he's still there with us. So Justin Donahue, hopefully I correctly pronounced that name.
Starting point is 00:03:06 I didn't check back in the, I think it was episode 69 that we had him on. We talked about serverless, and I'm not pretty sure. I don't remember if I pronounced his name correctly last time. I'm pretty sure he introduced himself the last time, because I'm not sure if everybody listens to that episode back then. Justin, could you do us a favor? First of all, thanks for being on the show. Give us a little background about yourself,
Starting point is 00:03:31 and then it will be really interesting to dig deeper into what kind of stories you have to share about things that actually went wrong and what we can learn from it. Yeah, so it's Donahoe, So you're pretty close. I mean, kind of like Donahoe tiny bubbles. Exactly. There you go. Yeah. I mean, it's definitely good to be back. I mean, I've always loved to talk about this and spend time with Dynatrace and I actually get to work with Brian because we're in the same market.
Starting point is 00:04:02 So that's kind of interesting. And then I saw both of you at Perform. And you were sick back then. Dude, that was horrible. I'm wondering if I was patient zero for Utah. Because it was nuts when I got home. You might have been. But yeah. I mean, quick background about me, though.
Starting point is 00:04:20 I'm the founder and CTO of Observian. We're an advanced AWS partner and a premier Google partner. We also spend a lot of time in the software delivery space and APM is part of our core strategy. That's why we're partnering with Dynastrace. And, you know, kind of a quick backstory to me before I jump in too far is a lot of the examples I'm going to talk about today are from my last job where I spent five plus years doing nothing but failing and failing fast
Starting point is 00:04:52 and realizing the more I learned about doing cloud computing I've been doing about 10 years but I spent almost four and a half years at my last job the more I did it the more I realized I really don't know what the heck people are doing or what we're doing by the end of it all it was kind of the event that started observing at the end is like that was really hard went through a lot of things why is it so painful instead of just getting a job at a single place why not start somewhere that I can help many people learn from all the mistakes I made? So, I mean, this is right up my alley with the whole reason why observing exists. Hey, when you said, so you failed fast and often,
Starting point is 00:05:36 but obviously, hopefully, you learned a lot and recovered. Was this the intention in the beginning because you just didn't know much better about the cloud technologies or it was okay to fail fast, or it just happened because you simply had no clue what you were doing? I would say it's an elegant combo of both incompetence and being agile in the cloud. I basically inherited quite a mess. So, I mean, my last company was called Ghostry. It was a privacy plug-in tool, still around today.
Starting point is 00:06:13 And then we also had an enterprise business that was doing a lot of the cutting edge digital advertising alliance, interest-based advertising. So we were kind of on the forefront of digital privacy and when I took over I started there as an hourly contractor it was only supposed to be like a three to six month gig and I ended up leaving as their head of tech so it was definitely a great experience but I had multiple AWS accounts that were VPC classic. So they all had public IPs.
Starting point is 00:06:48 We had pick a database. We had it in production. Pick a programming language. We probably had it. We even had like Haskell running in production. We had a 1200 plus node Hadoop cluster. It was pretty nuts. So some of it was inheriting it a lot of it was you know not
Starting point is 00:07:09 knowing what you don't know and then some of it towards the end was definitely fell fast and often because it's a lot cheaper than failing in production but i've got a lot of these stories that are production failure so i mean one of the things i like to talk about is you know on the surface it's like oh yeah you started a cloud company you know this time of year or not year i guess this in this day and age it's that's got to be a great industry to be and i was like yeah it might look that way but uh it wasn't all glitz and glamour definitely Definitely have a ton of battle scars. So that's why I was excited to talk about this episode. So first of all, I want to say earlier when I said you just didn't know better, it's just, you know, it happens, I think, to all of us when we pick something,
Starting point is 00:07:57 when we pick up a new technology, a new framework or something, it feels like, man, why am I wasting so much time? And then looking back on what I did in the morning when it's the end of the day, and it's like, man, if I would have known this in the morning, I would have saved so much time. And then obviously when you look back even further, it's like, man, now I need to redo this all again because I didn't know how to do this properly. So I guess these are a lot of these learnings.
Starting point is 00:08:26 Now, tell us a little bit about some examples on what went wrong because you didn't know better. What would you have liked to know and what should other people that are moving towards the cloud and find themselves in a similar situation where you found yourself back then, what should they know? What can you teach them? Yeah. So, I mean, with the Dunning-Kruger effect, right? The more you know about something, the more you realize you actually don't know anything, right? When you first start touching, I'll just pick on AWS today, because I think that's the easiest to talk about because people know what services I'm talking about.
Starting point is 00:09:10 I spin up an email and I have access to thousands of dollars per click, depending on what buttons I click with no like clippy trying to help us out saying like, hey, did you know? Let's just sign up and go right and so a lot of people um like my predecessor took the approach of oh it's just vms how hard can it be it's like well it's actually pretty hard um so kind of where i start when i'm talking with customers and like this my own story is i like to talk about the cloud adoption slash maturity model. And it's really a three-phased approach that I've seen people approach cloud. And it's step one, data center in the sky. It's probably the worst way to do cloud.
Starting point is 00:10:01 Everything's a static asset. We still have pets. We give servers names that mean something like oh that's web 01 right and everything is on demand and we're just taking on-prem concepts and running them in uh the cloud it's a lot of time that's a lift and shift i mean i'm not saying lift and shift is bad always sometimes you know there's a data center being shut down or whatever that you know makes you do a lift and shift once you do a lift and shift you start to realize wow this is really expensive and it starts to fall apart on you pretty quick or you start to move more into using managed services and starting to actually use
Starting point is 00:10:40 things like auto scaling and use the cloud for kind of what it was designed for and then the third and final phase where i actually try to get people to start on day zero you know based off my experience is doing infrastructure as code and doing more of the autonomous cloud management that that interest talks about because i can't tell you how many customers and even you know myself we started out you know it's building stuff in the console and yeah it sucks is the nicest way of putting it like infrastructure is code like you want to start with that on day zero if possible because you're not going to ever regret it because it's you know something that's safe and repeatable but then if you don't do it when the time comes
Starting point is 00:11:25 for dr to move into another region you're gonna regret it because you're playing like indiana jones with two consoles side by side right and hoping that you can recreate what was created months ago with no documentation so that's kind of where we start but i mean some of these stories that i've got that i've got a lot around big data um account strategy so there's just so many things that we did i mean i keep in mind i feel like part of the reason we were successful is i had a lot of you know leadership that was able to kind of take a chance on me. I was like 20-something years old at the time. They gave me a shot. I ended up being
Starting point is 00:12:09 the head of tech out of it. I had a good team that stuck with me and followed me. Five of the guys on my team at Observian are from Ghostry as well. We've all been together two or three companies each. We've got a good team dynamic. You know how it is. You go through the trenches with people,
Starting point is 00:12:26 you get pretty good bonds, right? Yeah. And so like, let's just start with account strategy, right? One of the first things that we try to do at Observing is help customers with a multi-account strategy where they've got, you know,
Starting point is 00:12:40 like a logging account, security account, master payer account, maybe production and not production. Maybe we break it down by business line. But there's a reason. There's a story behind this. So a lot of people in the early days, myself included, I mean, some of the accounts I inherited were we have, you know, one massive account that has all of our cloud assets for the entire company. And if you're lucky, sometimes people would separate by VPC.
Starting point is 00:13:11 And then the idea was I've got a dev VPC and a prod VPC, so my resources are isolated. But that starts to break down really fast. I mean, there's multiple reasons why this breaks down. When we start talking cost, right? Like, how much does product X cost to host? I can't really tell you. I'm like, well, here's the dev costs, staging, production.
Starting point is 00:13:34 Here's four other products. You know, data transfer is blended. Data storage is blended. And so if we can, you know, really break down our accounts into smaller more manageable units we get better numbers from that aspect like especially if you're a sass startup cost of goods sold all that stuff but the bigger problem is say a developer's like hey i need admin access because i'm trying to troubleshoot something i'm trying to figure something out well if i'm all in one account and i elevate that person they're now an admin across the entire environment.
Starting point is 00:14:08 So I don't know how you're going to pass any type of audit. Like you just gave somebody, you know, product access. And so when we were at Ghostry, so we had these two messy accounts. We're like, oh, we're going to create two new shiny accounts. And by the time we created the new shiny accounts and we migrated the resources we realized the mistake we went from a single account that had everything in a giant vpc was all public because it's before ec2 so it's all ec classic and we ended up just creating two bigger buckets that were you know one was for
Starting point is 00:14:42 our consumer business line one was for our consumer business line, one was for our commercial enterprise business line, but we still kept Dove in production in the same account. And when it came time to get audited, that's where we were like, oh, we did this wrong. So we had to like redo the redo. There's a big story behind all this that I didn't really jump into. I'll kind of jump around a little bit. So part of the reason this is a topic I'm so passionate about is we got put in an interesting situation where, you know, we were the tech startup where we had the hyper growth up into the right. We're hiring people like crazy. People are using our product. We won a demo award, demo 2014.
Starting point is 00:15:25 It was just great, right? All the good stuff. Well, then one of our big customers that happened to rhyme with oogle was no longer a customer. And so we lost a fair chunk of revenue. But through all of this growth, our cloud spend was rapidly growing up and to the right as well. So, I mean, we went from probably, I want to say in the early days, 50 to 75,000 a month, all the way up to a half a million a month. And we just took a significant hit on revenue.
Starting point is 00:16:00 And so we were also at this awkward inflection point where our VC was, you know, you're in year eight, we do 10 year funds, you need to start looking to sell and be self sufficient, because, you know, it's time to exit the fund. into fundraising mode and so the writing was on the wall this wasn't going to be a fun situation it was going to be either we find fundraising or we're going to lay off like 30 40 of our staff and this isn't me trying to say that i can like predict things i kind of looked at the numbers and i saw the cloud spend and i was like if we don't get ahead of this cloud spend problem it's gonna be more like 50 or 60 percent of our people and long story short we ended up being romanced in a due diligence we're at the point where we're about to get acquired we had joint press releases it was like that much a done deal we had a site that had both companies like announcing what was going to happen and then they pulled the deal like the day before and so that was six months of fundraising and we were running
Starting point is 00:17:11 our cost really high and again sass startup we were focused on feature development not the operational reduction and so it ended up being like my team went from like 70 people to like 25 people in a day. Wow. And then the company as a whole, we probably hit about 60%. So that's why I talk about this and try to help customers avoid these mistakes. Because I had to lay off a ton of friends that day. And it got to the point where I was telling people, you know, it's got to be some of us or all of us. Because there was no other way out. And instead of packing up and quitting and being frustrated, myself and a couple of guys that we're lifelong friends now,
Starting point is 00:17:54 we were able to take that half a million a month and spend and get it all the way down to about $150,000 a month with a reduced team. And so that's why I say I got a lot of topics here. So go ahead and jump in. Yeah, I want to ask you a question on this. So first of all, thanks for sharing that story and being so open about it. But let me understand. So the cloud costs grew.
Starting point is 00:18:20 Why? Is it because people were just creating cloud instances and resources and services and not taking care of also shutting them down in case they didn't need anymore? Or was it more you were kind of scaling up your operations because you were obviously very successful for a long time. And then when this big customer dropped, you had no way to scale down because the architecture didn't support it or because you forgot about it. Why couldn't you just, if you lose a big customer and you don't need these resources anymore in production,
Starting point is 00:18:58 why couldn't you just scale down and basically then serve the rest of the customers with the resources that they need and therefore bring the cost in balance. So these are all good questions. So the easiest way to describe this is some of our biggest costs were our big data problems. So every day we were producing 25 terabytes of new data every single day. And the way it was coming in, at first it would come in through these collector instances, and then it would get written to S3, and then it would get loaded into a Hadoop cluster,
Starting point is 00:19:39 and it would do some basic map reduce. Then it would go into another cluster for more fine grained reduction aggregation and then it would go off into like a reporting warehouse and the problem with the nature of our data set so you're running a browser plugin and if you opted in to send data, basically it would send data back every single time that person went to a different website. And it would basically be different customers would be paying for access to, we've called it the panel data, so it's the real user data.
Starting point is 00:20:27 So they could see what third parties were on their website, how was it being impacted, was there data leakage, all of that. And so this was a massive data set because they'd want historical reports. And so it's not like with the business, yes, we lost a customer,
Starting point is 00:20:44 but we didn't lose any consumers, right? And so the customers that were paying for access to the consumer data, again, this was all voluntary opt-in, they lot of the data that we were doing was when you see an interest based advertisements online, like there's a little blue shark up in the top left corner for ad choices. We were serving those icons. So we were getting billions of impressions a day. So one customer leaving like, yeah, it had an impact impact but to the overall data set not really we probably went from 27 terabytes a day to maybe 26 because we're basically 25 to 30 depending on the day i just always average about 25. okay i guess i didn't i didn't fully understand what a customer means and what he's kind of consuming versus producing of data.
Starting point is 00:21:46 Now I understand better that, you know, how your service worked back then. And I understand now that if you lose a customer, you're not losing that data because basically they're just consuming the service. Exactly. Yeah, yeah, yeah. And then we had an interesting connection with Dynatrace.
Starting point is 00:22:07 So some of the people at Ghostry came from Gomez. And as you know, Gomez was one of your guys' acquisitions. And so we also had our own third-party scanner that was, what we were doing was in the, you know, we weren't trying to be an APM by any means, right? What we were doing is we were focusing on the problem of third party JavaScript on your website that you're not in control of. And what's the side effect that has on your website?
Starting point is 00:22:40 Does it make your page slow? Are they leaking data? Like, do you even know about them? It's data governance, right? And then we were on kind of the cutting edge of GDPR and some of those stuff. Our in-house counsel was part of, you know, the conversations with the regulators that drafted GDPR. And so we had a web crawler, too. And it was indexing third-party JavaScript all across the Internet. So we were probably scanning 40 to 50 million web pages a week. The scale that we were doing some of this stuff was crazy.
Starting point is 00:23:15 One of the things I talk about a lot is big data causes big problems. We were dealing with so much data and the business and the demand like everybody wants the same experience right that they are on uber or google right i click on an email and it pops up instantly so well yeah there's kilobits of data and so i would have requests from like our customer reports team or like customers buying data feeds they want to be able to run a historical data query for you know a two-year period across 10 000 domains and they'd be upset that it would take like 20 minutes it's like well do you know how much data that is and so then we had this drive for it hey it needs to be i can query whatever i want not going to
Starting point is 00:24:07 tell you when and it just needs to be snappy and on demand so that pushed us into starting to like do things like standing up our own presto clusters and we really started pushing really hard there and at the end of the day uh our Amazon account rep was like, hey, stop what you're doing. Come help us develop Athena. So we were one of the first customers on Athena. We were actually helping close bug tickets and stuff. So we were in production on Athena before it was general availability. We also did the same thing with Aurora. So getting more into direct failures. So we had a data pipeline project where the original data pipeline
Starting point is 00:24:52 that was going from land, going to one Hadoop cluster, going to another Hadoop cluster, those nodes were on 24 by 7 whether they were being used or not. And in some cases, we had so many jobs stacked up that we were having to add nodes.
Starting point is 00:25:08 And sometimes if one job was delayed, we were taking longer than 24 hours to process 24 hours of data. So you can see how that's a problem really quick. We would basically bump up to 18 hours every single day. So it was getting pretty scary. So the way we approached that was hey let's start spinning up dynamic emr clusters so we were trying to basically spin up an emr cluster and use it for just that one moment and then get rid of it that was kind of the model we started
Starting point is 00:25:39 with and eventually we ended up in athena but as we were going through this project of redoing our data pipeline, we would, I mean, no sacred cow existed. The old cluster that we were on was like Cloudera 0.2. And this was at this time, I want to say 2014, 2015. And so it was way out of date. And so that was like your original question, like how much of it was, we don't know, we don't know, how much of it was intentional failure. And a lot of it was tech debt.
Starting point is 00:26:12 I mean, there was multiple tech regimes that had gone through there and it was kind of a mess. And as we were doing this, that whole infrastructure as code thing I'm bringing up, we had somebody, I'm not going to name their name they know who they are though decided they wanted to spin up a presto cluster to do some tests on presto because they read a lot of good things about it and i was like yeah go for it well i didn't realize that i just basically told them to go spin up a 200 node presto cluster on a friday and i was like okay
Starting point is 00:26:46 sweet do your test and kill it well they didn't kill it i came in monday morning and i spent thirty thousand dollars on this cluster and i was like wait what they left so literally they they spun up like a 200 node presto cluster they ran test for a few hours and then they just kind of forgot about it like oh it's just 200 nodes how much is that going to cost well apparently ten thousand dollars a day oh man so again when i say failures like that's just raw negligence right right? That's my bad. The team's bad. We should have had billing alarms.
Starting point is 00:27:28 There's so many things that went bad. Plus, we were working crazy hours, tons of stress. But, I mean, these are the types of failures I'm talking about. People don't talk about this. When we got to the end, though, if I just told you I took our AWS bill from 500K to 100K, you'd be like, oh, wow, you're amazing. It's like, ah, not exactly. Let me ask you, though, with that situation with Presto,
Starting point is 00:27:53 you're 200 nodes, people not shutting things down. Have you figured out a way to mitigate that? Has Amazon or other cloud providers put in any guardrails? I know you mentioned Clippy early on, which I appreciate the Clippy reference, but is there anything since then that you've figured out either from your own best practices or anything that the cloud providers make available to you to avoid that kind of mistake now right because i can see that happening early on uh in the cloud stuff but there's got to be something at this point well there's there's a lot of things right there's i feel like anytime i turn on there's a cloud
Starting point is 00:28:37 something tool it's you know cloud health cloudability Checker. There's a bunch of them, right? So there's definitely cost tools. The problem with cost tools, though, they're analyzing cost after you've already spent it. Service limits exist for a reason. But going back to that multi-account strategy, if I have smaller accounts that are single-purpose or project-based, I'm going to have service limits that are single purpose or like project based, I'm going to have service
Starting point is 00:29:08 limits that are probably more reasonable. Because we had that legacy account. And then when we moved over to the, you know, new shiny monolith is basically what we did, if we're being honest, we had such high spend, we had to push our service limits up so high. And we were doing so much burst computing for like web crawlers that spinning up 200 nodes wasn't out of the ordinary, right? Yeah.
Starting point is 00:29:34 But to answer your question, like, you know, what have cloud providers done? I mean, Amazon's changed a lot in the last six, seven years. You can do like OUs where you've got, you know, your master payer account
Starting point is 00:29:47 and then you've got an AWS org and you can have, like this was a dev project. So first off, we should have probably never let somebody spin up a 200 node air quotes dev cluster. We probably should have limited that to more like 30.
Starting point is 00:30:00 But again, it's really hard to run tests against almost a pet a bit data set in dev, right? At some point, you got to run a real test so you know that's going to work. But you can do sandbox accounts that have budgets. You can limit the services that they're turned on. So like, say, for example, I've got a dev OU. Like, this is what we do at Observian.
Starting point is 00:30:22 Every single employee has their own sandbox account and you know we run reports on them and we can see the spend we kind of leave it pretty open but other organizations we work with we lock it down um one of them has everybody that has a sandbox account has a one thousand dollar a month. So the second they hit $1,000, it will stop. They can't do anything more. Right. Also by doing infrastructures code, lots of terraform, uh, we've worked on stuff where either through Lambda automations, um, where you can have a tagging strategy where you can use that for programmatic things like, hey, run this Lambda. Anything that doesn't have a persist tag, destroy it.
Starting point is 00:31:11 I've seen people do that. There's also, hey, if it's not in the Terraform master branch, destroy it because we're just going to run Terraform apply with whatever's in the master branch. Anything that's not there, we're getting rid of you know like if you just run a tf destroy in your dev environments if you enforce your team to work with infrastructure as code because you're always blowing environments away it's kind of like they get in that practice like we got infrastructure code you i could have a database get deleted out from under me and i don't care because i'll just run my Terraform again, right?
Starting point is 00:31:46 So does that make sense? Yeah, yeah, yeah. I know it's kind of all over the place. And I think you could also add that. I mean, I don't know if it's all in the same realm, but I'm just trying to think with Andy's hat on, right? Because Andy does a lot of these pipelines and workshops and things where if you're going to, and you mentioned infrastructure as code,
Starting point is 00:32:06 if you're going to run this test and you create that, you know, 200 node system with code, you can also have at the end of your test another script run that tears it down. Again, now that puts it on the individual developer to make sure that's best practice, but that could be an organizational um enforcement that or organizational practice that gets spread and put in at an organizational level but at least yeah but you what you're mentioning is more of a brute force kind of let's go for everyone who's not following the rules we're just going to go ahead and delete it. It's almost the strategy I liked.
Starting point is 00:32:47 I never quite have the heart to do, but I always want to take with my daughter in her messy room. It's like, hey, if I go in your room after you say you clean it, anything on the floor, I'm just going to throw out. Similar kind of thing as opposed to her actually cleaning up and doing it. Yeah, exactly. I mean, so the way, like, I mean, we at some point had a system in place where when a resource came online if it wasn't tagged we would notify you because we could at least see who it was that's the nice mode right yeah but when I when I'm starting to work on cloud reduction
Starting point is 00:33:17 because every 10 grand I save is one more employee I get to keep I don't really care I'm not being nice at that point I'm doing what it takes to keep my team together so So if you violate it, I delete your resource and I don't care. If you wanted it, you should have done what I told you. That's kind of the mentality we got into. And I mean, it takes a lot of discipline to reduce your spend by that much in under a year. Like we did all of that in like six to seven months. So not only is it a huge number but it was like that's all we did some of us were literally sleeping at the office uh back to the big data causes big problems stuff uh we had so much data we were notorious for killing database engines right so another one of the that same project we're working on it okay we failed with the presto thing we ended up in athena but then it was like
Starting point is 00:34:06 well we got to tackle these web crawlers they are written in haskell nobody hears a haskell dev customers are asking for new features so one of the guys that works for me now i signed i don't know if it was the best thing i ever did for him or the worst thing but i feel like he gained a lot out of it i assigned him the task of re-platforming our web crawlers and i mean just to give you an idea of how cheeky this code was there was a method called batman and all of the uh variables were n n n n n n a n n n n a a a so oh my gosh that's the guy who wrote it right so the guy who wrote it uh he he kind of didn't care about what the next person behind him was going to see it was the way we looked at it so yeah i remember i came around out of my office and he's like, hey, check this out. And he showed me the Batman method.
Starting point is 00:35:06 And I was like, wait, what? It's actually in production? He's like, yeah, what am I supposed to do with this? I'm like, well, you're supposed to reverse engineer and figure out how it worked. But that's useless. I guess you're just going to rewrite it. So he had to figure out how to control Chromium and all that. But where I'm going with this is one of the choke points that we had was the database.
Starting point is 00:35:25 So this web crawler would be like going out, scanning the site, getting a bunch of data. If we hold on it too long, now it's a massive payload. When we had all these web crawlers that were coming up dynamic, eventually this led to something I built with spots. This was before spottings was a thing. So I'll get to that in a minute. But we were writing it to a relational database. Well, that relational database became a choke point. We get the great idea of,
Starting point is 00:35:55 hey, let's take out the relational database, and let's start writing these files to S3. Well, yeah, we fixed the database problem, but we destroyed our big data pipeline. So the Hadoop processes that we were working with, at the time, that one might have been post Athena, I don't really remember, but I can tell you what the problem is.
Starting point is 00:36:21 Instead of having a relational database that then got batched into chunked files that were easy to read and process, we had millions of tiny little files. And MapReduce is not good at reading tiny little files. So we basically took the problem of here's a database that's causing us to slow down and it's throttling our throughput. So, hey, now we can scan really quick, but now we can't process the data that we just collected so this is one where you solve one problem you make the other problem bigger that's why i said big data causes big problems yeah so you shifted you shifted the real problem just to another piece of the architecture in the end yeah and so what happened we had to batch those back together
Starting point is 00:37:05 just so you could process them. Yeah. Hey, I wanted to make one comment on Batman because I find it interesting. But, you know, case in point, I mean, maybe this was really an interesting developer. Could also be that he just obfuscated his code in case it was running in production
Starting point is 00:37:25 because the N underscore N A N N could also be code obfuscation but maybe he threw away the original code just done something that came up maybe he was not that evil maybe he just used code
Starting point is 00:37:40 no it was literally called Batman I was going to say come on andy's trying to cut him some slack so the the thing i get out of that though rage quit and kind of sent a company-wide email announcing in like the double middle finger walking out the door when he left so okay i would love to see somebody make a batman framework that then if you write something that's poor performing it it gets vigilance on you you know it gets you back it comes back for vengeance uh with some kind of craft crafty thing but yeah that's that's pretty awesome but i would like to uh justin i really uh you know besides
Starting point is 00:38:18 this podcast i would like to connect you with some of our folks at dynatrace that keep an eye on costs because we built a couple of tools also internally. As you said, there's always new tools out there. We also built some tools that optimize the usage and give recommendations depending on the workload on what type of easy-to-instance types we should use in order to save costs. So they also reduced costs massively with that approach. And also we are, just as you figured out over the years,
Starting point is 00:38:54 now using proper tagging so that incenses get terminated in case they are no longer needed or in case they are reaching their expiry date. So if somebody launches a service or an EC2 instance, they have to tell the system through text what the expiry date is. Do I need it for a day? Do I need it for a week or longer? And there might even be an approval process if it's longer than that. But by default, it's short-lived instances that then automatically get cleaned.
Starting point is 00:39:24 Or maybe you, as a developer, at the end of the day, you get an email or a Slack notification saying, hey, it's almost the end of the day. I see you still have two instances running that you launched in the morning. Do you still need it? Yes or no? And if not, if you don't respond, that means you're probably no longer here and you don't care.
Starting point is 00:39:43 So I'll just kill it. Or if you respond and say you need it longer, then the question is do you really need them longer? I mean, there's a lot of interesting things we can do, and it's great that we share these experiences and what can be done. It's also great there's a lot of tool vendors out there that are building new tools. We've, just as you probably, you've built your own tooling internally.
Starting point is 00:40:05 We've built our own tooling internally. It's just very good that we share these stories so that people that start from scratch or start new with this technology are not making the same mistakes as we did. Yeah. That's literally why observing exists. Yeah. I mean, I try to help customers now to actually use tools don't write your own i mean i've been in the cloud space for about 10 years so it's obviously different eras of the cloud have had different problems
Starting point is 00:40:35 um as far as like what you were just talking about with like hey i spin something up i put a tag on it and then i get like yeah you literally described the cloud maturity model, right? That's a very mature cloud practice. But if you have a bunch of people who've never done the cloud that are treating the cloud as a data center in the sky, it's like, they don't even know that they can do that. Let's be honest to them. They just think AWS is VMs and a little bit of S3 and maybe a database. Yeah. It's also when, when, you know, we are, we're currently, as you know, working on, on Captain and with runs obviously on Kubernetes.
Starting point is 00:41:16 And when we started, you know, which is like, he has been up in another cheeky cluster, who cares, right? It's just a cheeky cluster. What can it cost? And then we realized after a while, well, it's actually, because we have, just as you said in the beginning, we do have our individual development team accounts
Starting point is 00:41:35 where we can keep track of costs. And we're freshly new to the Kubernetes world, especially these managed services. You don't think about the cost because you need a server and it's just three clicks away or an API call away and the server is there. And then you sometimes get the wake-up call a month later that, oh my God, this was really like $2,000.
Starting point is 00:42:01 And I only used it for one demo at the beginning of the month, maybe I should be a little more careful in the future. Yeah, delete those node pools, man. Exactly. Yeah, just out of curiosity too, with what both of you are saying, right? I imagine a lot of people start moving to the cloud and they start thinking we want to do things good.
Starting point is 00:42:22 Besides hiring someone like Observian, right, to help explain and set people up on this, we want to do things good besides hiring someone like Observian, right. To help, you know, explain and set people up on this. Are there like, I've been asking this question on a lot of podcasts, are there like books or like, how do people get started like on their own if they needed to? Or is it just like you have to know somebody or happen to catch the right webinar or something to start understanding these things? Or is there like a, you know, best practices for managing your cloud costs kind of guidelines out there?
Starting point is 00:42:53 So, a couple answers. I think more people that are willing to talk about failures like we're doing today is going to help a lot of people, assuming people are listening to this stuff. The other is you're also i don't get me wrong there is best practices and white papers all over amazon's website but that assumes people take the time to read it okay but there's at least they at least exist okay a little bit um they're more to tell you how things work or, you know, best practice design. I don't feel like there's like a lot of emphasis on how to do cost optimization. It's more teaching people how to like architect. But one of the things that I've learned that, you know,
Starting point is 00:43:39 if I could go back in time that I wish I would have done is every person on my team. I mean, I don't even care if you pass the AWS certs, but we should have had everybody at least do the dev associate and maybe the architect associate. Like I'm not a huge fan of certs. It's just a piece of paper that says you can pass a test. Yeah.
Starting point is 00:43:59 But the, but the test prep that you go into when you're doing like a AWS SA Pro or SA Associate, it exposes you to a lot of different things about the platform, how to design things. And it gets you thinking different earlier. And I think just that alone is worth it. I don't even care if they pass the test, just giving them that exposure. Because, I mean, I can tell you crazy stories about when we first got started because we had that mindset of, oh, it's just data center in the sky, right? We were hosting, like, our own RabbitMQ services. We had a Microsoft high-performance computing cluster,
Starting point is 00:44:38 like all this stuff that we just didn't need. Like, SQS is sitting right there. It's one of the base services and we we had our own queuing system that was self-hosted why i also run into a lot of customers like i'm going to try to say this the nicest way possible but have you seen architectures and systems that look like resume building projects they're not like actually a thing like i love that description i never used that description i like the description yeah because like i was listening to your guys episode with kelsey high tower the other day and i mean one of the things that he said that like i just want to hug the guy because he's so right is he's like
Starting point is 00:45:25 don't use kubernetes for everything it's gotten so much hype it's like everybody's got this k shaped hole and they're just trying to stamp out everything it's like guys there's times to use serverless there's times to use an ec2 instance there's times to use kubernetes don't get me wrong i love me some containers but there's more to life than containers. Container orchestration is hard. That is another thing that we botched in the early days. So like right as I was on my way out and just starting Observian,
Starting point is 00:45:56 one of the projects I was working on, because we were, you know, a monolith at the time, but not really a monolith. We had multiple monoliths that ran on the same web heads and the same API heads because it was a.NET shop. So we were running, like, you know, web API. So I was starting to go down the path of, like, how do I, like, stop having a web farm and, like, an API farm?
Starting point is 00:46:21 And I started having, like, app farms. And so we started moving, like, our JWT tokens. So we were trying to get to a app farms and so we started moving like our jwt tokens so we were trying to get to a single off platform so we could have micro sites like this is how you get to micro architectures but keep in mind we have production data going through there so i was getting to the point where you know the model was going to be we have a load balancer in front of every application that's just for that application so So this is literally, I don't need to describe to you guys about microarchitectures, but it's like, this is why they exist. If the login service is getting hammered,
Starting point is 00:46:52 but the reporting service is not being used or vice versa, only scale up what's being used. Let's get more login boxes. Let's not spin up the entire system because it all runs on the same web form. And and so that then was setting us up into containers and so i was like on my way out so i didn't get big into containers there um but they started going down the path of let's do some containers and this is one that i run into with customers all the time and i mean i i tell customers if somebody tells you container orchestration is easy they're lying to you they've never done it because it's not easy.
Starting point is 00:47:26 Like, it's not rocket science, but it's difficult, right? It's a lot more than going to docker.com, reading a tutorial, downloading it, and like, hey, look, I've got an image on my laptop. I've seen many devs containerize their projects. And then they're like, hey, look, we have containers. It's like, well, well yeah that's not how you're going to run it in production right so we botched that one um i don't have a like a good gory story from that one because i was like on my way out we'd already sold the company like that was part of the cinderella story right doing really well get some heartbreak, reduce the spend, and we were able to exit the company.
Starting point is 00:48:06 It's still alive and kicking today. And so I don't know where I was going with that other than when Kelsey Hightower was saying, don't use Kubernetes everywhere. I'm seeing kind of the same thing happen in the serverless space, because I talked about serverless last time here where it's like they see a shiny object and they want to use it for everything because
Starting point is 00:48:32 it's like their new tool or the opposite the use case doesn't really fit but they want to say that they've done it for their resume so now i've got some tiny little project with a massive uh hadoop cluster or a massive kubernetes cluster or i've got something that's pretty complicated and hard to maintain because we we took it full serverless when containers was probably the better option right yeah so that's kind of one of the things like this isn't really a failure story but it is like in my mind failures that i see out in the wild when i'm interacting with customers that like just take a second to understand the use case and really design the system like that's one of the things that i like about cloud i don't have to make everything the same on-prem yeah i might be married to i only get vmware and sql server it's like okay that
Starting point is 00:49:23 kind of sucks if i want to do a big data project. But in the cloud, I don't have that limitation, right? So I'm kind of going with that too. It's like one of the other things, like this is kind of like an obvious one now, but back in the day, it really wasn't. AWS IAM roles, people need to promote and preach that. Like if you learn nothing today, learn that.
Starting point is 00:49:47 Like, do not have API keys like, you know, you're producing console users that you put in a config file in your code and then you deploy your application. Just give the host an IAM role. It just makes your life so much easier. So we didn't do that the first time obviously so we were like oh yeah api keys put it in our app config file what could go wrong well somebody checked one of those keys into a public repo in github now we were fortunate enough that nobody
Starting point is 00:50:20 hijacked it and turned us into a bitcoin farm but we caught it pretty quick but now instead of just you know killing that role and making a new role well i mean if we'd had a role it wouldn't have been exposed anyways uh we now had to go figure out which applications were using this key and update and redeploy them all like we were literally redeploying code because of a configuration issue yeah and if you remember last time i was on we were literally redeploying code because of a configuration issue. And if you remember last time I was on, we were talking about secrets management. And that's like a topic that I said, hey, we should dive into. So I'm not going to get into a full secrets management conversation.
Starting point is 00:50:56 But there's tools like SSM Parameter Store, AWS Secrets Manager, HashiCorp's Vault. I'm just saying use them. Because when you start doing things at scale, the risk is very high. One of the things about going into these microarchitectures, in a monolith, I've got one config file that might have 50 secrets, maybe 100 if it's a bigger app.
Starting point is 00:51:22 If I break that down into microservices i now probably have a hundred config files that's amazing and i have two to three secrets in each like it's like secrets overload right and you got to protect these secrets because otherwise you get ransomware do it get Bitcoin miners. And we literally were supposed to be in a situation where we were growing through our cloud maturity. And yeah, we had an intern. It was an intern that actually checked that.
Starting point is 00:51:58 It's always an intern. It's always an intern, I tell you. Well, then at the same time, though... Was it an intern named Justin? No, definitely not. I do bigger failures like $50,000 testers. Come on. So we had that same time.
Starting point is 00:52:19 I think Visual Studio had a bug where by default, if you use the Visual Studio extension, it was making repos public. There was a news article where somebody's account got compromised and Microsoft and Amazon worked together to make sure that guy didn't have to pay all that. I don't know, man.
Starting point is 00:52:35 I'm not trying to roll an intern over. It's just the facts. It sucks. What scared us is here's what was so crazy. There's some company out there that was indexing GitHub. They found the credentials and they notified us within like five minutes. And so the thing that keeps me up at night is if the good guys found it in five minutes, how many bad guys found it? Yeah.
Starting point is 00:53:02 And so that could have been a huge disaster. We got lucky on that one and and just to um support that intern who made the had the problem you know one of the best things that can happen to you early on is a problem like that because you'll probably never make a mistake like that again so it's you know unless you're just really you know someone who does not learn from mistakes which there are people out there but you know, unless you're just really, you know, someone who does not learn from mistakes, which there are people out there. But, you know, whenever you have one of those situations, you're going to be looking a lot more closely at everything you do from that period on. And it's a great lesson to learn early on. So hopefully you didn't go too hard on them.
Starting point is 00:53:42 Definitely not. It's one of those things where if you fail and you admit you failed, like, I'm good. Just let's fix it. But if you try to hide it, I'll fire you. So to that point, one of my early, early jobs when I was a junior DBA, I wasn't the one that hit F5, but I could have been. Remember the practice of like,
Starting point is 00:54:06 you got your database management studio and like you connect to production instance. Oh, and then this other tab might be connected to like dev. Guy sitting next to me who was like mentoring me and teaching me, he's like, goes to run a query and all of a sudden a bunch of explicitives come from his direction. And I look over at him.
Starting point is 00:54:25 I was like, what happened? He's like, I just dropped a table in production. I was like, Oh, which one? He's like the table. Cause we had, it was, uh, we had a, I mean, this is a failure. That's not even, this isn't even related to cloud, but it's a failure. That was a life lesson. Like you were saying, like, I'm now trying to be super careful
Starting point is 00:54:43 anytime I'm doing this type of stuff. But we had a bunch of inbound claims coming in and this was the main table that was like what status is it at because we were a claims clearinghouse of the job i was at yeah and they dropped that so it was like hey get pizza and sleeping bags we're not going home until this table gets recovered so i've seen those happen for a cent but yeah that they stick with you like i can recall a bunch of those hey justin i mean i think we we could probably talk on for hours and hours because you're a treasure trove of uh i don't want to say a treasure trove of failures, but a treasure trove of wisdom. You've just failed so much in life.
Starting point is 00:55:28 No, it's a compliment. You're so good at failing, Justin. No, but this is phenomenal. I would really love, as I said earlier, to get you in touch with our internal teams here at Dynatrace, who are also constantly also constantly you know increasing our maturity model I think we are already pretty mature based at least also from the feedback that you gave me earlier when I explained some of the things we are doing there's always more we can do and I'm pretty sure they will be excited to talk to you and I'm pretty sure there's more people out
Starting point is 00:56:03 there that are not listening to this podcast who also want to get in touch with you so could you quickly repeat how people can find you what's the best way to get a hold of you yes so observian.com that's o-b-s-e-r-v-i-a-n.com is our website I mean if you want to email me directly this is probably a bad idea but it's justin at observ N.com is our website. I mean, if you want to email me directly, this is probably a bad idea, but it's just an at observion.com. You can also hit me up on LinkedIn. It's Justin Donahoe. Pretty unique last name. It's pretty. D O N O H O O.
Starting point is 00:56:42 Yep. It's pretty easy. But yeah, this is one of those things where, you know, we just like to help people. Like I kind of found my calling doing this stuff. Like when I was at Ghostry, I was kind of going through some, you know, rough patches. I got to the point where I was so dedicated to that job. I had so much fun and it was probably one of the best places I've worked. Got a lot of lifelong friends out of it. It was just one of those experiences where it was like, by the end of it, I was like, yeah, this is what I want to do.
Starting point is 00:57:11 And so obviously I don't do it for free, but if I was independently wealthy, I totally would. I just like helping people in the tech space. When people ask what I do for a living, I'm like, I'm kind of like a tech therapist. I show up, kind of understand people's use cases what they're doing um sometimes it's culture related and like i've got a customer that they're going through a culture issue right now and there's some fun separations with some jira
Starting point is 00:57:37 projects it's going to be fun but i don't know it's just what i like to do so i'm always willing to help so if somebody wants to jump on and have a chat I'm down to do that anytime I'm pretty sure we'll call you up again for one of the future episodes because it was extremely great last time when we had you on serverless and now this was just phenomenal
Starting point is 00:57:58 as well we need to negotiate some sneakers though sneakers or some captain swag or whatever you want We need to negotiate some sneakers though. Sneakers or some captain's flag or whatever you want. You're going out in my office. We did a captain workshop and it was pretty fun. And he showed up without swag again. And like, I kind of, I kind of,
Starting point is 00:58:22 well, he's came with two t-shirts to like a 12 man workshop. I did con him into giving me some captain stickers, and I made him put an Observian sticker on his laptop because I put a captain sticker on my laptop. So, I mean, it wasn't a waste. It was definitely great. Yeah. It's one of my favorite things to do right now. That's awesome to hear.
Starting point is 00:58:37 So, Juergen was Juergen probably, right? And he asked me right before he went on site, you know, is there more swag? And it was just at the time when we were really low on swag, especially after perform. So, but there's more coming. So don't worry, you'll get your swag. You get your shirts, you get your sweaters. And thanks for helping us on that project as well. I thought you said shorts, Captain Shorts. That'd be great. Yeah, maybe Shorts. That'd be great. Maybe. Yeah.
Starting point is 00:59:06 You know what you need to do, Andy? You need to create for whoever's going to be the biggest captain advocate and who uses the greatest stories and all that. You need to have a single pair of Captain Lederhosen. Oh, look at that. That somebody can get. Yeah. We are actually thinking about the captain community champion that is constantly contributing to the captain community on a regular basis. And then champions get a little more.
Starting point is 00:59:37 Lederhosen sounds like an awesome idea. Well, when captain uniforms get released, like make a captain uniform, that's something ridiculous, that's embarrassing, and make them wear it. Andy will do it. All right. Justin, thank you so very, very much. And listeners, thanks for, you know, we've been having some audio challenges, I think, with everybody using up bandwidth at home. Our traditional means of recording wasn't working today, so we had to make a last-minute swap out.
Starting point is 01:00:10 So thanks for bearing with us and hanging in there. Hopefully things will settle down with the audio. Justin, thank you so much for joining us, and I'm glad you're not. I remember when I saw you perform, you were sick, And then when I talked to you after to set this up, you were like, I got sick again. So hopefully, hopefully you're out of the woods for you. You've done your time. Yeah, I've just got one more last thing to say
Starting point is 01:00:37 before we call this one quits. Oh, yeah. This is one that I think everybody should just do. It seems obvious, but design for failure. Like just assume it's going to fail. It's going to happen. So just a quick side story. I promise this will be quick.
Starting point is 01:00:53 One of our products that I was telling you about that was doing the compliance stuff, it supported about $20 million a year in revenue. And when we rewrote it to work with a new data pipeline, we made sure that we designed the system to have attached storage that could last up to seven days of no S3 because it was taking the data in. It would take out people's personal identifiable information before it wrote it. Well, if S3 was down, we wanted to write local and then push it in
Starting point is 01:01:27 and you know people like oh but s3's got 11 nines you remember that day the s3 went down in virginia virginia was our production so we got lucky that was not us being geniuses that was us just taking the design for failure because it's going to happen. We could have lost thousands of dollars for every minute the S3 was down. Instead, when S3 came back up, it pushed it right in like it never even was down. It was awesome. So just design for failure. It's going to happen. Yeah.
Starting point is 01:01:58 And on that note, too, we've been having several of our guests recently have been talking on the topic of chaos engineering and chaos testing. And it sounds, you know, that falls into that realm. I'm just curious, are you all implementing chaos testing with any of your customers or is that something you've really gotten into yet? So I would love to be doing chaos engineering everywhere I go. When I'm in charge of a product development group, it's always fun trying to get people to buy in. It's very much a culture thing. And then for customers, some of our customers are there.
Starting point is 01:02:41 They can do it. But a lot of it's, again again back to the maturity model i gotta get you into ci cd self-healing architecture before i'm going to start doing chaos engineering right yeah but just a quick plug for somebody nora jones not the singer um she's used to be at netflix i believe she's at slack now she spoke at reinvent before she's part of the chaos engineering team from netflix she's i believe she's at slack now she has a of the chaos engineering team from Netflix. She's, I believe she's a Slack now. She has a book that is part of it.
Starting point is 01:03:08 You should definitely read that. If you're into chaos engineering, check her out. She's awesome. Right. Excellent. All right. We got to wrap up now.
Starting point is 01:03:15 Andy's got to run. We'll hopefully talk to you soon. Thank you so much. And thanks for everyone for listening. And if you have any topics again you can reach us at pure underscore dt at uh on twitter or you can send us an email at pure performance at dynatrius.com thanks for listening everybody thank you

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.