PurePerformance - How to fail at Serverless (without even trying) with Kam Lasater

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and guess what? It's that time of year again. My co-host is the one and only Andy Candy Grabner. Hello Andy Candy Grabner. Because it's Halloween time.

Starting point is 00:00:44 Well, Halloween is still four weeks out, isn't it? What's wrong with you? No, when this airs it's going to be like still four weeks out isn't it what's wrong with you no when this airs it's going to be like right around Halloween oh yeah

Starting point is 00:00:48 maybe yeah oh that actually makes sense now because I happen to be at Star West

Starting point is 00:00:54 in Anaheim so Disneyland and the whole month of October here is Halloween theme it seems yeah it's getting crazier and crazier

Starting point is 00:01:02 I even see they take in the stores like traditional Christmas decorations and paint them black and orange and try to pass it off as Halloween. I'm like, okay, it's getting too much. It's getting too much, yeah. Hey, Brian, can I say hi to two people quickly? Yeah, you can. first of all to leandro senor performo he was gracious enough to give me his hotel room to record this session because we're both here at the conference but my hotel is a little further away and so i walked over and he gave me a quiet room um and the second person i want to do a shout out is roman granita he's one of our colleagues and i bumped into him last weekend and on his way he

Starting point is 00:01:41 showed me he was listening to our podcast oh wow awesome it was a Friday or Saturday night I'm not sure why but he's listening to our podcast and I wanted to shout out to him it's great to sit down

Starting point is 00:01:54 with like a six pack of beer and listen to the podcast and just get you know get sloppy and listen I heard it's a thing

Starting point is 00:02:00 with the kids they're eating Tide Pods and drinking beer and listening to Pure Performance so it's the only way you get through those episodes no it's a thing with the kids. They're eating Tide Pods and drinking beer and listening to pure performance. It's the only way to get through those episodes. No, it's not that bad.

Starting point is 00:02:09 And we don't want to shy away our guest today, which reminds me, we do have a guest today. We don't want to let him wait for longer. I would like Cam to introduce himself. Cam, thank you so much for being on the show today. We met at DevOps Days in Boston. You did a great talk about lessons learned from your experience with serverless. I think the title was called How to Fail at Serverless,

Starting point is 00:02:35 and in quotes, without even trying. But without further ado, Cam, please welcome to the show and then please introduce yourself to our audience. Yeah, hello. Thank you for having me. My name is Cam Lasseter. I've cut my teeth in serverless and I have the scars to prove it. And yeah, at conferences, I think we have plenty of talks on best practices and how to be great at something.

Starting point is 00:02:59 And I thought it was a fun kind of twist to talk about all the ways I failed and hopefully open up some space for other people to tell stories of failure and things that they've learned. And Brian, we've done a lot of these shows in the past that we talked about the top problem patterns, the things that went wrong. And then so many, you're right, It's always great to hear how great people are and what you want to achieve, but then you also want to hear the way and the hard steps that they took to get there. And I think that's the great thing about these presentations, yeah.

Starting point is 00:03:36 Yeah, as we share our mistakes, others learn not to even start doing them and they can pick up elsewhere. I always even go back to serverless, Andy. Every once in a while, I bring up that example that you had when you first tried doing serverless, when you tried running everything in serial and then to serverless, Andy, every once in a while I bring up that example that you had when you first tried doing serverless, when you tried running everything in serial,

Starting point is 00:03:48 and then you're like, wait, I need to make this all parallel, because that was just really stupid. And it's the only way we get past this. So awesome job, Cam. Thanks for taking that approach. Yeah, absolutely. I had an interaction at Happy Hour after giving that talk and we were with some people and they didn't say that I could disclose the story. So I don't want to tell them, you know, call them out too much. But, you know, they talked about this massive failure of a switch in London and that brought down a whole, you know, hospital's phone system. And I find that oftentimes those are kind of hidden

Starting point is 00:04:25 in the corner of conferences, people talking about the real deep cuts they've had in their career. And I think as a community of software practitioners, we really need to talk about it more if we're going to level up and kind of improve the state of our industry. And Cam, to this story, just a general question to you. We try as an industry,

Starting point is 00:04:50 as vendors of frameworks or platforms, we try to make it as easy as possible for developers to build new stuff by abstracting all the complexity away. Yet there is obviously complex stuff going on in the backend, right? And I wonder, are we failing as an industry to really make sure that when people build stuff in the easy way and high level languages so i think now they're called low code no code that we really still make sure that they understand

Starting point is 00:05:21 the technology underneath the hood because right even if everything is easy sometimes sometimes basic things in the back end can bring down prob can can bring down whole systems like as they as the switch you were mentioning and i think you also have a couple of examples that we talk about later on with serverless you need to know certain things about the runtimes underneath so that you are not running into certain mistakes so how can we educate people? Well, I think, so my point first was the why of this talk is very much about creating a space so we can talk about it. So that's sort of the predicate here of that.

Starting point is 00:05:55 Yes, we're talking about failure. And then the other piece is I think we need to have a little bit more humility about how little we know as a community, as a society about how to build software. You know, depending on how you date it, we've maybe been building software for 70 years, maybe. When you think about other types of construction that we do, we have, you know, thousands of years of experience building bridges and still sometimes they fall down. Right. And so just thinking through the number of types of software that people have built and the different layers of abstraction that have been built up, we are still very early days as humanity building software and learning about it.

Starting point is 00:06:41 So I think we could we could all just, you know, take a deep breath and say, you know what, maybe we're just really new at this. And we're still experimenting. And we're learning. And like you say, you know, like the classic joke when, when serverless as a term was really coming to the fore is like, well, there's still servers. Yeah, of course, like at some point, like a packet is going somewhere, and there is actual hardware that is going to process it and the like. So, yeah, there's always some sharp edges underneath. But I think we should have a little bit of humility that we don't really know what we're doing yet. Cam, before jumping into some of the examples, I would also like to give you the chance to talk a little bit maybe about the company you founded, Cyclic, if I'm not

Starting point is 00:07:25 mistaken. I know you did a pretty good job also at DevOps Days to not pitch kind of your product and your idea, but I want to give you the chance to quickly at least talk about it because I assume a lot of the stuff that you've learned in your work with serverless has also inspired some of the stuff that you're doing at Cyclic. Yeah, absolutely. So you're using, I would say, the European pronunciation. I would call it Cyclic. Cyclic is fine as well if you're on the more American side,

Starting point is 00:07:55 this side of the Atlantic. Yeah, we're a Node.js platform as a service. It's all serverless underneath, so we're really focusing on creating a way for people who just want to start getting APIs up and running, get them out quick, build in just a few seconds. So that's sort of like our target. And we're using serverless technologies, one, because that's something we know, but we think that it opens the door to some really powerful things over the next 10 or 20 years and different ways of computing. I think there's some really good articles out of some computer architecture professors out of Berkeley and Stanford in the ACM magazine talking about their vision for serverless going forward. A lot of other thoughts there on like how this changes productivity of developers

Starting point is 00:08:46 and sort of the first iteration of the cloud really changed people's expectations about do they need to be spending time managing AC units and power supplies and really change productivity of operators. And now we're really sort of seeing a trend into productivity of developers. And so that's sort of where our head is at. Yeah, and so we have a blog where we write up this stuff. And if people want to jump on that, check it out, subscribe to that, or try out the platform. Any of that is great.

Starting point is 00:09:19 Well, first of all, we will make sure to link to all of your website. Also, the presentation you gave at DevOps Days, the organizers also recorded everything. It's on YouTube and I think the whole session, like the whole days, the two days are there in one big block, but now I think they're also chopping it up. So we'll definitely make sure to link to it. And also thanks for the pronunciation reminder, cyclic and cyclic.

Starting point is 00:09:44 You're right. I obviously learned pronunciation differently than the native speak in the US. But now coming to your presentation, you had a couple of great examples that you went through. And you always had a nice intro to a new thing. Like you always said, hey, serverless is stateless, right? Who agrees with me? And things like that. So I really encourage everyone to watch the whole recording of your presentation. But I would like to give you a chance now to maybe walk through two or three

Starting point is 00:10:14 of those examples with us now. And I'll let you pick what those are that you think are the most interesting ones, especially considering that our audience right we're performance engineers site reliability engineers capacity planning people that need to operate uh platforms uh even though serverless you know is either operated by you guys or by some of the vendors but still you know pick pick two or three and let's start with the first one and just uh discuss yeah i think that that also laying a little bit of context, I started as well with a trigger warning that all of these mistakes were a mistake that my quote-unquote friend had made, this hypothetical friend,

Starting point is 00:10:54 and that names had been changed to protect the innocent and the not-so-innocent. So yeah, I think that state in serverless, I think, is an interesting place where people get tripped up a lot. Sort of there's a basic failure, and this is primarily in AWS. I should also give that as context. That's where I've spent the majority of my serverless time. I've dabbled on some other platforms, but kind of the deep learnings, the pain that I've had to learn from has come from the AWS side of things. So, yeah, state of the life cycle of a Lambda and how instances of it get created and torn down seems to be kind of the top of the list of pain points. And we saw this in

Starting point is 00:11:48 several different ways. And as you mentioned, you know, I kind of had people raise their hand if they agreed with these statements of, you know, serverless is stateless. And then an example of going through, you know, the kind of the classic serverless first, like almost hello world is going through and doing some image resizing, breaking some things up. And we got caught with writing these files, these files that we were resizing to slash temp, which is the only writeable place inside of Lambda. And that was great.

Starting point is 00:12:22 And as Lambda gets recycled, these containers get torn down and that slash temp gets thrown away. However, depending on how your Lambda instances get cycled, and this is something that's really obfuscated to you, you can build up files in that slash temp space, and then you can run out of disk and when you run out of disk that fault causes that lambda instance to recycle so um you know it's a very common error i think people in the community kind of have seen it now and understand this pattern but as a you know early practitioner you start to hit this and you get this inconsistent error behavior, which feels very odd. Following along in that same line, we fixed that. We fixed the problem of writing temp files to temp and then either not cleaning

Starting point is 00:13:16 up at the start of an execution or at the end of an execution. But we then hit a problem where based on our load pattern, we were processing these large PDFs kind of in batch, in bulk, through a step function. And if we hadn't been deploying, which again, when you deploy new versions, you recycle your Lambda. So it's like doing a, you know, restart, a hard restart on your instances, you have these instances of your Lambda that could be up for a longer period of time or not. And we found that we started to get inconsistent errors, even after cleaning out the temp, and we were getting some, you know, a different error coming out of the Lambda, which, again, these intermittent errors, right, that's the hardest thing to solve, because intermittent errors, right? That's the hardest

Starting point is 00:14:05 thing to solve because if you can't push a button and make it happen, or, you know, you don't know the mechanism by which it's happening, right? Because otherwise you would just be able to push the button and recreate the failure and then test if your fix worked. Also, we described it as sort of like this Heisenberg uncertainty of looking inside the Lambda. We had a system where we would change environment variables on the Lambda to change the debug levels or the logging levels instead of trying to make HTTP calls outbound to get config each time and then deal with caching and like all this sort of config inside of the Lambda so but by changing an environment variable on a lambda you trigger a refresh of those instances so any running instances get flushed and it's

Starting point is 00:14:53 essentially you've just cleared out your temp or you've cleared out any other os memory space that uh you might be you know if you have a a memory leak or the like. So by looking at it and turning on debug, we created a problem where we couldn't see it. And we were also on a pretty high deployment cadence. So if we were deploying every night to that service, we would flush things out and we wouldn't get any errors the next day. So we had to wait till our deployment frequency on that service came down. And we weren't looking at the debug logs or changing the debug settings enough that the instances would stay running long enough such that we would start to see this error.

Starting point is 00:15:36 And what it turned out to be is a file descriptor, essentially a memory leak of a file descriptor coming out of ImageMagick. And again, once it would fail, it would cause the whole Lambda to get recycled. So it wasn't a problem because it would just fail once and then it would disappear. And then we'd start looking at it and we couldn't find it. And the way we had to finally isolate it is we had to turn the concurrency, max concurrency to one and min concurrency to one on the Lambda.

Starting point is 00:16:04 So basically isolate that Lambda instance into a single one. And then we'd have to do a load test for a certain period of time. And we could recreate it then. And so then at least we had a system where we could reproduce this error. In investigating it at the time, we didn't think it was in our skill set or understanding how to go fix that error or pay somebody to fix it or anything like that. So sort of the solution we came up with was to cause the Lambda recycle itself. So kind of reset a environment variable on the Lambda, which would flush any of those instances. Definitely not a pretty solution, felt a little

Starting point is 00:16:42 bit of a hack, but kind of understanding that life cycle of the Lambda and really getting in our mind how that worked was sort of what masked the problem. And then in the end actually ended up being the way that we solved it. So, which is kind of a little bit of an irony in the end too. It reminds me a little bit of the days when I was working on IIS with the W3P containers, like the runtimes basically that were hosting, let's say, your.NET, ASP.NET applications, that they had their recycling either on a schedule basis or on a memory basis, but then we were just prematurely

Starting point is 00:17:21 then recycling these worker processes because we knew at some point they are crashing for whatever reason. And we just kind of took that pain away by just, you know, forcing them to restart at a time when there was low load. Yeah, early in my career, I remember reading a blog post by Joel Spolsky when he was still at Fogbugs, and he talked about doing just this, that he would do a nightly restart of all of his web servers, which, like, on one hand, like, kind of, like, was nails on a chalkboard for, like, this just-out-of-college, you know, computer engineer, like, oh, we can figure this out, and we can solve it, like, but then also just that practicality of, okay, well, does the solution work, and does it prevent the problem, and problem and you know are we going to spend weeks

Starting point is 00:18:05 of time trying to solve this problem or do we have something that's not going to impact the system that will achieve the same result and so finding those balances I think is really important and you know the business pays for things to work not always for them to work in the way that we would like to think that they should work quote unquote um i think uh some of you know going back to the beginning the the quote shoulds is a little bit of what i'm saying we need to examine you know that's the like most dangerous word in an engineering context i I think, is should, right? It either leads us down a design process that isn't connected with business realities or isn't connected with technical realities, or it blinds us from actually seeing potentially what is truly the cause of the bugs,

Starting point is 00:18:58 because we immediately are blocking ourselves from some possible outcome or some possible state in the world. So yeah, should is always a trigger word for me. I like the dose of reality you just threw at me there. We and probably others in the observability space always come at these problems like, well, let's gather the information, find what's causing it, and fix it. And in sitting behind the desk trying to keep all these things running, sometimes it really is like, well, how do we just keep this part running?

Starting point is 00:19:34 Because we have other things we've got to focus on and concentrate, and if we can fix this with a restart, we know it's not the best thing. But until it becomes catastrophic, we'll do that and work on the other you know put out all the other fires that are even blazing hotter and then get back to this but we always have this academic view of well find it you know let's get to it um and it's i just never really heard it put that way i guess i mean not that it was put in a special way but i've never heard someone say like yeah no sometimes we're just fine with the band-aid, and if it's got to be that, we know the risks, we accept it, and we move on. they are somewhat homogenous, even inside of a single organization. But if you think about the scale that each of them deals with, as far as the investment dollars that will go into the software

Starting point is 00:20:31 system, or the value returned to the organization, if you think about each service, you know, if you're in microservice or each application, right, and you tried to plot it on some distribution of the value and the load that that application experiences, you'll probably see like maybe three orders of magnitude in load, maybe more four or five, right? Some very large organizations. And probably value has a very similar distribution where you have some services or some applications that have a huge value

Starting point is 00:21:05 that they're delivering to the organization. And probably this type of fix might not work in those high load, high value systems, but there's a tremendous amount of software that doesn't run those three or four orders of magnitude higher on the scale or on the value curve. So, and I don't think we have, I certainly don't have good vocabulary and language for describing these types of, yeah, the value delivered and the scale of value delivered and the scale of load. Yeah, maybe service level agreements is sort of or service level objectives is the closest i've heard on the like performance

Starting point is 00:21:50 side um but all three of those really change how i as a practitioner i'm going to approach what is an acceptable solution or what is an acceptable engineering um answer to to a particular problem or an anomaly. Yeah. Hey, I got three quick things I wanted to bring up. One, so basically, we're often building workarounds to something we cannot control otherwise. Like in your case, you had this file descriptor leak or whatever you want to call it. And then you built a band-aid around it now the question is if they fix the problem at some point you don't necessarily depend it anymore but the

Starting point is 00:22:32 question is how would you learn about that the fundamental problem from a third-party software was actually fixed and i think it goes the other way around as well right maybe an update of the underlying runtime introduces a new thing that you have that you now need a band-aid for, but you haven't needed one earlier. And this is I think where obviously testing comes into place, that every change, whether it's your own code

Starting point is 00:22:55 or any of the dependencies needs through testing, because you never know, right? Things are not behaving differently in a good or in a bad way. Yeah, absolutely. I have this kind of thought I've been chewing on with testing and sort of the design choices we make in our testing and testing broadly, not just unit tests and code written to test the functionality, but our whole software pipeline is a long sort of evolutionary selection mechanism for the types of bugs that we will see in production. So every design choice we make in

Starting point is 00:23:35 how we test our software introduces the possibility of a bug, or it doesn't fully test every possible bug, right? So if you think of bugs as, you know, bacteria that we're growing in a Petri dish when we're writing code or when we're, you know, assembling systems, all of the verification steps we do are almost like some sort of antibiotic that we're applying. We screen away 99.9% of the bugs. And so there's some amount to get through and they're constantly being repropagated so that um the the bugs that we see or the the problems that we see in production are literally a result of the design decisions of our software delivery process which um i don't know i've been trying to think through how to put together a talk and talk through some of the design

Starting point is 00:24:20 decisions or or failures in the in the sort of value stream architecture that I've put together in the past and sort of how that manifested in some cases in these bugs or these failures that I've talked about in this conference talk. Yeah, let us know when you have the story done because then we'll invite you back. It's an interesting topic. Yeah, that's really interesting. I'm not even thinking I mean, I don't even, I'm not even thinking of it probably in the same way you're presenting it, but that's even just what you said there

Starting point is 00:24:51 has got a bunch of things floating in my head. So yeah, please, please, please come back when you have that. Yeah, and the simplest first example just as to like, you know, get you chewing on something is that if you have an inconsistent bug and you in a in a test suite but you let developers keep committing till they get

Starting point is 00:25:12 one green bar then you've just introduced a way that inconsistent bugs can flow through from your dev system into your test system right if if it's a one in ten failure then oh well if they bump the readme or they make a fix but it's not actually a fix but it green bars and then it goes through the next step you've just selected for a bug that's inconsistent so you know if you decide okay well i'm going to run the test suite twice in a row and it has to green bar both times, you now that one in 10 bug should be one in 100. So it's fewer, but you still are, you're just selecting down the number of chances that it's going to slip through. So I don't know, hopefully that'll be an earworm and it'll chew on you a little bit. Yeah. Very philosophical. Quickly, two other things,

Starting point is 00:26:08 because you talked about the debug logs, log levels. What's the right way of doing observability now in serverless from your experience? Is it still just logs? Do you already look into open telemetry? Do you do other things? Because Brian and I, observability is a big topic for us

Starting point is 00:26:27 day in and day out. So I'm just wondering. Yeah, I mean, I think that I can't claim expertise in this area. Open telemetry looks, or open telemetry, as you say on the European pronunciation, you know, is really intriguing to me.

Starting point is 00:26:47 I've, again, just begun to get my feet wet in it and looked at some of the different vendors in the space. It seemed that some of the early stuff that I was looking at seemed more well-suited for running inside of containers or running on, you know, whether it's a Kube platform or some flavor of EC2 or something like that. And so, I don't know. I haven't found something perfect yet that I really have sunk my teeth into

Starting point is 00:27:19 on the serverless side. I wish I knew more. No, it's okay. Maybe listen to some of our other podcasts and then you learn more. That's one option I'm doing. And Andy, before you go on, speaking of the open telemetry versus open whatever.

Starting point is 00:27:36 Telemetry. Telemetry. Can you say metaphor for us? Metaphor? Isn't it Metuffer? Metuffer, yeah, yeah, yeah. That was way back,

Starting point is 00:27:46 I remember way back, we had, our guest kept saying Metougher, and I'm like,

Starting point is 00:27:51 guys, what's a Metougher? And they were like, oh, oh, oh,

Starting point is 00:27:55 oh, Metaphor, okay, okay. I was so confused. Sorry, I didn't mean

Starting point is 00:27:59 to sidetrack, but it's just reminding, I had a flashback to that instance. You had a third thing

Starting point is 00:28:04 you wanted to bring about, right? Yeah, just one quick thing. You mentioned in the beginning the way you were kind of trying to find out the problem is you set the concurrency back to one. Setting the concurrency back to one means you're forcing Lambda to really only stand up one container or whatever instance, one runtime instance at a time. Is this the way I understand it correctly?

Starting point is 00:28:26 Yes. How I understand how Lambda works internally, primarily based off of talk, I believe it was the 2018 reInvent, and I believe it was the VP of Engineering or the engineering lead on the Lambda team, and kind of walks through the core architecture so people can sort of fact check me there, is that setting that max concurrency to one is when a request comes in to Lambda,

Starting point is 00:28:56 it'll take it on this kind of like receiver process and then it'll look at its backend to see if there's any available Lambda instances to serve responses for that. And it has some amount of timeout that it'll look at its backend to see if there's any available Lambda instances to serve responses for that. And it has some amount of timeout that it'll, and I think it has some prediction about how long the average time to service a request is so that it can, it'll hold it a little bit and not spin up new instances. And it'll try to get ahead of if it's seeing a load spike. So there's all sorts of black magic that's happening inside of there. However, setting that max concurrency to one is saying,

Starting point is 00:29:29 no matter what happens, the backend should only spin up one instance of a Lambda. So essentially, you're only ever hitting the same memory space. If anything happens with that memory space and it decides that it needs to shut it down because there's an error in the runtime or it needs to shut it down because there's an error in the runtime, or it needs to shut it down because of whatever AWS decides is the host that it's running on needs to get shut down.

Starting point is 00:29:52 My understanding is it will still recycle and get assigned somewhere else, but you won't have multiple copies. Otherwise, if you got two API requests that come in almost simultaneously, they could lead to two backend instances being spun up, even going from zero, so even from a cold start. So you'd get a cold start on both, each one would serve, you'd essentially have two memory spaces sitting there, and then they'd both get torn down. Yeah, and obviously the reason why you did it is because you wanted to, I guess, just focus on the one instance anyway. That means not getting flooded with too much information,

Starting point is 00:30:26 too much logs from multiple instances. And I guess also for a cost reason, if you're executing a load test and then you really only just focused on identifying the problem of a single instance, then just narrow it down to a single instance and that's it. Makes sense.

Starting point is 00:30:40 Yeah. And to be clear, we didn't add that as part of our integration tests going forward, right? That was just a one time, hey, we have this theory, can we prove it so that we know the mechanism by which we're, you know, that's sort of the scientific method of if you truly understand something, you can replicate it. Yeah. So now you just designed your test so it doesn't account for that and allows things that. That's true. And we accepted that.

Starting point is 00:31:05 That's the beauty of engineering. You say, and it's good that... That's true. And we accepted that. That's the beauty of engineering. You say, and it's good enough. I had one question if I can get in about, you mentioned that these were lessons learned on AWS Lambda. Andy, I can't remember who we had on a little while back. We were talking about our development

Starting point is 00:31:22 of the Kubernetes version of Dynatrace and how it was different for AWS, different for Azure, there's different disks and all that. I assume, Cam, but I don't know if you've had experience that you can talk to, that what you find in AWS Lambda serverless may or may not translate to any of the other serverless components either. So this complication you're dealing with of, yeah, we're going to use this supposedly easy thing where we can just write our code and push it out,

Starting point is 00:31:50 but now we really have to understand what's going on in the hood. That seems like there's a pattern where when people are doing these things, they're going to have to repeat that step process over and over again for each cloud vendor they go to because no two are completely alike in that sense. Is that, I don't know if, I know you said

Starting point is 00:32:07 you didn't really have too much experience with those, but do you have any inkling if that might be the case? Yeah, I mean, that's generally my understanding. Sort of the data point I base that off of is talking to sort of a CTO, short-term CTO for hire with I think a Fortune 1000 company in the US that did a big migration of a lot of their internal systems and was doing a bake-off between AWS and Azure.

Starting point is 00:32:36 And he was telling me some stories about how the Azure serverless functions, again, and this might be three or four years old, and I haven't kept up to date, but the Azure serverless functions were again, and this might be three or four years old, and I haven't kept up to date, but the Azure serverless functions were having a much harder time handling any sort of bursts in load. It seemed that whatever their load prediction algorithms were not as good at spinning up additional instances, and their cold start times were worse. I've noticed on Lambda consistently cold start times are going down. So I don't know how much of that is the predictive spinning up instances ahead of time, which seems to do a great job of. Or they're just getting much better at spinning up run times

Starting point is 00:33:22 and that the efficiency of the startup is much faster. But yes, back to your point is that my understanding is that these subtleties will be different across different cloud providers. And I've also noticed in other, I would consider CodeBuild a serverless service. I've noticed some significant differences in startup time on code builds across AWS regions. So for example, we run our main application in three different regions just to provide better service to our end users. And we noticed that Sao Paulo has a significantly slower

Starting point is 00:34:01 cold build startup time than we'll see in either Mumbai or Ohio. So, you know, I kind of extrapolate that across other services. And, you know, the scuttlebutt I hear from AWS insiders is that their region parity, you know, they're always... What's the polite way of putting it? They're looking to improve their region parity. And so even one service might not have the same level of

Starting point is 00:34:30 robustness or capacity in some of their different regions. And they may also have different hardware just running there, right? And that's what it is. I think it throws a complication

Starting point is 00:34:45 into the idea of evaluating different cloud platforms for what a company might want to choose as well because this might not be the case in that AWS example. I mean, in the Azure example you mentioned,

Starting point is 00:34:57 but it could be that whether or not it's serverless or anything else, it's optimized to run with a slightly different architecture. And if you're not aware of that optimization and you're trying to just port over your Lambda function and the way you have that to run to another platform

Starting point is 00:35:16 without knowing what their optimal approach might be, you might be like, oh, well, their performance is terrible. But if you were to have that secret sauce of what they're designed to operate optimally on, you might be able to compare. So not only is it hard to, I guess, maybe compare different features and functions potentially on the different platforms,

Starting point is 00:35:36 but even if you want to be multi-cloud, let's say you want to spread yourself amongst different vendors, you might have to have some architectural changes and differences based on what's going on under the hood and what's the best way to write for those processes, whether it's serverless or anything, right? But that's more and more complications

Starting point is 00:35:55 as we make things simpler, you know? Yeah, yeah. So that, I don't have experience in that. My career has primarily been at sort of the beginning end of the curve as far as time. So, most of my software has been at the, you know, from inception. So, from day zero through, you know, maybe year 10 of the life of a system. So, some of these much longer in the tooth, you know, been running for 20, 30 year kind of systems. I don't have nearly as much experience on that. You know, I think the

Starting point is 00:36:31 oldest running system I've ever had is I, you know, ran some, some PBX software on a, when they were like, had it running on some desktop machines and stuff like that. So I don't really have experience with these longer, more enterprise where you're like, oh, we're going from on-prem and now we're migrating into, should we go to Azure or AWS and let's do a bake-off and stuff like that. But that sounds right.

Starting point is 00:36:58 Your theory sounds right to me. Hey, considering the time, we have about 10-15 minutes left i would like to give you a second a second shot at a second topic uh whatever you choose i know you have plenty in your presentation yeah what else should we talk about um i think s3 that's probably the most well-known most used um service um and again sort of this subtlety, going back to your point, Brian, on that is that, you know, with S3 in an application, early on, you have to decide what's your pathing structure, right? Like that's something that has to happen before you can write your application um and so you can read

Starting point is 00:37:45 best practices or you know whatnot and um to win the system that i had been working on you know we i came in and the pathing structure had already been decided and there was uh we were using the global s3 um domain name and then we were using a path with a date in it. Our understanding at the time was that the date would provide some randomization and allow the S3 system to shard our files across however they did it, across clusters or individual computers. And so things worked fine until we started storing more and more data in S3. And then we turned Athena on and pointed it at S3.

Starting point is 00:38:39 And so Athena, my best way of describing it is it seems to use these like 429s with S3, which I think is 429 is like the too many requests, right? Like it seemed to be like a network flow control discussion that Athena would use to size how much data it could pull from S3, which was fine for Athena. And it was fine for S3. Like S3 didn't really complain. We couldn't even tell it was having problems until we like turned on extended metrics and looked at it. But our core application was trying to write to those same, to that same bucket. And so we were getting write failures because S3 was getting saturated. Essentially, the IO was getting saturated on each of these clusters or inside of each of these shards. And later what we learned is that internal to S3 at the time, they used sort of a progressive

Starting point is 00:39:41 path-based hashing. So they would just try to find a prefix that was consistent. And the first character that varied, they would try to shard based on that first character that varied. And so for, in our case, it was a date. And so that was the first character that had variability in it. So everything that was temporally the same or close to each

Starting point is 00:40:07 other would also be on the same shard in the S3 backend. So then when Athena was trying to do a date report, it would just bombard this one shard and hotspot it. So I think Amazon has done some more hash-based sharding. I can't speak to whether they fixed that or not. In general, the way we kind of got around this was that we replicated the data from one bucket in East 1 to a bucket in East 2. And then we started running the reports off of that East 2 bucket. So S3 is fine, you know know replicating the data for you it's nice little uh easy thing to turn on um and s3 was fine getting all those 429s right it's not a problem for it so athena was fine running off of that that other bucket um kind of the uh the side effect though was another way to fail there was that we never moved the Athena process from East 1 to East 2.

Starting point is 00:41:11 So then we got to continuously pay AWS for their network egress from East 2 back to East 1 to do the analysis and put together those dashboards. So, you know, again, one of those things where, yeah, it works fine and you might not notice it until you start digging into the cost reporting and really dig through the tags. The other place, and I don't know if you noticed it in my description just now. But S3, there's a subtle, the experience of S3 is that it's this global service, but S3 is actually a regional service. And you notice this when you create a bucket, right? You have to create it in a region. And so, yes, there's naming problems that if you're not thinking about it at the time, like you might create, you know, some bucket, bucket name in East 1,

Starting point is 00:42:06 and then you go to create a second bucket in East 2, and now you can't create bucket name in East 2. So there's a little bit of a headache there, kind of the common practices to append the region name. But then you have this other issue of that, at least for us, we were using the global reference to that bucket. And so not the regional DNS name, host name. So that meant we had another level of indirection. So we were hitting a DNS server in East 1 to get the global name for that bucket. And then that was then giving us access to that S3 bucket. And then that was then giving us access to that S3 bucket. Now, in, I believe it was early 2019, AWS had a DDoS attack against those global S3 DNS servers, right? Coming back to your previous

Starting point is 00:42:58 podcast on it's always DNS under the hood, right? Like every failure is DNS um and then uh you know so those servers went down um or they got swamped um and even though s3 was still up and the service was still existing and the regional so if we had referenced it by bucket name dot us east 2 or us east 1 dot and then you know the s3 pathing we could have gotten access to the data and all of those servers were still connected and available. We were unable to access and our core system went offline because we didn't have DNS, which got made more complicated

Starting point is 00:43:37 because our whole build system also relied on those same S3 global paths. So you get into these interesting kind of recursive dependency problems there. So, yeah, that was a couple ones. I know you, Andres, you said... No, that's perfect. I was actually, it's funny,

Starting point is 00:43:55 looking through the presentation that you did, it was actually the one that I picked. I would have picked my stuff that I would have hoped you picked with S3. So it's perfect. Good. Good. Yeah. Cam, I know we will link to the presentation, we will link to C-Click, your website.

Starting point is 00:44:16 Is there any other material that you've put out there, any other speaking slots, anything else where people can maybe reach out to you or see you and meet you in person? Oh, yeah. I don't think I have any speaking slots on the calendar going forward. I'm not on the conference circuit as much as you are. You know, maybe I should up my conference speaking game. Yeah, but reach out to me on Twitter. I'm sure we can put that up.

Starting point is 00:44:41 And, yeah, definitely cam.lassiter at cyclic.sh. Happy to respond to any emails that humans are willing to send me. So this was a shout out also to all the bots out there that are sending fake emails. Feel free to use that as well. Isaac Asimov is rolling in his grave.

Starting point is 00:45:03 No, but it was really great. And I think think these are as you said it's great to talk about these stories where things didn't go as expected where you learn something new about the complexity of the technology we're really working with even though the producers of the technology are making it sound like it's very easy anyway because they hide the complexity but um yeah it takes people like you to educate everybody out there that wants to become successful with technologies like serverless so thank you so much for that yeah absolutely if if anybody has a story um i'd love to hear it um you know record a little youtube video or write up a blog post i really want to

Starting point is 00:45:42 trigger other people to kind of you know come out out and share with the world the ways they've hit the wall or caught something on fire accidentally. We've all done it. Yeah, it reminded me when you were talking about the idea of sharing the failures, it reminds me of the Titanic, right? It's like, we all pay attention to the fact that the Titanic hit an iceberg, right? And if that wasn't shared, there might be others. But when's the last time you heard about a cruise ship sinking on an iceberg? No, right? Because they shared that, they thought they were indestructible, and now people don't crash in icebergs.

Starting point is 00:46:17 Avoid them. Hey, great lesson. But it's so important that you're sharing this stuff, and I think that's just fantastic i was looking over the powerpoint uh you know yesterday as well and it's these are these are the things that i love i love hearing these these stories of you know i don't want to say i glory in them by calling them stories of failure but i think those are the more interesting things to learn from right because what works for people it's not going to work in other situations,

Starting point is 00:46:46 but what fails for people is something we can all take and make sure we don't do. So I really appreciate what you're doing, and thanks for being on today. Is Brian James still doing the News of the Damned? I don't know. I haven't really been. I got myself, as part of my mental health, I got off of Twitter as well.

Starting point is 00:47:04 So I'm not following if he's doing that, but that was always a fun show. If it's still out there, people go listen to it in the great archives too. So Cam, our friend James Pulley, I think it was about a weekly podcast, would do News of the Damned, and he'd go through all Twitter and news things to find out the different crashes and downtimes different organizations would have. And he'd often, sometimes he would try to get someone on, but he will also just,

Starting point is 00:47:30 I think, try to surmise what may have happened or good stuff. Yeah. Awesome. Awesome. Everyone. Thanks for listening. Cam,

Starting point is 00:47:41 thanks for being on today. And if anybody has any questions or comments for Cam or whatever, we'll have his information posted in the show notes. If you have any questions or comments for us, you can reach us at pure underscore DT at dynatrace.com or tweet us at Pure Performance. I haven't been on that in a while. Pure Performance on Twitter as well, right?

Starting point is 00:48:01 Is that the, I don't remember what the handle is. It is, it is, yeah. I haven't paid attention to that. Either way, just, you know, send us a note with some money and we'll talk to you. Thanks, everyone. Cam, thanks again for being on. Andy.

Starting point is 00:48:17 Absolutely. Go grab some candy. Yeah, will do. Bye-bye. All right, see you.

PurePerformance - How to fail at Serverless (without even trying) with Kam Lasater

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.