PurePerformance - How to fail at Serverless (without even trying) with Kam Lasater
Episode Date: October 24, 2022Serverless and other emerging technologies hide the complexity of the underlying runtimes from developers. This is great for productivity but can make it really hard when troubleshooting behavior that... needs deeper insight into those runtimes, platforms or frameworks.In this episode we hear from Kam Lasater, Founder of Cyclic Software. Kam has run into several walls while he was implementing solutions from scratch using Serverless technologies as well as other popular cloud services. He recently presented a handful of those scenarios at DevOpsDays Boston 2022.Tune in and learn from Kam as he walks us through two of those challenges he covered during his DevOpsDays talk. If you want to learn more make sure to watch the full talk on YouTube: https://www.youtube.com/watch?v=xB9vsSl93mE If you want to learn more from or about Kam check out the following links:YouTube video from DevOpsDays Boston: https://www.youtube.com/watch?v=xB9vsSl93mECyclic Website: https://www.cyclic.sh/Cyclic Blog: https://www.cyclic.sh/blog/Twitter: https://twitter.com/seekayelPersonal Website: https://kamlasater.com/LinkedIn: https://www.linkedin.com/in/kamlasater/
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson and guess what?
It's that time of year again.
My co-host is the one and only Andy Candy Grabner.
Hello Andy Candy Grabner.
Because it's Halloween time.
Well, Halloween is still four weeks out, isn't it?
What's wrong with you? No, when this airs it's going to be like still four weeks out isn't it what's wrong with you
no when this
airs it's
going to be
like right
around Halloween
oh yeah
maybe
yeah
oh that
actually makes
sense now
because I
happen to be
at Star West
in Anaheim
so Disneyland
and the whole
month of October
here is Halloween
theme it seems
yeah it's getting
crazier and crazier
I even see
they take in the
stores like traditional Christmas decorations and paint them black and orange and try to pass it off as Halloween. I'm like, okay, it's getting too much.
It's getting too much, yeah. Hey, Brian, can I say hi to two people quickly?
Yeah, you can. first of all to leandro senor performo he was gracious enough to give me his hotel room to
record this session because we're both here at the conference but my hotel is a little further away
and so i walked over and he gave me a quiet room um and the second person i want to do a shout out
is roman granita he's one of our colleagues and i bumped into him last weekend and on his way he
showed me he was listening to our podcast oh wow awesome it was a Friday
or Saturday night
I'm not sure why
but he's listening
to our podcast
and I wanted
to shout out to him
it's great to sit down
with like a six pack
of beer
and listen to the podcast
and just get
you know
get sloppy
and listen
I heard it's a thing
with the kids
they're eating Tide Pods
and drinking beer
and listening to
Pure Performance
so it's the only way you get through those episodes no it's a thing with the kids. They're eating Tide Pods and drinking beer and listening to pure performance.
It's the only way to get through those episodes.
No, it's not that bad.
And we don't want to shy away our guest today,
which reminds me, we do have a guest today.
We don't want to let him wait for longer.
I would like Cam to introduce himself.
Cam, thank you so much for being on the show today.
We met at DevOps Days in Boston.
You did a great talk about lessons learned from your experience with serverless.
I think the title was called How to Fail at Serverless,
and in quotes, without even trying.
But without further ado, Cam, please welcome to the show
and then please introduce yourself to our audience.
Yeah, hello. Thank you for having me.
My name is Cam Lasseter.
I've cut my teeth in serverless and I have the scars to prove it.
And yeah, at conferences, I think we have plenty of talks on best practices
and how to be great at something.
And I thought it was a fun kind of twist to talk about all the ways I failed and hopefully open up some space for other people to tell stories of failure and things that they've learned.
And Brian, we've done a lot of these shows in the past that we talked about the top problem patterns, the things that went wrong.
And then so many, you're right, It's always great to hear how great people are
and what you want to achieve,
but then you also want to hear the way
and the hard steps that they took to get there.
And I think that's the great thing
about these presentations, yeah.
Yeah, as we share our mistakes,
others learn not to even start doing them
and they can pick up elsewhere.
I always even go back to serverless, Andy.
Every once in a while, I bring up that example
that you had when you first tried doing serverless, when you tried running everything in serial and then to serverless, Andy, every once in a while I bring up that example that you had
when you first tried doing serverless,
when you tried running everything in serial,
and then you're like, wait, I need to make this all parallel,
because that was just really stupid.
And it's the only way we get past this.
So awesome job, Cam.
Thanks for taking that approach.
Yeah, absolutely.
I had an interaction at Happy Hour after giving that talk and we were with some people and they didn't say that I could disclose the story.
So I don't want to tell them, you know, call them out too much. But, you know, they talked about this massive failure of a switch in London and that brought down a whole, you know, hospital's phone system. And I find that oftentimes those are kind of hidden
in the corner of conferences,
people talking about the real deep cuts they've had in their career.
And I think as a community of software practitioners,
we really need to talk about it more if we're going to level up
and kind of improve the state of our industry.
And Cam, to this story,
just a general question to you.
We try as an industry,
as vendors of frameworks or platforms,
we try to make it as easy as possible for developers to build new stuff
by abstracting all the complexity away.
Yet there is obviously complex stuff
going on in the backend, right?
And I wonder, are we failing as an industry
to really make sure that when people build stuff in the easy way and high level languages so i
think now they're called low code no code that we really still make sure that they understand
the technology underneath the hood because right even if everything is easy sometimes sometimes basic things in the back end
can bring down prob can can bring down whole systems like as they as the switch you were
mentioning and i think you also have a couple of examples that we talk about later on with
serverless you need to know certain things about the runtimes underneath so that you are not
running into certain mistakes so how can we educate people?
Well, I think, so my point first was the why of this talk
is very much about creating a space so we can talk about it.
So that's sort of the predicate here of that.
Yes, we're talking about failure.
And then the other piece is I think we need to have a little bit more humility
about how little we know as a community, as a society about how to
build software. You know, depending on how you date it, we've maybe been building software for
70 years, maybe. When you think about other types of construction that we do, we have, you know,
thousands of years of experience building bridges and still sometimes they fall down.
Right.
And so just thinking through the number of types of software that people have built and the different layers of abstraction that have been built up, we are still very early days as humanity building software and learning about it.
So I think we could we could all just, you know, take a deep breath and say, you know what, maybe we're just really new at this. And we're still
experimenting. And we're learning. And like you say, you know, like the classic joke when,
when serverless as a term was really coming to the fore is like, well, there's still servers.
Yeah, of course, like at some point, like a packet is going somewhere, and there is actual
hardware that is going to process it and the like.
So, yeah, there's always some sharp edges underneath.
But I think we should have a little bit of humility that we don't really know what we're doing yet.
Cam, before jumping into some of the examples, I would also like to give you the chance to talk a little bit maybe about the company you founded, Cyclic, if I'm not
mistaken. I know you did a pretty good job also at DevOps Days to not pitch kind of your product
and your idea, but I want to give you the chance to quickly at least talk about it because I assume
a lot of the stuff that you've learned in your work with serverless has also inspired
some of the stuff that you're doing at Cyclic.
Yeah, absolutely.
So you're using, I would say, the European pronunciation.
I would call it Cyclic.
Cyclic is fine as well if you're on the more American side,
this side of the Atlantic.
Yeah, we're a Node.js platform as a service.
It's all serverless underneath, so we're really focusing on creating a way for people who just want to start getting APIs up and running, get them out quick, build in just a few seconds.
So that's sort of like our target.
And we're using serverless technologies, one, because that's something we know, but we think that it opens the door to some really powerful things over the next 10 or 20 years and different ways of
computing. I think there's some really good articles out of some computer architecture
professors out of Berkeley and Stanford in the ACM magazine talking about their vision for
serverless going forward. A lot of other thoughts there on like how this changes productivity of developers
and sort of the first iteration of the cloud really changed people's expectations about do
they need to be spending time managing AC units and power supplies and really change productivity
of operators. And now we're really sort of seeing a trend into productivity of developers.
And so that's sort of where our head is at.
Yeah, and so we have a blog where we write up this stuff.
And if people want to jump on that, check it out,
subscribe to that, or try out the platform.
Any of that is great.
Well, first of all, we will make sure to link
to all of your website.
Also, the presentation
you gave at DevOps Days, the organizers also recorded everything.
It's on YouTube and I think the whole session, like the whole days, the two days are there
in one big block, but now I think they're also chopping it up.
So we'll definitely make sure to link to it.
And also thanks for the pronunciation reminder, cyclic and cyclic.
You're right. I obviously learned pronunciation differently than the native speak in the US.
But now coming to your presentation,
you had a couple of great examples that you went through.
And you always had a nice intro to a new thing.
Like you always said, hey, serverless is stateless, right?
Who agrees with me?
And things like that. So I really encourage everyone to watch the whole recording of your
presentation. But I would like to give you a chance now to maybe walk through two or three
of those examples with us now. And I'll let you pick what those are that you think are the most
interesting ones, especially considering that our audience right we're performance engineers site reliability engineers capacity planning people that need
to operate uh platforms uh even though serverless you know is either operated by you guys or by
some of the vendors but still you know pick pick two or three and let's start with the first one
and just uh discuss yeah i think that that also laying a little bit of context,
I started as well with a trigger warning that all of these mistakes were a mistake
that my quote-unquote friend had made,
this hypothetical friend,
and that names had been changed
to protect the innocent and the not-so-innocent.
So yeah, I think that state in serverless, I think, is an interesting place where people get
tripped up a lot. Sort of there's a basic failure, and this is primarily in AWS. I should also give
that as context. That's where I've spent the majority of my serverless time. I've dabbled
on some other platforms, but kind of the deep learnings, the pain that I've had to learn from has come from the AWS side of
things. So, yeah, state of the life cycle of a Lambda and how instances of it get created and
torn down seems to be kind of the top of the list of pain points. And we saw this in
several different ways. And as you mentioned, you know, I kind of had people raise their hand if
they agreed with these statements of, you know, serverless is stateless. And then an example of
going through, you know, the kind of the classic serverless first, like almost hello world is going through
and doing some image resizing, breaking some things up.
And we got caught with writing these files,
these files that we were resizing to slash temp,
which is the only writeable place inside of Lambda.
And that was great.
And as Lambda gets recycled, these containers get torn down and that
slash temp gets thrown away. However, depending on how your Lambda instances get cycled, and this
is something that's really obfuscated to you, you can build up files in that slash temp space,
and then you can run out of disk and when you run out of disk
that fault causes that lambda instance to recycle so um you know it's a very common error i think
people in the community kind of have seen it now and understand this pattern but as a you know
early practitioner you start to hit this and you get this inconsistent error behavior, which feels very odd. Following along in that same line,
we fixed that. We fixed the problem of writing temp files to temp and then either not cleaning
up at the start of an execution or at the end of an execution. But we then hit a problem where based on our load pattern, we were processing these large
PDFs kind of in batch, in bulk, through a step function. And if we hadn't been deploying, which
again, when you deploy new versions, you recycle your Lambda. So it's like doing a, you know,
restart, a hard restart on your instances, you have these instances of
your Lambda that could be up for a longer period of time or not. And we found that we started to
get inconsistent errors, even after cleaning out the temp, and we were getting some, you know,
a different error coming out of the Lambda, which, again, these intermittent errors, right,
that's the hardest thing to solve, because intermittent errors, right? That's the hardest
thing to solve because if you can't push a button and make it happen, or, you know, you don't know
the mechanism by which it's happening, right? Because otherwise you would just be able to push
the button and recreate the failure and then test if your fix worked. Also, we described it as sort
of like this Heisenberg uncertainty of looking inside the Lambda. We had a system where
we would change environment variables on the Lambda to change the debug levels or the logging
levels instead of trying to make HTTP calls outbound to get config each time and then deal
with caching and like all this sort of config inside of the Lambda so but by changing an environment variable on a lambda
you trigger a refresh of those instances so any running instances get flushed and it's
essentially you've just cleared out your temp or you've cleared out any other os memory space that
uh you might be you know if you have a a memory leak or the like. So by looking at it and turning on debug, we created a problem where we couldn't see it.
And we were also on a pretty high deployment cadence.
So if we were deploying every night to that service, we would flush things out and we wouldn't get any errors the next day.
So we had to wait till our deployment frequency on that service came down. And we weren't looking at the debug logs
or changing the debug settings enough
that the instances would stay running long enough
such that we would start to see this error.
And what it turned out to be is a file descriptor,
essentially a memory leak of a file descriptor
coming out of ImageMagick.
And again, once it would fail, it would cause the whole Lambda to get recycled.
So it wasn't a problem because it would just fail once and then it would disappear.
And then we'd start looking at it and we couldn't find it.
And the way we had to finally isolate it is we had to turn the concurrency,
max concurrency to one and min concurrency to one on the Lambda.
So basically
isolate that Lambda instance into a single one. And then we'd have to do a load test for a certain
period of time. And we could recreate it then. And so then at least we had a system where we
could reproduce this error. In investigating it at the time, we didn't think it was in our
skill set or understanding how to go fix that error or pay
somebody to fix it or anything like that. So sort of the solution we came up with was to
cause the Lambda recycle itself. So kind of reset a environment variable on the Lambda,
which would flush any of those instances. Definitely not a pretty solution, felt a little
bit of a hack, but kind of understanding that life cycle of the Lambda and
really getting in our mind how that worked was sort of what masked the problem. And then in the
end actually ended up being the way that we solved it. So, which is kind of a little bit of an irony
in the end too. It reminds me a little bit of the days when I was working on IIS with the W3P containers,
like the runtimes basically that were hosting,
let's say, your.NET, ASP.NET applications,
that they had their recycling either on a schedule basis
or on a memory basis, but then we were just prematurely
then recycling these worker processes
because we knew at some point they are crashing for whatever reason.
And we just kind of took that pain away by just, you know, forcing them to restart at a time when there was low load.
Yeah, early in my career, I remember reading a blog post by Joel Spolsky when he was still at Fogbugs, and he talked about doing just this, that he would do a nightly restart of all of his
web servers, which, like, on one hand, like, kind of, like, was nails on a chalkboard for, like,
this just-out-of-college, you know, computer engineer, like, oh, we can figure this out,
and we can solve it, like, but then also just that practicality of, okay, well, does the solution
work, and does it prevent the problem, and problem and you know are we going to spend weeks
of time trying to solve this problem or do we have something that's not going to impact the system
that will achieve the same result and so finding those balances I think is really important and
you know the business pays for things to work not always for them to work in the way that we would like to think
that they should work quote unquote um i think uh some of you know going back to the beginning the
the quote shoulds is a little bit of what i'm saying we need to examine you know that's the
like most dangerous word in an engineering context i I think, is should, right? It either leads us down a design
process that isn't connected with business realities or isn't connected with technical
realities, or it blinds us from actually seeing potentially what is truly the cause of the bugs,
because we immediately are blocking ourselves from some possible outcome or some possible state
in the world.
So yeah, should is always a trigger word for me.
I like the dose of reality you just threw at me there.
We and probably others in the observability space always come at these problems like,
well, let's gather the information, find what's causing it, and fix it.
And in sitting behind the desk trying to keep all these things running,
sometimes it really is like, well, how do we just keep this part running?
Because we have other things we've got to focus on and concentrate,
and if we can fix this with a restart, we know it's not the best thing.
But until it becomes catastrophic, we'll do that and work on the other you know put
out all the other fires that are even blazing hotter and then get back to this but we always
have this academic view of well find it you know let's get to it um and it's i just never really
heard it put that way i guess i mean not that it was put in a special way but i've never heard
someone say like yeah no sometimes we're just fine with the band-aid, and if it's got to be that, we know the risks, we accept it, and we move on. they are somewhat homogenous, even inside of a single organization. But if you think about the
scale that each of them deals with, as far as the investment dollars that will go into the software
system, or the value returned to the organization, if you think about each service, you know, if
you're in microservice or each application, right, and you tried to plot it on some distribution of
the value and the load that that application
experiences, you'll probably see like maybe three orders of magnitude in load, maybe more
four or five, right?
Some very large organizations.
And probably value has a very similar distribution where you have some services or some applications
that have a huge value
that they're delivering to the organization.
And probably this type of fix might not work in those high load, high value systems, but
there's a tremendous amount of software that doesn't run those three or four orders of
magnitude higher on the scale or on the value curve. So,
and I don't think we have, I certainly don't have good vocabulary and language for describing
these types of, yeah, the value delivered and the scale of value delivered and the scale of load.
Yeah, maybe service level
agreements is sort of or service level objectives is the closest i've heard on the like performance
side um but all three of those really change how i as a practitioner i'm going to approach
what is an acceptable solution or what is an acceptable engineering um answer to to a particular problem or an anomaly.
Yeah.
Hey, I got three quick things I wanted to bring up.
One, so basically, we're often building workarounds to something we cannot control otherwise.
Like in your case, you had this file descriptor leak or whatever you want to call it.
And then you built a band-aid around it now the
question is if they fix the problem at some point you don't necessarily depend it anymore but the
question is how would you learn about that the fundamental problem from a third-party software
was actually fixed and i think it goes the other way around as well right maybe an update of the
underlying runtime introduces a new thing that you have
that you now need a band-aid for, but you
haven't needed one earlier. And this is
I think where obviously testing
comes into place, that
every change, whether it's your own code
or any of the dependencies
needs through testing,
because you never know, right? Things are not
behaving differently in a good or in a bad way.
Yeah, absolutely. I have this kind of thought I've been chewing on with testing and sort of
the design choices we make in our testing and testing broadly, not just unit tests and code
written to test the functionality, but our whole software pipeline is a long sort of evolutionary selection
mechanism for the types of bugs that we will see in production. So every design choice we make in
how we test our software introduces the possibility of a bug, or it doesn't fully test every possible
bug, right? So if you think of bugs as, you know,
bacteria that we're growing in a Petri dish when we're writing code or when we're, you know,
assembling systems, all of the verification steps we do are almost like some sort of antibiotic
that we're applying. We screen away 99.9% of the bugs. And so there's some amount to get through
and they're constantly being repropagated so that um the the bugs that we see or the the problems that we see in production
are literally a result of the design decisions of our software delivery process which um i don't
know i've been trying to think through how to put together a talk and talk through some of the design
decisions or or failures in the in the sort of value stream architecture that I've put
together in the past and sort of how that manifested in some cases in these bugs or
these failures that I've talked about in this conference talk. Yeah, let us know when you
have the story done because then we'll invite you back. It's an interesting topic. Yeah, that's
really interesting. I'm not even thinking I mean, I don't even,
I'm not even thinking of it probably
in the same way you're presenting it,
but that's even just what you said there
has got a bunch of things floating in my head.
So yeah, please, please, please come back
when you have that.
Yeah, and the simplest first example
just as to like, you know,
get you chewing on something
is that if you have an
inconsistent bug and you in a in a test suite but you let developers keep committing till they get
one green bar then you've just introduced a way that inconsistent bugs can flow through from your
dev system into your test system right if if it's a one in ten failure then
oh well if they bump the readme or they make a fix but it's not actually a fix but it green bars
and then it goes through the next step you've just selected for a bug that's inconsistent
so you know if you decide okay well i'm going to run the test suite twice in a row and it has to green bar both times, you now that one in 10 bug should be one in 100. So it's fewer, but you still are, you're just selecting down the number of chances that it's going to slip through. So I don't know, hopefully that'll be an earworm and it'll chew on you a little bit.
Yeah.
Very philosophical.
Quickly, two other things,
because you talked about the debug logs, log levels.
What's the right way of doing observability now
in serverless from your experience?
Is it still just logs?
Do you already look into open telemetry?
Do you do other things?
Because Brian and I,
observability is a big topic for us
day in and day out.
So I'm just wondering.
Yeah, I mean, I think that
I can't claim expertise in this area.
Open telemetry looks,
or open telemetry,
as you say on the European pronunciation,
you know, is really intriguing to me.
I've, again, just begun to get my feet wet in it
and looked at some of the different vendors in the space.
It seemed that some of the early stuff that I was looking at
seemed more well-suited for running inside of containers or running on,
you know, whether it's a Kube platform or some flavor of EC2
or something like that.
And so, I don't know.
I haven't found something perfect yet that I really have sunk my teeth into
on the serverless side.
I wish I knew more.
No, it's okay.
Maybe listen to some of our other podcasts
and then you learn more.
That's one option I'm doing.
And Andy, before you go on,
speaking of the open telemetry versus open whatever.
Telemetry.
Telemetry.
Can you say metaphor for us?
Metaphor?
Isn't it Metuffer?
Metuffer, yeah, yeah, yeah.
That was way
back,
I remember
way back,
we had,
our guest
kept saying
Metougher,
and I'm
like,
guys,
what's a
Metougher?
And they
were like,
oh,
oh,
oh,
oh,
Metaphor,
okay,
okay.
I was so
confused.
Sorry,
I didn't mean
to sidetrack,
but it's just
reminding,
I had a
flashback to
that instance.
You had a
third thing
you wanted to bring about, right?
Yeah, just one quick thing.
You mentioned in the beginning the way you were kind of trying to find out the problem
is you set the concurrency back to one.
Setting the concurrency back to one means you're forcing Lambda
to really only stand up one container or whatever instance,
one runtime instance at a time.
Is this the way I understand it correctly?
Yes. How I understand how Lambda works internally,
primarily based off of talk, I believe it was the 2018 reInvent,
and I believe it was the VP of Engineering
or the engineering lead on the Lambda team,
and kind of walks through the core architecture
so people can sort of fact check me there,
is that setting that max concurrency to one
is when a request comes in to Lambda,
it'll take it on this kind of like receiver process
and then it'll look at its backend
to see if there's any available Lambda instances
to serve responses for that. And it has some amount of timeout that it'll look at its backend to see if there's any available Lambda instances to serve responses for that.
And it has some amount of timeout that it'll, and I think it has some prediction about how long the average time to service a request is so that it can, it'll hold it a little bit and not spin up new instances.
And it'll try to get ahead of if it's seeing a load spike.
So there's all sorts of black magic that's happening inside of there.
However, setting that max concurrency to one is saying,
no matter what happens,
the backend should only spin up one instance of a Lambda.
So essentially, you're only ever hitting the same memory space.
If anything happens with that memory space
and it decides that it needs to shut it down
because there's an error in the runtime or it needs to shut it down because there's an error in the runtime,
or it needs to shut it down because of whatever AWS decides
is the host that it's running on needs to get shut down.
My understanding is it will still recycle and get assigned somewhere else,
but you won't have multiple copies.
Otherwise, if you got two API requests that come in almost simultaneously,
they could lead to two backend instances being
spun up, even going from zero, so even from a cold start. So you'd get a cold start on both,
each one would serve, you'd essentially have two memory spaces sitting there,
and then they'd both get torn down. Yeah, and obviously the reason why you did it is because
you wanted to, I guess, just focus on the one instance anyway. That means not getting flooded with too much information,
too much logs from multiple instances.
And I guess also for a cost reason,
if you're executing a load test
and then you really only just focused on
identifying the problem of a single instance,
then just narrow it down to a single instance
and that's it.
Makes sense.
Yeah.
And to be clear,
we didn't add that as part of our integration tests going forward, right?
That was just a one time, hey, we have this theory, can we prove it so that we know the mechanism by which we're, you know, that's sort of the scientific method of if you truly understand something, you can replicate it.
Yeah.
So now you just designed your test so it doesn't account for that and allows things that.
That's true.
And we accepted that.
That's the beauty of engineering. You say, and it's good that... That's true. And we accepted that. That's the beauty of engineering.
You say, and it's good enough.
I had one question if I can get in about,
you mentioned that these were lessons learned
on AWS Lambda.
Andy, I can't remember who we had on
a little while back.
We were talking about our development
of the Kubernetes version of Dynatrace and how it was different for AWS,
different for Azure, there's different disks and all that. I assume, Cam,
but I don't know if you've had experience that you can talk to, that what you find
in AWS Lambda serverless may or may not translate
to any of the other serverless components either. So this complication you're
dealing with of,
yeah, we're going to use this supposedly easy thing
where we can just write our code and push it out,
but now we really have to understand
what's going on in the hood.
That seems like there's a pattern
where when people are doing these things,
they're going to have to repeat that step process
over and over again for each cloud vendor they go to
because no two are completely alike in that sense.
Is that, I don't know if, I know you said
you didn't really have too much experience with those,
but do you have any inkling if that might be the case?
Yeah, I mean, that's generally my understanding.
Sort of the data point I base that off of
is talking to sort of a CTO, short-term CTO for hire
with I think a Fortune 1000 company in the US that did a big
migration of a lot of their internal systems
and was doing a bake-off between AWS and Azure.
And he was telling me some stories about how the Azure serverless
functions, again, and this might be three or four years old, and I haven't kept up to date,
but the Azure serverless functions were again, and this might be three or four years old, and I haven't kept up to date, but the Azure serverless functions were having a much harder time handling any sort of
bursts in load. It seemed that whatever their load prediction algorithms were not as good at
spinning up additional instances, and their cold start times were worse. I've noticed on Lambda consistently cold start times are going down.
So I don't know how much of that is the predictive
spinning up instances ahead of time, which seems to do a great job of.
Or they're just getting much better at spinning up run times
and that the efficiency of the startup is much faster. But yes,
back to your point is that my understanding is that these subtleties will be different across
different cloud providers. And I've also noticed in other, I would consider CodeBuild a serverless
service. I've noticed some significant differences in startup time on code builds across AWS regions.
So for example, we run our main application
in three different regions just to provide better service
to our end users.
And we noticed that Sao Paulo has a significantly slower
cold build startup time than we'll see in either Mumbai or Ohio.
So, you know, I kind of extrapolate that across other services.
And, you know, the scuttlebutt I hear from AWS insiders
is that their region parity, you know, they're always...
What's the polite way of putting it?
They're looking to improve their region parity.
And so even one service might
not have the same level of
robustness
or
capacity in some of their different
regions.
And they may also have different hardware
just running there, right?
And that's what it is.
I think it throws a complication
into the idea of evaluating
different cloud platforms
for what a company
might want to choose as well
because this might not be the case
in that AWS example.
I mean, in the Azure example
you mentioned,
but it could be that
whether or not it's serverless
or anything else,
it's optimized to run
with a slightly different architecture.
And if you're not aware of that optimization
and you're trying to just port over your Lambda function
and the way you have that to run to another platform
without knowing what their optimal approach might be,
you might be like, oh, well, their performance is terrible.
But if you were to have that secret sauce
of what they're designed to operate optimally on,
you might be able to compare.
So not only is it hard to, I guess,
maybe compare different features and functions
potentially on the different platforms,
but even if you want to be multi-cloud,
let's say you want to spread yourself
amongst different vendors,
you might have to have some architectural changes
and differences based on what's going on under the hood
and what's the best way to write for those processes,
whether it's serverless or anything, right?
But that's more and more complications
as we make things simpler, you know?
Yeah, yeah.
So that, I don't have experience in that.
My career has primarily been at sort of the beginning end of the curve as far
as time. So, most of my software has been at the, you know, from inception. So, from day zero through,
you know, maybe year 10 of the life of a system. So, some of these much longer in the tooth,
you know, been running for 20,
30 year kind of systems. I don't have nearly as much experience on that. You know, I think the
oldest running system I've ever had is I, you know, ran some, some PBX software on a,
when they were like, had it running on some desktop machines and stuff like that. So I don't
really have experience with these longer,
more enterprise where you're like,
oh, we're going from on-prem and now we're migrating into,
should we go to Azure or AWS
and let's do a bake-off and stuff like that.
But that sounds right.
Your theory sounds right to me.
Hey, considering the time,
we have about 10-15 minutes left i would like to
give you a second a second shot at a second topic uh whatever you choose i know you have plenty in
your presentation yeah what else should we talk about um i think s3 that's probably the most
well-known most used um service um and again sort of this subtlety, going back to your point,
Brian, on that is that, you know, with S3 in an application, early on, you have to decide
what's your pathing structure, right? Like that's something that has to happen before you can write your application um and so you can read
best practices or you know whatnot and um to win the system that i had been working on you know we
i came in and the pathing structure had already been decided and there was uh we were using the
global s3 um domain name and then we were using a path with a date in it. Our understanding at the time was that
the date would provide some randomization and allow the S3 system to shard our files across
however they did it, across clusters or individual computers.
And so things worked fine until we started storing more and more
data in S3.
And then we turned Athena on and pointed it at S3.
And so Athena, my best way of describing it is it seems to use these like 429s with S3, which I think is 429 is like the too many requests, right?
Like it seemed to be like a network flow control discussion that Athena would use to size how much data it could pull from S3, which was fine for Athena.
And it was fine for S3. Like S3 didn't really
complain. We couldn't even tell it was having problems until we like turned on extended metrics
and looked at it. But our core application was trying to write to those same, to that same
bucket. And so we were getting write failures because S3 was getting saturated. Essentially,
the IO was getting saturated on each of these clusters or inside of each of these shards.
And later what we learned is that internal to S3 at the time, they used sort of a progressive
path-based hashing. So they would just try to find a prefix that was consistent.
And the first character that varied,
they would try to shard based on that first character that varied.
And so for,
in our case,
it was a date.
And so that was the first character that had variability in it.
So everything that was temporally the same or close to each
other would also be on the same shard in the S3 backend. So then when Athena was trying to do a
date report, it would just bombard this one shard and hotspot it. So I think Amazon has done some more hash-based sharding.
I can't speak to whether they fixed that or not.
In general, the way we kind of got around this was that we replicated the data from one bucket in East 1 to a bucket in East 2.
And then we started running the reports off of that East 2 bucket.
So S3 is fine, you know know replicating the data for you it's nice little uh easy thing to turn on um and s3 was fine getting
all those 429s right it's not a problem for it so athena was fine running off of that that other
bucket um kind of the uh the side effect though was another way to fail there was that we never moved the Athena process from East 1 to East 2.
So then we got to continuously pay AWS for their network egress from East 2 back to East 1 to do the analysis and put together those dashboards.
So, you know, again, one of those things where, yeah, it works fine and you might not notice it until you start digging into the cost reporting and really dig through the tags.
The other place, and I don't know if you noticed it in my description just now.
But S3, there's a subtle, the experience of S3 is that
it's this global service, but S3 is actually a regional service. And you notice this when you
create a bucket, right? You have to create it in a region. And so, yes, there's naming problems that
if you're not thinking about it at the time, like you might create, you know, some bucket,
bucket name in East 1,
and then you go to create a second bucket in East 2, and now you can't create bucket name in East 2.
So there's a little bit of a headache there, kind of the common practices to append the region name.
But then you have this other issue of that, at least for us, we were using the global reference to that bucket.
And so not the regional DNS name, host name.
So that meant we had another level of indirection.
So we were hitting a DNS server in East 1 to get the global name for that bucket.
And then that was then giving us access to that S3 bucket. And then that was then giving us access to that S3 bucket. Now, in, I believe it was early
2019, AWS had a DDoS attack against those global S3 DNS servers, right? Coming back to your previous
podcast on it's always DNS under the hood, right? Like every failure is DNS um and then uh you know so those servers went down um or they
got swamped um and even though s3 was still up and the service was still existing and the regional
so if we had referenced it by bucket name dot us east 2 or us east 1 dot and then you know the s3
pathing we could have gotten access to the data
and all of those servers were still connected and available.
We were unable to access and our core system went offline
because we didn't have DNS,
which got made more complicated
because our whole build system
also relied on those same S3 global paths.
So you get into these interesting
kind of recursive dependency problems there.
So, yeah, that was a couple ones.
I know you, Andres, you said...
No, that's perfect.
I was actually, it's funny,
looking through the presentation that you did,
it was actually the one that I picked.
I would have picked my stuff
that I would have hoped you picked with S3.
So it's perfect.
Good. Good.
Yeah.
Cam, I know we will link to the presentation, we will link to C-Click, your website.
Is there any other material that you've put out there, any other speaking slots, anything
else where people can maybe reach out to you or see you and meet you in person?
Oh, yeah.
I don't think I have any speaking slots on the calendar going forward.
I'm not on the conference circuit as much as you are.
You know, maybe I should up my conference speaking game.
Yeah, but reach out to me on Twitter.
I'm sure we can put that up.
And, yeah, definitely cam.lassiter at cyclic.sh.
Happy to respond to any emails
that humans are willing to send me.
So this was a shout out also
to all the bots out there
that are sending fake emails.
Feel free to use that as well.
Isaac Asimov is rolling in his grave.
No, but it was really great. And I think think these are as you said it's great to talk
about these stories where things didn't go as expected where you learn something new about
the complexity of the technology we're really working with even though the producers of the
technology are making it sound like it's very easy anyway because they hide the complexity
but um yeah it takes people like you
to educate everybody out there that wants to become successful with technologies like
serverless so thank you so much for that yeah absolutely if if anybody has a story um i'd love
to hear it um you know record a little youtube video or write up a blog post i really want to
trigger other people to kind of you know come out out and share with the world the ways they've hit the wall or caught something on fire accidentally.
We've all done it. Yeah, it reminded me when you were talking about the idea of sharing the
failures, it reminds me of the Titanic, right? It's like, we all pay attention to the fact that
the Titanic hit an iceberg, right? And if that wasn't shared, there might be others.
But when's the last time you heard about a cruise ship sinking on an iceberg?
No, right?
Because they shared that, they thought they were indestructible,
and now people don't crash in icebergs.
Avoid them.
Hey, great lesson.
But it's so important that you're sharing this stuff,
and I think that's just fantastic i was
looking over the powerpoint uh you know yesterday as well and it's these are these are the things
that i love i love hearing these these stories of you know i don't want to say i glory in them
by calling them stories of failure but i think those are the more interesting things to learn
from right because what works for people it's not going to work in other situations,
but what fails for people is something we can all take and make sure we don't do.
So I really appreciate what you're doing,
and thanks for being on today.
Is Brian James still doing the News of the Damned?
I don't know.
I haven't really been.
I got myself, as part of my mental health,
I got off of Twitter as well.
So I'm not following if he's doing that, but that was always a fun show.
If it's still out there, people go listen to it in the great archives too.
So Cam, our friend James Pulley, I think it was about a weekly podcast,
would do News of the Damned, and he'd go through all Twitter and news things
to find out the different crashes and downtimes different organizations would have.
And he'd often,
sometimes he would try to get someone on,
but he will also just,
I think,
try to surmise what may have happened or good stuff.
Yeah.
Awesome.
Awesome.
Everyone.
Thanks for listening.
Cam,
thanks for being on today.
And if anybody has any questions or comments for Cam or whatever,
we'll have his information posted in the show notes.
If you have any questions or comments for us,
you can reach us at pure underscore DT at dynatrace.com
or tweet us at Pure Performance.
I haven't been on that in a while.
Pure Performance on Twitter as well, right?
Is that the, I don't remember what the handle is.
It is, it is, yeah.
I haven't paid attention to that.
Either way, just, you know, send us a note with some money
and we'll talk to you.
Thanks, everyone.
Cam, thanks again for being on.
Andy.
Absolutely.
Go grab some candy.
Yeah, will do.
Bye-bye.
All right, see you.