Screaming in the Cloud - Episode 31: Hey Sam, wake up. It’s 3am, and time to solve a murder mystery!
Episode Date: October 10, 2018Have you ever been on-call duty as an IT person or otherwise? Woken up at 3 a.m. to solve a problem? Did you have to go through log files or look at a dashboard to figure out what was going o...n? Did you think there has got to be a better way to troubleshoot and solve problems? Today, we’re talking to Sam Bashton, who previously ran a premiere consulting partner with Amazon Web Services (AWS). Recently, he started runbook.cloud, which is a tool built on top of serverless technology that helps people find and troubleshoot problems within their AWS environment. Some of the highlights of the show include: Runbook.cloud looks at metrics to generate machine learning (ML) intelligence to pinpoint issues and present users with a pre-written set of solutions Runbook.cloud looks at all potential problems that can be detected in context with how the infrastructure is being used without being annoying and useless ML is used to do trend analysis and understand how a specific customer is using a service for a specific auto scaling group or Lambda functions Runbook.cloud takes all aggregate data to influence alerts; if there’s a problem in a specific region with a specific service, the tool is careful to caveat it Various monitoring solutions are on the market; runbook.cloud is designed for a mass market environment; it takes metrics that AWS provides for free and makes it so you don’t need to worry about them Will runbook.cloud compete with or sell out to AWS? Amazon wants to build underlying infrastructure, other people to use its APIs to build interfaces for users Runbook.cloud is sold through AWS Marketplace; it’s a subscription service where you pay by the hour and the charges are added to your AWS bill Amazon vs. Other Cloud Providers: Work is involved to detect problems that address multiple Clouds; it doesn’t make sense to branch out to other Clouds Runbook.cloud was built on top of serverless technology for business financial reasons; way to align outlay and costs because you pay for exactly what you use Analysis paralysis is real; it comes down to getting the emotional toil of making decisions down to as few decision points as possible Save money on Lambda; instead of using several Lambda functions concurrently, put everything into a single function using Go AWS responds to customers to discover how they use its services; it comes down to what customers need Links: Sam Bashton on Twitter runbook.cloud How We Massively Reduced Our AWS Lambda Bill with Go AWS AWS Lambda Microsoft Clippy Honeycomb AWS X-Ray Kubernetes Simon Wardley Go Secrets Manager DynamoDB EFS Digital Ocean .
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
This week's episode of Screaming in the Cloud is generously sponsored
by DigitalOcean. I would argue that every cloud platform out there biases for different things.
Some bias for having every feature you could possibly want offered as a managed service at
varying degrees of maturity. Others bias for, hey, we heard there's some money to be made in the cloud space. Can you give us some of it?
DigitalOcean biases for neither. To me, they optimize for simplicity. I polled some friends of mine who are avid DigitalOcean supporters about why they're using it for various things,
and they all said more or less the same thing. Other offerings have a bunch of shenanigans,
root access and IP addresses.
DigitalOcean makes it all simple.
In 60 seconds, you have root access to a Linux box with an IP.
That's a direct quote, albeit with profanity about other providers taken out.
DigitalOcean also offers fixed price offerings. You always know what you're going to wind up paying this month,
so you don't wind up having a minor heart issue when the bill comes in.
Their services are also understandable without spending three months going to cloud school.
You don't have to worry about going very deep to understand what you're doing.
It's click button or make an API call and you receive a cloud resource.
They also include very understandable monitoring and alerting.
And lastly, they're not
exactly what I would call small time. Over 150,000 businesses are using them today. So go ahead and
give them a try. Visit do.co slash screaming, and they'll give you a free $100 credit to try it out.
That's do.co slash screaming. Thanks again to DigitalOcean for their support of Screaming in the Cloud.
Welcome to Screaming in the Cloud.
I'm Corey Quinn.
I'm joined this week by Sam Bashton,
who once upon a time ran a premier consulting partner with AWS.
Recently, however, he started something new called Runbook.Cloud.
Welcome to the show.
Thank you. Thanks for having me on.
Always a pleasure.
It's interesting to me to talk to people
where there are multiple different aspects of what they do
that apply directly to how I view the world.
What's interesting to me about Runbook
is that, on the one hand, it's a tool that helps people find and troubleshoot problems within their AWS environment, which is fascinating and highly relevant.
But what's also equally interesting to me is that you built the entire tool on top of serverless technology.
So it feels like we should definitely tackle both angles of those.
Which do you want to go into first?
So maybe if we talk a bit first about my motivations for building runbook.cloud.
Ah, the why. Absolutely.
Cool. So basically, my entire career since leaving university was on call at some point or other,
often one week in four. And I would get a call in the middle of the night, around about once a week, and something had gone wrong.
So I would have to troubleshoot what that problem was
and work out what to do to fix it.
And at first that was very nerve-wracking
and it quickly became less exciting
and more an incredibly large chore.
I don't think anyone enjoys doing on-call.
There's a certain adrenaline rush of fixing a problem quickly.
Hey, Sam. Hey, Sam. Hey, Sam. Wake up. It's three in the morning. You know what you want to do now?
That's right. Solve a murder mystery. Yeah, exactly. Exactly. And all we've got
to help you solve the murder mystery is pages and pages and pages of graphs.
And that's if you're lucky, because in the early days, you probably didn't have any metrics at all.
And you just had to kind of look at some log files and do your best to try and work out what was going on.
Then as things got better, you built dashboards and a dashboard came, you know, like came like scar tissue for an organization.
Here are all the things that have failed previously.
They probably won't be the things that go wrong in future,
but at least if it breaks again, we've got a way to check on that.
So I thought, well, there's got to be a better way to do this,
and runbook.cloud is my attempt to try and build that better way.
So what we do with runbook.cloud is we look at all the metrics,
which people are drawing pretty graphs from,
but we apply some intelligence.
And when I say intelligence, I don't mean my intelligence.
I mean machine learning, of course.
And we pinpoint where are the issues within the infrastructure,
and then we have a pre-written set of, here are solutions to known problems that can occur.
And we present that to the user.
So when you get paged at three in the morning, you still see a problem.
But as well as seeing a problem, you see, well, here's what it looks like the problem
is, and here's a suggested solution.
And quite often, there won't be, well, it's definitely this.
It will be, well, it looks like it's probably this,
but there's a chance it could be this other thing.
You know, you might get a list of two or three suggestions.
That's infinitely better than hunting through a ton of graphs trying to work out,
okay, but what does any of this mean to me?
Because really with graphs, the best you can hope for
is that you have the right graph to test the hypotheses that you might have.
And you can look at the graph and say,
actually, you know, that graph disproves my hypotheses,
so I now need to try and invent another potential reason for this problem,
or it proves it, and then you can start trying to do something to fix it.
So how do you keep a system like that from turning into the infrastructure equivalent
of Microsoft Clippy?
It looks like you're fighting an outage.
Have you tried looking at DNS?
It seems like it's the sort of thing that it would be very easy to have become annoying
and unhelpful.
How do you avoid that problem?
Obviously, this is not, wait, you mean it might be annoying people? It's not going to be revelatory to you. How have you thought about
this as far as getting away from that particular failure mode? So that's where the machine learning
comes in. So what we do is we look at all the potential problems that we can detect,
and we look at that in context with how the infrastructure is actually
being used. So a good example is CPU usage. So in some scenarios, high CPU usage is an indicator of
problem. In other scenarios, for example, you're running a batch computing load, high CPU usage is
normal. It's what you should be seeing. And actually, if there's low CPU usage, that's an indicator of a problem. So we can use machine learning to do trend analysis and to
understand actually in the context of how this specific customer is using the service for this
specific auto-scaling group or for this set of Lambda functions, this looks wrong or this looks right. And we can look at it in context.
So I would say machine learning becoming accessible
to a wider base of developers,
specifically us that are writing this,
that's allowed us to build something that isn't, you know,
the Microsoft clippy of the DevOps world.
Do you tend to take only a particular user's environment into consideration, something that isn't the Microsoft clippy of the DevOps world.
Do you tend to take only a particular user's environment into consideration,
or do you take the global environment as well?
By which I mean, when I wake up historically in years past running infrastructures,
and I know that a few things are going to be true.
First, I know that something is broken.
Secondly, I know that Amazon's status page is going to be a sea of green telling me everything is perfect.
And third, I know that I'm not going to really
be able to disambiguate between
is this a problem with my environment
or is this a global problem
until I go on the internet and check Twitter.
Because that's the only sort of global
real-time alert system that most of us have.
Are you considering suddenly seeing a bunch of flurry
of activity across the board, across all of your clients in these environments,
and then able to advise them on that? Or is it strictly bounded by their specific environment?
No. So we very much take all of the data that we're seeing in aggregate and use that to influence
alerts. It's actually something that was informed by my prior experience
running a consulting partner.
So as a consulting partner, we were doing managed services.
We would look after many dozens of customers.
So actually, we had the same scenario.
You know, we would see problems across multiple customers,
and you knew actually it's very unlikely that this is a problem
that's specific to any of those customers.
This is a wider outage.
And obviously we could relay that to AWS in terms of support tickets.
So, yeah, in runbook.cloud, we also look at the data that we're receiving in aggregate.
And if there's a problem in a specific region with a specific service, we're quite careful to caveat it
because the reason I think often you see a sea of green
on the AWS status page
is because actually at the scale that AWS runs at,
probably most of their customers in that region,
everything is working fine for.
But when you've got millions of customers,
1% of your customers having a problem
is a significant number of people.
Absolutely.
And that's part of the challenge too that I think does vary from, I guess, a point of scale.
If you have 30 customers and one of them winds up breaking, that's a significant percentage of what you're seeing.
But if you have, I don't know, 500 queries per second hitting your website and you start seeing a 1% variance,
first, that winds up scaling as well to a tremendous number of people.
So it's one of those areas where, at scale, one in a million things, one in a million occurrences, happen five times a minute.
So it really does turn into one of those situationally dependent issues.
Yeah, exactly. And that's where machines are excellent at aggregating that sort of data and working out what's going on. And I think in the past, as humans,
we've not been as smart at using the machines to do a lot of the work for us
as we should be.
Or at least outside of large organizations,
the Googles and the Facebooks of this world,
I don't think we've been as smart as possible.
You look at most people's monitoring setups, and they are pretty dumb right now. And that's not a reflection on the people setting
them up. That's a reflection on the tooling that's available. Most things are, you set a threshold,
and you say, if it crosses this threshold, then there's a problem. And actually, that's not how
things work in the real world. It really does seem like this is an evolution on a very long axis.
I mean, back when I started working with technology,
we started playing with the original Call of Duty video game,
which is, of course, called Nagios.
That's the thing that woke us up in the middle of the night
and everything was broken.
The paradigm of setting something up, often manually,
to look at individual systems and alert when they
went down didn't age very well in a world of ephemeral infrastructure, in a world of auto
scaling, and especially in a world where you have 10 web servers that are load balanced. If one of
them blows up, I probably don't care. If three or five of them blow up, I really care. So it turns
into a story where the traditional thoughts around monitoring
no longer really seem to work. So the next sort of evolution of this has gone towards the idea
of aggregating things, looking at metrics, looking at graphs. And that's terrific. They're beautiful
dashboards you can hang up in an office, you can put on a website and send to execs,
and no one ever looks at them. And that's interesting.
And now we're starting to see the next generation of this stuff emerge, where you see things like
outlier detection, where we start to see systemic issues that underlie things. And it feels like
you're very much in, I guess, in line with the zeitgeist around monitoring thought and theory
right now. Is that something you'd agree with? Am I way off base in my assessment? I'm not going to disagree with you telling me that I've found exactly the right solution to
the problem. I think there are a number of solutions that people are finding. And actually,
I think they address different parts of the market. So you look at something like Honeycomb, which is very much going for
tracing. That's a key part of what needs to be done. But actually, you need a very,
very technical organization to be able to implement that functionality. And if you have
the right sort of organization to be able to implement the functionality to write all that
tracing data, then you absolutely need a tool
like that. With runbook.cloud, I'm trying to go for a more, I guess, mass market environment,
an organization where actually you probably don't have a huge amount of metrics beyond what Amazon
give you out of the box, which is pretty enormous. I think that last count, there's 30 odd metrics
purely for EC2 at low. So, you know, Amazon are giving you all of these metrics for free.
And what we're doing is we're looking at them and then trying to make it so you actually don't need
to worry about what any of the individual metrics are. We tell you, look, here's the problem and
here's what you need to do to fix it. And you don't need to worry about what the values are.
That's essentially all abstracted away from you by Runbook.cloud.
It seems like a very interesting direction to go in.
It also further seems like exactly the sort of thing that AWS should be offering, but of course isn't.
Do you have the haunting fear that most people do,
that Amazon is going to one day effectively try and build the version, basically a native platform offering of what you do.
I mean, it's Amazon, so we know the first version is going to be pretty crappy, and it's almost guaranteed to have a stupid name.
But other than that, as it iterates forward and starts to turn into something real, there is the chance that Amazon decides to fix all problems.
And from my perspective, from a monitoring point of view,
I don't know that I necessarily trust them to tell me
when things are broken in a way that is actionable
in a reasonable period of time.
So there's going to be that opportunity.
But do you see them coming for you in the night someday?
Well, if Andy Jassy is listening and he'd like to buy my company,
the phone's always going to be answered to his call. So I think, yeah, it's possible that Amazon
would come up with something like this. Having worked with Amazon for a large number of years,
I know that their strength is that they are almost not one company.
They are thousands of really small units which work on their own thing,
and then they bring those together.
So, Corey, you must see from the numerous billing CSVs,
they can't even agree on what they call a region.
You know, in some parts of the billing CSV, it will be called USW2 and other parts USW2
and other parts still, it might have an airport code for the name of the region. So I think it's
quite hard as someone inside Amazon to build a tool like that, or at least it's no easier than
it is for someone outside of Amazon, namely me.
I think also if you look at some of the solutions that are out there,
I get the impression that Amazon perhaps don't want to be in some of these spaces.
Specifically, if you look at Amazon X-Ray, X-Ray is in theory a tracing tool. Well, in practice, it is a tracing tool, and it lets you log all the tracing data
and is really good at logging that data.
And then they give you an awful interface
for searching through it.
And I, having used X-Ray quite a bit,
I kind of believe that actually that's not by accident.
It's not that they didn't know how to make a good interface.
It's that that isn't the game they want to be in.
They want to build the underlying infrastructure
and they want other people to come along,
use their APIs and build the right interface for the users.
Absolutely.
It's like this sort of theory or philosophy
that they're operating under that,
you know, if we just provide bare primitives,
maybe customers will build the things we don't, ideally in Lambda.
Yeah, that's absolutely true. The other thing that I would say is, actually, I'm selling
runbook.cloud through AWS Marketplace. So AWS Marketplace is a solution much like Amazon Marketplace.
So I'm selling runbook.cloud.
It's a subscription service you pay by the hour, just like a normal AWS service.
Oh, please, there's no such thing as a normal AWS service.
Well, absolutely true.
But in terms of it gets added to your normal monthly AWS bill. And of course, Amazon
take a cut for the privilege of doing that. So actually, AWS are still making money from this.
It almost is an AWS service. It's just a marketplace service. One thing the AWS Marketplace team are keen to point out quite frequently is on the
Amazon retail side, 50% of transactions are done through Amazon Marketplace. And actually, that's
where they see AWS Marketplace getting to as well. So I think maybe it's naivety, but I think it's
less likely that Amazon are going to try and clone something like Runbook.cloud because actually, why would you bother if someone else is putting all the money into R&D, doing that hard work, and then you're getting a cut from it anyway, and it's fulfilling the needs of your customer?
One bit of feedback that I've gotten on my business for the last couple of years as I focus on Amazon bills is, well, what about other cloud providers? And for my business, it doesn't make a lot of
sense for me to focus on providers that aren't AWS. What about you? Do you wind up getting that
feedback as far as, oh, what about GCP? What about Oracle Cloud? What about Azure, etc, etc, etc?
So that's quite an interesting question. my previous role when i built a consulting
company we started out well actually we started out pre-aws but in the cloud world we started
out obviously with aws because there are no other players in the game but then we did expand and
build a google cloud practice as well so i have quite a lot of familiarity with Google Cloud. I think for me, when I'm building a product like this,
there is so much work to do to be able to accurately detect the problems
that addressing multiple clouds would be extremely difficult.
And Amazon has such a massive order of magnitude more customers
than any of the other cloud platforms
that actually it doesn't really make sense at this point in time to branch out to other clouds.
Of course, Randy Jassy accepted.
I expect that at some point in time, we probably will want to make a version that is for Azure
and a version that's for Google Cloud.
But we're probably talking a good few years down the road here. make a version that is for Azure and a version that's for Google Cloud.
But we're probably talking a good few years down the road here.
Absolutely. And to my way of thinking, in this type of space, any of us who specialize in one particular provider are going to be able to retool to embrace a different provider far faster than
some other provider is going to gain workload
and market share and customers to the point where the one we're focusing on is no longer dominant.
In other words, you're not going to see these giant enterprises migrating between cloud platforms
faster than the ecosystem is going to be able to understand, embrace, and work with the new
provider. It's one of those things that's obviously worth keeping an eye on,
but it's not one of those things where we're going to wake up
and read in the front page of New York Times
in giant six-inch high letters, AWS suddenly irrelevant.
I mean, that isn't how the world or the market work.
Yeah, exactly. And actually, if you look,
the majority of computing is not on any cloud platform right now.
So there's still a lot of expansion to be done. the majority of computing is not on any cloud platform right now.
So there's still a lot of expansion to be done.
AWS isn't going anywhere.
And when you're the market leader,
that means you kind of become the default choice.
So I think this competition is AWS's to lose. I think the other cloud platforms have interesting offerings,
but I'm not seeing anything that's significantly different enough
that you would want to move from AWS if that's where you were previously.
So the other aspect that I wanted to chat about with you
is the fact that you built this entire service on top of serverless technology.
Why did you make that decision? So I made that decision primarily for business,
financial reasons, rather than because of technology.
Actually, building on serverless,
I saw was the best way to align our outlay,
our costs in terms of providing a service to a customer
with the actual amount we could charge a customer.
So I have a lot of experience with Kubernetes,
which I know you're a massive fan of.
I was using Kubernetes from about 2014 onwards
when it was very early stage.
And actually early on, I thought,
well, we probably will be deploying
Rombook.cloud onto Kubernetes because it's the platform I know best. I hadn't really done anything
with Lambda at any significant scale. I've done a lot of glue code, but nothing beyond that.
But when I looked into the spreadsheets more of the cost model. Actually, the upfront outlay for Kubernetes
is still quite a lot higher.
You need a certain critical mass of customers
before it makes sense.
And with Lambda, well, I can scale exactly in line
with my customer base.
And actually then when we need to decide,
okay, where do we optimize cost?
It's really easy.
You just look at
which Lambda function is costing the most. Well, that's where we should expend our engineering
effort, optimizing things. So suddenly, instead of optimization, either, you know, code bases
historically don't get optimized further. They just get new features added and added on top of
them. And then everyone talks about technical debt. you get just well people work with the bits of code and optimize the bits of
code that they decide kind of on a whim that that should be the thing they should should tackle
now with serverless you've got a clear way you look at the bill and you say okay so this is
costing us the most we can do some work optimizing
this and we can save ourselves some money. And actually you are paying down that technical debt,
but you have a clear metric that you're working towards, which is reducing the overall cost.
There's a very strong economic story for serverless. In other words, I think Simon
Wardley was talking about this extensively, where he
was focusing on the idea
of tracing capital flow throughout an
organization or through an application
where if you have
15 Lambda functions tied together
and you know which one is costing you more,
you don't just know what it costs to
serve as a customer. You know what every
function costs relative to the
others per customer request.
And it gives you a very in-depth viewpoint into where your revenue is coming from and what the
economics of your business are. Yeah, precisely. And those economic stats also give you good data
on, okay, but what parts of our application are actually used the most? Most companies, if you've got a monolithic application,
you don't really know what the most popular parts are.
Probably you've added some sort of after-the-fact metrics
for a SaaS app that's been built as a monolithic application.
You've probably got some JavaScript that's loaded,
client-side that's working out what's getting used the most.
But with Lambda, actually, it's
giving you that data really clearly, and you can see where your effort is best spent.
I think that when you say, I'm using serverless technology, and people ask why,
and the answer is, oh, for the economic story behind it. People often hear that as, oh,
I'm using serverless because I'm a cheap ass. And it has nothing whatsoever to do with that. It's very much in the realm of you pay for exactly what you use,
you don't have to worry about provisioning, you aren't falling into the wonderful world of,
oh, here's some on-demand resources I need to plan the usage of for the next three years.
And it really gets back to a you pay for exactly what you use and nothing else. So it comes down to a very predictable model where you know exactly what a customer brings in,
and then you really do scale with them to bring in revenue, as opposed to having these plateaus
where you buy a giant pile of things to service customers. Okay, now you've expanded. Now it's
time to buy another big instance or something and going down that rat hole. It's one of those stories that also not just saves money, but also lets you spend it
effectively and know where it's going. So yeah, I think that is true. But I think it's partly true
because of the model that Amazon have chosen to use in terms of their reserved instance model. So actually, they charge for instances now per second.
So if you could spin instances up quickly enough,
and actually with things like unikernels,
potentially you can start instances almost as fast
as you can start a Lambda function.
You can use instances in a way that actually doesn't mean
that you need to worry about reserving capacity up front,
except that that is baked into the AWS economic model.
And they used to talk a lot about, well, you're reserving capacity
because you're literally reserving
availability in a specific availability zone and then obviously you know they got rid of that that
hard link when they brought convert convertible instances in and actually all the new reserved
instance types they're not linked to a specific allocation of capacity you don't have any extra
allocation there are no guarantees that you'll be able to spin off instances like there used to be to a specific allocation of capacity. You don't have any extra allocation.
There are no guarantees that you'll be able to spin off instances like there used to be previously when you purchased specific capacity
as part of your reserved instance.
So I think, yes, some of the benefit of serverless is an accident
or perhaps not an accident of how AWS have chosen to charge for their service?
Absolutely. I think that they get beaten up a lot for the way that the reserved instance model
exists historically. They're making steps with things like convertible instances and instance
family flexibility, or size flexibility rather. And that starts to make some of this better,
but it's still an analysis paralysis style of decision.
Last time I ran the numbers, there were exactly 140 different instance types you could spin up
in US East 1. And okay, go ahead and make sure you're on the right one and then buy a reservation
for three years. That's daunting. And they keep adding new instance families and it becomes
trickier and trickier to be assured that you're making the right instance size and selection and choice.
What I love about Lambda is that there's a single variable you get to play with, and that is RAM allocation to the function.
That's it.
You don't have to pick around, well, what about the IO profile?
What about the CPU?
What about the network capacity?
Those are all tied to how much RAM you give the function. The more RAM, the better the rest of the resourcing. And that model, I think, is tremendously helpful, not purely even from an economic point of view, but from a not putting decisions on people unnecessarily. Analysis paralysis is very real. If I'm trying to sell someone a pen,
generally the right way to do that is do you want blue ink or black ink? Here's a catalog with 10,000
different kinds of pens. It just comes down to getting the, I guess, emotional toil of making
decisions down to as few decision points as possible. Yeah, I think that's true. I definitely,
earlier on when I was using AWS, there were
definitely situations where I was lobbying various people at AWS for different types of instances.
And I guess there must have been lots of people doing things like me, but we were all lobbying
for slightly different types of instances, which is why there are so many. I think, actually, I don't see that as a negative so much. You know, you look at
the instances, generally, it's relatively obvious. At least, you know, it was to me when I was doing
everything on EC2, it was pretty obvious what you wanted. If you were going to be compute bound,
you wanted to see instance. If you were doing stuff that needed GPUs, you needed a GPU instance.
If you had memory-bound tasks, you had an R.
If you weren't really sure, you had sort of a mix of memory,
you had an M instance.
You pick the latest that's available in the region you're running.
It's not – I agree, there are massive numbers of different types of instances.
Well, sure, you're absolutely right. But recently, they've also extended that to,
okay, now with different types of disks, some are NVMe, some are not. This one has extra fast CPU
in it, but it's designed for fewer threads at the same time. And you just wind up with these
little variances between them as you get into the M suffixes and the D suffixes. And I agree
wholeheartedly with, it used to make a lot of sense. Now with just the flurry of new instance
families, I have to go back to my traditional guidance and constantly reevaluate it.
Yeah, I think that's possibly true. I think most of the time you can just pick,
look, I'll just use a C instance or use an M instance.
And it'll be close enough that honestly, the few cents you might save here and there,
it's not going to be worthwhile. Right. Until you hit scale again,
and then suddenly you're having a very different, very vast conversation. And that's one of the nice
things I appreciate as well about the whole Lambda model. Even at scale, the economics are still pretty decent. They absolutely are. I had a blog post that I put out a couple of weeks ago
that was very successful. I think you linked to it in your wonderful newsletter about how we
actually saved significant amounts of money using Lambda in a way that actually most of the
experts told us that's not how you should use Lambda. So what I would say I guess the received
wisdom was with Lambda is that the great thing about Lambda is everything can be single threaded
because if you need concurrency, well, that you just run more lambda functions and that's true up to a point and actually what we found is that you can do that but obviously
the cost implications of running hundreds of lambda functions in parallel when you are not
fully utilizing those resources is pretty significant.
So in our specific instance, obviously, for Runbook,
we need to look at metrics from a large number of AWS services,
from a very large number of accounts,
because each customer has at least one account,
and an average customer will be using at least half a dozen services,
and some are using an order of magnitude more than that.
So we will make calls to all these different AWS APIs.
And obviously, they take some time to respond.
And what we're waiting for the response is, you know, it might only take a few hundred
milliseconds, but we're doing nothing with the compute power that Lambda has provisioned for us.
But obviously, we're still having to pay for it because we pay you by the 100 milliseconds
in a Lambda world. So what we did is we looked at this problem. And actually, we decided instead
of firing up several hundred Lambda invocations concurrently, We would put everything into a single Lambda function,
and we'd written everything in Go.
So Go has a really nice inbuilt programming model
for doing concurrent operations,
and we did the concurrency using the programming language.
We now run a single Lambda instead of several hundred Lambdas,
and, of course, the cost implications for that are pretty huge.
You know, we saved ourselves a significant amount of money.
Well, we made the business viable.
The business wouldn't have been viable at the price point that we had already selected
if we had used the received wisdom of, well, you should only do concurrency by just spinning up additional Lambda functions.
And actually, although that's the received wisdom in the community,
actually, once we look deeper into it, I'm not sure AWS really believe that
because if you look, you mentioned you get more compute capacity
as you allocate more RAM to your Lambda functions.
The higher RAM Lambda functions actually give you more cores,
which implies to me that AWS expects you to be doing things concurrently within Lambda.
That's just not how people had, in the community, expected things to be used.
I think you may be onto something there.
Counterpoint, it's easy to sit here and say,
ah, they hid this thing in and with the expectation
people would find it and use it this way.
I'm not so sure.
I think that Amazon is very good at building things
and then being surprised by how people wind up using them.
It's one of those areas where customers all have different use cases,
different problems,
different ways of thinking about things. And it winds up being a fun conversation in some cases.
I wound up talking to an engineer at AWS recently about how I use Secrets Manager
instead of Dynamo as a database. And they were disappointed in me, as they should be,
because it's a terrible idea, never do it. But it's getting into the idea of how people use or misuse services.
And the fact that these are these broad primitive building blocks that you can put together a whole bunch of different ways is an awful lot of fun in some ways.
Yeah, and I think AWS has shown a willingness to go and meet the customer where the customer is quite a lot and there are services that
I used to do a talk a few different sort of meetups and AWS summits and what have you
called five AWS services that shouldn't exist and you know the obvious service whenever I told
anyone the title before they'd heard anything I'd said was were you obviously going to say EFS
and EFS I And EFS,
I think, is one of those services that on the face of it, it shouldn't exist. Like everything in AWS has been designed around, well, actually, you don't store your state in the file system,
because that's absolutely the wrong place to put it. But at the same time, AWS obviously
spoke to their customers and realized, actually, the way that people
are used to using things, the way that people want to use them, we need to have a service
like EFS so that customers can work in that way. And we can talk to them about how they
can better do things and do things differently and use S3, which obviously was one of the
original services, to replace EFS. but we need to provide something for them.
So I think Amazon show, I guess there are two ways of looking at it.
Either they show a lot of humility in saying,
oh, well, actually, we didn't think you were going to use it like that,
but now we know that we'll build it differently.
Or they are just very mercenary and they just look and say,
well, who cares about what I believe you should be doing to do it best?
If you're going to pay us money to do it, we'll build it for you.
I think that you're right.
It comes down to what customers need.
I mean, I've been making fun of EFS for a long time.
And it's gotten better to the point now where my single ding against it is,
in a cloud-native world, you probably shouldn't be
greenfielding anything that uses NFS primitives, regardless of how good the implementation thereof
is. That said, that's not realistic for companies that are migrating from on-prem environments.
You're not going to shove a net app into the cloud. You've got to have something out there
that speaks to those languages. And if that is your scenario, and that's how everything's
architected, it's fun to sit here and condescendingly shake your finger
at people and tell them they should write their software differently. But that's
not how AWS speaks to customers. That's how Google Cloud speaks to their customers.
So I've got a lot of time for the Google Cloud people.
And I also know their roadmap, so I'm not going to comment on that.
But I'm pretty sure Cloud5 Store,
which has actually been announced for Google Cloud now.
So that definitely doesn't work.
But no, I think the reality of the situation is, yeah,
if you want to be the largest player in town, and AWS already are, and they don't want to lose that position,
you have to, as you say, you have to meet the customer
where the customer is.
And that means building solutions that you think,
well, we wouldn't build it like that internally at AWS
or for Amazon retail or for Amazon video streaming.
But actually, if that's what the customer needs to do, that's fine. And we
shouldn't be telling them, no, this is the way you have to build it. We should be building what
the customer needs, what's right for them right now.
Absolutely. Sam, thank you so much for spending time chatting with me today. I appreciate it.
No problem. Thanks for having me on.
Thanks again. My name's Corey Quinn.
This has been Sam Bashton, and this is Screaming in the Cloud.