Screaming in the Cloud - Episode 31: Hey Sam, wake up. It’s 3am, and time to solve a murder mystery!

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This week's episode of Screaming in the Cloud is generously sponsored by DigitalOcean. I would argue that every cloud platform out there biases for different things. Some bias for having every feature you could possibly want offered as a managed service at

Starting point is 00:00:37 varying degrees of maturity. Others bias for, hey, we heard there's some money to be made in the cloud space. Can you give us some of it? DigitalOcean biases for neither. To me, they optimize for simplicity. I polled some friends of mine who are avid DigitalOcean supporters about why they're using it for various things, and they all said more or less the same thing. Other offerings have a bunch of shenanigans, root access and IP addresses. DigitalOcean makes it all simple. In 60 seconds, you have root access to a Linux box with an IP. That's a direct quote, albeit with profanity about other providers taken out. DigitalOcean also offers fixed price offerings. You always know what you're going to wind up paying this month,

Starting point is 00:01:23 so you don't wind up having a minor heart issue when the bill comes in. Their services are also understandable without spending three months going to cloud school. You don't have to worry about going very deep to understand what you're doing. It's click button or make an API call and you receive a cloud resource. They also include very understandable monitoring and alerting. And lastly, they're not exactly what I would call small time. Over 150,000 businesses are using them today. So go ahead and give them a try. Visit do.co slash screaming, and they'll give you a free $100 credit to try it out.

Starting point is 00:02:00 That's do.co slash screaming. Thanks again to DigitalOcean for their support of Screaming in the Cloud. Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by Sam Bashton, who once upon a time ran a premier consulting partner with AWS. Recently, however, he started something new called Runbook.Cloud. Welcome to the show. Thank you. Thanks for having me on.

Starting point is 00:02:27 Always a pleasure. It's interesting to me to talk to people where there are multiple different aspects of what they do that apply directly to how I view the world. What's interesting to me about Runbook is that, on the one hand, it's a tool that helps people find and troubleshoot problems within their AWS environment, which is fascinating and highly relevant. But what's also equally interesting to me is that you built the entire tool on top of serverless technology. So it feels like we should definitely tackle both angles of those.

Starting point is 00:03:01 Which do you want to go into first? So maybe if we talk a bit first about my motivations for building runbook.cloud. Ah, the why. Absolutely. Cool. So basically, my entire career since leaving university was on call at some point or other, often one week in four. And I would get a call in the middle of the night, around about once a week, and something had gone wrong. So I would have to troubleshoot what that problem was and work out what to do to fix it. And at first that was very nerve-wracking

Starting point is 00:03:37 and it quickly became less exciting and more an incredibly large chore. I don't think anyone enjoys doing on-call. There's a certain adrenaline rush of fixing a problem quickly. Hey, Sam. Hey, Sam. Hey, Sam. Wake up. It's three in the morning. You know what you want to do now? That's right. Solve a murder mystery. Yeah, exactly. Exactly. And all we've got to help you solve the murder mystery is pages and pages and pages of graphs. And that's if you're lucky, because in the early days, you probably didn't have any metrics at all.

Starting point is 00:04:11 And you just had to kind of look at some log files and do your best to try and work out what was going on. Then as things got better, you built dashboards and a dashboard came, you know, like came like scar tissue for an organization. Here are all the things that have failed previously. They probably won't be the things that go wrong in future, but at least if it breaks again, we've got a way to check on that. So I thought, well, there's got to be a better way to do this, and runbook.cloud is my attempt to try and build that better way. So what we do with runbook.cloud is we look at all the metrics,

Starting point is 00:04:49 which people are drawing pretty graphs from, but we apply some intelligence. And when I say intelligence, I don't mean my intelligence. I mean machine learning, of course. And we pinpoint where are the issues within the infrastructure, and then we have a pre-written set of, here are solutions to known problems that can occur. And we present that to the user. So when you get paged at three in the morning, you still see a problem.

Starting point is 00:05:17 But as well as seeing a problem, you see, well, here's what it looks like the problem is, and here's a suggested solution. And quite often, there won't be, well, it's definitely this. It will be, well, it looks like it's probably this, but there's a chance it could be this other thing. You know, you might get a list of two or three suggestions. That's infinitely better than hunting through a ton of graphs trying to work out, okay, but what does any of this mean to me?

Starting point is 00:05:42 Because really with graphs, the best you can hope for is that you have the right graph to test the hypotheses that you might have. And you can look at the graph and say, actually, you know, that graph disproves my hypotheses, so I now need to try and invent another potential reason for this problem, or it proves it, and then you can start trying to do something to fix it. So how do you keep a system like that from turning into the infrastructure equivalent of Microsoft Clippy?

Starting point is 00:06:10 It looks like you're fighting an outage. Have you tried looking at DNS? It seems like it's the sort of thing that it would be very easy to have become annoying and unhelpful. How do you avoid that problem? Obviously, this is not, wait, you mean it might be annoying people? It's not going to be revelatory to you. How have you thought about this as far as getting away from that particular failure mode? So that's where the machine learning comes in. So what we do is we look at all the potential problems that we can detect,

Starting point is 00:06:41 and we look at that in context with how the infrastructure is actually being used. So a good example is CPU usage. So in some scenarios, high CPU usage is an indicator of problem. In other scenarios, for example, you're running a batch computing load, high CPU usage is normal. It's what you should be seeing. And actually, if there's low CPU usage, that's an indicator of a problem. So we can use machine learning to do trend analysis and to understand actually in the context of how this specific customer is using the service for this specific auto-scaling group or for this set of Lambda functions, this looks wrong or this looks right. And we can look at it in context. So I would say machine learning becoming accessible to a wider base of developers,

Starting point is 00:07:34 specifically us that are writing this, that's allowed us to build something that isn't, you know, the Microsoft clippy of the DevOps world. Do you tend to take only a particular user's environment into consideration, something that isn't the Microsoft clippy of the DevOps world. Do you tend to take only a particular user's environment into consideration, or do you take the global environment as well? By which I mean, when I wake up historically in years past running infrastructures, and I know that a few things are going to be true.

Starting point is 00:08:01 First, I know that something is broken. Secondly, I know that Amazon's status page is going to be a sea of green telling me everything is perfect. And third, I know that I'm not going to really be able to disambiguate between is this a problem with my environment or is this a global problem until I go on the internet and check Twitter. Because that's the only sort of global

Starting point is 00:08:19 real-time alert system that most of us have. Are you considering suddenly seeing a bunch of flurry of activity across the board, across all of your clients in these environments, and then able to advise them on that? Or is it strictly bounded by their specific environment? No. So we very much take all of the data that we're seeing in aggregate and use that to influence alerts. It's actually something that was informed by my prior experience running a consulting partner. So as a consulting partner, we were doing managed services.

Starting point is 00:08:51 We would look after many dozens of customers. So actually, we had the same scenario. You know, we would see problems across multiple customers, and you knew actually it's very unlikely that this is a problem that's specific to any of those customers. This is a wider outage. And obviously we could relay that to AWS in terms of support tickets. So, yeah, in runbook.cloud, we also look at the data that we're receiving in aggregate.

Starting point is 00:09:18 And if there's a problem in a specific region with a specific service, we're quite careful to caveat it because the reason I think often you see a sea of green on the AWS status page is because actually at the scale that AWS runs at, probably most of their customers in that region, everything is working fine for. But when you've got millions of customers, 1% of your customers having a problem

Starting point is 00:09:42 is a significant number of people. Absolutely. And that's part of the challenge too that I think does vary from, I guess, a point of scale. If you have 30 customers and one of them winds up breaking, that's a significant percentage of what you're seeing. But if you have, I don't know, 500 queries per second hitting your website and you start seeing a 1% variance, first, that winds up scaling as well to a tremendous number of people. So it's one of those areas where, at scale, one in a million things, one in a million occurrences, happen five times a minute. So it really does turn into one of those situationally dependent issues.

Starting point is 00:10:17 Yeah, exactly. And that's where machines are excellent at aggregating that sort of data and working out what's going on. And I think in the past, as humans, we've not been as smart at using the machines to do a lot of the work for us as we should be. Or at least outside of large organizations, the Googles and the Facebooks of this world, I don't think we've been as smart as possible. You look at most people's monitoring setups, and they are pretty dumb right now. And that's not a reflection on the people setting them up. That's a reflection on the tooling that's available. Most things are, you set a threshold,

Starting point is 00:10:55 and you say, if it crosses this threshold, then there's a problem. And actually, that's not how things work in the real world. It really does seem like this is an evolution on a very long axis. I mean, back when I started working with technology, we started playing with the original Call of Duty video game, which is, of course, called Nagios. That's the thing that woke us up in the middle of the night and everything was broken. The paradigm of setting something up, often manually,

Starting point is 00:11:22 to look at individual systems and alert when they went down didn't age very well in a world of ephemeral infrastructure, in a world of auto scaling, and especially in a world where you have 10 web servers that are load balanced. If one of them blows up, I probably don't care. If three or five of them blow up, I really care. So it turns into a story where the traditional thoughts around monitoring no longer really seem to work. So the next sort of evolution of this has gone towards the idea of aggregating things, looking at metrics, looking at graphs. And that's terrific. They're beautiful dashboards you can hang up in an office, you can put on a website and send to execs,

Starting point is 00:12:01 and no one ever looks at them. And that's interesting. And now we're starting to see the next generation of this stuff emerge, where you see things like outlier detection, where we start to see systemic issues that underlie things. And it feels like you're very much in, I guess, in line with the zeitgeist around monitoring thought and theory right now. Is that something you'd agree with? Am I way off base in my assessment? I'm not going to disagree with you telling me that I've found exactly the right solution to the problem. I think there are a number of solutions that people are finding. And actually, I think they address different parts of the market. So you look at something like Honeycomb, which is very much going for tracing. That's a key part of what needs to be done. But actually, you need a very,

Starting point is 00:12:51 very technical organization to be able to implement that functionality. And if you have the right sort of organization to be able to implement the functionality to write all that tracing data, then you absolutely need a tool like that. With runbook.cloud, I'm trying to go for a more, I guess, mass market environment, an organization where actually you probably don't have a huge amount of metrics beyond what Amazon give you out of the box, which is pretty enormous. I think that last count, there's 30 odd metrics purely for EC2 at low. So, you know, Amazon are giving you all of these metrics for free. And what we're doing is we're looking at them and then trying to make it so you actually don't need

Starting point is 00:13:38 to worry about what any of the individual metrics are. We tell you, look, here's the problem and here's what you need to do to fix it. And you don't need to worry about what the values are. That's essentially all abstracted away from you by Runbook.cloud. It seems like a very interesting direction to go in. It also further seems like exactly the sort of thing that AWS should be offering, but of course isn't. Do you have the haunting fear that most people do, that Amazon is going to one day effectively try and build the version, basically a native platform offering of what you do. I mean, it's Amazon, so we know the first version is going to be pretty crappy, and it's almost guaranteed to have a stupid name.

Starting point is 00:14:17 But other than that, as it iterates forward and starts to turn into something real, there is the chance that Amazon decides to fix all problems. And from my perspective, from a monitoring point of view, I don't know that I necessarily trust them to tell me when things are broken in a way that is actionable in a reasonable period of time. So there's going to be that opportunity. But do you see them coming for you in the night someday? Well, if Andy Jassy is listening and he'd like to buy my company,

Starting point is 00:14:45 the phone's always going to be answered to his call. So I think, yeah, it's possible that Amazon would come up with something like this. Having worked with Amazon for a large number of years, I know that their strength is that they are almost not one company. They are thousands of really small units which work on their own thing, and then they bring those together. So, Corey, you must see from the numerous billing CSVs, they can't even agree on what they call a region. You know, in some parts of the billing CSV, it will be called USW2 and other parts USW2

Starting point is 00:15:27 and other parts still, it might have an airport code for the name of the region. So I think it's quite hard as someone inside Amazon to build a tool like that, or at least it's no easier than it is for someone outside of Amazon, namely me. I think also if you look at some of the solutions that are out there, I get the impression that Amazon perhaps don't want to be in some of these spaces. Specifically, if you look at Amazon X-Ray, X-Ray is in theory a tracing tool. Well, in practice, it is a tracing tool, and it lets you log all the tracing data and is really good at logging that data. And then they give you an awful interface

Starting point is 00:16:12 for searching through it. And I, having used X-Ray quite a bit, I kind of believe that actually that's not by accident. It's not that they didn't know how to make a good interface. It's that that isn't the game they want to be in. They want to build the underlying infrastructure and they want other people to come along, use their APIs and build the right interface for the users.

Starting point is 00:16:37 Absolutely. It's like this sort of theory or philosophy that they're operating under that, you know, if we just provide bare primitives, maybe customers will build the things we don't, ideally in Lambda. Yeah, that's absolutely true. The other thing that I would say is, actually, I'm selling runbook.cloud through AWS Marketplace. So AWS Marketplace is a solution much like Amazon Marketplace. So I'm selling runbook.cloud.

Starting point is 00:17:08 It's a subscription service you pay by the hour, just like a normal AWS service. Oh, please, there's no such thing as a normal AWS service. Well, absolutely true. But in terms of it gets added to your normal monthly AWS bill. And of course, Amazon take a cut for the privilege of doing that. So actually, AWS are still making money from this. It almost is an AWS service. It's just a marketplace service. One thing the AWS Marketplace team are keen to point out quite frequently is on the Amazon retail side, 50% of transactions are done through Amazon Marketplace. And actually, that's where they see AWS Marketplace getting to as well. So I think maybe it's naivety, but I think it's

Starting point is 00:18:02 less likely that Amazon are going to try and clone something like Runbook.cloud because actually, why would you bother if someone else is putting all the money into R&D, doing that hard work, and then you're getting a cut from it anyway, and it's fulfilling the needs of your customer? One bit of feedback that I've gotten on my business for the last couple of years as I focus on Amazon bills is, well, what about other cloud providers? And for my business, it doesn't make a lot of sense for me to focus on providers that aren't AWS. What about you? Do you wind up getting that feedback as far as, oh, what about GCP? What about Oracle Cloud? What about Azure, etc, etc, etc? So that's quite an interesting question. my previous role when i built a consulting company we started out well actually we started out pre-aws but in the cloud world we started out obviously with aws because there are no other players in the game but then we did expand and build a google cloud practice as well so i have quite a lot of familiarity with Google Cloud. I think for me, when I'm building a product like this,

Starting point is 00:19:09 there is so much work to do to be able to accurately detect the problems that addressing multiple clouds would be extremely difficult. And Amazon has such a massive order of magnitude more customers than any of the other cloud platforms that actually it doesn't really make sense at this point in time to branch out to other clouds. Of course, Randy Jassy accepted. I expect that at some point in time, we probably will want to make a version that is for Azure and a version that's for Google Cloud.

Starting point is 00:19:44 But we're probably talking a good few years down the road here. make a version that is for Azure and a version that's for Google Cloud. But we're probably talking a good few years down the road here. Absolutely. And to my way of thinking, in this type of space, any of us who specialize in one particular provider are going to be able to retool to embrace a different provider far faster than some other provider is going to gain workload and market share and customers to the point where the one we're focusing on is no longer dominant. In other words, you're not going to see these giant enterprises migrating between cloud platforms faster than the ecosystem is going to be able to understand, embrace, and work with the new provider. It's one of those things that's obviously worth keeping an eye on,

Starting point is 00:20:26 but it's not one of those things where we're going to wake up and read in the front page of New York Times in giant six-inch high letters, AWS suddenly irrelevant. I mean, that isn't how the world or the market work. Yeah, exactly. And actually, if you look, the majority of computing is not on any cloud platform right now. So there's still a lot of expansion to be done. the majority of computing is not on any cloud platform right now. So there's still a lot of expansion to be done.

Starting point is 00:20:49 AWS isn't going anywhere. And when you're the market leader, that means you kind of become the default choice. So I think this competition is AWS's to lose. I think the other cloud platforms have interesting offerings, but I'm not seeing anything that's significantly different enough that you would want to move from AWS if that's where you were previously. So the other aspect that I wanted to chat about with you is the fact that you built this entire service on top of serverless technology.

Starting point is 00:21:22 Why did you make that decision? So I made that decision primarily for business, financial reasons, rather than because of technology. Actually, building on serverless, I saw was the best way to align our outlay, our costs in terms of providing a service to a customer with the actual amount we could charge a customer. So I have a lot of experience with Kubernetes, which I know you're a massive fan of.

Starting point is 00:21:54 I was using Kubernetes from about 2014 onwards when it was very early stage. And actually early on, I thought, well, we probably will be deploying Rombook.cloud onto Kubernetes because it's the platform I know best. I hadn't really done anything with Lambda at any significant scale. I've done a lot of glue code, but nothing beyond that. But when I looked into the spreadsheets more of the cost model. Actually, the upfront outlay for Kubernetes is still quite a lot higher.

Starting point is 00:22:28 You need a certain critical mass of customers before it makes sense. And with Lambda, well, I can scale exactly in line with my customer base. And actually then when we need to decide, okay, where do we optimize cost? It's really easy. You just look at

Starting point is 00:22:45 which Lambda function is costing the most. Well, that's where we should expend our engineering effort, optimizing things. So suddenly, instead of optimization, either, you know, code bases historically don't get optimized further. They just get new features added and added on top of them. And then everyone talks about technical debt. you get just well people work with the bits of code and optimize the bits of code that they decide kind of on a whim that that should be the thing they should should tackle now with serverless you've got a clear way you look at the bill and you say okay so this is costing us the most we can do some work optimizing this and we can save ourselves some money. And actually you are paying down that technical debt,

Starting point is 00:23:31 but you have a clear metric that you're working towards, which is reducing the overall cost. There's a very strong economic story for serverless. In other words, I think Simon Wardley was talking about this extensively, where he was focusing on the idea of tracing capital flow throughout an organization or through an application where if you have 15 Lambda functions tied together

Starting point is 00:23:55 and you know which one is costing you more, you don't just know what it costs to serve as a customer. You know what every function costs relative to the others per customer request. And it gives you a very in-depth viewpoint into where your revenue is coming from and what the economics of your business are. Yeah, precisely. And those economic stats also give you good data on, okay, but what parts of our application are actually used the most? Most companies, if you've got a monolithic application,

Starting point is 00:24:27 you don't really know what the most popular parts are. Probably you've added some sort of after-the-fact metrics for a SaaS app that's been built as a monolithic application. You've probably got some JavaScript that's loaded, client-side that's working out what's getting used the most. But with Lambda, actually, it's giving you that data really clearly, and you can see where your effort is best spent. I think that when you say, I'm using serverless technology, and people ask why,

Starting point is 00:24:55 and the answer is, oh, for the economic story behind it. People often hear that as, oh, I'm using serverless because I'm a cheap ass. And it has nothing whatsoever to do with that. It's very much in the realm of you pay for exactly what you use, you don't have to worry about provisioning, you aren't falling into the wonderful world of, oh, here's some on-demand resources I need to plan the usage of for the next three years. And it really gets back to a you pay for exactly what you use and nothing else. So it comes down to a very predictable model where you know exactly what a customer brings in, and then you really do scale with them to bring in revenue, as opposed to having these plateaus where you buy a giant pile of things to service customers. Okay, now you've expanded. Now it's time to buy another big instance or something and going down that rat hole. It's one of those stories that also not just saves money, but also lets you spend it

Starting point is 00:25:51 effectively and know where it's going. So yeah, I think that is true. But I think it's partly true because of the model that Amazon have chosen to use in terms of their reserved instance model. So actually, they charge for instances now per second. So if you could spin instances up quickly enough, and actually with things like unikernels, potentially you can start instances almost as fast as you can start a Lambda function. You can use instances in a way that actually doesn't mean that you need to worry about reserving capacity up front,

Starting point is 00:26:33 except that that is baked into the AWS economic model. And they used to talk a lot about, well, you're reserving capacity because you're literally reserving availability in a specific availability zone and then obviously you know they got rid of that that hard link when they brought convert convertible instances in and actually all the new reserved instance types they're not linked to a specific allocation of capacity you don't have any extra allocation there are no guarantees that you'll be able to spin off instances like there used to be to a specific allocation of capacity. You don't have any extra allocation. There are no guarantees that you'll be able to spin off instances like there used to be previously when you purchased specific capacity

Starting point is 00:27:13 as part of your reserved instance. So I think, yes, some of the benefit of serverless is an accident or perhaps not an accident of how AWS have chosen to charge for their service? Absolutely. I think that they get beaten up a lot for the way that the reserved instance model exists historically. They're making steps with things like convertible instances and instance family flexibility, or size flexibility rather. And that starts to make some of this better, but it's still an analysis paralysis style of decision. Last time I ran the numbers, there were exactly 140 different instance types you could spin up

Starting point is 00:27:50 in US East 1. And okay, go ahead and make sure you're on the right one and then buy a reservation for three years. That's daunting. And they keep adding new instance families and it becomes trickier and trickier to be assured that you're making the right instance size and selection and choice. What I love about Lambda is that there's a single variable you get to play with, and that is RAM allocation to the function. That's it. You don't have to pick around, well, what about the IO profile? What about the CPU? What about the network capacity?

Starting point is 00:28:23 Those are all tied to how much RAM you give the function. The more RAM, the better the rest of the resourcing. And that model, I think, is tremendously helpful, not purely even from an economic point of view, but from a not putting decisions on people unnecessarily. Analysis paralysis is very real. If I'm trying to sell someone a pen, generally the right way to do that is do you want blue ink or black ink? Here's a catalog with 10,000 different kinds of pens. It just comes down to getting the, I guess, emotional toil of making decisions down to as few decision points as possible. Yeah, I think that's true. I definitely, earlier on when I was using AWS, there were definitely situations where I was lobbying various people at AWS for different types of instances. And I guess there must have been lots of people doing things like me, but we were all lobbying for slightly different types of instances, which is why there are so many. I think, actually, I don't see that as a negative so much. You know, you look at

Starting point is 00:29:28 the instances, generally, it's relatively obvious. At least, you know, it was to me when I was doing everything on EC2, it was pretty obvious what you wanted. If you were going to be compute bound, you wanted to see instance. If you were doing stuff that needed GPUs, you needed a GPU instance. If you had memory-bound tasks, you had an R. If you weren't really sure, you had sort of a mix of memory, you had an M instance. You pick the latest that's available in the region you're running. It's not – I agree, there are massive numbers of different types of instances.

Starting point is 00:30:06 Well, sure, you're absolutely right. But recently, they've also extended that to, okay, now with different types of disks, some are NVMe, some are not. This one has extra fast CPU in it, but it's designed for fewer threads at the same time. And you just wind up with these little variances between them as you get into the M suffixes and the D suffixes. And I agree wholeheartedly with, it used to make a lot of sense. Now with just the flurry of new instance families, I have to go back to my traditional guidance and constantly reevaluate it. Yeah, I think that's possibly true. I think most of the time you can just pick, look, I'll just use a C instance or use an M instance.

Starting point is 00:30:46 And it'll be close enough that honestly, the few cents you might save here and there, it's not going to be worthwhile. Right. Until you hit scale again, and then suddenly you're having a very different, very vast conversation. And that's one of the nice things I appreciate as well about the whole Lambda model. Even at scale, the economics are still pretty decent. They absolutely are. I had a blog post that I put out a couple of weeks ago that was very successful. I think you linked to it in your wonderful newsletter about how we actually saved significant amounts of money using Lambda in a way that actually most of the experts told us that's not how you should use Lambda. So what I would say I guess the received wisdom was with Lambda is that the great thing about Lambda is everything can be single threaded

Starting point is 00:31:40 because if you need concurrency, well, that you just run more lambda functions and that's true up to a point and actually what we found is that you can do that but obviously the cost implications of running hundreds of lambda functions in parallel when you are not fully utilizing those resources is pretty significant. So in our specific instance, obviously, for Runbook, we need to look at metrics from a large number of AWS services, from a very large number of accounts, because each customer has at least one account, and an average customer will be using at least half a dozen services,

Starting point is 00:32:25 and some are using an order of magnitude more than that. So we will make calls to all these different AWS APIs. And obviously, they take some time to respond. And what we're waiting for the response is, you know, it might only take a few hundred milliseconds, but we're doing nothing with the compute power that Lambda has provisioned for us. But obviously, we're still having to pay for it because we pay you by the 100 milliseconds in a Lambda world. So what we did is we looked at this problem. And actually, we decided instead of firing up several hundred Lambda invocations concurrently, We would put everything into a single Lambda function,

Starting point is 00:33:06 and we'd written everything in Go. So Go has a really nice inbuilt programming model for doing concurrent operations, and we did the concurrency using the programming language. We now run a single Lambda instead of several hundred Lambdas, and, of course, the cost implications for that are pretty huge. You know, we saved ourselves a significant amount of money. Well, we made the business viable.

Starting point is 00:33:33 The business wouldn't have been viable at the price point that we had already selected if we had used the received wisdom of, well, you should only do concurrency by just spinning up additional Lambda functions. And actually, although that's the received wisdom in the community, actually, once we look deeper into it, I'm not sure AWS really believe that because if you look, you mentioned you get more compute capacity as you allocate more RAM to your Lambda functions. The higher RAM Lambda functions actually give you more cores, which implies to me that AWS expects you to be doing things concurrently within Lambda.

Starting point is 00:34:17 That's just not how people had, in the community, expected things to be used. I think you may be onto something there. Counterpoint, it's easy to sit here and say, ah, they hid this thing in and with the expectation people would find it and use it this way. I'm not so sure. I think that Amazon is very good at building things and then being surprised by how people wind up using them.

Starting point is 00:34:40 It's one of those areas where customers all have different use cases, different problems, different ways of thinking about things. And it winds up being a fun conversation in some cases. I wound up talking to an engineer at AWS recently about how I use Secrets Manager instead of Dynamo as a database. And they were disappointed in me, as they should be, because it's a terrible idea, never do it. But it's getting into the idea of how people use or misuse services. And the fact that these are these broad primitive building blocks that you can put together a whole bunch of different ways is an awful lot of fun in some ways. Yeah, and I think AWS has shown a willingness to go and meet the customer where the customer is quite a lot and there are services that

Starting point is 00:35:25 I used to do a talk a few different sort of meetups and AWS summits and what have you called five AWS services that shouldn't exist and you know the obvious service whenever I told anyone the title before they'd heard anything I'd said was were you obviously going to say EFS and EFS I And EFS, I think, is one of those services that on the face of it, it shouldn't exist. Like everything in AWS has been designed around, well, actually, you don't store your state in the file system, because that's absolutely the wrong place to put it. But at the same time, AWS obviously spoke to their customers and realized, actually, the way that people are used to using things, the way that people want to use them, we need to have a service

Starting point is 00:36:10 like EFS so that customers can work in that way. And we can talk to them about how they can better do things and do things differently and use S3, which obviously was one of the original services, to replace EFS. but we need to provide something for them. So I think Amazon show, I guess there are two ways of looking at it. Either they show a lot of humility in saying, oh, well, actually, we didn't think you were going to use it like that, but now we know that we'll build it differently. Or they are just very mercenary and they just look and say,

Starting point is 00:36:43 well, who cares about what I believe you should be doing to do it best? If you're going to pay us money to do it, we'll build it for you. I think that you're right. It comes down to what customers need. I mean, I've been making fun of EFS for a long time. And it's gotten better to the point now where my single ding against it is, in a cloud-native world, you probably shouldn't be greenfielding anything that uses NFS primitives, regardless of how good the implementation thereof

Starting point is 00:37:10 is. That said, that's not realistic for companies that are migrating from on-prem environments. You're not going to shove a net app into the cloud. You've got to have something out there that speaks to those languages. And if that is your scenario, and that's how everything's architected, it's fun to sit here and condescendingly shake your finger at people and tell them they should write their software differently. But that's not how AWS speaks to customers. That's how Google Cloud speaks to their customers. So I've got a lot of time for the Google Cloud people. And I also know their roadmap, so I'm not going to comment on that.

Starting point is 00:37:45 But I'm pretty sure Cloud5 Store, which has actually been announced for Google Cloud now. So that definitely doesn't work. But no, I think the reality of the situation is, yeah, if you want to be the largest player in town, and AWS already are, and they don't want to lose that position, you have to, as you say, you have to meet the customer where the customer is. And that means building solutions that you think,

Starting point is 00:38:13 well, we wouldn't build it like that internally at AWS or for Amazon retail or for Amazon video streaming. But actually, if that's what the customer needs to do, that's fine. And we shouldn't be telling them, no, this is the way you have to build it. We should be building what the customer needs, what's right for them right now. Absolutely. Sam, thank you so much for spending time chatting with me today. I appreciate it. No problem. Thanks for having me on. Thanks again. My name's Corey Quinn.

Starting point is 00:38:45 This has been Sam Bashton, and this is Screaming in the Cloud.

Your Ad Here

Screaming in the Cloud - Episode 31: Hey Sam, wake up. It’s 3am, and time to solve a murder mystery!

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.