Screaming in the Cloud - Episode 16: There are Still Servers, but We Don't Care About Them

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by Erica Windisch,

Starting point is 00:00:26 who's the founder and CTO of Iopipe. Welcome to the show. Hi, thanks for having me. No, thanks for taking the time to speak with me. So let's start at the beginning. What is Iopipe? Oh, wow. Okay, so what we do is we provide tools for developers

Starting point is 00:00:42 to build and operate their serverless applications from development through production. And increasingly also doing things like helping you extract business intelligence from your applications and correlate that with operational information and operational observability. Which does sound like a lot of buzzwords, doesn't it? I feel like half of this space sort of stands out that way. In fact, I first found out that you folks existed at reInvent last year. There was a big Midnight Madness launch, and they were going to be announcing some things. And frankly, none of us cared about that. We were there to see Shaquille O'Neal as DJ Diesel, apparently, quote unquote, dropping sick beats, as the kids say.

Starting point is 00:01:26 But while I was there, watching your presentation, a couple of other things that came out were, in some ways, more entertaining even than watching a seven-foot-tall gentleman spin discs for fun. So it was neat to see. To my understanding from back then and as continues to evolve now as I continue to work in this space, effectively what you do is provide visibility and metrics around AWS Lambda. Is that more or less how you're positioning yourselves these days? Is there a – I mean, you can obviously pour more buzzwords onto it, but is that effectively encapsulating what you do? I would say it's the baseline for what we do. We have some competitors, and I would say our competitors definitely fit more firmly within those parameters.

Starting point is 00:02:18 I think we're growing out of basic monitoring and basic visibility because we have things like profiling. We have core dumps. Now we look at things like incoming input events. So if you're doing a Lexa skill, you can filter by a specific conversation with a specific user if you want to. And that just works out of the box, right? And those are things that none of the box, right?

Starting point is 00:02:48 And those are things that none of our competitors, for instance, are able to do. So I don't know what to call this, but I think we're doing something new and unique. I would agree with your, for the first part of your last sentence, which is it's difficult to know what to call this. I mean, someone would argue that in any significantly exciting technology, a battle always breaks out either about pronunciation or about what it is you want to call the thing that you've built. We've seen it with monitoring versus observability. So to that end, where do you stand on use of the word serverless? I think the word serverless is fine. Initially, you know, fine. Initially, I kind of see the point people are making. People make a big deal of the name, but nobody complains about the term stateless.

Starting point is 00:03:33 We've agreed that we could build stateless applications, but there's still state, right? Your TCP session has state. The physical link layer has state of a wire physically being connected. Your application, your user, provides a session cookie and your state is stored in your database. So there is state, except that this part of code doesn't necessarily worry about the state. You put the state in different layers of your application, you manage the state in different layers of your application. You manage your state in certain ways.

Starting point is 00:04:11 And you ignore the places where you still have state, like the fact that you connect to a database. The fact that you're storing data in a database is taking that state and moving it somewhere. So it's like, I have this temporary state by the nature of running an application, and then I store it elsewhere. I don't maintain the state. And I think serverless is very much the same way, right?

Starting point is 00:04:29 Yes, there's still servers, but we don't care about them. We move them somewhere else. We've moved the concern for them in the same way we've moved state. But I guess because servers are a more concrete thing that you can physically see, that there's more pushback around that term than with state because state is such an abstract concept. You can't see state, right, generally, but you can see servers. But I think that these are similar, but we complain about one and don't

Starting point is 00:04:59 complain about the other. Very aptly put. So how long has Iopipe been in business? So we've been in business for two years, a little more than two years. We launched about a year ago. I started on this project maybe two and a half years ago in the terms of me leaving Docker and saying, I'm going to go do something around serverless and next-generation applications and figuring out what that meant. And then through customer conversations, through searching for a co-founder and finding Adam, and founding the company, we found a focus and a vision and supplanted that into Incorporate and so forth about two years ago.

Starting point is 00:05:43 If you take a look, I think Lambda wasn't really announced until 2015. So that's less than a year between the announcement of a thing that no one really knew what to make of and you effectively jumping on this in a very, very early state. How did the idea of building a, I guess, window of visibility into this new thing that no one quite understood what to do with come about? Kind of through two threads. One was talking to users and developers on Lambda and assessing what their needs were. We just had lots of conversations to find out where the pain points were. Where do you need help? What can we fix? Is there a product here? Is there something that you need that we can serve and fix for you

Starting point is 00:06:36 and build a product? So we were seeing a trend in users of developers of serverless looking for monitoring observability, as well as the ability to really understand things like sessions, for HTTP sessions, for users of those applications, for users of Alexa applications, tracking Alexa skills. These are all things that we saw. And so we saw a market need for that.

Starting point is 00:07:02 But more so, like we also, so the original vision of Idlepipe, like my vision when I left Docker was more ambitious. And I saw that observability of the infrastructure was a necessary evil to get to a place where I wanted to get to, which was more of automated healing, automated application construction. I wanted machines to do all this work for us, including the idea of, say, Amazon Glue, for instance, this idea of gluing together serverless applications or doing things like AWS step functions. When we build these units really small and they have very open

Starting point is 00:07:47 and standardized channels of communication and just process events, if we standardize event processors, we have the standardized input, we have the standardized output, and they're all very, very small, we could just use machine learning to construct them. And that was kind of my original vision

Starting point is 00:08:03 and was like, okay, well, it turns out we need a feedback loop for this, which is observability. And that just didn't exist. So we started building the observability tools and we started talking to users and seeing they need observability tools. So we just went straight down that path.

Starting point is 00:08:18 And I think maybe in some ways, we're getting back to those original vision ideas, but very strongly staying within where there's a market need. Which is a fascinating way of, I guess, almost stumbling into an offering that's definitely resonating within the market. To that end, do you see that customers are using Lambda at significant scale at this time? Or people still in early days doing it for proof of concept and not really rolling it out statewide? I mean, it depends, right? There are some very large organizations that are using Lambda for a number of projects that may be big or small. something that like a conversation i've had with um people where there was some focus in the market

Starting point is 00:09:07 in some like other like developer evangelists and um enthusiasts giving talks and they were focusing on the idea of like just go straight into production go straight into building these applications like these are applications that are ideal for Lambda. And kind of just starting there. And I was like, hold on a second, right? It's actually okay to say you can build simple applications, ad hoc applications on Lambda to learn it. And then land and expand, right? Get in there, get familiar with Lambda on low-risk applications, and then get into big applications. And I definitely see both of these. I've seen corporations, companies go straight into, I'm going to put

Starting point is 00:09:58 a billion dollars of billing into Lambda. And just a whole billion of billing into Lambda, right? And just a whole Fortune 100 is like, we're going to put all of our billing in Lambda, just straight off the bat. And I've also seen big companies that say, you know what, we're going to do this small project. We're going to do some cron jobs. We're going to become familiar with it and understand where their edge case is and then grow.

Starting point is 00:10:27 So it's a mix. I would say it's probably a lot of maybe the latter rather than the former, because I think it's easier to start with small things and expand out than to have big top-down initiatives like rewrite giant stacks in Lambda. Oh, I agree wholeheartedly. I mean, you're probably the single company that is best positioned as a global observer of what trends people are implementing with Lambda other than Amazon themselves. One of the, I guess, early use cases

Starting point is 00:10:58 and a lot of the examples that Amazon themselves give about implementing Lambda tend to involve around performing certain tasks in a AWS service environment, taking a tag and propagating it to a secondary or tertiary resource, taking a bit of data from one service and then passing it to another and so on and so forth.

Starting point is 00:11:21 Is that, I guess, the primary use case that you start to see? Is it people using this for something else entirely to run full-featured applications? Are you just seeing it done as glue code? I mean, what is the current state of the Lambda ecosystem? I mean, there's definitely a mix. And I would say that I kind of don't agree with this notion that

Starting point is 00:11:46 Lambda is just filling service gaps in AWS. Right? Lambda as, say, stored procedures isn't necessarily addressing a lack of capabilities of the database. It's like you have custom business logic you need to implement.

Starting point is 00:12:03 We use Kinesis and like, so there are some things. We've used Kinesis. So there are some things that we do with Kinesis that's like, yeah, we could technically just use Firehose or we could just use some of the other AWS services that do this for us. We chose to write our own code for a number of reasons. But yeah, it's a mix. So I was just thinking, so I wrote this Lambda

Starting point is 00:12:25 on Edge that does JWT verifications for S3. So instead of doing pre-signed URLs with S3, if you have a valid JWT JSON web token, you can just access the data, right? You don't need to use your JOT

Starting point is 00:12:41 to your Lambda-based API gateway Lambda to sign this request on S3 and return back a pre-signed URL. You can just use that JWT directly to S3 through Lambda at Edge.

Starting point is 00:12:58 But this is the case where wouldn't it just be cool if Amazon just supported JSON Web Tokens for S3 in the first place so like I could see that perspective but it also provides so much more over that right because Amazon can't predict

Starting point is 00:13:14 what's going to be popular right like JSON web tokens are a thing that kind of came from somewhere and you know the industry came around and said we're going to build this JSON web token thing but there's also basic authentication there was digest But there's also basic authentication. There was digest authentication. There's LDAP authentication to web services.

Starting point is 00:13:32 And Amazon could have went and supported all of those. Or they can just say, we can give you a mechanism where you do implement it however you want to and give you the power of open source to share that code and to build an ecosystem around us instead as a platform. And then on the other side, Amazon is, or our users are building web APIs and web applications and microservices and what I now call nanoservices around in Lambda. And I think those are real applications that are, as long as you can

Starting point is 00:14:07 build a quote-unquote 12-factor application, you can build it on Lambda. A question that I have, though, comes also down to the basic reliability of the platform. If I take a look right now at my Lambda functions over the past day, I've had 30 invocations, which means that there are large swaths of time during which Lambda could have been completely down, and I would have had no idea. There is no formal SLA around it. So from my perspective, I'm looking at this, and given that no one has complained about the thing

Starting point is 00:14:39 that my Lambda functions power, and no one has blown up my email about this, I assume that the reliability has been perfect. How does that map to what you're seeing in, I guess, the real world as people start to scale this significantly? Is Lambda fairly stable? Is it something that tends to drop out in weird ways that are difficult to diagnose? I would say it's been pretty stable recently. There are some outliers that are not recent. When they first launched and they first went GA, there were a couple of issues that were resolved fairly quickly, mostly in US East 1.

Starting point is 00:15:20 But it's been pretty stable since then. The last major outage, like significant outage I can accurately place was the great S3 failure. And that was because Lambda uses S3 for storage internally. And when S3 went down, Lambda went down too. Got you. When you do see Lambda issues, how do those tend to manifest? I feel like there's not enough exposure to how these things break. Is it delay in invocation? Do they fail to invoke at all?

Starting point is 00:15:53 Does it hang and add latency spikes or something else entirely? No, it's actually really interesting. So because, as you said, we have maybe some of the best visibility into this outside of Amazon themselves. We definitely have internal visibility into anonymized statistics of what's happening on Lambda that we could look at. And things that we noticed were that, well, a few things. So there's a built-in container cycle so

Starting point is 00:16:26 um there's this idea there's cold starts because containers are spun up whilst containers are also killed right there's a life cycle that's anywhere between four and a half minutes to four and a half hours for a container servicing a lambda function of which a lambda function might be served by multiple containers right but each container and every process that's in that container is supposed to live for between four and a half minutes to four and a half hours. We've seen cases where they've been alive for eight hours or 16 hours instead. And then sometime around that that 10 hour mark or whatever, you know, Amazon starts announcing that there are, you know, like service problems. And like, so we've actually have kind of noticed some of these failures before Amazon has, because we can, or at least before they've acknowledged them, because we can see that those containers aren't being reaped at the right time. And like, this may have been a case where that was literally the bug.

Starting point is 00:17:25 Maybe they weren't reaping, which meant that they were spawning too many containers and they had resource exhaustion in the Lambda service because they weren't properly garbage collecting containers. We've seen things where functions would be multiple evoked consistently, where every Lambda function was evoking three or four times

Starting point is 00:17:51 instead of once. But these things have mostly settled down to a very significant degree as the product has matured. I mean, these were mostly issues around launch, like initial launch. And that makes a fair bit of sense. Are you able to talk at all about the infrastructure

Starting point is 00:18:10 that powers IO pipe? In other words, when there starts to be a Lambda issue, is that something that impacts the performance of the monitoring system that watches Lambda? Yeah. So we are based on Lambda. So we actually consume reports from... So a user's Lambda runs, it sends data directly to a collector service that we run. That puts data into Kinesis. None of that touches Lambda up to that point. So we're not dependent on Lambda or any of Amazon's serverless products

Starting point is 00:18:47 for ingesting the data and getting it into our account, which is good because it does de-risk us from some, if there were a failure in Lambda, we wouldn't be affected by it at that point. And at that point, it's in Kinesis. So once it's in Kinesis, even if there was a failure with any

Starting point is 00:19:08 of the services that we built internally on Lambda, we could just process that at a delay. But the Kinesis feeds into several Lambdas that write things to our databases and run our alerts and run

Starting point is 00:19:24 various intelligence tasks against them. So we use Lambda very extensively internally. Basically, I think that the collector service is perhaps the only service that's not on Lambda for specific reasons that we've chosen to

Starting point is 00:19:40 de-risk against certain things, particularly against would there be a Lambda failure or for latency. But when we deployed that service, API Gateway did not have regional endpoints, which it does do now, but at the time it didn't. I know it's something that we needed.

Starting point is 00:19:58 So it is something that we have actually reconsidered is if we would eliminate that service because we could actually implement that service without EC2 and could implement that with API Gateway instead without any Lambda, actually. Gotcha. I was wondering on some level if there was going to be a dark secret of surprise. We actually run this entire thing in a data center somewhere that's in the middle of nowhere because we think this cloud thing's a fad. It's always interesting when you start scratching to see how things like this are built under the hood. I actually had a conversation with somebody who suggested we do that, actually.

Starting point is 00:20:32 That was a legitimate proposal. Was this person trying to sell you colospace by any chance? I don't think they were, actually. So as far as where you see today, at least from my perspective, Lambda started off as a curiosity and a bit of a toy. Three years in, it's more than that. I'm seeing it used for production-level workloads

Starting point is 00:20:55 in a number of different environments, and we're seeing the platform itself become a lot broader as well in the context of being able to support new runtimes that weren't there at launch, new versions, and for example, sign more resources. I believe at the last reInvent, the RAM limit doubled. Where do you see the platform evolving into in the future? I mean, when it becomes less of a toy, even than it is now, five years from now, what does that look like? I mean, I wouldn't say it's a toy now.

Starting point is 00:21:27 I think you can build really amazing advanced applications on it. And the limitations of Lambda, to me, are very freeing, where it's enforcing some of the 12-factor design decisions. factor design app you know decisions like 12 factor was a a guideline and lambda enforces that opinionated stack design right it forces you to build applications this way like things like the five minute window kind of makes you build applications a certain way which is a good thing it does maybe restrict you from doing some sort of MapReduce kind of jobs, but for most applications, I do think it's very much not toy applications.

Starting point is 00:22:10 You can build any kind of microservice, HTTP service you're looking to build, you can do it with APA, GitWin, Lambda. I think there are some limitations that are kind of an issue that are actually not even restricted just to Lambda.

Starting point is 00:22:25 Amazon's going to get there, but they need to work on it. So for instance, if you want to this is something we're dealing with right now, is that if you want to expose an API gateway service, well, so

Starting point is 00:22:41 we had a service that was based on Elastic Beanstalk, our collector, and exposing that collector over a VPN. You cannot use either CloudFront, nor can you use ELBs or ALBs for that when you're doing it over a VPN. So Amazon just announced API Gateway over a VPN, or VPC, I'm sorry, VPC. And again, now it's like, okay, great. Now, actually, this works. Now we can point to, we need to have API Gateway to ALB, but how do we do TLS termination, right?

Starting point is 00:23:21 And these are problems that I really wish Amazon would solve. So I guess what I'm saying is some of the services around, I wish they did a little better around those. Kinesis Video Streams, for instance, doesn't integrate with Lambda. So there are places where I just wish Lambda was, or I wish that they did a thing that they just don't do yet. And they're getting there, they're working on these things, but sometimes living

Starting point is 00:23:50 on a cutting edge, you definitely run into some of these services that aren't Lambda that have limitations that I wish they didn't. If you had a magic wand, what would you change about Lambda? I think this is maybe a selfish answer because I work on this observability platform.

Starting point is 00:24:07 But it's this thing that was actually in Azure Functions that was pretty neat was this idea of you run your function and then you can define basically handlers for the output of that function as well as different pipes out of it. So you could basically have your function run, return some value, and not just return data back to the caller,

Starting point is 00:24:36 but have that output basically teed off, piped off, forked to other receivers directly. So a thing that you basically in Lambda have to use step functions for is a thing that a Lambda execution itself could be an event trigger for another Lambda, for instance. Directly would be really, really neat. Whenever this Lambda evokes,

Starting point is 00:25:06 take the output of it and run another Lambda function or put the output of it in a Kinesis. Like that's a really, I think a neat thing that would actually enable me to do some things that I can't do today. And that Azure actually kind of did do out of the box. And there's some things they did out of the box that I don't like and things they didn't do out of the box. And there's some things they did out of the box that I don't like and things they didn't do out of the box I wish they did do over at Azure.

Starting point is 00:25:30 But that was like the one thing I was like, wow, that's really cool. And I still kind of wish that Amazon had something like that. Some sort of like queue or Kinesis stream or something for the output of those functions. That was, I mean, not ingesting CloudWatch data, because you could do that, like you do the CloudWatch stream, but something that was a little bit more alternative pipelines for data out of it.

Starting point is 00:25:59 It's kind of hard to explain. It's kind of ambiguous. It's maybe something to just explore. Very fair. Taking a bit of the opposite approach for a second, as you take a look at how people are implementing Lambda in various environments, what aspects of working with Lambda functions do you find that people either struggle to wrap their heads around, they misunderstand, or I guess fundamentally are having trouble with today. Because none of this stuff is easy or intuitive the first time you see it, I can assure you.

Starting point is 00:26:31 I spent most of my time learning how this stuff works by getting it hilariously wrong. I mean, so for me, I personally didn't have as much of a challenge here. And I do see others having that challenge. And I think it's a way of thinking. I think that a lot of people implementing microservices, implementing these next generation applications, Microsoft applications, they came to it with this monolithic mindset and adapted to it. They weren't familiar with actor-based programming models. They weren't familiar with things like Erlang or Haskell. When I'm saying Erlang, I'm thinking OTP in particular.

Starting point is 00:27:12 A lot of developers aren't aware of message queues, right? I mean, of course, many are, but that kind of distributed computing, distributed computing problems, building applications at scale is a thing that a lot of developers don't have direct familiarity with. They're just like, I'm going to build a Node app and build it stateless, and I'll throw in EC2, and I'll throw more EC2 instances at it. with Lambda that I think catches people by surprise is that Lambda scales so easily and so readily that its massive scale can become an issue if you don't plan for it, where

Starting point is 00:27:53 you can easily find yourself with a thousand concurrent invocations and a thousand active containers and overload your database. You can just throw so much more at a database. You can throw so much more

Starting point is 00:28:10 at a service. You can get so much concurrency and parallelization accidentally with Lambda that you run into bottlenecks that you didn't run into before because you just said, oh, well, assume AC2 instance is fine. I'm just going to make a vertical stack here. I'm just going to make these instance is fine, I'm just going to make a vertical stack here, right?

Starting point is 00:28:25 I'm just going to make these giant vertical silos. I'm just going to build them taller, right? And instead, you now have a distributed systems problem. And a lot of developers just're not familiar with them, you just find yourself creating bottlenecks in things like databases that you just didn't expect. If you're not, well, if you're new to it, if you don't know to expect that. Scale brings up an interesting question. The entire premise of any sort of cloud computing environment is that's the beautiful part.

Starting point is 00:29:05 You can scale infinitely, which is absolutely awesome until you actually try to do it. Come to find out there are theoretical upper limits. You cannot provision two million containers at the same time and expect something not to fall over. Do you see indications that there are capacity limits around Lambda that are at a point where it starts to affect individual consumers? Or does the shared nature of the platform make that very hard even to determine from the outside? I would say it's probably hard to determine from the outside. I wouldn't even say Lambda is shared. I would say that there's an implementation detail of Lambda that is not a... Amazon does not

Starting point is 00:29:49 guarantee this, but it is an implementation detail that you basically get your own virtual machines to run your containers on. Amazon's managing a fleet of EC2 instances just for you for your Lambdas, as implementation detail.

Starting point is 00:30:08 That, again, is not a guarantee from them, but that's just how they've chosen to implement it. And so I think that the limitations of Lambda are probably closer to that of EC2. In reality, things where there are limitations are like the 75 gigabyte limit for all function code per account, which some users

Starting point is 00:30:30 have run into. Oh, I've run into that on a single function for myself because I write really inefficient nonsense. So, you can't actually do that. I think per function you have a limit of like 500 megs compressed, I think. So you basically need to

Starting point is 00:30:45 like divide 75 gigabytes by 500 megabytes. Yeah, I think it was something like 75 megabytes compressed, which in all seriousness, snark and witticism aside, I did brush into with some of my early functions as I started trying to install everything into a monolithic function of pip dependencies over in Python land. It turns out that's a terrible anti-pattern, and I should never do that. Hey, putting a monolith into a serverless function does

Starting point is 00:31:14 not make you suddenly living in the future. You do have to break these things out architecturally, as it turns out. I mean, I don't know that you really need to do that. I think there's actually some valid use cases for saying running WordPress inside of Lambda. And I think that can be fine.

Starting point is 00:31:34 Cloud Custodian is an example of an app that is kind of a very large, open-source monolithic application. I think it's like 40,000 lines of code. It's very big, but it's kind of fine. There's some advantages to it. Every Alexa skill is a monolith, for better or worse. And it's just by design, you have to build it that way.

Starting point is 00:31:55 People are going to do it. I think tools like Iopipe do actually help with that, but I think we got off your actual question. Which is absolutely fine. Is there anything else you'd like to mention that you have coming up or talk

Starting point is 00:32:12 about that would be relevant or interesting? Or where can people find you? Well, I'm going to be speaking, I'm going to be keynoting for Serverless Days London I guess next month. That's next month already. I will be

Starting point is 00:32:27 speaking at Velocity London. So it handles all of our London people. I have a bunch of other conferences that I use so many that I can't remember where they are and what they are. But I think I can say that I'll be speaking at reInvent. I think that's happening. So you can find me there.

Starting point is 00:32:44 You can find me on Twitter, twitter.com slash ewindish. And Iopipe, so we have a community Slack and you can find our website and you can reach out to us as well. So yeah, I've also been doing Twitch streaming, twitch.tv slash ewindish. I've not been active in

Starting point is 00:33:00 the last few weeks, but I'll probably get back to streaming soon. Perfect. Thank you so much for taking the time to speak with me today. My name is Corey Quinn, and this is Screaming in the Cloud. This has been this week's episode of Screaming in the Cloud. You can also find more Corey at

Starting point is 00:33:15 screaminginthecloud.com or wherever fine snark is sold.

Your Ad Here

Screaming in the Cloud - Episode 16: There are Still Servers, but We Don't Care About Them

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.