Screaming in the Cloud - Episode 16: There are Still Servers, but We Don't Care About Them
Episode Date: June 27, 2018Are you interested in going beyond basic monitoring and visibility? Need tools to build and operate serverless applications and extract business intelligence? IOpipe provides extended visibil...ity and metrics around AWS Lambda, including profiling, core dumps, and incoming input events. Today, we’re talking to Erica Windisch, who is the founder and CTO of IOpipe. She brings her experience in building developer and operational tooling to serverless applications. Erica also has more than 17 years of experience designing and building Cloud infrastructure management solutions. She was an early and longtime contributor to OpenStack and maintainer of the Docker project. Some of the highlights of the show include: Nomenclature Battle: Serverless vs. stateless Building a window of visibility into Lambda: Talking to users and assessing needs/pain points Observability of the infrastructure: Necessary evil to get to automated healing Using Lambda at significant levels of scale; some companies grow usage, others go all in right away Current state of Lambda ecosystem Is Lambda stable? Indications and no formal SLA How issues manifest and are exposed Trends include cold starts, hours-long failures, and multiple function evokes Infrastructure powering IOpipe: Lambda issues may impact performance of monitoring system, but IOpipe is not necessarily dependent on Lambda Future of Lambda: Builds applications a specific way, but there are limitations What would Erica change about Lambda? Run function and define handlers Lambda functions can be difficult to understand; some developers do not have familiarity and create bottlenecks Capacity limits around Lambda can be difficult to establish Links: Erica Windisch on Twitter Erica Windisch on Twitch IOpipe 12-Factor App Cloud Custodian in Lambda Velocity London ServerlessConf London re:Invent AWS Glue .
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
Welcome to Screaming in the Cloud.
I'm Corey Quinn.
I'm joined this week by Erica Windisch,
who's the founder and CTO of Iopipe.
Welcome to the show.
Hi, thanks for having me.
No, thanks for taking the time to speak with me.
So let's start at the beginning.
What is Iopipe?
Oh, wow.
Okay, so what we do is we provide tools for developers
to build and operate their serverless applications from development through production.
And increasingly also doing things like helping you extract business intelligence from your applications and correlate that with operational information and operational observability.
Which does sound like a lot of buzzwords, doesn't it?
I feel like half of this space sort of stands out that way.
In fact, I first found out that you folks existed at reInvent last year.
There was a big Midnight Madness launch, and they were going to be announcing some things.
And frankly, none of us cared about that.
We were there to see Shaquille O'Neal as DJ Diesel, apparently, quote unquote, dropping sick beats, as the kids say.
But while I was there, watching your presentation, a couple of other things that came out were,
in some ways, more entertaining even than watching a seven-foot-tall gentleman spin discs for fun.
So it was neat to see.
To my understanding from back then and as continues to evolve now as I continue to work in this space, effectively what you do is provide visibility and metrics around AWS Lambda.
Is that more or less how you're positioning yourselves these days?
Is there a – I mean, you can obviously pour more buzzwords onto it, but is that effectively encapsulating what you do?
I would say it's the baseline for what we do.
We have some competitors, and I would say our competitors definitely fit more firmly within those parameters.
I think we're growing out of basic monitoring and basic visibility
because we have things like profiling.
We have core dumps.
Now we look at things like incoming input events.
So if you're doing a Lexa skill,
you can filter by a specific conversation with a specific user if you want to.
And that just works out of the box, right?
And those are things that none of the box, right?
And those are things that none of our competitors, for instance, are able to do. So I don't know what to call this, but I think we're doing something new and unique.
I would agree with your, for the first part of your last sentence, which is it's difficult to
know what to call this. I mean, someone would argue that in any significantly exciting technology,
a battle always breaks out either about pronunciation or about what it is you want
to call the thing that you've built. We've seen it with monitoring versus observability.
So to that end, where do you stand on use of the word serverless?
I think the word serverless is fine. Initially, you know, fine. Initially, I kind of see the point people are
making. People make a big deal of the name, but nobody complains about the term stateless.
We've agreed that we could build stateless applications, but there's still state, right?
Your TCP session has state. The physical link layer has state of a wire physically being connected.
Your application, your user, provides a session
cookie and your state is stored in your database. So there is
state, except that this part of code doesn't
necessarily worry about the state. You put the state in different layers of
your application, you manage the state in different layers of your application.
You manage your state in certain ways.
And you ignore the places where you still have state,
like the fact that you connect to a database.
The fact that you're storing data in a database is taking that state and moving it somewhere.
So it's like, I have this temporary state
by the nature of running an application,
and then I store it elsewhere.
I don't maintain the state.
And I think serverless is very much the same way, right?
Yes, there's still servers, but we don't care about them.
We move them somewhere else.
We've moved the concern for them in the same way we've moved state.
But I guess because servers are a more concrete thing
that you can physically see,
that there's more pushback around that term
than with state because state is such an abstract concept. You can't see state, right, generally,
but you can see servers. But I think that these are similar, but we complain about one and don't
complain about the other. Very aptly put. So how long has Iopipe been in business?
So we've been in business for two years, a little more than two years. We launched about a year ago.
I started on this project maybe two and a half years ago in the terms of me leaving Docker and saying,
I'm going to go do something around serverless and next-generation applications and figuring out what that meant.
And then through customer conversations,
through searching for a co-founder and finding Adam,
and founding the company, we found a focus and a vision
and supplanted that into Incorporate and so forth about two years ago.
If you take a look, I think Lambda wasn't really announced until 2015.
So that's less than a year between the announcement of a thing that no one really knew what to make of
and you effectively jumping on this in a very, very early state. How did the idea of building a, I guess, window of visibility into this new
thing that no one quite understood what to do with come about? Kind of through two threads.
One was talking to users and developers on Lambda and assessing what their needs were.
We just had lots of conversations to find out where the pain points were. Where do you need help? What can we fix?
Is there a product here? Is there something that you need
that we can serve and fix for you
and build a product? So we were seeing a trend in
users of developers of serverless looking for
monitoring observability,
as well as the ability to really understand things like sessions,
for HTTP sessions, for users of those applications,
for users of Alexa applications, tracking Alexa skills.
These are all things that we saw.
And so we saw a market need for that.
But more so, like we also, so the original vision of Idlepipe, like my vision when I left Docker was more ambitious.
And I saw that observability of the infrastructure was a necessary evil to get to a place where I wanted to get to, which was more of automated healing, automated application construction.
I wanted machines to do all this work for us,
including the idea of, say, Amazon Glue, for instance,
this idea of gluing together serverless applications
or doing things like AWS step functions.
When we build these units really small
and they have very open
and standardized channels of communication
and just process events,
if we standardize event processors,
we have the standardized input,
we have the standardized output,
and they're all very, very small,
we could just use machine learning to construct them.
And that was kind of my original vision
and was like, okay,
well, it turns out we need a feedback loop for this,
which is observability.
And that just didn't exist.
So we started building the observability tools
and we started talking to users
and seeing they need observability tools.
So we just went straight down that path.
And I think maybe in some ways,
we're getting back to those original vision ideas,
but very strongly staying within where there's a market need. Which is a fascinating way of, I guess,
almost stumbling into an offering that's definitely resonating within the market.
To that end, do you see that customers are using Lambda at significant scale at this time? Or
people still in early days doing it for proof of concept and not really rolling it out statewide?
I mean, it depends, right?
There are some very large organizations that are using Lambda for a number of projects that may be big or small. something that like a conversation i've had with um people where there was some focus in the market
in some like other like developer evangelists and um enthusiasts giving talks and they were
focusing on the idea of like just go straight into production go straight into building these
applications like these are applications that are ideal for Lambda. And kind of just starting there. And I was like, hold on a second, right?
It's actually okay to say you can build simple applications, ad hoc applications on Lambda
to learn it. And then land and expand, right? Get in there, get familiar with Lambda on low-risk
applications, and then get into big applications. And I definitely
see both of these. I've seen corporations, companies
go straight into, I'm going to put
a billion dollars of billing into
Lambda. And just a whole billion of billing into Lambda, right?
And just a whole Fortune 100 is like, we're going to put all of our billing in Lambda,
just straight off the bat.
And I've also seen big companies that say, you know what, we're going to do this small
project.
We're going to do some cron jobs.
We're going to become familiar with it and understand where their edge case is and then grow.
So it's a mix. I would say it's probably a lot of maybe the latter rather than the former,
because I think it's easier to start with small things and expand out than to have big top-down
initiatives like rewrite giant stacks in Lambda. Oh, I agree wholeheartedly.
I mean, you're probably the single company
that is best positioned as a global observer
of what trends people are implementing with Lambda
other than Amazon themselves.
One of the, I guess, early use cases
and a lot of the examples that Amazon themselves give
about implementing Lambda
tend to involve around performing certain tasks
in a AWS service environment,
taking a tag and propagating it
to a secondary or tertiary resource,
taking a bit of data from one service
and then passing it to another and so on and so forth.
Is that, I guess, the primary use case
that you start to see?
Is it people using this for something else entirely
to run full-featured applications?
Are you just seeing it done as glue code?
I mean, what is the current state of the Lambda ecosystem?
I mean, there's definitely a mix.
And I would say that I kind of don't agree with this notion that
Lambda is just
filling service gaps in AWS.
Right?
Lambda as, say, stored procedures
isn't necessarily addressing
a lack of capabilities
of the database. It's like you have custom
business logic you need to implement.
We use Kinesis
and like, so there are some things. We've used Kinesis.
So there are some things that we do with Kinesis that's like, yeah, we could technically just use Firehose
or we could just use some of the other AWS services
that do this for us.
We chose to write our own code for a number of reasons.
But yeah, it's a mix.
So I was just thinking, so I wrote this Lambda
on Edge that does JWT
verifications for S3.
So instead of doing pre-signed
URLs with S3, if you have
a valid JWT JSON web
token, you can
just access the data, right?
You don't need to use your JOT
to your Lambda-based
API gateway
Lambda to sign
this request on S3 and return back
a pre-signed URL.
You can just use that JWT directly
to S3 through
Lambda at Edge.
But this is the case where
wouldn't it just be cool if Amazon
just supported
JSON Web Tokens for S3 in the first place
so like I could see that perspective
but it also
provides so much more over that right
because Amazon can't predict
what's going to be popular right like
JSON web tokens are a thing that kind of came from
somewhere and you know the
industry came around and said we're going to build this JSON web token
thing but there's also basic
authentication there was digest But there's also basic authentication.
There was digest authentication.
There's LDAP authentication to web services.
And Amazon could have went and supported all of those.
Or they can just say, we can give you a mechanism where you do implement it however you want to
and give you the power of open source to share that code and to build an ecosystem around us instead as a platform.
And then on the other side, Amazon is,
or our users are building web APIs
and web applications and microservices
and what I now call nanoservices around in Lambda.
And I think those are real applications that are, as long as you can
build a quote-unquote 12-factor application, you can build it on Lambda. A question that I have,
though, comes also down to the basic reliability of the platform. If I take a look right now at my
Lambda functions over the past day, I've had 30 invocations, which means that there are large swaths of time
during which Lambda could have been completely down,
and I would have had no idea.
There is no formal SLA around it.
So from my perspective, I'm looking at this,
and given that no one has complained about the thing
that my Lambda functions power,
and no one has blown up my email about this,
I assume that the reliability has been
perfect. How does that map to what you're seeing in, I guess, the real world as people start to
scale this significantly? Is Lambda fairly stable? Is it something that tends to drop out in weird
ways that are difficult to diagnose? I would say it's been pretty stable recently.
There are some outliers that are not recent. When they first launched and they first went
GA, there were a couple of issues that were resolved fairly quickly, mostly in US East 1.
But it's been pretty stable since then. The last major outage, like significant outage I can accurately place was the great S3 failure.
And that was because Lambda uses S3 for storage internally.
And when S3 went down, Lambda went down too.
Got you.
When you do see Lambda issues, how do those tend to manifest?
I feel like there's not enough exposure to how these things break.
Is it delay in invocation?
Do they fail to invoke at all?
Does it hang and add latency spikes or something else entirely?
No, it's actually really interesting.
So because, as you said, we have maybe some of the best visibility into this outside of Amazon themselves. We definitely have
internal visibility into anonymized statistics
of what's happening on Lambda
that we could look at. And things that we noticed
were that, well, a few things.
So there's a built-in container cycle so
um there's this idea there's cold starts because containers are spun up whilst containers are also
killed right there's a life cycle that's anywhere between four and a half minutes to four and a half
hours for a container servicing a lambda function of which a lambda function might be served by
multiple containers right but each container and every process that's in that container is supposed to live for between
four and a half minutes to four and a half hours. We've seen cases where they've been alive for eight
hours or 16 hours instead. And then sometime around that that 10 hour mark or whatever, you know, Amazon starts announcing that there are, you know, like service problems.
And like, so we've actually have kind of noticed some of these failures before Amazon has, because we can, or at least before they've acknowledged them, because we can see that those containers aren't being reaped at the right time.
And like, this may have been a case where that was literally the bug.
Maybe they weren't reaping, which meant that they were spawning too many containers and
they had resource exhaustion in the Lambda service because they weren't properly garbage
collecting containers.
We've seen things where functions would be multiple evoked
consistently,
where every
Lambda function was evoking three or four
times
instead of once.
But these things have mostly
settled down
to a very significant degree
as the product has matured.
I mean, these were mostly issues
around launch, like initial
launch. And that makes a fair bit of sense. Are you able to talk at all about the infrastructure
that powers IO pipe? In other words, when there starts to be a Lambda issue, is that something
that impacts the performance of the monitoring system that watches Lambda? Yeah. So we are based on Lambda. So we actually consume reports from...
So a user's Lambda runs,
it sends data directly to a collector service that we run.
That puts data into Kinesis.
None of that touches Lambda up to that point.
So we're not dependent on Lambda
or any of Amazon's serverless products
for ingesting the data
and getting it into our account,
which is good because it does de-risk us
from some, if there were a failure in Lambda,
we wouldn't be affected by it at that point.
And at that point, it's in Kinesis.
So once it's in Kinesis,
even if there was a failure with any
of the services that we built internally on Lambda,
we could just
process that at a delay.
But the Kinesis
feeds into several Lambdas
that write things to
our databases and run our alerts
and run
various intelligence tasks
against them. So we use Lambda
very extensively internally.
Basically, I think that the collector service
is perhaps the
only service that's not on Lambda
for specific reasons
that we've chosen to
de-risk against certain
things, particularly against
would there be a Lambda failure
or for latency.
But when we deployed that service,
API Gateway did not have regional endpoints,
which it does do now, but at the time it didn't.
I know it's something that we needed.
So it is something that we have actually reconsidered
is if we would eliminate that service
because we could actually implement that service
without EC2 and could implement that with API Gateway instead without any Lambda, actually.
Gotcha. I was wondering on some level if there was going to be a dark secret of surprise. We
actually run this entire thing in a data center somewhere that's in the middle of nowhere because
we think this cloud thing's a fad. It's always interesting when you start scratching to see how things like this are built under the hood.
I actually had a conversation with somebody who suggested we do that, actually.
That was a legitimate proposal.
Was this person trying to sell you colospace by any chance?
I don't think they were, actually.
So as far as where you see today,
at least from my perspective,
Lambda started off as a curiosity and a bit of a toy.
Three years in, it's more than that.
I'm seeing it used for production-level workloads
in a number of different environments,
and we're seeing the platform itself
become a lot broader as well
in the context of being able to support new runtimes that weren't there at
launch, new versions, and for example, sign more resources. I believe at the last reInvent, the RAM
limit doubled. Where do you see the platform evolving into in the future? I mean, when it
becomes less of a toy, even than it is now, five years from now, what does that look like?
I mean, I wouldn't say it's a toy now.
I think you can build really amazing advanced applications on it.
And the limitations of Lambda, to me, are very freeing,
where it's enforcing some of the 12-factor design decisions. factor design app you know decisions like 12 factor was a a guideline and lambda enforces
that opinionated stack design right it forces you to build applications this way like things like
the five minute window kind of makes you build applications a certain way which is a good thing
it does maybe restrict you from doing some sort of MapReduce kind of jobs, but
for most applications, I do think it's
very much not toy applications.
You can build
any kind of
microservice, HTTP service you're looking
to build, you can do it with APA, GitWin, Lambda.
I think there are some
limitations that are
kind of an issue that are actually not even restricted
just to Lambda.
Amazon's going to get there, but
they need to work on it.
So for instance, if you want to
this is something we're dealing with
right now, is that if you want to
expose
an API gateway service,
well, so
we had a service that was based on Elastic Beanstalk,
our collector, and exposing that collector over a VPN.
You cannot use either CloudFront, nor can you use ELBs or ALBs for that when you're doing it over a VPN.
So Amazon just announced API Gateway over a VPN, or VPC, I'm sorry, VPC.
And again, now it's like, okay, great.
Now, actually, this works.
Now we can point to, we need to have API Gateway to ALB, but how do we do TLS termination,
right?
And these are problems that I really wish Amazon would solve.
So I guess what I'm saying is some of the services around,
I wish they did a little better around those.
Kinesis Video Streams, for instance, doesn't integrate with Lambda.
So there are places where I just wish Lambda was,
or I wish that they did a thing that they just don't do yet.
And they're getting there, they're working on these things,
but sometimes living
on a cutting edge, you definitely run into
some of these services
that aren't Lambda that have
limitations that I wish they didn't.
If you had a magic wand, what
would you change about Lambda?
I think this is maybe a selfish
answer because I work on this observability platform.
But it's this thing that was actually in Azure Functions
that was pretty neat was this idea of you run your function
and then you can define basically handlers
for the output of that function
as well as different pipes out of it.
So you could basically have your function run,
return some value,
and not just return data back to the caller,
but have that output basically teed off, piped off,
forked to other receivers directly.
So a thing that you basically in Lambda have to use
step functions for is a thing that
a Lambda execution itself
could be an event trigger for another Lambda, for instance.
Directly would be really, really neat.
Whenever this Lambda evokes,
take the output of it and run another Lambda function
or put the output of it in a Kinesis.
Like that's a really, I think a neat thing
that would actually enable me to do some things
that I can't do today.
And that Azure actually kind of did do out of the box.
And there's some things they did out of the box that I don't like and things they didn't do out of the box. And there's some things they did out of the box that I don't like
and things they didn't do out of the box I wish they did do over at Azure.
But that was like the one thing I was like, wow, that's really cool.
And I still kind of wish that Amazon had something like that.
Some sort of like queue or Kinesis stream or something
for the output of those functions.
That was, I mean, not ingesting CloudWatch data,
because you could do that, like you do the CloudWatch stream,
but something that was a little bit more alternative pipelines
for data out of it.
It's kind of hard to explain.
It's kind of ambiguous.
It's maybe something to just explore.
Very fair. Taking a bit of the opposite approach for a second,
as you take a look at how people are implementing Lambda in various environments,
what aspects of working with Lambda functions do you find that people either struggle to wrap
their heads around, they misunderstand, or I guess fundamentally are having trouble with today.
Because none of this stuff is easy or intuitive the first time you see it, I can assure you.
I spent most of my time learning how this stuff works by getting it hilariously wrong.
I mean, so for me, I personally didn't have as much of a challenge here.
And I do see others having that challenge. And I think it's a way of thinking.
I think that a lot of people implementing microservices, implementing these next
generation applications, Microsoft applications, they came to it with this monolithic mindset
and adapted to it. They weren't familiar with actor-based programming models.
They weren't familiar with things like Erlang or Haskell.
When I'm saying Erlang, I'm thinking OTP in particular.
A lot of developers aren't aware of message queues, right?
I mean, of course, many are, but that kind of distributed computing, distributed computing
problems, building applications at scale is a thing that a lot of developers don't have direct familiarity with.
They're just like, I'm going to build a Node app and build it stateless, and I'll throw in EC2, and I'll throw more EC2 instances at it. with Lambda that I think catches people by surprise is that Lambda scales
so easily and so readily
that its
massive scale can become an issue
if you don't plan for it, where
you can easily find yourself
with a thousand concurrent
invocations and a thousand active
containers and
overload your database.
You can just throw
so much more
at a database. You can throw so much more
at a service. You can get so much concurrency
and parallelization
accidentally with Lambda
that you run into bottlenecks
that you didn't run into before
because you just said, oh, well, assume AC2
instance is fine. I'm just going to make
a vertical stack here. I'm just going to make these instance is fine, I'm just going to make a vertical stack here, right?
I'm just going to make these giant vertical silos.
I'm just going to build them taller, right?
And instead, you now have a distributed systems problem.
And a lot of developers just're not familiar with them,
you just find yourself creating bottlenecks in things like databases that you just didn't expect.
If you're not, well, if you're new to it, if you don't know to expect that.
Scale brings up an interesting question. The entire premise of any sort of cloud computing
environment is that's the beautiful part.
You can scale infinitely, which is absolutely awesome until you actually try to do it.
Come to find out there are theoretical upper limits.
You cannot provision two million containers at the same time and expect something not to fall over. Do you see indications that there are capacity limits around Lambda that are at a point where it starts to affect individual consumers?
Or does the shared nature of the platform make that very hard even to determine from the outside?
I would say it's probably hard to determine from the outside.
I wouldn't even say Lambda is shared. I would say that there's an implementation detail
of Lambda that is not a...
Amazon does not
guarantee this, but it is
an implementation detail that
you basically get your own
virtual machines to run your containers
on. Amazon's
managing a fleet of EC2 instances
just for you
for your Lambdas, as implementation detail.
That, again, is not a guarantee from them,
but that's just how they've chosen to implement it.
And so I think that the limitations of Lambda
are probably closer to that of EC2.
In reality, things where there are limitations
are like the 75 gigabyte limit for all
function code per
account, which some users
have run into. Oh, I've run into that
on a single function for myself because I
write really inefficient nonsense.
So, you
can't actually do that. I think per function
you have a limit of like 500
megs compressed, I think.
So you basically need to
like divide 75 gigabytes by 500 megabytes. Yeah, I think it was something like 75 megabytes
compressed, which in all seriousness, snark and witticism aside, I did brush into with some of my
early functions as I started trying to install everything into a monolithic function of pip
dependencies over in Python land. It turns out
that's a terrible anti-pattern, and I
should never do that. Hey,
putting a monolith into a
serverless function does
not make you
suddenly living in the future.
You do have to break these things out
architecturally, as it turns out.
I mean, I don't know that you really
need to do that.
I think there's actually some valid use cases for saying running WordPress inside of Lambda.
And I think that can be fine.
Cloud Custodian is an example of an app that is kind of a very large, open-source monolithic application.
I think it's like 40,000 lines of code.
It's very big, but it's kind of fine.
There's some advantages to it.
Every Alexa skill is a monolith,
for better or worse.
And it's just by design, you have
to build it that way.
People are going to do it.
I think tools like Iopipe do actually help
with that, but I think we got off
your actual question.
Which is
absolutely fine. Is there
anything else you'd like
to mention that you have coming up or talk
about that would be relevant
or interesting? Or where can people find you?
Well,
I'm going to be speaking, I'm going to be
keynoting for Serverless Days London
I guess next month.
That's next month already.
I will be
speaking at Velocity London.
So it handles all of our London people.
I have a bunch of other conferences
that I use so many that I can't
remember where they are and what they are.
But I think I can say that
I'll be speaking at reInvent. I think that's
happening. So you can find me there.
You can find me on Twitter, twitter.com
slash ewindish. And
Iopipe, so we have a community Slack
and you can find our website and you can
reach out to us as well. So yeah,
I've also been doing Twitch streaming,
twitch.tv slash ewindish.
I've not been active in
the last few weeks, but I'll probably get back to streaming soon.
Perfect. Thank you so much for taking
the time to speak with me today.
My name is Corey Quinn, and this is
Screaming in the Cloud.
This has been this week's episode
of Screaming in the Cloud.
You can also find more Corey at
screaminginthecloud.com or wherever
fine snark is sold.