Screaming in the Cloud - Stepping Onto the AWS Commerce Platform with James Greenfield
Episode Date: May 17, 2022About JamesJames has been part of AWS for over 15 years. During that time he's led software engineering for Amazon EC2 and more recently leads the AWS Commerce Platform group that runs some o...f the largest systems in the world, handling volumes of data and request rates that would make your eyes water. And AWS customers trust us to be right all the time so there's no room for error.Links Referenced:Email: jamesg@amazon.com
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
This episode is sponsored in part by our friends at Vulture, spelled V-U-L-T-R,
because they're all about helping save money, including on things like, you know, vowels.
So what they do is they are a cloud provider that provides surprisingly high
performance cloud compute at a price that, well, sure, they claim it is better than AWS's pricing.
And when they say that, they mean that it's less money. Sure, I don't dispute that. But what I find
interesting is that it's predictable. They tell you in advance on a monthly basis what it's going
to cost. They have a bunch of advanced networking features. They tell you in advance on a monthly basis what it's going to cost. They have
a bunch of advanced networking features. They have 19 global locations and scale things elastically,
not to be confused with openly, which is apparently elastic and open. They can mean the same thing
sometimes. They have had over a million users. Deployments take less than 60 seconds across
12 pre-selected operating systems,
or if you're one of those nutters like me,
you can bring your own ISO
and install basically any operating system you want.
Starting with pricing as low as $2.50 a month
for Vulture Cloud Compute,
they have plans for developers and businesses of all sizes,
except maybe Amazon,
who stubbornly insists on having something of the scale on their
own. Try Vulture today for free by visiting vulture.com slash screaming, and you'll receive
$100 in credit. That's v-u-l-t-r dot com slash screaming. Finding skilled DevOps engineers is a
pain in the neck, and if you need to deploy a secure and compliant application to AWS without such things, forget about it.
But that's where Duplo Cloud can help.
Their comprehensive no-code-slash-low-code software platform guarantees a secure and compliant infrastructure in as little as two weeks while automating the full DevSecOps lifestyle.
Get started with DevOps as a service from Duplo Cloud, and your cloud configurations will be done
right the first time. Tell them I sent you, and your first two months are free. To learn more,
visit snark.cloud slash duplocloud. That's snark.cloud slash d-u-p-l-o-c-l-o-u-d.
Welcome to Screaming in the Cloud.
I'm Corey Quinn.
I've been angling to get someone from a particular department at AWS on this show for nearly
its entire run.
If you were to find yourself in an Amazon building and
wander through the various dungeons and boiler rooms and subterranean basements, I presume,
I haven't seen nearly as many of the inside of those buildings as people might think,
you pass interesting departments labeled things like spline reticulation or whatnot,
and then you come to a very particular group called
Commerce Platform. Now, I'm not generally one to tell other people's stories for them.
My guest today is James Greenfield, the VP of Commerce Platform at AWS. James,
thank you for joining me and suffering the slings and arrows I will no doubt be hurling at you.
Thanks for having me. I'm looking forward to it.
So let's start at the very beginning, because I guarantee you, you're going to do a better job of
giving the chapter adverse answer than I would from a background mired deeply in snark.
What is Commerce Platform? It sounds almost like it's the retail
website that sells socks, books, and underpants.
So Commerce Platform actually spans a bunch of different things. And so I'm going to try not to
bore you with a laundry list of all of the things that we do. It's a much longer list than most
people assume, even internal to AWS. At its core, Commerce Platform owns all of the infrastructure
and processes and software that takes the fact that you've been running an
EC2 instance or you're storing an object in S3 for some period of time and turns it into a number
at the end of the month that is what you ask for that service and then proceeds to try to give you
as many ways to pay us as easily as possible. There are a few other bits in there that are
maybe less obvious.
One is we're also responsible for protecting the platform and our customers from fraudulent
activity. And then we're also responsible for helping collect all of the data that we need
for internal reporting to support some of the backend services that a business needs to do
things like revenue recognition and general financial reporting.
One of the interesting aspects about the billing system is just how deeply it permeates everything that happens within AWS. I frequently say that when it comes to cloud, cost and architecture
are foundationally and fundamentally the same exact thing. If your entire service goes down,
a few interesting things happen.
One, I don't believe a single customer is going to complain other than maybe a few accountants
here and there because the books aren't reconciling. But also, you remove a whole
bunch of constraints around why things are the way that they are. What is the most efficient
way to run this workload? Well, if all the computers suddenly become free, I don't really
care about efficiency so much as, oh, hey, there's
a fly. What do I have as a fly swatter? That's right. I'm going to drop a building on it. And
those constraints breed almost everything. I've said, for example, that S3 has infinite storage
because it does. They can add drives faster than we're able to fill them, at least historically,
and they added some more replication services, but they're going to be able to buy hard drives faster than the rest of us are going to be able to stretch our budgets.
If that constraint of the budget falls away, all bets are really off. And more or less,
we're talking about the destruction of the cloud as a viable business entity. No pressure or
anything. You're also a recent transplant into AWS billing as a whole commerce platform in general you spent
15 years at the company the vast majority of that over an EC2 so either it was you've been
exiled to a basically digital Siberia or it was one of those okay keeping all the EC2 servers up
this is easy I don't see what people stress about and they say oh ho ho try this instead
how did you find yourself
migrating over to Commerce Platform? That's actually one I've had a lot from folks that
I've worked with. You're right. I spent the first 15 or so years of my career at AWS and EC2,
responsible for various things over there. And when the leadership role in Commerce Platform
opened up, the timing was fortuitous and part of it. I was in the process of relocating my family. We moved to Vancouver middle of last year and we had
an opening in the role and started talking about potentially me stepping into that role.
The reason that I took it, there's a few reasons, but the primary reason is if I look back over my career, I've kind of naturally gravitated towards owning things
where people only really remember that they exist
when they're not working.
And for some reason, I enjoy the opportunity
to try to keep those kinds of services
ticking over to the point where people don't notice them.
And so Commerce Platform lands squarely in that space.
I've always been attracted to opportunities that have an impact.
And it's hard to imagine having much more of an impact than in the Commerce Platform
space.
It underpins everything.
As you said earlier, every single one of our customers depends on the service, whether
they think about it or realize it.
Every single service that we offer to customers depends on us.
And so it really is the sort of nexus within AWS.
And I'm a platform guy.
I've always been a platform guy.
I like the force multiplier nature of platforms.
And so Commerce Platform, as I kind of thought through all of those elements, really was
a great opportunity to step in.
And I think there's something to be said for, I've been a customer of Commerce Platform
internally for a long time.
And so a chance to cross over and be on the other side of that was something that I didn't
want to pass up.
And so, you know, I'm digging in, I'm learning quickly, ramping up, by no means an expert,
very dependent on a very smart, talented, committed group of
people within the team. That's kind of the long and short of how and why.
Let's say that I am taking on the role of an AWS product team for the sake of argument. I know,
keep the cringe down for a second as far as, oh God, the wince is just inevitable when the idea
of me working there ever comes up to anyone. But I have an idea for a service. Obviously,
it runs containers and maybe it does some
other things as well. Going from
idea to six-pager to
MVP to
barely better than MVP day one launch,
and at some point, various
things happen to that service. It gets staffed
with a team, objectives, and a roadmap
get built, a P&L, a budget,
and a pricing model, and the rest. One of the last
things that happens, apparently, is someone picks the worst name off of a list of candidates, slaps it on
the product, and ships it off there. At what point does the billing system and figuring out the
pricing dimensions for a given service tend to factor in? Is that a last-minute story? Is that
almost from the beginning? Where along that journey does, oh, by the way, we're building
this thing. Maybe
we should figure out, I don't know, how to make money from it, factor into the conversation.
There are two parts to that answer. Pretty early on as we're trying to define what that service is
going to look like, we're already typically thinking about what are the dimensions that
we might charge along. The actual pricing discussions typically happen fairly late, but
identifying those dimensions and sort of the right way to present it to customers happens pretty early on.
The thing that doesn't happen early enough is actually pulling the commerce platform team in.
But it is something that we're going to work this year to try to get a little bit more in front of.
Have you found historically that you have a pretty good idea of how a service is going to be priced, everything is mostly thought through, a service goes to either private preview or you're discussing about a launch, and then more or less someone like me crops up with a, hey, yeah, let's disregard 90% of what this service does because I see a way to misuse the remaining 10% of it as a database. And you run some mental math and realize, huh, we're suddenly giving like eight petabytes of
storage per customer away for free. Maybe we should guard against that because otherwise it's rife with
misuse. It used to be that I could find interesting ways to sneak through the cracks of various
services, usually in pursuit of a laugh. Those are getting relatively hard to come by and invariably
a lot more trouble than they're worth. Is that just better comprehensive diligence internally? Is that learning from customers?
Or am I just bad at this? No, I mean, what you're describing is almost a variant of the
defender's dilemma. There are way more ways to abuse something than you can imagine. And so
defending against that is pretty challenging. And it's important because, you know, if you turn the
economics of something upside down, then it just becomes harder for us to offer it to customers who want to use it legitimately
i would say 90 of that improvement is us learning we make plenty of mistakes but i think you know
one of the things that i've always been impressed by over my time here is how intentional we are
at trying to learn from those mistakes and so i think that's what you're seeing there and then
we try very hard to listen to customers talk to to folks like you, because one of the best ways to tackle anything
that smells of the defender's dilemma is to harness the collective creativity of a large
number of smart people, because you really are trying to cover as much ground as possible.
There was a fun joke going around a while back of what is the most expensive environment you can get running on a free tier account before someone from AWS steps in.
And I think I got it to something like half a billion dollars in the first month.
Now, I haven't actually tested this for reasons that mostly have to do with being relatively poor compared to, you know, being able to buy Guam. And understanding as well that the
fraud protections built into something like AWS are largely built around defending against
getting service usage for free that in some way, shape, or form benefits the attacker.
The easy example of that would be mining cryptocurrency, which is just super economic
as long as you use someone else's AWS account to do it. Whereas a lot of my vectors be mining cryptocurrency, which is just super economic as long as you use someone
else's AWS account to do it. Whereas a lot of my vectors are, yeah, ignore all of that. How do I
just make the bill artificially high? What can I do to misuse data transfer and passing a single
gigabyte through? How much can I make that per gigabyte cost be? And oh, circular replication,
and the lambda invokes itself pattern, and basically every
bad architectural decision you can possibly make, only this time it's intentional.
And that shines some really interesting light on it.
And I have to give credit where due.
A lot of that didn't come from just me sitting here being sick and twisted nearly so much
as it did, having seen examples of that type of misconfiguration, by mistake, in a variety
of customer accounts,
most commonly my own, because it turns out that the way I learn things is by screwing them up first.
Yeah, you've touched on a couple of different things in there. So maybe the first one is,
I typically try to draw a line between fraud and abuse. And fraud is essentially trying to spend
somebody else's money, get something for free. And we spend a lot of time trying to shut that down, and we're getting really good at catching it.
And then abuse is either intentional or unintentional.
There's intentional abuse.
You find a chink in our armor, and you try to take advantage of it.
But much more commonly is unintentional abuse.
It's not really abuse.
Abuse has very negative connotations.
But it's unintentionally setting something up so
that you run up a much larger bill than you intended. And we have a number of different
internal efforts and we're working on a bunch more this year to try to catch those early on,
because one of my personal goals is to minimize the frequency with which we surprise customers.
And the least favorite kind of surprise for customers is a large bill and so what you're
talking about there is in a sufficiently complex system there's always going to be weaknesses and
ways to get yourself tied up in knots we're trying both at the service team level but also within my
teams to try to find ways to make it as hard as possible to accidentally do that to yourself and
then catch when you do so that we can stop it. And even in the more on
the intentional abuse side of things, if somebody's found a way to do something that's problematic
for our services, then, you know, that's pretty much on us. But we will often reach out and engage
with whoever's doing that and try to understand what they're trying to do and why. Because often
somebody's trying to do something legitimate. They've got a problem to solve. They found a
creative way to solve it. And it may put strain on the service, because it's just not something
we designed for. And so we'll try to work with them to use that to feed into either new services
or find a better place for that workload, or just bolster what they're using. And maybe that's
something that eventually becomes a fully fledged feature that we offer to customers.
We're always open to learning from our customers.
They have found far more creative ways to get really cool things done with our services
than we've ever imagined, and that's true today.
Most of my service criticisms come down to the fact that you have more or less built
a very late-model, high-performing iPad, and I'm out there complaining about,
what a shitty hammer this thing is. It barely works at all, and I'm out there complaining about what a shitty hammer
this thing is. It barely works at all, and then it breaks in my hand. What gives? I would also
challenge something you said a minute ago, that the worst aid for some customers is to get a giant
surprise bill. But the very short of that is, yeah, but on some level, that's kind of only money.
You do have levers on your side to fix those issues. A worse
scenario is you have a customer that exhibits fraud-like behavior. They're suddenly using far
more resources than they ever did before. So let's go ahead and turn them off or throttle them
significantly. And you call them up to tell them you saved them some money. And our Super Bowl ad
ran, what exactly do you think you're doing? Because they don't get a second bite at that kind of apple. So there's peril on both sides of this. And those are just two examples. The world is full of
nuances. And at the scale that you folks operate at, the one in a million events happen multiple
times a second. The corner cases become common cases. And I'm surprised to be direct at how little I see you folks dropping the ball.
Credit to all of the teams. I think our secret source, if anything, really does come down to
our people. Like a huge amount of what you see as hopefully relatively consistent, good execution
comes down to people behind the scenes making sure, you know, like some
of it is software that we've built and made sure is robust and tested to scale, but there's always
an element of people behind the scenes when you hit those edge cases or something doesn't quite
go the way that you planned, making sure that things run smoothly. And that, if anything, is
something that I'm immensely proud of and is kind of amazing to watch from the inside.
On some level, it's the small errors that
are the bigger concern than the big ones. Back a couple of years ago when they announced GP3
volumes at reInvent, well, great, we'll spin up a test volume and kick the tires on it for an hour.
I think it was 80 or 100 gigs or whatnot. And the next day in the bill, it showed up as about
$5,000. And it was, okay, that's not great, not great at all. It turned out it was a mispricing
error by, I think, a factor of a million. And okay, at least it stood out. But there are scenarios
where we were prepared to pay it because, oops, you got one over on us. Good job. That's never
been the mindset I've gotten about AWS's philosophy for pricing. The better example that I love,
because no one took it seriously, was a few years before that,
when there was a light sale bug in the billing system. And it made the papers because people
suddenly found that for their light sale instance, they were getting predicted bills of $4 billion.
And the way I see it, you really only had to make that work once, and then you've made your numbers
for the year. So why not? Someone's going to pay it probably. But that was such out-of-the-world numbers that no one saw that and ever thought it was anything other than a bug.
It's the small pernicious things that creep in because the billing system is vast. I had no
idea when I started working with AWS Bills just how complicated it really was.
Yeah, I remember both of those. And there's something in there that you touched on
that I think is really important. That's something that I realized pretty early on at Amazon. And
it's why customer obsession is our flagship leadership principle. It's not because it's
love and butterflies and unicorns. Customer obsession is key to us because that's how you
build a long-term sustainable business
is your customers depend on you and it drives how we think about everything that we do.
And then in the billing space, small errors, even if there are small errors in the customer's
favor, slowly erode that trust.
So we take any kind of error really seriously and we try to figure out how we can make sure
that it doesn't happen again.
We don't always get that right. As you said, we've built an enormous, super complex
business. It's grown really quickly and really quick growth like that always acts as kind of a
multiplier on top of complexity. And on the pricing points, we're managing millions of pricing points
at the moment. And our tools that we use internally, there's always room
for improvement. It's a huge area of focus for us. We're in the beginning of looking at applying
things like formal methods to make sure that we can make very hard guarantees about the correctness
of some of those. But at the end of the day, people are plugging numbers in and you need
as many belts and braces as possible to make sure that you don't make mistakes there.
One of the things that struck me by surprise when I first started getting deep into this space was
the fact that the finalized bill was, what does it mean to have this be finalized? It can hit the
cost and usage report in an S3 bucket and it can change retroactively after the month closed
periodically. And that's when I started to have an inkling of a few things.
Not just the sheer scale and complexity inherent to something like the billing system
that touches everything,
but the sheer data retention stories
where you clearly have to be able to go back
and reconstruct a bill from the raw data years ago.
And I know what the output of all of those things are
in the form of cost and usage
reports and the billing data from our client accounts, which is the single largest expense
in all of our AWS accounts. We spend thousands and thousands and thousands of dollars a year
just on storing all of that data, let alone the processing piece of it. The sheer scale is
staggering. I used to wonder, why does it take you a day to record me
using something to it showing up in the bill? And the more I learned, the more it became a,
how can you do that in only a day? Yes, scale is actually mind-boggling. I'm pretty sure
that the core of our billing system is, I'm reasonably confident it's the largest or one
of the largest data processing systems on the planet.
I remember pretty early on when I joined Commerce Platform and was sort of starting to wrap my head around some of these things, Googling the definition of quadrillion, because we measure the number of metering events, which is how we record usage in services on a daily basis in the quadrillions, which is a billion billions.
So it's just an absolutely staggering number.
And so the scale here is just out of this world.
That's saying something because it's not like
other services across AWS are small in their own right.
But I'm still reasonably sure that being one of a handful of services
that is kind of at the nexus of AWS
and kind of deals with the aggregate of AWS's scale, this is probably one of the biggest systems on the planet.
And that shows up in all sorts of places. You start with that input, just the sheer volume
of metering events, but that has to produce as an output, pretty fine grained line item detailed
information, which ultimately rolls up into the total that a customer will see in their bill. But we have a number of different systems further down the pipeline that try to do
things like analyze your usage, make sensible recommendations, look for opportunities to
improve your efficiency, give you the ability to slice and dice your data and allocate it out to
different parts of your business in whatever way makes sense for your business. And so those systems
have to deal with anywhere from millions to billions. Recently, we were talking about
trillions of data points themselves. And so I was tangentially aware of some of the scale of this,
but being in the thick of it, having joined the team really just does underscore just how
vast these systems are. I think it's on some level more than a little unfortunate
that that story isn't being more widely told more frequently. Because when Commerce Platform has
job postings that are available on the website, you read it and it's very vague. It doesn't tend
to give hard numbers about a lot of these things. And people who don't play in these waters could easily be forgiven for thinking the way
that you folks do your job is you fire up
one of those 24 terabyte of RAM instances,
those monstrous things that you folks offer,
and what do you do next?
Well, Microsoft Excel.
We have a special high memory version
that we've done some horse trading with our friends
over at Microsoft for.
Yeah, you're several steps
beyond that at this point. It's a challenging problem that everyone of your customers has to
deal with on some level as well. But we're only dealing with the output of a lot of the processing
that you folks are doing first. You're exactly right. And a big focus for some of my teams is
figuring out how to help customers deal with that output. Because even if you're talking about a couple of orders of magnitude reduction,
you're still talking about very large numbers there.
So to help customers make sense of that,
we have a range of tools that exist that we're investing in.
There's another dimension of complexity in the space
that I think is one that's also very easy to miss.
And I think of it as arbitrary complexity.
And it's arbitrary because some of the rules that we have to box within here are driven
by legislative changes.
As you operate in more and more countries around the world, you want to make sure that
we're tax compliant, that we help our customers be tax compliant.
Those rules evolve pretty rapidly.
And country A makes it next to country B, but that doesn't mean that they're talking
to one another.
They've all got their own ideas.
They're trying to accomplish work.
Our company is picking up and relocating
from India to Germany.
How do we change that on the AWS side and the rest?
And it's, whoo, boy,
have you considered burning it all down
and filing an insurance claim to start over?
And there's a lot of complexity buried underneath that
that just doesn't rise to the notice
of 99% of your customers.
And the fact that it doesn't rise to the notice is something that we strive for.
Like these shouldn't be things that customers have to worry about
because it really is about clearing away the things that,
as far as possible, you don't want to have to spend time thinking about
so that you can focus on the thing that your business does that differentiates you.
It's getting rid of that undifferentiated heavy lifting.
And there's a ton of that in this space. And if you're blissfully unaware of it,
then hopefully that means that we're doing our job.
What I'm, I think, the most surprised about, and I have been for a long time, and please don't take
this as an insult to various other folks, engineers, the rest, not just in other parts of AWS,
but throughout the other industry. But talking to the people who work within Commerce Platform
has always been just a fantastic experience. The caliber of people that you have managed to
attract and largely retain, we don't own people, they do matriculate out eventually,
but the caliber of people that you've retained on your teams has just been out
of this world. And at first I wondered, why are these awesome people working on something as
boring and prosaic as billing? And then I started learning a little bit more as I went. Oh, wow.
How did they learn all the stuff that they have to hold in their head and tension at once to be
able to build things like this? It's incredibly inspiring just watching the caliber of the people that you've been able
to bring in.
I've been really, really excited joining this team as I've got to know the folks on the
team because there's some super smart people here.
But what's really jumped out to me is how committed the team is.
This is for the most part a team that has been in this space for many years.
Many of them have,
we talk about boomerangs, folks who leave AWS, go spend some time somewhere else and come back.
And there's a surprisingly high proportion of folks in commerce platform who have spent time somewhere else and then come back because they enjoy the space. They find it challenging.
Folks are attracted to the ability to have an impact because it is so foundational. But yeah, there's a super committed core to this team. And I really enjoy working with teams where
you've got that because then you really can take the long view and build something great. And I
think we have tons of opportunities to do that here. This episode is sponsored in parts by our
friend EnterpriseDB. EnterpriseDB has been powering enterprise applications
with PostgreSQL for 15 years,
and now EnterpriseDB has you covered
wherever you deploy PostgreSQL,
on-premises, private cloud,
and they just announced a fully managed service
on AWS and Azure called Big Animal.
All one word.
Don't leave managing your database to your cloud vendor
because they're too busy
launching another half dozen
managed databases
to focus on any one of them
that they didn't build themselves.
Instead, work with the experts
over at EnterpriseDB.
They can save you time and money.
They can even help you
migrate legacy applications,
including Oracle,
to the cloud.
To learn more,
try Big Animal for free.
Go to biganimal.com slash snark and tell
them Corey sent you. It sounds ridiculous, but I've reached out to team members before to explain
two cent variances in my bill. And never once have I been confronted with a, it's two cents,
what do you care? They understand the requirement that these things be accurate and not just, ah, take our word for it.
And also, frankly, they understand that two cents on a $20 bill looks a little different on a $20
million bill. So, you know, let's figure out if this is systemic or something I have managed to
break. It turns out the cost and usage report processing systems don't love it when there's
a cost allocation tag whose name contains an emoji.
Who knew? It's the little things in life that just have this fun way of breaking when you least
expect it. There are also surprisingly interesting problems. So like it turns out something as simple
as rounding numbers consistently across a distributed system at this scale is a non-trivial
problem. And if you don't, then you do get small seventh or
eighth decimal place differences that add up to something that then shows up as a two cents
difference somewhere. And so there's some really, really interesting problems in this space.
And I think the team often takes these kinds of things as a personal challenge.
It should be correct and it's not. So we should go and make sure it is correct.
The interesting problems abound here, but at the end of the day, it's the kind of thing that
any engineering team wants to go and make sure is correct because they know that it can be.
On the one hand, I love people who round and estimate. We all do that. Let's be clear. I sit
there and I back of the envelope everything first, but then I look at some of your pricing pages and
I count the digits after the zero.
It's like, you're talking about trillions of a dollar
on some of your pricing points,
and you add it up in the course of a given hour.
It's like, oh, it's 250 a month most months.
And it's, you work backwards
to way more decimal places of precision
than is required sometimes.
I'm also a personal fan of the bill that counts,
for example, number of Route 53 zones.
Right, and it counts them to four decimal places of precision.
I don't even know what half of a Route 53 zone is at this point, let alone something like, ah, the thousandth of the zone is going to cause this.
It's all an artifact of what the underlying systems are.
Can you, by any chance, shed a little light on what the
evolution of those systems has been over a period of time? I have to imagine that anything you built
in the early days, 16 years ago or so from the time of this recording, when S3 launched general
availability, you probably didn't have to worry about this scope and scale of what you do now.
In fact, I suspect if you try to funnel this volume
through S3 back then,
the whole thing would have collapsed under its own weight.
What's evolved over the time
that you had the billing system there?
Because changes come slowly to your environment.
And frankly, I appreciate that as a customer.
I don't like surprising people in finance.
Yeah, you're totally right.
So I joined the EC2 team as an engineer myself
some 16 years ago. And the very first thing that I did was our billing integration. And so my
relationship with the Commerce Platform Organization, what was the billing team way back when,
it goes back over my entire career at AWS. And at the time, the billing team was similar,
you know, probably eight people.
That was everything.
There was none of the scale and complexity. It was all one system.
And much like many of our biggest, oldest services, EC2 is very similar, S3 is as well.
There's been significant growth over the last decade and a half.
A lot of that growth has been rapid and rapid growth presents its own challenges.
And you live with decisions that you make early on
that you didn't realize were significant decisions
that have pretty deep implications 15 years later.
We're still working through some of those.
They present their own challenges.
Evolving an existing system to keep up with the growth of business
and a customer base that's as varied and complex as ours
is always challenging and also harder,
but I also think more fun than a clean sheet redo at this point.
Like that's a great thought exercise for,
well, if we got to do this again today,
what would we do now that we've learned so much over the last 15 years?
But there is this, I find it personally,
fascinating challenge with evolving a live system
where it's like, no, no, like things exist.
So how do we go from there to where we want to be next? Turn the billing system where it's like, no, no, like things exist. So how
do we go from there to where we want to be next? Turn the billing system off for 18 months,
rebuild the whole thing from first principles, light it up. I'm sure you'd have a much better
billing system and also not a company left anymore. Exactly, exactly. I've always enjoyed
that challenge. You know, even prior to AWS, my previous careers have involved similar kinds of
constraints where you've got a live system or you've got an existing, in the one case, it was an existing SDK that was deployed to tens of thousands of customers around the world.
And so backwards compatibility was something that I spent the first five years of my career thinking about in way more detail than I think most people do.
And it's a very similar mindset.
And I enjoy that challenge.
I enjoy that.
How do I evolve from
here to there without breaking customers along the way? And that's something we take pretty
seriously across AWS. I think simple DB is the poster child for we never turn things off,
but that applies equally to the services that are maybe less visible to customers. And billing is
definitely one of them. We don't get to switch stuff off. We don't get to throw things away and start again.
It's this constant state of evolution.
So let's say that I were to find a way
to route data through a series of two managed NAT gateways
and then egress to internet.
And the sheer density of the expense of that traffic
tears a hole in the fabric of space-time.
It goes back 15 years ago,
and you can make a single change to how the billing system was built. What would it be?
What pisses you off the most about the current constraints that you have to work within or around?
I think one of the biggest challenges we've got actually is the concept of an account,
because an account means half a dozen
different things. And way back when it seemed like a great idea, you just needed an account,
an account was your customer. And it was the same thing as the boundary that you put all your
resources inside. And of course, it's the same thing that you're going to roll all of your usage
up and issue a bill against. And that has been one of the areas that's seen the most
evolution and probably still has a pretty long way to go. And what's interesting about that is
that's probably something we could have seen coming because we watched the retail business
go through kind of the same evolution because they started with, well, a customer is a customer
is a customer and had to evolve to support the concept of sellers and partners. And then users are different to customers and you want to log in and that's a different
thing.
So we saw that kind of bifurcation of a single entity into a wide range of different related,
but separate entities.
And I think if we'd looked at that, you know, thought out 15 years, then yeah, we could
probably have learned something from that.
But at the same time, when AWS first kicked off,
we had wild ambitions for it,
but there was no guarantee that it was going to be the monster that it is today.
So I'm always a little bit reluctant to,
like it's a great thought exercise,
but it's easy to end up second guessing
a pretty successful 15 years.
So I'm always a little bit careful to walk that line.
But I think account is one of the things
that we would probably go back
and think about a little bit more. I want to be very clear with this next question,
that it is intentionally setting up a question I suspect you get a lot. It does not mirror my own
thinking on the matter even slightly. But I get a version of it myself all the time.
AWS bills, that sounds boring as hell.
Why would you choose to work on such a thing?
Now, I have a laundry list of answers to that
that aren't nearly as interesting
as I suspect yours are going to be.
What makes working on this problem space interesting to you?
There's a bunch of different things.
So first and foremost,
the scale that we're talking about here
is absolutely mind-blowing.
And for any engineer who wants to get stuck into problems that deal with mind-blowingly large volumes of data, incredibly rich dimensions, problems where honestly applying techniques like statistical reasoning or machine learning is really the only way to chip away at it, that exists in spades in this space.
It's not always immediately obvious. And I think from the outside, it's easy to assume this is
actually pretty simple. So the scale is a huge part of that. Oh, petabytes, how quaint.
Exactly, exactly. I mean, it's mind blowing every time I see some of the numbers in various parts of
the commerce platform space. I talked about quadrillions earlier. Trillions is a pretty common unit of measure.
The complexity that I talked about earlier,
that's a result of external environments is another one.
So imposed by external entities,
whether it's a government or a tax authority somewhere
or a business requirement from customers or ourselves.
I enjoy those as well.
Those are a different kind of challenge.
They really keep you on your toes.
I enjoy thinking of them as an engineering problem.
Like how do I get in front of them?
And that's something we spend a lot of time doing
in commerce platform.
And when we get it right,
customers are just unaware of it.
And then the third one is,
I personally am always attracted
to the opportunity to have an impact.
And this is a space where we get to
hopefully positively impact every single customer every day. And that to me is pretty fulfilling.
Those are kind of the three standout reasons why I think this is actually a super exciting space.
And I think it's often an underestimated space. I think once folks join the team and sort of start
to dig in, I've never heard anybody after they've joined tell me that what they're doing is boring.
Challenging, yes. Frustrating sometimes. Hard, absolutely. But boring never comes up.
There's almost no service other than IAM that I can think of that impacts every customer simultaneously.
And it's easy for me to sit in the cheap seats and say,
oh, you should change this or you should change that.
But every change you have is so massive in scale
that it's going to break a whole bunch of companies' automations
around the bill processing in different ways.
You have an entire category of user persona
who is used to clicking a certain button in a certain place in the console
to generate the report every month.
And if that button moves or changes color or has a different font, suddenly that renders their documentation invalid and they're scrambling because it's not their core competency, nor should it be.
And every change you make is so constricted just based upon all the different concerns that you've got to be juggling with.
How do you get anything done at all? I find that to be one of the most impressive aspects
about your organization, bar none. Yeah, I'm not going to lie and say that it isn't a challenge,
but a lot of it comes down to the talent that we have on the team. We have a super motivated,
super smart, super engaged team. And we spend a lot of time figuring out how to make sure that
we can keep moving, keep up with the business, keep up with a world that's getting more complicated
with every passing day. So you've kind of hit on one of the core challenges there, which is
how do we keep up with all of those different dimensions that are demanding an increasing
amount of engineering and new support and new investment from us
while we keep those customers happy.
And I think you touched on something else
a little bit indirectly there,
which is a lot of our customers
are actually pretty technical across AWS.
The customers that Commerce Platform supports
are often the least technical of our customers.
And so often need the most help
understanding why things are the way they are, where the constraints are.
A big bill from Amazon. How many books did you people buy last month?
It's still very much a level of understanding in some cases, and it's not because they're dumb.
Far from it. It's just, imagine that. Some people view there as being more to life than
understanding the nuances and intricacies of cloud computing. How dare they? Exactly. Who would have thought?
So as you look now over all of your domain, such as it is, what sucks the most? What are you
looking to fix as far as impactful changes that the rest of the world might experience? Because
I'm not going to accept one of those questions like, oh yeah, on the backend, we have this
storage subsystem for a tertiary thing
that just annoys me because it wakes us up once in a while.
No, no, I want something customer facing.
What's the painful thing you're looking at fixing next?
I don't like surprising customers.
And free tier is sort of one of those buckets of surprises,
but there are others.
Another one that's pretty squarely in my sights
is whether we like it or not, customer there are others. Another one that's pretty squarely in my sites is whether
we like it or not, customer accounts get compromised. Usually it's a password got
reused somewhere or it was accidentally committed into a GitHub repository somewhere. And we have
pretty established, pretty effective mechanisms for finding all of those. We'll scan for passwords
and credentials and alert customers to those and help them correct
that pretty quickly. We're also actually pretty good at detecting when an account does start to
do something that suggests that it's been compromised. Usually the first thing that a
compromised account starts to do is cryptocurrency mining. We're pretty quick to catch those. We
catch those within a matter of hours, much faster most days. What we haven't really
cracked and where I'm focused at the moment is getting back to the customer in a way that's
effective. And by that, I mean, specifically, we detect account compromise super quickly.
We reach out automatically. And so, you know, customers got some kind of contact from us,
usually within a couple of hours. It's not having the effect that we need it to. Customers are still being surprised a month later by a large bill. And so we're digging
into how much of that is because they never saw the contact. They didn't know what to do with the
contact. It got buried with all the other, hey, we saw you spun up an S3 bucket. Have you heard
of what S3 is? Again, that's all valuable, but you have 300 some odd services. If you start doing
that for every service, you're going to hit mail sending limits for Gmail. Exactly. It's not just enough that we
detect those and notify customers. We have to reduce the size of the surprise. It's one thing
to spend a hundred bucks a month on average, and then suddenly find that your spend has jumped to
$150 because you've reused the password somewhere
and somebody got hold of it
and is cryptocurrency mining your account.
It's a whole different ballgame to spend a hundred bucks a month
and then at the end of the month,
discover that your bill is suddenly $2,000 or $20,000.
And so that's something that I really wanted
to make some progress on this year.
I've really enjoyed our conversation.
If people want to learn more about how you view these things,
how you're approaching some of these problems,
or potentially are just the right kind of warped to consider joining up,
where's the best place for them to go?
They should drop me an email at jamesg.amazon.com.
That is the most direct way to get hold of me.
And I promise I will get back to
you. I try to stay on top of my email as much as possible, but that will come straight to me. And
I'm always happy to talk to folks about the space, talk to folks about opportunities in this team,
opportunities across AWS, or just hear what's not working and make sure that it's something
that we're aware of and looking at. Throughout Amazon, but particularly within commerce platform, I've always appreciated
the response of whenever I report something, no matter how ridiculous it is, and I assure
you there's an awful lot of ridiculous in my bug reports, the response has always been
the same.
Tell me more.
Help me understand what it is you're trying to achieve, even if it is ridiculous, so we
can look at this and see what is actually going on.
Every Amazonian team has been great about that, or you're not at Amazon very long,
but you folks have taken that to an otherworldly level. I just want to thank you for doing that.
I appreciate you for calling that out. We try. We really do. We take listening to our customers
very seriously because at the end of the day, that's what makes us better. And that's how we
make sure we're in it for the long haul.
Thanks once again for being so generous with your time. I really appreciate it.
Yeah, thanks for having me on. I've enjoyed it.
James Greenfield, VP of Commerce Platform at AWS. I'm cloud economist Corey Quinn,
and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry
comment, possibly on YouTube as well, about how you aren't actually giving this five stars at all.
You have taken three trillionths of a star off of the radar.
If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group.
We help companies fix their AWS bill by making it smaller and less horrifying.
The Duck Bill Group works for you, not AWS.
We tailor recommendations to your business and we get to the point.
Visit duckbillgroup.com to get started.
This has been a HumblePod production.
Stay humble.