Screaming in the Cloud - Feature Flags & Dynamic Configuration Through AWS AppConfig with Steve Rice
Episode Date: October 11, 2022About Steve:Steve Rice is Principal Product Manager for AWS AppConfig. He is surprisingly passionate about feature flags and continuous configuration. He lives in the Washington DC area with ...his wife, 3 kids, and 2 incontinent dogs.Links Referenced:AWS AppConfig: https://go.aws/awsappconfig
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
This episode is sponsored in part by our friends at AWS AppConfig.
Engineers love to solve and occasionally create problems,
but not when it's an on-call fire drill at four in the morning.
Software problems should drive innovation and collaboration, not stress and sleeplessness
and threats of violence.
That's why so many developers are realizing the value of AWS AppConfig feature flags.
Feature flags let developers push code to production, but hide that feature from customers
so that the developers can release
their feature when it's ready. This practice allows for safe, fast, and convenient software
development. You can seamlessly incorporate AppConfig feature flags into your AWS or cloud
environment and ship your features with excitement, not trepidation and fear. To get started, go to snark.cloud slash appconfig.
That's snark.cloud slash appconfig. Forget everything you know about SSH and try Tailscale.
Imagine if you didn't need to manage PKI or rotate SSH keys every time someone leaves.
That'd be pretty sweet, wouldn't it?
With Tailscale SSH, you can do exactly that. Tailscale gives each server and user device a
node key to connect to its VPN, and it uses the same node key to authorize and authenticate SSH.
Basically, you're SSHing the same way you manage access to your app. What's the benefit here? Built-in key rotation,
permissions as code, connectivity between any two devices, reduced latency, and there's a lot more,
but there's a time limit here. You can also ask users to re-authenticate for that extra bit of
security. Sounds expensive? Nope, I wish it were. Tailscale is completely free for personal use on up to 20 devices.
To learn more, visit snark.cloud slash tailscale.
Again, that's snark.cloud slash tailscale.
Welcome to Screaming in the Cloud. I'm Corey Quinn.
This is a promoted guest episode. What does that mean?
Well, it means that some people don't just want me to sit here and throw slings and arrows their way.
They would prefer to send me a guest specifically.
And they do pay for that privilege, which I appreciate.
Paying me is absolutely a behavior I wish to endorse.
Today's victim who has decided to contribute to slash sponsor my ongoing ridiculous nonsense is, of all companies, AWS.
And today I'm talking to Steve Rice, who's the Principal Product Manager on AWS AppConfig.
Steve, thank you for joining me.
Hey, Corey, great to see you. Thanks for having me. Looking forward to our conversation.
As am I. Now, AppConfig does something super interesting, which I'm not aware of any other service or subservice doing.
You are under the umbrella of AWS Systems Manager, but you're not going to market with Systems Manager AppConfig.
You are just AWS AppConfig. Why?
So AppConfig is part of AWS Systems Manager.
Systems Manager has, I think, 17 different features associated with it.
Some of them have an individual name that is associated with Systems Manager.
Some of them don't.
We just happen to be one that doesn't.
AppConfig is a service that's been around for a while internally before it was launched externally a couple years ago.
So I'd say that's probably the origin of the name and the service.
I can tell you more about the origin of the service if you're curious.
Oh, I absolutely am. But I just want to take a bit of a detour here and point out that
I make fun of the subservice names in Systems Manager an awful lot, like Systems Manager,
Session Manager, and Systems Manager, Change Manager. And part of the reason I do that is
not just because it's funny, but because almost everything I found so far within the systems manager umbrella is pretty awesome.
It aligns with how I tend to think about the world in a bunch of different ways.
I have yet to see anything lurking within the systems manager umbrella that has led to a he he he Bill Surprise level that rivals, you know, the GDP of Guam.
So I'm a big fan of the entire suite of services,
but yes, how did AppConfig get its name?
So AppConfig started about six years ago now, internally.
So we actually were part of the region services department
inside of Amazon,
which is in charge of launching new services
around the world.
We found that a
centralized tool for configuration associated with each service launching was really helpful.
So a service might be launching in a new region and have to enable and disable things as it moved
along. And so the tool is sort of built for that, turning on and off things as the region developed
and was ready to launch publicly. Then the regions launched publicly. It turned out that our internal customers, which are a lot of AWS services and then
some Amazon services as well, started to use us beyond launching new regions and started to use
us for feature flagging. Again, turning on and off capabilities, launching things safely. And so it
became massively popular. We're actually a top 30 service internally in terms of
usage. And two years ago, we thought we really should launch this externally and let our customers
benefit from some of the goodness that we put in there. And some of those all come from the
mistakes we've made internally. And so it became AppConfig. In terms of the name itself, we
specialize in application configuration. So that's kind of a mouthful. So we just change it to AppConfig. Earlier this year, there was a vulnerability reported around,
I believe it was AWS Glue, but please don't quote me on that. And as part of its excellent response
that AWS put out, they said that from the time that it was disclosed to them, they had patched the service and rolled it out to every AWS region
in which Glue existed in a little under 29 hours, which at scale is absolutely magic fast. That is
superhero speed and then some, because you generally don't just throw something over the
wall, regardless of how small it is, when we're talking about something at the scale of AWS.
I mean, look at who your customers are. Mistakes will show. This also got me thinking that when
you have Adam or previously Andy on stage giving a keynote announcement, and then they mention
something on stage like, congratulations, it's now a very complicated service with 14 adjectives in
its name because someone's paid by the syllable. Great. Suddenly, the marketing pages are up, the APIs are working, it's showing up in the console.
And it occurs to me only somewhat recently to think about all of the moving parts that go on
behind this. That is far faster than even the improved speed of CloudFront distribution updates.
There is very clearly something going on there. So I've got to ask, is that you? Yes, a lot of that is us. I can't take credit for 100% of what you're talking about,
but that's how we are used. We're essentially used as a feature flagging service. And I can
talk generically about feature flagging. Feature flagging allows you to push code out to production,
but it's hidden behind a configuration switch, a feature toggle or a
feature flag. And the code can be sitting out there. Nobody can access it until somebody flips
that toggle. Now, the smart way to do it is to flip that toggle on for a small set of users.
Maybe it's just internal users. Maybe it's 1% of your users. And so the feature is available.
It's your best slash worst customers in that 1% in some cases.
Yeah. You want to stress test the system with them.
And you want to be able to look and see what's going to break before it breaks for everybody.
So you release this to a small cohort.
You measure your operations.
You measure your application health.
You measure your reputational concerns.
And then if everything goes well, then you maybe bump it up to 2%.
And then 10 everything goes well, then you maybe bump it up to 2%, and then 10%,
then 20%. So feature flags allow you to slowly release features, and you know what you're
releasing by the time it's at 100%. It's tempting for teams to want to have everybody access it at
the same time. You've been working hard on this feature for a long time. But again, that's kind
of an anti-pattern. You want to make sure that on production, it behaves the way you expect it to behave. I have to ask, what is the fundamental difference between feature flags and or dynamic
configuration? Because to my mind, one of them is a means of achieving the other, but I could also
see very easily using the terms interchangeably, given that in some of our conversations, you have
corrected me, which first, how dare you?
Secondly, okay, there's probably a reason here. What is that point of distinction?
Yeah. Typically, for those that are not eat, sleep, and breathing dynamic configuration,
which I do, and most people are not obsessed with this kind of thing, feature flags is kind
of a shorthand for dynamic configuration. It allows you to turn on and off things without
pushing out any new code.
So your application code's running, it's pulling its configuration data, say every five seconds,
every 10 seconds, something like that. And when that configuration data changes, then the app
changes its behavior, again, without a code push or without restarting the app. So dynamic
configuration is maybe a superset of feature flags. Typically, when people think feature flags, they're thinking of, oh, I'm going to release
a new feature.
So it's almost like an on-off switch.
But we see customers using feature flags, and we use this internally, for things like
throttling limits.
Let's say you want to be able to throttle TPS, transactions per second.
Or let's say you want to throttle the number of simultaneous background tasks and say,
you know, I just really don't want this creeping above 50. Bad things can start to happen. But in a period of stress, you might want to
actually bring that number down. Well, you can push out these changes with dynamic configuration,
which is, again, any type of configuration, not just an on-off switch. You can push this out
and adjust the behavior and see what happens. Again, I'd recommend pushing it out to 1% of
your users, then 10%, but allows you to have these dials and switches to do that. And again,
generically, that's dynamic configuration. It's not as fun a term as feature flags. Feature flags
is sort of a good mental picture. So I do use them interchangeably, but if you're really into
the whole world of this dynamic configuration, then you probably will care about the difference.
Which makes a fair bit of sense. It's the question of what are you talking about high level versus
what are you talking about implementation detail was. And on some level, I used to get, well,
we'll call it angsty because I can't think of a better adjective right now about how AWS was
reluctant to disclose implementation details behind what it did. And in the fullness of time, it's made a lot more sense to me,
specifically through a lens of you want to be able to have the freedom
to change how something works under the hood.
And if you've made no particular guarantee about the implementation detail,
you can do that without potentially worrying about
breaking a whole bunch of customer expectations that you've inadvertently set.
And that makes an awful lot of sense. The idea of rolling out changes to your infrastructure
has evolved over the last decade. Once upon a time, you'd have EC2 instances. And great,
you want to go ahead and make a change there, or this actually predates EC2 instances,
virtual machines in a data center, or heaven forbid, bare metal servers, you're not going to deploy a whole new server
because there's a new version of the code out.
So you separate out your infrastructure
from the code that it runs.
And that worked out well.
And increasingly, we started to see ways of,
okay, if we want to change the behavior of the application,
we'll just push out new environment variables
to that thing and restart the service
so it winds up consuming those.
And that's great. You've rolled it out
throughout your fleet. With containers,
which is sort of the next logical step,
well, okay, this stuff gets baked in.
We'll just restart containers with a new version of
code because that takes less than a second each
and you're fine. And then
with Lambda functions, it's okay, we'll just change
the deployment option and the next invocation
will wind up taking the brand new environment
variables passed out to it. How do feature flags feature into those, I guess, three
evolving methods of running applications in anger, by which I mean, of course, production?
Good question. And I think you really articulated that well.
Well, thank you. I should hope so. I'm a storyteller. At least I fancy myself one.
Yes, you are. Really, what you talked about
is the evolution of, you know, at the beginning, people were, well, first of all, people probably
were embedding their variables deep in their code. And then they realized, oh, I want to change this
and I have to find where in my code that is. And so it became a pattern. Why don't we separate
everything that's configuration data into its own file, but it'll get compiled at build time and
sent out all at once. There was kind of this breakthrough that was, why don't we actually separate out the deployment
of this?
We can separate the deployment from code from the deployment of configuration data and have
the code be reading that configuration data on a regular interval, as I said.
So now as the environments have changed, like you said, containers and Lambda,
that ability to make tweaks at microsecond intervals is more important and more powerful.
So there certainly is still value in having things like environment variables that get read at
startup. We call that static configuration as opposed to dynamic configuration. And that's
a very important element in the world of containers that you talked about.
Containers are a bit ephemeral, and so they kind of come and go, and you can restart things,
or you might spin up new containers that are slightly different config and have them operate in a certain way. And again, Lambda takes that to the next level. I'm really excited where people
are going to take feature flags to the next level, because already today we have people
just fine-tuning to very targeted small subsets, different configuration data, different feature flag data.
And it allows them to do this at, we've never seen before, scale of turning this on, seeing how it reacts, seeing how the application behaves, and then being able to roll that out to all of your audience.
Now, you've got to be careful.
You really don't want to have completely different configurations out there and have 10 different or 100 different configurations out there.
That makes it really tough to debug. So you want to think of this as, I want to roll this out
gradually over time, but eventually you want to have this sort of state where everything is
somewhat consistent. That on some level speaks to a level of operational maturity that my current deployment adventures generally don't have. A common reference I make is to my last tweet in AWS.com Twitter threading app. And anyone can visit it, use it however they want. It uses a Route 53 latency record to figure out, ah, which is the closest region to you because I've deployed it to 20 different regions. Now, if this were a paid service or I had people using this in large volume and I had to worry
about that sort of thing, I would probably approach something that is very close to what
you describe. In practice, I pick a devoted region that I deploy something to and cool,
that's sort of my canary where I get things working the way I would expect. And when
that works the way I want it to, I then just push it to everything else automatically.
Given that I've put significant effort into getting deployments down to approximately
two minutes to deploy to everything, that it feels like that's a reasonable amount of
time to push something out.
Whereas if I were, I don't know, running a bank, for example, I would probably have an
incredibly heavy process
around things that make changes to things like payment or whatnot, because despite the lies we
all like to tell both to ourselves and in public, anything that touches payments does go through
waterfall, not agile iterative development, because that mistake tends to show up on your
customer's credit card bills. And then they're so angry. I think that there's a certain point of maturity you need to be at as either an organization or
possibly as a software technology stack before something like feature flags even becomes
available to you. Would you agree with that? Or is this something everyone should use?
I would agree with that. Definitely a small team that has communication flowing between the two
probably won't get as much value out of a gradual release process because everybody kind of knows
what's going on inside of the team. Once your team scales or maybe your audience scales,
that's when it matters more. You really don't want to have something blow up with your users.
You really don't want to have people getting paged in the middle of the night because of a
change that was made. And so feature flags do help with that. So typically the journey
we see is people start off in a maybe very small startup. They're releasing features at a very fast
pace. They grow and they start to build their own feature flagging solution. Again, companies I've
been at previously have done that. And you start using feature flags and you see the power of it.
Oh my gosh, this is great.
I can release something when I want without doing a big code push.
I can just do a small little change.
And if something goes wrong, I can roll it back instantly.
That's really handy.
And so the basics of feature flagging might be a homegrown solution that you all have
built.
If you really lean into that and start to use it more, then you probably want to look
at a third-party solution because there's so many features out there that you might want. A lot of them are
around safeguards that make sure that releasing a new feature is safe. Again, pushing out a new
feature to everybody could be similar to pushing out untested code to production. You don't want
to do that. So you need to have some checks and balances in your release process of your feature
flags. And that's what
a lot of third parties do. It really depends, to get back to your question about who needs feature
flags, it depends on your audience size. If you have enough audience out there to want to do a
small rollout to a small set first and then have everybody hit it, that's great. Also, if you just
have one or two developers, then feature flags are probably
something that you're just kind of, you're doing yourself, you're pushing out this thing
anyway on your own, but you don't need to coordinate it across your team.
I think that there's also a bit of, how to frame this, a misunderstanding on someone's part
about where AppConfig starts and where it stops. When it was first announced, feature flags were
one of the things that it did.
And that was talked about on stage,
I believe in reInvent,
but please don't quote me on that,
when it wound up getting announced.
And then in the fullness of time,
there was another announcement of
AppConfig now supports feature flags,
which I'm sitting there,
and I had to go back to my old notes,
like, did I hallucinate this?
Which, again, would not be the first time
I'd imagined such a thing.
But no, it was originally how the service was described, but now it's extra feature flags.
Almost like someone would, I don't know, flip on a feature flag toggle for the service, and now it does a different thing.
What changed?
What was it that was misunderstood about the service initially versus what it became?
Yeah, I wouldn't say it was a misunderstanding.
I think what happened was we launched it guessing what our customers were going versus what it became. Yeah. I wouldn't say it was a misunderstanding. I think what happened was we launched it
guessing what our customers were going to use it as.
We had done plenty of research on that.
And as I mentioned before, we have-
Please tell me someone uses a database
or am I the only nutter that does stuff like that?
We have seen that before.
We have seen something like that before.
Excellent, excellent, excellent.
I approve.
And so we had done our due diligence ahead of time
about how we thought people were going to use it.
We were right about a lot of it.
I mentioned before that we have a lot of usage internally.
So that was kind of maybe cheating even for us to be able to sort of see how this is going
to evolve.
What we did announce, I guess it was last November, was an opinionated version of Feature
Flags.
So we had people using us for Feature Flags, but they were building their own structure,
their own JSON, and there was not a dedicated console experience for Feature Flags. So we had people using us for Feature Flags, but they were building their own structure, their own JSON, and there was not a dedicated console experience for Feature Flags.
What we announced last November was an opinionated version that structured the JSON in a way that we
think is the right way. And that afforded us the ability to have a smooth console experience. If we
know what the structure of the JSON is, we can have things like toggles and validations in there that really specifically look at some of the data points. So that's really what
happened. We're just making it easier for our customers to use us for feature flags. We still
have some customers that are kind of building their own solution, but we're seeing a lot of
them move over to our opinionated version. This episode is brought to us in part by our
friends at Datadog. Datadog is a SaaS monitoring and security platform that enables full-stack observability for developers,
IT operations, security, and business teams in the cloud age.
Datadog's platform, along with 500-plus vendor integrations,
allows you to correlate metrics, traces, logs, and security signals across your applications,
infrastructure, and third-party services in a single pane of glass. Combine these with drag
and drop dashboards and machine learning-based alerts to help teams troubleshoot and collaborate
more effectively, prevent downtime, and enhance performance and reliability. Try Datadog in your
environment today with a free 14-day trial and get a complimentary t-shirt when you install the agent. To learn more, visit datadoghq.com
slash screaming in the cloud to get started. That's www.datadoghq.com slash screaming in the cloud.
Part of the problem I have when I look at what it is you folks do and your use cases and
how you structure it is it's similar in some respects to how folks perceive things like FIS,
the fault injection service, or chaos engineering as it's commonly known, which is we can't even get
the service to stay up on its own for any period of time. What do you mean? Now let's intentionally
degrade it and make it work. There needs to be a certain level of operational stability or operational maturity. When you're
still building a service before it's up and running, feature flags seem awfully premature
because there's no one depending on it. You can change configuration however your little heart
desires in most cases. I'm sure at certain points of scale of development teams, you have a
communications problem internally, but it's not aimed at me trying to get something working at 2 a.m. in
the middle of the night. Whereas by the time folks are ready for what you're doing, they clearly have
that level of operational maturity established. So I have to guess on some level that your typical adopter of app config feature flags isn't in fact someone who
is, well, we're ready for feature flags, let's go, but rather someone who's come up with something
else as a stopgap as they've been iterating forward, usually something home built. And it
might very well be you have the exact same biggest competitor that I do in my consulting work,
which is, of course, Microsoft Excel, as people try to build their own thing that works their own way.
Yeah, so definitely a very common customer of ours is somebody that is using a homegrown solution
for turning on and off things. And they really feel like I'm using the heck out of these feature
flags. I'm using them on a daily or weekly basis.
I would like to have some enhancements to how my feature flags work, but I have limited resources
and I'm not sure that my resources should be building enhancements to a feature flagging
service. But instead, I'd rather have them focusing on something directly for our customers,
some of the core features of whatever your company does. And so that's when people sort
of look around externally and say, oh, let me see if there's some other third-party service or something
built into AWS like AWS AppConfig that can meet those needs. And so absolutely, the workflows get
more sophisticated, the ability to move forward faster becomes more important and do so in a safe way. I used to work at a
cybersecurity company and we would kind of joke that the security budget at a company is relatively
low until something bad happens and then it's, you know, whatever you need to spend on it.
It's not quite the same with feature flags, but you do see when somebody has a problem on production
and they want to be able to turn something off right away or make an adjustment right away, then the ability to do that in a measured way becomes incredibly important.
And so that's when, again, you'll see customers starting to feel like they're outgrowing their
homegrown solution and moving to something that's a third-party solution. Honestly, I feel like
so many tools exist in the space where, oh yeah, you should definitely
use this tool.
And most people will use that tool the second time.
Because the first time, it's one of those, how hard could that be?
I can build something like that in a weekend, which is sort of the rallying cry of doomed
engineers who are bad at scoping.
And by the time that they figure out why, they've backtracked significantly.
There's a whole bunch of stuff that I have built
that people look at and say,
wow, that's a really great design.
What inspired you to do that?
And the absolute honest answer to all of it is simply,
yeah, I worked in roles for the first time.
I did it the way you would think I would do it.
And it didn't go well.
It experiences what you get
when you didn't get what you wanted.
And this is one of those areas
where it tends to manifest in reasonable ways.
Absolutely, absolutely.
So give me an example here, if you don't mind,
about how feature flags can improve
the day-to-day experience of an engineering team
or an engineer themselves.
Because we've been down this path enough
in some cases to know the failure
modes. But for folks who haven't been there, let's try and shave a little bit off of their journey of
I'm going to learn from my own mistakes and learn from someone else's. What are the benefits that
accrue and are felt immediately? Yeah. So we kind of have a policy that the very first commit of any
new feature ought to be the feature flag. That's that sort of on-off switch that you want to put there
so that you can start to deploy your code
and not have a long-lived branch in your source code,
but you can have your code there.
It reads whether that configuration's on or off.
You start with it off.
And so it really helps just while developing these things
about keeping your branches short.
And you can push domain line
as long as the feature flag is off
and the feature's hidden to production, which is great.
So that helps with the mess of doing big code merges.
The other part is around the launch of a feature.
So you talked about Andy Jassy being on stage
to launch a new feature.
Sort of the old way of doing this, Corey,
was that you would need to look at your pipelines and see how long it might take for you to push out your code with any sort of code
change in it. And let's say that was an hour and a half process. And let's say your CEO is on stage
at eight o'clock on a Friday. And as much as you like to say it, oh, I'm never pushing out code on
a Friday. Sometimes you have to. Yeah, that week, yes, you are, whether you want to or not.
Exactly. Exactly. The old way was this idea that I'm going to time my release and it takes an hour
and a half. I'm going to push it out and I'll do my best. But hopefully when the CEO raises her arm
or his arm up and points to a screen that everything's lit up. Well, let's say you're
doing that and something goes wrong and you have to start over again. Well, oh my goodness, we're
15 minutes behind. Can you accelerate things? And then you start to pull away some of these blockers to
accelerate your pipeline, or you start editing right in the console of your application, which
is generally not a good idea right before a really big launch. So the new way is I'm going to have
that code already out there on a Wednesday before this big thing on a Friday, but it's hidden behind
this feature flag. I've already turned it on and off for internals
and it's just waiting there.
And so then when the CEO points to the big screen,
you can just flip that one small little configuration change
and that can be almost instantaneous
and people can access it.
So that just reduces the amount of stress,
reduces the amount of risk in pushing out your code.
Another thing is, we've heard this from customers.
Customers are increasing the number of deploys that they can do per week by a very large percentage
because they're deploying with confidence. They know that I can push out this code and it's off
by default, then I can turn it on whenever I feel like it, and then I can turn it off if something
goes wrong. So if you're into CICD, you can actually just move a lot faster with a number
of pushes to production each week,
which again, I think really helps engineers on their day-to-day lives.
The final thing I'm going to talk about is that let's say you did push out something and for
whatever reason that following weekend, something's going wrong. The old way was,
oh, you're going to get a page. I'm going to have to get on my computer and go and debug things and
fix things and then push out a new code
change.
And this could be late on a Saturday evening when you're out with friends.
If there's a feature flag there that can turn it off, and if this feature is not critical
to the operation of your product, you can actually just go in and flip that feature
flag off until the next morning or maybe even Monday morning.
So in theory, you kind of get your free time back when you are implementing feature flags.
So I think those are the big benefits for engineers in using feature flags.
The best way to figure out whether someone is speaking from a position of experience
or is simply a raving zealot when they're in a position where they are incentivized to advocate
for a particular way of doing things or a particular product as, let's be clear, you are in that position, is to ask a form of the following question.
Let's turn it around for a second.
In what scenarios would you absolutely not want to use feature flags?
What problems arise?
When do you take a look at a situation and say, oh, yeah, feature flags will make things worse instead of better.
Don't do it.
I'm not sure I would necessarily say don't do it. Maybe I am that zealot, but you got to do it
carefully. You really got to do things carefully because as I said before, flipping on a feature
flag for everybody is similar to pushing out untested code to production. So you want to do
that in a measured way. So you need to make sure that you do a couple things. One, there should be
some way to measure what the system behavior is for a small set of
users with that feature flag flipped on first. And there could be some canaries that you're using for
that. You can also, there's other mechanisms you can do that to set up cohorts and beta testers
and those kinds of things. But I would say the gradual rollout and the targeted rollout of a
feature flag is critical. You know, again, it sounds easy. I'll just turn it on later,
but you ideally don't want to do that. The second thing you want to do is, if you can,
is there some sort of validation that the feature flag is what you expect? So I was talking about on-off feature flags. There are things, when I was talking about dynamic configuration, that are
things like throttling limits, that you actually want to make sure that you put in some other safeguards that say, if it's TPS, I never want my TPS to go above 1200 and I never want to set it below 800
for whatever reason, for example. Well, you want to have some sort of validation of that data before
the feature flag gets pushed out. Inside Amazon, we actually have the policy that every single
flag needs to have some sort of validation around it so that we don't accidentally fat finger something out before it goes out there. And we have fat fingered things.
Typing the wrong thing into a command structure, into a tool,
who would ever do something like that, he says, remembering times he's taken production down
himself exactly that way. Exactly. Exactly. And we've done it at Amazon and AWS for sure.
And so, yeah, if you have some sort of structure
or process to validate that,
because oftentimes what you're doing
is you're trying to remediate something in production.
Stress levels are high.
It is especially easy to fat finger there.
So that check and balance of a validation is important.
And then ideally you have something
to automatically roll back whatever change
that you made very quickly.
So AppConfig, for example, hooks up to CloudWatch alarms. If an alarm goes off,
we're actually going to roll back instantly whatever that feature flag was to its previous
state so that you don't even need to really worry about validating against your CloudWatch. It'll
just automatically do that against whatever alarms you have. One of the interesting parts about
working at Amazon and seeing things at Amazonian scale is that one in a million events happen
thousands of times every second for you folks. What lessons have you learned by deploying feature
flags at that kind of scale? Because one of my problems and challenges with deploying feature
flags myself is that in some cases we're talking about three to five users a day
for some of these things.
That's not really enough usage to get insight
into various cohort analyses or A-B tests.
Yeah.
As I mentioned before,
we build these things as features into our products.
So I just talked about the CloudWatch alarms.
That wasn't there originally.
Originally, if something went wrong,
you would observe a CloudWatch alarm and then you decide what to do. And one of those
things might be that I'm going to roll back my configuration. So a lot of the mistakes that we
made that caused alarms to go off necessitated us building some automatic mechanisms. A human being
can only react so fast, but an automated system there is going to be able to roll things back
very, very quickly. So that came from some specific mistakes that we had made inside of AWS.
The validation that I was talking about as well.
We have a couple ways of validating things.
You might want to do a syntactic validation, which really you're validating, as I was saying,
the range between 100 and 1,000.
But you also might want to have sort of a functional validation, or we call it a semantic
validation, so that you can make sure that, for example, if you're switching to a new
database, that you're going to flip over to a new database, you can have a validation
there that says, this database is ready.
I can write to this table.
It's truly ready for me to switch.
Instead of just updating some config data, you're actually going to be validating that
the new target is ready for you.
So those are a
couple of things that we've learned from some of the mistakes we've made. And again, not saying we
aren't making mistakes still, but we always look at these things inside of AWS and figure out how
we can benefit from them and how our customers, more importantly, can benefit for these mistakes.
I would say that I agree. I think that you have threaded the needle of not talking smack about your own product
while also presenting it as not the global panacea
that everyone should roll out willy-nilly.
That's a good balance to strike.
And frankly, I'd also say it's probably a good point
to park the episode.
If people want to learn more about AppConfig,
how you view these challenges,
or even potentially want to get started
using it themselves,
what should they do?
We have a informational page
at go.aws.com
slash awsappconfig.
That will tell you the high-level overview.
You can search for our documentation,
and we have a lot of blog posts
to help you get started there.
And links to that will, of course,
go into the
show notes. Thank you so much for suffering my slings, arrows, and other assorted nonsense on
this. I really appreciate your taking the time. Corey, thank you for the time. It's always a
pleasure to talk to you. Really appreciate your insights. You're too kind. Steve Rice,
Principal Product Manager for AWS AppConfig. I'm cloud economist, Corey Quinn, and this is Screaming
in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast
platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your
podcast platform of choice, along with an angry comment. But before you do, just try clearing your
cookies and downloading the episode again. You might be in the 3% cohort for an A-B test,
and you might want to listen to the good one instead.
If your AWS bill keeps rising and your blood pressure is doing the same,
then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you,
not AWS. We tailor recommendations to your business and we get to the point.
Visit duckbillgroup.com to get started. this has been a humble pod production
stay humble