Screaming in the Cloud - Feature Flags & Dynamic Configuration Through AWS AppConfig with Steve Rice

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This episode is sponsored in part by our friends at AWS AppConfig. Engineers love to solve and occasionally create problems,

Starting point is 00:00:39 but not when it's an on-call fire drill at four in the morning. Software problems should drive innovation and collaboration, not stress and sleeplessness and threats of violence. That's why so many developers are realizing the value of AWS AppConfig feature flags. Feature flags let developers push code to production, but hide that feature from customers so that the developers can release their feature when it's ready. This practice allows for safe, fast, and convenient software development. You can seamlessly incorporate AppConfig feature flags into your AWS or cloud

Starting point is 00:01:17 environment and ship your features with excitement, not trepidation and fear. To get started, go to snark.cloud slash appconfig. That's snark.cloud slash appconfig. Forget everything you know about SSH and try Tailscale. Imagine if you didn't need to manage PKI or rotate SSH keys every time someone leaves. That'd be pretty sweet, wouldn't it? With Tailscale SSH, you can do exactly that. Tailscale gives each server and user device a node key to connect to its VPN, and it uses the same node key to authorize and authenticate SSH. Basically, you're SSHing the same way you manage access to your app. What's the benefit here? Built-in key rotation, permissions as code, connectivity between any two devices, reduced latency, and there's a lot more,

Starting point is 00:02:11 but there's a time limit here. You can also ask users to re-authenticate for that extra bit of security. Sounds expensive? Nope, I wish it were. Tailscale is completely free for personal use on up to 20 devices. To learn more, visit snark.cloud slash tailscale. Again, that's snark.cloud slash tailscale. Welcome to Screaming in the Cloud. I'm Corey Quinn. This is a promoted guest episode. What does that mean? Well, it means that some people don't just want me to sit here and throw slings and arrows their way. They would prefer to send me a guest specifically.

Starting point is 00:02:51 And they do pay for that privilege, which I appreciate. Paying me is absolutely a behavior I wish to endorse. Today's victim who has decided to contribute to slash sponsor my ongoing ridiculous nonsense is, of all companies, AWS. And today I'm talking to Steve Rice, who's the Principal Product Manager on AWS AppConfig. Steve, thank you for joining me. Hey, Corey, great to see you. Thanks for having me. Looking forward to our conversation. As am I. Now, AppConfig does something super interesting, which I'm not aware of any other service or subservice doing. You are under the umbrella of AWS Systems Manager, but you're not going to market with Systems Manager AppConfig.

Starting point is 00:03:38 You are just AWS AppConfig. Why? So AppConfig is part of AWS Systems Manager. Systems Manager has, I think, 17 different features associated with it. Some of them have an individual name that is associated with Systems Manager. Some of them don't. We just happen to be one that doesn't. AppConfig is a service that's been around for a while internally before it was launched externally a couple years ago. So I'd say that's probably the origin of the name and the service.

Starting point is 00:04:04 I can tell you more about the origin of the service if you're curious. Oh, I absolutely am. But I just want to take a bit of a detour here and point out that I make fun of the subservice names in Systems Manager an awful lot, like Systems Manager, Session Manager, and Systems Manager, Change Manager. And part of the reason I do that is not just because it's funny, but because almost everything I found so far within the systems manager umbrella is pretty awesome. It aligns with how I tend to think about the world in a bunch of different ways. I have yet to see anything lurking within the systems manager umbrella that has led to a he he he Bill Surprise level that rivals, you know, the GDP of Guam. So I'm a big fan of the entire suite of services,

Starting point is 00:04:46 but yes, how did AppConfig get its name? So AppConfig started about six years ago now, internally. So we actually were part of the region services department inside of Amazon, which is in charge of launching new services around the world. We found that a centralized tool for configuration associated with each service launching was really helpful.

Starting point is 00:05:11 So a service might be launching in a new region and have to enable and disable things as it moved along. And so the tool is sort of built for that, turning on and off things as the region developed and was ready to launch publicly. Then the regions launched publicly. It turned out that our internal customers, which are a lot of AWS services and then some Amazon services as well, started to use us beyond launching new regions and started to use us for feature flagging. Again, turning on and off capabilities, launching things safely. And so it became massively popular. We're actually a top 30 service internally in terms of usage. And two years ago, we thought we really should launch this externally and let our customers benefit from some of the goodness that we put in there. And some of those all come from the

Starting point is 00:05:56 mistakes we've made internally. And so it became AppConfig. In terms of the name itself, we specialize in application configuration. So that's kind of a mouthful. So we just change it to AppConfig. Earlier this year, there was a vulnerability reported around, I believe it was AWS Glue, but please don't quote me on that. And as part of its excellent response that AWS put out, they said that from the time that it was disclosed to them, they had patched the service and rolled it out to every AWS region in which Glue existed in a little under 29 hours, which at scale is absolutely magic fast. That is superhero speed and then some, because you generally don't just throw something over the wall, regardless of how small it is, when we're talking about something at the scale of AWS. I mean, look at who your customers are. Mistakes will show. This also got me thinking that when

Starting point is 00:06:50 you have Adam or previously Andy on stage giving a keynote announcement, and then they mention something on stage like, congratulations, it's now a very complicated service with 14 adjectives in its name because someone's paid by the syllable. Great. Suddenly, the marketing pages are up, the APIs are working, it's showing up in the console. And it occurs to me only somewhat recently to think about all of the moving parts that go on behind this. That is far faster than even the improved speed of CloudFront distribution updates. There is very clearly something going on there. So I've got to ask, is that you? Yes, a lot of that is us. I can't take credit for 100% of what you're talking about, but that's how we are used. We're essentially used as a feature flagging service. And I can talk generically about feature flagging. Feature flagging allows you to push code out to production,

Starting point is 00:07:41 but it's hidden behind a configuration switch, a feature toggle or a feature flag. And the code can be sitting out there. Nobody can access it until somebody flips that toggle. Now, the smart way to do it is to flip that toggle on for a small set of users. Maybe it's just internal users. Maybe it's 1% of your users. And so the feature is available. It's your best slash worst customers in that 1% in some cases. Yeah. You want to stress test the system with them. And you want to be able to look and see what's going to break before it breaks for everybody. So you release this to a small cohort.

Starting point is 00:08:15 You measure your operations. You measure your application health. You measure your reputational concerns. And then if everything goes well, then you maybe bump it up to 2%. And then 10 everything goes well, then you maybe bump it up to 2%, and then 10%, then 20%. So feature flags allow you to slowly release features, and you know what you're releasing by the time it's at 100%. It's tempting for teams to want to have everybody access it at the same time. You've been working hard on this feature for a long time. But again, that's kind

Starting point is 00:08:40 of an anti-pattern. You want to make sure that on production, it behaves the way you expect it to behave. I have to ask, what is the fundamental difference between feature flags and or dynamic configuration? Because to my mind, one of them is a means of achieving the other, but I could also see very easily using the terms interchangeably, given that in some of our conversations, you have corrected me, which first, how dare you? Secondly, okay, there's probably a reason here. What is that point of distinction? Yeah. Typically, for those that are not eat, sleep, and breathing dynamic configuration, which I do, and most people are not obsessed with this kind of thing, feature flags is kind of a shorthand for dynamic configuration. It allows you to turn on and off things without

Starting point is 00:09:24 pushing out any new code. So your application code's running, it's pulling its configuration data, say every five seconds, every 10 seconds, something like that. And when that configuration data changes, then the app changes its behavior, again, without a code push or without restarting the app. So dynamic configuration is maybe a superset of feature flags. Typically, when people think feature flags, they're thinking of, oh, I'm going to release a new feature. So it's almost like an on-off switch. But we see customers using feature flags, and we use this internally, for things like

Starting point is 00:09:54 throttling limits. Let's say you want to be able to throttle TPS, transactions per second. Or let's say you want to throttle the number of simultaneous background tasks and say, you know, I just really don't want this creeping above 50. Bad things can start to happen. But in a period of stress, you might want to actually bring that number down. Well, you can push out these changes with dynamic configuration, which is, again, any type of configuration, not just an on-off switch. You can push this out and adjust the behavior and see what happens. Again, I'd recommend pushing it out to 1% of your users, then 10%, but allows you to have these dials and switches to do that. And again,

Starting point is 00:10:28 generically, that's dynamic configuration. It's not as fun a term as feature flags. Feature flags is sort of a good mental picture. So I do use them interchangeably, but if you're really into the whole world of this dynamic configuration, then you probably will care about the difference. Which makes a fair bit of sense. It's the question of what are you talking about high level versus what are you talking about implementation detail was. And on some level, I used to get, well, we'll call it angsty because I can't think of a better adjective right now about how AWS was reluctant to disclose implementation details behind what it did. And in the fullness of time, it's made a lot more sense to me, specifically through a lens of you want to be able to have the freedom

Starting point is 00:11:11 to change how something works under the hood. And if you've made no particular guarantee about the implementation detail, you can do that without potentially worrying about breaking a whole bunch of customer expectations that you've inadvertently set. And that makes an awful lot of sense. The idea of rolling out changes to your infrastructure has evolved over the last decade. Once upon a time, you'd have EC2 instances. And great, you want to go ahead and make a change there, or this actually predates EC2 instances, virtual machines in a data center, or heaven forbid, bare metal servers, you're not going to deploy a whole new server

Starting point is 00:11:46 because there's a new version of the code out. So you separate out your infrastructure from the code that it runs. And that worked out well. And increasingly, we started to see ways of, okay, if we want to change the behavior of the application, we'll just push out new environment variables to that thing and restart the service

Starting point is 00:12:03 so it winds up consuming those. And that's great. You've rolled it out throughout your fleet. With containers, which is sort of the next logical step, well, okay, this stuff gets baked in. We'll just restart containers with a new version of code because that takes less than a second each and you're fine. And then

Starting point is 00:12:17 with Lambda functions, it's okay, we'll just change the deployment option and the next invocation will wind up taking the brand new environment variables passed out to it. How do feature flags feature into those, I guess, three evolving methods of running applications in anger, by which I mean, of course, production? Good question. And I think you really articulated that well. Well, thank you. I should hope so. I'm a storyteller. At least I fancy myself one. Yes, you are. Really, what you talked about

Starting point is 00:12:45 is the evolution of, you know, at the beginning, people were, well, first of all, people probably were embedding their variables deep in their code. And then they realized, oh, I want to change this and I have to find where in my code that is. And so it became a pattern. Why don't we separate everything that's configuration data into its own file, but it'll get compiled at build time and sent out all at once. There was kind of this breakthrough that was, why don't we actually separate out the deployment of this? We can separate the deployment from code from the deployment of configuration data and have the code be reading that configuration data on a regular interval, as I said.

Starting point is 00:13:20 So now as the environments have changed, like you said, containers and Lambda, that ability to make tweaks at microsecond intervals is more important and more powerful. So there certainly is still value in having things like environment variables that get read at startup. We call that static configuration as opposed to dynamic configuration. And that's a very important element in the world of containers that you talked about. Containers are a bit ephemeral, and so they kind of come and go, and you can restart things, or you might spin up new containers that are slightly different config and have them operate in a certain way. And again, Lambda takes that to the next level. I'm really excited where people are going to take feature flags to the next level, because already today we have people

Starting point is 00:14:01 just fine-tuning to very targeted small subsets, different configuration data, different feature flag data. And it allows them to do this at, we've never seen before, scale of turning this on, seeing how it reacts, seeing how the application behaves, and then being able to roll that out to all of your audience. Now, you've got to be careful. You really don't want to have completely different configurations out there and have 10 different or 100 different configurations out there. That makes it really tough to debug. So you want to think of this as, I want to roll this out gradually over time, but eventually you want to have this sort of state where everything is somewhat consistent. That on some level speaks to a level of operational maturity that my current deployment adventures generally don't have. A common reference I make is to my last tweet in AWS.com Twitter threading app. And anyone can visit it, use it however they want. It uses a Route 53 latency record to figure out, ah, which is the closest region to you because I've deployed it to 20 different regions. Now, if this were a paid service or I had people using this in large volume and I had to worry about that sort of thing, I would probably approach something that is very close to what

Starting point is 00:15:14 you describe. In practice, I pick a devoted region that I deploy something to and cool, that's sort of my canary where I get things working the way I would expect. And when that works the way I want it to, I then just push it to everything else automatically. Given that I've put significant effort into getting deployments down to approximately two minutes to deploy to everything, that it feels like that's a reasonable amount of time to push something out. Whereas if I were, I don't know, running a bank, for example, I would probably have an incredibly heavy process

Starting point is 00:15:46 around things that make changes to things like payment or whatnot, because despite the lies we all like to tell both to ourselves and in public, anything that touches payments does go through waterfall, not agile iterative development, because that mistake tends to show up on your customer's credit card bills. And then they're so angry. I think that there's a certain point of maturity you need to be at as either an organization or possibly as a software technology stack before something like feature flags even becomes available to you. Would you agree with that? Or is this something everyone should use? I would agree with that. Definitely a small team that has communication flowing between the two probably won't get as much value out of a gradual release process because everybody kind of knows

Starting point is 00:16:32 what's going on inside of the team. Once your team scales or maybe your audience scales, that's when it matters more. You really don't want to have something blow up with your users. You really don't want to have people getting paged in the middle of the night because of a change that was made. And so feature flags do help with that. So typically the journey we see is people start off in a maybe very small startup. They're releasing features at a very fast pace. They grow and they start to build their own feature flagging solution. Again, companies I've been at previously have done that. And you start using feature flags and you see the power of it. Oh my gosh, this is great.

Starting point is 00:17:06 I can release something when I want without doing a big code push. I can just do a small little change. And if something goes wrong, I can roll it back instantly. That's really handy. And so the basics of feature flagging might be a homegrown solution that you all have built. If you really lean into that and start to use it more, then you probably want to look at a third-party solution because there's so many features out there that you might want. A lot of them are

Starting point is 00:17:29 around safeguards that make sure that releasing a new feature is safe. Again, pushing out a new feature to everybody could be similar to pushing out untested code to production. You don't want to do that. So you need to have some checks and balances in your release process of your feature flags. And that's what a lot of third parties do. It really depends, to get back to your question about who needs feature flags, it depends on your audience size. If you have enough audience out there to want to do a small rollout to a small set first and then have everybody hit it, that's great. Also, if you just have one or two developers, then feature flags are probably

Starting point is 00:18:05 something that you're just kind of, you're doing yourself, you're pushing out this thing anyway on your own, but you don't need to coordinate it across your team. I think that there's also a bit of, how to frame this, a misunderstanding on someone's part about where AppConfig starts and where it stops. When it was first announced, feature flags were one of the things that it did. And that was talked about on stage, I believe in reInvent, but please don't quote me on that,

Starting point is 00:18:30 when it wound up getting announced. And then in the fullness of time, there was another announcement of AppConfig now supports feature flags, which I'm sitting there, and I had to go back to my old notes, like, did I hallucinate this? Which, again, would not be the first time

Starting point is 00:18:44 I'd imagined such a thing. But no, it was originally how the service was described, but now it's extra feature flags. Almost like someone would, I don't know, flip on a feature flag toggle for the service, and now it does a different thing. What changed? What was it that was misunderstood about the service initially versus what it became? Yeah, I wouldn't say it was a misunderstanding. I think what happened was we launched it guessing what our customers were going versus what it became. Yeah. I wouldn't say it was a misunderstanding. I think what happened was we launched it guessing what our customers were going to use it as.

Starting point is 00:19:09 We had done plenty of research on that. And as I mentioned before, we have- Please tell me someone uses a database or am I the only nutter that does stuff like that? We have seen that before. We have seen something like that before. Excellent, excellent, excellent. I approve.

Starting point is 00:19:20 And so we had done our due diligence ahead of time about how we thought people were going to use it. We were right about a lot of it. I mentioned before that we have a lot of usage internally. So that was kind of maybe cheating even for us to be able to sort of see how this is going to evolve. What we did announce, I guess it was last November, was an opinionated version of Feature Flags.

Starting point is 00:19:40 So we had people using us for Feature Flags, but they were building their own structure, their own JSON, and there was not a dedicated console experience for Feature Flags. So we had people using us for Feature Flags, but they were building their own structure, their own JSON, and there was not a dedicated console experience for Feature Flags. What we announced last November was an opinionated version that structured the JSON in a way that we think is the right way. And that afforded us the ability to have a smooth console experience. If we know what the structure of the JSON is, we can have things like toggles and validations in there that really specifically look at some of the data points. So that's really what happened. We're just making it easier for our customers to use us for feature flags. We still have some customers that are kind of building their own solution, but we're seeing a lot of them move over to our opinionated version. This episode is brought to us in part by our

Starting point is 00:20:22 friends at Datadog. Datadog is a SaaS monitoring and security platform that enables full-stack observability for developers, IT operations, security, and business teams in the cloud age. Datadog's platform, along with 500-plus vendor integrations, allows you to correlate metrics, traces, logs, and security signals across your applications, infrastructure, and third-party services in a single pane of glass. Combine these with drag and drop dashboards and machine learning-based alerts to help teams troubleshoot and collaborate more effectively, prevent downtime, and enhance performance and reliability. Try Datadog in your environment today with a free 14-day trial and get a complimentary t-shirt when you install the agent. To learn more, visit datadoghq.com

Starting point is 00:21:11 slash screaming in the cloud to get started. That's www.datadoghq.com slash screaming in the cloud. Part of the problem I have when I look at what it is you folks do and your use cases and how you structure it is it's similar in some respects to how folks perceive things like FIS, the fault injection service, or chaos engineering as it's commonly known, which is we can't even get the service to stay up on its own for any period of time. What do you mean? Now let's intentionally degrade it and make it work. There needs to be a certain level of operational stability or operational maturity. When you're still building a service before it's up and running, feature flags seem awfully premature because there's no one depending on it. You can change configuration however your little heart

Starting point is 00:21:58 desires in most cases. I'm sure at certain points of scale of development teams, you have a communications problem internally, but it's not aimed at me trying to get something working at 2 a.m. in the middle of the night. Whereas by the time folks are ready for what you're doing, they clearly have that level of operational maturity established. So I have to guess on some level that your typical adopter of app config feature flags isn't in fact someone who is, well, we're ready for feature flags, let's go, but rather someone who's come up with something else as a stopgap as they've been iterating forward, usually something home built. And it might very well be you have the exact same biggest competitor that I do in my consulting work, which is, of course, Microsoft Excel, as people try to build their own thing that works their own way.

Starting point is 00:22:51 Yeah, so definitely a very common customer of ours is somebody that is using a homegrown solution for turning on and off things. And they really feel like I'm using the heck out of these feature flags. I'm using them on a daily or weekly basis. I would like to have some enhancements to how my feature flags work, but I have limited resources and I'm not sure that my resources should be building enhancements to a feature flagging service. But instead, I'd rather have them focusing on something directly for our customers, some of the core features of whatever your company does. And so that's when people sort of look around externally and say, oh, let me see if there's some other third-party service or something

Starting point is 00:23:30 built into AWS like AWS AppConfig that can meet those needs. And so absolutely, the workflows get more sophisticated, the ability to move forward faster becomes more important and do so in a safe way. I used to work at a cybersecurity company and we would kind of joke that the security budget at a company is relatively low until something bad happens and then it's, you know, whatever you need to spend on it. It's not quite the same with feature flags, but you do see when somebody has a problem on production and they want to be able to turn something off right away or make an adjustment right away, then the ability to do that in a measured way becomes incredibly important. And so that's when, again, you'll see customers starting to feel like they're outgrowing their homegrown solution and moving to something that's a third-party solution. Honestly, I feel like

Starting point is 00:24:21 so many tools exist in the space where, oh yeah, you should definitely use this tool. And most people will use that tool the second time. Because the first time, it's one of those, how hard could that be? I can build something like that in a weekend, which is sort of the rallying cry of doomed engineers who are bad at scoping. And by the time that they figure out why, they've backtracked significantly. There's a whole bunch of stuff that I have built

Starting point is 00:24:46 that people look at and say, wow, that's a really great design. What inspired you to do that? And the absolute honest answer to all of it is simply, yeah, I worked in roles for the first time. I did it the way you would think I would do it. And it didn't go well. It experiences what you get

Starting point is 00:25:02 when you didn't get what you wanted. And this is one of those areas where it tends to manifest in reasonable ways. Absolutely, absolutely. So give me an example here, if you don't mind, about how feature flags can improve the day-to-day experience of an engineering team or an engineer themselves.

Starting point is 00:25:21 Because we've been down this path enough in some cases to know the failure modes. But for folks who haven't been there, let's try and shave a little bit off of their journey of I'm going to learn from my own mistakes and learn from someone else's. What are the benefits that accrue and are felt immediately? Yeah. So we kind of have a policy that the very first commit of any new feature ought to be the feature flag. That's that sort of on-off switch that you want to put there so that you can start to deploy your code and not have a long-lived branch in your source code,

Starting point is 00:25:51 but you can have your code there. It reads whether that configuration's on or off. You start with it off. And so it really helps just while developing these things about keeping your branches short. And you can push domain line as long as the feature flag is off and the feature's hidden to production, which is great.

Starting point is 00:26:08 So that helps with the mess of doing big code merges. The other part is around the launch of a feature. So you talked about Andy Jassy being on stage to launch a new feature. Sort of the old way of doing this, Corey, was that you would need to look at your pipelines and see how long it might take for you to push out your code with any sort of code change in it. And let's say that was an hour and a half process. And let's say your CEO is on stage at eight o'clock on a Friday. And as much as you like to say it, oh, I'm never pushing out code on

Starting point is 00:26:39 a Friday. Sometimes you have to. Yeah, that week, yes, you are, whether you want to or not. Exactly. Exactly. The old way was this idea that I'm going to time my release and it takes an hour and a half. I'm going to push it out and I'll do my best. But hopefully when the CEO raises her arm or his arm up and points to a screen that everything's lit up. Well, let's say you're doing that and something goes wrong and you have to start over again. Well, oh my goodness, we're 15 minutes behind. Can you accelerate things? And then you start to pull away some of these blockers to accelerate your pipeline, or you start editing right in the console of your application, which is generally not a good idea right before a really big launch. So the new way is I'm going to have

Starting point is 00:27:17 that code already out there on a Wednesday before this big thing on a Friday, but it's hidden behind this feature flag. I've already turned it on and off for internals and it's just waiting there. And so then when the CEO points to the big screen, you can just flip that one small little configuration change and that can be almost instantaneous and people can access it. So that just reduces the amount of stress,

Starting point is 00:27:39 reduces the amount of risk in pushing out your code. Another thing is, we've heard this from customers. Customers are increasing the number of deploys that they can do per week by a very large percentage because they're deploying with confidence. They know that I can push out this code and it's off by default, then I can turn it on whenever I feel like it, and then I can turn it off if something goes wrong. So if you're into CICD, you can actually just move a lot faster with a number of pushes to production each week, which again, I think really helps engineers on their day-to-day lives.

Starting point is 00:28:08 The final thing I'm going to talk about is that let's say you did push out something and for whatever reason that following weekend, something's going wrong. The old way was, oh, you're going to get a page. I'm going to have to get on my computer and go and debug things and fix things and then push out a new code change. And this could be late on a Saturday evening when you're out with friends. If there's a feature flag there that can turn it off, and if this feature is not critical to the operation of your product, you can actually just go in and flip that feature

Starting point is 00:28:36 flag off until the next morning or maybe even Monday morning. So in theory, you kind of get your free time back when you are implementing feature flags. So I think those are the big benefits for engineers in using feature flags. The best way to figure out whether someone is speaking from a position of experience or is simply a raving zealot when they're in a position where they are incentivized to advocate for a particular way of doing things or a particular product as, let's be clear, you are in that position, is to ask a form of the following question. Let's turn it around for a second. In what scenarios would you absolutely not want to use feature flags?

Starting point is 00:29:17 What problems arise? When do you take a look at a situation and say, oh, yeah, feature flags will make things worse instead of better. Don't do it. I'm not sure I would necessarily say don't do it. Maybe I am that zealot, but you got to do it carefully. You really got to do things carefully because as I said before, flipping on a feature flag for everybody is similar to pushing out untested code to production. So you want to do that in a measured way. So you need to make sure that you do a couple things. One, there should be some way to measure what the system behavior is for a small set of

Starting point is 00:29:45 users with that feature flag flipped on first. And there could be some canaries that you're using for that. You can also, there's other mechanisms you can do that to set up cohorts and beta testers and those kinds of things. But I would say the gradual rollout and the targeted rollout of a feature flag is critical. You know, again, it sounds easy. I'll just turn it on later, but you ideally don't want to do that. The second thing you want to do is, if you can, is there some sort of validation that the feature flag is what you expect? So I was talking about on-off feature flags. There are things, when I was talking about dynamic configuration, that are things like throttling limits, that you actually want to make sure that you put in some other safeguards that say, if it's TPS, I never want my TPS to go above 1200 and I never want to set it below 800 for whatever reason, for example. Well, you want to have some sort of validation of that data before

Starting point is 00:30:36 the feature flag gets pushed out. Inside Amazon, we actually have the policy that every single flag needs to have some sort of validation around it so that we don't accidentally fat finger something out before it goes out there. And we have fat fingered things. Typing the wrong thing into a command structure, into a tool, who would ever do something like that, he says, remembering times he's taken production down himself exactly that way. Exactly. Exactly. And we've done it at Amazon and AWS for sure. And so, yeah, if you have some sort of structure or process to validate that, because oftentimes what you're doing

Starting point is 00:31:10 is you're trying to remediate something in production. Stress levels are high. It is especially easy to fat finger there. So that check and balance of a validation is important. And then ideally you have something to automatically roll back whatever change that you made very quickly. So AppConfig, for example, hooks up to CloudWatch alarms. If an alarm goes off,

Starting point is 00:31:29 we're actually going to roll back instantly whatever that feature flag was to its previous state so that you don't even need to really worry about validating against your CloudWatch. It'll just automatically do that against whatever alarms you have. One of the interesting parts about working at Amazon and seeing things at Amazonian scale is that one in a million events happen thousands of times every second for you folks. What lessons have you learned by deploying feature flags at that kind of scale? Because one of my problems and challenges with deploying feature flags myself is that in some cases we're talking about three to five users a day for some of these things.

Starting point is 00:32:06 That's not really enough usage to get insight into various cohort analyses or A-B tests. Yeah. As I mentioned before, we build these things as features into our products. So I just talked about the CloudWatch alarms. That wasn't there originally. Originally, if something went wrong,

Starting point is 00:32:24 you would observe a CloudWatch alarm and then you decide what to do. And one of those things might be that I'm going to roll back my configuration. So a lot of the mistakes that we made that caused alarms to go off necessitated us building some automatic mechanisms. A human being can only react so fast, but an automated system there is going to be able to roll things back very, very quickly. So that came from some specific mistakes that we had made inside of AWS. The validation that I was talking about as well. We have a couple ways of validating things. You might want to do a syntactic validation, which really you're validating, as I was saying,

Starting point is 00:32:57 the range between 100 and 1,000. But you also might want to have sort of a functional validation, or we call it a semantic validation, so that you can make sure that, for example, if you're switching to a new database, that you're going to flip over to a new database, you can have a validation there that says, this database is ready. I can write to this table. It's truly ready for me to switch. Instead of just updating some config data, you're actually going to be validating that

Starting point is 00:33:22 the new target is ready for you. So those are a couple of things that we've learned from some of the mistakes we've made. And again, not saying we aren't making mistakes still, but we always look at these things inside of AWS and figure out how we can benefit from them and how our customers, more importantly, can benefit for these mistakes. I would say that I agree. I think that you have threaded the needle of not talking smack about your own product while also presenting it as not the global panacea that everyone should roll out willy-nilly.

Starting point is 00:33:51 That's a good balance to strike. And frankly, I'd also say it's probably a good point to park the episode. If people want to learn more about AppConfig, how you view these challenges, or even potentially want to get started using it themselves, what should they do?

Starting point is 00:34:08 We have a informational page at go.aws.com slash awsappconfig. That will tell you the high-level overview. You can search for our documentation, and we have a lot of blog posts to help you get started there. And links to that will, of course,

Starting point is 00:34:24 go into the show notes. Thank you so much for suffering my slings, arrows, and other assorted nonsense on this. I really appreciate your taking the time. Corey, thank you for the time. It's always a pleasure to talk to you. Really appreciate your insights. You're too kind. Steve Rice, Principal Product Manager for AWS AppConfig. I'm cloud economist, Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment. But before you do, just try clearing your

Starting point is 00:34:59 cookies and downloading the episode again. You might be in the 3% cohort for an A-B test, and you might want to listen to the good one instead. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started. this has been a humble pod production stay humble

Screaming in the Cloud - Feature Flags & Dynamic Configuration Through AWS AppConfig with Steve Rice

About Steve:Steve Rice is Principal Product Manager for AWS AppConfig. He is surprisingly passionate about feature flags and continuous configuration. He lives in the Washington DC area with ...his wife, 3 kids, and 2 incontinent dogs.Links Referenced:AWS AppConfig: https://go.aws/awsappconfig

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.