Screaming in the Cloud - Episode 1: Feature Flags with Heidi Waterhouse of LaunchDarkly

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This episode of Screaming in the Cloud is sponsored by my friends at GorillaStack. GorillaStack's a unique automation solution for cloud cost optimization, which of course is something here and dear to my heart.

Starting point is 00:00:34 By day, I'm a consultant who fixes exactly one problem, which is the horrifying AWS bill. Every organization eventually hits a point where they start to really, really care about their cloud spend, either in terms of caring about the actual dollars and cents that they're spending, or in understanding what teams or projects are costing money and starting to build predictive analytics around that. And it turns out that early on in my consulting work, I spent an awful lot of time talking with some of my clients about a capability that GorillaStack has already built. There's a laundry list of analytics

Starting point is 00:01:12 offerings in this space that tell you what you're spending and where it goes, and then they stop. Or worse, they slap a beta label on that side of it for remediation and then say that they're not responsible for anything or everything that their system winds up doing. So some folks try and go in a direction of doing things to write their own code, such as spinning down developer environments out of hours, bolting together a bunch of different services to handle snapshot aging, having a custom Slack bot that you build that alerts you when your budget's hitting a red line and this is all generic stuff it's the undifferentiated heavy lifting that's not terribly specific to your own specific environment so why build it when you can buy it gorilla stack does all of this think of it more or less like if this then that ifttt for aws it can manage

Starting point is 00:02:04 resources it can alert folks when things are about to turn off. It keeps people appraised of what's going on. More or less the works. Go check them out. They're at gorillastack.com, spelled exactly like it sounds. Gorilla like the animal, stack as in a pile of things. Use the discount code screaming for 15% off the first year. Thanks again for your support, Girlistack. Appreciate it. Hello and welcome to Screaming in the Cloud. I'm Corey Quinn.

Starting point is 00:02:31 Today I'm joined by Heidi Waterhouse of LaunchDarkly, where she's currently a developer advocate. Welcome to the show, Heidi. Thanks. I'm glad to be here. So your backstory is fascinating to me. You were a technical writer for a couple of decades. At one point, I think you mentioned to me that you used to write Patch Tuesday release notes, which I think is the definition of thankless job to some extent. And in your most a year, people begin to think that you might know what you're doing and invite you to come give talks. And so instead of being my marketing side hustle as a technical writer, writing blog posts and giving technical talks is my full time job, which is like a dream come true. Was that a role that you found or did you have to have it built when you started talking to them? So I applied to be a technical writer, and they asked me to come in and interview

Starting point is 00:03:30 to be the developer advocate. And so I'd never done that role or really, like I knew a bunch of dev advocates and dev rel people. But I had never thought of myself that way. But LaunchDarkly asked me to come in and give them a 15-minute presentation on feature flags to see what I could do with that. And I ended up giving them a 20-minute presentation on how you could use feature flags to do documentation better. That sounds like a fun story. But let's rewind a little bit first.

Starting point is 00:04:00 When I first met you, I knew you as the person who was doing a bunch of live, presentations and all, from an iPad. And that is something that turned into a surprisingly interesting area. Isn't that great? I find it so fascinating that the iPad really is a powerful enough computer that that's all you need to travel with. I'm also so glad I have that method because it turns out that the USB-C connection on the new MacBooks is not 100% reliable for HDMI and can fail not just frustratingly, but like catastrophically. So this happened to me.

Starting point is 00:05:00 I had just gotten a brand new laptop for this new job. I'm doing my first conference as a dev advocate. I plug in my laptop and it got this weird jaggy sideways graphic and failed to project. And I'm sitting there thinking, good, good. This is good. This is a brand new talk at a conference where I don't know anyone for my brand new job, and I have just bricked my computer. I feel like the only appropriate response there is, and there's the metaphor, and just wait for the applause. Exactly. I actually ended up giving a 35-minute talk purely from memory and adrenaline and fear

Starting point is 00:05:38 sweat. But after that, I was like, MacBook, you are fired for all time, and I'm going back to my iPad. Okay, back to topic a little bit. What is a launch darkly? Launch darkly is feature flags as a service. So it turns out that a lot of people want to be able to turn things on and off in their code base really quickly without having to do a lot of commits. But they have a lot of trouble tracking it when you get over, say, a dozen flags. So what LaunchDarkly is providing is a way to manage your features at scale with a usable interface and also an API.

Starting point is 00:06:19 Okay, for those who don't have a background in software development, namely me, what is a feature flag? So imagine if you are creating a piece of music and you know those big sample boards that the DJs use and they use in theaters? Oh yeah. and make that into an instrument to turn it on and off. Wonderful. And the advantage of doing this as opposed to a fresh code deploy would be? Speed and risk. So I am old enough that I remember when code and products came on CDs,

Starting point is 00:07:19 actually floppy disks, but we won't talk about that. Lotus 1, 2, 3. But it used to be a big deal to push out a deployment, and then that was all you got. So what feature flags do is make it possible for you to push out a deployment with things hidden. We call it launching darkly. Ah. Ah, see?

Starting point is 00:07:43 And then when you're ready, you can turn it on. So imagine you want to do a big website refresh in September. And you want it to have all the things that you're going to need for Black Friday on it. Well, you don't want to show the Black Friday stuff yet. So you hide it behind a feature flag, do all the work ahead of time. And then on Thursday at midnight, you can go ahead and turn it all on instantly without the risk of pushing untested code into your production. That makes a stunning amount of sense. Wonderful. Yes. We're all about avoiding risk. This is in fact, like, I think this is our motto this year is like eliminate risk, which I argued with, because you can't eliminate risk, but you can make it much less risky. So as far as doing this in production and minimizing risk, pushing it further down the deployment chain, how does this start to impact larger scale environments? So I think one of the exciting things about cloud and scale is

Starting point is 00:08:47 that you're doing things across servers and time zones and areas of control. And you don't know exactly what's going to happen in production. There is no way to test a massive distributed system except in production. But if you're doing that, you would like not to be showing everybody you're testing. So imagine you have a massive enterprise-grade system, and you want to know if this new feature, let's say a toolbar, is going to work right. Well, the first thing you do is you deploy it with nobody able to see it. And then you turn it on just so you and your team can see it.

Starting point is 00:09:30 And then you turn it on so that only people in your company can see it based on IP. And then you turn it on for 10% of your customers. And then you scale up the percentage of customers who can see it. This whole time you're doing sampling and metrics and analysis to make sure that it's not causing edge case problems or somehow causing your system to

Starting point is 00:09:51 fail or conflate or fall over. So testing isn't a binary, it's a degree. How does this apply to, I guess, other methods of testing large-scale distributed software in production or otherwise? So one of the ways that we've tested large-scale software before is to run a bunch of fake data through. And the problem with fake data is that it's fake and frequently sanitary. It's sort of like trying to test whether an antiseptic works, but only in a sterile environment. You're just not going to find out because you're feeding it data that isn't contaminated and grody. So being able to test in production means that you're going to get authentic data. Another thing that's important

Starting point is 00:10:36 to remember when you're testing at scale is that no matter how good your integration tests are, there's always going to be some sort of wobble in the system. I think about, say, Autodesk. I was working for a company doing cloud integration stuff a couple employers ago. up thousands, thousands of servers, and then spin them down again really rapidly because they were using them for basically scaling user 3D printing stuff. And if you couldn't handle the fact that you were spinning up 2000 servers all at once, it was a real problem. But it's hard to get the testing capacity to do that. Gotcha. So does this apply, I would say, let me take that step back. There are some technologies that you tend to see that make an awful lot of sense for certain use cases, generally at software company startups based

Starting point is 00:11:39 in San Francisco. And if you try and take that model to something like, I don't know, we control all of the ATMs in North America. A lot of the paradigms that work when you're Twitter for pets start to fall down at bank of the world. And for example, when a dog can't tweet for two minutes, that tends to be a different failure domain than the ATM is now spitting out wrong balance information to a subset of users. Or 20s. Oh, yeah, exactly. That depends entirely on what level of happy or sad you want your users to be.

Starting point is 00:12:13 But the question I'm getting at here is, are feature flags something that maps reasonably to most workloads? Or is this something that is better suited for stuff that errors won't really leave nasty marks and bruises? We certainly think that it is enterprise grade, and I cannot talk about a lot of our customers. I have to go through and see how we can talk about, but I will say that Atlassian and Jira are using us, which I think is a pretty significant use case. I bet a lot of your listeners have a confluence somewhere. And we think that it is exactly because you can do feature flagging that it makes it safer to do an enterprise grade.

Starting point is 00:12:57 Because if you have a button, a knob that can turn on a feature, you also have a knob that can turn it off instantly. We think it's about 200 milliseconds from the time you hit what we call the kill switch to the time servers stop delivering that broken feature. Imagine the power to say, wow, we're spitting out 20s from our ATM. Let's roll that back right now. That's compelling, which I guess leads into the next big question here. How do you get there from where many shops are today? It's similar in some ways to, it's easy to implement something like this or relatively easy to implement this in something completely greenfield where you're not necessarily going

Starting point is 00:13:45 to be having to retrofit it to things. But in practice, we rarely get to work with environments like that. Here's a 20-year-old PHP app. Time to go ahead and re-architect it to take advantage of something like feature flags. What does that journey look like? So people can take a couple of different paths. There is a massive reorganization path, which is, it's not ideal. Like nobody enjoys it. It burns a ton of developer time and your value add is very small right away. But if you're doing something like using feature flags to do price tiering, where you're showing people the same page, but with different features, depending on whether they're paying you or not, it's what you have to do. Most of the time,

Starting point is 00:14:35 what we recommend is that people start using it for new features. So you just, as your new coding practice, as your best practice, whenever you create a feature, instead of necessarily making it a branch, you just wrap it in a feature flag and add some, if this is on and this is off, defaults, and then go ahead and write your feature. And you know that it's hidden behind the magical feature flag curtain until you're ready to turn it on. So we say like this incremental, just new features or just new code bases,

Starting point is 00:15:07 it's still going to help you a ton. You're still going to see a lot of benefit from it without disrupting and randomizing a working cold fusion environment. Which makes sense. So the next logical evolution of that question, given that this is screaming in the cloud, how does rolling something like feature flags out change when you're doing it in a cloud-based context instead of a traditional on-prem style deployment? Or does it? So interestingly, when I say we're feature flags as a service, we are a cloud-first organization.

Starting point is 00:15:38 We are tightly linked with the CDN. We're all about the distributed network. It's not like we're having your request come all the way back to our servers to be evaluated or to your servers to be evaluated. We're evaluating on the edge, which gives us a lot of power. But it also means that it was a little hard for us to move into on-prem. We do have some on-prem installations, but it's less powerful because we don't have that edge service. The edge of the cloud is giant.

Starting point is 00:16:08 And sharp. And sharp. And the on-prem servers are small and localized. And so we can do it, but it's sort of not how we think. We're very much thinking in the cloud as a service because mostly we're working with people who are doing web-based delivery of features. We're working with companies that are giant web pages or retailers or people who really need to have always on control of their features. Which definitely does tend to make sense as you look at the current crop of companies and the historical migration pattern that we're seeing as companies move out of on-prem and into cloud.

Starting point is 00:16:52 A question I have for you, though, is whenever I hear someone say something as a service, in this case, feature flags as a service, my immediate instinctive knee-jerk reaction is, oh, okay, so how long is it until AWS comes out with a confusingly named offering around this that tries to eat your lunch more or less, but somehow manages to have 15 they can build out for their customers or how tightly integrated to a particular vendor's offering it becomes? Interestingly, there is a level of organization that's not interested in buying us because they're doing it themselves. And I think it's possible AWS is not offering this to customers, but I do think they are using something very much like feature flags for their own internal development. I know Google is, I know Facebook is. They're operating at such a giant scale. They have entire teams that are already doing this so that they can serve you the 15 degrees of confusing, badly worded pricing. Because they're serving you your 15 confusing things, but they're serving someone else 15 other different confusing things.

Starting point is 00:18:14 And the only way that can be happening is if they're doing feature flags. Gotcha. So this is A-B testing taken to a extreme level. Yes, this is A-B testing, but I like to call it on-beyond A-B testing, on-beyond Z, because A-B testing is just one of the ways

Starting point is 00:18:33 that you could be manipulating what people are experiencing. Wonderful. I feel like when we break into the level of alphabets to name that across that many dimensions,

Starting point is 00:18:43 we run a reasonable risk of inadvertently summoning a demon. As a general rule, are feature flags considered to be a front-end technology, or is this something that starts to work its way throughout the rest of the stack? It works through the whole stack. People are using it for, like, our customers are using it for front end page delivery, but also, like I said, pricing tiers, and also just safer deployments. So if you're doing a significant back end revision, like I was reading about how Slack upgraded their database back end, they put a new database on top of their old database, and then switched over slowly, almost like a blue-green, but they weren't identical. And that was done using feature

Starting point is 00:19:33 flags so that they could slowly shift traffic from one to the other without having anything irrevocable. Fascinating. I guess this ties into my next question rather neatly, which is, you mentioned at the beginning that you've done some documentation work with feature flags or mentioned that in your interview. What sort of wacky things can you do with feature flags that aren't continuous integration or delivery based? You can use it to do white labeling. So imagine if instead of having 15 different custom websites that are slightly different and you have to maintain, you have one website and you're just pulling the customized look and feel, the CSS, out using a feature flag.

Starting point is 00:20:16 I think you could use it for some really interesting localization and market segmentation. So if you wanted to target all the people in Germany who have previously expressed an interest in Hamburg United, you would be able to say, deliver that to them. And I'm working on a blog post right now. My CTO is a little dubious about this idea, but I think you could use feature flags to do some really interesting localization stuff to pull out different files and do your localizing on the fly using feature flags instead of having to do browser-based dependencies. Fascinating. Getting back to what you said originally about using feature flags for documentation, how does that work? So I don't think you should have to read documentation for anything you don't own. So every feature should have its documentation tied to it, committed as code.

Starting point is 00:21:12 And then if you don't have the extreme module, you will never see the documentation for the extreme module. That just won't appear because we'll have turned that flag off. So being able to synchronize exactly the code that you're using with exactly the documentation you get will really cut down on the amount of documentation people want to read. Because if 20 years of technical writing taught me anything, it's that nobody wants to be reading documentation.

Starting point is 00:21:39 It seems sometimes that no one wants to read it at all, to the point where RTFm has become almost a trope in our space oh how do i do this read the manual yes i tried it's an encyclopedia could you be slightly more specific please yeah i used to call ibm to get the help desk to give me the page that i needed to be reading it was like your indexing is really bad, people. But I think it helps if we remember that everyone who's reading documentation is already a little bit angry because they couldn't figure it out. And all documentation is essentially a failure of user interface at some level. I like the idea that everyone who's reading documentation is already slightly upset. They've had a negative experience.

Starting point is 00:22:27 Other than removing, I guess, parts of the documentation that don't apply to them, with your background in technical writing, how else can you see reasonable ways to address that? I know that I spend more time swearing into various cloud provider documentation bundles than I'd care to admit publicly? Well, I think that you should be able to have a customized experience. And that would mean that not only is stuff that you're not using hidden from you, but also you would get synopses of things

Starting point is 00:22:57 that you already know. So for instance, if you log in as Corey Quinn, expert AWS person, nobody's going to explain to you how the certificates work because you've done that enough times. You know that. So, it'll just be like a consolidated summary. And then we'll go into how certificates don't work in these particular circumstances. And that part would be expanded. So I think you'd be able to do level setting to answer these three questions, and then

Starting point is 00:23:26 we will give you a more accurate representation of what you're actually trying to get an answer for. You flatter me. It tells me my marketing has worked. I have no idea how most of the certificate stuff works. I smile, shrug, hand wave over it, and hope no one presses me too closely on it. My website was down for 10 days because I can't figure out how to renew certificates. Welcome to the eternal joy of anything involving infrastructure. So something you mentioned

Starting point is 00:23:51 earlier was using feature flags to effectively manage and minimize risk. How does that wind up progressing as companies start to embrace the idea of feature flags? The thing that we're trying to do is accept that there is always risk in the world. And what causes disaster is not one failure. It is a multiplication of failures. This goes wrong and this goes wrong. It's not just that the O-ring got too cold. It's that PowerPoint made it difficult for people to explain to their bosses that the O-ring got too cold, it's that PowerPoint made it difficult for people to explain to their bosses that the O-ring was too cold and the space shuttle might blow up. All of the failure analysis you've ever read involves a lot of different factors. And so what we're trying to do with

Starting point is 00:24:36 feature flagging and continuous integration and continuous deployment is break these monolithic releases up into tiny bite-sized chunks that we can make go forward or go back. So if you think about it, it's less like putting all of your money on one color of the roulette table and more like putting it all over the roulette table. Your odds of something catastrophic happening are much lower. Gotcha. So effectively what you're doing is reducing failure domain by having fewer deltas at any given time? Exactly. Because if we say, I'm only changing this one particular thing, then you can track

Starting point is 00:25:15 what's happening with that feature. And if something goes wrong, you have a way to back it out, or I like to call it roll it forward. So if you have a deploy that has 20 features in it, which that's a really huge deploy in the CI CD world, but has 20 features in it, and one of them goes wrong, the old style was to panic and push out the old version, effectively rolling back all 20 features. Feature flag style is you go, oh, feature X is not working. We're going to turn that off. All the rest of the features are fine. They're going to roll forward. We're going to go find out what happened with feature X. Gotcha. It seems almost counterintuitive in some ways where you have a deploy, things are broken. It's a terrifying moment. The instinct is, since that hurts so badly,

Starting point is 00:26:07 companies generally want to do fewer releases as opposed to more releases that have smaller change sets. Right. I think this is why people have trouble with weightlifting. It turns out very few of us can actually lift 300 pounds, but a lot of us could lift 30 pounds 10 times or 3 pounds 100 times. I want more companies to be lifting 3 pounds 100 times when they release instead of trying to lift this one massive 300 pounds. I like the metaphor quite a bit. You're going to hurt yourself if you try and lift that. Yes, I don't even try to lift the 30 pounds 10 times. Good Lord.

Starting point is 00:26:43 Yeah, that's not really my skill set these days. Yeah. How old is your baby? Eight months at this point. And yeah, not quite to 30 pounds yet. Not quite. Hopefully, won't get there for a while, but we'll see. So is there anything else that you'd like to talk about or mention that you'd like people to take a look at, participate in, throw fire and brimstone upon, etc.? So I've got a couple things. If you're in the Bay Area, my company does a monthly meetup called Test in Production. And we talk more about these things. And we have people come in and talk to us about how they're testing in production and what their use case is for continuous integration,

Starting point is 00:27:26 continuous deployment, DevOps, that sort of stuff. And it's super fun. And I would like people, if they have time, to write me in with stories about trunk-based development versus branch-based development. Because I think it's a philosophical split that we haven't explored a lot in the DevOps industry yet. The way that I've always found that worked very well to get people's feedback is to stake out a strong opinion on one side of an issue or the other, and then just wait. You don't even need to give them addresses.

Starting point is 00:27:59 They will come back and give it to you themselves without prompting. It's true. Yeah. So I want to say, let's just say trunk-based development is probably a better way to go for your enterprise organization than branch-based development. Well, thank you very much for joining me today, Heidi. I'm going to disagree with you vehemently as soon as we stop recording. My name is Corey Quinn.

Starting point is 00:28:23 This has been Screaming in the Cloud. Thank you for joining me. Thank you. Have fun. This has been this week's episode of Screaming in the Cloud. You can also find more Corey at screaminginthecloud.com or wherever fine snark is sold.

Your Ad Here

Screaming in the Cloud - Episode 1: Feature Flags with Heidi Waterhouse of LaunchDarkly

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.