Screaming in the Cloud - Putting the “Fun” in Functional with Frank Chen

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. It seems like there's a new security breach every day. Are you confident that an old SSH key or a shared admin account isn't going to come back and bite you?

Starting point is 00:00:43 If not, check out Teleport. Teleport is the easiest, most secure way to access all of your infrastructure. The open source Teleport access plane consolidates everything you need for secure access to your Linux and Windows servers. And I assure you, there is no third option there. Kubernetes clusters, databases, and internal applications like AWS Management Console, Yankins, JitLab, Grafana, Jupyter Notebooks, and more. Teleport's unique approach is not only more secure, it also improves developer productivity.

Starting point is 00:01:28 To learn more, visit GoTeleport.com. And no, that's not me telling you to go away. It is GoTeleport.com. This episode is sponsored by our friends at Oracle Cloud. Counting the pennies, but still dreaming of deploying apps instead of Hello World demos? Allow me to introduce you to Oracle's Always Free tier. It provides over 20 free services in infrastructure, networking, databases, observability, management, and security. And let me be clear here, it's actually free. There's no surprise billing until you intentionally and proactively upgrade your account. This means you can provision a virtual machine instance or spin up an autonomous database that manages itself, all while gaining the networking,

Starting point is 00:02:17 load balancing, and storage resources that somehow never quite make it into most free tiers needed to support the application that you want to build. With Always Free, you can do things like run small-scale applications or do proof-of-concept testing without spending a dime. You know that I always like to put asterisk next to the word free. This is actually free. No asterisk. Start now. Visit snark.cloud slash oci-free. That's snark.cloud slash oci-free. Welcome to Screaming in the Cloud. I'm Corey Quinn. Several people are undoubtedly angrily typing, and part of the reason they can do that, and the fact that I know that, is because we're all using Slack. My guest today is Frank Chen, Senior Staff Software Engineer at Slack. My guest today is Frank Chen, senior staff software engineer at Slack. So I guess sort of

Starting point is 00:03:08 Salesforce. Frank, thanks for joining me. Hey, Corey. I've been a longtime listener and follower and just really delighted to be here. It's one of the weird things about doing a podcast is that for better or worse, people don't respond to it in the same way that they do writing a newsletter, for example, because you receive an email and, oh, well, I know how to write an email. I can hit reply and send an email back and give that jack wagon a piece of my mind, and people often do. But with podcasts, I feel like it's much more closely attuned to the idea of an AM radio talk show. And who calls into a radio talk show? Lunatics. And most people don't self-describe as lunatics, so they don't want

Starting point is 00:03:50 to do that. But then when I catch up with people one-on-one or at events in person, I find out that a lot more people listen to this show than I thought they did because I don't trust podcast statistics because lies, damn lies, and analytics are sort of how I view this world. So you've worked at a bunch of different companies. You're at Slack now, which of course upsets some people because Slack is ruining the way that people come and talk to me in the office, or it's making it easier for employees to collaborate internally in ways their employers wish they wouldn't. But that's neither here nor there. Before this, you were at Palantir, and before this, you were at Palantir. And before this, you were at Amazon, working on Amazon WorkDocs, of all things, which is supposedly rumored to have at least one customer somewhere, but I've never seen them. Before that, you were at Sandia

Starting point is 00:04:34 National Labs, and you've gotten a master's in computer science from Stanford. You've done a lot of things, and everything you've done on some level seems like the recurring theme is someone on Twitter will be unhappy at you for a career choice you've made. But what is the common thread in seriousness between the different places that you've been? One thing that's been a driver for where I work is finding amazing people to work with and building something that I believe is valuable and fun to keep doing. The thing that brought me to Slack is I became my own Slack admin when I met a girl and we moved in together into a small apartment in Brooklyn. And she had a cat that, you know, is a sweetheart, but also just doesn't know how to be social.

Starting point is 00:05:28 Yes, you covered that with cat. Part of moving it together, I became my own Slack admin and discovered, well, we can build a series of home automations to better train and inform our little command center for when the cat lies about being fed or not fed, clipping his nails and discovering and tracking bad behaviors. In a lot of ways, this was like the human side of a lot of the data work that I've been doing at my previous role. And it was like a fun way to use the same frameworks that he's at work to better train and be a cat caretaker. Now, at some point, you know that some product manager at Amazon

Starting point is 00:06:05 is listening to this and immediately sketching notes because their product strategy is yes, and this is going to be productized and shipping in two years as Amazon Prime Meow. But until then, we'll enjoy the originality

Starting point is 00:06:17 of having a Slack bot more or less control the home automation slash making your house seem haunted for anyone who didn't write the code themselves. There's an idea of solving real-world problems that I definitely understand. I mean, and again, it might not even be a fair question entirely. Just because I am, for better or worse, staggering through my world and trying and failing most days to tell a narrative that, oh, why did I start my tech career at a university and then

Starting point is 00:06:45 spend time in ad tech and then spend time in consulting and then fintech and the rest? And the answer is, oh, I get fired an awful lot and that sucked. So instead of going down that particular rabbit hole of a mess, I went in other directions. I started finding things that would pay me and pay me more money because I wasn't dead at the time, but that was the narrative thread. It was the, I have rent to pay and they have computers that aren't behaving properly. And that's what dictated the shape of my career for a long time. It's only in retrospect that I started to identify some of the things that aligned with it, but it's easy to look at it with the shine of hindsight and not realize that, no, no, that's sort of retconning what happened in the past.

Starting point is 00:07:27 Yeah, I have a mentor and my former advisor had this way of describing building out the j or really, really janky ideas for what helping people through technology might look like. And I feel like in a lot of ways, even when those prototypes fail, like in a career or some half-baked tech prototype I put together, it might succeed and great. We could keep building upon that. But when it fails, you actually discover, oh, this is one way that I didn't succeed. And even in doing so, you discover things about yourself, your way of building, and maybe a little bit about your infrastructure or whatever it is that you build on a day-to-day basis. And wrapping that back to your original question, I was like, well, we think we're human beings, right? We're static. But in a lot of ways, we're human becomings. We think we know what the

Starting point is 00:08:31 future might look like with our careers, what we're building on a day-to-day basis, and what we're building a year from now. But oftentimes, things change as we discover things about ourselves, the people we work with, and ultimately ultimately the things that we put out into the world. Obviously, I've been aware of who Slack is for a long time. I've been a paying customer for years because it basically is IRC with reaction gifts and not having to teach someone how to sign into IRC when they work in accounting. So the user experience alone solved the problem. And you've actually worked with us in the past before. And Slack, it's the searchable log for all content and knowledge.

Starting point is 00:09:11 I think that acronym, that's how it works. And I was delighted when I had mentioned your jokes and your trolling of folk on Twitter and on your podcast to my former engineering manager, Chris Merrill. He was like, oh, you should search the Slack. Corey actually worked with us and he put together a lot of cool tooling and ideas for us to think about. Careful, if we talk too much about what I did when I was at Slack years ago, someone's going to start looking into some of the old commits and whatnot and start

Starting point is 00:09:39 demanding an apology. And we don't want that. It's, wow, you're right. You are a terrible engineer. He told you. There's a reason I don't do that. It's, wow, you're right. You are a terrible engineer. He told you. There's a reason I don't do that anymore. I think that's all of us. An early career mentor of mine is like, hey, Frank, listen, you think you're building perfect software at any point in time? No, you're building future tech debt. And yeah, we should put much more emphasis on interfaces and ideas we're putting out because the implementation is going to change over time.

Starting point is 00:10:09 And likely your current implementation is shit. And that is okay. That's the beautiful part about this is that things grow and things evolve. And it's interesting working with companies. And as a consultant, I tend to build my projects in such a way that I start on day one and people know that I'm leaving with usually a very short window because I don't want to build a forever job for myself. I don't want to show up and start charging by the hour or by the day if I can possibly avoid it because then it turns into eternal projects that never end because I'm billing and nothing's ever done. No, no, I like charging fixed fee and then getting out at a predetermined outcome.

Starting point is 00:10:44 But then you get to hear about what happens with companies as they move on. This combines with the fact that I have a persistent alert for my name, usually because I'm looking for various ineffective character assassination from enterprise marketing types. Because, you know, I dish it out. I should certainly be able to take it. But I found a blog post on the Slack engineering blog that mentioned my name. And it's, oh crap,

Starting point is 00:11:06 are they coming after me for a refund? No, it was not. It was you writing a fairly sizable post. Tell me more about that. Yeah, I'm part of an organization called Developer Productivity. And our goal is to help folk at Slack deliver services to their customers where we build test and release high-quality software. And a lot of our time is spent thinking about internal tooling and making infrastructure bets.

Starting point is 00:11:35 As engineers, right, it's like we have this idea for what the world looks like. We have this idea for what our infrastructure looks like. But what we discover using a set of techniques around observability of just asking questions, advanced questions, basic questions, and how even dumb questions, we discover, hey, the things that we think our computers are doing aren't actually doing

Starting point is 00:11:56 what they say they're doing. And the question is like, great, now what? How can we ask better questions? How can we better tune, change, and equip engineers with tooling so that they can do better work to make Slack customers have simple, pleasant, and productive experiences? And I have to say that there's a lot that Slack does that is incredibly helpful. I don't know that I'm necessarily completely bought in to the idea that, oh, all work should happen in Slack.

Starting point is 00:12:31 It's, well, on some level, people like to debate the, should people work from home? Should people all work in an office discussion? And on some level, it seems, if you look at people who are constantly fighting that debate online, it's, do you ever do work at all on some level? But I'm not here to besmirch others. I'm here to talk about, at some level, what you alluded to in your blog post. But I want to start with a disclaimer

Starting point is 00:12:53 that Slack, as far as companies go, is not small. And if you take a look around, most companies are using Slack, whether they know it or not. The list of side channel Slack groups people have tend to extend massively. I look and I pare it down every once in a while whenever I cross 40 signed in Slacks on my desktop. It is where people talk for a wide variety of different reasons and they all do different things. But if you're sitting here listening to this and you have a $2,000 a month AWS bill,

Starting point is 00:13:26 this is not for you. You will spend orders of magnitude more money trying to optimize a small cost. Once you're at significant points of scale and you have scaled out to the point where you begin to have some ability to predict over months or years, that's when a lot of this stuff starts to weigh in. So talk to me a bit about how you wound up, and let me quote directly from the article, which is titled Infrastructure Observability for Changing the Spend Curve. And I will, of course, throw a link to this in the show notes. But you talk in this about knocking, I believe it was orders of magnitude off of various cost areas within your bill. Yeah. The article itself describes three biggish projects where we are able to change the curve of the number of tests that we run and a change in how much it costs to run any single

Starting point is 00:14:23 test. When you say test, are you talking CICD infrastructure test or code test to make sure it goes out? Or are you talking something higher up the stack as far as, huh, let's see how some users respond when we send four notifications on every message instead of the usual one, to give a ridiculous example? Yeah, this is in the CI-CD pipelines. And one of these projects was around borrowing some concepts from data engineering over subscription and planning your capacity to have access capacity at peak, where at peak, your engineers might have a 5% degradation in performance while still maintaining high resiliency and reliability of your tests in order to oversubscribe either CPU or memory and keep throughput on the overall

Starting point is 00:15:16 system stable and consistent and fast enough. I think what's spent in developer productivity, I think both like the metrics you're trying to move and what you're optimizing for at any given time are like this like calculus, or it's like more art than science. And that there's no one right answer, right? It's like, oh yeah, very naively, like, yeah, let's throw the biggest machines, most expensive machines. We can at any given problem, but that doesn't solve the crux of your problem. It's like, yeah, let's throw the biggest machines, most expensive machines. We can't at any given problem, but that doesn't solve the crux of your problem. It's like, hey, what are the things in your system doing? And what is the right guess? The calculus around how much to spend on your CICV info is oftentimes not precise, nor is this blog article meant to be prescriptive.

Starting point is 00:16:04 It depends entirely on what you're doing and how, because it's on some level, well, we can save a whole bunch of money if we slow all of our CICD runs down by 20 minutes. Yeah, but then you have a bunch of engineers sitting idle, and I promise you that costs a hell of a lot more than your cloud bill is going to be. The payroll is almost always a larger expense than your infrastructure costs. And if it's not, you should seriously consider firing at least part of your data science team, but you didn't hear it from me. Yeah. And part of the exploration on profiling and performance and resiliency was around interrogating what the boundaries and what the constraints were

Starting point is 00:16:42 for our CICD pipelines. Because Slack has grown in engineering and in the number of tests we were running on a month-to-month basis, for a while from 2017 to mid-2020, we were growing about 10% month-over-month in test suite execution numbers, which means on a given year, we double almost two times, which is quite a bit of strain on internal resources and a lot of dependent services where in internal systems, we oftentimes have more complexity and less understood changes in what dependencies your infrastructure might be using,

Starting point is 00:17:23 what business logic your internal services are using to communicate with one another than you do your production. And so by performing a series of curiosity-driven development, we're able to both answer at that point in time what our customers internally were doing and start to put together ideas for eliminating some bottlenecks and even adding bottlenecks with circuit breakers where you keep the overall throughput of your system stable while deferring or canceling

Starting point is 00:17:52 work that otherwise might have overloaded dependencies. There's a lot to be said for understanding what the optimization opportunities are in an environment, understanding what it is you're attempting to achieve. Having those tests for something like Slack makes an awful lot of sense because let's be very clear here. When you're building an application that acts as something people use to do expense reports, it's like one of my previous job examples, it turns out you can be down for a week and a majority of your customers will never know or care. With Slack, it doesn't work that way. Everyone more or less has a continuous monitor

Starting point is 00:18:28 that they're typing into for a good portion of the day, angrily or otherwise, and as soon as it misses anything, people know. And if there's one thing that I love on some level, seeing a change when I know that Slack's having a blip, even if I'm not using Slack that day for anything in particular, because Twitter explodes about it. Slack is down. I'm now going to tweet some stuff

Starting point is 00:18:48 to my colleagues. All right, you do you, I suppose. And credit where due, Slack doesn't go down nearly as often as it used to, because as you tend to figure out how these things work, operational maturity increases through a bunch of tests. Fixing things like durability, reliability, uptime, etc. should always, to some extent, take precedence, priority-wise, over, let's save some money. Because, yeah, you could turn everything off and save all the money, but then you don't have a business anymore. It's focus on where to cut, where to optimize in the right way, and ideally, as you go, find some of the areas in which, oh, I'm paying AWS a tax for just going about my business and I could have flipped a switch at any point and saved how much money?

Starting point is 00:19:29 Oh my God, that's more than I'll make in my lifetime. Yeah. And one thing I talk about a little bit is distributed tracing as one of the drivers for helping us understand what's happening inside of our systems, where it helps you figure out, and it's like this buzzword to describe, how do you ask questions of deployed code? And in a lot of ways, it's helped us understand existing bottlenecks and identify opportunities for performance or resiliency gains, because your past janky band-aids become more and more obvious when you can interrogate and ask questions around what isn't performing like it used to, or what has changed recently. This episode is sponsored in part by my friends at Cloud Academy. Something special just for you folks. If you missed their offer on Black Friday or Cyber Monday or whatever day of the week doing

Starting point is 00:20:21 sales it is, good news, they've opened up their Black Friday promotion for a very limited time. Same deal. $100 off a yearly plan, $249 a year for the highest quality cloud and tech skills content. Nobody else is going to get this, and you have to act now because they have assured me this is not going to last for much longer. Go to cloudacademy.com, hit the start free trial button on the homepage and use the promo code cloud when checking out. That's C-L-O-U-D, like loud, what I am with a C in front of it. They've got a free trial too, so you'll get seven days to try it out to make sure it really is a good fit. You've got nothing to lose except your ignorance about cloud. My thanks to Cloud Academy

Starting point is 00:21:05 once again for sponsoring my ridiculous nonsense. It's also worth pointing out that as systems grow organically, that it is almost impossible for any one person to have it all in their head anymore. I saw one of the most overly complicated architecture flow trees that I think I've seen in recent memory, and it was on the Slack engineering blog about how something was architected, but it wasn't the Slack app itself. It was simply the decision tree for should we send a notification? And it is more complicated than almost anything I've written, except maybe my newsletter content publication pipeline. It is massive. And I'll throw a link to that in the show notes as well,

Starting point is 00:21:49 just because it is well worth people taking a look at. But there is so much complexity at scale for doing the right thing. And it's necessary. Because if I'm talking to you on Slack right now and getting notifications every time you reply on my phone, it's not going to take too long before I turn off notifications everywhere. And then I don't notice that Slack is there and it becomes useless. And I use something else, ideally

Starting point is 00:22:09 something better, which is hard to come by moderately worse like email or completely worse like Microsoft Teams. I tell all my close collaborators about this. I typically set myself away on Slack because I like to make time for deep focused work. And that's very hard with a constant stream of notifications. How people use Slack and how people notify others on Slack is like not incumbent on the software itself, but it's a reflection of the work culture that you're in. That the expectation for an email driven culture is like, oh yeah, you should be reading your email all the time and be able to respond within 30 minutes. Peace. I have friends that are lawyers,

Starting point is 00:22:51 and that is the expectation at all times of day. I married one of those. Oh yeah, people get very salty. And she works with a global team spread everywhere to the point where she wakes up and there's just a whole flurry of angry people that have tried to reach her in the middle of the night. Like, why were you sleeping at 2 a.m.? It's daytime here. And yeah, time zones. Not everyone understands how they work from my estimation. That's funny. My sweetheart is a former attorney. On our first international date, we spent an entire day and a half hopping between Wi-Fi spots in Prague so that she could answer a five-minute question from a partner about standard deviations.

Starting point is 00:23:35 So one thing that you linked to that really is what drew my notice to this, because again, if you talk about AWS cost optimization, I'm probably going to stumble over it. But if you mentioned my name, that's sort of a nice accelerator. And you linked to my article called Why Right-Sizing Your Instances is Nonsense. And that is a little overblown to some extent, but so many folks talk about it in the cost optimization space, because you can get a bunch of metrics and do these things programmatically and somewhat without observability into what's going on. Because, well, I can see how busy the computers are. And if we, if it's not busy, we could use smaller computers, problem

Starting point is 00:24:10 solved versus the things that require a fair bit of insight into what is that thing doing exactly? Because it leads you into places of, oh, turn off that idle fleet. That's not doing anything. It is all labeled backup where you're going to have three seconds of notice before it gets all the traffic. There's an idea of sometimes things are the way they are for a reason. And it's also not easy for a lot of things, think databases, to seamlessly just restart the thing and have it scale back up and run on a different instance class. That takes weeks of planning and it's hard. So I find that people tend to reach for it where it doesn't often make sense. At your level of scale

Starting point is 00:24:45 and operational maturity, of course you should optimize what instance classes things are using and what sizes they are, especially since that stuff changes over time as far as what AWS has made available. But it's not the sort of thing that I suggest as being the first easy thing to go for. It's just what people think is easy because it requires no judgment and computers can do it. At least that's their opinion. I feel like you probably have a lot more experience than me and talk about war stories, but I recall working with customers where they want to lift and shift on-prem hardware to VMs on-prem. I'm like, it's not going to be as simple as you're making it out to be. Whereas like the trend today is probably, oh yeah, we're going to shift on-prem. I'm like, it's not going to be as simple as you're making it out to be. Whereas the trend today is probably, oh yeah, we're going to shift on-prem VMs to AWS or hell,

Starting point is 00:25:31 let's go two levels deeper and just run everything on Kubernetes. Similar workloads, right? It's not going to be a huge challenge or everything serverless. Spare me from that entire school of thought, my God. Yeah, and it's fun too, because this came out a month ago, and you're talking about using, an example you gave was a C5.9x large instance. Great. Well, the C6i is out now as well. So people are going to look at that someday and think, oh, wow, that's incredibly quaint. You wrote this a month ago, and it's already out of date as far as what a lot of the modern story instances are. From my perspective, one of the best things that AWS has done in this space has been to get away from the reserved instance story and over into savings

Starting point is 00:26:16 plans, where it's, I know I'm going to run some compute. Maybe it's Fargate, maybe it's EC2. Let's be serious. It's definitely going to be EC2, but I don't want to tie myself to specific instances types for the next three years. Right. Well, I'm just going to commit to spending some money on AWS for the next three years, because if I decide today to move off of it, it's going to take me at least that long to get everything out. So, okay. Then that becomes something that's a lot more palatable for an awful lot of folks. One thing you brought up in the article I linked to is instance types. You think upgrading to the newest instance type will solve all your challenges, but oftentimes it's not obvious that it won't all the time. And in fact, you might even

Starting point is 00:27:02 see degraded resiliency and degraded performance because different packages that your software relies upon might not be optimized for the given kernel or CPU type that you're running against. And ultimately, you go back to just asking really basic questions and performing some end-to-end benchmarking so that you can at least get a sense for what your customers are doing today and maybe make a guess for what they're going to do tomorrow. I have to ask, because I'm always interested in what it is that gives rise to blog posts like this, which that's easy. It's someone had to do a project on these things. And while we learned things that would probably apply to other folks. You're solving what is effectively a global

Starting point is 00:27:45 problem locally when you go down this path. Part of the reason I have a consulting business is things I learn at one company apply almost identically to another company, even though that they're in completely separate industries and parts of the world, because AWS billing is, for better or worse, a bounded problem space, despite their best efforts to use quantum computers to fix that. What was it that gave rise to looking at the CI-CD system from an optimization point of view? So internally, I initially started writing a white paper about, hey, here's a simple question that we can answer without too much effort. Let's transition all of our C3 instances to C5 instances.

Starting point is 00:28:26 And that could have been the one and done. But by thinking about it a little more and kind of drawing out, well, we can actually borrow a model for over subscription from another field. We could potentially decrease our spend by quite a bit. That eventually evolved into a 70-page white paper, no joke, that my former engineering manager said, Frank, no one's going to read this. Always, always, always. Here's a whole bunch of academically research and the rest. It's like, great. Which of these two buttons do I press is really the question people are getting at. And while it's great to have the research and the academic stuff, it's also a great, we're trying to achieve an outcome, which what is the choice? But it's nice to know that people are doing actual research on the backend instead of, ah, my gut tells me to take the path on the left. Cause why not? Left is better. Right's tricky friend. Yeah. And it was like, oh yeah,

Starting point is 00:29:17 I accidentally wrote a really long thing because there was like a lot of variables to test. I think we had spun up 16 plus auto scaling groups and ran something like the cross section of a couple of representative test suites against them, as well as configurations for number of executors per instance. And about a year ago, I translated that into a 10 page blog article that when I read through, I really didn't enjoy. And that 10 Place Biologics article is ultimately about a page in the article you're reading today. And the actual kick in the butt to get this out the door was about four months ago, I spoke at Olicon, which you were a part of, and it was a vendor conference by Honeycomb. And it was just so fun to share some of the things we've been doing with distributor tracing and how we were able to solve internal problems

Starting point is 00:30:10 using a relatively simple idea of asking questions about what was running. And the entire team there was wonderful in coaching and just helping me think through what questions people might have this work. And that was, again, former academic. The last time I spoke at a conference was about a decade earlier. And it was just so fun to be part of this community of people trying to all solve the same set of problems just in their own unique ways. One of the things I loved about working with Honeycomb was the fact that whenever I asked them a question, they had instrumented their own stuff.

Starting point is 00:30:44 So they could tell me extremely quickly what something was doing, how it was doing it, and what the overall impact on this was. It's very rare to find a client that is anywhere near that level of awareness into what's going on in their infrastructure. Yeah. And that blog article, right? It's like, here's our current perspective. And here's like the current set of projects we're able to make to get to this result. And we think we know what we want to do. But if you were to ask that same question, what are we doing for our spend a year from now? The answer might be very different, probably similar in some ways, but probably different. Well, there are some principles that we'll never get away from. It's,

Starting point is 00:31:21 is no one using the thing? Turn that shit off. That's one of those tried and true things. Oh, it's the third copy of that multiple petabyte of data thing. Maybe delete it or stuff it in a deep archive. It's maybe move data less between various places. Maybe log things fewer times, given that you're paying 50 cents per gigabyte ingest in some cases, et cetera, et cetera, et cetera. There's a lot to consider as far as the general principles go, but the specifics, well, that's where it gets into the weeds. And at your scale, yeah, having people focus on this internally with the context and nuance to it is absolutely worth doing. Having a small team devoted to this at large companies will pay for itself, I promise. Now, I go in and advise in these scenarios, but past a certain point,

Starting point is 00:32:06 this can't just be one person's part-time gig anymore. I'm kind of curious about that. How do you think about working with a company and then deprecating yourself and allowing your tools and the frameworks you put into place to continue to thrive? We're advisory only.

Starting point is 00:32:21 We make no changes to production. Or I don't know if that's the right word, deprecate. That's my own word. No, no, it's fair. What we do is we go in and we are advisory. It's less of a cost engagement, more of an architecture engagement, because in cloud, cost and architecture are the same thing. We look at what's going on.

Starting point is 00:32:39 We look at the constraints of why we've been brought in, and we identify things that companies can do and the associated cost savings associated with that, and let them make their own decision. Because it's, if I come in and say, hey, you could save a bunch of money by migrating this whole subsystem to serverless, great. I sound like a lunatic evangelist because, yeah, but 18 months of work, during which time the team doing that is not advancing the state of the business any further, so it's never going to happen. So why even suggest it? Just look at the things that are within the bounds of possibility. Counterpoint, when a client says a full rearchitecture is on the table, well, okay,

Starting point is 00:33:13 that changes the nature of what we're suggesting. But we're trying to get away from what a lot of tooling does, which is, great, here's 700 things you can adjust, and you'll do none of them. We come back with, yeah, here's three or four things you can do that'll blow 20% off the bill. Then let's see where you stand. The other half of it, of course, is large-scale enterprise contract negotiation, but that's a bit of a horse of a different color. I want to thank you so much for taking the time to speak with me today. I really do appreciate it. If folks want to hear more about what you're up to and how you think about these things, where can they find you? You can find me at frankc.net or at me at frankc on Twitter.

Starting point is 00:33:53 Oh, inviting people to yell at you at Twitter. That's never a great plan. Yeesh, good luck. Thanks again. We've absolutely got to talk more about this in depth because I think this is one of those areas that you have the folks above a certain point of scale talk about these things semi-constantly and live in the space, whereas folks who are in relatively small-scale environments are listening to this and thinking that they've got to do this. And no, no, you do not want to spend millions of dollars of engineering effort to optimize a bill that's

Starting point is 00:34:19 80 grand a year. I promise. It's focus on the thing that's right for your business. And at a certain point of scale, this becomes that. But thank you so much for being so generous with your time. I appreciate it. Thank you so much, Corey. Frank Chen, Senior Staff Software Engineer at Slack. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment that seems to completely miss the fact that Microsoft Teams is free because it sucks. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying.

Starting point is 00:35:14 The Duck Bill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started. This has been a HumblePod production. Stay humble.

Screaming in the Cloud - Putting the “Fun” in Functional with Frank Chen

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.