Screaming in the Cloud - Putting the “Fun” in Functional with Frank Chen
Episode Date: December 21, 2021About FrankFrank Chen is a maker. He develops products and leads software engineering teams with a background in behavior design, engineering leadership, systems reliability engineering, and ...resiliency research. At Slack, Frank focuses on making engineers' lives simpler, more pleasant, and more productive, in the Developer Productivity group. At Palantir, Frank has worked with customers in healthcare, finance, government, energy and consumer packaged goods to solve their hardest problems by transforming how they use data. At Amazon, Frank led a front-end team and infrastructure team to launch AWS WorkDocs, the first secure multi-platform service of its kind for enterprise customers. At Sandia National Labs, Frank researched resiliency and complexity analysis tooling with the Grid Resiliency group. He received a M.S. in Computer Science focused in Human-Computer Interaction from Stanford. Frank's thesis studied how the design / psychology of exergaming interventions might produce efficacious health outcomes. With the Stanford Prevention Research Center, Frank developed health interventions rooted in behavioral theory to create new behaviors through mobile phones. He prototyped early builds of Tiny Habits with BJ Fogg and worked in the Persuasive Technology Lab. He received a B.S. in Computer Science from UCLA. Frank researched networked systems and image processing with the Center for embedded Networked Systems. With the Rand Corporation, he built research systems to support group decision-making.Links:Slack: https://slack.com“Infrastructure Observability for Changing the Spend Curve”: https://slack.engineering/infrastructure-observability-for-changing-the-spend-curve/“Right Sizing Your Instances Is Nonsense”: https://www.lastweekinaws.com/blog/right-sizing-your-instances-is-nonsense/Personal webpage: https://frankc.netTwitter: @frankc
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
It seems like there's a new security breach every day.
Are you confident that an old SSH key or a shared admin account isn't going to come back and bite you?
If not, check out Teleport. Teleport is the easiest,
most secure way to access all of your infrastructure. The open source Teleport
access plane consolidates everything you need for secure access to your Linux and Windows servers. And I assure you, there is no third option there.
Kubernetes clusters, databases, and internal applications
like AWS Management Console, Yankins, JitLab, Grafana,
Jupyter Notebooks, and more.
Teleport's unique approach is not only more secure,
it also improves developer productivity.
To learn more, visit GoTeleport.com. And no, that's not me telling you to go away. It is GoTeleport.com.
This episode is sponsored by our friends at Oracle Cloud. Counting the pennies, but still dreaming of deploying apps instead of Hello World demos?
Allow me to introduce you to Oracle's Always Free tier.
It provides over 20 free services in infrastructure, networking, databases, observability, management, and security.
And let me be clear here, it's actually free.
There's no surprise billing until you intentionally
and proactively upgrade your account. This means you can provision a virtual machine instance or
spin up an autonomous database that manages itself, all while gaining the networking,
load balancing, and storage resources that somehow never quite make it into most free tiers
needed to support the application that you want to build. With Always Free, you can do things like run small-scale applications or do proof-of-concept
testing without spending a dime. You know that I always like to put asterisk next to the word free.
This is actually free. No asterisk. Start now. Visit snark.cloud slash oci-free. That's snark.cloud slash oci-free.
Welcome to Screaming in the Cloud. I'm Corey Quinn.
Several people are undoubtedly angrily typing, and part of the reason they can do that,
and the fact that I know that, is because we're all using Slack.
My guest today is Frank Chen, Senior Staff Software Engineer at Slack. My guest today is Frank Chen, senior staff software engineer at Slack. So I guess sort of
Salesforce. Frank, thanks for joining me. Hey, Corey. I've been a longtime listener and follower
and just really delighted to be here. It's one of the weird things about doing a podcast
is that for better or worse, people don't respond to it in the same way that they do
writing a newsletter, for example, because you receive an email and, oh, well, I know how to
write an email. I can hit reply and send an email back and give that jack wagon a piece of my mind,
and people often do. But with podcasts, I feel like it's much more closely attuned to the idea
of an AM radio talk show. And who calls into a radio
talk show? Lunatics. And most people don't self-describe as lunatics, so they don't want
to do that. But then when I catch up with people one-on-one or at events in person, I find out that
a lot more people listen to this show than I thought they did because I don't trust podcast
statistics because lies, damn lies, and analytics are sort of how I view this world.
So you've worked at a bunch of different companies. You're at Slack now, which of course upsets some people because Slack is ruining the way that people come and talk to me in the office,
or it's making it easier for employees to collaborate internally in ways their employers
wish they wouldn't. But that's neither here nor there. Before this, you were at Palantir,
and before this, you were at Palantir. And before this,
you were at Amazon, working on Amazon WorkDocs, of all things, which is supposedly rumored to have at least one customer somewhere, but I've never seen them. Before that, you were at Sandia
National Labs, and you've gotten a master's in computer science from Stanford. You've done a
lot of things, and everything you've done on some level seems like the recurring theme is someone
on Twitter will be unhappy at you for a career choice you've made. But what is the common thread
in seriousness between the different places that you've been? One thing that's been a driver for
where I work is finding amazing people to work with and building something that I believe is valuable and
fun to keep doing. The thing that brought me to Slack is I became my own Slack admin when I
met a girl and we moved in together into a small apartment in Brooklyn. And she had a cat that,
you know, is a sweetheart, but also just doesn't know how to be social.
Yes, you covered that with cat.
Part of moving it together, I became my own Slack admin and discovered, well,
we can build a series of home automations to better train and inform our little command center
for when the cat lies about being fed or not fed, clipping his nails and discovering
and tracking bad behaviors. In a lot of ways, this was like the human side of a lot of the
data work that I've been doing at my previous role. And it was like a fun way to use the same
frameworks that he's at work to better train and be a cat caretaker. Now, at some point,
you know that some product manager at Amazon
is listening to this
and immediately sketching notes
because their product strategy is yes,
and this is going to be productized
and shipping in two years
as Amazon Prime Meow.
But until then,
we'll enjoy the originality
of having a Slack bot
more or less control the home automation
slash making your house seem haunted
for anyone who didn't write the code themselves.
There's an idea of solving real-world problems that I definitely understand. I mean, and again,
it might not even be a fair question entirely. Just because I am, for better or worse, staggering
through my world and trying and failing most days to tell a narrative that, oh, why did I start my
tech career at a university and then
spend time in ad tech and then spend time in consulting and then fintech and the rest? And
the answer is, oh, I get fired an awful lot and that sucked. So instead of going down that
particular rabbit hole of a mess, I went in other directions. I started finding things that would
pay me and pay me more money
because I wasn't dead at the time, but that was the narrative thread. It was the, I have rent to
pay and they have computers that aren't behaving properly. And that's what dictated the shape of
my career for a long time. It's only in retrospect that I started to identify some of the things that
aligned with it, but it's easy to look at it with the shine of hindsight and not realize that, no, no, that's sort of retconning what happened in the past.
Yeah, I have a mentor and my former advisor had this way of describing building out the j or really, really janky ideas for what helping people through technology might look like.
And I feel like in a lot of ways, even when those prototypes fail, like in a career or some half-baked tech prototype I put together, it might succeed and great.
We could keep building upon
that. But when it fails, you actually discover, oh, this is one way that I didn't succeed.
And even in doing so, you discover things about yourself, your way of building,
and maybe a little bit about your infrastructure or whatever it is that you build on a day-to-day
basis. And wrapping that back to your original question, I was like, well, we think we're human
beings, right? We're static. But in a lot of ways, we're human becomings. We think we know what the
future might look like with our careers, what we're building on a day-to-day basis, and what
we're building a year from now. But oftentimes, things change as we discover things about
ourselves, the people we work with, and ultimately ultimately the things that we put out into the world. Obviously, I've been aware of who Slack is for a long time. I've
been a paying customer for years because it basically is IRC with reaction gifts and not
having to teach someone how to sign into IRC when they work in accounting. So the user experience
alone solved the problem. And you've actually worked with us in the past before.
And Slack, it's the searchable log
for all content and knowledge.
I think that acronym, that's how it works.
And I was delighted when I had mentioned your jokes
and your trolling of folk on Twitter and on your podcast
to my former engineering manager, Chris Merrill.
He was like, oh, you
should search the Slack. Corey actually worked with us and he put together a lot of cool tooling
and ideas for us to think about. Careful, if we talk too much about what I did when I was at Slack
years ago, someone's going to start looking into some of the old commits and whatnot and start
demanding an apology. And we don't want that. It's, wow, you're right. You are a terrible engineer.
He told you. There's a reason I don't do that. It's, wow, you're right. You are a terrible engineer. He told you.
There's a reason I don't do that anymore.
I think that's all of us.
An early career mentor of mine is like, hey, Frank, listen, you think you're building perfect software at any point in time?
No, you're building future tech debt.
And yeah, we should put much more emphasis on interfaces and ideas we're putting out
because the implementation is going to change over time.
And likely your current implementation is shit.
And that is okay.
That's the beautiful part about this is that things grow and things evolve.
And it's interesting working with companies.
And as a consultant, I tend to build my projects in such a way that I start on day one and
people know that I'm leaving with usually a very short window because I don't want to build a forever job for myself.
I don't want to show up and start charging by the hour or by the day if I can possibly avoid it because then it turns into eternal projects that never end because I'm billing and nothing's ever done.
No, no, I like charging fixed fee and then getting out at a predetermined outcome.
But then you get to hear about what happens with companies as they move on.
This combines with the fact that I have a persistent alert for my name,
usually because I'm looking for various ineffective character assassination
from enterprise marketing types.
Because, you know, I dish it out.
I should certainly be able to take it.
But I found a blog post on the Slack engineering blog that mentioned my name.
And it's, oh crap,
are they coming after me for a refund?
No, it was not.
It was you writing a fairly sizable post.
Tell me more about that.
Yeah, I'm part of an organization called Developer Productivity.
And our goal is to help folk at Slack deliver services to their customers where we build test and release high-quality software.
And a lot of our time is spent thinking about internal tooling
and making infrastructure bets.
As engineers, right, it's like we have this idea for what the world looks like.
We have this idea for what our infrastructure looks like.
But what we discover using a set of techniques around observability
of just asking questions,
advanced questions, basic questions,
and how even dumb questions,
we discover, hey, the things that we think
our computers are doing aren't actually doing
what they say they're doing.
And the question is like, great, now what?
How can we ask better questions?
How can we better tune, change, and equip
engineers with tooling so that they can do better work to make Slack customers have simple, pleasant,
and productive experiences? And I have to say that there's a lot that Slack does that is
incredibly helpful. I don't know that I'm necessarily completely bought in to the idea that, oh, all work should
happen in Slack.
It's, well, on some level, people like to debate the, should people work from home?
Should people all work in an office discussion?
And on some level, it seems, if you look at people who are constantly fighting that debate
online, it's, do you ever do work at all on some level?
But I'm not here to besmirch others.
I'm here to talk about, at some level,
what you alluded to in your blog post.
But I want to start with a disclaimer
that Slack, as far as companies go, is not small.
And if you take a look around,
most companies are using Slack,
whether they know it or not. The
list of side channel Slack groups people have tend to extend massively. I look and I pare it
down every once in a while whenever I cross 40 signed in Slacks on my desktop. It is where people
talk for a wide variety of different reasons and they all do different things. But if you're
sitting here listening to this and you have a $2,000 a month AWS bill,
this is not for you. You will spend orders of magnitude more money trying to optimize a small
cost. Once you're at significant points of scale and you have scaled out to the point where you
begin to have some ability to predict over months or years, that's when a lot of this stuff starts to weigh in. So talk to me a
bit about how you wound up, and let me quote directly from the article, which is titled
Infrastructure Observability for Changing the Spend Curve. And I will, of course, throw a link
to this in the show notes. But you talk in this about knocking, I believe it was orders of magnitude off of various cost areas within your
bill. Yeah. The article itself describes three biggish projects where we are able to change the
curve of the number of tests that we run and a change in how much it costs to run any single
test. When you say test, are you talking CICD infrastructure test or code test to make sure it goes out?
Or are you talking something higher up the stack as far as, huh, let's see how some users
respond when we send four notifications on every message instead of the usual one, to
give a ridiculous example?
Yeah, this is in the CI-CD pipelines. And one of these projects was around
borrowing some concepts from data engineering over subscription and planning your capacity to have
access capacity at peak, where at peak, your engineers might have a 5% degradation in performance while still maintaining high resiliency and reliability
of your tests in order to oversubscribe either CPU or memory and keep throughput on the overall
system stable and consistent and fast enough. I think what's spent in developer productivity,
I think both like the metrics you're trying to move and what you're optimizing for at any given time are like this like calculus,
or it's like more art than science. And that there's no one right answer, right? It's like,
oh yeah, very naively, like, yeah, let's throw the biggest machines, most expensive machines.
We can at any given problem, but that doesn't solve the crux of your problem. It's like, yeah, let's throw the biggest machines, most expensive machines. We can't at any given
problem, but that doesn't solve the crux of your problem. It's like, hey, what are the things in
your system doing? And what is the right guess? The calculus around how much to spend on your
CICV info is oftentimes not precise, nor is this blog article meant to be prescriptive.
It depends entirely on what you're doing and how, because it's on some level, well,
we can save a whole bunch of money if we slow all of our CICD runs down by 20 minutes. Yeah,
but then you have a bunch of engineers sitting idle, and I promise you that costs a hell of a
lot more than your cloud bill is going to be. The payroll is almost always a larger expense
than your infrastructure costs.
And if it's not, you should seriously consider firing at least part of your data science team,
but you didn't hear it from me. Yeah. And part of the exploration on profiling and performance
and resiliency was around interrogating what the boundaries and what the constraints were
for our CICD pipelines. Because Slack has grown in engineering and in the number of tests we were running on a month-to-month basis,
for a while from 2017 to mid-2020, we were growing about 10% month-over-month in test suite execution numbers, which means on a given year,
we double almost two times,
which is quite a bit of strain on internal resources and a lot of dependent services
where in internal systems,
we oftentimes have more complexity
and less understood changes
in what dependencies your infrastructure might be using,
what business logic your internal services are using
to communicate with one another than you do your production.
And so by performing a series of curiosity-driven development,
we're able to both answer at that point in time
what our customers internally were doing
and start to put together ideas for eliminating some bottlenecks
and even adding bottlenecks with circuit
breakers where you keep the overall throughput of your system stable while deferring or canceling
work that otherwise might have overloaded dependencies. There's a lot to be said for
understanding what the optimization opportunities are in an environment, understanding what it is
you're attempting to achieve. Having those tests for something like Slack makes an awful lot of sense
because let's be very clear here. When you're building an application that acts as something
people use to do expense reports, it's like one of my previous job examples, it turns out you can
be down for a week and a majority of your customers will never know or care. With Slack, it doesn't
work that way.
Everyone more or less has a continuous monitor
that they're typing into for a good portion of the day,
angrily or otherwise,
and as soon as it misses anything, people know.
And if there's one thing that I love on some level,
seeing a change when I know that Slack's having a blip,
even if I'm not using Slack that day
for anything in particular,
because Twitter explodes about it. Slack is down. I'm now going to tweet some stuff
to my colleagues. All right, you do you, I suppose. And credit where due, Slack doesn't go down nearly
as often as it used to, because as you tend to figure out how these things work, operational
maturity increases through a bunch of tests. Fixing things like durability, reliability, uptime, etc. should always, to some
extent, take precedence, priority-wise, over, let's save some money. Because, yeah, you could turn
everything off and save all the money, but then you don't have a business anymore. It's focus on
where to cut, where to optimize in the right way, and ideally, as you go, find some of the areas in
which, oh, I'm paying AWS a tax for just going
about my business and I could have flipped a switch at any point and saved how much money?
Oh my God, that's more than I'll make in my lifetime. Yeah. And one thing I talk about a
little bit is distributed tracing as one of the drivers for helping us understand what's happening
inside of our systems, where it helps you figure out, and it's like this buzzword to describe, how do you ask questions of deployed code? And in a lot of ways, it's helped us understand
existing bottlenecks and identify opportunities for performance or resiliency gains, because your
past janky band-aids become more and more obvious when you can interrogate and ask questions around
what isn't performing like it used to, or what has changed recently.
This episode is sponsored in part by my friends at Cloud Academy. Something special just for you
folks. If you missed their offer on Black Friday or Cyber Monday or whatever day of the week doing
sales it is, good news, they've opened up their Black
Friday promotion for a very limited time. Same deal. $100 off a yearly plan, $249 a year for the
highest quality cloud and tech skills content. Nobody else is going to get this, and you have
to act now because they have assured me this is not going to last for much longer. Go to cloudacademy.com, hit the start free
trial button on the homepage and use the promo code cloud when checking out. That's C-L-O-U-D,
like loud, what I am with a C in front of it. They've got a free trial too, so you'll get seven
days to try it out to make sure it really is a good fit. You've got nothing to lose except your
ignorance about cloud. My thanks to Cloud Academy
once again for sponsoring my ridiculous nonsense. It's also worth pointing out that as systems grow
organically, that it is almost impossible for any one person to have it all in their head anymore.
I saw one of the most overly complicated architecture flow trees that I think I've
seen in recent memory, and it was on the Slack engineering
blog about how something was architected, but it wasn't the Slack app itself. It was simply
the decision tree for should we send a notification? And it is more complicated than almost
anything I've written, except maybe my newsletter content publication pipeline. It is massive.
And I'll throw a link to that in the show notes as well,
just because it is well worth people taking a look at.
But there is so much complexity at scale for doing the right thing.
And it's necessary.
Because if I'm talking to you on Slack right now
and getting notifications every time you reply on my phone,
it's not going to take too long
before I turn off notifications everywhere.
And then I don't notice that Slack is there and it becomes useless. And I use something else, ideally
something better, which is hard to come by moderately worse like email or completely worse
like Microsoft Teams. I tell all my close collaborators about this. I typically set
myself away on Slack because I like to make time for deep focused work. And that's very hard with
a constant stream of notifications. How people use Slack and how people notify others on Slack
is like not incumbent on the software itself, but it's a reflection of the work culture
that you're in. That the expectation for an email driven culture is like, oh yeah,
you should be reading your email
all the time and be able to respond within 30 minutes. Peace. I have friends that are lawyers,
and that is the expectation at all times of day. I married one of those. Oh yeah, people get very
salty. And she works with a global team spread everywhere to the point where she wakes up and
there's just a whole flurry of angry people that have
tried to reach her in the middle of the night. Like, why were you sleeping at 2 a.m.? It's
daytime here. And yeah, time zones. Not everyone understands how they work from my estimation.
That's funny. My sweetheart is a former attorney. On our first international date,
we spent an entire day and a half hopping between Wi-Fi spots in Prague
so that she could answer a five-minute question from a partner about standard deviations.
So one thing that you linked to that really is what drew my notice to this, because again,
if you talk about AWS cost optimization, I'm probably going to stumble over it. But if you
mentioned my name, that's sort of a nice accelerator. And you linked to my article called Why Right-Sizing Your
Instances is Nonsense. And that is a little overblown to some extent, but so many folks
talk about it in the cost optimization space, because you can get a bunch of metrics and do
these things programmatically and somewhat without observability into what's going on.
Because, well, I can see how
busy the computers are. And if we, if it's not busy, we could use smaller computers, problem
solved versus the things that require a fair bit of insight into what is that thing doing exactly?
Because it leads you into places of, oh, turn off that idle fleet. That's not doing anything.
It is all labeled backup where you're going to have three seconds of notice before it gets all
the traffic. There's an idea of sometimes things are the way they are for a
reason. And it's also not easy for a lot of things, think databases, to seamlessly just restart the
thing and have it scale back up and run on a different instance class. That takes weeks of
planning and it's hard. So I find that people tend to reach for it where it doesn't often make sense.
At your level of scale
and operational maturity, of course you should optimize what instance classes things are using
and what sizes they are, especially since that stuff changes over time as far as what AWS has
made available. But it's not the sort of thing that I suggest as being the first easy thing to
go for. It's just what people think is easy because it requires no judgment and computers can do it.
At least that's their opinion. I feel like you probably have a lot more experience than me
and talk about war stories, but I recall working with customers where they want to lift and shift
on-prem hardware to VMs on-prem. I'm like, it's not going to be as simple as you're making it
out to be. Whereas like the trend today is probably, oh yeah, we're going to shift on-prem. I'm like, it's not going to be as simple as you're making it out to be. Whereas the trend today is probably, oh yeah, we're going to shift on-prem VMs to AWS or hell,
let's go two levels deeper and just run everything on Kubernetes. Similar workloads,
right? It's not going to be a huge challenge or everything serverless.
Spare me from that entire school of thought, my God.
Yeah, and it's fun too, because this came out a month ago, and you're talking about using,
an example you gave was a C5.9x large instance. Great. Well, the C6i is out now as well. So people are going to look at that someday and think, oh, wow, that's incredibly quaint.
You wrote this a month ago, and it's already out of date as far as what a lot
of the modern story instances are. From my perspective, one of the best things that AWS
has done in this space has been to get away from the reserved instance story and over into savings
plans, where it's, I know I'm going to run some compute. Maybe it's Fargate, maybe it's EC2.
Let's be serious. It's definitely going to be EC2, but I don't want to tie myself to specific instances types for the next three years.
Right. Well, I'm just going to commit to spending some money on AWS for the next three years,
because if I decide today to move off of it, it's going to take me at least that long to get
everything out. So, okay. Then that becomes something that's a lot more palatable for an
awful lot of folks. One thing you brought up in the article I linked
to is instance types. You think upgrading to the newest instance type will solve all your
challenges, but oftentimes it's not obvious that it won't all the time. And in fact, you might even
see degraded resiliency and degraded performance
because different packages that your software relies upon might not be optimized for the given
kernel or CPU type that you're running against. And ultimately, you go back to just asking really
basic questions and performing some end-to-end benchmarking so that you can at least get a sense
for what your customers are doing today and maybe make a guess for what they're going to do tomorrow.
I have to ask, because I'm always interested in what it is that gives rise to blog posts like
this, which that's easy. It's someone had to do a project on these things. And while we learned
things that would probably apply to other folks. You're solving what is effectively a global
problem locally when you go down this path. Part of the reason I have a consulting business is
things I learn at one company apply almost identically to another company, even though
that they're in completely separate industries and parts of the world, because AWS billing is,
for better or worse, a bounded problem space, despite their best efforts to use quantum
computers to fix that.
What was it that gave rise to looking at the CI-CD system from an optimization point of view?
So internally, I initially started writing a white paper about, hey, here's a simple question
that we can answer without too much effort. Let's transition all of our C3 instances to C5 instances.
And that could have been the one and done.
But by thinking about it a little more and kind of drawing out, well, we can actually
borrow a model for over subscription from another field.
We could potentially decrease our spend by quite a bit.
That eventually evolved into a 70-page white paper, no joke, that my former engineering manager said, Frank, no one's going to read this.
Always, always, always. Here's a whole bunch of academically research and the rest. It's like, great. Which of these two buttons do I press is really the question people are getting at. And while it's great to have the research and the academic stuff, it's also a great, we're trying to achieve an outcome, which what is the choice? But it's nice to know that people
are doing actual research on the backend instead of, ah, my gut tells me to take the path on the
left. Cause why not? Left is better. Right's tricky friend. Yeah. And it was like, oh yeah,
I accidentally wrote a really long thing because there was like a lot of variables to test. I think we had spun up 16 plus auto scaling groups and ran
something like the cross section of a couple of representative test suites against them,
as well as configurations for number of executors per instance. And about a year ago, I translated
that into a 10 page blog article that when I read through, I really didn't enjoy. And that 10 Place Biologics article is ultimately
about a page in the article you're reading today. And the actual kick in the butt to
get this out the door was about four months ago, I spoke at Olicon, which you were a part of,
and it was a vendor conference by Honeycomb. And it was just so fun to share some of the things we've
been doing with distributor tracing and how we were able to solve internal problems
using a relatively simple idea of asking questions about what was running.
And the entire team there was wonderful in coaching and just helping me think through
what questions people might have this work.
And that was, again, former academic.
The last time I spoke at a conference was about a decade earlier.
And it was just so fun to be part of this community of people trying to all solve the same set of problems just in their own unique ways.
One of the things I loved about working with Honeycomb was the fact that whenever I asked
them a question, they had instrumented their own stuff.
So they could tell me extremely quickly what something was doing, how it was doing it,
and what the overall impact on this was. It's very rare to find a client that is anywhere near that
level of awareness into what's going on in their infrastructure. Yeah. And that blog article,
right? It's like, here's our current perspective. And here's like the current set of projects we're
able to make to get to this result. And we think we know what we want to do. But if you were to ask that same question,
what are we doing for our spend a year from now? The answer might be very different,
probably similar in some ways, but probably different.
Well, there are some principles that we'll never get away from. It's,
is no one using the thing? Turn that shit off. That's one of those tried and true things. Oh, it's the third copy of that multiple petabyte of data thing. Maybe
delete it or stuff it in a deep archive. It's maybe move data less between various places.
Maybe log things fewer times, given that you're paying 50 cents per gigabyte ingest in some cases,
et cetera, et cetera, et cetera. There's a lot to consider as far as the general principles go,
but the specifics, well, that's where it gets into the weeds. And at your scale, yeah,
having people focus on this internally with the context and nuance to it is absolutely worth doing.
Having a small team devoted to this at large companies will pay for itself, I promise. Now,
I go in and advise in these scenarios, but past a certain point,
this can't just be one person's part-time gig anymore.
I'm kind of curious about that.
How do you think about working with a company
and then deprecating yourself
and allowing your tools
and the frameworks you put into place
to continue to thrive?
We're advisory only.
We make no changes to production.
Or I don't know if that's the right word, deprecate.
That's my own word.
No, no, it's fair.
What we do is we go in and we are advisory.
It's less of a cost engagement, more of an architecture engagement, because in cloud,
cost and architecture are the same thing.
We look at what's going on.
We look at the constraints of why we've been brought in, and we identify things that companies
can do and the
associated cost savings associated with that, and let them make their own decision. Because it's,
if I come in and say, hey, you could save a bunch of money by migrating this whole subsystem to
serverless, great. I sound like a lunatic evangelist because, yeah, but 18 months of work,
during which time the team doing that is not advancing the state of the business any further,
so it's never going to happen. So why even suggest it? Just look at the things that are within the bounds of possibility.
Counterpoint, when a client says a full rearchitecture is on the table, well, okay,
that changes the nature of what we're suggesting. But we're trying to get away from what a lot of
tooling does, which is, great, here's 700 things you can adjust, and you'll do none of them.
We come back with, yeah, here's three or four things you can do that'll blow 20% off the bill. Then let's see where you stand. The other
half of it, of course, is large-scale enterprise contract negotiation, but that's a bit of a horse
of a different color. I want to thank you so much for taking the time to speak with me today. I
really do appreciate it. If folks want to hear more about what you're up to and how you think
about these things, where can they find you?
You can find me at frankc.net or at me at frankc on Twitter.
Oh, inviting people to yell at you at Twitter.
That's never a great plan.
Yeesh, good luck.
Thanks again.
We've absolutely got to talk more about this in depth because I think this is one of those areas that you have the folks above a certain point of scale talk
about these things semi-constantly and live in the space, whereas folks who are in relatively
small-scale environments are listening to this and thinking that they've got to do this. And no,
no, you do not want to spend millions of dollars of engineering effort to optimize a bill that's
80 grand a year. I promise. It's focus on the thing that's right for your business. And at a
certain point of scale, this becomes that. But thank you so much for being so generous with your time. I appreciate
it. Thank you so much, Corey. Frank Chen, Senior Staff Software Engineer at Slack. I'm cloud
economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please
leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your
podcast platform of choice, along with an angry comment that seems to completely miss the fact
that Microsoft Teams is free because it sucks. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group.
We help companies fix their AWS bill by making it smaller and less horrifying.
The Duck Bill Group works for you, not AWS.
We tailor recommendations to your business and we get to the point.
Visit duckbillgroup.com to get started.
This has been a HumblePod production.
Stay humble.