Screaming in the Cloud - Evolving, Adapting, and Staying Prepared with Brian Weber
Episode Date: January 28, 2025Ever wondered how Corey got to where he is today? You have Brian Weber to partially thank for that. On this episode of Screaming in the Cloud, Corey catches up with his old friend and mentor ...to talk about the ever-evolving world of tech. Brian’s been around the block a time or two having done significant stints at Pinterest, Facebook, and Twitter (during the Elon acquisition no less)! As Corey and Brian catch up, you’ll hear them chat about the importance of empathy, coaching the next generation of tech workers, and their conspiracies surrounding Google and Kubernetes. So grab your tinfoil hats, it’s time to go Screaming!Show Highlights(0:00) Intro(0:53) The Duckbill Group sponsor read(1:27) When Brian took Corey under his win(3:21) Brian's experience coming to the cloud as an engineer(7:24) Why it's important to reinvent yourself in tech(8:54) How Brian reacted to the industry adopting Kubernetes over Mesos Marathon(10:31) Kubernetes conspiracy theories(12:30) The importance of empathy in tech(15:46) Trying to advise younger generations entering tech(19:19) The Duckbill Group sponsor read(20:02) Working at Twitter when jobs started getting cut and the site frequently went down(22:41) The best way to navigate certification expiration(26:08) Talking about "The Golden Path”(28:52) Why you should always plan ahead in tech (and life)(34:21) Where you can find more from BrianAbout Brian WeberBrian is a former FedRAMP DevOps Engineer for Coralogix. He’s also been a Site Reliability Engineer at Twitter, Pinterest, and Facebook, where he has maintained large installations on-premises, building reliability, security, and developer efficiency. In my spare time, Brian skis, knits, cycles, bakes, and tries to spend as much time outdoors as possible.LinksBrian’s LinkedIn: https://www.linkedin.com/in/brian-weber-2423b55/SponsorThe Duckbill Group: duckbillgroup.com
Transcript
Discussion (0)
And that's exactly how SRE generally works in my mind as well.
You're not building something for the normal day-to-day.
Actually, no, that's not true.
You're building stuff for the normal day-to-day,
but you are also building stuff for the day when everything catches fire.
Welcome to Screaming in the Cloud. I'm Corey Quinn, and I've been trying to get a particular
person on this show since its very inception. Brian Weber, currently between jobs, was a
formative influence on my early career that started to look a little bit vaguely like
software engineering. Brian, thank you for your ongoing patience and willingness to subject yourself to my tomfoolery yet again.
Oh, your tomfoolery is always amazing.
Did you just call me a mentor?
This episode is sponsored in part by my day job, the Duck Bill Group.
Do you have a horrifying AWS bill?
That can mean a lot of things.
Predicting what it's going to be.
Determining what it should
be, negotiating your next long-term contract with AWS, or just figuring out why it increasingly
resembles a phone number, but nobody seems to quite know why that is. To learn more,
visit duckbillgroup.com. Remember, you can't duck the duck bill bill. And my CEO informs me that is absolutely not our slogan.
I had no idea what I was doing many years ago when I was working for a large consulting
firm and you were working at Pinterest at the time.
And they parachuted me into this environment because I was personable for lack of a better
term.
And they had a, at the time pinterest had a very weird technical vetting
process for consultants so they needed someone who could do the work ostensibly but also be
gregarious and talk their way through the process this was many years ago the consulting company no
longer exists after being bought by ibm i don't think i'm spilling any tea here but as at the end
of it i was brought in to write a bunch of tests for puppet code as part of a long stalled puppet three migration, if memory serves.
I had no idea what I was doing.
Ruby was a precious stone to me, not so much a programming language.
And you took me under your wing for about a month and a half, and it resonated.
Thank you for doing that.
Thank you so kindly. I do remember you had, I believe it was an Ansible
sticker on your laptop, and you told me that you made a very clear point of not adhering a laptop
sticker until you'd actually contributed to the source repo. It would have been SaltStack then,
not Ansible, because I still haven't dared to touch Ansible with my hands. Oh, the other Python.
Exactly, the one that basically was frozen in amber forever and then achieved its final form
of all software projects that have run their course getting acquired by VMware.
You know, it's funny. I have a very good friend who basically soft retired when VMware got bought
out by Broadcom. A lot of folks have that story. Oh yeah. It's kind of funny how everybody takes
layoffs just a little bit differently, you know,'s kind of funny how everybody takes layoffs just a little
bit differently, you know? Just like me and all my various layoffs, like, you end up, like,
staying in touch with some friends, and maybe not, I don't know, and some people get angry and bitter,
and others are just like, woohoo, I can do what I want now. I have severance.
So, I do want to talk about your technical evolution because you are
something of a rarity in that you were for years over at Facebook, which they've since renamed to
something dumb, but they'll always be Facebook to me. You were briefly at Pinterest that coincided
with my time there. And then you decided to spend the next seven and a half years over at Twitter.
Yes, we're still calling it Twitter. Now, what makes that interesting is that Pinterest was sort of the departure from the other two, because neither Twitter nor Facebook,
at least at the time, were large cloud shops. They weren't running Kubernetes. You, in fact,
called yourself Mr. Mesos at one point, or Mr. Marathon. I forget what it was, but you were
effectively responsible for the care and feeding of that particular orchestration system while you were at Twitter.
So you have found yourself in this interesting scenario where, despite the fact that this is where the zeitgeist is gone,
you hadn't done a whole lot of cloud work until your most recent gig over at CoreLogix, where you're focusing on FedRAMP.
So one could argue that, well, is GovCloud really cloud at all?
The jury is still out.
But what's it like coming to cloud
as someone who's very competent as an engineer,
but who just has found themselves in a situation
where until recently, you never had to touch it?
Well, you know, you could consider that, Mike.
I don't know.
What happens when you put a cloud in a bottle or a room?
Is it, it's kind of like when you go to the bar
and they have the smoked Manhattan,
you know, and it's there and it's pretty. And then you open the bottle and all the tendrils.
Anyway, I'm bad at metaphors today. It's early yet. By the way, happy new year. We're recording
this shortly into the new year. It is the second of the year. Yes.
Anyway, in some ways it was just like being dropped into literally any other environment
where, you know, you don't know anything.
You don't know what's going on.
You don't know how all the pieces glue together.
But it was a lot more challenging because a lot of the facets that I do know, like I
know, you know, how a kernel works, how all of the modules work, how systemd works, how
to strap things together.
You know, when do you need to disable SEL SE Linux permissions to make things talk to one another?
Oh, you're funny.
Set in 4.0.
It's a way to live.
There you go.
But well, it's a way to lift.
But anyway, so it was a very different environment.
You know, when you spin up, say, a web server on a siloed out host, you know, you spin it up, you access it,
you see, oh, this is cool. And then you start putting up walls to protect it. When you spin
up an instance for the very first time in a kube cluster, in an AWS cluster, you can see that it's
running, but it is very much behind the phalanx. You know, all of those protections are saying, yes, your service
is running, come and get it, which is often a challenge when you don't know how to do things,
simple things like properly open up a port and make sure it stays open and reopens the next time
the service runs, how to hack and slash your way through all of the VPC rules and whatever other rules randomly appear in the way when you
don't know. Now, I spent, you know, a good 10 months, you know, trying to figure all that out.
And luckily, I was there in an environment where there were parallel running environments. And,
you know, once you learn that the various differences are basically just down to ARN names are different.
You know, when you look at an ARN, it's AWS right in the middle of everything.
You have to then change it to AWS-US-gov.
Yeah.
Yeah.
It's a different partition is to use their nomenclature.
Right.
Because it is a common assumption in all of your templating.
And so I had to go and hack and slash through
so many YAML configs and Terraform configs, you know, and we can sit here and talk about,
you know, how fascinating and interesting it is that all the stuff glues together.
But at the end of the day, we are all just monkeys scratching our heads, looking at code saying,
where the hell is this config and why doesn't it do what I want it to do?
It's the same thing that
brought me to consulting in that i was always parachuted into environments where i didn't know
what the hell was going on in order to succeed in those environments to my mind at least you've got
to have a strong grasp of fundamentals okay i don't know how this particular system works however
i know the linux system internal is well enough to know that it should be doing the okay if it's
doing this that means it's making this other call it It's not doing what I would expect. What do I not
understand fully and diving deeper and dismantling it into bite-sized problems, which is why when
people ask, oh, what technology should I learn? It almost doesn't matter. If you're entering the
field now as a new graduate in your early twenties, the technology you're going to be running by the time that you're my age, in my mid to late 40s, is no longer going to be the same thing. You have to reinvent
yourself. You have to understand how this stuff all ties together. So I like the foundational
things that are likely to remain constant for, well, at least the rest of my life.
Well, I remember when there were cries and moans when the environment I was in at the time,
which I'll leave nameless,
was migrating from CentOS 7 to CentOS 8
because of the whole stream model.
What are you doing to my RPM delivery system?
How does this work?
And you look under the hood
and it's really just the same.
It's just packaged slightly differently
and branded differently
and it works the same.
It's just they figured out ways to smooth off some of the rough edges.
So if you're sitting there saying, oh, my goodness, I can't handle change, then what
the hell are you doing here?
Well, that's one of the areas I wanted to dive into with you, because I wasn't kidding
what I said used to be the Mesos Marathon guy for a period of time.
The industry collectively took a vote and
Mesos Marathon did not win. Kubernetes did. How did you react to that?
Well, at the time when I was still at the company formerly known as Twitter,
we talked a lot about whether we should spin up Kubernetes. When the decision came through that
we should, we did it in a very slow and piecemeal manner. And in my opinion, I felt it was a little bit too slow. We spun up sample environments in GCP. We even had acquisitions that were in AWS that we were still operating in AWS just because the migration just didn't make sense. It was well entrenched. It worked properly. Why the heck not? Leave it there. And so we actually have reasonable brain trust around this stuff for a while.
Where we ran into a lot of trouble was spinning up Kubernetes internally on our own bare metal
infrastructure. You know, not the least of that, as I'm learning now, as I set up my own home lab,
setting up Kubernetes on your own bare metal infrastructure is a pain in the ass.
Oh, yeah.
I did it a year ago, almost exactly, where I spun up a Kubernetes in my spare room running
on top of K3S and some raspberries pie.
And sure enough, it was, oh, OK, this makes sense.
Kubernetes lets you cosplay as your own cloud provider.
I sort of get it now.
But yeah, I'd forgotten all the obnoxious hardware bits
that the cloud has gently abstracted away
in the intervening years.
Oh, I don't even think it's the hardware bits.
Kubernetes makes a point of not making it easy.
I wonder if they're just in collusion
with the cloud providers to say,
here, we're going to escort you on the way
so that way you can earn all this money
and then pay the CNCF a bunch of money
so that way we all get
rich. My tinfoil hat conspiracy theory remains that Kubernetes is how Google decided to get the
rest of the world to write software more like Google does, because without that, Google Cloud
was never going to work as a cloud provider for a lot of these workloads. So it works super well.
They sort of lost control of it and they don't get to drive it anymore the way that they
once did.
But I'm not entirely convinced I'm wrong.
Well, you know, that same model worked for Google in search.
You know, they got everybody in the world to change how they wrote web pages, how they
structured web pages, buying into the AMP project.
All that stuff is all because Google said we want it this way and everybody wanted some of that sweet, sweet search results and figured out how to do it.
And now, as a result, when you go to a Web page to look for a recipe for peanut butter brownies, you have to read a 10 page diatribe about order to come up in the search rankings and potentially get affiliate links, which makes the experience of a human reading a web page suck.
And there's always some of the better sites now have the jump to recipe button at the top because they know what's up.
But at the same time, it's why do we go through this ridiculous theater piece?
Because it's what our Google overlords built for us, you know? And now we
experience that in cloud factories because we get to play with Kubernetes. How lovely of them for
doing these games. It's always appreciated. Well, what can I say? It's, you know, it makes our lives
relatively easier as opposed to when we had to thumb through recipe cards and when I could just,
you know, bootstrap, install, you know, whatever OS I felt like at the time and get something running at home.
It's a reasonable approach to take. But I guess what I'm curious about is how you
perceive that shift, though, because I've met an awful lot of technologists over the course
of my career who start to identify themselves by the technology upon which they're working.
And I'm not immune from this.
I think of myself these days as an AWS guy to some extent. And before that, I was an email systems
guy. And reinventing the way that you perceive yourself is never easy. You know, I still perceive
myself as somebody who just, like you say, and like you do, parachuted into a site, tried to
figure out what was wrong, and mostly just try to make things better for the other people running it. Because I've said this before a thousand times, and I'll
say it again, software is made of people. We are all here together and we do what we do as a
collective. You know, open source projects, yes, there's occasionally the one lone guy in Nebraska, a la XKCD, who's maintaining a very important core project.
But a lot of projects out there and a lot of companies, well, all companies out there, are building it as a group, as people, as many people. if we can make that experience for our peers, for our colleagues, for whoever you're working with
better, then we all get better at writing the software, at building the systems,
at making things better. So that's what I pride myself in. And that one thing has never changed
for me. I've picked up multiple languages. I've dived into multiple different environments.
I'm comfortable in multiple operating systems.
But the reality is that we're all people.
We all do what people do.
And if I can at least just be empathetic and be as human as I can and try and understand
that you're human too, you just want to read a simple doc that tells you how to start and
stop the service. You just want to read a simple dashboard that can tell you what's wrong. And
you don't want to get paged in the middle of the night at something stupid and pointless that had
no reason to page you. Every human wants that. Every human engineer wants that. I mean, granted,
there may be exceptions to that case. I have known masochists who just want to alert on everything because they don't know what's
going on and they'd rather be woken up and find out.
And 90% of the time, they wake up, they look at the alert, they say, oh, this is nothing,
they crush the alert, they go back to sleep.
And then the next person comes on call and goes, what the holy hell?
And I care about both of those people similarly. I think empathy is one of those
core attributes to being a competent technologist. And I have no idea how you teach it. I feel like
it's something you either have or you don't. I feel like the significant bulk of us have it.
We just don't often know what to do with it. You know, sometimes we learn how not to be empathetic.
Sometimes we're psychopaths and we just innately don't have it.
But I believe those are the exceptions.
You know, in reality, we're all empathetic people.
And if we can tap into that empathy and help make other people's lives better as a result,
then that's what we should be doing.
This is in part why up here in my small town, I tried to help start a tech meetup out here
because there's so many people around here.
There's a local university, a local community college, and a whole lot of other people who
are just career changers, who are just interested in trying to learn about the technology, not only because they find it fascinating, but they see it as a career path forward,
hopefully as long as AI doesn't destroy everything.
I used to be fairly active in the, I guess, helping the next generation figure out how
to navigate the world of tech. And I've gotten away from it just because it's been so long since
I was new to the space that I worry I would give boomer to your advice of,
oh, just have a strong handshake
and walk in with a resume printed on nice paper,
ask to speak to the owner,
and you'll have a job by dark, which does not work.
I don't know how to get started in technology
in the current system.
I know a lot about how to get started in technology
in the early 2000s,
but that apparently is not a highly useful skill.
No, absolutely not. Although traces of it still are like, you know, yes, you can't just, you know,
walk in and be bold, but having a level of confidence exudes and it shows other people
you're talking to. When you're talking to a recruiter, when you're talking to a hiring
manager, if you can say, hey, I may not know everything, but I know how to do these things well, and I know how to figure
out what I don't know. And it's funny because one other person in our little group here of my local
meetup has finally achieved something that I had been hoping for. And of course, I'm leaving
location out. I'm leaving people nameless and all that to protect the innocent. You know, this young man had been
doing hack jobs on Fiverr to try and boost his skills on top of working a simple retail job
and got enough chops together after a while that he cleared an interview for a local company. Now, it's
not that huge. It's writing some JavaScript tests, but it's a start. And if that's what
gives him the foot in the door that he needs to build a career, then I feel 100% vindicated in
everything that I've ever done to try and build a community out here.
What worries me is the future of that story. When I first played with ChatGPT and it spat out
a quick hacked together script to query NAT gateway prices across different AWS regions,
the response that I got instantly from a couple of senior devs was, oh, well, this is fantastic,
but it's only for junior dev work. It'll never take the place of a senior engineer. And it's
great. Where do you, is it that you believe that senior engineers come from? You didn't just show up one day knowing all the
stuff that you know now, it was incremental. What does this mean for the next generation?
And people don't really have a good answer for that yet.
No, nobody has the crystal ball right now, unfortunately. And I wish we did because I'd love to be able to say, here's what's coming. Now, I have high hopes that we're still going to need humans in order to actually build
large systems because large systems are not easily intuited.
You know, as much as other talking heads out there would like you to believe, oh, Twitter
is just small globs of characters ordered in a timeline, right?
Twitter sounds like the easiest problem in the world.
Oh, I could build that in a weekend until you actually think about it for 30 seconds.
Well, you could build it in a weekend to serve like 10 users.
Here at the Duckbill Group, one of the things we do with, you know, my day job is we help negotiate AWS contracts.
We just recently crossed $5 billion of contract value negotiated. It solves for fun problems,
such as how do you know that your contract that you have with AWS is the best deal you can get?
How do you know you're not leaving money on the table? How do you know that you're not doing what I do on this
podcast and on Twitter constantly and sticking your foot in your mouth? To learn more, come chat
at duckbillgroup.com. Optionally, I will also do podcast voice when we talk about it. Again,
that's duckbillgroup.com. I have a question about Twitter. Since you were there during the acquisition for a bit before the fall, everyone that I know in this space, and we didn't talk to you folks because we didn't want to compromise any of the folks who were working there and trying to hold on to a job.
But a lot of us predicted that Twitter itself would basically fall over one day and have a lot of trouble getting back up.
And that never happened. Do you have any insight into why that might've been like,
well, how did we all get it wrong? I almost want to do a post-mortem on how the SRE community
got it wrong. I'm in a little chat group with a bunch of other former SREs from Twitter. And we
have talked about this a time or two, and we attributed that a lot to the work that
we had done. Because those of us who are SREs, we don't just think about, you know, what's going on
right now. We often think about what's going on in the future. How do I make sure my service doesn't
completely tip over? You know, the first hammer fell and cut off half the company and then another half of the remaining
company all right in November before the holidays. And I believe it was the week between Christmas
and New Year's that Elon said, oh, we don't need Sacramento data center.
And you would expect that to end as hilariously as it sounds, but somehow they pulled it off.
Somehow they pulled it off.
Now, granted, Twitter had for that whole year of 2023 stumbled a lot.
The down detector had been going bonkers on Twitter.
Things had been falling over.
The site just didn't always want to work. So I attribute partly
the work that we had done to shore up the service for the long term. Now, the other thing that I can
think of as maybe just the reduced user count, because I know people had been leaving the site in droves. But I don't know.
I honestly haven't looked at, you know, any whatever stat count to see what the daily
active users are, what the, of course, none of that stuff is public anymore because they
don't have to report to the SEC anymore because they're a privately held company.
Thanks, dudes.
A lot of it does make sense in that when I was building systems, I always wanted to make
sure they were well documented and the interfaces were easily understood for basically a complete Thanks, dudes. easily understood. The idea that Twitter learned pretty early on in the course of its life was one
of graceful degradation. Instead of showing the fail whale when things started breaking,
okay, maybe you just don't reload the timeline as rapidly, or you put the eventual in eventual
consistency. That tends to be a failure mode that is less noticeable, and it stops treating the
service as a binary, is it up or is it down, and instead views it, how down is
it? Once you unlock those graceful degradation modes, that's kind of awesome. I'm still surprised
there weren't a whole bunch of issues that were coincided with certificate expiries and whatnot,
but apparently there's still enough talent left there to keep the lights on.
I'm glad you mentioned certificate expiries because that's what I worked on. That's what my, you know, I was on that team for, I want to say like four years, I think,
where we managed the distribution of internal certificates and public PKIs and all that stuff.
And we automated the shit out of that.
It's the dumbest outage in the world because it's highly visible that there's a certificate
that just expired when someone can get to it with their browser it's it's one of those things of you
should have known this was coming um we have this fancy technology called calendar reminders
so the idea of automated certificate renewal is huge i think that this was a lost it was a poor
decision in the 90s to have an expired certificate by 15 minutes, have the exact same failure mode
as a man in the middle attack,
but that's a battle long since lost.
Well, it was also relatively simple to just say,
you know, a Java application loads the file on disk
at start time.
So if you, at that point,
you can do whatever you want to the file.
So we had automated systems that just went in and said,
that cert is due to expire in X amount of time.
Let's just snap it up.
All right.
So you'd have X number of days before it expired.
And service owners should theoretically know,
restart your service within X number of days and light's good.
Now, what you can do is have a failure state that says,
oh, I've never restarted my service, but this cert's expired.
Maybe I should die.
And then it dies.
And then whatever container system you're using restarts the service for you because a service died.
You do that and then voila, automation happens.
These are the kinds of things that we thought of collectively at Twitter for years
in order to keep things up and running smoothly. So that way, as much as possible, all of the pain
in the butt things that everybody had to deal with could just be on autopilot. Oh, I just restart my
service. Cool. It's the right approach. It's why I love what uh let's encrypt is done and the maximum maximum
cert validity is 90 days because people go through an outage like that like oh crap let's let's build
a cert that has a 10-year expiry great which i understand from a human perspective this was
painful let's make sure we're gonna deal with this again anytime soon while we're rotating it
but when you have like a wild card cert god help, that is good for the next 10 years, you'll never be able to trace all the places that it winds up in the next decade.
So when that does hit expiry, everything is going to break and it becomes a massive issue.
Whereas if you do the painful things and scary things more frequently and it makes them routine, yeah, I have a bunch of systems now that auto-roll certificates programmatically and I never have to think about it until and unless I'm doing something clever.
Yeah. Well, I know a lot of people have talked about the golden path. The golden path being
where you want everybody to go in order to get to that destination. That destination being
a running service that makes us all money so that we can all pay our rent and eat food.
So if you make that golden path as easy to walk as possible, then people will naturally go there.
You know, and I say that knowing full well, that's one of those Pareto principle things you run into. Because multiple times in my career, I have run through mass migrations where I chase down large numbers of people at a large company in order to get them to do a thing.
You know, here, this is going to take you two hours to do.
This is going to take you 10 minutes to do.
We just need you to do it.
I will show you how to do it.
I will do it for you if you're willing to let me.
So on and so on.
You know, the bulk of people are just like, oh, cool. We love it. I will do it for you if you're willing to let me. So on and so on. You know, the bulk of
people are just like, oh, cool. We love it. Sure. And then you get to that last 20%. And even worse,
you get to that last 3%. And those last people are like, you want me to restart? I'm not sure
we know how. That's one of the things I learned from my kubernetes because it's okay great i have a person on a bunch of raspberries pi plugged into the same power
supply and when that thing gets jostled and loses power okay how do you safely bring up an entire
cluster we didn't think about that because why would you ever turn off cloud instances all at
once oh no oh dear because my again this comes back to the ancient sysadmin wisdom of once i
had my cluster built out one of the first thingsysadmin wisdom of once i had my cluster
built out one of the first things i did is i yanked the power cord out of the back of one of the nodes
like i was rip starting a lawnmower just so i could see what the recovery looked like and it
turns out with a lot of extra work it just never comes back which okay that's that's a little
disturbing it all comes down to long horde the disk system i'm using because ebs is a marvel that
people do not give enough credence to, because managing disk volumes in a distributed fashion is super hard.
And this is why people pay AWS, GCP, and Azure tons of money.
Tons of money.
Because managing Kubernetes sucks on its own.
Managing an EBS, I 100% agree, sucks even worse.
At least for home labbing stuff, you could do a true NAS, which has all the right APIs
for doing that, which makes that a lot easier.
Oh, yeah.
There are a lot of options you have, but it's also stuff that I run that is only production
adjacent.
Like my RSS reader lives on top of this thing.
My change detection bot that winds up validating at different websites have these things changed
and showing me what happens.
I have a bunch of container stuff that I've thrown together in here,
but if the entire thing blows up and falls into the sea, I still have a bunch of options that do not preclude me from getting my work done. Yes, yes, yes. I get that. You know, it's funny. I
think about this in the real world too. I have a pantry full of home canned soup. No, I'm not a super prepper.
I just like doing it. But it's great because, you know, where I live, it can get inclement weather.
So if the roads shut down, I have four days worth of food in the house just in case.
And this was just because of how I learned how to live growing up. You know, I grew up in another
mountain town
and the roads would routinely close. So we would have routinely a couple of weeks of food in the
house. And if the power went out, we could pull out a camp stove and warm up a can of soup.
I just like homemade soup better than Campbell's. It's the right answer. I wish more people thought
about these things and did a little bit of planning ahead. Like, oh, they start forecasting and climate weather. You don't need to do a run to the store with everyone else
necessarily. And that's exactly how SRE generally works in my mind as well. You're not building
something for the normal day to day. Actually, no, that's not true. You're building stuff for
the normal day to day, but you are also building stuff for the day when everything catches fire.
A lot of work that I did on a lot of my different teams and products that I had worked on was not just to say, OK, everything is burning to the ground.
How are we surviving? A lot of what I have done is saying, let's make deploys easier so that we don't have to think about it.
So one thing that's kind of on my brag sheet is I worked with a couple of different teams, both my own team and the core services team at Ye Olde Twitter to help build out a process for continuously deploying an RPM. Now, this is often
not something you want to do in production environments.
Not without some gating or some really great automated testing.
Oh, yeah. And that's what we did. We made sure that we had good process for dating and versioning, for easy push-button rollback, for hard versioning,
because originally my first version of this was just saying, yada-da, latest, whatever,
which is never a good scenario. So why the hell are you doing it with your RPMs?
So we came up with this process. We pinned the version into a Hira file for Puppet. We read
that out of a config file from elsewhere so that way another automation surface could stamp it in
and tied it all together in a Jenkins script that would then pull all the right stuff together,
auto-stamp a version, and then ratchet up a FQDN hash percentage number. So that way you could say,
let's roll this new version to 1% of the fleet and see how it does. Let's roll it to 10% of the
fleet and see how it does. And once we got that machine well-oiled and well-lubricated, and mind
you, this was a process that took like maybe three to six months to build on top of doing other things.
And then another three to six months to gain enough confidence on that we could just pull the brakes off and just say, let's let it go.
And the biggest noise that we ran into was that we would ratchet the version forward faster than all of the RPM
masters could sync. So occasionally, a puppet run would go through, would talk to a yum repo
that didn't have the newest version because we literally just shoved it out there. And we actually
got some feedback from the team that managed that saying,
oh, yeah, we are having some problems with a couple of these.
And I said, what can I do to help?
And he said, well, maybe don't roll out so fast.
So I added extra steps to then say, let's not roll through
and just look through and see, did all the young repos sync?
Because you could just probe it all in a loop
and then just come back and wait a minute
and probe it all, blah, blah, blah, blah, minutia, minutia, minutia. We got it working.
And again, software is made of people. I was able to do that because I had good relationships with
the people on my team and the people on those other teams so that we could talk about these
things like humans. Which is a reasonable and grown-up way to approach it.
Yeah, because it's one thing to walk up and say,
I don't give two shits about what your job is.
I have to get this done, which is not the way.
That's not how you win friends and influence people.
No, it's not.
Instead, you walk up and say, well, I'd like to get this done.
How do you think we can do this?
You know, I'm here playing in your pool.
I don't want to pee in your pool. I want to do this done. How do you think we can do this? You know, I'm here playing in your pool. I don't want to pee in your pool.
I want to do this right.
Exactly.
With the unspoken thing being,
look,
at some point this has to get done.
And so you,
you either have to get at some point,
leave,
follow,
or get out of the way.
I would love to collaborate with you on this for a better outcome for
everyone.
Right.
And at the end of the day,
this can be copy pasted out to make everybody else's life easier.
You know,
lots of carrots, lots of hugs and lots of golden
stars and all that. The stick may be back there somewhere else, but don't even think about it.
Be people, be human. We're all here to just take care of each other. So let's do that.
I want to thank you for taking the time to chat with me about all this. If people want to learn
more about what you're up to, where's the best place for them to find you these days?
I feel like I should re-step up my social media game because I was a lot more active on ye olde Twitter before it became something not Twitter.
I have migrated entirely to blue sky, and it's like Twitter of old in a lot of ways.
It's great.
That's what it looks like.
All right, well, in the meanwhile,
you can find me on the LinkedIn.
And we will, of course, put a link to that
in the show notes.
Thank you so much for taking the time to speak with me.
I appreciate it.
More than happy to, Corey.
Thank you.
Brian Weber, longtime friend and mentor.
I'm cloud economist, Corey Quinn,
and this is Screaming in the Cloud.
If you've enjoyed this podcast,
please leave a five-star review
on your podcast platform of choice. Whereas if you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice.
Whereas if you've hated this podcast,
please leave a five-star review
on your podcast platform of choice,
along with an angry, insulting comment
telling us that we must be idiots
because clearly setting up storage for Kubernetes
in a home environment couldn't possibly be that hard.