Software at Scale - Software at Scale 2 - Christine Dodrill: ex-SRE, Lightspeed
Episode Date: December 7, 2020This episode contains an interview with Christine Dodrill, ex Senior Software Reliability Expert at Lightspeed. We discuss Kubernetes, Spectre/Meltdown, configuration languages, a controversial testi...ng philosophy, autoscaling (auto-failing), technical problems vs social problems, monoliths, Conway’s Law and Canada.Listen on Apple Podcasts or Spotify.HighlightsNotes are italicized.5:34 - Stack Overflow might become actively harmful if you’re working on WebAssembly or something sufficiently niche.7:40 - Spectre/Meltdown caused 20-40% slowdown at one workplace, which led to some interesting projects, like the aforementioned WebAssembly work11:30 - What’s up with the title Senior Site Reliability Expert?Apparently, you can’t call yourself an “engineer” in Canada, you need to go through some kind of process which software developers don’t need to bother. 14:40 - It’s questionable how much software developers are “engineers” in the first place.16:52 - YAML allows 8 values for boolean true and false, such as “no”, and “on”, which conflict with ISO code for Norway and Province code for Ontario. Maybe Starlark is an answer. Christine uses Dhall with promising results.Dhall looks like Haskell. It has a strong type system, with variables, functions, and imports, but otherwise a config language. They have a Kubernetes package.20:30 - Nix the language. An example to configure a website.24:00 - Experiences building internal tools that interact with other internal tools. Developers tend to have strange environments. One developer would only keep source code on a thumbdrive, and that would cause a few issues.27:20 - Compliance requirements can be a useful to stop developers from security snafus.29:30 - Experience with Kubernetes.30:40 - Kubernetes Autoscaling out of the box is a great way to cause downtime. Experiences on the Metrics team at Heroku which worked on autoscaling. Most applications tend to be I/O bound to the database, so autoscaling tends to become “auto-failing” and cause more problems than it solves.37:00 - PostgreSQL, PgBouncer, and Transaction ID wraparound. External postmortem.40:48 - “A lot of document databases are solutions looking for problems”45:30 - “Continuous Deployment can be a double edged sword”47:20 - “A lot of unit testing methodology I’ve seen is kind of fundamentally wrong. A fake version of the world will only let you see how fake your world is”. 53:30 - Experience with tiered deployments - stage, QA, and production.58:40 - Exploring the model where product engineers only build features, and SREs focus on reliability.A conclusion is that some governance is probably required to prevent a complexity explosion.60:30 - Monoliths are pretty great. Eventually, Conway’s Law takes place. Incongruities in products or APIs often reflect team boundaries.68:00 - Buzzwords at big companies. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
Transcript
Discussion (0)
Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications.
I'm your host, Utsav Shah, and thank you for listening.
Hey, Christine, and welcome to another edition of the Software at Scale podcast.
Yeah, and welcome.
Yeah, thanks for having me.
Sure. So I want to get started with like some simple questions like what got you into like software engineering?
I've seen that you've been like a software engineer for more than five years now.
Somewhere like that. Yeah.
Yeah. So what got you into it um well i was one of the i for the longest
time i can remember i've been one of those people that have been cursed by uh being good at computers
and uh actually getting down into the nitty-gritty of what's going on and uh understanding things and tweaking it um and getting into software engineering just sort of
naturally happened uh my dad was in uh software stuff for years um he recently moved to llama
farming or alpaca farming okay uh yeah and um i've basically basically been coding off and on since I was 12.
Okay.
So like you were working on like your own projects or?
Yeah.
Lots of IRC stuff in the beginning.
Okay.
Yeah.
I think I was never at that age where I got to use IRC.
I think somebody in like my freshman year showed me Slack and I didn't realize what the big deal was.
And then a couple of years later,
it's like something that everyone uses.
So, I mean, do you have like an opinion
on like IRC versus Slack?
They both have their place, but I kind of miss IRC.
It's simple.
It's easy to write bots for. And it just got killed by chat apps with
scroll back. I still use it, but it's definitely a shadow of its former glory.
Yeah. Yeah. And then like, so you got into software engineering, but you've been working for a lot of time now as an SRE, like, as a site reliability engineer.
So what made you go from writing, like, bots for IRC into something like SRE?
I kind of got into SRE stuff on accident.
A lot of it started when I was running those bots and other services was how do I keep them up?
Like if Google goes down, what do I do?
Or that sort of thing. And a lot of what I know about SRE stuff, I really learned running an IRC network and at the job at a place called IMVU.
The SRE team there was really interesting and I learned a lot.
But really, a lot of the stuff involved with writing software and putting it up somewhere goes hand in hand with the SRE stuff because a lot of that boils down to not just putting it up there, but keeping it up there. You know, it's easy to push something to the cloud once and then, you know, you don't care about it in three weeks.
But pushing it to the cloud multiple times per week
and still having it work even after you touch it,
that's the hard part.
It's gotten easier, but it's been historically pretty hard.
Yeah, that's for sure.
Like there's like a random script that fails
or some some random package
on that machine stops working or Ubuntu needs a restart. It's easy to write scripts,
but how do you make sure they keep running? So did you learn any of that in a formal way?
How do you keep a system up or was it like all just trial and error
um there's a formal way to learn that uh i mean it kind of feels like uh a lot of this stuff has
to be learned through either experience or well experience at the job yeah yeah yeah it there's not really many books that I know of on how to actually run Linux systems
that don't get outdated the day after they were written.
Yeah, exactly.
That's how most people seem to learn how to keep systems up.
I mean, going back, is there anything that you think would have helped
like if you had read like a particular book or some reference or is it just like stack overflow
and figuring it out um the kind of problems i get into are the stuff where stack overflow is uh
actively harmful so a lot of it is really like reading man pages,
trial and error and prayer, lots of prayer.
I'm curious to know like what kind of problems
you're talking about when you mean like stack overflows
like actively harmful.
Like I've certainly seen examples,
but I'd like to know what you mean.
I have a bit of a Skunkworks project where
I try to do WebAssembly stuff on the server. I'm trying to implement a functions as a service
platform, mainly just to feel out the problem space and see how I would do it. And when you get deep enough into it, there's basically like no guide for doing a lot of the stuff that you need to do for that.
Because at some level, you're basically writing like a kernel and interfacing between the programs of the outside world is basically literally writing a kernel.
And there's basically nothing out there
to help you yeah so you gotta you gotta really wild west it yeah yeah it seems like you know
initially when you're just starting off and learning something new it's helpful and the more
you go into it the worse it gets and there's this point where you just don't trust any of the outside
stuff unless it's like a reference manual or point where you just don't trust any of the outside stuff unless
it's like a reference manual or something, because you probably know that it's going to be like buggy
or not what you need. Yeah. The hard part is finding out where that divide is, because sometimes
it's really hard. But other times it's, you know, apparently obvious when you hit that cliff, like trying to implement system calls and WebAssembly stuff.
Like, there's no manual.
Yeah.
Especially back when I got into it a couple years ago.
Like, there was no guidance out there.
Yeah.
Yeah, syscalls and WebAssembly. assembly like yeah like if i could ask like what
what what project were you working on like what what was like the objective or was it just to
learn something well it's kind of first started out as uh me getting really angry at intel for Intel for Spectre. The place where I was working at,
we saw these huge 20% to 40% drops in performance
or something somewhere in there.
And I was looking to see if there was another way to do it.
So I got the idea of doing a WebAssembly Functions
as a Service platform in my head, and it never
quite left.
I don't think it's a bad idea.
Yeah, and I can totally relate with getting angry at Spectre and Meltdown.
At our workplace, we had two kinds of issues there.
First, our CPU load or just the efficiency of our entire system went down by
like 10 to 20%. And the second thing is like AWS tried patching some of the issues and that led to
other problems that manifested in weird ways and they were mostly just JVM crashes. So yeah, that's pretty bad. So I mean, I'm sure then you've thought about
like Apple's recent announcement and like moving to like ARM and all of that. Do you
have like an opinion or you're just more of like a Linux person?
I've kind of flopped back between macOS and Linux for a while, but Catalina is probably
pushing me back towards Linux in general. Is it because of the intrusiveness or
just the amount of pop-ups you need to click? A lot of it is I do a lot of hardcore development stuff.
And Mac OS is really great for a lot of things.
But if you want to try to customize things or completely
get rid of animations so that things appear to happen faster
or unbind certain modifier keys so that you
can use them freely elsewhere, you know, you're completely out of luck. A lot of the Apple
products work really great for the Apple prescriptive mindset. And sometimes it works really great. Heck, I think a good 40% of my blog was written on an iPad.
It's a pretty good device, yeah.
I'm kind of reaching the limits of macOS usability for me,
plus the fact that Catalina basically broke my entire Steam library on macOS.
Yeah, it really makes you think, right? Because I just saw a tweet somewhat recently that
if somebody wanted to build a browser for an iOS device today, and there wasn't a concept
of a browser before.
Apple wouldn't allow it
because it would be like remote code execution.
And we wouldn't have the concept of like a browser
because of Apple rules probably.
So I wonder what kind of things we're missing out on
because we're designing for the average case
but we're not letting hackers like do their thing.
And hackers in the like people who like tinkering with stuff not the shady black hat hackers yeah um it's kind of hard to know um stuff like the pine phone does
have does is being taken in some interesting ways but uh it's an unknown unknown.
So, yeah, I want to ask you about your current job.
So your title is a senior site reliability expert.
And I'd like to know what it means.
That's actually my past job.
I haven't updated that page yet to avoid the onslaught of recruiters. Yeah. But so in Canada, you can't
call yourself an EWORD. You unless you actually have an engineering certificate in like you have
one of the rings and everything. So we can't be site reliability engineers. So somebody had the brilliant idea of calling us site reliability
experts. And apparently that's how they got around it. There are software developers and
site reliability experts. I think that's a great idea. What does it take to get like a certification
where like software engineers generally go through that flow or?
In order to be called an engineer of any kind,
you have to have like a mechanical engineering degree and you have to go
through the whole ritual with getting the ring from the engineers guild.
The engineers guild. It sounds like.
Whatever the heck it's called. I don't remember
the exact term, but I remember that the Engineer's Guild is kind of litigious. And if they see you
using Software Engineer, they will not be happy. Yeah. For a second, I was just imagining like a discord guild
of people like handing out licenses, I suppose.
No, like in my head,
what I'm imagining is like this group of people
standing in a,
five of them standing in a circle
and you stand in the middle
and they give you a ring and
formally induct you into the guild of engineers or whatever.
You're like a first level mage or something of the engineers guild.
I mean it's cool Like I read a statistic recently
that there's going to be 25 million software developers
in like 2021 or something.
That's a lot of people.
It's probably not enough.
Yeah.
And I wonder how much this like bureaucracy helps.
I guess if all it does is just hurt the title
and you can just do the same job, which I'm guessing is true of Canadians working on this.
Like, it seems like it's not that bad. we don't call ourselves engineers because engineering refers to like building bridges
and things like uh if they break they have like material impact on human lives uh instead of uh
marginally making a line go up faster yeah yeah i mean it's questionable whether it's like software engineering or software artisanship.
I'm sure like management wants things to be engineering or like more like software manufacturing, right?
Like give me a schedule and I can and tell me how quickly you can ship stuff.
And for like engineers, it's often how do we design the best system so that it stays up?
At least that's what I'm interested in and i know you are yeah so yeah i mean i guess it's a good it's a good thing it's like maybe
we should all just be like software artisans or software plumbers or software developers
um yeah who knows yeah so yeah about light speed so like what did you do there um what i did there is uh
i was working on kubernetes stuff internal tooling and uh lots of kubernetes really
yeah that's that's the hot new technology of the past few years i think light speed is the hot
meme yeah yeah It's a
pretty big company. So like, I don't know how much you can share any numbers, but would you say it's
like a large fleet of pods and hosts? Or would you say like, you don't need that much to run
your system? Depending on how large you define large, it was pretty large. Okay. I think that gives a reasonable estimate,
at least in my head.
Like you get a sense of whether you can deal with it manually
and you probably can't, right?
You have to automate a lot of operations.
So-
Well, I mean, can and should are different questions.
It's a good point.
So how did you go about it?
Like what was your experience like with Kubernetes?
Anything interesting you learned through that?
YAML is...
adjective.
There's lots of interesting problems with YAML,
especially when you get into the Norway
and Ontario bugs that it has with the parsing strings.
I don't know if you know this, but YAML allows like eight values for Boolean true and false.
And two of these are no and on, which just so happened to collide with the ISO code for Norway and the province
code for Ontario and Canada.
There's also the fun where YAML is arguably Turing complete because it has generalized
recursion in the form of references.
Yeah, so it's not good at what it's
trying to do, which is like limit the syntax that you can apply, and it's wrong in so many cases,
and it's way too verbose for things like Kubernetes. Yeah, I think that's like,
it's a good question. How should one configure Kubernetes? Have you seen Starlark by any chance?
I have looked at Starlark. A friend of mine has been trying to get me to get into it for a while
and I don't know how I feel about it, but I've been using DALL, D-H-A-L-L,
to configure Kubernetes stuff for my personal things and it's pretty decent.
Yeah, so I should give a little bit of background
for listeners.
So Starlark is the language.
It's like an open source language released by Google.
It's supposed to be, it's the language
used to configure build files of the Bazel build system.
And you can think of it as a subset of Python.
But it doesn't allow things like
recursion. So you can't do completely crazy stuff. I've always found it like, maybe it's good, but
maybe you can do way too much with it. So what does DAW look like? Is it similar to another
language? Is it like, you know, superset of JSON or something like that?
DALL is, it kind of looks like Haskell in a way, but the big things that it has are functions,
variables, and imports. And it also has a very strict type system so that it's impossible to write a lot of common bad things in Kubernetes or misspell variable things because the DAW compiler will just stop you.
Okay.
And before it fails in production.
Okay.
So is this like specifically for Kubernetes or is it like more general purpose?
Oh, it's very general purpose, but they
have a Kubernetes package.
I use the Kubernetes package for deploying my website.
OK.
That sounds promising.
It sounds super similar to Starla,
because it's the same thing with functions, variables,
and imports.
And you can't do too much else with it. I guess it has a slightly weaker
type system but yeah. Yeah, the ALT type system is a huge advantage of it but at the same time
it can be kind of painful too but yeah that's just how this stuff is. Yeah, it seems like we need
a common ground from YAMl which is super declarative and
doesn't let you do much but it doesn't let you do much and doesn't let you configure config pipelines
that take like 30 minutes to for like a system to understand but at the same time like we need
we need a little more power so we're not copy-pasting stuff everywhere. I don't know if
you've gone a good middle ground yet. Dal and probably Nix are good places to start.
I mean, Nix is in the language, not the package manager.
Yeah, so tell me a little bit about it since I have never used Nix. And I've only heard about like NixOS and like the Nix Package Manager.
Yeah.
So NixOS is built on top of the Nix Package Manager.
And the Nix Package Manager uses its own bespoke language, which I believe is also called Nix. And it uses that to define packages, key value pairs, and even entire system declarations
and Docker container build files.
Okay.
So everything is defined through this one language called Nix.
And what does it look like?
Does it have a syntax similar to other languages,
or is it completely independent?
It looks like the chat is disabled.
However, let me paste you something in the Twitter DM.
But you can probably include this in the show notes
or something.
But here's an example of a Nix file
that I used to build the Christie.website.
And it has imports.
It has some statements terminated with semicolons.
It looks kind of Haskell-y, I guess. Yeah, it seems like somebody's put TypeScript into Haskell. That's what it looks like to me.
Yeah. There's also some other interesting features like the inherit statement,
other things. What does the inherit statement do?
Okay, so basically in that site, in that file on line 15, I have inherit src.
That's equivalent to saying src equals src. Okay.
And yeah, maybe I'm missing something, but why would you want to do src equals src?
So that you can define a common source tree in the parent let block and use it in a couple other packages for building things.
In my case, I use it to define the site source separately.
I build the website binary in Rust, and then I build a combined
derivation with the binary for the site server, as well as the config.
Okay, so it's just a way of basically like importing in a sense, but also being able to use that.
Yeah, I think there's something similar in Ruby
but I forgot how to do it.
Yeah, that's another language that I just missed out on.
But I've seen it syntax.
It looks nice to use.
So some of the other things you spoke about
or like you've done in your job,
you said that you've created some internally consistent
and extensible command line interface
for internal tooling. I'm not sure how much is NDA'd there, but I made a tool to help
organize a bunch of other tooling. It ended up kind of growing out of scope as one does, but it was an interesting project.
It also led me to learning how to make my own homebrew package definitions, which is what drove me to Nix.
For personal stuff, because trying to do homebrew builds in CI is a Lovecraftian nightmare of horrors.
Yeah, it sounds like,
I've never seen anybody actually try to integrate
like Homebrew and CI.
We ship a bunch of stuff through Homebrew,
but we figured out a hack
where we have like a puppet script
that runs on like developers' laptops
and it just periodically
syncs from this private GitHub repo. And we just make binaries available there. It doesn't need any
best practices or anything, but it makes deployment of internal tools really simple.
Yeah. It's one of those solutions where all the options are bad, and you really just got to
choose which one you want to deal with.
Yeah.
Were your customers just internal engineers?
And what's it like to have engineers as customers?
They were all basically internal people. People do weird things to their environment. There was somebody
that had a bunch of problems with the tooling but it turns out
that they had all their source code on a thumb drive and they
mounted the thumb drive every time they wanted to do something.
And it was like a fat thumb driver.
And guess what Unix permissions really don't like?
If you guessed fat drives, then you win exactly zero dollars.
Wow. And what was the reason for them to keep all their source code on a thumb drive? Were
they scared that their laptop would just not work or something?
I've learned over the years that when you encounter divergent workflows like that, it's best to not ask why, because the answer will sometimes scare you.
So I just sort of tried to fix the bugs and just let that person be with their unique life
decisions. Yeah, I mean, so we'll never really know why this person had all their source code on a
thumb drive, but.
I mean, undoubtedly it's probably because they had data loss or something and they don't
want to do that again, but who knows.
Yeah, there's this like dual called Dropbox.
Maybe they can use that or like Git.
Yeah.
And there's this team called team called compliance which blocked it oh oh that's that's
the other issue to think about i guess yeah compliance how often have had have you had to
think about like compliance in your job or like in any job not very often however um i found that
compliance requirements can be a useful tool to stop people from committing mistakes in production, especially if the security team is overloaded and saying, yeah, we're going to need approval from security there.
We'll delay something by two weeks and product can't make it go faster no matter how much they complain
yeah i've seen that engineers get scared of compliance much more than they get scared of
other things because like i think engineers don't want to talk to lawyers like ever
so it's always a good way to like say oh you what? That might break some of our requirements.
We have to fix this ASAP, or we cannot fix that.
Yeah.
I think people really just want to go fast.
And going too fast too often means that somebody
has to clean up after it.
And you really got to choose what you're optimizing for.
So would you say your experience with using Kubernetes
and all of that, do you think it's, there's a lot of hype
and there's also a lot of people complaining about how
it's overkill for most people.
Clearly, I think you seem to like it
since you deploy your own site with it,
but what's your experience been otherwise?
Actually, I deploy my own site with Kubernetes
because I wanted to learn Kubernetes
and I'm one of those people that is cursed
to be a hands-on learner.
So in the future,
I don't know if I'd really use Kubernetes.
It's convenient. A nice side effect of it is that it's very, very prescriptive. And it has an exact model of what you do and how you
introduce like mutable state and how you do replicas and other things that a lot of other systems just don't really have a match on.
I can't really hold a candle to, but at the same time, with all that power comes a whole bunch of complexity
and documentation that is really good at selling you Kubernetes,
but not explaining what fields are allowed on the freaking pod spec.
And, you know, like overall it's a good tool sometimes, but sometimes is not all the time.
Yeah. Yeah.
So do you really feel like Kubernetes achieves its goal or goals?
Like, like if you think about its goals,
like it can give you like a reliable system out of the box in a sense,
and it'll ensure like with those disruption budgets and all that,
that your system doesn't go down.
You get a bunch of features and systems for free,
like auto-scaling and all of that.
But then along with that comes the complexity of managing all those systems.
Like overall, would you say it's like a net positive
for like productivity at a company
or it ends up causing more outages and issues
because people don't know how to operate it?
You said that it has auto-scaling out of the box
and I had to laugh for a minute
because the auto-scaling stuff in Kubernetes out of the box is basically a great
way to make downtime problems worse. Oh, interesting. Tell me more. It sounds like
there's like a story there. Yeah. Yeah. I used to work at Heroku on the metrics team, which did auto scaling and basically due to some constraints that were brought about by existing
history we weren't able to use very many metrics in order to do the auto scaling calculation.
So a compromise was made and we had something something that sort of worked for our target, which was trying to make load times go down.
Except when you try to do a quick fix somewhere, you end up pushing problems deeper into the stack and on certain types of applications if you enable the auto scaling
and it's not actually a CPU heavy load but it's a database heavy load spinning out more instances
would kill the database server and it would lead to a situation that we kind of jokingly called
auto failing I mean it's a distributed system,
so the errors should be distributed too, right?
Yeah.
That makes a lot of sense, right?
Like you have a system that's like,
you can totally imagine a database being overloaded
and its latency increasing and the latency like-
Bubbles up.
Yeah, bubbles up and then it's like,
oh, you need to auto scale this other cluster
since its latency is too high and throughput's too low.
Yep.
And a lot of the Kubernetes stuff mainly focuses on CPU and memory.
Okay.
So, yeah.
Which kind of invites auto failing.
Okay.
So, because the input metric isn't good enough to take all of these different things into account.
Yeah.
A lot more services are IO bound than you think.
And they're usually IO bound on the database.
Yeah.
Or something that is a JSON front end to a database.
Yeah.
That makes sense. So I don't know how
much you can reveal, so it's totally fine if you don't say any of this stuff, but was your
objective to roll out like auto-scaling to every single deployment of Heroku? I guess not, right?
Because there's so many... I don't know. I wasn't around for all of the planning stuff there, but it was an interesting problem either way.
It teaches you a lot about how services are put together and how assumptions fall apart when the assumption is sufficiently general.
Yeah.
So I've read a little bit about Heroku's
auto-scaling system. You use like the P95 request latency percentile to decide, and then you combine
that with Little's law to decide whether you should scale up the application or not. That
seems like one of the things or one of the metrics that is at least advertised
on one of the blogs.
Does that, okay.
That sounds about right.
I was going to avoid going into there, but since it's on a blog post, yeah, they use
P95 metrics and Little's Law in order to do it.
It's actually kind of a clever thing, but, uh, it's unfortunate that it doesn't,
uh, scale out as generically as I hope as I'd hoped.
Okay.
So, so it feels in like those cases where you're IO bound on a database, but
it's really great if, uh, you're a rails app and you're single threaded and you
only have one instance, uh, per, uh, one instance per container and you have a really
CPU and memory intensive thing and it gets slow. For that case, it's absolutely beautiful.
It's just that case is not very often at all. Okay. So that makes sense that you're basically running
a stateless single threaded thing that scale
that that's directly proportional to like latency
and request depth in a sense.
So it's like, if you're trying to run
your stateless monolith web app,
you can put that on the system.
What, if you can reveal,
or if you can just give a flavor for,
what kind of other applications does it break for?
That's all I really know.
All I really can remember is what was in support tickets.
And IO bound on the database is probably the biggest fail
pattern.
But I mean, this is something that everybody struggles with.
I mean, creative design decisions in MySQL and Postgres like 20 years ago have created situations where like,
fun fact, every time you make a new connection to Postgres,
Postgres actually forks a new child just for that one socket server, socket client.
And as a side effect of this, Postgres has really amazing multi-thread performance.
I mean, like, combined with Linux's copy-on-write support and Postgres' habit of doing things as append-only logs,
this means that you can get some really amazing
multi-threaded performance.
However, as a side effect,
this also means that all of those children,
especially if a lot of them are persistent,
they just hang out and around in memory.
And every time they just use up a lot of them are persistent. They just hang out and around in memory. And every time they just use up a lot of memory
and then there's kernel buffers and all that stuff
that just adds up over time.
Yeah, I didn't know that.
So is that the reason why people recommend
putting like a PG bouncer or something
in front of a Postgres instance?
I love PG bouncer.
It's such a hack.
Yeah, and I'm sure you know about transaction ID wraparound.
I am not actually that familiar with it.
I'm more used to Postgres in anger and production and not really like
in the nitty-gritty of ORM stuff. That makes sense. Did you work with
Postgres as someone who worked at Heroku and had to deal with customer Postgres instances going down
all the time? Nope, I was not on that team. I was more dealing with the actual metric server itself.
Okay. Yeah. So transaction ID wraparound is just, if you have close to 2 billion transactions
since like Postgres uses like in32 for transaction IDs, it just stops the database completely
saying I need a garbage collect some old transactions since i'm running
out of ids for you to use and can go down for like a day or for like a week and if you if you
google like transaction id wraparound you'll see like a bunch of postmortems like you know generally
people don't write up like a postmortem like if they're down for a few hours but since this issue
just causes them to be down for a day they have like a huge mea culpa like
i'm sorry this is what happened and it's a result of the architecture of our system and we use
post-resonance i've certainly almost seen or maybe i even have and i don't even remember
no i haven't seen one but i've certainly seen like an internal post-mortem of like yeah our database
hit transaction id wraparound and was just down for like a day and a half.
And there's nothing we could have done about it. So yeah, it's as you said,
like a creative design decision, like who would need 2 billion transaction IDs?
I mean, if you have like a few million requests per second,
across a bunch of machines,
2 billion happens pretty often.
Yeah, but Postgres can do like remarkably well for...
Oh yeah, it's fantastic until it's not.
Yeah.
It's, yeah, I think the challenge comes in like once it's, yeah,
once it reaches its limit, how the heck do you shard it in and then
people think of like the first like maybe we shard by user or like we shard by like merchant or
whatever and then the product use case comes in where you need to do like a cross shard transaction
and then you're screwed so that's why i think like cockroach db and all of these things have like a
lot of potential right do you have any experience with any of these new
NoSQL databases or SQL but in the cloud-type systems at all?
Not much.
I kind of don't like them.
But I've had some bad experiences with MongoDB
and losing data in the past.
So I tend to stick to what works because if I can understand it, even though it's bad, I can understand it.
And that means that I know better how to fix it when it breaks.
A lot of the document databases are really solutions looking for problems in my book.
Yeah, I was listening to somebody's opinion.
And they said it helps you start a company faster,
since you don't have to think about schemas.
Do you buy that argument at all?
I think in terms of types. And I like schemas because it's the information about what's there and what can be there and what can't be there and what is there at all is very explicit.
You can't just have random other fields show up one day without there being explicit changes to
the database to allow it. I'm a big fan of types and schemas are a great way to enforce types.
And if something is that important
that you need to add a field to it,
then you can add the column in a schema change.
Yeah, that makes a lot of sense.
I mean, that has its own problems.
Like, not saying it's pain-free, but overall, the ability
to know what is in a database table without having to look at all the contents of it is a lot more valuable than you really think it is.
Yeah. which mostly works when we know that we won't have to be querying you know filtering on a particular
field is we stuff like a protobuf as one of the columns like on in in a table and we can just
put like more fields inside that protobuf if we know it's something that we don't think
we'll need to filter and that that gets us kind of like this in this place
where we still know the schema, but it's not
as expensive as a column migration on a huge table
when it's problematic.
I don't know if you've seen similar hacks, maybe
like JSON columns.
There is a time and place for JSON columns, though.
I know one thing that uses it pretty well is an ActivityPub server called Pleroma.
I'm not really sure how to pronounce it.
But ActivityPub is a very JSON document heavy protocol.
And Pleroma tries to be agnostic
as to what activityPub objects it can accept.
So a JSON column makes sense there
since activityPub objects can,
for all practical purposes, be arbitrary JSON.
Yeah, yeah.
It's like when you're acting as like a proxy in a sense.
Yeah. Yeah. It's like when you're acting as like a proxy in a sense. Yeah.
But a lot of the times you really do know what's there and just use the damn schema.
But what about those cases where like a migration is really hard or you have like a partition table and it's gonna take, it might even cause like downtime
or it's just impossible to do like some kind of migration
in place because there's so many like concurrent transactions
that they just hold on to like these row level logs
or like table level logs and it's not worth the effort for something as simple as,
oh, I want to add this just random number
that I know I'm never going to be filtering by.
It depends.
However, I personally prefer radical simplicity,
even if it ends up causing some issues down the line
because the simple thing
is easier to understand when it breaks. And the simple thing has less going on. So it is more
obvious when you look at it when it breaks. Yeah, I think that was spoken as like a true
asserry, right? You keep systems as simple as possible. And that's the only way you'll be able to debug them
when you go wrong, when they go wrong.
I mean, I think that's like a great segue.
So you had to manage some of these larger systems
at a bunch of your previous workplaces.
What's your philosophy on CICD and alerting and all that?
So let's start with CICD.
Do you believe in continuous deployment?
What have you seen work best for you
or for the companies you've been at?
So continuous deployment is kind of a double-edged sword.
On one hand, it's really great.
And when you do spend the time to get it all working,
it can make updates seem kind of magic because, you know, you just blink and it's done.
And you can put new versions of stuff into production without much fear. But then the other edge of the sword comes in and for some things getting to a place where things are continuously deployed can be slightly difficult, especially if it's an existing implementation or there's some fun problems with your setup that make continuous deployment difficult. I'm just thinking about that one thing that I heard from a Googler friend of mine where
they had services in Java and they were deploying it continuously, but the JVM's JIT needed
to warm up. So they just replayed traffic at it
at like 30,000 times normal speed
for half an hour to warm up Java's JIT.
I mean, you know,
but so part of the process of it
is figuring out how to actually do it and implementing the scripts and automation to do it.
But overall, I think it's generally worth it.
Yeah. this is a hot take, but a lot of the unit testing methodology that I see used is kind
of fundamentally wrong, especially the stuff around mocks. I'm probably one of the few people
that thinks that, but a fake version of the world will only let you see how fake your world is. And you should really be testing against real databases,
real APIs, real file systems as much as possible.
You don't need to mock the file system.
That's why we have temporary files.
Yeah, that certainly makes sense in the context of,
like, yeah, you assume that your APIs are working,
but it's often like these points of integration that break.
But isn't there like a world where, you know,
business logic should be tested and like unit tests
because you have a set of integration tests
that test these like various integration points, right?
Yes.
However, at some level, you're really testing the mock.
I guess you are.
I've had some pretty bad experiences with some pretty horrible mocks.
And a lot of the stuff that I've been dealing with
is mostly like do math and put it into a database.
And a lot of the math you can end up just testing by itself
using actual data structures.
And then integration against an actual database with Docker or something.
I think people, like, there's so much functionality out there,
and people don't, or developers, I should say, just generally,
they either go, like, one step too far,
which I think is, like, integration tests or, like,
Selenium tests for everything, since, like,
the data model isn't clean enough that you can test it at like a unit
or even in an integration test.
And I've certainly seen that.
And, or they go the complete other route
where you just mock out so much
that you create extremely brittle tests
that you refactor anything and you'd have to like
fix thousands and thousands of tests.
And I've seen that as well.
And your test isn't doing
anything useful since you just patched out the wrong thing. And it just kind of works
because you haven't written both a positive and negative test. Yeah.
In terms of my testing philosophy though, I believe in two tiers of tests. I believe in
an integration test where you test against as real of the world as possible.
And I also believe in functional tests where you have something that goes against prod
and staging and the like every hour or so and tries to do a whole bunch of common user
operations. And if that fails, then you either alert someone or log it in Slack.
Yeah. So how do you configure those functional tests? I totally agree with you on the functional
test aspect. Yeah. So in Heroku, they had a tool called Direwolf, which had a huge bank of Ruby code that would let you test against, do functional
testing against stuff.
And that was pretty great.
The exact dialect of Ruby was a bit odd because, you know, Ruby is DSL heaven.
But once you got used to it, it was really great and you could implement tests really
quickly. Yeah, so this library or like this framework would set up auth and all of that
for you and you just had to specify. It would configure like API URLs.
I think there was also a April Fool's joke for free puppies at one point.
And it would probably log some metrics so whoever owned that test or the team developers at that company to build frameworks like Direwolf in many cases.
And then everybody ends up building their own local Maxima solution, which isn't great.
Yeah, I've been considering building something like that as open source software on top of like Rust and Lua, but I never really got anywhere with that idea yet.
Yeah.
Probably make a great startup.
Yeah.
How do you build something like that that's generally applicable to a bunch of different companies, right? Like you'll have to figure out, you'll have to come up with some design
where you have like an auth provider
and then it would just end up looking like
enterprise software that doesn't really work
for my use case since I have to think
about this other thing as well.
That's generally where I've been caught, yeah.
Yeah, I think, yeah I think startups in the CI space are always confusing to me.
But maybe someday something like CircleCI,
but something that has the popularity of GitHub, will be out there.
Maybe GitHub Actions is that thing,
but I just heard about that vulnerability a few weeks ago,
and I don't know if. I mean, it's executing arbitrary code from people.
It's going to have vulnerabilities
at some point.
Yeah.
But does it really have to read log files and run commands
based on that?
That seems a bit excessive.
Who knows?
OK. that seems a bit excessive. Who knows? Okay, so yeah, but your philosophy on testing is
go for the straight integration test
and then just the periodic functional test.
Yep.
Verify something.
And what about like stage and canary and production
and all of that?
So like, how do you think about breaking up,
you know, your deployments into tiers? Depending on the company and how much budget you have, the ideal is sort of like a
three-tier system where you have some sort of development one, which is either the most recent commit to any branch or the most
recent commit to your default branch. A staging one, which is a bit more quiet and
it's somewhere if you have like QA testers or something, you point them to staging and tell them to do stuff there.
And once staging set and once QA says, yes, this will probably not explode horribly,
then you can promote it to the money generator or production.
Okay. Yeah, that makes sense. So you mean like this is three stage where you're continuously deploying, I guess, the first one somebody manually verifying on the second one and then the third one is considered like safe
since it's been tested against by humans quote-unquote safe yeah yeah quote-unquote safe
yeah but but like how would how so so what about like automating those humans in the second stage?
Like, is it possible to do that or you think,
or let's say you don't have the budget to have humans there?
Then you'd probably have some sort of machine thing,
but if you don't really have the budget there,
then you just kind of chuck out the staging phase
altogether and just have like
sandbox and prod okay the way the way we we do it is like the first that that staging part like it certainly
that's where we dog food in a sense or i should say the first year which is like continuously
pushed that goes out to employees and that ends up catching a bunch of issues.
So that can be one way of...
If it's a kind of product which employees can actually dog food, that is.
It really depends because, for example,
if your product is about golf courses and nobody in your company golfs,
then it might be a bit difficult to dogfood it.
And what's your philosophy on stuff like earlier in the stack?
I know that you said that you're like a types person.
Does Python or Ruby ever have it?
When do you think it's is it ever fine
to deploy a service in Python I know some companies don't let you do that anymore.
Like if you have a really good reason like I mean a really good reason. We're doing some fancy machine learning stuff,
and we have a limited number of goat sacrifices,
so we want to use Python to save them for later.
Yeah, Python would make sense there.
Rails is kind of a pain to manage in production,
but it's a pretty fantastic and solid platform.
There are times when you'd want
to use that yeah um off the top of my head uh i know shopify has like this huge rails monolith that
they do stuff with um i think heroku started as a rails app too um it really depends um but if i was given free reign i'd probably gravitate towards something
like go or rust and more recently rust uh mostly because um the compiler stops you when you're
trying to do things that are bad and especially in rust the compiler will stop you when you're trying to do things that are bad. And especially in Rust, the compiler
will stop you if you're trying to use memory invalidly.
And it won't break in production for that reason
that the compiler stopped you for.
Code that can never be deployed is code that can never fail.
Those problems that are rejected at compile time
are problems that cannot be discovered at runtime.
Yeah.
And the value of the compiler stopping you when you make a typo is so great that
it's basically unparalleled.
Yeah. So, I mean, that brings me to my next question, which is,
so how do you feel about letting a bunch of engineers work on features and then there's
like a group of SREs who work on making sure the system stays up? So like, let's say you have like a hundred developers just working on product
and like 15 developers or like 15 SREs keeping the model that up, or do you
like the model of like, you know, every team should maintain its own service
or is it like some kind of middle ground?
What's your general opinion on these things?
That's just hell.
Um, I've experienced this a few times where, you know,
product controls the purse and product wants new shiny and they want it now. And there's this
brilliant video by, what's it called? It's about microservices. And they have this person going
through and explaining the entire flow of why they can't add someone's birthday field to a user info page because, you know, they have to go through the name provider service and then, you know, the entropy chaos system and a whole bunch of other stuff. So like at some point if you just give people free reign
to do stuff and just keep going, you end up with just something unmanageable. And
in terms of actually like designing things to work around that. I don't know. This is not a technical
problem. It is an organizational and social problem. And I'm really good at technical
solutions, but social problems like that are, they're out of my league.
So given one or the other, do you think it's better to just let people work on maybe a monolith and a few services where it makes sense, like a database service or a database proxy, and everybody else uses the monolith?
Monoliths are pretty great, yeah.
People rag on monoliths a lot because they've been burned by that one PHP monolith that takes three days to deploy as long as there's a full
moon. But in terms of reliability and not dynamically linking function calls over HTTP
and JSON, it's pretty great.
Yeah, you don't have to think about like version skew and all of those things.
Oh God, version skew.
It's also so hard to test for those things.
I mean, it's just Conway's law at work at that point.
Are you familiar with Conway's law?
I am familiar with it, but like I think you should repeat it. Okay. So Conway's Law is basically this law that says that all technical systems will
imitate the communication structure that created them. And you can see this very obviously
if you look at things like Apple and Microsoft, where
it seems that two features that on your end
seem like they're very related.
Oh, what's a simple example of something that comes to mind?
Like Safari for iPhone safari for the mac
desktop you know they have they they're very similar they are based on the same code but they
have like subtly different things that just don't quite add up and uh i'd be willing to bet some good money that uh the safari for ios team is a different
team the safari for mac team all those like a different set of pms and different okrs different
pms different services different budgets uh
different goals yeah yeah you you basically ship your org chart, right?
Yeah.
There's a way you can see this in Microsoft.
I can't think of one off the top of my head.
But if you start looking at incongruities
in various APIs or even just products in places, you can begin to see where
the different team boundaries are. Yeah. Yeah. Yeah. That makes sense. Like, and especially like
one recent example I can think of is somebody complaining in the product that, you know,
there's so many upsells inside the product. Like there's like one prompt or like, and there's like
another prompt right after that bunch of banners within the app.
It's because each individual product team wants you to use the thing they ship.
And, but there's no centralized.
They want number.
Yes.
They want, they want like, oh, look at, we expect AR to go up by like 3% with this new model that says buy more stuff. So we ship this new
model that says buy more stuff. And now we expect us our revenue to go up by like 10,000 bucks
a month or something. I don't know. Best experience in Netscape navigator. Yeah.
Yeah.
And I'm not, I'm not, and you're right, right? It's an organizational and social problem.
I still don't know if there's any organization that's big that does it well.
Maybe all the big organizations that are like successful, they've just, you're successful
by shipping different products that are successful. And if you put too many people on one thing,
there's going to be like no cohesive vision,
even if you have like one VP on top or whatever.
Maybe it's just too hard to do that and to organize people that way.
I wonder if this is how Amazon organizes stuff with an,
it's with inside AWS, you know,
that giant product screen where you have like
50 different products, everything from private servers to quantum computing to satellite
operations. It's very possible that every one of those is a separate team.
Yeah. And I guess eventually like there's enough people going to complain about the list
of product offerings that they'll spin up another team that'll be the AWS catalog team.
Their job is to show you the relevant set of products for you.
Yeah, and then there's going to be the team management team
for managing all the team management teams.
Yeah. yeah this is this is why like i think if you put like if you make like two software engineers talk for
too long i i wonder if there's like a rule that says they have to complain about management and
organization within like 45 minutes oh it happens yeah uh when i started this like podcast i was
thinking about you know should i ever ever invite like somebody who's on who's in management
or should i keep this like pure like software engineering sre i mean let's see
i guess i can always change my mind later on, but.
It can be interesting to have it happen just once to, you know,
let the chaos unfold.
Yeah. But the moment they say the word synergy, then I just disconnect.
Well, the thing is you have to synergize all of your deliverables into
Gantt charts, right?
Yeah. And have a timeline view of them. So.
You just double click on it and you look at the big picture. Okay. So at Salesforce,
one of the buzzwords was let's double click on that.
And apparently that was like corporate speak for, you know, looking into it closer.
But inside Heroku, there was this huge, you know, like perma thread on Slack where people were like showing various different behaviors that double click can have in various environments
various programs
so obviously we want to open the documents in a pdf viewer
and now like double tapping stuff on your phone sometimes means you like
a picture like on instagram or linkedin i don't know if you've seen like those viral posts on LinkedIn where they're
like, Oh, look at this new feature.
You just need to double tap to see what it is.
And they make like thousands of people double tap on their posts and that's
how it goes viral. It's so bad.
Yeah.
Yeah.
I've never been in a part of like a smaller company that was acquired by a
larger company. I've always been in a part of like a smaller company that was acquired by a larger company.
I've always been in those larger companies.
So that's the experience that I guess I should try at some point.
If you really hate buzzwords, it's probably a bad thing.
At least with Salesforce, they have this unique mix of like English and Hawaiian buzz that uh i don't think i'll see anywhere else
i think yeah i think i know what you're talking about i have a bunch of friends at seal sport
and there's a there's a lot of focus on hawaii is that because like the ceo likes hawaii like
like mark i have no idea um but it it was fun to uh poke fun at
yeah i wonder what weird like if i ever start company, I can just start like a culture just for the sake of it. And since I'm the CEO, nobody can tell me anything. We're really into water bottles, we try, we love different kinds of water bottles and our buzzword for focus is, let's, let's have a drink from our water bottle for that.
And nobody can say anything since that's the culture of the company.
That's the kind of chaotic stuff that I like.
Some point I should start something and hopefully it succeeds just for this
water bottle joke. And hopefully nobody listens to that pod to this
podcast at that point well until then i bet you can uh put a cap on that
i think i think yeah this is like a good stopping point a good point to like you know cap it up
and wrap this podcast up yeah thank you so
much for participating i think this was a lot of fun like this podcast is supposed to be like
software at scale but i just had a lot of fun talking about all like random things as well
yeah thanks thanks so much yeah no problem it was fun to do this yeah yeah until next time i'll i'll
tag you when I do the editing and publish.
Thanks so much.
You're welcome.
See you around.