Software at Scale - Software at Scale 2 - Christine Dodrill: ex-SRE, Lightspeed

Episode Date: December 7, 2020

This episode contains an interview with Christine Dodrill, ex Senior Software Reliability Expert at Lightspeed. We discuss Kubernetes, Spectre/Meltdown, configuration languages, a controversial testi...ng philosophy, autoscaling (auto-failing), technical problems vs social problems, monoliths, Conway’s Law and Canada.Listen on Apple Podcasts or Spotify.HighlightsNotes are italicized.5:34 - Stack Overflow might become actively harmful if you’re working on WebAssembly or something sufficiently niche.7:40 - Spectre/Meltdown caused 20-40% slowdown at one workplace, which led to some interesting projects, like the aforementioned WebAssembly work11:30 - What’s up with the title Senior Site Reliability Expert?Apparently, you can’t call yourself an “engineer” in Canada, you need to go through some kind of process which software developers don’t need to bother. 14:40 - It’s questionable how much software developers are “engineers” in the first place.16:52 - YAML allows 8 values for boolean true and false, such as “no”, and “on”, which conflict with ISO code for Norway and Province code for Ontario. Maybe Starlark is an answer. Christine uses Dhall with promising results.Dhall looks like Haskell. It has a strong type system, with variables, functions, and imports, but otherwise a config language. They have a Kubernetes package.20:30 - Nix the language. An example to configure a website.24:00 - Experiences building internal tools that interact with other internal tools. Developers tend to have strange environments. One developer would only keep source code on a thumbdrive, and that would cause a few issues.27:20 - Compliance requirements can be a useful to stop developers from security snafus.29:30 - Experience with Kubernetes.30:40 - Kubernetes Autoscaling out of the box is a great way to cause downtime. Experiences on the Metrics team at Heroku which worked on autoscaling. Most applications tend to be I/O bound to the database, so autoscaling tends to become “auto-failing” and cause more problems than it solves.37:00 - PostgreSQL, PgBouncer, and Transaction ID wraparound. External postmortem.40:48 - “A lot of document databases are solutions looking for problems”45:30 - “Continuous Deployment can be a double edged sword”47:20 - “A lot of unit testing methodology I’ve seen is kind of fundamentally wrong. A fake version of the world will only let you see how fake your world is”. 53:30 - Experience with tiered deployments - stage, QA, and production.58:40 - Exploring the model where product engineers only build features, and SREs focus on reliability.A conclusion is that some governance is probably required to prevent a complexity explosion.60:30 - Monoliths are pretty great. Eventually, Conway’s Law takes place. Incongruities in products or APIs often reflect team boundaries.68:00 - Buzzwords at big companies. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Hey, Christine, and welcome to another edition of the Software at Scale podcast. Yeah, and welcome. Yeah, thanks for having me. Sure. So I want to get started with like some simple questions like what got you into like software engineering? I've seen that you've been like a software engineer for more than five years now. Somewhere like that. Yeah.
Starting point is 00:00:41 Yeah. So what got you into it um well i was one of the i for the longest time i can remember i've been one of those people that have been cursed by uh being good at computers and uh actually getting down into the nitty-gritty of what's going on and uh understanding things and tweaking it um and getting into software engineering just sort of naturally happened uh my dad was in uh software stuff for years um he recently moved to llama farming or alpaca farming okay uh yeah and um i've basically basically been coding off and on since I was 12. Okay. So like you were working on like your own projects or? Yeah.
Starting point is 00:01:34 Lots of IRC stuff in the beginning. Okay. Yeah. I think I was never at that age where I got to use IRC. I think somebody in like my freshman year showed me Slack and I didn't realize what the big deal was. And then a couple of years later, it's like something that everyone uses. So, I mean, do you have like an opinion
Starting point is 00:01:52 on like IRC versus Slack? They both have their place, but I kind of miss IRC. It's simple. It's easy to write bots for. And it just got killed by chat apps with scroll back. I still use it, but it's definitely a shadow of its former glory. Yeah. Yeah. And then like, so you got into software engineering, but you've been working for a lot of time now as an SRE, like, as a site reliability engineer. So what made you go from writing, like, bots for IRC into something like SRE? I kind of got into SRE stuff on accident.
Starting point is 00:02:53 A lot of it started when I was running those bots and other services was how do I keep them up? Like if Google goes down, what do I do? Or that sort of thing. And a lot of what I know about SRE stuff, I really learned running an IRC network and at the job at a place called IMVU. The SRE team there was really interesting and I learned a lot. But really, a lot of the stuff involved with writing software and putting it up somewhere goes hand in hand with the SRE stuff because a lot of that boils down to not just putting it up there, but keeping it up there. You know, it's easy to push something to the cloud once and then, you know, you don't care about it in three weeks. But pushing it to the cloud multiple times per week and still having it work even after you touch it, that's the hard part.
Starting point is 00:03:57 It's gotten easier, but it's been historically pretty hard. Yeah, that's for sure. Like there's like a random script that fails or some some random package on that machine stops working or Ubuntu needs a restart. It's easy to write scripts, but how do you make sure they keep running? So did you learn any of that in a formal way? How do you keep a system up or was it like all just trial and error um there's a formal way to learn that uh i mean it kind of feels like uh a lot of this stuff has
Starting point is 00:04:34 to be learned through either experience or well experience at the job yeah yeah yeah it there's not really many books that I know of on how to actually run Linux systems that don't get outdated the day after they were written. Yeah, exactly. That's how most people seem to learn how to keep systems up. I mean, going back, is there anything that you think would have helped like if you had read like a particular book or some reference or is it just like stack overflow and figuring it out um the kind of problems i get into are the stuff where stack overflow is uh actively harmful so a lot of it is really like reading man pages,
Starting point is 00:05:25 trial and error and prayer, lots of prayer. I'm curious to know like what kind of problems you're talking about when you mean like stack overflows like actively harmful. Like I've certainly seen examples, but I'd like to know what you mean. I have a bit of a Skunkworks project where I try to do WebAssembly stuff on the server. I'm trying to implement a functions as a service
Starting point is 00:05:52 platform, mainly just to feel out the problem space and see how I would do it. And when you get deep enough into it, there's basically like no guide for doing a lot of the stuff that you need to do for that. Because at some level, you're basically writing like a kernel and interfacing between the programs of the outside world is basically literally writing a kernel. And there's basically nothing out there to help you yeah so you gotta you gotta really wild west it yeah yeah it seems like you know initially when you're just starting off and learning something new it's helpful and the more you go into it the worse it gets and there's this point where you just don't trust any of the outside stuff unless it's like a reference manual or point where you just don't trust any of the outside stuff unless it's like a reference manual or something, because you probably know that it's going to be like buggy
Starting point is 00:06:50 or not what you need. Yeah. The hard part is finding out where that divide is, because sometimes it's really hard. But other times it's, you know, apparently obvious when you hit that cliff, like trying to implement system calls and WebAssembly stuff. Like, there's no manual. Yeah. Especially back when I got into it a couple years ago. Like, there was no guidance out there. Yeah. Yeah, syscalls and WebAssembly. assembly like yeah like if i could ask like what
Starting point is 00:07:27 what what project were you working on like what what was like the objective or was it just to learn something well it's kind of first started out as uh me getting really angry at intel for Intel for Spectre. The place where I was working at, we saw these huge 20% to 40% drops in performance or something somewhere in there. And I was looking to see if there was another way to do it. So I got the idea of doing a WebAssembly Functions as a Service platform in my head, and it never quite left.
Starting point is 00:08:07 I don't think it's a bad idea. Yeah, and I can totally relate with getting angry at Spectre and Meltdown. At our workplace, we had two kinds of issues there. First, our CPU load or just the efficiency of our entire system went down by like 10 to 20%. And the second thing is like AWS tried patching some of the issues and that led to other problems that manifested in weird ways and they were mostly just JVM crashes. So yeah, that's pretty bad. So I mean, I'm sure then you've thought about like Apple's recent announcement and like moving to like ARM and all of that. Do you have like an opinion or you're just more of like a Linux person?
Starting point is 00:08:59 I've kind of flopped back between macOS and Linux for a while, but Catalina is probably pushing me back towards Linux in general. Is it because of the intrusiveness or just the amount of pop-ups you need to click? A lot of it is I do a lot of hardcore development stuff. And Mac OS is really great for a lot of things. But if you want to try to customize things or completely get rid of animations so that things appear to happen faster or unbind certain modifier keys so that you can use them freely elsewhere, you know, you're completely out of luck. A lot of the Apple
Starting point is 00:09:56 products work really great for the Apple prescriptive mindset. And sometimes it works really great. Heck, I think a good 40% of my blog was written on an iPad. It's a pretty good device, yeah. I'm kind of reaching the limits of macOS usability for me, plus the fact that Catalina basically broke my entire Steam library on macOS. Yeah, it really makes you think, right? Because I just saw a tweet somewhat recently that if somebody wanted to build a browser for an iOS device today, and there wasn't a concept of a browser before. Apple wouldn't allow it
Starting point is 00:10:47 because it would be like remote code execution. And we wouldn't have the concept of like a browser because of Apple rules probably. So I wonder what kind of things we're missing out on because we're designing for the average case but we're not letting hackers like do their thing. And hackers in the like people who like tinkering with stuff not the shady black hat hackers yeah um it's kind of hard to know um stuff like the pine phone does have does is being taken in some interesting ways but uh it's an unknown unknown.
Starting point is 00:11:29 So, yeah, I want to ask you about your current job. So your title is a senior site reliability expert. And I'd like to know what it means. That's actually my past job. I haven't updated that page yet to avoid the onslaught of recruiters. Yeah. But so in Canada, you can't call yourself an EWORD. You unless you actually have an engineering certificate in like you have one of the rings and everything. So we can't be site reliability engineers. So somebody had the brilliant idea of calling us site reliability experts. And apparently that's how they got around it. There are software developers and
Starting point is 00:12:16 site reliability experts. I think that's a great idea. What does it take to get like a certification where like software engineers generally go through that flow or? In order to be called an engineer of any kind, you have to have like a mechanical engineering degree and you have to go through the whole ritual with getting the ring from the engineers guild. The engineers guild. It sounds like. Whatever the heck it's called. I don't remember the exact term, but I remember that the Engineer's Guild is kind of litigious. And if they see you
Starting point is 00:12:56 using Software Engineer, they will not be happy. Yeah. For a second, I was just imagining like a discord guild of people like handing out licenses, I suppose. No, like in my head, what I'm imagining is like this group of people standing in a, five of them standing in a circle and you stand in the middle and they give you a ring and
Starting point is 00:13:26 formally induct you into the guild of engineers or whatever. You're like a first level mage or something of the engineers guild. I mean it's cool Like I read a statistic recently that there's going to be 25 million software developers in like 2021 or something. That's a lot of people. It's probably not enough. Yeah.
Starting point is 00:13:56 And I wonder how much this like bureaucracy helps. I guess if all it does is just hurt the title and you can just do the same job, which I'm guessing is true of Canadians working on this. Like, it seems like it's not that bad. we don't call ourselves engineers because engineering refers to like building bridges and things like uh if they break they have like material impact on human lives uh instead of uh marginally making a line go up faster yeah yeah i mean it's questionable whether it's like software engineering or software artisanship. I'm sure like management wants things to be engineering or like more like software manufacturing, right? Like give me a schedule and I can and tell me how quickly you can ship stuff.
Starting point is 00:14:58 And for like engineers, it's often how do we design the best system so that it stays up? At least that's what I'm interested in and i know you are yeah so yeah i mean i guess it's a good it's a good thing it's like maybe we should all just be like software artisans or software plumbers or software developers um yeah who knows yeah so yeah about light speed so like what did you do there um what i did there is uh i was working on kubernetes stuff internal tooling and uh lots of kubernetes really yeah that's that's the hot new technology of the past few years i think light speed is the hot meme yeah yeah It's a pretty big company. So like, I don't know how much you can share any numbers, but would you say it's
Starting point is 00:15:50 like a large fleet of pods and hosts? Or would you say like, you don't need that much to run your system? Depending on how large you define large, it was pretty large. Okay. I think that gives a reasonable estimate, at least in my head. Like you get a sense of whether you can deal with it manually and you probably can't, right? You have to automate a lot of operations. So- Well, I mean, can and should are different questions.
Starting point is 00:16:19 It's a good point. So how did you go about it? Like what was your experience like with Kubernetes? Anything interesting you learned through that? YAML is... adjective. There's lots of interesting problems with YAML, especially when you get into the Norway
Starting point is 00:16:45 and Ontario bugs that it has with the parsing strings. I don't know if you know this, but YAML allows like eight values for Boolean true and false. And two of these are no and on, which just so happened to collide with the ISO code for Norway and the province code for Ontario and Canada. There's also the fun where YAML is arguably Turing complete because it has generalized recursion in the form of references. Yeah, so it's not good at what it's trying to do, which is like limit the syntax that you can apply, and it's wrong in so many cases,
Starting point is 00:17:33 and it's way too verbose for things like Kubernetes. Yeah, I think that's like, it's a good question. How should one configure Kubernetes? Have you seen Starlark by any chance? I have looked at Starlark. A friend of mine has been trying to get me to get into it for a while and I don't know how I feel about it, but I've been using DALL, D-H-A-L-L, to configure Kubernetes stuff for my personal things and it's pretty decent. Yeah, so I should give a little bit of background for listeners. So Starlark is the language.
Starting point is 00:18:11 It's like an open source language released by Google. It's supposed to be, it's the language used to configure build files of the Bazel build system. And you can think of it as a subset of Python. But it doesn't allow things like recursion. So you can't do completely crazy stuff. I've always found it like, maybe it's good, but maybe you can do way too much with it. So what does DAW look like? Is it similar to another language? Is it like, you know, superset of JSON or something like that?
Starting point is 00:18:54 DALL is, it kind of looks like Haskell in a way, but the big things that it has are functions, variables, and imports. And it also has a very strict type system so that it's impossible to write a lot of common bad things in Kubernetes or misspell variable things because the DAW compiler will just stop you. Okay. And before it fails in production. Okay. So is this like specifically for Kubernetes or is it like more general purpose? Oh, it's very general purpose, but they have a Kubernetes package.
Starting point is 00:19:29 I use the Kubernetes package for deploying my website. OK. That sounds promising. It sounds super similar to Starla, because it's the same thing with functions, variables, and imports. And you can't do too much else with it. I guess it has a slightly weaker type system but yeah. Yeah, the ALT type system is a huge advantage of it but at the same time
Starting point is 00:19:56 it can be kind of painful too but yeah that's just how this stuff is. Yeah, it seems like we need a common ground from YAMl which is super declarative and doesn't let you do much but it doesn't let you do much and doesn't let you configure config pipelines that take like 30 minutes to for like a system to understand but at the same time like we need we need a little more power so we're not copy-pasting stuff everywhere. I don't know if you've gone a good middle ground yet. Dal and probably Nix are good places to start. I mean, Nix is in the language, not the package manager. Yeah, so tell me a little bit about it since I have never used Nix. And I've only heard about like NixOS and like the Nix Package Manager.
Starting point is 00:20:47 Yeah. So NixOS is built on top of the Nix Package Manager. And the Nix Package Manager uses its own bespoke language, which I believe is also called Nix. And it uses that to define packages, key value pairs, and even entire system declarations and Docker container build files. Okay. So everything is defined through this one language called Nix. And what does it look like? Does it have a syntax similar to other languages,
Starting point is 00:21:26 or is it completely independent? It looks like the chat is disabled. However, let me paste you something in the Twitter DM. But you can probably include this in the show notes or something. But here's an example of a Nix file that I used to build the Christie.website. And it has imports.
Starting point is 00:21:58 It has some statements terminated with semicolons. It looks kind of Haskell-y, I guess. Yeah, it seems like somebody's put TypeScript into Haskell. That's what it looks like to me. Yeah. There's also some other interesting features like the inherit statement, other things. What does the inherit statement do? Okay, so basically in that site, in that file on line 15, I have inherit src. That's equivalent to saying src equals src. Okay. And yeah, maybe I'm missing something, but why would you want to do src equals src? So that you can define a common source tree in the parent let block and use it in a couple other packages for building things.
Starting point is 00:23:10 In my case, I use it to define the site source separately. I build the website binary in Rust, and then I build a combined derivation with the binary for the site server, as well as the config. Okay, so it's just a way of basically like importing in a sense, but also being able to use that. Yeah, I think there's something similar in Ruby but I forgot how to do it. Yeah, that's another language that I just missed out on. But I've seen it syntax.
Starting point is 00:23:49 It looks nice to use. So some of the other things you spoke about or like you've done in your job, you said that you've created some internally consistent and extensible command line interface for internal tooling. I'm not sure how much is NDA'd there, but I made a tool to help organize a bunch of other tooling. It ended up kind of growing out of scope as one does, but it was an interesting project. It also led me to learning how to make my own homebrew package definitions, which is what drove me to Nix.
Starting point is 00:24:36 For personal stuff, because trying to do homebrew builds in CI is a Lovecraftian nightmare of horrors. Yeah, it sounds like, I've never seen anybody actually try to integrate like Homebrew and CI. We ship a bunch of stuff through Homebrew, but we figured out a hack where we have like a puppet script that runs on like developers' laptops
Starting point is 00:25:04 and it just periodically syncs from this private GitHub repo. And we just make binaries available there. It doesn't need any best practices or anything, but it makes deployment of internal tools really simple. Yeah. It's one of those solutions where all the options are bad, and you really just got to choose which one you want to deal with. Yeah. Were your customers just internal engineers? And what's it like to have engineers as customers?
Starting point is 00:25:37 They were all basically internal people. People do weird things to their environment. There was somebody that had a bunch of problems with the tooling but it turns out that they had all their source code on a thumb drive and they mounted the thumb drive every time they wanted to do something. And it was like a fat thumb driver. And guess what Unix permissions really don't like? If you guessed fat drives, then you win exactly zero dollars. Wow. And what was the reason for them to keep all their source code on a thumb drive? Were
Starting point is 00:26:28 they scared that their laptop would just not work or something? I've learned over the years that when you encounter divergent workflows like that, it's best to not ask why, because the answer will sometimes scare you. So I just sort of tried to fix the bugs and just let that person be with their unique life decisions. Yeah, I mean, so we'll never really know why this person had all their source code on a thumb drive, but. I mean, undoubtedly it's probably because they had data loss or something and they don't want to do that again, but who knows. Yeah, there's this like dual called Dropbox.
Starting point is 00:27:17 Maybe they can use that or like Git. Yeah. And there's this team called team called compliance which blocked it oh oh that's that's the other issue to think about i guess yeah compliance how often have had have you had to think about like compliance in your job or like in any job not very often however um i found that compliance requirements can be a useful tool to stop people from committing mistakes in production, especially if the security team is overloaded and saying, yeah, we're going to need approval from security there. We'll delay something by two weeks and product can't make it go faster no matter how much they complain yeah i've seen that engineers get scared of compliance much more than they get scared of
Starting point is 00:28:13 other things because like i think engineers don't want to talk to lawyers like ever so it's always a good way to like say oh you what? That might break some of our requirements. We have to fix this ASAP, or we cannot fix that. Yeah. I think people really just want to go fast. And going too fast too often means that somebody has to clean up after it. And you really got to choose what you're optimizing for.
Starting point is 00:28:49 So would you say your experience with using Kubernetes and all of that, do you think it's, there's a lot of hype and there's also a lot of people complaining about how it's overkill for most people. Clearly, I think you seem to like it since you deploy your own site with it, but what's your experience been otherwise? Actually, I deploy my own site with Kubernetes
Starting point is 00:29:12 because I wanted to learn Kubernetes and I'm one of those people that is cursed to be a hands-on learner. So in the future, I don't know if I'd really use Kubernetes. It's convenient. A nice side effect of it is that it's very, very prescriptive. And it has an exact model of what you do and how you introduce like mutable state and how you do replicas and other things that a lot of other systems just don't really have a match on. I can't really hold a candle to, but at the same time, with all that power comes a whole bunch of complexity
Starting point is 00:29:55 and documentation that is really good at selling you Kubernetes, but not explaining what fields are allowed on the freaking pod spec. And, you know, like overall it's a good tool sometimes, but sometimes is not all the time. Yeah. Yeah. So do you really feel like Kubernetes achieves its goal or goals? Like, like if you think about its goals, like it can give you like a reliable system out of the box in a sense, and it'll ensure like with those disruption budgets and all that,
Starting point is 00:30:33 that your system doesn't go down. You get a bunch of features and systems for free, like auto-scaling and all of that. But then along with that comes the complexity of managing all those systems. Like overall, would you say it's like a net positive for like productivity at a company or it ends up causing more outages and issues because people don't know how to operate it?
Starting point is 00:30:55 You said that it has auto-scaling out of the box and I had to laugh for a minute because the auto-scaling stuff in Kubernetes out of the box is basically a great way to make downtime problems worse. Oh, interesting. Tell me more. It sounds like there's like a story there. Yeah. Yeah. I used to work at Heroku on the metrics team, which did auto scaling and basically due to some constraints that were brought about by existing history we weren't able to use very many metrics in order to do the auto scaling calculation. So a compromise was made and we had something something that sort of worked for our target, which was trying to make load times go down. Except when you try to do a quick fix somewhere, you end up pushing problems deeper into the stack and on certain types of applications if you enable the auto scaling
Starting point is 00:32:06 and it's not actually a CPU heavy load but it's a database heavy load spinning out more instances would kill the database server and it would lead to a situation that we kind of jokingly called auto failing I mean it's a distributed system, so the errors should be distributed too, right? Yeah. That makes a lot of sense, right? Like you have a system that's like, you can totally imagine a database being overloaded
Starting point is 00:32:36 and its latency increasing and the latency like- Bubbles up. Yeah, bubbles up and then it's like, oh, you need to auto scale this other cluster since its latency is too high and throughput's too low. Yep. And a lot of the Kubernetes stuff mainly focuses on CPU and memory. Okay.
Starting point is 00:32:57 So, yeah. Which kind of invites auto failing. Okay. So, because the input metric isn't good enough to take all of these different things into account. Yeah. A lot more services are IO bound than you think. And they're usually IO bound on the database. Yeah.
Starting point is 00:33:19 Or something that is a JSON front end to a database. Yeah. That makes sense. So I don't know how much you can reveal, so it's totally fine if you don't say any of this stuff, but was your objective to roll out like auto-scaling to every single deployment of Heroku? I guess not, right? Because there's so many... I don't know. I wasn't around for all of the planning stuff there, but it was an interesting problem either way. It teaches you a lot about how services are put together and how assumptions fall apart when the assumption is sufficiently general. Yeah.
Starting point is 00:34:03 So I've read a little bit about Heroku's auto-scaling system. You use like the P95 request latency percentile to decide, and then you combine that with Little's law to decide whether you should scale up the application or not. That seems like one of the things or one of the metrics that is at least advertised on one of the blogs. Does that, okay. That sounds about right. I was going to avoid going into there, but since it's on a blog post, yeah, they use
Starting point is 00:34:35 P95 metrics and Little's Law in order to do it. It's actually kind of a clever thing, but, uh, it's unfortunate that it doesn't, uh, scale out as generically as I hope as I'd hoped. Okay. So, so it feels in like those cases where you're IO bound on a database, but it's really great if, uh, you're a rails app and you're single threaded and you only have one instance, uh, per, uh, one instance per container and you have a really CPU and memory intensive thing and it gets slow. For that case, it's absolutely beautiful.
Starting point is 00:35:19 It's just that case is not very often at all. Okay. So that makes sense that you're basically running a stateless single threaded thing that scale that that's directly proportional to like latency and request depth in a sense. So it's like, if you're trying to run your stateless monolith web app, you can put that on the system. What, if you can reveal,
Starting point is 00:35:43 or if you can just give a flavor for, what kind of other applications does it break for? That's all I really know. All I really can remember is what was in support tickets. And IO bound on the database is probably the biggest fail pattern. But I mean, this is something that everybody struggles with. I mean, creative design decisions in MySQL and Postgres like 20 years ago have created situations where like,
Starting point is 00:36:21 fun fact, every time you make a new connection to Postgres, Postgres actually forks a new child just for that one socket server, socket client. And as a side effect of this, Postgres has really amazing multi-thread performance. I mean, like, combined with Linux's copy-on-write support and Postgres' habit of doing things as append-only logs, this means that you can get some really amazing multi-threaded performance. However, as a side effect, this also means that all of those children,
Starting point is 00:36:59 especially if a lot of them are persistent, they just hang out and around in memory. And every time they just use up a lot of them are persistent. They just hang out and around in memory. And every time they just use up a lot of memory and then there's kernel buffers and all that stuff that just adds up over time. Yeah, I didn't know that. So is that the reason why people recommend putting like a PG bouncer or something
Starting point is 00:37:20 in front of a Postgres instance? I love PG bouncer. It's such a hack. Yeah, and I'm sure you know about transaction ID wraparound. I am not actually that familiar with it. I'm more used to Postgres in anger and production and not really like in the nitty-gritty of ORM stuff. That makes sense. Did you work with Postgres as someone who worked at Heroku and had to deal with customer Postgres instances going down
Starting point is 00:37:59 all the time? Nope, I was not on that team. I was more dealing with the actual metric server itself. Okay. Yeah. So transaction ID wraparound is just, if you have close to 2 billion transactions since like Postgres uses like in32 for transaction IDs, it just stops the database completely saying I need a garbage collect some old transactions since i'm running out of ids for you to use and can go down for like a day or for like a week and if you if you google like transaction id wraparound you'll see like a bunch of postmortems like you know generally people don't write up like a postmortem like if they're down for a few hours but since this issue just causes them to be down for a day they have like a huge mea culpa like
Starting point is 00:38:45 i'm sorry this is what happened and it's a result of the architecture of our system and we use post-resonance i've certainly almost seen or maybe i even have and i don't even remember no i haven't seen one but i've certainly seen like an internal post-mortem of like yeah our database hit transaction id wraparound and was just down for like a day and a half. And there's nothing we could have done about it. So yeah, it's as you said, like a creative design decision, like who would need 2 billion transaction IDs? I mean, if you have like a few million requests per second, across a bunch of machines,
Starting point is 00:39:25 2 billion happens pretty often. Yeah, but Postgres can do like remarkably well for... Oh yeah, it's fantastic until it's not. Yeah. It's, yeah, I think the challenge comes in like once it's, yeah, once it reaches its limit, how the heck do you shard it in and then people think of like the first like maybe we shard by user or like we shard by like merchant or whatever and then the product use case comes in where you need to do like a cross shard transaction
Starting point is 00:39:58 and then you're screwed so that's why i think like cockroach db and all of these things have like a lot of potential right do you have any experience with any of these new NoSQL databases or SQL but in the cloud-type systems at all? Not much. I kind of don't like them. But I've had some bad experiences with MongoDB and losing data in the past. So I tend to stick to what works because if I can understand it, even though it's bad, I can understand it.
Starting point is 00:40:38 And that means that I know better how to fix it when it breaks. A lot of the document databases are really solutions looking for problems in my book. Yeah, I was listening to somebody's opinion. And they said it helps you start a company faster, since you don't have to think about schemas. Do you buy that argument at all? I think in terms of types. And I like schemas because it's the information about what's there and what can be there and what can't be there and what is there at all is very explicit. You can't just have random other fields show up one day without there being explicit changes to
Starting point is 00:41:36 the database to allow it. I'm a big fan of types and schemas are a great way to enforce types. And if something is that important that you need to add a field to it, then you can add the column in a schema change. Yeah, that makes a lot of sense. I mean, that has its own problems. Like, not saying it's pain-free, but overall, the ability to know what is in a database table without having to look at all the contents of it is a lot more valuable than you really think it is.
Starting point is 00:42:15 Yeah. which mostly works when we know that we won't have to be querying you know filtering on a particular field is we stuff like a protobuf as one of the columns like on in in a table and we can just put like more fields inside that protobuf if we know it's something that we don't think we'll need to filter and that that gets us kind of like this in this place where we still know the schema, but it's not as expensive as a column migration on a huge table when it's problematic. I don't know if you've seen similar hacks, maybe
Starting point is 00:42:58 like JSON columns. There is a time and place for JSON columns, though. I know one thing that uses it pretty well is an ActivityPub server called Pleroma. I'm not really sure how to pronounce it. But ActivityPub is a very JSON document heavy protocol. And Pleroma tries to be agnostic as to what activityPub objects it can accept. So a JSON column makes sense there
Starting point is 00:43:33 since activityPub objects can, for all practical purposes, be arbitrary JSON. Yeah, yeah. It's like when you're acting as like a proxy in a sense. Yeah. Yeah. It's like when you're acting as like a proxy in a sense. Yeah. But a lot of the times you really do know what's there and just use the damn schema. But what about those cases where like a migration is really hard or you have like a partition table and it's gonna take, it might even cause like downtime or it's just impossible to do like some kind of migration
Starting point is 00:44:14 in place because there's so many like concurrent transactions that they just hold on to like these row level logs or like table level logs and it's not worth the effort for something as simple as, oh, I want to add this just random number that I know I'm never going to be filtering by. It depends. However, I personally prefer radical simplicity, even if it ends up causing some issues down the line
Starting point is 00:44:44 because the simple thing is easier to understand when it breaks. And the simple thing has less going on. So it is more obvious when you look at it when it breaks. Yeah, I think that was spoken as like a true asserry, right? You keep systems as simple as possible. And that's the only way you'll be able to debug them when you go wrong, when they go wrong. I mean, I think that's like a great segue. So you had to manage some of these larger systems at a bunch of your previous workplaces.
Starting point is 00:45:18 What's your philosophy on CICD and alerting and all that? So let's start with CICD. Do you believe in continuous deployment? What have you seen work best for you or for the companies you've been at? So continuous deployment is kind of a double-edged sword. On one hand, it's really great. And when you do spend the time to get it all working,
Starting point is 00:45:44 it can make updates seem kind of magic because, you know, you just blink and it's done. And you can put new versions of stuff into production without much fear. But then the other edge of the sword comes in and for some things getting to a place where things are continuously deployed can be slightly difficult, especially if it's an existing implementation or there's some fun problems with your setup that make continuous deployment difficult. I'm just thinking about that one thing that I heard from a Googler friend of mine where they had services in Java and they were deploying it continuously, but the JVM's JIT needed to warm up. So they just replayed traffic at it at like 30,000 times normal speed for half an hour to warm up Java's JIT. I mean, you know, but so part of the process of it
Starting point is 00:47:04 is figuring out how to actually do it and implementing the scripts and automation to do it. But overall, I think it's generally worth it. Yeah. this is a hot take, but a lot of the unit testing methodology that I see used is kind of fundamentally wrong, especially the stuff around mocks. I'm probably one of the few people that thinks that, but a fake version of the world will only let you see how fake your world is. And you should really be testing against real databases, real APIs, real file systems as much as possible. You don't need to mock the file system. That's why we have temporary files.
Starting point is 00:47:58 Yeah, that certainly makes sense in the context of, like, yeah, you assume that your APIs are working, but it's often like these points of integration that break. But isn't there like a world where, you know, business logic should be tested and like unit tests because you have a set of integration tests that test these like various integration points, right? Yes.
Starting point is 00:48:27 However, at some level, you're really testing the mock. I guess you are. I've had some pretty bad experiences with some pretty horrible mocks. And a lot of the stuff that I've been dealing with is mostly like do math and put it into a database. And a lot of the math you can end up just testing by itself using actual data structures. And then integration against an actual database with Docker or something.
Starting point is 00:49:07 I think people, like, there's so much functionality out there, and people don't, or developers, I should say, just generally, they either go, like, one step too far, which I think is, like, integration tests or, like, Selenium tests for everything, since, like, the data model isn't clean enough that you can test it at like a unit or even in an integration test. And I've certainly seen that.
Starting point is 00:49:30 And, or they go the complete other route where you just mock out so much that you create extremely brittle tests that you refactor anything and you'd have to like fix thousands and thousands of tests. And I've seen that as well. And your test isn't doing anything useful since you just patched out the wrong thing. And it just kind of works
Starting point is 00:49:50 because you haven't written both a positive and negative test. Yeah. In terms of my testing philosophy though, I believe in two tiers of tests. I believe in an integration test where you test against as real of the world as possible. And I also believe in functional tests where you have something that goes against prod and staging and the like every hour or so and tries to do a whole bunch of common user operations. And if that fails, then you either alert someone or log it in Slack. Yeah. So how do you configure those functional tests? I totally agree with you on the functional test aspect. Yeah. So in Heroku, they had a tool called Direwolf, which had a huge bank of Ruby code that would let you test against, do functional
Starting point is 00:50:47 testing against stuff. And that was pretty great. The exact dialect of Ruby was a bit odd because, you know, Ruby is DSL heaven. But once you got used to it, it was really great and you could implement tests really quickly. Yeah, so this library or like this framework would set up auth and all of that for you and you just had to specify. It would configure like API URLs. I think there was also a April Fool's joke for free puppies at one point. And it would probably log some metrics so whoever owned that test or the team developers at that company to build frameworks like Direwolf in many cases.
Starting point is 00:51:54 And then everybody ends up building their own local Maxima solution, which isn't great. Yeah, I've been considering building something like that as open source software on top of like Rust and Lua, but I never really got anywhere with that idea yet. Yeah. Probably make a great startup. Yeah. How do you build something like that that's generally applicable to a bunch of different companies, right? Like you'll have to figure out, you'll have to come up with some design where you have like an auth provider and then it would just end up looking like
Starting point is 00:52:30 enterprise software that doesn't really work for my use case since I have to think about this other thing as well. That's generally where I've been caught, yeah. Yeah, I think, yeah I think startups in the CI space are always confusing to me. But maybe someday something like CircleCI, but something that has the popularity of GitHub, will be out there. Maybe GitHub Actions is that thing,
Starting point is 00:53:04 but I just heard about that vulnerability a few weeks ago, and I don't know if. I mean, it's executing arbitrary code from people. It's going to have vulnerabilities at some point. Yeah. But does it really have to read log files and run commands based on that? That seems a bit excessive.
Starting point is 00:53:23 Who knows? OK. that seems a bit excessive. Who knows? Okay, so yeah, but your philosophy on testing is go for the straight integration test and then just the periodic functional test. Yep. Verify something. And what about like stage and canary and production and all of that?
Starting point is 00:53:41 So like, how do you think about breaking up, you know, your deployments into tiers? Depending on the company and how much budget you have, the ideal is sort of like a three-tier system where you have some sort of development one, which is either the most recent commit to any branch or the most recent commit to your default branch. A staging one, which is a bit more quiet and it's somewhere if you have like QA testers or something, you point them to staging and tell them to do stuff there. And once staging set and once QA says, yes, this will probably not explode horribly, then you can promote it to the money generator or production. Okay. Yeah, that makes sense. So you mean like this is three stage where you're continuously deploying, I guess, the first one somebody manually verifying on the second one and then the third one is considered like safe
Starting point is 00:54:50 since it's been tested against by humans quote-unquote safe yeah yeah quote-unquote safe yeah but but like how would how so so what about like automating those humans in the second stage? Like, is it possible to do that or you think, or let's say you don't have the budget to have humans there? Then you'd probably have some sort of machine thing, but if you don't really have the budget there, then you just kind of chuck out the staging phase altogether and just have like
Starting point is 00:55:29 sandbox and prod okay the way the way we we do it is like the first that that staging part like it certainly that's where we dog food in a sense or i should say the first year which is like continuously pushed that goes out to employees and that ends up catching a bunch of issues. So that can be one way of... If it's a kind of product which employees can actually dog food, that is. It really depends because, for example, if your product is about golf courses and nobody in your company golfs, then it might be a bit difficult to dogfood it.
Starting point is 00:56:12 And what's your philosophy on stuff like earlier in the stack? I know that you said that you're like a types person. Does Python or Ruby ever have it? When do you think it's is it ever fine to deploy a service in Python I know some companies don't let you do that anymore. Like if you have a really good reason like I mean a really good reason. We're doing some fancy machine learning stuff, and we have a limited number of goat sacrifices, so we want to use Python to save them for later.
Starting point is 00:56:53 Yeah, Python would make sense there. Rails is kind of a pain to manage in production, but it's a pretty fantastic and solid platform. There are times when you'd want to use that yeah um off the top of my head uh i know shopify has like this huge rails monolith that they do stuff with um i think heroku started as a rails app too um it really depends um but if i was given free reign i'd probably gravitate towards something like go or rust and more recently rust uh mostly because um the compiler stops you when you're trying to do things that are bad and especially in rust the compiler will stop you when you're trying to do things that are bad. And especially in Rust, the compiler
Starting point is 00:57:46 will stop you if you're trying to use memory invalidly. And it won't break in production for that reason that the compiler stopped you for. Code that can never be deployed is code that can never fail. Those problems that are rejected at compile time are problems that cannot be discovered at runtime. Yeah. And the value of the compiler stopping you when you make a typo is so great that
Starting point is 00:58:18 it's basically unparalleled. Yeah. So, I mean, that brings me to my next question, which is, so how do you feel about letting a bunch of engineers work on features and then there's like a group of SREs who work on making sure the system stays up? So like, let's say you have like a hundred developers just working on product and like 15 developers or like 15 SREs keeping the model that up, or do you like the model of like, you know, every team should maintain its own service or is it like some kind of middle ground? What's your general opinion on these things?
Starting point is 00:59:01 That's just hell. Um, I've experienced this a few times where, you know, product controls the purse and product wants new shiny and they want it now. And there's this brilliant video by, what's it called? It's about microservices. And they have this person going through and explaining the entire flow of why they can't add someone's birthday field to a user info page because, you know, they have to go through the name provider service and then, you know, the entropy chaos system and a whole bunch of other stuff. So like at some point if you just give people free reign to do stuff and just keep going, you end up with just something unmanageable. And in terms of actually like designing things to work around that. I don't know. This is not a technical problem. It is an organizational and social problem. And I'm really good at technical
Starting point is 01:00:13 solutions, but social problems like that are, they're out of my league. So given one or the other, do you think it's better to just let people work on maybe a monolith and a few services where it makes sense, like a database service or a database proxy, and everybody else uses the monolith? Monoliths are pretty great, yeah. People rag on monoliths a lot because they've been burned by that one PHP monolith that takes three days to deploy as long as there's a full moon. But in terms of reliability and not dynamically linking function calls over HTTP and JSON, it's pretty great. Yeah, you don't have to think about like version skew and all of those things. Oh God, version skew.
Starting point is 01:01:14 It's also so hard to test for those things. I mean, it's just Conway's law at work at that point. Are you familiar with Conway's law? I am familiar with it, but like I think you should repeat it. Okay. So Conway's Law is basically this law that says that all technical systems will imitate the communication structure that created them. And you can see this very obviously if you look at things like Apple and Microsoft, where it seems that two features that on your end seem like they're very related.
Starting point is 01:01:58 Oh, what's a simple example of something that comes to mind? Like Safari for iPhone safari for the mac desktop you know they have they they're very similar they are based on the same code but they have like subtly different things that just don't quite add up and uh i'd be willing to bet some good money that uh the safari for ios team is a different team the safari for mac team all those like a different set of pms and different okrs different pms different services different budgets uh different goals yeah yeah you you basically ship your org chart, right? Yeah.
Starting point is 01:02:50 There's a way you can see this in Microsoft. I can't think of one off the top of my head. But if you start looking at incongruities in various APIs or even just products in places, you can begin to see where the different team boundaries are. Yeah. Yeah. Yeah. That makes sense. Like, and especially like one recent example I can think of is somebody complaining in the product that, you know, there's so many upsells inside the product. Like there's like one prompt or like, and there's like another prompt right after that bunch of banners within the app.
Starting point is 01:03:25 It's because each individual product team wants you to use the thing they ship. And, but there's no centralized. They want number. Yes. They want, they want like, oh, look at, we expect AR to go up by like 3% with this new model that says buy more stuff. So we ship this new model that says buy more stuff. And now we expect us our revenue to go up by like 10,000 bucks a month or something. I don't know. Best experience in Netscape navigator. Yeah. Yeah.
Starting point is 01:04:08 And I'm not, I'm not, and you're right, right? It's an organizational and social problem. I still don't know if there's any organization that's big that does it well. Maybe all the big organizations that are like successful, they've just, you're successful by shipping different products that are successful. And if you put too many people on one thing, there's going to be like no cohesive vision, even if you have like one VP on top or whatever. Maybe it's just too hard to do that and to organize people that way. I wonder if this is how Amazon organizes stuff with an,
Starting point is 01:04:41 it's with inside AWS, you know, that giant product screen where you have like 50 different products, everything from private servers to quantum computing to satellite operations. It's very possible that every one of those is a separate team. Yeah. And I guess eventually like there's enough people going to complain about the list of product offerings that they'll spin up another team that'll be the AWS catalog team. Their job is to show you the relevant set of products for you. Yeah, and then there's going to be the team management team
Starting point is 01:05:17 for managing all the team management teams. Yeah. yeah this is this is why like i think if you put like if you make like two software engineers talk for too long i i wonder if there's like a rule that says they have to complain about management and organization within like 45 minutes oh it happens yeah uh when i started this like podcast i was thinking about you know should i ever ever invite like somebody who's on who's in management or should i keep this like pure like software engineering sre i mean let's see i guess i can always change my mind later on, but. It can be interesting to have it happen just once to, you know,
Starting point is 01:06:13 let the chaos unfold. Yeah. But the moment they say the word synergy, then I just disconnect. Well, the thing is you have to synergize all of your deliverables into Gantt charts, right? Yeah. And have a timeline view of them. So. You just double click on it and you look at the big picture. Okay. So at Salesforce, one of the buzzwords was let's double click on that. And apparently that was like corporate speak for, you know, looking into it closer.
Starting point is 01:06:53 But inside Heroku, there was this huge, you know, like perma thread on Slack where people were like showing various different behaviors that double click can have in various environments various programs so obviously we want to open the documents in a pdf viewer and now like double tapping stuff on your phone sometimes means you like a picture like on instagram or linkedin i don't know if you've seen like those viral posts on LinkedIn where they're like, Oh, look at this new feature. You just need to double tap to see what it is. And they make like thousands of people double tap on their posts and that's
Starting point is 01:07:34 how it goes viral. It's so bad. Yeah. Yeah. I've never been in a part of like a smaller company that was acquired by a larger company. I've always been in a part of like a smaller company that was acquired by a larger company. I've always been in those larger companies. So that's the experience that I guess I should try at some point. If you really hate buzzwords, it's probably a bad thing.
Starting point is 01:07:57 At least with Salesforce, they have this unique mix of like English and Hawaiian buzz that uh i don't think i'll see anywhere else i think yeah i think i know what you're talking about i have a bunch of friends at seal sport and there's a there's a lot of focus on hawaii is that because like the ceo likes hawaii like like mark i have no idea um but it it was fun to uh poke fun at yeah i wonder what weird like if i ever start company, I can just start like a culture just for the sake of it. And since I'm the CEO, nobody can tell me anything. We're really into water bottles, we try, we love different kinds of water bottles and our buzzword for focus is, let's, let's have a drink from our water bottle for that. And nobody can say anything since that's the culture of the company. That's the kind of chaotic stuff that I like. Some point I should start something and hopefully it succeeds just for this
Starting point is 01:09:01 water bottle joke. And hopefully nobody listens to that pod to this podcast at that point well until then i bet you can uh put a cap on that i think i think yeah this is like a good stopping point a good point to like you know cap it up and wrap this podcast up yeah thank you so much for participating i think this was a lot of fun like this podcast is supposed to be like software at scale but i just had a lot of fun talking about all like random things as well yeah thanks thanks so much yeah no problem it was fun to do this yeah yeah until next time i'll i'll tag you when I do the editing and publish.
Starting point is 01:09:48 Thanks so much. You're welcome. See you around.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.