Software Misadventures - Breaking distributed systems for fun and profit | Kyle Kingsbury (Jepsen)

Starting point is 00:00:00 Looking at how systems fail, what's your favorite database that you have? So let's start with the relational ones. Postgres. Fantastic. Wish I had a good replication story. Key value store. I'm still partial to React. I have not seen it being used.

Starting point is 00:00:14 This is the fixie bike of databases. Okay. Object store. FaunaDB. FaunaDB. It's a weird one. Yeah. Yeah.

Starting point is 00:00:24 I think it's technically an object or document store. I haven't seen that one. But it has a really cool Lisp-inspired query language back when I worked on it. What about a columnar store slash the class of NoSQL databases? Datomic. Datomic, huh? Yeah. Entity attribute value time triples. Actually, quints. With

Starting point is 00:00:39 a totally different transaction model. Up to strict serializability, it's a wild beast. Dat log for querying. And last one, uh, queues slash distributed logs. Oh, you gotta go for the classics, right? Like it's, it's Kafka, Red Panda or, or nothing. As far as I know, all the other ones lose data. In terms of Jeppson's just testing philosophy, it's more of a experimental

Starting point is 00:01:02 testing instead of theoretical. And you actually spin up a system, you inject faults, and you're not looking at the source code, you're running real binaries like fault people would see in production. How did you come up with this philosophy of testing in general? I think some of this might come from my background in undergraduate physics, where you have to think rigorously about experiment design. Both in physics and psychology, you're always trying to figure out, like do you construct a functional definition of some concept like consistency and how do you translate that into some sort of observable parameters and control interactions in a way that means you're measuring what you hope to measure. So there are a couple aspects we want to dive into there. One aspect of just starting a consultancy like going from being a software

Starting point is 00:01:44 engineer at companies to doing consulting work. There are many engineers who want to dive into there. One aspect of just starting a consultancy, like going from being a software engineer at companies to doing consulting work. There are many engineers who want to go on that path at some point in their lives, but they struggle with the mechanics of a business or rather just don't know what it really means to be a consultant or have a consulting firm. It's like, hey, I have some expertise. I can share it with some clients, but they don't know how to structure it exactly. So can you share some recipes of how you thought about structuring it? Welcome to the Software Misadventures podcast. We are your hosts, Ronak and Gwan.

Starting point is 00:02:19 As engineers, we are interested in not just the technologies, but the people and the stories behind them. So on this show, we try to scratch our own edge by sitting down with engineers, founders, and investors to chat about their path, lessons they've learned, and of course, the misadventures along the way. To start with, we wanted to rewind the clock a little bit and talk about when you graduated from Carlton. So I came across your post on your blog. And by the way, you've been blogging for quite some time, like even when you were in school, which is something I want to get into. But there you mentioned that after graduation, you wanted to study physics in grad school, but somehow ended up being a software engineer at Wattpad. Can you tell

Starting point is 00:03:02 us how that happened? Oh, yeah. So it turns out I'm bad at everything. So I barely got a job, right? The only reason I got a job was because I'd already been working as an intern at a software company to sort of help put myself through school, and I also got an offer through the alumni network. But every single place I applied to, like sending out resumes, you know, emails, nobody responded to me. Oh, wow. I graduated in 09, the sort of high of the financial crisis.

Starting point is 00:03:32 Timing was also pretty bad at the time. Timing was bad. I really, in retrospect, I should have been born differently. But the other challenge with that is that everybody and their mom wanted to go back to school because there weren't jobs. And so there was a lot of competition for grad school at the time. So I thought, oh, I'll take, you know, a few years, kind of recover my finances, pay off my debt, study, and maybe take the GRE and go back.

Starting point is 00:04:00 A second fun story, I went to a liberal arts college, which is a fantastic choice for many things. I'm really thankful for my education. But it did not prepare me with the, like, relentless focus I would need to have on standardized testing. And it turns out that high-energy particle physics is a very competitive field. So there were something like, I think like 350 applicants at my safety school and only 35 open positions or something like that. It was insanely tough. So yeah, I studied, took a junior year, did not get into grad school. It's like, oh, well, I guess I'm doing software for now.

Starting point is 00:04:36 You were at Little Art School. Were you studying computer science there or something else? No, I actually have no computer science background. So how did that happen? How did you start working in software? Because I recall in the post you mentioned that around graduation time, you already had some experience working as a social admin, as a web developer, and you gained some of that experience.

Starting point is 00:04:54 So how did that come to be? I went to a weird high school, which was a public school but a magnet program, so you would apply if you were interested in natural resources, science, and technology. And one of the cool things that school did was they offered high school internships, often with alumni of the program in local companies. And so I would go and work on Friday for half the day at this local startup called Cryptic that did healthcare medical record software. And that meant that I got to operate on Cisco switches and do network monitoring and cacti and Nagios

Starting point is 00:05:32 and run phone lines and do desktop support, all this fun stuff. Oh, I see. That's interesting. Did you learn yourself all of these things considering you were just... I had a really good mentor. And already I was a huge nerd

Starting point is 00:05:44 who couldn't stay away from computers growing up, so I had been running Linux machines and doing silly networking things for quite some time beforehand. Um, but, uh, Eric Rosenberry, who is the alumni who, uh, took me on as his, um, charge at Cryptic, uh, was a fantastic educator and gave me the chance to work on all kinds of different systems. That's pretty cool. Uh, by the way, I will say while doing research for this conversation, I came across some of your artwork. Character in the Dark is a book you published.

Starting point is 00:06:15 We'll link it in the show notes for people. I think you took some. Is that still work? It's been like what 15 years yeah i mean i came across it on blurb.com it's still there uh and that's what oh my gosh yeah pretty cool and it it's actually people can buy it uh i hope you still get royalties for it um i don't think i've ever seen someone so i i made two books um one of both photography books and one was uh like self-printed and i did all the binding and and stitching and painting and whatnot for it and the other one was uh more for like a an art project and that one i did all the typesetting and photography but i did it digitally and just like all right i'm not to sew all of the pages by hand again.

Starting point is 00:07:06 It's exhausting. So I just sent it off to Blurb instead. I see. Nice. So going from undergrad to working at Wattpod, at least the way people know you these days and have known you for, I would say, almost the last decade is through your work on distributed systems and not in the traditional

Starting point is 00:07:25 sense, but rather looking for or testing distributed systems for safety and seeing how they fail and the promises they make. And you have this amazing project called Jepson and you also have a bunch of analysis reports for different systems that people use on how well they fare against the promises they make. We should want to get into around how that works and how you do it. But how did this start? Again, I was very foolish as a young man and made a lot of wonderful mistakes. One of them was not controlling my temper very well at a conference talk where someone

Starting point is 00:07:59 described using a database in a way that I thought was incorrect, and I asked this common, not really a question in the talk afterwards, which was extremely disrespectful and basically intimated that he had no idea how things worked. Went back and followed up by writing a demo basically to show that that approach from the talk would not work and then came back to the conference the next time it ran and demonstrated that losing data on stage. And that kind of became the Jepson thing.

Starting point is 00:08:34 And I kept doing that as sort of a nights and weekends project because people were building a lot of distributed systems and making what seemed like absurd claims about their safety characteristics. And I thought that maybe we should actually try those things and see if they worked or not. Did you get a response from this database person who was giving this talk when you actually gave the talk the next year? Like, did they reach out to you with something?

Starting point is 00:08:58 I don't believe so, no. I think he's still rather mad. Well, you would ideally wish that databases making these claims actually run some of these tests themselves before publishing some of the guarantees that they do. So I would say they should rather be thankful. I understand his position is that I misunderstood the system he was describing.

Starting point is 00:09:24 And that may be valid. I see. And since then, like, since you started Gypsum, you were doing this as a nights and weekends thing, and you still have a day job. Like, you were working at some of the startups at the time, if I recall correctly. At what point did you decide to do this full-time? I was getting requests from people, you know, hey, would you test our system for us? And I sort of said, well, you know, this is a thing that I do when I can. And I'm sort of at the mercy of my free time.

Starting point is 00:09:53 And then I got an offer from Stripe that said, hey, we'd like to hire you to do Jepson, you know, maybe on some internal systems, maybe on some public ones. And so I went and worked there for a few months. And during that time, I got an additional offer saying like, would you, you know, be able to work on our system for pay? And I said like, no, it'd be a conflict of interest. I have to work on things that, you know, Stripe decides. And I can't take on external income for that. But I had this little backlog of like, oh, there's a couple of clients. Maybe I could start a consultancy. And then I went through a sort of desperate part of my life towards the end of the year. And my rent was raised incredibly high, as happens in San Francisco. And I started to think like, oh, gosh, I need to change everything, like move and reorganize my life.

Starting point is 00:10:41 And so I made this very silly idea to start a consultancy company. It actually kind of worked. In this case, when people reach out to you for testing their systems, I can see two sides of it. In one perspective, so you publish the report publicly

Starting point is 00:11:02 and you give clients three months time before you publish the report so that they have time to fix bugs, improve documentation. But on one perspective, they would want you to find problems with the system so that they can improve it. On the other hand, they also don't want the system to be so bad that it results in negative publicity of sorts. So what does that interaction look like with clients typically? You know, usually my clients are really good. I've had almost universally positive experiences. There is occasional vigorous debate, but I we're we're always respectful of one another's

Starting point is 00:11:45 position and uh and all of the like published reports from jepson with one exception uh have come about as a result of like you know long back and forth and eventually reaching some sort of common ground and what has to be said um i think the the sticking points are often matters of interpretation like severity. If I describe something as corruption, what does corruption mean exactly? To the vendor, it might mean this has to be an unreadable database. From my perspective, it might be a database which contains logically inconsistent state, like a record which represents a state machine and it somehow took both paths of a transition.

Starting point is 00:12:25 So that sort of thing, ultimately, Jepson retains final say. And I try to do that as carefully as I can to represent the interests of general database reading public. But I am shocked at how accommodating and friendly my clients have been. I guess there's some kind of selection bias of the fact that they really do care about it and then they want to actually be better therefore they have their systems in good order and then they kind of seek you out versus

Starting point is 00:12:53 if it's like a dumpster fire then it's probably the last thing on their mind. And Jefferson's a known quantity. People have been reading these reports for a decade and they know what they're getting. I think I have a pretty good reputation, both working with vendors and writing for the public, that means people trust that I'll do an okay job.

Starting point is 00:13:14 Right, right, right. In terms of Jeppson's testing philosophy, it's more of experimental testing instead of theoretical. You actually spin up a system, you inject faults, and you're not looking at the source code, you're running real binaries like false people would see in production. How did you come up with this philosophy of testing in general? I think some of this might come from my background in undergraduate physics where you have to think rigorously about experiment design. Both in physics and psychology, you're always trying to figure out how do you construct a functional definition

Starting point is 00:13:46 of some concept like consistency? And how do you translate that into some sort of observable parameters and control interactions in a way that means you're measuring what you hope to measure? I think another part is pragmatics. If I want to test a lot of different systems, they're all written in

Starting point is 00:14:05 different languages, different runtimes, different coding styles, any sort of like tracing or instrumentation of those programs would be very closely bound to the source code. It means I have to learn their build system. I have to figure out how to build the binaries. I have to construct instrumentation that's appropriate to that system. And that's, that's an important and valid technique, but it would be a big lift to do that for lots of different systems. So in some way, this is the only thing that could work. So at a high level, let's say if you were testing a new system, can you describe what does the approach to testing that new system look like? What are the approaches that you take?

Starting point is 00:14:39 Because you're not just bringing that system up, injecting failure. You're also comparing it against the claims that it makes. So you have to go read some of the documentation, for example. So what does that look like? Yeah, there's a lot of reading. Generally, every engagement starts with some sort of meeting the client. If I'm doing a pro bono thing just on my own free time, I don't necessarily go talk to them up front.

Starting point is 00:15:02 I just go read their docs. But when they pay me, we have a meeting. We talk about what they want to measure, what they think their system provides, and they give me some guidance. And then I disappear and I read all the docs and I ask them lots of obnoxious questions in Slack, like, what does this mean? What is this error for? And then I'll take notes on that. I put in all of the quotes that I can, and I get links to the relevant parts of the quotes that I can and I get, you know, links to the relevant parts of documentation, anything I think could be useful for later. And then I start thinking, how am I going to test this thing? What kinds of requests can I make? What kinds of nouns

Starting point is 00:15:35 are in the system? What are the invariants? And I'll propose a test to the client and we kind of go back and forth. Usually it's straightforward. Sometimes I need to understand their transaction model or their semantics more thoroughly. And then I go and write the automation, set up the cluster, get that running, start running basic tests, and hopefully produce results. And as that collaboration goes on, I'll hand them like, oh, hey, I found it doing this thing. What do you think about this? And they'll say, oh my God, that's a huge problem. Or maybe they're like, oh, that's fine. Why are you worried? And so we'll figure out how to interpret it.

Starting point is 00:16:09 Maybe the bug is in my test code. Maybe it's in the system itself. Interesting. And how was the system like developed? Was it like when you first started, you've already kind of, you know, with each of these report you do, like you start kind of building the system.

Starting point is 00:16:21 And then by the time you have the consultancy, it's kind of like you already know, like what's like the general script or did that develop as part of the uh the consulting uh are you talking about like the system being um like what i do as far as consulting businesses or oh uh i mean i kind of had to come up with this from scratch there was a i had been doing these pro bono analyses where i was just like, somebody would say, Hey, I think Elasticsearch loses data. And I'd be like, yeah, I've run it.

Starting point is 00:16:49 We've watched half of our cluster just like disappear when we rebooted the nodes. God, that was a fun day. And I'm sorry, it was only, it was only 30% of the data that disappeared. Because all of the metrics for our customer dashboards dropped by 30%. This was when you were working at Stripe? This was back at Factual. Wow. Yeah.

Starting point is 00:17:15 I mean, Elasticsearch was notorious at this time frame for losing data. And everyone I talked to in any company had some really fun story about a different failure mode. Which is fine. It's a search engine. It's not supposed to keep your data. It's supposed to give you results some of the time. Some of the time being the important word. Right. Like, you know, your system of records should preserve data, but your search index,

Starting point is 00:17:37 if you don't find something, it's not the end of the world. Usually. It's not supposed to be the source of truth of your data. Yeah. Yeah. Long story there, anyway. So I would basically go and do all of this in isolation and talk to some friends, and I'd workshop the report and just drop it.

Starting point is 00:17:52 Be like, hey, look at this thing. And maybe I'd file some bugs in there. But when clients started coming to me, I had to come up with an approach that was much more collaborative and that would produce a professional work product. And so I wrote this ethics policy before I started doing any of this work and said, here's how I want to balance a sort of public interest obligation of providing people with accurate,

Starting point is 00:18:17 timely information about databases and hopefully advancing the sort of cultural state of the art in testing. And balance that with the obvious interest of like, I need to get paid so i can eat and have a house and the vendor needs to get useful actionable reports that respect their integrity and privacy and intellectual property and all that stuff and from that ethics policy i was able to look at some lawyers and come up with contracts that kind of encoded that structure and basically that protocol has been unchanged for the last eight years now. Nice, nice, nice.

Starting point is 00:18:51 I've gotten better at it for sure, but it's the same basic strategy. In this case, like, so there are a couple aspects we want to dive into there. One aspect of just starting a consultancy, like going from being a software engineer at companies to doing consulting work. There are many engineers who want to go on that path at some point in their lives, but they struggle with the mechanics of a business or rather just don't know what it really means to be a consultant or have a consulting firm. It's like, hey, I have some expertise. I can share it with some clients, but they don't know how to structure it exactly. So can you share some recipes of how you thought about structuring it? And we don't need to go into like the pricing details, but just how you thought about structuring the contracts or structuring the cost, et cetera.

Starting point is 00:19:34 Yeah. So my contracts look really weird and I'm not sure how generally applicable this would be because I do this public reporting thing. But because of that, there's all this stuff around retaining final editorial control and timelines for review and work product. And some of the IP that I develop is like Jepson General, and that gets built into Jepson and used time and time again. Some of it is specific to the client. So there's all that specific stuff. But on the general side for any consultancy, you need to have some sort of legal vehicle for what you're doing. I looked at S-Corps, LLCs, consulted with some lawyers, wound up spinning up a small LLC and having contracts drafted for everything. You need a bank account. You need all the basic business stuff that applies to anybody.

Starting point is 00:20:26 Stretch separation of finances. You want to have an actual tax accountant who can help you with the planning and filing your estimated taxes and all of that stuff. Bookkeeping. You end up becoming this sort of jack-of-all-trades in capital B business because there's just enough work that, you know, like someone has to do it. Like it would be, I can't run a business that doesn't have, you know, well-defined, legally compliant books and that files as taxes. It has to happen. But hiring someone to like do

Starting point is 00:20:56 my books for me, um, at least for the initial ingest instead of review would be absurd. Cause it's like 15 minutes of work a month. Same thing for contract negotiation. I have to do a lot of contract reading myself. And then I pass it over to the lawyers because if I did every single one of them, it would be absurdly expensive. And there's just not enough work to have someone permanently on staff for it. So things have to be contract relationships for a small consultancy. I said, did you know anyone that was doing this sort of things? Or did you talk to people about how to go about it?

Starting point is 00:21:33 Yeah, I talked to several friends who have done consulting-ish things before. And they gave me the good basic advice of, you know, talk to a lawyer and do the legally required things. You know, file with the state. And I think anybody can kind of look up what they need although maybe you can't look up anything now that chat gpt is out you can generate it sidebar like oh my god the number of just bonkers lies i've been receiving in uh like web pages recently and you read it you're like this looks locally coherent and then you read realize it makes no sense at this looks locally coherent. And then you realize it makes no sense at all. It was probably generated. It's really

Starting point is 00:22:08 eerie. Yeah. Anyway, also, I'm not a lawyer, please don't take my advice as legal advice. But do talk to a lawyer because they're great. They will protect your interests and help you avoid problems. Oh, sorry to follow up on that. But was there any mistakes in the beginning that you made that like for someone that's like going through this process now, like you're like, hey, don't do that. Yeah, charge more. Everyone says this. And like, there's a flyer that I made when I was a kid

Starting point is 00:22:43 that shows me like holding up a cat. It's, you know, Kyle's pet sitting service. And I think it proudly declares like we charge $5 per day, or if you think that's too much, whatever you think is right, because I'm a terrible capitalist. And so I definitely undervalued my labor and I didn't anticipate that I would spend so much time doing lead acquisition and bookkeeping. And also that when you charge less and you do sort of like piece work where it's 30 minutes here, two hours there, clients will want you to work for an hour here and then three hours next week and two hours next week, and it should all be on their schedule. And what you end up with is a work week

Starting point is 00:23:25 that contains like 10 hours of actual work. And so now, rather than offer hour here, hour there, I say, you engage me for a number of weeks and we can decide how many that is. You can say, okay, next week is our last week. That's great. Or you can say, I want to extend. That's great.

Starting point is 00:23:44 But you're buying a whole week of time and then i can really sink my teeth into the problem have all the context i need and uh ideally give you a good result i see interesting i feel like that right and engineering estimation like how about a task would take are sort of the eternal problems um i feel like we've heard people saying, oh, yeah, you look at how much you're making, you would be making otherwise, and then using that as a gauge to then kind of determine, OK, if you're only, so normally you'll be able to fill out, say,

Starting point is 00:24:16 like 60% of your time throughout the year, and then you divide that to use that system to come up with an initial number that's not super low or just not super crazy. What's your approach in terms of like grounding it to um like a starting number to yeah yeah um initially i actually followed exactly that advice and i came with an hourly rate that i thought was you know reasonable and fair and i went immediately So the work that I was doing was for the entire deliverable. So we'd say like, okay, there's 30% upfront, 70% on delivery, or here's some milestones, and you're going to get this whole analysis. Here's the whole report.

Starting point is 00:24:56 And the challenge there, at least in my work, is that we don't know what we're going to find. And so two weeks in, I give them this giant platter of bugs and they go, oh my God, we need more time. We have to fix all of this. We need to investigate these other bugs and figure out what's cleared. And then the scope of the work balloons. And then they want reviews and the reviews go through several passes. And so you end up with this uncontrolled scope where what you thought was a month of work has become four. And I always want to be accommodating of clients and do my best for them. And that means I was ending up vastly undercharging for the work.

Starting point is 00:25:35 But this weekly thing is working out pretty well? It's controllable. You know, weekly means I have a reliable prediction on how much I'm going to make. It also means that for my weird job where I do a lot of time where I'm just doing research, like writing papers or developing the sort of core Jepson software, or doing free analysis for the public, I can alternate my time between doing that and then coming back and doing paid work. And I don't have this sort of constant low boil of one hour here, one hour there.

Starting point is 00:26:10 The other nice thing about it is that it provides incentives to use time well. Like I work really hard for clients. I am sprinting when I'm on with them. And in exchange, they have to use my time well, which means that they can't like disappear for six months or ask for endless revisions. There's some financial pressure to like complete the analysis and move on. I think that works well. Like I can provide a pretty reliable estimate. I say four weeks, we'll get you a basic analysis. Most clients end up asking for anywhere from four to 12. Okay. In this case, like going from that initial math to come up with the hourly number,

Starting point is 00:26:51 which ends up being too low if you look at it over a duration of time, like six months, then going to weekly, how did you adjust what you would charge for that week on a weekly basis in this case? I listened to the people who told me to charge more, and I raised my rates gradually. This took a long time, longer than it should have, but it did eventually get their advice. It's quite scary, no? It is, yeah. And especially because leads are infrequent for me.

Starting point is 00:27:20 I do this very specialized thing in this tiny niche that very few people need. So it's not like there's a lot of deal flow that I can like get good signal on. I get leads maybe once a month. And of those, you know, a third of them actually sort of pan out into negotiations and become something real. And that's been a challenge. So I've incrementally raised rates. And as I started doing this weekly work, I developed a backlog. And so I said, I'll work on a first come, first serve basis.

Starting point is 00:27:53 As soon as you sign, you go in queue. I will work with you as soon as I can for as long as I can. And then we'll move on. And when new clients come to me, I say, hey, I've got nine months right now of people in queue, so you will likely come up somewhere around this time. And then I started adjusting my pricing

Starting point is 00:28:12 to control that queue length because I didn't want to make people wait too long for their work. Ah, that's quite smart. Ah, interesting. Because it is a pretty hard problem since it is so niche that you don't have a lot of data to be able to run experiments. Interesting.

Starting point is 00:28:28 Yeah, I'm not really sure how useful this is for other people if they're not doing this kind of hyper-specialized job. I'm like those Galapagos finches with the very special-issued beak. I was listening to another podcast, I don't remember the... It was on the Tim Ferriss Show, but I forget the to another podcast. I don't remember the, it was on the Tim Ferriss show, but I forget the guest's name. And they went into consulting after a full-time job and they said the same things like charge more was their first advice. And they were like, they kept doubling their rates until someone was like, this is too crazy of a number. And it's like, okay,

Starting point is 00:29:02 now that's a number I'm going to settle on so that was different yeah yeah and it's it's particularly strange because um like in the software world numbers are insane uh nobody knows what anything costs and so like any figure you quote is going to be both 10 times less and 100 times more than two different clients are willing to pay um i i've had like i think it was ubisoft that came to me recently was like yeah we want to hire you for like you know a week of training you know uh and their budget was like it worked out to less than minimum wage per person per participant in the class right but it was like you have a budget of you know one tenth or one one hundredth of of what this rate is normally uh you couldn't send your employees to a conference for this money and other companies are like rolling in vc money and think nothing of

Starting point is 00:29:58 throwing throwing cash out the window so it's really surreal trying to price. In this process, like, have you ever considered going back and having a traditional job? Or you're like, no, this life is pretty amazing. You have the freedom and the flexibility to do what you want to. And you would not go back. Yeah, I mean, it's different. You have a lot of flexibility. And I like that because I think that my work is important. I like doing this sort of public interest thing

Starting point is 00:30:32 where people get free, high-quality writing. And I'm amazed that I've managed to make that work. I hope it continues to happen. I would. And I like that I get to do research. I actually am still publishing papers and coming up with new techniques with collaborators

Starting point is 00:30:47 and that work would not really be possible at a traditional company unless I had a special arrangement on the flip side you burn so much time doing contract evaluation and lead generation and bookkeeping

Starting point is 00:31:04 and that would be cool if it were handled by someone else you know, contract evaluation and lead generation and bookkeeping. And that would be cool if it were handled by someone else. It would be great if I had a reliable salary. Also, it would free me up to do things without thinking about the cost. Like, if I get an invitation to speak at a conference, that's my job now. And I have to say like, well, I do charge for that. You know, it takes me a week to write a talk and rehearse it and draw the slides and prepare and come there. And I have to fly there on my own dime and get a hotel. So you need to pay me for that. Whereas if I work for a big tech company, you know, there's budget for these things and I have a salary. So I'm not thinking about lost income during those weeks. It's a, it's a whole different calculus. Yeah. Interesting. So one of the things we wanted to

Starting point is 00:31:41 ask was around just time management and working yourself in a way. Because when you're working at a traditional company, you have a group of team members. There are some restrictions to times you have to be available at, like, be online, have four hours overlap with wherever your team is, for example. But when you're working alone in a way, you're still partnering with the clients, but a lot of the work is done yourself. How do you go about structuring your day and being disciplined about how you use your time? Yeah, great question. So when I'm teaching, I have to be with the client during all that time. And so I will rotate my sleep schedule to match whatever their hours are as best I can. Sometimes it's waking up at three for a class in India or something. Oh, gosh. And then when I'm working with a Jepson client,

Starting point is 00:32:27 I want to be talking to them live for a good several hours each day. And so I make sure that my schedule overlaps that. Typically, I can work on my own time, like roughly 8 a.m. to 6 p.m. And then what I'll do is I try to run my errands in the middle of the day. So I'll work for four hours. I'll go to the gym. I'll get groceries, come back, work more. And I just push the work out longer so I'm not there during commuting peak hours in the world.

Starting point is 00:32:53 And generally that works out okay because I'm still there talking to the client for all the important interactions. And a good chunk of my work is also independent, so I can work late into the night some days. I will sit there until 2 in the morning if I get the excitement going, and I'll just be hacking away on some test or another. I want to make sure I give roughly 40 hours each week to clients. I never want to skimp on the obligation to a full-time client. But in practice, because of ADHD, sometimes that works out to be like 60 hours if I get hyper-focused on something.

Starting point is 00:33:31 Are there any specific tools you use to track the work you need to do? Because like, again, in a traditional office environment, you have, let's say, a queue of Jira tickets you wanna go through. There's a project timeline or status update or whatnot. How do you do this in this case? Yeah, because my work sort of flows naturally, I don't have to do a lot of structure on top of it.

Starting point is 00:33:54 I keep notes for myself on what I do each day and meeting notes and things. I keep a structured queue for the client so I know in three months I'm working for this client and after that this client but each day I often kind of wake up and I'm like oh I'm gonna figure out what's up with that weird error today

Starting point is 00:34:13 and let's investigate the transaction anomaly and as I talk about those results to the client they immediately ask questions and that kind of drives my next thing so I'm always saying like what would be most helpful next what would be you know the thing that we can do in this timeframe? And I'll give estimates and continuously refine.

Starting point is 00:34:30 Nice. And curious, when you were doing that transition from going from full-time, like working to like doing your own business, there's a lot of more logistical things that you had to figure out, right? Versus like sort of real work. How do you go about balancing that? And I guess, how do you go about like learning to improve the business? I wish I knew. My tactic has been like, I try to spend almost all my time on work work. Initially, I thought I was going to give 40 hours each week to my client, and then I would do the logistics work of like bookkeeping and new client acquisition on top of that. That, as it turns out, is a recipe for burnout. And so what I say now is like, okay, I'm putting in, you know, roughly eight hours a day. Some days might be 12, some days might be four. I try to make sure each week works out to 40. And then I try to dispatch my bookkeeping obligations each morning. So I

Starting point is 00:35:31 answer the emails, I merge tickets, I do whatever support stuff I need to, knock that out. And then I shift gears over the client and I make sure that I'm like on work, work each day. But when I'm billing clients, you know, the week's work implicitly includes a little bit of that auxiliary. And that's, I think that's fair because even though I'm, you know, like taking some hours from my week paid for client A to negotiate with client B when I'm working with client B, I'm borrowing from client C. Yeah, so you don't want to force people to do these things synchronously.

Starting point is 00:36:09 And so I try to make sure that everybody's borrowed time kind of lines up with the next client. Likewise, there's a review phase at the end of every engagement. So after three months or whatever, we go through and make some last minute changes before release. I borrow that from the next client. I take two or three hours out of the day to do it. And then they get the next one. Nice.

Starting point is 00:36:27 And for like logistical things, I can totally see it's easier to kind of force yourself. It's like, okay, I gotta do the bookkeeping, you know. For something that's more amorphous, like, so like business development, like trying to do Legion, things like that, where it's like pretty painful, I think, for most engineers. Like, how did you go about like kind of where it's pretty painful, I think, for most engineers.

Starting point is 00:36:50 How did you go about forcing yourself to sit down and say, hey, I got to write these emails for you? God, it's so hard. Anybody who's struggling with this, I feel your pain. I especially struggle with contract negotiation. If I get a contract revision, I have to just yell at myself, send it to the lawyer like read the doc put in your comments um because it's like pulling teeth getting my brain to focus on that stuff uh i wish i had a better answer i i do operate in like a an inbox driven workflow so like if it's marked red it's done And I leave everything unread until I've handled it. And this is probably like a totally chaotic gremlin way to do business, but it's what keeps me on track.

Starting point is 00:37:32 Very interesting. Thanks, thanks. And I assume like that process also improved over time. Like at the beginning, it was very chaotic. And then over time, it's worse. It's still chaotic. I mean, I try really hard, but it's just

Starting point is 00:37:46 one of those things I struggle with like bookkeeping I look at QuickBooks once every like two months if that it's embarrassing but I only do like you know 10 to 20 transactions a month

Starting point is 00:38:03 so there's not that much work there in general like when you're working 10 to 20 transactions a month. So it's, there's not that much work there. Right, right, right, right. In general, like when you're working with a team, typically when you're trying to solve a problem, if you make some progress, you share it with the team and that results in more like positive reinforcement. Is there something that happens with you when you're talking to clients and sharing some results with them? It's like a initial results and then what to do next and so on and so forth.

Starting point is 00:38:23 Yeah. That's my favorite part of the job actually is that I like on day two, I'll be like, hey, I got your system installed and here's some notes and installation. And they'll be like, oh, cool. We'll send that to the dev rel team or the release team. And then also like, oh, I found this weird error.

Starting point is 00:38:37 Do you think this is definite or indefinite? And they'll start a debate internally and we'll have some Zoom calls and then I'll write more tests. And I love that process of like working with the client helping them to understand what the system does and getting getting their like marketing and docs and engineers on the same page because sometimes there are these internal disparities and having me as this outside consultant is like it allows us to cross those

Starting point is 00:39:00 bridges briefly and unify them uh yeah so that's that's the fun part of the loop for me is getting client feedback. Like, oh, wow, that's exciting. What happens to this build? Makes sense. Makes sense. So I was reading your blog, and you have entries there since, I think, 2007 was the oldest one I could find, but I'm sure there are ones older than that. Oh, man, do you want to talk about a live journal um but uh so you you've been writing online for a while and uh your your recent i'm not

Starting point is 00:39:32 going to go into the posts you have uh some there are some really good memes uh i will say there is a cat you want to yeah let's bring in some high school poetry maybe yeah uh there is a tag i think you have a tag called funny. If someone wanted to just go look at nice funny memes, they can. OK. What's on there? Let me go look. Yeah, continue, please.

Starting point is 00:39:52 So what I want to ask was, today you write really high quality analysis for everything you do. And this is, and you've been writing this, uh, by the way, if, if, if those memes are,

Starting point is 00:40:09 yes, I've, I've seen the pictures now. Yes, this is, uh, yes. Yes.

Starting point is 00:40:13 Uh, if we don't necessarily need to put in the show notes, but if folks are interested, they know where to find you. They, they, they can, they can,

Starting point is 00:40:19 they can search for themselves. Uh, they're, they are funny. I, I, people who like a good laugh, I do recommend they go check it out. Moving past that. So you've been writing really good research analysis on all

Starting point is 00:40:35 these systems. And this was, there's two aspects to it. When as a reader, I'm reading these, it's not, it was deeply technical subject, but it's also very easy to read. Like these are long posts which go into specific details. And usually if something is hard to read, I give up, maybe others don't. But when I'm reading your posts, I find, I found myself just wanting to go through the entire thing, not wanting to stop. It's very easy to read. And this is before chat GPT was a thing. So how did you develop writing capabilities and just writing well in general? I had really good teachers and I read a lot. I was reading prolifically at a young age and kept that up until I got to college and was forced to read a lot and then I stopped reading for fun.

Starting point is 00:41:27 If I stopped, I mean, like, I reduced my book consumption to a normal level. And I had really good teachers in high school and college. I spent a lot of time doing English. And I think it's more than I see in liberal arts education. You learn to write well. I did a lot of creative writing as well as nonfiction.

Starting point is 00:41:48 And I think that helps. Another part of it is just time. You have to practice writing and get feedback from peers and read it aloud and go through the editing process. Most of Jepson's reports are all my work with peer feedback, and then I would do my own editing. Now I do four or five passes of editing. I take revisions from the client.

Starting point is 00:42:14 I show it to peers, too, in some cases. And then I also have a professional copy editor, Irene Canio, who is fantastic. And she is responsible for a lot of the polish and consistency you've seen in more recent Jepson posts the last three years, I think. How did you go about getting the peers? Is that a network that you built over time in terms of getting others? Yeah, yeah.

Starting point is 00:42:37 I don't even know how I've met. Probably a lot of it was from Run Twitter. And now there's some sort of Slack communities that I remain a part of. And I chat with those people regularly about decision systems and weird things that will happen so they're writing stuff they will also like have you like kind of proofread and things yeah we have a like a reviewers channel and we'll we'll chat about things that we we write or or read and like this going back to the business aspect in general so you mentioned like lawyers,

Starting point is 00:43:07 bookkeepers, at this point you mentioned like copywriter or copy editor at least. How do you go about finding these people who you work with at least not the peers part but people who you actually pay some money for and they do certain parts of your business? Yeah, as far as I know, referrals are kind of the best way to go about it. I think the lawyer that I currently use, I might have actually found online at the time. I was looking for somebody who had this weird specialization of like open source and performance law, and her firm specializes in this very small niche and that's been fantastic. And then for things like accounting, you know, sort of any small business accountant or bookkeeper can help you out with those things. So I ask friends like, hey, you know, you've run a small business.

Starting point is 00:43:57 What do you use for your accountant and what with them? Any specific resources that someone who might be getting into running a small business that they can go and refer to at least in terms of finding help like this? I'm afraid I don't have like a book or other reference I can give off the top of my head. All the stuff I know is kind of built up incrementally by staring at the IRS web pages and asking lots of really inane questions of the experts I hire. I cannot say enough good things about hiring lawyers if you have the time, because I've never actually needed to use one as far as hostile litigation goes. Nobody sued Jepson. But it's really helpful for evaluating client contracts and also making sure you structure your business appropriately.

Starting point is 00:44:42 There are professionals paid to work with you. This is going to probably be a dumb question. I apologize in advance. Um, so for me, uh, I'm trying to get better at writing and I'm really, really bad. Um, but ChatGPT has been quite helpful in just getting me started in terms of, um, having something like where I can kind of iterate on, or it's like a, as a way to kind of critique. I'm curious like from your perspective is there any functionalities of it that you find it like useful for you or not a whole lot? I find it horrifying.

Starting point is 00:45:21 Not a popular opinion but but yes, please say more. I mean, I think there's probably really good uses of LLMs in search and maybe in a sort of like helping you develop your thoughts way. As far as code generation goes, like the hardest part by far of code is understanding what's already written. It's much easier to write something than to figure out what's already done. And it is so, so easy to miss critical, subtle mistakes. And one of the things I see come up again and again in LLM-generated writing and code is things that look correct, but if you stare really, really hard, you find out they're totally fucking wrong. And wrong in dangerous ways ways ways that will cause file system corruption ways that will cause you to mix dangerous chemicals so i am adamantly against this use as far as like generating content goes as a moderator of an online community like i'm facing this sort of event horizon where it'll be impossible to distinguish automated bad actors from real people and all of moderation

Starting point is 00:46:26 is this sort of economic or environmental balance of time you invest identifying authentic behavior trying to understand people's intentions and I think it's entirely possible we're going to drown that moderation will become an impossible task same for authentication of media

Starting point is 00:46:44 right like if media is fabricated, we lose the ability to have any sort of trust that the thing on the other end of the line is all sensate. And so, you know, I've been talking with a friend of mine who's rather historically literate about this. And he's saying, well, you know, the period where you could trust what you read is actually very recent and uh

Starting point is 00:47:07 you know historically speaking people had to send paid emissaries to other cities to learn like what the bank rates were and you could only trust material from small number of publishers because of the issues with with broadsheets and so you know we're just going to return to that norm and the internet as we knew it will be this very brief blip of semi trustworthy content that's I think sort of disheartening not to mention issues the state act I could go on about this is like I have to deal with like AI generated CSAM as a part of my moderation work and horrible yeah I don't touch it is it obvious to identify some of these cases where you see it's a bot and it's not a real human behind the system?

Starting point is 00:47:54 That's the challenge, right, is that there used to be you could look at a piece of text or an image and you could identify generally whether it was computer generated. And now it's getting so hard to distinguish that the cost of identifying and discarding misinformation is much higher. And ultimately it's a cost problem, right? If it becomes difficult to identify, people will stop doing it. Or they'll stop effectively doing it. But, I'm

Starting point is 00:48:18 sorry, this is a rant. To give you concrete, actionable insight, one of the things that really helps with writing, I find, is of the things that really helps with writing, I find, is to spew out a bunch of words knowing that they're going to be bad. And this is how I write talks. I just start throwing out stupid drawings on slide after slide after slide. And then I kind of look back and I'm like, oh, okay. What if this can I salvage?

Starting point is 00:48:41 I think setting yourself this separate phase of writing admitting and not caring that it's bad and then editing where you try and massage your thinking that's really helpful yeah i was listening to a podcast where they were kind of talking about this of like really want to treat your sort of like having like two stages the first one you really want to treat your ego almost like you're like oh you're doing great doing great. You know, just keep going. And it doesn't matter, you know, what comes out, right.

Starting point is 00:49:07 Just keep going versus the second one. And then you like actually be critical and like criticize and then actually try to, uh, that's sort of what you're describing, right? Like having separation. Yeah.

Starting point is 00:49:18 And, um, you know, this is obviously not great advice, but, uh, some people say get drunk or maybe, you know, have a gummy. And the idea is to reduce your filter.

Starting point is 00:49:30 You want to get yourself in a state, whether sober and just doing it emotionally or, you know, with your favorite whiskey in your hand, of being able to just spew thoughts onto the page. And then you refine but if you're if you're struggling to get over the hump I find often the challenges you're trying to like pre-editor get everything perfect upfront right and I like to think of writing as a tool for thinking and nobody else is gonna see your bad draft right you can you can always throw it away this reminds me of also propose that I think Paul Graham has on writing he has an essay about this and I think he posted a video at one point

Starting point is 00:50:05 of how he edits his post. So it started with pretty much what you were describing. Spew out anything that he wants to write, and then slowly it just went over iterations, iterations, iterations, and then you have this final piece that he publishes. So good advice. I'm going to try that.

Starting point is 00:50:21 I've unsuccessfully, I've not been able to try that effectively, but I'm going to give that a shot've unsuccessfully, I've not been able to try that effectively, but I'm going to give that a shot. I don't think it might help is to shift formats. Like if you're really good at speaking extemporaneously, get out a tape recorder or the equivalent and imagine you're telling a friend or find a friend to open up with you and just explain to them what you're going to talk about.

Starting point is 00:50:44 Like, yeah, so, you know, we got these computers and they're trying to communicate but they don't have clocks and so sometimes messages get reordered and you take that verbal explanation and then you sort of distill and massage that back down into written i often work with outlines like i'll do just like a little dash and text file and then as small thought as i can and i'll build up this deeply nested outline and that way i can move the lines around using just like line wise text commands kind of do my restructuring fast and then from there i'll expand the outline to to written stuff so if your problem is more about the ideas than about the format consider changing the format itself interesting interesting and then so

Starting point is 00:51:22 you also have really cool, like, conference talks where you, I think you feature some of these, like, sketches. Like, do you have a different process for doing those versus your normal writing? Or is it, at the end of the day, just storytelling? Yeah. The talks that are for Jepson have a sort of defined formula, like, introduce the tool, talk about the problem, and then case study, case study, case study, recap. And for the case studies, it's really straightforward. I've already put a ton of work into explaining the system, and I've got the report, so I just adapt it. If you look at the slide structure, you can basically just read them off from the headings

Starting point is 00:51:58 of the paper, because all that work went into it. For novel talks, I tend to do the same process. I sit down with pen and paper or use an iPad and a little Apple Pencil, and I'll just start sketching. And imagine I had introduced this to somebody at the whiteboard, and so I have this really informal, loose, strong style, and I'll go through there, kind of just almost bullet point style you know just like text scrabble down on the screen and maybe like an illustration or two but just really rough and then i like there to be some high

Starting point is 00:52:36 quality illustrations which is very loosely but you know some things with color maybe like a character saying hello and so i i often go back and i put in the jokes and that sort of thing once i know the structure i see oh that's super helpful um so one thing that we've been touching on is like you work on distributed systems and you mentioned that you also do training classes i saw the you have an a nice outline of the distributed systems training class for people to refer to. We'll link that to the show notes as well. One thing that I was realizing is many of these concepts people sometimes who take computer science and go to a distributed systems class may understand. In reality, when you talk to most software engineers

Starting point is 00:53:25 in terms like linearizability concept or serializability comes the consistency the cap theorem and other stuff like things get morphed into this one ball of fur and people get confused um and you mentioned you don't have uh like like school education in computer science, but your content, like I've read your posts on these concepts and they are really clear. It's super easy to understand. And I, you, you have a clickable diagram on your, on the Jeppesen website, which is super nice just to see where his forms of like, you start with strict serializability, then you have two paths there.

Starting point is 00:54:03 One to lean rise ability, serializability, et cetera, et cetera. Highly recommend people to check it out. We'll link that to the show note as well. But how did you develop this formal understanding of these concepts going from what you saw in practice, like, hey, from the database talk where I don't think it works that way, experimenting it, producing it in reality, but then taking it also back to some of these concepts in a more formal sense. Yeah. So I think like most engineers with an immense sense of hubris and naivete, I started trying to do all this myself.

Starting point is 00:54:35 So when I was at Shoyu in early, actually when I was at Vodpod, one of the first jobs I had at a college, we had a social network and I somehow convinced them to let us migrate to this newfangled database system called Reoc and we had a lot of issues with MySQL replication at the time like this will make it better and in the process it also introduced me to like oh this has a whole different consistency model and how do we do data updates and maybe we can't anymore. Anyway the the side effect of this is that I spent a lot of time drawing whiteboard diagrams of data structures and trying to think about merging and joining and ordering semantics. And I got active on the mailing list and I started talking with people and getting into arguments on Twitter and making friends with the people over at Basho. And I would go work at their offices and we would talk about consistency.

Starting point is 00:55:23 And so it was a combination of that and then reading the papers that people would link. Folks would talk about the dynamo paper and I would go and pore over that or CRDTs, I would go and read the research paper. And I think having some background in physics and psychology means that you kind of know how to read a paper and it's not quite so daunting. So even though I didn't know any of the terms, I mean, I knew a lot. I've been a software engineer for a long time.

Starting point is 00:55:48 But I didn't know linearizability. I could go and read the paper and sort of convince myself of the formalism and the proof. So that formed a foundation for this kind of work. And that diagram you mentioned, that's not my work. It's lifted straight from Bayless, Fekete, Davidson et al in 2014, Highly Valuable Transactions, I think. And it's cited right below.

Starting point is 00:56:07 But people keep crediting me for it, so I want to give full credit to the original people who did this work. And it's really novel. They unified these two different families of consistency models that were from two different branches of literature, didn't necessarily cross-pollinate. And I got the privilege when I lived in San Francisco

Starting point is 00:56:26 of making friends with and talking about this concept with Peter Bayliss and Peter Alvaro, both of whom actually do have academic backgrounds and could explain things to me in a way I'm interested. And so a lot of this writing that I'm doing is just interpreting and trying to make accessible the work of the giants who have done all the important thinking ahead of me. Back from like 2000, early 2010s to now, distributed systems have obviously evolved. They were kind of new,

Starting point is 00:56:59 I would say, 15 years back, but not anymore. How do you see this change when you're testing the system these days? Like I'm assuming finding faults was probably easier 10 years ago than it is now. Yes and no. There's been a big phase change in industry. People take partitions and clock skews seriously now. I see a lot of folks address those upfront in their designs. There's an obsession with Raft, which is fantastic work from Diego and Osterhout. I hope I'm saying that right. The Raft protocol provided a consensus algorithm that was, for the first time, aimed at actual implementers. And it came with a little one-page introduction to it. And that was so, so well done, both as pedagogy for students, but also for industry. And it's something that I think it's frustrating

Starting point is 00:57:47 that it was so difficult for the RAF paper to gain acceptance in an actual journal because the work was so impactful on the industry side. I didn't know actually it wasn't accepted in a conference or something. Yeah, they really had to struggle for it. Don't quote me on this. It's been a long time since I talked to them about it.

Starting point is 00:58:08 But if I recall, there was sort of this atmosphere of like, well, we already have Paxos. This doesn't really contribute anything super new. That's interesting. And yet the way in which the paper is written and the relentless simplification of concepts down to just those two API calls, I think, really helps. Also the operational side, that it does changeable membership, and that's something a lot of protocols sort of put off until a separate phase. But yeah, sorry, to back up, the Raft paper was a sort of watershed in the field.

Starting point is 00:58:48 And about two to three years later, I started to see a lot of systems which previously would have made up their own consensus algorithm choosing Raft or trying to adapt Raft instead. And systems like Mongo that originally kind of made up their own thing grafted or reorganized their protocols around Raft-style things and gain much safer safety properties. In terms of where you see systems adopting these techniques today or taking some of these faults more seriously upfront in their design, what kind of faults are common today still in distributed

Starting point is 00:59:20 systems? I know it's hard to generalize some of these things, but what are some of the other areas you wish distributed system developers focused more on that we are not yet? I think that the big ones have remained essentially unchanged. Our architecture of computer systems is basically the same as it was 20 years ago. It's IP, right? You've got all the issues with TCP and packet reordering. We're using packet switch networks.

Starting point is 00:59:47 Things are much faster now, better bandwidth, better flexibility. Cloud is now a thing. But the semantics that arise on top of that still look like Linux processes communicating over ordered transient pipes. And so from that perspective, clock skew, clock drift, the impact of partitions, process crashes, those are still, I think, very successful at finding faults. There are esoteric faults, cooler faults I've seen, and some of these are novel.

Starting point is 01:00:19 Like now that we have virtualization, you can get really cool behaviors like on occasion, your virtual machine has a part of its memory replaced with another virtual machine's memory running on the same physical substrate. And so some other customer strings show up in the middle of your database. That was a, that was a Google cloud issue for a while. And so, you know, that, that's neat. But I also think that hopefully those are uncommon enough that it's not necessarily worth testing against. Not as early as you would, say, partitions, which happen all the time. So in terms of degree of testing, I was reading your methodology in terms of how you go about it. So it's like starting with documentation to see what the claims they make, set up the system, and then you start with some simple things, which is, like you said, crash the process, do a kill line, turn off the machine, partition the network in different forms. Where do you go from there after you've done some of these simpler things?

Starting point is 01:01:18 Yeah, there's kind of two dimensions. One is scale, and the other is complexity. The scaling one is easy. We just jack up the concurrency, increase the size of transactions, add additional types of operations in the transaction. We can run the test for longer, put in bigger data volumes. The other dimension is the tricky one, things like new faults or new workloads. And some of the interesting faults are things like what happens if the process fails to fsync some data and assumes that it's durable or assumes that fsync is ordered. If fsync is not ordered because the underlying stack doesn't necessarily flush the writes

Starting point is 01:01:59 in the same sequence, you can end up with weird gaps in your log. So I've been doing some research on simulating those sort of Fsync failures. Likewise, membership changes are especially fraught. So a lot of customers have the ability to add or remove nodes from the system, and doing so activates this whole other set of code paths that don't normally get engaged. And so the systems which are built

Starting point is 01:02:24 on the assumption of membership being stable, which is basically everything, and then have this membership changing logic sort of added onto the side of it, you know, often encounter surprising effects. But writing those tests is really tricky because when you want to schedule membership changes, especially in the presence of other faults, like remove a node, and by the way, that node is partitioned from you, you wind up driving the cluster into these weird regimes where like the node still thinks it's part of the original cluster, but the rest of the cluster has moved on, thinks it's no longer part. And then you're trying to guess like

Starting point is 01:02:57 what are legal transitions from this point? And ideally, everything is legal, but many of the transitions will drive you off a cliff permanently by reducing consensus below the critical threshold. And so you have to be really careful about trying to observe the system and not push it into the corner where it gives up entirely on life. That's surprisingly tough. And so there's this whole thing in Jepson that just does cluster membership state machines and tries to more or less figure out what the safe transitions are. In a way to observe the correctness of the system and to see where it fails,

Starting point is 01:03:29 you also need really good tests, like you said, that don't push the system to catastrophic failures, but keep them functional while you can observe how they fail. Yeah, in a sense, like a catastrophic failure is the best possible thing from a safety test standpoint, because, from the system's standpoint, because it doesn do anything, and therefore nothing it does is wrong. It's predictable. It's saying it's false.

Starting point is 01:03:53 Dead systems don't commit crimes. So the bad scenario is you keep running, and then you do something illegal. You allow someone to see the middle of your transaction. And for that, yes, you need partial availability. Some nodes have to run. Some nodes have to be executable. So you've tested a lot of systems. You have published detailed analysis

Starting point is 01:04:15 of what you found with those tests. Are there specific failures that come to mind which you found to be more surprising slash amusing? And it could be from one of the recent tests or the past ones you your your choice when you say failure do you mean like a like a fault that jeffson injects or do you mean a bug a bug in the system like uh something so you're trying to inject a bug but something happened system behaved in a way that you found surprising shocking amusing pick your choice of the word there. Oh my gosh, yes, all of them.

Starting point is 01:04:47 They're so much fun. I love my job because the things that databases do are just endlessly creative and surprising. In MySQL, for example, they have this whole hullabaloo about taking snapshots for the repeatable read isolation level, and they describe it repeatedly as a snapshot. But if you actually start to use it, you discover it is nothing like a snapshot. In fact, it will incrementally refresh certain rows in the snapshot to include the effects of other transactions' writes. Maybe part of the transaction's effects, but not all of its effects. Like, this is categorically different than a snapshot.

Starting point is 01:05:23 And yet somehow, like, convincing the MySQL mailing list of this was surprisingly difficult convincing the MariaDB mailing list of this was surprisingly difficult people would look at that and be like I don't see the problem and you say well you're supposed to show all the transaction or none of it. Or here's a transaction that observes something and then unobserves it. You know, that's definitely not a repeatable read. Why is it hard to convince them of this? They just wouldn't agree with this is what a snapshot should be? You know, once I phrased it the right way, everybody was on board, I think, but it just took like seven or eight variations on the theme to kind of get there. And another part of it is too, that like

Starting point is 01:06:10 people internalize their sort of missing stairs and just say like, oh, well you shouldn't expect safe behavior or snapshot behavior from Peter Breen. You should be using select for update, or like you can fix this with select for share and and this is a common thing I think in in any sort of systems testing is you say here's this behavior that contradicts the documentation of the specification and the answer is not the system is incorrect the answer is here's how you stop the bug from showing up you know like well if you just don't make requests that fast or if you had used the you know correct way of holding the system. And then you get into arguments about like, well, what is the safety

Starting point is 01:06:48 measure actually supposed to do? What is repeatable read really intended for? Because everybody defines it a little bit differently. I'm curious, kind of like a meta point, right? Is that I think you're really good. Like you're a very good communicator because part of your, like your core job is telling people that they're wrong. And I feel like engineers are especially bad at this. And it sounds like that has been a journey

Starting point is 01:07:13 for you as well, right? From the very initial point of calling people out just very bluntly to now being able to do it much more gracefully. Like I'm curious what that journey looks like. Like what advice you would have for engineers in terms of how do you give constructive feedback and things like that? Yeah. Um, gosh, like how do you become a kind and moral person? Uh, if I find out, I'll let you know, I'm working on myself. Um, I don't know.

Starting point is 01:07:41 I think, I think for the story of all humans is kind of of starting off blundering into corners and making a fool of yourself. And gradually you start to see the effects of your actions on others and develop empathy for them and communicate in a way which is more respectful. Hopefully I'm doing that. I try really hard. But, you know, I mess up. All of us do. There's a, there's a distinct transition point in Jepson's history, because when I was doing this in my nights and weekends, um, I could be this argumentative gadfly on the wall going like, what is wrong with you?

Starting point is 01:08:15 Look at your database, look at your choices. And so the first ones are very sassy. Uh, it's, it's, you know, database, real talk, you and the girls sitting down at a brunch dishing and all of the weird faults you've seen. But if I do this for a paying client, then it might be insulting. They're hiring me to be a professional and to deliver something which is nuanced and respectable. And I actually have clients, maybe one in four, who say like, could we have a fun analysis? Like, could you really, you know, make up some memes and dish on us and like be jokey again?

Starting point is 01:08:53 And I actually have to tell them, well, I don't have to, but I've decided to tell them no, because I don't want to establish a sort of unfair or disparate standard for different reports. I want everybody to have this kind of uniform tone, which means that no client is particularly favored. And that's kind of for better or worse. It makes the writing less fun. People, you know, they don't read as much because it's not as punchy. The reports have gotten much longer and much more nuanced. But I think that's overall for the best. It's not bad to write a sassy thing, but if you're doing this kind of work professionally

Starting point is 01:09:34 and you have to balance one client, I think it's important to be ethical about how you do it. Is there like a fun version of each report that you don't publish? Well, only to one's friends in private. Once we stop recording, we'll get active. I mean, he will have this. No, I try to have a lot of empathy for people who are writing distributed systems. Having done it myself, it's really difficult. And it's not that people aren't competent.

Starting point is 01:10:01 It's that they're trying to solve a really hard problem with limited context and tooling you know people should be congratulated on on finding and fixing bugs it's not something to be ashamed of on that note like do you have any advice for people who are working on distributed systems good luck godspeed let's war into the breach, go we. I keep reiterating the same advice, and I kind of worry maybe this makes me stuck in my ways. Maybe I haven't evolved with the times. But do test your system. Do write down some formal-ish descriptions of what you think your system should do.

Starting point is 01:10:41 And describe your system in terms of invariance as opposed to mechanisms. So say something like, no transaction will observe part of, but not all of, another transaction. Say transactions will be observed in total. Don't tell people, you know, when a transaction is committed, the latch mechanism comes into effect and guarantees that any, you know, frobness, which is initiated prior as well as reads on Mars will observe, you know, timestamp T or before you're like, no, nobody can figure out what that does. So try to give people, you know, external, concise definitions of what safety they should achieve and, and use existing algorithms and proofs where you can.

Starting point is 01:11:26 There's a lot of great research out there. Many systems do learn from it, but I still see people on occasion rolling their own consistency or trying to bolt it on to a system that has different replication strategy. So I think you gave a talk in 2018 where you had this one statement as part of the talk.

Starting point is 01:11:44 Anybody who is trying to sell you a distributed log is trying to sell you sawdust and lies. Distributed log or queue? Well, what I got from the talk was distributed log. But hey, could I think not log or? Lock, lock. Lock. Oh, yes, lock.

Starting point is 01:12:01 That makes more sense. Yes. Yeah. I stand by that. I need to improve my enunciation. Oh, lock. That makes more sense. Yes. I stand by that. I need to improve my enunciation. Oh, please. You and me both, buddy. You and me both. But say more about that.

Starting point is 01:12:30 Okay, so this is actually really well described by Dr. Martin Kleppman, who has a great post on, I think it's Redlock. If you look for that, it should show up. The core problem is that locks are meant to establish mutual exclusion over some resource. And in a distributed system, even if you acquire a lock perfectly, your side effects that interact with that shared resource could be delayed in time and might step on each other's toes. So even if I hold the lock, I submit a request, I release the lock, somebody else acquires it, they submit a request, they release the lock, our effects could take place in the opposite order of working currently. So even if the lock says it's perfect, it doesn't usually do what people want. The second problem is that it is impossible to provide a lock service in the first place because any lock service must either, when nodes are isolated, must either force the release of a lock, thus possibly giving the lock to two people at the same time, or stall indefinitely, in which case it fails to be live. And most people choose the former because they don't want their system to fall over forever yeah uh so in that case like what are some of the mitigations

Starting point is 01:13:30 against this so like people still rely on logs i only want one of these three instances to be up so later election is a common problem uh but so what are some of the mitigations around this that people could build or think about yeah i think there are kind of three classes to examine. One of these cases is when the lock is critical for safety. So like you acquire this lock in order to update some resource and you assume that it's gonna be atomic. And if it were to be acquired by somebody else, you could end up with state lost or corrupted.

Starting point is 01:14:01 Do not do this. Stop doing the lock, rethink the system from scratch uh you know use a sequentially consistent or serializable consistent data store instead um class two is the lock is important-ish for safety like the number of times this kind of failure is going to occur are infrequent and the damage it will do is not so bad that it can't be mitigated or ignored so yes you you built this bad lock service and it does in fact do the wrong thing but it only does the wrong thing one in 500 requests and for that case you know you have to email a customer and say hey sorry i know that you bought this product we actually don't have any in stock after all we lied lied to you. Not the end of the world.

Starting point is 01:14:46 Happens more so than that. Yeah. Yeah. There's a lot of systems where like approximately correct is fine. And then sometimes the lock is just the performance reasons. Like you want to reduce the number of times two services contend on some resource. And in that case, it's great to use a cheap lock or heartbeat or something like that. Uh, safety is ensured by the underlying resource itself,

Starting point is 01:15:06 not by the lock service. And this underlying resource could be like some sort of a database. Yeah, yeah. So if I'm interacting with like Postgres, I structure my transactions that even if they're executed concurrently, they're still safe. But Postgres locks are expensive or, you know,

Starting point is 01:15:22 there's some table that I'm contending on and I wanna reduce that contention. So I put a lock service in front of my clients that prevents them from doing that or makes it more efficient for them to issue lots of requests in sequence. Makes sense. So I forget if this was a talk or a podcast or a blog, but you had some advice for people who are not distributed systems developers themselves, but rather people who want to use databases, queues and whatnot. And you said something that they kind of like look out for words like strong consistency, asset strong, and they don't have super well defined meaning.

Starting point is 01:15:55 And sometimes people use them in very different ways in advertisement or rather how they put documentation out. So in general, for people who want to use these systems, which all of them fail in different ways and forms, what advice do you have when they're going to make these choices? Um, outside of looking out for the weasel words, cause that still happens. Actually working on Raven DB, like they claimed acid properties, but when you dug into it, their quote unquote business transactions weren't even remotely isolated. So yeah, people are still out here making up like completely disparate definitions of

Starting point is 01:16:31 these words. I like to look for failure mode documentation. I want to see something in your docs that says, you know, this system can survive the failure of up to one in one third of the nodes or up to one half of nodes can be partitioned away and their many have still in progress that's really helpful and it's surprisingly absent from a lot of systems people love to say high availability but they don't characterize whether that's total or majority or what um i'd also i'd like to see more latency characteristics like it would be really cool if people could say, you know, we commit within one network round trip in a local data center plus one across data centers if we're doing multi-site.

Starting point is 01:17:17 Or, you know, we take 15 multi-data center round trips. Those two things have very different performance characteristics. And they also have safety impacts, right? You can sometimes guess from the latency characteristics, what safety is going to be like, if somebody says, you know, we always commit immediately and you're like, oh, hang on, you know, you, you can't be providing certain consistency levels because you're making this decision independently of the other nodes. Yeah.

Starting point is 01:17:41 Uh, so one aspect is like running distributed systems or specifically databases is super hard because, well, they hold state. Moving state is hard. What are the ways you've seen of running databases or Qs or stateful systems in general that are better than others? I don't know if I'm going to give a great answer here. Monitoring is great. Do it. Have an ops team. This is table stakes, right? What's new here? I've heard from people, although I cannot corroborate firsthand, that doing stateful stuff in Kubernetes is rough. A lot of folks who have pushed long and hard on that process have just abandoned it and gone back to long-lasting containers on VMs they own or actually to physical hardware all the way. So maybe that's helpful.

Starting point is 01:18:35 I can't speak to that firsthand. I have successfully, in my crotchety old man phase of my career, now that I'm a ripe old 36, 37, I've managed to ignore Kubernetes entirely since Zepson and hope to ignore it until it becomes irrelevant. Save for Docker. I've almost managed to ignore Docker, but somehow people keep asking me to do Docker-related things, and every time it ruins my life. Why is that? Oh, my gosh.

Starting point is 01:19:03 It's Docker, like, Docker loves to break everything else on the system. There's a lot of networking stuff that will just completely explode on Linux if you have Docker installed. Firewalls and anything with LXC or LXD. Another fun Docker thing is that it behaves wildly differently on different platforms. And so I have to merge patches from people who are like oh it broke an osx here's a patch for osx and you merge and it breaks it on debbie and the debbie person merges back and you're just like you're constantly fighting these mergers across

Starting point is 01:19:33 different platforms because they have different bugs or different uh different behaviors containers are supposed to solve the problem right right runs on my machine. Right, right. But as far as I can tell, it's actually harder to run in Docker than it is just to, like, spin up your own VMs. Sometimes the old ways are better. I agree with that. So we're getting towards the end of our conversation. And I have not rapid fire of sorts, but rather we would ask you to choose favorites. You can say not to. But looking at how systems fail, what's a favorite database that you have?

Starting point is 01:20:12 And I would like to have an answer for different categories. So let's start with the relational ones. Relational. Postgres. Fantastic. Love it. Wish I had a good replication story. Key value store.

Starting point is 01:20:25 I'm still partial to React. I have not seen it being used. This is the fixie bike of databases. Okay. Object store. FaunaDB. It's a weird one. Yeah. I think it's technically an object or document store. FaunaDB. FaunaDB? It's a weird one. Yeah, yeah.

Starting point is 01:20:45 I think it's technically an object or document store. But it had a really cool Lisp-inspired query language back when I worked on it. I think they've changed that now for obvious reasons. But I'm a big fan of that. Also, it was temporal, which is a rare quality. What about columnar store slash the class of NoSQL databases? Datomic.

Starting point is 01:21:06 Yeah, entity attribute value time triples. Actually, quints. With a totally different transaction model, up to strict serializability, it's a wild beast. Datalog for querying. And last one, queues slash distributed logs. You got to go for the classics, right? Like it's Kafka or Red Panda or nothing.

Starting point is 01:21:29 As far as I know, all the other ones lose data. I'm being facetious. I'm sure that someone has come up with a better queue in the last few years. I just haven't heard about it yet. I don't know. Kyle, the work you do is extremely niche, requires great expertise, and you've helped improve just the ecosystem of distributed systems in general. A lot of work that you put out for the public and your analysis is extremely helpful to all the practitioners out there who not only build systems, but also using many of these distributed systems themselves. So thanks for doing everything you do.

Starting point is 01:22:03 All of us really appreciate your work and we continue to hope to see a lot more of it in the future. And thank you so much for taking the time and sharing your story with us. This was an amazing conversation. You were far too kind. Thank you very much for all of these engaging questions. What deep cuts. I'm really impressed by the research you guys did. Please don't read my blog if you're listening to this. We'll do some back and forth reviews and you can choose.

Starting point is 01:22:35 Oh, we have a little pre-worthy past. It's fine. Kyle, thank you so much for coming on the show. This was super awesome. Thank you both. Really appreciate it. Thanks. Hey, thank you so much for listening to the show. You can subscribe wherever you get your podcasts and learn more about us at softwaremisadventures.com.

Starting point is 01:23:00 You can also write to us at hello at software misadventures dot com. We would love to hear from you. Until next time, take care.

Your Ad Here

Software Misadventures - Breaking distributed systems for fun and profit | Kyle Kingsbury (Jepsen)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.