Software Misadventures - Julia Evans - On kubernetes scheduler bugs, TCP performance regressions and debugging tips - #2

Starting point is 00:00:00 It just becomes impossible if you don't understand the fundamentals, right? And when people talk about debugging tools, sometimes they talk about observability tools or debuggers, and that stuff is really great and so important, and I love it. But if you don't understand the fundamental concepts about how the system works, it often is really hard to make progress. And the flip side is if you do, it becomes just so much easier, right? Because you can think through in your head, like, okay, how should this work? What are the steps here that are happening on the computer?

Starting point is 00:00:30 And it gets a lot more straightforward. Welcome to the Software Misadventures podcast, where we sit down with software and DevOps experts to hear their stories from the trenches about how software breaks in production. We are your hosts, Ronak, Austin, and Guang. We've seen firsthand how stressful it is when something breaks in production, but it's the best opportunity to learn about a system more deeply. When most of us started in this field, we didn't really know what to expect

Starting point is 00:00:57 and wish there were more resources on how veteran engineers overcame the daunting task of debugging complex systems. In these conversations, we discuss the principles and practical tips to build resilient software, as well as advice to grow as technical leaders. Hey everyone, this is Ronak here. In this episode, Guang and I speak with Julia Evans. Julia runs a programming zines business called Wizard Zines, where she creates comics about various programming concepts.

Starting point is 00:01:26 She has been creating zines when she was still a software engineer at Stripe. Her zines are extremely approachable and highly educational. I've bought many of them myself and encourage you to check them out at wizardzines.com. In addition to creating zines, Julia is a prolific blogger and has around 500 posts on her blog at jvns.ca. Her blogs are another great source to learn about fundamental programming concepts. We had a lot of fun speaking with Julia for this episode. We discussed two bugs she came across at Stripe. We talk about how she identified and fixed a bug in Kubernetes Scheduler and how her understanding of TCP helped her fix a performance regression.

Starting point is 00:02:06 We also cover other topics like logging, zines, debugging, and learning new things. Please enjoy this fun conversation with the amazing Julia Evans. Julia, we are super excited to talk to you today. Welcome to the show. Thanks so much for having me. So I thought we would start with asking you about a word that I've only seen in the urban dictionary, which is also your Twitter handle, B0RK. I think Bork is the right pronunciation. And we're wondering if you could tell us the story behind how it became your Twitter handle? I think it's the normal story,

Starting point is 00:02:45 which is I thought it was funny when I was 14. And here we are. It is unique. And we all recognize you by that these days. And a lot of people also know you through your blog and your zines, which personally, I'm a huge fan of. And it's something we definitely want to get into. Before we go there, can you tell us about your engineering career in tech and how you got into the infrastructure engineering space? Yeah, definitely. So I think when I started out, I worked on Drupal. I made websites in Drupal and then I worked at a machine learning consultancy for a little bit where we didn't have any clients. And then after that, I spent some time at the

Starting point is 00:03:32 Recurse Center, which is this program in New York, where you spend 12 weeks trying to become a better programmer. And when I was there, I was like, okay, I'm not going to do web design. And I'm not going to do machine learning, because I kind of know something about that and instead i wanted to do something that i didn't know about um so i spent a lot of time trying to learn about operating systems and tried to write an operating system and like i wrote like a little tcp stack that was did that it did work in an extremely limited way it was in python which is the perfect language to try to tc in. Yeah. As we all know. But it sounds hard, though. It wasn't that hard. It was okay.

Starting point is 00:04:11 The hardest part was getting it to actually send a TCP packet. Once I could send one packet, it was okay. Nice. And so I started learning about the system stuff there, which previously kind of seemed really unapproachable to me. And I was like, oh, this is so fun. And I loved it. And then I got another job doing machine learning, which is not system stuff at all.

Starting point is 00:04:31 Because I was like, well, my dream at the time was to do machine learning at a real company that did use machine learning for important things to the company. So I worked on machine learning for fraud, which was great. But then at some point I was like, but I was really excited about this infrastructure stuff, this networking stuff. And so I ended up switching to a networking team where I did that and I had a lot of fun there. Nice.

Starting point is 00:04:59 So you were on the infrastructure team at Stripe. And at least if I don't remember the exact timeline, but I think sometime last year or the year before that, you started a business, Wizard Zines. I don't know if that's the right name, but at least the website is wizardzines.com, which I love. So you have been making zines way before then, though. So how did you come up with the idea of making zines in the first place? I watched this movie called Girls to the Front, which is about zine cult, like Riot Grrrl and zine culture in the nineties.

Starting point is 00:05:38 And I thought it was really fun and it made me want to make a zine. And I'll, yeah. And I think I was at, I was giving a conference talk and I thought it'd be so fun if I like wrote a zine and printed out 200 copies and gave it out to everyone at my talk. I was like, that would be so epic. So I did it and it was really fun. It was exactly as fun as I hoped it would be. And so I ended up making a lot more zines because of that. That's awesome. So since you started a business, did you always knew you wanted to start one or

Starting point is 00:06:04 is this something that just happened organically? I think it kind of happened. Yeah, I definitely wasn't. Like, I definitely didn't always want to start a business. I think I started one and then it worked. And I was like, oh, that's cool. Maybe I'll do this for a while. But I was pretty surprised when it did work.

Starting point is 00:06:28 Well, it's awesome. We all love the zines that you make. So how's the experience been so far with running a business? I'm sure it would be very different from working at a company on an infrastructure team. Yeah, it's very different. It's nice. I don't know.

Starting point is 00:06:43 It's much more low-key. And I definitely miss my coworkers sometimes. But it's also fun to be able to do literally whatever I want in some sense, right? So there's ups and downs. Oh, yeah. That certainly sounds nice. So let's talk about your blog a little bit. I remember reading a post where you mentioned why you started blogging

Starting point is 00:07:05 when you were at Ricker Center. And we all are very thankful that you started it because we have learned so much from it. Like I don't come from a traditional CS background. So your blog has been a go to resource for me to learn about a lot of the system stuff. Like whenever I see a tweet from you that you wrote about something, I get excited in the morning saying, Oh, I'm going to learn something new today. So what I would like to know is how did you decide that you wanted to write about so fundamental concepts and systems? I haven't seen that to be very common, but I'm glad that you decided to do it. I think I wrote about those things because I didn't understand them myself and I was kind of mad

Starting point is 00:07:45 that like I think I write a lot from a place of like like frustration almost like because I'll be like this was so hard for me to learn and no one told me that it worked this way and like why not yeah um and I think a lot of those things inspire that feeling of like why did no one tell me and like sometimes the things are not really that complicated, too. You know, it's just that like, the information isn't out there. Or it's hard to find. So usually, I just had a hard time. And I want someone else to not have such a hard time. Well, you have made lives of many engineers much simpler by writing all the posts that you do. So thanks for doing that. On top of that, I would say I really admire your writing style, especially the way you share your thinking.

Starting point is 00:08:29 It very much feels like as if I'm talking to you live, even if I'm reading a blog post. And the way you explain things, it also shows up in your zines, making them so approachable. And it's very educational in the first place. So how did you develop this way of writing in general i don't know that's a great question um i think i don't know i look i didn't always write that way i don't know what happened actually because

Starting point is 00:09:01 i have a like i have this like master's thesis that i wrote on my github profile somewhere and it is not like that and then one sees that blah blah blah blah it goes on like that for 80 pages you know um so i'm not sure how i got there like i'm happy i did but i i wish i had like a pithy answer but i guess it's at least it's possible to learn it it seems like. It is. I mean, I look up to your writing style in many ways and thinking, oh, how can I make this simpler so that anyone reading it can just understand it? So another question also on your blog post is that your posts have, or your blog in general, has a very positive and energetic personality to it was that intentional when you started blogging or that just also happened organically i think when i started

Starting point is 00:09:53 writing there were a lot of things like i was in a good place i feel like i was at the recurse center i got to learn a lot of stuff and that's like a very positive environment anyway um and so like i didn't have a lot of you know like there wasn't i don't know that was just how i felt um i think like as like i i've definitely been blogging for a lot of years and i mean things happen that are not positive you know in one's life or in one's career but one thing i like about writing in a positive way is it just makes it a lot easier kind of on the internet because i feel like if you're negative, people are sort of negative back at you sometimes, if that makes sense. And so like just being like, here's something that I love, like because if I don't, there's lots of things I don't like.

Starting point is 00:10:37 Right. But I don't really write posts being like, here's why I hate like blah, blah, blah. Because I'm like, well, you know, like whatever. There's enough posts like that, even though i have like lots of opinions and i i think i just find that the conversations that that those kind of posts generate to be more fun or like easier to deal with right and then no one is yelling at me being like why do you hate blah blah blah because like they don't know i only write about the things that i liked well that sounds like a good strategy uh since you mentioned about internet i mean internet is a funny place uh and in your post you actually mentioned that hey you don't know something

Starting point is 00:11:17 i think it's actually very brave it's something that well I'm not always comfortable saying, not on the internet all the time. How did you get over that just barrier of saying, I don't know, publicly? How did you get comfortable with that? I think, I feel like, I don't remember exactly. I feel like when I started doing it, I didn't think about it a lot. Like, I was writing every day. I was writing this blog post every day. And I was like, well, here's what I'm learning, you know. And I think that the response was pretty good. about it a lot um like i was writing every day um i was writing this blog post every day and i was like well here's what i'm learning you know and i think that the response was pretty good like

Starting point is 00:11:49 people liked it you know and so i think i think i saw that it was okay to do that by doing it and of course some people will be like oh why don't you know this right but then also a lot of people will be like well people don't know things like yeah we are not all born understanding tcp i think it was also easier because i started out writing about a lot of things that a lot of people actually do not know about you know like i was like oh i don't know how tcp works and it's like well a lot of people don't know how tcp works you know yeah with you writing it on your post it also makes me as a reader very comfortable just reading the post because i can relate to it so much uh okay so this will be the last question on the blog uh because we have a bunch more interesting things to talk about uh you mentioned in our about page that you have one opinion about programming uh can you share what that is and how did you come to develop that yeah so i said that um the opinion is that

Starting point is 00:12:48 if you learn the fundament if you want to do hard things it's really important to learn the fundamentals um how did i develop that i think i find it so fun to learn the fundamentals first of all, like, like, like, I find, like, for example, like with TCP, right? Like, when you're going somewhere on the internet, it's so exciting to understand, like, what's actually happening when you're doing that. But I think that the place where learning fundamentals really, like, becomes important. And what I saw, saw like in my career is if you're trying to fix a bug um or like which is what we're going to talk about we're going to talk about bugs um and when you're trying to fix like a hard bug you can't like it just becomes impossible if you don't understand fundamentals right and like people when people talk about debugging tools sometimes they talk about like you know like observability tools or

Starting point is 00:13:43 like um debuggers and that that stuff is really great and so important and i love it but like if you don't understand like the fundamental concepts about how the system works it often is really hard to make progress um and the flip side is if you do like it becomes just like so much easier right because you can think through in your head like okay how should this work like why are the steps here that are like happening on the computer? And it gets a lot more straightforward. No, that makes sense.

Starting point is 00:14:10 And all of us relate to it. Yeah, I would like to give Ronak some credit for smoothly transitioning that into a plug for our podcast. Well done, Ronak. So changing gears a little bit, one of the sort of quote unquote misadventures we're talking about today is this post you made a little while back about why people should understand a little bit about TCP, which we'll include a link in our show notes.

Starting point is 00:14:33 So you started the post being like, hey, these are practical tips. This isn't about reading through TCP IP Illustrated. When I read that, I gave myself a pat on the back for knowing what you're talking about. I actually have the volume two back there propping up my monitor. But I only know about it because when I was running into a lot of networking issues at my first job trying to set up data infrastructure, my mentor at the time was like, hey, if you want to learn about how these things work you should just go and read these books called tcpip illustrated uh little did i know there's three volumes and each one is like 600 pages yeah i never read books i haven't read any technical books i which i think it's great

Starting point is 00:15:17 i think it's great to read books but i just i also think there's other ways to learn. And I have one book, maybe that I read sometimes. Thanks for making me feel better about never finishing TCP IAP Illustrator Volume 2. So for people like me who don't come from a traditional CS background and have never taken a networking class, but who also want to learn more about things like networking. These technical books can be quite intimidating. So I was really pleasantly surprised when I came across these networking zines you made, just at how approachable they are. And can you tell us more about that?

Starting point is 00:16:00 Yeah, so I think that for me, I find it a lot easier to learn by looking at, well, by like obviously learning some basic facts, but the ByteSize networking zine is actually, is about networking tools and about like tools that you can use to inspect your network. And I find it a lot easier to learn by being like, okay, I want to learn about maybe TCP. I want to understand, like, for example, the other day I was trying to understand WebRTC because I was debugging a WebRTC problem with someone. And I don't know. I didn't know anything about WebRTC.

Starting point is 00:16:32 But I just, like, don't like reading books. Like, I don't have the patience. So instead I was like, okay, let's look at some debugging output, right? Like, let's set up a connection. And then it was like, oh, you have this, like, turn server. And then you start, you can, like, capture, find the port for the turn server. And then you can start capturing packets, like UDP packets. And then you can open up Wireshark, and then you can see the UDP packets that are being used to, like, start establishing this WebRTC connection. And I think that, like, trying to look at things interactively like that,

Starting point is 00:17:04 that are really happening on my computer makes it feel so much more real to me. And like, I'm sure that also it's not a lie. Like sometimes things that you read like on the internet are lies, you know, or they're like not, or they're like true for someone else's computer, but not true for my computer. And it just like so much more, like it feels so much more like immediate and easy for me to understand when I can actually see like things that are happening on my computer. That's awesome. So now jumping into the post itself, what was the team trying to do? And what was the problem that you guys saw? Yeah, so we had this message here that we were using called NSQ. And the way you publish a message to NSQ is there's a local daemon running on every host that basically, like,

Starting point is 00:17:47 so you send it a message, and then it publishes the message to, like, some central place. And so what we were seeing was that publishing a message, so the way you publish a message is you make an HTTP request, right, where you're, like, post, and then you send the contents of the message. So that's super simple. We were doing that

Starting point is 00:18:04 in Ruby, and someone noticed that it was taking 40 milliseconds to send each message and i think i've been learning about how much time things should take on computers around that time and i was like 40 milliseconds is too long for you know, an HTTP request on localhost to a Go program, right? Like, this should not be, like, there's no reason it needs to take this long, right? It doesn't make sense. And I think we spent some time looking at garbage collection, and it was not, that was not it. You know, the Go program seemed to be fine. I see. And then you mentioned this blog post that you saw elsewhere that gave you a clue. Can you tell us more about that? Yeah. So I'd read this blog post that you saw elsewhere that gave you a clue.

Starting point is 00:18:45 Can you tell us more about that? Yeah, so I'd read this blog post, I think the week before, that was like, hey, sometimes if you're seeing a slow HTTP request, it might be because of a weird problem with TCP. And I think my manager had posted it the week before, coincidentally, and I was like, oh, that's interesting. Um, that probably doesn't have anything to do with me, but it would be really cool if that was the problem. Right. Cause it would be. Um, and so I think what I did, so, so I'll explain the blog

Starting point is 00:19:19 post, um, and what it said. Uh, so what it said was that you can, in TCP, you send packets. And there are two, there are a lot of different ways TCP clients and servers can be configured. So one thing that the TCP client sometimes does is something called Nagel's algorithm. And what that means is it'll send the first packet. And if it doesn't, if that packet doesn't get hacked right away, it'll be like, oh, there's a problem with the connection. Maybe I shouldn't send the second packet just in case there's some kind of congestion. So it'll wait before sending the second packet. And then on the server, there's an algorithm called delayed acts, which is like, when I get the first packet, I won't hack it right away. Because we don't want to waste space on the network with ACKs or something. So I'll just wait until the second packet and then

Starting point is 00:20:09 ACK that one. So the client sends the packet, and it's waiting for the ACK. And the server is like, I'm waiting to ACK. I don't want to send too many ACKs or anything. And so they're both in this kind of stalemate until they time out after 40 milliseconds. And then the client just goes ahead and sends a second packet. And this is something that's happening inside the Linux kernel at the OS level. And I guess the other part of this is that the Ruby program was sending... The request it was sending was quite small. But because of the way the code was written, it was sending the headers and the request body, the HTTP request body in

Starting point is 00:20:49 two separate packets, which it didn't, like if it had just been sending in a one packet, this also wouldn't have happened. So there was kind of like an interaction between like the Ruby code and the interesting. In the post, you mentioned using strace helped quite a lot. I was curious what you saw there that helped you make the connection to the blog post. Yeah, so what I saw there was you can see network packets being sent in strace, I think with maybe send message, I don't remember the exact system call. But what I saw, what would I have seen? So you see the first packet being sent. And then you see the second packet being sent 40 milliseconds later. Yeah.

Starting point is 00:21:38 And so I think, yeah, what I saw, because you can do strace-t to see timestamps. I think today I would probably use tcpdump for this, but at the time I was scared of tcpdump, so I used strace. Anyway, yeah, so you see one packet being sent, and then another one being sent 40 milliseconds later. And then I was like, why? Right? Like, why wait to send the second packet? Like, what's going on? Is that true? Yeah, I think that's right. I honestly don't remember exactly what I saw. But it was it was something like that.

Starting point is 00:22:06 Got it. And why are you scared of TCP dump? I think then I was scared of TCP dump because I thought... I mean, I don't ever use it. But just curious from your perspective. Yeah, so TCP dump I no longer think is scary. I love it. I think it's very straightforward to use not very i think it's i wrote a zine about how to use it uh which is free it's on my website if you want to not be scared of tcp as well um um because i was mad that it took me so long to not be scared of it anyway um but i think just at first like it's this command line tool there's all these network packets it prints out prints out things that sometimes you don't understand you know like all of these things about like the tcp statistics um but the way i usually like to use tcpdump which i find much easier

Starting point is 00:22:54 is you can capture uh packets with tcpdump and write them out to what's called the pcap file and then you can open them in wireshark and wireshark i find much easier to use because it's like a graphical interface you can can see everything. You can filter things much more easily. And you can also kind of do it after the fact, you know, like you have like your data saved and then you can analyze it more slowly or maybe with a coworker or friend, right? And be like, okay, wait, what does this mean? And so I think I also learned that workflow, which I find a lot easier than kind of trying to like use TCP dump interactively from the command line. You mentioned TCP drum being scary at the time.

Starting point is 00:23:30 I feel strace is also scary. I mean, I've used it on occasions and it's extremely helpful. But I think it takes a little getting used to. And it's non-trivial initially when you're seeing a bunch of system calls and it's like, what does that system call do? So like one, how did you get just better at using S-trace? And under what circumstances have you found it to be very useful as a tool? I think the real secret to using S-trace is to ignore system calls that you don't understand

Starting point is 00:23:59 and to develop a sense for what system calls you do understand. For example, there's the open system call, which opens files, right? And that's like, okay, that's not scary. And so when I'm s-tracing programs, often what I do is I'll just grep for open and be like, I'm going to ignore everything else. I just want to know what files this program is opening. Or if I'm looking at network stuff,

Starting point is 00:24:23 I just want to know like what it's sending and receiving and then if there's like three billion other like few tags like system that's the confusing one it's it's okay like that's not like often that's not the focus right so like often one of my favorite ways to use strace is if a program is like writing to a log file and i don't what log file, and then you can just strace the program, you find the file, it's over, you don't like it doesn't matter, like all the other weird stuff it's doing, you know, you just need the file. That makes sense.

Starting point is 00:24:56 Nice, nice. Going back to the TCP issue that you guys were trying to solve. So once you made the connection to the blog post, did everything pretty much just work out from there on out? Yeah, so we need to implement the solution still. So what I did was I set up this setting called TCP no delay on the client, which basically turns off the waiting for acts on the client side um because the client was the thing i had the most control over um and it was also what the aha proxy web blog post said to do um and then i like made this pull request and i was like oh my

Starting point is 00:25:36 god like i've never like tried to change a tcp setting before like is this gonna work i mean like it's kind of like scary because like if it't work, I'll be embarrassed a little bit, you know. But I did it and it worked. And I was like, oh, my God, I'm amazing. This is so cool. And I think the funny thing about this bug in particular is that over the years since I wrote that blog post, I keep on seeing people writing about it being like, this thing happened to me with Nagel's algorithm to like delayed acts. And I'm like, oh, yeah, again. And I think it's just like maybe these like these are like very common settings that interact poorly and it just keeps on happening.

Starting point is 00:26:23 So I feel like it's like it wasn't like maybe a big coincidence that I'd seen that blog post. It's like just that is like a relatively common thing even though it seems like really obscure at first um was your team like holy crap uh you know like how how did she know what to do I mean maybe they I think I yeah I think they were like cool good job like and I really felt like validated because I was so obsessed with like networking and system calls. And that wasn't really my job at the time at all. And I was like, finally, like my obsession with these weird things has paid off. That's awesome. Changing gears a little bit.

Starting point is 00:26:57 Something you emphasize quite a lot in your blog is the importance of learning the underlying systems that we use and rely on, which is definitely one of the core motivations that we have for starting this podcast. One challenge, though, sometimes is to figure out, you know, okay, when do I need to dig deep into this concept versus no, you know, I just got more tickets I need to go do. Do you have a rule of thumb in terms of prioritizing what to learn yeah especially at work um so i think one one one thing i think is important is if you're putting something into production and you're like on an sre team like i don't think it's really it really works to put things into production that you don't understand, right? Like, it's not okay. And so once, a long time ago, I was trying to,

Starting point is 00:27:51 we were thinking about using Kubernetes, and no one on my team knew anything about Kubernetes, right? And so I spent a huge amount of time, like, trying to understand how Kubernetes networking works and just, like, being like, what is this thing? Like, how does it work? What's going on? I spent so much time, like what is this thing like how does it work what's going on um i spent so much time like learning about like container networking like namespaces and like i think i spent weeks on it and like in those weeks i really accomplished nothing um at work like i

Starting point is 00:28:16 learned a lot um but i think at the time my manager understood like he was like hey like we want to investigate using this technology none of us know anything about it so like someone needs to learn about it you know um and i think it was like understood that that was a good use of time uh because because we did end up using it and we didn't end up needing to use all of that knowledge to debug the problems we had so like yeah but i think that you do need to be in a place where like that's valued right which which i think on a lot of infrastructure teams it is because people see what happens if you have systems in production that you don't understand and they see like where it leads it's it's a lot of fun when you get woken up in the middle of the night and something's not working and you're like, I have no idea what to do now. Yeah, yeah, exactly.

Starting point is 00:29:10 Well, I can't stress enough how important it is to understand the thing you put in production. Also, I empathize spending weeks learning about container networking. By the way, you have some really good blog posts on Kubernetes and containers. And for me personally, the ones on container networking and Kubernetes certificate authorities clarified so many things. And we'll link to all of these in the show notes as well. One of the posts around Kubernetes, which you have, is what we want to discuss today is on Kubernetes scheduler, where you mentioned that you found a bug in production. Can you tell us what Kubernetes schedululer does and what was the bug that you encountered? Yeah, so I think I'll talk about the issue first because it was kind of, yeah.

Starting point is 00:29:51 So we were running cron jobs on Kubernetes. And so I guess part of the model of Kubernetes is you have these things called pods. And it's like this declarative system where you declare a pod and you're like, hello, etcd, here's a pod. And it's like, great, there's pod, but it's like this declarative system where you declare a pod and you're like hello etcd here's a pod and it's like great there's pod but it's not running anywhere and then so what the scheduler does is it periodically looks at it looks at the pods and then it sees if one of them is not attached to a node and it's not running and then if it's not running they'll find somewhere to run it and then it'll update it to say that it's running cool um and so what what we're running is cron jobs and what running cron jobs means is that you're scheduling

Starting point is 00:30:31 new pods a lot right like if you're running like a web service that where you just have some stuff running maybe it restarts like every day but we have like many i don't know a lot of jobs that we were running all the time um and so i think new things were constantly starting and uh so we we'd set up this cluster and we were starting to slowly roll it out um on some like less critical jobs and someone on one of the teams that was operating these jobs like that wrote the jobs was like hey uh one uh, one of these didn't run. That was supposed to run. What's going on. And, and I saw that it was stuck and I kind of, I restarted the scheduler and then the pod ran and I was like, okay, cool. That's fine that it's running now, but why, you know, what happened? And I think there, there like i really had a choice like i

Starting point is 00:31:25 could have like let it go or being like maybe we should just restart the scheduler sometimes i guess right um but um this was at kind of like an early stage of rolling out the system and it needed to run jobs that were processing like a lot of money um and also like when there were problems with it it caused a lot of problems for the engineers who wrote the jobs and i decided like no this isn't okay like i'm not gonna accept like this happening i'm gonna figure it out um and so i and i didn't know how the scheduler worked and i think the way that i assumed that it worked was that it looked at every pod periodically, like it would be like four X in pods. If pod is not running, then run it. That's what I looked at the code. What? I was saying that's what one would expect. Yeah, that is what one would expect. And so I

Starting point is 00:32:21 looked at the code and that was not what the code did. Instead, what happened was that there was a queue, and that every time a new pod was added, it would get added to the queue of things to schedule, of things to schedule. And if there was an error, like scheduling a pod, it would add it back to the queue of things to schedule. And this makes sense, because you can imagine, if you're trying to run like 10 bajillion pods,

Starting point is 00:32:46 you don't want to be iterating over all of them every single time, right? Like it could cause performance issues. So this is like a really reasonable performance optimization to have this like queue of things to schedule and you only actually look at the stuff that you need to, that like needs to be scheduled. And then maybe you go look at everything again when you're done. And so it turns out that what had happened was one of the pods,

Starting point is 00:33:07 there'd been some error when trying to schedule it. And so it went off the queue and then the error handling code was supposed to put it back in the queue and it hadn't. And so I added like maybe like one line of code or something to put it back in the queue, like on this error. And to debug this, I think I added a bunch of print statements to the scheduler and compiled it and ran it in production or something to find it. I did something dumb like that because I was just so stuck. And I was like, look, what's another print like oh and I really didn't have a plan um anyway uh so so I like compiled uh I put in a fix and then

Starting point is 00:33:53 I like compiled it and I put it in production I saw that it fixed our problem which was great and then I made an upstream pull request and it got merged and and then like the problem never came back you know and it was done and it was so much nicer than trying to... Because you can paper over a lot of stuff in operations by restarting things. And it was so nice to not add to the stack of things that we'd papered over in that way. And there were other things that we did handle. We had a memory leak, and I could not figure out what was going on. And I just put, like, just restart this sometimes.

Starting point is 00:34:27 And that fixed the memory like and we never i i think eventually we upgraded and it went away but uh that that one seemed kind of less important to me because it wasn't a correctness issue you know like but this one i was like no i don't i don't know i. No, that makes sense. I mean, the importance of not letting weird things in production just go by and actually digging and figuring out what exactly is happening to fix it for the long term. One thing which I will say is plus one to print statements. I'm sure a lot of people would argue it's not the best way to go about it. I can actually relate to it because a few months back, I did that exact thing to Kubelet, compiled it, run it on a node to debug and i was like i couldn't figure out anything else throughout the like stackdoms i was getting i was like this is the best way and well we found the problem it

Starting point is 00:35:15 was our mistake not a kubernetes issue but it worked uh so plus one to print statements yeah i mean you gotta do what you gotta do And like adding a print statement is not that invasive. Yeah. So the specific issue that you mentioned, how many, was it happening too frequently, by the way? Like, were you seeing too many pods ending up in pending state? It was not happening that frequently, no. Which was good in some sense.

Starting point is 00:35:43 But, and it was like a nice example of why it's good to jump on stuff that isn't really a big deal you know because this was not like an incident but it would have been an incident like in the future you know like um it's the kind of thing that like you're like if i don't deal with this now i'm gonna get paged in the middle of the night people are gonna be really mad in like eight months, you know. And so I think I wrote like a whole post-mortem about it too. Like to be like, this is like what happened. Like here's just to show like the kinds of things that,

Starting point is 00:36:18 because I think this was like one of the first Kubernetes issues we had. And I just wanted to show people like, you know, these are the kinds of problems we're running into. And like, this is what we need to understand about the system. So that when there are real problems in the future, we're better equipped to start dealing with them. Yeah, that makes total sense. So you mentioned reading the Kubernetes scheduler code to actually figure out how it works,

Starting point is 00:36:36 which is the right way to figure something out, but also navigating a huge code base, which is unfamiliar and something like Kubernetes can be overwhelming. So what practices have you found useful when navigating such a code base i feel like a lot of crap you know um like like a lot of looking for specific error messages because i was getting i think i was seeing some weird error message in the log that was correlated with the error so i was like okay that error message is related so i think i tracked down like the code path that led to that error

Starting point is 00:37:15 message and i tried to like think about it um and like what else is there i also think that understanding kind of the overall architecture of the system is important, because with Kubernetes, at least at that time, and I'm sure now too, the components are kind of are like independent of each other. only interacted with the scheduler through etcd. So I knew that I only really needed to worry about the code in the scheduler subtree of Kubernetes. And I understood the overall system well enough to be 100% sure that it was a problem with the scheduler. So I didn't... Because it sucks if you get distracted

Starting point is 00:37:59 and you're like, oh, maybe it's in this component, or maybe it's in this component. If you can tell from your model and system, okay, the problem is here. Like even if here is kind of big, then at least like you can forget about all the other stuff, you know? Yeah, no, that makes sense. Narrowing down where the problem is.

Starting point is 00:38:15 And anyone running software in production can attest that debugging is a very important skill because things break almost all the time. Sometimes we are aware, sometimes we are not. And I love your debugging scene. So what are some of the debugging practices you found to be useful? And did you pick up most of them at the job or was it something else? So what are the debugging practices that are useful i think an important debugging practice is the practice of not giving up which is i i feel like one of the hardest ones for me sometimes like especially if i'm in the middle of a problem and i'm like why like maybe if i just run the same

Starting point is 00:39:02 code again it'll work like maybe it'll just go away. Like, you know, like it's still easy to succumb to this kind of this like magical thinking of like, Oh, maybe this problem will just disappear if I pretend it's not. So just like, like acknowledging, like acknowledge that there's a problem and like believe in yourself that you can fix it or that you can find out. And then I think like what believing in yourself that you can fix it or that you can find out um and then i think like what believing in yourself that you can fix it means is that like just like trying to find more information like being like

Starting point is 00:39:32 okay what information am i missing and like how could i get it right and i think it's like that's not that's not easy to do right is to try to like a like notice what information you're missing and b like figure out how to get it because sometimes the answer is like compile this thing like sometimes getting it takes a long time and you're like yes is it worth it like um but it's often worth it like usually worth it to get more information um also like talking to someone about like hey i'm in this situation what like just like what more like how could i collect more information about this? Like, what questions could I ask about what the system is doing?

Starting point is 00:40:11 And yeah, yeah, because that's all it is. You know, it's like collect information. And then eventually, once you have enough information, you understand what's happening. Yeah, that makes sense. I think for me, debugging gets like emotional sometimes. I try to catch myself when I get stuck in this loop

Starting point is 00:40:32 of asking myself, wait, but you know, this should work, but why doesn't it actually work? And then just really stop that thinking and be more matter of fact, like, hey, you know, I've checked A, B, C,

Starting point is 00:40:42 I haven't done D, even though that seems obvious, but let's just do that but yeah i think just not giving up is hard but really important yeah it's so emotional like it's so stressful sometimes and it's like i don't know i find it a really intense process especially when it's a really complicated system that you're trying to reason through and like there's so many things in your brain and like sometimes there's pressure from other people like um and i

Starting point is 00:41:06 think it's important to acknowledge that it's kind of an emotional process and to be like okay let's calm down yeah and it's a range of emotions i mean starts with some frustration maybe gets stressful at times but debugging can also be a lot of fun i mean i know i have learned the most about a system when it's not working the way it is supposed to. And when you actually don't give up and eventually solve the problem, it is so gratifying that you were able to get through all of that and fix the problem. Yeah, it feels so good. And then I think that really helps feed into like future problems, because then you can be like, okay, this is really hard. And I have no idea what's going on but remember when i saw that really like hard bug that other time and it

Starting point is 00:41:49 took five days and i got it and it was amazing and like this is going to be like that yeah and i think you can like hold on to those those things after they can bring you forward in future bugs um to wrap up i saw on your blog that you've been looking into autoencoders and using it to cluster faces um which i thought was uh pretty cool oh yeah i i'm curious how you're picking new topics to explore now that you have a lot more freedom and control over your time um so i think the auto encoder neural network thing is like i was just mad that i don't understand what's going on with neural networks like at all and so i was like let's do a little like fun project to try to understand because i i wrote this um i read this paper like four or five years ago that was about how to trick neural networks um like how to trick like an image recognition algorithm into being like

Starting point is 00:42:50 thinking like a panda is a vulture or something like you can take an image of a panda and like modify it a little bit to convince it that it's a vulture um and yeah yeah you just change the pixels a little bit like by like 0.01 or something. And then you can trick it into thinking it's any other animal you want. Anyway, I like found that so cool at the time, but I still came away from that having like really feeling like I didn't understand anything about these neural networks. And I don't know, I kind of, I think a lot of us want to know what's going on. So just trying to understand like, you know percent more for myself cool um so the fun

Starting point is 00:43:28 question that we like to ask is what was the last tool that you discovered and really liked uh i think the thing i'm learning right now that i'm really into is braille uh which i'm finding really fun uh and it's really different from like usually i like things that are like very explicit and like where you know where everything is coming from and real is really the opposite of that but i'm having a lot of fun with it anyway and it's nice and it's nice that it's like such an old technology you know like like i was watching dhh's video about it from like whatever 12 years ago like 2008 and i'm like oh wow yeah this is really cool um so what uh what websites are you building um right now i'm building a website to

Starting point is 00:44:14 um that gives you a virtual machine uh like like um an aws instance or something and it's like something is wrong on this computer like figure out or it tells you like something specific that's wrong and then you have to like figure out what it is um and so it's like a pretty simple website um because it's probably more about like designing like problems to show to people um that are fun but um you also need to like let people log in like blah blah blah and blah, and manage the instances. So I'm running a website, and I'm using Ruby on Rails, and I'm really having a fun time. Oh, I would say that's actually a really cool idea.

Starting point is 00:44:56 It reminds me of something we used to do at one point, like Wheel of misfortune i think austin came up with that term where we would like break something intentionally on a system and like give each other a problem to figure out yeah yeah that's so fun like it's it's good when you can do it in a controlled environment yeah exactly because so many of these things like it's so stressful to find it out like during a production incident to this stuff, or to like learn some of the tools like you don't want to be learning about S-trace for the first time, like when you're trying to like, you know, when like the site is down, like it sucks. And so but it's like, it can be so fun to like actually deal with these problems if there's not that stress. And so I wanted to come up with like a more fun environment for learning about some of the stuff that you learned during real incidents. Nice. I'm looking forward to when you release that website. And thank you so much for taking the time, Julia. It has been awesome talking to you. We really appreciate you joining us today. Thanks for having me. This is really fun. Hey, thank you so much for listening to the show. You can subscribe wherever you get your podcasts and learn more about us at softwaremisadventures.com.

Starting point is 00:46:11 You can also write to us at hello at softwaremisadventures.com. We would love to hear from you. Until next time, take care.

Your Ad Here

Software Misadventures - Julia Evans - On kubernetes scheduler bugs, TCP performance regressions and debugging tips - #2

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.