Software Misadventures - Ryan Underwood - On debugging the Linux kernel - #4

Episode Date: February 6, 2021

Ryan Underwood is a Staff SRE and tech lead on the Helix and Zookeeper SRE team at LinkedIn. Prior to LinkedIn, he was an SRE at Machine Zone and Google. Apart from his regular responsibilities, Ryan�...��s interest and expertise include debugging production kernel, I/O and containerization issues. His opinion about not treating software as a black box and his persistent approach to debugging complex problems are truly inspiring.   On several occasions, Ryan’s colleagues have leaned on him to solve an esoteric problem that everyone thought was insurmountable. Our main focus today is one such problem that Ryan and team ran into while upgrading machines to 4.x kernel that resulted in elevated 99th percentile latencies. We dive into what the problem was, how it was identified and how it was fixed. We discuss some of the tools and practices that are helpful in debugging system performance issues. And we also talk about Ryan’s background and how his curiosity landed him a career in Site Reliability Engineering. Please enjoy this deeply technical and highly educational conversation with Ryan Underwood. Website link: https://softwaremisadventures.com/ryan   Music Credits: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en

Transcript
Discussion (0)
Starting point is 00:00:00 It's an interesting problem combined with just having a learner's mindset and innate curiosity and not being afraid, peering inside the black box of something. This is just knowledge. It's like cracking open a book off the library shelf. I mean, it's a very different kind of knowledge. It's knowledge that changes on an ongoing basis because the Linux kernel is a moving, it's a living organism, it's a moving system, it's constantly changing on a regular basis. So open mind, curiosity, and check your ego at the door because you're not going to understand the whole thing and you don't have to do something impactful. Welcome to the Software Misadventures podcast, where we sit down with software and DevOps
Starting point is 00:00:49 experts to hear their stories from the trenches about how software breaks in production. We are your hosts, Ronak, Austin, and Guang. We've seen firsthand how stressful it is when something breaks in production, but it's the best opportunity to learn about a system more deeply. When most of us started in this field, we didn't really know what to expect, and wish there were more resources on how veteran engineers overcame the daunting task of debugging complex systems. In these conversations, we discuss the principles and practical tips to build resilient software,
Starting point is 00:01:19 as well as advice to grow as technical leaders. Hey everyone, this is Ronak here. well as advice to grow as technical leaders. about debugging distributed systems. I don't know of anyone else who understands the Linux tag better than him. His opinion about not treating software as a black box and his persistent approach to debugging complex problems are truly inspiring. As the team at LinkedIn was upgrading hosts to 4.x kernel, they observed elevated 99th percentile latencies in some critical applications. We speak with Ryan about what exactly this problem was and how it was fixed. We also discuss some of the tools and practices that are helpful in debugging system performance issues.
Starting point is 00:02:12 Please enjoy this deeply technical and highly educational conversation with Ryan Underwood. Cool. Ryan, welcome to the show. To get us started, tell us about your background. I'm especially curious about what kind of got you excited in computers was finding a computer that I did not know was a computer that my dad had brought home when I was something like four, five years old. And I actually thought it was a VCR because I was familiar with VCRs at the time. I'm like, oh, here's this box that appears similar to size and shape and similar materials composition
Starting point is 00:03:07 as this thing that ingests video cassettes and you know puts colorful sounds and pictures on the screen for me so anyway over a period of weeks or months my my dad set it up and started showing me how to use it we had some some some books and magazines that I could type basic programs into it you know using DOS and you know just all the things we did back in the day of course I was in it for the games it was all about the games and you know the the machinery itself was just you know a means, the, the machinery itself was just, uh, you know, a means to the end of playing cool games. Uh, I had a neighbor kid who lived across the street, who's now a special effects director in, in Hollywood, uh, whose dad worked for IBM. And so his dad was
Starting point is 00:03:57 always bringing home the goods and, you know, the neighbor kid always had the, uh, you know, the latest and greatest, uh, uh, IBM PC jr. And then, you know, he had a 286 and then a 486 leapfrogging whatever I had at the time. So I was always jealous but he also had all the cool games that we just swapped discs back and forth. So from there to writing, I mean I was doing DOS batch programming on floppy disks just like automating the startup of, of games and things like that. Just very simple things. And then I got into the BBS
Starting point is 00:04:31 world in 1994, I was 13, uh, and it was all over from there. But then I started getting into, um, you know, running my own BBS. I wrote some Pascal programs to automate some like just grungy things of, you know, converting configuration files from the BBS format to the format that these, these, of course, again, games would use externally to, you know, so that people could dial in and play things like Trade Wars and Legend of the Red Dragon and things like that, if anyone remembers. So, yeah, all about the games and the files and the wide world of very odd things being communicated and text files that are swapped around in BBSs that were very much outside my insular, suburban, Midwest kid upbringing. And something, you know, my parents would have been very upset if they knew what I was getting up to. But, you know, in high school, we had a programming class,
Starting point is 00:05:32 formal programming class with actual Mac, compact Mac, black and white screen computers around the room. And so I did some Pascal in that class. And then, you know, at the same time, my hobby was like, uh, you know, copying discs and CDs. So I just, you know, uh, figured out how to disassemble programs, look for the, the SIS calls and DOS and windows that, you know, identified whether you were running on an original disc or an original CD and just, you know, hack those out because I wanted to copy them to my hard disc and play them without having these stupid discs and CDs around. I mean, come on, like everything should be in the cloud, you
Starting point is 00:06:08 know, in, in my, in front of me. So, uh, you know, so I, I learned, you know, basically I went up and down the stack that way, just, you know, being a hacker, you know, bored kid, misfit, outcast, uh, you know, um, everything that, you know, a lot of us are familiar with. But then I went off to engineering school in Missouri, started as an electrical engineer major, and ditched that after two years, flunked out, family problems, divorce, nasty stuff. Then I was admitted back into the CS program, finished my master's at CS,
Starting point is 00:06:47 did a lot of open source hacking. That's the period when I actually got into kernel hacking because I had been using Linux since 1997. I was actually one of the users of Zip Slack, which if anybody remembers Zip Slack, it was a Slack where Linux distribution that you could unzip onto an MS-DOS partition and, you know, basically boot from, you know, just by running a batch file. And so that was my dipping my toes into the Linux world. And over time, I just migrated the things I was doing to Linux because I just found Linux more expressive and open for the kinds of things I wanted to do with computers, you know, very much like the collaborative and, you know, sometimes opinionated, hopefully not too opinionated nature
Starting point is 00:07:28 of open source development. And so then I wanted my hardware that worked so well on Windows to work on Linux. And it frustrated me when it didn't. And it frustrated me to the point where I spent late nights with a cup of tea next to me and realizing the sun was coming up because I had been up all night just digging into the driver for this thing and figuring out what registers was Windows flipping that Linux driver was not, why this piece of hardware wasn't working or wasn't working optimally in some way. And so I just did that kind of stuff until I graduated. And then I graduated, did the normal software engineering stuff, embedded Linux, real-time simulation systems. Actually, in my job at the flight simulator company, I had to do a lot of reverse engineering, which was suited to my
Starting point is 00:08:16 background, both on the network and of physical devices. So that was pretty cool. And then Google came and hired me for the stuff I was doing as a hobby, not for what my actual career trajectory was, which was straight up software engineering. And so I became an SRE. And since then, I've been doing SRE things and helping out on that side. And the history of my tenure at LinkedIn is that of SRE. I'm sure we'll talk about things that will illuminate why my twisty background was able to create a contribution in this particular area. Nice. I remember back in 1997 when you were getting into Linux,
Starting point is 00:09:00 I was being tied to a piano chair, so I'm not at all jealous. So yeah, so having been at LinkedIn as an SRE for the last pass of years, what have you been working on? What kind of work do you do today? Well, I joined LinkedIn. When I joined LinkedIn, I joined Ben Ferguson's team, which was at the time the tools SRE team. It was a new team. It was a team that was explicitly created with the intent of providing an engineering approach to operational problems, as in the core definition of SRE, but also specifically evangelizing the idea of operational awareness in the foundation organization, which was developing all of our internal tooling at the time. And, you know, they were very focused on iterating and, you know, moving fast and breaking things. And we just, we realized that there was a business
Starting point is 00:09:58 need around not breaking things as much and still being able to move fast. And so Tools SRE, we explicitly helped with that. We're able to make many, many concrete improvements up and down the stack from the source of truth for topology to the deployment machinery to the private cloud, fixing some nasty containerization bugs and operational problems with the services that comprise the private cloud, fixing some nasty containerization bugs and operational problems with the services that comprise the private cloud. So yeah, that was what I did for the first four years. And so Tools SRE, I moved into LPS SRE, which is the team that owns and operates LinkedIn's
Starting point is 00:10:43 private cloud. And then after that, I moved to the Helix Zookeeper SRE team, which was a shift, a very interesting shift, because moving into that team, I knew virtually nothing about Zookeeper aside from at a purely conceptual level and where Zookeeper fits in in the ecosystem of distributed service components, implementing distributed system fundamentals. And in a matter of I was actually less than two months, I had been able to using my background in, you know, in four years, working with
Starting point is 00:11:19 foundation tooling, I was able to create a framework that allowed us to measure the availability of all zookeepers at LinkedIn basically within two months' time. And so I was able to stand on the shoulders of giants and deliver something that was very important to the company. And since then, we've been using that availability measurement tooling to solve one problem after another in the zookeeper ecosystem that was impacting either LinkedIn's customers in terms of site users, or at a minimum impacting engineers at LinkedIn who are just trying to deliver new products and features. A lot of things that you say, Ryan, which are normal software engineering or normal infrastructure engineering, at least in
Starting point is 00:12:05 my perspective, they don't sound simple at all, especially your background and things that you did. You mentioned games quite a bit as you were growing up. Do you still play games? I actually have not sat down and played a game in a long time. I watched more videos on YouTube of other people playing games, wishing that I could go back in time to that life where I felt like I had enough time to be playing games so much. But yeah, there was a time in my life where games and files were the goal. And every means was a means to that end. But that's how I learned. That's how I learned the things that I do today. But I have fond memories of games like Privateer was one that was very near and dear to my heart.
Starting point is 00:12:56 I spent a lot of time on MUDs, if anybody remembers those. These are sort of the predecessor to today's MMOs. And if anyone remembers EverQuest, that was more or less the first graphical MUD that was massively multiplayer. Before that, there was Ultima Online, which was smaller in scale and more themed around the Ultimate Universe. But MUDs were just a text-based, basically multiverse that people could join over the internet in the early days of the internet and interact with other people in real time in a textual shared space. Sort of akin to the same kind of metaphor that IRC used or that Slack uses today, kind of this shared space where people interact via text and emoting and things like that. But for me, I mean, MUDs was where that really started.
Starting point is 00:13:51 And so I was into those. And that's a whole other story, all the shenanigans and network hopping that we did to get around, you know, net blocks and all those kind of things. Fun times. Nice. Well, you definitely have a fascinating background. And knowing you know, net blocks and all those kinds of things. Fun times. Nice. Well, you definitely have a fascinating background. And knowing you personally, like this is something our listeners wouldn't know. But when I joined LinkedIn, I was on the same team as Ryan. And before I met you, Ryan, I heard about you from all the other team members. And one thing that I heard consistently from everyone was, if I had a problem or a tricky situation that I was dealing with, either with Linux or distributed systems, and I couldn't find the answer via Google or Stack Overflow, I should just go and ask Ryan. And that has been consistently true. And we'll dive deep into the example on one of those examples
Starting point is 00:14:42 today. But before we jump there uh everyone who works with you knows about this skill set which is rarely known to people who probably don't work with you is how did you get so good with memes that was um that was a required skill at google um basically uh you know becoming a a meme master, or at least, you know, moving along the, walking the path towards meme masterhood is just, if anybody has worked at Google, they would understand. And if they haven't worked at Google, you'll just have to take my word for it. Well, we'll take your word for it then. And like you mentioned, your knowledge spreads
Starting point is 00:15:28 in multiple dimensions, both in breadth and depth. And one of the things that we wanted to touch on today is the blog post that you recently published that talks about a very interesting and tricky situation
Starting point is 00:15:40 that you encountered in Linux system performance. Before we dive into exactly what the issue was, in your blog post, you mentioned re-imaging machines and OS subgrade initiative at LinkedIn. Can you tell us a little bit about that? Right. Well, I think you could tell us a little more about that. But the short version of the story is that basically the decision was made to get off of Red Hat Enterprise Linux 7 for various business reasons, cost savings reasons, synergy with Microsoft, and those sorts of things. Nothing against Red Hat specifically. It's just that we had, as an organization, evolved past essentially what their support model was able to provide for us with the kinds of things we were doing.
Starting point is 00:16:29 So the decision was made to move to a CentOS platform image by using Microsoft's kernel from Azure, which was perfectly reasonable because Microsoft hires a lot of people to work specifically on the kernel. And so why should LinkedIn duplicate that effort if we don't need to? Or be beholden to an external vendor who has their own roadmap, their own priorities that aren't necessarily aligned with ours, and we don't have a lot of influence over theirs. so that decision was made by SRE exec. And the part that my understanding as LPS SRE specifically undertook was to re-image the general rain pools, which is basically, this is akin to what most people would think of when they just request an instance on AWS. They have no idea other than the region it's in, which specific hardware they're going to be getting or what the characteristics of that hardware is other than it meets a specific
Starting point is 00:17:40 performance class that Amazon advertises. And so as opposed to the other pools at LinkedIn, which are application specific, this is the general pool, which is sort of the default pool that anybody who wants to deploy a random application lands on, right? Yeah, it makes sense. So this is like the multi-tenant cloud where you get to ask for a compute resource and you get it through an API call where you're not worrying about exactly which host that you get. So the OSS upgrade initiative involved upgrading the set of machines which are on the multi-tenant cloud. One thing that you also refer to in the blog post is various noisy neighbor problems in multi-tenant environments in general. Tell us a little bit about some of these noisy neighbor problems that people usually hit in these kind of situations. Oh, yeah. So, well, one noisy neighbor problem that we had initially in
Starting point is 00:18:37 RAINN was simply that of swap utilization competing for disk IO. So disk IO availability is, you know, like you can always push IO requests through to a disk. The only question is what's the latency of those IO requests being serviced. And so if you start filling up the queue on the disk, you're just waiting for stuff to be streamed out to disk. So if you have, for example, multiple applications that are logging at a very high rate or multiple applications which are making memory allocations, which are being satisfied either partially or fully through swap allocations or through pushing inactive pages out to swap in order to avoid a low memory situation. Those will create noisy neighbor situations because a neighbor who is not creating the resource burden is unfairly impacted in terms
Starting point is 00:19:38 of their own latency because some other process on the host is creating a resource burden. And other examples of this could be contention for NIC bandwidth on, for example, one gig NIC host. So sometimes you have to move to a 10 gig NIC host to get more bandwidth. Other things would be like memory bandwidth itself. It's a global resource that can be used unfairly by one actor. We saw that with, you know, actually, we didn't see this on RAIN necessarily, but we saw this on Espresso nodes where one process would be doing Java garbage collection and actually starve the rest of the system of memory bandwidth
Starting point is 00:20:23 because there were ECC errors that were reducing the system of memory bandwidth because there were ECC errors that were reducing the amount of memory bandwidth available to the cores on that system. And so the global resource was exhausted. Another classic one is cache thrashing. So when processes are not pinned to cores, when they're free to schedule on any core. The working set for one process will, it has different cache semantics than the working set for another process. I mean, not just different cache semantics, but the working set itself is different.
Starting point is 00:20:58 So the data and code caches in the CPU are going to be nuked every time that process is swapped out and having to be basically reloaded by going to main memory. And so that creates a slow start issue for whenever that task is scheduled. So a lot of these can be kind of bucketed into, in a general term, what we call thrashing issues, where thrashing sort of indicates that somebody who wants to use resources is experiencing high latency because somebody else has caused that resource to be used in such a way that is causing the one who wants to use it to be cued to have to wait on some physical characteristic, some physical
Starting point is 00:21:46 limitation of some kind. Okay. And these are interesting noisy neighbor problems. And as you mentioned in the blog post that as you were moving the fleet to CentOS in this multi-tenant environment, there was a problem that you identified, but it wasn't a noisy neighbor issue. Can you tell us more about that? Right. And the problem when it was brought to me was brought to me as an IO weight issue, that it was simply that on machines running the newer kernel, as compared to the Red Hat Enterprise Linux 7 kernel, which was a 3.x kernel. We had this newer 4.x kernel. And we were seeing that when multiple processes were doing sequential I.O. simultaneously, such as artifacts being downloaded for deployment or logging and, you know, with multiple concurrent
Starting point is 00:22:47 streams that IO8 was going up on the host very much not in proportion to the IO8 that would have been generated on the same host with the 3.x kernel from Red Hat. So the noisy neighbor issue in that sense was that, you know, symptomatic rather than fundamental. So it was a symptomatic noisy neighbor issue in the sense that one process was able to create IO weight that was causing another process to be stuck in the run queue simply because it was doing IO to the disk and not because the disk itself had a fundamental a hardware limitation of any kind that was causing uh that that was justifying uh you know causing a process to to wait it seemed like there was something in the software that was doing this and uh you know, as it turned out, you know, we were right about that.
Starting point is 00:23:45 And we'll talk more about that, I'm sure. Yeah. So in this case, the effect was an application would experience higher 99th percentile latencies if another application was deployed on the same physical server. And it wasn't because of any limitations on the hardware side, like you mentioned, but it was something in the software software that was limiting the performance. That's right. And the first, my first suspicion was that it was, it was going to be, you know, something related to mutual exclusion, or locking. And I looked at the usual things I would look at for debugging that sort of problem, which would be perf top to look for spin lock utilization, which given that there was virtually no system CPU, no core on kernel time disproportionate to what we would have seen on the 3.x kernel.
Starting point is 00:24:48 It didn't seem like there was any kind of spin lock meltdown or anything like that going on. So then the other question was, is it possible that we're waiting, just waiting around for some lock that's being inefficient or something like that? And so to look into that, I used the magic sysrq. I think it's sysrql. We don't actually have console access to our servers, but I triggered it using procfs. And what that does is it dumps the stack of all the cores on the host to the kernel message buffer so that you can see for a snapshot at any point in time. And by the way, this only works on SMP hosts because on a single processor host, the stack is always going to be, you know, for the single core in that host, it's going to be the stack of the kernel function that actually prints that thing out to the kernel
Starting point is 00:25:44 message buffer. So it's not that helpful. But when you have 24 cores, you have 23 cores that are presumably doing other things. And so often you can see that these cores are just stacking up, kind of waiting in the same mutual exclusion primitive, and can kind of figure out what went wrong that way. And so in this case, I did not see anything like that. It was just slow. It was very strange. And the way this problem got surfaced is these applications started seeing higher latencies when they were deployed on the newer kernel. And that's also how you started looking into the problem and dissecting it and seeing it performs perfectly fine on the older version, but not necessarily on the newer one. And did that
Starting point is 00:26:26 lead you down the path of thinking it has something to do with the kernel and the investigation you did with SysRQ and other tools? Well, the thing that led me to believe it had something to do with the kernel was simply the combination of that the kernel was the main thing. I mean, like user space changed in CentOS as well, but user space changes don't generally produce this massive increase in IO weight because IO weight is the kernel. I mean, it's telling a process it cannot be scheduled at the moment because it's waiting on a shared underlying resource, whether it's a block device or some other device that is consuming I.O. of some kind. And the kernel is telling that process to sleep because it can't do its work in the meantime, so it needs to sleep. So that's some other process which can do work other than I to a device that's currently busy, you know, can be done. So those things together just, I mean, my intuition, you know, basically bisected this, you know, away from a user space problem and into a kernel problem that way. That's fascinating.
Starting point is 00:27:45 One tool that you also mentioned in your arsenal in the blog is ATOP. Tell us more about what ATOP does. Yeah, ATOP is a great tool, something that I learned to use effectively at Google because we used it to debug problems on Borg hosts, especially noisy neighbor type of problems and so ATOP is a tool that allows you to at a glance see essentially everything that is important about your system from a resource utilization perspective many people stop at top because they're you
Starting point is 00:28:23 know they see okay my my cores are doing this. I have so much memory that's RSS, so much memory that's virtual allocation. I can see which process is at the top that's busy and so on. But ATOP gives you much more of a picture of what's going on because not only can it use the process accounting mechanism that's built into the kernel to accurately attribute disk utilization to specific processes which used it, even ephemeral processes which had disappeared in between sampling intervals, which is incredibly important when you have, for example, forking model type processes you're trying to debug like we do in our Python ecosystem. But also it gives you everything at one glance. Virtual memory, it tells you when I'm having to, when I'm undergoing free page scans because
Starting point is 00:29:21 I'm having such memory pressure on the system. Swap activity is all there, page fault activity. Basically, it has several different views depending on what you want to look at. Do I want to look at what looks like a primarily disk IO problem? I can hit D and have a view that gives me those things that would be interesting in that context. I can hit P and give me things that are interesting in the context of, well, I seem to be having a compute problem in the sense of that this system
Starting point is 00:29:52 is not stacking up performance-wise what's going on. And then just the general view, the G view is a great overview of both of those things. So it's it's just a it's just a great a scope and into a system i mean it's my my go-to tool i don't bother with anything else um really it's the first thing i go to um so and it has a nice uh demonized mode also that allows it to record all this stuff uh historically so that you can you can go back in time and see, you know, even like all the process accounting aspect for, you know, accounting to disk IO. Like if I saw that a host was just slammed with IO at some point in time, and I was running ATOP and daemon mode, I could very easily just go back in time and see, oh, what was happening on that host at that time? And, oh,
Starting point is 00:30:40 it looks like this, you know, this automation process, you know, kicked off a bunch of children that all were trying to, you know, read a gigantic file read a gigantic file and write it out at the same time. And then I know exactly what happened. That sounds like a very useful tool. And I know I hadn't heard of ATOP until I met you. And I heard about ATOP from you. We'll definitely add a link to this tool in our show notes when you publish the episode and in general this tool sounds like a very useful thing in for containerized environments specifically where you have a bunch of processes using shared resources now that you knew that
Starting point is 00:31:19 there was this io performance issue as you're upgrading these machines. You had kind of symptoms from the applications running on these newer kernel. And through your intuition, you identified that this problem has to do more on the kernel side. How did you proceed next? Yeah, the next thing I looked at was because 4.x has BlockMQ and BlockMQ has some different IO schedulers. I tried playing around with the IO schedulers to see if one IO scheduler or the other made a difference. I also, at least for a minute, not too much longer, thought that maybe disk IO quotas had possibly been introduced on the CFQ side. And so I investigated that because that could account for IO blockage happening. But none of that turned anything up.
Starting point is 00:32:16 And actually, it was kind of by a stroke of luck while I was looking into those things. One of the engineers in Pi noticed that there was an issue on Red Hat's Bugzilla about a... Actually, yes, it was on Red Hat's internal support ticket system that somebody else was complaining about that they had an IO regression in RHEL 8, which has a 4.x kernel, and that they thought it was related to, there's a public Bugzilla kernel upstream bug related to this where people had SATA SSDs and were having a, well, SSDs or HDDs in this case,
Starting point is 00:32:59 and they were having basically a massive penalty in the same type of situation that we had, which was sequential I.O. and using BlockMQ. So that kind of led me in the direction that something was going on with BlockMQ. And taking... YenZoxpo had a very simple patch related to the scalable bitmap code in BlockMQ
Starting point is 00:33:31 that is a scalable bitmap. You can read a lot more about this online, but basically a scalable bitmap is a mechanism that allows IOs to be submitted efficiently across multiple cores without having a logjam on any one particular core. And many of the advanced IO schedulers, you know, Builder OnBlock, MQ, are leveraging this scalable bitmap primitive.
Starting point is 00:34:01 But, you know, it was a very simple patch to reduce the number of queues in a scalable bitmap. And it actually produced an improvement. Like, it was a very simple patch, reduced the number of queues in a scalable bitmap. And it actually, it produced an improvement, like it didn't, it didn't fix the problem, but it produced an improvement. And so that was what kind of gave me, you know, a more, you know, intuitively, um, a suspicion that this was the right direction to go seeing improvement with that. So based on that, I just, I thought, well, you know, we have this platform image that is different, but I think the kernel is the problem. So what would happen if I installed a 3x kernel on one of these CentOS hosts and redid the benchmark? And it was the realization from there that using the same kernel config,
Starting point is 00:34:46 by using a 3x upstream kernel, the problem did not appear. Then it was like, okay, well, this is interesting because I have a good kernel, which is a 3x using the same config from the 4x kernel and using block and queue. And we don't have this problem. And I have a 4x kernel and using block MQ and we don't have this problem and I have a forex kernel which is bad and from there I just went straight into okay you know I mean you just roll up your sleeves and bisect at that point because you know the problem is somewhere between you know T and you know T plus one. So bisecting is something that I want to definitely get to before we do that can you briefly describe for our listeners what is BlockMQ? So BlockMQ is the block IO layer for the Linux kernel.
Starting point is 00:35:36 This is the layer that every block device, every hard disk, every flash memory device, anything that is seekable and seekable durable storage registers with the block layer in order for IO requests to be forwarded either from file system drivers or just from when raw block device. is being done like with DD or things like that. Those I.O. requests go through the block layer. They are combined and aggregated in a very simple way, the name of which escapes me at the moment. And then the block layer forwards the aggregated request to the IO scheduler,
Starting point is 00:36:26 which then, based on whatever the semantics of IO scheduler are, which are often semantics that are attuned towards a specific underlying type of block device, taking into account the physical characteristics of that block device, such as the elevator IO scheduler algorithm is one that is specifically designed for disk devices that have a seek penalty, for example. But block MQ is the replacement for the legacy block IO layer. And block MQ is called block MQ because MQ stands for multi-queue. And the idea is that you can have, instead of having one queue that is not scalable because there's one lock around this queue where if an IO is taken off the queue or an IO is being put back on the queue, whichever core is doing that operation has to have
Starting point is 00:37:21 the lock for that queue. And of course, if you have 20 cores on a, like we, we got into a world where we were, we were, you know, back in the eighties and nineties, it was a PC versus a vertical scalability and the giant mainframes and things like that. And then we got back into the, you know, with Google and, and massively distributed systems, we got into the world where we were doing more horizontal scalability and vertical scalability kind of fell by the wayside as all the big iron Unix vendors went out of business one by one in this new world. And then around 2000, between 2006 and 2010,
Starting point is 00:38:00 with the core architecture and AMDs, also the bulldozer and the subsequent architectures, we realized that we were hitting a limit with core scalability in terms of the ability to scale clock speeds infinitely upwards and the ability to scale instructions per cycle, IPC upwards. And so we needed to increase the parallelism of cores in a host. And so we've moved back into this world of vertical scalability as a result. And because we've moved back into this world of vertical scalability, we have, of course, parallelism becoming front and center, again, to what we're doing, a specific host. And so that's why the Linux, the legacy block IO layer
Starting point is 00:38:51 became a bottleneck, because with this increasing parallelism, you know, at first it was two cores, four cores, six cores, and then it became eight, 12, 16, 24 cores, 24 cores per CPU package. I mean, and especially with hyper-threading enabled, which increases parallelism, you know, without necessarily increasing the functional unit capacity of a CPU, but increases parallelism, which again, increases the ability to submit IOs at a very rapid rate. And if I'm doing that, then my IOQ itself becomes the bottleneck if my IOQ is one entity
Starting point is 00:39:30 which has to be managed by, you know, in sequential fashion by a single core. And so BlockMQ solved that problem by, you know, creating basically per core IOQs such that an IO that needs to be submitted to a blocked device can be added to any I.O. queue on any core which is idle or isn't currently holding a lock for its I.O. queues, so it massively increases the potential parallelism of I.O. submission. And this became very important also, not just for, you know, like it wouldn't necessarily be a problem if you had
Starting point is 00:40:07 24 cores all submitting IO to, you know, a hard disk because the hard disk is a single command queue. I mean, it can do some reordering, but ultimately it's kind of a FIFO piece of hardware. You just, you stuff commands into it, it executes those commands, you know, returns the results and then the queue drains and you can stuff more stuff into the queue. But multi-queue hardware is the main impetus for BlockMQ, which is so that I can have, for example, an NVMe device that can have, I think, up to 64 or 128 independent hardware queues. Some of this might be virtualized in the hardware.
Starting point is 00:40:46 I mean, who knows what's really going on? But at least in terms of the logical queues that it presents to the host, many, many more queues are available. And so if I have only one I-O queue for the host to submit I-Os to, I have not just that contention problem in the I-O queue, but also I'm being inefficient
Starting point is 00:41:02 in terms of maximizing the hardware utilization. So to increase utilization on devices with parallel hardware queues, BlockMQ is the solution for that too, because if I have a queue per core and I have all cores submitting IO at the same time, I'm doing the best I can in terms of getting IO out to the hardware because I couldn't do any better without more cores. I hope that makes sense. Yeah, thanks for explaining that. Coming back to the bisecting. So you mentioned there's, you see different outcomes in T and T plus one version of the kernel. I was just looking at Linux kernel repo on GitHub, it has a million commits. So between T and T plus one, I can just imagine there being thousands of commits. So how do you go about bisecting
Starting point is 00:41:52 something like that? Yeah, well, I mean, the great thing is you can actually have multiple problems in those millions of commits to that makes your day even more fun. But yeah, so long story short, I mean, Git, one of the great tools that Git has built in is the BISEC tool. And the BISEC tool is a, it's just, it's a workflow that once you've entered this workflow allows you to, so your head pointer will be pointing to as a specific commit on the BISEC branch that's created for this workflow. And you tell Git for each commit. So I start bisecting. I tell Git what my bad revision is, and I tell it what my good revision is. And it expects that my good revision is older and my bad revision is newer. But you can also invert those things. I mean, it doesn't matter.
Starting point is 00:42:49 It's just all Git wants to know is what's the starting commit where we're starting the bisect and what's the ending commit where we're ending the bisect. And it just picks the midpoint between those commits and gives you, you know, sets head to point to that commit and then expects you to do testing at that commit and then give it feedback whether at that commit it is good or bad. And then so you have your two ends, you're at the midpoint,
Starting point is 00:43:17 you've bisected and you said, okay, well, this one is bad too. So Git knows that, oh, okay, well, we need to go further back in time to figure out where this was good. And so knows that, oh, okay, well, we need to go further back in time to figure out where this was good. And so it bisects between, you know, takes the midpoint between those two commits until you get down to eventually, you know, I mean, it's a logarithmic function of, you know, for the number of steps that you need to bisect. So, you know, if you had a million commits, it's going to be, you know, log base two of
Starting point is 00:43:41 a million, you know, to, you know, the number of steps that you're going to have to do, unless you're really lucky. But yeah, so once you've gotten down to the actual commit, that's bad, then you can figure out like it doesn't prescribe what you do with that. It just tells you which commit flipped the state of the system from from good to bad, and then expects you to fix it from there. So do you know how many steps you had to go through? I think I calculated at one point, it was something like 17 steps. Okay, not too bad. But when you're bisecting and identifying these commits, are you also testing the kernel at those commits and with different patches to see if it solves the problem? Well, yes. So the bad kernel that I had was the 4.19 kernel.
Starting point is 00:44:38 And the good kernel that I had was, I realized that like 4.3 or something was good. So I knew that the problem was somewhere between those. And so in between, each time I would bisect, I would tell Git what was the result, and then I would take that, I would copy in, I would nuke the, make Mr. Proper, copy in the config, so I'm starting from a clean slate, and then, you know, do a parallel build of the kernel, which, you know, it took three to four minutes on a, I mean, these are fast 24-core machines, so, I mean, it's not that slow,
Starting point is 00:45:19 and then built a, you know, just use the kernel's own scaffolding for building a kernel RPM, installed the RPM on the host, and then, you know, told Grub to, you know, Grub2-reboot, you know, will tell to boot a specific kernel next. And then just, you know, rebooted the host. And of course, these hosts are in data centers. They're headless hosts. I mean, so every time I reboot, it's kind of a leap of faith that we're going to get through the firmware, that the kernel is not going to be bad. And there were a few bad intermediate kernels that were,
Starting point is 00:45:52 you know, failed in interesting ways. And so, you know, we had to work with the, you know, the assist ops team and data center team to recover the host sometimes. And, you know, sometimes my, you know, my day would just be over on a specific host and I'd have to go find another host to, you know, to continue on. And so I'd have to copy all my stuff to that new host and, you know, get, get started with the workflow again. But yeah, I mean, it was just a lot of steps of that. I mean, if there were 17 steps, I mean, two to the 17th is what, you know, a hundred, 128 Ks or something like a, you know, I mean, you said a million, but I mean, it's not that far off i mean probably 100 100k plus commits in between where uh you know where we were so um you say that
Starting point is 00:46:32 extremely lightly that's a lot of commits to bisect to be honest from my perspective that sounds extremely daunting to even get started with but in you know in computer science we have a general poster problems of divide and conquer that exploits logarithmic reduction in the time that's taken to do something when you take a divide and conquer strategy. And so bisecting is just another form of doing that. And frankly, it could have been automated. It's just that when I did the calculation, I realized that it was only going to take 17 steps to, you know, at worst to figure this out. I just made a I made a spot call that, you know, I would just burn some time on it rather than, you know, building automation around this. That isn't really, you know, I didn't see another direct use case for.
Starting point is 00:47:17 So, yeah. And I would also say that the divide and conquer approach that you mentioned is also super helpful, not just when you're bisecting commits, but also when you're just trying to evaluate various possibilities, trying to debug something. Right. Well, and divide and conquer in algorithmic terms is, of course, it's a separate concept from fault isolation in systems engineering terms. But they do have overlap conceptually in that you want to narrow down the amount of work that's done by constraining the working set in some way. And so the fault isolation is sort of divide and conquer approach being applied to systems engineering. So you're right about that. That's a good observation. And as you're checking these different commits
Starting point is 00:48:05 and different kernel versions, you mentioned that you use something like a fire test to check if a specific kernel version of the kernel is good. Can you tell us more about that? Yeah, file is a tool that is, I believe it was developed, or at least it's maintained by Jens. Also, it's, it's, I mean, it's distributed on kernel.org. So it's a very, or at least it's maintained by Jens also. I mean, it's distributed on kernel.org. So it's a core regression testing tool for file system and block device driver maintainers. But FIO basically just gives you a suite of tests
Starting point is 00:48:43 that you can sort of slice and dice, mix and match. And you can determine the parallelism, you can determine the test semantics, whether you're doing sequential IO, random IO, direct IO, buffered IO, all those sorts of things, and how long the test is going to take. And it's a very nice tool. And the SysEng team had, while I was working on these other things, and this is what made bisecting the kernel very straightforward in the end, was having this FIO tool to get very clear A that gave clear signal between a host that was, you know, that was fine for, at least for the application team's purposes, and a host that was problematic. And so using that test, I was, you know, it was very helpful to very quickly figure out whether a particular kernel build was good or bad. I mean, the signal was crystal clear night and day. And this FHIR test also significantly reduced the time to test out these different kernels
Starting point is 00:49:51 because deploying an application and just testing for its 99th percentile latency, again, sounds very expensive in general. Right, exactly. And we avoided that whole closed loop. And additionally, because of the JVM that we're using requires a kernel module to be built and available at runtime. Getting that kernel module built was not necessarily a straightforward task. That was something that would have had to be done on every kernel, every intermediate kernel bisected. Being able to use the FiO test as a proxy for the application I.O. experience was very much short in the cycle time. So, you know, it was crucial. And one after you're doing these tests and getting these signals. How did you end up identifying the root cause? And what was the root cause?
Starting point is 00:50:54 Well, the root cause was two, two different patches. I don't recall exactly what they were. One was related to security of, like, if a file was erased, ensuring that no data from that file could be leaked out to somebody who subsequently did an in-map allocation of the, you know, you know, and got the same disk block and was able to see the, you know, the, the data that was left over from the previous file, um, on it. And the other was related to, um, anyway, I can't remember exactly, but both of them were ext4 patches and neither of them were at all, or obviously related to the issue at hand. They were both, you know, looked completely orthogonal. But when it came down to it, I checked and double-checked and in both cases, applying either of these patches, you know, caused the IO latency to
Starting point is 00:52:00 reappear and with both of them removed. And even in the 4.19 kernel, which was interesting because a lot of the internal kernel APIs had changed by that point. So I had to actually hand backport or hand forward port the revert of one of those because it did not revert cleanly on 4.19. So fortunately that worked and I'm sure everybody's comforted to know that, you know, my hand rolled code is, you know, a component of our ext4 driver on, you know, 10s of 1000s of hosts in our data centers running our production stack. So yeah, that sounds fascinating. And the fact that you were able to identify the problem and actually fix it, it's just amazing. You identify a lot of lessons learned in this process. If you had to pick the top two, one that I think we've internalized as an organization, which is
Starting point is 00:53:05 that we need to have a formalized regression testing around platform images and kernels. And fortunately, that is something that is so obvious that it was immediately funded. And that's an ongoing project internally to put that automated regression testing in place so we never run into something like this again. The other thing that was probably an interesting learning is just how well we can work together as an organization when we have a shared, very clear objective to rally around and where, um, you know, the individual, uh, you know, ICA engineers, uh, are able to, um, have their plates cleared and, uh, and focus down on this problem. Um, instead of, uh, kind of having a, I mean, uh, an organization that would have been less effective
Starting point is 00:54:00 would have kind of, you know, acknowledged that this was a problem and, you know, allowed people to freelance on it in their spare time, but not actually committed resources to tackling it such that the ICs that ultimately contributed the key building blocks of the solution had the laser focus time that they needed to, you know, to really, um, to get there. And, you know, it's not that you have to focus on it a hundred percent all the time from beginning to end. Sometimes you just have some downtime such as, I mean, downtime for me was bisecting the kernel. It's just like, it's repetitive tasks that you just have to, um, just grin and bear it because you know that there's a pot
Starting point is 00:54:45 of gold at the end of it, you know, once you get to the end. But then, you know, once you get to that point, once you have that additional nugget of information that gives you the, you know, what you need to move forward to the next step, then again, you have to have, you know, like managers have to agree that resources are going to be dedicated to this problem such that you're not being pulled off of this for on-call, for, you know, for project, other project work, for, you know, for other interrupts, for tickets, for, you know, for ad hoc asks. I mean, whatever these things are, you know, that, you know, managers need to protect their ICs so they can really get, you know, impactful, tactical work like this done. So I wasn't sure beforehand how, you know, I mean, if you had asked me how well we would have functioned getting this done, I wouldn't have been sure what to tell you.
Starting point is 00:55:37 But I was impressed that we, you know, especially given the, you know, the unusual situation that we were in this year, that people were really able to clear their plates and roll up their sleeves. Nice, nice, nice. So taking a step back now, it sounds like you became an expert at Linux kernel almost unintentionally by sort of hacking
Starting point is 00:55:59 to scratch your own itch. Do you have any advice for our listeners who want to maybe more intentionally get a better understanding of how the Linux kernel works? Yeah, so, and the first thing I would do is take issue
Starting point is 00:56:18 with the phrase expert, because there are very few experts at the Linux kernel. The Linux kernel is huge. It's one of the biggest software projects that has ever existed with the most contributors to a single code base in history. So, I mean, nobody's an expert. You have to adopt a learner's mindset and always have that learner's hat on because there's always something that you're, you know, it's kind of like, you know, dissecting, you know, a frog in biology class or something like that. Like, you kind of know what the, you know, what the outside of the frog looks like.
Starting point is 00:57:03 You kind of know what the inside is going to look like, but you don't know what it's really going to look like until you've actually cut it open and, and looked inside. And then for each one of the components that are inside it, well, you, you kind of know what they look like because you're, you're looking at them, but you don't really know how, you know, what it looks like on the inside or what it does until you, you've cut open that, you know, that, that heart or that, that lung or that, you know, that liver or whatever it is. And so the Linux kernel is kind of a lot like that. I mean, you can describe it in broad swaths.
Starting point is 00:57:31 You can kind of know what general terms, what different components of it do, like the VM subsystem, the block IO subsystem, what different drivers do, the driver frameworks, the kernel itself, the scheduler, the platform-specific code. You can know what those things do in general terms, but you don't really know until you've actually dug into it and had a good problem in front of you to incentivize digging into it. I didn't know how page reclaim worked in the kernel until I had an actual problem in front of me on LinkedIn's private cloud having to, you know, to debug and ultimately mitigate this containerization problem that we were having. And so, you know, but by having this problem in front of me and being willing to just crack open
Starting point is 00:58:20 the source code and read it without being intimidated by it. So I would say to answer that question, it's an interesting problem combined with just having a learner's mindset and innate curiosity and not being afraid of peering inside the black box of something. It's not going to hurt anything. This is just knowledge. It's like cracking open a book off the library shelf. I mean, it's a very different kind of knowledge. It's knowledge that changes on an ongoing basis because the Linux kernel, it's a moving, it's a living organism. It's a moving system. You know, it's constantly changing, undergoing very, very large change, you know, on a regular basis. But, you know, some things are conserved relatively well, you know, some structures are conserved even while other
Starting point is 00:59:14 structures are evolving fast. So, you know, just open mind, curiosity, and, you know, check your ego at the door because you're you're not going to understand the whole thing and you don't have to to do something impactful i was uh very motivated you know by your by your talk but then you you kind of gone with the with the dissecting the frog which reminded you know me so of some painful memories but um coming back um you've already mentioned how useful ATOP is. But that aside, what was the last tool that you discovered and really liked? The last tool that I discovered?
Starting point is 01:00:00 Very good question. Or is it hashtag ATOP for life? I don't have a specific tool in mind that I've just recently discovered. J-X-Ray is one that I've been using for analyzing Java heap dumps. And that one is developed by an engineer here at LinkedIn. So, I mean, that's a cool tool. It tells you exactly what's going on with your Java heap. And if you're in garbage collection, what's triggering it. So a little bit off topic from where we were,
Starting point is 01:00:38 but that's a recent tool that I've run into. Awesome. Well, that's it for us. Where can people find you on the internet? They can find me on GitHub or on LinkedIn. I do have a LinkedIn profile. Nice. And anything else you'd like to share with our listeners? I would just say that the more people that are curious enough about the kernel to be willing to learn some operating system fundamentals, work using tools like ATOP or IOTOP or even VMStat and things like that, more classic tools.
Starting point is 01:01:32 The more people that are willing to dig into these things, the better off we are as an industry because we break down the kind of wall of isolation between pure SWE mindset, which is kind of a world of abstractions and algorithms and the pure systems mindset, which is one of bare metal machinery and making things run smoothly in the operational sense and bringing those two perspectives together into more of a holistic mindset.
Starting point is 01:02:06 And I think we're all better engineers when we merge those two things. And we do better things for the industry when we do. Absolutely. It is always a pleasure to talk to you, Ryan. And whenever we do, I learn a lot. And today was no different. Thank you so much for joining us today. It's been a pleasure.
Starting point is 01:02:25 Thanks, Rannik. Thanks, Rana. Thanks, Guang, Austin. Appreciate your time. Hey, thank you so much for listening to the show. You can subscribe wherever you get your podcasts and learn more about us at softwaremisadventures.com. You can also write to us at hello at software misadventures dot com. We would love to hear from you. Until next time, take care.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.