Software Misadventures - Ryan Underwood - On debugging the Linux kernel - #4
Episode Date: February 6, 2021Ryan Underwood is a Staff SRE and tech lead on the Helix and Zookeeper SRE team at LinkedIn. Prior to LinkedIn, he was an SRE at Machine Zone and Google. Apart from his regular responsibilities, Ryan�...��s interest and expertise include debugging production kernel, I/O and containerization issues. His opinion about not treating software as a black box and his persistent approach to debugging complex problems are truly inspiring.  On several occasions, Ryan’s colleagues have leaned on him to solve an esoteric problem that everyone thought was insurmountable. Our main focus today is one such problem that Ryan and team ran into while upgrading machines to 4.x kernel that resulted in elevated 99th percentile latencies. We dive into what the problem was, how it was identified and how it was fixed. We discuss some of the tools and practices that are helpful in debugging system performance issues. And we also talk about Ryan’s background and how his curiosity landed him a career in Site Reliability Engineering. Please enjoy this deeply technical and highly educational conversation with Ryan Underwood. Website link: https://softwaremisadventures.com/ryan  Music Credits: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en
Transcript
Discussion (0)
It's an interesting problem combined with just having a learner's mindset and innate curiosity
and not being afraid, peering inside the black box of something. This is just knowledge. It's like
cracking open a book off the library shelf. I mean, it's a very different kind of knowledge.
It's knowledge that changes on an ongoing basis because the Linux kernel is a moving, it's a living organism,
it's a moving system, it's constantly changing on a regular basis. So open mind, curiosity, and
check your ego at the door because you're not going to understand the whole thing
and you don't have to do something impactful.
Welcome to the Software Misadventures podcast, where we sit down with software and DevOps
experts to hear their stories from the trenches about how software breaks in production.
We are your hosts, Ronak, Austin, and Guang.
We've seen firsthand how stressful it is when something breaks in production, but it's the
best opportunity to learn about a system more deeply.
When most of us started in this field, we didn't really know what to expect, and wish
there were more resources on how veteran engineers overcame the daunting task of debugging complex
systems.
In these conversations, we discuss the principles and practical tips to build resilient software,
as well as advice to grow as technical leaders.
Hey everyone, this is Ronak here. well as advice to grow as technical leaders. about debugging distributed systems. I don't know of anyone else who understands the Linux tag
better than him. His opinion about not treating software as a black box and his persistent
approach to debugging complex problems are truly inspiring. As the team at LinkedIn was upgrading
hosts to 4.x kernel, they observed elevated 99th percentile latencies in some critical applications.
We speak with Ryan about what exactly this problem was and how it was fixed.
We also discuss some of the tools and practices that are helpful in debugging system performance
issues.
Please enjoy this deeply technical and highly educational conversation with Ryan Underwood.
Cool.
Ryan, welcome to the show.
To get us started, tell us about your background. I'm especially curious about what kind of got you excited in computers was finding a computer that I did
not know was a computer that my dad had brought home when I was something like four, five
years old.
And I actually thought it was a VCR because I was familiar with VCRs at the time.
I'm like, oh, here's this box that appears similar to size and shape and similar materials composition
as this thing that ingests video cassettes and you know puts colorful sounds and pictures on
the screen for me so anyway over a period of weeks or months my my dad set it up and started showing me how to use it we had
some some some books and magazines that I could type basic programs into it you
know using DOS and you know just all the things we did back in the day of course
I was in it for the games it was all about the games and you know the the
machinery itself was just you know a means, the, the machinery itself was just, uh, you know, a means
to the end of playing cool games. Uh, I had a neighbor kid who lived across the street, who's
now a special effects director in, in Hollywood, uh, whose dad worked for IBM. And so his dad was
always bringing home the goods and, you know, the neighbor kid always had the, uh, you know,
the latest and greatest, uh, uh, IBM PC jr. And then, you know, he had a 286 and then a 486 leapfrogging
whatever I had at the time. So I was always jealous but he also had all the cool games
that we just swapped discs back and forth.
So from there to writing, I mean I was doing
DOS batch programming on
floppy disks just like automating the
startup of, of games and things like that. Just very simple things. And then I got into the BBS
world in 1994, I was 13, uh, and it was all over from there. But then I started getting into,
um, you know, running my own BBS. I wrote some Pascal programs to automate some like just grungy things
of, you know, converting configuration files from the BBS format to the format that these,
these, of course, again, games would use externally to, you know, so that people could dial in and
play things like Trade Wars and Legend of the Red Dragon and things like that, if anyone remembers. So, yeah, all about the games and the files and the wide world of very odd things being communicated
and text files that are swapped around in BBSs that were very much outside my insular, suburban, Midwest kid upbringing.
And something, you know, my parents would have been very upset if they
knew what I was getting up to. But, you know, in high school, we had a programming class,
formal programming class with actual Mac, compact Mac, black and white screen computers around the
room. And so I did some Pascal in that class. And then, you know, at the same time, my hobby was like, uh, you know, copying discs and CDs. So I just,
you know, uh, figured out how to disassemble programs, look for the,
the SIS calls and DOS and windows that, you know,
identified whether you were running on an original disc or an original CD and
just, you know,
hack those out because I wanted to copy them to my hard disc and play them
without having these stupid discs and CDs around. I mean, come on, like everything should be in the cloud, you
know, in, in my, in front of me. So, uh, you know, so I, I learned, you know, basically I went up
and down the stack that way, just, you know, being a hacker, you know, bored kid, misfit, outcast,
uh, you know, um, everything that, you know, a lot of us are familiar with. But then I went off to engineering school in Missouri,
started as an electrical engineer major,
and ditched that after two years,
flunked out, family problems, divorce, nasty stuff.
Then I was admitted back into the CS program,
finished my master's at CS,
did a lot of open source hacking. That's the period when I actually got into kernel hacking because I had been
using Linux since 1997. I was actually one of the users of Zip Slack, which if anybody
remembers Zip Slack, it was a Slack where Linux distribution that you could unzip onto
an MS-DOS partition and, you know, basically boot from,
you know, just by running a batch file. And so that was my dipping my toes into the Linux world.
And over time, I just migrated the things I was doing to Linux because I just found Linux more
expressive and open for the kinds of things I wanted to do with computers, you know, very much
like the collaborative and, you know, sometimes opinionated, hopefully not too opinionated nature
of open source development. And so then I wanted my hardware that worked so well on Windows to work
on Linux. And it frustrated me when it didn't. And it frustrated me to the point where I spent
late nights with a cup of tea next to me and realizing the sun was coming up because I had
been up all night just digging into the driver for this thing and figuring out what registers
was Windows flipping that Linux driver was not, why this piece of hardware wasn't working or
wasn't working optimally in some way. And so I just did that kind of stuff until I graduated.
And then I graduated, did the normal software engineering stuff, embedded Linux, real-time simulation systems. Actually, in my job at the
flight simulator company, I had to do a lot of reverse engineering, which was suited to my
background, both on the network and of physical devices. So that was pretty cool. And then Google
came and hired me for the stuff I was doing as a hobby, not for what my actual career trajectory was, which was straight up software engineering.
And so I became an SRE.
And since then, I've been doing SRE things and helping out on that side.
And the history of my tenure at LinkedIn is that of SRE.
I'm sure we'll talk about things that will illuminate why my twisty background was able to
create a contribution in this particular area.
Nice. I remember back in 1997 when you were getting into Linux,
I was being tied to a piano chair, so I'm not at all
jealous. So yeah, so having been
at LinkedIn as an SRE for the last pass of years, what have you been working on? What kind of work
do you do today? Well, I joined LinkedIn. When I joined LinkedIn, I joined Ben Ferguson's team,
which was at the time the tools SRE team. It was a new team. It was a team that was explicitly created with the intent of providing an engineering approach to operational problems,
as in the core definition of SRE, but also specifically evangelizing the idea of operational awareness in the foundation organization, which was developing
all of our internal tooling at the time. And, you know, they were very focused on iterating and,
you know, moving fast and breaking things. And we just, we realized that there was a business
need around not breaking things as much and still being able to move fast. And so Tools SRE, we explicitly helped with that.
We're able to make many, many concrete improvements
up and down the stack from the source of truth for topology
to the deployment machinery to the private cloud,
fixing some nasty containerization bugs
and operational problems with the services that comprise the private cloud, fixing some nasty containerization bugs and operational problems
with the services that comprise the private cloud. So yeah, that was what I did for the first four
years. And so Tools SRE, I moved into LPS SRE, which is the team that owns and operates LinkedIn's
private cloud. And then after that, I moved to the Helix Zookeeper SRE team,
which was a shift, a very interesting shift,
because moving into that team, I knew virtually nothing about Zookeeper
aside from at a purely conceptual level
and where Zookeeper fits in in the ecosystem of distributed service components,
implementing distributed system fundamentals. And in a matter of
I was actually less than two months, I had been able to using
my background in, you know, in four years, working with
foundation tooling, I was able to create a framework that
allowed us to measure the availability of all zookeepers at LinkedIn basically within two months' time.
And so I was able to stand on the shoulders of giants and deliver something that was very important to the company.
And since then, we've been using that availability measurement tooling to solve one problem after another in the zookeeper ecosystem
that was impacting either LinkedIn's customers in terms of site users, or at a minimum impacting engineers
at LinkedIn who are just trying to deliver new products and features.
A lot of things that you say, Ryan, which are normal software engineering or normal
infrastructure engineering, at least in
my perspective, they don't sound simple at all, especially your background and things that you did.
You mentioned games quite a bit as you were growing up. Do you still play games?
I actually have not sat down and played a game in a long time. I watched more videos on YouTube of other people playing games, wishing that I could go back in time to that life where I felt like I had enough time to be playing games so much.
But yeah, there was a time in my life where games and files were the goal.
And every means was a means to that end.
But that's how I learned.
That's how I learned the things that I do today.
But I have fond memories of games like Privateer was one that was very near and dear to my heart.
I spent a lot of time on MUDs, if anybody remembers those.
These are sort of the predecessor to today's MMOs. And if anyone remembers EverQuest, that was more or less the first graphical MUD that was massively multiplayer.
Before that, there was Ultima Online, which was smaller in scale and more themed around the Ultimate Universe.
But MUDs were just a text-based, basically multiverse that people could join over the internet in the early
days of the internet and interact with other people in real time in a textual shared space.
Sort of akin to the same kind of metaphor that IRC used or that Slack uses today, kind
of this shared space where people interact via text and
emoting and things like that. But for me, I mean, MUDs was where that really started.
And so I was into those. And that's a whole other story, all the shenanigans and network
hopping that we did to get around, you know, net blocks and all those kind of things. Fun times.
Nice. Well, you definitely have a fascinating background. And knowing you know, net blocks and all those kinds of things. Fun times. Nice. Well, you definitely have a fascinating background. And knowing you personally,
like this is something our listeners wouldn't know. But when I joined LinkedIn, I was on the
same team as Ryan. And before I met you, Ryan, I heard about you from all the other team members.
And one thing that I heard consistently from everyone was, if I had a problem or a tricky situation that I was dealing with, either with Linux or distributed
systems, and I couldn't find the answer via Google or Stack Overflow, I should just go and ask Ryan.
And that has been consistently true. And we'll dive deep into the example on one of those examples
today. But before we jump there uh
everyone who works with you knows about this skill set which is rarely known to people who
probably don't work with you is how did you get so good with memes that was um that was a required
skill at google um basically uh you know becoming a a meme master, or at least, you know,
moving along the, walking the path towards meme masterhood is just, if anybody has worked at
Google, they would understand. And if they haven't worked at Google, you'll just have to
take my word for it. Well, we'll take your word for it then. And like you mentioned,
your knowledge spreads
in multiple dimensions,
both in breadth and depth.
And one of the things
that we wanted to touch on today
is the blog post
that you recently published
that talks about a very interesting
and tricky situation
that you encountered
in Linux system performance.
Before we dive into exactly what the
issue was, in your blog post, you mentioned re-imaging machines and OS subgrade initiative
at LinkedIn. Can you tell us a little bit about that? Right. Well, I think you could tell us a
little more about that. But the short version of the story is that basically the decision was made to get off of Red Hat Enterprise Linux 7 for various business reasons, cost savings reasons, synergy with Microsoft, and those sorts of things.
Nothing against Red Hat specifically.
It's just that we had, as an organization, evolved past essentially what their support model was able to provide for us with the kinds of things we were doing.
So the decision was made to move to a CentOS platform image by using Microsoft's kernel from Azure, which was perfectly reasonable because Microsoft hires a lot of people to work specifically on the kernel.
And so why should LinkedIn duplicate that effort if we don't need to? Or be beholden to an external
vendor who has their own roadmap, their own priorities that aren't necessarily aligned
with ours, and we don't have a lot of influence over theirs. so that decision was made by SRE exec. And the part that my understanding
as LPS SRE specifically undertook was to re-image the general rain pools, which is basically,
this is akin to what most people would think of when they just request an instance on AWS.
They have no idea other than the region it's in, which specific hardware they're going to be
getting or what the characteristics of that hardware is other than it meets a specific
performance class that Amazon advertises. And so as opposed to the other pools at LinkedIn,
which are application specific, this is the general pool, which is sort of the default pool
that anybody who wants to deploy a random application lands on, right? Yeah, it makes
sense. So this is like the multi-tenant cloud where you get to ask for a compute resource and you get it through an API call where you're not worrying about exactly which host that you get.
So the OSS upgrade initiative involved upgrading the set of machines which are on the multi-tenant cloud.
One thing that you also refer to in the blog post is various noisy neighbor problems in multi-tenant environments in general.
Tell us a little bit about some of these noisy neighbor problems that people usually hit in
these kind of situations. Oh, yeah. So, well, one noisy neighbor problem that we had initially in
RAINN was simply that of swap utilization competing for disk IO. So disk IO availability is, you know, like you can always
push IO requests through to a disk. The only question is what's the latency of those IO
requests being serviced. And so if you start filling up the queue on the disk, you're just
waiting for stuff to be streamed out to disk. So if you have, for example, multiple applications that are logging at a very high rate
or multiple applications which are making memory allocations,
which are being satisfied either partially or fully through swap allocations
or through pushing inactive pages out to swap in order to avoid a low memory situation. Those will create noisy neighbor
situations because a neighbor who is not creating the resource burden is unfairly impacted in terms
of their own latency because some other process on the host is creating a resource burden. And other examples of this could be contention for NIC bandwidth on, for example, one gig
NIC host.
So sometimes you have to move to a 10 gig NIC host to get more bandwidth.
Other things would be like memory bandwidth itself.
It's a global resource that can be used unfairly by one actor.
We saw that with, you know, actually, we didn't see this on RAIN necessarily,
but we saw this on Espresso nodes where one process would be doing Java garbage collection
and actually starve the rest of the system of memory bandwidth
because there were ECC errors that were reducing the system of memory bandwidth because there were ECC errors that were reducing the
amount of memory bandwidth available to the cores on that system. And so the global resource was
exhausted. Another classic one is cache thrashing. So when processes are not pinned to cores,
when they're free to schedule on any core.
The working set for one process will, it has different cache semantics
than the working set for another process.
I mean, not just different cache semantics,
but the working set itself is different.
So the data and code caches in the CPU
are going to be nuked every time that process is swapped out
and having to be basically reloaded by going to main memory.
And so that creates a slow start issue for whenever that task is scheduled.
So a lot of these can be kind of bucketed into, in a general term, what we call thrashing issues, where thrashing sort
of indicates that somebody who wants to use resources is experiencing high latency because
somebody else has caused that resource to be used in such a way that is causing the one who wants to
use it to be cued to have to wait on some physical characteristic, some physical
limitation of some kind. Okay. And these are interesting noisy neighbor problems. And as
you mentioned in the blog post that as you were moving the fleet to CentOS in this multi-tenant
environment, there was a problem that you identified, but it wasn't a noisy neighbor issue. Can you tell us
more about that? Right. And the problem when it was brought to me was brought to me as an
IO weight issue, that it was simply that on machines running the newer kernel, as compared
to the Red Hat Enterprise Linux 7 kernel, which was a 3.x kernel. We had
this newer 4.x kernel. And we were seeing that when multiple processes were doing sequential
I.O. simultaneously, such as artifacts being downloaded for deployment or logging and, you know, with multiple concurrent
streams that IO8 was going up on the host very much not in proportion to the IO8 that would
have been generated on the same host with the 3.x kernel from Red Hat. So the noisy neighbor issue in that sense was that, you know, symptomatic rather than
fundamental. So it was a symptomatic noisy neighbor issue in the sense that one process was
able to create IO weight that was causing another process to be stuck in the run queue
simply because it was doing IO to the disk and not because
the disk itself had a fundamental a hardware limitation of any kind that was causing uh that
that was justifying uh you know causing a process to to wait it seemed like there was something in
the software that was doing this and uh you know, as it turned out, you know, we were right about that.
And we'll talk more about that, I'm sure. Yeah. So in this case, the effect was an application
would experience higher 99th percentile latencies if another application was deployed on the same
physical server. And it wasn't because of any limitations on the hardware side, like you
mentioned, but it was something in the software software that was limiting the performance. That's right. And the first,
my first suspicion was that it was, it was going to be, you know, something related to
mutual exclusion, or locking. And I looked at the usual things I would look at for debugging that sort
of problem, which would be perf top to look for spin lock utilization, which given that there
was virtually no system CPU, no core on kernel time disproportionate to what we would have seen on the 3.x kernel.
It didn't seem like there was any kind of spin lock meltdown or anything like that going on.
So then the other question was, is it possible that we're waiting, just waiting around for some
lock that's being inefficient or something like that? And so to look into that, I used the magic sysrq. I think it's sysrql.
We don't actually have console access to our servers, but I triggered it using procfs. And
what that does is it dumps the stack of all the cores on the host to the kernel message buffer so that you can see for
a snapshot at any point in time. And by the way, this only works on SMP hosts because on a single
processor host, the stack is always going to be, you know, for the single core in that host,
it's going to be the stack of the kernel function that actually prints that thing out to the kernel
message buffer. So it's not that helpful.
But when you have 24 cores, you have 23 cores that are presumably doing other things.
And so often you can see that these cores are just stacking up, kind of waiting in the same
mutual exclusion primitive, and can kind of figure out what went wrong that way. And so in this case,
I did not see anything like that. It was just slow. It was very strange.
And the way this problem got surfaced is these applications started seeing higher latencies when they were deployed on the newer kernel.
And that's also how you started looking into the problem and dissecting it and seeing it
performs perfectly fine on the older version, but not necessarily on the newer one. And did that
lead you down the path of thinking it has something to do with the kernel and the investigation you
did with SysRQ and other tools? Well, the thing that led me to believe it had something to do
with the kernel was simply the combination of that the kernel was the main thing. I mean, like user space changed in CentOS as well,
but user space changes don't generally produce this massive increase in IO
weight because IO weight is the kernel.
I mean, it's telling a process it cannot be scheduled at the moment because it's waiting on a shared underlying resource, whether it's a block device or some other device that is consuming I.O. of some kind.
And the kernel is telling that process to sleep because it can't do its work in the meantime, so it needs to sleep.
So that's some other process which can do work other than I to a device that's currently busy, you know, can be done. So those things together just, I mean, my intuition, you know, basically bisected this, you know, away from a user space problem and into a kernel problem that way. That's fascinating.
One tool that you also mentioned in your arsenal in the blog is ATOP.
Tell us more about what ATOP does.
Yeah, ATOP is a great tool,
something that I learned to use effectively at Google
because we used it to debug problems on Borg hosts,
especially noisy neighbor type of problems and so ATOP is a tool that allows you to at a
glance see essentially everything that is important about your system from a
resource utilization perspective many people stop at top because they're you
know they see okay my my cores are doing this.
I have so much memory that's RSS, so much memory that's virtual allocation.
I can see which process is at the top that's busy and so on.
But ATOP gives you much more of a picture of what's going on because not only can it use the process accounting mechanism that's built into the kernel to accurately attribute disk utilization to specific
processes which used it, even ephemeral processes which had disappeared in between sampling
intervals, which is incredibly important when you have, for example, forking model type processes you're trying
to debug like we do in our Python ecosystem. But also it gives you everything at one glance.
Virtual memory, it tells you when I'm having to, when I'm undergoing free page scans because
I'm having such memory pressure on the system. Swap activity is all there, page fault activity.
Basically, it has several different views
depending on what you want to look at.
Do I want to look at what looks like a primarily disk IO problem?
I can hit D and have a view that gives me
those things that would be interesting in that context.
I can hit P and give me things that are interesting
in the context of, well, I seem to be having a compute problem in the sense of that this system
is not stacking up performance-wise what's going on. And then just the general view, the G view
is a great overview of both of those things. So it's it's just a it's just a great a scope and
into a system i mean it's my my go-to tool i don't bother with anything else um really it's the first
thing i go to um so and it has a nice uh demonized mode also that allows it to record all this stuff
uh historically so that you can you can go back in time and see, you know, even like all the
process accounting aspect for, you know, accounting to disk IO. Like if I saw that a host was just
slammed with IO at some point in time, and I was running ATOP and daemon mode, I could very easily
just go back in time and see, oh, what was happening on that host at that time? And, oh,
it looks like this, you know, this automation process, you know, kicked off a bunch of children
that all were trying to, you know, read a gigantic file read a gigantic file and write it out at the same time.
And then I know exactly what happened.
That sounds like a very useful tool.
And I know I hadn't heard of ATOP until I met you.
And I heard about ATOP from you.
We'll definitely add a link to this tool in our show notes when you publish the episode and in general this tool sounds like a very useful thing in for containerized environments
specifically where you have a bunch of processes using shared resources now that you knew that
there was this io performance issue as you're upgrading these machines. You had kind of symptoms from the
applications running on these newer kernel. And through your intuition, you identified that this
problem has to do more on the kernel side. How did you proceed next? Yeah, the next thing I looked at
was because 4.x has BlockMQ and BlockMQ has some different IO schedulers.
I tried playing around with the IO schedulers to see if one IO scheduler or the other made a difference.
I also, at least for a minute, not too much longer, thought that maybe disk IO quotas had possibly been introduced on the CFQ side.
And so I investigated that because that could account for IO blockage happening.
But none of that turned anything up.
And actually, it was kind of by a stroke of luck while I was looking into those things.
One of the engineers in Pi noticed that there was an issue on Red Hat's Bugzilla about a...
Actually, yes, it was on Red Hat's internal support ticket system
that somebody else was complaining about that they had an IO regression in RHEL 8,
which has a 4.x kernel, and that they thought it was related to,
there's a public Bugzilla kernel upstream bug
related to this where people had SATA SSDs
and were having a, well, SSDs or HDDs in this case,
and they were having basically a massive penalty
in the same type of situation that we had,
which was sequential I.O. and using BlockMQ.
So that kind of led me in the direction
that something was going on with BlockMQ.
And taking...
YenZoxpo had a very simple patch
related to the scalable bitmap code in BlockMQ
that is a scalable bitmap.
You can read a lot more about this online,
but basically a scalable bitmap is a mechanism
that allows IOs to be submitted efficiently across multiple cores
without having a logjam on any one particular core.
And many of the advanced IO schedulers,
you know, Builder OnBlock, MQ,
are leveraging this scalable bitmap primitive.
But, you know, it was a very simple patch
to reduce the number of queues in a scalable bitmap. And it actually produced an improvement. Like, it was a very simple patch, reduced the number of queues in a scalable
bitmap. And it actually, it produced an improvement, like it didn't, it didn't fix the problem,
but it produced an improvement. And so that was what kind of gave me, you know, a more, you know,
intuitively, um, a suspicion that this was the right direction to go seeing improvement with that. So based on that,
I just, I thought, well, you know, we have this platform image that is different, but I think the
kernel is the problem. So what would happen if I installed a 3x kernel on one of these CentOS hosts
and redid the benchmark? And it was the realization from there that using the same kernel config,
by using a 3x upstream kernel, the problem did not appear. Then it was like, okay, well,
this is interesting because I have a good kernel, which is a 3x using the same config from the 4x
kernel and using block and queue. And we don't have this problem. And I have a 4x kernel and using block MQ and we don't have this problem and I have a forex
kernel which is bad and from there I just went straight into okay you know I mean you just roll
up your sleeves and bisect at that point because you know the problem is somewhere between you know
T and you know T plus one. So bisecting is something that I want to definitely get to
before we do that can you briefly describe for our listeners what is BlockMQ?
So BlockMQ is the block IO layer for the Linux kernel.
This is the layer that every block device, every hard disk, every flash memory device, anything that is seekable and seekable durable storage
registers with the block layer
in order for IO requests to be forwarded
either from file system drivers
or just from when raw block device. is being done like with DD
or things like that. Those I.O. requests go through the block layer. They are combined
and aggregated in a very simple way, the name of which escapes me at the moment. And then
the block layer forwards the aggregated request to the IO scheduler,
which then, based on whatever the semantics of IO scheduler are, which are often semantics that
are attuned towards a specific underlying type of block device, taking into account the physical
characteristics of that block device, such as the elevator IO scheduler algorithm is one that is specifically designed
for disk devices that have a seek penalty, for example.
But block MQ is the replacement for the legacy block IO layer.
And block MQ is called block MQ because MQ stands for multi-queue. And the idea is that you can have, instead of having one queue
that is not scalable because there's one lock around this queue where if an IO is taken off
the queue or an IO is being put back on the queue, whichever core is doing that operation has to have
the lock for that queue. And of course, if you have 20 cores on a, like we, we got into a world where we were, we were, you know, back in the eighties
and nineties, it was a PC versus a vertical scalability and the giant mainframes and things
like that. And then we got back into the, you know, with Google and, and massively distributed
systems, we got into the world where we were doing more horizontal scalability
and vertical scalability kind of fell by the wayside
as all the big iron Unix vendors went out of business
one by one in this new world.
And then around 2000, between 2006 and 2010,
with the core architecture and AMDs, also the bulldozer and the subsequent architectures,
we realized that we were hitting a limit with core scalability in terms of the ability to scale clock speeds infinitely upwards
and the ability to scale instructions per cycle, IPC upwards.
And so we needed to increase the parallelism of cores in a host.
And so we've moved back into this world of vertical scalability as a result.
And because we've moved back into this world of vertical scalability, we have, of course,
parallelism becoming front and center, again,
to what we're doing, a specific host. And so that's why the Linux, the legacy block IO layer
became a bottleneck, because with this increasing parallelism, you know, at first it was two cores,
four cores, six cores, and then it became eight, 12, 16, 24 cores, 24 cores per CPU package. I mean, and especially with hyper-threading enabled,
which increases parallelism, you know,
without necessarily increasing the functional unit capacity of a CPU,
but increases parallelism, which again,
increases the ability to submit IOs at a very rapid rate.
And if I'm doing that, then my IOQ itself
becomes the bottleneck if my IOQ is one entity
which has to be managed by, you know,
in sequential fashion by a single core.
And so BlockMQ solved that problem by, you know,
creating basically per core IOQs such that an IO
that needs to be submitted to a blocked device
can be added to any I.O. queue on any core which is idle or isn't currently holding a lock for its I.O. queues,
so it massively increases the potential parallelism of I.O. submission.
And this became very important also, not just for, you know, like it wouldn't necessarily be a problem if you had
24 cores all submitting IO to, you know, a hard disk because the hard disk is a single command
queue. I mean, it can do some reordering, but ultimately it's kind of a FIFO piece of hardware.
You just, you stuff commands into it, it executes those commands, you know, returns the results and
then the queue drains and you can stuff more stuff into the queue.
But multi-queue hardware is the main impetus for BlockMQ,
which is so that I can have, for example, an NVMe device
that can have, I think, up to 64 or 128 independent hardware queues.
Some of this might be virtualized in the hardware.
I mean, who knows what's really going on?
But at least in terms of the logical queues
that it presents to the host,
many, many more queues are available.
And so if I have only one I-O queue
for the host to submit I-Os to,
I have not just that contention problem in the I-O queue,
but also I'm being inefficient
in terms of maximizing the
hardware utilization. So to increase utilization on devices with parallel hardware queues,
BlockMQ is the solution for that too, because if I have a queue per core and I have all cores
submitting IO at the same time, I'm doing the best I can in terms of getting IO out to the
hardware because I couldn't do any better without more cores. I hope that makes sense. Yeah, thanks for explaining that. Coming
back to the bisecting. So you mentioned there's, you see different outcomes in T and T plus one
version of the kernel. I was just looking at Linux kernel repo on GitHub, it has a million commits. So between T and T plus one,
I can just imagine there being thousands of commits. So how do you go about bisecting
something like that? Yeah, well, I mean, the great thing is you can actually have
multiple problems in those millions of commits to that makes your day even more fun. But yeah, so long story short, I mean, Git, one of the great
tools that Git has built in is the BISEC tool. And the BISEC tool is a, it's just, it's a workflow
that once you've entered this workflow allows you to, so your head pointer will be pointing to as
a specific commit on the BISEC branch that's created for this workflow.
And you tell Git for each commit. So I start bisecting. I tell Git what my bad revision is, and I tell it what my good revision is. And it expects that my good revision is older and my
bad revision is newer. But you can also invert those things.
I mean, it doesn't matter.
It's just all Git wants to know is what's the starting commit
where we're starting the bisect
and what's the ending commit where we're ending the bisect.
And it just picks the midpoint between those commits
and gives you, you know, sets head to point to that commit
and then expects you to do testing at that commit
and then give it feedback whether at that commit it is good or bad.
And then so you have your two ends, you're at the midpoint,
you've bisected and you said, okay, well, this one is bad too.
So Git knows that, oh, okay, well, we need to go further back in time
to figure out where this was good. And so knows that, oh, okay, well, we need to go further back in time to figure out where
this was good.
And so it bisects between, you know, takes the midpoint between those two commits until
you get down to eventually, you know, I mean, it's a logarithmic function of, you know,
for the number of steps that you need to bisect.
So, you know, if you had a million commits, it's going to be, you know, log base two of
a million, you know, to, you know, the number of steps that you're going to have to do, unless you're really lucky. But yeah, so
once you've gotten down to the actual commit, that's bad, then you can figure out like it
doesn't prescribe what you do with that. It just tells you which commit flipped the state of the
system from from good to bad, and then expects you to fix it from
there. So do you know how many steps you had to go through? I think I calculated at one point,
it was something like 17 steps. Okay, not too bad. But when you're bisecting and identifying
these commits, are you also testing the kernel at those commits and with different patches to
see if it solves the problem? Well, yes. So the bad kernel that I had was the 4.19 kernel.
And the good kernel that I had was, I realized that like 4.3 or something was good. So I knew that the problem was somewhere between those.
And so in between, each time I would bisect,
I would tell Git what was the result,
and then I would take that, I would copy in,
I would nuke the, make Mr. Proper,
copy in the config, so I'm starting from a
clean slate, and then, you know, do a parallel build of the kernel, which, you know, it took
three to four minutes on a, I mean, these are fast 24-core machines, so, I mean, it's not that slow,
and then built a, you know, just use the kernel's own scaffolding for building a kernel RPM,
installed the RPM on the host, and then, you know, told Grub to, you know, Grub2-reboot,
you know, will tell to boot a specific kernel next.
And then just, you know, rebooted the host.
And of course, these hosts are in data centers.
They're headless hosts.
I mean, so every time I reboot, it's kind of a leap of faith that we're going to get through the firmware,
that the kernel is not going to be bad. And there were a few bad intermediate kernels that were,
you know, failed in interesting ways. And so, you know, we had to work with the, you know,
the assist ops team and data center team to recover the host sometimes. And, you know,
sometimes my, you know, my day would just be over on a specific host and I'd have to go find another host to, you know, to continue on. And so I'd have to copy all my
stuff to that new host and, you know, get, get started with the workflow again. But yeah, I mean,
it was just a lot of steps of that. I mean, if there were 17 steps, I mean, two to the 17th is
what, you know, a hundred, 128 Ks or something like a, you know, I mean, you said a million,
but I mean, it's not that far off i mean
probably 100 100k plus commits in between where uh you know where we were so um you say that
extremely lightly that's a lot of commits to bisect to be honest from my perspective that
sounds extremely daunting to even get started with but in you know in computer science we have
a general poster problems of divide and conquer that exploits logarithmic reduction in the time that's taken to do something when you take a divide and conquer strategy.
And so bisecting is just another form of doing that.
And frankly, it could have been automated.
It's just that when I did the calculation, I realized that it was only going to take 17 steps to, you know, at worst to figure this out.
I just made a I made a spot call that, you know, I would just burn some time on it rather than, you know, building automation around this.
That isn't really, you know, I didn't see another direct use case for.
So, yeah.
And I would also say that the divide and conquer approach that you mentioned is also super helpful, not just when you're bisecting commits, but also when you're just trying to evaluate various possibilities, trying to debug
something. Right. Well, and divide and conquer in algorithmic terms is, of course, it's a separate
concept from fault isolation in systems engineering terms. But they do have overlap conceptually in
that you want to narrow down the amount of work that's done by constraining the working set in some way.
And so the fault isolation is sort of divide and conquer approach being applied to systems engineering.
So you're right about that. That's a good observation.
And as you're checking these different commits
and different kernel versions, you mentioned that you use something like a fire test to
check if a specific kernel version of the kernel is good. Can you tell us more about
that? Yeah, file is a tool that is, I believe it was developed, or at least it's maintained
by Jens. Also, it's, it's, I mean, it's distributed on kernel.org. So it's a very, or at least it's maintained by Jens also.
I mean, it's distributed on kernel.org.
So it's a core regression testing tool for file system
and block device driver maintainers.
But FIO basically just gives you a suite of tests
that you can sort of slice and dice, mix and match.
And you can determine the parallelism, you can determine the test semantics, whether you're doing sequential IO, random IO, direct IO, buffered IO, all those
sorts of things, and how long the test is going to take. And it's a very nice tool. And the SysEng
team had, while I was working on these other things, and this is what made bisecting the kernel very straightforward in the end, was having this FIO tool to get very clear A that gave clear signal between a host that was, you
know, that was fine for, at least for the application team's purposes, and a host that was
problematic. And so using that test, I was, you know, it was very helpful to very quickly figure
out whether a particular kernel build was good or bad. I mean, the signal was crystal clear night and day.
And this FHIR test also significantly reduced the time to test out these different kernels
because deploying an application and just testing for its 99th percentile latency, again,
sounds very expensive in general. Right, exactly. And we avoided that whole
closed loop. And additionally, because of the JVM that we're
using requires a kernel module to be built and available at runtime. Getting that kernel module
built was not necessarily a straightforward task. That was something that would have had to be done on every kernel, every intermediate kernel bisected.
Being able to use the FiO test as a proxy for the application I.O. experience was very much short in the cycle time.
So, you know, it was crucial. And one after you're doing these tests and getting these
signals. How did you end up identifying the root cause? And what was the root cause?
Well, the root cause was two, two different patches. I don't recall exactly what they were. One was related to security of, like, if a file was erased,
ensuring that no data from that file could be leaked out to somebody
who subsequently did an in-map allocation of the, you know, you know, and got the same disk block and was
able to see the, you know, the, the data that was left over from the previous file, um, on it.
And the other was related to, um, anyway, I can't remember exactly, but both of them were
ext4 patches and neither of them were at all, or obviously related to the issue at hand. They were both,
you know, looked completely orthogonal. But when it came down to it, I checked and double-checked
and in both cases, applying either of these patches, you know, caused the IO latency to
reappear and with both of them removed. And even in the 4.19 kernel, which was interesting because a lot of the internal kernel APIs
had changed by that point.
So I had to actually hand backport or hand forward port the revert of one of those because
it did not revert cleanly on 4.19.
So fortunately that worked and I'm sure everybody's comforted
to know that, you know, my hand rolled code is, you know, a component of our ext4 driver on,
you know, 10s of 1000s of hosts in our data centers running our production stack. So
yeah, that sounds fascinating. And the fact that you were able to identify the problem and actually fix it, it's just amazing. You identify a lot of lessons learned in this process. If you had to pick the top two, one that I think we've internalized as an organization, which is
that we need to have a formalized regression testing around platform images and kernels.
And fortunately, that is something that is so obvious that it was immediately funded.
And that's an ongoing project internally to put that automated regression testing in place
so we never run into something
like this again. The other thing that was probably an interesting learning is just how
well we can work together as an organization when we have a shared, very clear objective to rally around and where, um, you know, the individual, uh, you know, ICA engineers,
uh, are able to, um, have their plates cleared and, uh, and focus down on this problem. Um,
instead of, uh, kind of having a, I mean, uh, an organization that would have been less effective
would have kind of, you know, acknowledged that this was a problem and, you know,
allowed people to freelance on it in their spare time, but not actually committed resources to
tackling it such that the ICs that ultimately contributed the key building blocks of the
solution had the laser focus time that they needed to, you know, to really, um, to get there. And,
you know, it's not that you have to focus on it a hundred percent all the time from beginning to
end. Sometimes you just have some downtime such as, I mean, downtime for me was bisecting the
kernel. It's just like, it's repetitive tasks that you just have to, um, just grin and bear it
because you know that there's a pot
of gold at the end of it, you know, once you get to the end. But then, you know, once you get to
that point, once you have that additional nugget of information that gives you the, you know,
what you need to move forward to the next step, then again, you have to have, you know, like
managers have to agree that resources are going to be dedicated to this problem such that you're not being pulled off of this for on-call, for, you know, for project, other project work, for, you know, for other interrupts, for tickets, for, you know, for ad hoc asks.
I mean, whatever these things are, you know, that, you know, managers need to protect their ICs so they can really get, you know, impactful, tactical work like this done.
So I wasn't sure beforehand how, you know, I mean,
if you had asked me how well we would have functioned getting this done,
I wouldn't have been sure what to tell you.
But I was impressed that we, you know, especially given the, you know,
the unusual situation that we were in this year,
that people were really able to clear their plates
and roll up their sleeves.
Nice, nice, nice.
So taking a step back now,
it sounds like you became an expert at Linux kernel
almost unintentionally by sort of hacking
to scratch your own itch.
Do you have any advice for our listeners
who want to maybe more intentionally
get a better understanding
of how the Linux kernel
works?
Yeah, so, and the
first thing I would do is take issue
with the phrase
expert, because there are very
few experts at the
Linux kernel. The Linux kernel is huge. It's one of
the biggest software projects that has ever existed with the most contributors to a single
code base in history. So, I mean, nobody's an expert. You have to adopt a learner's mindset
and always have that learner's hat on because there's always something that you're, you know, it's kind of like, you know, dissecting, you know, a frog in biology class or something like that.
Like, you kind of know what the, you know, what the outside of the frog looks like.
You kind of know what the inside is going to look like, but you don't know what it's really going to look like until you've
actually cut it open and, and looked inside. And then for each one of the components that are
inside it, well, you, you kind of know what they look like because you're, you're looking at them,
but you don't really know how, you know, what it looks like on the inside or what it does until
you, you've cut open that, you know, that, that heart or that, that lung or that, you know, that
liver or whatever it is.
And so the Linux kernel is kind of a lot like that.
I mean, you can describe it in broad swaths.
You can kind of know what general terms, what different components of it do, like the VM
subsystem, the block IO subsystem, what different drivers do, the driver frameworks, the kernel itself, the scheduler,
the platform-specific code. You can know what those things do in general terms, but you don't
really know until you've actually dug into it and had a good problem in front of you to incentivize
digging into it. I didn't know how page reclaim worked in the kernel
until I had an actual problem in front of me on LinkedIn's private cloud having to, you know,
to debug and ultimately mitigate this containerization problem that we were having.
And so, you know, but by having this problem in front of me and being willing to just crack open
the source code and read it without being intimidated by it. So I would say to answer that question, it's an interesting problem combined with just
having a learner's mindset and innate curiosity and not being afraid of peering inside the
black box of something.
It's not going to hurt anything. This is just knowledge. It's like
cracking open a book off the library shelf. I mean, it's a very different kind of knowledge.
It's knowledge that changes on an ongoing basis because the Linux kernel, it's a moving,
it's a living organism. It's a moving system. You know, it's constantly changing,
undergoing very, very large change, you know, on a regular basis. But, you know, some things are conserved relatively well, you know, some structures are conserved even while other
structures are evolving fast. So, you know, just open mind, curiosity, and, you know,
check your ego at the door because you're you're not going to understand
the whole thing and you don't have to to do something impactful i was uh very motivated
you know by your by your talk but then you you kind of gone with the with the dissecting the
frog which reminded you know me so of some painful memories but um coming back um you've already
mentioned how useful ATOP is.
But that aside, what was the last tool that you discovered and really liked?
The last tool that I discovered?
Very good question.
Or is it hashtag ATOP for life?
I don't have a specific tool in mind that I've just recently discovered.
J-X-Ray is one that I've been using for analyzing Java heap dumps. And that one is developed by an engineer here at LinkedIn.
So, I mean, that's a cool tool.
It tells you exactly what's going on with your Java heap.
And if you're in garbage collection, what's triggering it.
So a little bit off topic from where we were,
but that's a recent tool that I've run into.
Awesome.
Well, that's it for us.
Where can people find you on the internet?
They can find me on GitHub or on LinkedIn. I do have a LinkedIn profile.
Nice. And anything else you'd like to share with our listeners?
I would just say that the more people that are curious enough about the kernel to be willing to learn some operating system fundamentals, work using tools like ATOP or IOTOP or even VMStat and things like that, more classic
tools.
The more people that are willing to dig into these things, the better off we are as an
industry because we break down the kind of wall of isolation between pure SWE mindset,
which is kind of a world of abstractions and algorithms
and the pure systems mindset,
which is one of bare metal machinery
and making things run smoothly in the operational sense
and bringing those two perspectives together
into more of a holistic mindset.
And I think we're all better engineers when we merge those two things.
And we do better things for the industry when we do.
Absolutely.
It is always a pleasure to talk to you, Ryan.
And whenever we do, I learn a lot.
And today was no different.
Thank you so much for joining us today.
It's been a pleasure.
Thanks, Rannik. Thanks, Rana.
Thanks, Guang, Austin.
Appreciate your time.
Hey, thank you so much for listening to the show.
You can subscribe wherever you get your podcasts
and learn more about us at softwaremisadventures.com.
You can also write to us at hello at software misadventures dot com. We would love to
hear from you. Until next time, take care.