Software at Scale - Software at Scale 22 - Sujay Jayakar
Episode Date: June 2, 2021Sujay Jayakar was a Software Engineer at Microsoft Research where he worked on kernel bypass networking. He was previously a Principal Engineer at Dropbox where he worked on the migration of user data... from S3 to the internal storage system (Magic Pocket), and the sync engine deployed to clients.Apple Podcasts | Spotify | Google PodcastsHighlights05:00 - What framework do you use to decide to stop using S3 and store data in your own data centers? (the “Epic Exodus” story)11:00 - Perfect Hashtables and how they are used in production systems to reduce memory use14:00 - What is an OSD? (Object Storage Device). How does it work?20:30 - SMR drives30:00 - The actual data migration - how did it happen, and how does one validate that the data being transferred is correct.33:00 - “S3 being overwhelmed”. That’s a string of words most software developers don’t expect to hear. What kind of overhead do kernels impose on networking, and why?43:00 - What is Kernel Bypass Networking? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
Transcript
Discussion (0)
Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications.
I'm your host, Utsav Shah, and thank you for listening.
Hey, welcome to another episode of the Software at Scale podcast.
Joining me here is Sujay Jayakar, who I think everyone knows that correctly.
Yeah, you got it.
Yeah, yeah and you you're
an ex-research engineer at microsoft you're working on a stealth startup now i think
since like two or three weeks if or more yeah and you were an engineer at dropbox for like
eight years where you worked on like magic pocket uh sync and a bunch of other things yep most notably our email system
cool
and yeah thanks for being a guest
thanks for having me
yeah so why don't we get started with
maybe
Magic Pocket because I think that's
where you started your career like maybe
after the email system
I had done I mean when I joined
Dropbox it was a bit smaller.
So, I mean, it was kind of, there was a phase of everyone doing everything.
But yeah, I mean, that was, I think, one of the first like really big, chunky projects
I had gotten involved with.
Okay.
And what exactly were you doing for that?
Yeah.
So I joined as an engineer when the project was getting a little bit more serious and
we started staffing it up um and then so i worked as an ic throughout my entire time on that project
so i worked um first on like a big push to get the system code complete so for those listening
magic pocket is an in-house storage system that we built at Dropbox that is very similar to Amazon S3.
So Dropbox has exabytes of user data, and we need to store it somewhere. So at the beginning,
we started in Amazon S3, which was totally the right decision, right? It's a very hard problem
to store data at scale. And, you know, S3 is a great product and at a certain point it
became it made more sense for us to do it ourselves so we had this project called magic
pocket that was kind of running around in like prototype-ish stage for a while and then at a
certain point we uh leadership and infrastructure decided that we wanted to go full steam so
we started staffing up the project so that's when i was brought in and the first few months were just
us like building the storage system from scratch and just it was kind of amazing like how much we
got done very quickly so um i had worked on the one of the metadata servers that stores and helps clients figure out where a particular object is stored on disk.
And one of the levels, it's like a multi-level indexing scheme.
So I worked on one of the levels there.
And then I also worked on our backfill system, which did the data migration from Amazon S3 over the internet and then eventually had pairing to our own cluster.
And then after that, we had kind of gotten the rollout going.
We had qualified the system. We felt very confident in it, and we were starting to do the dance of getting supply chain to bring up new servers and move the data over at a rate that matched our ramp up for the hardware, but was really as fast as we wanted to.
And then there were also some contractual things with S3 with how much data we needed to have in there.
So it was a really like resource management problem. But then in parallel with that, we started working
on the second generation of the system and approaching it from the perspective of cost.
One of the opportunities that one of the tech leads on the team identified was that if we could
increase the density a lot on our storage node, which comprises the majority of the tech leads on the team identified was that if we could increase density a lot on our storage
node which comprises the majority of the system this is like a server that just has a ton of disks
and all it needs to do is put a key to a value and the value is like a few megabytes and then
serve it and then eventually also be able to delete it um So if we could increase density a lot there, so I think we started off
having one server manage, like, I think, roughly like 44 terabyte disks. And if we could, one,
take advantage of these new SMR disks that were coming to market, which each disk could get up to like 15 terabytes
and then get up to like 60 of those or even more,
then that would get our price per gigabyte down.
I'll buy a lot.
So I was on the team.
It was a team of three of us on the software side
to build this out.
And there were a lot of design it was really interesting
design process where we started with what we wanted from the hardware and then found out what
the implications were on software um built that out and then helped qualify it and start release
and then i worked on and then i moved to workingleus after that, which you had one of my
colleagues, Nipun, on a previous podcast.
Yeah, I think the history of like Magic Pocket just sounds so interesting.
The first question I had around that was, and you don't have to talk too much in specifics,
it's just the framework used to decide that, you know what, we have to stop using S3 and
start storing data ourselves?
Because I'm sure that was like a hard decision.
And I'm just curious how that went about and how you got buy-in and all of that.
Yeah, that's a really good question.
And I think it was actually a bit above my pay grade at the time.
I was like a new grad effectively at that point.
And, you know, looking from,
so I don't have that much insight
into how the decision was made at the time,
but looking backwards
and then having worked on the project, right,
and talked about it,
price was one factor.
Amazon S3 is incredibly competitive on price,
but we were able to do a little bit better
from what we, I mean,
I don't know what Amazon's pricing is internally, right? But for us, we were able to do a little bit better from what we, I mean, I don't know what Amazon's pricing is internally, right.
But for us, we were able to do a little bit better.
And especially having control over the system and being able to tailor it very
specifically to our workload.
We did some pretty specific optimizations,
especially after I had left the team
with our multi-region replication
that I think would be harder to do more generally.
And then some of it was also control
from like a strategic perspective, right?
Like owning our destiny a bit more.
That both the reasons make a lot of sense.
So like if you had to guess why guess why or how is S3 designed,
and why do you think, what was Dropbox's workload
that could be optimized a little more specifically?
Yep, so Dropbox implements its file system,
so the file system that we use for storing user data,
with MySQL for the metadata storage and i
think you talked a little bit about it with nipun on the previous podcast um but then all of the
contents get broken up into four megabyte chunks or blocks as we call them and they're addressed
by their shot 256 hash so their content, and they're fully immutable.
So we built Magic Pocket to be in a content-addressable store
where data is completely immutable,
and the object size is very well-known.
For us, the distribution was maximum of 4 megabytes,
and then we had our mean around 1.5 megabytes.
So we actually went and sampled the distribution and then used had our mean around 1.5 megabytes so we actually use like we
went and sampled the distribution and then use it to tune all of our allocators and things like that
so that we would be able to squeeze a little bit more efficiency out of them based on it
and i think like immutability is a huge one and i you know i don't know anything of the details of
how amazon s3 is implemented but i feel
a lot of distributive systems end up embracing immutability at some layer so you can imagine like
you have like if you're mutating an object in s3 like maybe it's just putting a new version
and then the tuple of like path inversion is actually an immutable object within the system.
And this just makes coordination and so many of those distributed systems problems a lot easier.
So we had that from the very top, which was great.
That makes a lot of sense.
And it seems like S3s also would have been designed for a super general purpose
with no idea how big or small
object sizes are and if i had to guess like most things in st are just not that big
compared to magic pocket where you know for sure that it's going to be of a certain distribution
right and small objects are actually very difficult to handle well in a system like ours
one we always think about ratios from different types of storage in our system.
So like there's a lot of per object overhead, right?
Because you need to have mentioned some of those metadata services for being able to
find the location of objects on disk.
And there's probably a fixed cost and like that fixed cost per object probably needs
to be stored in flash because you need to it's on the critical
path for reads so there's all these constraints there that make small objects a huge pain to deal
with specifically within the storage node we one of the tricks we did for the second generation of
what we call the osd the like i think object storage device i can't even remember the acronym
for it one of the tricks uh well one of the acronym for it. One of the tricks,
well, one of the quantities we paid a lot of attention to was how much RAM, how much memory
was necessary to manage each object
so that we could find it in a single disk seek.
And one of the tricks we did
with the second generation of the system
is use perfect hash tables
because our data is immutable
and we group it into these what we call
buckets like groups of one gigabyte of data and once the bucket fills up it gets sealed and even
then that grouping is immutable so starts append only becomes immutable and once it's immutable
we can look at all of the keys that are inside that bucket and compute a perfect hash table which then perfect hash tables for those
who aren't familiar with it if you have a fixed set of keys you can compute this hash function
that will map each one of those keys to a slot in a table and then when you're querying this
hash table you don't actually have to have the keys in memory at all.
You just have this little hash function,
and it's guaranteed to not have collisions
because it has some probabilistic algorithm
to keep on trying different seeds for that hash function
until it finds one where all those keys have no collisions.
So you throw away the keys.
You don't have to keep them in memory anymore.
And you just keep this small,
succinct little hash function.
And you can then use that to directly find stuff on disk,
which is pretty cool.
That is so interesting.
When I heard perfect hash table,
I just,
I assumed it meant something like you don't have to deal with collisions.
I did not realize it meant you can throw away keys as well and i'm guessing yeah go ahead yeah um and not having collisions is a big
part of it right because like so what we would do is um just to put some numbers to things so the
keys for each object in dropbox are 256 bits right because they come from from shot 256 so 32 bytes and that 32 bytes per megabyte roughly
because that blocks are 1.5 megabytes on average um it adds up so um what we did is
we take all those 32 byte keys we compute this perfect hash table into a dense array of offset
and length pairs on disk so those are we use u32s for them which are four bytes each so we went from
previously needing to have 40 bytes and actually a little bit more because for a general purpose
hash table you'll have some load factor right whatever that is um so we went to like 40 plus
bytes down to just eight because we just have this dense table uh with one entry then it's in order
for the order on disk of just offset length offset length offset length and then when you
uh query for a particular key that just feeds into this hash function, which is small.
And so there's a certain number of bits per entry.
There's a lot of actually academic research on this,
which is pretty cool.
There's a few bits per entry.
And then that points to the table.
We look at that offset and length,
and then we read that off from disk.
And then one actually important piece is that
if you feed some key that
isn't in the hash table, or that wasn't in your original set to this hash function, it's going to
just spit out some garbage value, right? So you actually then do need to check from the value you
read off disk, that it's something that was actually original, was actually matches the key
you passed in interesting so you
need to make sure but that that's that doesn't seem too expensive it seems like a one-time
check right right and we don't care about making misses cheap right like we're okay with misses
being expensive yeah because that seems like it would only happen when there's like
a a a problem or like an error a higher up in the stack yep and some like kind of race or in like the
system is in a weird state and like we're trying to re-replicate data or something like that like
a bit flip or something at max it seems like yes that's super interesting and can you talk a little
bit more about the osd right so the perspective that i'm taking is you know you're a new grad
and i'm just thinking about new grads in general you've taken like not you in specific but in general like a new grad would
have taken like a bunch of distributed systems classes and they dive into this project and they
have no idea like how to build something like an osd or like what even is an osd and like can you
just introduce what that is and why would you need to build something like that yeah so an OSD is it it's aiming to be a very simple abstraction and like a very simple
interface to solve a hard problem so and the OSD is the software that manages the server
where the server's primary job is to just store a ton of data. And the interface it presents is,
it's almost kind of like a very restricted file system.
You can create a bucket or like this one gigabyte file
in some sense that's append only,
and you can put key value pairs into it.
And then you can, once the bucket is done, you can close it. And then for both open
buckets and close buckets, you can query them by the key and try to find some value. So it's like
a kind of like a file that's a append only hash table. Okay. So it's basically like a process
that's just running one per server and its job is to basically manage blocks of data.
Yep. And we ran like one per disk because we had many disks per server and its job is to basically manage blocks of data yep and we ran like one per disk
because we had many disks per server okay one interesting okay that makes sense so one osd's
job was to handle one disk and then you would have like multiple disks per server yep exactly
and some of there was around like life cycle right because like you can have one disk go bad in a
server and then it's very easy to just
have that server crash or go down and consistently stay down while it's being remediated and then
have other processes that are attached to other disks still be fine. Okay. So is this, is the OSD
something like a disk driver or is there like another piece of the magic pocket puzzle that's acting as
like a disk driver?
Yeah, it's a good question.
So by driver here, you mean like the operating system?
Yes, exactly.
Yeah.
So yeah, we use the operating system pretty, we use it for quite a lot of functionality
in magic pocket.
So for OSP one, we use XFS, it was a file system, we had a lot of functionality in Magic Pocket. So for OSD1, we used XFS.
It was a file system we had a lot of experience with
on the database side.
And we had a pretty good sense
what all of its sharp corners were
and how we would be able to achieve
highly durable storage with it.
So each OSD had, I think we had a directory
per bucket, maybe.
We used some encoding of like our primitives onto the file system, had like a file format, and we just used regular files.
And we had to, you know, be careful about how we use the page cache. For example, like if you want to validate that data on disk is correct, you need to make sure you're
actually going to disk and you're not just reading
some value in the page cache that
isn't actually persisted. So
there were things like that, but they were pretty
minimal. And then for
OSD2, we
still use a lot of the Linux storage
stack, but we used ODirect mode.
Is that right? I think so.
We wrote, I think it was o
direct mode on the block device so slash dev slash whatever um where i think p read and p
write i'll let you write two different um logical block addresses the lbas so you get the right to
sectors on disk and so there we were still using the driver, right? But we were also using the IO scheduler.
Okay, so in OSD2, you're bypassing XFS?
Exactly. So we effectively wrote our own file system.
Yeah. And can you describe a little bit about why?
Is it just that you could get better performance or is there something else?
Yeah, that's actually a really
good question um why we the fundamental reason why we did it has to do with this technology of
smr disks okay so i mentioned that a bit earlier and smr disks like they regular disks which are
called pmr disks for perpendicular perpendicular magnetic, they were topping out, I think, around four terabytes.
And without actually that many changes to the fundamental hardware,
like the magnetic medium and the heads and everything,
SMR disks were able to go up to about 15 terabytes.
And my knowledge is actually a few years old at this point.
It's also not only is it outdated, but it's also the fog of time.
So the numbers are probably
approximately hopefully close and so the question then is how do they manage to get so much higher
density without really having that many innovations on the fundamental hardware and
my understanding is i'm not a hard drive expert and i don't also want to i'm sure there was a ton of engineering work for smr that's i'm not trying to diminish it but my understanding is I'm not a hard drive expert. And I don't also want to, I'm sure there was a ton of engineering work for SMR,
so I'm not trying to diminish it.
But my understanding of it is that the main innovation
was writing to the magnetic medium in a different way,
with different geometry,
and by packing sectors in very closely.
So one of the fundamental problems for hard drive storage
is that when you write to one sector, you're magnetizing it, right?
You're, I don't know, my knowledge of it is I don't know exactly what's going on, but it's being magnetized.
And one of the phenomena that limits density is adjacent track interference.
So if you write to one sector, there's some probability that it will affect the sectors around it,
right? And so that puts a limit on how densely you can pack sectors together. So there's all
types of tricks that are done with like error correcting codes and that to try to pack things
in more. But one of the insight for SMR is that if you constrain the right pattern of how sectors are getting written to the medium,
you can pack things in a lot more closely. So SMR stands for shingled magnetic recording.
And the visual analogy here is that if you're laying shingles, and each one of them say it's
like two by two or something, I don't know, just make up some unit.
And the shingle is just some
like that that thing that's on people's roofs right it's exactly yeah right so if you imagine
like how it's laid on laid on your roof right it's like they're overlapping and they're kind
of stacked together and you can actually only well i don't know if it's actually i've never
laid shingles but in my mind you know it's like you can only put them on in a line right because you're
stacking one on top of the other but then after you're done laying them in a line you can look
at just a little bit peeking out and see what's there so if you were to imagine storing data on
shingles where say it's the color or something if you the shingles are non-overlapping then they like you can read and you can write to any
one of the shingles at any point in time but if you put them in a line you can only put the you
can only append to the end but you can read at any point because you can see the color that's poking
out so that's like the core inside with um with smr and how it's able to pack a lot more data in without fundamentally
changing the physics of it so where does that get us back to the linux stack is that
the in the interface that the device presents no longer has random writes
so roughly it's like there's a concept of zones, which I think for ours were like 256 megabytes.
And for each zone, there's a right pointer, which you can reset all the way back to the beginning, which would be like just throwing out all the shingles.
Or you can append to it, and then you can read from any point before the right pointer.
So with that restricted API, file systems just don't work on it.
That makes sense.
And yeah, super basically, it's like if you had shingles all spread out,
they just wouldn't be that dense, so you would need more space.
But the fact that you can put them all together,
and append only is perfect for the Magic Pocket use case, it seems like,
because that's how Magic pocket was designed initially anyway and you already know roughly how much like each block is
so you can roughly know how many shingles you need maybe this is going too much into the analogy but
totally helps yeah well yeah i mean with um like for example these um
um i just said it like blanking on the name, the 256 megabyte grouping.
The zones.
The zones, yeah.
With the zones, like that actually worked out really well for us because we have one gigabyte buckets, right?
We group data into one gigabyte like files that are append only.
And then we just made those four zones and you would just continue to write through it.
And that was super easy. Interesting. So one thing that I would be really scared about
about rolling out this project, I think most people would be just making sure that you're not
losing customer data, right? Like even just the regular OSD one where you're using XFS.
How did you or like, how did your team like verify that your team verify that your data is mostly accurate
and you feel confident about deleting it from S3?
I mean, there's so many layers to doing this.
I think there is not a single answer.
I mean, the answer is to defense in depth, right?
Like try everything.
So at the very highest layer,
we had a lot of consistency checkers.
So there's some non-trivial invariance
between the metadata indexing and stuff that's on disk.
So being able to join those two
and periodically check and make sure that they're in sync.
And that process of doing consistency checking
and production was also tied to our release process
where we had a staging cluster
and we would be guaranteed
that by the time our release gets promoted
from the staging cluster to the first zone,
that cycle of those consistency checkers has progressed.
There's also, we were a multi-zone storage system.
So having the release
process push new code out to one zone before the others. And then having a set of qualifications
that that zone would have to pass to enable promoting its deployment to the others there is um yeah there were what else so that was in production and
there yeah um there was also like a so that was the consistency checkers are like a
like a check everything and check that the invariance match approach there was also like
a statistical sampler where we would just check that we can retrieve blocks that have been recently
written with some probability like just sampling that we can retrieve blocks that have been recently written with some probability,
like just sampling that
and then doing slightly more expensive things.
We had a very extensive testing suite.
So we had both obviously like unit tests,
but then also some integration tests,
test the system as a whole.
We spent a lot of time
qualifying hardware, and that was more on the hardware
team, but I don't have as much time
as much access to it.
We
for the process of first releasing
Magic Pocket, we had this like
I still remember it. It's this huge
spreadsheet with like all of the potential
kind of like a premortem, like all
of the potential things that we could think of going wrong the probability of them happening the um the downside
if they did happen and then mitigations and ways to validate it um we like actually went into the
data center and like pulled out some plugs and tried to validate that the hardware behaved in
ways that we thought.
I think we even overheated one of our servers intentionally.
I can't remember.
But yeah.
How did we validate the system?
Yes.
There was a lot of stuff.
It was not one single thing, but I think this process of just being extremely paranoid and having many layers of checks.
This might involve a hot take and you don't have to answer if you don't feel like.
Yeah, I love hot takes. Do you think like, you know, five years later or six, I think it's been longer than that now.
The project was like worth it.
Or do you think that because the way i i look at it is that cloud
prices have gone down significantly and i don't know how much that affects the dynamics of building
your own thing yeah i mean i think it's a good question um and i think it's a little bit above
my pay grid for the decision at that time right because like um i think it's you know easy
to wax poetic after the fact but i think in that point in time was it the right decision i think
yes um i mean i think it helped our company financials a lot it also you know was part of
is ingredient of us being able to go and ipo And yeah, I mean, I think the parts that are difficult
that are unavoidable with running a system like this
is that it's a really hard thing to do well, right?
And not ever mess up.
And being able, being on the hook for that
as opposed to being able to outsource it
to some other company is,
that's a big commitment right it's a commitment on
the software engineering side but then also the SRE side that there's a lot of databases involved
there's a whole supply chain there's networking there's hardware right we need to continuously
we need to for it to be worth Dropbox doing it we can't just tread water we need to, for it to be worth Dropbox doing it, we can't just tread
water.
We need to be able to continue to remain competitive and beat Amazon S3, right?
Where that is one of their core products.
So yeah, I mean, it's definitely, it's, you know, it was a very, very successful project,
but it didn't come for free.
Yeah, that makes sense.
And I like the idea, the second reason why in your framework of why go with Magic Pocket,
just like company strategy and owning your own destiny.
I think that might be like reason enough, as long as it's not like costing like 10 times
as much as S3 or something.
The fact that you're not dependent on as much as possible, dependent
on another company completely for your core, like multi-billion dollar business, right?
Like most companies can easily switch, not easily, but relatively easily switch clouds.
It'll take like a few months, but it's possible.
And it's even possible to go like multi-cloud. But for something like Dropbox, where you have to transfer exabytes of data, it's just not physically possible.
Right.
Yeah.
So that is a very, very interesting project.
And maybe you can tell us a little bit about the actual data transfer.
Because from my understanding, again again you can't actually physically transfer
you can't transfer data that quickly like at least network cards are just not that fast
when you have so much yeah um let's see so we were at our peak at around half a terabit a second
right so that's like i think about 62 and a half gigabytes a second uh so nothing to sneeze at
right so 500 gigabits yeah i think i did that right it's a 500 gigabits and we so one of the
really difficult things at the beginning was being able to push this type of bandwidth over the public internet so we would
buy transit from just like a normal metered metered um like almost like buying bandwidth
buying internet service from your isp right we would buy transit from uh these different providers
and we would sometimes break the internet because it turned out for you know at a for a certain
number of effective people because we bought all this bandwidth from different providers and it
turned out that they shared a common link and that like we are path from our data center to
amazon didn't actually have as much bandwidth as we thought we did. And eventually we ended up getting peering,
which is, I wish I knew better the actual,
like what happened on the ground.
I mean, my mental image is like us having data centers
near Amazon's and then having a fiber line
directly between them.
I'm not sure if there were like people laying fiber
and like plugging into the Ethernet jacks
or exactly what that was.
But the idea there that it's just was mutually beneficial for both us and Amazon to just have a direct link as opposed to both of us having to pay to
go over the internet.
Okay.
Interesting.
Yeah.
And so from that was like the first kind of,
so that's one constraint from the
bandwidth perspective um and then like your to your point each one of these machines um that we
use for this backfill system i think they started off with one gigabit ethernet cards and then we
eventually got 10 gigabit and we would use close to the whole thing. You know, it's not, it's actually pretty amazing, right?
It's not that hard to saturate even 10 gigabits with just a bunch of TCP connections in parallel, right?
And then we might actually talk a little bit more about this because that's very related to my research that I was doing after I left Dropbox. And yeah, and then from there,
they were just making sure that we think of the queuing theory of all of how like backfill works pretty well.
So we don't overwhelm any of the resources.
So overwhelming S3 on the reads,
overwhelming Magic Pocket on the writes.
But then there's tons of things in the middle.
Like we needed to consult
the file system metadata to know which blocks to transfer so that was doing a read to my sql
talking to secondaries and all that um and what else it's been a long time since i thought about
the system there were a lot of other resources resources. It was a heterogeneous resource system. I had
implemented this distributed token bucket algorithm that was very flexible. It was very simple,
just like a distributed rate limiter. But it ended up being really helpful because then you could
just put a rate limiter on everything and make sure that the throughput for any of these subsystems doesn't exceed within some bound.
And then as long as the kind of queue setup of it,
and I think you had it from a queuing theory perspective,
when one thing backs up,
it's back pressure propagating through the system,
then everything just kind of worked out.
And I say this like it's something really simple and obvious,
but it took us a long time to get there.
We had plenty of times of just like trying to like hack in something and
then causing it to overflow.
And then that causing some other system to go down and then paging some
other teams.
So yeah,
I think the lesson there is like be disciplined about the queuing theory
and thinking about back pressure and all that.
Yeah.
That is so interesting.
And like one thing that I haven't ever heard in my life,
I think is overwhelming S3.
Like I don't think I've heard these two words put together.
Yeah, and you can correct me if I'm wrong,
but you were talking about like saturating your network cards,
like the limit was like 10 gigabits or something.
And I was reading your research
and I read somewhere that 30 of that
time like when you're actually doing these network calls goes in like syscall overhead
and it's like and this is like linux and it's even worse on like mac and windows
is that is that like accurate and like why is that yeah um so the numbers i think
they depend kind of on your hardware like what type of hardware offloads you have.
So how much of the TCP stack can you just hand over to your NIC?
So at the very basic level, that's like doing checksumming.
Like TCP has a checksum that actually checksums the user data,
the body of the segment.
And that's implemented in a lot of NICs.
So that's a huge win for high throughput flows.
And then there's more sophisticated stuff like actually being able to offload
the segmentation.
So you can just present the NIC with a big buffer and then it will break that
up into smaller TCP segments. Yeah.
And I think like the overall perspective that's pretty interesting is that the hardware
has gotten really really good both on the throughput side right we're getting 100 gigabit
nicks and it's going it's going even further but also on the latency side i think the latency side
is the part that was pretty underserved and was the basis for the research i was doing is that if
you have two servers that are in the data center,
they have one, say they're on the same rack,
so they have just the top of rack switch.
The switch itself will add
just like a microsecond of latency.
And then the propagation time through the cables
is going to be, I think,
in like the hundreds of nanoseconds.
So like your round trip times in the data center should be in the low micoseconds. So your round-trip times in the data center should be
in the low microseconds.
But
if you, even with an exceptionally
well-tuned Linux kernel,
you're going to be in the 50
microsecond territory.
And then with anything
that's just not hyper
well-tuned, it's going to be in the hundreds
of microseconds or even milliseconds, right?
That is, yeah, that's surprising.
Like I've always, my heuristic
for inter data center communication
is like less than a millisecond,
but I always assumed it's like speed of light or something.
I mean, I guess it's not that it can't be that it's, it's overhead from software rather
than hardware things.
And the switches are really good, right?
The switches are not adding that much.
It's really, it's kind of, it's tragic, right?
And that you have this like highly engineered pipeline where you have like the path of going from user space to kernel space through the network
stack through the driver and then off from the driver to the device over pcie say and then from
there to over the physical medium through some switches and then up the same stack on the other
side and it's like all that stuff in the middle is really, really good.
And like 90 plus percent
of the latency
is actually on the endpoints.
Yeah.
Do you know why the kernel
adds that much overhead
even if it's like well-tuned?
Like I've always read
that syscalls are high overhead,
but what does that mean?
Yeah.
Yeah.
So yeah,
one of the first kind of perspectives
is that context switching is expensive. And so there's context switching between kernel space and user space. And it's been announced, I mean, even with a month, my fog of memory, there's even some switching between different subsystems within the kernel. There's a really good paper on this. I can't remember what it is.
I think it might actually be unpublished.
It might not be published yet.
But anyway, there's a really,
and there's another really good paper
from a group at Stanford.
The paper is called Dune,
and it's their like operating system architecture
that uses virtualization
to do some pretty interesting stuff.
But we don't have to get into that.
But the one of the core insights from the paper is that syscalls are very
expensive, you know, the hundreds of cycle,
like hundreds of nanoseconds plus right.
For even just a no op syscall because of the indirect effects of them,
like actually executing the mode switch of switching from ring three to ring
zero. So going from user mode into protected mode, that's actually not that expensive. effects of them like actually executing the mode switch of switching from ring three to ring zero
so going from user mode into protected mode that's actually not that expensive the things that are
expensive are that when we have these big monolithic kernels that run like everything
together like the kitchen sink um you have you just totally blow your l1 and even l2 cache i
think for in the l1 cache like the think, and the L1 cache, like the instruction cache
and the data cache are totally blown out.
And then I think the L2 cache, a lot of that gets blown out too.
So when you enter the kernel, you have a bunch of cache misses.
And then when you come back into user space,
you have a bunch more cache misses as well
because a kernel brings in all of these data structures
and all of this code because it's this big monolithic thing um so that's like one of the core ingredients for why system calls and like context switches in
general are expensive that is so interesting so like i know i've said that so much now but this
is this is interesting it's like the kernel itself is so big that it's data structures
like it's just like things like logging structures, I'm guessing,
all sorts of things like that are so big
that your caches get overwhelmed when you're switching
from like ring three to ring zero.
That is not what I thought.
I thought there would be something to do with like, you know,
an interrupt in like your hardware or something causing the problem.
Yeah.
And this is actually the Dune paper is really cool.
And we will talk about a little bit because it's just cool stuff.
So one of the, what their system does is that it's like, say you're implementing a, like
a programming language runtime and you want like memory management through garbage collection
or something, you want to be able to use hardware features for
having some protected data, maybe like read-only pages that the program you're executing can't
touch. And then maybe you want to do things like a read barrier where you, while you're garbage
collecting some page in memory, if the application tries to read from it, it's going to fault. And there you're also using fault. Excuse me.
You're also making it so that the user program is going to trap into the kernel.
And the premise of the paper is that if the kernel is small, these features can be implemented at a
very low cost. So then what they do is they use virtualization support because that so then the setup is that you have for virtualization you have host mode and guest mode.
So imagine this is like a two by two matrix, you have host mode and guest mode. And then within each mode, you have ring zero and ring three for kernel mode and user mode. And she gets to get some water real quick.
Yeah, I have more questions already lined up.
Sorry, I think I just haven't been talking to this much in a while.
So, right.
So you have this two by two matrix of host mode, guest mode, ring zero, ring three.
And the Linux kernel runs in host mode in ring 0.
That's what you'd expect.
And then regular applications run in host mode ring 3.
But then a Dune process where it has a split between, like, the language runtime and managed code,
it runs the runtime in ring zero of guest mode.
And then it runs the user code there,
like whatever code is being executed as ring three in guest mode.
So then switches between ring zero and ring three in guest mode
are pretty cheap.
But then if you need to perform like a real syscall
that is actually talking to the file system,
then you need to switch into host mode and then go to execute in ring zero there.
Yeah, so your traditional expensive syscalls that are doing I.O. and looking at the file
system might still get expensive, but there's so many random syscalls that you can just
skip out on.
And does the paper show that Dune is actually faster
than running a program directly in the user mode?
Yeah, I mean, they have some pretty good evaluation
on what the switching costs are.
I'm trying to, I can't remember offhand
what the programs they use for their evaluation.
But yeah, I remember garbage collection
and read barriers being one example
where, I mean, the alternative writer just writing a regular program against Linux but yeah I remember garbage collection and read barriers being one example where
I mean the alternative writer just writing a regular program against Linux that uses like
mprotect to protect pages and then goes and traps in the kernel and like that just I mean it sounds
like it'd be very slow yeah yeah it just reminds me of like node.js because I've been looking at
like syscalls from node.js and that like, I think Node does that,
or I guess V8 does that quite a lot.
And that makes sense.
So that makes me think about like kernel bypass, right?
Is that the reason why kernel bypass was developed?
And maybe you can talk about it to like listeners,
like what is kernel bypass?
Because I also want to like figure out
if that's something that they have, like is kernel bypass? Because I also want to figure out if that's something that the kernel developers have developed
or is it something external?
Yeah, I think it's pretty related.
I think a lot of people who work on the kernel
have worked on these user space approaches.
And the perspective here is that
one of the jobs of the kernel is to multiplex resources right to like
share resources like an ethernet device across multiple user space programs and that's why it
needs like hardware protection right because it it it alone is the arbiter on how to share that
resource and user spaces and it can circumvent that so if you relax that assumption and you say that i like say i'm running
a database machine and i have maybe like a one gigabyte or gigabit really cheap card for just
sshing into and pushing code and stuff but then i have this like 40 gigabit ethernet card that
all the data actually flows through and all i'm running on this server is just a database process then why does the 40 gigabit card need to get managed by the kernel it's not getting shared
so the idea is then that you would run the driver like roughly the same driver that would be in
kernel space in user space and that would you set up all the memory mappings to map
the device memory and all of the like um for like the iommu and everything to map those into user
memory and then run the driver that is manipulating all the ring buffers and like ringing the bell
to tell the device that there's new data on them. And you can just do that in user space because why not, right?
And then when you do that,
you just cut out a lot of the overheads
of having to go through the kernel.
You do it directly.
Yeah, and at the same time,
you lose that kind of trust
because it's a user-based program,
but you don't care anymore.
It's like in max power mode, right?
You're going with, trust me me i know what i'm doing but then give me this much performance and like if syscall overhead is just like 30 you get like basically 30 performance for free
it sounds like and what have you seen in like practice right so this is like a really interesting
um kind of like,
this is one of the motivations
for the research
that I'm working on
with my advisor,
Irene Zhang,
at Microsoft Research.
And the observation here
is that the hardware is great.
And now actually
with kernel bypass,
say for networking,
DPDK is like the common library
for doing this.
It's actually possible to achieve
these really low latencies and really high throughputs but for some reason people haven't
been doing it and like a lot of like it just doesn't really exist out there and part of the
reason at least the reason that we've identified for a research thesis, is that the really low-level interface of talking to DPDK, where it's, you know, it's like you're talking to a driver directly, right?
Is, one, very developer-unfriendly.
But two, it's also a moving target, right?
Like, for different devices, you have to configure them in different
ways and the operating system just takes care
of this for you when you use sockets, right?
So our work has been
on making a very high-level interface
that looks like sockets
and
lets you use DPDK
or we even have like RDMA support.
We support other types of hardware too.
And, but engineered in a way
so that you can still have microsecond scale.
So it doesn't add overhead
like going through the kernel would.
And you can have your abstraction
while maintaining performance.
It sounds like you're building a microkernel
for specific use, like a small kernel for like networking in
a sense yeah totally yeah and it's like a user it's kind of like a lot of like the library os
research right or having like stuff that people can link into their programs and run some user
space and you pick and choose what you want and it's like all a lot smaller than this big monolithic thing that does
everything.
Do you know if there's any work towards like the Linux kernel to like cut
its size down or it just seems really hard at this point?
I think it's also, I mean,
Linux is inherently a monolithic kernel by design, right?
And that's like, and that's a virtue in some ways, right?
That's, that's probably not design.
Who would be the users
of the work that you're doing?
And the people I can imagine
are just like Azure, AWS, GCP.
Like they care about like,
you know, getting as
performance stuff as possible.
Yeah.
And I think one other motivation
that's related directly
to this idea of accounting
for CPU cycles is that
a lot of people we talk to, because one of the cool
things about working in Microsoft Research is you get to
talk to internal teams at Microsoft,
right? And a
lot of people actually really care about cost efficiency,
which is good to hear.
But the
latency benefits of running some really simple server,
just like a UDP relay server, are great.
But what people really care about is being able to run it a lot cheaper
because all the cycles that aren't being wasted on going to the kernel
are now cycles that they can sell for Azure to other uses.
Yeah, so my assumption is basically wrong.
And even if you're running a company like Discord or Slack
or like WhatsApp, where you need to have a lot of connections
and there's basically generally going to be a lot of syscall overhead
because you're going to be receiving messages
and there's going to be a lot of IO on their chat servers,
they could use something like this potentially.
Yeah, so I think for them, I don't know if the cost motivation is as strong. on like their chat servers, they could use something like this potentially. Yeah.
So I think, you know, for them, I don't know if the cost motivation is as strong, right?
Because I think for a hyperscaler,
like your Googles, your Amazons, your Microsofts,
they have like a lot of elasticity
to be able to say that,
like if I cut this percentage out,
I can then go sell it.
Where if you're running
your own infrastructure and you're but on a much smaller scale i think that might not be as much of
a motivation so like the marginal return on efficiency is actually probably pretty low
but yeah i mean i think the vision for the work that irene and i have been working on is that
it should just be so easy to do this and get like 100x better latency that why wouldn't you do it?
And do you think any of this work could have been applied to like Magic Pocket like retroactively?
Or do you think it wouldn't make sense?
That's a really good question because my first exposure to kernel bypass
was actually working on Magic Pocket.
For the second OSD, we talked before about ratios, right?
And the one really important ratio is the amount of CPU
that was needed per unit gigabyte on disk.
And there were a lot of cliffs there for price.
If we could manage to use a cheaper CPU for this like hardware revision,
that would save us a ton of money and being able to use that then implied a
particular ratio. And when we were launching,
we thought we would actually like,
we thought that using TCP through the Linux kernel wouldn't actually be good
enough for us to hit our price
goals. So we spent a bunch of time profiling it. And we like, I think started to prototype
some kernel bypass stuff. So like going over UDP. And like using the mute, like there are a lot of
other things that had to change. Funny enough, one of my main contributions
for this research project is a TCP stack for kernel bypass.
That had existed back in the day.
But it was definitely not production-ready.
It just sold nothing.
So we were trying a bunch of things.
We had some prototypes for kernel bypass.
It definitely would have done
what we needed um but it turned out that actually just upgrading the linux kernel was good enough
interesting yeah um and in your experience like you you mentioned that there's like the libraries
the existing libraries you mentioned like dpdk or something is too low level for people to use like
you have to think too much about,
it's basically like using a driver directly, it sounds like.
Do you think that is the biggest barrier
or like once you have a better interface,
like that's it, it's relatively easy for people
to like deploy their own like kernel bypass?
Yeah, it's a good question.
I mean, I think it's a big factor, right?
Because I mean, you're like reading and writing ethernet frames directly, right? That's like, pretty hardcore. There's no TCP, there's no UDP. I mean, IP support is not that hard. But like ARP, for example, right? Like being able to resolve IP addresses and MAC addresses, like that is not included in the box. So that was all stuff that we had to implement
as part of our project.
So assuming you have all that,
I mean, one barrier is that
it is restrained
within cloud environments.
Like there is kernel bypass support
within both AWS and Azure,
but it's not as prolific
as their general offerings.
There's also like,
it's pretty interesting,
the overlap between kernel bypass and virtualization.
So you can imagine that, like, how would you provide kernel bypass
in a virtualized environment?
And the idea is that there's actually a lot of interesting stuff
happening on the NIC hardware side to present multiple virtual devices. So kind of previously how, like, the kernel was sharing a NIC hardware side to present multiple virtual devices.
So kind of previously how like the kernel was sharing a NIC
across multiple processes,
that sharing has actually been pushed into the hardware
where your physical device can actually present it.
You can configure it to have multiple virtual devices.
And there were some pretty cool work, I think at MSR,
on having like, I think they embedded FPGA into the NIC to implement some of the like, I wish I remembered this better.
For all like the VPC rules on like network policies and stuff, I think they get compiled down to code to like an FPGA that actually runs on the NIC and then does that isolation,
but like in hardware.
Super interesting.
Yeah.
I was imagining that kernel bypass would only be allowed on like bare metal instances or
something, but it sounds like there might be support for even like non-bare metal instances.
That's super interesting.
And maybe like a couple of questions to tie this up is around Rust.
So you've been, so you have a lot of experience with tie this up is around rust so you've been yeah
so you you have a lot of experience with rust both on like magic pocket and like
you work at microsoft research so how's that experience been yeah it's been great i mean
i would say like i've been a full-time rust developer since like not Rust. I've been an engineer using Rust full-time for
what, maybe like
2014? Okay, that's like
2014, 2015. And it's
been great. I mean, I think like
on Magic Pocket, we introduced
Rust for the OSD
2, and there
Rust was a big part of us being able to have
a lot of control over memory.
And not having a garbage collector was actually a pretty big part of that
because we needed to control,
like the OSD is a very throughput oriented system.
So having a lot of control and knowing when memory will get freed,
like if a storm of requests comes in,
and if we hold on to that memory for any longer
than it needs to get held, then that will stack up, right?
And then I think if it's like a queue backing up,
then we might just run out of memory, right, on that machine.
So having deterministic resource finalization
was a big part of what made Rust work for that project.
We also had a similar system
called the volume manager where it's similar that it's throughput oriented and what it did all the
erasure coding so it just streams in data like multiple streams of data it does erasure coding
and then streams them out and having that I did the rewrite of rewriting it from go to Russ and
one of the things that was really great
with the new system was just that
it's like resource utilization is very flat
and very predictable.
So yeah.
And then after Nucleus, working on there,
we did that system in Rust.
And there were some performance elements
of being able to pack things in memory very tightly.
But I think the main
value add was like around encoding and variance in the type system and i was taking some of that
like haskell like approach and then applying it to a systems language um and then the async away
stuff is also really sweet and then for my research um it's actually also in Rust. And so I implemented that TCP stack and all the network things in Rust.
And some of the things that are, I think, quite novel from the language side is like
the async await, having like user space scheduling for these coroutines and not requiring
heap allocation is like pretty novel.
I think it's like, i don't know of any other
language that provides it and it was key actually to the success of the research project because
we implement algorithm like all the tcp state machine algorithms for there's just a lot of
stuff in there right like doing retransmissions having like this and pure acknowledgements
there's like a persist state there's all this stuff
and being able to implement it as coroutines in a very compositional way um makes it really
pleasant to work on that stack where if you go look at other tcp implementations it's really
really hard to reason about what's going on um but then we can run these coroutines that manage
tcp state with really,
really low overhead.
Like I think we met,
I,
for my,
like the scheduler I wrote,
it's like tens of nanoseconds for switching between them.
Um, there's no allocations in the critical path.
One of the interesting parts of programming in microsecond scale is that like
even just using a heap allocator,
like using just regular Lipse malik free is like
it can get to the like order of a few microseconds like hundreds of nanoseconds few microseconds on
the slow pass a lot of them have like really optimized fast paths but if you're round trip
time if we're aiming for like six microseconds then like allocating at all is not an option
how do you prevent like heap allocation?
I don't, I can't understand how you can prevent that when doing something like providing like
coroutines.
Yeah.
And the async await stuff is really, really cool.
So the kind of the idea is that you write a function that just looks like regular code.
It has like regular control flow.
And then the compiler
does like what you would imagine that you would do if you're writing this in c but like how to do
all of yourself and that it takes like the control flow graph really right and it change it turns it
into a state machine where there's a variant of an enumeration there's like um what's the word
i'm looking for and rust are called enums
and then it's like a union of a bunch of types right and then there's a variant per suspension
point and at every suspension point when you're suspending thinking about this even from like the
os perspective when you suspend um you need to save the stack, right? So the data storage for each variant are all of
the stack variables that are live at that point. And then when you think of like that enum, which
is the all of the suspension points, and then for each suspension point, what are all the variables
that are live, the size of that enum will be a tag to indicate which variant you're in. And then
the maximum size of any of the suspension points,
like the maximum stack size in that function.
Got it.
So I think the key inside over there was just like the compiler
initially converting your entire program into a state machine.
So it knows exactly how,
like you have all of that information up front on the stack size yeah
exactly yeah yeah and that's like i think it's one of those things that's really cool from like a pl
innovation perspective and that when you have a async await program like that so there's like all
these nested state machines right like when you have a function call that's like you have three
enum variants for the top thing so there's like three
suspension points and then when you yield when you await some other function call then there's like
that variant two might have like five variants nested inside of it right so there's all these
like there's these nested state machines and put together they make this big, you know, right. And the maximum size of that is actually known
at compile time. So the compiler knows exactly how much stack space it needs to allocate for
this lightweight thread. And that is something that like, for example, with go or in rust,
when they had like a green threads runtime, that there's been a lot of struggle with
because on the Go side,
I can't remember what their default stack size is
for Go routines.
I think it's maybe a kilobyte.
But they don't know up front
how big of a stack they need to provide.
So then there's these weird performance cliffs
where if you just add a new stack variable
to your function,
that could add a stack split that gets detected at runtime
and then there's
an expensive path to go and allocate
a new stack. I think they have a linked list of chunks
for their stack representation.
Where on the Rust side,
you just know up front, this is
what it needs. And then
the question of
recursion comes up, then it's like
you can't have cycles in your async call graph
because then you wouldn't know, right, at compile time.
But then there's like a way to get around that
by then using heap allocation to break those cycles and specific cases.
That's, I think that's fascinating.
Is that like, I think with Rust, like compile times
and like your binary size might blow up
a little bit because of all of these things,
but then you get amazing performance at runtime.
I remember you sent me a link
a long time ago on
Rust's compilation time improvements
and they were redesigning the compiler.
I'm sure they've had
improvements, but at what magnitude
and what scale those have been.
Yeah, they've, my, like, as a,
I don't know that I like technology that well,
and I don't know the actual numbers,
but in terms of developer experience,
like, I've actually been quite happy with it.
They've done a lot of work for incremental compilation.
So when you make a small change,
making it so that the next compile doesn't take very long,
there's a whole algorithm they have
for expressing data flow within the compiler.
It's like a DAG.
And then knowing when nodes in the DAG have changed
and then downstream results need to get reevaluated.
So that stuff's super cool.
And yeah, I think working at a larger company like working
on nucleus say for example like then compile times were a little bit more of a pain there
because it was a really large code base and it was a large binary just like binary at the end
was huge but after having left and then working on um the microsoft research stuff and also some
of the stealth startup stuff
that I've been doing like I compiled I've been totally fine yeah you only have to worry about
it once it's like really large I'm guessing and it also seems like a lot of languages like you
know TypeScript Python with like async await stuff they're all converging to like this feature set
and it's it's kind of nice that and you can basically pick
the right tool based on what your constraints are and easily switch around when you realize like
yeah you know i need these other things that are just not possible there's like some there's a
really snarky paper well maybe not that snarky but it was a pretty good paper i was i remember
reading i have it somewhere in my like msr journal on um
that all this user space scheduling is a fad and that people just end up going back to kernel
threads you know like threads and then multiplex by kernel at some point it's an interesting
perspective right like the ecosystem is moving towards having language level support as opposed to OS support for concurrency and swapping context in
and out. And some of that is motivated by the fact that threads are not meeting people's needs,
right? And, you know, maybe we could just make threads better.
Yeah, I think in that world, like the Linux kernel will have to get smaller.
Yeah. Yeah.
Yeah.
And yeah, I don't have a... There's definitely a context switch overhead.
And yeah, like creation overhead.
I think some of the stuff we've talked about
with statically knowing the stack size
is impossible without language support, right?
So that's definitely unavailable to the operating system.
Yeah, it's kind of an interesting debate, right?
Yeah.
Well, I wish you best of
luck on your style startup i hope everything goes super well there i'm sure like you'll figure
something awesome out and thanks for being a guest here i think this was a lot of fun
and i feel like i learned a lot even though i've spoken to you a lot in the past i still
feel like i learned a lot again