Software at Scale - Software at Scale 22 - Sujay Jayakar

Episode Date: June 2, 2021

Sujay Jayakar was a Software Engineer at Microsoft Research where he worked on kernel bypass networking. He was previously a Principal Engineer at Dropbox where he worked on the migration of user data... from S3 to the internal storage system (Magic Pocket), and the sync engine deployed to clients.Apple Podcasts | Spotify | Google PodcastsHighlights05:00 - What framework do you use to decide to stop using S3 and store data in your own data centers? (the “Epic Exodus” story)11:00 - Perfect Hashtables and how they are used in production systems to reduce memory use14:00 - What is an OSD? (Object Storage Device). How does it work?20:30 - SMR drives30:00 - The actual data migration - how did it happen, and how does one validate that the data being transferred is correct.33:00 - “S3 being overwhelmed”. That’s a string of words most software developers don’t expect to hear. What kind of overhead do kernels impose on networking, and why?43:00 - What is Kernel Bypass Networking? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Hey, welcome to another episode of the Software at Scale podcast. Joining me here is Sujay Jayakar, who I think everyone knows that correctly. Yeah, you got it. Yeah, yeah and you you're an ex-research engineer at microsoft you're working on a stealth startup now i think since like two or three weeks if or more yeah and you were an engineer at dropbox for like
Starting point is 00:00:37 eight years where you worked on like magic pocket uh sync and a bunch of other things yep most notably our email system cool and yeah thanks for being a guest thanks for having me yeah so why don't we get started with maybe Magic Pocket because I think that's where you started your career like maybe
Starting point is 00:01:00 after the email system I had done I mean when I joined Dropbox it was a bit smaller. So, I mean, it was kind of, there was a phase of everyone doing everything. But yeah, I mean, that was, I think, one of the first like really big, chunky projects I had gotten involved with. Okay. And what exactly were you doing for that?
Starting point is 00:01:20 Yeah. So I joined as an engineer when the project was getting a little bit more serious and we started staffing it up um and then so i worked as an ic throughout my entire time on that project so i worked um first on like a big push to get the system code complete so for those listening magic pocket is an in-house storage system that we built at Dropbox that is very similar to Amazon S3. So Dropbox has exabytes of user data, and we need to store it somewhere. So at the beginning, we started in Amazon S3, which was totally the right decision, right? It's a very hard problem to store data at scale. And, you know, S3 is a great product and at a certain point it
Starting point is 00:02:07 became it made more sense for us to do it ourselves so we had this project called magic pocket that was kind of running around in like prototype-ish stage for a while and then at a certain point we uh leadership and infrastructure decided that we wanted to go full steam so we started staffing up the project so that's when i was brought in and the first few months were just us like building the storage system from scratch and just it was kind of amazing like how much we got done very quickly so um i had worked on the one of the metadata servers that stores and helps clients figure out where a particular object is stored on disk. And one of the levels, it's like a multi-level indexing scheme. So I worked on one of the levels there.
Starting point is 00:02:59 And then I also worked on our backfill system, which did the data migration from Amazon S3 over the internet and then eventually had pairing to our own cluster. And then after that, we had kind of gotten the rollout going. We had qualified the system. We felt very confident in it, and we were starting to do the dance of getting supply chain to bring up new servers and move the data over at a rate that matched our ramp up for the hardware, but was really as fast as we wanted to. And then there were also some contractual things with S3 with how much data we needed to have in there. So it was a really like resource management problem. But then in parallel with that, we started working on the second generation of the system and approaching it from the perspective of cost. One of the opportunities that one of the tech leads on the team identified was that if we could increase the density a lot on our storage node, which comprises the majority of the tech leads on the team identified was that if we could increase density a lot on our storage
Starting point is 00:04:06 node which comprises the majority of the system this is like a server that just has a ton of disks and all it needs to do is put a key to a value and the value is like a few megabytes and then serve it and then eventually also be able to delete it um So if we could increase density a lot there, so I think we started off having one server manage, like, I think, roughly like 44 terabyte disks. And if we could, one, take advantage of these new SMR disks that were coming to market, which each disk could get up to like 15 terabytes and then get up to like 60 of those or even more, then that would get our price per gigabyte down. I'll buy a lot.
Starting point is 00:04:56 So I was on the team. It was a team of three of us on the software side to build this out. And there were a lot of design it was really interesting design process where we started with what we wanted from the hardware and then found out what the implications were on software um built that out and then helped qualify it and start release and then i worked on and then i moved to workingleus after that, which you had one of my colleagues, Nipun, on a previous podcast.
Starting point is 00:05:28 Yeah, I think the history of like Magic Pocket just sounds so interesting. The first question I had around that was, and you don't have to talk too much in specifics, it's just the framework used to decide that, you know what, we have to stop using S3 and start storing data ourselves? Because I'm sure that was like a hard decision. And I'm just curious how that went about and how you got buy-in and all of that. Yeah, that's a really good question. And I think it was actually a bit above my pay grade at the time.
Starting point is 00:05:58 I was like a new grad effectively at that point. And, you know, looking from, so I don't have that much insight into how the decision was made at the time, but looking backwards and then having worked on the project, right, and talked about it, price was one factor.
Starting point is 00:06:17 Amazon S3 is incredibly competitive on price, but we were able to do a little bit better from what we, I mean, I don't know what Amazon's pricing is internally, right? But for us, we were able to do a little bit better from what we, I mean, I don't know what Amazon's pricing is internally, right. But for us, we were able to do a little bit better. And especially having control over the system and being able to tailor it very specifically to our workload. We did some pretty specific optimizations,
Starting point is 00:06:43 especially after I had left the team with our multi-region replication that I think would be harder to do more generally. And then some of it was also control from like a strategic perspective, right? Like owning our destiny a bit more. That both the reasons make a lot of sense. So like if you had to guess why guess why or how is S3 designed,
Starting point is 00:07:07 and why do you think, what was Dropbox's workload that could be optimized a little more specifically? Yep, so Dropbox implements its file system, so the file system that we use for storing user data, with MySQL for the metadata storage and i think you talked a little bit about it with nipun on the previous podcast um but then all of the contents get broken up into four megabyte chunks or blocks as we call them and they're addressed by their shot 256 hash so their content, and they're fully immutable.
Starting point is 00:07:45 So we built Magic Pocket to be in a content-addressable store where data is completely immutable, and the object size is very well-known. For us, the distribution was maximum of 4 megabytes, and then we had our mean around 1.5 megabytes. So we actually went and sampled the distribution and then used had our mean around 1.5 megabytes so we actually use like we went and sampled the distribution and then use it to tune all of our allocators and things like that so that we would be able to squeeze a little bit more efficiency out of them based on it
Starting point is 00:08:15 and i think like immutability is a huge one and i you know i don't know anything of the details of how amazon s3 is implemented but i feel a lot of distributive systems end up embracing immutability at some layer so you can imagine like you have like if you're mutating an object in s3 like maybe it's just putting a new version and then the tuple of like path inversion is actually an immutable object within the system. And this just makes coordination and so many of those distributed systems problems a lot easier. So we had that from the very top, which was great. That makes a lot of sense.
Starting point is 00:08:56 And it seems like S3s also would have been designed for a super general purpose with no idea how big or small object sizes are and if i had to guess like most things in st are just not that big compared to magic pocket where you know for sure that it's going to be of a certain distribution right and small objects are actually very difficult to handle well in a system like ours one we always think about ratios from different types of storage in our system. So like there's a lot of per object overhead, right? Because you need to have mentioned some of those metadata services for being able to
Starting point is 00:09:33 find the location of objects on disk. And there's probably a fixed cost and like that fixed cost per object probably needs to be stored in flash because you need to it's on the critical path for reads so there's all these constraints there that make small objects a huge pain to deal with specifically within the storage node we one of the tricks we did for the second generation of what we call the osd the like i think object storage device i can't even remember the acronym for it one of the tricks uh well one of the acronym for it. One of the tricks, well, one of the quantities we paid a lot of attention to was how much RAM, how much memory
Starting point is 00:10:09 was necessary to manage each object so that we could find it in a single disk seek. And one of the tricks we did with the second generation of the system is use perfect hash tables because our data is immutable and we group it into these what we call buckets like groups of one gigabyte of data and once the bucket fills up it gets sealed and even
Starting point is 00:10:31 then that grouping is immutable so starts append only becomes immutable and once it's immutable we can look at all of the keys that are inside that bucket and compute a perfect hash table which then perfect hash tables for those who aren't familiar with it if you have a fixed set of keys you can compute this hash function that will map each one of those keys to a slot in a table and then when you're querying this hash table you don't actually have to have the keys in memory at all. You just have this little hash function, and it's guaranteed to not have collisions because it has some probabilistic algorithm
Starting point is 00:11:14 to keep on trying different seeds for that hash function until it finds one where all those keys have no collisions. So you throw away the keys. You don't have to keep them in memory anymore. And you just keep this small, succinct little hash function. And you can then use that to directly find stuff on disk, which is pretty cool.
Starting point is 00:11:35 That is so interesting. When I heard perfect hash table, I just, I assumed it meant something like you don't have to deal with collisions. I did not realize it meant you can throw away keys as well and i'm guessing yeah go ahead yeah um and not having collisions is a big part of it right because like so what we would do is um just to put some numbers to things so the keys for each object in dropbox are 256 bits right because they come from from shot 256 so 32 bytes and that 32 bytes per megabyte roughly because that blocks are 1.5 megabytes on average um it adds up so um what we did is
Starting point is 00:12:18 we take all those 32 byte keys we compute this perfect hash table into a dense array of offset and length pairs on disk so those are we use u32s for them which are four bytes each so we went from previously needing to have 40 bytes and actually a little bit more because for a general purpose hash table you'll have some load factor right whatever that is um so we went to like 40 plus bytes down to just eight because we just have this dense table uh with one entry then it's in order for the order on disk of just offset length offset length offset length and then when you uh query for a particular key that just feeds into this hash function, which is small. And so there's a certain number of bits per entry.
Starting point is 00:13:09 There's a lot of actually academic research on this, which is pretty cool. There's a few bits per entry. And then that points to the table. We look at that offset and length, and then we read that off from disk. And then one actually important piece is that if you feed some key that
Starting point is 00:13:26 isn't in the hash table, or that wasn't in your original set to this hash function, it's going to just spit out some garbage value, right? So you actually then do need to check from the value you read off disk, that it's something that was actually original, was actually matches the key you passed in interesting so you need to make sure but that that's that doesn't seem too expensive it seems like a one-time check right right and we don't care about making misses cheap right like we're okay with misses being expensive yeah because that seems like it would only happen when there's like a a a problem or like an error a higher up in the stack yep and some like kind of race or in like the
Starting point is 00:14:08 system is in a weird state and like we're trying to re-replicate data or something like that like a bit flip or something at max it seems like yes that's super interesting and can you talk a little bit more about the osd right so the perspective that i'm taking is you know you're a new grad and i'm just thinking about new grads in general you've taken like not you in specific but in general like a new grad would have taken like a bunch of distributed systems classes and they dive into this project and they have no idea like how to build something like an osd or like what even is an osd and like can you just introduce what that is and why would you need to build something like that yeah so an OSD is it it's aiming to be a very simple abstraction and like a very simple interface to solve a hard problem so and the OSD is the software that manages the server
Starting point is 00:14:59 where the server's primary job is to just store a ton of data. And the interface it presents is, it's almost kind of like a very restricted file system. You can create a bucket or like this one gigabyte file in some sense that's append only, and you can put key value pairs into it. And then you can, once the bucket is done, you can close it. And then for both open buckets and close buckets, you can query them by the key and try to find some value. So it's like a kind of like a file that's a append only hash table. Okay. So it's basically like a process
Starting point is 00:15:39 that's just running one per server and its job is to basically manage blocks of data. Yep. And we ran like one per disk because we had many disks per server and its job is to basically manage blocks of data yep and we ran like one per disk because we had many disks per server okay one interesting okay that makes sense so one osd's job was to handle one disk and then you would have like multiple disks per server yep exactly and some of there was around like life cycle right because like you can have one disk go bad in a server and then it's very easy to just have that server crash or go down and consistently stay down while it's being remediated and then have other processes that are attached to other disks still be fine. Okay. So is this, is the OSD
Starting point is 00:16:19 something like a disk driver or is there like another piece of the magic pocket puzzle that's acting as like a disk driver? Yeah, it's a good question. So by driver here, you mean like the operating system? Yes, exactly. Yeah. So yeah, we use the operating system pretty, we use it for quite a lot of functionality in magic pocket.
Starting point is 00:16:42 So for OSP one, we use XFS, it was a file system, we had a lot of functionality in Magic Pocket. So for OSD1, we used XFS. It was a file system we had a lot of experience with on the database side. And we had a pretty good sense what all of its sharp corners were and how we would be able to achieve highly durable storage with it. So each OSD had, I think we had a directory
Starting point is 00:17:04 per bucket, maybe. We used some encoding of like our primitives onto the file system, had like a file format, and we just used regular files. And we had to, you know, be careful about how we use the page cache. For example, like if you want to validate that data on disk is correct, you need to make sure you're actually going to disk and you're not just reading some value in the page cache that isn't actually persisted. So there were things like that, but they were pretty minimal. And then for
Starting point is 00:17:35 OSD2, we still use a lot of the Linux storage stack, but we used ODirect mode. Is that right? I think so. We wrote, I think it was o direct mode on the block device so slash dev slash whatever um where i think p read and p write i'll let you write two different um logical block addresses the lbas so you get the right to sectors on disk and so there we were still using the driver, right? But we were also using the IO scheduler.
Starting point is 00:18:08 Okay, so in OSD2, you're bypassing XFS? Exactly. So we effectively wrote our own file system. Yeah. And can you describe a little bit about why? Is it just that you could get better performance or is there something else? Yeah, that's actually a really good question um why we the fundamental reason why we did it has to do with this technology of smr disks okay so i mentioned that a bit earlier and smr disks like they regular disks which are called pmr disks for perpendicular perpendicular magnetic, they were topping out, I think, around four terabytes.
Starting point is 00:18:46 And without actually that many changes to the fundamental hardware, like the magnetic medium and the heads and everything, SMR disks were able to go up to about 15 terabytes. And my knowledge is actually a few years old at this point. It's also not only is it outdated, but it's also the fog of time. So the numbers are probably approximately hopefully close and so the question then is how do they manage to get so much higher density without really having that many innovations on the fundamental hardware and
Starting point is 00:19:20 my understanding is i'm not a hard drive expert and i don't also want to i'm sure there was a ton of engineering work for smr that's i'm not trying to diminish it but my understanding is I'm not a hard drive expert. And I don't also want to, I'm sure there was a ton of engineering work for SMR, so I'm not trying to diminish it. But my understanding of it is that the main innovation was writing to the magnetic medium in a different way, with different geometry, and by packing sectors in very closely. So one of the fundamental problems for hard drive storage is that when you write to one sector, you're magnetizing it, right?
Starting point is 00:19:48 You're, I don't know, my knowledge of it is I don't know exactly what's going on, but it's being magnetized. And one of the phenomena that limits density is adjacent track interference. So if you write to one sector, there's some probability that it will affect the sectors around it, right? And so that puts a limit on how densely you can pack sectors together. So there's all types of tricks that are done with like error correcting codes and that to try to pack things in more. But one of the insight for SMR is that if you constrain the right pattern of how sectors are getting written to the medium, you can pack things in a lot more closely. So SMR stands for shingled magnetic recording. And the visual analogy here is that if you're laying shingles, and each one of them say it's
Starting point is 00:20:40 like two by two or something, I don't know, just make up some unit. And the shingle is just some like that that thing that's on people's roofs right it's exactly yeah right so if you imagine like how it's laid on laid on your roof right it's like they're overlapping and they're kind of stacked together and you can actually only well i don't know if it's actually i've never laid shingles but in my mind you know it's like you can only put them on in a line right because you're stacking one on top of the other but then after you're done laying them in a line you can look at just a little bit peeking out and see what's there so if you were to imagine storing data on
Starting point is 00:21:16 shingles where say it's the color or something if you the shingles are non-overlapping then they like you can read and you can write to any one of the shingles at any point in time but if you put them in a line you can only put the you can only append to the end but you can read at any point because you can see the color that's poking out so that's like the core inside with um with smr and how it's able to pack a lot more data in without fundamentally changing the physics of it so where does that get us back to the linux stack is that the in the interface that the device presents no longer has random writes so roughly it's like there's a concept of zones, which I think for ours were like 256 megabytes. And for each zone, there's a right pointer, which you can reset all the way back to the beginning, which would be like just throwing out all the shingles.
Starting point is 00:22:14 Or you can append to it, and then you can read from any point before the right pointer. So with that restricted API, file systems just don't work on it. That makes sense. And yeah, super basically, it's like if you had shingles all spread out, they just wouldn't be that dense, so you would need more space. But the fact that you can put them all together, and append only is perfect for the Magic Pocket use case, it seems like, because that's how Magic pocket was designed initially anyway and you already know roughly how much like each block is
Starting point is 00:22:51 so you can roughly know how many shingles you need maybe this is going too much into the analogy but totally helps yeah well yeah i mean with um like for example these um um i just said it like blanking on the name, the 256 megabyte grouping. The zones. The zones, yeah. With the zones, like that actually worked out really well for us because we have one gigabyte buckets, right? We group data into one gigabyte like files that are append only. And then we just made those four zones and you would just continue to write through it.
Starting point is 00:23:24 And that was super easy. Interesting. So one thing that I would be really scared about about rolling out this project, I think most people would be just making sure that you're not losing customer data, right? Like even just the regular OSD one where you're using XFS. How did you or like, how did your team like verify that your team verify that your data is mostly accurate and you feel confident about deleting it from S3? I mean, there's so many layers to doing this. I think there is not a single answer. I mean, the answer is to defense in depth, right?
Starting point is 00:24:00 Like try everything. So at the very highest layer, we had a lot of consistency checkers. So there's some non-trivial invariance between the metadata indexing and stuff that's on disk. So being able to join those two and periodically check and make sure that they're in sync. And that process of doing consistency checking
Starting point is 00:24:25 and production was also tied to our release process where we had a staging cluster and we would be guaranteed that by the time our release gets promoted from the staging cluster to the first zone, that cycle of those consistency checkers has progressed. There's also, we were a multi-zone storage system. So having the release
Starting point is 00:24:46 process push new code out to one zone before the others. And then having a set of qualifications that that zone would have to pass to enable promoting its deployment to the others there is um yeah there were what else so that was in production and there yeah um there was also like a so that was the consistency checkers are like a like a check everything and check that the invariance match approach there was also like a statistical sampler where we would just check that we can retrieve blocks that have been recently written with some probability like just sampling that we can retrieve blocks that have been recently written with some probability, like just sampling that and then doing slightly more expensive things.
Starting point is 00:25:31 We had a very extensive testing suite. So we had both obviously like unit tests, but then also some integration tests, test the system as a whole. We spent a lot of time qualifying hardware, and that was more on the hardware team, but I don't have as much time as much access to it.
Starting point is 00:25:53 We for the process of first releasing Magic Pocket, we had this like I still remember it. It's this huge spreadsheet with like all of the potential kind of like a premortem, like all of the potential things that we could think of going wrong the probability of them happening the um the downside if they did happen and then mitigations and ways to validate it um we like actually went into the
Starting point is 00:26:18 data center and like pulled out some plugs and tried to validate that the hardware behaved in ways that we thought. I think we even overheated one of our servers intentionally. I can't remember. But yeah. How did we validate the system? Yes. There was a lot of stuff.
Starting point is 00:26:46 It was not one single thing, but I think this process of just being extremely paranoid and having many layers of checks. This might involve a hot take and you don't have to answer if you don't feel like. Yeah, I love hot takes. Do you think like, you know, five years later or six, I think it's been longer than that now. The project was like worth it. Or do you think that because the way i i look at it is that cloud prices have gone down significantly and i don't know how much that affects the dynamics of building your own thing yeah i mean i think it's a good question um and i think it's a little bit above my pay grid for the decision at that time right because like um i think it's you know easy
Starting point is 00:27:27 to wax poetic after the fact but i think in that point in time was it the right decision i think yes um i mean i think it helped our company financials a lot it also you know was part of is ingredient of us being able to go and ipo And yeah, I mean, I think the parts that are difficult that are unavoidable with running a system like this is that it's a really hard thing to do well, right? And not ever mess up. And being able, being on the hook for that as opposed to being able to outsource it
Starting point is 00:28:00 to some other company is, that's a big commitment right it's a commitment on the software engineering side but then also the SRE side that there's a lot of databases involved there's a whole supply chain there's networking there's hardware right we need to continuously we need to for it to be worth Dropbox doing it we can't just tread water we need to, for it to be worth Dropbox doing it, we can't just tread water. We need to be able to continue to remain competitive and beat Amazon S3, right? Where that is one of their core products.
Starting point is 00:28:35 So yeah, I mean, it's definitely, it's, you know, it was a very, very successful project, but it didn't come for free. Yeah, that makes sense. And I like the idea, the second reason why in your framework of why go with Magic Pocket, just like company strategy and owning your own destiny. I think that might be like reason enough, as long as it's not like costing like 10 times as much as S3 or something. The fact that you're not dependent on as much as possible, dependent
Starting point is 00:29:06 on another company completely for your core, like multi-billion dollar business, right? Like most companies can easily switch, not easily, but relatively easily switch clouds. It'll take like a few months, but it's possible. And it's even possible to go like multi-cloud. But for something like Dropbox, where you have to transfer exabytes of data, it's just not physically possible. Right. Yeah. So that is a very, very interesting project. And maybe you can tell us a little bit about the actual data transfer.
Starting point is 00:29:41 Because from my understanding, again again you can't actually physically transfer you can't transfer data that quickly like at least network cards are just not that fast when you have so much yeah um let's see so we were at our peak at around half a terabit a second right so that's like i think about 62 and a half gigabytes a second uh so nothing to sneeze at right so 500 gigabits yeah i think i did that right it's a 500 gigabits and we so one of the really difficult things at the beginning was being able to push this type of bandwidth over the public internet so we would buy transit from just like a normal metered metered um like almost like buying bandwidth buying internet service from your isp right we would buy transit from uh these different providers
Starting point is 00:30:39 and we would sometimes break the internet because it turned out for you know at a for a certain number of effective people because we bought all this bandwidth from different providers and it turned out that they shared a common link and that like we are path from our data center to amazon didn't actually have as much bandwidth as we thought we did. And eventually we ended up getting peering, which is, I wish I knew better the actual, like what happened on the ground. I mean, my mental image is like us having data centers near Amazon's and then having a fiber line
Starting point is 00:31:17 directly between them. I'm not sure if there were like people laying fiber and like plugging into the Ethernet jacks or exactly what that was. But the idea there that it's just was mutually beneficial for both us and Amazon to just have a direct link as opposed to both of us having to pay to go over the internet. Okay. Interesting.
Starting point is 00:31:40 Yeah. And so from that was like the first kind of, so that's one constraint from the bandwidth perspective um and then like your to your point each one of these machines um that we use for this backfill system i think they started off with one gigabit ethernet cards and then we eventually got 10 gigabit and we would use close to the whole thing. You know, it's not, it's actually pretty amazing, right? It's not that hard to saturate even 10 gigabits with just a bunch of TCP connections in parallel, right? And then we might actually talk a little bit more about this because that's very related to my research that I was doing after I left Dropbox. And yeah, and then from there,
Starting point is 00:32:30 they were just making sure that we think of the queuing theory of all of how like backfill works pretty well. So we don't overwhelm any of the resources. So overwhelming S3 on the reads, overwhelming Magic Pocket on the writes. But then there's tons of things in the middle. Like we needed to consult the file system metadata to know which blocks to transfer so that was doing a read to my sql talking to secondaries and all that um and what else it's been a long time since i thought about
Starting point is 00:33:00 the system there were a lot of other resources resources. It was a heterogeneous resource system. I had implemented this distributed token bucket algorithm that was very flexible. It was very simple, just like a distributed rate limiter. But it ended up being really helpful because then you could just put a rate limiter on everything and make sure that the throughput for any of these subsystems doesn't exceed within some bound. And then as long as the kind of queue setup of it, and I think you had it from a queuing theory perspective, when one thing backs up, it's back pressure propagating through the system,
Starting point is 00:33:36 then everything just kind of worked out. And I say this like it's something really simple and obvious, but it took us a long time to get there. We had plenty of times of just like trying to like hack in something and then causing it to overflow. And then that causing some other system to go down and then paging some other teams. So yeah,
Starting point is 00:33:55 I think the lesson there is like be disciplined about the queuing theory and thinking about back pressure and all that. Yeah. That is so interesting. And like one thing that I haven't ever heard in my life, I think is overwhelming S3. Like I don't think I've heard these two words put together. Yeah, and you can correct me if I'm wrong,
Starting point is 00:34:14 but you were talking about like saturating your network cards, like the limit was like 10 gigabits or something. And I was reading your research and I read somewhere that 30 of that time like when you're actually doing these network calls goes in like syscall overhead and it's like and this is like linux and it's even worse on like mac and windows is that is that like accurate and like why is that yeah um so the numbers i think they depend kind of on your hardware like what type of hardware offloads you have.
Starting point is 00:34:46 So how much of the TCP stack can you just hand over to your NIC? So at the very basic level, that's like doing checksumming. Like TCP has a checksum that actually checksums the user data, the body of the segment. And that's implemented in a lot of NICs. So that's a huge win for high throughput flows. And then there's more sophisticated stuff like actually being able to offload the segmentation.
Starting point is 00:35:12 So you can just present the NIC with a big buffer and then it will break that up into smaller TCP segments. Yeah. And I think like the overall perspective that's pretty interesting is that the hardware has gotten really really good both on the throughput side right we're getting 100 gigabit nicks and it's going it's going even further but also on the latency side i think the latency side is the part that was pretty underserved and was the basis for the research i was doing is that if you have two servers that are in the data center, they have one, say they're on the same rack,
Starting point is 00:35:47 so they have just the top of rack switch. The switch itself will add just like a microsecond of latency. And then the propagation time through the cables is going to be, I think, in like the hundreds of nanoseconds. So like your round trip times in the data center should be in the low micoseconds. So your round-trip times in the data center should be in the low microseconds.
Starting point is 00:36:09 But if you, even with an exceptionally well-tuned Linux kernel, you're going to be in the 50 microsecond territory. And then with anything that's just not hyper well-tuned, it's going to be in the hundreds
Starting point is 00:36:24 of microseconds or even milliseconds, right? That is, yeah, that's surprising. Like I've always, my heuristic for inter data center communication is like less than a millisecond, but I always assumed it's like speed of light or something. I mean, I guess it's not that it can't be that it's, it's overhead from software rather than hardware things.
Starting point is 00:36:51 And the switches are really good, right? The switches are not adding that much. It's really, it's kind of, it's tragic, right? And that you have this like highly engineered pipeline where you have like the path of going from user space to kernel space through the network stack through the driver and then off from the driver to the device over pcie say and then from there to over the physical medium through some switches and then up the same stack on the other side and it's like all that stuff in the middle is really, really good. And like 90 plus percent
Starting point is 00:37:27 of the latency is actually on the endpoints. Yeah. Do you know why the kernel adds that much overhead even if it's like well-tuned? Like I've always read that syscalls are high overhead,
Starting point is 00:37:38 but what does that mean? Yeah. Yeah. So yeah, one of the first kind of perspectives is that context switching is expensive. And so there's context switching between kernel space and user space. And it's been announced, I mean, even with a month, my fog of memory, there's even some switching between different subsystems within the kernel. There's a really good paper on this. I can't remember what it is. I think it might actually be unpublished. It might not be published yet.
Starting point is 00:38:08 But anyway, there's a really, and there's another really good paper from a group at Stanford. The paper is called Dune, and it's their like operating system architecture that uses virtualization to do some pretty interesting stuff. But we don't have to get into that.
Starting point is 00:38:25 But the one of the core insights from the paper is that syscalls are very expensive, you know, the hundreds of cycle, like hundreds of nanoseconds plus right. For even just a no op syscall because of the indirect effects of them, like actually executing the mode switch of switching from ring three to ring zero. So going from user mode into protected mode, that's actually not that expensive. effects of them like actually executing the mode switch of switching from ring three to ring zero so going from user mode into protected mode that's actually not that expensive the things that are expensive are that when we have these big monolithic kernels that run like everything
Starting point is 00:38:56 together like the kitchen sink um you have you just totally blow your l1 and even l2 cache i think for in the l1 cache like the think, and the L1 cache, like the instruction cache and the data cache are totally blown out. And then I think the L2 cache, a lot of that gets blown out too. So when you enter the kernel, you have a bunch of cache misses. And then when you come back into user space, you have a bunch more cache misses as well because a kernel brings in all of these data structures
Starting point is 00:39:21 and all of this code because it's this big monolithic thing um so that's like one of the core ingredients for why system calls and like context switches in general are expensive that is so interesting so like i know i've said that so much now but this is this is interesting it's like the kernel itself is so big that it's data structures like it's just like things like logging structures, I'm guessing, all sorts of things like that are so big that your caches get overwhelmed when you're switching from like ring three to ring zero. That is not what I thought.
Starting point is 00:39:58 I thought there would be something to do with like, you know, an interrupt in like your hardware or something causing the problem. Yeah. And this is actually the Dune paper is really cool. And we will talk about a little bit because it's just cool stuff. So one of the, what their system does is that it's like, say you're implementing a, like a programming language runtime and you want like memory management through garbage collection or something, you want to be able to use hardware features for
Starting point is 00:40:26 having some protected data, maybe like read-only pages that the program you're executing can't touch. And then maybe you want to do things like a read barrier where you, while you're garbage collecting some page in memory, if the application tries to read from it, it's going to fault. And there you're also using fault. Excuse me. You're also making it so that the user program is going to trap into the kernel. And the premise of the paper is that if the kernel is small, these features can be implemented at a very low cost. So then what they do is they use virtualization support because that so then the setup is that you have for virtualization you have host mode and guest mode. So imagine this is like a two by two matrix, you have host mode and guest mode. And then within each mode, you have ring zero and ring three for kernel mode and user mode. And she gets to get some water real quick. Yeah, I have more questions already lined up.
Starting point is 00:41:34 Sorry, I think I just haven't been talking to this much in a while. So, right. So you have this two by two matrix of host mode, guest mode, ring zero, ring three. And the Linux kernel runs in host mode in ring 0. That's what you'd expect. And then regular applications run in host mode ring 3. But then a Dune process where it has a split between, like, the language runtime and managed code, it runs the runtime in ring zero of guest mode.
Starting point is 00:42:07 And then it runs the user code there, like whatever code is being executed as ring three in guest mode. So then switches between ring zero and ring three in guest mode are pretty cheap. But then if you need to perform like a real syscall that is actually talking to the file system, then you need to switch into host mode and then go to execute in ring zero there. Yeah, so your traditional expensive syscalls that are doing I.O. and looking at the file
Starting point is 00:42:36 system might still get expensive, but there's so many random syscalls that you can just skip out on. And does the paper show that Dune is actually faster than running a program directly in the user mode? Yeah, I mean, they have some pretty good evaluation on what the switching costs are. I'm trying to, I can't remember offhand what the programs they use for their evaluation.
Starting point is 00:43:00 But yeah, I remember garbage collection and read barriers being one example where, I mean, the alternative writer just writing a regular program against Linux but yeah I remember garbage collection and read barriers being one example where I mean the alternative writer just writing a regular program against Linux that uses like mprotect to protect pages and then goes and traps in the kernel and like that just I mean it sounds like it'd be very slow yeah yeah it just reminds me of like node.js because I've been looking at like syscalls from node.js and that like, I think Node does that, or I guess V8 does that quite a lot.
Starting point is 00:43:28 And that makes sense. So that makes me think about like kernel bypass, right? Is that the reason why kernel bypass was developed? And maybe you can talk about it to like listeners, like what is kernel bypass? Because I also want to like figure out if that's something that they have, like is kernel bypass? Because I also want to figure out if that's something that the kernel developers have developed or is it something external?
Starting point is 00:43:51 Yeah, I think it's pretty related. I think a lot of people who work on the kernel have worked on these user space approaches. And the perspective here is that one of the jobs of the kernel is to multiplex resources right to like share resources like an ethernet device across multiple user space programs and that's why it needs like hardware protection right because it it it alone is the arbiter on how to share that resource and user spaces and it can circumvent that so if you relax that assumption and you say that i like say i'm running
Starting point is 00:44:28 a database machine and i have maybe like a one gigabyte or gigabit really cheap card for just sshing into and pushing code and stuff but then i have this like 40 gigabit ethernet card that all the data actually flows through and all i'm running on this server is just a database process then why does the 40 gigabit card need to get managed by the kernel it's not getting shared so the idea is then that you would run the driver like roughly the same driver that would be in kernel space in user space and that would you set up all the memory mappings to map the device memory and all of the like um for like the iommu and everything to map those into user memory and then run the driver that is manipulating all the ring buffers and like ringing the bell to tell the device that there's new data on them. And you can just do that in user space because why not, right?
Starting point is 00:45:28 And then when you do that, you just cut out a lot of the overheads of having to go through the kernel. You do it directly. Yeah, and at the same time, you lose that kind of trust because it's a user-based program, but you don't care anymore.
Starting point is 00:45:42 It's like in max power mode, right? You're going with, trust me me i know what i'm doing but then give me this much performance and like if syscall overhead is just like 30 you get like basically 30 performance for free it sounds like and what have you seen in like practice right so this is like a really interesting um kind of like, this is one of the motivations for the research that I'm working on with my advisor,
Starting point is 00:46:09 Irene Zhang, at Microsoft Research. And the observation here is that the hardware is great. And now actually with kernel bypass, say for networking, DPDK is like the common library
Starting point is 00:46:23 for doing this. It's actually possible to achieve these really low latencies and really high throughputs but for some reason people haven't been doing it and like a lot of like it just doesn't really exist out there and part of the reason at least the reason that we've identified for a research thesis, is that the really low-level interface of talking to DPDK, where it's, you know, it's like you're talking to a driver directly, right? Is, one, very developer-unfriendly. But two, it's also a moving target, right? Like, for different devices, you have to configure them in different
Starting point is 00:47:05 ways and the operating system just takes care of this for you when you use sockets, right? So our work has been on making a very high-level interface that looks like sockets and lets you use DPDK or we even have like RDMA support.
Starting point is 00:47:24 We support other types of hardware too. And, but engineered in a way so that you can still have microsecond scale. So it doesn't add overhead like going through the kernel would. And you can have your abstraction while maintaining performance. It sounds like you're building a microkernel
Starting point is 00:47:42 for specific use, like a small kernel for like networking in a sense yeah totally yeah and it's like a user it's kind of like a lot of like the library os research right or having like stuff that people can link into their programs and run some user space and you pick and choose what you want and it's like all a lot smaller than this big monolithic thing that does everything. Do you know if there's any work towards like the Linux kernel to like cut its size down or it just seems really hard at this point? I think it's also, I mean,
Starting point is 00:48:17 Linux is inherently a monolithic kernel by design, right? And that's like, and that's a virtue in some ways, right? That's, that's probably not design. Who would be the users of the work that you're doing? And the people I can imagine are just like Azure, AWS, GCP. Like they care about like,
Starting point is 00:48:38 you know, getting as performance stuff as possible. Yeah. And I think one other motivation that's related directly to this idea of accounting for CPU cycles is that a lot of people we talk to, because one of the cool
Starting point is 00:48:52 things about working in Microsoft Research is you get to talk to internal teams at Microsoft, right? And a lot of people actually really care about cost efficiency, which is good to hear. But the latency benefits of running some really simple server, just like a UDP relay server, are great.
Starting point is 00:49:10 But what people really care about is being able to run it a lot cheaper because all the cycles that aren't being wasted on going to the kernel are now cycles that they can sell for Azure to other uses. Yeah, so my assumption is basically wrong. And even if you're running a company like Discord or Slack or like WhatsApp, where you need to have a lot of connections and there's basically generally going to be a lot of syscall overhead because you're going to be receiving messages
Starting point is 00:49:38 and there's going to be a lot of IO on their chat servers, they could use something like this potentially. Yeah, so I think for them, I don't know if the cost motivation is as strong. on like their chat servers, they could use something like this potentially. Yeah. So I think, you know, for them, I don't know if the cost motivation is as strong, right? Because I think for a hyperscaler, like your Googles, your Amazons, your Microsofts, they have like a lot of elasticity to be able to say that,
Starting point is 00:50:00 like if I cut this percentage out, I can then go sell it. Where if you're running your own infrastructure and you're but on a much smaller scale i think that might not be as much of a motivation so like the marginal return on efficiency is actually probably pretty low but yeah i mean i think the vision for the work that irene and i have been working on is that it should just be so easy to do this and get like 100x better latency that why wouldn't you do it? And do you think any of this work could have been applied to like Magic Pocket like retroactively?
Starting point is 00:50:38 Or do you think it wouldn't make sense? That's a really good question because my first exposure to kernel bypass was actually working on Magic Pocket. For the second OSD, we talked before about ratios, right? And the one really important ratio is the amount of CPU that was needed per unit gigabyte on disk. And there were a lot of cliffs there for price. If we could manage to use a cheaper CPU for this like hardware revision,
Starting point is 00:51:08 that would save us a ton of money and being able to use that then implied a particular ratio. And when we were launching, we thought we would actually like, we thought that using TCP through the Linux kernel wouldn't actually be good enough for us to hit our price goals. So we spent a bunch of time profiling it. And we like, I think started to prototype some kernel bypass stuff. So like going over UDP. And like using the mute, like there are a lot of other things that had to change. Funny enough, one of my main contributions
Starting point is 00:51:45 for this research project is a TCP stack for kernel bypass. That had existed back in the day. But it was definitely not production-ready. It just sold nothing. So we were trying a bunch of things. We had some prototypes for kernel bypass. It definitely would have done what we needed um but it turned out that actually just upgrading the linux kernel was good enough
Starting point is 00:52:09 interesting yeah um and in your experience like you you mentioned that there's like the libraries the existing libraries you mentioned like dpdk or something is too low level for people to use like you have to think too much about, it's basically like using a driver directly, it sounds like. Do you think that is the biggest barrier or like once you have a better interface, like that's it, it's relatively easy for people to like deploy their own like kernel bypass?
Starting point is 00:52:40 Yeah, it's a good question. I mean, I think it's a big factor, right? Because I mean, you're like reading and writing ethernet frames directly, right? That's like, pretty hardcore. There's no TCP, there's no UDP. I mean, IP support is not that hard. But like ARP, for example, right? Like being able to resolve IP addresses and MAC addresses, like that is not included in the box. So that was all stuff that we had to implement as part of our project. So assuming you have all that, I mean, one barrier is that it is restrained within cloud environments.
Starting point is 00:53:13 Like there is kernel bypass support within both AWS and Azure, but it's not as prolific as their general offerings. There's also like, it's pretty interesting, the overlap between kernel bypass and virtualization. So you can imagine that, like, how would you provide kernel bypass
Starting point is 00:53:32 in a virtualized environment? And the idea is that there's actually a lot of interesting stuff happening on the NIC hardware side to present multiple virtual devices. So kind of previously how, like, the kernel was sharing a NIC hardware side to present multiple virtual devices. So kind of previously how like the kernel was sharing a NIC across multiple processes, that sharing has actually been pushed into the hardware where your physical device can actually present it. You can configure it to have multiple virtual devices.
Starting point is 00:54:00 And there were some pretty cool work, I think at MSR, on having like, I think they embedded FPGA into the NIC to implement some of the like, I wish I remembered this better. For all like the VPC rules on like network policies and stuff, I think they get compiled down to code to like an FPGA that actually runs on the NIC and then does that isolation, but like in hardware. Super interesting. Yeah. I was imagining that kernel bypass would only be allowed on like bare metal instances or something, but it sounds like there might be support for even like non-bare metal instances.
Starting point is 00:54:39 That's super interesting. And maybe like a couple of questions to tie this up is around Rust. So you've been, so you have a lot of experience with tie this up is around rust so you've been yeah so you you have a lot of experience with rust both on like magic pocket and like you work at microsoft research so how's that experience been yeah it's been great i mean i would say like i've been a full-time rust developer since like not Rust. I've been an engineer using Rust full-time for what, maybe like 2014? Okay, that's like
Starting point is 00:55:09 2014, 2015. And it's been great. I mean, I think like on Magic Pocket, we introduced Rust for the OSD 2, and there Rust was a big part of us being able to have a lot of control over memory. And not having a garbage collector was actually a pretty big part of that
Starting point is 00:55:28 because we needed to control, like the OSD is a very throughput oriented system. So having a lot of control and knowing when memory will get freed, like if a storm of requests comes in, and if we hold on to that memory for any longer than it needs to get held, then that will stack up, right? And then I think if it's like a queue backing up, then we might just run out of memory, right, on that machine.
Starting point is 00:55:56 So having deterministic resource finalization was a big part of what made Rust work for that project. We also had a similar system called the volume manager where it's similar that it's throughput oriented and what it did all the erasure coding so it just streams in data like multiple streams of data it does erasure coding and then streams them out and having that I did the rewrite of rewriting it from go to Russ and one of the things that was really great with the new system was just that
Starting point is 00:56:27 it's like resource utilization is very flat and very predictable. So yeah. And then after Nucleus, working on there, we did that system in Rust. And there were some performance elements of being able to pack things in memory very tightly. But I think the main
Starting point is 00:56:45 value add was like around encoding and variance in the type system and i was taking some of that like haskell like approach and then applying it to a systems language um and then the async away stuff is also really sweet and then for my research um it's actually also in Rust. And so I implemented that TCP stack and all the network things in Rust. And some of the things that are, I think, quite novel from the language side is like the async await, having like user space scheduling for these coroutines and not requiring heap allocation is like pretty novel. I think it's like, i don't know of any other language that provides it and it was key actually to the success of the research project because
Starting point is 00:57:32 we implement algorithm like all the tcp state machine algorithms for there's just a lot of stuff in there right like doing retransmissions having like this and pure acknowledgements there's like a persist state there's all this stuff and being able to implement it as coroutines in a very compositional way um makes it really pleasant to work on that stack where if you go look at other tcp implementations it's really really hard to reason about what's going on um but then we can run these coroutines that manage tcp state with really, really low overhead.
Starting point is 00:58:07 Like I think we met, I, for my, like the scheduler I wrote, it's like tens of nanoseconds for switching between them. Um, there's no allocations in the critical path. One of the interesting parts of programming in microsecond scale is that like even just using a heap allocator,
Starting point is 00:58:23 like using just regular Lipse malik free is like it can get to the like order of a few microseconds like hundreds of nanoseconds few microseconds on the slow pass a lot of them have like really optimized fast paths but if you're round trip time if we're aiming for like six microseconds then like allocating at all is not an option how do you prevent like heap allocation? I don't, I can't understand how you can prevent that when doing something like providing like coroutines. Yeah.
Starting point is 00:58:54 And the async await stuff is really, really cool. So the kind of the idea is that you write a function that just looks like regular code. It has like regular control flow. And then the compiler does like what you would imagine that you would do if you're writing this in c but like how to do all of yourself and that it takes like the control flow graph really right and it change it turns it into a state machine where there's a variant of an enumeration there's like um what's the word i'm looking for and rust are called enums
Starting point is 00:59:26 and then it's like a union of a bunch of types right and then there's a variant per suspension point and at every suspension point when you're suspending thinking about this even from like the os perspective when you suspend um you need to save the stack, right? So the data storage for each variant are all of the stack variables that are live at that point. And then when you think of like that enum, which is the all of the suspension points, and then for each suspension point, what are all the variables that are live, the size of that enum will be a tag to indicate which variant you're in. And then the maximum size of any of the suspension points, like the maximum stack size in that function.
Starting point is 01:00:10 Got it. So I think the key inside over there was just like the compiler initially converting your entire program into a state machine. So it knows exactly how, like you have all of that information up front on the stack size yeah exactly yeah yeah and that's like i think it's one of those things that's really cool from like a pl innovation perspective and that when you have a async await program like that so there's like all these nested state machines right like when you have a function call that's like you have three
Starting point is 01:00:42 enum variants for the top thing so there's like three suspension points and then when you yield when you await some other function call then there's like that variant two might have like five variants nested inside of it right so there's all these like there's these nested state machines and put together they make this big, you know, right. And the maximum size of that is actually known at compile time. So the compiler knows exactly how much stack space it needs to allocate for this lightweight thread. And that is something that like, for example, with go or in rust, when they had like a green threads runtime, that there's been a lot of struggle with because on the Go side,
Starting point is 01:01:27 I can't remember what their default stack size is for Go routines. I think it's maybe a kilobyte. But they don't know up front how big of a stack they need to provide. So then there's these weird performance cliffs where if you just add a new stack variable to your function,
Starting point is 01:01:42 that could add a stack split that gets detected at runtime and then there's an expensive path to go and allocate a new stack. I think they have a linked list of chunks for their stack representation. Where on the Rust side, you just know up front, this is what it needs. And then
Starting point is 01:02:00 the question of recursion comes up, then it's like you can't have cycles in your async call graph because then you wouldn't know, right, at compile time. But then there's like a way to get around that by then using heap allocation to break those cycles and specific cases. That's, I think that's fascinating. Is that like, I think with Rust, like compile times
Starting point is 01:02:23 and like your binary size might blow up a little bit because of all of these things, but then you get amazing performance at runtime. I remember you sent me a link a long time ago on Rust's compilation time improvements and they were redesigning the compiler. I'm sure they've had
Starting point is 01:02:42 improvements, but at what magnitude and what scale those have been. Yeah, they've, my, like, as a, I don't know that I like technology that well, and I don't know the actual numbers, but in terms of developer experience, like, I've actually been quite happy with it. They've done a lot of work for incremental compilation.
Starting point is 01:03:02 So when you make a small change, making it so that the next compile doesn't take very long, there's a whole algorithm they have for expressing data flow within the compiler. It's like a DAG. And then knowing when nodes in the DAG have changed and then downstream results need to get reevaluated. So that stuff's super cool.
Starting point is 01:03:22 And yeah, I think working at a larger company like working on nucleus say for example like then compile times were a little bit more of a pain there because it was a really large code base and it was a large binary just like binary at the end was huge but after having left and then working on um the microsoft research stuff and also some of the stealth startup stuff that I've been doing like I compiled I've been totally fine yeah you only have to worry about it once it's like really large I'm guessing and it also seems like a lot of languages like you know TypeScript Python with like async await stuff they're all converging to like this feature set
Starting point is 01:04:01 and it's it's kind of nice that and you can basically pick the right tool based on what your constraints are and easily switch around when you realize like yeah you know i need these other things that are just not possible there's like some there's a really snarky paper well maybe not that snarky but it was a pretty good paper i was i remember reading i have it somewhere in my like msr journal on um that all this user space scheduling is a fad and that people just end up going back to kernel threads you know like threads and then multiplex by kernel at some point it's an interesting perspective right like the ecosystem is moving towards having language level support as opposed to OS support for concurrency and swapping context in
Starting point is 01:04:46 and out. And some of that is motivated by the fact that threads are not meeting people's needs, right? And, you know, maybe we could just make threads better. Yeah, I think in that world, like the Linux kernel will have to get smaller. Yeah. Yeah. Yeah. And yeah, I don't have a... There's definitely a context switch overhead. And yeah, like creation overhead. I think some of the stuff we've talked about
Starting point is 01:05:12 with statically knowing the stack size is impossible without language support, right? So that's definitely unavailable to the operating system. Yeah, it's kind of an interesting debate, right? Yeah. Well, I wish you best of luck on your style startup i hope everything goes super well there i'm sure like you'll figure something awesome out and thanks for being a guest here i think this was a lot of fun
Starting point is 01:05:38 and i feel like i learned a lot even though i've spoken to you a lot in the past i still feel like i learned a lot again

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.