Disseminate: The Computer Science Research Podcast - Paras Jain & Sarah Wooders | Skyplane: Fast Data Transfers Between Any Cloud | #26

Starting point is 00:00:00 Hello and welcome to Disseminate, the computer science research podcast. I'm your host, Jack Wardby. Today is a special day because for the first time I'm joined by two guests. It gives my great pleasure to welcome Paraz Jain and Sarah Wooders to the show, who will be talking about the Skyplane project. Sarah and Paraz are both PhD students in the Sky Computing Lab at UC Berkeley. Welcome to the show, folks. Thanks for having us.

Starting point is 00:00:47 So let's jump straight in then. So maybe you can tell us a little bit more about yourselves and how you both became interested in researching in the data management area. So yeah, I'll start. So I started my PhD about five years ago. Here, I've gone through several phases. I actually started my PhD in machine learning. So I was working on machine learning systems, infrastructure for scaling up large scale models. And one of the most pressing problems when scaling up these models is that the data sets become very large and unwieldy. And so just from my research itself, I started to try to look at how could we try to solve these data bottlenecks when training these large models.

Starting point is 00:01:25 And so these types of models, you can think of these like diffusion models or GPT and these type of models, they can ingest like terabytes of data during the training process. And then the models themselves are very large. They can be many gigabytes themselves. And so this was a really firsthand problem I had encountered just moving these parameters and data sets around. And so I made this transition to networking and systems research with the Skyplane project about two years ago. I also had a very similar motivation. Prior to coming to Berkeley as a PhD student, I was actually working on a startup, training these e-commerce models to basically classify different types of products based off the product images, basically. And with that project, moving around these huge image data sets that I had was a really big pain. I was often switching between GCP and AWS, depending on what tools I wanted to use.

Starting point is 00:02:15 And so then when this whole vision of sky started coming up, when I was in the lab, I was really immediately drawn towards solving specifically the data gravity problem around moving between clouds, since that, to me, seemed like the biggest challenge. Awesome. That's a nice segue, actually, into the Skyplane project. So maybe you can tell us a little bit more about the project. And I think you hit on the web there, data gravity. Maybe you can tell us a little bit more about what that is and kind of what the problem is you're trying to solve. Sounds good. Yeah. And before I do that, I just want to give a one sentence intro on what the Sky Computing Lab's goals are.

Starting point is 00:02:47 So here at the Sky Computing Lab at UC Berkeley, we are trying to build infrastructure and platforms to enable applications to run seamlessly across multiple cloud providers and cloud regions. And so, you know, again, there's a variety of different reasons for that. I can go into that if we're interested,

Starting point is 00:03:02 but the most important problem you're going to encounter in this cross-cloud or cross-region environment is something called data gravity. And so data gravity, I think, is actually very simple. It's that when you work with large data sets in the cloud, it is, number one, slow to move that data between different regions, right? It can take many hours to move large data sets between different regions or clouds. Second, it's very expensive to move that data. And why is it expensive? Well, in the cloud, you have to pay for egress fees. So every single byte of data you move over the internet or a cloud network, you have to pay. And so because you have to pay for that volume of data,

Starting point is 00:03:39 it can cost a lot of money. And to ballpark, it can cost anywhere up to 10 cents per gigabyte to move data in the cloud. So to move, let's say, a 200 gigabyte training data set, that's equivalent to spinning up a 34 node cluster of VMs, M5X large VMs for a whole hour, right? So it's really expensive, actually, these data transfer costs in cloud. And the third factor of data gravity is the complexity of moving this data. So you end up having to use a patchwork of different tools that are mutually incompatible and work with specific cloud sources or cloud destinations. And so this data gravity problem, again, that it's slow, expensive, and complex to move data in the cloud environment means that's kind of the key problem we want to solve in the Skyplane project. And so Skyplane's goal is to make data transfer in the cloud,

Starting point is 00:04:25 essentially high speed, cheap, and very easy. And I can go into details about how that, how we kind of accomplish that, but at a high level, that is the goal of the project. Awesome. Cool. Yeah. I actually read the sort of, I guess, the manifesto of the Skycomputing Lab a while back. And there's like a paper on Archive, right? I remember reading it and been really interested in it. I know it's from the thing that kind of piqued my interest in it as well as the

Starting point is 00:04:48 sort of the economics of this sort of the idea of like of moving day between clouds or whatever and what that sort of marketplace might look like in 10 20 years in the future but anyway i digress it was very interesting read um so yeah you guys work on some really really cool problems um so yeah i guess you mentioned a little bit there about the existing tools sort of not being efficient. So I mean, I've stolen this line from your slide deck. So what does life in the sky look like now today? I think the reality is there's actually very little life in the sky right now. Like multi-cloud is actually pretty uncommon. You might have some companies doing migrations every once in a while,

Starting point is 00:05:25 but the actual vision of sky computing, which is being able to choose the best of breed software from different providers and combine them into a single application is I think still pretty far off. And a big reason for that, which is what we believe is that this data gravity issue and these really high e-recipes that cloud providers charge are basically preventing this from happening um so yeah part of our motivation with skyplane is to be able to sort of achieve that vision by eliminating data gravity yeah i guess just from a

Starting point is 00:05:57 if i was aws for example i guess i want to lock in people to my cloud as much as possible right so i guess they have and they have an incentive to make it as difficult as possible for people to move right so you keep spending money with them but i guess skyplane is going to help alleviate some of these problems and make it better for the end user at the end of the day so maybe you can tell us a little bit more about the architecture and the design of skyplane and how you went about making these designs decisions to to kind of address this gap yeah it might it might be helpful first to outline like why the performance for current tools is so bad first. And, you know, when we actually profiled cloud networks, what we found was really surprising. So first of all, when you begin to transit between different cloud providers or even two regions within a cloud,

Starting point is 00:06:38 it's very common to encounter congestion. And so you'll find this on places such as transatlantic network routes. That's a single kind of fiber cable that's kind of that they had pulled that's shared across many, many thousands of customers. And so that's going to be a highly congested resource. You have to compete for capacity there. The other challenge is that we actually find that cloud providers actually throttle network data transfers. And so this is applied, for example, in AWS. If you get a VM, even if the network could sustain 10 gigabits per second of egress, they're only going to let you transfer data at five gigabits per second, right? And so they're

Starting point is 00:07:15 actually explicitly throttling your network transfers there. And then on top of that, again, beyond just speed, the costs are still real. And so that can vary anywhere from one or two cents per gigabyte if you're transferring between two regions in AWS in the same cloud to anywhere up to 10 or even 19 cents per gigabyte for some regions. So the cost structure is also very complex and hard to navigate as an end user. So what Skyplane does is that we have a system that periodically profiles the internet, essentially, between all these clouds, and it measures the throughput and the cost for moving data. And so we end up getting essentially this map of the public cloud internet, right? And so we have

Starting point is 00:07:56 all these different routes between all the providers, and we can see the throughput from different region pairs. And so we go through this kind of cartography exercise. But after doing this, we now have a map where we can kind of begin to navigate and find more efficient or more optimal network routes that can route around congestion in the public clouds network. And so with these two parts, we first kind of profile the network, and then we kind of plan around it. That is really one of the key techniques we leverage to achieve better performance. That's fantastic. So you've got an upcoming NSDI paper that's going to talk all about this in some depth. So I think that this primarily focuses on how you optimize the transfer, right? So the cost and the throughput. So can you maybe tell us a little bit more about that paper and can you

Starting point is 00:08:41 dig into some of the insights and techniques? I think you've touched on a bit previously but going to them into a little bit more depth and how you leverage them within within skypline so i'll talk about this and it's the nsda paper that we wrote is on unicast so what that means is from a single source to a single destination how do you move data um so i'll briefly touch on that i actually would like to also like uh leave some time to discuss some of the upcoming work that we have under review now that's, I think, really exciting too after this Unicast. So for the single source, single destination network transfers, again, as I mentioned, we profile all these cloud networks. We got this map. If you have N regions, we have this N squared grid essentially telling us the performance between any source and any destination. And we have an optimizer that actually carefully plans routes in cloud networks. And so really the hard thing here that we want to

Starting point is 00:09:30 solve is the optimization problem is that we want to minimize the cost of a transfer subject to a user's desired replication time constraint. So the user might say, I want to move 100 gigabytes of data and I want to finish in 10 minutes, right? And so subject to that, we need to find a way to move that data at the lowest possible cost. And so we'll use a variety of different optimizations, such as this overlay routing I mentioned, where we route around kind of slow parts

Starting point is 00:09:57 of the cloud network. We also have other tricks. So we do things like we automatically manage elastic VM pools and parallelism. And so in cases like where the cloud provider like AWS throttles the user, we can essentially provision more VMs to burst past that throttling constraint, right? And so our optimizer also kind of in the same integrated problem decides when to deploy that parallelism trick to essentially conquer throttling. And also we perform automatic bandwidth tiering and compression that help reduce egress fees, right?

Starting point is 00:10:31 So it turns out it's really surprising, but it's well worth paying the extra money to get some extra vCPUs to run some compression and deduplication so you can reduce your egress fee. And it's almost like a 10 to 1 ROI there, right? You save a lot of money there in aggregate by doing this compression and deduplication, which is really surprising. That's another insight that we leverage.

Starting point is 00:10:53 Fascinating. Is there like a tipping point there after which I kind of up to a certain level, it depends how much data you're transferring, whether you actually get a benefit, right? Or is it kind of always a win? Really interesting. So we don't necessarily have strong support for very small files. So if you're doing like, you know, whether you actually get a benefit, right? Or is it kind of always a win? It's really interesting.

Starting point is 00:11:07 So we don't necessarily have strong support for very small files. So if you're doing like, you know, five to 12 kilobyte key value kind of store, you want to do partial replication of that. I mean, it's not necessarily the right tool for the job. So here I would think, you know, if you're moving five gigabytes or more, you'll probably see a strong benefit

Starting point is 00:11:22 from using Skyplane compared to existing tools. But again, we keep putting more and more optimization in the system. It's open source, and we have a very vibrant community of contributors who are helping us optimize performance further and further. So I think that tipping point is starting to shift lower and lower, which I think is really exciting. Yeah, the main reason why there's sort of like that five gigabyte threshold is because Skyplane is actually creating VMs in your AWS account to run that transfer. And so that startup time for the VMs is going to take about 40 seconds, which is sort of a constant overhead that we can't really do much about unless we have like some continually online Skyplane cluster, which we don't have right now. So because of that, it's only really beneficial for larger transfers. Just to summarize briefly, so the kind of the four ingredients

Starting point is 00:12:06 in the secret sauce is the overlay routing, being able to dynamically adjust the number of VMs, the network tiering and compression, and then being able to have parallel TCP connections, right, they're the four sort of ingredients that go into making Skyplane. Cool, so you mentioned there about other, like maybe further optimizations that you might consider as well what's on the roadmap there for that like

Starting point is 00:12:30 what other things do you think could be interesting um yeah so one recent uh follow-up paper that we just submitted actually was on how can we extend the optimizations in skyplane unicast to multicast or broadcast so this is basically where you want to transfer data from one source to multiple destinations. And in this scenario, you can actually get much more significant wins in both the throughput and cost improvements, because you might be over and over again using expensive links. So oftentimes in AWS, you'll have regions that are much more expensive, so maybe 19 cents per gigabyte, as opposed to the standard one or two cents per gigabyte. So one optimization that we do in the multicast scenario is we can move the data once from an expensive source region to a cheaper overlay region, and then broadcast the data from

Starting point is 00:13:16 there. Another really important optimization that we do is we basically have multiple stripes of the data. So as opposed to sending all the data along one distribution tree, we split up the data along multiple distribution trees so that no single node or single path is burdened with all the data. So from those two optimizations, we actually end up getting very significant cost savings on both the egress and then also throughput improvements. And this isn't in the NSDI paper, this is like additional work that's under review at the minute at another conference right yeah okay cool well good luck with that submission hopefully it hopefully it gets in i had a while while you are while you guys are talking there i'm just thinking and we're kind of going back to what we said the very start of the show of how it's in kind of cloud vendors interests to lock people in right and how they essentially

Starting point is 00:14:04 throttle things to make sure to in various scenarios they throttle things right so you don't get the full like full bandwidth or whatever is there any sort of kind of not concern but if they detect that someone's doing this kind of throttling other things to kind of stop them moving from to different cloud vendors for example like i'm thinking that if all of a sudden the cloud provider becomes kind of cloud provider becomes sort of adversarial in this. I think that would be really hard for the cloud providers to do because we use all public cloud APIs.

Starting point is 00:14:32 So they would have them figure out that it's Skyplane running this API. And then the other thing is that we also run inside of our users. It's the users' VMs that are actually executing these transfers. So it's not like there's some Skyplane service that they can block. It's all something that you run yourself and it's open source. And like just because we're sort of saving costs for the user, I don't think necessarily means that it's bad for the cloud provider. Because if you think about like why are the clouds actually throttling

Starting point is 00:14:58 or why are they making some paths more expensive than others, it's sort of in a way like trying to shape demand, right? So what Skyplane basically allows is for users to actually in an automated way, react to the pricing that the cloud provider set. So I don't think it's necessarily a bad thing for the cloud providers. Cool. So I guess I'd like to touch on a little bit about the implementation of Skyplane. Can you maybe describe a little bit about how it's implemented, how long it took you to implement and what implementation effort it was like? Yeah, so it's open source.

Starting point is 00:15:29 And so that's a really, I think, important part of the project is that we want to shift this to the real world. We wanted to see this actually be implemented to real applications because, you know, it's going to improve our research long term to learn about the use cases for which people are using sky computing. Right. term to learn about the use cases for which people are using sky computing right um so we had an initial implementation actually done for our first paper submission uh within about three months actually it was a very very quick project kind of turn around that way um we had built the system originally in a few different languages so i had done some initial tests with like go or c plus plus and then we figured out it's all IO-bounded. So actually, however remarkable that is,

Starting point is 00:16:08 the majority of the prototype is written in Python. And so we're able to get, you know, hundreds of gigabits per second of aggregate throughput with this, right? Though, again, just very careful implementation using careful system call interfaces, like splice and kind of zero copy kind of mechanisms. But yeah, I mean, that was sufficient to get very high performance that has really kind of enabled high velocity and means our community of contributors are able to kind of upstream

Starting point is 00:16:33 new cloud connections, new objects or interfaces, or even kind of new kind of technologies like different encryption techniques very quickly. Right. And so the on-ramp to kind of contributing back to the project is very low. And so, you know, I think that's been a really important kind of enabler for the project today. Sure, I guess having it in a language like Python makes it a lot more open to a lot more people, right? So as a user then, how do I interact with Skyplane? And so what's my experience like?

Starting point is 00:17:02 Because if I wake up tomorrow and want to move some data around using Skyplane, how does that look like? You can just pip install it. It's very easy. So you just pip install Skyplane and now you'll have the Skyplane CLI. And so if you wanted to do a transfer from, let's say, between two AWS regions,

Starting point is 00:17:20 all we'll require you to have is the AWS CLI installed and have your cloud credentials logged in. So, you know, I you to have is the AWS CLI installed and have your cloud credentials logged in. So, you know, I guess you'll do like AWS configure auth or something. And, you know, as long as that's kind of set up, which if you're using AWS, you probably already have, Skyplane will just work. And so behind the scenes, you know, Skyplane supports scale to zero. So when you kind of just installed it, there's no VMs, there's no state, there's nothing running. Everything lives on your local laptop.

Starting point is 00:17:48 But as soon as you type Skyplane CP and you might say S3 in the source and S3 in the destination, then Skyplane will actually provision virtual machines to actually serve as ephemeral cloud routers. And that only exists for the duration of your transfer. It's actually a big advantage to what using Skyplane from a user's perspective is compared to other systems, which leave VMs running all day, kind of cost a lot of money. This is actually really, we started out trying to design this for other researchers like us. So, you know, small scale users who have a lot of data, though, right? And so the system kind of is there when you need it and scales to zero and deprovisions everything and cleans it up when you're not, right?

Starting point is 00:18:26 And so behind the scenes, there's a lot of complexity in terms of provisioning, doing things securely. So managing encryption and cloud keys and IAM and all this stuff. But that's all kind of abstracted away behind the CLI. From the user's perspective, you pip install and you're ready to go. One thing I'll also mention is we also just recently released the Python API. So now you can actually directly from code provision a Skyplane basically cluster and then execute transfers on top of that and deprovision everything from Python. So we built this actually because we were hoping that people could sort of build applications on top of Skyplane. So we actually

Starting point is 00:19:00 have an example of like ML data loader and an Airflow operator that are built with Skyplane's API on our documentation page. The deprovisioning thing would have saved me a lot of trouble over the years because I've had numerous people I know of leaving VMs running when I shouldn't have done and stuff. Right. Because that's cost, well, cost the university a lot of money, but I got shouted at for it. Right. So anyway. Cool. Yes. You said some apps were built on top of Skyplane. so can you maybe tell us a little bit more about those i'm not familiar with airflow so airflow is the swiss army knife of essentially data pipelines and industry so for a lot of etl workflows airflow is an orchestrator so you know it's a it's a service where you can schedule you know workflows meaning like here's some job inputs and here's a particular computation you might run, like running a Spark SQL query.

Starting point is 00:19:49 And then you want to write the outputs to a particular destination. And you might say, I want to run that job every night. So it's just a it's essentially the system to kind of orchestrator that runs these workflows over data pipelines. What's really common is people have data in a thousand and one sources. You might have data in AWS S3, then Google Cloud Storage, then Snowflake. Then you might have even data in, you know, software as a service providers like Salesforce

Starting point is 00:20:14 or, you know, Facebook marketing. And then you might even have data on-prem and then it gets very complex. And so in Airflow today, the state of the art is these kind of S3 to GCS, S3 to Azure blob, S3 to Azure ADLS kind of connectors. And each one has a slightly different interface, right? And so it's a lot of user burden to learn how to use this tooling in order to get your

Starting point is 00:20:39 Airflow jobs to kind of operate and consume data from all your data sources. But this is, again, very important for organizations. We have an integration with Skyplane and Airflow that enables users to effectively bridge data from any source and any destination support in Skyplane. And we will take care of moving that data at really high performance. So, you know, if you actually use this, even if you use like, let's say the baseline, you know, S3 to S3 connector for Airflow, you might only get data transfer speeds of 20 megabytes per second. But with Skyplane, we're able to complete those transfers at, you know, tens of gigabits per second. So you get better performance, lower cost, and it's easier to use, quite honestly,

Starting point is 00:21:17 because there's a single tool that covers all of these cloud destinations. Great. So the Airflow is not part of the same project, right? This is just sort of, it's not within the Sky Compute Lab. It's totally separate. It's about 100 lines of code, actually, in our examples. It's very simple to use over our API. So we expose all these primers up to users,

Starting point is 00:21:37 and you can programmatically kind of configure, we call it a data plane. That's like this group of VMs that runs across the clouds and kind of describe what inputs and outputs you're kind of plumbing through the system. So, you know, it's actually very easy to use and our intent is like to have this be upstreamed into kind of other open source libraries. I think Sarah also mentioned, we have another example of a cross cloud ML data loader. So if you have data in one region and you want to train on a GPU in another region,

Starting point is 00:22:05 we have an example where you actually will use Skyplanes overlay to plumb the data from the source to your GPU, right? And just stream it directly from object storage to the GPU. And so that's another kind of simple example that's actually implemented very cleanly and very simply over our API. Cool. We've thrown kind of some numbers around

Starting point is 00:22:25 across the course of the chat so far. So I'm guessing you've done some sort of evaluation of Skyplane and under various different experiments. So I was wondering if you could maybe talk a little bit more about some of the experiments you've run. I guess maybe revisit the questions you were trying to answer and kind of talk about how those,

Starting point is 00:22:43 I guess what you compared it against maybe as well, right? And because that's kind of not obvious because it's a new thing right so yeah maybe tell us a little bit more about your evaluation of skyplane is where i'm getting at a long-winded question i apologize so i can talk about the unicast project so that is on again on archive you just search you can find it or on skyplane.org we have the paper linked um so we compared against several vendor baselines and some open source slash academic baselines, but the vendor baselines are very strong. So what we compared against are AWS DataSync, GCP Cloud Data Transfer,

Starting point is 00:23:15 and then we compared against Azure has an AZ Copy service that allows for high-speed data transfer between Azure services. And so these are cloud provider-specific services, so they should be highly optimized and, you know, overfit in some sense to each cloud provider. And they have visibility into their own networks and capacity and stuff. So, you know, these tools are quite strong baselines in some sense that the cloud providers themselves, this is their sanctioned way of moving data.

Starting point is 00:23:43 So relative to some of these tools, for example, like against Azure DataSync, we're up to five times faster for data transfer within a single cloud, right? And so already, even just if you're in AWS and you don't use another cloud provider, you can get up to five times higher throughput than AWS DataSync. And it's substantially cheaper because we don't charge a service fee, right? AWS DataSync charges you the egress fee plus an additional service fee per gigabyte you move. You get better results, obviously, moving between clouds. You know, the results vary there, but again, you know, the vendor tools often don't have very good support for cross-cloud data transfer, right? They often only allow you to move data into their cloud,

Starting point is 00:24:25 but they don't support generally moving data out, for example. So it's a little hard to evaluate. We also considered some academic baselines. So, for example, we considered GridFTP, and we have better price and performance compared to GridFTP. I think the most interesting, exciting results are actually in some of the follow-up work. Yeah, so in the multicast work, we had a couple of baselines. So one of them was AWS S3 multi-bucket

Starting point is 00:24:50 replication. So S3 basically allows you to write these replication rules. So like when you write to that bucket, it automatically gets replicated to a bunch of other buckets that are connected to it. So that was one baseline we had. And another one was BitTorrent. So that's a really common tool people use to basically disseminate data that might be located in one place to a bunch of other destinations. So we did a bunch of experiments basically comparing about 100 gigabyte transfers from one source to about six destinations. And our numbers are usually around like 60% cost savings and almost up to three times replication time speed up. And I also want to note that the results actually get better and better the more destinations you add so for larger sets of destinations the cost saving opportunity and also the replication

Starting point is 00:25:34 time reduction opportunity actually gets bigger than that we spoke about this a little bit before earlier on but sort of like what are the current limitations of Skyplane? So Skyplane today supports cloud-to-cloud destinations, and we have very limited support for on-prem to cloud, but that is a very hard problem. But we have a student in the group who is working on high-performance on-prem to cloud data transfer. But that is probably one of the biggest areas that our users have been asking for new solutions in, and that's something

Starting point is 00:26:05 we're trying to evolve the project to better support sure cool you obviously get a lot of feedback or you have got a lot of feedback and from people actually using the skyplane which is like i guess it's quite different to most sort of academic projects right most of us kind of we do something and it disappears into the ether and we're lucky if anyone looks at it again and so my next which i normally ask is kind of how as a software developer, how can I leverage the findings in your in your in your reset in Skyplane in this case? I guess it's just kind of just just just kind of go and use it and and see how it works for you. Right. And feedback in that way. But I guess like what what impact do you think you work with Skyplane can kind of have for the average software developer

Starting point is 00:26:46 in their day-to-day life? Yeah, so I think for the average software developer, if you have large amounts of data, so anything more than, like, 10 gigabytes probably, and if you're having to use that data from a location that's not the same place where that data is, so whether it's a different region and the same cloud provider

Starting point is 00:27:06 or a different cloud provider, then Skyplane can help you access that data faster. And also if there's data that you have to synchronize that's maybe living in different locations, you can also synchronize that data faster using a tool like Skyplane. There's other applications, sort of higher level applications

Starting point is 00:27:23 that we've been sort of starting to think about designing. So, for example, things like disseminating model weights are also containers to a large number of destinations. So I think especially something like a high performance container registry built on top of Skyplane multicast could be something that's really useful for sort of higher level applications used by software developers as well. I mean, I guess kind of working on this project, you seem to have covered some really fascinating areas and problems. But are there any ones that sort of stand out as being the most interesting or maybe unexpected lessons that you've learned while working on this project? I think for the multicast work, I was always really surprised by how crazy some of the distribution trees ended up looking. We had some examples that were like a multi-cloud transfer from AWS to maybe some other AWS, GCP, and Azure regions.

Starting point is 00:28:19 And there were some really unintuitive paths, like going from AWS to GCP back to AWS and then to Azure that were part of the tree that were really surprising to me, but I couldn't, and they almost seemed wrong at first, but I couldn't figure out a better solution. So I think they're probably right. So that was really surprising to me. I think it's sort of the erraticness of cloud pricing causes the optimal distribution tree structure to look really, really strange in some cases. Do you notice it changed quite significantly between sort of like day to day as well? I mean, I'm not really too familiar with how dynamic cloud pricing is. But did you notice doing the same job on a Tuesday one week to then do it on the Thursday could result in a completely different distribution tree?

Starting point is 00:29:04 So cloud pricing itself is actually very, very static. I think it might like people might change their pricing around like every couple of years. I think something that's more likely to change the pricing models is the impact of incumbent clouds. So there's some new cloud providers who are basically trying to compete with the big players by offering free egress. So Cloudflare is one of these. And we actually found another sort of surprising result that we had was that you can actually use these these clouds like Cloudflare as kind of intermediary points. So if you're transferring from AWS to GCP and Azure, instead of paying egress twice, so once to Azure and once to GCP, you can actually transfer the data through Cloudflare,

Starting point is 00:29:46 pay egress once, and then broadcast the data from Cloudflare to the other clouds. So I think that's potentially something that's more likely to impact pricing in the long run than like sort of dramatic changes to cloud provider pricing. How easy is it to add support for like a new, let's say some startup happens tomorrow about a new cloud provider. How easy is it to add support for that for Skyplane? Is it a lot of work or is it pretty simple? So if they're S3 compatible, it's not that hard because worst case we can just use VMs and other clouds

Starting point is 00:30:20 to basically just connect to their S3 compatible API. I guess, do you want to add to that? Yeah, so we've been collaborating pretty closely with some contributors from IBM who have been working to add IBM cloud support. I think their PR is like under 800 lines of code. So, you know, it is some work today, but it's not like tremendous. I think it's something that's pretty doable, and we're working very hard to kind of reduce the amount of changes necessary.

Starting point is 00:30:48 In fact, one of the critical projects that are kind of starting around this effort is to create a SkyKit, in some sense, common provisioning infrastructure that, you know, we can share with Skyplane, but also with other projects. With this idea of modularity, kind of leveraging existing cloud provider integrations with tools like terraform for example so that you know you don't have to keep reinventing the wheel and rewriting integrations to clouds it should be really easy and really simple to add these new cloud providers great stuff yeah i guess the next question is something i i often ask my uh my guests as well is that obviously from the initial idea for some piece of research,

Starting point is 00:31:26 piece of work, and to where we are now, whether things along the way that you've tried that failed and the listener might benefit from knowing about cyber, they don't make the same mistake again, or it's just an interesting dead end that you found. Yeah, so since starting the project, I mean, I actually didn't want to work on this problem necessarily. I wanted to kind of work at higher level applications. So, you know, how can we kind of build these applications in Sky Computing context, such as machine learning, but, you know, as I started digging into that problem, I was like, the most basic infrastructure here is abysmally slow and expensive, right? And so that has focused our research around it. And it's kind of like in the beginning, I was a little hesitant to work on this problem. But it actually turns out just in a simple data transfer setting, there is so much work that can be done. And so, you know, the insight here, I think I took away from this experience is kind of start simple, you know, I really wanted to build some big overarching system to let you do kind of cross cloud machine learning or data analytics or something. But really, it's even just the really simple problems like data transfer that really are

Starting point is 00:32:28 most critical, right? And I never really appreciated the complexity of how to kind of accomplish this in a cloud environment easily. I think one really important thing also that kind of enabled the research was kind of having this open source mindset. So, I mean, academically, I mean, I think there was a little skepticism initially, what would the insight be? What would the key kind of optimization space look like? Taking an open source kind of mindset, kind of looking at the best of breed tools today for data

Starting point is 00:32:55 transfer, and just profiling them carefully. You know, that's where we kind of discovered this overlay technique, and then kind of the research agenda kind of fell into place from there. Right. And so really it's just taking the best of the tools and kind of hacking on them and trying to figure out incrementally, how can I just make this better or better and better? Yeah, that's really nice. I mean, when did this idea initially start? How far back are we talking? Is it like a multi-year sort of project or more recent? It's been about two years since I started thinking about this area. But as I said, the first implementation, very incremental, very quick.

Starting point is 00:33:32 We got it done in three months or so before for that kind of first paper submission. But, you know, since then, I mean, it's a large team here. We have, you know, about five PhD students working on the project and several undergraduate students on the project too. So it's been very encouraging to kind of see how the project has grown over time. Yeah, I guess what's next for Skyplane then? Kind of maybe the short-term vision and then the long-term vision and some of the stuff that you've got in the pipeline.

Starting point is 00:34:00 So I feel like at this point, we basically have these really powerful communication primitives, like whether it's unicast or multicast, the ability to move data really quickly and cheaply across clouds. So I think that what the main things that we've been thinking about right now have basically been what are sort of applications that we can now build on top of this. So one thing that we've been thinking about is potentially building a sky storage layer. So basically sort of like a shared object store that spans clouds, except the underlying replication is done by Skyplane. So if you're right to one location that gets replicated by Skyplane really quickly and cheaply to all the other places that you might want your data access. And so we can kind of co-optimize that with object placement to basically have like a very fast, cheap replicated object store. Another thing that we've been thinking about that I mentioned previously is also building

Starting point is 00:34:50 either a container or model registry. So this was kind of a use case we stumbled into when we were looking for baselines for the broadcast, sorry, the multicast work, where a lot of, we were finding a lot of other similar systems were actually focused on building container registries. So if you have a new container, how do you push that to a bunch of destination locations where you might want to run that container? So one other thing that we've been thinking about is like, could we basically build like a container publishing or also similarly with models like model weight publishing platform on top of Skyplane Multicast? Some really interesting directions and work that's going on. This would be a nice time to talk a little bit about some of the other research. I know there was a paper at CIDR recently

Starting point is 00:35:30 about lake houses. So maybe you could give us the quick elevator pitch for that and tell us about some of that work and how that links to the Skyplane project. Yeah, so I was working on the Skyplane project and you look at what do people store in their buckets? Well, it's really surprising, but a ton of it is just parquet.

Starting point is 00:35:49 And so people are storing tons and tons of this relational data in their data lakes, in effect. And so one of the things I wanted to understand was what are the current storage formats for this tabular data and these data lakes? And a very recent trend in the industry has been the development of the lake house architecture. So here you have disaggregated compute from storage. And usually this means you're going to store this data in object storage. So like AWS S3, for example.

Starting point is 00:36:19 And then you utilize a technology such as Delta Lake or Apache Hudi or Apache Iceberg to provide asset on top of that, right? So now you have transactions. This is an emerging paradigm in this kind of analytics area, but it's very appealing because now you're storing your data in Parquet format, which is highly compressible and it's in a data lake, right? So you can scale to zero when you don't need to do any queries. And it's very cheap to store data in Amazon S3. And so the Lakehouse benchmark,

Starting point is 00:36:49 which is what the paper we submitted to CIDR was, or LHbench, was effectively an attempt to kind of define a single benchmark suite that kind of understand the pros and cons between some of these different systems that exist today, right? So we tested, again, Apache Hudi, Apache Iceberg, and Databricks Delta Lake as the three key formats in that benchmark. Which one should I use? Which one's the best?

Starting point is 00:37:14 So the answer here is pretty interesting. It depends. It really depends if you're kind of trying to optimize for writes or reads. So for example, Iceberg and Hudi have a very special mode of operation where they support copy on write and merge on read tables. And so that allows them to convert their database from operating in a read optimized setting to a write optimized setting. And so really depending on your workload, you might choose a different tool. And so in general, though, we did find that Delta Lake performed extremely well. That may not be surprising because it's actually well optimized with Spark specifically to ensure it gets high IO performance. But I don't believe that's something fundamental to the design of Delta Lake.

Starting point is 00:37:53 Cool. You say they all sort of trade-offs to make that there would be some sort of um idealistic system that would do pick the correct trade-offs and do best across all the different sort of workloads you evaluated it on or is it you always gonna have to make some sacrifices somewhere right but i guess but yeah is there some sort of potential new system you could develop off the back of that that would tick all the boxes and be faster than all the existing solutions so i think these different frameworks are adding you know specialized like fast paths for specific workloads into their frameworks we evaluated tpcds as our benchmark so it's a pretty general purpose analytics benchmark but what's unique is we also tested the incremental refresh benchmark and that's a new representative workload where you continually kind of do small merges periodically with queries interspersed in between. I mean, so in that

Starting point is 00:38:50 specific workload, actually a lot of these different systems are adding specialized support for merge on read, which is a new capability that lets you defer the merge or kind of compaction of data until that data is kind of read. And so again, that's what I mentioned, Iceberg and Hootie are implemented, but now Delta Lake, for example, added deletion vectors, which is, I would see as a variant of this kind of system. So the answer to your question in some sense is I think the frameworks are kind of learning from each other and taking kind of the best of each of these different approaches and kind of hybridizing in some way, right? But again, it really depends on the structure workload. Is it going to be write- right dominated or re-dominated i mean what level of transaction support do you need for example some systems

Starting point is 00:39:30 utilize locking versus others use you know optimistic concurrency control like delta lake i guess kind of the i think this is my third to last question now so it's not the penultimate one it's the one before the penultimate one and this is this is always a really interesting question because i ask this to everybody and it's really nice to see the divergence in questions and the question is like how do you go about um generating ideas and then selecting projects to work on because you're working on some fascinating stuff like how do you decide what to work on? What's your process there for that? We're both advised by Jan Stoica, and I think he really indoctrinates us with the idea that you should solve problems

Starting point is 00:40:15 that are real, right? Which is harder than it sounds because you have to find some sort of very fundamental problem that actually affects people in industry. And if you can solve it, then would actually provide a lot of benefit to a lot of people. So I think that's definitely something that we look for in terms of selecting research ideas. I think with Skyplane, what was really appealing to you about this project was that it's such a sort of simple part of our kind of like a basic component of system building,

Starting point is 00:40:46 like just like transferring communication, that if you can make that more performant, then there's so many other things that you can improve the performance of as well. So I think that was really exciting to me to work on something that was pretty low level and fundamental to a lot of systems. Yeah, I guess. Do you have any thoughts there, Karth? Yeah, my thoughts here are pretty interesting. So when I started working on this project, I took some of the best systems I could find for data transfer and other tools like data storage, object storage, you know, all these different technologies. And I just try to see what is the state of the art here? And then how can I kind of hack on these systems? And because

Starting point is 00:41:24 all of these systems are open source, right, they're all kind of white box, you can kind of open them up and start to poke around. I began to kind of take a look and see what can I kind of tweak here and there to try to improve the performance. And so, you know, as you started to go through that process, I developed, you know, I think, a lot of understanding for kind of the trade offs in these systems, right. And so that's what I really like about working on real problems in some sense here is that like there are existing solutions that we kind of can pick up and start to test and kind of take apart, right? And as you

Starting point is 00:41:53 kind of go through that process, it's very often just some very simple insight that's like, you know, just a few feet under the surface of the water that, you know, just needs a little bit of digging to find, right? But that research journey is always very satisfying and sometimes very challenging, but it really takes that kind of, I think, confidence to be able to take kind of an existing solution, kind of rip it apart and say, like, what can we do better here? Yeah.

Starting point is 00:42:17 I had two really great answers. I can add to my collection now of answers to this question. That's fascinating. I mean, I guess both you guys come from like originally like you've got a background in industry I mean do you find that kind of having that industrial industry experience and then going to do your like like do a PhD would is a kind of a better path to take in a way that you're grounded in kind of I'm going to go and work on real world problem I want to work solve real world problems you Do you think that kind of helped in a way? The first two years of my PhD were pretty challenging

Starting point is 00:42:49 because I, after spending some time in the industry, I was kind of programmed a different way. And I kind of lost that research context, you know, that mindset of which you go through and study these problems. But I think through that journey, that kind of, yes, I think that has grounded my research. And it kind of drives me to try to find problems that I think are actually real in some sense.

Starting point is 00:43:09 Like, meaning, if I'm going to work on a problem, I want to always find somebody who will benefit from the solution, right? It may not be today, but at least, you know, some point down the line, right? I think what has been challenging about that is, again, like, I've had to kind of go back and relearn the scientific method in some sense, like the whole scientific discovery process of how do you formulate an experiment and actually attractively design experiment to answer that hypothesis. But I think like, again, you know, there are different mindsets, kind of industrial design versus research. But I think, you know, we now are working on real technology with real impact while still having that scientific contribution. So I think it's a really nice balance to strike here. Yeah, you're working in a sweet spot.

Starting point is 00:43:55 Yeah, for sure. Great stuff. I guess now it's the penultimate question. So what do you think is the biggest challenge now in database data management research? That's a really interesting question. What I think after kind of studying this area is that we're kind of in this new world with cloud computing. So we have elasticity.

Starting point is 00:44:15 You have scale to zero. So fundamentally, just the pricing model for everything to do with computation has changed. And so along with that, we kind of have to revisit a lot of fundamental assumptions and how we build data platforms. I think the ideas themselves that we are leveraging aren't necessarily intrinsically novel, right? So for example, overlay networking

Starting point is 00:44:36 is an idea that dates back over 20 years, right? I mean, Jan worked on this back in 2000 in his PhD. And so in some sense, it's a very classic idea. But because the underlying constraints of the system with cloud computing have changed so significantly, I actually think we have to go back and revisit a lot of very fundamental design decisions and how we build and construct data platforms. Amazing. And I guess now it's the last question. So what's the one thing you want the listener to take away from this episode and from your work on Skyplane?

Starting point is 00:45:10 Yeah, so Skyplane today is open source. We have a blog post discussing how we're able to make transfers across clouds up to 110 times faster and 3.8x times cheaper. You can use Skyplane today both with the CLI tool and also the Python API. So if you're moving data across clouds or just even across regions, we'd love to have people try it out. Fantastic. And let's end it there. Thank you so much, Sarah and Prash, for coming on.

Starting point is 00:45:38 It's been a fascinating conversation. I hope the listeners really enjoyed learning about Skyplane. I know I have. And if you're interested in knowing more about all of their awesome work, we'll put links, I think, in the show notes. And you can support the podcast by buying me a coffee, donate through there, and we'll keep making this podcast. And we'll see you all next time for some more awesome computer science research.

Disseminate: The Computer Science Research Podcast - Paras Jain & Sarah Wooders | Skyplane: Fast Data Transfers Between Any Cloud | #26

Summary:This week Paras Jain and Sarah Wooders tell us about how you can quickly data transfers between any cloud with Skyplane. Tune in to learn more! Links:Skyplane homepageSarah's homepagePara...s's homepageSupport the podcast here Hosted on Acast. See acast.com/privacy for more information.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Disseminate: The Computer Science Research Podcast - Paras Jain & Sarah Wooders | Skyplane: Fast Data Transfers Between Any Cloud | #26

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.