Disseminate: The Computer Science Research Podcast - Paras Jain & Sarah Wooders | Skyplane: Fast Data Transfers Between Any Cloud | #26
Episode Date: March 13, 2023Summary:This week Paras Jain and Sarah Wooders tell us about how you can quickly data transfers between any cloud with Skyplane. Tune in to learn more! Links:Skyplane homepageSarah's homepagePara...s's homepageSupport the podcast here Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, the computer science research podcast.
I'm your host, Jack Wardby.
Today is a special day because for the first time I'm joined by two guests.
It gives my great pleasure to welcome Paraz Jain and Sarah Wooders to the show,
who will be talking about the Skyplane project.
Sarah and Paraz are both PhD students in the Sky Computing Lab at UC Berkeley.
Welcome to the show, folks.
Thanks for having us.
So let's jump straight in then. So maybe you can tell us a little bit more about yourselves and
how you both became interested in researching in the data management area.
So yeah, I'll start. So I started my PhD about five years ago. Here, I've gone through several
phases. I actually started my PhD in
machine learning. So I was working on machine learning systems, infrastructure for scaling up
large scale models. And one of the most pressing problems when scaling up these models is that the
data sets become very large and unwieldy. And so just from my research itself, I started to try to
look at how could we try to solve these data bottlenecks when training these large models.
And so these types of models, you can think of these like diffusion models or GPT and these type of models,
they can ingest like terabytes of data during the training process.
And then the models themselves are very large. They can be many gigabytes themselves.
And so this was a really firsthand problem I had encountered just moving these parameters and data sets around.
And so I made this transition to networking and systems research with the Skyplane project about two years ago. I also had a very similar motivation. Prior to coming to Berkeley as a PhD student,
I was actually working on a startup, training these e-commerce models to basically classify
different types of products based off the product images, basically. And with that project, moving around these huge image data sets that I had was a really
big pain. I was often switching between GCP and AWS, depending on what tools I wanted to use.
And so then when this whole vision of sky started coming up, when I was in the lab,
I was really immediately drawn towards solving specifically the data gravity problem
around moving between clouds, since that, to me, seemed like the biggest challenge.
Awesome. That's a nice segue, actually, into the Skyplane project. So maybe you can tell us a
little bit more about the project. And I think you hit on the web there, data gravity. Maybe
you can tell us a little bit more about what that is and kind of what the problem is you're trying
to solve. Sounds good. Yeah. And before I do that, I just want to give a one sentence intro
on what the Sky Computing Lab's goals are.
So here at the Sky Computing Lab at UC Berkeley,
we are trying to build infrastructure
and platforms to enable applications
to run seamlessly across multiple cloud providers
and cloud regions.
And so, you know, again,
there's a variety of different reasons for that.
I can go into that if we're interested,
but the most important problem
you're going to encounter
in this cross-cloud or cross-region environment is something called data
gravity. And so data gravity, I think, is actually very simple. It's that when you work with large
data sets in the cloud, it is, number one, slow to move that data between different regions, right?
It can take many hours to move large data sets between different regions or clouds. Second, it's very expensive to move that data. And why is it expensive? Well, in the
cloud, you have to pay for egress fees. So every single byte of data you move over the internet
or a cloud network, you have to pay. And so because you have to pay for that volume of data,
it can cost a lot of money. And to ballpark, it can cost anywhere up to 10 cents per gigabyte to
move data in the cloud. So to move, let's say, a 200 gigabyte training data set, that's equivalent to
spinning up a 34 node cluster of VMs, M5X large VMs for a whole hour, right? So it's really
expensive, actually, these data transfer costs in cloud. And the third factor of data gravity
is the complexity of moving this data. So you end up having to use a patchwork of different tools that are mutually incompatible and work with specific cloud sources or cloud
destinations. And so this data gravity problem, again, that it's slow, expensive, and complex to
move data in the cloud environment means that's kind of the key problem we want to solve in the
Skyplane project. And so Skyplane's goal is to make data transfer in the cloud,
essentially high speed, cheap, and very easy.
And I can go into details about how that, how we kind of accomplish that,
but at a high level, that is the goal of the project.
Awesome. Cool. Yeah. I actually read the sort of, I guess,
the manifesto of the Skycomputing Lab a while back.
And there's like a paper on Archive, right?
I remember reading it and been really interested in it.
I know it's from the thing that kind of piqued my interest in it as well as the
sort of the economics of this sort of the idea of like of moving day between clouds or whatever and
what that sort of marketplace might look like in 10 20 years in the future but anyway i digress it
was very interesting read um so yeah you guys work on some really really cool problems um so yeah i
guess you mentioned a little bit there about the existing
tools sort of not being efficient. So I mean, I've stolen this line from your slide deck. So
what does life in the sky look like now today? I think the reality is there's actually very little
life in the sky right now. Like multi-cloud is actually pretty uncommon. You might have
some companies doing migrations every once in a while,
but the actual vision of sky computing,
which is being able to choose the best of breed software from different
providers and combine them into a single application is I think still pretty
far off. And a big reason for that,
which is what we believe is that this data gravity issue and these really
high e-recipes that cloud providers charge
are basically preventing this from happening um so yeah part of our motivation with skyplane is
to be able to sort of achieve that vision by eliminating data gravity yeah i guess just from a
if i was aws for example i guess i want to lock in people to my cloud as much as possible right
so i guess they have and they have an incentive to make it as difficult as possible for people to move right so you keep spending money
with them but i guess skyplane is going to help alleviate some of these problems and make it
better for the end user at the end of the day so maybe you can tell us a little bit more about the
architecture and the design of skyplane and how you went about making these designs decisions to
to kind of address this gap yeah it might it might be helpful first to outline like why the performance for current tools is so bad first. And, you know,
when we actually profiled cloud networks, what we found was really surprising. So first of all,
when you begin to transit between different cloud providers or even two regions within a cloud,
it's very common to encounter congestion. And so you'll find this on places such as transatlantic
network routes. That's a single
kind of fiber cable that's kind of that they had pulled that's shared across many, many thousands
of customers. And so that's going to be a highly congested resource. You have to compete for
capacity there. The other challenge is that we actually find that cloud providers actually
throttle network data transfers. And so this is applied, for example, in AWS.
If you get a VM, even if the network could sustain 10 gigabits per second of egress,
they're only going to let you transfer data at five gigabits per second, right? And so they're
actually explicitly throttling your network transfers there. And then on top of that, again,
beyond just speed, the costs are still real. And so that can vary anywhere from one or two cents
per gigabyte if you're transferring between two regions in AWS in the same cloud to anywhere up
to 10 or even 19 cents per gigabyte for some regions. So the cost structure is also very
complex and hard to navigate as an end user. So what Skyplane does is that we have a system that
periodically profiles the internet, essentially,
between all these clouds, and it measures the throughput and the cost for moving data.
And so we end up getting essentially this map of the public cloud internet, right? And so we have
all these different routes between all the providers, and we can see the throughput
from different region pairs. And so we go through this kind of cartography exercise. But after doing this,
we now have a map where we can kind of begin to navigate and find more efficient or more optimal
network routes that can route around congestion in the public clouds network. And so with these
two parts, we first kind of profile the network, and then we kind of plan around it. That is really
one of the key techniques we leverage to achieve better performance. That's fantastic. So you've got an upcoming NSDI paper that's going to talk all about this in some
depth. So I think that this primarily focuses on how you optimize the transfer, right? So the
cost and the throughput. So can you maybe tell us a little bit more about that paper and can you
dig into some of the insights and techniques? I think you've touched on a bit previously but going to them into a little bit more depth and how you leverage
them within within skypline so i'll talk about this and it's the nsda paper that we wrote is on
unicast so what that means is from a single source to a single destination how do you move data
um so i'll briefly touch on that i actually would like to also like uh leave some time to discuss
some of the upcoming work that we have under review now that's, I think, really exciting too after this Unicast. So for the single source, single
destination network transfers, again, as I mentioned, we profile all these cloud networks.
We got this map. If you have N regions, we have this N squared grid essentially telling us the
performance between any source and any destination. And we have an optimizer that actually carefully plans routes in cloud networks. And so really the hard thing here that we want to
solve is the optimization problem is that we want to minimize the cost of a transfer subject to a
user's desired replication time constraint. So the user might say, I want to move 100 gigabytes of
data and I want to finish in 10 minutes, right?
And so subject to that,
we need to find a way to move that data at the lowest possible cost.
And so we'll use a variety of different optimizations,
such as this overlay routing I mentioned,
where we route around kind of slow parts
of the cloud network.
We also have other tricks.
So we do things like we automatically manage
elastic VM pools and parallelism.
And so in cases like where the cloud provider like AWS throttles the user,
we can essentially provision more VMs to burst past that throttling constraint, right?
And so our optimizer also kind of in the same integrated problem decides when to deploy that parallelism trick to essentially conquer throttling. And also we perform automatic bandwidth tiering and compression that help reduce egress fees,
right?
So it turns out it's really surprising, but it's well worth paying the extra money to
get some extra vCPUs to run some compression and deduplication so you can reduce your egress
fee.
And it's almost like a 10 to 1 ROI there, right?
You save a lot of money there in aggregate
by doing this compression and deduplication,
which is really surprising.
That's another insight that we leverage.
Fascinating.
Is there like a tipping point there
after which I kind of up to a certain level,
it depends how much data you're transferring,
whether you actually get a benefit, right?
Or is it kind of always a win?
Really interesting.
So we don't necessarily have strong support for very small files. So if you're doing like, you know, whether you actually get a benefit, right? Or is it kind of always a win? It's really interesting.
So we don't necessarily have strong support for very small files.
So if you're doing like, you know,
five to 12 kilobyte key value kind of store,
you want to do partial replication of that.
I mean, it's not necessarily the right tool for the job.
So here I would think, you know,
if you're moving five gigabytes or more,
you'll probably see a strong benefit
from using Skyplane compared to existing tools.
But again, we keep putting more and more optimization in the system. It's open source,
and we have a very vibrant community of contributors who are helping us optimize
performance further and further. So I think that tipping point is starting to shift lower and lower,
which I think is really exciting. Yeah, the main reason why there's sort of like that five
gigabyte threshold is because Skyplane is actually creating VMs in your AWS account to run that transfer.
And so that startup time for the VMs is going to take about 40 seconds, which is sort of a constant overhead that we can't really do much about unless we have like some continually online Skyplane cluster, which we don't have right now.
So because of that, it's only really beneficial for larger transfers. Just to summarize briefly, so the kind of the four ingredients
in the secret sauce is the overlay routing,
being able to dynamically adjust the number of VMs,
the network tiering and compression,
and then being able to have parallel TCP connections,
right, they're the four sort of ingredients
that go into making Skyplane.
Cool, so you mentioned there about other,
like maybe further optimizations that you might consider as well what's on the roadmap there for that like
what other things do you think could be interesting um yeah so one recent uh follow-up paper that we
just submitted actually was on how can we extend the optimizations in skyplane unicast to multicast
or broadcast so this is basically where you want to transfer data from one
source to multiple destinations. And in this scenario, you can actually get much more significant
wins in both the throughput and cost improvements, because you might be over and over again using
expensive links. So oftentimes in AWS, you'll have regions that are much more expensive, so maybe 19
cents per gigabyte, as opposed to the standard one or two cents per gigabyte. So one optimization that we do in the multicast scenario is we can move the data
once from an expensive source region to a cheaper overlay region, and then broadcast the data from
there. Another really important optimization that we do is we basically have multiple stripes of the
data. So as opposed to sending all the data along one distribution tree,
we split up the data along multiple distribution trees so that no single node or single path is burdened with all the data. So from those two optimizations, we actually end up getting
very significant cost savings on both the egress and then also throughput improvements.
And this isn't in the NSDI paper, this is like additional work that's under review at the minute at another conference right yeah okay cool well good luck with that submission
hopefully it hopefully it gets in i had a while while you are while you guys are talking there
i'm just thinking and we're kind of going back to what we said the very start of the show of
how it's in kind of cloud vendors interests to lock people in right and how they essentially
throttle things to make sure to
in various scenarios they throttle things right so you don't get the full like full bandwidth or
whatever is there any sort of kind of not concern but if they detect that someone's doing this kind
of throttling other things to kind of stop them moving from to different cloud vendors for example
like i'm thinking that if all of a sudden the cloud provider becomes kind of cloud provider
becomes sort of adversarial in this.
I think that would be really hard for the cloud providers to do
because we use all public cloud APIs.
So they would have them figure out that it's Skyplane running this API.
And then the other thing is that we also run inside of our users.
It's the users' VMs that are actually executing these transfers.
So it's not like there's some Skyplane service that they can block.
It's all something that you run yourself and it's open source.
And like just because we're sort of saving costs for the user,
I don't think necessarily means that it's bad for the cloud provider.
Because if you think about like why are the clouds actually throttling
or why are they making some paths more expensive than others,
it's sort of in a way like trying to shape demand, right?
So what Skyplane basically allows is for users to actually in an automated way,
react to the pricing that the cloud provider set. So I don't think it's necessarily a bad
thing for the cloud providers. Cool. So I guess I'd like to touch on a little bit about the
implementation of Skyplane. Can you maybe describe a little bit about how it's implemented,
how long it took you to implement and what implementation effort it was like?
Yeah, so it's open source.
And so that's a really, I think, important part of the project is that we want to shift this to the real world.
We wanted to see this actually be implemented to real applications because, you know, it's going to improve our research long term to learn about the use cases for which people are using sky computing.
Right. term to learn about the use cases for which people are using sky computing right um so
we had an initial implementation actually done for our first paper submission uh within about
three months actually it was a very very quick project kind of turn around that way um we had
built the system originally in a few different languages so i had done some initial tests with
like go or c plus plus and then we figured out it's all IO-bounded.
So actually, however remarkable that is,
the majority of the prototype is written in Python.
And so we're able to get, you know,
hundreds of gigabits per second of aggregate throughput with this, right?
Though, again, just very careful implementation
using careful system call interfaces,
like splice and kind of zero copy kind of mechanisms.
But yeah, I mean, that was sufficient to get very high performance that has really kind
of enabled high velocity and means our community of contributors are able to kind of upstream
new cloud connections, new objects or interfaces, or even kind of new kind of technologies like
different encryption techniques very quickly.
Right.
And so the on-ramp to kind of contributing back to the project is very low.
And so, you know, I think that's been a really important kind of enabler for the project today.
Sure, I guess having it in a language like Python makes it a lot more open to a lot more people, right?
So as a user then, how do I interact with Skyplane?
And so what's my experience like?
Because if I wake up tomorrow and want to move some data around using Skyplane,
how does that look like?
You can just pip install it.
It's very easy.
So you just pip install Skyplane
and now you'll have the Skyplane CLI.
And so if you wanted to do a transfer
from, let's say, between two AWS regions,
all we'll require you to have is the AWS CLI installed
and have your cloud credentials logged in. So, you know, I you to have is the AWS CLI installed and have your cloud credentials
logged in. So, you know, I guess you'll do like AWS configure auth or something. And, you know,
as long as that's kind of set up, which if you're using AWS, you probably already have,
Skyplane will just work. And so behind the scenes, you know, Skyplane supports scale to zero. So when
you kind of just installed it, there's no VMs, there's no state,
there's nothing running.
Everything lives on your local laptop.
But as soon as you type Skyplane CP
and you might say S3 in the source
and S3 in the destination,
then Skyplane will actually provision virtual machines
to actually serve as ephemeral cloud routers.
And that only exists for the duration of your transfer.
It's actually a big advantage to what using Skyplane
from a user's perspective is compared to other systems, which leave VMs running all day, kind of cost a lot of money. This is actually really, we started out trying to design this for other researchers like us. So, you know, small scale users who have a lot of data, though, right? And so the system kind of is there when you need it and scales to zero and deprovisions everything and cleans it up when you're not, right?
And so behind the scenes, there's a lot of complexity in terms of provisioning, doing things securely.
So managing encryption and cloud keys and IAM and all this stuff.
But that's all kind of abstracted away behind the CLI.
From the user's perspective, you pip install and you're ready to go.
One thing I'll also mention is we also just recently released the Python API.
So now you can actually directly from code provision a Skyplane basically cluster and then execute transfers on
top of that and deprovision everything from Python. So we built this actually because we
were hoping that people could sort of build applications on top of Skyplane. So we actually
have an example of like ML data loader and an Airflow operator that are built with Skyplane's API on our documentation page.
The deprovisioning thing would have saved me a lot of trouble over the years because I've had numerous people I know of leaving VMs running when I shouldn't have done and stuff.
Right. Because that's cost, well, cost the university a lot of money, but I got shouted at for it.
Right. So anyway. Cool.
Yes. You said some apps were built on top of Skyplane. so can you maybe tell us a little bit more about those i'm
not familiar with airflow so airflow is the swiss army knife of essentially data pipelines and
industry so for a lot of etl workflows airflow is an orchestrator so you know it's a it's a service
where you can schedule you know workflows meaning like here's some job inputs and here's a particular computation you might run, like running a Spark SQL query.
And then you want to write the outputs to a particular destination.
And you might say, I want to run that job every night.
So it's just a it's essentially the system to kind of orchestrator that runs these workflows over data pipelines.
What's really common is people have data in a thousand and one sources.
You might have data in AWS S3,
then Google Cloud Storage, then Snowflake.
Then you might have even data in, you know,
software as a service providers like Salesforce
or, you know, Facebook marketing.
And then you might even have data on-prem
and then it gets very complex.
And so in Airflow today,
the state of the art is these kind of S3 to GCS, S3 to Azure
blob, S3 to Azure ADLS kind of connectors.
And each one has a slightly different interface, right?
And so it's a lot of user burden to learn how to use this tooling in order to get your
Airflow jobs to kind of operate and consume data from all your data sources.
But this is, again, very important for organizations. We have an integration with Skyplane and Airflow that
enables users to effectively bridge data from any source and any destination support in Skyplane.
And we will take care of moving that data at really high performance. So, you know, if you
actually use this, even if you use like, let's say the baseline, you know, S3 to S3 connector
for Airflow, you might only get data transfer speeds of 20 megabytes per second. But with
Skyplane, we're able to complete those transfers at, you know, tens of gigabits per second.
So you get better performance, lower cost, and it's easier to use, quite honestly,
because there's a single tool that covers all of these cloud destinations.
Great. So the Airflow is not part of the same project, right?
This is just sort of,
it's not within the Sky Compute Lab.
It's totally separate.
It's about 100 lines of code, actually, in our examples.
It's very simple to use over our API.
So we expose all these primers up to users,
and you can programmatically kind of configure,
we call it a data plane.
That's like this group of VMs that runs across the clouds
and kind of describe what
inputs and outputs you're kind of plumbing through the system. So, you know, it's actually very easy
to use and our intent is like to have this be upstreamed into kind of other open source libraries.
I think Sarah also mentioned, we have another example of a cross cloud ML data loader. So if
you have data in one region and you want to train on a GPU in another region,
we have an example where you actually will use Skyplanes
overlay to plumb the data from the source to your GPU, right?
And just stream it directly from object storage to the GPU.
And so that's another kind of simple example
that's actually implemented very cleanly
and very simply over our API.
Cool.
We've thrown kind of some numbers around
across the course of the chat so far.
So I'm guessing you've done some sort of evaluation
of Skyplane and under various different experiments.
So I was wondering if you could maybe talk
a little bit more about some of the experiments you've run.
I guess maybe revisit the questions
you were trying to answer
and kind of talk about how those,
I guess what you compared it against maybe as well, right? And because that's kind of not obvious because it's a new thing right so yeah maybe tell
us a little bit more about your evaluation of skyplane is where i'm getting at a long-winded
question i apologize so i can talk about the unicast project so that is on again on archive
you just search you can find it or on skyplane.org we have the paper linked um so we compared against
several vendor baselines and some open source slash academic baselines,
but the vendor baselines are very strong.
So what we compared against are AWS DataSync,
GCP Cloud Data Transfer,
and then we compared against Azure has an AZ Copy service
that allows for high-speed data transfer
between Azure services.
And so these are cloud provider-specific services, so they should be highly optimized
and, you know, overfit in some sense to each cloud provider.
And they have visibility into their own networks and capacity and stuff.
So, you know, these tools are quite strong baselines in some sense that the cloud providers
themselves, this is their sanctioned way of moving data.
So relative to some of these tools, for example, like against Azure DataSync, we're up to five
times faster for data transfer within a single cloud, right? And so already, even just if you're
in AWS and you don't use another cloud provider, you can get up to five times higher throughput
than AWS DataSync. And it's substantially cheaper because we don't charge
a service fee, right? AWS DataSync charges you the egress fee plus an additional service fee per
gigabyte you move. You get better results, obviously, moving between clouds. You know,
the results vary there, but again, you know, the vendor tools often don't have very good support
for cross-cloud data transfer, right? They often only allow you to move data into their cloud,
but they don't support generally moving data out, for example.
So it's a little hard to evaluate.
We also considered some academic baselines.
So, for example, we considered GridFTP,
and we have better price and performance compared to GridFTP.
I think the most interesting, exciting results
are actually in some of the follow-up work.
Yeah, so in the multicast work, we had a couple of baselines. So one of them was AWS S3 multi-bucket
replication. So S3 basically allows you to write these replication rules. So like when you write
to that bucket, it automatically gets replicated to a bunch of other buckets that are connected
to it. So that was one baseline we had. And another one was BitTorrent. So that's a really
common tool people use to basically disseminate data that might be located in one place to a bunch of other destinations.
So we did a bunch of experiments basically comparing about 100 gigabyte transfers from one source to about six destinations.
And our numbers are usually around like 60% cost savings and almost up to three times replication time speed up.
And I also want to note that the results actually get better and better the more destinations you add
so for larger sets of destinations the cost saving opportunity and also the replication
time reduction opportunity actually gets bigger than that we spoke about this a little bit before
earlier on but sort of like what are the current limitations of Skyplane?
So Skyplane today supports cloud-to-cloud destinations, and we have very limited support
for on-prem to cloud, but that is a very hard problem.
But we have a student in the group who is working on high-performance on-prem to cloud
data transfer.
But that is probably one of the biggest areas that our users have been asking for new solutions
in, and that's something
we're trying to evolve the project to better support sure cool you obviously get a lot of
feedback or you have got a lot of feedback and from people actually using the skyplane which
is like i guess it's quite different to most sort of academic projects right most of us kind of we
do something and it disappears into the ether and we're lucky if anyone looks at it again
and so my next which i normally ask is kind of how as a software developer, how can I leverage the findings in your in your in your reset in Skyplane in this case?
I guess it's just kind of just just just kind of go and use it and and see how it works for you.
Right. And feedback in that way.
But I guess like what what impact do you think you work with Skyplane can kind of have for the average software developer
in their day-to-day life?
Yeah, so I think for the average software developer,
if you have large amounts of data,
so anything more than, like, 10 gigabytes probably,
and if you're having to use that data from a location
that's not the same place where that data is,
so whether it's a different region
and the same cloud provider
or a different cloud provider,
then Skyplane can help you access that data faster.
And also if there's data that you have to synchronize
that's maybe living in different locations,
you can also synchronize that data faster
using a tool like Skyplane.
There's other applications,
sort of higher level applications
that we've been sort of starting to think about designing.
So, for example, things like disseminating model weights are also containers to a large number of destinations.
So I think especially something like a high performance container registry built on top of Skyplane multicast could be something that's really useful for sort of higher level applications used by software developers as well.
I mean, I guess kind of working on this project,
you seem to have covered some really fascinating areas and problems.
But are there any ones that sort of stand out as being the most interesting or maybe unexpected lessons that you've learned while working on this project?
I think for the multicast work, I was always really surprised by how crazy some of the distribution trees ended up looking.
We had some examples that were like a multi-cloud transfer from AWS to maybe some other AWS, GCP, and Azure regions.
And there were some really unintuitive paths, like going from AWS to GCP back to AWS and then to Azure that were part of the
tree that were really surprising to me, but I couldn't, and they almost seemed wrong at first,
but I couldn't figure out a better solution. So I think they're probably right. So that was
really surprising to me. I think it's sort of the erraticness of cloud pricing causes the optimal
distribution tree structure to look really, really strange in some cases.
Do you notice it changed quite significantly between sort of like day to day as well?
I mean, I'm not really too familiar with how dynamic cloud pricing is.
But did you notice doing the same job on a Tuesday one week to then do it on the Thursday could result in a completely different distribution tree?
So cloud pricing itself is actually very, very static. I think it might like people might change
their pricing around like every couple of years. I think something that's more likely to change the
pricing models is the impact of incumbent clouds. So there's some new cloud providers who are
basically trying to compete with the big players by offering free egress. So Cloudflare is one of these. And we
actually found another sort of surprising result that we had was that you can actually use these
these clouds like Cloudflare as kind of intermediary points. So if you're transferring
from AWS to GCP and Azure, instead of paying egress twice, so once to Azure and once to GCP,
you can actually transfer the data through Cloudflare,
pay egress once, and then broadcast the data from Cloudflare to the other clouds.
So I think that's potentially something that's more likely to impact pricing in the long run
than like sort of dramatic changes to cloud provider pricing.
How easy is it to add support for like a new, let's say some startup happens tomorrow about a new cloud provider.
How easy is it to add support for that for Skyplane?
Is it a lot of work or is it pretty simple?
So if they're S3 compatible, it's not that hard
because worst case we can just use VMs and other clouds
to basically just connect to their S3 compatible API.
I guess, do you want to add to that?
Yeah, so we've been collaborating pretty closely with some contributors
from IBM who have been working to add IBM cloud support.
I think their PR is like under 800 lines of code.
So, you know, it is some work today, but it's not like tremendous.
I think it's something that's pretty doable,
and we're working very hard to kind of reduce the amount of changes necessary.
In fact, one of the critical projects that are kind of starting around this effort is
to create a SkyKit, in some sense, common provisioning infrastructure that, you know,
we can share with Skyplane, but also with other projects.
With this idea of modularity, kind of leveraging existing cloud provider integrations
with tools like terraform for example so that you know you don't have to keep reinventing the wheel
and rewriting integrations to clouds it should be really easy and really simple to add these new
cloud providers great stuff yeah i guess the next question is something i i often ask my uh my guests
as well is that obviously from the initial idea for some piece of research,
piece of work, and to where we are now, whether things along the way that you've tried that failed
and the listener might benefit from knowing about cyber, they don't make the same mistake again,
or it's just an interesting dead end that you found. Yeah, so since starting the project, I mean,
I actually didn't want to work on this problem necessarily. I wanted to kind of work at higher level applications. So, you know, how can we kind of build these applications in Sky Computing context, such as machine learning, but, you know, as I started digging into that problem, I was like, the most basic infrastructure here is abysmally slow and expensive, right? And so that has focused our research around it. And it's kind of like in the beginning, I was a little hesitant to work on this problem. But it actually turns out just in
a simple data transfer setting, there is so much work that can be done. And so, you know, the
insight here, I think I took away from this experience is kind of start simple, you know,
I really wanted to build some big overarching system to let you do kind of cross cloud machine
learning or data analytics or something. But really, it's even just the really simple problems like data transfer that really are
most critical, right?
And I never really appreciated the complexity of how to kind of accomplish this in a cloud
environment easily.
I think one really important thing also that kind of enabled the research was kind of having
this open source mindset.
So, I mean, academically, I mean, I think there was a little skepticism initially,
what would the insight be? What would the key kind of optimization space look like?
Taking an open source kind of mindset, kind of looking at the best of breed tools today for data
transfer, and just profiling them carefully. You know, that's where we kind of discovered this
overlay technique, and then kind of the research agenda kind of fell into place from there.
Right. And so really it's just taking the best of the tools and kind of hacking on them and trying to figure out incrementally,
how can I just make this better or better and better?
Yeah, that's really nice. I mean, when did this idea initially start?
How far back are we talking? Is it like a multi-year sort of project or more recent?
It's been about two years since I started thinking about this area.
But as I said, the first implementation, very incremental, very quick.
We got it done in three months or so before for that kind of first paper submission.
But, you know, since then, I mean, it's a large team here.
We have, you know, about five PhD students working on the project
and several undergraduate students on the project too.
So it's been very encouraging to kind of see how the project has grown over time.
Yeah, I guess what's next for Skyplane then?
Kind of maybe the short-term vision and then the long-term vision
and some of the stuff that you've got in the pipeline.
So I feel like at this point,
we basically have these really powerful communication primitives, like whether it's unicast or multicast, the ability to move data really quickly and cheaply across clouds.
So I think that what the main things that we've been thinking about right now have basically been what are sort of applications that we can now build on top of this.
So one thing that we've been thinking about is potentially building a sky storage layer. So basically sort of like a shared object store that spans clouds, except the underlying replication is done by Skyplane.
So if you're right to one location that gets replicated by Skyplane really quickly and cheaply
to all the other places that you might want your data access. And so we can kind of co-optimize
that with object placement to basically have like a very fast, cheap replicated object
store. Another thing that we've been thinking about that I mentioned previously is also building
either a container or model registry. So this was kind of a use case we stumbled into when we were
looking for baselines for the broadcast, sorry, the multicast work, where a lot of, we were finding
a lot of other similar systems were actually focused on building container registries. So if you have a new container, how do you push that to a bunch of destination locations where you might want to run that container?
So one other thing that we've been thinking about is like, could we basically build like a container publishing or also similarly with models like model weight publishing platform on top of Skyplane Multicast?
Some really interesting directions and work that's going on.
This would be a nice time to talk a little bit about
some of the other research.
I know there was a paper at CIDR recently
about lake houses.
So maybe you could give us
the quick elevator pitch for that
and tell us about some of that work
and how that links to the Skyplane project.
Yeah, so I was working on the Skyplane project
and you look at what do people store in their buckets?
Well, it's really surprising, but a ton of it is just parquet.
And so people are storing tons and tons of this relational data in their data lakes, in effect.
And so one of the things I wanted to understand was what are the current storage formats for this tabular data and these data lakes?
And a very recent trend in the industry
has been the development of the lake house architecture.
So here you have disaggregated compute from storage.
And usually this means you're going to store this data
in object storage.
So like AWS S3, for example.
And then you utilize a technology such as Delta Lake
or Apache Hudi or Apache Iceberg
to provide
asset on top of that, right? So now you have transactions. This is an emerging paradigm in
this kind of analytics area, but it's very appealing because now you're storing your data
in Parquet format, which is highly compressible and it's in a data lake, right? So you can scale
to zero when you don't need to do any queries. And it's very cheap to store data in Amazon S3.
And so the Lakehouse benchmark,
which is what the paper we submitted to CIDR was,
or LHbench, was effectively an attempt
to kind of define a single benchmark suite
that kind of understand the pros and cons
between some of these different systems
that exist today, right?
So we tested, again, Apache Hudi, Apache Iceberg, and Databricks Delta Lake as the three key formats in that benchmark.
Which one should I use? Which one's the best?
So the answer here is pretty interesting. It depends. It really depends if you're kind of
trying to optimize for writes or reads. So for example, Iceberg and Hudi have a very special
mode of operation where they support copy on write and merge on read tables. And so that allows them to convert
their database from operating in a read optimized setting to a write optimized setting. And so
really depending on your workload, you might choose a different tool. And so in general,
though, we did find that Delta Lake performed extremely well. That may not be surprising
because it's actually well optimized with Spark specifically to ensure it gets high IO performance.
But I don't believe that's something fundamental to the design of Delta Lake.
Cool. You say they all sort of trade-offs to make that there would be some sort of um idealistic system that would do pick the correct trade-offs and do
best across all the different sort of workloads you evaluated it on or is it you always gonna
have to make some sacrifices somewhere right but i guess but yeah is there some sort of potential
new system you could develop off the back of that that would tick all the boxes and be faster than
all the existing solutions so i think these different frameworks are adding you know specialized like fast paths for specific
workloads into their frameworks we evaluated tpcds as our benchmark so it's a pretty general
purpose analytics benchmark but what's unique is we also tested the incremental refresh benchmark
and that's a new representative workload where you continually kind of do small merges periodically with queries interspersed in between. I mean, so in that
specific workload, actually a lot of these different systems are adding specialized support
for merge on read, which is a new capability that lets you defer the merge or kind of compaction of
data until that data is kind of read. And so again, that's what I mentioned, Iceberg and
Hootie are implemented, but now Delta Lake, for example, added deletion vectors, which is, I would see
as a variant of this kind of system. So the answer to your question in some sense is I think the
frameworks are kind of learning from each other and taking kind of the best of each of these
different approaches and kind of hybridizing in some way, right? But again, it really depends on
the structure workload. Is it going to be write- right dominated or re-dominated i mean what level of transaction support do you need for example some systems
utilize locking versus others use you know optimistic concurrency control like delta lake
i guess kind of the i think this is my third to last question now so it's not the penultimate
one it's the one before the penultimate one and this is this is always a really interesting question because i ask this to everybody
and it's really nice to see the divergence in questions and the question is like how do you
go about um generating ideas and then selecting projects to work on because you're working on
some fascinating stuff like how do you decide what to work on? What's your process there for that?
We're both advised by Jan Stoica,
and I think he really indoctrinates us with the idea that you should solve problems
that are real, right?
Which is harder than it sounds
because you have to find some sort of
very fundamental problem
that actually affects people in industry.
And if you can solve it, then would actually provide a lot of benefit to a lot of people.
So I think that's definitely something that we look for in terms of selecting research ideas.
I think with Skyplane, what was really appealing to you about this project was that it's such a sort of simple part of our kind of like a basic component of system building,
like just like transferring communication, that if you can make that more performant,
then there's so many other things that you can improve the performance of as well.
So I think that was really exciting to me to work on something that was pretty low level and fundamental to a lot of systems.
Yeah, I guess. Do you have any thoughts there, Karth?
Yeah, my thoughts here are pretty interesting. So when I started working on this project,
I took some of the best systems I could find for data transfer and other tools like data storage,
object storage, you know, all these different technologies. And I just try to see what is
the state of the art here? And then how can I kind of hack on these systems? And because
all of these systems are
open source, right, they're all kind of white box, you can kind of open them up and start to poke
around. I began to kind of take a look and see what can I kind of tweak here and there to try
to improve the performance. And so, you know, as you started to go through that process, I developed,
you know, I think, a lot of understanding for kind of the trade offs in these systems, right.
And so that's what I
really like about working on real problems in some sense here is that like there are existing
solutions that we kind of can pick up and start to test and kind of take apart, right? And as you
kind of go through that process, it's very often just some very simple insight that's like, you
know, just a few feet under the surface of the water that, you know, just needs a little bit
of digging to find, right? But that research journey is always very satisfying
and sometimes very challenging,
but it really takes that kind of, I think, confidence
to be able to take kind of an existing solution,
kind of rip it apart and say, like, what can we do better here?
Yeah.
I had two really great answers.
I can add to my collection now of answers to this question.
That's fascinating.
I mean, I guess both you guys come from like originally like you've got a background in industry I mean do you find that kind of having
that industrial industry experience and then going to do your like like do a PhD would is a kind of
a better path to take in a way that you're grounded in kind of I'm going to go and work on real world
problem I want to work solve real world problems you Do you think that kind of helped in a way?
The first two years of my PhD were pretty challenging
because I, after spending some time in the industry,
I was kind of programmed a different way.
And I kind of lost that research context,
you know, that mindset of which you go through
and study these problems.
But I think through that journey, that kind of,
yes, I think that has grounded my research.
And it kind of drives me to try to find problems that I think are actually real in some sense.
Like, meaning, if I'm going to work on a problem, I want to always find somebody who will benefit from the solution, right?
It may not be today, but at least, you know, some point down the line, right?
I think what has been challenging about that is, again, like, I've had to kind of go back and relearn the scientific method in some sense, like the whole scientific discovery process of how do you formulate an experiment and actually attractively design experiment to answer that hypothesis.
But I think like, again, you know, there are different mindsets, kind of industrial design versus research. But I think, you know, we now are working on real technology
with real impact
while still having that scientific contribution.
So I think it's a really nice balance to strike here.
Yeah, you're working in a sweet spot.
Yeah, for sure.
Great stuff.
I guess now it's the penultimate question.
So what do you think is the biggest challenge now
in database data management research?
That's a really interesting question.
What I think after kind of studying this area is that we're kind of in this new world with cloud computing.
So we have elasticity.
You have scale to zero.
So fundamentally, just the pricing model for everything to do with computation has changed.
And so along with that, we kind of have to revisit a lot of fundamental assumptions
and how we build data platforms.
I think the ideas themselves
that we are leveraging
aren't necessarily intrinsically novel, right?
So for example, overlay networking
is an idea that dates back over 20 years, right?
I mean, Jan worked on this back in 2000
in his PhD.
And so in some sense, it's a very classic idea.
But because the underlying constraints of the system with cloud computing have changed so significantly,
I actually think we have to go back and revisit a lot of very fundamental design decisions and how we build and construct data platforms.
Amazing. And I guess now it's the last question.
So what's the one thing you want the listener to take away from this episode and from your work on Skyplane?
Yeah, so Skyplane today is open source.
We have a blog post discussing how we're able to make transfers across clouds up to 110 times faster and 3.8x times cheaper.
You can use Skyplane today both with the CLI tool and also the Python API.
So if you're moving data across clouds
or just even across regions,
we'd love to have people try it out.
Fantastic. And let's end it there.
Thank you so much, Sarah and Prash, for coming on.
It's been a fascinating conversation.
I hope the listeners really enjoyed learning about Skyplane.
I know I have.
And if you're interested in knowing more about all of their awesome work,
we'll put links, I think, in the show notes.
And you can support the podcast by buying me a coffee,
donate through there, and we'll keep making this podcast.
And we'll see you all next time for some more awesome computer science research.