Storage Developer Conference - #77: Pocket - Elastic Ephemeral Storage for Serverless Analytics
Episode Date: October 29, 2018...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to SDC Podcast Episode 77. Hi everyone, my name is Anna Klimovic
and I'm a final year PhD student at Stanford. Today I want to talk about a system called Pocket that provides elastic and ephemeral storage for serverless analytics.
And this is joint work with Yawen Wang and Christos Kazirakis at Stanford,
along with Patrick Animesh and Jonas at IBM Research in Zurich.
Imagine if you could launch thousands of tiny tasks in the cloud simultaneously.
You could achieve massive parallelism and hence near real-time performance for the jobs
you would otherwise run on your local machine or cluster.
And what if you could just focus on writing your application code while the provider in
the cloud is managing resources and charging you only for the resources your job actually consumes
at a fine time granularity.
This is the appeal of serverless computing.
Serverless computing is a new cloud execution model
that enables users to abstract away resource management
and launch tasks with high elasticity, meaning your jobs
can go from one to thousands of threads in a matter of seconds, and fine-grained resource
billing, meaning users pay only for the resources their jobs actually consume, and it's billed
at a fine time granularity, such as hundreds of milliseconds.
Now, the first applications to take advantage of serverless computing have
been IoT and web microservice type of applications. The typical use case is we have a function
that executes in response to a spurious event, like sensor data becoming available or a user
logging into a web service. But it turns out that this high elasticity and fine-grained
resource billing makes serverless computing appealing
for a much broader range of applications,
including what we're going to refer
to as interactive analytics.
By this, I mean that a user submits a query and some kind
of input data to the cloud, and we use multiple tasks
and threads to come up with a result for this user
in real time or near real time. Now the
challenge with analytics in general is that they often involve multiple stages of execution.
And we have to pass intermediate results between these different stages. Think of the shuffle
stage between map and reduce, for example. So what this means is that if we want to run analytics on lambdas or on serverless
platforms, we need these serverless tasks to have an efficient way of communicating
intermediate results. And we're going to refer more broadly to this intermediate data as
ephemeral data, which simply means short-lived. We're talking about data that typically only
needs to live for the duration of the job itself.
So in traditional analytics platforms, if you think of a Spark deployment, for example,
the federal data is exchanged directly between tasks. Mappers write their intermediate data
to local storage, and then reducers can fetch this data directly over the network. But in
serverless analytics, if we look at the serverless
platform context, this idea of directly fetching the data becomes very difficult because direct
communication between lambdas is challenging. And there are a couple of reasons for this.
Lambdas are short-lived and they're stateless. So, for example, on Amazon Lambda, their execution is capped to five minutes.
And when a Lambda exits, so if a mapper is done, its local data that it stored on its
local storage is no longer available.
In addition, part of the draw of serverless computing is that users don't have to schedule
and manage their resources.
But this also means they have no control over the scheduling and we don't know which IP
addresses different Lambdas are sitting behind. So it's actually very difficult for the reducer to know which
lambda in the mappers to communicate with. What this means is that the natural approach
for sharing intermediate data in a serverless context is by using a common remote data store.
So we could have mappers write their intermediate data
to a distributed storage system,
and then even as mappers exit,
reducers are still able to fetch this data.
So we need a distributed data store.
But when we think of existing storage systems
like object stores, key value stores, and databases,
they're typically not designed to meet the elasticity,
performance, and cost requirements for these kind of serverless analytics applications.
And to understand why, we look at what these requirements are. So we analyze three different
types of serverless analytics jobs, and we look at their requirements in terms of ephemeral
I.O. throughput, capacity, the data size, and the object lifetime.
So the first type of application we look at is the distributed compilation job.
So imagine you're trying to compile some source code which might take minutes or even hours
on your local machine. You would be doing maybe make-j8 if you had an 8-core machine.
But why not do make-j1000 and leverage the parallelism that you have with
these Lambda workers in the cloud? So that's exactly what we do using a framework called
GG that's being developed at Stanford. This allows us to run compilation leveraging these
serverless tasks. And in this plot here, we're looking at the total throughput with the read
throughput in the solid line and the write throughput in the dotted line for ephemeral data shared
between these tasks.
And what you'll notice is that at the beginning of the job, there's high parallelism because
every lambda is compiling an independent source file.
And so they can all run concurrently and this high parallelism results in high I.O. rates
and throughput.
In the later stages of the job, lambdas are handling archiving and linking, so they're
serialized.
They depend on the outputs of previous stage lambdas.
There might be some stragglers here.
And particular source code we're compiling in this case is the CMake project, and the
ephemeral data capacity for this is less than one gigabyte, so pretty
small. Another application we look at is the MapReduce style of applications. And in particular,
we run a canonical sort of 100 gigabytes, which also means we have 100 gigabytes to
shuffle. And the interesting thing about this application is that
it requires particularly high throughput.
It's a very I-O and throughput intensive application.
So when we're running with 500 concurrent lambdas, we need up
to 7.5 gigabytes per second of storage.
And the final application type we look at is the video
analytics job that runs in two different stages of lambdas.
So the first stage is taking in video frames and decoding them
and then writing these decoded frames to the ephemeral storage,
which is this first bump that you see.
And these lambdas are all launched together.
And as each lambda finishes decoding its batch,
it launches a second stage lambda, which will
compute an MXNet classification algorithm to detect objects in this video.
And so here, the second stage lambdas are reading the ephemeral decoded video frames.
And for the particular video we're looking at here, we need around six gigabytes of capacity.
Now, another thing
that is characteristic, is important for characterizing these applications is the granularity of
their IO accesses. So this CDF here shows a plot of the different sizes of objects we
see for these applications. And the takeaway is that they range from hundreds of bytes
to hundreds of megabytes. And the implication of this is we need both low latency,
this is important for the small objects,
but also high throughput,
which is going to be important for the large object sizes.
We need both this high throughput and low latency
for an ephemeral storage system
to address the requirements of these applications.
And in addition to having high performance storage, meaning low latency and high throughput,
we also care about optimizing cost.
But optimizing cost while maintaining high performance requires us to intelligently select
and scale the resources that we're using for this ephemeral storage.
And this can be very challenging.
So here we look at an example for a serverless
video analytics job using an ephemeral storage system with different resource configurations.
So we're plotting the performance on the Y axis and the cost per hour on the X axis.
And we have different storage technologies that we use for the ephemeral data store,
such as DRAM and VME Flash and Disk.
And also we vary the number of nodes in the cluster,
which impacts the overall throughput or the network bandwidth provisioned.
And we vary the number of CPU cores as well.
And the takeaway here is that finding this Pareto-efficient frontier of performance and cost is non-trivial.
In particular, we would
like the system to be able to automatically find this point that is beyond which we only
increase cost if we add more resources without improving performance. So it's kind of the
optimal performance point for the lowest cost possible. So users can try out different configurations and try to realize
what the best configuration is, but this is a huge burden for users. In particular, for
users of serverless platforms, which are already abstracting away resource management for the
compute and memory required to run their jobs. So we don't want to require users to manage
their own storage clusters for the purposes of ephemeral data sharing.
And hence, automatically managing and right-sizing these resources is another requirement.
So we've talked about what we require from an ephemeral storage system for these applications,
but there's also something that we don't really need, and that sets us apart from traditional storage systems.
One of the unique things about ephemeral data is that it is short-lived.
So here we're showing a CDF of the object lifetime for these ephemeral objects in the
three applications we look at.
And most of these objects live for less than 20 seconds.
Furthermore, ephemeral data is relatively easy to regenerate by rerunning the tasks that generated that data.
And so because fault tolerance is typically baked into application frameworks,
so if you think about a traditional framework like Spark, it tracks lineage for its resilient distributed data sets,
and so it knows which tasks need to be regenerated if a particular object is lost. So fault tolerance in the
storage system is not a high requirement in this case, and that's going to influence our
design. So to summarize, the main requirements for our storage system then are high performance
for a wide range of object sizes, automatic resource scaling with an
awareness of storage technology, and fault intolerance, meaning we don't have to have
particularly high fault tolerance in this case.
Now, because existing storage systems are not meant to address these particular resource
requirements, we decided to implement our own system, which we call Pocket,
because it's designed for storage-on-the-go intermediate data sharing.
So Pocket is an elastic and distributed storage system
that targets ephemeral data sharing in serverless analytics.
And its key properties are high throughput and low latency,
automatic resource scaling and write sizing,
and intelligent data placement
across multiple storage technology tiers.
We'll show that Pocket achieves similar performance
to a DRAM-based key value store called Redis,
while saving close to 60% in cost
for the different serverless analytics jobs that we look at.
But before getting into those results, I want to describe the design of Pocket and how we implement
it. So we follow three main principles in the design of Pocket, and the first is separating
responsibility across three different planes. There's the control, metadata, and data plane.
So the control plane is scaling resources and deciding which resources
to give to each job. The metadata plane is tracking where objects are stored across nodes
in the data plane. So an object gets chunked into fixed size blocks, 64 kilobytes by default,
and spread across storage servers in the data plane. And then the data plane is managing the actual data blocks.
We also designed Pocket for sub-second response time.
So we'll talk about how the implementation of the storage stack in the data plane
is optimized for fast and simple I.O. operations.
And then the controller is also constantly monitoring the state of the cluster
and reacting
quickly to adjust the resources accordingly.
And then we leverage multiple storage tiers including DRAM, flash, and disk in order to
minimize cost while getting the required performance for the application.
So let's take a look at Pocket System architecture. The data plane consists of multiple storage
servers. Here we're showing disk server, flash, and two DRAM storage servers. And the bars
here represent their instantaneous resource utilization at a snapshot in time. We have
the metadata plane consisting of one or more metadata servers. These are going to be routing
requests from clients
to the appropriate storage server nodes.
And we have a logically centralized controller
that is doing application-driven resource management
and scaling.
So let's look at how a job interacts with Pocket.
The first step is to register with the controller.
And at this point, the job can provide optional hints
about attributes of the job, such as its latency sensitivity and
throughput requirements.
We'll talk more about these hints, and we'll dive into the
job registration process in a bit more detail in a few
slides.
What job registration involves is the controller needs to
decide an allocation and assignment of
resources for this job.
So it decides this at the beginning of the job job and it communicates this information to the metadata servers because they are the ones
that will enforce data placement. Because metadata servers are going to route client
write requests to the appropriate storage nodes. So at this point, when the controller
has communicated this data to the metadata servers, it returns to the user with the job ID and the job can now launch lambdas. And these lambdas are going to issue put and get
requests. So here's an example of a put request. These requests start by doing a metadata lookup
and then the actual storage operation goes to the storage server. Now the API that Pocket
exposes is a simple get put API, similar of traditional object
stores, but it does have additional functionality tailored to this ephemeral storage use case.
So for example, Pocket supports hints about data lifetime.
Because what we realized with ephemeral data is that it's often written and read only once
for a particular object.
So for example, a mapper can write a file or an object that is destined to a particular reducer and after that reducer reads
this object it is no longer needed in the system. And so we might as well
delete it immediately to optimize garbage collection and capacity usage.
And so if this is something known about the application, the user can directly
hint this or the application framework the user can directly hint this, or the application
framework that's launching these lambdas can provide such hints.
By default, when a job deregisters, this deletes the controller then will ensure that the data
gets deleted, all data that this job generated will get deleted.
But you can override this by writing data with a persist flag,
and then it will last for longer than the duration of the job
until a programmable timeout occurs
or the user explicitly deletes this object.
And so this is useful for cases where you want to use Pocket
to pipe data between different jobs.
So the other types of hints I mentioned Pocket supports is during job registration. So let's take a closer look at how resource
allocation and assignment happens. What do I mean by resource allocation? Well
the controller needs to decide how much throughput a job needs, how much capacity,
and which storage technology tier should we use.
If we have no information about the job, it's very
difficult to size this allocation.
And so Pocket will use a very conservative default
allocation that over-provisions throughput and
capacity, assumes the job is latency sensitive, and uses
DRAM first, and then spills to other tiers if necessary for capacity.
But as we learn more about a job through optional hints the user can provide,
we can optimize the cost of the resource allocation.
And here are the hints that Pocket currently supports.
The first is a latency sensitivity hint.
So this is just a Boolean true or false for now. We realized
that if a job is latency sensitive, it benefits from using the DRAM tier. So this is going
to be a remote access from a Lambda client to a storage server using DRAM. But we realized
that if the job is more throughput intensive and not so sensitive to latency,
we get equivalent performance using NVMe flash.
The bottleneck ends up being network bandwidth on the VM,
or the nodes that we're using for the storage servers, as opposed to the bandwidth of the actual storage technology in the case of NVMe.
Another hint that we support is the maximum number of concurrent workers.
So this allows us to compute a less conservative estimate for how much throughput we need.
Each, on a particular serverless platform, each lambda typically has a network limit.
On Amazon, we've seen roughly around 600 megabits per second.
So these are relatively wimpy workers.
But the point is you would be running a job with
many of them in parallel.
But if we know how many concurrent lambdas to expect and we know the throughput limit
per lambda, we can compute what is the peak throughput we need to provision for the job.
And then we also allow the user to directly tell us if they know what is the peak throughput
required and what is the total ephemeral capacity
required. They might have this information if they run this job before and profiled it.
So once we have an allocation in terms of throughput capacity and the choice of the
storage media, the next step is to actually do a resource assignment, which means select
which servers, which storage servers in the cluster are going to be providing the resources.
And for this, Pocket uses an online bin packing algorithm.
So it's trying to fill up existing storage servers in the cluster
with the throughput and capacity requirements on the particular storage tiers
and only launching new nodes if this requirement cannot be satisfied by existing nodes sharing between jobs.
So this information gets recorded in what we call a job weight map.
So for example, the controller can decide that jobs A data
is going to be spread across server C and D.
So this, for example, could be a latency-sensitive application
that wants to exclusively use the DRAM tier.
And 40% of its data can be going to server C, and 60%
will be going to server D. So these choices will depend on
what is the current load on these systems, because the
controller is monitoring resource utilization.
And for each job, we will have a weight map.
Now, in addition to selecting a resource allocation and
assignment for a job upfront when it registers,
the controller is always monitoring the state of the cluster.
And so it's looking at CPU, network bandwidth, and storage capacity utilization.
The policy for scaling up and down the resources in the cluster is looking at, is trying to
keep across these three dimensions of CPU, network bandwidth looking at, is trying to keep
across these three dimensions of CPU, network bandwidth, and
capacity, trying to keep utilization within a target
range, which can be empirically tuned.
And in our deployment, we're using 60% as a lower end and
80% as the higher end.
So more concretely, the controller will scale up the
cluster by adding a node if any of
CPU utilization, network bandwidth, or storage capacity for a tier are being utilized above
80%.
The controller will scale down the cluster, meaning take down a node, if all of CPU, network, and storage capacity utilization for a particular tier
are being utilized under 60%.
And the actual mechanism that's used to balance load as the size of the cluster changes
is all done through this job weight map.
So here we're leveraging and relying on the fact that this ephemeral data is short-lived.
And so whatever is currently in the cluster is likely going to expire soon and be garbage collected.
So we really focus and we want to avoid migrating data because for such short lifetimes of data, this presents high overhead.
And so for this reason, we focus on steering data for incoming jobs in order to balance load in the cluster.
So for example, if we take the four storage servers that we had and we decide that we don't need as many DRAM resources
or we can take down storage server D, what we're going to do is we're going to steer data for new incoming jobs
away from server D and onto the other storage servers, such as B
and C, so their resource utilization will increase. And if we're not steering any new
data to server D, eventually its data and its resource usage will decrease. And at this
point, we can remove that cluster or that node.
Okay. So I'll talk next about how we actually implement Pocket.
We leverage several open source code bases.
So the storage and metadata server implementation of Pocket
is based on the Apache Crail distributed storage system,
which I'll talk about next.
And to implement a fast NVMe flash storage server tier,
we use Reflex. And this is a software system that has
been a key part of my thesis at Stanford, which achieves fast access to NVMe Flash over commodity
Ethernet networks. And it's a purely software-based solution. Then we run the actual pocket metadata
and storage servers inside of containers that the controller then uses kubernetes to orchestrate so the Apache Crail
distributed storage system targets ephemeral data storage in distributed
data processing frameworks so think of spark for example they have a plug-in to
make applications be able to get the raw performance of the hardware
they're using.
And Crail was actually originally
designed to target RDMA networks, which
have particularly low latency and enforce high performance
requirements for the software.
Now, in a public cloud environment,
RDMA networks are not readily available.
But Crail has a modular
and pluggable architecture, so you can add and swap in your own network processing and
storage tiers. And so we implement an R do the NVMe flash storage. So Reflex provides low latency and
high throughput while consuming low compute resources. And for example, with a single
core running Reflex, you can serve 850,000 IO operations per second. So this is around 11 times higher than what we measured
with a traditional Linux storage stack using LibAIO, for example.
The way Reflex achieves this kind of performance
is by doing direct access to NVMe and network queues from user space.
So we're leveraging libraries like Intel's DPDK and SPDK.
We use a polling-based execution model that runs to completion.
So when we pick up a packet from the receive queue,
we continue processing it until we submit the request to flash storage.
And so this allows us to optimize for data cache locality.
And then we're avoiding data copies by leveraging the DMA engines
on network and NVMe flash devices to forward
data directly between these. So we're integrating the network and storage processing. And then
we use adaptive batching to optimize instruction cache locality. So if we're operating at high
load and when we're doing our polling loop, we see that there's many requests in the receive
queue, we can run our execution loop with multiple requests
at each stage.
We call it adaptive batching because if we're
operating at low load, we don't have
to wait to accumulate a batch.
And so we don't hurt latency, but in the cases of high load,
we can achieve high throughput.
And another key aspect of Reflex is
achieving predictable performance and quality of service
when multiple tenants are sharing a Flash device.
So this is important particularly for NVMe Flash because of the asymmetries of read and write performance.
So Reflex is able to enforce throughput and tail latency SLOs,
as long as they're physically feasible with the performance of the device.
Reflex can isolate the traffic between tenants to provide their tail latency and throughput
service level objectives. And this code is open source, so you are welcome to check it
out on GitHub. So, Pocket is a system that we envision a cloud provider managing and providing.
But in order to evaluate and implement Pocket, we deploy it on EC2 nodes in Amazon.
And here are a few types that we use for different storage tiers and Pocket-type nodes.
And then we're using Amazon Lambda as our serverless platform.
And the applications we use are the distributed compilation, MapReduce sort, and video analytics, which you saw earlier in this talk.
So the first thing we look at is Pocket's latency. And we compare Pocket to two existing
systems. First, S3, which is Amazon's object store. It provides serverless storage. So
users pay for the capacity and bandwidth that they need or that they use,
and they don't manage their own resources. A disadvantage of S3 is going to be performance,
particularly latency, and there's particular overhead for small objects. So here we're
doing a one kilobyte access from an AWS Lambda client. And we see that S3 takes 12 to 25 milliseconds,
depending on whether it's a put or get request.
The other system we consider is Redis.
So this is a DRAM-based key value store.
And Redis achieves significantly higher performance.
Notice it's around 230 microseconds.
Considering this is DRAM, this might sound high,
but that is the network overhead from the Lambda client.
And the disadvantages of Redis are high cost,
because we're exclusively using DRAM here, and also the fact
that users have to spin up their own storage cluster,
and scale, and manage this cluster for the purposes
of their ephemeral data sharing.
So this is
a burden on users. And what we see is that Pocket with the different storage tiers, particular
for the DRAM and NVMe flash tiers, offers similar performance with the main advantage
of having this automatically managed resources. Now, the reason why Pocket's DRAM tier is
higher latency than Redis is because Pocket requires a metadata lookup.
So in a Redis cluster, keys are hashed to particular storage nodes and there's no explicit
metadata lookup.
So this is a tradeoff that we have, but it is essential for this metadata plane and metadata
lookup is essential for Pocket's flexible data placement and resource allocation policies.
So as the cluster size changes, a key aspect of the system
is its elasticity.
If you were relying on hashing, you
would have redistribution of keys across servers,
whereas in Pocket, we have the jobs data placement
is encoded in the weight maps that don't change when storage servers enter and leave the cluster.
The next thing we look at is throughput scaling. So of course, if we increase the number of nodes
in our cluster for Pocket and Redis, we're going to see higher and higher throughput. So we compare
what is the throughput per node and how do they scale for the different pocket tiers. And
we compare this against S3 and Redis. So the S3 line is flat because we don't actually
control the number of nodes in S3. The throughput we get is what it is. And what we see here
is that with two of these I3 2x large nodes that we're using for pocket nodes,
we have higher throughput than S3.
So we're roughly getting 1 gigabyte per second per node.
And that is the network bandwidth of the VM.
Is there a question?
You're saying you get 1.5 gigabytes per second from S3?
That's like off by three orders of magnitude.
These are for one... Okay, I should say what the test is also.
So we're issuing 100 concurrent lambdas,
and each of them is issuing one megabyte request in a loop.
So earlier you said that there were 600 megabits per second,
and so I did the proof math there,
and it was like 80 megabytes a second. Is that the typical AWS lambda throughput So what we do notice is that when we run multiple lambdas concurrently, the per lambda network
bandwidth kind of diminishes because I think they're being put onto the same physical nodes.
And so that 600 megabits is kind of measured with maybe less
concurrent lambdas.
Okay. Can you answer my question? I was, the S3 throughput is across mobile, let's
say, 100 instances.
Mm-hmm.
Got it.
Right. The reason why we see a tapering off here is because we're reaching for these
100 concurrent lambdas the per lambda network bandwidth limit. And we notice here also that the throughput of
the, so we distinguish between NVMe flash and regular SAT or SAS-based SSDs, which are
offered on older generation Amazon instances. And both of the disk tier and the SATA SAS SSD storage
provide lower throughput.
The solid line here is particularly
low due to low network bandwidth on these old generation
instances.
And that's why we have in the dotted line
using a larger instance that doesn't bottleneck
on the network bandwidth, but instead bottlenecks
on the actual SSD bandwidth.
So for the rest of the results, we're going to focus on Pocket's DRAM and NVMe tiers,
because these really stress the performance we need from the software.
And they're much more cost-effective on the Amazon infrastructure.
So next we look at how Pocket's right-sizing with hints helps us save on cost. So, our baseline here is the default allocation where we use 50 I3 nodes and provision 50
gigabytes per second of storage, which is likely overkill for a typical job, but it's
because we're being conservative.
And we look at how as we learn more about an application for the storage video analytics
and the distributed compilation, how we can save on cost.
So if we learn the number of concurrent lambdas,
and we provision using the conservative 600 megabits per second per lambda as the rule of thumb,
we can reduce cost as shown by the red bars here.
If we also learn that these jobs are not sensitive to latency,
they're actually all throughput intensive. The distributed compilation is also compute intensive, so
it is not sensitive to the latency of the storage system as much. We get further reductions
in cost. And then the green bars show what if the applications told us directly the throughput
and capacity requirements that they need.
Next, we look at how Pocket right-sizes resources as multiple jobs enter and leave the system.
So here on the x-axis, we're plotting time, and we have up arrows indicating when a job has arrived or registered with the system, and a down arrow indicating when the job deregisters.
In the orange line, we're plotting the aggregate throughput that we observe from pocket storage
servers.
And the dotted blue line shows the throughput that the controller has provisioned with the
rule of thumb that each of these nodes gives us one gigabyte per second.
That is roughly the throughput that we see these nodes providing.
Occasionally we see bursts and that is what's happening here with the job being able to do to achieve higher throughput
than what the controller believes has been provisioned. But the takeaway here
is that the controller is elastically scaling resources and adjusting as jobs
are entering and leaving the system. And this is all done automatically without
users explicitly saying how many nodes they need.
Now, how does Pocket affect overall execution time compared to using a system like S3 or Redis?
So we run a 100-gigabyte sort job,
and we're always storing the original input data and the final output data in S3 for this test
because that's kind
of longer term data. But data that gets shuffled is what's, is what is shown, the time is shown
per lambda in the blue bars. This is the ephemeral data I.O. And for the ephemeral data storage
system, we use either S3, Redis, or Pocket. And the three sets of bars here are running
this job with 250, 500, and 1,000 concurrent workers.
So the first observation is that if we're using 250 lambdas, for example,
S3 does not provide sufficient throughput for this job
because we can see that by provisioning higher throughput for Redis and Pocket NVMe,
the execution time per lambda in this stage can decrease.
You'll notice that we're missing the S3 data points for the 500 and 1,000 tests, and this is because
we actually got request rate limit errors, so telling us to exponentially back off the
amount of IOPS that we're issuing. The other observation here is that for Redis, we're
using DRAM, but for pocket, we're using NVMe flash here here is that for Redis, we're using DRAM.
But for Pocket, we're using NVMe flash here.
And as I mentioned before, because this is a throughput
intensive application, we see similar performance
using Pocket NVMe and Redis.
The last thing I want to go into is a cost analysis, which
is a bit tricky to do because it's
tough to do an apples-to-apples comparison.
But we envision that Pocket would be a system provided with a pay-what-you-use cost model to match the serverless abstraction.
So users would pay for the capacity and bandwidth that is used for their job.
And so what we do is we calculate fine-grained resource costs based on EC2 prices.
So we compute what is the approximate cost of NVMe flash per gigabyte, DRAM per gigabyte,
and the cost of a core. And we compare the cost of running each of these jobs, the sort,
video analytics, and distributed compilation,
using S3, Redis, and Pocket.
And with Redis, we assume that since users are individually spinning up these clusters,
they have to pay for the entire VM,
not just the fraction of resources that they used on it.
And so Pocket, by leveraging hints from users,
so for example using NVMe instead of flash when appropriate,
or NVMe instead of DRAM when appropriate,
and by providing this kind of pay-what-you-use cost model,
reduces cost by 60% roughly, we find, for all of these three jobs.
You'll notice that S3 is still much cheaper.
The cost comparison here is not quite a fair one though,
because our cost prices for Redis and Pocket are based on user facing EC2 resource costs whereas S3's pricing is
based on whatever Amazon is actually paying for those resources. And this is amortized
across many, many jobs that they're running. So that explains the discrepancy there.
But it is true that one of the main draws of S3
is how cost-effective it is.
But another thing to keep in mind
is that it's not going to give
quite the performance expected for these applications.
Can I ask you a question?
Yes.
So you have costs for doing things this way, right? Yes. So more broadly, you're saying to run a job on serverless
versus to run a job on VMs. I mean, I think it's really cool you showed that you made serverless cheaper for the standard servers. I think that's a great outcome.
But now that you've trumped the concept of serverless, how much cheaper is serverless?
Okay, so the question is how much cheaper is serverless than doing it the traditional
way?
I don't have actual numbers to answer that question.
Some of my colleagues at Stanford actually have been doing work, and they've found that
you don't actually win currently on cost necessarily for some of
the, the, the jobs that they're doing. What you gain is performance if your job can benefit
from this massive parallelism. It depends on a lot, like, it depends on the current
pricing of things, of course. There's many caveats, but I think currently the expectation is the
reason to switch here is because of user convenience. So not, the main draw of the serverless I
would say are user convenience to not have to manage these resources. It's good for very
short jobs because the time granularity of the cost is very low. And performance, if you're, can benefit from the very high parallelism
in your job. Yes?
I remember from the architectural study, you have one management node, I think, one
like the controller.
The controller, mm-hmm.
But you don't have stability. How many tasks can you actually run on?
How many tasks can you can the system handle? Sure.
How many...
Okay, so the amount of jobs you're asking about...
So a metadata server can handle
like 195,000 IO operations per second.
Depends how many jobs you're...
How many metadata operations you're doing per job then.
The controller itself,
I don't have a number of how many jobs it can register.
But that would be interesting to measure.
The solution that you tested,
is that scalable?
But where you could have multiple controllers
and multiple data stores? Yes. So all the three data planes are independently scalable.
And so you can scale metadata servers, storage servers.
The controller is logically centralized,
but it could also be scaled up and distributed. Are there any commercial applications for this?
So we would love to have some.
Yeah, so this is really early stage.
So this is the first talk that I'm giving about this in public.
So we've been working on this for about a year,
and we're excited to talk about use cases of this for industry applications.
So if you have ideas, I would be happy to hear.
Okay.
Yes.
You mentioned using hints to optimize for demonstration.
Is that during the registration movie or is there a time during the course? Okay, so the question is regarding hints
and when do they get passed for.
Regarding job attributes,
those hints are all passed during job registration.
So the allocation gets decided at the beginning of the job.
You deployed this mostly in the cloud, it looks like.
Have you tried local deployments, and particularly
with your reflex RDMA infrastructure?
So we were specifically targeting the cloud
environment because of the serverless tasks.
So we haven't tried locally, but the system can be more broadly.
And we've looked at, for example, running Spark
and using the Spark plug-in for Crail and hence Pocket as well.
So that could also be interesting.
How the advantages would shift in such an environment.
I'm curious if you've explored it yet.
We haven't explored it yet,
but one thing I can kind of hypothesize
is that the use of the DRAM tier, I think,
would become much more beneficial
than we see in the serverless environment
because the network overhead in AWS Lambda
currently is very high.
So the DRAM tier loses its appeal
in terms of the low latency.
Yeah.
So you talked about passing on the hints
at the time of registering a job
and then at the time of deregistering the job,
delete all the data.
I assume the deletion happens just by deleting the files.
Right.
You know, most likely.
Is there, you know, have you explored better ways to do this as in passing down the allocation
hints down to the lower storage layers as well as at the time of deregistration doing
something different other than just deleting the file?
So currently, the metadata server
that is managing the data for that job
deletes it.
So it doesn't have to actually...
I mean, for security, you need to maybe propagate that
down to the storage.
But currently, it's just at the metadata server
that this deletes.
And then that data now is considered free blocks,
and they can be populated by new jobs.
Yeah.
Yes.
So does your control avoid analytics at all?
In terms of, so the customer,
oh, sorry, the user passes the hints,
but they could be overestimated,
for example, underestimated.
Yeah.
And based on your previous experience with the job,
you adjust those hints.
Yeah.
Yeah, so that's something we would like to do as future work.
Currently, the controller assumes that the user has a good idea
and provides these hints.
But especially, yeah, so there's a great opportunity
to learn across invocations.
So, for example, on Amazon Lambda, you register this job,
and so you can track every time it's executed.
You can keep track of what you've learned about the resource allocation.
So that's something we're hoping to explore.
I guess I can talk about...
So the other opportunities for future work that we see
are harvesting Slack resources in the data center to run Pocket.
So this is ephemeral data that we're managing.
It has short lifetime.
And we know that in data centers we have many resources sitting idle.
So we see an opportunity to try and harvest those.
In this case, we need Pocket to be able to deal efficiently with resource revocation
where suddenly these resources are needed,
and so maybe we need some better fault tolerance in this case,
or at least some way of handling when data is no longer accessible because those resources that used to be idle need to be reclaimed.
And then although we've talked about Pocket here in the context of serverless analytics,
more broadly we think of it as a distributed slash temp.
And so we're interested in exploring other
use cases for this.
And I'll put up this slide here.
So there's a repo here on GitHub, so I encourage you to
check it out.
Any other questions?
We'll also have a paper appearing at the OSDI
conference in a couple of weeks
alright, thank you
thanks for listening
if you have questions about the material presented in this podcast
be sure and join our developers mailing list
by sending an email to
developers-subscribe
at snea.org.
Here you can ask questions and discuss this topic further
with your peers in the storage developer community.
For additional information about the Storage Developer Conference,
visit www.storagedeveloper.org.