Storage Developer Conference - #77: Pocket - Elastic Ephemeral Storage for Serverless Analytics

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast Episode 77. Hi everyone, my name is Anna Klimovic and I'm a final year PhD student at Stanford. Today I want to talk about a system called Pocket that provides elastic and ephemeral storage for serverless analytics. And this is joint work with Yawen Wang and Christos Kazirakis at Stanford, along with Patrick Animesh and Jonas at IBM Research in Zurich.

Starting point is 00:01:01 Imagine if you could launch thousands of tiny tasks in the cloud simultaneously. You could achieve massive parallelism and hence near real-time performance for the jobs you would otherwise run on your local machine or cluster. And what if you could just focus on writing your application code while the provider in the cloud is managing resources and charging you only for the resources your job actually consumes at a fine time granularity. This is the appeal of serverless computing. Serverless computing is a new cloud execution model

Starting point is 00:01:37 that enables users to abstract away resource management and launch tasks with high elasticity, meaning your jobs can go from one to thousands of threads in a matter of seconds, and fine-grained resource billing, meaning users pay only for the resources their jobs actually consume, and it's billed at a fine time granularity, such as hundreds of milliseconds. Now, the first applications to take advantage of serverless computing have been IoT and web microservice type of applications. The typical use case is we have a function that executes in response to a spurious event, like sensor data becoming available or a user

Starting point is 00:02:19 logging into a web service. But it turns out that this high elasticity and fine-grained resource billing makes serverless computing appealing for a much broader range of applications, including what we're going to refer to as interactive analytics. By this, I mean that a user submits a query and some kind of input data to the cloud, and we use multiple tasks and threads to come up with a result for this user

Starting point is 00:02:43 in real time or near real time. Now the challenge with analytics in general is that they often involve multiple stages of execution. And we have to pass intermediate results between these different stages. Think of the shuffle stage between map and reduce, for example. So what this means is that if we want to run analytics on lambdas or on serverless platforms, we need these serverless tasks to have an efficient way of communicating intermediate results. And we're going to refer more broadly to this intermediate data as ephemeral data, which simply means short-lived. We're talking about data that typically only needs to live for the duration of the job itself.

Starting point is 00:03:26 So in traditional analytics platforms, if you think of a Spark deployment, for example, the federal data is exchanged directly between tasks. Mappers write their intermediate data to local storage, and then reducers can fetch this data directly over the network. But in serverless analytics, if we look at the serverless platform context, this idea of directly fetching the data becomes very difficult because direct communication between lambdas is challenging. And there are a couple of reasons for this. Lambdas are short-lived and they're stateless. So, for example, on Amazon Lambda, their execution is capped to five minutes. And when a Lambda exits, so if a mapper is done, its local data that it stored on its

Starting point is 00:04:10 local storage is no longer available. In addition, part of the draw of serverless computing is that users don't have to schedule and manage their resources. But this also means they have no control over the scheduling and we don't know which IP addresses different Lambdas are sitting behind. So it's actually very difficult for the reducer to know which lambda in the mappers to communicate with. What this means is that the natural approach for sharing intermediate data in a serverless context is by using a common remote data store. So we could have mappers write their intermediate data

Starting point is 00:04:45 to a distributed storage system, and then even as mappers exit, reducers are still able to fetch this data. So we need a distributed data store. But when we think of existing storage systems like object stores, key value stores, and databases, they're typically not designed to meet the elasticity, performance, and cost requirements for these kind of serverless analytics applications.

Starting point is 00:05:10 And to understand why, we look at what these requirements are. So we analyze three different types of serverless analytics jobs, and we look at their requirements in terms of ephemeral I.O. throughput, capacity, the data size, and the object lifetime. So the first type of application we look at is the distributed compilation job. So imagine you're trying to compile some source code which might take minutes or even hours on your local machine. You would be doing maybe make-j8 if you had an 8-core machine. But why not do make-j1000 and leverage the parallelism that you have with these Lambda workers in the cloud? So that's exactly what we do using a framework called

Starting point is 00:05:49 GG that's being developed at Stanford. This allows us to run compilation leveraging these serverless tasks. And in this plot here, we're looking at the total throughput with the read throughput in the solid line and the write throughput in the dotted line for ephemeral data shared between these tasks. And what you'll notice is that at the beginning of the job, there's high parallelism because every lambda is compiling an independent source file. And so they can all run concurrently and this high parallelism results in high I.O. rates and throughput.

Starting point is 00:06:25 In the later stages of the job, lambdas are handling archiving and linking, so they're serialized. They depend on the outputs of previous stage lambdas. There might be some stragglers here. And particular source code we're compiling in this case is the CMake project, and the ephemeral data capacity for this is less than one gigabyte, so pretty small. Another application we look at is the MapReduce style of applications. And in particular, we run a canonical sort of 100 gigabytes, which also means we have 100 gigabytes to

Starting point is 00:07:00 shuffle. And the interesting thing about this application is that it requires particularly high throughput. It's a very I-O and throughput intensive application. So when we're running with 500 concurrent lambdas, we need up to 7.5 gigabytes per second of storage. And the final application type we look at is the video analytics job that runs in two different stages of lambdas. So the first stage is taking in video frames and decoding them

Starting point is 00:07:30 and then writing these decoded frames to the ephemeral storage, which is this first bump that you see. And these lambdas are all launched together. And as each lambda finishes decoding its batch, it launches a second stage lambda, which will compute an MXNet classification algorithm to detect objects in this video. And so here, the second stage lambdas are reading the ephemeral decoded video frames. And for the particular video we're looking at here, we need around six gigabytes of capacity.

Starting point is 00:08:04 Now, another thing that is characteristic, is important for characterizing these applications is the granularity of their IO accesses. So this CDF here shows a plot of the different sizes of objects we see for these applications. And the takeaway is that they range from hundreds of bytes to hundreds of megabytes. And the implication of this is we need both low latency, this is important for the small objects, but also high throughput, which is going to be important for the large object sizes.

Starting point is 00:08:32 We need both this high throughput and low latency for an ephemeral storage system to address the requirements of these applications. And in addition to having high performance storage, meaning low latency and high throughput, we also care about optimizing cost. But optimizing cost while maintaining high performance requires us to intelligently select and scale the resources that we're using for this ephemeral storage. And this can be very challenging.

Starting point is 00:09:03 So here we look at an example for a serverless video analytics job using an ephemeral storage system with different resource configurations. So we're plotting the performance on the Y axis and the cost per hour on the X axis. And we have different storage technologies that we use for the ephemeral data store, such as DRAM and VME Flash and Disk. And also we vary the number of nodes in the cluster, which impacts the overall throughput or the network bandwidth provisioned. And we vary the number of CPU cores as well.

Starting point is 00:09:37 And the takeaway here is that finding this Pareto-efficient frontier of performance and cost is non-trivial. In particular, we would like the system to be able to automatically find this point that is beyond which we only increase cost if we add more resources without improving performance. So it's kind of the optimal performance point for the lowest cost possible. So users can try out different configurations and try to realize what the best configuration is, but this is a huge burden for users. In particular, for users of serverless platforms, which are already abstracting away resource management for the compute and memory required to run their jobs. So we don't want to require users to manage

Starting point is 00:10:22 their own storage clusters for the purposes of ephemeral data sharing. And hence, automatically managing and right-sizing these resources is another requirement. So we've talked about what we require from an ephemeral storage system for these applications, but there's also something that we don't really need, and that sets us apart from traditional storage systems. One of the unique things about ephemeral data is that it is short-lived. So here we're showing a CDF of the object lifetime for these ephemeral objects in the three applications we look at. And most of these objects live for less than 20 seconds.

Starting point is 00:11:00 Furthermore, ephemeral data is relatively easy to regenerate by rerunning the tasks that generated that data. And so because fault tolerance is typically baked into application frameworks, so if you think about a traditional framework like Spark, it tracks lineage for its resilient distributed data sets, and so it knows which tasks need to be regenerated if a particular object is lost. So fault tolerance in the storage system is not a high requirement in this case, and that's going to influence our design. So to summarize, the main requirements for our storage system then are high performance for a wide range of object sizes, automatic resource scaling with an awareness of storage technology, and fault intolerance, meaning we don't have to have

Starting point is 00:11:50 particularly high fault tolerance in this case. Now, because existing storage systems are not meant to address these particular resource requirements, we decided to implement our own system, which we call Pocket, because it's designed for storage-on-the-go intermediate data sharing. So Pocket is an elastic and distributed storage system that targets ephemeral data sharing in serverless analytics. And its key properties are high throughput and low latency, automatic resource scaling and write sizing,

Starting point is 00:12:25 and intelligent data placement across multiple storage technology tiers. We'll show that Pocket achieves similar performance to a DRAM-based key value store called Redis, while saving close to 60% in cost for the different serverless analytics jobs that we look at. But before getting into those results, I want to describe the design of Pocket and how we implement it. So we follow three main principles in the design of Pocket, and the first is separating

Starting point is 00:12:56 responsibility across three different planes. There's the control, metadata, and data plane. So the control plane is scaling resources and deciding which resources to give to each job. The metadata plane is tracking where objects are stored across nodes in the data plane. So an object gets chunked into fixed size blocks, 64 kilobytes by default, and spread across storage servers in the data plane. And then the data plane is managing the actual data blocks. We also designed Pocket for sub-second response time. So we'll talk about how the implementation of the storage stack in the data plane is optimized for fast and simple I.O. operations.

Starting point is 00:13:40 And then the controller is also constantly monitoring the state of the cluster and reacting quickly to adjust the resources accordingly. And then we leverage multiple storage tiers including DRAM, flash, and disk in order to minimize cost while getting the required performance for the application. So let's take a look at Pocket System architecture. The data plane consists of multiple storage servers. Here we're showing disk server, flash, and two DRAM storage servers. And the bars here represent their instantaneous resource utilization at a snapshot in time. We have

Starting point is 00:14:19 the metadata plane consisting of one or more metadata servers. These are going to be routing requests from clients to the appropriate storage server nodes. And we have a logically centralized controller that is doing application-driven resource management and scaling. So let's look at how a job interacts with Pocket. The first step is to register with the controller.

Starting point is 00:14:41 And at this point, the job can provide optional hints about attributes of the job, such as its latency sensitivity and throughput requirements. We'll talk more about these hints, and we'll dive into the job registration process in a bit more detail in a few slides. What job registration involves is the controller needs to decide an allocation and assignment of

Starting point is 00:15:00 resources for this job. So it decides this at the beginning of the job job and it communicates this information to the metadata servers because they are the ones that will enforce data placement. Because metadata servers are going to route client write requests to the appropriate storage nodes. So at this point, when the controller has communicated this data to the metadata servers, it returns to the user with the job ID and the job can now launch lambdas. And these lambdas are going to issue put and get requests. So here's an example of a put request. These requests start by doing a metadata lookup and then the actual storage operation goes to the storage server. Now the API that Pocket exposes is a simple get put API, similar of traditional object

Starting point is 00:15:46 stores, but it does have additional functionality tailored to this ephemeral storage use case. So for example, Pocket supports hints about data lifetime. Because what we realized with ephemeral data is that it's often written and read only once for a particular object. So for example, a mapper can write a file or an object that is destined to a particular reducer and after that reducer reads this object it is no longer needed in the system. And so we might as well delete it immediately to optimize garbage collection and capacity usage. And so if this is something known about the application, the user can directly

Starting point is 00:16:24 hint this or the application framework the user can directly hint this, or the application framework that's launching these lambdas can provide such hints. By default, when a job deregisters, this deletes the controller then will ensure that the data gets deleted, all data that this job generated will get deleted. But you can override this by writing data with a persist flag, and then it will last for longer than the duration of the job until a programmable timeout occurs or the user explicitly deletes this object.

Starting point is 00:16:56 And so this is useful for cases where you want to use Pocket to pipe data between different jobs. So the other types of hints I mentioned Pocket supports is during job registration. So let's take a closer look at how resource allocation and assignment happens. What do I mean by resource allocation? Well the controller needs to decide how much throughput a job needs, how much capacity, and which storage technology tier should we use. If we have no information about the job, it's very difficult to size this allocation.

Starting point is 00:17:34 And so Pocket will use a very conservative default allocation that over-provisions throughput and capacity, assumes the job is latency sensitive, and uses DRAM first, and then spills to other tiers if necessary for capacity. But as we learn more about a job through optional hints the user can provide, we can optimize the cost of the resource allocation. And here are the hints that Pocket currently supports. The first is a latency sensitivity hint.

Starting point is 00:18:01 So this is just a Boolean true or false for now. We realized that if a job is latency sensitive, it benefits from using the DRAM tier. So this is going to be a remote access from a Lambda client to a storage server using DRAM. But we realized that if the job is more throughput intensive and not so sensitive to latency, we get equivalent performance using NVMe flash. The bottleneck ends up being network bandwidth on the VM, or the nodes that we're using for the storage servers, as opposed to the bandwidth of the actual storage technology in the case of NVMe. Another hint that we support is the maximum number of concurrent workers.

Starting point is 00:18:44 So this allows us to compute a less conservative estimate for how much throughput we need. Each, on a particular serverless platform, each lambda typically has a network limit. On Amazon, we've seen roughly around 600 megabits per second. So these are relatively wimpy workers. But the point is you would be running a job with many of them in parallel. But if we know how many concurrent lambdas to expect and we know the throughput limit per lambda, we can compute what is the peak throughput we need to provision for the job.

Starting point is 00:19:18 And then we also allow the user to directly tell us if they know what is the peak throughput required and what is the total ephemeral capacity required. They might have this information if they run this job before and profiled it. So once we have an allocation in terms of throughput capacity and the choice of the storage media, the next step is to actually do a resource assignment, which means select which servers, which storage servers in the cluster are going to be providing the resources. And for this, Pocket uses an online bin packing algorithm. So it's trying to fill up existing storage servers in the cluster

Starting point is 00:19:52 with the throughput and capacity requirements on the particular storage tiers and only launching new nodes if this requirement cannot be satisfied by existing nodes sharing between jobs. So this information gets recorded in what we call a job weight map. So for example, the controller can decide that jobs A data is going to be spread across server C and D. So this, for example, could be a latency-sensitive application that wants to exclusively use the DRAM tier. And 40% of its data can be going to server C, and 60%

Starting point is 00:20:26 will be going to server D. So these choices will depend on what is the current load on these systems, because the controller is monitoring resource utilization. And for each job, we will have a weight map. Now, in addition to selecting a resource allocation and assignment for a job upfront when it registers, the controller is always monitoring the state of the cluster. And so it's looking at CPU, network bandwidth, and storage capacity utilization.

Starting point is 00:20:56 The policy for scaling up and down the resources in the cluster is looking at, is trying to keep across these three dimensions of CPU, network bandwidth looking at, is trying to keep across these three dimensions of CPU, network bandwidth, and capacity, trying to keep utilization within a target range, which can be empirically tuned. And in our deployment, we're using 60% as a lower end and 80% as the higher end. So more concretely, the controller will scale up the

Starting point is 00:21:24 cluster by adding a node if any of CPU utilization, network bandwidth, or storage capacity for a tier are being utilized above 80%. The controller will scale down the cluster, meaning take down a node, if all of CPU, network, and storage capacity utilization for a particular tier are being utilized under 60%. And the actual mechanism that's used to balance load as the size of the cluster changes is all done through this job weight map. So here we're leveraging and relying on the fact that this ephemeral data is short-lived.

Starting point is 00:22:04 And so whatever is currently in the cluster is likely going to expire soon and be garbage collected. So we really focus and we want to avoid migrating data because for such short lifetimes of data, this presents high overhead. And so for this reason, we focus on steering data for incoming jobs in order to balance load in the cluster. So for example, if we take the four storage servers that we had and we decide that we don't need as many DRAM resources or we can take down storage server D, what we're going to do is we're going to steer data for new incoming jobs away from server D and onto the other storage servers, such as B and C, so their resource utilization will increase. And if we're not steering any new data to server D, eventually its data and its resource usage will decrease. And at this

Starting point is 00:22:55 point, we can remove that cluster or that node. Okay. So I'll talk next about how we actually implement Pocket. We leverage several open source code bases. So the storage and metadata server implementation of Pocket is based on the Apache Crail distributed storage system, which I'll talk about next. And to implement a fast NVMe flash storage server tier, we use Reflex. And this is a software system that has

Starting point is 00:23:27 been a key part of my thesis at Stanford, which achieves fast access to NVMe Flash over commodity Ethernet networks. And it's a purely software-based solution. Then we run the actual pocket metadata and storage servers inside of containers that the controller then uses kubernetes to orchestrate so the Apache Crail distributed storage system targets ephemeral data storage in distributed data processing frameworks so think of spark for example they have a plug-in to make applications be able to get the raw performance of the hardware they're using. And Crail was actually originally

Starting point is 00:24:11 designed to target RDMA networks, which have particularly low latency and enforce high performance requirements for the software. Now, in a public cloud environment, RDMA networks are not readily available. But Crail has a modular and pluggable architecture, so you can add and swap in your own network processing and storage tiers. And so we implement an R do the NVMe flash storage. So Reflex provides low latency and

Starting point is 00:24:51 high throughput while consuming low compute resources. And for example, with a single core running Reflex, you can serve 850,000 IO operations per second. So this is around 11 times higher than what we measured with a traditional Linux storage stack using LibAIO, for example. The way Reflex achieves this kind of performance is by doing direct access to NVMe and network queues from user space. So we're leveraging libraries like Intel's DPDK and SPDK. We use a polling-based execution model that runs to completion. So when we pick up a packet from the receive queue,

Starting point is 00:25:30 we continue processing it until we submit the request to flash storage. And so this allows us to optimize for data cache locality. And then we're avoiding data copies by leveraging the DMA engines on network and NVMe flash devices to forward data directly between these. So we're integrating the network and storage processing. And then we use adaptive batching to optimize instruction cache locality. So if we're operating at high load and when we're doing our polling loop, we see that there's many requests in the receive queue, we can run our execution loop with multiple requests

Starting point is 00:26:06 at each stage. We call it adaptive batching because if we're operating at low load, we don't have to wait to accumulate a batch. And so we don't hurt latency, but in the cases of high load, we can achieve high throughput. And another key aspect of Reflex is achieving predictable performance and quality of service

Starting point is 00:26:26 when multiple tenants are sharing a Flash device. So this is important particularly for NVMe Flash because of the asymmetries of read and write performance. So Reflex is able to enforce throughput and tail latency SLOs, as long as they're physically feasible with the performance of the device. Reflex can isolate the traffic between tenants to provide their tail latency and throughput service level objectives. And this code is open source, so you are welcome to check it out on GitHub. So, Pocket is a system that we envision a cloud provider managing and providing. But in order to evaluate and implement Pocket, we deploy it on EC2 nodes in Amazon.

Starting point is 00:27:12 And here are a few types that we use for different storage tiers and Pocket-type nodes. And then we're using Amazon Lambda as our serverless platform. And the applications we use are the distributed compilation, MapReduce sort, and video analytics, which you saw earlier in this talk. So the first thing we look at is Pocket's latency. And we compare Pocket to two existing systems. First, S3, which is Amazon's object store. It provides serverless storage. So users pay for the capacity and bandwidth that they need or that they use, and they don't manage their own resources. A disadvantage of S3 is going to be performance, particularly latency, and there's particular overhead for small objects. So here we're

Starting point is 00:27:56 doing a one kilobyte access from an AWS Lambda client. And we see that S3 takes 12 to 25 milliseconds, depending on whether it's a put or get request. The other system we consider is Redis. So this is a DRAM-based key value store. And Redis achieves significantly higher performance. Notice it's around 230 microseconds. Considering this is DRAM, this might sound high, but that is the network overhead from the Lambda client.

Starting point is 00:28:29 And the disadvantages of Redis are high cost, because we're exclusively using DRAM here, and also the fact that users have to spin up their own storage cluster, and scale, and manage this cluster for the purposes of their ephemeral data sharing. So this is a burden on users. And what we see is that Pocket with the different storage tiers, particular for the DRAM and NVMe flash tiers, offers similar performance with the main advantage

Starting point is 00:28:54 of having this automatically managed resources. Now, the reason why Pocket's DRAM tier is higher latency than Redis is because Pocket requires a metadata lookup. So in a Redis cluster, keys are hashed to particular storage nodes and there's no explicit metadata lookup. So this is a tradeoff that we have, but it is essential for this metadata plane and metadata lookup is essential for Pocket's flexible data placement and resource allocation policies. So as the cluster size changes, a key aspect of the system is its elasticity.

Starting point is 00:29:31 If you were relying on hashing, you would have redistribution of keys across servers, whereas in Pocket, we have the jobs data placement is encoded in the weight maps that don't change when storage servers enter and leave the cluster. The next thing we look at is throughput scaling. So of course, if we increase the number of nodes in our cluster for Pocket and Redis, we're going to see higher and higher throughput. So we compare what is the throughput per node and how do they scale for the different pocket tiers. And we compare this against S3 and Redis. So the S3 line is flat because we don't actually

Starting point is 00:30:09 control the number of nodes in S3. The throughput we get is what it is. And what we see here is that with two of these I3 2x large nodes that we're using for pocket nodes, we have higher throughput than S3. So we're roughly getting 1 gigabyte per second per node. And that is the network bandwidth of the VM. Is there a question? You're saying you get 1.5 gigabytes per second from S3? That's like off by three orders of magnitude.

Starting point is 00:30:44 These are for one... Okay, I should say what the test is also. So we're issuing 100 concurrent lambdas, and each of them is issuing one megabyte request in a loop. So earlier you said that there were 600 megabits per second, and so I did the proof math there, and it was like 80 megabytes a second. Is that the typical AWS lambda throughput So what we do notice is that when we run multiple lambdas concurrently, the per lambda network bandwidth kind of diminishes because I think they're being put onto the same physical nodes. And so that 600 megabits is kind of measured with maybe less

Starting point is 00:31:27 concurrent lambdas. Okay. Can you answer my question? I was, the S3 throughput is across mobile, let's say, 100 instances. Mm-hmm. Got it. Right. The reason why we see a tapering off here is because we're reaching for these 100 concurrent lambdas the per lambda network bandwidth limit. And we notice here also that the throughput of the, so we distinguish between NVMe flash and regular SAT or SAS-based SSDs, which are

Starting point is 00:31:58 offered on older generation Amazon instances. And both of the disk tier and the SATA SAS SSD storage provide lower throughput. The solid line here is particularly low due to low network bandwidth on these old generation instances. And that's why we have in the dotted line using a larger instance that doesn't bottleneck on the network bandwidth, but instead bottlenecks

Starting point is 00:32:22 on the actual SSD bandwidth. So for the rest of the results, we're going to focus on Pocket's DRAM and NVMe tiers, because these really stress the performance we need from the software. And they're much more cost-effective on the Amazon infrastructure. So next we look at how Pocket's right-sizing with hints helps us save on cost. So, our baseline here is the default allocation where we use 50 I3 nodes and provision 50 gigabytes per second of storage, which is likely overkill for a typical job, but it's because we're being conservative. And we look at how as we learn more about an application for the storage video analytics

Starting point is 00:33:04 and the distributed compilation, how we can save on cost. So if we learn the number of concurrent lambdas, and we provision using the conservative 600 megabits per second per lambda as the rule of thumb, we can reduce cost as shown by the red bars here. If we also learn that these jobs are not sensitive to latency, they're actually all throughput intensive. The distributed compilation is also compute intensive, so it is not sensitive to the latency of the storage system as much. We get further reductions in cost. And then the green bars show what if the applications told us directly the throughput

Starting point is 00:33:41 and capacity requirements that they need. Next, we look at how Pocket right-sizes resources as multiple jobs enter and leave the system. So here on the x-axis, we're plotting time, and we have up arrows indicating when a job has arrived or registered with the system, and a down arrow indicating when the job deregisters. In the orange line, we're plotting the aggregate throughput that we observe from pocket storage servers. And the dotted blue line shows the throughput that the controller has provisioned with the rule of thumb that each of these nodes gives us one gigabyte per second. That is roughly the throughput that we see these nodes providing.

Starting point is 00:34:22 Occasionally we see bursts and that is what's happening here with the job being able to do to achieve higher throughput than what the controller believes has been provisioned. But the takeaway here is that the controller is elastically scaling resources and adjusting as jobs are entering and leaving the system. And this is all done automatically without users explicitly saying how many nodes they need. Now, how does Pocket affect overall execution time compared to using a system like S3 or Redis? So we run a 100-gigabyte sort job, and we're always storing the original input data and the final output data in S3 for this test

Starting point is 00:35:04 because that's kind of longer term data. But data that gets shuffled is what's, is what is shown, the time is shown per lambda in the blue bars. This is the ephemeral data I.O. And for the ephemeral data storage system, we use either S3, Redis, or Pocket. And the three sets of bars here are running this job with 250, 500, and 1,000 concurrent workers. So the first observation is that if we're using 250 lambdas, for example, S3 does not provide sufficient throughput for this job because we can see that by provisioning higher throughput for Redis and Pocket NVMe,

Starting point is 00:35:40 the execution time per lambda in this stage can decrease. You'll notice that we're missing the S3 data points for the 500 and 1,000 tests, and this is because we actually got request rate limit errors, so telling us to exponentially back off the amount of IOPS that we're issuing. The other observation here is that for Redis, we're using DRAM, but for pocket, we're using NVMe flash here here is that for Redis, we're using DRAM. But for Pocket, we're using NVMe flash here. And as I mentioned before, because this is a throughput intensive application, we see similar performance

Starting point is 00:36:13 using Pocket NVMe and Redis. The last thing I want to go into is a cost analysis, which is a bit tricky to do because it's tough to do an apples-to-apples comparison. But we envision that Pocket would be a system provided with a pay-what-you-use cost model to match the serverless abstraction. So users would pay for the capacity and bandwidth that is used for their job. And so what we do is we calculate fine-grained resource costs based on EC2 prices. So we compute what is the approximate cost of NVMe flash per gigabyte, DRAM per gigabyte,

Starting point is 00:36:56 and the cost of a core. And we compare the cost of running each of these jobs, the sort, video analytics, and distributed compilation, using S3, Redis, and Pocket. And with Redis, we assume that since users are individually spinning up these clusters, they have to pay for the entire VM, not just the fraction of resources that they used on it. And so Pocket, by leveraging hints from users, so for example using NVMe instead of flash when appropriate,

Starting point is 00:37:28 or NVMe instead of DRAM when appropriate, and by providing this kind of pay-what-you-use cost model, reduces cost by 60% roughly, we find, for all of these three jobs. You'll notice that S3 is still much cheaper. The cost comparison here is not quite a fair one though, because our cost prices for Redis and Pocket are based on user facing EC2 resource costs whereas S3's pricing is based on whatever Amazon is actually paying for those resources. And this is amortized across many, many jobs that they're running. So that explains the discrepancy there.

Starting point is 00:38:05 But it is true that one of the main draws of S3 is how cost-effective it is. But another thing to keep in mind is that it's not going to give quite the performance expected for these applications. Can I ask you a question? Yes. So you have costs for doing things this way, right? Yes. So more broadly, you're saying to run a job on serverless

Starting point is 00:38:57 versus to run a job on VMs. I mean, I think it's really cool you showed that you made serverless cheaper for the standard servers. I think that's a great outcome. But now that you've trumped the concept of serverless, how much cheaper is serverless? Okay, so the question is how much cheaper is serverless than doing it the traditional way? I don't have actual numbers to answer that question. Some of my colleagues at Stanford actually have been doing work, and they've found that you don't actually win currently on cost necessarily for some of the, the, the jobs that they're doing. What you gain is performance if your job can benefit

Starting point is 00:39:32 from this massive parallelism. It depends on a lot, like, it depends on the current pricing of things, of course. There's many caveats, but I think currently the expectation is the reason to switch here is because of user convenience. So not, the main draw of the serverless I would say are user convenience to not have to manage these resources. It's good for very short jobs because the time granularity of the cost is very low. And performance, if you're, can benefit from the very high parallelism in your job. Yes? I remember from the architectural study, you have one management node, I think, one like the controller.

Starting point is 00:40:10 The controller, mm-hmm. But you don't have stability. How many tasks can you actually run on? How many tasks can you can the system handle? Sure. How many... Okay, so the amount of jobs you're asking about... So a metadata server can handle like 195,000 IO operations per second. Depends how many jobs you're...

Starting point is 00:40:44 How many metadata operations you're doing per job then. The controller itself, I don't have a number of how many jobs it can register. But that would be interesting to measure. The solution that you tested, is that scalable? But where you could have multiple controllers and multiple data stores? Yes. So all the three data planes are independently scalable.

Starting point is 00:41:10 And so you can scale metadata servers, storage servers. The controller is logically centralized, but it could also be scaled up and distributed. Are there any commercial applications for this? So we would love to have some. Yeah, so this is really early stage. So this is the first talk that I'm giving about this in public. So we've been working on this for about a year, and we're excited to talk about use cases of this for industry applications.

Starting point is 00:41:45 So if you have ideas, I would be happy to hear. Okay. Yes. You mentioned using hints to optimize for demonstration. Is that during the registration movie or is there a time during the course? Okay, so the question is regarding hints and when do they get passed for. Regarding job attributes, those hints are all passed during job registration.

Starting point is 00:42:13 So the allocation gets decided at the beginning of the job. You deployed this mostly in the cloud, it looks like. Have you tried local deployments, and particularly with your reflex RDMA infrastructure? So we were specifically targeting the cloud environment because of the serverless tasks. So we haven't tried locally, but the system can be more broadly. And we've looked at, for example, running Spark

Starting point is 00:42:42 and using the Spark plug-in for Crail and hence Pocket as well. So that could also be interesting. How the advantages would shift in such an environment. I'm curious if you've explored it yet. We haven't explored it yet, but one thing I can kind of hypothesize is that the use of the DRAM tier, I think, would become much more beneficial

Starting point is 00:43:02 than we see in the serverless environment because the network overhead in AWS Lambda currently is very high. So the DRAM tier loses its appeal in terms of the low latency. Yeah. So you talked about passing on the hints at the time of registering a job

Starting point is 00:43:21 and then at the time of deregistering the job, delete all the data. I assume the deletion happens just by deleting the files. Right. You know, most likely. Is there, you know, have you explored better ways to do this as in passing down the allocation hints down to the lower storage layers as well as at the time of deregistration doing something different other than just deleting the file?

Starting point is 00:43:45 So currently, the metadata server that is managing the data for that job deletes it. So it doesn't have to actually... I mean, for security, you need to maybe propagate that down to the storage. But currently, it's just at the metadata server that this deletes.

Starting point is 00:44:06 And then that data now is considered free blocks, and they can be populated by new jobs. Yeah. Yes. So does your control avoid analytics at all? In terms of, so the customer, oh, sorry, the user passes the hints, but they could be overestimated,

Starting point is 00:44:23 for example, underestimated. Yeah. And based on your previous experience with the job, you adjust those hints. Yeah. Yeah, so that's something we would like to do as future work. Currently, the controller assumes that the user has a good idea and provides these hints.

Starting point is 00:44:40 But especially, yeah, so there's a great opportunity to learn across invocations. So, for example, on Amazon Lambda, you register this job, and so you can track every time it's executed. You can keep track of what you've learned about the resource allocation. So that's something we're hoping to explore. I guess I can talk about... So the other opportunities for future work that we see

Starting point is 00:45:02 are harvesting Slack resources in the data center to run Pocket. So this is ephemeral data that we're managing. It has short lifetime. And we know that in data centers we have many resources sitting idle. So we see an opportunity to try and harvest those. In this case, we need Pocket to be able to deal efficiently with resource revocation where suddenly these resources are needed, and so maybe we need some better fault tolerance in this case,

Starting point is 00:45:32 or at least some way of handling when data is no longer accessible because those resources that used to be idle need to be reclaimed. And then although we've talked about Pocket here in the context of serverless analytics, more broadly we think of it as a distributed slash temp. And so we're interested in exploring other use cases for this. And I'll put up this slide here. So there's a repo here on GitHub, so I encourage you to check it out.

Starting point is 00:45:58 Any other questions? We'll also have a paper appearing at the OSDI conference in a couple of weeks alright, thank you thanks for listening if you have questions about the material presented in this podcast be sure and join our developers mailing list by sending an email to

Starting point is 00:46:20 developers-subscribe at snea.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #77: Pocket - Elastic Ephemeral Storage for Serverless Analytics

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.