Storage Developer Conference - #143: Deep Compression at Inline Speed for All-Flash Array

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode 143. Good morning, good afternoon, and good evening. No matter which time zone you are in, welcome to my virtual talk today about deep compression at inline speed for all-fresh array. We all love all-fresh array for its consistent high throughput and low latency. This enables us to consolidate multiple workloads into a single unified storage platform.

Starting point is 00:01:11 It is easy to manage and share big data among different applications. However, all-fresh array still comes with a cost premium at acquisition.

Starting point is 00:01:27 So in this talk, I'm going to share with you how to lower the cost of all-flesh array by up to 30% with deep compression at in-land speed. I'll share with you

Starting point is 00:01:42 our hardware platform, which is a programmable FPGA accelerator card. I'll talk about our simple software API and then dive a little bit into the wonderful world of hardware architectural design on FPGA devices. I'll also talk about our compression algorithm development. And lastly, I show some of the benchmarks. Now, first, a few words about Pure Storage. Pure Storage started in 2009 as one of the pioneers for Pure All-flash array for enterprise storage.

Starting point is 00:02:29 Today, we have three major product lines. In the front of the center is the front-flash array X, which is a performance-optimized, 100% NVMe block storage system. And recently, we just FlashArray C, which is capacity optimized and built on 100% QRC Flash. And last but not least is the FlashBit system. Last year, our annual revenue is $1.6 billion. Customers love our all-flash systems for its high performance and seven lines of reliability. We are very proud of our certified NPS score of 82. Now, back to compression.

Starting point is 00:03:35 Not all compressions are created equal. For data storage systems, the compression algorithm that we are interested are lossless data compression algorithms. We can actually further divide these algorithms into two categories. The first one is inline compression. Inline compression is pretty fast. In general, x86 CPU can compress data at half a gigabyte per second, or even faster. So some of the good in-line compression algorithms are LZO,

Starting point is 00:04:19 LZ4, and Z-standards in faster mode. The other category is deep compression algorithms. Deep compression algorithms, typically they can compress 30% more than inline compression, but it runs quite a bit slower. In fact, in general, deep compression is about 10x slower than inline compression.

Starting point is 00:04:49 Some of the examples are Z-standard and Mermaid. And we have thousands of flash arrays deployed in our customers' site. And through our data analysis, we have discovered that not all data gets an equal chance at compression. In the flash systems we design at Pure Storage, we have compression in mind from day one. So that's why our systems always have 100% data gets in and compressed. However, deep compression is still opportunistic. Typically, it is done at garbage connection time when we have a significant amount of data that never actually get the chance to be deeply compressed. So how do we lower the cost of all fresh array by up to 30%.

Starting point is 00:06:28 The idea actually is really simple and straightforward. So today we take the ingested data and perform inline compression with algorithms like LZO before we write to the UMI run. And then after the data has been persisted to flash, and then we do opportunistic deep compression at a garbage connection time. So what if we change that picture a little bit so that we actually perform deep compression in line before we

Starting point is 00:07:10 actually write the data into the NVR. If we are able to do that, then we pretty much guarantee that a hundred percent of our data are deeply compressed and because deep compression typically compress 30 percent better than in compression we can save the uh we can have a dollar per gig savings up to 30 So what is a good hardware platform to be able to compress, deeply compress data at in-net speed? We have found that a programmable FPGA accelerator card is a good choice. It is not only able to achieve the target of being 10x faster than the CPU for deep compression, it is also power efficient. It also has very low NRE for hardware

Starting point is 00:08:19 development. And what is also very important for us is that it's a programmable platform that enables us to continuously

Starting point is 00:08:34 improve the algorithm and also enables us to explore other offload opportunities besides compression. So why not ASIC? So there are actually a few off-the-shelf ASICs available to offload compression. But in general, they are all fixed gzipIP or ZLAB format. And to our standards,

Starting point is 00:09:11 they are suboptimal in both compression speed and compression ratio. Some of the vendors have Z standard on their roadmap. Z standard is better. It decodes fast, but not necessarily the fastest. But these chips that promise Z-standard support are still years away to be commercially available. Now the second choice is what about custom in-house ASIC? That'd be a fun project. I have done quite a few leading-edge networking chips in the past.

Starting point is 00:09:59 And that was fun. So next time when I ask my boss and get approval for about $100 million of budget and increase the team size by about 10x, I would definitely do that as the next fund project. Just kidding. So actually, even if you get the resources to do ASIC, one of the drawbacks of ASIC is that it always lags years behind in algorithm development. And if you are a company that sells flash in the billions of dollars, then every 1% of compression ratio improvement is tens of millions of dollars in savings each year. So let's look at our pure accelerator card. And our

Starting point is 00:11:09 this is an illustrator of our hardware compression offload card.

Starting point is 00:11:16 It is a standard half-height half-length PCIe adapter. Really the only

Starting point is 00:11:24 major component on this card is a mid-range FPGA device. adapter. Really, the only major component on this card is a mid-range FPGA device. It is currently capable of

Starting point is 00:11:34 PCIe Gen 3 by 8, which mounts to about 6 gigabytes per second of throughput. In the

Starting point is 00:11:43 software driver, we have a really lightweight VFL driver in the user space. One of the things that I have learned in the past doing hardware offload is that it's really

Starting point is 00:11:59 helpful if we can keep the hardware API as simple and as stupid as possible. If an offload project requires a team of software engineers to work on compiler design, then we are in big trouble. On the other hand, if we just replace a commonly used software function with a hardware function core, then it is so much easier. data block stored in a source buffer and compresses the

Starting point is 00:12:49 data and output in a destination buffer. So we keep exactly the same function API, but just replace the CPU computations with a few PCI rights to notify the hardware about the source buffer address and the destination buffer address.

Starting point is 00:13:14 And then let hardware just do the work. And the software just pull and wait for hardware to finish the work. And to make things even simpler, we make sure in hardware that all the compression requests are completed in strict FIFO order. Now we have chosen a hardware platform and we have picked a really simple API. So the next thing that we need to decide is what kind of compression algorithm that we are going to implement in hardware. So if you look at the many, many different flavors of compression algorithms available on the market today, they are generally all based on actually on two fundamental techniques. The first one is LZ lossless data compression,

Starting point is 00:14:15 which was actually published in a paper in 1977. The other technique is entropy encoding. The most famous form of that is Hoffman encoding, The other technique is entropy encoding. The most famous form of that is Huffman encoding, which was published in a paper in 1952. However, compression algorithms are still being actively developed. And different algorithms, they actually differentiate on the quality of LZ search and output encoding. Some of the good open source examples

Starting point is 00:14:53 are Gzip and Zlib and recently LZ4 and Zstandard have been really popular. On the closed source side, we have LZO for in-air compression and the MERMIT for deep compression. So which algorithm do we actually pick to do in hardware? So we actually, before we decide on that, let's list our goals of our compression. So first

Starting point is 00:15:28 what we are looking for in our all-fresh array is a block compression. The compression of block size is typically less than 6 kilobytes. And in most cases, actually less than 32 kilobytes. Because

Starting point is 00:15:43 we do not want to sacrifice the weak latency. And the second thing that is super important for us is that the algorithm has to be able to support fast decoding in software. This is super critical because all our systems support non-disruptive upgrades. So any data that is written into our flash array can live forever and we have to guarantee that any future hardware upgrade can still read the same data out at faster speed. And lastly, we, of course,

Starting point is 00:16:33 want the best possible compression ratio for real work nodes. So with all these three goals, we ended up deciding to take the hard path to do an in-house algorithm for compression. It is not easy, but at least we have the vast amount of data that is available for us across our tens of thousands of fresh arrays to train our algorithm and benchmark our algorithm against the actual hardware design for compression on FPGA device. So with a mid-range FPGA device, typically if the CPU can run at multiple gigahertz? Well, really the key is the freedom in architectural and logical design with a programmable hardware.

Starting point is 00:18:07 If you look at a general-purpose CPU, it has become more and more complex. But fundamentally, its architecture is still based on the same volume architecture that was described in 1945. It is a memory load and store architecture with a limited number of execution units. With FPGA device and the freedom in logical design, we can explore the inherent hardware parallelism.

Starting point is 00:18:45 And also what is really helpful in these devices explore the inherent hardware parallelism. And also what is really helpful in these devices is it has thousands of S-runs that has aggregated bandwidth in the neighborhood of terabytes per second. And I will show you the example of why that is super helpful. Let's take an example of how we actually take advantage of the abundance of memory blocks on mid-range FPGA device.

Starting point is 00:19:21 One of the essence of LG search is to find duplicate strings in the data stream. And one way to do that is with a hash table. So at each batch position, you calculate the hash of the current batch stream that you have seen and update the hash table. You also want to look at the hash table with a hash to see if you have a possible match in the data that you have seen before. So in a hardware design, in our pipeline, we process four bytes positions on every clock cycle. So that means we have 4 hash table

Starting point is 00:20:08 lookups, and we have 4 hash table updates. And at a frequency of about 300 MHz, this amounts to about 19 gigabytes per second of memory bandwidth.

Starting point is 00:20:23 And of course, we don't have to stop at just one hash table per compression engine. If we have two hash tables, that doubles to be 38 gigabytes per second. And we can easily put four compression engines into our FPGA device. And that aggregates to be 153 gigabytes per second of hash table memory bandwidth.

Starting point is 00:20:53 So such a high memory bandwidth hash table can actually be fairly easily implemented on the FPGA device. So in this particular example, we just take 64 memory blocks, which with each memory block can independently perform one read and one write at every clock cycle independent of each other.

Starting point is 00:21:21 So what is needed is really just a simple write after and read after. It distribute the four lookups and the four updates into the 64 distributed memory blocks. And because the hash lookup table and updates, the hash stuff really just run the numbers. So the memory conflict, memory access conflict actually is fairly low in the range of a few percentage. And that does not really degrade our LZ search quality. In our hardware compression engine, we also do Hoffman coding. And

Starting point is 00:22:15 the LZ search gives us the literals and the metadata sections. The literals are the data files that we have not found any match. And then metadata section describes the match that we have found. So for our Hoffman encoding,

Starting point is 00:22:40 we perform a dynamic Hoffman encoding for the literals, but we perform static table, static Hoffman encoding with 64 static tables for the metadata sections. Our static tables are trained by the real work node across our fleet. And in order to increase the Hoffman decoding speed, we do multi-string Huffman encoding for each data section and the literals. Now let's again dive a little bit into the hardware design for Huffman encoding. One of the critical paths for Huffman encoding is to sort the histograms. The way we actually implement Hoffman encoding is each symbol is an 8-bit byte.

Starting point is 00:23:38 So that means for each data block, we generate a histogram of 256 random numbers. And we need to sort 256 random numbers. A good sorting algorithm is a merge sort because it has predictable n log n cost. But if you just implement in hardware naively, the N log N of 256 numbers still gives you 2000 clock cycles. That is a very long latency. So how do you speed that up?

Starting point is 00:24:32 So it's actually easy if you have the freedom to design your own hardware pipeline. So here is an example of how we actually implement our merge sort. So we have a hardware pipeline of four stages. On the first stage, we have eight merge sort engines. Each can sort 32 numbers. So once we have generated the 256 historical numbers, we just distribute it

Starting point is 00:25:01 to all the eight merge sort engines. And they start sorting the 32 numbers each they got independent of each other. Once all the eight engines have completed their work, then we move to the second stage. The second stage is really simple. It's just a merge process. Imagine you have two solid lists of

Starting point is 00:25:27 numbers already. So it's really simple just to merge it into a single list of solid numbers. And in this case, we should merge two lists of 32 numbers into a list of solid 64 numbers. The same thing goes on stage 3 and stage 4, and after stage 4, we have a solid list of 256 numbers. Now if we compute the latency of the hardware processing pipeline. It is 608 cycles, which is quite a bit of a speed up than the naive implementation of merge sort.

Starting point is 00:26:17 But more importantly is because this is a pipeline design. So the processing pipeline can actually sustain one sort every 256 cycles. And this is really important for us to be able to sustain compression throughput, even for data block size that is less than two kilobytes. Now let's take a step back and look at some of the micro benchmarks.

Starting point is 00:26:57 The first benchmark we run is against block bench of data that is 32 kilobytes in size. We compare the performance of six different algorithms, and I'll describe each one later. The first column here is the encoding speed. The second column is the decoding speed. And the third column is the compression ratio. The speed is expressed in nanoseconds per byte. Now, about the algorithms. The Alitraxis 0 and Alitraxis 4 are our in-house development software compression algorithms. And LZO Pro is a commercially available in-line compression algorithm that we use in some of our systems. And Mermaid4 is the deep compression commercial compression algorithm that we also use in some of our systems. ARIA 0 and ARIA 3 are our

Starting point is 00:28:18 hardware compression engines. ARIA 0 is a hardware version that is performing inline compression. And IREA3 is really the interesting one, that is the hardware engine that is doing deep compression. So the first thing we can actually see from this benchmark is that with 32 kilobytes of data blocks, our hardware in-line compression is actually faster than the best in-line compression LZO Pro that is running with Intel CPU. Not only that, it also actually compresses the data about 10 to 15% better.

Starting point is 00:29:10 So that's really a win-win on both fronts. Now, if we look at our hardware deep compression engine, it is just slightly slower than the LZO in compression running on software, but it actually performs more than 35% better in compression ratio. So that's a huge win. However, clearly we have in this case, there is still a room to improve. If we compare the compression ratio of our hardware deep compression engine against either MERMID4 or our Anacharsis4 deep compression algorithm,

Starting point is 00:30:05 we still have about between 5 to 10% of room to improve. Now, next let's look at the benchmarks with four kilobyte block size. In this case, we actually see that the encoding speed of hardware engine, it's quite a bit slower than what we would like to see compared to LGO in-net compression. But actually, this cost actually is not really in hardware. Our hardware actually is able to keep the same throughput regardless of the block size from 32K all the way down to two kilobytes.

Starting point is 00:30:56 We have discovered that we have quite a bit of software driver tuning that we perform for small blocks. Now, if you look at the compression ratio, and we still, it consistently beats the inner compression with up to 35% of better compression ratio. And interestingly, in this case, our hardware deep compression ratio, I-RAT3, is able to beat the best deep compression algorithm moment for that we can get our hands on the open market

Starting point is 00:31:38 by about 10%. However, compared to our own in-house software, deep compression ratio, we still have another 10% room to improve in terms of compression ratio. So that brings us to our continuous algorithm development. So we, our roadmap for hardware compression is we are going to work on LZ plus. That means it's a better LZ search algorithm. LZ search is an unbeheld problem, so there's always room to improve. Another feature that we are apply some possible transformations of the data

Starting point is 00:32:48 to achieve better compression ratio. And of course, there's also the unknown unknown, that is the new algorithm discoveries that's going to happen later. And with a programmable hardware platform, we have the possibility to adapt to that. So in summary, I've shared with you how deep compression at the in-hand speed can potentially lower the cost per gigabytes for all-fresh array up to 30%.

Starting point is 00:33:31 And I hope you agree that FPGA accelerator card is a good choice of hardware platform to deliver on that promise. And I have shared with you that a domain specific architectural and a logic design is key to success. And we love our programmable platform because it enables us to do continuous improvement and the non disruptiveive field upgrades. And with that, that concludes my talk. Thank you very much. I hope you enjoyed it. If you have any comments, please leave it here. And if you have any questions about any of the materials,

Starting point is 00:34:27 please feel free to contact me at my email address that is provided on the first page of the presentation material. Thank you very much. I hope you enjoy the rest of your virtual conference. Thanks for listening.

Starting point is 00:35:07 If you have questions about the material presented in this podcast, enjoy the rest of your virtual conference. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #143: Deep Compression at Inline Speed for All-Flash Array

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.