Storage Developer Conference - #143: Deep Compression at Inline Speed for All-Flash Array
Episode Date: March 31, 2021...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, episode 143. Good morning, good afternoon, and good evening.
No matter which time zone you are in, welcome to my virtual talk today about deep compression at inline speed for all-fresh array.
We all love all-fresh array for its consistent high throughput and low latency. This enables us to consolidate multiple workloads into a single
unified storage platform.
It is easy
to manage and share big
data among
different applications.
However,
all-fresh array
still comes with
a cost premium at acquisition.
So in this
talk, I'm going to share
with you how to
lower the cost of all-flesh array
by up to 30%
with deep compression
at in-land speed.
I'll share with you
our hardware platform,
which is a programmable FPGA accelerator card.
I'll talk about our simple software API and then dive a little bit into the wonderful
world of hardware architectural design on FPGA devices.
I'll also talk about our compression algorithm development.
And lastly, I show some of the benchmarks.
Now, first, a few words about Pure Storage.
Pure Storage started in 2009 as one of the pioneers for Pure All-flash array for enterprise storage.
Today, we have three major product lines.
In the front of the center is the front-flash array X,
which is a performance-optimized, 100% NVMe block storage system.
And recently, we just FlashArray C, which is capacity optimized and built on 100% QRC Flash. And last but not least is the FlashBit system. Last year, our annual revenue is $1.6 billion.
Customers love our all-flash systems
for its high performance and seven lines of reliability.
We are very proud of our certified NPS score of 82.
Now, back to compression.
Not all compressions are created equal.
For data storage systems,
the compression algorithm that we are interested
are lossless data compression algorithms.
We can actually further divide these algorithms into two categories.
The first one is inline compression.
Inline compression is pretty fast. In general, x86 CPU can compress data at half a gigabyte per second,
or even faster. So some of the good in-line compression algorithms are LZO,
LZ4, and Z-standards in faster mode. The other category is deep compression algorithms.
Deep compression algorithms,
typically they can compress 30% more
than inline compression,
but it runs quite a bit slower.
In fact, in general,
deep compression is about 10x slower
than inline compression.
Some of the examples are Z-standard and Mermaid.
And we have thousands of flash arrays deployed
in our customers' site. And through our data analysis, we have discovered that not all data gets an equal chance at compression.
In the flash systems we design at Pure Storage, we have compression in mind from day one.
So that's why our systems always have 100% data gets in and compressed.
However, deep compression is still opportunistic.
Typically, it is done at garbage connection time when we have a significant amount of data that never actually
get the chance to be deeply compressed. So how do we lower the cost of all fresh array by up to 30%.
The idea actually is really simple and straightforward.
So today we take the ingested data
and perform inline compression with algorithms like LZO
before we write to the UMI run.
And then after the data has been persisted to flash,
and then we do opportunistic deep compression at a garbage connection time.
So what if we change that
picture a little bit so that we actually perform deep compression in line before we
actually write the data into the NVR. If we are able to do that, then we pretty much guarantee
that a hundred percent of our data are deeply compressed and because deep compression typically compress 30
percent better than in compression we can save the uh we can have a dollar per gig savings up to 30 So what is a good hardware platform to be able to compress, deeply compress data at
in-net speed?
We have found that a programmable FPGA accelerator card is a good choice. It is not only able to achieve
the target of being 10x faster than the CPU
for deep compression, it is also power efficient.
It also has very low NRE for hardware
development.
And
what is
also very important for us
is that
it's a programmable platform
that enables us
to continuously
improve the algorithm
and also
enables us to explore
other offload opportunities
besides compression.
So why not ASIC? So there are actually a few off-the-shelf ASICs available to offload compression.
But in general, they are all fixed gzipIP or ZLAB format.
And to our standards,
they are suboptimal in both compression speed and compression ratio.
Some of the vendors have Z standard on their roadmap.
Z standard is better.
It decodes fast, but not necessarily the fastest.
But these chips that promise Z-standard support are still years away to be commercially available.
Now the second choice is what about custom in-house ASIC?
That'd be a fun project.
I have done quite a few leading-edge networking chips in the past.
And that was fun. So next time when I ask my boss and get approval for about $100 million of budget and increase the team size by about 10x, I would definitely do that as the next fund project.
Just kidding. So actually, even if you get the resources to do ASIC,
one of the drawbacks of ASIC is that it always lags years behind in algorithm development.
And if you are a company that sells flash in the billions of dollars, then every 1% of compression ratio improvement
is tens of millions of dollars in savings each year.
So let's look at our pure accelerator card.
And
our
this is
an
illustrator
of our
hardware
compression
offload
card.
It is
a standard
half-height
half-length
PCIe
adapter.
Really
the only
major component on this card is a mid-range FPGA device. adapter. Really, the only major
component on
this card is
a mid-range
FPGA device.
It is
currently
capable of
PCIe Gen
3 by 8,
which mounts
to about 6
gigabytes per
second of
throughput.
In the
software driver,
we have a really lightweight
VFL driver
in the user space.
One of the things that I have learned
in the past doing
hardware offload
is that it's really
helpful if we can keep
the hardware API
as simple and as stupid as possible.
If an offload project requires a team of software engineers to work on compiler design, then
we are in big trouble.
On the other hand, if we just replace a commonly used software function with a hardware function core, then it is so much easier. data block stored in a source
buffer and compresses
the
data and output
in a destination buffer.
So we keep exactly
the same
function
API, but just
replace the
CPU computations with a few PCI rights to notify the hardware about the source buffer address and the destination buffer address.
And then let hardware just do the work.
And the software just pull and wait for hardware to finish the work.
And to make things even simpler, we make sure in hardware that all the compression
requests are completed in strict FIFO order. Now we have chosen a hardware platform and
we have picked a really simple API. So the next thing that we need to decide is what kind of compression algorithm that we are going to implement in hardware.
So if you look at the many, many different flavors of compression algorithms available on the market today,
they are generally all based on actually on two fundamental techniques.
The first one is LZ lossless data compression,
which was actually published in a paper in 1977.
The other technique is entropy encoding.
The most famous form of that is Hoffman encoding, The other technique is entropy encoding.
The most famous form of that is Huffman encoding, which was published in a paper in 1952.
However, compression algorithms are still being actively developed.
And different algorithms, they actually differentiate on the quality of LZ search
and output encoding.
Some of the good open source examples
are Gzip and Zlib
and recently LZ4 and Zstandard
have been really popular.
On the closed source side,
we have LZO for in-air compression and the MERMIT for deep compression.
So which algorithm do we actually pick to do in hardware? So we actually, before we
decide on that, let's list our goals
of our compression. So first
what we are looking for in our all-fresh
array is a block compression.
The
compression of block size
is typically less than 6
kilobytes.
And in most cases, actually less than
32 kilobytes. Because
we do not want to sacrifice the weak latency.
And the second thing that is super important for us
is that the algorithm has to be able
to support fast decoding in software.
This is super critical because all our systems support
non-disruptive upgrades. So any data that is written into our flash array can live
forever and we have to guarantee that any future hardware upgrade can still read the same data out at faster speed.
And lastly, we, of course,
want the best possible compression ratio
for real work nodes.
So with all these three goals,
we ended up deciding to take the hard path to do an in-house algorithm for compression.
It is not easy, but at least we have the vast amount of data that is available for us across our tens of thousands of fresh arrays
to train our algorithm and benchmark our algorithm against the actual hardware design for compression on FPGA device.
So with a mid-range FPGA device, typically if the CPU can run at multiple gigahertz?
Well, really the key is the freedom in architectural and logical design with a programmable hardware.
If you look at a general-purpose CPU,
it has become more and more complex.
But fundamentally, its architecture is still based
on the same volume architecture that was described in 1945.
It is a memory load and store architecture
with a limited number of execution units.
With FPGA device and the freedom in logical design,
we can explore the inherent hardware parallelism.
And also what is really helpful in these devices explore the inherent hardware parallelism.
And also what is really helpful in these devices is it has thousands of S-runs
that has aggregated bandwidth
in the neighborhood of terabytes per second.
And I will show you the example
of why that is super helpful.
Let's take an example of how we actually take advantage of the abundance of memory blocks
on mid-range FPGA device.
One of the essence of LG search is to find duplicate strings in the data stream.
And one way to do that is with a hash table.
So at each batch position, you calculate the hash of the current batch stream that you have seen and update the hash table. You also want to look at the hash table with a hash to see if you have a
possible match in the data that you have seen before.
So in a hardware design, in our pipeline,
we process four bytes positions on every clock cycle.
So that means we have
4 hash table
lookups, and
we have 4 hash table updates.
And
at a frequency of about
300 MHz, this amounts
to about 19 gigabytes
per second of
memory bandwidth.
And of course, we don't have to stop at just one hash table
per compression engine.
If we have two hash tables, that doubles to be 38 gigabytes
per second.
And we can easily put four compression engines
into our FPGA device.
And that aggregates to be 153 gigabytes per second
of hash table memory bandwidth.
So such a high memory bandwidth hash table
can actually be fairly easily implemented
on the FPGA device.
So in this particular example,
we just take 64 memory blocks,
which with each memory block
can independently perform one read and one write
at every clock cycle independent of each other.
So what is needed is really just a simple write after and read after.
It distribute the four lookups and the four updates into the 64 distributed memory blocks.
And because the hash lookup table and updates, the hash stuff really just run the numbers.
So the memory conflict, memory access conflict actually is fairly low in the range of a few percentage.
And that does not really degrade our LZ search quality.
In our hardware compression engine,
we also do Hoffman coding.
And
the
LZ search
gives us the literals and the
metadata sections. The literals are the data files
that we have not found any match.
And then metadata section describes
the match that we have found.
So for our Hoffman encoding,
we perform a dynamic Hoffman encoding for the literals,
but we perform static table, static Hoffman encoding with 64 static tables for the metadata sections.
Our static tables are trained by the real work node
across our fleet.
And in order to increase the Hoffman decoding speed, we do multi-string Huffman encoding for each data section and the literals.
Now let's again dive a little bit into the hardware design for Huffman encoding.
One of the critical paths for Huffman encoding is to sort the histograms.
The way we actually implement Hoffman encoding is each symbol is an 8-bit byte.
So that means for each data block, we generate a histogram of 256 random numbers.
And we need to sort 256 random numbers.
A good sorting algorithm is a merge sort
because it has predictable n log n cost.
But if you just implement in hardware naively, the N log N of 256 numbers still gives you
2000 clock cycles.
That is a very long latency.
So how do you speed that up?
So it's actually easy if you have the freedom to design your own hardware pipeline.
So here is an example of how we actually implement our merge sort.
So we have a hardware pipeline of four stages.
On the first stage, we have eight merge sort engines.
Each can sort 32 numbers.
So once we have generated
the 256 historical numbers,
we just distribute it
to all the eight merge sort engines.
And they start sorting the 32 numbers each they got
independent of each other.
Once all the eight engines have completed their work,
then we move to the second stage.
The second stage is really simple.
It's just a merge process.
Imagine you have two solid lists of
numbers already. So it's really simple just to merge it
into a single list of solid numbers.
And in this case, we should merge two lists of 32
numbers into a list of solid 64
numbers. The same thing goes on stage 3 and stage 4, and
after stage 4, we have a solid list of 256 numbers. Now if we compute the latency of the hardware processing pipeline.
It is 608 cycles, which is quite a bit of a speed up
than the naive implementation of merge sort.
But more importantly is because this is a pipeline design.
So the processing pipeline can actually
sustain one sort every 256 cycles.
And this is really important for us
to be able to sustain compression throughput, even
for data block size that is less than two
kilobytes.
Now let's take a step back and look at some of the micro benchmarks.
The first benchmark we run is against block bench of data that is 32 kilobytes in size.
We compare the performance of six different algorithms, and I'll describe each one later.
The first column here is the encoding speed.
The second column is the decoding speed.
And the third column is the compression ratio.
The speed is expressed in nanoseconds per byte.
Now, about the algorithms. The Alitraxis 0 and Alitraxis 4 are our in-house development software compression algorithms.
And LZO Pro is a commercially available in-line compression algorithm that we use in some of our systems. And Mermaid4 is the deep compression commercial compression algorithm that we also use in some of our systems. ARIA 0 and ARIA 3 are our
hardware compression engines. ARIA 0 is a hardware version that is performing inline compression.
And IREA3 is really the interesting one, that is the hardware engine that is doing deep compression.
So the first thing we can actually see from this benchmark is that with 32 kilobytes of data blocks,
our hardware in-line compression is actually faster
than the best in-line compression LZO Pro
that is running with Intel CPU.
Not only that, it also actually compresses the data
about 10 to 15% better.
So that's really a win-win on both fronts.
Now, if we look at our hardware deep compression engine,
it is just slightly slower than the LZO in compression running on software, but it actually
performs more than 35% better in compression ratio.
So that's a huge win.
However, clearly we have in this case, there is still a room to improve.
If we compare the compression ratio of our hardware deep compression engine
against either MERMID4 or our Anacharsis4 deep compression algorithm,
we still have about between 5 to 10% of room to improve.
Now, next let's look at the benchmarks with four kilobyte block size.
In this case, we actually see that the encoding speed
of hardware engine,
it's quite a bit slower than what we would like to see
compared to LGO in-net compression.
But actually, this cost actually is not really in hardware.
Our hardware actually is able to keep the same throughput regardless of the block size from 32K all the way down to two kilobytes.
We have discovered that we have quite a bit of software driver tuning that we perform for small blocks.
Now, if you look at the compression ratio,
and we still, it consistently beats the inner compression
with up to 35% of better compression ratio.
And interestingly, in this case,
our hardware deep compression ratio, I-RAT3,
is able to beat the best deep compression algorithm
moment for that we can get our hands on the open market
by about 10%.
However, compared to our own in-house software, deep compression ratio, we still have another
10% room to improve in terms of compression ratio.
So that brings us to our continuous algorithm development.
So we, our roadmap for hardware compression is we are going to work on LZ plus.
That means it's a better LZ search algorithm.
LZ search is an unbeheld problem, so there's always room to improve.
Another feature that we are apply some possible transformations of the data
to achieve better compression ratio.
And of course, there's also the unknown unknown, that is the new algorithm discoveries that's
going to happen later. And with a programmable hardware platform,
we have the possibility to adapt to that.
So in summary, I've shared with you
how deep compression at the in-hand speed
can potentially lower the cost per gigabytes
for all-fresh array up to 30%.
And I hope you agree that FPGA accelerator card is a good choice of hardware platform
to deliver on that promise. And I have shared with you that a domain specific
architectural and a logic design is key to success.
And we love our programmable platform
because it enables us to do continuous improvement
and the non disruptiveive field upgrades.
And with that, that concludes my talk. Thank you very much. I hope you enjoyed it.
If you have any comments, please leave it here. And if you have any questions about any of the materials,
please feel free to contact me
at my email address
that is provided on the first page
of the presentation material.
Thank you very much.
I hope you enjoy the rest
of your virtual conference.
Thanks for listening.
If you have questions about the material presented in this podcast, enjoy the rest of your virtual conference. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.