Grey Beards on Systems - 114: GreyBeards talk computational storage with Tong Zhang, Co-Founder & Chief Scientist, ScaleFlux
Episode Date: February 22, 2021Seeing as how one topic on last years FMS2020 wrap-up with Jim Handy was the rise of computational storage and it’s been a long time (see GreyBeards talk with Scott Shadley at NGD Systems) since we ...discussed this, we thought it time to check in on the technology. So we reached out to Dr. Tong … Continue reading "114: GreyBeards talk computational storage with Tong Zhang, Co-Founder & Chief Scientist, ScaleFlux"
Transcript
Discussion (0)
Hey everybody, Ray Lucchese here with Matt Lieb.
Welcome to the next episode of Greybeards on Storage podcast,
a show where we get Greybeards Storage bloggers to talk with system vendors and other experts
to discuss upcoming products, technologies, and trends affecting the data center today. This Great Bridge on Storage episode is recorded February 12th, 2021.
We have with us here today Tong Zhang, co-founder and chief scientist at ScaleFlux.
So Tong, why don't you tell us a little bit about yourself and what ScaleFlux is up to
with their new computational storage?
Yes, yes, great.
Yes, it's really great to have this opportunity to talk about this computational storage. Yes. Yeah, it's great. Yeah, it's good. It's really great to have this opportunity to talk about this computational storage.
Um, um, yeah, uh, the ScaleFlux was really kind of founded at the end of the 2014, um,
with the mission of, uh, uh, kind of exploring the commercially viable paths to move the
very simple idea of computational
storage into the real world.
So actually, we coined the term of computational storage.
And today we are very happy to see the computational storage has very successfully gained tremendous
momentum in the industry over the years.
And now, over the past years, we have evolved and become today's only company that has shipped
thousands of computational storage drives to the data centers worldwide.
We had a Flash Memory Summit wrap-up here last fall, actually.
And Jim Handy was talking about the success,
the nascent success of computational storage,
which you guys have been around, like you said, since 2014.
I mean, it's when you started.
Obviously, you didn't have a product then,
but it's been a long haul to get to this point.
What is going on that's driving some of the adoption
of computational storage?
Yeah, yeah.
It's a very good question.
So actually we can just look back a little bit even earlier
that in terms of computational storage,
actually loosely speaking, we can say any data storage
device built on any storage technology, for example, either flash memory or magnetic recording
or even DRAM, that can carry out any data processing tasks beyond its core data storage duty can be defined or can be called as a computational storage drive.
So actually, this simple idea of empowering data storage devices with additional computing capability is certainly not new.
Actually, we didn't invent this kind of idea. So this idea can really trace back to more than 20,
I think at least more than 20 years ago.
Oh God, when I was at Storage Technology Corporation,
we were adding sort specific commands
to the storage solution that we had
so that we could provide,
and even, yeah, and compression and decompression kinds of things
for backups and stuff like that.
But it was a fairly niche market,
and there was only a couple of vendors that we were focused on
to provide that functionality.
Nowadays, it's much more generic, I would say.
It's much more available to just about anybody who wants to use it, right?
Yes, yeah.
That's really, I think now is the right timing because back then, um, because of
the study and the healthy CMOS, more small, like a CMOS technology scaling in
the good old days, really not surprisingly, the mainstream industry chose not to
adopt this idea.
Um, uh, but, uh, uh, like more recently, as you mentioned, this very simple idea received
a resurgence of interest. We believe it's really driven by two trends. One is as the
CMOS technology scaling slows down, there's a growing consensus in the industry
that is domain-specific or heterogeneous computing
must play an increasingly important role
to complement with the host CPU to push forward.
And second, that is a significant progress
of solid-state data storage to really push the system bottleneck
from data storage to computing.
Then naturally, the concept of computational storage very nicely matched those two ground
trends. trend. So that is really kind of naturally this idea was now is being picked up by the industry.
Yeah. One of the things we talked about and Matt was there at our year end
podcast was the rise of the ARM processor in the data center and at these, I'll call them external
computing engine kind of thing. I mean, you know, computational storage is just one aspect, but you know,
DPUs are another smart Nix. There's, there's gosh,
there's a couple of products out there that are like storage processors
sitting out on the network and stuff like that. It's, it's,
it's all kind of evolving, I guess, like you say,
because the Moore's law is starting to slow down
and the computer is starting to become the bottleneck again.
It's interesting.
Yeah.
Yeah.
So I think really, I think over the next few years,
we will see a very exciting kind of development
across all over the board,
like a SmartNIC, DPU and the computational storage drive.
We see that as really kind of complementary to each other to really form
a very solid, heterogeneous computing
fabrics to complement with the CPU and the GPU.
And each one of those, like a computational storage drive,
DPU, and the SmartNIC, they have their own unique position. They have their own very
well-suited application domain and computation functions to push towards that.
I think that really will complementarily come together to really help the host CPU to the full extent.
A lot of these devices have specific hardware circuitry to offload time-sensitive tasks and that sort of stuff. Does computational storage do have that as well,
or is it more just an ARM processor that you can run different workloads on if you want?
Yeah, absolutely. Yeah, that is a very good question. So actually, that is also a question
we have been asking ourselves over the years, like when we started Scaleflux to really explore this area.
So now, I think transparent compression and encryption, we believe, is a must for any
commercially viable computational storage drive. Because there's a pervasive usage of at rest data storage sorry
the pervasive use of at rest data compression and the data encryption really can greatly benefit
from the built-in customized hardware circuit inside the storage drive that can transparently carry out compression and
encryption. Yeah, Tong, self-encrypting drives have been around for quite a while. I mean, even
I think on the disks, not just the SSDs and stuff like that. So that sort of logic has been around,
but the compression, decompression kinds of things, that's a different animal, I'd say.
But yeah, and maybe they're all doing the same sorts of logic kinds of things, but not.
I don't know.
Yeah.
So actually, compared with encryption, the compression to move the compression function into the storage drive is much, much more challenging and also much more beneficial from two aspects.
The first is that compared with encryption, compression is very much unfriendly to CPU.
Because the compression...
There's always been a ton of overhead. Yes, because compression by nature, the data processing flow of compression has a lot of
randomness. This randomness will trigger a lot of cache miss in CPU. That is very bad from the CPU's performance perspective.
Compared with encryption, the compression will tend to consume much more CPU cycles.
As a result, it is much more beneficial to offload, to migrate compression off from the CPU and down to the storage drive on the IOPath. And another challenge for the compression
is that because when we compress the data,
the compressed data block size is unpredictable.
It can vary from one block to the other.
So then for the computational storage drive,
we have to implement a very efficient data remapping
mechanism to handle the variable length compressed data block.
So that also really makes things very, very challenging
to move the compression, even though the concept
of compression is really not that fancy, right?
It's not compared with AI or machine learning. It's really kind of very old.
Even encryption. I mean, encryption, it seems to me, is much more mathematically intensive task than compression ever was. But getting back to what you said, so you think that compression,
you said it was more random from a cache perspective. You're talking the instruction cache or the data cache in the CPU or? Data cache, yes. Data cache.
Data cache. Yes. Actually, today's modern CPU can execute the private key encryption much, much at least
like one order of magnitude higher,
higher speed than executing compression.
No kidding.
That's interesting.
That's like regular LZ compression or something.
It's not anything that fancy, right?
I mean, it's just normal compression kinds
of algorithms, right?
Yeah.
For example, like a single-core CPU can execute this AES encryption
with a few gigabytes per second encryption throughput. But for compression, for example,
like Zlib, Gzip, and LZ4, at most, for Gzip, it only tends to 50, 60 megabyte per second.
For LZ4, it's only about 300, 400 megabyte per second.
So it's way, way lower.
Yeah, order of magnitude, couple orders of magnitude,
depending on the algorithm.
That's impressive.
Yes.
So that is the reason why it is so beneficial to push the compression into the drive.
Yeah, but you also have to add all the sophistication of the variable block size,
remap.
Flash obviously does its own mapping, remapping, Flash translation stuff all along, but you
got to put on top of this.
Now you've got variable length blocks that you're playing with that have to be somehow stacked across all this Flash and stuff like that, right?
Exactly. Yeah, exactly. That is also what makes things so complicated. to really convert the functionality of a traditional flash translation layer with the variable
length block management and also make everything inline along this I-O path.
So then when we do the compression, the decompression, we do not introduce any noticeable amount of extra latency on the IOPath.
Well, I mean, the other thing with compression,
when we were doing it at the storage controller level,
it actually helped the performance because we were moving less data around, right?
You were writing less data and reading less data from the back end and stuff.
But when you're doing it at the drive, you're still effectively sending all that data out
the front end, right? What's your interface to the drive? NVMe? Yes, we are PCIe Gen 4.
And then down the road, we migrate into PCIe Gen 5. So we really enjoy this kind of continuous progress
on this PCIe bandwidth.
So it really provides a very wide front-end bandwidth
for us to move this compression down, push that into the drive.
And even today, our product uses TLC flash memory.
And then also very, very soon, we
will release the QLC-based flash memory.
So then we have very clearly observed
that the benefit of using compression
to improve the storage drive performance on both TLC
and QLC, and even more noticeably on QLC flash.
Interesting. Well, I mean, the challenge with QLC flash is they want to write in big stripes
and that sort of thing. And with compression, I guess you could pack more of that stuff in there
to try to get the stripe wider.
So when you claim, and I have no idea what capacity your current drives are at, is that
at raw capacity or as compressed capacity or how does that work?
Oh, yes.
For our current product, we start from a raw capacity of 4TB or raw capacity of 8TB.
And then once the user has our drive, then the user can format our drive.
For example, for the drive with a 4TB raw capacity, then the user can format the drive as an 8 terabyte drive or 16 terabyte drive.
And then to enjoy the benefit of using compression to expand, to amplify the usable logic storage
capacity.
Yeah.
It's almost like virtual storage versus real storage and physical storage and that sort
of stuff.
Yes. More like a theme provisioning.
Right, right.
Actually, theme provisioning is the other kind of word that applies.
So where are you guys finding success?
I mean, is this something like that you would see in enterprise data centers
or is it more in a public cloud environment or where?
Yes, in data center and the cloud and mainly like for our
current product uh we see the really the early like the the early adopters mainly are the
relational database users like mysql mera db uh oracle those those kind of relational database customers there because they have to handle
a huge amount of data. And at the same time, the relational database is very latency sensitive.
So historically or traditionally, people never turn on the compression because when they
do the compression, they have to move the data back and forth between a CPU or even
some kind of accelerator to do the compression, decompression.
That will introduce extra latency penalty.
So that will degrade their relational database performance.
So even though the majority of the data in relational database, they are highly compressible.
People almost never compress the data.
But once they see our computational storage drive with built-in transparent inline compression, then they
can enjoy that lower storage cost and without any penalty on the performance because we
just do the compression, decompression transparently inline.
And even sometimes they even see the performance improvement.
They see benefit on the performance in addition to the much
lower storage cost.
Are you seeing the same level of compression algorithm applied to, let's just say databases
to, as opposed to far less compressible type data like video and audio and PDFs and that kind of thing?
Oh, no. Yeah, we see a very wide range of data compressibility. And for the video and the image
and audio, certainly they are not compressible because they are already being compressed by the video compression
algorithm, by the audio compression.
So then that kind of scenario is really not applicable to us.
But on the other hand, for the relational database data set,
those data are very highly compressible.
Actually, we have seen from our customers reported to us,
they have even seen over 5 to 1 compression ratio
on their production data.
So your compression algorithm is based on white space?
Oh, our compression algorithm is fundamentally the same as LZ,
just like in the GZ compression algorithm.
Yeah, so it's not just white space.
It's also changing the redundancy levels or the number of times a particular character string is used or characters are used.
Yeah, so GZIP is fairly compute intensive, but it does do a decent job of compressing and stuff like that.
In some cases, we've seen where there's an inline version of the compression algorithm and there's an offline
version of the compression algorithm, but you do everything inline, right?
Yes. Yeah. Everything is inline. And also one thing that our customers really like is that
they do not need to change anything on their software.
So transparent, right?
Yeah. Everything is transparent. They do not need to change any single line of code in the application.
They do not need to change any single line of code in their file system or on their IOS
stack.
So then they just plug in and then just reinstall our driver and then everything just works.
Yeah.
So we talked about compression and we talked about encryption.
Are there other functionality that's being done outboard like this?
Then compression encryption is like a foundation and because that is also like transparent,
without any demand to change on the upper level application. But we have also explored another direction, that we intentionally,
we purposely push down some data filtering kernel from the data analytics engine into the drive.
So then that inside the drive, we can pre-process or you can consider as a filter
the data on behalf of the host CPU. So then the host CPU will only need to handle the already filtered data.
So actually, basically, for any data analytics pipeline, there are two steps. The first step is to prepare
the data, like to select
those
real useful data.
And then second step is to do
the real data analytics.
But more than often, the first stage,
like the data preparation stage
or data filtering stage,
tend to consume even more CPU
cycles than the second step.
Yeah, I would imagine because you've got to bring all the data in, you've got to check to see if it's the right data.
And effectively half of that might not even be the right data, so you're throwing all that stuff away.
And that way you don't have to actually send it across the network.
So this sort of thing would require changes to protocols or is there some sort of standard
system protocol that allows you to do these sorts of things or is this all specific proprietary
to ScaleFlux?
Yeah, at this time, everything is customized. And today, being led by Intel and AWS, the NVMe committee is under the development to
extend the NVMe standard to support the computational storage.
So then they will standardize the interface.
So then in the future, everyone can use a standardized
interface to push down the computation. But today, before we have NVMe-based standards,
everything was just customized. Actually, we have developed a solution with Alibaba Cloud to do exactly the same. For a database running on
the Alibaba Cloud, they customize their engine in such a way that they are able to push down
the data filtering kernel through a customized interface to our drive. Then inside our drive,
we will do the pre-processing. We will filter the data and then only send the data that can pass
this filtering stage and back to the CPU. I was thinking that you could do some sort
of thing with containers where you'd send the container and have it run out there on the drive, but this is more
specific to the filtering kernel kinds of capabilities.
Is that?
Yes.
Yeah.
I guess the other question is, is there like an ARM processor sitting on the drive that's
available for these sorts of services or?
Yes, actually, that's a very good question.
So actually, our current product is based on the FPGA.
So we don't have a built-in ARM core yet.
But very soon, we will have our SoC chip will be available. So in our SoC chip, we will have ARM cores available there
then to enhance the programmability.
So then our future SoC chip will have both hardware, compression,
encryption, those kind of hardware engine, hardware compression, encryption, those kind of hardware engine, and the plus multiple
ARM cores that can handle those programmable data filtering, data selection operations.
You would think something like transcoding would also be something that would be useful,
it'd be done help board. If a video comes in and it's, I don't know, MP4 or something like that,
or you want to go to higher resolution or lower resolution kinds of things.
Yeah, absolutely.
That's pretty computationally intensive as well, right?
Yes, absolutely. So actually, we can expand the concept of transcoding,
not necessarily only for video. In the drive, we can transcode any data from one format to the other format.
So inside the drive without needing to bother CPU to handle those kind of data format conversion.
So what sort of other things besides video transcoding would that be?
I mean audio transcoding I, would be the same thing, but...
Oh, and like, for example, today, like the industry is really heavily pushing
toward what we call the in-memory data analytics, right?
So, but for in-memory data analytics, but we like, we need to analyze the data from some storage. Suppose a big file contains the data,
then we want to do this kind of in-memory data analytics. But when data are stored on the
hard drive or solid state drive, people use a format different from the format that people are doing for
the in-memory data analytics.
So then if we can build in the format conversion into the drive, so then when the user wants
to do the data in-memory data analytics from some data on the drive, then transparently on the fly, our
drive will convert the data format or data syntax from the on storage data format into
the in memory data format.
So then the CPU can immediately start the data in memory data analytics without doing any kind of a conversion.
Format translation. Yeah. I guess I'm not familiar with how that works out. I was trying to think of
what might be that sort of thing, but it's not like you're changing the block size or anything
like that. This is more than just that sort of work. You're actually changing the format of the
fields and stuff like that? Yeah, it's changing the format. For example, if we have a big table data,
maybe the raw data are stored on the drive with the raw by raw, but for the in-memory data analytics, they might favor column by column data.
So then we can convert from row by row to column by column. Because once the data are in the column
by column format, then CPU can very nicely apply this single instruction multiple data,
like vectorized processing. Yes? Yes, yes, yes. No way. You can do it row by row, column by column.
Actually, that is one direction we are very actively exploring.
Right, right, right. Well, I always thought that, quite frankly, I'm not a database expert,
so all this is conjecture on my part, but
row-based databases and column-based databases were completely different animals.
To know that you could do the same, the same data for both is kind of an impressive concept.
Yes.
Actually, we are very excited about all kinds of possibilities and opportunities enabled
by moving compute into the storage drive.
Actually, I would believe that now we are still scratching the surface. We're still
much more than waiting for us to discover. It seems to me that compression has always run alongside, if not hand-in-hand, with deduplication.
Is there any deduplication in your algorithm either?
Yes. Right now, we don't have the further integrate the dedupe functionality into the drive.
When people talk about dedupe, then the efficiency of the dedupe will improve once we dedupe data
across multiple drives. The more data we have, then the higher chance we can dedupe.
So then over there, we are exploring the possibility of doing a global flash management and integrate integrate with the dedupe function and also the dedupe
also needs a lot of
hashing computation.
So then
all kinds of
possibilities there.
So I mean, one of the challenges
with SSDs
in general is that
if a power fluctuation
happens or you lose power, there's a potential for data
sitting in DRAM cache, if there is such a thing on your drive, to be lost. How do you deal with
the power failure kinds of scenarios in a system that does this sort of thing?
Oh, yes. Actually, we treat the data reliability very seriously. When people talk about the computational storage,
people tend to emphasize too much on the computation. At the same time, they might
tend to ignore about the storage. But actually, at Skillflux, we strongly believe that to qualify for a viable computational storage drive,
you have to be first to be a good storage drive before we can talk about the computation aspect.
We spec our storage drive at the same level as any high-end,
enterprise-grade solid-state drive.
So then actually, we integrate a capacitor-based power loss
protection mechanism.
And we guarantee that whenever data reaches our drive,
then even when the power goes away,
then we will guarantee that the data will never be lost.
That's very interesting.
A lot of this stuff is something that you would find in a storage controller two decades ago.
The thing there on the drive is very interesting.
And you're doing all this with FPGA at this point.
That's fairly interesting as well. Yes. Actually, the reason we chose FPGA,
because actually, to be honest, at the very beginning, when we started this company,
actually, we didn't have a very clear picture of the future either.
So then we just believed that this is a trend.
But in order to really explore this direction, the FPGA is certainly the best option for
us to quickly prototype and then quickly engage with the potential
customer and then quickly get the feedback.
So over the years we use FPGA to really kind of-
To quickly iterate your design.
Explore the path.
Yes.
Yeah, yeah, yeah.
And FPGA is much more flexible and functional as well, right?
Much more sophisticated and much more logic can be jammed into one, right?
Yes, yes.
And also we have really, I would say,
the industry, the best team on designing the controller
and also very sophisticated ECC.
Then we put everything into a middle range Xilinx FPGA.
Actually, many people even didn't believe we can do that,
to put everything into the middle range FPGA
and the plus computation on the FPGA.
So that is really kind of surprised many people.
Yeah, so the challenge of course,
is that your team has to pretty much implement the logic that the customer requires, right?
With the ARM processor, if it comes on board as quickly, it can become a little bit, it's almost, you know, the customer can do all this stuff or the developers of the customer line can do all this stuff themselves rather than having you guys have to re-implement
this logic every time.
Yes.
Yeah.
So once actually our plan is that like once our next generation SOC based computational
storage drive will be available, then we will provide a set of environment and a platform
then to enable the customer to conveniently utilize
the ARM cores in our drive, and also can seamlessly integrate
with already built-in transparent compression
and encryption functionalities, and then
to really form a
very kind of highly programmable and at the same time, high performance computational
storage drive development platform.
So I mean, do you see like data centers coming in and buying, you know, I don't know, 10
types of drives or hundreds of drives or thousands of drives
that are whack to convert to utilizing your facilities? Or is it more they're adopting,
they're trying a couple, a handful of drives, see how it works. And then once they come up with a
better idea what the performance and economics are, they bring in more. Oh, yes. Actually, now we are already at the stage, we are already in deep engagement with
the two hyperscalers. They're already in the phase that on the way to the large scale deployment
of our product in their data center. So they already are fully convinced about the value
and fully convinced about the trend, the future.
And I'm sure that all the major hyperscalers in the world,
they are looking at exactly the same direction.
And for example, this NVMe standardization effort was driven by AWS.
AWS clearly sees the need of pushing down certain computation from CPU to their storage
and IOPath. and to their storage and IO path. I mean, the other challenge is something like this is
it's hard for many companies to sole source a technology
like Intel runs the problem.
They have to have AMD out there to provide x86 processors
and stuff like that.
Are you guys seeing the requirement to have multiple sources
for these sorts of products?
Yeah, I think certainly in the long run, there should be
multiple players, and there will be.
Like we believe, especially enabled by this
NVMe standard. Once we have this NVMe standard open,
then everybody can come in. Then the
differentiation is really the quality because we follow the exact same standard. But right now,
for those early adopters, they really see the value, the big value from our product
that they can save, especially using transparent compression to achieve significant storage
cost saving.
So they see the value and then they are willing to bring in the product and largely deploy
it in their data center, even though we are the only one
on the market. Right, right, right. As far as the encryption is concerned, the keys for that
sort of stuff, is that outboard or is that on the drive or how does that work? Oh, yes. So
right now, for our hardware, actually our hardware provides the Root of Trust capability. So then we are able to store
the key on board. And then that also depends upon future system integration and how the software stack will start to integrate with us. But at least we can achieve this self-encrypting drive
functionality. But we can achieve more than that. They had to solve this problem before you guys.
So yeah, interesting. So you've got TLC today, you're looking at QLC. In the QLC environment, the challenges are the latencies go up considerably, the wear leveling goes down, or the endurance rather goes down considerably. I mean, how do you feel that your computational drive will work in that space?
Actually, we are very excited about the potential opportunity over there for QROC.
Just because of the problems you just mentioned for QROC, that is exactly how our computational
storage drive can help and come in to help to solve.
For example, we already have a working PLC on the QLC-based
drive. We see that clearly our transparent compression can very noticeably close the
performance gap between the TLC and the QLC. Because you're writing less data at a whack
because it's compressed and you're reading less data at a whack because it's compressed and you're reading less
data off the drive because it's compressed and the latency is better and the wear leveling or
endurance can be better, I guess. Yes. Yeah. All of them. And QLC, once we use the QLC drive in the system, then more likely the
bandwidth of the QLC memory chip will become the system bottleneck, not the PCIe bandwidth. So then once we use a transparent compression, then logically, we
amplify or we expand the QLC flash memory bandwidth. So then once we couple it with
the lower cost of QLC, actually we are confident that by combining this transparent compression with QLC, on both sides, that
can displace hard disk drive in many scenarios and can also complement with TLC in many scenarios. So that will really make this QLC, the real applications of QLC
will be largely amplified through our transparent compression.
You're not the only vendor that's told us that QLC will kill off drives, but I'm still
not convinced, but that's subject for another discussion. No, certainly I'm not claiming that the QRC will kill the hard drive. I believe that
hard drive has its own position, absolutely. But for a certain kind of area, for certain
kind of use cases, once we can improve the performance of the QRC and at the same time, transparently further
reducing the cost of the QRC, then for some scenarios, it may make more sense to use the
QRC other than.
Right, right.
Disk drives in those days.
All right.
Well, this has been great.
Matt, any last questions for Tong before we close?
No, but Ray, you did a great job on
this one well you know i've been interested in the company well quite quite frankly so this is
back in you know 1990s we were doing these sorts of things with a storage controller that fit in a
rack you know and had you know 180 gig drives and stuff like that so yeah it was crazy all right
tong anything you'd like to say to our listening audience before we close?
TONG LIUOFENGYIUOUNG PENG YIUOUNG PENG YIUOUNG
Sure.
Yeah, we believe that computational storage drive will
really revolutionize the whole storage and the computing
industry.
And at Skillflux, we sincerely look forward
to working with the whole industry
to really make this thing happen
and to contribute to the future growth of the whole industry.
Okay.
All right.
Well, this has been great.
Thank you very much, Tong, for being on our show today.
Thank you.
Thank you very much for offering me this opportunity.
That's it for now.
Bye, Matt.
Bye, Ray. Bye, Tong. Until next time. thank you very much for offering me this opportunity that's it for now bye Matt bye Ray
bye Tong
until next time
next time
we will talk to
the system storage
technology person
any questions you want
us to ask
please let us know
and if you enjoy
our podcast
tell your friends about it
please review us
on Apple Podcasts
Google Play
and Spotify
as this will help
get the word out.