Storage Developer Conference - #105: Dual-Mode SSD Architecture for Next-Generation Hyperscale Data Centers
Episode Date: August 13, 2019...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to SDC Podcast, Episode 105.
Good morning, everyone. My name is Feng Zhu. I'm a storage architect from Alibaba Infrastructure Service Group.
Today, thanks for the opportunity to be here to present Alibaba's continuous efforts to address the storage solution challenges for the hyperscale data center.
My presentation today is mostly on the dual-mode SSD architecture
for the next-generation
hyperscale data centers.
So,
for today's hyperscale data centers,
we are facing
various
changes from a storage
solution perspective.
Here we summarize
the changes,
especially for the Alibaba data centers.
We have five high-level issues or changes
we need to overcome from a storage perspective.
The first one is that the workload
becomes much more diversified.
Like we use Alibaba as an example.
We have so many business units,
like the e-commerce, search,
cloud computing, cloud storage,
as well as online promotion, and so on and so forth.
So all those business units
require very different workloads.
And even within one business unit,
because the business is going so fast,
their workloads are changing a lot.
So for Alibaba Group,
we were mostly buying the SSDs
and storage hardwares from the vendors.
But those generic storage hardwares,
they're usually optimized for certain workloads.
It's not really customized to our specific workloads,
which is truly using our data centers.
So that's one challenge we're facing.
And also because the data center scale is becoming larger and larger,
so we keep facing the pressure on the TCO.
We buy so many SSDs every year.
How to reduce the cost is always challenging in front of us.
And also for the data centers, we cannot rely on just one vendor.
We usually, for each product,
we require at least three vendors there.
So we want to keep multiple vendors
to release the string on the supply,
but also want to,
more vendors means you need to pay more costs
on the qualification.
So we want to keep multiple vendors, but still reduce
the complexity and overhead on the qualification. So that's another
challenging way of facing. And also for the
standard block device,
it's more like a black box. So I think it's a
well-known concept
recently for the open channel.
The standard block device,
from the host perspective,
we cannot control the IOs inside the device.
It's more like we just send the IOs to the device,
and whatever the device will do,
like so many background operations,
GSA and so forth,
but for the data centers,
we want to control all those data placement,
especially for the latency control,
quality of service,
so it becomes more and more important.
So we want more control and more IO determinism.
The fourth issue we are facing is that
the quick response to the issues
and also the quick response for the new feature request.
So our experience is that
when we are working with the hardware vendors,
if there's an issue,
it really takes maybe one quarter to get that issue really fixed.
We report the issue and then talk to the field engineers,
and then we talk to the internal engineering team,
and then they get a log from us and look at it,
and say, oh, this is your problem, this is my problem,
and then if it's truly
diagnosed as a hardware problem,
then we shift it right back to a vendor,
and then take some time to really debug it
and provide solutions.
I think it's at least one quarter.
For an internet company, I think that's one quarter.
It's too long, so we do want to have a way
to address those online issues very quickly.
And also for those in a very dynamic business environment for new features,
we also want a very quick solution there to address our needs.
But again, just like the bug fix, the new features also takes a long time.
And in the end, I think performance is truly the most important one.
We want to have better and better performance,
but unfortunately I think for our current standard
block devices, I think the hardware and software
are kind of optimized separately.
So the software part doesn't know what's happening in the hardware side,
and the hardware side doesn't know what the business,
the upper-level storage software is trying to do.
So this kind of gaps here limits the performance improvement from a storage perspective.
So because of all those challenges,
our Alibaba infrastructure service group,
we started to address those challenges
since starting from develop our own solution
for the SSDs, the NetFlash-based SSDs.
We started this effort since 2016.
So until today, we released three generations of AliFlash SSDs.
So the AliFlash V1 is a high-performance PCIe SSD.
It's host-based FTL.
So we have some control on the software and hardware together.
It was developed in 2016 and then deployed at the end of 2016.
So until today, it has been in service in Alibaba data center for more than two years,
and we have more than 20K devices deployed
in the Alibaba data center.
So far, it's going very well, very stable.
And then we moved to the Aliflash V2 into 2017,
which is U.2 SSD form factor.
And also, we started customization for some certain specific business units,
certain workloads, and compared to the
purchase from the storage vendors,
we successfully reduced the cost by roughly 20%.
Aliflash V3 is what we are currently
working on. It's it's dual mode SSD with open channel supported
this is indeed my primary focus for today's presentation
for this one we do deep optimization
for the business units applications
and also this one as I, it's an ongoing effort.
We are in the progress of prioritizing this dual-mode SSD.
So here, let me just give a little bit more details on the productization of this Aliflash V3.
First of all, this is a self-developed open channel SSD,
which is ongoing efforts in Alibaba data centers.
This right-hand side, this graph here is showing the controller of this Alifax 3.
It's indeed our own controller.
We work with ASIC design company and customize this controller,
especially optimize the performance and also for the open channel usage mode.
We already announced the major milestone in the FAST 2018 this year.
And also this open channel V3 is indeed a platform.
We are not just working on one vendor.
We are working on multiple vendors.
So the goal here is truly to build an ecosystem,
to have a unified platform and have multiple vendors to provide the hardware
for this dual mode and SSD with
focus on the open channel. So this is
photos for the sample
for one of our vendors for the Ali Flash Race 3.
So, this page summarizes why we chose the dual remote, especially the open channel for our next generation data center storage solutions.
So the right hand side is our high level architecture
of the dual mode.
So from the bottom to the top.
So the bottom part is the Alibaba open channel SSD.
We call it AOC SSD.
So it's indeed from our collaborators.
We have multiple collaborators.
I'll give an end to this presentation.
I'll show you our partners and collaborators for those hardwares. hardware. And with the open channel SSD engine, Alibaba group will build the software stack
on top of it. So that's what is showing here. We have our Alibaba open channel SSD specification.
So this is our own SSD specification. It's different from the, I guess you may be familiar with this open channel 1.2 and 2.0 specifications in the industry.
Our SSD specification is our own SSD specifications.
I think we absorbed some of the good ones from those open channel 1.2 and 2.0,
and then based on our needs and our specific requirements,
we had a lot of new commands here
to truly have a spec which is,
we believe is good for the productization.
And here, with this specification,
we actually require a dual mode,
which means that the hardware is exactly the same,
but we can switch the SSDs in the open channel mode and the standard device mode.
So the vendors can just configure the drives into either open channel mode
or the standard device mode in the fab and then ship to us.
The reason we choose this is because
as a first generation
of open channel SSDs, we want to keep the
flexibility for deployment.
So our business
units can still choose
if they want to
do the open channel mode or they want to do
the device mode.
So for the infrastructure group,
so we just have one unified hardware
as a flexibility to configure it
based on the business unit's needs.
With this open channel SCC platform,
and on top of it, we have,
our team has our storage software stack.
We call it the Fusion Engine.
It's basically a user mode storage engine
and have different capabilities like a device manager
and different tools to manage the statistics and track the IOs there.
And on top of that, we connect with different applications here.
And with this architecture, we believe that, first of all, we enable hardware and software co-optimization.
So with the hardware and the firmware provided by vendors, but on top of it, we have our open channel drivers, fully controlled by Alibaba Group.
And we have our storage engine.
And we have our internal business units.
So we have those three layers combined together,
and then we can do a fully co-design and co-optimization.
And then we essentially control the FTL,
so we have the maximum flexibility
to optimize for different workloads.
We can choose, for the latency application,
we can choose the generic block device mode,
or we can, based on the special requirements
for that business unit,
we can customize our FTO to their special needs.
I'll have a one page to show a high-level example
like what we do there.
So this provides a deep integration with applications.
And also, since we moved the FTL from the device
to the host site, so from the SSD hardware perspective,
we actually reduced the scope.
So with this change, we believe we can reduce the time and the complexity of the SSD qualifications
when we introduce a specific hardware into our data center environment.
And also...
Can I ask?
Sure.
I'm sorry.
Do you want more data?
Do you want more data?
Is it not the ultimate solution? Is it the perfect solution? I would think so.
As I said earlier, the reason we chose DuraMod is kind of a trade-off.
So I think we are the first one.
I believe we are the first one in the industry to productize this open channel.
So there's risks here.
So we want to keep the flexibility.
So the business
units, they can choose.
If they think, okay,
I'm more comfortable with that
block device,
they can switch it, drive it back
to the
standard device mode.
This kind of thing also
sometimes depends on the
expertise of the different business units.
They may not have the expertise to
customize their software for the open channel.
So in this way, they may choose to use this
block device mode, standard device mode.
I think you can have a few of those.
It is difficult to reduce the cost
of the hardware because
Yeah, yeah, I agree with you.
That's why I think there's a
trade-off at this point.
When we have the first
generation massively
deployed in our data center
and then we nerf on it, and then
we will figure out what's the best way
to deploy those devices for different
business units requirements.
And then we can, in the stage two of our open channel,
we will try to do like the reduce memory in the SSD
to reduce the cost.
That will be a stage two plan.
But in stage one, we are keeping the identical hardware.
Do you know if it's really a staging product, an intermediate product?
I would think so, yeah.
I have a similar question.
So all the FTL management products are studied.
I would imagine that cost-per-service
still has a lot of daunting tasks to do.
And then I have a question.
So along the way from the device to the host,
is there a acceleration in between?
Or no?
There will be.
There will be.
That's in our roadmap.
Yeah, so I think that's a very good question.
You move FTL to the host side,
first of all, there's a lot of memory
consumption there, and there
are a lot of CPU computation there.
When we
do this internally in Alibaba Group,
we do see the concern from the business units
regarding those, like,
you consume so many memories, and you consume
so many CPU cycles there.
It's not a fundamental blocker,
because nowadays in data centers,
we have the separate computing load and the storage load.
I think those storage loads do have enough CPU
and memories there to support the open channel.
But if we can figure out a way
to offload those things from the host,
that would be good.
It's indeed in our own app, like you said,
hardware acceleration.
It's in our phase two,
phase two of open channel deployment.
Thank you.
Okay.
Okay.
So this page is open channel architecture
for stage one dual-mode SSDs.
So the right-hand side is
a
paragraph to show the
role and responsibilities
between
the device
and the host.
So
compared to the traditional SSDs,
everything is
traditional SSC basically encapsulates
everything into the
device, but here
as I said earlier, we actually
move the
most of the FTOs to the host side
so in device we
still keep those
the MME
commander
and then the media and characteristics monitoring to keep those, the MME commander,
and then the median characteristics monitoring capability,
and of course ECC engine, there's no way for the host
to do the ECC, and RAID engine, XOR, and those smart.
And then for the FTL functionalities,
which is impacting the data placement and IO scheduling,
we move all of them to the whole site,
including the IO scheduling, data placement,
background data refresh, garbage collection,
rare leveling, as well as the band block management.
So here I think one thing I want to mention is that
even though we move all those data refresh,
garbage collection, we're leveling to a host site,
I think those are in the category of the media management.
We actually just moved the whole site to be responsible for the initialization for those handling,
but the median characteristics monitoring is still in the device.
So in other words, we are looking for something like we don't really want the whole site to monitor the media characteristics
because that's too much effort.
And also it's not really the – those kinds of things are very media-specific.
I think it's better to give that control to the vendor.
They know their land medium better than us. So we just want the device to give us information to say, this
media is in the risk. You need to do something to handle it. And then the host will do initialize
the handling. I think if you are familiar with the SSDFTO, the handling, you probably
need to refresh the data and also do some
management there. So those will be controlled by the host. But the monitoring things
will be controlled by the device still.
How does the device improve the kind of
information? It's vendor specific.
But the general idea there is like, you know the AER, right?
Asynchronized Event Request.
So that could be a way, a mirror for the device to notice the host.
So the whole device interface is still largely based on MVM?
Yeah, yeah, it's still, it's like the MVM, right?
Yeah, customized MVM, yeah, it's still, it's like the OVM, right? Yeah, customized OVM, yeah.
So if the device informs this kind of AEO,
like the video, the badness,
does it part be kind of determinism?
Will not be...
You mean IO determinism?
Yeah.
So, I think this way is also
for the good of IO determinism.
So, for
the traditional SSD,
it will do its work for you, right?
Without your permission.
If the traditional SSD
says the data there has some retention
issue, it will just move the data,
and the host side has nothing to do with it
and doesn't have any control there.
But in this way, the device will notice the host
and say, oh, you have the data retention problem.
You may choose to move the data sooner or later.
And then the host side, it's fully up to the whole side
with when to do this work.
So from that perspective, you can actually achieve
a certain level of IO determinism.
General, do you want us to ask questions
during the talk or towards the end?
Yeah, I think it's better to listen.
Yeah, we have 15 minutes, so we still have enough time.
Let me finish the presentation,
then we move to the Q&A part.
Okay, so with this architecture,
basically the host site actually gains
the most important part.
We have the direct access to the physical media and
maximize
the utilization of the media
capability. We
fully control what's really important,
what is really important for the
upper storage
software level, the data
placement that I was scheduling, and also
coordinating the
garbage collection
and eventually
resulting in reducing the right applications is critical for the SSD
application.
So before I move to give some high-level idea of how
we
customize our FTLOs for different business requirements.
I just want to give a high-level summary,
overview on what we do to customize the FTOs in our data centers.
So this is just a flow of how we usually do that,
do a co-optimization and co-design.
So we started from the applications.
As an infrastructure group,
so we have many customers inside Alibaba group.
So we talk to the different business units
like Alicloud, Alipay, TMO, and
those e-commerce. So we understand their requirements
and abstract them, do the
requirement analysis here. And based on that, based on the
hardware we have, and also what our
collaborators can provide,
we do co-design, co-optimization.
This is a kind of a loop process.
And then based on that, we have this generated in-house storage solutions
and then provide the solutions back to the business units application site
and talk to them, do you like this?
And they will provide us the feedback that, oh, this truly meets our requirements.
Or, no, I think you still maybe put more efforts on this or that.
So it's kind of feedback to civil runs,
and then we generate
the truly understanding requirements
and come out
optimized solution
based on their needs.
So this page summarizes
our software stacks for the open channel.
The whole software stack is developed by ourselves.
And then we have two versions here.
One is Cardinal Mode Alibaba Open Channel SCC driver block device FTO. That's what
we are
developed
and also deployed
in our data centers.
On top of that, we have user space
SC drivers integrated
with our
own storage engine.
We call it the Fusion Engine software
with the internal business units.
And also on the user space mode, we also customize different FTO solutions
to optimize for certain business units requirements,
as well as the management and monitoring and test tools in this whole software package.
So in the next couple of slides, I will just give a little bit more details on the kernel mode and user mode,
and one example for that customized FTO. so for this this is a
structure for our kernel
space
Alibaba Open Channel SAC solutions
so we have
this
multi-vendor
Open Channel SAC hardware
and of course with the firmware
and on top of that we have our kernel mode FTL,
which currently is mostly a standard block device FTL.
And on top of that, as I said, we do this software
and hardware co-optimization, co-design,
so we always connect it to our storage engine,
what we call the Fusion Engine,
which is a user space storage engine.
And then to the applications.
So this works as a generic block device
and covers most of the legacy use cases
in Alibaba data centers right now.
Our expectation is that they will provide equivalent performance and functionality
as started in the SSDs.
It will be, as I said, it's a step one
for deployment of the open channel SSDs.
On top of that, we do certain customizations
to improve the performance and latency.
So we do some IO scheduling there
and to improve the quality of service.
This page shows the user space
Alibaba Open Channel SSD solutions.
So compared to to previous page,
now we no longer have the kernel mode drivers.
So with the Open Channel hardware,
we just connect to our Fusion engine,
which is fully in the user space.
And in this Fusion engine layer,
we have the SSD driver,
which is also in the user space.
And then based on the different business use requirements,
we actually customized several different kind of FTLs.
On top of that, we have the user space file system,
and we're also working on the KV engine for those specific applications.
So with this user space,
we actually achieve lower material software overhead,
better performance,
and also deep integration
with certain applications.
Still keep those optimized data placement
and also the coordinated background data tasks.
Regarding the customized FTO,
we actually do something called object SSD,
which is under development to basically,
instead of just providing the block API to the upper level,
we have this so-called object API to upper level.
I'll have one page to just show that idea here.
So this is our object SSD.
So from the bottom part, it's still
the open channel SSD hardware.
What we show those kind of, those white box here, you can think of that as a NAND block
in the NAND SSDs.
And then we have something called object.
It's basically so in the SSD
you delete
you erase a NAND block, right?
So in this customized FTO
we actually group multiple blocks together.
You can think of it as a super block
or band in the standard
SSD concept.
But now with this object
API, the
delete granularity will become the object. It's no longer
just the AOBA or
NAND blocks here. So this
device is organized into the object
and then through the fusion engine, then we
encapsulate everything and then just expose the object
API to the upper level softwares.
When we say upper level here, we specifically refer to Pangu,
which is our internal distributed storage system in Alibaba data centers.
They have their chunk server.
We basically organize all the data into objects and then utilize
a pandani write
IO pattern to write it
to our devices.
To the future engine and to our
open channel CDs.
So in this way we actually
achieve much lower
write amplification
significantly improve the SSD
lifetime and also improves that through the
user mode out-scheduling
and will actually improve the performance uniformity and the quality of service
by 5X.
Is there awareness of the objects within the drive or in the software?
So the object is actually mapped to of the objects within the drive or in the software?
So the object is actually mapped to a physical group of the NAND blocks in the drive.
But the software actually, from this fusion engine,
will just expose the object APIs to the upper-level software.
So it's no longer like a LBA.
The drive is playing open channel?
Yeah, the drive is open channel.
No objects in the drive?
No objects in the drive,
it's software just to form the object in the drive.
Yeah.
So this page shows, I already mentioned the IO scheduling and call your service improvement several times. So here's just an example of what we do in the either kernel level FTO or the user mode FTO, how we do the precise IOS scheduling to improve
the latency.
So the left-hand side is just showing still the open channel architecture, and the right-hand
side is showing a high-level idea of how we do this precise IOS scheduling in the driver level FTL.
So here we have those PU.
So this is indeed the open channel 2.0 concept.
You can think it's just a NAND die.
So NAND die, I think if you're familiar with the standard SSD FTL,
the SSD firmware actually maintains a queue, a command queue
inside the SSD.
The host side doesn't
have any control on it. You just
send the I.O. to a drive, but you don't
know when the I.O. will be
executed. But with
open channel, the situation is totally different.
Now we have this fully
the host
has the knowledge on the priority or how critical the L is.
And also host has the full control on the command queue for each die, each NAND component die.
Because the open channel. So here we just have the driver
which is maintaining a queue for each NAND die.
We call it PU.
And then the red color here basically means
that it goes kind of high priority read commands.
Here we just use read command as an example,
but it's a similar idea.
It can be extended to the right.
With those different priorities here,
the FTO can actually reorder the commands
in the queues and the driver level.
And then because it's reordering,
and we can achieve better latency,
better control on the I.O.
for those critical one and the high priority one
to be commenced.
So this is just to show our experiment data
from our user mode FTO with the I.O. scheduling scheme
which I just described in the previous page.
So you can see that from the read-only workload
to actually achieve very significant improvement.
The light blue is the baseline,
and the deep blue is the user mode FTO
with those IO scheduling implemented.
So it's roughly 5x improvement, and if you
look at the read and write mix workload, the
improvement is even higher, especially for
the average write latency.
So
this is our current standards for the open channel SSD.
So we already mentioned several times we're in stage one.
We are the first one in the industry to productize this open channel SSDs.
We already have small volume deployment in Alibaba data centers, and the massive deployment
is expected in the first half of next year.
As I said, the Alibaba open channel is not just a single vendor or just a specific storage
devices. We are targeting to build an ecosystem. That's why we work with
many
collaborators in our industry.
Intel is our
strategic partner.
We started to work with Intel since
2017
to basically do the co-development
and the co-validation.
Intel provides hardware and software,
sorry, hardware and the firmware,
and we develop all this whole size software,
and then we do co-validation,
basically means the whole software stack
will be validated in Intel's standard qualification flow.
Intel, as an asset vendor, I think has a very good reputation on those SSE qualification flow and also the stability.
So we do expect those kind of collaboration will give us our open channel product
a very good stability
in the Alibaba data centers.
And in addition to Intel, we also work with
Unix, Micron, Hynix, Shannon,
which is in Shanghai, and also CinexNaps.
Future work.
As I said, this is just a step one.
We have the open channel.
We open a door to give the whole site the full control on the underlayer hardware and hardware devices.
So we are also looking to the opportunity to combine multi-streams.
We set further IO scheduling and quality of service optimization.
Low latency, high endurance solutions.
QLC.
I think I see the industry right now is moving to QLC,
so we're also working on it right now.
Also 3DXPoint.
So all those are innovations into our open channel platform.
Going forward, from our perspective, we believe the user mode customization and hardware software
co-design, co-optimization is the true
key methods
to address the data, the hyperscale data center
faced challenges from the storage system perspective
and as a hyperscale data center face the challenges from the storage system perspective.
And as a hyperscale data center, we focus on the software,
but the collaboration is needed.
We welcome all those hardware and solution vendors
to work with us to generate the next innovations for the
hyperscale data centers. So to conclude my presentation,
so hyperscale data centers
are facing unique challenges, and we believe the hardware-software
integrated solution is the key, especially
based on the open channel
scheme. We customize
all the designs by
applications individually,
and we are open to the industry
collaborations. Right now
we are, in addition to
open channel, we are also working on the near
storage computing, memory computing,
and
new
storage media.
So if anyone is interested in that, please feel free to stop by
and we can discuss their potential collaborations.
Thanks.
Thanks.
Thanks for listening.
If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the Storage Developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.