Storage Developer Conference - #105: Dual-Mode SSD Architecture for Next-Generation Hyperscale Data Centers

Episode Date: August 13, 2019

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast, Episode 105. Good morning, everyone. My name is Feng Zhu. I'm a storage architect from Alibaba Infrastructure Service Group. Today, thanks for the opportunity to be here to present Alibaba's continuous efforts to address the storage solution challenges for the hyperscale data center. My presentation today is mostly on the dual-mode SSD architecture
Starting point is 00:01:05 for the next-generation hyperscale data centers. So, for today's hyperscale data centers, we are facing various changes from a storage solution perspective.
Starting point is 00:01:23 Here we summarize the changes, especially for the Alibaba data centers. We have five high-level issues or changes we need to overcome from a storage perspective. The first one is that the workload becomes much more diversified. Like we use Alibaba as an example.
Starting point is 00:01:47 We have so many business units, like the e-commerce, search, cloud computing, cloud storage, as well as online promotion, and so on and so forth. So all those business units require very different workloads. And even within one business unit, because the business is going so fast,
Starting point is 00:02:13 their workloads are changing a lot. So for Alibaba Group, we were mostly buying the SSDs and storage hardwares from the vendors. But those generic storage hardwares, they're usually optimized for certain workloads. It's not really customized to our specific workloads, which is truly using our data centers.
Starting point is 00:02:44 So that's one challenge we're facing. And also because the data center scale is becoming larger and larger, so we keep facing the pressure on the TCO. We buy so many SSDs every year. How to reduce the cost is always challenging in front of us. And also for the data centers, we cannot rely on just one vendor. We usually, for each product, we require at least three vendors there.
Starting point is 00:03:12 So we want to keep multiple vendors to release the string on the supply, but also want to, more vendors means you need to pay more costs on the qualification. So we want to keep multiple vendors, but still reduce the complexity and overhead on the qualification. So that's another challenging way of facing. And also for the
Starting point is 00:03:35 standard block device, it's more like a black box. So I think it's a well-known concept recently for the open channel. The standard block device, from the host perspective, we cannot control the IOs inside the device. It's more like we just send the IOs to the device,
Starting point is 00:04:01 and whatever the device will do, like so many background operations, GSA and so forth, but for the data centers, we want to control all those data placement, especially for the latency control, quality of service, so it becomes more and more important.
Starting point is 00:04:18 So we want more control and more IO determinism. The fourth issue we are facing is that the quick response to the issues and also the quick response for the new feature request. So our experience is that when we are working with the hardware vendors, if there's an issue, it really takes maybe one quarter to get that issue really fixed.
Starting point is 00:04:49 We report the issue and then talk to the field engineers, and then we talk to the internal engineering team, and then they get a log from us and look at it, and say, oh, this is your problem, this is my problem, and then if it's truly diagnosed as a hardware problem, then we shift it right back to a vendor, and then take some time to really debug it
Starting point is 00:05:12 and provide solutions. I think it's at least one quarter. For an internet company, I think that's one quarter. It's too long, so we do want to have a way to address those online issues very quickly. And also for those in a very dynamic business environment for new features, we also want a very quick solution there to address our needs. But again, just like the bug fix, the new features also takes a long time.
Starting point is 00:05:48 And in the end, I think performance is truly the most important one. We want to have better and better performance, but unfortunately I think for our current standard block devices, I think the hardware and software are kind of optimized separately. So the software part doesn't know what's happening in the hardware side, and the hardware side doesn't know what the business, the upper-level storage software is trying to do.
Starting point is 00:06:20 So this kind of gaps here limits the performance improvement from a storage perspective. So because of all those challenges, our Alibaba infrastructure service group, we started to address those challenges since starting from develop our own solution for the SSDs, the NetFlash-based SSDs. We started this effort since 2016. So until today, we released three generations of AliFlash SSDs.
Starting point is 00:06:57 So the AliFlash V1 is a high-performance PCIe SSD. It's host-based FTL. So we have some control on the software and hardware together. It was developed in 2016 and then deployed at the end of 2016. So until today, it has been in service in Alibaba data center for more than two years, and we have more than 20K devices deployed in the Alibaba data center. So far, it's going very well, very stable.
Starting point is 00:07:35 And then we moved to the Aliflash V2 into 2017, which is U.2 SSD form factor. And also, we started customization for some certain specific business units, certain workloads, and compared to the purchase from the storage vendors, we successfully reduced the cost by roughly 20%. Aliflash V3 is what we are currently working on. It's it's dual mode SSD with open channel supported
Starting point is 00:08:09 this is indeed my primary focus for today's presentation for this one we do deep optimization for the business units applications and also this one as I, it's an ongoing effort. We are in the progress of prioritizing this dual-mode SSD. So here, let me just give a little bit more details on the productization of this Aliflash V3. First of all, this is a self-developed open channel SSD, which is ongoing efforts in Alibaba data centers.
Starting point is 00:09:00 This right-hand side, this graph here is showing the controller of this Alifax 3. It's indeed our own controller. We work with ASIC design company and customize this controller, especially optimize the performance and also for the open channel usage mode. We already announced the major milestone in the FAST 2018 this year. And also this open channel V3 is indeed a platform. We are not just working on one vendor. We are working on multiple vendors.
Starting point is 00:09:41 So the goal here is truly to build an ecosystem, to have a unified platform and have multiple vendors to provide the hardware for this dual mode and SSD with focus on the open channel. So this is photos for the sample for one of our vendors for the Ali Flash Race 3. So, this page summarizes why we chose the dual remote, especially the open channel for our next generation data center storage solutions. So the right hand side is our high level architecture
Starting point is 00:10:35 of the dual mode. So from the bottom to the top. So the bottom part is the Alibaba open channel SSD. We call it AOC SSD. So it's indeed from our collaborators. We have multiple collaborators. I'll give an end to this presentation. I'll show you our partners and collaborators for those hardwares. hardware. And with the open channel SSD engine, Alibaba group will build the software stack
Starting point is 00:11:10 on top of it. So that's what is showing here. We have our Alibaba open channel SSD specification. So this is our own SSD specification. It's different from the, I guess you may be familiar with this open channel 1.2 and 2.0 specifications in the industry. Our SSD specification is our own SSD specifications. I think we absorbed some of the good ones from those open channel 1.2 and 2.0, and then based on our needs and our specific requirements, we had a lot of new commands here to truly have a spec which is, we believe is good for the productization.
Starting point is 00:12:00 And here, with this specification, we actually require a dual mode, which means that the hardware is exactly the same, but we can switch the SSDs in the open channel mode and the standard device mode. So the vendors can just configure the drives into either open channel mode or the standard device mode in the fab and then ship to us. The reason we choose this is because as a first generation
Starting point is 00:12:29 of open channel SSDs, we want to keep the flexibility for deployment. So our business units can still choose if they want to do the open channel mode or they want to do the device mode. So for the infrastructure group,
Starting point is 00:12:47 so we just have one unified hardware as a flexibility to configure it based on the business unit's needs. With this open channel SCC platform, and on top of it, we have, our team has our storage software stack. We call it the Fusion Engine. It's basically a user mode storage engine
Starting point is 00:13:17 and have different capabilities like a device manager and different tools to manage the statistics and track the IOs there. And on top of that, we connect with different applications here. And with this architecture, we believe that, first of all, we enable hardware and software co-optimization. So with the hardware and the firmware provided by vendors, but on top of it, we have our open channel drivers, fully controlled by Alibaba Group. And we have our storage engine. And we have our internal business units. So we have those three layers combined together,
Starting point is 00:14:11 and then we can do a fully co-design and co-optimization. And then we essentially control the FTL, so we have the maximum flexibility to optimize for different workloads. We can choose, for the latency application, we can choose the generic block device mode, or we can, based on the special requirements for that business unit,
Starting point is 00:14:37 we can customize our FTO to their special needs. I'll have a one page to show a high-level example like what we do there. So this provides a deep integration with applications. And also, since we moved the FTL from the device to the host site, so from the SSD hardware perspective, we actually reduced the scope. So with this change, we believe we can reduce the time and the complexity of the SSD qualifications
Starting point is 00:15:12 when we introduce a specific hardware into our data center environment. And also... Can I ask? Sure. I'm sorry. Do you want more data? Do you want more data? Is it not the ultimate solution? Is it the perfect solution? I would think so.
Starting point is 00:15:29 As I said earlier, the reason we chose DuraMod is kind of a trade-off. So I think we are the first one. I believe we are the first one in the industry to productize this open channel. So there's risks here. So we want to keep the flexibility. So the business units, they can choose. If they think, okay,
Starting point is 00:15:52 I'm more comfortable with that block device, they can switch it, drive it back to the standard device mode. This kind of thing also sometimes depends on the expertise of the different business units.
Starting point is 00:16:09 They may not have the expertise to customize their software for the open channel. So in this way, they may choose to use this block device mode, standard device mode. I think you can have a few of those. It is difficult to reduce the cost of the hardware because Yeah, yeah, I agree with you.
Starting point is 00:16:32 That's why I think there's a trade-off at this point. When we have the first generation massively deployed in our data center and then we nerf on it, and then we will figure out what's the best way to deploy those devices for different
Starting point is 00:16:49 business units requirements. And then we can, in the stage two of our open channel, we will try to do like the reduce memory in the SSD to reduce the cost. That will be a stage two plan. But in stage one, we are keeping the identical hardware. Do you know if it's really a staging product, an intermediate product? I would think so, yeah.
Starting point is 00:17:19 I have a similar question. So all the FTL management products are studied. I would imagine that cost-per-service still has a lot of daunting tasks to do. And then I have a question. So along the way from the device to the host, is there a acceleration in between? Or no?
Starting point is 00:17:38 There will be. There will be. That's in our roadmap. Yeah, so I think that's a very good question. You move FTL to the host side, first of all, there's a lot of memory consumption there, and there are a lot of CPU computation there.
Starting point is 00:17:55 When we do this internally in Alibaba Group, we do see the concern from the business units regarding those, like, you consume so many memories, and you consume so many CPU cycles there. It's not a fundamental blocker, because nowadays in data centers,
Starting point is 00:18:12 we have the separate computing load and the storage load. I think those storage loads do have enough CPU and memories there to support the open channel. But if we can figure out a way to offload those things from the host, that would be good. It's indeed in our own app, like you said, hardware acceleration.
Starting point is 00:18:37 It's in our phase two, phase two of open channel deployment. Thank you. Okay. Okay. So this page is open channel architecture for stage one dual-mode SSDs. So the right-hand side is
Starting point is 00:19:08 a paragraph to show the role and responsibilities between the device and the host. So compared to the traditional SSDs,
Starting point is 00:19:24 everything is traditional SSC basically encapsulates everything into the device, but here as I said earlier, we actually move the most of the FTOs to the host side so in device we
Starting point is 00:19:39 still keep those the MME commander and then the media and characteristics monitoring to keep those, the MME commander, and then the median characteristics monitoring capability, and of course ECC engine, there's no way for the host to do the ECC, and RAID engine, XOR, and those smart. And then for the FTL functionalities,
Starting point is 00:20:04 which is impacting the data placement and IO scheduling, we move all of them to the whole site, including the IO scheduling, data placement, background data refresh, garbage collection, rare leveling, as well as the band block management. So here I think one thing I want to mention is that even though we move all those data refresh, garbage collection, we're leveling to a host site,
Starting point is 00:20:34 I think those are in the category of the media management. We actually just moved the whole site to be responsible for the initialization for those handling, but the median characteristics monitoring is still in the device. So in other words, we are looking for something like we don't really want the whole site to monitor the media characteristics because that's too much effort. And also it's not really the – those kinds of things are very media-specific. I think it's better to give that control to the vendor. They know their land medium better than us. So we just want the device to give us information to say, this
Starting point is 00:21:29 media is in the risk. You need to do something to handle it. And then the host will do initialize the handling. I think if you are familiar with the SSDFTO, the handling, you probably need to refresh the data and also do some management there. So those will be controlled by the host. But the monitoring things will be controlled by the device still. How does the device improve the kind of information? It's vendor specific. But the general idea there is like, you know the AER, right?
Starting point is 00:22:07 Asynchronized Event Request. So that could be a way, a mirror for the device to notice the host. So the whole device interface is still largely based on MVM? Yeah, yeah, it's still, it's like the MVM, right? Yeah, customized MVM, yeah, it's still, it's like the OVM, right? Yeah, customized OVM, yeah. So if the device informs this kind of AEO, like the video, the badness, does it part be kind of determinism?
Starting point is 00:22:39 Will not be... You mean IO determinism? Yeah. So, I think this way is also for the good of IO determinism. So, for the traditional SSD, it will do its work for you, right?
Starting point is 00:22:57 Without your permission. If the traditional SSD says the data there has some retention issue, it will just move the data, and the host side has nothing to do with it and doesn't have any control there. But in this way, the device will notice the host and say, oh, you have the data retention problem.
Starting point is 00:23:16 You may choose to move the data sooner or later. And then the host side, it's fully up to the whole side with when to do this work. So from that perspective, you can actually achieve a certain level of IO determinism. General, do you want us to ask questions during the talk or towards the end? Yeah, I think it's better to listen.
Starting point is 00:23:43 Yeah, we have 15 minutes, so we still have enough time. Let me finish the presentation, then we move to the Q&A part. Okay, so with this architecture, basically the host site actually gains the most important part. We have the direct access to the physical media and maximize
Starting point is 00:24:06 the utilization of the media capability. We fully control what's really important, what is really important for the upper storage software level, the data placement that I was scheduling, and also coordinating the
Starting point is 00:24:24 garbage collection and eventually resulting in reducing the right applications is critical for the SSD application. So before I move to give some high-level idea of how we customize our FTLOs for different business requirements. I just want to give a high-level summary,
Starting point is 00:24:52 overview on what we do to customize the FTOs in our data centers. So this is just a flow of how we usually do that, do a co-optimization and co-design. So we started from the applications. As an infrastructure group, so we have many customers inside Alibaba group. So we talk to the different business units like Alicloud, Alipay, TMO, and
Starting point is 00:25:27 those e-commerce. So we understand their requirements and abstract them, do the requirement analysis here. And based on that, based on the hardware we have, and also what our collaborators can provide, we do co-design, co-optimization. This is a kind of a loop process. And then based on that, we have this generated in-house storage solutions
Starting point is 00:26:01 and then provide the solutions back to the business units application site and talk to them, do you like this? And they will provide us the feedback that, oh, this truly meets our requirements. Or, no, I think you still maybe put more efforts on this or that. So it's kind of feedback to civil runs, and then we generate the truly understanding requirements and come out
Starting point is 00:26:30 optimized solution based on their needs. So this page summarizes our software stacks for the open channel. The whole software stack is developed by ourselves. And then we have two versions here. One is Cardinal Mode Alibaba Open Channel SCC driver block device FTO. That's what we are
Starting point is 00:27:06 developed and also deployed in our data centers. On top of that, we have user space SC drivers integrated with our own storage engine. We call it the Fusion Engine software
Starting point is 00:27:23 with the internal business units. And also on the user space mode, we also customize different FTO solutions to optimize for certain business units requirements, as well as the management and monitoring and test tools in this whole software package. So in the next couple of slides, I will just give a little bit more details on the kernel mode and user mode, and one example for that customized FTO. so for this this is a structure for our kernel space
Starting point is 00:28:10 Alibaba Open Channel SAC solutions so we have this multi-vendor Open Channel SAC hardware and of course with the firmware and on top of that we have our kernel mode FTL, which currently is mostly a standard block device FTL.
Starting point is 00:28:36 And on top of that, as I said, we do this software and hardware co-optimization, co-design, so we always connect it to our storage engine, what we call the Fusion Engine, which is a user space storage engine. And then to the applications. So this works as a generic block device and covers most of the legacy use cases
Starting point is 00:29:02 in Alibaba data centers right now. Our expectation is that they will provide equivalent performance and functionality as started in the SSDs. It will be, as I said, it's a step one for deployment of the open channel SSDs. On top of that, we do certain customizations to improve the performance and latency. So we do some IO scheduling there
Starting point is 00:29:30 and to improve the quality of service. This page shows the user space Alibaba Open Channel SSD solutions. So compared to to previous page, now we no longer have the kernel mode drivers. So with the Open Channel hardware, we just connect to our Fusion engine, which is fully in the user space.
Starting point is 00:29:57 And in this Fusion engine layer, we have the SSD driver, which is also in the user space. And then based on the different business use requirements, we actually customized several different kind of FTLs. On top of that, we have the user space file system, and we're also working on the KV engine for those specific applications. So with this user space,
Starting point is 00:30:29 we actually achieve lower material software overhead, better performance, and also deep integration with certain applications. Still keep those optimized data placement and also the coordinated background data tasks. Regarding the customized FTO, we actually do something called object SSD,
Starting point is 00:30:55 which is under development to basically, instead of just providing the block API to the upper level, we have this so-called object API to upper level. I'll have one page to just show that idea here. So this is our object SSD. So from the bottom part, it's still the open channel SSD hardware. What we show those kind of, those white box here, you can think of that as a NAND block
Starting point is 00:31:36 in the NAND SSDs. And then we have something called object. It's basically so in the SSD you delete you erase a NAND block, right? So in this customized FTO we actually group multiple blocks together. You can think of it as a super block
Starting point is 00:31:57 or band in the standard SSD concept. But now with this object API, the delete granularity will become the object. It's no longer just the AOBA or NAND blocks here. So this device is organized into the object
Starting point is 00:32:18 and then through the fusion engine, then we encapsulate everything and then just expose the object API to the upper level softwares. When we say upper level here, we specifically refer to Pangu, which is our internal distributed storage system in Alibaba data centers. They have their chunk server. We basically organize all the data into objects and then utilize a pandani write
Starting point is 00:32:46 IO pattern to write it to our devices. To the future engine and to our open channel CDs. So in this way we actually achieve much lower write amplification significantly improve the SSD
Starting point is 00:33:03 lifetime and also improves that through the user mode out-scheduling and will actually improve the performance uniformity and the quality of service by 5X. Is there awareness of the objects within the drive or in the software? So the object is actually mapped to of the objects within the drive or in the software? So the object is actually mapped to a physical group of the NAND blocks in the drive. But the software actually, from this fusion engine,
Starting point is 00:33:41 will just expose the object APIs to the upper-level software. So it's no longer like a LBA. The drive is playing open channel? Yeah, the drive is open channel. No objects in the drive? No objects in the drive, it's software just to form the object in the drive. Yeah.
Starting point is 00:34:13 So this page shows, I already mentioned the IO scheduling and call your service improvement several times. So here's just an example of what we do in the either kernel level FTO or the user mode FTO, how we do the precise IOS scheduling to improve the latency. So the left-hand side is just showing still the open channel architecture, and the right-hand side is showing a high-level idea of how we do this precise IOS scheduling in the driver level FTL. So here we have those PU. So this is indeed the open channel 2.0 concept. You can think it's just a NAND die. So NAND die, I think if you're familiar with the standard SSD FTL,
Starting point is 00:35:02 the SSD firmware actually maintains a queue, a command queue inside the SSD. The host side doesn't have any control on it. You just send the I.O. to a drive, but you don't know when the I.O. will be executed. But with open channel, the situation is totally different.
Starting point is 00:35:20 Now we have this fully the host has the knowledge on the priority or how critical the L is. And also host has the full control on the command queue for each die, each NAND component die. Because the open channel. So here we just have the driver which is maintaining a queue for each NAND die. We call it PU. And then the red color here basically means
Starting point is 00:35:56 that it goes kind of high priority read commands. Here we just use read command as an example, but it's a similar idea. It can be extended to the right. With those different priorities here, the FTO can actually reorder the commands in the queues and the driver level. And then because it's reordering,
Starting point is 00:36:19 and we can achieve better latency, better control on the I.O. for those critical one and the high priority one to be commenced. So this is just to show our experiment data from our user mode FTO with the I.O. scheduling scheme which I just described in the previous page. So you can see that from the read-only workload
Starting point is 00:36:47 to actually achieve very significant improvement. The light blue is the baseline, and the deep blue is the user mode FTO with those IO scheduling implemented. So it's roughly 5x improvement, and if you look at the read and write mix workload, the improvement is even higher, especially for the average write latency.
Starting point is 00:37:22 So this is our current standards for the open channel SSD. So we already mentioned several times we're in stage one. We are the first one in the industry to productize this open channel SSDs. We already have small volume deployment in Alibaba data centers, and the massive deployment is expected in the first half of next year. As I said, the Alibaba open channel is not just a single vendor or just a specific storage devices. We are targeting to build an ecosystem. That's why we work with
Starting point is 00:38:05 many collaborators in our industry. Intel is our strategic partner. We started to work with Intel since 2017 to basically do the co-development and the co-validation.
Starting point is 00:38:25 Intel provides hardware and software, sorry, hardware and the firmware, and we develop all this whole size software, and then we do co-validation, basically means the whole software stack will be validated in Intel's standard qualification flow. Intel, as an asset vendor, I think has a very good reputation on those SSE qualification flow and also the stability. So we do expect those kind of collaboration will give us our open channel product
Starting point is 00:39:01 a very good stability in the Alibaba data centers. And in addition to Intel, we also work with Unix, Micron, Hynix, Shannon, which is in Shanghai, and also CinexNaps. Future work. As I said, this is just a step one. We have the open channel.
Starting point is 00:39:27 We open a door to give the whole site the full control on the underlayer hardware and hardware devices. So we are also looking to the opportunity to combine multi-streams. We set further IO scheduling and quality of service optimization. Low latency, high endurance solutions. QLC. I think I see the industry right now is moving to QLC, so we're also working on it right now. Also 3DXPoint.
Starting point is 00:40:07 So all those are innovations into our open channel platform. Going forward, from our perspective, we believe the user mode customization and hardware software co-design, co-optimization is the true key methods to address the data, the hyperscale data center faced challenges from the storage system perspective and as a hyperscale data center face the challenges from the storage system perspective. And as a hyperscale data center, we focus on the software,
Starting point is 00:40:52 but the collaboration is needed. We welcome all those hardware and solution vendors to work with us to generate the next innovations for the hyperscale data centers. So to conclude my presentation, so hyperscale data centers are facing unique challenges, and we believe the hardware-software integrated solution is the key, especially based on the open channel
Starting point is 00:41:25 scheme. We customize all the designs by applications individually, and we are open to the industry collaborations. Right now we are, in addition to open channel, we are also working on the near storage computing, memory computing,
Starting point is 00:41:42 and new storage media. So if anyone is interested in that, please feel free to stop by and we can discuss their potential collaborations. Thanks. Thanks. Thanks for listening.
Starting point is 00:42:08 If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the Storage Developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.