Storage Developer Conference - #73: Key Value SSD Explained – Concept, Device, System, and Standard

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast Episode 73. I'm Yang Seok-Gi from Samsung, especially from the San Jose office.

Starting point is 00:00:47 We have an R&D center at San Jose office. So this work is actually global collaboration across Korea and the U.S. And today I'm going to introduce our experience to you about the key value SSD. The main purpose of this talk is not to show off what we have done. Rather, we share our learning from this experience and invite industry to work together and move forward together. That is the main reason.

Starting point is 00:01:26 So some data is tight recent and we are still working on to implement feature or to evaluate the performance and feature from different angles. So during this presentation you can interrupt me any time and ask question, I can share the information as much as I can, as long as it does not break my confidentiality. Okay. So in this talk I talk, it's quite short compared to the FMS.

Starting point is 00:02:10 So I actually collapsed many slides into small number of slides. But I'm going to go over why we start this work and what is the concept of key value SSD and what kind of ecosystem we need to enable this kind of technology. And what we are working on from a standard perspective, Bill Martin is sitting over there, so he can explain more, but I'll cover as far as I know, and he can fill in the more details about the most recent progress.

Starting point is 00:02:50 And I'm gonna share some experiment about this device from the system perspective. So we actually build a real system and plug in this prototype into the system and evaluate from different angles. So let me start with the background briefly. So this is quite well known. So this say a lot of data is generated

Starting point is 00:03:18 from different devices and the application, the mobile devices and the application, the mobile devices and PC, users are connected to the internet and cloud and keep generating data. So for example, Netflix, just one year, within one year, the traffic amount is 1.5x increase. And across different type of application, a lot of data is generated and increased dramatically.

Starting point is 00:03:54 So this is quite popular graph as well. So I put a little bit different name here. So around 2005, the cloud concept was introduced by Amazon. The EC2 was first come up and virtualization technology. Okay, so I hold on this mic. so that's better, right? Yeah, so around 2005, cloud concept was introduced, and after that, iPhone was introduced in the mobile device quite popular. Since then, the data is generated a lot from mobile devices, and also in the cloud,

Starting point is 00:04:41 the infrastructure itself generates a lot of data. So, to me, sort of the new era started around that time. So I can say before cloud and after 2000 is sort of the era of the data. So I would say the era of data. So it's a new BCAD for IT technology. So, one of the interesting thing is, since the cloud was introduced,

Starting point is 00:05:16 the lot of unstructured data was generated. So, unstructured data is contrasted to the traditional relational database data. 공식적인 데이터는 일반적인 데이터를 비교하는 데이터입니다. 비디오 파일, 포토 파일, 공식적인, 세미 공식적인 데이터, JSON 파일, VM 이미지, 컴프레스 파일 등 많은 데이터를 이 시기에 생성했습니다. 그럼 문제는 무엇입니까? 그래서 우리가 관리하는 데이터는 이미 액자입니다.

Starting point is 00:05:53 파일의 형태인데, 사람들이 데이터를 액자처럼 perceives합니다. 하지만 데이터가 어떻게 구성되는지, 우리는 아직도 블록 스토리지에 의해 의지하고 있습니다. object, but how data is actually stored. We are still relying on the block storage, meaning that even though you store some concept of data, but when data is stored, you have to split into fixed size of chunk and distribute the chunk across physical devices.

Starting point is 00:06:19 But can you store such data directly to the device? That is the starting point of our thought. Is this concept new? Not at all. We heard about the OSD more than 10 years ago. And the OSD was proposed to actually solve this kind of problem, especially to solve the metadata handling problem. But due to the metadata handling problem.

Starting point is 00:06:45 But due to the complexity of the implementation, OSD actually didn't take off. But to store the object into device, there are, I think, two different ways. We can actually rely on the traditional OSD model. Traditional OSD model consists of three components mainly. To identify an object, you need to specify ID, and you can associate attributes to the object,

Starting point is 00:07:14 such as, for example, if you store a photo, you can specify where did I take this photo, or when did I take this photo, with whom. The search data is not directly related to your photo, or when did I take this photo, with whom. The search data is not directly related to your photo, but it's associated with the photo and specifies more information about your data. And the actual data is the user data. So typically, in OSD concept,

Starting point is 00:07:40 an object can associate with three components. But there's another way to store that is, okay, why don't you just integrate attribute into identifier, so that is the key, right? So in the key value concept, you can specify identity of object using key and store actual data using value. But key can be simple but powerful.

Starting point is 00:08:13 Instead of having three different component, you can encode much more information about your data into key. So by just handling key, the upper layer application can differentiate or identify object from key. For example, I use the photo example, right? You can specify where the place you take photo

Starting point is 00:08:44 or time you take photo. You can actually encode that information into key instead of store that information as separate metadata. So, then why we just choose key value path, not the OST path? Because when I actually survey the data center infrastructure, a lot of system actually relying on key value abstraction.

Starting point is 00:09:12 So this is not comprehensive list of application or domain, but there's some snapshot of the application which rely on key value abstraction. But quite different application actually using key value abstraction. So the most popular one is cache. Redis is in-memory object cache which use key value abstraction. It's mostly for the DRAM, but also can be extended for storage as well. And the storage side,

Starting point is 00:09:50 SEP is quite popular, and at the bottom of SEP layer, it actually rely on RocksDB abstraction, even though they introduced BlueStory, essentially it rely on the RocksDB abstraction, which basically provides key value abstraction. The NetApp case, a solidifier system, when you see the solidifier scale out storage system,

Starting point is 00:10:16 it, at the bottom layer of the stack, basically key value to provide more efficient duplication and compression. And the large scale, hyperscale data center like Amazon Azure, when you see the Azure storage architecture, basically, it's a very huge key value store at the bottom. And in the database side, the MongoDB is one of the NoSQL database and it provides

Starting point is 00:10:46 the NoSQL document store. But at the bottom, an interesting thing of MongoDB is it has a storage layer and then you can plug in different types of storage into this infrastructure. But at the bottom of the MongoDB stack, it basically has a key

Starting point is 00:11:06 value abstraction. You can plug in a wire tiger, you can plug in the rocks DB, there's different type can be plugged into the system. And another interesting is the Facebook actually introduced the My Rocks. My SQL is basically relational database, but they want to introduce the MyRocks. MySQL is basically relational database, but they want to replace the

Starting point is 00:11:31 InnoDB with the RocksDB to reduce the space user database actually consume. They reduce, I don't remember the exact number, around 50% of user database space by replacing InnoDB with Rocksteam. And recently, the Cafe2 basically replaced their storage model from the NFS to the key value, like LevelDB and the ReadySpace.

Starting point is 00:12:02 And the service provider, like Airbnb, Rakuten, they build object store, but at the bottom of their stack, they're using key value abstraction. So okay, so key value quite popular in the data center. Can it help them? So then what is the problem they have right now? So, all the application basically interact with the storage system through object abstraction.

Starting point is 00:12:37 But the bottom of the stack, the hardware, just provide the block interface. So there is a gap between what hardware can provide and what application actually wants. So to bridge the gap, many systems are actually using software-based key value store. The most popular one is the RocksDB. RocksDB is a branch of LevelDB.

Starting point is 00:13:06 LevelDB was introduced by Google, and RocksDB was introduced by Facebook. And another popular one is WildTiger. WildTiger is acquired by MongoDB, and that is the base storage backend for MongoDB. But WildTiger is also used by Amazon as well. Amazon DynamoDB as well. So there are several popular key value store and that key value store's main job is

Starting point is 00:13:34 to translate the upper layer object abstraction to the block abstraction at the bottom. So basic idea of the key value SSD is take the common functionality from software, key value store, into device.

Starting point is 00:14:02 I will explain why this approach makes sense in the later. You can maybe think, well, what if you just move a certain component into the device? You may have more penalty because computing capability or host is much better than device. Why would you want to move certain component into the device?

Starting point is 00:14:25 I will explain why this approach makes sense later. But very simply put, let's get rid of that layer and put it into the device. But I need to be more specific about this statement because it does not mean we get rid of the stack completely but we can actually reduce the overhead the existing the key value store, software key value store has. So by doing that, we can improve the overall

Starting point is 00:15:01 throughput of system and also reduce the overhead existing key value store has like a write amplification problem and read amplification problem over the load throughput problem. So to realize this concept, we prototyped a key value SSD concept using the new Samsung device. So in the last month, we introduced into the market a new small form factor SSD. At the bottom, this is traditional, the M.2 form factor SSD.

Starting point is 00:15:51 And the new form factor is called next generation small form factor SSD. It's long, mouth-proof. But there is a reason we mouthful name, but there is a reason we use different name, but anyway. So it is same length of the M.2, traditional M.2, but it's a little bit wider than the M.2, so you can actually put the NAND chip

Starting point is 00:16:17 in the both side, and then you can put it two rows like this. We also have prototype for U.2 form factor as well. And in terms of capacity, it can be 1 to 16 terabyte, but we prototype the key value concept using the 1 terabyte device. So this is sort of the summary of the benefit, but I will cover the one by one in the later slide. But basically, initial goal was to provide

Starting point is 00:17:03 better performance from the system perspective and provide better capability. And also, depending on the system, you can actually use more disk space from system by leveraging key value SSD compared to the block. And by, the point is the main focus of key value SSD is to provide benefit from the system perspective, not the device directly. So if you want to compare block device performance

Starting point is 00:17:38 and key value device performance directly, obviously key value SSD may be slower than the block device for now because there are several reasons. We haven't had standard yet, so we cannot automate the operation efficiently. And as you can, I will explain a little bit more later, but key value SSD is more complex than the block SSD. So it will have more overhead from the device perspective. So it's going to be slower. But from system level perspective,

Starting point is 00:18:11 you can get much more benefit by using this kind of device. So the first one is when you build a system storage system, the key value SSD can provide better scalability than the block device. I will show some data about this, what this means. So basically, you can add more devices into the box. Key value SSD basically provide linear scalability in terms of capacity and performance.

Starting point is 00:18:40 So if you add more, a traditional block, you can add more device to increase capacity, but performance does not grow when you provide kind of object interface to the user. I will explain why that happened. And by doing that, what is the benefit to the user? Actually, key value SSD, you need just one core to saturate device.

Starting point is 00:19:07 A traditional block device, if you want to have an object interface, a key value interface using software key value store, you may need to use multiple cores. Typically, when you use RocksDB to saturate one device, you may need eight or nine cores. Basically, we consume more CPU power. Because of that, it's very hard to scale the box,

Starting point is 00:19:38 to scale the system by adding more devices because you will hit the CPU saturation point quickly. It doesn't matter how expensive the CPU is. So I will show the result using very expensive CPU, $5,000 per unit, but it is quickly saturated when you try to do KVALUE implement KVALUE interface using the software based solution. So you can actually reduce number of server by leveraging KVALUE SSD

Starting point is 00:20:20 because it is linear scale that you can provide better capacity and better performance leverage in this. So overall TCO cost will drop down. Depending on how you calculate it, it can be changed, but based on our system, it can reduce around 20% or 30% 정도의 렉을 제공할 수 있습니다. 키웨어는 꽤 쉽게 스케일이 됩니다. 기능을 여러 기능에 따라서 키웨어를 해시하는 방법으로 기능을 수정할 수 있습니다. hashing the key across multiple nodes, so it's quite easy to scale out.

Starting point is 00:21:06 So, KeyValue SSD, within a box, you can add in more devices, performance and capacity goes up. By adding more server, also easily capacity and the performance goes up. But that is the on-page summary of the main benefit of key-value SSD. I will provide more details later.

Starting point is 00:21:32 Okay, so you have new device. This good, but you may have concern, right? So there is no ecosystem for that. How can you leverage that? So we fully understand that and it's quite challenging to us because we actually overcome this kind of big obstacle to us. That's why actually at the beginning of this presentation,

Starting point is 00:22:03 this is not to show off something to you. Actually, we want to invite you to work together from different aspects. So I will go one by one. So to enable this kind of technology, we believe three pieces go together. We should show the benefit by providing a product. So we both prototype the concept in our real product. But okay, you build something, but how can you show the benefit to me?

Starting point is 00:22:36 You may ask that. So we explore several applications, whether those applications can pick up this technology quickly. Applications do not have any infrastructure to pick up this technology, so we actually need to build an infrastructure like a device driver, library, or API, or a command set to extend this kind of device. So we should build a core software, a prototype, to prove this concept is working. And that's not enough, right?

Starting point is 00:23:18 When we talk to customers, they say, well, this is good, but you are the only one. Then we don't want to be locked in. So to invite more device vendors, we actually open what we have done. And basically, OK, this is the core requirement for this kind of device. And we want to make a standard and open to the community and invite others who can actually contribute. So we are working on standardization in MBME and also SMEA. So we started a few months ago, and discussions are ongoing right now. So

Starting point is 00:24:07 this should be both together and if industry see the benefit, then we are open to work with you. depending on your company's situation, you can contribute application side or you can contribute product side or center side or the infrastructure side.

Starting point is 00:24:26 So we are quite open for that. So let me go quickly one by one. So this is really busy, but in the previous slide, I talk about, okay, you have a key value store, we can remove this and move into the device. You can simply say that, but is it really good? Then we have to think about whether it's really good

Starting point is 00:25:03 or not, and how much you want to move from the software to the hardware. If you want to move entire KVL storing to device, it's not going to work because it's too heavy. Device does not have that capability. And how much you want to move.

Starting point is 00:25:19 What is the core feature you want to move? So when you look at the traditional stack, at the bottom of the stack is you have block device, and you need block device driver. Usually you put the operating system, and you may have volume manager under this, but let's skip that part,

Starting point is 00:25:42 and you may put the file system on top of that, and typically software key-value storage running on top of file system. And the application is running on top of this key-value interface. This is a typical structure. The application can be MongoDB, can be DynamoDB, can be SAF. And if it's SAF case, you manage multiple nodes across

Starting point is 00:26:08 the cluster. But if you just see one node in a cluster, it's mostly look like this. Then what happens here? So, let's start from the key value store. Key value store typically manage index to identify

Starting point is 00:26:24 object within the key value store. And also do the logging to provide transaction. And what happen in the file system? We have a file mapping, so block mapping. To identify a file, we have to maintain all the block location information. Basically, a file system provides two things. One is to provide the namespace, and the other is to manage the storage. So to manage storage, you need to

Starting point is 00:26:53 the mapping information, maintain mapping information about your file to the devices. And also file system to the journal. Okay, so sort of the reason done, and they maintain this information for their own purpose. Then what happened in the SSD? SSD also has a mapping to translate the logical

Starting point is 00:27:21 block address to the physical block address to provide a transaction efficiently so SSD maintain the log information. So as you can see, there are a lot of redundant operation across stack. And these actually add more overhead and drop the performance and also reduce the available user space. But SSD already has this kind of functionality

Starting point is 00:27:54 and we collapse this functionality into the device. Basically, we already have that, so this one can be efficiently implemented in the device. While we are doing that, we can provide a different interface that application actually consumes. That is the main idea. But from the user's perspective, you can think, okay, you remove this software key value store, and the device provides this interface track. What happens here is actually you collapse the multiple logging and the multiple mapping into

Starting point is 00:28:45 the single mapping and the single logging into device. So we try to minimize the capability into device, but that capability should be enough to support application. We don't want to cover all applications

Starting point is 00:29:02 because some of them doesn't need to be implemented device can be implemented software stack like a library. So we can cover more applications, but what is the core functionality should be pushed down to the device. 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 주요 So in the previous slide, I briefly explained the basic idea and whether it is actually a good idea to move that into the device. And actually, I have many questions. have the information about kinetics, and, oh, you are building something on top of a block inside the device. That is a common question. But in the previous slide, what I meant was, okay, you can collapse all the redundant functionality into device. So device itself actually manages differently.

Starting point is 00:30:24 Traditional, the SSD has a mapping table and translates LVA to PVA, but we call that a flash translation layer. And it maintains mapping and also handling the transaction, but we completely rewrite the FTR. So FTR itself is aware 다시 적어냈습니다. FTR은 KeyValue Abstraction에 의해 깨달았죠. 그래서 우리는 NAND를 다르게 만들었습니다. 우리가 여기 맵핑 테이블에 대해선 Index을 사용합니다.

Starting point is 00:31:02 Index는 KeyValueStore에 비슷한 것입니다. instead of mapping table, because index is similar to the host-based key-value store. So this guy actually handle variable size, key variable size object by itself. So it is not block-based mapping anymore. So, new FTL is implemented and provide interface of key value directly from the device. UFTL is implemented and provide interface of key value directly from the device. And again, I'm using the term device. What they mean, this is device actually.

Starting point is 00:31:35 You cannot work alone. It does not exclude the possibility. In the previous session, we talked about the object storage, and whether this device can have Ethernet interface or not, that's completely also our problem. But we actually prefer to have device interface, not the service interface right now. It can evolve, but the reason is that the market is quite different for now.

Starting point is 00:32:06 It can be converged, but we are targeting the main storage, not cold storage. So providing high performance and providing enough capacity for that market segment, we need to provide a device interface for that. So that's why we, based on MVME and this device protocol, we extend the protocol to support key value. We introduce several new commands into the MVM spec and through that command the host can communicate with this device. So currently it is not standardized

Starting point is 00:32:55 so we are actually using the vendor command, vendor unit command. So to use this kind of device 그래서 이 기술을 사용하기 위해서는 호스사인에서 가장 적은 소프트웨어를 필요합니다. 그래서 당연히 이것은 기술입니다. 그래서 새로운 기술 드라이버가 필요합니다. 현재 우리는 3가지 종류의 기술 드라이버가 있습니다. Currently, we have three different types of device driver. So for Linux, we extended the community version of the NVMe device driver, adding new command into that, and also adding a new feature like asynchronous I.O. because the host can send

Starting point is 00:33:45 command to device through IAFTAR for now, but IAFTAR problem, the IAFTAR has a problem, like it is synchronous operation. If you do the synchronous operation, your performance gonna be very bad, so we added to the lib I.O. type infrastructure into this new device driver, so you can actually

Starting point is 00:34:08 do the asynchronous operation from the application. We also have a user-based driver. It's mostly SPDK extension, so working more well. There are some limitations

Starting point is 00:34:24 due to the user-level device driver itself, but there are some limitations due to the user-level device driver itself. But in terms of functionality, we provide that. And also implement the Windows device driver. And we plan to open

Starting point is 00:34:39 this Windows device driver. Actually, all of them. But not open yet. But once the standard device driver, all of them, but not open yet. But once the standard work progress, we may be able to open it. And on top of that, it doesn't matter what device driver you're using, what is the feature device should provide

Starting point is 00:34:59 eventually, so we call as as abstract device interface, and this basically provide an abstract device functionality from the host perspective. So it does not need to aware of what is actual protocol, the command protocol on the line. It provide mostly functionality and semantic of using this device. So the basic functionality like provide namespace,

Starting point is 00:35:35 how we can actually see object across multiple devices, and the object itself, and the basic operation like put, get, delete, and exist. This is actually part of standardization and it can be extended over discussion, but we try to minimize the model of operation at the beginning and if we have a consensus, we can extend more and more.

Starting point is 00:36:06 So if we have too many things up front, then the people are going to reject it, right? So we have the opposite approach. We put a very minimum set, and do we need to extend that? We can decide after discussing through the standard activities. And on top of that, since we collapse the software stack,

Starting point is 00:36:33 that is not always good. We lose something. For example, in the file system, it has page cache. It improves performance a lot. But if you talk to the device directly, you lose the caching effect completely. And the kernel stack has a different type of feature

Starting point is 00:36:56 like asynchronous IO. But as I mentioned in the early discussion, Linux driver through the IOctor can be synchronous. Can you overcome that kind of problem? So we actually implement the library to provide better management, especially for user-level device driver. You have to allocate memory from the huge page.

Starting point is 00:37:22 So you, and if you don't use that, then you have to copy and basically negate the benefit of using user-level device. Right. So there are several issues actually we have to address. So, what is the minimum functionality, at least we can use this device without significant penalty.

Starting point is 00:37:45 Through this study, actually we introduced the memory manager and this library can manage multiple devices and provide multiple queues. Okay. Okay. And also have mostly the write back write through cache to maintain the status of object

Starting point is 00:38:38 consistent. So we implemented the several key features into this library. So, and, we can provide this kind of the infrastructure, but can application actually can pick up this technology? There are different type of integration points.

Starting point is 00:39:07 So one example is, for example, if you want to actually implement RocksDB on top of p-value, there's no easy way to do that. Actually, we have to cut out the major portion of RocksDB and plug it into the system using ROCKDB. But in terms of the performance and efficiency, it's gonna improve significantly. Another model is to using the storage engine.

Starting point is 00:39:37 MongoDB has abstraction for storage engines, so you can plug in different type of storage engine as long as it compatible to their abstraction. So we did this kind of work with MongoDB and we did some work for LevelDB actually to plug in the key value SSD into the system. And another way is to, like for example, they have abstraction, much higher abstraction, like OSD.

Starting point is 00:40:08 You can, as long as you provide OSD interface, you can plug in any storage engine into this system. So some system, the point is that some system already have abstraction for their system, so this new device can be easily plugged in. Some system does not have that kind of abstraction, so we may have to change the software quite a bit. But the good thing is there are many systems that actually have that kind of abstraction, so we can plug in our library,

Starting point is 00:40:49 the Kibble SSD library into the existing system. So, I briefly talk about what kind of software we need. So, the software work, some of them we can, the Samsung can do, but actually many work like storage engine or the application like cache or storage system, that is not our work.

Starting point is 00:41:13 So we need to work with industry to enabling this technology. And we show some pro point through our own development and study and by providing the performance number and the benefit, so we can work with you guys. And regarding the standard, this is from this slide I talk about standard. We are working on the NVMe and the Cinear right now. And the NVMe case, we actually propose T-PAR

Starting point is 00:41:56 and discussing with the community right now. And Cinear we defines several API and I use the term ADI in the previous slide and define what kind of the operation we need to define. But the one thing I'd like to highlight is it's not about the object drive and it's different from the object drive is about key value interface

Starting point is 00:42:24 and I discussed about what is the difference between object and key value in the previous slide so from the command perspective actually we introduced four new commands put, get, delete, exist and we extend

Starting point is 00:42:40 existing command to manage device efficiency 그리고 기술을 유지하기 위한 존재적인 조건을 제공합니다. 이 부분을 끊겠습니다. 이 부분은 연결이 안 되었습니다. 이 부분은 KV-LUSSD에 관한 모든 종류의 논의를 통해 열립니다. 사실 제가 많은 답변을 받았습니다. any type of discussion regarding key value SSD. So if you, actually I had a lot of the feedback. Okay, can you add this, can you add this? So you can actually join, participate this activity in the NVMe community or the CINEA community.

Starting point is 00:43:17 You can actually show your interest and reflect your need into the standard. So, regarding the performance, so we did some experiment by building a real system. I'm gonna quickly touch some data here. So basically, in the previous slide, I talk about, okay, what is the benefit of using key value SSD. From system perspective it provides better scalability,

Starting point is 00:43:50 it can scale up and scale up, and compared to the existing key value store it provides better performance. So to show that from the single node perspective we compare RocksDB, because RocksDB is quite popular. You can say, well, is RocksDB the right one or not? That's debatable, but this is a very popular one, so we compare the performance of RocksDB

Starting point is 00:44:18 on top of block device versus the Kibale SSD using our software stack. And RocksDB, 페이스북에서 유명한 소프트웨어 스택입니다. 이 소프트웨어 스택은 적절한 수행을 지원하기 위해 적절하게 적용되어 있습니다. Depending on the workload, you can say different things, but we compare the write operation mostly. The reason is that read operation is heavily depending on the caching effect. So if you have a large cache, RockCV cache, you read all the requests from the DRAM. So it's very hard to compare and justify any number.

Starting point is 00:45:07 So what is the better way? So we mostly focusing on the right operation and there are two type, right? You can use the random operation, the sequential operation. For ROFS DB case, if you do the sequential operation, basically it's overhead minimum. So, but the real world the sequential operation, basically it's overhead minimum. But the real world, sequential operation does not exist in the key value space.

Starting point is 00:45:31 Then most operation is random. So we mostly focusing on the random operation here. So this is one, this graph basically, you can take as a one data point, it's actually not the strong claim, but the, when you use the RocksDB, you have around 13 watt, what that mean.

Starting point is 00:45:57 The RocksDB system, if you write one byte, the RocksDB actually write 13 bytes to the disk. So, existing the key value store ROXDB or Viotel, the main problem of the system is they have very high write amplification and read amplification. Even though you do not write that much, actually system write a lot. So, we can get the benefit of by using the key value SSD,

Starting point is 00:46:28 but reducing that kind of the overhead. I also mentioned about reducing the redundant part and the right implication actually back to SSD because you write more data than you need to do. Actually, device age it very quickly. So in terms of performance, WAF actually eat up your bandwidth. So device provide a bandwidth,

Starting point is 00:46:57 but you can use only a fraction of the bandwidth due to the WAF from the application perspective. So that is the actual performance benefit and the reliability benefit of using the key value SSD. And this is another graph for scalability. When you add more devices, key value SSD case actually by adding more, you can get more performance.

Starting point is 00:47:24 But the ROX DB case actually tried to saturate the device as much as we can by leveraging the older CPU power, but after some point, CPU is saturated. So this system has 48 cores, but it saturates very quickly after six devices. So to scale more, you have to put more cores into system, but this system costs around 5,000 per CPU, but we only use 18 actually for key value SSD.

Starting point is 00:47:58 Key value SSD just use one core, but existing system actually use lot of cores, so you can easily hit the CPU saturation. So that is the main benefit from the scale-up perspective. And we also configure system using the MVM fabric. So MVM fabric has very low latency, and you can actually easily disaggregate the system, and by configuring

Starting point is 00:48:27 multiple systems, you can actually get a similar performance benefit to the scale of cases. So since I'm mostly running out of time, so I'm going to stop here. So the main point of the experiment is that a traditional approach to provide a key value interface using host software basically eat up the CPU power a lot

Starting point is 00:48:59 because of that it's very hard to make systems scalable. But key value case offloaded the core functionality from the host to device and the collapse software stack and the remove redundant operation. By doing that, you can easily scale up by adding more devices in terms of performance and the capacity. So we show the benefit from scale up, scale out,

Starting point is 00:49:30 and also from the single storage perspective. So again, I can't discuss this more, the offline or by email, but basically we did some implementation and experiment, and there is great potential for this technology, so we want to work with you

Starting point is 00:49:53 guys in the industry, and also move forward together. So, thank you very much. I'm happy to take the question. Yeah. Thanks for listening. So thank you very much. I'm happy to take sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community.

Starting point is 00:50:42 For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #73: Key Value SSD Explained – Concept, Device, System, and Standard

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Storage Developer Conference - #73: Key Value SSD Explained – Concept, Device, System, and Standard

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.