Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 3x21: Under the Hood of the Data Engine with Speedb

Episode Date: February 8, 2022

Data is the most important element of artificial intelligence, but how is that data managed and stored? In this episode of Utilizing AI, Adi Gelvan of Speedb goes deep under the hood to take a look at... the data engine along with Frederic Van Haren and Stephen Foskett. Facebook's RocksDB provides the basic storage for many webscale projects, managing metadata in a massive scale. Because of the inherent limits of RocksDB, most cloud applications shard data across many data engines. But Speedb takes a different approach, bringing more advanced storage technology to build a compatible data engine. A good data engine can massively improve overall performance, and data scientists and AI engineers would be wise to consider the storage engine, not just the processing components and models. Three Questions: Frederic Van Haren: In what areas will AI have little to no impact? Stephen: Is AI just a new aspect of data science or is it truly a unique field? Rob Telson of BrainChip: Where do you see AI having the most beneficial impact on our society? Gests and Hosts Adi Gelvan, Co-Founder and CEO of Speedb. Find out more at  www.speedb.io or reach out to Adi at adi@speedb.io.  Frederic Van Haren, Founder at HighFens Inc., Consultancy & Services. Connect with Frederic on Highfens.com or on Twitter at @FredericVHaren. Stephen Foskett, Publisher of Gestalt IT and Organizer of Tech Field Day. Find Stephen’s writing at GestaltIT.com and on Twitter at @SFoskett. Date: 2/08/2022 Tags: @SFoskett, @FredericVHaren, @speedb_io

Transcript
Discussion (0)
Starting point is 00:00:00 I'm Stephen Foskett. I'm Frederik van Herren. And this is the Utilizing AI podcast. Welcome to another episode of Utilizing AI, the podcast about enterprise applications for machine learning, deep learning, data science, and other artificial intelligence topics. In previous episodes, we've of course talked about the importance of data science. In fact, I even added it to the intro for utilizing AI because it is so important. I think it's safe to say that there is no AI or ML or DL or anything else that we're talking about without data, without good data. Isn't that right, Frederik? Yeah, right. I mean, in the early days of AI, a lot of people were saying you need a lot of data. So there was a heavy focus on the quantity of the data.
Starting point is 00:00:50 Nowadays, there's much more important focus on the quantity of the data and also focus on ethical and moral and religious focus to make sure that the data being used is coming from sources that have been vetted. Yeah, exactly. And so it is so important because bad data, well, frankly, I guess it's like anything else in the world. Bad data will yield bad results. And so that's one reason that we were talking about quality of data. But also one of the interesting aspects of AI is, of course, that it needs a lot of data. It's ravenous. We've talked previously about the size of models and the challenges for storing and transporting that data. Speedy B, I figured that it would be a good idea to bring him into the conversation to talk about
Starting point is 00:01:45 how the data engine really works and what is underneath all this stuff that's making everything go. So Adi, it's nice to have you here. Hey, thanks for having me. It's a pleasure being here. So first, tell us a little bit about yourself. Who are you and what's your background with data? Yeah, sure. So born and raised in Israel. I had my time as an IT guy after my university period. I met a double degree in math and computer science. And at some point, I moved to business, worked for some storage companies and then I moved to the startup space and had my share in some startup companies and in the last one of them actually which is a storage unicorn called Infinidat, I met my co-founders and that's at some point that's how Speedybee started.
Starting point is 00:02:47 Yeah, it's interesting. I mean, when we talk about data engines, there are so many data engines out there and with a lot of innovation going on. So what made you give the impression that you could come up with a more innovative data engine compared to what the market was offering? That's a tricky question for an Israeli. The challenge we have here in Israel is that too many people think they can do everything better themselves. So you're talking to one of these guys. And so, no, seriously, my co-founders, Chilik and Mike, who are the brainiacs in the team, they faced a challenge in one of the projects we had where they had to pick a storage engine to be utilized in the storage system to manage the metadata.
Starting point is 00:03:41 And they decided not to develop this within the company, but to take a third party software. Storage engines are real deep tech and it's a very thin layer, but we're very sophisticated. And they went to the market and they need to pick what is the golden standard for storage engine. And they looked at everything you can think about, and they found out that RocksDB from Facebook was the most prevalent and most popular, being used by tens of thousands of customers. And they said, okay, can Facebook be wrong, right? So they tried it, and they saw that it was working great
Starting point is 00:04:23 in very small data sizes. When you spoke or when they tested it on large data sizes, they saw that it didn't really function well, and they went to the community, went to Facebook, and then they realized that RocksDB and other storage engines, there's lots of innovation around it, but no one actually took it to the next level. These components are, they were designed to manage metadata and metadata typically is small.
Starting point is 00:04:56 And 10 years ago, 15 years ago, that was really the case. But now in 2021 and 2022, metadata is the fastest growing data segment in the market. If you look at the ratio between the data and metadata today, it has completely changed. And when they said, okay, how do we actually work with a storage engine to manage large amounts of data? The answer was simply shard it, shard the problem, find workarounds, do data manipulation that the storage engine will support. Each storage engine will support a very small amount of data. And they said, okay, that's a nice workaround.
Starting point is 00:05:44 But if you're looking at the cloud era, that's a very small amount of data. And they said, okay, that's a nice workaround. But if you're looking at the cloud era, that's a very expensive workaround in terms of resources and development. So how come don't we have a storage engine or a data engine that can actually manage on a single node large capacity? And then they said, okay, we know something about data. My partners, they have mutually around 160 patents in data software and algorithms. So they said, okay, now we know what we want to do. And they left Infinidat. We met together and decided, okay, there's a good and valuable mission here. And we started SPDB. Yeah, the question I have for you there is, it's interesting that you're coming from the
Starting point is 00:06:34 storage area where there's a lot of need for high performance and very low latency. So what made you think that forking RocksDB and doing something on your own compared to working with the RocksDB team to speed up their project and their product? Was there a particular reason why you felt RocksDB wasn't cutting it for you? Yeah, so that's a very interesting question. It was actually one of the questions we faced early on. And so I had my time in business and my take on every technology that you are developing, you need to make sure a lot of people will use it
Starting point is 00:07:18 or can use it, need to use it, and will use it. And to speak through the RocksDB API was very trivial for me because we know there's a huge market, many customers suffering from the same problems. So going with the RocksDB API was super important for me. And that actually, it made us need to face some issues because SpeedDB essentially inside is totally different we have our new IP and new technology and special algorithms but we am we insisted on aligning ourselves to the rocks DB protocol that's because of
Starting point is 00:07:57 the market size and we want to solve a huge problem and we want to make sure that the adoption rate is fast now Now, from a technology point of view, the Facebook guys, super talented, it's a huge project. They made RocksDB to really, really fit their needs. Facebook's way of working with data is with very small sizes on thousands and thousands of small nodes. And it's perfect for Facebook. When you look at the real world or the enterprise world, it's totally different. No one has the resources that Facebook has. No one can actually sustain the costs of what it means.
Starting point is 00:08:42 Not many customers have hundreds of C++ developers specialized in data. That's not what they do. And when we saw what RocksDB did, it was very good for Facebook, but not really designed to enterprise scale and to large scale. When we spoke to Facebook to solve our own problem,
Starting point is 00:09:04 they said, no, no, that's not what we're going to do. It was designed for us and that's how it's going to stand. We said, okay, there's a huge market there of companies using this technology, but it doesn't really fit what they need. We think we can do it better. We are storage people. We have designed super large exabyte size systems. We think we can do that. And when we looked at all the research done around RocksDB in the market, in the academia, they all were trying to solve small scale issues within RocksDB. And we said, let's rewrite it. Let's design it to scale.
Starting point is 00:09:40 And that's what we did. That's what Spey B is about. And we think that when you look at 10 years down the line, then speedy B will probably, we hope, will be the de facto standard storage engine. We call it data engine because it's much more than storage engine. Now it's about data. I think that's the interesting thing as well, is that essentially what you've got is technology that was designed for a purpose by a large company that needed a product that did this thing. And then other people, because it's open source, other people adopted that technology. And as so often happens, it gets used in different ways that the original designers didn't intend. And I think this is one of the aspects of the modern software environment, especially with open source, that things tend to get used in ways that the designers maybe didn't intend. You know, things that were designed for use inside the firewall get used on the internet, or, you know, yeah, storage, or data, or machine learning algorithm gets used in a completely
Starting point is 00:10:47 different context. And sometimes it does take a different perspective than the originators of the technology in order to make it work. I wonder, for the benefit of the audience, let's kind of take this up a step. So the data engine is fundamentally where the, you know, how the data is stored underneath the database. And then the database is sort of the organization and management of that data. And then a machine learning application is going to be accessing data in some way. Talk to me about that chain from basically from disk to ML. Okay, right. So a database is an application, right?
Starting point is 00:11:30 It's an application that is meant for structured or unstructured data. It gathers data in a way that the application above can actually use it logically. But a database is merely another application. There are many, many use cases of applications using directly RocksDB or storage engines within them to access either a database or cloud S3 or different media. but there are many more. But when you look from the application level, the application is talking to a certain database or another application that manages the data. And that layer is talking to the media, the hardware beneath.
Starting point is 00:12:17 It can be a file system or an object or a drive, a bare metal drive. What the storage engine is, it's a very thin layer that actually determines the layout of the data on the media. Now, the storage itself is a self-contained component. It will store the data the way it's working, but the data structure of how it's being stored is determined by the storage engine. So this is a very thin layer. Many users don't even know that it exists. And it's very funny. Ten years ago, when you said storage engine,
Starting point is 00:13:05 very few people would know. Nowadays, they know because it became a bottleneck. It started making noise and problems. And now people are well aware. So now this small piece, very, very crucial and important piece that was hidden, now is facing some challenges and needs someone to actually solve it. So it was under the hood. Now gradually, it's coming up the hood. Now, when you're looking at AI, ML, and this layer, it has a very important part in the storage engine revealing itself from under the hood.
Starting point is 00:13:54 Because we were saying that in AI and ML, quantity matters. The more objects you have, the more accurate your AI will be. Now, if 10 years ago, quantity would mean the data size or the capacity, now quantity means the amount of objects. In AI and ML, objects are usually small, and you need them in very, very large capacities. These large capacities of billions and trillions of objects actually determine the ratio between the metadata and the data, which has come to the point where the metadata now sometimes is bigger than the data itself. Hence, the storage engine that was used to manage small amounts of data now needs to manage a large amount of data. And that's basically the problem.
Starting point is 00:14:59 So when you talk about storage engine and the structure, should I envision it that one day when you want to store the data on, let's say, a hard drive, and in the future you want to move that data to a memory structure, does the data structure then change because the media changed? And there probably might be more optimized ways of storing it. Or should I assume that the data structure is independent of the media, and when you go from one media to another one, that the data structure stays the same? A good question. It actually really matters. It really matters. The structure or the layout of the data on the media really depends on the specification of the data. If you're talking about tapes that belong to the prehistory period, then your access has to be sequential. When you're talking about drive, then better sequential rather than random. When you're talking about SSD and NVMe,
Starting point is 00:16:07 or even memory today, the layout does matter. And if the layout is not really calculated right or optimized for the media, then for once you can, you can utilize the media not right, you will lose the media, performance will not be sufficient. And second, the performance of the application, since it's not optimized, will suffer.
Starting point is 00:16:38 So it does matter. And if you look, for example, in the storage engine market, LevelDB of Google was designed to support spinning drives, where the RocksDB is designed to support flash drives. In SpeedyB, we designed the system to be able to support or dynamically support both spinning, flash and memory and be very very efficient according to the media you're working on. That's why I assume that as time goes by you will have to update Speedybee with
Starting point is 00:17:22 different new or new media types and the performance criteria of those media types. Now, you'd also talked a little bit about metadata and data layout. I presume that the data structure for both can be significantly different as you mentioned, right? A lot of objects, a lot of small objects where your metadata can actually become the performance
Starting point is 00:17:45 bottleneck versus your actual data. Is that a true statement then that people can assume that the data structure for both will look different? Not necessarily. So the main difference between metadata and data is that metadata is pretty much designed to, one, describe the data, give you hints about where the data is, and all sorts of things you need to know about the data to allow to access it. So the main difference between metadata and data is that metadata you're accessing all the time. When you're scanning, when you wanna get info, not necessarily fetch the data. So you need to have it very, very close
Starting point is 00:18:30 and the response time needs to be as close to zero as you can. That's why the metadata will usually reside in the memory because you need fast access. And also because you're accessing it so much, then if you're accessing it wrong, then you want the media to be very, very fast, preferably memory.
Starting point is 00:18:50 So if you do mistakes, then they're forgivable because the memory is very fast. The challenge with metadata today is that no one wants to pay the amount of memory to have all the metadata in the data. So you store it on media and now all the bad design and the algorithms um you really um pay a huge price on the on the mistakes you do okay so it's it's not about that the metadata should be treated different than the data rather
Starting point is 00:19:22 than you want very very high performance at scale scale on the metadata so the application can function right. system and really, really try to optimize their computation, their data models, the GPUs or ASICs or special purpose processors that they're using for training and inferencing. But I think a lot of them may not really even consider, as one of the things that you mentioned rings so true to me, that most people don't even consider the data engine. Even database people may not consider the data engine. I've been involved in that community for a long time. For example, in the Microsoft SQL space or in the MySQL space, there's a lot of tips about switching out the data engine and sort of weird best practices about which data engine to use, but it seems like a lot of folklore and not a lot of technology. And given that,
Starting point is 00:20:33 given the fact that even the database people or the data scientists may not know much about data engine and storage and layout and optimization, is it possible there's a heck of a lot more performance that can be wrung out of these systems with a better data engine, just like there is with processors or with networking interconnects or with other elements, flash memory, all sorts of things? Yeah, so very good question. So I will not talk in theory. I will give you the real life numbers we see from working with the customers. If you take a database and you optimize it on the database level, like DBAs do and the data science guys do, then you will improve a single digit number five, maybe two, 10%. If you work on a particular workload
Starting point is 00:21:31 with an optimized data engine, you can improve the performance 5X. That's 500%. We see cases in Speedy B that we improve a thousand percent 10 X. And I'm not talking about some weird, um, um, databases or applications. We're talking about my SQL and, uh, Mongo and cockroach and databases, you know,
Starting point is 00:22:01 like a Sandra that everyone uses. Um, I think storage engines, um, have pretty much been treated as, and I will say, an atom part. You would take it, you would install it, and you would use it as is. What we did, we said, okay's let's look inside we opened the atom and we saw that it wasn't an atom it was a molecule it actually has components that that needs to be redesigned and i can tell you today that we are working with almost um or the biggest database vendors today and we show them that when you take Speedybee and you replace it with RocksDB, you get anywhere from 200% to 1000% performance impact.
Starting point is 00:22:54 And that has a lot of value. Yeah, so how do you do that? If a customer comes to you and says, hey, we have a need to improve, how do you approach that customer I mean it there's nothing I assume it is like just in place replacements or is there a little bit of tuning and testing on some some test data just to see which options fit the best it's so I will start from from from
Starting point is 00:23:24 the bottom technically it's a simple drop-in replacement. Our API is 100% compatible with RocksDB. It's the same API. Your application will not even know that it's SpeedyBee and not RocksDB. Same API, identical. And that was very important from the go-to-market perspective. Now, it's not one size fits all because different customers have different workloads. Some of them have very, very small data sizes. If you're talking about very small data size that resides in the memory, then Speedybee's great technology will not necessarily help you. So when we talk to you, we speak to the guys of the data of the company, and
Starting point is 00:24:09 we really make sure that they're suffering from issues. And according to the issues they're suffering from, it can be IO hangs, stalls, rat amplification, stalls on the database level, wear out of the drives, we will very, very fast recognize if these are the things that we can solve. And in most cases, we do. Then we simply send the customer the library. It's a drop in your place. In 30 seconds, he's running the applications, doing his own benchmark. And happily for us, in most cases, they get back to us and say, wow, that's great. In some cases, it works, but then they seem that they have some more issues.
Starting point is 00:24:54 And then we help them, whether it's on the database level or on the speed level, and do some optimizations. But we're very happy to see that most of the customers, it's simply plug and play. Right. And then another question regarding to the storage engine is, is the data structure static? Meaning that the data structure architecture, when you start using it in a project, will it be consistent as a user adds data or can the data structure dynamically change as customers add data?
Starting point is 00:25:34 And the data might look significantly different than what they had in the early days of the installation of the product. Right. So it's like you've seen our roadmap. So on the basic level, we simply designed a new data structure that is much more efficient and much more scalable than RocksDB has. um sorry uh future abilities uh we will have um dynamic change of the data layout according to the workload sometimes you'll be write intensive sometimes you'll be read intensive and sometimes you'll be working with large objects and small objects we are currently developing um our next version that will include inside the dynamic and auto-tuning of the system to your workload. Right, and I think that can be challenging because, you know, when you profile workloads,
Starting point is 00:26:35 the workloads will also change over time. So it's, you know, you're kind of chasing your own tail, but it's the best way to get the best performance out of the product. Yeah, I can tell you that when we came out with Speedy Bee first, then results were great. When we started selling to customers, then you really realize what the real problems are, right? They're different than what you see in the lab. And we are talking a lot to our customers. And I think we have very good
Starting point is 00:27:09 hints on the most popular workloads and what we need to do. But yeah, I'm sure that we'll have to work on it on and on to make sure we improve ourselves and listen to our customers because, you know, they need to be happy in the end. Is it possible for you to tell us at all what the best workloads are? I mean, what kind of applications are people using this thing for generally? I mean, not specific companies maybe, but, you know, what sort of applications, especially in machine learning, are we seeing with Spey B? Yeah, so I think the nice thing about speedy B, we are so much low in the stack. So we are sitting behind the application
Starting point is 00:27:57 and sometimes behind the database itself and sometimes in the storage system itself. So the change we are doing in the data is so basic. It's so basic that even a small change that we do will affect your application tremendously. Sometimes if you solve a 2x problem really on the lowest level, it can be translated to a 200% on the upper level. So I can tell you that we have customers doing AI ML on streaming. We have customers who are standard, regular legacy database companies.
Starting point is 00:28:47 We are talking to some very, very big storage companies who in their storage stack, they have applications inside that needs to manage metadata. We're talking to all kinds of, I would say that what they do have in common, they simply need to deal with large amount of data and large amount of objects. And they can vary from, yeah, from- And that does describe a lot
Starting point is 00:29:18 of machine learning applications. I mean, a large amount of objects. I mean, that's one of the things we talk about with machine learning is the huge, huge numbers of parameters that are involved in the massive data sets, especially with training. Yeah. Yeah, definitely.
Starting point is 00:29:35 Excellent. So I think that the takeaway for this, for our listeners, if you're a data scientist, especially, is maybe do consider the data engine underneath your database. Maybe do consider whether that can be improved with a replacement. And there are a number of them. I mean, it doesn't have to be, you know, maybe speedy B isn't the right choice for you, but there are a lot of replacement data engines and many of them can offer a lot better performance and, you know,
Starting point is 00:30:04 different workloads than the one that just is the default. And so maybe consider that. And the same thing with machine learning engineers. Maybe consider the storage. Maybe consider the storage layer and how data is being stored, not just hardware and not just improving the, you know, adding more GPUs or adding, you know, special processors or something. Think about where in the stack the bottleneck lies. Because if there's one thing I know about the history of computing, it's all about moving bottlenecks up and down the stack. And at this point, certainly compute is a bottleneck for many applications, but many others, compute is not the
Starting point is 00:30:41 bottleneck and you need to think about other areas. So we've now reached the point in the podcast where we move on to three questions. This is a tradition here where we ask our guest three unexpected questions. We're going to get his off-the-cuff answers right now. He's not been prepped or warned what they might be. I'm going to ask one and Frederick is going to ask one and then we're also going to have a question from a previous guest as well. So let's start off. Frederick, do you want to ask the first question? Sure. So do you see any areas where AI and ML will have tremendous impact, whether positive or negative, on the world. You can see it anywhere from autonomous cars to TV to what we're doing online. No.
Starting point is 00:31:40 So I can't really tell you the places that it won't impact. And if there are, then in one or two years, it will. So no, I don't see. Okay. And that was a new question. I love it, Fred. Thanks for bringing that one. Next, for me, my question is, when you think of the field of AI, is AI a new aspect of data science, or is it truly a new field? I'm not sure I'm qualified to answer this, but I can tell you that I've been reading a lot about Yuval Noah Harari, it's a very famous historian in the past years, and he's talking about the AI revolution. And it seems like AI is just at the beginning.
Starting point is 00:32:36 And if you look forward 10, 20 years, it's going to be a world of itself. So not that I'm qualified, but it seems like it's a new science, it's a new era, and it will impact everything we do. Thanks for that. And now, as promised, we're going to use a question from a previous guest. Rob Telson, the vice president of worldwide sales
Starting point is 00:33:01 for BrainChip, asks a question. Take it away, Rob. Hi. I'm Rob Telson with BrainChip. The question for today, where do you see AI having the most beneficial impact on our society? I think AI, there are lots of pros and cons about what can happen with AI. But if you look at the pros, I think that AI is going to give, it's going to level the playing field in healthcare, and it's going to allow people in Africa get the same service or level of healthcare that people in California,
Starting point is 00:33:43 which I think is one of the most important things that humanity can do. And I think AI can leverage that. But doctor bots and the ability to remotely serve people. Yeah, that's my, that's what I think. Well, thanks so much, Adi. That reminds me of what we talked about with Sarah E. Berger just a couple of episodes ago about the impact of AI on healthcare. So if this piques your interest, maybe go back to season three, episode 19, and listen to Sarah talking about that as well. Adi, also, we look forward to your question for a future guest if you have one. And if our listeners want to contribute, please do reach out to host at utilizingai.com and we'll record your question online. So thank you Adi for
Starting point is 00:34:35 joining us. Where can people connect with you? And do you have any news or anything to share with the audience? Adi M.: Yeah, so you can reach our website at www.speedyb.io and I myself adi at speedyb.io I'll be happy to get any any question from
Starting point is 00:34:54 from any listener and one thing that we are going to do that's going to be big we are going to go open source very very soon so
Starting point is 00:35:01 we hope to be able to serve large amount of developers and customers and allow anyone to enjoy the benefits of SpeedDB. Yeah, that's great news. I'm glad to hear that because I think that in terms of trying it out, I think that that's a great way to do it. And then they can maybe move on to the enterprise product if it's a good solution for them. How about you, Frederick? What's going on in your life? Well, I'm still helping enterprises
Starting point is 00:35:29 with efficient data management and designing and deploying large-scale AI clusters. And you can find me on LinkedIn and Twitter as Frederick V. Heron. And as for me, I'm looking forward to our AI Field Day event, which is coming up May 18th through 20th. We are getting some companies signed up and we're starting to get some interest from delegates as well. If you'd like to be part of that, you can reach me at sfosket at gestaltit.com.
Starting point is 00:35:57 So thank you for listening to the Utilizing AI podcast. If you enjoyed this discussion, remember to subscribe, rate, and review the show in any podcast application, since that does help. And please do share this show with your friends and colleagues. This podcast is brought to you by gestaltit.com, your home for IT coverage from across the enterprise. For show notes and more episodes, go to utilizing-ai.com, or you can find us on Twitter at utilizing underscore AI. Thanks for listening, and we'll see you next week.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.