Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 3x21: Under the Hood of the Data Engine with Speedb
Episode Date: February 8, 2022Data is the most important element of artificial intelligence, but how is that data managed and stored? In this episode of Utilizing AI, Adi Gelvan of Speedb goes deep under the hood to take a look at... the data engine along with Frederic Van Haren and Stephen Foskett. Facebook's RocksDB provides the basic storage for many webscale projects, managing metadata in a massive scale. Because of the inherent limits of RocksDB, most cloud applications shard data across many data engines. But Speedb takes a different approach, bringing more advanced storage technology to build a compatible data engine. A good data engine can massively improve overall performance, and data scientists and AI engineers would be wise to consider the storage engine, not just the processing components and models. Three Questions: Frederic Van Haren: In what areas will AI have little to no impact? Stephen: Is AI just a new aspect of data science or is it truly a unique field? Rob Telson of BrainChip: Where do you see AI having the most beneficial impact on our society? Gests and Hosts Adi Gelvan, Co-Founder and CEO of Speedb. Find out more at www.speedb.io or reach out to Adi at adi@speedb.io. Frederic Van Haren, Founder at HighFens Inc., Consultancy & Services. Connect with Frederic on Highfens.com or on Twitter at @FredericVHaren. Stephen Foskett, Publisher of Gestalt IT and Organizer of Tech Field Day. Find Stephen’s writing at GestaltIT.com and on Twitter at @SFoskett. Date: 2/08/2022 Tags: @SFoskett, @FredericVHaren, @speedb_io
Transcript
Discussion (0)
I'm Stephen Foskett. I'm Frederik van Herren. And this is the Utilizing AI podcast.
Welcome to another episode of Utilizing AI, the podcast about enterprise applications for
machine learning, deep learning, data science, and other artificial intelligence topics.
In previous episodes, we've of course talked about the importance of data science. In
fact, I even added it to the intro for utilizing AI because it is so important. I think it's safe
to say that there is no AI or ML or DL or anything else that we're talking about without data,
without good data. Isn't that right, Frederik? Yeah, right. I mean, in the early days of AI, a lot of people were saying you need a lot of data.
So there was a heavy focus on the quantity of the data.
Nowadays, there's much more important focus on the quantity of the data and also focus
on ethical and moral and religious focus to make sure that the data being used is coming
from sources that have been vetted.
Yeah, exactly. And so it is so important because bad data, well, frankly, I guess it's like
anything else in the world. Bad data will yield bad results. And so that's one reason that we
were talking about quality of data. But also one of the interesting aspects of AI is, of course, that it needs a lot of data.
It's ravenous.
We've talked previously about the size of models and the challenges for storing and transporting that data. Speedy B, I figured that it would be a good idea to bring him into the conversation to talk about
how the data engine really works and what is underneath all this stuff that's making everything
go. So Adi, it's nice to have you here. Hey, thanks for having me. It's a pleasure being here.
So first, tell us a little bit about yourself. Who are you and what's your background with data? Yeah, sure. So born and
raised in Israel. I had my time as an IT guy after my university period. I met a double degree in
math and computer science. And at some point, I moved to business, worked for some storage companies and then I moved
to the startup space and had my share in some startup companies and in the last one of them
actually which is a storage unicorn called Infinidat, I met my co-founders and that's
at some point that's how Speedybee started.
Yeah, it's interesting.
I mean, when we talk about data engines, there are so many data engines out there and with a lot of innovation going on.
So what made you give the impression that you could come up with a more innovative data engine compared to what the market was offering?
That's a tricky question for an Israeli.
The challenge we have here in Israel is that too many people think they can do everything better themselves.
So you're talking to one of these guys.
And so, no, seriously, my co-founders, Chilik and Mike,
who are the brainiacs in the team, they faced a challenge in one of the projects we had where they had to pick a storage engine to be utilized in the storage system to manage the metadata.
And they decided not to develop this within the company, but to take a third party
software. Storage engines are real deep tech and it's a very thin layer, but we're very
sophisticated. And they went to the market and they need to pick what is the golden standard
for storage engine. And they looked at everything you can think about, and they found out that RocksDB from Facebook
was the most prevalent and most popular,
being used by tens of thousands of customers.
And they said, okay, can Facebook be wrong, right?
So they tried it, and they saw that it was working great
in very small data sizes.
When you spoke or when they tested it on large data sizes,
they saw that it didn't really function well,
and they went to the community, went to Facebook,
and then they realized that RocksDB and other storage engines,
there's lots of innovation around it,
but no one actually took it to the next level.
These components are, they were designed to manage metadata and metadata typically is small.
And 10 years ago, 15 years ago, that was really the case. But now in 2021 and 2022, metadata is the fastest growing data segment in the market.
If you look at the ratio between the data and metadata today, it has completely changed.
And when they said, okay, how do we actually work with a storage engine to manage large amounts of data?
The answer was simply shard it, shard the problem,
find workarounds, do data manipulation
that the storage engine will support.
Each storage engine will support a very small amount of data.
And they said, okay, that's a nice workaround.
But if you're looking at the cloud era, that's a very small amount of data. And they said, okay, that's a nice workaround. But if you're
looking at the cloud era, that's a very expensive workaround in terms of resources and development.
So how come don't we have a storage engine or a data engine that can actually manage on a single
node large capacity? And then they said, okay, we know something about data. My partners,
they have mutually around 160 patents in data software and algorithms. So they said, okay,
now we know what we want to do. And they left Infinidat. We met together and decided, okay, there's a good and valuable mission here.
And we started SPDB.
Yeah, the question I have for you there is, it's interesting that you're coming from the
storage area where there's a lot of need for high performance and very low latency.
So what made you think that forking RocksDB and doing something on your own compared
to working with the RocksDB team to speed up their project and their product? Was there a
particular reason why you felt RocksDB wasn't cutting it for you? Yeah, so that's a very
interesting question. It was actually one of the questions we faced early on.
And so I had my time in business
and my take on every technology that you are developing,
you need to make sure a lot of people will use it
or can use it, need to use it, and will use it.
And to speak through the RocksDB API
was very trivial for me because we know there's
a huge market, many customers suffering from the same problems. So going with the RocksDB API was
super important for me. And that actually, it made us need to face some issues because SpeedDB
essentially inside is totally
different we have our new IP and new technology and special algorithms but we
am we insisted on aligning ourselves to the rocks DB protocol that's because of
the market size and we want to solve a huge problem and we want to make sure
that the adoption rate is fast now Now, from a technology point of view, the Facebook guys, super talented, it's a huge project.
They made RocksDB to really, really fit their needs.
Facebook's way of working with data is with very small sizes on thousands and thousands of small nodes.
And it's perfect for Facebook.
When you look at the real world or the enterprise world, it's totally different.
No one has the resources that Facebook has.
No one can actually sustain the costs of what it means.
Not many customers have hundreds of C++ developers
specialized in data.
That's not what they do.
And when we saw what RocksDB did,
it was very good for Facebook,
but not really designed to enterprise scale
and to large scale.
When we spoke to Facebook to solve our own problem,
they said, no, no, that's not what we're
going to do. It was designed for us and that's how it's going to stand. We said, okay, there's a huge
market there of companies using this technology, but it doesn't really fit what they need. We think
we can do it better. We are storage people. We have designed super large exabyte size systems.
We think we can do that. And when we looked at all the research done around RocksDB in the market, in the academia,
they all were trying to solve small scale issues within RocksDB.
And we said, let's rewrite it.
Let's design it to scale.
And that's what we did.
That's what Spey B is about. And we think that when you look at 10 years down the line, then speedy B will probably, we hope, will be the de facto standard storage engine. We call it data engine because it's much more than storage engine. Now it's about data. I think that's the interesting thing as well, is that essentially what you've got is technology that was designed for a purpose by a large company that needed a product that did this thing.
And then other people, because it's open source, other people adopted that technology.
And as so often happens, it gets used in different ways that the original designers didn't intend.
And I think this is one of the aspects of the modern software environment, especially with
open source, that things tend to get used in ways that the designers maybe didn't intend.
You know, things that were designed for use inside the firewall get used on the internet,
or, you know, yeah, storage, or data, or machine learning algorithm gets used in a completely
different context. And sometimes it does take a different perspective than the originators
of the technology in order to make it work. I wonder, for the benefit of the audience,
let's kind of take this up a step. So the data engine is fundamentally where the, you know, how the data is stored
underneath the database. And then the database is sort of the organization and management of that
data. And then a machine learning application is going to be accessing data in some way.
Talk to me about that chain from basically from disk to ML.
Okay, right.
So a database is an application, right?
It's an application that is meant for structured or unstructured data.
It gathers data in a way that the application above can actually use it logically.
But a database is merely another application.
There are many, many use cases of applications using directly RocksDB or storage engines within them to access either a database or cloud S3 or different media. but there are many more. But when you look from the application level,
the application is talking to a certain database
or another application that manages the data.
And that layer is talking to the media,
the hardware beneath.
It can be a file system or an object or a drive,
a bare metal drive.
What the storage engine is, it's a very thin layer
that actually determines the layout of the data on the media. Now, the storage itself
is a self-contained component. It will store the data the way it's working, but the data structure of how
it's being stored is determined by the storage engine. So this is a very
thin layer. Many users don't even know that it exists.
And it's very funny. Ten years ago, when you said storage engine,
very few people would know. Nowadays, they know because it became a bottleneck. It started
making noise and problems. And now people are well aware. So now this small piece,
very, very crucial and important piece that was hidden,
now is facing some challenges and needs someone to actually
solve it. So it was under the hood. Now gradually, it's coming
up the hood. Now, when you're looking at AI, ML, and this layer,
it has a very important part in the storage engine
revealing itself from under the hood.
Because we were saying that in AI and ML, quantity matters. The more objects you have, the more accurate your AI will be.
Now, if 10 years ago, quantity would mean the data size or the capacity, now quantity means
the amount of objects. In AI and ML, objects are usually small, and you need them in very, very large capacities.
These large capacities of billions and trillions of objects actually determine the ratio between
the metadata and the data, which has come to the point where the metadata now sometimes is bigger than the data itself.
Hence, the storage engine that was used to manage small amounts of data
now needs to manage a large amount of data.
And that's basically the problem.
So when you talk about storage engine and the structure, should I envision it that one day when you want to store the data on, let's say, a hard drive,
and in the future you want to move that data to a memory structure, does the data structure then change because the media changed?
And there probably might be more optimized ways of storing it. Or should I assume
that the data structure is independent of the media, and when you go from one media to another
one, that the data structure stays the same? A good question. It actually really matters.
It really matters. The structure or the layout of the data on the media really depends on the specification of the data.
If you're talking about tapes that belong to the prehistory period, then your access has to be
sequential. When you're talking about drive, then better sequential rather than random. When you're talking about SSD and NVMe,
or even memory today, the layout does matter.
And if the layout is not really calculated right
or optimized for the media, then for once you can,
you can utilize the media not right,
you will lose the media,
performance will not be sufficient.
And second, the performance of the application,
since it's not optimized, will suffer.
So it does matter.
And if you look, for example, in the storage engine market,
LevelDB of Google was designed to support spinning drives,
where the RocksDB is designed to support flash drives.
In SpeedyB, we designed the system to be able to support
or dynamically support both spinning, flash and memory
and be very very efficient according to the media you're working on.
That's why I assume that as time goes by you will have to update Speedybee with
different new or new media types
and the performance criteria of those media types.
Now, you'd also talked a little bit about metadata
and data layout.
I presume that the data structure for both
can be significantly different as you mentioned, right?
A lot of objects, a lot of small objects
where your metadata can actually become the performance
bottleneck versus your actual data. Is that a true statement then that people can assume that
the data structure for both will look different? Not necessarily. So the main difference between
metadata and data is that metadata is pretty much designed to, one, describe the data, give you hints about where the data is,
and all sorts of things you need to know about the data to allow to access it.
So the main difference between metadata and data is that metadata you're accessing all the time.
When you're scanning, when you wanna get info,
not necessarily fetch the data.
So you need to have it very, very close
and the response time needs to be as close to zero
as you can.
That's why the metadata will usually reside in the memory
because you need fast access.
And also because you're accessing it so much,
then if you're accessing it wrong,
then you want the media to be very, very fast, preferably
memory.
So if you do mistakes, then they're
forgivable because the memory is very fast.
The challenge with metadata today
is that no one wants to pay the amount of memory
to have all the metadata in the data.
So you store it on media and now all
the bad design and the algorithms um you really um pay a huge price on the on the mistakes you do
okay so it's it's not about that the metadata should be treated different than the data rather
than you want very very high performance at scale scale on the metadata so the application can function right. system and really, really try to optimize their computation, their data models, the GPUs or ASICs
or special purpose processors that they're using for training and inferencing. But I think a lot
of them may not really even consider, as one of the things that you mentioned rings so true to me,
that most people don't even consider the data engine. Even database people may not
consider the data engine. I've been involved in that community for a long time. For example,
in the Microsoft SQL space or in the MySQL space, there's a lot of tips about switching out the data
engine and sort of weird best practices about which data engine
to use, but it seems like a lot of folklore and not a lot of technology. And given that,
given the fact that even the database people or the data scientists may not know much about data
engine and storage and layout and optimization, is it possible there's a heck of a lot more performance that can be wrung out of
these systems with a better data engine, just like there is with processors or with networking
interconnects or with other elements, flash memory, all sorts of things? Yeah, so very good
question. So I will not talk in theory. I will give you the real life numbers we see from working with the customers.
If you take a database and you optimize it on the database level, like DBAs do and the data science guys do, then you will improve a single digit number five,
maybe two, 10%.
If you work on a particular workload
with an optimized data engine,
you can improve the performance 5X.
That's 500%.
We see cases in Speedy B that we improve a thousand percent 10 X.
And I'm not talking about some weird, um, um,
databases or applications.
We're talking about my SQL and, uh,
Mongo and cockroach and databases, you know,
like a Sandra that everyone uses. Um,
I think storage engines, um, have pretty much been treated as, and I will say, an atom part. You would take it, you would install
it, and you would use it as is. What we did, we said, okay's let's look inside we opened the atom and we saw that it
wasn't an atom it was a molecule it actually has components that that needs to be redesigned
and i can tell you today that we are working with almost um or the biggest database vendors today
and we show them that when you take Speedybee
and you replace it with RocksDB,
you get anywhere from 200% to 1000% performance impact.
And that has a lot of value.
Yeah, so how do you do that?
If a customer comes to you and says,
hey, we have a need to improve,
how do you approach
that customer I mean it there's nothing I assume it is like just in place
replacements or is there a little bit of tuning and testing on some some test
data just to see which options fit the best it's so I will start from from from
the bottom technically it's a simple drop-in replacement.
Our API is 100% compatible with RocksDB.
It's the same API.
Your application will not even know that it's SpeedyBee and not RocksDB.
Same API, identical.
And that was very important from the go-to-market perspective. Now, it's not one size fits all because different customers have different workloads.
Some of them have very, very small data sizes.
If you're talking about very small data size that resides in the memory, then Speedybee's great technology will not necessarily help you. So when we talk to you, we speak to the guys of the data of the company, and
we really make sure that they're suffering from issues.
And according to the issues they're suffering from, it can be IO hangs,
stalls, rat amplification, stalls on the database level,
wear out of the drives, we will very, very fast recognize if
these are the things that we can solve. And in most cases, we do. Then we simply send the customer
the library. It's a drop in your place. In 30 seconds, he's running the applications,
doing his own benchmark. And happily for us, in most cases, they get back to us and say, wow, that's great.
In some cases, it works, but then they seem that they have some more issues.
And then we help them, whether it's on the database level or on the speed level, and
do some optimizations.
But we're very happy to see that most of the customers, it's
simply plug and play.
Right.
And then another question regarding to the storage engine is, is the data structure static?
Meaning that the data structure architecture, when you start using it in a project,
will it be consistent as a user adds data or can the data structure dynamically change as customers add data?
And the data might look significantly different
than what they had in the early days of the installation of the product.
Right.
So it's like you've seen our roadmap. So on the basic level, we simply designed a new data structure that is much more efficient and much more scalable than RocksDB has. um sorry uh future abilities uh we will have um dynamic change of the data layout according to
the workload sometimes you'll be write intensive sometimes you'll be read intensive and sometimes
you'll be working with large objects and small objects we are currently developing um our next
version that will include inside the dynamic and auto-tuning of the system to your workload.
Right, and I think that can be challenging because, you know, when you profile workloads,
the workloads will also change over time.
So it's, you know, you're kind of chasing your own tail,
but it's the best way to get the best performance out of the product.
Yeah, I can tell you that when we came out with Speedy Bee first, then results were great.
When we started selling to customers, then you really realize what the real problems
are, right?
They're different than what you see in the lab.
And we are talking a lot to our customers. And I think we have very good
hints on the most popular workloads and what we need to do. But yeah, I'm sure that we'll have to
work on it on and on to make sure we improve ourselves and listen to our customers because, you know,
they need to be happy in the end. Is it possible for you to tell us at all what the
best workloads are? I mean, what kind of applications are people using this thing for
generally? I mean, not specific companies maybe, but, you know, what sort of applications,
especially in machine learning, are we seeing with Spey B? Yeah, so I think the nice thing about speedy B,
we are so much low in the stack.
So we are sitting behind the application
and sometimes behind the database itself
and sometimes in the storage system itself.
So the change we are doing in the data is so basic.
It's so basic that even a small change that we do
will affect your application tremendously.
Sometimes if you solve a 2x problem really on the lowest level, it can be translated to a 200% on the upper level.
So I can tell you that we have customers doing AI ML on streaming.
We have customers who are standard, regular legacy database companies.
We are talking to some very, very big storage companies
who in their storage stack, they have applications inside
that needs to manage metadata.
We're talking to all kinds of, I would say that
what they do have in common,
they simply need to deal with large amount of data and large amount of objects.
And they can vary from, yeah, from-
And that does describe a lot
of machine learning applications.
I mean, a large amount of objects.
I mean, that's one of the things we talk about
with machine learning is the huge, huge numbers
of parameters that are involved in the massive data sets,
especially with training.
Yeah.
Yeah, definitely.
Excellent.
So I think that the takeaway for this, for our listeners,
if you're a data scientist, especially,
is maybe do consider the data engine
underneath your database. Maybe do
consider whether that can be improved with a replacement. And there are a number of them. I
mean, it doesn't have to be, you know, maybe speedy B isn't the right choice for you, but there are a
lot of replacement data engines and many of them can offer a lot better performance and, you know,
different workloads than the one that just is the default.
And so maybe consider that.
And the same thing with machine learning engineers.
Maybe consider the storage.
Maybe consider the storage layer and how data is being stored, not just hardware and not just improving the, you know, adding more GPUs or adding, you know, special processors or something.
Think about where in the stack the bottleneck lies. Because if there's one thing I know about
the history of computing, it's all about moving bottlenecks up and down the stack. And at this
point, certainly compute is a bottleneck for many applications, but many others, compute is not the
bottleneck and you need to think about other areas. So we've now reached the point in the podcast where we move on to three questions. This is a
tradition here where we ask our guest three unexpected questions. We're going to get his
off-the-cuff answers right now. He's not been prepped or warned what they might be.
I'm going to ask one and Frederick is going to ask one and then we're also going to have a question from a previous guest as well.
So let's start off. Frederick, do you want to ask the first question?
Sure. So do you see any areas where AI and ML will have tremendous impact, whether positive or negative, on the world.
You can see it anywhere from autonomous cars to TV to what we're doing online.
No.
So I can't really tell you the places that it won't impact. And if there are, then in one or two years, it will. So no, I don't see.
Okay. And that was a new question. I love it, Fred. Thanks for bringing that one.
Next, for me, my question is, when you think of the field of AI, is AI a new aspect of data science, or is it truly a new field?
I'm not sure I'm qualified to answer this, but I can tell you that
I've been reading a lot about Yuval Noah Harari,
it's a very famous historian in the past years,
and he's talking about the AI revolution.
And it seems like AI is just at the beginning.
And if you look forward 10, 20 years,
it's going to be a world of itself.
So not that I'm qualified, but it seems like it's a new science,
it's a new era, and it will impact everything we do.
Thanks for that.
And now, as promised, we're going
to use a question from a previous guest.
Rob Telson, the vice president of worldwide sales
for BrainChip, asks a question.
Take it away, Rob.
Hi. I'm Rob Telson with BrainChip. The question for today, where do you see AI
having the most beneficial impact on our society? I think AI, there are lots of pros and cons about what can happen with AI. But if you look at the pros, I think that AI is going to give,
it's going to level the playing field in healthcare,
and it's going to allow people in Africa
get the same service or level of healthcare
that people in California,
which I think is one of the most important things that humanity can do.
And I think AI can leverage that. But doctor bots and the ability to remotely serve people. Yeah, that's my, that's what I think. Well, thanks so much, Adi. That reminds me of what
we talked about with Sarah E. Berger just a couple of episodes ago about the impact of AI on
healthcare. So if this piques your interest, maybe go back to season three, episode 19,
and listen to Sarah talking about that as well. Adi, also, we look forward to your question for
a future guest if you have one. And if our listeners want to
contribute, please do reach out to host at utilizingai.com and
we'll record your question online. So thank you Adi for
joining us. Where can people connect with you? And do you
have any news or anything to share with the audience?
Adi M.: Yeah, so you can reach our website at www.speedyb.io
and
I myself
adi at speedyb.io
I'll be happy to get any
any question from
from any listener
and
one thing that we are going to do
that's going to be big
we are going to go
open source
very very soon
so
we hope to be able to serve
large amount of
developers and customers
and allow anyone to enjoy the benefits of SpeedDB. Yeah, that's great news. I'm glad to hear that
because I think that in terms of trying it out, I think that that's a great way to do it. And then
they can maybe move on to the enterprise product if it's a good solution for them.
How about you, Frederick? What's going on in your life?
Well, I'm still helping enterprises
with efficient data management
and designing and deploying large-scale AI clusters.
And you can find me on LinkedIn and Twitter
as Frederick V. Heron.
And as for me, I'm looking forward
to our AI Field Day event, which is coming up May 18th through 20th.
We are getting some companies signed up and we're starting to get some interest from delegates as well.
If you'd like to be part of that, you can reach me at sfosket at gestaltit.com.
So thank you for listening to the Utilizing AI podcast.
If you enjoyed this discussion, remember to subscribe, rate, and review the show in any
podcast application, since that does help. And please do share this show with your friends and
colleagues. This podcast is brought to you by gestaltit.com, your home for IT coverage from
across the enterprise. For show notes and more episodes, go to utilizing-ai.com, or you can find
us on Twitter at utilizing underscore AI. Thanks for listening, and we'll see you next week.