Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 2x10: AI and Analytics Are Driving a New Kind of Storage with Brad King of Scality
Episode Date: March 9, 2021Big data really wasn't all that big until modern analytics and machine learning applications appeared, but now storage solutions have to scale capacity and performance like never before. In this episo...de, Brad King, Co-Founder of Scality, joins Chris Grundemann and Stephen Foskett to discuss this new demand for scalable storage by AI applications. Applications like autonomous driving, log analysis, and travel booking are driving massive need for storage as AI applications detect anomalies and support business intelligence. Scality had to tune their system to handle the massive scale of data supporting these applications, with up to a petabyte of log data being added and deleted in a single day. AI-driven tools are enabling customers to do what they never could do, and it requires a balanced infrastructure stack to make it possible. Brad suggests that companies implementing AI applications need to find a system that scales with their needs and has API-driven data access, preferably with an object-based storage model. Guests and Hosts: Brad King is Co-Founder and CTO of Scality. Connect with Brad on Twitter at @Baslking. Chris Grundemann a Gigaom Analyst and VP of Client Success at Myriad360. Connect with Chris on ChrisGrundemann.com on Twitter at @ChrisGrundemann. Stephen Foskett, Publisher of Gestalt IT and Organizer of Tech Field Day. Find Stephen’s writing at GestaltIT.com and on Twitter at @SFoskett. Date: 3/9/2021 Tags: @SFoskett, @ChrisGrundemann, @Scality, @Baslking
Transcript
Discussion (0)
Welcome to Utilizing AI, the podcast about enterprise applications for machine learning,
deep learning, and other artificial intelligence topics. Each episode brings experts in enterprise
infrastructure together to discuss applications of AI in today's data center. Today, we're
discussing how storage impacts AI and how AI is driving new demands for
storage. Our guest is Brad King of Scality. Hi there. Glad to be with you all. My name is Brad
King. I'm one of the co-founders of Scality, and my official title is Field CTO. That means
basically that I meet with most of our largest customers, learn about their
businesses, and offer them interesting solutions for their storage challenges. We are a software
solution, so a software company that provides very large-scale storage systems around the world.
And my name is Chris Grunemann. I'm an independent consultant, content creator, coach, and mentor.
And I'm Stephen Foskett, publisher of Gestalt IT and organizer of Tech Field Day and host of Utilizing AI right here.
So, Brad, I'm quite familiar with Scality being from the enterprise tech space and especially the enterprise storage space.
And the company is pretty well known in the enterprise as a provider of massive, massive scale software-driven storage solutions.
But I know that Scality is also being pulled increasingly into supporting data and analytics applications and AI-driven applications.
And so I'm wondering if you can start out by just saying, you know, what is it about AI that demands a new kind of storage?
And I guess, did this catch you off guard as somebody who had been developing a product
that ended up finding this new market?
It's interesting.
I think one of the key things has been something that people have been saying for a long time is that big data analytics using data
for machine learning, all these things that we've been talking about,
people were saying big, but it wasn't really so big. It was maybe a couple hundred terabytes.
And now we're really starting to see customers hitting the petabyte range for their storage
system. So I think the reality of
managing more storage than we've ever managed before is really coming to bear here. And I think
we were hopeful that big data land, which would be an interesting use case for us,
but a little bit disappointed in early days because we found the volumes of storage were
really not that significant.
What we've seen is as the demand has grown, there's been a transition as well that the providers of solutions have made those solutions much more well adapted to use object-based storage.
And that really plays into our space. So I wouldn't say we were locked and loaded for this situation, but we really do
have the tools that meet a lot of the needs. That's interesting, Brad. So one of the things
that really intrigues me about that is you obviously thought of AI, machine learning,
big data to feed that as an initial use case, but you said you were disappointed in early days,
and that's changed. Are there specific use cases? Was it a different way of doing machine learning in the past versus
now or is it just more adoption or what's changed? I think several things have changed.
There are use cases clearly that are changing dramatically. I would say the medical industry,
genomics, digital pathology are a
couple of things that are really driving massive data usage. Another one is obviously self-driving
or semi-autonomous automobiles and all of the data that's being stored so you can test algorithms.
We're really seeing growth in that space. So I think new uses have driven
the situation for sure. And I think one of the things, some of these applications
really wanted to provide storage at the same time. And I think most of the applications today
we're seeing are opening up to the idea that maybe providing storage is not the best business for them, that providing fast applications that can use a variety of storage
is an interesting model to pursue.
And I could talk more about that.
So what are those applications?
I mean, you mentioned some of the classic poster children,
autonomous driving and so on.
But truly, what are the specific applications that are demanding
massively scalable storage? So I think we do have a customer doing massively scalable storage for
collision avoidance algorithms, and they've been doing that for now about five or six years,
gotten up to about 30 petabytes of storage. We're seeing a lot of usage of logging
and notably applications like Splunk that work with logs. They're one of the companies that's
really moved to a new model that allows you to use S3 type storage. One of the applications there
is a travel industry where you get massive amounts of user request logs that can be transformed and monetized if you want to
communicate back to the airlines what is being asked for by end customers. But the kinds of
volumes there are just terrifying for traditional storage systems. It's very interesting. And I
think one of the questions in my mind is where this processing needs to happen. And I know
for some deep learning applications, there's definitely the modeling itself versus the inference engine.
Some of this can happen at the edge versus back in your private cloud or public cloud.
Is there a difference in use cases?
Do most use cases span both?
Or is there a difference between what kind of applications need storage in a public cloud versus a private cloud
versus something further out towards the edge and smaller and broken up? Yeah, we have some,
I think one of the things we noticed is there's, I would say there's two categories.
There's applications that use very standard tools like, I don't know, maybe Spark is not so standard, but a tool like Spark or Elastic Search
or Splunk. And then you have applications that are very specific, industry specific,
for instance, the pharmaceutical industry that are working with data sets. The usages are a little bit different there. We use really fast file systems using GPUs for inferencing and things like that.
That uses typically a really fast file system like a Weka.io that can then be tiered off
to a ScalaD platform.
But then the other uses that we see a lot of is indexed logs that allow you to do queries
on pools of logs.
And I think that's clearly a huge usage for us
and growing all the time.
Yeah, we recently spoke with folks from Weka.io
and Splunk actually on this same podcast
and listeners can find those in the archives. And
indeed, it seems like they are absolutely seeing the kind of patterns that you're seeing in terms
of needing, you know, some applications need ultra, ultra high performance access. Some
applications just need massive, you know, programmatic access to storage. Did it require
re-engineering the storage solution in order to support these applications? Because I, you know, programmatic access to storage. Did it require re-engineering the storage solution
in order to support these applications? Because I, you know, you mentioned with Weka, maybe you're
doing a tiered solution, but I think that you're probably, you know, the, you know, the shaft of
the arrow as it is for some of these applications. And did that require a re-engineering of your
solution? Re-engineering, maybe,. It required a certain amount of tuning,
some efforts on our part. I know one of the Splunk-based applications, when the customer
first reached out to us, they said, we're ingesting about a petabyte of logs a day.
Will that work for you guys? And that's pretty breathtaking. We hit somewhere between 15 to 20 gigabytes a second
of ingest of logs during peak access time.
So that really changes things.
They were moving from managing two or three days of logs
to wanting to manage about 20 days of logs.
But then that means your file system
is ingesting a petabyte a day
and deleting a petabyte a day.
And so we had to do deleting a petabyte a day.
And so we had to do quite a bit of fine tuning.
The distributed nature of the storage became very important for applications like that.
Sometimes 200 servers are generating logs or pushing data into the platform. So it tests the promises that we make about scalability for sure.
Interesting.
And that kind of data coming in for sure, and probably data going out in some cases to analyze, it really makes me think about, and again, going back to that kind of use
case of, do I build a private cloud for this data or do I take advantage of public clouds?
And if so, utilize a multi-cloud strategy or not.
I wonder how much that plays into that
or how much of that do you advise on with customers
and kind of help them find the right path?
Lots of discussions are ongoing there.
I think some of the public cloud tools
are really interesting.
And so I think we will be going to be seeing
more and more of that.
We have, I think one of the more interesting things that we had a customer do.
This was together with Weka.
They had machine learning algorithms running on-premises, but they're in this medical space,
time is of the essence.
And their concern was what happens if we have a power outage, like happens for various reasons? What happens if we're
offline? Do we just stop? Do our researchers stop working and twiddle their thumbs for days?
And what we propose to them is replicating the same data set into a public cloud, and they've
set up everything to be able to launch their tools in a public cloud, you know, within less than 24 hours up and running in the public cloud,
then replicate the data back from their learnings and carry on.
So instead of deploying a whole new data center,
they're doing the same tools in the public cloud.
And I think we're seeing more and more of that with a lot of these tools being
built around Kubernetes and different things,
allowing you to choose where you do your inferencing or where you do your work and for various reasons.
In certain industries like banking, obviously, public clouds still remain kind of an outlier.
Given that a lot of public clouds charge ingress and egress charges, and you're talking about a
pet about a day coming in and out from this one application, I imagine they're probably not doing
that to the public cloud because that would break the bank. Is there a need for sort of a hybrid
solution where you have basically high volume data on your own servers and then maybe long-term
data or something else in the cloud?
Yeah, I think there's a few tricks you can play there. You can obviously push large volumes of data for the cloud, do analysis of the data, and then just delete it. And then you don't pay the
egress fees and ingress for the public clouds tend to be very, very gracious about those fees. So I think that's
one of the interesting models. I suspect at some point in time, some of these public cloud
companies may find that that model doesn't work for them. But I think that is one of the
situations. But clearly, that is a major preoccupation. I mean, even if the cost isn't an issue, having enough bandwidth to push a couple
hundred terabytes in or out over a 24-hour period, that's big pipes. Yeah. So, I mean, is there,
I'm curious, are there folks who are building kind of storage facilities closer to where they're
collecting data intentionally to get around those bandwidth constraints? So we're definitely seeing, I would say, the systems are being deployed as close to the
generation of the data as possible, because obviously when you do the analytics indexes
and things like that, they tend to be a little bit smaller in size. But the trend that there's a lot of talk about is machine learning on the edge, I think,
is still pretty immature.
And I think we're going to see more of that in the future.
But I think right now, really, the data is generated in big data centers.
This is very often kind of a big iron thing right now.
So I know that Scality has the Zenko product or open source
project as well, which gives customers
some transparency between various object stores.
Do you see that playing a part in the future
of infrastructure supporting AI applications? We certainly believe that it provides very
interesting opportunities. We haven't seen a massive amount of business from that. We have
some small companies that are doing AI tools that have actually used the open source version of Zynco
to make sure that they can work with all the public clouds without having to do all the
development work. They can do an AWS interface and then they can work on all the other public
clouds. We've had some customers already do that. And I think that's a very interesting application.
Otherwise pushing data into a public cloud temporarily, I think we have some customers
doing that today with Azure pushing data into an Azure cloud and then using it for instance
for speech to text and translation services that may be especially well adapted in an Azure cloud.
And we've seen some of our customers comparing results, for instance, of speech-to-text and translation between a couple of public clouds
and getting an excellent result by comparing those two.
And that Zynco technology has certainly allowed a couple of customers to do that kind of thing. Yeah, those of you who are listening who are not familiar with that, it's basically
kind of an S3, I guess almost an S3 virtualization object approach that can move data around and
provide the same kind of access. And, you know, it's great to see uh you know open source tools like that be added uh not just uh you know
not just commercial products right and we we end up being able to um push data to multiple clouds
simultaneously um one to many kinds of replication so you can push data to several places test
different outcomes uh with it so it's potentially very interesting in this space. And
we do have some usage already. Yeah, that's very interesting. So I mean, so because that's one of
the pieces of this that was very interesting to me was, you know, the real life applications of
multi-cloud. It sounds like there is some, but it's still not quite, you know, overwhelming
demand at this point. Yeah, I think we're going to see some of these edge, more edge-like things progressing
in that space. Our key customers that are really using petabytes of data, I would say the primary
concern is just the pushing that kind of data volumes around is pretty prohibitive. We've got
a customer doing four or 500 terabytes a day of ingest of data and keeping that data for a year.
So you think they're replicating between two sites to do that. So those are big data volumes.
And you start pushing that to a public cloud, all questions about pricing aside, you have to pay a
lot of infrastructure, network infrastructure, if nothing else, to make that work.
You know, in a way, it seems to me that it's just like any kind of enterprise application that we're used to,
in that, you know, you have to basically build an infrastructure that supports all of the various demands in sort of a balanced way. I think that the challenge really is just that AI is demanding
maybe a new mix of the traditional capabilities that we've always had, you know, in terms of
performance and scalability and storage capacity and, you know, IOs and so on. And I think that
from a network architect's perspective is really the story here. The story is that we need to figure out, you know, what are the metrics that let us balance
an infrastructure to support AI applications that are different from the way that we would
have supported other applications, even a big data analytics application without AI.
Do you have any ideas about that?
I mean, what are the kind of things that you're saying? Well, I think one of the other pieces of that beyond the network and these questions of public and private clouds,
the fundamental difference in this kind of work compared to traditional high-performance computing,
high-performance computing used massive numbers of CPUs to do simulations, and you store the results of that data.
Those outcomes produced petabytes of data that was later analyzed.
But if you lost a couple of petabytes of data, you could reproduce it maybe with a month of CPU.
It's not free, but it's possible. We're seeing data sets today, things like
bank logs, human usage logs, sensors from automobiles. The data is irreplaceable.
If you have something like a solar events that are being captured,
anything that's a true real world application,
you can't make that stuff up.
And so there's a need to have not only high performance
access, but you need to protect your data.
And typically these data volumes are way beyond
what people are willing to back up.
So having a system,
and I think that's one of the big
changes. Precious data, well stored, it becomes really a priority where it used to be that,
oh, well, we lost a month of simulation. Well, we do it. I mean, the oh, well was probably with a
lot more groaning than that. That's really interesting and enlightening. I know Stephen
has a much bigger storage background than I do, but for me, that's kind of eye-opening,
this idea that we have this irreplaceable data and there's so much of it that you can't replicate it.
So you've just got to have it in a mission-critical environment where you're not going to lose
anything. I think I'm answering my own question in my head here, but does that lead to any
interesting security implications? I mean, if I've'm answering my own question in my head here, but does that lead to any interesting security implications?
I mean, if I've got this, you know, if it's, if it's irreplaceable data, is it also, um,
invaluable?
So, uh, potentially, I think one of the, one of the things, I mean, the thing everyone
is talking about right now is obviously, um, ransomware.
And I think, um, you know, if you're talking about bank records and what people have done on a transaction basis, wow, there's a lot of value in that data.
But on another side, in some ways, these data sets are only valuable to a company that knows what to do with them.
You know, you
think about sensor data from automobiles. If you have no idea how to turn that into an effective
collision avoidance algorithm, that data is, you know, a million miles of a sort of a LIDAR-like
sensor on the front of a car looking for people is not resellable,
except in the context of making self-driving cars better. So some of the data is only of great value
to the people that are exploiting it. Others of it, obviously, if it's bank records on where
people think information, Yeah, super value.
But I think the fear,
and that's one of the big things we're seeing today,
everyone is scared of ransomware.
That's where we feel that object storage
is potentially helping out a little bit.
It doesn't mean you can't crypt object storage,
but you don't have a traditional Windows desktop
hooked to a big object store and wandering through the data encrypting at all.
So that's the thing we hear the most today is those concerns.
And obviously, it's a double whammy kind of a thing.
They encrypt your data.
They charge you to decrypt it.
And if you don't pay them, they expose on the internet, everything that you
stored. And that really depends on the nature of the data, whether that's a big deal or not.
Yeah, that makes a lot of sense. And it's, yeah, it seems right that the ransomware or locking you
out of that data is probably the largest attack vector. What about injecting bad data? Is that something that folks are worried
about? Like if I want to influence your results or have your cars crash, for instance, right? And I,
can I inject data into that data set that actually causes something to happen that shouldn't have
happened? I suppose there's always a possibility for that kind of malicious thing. I think the data ingestion is typically being done
by people that are very close to the problem at hand,
experts on the problem,
and they're doing their very best to get rid of bad data.
And that's part of what AI can do for you
is help you sort out data that's really outliers
and inappropriate.
They're already doing that in autonomous vehicles.
Boeing probably should have done a little bit more of that with their sensors, but being
able to understand when you're getting bad data to use AI.
And I think that's part of the process of learning to do better AI is to root out bad
data, whether it be malicious or not.
We've talked a little bit about how AI is driving bigger and bigger datasets and analytics and logging and so on.
I wonder if some of this might just be because we can. words, you know, as systems have grown more capable, and as AI has allowed us to search
better through haystacks, we're growing ever bigger haystacks. You know, I mean, if we didn't
have, to stretch my metaphor, if we didn't have a barn big enough to hold the hay, and if we didn't
have, you know, an AI that could search through that hay, it wouldn't have been valuable to
collect that hay. But now
that we do, now that we've got, you know, machine learning algorithms, as you say, that are
incredibly good at finding outliers, now that we, you know, we can kind of turn up the volume on our
logging and analytics assessment simply because it exists. You know, many of us maybe would have
thrown out some of this log data, but now we can keep it. Now we can keep even more data. We can ingest even more data because we can do more with that data. This
contrasts with something like autonomous driving, where essentially until you have this, you know,
unless you're this tall, you cannot make your system, you know, autonomously drive. But in
logging and analytics and things like that, is the capability driving the data collection or is the data collection truly driving the capability?
That's a pretty chicken and eggish thing.
I mean, the reality is the cost of storage today allows you to store more data.
There's no doubt about it. I think one of the key challenges that that brings is indeed the needle in the haystack
problem.
The more data I get, the more there's a risk of having a data swamp and not a data lake
where I simply can't get intelligent things out of my data because I have too much of
it.
We've done a lot of work with indexing of data, and I think that's going to become a growing concern,
intelligently indexing your data so that you can get access to what you need.
But I think there's also a little bit of a scary component here
where senior management, CXO kind of folks,
are hearing that big data analytics are very important.
We need to store our data. We need to do wise things with it. If you're not careful, you get
so much data that you kind of drowned in it before you figure out what to do. And we do see a little
bit of that. So I think you have to be very careful about those things because what we've seen in general is people start with a relatively small project.
If they're getting good results, then they really turn the dial up on the data storage.
But I do think there are some people that are out there just saving everything, hoping they'll figure out what to do with it.
And I think the experience has shown that's probably not the best approach.
That makes sense. You know, and this is where a little bit of my ignorance of storage is going to show
possibly, but is this where I've heard, obviously folks have talked about the move to data lakes
over the last years. And now I've heard of a trend of moving to what they're calling a lake house,
right? And kind of combining the best of a data lake and a data warehouse. Is that kind
of related to the indexing you're talking about and how this all works together or am I way off
base there? I wasn't familiar with that term. I like it. I think I would say affirmatively that
I believe what we're seeing is starting to do AI on the data going in, and I think that's where Edge is going to be helpful.
If you can tag and provide useful information about what you're getting on your data coming in,
you can then store the data in a way that you can search it more quickly and find the kinds of things you're wanting to do data analytics and inferencing on. You can think about,
I'm not directly involved with any of these projects, but you can think about a speech analysis algorithms. If you're looking for certain patterns that you want to improve on,
if you can tag your data on the mobile device as it's coming in is, hmm, interesting speech pattern. Have a
look at this. Then when the data is stored, you get that combination of a data lake, but the
warehouse component where there's indexing on key things. And that's notably what people like
Splunk are doing with their methodology is we keep fast indexes of massive volumes of data for key parameters that we're
looking for. So I think you're going to have to do a mix, that mixed world of some structured data
on top of a lot of unstructured data. And then the terminology is, I think, appropriate.
So, Brad, now that we've talked about this a little bit, how would you sum up, from your perspective as someone who's been deeply involved in the creation of massive scalable storage systems, what message would you send to enterprises that are looking to add AI-driven applications?
I mean, what do they need to think about when it comes to storing all that data? I think a couple of things
come to top of mind. A system that can grow with your needs, that the performance grows,
that the capacity grows, that you don't fear for the loss of your data on a daily basis, I think is clearly very important.
And I think API-based, and S3 is the most obvious choice today, but a REST API-based
storage system becomes really, really important today because a file system-based approach,
the historical approach, doesn't lend itself well to a variety of tools
being used to do data analytics. Having a single file system that people have to share is a very
difficult challenge. And I would say moving toward an object-based model where multiple AI applications can share the same data set and reap benefits from it, I think, is a really key component of making all this work.
And as a benefit of using that kind of a model as well, you're not committing to an on-premises or a public cloud.
You can hybridize your data storage. You can hybridize your CPU
usage, your GP usage, all those things much more effectively when you're not tied to
some sort of a SMB protocol or something like that.
Great. Well, thanks a lot for that. And thanks for providing your perspective here on utilizing AI.
Now, we have a tradition here on season two of utilizing AI where we surprise our guests
with some questions that they aren't aware of until now and see how they react.
So I'm going to throw a couple of these your way, Brad.
And I hope that you enjoy the challenge here.
Remember, there's no wrong answers.
So first question, as a company that has indeed worked a bit around the autonomous driving industry, obviously without revealing any secret information, when will we see a full self-driving car that can truly drive anywhere at any time, if ever?
January 2026.
Ah, excellent. I don't, I don't know. I think, um, a couple of, I think the electrification
of automobiles, uh, is, is one of the key elements there. Um, but I think, uh, there's so many
obstacles to overcome in the meantime that it's, uh, it's going to be a limited deployment.
The truly, truly autonomous car, I fear that it's farther out than we think.
How about the next question?
Just in your opinion, do you think that machine learning is a product or is it just a feature
of a product?
My perspective on that particular topic is that it isn't a product today.
Will it be a product someday?
Maybe.
My experience to our customer base is it takes a lot of science to do machine learning.
And I think data scientists have great career perspectives ahead of them. And I know from
experience, I've talked to a lot of customers, deploying systems that quote unquote do machine
learning is far easier than learning from your data. And so I think we're going to see products that are more friendly to the common man, but being able to plug in a machine learner
and it comes back and tell you everything you needed to know about your data and didn't know,
that's a far away bridge. And one more future prediction question for you. We know that machine learning and AI are creating new jobs, as you just mentioned, with data scientists. Are there any jobs that have already been eliminated by relatively simple AI.
You know, who buys insurance from an insurance salesman anymore? have the most of fear are kind of the white collar people that do things like selling insurance.
Because I think people use algorithms today to determine what's the best approach.
And those kinds of things have gotten so much easier. I think there's all kinds of kind of
mid-range white collar jobs that are going to be pretty wiped out by AI in the relatively near
future. Very interesting. Well, thank you so much, Brad, for joining us today and providing your
thoughts on infrastructure to support AI applications. Where can people connect with
you and follow your thoughts on enterprise AI and other topics? So certainly via LinkedIn, Twitter, and my personal email if you have a
scality question. I'd certainly be glad to reach out and discuss with folks. How about you, Chris?
Yeah, you can find me on Twitter at Chris Gunderman or online at chrisgunderman.com.
And you can find me on most social media sites at S Foskett. You can find my thoughts on enterprise tech every week on the Gestalt IT rundown at gestaltit.com. And you can find more Utilizing AI by going to utilizing-ai.com or find this podcast on Twitter at utilizing underscore AI. Thank you very much for listening to this episode of Utilizing AI.
If you enjoyed this discussion,
please remember to subscribe, rate,
and review the show on iTunes
since that really helps our visibility.
And please do share this show with your friends
who you think might be interested in these topics.
This podcast is brought to you by gestaltit.com,
your home for IT coverage from across the enterprise.
Thanks for listening and we'll see you next time.