Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 3x25: The Unique Challenges of ML Training Data with Bin Fan
Episode Date: March 15, 2022Machine learning is unlike any other enterprise application, demanding massive datasets from distributed sources. In this episode, Bin Fan of Alluxio discusses the unique challenges of distribut...ed heterogeneous data to support ML workloads with Frederic Van Haren and Stephen Foskett. The systems supporting AI training are unique, with GPUs and other AI accelerators distributed across multiple machines, each accessing the same massive set of small files. Conventional storage solutions are not equipped to serve parallel access to such a large number of small files, and they often become a bottleneck to performance in machine learning training. Another issue is moving data across silos, storage systems and protocols, which is impossible with most solutions. Three Questions: Frederic: What areas are blocking us today to further improve and accelerate AI? Stephen: How big can ML models get? Will today's hundred-billion parameter model look small tomorrow or have we reached the limit? Sara E. Berger: With all of the AI that we have in our day-to-day, where should be the limitations? Where should we have it, where shouldn't we have it, where should be the boundaries? Gests and Hosts Bin Fan, Founding Member of Alluxio Inc. Connect with Bin on LinkedIn and on Twitter @BinFan. Frederic Van Haren, Founder at HighFens Inc., Consultancy & Services. Connect with Frederic on Highfens.com or on Twitter at @FredericVHaren. Stephen Foskett, Publisher of Gestalt IT and Organizer of Tech Field Day. Find Stephen’s writing at GestaltIT.com and on Twitter at @SFoskett. Date: 3/15/2022 Tags: @SFoskett, @FredericVHaren, @BinFan, @Alluxio
Transcript
Discussion (0)
I'm Stephen Foskett.
I'm Frederick Van Herens.
And this is the Utilizing AI podcast.
Welcome to another episode of Utilizing AI,
the podcast about enterprise applications for machine learning,
deep learning, data science, and other enterprise AI topics.
Every time we've spoken, we've looked at various portions of the AI stack from
applications to the underlying storage devices and GPUs. We've even looked beyond at the
implications of AI for society. But today we're going back into the stack and we're going to think
about data sources. And this is one of those things as a storage nerd,
it kind of makes me interested because quite frankly, ML training is very, very different
workload wise than almost any other application for enterprise storage.
Right. I totally agree. I think, I mean, we all know you need data in order to process and create AI models.
One of the challenges today is if you don't know what data you have, it's as good as not
having the data at all.
And I think the challenges that the enterprises are seeing today is the ever-growing amount
of data silos, as well as the amount of data sets. And to your point, the architecture to process data with CPUs,
the traditional analytical compared to the GPUs is significantly different.
Yeah, and increasingly it's a distributed access
where you've got multiple devices accessing the same data.
You've got a tremendous amount of data.
It's a very different kind of data as well.
And so today, I'm really glad to have Ben Phan from Eluxio joining us
to talk about the massive expansion of data
that is required by machine learning trading.
Ben, welcome to the show.
Thank you, Stephen.
Hi, Patrick.
Thanks for inviting me to talk about what
we are observing in the field and talking about how Aluxo is
engaging with machine learning and AI workloads and data
stack.
So I'm Bin.
I'm one of the founding members of Aluxo.
I joined the company in 2015.
Before that, I was working in Google
building the next generation of a large-scale storage system.
And prior to that, I was doing my PhD in Carnegie Mellon, also working on distributed system
and distributed networking systems.
So in the past, I would say, like seven years, I started implementing, designing, architecting
the open-source Aluxo project.
And in the last three years, I'm focusing more on the community,
building the community, building the open source,
and engaging with the users.
And also, I'm working with a lot of ecosystem partners
and a lot of different other open source projects
just to see if there is a synergy to put a Luxio into this picture
and to having different users to solve their problem.
In the past two years,
there is a dramatic change actually in the community,
in the trend of community interest that I started to see
a lot of users are switching from using Eluxio to,
in addition to using Eluxio for big data workloads,
analytics workloads,
they start to use this for accelerating
and to ease their life for managing data
for machine learning workloads.
So I guess that's why I'm here,
just to share my knowledge and share my observations.
Yeah, so you clearly have been around the block
and you probably have seen what works and what doesn't work.
What do you, I'm curious to know, what do you feel like are the data challenges that customers and enterprises have today?
That's a good question.
I actually see the challenges in multiple different dimensions within the machine learning and AI,
serving machine learning and AI workloads.
One of the challenges is really in the 15 or 10 years ago,
a lot of the data platform, data processing platform
are designed with the principle to put compute
and data storage together to re-em really emphasizing on having data locality.
And this is really just following the fantastic idea
proposed by the early Google paper,
like Google File System and MapReduce and BigTable.
And Hadoop is modeled after these different,
very successful systems in Google.
And so in that end, you really just optimize,
you really try to reduce the chance you need to read the data
from a different device or from a network device
or from network, basically.
But now in the AI, today, we see more less and less scenarios
people are deploying their data platform in that way.
And actually, it's also a fact of people are moving to the cloud.
And in the cloud, natively, they are doing this, they have this natural, I will say,
isolation or dividing between the compute component and also the storage, because they
are just different services.
And also the computation is really heavily today for AI,
machine learning AI workloads.
People are moving from CPUs to GPUs,
so they are heavily equipped by this GPU hardware,
and naturally they are not belonging to the storage device
or storage system any longer.
Then the first challenge I'm mentioning here is really given machine
learning AI training or serving is very data intensive, you have to move data around.
And the traditional, the previous approach having co-located data and storage data and compute
is no longer there, is no longer option, then you have to do this work. Yeah, moving data and prepared data
becomes a really heavy burden for a lot of machine learning AI workloads. So that's the
first dimension. And also there are some other dimensions, like even in the data access pattern,
it becomes very different. If people recall, like what's mentioned in the very first papers
that are talking about big data in the GFS, Google File System
paper. They're talking about, oh, we're only
dealing with the big files, the files
at least several gig and two terabytes.
So we don't worry
about small files.
And that's perhaps true
for a lot of ETL
workloads, for batch processing.
You have the choice
just to compact a lot of this data together to see a huge files, which is much easier to manage.
In the AI world, we see a for different types of, a lot of times
for audio and also for video, for pictures.
So it's basically a collection, a data set is a collection of a massive amount of relatively
small files, pictures or video clips, audio clips, things like that.
So that means you have to be prepared to have a lot of reading and writing
or checking a lot of different small files,
and this creates a lot of pressure, especially on a metadata side.
It's basically the worst workload. It's a nightmare
for traditional storage system. And we start to see a lot of people or a lot of companies,
a lot of projects are thinking towards that direction. Oh, how do I handle the case I have
a massive amount of small files? Yeah, so that's for datasets. For the input, we see that's another
different challenge. And the third challenge we see that's another different challenge.
And the third challenge I'm mentioning here is really I see
the increasing parallelism in accessing data.
Like traditional data processing is mostly utilizing CPUs.
You have maybe 32 cores, you have 64 cores.
That's the level of parallelism you're talking about,
how each core is processing something.
But with the GPU equipping this as a new weapon
to beef up the machine learning workloads,
it's easy to see hundreds of different threads
doing the data processing, data reading.
So you are, and this is on a single machine, and on each single machine there can be four
GPU cards or sometimes 16 GPU cards, and you can have a massive amount of GPU farm, right?
So you definitely see the parallelism is on a different order compared to the traditional big data workloads.
So your system, your serving the data needs to be prepared for this type of workloads accessing data
in a massive amount of parallelism. So that's basically the three dimensions we're seeing.
And also some general trend like just people are using more and more data. That's generally true for across different workloads.
For AI workloads, we definitely see people are talking about huge data sets, not only in the number of files, but also in the total size of the data set.
Yeah, you have to be prepared for a much larger data set too.
So with my background in storage, I can say that, yeah, the environment you're describing is very unusual.
Typically, storage solutions aren't designed, as you said,
to handle massive datasets,
especially massive numbers of small files.
Some solutions have been designed
to handle large numbers of small files,
but most of those aren't designed to handle parallel access
to all of these small files and high performance. And as you said as well, and I think this is key,
sometimes those small files aren't very small. So if you think about some of the signature
applications that machine learning is being used for, it's things like video and audio and image
processing. And sometimes those files are surprisingly big.
They're not, especially when it comes to video files,
you could be talking about millions of video clips,
each in the multi-megabyte size or even gigabyte size.
That's a lot of storage right there.
And then the other challenge, I think,
is that in many cases,
companies want to use data that may be distributed across various different storage
technologies, storage devices, even different protocols for access. And that also poses a
challenge to bring those into the machine learning workflow as well. Yeah, I think that small files and large files is a really common problem.
What I do see in the market a lot more is that there's multiple types of workloads on
the same hardware or same infrastructure, which basically means that traditionally you
could tune a system or an environment for small files or large files.
Today, the challenge is that some workloads have both large files and small files.
I think one of the challenges with data and data movement in an organization as a challenge,
it's because they severely underestimate the effort
it takes to move data around. Is that something you agree with? I mean, today, I mean, a couple
of years ago, people would say there are multiple silos, we have to get rid of many more silos and
go to a few silos, while there's a whole new data architecture coming out where they're now saying data mesh which is pretty much you know keep your data silos and will compute where the
silos are yeah so i actually i see both happening like i see um based on my observation i do see
like uh users or in the field, people are taking both either approaches.
They're consolidating their siloed data into what sometimes they call
a single data lake, or I have a master region on AWS.
I want to move all my data to this master region.
I definitely see that.
However, there's also another, it's always, you will see a different, like, situations
enabling a different direction, which is like, oh, through acquisition, my company get another
studio from Europe, right?
Then they might be using a totally different technical stack.
And that's going to be, take a long while to merge.
Situations like this, people telling
me stories like this
very often. So basically,
silos will be created
and also there will be effort
to clean silos. But still,
I think both
will be true.
Also, another fact is
people are moving to the data world anyway.
More and more things will be processed in a digital way.
Like traditional companies are moving to the data,
more and more than the data platform.
Anyway, so the cake is just growing.
It's bigger and bigger.
And we will see each part also bigger and bigger.
I will say that's my projection.
Well, so data is a piece of the puzzle.
You talked a little bit earlier about an ecosystem.
I'm curious to learn more about what you mean with an ecosystem
and how does that help the AI community?
Well, the AI community from its very beginning
is already very, you can think it's like a collection of multiple different communities,
especially on the open source communities,
and they're boosting this development of AI.
Compared to 30 years or 20 years ago,
I think in the new era,
people are, in the very early days,
people just learn this, oh, this is a new way we're treating data, we're training data.
But soon, like, you see TensorFlow or this type of, like, PyTorch, all this open source technology, they are available just for everyone to use. And then, so I think that contributes to the fact why
you see this trend
of this wave of modernizing,
getting the AI technology
or just using AI
to empower everything.
Like it's getting so fast,
it's so rapid
because it's quite available
from its early age.
And that really is, the that is really helping a lot.
You can resource documentation, code samples, and even training
data.
And a lot of things are just free and available on the internet.
Everyone can learn from that.
I think that demonstrates a huge power of having this,
like people sharing knowledge, sharing their source code,
sharing their ways of doing things.
Yeah, it seems like data and data movement is the Achilles heel
of producing faster and low latency workloads.
It's kind of an interesting move.
Now, going back to the ecosystem,
so does that mean you're also talking to partners
at the compute and the network level,
or do you really consider the ecosystem
more of a community effort?
Both, both.
Actually, because I'm driving the ecosystem
or open source initiative in my project, in my company.
So I do talk to a lot of open source projects.
Like I myself is a committer for Presto.
Presto is a big data SQL processing engine, open source processing engine.
So as a data layer, if you want to do a good work, I mean, if you want to do a reasonable work,
yes, you can just stay in your layer
and just make everything well-defined
and perform as expected,
getting good throughput, latency optimized for everything, right?
But you won't have the next level of performance or usability.
You have to go to different users and different computations.
You have to.
And that's the case.
It's good that you have these layers.
You have this processing layer,
you have the training or big data analytics.
You have this data layer corresponding to data access
or data federation.
However, from user's perspective, it's ideal if you can just present this stack
and pre-configured and a lot of things, whistles and bells are settled down
and they can just use this.
And that will be ideal from user's experience.
So yeah, you have to talk to the different community,
different computes, including open source communities
and also including partners.
That's definitely what you should do as a data solution provider.
On top of that, I just want to share some technical perspective.
It's very interesting to see.
We're providing a unified,
this is the same code base,
the same project for both computation in analytics and also training in AI.
But you can definitely tell,
for serving different workloads,
it requires different configuration
or different ways to set your data service.
And that actually is also very important.
And on top of that, even to serve different data models,
a different type of data model, ImageNet or some other,
and you can even tune on top of that to achieve better performance
and to just make it more stable. Yeah, it's very fun.
The deeper you go, the better performance and more friendly user experience you will provide.
Yeah, it really is a distributed and heterogeneous on both ends of the stack. So on the one hand,
you might be dealing with object storage or cloud storage or NAS or something.
On the other end, you might be dealing, you know, with, you know, TensorFlow, you know, PyTorch, whatever, and they may expect a different type of storage.
And so that's really what you all were working on, right, is to have that glue between the various different protocols, as well as making the enabling what we talked about at the beginning,
which is that distributed access.
Yeah, yeah, definitely.
So the vision for us is basically we provide this data abstraction layer, or in the system
we call this data abstraction.
So this data, we provide this abstraction, and users facing this data abstraction,
they don't really need to care if my data is moved from AWS
or moved from HDFS in their own on-premise data warehouse.
They just need, oh, I need this ImageNet data set
or I need these tables for TPC-DS.
And then they can run Spark,
they can run Hive and Presto,
or they can run TensorFlow,
they can run PyTorch.
And even with this same data abstraction,
you can use different ways to access data,
like a traditional POSX way,
which is favored, more favored in the AI community,
like a lot of people using Python, right?
And using this type of tools to access data.
But it's the same data.
In contrast, in a more traditional analytics world,
people are using HDFS interface.
It's more or less the industry defado.
But it doesn't matter.
Like, this is the same data abstraction.
You can use different ways to access the same data abstraction.
That's what we're providing in Nalox.
And even in the data scientists,
although we want to stick to machine learning,
I'd say even in the data analytics space,
I think Python is really taking off.
I'm seeing a lot more people using POSIX data in Python
instead of going to HDFS.
But I don't know, Frederick, if you're seeing that too.
No, I definitely see that.
I think HDFS is the first kind of open source community way of storing data, right?
So you didn't have to buy those expensive, high-performing file systems.
You kind of used commodity hardware to achieve a similar performance,
but at a much lower cost.
But I do agree, you know, there's HDFS,
there's all kinds of different file systems.
I'm actually really interested that you brought up Presto
because, you know, databases is another silo or data source, if you wish.
And I think what the other problem customers are dealing with or enterprises are dealing with is not the amount of data, but it's also the access to that data, right?
In one case, it's a database.
In another case, it's a file system, a POSIX file system, or object, or HDFS, and so on.
And I think having a solution that kind of creates that abstraction is one of the big
issues that need to be solved in order to take advantage of the available data.
And so maybe a question I have there is, I understand the data abstraction.
Now, when you talk about data orchestration,
in the end, you just provide a mechanism to find the data.
You're not moving data, right?
Oh, we're moving data.
Definitely, we're moving data.
Yeah, the data is, for example,
if the data is living in AWS S3,
but you're running your applications
in your on-premise data warehouse,
just to let you know, we do have your on-premise data warehouse.
Just to let you know, we do have users, customers running in that mode.
So every time you go to S3, that's a huge cost for you.
Even with data abstraction, without movement,
you have to go to S3 and download the data,
and that can be a burden for you.
So what we do is, in addition to the abstraction layer, we also have this mechanism to tell,
oh, this is the working set, and this is the hot data,
so we can move the data closer to computation
next time you don't need to go to S3.
That's definitely something, key applications,
key application, key features we see among our community.
And it's not just performance that you're worried about.
It's also cost of accessing data in public cloud
and especially in S3.
Both or either.
Sometimes people choose, depends on their applications.
But I was talking to a user and he mentioned to me
by using this technology,
they can reduce their cost to access their data in
cross-region in AWS by 50%.
That's interesting.
Cross-region, that's something I hadn't thought about, too.
That's even more expensive than accessing the data in the same region.
So the data orchestration, is that done automatically, or is that the consumer saying, I would like to move it from on-premises POSIX to object in the public cloud?
So the way is like we're building this virtual layer there.
And when we're doing this, we want to pretty much model after the current model,
like people are very familiar with.
For example, in your laptop, if you want to, we have this VFS layer on your Mac or Linux box.
So whenever you put a hardware there, like, for example, a hard disk you put in your laptop, and then it's basically
you mount this hard disk into your VFS layer, and then you can just access data there.
And it's exactly the same analogy in our case. So you treat the S3 bucket or you treat HDFS
as a hard disk, exactly as a hard disk.
And you mount, by the way, the command is really called mount.
You mount this S3 bucket, or you mount this HDFS cluster
into this virtual namespace to provide data abstraction.
And then you can access this data abstraction
through alexio-commerce namespace so yeah and then
everything happens automatically you don't have to tell oh you can still do that you can still tell
oh alexio help me to fetch data from the s3 bucket in here you can just run the command load or
distribute load like in a parallel way or you just wait for the applications to tell you, the application is accessing a part of the data, maybe even a part of the files.
And then we can just fetch the logical part of this.
We will have logical block concepts in these files and just fetch the part you're touching.
So this can happen both.
And just go back to the topic of machine learning.
We found a very interesting way, like some users telling us the story.
When they are doing the training, in the beginning, I thought what they will do is they will just wait for either
they will just wait for the machine learning training applications
to read certain data sets, ImageNet.
So in this way, the ImageNet will be loaded into,
because it's accessed,
so it will be loaded into a Luxio space
and cached there.
And then so you can just reduce the cost
and increase the performance.
Or you just, like,
beforehand, you run this distributed load command
to bring the data into the Luxio space
so you can just access this.
But it turns out they're doing it in a smart way.
They're definitely smarter than I.
So they're doing this at the same time.
They're training.
So training will bring data in.
And they're also running this digital reload.
So it's a combination for both approach.
And I found that users are genius to find their good ways
to use Luxio and also to do training. They will do a lot of different organizations I
would never imagine.
So thanks so much for this tour through all the challenges that people are facing.
I guess, Ben, if you could give us sort of a takeaway message, what should people be
thinking about when they're considering the types of data sources that they're going to be using for machine learning training?
So thanks for inviting me in this session. I think I really enjoyed this discussion with
you, especially for the... Go back to your question regarding the data sources and I think our mission is to make this in not really relevant
basically the system we're building is to provide this abstraction that you can
choose whatever data source you like whatever cheaper to you or more
convenient to you and whenever you need to use it and you think it's too slow
just put locks it put locks you there or you think you have too slow, just put a lock seal there. Or you think you have too many of these type of data accesses,
put a lock seal there and mount them.
Well, thank you so much.
Now is the time in our podcast
when we move on to our fun lightning round.
We're going to ask our guests three questions.
They've not been warned about what these questions will be.
And so it's fun to get some off-the-cuff answers that are a little bit more, you know, abstract than the specific storage issues that we've been talking about.
So first off, Frederik, why don't you go ahead with yours?
So my question is, what areas are blocking us today to further improve and accelerate AI?
I will talk about my observation. I think it's right now, so AI is a huge, a very deep stack,
like from very top level, you have to find a business value. You have to find,
how do you justify, I need to run the AI, why I need to introduce AI here, right? And then
how do you translate your business problem into a algorithmic problem? How do you get data? How do
you write this different pipelines for training and get the clean data?
And then next level, how do you get the resource for you?
You have to get the GPUs, you have the hardware or CPUs,
and run somewhere, and even deploying AI workloads is not that easy.
It's a huge stack. It's complicated enough.
Mastering this stack is, I think, people are,
nowadays I've seen more and more people in the field
are capable of building a stack like this,
but it takes a while.
It will take a few years and maybe even a decade
to get to the stage that everyone is using database.
Database is a so consolidated space
and it's so well-known and well-defined.
AI is far from that.
I think it just takes years for each layer to consolidate
and also for the cross different layers
to find the right stack to go with.
And I think it's evolving. I cannot think
of a single bottleneck where people are bottlenecked. So my question, I guess somewhat
predictably, is since we're talking about the size of datasets, how big can ML models get?
We have 100 billion parameter models already.
Will this look small or will we reach some kind of limit eventually?
I do talk to some modelers about, I ask them questions like this before, and I think it
will getting larger.
It will just keep increasing. But I do think it's going to be
slowed down. Right now, we're seeing huge models. And I don't know, we are, as modelers or as a data
scientist, we are comfortable to understand what we are doing there. So once you have too many parameters,
there's a lot of things can happen
without fully understanding.
I think that's...
Still, people will just build larger models.
It's just not...
I don't think the increasement will be as high as right now.
Finally, as promised, we have a question from a previous podcast guest.
This one comes from Sarah E. Berger, a researcher at IBM.
Sarah, take it away.
Hi, I'm Sarah Berger.
I am a research staff member at IBM Research.
And my question is, with all of the AI that we have in our you know day-to-day where should be the
limitations where shouldn't we have it where should we have it you know where are the boundaries
I mean this is a really big question I think everyone has their own different
opinions on this and to me I definitely think there should be a boundary. I don't want to be fully controlled by AI in my day-to-day life.
Perhaps I want to leverage AI for helping me in the work,
automating a lot of different work.
And even during travel or when I'm driving,
I feel like I'm comfortable to use a self-driving car to help me to reduce my burden.
But beyond that, like helping me doing decisions or taking me from decision-making process for my personal life or things like that, I'm not feeling comfortable. For example, more and more recommendations, people are really
heavily depending on the recommendation system when you are seeing news or videos on YouTube.
A lot of different things are the output from the AI training algorithms. And then this actually
makes me very, to some degree, concerned that if there is something in the
bug in, for example, the search engine or the recommendation engine, I may just keep
seeing very different or very siloed views or very different type of information I want
to retrieve. So I want to be very cautious about that.
Well, thank you so much. And that's what's fun about the three questions segment is we get some
awesome answers like that. So we look forward to hearing what your question might be for a
future guest. And if one of our listeners wants to contribute one as well, just reach out to host at utilizing-ai.com and we'll record it. So Bin Fan, thank you so much for joining us. Where can
people connect with you and follow your thoughts on enterprise AI and data science? And is there
something special that you'd like to let us know about? Yeah, my Twitter handle is Bin Fan. It's
just like a Bin Fan, my first name and last name. You can follow me there.
And also, sometimes I repost
a lot of interesting articles on LinkedIn.
And Frederick, what's up?
Well, I'm
still working on my startup
around data management.
And from a consulting perspective,
I'm still designing
and deploying large-scale AI clusters
for companies. And you
can find me on LinkedIn and Twitter as Frederick V. Heron. And as for me, I am, as I mentioned last
week, getting pretty excited about where we're going with our future Tech Field Day events,
including our AI Field Day that we're currently planning. You will catch some of the folks
that you've heard on the podcast at that event.
And I would love to have you all involved.
So if you go to techfielday.com,
you can learn a little bit more about that.
Or you can reach out at S Foskett on Twitter,
and I'd love to hear from you.
So thank you for listening to the Utilizing AI podcast.
If you enjoyed this discussion,
please do subscribe in your favorite podcast application or find us on YouTube and maybe give us a thumbs up.
That always helps.
This podcast is brought to you by gestaltit.com, your home for IT coverage from across the enterprise.
For show notes and more episodes, go to utilizing-ai.com or you can follow us on Twitter at utilizing underscore AI.
Thanks for listening and we'll see you next time.