Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 3x25: The Unique Challenges of ML Training Data with Bin Fan

Episode Date: March 15, 2022

Machine learning is unlike any other enterprise application, demanding massive datasets  from distributed sources. In this episode, Bin Fan of Alluxio discusses the unique challenges of distribut...ed heterogeneous data to support ML workloads with Frederic Van Haren and Stephen Foskett. The systems supporting AI training are unique, with GPUs and other AI accelerators distributed across multiple machines, each accessing the same massive set of small files. Conventional storage solutions are not equipped to serve parallel access to such a large number of small files, and they often become a bottleneck to performance in machine learning training. Another issue is moving data across silos, storage systems and protocols, which is impossible with most solutions. Three Questions: Frederic: What areas are blocking us today to further improve and accelerate AI? Stephen: How big can ML models get? Will today's hundred-billion parameter model look small tomorrow or have we reached the limit? Sara E. Berger: With all of the AI that we have in our day-to-day, where should be the limitations? Where should we have it, where shouldn't we have it, where should be the boundaries? Gests and Hosts Bin Fan, Founding Member of Alluxio Inc. Connect with Bin on LinkedIn and on Twitter @BinFan. Frederic Van Haren, Founder at HighFens Inc., Consultancy & Services. Connect with Frederic on Highfens.com or on Twitter at @FredericVHaren. Stephen Foskett, Publisher of Gestalt IT and Organizer of Tech Field Day. Find Stephen’s writing at GestaltIT.com and on Twitter at @SFoskett. Date: 3/15/2022 Tags: @SFoskett, @FredericVHaren, @BinFan, @Alluxio

Transcript
Discussion (0)
Starting point is 00:00:00 I'm Stephen Foskett. I'm Frederick Van Herens. And this is the Utilizing AI podcast. Welcome to another episode of Utilizing AI, the podcast about enterprise applications for machine learning, deep learning, data science, and other enterprise AI topics. Every time we've spoken, we've looked at various portions of the AI stack from applications to the underlying storage devices and GPUs. We've even looked beyond at the
Starting point is 00:00:35 implications of AI for society. But today we're going back into the stack and we're going to think about data sources. And this is one of those things as a storage nerd, it kind of makes me interested because quite frankly, ML training is very, very different workload wise than almost any other application for enterprise storage. Right. I totally agree. I think, I mean, we all know you need data in order to process and create AI models. One of the challenges today is if you don't know what data you have, it's as good as not having the data at all. And I think the challenges that the enterprises are seeing today is the ever-growing amount
Starting point is 00:01:19 of data silos, as well as the amount of data sets. And to your point, the architecture to process data with CPUs, the traditional analytical compared to the GPUs is significantly different. Yeah, and increasingly it's a distributed access where you've got multiple devices accessing the same data. You've got a tremendous amount of data. It's a very different kind of data as well. And so today, I'm really glad to have Ben Phan from Eluxio joining us to talk about the massive expansion of data
Starting point is 00:01:57 that is required by machine learning trading. Ben, welcome to the show. Thank you, Stephen. Hi, Patrick. Thanks for inviting me to talk about what we are observing in the field and talking about how Aluxo is engaging with machine learning and AI workloads and data stack.
Starting point is 00:02:14 So I'm Bin. I'm one of the founding members of Aluxo. I joined the company in 2015. Before that, I was working in Google building the next generation of a large-scale storage system. And prior to that, I was doing my PhD in Carnegie Mellon, also working on distributed system and distributed networking systems. So in the past, I would say, like seven years, I started implementing, designing, architecting
Starting point is 00:02:38 the open-source Aluxo project. And in the last three years, I'm focusing more on the community, building the community, building the open source, and engaging with the users. And also, I'm working with a lot of ecosystem partners and a lot of different other open source projects just to see if there is a synergy to put a Luxio into this picture and to having different users to solve their problem.
Starting point is 00:03:05 In the past two years, there is a dramatic change actually in the community, in the trend of community interest that I started to see a lot of users are switching from using Eluxio to, in addition to using Eluxio for big data workloads, analytics workloads, they start to use this for accelerating and to ease their life for managing data
Starting point is 00:03:32 for machine learning workloads. So I guess that's why I'm here, just to share my knowledge and share my observations. Yeah, so you clearly have been around the block and you probably have seen what works and what doesn't work. What do you, I'm curious to know, what do you feel like are the data challenges that customers and enterprises have today? That's a good question. I actually see the challenges in multiple different dimensions within the machine learning and AI,
Starting point is 00:04:05 serving machine learning and AI workloads. One of the challenges is really in the 15 or 10 years ago, a lot of the data platform, data processing platform are designed with the principle to put compute and data storage together to re-em really emphasizing on having data locality. And this is really just following the fantastic idea proposed by the early Google paper, like Google File System and MapReduce and BigTable.
Starting point is 00:04:36 And Hadoop is modeled after these different, very successful systems in Google. And so in that end, you really just optimize, you really try to reduce the chance you need to read the data from a different device or from a network device or from network, basically. But now in the AI, today, we see more less and less scenarios people are deploying their data platform in that way.
Starting point is 00:05:04 And actually, it's also a fact of people are moving to the cloud. And in the cloud, natively, they are doing this, they have this natural, I will say, isolation or dividing between the compute component and also the storage, because they are just different services. And also the computation is really heavily today for AI, machine learning AI workloads. People are moving from CPUs to GPUs, so they are heavily equipped by this GPU hardware,
Starting point is 00:05:35 and naturally they are not belonging to the storage device or storage system any longer. Then the first challenge I'm mentioning here is really given machine learning AI training or serving is very data intensive, you have to move data around. And the traditional, the previous approach having co-located data and storage data and compute is no longer there, is no longer option, then you have to do this work. Yeah, moving data and prepared data becomes a really heavy burden for a lot of machine learning AI workloads. So that's the first dimension. And also there are some other dimensions, like even in the data access pattern,
Starting point is 00:06:15 it becomes very different. If people recall, like what's mentioned in the very first papers that are talking about big data in the GFS, Google File System paper. They're talking about, oh, we're only dealing with the big files, the files at least several gig and two terabytes. So we don't worry about small files. And that's perhaps true
Starting point is 00:06:38 for a lot of ETL workloads, for batch processing. You have the choice just to compact a lot of this data together to see a huge files, which is much easier to manage. In the AI world, we see a for different types of, a lot of times for audio and also for video, for pictures. So it's basically a collection, a data set is a collection of a massive amount of relatively small files, pictures or video clips, audio clips, things like that.
Starting point is 00:07:25 So that means you have to be prepared to have a lot of reading and writing or checking a lot of different small files, and this creates a lot of pressure, especially on a metadata side. It's basically the worst workload. It's a nightmare for traditional storage system. And we start to see a lot of people or a lot of companies, a lot of projects are thinking towards that direction. Oh, how do I handle the case I have a massive amount of small files? Yeah, so that's for datasets. For the input, we see that's another different challenge. And the third challenge we see that's another different challenge.
Starting point is 00:08:06 And the third challenge I'm mentioning here is really I see the increasing parallelism in accessing data. Like traditional data processing is mostly utilizing CPUs. You have maybe 32 cores, you have 64 cores. That's the level of parallelism you're talking about, how each core is processing something. But with the GPU equipping this as a new weapon to beef up the machine learning workloads,
Starting point is 00:08:37 it's easy to see hundreds of different threads doing the data processing, data reading. So you are, and this is on a single machine, and on each single machine there can be four GPU cards or sometimes 16 GPU cards, and you can have a massive amount of GPU farm, right? So you definitely see the parallelism is on a different order compared to the traditional big data workloads. So your system, your serving the data needs to be prepared for this type of workloads accessing data in a massive amount of parallelism. So that's basically the three dimensions we're seeing. And also some general trend like just people are using more and more data. That's generally true for across different workloads.
Starting point is 00:09:27 For AI workloads, we definitely see people are talking about huge data sets, not only in the number of files, but also in the total size of the data set. Yeah, you have to be prepared for a much larger data set too. So with my background in storage, I can say that, yeah, the environment you're describing is very unusual. Typically, storage solutions aren't designed, as you said, to handle massive datasets, especially massive numbers of small files. Some solutions have been designed to handle large numbers of small files,
Starting point is 00:10:02 but most of those aren't designed to handle parallel access to all of these small files and high performance. And as you said as well, and I think this is key, sometimes those small files aren't very small. So if you think about some of the signature applications that machine learning is being used for, it's things like video and audio and image processing. And sometimes those files are surprisingly big. They're not, especially when it comes to video files, you could be talking about millions of video clips, each in the multi-megabyte size or even gigabyte size.
Starting point is 00:10:39 That's a lot of storage right there. And then the other challenge, I think, is that in many cases, companies want to use data that may be distributed across various different storage technologies, storage devices, even different protocols for access. And that also poses a challenge to bring those into the machine learning workflow as well. Yeah, I think that small files and large files is a really common problem. What I do see in the market a lot more is that there's multiple types of workloads on the same hardware or same infrastructure, which basically means that traditionally you
Starting point is 00:11:24 could tune a system or an environment for small files or large files. Today, the challenge is that some workloads have both large files and small files. I think one of the challenges with data and data movement in an organization as a challenge, it's because they severely underestimate the effort it takes to move data around. Is that something you agree with? I mean, today, I mean, a couple of years ago, people would say there are multiple silos, we have to get rid of many more silos and go to a few silos, while there's a whole new data architecture coming out where they're now saying data mesh which is pretty much you know keep your data silos and will compute where the silos are yeah so i actually i see both happening like i see um based on my observation i do see
Starting point is 00:12:20 like uh users or in the field, people are taking both either approaches. They're consolidating their siloed data into what sometimes they call a single data lake, or I have a master region on AWS. I want to move all my data to this master region. I definitely see that. However, there's also another, it's always, you will see a different, like, situations enabling a different direction, which is like, oh, through acquisition, my company get another studio from Europe, right?
Starting point is 00:12:56 Then they might be using a totally different technical stack. And that's going to be, take a long while to merge. Situations like this, people telling me stories like this very often. So basically, silos will be created and also there will be effort to clean silos. But still,
Starting point is 00:13:16 I think both will be true. Also, another fact is people are moving to the data world anyway. More and more things will be processed in a digital way. Like traditional companies are moving to the data, more and more than the data platform. Anyway, so the cake is just growing.
Starting point is 00:13:35 It's bigger and bigger. And we will see each part also bigger and bigger. I will say that's my projection. Well, so data is a piece of the puzzle. You talked a little bit earlier about an ecosystem. I'm curious to learn more about what you mean with an ecosystem and how does that help the AI community? Well, the AI community from its very beginning
Starting point is 00:14:01 is already very, you can think it's like a collection of multiple different communities, especially on the open source communities, and they're boosting this development of AI. Compared to 30 years or 20 years ago, I think in the new era, people are, in the very early days, people just learn this, oh, this is a new way we're treating data, we're training data. But soon, like, you see TensorFlow or this type of, like, PyTorch, all this open source technology, they are available just for everyone to use. And then, so I think that contributes to the fact why
Starting point is 00:14:45 you see this trend of this wave of modernizing, getting the AI technology or just using AI to empower everything. Like it's getting so fast, it's so rapid because it's quite available
Starting point is 00:15:00 from its early age. And that really is, the that is really helping a lot. You can resource documentation, code samples, and even training data. And a lot of things are just free and available on the internet. Everyone can learn from that. I think that demonstrates a huge power of having this, like people sharing knowledge, sharing their source code,
Starting point is 00:15:28 sharing their ways of doing things. Yeah, it seems like data and data movement is the Achilles heel of producing faster and low latency workloads. It's kind of an interesting move. Now, going back to the ecosystem, so does that mean you're also talking to partners at the compute and the network level, or do you really consider the ecosystem
Starting point is 00:15:55 more of a community effort? Both, both. Actually, because I'm driving the ecosystem or open source initiative in my project, in my company. So I do talk to a lot of open source projects. Like I myself is a committer for Presto. Presto is a big data SQL processing engine, open source processing engine. So as a data layer, if you want to do a good work, I mean, if you want to do a reasonable work,
Starting point is 00:16:26 yes, you can just stay in your layer and just make everything well-defined and perform as expected, getting good throughput, latency optimized for everything, right? But you won't have the next level of performance or usability. You have to go to different users and different computations. You have to. And that's the case.
Starting point is 00:16:52 It's good that you have these layers. You have this processing layer, you have the training or big data analytics. You have this data layer corresponding to data access or data federation. However, from user's perspective, it's ideal if you can just present this stack and pre-configured and a lot of things, whistles and bells are settled down and they can just use this.
Starting point is 00:17:20 And that will be ideal from user's experience. So yeah, you have to talk to the different community, different computes, including open source communities and also including partners. That's definitely what you should do as a data solution provider. On top of that, I just want to share some technical perspective. It's very interesting to see. We're providing a unified,
Starting point is 00:17:49 this is the same code base, the same project for both computation in analytics and also training in AI. But you can definitely tell, for serving different workloads, it requires different configuration or different ways to set your data service. And that actually is also very important. And on top of that, even to serve different data models,
Starting point is 00:18:15 a different type of data model, ImageNet or some other, and you can even tune on top of that to achieve better performance and to just make it more stable. Yeah, it's very fun. The deeper you go, the better performance and more friendly user experience you will provide. Yeah, it really is a distributed and heterogeneous on both ends of the stack. So on the one hand, you might be dealing with object storage or cloud storage or NAS or something. On the other end, you might be dealing, you know, with, you know, TensorFlow, you know, PyTorch, whatever, and they may expect a different type of storage. And so that's really what you all were working on, right, is to have that glue between the various different protocols, as well as making the enabling what we talked about at the beginning,
Starting point is 00:19:05 which is that distributed access. Yeah, yeah, definitely. So the vision for us is basically we provide this data abstraction layer, or in the system we call this data abstraction. So this data, we provide this abstraction, and users facing this data abstraction, they don't really need to care if my data is moved from AWS or moved from HDFS in their own on-premise data warehouse. They just need, oh, I need this ImageNet data set
Starting point is 00:19:38 or I need these tables for TPC-DS. And then they can run Spark, they can run Hive and Presto, or they can run TensorFlow, they can run PyTorch. And even with this same data abstraction, you can use different ways to access data, like a traditional POSX way,
Starting point is 00:19:59 which is favored, more favored in the AI community, like a lot of people using Python, right? And using this type of tools to access data. But it's the same data. In contrast, in a more traditional analytics world, people are using HDFS interface. It's more or less the industry defado. But it doesn't matter.
Starting point is 00:20:22 Like, this is the same data abstraction. You can use different ways to access the same data abstraction. That's what we're providing in Nalox. And even in the data scientists, although we want to stick to machine learning, I'd say even in the data analytics space, I think Python is really taking off. I'm seeing a lot more people using POSIX data in Python
Starting point is 00:20:42 instead of going to HDFS. But I don't know, Frederick, if you're seeing that too. No, I definitely see that. I think HDFS is the first kind of open source community way of storing data, right? So you didn't have to buy those expensive, high-performing file systems. You kind of used commodity hardware to achieve a similar performance, but at a much lower cost. But I do agree, you know, there's HDFS,
Starting point is 00:21:14 there's all kinds of different file systems. I'm actually really interested that you brought up Presto because, you know, databases is another silo or data source, if you wish. And I think what the other problem customers are dealing with or enterprises are dealing with is not the amount of data, but it's also the access to that data, right? In one case, it's a database. In another case, it's a file system, a POSIX file system, or object, or HDFS, and so on. And I think having a solution that kind of creates that abstraction is one of the big issues that need to be solved in order to take advantage of the available data.
Starting point is 00:21:59 And so maybe a question I have there is, I understand the data abstraction. Now, when you talk about data orchestration, in the end, you just provide a mechanism to find the data. You're not moving data, right? Oh, we're moving data. Definitely, we're moving data. Yeah, the data is, for example, if the data is living in AWS S3,
Starting point is 00:22:19 but you're running your applications in your on-premise data warehouse, just to let you know, we do have your on-premise data warehouse. Just to let you know, we do have users, customers running in that mode. So every time you go to S3, that's a huge cost for you. Even with data abstraction, without movement, you have to go to S3 and download the data, and that can be a burden for you.
Starting point is 00:22:47 So what we do is, in addition to the abstraction layer, we also have this mechanism to tell, oh, this is the working set, and this is the hot data, so we can move the data closer to computation next time you don't need to go to S3. That's definitely something, key applications, key application, key features we see among our community. And it's not just performance that you're worried about. It's also cost of accessing data in public cloud
Starting point is 00:23:10 and especially in S3. Both or either. Sometimes people choose, depends on their applications. But I was talking to a user and he mentioned to me by using this technology, they can reduce their cost to access their data in cross-region in AWS by 50%. That's interesting.
Starting point is 00:23:31 Cross-region, that's something I hadn't thought about, too. That's even more expensive than accessing the data in the same region. So the data orchestration, is that done automatically, or is that the consumer saying, I would like to move it from on-premises POSIX to object in the public cloud? So the way is like we're building this virtual layer there. And when we're doing this, we want to pretty much model after the current model, like people are very familiar with. For example, in your laptop, if you want to, we have this VFS layer on your Mac or Linux box. So whenever you put a hardware there, like, for example, a hard disk you put in your laptop, and then it's basically
Starting point is 00:24:28 you mount this hard disk into your VFS layer, and then you can just access data there. And it's exactly the same analogy in our case. So you treat the S3 bucket or you treat HDFS as a hard disk, exactly as a hard disk. And you mount, by the way, the command is really called mount. You mount this S3 bucket, or you mount this HDFS cluster into this virtual namespace to provide data abstraction. And then you can access this data abstraction through alexio-commerce namespace so yeah and then
Starting point is 00:25:06 everything happens automatically you don't have to tell oh you can still do that you can still tell oh alexio help me to fetch data from the s3 bucket in here you can just run the command load or distribute load like in a parallel way or you just wait for the applications to tell you, the application is accessing a part of the data, maybe even a part of the files. And then we can just fetch the logical part of this. We will have logical block concepts in these files and just fetch the part you're touching. So this can happen both. And just go back to the topic of machine learning. We found a very interesting way, like some users telling us the story.
Starting point is 00:25:52 When they are doing the training, in the beginning, I thought what they will do is they will just wait for either they will just wait for the machine learning training applications to read certain data sets, ImageNet. So in this way, the ImageNet will be loaded into, because it's accessed, so it will be loaded into a Luxio space and cached there. And then so you can just reduce the cost
Starting point is 00:26:12 and increase the performance. Or you just, like, beforehand, you run this distributed load command to bring the data into the Luxio space so you can just access this. But it turns out they're doing it in a smart way. They're definitely smarter than I. So they're doing this at the same time.
Starting point is 00:26:32 They're training. So training will bring data in. And they're also running this digital reload. So it's a combination for both approach. And I found that users are genius to find their good ways to use Luxio and also to do training. They will do a lot of different organizations I would never imagine. So thanks so much for this tour through all the challenges that people are facing.
Starting point is 00:26:57 I guess, Ben, if you could give us sort of a takeaway message, what should people be thinking about when they're considering the types of data sources that they're going to be using for machine learning training? So thanks for inviting me in this session. I think I really enjoyed this discussion with you, especially for the... Go back to your question regarding the data sources and I think our mission is to make this in not really relevant basically the system we're building is to provide this abstraction that you can choose whatever data source you like whatever cheaper to you or more convenient to you and whenever you need to use it and you think it's too slow just put locks it put locks you there or you think you have too slow, just put a lock seal there. Or you think you have too many of these type of data accesses,
Starting point is 00:27:46 put a lock seal there and mount them. Well, thank you so much. Now is the time in our podcast when we move on to our fun lightning round. We're going to ask our guests three questions. They've not been warned about what these questions will be. And so it's fun to get some off-the-cuff answers that are a little bit more, you know, abstract than the specific storage issues that we've been talking about. So first off, Frederik, why don't you go ahead with yours?
Starting point is 00:28:17 So my question is, what areas are blocking us today to further improve and accelerate AI? I will talk about my observation. I think it's right now, so AI is a huge, a very deep stack, like from very top level, you have to find a business value. You have to find, how do you justify, I need to run the AI, why I need to introduce AI here, right? And then how do you translate your business problem into a algorithmic problem? How do you get data? How do you write this different pipelines for training and get the clean data? And then next level, how do you get the resource for you? You have to get the GPUs, you have the hardware or CPUs,
Starting point is 00:29:14 and run somewhere, and even deploying AI workloads is not that easy. It's a huge stack. It's complicated enough. Mastering this stack is, I think, people are, nowadays I've seen more and more people in the field are capable of building a stack like this, but it takes a while. It will take a few years and maybe even a decade to get to the stage that everyone is using database.
Starting point is 00:29:44 Database is a so consolidated space and it's so well-known and well-defined. AI is far from that. I think it just takes years for each layer to consolidate and also for the cross different layers to find the right stack to go with. And I think it's evolving. I cannot think of a single bottleneck where people are bottlenecked. So my question, I guess somewhat
Starting point is 00:30:16 predictably, is since we're talking about the size of datasets, how big can ML models get? We have 100 billion parameter models already. Will this look small or will we reach some kind of limit eventually? I do talk to some modelers about, I ask them questions like this before, and I think it will getting larger. It will just keep increasing. But I do think it's going to be slowed down. Right now, we're seeing huge models. And I don't know, we are, as modelers or as a data scientist, we are comfortable to understand what we are doing there. So once you have too many parameters,
Starting point is 00:31:07 there's a lot of things can happen without fully understanding. I think that's... Still, people will just build larger models. It's just not... I don't think the increasement will be as high as right now. Finally, as promised, we have a question from a previous podcast guest. This one comes from Sarah E. Berger, a researcher at IBM.
Starting point is 00:31:33 Sarah, take it away. Hi, I'm Sarah Berger. I am a research staff member at IBM Research. And my question is, with all of the AI that we have in our you know day-to-day where should be the limitations where shouldn't we have it where should we have it you know where are the boundaries I mean this is a really big question I think everyone has their own different opinions on this and to me I definitely think there should be a boundary. I don't want to be fully controlled by AI in my day-to-day life. Perhaps I want to leverage AI for helping me in the work,
Starting point is 00:32:17 automating a lot of different work. And even during travel or when I'm driving, I feel like I'm comfortable to use a self-driving car to help me to reduce my burden. But beyond that, like helping me doing decisions or taking me from decision-making process for my personal life or things like that, I'm not feeling comfortable. For example, more and more recommendations, people are really heavily depending on the recommendation system when you are seeing news or videos on YouTube. A lot of different things are the output from the AI training algorithms. And then this actually makes me very, to some degree, concerned that if there is something in the bug in, for example, the search engine or the recommendation engine, I may just keep
Starting point is 00:33:12 seeing very different or very siloed views or very different type of information I want to retrieve. So I want to be very cautious about that. Well, thank you so much. And that's what's fun about the three questions segment is we get some awesome answers like that. So we look forward to hearing what your question might be for a future guest. And if one of our listeners wants to contribute one as well, just reach out to host at utilizing-ai.com and we'll record it. So Bin Fan, thank you so much for joining us. Where can people connect with you and follow your thoughts on enterprise AI and data science? And is there something special that you'd like to let us know about? Yeah, my Twitter handle is Bin Fan. It's just like a Bin Fan, my first name and last name. You can follow me there.
Starting point is 00:34:06 And also, sometimes I repost a lot of interesting articles on LinkedIn. And Frederick, what's up? Well, I'm still working on my startup around data management. And from a consulting perspective, I'm still designing
Starting point is 00:34:22 and deploying large-scale AI clusters for companies. And you can find me on LinkedIn and Twitter as Frederick V. Heron. And as for me, I am, as I mentioned last week, getting pretty excited about where we're going with our future Tech Field Day events, including our AI Field Day that we're currently planning. You will catch some of the folks that you've heard on the podcast at that event. And I would love to have you all involved. So if you go to techfielday.com,
Starting point is 00:34:51 you can learn a little bit more about that. Or you can reach out at S Foskett on Twitter, and I'd love to hear from you. So thank you for listening to the Utilizing AI podcast. If you enjoyed this discussion, please do subscribe in your favorite podcast application or find us on YouTube and maybe give us a thumbs up. That always helps. This podcast is brought to you by gestaltit.com, your home for IT coverage from across the enterprise.
Starting point is 00:35:15 For show notes and more episodes, go to utilizing-ai.com or you can follow us on Twitter at utilizing underscore AI. Thanks for listening and we'll see you next time.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.