Disseminate: The Computer Science Research Podcast - Alex Isenko | Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines | #1

Episode Date: June 27, 2022

Summary: Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy. Maximizing resource utilization is becoming more challenging as the thr...oughput of training processes increases with hardware innovations (e.g., faster GPUs, TPUs, and inter-connects) and advanced parallelization techniques that yield better scalability. At the same time, the amount of training data needed in order to train increasingly complex models is growing. As a consequence of this development, data preprocessing and provisioning are becoming a severe bottleneck in end-to-end deep learning pipelines.In this interview Alex talks about his in-depth analysis of data preprocessing pipelines from four different machine learning domains. Additionally, he discusses a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines and extract individual trade-offs to optimize throughput, preprocessing time, and storage consumption. Alex and his collaborators have developed an open-source profiling library that can automatically decide on a suitable preprocessing strategy to maximize throughput. By applying their generated insights to real-world use-cases, an increased throughput of 3x to 13x can be obtained compared to an untuned system while keeping the pipeline functionally identical. These findings show the enormous potential of data pipeline tuning.Questions: 0:36 - Can you explain to our listeners what is a deep learning pipeline?1:33 - In this pipepline how does data pre-processing become a bottleneck?5:40 - In the paper you analyse several different domains, can you go into more details about the domains and pipelines?6:49 - What are the key insights from your analysis?8:28 - What are the other insights?13:23 - Your paper introduces PRESTO the opens source profiling library, can you tell us more about that?15:56 - How does this compare to other tools in the space?18:46 - Who will find PRESTO useful?20:13 - What is the most interesting, unexpected, or challenging lesson you encountered whilst working on this topic?22:10 - What do you have planned for future research?Links: HomepagePaperPRESTOContact Info: Email: alex.isenko@tum.deLinkedIn Hosted on Acast. See acast.com/privacy for more information.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Disseminate, the podcast bringing you overviews of the latest computer science research. I'm your host, Jack Wardby. We're recording today from ACM SIGMOD pods in Philadelphia. I'm delighted to say I'm joined by Alex Yusenko, who will be talking about his paper, Where is my training bottleneck? Hidden trade-offs in deep learning pre-processing pipelines. Alex is a PhD student at the Technical University of Munich. He is interested in ML and optimizations and generally making stuff faster. Alex, thanks for joining us on the show. Yeah, thank you. Happy to be here. Brilliant. And let's dive straight in. Can you first of all explain to our listeners
Starting point is 00:00:58 what is a deep learning pipeline? Yeah, so deep learning pipeline in general always starts off, we have some data set stored somewhere, then we have some pre-processing steps, typically any kind of transformation that you can imagine, like decoding images, we have the training process that ingests that set data, right? And then runs with that and does forward propagation, backward propagation, applies the gradients to the model. And then we start everything from an U and do that multiple times.
Starting point is 00:01:37 So we iterate over the entire data set, multiple so-called epochs. And then hopefully sometime we achieve the desired accuracy that we finally can conclude and say well the model is trained and it does the task whichever we want to make it to make it work fantastic so in this pipeline how does how does data pre-processing and provisioning become become a bottleneck yeah so this is actually a very good question because this is kind of the the line of
Starting point is 00:02:01 thinking that got us started in doing this kind of research in general because like three years ago when we initially thought about it well okay sure there's actually a lot of research being done already in the entire ml community about how to make the training go faster because this is the major major bottle like it it's it was the bottleneck and it still is the bottleneck for specific applications but there's like what is it like 100 papers published each day in the machine learning community keeping up with that is very very hard and like creating a space for you and then we noticed then actually there are many many papers that are already doing this training optimization speed like using better hardware or using better software to optimize the
Starting point is 00:02:40 training right and we noticed well if this part of this entire pipeline is getting that much that much faster and that so many people are invested in it right what about the pre-processing which actually has to provide the data to feed these models which run that much faster and luckily luckily for us unluckily for all of the other community we kind of noticed that and all the other people of course also noticed that this is actually a problem. And that sometimes if you are not a professional in terms of designing such a pipeline, it's very, very easy to create a pre-resourcing pipeline that will not saturate your GPU, which sometimes you pay for with real money and AWS or Google Cloud resources, right? So it's kind of wasting money in that sense.
Starting point is 00:03:21 And the funny thing about that is it's also actually not very good for cloud providers as well because the interesting trade-off in there is the following. If we could show you that actually you could get a worse hardware and still get the same results, this specific game which you would have bought
Starting point is 00:03:38 would be free and the cloud providers could actually sell that to another person because you are the person who would use this lower-end hardware. So it's like a win-win. The client uses less money to get the same result, and the cloud providers get an additional, better-working VM to provision to other people because they have,
Starting point is 00:03:53 in reality, have quite a problem to give people those low-end VMs because nobody wants them because they don't even know that they could be essentially the same, like, have the same usability as the higher-end VMs. So, yeah, this is actually why we got into preprocessing in general and saw, well, this is actually a problem for many people.
Starting point is 00:04:13 And how it can become a bottleneck, the gist of the paper in general already is that storage consumption is really, really important when estimating the throughput because it affects many, many different things. It affects the bandwidth of the network which however you're connected to your node and wherever your storage lies be the s3 bucket or or your local ssd or whatever right and the other thing is of course typically we work always with throughput with a metric of it's called throughput and samples per second right and samples per second is not really
Starting point is 00:04:44 megabytes per second, right? And the interesting thing about the preprocessing is that the same data, like the same kind of image, can sometimes be blown up and then also, again, reduced in storage consumption. And therefore, this throughput in samples per second may differ a lot because some, for example, decoding JPEGs increases the storage consumption by a factor of 6 to 8, right? So the CPU actually has to process that much more data, which is very, very inefficient in that particular step. And therefore, the throughput can actually be reduced by a lot. So, for example, very naive, but decoding and saving the images raw to disk, right, is not really a good thing. But kind of everybody knows that already. What we have shown is that these particular trade-offs
Starting point is 00:05:29 are also there with other non-trivial steps, like, for example, resizing and normalization and all of those things. All of these transformations have a trade-off in terms of throughput versus storage consumption. And depending on the data set, depending on the hardware you're running, depending on the amount of DRAM you have, all of these can change and affect your pre-resourcing pipeline
Starting point is 00:05:51 negatively or positively. And this is the way these bottlenecks actually show up just because every step has such a trade-off. And that leads naturally into the next question. In your paper, you analyzed several different pipelines from several different domains. Can you go into a little bit more detail about the specific domains you looked at and the pipelines that were there? Yeah, of course.
Starting point is 00:06:11 So the pipelines we looked at were mainly like we had seven representative pipelines. We tried to basically scrape the papers from the last 10 years and see which ones actually were kind of prominent, which ones were used a lot, right, to get actually something representative going on. And we focused on four different domains, namely computer vision, natural language processing, automatic speech recognition, and there's a voice, right, and one kind of esoteric one that's called NILM, so non-intrusive load monitoring, it's called. We use that because people from Allchair actually are kind of doing that in their
Starting point is 00:06:48 research as well, so they kind of pushed me to it. Well, yeah, let's just take a look at that as well. And it's kind of an interesting topic where you measure electrical data and try to classify appliances from that. So we also uncovered that, and this is kind of like the signal data that we use for that. Okay, cool. So what were the key
Starting point is 00:07:04 insights from your analysis? Yeah, so actually the key insights, there were so many, we definitely won't have enough time. Okay, give me the best ones. In this talk, yeah. Yeah, so one of the main things, to be honest, I would really like to focus on, if there would be one thing that everybody should take away from here,
Starting point is 00:07:19 it should be the following, is that fully offline preprocessing is not necessarily the best strategy. So to expand that, we have to take a little bit of a step before the following is that fully offline pre-processing is not necessarily the best strategy so to to explain that we have to take a little bit step step before and to explain what fully offline means right um fully offline what i mean by that is that you pre-process the entire data set before you start the training and this is kind of the the status quo and the default way to do things because it's kind of easy and it makes sense right because i already talked about that we actually have epochs right so we look at a data set multiple times so and if now
Starting point is 00:07:50 i would say tell you well you have but you have to pre-process the entire data set again and again and again every epoch it's obvious that you should cache or somehow save this kind of computations right and what we have shown unfortunately is that this is not the best strategy sometimes, especially for image problems. Especially for image pipelines, we have shown that in all of our, we tested three image pipelines. Sorry, one image pipeline with three different data sets. It didn't even matter if the images were from ImageNet or some high-resolution data set. In all situations, we could actually find the strategy which was better than fully offline preprocessing. And that had, again, to do with the storage consumption and then, of course, the throughput.
Starting point is 00:08:29 So we actually had a throughput increase in times three. So going from like 600 samples per second to 1,800 samples per second without changing the CPU hardware, just by changing this materialization strategy, which I talked about. What are the other insights? Yeah, so other insights. I think one that i'm really eager to talk about as well is um it's about compression actually because um people who already know about these uh this issue about storage consumption throughput and whatever compression is kind of the next logical step right because compression actually focuses on exactly this particular issue. It focuses on, you have
Starting point is 00:09:06 CPU compute power left in some sense, right? And you have to compress and decompress data, and you save storage consumption. So it's kind of exactly in that particular trade-off, and therefore we also analyze that. And funnily enough, there were two quite interesting insights as to,
Starting point is 00:09:22 for example, the first one was that actually compression really really works sometimes like a lot it can save storage consumption and increase the throughput at the same time and this is kind of a kind of interesting trade-off because typically as i said it's a trade-off right you trade storage consumption for cpu compute or for throughput right and we actually notice that for some particular strategies it's actually beneficial to do both. You can actually get both out of it. So you have like a win-win situation just because the effect the storage consumption is having, right, the smaller the storage consumption gets, the higher the effect and the throughput gets you.
Starting point is 00:09:57 And therefore, you can actually, while still having to decompress the entire data every single time and every single read in the epoch, you still get a better throughput while keeping a small storage consumption. This is kind of an absurd finding to me it was. I was like really not thinking that something like this is going to happen, but it actually was. Fascinating insight, yes. Great result.
Starting point is 00:10:17 Do we have time for one more? Yeah. Yeah, so the one other thing, which is quite cool about compression in general, was that we used a data set which was provided to calibrate cameras, actually. It was kind of a smallish data set, but we found it interesting because it was provided both in PNGs and JPEGs. And we found it quite interesting because it was also quite a high resolution. And we just said, well, yeah, let's just run it in the same pipeline we already have.
Starting point is 00:10:42 How about, well, what kind of differences does a bigger resolution make, right? But what we actually found out is the difference between JPEGs and PNGs is actually quite big. Because it was a very funny thing. The PNG data set in general, I think, if I'm not totally incorrect, it was about 60 gigabytes. And the JPEG, the same images, right, same images, took about, I think it was like two gigabytes or something like ballpark basically right it's like times 30 right so obviously you should always keep the jpegs why would you ever use the pngs and the funny thing about that is because the pipeline already of course decodes the images as well right so because we have to have it's in some kind of rgb matrix right we want we have to decode them anyways so jpeg decoding is
Starting point is 00:11:24 anyway slow and this is something which you don't get with pngs right so that's kind of RGB matrix, right? We have to decode them anyways. So JPEG decoding is anyway slow. And this is something which you don't get with PNGs, right? So that's kind of like a first trade-off, but it still didn't make really a big difference in throughput because it's still, well, yeah, but storage consumption, right? Remember, big storage consumption is kind of bad. So it kind of was basically equal in terms of that. But the interesting thing about compression,
Starting point is 00:11:44 we tried GZIP and ZLIP compression. This is kind of like the very default ways of doing compression, basically, on any kind of data and was already implemented in TensorFlow. And when applied, after the decoding step, we found out that the PNG data set, so the decoded, in quotes, PNGs, were actually compressed four times better
Starting point is 00:12:04 than the decoded JPEGs. And the reason for that is actually because even when you decode the JPEGs and get them into an RGB matrix, they still have the artifacts which they had in the encoding before, right? And the default GZIP and ZLIP compression simply didn't work that well on those images with artifacts. And by that, we actually got an increase both in throughput and in storage consumption. Oh, sorry, increase in throughput and a decrease in storage consumption for this PNG dataset, and it was a factor of four. So the PNGs at the end in this particular strategy, which performed the best, which we would always recommend people to use, actually took about 280 megabytes,
Starting point is 00:12:44 while the JPEG dataset, for the exactly same strategy, right, for this best strategy that recommend people to use, actually took about 280 megabytes, while the JPEG data set, for the exactly same strategy, for this best strategy that we had to offer, actually took 1.1 gigabyte. So it was like four times bigger, which is kind of absolutely also non-trivial to think about, that keeping a data set as JPEGs at the end will actually get you a higher storage consumption for that particular task.
Starting point is 00:13:03 Kind of a totally funny, interesting insight. And also, it was just by accident that we found that out. So it was pretty cool, too. Yeah, well, often the best things are discovered by accident, right? Exactly. That's awesome. And for my next question, in the paper, you also introduce Presto, the open source profiling library.
Starting point is 00:13:21 Can you tell us a bit more about that? Yeah, sure. So Presto is actually a very, very simple program, I would say. We're using the so-called TFData API. It's kind of the TFData.dataset API. It's kind of the default way to create a preprocessing pipeline right now in TensorFlow. And what we are basically doing is accessing all of these steps that you create in this data set. You can do a map, right?
Starting point is 00:13:48 And a map means you just simply apply a certain transformation function over a collection of elements, which is typically the data set, right? And what we are doing is creating a split point between every step. And the split point serializes and deserializes the data set. So basically does this materialization, which I'm talking about. And then we do a short profiling and then say well yeah of all of these of all these particular strategies this is the one which performs the best and you can optimize it automatically with a cost function if you want to it's very very easy to see to try out to be honest and if i'm going to be really honest it's actually very easy to implement by yourself in
Starting point is 00:14:23 your particular framework be it pytorch be it even TensorFlow, if you don't want to use our code. Just take a look. It's very easy to understand, in my opinion at least. And the entire optimization is just basically creating insights about your particular pipeline and to actually make it visible for you what kind of trade-offs you're looking at. Because sometimes it's really not easy to understand how certain steps affect your hardware, right? How much memory it actually takes. How much bigger are the images when you decode them from JPEG into an RGB matrix? Like, I didn't know before.
Starting point is 00:14:58 Yeah. And those kind of things, it's, like, actually very, very important, we found out, to know for a fact, to actually know what, for example, what kind of machine to schedule, right? If I know that my data set, which takes up two gigabytes, but decoded actually 80, maybe I should provision a machine that has 80 gigabytes of RAM, right? And these kinds of things, this is what we want you guys, like, every listener, hopefully, to get with you, like, that you actually know that there are some trade-offs, some hidden trade-offs, right? The title of the paper, there are, some hidden trade-offs, right? The title on the paper. There are actually some hidden trade-offs
Starting point is 00:15:27 in everybody's pipeline probably, and you should definitely profile it because there are many, many opportunities to keep using the same hardware without paying additional money, like to reschedule it or whatever, because sometimes just picking the better strategy is the better way to speed it up right
Starting point is 00:15:45 so how does how does this compare to other tools in the space yeah so um what we what we focused really mostly on about was to just generate actually these insights to actually see what's what's what are like the common common caveats right and the common pitfalls i would say that some people do and but which we do which we did already as well right while doing this entire research so we focus mostly on that so and this kind of analysis was not really done before i would hopefully can say that but there are actually a lot of really there's actually a lot of related work in that particular domain as well like many many people who are publishing papers typically focus on automatic decision finding, and there's actually a lot. So, for example, one particular I would like to mention is Plumber.
Starting point is 00:16:29 It was actually also published, I think, November last year. Actually, I had a communication with that particular researcher as well. It was quite cool. And they also use the TF data framework. And what they do is actually also quite interesting. They also try to focus on throughput optimizations and those kind of trade-offs and kind of automatically doing all of those things but the key thing what they are focusing on about is rewriting the execution graph because you can imagine in our case let's only talk about the pre-processing it's actually just a line of transformations stacked one after each other right and what they focus on is to use this degenerated graph structure and say, well,
Starting point is 00:17:07 but couldn't we just simply rearrange things automatically, kind of in a semi-automatic way, to make it better or faster? And this is what they do, right? And this is kind of a cool thing. But for the exact citation, please feel free to look at my paper. We are actually citing many, many of those tools, and
Starting point is 00:17:24 feel free to look at them, because many of them are very, very good. Yeah, we can also link it in the show notes. Yeah, of course, of course, yeah. It would be really cool. And maybe other general things to note of related work would be the FFCV library right now. I think
Starting point is 00:17:39 I saw a Twitter post in February that it's coming out as kind of a new data loader specialized for computer vision purposes. And they kind of, kind of funny for me, it was kind of funny for me to note that they actually include many of our insights. But of course, the paper wasn't out back then yet. So they kind of get to that on their own, of course, right? Kind of cool for me to read that some of the things that we found out were actually problems and they also already included them in their in their library without ever reading my paper because they came to the same conclusion by themselves so hopefully
Starting point is 00:18:12 we did good research yeah yeah yeah so it was pretty cool for me to read about them but i'm still looking still waiting for the paper i think it's not not out yet i think they published the library in itself and they already have a pretty cool website going on, but I didn't read the paper. But maybe I haven't checked it out. Who will find these results useful? Who will find the tool useful? Yeah, so to be honest, I hope everybody who ever has something to do with preprocessing pipelines in general. That's kind of the basic answer, but the more high-level answer is, I think, because this is also one of the things that some people who I talked about at the conference here
Starting point is 00:18:49 have kind of shown me a little bit of a bigger picture of what we actually did right here. And, for example, they told me that this kind of trade of throughput versus storage consumption is very, very generic. It's like not something particularly, it's not a particular problem of pre-processing strategies in general. It's just about everywhere
Starting point is 00:19:07 where you process data, right? Be it some SQL queries, be it in a streaming fashion, whatever, right? All of those people are kind of affected by this trade-off. And, for example, migration, right? You want to migrate a GPU model from
Starting point is 00:19:23 one GPU to another GPU. When do you migrate it? When it takes the least amount of storage, because that's the way where it's going to be the fastest, right? You want to migrate a GPU model from one GPU to another GPU. When do you migrate it? When it takes the least amount of storage, because that's the way where it's going to be the fastest, right? And this is actually exactly the same thing, like exactly the same trade-off, where you kind of wait and try to see, well, at what point in time is the data as small as possible
Starting point is 00:19:38 so I can actually do something with it? Like either make it faster, like reading it faster from storage, or migrating it from one uh one vram to another vram right so this is kind of the general um take i would say so people just should be aware of this storage consumption throughput trade-off and but this kind of affects really everybody who's working in computer science i would even say this actually leads quite nicely into this next question i've got which is what what's the most interesting unexpected or challenging lesson that you have learned while working on this topic oh that's hard to answer to be honest um most challenging
Starting point is 00:20:15 i think actually actually having the rigor of of performing all of these experiments to be honest it's like it's very easy to say well yeah now i kind of kind of already found out some things the rigor of performing all of these experiments, to be honest. It's very easy to say, well, yeah, now I kind of already found out some things, right? It's hopefully enough to write a paper, and then the reviewers get back to you and then say, well, yeah, but actually, haven't you thought about that thing and that thing? And I was, oh, no, that's so out of scope. For example, I got a reviewer who told me to think about, but what about multi-tenancy? How can you share pre-processing pipelines? And I was like, yeah, sure.
Starting point is 00:20:47 That's a very good idea. That's a very good idea for the next paper. But really not. Like, I already, like, we have, what is it, like, 13 pages? And for 13 pages, we're already working, like, for two-ish years on that. We have so much. We already, like, kicked out, I think, three entire chapters, which didn't fit the paper the paper already so it was really really hard to actually follow up on that research so i think to be honest this was the most challenging thing yeah just having the rigor to actually perform all
Starting point is 00:21:12 of these experiments and focus on the the these main insights and try to make it useful to everybody because this is kind of a kind of a pet peeve of mine that there's a lot of research out there that does very very good work of course but it's very hard to to make it accessible to to like i would say everyday people and by everyday people i mean like people who are working in that particular area as well so i think we kind of really really hope that with these kind of easy to to hopefully easy to understand insights we'll make like a very first step that everybody could just hopefully not do the like not have these same problems that we had so so yeah fantastic and one last question from me
Starting point is 00:21:50 what do you have planned for future research yeah for future actually more tendency so actually we we thought about that a lot and we kind of so unfortunately multi-tenancy is a very very hard problem so i'm i'm basically like everybody can try this topic by themselves it's very very hard to solve um for specific for specific tasks of course but right now i'm actually um talking with people from hive mind it's a very cool new like not really that new project and they are working on democratizing access to deep learning models. The cool thing about them is that they are basically trying to make it possible to simulate cluster-level GPU environments with volunteer hardware.
Starting point is 00:22:37 So you can think of it as like if you ever heard about CT at Home, like all of these kinds of voluntary kind of projects. There are actually many with machine learning as well, but from what I know, these people were the first ones to actually make it work, and they actually made it work faster than a comparable CPU, like comparable GPU cluster.
Starting point is 00:22:54 So they had like 40 volunteers. I think they trained a language model on the Bengali language, if I'm not entirely incorrect. And they compared it, and actually the actual runtime was even faster than an actual compute center or whatever, and they made they compared it and actually like the actual runtime was even faster than an actual compute center or whatever and they like used many many different techniques in terms of how to optimize for bandwidth um dropouts and the and like of people
Starting point is 00:23:15 who dropping out basically in the time of training all of these kind of things and when we talked when we talked to these guys uh they said actually to me that they have a have an issue with cv pre-processing as well because NLP is kind of easy because text is already that really, it's really, really condensed. So you can very easily transfer that. And then the problem becomes own, like in quotes, only model training. But for CV, you actually have a lot of data that you have to send away to these volunteers. And they kind of do it in a very smart way.
Starting point is 00:23:42 And I'm kind of hoping that I can make a dent in that, but we'll see. Fantastic, and good luck for that future research. And I think we'll end it there. Thanks so much for coming on the podcast. If you're interested in knowing more about Alex's work, the links to his paper and the relevant libraries for Presto and whatnot will be put in the show notes. And you can also connect with him on LinkedIn,
Starting point is 00:24:03 and he can be reached via email at alex.isenko at tum.de exactly yeah thank you very much for inviting me it's been a pleasure see you next time see you Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.