Disseminate: The Computer Science Research Podcast - Alex Isenko | Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines | #1
Episode Date: June 27, 2022Summary: Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy. Maximizing resource utilization is becoming more challenging as the thr...oughput of training processes increases with hardware innovations (e.g., faster GPUs, TPUs, and inter-connects) and advanced parallelization techniques that yield better scalability. At the same time, the amount of training data needed in order to train increasingly complex models is growing. As a consequence of this development, data preprocessing and provisioning are becoming a severe bottleneck in end-to-end deep learning pipelines.In this interview Alex talks about his in-depth analysis of data preprocessing pipelines from four different machine learning domains. Additionally, he discusses a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines and extract individual trade-offs to optimize throughput, preprocessing time, and storage consumption. Alex and his collaborators have developed an open-source profiling library that can automatically decide on a suitable preprocessing strategy to maximize throughput. By applying their generated insights to real-world use-cases, an increased throughput of 3x to 13x can be obtained compared to an untuned system while keeping the pipeline functionally identical. These findings show the enormous potential of data pipeline tuning.Questions: 0:36 - Can you explain to our listeners what is a deep learning pipeline?1:33 - In this pipepline how does data pre-processing become a bottleneck?5:40 - In the paper you analyse several different domains, can you go into more details about the domains and pipelines?6:49 - What are the key insights from your analysis?8:28 - What are the other insights?13:23 - Your paper introduces PRESTO the opens source profiling library, can you tell us more about that?15:56 - How does this compare to other tools in the space?18:46 - Who will find PRESTO useful?20:13 - What is the most interesting, unexpected, or challenging lesson you encountered whilst working on this topic?22:10 - What do you have planned for future research?Links: HomepagePaperPRESTOContact Info: Email: alex.isenko@tum.deLinkedIn Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello and welcome to Disseminate, the podcast bringing you overviews of the latest computer science research.
I'm your host, Jack Wardby. We're recording today from ACM SIGMOD pods in Philadelphia.
I'm delighted to say I'm joined by Alex Yusenko, who will be talking about his paper,
Where is my training bottleneck? Hidden trade-offs in deep learning pre-processing pipelines.
Alex is a PhD student at the Technical University of Munich.
He is interested in ML and optimizations and generally making stuff faster.
Alex, thanks for joining us on the show. Yeah, thank you. Happy to be here.
Brilliant. And let's dive straight in. Can you first of all explain to our listeners
what is a deep learning pipeline? Yeah, so deep learning pipeline in general
always starts off, we have some data set stored somewhere, then we have some pre-processing steps, typically any kind of transformation that you can imagine, like decoding images, we have the training process that ingests that set data, right?
And then runs with that
and does forward propagation,
backward propagation,
applies the gradients to the model.
And then we start everything from an U
and do that multiple times.
So we iterate over the entire data set,
multiple so-called epochs.
And then hopefully sometime
we achieve the desired accuracy
that we finally can conclude and
say well the model is trained and it does the task whichever we want to make it to make it work
fantastic so in this pipeline how does how does data pre-processing and provisioning become become
a bottleneck yeah so this is actually a very good question because this is kind of the the line of
thinking that got us started in doing this kind of research in general
because like three years ago when we initially thought about it well okay sure there's actually
a lot of research being done already in the entire ml community about how to make the training go
faster because this is the major major bottle like it it's it was the bottleneck and it still
is the bottleneck for specific applications but there's like what is it like 100 papers published each
day in the machine learning community keeping up with that is very very hard and like creating a
space for you and then we noticed then actually there are many many papers that are already doing
this training optimization speed like using better hardware or using better software to optimize the
training right and we noticed well if this part of this entire pipeline is getting that much
that much faster and that so many people are invested in it right what about the pre-processing
which actually has to provide the data to feed these models which run that much faster and
luckily luckily for us unluckily for all of the other community we kind of noticed that and all
the other people of course also noticed that this is actually a problem. And that sometimes if you are not a professional in terms of designing such a pipeline,
it's very, very easy to create a pre-resourcing pipeline that will not saturate your GPU,
which sometimes you pay for with real money and AWS or Google Cloud resources, right?
So it's kind of wasting money in that sense.
And the funny thing about that is it's also actually not very good
for cloud providers as well
because the interesting trade-off in there
is the following.
If we could show you that actually
you could get a worse hardware
and still get the same results,
this specific game which you would have bought
would be free and the cloud providers
could actually sell that to another person
because you are the person
who would use this lower-end hardware.
So it's like a win-win.
The client uses less money to get the same result, and the cloud providers get an
additional, better-working VM to provision
to other people because they have,
in reality, have quite a problem
to give people those low-end VMs because
nobody wants them because they don't even know
that they could be essentially the same,
like, have the same usability as
the higher-end VMs.
So, yeah, this is actually why we got into preprocessing in general
and saw, well, this is actually a problem for many people.
And how it can become a bottleneck,
the gist of the paper in general already is
that storage consumption is really, really important
when estimating the throughput
because it affects many, many different things. It affects the bandwidth of the network which however you're connected to your
node and wherever your storage lies be the s3 bucket or or your local ssd or whatever right
and the other thing is of course typically we work always with throughput with a metric of
it's called throughput and samples per second right and samples per second is not really
megabytes per second, right?
And the interesting thing about the preprocessing is that the same data, like the same kind of image, can sometimes be blown up and then also, again, reduced in storage consumption.
And therefore, this throughput in samples per second may differ a lot because some, for example, decoding JPEGs increases the storage consumption by a factor of 6 to 8, right?
So the CPU actually has to process that much more data, which is very, very inefficient in that particular step.
And therefore, the throughput can actually be reduced by a lot.
So, for example, very naive, but decoding and saving the images raw to disk, right, is not really a good thing.
But kind of everybody knows that already.
What we have shown is that these particular trade-offs
are also there with other non-trivial steps,
like, for example, resizing and normalization and all of those things.
All of these transformations have a trade-off
in terms of throughput versus storage consumption.
And depending on the data set, depending on the hardware you're running,
depending on the amount of DRAM you have,
all of these can change
and affect your pre-resourcing pipeline
negatively or positively.
And this is the way these bottlenecks actually show up
just because every step has such a trade-off.
And that leads naturally into the next question.
In your paper, you analyzed several different pipelines
from several different domains.
Can you go into a little bit more detail about the specific domains you looked at and the pipelines that were there?
Yeah, of course.
So the pipelines we looked at were mainly like we had seven representative pipelines.
We tried to basically scrape the papers from the last 10 years and see which ones actually were kind of prominent,
which ones were used a lot, right, to get actually something representative going on.
And we focused on four different domains, namely computer vision, natural language processing,
automatic speech recognition, and there's a voice, right,
and one kind of esoteric one that's called NILM, so non-intrusive load monitoring, it's called.
We use that because people from Allchair actually are
kind of doing that in their
research as well, so they kind of pushed me to it.
Well, yeah, let's just take a look at that as well.
And it's kind of an interesting topic where
you measure electrical data and try to
classify appliances from that. So we also
uncovered that, and this is kind of like the
signal data that we
use for that. Okay, cool. So what were the key
insights from your analysis?
Yeah, so actually the key insights, there were so many,
we definitely won't have enough time.
Okay, give me the best ones.
In this talk, yeah.
Yeah, so one of the main things, to be honest,
I would really like to focus on, if there would be one thing
that everybody should take away from here,
it should be the following,
is that fully offline preprocessing is not necessarily the best strategy.
So to expand that, we have to take a little bit of a step before the following is that fully offline pre-processing is not necessarily the best strategy so to to
explain that we have to take a little bit step step before and to explain what fully offline
means right um fully offline what i mean by that is that you pre-process the entire data set before
you start the training and this is kind of the the status quo and the default way to do things
because it's kind of easy and it makes sense right because i already talked
about that we actually have epochs right so we look at a data set multiple times so and if now
i would say tell you well you have but you have to pre-process the entire data set again and again
and again every epoch it's obvious that you should cache or somehow save this kind of computations
right and what we have shown unfortunately is that this is not the best strategy sometimes, especially for image problems.
Especially for image pipelines, we have shown that in all of our, we tested three image pipelines.
Sorry, one image pipeline with three different data sets.
It didn't even matter if the images were from ImageNet or some high-resolution data set.
In all situations, we could actually find the strategy which was better than fully offline preprocessing.
And that had, again, to do with the storage consumption and then, of course, the throughput.
So we actually had a throughput increase in times three.
So going from like 600 samples per second to 1,800 samples per second without changing the CPU hardware,
just by changing this materialization strategy, which I talked about.
What are the other insights?
Yeah, so other insights. I think one that i'm really eager to talk about as well is um
it's about compression actually because um people who already know about these uh this issue about
storage consumption throughput and whatever compression is kind of the next logical step
right because compression actually focuses on exactly this particular issue. It focuses on, you have
CPU compute power left
in some sense, right? And you have to compress
and decompress data,
and you save storage consumption. So it's kind of
exactly in that particular trade-off, and
therefore we also analyze that.
And funnily enough, there were two quite
interesting insights as to,
for example, the first one was that
actually compression really really
works sometimes like a lot it can save storage consumption and increase the throughput at the
same time and this is kind of a kind of interesting trade-off because typically as i said it's a
trade-off right you trade storage consumption for cpu compute or for throughput right and we actually
notice that for some particular strategies it's actually beneficial to do both. You can actually get both out of it.
So you have like a win-win situation just because the effect the storage consumption is having, right,
the smaller the storage consumption gets, the higher the effect and the throughput gets you.
And therefore, you can actually, while still having to decompress the entire data every single time
and every single read in the epoch, you still get a better throughput
while keeping a small storage consumption.
This is kind of an absurd finding to me it was.
I was like really not thinking that something like this
is going to happen, but it actually was.
Fascinating insight, yes.
Great result.
Do we have time for one more?
Yeah.
Yeah, so the one other thing,
which is quite cool about compression in general,
was that we used a data set which was provided to calibrate cameras, actually.
It was kind of a smallish data set, but we found it interesting because it was provided both in PNGs and JPEGs.
And we found it quite interesting because it was also quite a high resolution.
And we just said, well, yeah, let's just run it in the same pipeline we already have.
How about, well, what kind of differences does a bigger resolution make, right?
But what we actually found out is the difference between JPEGs and PNGs is actually quite big.
Because it was a very funny thing.
The PNG data set in general, I think, if I'm not totally incorrect, it was about 60 gigabytes.
And the JPEG, the same images, right, same images, took about, I think it was like two gigabytes or something like ballpark basically right it's like times 30 right so obviously you
should always keep the jpegs why would you ever use the pngs and the funny thing about that is
because the pipeline already of course decodes the images as well right so because we have to have
it's in some kind of rgb matrix right we want we have to decode them anyways so jpeg decoding is
anyway slow and this is something which you don't get with pngs right so that's kind of RGB matrix, right? We have to decode them anyways. So JPEG decoding is anyway slow.
And this is something which you don't get with PNGs, right?
So that's kind of like a first trade-off,
but it still didn't make really a big difference in throughput
because it's still, well, yeah, but storage consumption, right?
Remember, big storage consumption is kind of bad.
So it kind of was basically equal in terms of that.
But the interesting thing about compression,
we tried GZIP and ZLIP compression.
This is kind of like the very default ways
of doing compression, basically, on any kind of data
and was already implemented in TensorFlow.
And when applied, after the decoding step,
we found out that the PNG data set,
so the decoded, in quotes, PNGs,
were actually compressed four times better
than the decoded JPEGs.
And the reason for that is actually because even when you decode the JPEGs and get them into an RGB matrix,
they still have the artifacts which they had in the encoding before, right?
And the default GZIP and ZLIP compression simply didn't work that well on those images with artifacts.
And by that, we actually got an increase both in throughput and in storage consumption.
Oh, sorry, increase in throughput and a decrease in storage consumption for this PNG dataset,
and it was a factor of four. So the PNGs at the end in this particular strategy, which performed the best,
which we would always recommend people to use, actually took about 280 megabytes,
while the JPEG dataset, for the exactly same strategy, right, for this best strategy that recommend people to use, actually took about 280 megabytes, while the JPEG data set, for the exactly same strategy,
for this best strategy that we had to offer,
actually took 1.1 gigabyte.
So it was like four times bigger,
which is kind of absolutely also non-trivial to think about,
that keeping a data set as JPEGs at the end
will actually get you a higher storage consumption
for that particular task.
Kind of a totally funny, interesting insight.
And also, it was just by accident that we found that out.
So it was pretty cool, too.
Yeah, well, often the best things are discovered by accident, right?
Exactly.
That's awesome.
And for my next question, in the paper, you also introduce Presto,
the open source profiling library.
Can you tell us a bit more about that?
Yeah, sure.
So Presto is actually a very, very simple program, I would say.
We're using the so-called TFData API.
It's kind of the TFData.dataset API.
It's kind of the default way to create a preprocessing pipeline right now in TensorFlow.
And what we are basically doing is accessing all of these steps that you create in this data set.
You can do a map, right?
And a map means you just simply apply a certain transformation function over a collection of elements,
which is typically the data set, right?
And what we are doing is creating a split point between every step.
And the split point serializes and deserializes the data set.
So basically does this materialization, which I'm talking about. And then we do a short profiling and then say well yeah of all of these of all these
particular strategies this is the one which performs the best and you can optimize it
automatically with a cost function if you want to it's very very easy to see to try out to be
honest and if i'm going to be really honest it's actually very easy to implement by yourself in
your particular framework be it pytorch be it even TensorFlow, if you don't want to use our code.
Just take a look.
It's very easy to understand, in my opinion at least.
And the entire optimization is just basically creating insights about your particular pipeline and to actually make it visible for you what kind of trade-offs you're looking at. Because sometimes it's really not easy to understand how certain steps affect your hardware,
right?
How much memory it actually takes.
How much bigger are the images when you decode them from JPEG into an RGB matrix?
Like, I didn't know before.
Yeah.
And those kind of things, it's, like, actually very, very important, we found out, to know
for a fact, to actually know what, for example,
what kind of machine to schedule, right? If I know that my data set, which takes up two gigabytes, but decoded actually 80, maybe I should provision a machine that has 80 gigabytes of RAM, right?
And these kinds of things, this is what we want you guys, like, every listener, hopefully,
to get with you, like, that you actually know that there are some trade-offs, some hidden trade-offs,
right? The title of the paper, there are, some hidden trade-offs, right? The title on the paper.
There are actually some hidden trade-offs
in everybody's pipeline probably,
and you should definitely profile it
because there are many, many opportunities
to keep using the same hardware
without paying additional money,
like to reschedule it or whatever,
because sometimes just picking the better strategy
is the better way to speed it up right
so how does how does this compare to other tools in the space yeah so um what we what we focused
really mostly on about was to just generate actually these insights to actually see what's
what's what are like the common common caveats right and the common pitfalls i would say that
some people do and but which we do which we
did already as well right while doing this entire research so we focus mostly on that so and this
kind of analysis was not really done before i would hopefully can say that but there are actually
a lot of really there's actually a lot of related work in that particular domain as well like many
many people who are publishing papers typically focus on automatic decision finding, and there's actually a lot. So, for example, one particular I would like to mention is Plumber.
It was actually also published, I think, November last year. Actually, I had a communication with
that particular researcher as well. It was quite cool. And they also use the TF data framework.
And what they do is actually also quite interesting. They also try to focus on
throughput optimizations and those kind of trade-offs and kind of automatically doing all of those things
but the key thing what they are focusing on about is rewriting the execution graph because you can
imagine in our case let's only talk about the pre-processing it's actually just a line of
transformations stacked one after each other right and what they focus on is to use this degenerated
graph structure and say, well,
but couldn't we just simply rearrange
things automatically, kind of in
a semi-automatic way, to make
it better or faster? And this is what
they do, right? And this is kind of a cool thing.
But for the exact citation, please feel
free to look at my paper. We are actually
citing many, many of those tools, and
feel free to look at them, because many of them
are very, very good. Yeah, we can also link it in the
show notes. Yeah, of course, of course, yeah.
It would be really cool. And maybe other general
things to note
of related work
would be the FFCV
library right now. I think
I saw a Twitter post in February
that it's coming out as kind of a
new data loader specialized for computer vision purposes.
And they kind of, kind of funny for me,
it was kind of funny for me to note that they actually include many of our insights.
But of course, the paper wasn't out back then yet. So they kind of get to that on their own, of course, right?
Kind of cool for me to read that some of the things that we found out were actually problems and they also already included them in their in their library
without ever reading my paper because they came to the same conclusion by themselves so hopefully
we did good research yeah yeah yeah so it was pretty cool for me to read about them but i'm
still looking still waiting for the paper i think it's not not out yet i think they published the
library in itself and they already have a pretty cool website going on, but I didn't read the paper.
But maybe I haven't checked it out.
Who will find these results useful? Who will find the tool useful?
Yeah, so to be honest, I hope everybody who ever has something to do with preprocessing pipelines in general.
That's kind of the basic answer, but the more high-level answer is, I think,
because this is also one of the things that some people who I talked about at the conference here
have kind of shown me a little bit of a bigger picture
of what we actually did right here.
And, for example, they told me that this kind of trade
of throughput versus storage consumption
is very, very generic.
It's like not something particularly,
it's not a particular problem of pre-processing strategies in general.
It's just about everywhere
where you process data, right?
Be it some SQL queries, be it
in a streaming fashion, whatever, right?
All of those people
are kind of affected by this trade-off.
And, for example,
migration, right? You want to migrate
a GPU model from
one GPU to another GPU. When do you migrate it? When it takes the least amount of storage, because that's the way where it's going to be the fastest, right? You want to migrate a GPU model from one GPU to another GPU.
When do you migrate it?
When it takes the least amount of storage,
because that's the way where it's going to be the fastest, right?
And this is actually exactly the same thing,
like exactly the same trade-off,
where you kind of wait and try to see,
well, at what point in time is the data as small as possible
so I can actually do something with it?
Like either make it faster, like reading it faster from storage,
or migrating it
from one uh one vram to another vram right so this is kind of the general um take i would say so
people just should be aware of this storage consumption throughput trade-off and but this
kind of affects really everybody who's working in computer science i would even say this actually
leads quite nicely into this next question i've got which is what what's the most interesting unexpected or challenging lesson that you have learned
while working on this topic oh that's hard to answer to be honest um most challenging
i think actually actually having the rigor of of performing all of these experiments to be honest
it's like it's very easy to say well yeah now i kind of kind of already found out some things the rigor of performing all of these experiments, to be honest.
It's very easy to say, well, yeah, now I kind of already found out some things, right?
It's hopefully enough to write a paper, and then the reviewers get back to you and then say,
well, yeah, but actually, haven't you thought about that thing and that thing? And I was, oh, no, that's so out of scope.
For example, I got a reviewer who told me to think about, but what about multi-tenancy?
How can you share pre-processing pipelines?
And I was like, yeah, sure.
That's a very good idea.
That's a very good idea for the next paper.
But really not.
Like, I already, like, we have, what is it, like, 13 pages?
And for 13 pages, we're already working, like, for two-ish years on that.
We have so much.
We already, like, kicked out, I think, three entire chapters, which didn't fit the paper the paper already so it was really really hard to actually follow up on that research so i think
to be honest this was the most challenging thing yeah just having the rigor to actually perform all
of these experiments and focus on the the these main insights and try to make it useful to
everybody because this is kind of a kind of a pet peeve of mine that there's a lot of research out
there that does very very good work of course but it's very
hard to to make it accessible to to like i would say everyday people and by everyday people i mean
like people who are working in that particular area as well so i think we kind of really really
hope that with these kind of easy to to hopefully easy to understand insights we'll make like a very
first step that everybody could just hopefully not do the like
not have these same problems that we had so so yeah fantastic and one last question from me
what do you have planned for future research yeah for future actually more tendency so actually we
we thought about that a lot and we kind of so unfortunately multi-tenancy is a very very hard
problem so i'm i'm basically
like everybody can try this topic by themselves it's very very hard to solve um for specific
for specific tasks of course but right now i'm actually um talking with people from hive mind
it's a very cool new like not really that new project and they are working on democratizing access to deep learning models.
The cool thing about them is that they are basically trying to make it possible
to simulate cluster-level GPU environments with volunteer hardware.
So you can think of it as like if you ever heard about CT at Home,
like all of these kinds of voluntary kind of projects.
There are actually many with machine learning as well,
but from what I
know, these people were the first ones to actually
make it work, and they actually made it work
faster than a comparable
CPU, like comparable GPU cluster.
So they had like 40 volunteers.
I think they trained a language
model on the Bengali language, if I'm
not entirely incorrect.
And they compared it,
and actually the actual runtime was even faster than an actual compute center or whatever, and they made they compared it and actually like the actual runtime
was even faster than an actual compute center or whatever and they like used many many different
techniques in terms of how to optimize for bandwidth um dropouts and the and like of people
who dropping out basically in the time of training all of these kind of things and when we talked
when we talked to these guys uh they said actually to me that they have a have an issue with cv
pre-processing as well
because NLP is kind of easy because text is already that really, it's really, really condensed.
So you can very easily transfer that.
And then the problem becomes own, like in quotes, only model training.
But for CV, you actually have a lot of data that you have to send away to these volunteers.
And they kind of do it in a very smart way.
And I'm kind of hoping that I can make a dent in that,
but we'll see.
Fantastic, and good luck for that future research.
And I think we'll end it there.
Thanks so much for coming on the podcast.
If you're interested in knowing more about Alex's work,
the links to his paper and the relevant libraries for Presto and whatnot will be put in the show notes.
And you can also connect with him on LinkedIn,
and he can be reached via email at alex.isenko at tum.de exactly yeah thank you very much for inviting me it's been a
pleasure see you next time see you Thank you.