Storage Developer Conference - #154: Amazon FSx For Lustre Deep Dive and its importance in Machine Learning
Episode Date: October 6, 2021...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to SDC Podcast, episode 154.
Good morning, good afternoon, and good evening, depending on from where you have joined
in. Thank you so much for your time attending this session on Amazon FSx for Lustre. My name
is Suman Devnath. I'm a Principal Developer Advocate with Amazon Web Services. And today,
we are going to deep dive on Amazon FSx for Lustre and its use case and few of its performance advantage.
So in the next 30 minutes, what we are going to talk about is what Amazon FSx for Lustre is
and how you can make use of this
to run your workload even faster and cheaper.
We will look into its integration with Amazon S3,
which is our object-based storage.
And we will then dive into the data pricing options
that you may have with Amazon FSx for Lustre.
And we'll end this discussion with a few of the performance stats,
which you should be aware of.
And that is something which is the most powerful, you know,
workhorse for Amazon FSx for Lustre.
So let's start off with thinking that how you can make and run your workload
faster and cheaper with Amazon FSx for Lustre.
Now, before we get into FSx for Lustre,
let's see that why our customers want to run compute workload on AWS.
What are the advantages that you can get or what are the benefits?
So if you look at it, we generally see that there are five basic benefits
that our customers or any user gets.
First is the elasticity.
So you virtually get unlimited infrastructure and it can scale
and you get the agility to deploy your application at scale,
which is not possible on-premise. Then comes the functionality. So we have a rich set of
instance types, like various types of instances based on your need in terms of performance and
price. And we do have a lot of automation, orchestration and networking and virtualization
solution, which can cater your
application based on your need and requirement. And the next thing is agility. This is one of the
most important thing in today's world where organizations want to try different experiments
and they want to fail fast and, you know, quickly recover from whatever they have thought of that it would work, but it did not.
So you need to fail fast and quickly recover and go forward and reduce the time to result.
So you get that agility in AWS as when you need, you can build your infrastructure.
And once you are done with it, you can just destroy it. So there is no legacy storage or there is no budget that will be, you know,
completely dedicated for your on-premise infrastructure that you have to live with,
even if you don't use.
Then comes the global infrastructure.
So Amazon Web Services, we have the infrastructure, the resources all across the globe,
depending on which region you are picking for deploying your application or services.
So you can go live in any of the region across the globe for your business with no time.
So depending on your need, you may like to host your infrastructure and application
in one or more availability zones or regions that we have.
And it is cost optimized. So this is one of the most important things
so that in AWS, whatever you do, you
pay only for what you use. If you don't pay, you are not going to
get charged for anything. So it's completely based on the
consumption model that you pay only for what you use.
Now when we look at the compute workload
on AWS, typically the data processing when
the data is stored in S3, it looks something like this. That you have the
S3 in the middle and you have various sources of data coming
in. It might be IoT, it might be
some sensor data, it might be some data from your automotive vehicle, etc. So basically,
if you look at it, all this ingestion or all this input data are of different types. So it is
completely unstructured. It can be images. It can be CSV files.
It can be text.
It can be anything.
And you use S3 to dump all that data.
And then if you want to compute,
if you want to run some compute on this particular data,
what you need to do is essentially you have to create some EC2 instance
or some compute service.
And then you want to make sure that this S3 gets the access
to that compute instance.
And then you will do some processing on that particular data set
with lots of compute or lots of EC2 instances.
And then you might want to write or checkpoint the result back to S3
if your application demands.
And once you are done with this, you will just destroy or get rid of those compute.
So this is a typical workflow that you have where you store the data in S3
and make use of some compute instances, process the data.
And then once that pricing is done, it can be a machine learning training job
running on SageMaker or it can be anything. So once you are done with that, you may want to
write that checkpoint back to S3 and destroy the compute instances.
Now, when you look at this, there are various options that you have for pricing the data. The first is you can have the instances
or the EC2 machines to have EBS volume or instance storage,
and all your S3 data should be stored
in all these EBS volumes or instance storage.
That means you need to copy that data from S3
to either EBS or instance store of
these EC2 instances. The next is self-managed file system, where you need to take a bunch of
EC2 instances and save all these data from S3 and create a self-managed file system.
And then your client can access that through the network.
So this needs a lot of, you know,
headache from a customer standpoint
because they have to manage that file system of their own.
And then we have a direct S3 access
where you will be accessing that S3 bucket
to access your objects via maybe HTTP port and get.
So it's basically through the APIs.
Now, when you look at all these three options,
let's get into each of them and see that what are the
disadvantage of each of them.
So the first is the data processing using EBS and instant storage.
So this is a working model and a lot of our customers do use that
where you have to take that S3 bucket and
you need to make sure that you have enough number of instances and you need to give access or you
need to copy the data or the subset of the data from S3 to all these instances EBS volume or
instance storage. So basically you need to first place decide what part of the data should go to
instance number one, what part of the data should go to instance number two and so on
and so forth. So while this is
a working model but there are a lot of
challenges involved in this. First is you need to plan the activity
plan the active working set. First is you need to plan the activity, plan the active
working set beforehand. So you need to plan which data to move in and out ahead of time.
The next is you need to shard your data set. So you need to know which is the data that need to
be accessed by which instance. So you need to have that mapping beforehand and then there is a
problem with our data duplication because it might so happen that you know
your multiple instances might need to access the same data and to do so what
you need to do is you need to save the same data in respective EC2
instance EBS volume or instance storage.
So what will happen is that you will have multiple instances
having the same data in its respective EBS volume
or instance storage.
The next is data processing with self-managed file system.
So here you will create a file system
or distributed file system, and you will save the data on S3 and then your client will access it. So this is very,
very complex to manage and maintain. And this has a lot of dependencies on how you have
architected your file system. And it might take a toll on you in terms of
performance because there are a lot of things going around because you have a
connectivity from your file server to S3 and then your file server to the end
user or the client so there are a lot of moving parts here and you know it would
be a very cumbersome job for any customer to maintain such a distributed file system.
Now if you look at all these two options which we have just seen, one thing is common that you
need to track the changes that you are doing. It's not only about accessing the data from S3 but it's
also about how you know you to maintain or track the changes that you are making on on the data
because you have to write that data back to s3 so you need to write your own scripting logic
or you need to put some automation in place which can periodically write the data back to s3
now the last thing you know last option that you have is processing the data without any intermediate storage.
So you will be having EC2 instances and you will be having your S3 bucket.
Now you need to access that directly.
So this works pretty fine and the way that you can do so is through HTTP APIs or the
restful APIs through GET, PUT, and all of that.
While this is very good, but this is not, you know, appropriate for all kinds of application.
So this is good for application which needs a high throughput, but it is not at all good for
application which is latency sensitive. And one of the most important drawback of this is, you know,
this is not P posex compliance because you
are not going to access the storage using posex you know schema so you are going to access that
storage using object interface right now put and get as we just discussed and this is not good for
application if you need to access the same data repeatedly because what will happen end up happening is when you access the same data multiple times you are going to send multiple get
requests and it will be you know you will be charged for that which is not
all optimal you know a good thing for any customer now to overcome all of this
you know yeah here we have Amazon FFX for Lustre.
So the way that it works is,
or the way that you can have FFX for Lustre is you can integrate that with S3.
So what you do is you create a file system
called FFX for Lustre on AWS,
and you can link that file system to your bucket
or to your S3 bucket.
So your data is to your S3 bucket.
So your data is stored in S3, but, you know, it will be loaded on FFX, you know, for Lustre for FFX file system only when there is a read request, you know, comes in from the client.
Okay. Okay, so the way that it would work is, you know, your data will be still in S3 and you will just create an FFX Lustre file system on AWS.
And at that point, there is no copying of data.
Okay, and then you can just mount that file system to your EC2 instances. And the moment you mount that file system, you would be able to see all the objects in terms of your folders and files on the EC2 instances. But at that time, the data is still
in S3. There's no data saved in FSx. So it just copies the metadata. But it will copy the data
only when the client requests for any read.
So that's how it will work.
And we are going to see that lifecycle in a moment.
So another advantage of FSx for Lustre is for on-prem,
a bursting for on-prem data repositories.
So a lot of times it might so happen that you have lots of lots of data,
which is on-premise and you need to do some compute operation on that data,
but you don't have that compute resource on-premise.
So you may want to use the compute resource on AWS.
So what you can do is you can create a bunch of EC2 instances or the compute resources on AWS
and create an FSx for Lustre file system.
And you can connect that file system to your on-premise storage
via AWS VPN and Direct Connect, and there are various other options.
Now, what happens is, you know, with this,
FSx for Lustre can become an intermediate between
your virtual machines on AWS and your data storage
on-premise. And it can access and do the computation. And once the
computation is over, you can just delete the FFx for Lustre file system.
And you can get rid of these compute resources.
Now this particular file system is very popular,
you know, have a support for all the popular Linux distribution
like Amazon Linux 2, RHEL, CentOS, Ubuntu, and SUSE.
So it's almost you can run there,
you can have any of these clients to access the data on FSx for Lustre.
Now let's look into the integration in a bit more details with Amazon S3.
So as we just discussed, what we need to do is when you create an Amazon FSx for Lustre file system,
you need to appoint or you need to tell the file system that,
okay, here is the S3 bucket,
and you need to make sure that you connect your file system with this S3 bucket.
So at that point, the file system doesn't copy any data from S3,
and you can just mount that file system
from your EC2 instances
and you can see all the files and directories
which is saved or stored in that particular S3 bucket.
But nothing is saved in FSx file system.
So the file are moved in real time
only when it is accessed. So the file system will
go and talk to S3 only when there is a read request coming from any of the client or the EC2
machine. So let's take an example. And this is what we call it as a lazy loading. So here, let's take an example that you have a bucket called
S3 bucket, and you have created an FSx for Lustre file system and pointing to that particular bucket.
Now, the moment you create that, it copies the metadata from S3. So now if you mount that file
system on any of the EC2 instances and if you just run ls,
you will be able to see all the files which are saved in the S3 bucket.
But at that point in time, nothing is copied from S3 to the file system.
Now what happens is when one of the EC2 instance first try to access, let's say, file 1, at that point,
the FSx for Lustre file system will go back to S3 and will try to copy that data or that file 1.txt
to that file system. And then, you know, at later point in time, if that same client try to access
that, it will be served from that file system. Not only this, now that the data is in the file system, if any other client
tries to access that, it will be served from the file system itself. It will not go back to S3.
Right. So that's how the, that's why we call this as a lazy load because it doesn't load the data first place
when you actually create the file system.
Now there are few HSM commands that you can use to control the data movement.
The first is HSM archive which copies file to S3 from FSx for Lustre. So let's say you made some changes in one of the file or
directory or whatever it is. And then you want to make sure that that change goes back to S3.
So you can use HSM archive for that. You can also use HSM release, which will free up the space
with the file once it is archived. So what would typically happen is you will make some change in your EC2 instance on doing
some computational work.
And then you want to save that data back to S3.
So you'll be doing HSM archive.
And once that is done, you don't want to waste the storage space of fsx for lustre file system so you can
you can just fire hsm release command it will you know free up that space and then
you can use hsm restore which will bring back the data back from to fsx for lustre from s3
so one key thing to remember is when you do HSM release, it just
free up the space, but it saves the metadata. So when you use HSM for restore, you know,
it can go back to S3 and bring your data back because it has the metadata. Now let's see,
you know, how it preserves the POSIX metadata across FSX and S3.
So the first time when you create a FSX
for Lustre file system,
it will just copy the metadata
and it will copy the POSIX permission,
which is stored in S3.
And later on, when these files are accessed
on the file system, it will try
to read it from S3. And later on, once you made some change in that data on your compute,
on your EC2 instances, you can call the API data repository task API and it will
export all those changes back to S3. So the files are stored with the POSIX permission
from FSx Lustre to S3. And later on when you again try to read the data from S3 to Lustre, it will have the updated
POSIX permissions etc. based on the changes that you made.
Now when you release that, it's exactly like this. So you can,
what we have just discussed, you can use HSM release and it will just free up the space
in your FSx for Lustre file system but it will still free up the space in your FSx for last file system
but it will still have the metadata saved. Now this is one of the most
predominant use case that we have and that is the integration with SageMaker
so Amazon SageMaker is the service that we have for machine learning, you know, machine learning operations.
Like you can build, you can train and you can deploy your machine learning models on Amazon SageMaker.
And in Amazon SageMaker, typically any data science engineer or machine learning engineer will be saving all the data in S3 and that would be the data which it would use for
training any particular model. So now imagine that you are training for a particular problem
and you are using SageMaker for that and your data is in S3.
So you might have to copy the whole data from S3 before you start the training
on the instance. And it might so happen that when you
because machine learning is all about experiments, you keep on changing
different hyperparameters and do the training. So you might have to
read the data multiple times from S3, which takes a lot of time and as well as it will charge,
it will cost you more money
because you are reading data multiple times from S3.
So now to avoid that, you can have FSx for Lustre in the middle and your data can be read from FSx for Lustre.
So now SageMaker will not interact with S3 directly.
It will just interact with FSx for Lustre. will save the data in the file system. You don't have to go back to S3 multiple times
for running the same training job multiple times
with different hyperparameters.
And not only that, it would be much faster
because you are going to use the FSX file system in the middle
so you can get gigabytes of throughput
on your training lifecycle.
Now let's look into the data processing options that you have. So this is a typical workflow that
we have or the typical deployment model. You save your data on S3 and you create an FSx file system and you link the file system to the S3
bucket at any point you know you can use the Lustre command to write the changes back to S3
and once you are done with your compute on using that particular data you can always delete that
file system so you get two choices now when you create a fsx for lustre file system one is scratch
another is persistent so scratch is as the name suggests is for short-term processing
where you know there is no you know high availability but you can just create a scratch
fsx for lustre file system and it has just a single copy and once your
workload is done i mean your computation is done you can just delete it and if you want to use this
for long term or for you know for some pricing which will which you need for a months or even
years or even beyond then you might want to go with persistent deployment where you get
HA file servers and the data is replicated
across the file servers.
Now let's talk about the performance.
And this is one of our customers back
from and they are mostly into machine learning
and MRI image processing.
And what they have done is they are now using FSx for Lustre
for training their data, you know, their models.
So previously they were using S3,
but now they are using FSFF for Lustre in the middle
and they're not talking. So SageMaker is not talking to S3 directly. And what they have seen
is their ML-based workflow was reduced by 20 times. So it reduced the time to train and deploy
their model drastically.
So this improved the performance as well as the cost.
Now, these are the numbers that you can keep in mind.
This is for the scratch file system performance.
So with one terabyte of capacity,
you get a base throughput of 200 megabytes per second, and it can bust up to 400 megabytes.
And if you increase the storage capacity, your throughput increases accordingly. So one interesting thing to look at here is your performance is directly proportional to the capacity of the storage. So the more
capacity that you have in your file system or the bigger your file system is
the better performance that you will get. Now by default you know Lustre, FSx
for Lustre will be good for you but in case you want to do some performance
tuning or some optimization there are some best practices. One of them is you will be good for you. But in case you want to do some performance tuning
or some optimization, there are some best practices.
One of them is you can explore striping,
and we are going to see that what striping is.
And you can also take advantage of using a bigger IO size
because if you use a bigger IO size,
technically your
throughput will increase, right? It's common for any file system. And you need to make sure that
your client selection or the EC2 instance that you're using are of good configuration,
like it should have enough memory, CPU and network bandwidth so that it can make best use of that FNSX for Lustre file system.
Now what is striping and why we use it? So striping is one of the very important things
because it actually shards a large file into small small chunks and all those chunks can be saved in various file servers.
And when you try to read that, you get the most amount of parallelization.
So, striping can drastically improve your throughput.
Now, striping can be done at the directory level or at file level.
So, it can be per file or per directory and if you are setting it per directory
yeah you know the parameters get you know inherited for all the files inside that directory.
So what is this stripe is about so let's take an example. So let's say you have a file of 7 MB and you have set the stripe count of 3 that means your whole file will be divided
into three different drive or three different disk and if your stripe size is 1 MB that means
a 7 MB file will be chopped into seven small small chunks and each of these chunk will be chopped into seven small, small chunks. And each of these chunks will be saved in different disks.
So if your stripe count is three,
that means your file will be saved in three different disks.
So these are the parameters or the variables
that you should keep in mind,
and you may like to use it while configuring the file system.
And there is an interesting you know configuration
that you can set which is import file chunk size so let's say you have files of different sizes
right which is which would be the most the case in most of your use cases when you when you use this import file chunk size, you can pick the file size,
which is very much dominant in your file system.
So let's say that is a 1MB chunk size.
So you can use that most dominant file size
divided by the number of disks,
and that you can use as your chunk size.
So this is something that you may like to explore if you have a particular file size,
you know, which is dominant in your data store.
Now, these are the regions where, you know, you can make use of Amazon FFX for Lustre
service and, you know, these regions are, as you know,
you know, we keep on adding support
for the same service across various regions.
But you can almost use it, I mean,
you make use of this file system
almost everywhere across the globe.
So now let's spend a couple of minutes
where I'll try to show you across the globe. So now let's spend a couple of minutes
where I'll try to show you in the console how you can create FSxFallluster file system.
Okay, so I have already created a file system,
FSxFallluster, and I have pointed this to an S3 bucket.
So I'll still show you how you can create that. So this is a wizard and
you can select FSX for Lustre. And we do have another service called FSX for Windows Server,
that is for Windows client, and this is for Linux. So when you click next, you can give some name to your files, you know, file system.
Let's say my file system.
And you can use, you can select either it's persistent or scratch.
So we can use, you know, scratch.
And you have to give the size of the file system.
So minimum is 1.2 terabytes.
So let's go with that and then you can set you know pick the right security group and
VPN, subnet etc these are very common across most of the services that we have the important thing
is this here you can import your S3 bucket right so you you can select this option called import
data from S3 and you can give your S3 bucket name and once this is done you can just
click on next and create it I'm not going to create that but this is how you can create so
we already have one F3C for Lustre file system and if you click on this this is pointing to an S3 bucket so let
me show you that S3 bucket so the S3 bucket name is SNIEA SDC 2020 it should
be 2020 but it was just a typo so now let's go to that S3 bucket and let's search for that and we see that
inside this bucket we have two different files. it's a zip file but these are the two files now let's go
to the EC2 instance and I have already created two EC2 instances and they
are running so we can log into one of them and try to you know mount that file
system right so we have server 1 and server two so i have already logged in to those servers
and if you look at it
if you look at it
these that particular file system is not
imported so what we can do is first we can create one directory okay it's
seeing that that directory is already there okay but nothing is you know there
inside now we can connect to that file system so to do that let's go to that file system and let's get inside and click on attach so when you
do an attach you get the see we see the same files you
know which we have in S3 bucket but now you can you know do anything with
this data but at this point this data is not yet copied on the file system because it is not yet you know
read so it's not yet copied so now next thing is what we can do is we can try to
write something in this particular file share so let's create a text file and if you see now this test file is
there right but now if you go to s3 and if I refresh you will see that that file
is still not there right and the reason is we have not archived that so let's try to archive this
okay so let's go to
fx and we have this file file one so let's um archive this and we used hsm archive command and if we now go back and try to refresh and we would see that
file came back to S3 right so this is how you know you can make use of
different HSM commands to write your you know process your data on your computer instances and write back the changes to S3.
Okay.
And one small thing is, you know, which I have already done that, but just to save some time.
But once you create an EC2 instance, you need to install an FSxPholoster client.
So, you need to install this
it's clearly documented
in the user guide
but this is something that you need to do
before you try to mount
any of the FSX for Lustre
file system
so
that's all I have
for this talk.
And feel free to reach out to me over LinkedIn
or any other social media platform like Twitter.
And if you have any comments or any queries
while working on this, feel free to ping me.
I'll be more than happy to answer your queries.
Thank you so much for your time.
And it was an amazing experience to be in this wonderful conference.
I hope you are enjoying and you have a wonderful rest of the day and enjoy the upcoming talks.
Thank you so much.
Thanks for listening. If you have questions about the material presented in this
podcast, be sure and join our developers mailing list by sending an email to developers-subscribe
at sneha.org. Here you can ask questions and discuss this topic further with your peers
in the storage developer community. For your peers in the Storage Developer Community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.