Storage Developer Conference - #154: Amazon FSx For Lustre Deep Dive and its importance in Machine Learning

Episode Date: October 6, 2021

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast, episode 154. Good morning, good afternoon, and good evening, depending on from where you have joined in. Thank you so much for your time attending this session on Amazon FSx for Lustre. My name is Suman Devnath. I'm a Principal Developer Advocate with Amazon Web Services. And today,
Starting point is 00:00:56 we are going to deep dive on Amazon FSx for Lustre and its use case and few of its performance advantage. So in the next 30 minutes, what we are going to talk about is what Amazon FSx for Lustre is and how you can make use of this to run your workload even faster and cheaper. We will look into its integration with Amazon S3, which is our object-based storage. And we will then dive into the data pricing options that you may have with Amazon FSx for Lustre.
Starting point is 00:01:28 And we'll end this discussion with a few of the performance stats, which you should be aware of. And that is something which is the most powerful, you know, workhorse for Amazon FSx for Lustre. So let's start off with thinking that how you can make and run your workload faster and cheaper with Amazon FSx for Lustre. Now, before we get into FSx for Lustre, let's see that why our customers want to run compute workload on AWS.
Starting point is 00:02:06 What are the advantages that you can get or what are the benefits? So if you look at it, we generally see that there are five basic benefits that our customers or any user gets. First is the elasticity. So you virtually get unlimited infrastructure and it can scale and you get the agility to deploy your application at scale, which is not possible on-premise. Then comes the functionality. So we have a rich set of instance types, like various types of instances based on your need in terms of performance and
Starting point is 00:02:37 price. And we do have a lot of automation, orchestration and networking and virtualization solution, which can cater your application based on your need and requirement. And the next thing is agility. This is one of the most important thing in today's world where organizations want to try different experiments and they want to fail fast and, you know, quickly recover from whatever they have thought of that it would work, but it did not. So you need to fail fast and quickly recover and go forward and reduce the time to result. So you get that agility in AWS as when you need, you can build your infrastructure. And once you are done with it, you can just destroy it. So there is no legacy storage or there is no budget that will be, you know,
Starting point is 00:03:27 completely dedicated for your on-premise infrastructure that you have to live with, even if you don't use. Then comes the global infrastructure. So Amazon Web Services, we have the infrastructure, the resources all across the globe, depending on which region you are picking for deploying your application or services. So you can go live in any of the region across the globe for your business with no time. So depending on your need, you may like to host your infrastructure and application in one or more availability zones or regions that we have.
Starting point is 00:04:09 And it is cost optimized. So this is one of the most important things so that in AWS, whatever you do, you pay only for what you use. If you don't pay, you are not going to get charged for anything. So it's completely based on the consumption model that you pay only for what you use. Now when we look at the compute workload on AWS, typically the data processing when the data is stored in S3, it looks something like this. That you have the
Starting point is 00:04:39 S3 in the middle and you have various sources of data coming in. It might be IoT, it might be some sensor data, it might be some data from your automotive vehicle, etc. So basically, if you look at it, all this ingestion or all this input data are of different types. So it is completely unstructured. It can be images. It can be CSV files. It can be text. It can be anything. And you use S3 to dump all that data.
Starting point is 00:05:11 And then if you want to compute, if you want to run some compute on this particular data, what you need to do is essentially you have to create some EC2 instance or some compute service. And then you want to make sure that this S3 gets the access to that compute instance. And then you will do some processing on that particular data set with lots of compute or lots of EC2 instances.
Starting point is 00:05:37 And then you might want to write or checkpoint the result back to S3 if your application demands. And once you are done with this, you will just destroy or get rid of those compute. So this is a typical workflow that you have where you store the data in S3 and make use of some compute instances, process the data. And then once that pricing is done, it can be a machine learning training job running on SageMaker or it can be anything. So once you are done with that, you may want to write that checkpoint back to S3 and destroy the compute instances.
Starting point is 00:06:16 Now, when you look at this, there are various options that you have for pricing the data. The first is you can have the instances or the EC2 machines to have EBS volume or instance storage, and all your S3 data should be stored in all these EBS volumes or instance storage. That means you need to copy that data from S3 to either EBS or instance store of these EC2 instances. The next is self-managed file system, where you need to take a bunch of EC2 instances and save all these data from S3 and create a self-managed file system.
Starting point is 00:07:02 And then your client can access that through the network. So this needs a lot of, you know, headache from a customer standpoint because they have to manage that file system of their own. And then we have a direct S3 access where you will be accessing that S3 bucket to access your objects via maybe HTTP port and get. So it's basically through the APIs.
Starting point is 00:07:28 Now, when you look at all these three options, let's get into each of them and see that what are the disadvantage of each of them. So the first is the data processing using EBS and instant storage. So this is a working model and a lot of our customers do use that where you have to take that S3 bucket and you need to make sure that you have enough number of instances and you need to give access or you need to copy the data or the subset of the data from S3 to all these instances EBS volume or
Starting point is 00:08:01 instance storage. So basically you need to first place decide what part of the data should go to instance number one, what part of the data should go to instance number two and so on and so forth. So while this is a working model but there are a lot of challenges involved in this. First is you need to plan the activity plan the active working set. First is you need to plan the activity, plan the active working set beforehand. So you need to plan which data to move in and out ahead of time. The next is you need to shard your data set. So you need to know which is the data that need to
Starting point is 00:08:39 be accessed by which instance. So you need to have that mapping beforehand and then there is a problem with our data duplication because it might so happen that you know your multiple instances might need to access the same data and to do so what you need to do is you need to save the same data in respective EC2 instance EBS volume or instance storage. So what will happen is that you will have multiple instances having the same data in its respective EBS volume or instance storage.
Starting point is 00:09:17 The next is data processing with self-managed file system. So here you will create a file system or distributed file system, and you will save the data on S3 and then your client will access it. So this is very, very complex to manage and maintain. And this has a lot of dependencies on how you have architected your file system. And it might take a toll on you in terms of performance because there are a lot of things going around because you have a connectivity from your file server to S3 and then your file server to the end user or the client so there are a lot of moving parts here and you know it would
Starting point is 00:10:00 be a very cumbersome job for any customer to maintain such a distributed file system. Now if you look at all these two options which we have just seen, one thing is common that you need to track the changes that you are doing. It's not only about accessing the data from S3 but it's also about how you know you to maintain or track the changes that you are making on on the data because you have to write that data back to s3 so you need to write your own scripting logic or you need to put some automation in place which can periodically write the data back to s3 now the last thing you know last option that you have is processing the data without any intermediate storage. So you will be having EC2 instances and you will be having your S3 bucket.
Starting point is 00:10:54 Now you need to access that directly. So this works pretty fine and the way that you can do so is through HTTP APIs or the restful APIs through GET, PUT, and all of that. While this is very good, but this is not, you know, appropriate for all kinds of application. So this is good for application which needs a high throughput, but it is not at all good for application which is latency sensitive. And one of the most important drawback of this is, you know, this is not P posex compliance because you are not going to access the storage using posex you know schema so you are going to access that
Starting point is 00:11:31 storage using object interface right now put and get as we just discussed and this is not good for application if you need to access the same data repeatedly because what will happen end up happening is when you access the same data multiple times you are going to send multiple get requests and it will be you know you will be charged for that which is not all optimal you know a good thing for any customer now to overcome all of this you know yeah here we have Amazon FFX for Lustre. So the way that it works is, or the way that you can have FFX for Lustre is you can integrate that with S3. So what you do is you create a file system
Starting point is 00:12:16 called FFX for Lustre on AWS, and you can link that file system to your bucket or to your S3 bucket. So your data is to your S3 bucket. So your data is stored in S3, but, you know, it will be loaded on FFX, you know, for Lustre for FFX file system only when there is a read request, you know, comes in from the client. Okay. Okay, so the way that it would work is, you know, your data will be still in S3 and you will just create an FFX Lustre file system on AWS. And at that point, there is no copying of data. Okay, and then you can just mount that file system to your EC2 instances. And the moment you mount that file system, you would be able to see all the objects in terms of your folders and files on the EC2 instances. But at that time, the data is still
Starting point is 00:13:12 in S3. There's no data saved in FSx. So it just copies the metadata. But it will copy the data only when the client requests for any read. So that's how it will work. And we are going to see that lifecycle in a moment. So another advantage of FSx for Lustre is for on-prem, a bursting for on-prem data repositories. So a lot of times it might so happen that you have lots of lots of data, which is on-premise and you need to do some compute operation on that data,
Starting point is 00:13:53 but you don't have that compute resource on-premise. So you may want to use the compute resource on AWS. So what you can do is you can create a bunch of EC2 instances or the compute resources on AWS and create an FSx for Lustre file system. And you can connect that file system to your on-premise storage via AWS VPN and Direct Connect, and there are various other options. Now, what happens is, you know, with this, FSx for Lustre can become an intermediate between
Starting point is 00:14:27 your virtual machines on AWS and your data storage on-premise. And it can access and do the computation. And once the computation is over, you can just delete the FFx for Lustre file system. And you can get rid of these compute resources. Now this particular file system is very popular, you know, have a support for all the popular Linux distribution like Amazon Linux 2, RHEL, CentOS, Ubuntu, and SUSE. So it's almost you can run there,
Starting point is 00:15:00 you can have any of these clients to access the data on FSx for Lustre. Now let's look into the integration in a bit more details with Amazon S3. So as we just discussed, what we need to do is when you create an Amazon FSx for Lustre file system, you need to appoint or you need to tell the file system that, okay, here is the S3 bucket, and you need to make sure that you connect your file system with this S3 bucket. So at that point, the file system doesn't copy any data from S3, and you can just mount that file system
Starting point is 00:15:45 from your EC2 instances and you can see all the files and directories which is saved or stored in that particular S3 bucket. But nothing is saved in FSx file system. So the file are moved in real time only when it is accessed. So the file system will go and talk to S3 only when there is a read request coming from any of the client or the EC2 machine. So let's take an example. And this is what we call it as a lazy loading. So here, let's take an example that you have a bucket called
Starting point is 00:16:26 S3 bucket, and you have created an FSx for Lustre file system and pointing to that particular bucket. Now, the moment you create that, it copies the metadata from S3. So now if you mount that file system on any of the EC2 instances and if you just run ls, you will be able to see all the files which are saved in the S3 bucket. But at that point in time, nothing is copied from S3 to the file system. Now what happens is when one of the EC2 instance first try to access, let's say, file 1, at that point, the FSx for Lustre file system will go back to S3 and will try to copy that data or that file 1.txt to that file system. And then, you know, at later point in time, if that same client try to access
Starting point is 00:17:22 that, it will be served from that file system. Not only this, now that the data is in the file system, if any other client tries to access that, it will be served from the file system itself. It will not go back to S3. Right. So that's how the, that's why we call this as a lazy load because it doesn't load the data first place when you actually create the file system. Now there are few HSM commands that you can use to control the data movement. The first is HSM archive which copies file to S3 from FSx for Lustre. So let's say you made some changes in one of the file or directory or whatever it is. And then you want to make sure that that change goes back to S3. So you can use HSM archive for that. You can also use HSM release, which will free up the space
Starting point is 00:18:21 with the file once it is archived. So what would typically happen is you will make some change in your EC2 instance on doing some computational work. And then you want to save that data back to S3. So you'll be doing HSM archive. And once that is done, you don't want to waste the storage space of fsx for lustre file system so you can you can just fire hsm release command it will you know free up that space and then you can use hsm restore which will bring back the data back from to fsx for lustre from s3 so one key thing to remember is when you do HSM release, it just
Starting point is 00:19:06 free up the space, but it saves the metadata. So when you use HSM for restore, you know, it can go back to S3 and bring your data back because it has the metadata. Now let's see, you know, how it preserves the POSIX metadata across FSX and S3. So the first time when you create a FSX for Lustre file system, it will just copy the metadata and it will copy the POSIX permission, which is stored in S3.
Starting point is 00:19:41 And later on, when these files are accessed on the file system, it will try to read it from S3. And later on, once you made some change in that data on your compute, on your EC2 instances, you can call the API data repository task API and it will export all those changes back to S3. So the files are stored with the POSIX permission from FSx Lustre to S3. And later on when you again try to read the data from S3 to Lustre, it will have the updated POSIX permissions etc. based on the changes that you made. Now when you release that, it's exactly like this. So you can,
Starting point is 00:20:37 what we have just discussed, you can use HSM release and it will just free up the space in your FSx for Lustre file system but it will still free up the space in your FSx for last file system but it will still have the metadata saved. Now this is one of the most predominant use case that we have and that is the integration with SageMaker so Amazon SageMaker is the service that we have for machine learning, you know, machine learning operations. Like you can build, you can train and you can deploy your machine learning models on Amazon SageMaker. And in Amazon SageMaker, typically any data science engineer or machine learning engineer will be saving all the data in S3 and that would be the data which it would use for training any particular model. So now imagine that you are training for a particular problem
Starting point is 00:21:40 and you are using SageMaker for that and your data is in S3. So you might have to copy the whole data from S3 before you start the training on the instance. And it might so happen that when you because machine learning is all about experiments, you keep on changing different hyperparameters and do the training. So you might have to read the data multiple times from S3, which takes a lot of time and as well as it will charge, it will cost you more money because you are reading data multiple times from S3.
Starting point is 00:22:20 So now to avoid that, you can have FSx for Lustre in the middle and your data can be read from FSx for Lustre. So now SageMaker will not interact with S3 directly. It will just interact with FSx for Lustre. will save the data in the file system. You don't have to go back to S3 multiple times for running the same training job multiple times with different hyperparameters. And not only that, it would be much faster because you are going to use the FSX file system in the middle so you can get gigabytes of throughput
Starting point is 00:23:03 on your training lifecycle. Now let's look into the data processing options that you have. So this is a typical workflow that we have or the typical deployment model. You save your data on S3 and you create an FSx file system and you link the file system to the S3 bucket at any point you know you can use the Lustre command to write the changes back to S3 and once you are done with your compute on using that particular data you can always delete that file system so you get two choices now when you create a fsx for lustre file system one is scratch another is persistent so scratch is as the name suggests is for short-term processing where you know there is no you know high availability but you can just create a scratch
Starting point is 00:24:02 fsx for lustre file system and it has just a single copy and once your workload is done i mean your computation is done you can just delete it and if you want to use this for long term or for you know for some pricing which will which you need for a months or even years or even beyond then you might want to go with persistent deployment where you get HA file servers and the data is replicated across the file servers. Now let's talk about the performance. And this is one of our customers back
Starting point is 00:24:43 from and they are mostly into machine learning and MRI image processing. And what they have done is they are now using FSx for Lustre for training their data, you know, their models. So previously they were using S3, but now they are using FSFF for Lustre in the middle and they're not talking. So SageMaker is not talking to S3 directly. And what they have seen is their ML-based workflow was reduced by 20 times. So it reduced the time to train and deploy
Starting point is 00:25:23 their model drastically. So this improved the performance as well as the cost. Now, these are the numbers that you can keep in mind. This is for the scratch file system performance. So with one terabyte of capacity, you get a base throughput of 200 megabytes per second, and it can bust up to 400 megabytes. And if you increase the storage capacity, your throughput increases accordingly. So one interesting thing to look at here is your performance is directly proportional to the capacity of the storage. So the more capacity that you have in your file system or the bigger your file system is
Starting point is 00:26:09 the better performance that you will get. Now by default you know Lustre, FSx for Lustre will be good for you but in case you want to do some performance tuning or some optimization there are some best practices. One of them is you will be good for you. But in case you want to do some performance tuning or some optimization, there are some best practices. One of them is you can explore striping, and we are going to see that what striping is. And you can also take advantage of using a bigger IO size because if you use a bigger IO size,
Starting point is 00:26:44 technically your throughput will increase, right? It's common for any file system. And you need to make sure that your client selection or the EC2 instance that you're using are of good configuration, like it should have enough memory, CPU and network bandwidth so that it can make best use of that FNSX for Lustre file system. Now what is striping and why we use it? So striping is one of the very important things because it actually shards a large file into small small chunks and all those chunks can be saved in various file servers. And when you try to read that, you get the most amount of parallelization. So, striping can drastically improve your throughput.
Starting point is 00:27:37 Now, striping can be done at the directory level or at file level. So, it can be per file or per directory and if you are setting it per directory yeah you know the parameters get you know inherited for all the files inside that directory. So what is this stripe is about so let's take an example. So let's say you have a file of 7 MB and you have set the stripe count of 3 that means your whole file will be divided into three different drive or three different disk and if your stripe size is 1 MB that means a 7 MB file will be chopped into seven small small chunks and each of these chunk will be chopped into seven small, small chunks. And each of these chunks will be saved in different disks. So if your stripe count is three, that means your file will be saved in three different disks.
Starting point is 00:28:33 So these are the parameters or the variables that you should keep in mind, and you may like to use it while configuring the file system. And there is an interesting you know configuration that you can set which is import file chunk size so let's say you have files of different sizes right which is which would be the most the case in most of your use cases when you when you use this import file chunk size, you can pick the file size, which is very much dominant in your file system. So let's say that is a 1MB chunk size.
Starting point is 00:29:15 So you can use that most dominant file size divided by the number of disks, and that you can use as your chunk size. So this is something that you may like to explore if you have a particular file size, you know, which is dominant in your data store. Now, these are the regions where, you know, you can make use of Amazon FFX for Lustre service and, you know, these regions are, as you know, you know, we keep on adding support
Starting point is 00:29:47 for the same service across various regions. But you can almost use it, I mean, you make use of this file system almost everywhere across the globe. So now let's spend a couple of minutes where I'll try to show you across the globe. So now let's spend a couple of minutes where I'll try to show you in the console how you can create FSxFallluster file system. Okay, so I have already created a file system,
Starting point is 00:30:18 FSxFallluster, and I have pointed this to an S3 bucket. So I'll still show you how you can create that. So this is a wizard and you can select FSX for Lustre. And we do have another service called FSX for Windows Server, that is for Windows client, and this is for Linux. So when you click next, you can give some name to your files, you know, file system. Let's say my file system. And you can use, you can select either it's persistent or scratch. So we can use, you know, scratch. And you have to give the size of the file system.
Starting point is 00:30:59 So minimum is 1.2 terabytes. So let's go with that and then you can set you know pick the right security group and VPN, subnet etc these are very common across most of the services that we have the important thing is this here you can import your S3 bucket right so you you can select this option called import data from S3 and you can give your S3 bucket name and once this is done you can just click on next and create it I'm not going to create that but this is how you can create so we already have one F3C for Lustre file system and if you click on this this is pointing to an S3 bucket so let me show you that S3 bucket so the S3 bucket name is SNIEA SDC 2020 it should
Starting point is 00:31:56 be 2020 but it was just a typo so now let's go to that S3 bucket and let's search for that and we see that inside this bucket we have two different files. it's a zip file but these are the two files now let's go to the EC2 instance and I have already created two EC2 instances and they are running so we can log into one of them and try to you know mount that file system right so we have server 1 and server two so i have already logged in to those servers and if you look at it if you look at it these that particular file system is not
Starting point is 00:33:00 imported so what we can do is first we can create one directory okay it's seeing that that directory is already there okay but nothing is you know there inside now we can connect to that file system so to do that let's go to that file system and let's get inside and click on attach so when you do an attach you get the see we see the same files you know which we have in S3 bucket but now you can you know do anything with this data but at this point this data is not yet copied on the file system because it is not yet you know read so it's not yet copied so now next thing is what we can do is we can try to write something in this particular file share so let's create a text file and if you see now this test file is
Starting point is 00:34:29 there right but now if you go to s3 and if I refresh you will see that that file is still not there right and the reason is we have not archived that so let's try to archive this okay so let's go to fx and we have this file file one so let's um archive this and we used hsm archive command and if we now go back and try to refresh and we would see that file came back to S3 right so this is how you know you can make use of different HSM commands to write your you know process your data on your computer instances and write back the changes to S3. Okay. And one small thing is, you know, which I have already done that, but just to save some time.
Starting point is 00:35:36 But once you create an EC2 instance, you need to install an FSxPholoster client. So, you need to install this it's clearly documented in the user guide but this is something that you need to do before you try to mount any of the FSX for Lustre file system
Starting point is 00:35:57 so that's all I have for this talk. And feel free to reach out to me over LinkedIn or any other social media platform like Twitter. And if you have any comments or any queries while working on this, feel free to ping me. I'll be more than happy to answer your queries.
Starting point is 00:36:24 Thank you so much for your time. And it was an amazing experience to be in this wonderful conference. I hope you are enjoying and you have a wonderful rest of the day and enjoy the upcoming talks. Thank you so much. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For your peers in the Storage Developer Community.
Starting point is 00:37:06 For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.