Storage Developer Conference - #14: Instantly finding a Needle of data in a Haystack of large-scale NFS environment

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 14. Today we hear from Gregory Turetsky, Solutions Architect with Intel, as he presents Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment from the 2015

Starting point is 00:00:46 Stories Developer Conference. My name is Gregory Turetsky, and as of beginning of this week, I'm working at Infinidat. So until end of last week, I was working for slightly bigger company, Intel. So I'll have two parts in this presentation. Most of the talk here is really related to the work I was doing in my previous life. But I definitely want to introduce the company that I've moved into as well. That's also a strange feeling. I never had to introduce my company before.

Starting point is 00:01:23 So Infinidat, has anybody heard about the Infinidat before? Okay, great. Yeah, InfiniDat, has anybody heard about the InfiniDat before? Okay, great, yeah, that's nice.

Starting point is 00:01:30 So, the company is not new, it's established in 2011, so it's four years old for

Starting point is 00:01:35 now. It is established by people with a pretty good record in the storage

Starting point is 00:01:40 industry, Moshe and I, and many others from XAV and other companies. The company invests heavily in getting patents in the storage field and it's growing. The product that comes out of this company is basically a unified storage appliance. you can get a single rack with up to 2 petabytes of usable capacity

Starting point is 00:02:06 with 480 6 terabyte drives with heavy focus on the caching with flash and DRAM so you can get about 2 terabytes of cache in front of those 2 petabytes of the usable disk space the focus is on high availability in front of those two petabytes of the usable disk space. The focus is on high availability. We're talking about seven nines,

Starting point is 00:02:32 so it's about three seconds of downtime a year. Density multi-protocol is a unified storage. Basically, up until now, there was a block storage solution. As of this this week there is a new release available that adds NAS functionality with NFS v3 so it's in place upgrade of the existing products you can just upgrade the operating system and you get also NAS

Starting point is 00:02:56 functionality there are plans to add SMB, there are plans to add object storage later on kind of early next year. Overall consumption of this max-scale configuration of 2 petabytes usable, we are getting up to 8 kilowatts per rack. Over 750k IOPS, RESTful API, there are production deployments of the SAN solution in actually several companies,

Starting point is 00:03:27 over 200 petabytes deployed in the world. So it's really impressive. Talking about the NAS, as I mentioned, NAS product was just announced. Infinidat used to have a NAS solution, which was kind of a side product in the past. This is not something that is being sold today. Today this is coming as a software upgrade to the existing InfiniBot appliances. In the first release, we are talking about over 250,000 NFS ops per second, and there are plans to optimize it significantly more.

Starting point is 00:04:03 You can get a single file system covering the entire capacity of the NAS appliance or of the InfiniBox system. So you can get two petabytes of the usable space and with a thin provision you can go actually as much as you want on top of it. And there is support of billions of files per file system. And with the first release, we are supporting

Starting point is 00:04:27 4,000 file systems per appliance, per system with the plan to go to over 100,000 file systems later on. As I mentioned, the focus is on highly reliable solutions. This is an N plus 2 architecture.

Starting point is 00:04:44 Everything is multiplied by 3. We support snapshots with no performance impact. And this is a really unified solution. So you get up to 2 petabytes of usable space. You can create what we call pools and you can create file

Starting point is 00:05:00 system or block devices on the same pools and access them. Connectivity wise, we offer 24 8-gigabit fiber channel links to the system or 6 to 12 10-gigabit Ethernet links. So this is pretty much about Infinidat. That's for now, and the rest of my slides are about the previous work that I wanted to share. So how many people here in the room have used search engine in the last one or two days? So pretty much everybody, but if we look into what people were doing just a few years ago,

Starting point is 00:05:47 everybody was going to some encyclopedias or whatever. That's another type of encyclopedia from Russia, a big Soviet encyclopedia that I was using back in the 80s. But that was the idea. If you want to find something, you go and look for the index, and you can find the information. Today, if you want to find something for your own use, you go to Google or Bing or whatever and get it pretty quickly.

Starting point is 00:06:14 The problem is that at the large enterprises, you don't always have this capability for your data, whether it's structured or unstructured, especially if this is an unstructured data. So one thing that I was looking at is really can we optimize the way people are looking for the data in the large scale NFS deployment. So my previous job

Starting point is 00:06:40 we had over 1,000 NFS file servers worldwide in different locations. There are about 40 plus sites that we had. 35 petabytes of configured NFS capacity, which is divided to tens of thousands or over 100,000 NFS file systems, which are all glued together with the auto monitor. So we provide kind of a global namespace. There are about 90,000 compute servers accessing this data, and all of them can go and access the file systems. They're the NFS using the same auto-mounter-based namespace. Talk about over 100 billion files across those sites.

Starting point is 00:07:21 So it's pretty big. The problem is that customers are looking for the data in this environment, and they're not always really sure where this data is and what kind of data they could be looking for. Let's say there is a source code. I'm working on some design. There are many, many people working on this. My source code says maybe it's in Git, maybe it's not in Git, but the data is actually in NFS, and I want to find where some function is defined, or where is some kind of a comment within the code, or looking for some keywords. Maybe I am supporting a large environment which exists for many years, so there are many, many scripts running one after the other,

Starting point is 00:08:02 and I'm trying to understand where some environment variable is defined. So it is somewhere in the file system, so I have no idea where it is. How can I find it quickly? Maybe I'm running a large-scale regression that generates hundreds of thousands of logs, and I want to figure out where some error messages are. I want to find all kinds of things. So what people are usually doing in that environment,

Starting point is 00:08:28 they run grep. And then they wait. And then eventually they get some result. Well, they don't run the grep on 100 billion files, obviously. But they can get some estimation. This is the file system

Starting point is 00:08:44 where my data probably resides. I'll run this recursive grep. I'll wait for half an hour. Maybe I get results. Maybe I made a mistake and I rerun the grep and wait another half an hour. So this is not really very efficient. And so these were the requirements that we defined for this project. First of all, I was part of the IT organization.

Starting point is 00:09:07 So from our point of view, we get an NFS file server. This is a black box. I cannot change anything in the way that NAS provides. I can do something around it to improve my experience. Another thing that we discussed with our customers is how soon they want to see something as a result of the search. So, grep might be slow, but this runs on the actual data. So, if I created a new file, and then I run grep on it, I'll get the results immediately. If I do some kind of an indexing solution, how soon my customers are ready to... How long my customers are ready to wait

Starting point is 00:09:51 until the data that was generated appears in the index? And so this is basically the indexing SLA, and we came to the conclusion that within 24 hours, if something is created on the disk, they want the disk to be available from the indexing system. Another thing that we wanted to ensure that we do not get into is any kind of denial of service for the NASA clients. I can scan my data. I have to do some kind of a crawling and look for data and index it.

Starting point is 00:10:20 I don't want to see a problem of generating too much load on the file system instead of the regular production workload. And we have different types of NAS appliances, so some of them may be stronger, some of them are maybe less powerful. How can my system basically take it into consideration and throttle the load generated by the scanners to avoid denial of service. We didn't want to invest in development from many things within the virus, so the focus was really on reusing

Starting point is 00:10:52 things, whether it's off-the-shelf industry solutions or things that we already had developed within the company. Another thing that we wanted to provide is an ability not just to look for some data, right, and looking for whatever, some keyword, but an ability to provide some hints. So same way as you go to Google and you can specify,

Starting point is 00:11:12 look for something only on one specific site or within specific DNS domain, or you want to look for something and this should be just PDF files, we want a similar type of solution here. So I want to look for something that should be only within the files that are owned by user X, or files that are created before or after some timestamp, and so on. And something else that we looked into said, okay, we'll start with this NFS solution, but can we really use the same capability to extend it and provide the search capability across all data? Whether it's a SharePoint, whether it's maybe websites,

Starting point is 00:11:50 wikis, databases and so on. So can it be extensible enough to go beyond just NFS indexing? So we looked into a few options. One was to go with a commercial product. Google gives you an appliance you can install at your company. It runs basically the same algorithms that Google is using in the wild. And you can use it. There are two concerns that we had. Google licensing model is based on number of index documents. We don't want to pay

Starting point is 00:12:28 400 billion documents indexed. I'm sure they would be happy. The other thing was that it works great for web and internet and SharePoint. There are some concerns that we had back at the time on scalability and support for NFS

Starting point is 00:12:44 scouring with the Google appliance. And then we looked into two open source solutions. One was for actual indexing capability. We looked on Solr and we looked on Elasticsearch. Both rely on Lucene indexing libraries underneath it. So we basically tried to consider those two. With Solr we had some concerns with the scalability. We ran some basic tests and decided not to go with that. And we also wanted to use it for some additional use cases beyond just NFS data indexing.

Starting point is 00:13:13 So that was also less applicable. And there was also Elasticsearch, which is known now as Elastic System. So we decided to go with this capability. How many people here are familiar with Elastic? Okay. decided to go with this capability. How many people here are familiar with Elastic? Elastic comes from an open source product.

Starting point is 00:13:32 There is a company behind it now that you can get commercial support from and they continue developing this product. This is a scale-out index. You can bring up a cluster of multiple nodes very easily and Elastic creates an index that expands multiple machines, so it provides basically if you

Starting point is 00:13:53 need more capacity for your index or you need more performance, you get more nodes. This cluster can grow very easily. One thing that was missing is really the crawling side. ElasticSearch was based on our benchmarks ready to meet the challenges. We had to find a solution for the crawling. There are some open source solutions for efficient crawling which are more web focused. There is an open source product

Starting point is 00:14:24 called Nudge which is used by many companies in conjunction with solar for example which are more web-focused. So there is an open-source product called Notch, which is used by many companies in conjunction with Solar, for example, to do web crawling, again, similarly to Google and other web servers. It is less applicable for file system crawling. So we decided to implement our own crawler. So how this all works together? I'll talk later on about the components. First of all, we have to provide as an input to the system a list of NFS file systems that should be indexed, and some metadata related to these file systems. For example, we want

Starting point is 00:14:56 to ensure that we specify that file systems A, B, and C belong to project X, and file systems X, Y, and Z belong to project 2. So later on, when I'm trying to find the data, I can define some boundaries. I'm looking for the data that's related to project X only. Another thing that I want to provide as an input is whitelists and blacklists. Basically, what kind of files I do want to index

Starting point is 00:15:23 and what kind of files I don't care about. For example, maybe I don't care about. For example, maybe I don't care about the binary files. I do care about PDFs or documents or source code. So then, we have to provide some kind of a scalable solution for crowding. We have those thousands

Starting point is 00:15:41 and tens of thousands of file systems to scan. We have billions of files. There is no way you can do it on a single machine. We need a scaled out architecture and we have to provide this in our implementation. The crawler has to find what files should be indexed. Then later on we have to read those files and send them to the indexing system, which is basically the Elasticsearch cluster. Elasticsearch creates the index. Then we have to provide an interface for the end user to go and look for the data.

Starting point is 00:16:18 It could be Google-like web UI, or, as our customers were specifically requesting, it has to be basically in-place replacement for grep. Grep was used for many, many years. They are used to this as an interface in the Unix command line. So they said, we want to have the same thing. Just instead of going and doing the recursive search for NFS, we want this to be instant going to the NFS index.

Starting point is 00:16:47 One more thing that we have to provide is security, how we can ensure that if I have access to the file, I can see the results. If I don't have access to the file, I shouldn't see the results as well. And, again, we want to make sure that those crawlers and readers do not really kill our file servers. So you're doing an access filter after you've done all the indexing, and then so you're my manager, you may have access to a file that you just wouldn't show me the results. Exactly, yes. So basically, I'll talk about this later, but the way we do this, those guys have root access to NFS, scan everything, and then we also look into the access ACLs, basically,

Starting point is 00:17:31 or Unix permissions on the files. How far in the range is that down to the file level? Not at this point. Right now it's at the file system level. So I'll go through each one of those elements that I had on this previous chart. First of all, we have these pools of crawlers and readers. Regardless of this project, we have our in-house developed batch scheduling system. It is similar to SunGrid, LSF, and so on. So we have something that is developed in-house that we use to manage

Starting point is 00:18:08 workload on those tens of thousands of servers. And we decided to use the same system to basically manage those pools of crawlers and readers. We didn't want to invest in something new. One thing that we can

Starting point is 00:18:23 get as a feature of the scheduling system is an ability to manage multiple queues and define all kinds of parameters for those queues. For example, how many jobs can run concurrently within the queue? So the way we defined the pool of those scrollers and the pool of readers is basically

Starting point is 00:18:40 we said, for every file server that we want to scan, we'll define a separate queue. And for every queue, we define a parameter called max running jobs. So if this is a more powerful file server, we can run more jobs in parallel on that server. If this is a less powerful server, we can define max running limit as a lower number. So there will be less jobs running, and we'll have less impact on the performance of that server. So from the scheduling point of view, we can submit all the jobs for scanning into the pool, and then the scheduler will figure out how many concurrent jobs per server it can run.

Starting point is 00:19:27 The way that we define those jobs is basically one job per file system. If I have to scan 100 file systems, I'll run 100 jobs. Each one will crawl through the related file system. If it takes longer than I need, I can add more nodes into the crawling pool or into the readers pool, and so I can fan out more of those. We talked about the queuing, creating a queue per file server so the scheduling system basically can launch more or less servers. We are also looking into

Starting point is 00:19:53 defining this more as dynamic versus static limits. Right now in the first implementation this is doing static limit. I'm saying that this server is capable to do more to run more workload. It can sustain 10 concurrent scans.

Starting point is 00:20:11 This server maybe is less capable. It can do only 3 concurrent scans. What we do plan to implement there is basically an ability to define automatic throttling. If we see that the latency of scan increases, then we would throttle down the number of concurrent jobs

Starting point is 00:20:30 and then maybe bring it up again. As I mentioned, there was another question. We do provide root access to NFS, so the crawler and the reader has to be able to access every single file. Two additional things that we had to consider. One, we heavily relay on A-time parameter of the files

Starting point is 00:20:53 for other purposes. For example, we do other types of scans that define whether the data can be recycled. And we can notify customers saying, you know, this data was not accessed for over 180 days or for over a year. Maybe you don't need to buy a new server.

Starting point is 00:21:11 Maybe you should just go and remove this data. So access time, parameter for files, is important in our environment. And every time we introduce a new kind of scanner, especially the scanner that reads the data, like in this case, we don't want this to reset the A-time. So we looked into a few options. We looked into the way to maybe use snapshots to kind of as a

Starting point is 00:21:32 read-only file system to do the scanning. For several reasons we decided not to go with the snapshots. What we ended up doing is basically our crawler, not the crawler, the reader, goes to the file and then it resets the A-time to the previous value. So every time we read the file and then it resets the ATEM to the previous value.

Starting point is 00:21:47 So every time we read the file, we reset back the ATEM. So if somebody does a search and then they go and open that file, do you change any kind of... Right. If he opens the file, then this file is probably needed. That's not a big deal. And we also don't expect that somebody will open a billion

Starting point is 00:22:03 files. They will probably open one or two. Another problem that this type of scanning may create is tiering. If we have a storage solution that supports automatic tiering based on number of accesses, no solution today supports something that would allow us to make a decision about migrating data from tier A to tier B based on number of accesses. Usually, if it is accessed once, it will be fetched from the lower tier to the upper tier. So this is one of the problems that we have to remember. From the indexing point of view, again, if you're familiar with Elasticsearch, you can create multiple indexes.

Starting point is 00:22:50 Usually with Elasticsearch, it is used to store time series data. In this case, usually the indexes are created by timestamp, like daily or hourly, depending on the number of records that have been indexed. In our case, we're not looking into the time series. This is more indexing of data. So we create an index profile system. Those indexes are sharded between multiple nodes within the Elasticsearch cluster. That's a feature provided by Elastic. We define whitelists and blacklists.

Starting point is 00:23:22 Basically, I can say I care about C and E and Perl and Python and whatever else, types of files and PDFs, but I don't care about binaries. One other thing that we look into is an ability to define different types of parsers for the indexer. This is another feature that is available from the open source community. There are a set of parsers from Apache Ticker project that we can use right now. We don't use them, we use just basic readers. But in general, Ticker parsers can be used to read content from different types of files. Maybe in the future we'll want to add scanning things like image files, and I want to get from the image where it was taken

Starting point is 00:24:11 and when it was taken from the header of the image. So we can define parts of that, and understand the JPEG format or whatever. An additional thing that we wanted to add is updates to the index. So let's say I've indexed my file system for the first time. Then there are changes happening. There are some files removed, some files created, some files changed. I don't want to re-index the entire file system every time.

Starting point is 00:24:38 I want to figure out what has changed there, find those changes, and index only them. So that's something else that we had to implement. We also looked into configuration of this Elasticsearch cluster, how it should be tuned to meet our needs. Additional component of the system is the user interface. So definitely we looked into the web UI. We're looking for something very easy. Again, Google-like, Bing-like interface.

Starting point is 00:25:11 Customer's demand was pretty clear about the command line and in-place replacement for grep. So in many cases, they just prefer to run this instead of grep command. There are different discussions that we had about whether we want to really provide just grep-like interface or want to expose really more powerful Lucene API, Lucene

Starting point is 00:25:37 interface to the search. We could do something like say look for documents that in the title contain something and there is a text that says go. This is not something that you do easily in grep, but this is supported by the index. So we ended up giving both ways of scanning.

Starting point is 00:25:57 Right now people are actually still doing more grep-like interface, and understanding and getting used to this kind of interface requires more training. Security aspects. So I mentioned this briefly, but we do encrypt the index. There are many things that we do to ensure that people who can access the cluster, they really should be able to do so. Search access control, right? So that's something that is very important.

Starting point is 00:26:29 I don't want to expose data to somebody who is not supposed to see this data for the search. So we had to provide some kind of access control capability that was not available from the open source solution, and that was implemented in-house. Basically, this is done at the file system level granularity right now. We do want to go more to the file level, but at this point this is done at the file system level. So I'm looking at the user and group

Starting point is 00:26:55 who has access to the entire file system at the mount point level. And then when somebody runs a search command, we check which groups this person belongs to. If he belongs to a group that has access to the file system, he will see the results for that file system. If he does not, he won't get it. So where this project is now,

Starting point is 00:27:22 there is a pilot happening with one project. There are about 85, roughly about 100 file systems today, close to 100 terabytes of data and half a billion files. So it's a relatively small scale compared to, obviously, the entire environment, but this still gives us some pretty good understanding of usability of the solution and scalability. So out of this 71 terabytes, or slightly more now, that we index, the actual index size that we create is about 20 terabytes or less than 20 terabytes. This comes from a few

Starting point is 00:28:05 reasons. One is because we do not scan every single file. We don't index every single file. We have those white lists defined and black lists, so many files are not in the index by design. And also there is all kinds of

Starting point is 00:28:22 compression and optimization done on the index side. One more thing that I didn't mention on the indexing is a challenge. Many things that are stored in our NFS environment are compressed with GZIP or BZIP2 and so on. And then customers are interested in indexing this data also. So this is something that we are thinking about how to do it in the future, maybe kind of unzipping the data as part of the scanning and indexing it

Starting point is 00:28:52 and so on. So access control is implemented, user interface is in place and it essentially works, right? So we have this example of somebody's running a search. This may take 30 minutes or more,

Starting point is 00:29:08 and the person is just waiting for the results and goes for the coffee or whatever. And then I'm just giving this in-place replacement of the new command, which gives a very similar interface, and it comes back within less than a minute, really less than a second. Because, again, assuming the data is in the index, now I can take it actually further. I don't really have to specify where

Starting point is 00:29:36 to look for the data. I can look for the entire index everywhere. And the time that it takes is really the same. So this is something that was not available at all. People would not run grep-shar on 100 file systems and 70 terabytes of data. And now they can do it. So this is what we have done in my previous job.

Starting point is 00:30:03 Now, as I moved to the company that actually does work on the NAS development, or SAN, we can think what we could do as a NAS vendor now to make things better. If we do have access to what the NAS is doing. One thing that I can see is that many companies are doing all kinds of scans.

Starting point is 00:30:23 Those are just a subset of scans that we were doing at my previous job. I want to do indexing. That's what we just talked about. I'm scanning all the data. I want to do replication. I'm replicating tons of data between different sites for cross-site sharing and collaboration. I'm running rsync all the time. By the way, with index, what I didn't mention,

Starting point is 00:30:46 the way it works, we scan the file system. The time the scanning ends, we run this again to find what has changed. Same thing with replication. I'm replicating data from Israel to California and Oregon and from Oregon to Texas and India and everywhere else, talking about petabytes of data replicated and

Starting point is 00:31:08 thousands or probably hundreds of thousands of replication tasks running every day copying data across different locations. So all those are doing scans of the file systems looking for changes. What has changed?

Starting point is 00:31:24 Identify those changes, copy them. Aging, this is another thing that I mentioned. So we have another type of scans who go through the entire file system and are looking for aging for A-time. If the files on the file system X are not accessed for more than X days, we can indicate that this file system is candidate for recycling. We do all kinds of legal required scans. I'm looking for

Starting point is 00:31:48 controlled technology on files and make sure that this data is not replicated to controlled countries or whatever. There might be antiviruses running. Maybe I'm doing backups. Again, I'm scanning for all the data. All those things

Starting point is 00:32:04 create a lot of noise and a lot of extra load on the file servers. They may interfere with each other. And the main reason why we do all those scans and crawling the file systems again and again is really to find what has changed. If I would be able just to generate a list of changes and say, this is the file system. I made one time inventory of what is there

Starting point is 00:32:27 and from that point in time I'm getting incremental forever type of thing. Generate the list of changes. That would be just tremendous help. So there are many things that exist today. Like there is an iNotify interface in Linux or if iNotify and so on.

Starting point is 00:32:44 Unfortunately this doesn't work for a NAS appliance. I cannot have this on a NAS. There are some things that are available from some vendors. They can use something like F-Policy interface on NetApp. But this is vendor-specific. It's not something that could be used globally. And, for example, NetApp folks really don't like people using F-Policy. They are very concerned about the performance

Starting point is 00:33:05 impact from that. So, one thing that we are looking at at my new job is really how can we do something like that as a feature of the NASA appliance? Can we provide

Starting point is 00:33:21 an easy-to-use capability, which would be, I notify like, or some kind of a Snapdiff solution that really gives me an instant view to what has changed on my file system and provide an ability to generate this report easily for the customer so they can integrate this in any kind of solution that they use, whether it is replication or antivirus solution or indexing and so on. So this is one part

Starting point is 00:33:49 of this that we could do. Just generating this or eliminating the need and the crawler, that would be helpful in many cases. But can we really go beyond that and provide a search capability as a feature of a NASA appliance, right? So we

Starting point is 00:34:07 have this on Windows laptops for the last, what are, five, six years. I can go and search for data. If I don't really remember where my file is, I still remember. I can find it. I don't have to remember my directory structure on the local disk. Can I get something like that from my NAS appliance? Can I try to not require from the customer to remember the director structure? It still should be there. We still want to ensure POSIX compatibility and everything like that. But can we provide an extra feature with kind of indexed content?

Starting point is 00:34:42 And this is another thing that we are considering as well this is it questions either on infinity or on this side

Starting point is 00:34:54 so we're in the NFS tracking because your history is there anything

Starting point is 00:35:03 particularly tied to NFS in this is there any reason that to the best of this? Is there any reason that you couldn't do this on some other file like this protocol? So, there is not something

Starting point is 00:35:13 very tight to any of us. It could definitely run on other protocols. One thing, or things that are maybe more specific, but it's pluggable, is access controls, right? On Unix, we get your owner and the group, and

Starting point is 00:35:28 on SMB, we may have to read the ACL server, something like that. But other than that, it's not very specific to NFS. Other questions? So, what kind of customer

Starting point is 00:35:47 scenarios do you have for us? Say it again? Customer what? Scenarios. Oh, customer, okay. So, I don't know if you attended this,

Starting point is 00:36:02 but I was talking about this initially. What kind of data people are looking for, right? So we have, again, on my previous job, we had thousands of designers worldwide who are doing chip design. And all the data that is related to chip design or software development is in NFS. So whether it's in some way a source control system or maybe just sitting directly in NFS space. Now, if I have created a file, I probably know where this file is.

Starting point is 00:36:40 If I'm using data that was generated by many peers or maybe by somebody five years ago, I have no idea where this data is. I can go and find this data using this tool. If I'm looking for data in some automatically generated logs, I'm doing a design, I have some kind of a model of a chip, and then I'm running 50,000 tests against this model generating the logs. Now I have to scan through those logs and find failures. Go and run grep.

Starting point is 00:37:10 Can I do this without running this recursive grep? That may take half an hour. That's the main scenario. So can you support, for example, distributed location, the office to share the fire, research the fire, maybe the data located in different places? All right, so one thing that we are looking at is, let's say I'm doing design of project X,

Starting point is 00:37:44 and some other team in other locations'm doing design of project X and some other team in other locations is doing design of project Y but those projects are somehow related right, we have backward compatibility or I want to see how my peers have implemented something right, so I may use this tool to look for the, see if I have access

Starting point is 00:37:59 use this tool to find where something is stored there and kind of look at this and share some data and so on, so definitely this should help tool to find where something is stored there and kind of look at this and share some data and so on. So definitely this should help people to find data, whether this is part of their day-to-day job or someone else's job. Now, accessing this data, once you found it, is a different story, right? So I've talked about this in the preview, I think, a couple of years ago, how can we do this cross-site data sharing through NFS caching

Starting point is 00:38:28 and those replication systems that's available on the web. So the index is not stored on NFS. The index, as I mentioned, we use Elasticsearch. So Elasticsearch provides kind of its own file system implementation, somewhat similar to HDFS, where you can basically split the index across multiple nodes that participate in the cluster.

Starting point is 00:39:04 So we use local drives of those low-cost servers that we allocated for the Elasticsearch index. the index information which was stand-up and it's really all other list API

Starting point is 00:39:33 you can define your API to position so I just can't hear you because you have the index information saved in your storage, right? Right, the index is stored on the local drives of machines, right. So, do you have a plan to support searching with the API?

Starting point is 00:39:59 Oh, so Elasticsearch offers your REST API. You can basically access it directly. Now, right now, we actually do not expose it because of the access controls issues. We have to provide our own layer of authentication and authorization in front of the Elasticsearch index to ensure that people can see only what they are allowed to see. So this is something that we are looking into,

Starting point is 00:40:28 is basically, well, we were looking into, offering as an option, but that was not required by our internal customers. So they were looking for either command lane or web UI right now. I'm sure the API demand will come in the future and we probably will have to expose REST API in a way, but right now it's not a requirement.

Starting point is 00:40:55 One more question. So, is this just one feature in your last system, or can it be in your last system or can be individual subsystem can integrate with other

Starting point is 00:41:12 last system like from NetApp? Let me clarify this again. I mentioned this at the very beginning. There are two parts in this presentation. One is, most of this talk is about what we were doing at my previous job, basically at Intel, to index the data of third-party NAS appliances.

Starting point is 00:41:36 So this implementation has nothing to do with the specific NAS appliance. Now, as of this week, I've moved to Infinidat, which is basically a storage solution provider. And then we will have to look into what we could do or should we do to ease this type of problems for customers. So everything I've covered here is how it is done if I'm not controlling the NAS itself. And it is not specific. It can be done with any NAS. So we said the crawler will get to know

Starting point is 00:42:05 if there is any change in the file which has gone to only the next generation, right? So in meanwhile, if the file changes and someone grabs it for the next generation, so can it be served in any long term? Exactly, yes. And this is what I mentioned in the requirements section. That's part of the problem

Starting point is 00:42:24 of the crawling implementation in general. We agreed with the customer that the change will be indexed within 24 hours. So this defines how many crowlers do we need, how many indexers do we need. And as long as they're okay with this, that's fine. You're right.

Starting point is 00:42:43 If I did make a change and GRIP finds it immediately and my indexer finds it within 24 hours, there is a window when the results will be different. If we go from the crowling implementation to kind of a Snapd for iNotify-like solution, when we know exactly when a change happens, we can eliminate this problem at all. Is that your last system based on SDE, memory, or SME? You mean the Infinidat solution? Yeah. Okay, so again, this is not related to this part,

Starting point is 00:43:25 but the way the Infinidot solution works, right, this is basically you get a three nodes cluster with eight enclosures with HDDs, six terabyte HD, so you get 480 drives within a rack with high density, and there is a big cache in front of those HGDs. So that's how you get basically the random access and hundreds of thousands, or over 800,000 I of the second. So we're talking about a 2 terabyte RAM cache, for example, in front of those 2 petabyte of usable space in HDs.

Starting point is 00:44:16 Other questions? Okay. Thank you very much. dot org. Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #14: Instantly finding a Needle of data in a Haystack of large-scale NFS environment

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.