Storage Developer Conference - #14: Instantly finding a Needle of data in a Haystack of large-scale NFS environment
Episode Date: August 5, 2016...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to SDC Podcast Episode 14.
Today we hear from Gregory Turetsky, Solutions Architect with Intel,
as he presents Instantly Finding a Needle of Data in a Haystack of Large-Scale NFS Environment
from the 2015
Stories Developer Conference.
My name is Gregory Turetsky, and as of beginning of this week, I'm working at Infinidat.
So until end of last week, I was working for slightly bigger company, Intel.
So I'll have two parts in this presentation.
Most of the talk here is really related to the work I was doing in my previous life.
But I definitely want to introduce the company that I've moved into as well.
That's also a strange feeling.
I never had to introduce my company before.
So Infinidat, has anybody heard about the Infinidat before? Okay, great. Yeah, InfiniDat, has anybody
heard about
the InfiniDat
before?
Okay,
great,
yeah,
that's nice.
So,
the company
is not new,
it's established
in 2011,
so it's
four years
old for
now.
It is
established
by people
with a
pretty good
record in
the storage
industry,
Moshe and
I,
and many
others from
XAV and other companies.
The company invests heavily in getting patents in the storage field and it's growing.
The product that comes out of this company is basically a unified storage appliance. you can get a single rack with up to 2 petabytes of usable capacity
with 480 6 terabyte drives
with heavy focus on the caching
with flash and DRAM
so you can get about 2 terabytes of cache
in front of those 2 petabytes of the usable disk space
the focus is on high availability in front of those two petabytes of the usable disk space.
The focus is on high availability.
We're talking about seven nines,
so it's about three seconds of downtime a year.
Density multi-protocol is a unified storage.
Basically, up until now, there was a block storage solution. As of this this week there is a new release available that
adds NAS functionality with
NFS v3 so it's
in place upgrade of the existing products
you can just upgrade the
operating system and you get also NAS
functionality
there are plans to add
SMB, there are plans to add object storage
later on
kind of early next year.
Overall consumption of this max-scale configuration of 2 petabytes usable,
we are getting up to 8 kilowatts per rack.
Over 750k IOPS, RESTful API, there are production deployments of the SAN solution in actually several companies,
over 200 petabytes deployed in the world.
So it's really impressive.
Talking about the NAS, as I mentioned, NAS product was just announced.
Infinidat used to have a NAS solution, which was kind of a side product in the past.
This is not something that is being sold today.
Today this is coming as a software upgrade to the existing InfiniBot appliances.
In the first release, we are talking about over 250,000 NFS ops per second,
and there are plans to optimize it significantly more.
You can get a single file system
covering the entire capacity of the NAS appliance
or of the InfiniBox system.
So you can get two petabytes of the usable space
and with a thin provision you can go
actually as much as you want on top of it.
And there is support of billions of files per file system.
And with the first release, we are supporting
4,000 file systems per appliance,
per system
with the plan to go to over 100,000
file systems later on.
As I mentioned, the focus
is on highly reliable
solutions.
This is an N plus 2 architecture.
Everything is
multiplied by 3.
We support snapshots
with no performance impact. And this is a really
unified solution. So you get
up to 2 petabytes of usable space.
You can create what we call
pools and you can create file
system or block devices on the same
pools and access them.
Connectivity wise, we offer 24 8-gigabit fiber channel links to the system
or 6 to 12 10-gigabit Ethernet links.
So this is pretty much about Infinidat.
That's for now, and the rest of my slides are about the previous work that
I wanted to share. So how many people here in the room have used search engine in the
last one or two days? So pretty much everybody, but if we look into what people were doing just a few years ago,
everybody was going to some encyclopedias or whatever.
That's another type of encyclopedia from Russia,
a big Soviet encyclopedia that I was using back in the 80s.
But that was the idea.
If you want to find something, you go and look for the index,
and you can find the information.
Today, if you want to find something for your own use,
you go to Google or Bing or whatever and get it pretty quickly.
The problem is that at the large enterprises,
you don't always have this capability for your data,
whether it's structured or unstructured,
especially if this is an unstructured data.
So one thing that I was looking at
is really can we optimize the way
people are looking for the data in the large scale NFS deployment.
So my previous job
we had over 1,000 NFS file servers worldwide in different
locations. There are about 40
plus sites that we had. 35 petabytes of configured NFS capacity, which is divided to tens of
thousands or over 100,000 NFS file systems, which are all glued together with the auto
monitor. So we provide kind of a global namespace.
There are about 90,000 compute servers accessing this data,
and all of them can go and access the file systems. They're the NFS using the same auto-mounter-based namespace.
Talk about over 100 billion files across those sites.
So it's pretty big.
The problem is that customers are looking for the data
in this environment, and they're not always really sure where this data is and what kind
of data they could be looking for. Let's say there is a source code. I'm working on some
design. There are many, many people working on this. My source code says maybe it's in
Git, maybe it's not in Git, but the data is actually in NFS, and I want to find where some function is defined, or where is some kind of a comment
within the code, or looking for some keywords. Maybe I am supporting a large environment
which exists for many years, so there are many, many scripts running one after the other,
and I'm trying to understand where some environment variable is defined.
So it is somewhere in the file system, so I have no idea where it is.
How can I find it quickly?
Maybe I'm running a large-scale regression that generates hundreds of thousands of logs,
and I want to figure out where some error messages are.
I want to find all kinds of things.
So what people are usually doing
in that environment,
they run grep.
And then
they wait.
And then eventually they get some result.
Well, they don't run the grep on 100 billion
files, obviously.
But they can
get some estimation. This is the file system
where my data probably resides.
I'll run this recursive grep.
I'll wait for half an hour.
Maybe I get results.
Maybe I made a mistake and I rerun the grep and wait another half an hour.
So this is not really very efficient.
And so these were the requirements that we defined for this project.
First of all, I was part of the IT organization.
So from our point of view, we get an NFS file server. This is a black box.
I cannot change anything
in the way that NAS provides. I can do something around it to improve
my experience. Another thing that we
discussed with our customers is how soon they want to
see something as a result of the search. So, grep might be slow, but this runs on the actual
data. So, if I created a new file, and then I run grep on it, I'll get the results immediately. If I do some kind of an indexing solution,
how soon my customers are ready to... How long my customers are ready to wait
until the data that was generated appears in the index?
And so this is basically the indexing SLA,
and we came to the conclusion that within 24 hours,
if something is created on the disk,
they want the disk to be available from the indexing system.
Another thing that we wanted to ensure that we do not get into is any kind of denial of service for the NASA clients.
I can scan my data.
I have to do some kind of a crawling and look for data and index it.
I don't want to see a problem of generating too much load on the file system
instead of the regular production workload.
And we have different types of NAS appliances, so some of them may be stronger,
some of them are maybe less powerful.
How can my system basically take it into consideration and throttle the load
generated by the scanners to avoid denial of service.
We didn't want to invest in development from many things
within the virus, so the focus was really on reusing
things, whether it's off-the-shelf
industry solutions or things that we
already had developed within the
company.
Another thing that we
wanted to provide is an ability not just
to look for some data, right, and looking for whatever, some keyword, but an ability to provide some hints.
So same way as you go to Google and you can specify,
look for something only on one specific site or within specific DNS domain,
or you want to look for something and this should be just PDF files,
we want a similar type of solution here.
So I want to look for something
that should be only within the files that are owned by user X, or files that are created
before or after some timestamp, and so on. And something else that we looked into said,
okay, we'll start with this NFS solution, but can we really use the same capability
to extend it and provide the search capability across all data? Whether it's a SharePoint, whether it's maybe websites,
wikis, databases and so on. So can it be extensible enough to go beyond just NFS indexing?
So we looked into a few options. One was to go with a commercial product.
Google gives you an appliance you can install at your company.
It runs basically the same algorithms that Google is using in the wild.
And you can use it.
There are two concerns that we had.
Google licensing model is based on number of index documents.
We don't want to pay
400 billion documents indexed.
I'm sure they would be happy.
The other thing was that
it works great for web
and internet and SharePoint.
There are some concerns that we had back
at the time on
scalability and support for NFS
scouring with the Google appliance. And then we looked
into two open source solutions. One was for actual indexing capability.
We looked on Solr and we looked on Elasticsearch. Both rely on
Lucene indexing libraries underneath it.
So we basically tried to consider those two. With Solr we had some concerns with
the scalability. We ran some basic tests and decided not to go with that.
And we also wanted to use it for some additional use cases
beyond just NFS data indexing.
So that was also less applicable.
And there was also Elasticsearch,
which is known now as Elastic System.
So we decided to go with this capability.
How many people here are familiar with Elastic?
Okay. decided to go with this capability. How many people here are familiar with Elastic? Elastic
comes from
an open source product.
There is a company behind it now that you can
get commercial support from and they
continue developing this product.
This is a scale-out
index. You can bring
up a cluster of multiple nodes
very easily and Elastic
creates an index that expands multiple machines, so it provides basically if you
need more capacity for your index or you need more performance, you get more nodes.
This cluster can grow very easily. One thing that was missing is really the crawling
side. ElasticSearch was
based on our benchmarks ready to
meet the challenges.
We had to find a solution for the crawling. There are some open source
solutions for efficient crawling which are more
web focused. There is an open source product
called Nudge which is used by many companies in conjunction with solar for example which are more web-focused. So there is an open-source product called Notch,
which is used by many companies in conjunction with Solar, for example,
to do web crawling, again, similarly to Google and other web servers.
It is less applicable for file system crawling.
So we decided to implement our own crawler.
So how this all works together?
I'll talk later on about the components. First of all, we have to provide as an input to the system a list of NFS file systems that
should be indexed, and some metadata related to these file systems. For example, we want
to ensure that we specify that file systems A, B, and C belong to project X, and file
systems X, Y, and Z belong to project 2.
So later on, when I'm trying to find the data,
I can define some boundaries.
I'm looking for the data that's related to project X only.
Another thing that I want to provide as an input
is whitelists and blacklists.
Basically, what kind of files I do want to index
and what kind of files I don't care about.
For example, maybe I don't care about. For example,
maybe I don't care about the binary files.
I do care about PDFs or documents or source code.
So then,
we have to provide some kind of a scalable
solution for crowding.
We have those thousands
and tens of thousands of file systems to scan.
We have billions of files.
There is no way you can do it on a single machine. We need a scaled out architecture
and we have to provide this in our implementation. The crawler has to find what files should
be indexed. Then later on we have to read those files and send them to the indexing system,
which is basically the Elasticsearch cluster.
Elasticsearch creates the index.
Then we have to provide an interface for the end user to go and look for the data.
It could be Google-like web UI,
or, as our customers were specifically requesting,
it has to be basically in-place replacement for grep.
Grep was used for many, many years.
They are used to this as an interface in the Unix command line.
So they said, we want to have the same thing.
Just instead of going and doing the recursive search for NFS,
we want this to be instant going to the NFS index.
One more thing that we have to provide is security, how we can ensure that if I have access to the file, I can see the results.
If I don't have access to the file, I shouldn't see the results as well.
And, again, we want to make sure that those crawlers and readers do not really kill our file servers.
So you're doing an access filter after you've done all the indexing,
and then so you're my manager, you may have access to a file that you just wouldn't show me the results.
Exactly, yes.
So basically, I'll talk about this later, but the way we do this, those guys have root access to NFS, scan everything,
and then we also look into the access ACLs, basically,
or Unix permissions on the files.
How far in the range is that down to the file level?
Not at this point.
Right now it's at the file system level.
So I'll go through each one of those elements that I had on this previous chart.
First of all, we have these pools of crawlers and readers.
Regardless of this project, we have our in-house developed batch scheduling system.
It is similar to SunGrid, LSF, and so on. So we have something that is developed in-house that we use to manage
workload on those tens of thousands of
servers.
And we decided
to use the same system to basically
manage those pools of crawlers and
readers. We didn't want to invest
in something new.
One thing that we can
get as a feature of the
scheduling system is an ability to manage
multiple queues and define all kinds of
parameters for those queues. For example,
how many jobs can run concurrently
within the queue?
So the way we defined the pool of those
scrollers and the pool of readers is basically
we said, for every file server that we want
to scan, we'll define a separate queue.
And for every queue, we define a parameter called max running jobs.
So if this is a more powerful file server, we can run more jobs in parallel on that server.
If this is a less powerful server, we can define max running limit as a lower number.
So there will be less jobs running, and we'll have less impact on the performance of that server.
So from the scheduling point of view, we can submit all the jobs for scanning into the pool,
and then the scheduler will figure out how many concurrent jobs per server it can run.
The way that we define those jobs is basically one job per file system. If I have to scan 100 file systems, I'll run 100 jobs.
Each one will crawl through the related file system.
If it takes longer than I need, I can add more nodes into the crawling pool or into the readers pool,
and so I can fan out more of those.
We talked about the queuing, creating a queue per file server so the scheduling system
basically can launch
more or less servers.
We are also looking into
defining this more as
dynamic versus static limits. Right now in the
first implementation this is doing
static limit. I'm saying that this server
is capable to do more
to run more workload.
It can sustain
10 concurrent scans.
This server maybe is less capable.
It can do only 3 concurrent scans.
What we do plan to
implement there
is basically an ability
to define automatic throttling.
If we see that the latency of scan increases,
then we would throttle down the number of concurrent jobs
and then maybe bring it up again.
As I mentioned, there was another question.
We do provide root access to NFS,
so the crawler and the reader has to be able to access every single file.
Two additional things
that we had to consider. One,
we heavily relay on
A-time parameter of the files
for other purposes.
For example, we do
other types of scans that
define whether
the data can be recycled.
And we can notify customers saying,
you know, this data was not accessed for over 180 days or for over a year.
Maybe you don't need to buy a new server.
Maybe you should just go and remove this data.
So access time, parameter for files, is important in our environment.
And every time we introduce a new kind of scanner,
especially the scanner that reads the data, like in this case,
we don't want this to reset the
A-time. So we looked into a few options. We looked into
the way to maybe use snapshots
to kind of as a
read-only file system to do the
scanning. For several reasons
we decided not to go with the snapshots.
What we ended up doing is basically
our crawler,
not the crawler, the reader, goes
to the file and then it resets the A-time to the previous value. So every time we read the file and then it
resets the ATEM to the previous value.
So every time we read the file, we
reset back the ATEM.
So if somebody does a search and then they go
and open that file, do you change any kind of...
Right. If he opens the file, then this file is probably
needed. That's not
a big deal. And we also don't expect
that somebody will open a billion
files. They will probably open one or two.
Another problem that this type of scanning may create is tiering.
If we have a storage solution that supports automatic tiering based on number of accesses,
no solution today supports something that would allow us to make a decision about migrating data from tier A to tier B
based on number of accesses.
Usually, if it is accessed once, it will be fetched from the lower tier to the upper tier.
So this is one of the problems that we have to remember.
From the indexing point of view, again, if you're familiar with Elasticsearch, you can create multiple indexes.
Usually with Elasticsearch, it is used to store time series data.
In this case, usually the indexes are created by timestamp, like daily or hourly, depending on the number of records that have been indexed.
In our case, we're not looking into the time series.
This is more indexing of data.
So we create an index profile system.
Those indexes are sharded between multiple nodes within the Elasticsearch cluster.
That's a feature provided by Elastic.
We define whitelists and blacklists.
Basically, I can say I care about C and E and Perl and Python and whatever else,
types of files and PDFs, but I don't care about binaries.
One other thing that we look into is an ability to define different types of parsers for the indexer.
This is another feature that is available from the
open source community. There are a set of parsers from Apache Ticker project that we
can use right now. We don't use them, we use just basic readers. But in general, Ticker
parsers can be used to read content from different types of files. Maybe in the future we'll want to add scanning things like image files,
and I want to get from the image where it was taken
and when it was taken from the header of the image.
So we can define parts of that,
and understand the JPEG format or whatever.
An additional thing that we wanted to add is updates to the index.
So let's say I've indexed my file system for the first time.
Then there are changes happening.
There are some files removed, some files created, some files changed.
I don't want to re-index the entire file system every time.
I want to figure out what has changed there, find those changes, and index only them.
So that's something else that we had to implement.
We also looked into configuration of this Elasticsearch cluster,
how it should be tuned to meet our needs.
Additional component of the system is the user interface.
So definitely we looked into the web UI.
We're looking for something very easy.
Again, Google-like, Bing-like interface.
Customer's demand was pretty clear about the command line
and in-place replacement for grep.
So in many cases, they just prefer to run this instead of grep command.
There are different discussions that we had about
whether we want to really provide just
grep-like interface or want to expose
really more powerful
Lucene API, Lucene
interface to the search.
We could do something like say
look for documents that in the title
contain something
and there is a text that says go.
This is not something that you do easily in grep,
but this is supported by the index.
So we ended up giving both ways of scanning.
Right now people are actually still doing more grep-like interface,
and understanding and getting used to this kind of interface
requires more training.
Security aspects.
So I mentioned this briefly, but we do encrypt the index.
There are many things that we do to ensure that people who can access the cluster, they really should be able to do so.
Search access control, right?
So that's something that is very important.
I don't want to expose data to somebody who is not supposed to see this data for the search.
So we had to provide some kind of access control capability
that was not available from the open source solution,
and that was implemented in-house.
Basically, this is done at the file system level granularity right now.
We do want to go more to the file level,
but at this point this is done at the file system level.
So I'm looking at the user and group
who has access to the entire file system
at the mount point level.
And then when somebody runs a search command,
we check which groups this person belongs to.
If he belongs to a group that has access to the file system,
he will see the results for that file system.
If he does not, he won't get it.
So where this project is now,
there is a pilot happening with one project.
There are about 85, roughly about 100 file systems today,
close to 100 terabytes of data and half a billion files.
So it's a relatively small scale compared to, obviously, the entire environment,
but this still gives us some pretty good understanding of usability of the solution and scalability.
So out of this 71 terabytes, or slightly more now, that we index,
the actual index size that we create is about 20 terabytes or less than 20 terabytes.
This comes from a few
reasons. One is
because we do not
scan every single file. We don't
index every single file. We have those white lists
defined and black lists, so many
files are not in the index
by design.
And also there is all kinds of
compression and optimization done on the index side.
One more thing that I didn't mention on the indexing is a challenge.
Many things that are stored in our NFS environment are compressed with GZIP or BZIP2 and so on.
And then customers are interested in indexing this data also.
So this is something that we are thinking
about how to do it in the future, maybe kind of
unzipping the data as part
of the scanning and indexing it
and so on.
So access control is implemented, user
interface is in place
and it essentially works, right?
So we have this
example of somebody's running
a search.
This may take 30 minutes or more,
and the person is just waiting for the results and goes for the coffee or whatever.
And then I'm just giving this in-place replacement of the new command,
which gives a very similar interface, and it comes back
within less than a minute, really less than a second.
Because, again,
assuming the data is in the index,
now I can take it actually further. I don't really
have to specify where
to look for the data. I can look for
the entire index everywhere.
And the time that it takes is really the same.
So this
is something that was not available at all.
People would not run grep-shar on 100 file systems and 70 terabytes of data.
And now they can do it.
So this is what we have done in my previous job.
Now, as I moved to the company
that actually does work on the NAS development,
or SAN,
we can think what we could do as a NAS vendor now
to make things better.
If we do have access to what the NAS is doing.
One thing that I can see is that
many companies are doing all kinds of scans.
Those are just a subset of scans that we were doing at my previous job.
I want to do indexing.
That's what we just talked about.
I'm scanning all the data.
I want to do replication.
I'm replicating tons of data between different sites for cross-site sharing and collaboration.
I'm running rsync all the time.
By the way, with index, what I didn't mention,
the way it works, we scan the file system.
The time the scanning ends,
we run this again to find what has changed.
Same thing with replication.
I'm replicating data from Israel to California and Oregon
and from Oregon to Texas and India and everywhere else,
talking about petabytes of data replicated
and
thousands or
probably hundreds of thousands of
replication tasks running every day
copying data
across different locations.
So all those are doing
scans of the file systems
looking for changes. What has changed?
Identify those changes, copy them.
Aging, this is another thing that I mentioned.
So we have another type of scans who go through the entire file system
and are looking for aging for A-time.
If the files on the file system X are not accessed for more than X days,
we can indicate that this file system is candidate for recycling.
We do all kinds of legal
required scans. I'm looking for
controlled technology on
files and make sure that this data is not
replicated to controlled
countries or whatever.
There might be antiviruses running.
Maybe I'm doing backups.
Again, I'm scanning for all the data.
All those things
create a lot of noise and a lot of extra load on the file servers.
They may interfere with each other.
And the main reason why we do all those scans and crawling the file systems again and again
is really to find what has changed.
If I would be able just to generate a list of changes and say,
this is the file system.
I made one time
inventory of what is there
and from that point in time I'm getting
incremental forever type of thing.
Generate the list of changes.
That would be just tremendous
help. So there are
many things that exist today.
Like there is an iNotify interface in Linux
or if iNotify and so on.
Unfortunately this doesn't work for a NAS appliance.
I cannot have this on a NAS.
There are some things that are available from some vendors.
They can use something like F-Policy interface on NetApp.
But this is vendor-specific.
It's not something that could be used globally.
And, for example, NetApp folks really don't like people using F-Policy.
They are very concerned about the performance
impact from that.
So, one
thing that we are looking at
at my new job
is really how can
we do something like that as a feature
of the NASA appliance?
Can we provide
an easy-to-use capability, which
would be, I notify like, or some kind of a Snapdiff solution
that really gives me an instant view to what has changed on my file system
and provide an ability to generate this report easily for the customer
so they can integrate this in any kind of solution that they use,
whether it is replication or antivirus solution or indexing
and so on.
So this is one part
of this that we could do.
Just generating this
or eliminating the need and the crawler,
that would be helpful in many cases.
But can we really go beyond
that and
provide a search capability
as a feature of a NASA appliance, right? So we
have this on Windows laptops for the last, what are, five, six years. I can go
and search for data. If I don't really remember where my file is, I still
remember. I can find it. I don't have to remember my directory structure on the
local disk. Can I get something like that from my NAS appliance?
Can I try to not require from the customer to remember the director structure?
It still should be there.
We still want to ensure POSIX compatibility and everything like that.
But can we provide an extra feature with kind of indexed content?
And this is another thing that we are considering as
well
this is it
questions
either on
infinity
or on
this side
so we're
in the
NFS
tracking
because your
history
is there
anything
particularly
tied to
NFS
in this
is there any reason that to the best of this?
Is there any reason that you couldn't do this on some other file like this protocol?
So, there is
not something
very tight to any of us.
It could definitely run on other protocols.
One thing, or things that are maybe more specific,
but it's pluggable,
is access controls, right?
On Unix, we get
your owner
and the group, and
on SMB, we may have to read the ACL server,
something like that. But
other than that, it's not
very specific to
NFS.
Other questions?
So, what kind of
customer
scenarios
do you have for us?
Say it again?
Customer what?
Scenarios.
Oh, customer, okay.
So, I don't know if you
attended this,
but I was talking about this
initially.
What kind of data people are looking for, right?
So we have, again, on my previous job, we had thousands of designers worldwide who are doing chip design.
And all the data that is related to chip design or software development is in NFS.
So whether it's in some way a source control system
or maybe just sitting directly in NFS space.
Now, if I have created a file, I probably know where this file is.
If I'm using data that was generated by many peers or maybe by somebody five years ago,
I have no idea where this data is.
I can go and find this data using this tool.
If I'm looking for data in some automatically generated logs,
I'm doing a design, I have some kind of a model of a chip,
and then I'm running 50,000 tests against this model generating the logs.
Now I have to scan through those logs and find failures.
Go and run grep.
Can I do this without running this recursive grep?
That may take half an hour.
That's the main scenario.
So can you support, for example, distributed location,
the office to share the fire, research the fire,
maybe the data located in different places?
All right, so one thing that we are looking at is,
let's say I'm doing design of project X,
and some other team in other locations'm doing design of project X and some other team
in other locations is doing design of project Y
but those projects are somehow related
right, we have backward compatibility or
I want to see how my peers
have implemented something
right, so I may use this tool
to look for the, see if I have access
use this tool to find
where something is stored there and kind of
look at this and share some data and so on, so definitely this should help tool to find where something is stored there and kind of look at this and share some data and so on.
So definitely this should help people to find data, whether this is part of their day-to-day job or someone else's job.
Now, accessing this data, once you found it, is a different story, right?
So I've talked about this in the preview, I think, a couple of years ago,
how can we do this cross-site
data sharing through NFS caching
and those replication systems
that's available on the web.
So the index is not stored on NFS.
The index, as I mentioned, we use Elasticsearch.
So Elasticsearch provides kind of its own file system implementation,
somewhat similar to HDFS,
where you can basically split the index across multiple nodes
that participate in the cluster.
So we use local drives
of those low-cost
servers that we allocated for the
Elasticsearch index. the index information which was stand-up
and
it's really
all other
list API
you can define your API
to
position
so I just
can't hear you
because you have the index information saved in your storage, right?
Right, the index is stored on the local drives of machines, right.
So, do you have a plan to support searching with the API?
Oh, so Elasticsearch offers your REST API.
You can basically access it directly.
Now, right now, we actually do not expose it
because of the access controls issues.
We have to provide our own layer of authentication
and authorization in front of the Elasticsearch index
to ensure that people can see only what they are allowed to see.
So this is something that we are looking into,
is basically, well, we were looking into,
offering as an option,
but that was not required by our internal customers.
So they were looking for either command lane or web UI right now.
I'm sure the API demand will come in the future
and we probably will have to expose REST API
in a way, but
right now it's not a requirement.
One more question.
So, is this
just one feature
in your last
system, or can it be in your last system or can be
individual subsystem
can
integrate with other
last system
like from
NetApp?
Let me clarify this again.
I mentioned this at the very beginning.
There are two parts in this presentation.
One is, most of this talk is about what we were doing at my previous job,
basically at Intel, to index the data of third-party NAS appliances.
So this implementation has nothing to do with the specific NAS appliance.
Now, as of this week, I've moved to Infinidat,
which is basically a storage solution provider.
And then we will have to look into what we could do or should we do to ease this type of problems for customers.
So everything I've covered here is how it is done if I'm not controlling the NAS itself.
And it is not specific.
It can be done with any NAS.
So we said the crawler will get to know
if there is any change in the file
which has gone to only the next generation, right?
So in meanwhile, if the file changes
and someone grabs it for the next generation,
so can it be served in any long term?
Exactly, yes.
And this is what I mentioned in the requirements section.
That's part of the problem
of the crawling implementation in general.
We agreed with the customer
that the change will be indexed within 24 hours.
So this defines how many crowlers do we need,
how many indexers do we need.
And as long as they're okay with this,
that's fine.
You're right.
If I did make a change and GRIP finds it immediately and my indexer finds it within 24 hours,
there is a window when the results will be different.
If we go from the crowling implementation to kind of a Snapd for iNotify-like solution,
when we know exactly when a change happens, we can eliminate this problem at all.
Is that your last system based on SDE, memory, or SME?
You mean the Infinidat solution?
Yeah.
Okay, so again, this is not related to this part,
but the way the Infinidot solution works, right, this is basically you get a three nodes cluster with eight enclosures with HDDs,
six terabyte HD, so you get 480 drives within a rack with high density,
and there is a big cache in front of those HGDs.
So that's how you get basically the random access and hundreds of thousands,
or over 800,000 I of the second.
So we're talking about a 2 terabyte RAM cache, for example,
in front of those 2 petabyte of usable space
in HDs.
Other questions?
Okay. Thank you very much. dot org. Here you can ask questions and discuss this topic further with your peers in the developer
community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.