Storage Developer Conference - #42: The Role of Active Archive in Long-Term Data Preservation
Episode Date: April 26, 2017...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to SDC Podcast Episode 42.
Today we hear from Mark Pastor, Director of Archive and Technical Workflow Solutions with Quantum,
as he presents the role of active Archive in long-term data preservation
from the 2016 Storage Developer Conference.
The session is called the role of Active Archive in
long-term preservation. A little bit redundant, but we
have, I'm representing the Active Archive Alliance
and I'm with the Active Archive Alliance, and I'm with the Quantum Corporation,
and I'm very much involved in archive use cases.
My specific area of focus at Quantum is archive and technical workflow.
And so I'm involved in, you know, and basically just so you know,
in quantum terms what that means is we have other people focused on media and entertainment,
and we have people focused on surveillance, and then we've got some people focused on backup,
and then I focus on what's left behind that for the most part, even though I get involved a lot in some of the other ones as well.
So surveillance, opportunities, and media and entertainment, so I can speak to that. But with the Active Archive Alliance, our mission in life actually is to provide
open systems ways of developing solutions that can enable people to
access all their data all the time and that's why we call it Active Archive.
So the point of archive of course is for data retention but what we've seen happening in
recent years is people have traditionally done archive by, you know, the original way
of doing it was I'll take my backups and I'll send them off to Iron Mountain or I'll store
them somewhere for a long time and if I need that data back I'll have to go back through the backup software to retrieve it
because tape is a very attractive technology from a cost perspective.
And one of the few ways of getting to tape,
either you have your own homegrown application,
which has happened in some of the key industries like oil and gas and things like that.
But for the most part, the broad market used their backup apps to get to tape.
Tape was cheap.
That's what they wanted to archive on.
And so if they ever wanted the data back,
they'd have to come back through that process.
So data is getting very, very valuable,
and intelligence in companies is getting richer.
And so they want ways to get their data back
more actively than that process.
And so there's been ways over the past 5, 10 years
that have really been developing
to give people better access of economic storage like tape.
And this is not all about tape, by the way,
but object storage has come out as an option as well.
You know, cloud, I'll talk about all of the above.
And so there's new technologies available that can offer cost-effective retention of content,
but also provide active access.
And tape and the other ones can be just as good.
So it's all about providing ease of use, scalability, cost, and compliance.
And we'll talk about the reasons why people retain data in the first place.
So my assumption is that you're all here
because you're looking at developing new solutions that store stuff. That's the Storage Developers Conference.
And so what I hope to provide
in today's discussion is really just give you a sense of what are
some common motivations that we're seeing in the marketplace.
I interact with customers quite a bit, so why are people looking to
hang on to data for longer periods of time?
And then what I've learned a lot in helping people architect
and think about total solutions for that retention of data,
there's actually a lot of considerations that I've learned about.
And so we'll talk about many things that enter into the conversation
when you talk about holding on onto data for a long time.
And what are some of the technologies that are of interest?
Really, what are the common themes that I see across all the industries in terms of their data retention strategies?
And we'll walk through a couple of real life examples so you can kind of see what other companies have done Just to give you a sense of that
And by the way, I'm delighted if you guys make it interactive if you want to ask questions
I hate I don't want it to be death by PowerPoint unnecessarily. So feel free to jump in ask questions
Okay, so when we're talking about archive, basically, you know, I'm talking about
long-term preservation. Some people think 90 days might be even too short a period of time to think
about this. The reason I say that, the reason it's important to understand that, is because one of
the things, the most common thing, you know, I'm involved in selling archive solutions, and what's
one of the most common competitors I have and that
is status quo it's people not doing anything special it's like let me just
buy some more primary storage you know and let me just keep building up my my
scaling up my storage that way but I am seeing without a doubt more and more
people come to us now and saying I can't afford to do that anymore.
The data is getting way too rich, the sensors are getting much higher resolution, cameras,
things like that.
So everybody is really feeling this data growth problem and it's hurting in all kinds of areas.
It's the cost of storing the data if they're just doing it the same way.
It's impacting the backup process because if they keep backing that stuff up
they're increasing their window, burdening their whole system. It's really, really hard
to manage as data growth is something that's happening in certain environments. So anyway,
so that's why we have to think about this retention thing because people want to retain
more data because the data is more valuable and it's also growing much faster than it
ever has and it's
breaking a lot of people's environments.
And sometimes they're storing
it because they like the value. Sometimes they're
storing it because they're in an industry where they have to
whether it be medical records or financial
records, things like that. So sometimes
there's compliance motivation. Sometimes it's both.
And those are the two key
I might have missed something, I don't know, but those are the two key
things that come to mind all the time it's like either they want it
I mean I don't know a lot of people that feel great about
deleting their data so everybody seems to
want to keep it particularly if it's cost them
a lot of investment to get it in the first place
you know if they have to go out and do a seismic
blast and gather a whole bunch of stuff
you know that's hard to repeat
kind of a strange way
to word the slide, but when I say when is archive
justified, what that
means, as I talked about, there's
people who just keep building up their primary storage.
And you know what, that's fine
if they don't have too much.
And so
I already talked about kind of the problems
if your backup
is busted or your bank is busted
because you don't have the budget,
that's really when people start taking archive more seriously.
If they're talking about small amounts of data,
if it's a small shop and they're talking about 10 terabytes or 20 terabytes of data,
you know, go ahead and buy more primary storage
because I don't know that there's a lot of solutions out there
that are going to fix that problem any different than just and buy more primary storage, because I don't know that there's a lot of solutions out there that are going to fix that problem any different
than just buying some more primary storage.
But when you start talking about 30, 50, 100 terabytes for sure,
or more, then you can definitely demonstrate
some key economic value.
You can definitely articulate how it will help
the rest of their processes
by having an infrastructure that
can be specific about orient, about retaining data.
And then one of the things that I kind of outline a table, but these are some of the
key ingredients.
I try to keep it simple.
If you want to look at the economics, you know, if somebody's going to invest in an
archive solution, what should they look at?
And so look at the cost of your primary storage, look at the cost of your backup storage,
look at the cost of your software to do the backup,
and then add up the cost of those kind of key elements for your archive solution,
you know, storage, whatever software you needed to implement it, and those sorts of things.
And you can do that analysis pretty quickly.
And one of the other things,
actually I don't think that it's in the presentation,
but one of the things I will tell you is absolutely real,
is when people are storing this data,
and we're talking mostly about unstructured data as opposed to the database transactional stuff,
because you can break that apart,
the unstructured data stuff.
If you look at the file data, that's what I mean by unstructured data,
the stuff that comes in from sources that people work on,
not really transactional database stuff.
And if somebody's got 100 terabytes of data and you do the analysis or you ask them,
some people know, some people don't know,
how much of that data has been touched by anybody in the past
six months or the past year or something like that.
It's about,
you know, it'll range anywhere from 20
to 50%. So that means, you know,
50 to 80% of your data,
you're saving it because it's
valuable, but it's not
really being used
very actively right now.
It might be used next month,
and sometimes you have no idea when it's going to be used,
but that's why you're keeping it.
So the fact that data is inactive for a period of time means it's a good candidate to be part of an archive infrastructure
because it doesn't have to be managed the same way,
the stuff that's changing daily today,
because that's got to be involved in an active backup process and things like that.
But if I have something that I'm done with,
and I'm really not going to change it,
I might want to reference it at some point in time,
or I might want to edit it later,
I can put it over in this new infrastructure,
and I don't have to have it backing up on a daily basis.
I can have a backup or two copy of it somewhere,
whether it be in the cloud or off-site or something like that, but there's no reason for it to be part of a daily basis. I can have a backup or two copy of it somewhere whether it be in the cloud or off-site
or something like that, but there's no
reason for it to be part of a normal process.
So we talk about durable
archives and that's really important.
It'll probably take me an hour and a half
to go through these slides if I stay at this pace.
So let me skip through some stuff just to make it quick.
Every industry.
I wanted you to.
Yeah.
No, go ahead.
I'm actually pretty good at keeping pace.
So if we bog down somewhere, I'll speed up somewhere else.
Did you have a question?
No.
Okay.
So anyway, so where do we see these problems?
Everywhere.
There is absolutely no industry that I've seen that is immune to this kind of data growth
and these problems that we're seeing.
Really cool examples everywhere.
And there's a lot of great data.
Just technology of creating data and generating data and editing data has gotten so good that
there's really interesting data everywhere and people are using it.
So a little workflow diagram.
The key point of this really is to say, for the most part,
that this is kind of like the storage tiers sitting over here,
and this is kind of the less active side of a workflow.
We call it the archive side.
So this is going to be the stuff that costs less.
So if you're going to retain data for a longer period of time, you want to try to have it live on stuff that isn't
going to cost as much. If you're involved in a highly active workflow, you're probably
going to spend more on that storage because you need it to perform much faster, you need
it to be very accessible. And so like I said, data comes in at any point.
It's funny, a lot of people think, well, data starts here,
where I'm working on it very quickly, and then over time it will migrate over here.
Well, a lot of use cases, data is going to come in, whether it be surveillance
or highway infrastructure development or autonomous car testing,
gathering test data, things like that,
data is going to come in and they're not ready to use it just yet, just because they have so much
else that they're doing. So data is going to come in, they need to find a place that they can save
it so they can go get more data, and then they'll work on it when they get to it. So a lot of times
data actually comes into the archive and then it'll sit there. That's another reason why active archive
has become a really important characteristic
because they need that data
when they need it, but it just doesn't have
to be at this moment. It might be next week.
Data comes in
at various points in time.
One of the tricks is to figure out how can data
move back and forth
in a way that's seamless to the users.
That's really the trick
of an active archive infrastructure.
So kind of wrapping all that up,
these are really kind of the three tenets of what you're looking for.
It's like you want to deliver performance for those workflows that need it.
You need to deliver low-cost capacity as that data is growing. You need to deliver low cost capacity as that data is growing.
You want to store it cost effectively, but you also need to provide seamless access
to those people that need it wherever they are. So an active archive
is really mostly about combining low cost capacity and active access.
And the performance thing, because what we expect to happen is if there's
a performance requirement, then that data is likely to be moved into the appropriate
place for that work that's going to need that performance.
So then we talk about some of the technologies. I may not be breaking this down into the granularity
that you guys are looking for, but even from my perspective, you know, these are kind of the big categories.
So if you're looking at access and, you know, for the most part, you know, I don't have
like iSCSI here and things like that, but what I see the bulk of our customers being
happy with looking at to integrate into their workflow is either to connect to the environment.
NASA's fine for many people.
We thought, you know, cloud's on everybody's mind and we're involved in cloud conversations
every day, everybody's thinking about the cloud.
And so that might suggest a restful interface of some form, you know, Ethernet, whatever.
And what I'm finding is that a lot of applications,
the applications haven't moved as rapidly
to adopting those technologies,
sorry for my screen,
as we thought.
So there's a lot of applications that aren't ready,
they're not cloud ready.
And so we have to be able to accommodate those.
As a matter of fact, it turns out to be more often
than not.
Sorry about that, folks.
And this will only slow us down further.
Resume slides from the webinar.
Where do you see?
Oh, there it is.
Thank you.
If I had a touch screen, I'd go right to it. Come on. I think this connector sometimes gets a little loose and that's what was screwing up earlier.
All right, I'm taking my glasses off.
It's serious work time.
I'm very sorry.
I'll make up the time.
It's switching, okay.
I could just go into non-sledging mode.
Only I can see my mouse
okay got us to the right slide there yay okay sorry okay so I think we're right around here.
Okay, so anyway, a lot of applications have not really gotten to the cloud readiness stage
as we might have thought three years ago.
And so NAS connectivity is kind of an interesting thing for a lot of people,
or some convenient access.
You know, that's not the world of everything, but you guys know what those things need to be.
In terms of storage technologies, there's probably some cool stuff being talked about around here.
But for the most part, and please speak up if you know differently.
I'd love to be educated.
But, I mean, there's tape technology and there's disk technology
Oh the flash. Sorry. I have flash up here, but in terms of the low cost capacity
There's a lot of things being done with disk. You know there's certainly capacity disk
There's object storage, which is you know kind of a huge thing that's going on that really helps the economics if you look at how they
do things
So so I think object storage is a big piece of the disk equation
in terms of how that's getting set up to accommodate the high-capacity stuff.
And then tape technology continues to be there.
So I don't know where all your guys' heads are at on tape.
A lot of customers actually love it.
They love the economics of it.
Yes, it's absolutely been displaced in a lot of the traditional places it used to be
in backup, as I described at the opening of this discussion.
But tape is finding some pretty strong footholding in the big data world.
And when I say big data, I mean the kind of large file sets, data sets that we were talking about, unstructured stuff.
And then, of course, you do need the high-performance stuff in the active workflow piece. And I put these things here which is
really interesting and I think one of the key messages I probably would like
you to get out of today is you can't really set up a single type of
technology and have it serve all your needs. So there's got to be something in the environment that facilitates a tiering
between storage that addresses one of these key aspects of it.
So tiering is really important.
I talk about acceleration also because there's deduplication,
there's compression, there's WAN acceleration.
Sometimes people are looking at data that needs to move from A to B.
There's different ways of doing that as well.
And then there's gateways.
So there's a lot more gateway solutions that we're seeing.
I actually think that between gateways and tiering,
I think these are kind of the game changers of today
in terms of helping manage data that has to be held on for a long time.
So those are the solutions that I think are going to make a big difference
in a lot of the environments in the next few years anyway.
Yeah?
Two points.
I was going to say optical is reemerging.
Oh, fair point.
Yeah, thank you.
And I work for an optical.
Okay, thank you very much. That's a great point.
I was also going to add that
what I sense is that
customers are stuck in
the process tied to the backup
to archive approach
and you can't break it.
People do try to do
restful API based
archive or longer term
long term data but
you never
get away from net backup
and control.
Right. Yeah, I totally agree.
And you're absolutely right. And that's the inertia
that the existing ways have.
And so I think the challenge and the opportunity for
all of us in this room is
how do you... One of the reasons is
it's just hard. And as a matter of fact,
one of the things I've said in an article is that's why I'm calling these things game changers.
If you can make it easy, you know, as easy as a backup app is.
People are so familiar and comfortable with backup apps.
It's like that's the easy way to do it.
The objective for us is to bring something to them that feels as comfortable and as easy for them to get their data to the new place,
if it can go to a new place.
Now, sometimes you've got some of the backup apps and things like that.
They've integrated an archive piece, so that helps a little bit,
and that's the right answer sometimes.
They've tried with that,
and I think they've even struggled a little bit
to get those things to be adopted, you know, very broadly.
But thank you very much.
Optical is absolutely another one of these capacity technologies that can be integrated, you know, seamlessly as the others can.
So we'll talk a bit about that integration. So in terms of some of the common attributes of some of these archive storage, so tape,
and I apologize, I did not mean to ignore optical.
So I would say that optical and tape kind of have a battle.
They can compete.
Optical has some good capabilities that we'll talk about.
When we talk about compliance, you've got some of the worm capabilities.
Tape can offer that too.
Sometimes optical is seen as even stronger in that environment.
But low cost is really key.
I did some math, and I didn't publish it in this presentation,
but if you do a three-year analysis, then I like to say do a five-year analysis
also, and I don't know if you would know where
the optical comes in, for tape
and it depends on your capacity, but I mean
tape can come in at the
$40 to $90
per terabyte
level.
It's all in.
That includes gateway to get there, it includes a library,
it includes media, it includes a library, it includes media,
it includes drives and stuff like that.
If you look at
some of the cloud solutions, the public
cloud solutions that are out there, they'll be
around $250 per
terabyte. If you add up your monthly
expense
and you multiply it by three years,
so do a monthly rate times 36,
I think you'll end up at about $250 per terabyte for the cheapest stuff.
That's like Amazon Glacier.
And if you look at object storage, it'll be probably not terribly far from that.
So one of the things that we want to make sure people understand
is if you're looking at public cloud,
and I know I'm diverging quite a bit,
but if you look at public cloud,
you can look at the options for on-premise as well,
and you might find that to be more cost-effective, whereas a lot of people think cloud's cheaper. It
is from an entry perspective, and I think I talk about that. Yeah, it's really the lowest. You can
get into a public cloud for really, really cheap. I have one terabyte. Let me store it there, but if
I have to grow it a hundred or petabyte scale, then all of a sudden it starts to add up.
And then if you look at the investment over time, if you're continuing to pay a monthly
fee for what you're storing in the cloud and you go on to the five-year horizon, well,
you get to take your five-year investment maybe and look at a capital amortization over
five years.
And all of a sudden, the cloud goes from like 240 to I think about over 400, $430 per terabyte, something like that.
And so then you can compare that to other technologies that maybe you could have invested in your own data center.
Granted, you're all looking at solutions.
But anyway, so that's some of the math that we can look at, too.
So object storage, a lot of times they have a multi-site.
They talk about durability a lot of times.
Sometimes it's replication.
Sometimes it's erasure code
and by spreading the data intelligently
over different locations
you can include
you're including protection, you're including
disaster recovery copies
and so you get to analyze the solution
as if that's all the cost of those things
against an object storage solution.
So whereas over here today I'm buying primary, I'm buying backup storage, I'm buying DR storage,
I'm buying software to do all that stuff, my option might be to move it over to object
storage that has all that durability already baked into it
and I don't have to worry about the software
and all that kind of stuff.
I just needed to figure out, we'll talk about this,
how did I move it over there?
That's part of the challenge.
But anyway, so there's a lot of economics
that you can talk about.
And then we already talked a lot about the gateways,
which I think is a really important part
of today's environment.
And that's probably why we haven't seen adoption
as quickly
for cloud and stuff like that, because I don't think
the gateways have been developed
quite as much yet.
Okay, so we already talked about this.
Data moves back and forth.
So let's look at some examples, and then we
have some considerations afterwards.
I talked about, I actually mentioned state infrastructures,
customers I'm familiar with.
What these people do is they have trucks
that drive around the highway infrastructure
with cameras on them,
so the cameras are gathering and storing on the truck
in some storage a bunch of this video
and photographic and imaging content.
And they're expecting those cameras
to increase in resolution, things like that,
so that data is just going to continue to get richer.
And so they get back to the shop, and they have to ingest all that data.
So now they need a high-speed way of getting data into their workflow
because what they want to do is they want to analyze all this data,
plan out the infrastructure of the future.
So they've got these huge packets of data that they're bringing in.
How do I ingest it quickly?
I know similar workflows.
I think I'll talk about one of these.
I might later on.
But sometimes if I can connect my storage system that's here over NAS
or over some kind of maybe it's wireless or maybe it's something.
I don't know what it is.
But they're always looking to what's the best way to do this.
I know a lot of them are still using portable hard drives,
plugging in a USB port in a workstation and bringing it in that way.
And they've got shelves and shelves of stuff that they haven't ingested yet.
So the ingestion piece is becoming a really important part of the process as well.
Anyway, that comes in, like I mentioned before, they're moving it over, the customer I'm thinking about here moves it over to an object storage.
They opted for that as their long-term high-capacity storage environment.
And then they've got all their workflow and process and everything else over there.
And so when the data comes in here, and so this is a gateway.
I think you'll see a theme.
I've got gateways in all of these.
I happen to know from personal experience because
I'm not plugging my company
Quantum, but we have
we don't sell optical today, but we have
a lot of the other things that we've talked about. We have
flash, we have disc, we have tape
and we also have gateways.
We're not pushing gateways. We have people that
direct connect to everything,
but it just turns out that when you have
the conversation and you're looking to solve a problem, that becomes an important piece of
making it easy for people. So anyway, so when it comes in it goes over
here and then this gateway still has the ability to provide active access for all
the people that need to work on the data, and so that's one example. They've got
multiple sites, you know, so their object storage is able to leverage the durability
the way object storage was designed
to deliver the durability.
So actually, it's structured that I have this redundancy.
So I think we talked about all this stuff.
The tiering really happens when a customer
wants to retrieve the data and work on it,
then the gateway is able to
migrate the data from
object storage to some other
tier. Sometimes
it's just as simple as a NAS share drag and
drop. Sometimes there's actually an active
high performance disk stage that
it moves to within the archive infrastructure.
I'll talk about that in a minute as well.
Is the object storage that you have in that slide
remote as far as a provider, or is it a private?
They have data centers in multiple locations.
So it's, but they own?
They own those locations.
Core walls of their network?
Correct, yeah, yeah.
That slide does show S3, so it's Amazon, right?
No, that's actually the front end of that object storage.
Okay.
Yeah.
And it turned out that they, and I'll talk about another one,
they might have, I mean, they'd be open to cloud,
but it turned out they're using a processing software that,
it's funny, it's actually a web-based software system,
but the only way it knows how to connect to storage is over NAS.
And so they needed this NAS connectivity over here,
which I found kind of interesting.
Ultimately, I would expect that software will support RESTful interface.
And if that happens, that's fine.
Go ahead and connect directly to the object storage.
But then you still need a way to ingest the data and stuff like that.
This is
financial transactions, security trading. They have a lot of compliance requirements.
They could have put optical here. I know they chose tape in this case.
They have high performance ingest as well.
They've got a gateway over here.
Oh, I thought I took this out.
So they happen to be using a standard tool.
I'll talk a little bit about tools.
When I need to get data back and forth from my active work environment
over to my archive infrastructure, how do I do that?
There's a lot of software packages that are out there.
There's a lot of tools.
R-Sync is part of Linux Toolkit, and they're just using that. They scripted that up
to take care of data movement.
Data's coming in from the left
into the archive environment, so that's why
they're able to use rsync. They gather up a bunch of data
on a daily basis over here.
They sync it over to here.
They can access it and do analysis
if they want, so that was part of the requirement
also. They want
high-performance retrieval, so if they
need to look at something or access it, they
don't like the 30 seconds latency that
a tape library might have associated with it.
They really like object storage,
its performance profile, but they absolutely
needed to have an offline
copy of the data. I'll mention
offline for a second. I don't have this in here either,
but ransomware, I'm amazed
at what I mean. I don't have still a lot of, but ransomware, I'm amazed at what I mean.
And I don't have still a lot of first-hand
experience on this, but I've had people tell me
recently that there's a lot of people paying ransomware.
And just
so you guys know what that is,
maybe all you do, but
villains are coming in
and they're encrypting your data.
I'm going to announce to you they can hack your network, encrypt
your data, and say, if you want your data back, pay us some money,
because they have the key, you don't.
And one good way to protect against that is to have some data sitting offline.
It's like, okay, enjoy your key.
I have my data.
I'll reinstall it over here.
So anyway, whereas we used to talk about DR in an off-site copy,
this whole deal about an offline copy is also becoming really important to have.
And so that's a piece of one of the considerations.
Actually, I do mention that later on, so I won't have to talk about it again.
So I think you get the gist of that one.
I don't think I have to go through this too much.
I don't think there was anything big, new news here.
Here's a major university in the United States.
They were absolutely going out.
I'll say it.
They were going to go sign up with Amazon.
Anybody here from Amazon work with Amazon?
So they were already moving in that direction,
and they were having troubles getting all of their needs met properly.
It's not like they were done with the analysis yet, but somehow I know that a storage vendor
got called in and was able to have a broader discussion about what they could do.
So what they ended up doing is implementing their own campus wide archive on their own
premise instead of going to the public cloud, which is where they were really believing
that they were going as a starting point.
And so what they did here, they used the magical gateway.
The IT department tells all the departments on the campus,
when you guys are done with your work,
doing whatever it is you need to do,
and you want to save it cost-effectively,
just drag and drop it over here.
So they gave each department their own archive share.
And so they're just dragging it over here.
And then they're leveraging actually tape libraries on the back end
because cost effectiveness is really important.
Now, these guys care a lot about their data.
Everybody we talk about does, but the data is precious here.
So they've got probably at least three copies of data. These are just
replica
instances of each other.
When they send the data out to tape, they send it
to two different tape libraries in two different
locations. Then they've got an offline
copy as well that they can send off
somewhere else. They cannot lose
their data. It's very valuable to them.
So anyway, that's a simple
multi-department drag and drop
leveraging tape, leveraging a gateway.
Really important to them.
So I talk about tiering and actually
just to emphasize what
that's all about. Here's one of the things
about making it easy. So we talked about
when we tried to make tape easy
it's got to be close to five years ago four or five
years ago we brought out LTFS which provided kind of a NAS front-end enabled a NAS front-end
to tape and that was pretty good but you still are you know a lot of the LTFS solutions linear
tape file system if you guys are familiar with that but But it's a file. Each tape cartridge is a self-describing file system.
And so that kind of changed the world of tape.
And I loved, you know, people said,
you mean I can actually get to tape without a backup app?
That's what it does.
You just drag it over, either attached to your workstation
or attached to your network.
Problem is, if it didn't have a cache of some sort,
and some people implemented that,
then you still had to deal with the latencies of tape
and things like that.
So a tiered solution might bring something off of tape
if it gets active and bring it over here
onto some kind of a higher performance tier
that can eliminate the latencies
for ongoing work with that data.
So one strategy is to keep all your fresh stuff on disk or spinning disk storage over here
for a while and then automatically migrate it over so that if somebody wants to get it within
six months or three months or whatever it is, they have the experience of getting it right back as
they expect to. And then later on, if tiering is done properly, they can still get it back. They still go to the same
place, you know, in a lot of these tiering solutions. So they still go ask
for their data the same place, but if it's more than six months or more than a
year, it's like, oh I have to wait, you know, 30, 60 seconds to get my data. That's
not a big deal for a lot of people. But that's a tiering solution implemented
in a way that can deliver that kind of access conveniently and forget about what's behind
the scenes. That's what cloud is doing today. You go to Amazon Glacier, you want a file
back, you have to wait four hours if it's on Glacier. So people are obviously willing
to accept some of the tradeoffs of the lower cost of storage.
So in an application like that, do they consider deduplication the effects of the overall storage?
Well, it's a great, deduplication is a great question.
And so here's how I think about it, and there may be a lot of good debate over a drink on this stuff
deduplication
is at it's maximum
benefit
in a backup environment
and the reason is because I'm backing up
this data today, a little bit
of it changed, I'm going to back up the whole thing
again and so if I can eliminate
duplicating the work of
everything I did the day before, why am I
doing that?
So deduplication, you can see huge payoff when you're doing repetitive work like that.
When you talk about archive, it's a little bit different.
Now every file is a little bit different and stuff like that.
So deduplication might be able to find some...
It's more like a compression algorithm at that point in time.
So you might see some benefit from the
compression aspect of it, but not
from the repetitive file aspect
of it, in my opinion.
In this picture,
is the gateway
the glue between the long tail
archive and the near line
or is it still backup software?
Oh, this is not backup software.
So, as I said, the IT group
said to the rest of the university
when you're done, drag your files
over here. So it's a NAS share.
There's no backup software here.
They may have backup software doing something
in each of their departments over there,
but what they set up was the campus-wide
archive. So this has nothing to do set up was the campus-wide archive.
So this has nothing to do with the day-to-day backup.
This says when you're done with your project and you want to store cost effectively, drag your work over here.
Now, when they drag their work over here,
and they're comfortable because they get the guarantee from the IT guys,
they say, okay, I have less that I'm storing and backing up over here.
So they've improved their environment,
and they're paying for less primary storage
here because they were able to offload
it over here. So that's kind of what we
talked about. When you're done with stuff
and you're ready to move it to an archive phase,
figure out how you want to do that.
Yes?
You haven't talked about any security
or privacy concerns. I will.
Any of this. Yeah.
Well, the deduplication thing
that was just brought up
is an area where
if you've encrypted it
or you have data privacy concerns,
which you would have
in a university environment,
one of those guys
stacks over there
is the financial aid area.
The business guys
at the university
have huge security concerns.
Something that works for maybe sociology
although some of the studies
are saying things that you don't want public
etc
it kind of complicates the situation
with at least two more
dimensions
it absolutely
can and it depends on how robust this solution is.
You're absolutely right.
So does this solution, and I don't know if I'm hitting the point,
but it's a point well stated in very much real life.
If this can handle encryption that those guys might be able to control,
so can they move their stuff over here and they keep the key?
Is there some key management process involved that allows
them to be comfortable?
However they're solving that problem today,
if they can solve it
a different way or the same way over here,
I think that can address
that piece. They want to make sure
that nobody has access to those records.
There's a server somewhere that they
got comfortable with.
And that's been an issue,
and I don't know how well the public cloud infrastructures
are getting around that.
I mean, that's been one of the first objections people have
to public cloud.
It's like, my data's way too secure.
I'm not sure I'm ready to go there yet.
There may be good answers to that up there,
but not everybody's on board yet.
So security is absolutely an important piece.
Thanks for bringing that up.
Media production distribution.
So here's the media
entertainment world.
It's kind of getting repetitive
at this point in time. Single company
has multiple data centers
throughout the U.S.
These are mostly U.S., I'm sorry. I grabbed some convenient ones. New York, Denver, and Los Angeles.
Today, I'll be honest, today what they're doing is they're using these locations
to spread their object storage so that they get the site durability.
So if any of these individual locations goes down, their data still stands. So I wasn't going to drill down into object storage.
I can answer questions on that later.
But there's ways that you can spread your data.
I mean, the simplest is just to think three copies.
But that didn't save you a whole lot of money, right?
Why not just replicate your primary storage at that point in time?
But there's things like erasure code and stuff like that
where you can actually spread data across multiple sites
so that if one goes down, you still have your data,
and it's not the same as three full copies of your data.
It kind of hits your point of deduplication.
It's a different way of making data storage more efficient.
So they're using it for the efficiency of object storage
and the durability of their content and their data.
They're still really just distributing from one location,
but their plan as they've got this thing scaled out
will be now I can take my data and access it over here,
I can get my data and access it over here,
and I can distribute from these different locations,
and maybe they'll even spread their production resources around.
Let's see, are there any key points on there? No, I talked about that. resources around.
Let's see, are there any key points on there? No, I talked about it.
Okay, so we
talked about it, and actually, to be honest,
I should have hit harder on the security. I like that.
I called it compliance and things like that,
but that's a really important
point that maybe I don't talk a lot about.
So just, you know, thinking about some of the other
considerations,
for cloud,
it's just another data center.
And so as I talk about these technologies, what users don't always understand,
well, isn't cloud a different technology?
Well, no, it's everything we just talked about just somewhere else.
And it's like I finally got comfortable with that.
That's really what it is.
And so whatever front end they put on, a lot of times it's a RESTful interface for the network out there.
And so we're seeing cloud public services,
the ones you know,
implementing all of these technologies,
including tape, including object,
or disk, I mean optical disk.
That's all happening behind the scenes, and users aren't saying, well, what kind of technology
are you storing this stuff on?
Don't worry about it.
Ask for your data.
You get it back in four hours.
It doesn't matter to you what I'm storing it on.
I give you these guarantees.
I give you this SLA guarantee.
All you care about is your data, when you get it, how you can get it.
Forget about what's behind the scenes.
So that's all I really need to say about cloud.
It's just another implementation.
Ten minutes to go.
So data movement.
We talked a lot about that, or we mentioned it from time to time.
This is really key.
I've been in environments where IT, where the users absolutely will not tolerate having to move their data to a different place.
And there's a bunch of solutions that have been out on the market for a long time,
where what they'll do is they'll crawl the user data, and they'll look, you know,
and you get to set whatever policies, you know, has it been touched in a certain amount of time?
Is it a certain type of data? Is it in a certain location, whatever it is.
You can set up the policies, the software will go out, it'll find the data that fits your profile and say,
okay, you told me anything that looks like that can be moved over here, and maybe it'll,
and there's different technologies, it'll stub it somehow.
So that way the users, they keep going to where they're used to going to get their data but behind the scenes this magical data
mover has moved it over to the archive this thing and that works great for a
lot of people there's a the the science of stubbing data I'm learning is is got
some depth to it yeah agreed agreed and I use the word stub to represent what but... That's deep water. Yeah. It's really deep water. Yeah. Most IT people don't want stubs.
Agreed, agreed.
And I use the word stub
to represent what we're talking about.
But you're right.
You're absolutely right.
You can think of it as a link or something.
Yeah.
Because if you move the location over here
and the link stops being aware of where it happened,
all of a sudden you don't have access to your data.
Sometimes it'll just move the data
into a predefined location and the users can go over there.
So that's part of what holds back a lot of the
migration to archive is how do I get my data there
in a way that the users will tolerate.
I'm seeing a lot of adoption of some
stubbing technologies. I see exactly the concerns and the issues
that you brought up.
So anyway,
that's an opportunity
to figure out how to do that right.
I think we talked a lot about that already.
Oh, and I guess the other thing that I'd say
is one of the things that people need to be aware of
is if you remember some of the diagrams that we talked about.
So I have my applications and my primary storage and all that stuff on the left.
And one decision I want to make is I need to get my data over to this archive infrastructure.
So that's one movement, and we just talked about that on the previous page.
How do you do that.
But then if this is
a tiered storage environment where
I've got some fast disk or I've got
some flash and I've got some tape or optical
on the back end, I've got another
data movement thing going on here.
So in order to implement
a real retention strategy,
I have to think
about how does my data migrate across the
whole spectrum of stuff.
Sometimes most of the time from what I see, although I do see examples, but most of the
time it's easier to figure out how to do it with two parts of the solution.
The way you do it the other way is you've got a single vendor or some kind of a homogenous environment
where that vendor can manage all of that up and down.
And there's some vendors out there that do that,
but users don't always want to be restricted to that.
And I talked about open systems,
and so one of the magic things to look for is
how can I take an open environment where I've got a bunch of different vendors
with different pieces throughout my environment,
and my users are happy because they always know where to find their data,
and they're also happy because they're paying the least amount they can
to retain it over a long period of time.
And they're also happy because it comes back to them fairly quickly.
So ultimately, it's not as simple.
An archive strategy isn't as simple as we thought,
and so the opportunity for all of us in the room is to design those solutions that makes it simple.
And there's great ways that have been brought to market so far and that can continue to get improved.
So compliance integrity, this is where I would put the security thing. Oops.
I don't know what's causing that.
Other speakers have had similar issues.
Really?
Yeah, it's interesting.
Okay.
Boy, I hope I don't have to take all the rest of the time
just to figure this out.
So in terms of compliance and security
and data integrity,
so let me just...
Here we go.
And I'll do this.
I'm not going to mess with it.
So it's not always about the storage
target. Sometimes people say I want optical
because it's got the worm piece of it.
You know, worm comes up quite a bit,
but there's other things. And there's also conditional access and just access at all to your data.
And so all I want to do is I want to name some of the other things that people have to think about
when they implement a data retention strategy.
So the same things that you had to think about, you know, as part of your primary storage implementation,
those same questions and issues, you know, might very well come along with your data as it moves.
Data integrity is another piece.
There's a lot of great solutions out there.
If you're going to store data,
much more common now,
five years, ten years, twenty years,
there's a law firm that we do business with.
They have to store their data.
They actually store data for NFL injury cases. that we do business with, they have to store their data.
They actually store data for NFL injury cases.
And they have to store their data for, I think, 68 years,
I think is what they said.
So 60-something.
And it's like our theory, I didn't get the real reason on why that is,
but maybe that's how long they expect the life of the patient to live type of thing.
But anyway, multiple decades is how long they need to store stuff.
That will bring me to another point as well.
But you want to make sure, however much that requirement is,
that that data doesn't, you know, you've heard terms like bit rot and stuff like that.
Is my data being checked on some kind of an ongoing regular basis
so that I know it's
not dying, whatever
media it's stored on. And there's some great solutions
out there to do that.
Sorry.
So scale, you know,
can I continue to scale? With a lot
of primary solutions, I know when you scale them up,
as we talked about what people do today,
one of the things they don't like is, in many cases,
depending on the architecture, as you scale up,
your performance goes to hell.
And is there a way?
So make sure that you understand how will the storage capacity of this thing
continue to scale in a way that stops delivering on one of the promises
that you put together in the first place,
whether it be access, capability,
or ease of access, I should say.
Or do I have to move to a secondary location?
So that's the other thing
that the whole archive world is faced with,
is how do I have this infinite scale?
And there's solutions out there that do.
I mean, public cloud, they're not talking about any options.
Object storage can scale huge.
Tape libraries can get really big.
So there's a lot of great scalability options out there.
Just make sure that that's part of your solution.
And then if you have a gateway or something like that,
is there a file count limitation?
Sometimes data under management has issues with how much metadata can I handle,
and you might have to store that.
Sorry, I'm using the wrong button.
Reporting, so like one key thing,
that university example,
one of the key criteria those guys needed
that had to get delivered is,
every department needs to pay their share
of how they're using the archive.
And so IT needs to have those tools
to report and bill properly. so chargeback type of models.
So as IT looks to set up a service,
or as a cloud provider looks to set up a service,
they need to make sure they've got ways of doing the billback
and accounting and things like that.
Format migration is a really important one.
This, I think, I'm seeing more work being done here,
or at least more of it's come to my attention recently.
And so what that means is, you know, so we've always heard,
you know, I'll use LTO as an example, LTO tape format.
You know, they're out there.
LTO 4, then to 5, you know, it started with LTO 1,
and how do customers migrate from one to the next?
And that's only part of the equation.
So you can do that from a storage technology perspective,
but how do I do that from a data format perspective or anything else I need to worry about?
And so there's actually companies out there that are setting up solutions and or services
where they spend all the time, you know, understanding this data format and when the new one comes out,
it keeps coming out, so they're creating migration tools so that behind the scenes that data can be managed and handled for the long run.
You know, not a huge deal if you're talking about five, seven, even ten years, that's pretty easy to manage, I think, from my perspective.
But as you get into 10, 20, 30, 40 years,
this whole notion of digital preservation, I'm seeing more of that.
So this may be more niche-y from a market needs perspective, maybe not,
but I'm seeing businesses getting set up around that stuff.
Okay, I think pretty much a wrap.
So active archive, it's a common requirement everywhere in the world that I see.
It's all about long-term retention.
It's in all industries.
It can offer substantial benefit.
I didn't really quantify it that much, but there's huge benefits. I would say that people can spend half as
much if they address
an archive infrastructure instead of the
status quo.
You've got to look for the right balance,
cost, performance. A lot of
people, the first thing that's easy to talk about,
cost, access, and performance, but then there's all those
other considerations we just talked about that you can't
forget as well.
I told you who the Active Archive Alliance is,
and if you want, there's a report that you can get.
You can either go to the website and leave your name,
or if you want to make me look really good,
you can give me your card and we'll send you something,
however you want to do it.
It doesn't matter to me.
And that's our chat.
So I appreciate the interaction.
Thank you all very much.
I think I've just been told that we're at the end of the time that we're supposed to be speaking. So maybe it's my time
to migrate yourselves somewhere else. Any other questions you guys have? Thank you very much.
Appreciate it. Thanks for listening. If you have questions about the material presented in this
podcast, be sure and join our developers mailing list by sending an email to developers-subscribe
at sneha.org. Here you can ask questions and discuss this topic further with your peers in
the developer community. For additional information about the Storage Developer Conference, visit
storagedeveloper.org.