Storage Developer Conference - #42: The Role of Active Archive in Long-Term Data Preservation

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 42. Today we hear from Mark Pastor, Director of Archive and Technical Workflow Solutions with Quantum, as he presents the role of active Archive in long-term data preservation from the 2016 Storage Developer Conference.

Starting point is 00:00:52 The session is called the role of Active Archive in long-term preservation. A little bit redundant, but we have, I'm representing the Active Archive Alliance and I'm with the Active Archive Alliance, and I'm with the Quantum Corporation, and I'm very much involved in archive use cases. My specific area of focus at Quantum is archive and technical workflow. And so I'm involved in, you know, and basically just so you know, in quantum terms what that means is we have other people focused on media and entertainment,

Starting point is 00:01:27 and we have people focused on surveillance, and then we've got some people focused on backup, and then I focus on what's left behind that for the most part, even though I get involved a lot in some of the other ones as well. So surveillance, opportunities, and media and entertainment, so I can speak to that. But with the Active Archive Alliance, our mission in life actually is to provide open systems ways of developing solutions that can enable people to access all their data all the time and that's why we call it Active Archive. So the point of archive of course is for data retention but what we've seen happening in recent years is people have traditionally done archive by, you know, the original way of doing it was I'll take my backups and I'll send them off to Iron Mountain or I'll store

Starting point is 00:02:20 them somewhere for a long time and if I need that data back I'll have to go back through the backup software to retrieve it because tape is a very attractive technology from a cost perspective. And one of the few ways of getting to tape, either you have your own homegrown application, which has happened in some of the key industries like oil and gas and things like that. But for the most part, the broad market used their backup apps to get to tape. Tape was cheap. That's what they wanted to archive on.

Starting point is 00:02:45 And so if they ever wanted the data back, they'd have to come back through that process. So data is getting very, very valuable, and intelligence in companies is getting richer. And so they want ways to get their data back more actively than that process. And so there's been ways over the past 5, 10 years that have really been developing

Starting point is 00:03:03 to give people better access of economic storage like tape. And this is not all about tape, by the way, but object storage has come out as an option as well. You know, cloud, I'll talk about all of the above. And so there's new technologies available that can offer cost-effective retention of content, but also provide active access. And tape and the other ones can be just as good. So it's all about providing ease of use, scalability, cost, and compliance.

Starting point is 00:03:34 And we'll talk about the reasons why people retain data in the first place. So my assumption is that you're all here because you're looking at developing new solutions that store stuff. That's the Storage Developers Conference. And so what I hope to provide in today's discussion is really just give you a sense of what are some common motivations that we're seeing in the marketplace. I interact with customers quite a bit, so why are people looking to hang on to data for longer periods of time?

Starting point is 00:04:07 And then what I've learned a lot in helping people architect and think about total solutions for that retention of data, there's actually a lot of considerations that I've learned about. And so we'll talk about many things that enter into the conversation when you talk about holding on onto data for a long time. And what are some of the technologies that are of interest? Really, what are the common themes that I see across all the industries in terms of their data retention strategies? And we'll walk through a couple of real life examples so you can kind of see what other companies have done Just to give you a sense of that

Starting point is 00:04:47 And by the way, I'm delighted if you guys make it interactive if you want to ask questions I hate I don't want it to be death by PowerPoint unnecessarily. So feel free to jump in ask questions Okay, so when we're talking about archive, basically, you know, I'm talking about long-term preservation. Some people think 90 days might be even too short a period of time to think about this. The reason I say that, the reason it's important to understand that, is because one of the things, the most common thing, you know, I'm involved in selling archive solutions, and what's one of the most common competitors I have and that is status quo it's people not doing anything special it's like let me just

Starting point is 00:05:30 buy some more primary storage you know and let me just keep building up my my scaling up my storage that way but I am seeing without a doubt more and more people come to us now and saying I can't afford to do that anymore. The data is getting way too rich, the sensors are getting much higher resolution, cameras, things like that. So everybody is really feeling this data growth problem and it's hurting in all kinds of areas. It's the cost of storing the data if they're just doing it the same way. It's impacting the backup process because if they keep backing that stuff up

Starting point is 00:06:06 they're increasing their window, burdening their whole system. It's really, really hard to manage as data growth is something that's happening in certain environments. So anyway, so that's why we have to think about this retention thing because people want to retain more data because the data is more valuable and it's also growing much faster than it ever has and it's breaking a lot of people's environments. And sometimes they're storing it because they like the value. Sometimes they're

Starting point is 00:06:31 storing it because they're in an industry where they have to whether it be medical records or financial records, things like that. So sometimes there's compliance motivation. Sometimes it's both. And those are the two key I might have missed something, I don't know, but those are the two key things that come to mind all the time it's like either they want it I mean I don't know a lot of people that feel great about

Starting point is 00:06:48 deleting their data so everybody seems to want to keep it particularly if it's cost them a lot of investment to get it in the first place you know if they have to go out and do a seismic blast and gather a whole bunch of stuff you know that's hard to repeat kind of a strange way to word the slide, but when I say when is archive

Starting point is 00:07:07 justified, what that means, as I talked about, there's people who just keep building up their primary storage. And you know what, that's fine if they don't have too much. And so I already talked about kind of the problems if your backup

Starting point is 00:07:23 is busted or your bank is busted because you don't have the budget, that's really when people start taking archive more seriously. If they're talking about small amounts of data, if it's a small shop and they're talking about 10 terabytes or 20 terabytes of data, you know, go ahead and buy more primary storage because I don't know that there's a lot of solutions out there that are going to fix that problem any different than just and buy more primary storage, because I don't know that there's a lot of solutions out there that are going to fix that problem any different

Starting point is 00:07:46 than just buying some more primary storage. But when you start talking about 30, 50, 100 terabytes for sure, or more, then you can definitely demonstrate some key economic value. You can definitely articulate how it will help the rest of their processes by having an infrastructure that can be specific about orient, about retaining data.

Starting point is 00:08:10 And then one of the things that I kind of outline a table, but these are some of the key ingredients. I try to keep it simple. If you want to look at the economics, you know, if somebody's going to invest in an archive solution, what should they look at? And so look at the cost of your primary storage, look at the cost of your backup storage, look at the cost of your software to do the backup, and then add up the cost of those kind of key elements for your archive solution,

Starting point is 00:08:42 you know, storage, whatever software you needed to implement it, and those sorts of things. And you can do that analysis pretty quickly. And one of the other things, actually I don't think that it's in the presentation, but one of the things I will tell you is absolutely real, is when people are storing this data, and we're talking mostly about unstructured data as opposed to the database transactional stuff, because you can break that apart,

Starting point is 00:09:02 the unstructured data stuff. If you look at the file data, that's what I mean by unstructured data, the stuff that comes in from sources that people work on, not really transactional database stuff. And if somebody's got 100 terabytes of data and you do the analysis or you ask them, some people know, some people don't know, how much of that data has been touched by anybody in the past six months or the past year or something like that.

Starting point is 00:09:29 It's about, you know, it'll range anywhere from 20 to 50%. So that means, you know, 50 to 80% of your data, you're saving it because it's valuable, but it's not really being used very actively right now.

Starting point is 00:09:43 It might be used next month, and sometimes you have no idea when it's going to be used, but that's why you're keeping it. So the fact that data is inactive for a period of time means it's a good candidate to be part of an archive infrastructure because it doesn't have to be managed the same way, the stuff that's changing daily today, because that's got to be involved in an active backup process and things like that. But if I have something that I'm done with,

Starting point is 00:10:10 and I'm really not going to change it, I might want to reference it at some point in time, or I might want to edit it later, I can put it over in this new infrastructure, and I don't have to have it backing up on a daily basis. I can have a backup or two copy of it somewhere, whether it be in the cloud or off-site or something like that, but there's no reason for it to be part of a daily basis. I can have a backup or two copy of it somewhere whether it be in the cloud or off-site or something like that, but there's no

Starting point is 00:10:27 reason for it to be part of a normal process. So we talk about durable archives and that's really important. It'll probably take me an hour and a half to go through these slides if I stay at this pace. So let me skip through some stuff just to make it quick. Every industry. I wanted you to.

Starting point is 00:10:46 Yeah. No, go ahead. I'm actually pretty good at keeping pace. So if we bog down somewhere, I'll speed up somewhere else. Did you have a question? No. Okay. So anyway, so where do we see these problems?

Starting point is 00:10:57 Everywhere. There is absolutely no industry that I've seen that is immune to this kind of data growth and these problems that we're seeing. Really cool examples everywhere. And there's a lot of great data. Just technology of creating data and generating data and editing data has gotten so good that there's really interesting data everywhere and people are using it. So a little workflow diagram.

Starting point is 00:11:27 The key point of this really is to say, for the most part, that this is kind of like the storage tiers sitting over here, and this is kind of the less active side of a workflow. We call it the archive side. So this is going to be the stuff that costs less. So if you're going to retain data for a longer period of time, you want to try to have it live on stuff that isn't going to cost as much. If you're involved in a highly active workflow, you're probably going to spend more on that storage because you need it to perform much faster, you need

Starting point is 00:11:58 it to be very accessible. And so like I said, data comes in at any point. It's funny, a lot of people think, well, data starts here, where I'm working on it very quickly, and then over time it will migrate over here. Well, a lot of use cases, data is going to come in, whether it be surveillance or highway infrastructure development or autonomous car testing, gathering test data, things like that, data is going to come in and they're not ready to use it just yet, just because they have so much else that they're doing. So data is going to come in, they need to find a place that they can save

Starting point is 00:12:35 it so they can go get more data, and then they'll work on it when they get to it. So a lot of times data actually comes into the archive and then it'll sit there. That's another reason why active archive has become a really important characteristic because they need that data when they need it, but it just doesn't have to be at this moment. It might be next week. Data comes in at various points in time.

Starting point is 00:12:57 One of the tricks is to figure out how can data move back and forth in a way that's seamless to the users. That's really the trick of an active archive infrastructure. So kind of wrapping all that up, these are really kind of the three tenets of what you're looking for. It's like you want to deliver performance for those workflows that need it.

Starting point is 00:13:22 You need to deliver low-cost capacity as that data is growing. You need to deliver low cost capacity as that data is growing. You want to store it cost effectively, but you also need to provide seamless access to those people that need it wherever they are. So an active archive is really mostly about combining low cost capacity and active access. And the performance thing, because what we expect to happen is if there's a performance requirement, then that data is likely to be moved into the appropriate place for that work that's going to need that performance. So then we talk about some of the technologies. I may not be breaking this down into the granularity

Starting point is 00:14:01 that you guys are looking for, but even from my perspective, you know, these are kind of the big categories. So if you're looking at access and, you know, for the most part, you know, I don't have like iSCSI here and things like that, but what I see the bulk of our customers being happy with looking at to integrate into their workflow is either to connect to the environment. NASA's fine for many people. We thought, you know, cloud's on everybody's mind and we're involved in cloud conversations every day, everybody's thinking about the cloud. And so that might suggest a restful interface of some form, you know, Ethernet, whatever.

Starting point is 00:14:43 And what I'm finding is that a lot of applications, the applications haven't moved as rapidly to adopting those technologies, sorry for my screen, as we thought. So there's a lot of applications that aren't ready, they're not cloud ready. And so we have to be able to accommodate those.

Starting point is 00:15:04 As a matter of fact, it turns out to be more often than not. Sorry about that, folks. And this will only slow us down further. Resume slides from the webinar. Where do you see? Oh, there it is. Thank you.

Starting point is 00:15:31 If I had a touch screen, I'd go right to it. Come on. I think this connector sometimes gets a little loose and that's what was screwing up earlier. All right, I'm taking my glasses off. It's serious work time. I'm very sorry. I'll make up the time. It's switching, okay. I could just go into non-sledging mode. Only I can see my mouse

Starting point is 00:16:18 okay got us to the right slide there yay okay sorry okay so I think we're right around here. Okay, so anyway, a lot of applications have not really gotten to the cloud readiness stage as we might have thought three years ago. And so NAS connectivity is kind of an interesting thing for a lot of people, or some convenient access. You know, that's not the world of everything, but you guys know what those things need to be. In terms of storage technologies, there's probably some cool stuff being talked about around here. But for the most part, and please speak up if you know differently.

Starting point is 00:17:01 I'd love to be educated. But, I mean, there's tape technology and there's disk technology Oh the flash. Sorry. I have flash up here, but in terms of the low cost capacity There's a lot of things being done with disk. You know there's certainly capacity disk There's object storage, which is you know kind of a huge thing that's going on that really helps the economics if you look at how they do things So so I think object storage is a big piece of the disk equation in terms of how that's getting set up to accommodate the high-capacity stuff.

Starting point is 00:17:32 And then tape technology continues to be there. So I don't know where all your guys' heads are at on tape. A lot of customers actually love it. They love the economics of it. Yes, it's absolutely been displaced in a lot of the traditional places it used to be in backup, as I described at the opening of this discussion. But tape is finding some pretty strong footholding in the big data world. And when I say big data, I mean the kind of large file sets, data sets that we were talking about, unstructured stuff.

Starting point is 00:18:02 And then, of course, you do need the high-performance stuff in the active workflow piece. And I put these things here which is really interesting and I think one of the key messages I probably would like you to get out of today is you can't really set up a single type of technology and have it serve all your needs. So there's got to be something in the environment that facilitates a tiering between storage that addresses one of these key aspects of it. So tiering is really important. I talk about acceleration also because there's deduplication, there's compression, there's WAN acceleration.

Starting point is 00:18:45 Sometimes people are looking at data that needs to move from A to B. There's different ways of doing that as well. And then there's gateways. So there's a lot more gateway solutions that we're seeing. I actually think that between gateways and tiering, I think these are kind of the game changers of today in terms of helping manage data that has to be held on for a long time. So those are the solutions that I think are going to make a big difference

Starting point is 00:19:13 in a lot of the environments in the next few years anyway. Yeah? Two points. I was going to say optical is reemerging. Oh, fair point. Yeah, thank you. And I work for an optical. Okay, thank you very much. That's a great point.

Starting point is 00:19:26 I was also going to add that what I sense is that customers are stuck in the process tied to the backup to archive approach and you can't break it. People do try to do restful API based

Starting point is 00:19:41 archive or longer term long term data but you never get away from net backup and control. Right. Yeah, I totally agree. And you're absolutely right. And that's the inertia that the existing ways have.

Starting point is 00:19:56 And so I think the challenge and the opportunity for all of us in this room is how do you... One of the reasons is it's just hard. And as a matter of fact, one of the things I've said in an article is that's why I'm calling these things game changers. If you can make it easy, you know, as easy as a backup app is. People are so familiar and comfortable with backup apps. It's like that's the easy way to do it.

Starting point is 00:20:18 The objective for us is to bring something to them that feels as comfortable and as easy for them to get their data to the new place, if it can go to a new place. Now, sometimes you've got some of the backup apps and things like that. They've integrated an archive piece, so that helps a little bit, and that's the right answer sometimes. They've tried with that, and I think they've even struggled a little bit to get those things to be adopted, you know, very broadly.

Starting point is 00:20:48 But thank you very much. Optical is absolutely another one of these capacity technologies that can be integrated, you know, seamlessly as the others can. So we'll talk a bit about that integration. So in terms of some of the common attributes of some of these archive storage, so tape, and I apologize, I did not mean to ignore optical. So I would say that optical and tape kind of have a battle. They can compete. Optical has some good capabilities that we'll talk about. When we talk about compliance, you've got some of the worm capabilities.

Starting point is 00:21:27 Tape can offer that too. Sometimes optical is seen as even stronger in that environment. But low cost is really key. I did some math, and I didn't publish it in this presentation, but if you do a three-year analysis, then I like to say do a five-year analysis also, and I don't know if you would know where the optical comes in, for tape and it depends on your capacity, but I mean

Starting point is 00:21:52 tape can come in at the $40 to $90 per terabyte level. It's all in. That includes gateway to get there, it includes a library, it includes media, it includes a library, it includes media, it includes drives and stuff like that.

Starting point is 00:22:08 If you look at some of the cloud solutions, the public cloud solutions that are out there, they'll be around $250 per terabyte. If you add up your monthly expense and you multiply it by three years, so do a monthly rate times 36,

Starting point is 00:22:23 I think you'll end up at about $250 per terabyte for the cheapest stuff. That's like Amazon Glacier. And if you look at object storage, it'll be probably not terribly far from that. So one of the things that we want to make sure people understand is if you're looking at public cloud, and I know I'm diverging quite a bit, but if you look at public cloud, you can look at the options for on-premise as well,

Starting point is 00:22:45 and you might find that to be more cost-effective, whereas a lot of people think cloud's cheaper. It is from an entry perspective, and I think I talk about that. Yeah, it's really the lowest. You can get into a public cloud for really, really cheap. I have one terabyte. Let me store it there, but if I have to grow it a hundred or petabyte scale, then all of a sudden it starts to add up. And then if you look at the investment over time, if you're continuing to pay a monthly fee for what you're storing in the cloud and you go on to the five-year horizon, well, you get to take your five-year investment maybe and look at a capital amortization over five years.

Starting point is 00:23:18 And all of a sudden, the cloud goes from like 240 to I think about over 400, $430 per terabyte, something like that. And so then you can compare that to other technologies that maybe you could have invested in your own data center. Granted, you're all looking at solutions. But anyway, so that's some of the math that we can look at, too. So object storage, a lot of times they have a multi-site. They talk about durability a lot of times. Sometimes it's replication. Sometimes it's erasure code

Starting point is 00:23:45 and by spreading the data intelligently over different locations you can include you're including protection, you're including disaster recovery copies and so you get to analyze the solution as if that's all the cost of those things against an object storage solution.

Starting point is 00:24:10 So whereas over here today I'm buying primary, I'm buying backup storage, I'm buying DR storage, I'm buying software to do all that stuff, my option might be to move it over to object storage that has all that durability already baked into it and I don't have to worry about the software and all that kind of stuff. I just needed to figure out, we'll talk about this, how did I move it over there? That's part of the challenge.

Starting point is 00:24:34 But anyway, so there's a lot of economics that you can talk about. And then we already talked a lot about the gateways, which I think is a really important part of today's environment. And that's probably why we haven't seen adoption as quickly for cloud and stuff like that, because I don't think

Starting point is 00:24:47 the gateways have been developed quite as much yet. Okay, so we already talked about this. Data moves back and forth. So let's look at some examples, and then we have some considerations afterwards. I talked about, I actually mentioned state infrastructures, customers I'm familiar with.

Starting point is 00:25:06 What these people do is they have trucks that drive around the highway infrastructure with cameras on them, so the cameras are gathering and storing on the truck in some storage a bunch of this video and photographic and imaging content. And they're expecting those cameras to increase in resolution, things like that,

Starting point is 00:25:23 so that data is just going to continue to get richer. And so they get back to the shop, and they have to ingest all that data. So now they need a high-speed way of getting data into their workflow because what they want to do is they want to analyze all this data, plan out the infrastructure of the future. So they've got these huge packets of data that they're bringing in. How do I ingest it quickly? I know similar workflows.

Starting point is 00:25:48 I think I'll talk about one of these. I might later on. But sometimes if I can connect my storage system that's here over NAS or over some kind of maybe it's wireless or maybe it's something. I don't know what it is. But they're always looking to what's the best way to do this. I know a lot of them are still using portable hard drives, plugging in a USB port in a workstation and bringing it in that way.

Starting point is 00:26:09 And they've got shelves and shelves of stuff that they haven't ingested yet. So the ingestion piece is becoming a really important part of the process as well. Anyway, that comes in, like I mentioned before, they're moving it over, the customer I'm thinking about here moves it over to an object storage. They opted for that as their long-term high-capacity storage environment. And then they've got all their workflow and process and everything else over there. And so when the data comes in here, and so this is a gateway. I think you'll see a theme. I've got gateways in all of these.

Starting point is 00:26:42 I happen to know from personal experience because I'm not plugging my company Quantum, but we have we don't sell optical today, but we have a lot of the other things that we've talked about. We have flash, we have disc, we have tape and we also have gateways. We're not pushing gateways. We have people that

Starting point is 00:26:59 direct connect to everything, but it just turns out that when you have the conversation and you're looking to solve a problem, that becomes an important piece of making it easy for people. So anyway, so when it comes in it goes over here and then this gateway still has the ability to provide active access for all the people that need to work on the data, and so that's one example. They've got multiple sites, you know, so their object storage is able to leverage the durability the way object storage was designed

Starting point is 00:27:28 to deliver the durability. So actually, it's structured that I have this redundancy. So I think we talked about all this stuff. The tiering really happens when a customer wants to retrieve the data and work on it, then the gateway is able to migrate the data from object storage to some other

Starting point is 00:27:52 tier. Sometimes it's just as simple as a NAS share drag and drop. Sometimes there's actually an active high performance disk stage that it moves to within the archive infrastructure. I'll talk about that in a minute as well. Is the object storage that you have in that slide remote as far as a provider, or is it a private?

Starting point is 00:28:12 They have data centers in multiple locations. So it's, but they own? They own those locations. Core walls of their network? Correct, yeah, yeah. That slide does show S3, so it's Amazon, right? No, that's actually the front end of that object storage. Okay.

Starting point is 00:28:26 Yeah. And it turned out that they, and I'll talk about another one, they might have, I mean, they'd be open to cloud, but it turned out they're using a processing software that, it's funny, it's actually a web-based software system, but the only way it knows how to connect to storage is over NAS. And so they needed this NAS connectivity over here, which I found kind of interesting.

Starting point is 00:28:57 Ultimately, I would expect that software will support RESTful interface. And if that happens, that's fine. Go ahead and connect directly to the object storage. But then you still need a way to ingest the data and stuff like that. This is financial transactions, security trading. They have a lot of compliance requirements. They could have put optical here. I know they chose tape in this case. They have high performance ingest as well.

Starting point is 00:29:24 They've got a gateway over here. Oh, I thought I took this out. So they happen to be using a standard tool. I'll talk a little bit about tools. When I need to get data back and forth from my active work environment over to my archive infrastructure, how do I do that? There's a lot of software packages that are out there. There's a lot of tools.

Starting point is 00:29:42 R-Sync is part of Linux Toolkit, and they're just using that. They scripted that up to take care of data movement. Data's coming in from the left into the archive environment, so that's why they're able to use rsync. They gather up a bunch of data on a daily basis over here. They sync it over to here. They can access it and do analysis

Starting point is 00:29:59 if they want, so that was part of the requirement also. They want high-performance retrieval, so if they need to look at something or access it, they don't like the 30 seconds latency that a tape library might have associated with it. They really like object storage, its performance profile, but they absolutely

Starting point is 00:30:15 needed to have an offline copy of the data. I'll mention offline for a second. I don't have this in here either, but ransomware, I'm amazed at what I mean. I don't have still a lot of, but ransomware, I'm amazed at what I mean. And I don't have still a lot of first-hand experience on this, but I've had people tell me recently that there's a lot of people paying ransomware.

Starting point is 00:30:33 And just so you guys know what that is, maybe all you do, but villains are coming in and they're encrypting your data. I'm going to announce to you they can hack your network, encrypt your data, and say, if you want your data back, pay us some money, because they have the key, you don't.

Starting point is 00:30:48 And one good way to protect against that is to have some data sitting offline. It's like, okay, enjoy your key. I have my data. I'll reinstall it over here. So anyway, whereas we used to talk about DR in an off-site copy, this whole deal about an offline copy is also becoming really important to have. And so that's a piece of one of the considerations. Actually, I do mention that later on, so I won't have to talk about it again.

Starting point is 00:31:15 So I think you get the gist of that one. I don't think I have to go through this too much. I don't think there was anything big, new news here. Here's a major university in the United States. They were absolutely going out. I'll say it. They were going to go sign up with Amazon. Anybody here from Amazon work with Amazon?

Starting point is 00:31:37 So they were already moving in that direction, and they were having troubles getting all of their needs met properly. It's not like they were done with the analysis yet, but somehow I know that a storage vendor got called in and was able to have a broader discussion about what they could do. So what they ended up doing is implementing their own campus wide archive on their own premise instead of going to the public cloud, which is where they were really believing that they were going as a starting point. And so what they did here, they used the magical gateway.

Starting point is 00:32:10 The IT department tells all the departments on the campus, when you guys are done with your work, doing whatever it is you need to do, and you want to save it cost-effectively, just drag and drop it over here. So they gave each department their own archive share. And so they're just dragging it over here. And then they're leveraging actually tape libraries on the back end

Starting point is 00:32:33 because cost effectiveness is really important. Now, these guys care a lot about their data. Everybody we talk about does, but the data is precious here. So they've got probably at least three copies of data. These are just replica instances of each other. When they send the data out to tape, they send it to two different tape libraries in two different

Starting point is 00:32:54 locations. Then they've got an offline copy as well that they can send off somewhere else. They cannot lose their data. It's very valuable to them. So anyway, that's a simple multi-department drag and drop leveraging tape, leveraging a gateway. Really important to them.

Starting point is 00:33:12 So I talk about tiering and actually just to emphasize what that's all about. Here's one of the things about making it easy. So we talked about when we tried to make tape easy it's got to be close to five years ago four or five years ago we brought out LTFS which provided kind of a NAS front-end enabled a NAS front-end to tape and that was pretty good but you still are you know a lot of the LTFS solutions linear

Starting point is 00:33:38 tape file system if you guys are familiar with that but But it's a file. Each tape cartridge is a self-describing file system. And so that kind of changed the world of tape. And I loved, you know, people said, you mean I can actually get to tape without a backup app? That's what it does. You just drag it over, either attached to your workstation or attached to your network. Problem is, if it didn't have a cache of some sort,

Starting point is 00:34:07 and some people implemented that, then you still had to deal with the latencies of tape and things like that. So a tiered solution might bring something off of tape if it gets active and bring it over here onto some kind of a higher performance tier that can eliminate the latencies for ongoing work with that data.

Starting point is 00:34:27 So one strategy is to keep all your fresh stuff on disk or spinning disk storage over here for a while and then automatically migrate it over so that if somebody wants to get it within six months or three months or whatever it is, they have the experience of getting it right back as they expect to. And then later on, if tiering is done properly, they can still get it back. They still go to the same place, you know, in a lot of these tiering solutions. So they still go ask for their data the same place, but if it's more than six months or more than a year, it's like, oh I have to wait, you know, 30, 60 seconds to get my data. That's not a big deal for a lot of people. But that's a tiering solution implemented

Starting point is 00:35:06 in a way that can deliver that kind of access conveniently and forget about what's behind the scenes. That's what cloud is doing today. You go to Amazon Glacier, you want a file back, you have to wait four hours if it's on Glacier. So people are obviously willing to accept some of the tradeoffs of the lower cost of storage. So in an application like that, do they consider deduplication the effects of the overall storage? Well, it's a great, deduplication is a great question. And so here's how I think about it, and there may be a lot of good debate over a drink on this stuff deduplication

Starting point is 00:35:47 is at it's maximum benefit in a backup environment and the reason is because I'm backing up this data today, a little bit of it changed, I'm going to back up the whole thing again and so if I can eliminate duplicating the work of

Starting point is 00:36:04 everything I did the day before, why am I doing that? So deduplication, you can see huge payoff when you're doing repetitive work like that. When you talk about archive, it's a little bit different. Now every file is a little bit different and stuff like that. So deduplication might be able to find some... It's more like a compression algorithm at that point in time. So you might see some benefit from the

Starting point is 00:36:27 compression aspect of it, but not from the repetitive file aspect of it, in my opinion. In this picture, is the gateway the glue between the long tail archive and the near line or is it still backup software?

Starting point is 00:36:46 Oh, this is not backup software. So, as I said, the IT group said to the rest of the university when you're done, drag your files over here. So it's a NAS share. There's no backup software here. They may have backup software doing something in each of their departments over there,

Starting point is 00:37:01 but what they set up was the campus-wide archive. So this has nothing to do set up was the campus-wide archive. So this has nothing to do with the day-to-day backup. This says when you're done with your project and you want to store cost effectively, drag your work over here. Now, when they drag their work over here, and they're comfortable because they get the guarantee from the IT guys, they say, okay, I have less that I'm storing and backing up over here. So they've improved their environment,

Starting point is 00:37:24 and they're paying for less primary storage here because they were able to offload it over here. So that's kind of what we talked about. When you're done with stuff and you're ready to move it to an archive phase, figure out how you want to do that. Yes? You haven't talked about any security

Starting point is 00:37:40 or privacy concerns. I will. Any of this. Yeah. Well, the deduplication thing that was just brought up is an area where if you've encrypted it or you have data privacy concerns, which you would have

Starting point is 00:37:54 in a university environment, one of those guys stacks over there is the financial aid area. The business guys at the university have huge security concerns. Something that works for maybe sociology

Starting point is 00:38:09 although some of the studies are saying things that you don't want public etc it kind of complicates the situation with at least two more dimensions it absolutely can and it depends on how robust this solution is.

Starting point is 00:38:26 You're absolutely right. So does this solution, and I don't know if I'm hitting the point, but it's a point well stated in very much real life. If this can handle encryption that those guys might be able to control, so can they move their stuff over here and they keep the key? Is there some key management process involved that allows them to be comfortable? However they're solving that problem today,

Starting point is 00:38:53 if they can solve it a different way or the same way over here, I think that can address that piece. They want to make sure that nobody has access to those records. There's a server somewhere that they got comfortable with. And that's been an issue,

Starting point is 00:39:07 and I don't know how well the public cloud infrastructures are getting around that. I mean, that's been one of the first objections people have to public cloud. It's like, my data's way too secure. I'm not sure I'm ready to go there yet. There may be good answers to that up there, but not everybody's on board yet.

Starting point is 00:39:23 So security is absolutely an important piece. Thanks for bringing that up. Media production distribution. So here's the media entertainment world. It's kind of getting repetitive at this point in time. Single company has multiple data centers

Starting point is 00:39:43 throughout the U.S. These are mostly U.S., I'm sorry. I grabbed some convenient ones. New York, Denver, and Los Angeles. Today, I'll be honest, today what they're doing is they're using these locations to spread their object storage so that they get the site durability. So if any of these individual locations goes down, their data still stands. So I wasn't going to drill down into object storage. I can answer questions on that later. But there's ways that you can spread your data. I mean, the simplest is just to think three copies.

Starting point is 00:40:14 But that didn't save you a whole lot of money, right? Why not just replicate your primary storage at that point in time? But there's things like erasure code and stuff like that where you can actually spread data across multiple sites so that if one goes down, you still have your data, and it's not the same as three full copies of your data. It kind of hits your point of deduplication. It's a different way of making data storage more efficient.

Starting point is 00:40:36 So they're using it for the efficiency of object storage and the durability of their content and their data. They're still really just distributing from one location, but their plan as they've got this thing scaled out will be now I can take my data and access it over here, I can get my data and access it over here, and I can distribute from these different locations, and maybe they'll even spread their production resources around.

Starting point is 00:41:04 Let's see, are there any key points on there? No, I talked about that. resources around. Let's see, are there any key points on there? No, I talked about it. Okay, so we talked about it, and actually, to be honest, I should have hit harder on the security. I like that. I called it compliance and things like that, but that's a really important point that maybe I don't talk a lot about.

Starting point is 00:41:20 So just, you know, thinking about some of the other considerations, for cloud, it's just another data center. And so as I talk about these technologies, what users don't always understand, well, isn't cloud a different technology? Well, no, it's everything we just talked about just somewhere else. And it's like I finally got comfortable with that.

Starting point is 00:41:39 That's really what it is. And so whatever front end they put on, a lot of times it's a RESTful interface for the network out there. And so we're seeing cloud public services, the ones you know, implementing all of these technologies, including tape, including object, or disk, I mean optical disk. That's all happening behind the scenes, and users aren't saying, well, what kind of technology

Starting point is 00:42:10 are you storing this stuff on? Don't worry about it. Ask for your data. You get it back in four hours. It doesn't matter to you what I'm storing it on. I give you these guarantees. I give you this SLA guarantee. All you care about is your data, when you get it, how you can get it.

Starting point is 00:42:24 Forget about what's behind the scenes. So that's all I really need to say about cloud. It's just another implementation. Ten minutes to go. So data movement. We talked a lot about that, or we mentioned it from time to time. This is really key. I've been in environments where IT, where the users absolutely will not tolerate having to move their data to a different place.

Starting point is 00:42:49 And there's a bunch of solutions that have been out on the market for a long time, where what they'll do is they'll crawl the user data, and they'll look, you know, and you get to set whatever policies, you know, has it been touched in a certain amount of time? Is it a certain type of data? Is it in a certain location, whatever it is. You can set up the policies, the software will go out, it'll find the data that fits your profile and say, okay, you told me anything that looks like that can be moved over here, and maybe it'll, and there's different technologies, it'll stub it somehow. So that way the users, they keep going to where they're used to going to get their data but behind the scenes this magical data

Starting point is 00:43:27 mover has moved it over to the archive this thing and that works great for a lot of people there's a the the science of stubbing data I'm learning is is got some depth to it yeah agreed agreed and I use the word stub to represent what but... That's deep water. Yeah. It's really deep water. Yeah. Most IT people don't want stubs. Agreed, agreed. And I use the word stub to represent what we're talking about. But you're right. You're absolutely right.

Starting point is 00:43:53 You can think of it as a link or something. Yeah. Because if you move the location over here and the link stops being aware of where it happened, all of a sudden you don't have access to your data. Sometimes it'll just move the data into a predefined location and the users can go over there. So that's part of what holds back a lot of the

Starting point is 00:44:11 migration to archive is how do I get my data there in a way that the users will tolerate. I'm seeing a lot of adoption of some stubbing technologies. I see exactly the concerns and the issues that you brought up. So anyway, that's an opportunity to figure out how to do that right.

Starting point is 00:44:36 I think we talked a lot about that already. Oh, and I guess the other thing that I'd say is one of the things that people need to be aware of is if you remember some of the diagrams that we talked about. So I have my applications and my primary storage and all that stuff on the left. And one decision I want to make is I need to get my data over to this archive infrastructure. So that's one movement, and we just talked about that on the previous page. How do you do that.

Starting point is 00:45:06 But then if this is a tiered storage environment where I've got some fast disk or I've got some flash and I've got some tape or optical on the back end, I've got another data movement thing going on here. So in order to implement a real retention strategy,

Starting point is 00:45:22 I have to think about how does my data migrate across the whole spectrum of stuff. Sometimes most of the time from what I see, although I do see examples, but most of the time it's easier to figure out how to do it with two parts of the solution. The way you do it the other way is you've got a single vendor or some kind of a homogenous environment where that vendor can manage all of that up and down. And there's some vendors out there that do that,

Starting point is 00:45:51 but users don't always want to be restricted to that. And I talked about open systems, and so one of the magic things to look for is how can I take an open environment where I've got a bunch of different vendors with different pieces throughout my environment, and my users are happy because they always know where to find their data, and they're also happy because they're paying the least amount they can to retain it over a long period of time.

Starting point is 00:46:16 And they're also happy because it comes back to them fairly quickly. So ultimately, it's not as simple. An archive strategy isn't as simple as we thought, and so the opportunity for all of us in the room is to design those solutions that makes it simple. And there's great ways that have been brought to market so far and that can continue to get improved. So compliance integrity, this is where I would put the security thing. Oops. I don't know what's causing that. Other speakers have had similar issues.

Starting point is 00:46:53 Really? Yeah, it's interesting. Okay. Boy, I hope I don't have to take all the rest of the time just to figure this out. So in terms of compliance and security and data integrity, so let me just...

Starting point is 00:47:06 Here we go. And I'll do this. I'm not going to mess with it. So it's not always about the storage target. Sometimes people say I want optical because it's got the worm piece of it. You know, worm comes up quite a bit, but there's other things. And there's also conditional access and just access at all to your data.

Starting point is 00:47:29 And so all I want to do is I want to name some of the other things that people have to think about when they implement a data retention strategy. So the same things that you had to think about, you know, as part of your primary storage implementation, those same questions and issues, you know, might very well come along with your data as it moves. Data integrity is another piece. There's a lot of great solutions out there. If you're going to store data, much more common now,

Starting point is 00:47:57 five years, ten years, twenty years, there's a law firm that we do business with. They have to store their data. They actually store data for NFL injury cases. that we do business with, they have to store their data. They actually store data for NFL injury cases. And they have to store their data for, I think, 68 years, I think is what they said. So 60-something.

Starting point is 00:48:19 And it's like our theory, I didn't get the real reason on why that is, but maybe that's how long they expect the life of the patient to live type of thing. But anyway, multiple decades is how long they need to store stuff. That will bring me to another point as well. But you want to make sure, however much that requirement is, that that data doesn't, you know, you've heard terms like bit rot and stuff like that. Is my data being checked on some kind of an ongoing regular basis so that I know it's

Starting point is 00:48:45 not dying, whatever media it's stored on. And there's some great solutions out there to do that. Sorry. So scale, you know, can I continue to scale? With a lot of primary solutions, I know when you scale them up, as we talked about what people do today,

Starting point is 00:49:05 one of the things they don't like is, in many cases, depending on the architecture, as you scale up, your performance goes to hell. And is there a way? So make sure that you understand how will the storage capacity of this thing continue to scale in a way that stops delivering on one of the promises that you put together in the first place, whether it be access, capability,

Starting point is 00:49:27 or ease of access, I should say. Or do I have to move to a secondary location? So that's the other thing that the whole archive world is faced with, is how do I have this infinite scale? And there's solutions out there that do. I mean, public cloud, they're not talking about any options. Object storage can scale huge.

Starting point is 00:49:43 Tape libraries can get really big. So there's a lot of great scalability options out there. Just make sure that that's part of your solution. And then if you have a gateway or something like that, is there a file count limitation? Sometimes data under management has issues with how much metadata can I handle, and you might have to store that. Sorry, I'm using the wrong button.

Starting point is 00:50:09 Reporting, so like one key thing, that university example, one of the key criteria those guys needed that had to get delivered is, every department needs to pay their share of how they're using the archive. And so IT needs to have those tools to report and bill properly. so chargeback type of models.

Starting point is 00:50:28 So as IT looks to set up a service, or as a cloud provider looks to set up a service, they need to make sure they've got ways of doing the billback and accounting and things like that. Format migration is a really important one. This, I think, I'm seeing more work being done here, or at least more of it's come to my attention recently. And so what that means is, you know, so we've always heard,

Starting point is 00:50:52 you know, I'll use LTO as an example, LTO tape format. You know, they're out there. LTO 4, then to 5, you know, it started with LTO 1, and how do customers migrate from one to the next? And that's only part of the equation. So you can do that from a storage technology perspective, but how do I do that from a data format perspective or anything else I need to worry about? And so there's actually companies out there that are setting up solutions and or services

Starting point is 00:51:19 where they spend all the time, you know, understanding this data format and when the new one comes out, it keeps coming out, so they're creating migration tools so that behind the scenes that data can be managed and handled for the long run. You know, not a huge deal if you're talking about five, seven, even ten years, that's pretty easy to manage, I think, from my perspective. But as you get into 10, 20, 30, 40 years, this whole notion of digital preservation, I'm seeing more of that. So this may be more niche-y from a market needs perspective, maybe not, but I'm seeing businesses getting set up around that stuff. Okay, I think pretty much a wrap.

Starting point is 00:52:04 So active archive, it's a common requirement everywhere in the world that I see. It's all about long-term retention. It's in all industries. It can offer substantial benefit. I didn't really quantify it that much, but there's huge benefits. I would say that people can spend half as much if they address an archive infrastructure instead of the status quo.

Starting point is 00:52:32 You've got to look for the right balance, cost, performance. A lot of people, the first thing that's easy to talk about, cost, access, and performance, but then there's all those other considerations we just talked about that you can't forget as well. I told you who the Active Archive Alliance is, and if you want, there's a report that you can get.

Starting point is 00:52:50 You can either go to the website and leave your name, or if you want to make me look really good, you can give me your card and we'll send you something, however you want to do it. It doesn't matter to me. And that's our chat. So I appreciate the interaction. Thank you all very much.

Starting point is 00:53:02 I think I've just been told that we're at the end of the time that we're supposed to be speaking. So maybe it's my time to migrate yourselves somewhere else. Any other questions you guys have? Thank you very much. Appreciate it. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #42: The Role of Active Archive in Long-Term Data Preservation

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.