Grey Beards on Systems - 122: GreyBeards talk big data archive with Floyd Christofferson, CEO StrongBox Data Solutions

Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Keith Townsend. Welcome to another sponsored episode of the Greybeards on Storage podcast, a show where we get Greybeards bloggers together with storage assistant vendors to discuss upcoming products, technologies, CEO of Strongbox Data Solutions. So Floyd, why don't you tell us a little bit about yourself and what Strongbox Data Solutions is all about? Sure. Thanks, Ray. And it's a pleasure to be here today. Strongbox Data Solutions is a company that was built on the premise that there is no one-size-fits-all storage. So it's a software solution that addresses the problem of heterogeneous

Starting point is 00:01:02 storage environments. In other words, when data goes beyond what can fit on a single storage platform. So my background is in the media and entertainment space and in the HPC space, where I worked at SGI for 10 years in data storage and mainly in the data software solutions. And what we realized in those very large environments that is typical in the HPC space is that the data is going to outlast the storage that it's placed on, inevitably. And that involves moving data either between different storage types of the same vendor or or more typically, across different vendor storage, different file systems. I think I got, I think I have data that's like 20, maybe, no, 30 plus years old, and I'm still carrying around. So yeah, it's outlasted any

Starting point is 00:01:58 storage that it ever existed on. No, exactly right. And how many of us have those old hard drives with your photo library and how many copies of those photos? And just knowing what you've got and where it is. I'm just a single person organization. So I imagine enterprises, it's like goes back centuries. Oh, exactly. Exactly. I mean, it's a huge problem. And the issue is, is that most people are thinking of it, you know, kind of logically. And while my storage is filling up, I've got a storage problem. And so who do I call? I call a storage vendor. Well, the issue is, is, again, there is no one size fits all storage. So where do you move it? And then how do you avoid creating silos? So moving the data from one place to another, but it disconnects your users from that.

Starting point is 00:02:49 And so we formed Strongbox Data Solutions to create a product called Stronglink to address this issue. And that is to create a vendor-neutral software platform that can help customers manage large volumes of data, virtually any volume of data, across any storage type. So that from a data-centric as opposed to a storage-centric point of view, they're able to manage where that data is placed, how many copies there are, who can see it, when it should move, but without the typical interruption to user access or the proprietary hooks, vendor specific hooks that are often the case. And it works in primary NAS solutions and files data and that sort of stuff? Or is it certainly not block storage kinds of things, right? We can definitely talk to block storage.

Starting point is 00:03:51 But you're right, it is mainly for unstructured data, files that close files or objects. And it really can leverage anything. I mean, we've got customers, for example, in dozens or in some cases, hundreds of petabytes, where they've got pockets of NAS, they've got luster file systems, they may have tape, they have a desire to move to cloud, but they can't move at all to cloud. And all of this orchestration of data placement

Starting point is 00:04:22 has to happen in a way that doesn't interrupt users and gives administrators complete knowledge and transparency of what they have and where it is. Hundreds of petabytes? Kinds of stuff? I mean. Oh, yes. We have a customer in the manufacturing or in the automobile. And basically, it's in autonomous vehicle research, that are generating about two petabytes a day of new data. Every time of new data,

Starting point is 00:04:52 two petabytes a day. And this is all from telemetry from self-driving vehicles. So all the sensors, as they get higher resolution, more cameras on test cars, these generate a lot of data. And these data need to be kept forever for liability and accountability purposes. And as you might imagine, that's a huge expense. So being able to rapidly move this data off of expensive storage where it was born and into a place that it can be preserved, protected, copy managed, everything. That's very key. Two petabytes coming in per day, that would be two petabytes going out per day in a normal business environment, right? That's exactly right. That's exactly right. And that's

Starting point is 00:05:38 how, you know, that's one of the aspects of StrongLink is it really is no limitation to the way it can scale. We've got another project at DKRZ, which is the German Climate Science Research Computing Center in Hamburg, and they've got 150 petabyte environment. They're just adding a new supercomputer. The project that we're working with them on, they call it an exabyte data archive, because the amount of climate science simulation data that they create is a hockey stick in how it's growing. And they cannot interrupt to migrate from HPSS, for example, to an open standard format or to add more storage. They do not have the time to take the whole system down and then start with something new. They have to make it a smooth transition.

Starting point is 00:06:31 So Floyd, I'm a little bit unclear on what we're talking about as far as a solution. Are we talking about a solution that moves data off of a system? Are we talking about a solution that is a system that you move data on to? Are we talking about a system that keeps data in place and then just provides the metadata? What exactly is the solution? Yeah, that's a good question. So Stronglink is software that is designed and it leverages intelligence about,

Starting point is 00:07:02 so metadata about the storage, about the storage and the files across any storage. So we will roll into an environment and install on customer hardware or on their VMs. And first of all, do an analysis of harvesting all of the available metadata from the customer data. So they know what they've got. They know where it is. They know how often it's used. And if there's rich metadata in those files, we can extract that. But then, so from that, the customer gets a picture of where their data is, and then we enable them to create policies, you know, move that data off of expensive storage to cloud or to tape or to an object store, but without interrupting user access.

Starting point is 00:07:49 So it's a virtualization to create a global namespace across otherwise incompatible storage. And this is a software solution that enables our customers to manage that data across any vendor storage. So you go in and you ingest metadata from the environment. enables our customers to manage that data across any vendor storage. So you go in and you ingest metadata from the environment. Is it like a cutover kind of scenario, or are you just kind of, I'll say, take over control over a file as you ingest its metadata? No, it really depends on the customer's use case. So we can sit in the data path for some use cases where the customer wants to control access, virtualize multiple tiers, for example, into a single tier.

Starting point is 00:08:46 But for high performance workflows, you don't want anybody sitting of the data path harvest the metadata and then enable administrators to take policy-based actions on when that data should move from one storage to another or how to manage it across different copies for protection or even across multiple sites so in summary it's a software solution designed to use data intelligence across any storage platform to intelligently create policy-based data placement policies, data copy policies, so that the users know what they have and can ensure that they can overcome the silos of data that are typically a problem in most large environments. So do you transform files into objects or does a file have to retain its file nature throughout wherever you put it? No, this is a very important question because we do not transform any file. We're not adding stubs. We're not adding any kind of proprietary sim links. Your file is your file. We're not putting agents on

Starting point is 00:09:46 the storage. What we're doing is providing a management layer that can interact with the files in a non-proprietary way as they are. So if you're on an object store, we can present that out through a file-based interface, SMB or NFS. If it's on NFS, we can present it out through S3. So we're creating a global namespace that bridges these incompatibilities so that users and their applications are mapped to the data and not to the device. So one of the challenges I've had in the enterprise with implementing global namespaces in general is user workflow and HPC is a great example of this. So let's say that, you know,

Starting point is 00:10:32 we have three different filers, NFS filers with, as we've expanded across and challenged with, you know, the file systems turning over. Workflow is very difficult to get integrated in from an infrastructure perspective to a consumer. Scientists just won't change links to data sets, but yet I have this challenge of managing data. How do you guys help with what doesn't seem like a technical problem is much more of a political, cultural problem in training and getting users to leverage a global file system? Sure.

Starting point is 00:11:17 And you highlighted something that is the biggest problem, and that is the cultural change of moving users access from one point to another and that's the problem that a lot of companies face for example maybe they've got an Isilon that is getting full well they they don't want to necessarily move the data off of the Isilon because now users would have to end applications would have to be repointed to a different mount point. So our approach is try to maintain the way that users see their data as much as it was as it was before. So let me explain what that means. So when we create a global namespace, we're creating a multi-protocol access of mapping them to the

Starting point is 00:12:06 data. So if we create a mount point, for example, that mount point goes to a file system that is a virtual file system. It looks like a normal NAS, or if I'm coming in an S3 and expect to see an S3 object store, it looks like an S3 object store. But the reality is that that data might be sitting on an Isilon, it might be sitting on any storage in the background. So we're virtualizing that so that when data needs to move in a migration event from one storage to another, the mount point doesn't change. The data moves from store A to store B, but the mount point doesn't change. The data moves from store A to store B, but the mount point doesn't change. Does that make sense? Yeah.

Starting point is 00:12:51 I think for the most part, it makes sense. I'm just thinking about like practical, some challenges that happen. So now I have the metadata from the global file system. Then I have the metadata from the actual storage array. Keeping those two in sync is a challenge, maybe not the challenge, but is a challenge. And another part of that challenge is administrative, because now it seems like I can assign rights at both layers. So access control, how is access control handled when we're abstracting the mount points? Sure. I mean, as with everything, our mantra is that we're not adding proprietary

Starting point is 00:13:34 hooks. We're not adding other layers. We're extending what they've got. So when a user comes into Stronglink, which is the platform, or they're mounting, they may just see a file system mount. If whatever their user authentication is, their Active Directory group ID and user ID, and the permissions on those files, that translates straight through to Stronglink, so that we're not adding anything. Users are not having to create a separate authentication. Whatever permissions they already have are what they have. What they can see, the data they can access, and the permissions they have for those, they still have. And it doesn't matter if it's on an S3 store, an NFS store, an SMB store, anything.

Starting point is 00:14:18 You mentioned Lustre. I mean, so these guys are using POSIX kinds of things. They're not even using NFS or SMB, right? Yeah, well, NFS is POSIX file system, but you're right. We have to be able to bridge the semantics of multiple different types. And that's really the virtualization that StrongLink does at the top layer. So, and something else you mentioned before, there are multiple types of metadata. So what StrongLink is doing is aggregating all of these.

Starting point is 00:14:47 So straight file system metadata, simple stuff like how often the file is accessed and what those permissions are, but also, you know, rich metadata that might be in the header of those files. Even something as simple as a Microsoft Office document that may, or.com, or often from electron microscopes, lens settings, things like that. But then there's custom business-related metadata. So what project does this belong to? Which department? Or how many departments? What's the retention policy? All of these metadata combined create a very rich picture that enables users to track what the cost of the organization for a project is the data sitting on the correct storage. And when you've taken away that limitation that, oh, I can't move the data off of expensive storage because either I'm going to have to use a stub or a symlink, which is proprietary, or I'm going to disconnect my users. But if you've got this combination of the richness of the metadata, so you know what you got, you know where it should be, you know

Starting point is 00:15:53 when it should move, but you can do that movement without interrupting user access, now you've got complete control of your data and your environment. And so what sort of policies can you create here? I mean, so data hasn't been accessed in a year, move it off to some level two hierarchy. If it hasn't been accessed in 10 years, move it off to level three hierarchy or something like that tier? Yeah, it can be that. It can also be more sophisticated than that. Maybe you want to manage a primary copy that is sitting on, let's say, an Isilon or your primary storage. Over time, maybe you want to create a secondary copy that sits off-site or sits somewhere else, or like you said, age off of the primary. But that can be done anytime. So use cases might be bulk data that comes in on expensive storage that immediately, like in the case of the autonomous vehicle use case,

Starting point is 00:16:54 where immediately it's moving off into a protected company somewhere else, but under management. Or in the case of the climate science research, that workflow is they're assembling out of 150 petabyte group of data that is scattered across four different tape libraries from different manufacturers. and then assemble it and push it into Lustre using Slurm to be able to run their simulations and then push it back. You're talking multiple tape library vendors, multiple operating environments. It's coming in. It's sitting in tape and you're pushing it into Lustre? That's right. That's right. And so from a user's perspective, they're coming into, in their case, they're coming in through the Stronglink CLI, which looks and feels just like the CLI

Starting point is 00:17:54 they used to do their HPC work in, except that with our CLI, you can also annotate with metadata, you can run queries, you can trigger it directly from Slurm so that these files for their projects, they'll just run their script in our CLI. It'll go and assemble those files out of the tapes wherever they may be across 75,000 tape slots in four libraries pull it back into a staging area push it into into their luster environment pull it back etc etc this is all through slurm scripts or whatever right things like that yeah exactly and oh by the way another little highlight here is this this whole workflow is really unchanged from what they're used to with HPSS. And in fact, HPSS is

Starting point is 00:18:51 still there. The tapes are still in HPSS format, but we virtualized that. So in the background, we've got a huge migration from HPSS to LTFS that the users are not even aware of. Oh my God. They don't even need to know. Yeah. Yeah. Yeah. So you're managing the tape, you're managing the luster and your, and your talk,

Starting point is 00:19:11 you talk NFS and SMB and all this other stuff as well. Oh, and S3. Yeah. Yeah. So the idea, the, the idea is,

Starting point is 00:19:22 is that, you know, I kind of, I'll always say things like the data will always outlast the storage it's lived on, right? And there's no one size fits all. So the challenge is that when you look at the world from a storage-centric point of view, it's like, okay, I've got a repository that fits a certain class of use case. I need to put my data there. Okay, when it starts filling up now, what do I do? But if you flip the script a bit and you look at it from a data-centric point

Starting point is 00:19:51 of view, then your storage resources, whatever they may be, are part of a fabric. Some are high-performing, some are lower-performing, lower cost, and that the data lifecycle has always a right place for it to be at this particular moment. The limitation often is that to move to a different storage, like you said, interrupts user access or creates friction. If you remove the friction so that the data can move without interrupting users, and you enable the tools so the administrator can pick whatever storage, then what you're able to do is create this data-centric fabric where the workflows are important, availability to the users important,

Starting point is 00:20:39 and then the physical storage is a function of where you are in that lifecycle on any given day. Talk to us a little bit more about scale, because right now we've talked about hundreds of petabytes. Sure. 99% of the enterprises are not hundreds of petabytes. Like, what's the market for this? Yeah. Yeah, so we usually see customers north of about 200 terabytes, but especially from 500 terabytes and above, we'll have this problem. And that problem is, is that one storage array isn't going to solve it, which means you're going to have to have some kind of a bridge.

Starting point is 00:21:19 Typically, it creates a silo. Maybe I want to explore moving things off of the most expensive primary and maybe have an object store or a mezzanine tier. Or maybe we want to start exploring moving data to cloud. How does that work? But again, if you move data from one thing to another, you're inevitably disconnecting your users, which creates friction in doing such things. If you can remove that friction so that the data movement does not impact user access or cause change to user access, now your storage choices are much broader,

Starting point is 00:21:58 and you can make them proactively instead of reactively. So, I mean, data movement is a critical aspect of any solution like this. I mean, how do you optimize? I mean, if you're moving petabytes or hundreds of terabytes a day, we're talking pretty sophisticated environment here. Yes. Yes, we are. And this is, I mean, we just released the third generation of StrongLink. And we've been in production, I guess, close to five years now.

Starting point is 00:22:26 And in those first two generations, we learned an awful lot about scale. So in what we call our workflow engine, which is the underlying parallelized data movers within the system, these are highly granular. So we go down, you know, we actually leverage the instruction set of the AVX512 instruction set on the Intel x86 processors to get really maximum parallelization. So in the two petabyte a day, we started from our basic three node HA configuration and have expanded that out so that you just have enough physical bandwidth to do that. And we can continue to do that. So a user could start from a single node with a small amount of data movement and a relatively small amount of data to manage. And like this use case, expand up to extreme scales. So nobody knows how fast their data is going to grow, and nobody knows what storage they're going to need to have it on.

Starting point is 00:23:31 And so you have to create it in a policy-based fabric. So just for grins, how many nodes is moving two petabytes of data a day take? And what do those nodes look like? The nodes are actually relatively simple they're a simple you know one or two u commodity server you know deluxe uh adele you know um r 630 or power edge yeah a typical sort of size they're not they're not over beefy and and we can do it in bs instances as well. But, and it's really the network and the storage performance.

Starting point is 00:24:12 So we will go essentially at line rate. So whatever the network can handle, in their case, it's a hundred gigabit network. They've got 80 tape drives, a mixture of Jaguar, different generations of LTO across three libraries on that use case. And then it's really the bottleneck is how fast can we drive the isolons? And so we'll max out whatever the isolons can handle. So it's the parallelization. And this is, again, why if we go back to the DKRZ use case where they're using our CLI, this is not just a CLI from a single workstation initiating a data movement of, let's say it's a half a petabyte into simulation. But because our

Starting point is 00:24:55 CLI is leveraging the parallelization of Stronglink, it can parallelize those jobs out across, in their case, 11 nodes, 11 strong link nodes. And so these, it can be a flood of data that comes in all within the QoS and quotas that they may have established. Right. Talk to me about the HA. You mentioned that you support multi-node HA solutions. So, I mean, if you're going to be, let's say, front-ending my petabyte storage, I mean, HA has got to be extremely important, right? Oh, it is. Yeah. Andabyte storage, I mean, HA has got to be extremely important, right? Oh, it is, yeah. And everything is... With our HA, this is not a typical Linux cluster. So our minimum HA configuration is three nodes. And these three nodes, there is no head node,

Starting point is 00:25:40 there's no single point of failure, and they all work as peers. The identical software load goes on all of them. The system is distributed out. All operations are distributed out across them and load balanced across them. The metadata database, now we're not storing any customer metadata or any customer data, but only the metadata is in the StrongLink metadata database. And that is sharded and fully replicated across all available nodes in either single or dual copies, plus automated backups of that. So we've got multiple layers of protection within the metadata database. And the data itself is on the customer storage in open form, right?

Starting point is 00:26:27 What about replication across data centers? And, you know, this sort of data can be extremely valuable and they want it to be available in disaster recovery scenarios and that sort of stuff. Yep. And that's a very common use case. And we have for that, it was also part of our third generation Stronglink announcement last month, is what we call Stronglink Galaxy. So a single node, we call a star, and a cluster, we call a constellation, because it's not a typical Linux cluster. There's no head node. And so you're not failing over a head node. Well, we can tie together multiple strong link clusters or constellations into what we call a galaxy, which is multi-sites. And here we can have one-direction or bi-directional synchronization of all metadata as changes happen, and all are parts of the data. So DKRZ, for example, they're making massive amounts of changes to their data sets every day, every second, actually. And so that's in Hamburg in the primary environment, it's updated in the secondary, the metadata instantly, and then they've opted to only replicate 10% of the data, the most active. U.S. Library of Congress is another poster child for this, where we've got probably it'll end up being four or five data centers that are tied together. So all of the digital assets of the U.S. Library of Congress are managed by Stronglink.

Starting point is 00:28:12 And this is a primary data center that consolidated 50 different data silos, plus two satellite data centers in AWS, where they've got about 18 petabytes. And then there will be at least one and possibly two other data centers. Stronglink has instances in all of these, and they are all tied together where they exchange metadata and then by policy, a certain of those data. So you mentioned, so the metadata is replicated across cluster, constellation, stars, kinds of things. So is that synchronously replicated or asynchronously replicated? So if I make a change to the metadata on site A, let's say, how long does it take before it hits site B? For the metadata, it's whatever the latency of the network is. So it's virtually

Starting point is 00:29:06 instantly. The data itself, it's policy-based. No, no. Yeah, I understand that. So I just want to clarify, it's synchronously replicated? The metadata is synchronously replicated between sites? So if I'm updating metadata on site A, the update actually doesn't take place until the metadata is sitting in site B? No, no. Site A is independent of site B. So it is asynchronously replicated, but as soon as it's available. I got you. Okay. Yeah, yeah, exactly. And of course, if there's a network interruption, it will buffer and then catch up once the network is available again. And you mentioned you have instances in the cloud.

Starting point is 00:29:46 So if I'm a customer that has, I don't know, a couple petabytes of data and I want to move some of it to AWS and some of it to Azure and some of it to Google Cloud, I end up having a strong link solution sitting in all those clouds? No, you don't need to. In the case of Library of Congress, they have instances in the cloud because they've got other use cases. But typically, it's a single strong link instance on-prem at the customer's data center. And then they can, by policy, connect or extend that data center into cloud-based storage, S3 or Azure at this point.

Starting point is 00:30:26 And so that just becomes a seamless extension of their primary data center. But they won't have access to that data from the cloud. They'd still have to have... So Stronglink is providing access to that data in the cloud at the primary data center, let's say, in this case. Yeah, exactly right. And you don't want to have anybody coming in the back door to your storage, whether it's cloud-based storage or on-prem storage. So you want to control access. So Floyd, let's talk to me a little bit about the control plane and control plane access when it comes to data center management and orchestration. Are you guys partnering with any other providers to orchestrate data movement from a higher abstraction layer?

Starting point is 00:31:18 Unless, you know, people are talking now about using Kubernetes to manage data center orchestration, et cetera, from a larger perspective? Are you guys working with any of those types of solutions? Well, not exactly, but there's also not necessarily a specific need. So if you think about what we do, we're abstracting or bridging silos of different storage types and presenting those out in a vendor-neutral front end. So multiple protocol access, NFS, SMB, S3, to data that might be sitting on storage that doesn't even support those protocols. A CLI access, a web-based control panel access. So a common access across all of those. And so where the user wants to store it, that's completely up to them.

Starting point is 00:32:13 So, Floyd, you mentioned Slurm as using your CLI, I guess, to move data back and forth. Something like that presumably could be done similarly in Kubernetes or other orchestration solutions as well, right? Yeah, potentially. I mean, candidly, none of our customers have ever asked for that. They're usually at volumes that may not work so well for that, but it's never really come up with our customers. Typically, they need to be able to, it's either NFS, SMB, or S3, or the CLI access directly. And being able to bridge across the silos is really their main focus. And the CLI is talking to your StrongLink software set of nodes sitting on the data center, presumably, right? Correct. Everything that we do within

Starting point is 00:33:05 Stronglink is available through the API. I mean, we do have other customers. NASA is one, for example, where their users aren't even aware that Stronglink is there. They've extended their applications to talk to our API. Well, our CLI does the same thing. So our CLI connects to Stronglink through the API so that it gives an environment that is very familiar. The CLI is what the researchers there are used to seeing. The difference is that it's now a scale-out system where it's triggering the entire scale-out resources of Stronglink from their workstation and parallelizing those jobs to get higher performance throughput than they could have ever done previously with HPSS directly. So HPFS is an old style hierarch Yes. It's a legacy HSM from the 90s. It was HPSS originated out of IBM and several of the labs. And then SGI had DMF, which is now part of HP. Those were the two big dogs in the HSM space. And there's a lot of data

Starting point is 00:34:28 still out there on both of those. But both of those are proprietary. They're not in open standard format. And they follow typical HSM hierarchy, which we go any to any. We don't have to rehydrate back through the same paths because we're not leaving stubs or having to reconnect to sim links. So direct from Isilon to tape, tape to cloud, tape to object store, you know, object store to cloud, wherever to wherever. And you mentioned just about anybody's tape library is compatible. Yeah, any tape library. And any modern drive supports LTFS. So, I mean, we've got the one at DKRZ.

Starting point is 00:35:15 They've got old legacy Oracle, you know, storage tech libraries, quantum libraries, you know, Spectra, IBM. I mean, anything. So, you know, moving petabytes, so data gravity has always been a real problem for these sorts of solutions. But you mentioned that you're parallelizing the move activity and going at line speeds in this environment. Yes, we are. And don't forget that we're not only going at line speeds, but we're also creating metadata records, data provenance and file history, audit trails. So there's a lot of intelligent metadata operations going on in the background,

Starting point is 00:35:59 but we're able to move the data very fast, essentially at line rates. Talk to me a little bit about the metadata data protection. I mean, normally this sort of environment, you know, would be replicated multiple times. It would be, it might, if it's, you know, it could be RAID protected, that sort of thing. So the metadata itself in our database, do you mean? So it is a NoSQL database that we've heavily modified and extended. And so the key, so there's multiple levels of protection. One is just the straight backing up the database by policy, right? The second is, and that's primarily in HA, obviously, environments, is the sharding. So we shard across all of the nodes, but we also create a full replica that is continually updated across all nodes.

Starting point is 00:37:00 Either a single or dual replica of the entire database. And that's continually updating. And then the third thing, of course, is if they have a replica site, the Galaxy configuration, where we continue to do that to the other site. So you've got multiple layers of protection in there. And the data itself, it is on the customer storage. Because we can manage multiple copies at once, you may have a primary copy that is sitting on your Isilon, but then you've got a secondary copy that's sitting on a cool or cold storage or a third copy that's sitting off-prem in the cloud or another data center. Stronglink can keep all of these synchronized so that you're not having to run separate backup operations for one class of storage, which is different from another class of storage. You can look at all of your data across all storage types, cloud, tape, object,

Starting point is 00:38:00 disk, flash, whatever it may be, and then create a data continuity policy that can apply to them all. Normally when you're tiering stuff, the copy, once it's moved, is actually deleted off a tier. You support both tiering as well as copying? Yeah, sure. Absolutely. Absolutely. I mean, you know, it depends on the use case. I mean, in some cases, in some cases, customers I mean, the smarter way to tier is not wait until you have to tier and then do it and then clean up. But proactively tier. I mean, it's a very common thing where you proactively create a second copy downstream. Right. Sometime later, delete it on the primary kind of thing. Exactly. Whenever by policy it ages off the primary.

Starting point is 00:38:46 And so the Stronglink will, if I'm a user and I'm coming to my mount point on my Mac Finder or my Windows Explorer, I see a file. Today that file might be in Isilon. Tomorrow it might be in an object store. But I'm still coming to my Finder and I'm still seeing the same folder in the same file. I don't know that it's moved from one place to another and nor should I need to care. I just know that my file is there. My applications can see it. I can do whatever I want without having to worry that IT is now moving things around in the background.

Starting point is 00:39:21 And you mentioned somewhere back there you're replicating cross sites, but it's bidirectional. So I could have, yes, I could have data being updated in site A and data being updated in the site B and they would be replicating across to one another. Correct. Yes. And that can be done by policy, either one directional or bidirectional. And, you know, so we'll have, you know, in some cases it's more of a hub and spoke, a central repository with spokes that are contributing back to the core. But in other

Starting point is 00:39:51 cases, it might just be two different data centers that back each other up. And you mentioned you could actually replicate to a third data center as well, or is that? Oh yeah. Yeah. We can have as many, virtually as many as you want. I mean, Library of Congress, they've got four. Fully replicated data centers? Yeah, four unique sites, exactly. This has been great. This is a pretty impressive solution.

Starting point is 00:40:15 I didn't think you guys had all this sophistication, especially the movement stuff is very impressive. Is there anything you'd like to say to our listening audience before we close? Yeah, I think the key that is, and I touched on it a couple of times, is we've been conditioned over the years to think of our data, especially in larger environments, to think of our data as a storage problem. My storage is filling up. I got to get more storage. My storage costs are expensive. I need a data reduction strategy because I can't afford to keep everything, but I don't know what I can keep and can't keep because I don't have enough intelligence about the data. Our view is that the data is key.

Starting point is 00:40:59 And starting from the intelligence of the data that can be derived from its metadata, and that the policies need to be able to bridge any storage type without a proprietary lock-in and without the kind of hooks, agents, stubs, symlinks that are needed. And even the pricing, we don't charge by the volume of data. We charge by the performance that you need. So who knows how much data you're going to have five years from now. So how do you budget if the cost of your system is tied to the volume of data? create a vendor-neutral platform that enables IT administrators to manage their data wherever they want with complete control and not being locked in either by a pricing model that is unsustainable or by proprietary hooks. And then in that way, they can focus on a data-centric management policy that leverages whichever storage type is right for their budget, their performance, and maybe even just their corporate policies, but to do so in a way that keeps that data available to their users at all times. Brings up another question. So you mentioned

Starting point is 00:42:17 your pricing is by performance. So it's like terabytes ingested or terabytes accessed or? No, no. We use it based on the number of CPU cores. Oh, okay. I got you. It's like a VM kind of thing. Yeah. Yeah, sort of. Except it's, you know, if, if for my workload, I know I'm going to be needing to move, you know, X amount of data, and this is my network. So physics, you know, calculate the amount of nodes to move two petabytes a day, you need a lot of nodes, because you're saturating two petabytes of bandwidth. So we just you just add another node and, and that number of CPU cores that you've allocated for that is what you pay. And if you want to push two petabytes of data or 200 petabytes of data, that's up to you.

Starting point is 00:43:09 I mean, Cisco doesn't ask you how much data is pumping through the pipes. It asks how fast you want it to move. And then you buy the performance you need. One other question. Is there some way that a customer, let's say he's got 500 terabytes, can try out the solution without buying it? Or I have no idea what it takes. what is their environment, and we'll construct a POC, usually with a statement of work of what the objectives are of the test, and we'll run through the test and either works for them,

Starting point is 00:43:50 which typically is the case, or it doesn't, and we address that. Okay, I got you. All right, well, this has been great. Floyd, thank you very much for being on our show today. It's been my pleasure, Ray. I appreciate that. And that's it for now. Bye-bye. Next time, we will talk to the system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it. Please review us on Apple Podcasts, Google Play, and Spotify,

Starting point is 00:44:18 as this will help get the word out.

Your Ad Here

Grey Beards on Systems - 122: GreyBeards talk big data archive with Floyd Christofferson, CEO StrongBox Data Solutions

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.