Grey Beards on Systems - 103: Greybeards talk scale-out file and cloud data with Molly Presley & Ben Gitenstein, Qumulo
Episode Date: June 23, 2020Sponsored by: Ray has known Molly Presley (@Molly_J_Presley), Head of Global Product Marketing for just about a decade now and we both just met Ben Gitenstein (@Qumulo_Product), VP of Products & Solut...ions, Qumulo on this podcast. Both Molly and Ben were very knowledgeable about the problems customers have with massive data troves. Molly has been … Continue reading "103: Greybeards talk scale-out file and cloud data with Molly Presley & Ben Gitenstein, Qumulo"
Transcript
Discussion (0)
Hey everybody, Ray Lucchese here with Keith Townsend.
Welcome to another sponsored episode of the Graybeards on Storage podcast.
This Graybeards on Storage podcast is brought to you today by Cumulon and was recorded on
June 16th, 2020.
We have with us here today, Molly Presley, head of global product marketing and Ben Gittenstein,
VP of product and solutions at Cumulo. So Molly and Ben, why don't you tell us a little bit about
yourselves and what's new at Cumulo? Yeah, you bet. Hey, this is Molly Presley. I'm responsible for product marketing and excited to be on this podcast getting ready to talk about the big announcement that we have here at the end of June for Cumulo about kind of the next steps and how we're enabling data with applications in the cloud.
Oh, that's great. So this is a, it's a sort of a cloud sort of solution. Is that what we're talking about here? Or how is this going to play out from that perspective?
Yeah, the new announcement we have is about how to take advantage of the power of the cloud,
whether that's the applications or the infrastructure,
cumulus, spans, data center, public, private cloud.
But this announcement we'll be talking about and going a little deeper into
is pretty cloud-focused about just the amount of data you can process and get access to if you use the tools up there in the
cloud. So we haven't talked to Cumulo in quite a while. Could you tell us a little bit about,
you know, the background of Cumulo and that sort of thing? Yeah, you bet. Ben, do you want to take
that part? Sure. I'm Ben Gittenstein. I'm the VP of product here at Cumulo, and I'm super excited
to be here and be a part of the show.
You know, when you think about Cumulo, I'd encourage you to think about applications and how applications deliver value for businesses.
Because at the end of the day, Cumulo provides a file system that is really only as valuable as the applications that leverage that file system to do work. Examples being life sciences organizations
that are doing research into curing diseases,
which I know is top of mind for all of us,
or media and entertainment companies
that are building the next great movie
or creating the next great television show,
or state and local organizations
that are dealing with massive deluges of data or manufacturing
companies that are looking at all the data coming off their factory floor and trying to optimize
production. All those sorts of folks, their real value, the value of that data is in the application
that leverages it. These are fairly sizable data repositories, I guess I'd call them. This is not your typical data center file environment.
Is that correct? Yeah. Cumulo is really all about the large-scale data that customers use to create
innovation. So we're really interested in genomes that get really big or video files that get really
big or video data sets that get really big or massive collections
of small files. So if you are a mortgage processor that creates billions of small files, that's a
pretty common use case for us. You know, when you think about us, I think one way to think about
Cumulo is sort of what we believe. And we are big believers in the notion that data needs to be near the compute and the users that take advantage of it.
And so that's pretty important to what we've done to date and also some of the stuff we're announcing today,
which is really all about the notion that at its heart, our job is to enable the application,
which means we need to move data to the place where the application
resides. That could be on-premises in your data center or increasingly in the public cloud,
in AWS, in GCP. And we need our products to run really well in both of those environments
and speak the native languages of those clouds, in particular, the way in which you store data.
And so a lot of what we're
going to be talking about today is how we store data and present that data so that the customer
can use the application they want to use in the environment they want to use.
And why would customers select Cumulo?
Yeah, you know, I think that when you look at the world today, data centers have really shifted.
They're rapidly, I mean, the world is changing, workflows are changing, innovation is accelerating
at incredibly fast paces.
And the way Cumulo has architected our software, when you run the same software package in
the data center or in the cloud, so whether you want to run on x86 processors and Google
and Amazon, whichever cloud
you choose, Cumulo gives you that flexibility and freedom to choose so that even if you're
making a decision today that you want to run an all flash file system, but you know you're going
to need to move some of that data to the cloud someday, Cumulo can get you there. So there's
that freedom and flexibility that customers get out of Cumulo. And Cumulo is a software-defined solution, correct?
Yeah, we are.
Software-defined kind of from all the different aspects
of how you might think about that.
We have a software pricing model.
We leverage commodity hardware.
So you can think about industry standard hardware,
which is a Cumulo appliance, or it could be HPE hardware. We have
a partnership with Fujitsu. And then that same software you could buy up in an AWS marketplace,
for example, as well. And you mentioned Google Cloud too, is that right?
The two clouds we run in today are Amazon and GCP. And of course, we're looking at other clouds for in the future. So I'm really interested in this set of use cases and this idea of running workloads close to where the data,
having the data specifically as close to the workloads as possible, because I have a big pharma background and ingesting data if you're doing sequencing, the clinics and
et cetera have these machines that speak file systems, et cetera. And then you want to get
this data to cloud services that primarily speak object. So I'm hoping we'll dig into this a little bit because I've had this practical disconnect in my career and where I'm ingesting data in a file system application centric paradigm and then needing to analyze that data in a completely different type of storage methodology.
So I'm really curious as to how Cumulo solves that problem.
It's almost a producer-analyzer type of dichotomy here.
I mean, everybody gets data or creates data in sort of a file environment,
but when they want to try to analyze it, it's got different types of requirements.
Yeah, when you look at these devices, they have USB drives on, they have CDs, etc.
So how does that play out in Cumulo then? file systems and really storage providers between what the cloud object stores provide and what the
on-premises file systems provide. And I think that's been driven, if I may be so bold as to
sort of conjecture as to their motivations, it's been driven by business models creating technology decisions, which I think is a mistake for customers.
And Cumula takes a very different approach. We fundamentally believe that at the end of the day,
you should be able to ingest and analyze the data that you work with and create as a business owner
using the tools that make sense for your business, not the tools that are optimized for your storage. You should be making those decisions based on how you want to run.
Is that what you mean by business model driven development kinds of things?
So what it really means is that when other storage vendors hold your data hostage and their business
model is built around storing and maintaining that your data lives inside of their system permanently.
And they do that because they monetize the total amount of data stored, particularly in cold
storage, for a long period of time. Cumulo has a different point of view. We monetize business
value. We enable customers to run applications. And by doing that, we make customers more successful. And then our business model is aligned with that.
So you wouldn't charge for capacity then? I mean...
Well, we charge for capacity for the data that's stored inside a cumulophile system.
But we're talking about Shift today as an example. And Shift is built around the concept of freedom.
So the whole notion of Shift is you should be able to take your example of creating
data locally using, let's say, an Illumina sequencer. Well, that Illumina sequencer is
really optimized for a file system on the backend. It really wants to produce a bunch of files. And
in fact, for many Illumina sequencers, it produces a very large number of files, some of them
very, very small, which breaks a lot of file systems, which is not a problem Cumulo has, but that's a separate topic.
But then you got to do something with that data. Well, the first thing that happens is often a lot
of that analysis happens over a different protocol than what was originally created,
than the protocol that originally created the data. So in file world, that's, you know, often the Illumina sequencer speaks SMB and the
analysis farm is a big Linux or NFS farm. And so now I've got to, even just in the world of file,
right, I've got a file problem where I've got two different protocols hitting the same data.
That's a problem we already solved. So if you run Cumulo entirely on-premises,
you can use both protocols against us. We work great in that scenario. But the bigger problem is, well, what if I want to do that analysis in AWS, which is increasingly
important for a couple of reasons. One, because my researchers are remote. My researchers don't
work. They can't come into the office right now because of the pandemic we're all living through,
or they are increasingly part of other organizations.
And so they actually would never come into my office.
They're organizations we partner with.
And not only do I have a remote worker problem,
I also have an elastic compute problem,
which is at peaks, I want to turn time into a variable.
I want to compute against a large data set very fast
and not have to wait for, you know, the number of cycles that my given number of, you know, my given amount of compute on premises can get through or to wait to order more compute and have that arrive and get racked and stacked and all that.
I want to provision compute resources with code.
Well, in order to do that, I need that in the cloud in a format that the cloud can work
with. And that's what shift does for you? Yeah, that's exactly what shift does. So Cumulo shift
enables you to take data from any Cumulo file system on premises or in the cloud and take that
data and move it. It functionally makes a copy of that into Amazon's S3 as a native S3
object or bucket. And now you can interact with it directly using your S3 optimized tooling.
So if you are using Amazon's recognition service or SageMaker or Beanstalk or any of their native
services that assume that the underlying content is S3 have at it.
And our product makes that super available and super easy to work with in a like reliable production grade sort of ready.
And the file environment, you'd have a directory and, you know, sub directories, et cetera, et cetera.
How does that map into S3 bucket structures per se?
It's a one-to-one mapping. So we make a copy. So what happens is when you go over to S3 and you
open up your S3 browser, you will see an object or a bucket, depending on where we're talking
about a directory or a file. And you will see in there, the name of the object will be the name of the file.
And now you have added, you can go mess with that. Now there's no, over time we'll make the
linking between S3 and Cumulo deeper and deeper. At this first iteration that we're releasing for
customers is really just a copy, but already we've seen with the early customers who've been using it
a ton of value because they can immediately create a real hybrid workflow where they're
generating data with an Illumina sequencer against a Cumulo on-premises, and then they're copying
that data up to AWS S3. So Ben, let me tell you where I've seen this real life where this
would have reduced friction for an environment that had Illuminal sequencers that centralized data into the data center.
And we have this challenge of getting that data from the scientists had this challenge
of getting the data from the data center into a third party cloud provider that consumed it via object.
The biggest problem is scientists doing that conversion, the friction of getting it from
the data center into object, so much so that we almost wasted a whole investment in connecting
to internet two to facilitate that
10 gigabit file transfer. That kind of third level, that human layer, that level eight problem
was a big problem for us. That connects really well with what we have been hearing from customers
from the beginning and our motivation for building Shift.
And I want to make sure we broaden the conversation just a little bit and think about Cumulo in AWS and Cumulo Shift, because together, it's really the better together story that makes
this more powerful. So Cumulo already offers a scalable multi-protocol file system for AWS.
So when you, for example, when you use Illumina, which I'm
only anchoring on because you mentioned it's a great example, but Illumina offers something
called Dragon, which is essentially a toolkit for doing analytics against genomic data. Well,
Dragon is really a file-based application and it wants to talk to a file system.
Running that on-premises is relatively straightforward.
You just put it on, load it up on your compute, have it talk to your file system.
In the cloud, where do you put that, and what file system do you talk to, and what data does it work with? With Cumulo pre-shift, so this is something we've already been delivering on for several years,
you can build a scalable Cumulo file system in the cloud.
You can have your Illumina sequencer talk
to your local file system, your local cumulo
in your data center and replicate data directly
over your AWS, your nice big 10 gig link
or whatever it is you have into AWS,
right into a cumulo file system, you load Dragon onto EC2
and now you have an elastic cloud compute farm for doing
genomic analysis. That already works today. What's surprising is that in the past,
researchers have avoided using AWS for that very problem. They've had to go to niche
cloud providers that would speak NFS because specifically the application you mentioned,
Dragon, that's the industry standard and it's a file system-based solution.
That's right.
And we continue to be big believers in the power of file systems.
But we also recognize that a lot of the world's most important data
needs to live in S3, specifically AWS's S3 service, because it's such an important underpinning to
other cloud services like SageMaker for machine learning, for example. And our goal with Shift
is to say, hey, look, our job is to make your data available to the place it needs to go
in order to make the application successful. And that's what we've
been really focused on, both with Shift, which makes it available to your S3-based application,
as well as with just Cumulo as a cloud-native file system in AWS or GCP.
We've been talking about AWS almost exclusively, but S3 is sort of ubiquitous anymore. I can name probably four or five
different storage environments where S3 is an available protocol. I'm assuming that you can
actually copy or shift, let's say, the data to any of the S3 compatible solutions?
Not today. Today, we're focused pretty squarely on AWS's S3 service. Over time, you will see us bring it to GCP and other cloud object stores.
We're super focused on the problem of interop of your data and your data flow.
I got you.
The applications happen to be there, right?
Yeah, I think that's the way to think about it, Ray.
It's very different from thinking about an S3 storage target for an archive. That's a pretty of applications in the cloud that only read and write from that
AWS S3.
And we want to make sure that that data is available to those applications as well.
And AWS S3 is the vehicle to do that through.
And so the customer would effectively have their own AWS S3 license and billing and all
that stuff.
And that would be outside of the cumulo purchased and all that stuff, right? That's right.
Right, right. And then, so moving the data from Cumulo to S3, you've got,
do you have some sort of special compression or data transfer characteristics that you're
bringing to bear to make this happen easier and quicker and things of that nature? Yeah, we do. We leverage, we built our own
replication protocol at Cumulo. We've been working on this for several years and have
several patents around it. It really is built on several sort of design principles, one of them
being scale. We tend to, we really focus on customers with very large data sets.
So our replication protocol handles really large, really large directories and stuff with lots and lots of files in it, as an example.
And with, you know, handling all the problems that happen when you take one system and then put a really long network connection between it and another system. So what do you do about when the network has an
issue or how do you make sure you retransmit very efficiently and all that sort of stuff?
So there's a bunch of patents in there that make it really efficient, really fast, and really
scalable. So from a practical perspective, the intent is basically to enable applications that weren't designed for WAN, asynchronous
communication to actually work over the WAN?
I would say it slightly differently.
It's more designed.
The answer is yes, but in a slightly different way.
So we are very focused on enabling you to run the application in the data center or in the
public cloud. So if, as soon as you have decided, Hey, I would like to move this, let's go back to
Dragon. I would like to take Dragon or Adobe Premiere if you're in a media and entertainment
group. And I would like to take, I've been running that application locally. I would like to now run
that application in the cloud because my data set has gotten large enough. It's time to move there or because I need more
compute power or because my workers are remote. Just pick the application up and move it and we'll
move the data for you. So the ideal is, and the other piece of it, since I can't control latency
as much in the cloud, you guys have a higher tolerance for that.
That's right.
And we do that by being a really scalable system.
Obviously, there's, you know, in the cloud,
you're always working on abstracted infrastructure.
And so there's some challenges you have to work with
to just understand noisy neighbor problems
and all that sort of stuff that can happen in the cloud.
And we handle all of that by building really big namespaces that are really highly available and
highly resilient to underlying components. What sort of other use cases? We've been
talking a lot about genome sequencing and stuff like that, but I'm sure you've got other
clients or customers out there using Shift for other solutions.
Yeah, there is a lot of stuff in there. I mean, there's a lot of
different, it's a very powerful sort of horizontal capability. I'd mention a couple. We talked a lot
about Dragon and that sort of stuff. Really, I would bucket that in innovating using file data
in the public cloud. So that's creating a running and innovation workload in the public
cloud. The other one that's really important is collaborating across organizations. So your
ability with Shift, for example, to finish working on a dataset and essentially publish to S3,
that's a really powerful lever for a research organization that wants to be able to say,
this dataset is done. Please,
researchers from across the world, go work against it. Here's the URL. Then, of course,
there's DevOps workflows. You could see something like seismic analysis where I've done some work
on-prem and then I want to publish some results of that and make it available to others and that
sort of thing. That's a really common use case with us. In fact, one of our largest cloud customers is one of the world's largest analyzers of seismic data. And the way
they use Cumulo is as the native file system in AWS for all of their seismic analysis.
And they do that because we offer multi-protocol access at scale in AWS.
Right, right, right. Okay, what other use cases?
You were talking about DevOps.
Where does that play into this thing?
We don't, at Cumulo, we often take it for granted,
but it turns out our API is a really powerful tool for customers.
We take it for granted because everything we do at Cumulo is API first,
so it just sort of is second nature.
But every feature available, every capability in the system is available as an API, which means
that customers that have really built DevOps pipelines can program against a Cumulo system,
whether that's programming for data access, so their workflow can literally come in over an API,
or programmatically creating, managing, destroying their Cumulo clusters
via an API.
That's a really common DevOps workflow.
And that's with or without Shift to some extent.
That's really the Cumulo cloud service, is that?
I would say that's core to Cumulo software.
I mean, whether you're running us on-premises or in the cloud, the API set is the same and
you just have at it.
And we've got a nice set of samples on GitHub that you can use to get started and all that sort of stuff.
OK, OK.
There's a couple of others that are really operating sort of focused on cost.
So you can use Shift to create a safe second copy of your data in S3. It's, you know, and then you can use S3's
data lifecycle tools to tier that stuff down into Glacier or however you'd like. And then,
of course, archiving is a very similar workflow, which is once you've decided that a project is
complete or a show is done, you can make a copy in S3 and then delete that data off of your existing
Cumulo cluster.
And all of those are available just as part of what we do in the shift feature.
Okay.
And shift is going to be available on the 23rd of June?
Yeah, we're making an announcement about shift on the 23rd.
We actually already have preview customers using it today.
It'll be publicly available in the product in July. Okay. All right. Well, this has really been great. Hey,
Keith, any last questions for Molly or Ben? No, it's been a really great conversation.
Okay. So Molly or Ben, anything else you'd like to say to our listening audience before we close?
No, I think that just in closing, we really appreciate you guys hosting us in the conversation.
And as we think about how data structures are set up and data environments are changing over time, we're just really excited to have the opportunity to talk with both analysts, community, as well as the customers who are evolving their workloads to be able to use their data better.
And, you know, we welcome the opportunity to do demos and have these types of conversations
with anyone who listens to this podcast once it's been released.
Okay.
Well, this has been great.
Thank you very much, Molly and Ben, for being on our show today.
Yeah, thanks for having us.
Thanks for having me.
And thanks to Cumulo for sponsoring this podcast.
Next time, we'll talk with another system storage technology person.
Anything you want us to ask, please let us know.
If you enjoy our podcast, tell your friends about it,
and please review us on Apple Podcasts, Google Play, and Spotify
as this will help get the word out.
That's it for now.
Bye, Keith.
Bye, Ray.
Bye, Molly and Ben.
Have a great day.
Okay.
Good day. Okay. Good day.