Grey Beards on Systems - 094: GreyBeards talk shedding light on data with Scott Baker, Dir. Content & Data Intelligence at Hitachi Vantara
Episode Date: December 5, 2019Sponsored By: At Hitachi NEXT 2019 Conference, last month, there was a lot of talk about new data services from Hitachi. Keith and I thought it would be a good time to sit down and talk with Scott Bak...er (@Kraken-Scuba), Director of Content and Data Intelligence, at Hitachi Vantara about what’s going on with data … Continue reading "094: GreyBeards talk shedding light on data with Scott Baker, Dir. Content & Data Intelligence at Hitachi Vantara"
Transcript
Discussion (0)
Hey everybody, Ray Lucchese here with Keith Townsend.
Welcome to another sponsored episode of the Greybeards on Storage podcast.
This Greybeards on Storage podcast is brought to you today by Hitachi Vantara and is recorded on November 22nd, 2019.
We have with us here today Scott Baker, Senior Director of Content and Data Intelligence at Hitachi Vantara.
So, Scott, why don't you tell us a little bit about yourself and what's happening with Hitachi Vantara Storage Solutions and what your team is doing in the market?
Well, gentlemen, thanks so much for allowing me to join this cast with you and really share some of the thoughts that I have around this area.
A little bit about myself.
Man, a captive audience, and they all want to hear about me.
How about this?
I'll keep it light.
I really, really love to scuba dive.
I like to tell people that when I'm not underwater, this is my surface interval.
Otherwise, I really enjoy putting a mask on, throwing a tank on my back,
and hitting the water. And what's interesting about that particular reference, though, is a
lot of what I see under the water drives a lot of what I do on the surface with my job, where,
you know, we do focus around taking massive amounts of unstructured data and finding the
right home for them on our object storage platform.
But increasingly, more and more of my time continues to be spent with my team on how
we help organizations uncover additional value, additional insights within their unstructured
data repositories, moving beyond just keeping that data for compliance purposes and finding
new ways to integrate that with traditional structured data kinds of analytics and visualization tools
that downstream folks like data scientists, data operations engineers, user dev, BI, et cetera,
make use of to help organizations really get to a faster decision point.
Yeah. So, Tachi, next. There's a lot of discussion about some of the data intelligence, I would say.
It's not exactly the right term, but tools that are coming out, which was kind of interesting.
So you're saying that you can effectively augment or add information to customer data, that sort of thing?
Absolutely. So, you know, I respond to this
by telling you that my team, my entire team, that's everything from marketing to sales to
engineering, you know, we sort of operate under this personal mantra that we believe that it's
our responsibility to make every bit of an organization's data available to them in the
most insightful way possible so that they can take some kind
of informed action on it.
And that simple why statement is embedded in everything that we do, because what we're
seeing is this sort of broader distribution of data across on-premise cloud, multi-cloud,
omni-cloud, any cloud, whatever you want to call it, any form of hybrid kind of infrastructure,
this is the new normal for most organizations. It's more than just going cloud first.
But the problem that we see is that legacy or the complexity of the information infrastructure in
most enterprises today, that's that legacy architecture, redundant tools, redundant
activities, it really consumes the IT organization's budget.
And that's not just capital.
That's also the resources, time, and energy necessary.
And it really becomes a stifling point for new approaches.
And that's why the whole focus around Next was helping organizations achieve their data operations advantage, where the impetus is put on the effective management
and controls necessary to help data move effectively through the information supply
chain for that organization, adding the intelligence to that along the way, either in an
automated fashion with artificial intelligence and machine learning kinds of algorithms, or providing those services to the users that are sort of augmenting that data and making use of it
so that they can continue to rank it or prove its value. I want to get back to something you
mentioned that I hadn't heard before, information supply chain. You want to characterize what that
really represents? Well, I think there's a lot of examples in the world
around us, whether you look at it naturally, you know, from the perspective of nature,
or if you look at other activities that occur in business, you know, organizations that drive
something out in terms of production, typically physical, like my cell phone or my watch or
whatever, will have a supply chain that allows them to get the right kinds of providers of
resources into the right sort of flow of manufacturing. And then there's different
people that put screws in and batteries in and circuit boards in. And as I thought more and more about that, I said to myself, man, you know.
You think there's an information analog to that.
Right.
Information supply chain makes total sense because data isn't, well, ideally it isn't
produced once and never used again.
It happens a lot though.
But, you know, ideally information is produced and it flows through the organization.
And now what I think, you know, the term lifecycle is unfair for data.
It's more of a sort of an infinite loop because data, you know, constantly is getting massaged. It may change from its original source or original version, but it continues to be used within the organization. And I like to
think about the supply chain because the organization that is the data center, right?
Think about that term data center. It's not the server center. It's not the storage center. It's
not the network center. It's the data center. It becomes a hub that supports that supply chain,
providing the data to the right people, the right place in the right format at the right time. So Keith, have you ever heard of supply chain, information supply chain providing the data to the right people, the right place, in the right format, at the right time. So, Keith, have you ever heard of supply chain, information supply chain before?
Well, I've heard of supply chains. I'm an SAP guy, but it makes sense to kind of lead to this
or add to this conversation some stats, or at least one big stat that I got out of
Ignite a couple of weeks ago. Microsoft talked about 73% of the world's data
that's created in the past two years hasn't been analyzed.
And I think one of the things that we don't realize
how big of an opportunity loss that is.
Scott, can you tell us kind of some areas
where Hitachi is using this supply chain
methodology to help customers realize the value of unlocking some of this data, especially in
object storage? Because when I think of analytics and I think of the ability to mine data, I don't
think about doing that directly from object. Right. And that is, I'm so glad to hear you say that. In fact, as we look at our own install base,
you know, and we talk to them about unstructured data, you know, especially as you sort of align
this concept with how data is growing in terms of volume, size, complexity, et cetera.
Roughly from our customer's perspective,
90% of their unstructured data tends to go unanalyzed.
And in many cases, it's only kept around for compliance.
I mean, let's face it, guys,
no one that I know of has ever been fired for keeping data. Well, maybe Enron, but that's another story.
That's true.
But what I would say is that, in as much as that not all data is created
equally, all data does have value, right?
The real question is, how do you surface that knowledge that's buried within the data,
especially unstructured data?
And the only way to really do that is to really grow and mature your data culture and invest
in the right kind of tools
to address that. And so when we think about that and this notion of making every bit of an
organization's data available in the most insightful way possible, what we realized
is that we needed an unstructured data analytics engine, if you will, that you can place
at the point of ingest as close to where the data is
getting created, as well as on the backend for data access. Because the data consumers that we
have out there are constantly demanding from at least our object storage, this continued
acceleration of delivery of the data in the format that they want to do whatever it is that they're going to do.
And so if we don't put in a form of automation to help, that's going to mean that the data producers will face this increased pressure to access, review, qualify, you know, remediate
and deliver that data quickly. So there has to be an automated fashion here. And one of the things that we talked about at Next that does this for us is Hitachi's content
intelligence engine, if you will. And the product is really designed to take unstructured data and
perform things such as natural language processing, to extract metadata, to blend different data streams together, to really sort of begin to
create in an automated fashion these data dictionaries so that as data scientists or
these other people within that supply chain that I mentioned are looking for the right data,
then they're able to pull from that. So, I mean, that's sort of the Hitachi content intelligence engine,
I guess that's the right term.
It works on both all unstructured data object storage as well as file storage?
Right on. So that's what I was going to say.
So, you know, globally speaking,
one of the very first things that we do is we help organizations shine a light
on the dark data that they have, right?
Let's just assume that every organization has got any number of data repositories that are out there, right? And
sort of the increasing levels of data diversity and distribution puts tremendous pressure on
organizations to find a solution. And with content intelligence, what we find is this becomes a great
tool to help organizations centralize their data, right? And really
highlights the repository focus capabilities that Hitachi's content platform, our object storage,
can provide to the rest of the organization. I don't really want to say a single source of truth,
but what I would rather say is you create a centralized data hub from which you
can consistently apply whatever the rules are that you want to apply on that data.
And because it's object storage, you could have diverse categorizations of metadata and that sort
of thing applied to an object? Is that how this would work?
Exactly. So, you know, what we would be looking at at this point is taking the logic that's either being applied to that unstructured data upon ingest in an automated fashion or through whatever the processes are for the organization to really amplify the value of the object itself and make it easier to find for people. Now, that's sort of the most horizontal and basic level is just knowing
what data you have, where it's located, how it's being used and by who, and then ultimately who's
responsible for it. That solves, I would say, 90% of the regulatory and compliance obligations
that any one regulatory body would probably set forth as it pertains to protecting
data and remaining compliant. So, Scott, help us understand, how would data scientists leverage
this? Because there's usually an extraction layer between storage, object storage. So,
it can be objected, it can be filed. There's really not a difference. And giving
data scientists access to this unstructured data, usually we're, you know, we're,
we're accustomed to seeing, you know, the stuff put into a distributed database like Hadoop,
and there's a layer between the storage system and the actual data application itself. How are
people leveraging these systems?
Wow, man, that really opened me up to a whole lot of things to talk about. Let me drive down to some
very simple things, right? So if you use traditional data science tools for processing
structured data, I'll just pick on Pentaho since that's one of the Hitachi products.
As the analytics tool is burning through that structured data source,
nine times, well, I shouldn't say nine times, there's a good chance it's going to run across
an unstructured row column element, like a free text field or a blob. And, you know,
aside from being able to say, you know, this particular field can't be null, it's really
hard to apply some very rigid expectations around that data
element. So in our case, what we do with the relationship between Pentaho and content
intelligence is as Pentaho is processing the data in a structured database, when it comes across
a free text field or a blob or whatever, it actually creates a communication path between itself and content
intelligence, sending the unstructured data over to content intelligence with the request that
we basically turn that unstructured data into structured information in the form of
key value pairs that we then hand back to the data analyst, right? Or we hand back to Pentaho.
So that would be one example.
No way.
So Pentaho is talking to the content intelligence and they're discussing what the key value pair is
for some blob of data should be?
Yeah, can you imagine that?
Absolutely.
So that's one great way.
Another way that we see this occurring, and I'll pick on Hadoop because you mentioned that, is that we have the ability to communicate directly with Hadoop clusters and create essentially a bridge where cold or frozen Hadoop data can be offloaded to HCP, leaving a stub or a link behind so that
Hadoop believes that it still has the data available to it. But once that data lands on
the HCP, we can trigger content intelligence to burn through that because content intelligence
can also process structured data. The difference between that and Pentaho is the ultimate reason for why you're processing it.
So with Pentaho, I'm looking to take information to refine down to a specific decision point and likely a very real time kind of an experience.
Whereas with content intelligence, it's really about data discovery.
It's not time sensitive, if you will. But but we can actually index or process and and understand the data in those offloaded Hadoop stores they click a button and request that it come back into
the Hadoop cluster. And this really gives us an opportunity to help organizations optimize their
existing data lakes. So that would be the second example. And the third would be using that same
kind of search experience so that the data scientists could evaluate two different kinds
of data repositories, the structured data, as well as what's being stored on the edge of the business,
in the cloud, on-prem, whatever, in the unstructured repositories
so that they may find similar kinds of data sets that need to be blended together
in whatever model they're working on.
Let me try to understand.
So content intelligence is roughly indexing both structured and unstructured data?
It can.
It can.
We focus more on the unstructured piece.
Right.
Because that's where the real processing value is.
But I'll give you one other example how we're helping customers with this.
Content intelligence has the ability to transform
information. We have a lot of organizations out there that have legacy IT, or I'm sorry,
legacy application architectures. And they tend to keep those applications around, not because
people are writing to them, but because people are still accessing the data contained within them.
And they're looking for ways to sort of pull that data out
and retire that architecture to remove the burden on IT,
free up some physical space, et cetera.
So what we've done for customers is actually use content intelligence
to connect to the application's data store, typically a database.
We essentially burn through the entire table structures of the database. We essentially burn through the entire table structures of the database. And as we're
reading that data in, we convert it into an XML file that we store wherever the customer wants it.
And so now all you have to do is change maybe your SQL query statements to XPath query statements,
and you can still reference that data. And it gives them a chance to create maybe a more modern front end
to the data set itself without destabilizing the sort of the sanctity or the veracity of that
data that was once offered up through that legacy architecture.
Speaking of legacy architectures versus modern data structures,
there was a recent announcement at Next on HCP. You want to talk about that, Scott? Right on. So one of the things that we've done since about 2003 is always had an object storage solution available to the market.
And we originally had released this as a physical appliance.
We later converted the physical appliance into an OVA or virtual machine format to give organizations that software-only experience, if you will.
That's not really software-only.
So what we announced at Next was a branch of HCP that we refer to as HCP for cloud scale.
So it's truly a software-defined solution, right? It is built to be a containerized
object store, S3-compliant object store, to really address tier one and mission-critical workloads,
because what we're doing is we're balancing, and this is the key part, we're balancing the performance and scale-out capabilities that you would expect out of cloud solutions today,
but focusing on the performance piece as well so that you get linear scale in both throughput and capacity while also maintaining a very strong consistency with respect to how we've re-engineered the metadata database technology.
It's actually patent pending, to be quite frank.
So you mentioned containers, this thing sort of runs on Kubernetes or Docker Swarm or Mesos or something like that?
Well, you know, we had originally started off on Dockers and Swarm.
And I think now they're looking into the Kubernetes component of that as well.
But quite frankly, you know, running it within a container is just one example of the deployment methodology.
You know, you could also sort of kick off, I don't know, jokingly, HCP.exe, and you could install it on your own bare metal or in a virtual machine or even in a public cloud.
So, Keith, didn't you just come back from KubeCon?
I came back from KubeCon and there's a growing number of developers and operators who are extremely interested in having persistent data provided as a service inside of containers.
So this very deployment model.
Ah, gosh, that'd be interesting.
So you could almost deploy your object store almost concurrent with your application in the same pods
or in the same service slash deployment environment.
Yeah, I think the main desire is to be able to have portability in the service itself,
so you can deploy the service wherever you'd like to deploy the service.
Right, right. The container dream. Well, that was the goal here, right? As we look at the
architecture that we have today, and not to say that there's anything wrong with the current
object storage solution that we have, we just realized that to be completely fair to the audiences that
are out there that are looking for the ability to support high performance workloads without
the necessity to move data to a compute layer, and then also give them the ability to really
fine tune application performance and balance the resources, we needed to rethink a lot of the core architecture of object storage in
general, not just HCP, but in general. And that was the drive to readdress how metadata is managed,
to move to a microservices-based architecture to give us the elasticity that we needed,
and then to also give us the ability to create these clusters with master and worker nodes
to really allow that core clustering infrastructure to scale effectively for organizations.
And that was the critical piece, right, is that object storage left to the vendors that are out there is always going to show up in the data
center. And an IT person will likely say, you know, if it's not block supporting tier one scale
up workloads, and it's not file supporting, you know, user services and, you know, some of the
other things that file is capable of offering, then it must be archived. And I feel so bad for
object storage. It's capable of doing
so much more than that. I think it's being used for more for so much more than that in the world
today, especially in the cloud and that sort of stuff. All right. Well, this has been great.
Keith, any last questions for Scott before we sign off? No, he just left me with some homework
with this with this pointer and the ability to to talk directly to these services with the metadata.
That was a bit of a bombshell.
I didn't expect that, Scott.
You always got to leave you with one.
Got to leave one.
Scott, anything else you'd like to say to our listening audience?
Listen, folks, I'll tell you, right?
So I think Sherlock Holmes said it best.
It is absolutely a capital mistake to theorize or act on anything before you have the data.
We want to give that data to you.
We want to nurture the data knowledge in that organization that you belong to.
And whether you choose to work with Hitachi or not, please, please, please develop your data culture so that you can create sustainable innovation within your business.
I've been surprised by a couple of things you said, Scott, but it started out with your title, which is Director of Content and Data Intelligence.
I'd never heard that before.
Anyways, this has been great.
Thank you very much, Scott, for being on our show today.
And thanks to Itachi Ventara for sponsoring this podcast.
You got it.
Next time, we'll talk to another system storage technology person.
Any questions you want us to ask,
please let us know.
And if you enjoy our podcast,
tell your friends about it.
And please review us on Apple Podcasts,
Google Play, and Spotify,
as this will also help get the word out.
That's it for now.
Bye, Keith.
Bye, Ray.
And bye, Scott.
Bye, gents.
Thank you.