Grey Beards on Systems - 094: GreyBeards talk shedding light on data with Scott Baker, Dir. Content & Data Intelligence at Hitachi Vantara

Episode Date: December 5, 2019

Sponsored By: At Hitachi NEXT 2019 Conference, last month, there was a lot of talk about new data services from Hitachi. Keith and I thought it would be a good time to sit down and talk with Scott Bak...er (@Kraken-Scuba), Director of Content and Data Intelligence, at Hitachi Vantara about what’s going on with data … Continue reading "094: GreyBeards talk shedding light on data with Scott Baker, Dir. Content & Data Intelligence at Hitachi Vantara"

Transcript
Discussion (0)
Starting point is 00:00:00 Hey everybody, Ray Lucchese here with Keith Townsend. Welcome to another sponsored episode of the Greybeards on Storage podcast. This Greybeards on Storage podcast is brought to you today by Hitachi Vantara and is recorded on November 22nd, 2019. We have with us here today Scott Baker, Senior Director of Content and Data Intelligence at Hitachi Vantara. So, Scott, why don't you tell us a little bit about yourself and what's happening with Hitachi Vantara Storage Solutions and what your team is doing in the market? Well, gentlemen, thanks so much for allowing me to join this cast with you and really share some of the thoughts that I have around this area. A little bit about myself. Man, a captive audience, and they all want to hear about me.
Starting point is 00:00:53 How about this? I'll keep it light. I really, really love to scuba dive. I like to tell people that when I'm not underwater, this is my surface interval. Otherwise, I really enjoy putting a mask on, throwing a tank on my back, and hitting the water. And what's interesting about that particular reference, though, is a lot of what I see under the water drives a lot of what I do on the surface with my job, where, you know, we do focus around taking massive amounts of unstructured data and finding the
Starting point is 00:01:22 right home for them on our object storage platform. But increasingly, more and more of my time continues to be spent with my team on how we help organizations uncover additional value, additional insights within their unstructured data repositories, moving beyond just keeping that data for compliance purposes and finding new ways to integrate that with traditional structured data kinds of analytics and visualization tools that downstream folks like data scientists, data operations engineers, user dev, BI, et cetera, make use of to help organizations really get to a faster decision point. Yeah. So, Tachi, next. There's a lot of discussion about some of the data intelligence, I would say.
Starting point is 00:02:08 It's not exactly the right term, but tools that are coming out, which was kind of interesting. So you're saying that you can effectively augment or add information to customer data, that sort of thing? Absolutely. So, you know, I respond to this by telling you that my team, my entire team, that's everything from marketing to sales to engineering, you know, we sort of operate under this personal mantra that we believe that it's our responsibility to make every bit of an organization's data available to them in the most insightful way possible so that they can take some kind of informed action on it.
Starting point is 00:02:47 And that simple why statement is embedded in everything that we do, because what we're seeing is this sort of broader distribution of data across on-premise cloud, multi-cloud, omni-cloud, any cloud, whatever you want to call it, any form of hybrid kind of infrastructure, this is the new normal for most organizations. It's more than just going cloud first. But the problem that we see is that legacy or the complexity of the information infrastructure in most enterprises today, that's that legacy architecture, redundant tools, redundant activities, it really consumes the IT organization's budget. And that's not just capital.
Starting point is 00:03:27 That's also the resources, time, and energy necessary. And it really becomes a stifling point for new approaches. And that's why the whole focus around Next was helping organizations achieve their data operations advantage, where the impetus is put on the effective management and controls necessary to help data move effectively through the information supply chain for that organization, adding the intelligence to that along the way, either in an automated fashion with artificial intelligence and machine learning kinds of algorithms, or providing those services to the users that are sort of augmenting that data and making use of it so that they can continue to rank it or prove its value. I want to get back to something you mentioned that I hadn't heard before, information supply chain. You want to characterize what that
Starting point is 00:04:21 really represents? Well, I think there's a lot of examples in the world around us, whether you look at it naturally, you know, from the perspective of nature, or if you look at other activities that occur in business, you know, organizations that drive something out in terms of production, typically physical, like my cell phone or my watch or whatever, will have a supply chain that allows them to get the right kinds of providers of resources into the right sort of flow of manufacturing. And then there's different people that put screws in and batteries in and circuit boards in. And as I thought more and more about that, I said to myself, man, you know. You think there's an information analog to that.
Starting point is 00:05:10 Right. Information supply chain makes total sense because data isn't, well, ideally it isn't produced once and never used again. It happens a lot though. But, you know, ideally information is produced and it flows through the organization. And now what I think, you know, the term lifecycle is unfair for data. It's more of a sort of an infinite loop because data, you know, constantly is getting massaged. It may change from its original source or original version, but it continues to be used within the organization. And I like to think about the supply chain because the organization that is the data center, right?
Starting point is 00:05:50 Think about that term data center. It's not the server center. It's not the storage center. It's not the network center. It's the data center. It becomes a hub that supports that supply chain, providing the data to the right people, the right place in the right format at the right time. So Keith, have you ever heard of supply chain, information supply chain providing the data to the right people, the right place, in the right format, at the right time. So, Keith, have you ever heard of supply chain, information supply chain before? Well, I've heard of supply chains. I'm an SAP guy, but it makes sense to kind of lead to this or add to this conversation some stats, or at least one big stat that I got out of Ignite a couple of weeks ago. Microsoft talked about 73% of the world's data that's created in the past two years hasn't been analyzed. And I think one of the things that we don't realize
Starting point is 00:06:36 how big of an opportunity loss that is. Scott, can you tell us kind of some areas where Hitachi is using this supply chain methodology to help customers realize the value of unlocking some of this data, especially in object storage? Because when I think of analytics and I think of the ability to mine data, I don't think about doing that directly from object. Right. And that is, I'm so glad to hear you say that. In fact, as we look at our own install base, you know, and we talk to them about unstructured data, you know, especially as you sort of align this concept with how data is growing in terms of volume, size, complexity, et cetera.
Starting point is 00:07:27 Roughly from our customer's perspective, 90% of their unstructured data tends to go unanalyzed. And in many cases, it's only kept around for compliance. I mean, let's face it, guys, no one that I know of has ever been fired for keeping data. Well, maybe Enron, but that's another story. That's true. But what I would say is that, in as much as that not all data is created equally, all data does have value, right?
Starting point is 00:07:53 The real question is, how do you surface that knowledge that's buried within the data, especially unstructured data? And the only way to really do that is to really grow and mature your data culture and invest in the right kind of tools to address that. And so when we think about that and this notion of making every bit of an organization's data available in the most insightful way possible, what we realized is that we needed an unstructured data analytics engine, if you will, that you can place at the point of ingest as close to where the data is
Starting point is 00:08:27 getting created, as well as on the backend for data access. Because the data consumers that we have out there are constantly demanding from at least our object storage, this continued acceleration of delivery of the data in the format that they want to do whatever it is that they're going to do. And so if we don't put in a form of automation to help, that's going to mean that the data producers will face this increased pressure to access, review, qualify, you know, remediate and deliver that data quickly. So there has to be an automated fashion here. And one of the things that we talked about at Next that does this for us is Hitachi's content intelligence engine, if you will. And the product is really designed to take unstructured data and perform things such as natural language processing, to extract metadata, to blend different data streams together, to really sort of begin to create in an automated fashion these data dictionaries so that as data scientists or
Starting point is 00:09:34 these other people within that supply chain that I mentioned are looking for the right data, then they're able to pull from that. So, I mean, that's sort of the Hitachi content intelligence engine, I guess that's the right term. It works on both all unstructured data object storage as well as file storage? Right on. So that's what I was going to say. So, you know, globally speaking, one of the very first things that we do is we help organizations shine a light on the dark data that they have, right?
Starting point is 00:10:02 Let's just assume that every organization has got any number of data repositories that are out there, right? And sort of the increasing levels of data diversity and distribution puts tremendous pressure on organizations to find a solution. And with content intelligence, what we find is this becomes a great tool to help organizations centralize their data, right? And really highlights the repository focus capabilities that Hitachi's content platform, our object storage, can provide to the rest of the organization. I don't really want to say a single source of truth, but what I would rather say is you create a centralized data hub from which you can consistently apply whatever the rules are that you want to apply on that data.
Starting point is 00:10:53 And because it's object storage, you could have diverse categorizations of metadata and that sort of thing applied to an object? Is that how this would work? Exactly. So, you know, what we would be looking at at this point is taking the logic that's either being applied to that unstructured data upon ingest in an automated fashion or through whatever the processes are for the organization to really amplify the value of the object itself and make it easier to find for people. Now, that's sort of the most horizontal and basic level is just knowing what data you have, where it's located, how it's being used and by who, and then ultimately who's responsible for it. That solves, I would say, 90% of the regulatory and compliance obligations that any one regulatory body would probably set forth as it pertains to protecting data and remaining compliant. So, Scott, help us understand, how would data scientists leverage this? Because there's usually an extraction layer between storage, object storage. So,
Starting point is 00:12:00 it can be objected, it can be filed. There's really not a difference. And giving data scientists access to this unstructured data, usually we're, you know, we're, we're accustomed to seeing, you know, the stuff put into a distributed database like Hadoop, and there's a layer between the storage system and the actual data application itself. How are people leveraging these systems? Wow, man, that really opened me up to a whole lot of things to talk about. Let me drive down to some very simple things, right? So if you use traditional data science tools for processing structured data, I'll just pick on Pentaho since that's one of the Hitachi products.
Starting point is 00:12:41 As the analytics tool is burning through that structured data source, nine times, well, I shouldn't say nine times, there's a good chance it's going to run across an unstructured row column element, like a free text field or a blob. And, you know, aside from being able to say, you know, this particular field can't be null, it's really hard to apply some very rigid expectations around that data element. So in our case, what we do with the relationship between Pentaho and content intelligence is as Pentaho is processing the data in a structured database, when it comes across a free text field or a blob or whatever, it actually creates a communication path between itself and content
Starting point is 00:13:27 intelligence, sending the unstructured data over to content intelligence with the request that we basically turn that unstructured data into structured information in the form of key value pairs that we then hand back to the data analyst, right? Or we hand back to Pentaho. So that would be one example. No way. So Pentaho is talking to the content intelligence and they're discussing what the key value pair is for some blob of data should be? Yeah, can you imagine that?
Starting point is 00:13:59 Absolutely. So that's one great way. Another way that we see this occurring, and I'll pick on Hadoop because you mentioned that, is that we have the ability to communicate directly with Hadoop clusters and create essentially a bridge where cold or frozen Hadoop data can be offloaded to HCP, leaving a stub or a link behind so that Hadoop believes that it still has the data available to it. But once that data lands on the HCP, we can trigger content intelligence to burn through that because content intelligence can also process structured data. The difference between that and Pentaho is the ultimate reason for why you're processing it. So with Pentaho, I'm looking to take information to refine down to a specific decision point and likely a very real time kind of an experience. Whereas with content intelligence, it's really about data discovery.
Starting point is 00:15:00 It's not time sensitive, if you will. But but we can actually index or process and and understand the data in those offloaded Hadoop stores they click a button and request that it come back into the Hadoop cluster. And this really gives us an opportunity to help organizations optimize their existing data lakes. So that would be the second example. And the third would be using that same kind of search experience so that the data scientists could evaluate two different kinds of data repositories, the structured data, as well as what's being stored on the edge of the business, in the cloud, on-prem, whatever, in the unstructured repositories so that they may find similar kinds of data sets that need to be blended together in whatever model they're working on.
Starting point is 00:16:02 Let me try to understand. So content intelligence is roughly indexing both structured and unstructured data? It can. It can. We focus more on the unstructured piece. Right. Because that's where the real processing value is. But I'll give you one other example how we're helping customers with this.
Starting point is 00:16:22 Content intelligence has the ability to transform information. We have a lot of organizations out there that have legacy IT, or I'm sorry, legacy application architectures. And they tend to keep those applications around, not because people are writing to them, but because people are still accessing the data contained within them. And they're looking for ways to sort of pull that data out and retire that architecture to remove the burden on IT, free up some physical space, et cetera. So what we've done for customers is actually use content intelligence
Starting point is 00:16:56 to connect to the application's data store, typically a database. We essentially burn through the entire table structures of the database. We essentially burn through the entire table structures of the database. And as we're reading that data in, we convert it into an XML file that we store wherever the customer wants it. And so now all you have to do is change maybe your SQL query statements to XPath query statements, and you can still reference that data. And it gives them a chance to create maybe a more modern front end to the data set itself without destabilizing the sort of the sanctity or the veracity of that data that was once offered up through that legacy architecture. Speaking of legacy architectures versus modern data structures,
Starting point is 00:17:40 there was a recent announcement at Next on HCP. You want to talk about that, Scott? Right on. So one of the things that we've done since about 2003 is always had an object storage solution available to the market. And we originally had released this as a physical appliance. We later converted the physical appliance into an OVA or virtual machine format to give organizations that software-only experience, if you will. That's not really software-only. So what we announced at Next was a branch of HCP that we refer to as HCP for cloud scale. So it's truly a software-defined solution, right? It is built to be a containerized object store, S3-compliant object store, to really address tier one and mission-critical workloads, because what we're doing is we're balancing, and this is the key part, we're balancing the performance and scale-out capabilities that you would expect out of cloud solutions today,
Starting point is 00:18:53 but focusing on the performance piece as well so that you get linear scale in both throughput and capacity while also maintaining a very strong consistency with respect to how we've re-engineered the metadata database technology. It's actually patent pending, to be quite frank. So you mentioned containers, this thing sort of runs on Kubernetes or Docker Swarm or Mesos or something like that? Well, you know, we had originally started off on Dockers and Swarm. And I think now they're looking into the Kubernetes component of that as well. But quite frankly, you know, running it within a container is just one example of the deployment methodology. You know, you could also sort of kick off, I don't know, jokingly, HCP.exe, and you could install it on your own bare metal or in a virtual machine or even in a public cloud. So, Keith, didn't you just come back from KubeCon?
Starting point is 00:19:50 I came back from KubeCon and there's a growing number of developers and operators who are extremely interested in having persistent data provided as a service inside of containers. So this very deployment model. Ah, gosh, that'd be interesting. So you could almost deploy your object store almost concurrent with your application in the same pods or in the same service slash deployment environment. Yeah, I think the main desire is to be able to have portability in the service itself, so you can deploy the service wherever you'd like to deploy the service. Right, right. The container dream. Well, that was the goal here, right? As we look at the
Starting point is 00:20:36 architecture that we have today, and not to say that there's anything wrong with the current object storage solution that we have, we just realized that to be completely fair to the audiences that are out there that are looking for the ability to support high performance workloads without the necessity to move data to a compute layer, and then also give them the ability to really fine tune application performance and balance the resources, we needed to rethink a lot of the core architecture of object storage in general, not just HCP, but in general. And that was the drive to readdress how metadata is managed, to move to a microservices-based architecture to give us the elasticity that we needed, and then to also give us the ability to create these clusters with master and worker nodes
Starting point is 00:21:29 to really allow that core clustering infrastructure to scale effectively for organizations. And that was the critical piece, right, is that object storage left to the vendors that are out there is always going to show up in the data center. And an IT person will likely say, you know, if it's not block supporting tier one scale up workloads, and it's not file supporting, you know, user services and, you know, some of the other things that file is capable of offering, then it must be archived. And I feel so bad for object storage. It's capable of doing so much more than that. I think it's being used for more for so much more than that in the world today, especially in the cloud and that sort of stuff. All right. Well, this has been great.
Starting point is 00:22:13 Keith, any last questions for Scott before we sign off? No, he just left me with some homework with this with this pointer and the ability to to talk directly to these services with the metadata. That was a bit of a bombshell. I didn't expect that, Scott. You always got to leave you with one. Got to leave one. Scott, anything else you'd like to say to our listening audience? Listen, folks, I'll tell you, right?
Starting point is 00:22:42 So I think Sherlock Holmes said it best. It is absolutely a capital mistake to theorize or act on anything before you have the data. We want to give that data to you. We want to nurture the data knowledge in that organization that you belong to. And whether you choose to work with Hitachi or not, please, please, please develop your data culture so that you can create sustainable innovation within your business. I've been surprised by a couple of things you said, Scott, but it started out with your title, which is Director of Content and Data Intelligence. I'd never heard that before. Anyways, this has been great.
Starting point is 00:23:16 Thank you very much, Scott, for being on our show today. And thanks to Itachi Ventara for sponsoring this podcast. You got it. Next time, we'll talk to another system storage technology person. Any questions you want us to ask, please let us know. And if you enjoy our podcast, tell your friends about it.
Starting point is 00:23:29 And please review us on Apple Podcasts, Google Play, and Spotify, as this will also help get the word out. That's it for now. Bye, Keith. Bye, Ray. And bye, Scott. Bye, gents.
Starting point is 00:23:39 Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.