Grey Beards on Systems - GreyBeards talk enterprise Hadoop with Jack Norris, CMO MapR Technologies

Starting point is 00:00:00 Hey everybody, Ray Lucchese here. And Howard Marks here. Welcome to the next episode of Greybeards on Storage monthly podcast, a show where we get Greybeards storage and system bloggers to talk with systems and storage vendors to discuss upcoming products, technologies, and trends affecting the data center today. Now, if you're a startup or a vendor representative and want to be considered for a future Graybeards podcast, feel free to contact either myself or Howard on Twitter at Ray Lucchese or at DeepStorage.net. Welcome to the third episode of Graybeards on Storage.

Starting point is 00:00:40 The third episode was recorded on November 9, 2013. We have with us here today Jack Norris, Chief Marketing Officer of MapR Technologies. Why don't you tell us a little bit about yourself and your company, Jack? Well, thanks, Ray. So I've been in enterprise software for several decades. I guess I qualify for the gray beards without the facial hair. That's good. And, you know, the interesting thing about MapR

Starting point is 00:01:10 is that we're really focused on a huge paradigm shift. And I think this data on compute really represents the next generation stack, if you will, for enterprise computing. And, you know, we can't take credit for this. This was an innovation that was really driven by Google and popularized through their white paper that they finally wrote in 2004 after just, you know, taking the search industry by storm. And a lot of it was this architecture behind the scenes. And reading that white paper were engineers that subsequently were employed by Yahoo

Starting point is 00:01:54 and created this open source package called Hadoop. And our co-founder, CTO, was at Google, had been working with Bigtable, an architect, seeing it at scale, was also formerly the chief software architect at Spinnaker Networks, and really understood what it takes for enterprise storage and what it takes to do this massive distributed processing. And what we did at Mapbar was take open source, embrace open source, but also update the data platform underneath so that you get the best of both worlds. You get the best of open source for this new paradigm, and you get the best of enterprise storage to make it much easier to deploy, to protect, and to perform within an environment.

Starting point is 00:02:46 God, you know, I was at a, I don't know, it was an analyst meeting. I think it might have been NetApp, but it could have been EMC. And there was a customer talking about, you know, what they were doing and where they were going. And he made this statement, audacious statement, that, you know, within like five to ten years, 90% of all data in the world will first land on Hadoop. Does that make sense to you, Jack? Well, you know, it does. And I'll explain why. I think that comment is a reaction to the data silo approach that we see now. And the approach that we have now is that to optimize analysis and operations, you have a specialized processing stack. And because of the rate of data growth that has

Starting point is 00:03:40 translated into a specialized cluster that continues to grow. And organizations are spending a lot of time transforming and moving and copying data to the point where they're just buried with the processing of the data, not really a value add. And what Hadoop allows you to do is to process the data in place and um polyglot persistence is a term that's starting to be used more and more which polyglot persistence you have to explain that one yeah well i thought i'd work that in there good choice the first time i heard it was one of our data scientists and i i said well you know being in charge of marketing, I know one thing. I'll never repeat that term.

Starting point is 00:04:30 Well, there you go. You've repeated it now. Wrong again. And then I heard it used in a keynote, and it started to make sense, especially – we've heard this, hey, you're going to put data all in one place before, right? We've heard about enterprise application integration and having a common data interchange format. And it just meant a lot more work, right? And what Hadoop allows you to do is to take data and put it in one cluster and not have to worry about a transformation step. And polyglot persistence really pertains to you've got data in different formats that remain in their different formats.

Starting point is 00:05:14 And then you can combine them together for certain processing steps, whether that's unstructured data and structured data. You can perform a transformation step on the platform itself and then serve some downstream environmental activity. So it's not necessarily that you're going to take Hadoop and replace all of your analytic clusters or data warehouses today. But it gives you a new option. In some cases, it's to do things like I want to offload some of the processing. I want to offload some of the data from my data warehouse into Hadoop. And when you're looking at the price per terabyte of Hadoop

Starting point is 00:05:58 in a fully protected enterprise environment like we provide, you're still talking a few hundred dollars a terabyte, not the tens of thousands and on up, depending on the platform that you've adopted. So there's a big advantage to do that. It's very easy to scale. You just add additional nodes. It's not a six-month process to figure out how to upgrade and expand. So that allows organizations to continue to address the fast-growing data sources that they have. And it allows you to do analytics in a much more flexible manner. So you can do exotic machine learning processing or you can combine unstructured feeds coming in from web activity

Starting point is 00:06:48 or social media or sensor data directly with transactional information that exists within the company to spur better insights and better operations. You know, this all seems like the rebirth of batch in my mind. I'm looking back to the world of, you know, the mainframe days and stuff like that. It was a lot of batch processing which transformed data and provided some, I would say, it's probably not analytics per se, but, you know, analysis of what's going on inside that data stream and stuff like that. Lots of streams of data going back and forth. I mean, it's a scale-out version. It's an open-source version of this, but it's sort of the rebirth of Batch from a world which has been almost online transaction processing for the last 20 years or so, maybe longer. Howard?

Starting point is 00:07:38 Well, Batch never really died, did it? It did die in the open world. It didn't examine scripts, maybe. That was it. There never was a batch environment as far as I could tell. It was all transaction driven. Now, maybe I'm wrong. Perhaps I'm wrong.

Starting point is 00:07:54 Well, the OLTP side was, but when you start doing analytics, that's always still been batch. Yeah, yeah. I mean, it's certainly much more popularized today especially with with the dupe coming out and stuff like that but you know i never i know i when i was working on mainframe 40 years ago or 30 years ago god 40 would have been too long but 30 years ago you know we talked batch all the time and we very rarely that's all there was. Well, no, there wasn't. There was time sharing. There was online. There was Kix and ICMS and all that stuff.

Starting point is 00:08:32 But the world revolved around batch and mainframe. And over the last 25 years, when we've moved to PCs and open and all that other stuff, the world started to more or less revolve around client-server applications and transaction processing. And now we're coming full circle with Hadoop. Although it's not mainframe, it's scale-out, it's open source. But, you know, this is the rise of batch in my mind. Well, I think that's – I mean, there are some batch activities. If you're doing large-scale, you know, complex clustering, that's definitely a process that's taking a huge amount of data over a period of time and processing those. But there's also some real-time elements, some interactive elements.

Starting point is 00:09:19 Let me give you some examples. So we've got a customer, Rubicon Project. They're doing 90 billion real-time auctions per day on Mapbar. So, you know, they're... 90 billion? 90 billion. Jesus. Comscore, another Mapbar customer, they're doing 1.7 trillion events per month on Hadoop.

Starting point is 00:09:41 Oh, my God. So they're, I mean, where do you draw the line between batch and real time? I mean, there are some predictive analytics that are part of that, that are large scale processing, but it's also taking high arrival rate data, processing very quickly, combining, you know, a transactions approach plus analytics so that you can optimize the ad placement or in a financial services customer, optimize the fraud detection that you're doing. Or in the case of Comcast Spotlight, you know, multi-screen marketing and ad insertion. So, you know, one of the things that we focused on in our platform was to eliminate the batch requirement of Hadoop. Really? Okay.

Starting point is 00:10:31 The batch requirement is really an artifact of the Hadoop distributed file system. So that file system is a write-once storage layer written in Java storing its data in the Linux file system. And we rewrote that layer into a highly scalable POSIX-compliant data layer. So you can stream data in and do analysis directly on that data. You don't have to physically close the file that you do in the other Hadoop distributions. And that requirement to close the file so that you can recognize all the data that you do in the other Hadoop distributions. And that requirement to close the file so that you can recognize all the data that's been appended is what creates this batch cadence, if you will. Right, right, right. I thought that came from the MapReduce process,

Starting point is 00:11:20 but I must admit that I understand Hadoop much less well than I should. Well, you know, Hadoop's a broad ecosystem. So, yes, if you're doing a MapReduce job, there's a certain latency involved in that, even if you're using a SQL syntax against Hive, which converts that into a MapReduce. Right. Yeah, because you have to map and then you have to reduce. Yeah. But there's also HBase that's running on top of the platform.

Starting point is 00:11:56 And we've done a lot on the architecture to really improve the performance, expand the number of tables that are supported, and provide the full data protection. So we have consistent 24 by 7 low latency available in HBase. And that's where you do some of the online transactions, if you will, that support some of these applications. Okay, that makes more sense. So tell us how your product, you know, essentially, and we started to touch on it, differs from, you know, a standard Hadoop distribution and that sort of stuff.

Starting point is 00:12:33 You mentioned some of the enterprise capabilities that we all know and love today. Maybe you can talk about, you know, what those are and kind of how your services provides that. Yeah, so first of all all let me start with the the you know open source hadoop because when we talk about hadoop distribution there's actually a series of projects and packages that are integrated and hardened and tested so there are over a dozen of these components that are part of our distribution, things like Hive and HBase and Scoop and Flume and Uzi and Zookeeper.

Starting point is 00:13:15 You got to love the way the open source guys name things. I was going to say there's kind of a requirement. I think maybe in the Apache Software Foundation, you know, they have a requirement that the name has to be memorable. And so far, I think they're all in compliance. You didn't mention Splunk in there, but I that's a partner of ours that actually has a model similar to MapR's, which is they have a free version and then an enterprise version that's a paid-for version. Okay. That's a core piece, and all that open source code is available with Mapbar, and you can pull it off of GitHub, and use our Maven repository to make changes and incorporate that. In addition to that, we've got a whole management suite that make it easy to monitor, and a lot of functions are automated, and that's part of the value-add platform that we have.

Starting point is 00:14:25 And then I mentioned earlier about the underlying data platform. And so if you look at MapR, there's a lot of differentiation because of that data platform. And the re-architecture was basically let's add value by looking at what people are trying to do, support the broadest number of applications, provide a mission-critical environment. So, you know, people expect if you've got data that's stored in an enterprise, you've got the ability to back it up. You've got the ability to protect it up. You've got the ability to protect it across data centers for disaster recovery. You've got the ability to access it with industry standard tools and Linux commands, et cetera. And all of those things that I just ticked through are unique features to Mapbar. We're the only ones that have mirroring,

Starting point is 00:15:28 the only ones that have point-in-time consistent snapshots, the only one that has full POSIX compliance, the only one that's distributed the name node function, so it's much easier to scale, and we can scale orders of magnitude higher in terms of the number of files supported, etc. Okay, backup. I've often talked, it never seemed to me that Hadoop data is ever really backed up. It's backed up someplace else.

Starting point is 00:15:57 But you mentioned mirroring, you mentioned snapshots and stuff like that. Do you have a backup solution in there someplace? Yeah, so we've got snapshot capabilities i mean if you've got a petabyte of information yeah yeah yeah having you know having a whole backup protocol for that is really unwieldy and and costly gee there's an understatement yeah so what whatadoop does is it has a replication built in, so you can establish the replication factor that you want for data. Now, that protects against server failure, node failure, et cetera. It doesn't protect against user application errors and corruption.

Starting point is 00:16:41 Yeah. and corruption yeah and what we have as part of that underlying storage layer is a redirect on right enterprise grade point in time consistent snapshot capability so so you're telling me you could snapshot a petabyte of data is that what you're telling me no way absolutely absolutely and it's consistent and because it's a redirect on right it there's no excess storage the the the additional storage is basically dependent on on what is the the overwrites the updates that you're doing i i need to see this i would like to see a petabyte of data being snapped at one wouldn't you like to see that? Maybe you could do a demo for us. Remember, it's not the petabyte of data.

Starting point is 00:17:29 It's the metadata. Yeah, yeah, I know. I've been there. I've done that. The whole idea of redirect on write is that we don't have to deal with the data. We just have to deal with the metadata. Right, the pointers. Well, there's splits on writes and stuff like that.

Starting point is 00:17:43 There's a lot of work that goes into making it all kind of hang together. And, you know, can you have multiple iterations of snapshots? I mean, can you support? No, I mean, we also have a volume concept. So you can set up different frequencies of snapshots by volume. You set up the retention policy. You want a snapshot every 10 minutes and then retain one at the end of the day

Starting point is 00:18:07 and then retain one at the end of the week and retain one at the end of the month so that there's this rolling update. And if you look at it, you go- Oh my God, it really is enterprise storage. It is close. Exactly. And it's because, you know,

Starting point is 00:18:22 if you look at the DNA of our founders, there aren't too many people in the Valley that really have been an architect for enterprise storage as well as an architect at Google and understand this scale. was a new underlying approach. If you tried to do a clustered NAS at the underlying platform for Hadoop, it won't work. And I'll just give you one example. And one example is once you scale to hundreds of nodes, the recovery process, the resync process can create IO storms. So you really have to have an architecture that's built to handle that distributed framework. And we did a test at Google where we had 1,000 nodes running. We took down basically 1,000 nodes.

Starting point is 00:19:29 We had 1,003 nodes. We took down 1,000. What? We brought them back up. And they were back up, and the cluster was running in three minutes. And then synchronizing for the next six months. No, no, no. Because all nodes are participating in that resync.

Starting point is 00:19:49 As opposed to if you have a single name node or even a federated name node, then all of those nodes have to report back to that one name node or that small cluster of name nodes. And if you did that with any other distribution, it would be a 20-hour or so process. Wait a minute. I just need to understand what's going on here. So you have 1,003 nodes of effectively Hadoop MapBar running in place and doing stuff. And you took down 1,000 nodes at one whack, and you brought them back up. Now, so 1,000 nodes of 1,003 nodes is going to be 99% of the data storage that's going down here. The other three nodes are not going to be able to do anything effectively with that 99.7% of the data. How does this all work?

Starting point is 00:20:43 He didn't say any activity happened while only three of the 1,003 nodes were down. Okay, then what's the resync problem then? Because if there's no data being changed. Everybody's still got to checkpoint and know that no data got changed. Exactly. Exactly. And when you do that with every other distribution, it's got to wait for all the block reports to report back and say, I'm alive, and here's what I have, and here's what's running on my cluster. And that check-in process, if you will, takes hours with MapR because everything's distributed and every node participates.

Starting point is 00:21:24 That isn't, you know, tens of thousands of times faster. You mentioned multiple times that you have a distributed name node. So I guess I'm saying that in a standard Hadoop distribution, there's a single name node? Correct. And what does the name node do for you in this space? space. Well, the whole function here of Hadoop is to have this detailed clustered file system appear as one processing unit. So the name node kind of controls where are things located, because I need to know that before the mappers and reducers function. So it's like the metadata server in another cluster,

Starting point is 00:22:06 in a large other cluster file system. Right, right, right. It's basically the key brains. And if for some reason you lose that name node and lose the data, then in the words of one of our customers, then I've got a bunch of spinning rust. So that's a key function. There's a process to federate that, but it's still fundamentally the same process.

Starting point is 00:22:35 And in the MapBar environment, the distributed name node is distributed to a number of select nodes, or is it distributed across all the nodes in the cluster? It basically distributes it the same way data is distributed. So each node has a collection of data. Not all the data, obviously, but a collection that's unique. And the metadata is distributed in the same way. Oh, my God. So you almost shard the metadata.

Starting point is 00:23:02 Yeah. And basically it's a lot faster. It's a lot more reliable. And to do that requires a fundamentally new architecture. So that's, you know, we've talked a lot about these features and capabilities. It's because of the architecture. And I guess if there's one premise that we'd like to talk about, it's that architecture matters. We all know that. Oh, to talk about. It's that architecture matters. We all know that.

Starting point is 00:23:27 Oh, yeah. And then, you know, the proof point is in, so how are customers using it? And, you know, I mentioned some of the customers in the ad media space, but we also have, you know, one of the largest retailers that has over 2,000 nodes of MapBar, and they're using it as part of their retail merchandising operations with the numerous

Starting point is 00:23:53 number of use cases that Hadoop is being deployed for. You also mentioned mirroring and that sort of stuff. So how is that, and mirroring for disaster recovery, I think, specifically, how is that done in this environment? You've got, let's say, a 1,003-node map bar cluster, and I'm assuming that's the right terminology here. Yeah. What do you do to mirror that? I mean, are you going to mirror it to another 1,003 nodes?

Starting point is 00:24:18 You can set it up where you can select, you know, I only want this volume of data to be mirrored. And because we're, you know, we rewrote that underlying layer, what we're mirroring is just the differential changes within the block, and those are automatically compressed and resynchronized. If you've, you know, lost connections, you can also populate, pre-populate the remote site with a large dump to initiate it so you're not trying to do a petabyte over the WAN to do the initial synchronization. Okay. Sometimes just shipping a box of tapes is a better idea.

Starting point is 00:25:04 Oh, God, yes. I was trying to – well, that's a different story. But moving my desktop to Dropbox was a bit of a challenge. But that was only a couple hundred gig. So, I mean, you're getting the flavor that this is an enterprise-grade distribution, and our focus is production success for customers. And we've got many customers that started in another distribution for Hadoop and moved to MapR for some of these features as they moved from test to production deployment. And I think that's one of the beauties of the Hadoop space, is that we're all supporting the same open source distribution. They're the same APIs.

Starting point is 00:25:56 In a sense, they're different reference architectures. So it's quite transparent to move applications across distributions. Okay. Right, because all the visible parts are the same, but you guys are replacing HDFS with what sounds like a much more enterprise class file system. So the other question, so HDFS typically has three copies of every piece of data. Do you guys maintain that sort of environment? Yep, yep. We also have automatic compression that's transparent.

Starting point is 00:26:36 So if you look at the kind of typical compression that's going on with the data sets, it's pretty much a wash. So you are keeping multiple copies, but the fact that it's compressed, it looks out to be about the same size. Oh, okay. And it's all software-based, of course. Yes. Yeah, running on different flavors of Linux. Right. Yeah, I know. I'm getting the, and I'm just a storage guy.

Starting point is 00:27:05 Kind of view, yeah. It's like, huh. I mean, this is why I start off by saying it's a paradigm shift. What we haven't touched on is so what does this mean for the types of applications that

Starting point is 00:27:24 are being done? And I think it was Howard earlier talked about, you know, analytics have always been kind of batch. And a data warehouse environment is typically, you know, taking data from transactional environments through an operational data store and then put that in the data warehouse and process and, you know, help illuminate what happened in the business last quarter, last month, et cetera. Yeah. And that's the model I've always imagined that Hadoop was serving. And part of why you don't back up the data because it's just an export from the OLTP system. And if you lose it, you can export it again. And there are some uses like that, but increasingly we're seeing, you know, new types of applications that are being addressed and data sources that really aren't being leveraged today, especially the data exhaust from certain

Starting point is 00:28:18 processes, sensor data, social media, you know, different insights into customers. And being able to react in, you know, near real time so that you can do a series of small adjustments and the result is, you know, better top line revenue or better risk mitigation or streamlined operations has an impact on the business now. And that's been some of the more exciting uses and applications that we've seen. Yeah, and I can see that as being especially handy for web-based businesses where you can react to what the data is telling you essentially immediately. If we bought 10 million of these things and they're not selling quite as well as we'd like, we'd have to wait until the end of the day or the end of the week to change the price.

Starting point is 00:29:14 Exactly. And that's especially true as you look at some of the long-tail operations. If I can personalize my product or service to better meet your needs, then I'm going to be more effective. I'm going to be a stickier vendor to you, etc. So can you get Google AdWords to stop showing me ads for the things that I just bought because I'm done looking for them? That's a different problem. I would like actually – well, so does this thing run on Macs or PCs? It's all Linux-based or how does this work? We're running in the application space of Linux.

Starting point is 00:30:05 Red Hat, Ubuntu, SUSE, et cetera. How does this work? We're running in the application space of Linux. Okay. Red Hat, Ubuntu, SUSE, et cetera. And all that's there. I'm thinking if I could just cluster my Macs together, I might have a four-node cluster or something like that. I might be able to fire this thing up. Yeah, well, you can also use it in the cloud. Oh, okay, so Amazon stuff. Yeah.

Starting point is 00:30:22 And Amazon actually is an OEM of MapBar. So you can go to the Amazon Elastic MapReduce service, and there's an EMR option, and then there's the MapBar M3, M5, or M7 edition. So we have multiple editions to basically tailor what's available for certain use cases and certain applications. So is this like more highly resilient kinds of environments or faster? So we're faster. We're more dependable in terms of how you do data protection and high availability. No, no, the different editions that you support, what's the differences between them? So the M3 edition is the free edition.

Starting point is 00:31:12 So that's free for unlimited use. The M5 has the high availability and data protection available. So if you want the mirroring and you want snapshots, it's available in the M5. And that also includes 24-bit. So you start with M3 and then you have a data loss and you realize you need to pay. I get it. Well, the M3 is kind of on combines the NoSQL and Hadoop together. So if you have enterprise-grade needs for that integrated NoSQL, we've got the file system that has the tables on the same layer. And that's a really compelling architecture for organizations that are trying to do multiple things with their cluster.

Starting point is 00:32:11 So you mentioned this NoSQL and Hadoop. I always thought Hadoop was a NoSQL environment in and of itself. I mean, obviously Hive and HBase provide SQL access, SQL access to it, but... Well, that NoSQL access is translating those to process MapReduce code against the files. So when we're talking about NoSQL, it's the database capabilities. And in other distributions, you've got HBase running on Java, writing its data to the Hadoop distributed file system, running on another instance of Java, writing its data to the Linux file system. And the problem there is there's a disconnect between the needs of a database and the right one storage layer of HDFS. And there's a lot of manual administration process, and then there's some latency spikes that you get because of not only Java,

Starting point is 00:33:15 but the fact that you have to stage data and create these compactions. So there's quite a lot understood about that. And what we've done with MapR is to collapse that completely. And you've got files and tables on the same layer writing to disk. Very simple, no Java dependencies, zero administration overhead, and consistent 24 by 7 low latency. And hopefully avoiding Java gets you some performance improvements as well? Oh, it's five to seven times faster on the benchmarks that we've run using the Yahoo Cloud Services benchmark. Okay, looks like we're about done. I had one question. I think,

Starting point is 00:33:59 do you use any SSDs in this environment, and where would you use them if you had them? Yeah, great question. So I mentioned volumes earlier, which is a feature specific to MapBar. You can have data and job placement control. So, for instance, if you have some workload that really needs to take advantage of fast I.O. processing, you can have SSDs deployed in a few nodes in the cluster and then direct data that's placed there and specific jobs that run there. Okay. I think I'm about exhausted.

Starting point is 00:34:40 Donald Howard, do you have any other questions you want to ask? Well, I mean, I do, one question I want to ask. It's not actually MapR related. It drags Jack back to his past. F5 is killing off the ARX. F5 is killing off ARX? Yeah. Yeah, which is a file virtualization tool.

Starting point is 00:35:08 And, yeah, and this is, you know, the file virtualization has been one of those areas that's just a graveyard. Yeah, absolutely. As companies buy them, they kill them off and stuff like that. Yeah, I understand. I look back through my database of product categories and it's like, wow, everybody here, 27 companies have tried this. It's never caught on. Nobody's ever been successful at it. And I was wondering if Jack, with his experience at Rainfinity and other places, has any insight into, you know, why has this, what, you know, seems to me as a practitioner, as really interesting, good idea technology been so dramatically unsuccessful in the market?

Starting point is 00:36:07 Well, I think EMC did it right. They took the Rainfinity technology and they embedded it into the NAS storage and integrated that so that you can move things transparently. So I think that was the right move. I think on the broader macro trend, I think one of the issues is that rather than move data, it's process it in place. And that's basically the trend we're seeing with Hadoop. He brings it back home. Good job, Jack. All right. Well, this has been great, Jack. I really appreciate having you on the call. And thank you, Howard, once again for being here. Next month, we're not exactly certain what we're going to be talking about, but I'm sure it will be very interesting.

Starting point is 00:37:00 Any last comments, Jack? Well, thank you so much. I really enjoyed this and appreciate you having me on the program. It's our pleasure. It's been our pleasure and always nice to learn something. All right. Well, thanks, guys, and we'll talk to you next month. Very good.

Starting point is 00:37:20 All right.

Grey Beards on Systems - GreyBeards talk enterprise Hadoop with Jack Norris, CMO MapR Technologies

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.