Drill to Detail - Drill to Detail Ep.20 'MapR Platform Differentation, Scaling Hadoop and Microservices' With Special Guest Tugdall Grall

Starting point is 00:00:00 my guest on this episode of drill to detail is tug door graal a name some of you from the oracle world might recognize and who's now at map r working alongside niraja from map r who came on the show a couple of months ago to talk about Drill and MapR's analytics strategy. So, Tug, nice to speak to you again after so many years, and welcome to the show. So, thank you for the invitation. And yes, nice and quick introduction and remind me old memories from our Oracle time

Starting point is 00:00:44 because I left Oracle 10 years ago to move to an open source market working at Exo Platform, so a startup around open source portal and social network for the enterprise. But after a few years doing the role of CTO in a startup, I choose to move back to my roots, meaning I want to talk to developer. I want my user to be the developer that will use my product to build application. And I was looking for a new opportunity

Starting point is 00:01:17 and I moved to NoSQL. Spent a little bit of time at Codespace, then MongoDB, and for a little less than two years now I am working as a technical evangelist for Europe at MapR. And what brings me to MapR is big data. And you know, it's a very vague and very large scope. But what was very exciting to me, it was when I was working on NoSQL back in 2012 and 2013, most of the jobs that you have to do when you were processing a very large amount of data or when you have to integrate multiple data sources and so on, we have to move the data out of the NoSQL database to something else. It would be any Hadoop distribution or MAPR

Starting point is 00:02:07 to store more data, process more data. This is why I switched to this fantastic world two years ago. Okay, okay. So, Tug, what I wanted to talk to you about really was when we had the podcast episode with Niraja, we talked about drill and analytics and and and that kind of world really but i think what's particularly interesting with mapr is is some of the choices that mapr made over the technology you use and some of the proprietary parts of what you do and i guess

Starting point is 00:02:36 where you've sort of diverged from kind of open source hadoop and so on and given the background you've got and the fact we know each other from the oracle world i thought it'd be interesting i suppose to sort of drill into some of those kind of like products in the kind of stack you've got and try to understand from a developer's perspective, you know, what's different and potentially kind of what's better about the way you do things and so on. So let's start off initially, Tug, by just for anybody that doesn't know the MapR stack um just kind of you know paint a picture really of of what a kind of the platform looks like and and some of the key products in there and we'll drill into some of the detail of those um in a moment yeah so uh the the product that we we build on we sell is called mapper Converge Data Platform. And I think this name, Converge, or this adjective, Converge, it's very important. The convergence of data into a single platform.

Starting point is 00:03:32 This is why we have built on the engineering team, on the vision of the company has been focusing. Provide the best data platform to build any type of application. And for this, you have to make some choice. How do you store the data? How do you organize the data? And also, how do you ingest, consume, process the information? Something that was clear when Mapper was built in 2009, it was really that for big data, the Hadoop ecosystem was the way to go. But looking at the Hadoop ecosystem, something was not good enough, at least based on the vision of the founder of Mapart, is the file system HDFS. It's not a file system. It's a simple storage allowing you to store a very very large file and do put and get. But what if you want to store very very large amount of small file? What if you want to modify your file that is on a

Starting point is 00:04:39 distributed file system? It's not possible with Hadoop. It's not made for that. So the idea was initially to build this very powerful file system. It's not possible with Hadoop. It's not made for that. So the idea was initially to build this very powerful file system, allowing to store any type of files, small or big, with the same characteristic in terms of replications, data localities that you can have in the Hadoop ecosystem, but providing more infrastructure layer, like a different security model, a more efficient replication between nodes, a powerful replication between clusters. So everything that I say on the file system,

Starting point is 00:05:15 it's transparent for any developer, not talking about Hadoop developers, not talking about big data developers. But when you work on your laptop, when you develop on your laptop, when you develop on the cloud, you open a file, you save it. You don't care where it is.

Starting point is 00:05:28 You expect it to be very efficiently stored, replicated if possible. This is exactly what the file system gives to you. And a very important part, and I will come back about proprietary versus open source, a very important part is what the developer needs to know to work on MAPAR.

Starting point is 00:05:49 Nothing special. We leverage open source as an API. What we will say is we contribute to open source. Drill is a good example. We have also contributors in other ecosystem, Hadoop ecosystem component. But we also want to be sure that the APIs

Starting point is 00:06:08 that developers use to work with data run efficiently on the Mapper platform. So if you are a Spark developer, you will run with Spark. If you are a MapReduce developer, you work with MapReduce and you will consume the data from the Mapper platform transparently. So for the developer, it's transparent.

Starting point is 00:06:30 Okay. Okay. So I guess, you know, we're talking about the MapR kind of file system here and how, I guess, the technology you've got and the approach you took allows you to, as you say, have small files as well as large files etc etc you know do uh updates and so on so if we look back to i guess google file system and the and the kind of genesis of a lot of hgfs and so on um you know the choices that they took and they they kind of the the optimizations they went for and and and and so on were because they kind of had to you know so you know you had to kind of have big files and big blocks and so on there to do what they were going to do and so on so how did mapR manage to have the performance of say google file system and hdfs but still have this ability to you know

Starting point is 00:07:14 to update data to have small files and so on how did it manage to do that what's different about the approach you've got that allows so i i think it's really uh i was not at the foundation uh at the foundation of mapar so i will based on this is based on our history on dna that we we get into our body when we join the company so it's it's um the founder of mapar come from different backgrounds. They used to work, Srivast, used to work for example on pure storage. He works for NetApp, then he moved to Google working on MapReduce, on FileSystem, on Bigtable and so on. So he has this very serious enterprise storage background, but also very advanced on new vision of the data processing and storage for large-scale use case at Google. So you really start to build with the different engineers to build a file system from day

Starting point is 00:08:21 one, say, let's take what I know and what works from an enterprise-scale storage where you need to be able to store any type of file, have the replications, and so on and so on, but also keep the way Hadoop distributed processing layer are working at the top level to make it more that efficient. So, for example, back then, everybody knows that the name node of HDFS is a challenge, or at least was a big challenge for many users.

Starting point is 00:08:54 It was more or less a spoof, a single point of failure. Historically, it was also the way the replication is done inside HDFS with the same size of blocks for the replication between nodes where you read the data so where you organize the chunks and so on where for example mapper fs doesn't use a name node we have a context of a container location database that is a transactional distributed systems that have addresses, not of file, but addresses of containers that contains addresses of files.

Starting point is 00:09:32 So we have an indirection allowing us to scale better, allowing us to have more files. So more addresses, more small file and so on. And this database is CLDB, container location database, is by nature, by day one, it has been built to be distributed. So it was unreplicated. So it's not the case, for example,

Starting point is 00:09:57 of the name node initially. Also, you have different size of blocks blocks different block size in the file system to make it very efficient what you want you want to have a very large cldb with multiple gigabyte to address file very very fast you want to have a very efficient block when you read the data, so you have a 260 by default, 256 megabytes, sorry, no, megabytes. Block size, when you read the data, when you want to do, for example, a Spark job that have to read large files, you will use this block size.

Starting point is 00:10:38 But because you want to be able to update, modify a file, create a small file, when you replicate the data between nodes for high availability, so when you redo the replication between nodes inside the cluster with a default replication factor of three, you can choose how many replicates do you want. This site is only 8K, 8 kilobytes, to make it very, very efficient. So all this together, because it has been designed from day one to be an efficient file system, a real file system, allowing any type of operations that you know from a file system, this is why we are successful.

Starting point is 00:11:20 We were able to succeed on this. And also, this is where we see the proprietary side of the software. Everything that is touching the file system is written in C and C++, not in Java. Everything is with a very fast, efficient access to the hardware, to the type of disk you have. So we have a different way of storing data, at least doing the IOs when you use a very

Starting point is 00:11:50 fast SSD drive compared with a classical drive and so on. So all these make it more efficient. And this is our intellectual properties. We have a patent around the way we organize the content on the files and so on. Okay, okay. So would it then be Yarn and MapReduce and Spark and so on running on top of that, the normal kind of open source one, or is it a particular kind of variant of that from you guys that's proprietary?

Starting point is 00:12:22 No, so only the way we organize and store the data is proprietary the way you access the data is based on the open source way of doing stuff for example i have talked a lot about the file system and we will talk about the other component but you will see that it's coming naturally to that how do you access the files in Hadoop? You access the file using HDFS command. HDFS put, HDFS get, HDFS on, I don't, LS on, FS, LS on, so on.

Starting point is 00:12:54 All these command, HDFS command, are compliant and works on MAPA. And so everything that has been built, a MAPA use job where Yarn will look for the location of the data to do dynamic allocations, for example, will work exactly the same way.

Starting point is 00:13:13 Because for us, you have the file system and you have multiple protocol or multiple API to access the file system. One of them is HDFS. And this is made to migrate or use the same code that you have into a Hadoop standard project and run it on MapR. But most of the time, what people will do, they will directly access MapR using an NFS endpoint

Starting point is 00:13:43 or what we call our POSIX client, or the FUSE client, that has allow you on edge nodes that need to process or read the data to have a very, very efficient access to the information using standard IOs. So you can connect to the cluster, do LS, put VI, modify your file, and immediately refresh it. So because we use open source as an API,

Starting point is 00:14:12 when we expose the data out of our system, all the open source API or open source frameworks that are running at the top will run the same way, if not more efficient, in a more efficient manner, depending on the type of job you do okay so so i mean in a tangible i mean we'll get on to kind of map our you know db and streams in a second but in a tangible way what what does this mean as a benefit for a developer and a customer so i get i get that it's probably more efficient and it's more scalable and so on what

Starting point is 00:14:41 does it mean for the customers that are actually using this? Because you've got quite a few, you know, what do they get out of it? Yeah, so you have different, so what we have to keep in mind before answering the developer questions is MAPR from a pure ops point of view, from an infrastructure point of view, is usually a lot more efficient. For the same use case, let's talk about a Hadoop use case when you do Hive, MapReduce and all this, for the same use case on let's talk about a hadoop use case when you do hive map reduce and all this for the same use case you can you may need 30 percent less physical servers because we are more efficient to manipulate the data also it's easier to make it very highly available and so on so this is upside for the developer first of all you will not change anything so this is one thing however it's changing in the way you want to build application For the developer, first of all, it will not change anything.

Starting point is 00:15:26 So this is one thing. However, it's changing in the way you want to build application. Suppose you want to ingest log file into your file system. So you know you have many web servers and you want to take the log file and push them into the file system because you will use that to do some job with Spark and sort of job with Spark or job with analytics with this SQL or MapReduce and so on. Usually in the Hadoop way of doing it, you will have Flume on this kind of job that will take the file partially and they will aggregate the file to create very large file that you push into HDFS. So you can take the same data flow into MapR

Starting point is 00:16:06 and it will work. So is it totally transparent for developers? But in the same way, it's a lot more easier with MapR because usually what people will be doing, they will simply directly use NFS, a mount point on your web server that generates a log, saves a log directly into the cluster.

Starting point is 00:16:27 So you will simplify the ingestion process on the data flow to ingest the information into the cluster with Mapa. So this is a simplification for the developer. Besides that, all the Hadoop APIs, the Spark APIs, the SQL with drill, or Hive will work the same way. The way security works will be also based on Kerberos, so the configurations on the way you authenticate to the cluster will be similar to what they know already.

Starting point is 00:16:57 Another little part that is interesting for developers, if you want to manipulate the files, you don't need to use HDFS API. Just use, as a Java developer, Java IO API and save the file into the cluster. It will be automatically saved into a distributed file system. Okay, okay. So let's talk about MapRDB

Starting point is 00:17:21 because I think Niraja mentioned it at the time when we did the call before. So I take it MapRDB is a similar thing to say kind of, you know, HBase and it's a NoSQL database. So tell us a bit about MapRDB. Again, you know, what's the history of it and what problems it's solving really? So one of the key elements is the more we do application today, month after month, the more you have to deal with real-time data, real-time applications, interactive applications.

Starting point is 00:17:53 So this is one of the reasons inside the Hadoop, you have, for example, you have a NoSQL database called HBase, a very successful database, but it's based on HDFS. So we had some flow in terms of compaction of the data, the way you scale out and so on. So Mapper chose to implement its own NoSQL database. So first step was in Mapper 4 was to use MapperDB binary, what we call MapperDB binary. This is an OCQL database, columnar format, column-oriented OCQL database

Starting point is 00:18:30 that use the EdgeBase API. So same for the developer, it's transparent. So you will be able to use an OCQL database on tables that are oriented by column that is based directly on the file system. So everything we said about scalability efficiency

Starting point is 00:18:50 will be exactly the same with mapperDB Binary. In addition to that, last year, end of 2015, we added mapperDB JSON. So using the same engine, using the same file system, everything is part as soon as you install Mapper, you have the file system, you have the database available. You don't have product to install.

Starting point is 00:19:14 It's running on the same engine. It's running on the same binaries. You have MapperDB JSON. That is a document-oriented database that will use the same, that is using the same scalability scheme that you have with the file system, but using a document-oriented database, allowing you to store JSON and manipulate JSON document in an efficient way. And what you have seen with Neeraja last time is you can query files, you can query mapper DB binaries or edge base,

Starting point is 00:19:46 and you can query mapper DBGs on using Drill on doing SQL analytics. And once again, so this is a very important part for developers because developers will need to manipulate data and do some updates, do some manipulations, aggregations, increment, decre some manipulations, aggregations, increment, decrement some value and so on,

Starting point is 00:20:12 modify on the fly the structure of the data because the way, let's take, for example, in an insurance company, a policy for a car, a policy for a home, a policy for healthcare may have characteristics that are equivalent, like the policy ID, the name, address of the customer, and so on. But many, many things are totally different from one contract to another because you don't represent the same data. So NoSQL Engine is very useful for that. This is a flexible schema.

Starting point is 00:20:40 It's something that many applications need. Then you need, in the context of large data a large project with lots of data you need the scalability and the reliability this is where mapper db binary and mapper dbg is and will help the developers and will help the ops guy okay okay so so and there's also i see there's a product called MapR Streams and streaming, you know, streaming ingestion, streaming processing. That's a massively kind of hot area at the moment and so on. So again, you know, why did MapR create its own kind of streams product? Again, what problem is it solving and what's the kind of story behind it really?

Starting point is 00:21:22 So, it's the same story as a file system on the database and I will say for example and I will answer this question about streaming two-step. Some things that I didn't mention about MapperDB if you take a traditional Hadoop environment if you want to have HBase running a very intensive workload with very lot of queries and response and real time and modification of the data, and in the same time you want to use your Hadoop cluster to do some large analytics with MapReduce, most of the time what you have to do, what you must do, you need to create two clusters. One cluster to run HBase, one cluster to run your MapReduce jobs

Starting point is 00:22:05 or your analytics on the file system, for example. With MapRDB, you can run that on the same cluster. You have many configurations you can do in the way you will do a multi-tenancy of the data, tagging of the nodes and so on. So you have in a single cluster that is easy to administer and easy to secure, running operational analytics job with NoSQL database and file system. For the same reason, if you look at the streaming part, what you need from an application, you want to simply stream messages into the platform and be able to not only move the data from one place to another, but also do some processing, for example, with Spark or with Fling. You want to be able to process these messages,

Starting point is 00:22:51 but also emit, so publish new messages as a result. For the same reason, we said we don't want people to have to install another cluster on the site to be able to stream data in and out, because a common practice is to use Kafka to have to install another cluster on the site to be able to stream data in and out. Because a common practice is to use Kafka and create a Kafka cluster, meaning connect the Kafka cluster to ZooKeeper, have multiple brokers, configure the replications, and so on.

Starting point is 00:23:16 So the idea was, for the same reason we create MapperDB binary with the Edge-based API, let's leverage the MapR capabilities of having an easy installation, an efficient storage, replications between nodes in an efficient way in real time, but also multi-master

Starting point is 00:23:36 replication between multiple clusters and multiple geos. But do that for your streamings. Do that for your streams, for your messages. So, but using the Kafka API, because what we want with MapR, we don't want, when it's possible,

Starting point is 00:23:54 we don't want to invent a new API. It doesn't make sense. You have so many good API and good programming models that has been built by the open source community. So what we do, we leverage this API, and we simply change the way you discuss with a broker. We don't have a broker in the sense you are saving data, sending data to the cluster. Same API, same concept in the way you build your

Starting point is 00:24:19 applications, but at runtime, the way it's executed, so where the data are saved, replicated, are different. So in the case of Stream, you have a few interests, at least we can directly explain in very short sentences. One of them will be the speed, the scalability, the latency and so on.

Starting point is 00:24:47 So if you have multiple million of message per second, MapperStream will be a lot more efficient and more easier to manage than Kafka itself. But like I like to say quite often with a smile on my face, it's if every project has to send one or 10 or 20 million message per second, mapper will be everywhere. It's not everybody needs this scale

Starting point is 00:25:13 in terms of messaging, but everybody needs a better security model that is shared between the database, the file system, and the topics you have into your strings or your messaging layer because you don't want to have, in one case, use an SSH key, an SSL key,

Starting point is 00:25:34 and the other, you want to use a Kerberos ticket to authenticate and to say, I can access all of this part of the data. So a common security model is one of the benefits. Another benefit is the fact that you don't have multiple clusters, but also you want to be able, and we see that in a more and more common way in the IoT, we have customers in oil and gas industry, for example, and you have drills all over multiple plants,

Starting point is 00:26:02 and you want to be able to capture messages, send the messages on the local cluster, for example, in a region that we need to replicate the same message on a national cluster than on a worldwide cluster. Doing that with Kafka, it's not that easy because you need to install MirrorMaker, configure this, monitor this, and plus also you are losing the offset of the various consumer over the different replication where it's totally built in into a mapR. So you can publish and subscribe on any of the cluster

Starting point is 00:26:37 and it will replicate in the both directions between the different clusters, the same way the database could be used as a multi-master between different clusters, the same way the database could be used as a multi-master between different clusters. So you see, it's the same API, same developer experience, but usually a lot easier to put in production and do the configuration between the different components you want to work with.

Starting point is 00:27:01 Okay, okay. So what we've been talking about is effectively MapR doing things kind of the same but better than things already in the platform but I noticed that looking at your website and looking at some of the the white papers from MapR, you talk a lot about microservices and microservices seems to be a kind of increasingly kind of you know topical kind of interesting area to do with kind of Hadoop and big data and so on so so tell us a little bit about what microservices are and why they're important in this context and why uh map are kind of putting a lot of investment and time into this yes so you see this um you probably have seen many uh about that already, but on microservices,

Starting point is 00:27:46 it's a different way of building applications. We were talking about, in the past, a big monolithic application, and you and me, we work at a good time of the big Java EE development, so where you build the big IR files that contain multiple war files and so on and so on. This was a big IR file that contains multiple work files, and so on and so on. This was a big monolithic application

Starting point is 00:28:06 that was very, very hard to make new updates and new features or remove features or change technology. Suppose you want to be able to, let's say, the user profile in my database, my application user profile, I want to switch from a relational database to a document database because my application user profile. I want to switch from a relational database to a document database because I need schema flexibility.

Starting point is 00:28:27 Doing that in a monolith application, it's a nightmare. And startups, big startups like Netflix, I started to build application in a new way by building very small services that are very dedicated to one single thing that they will do from end to end, creation of your user profile. The creation of user profile will be not only the UI, the REST API, but also the storage

Starting point is 00:28:55 itself of the profile. And it will communicate with other services using messages. By having all these small services communicating together, so this is what we call micro services, communicating together will allow you to build a very large application on a large set of services. But if you want to upgrade a specific version of one of the services, if you want to change the technology of one of the services, if you want to be able to test a new service in parallel, suppose you are in the e-commerce platform and you want to change the technology of one of the services, if you want to be able to test a new service in parallel, suppose you are in the e-commerce platform

Starting point is 00:29:28 and you want to test a new payment page. So you know you have your V1 with a very nice UI with a credit card on PayPal integrations and you want to test something else. You just create a new service that is plugged using the same messages and you can do some A-B testing between the two versions of the service. And then you can remove one, decommission one of them in a very easy way. So what we see at MAPA is people need a platform where it's easy to deploy. And because we can run multiple

Starting point is 00:30:07 services on multiple types of applications, you can run at the top of Mapper most of your microservices because you will be able to use, you can even run MySQL on the Mapper file system. You can store data on files on the file system. You can use a NoSQL database and so on. So each services could store the data in a single platform. When I say single platform, it's not necessarily on the same location physically, but it's to be able to leverage all the securities or applications that you have

Starting point is 00:30:40 because you want to be sure you have the same quality of services for the whole data data stores and also what we see on most one of the way of deploying microservices it's using docker container you will create a new service that is a user profile management that will save data for example in the no sql. So in this case, you deploy the container containing a very small Java application, so either an embedded JT or Vertex, whatever you want to use that communicate with Mapper DBG, for example. And this is your microservices running at the top of Mapper.

Starting point is 00:31:20 The container itself is running, could be deployed, redeployed and so on. And every time you need some different services needs to communicate between them. When you have many hundreds of microservices, you need a very efficient messaging technology. This is where Mapper Streams with using the Kafka API could be used to exchange messages between the different services. And this is why we see microservices being very important for us, for our customer, and why the Converged Data Platform can help. And we have a few customers doing that. We have, for example, a customer in healthcare providing a software as a service on the cloud to allow the doctors and patients to get some information about the different steps

Starting point is 00:32:10 when you have some health process to do. And they use only microservices. Everything running on Mapper Converged Data Platform, using container to deploy new services. So this is a very easy way of developing application. It comes with some new challenges, you know, in terms of how do you manage errors, how do you, because you have to do kind of compensation, so redoing some business transaction, not talking about the database transaction, but really the business side of it, so you have to capture events.

Starting point is 00:32:47 So all this, usually you will see in the microservices application, multiple topics inside a mapper stream or Kafka to not only exchange business messages, but also emit or publish many technical information about SLAs, quality of service exceptions like that you can monitor that in real time and for this you need something very efficient in term of processing and storage. I guess that's the reason why you guys can implement this because the obvious question is well what is it that's special about the MapR platform and the way you do things that

Starting point is 00:33:23 meant that you could introduce this because I think Map is the first of the hadoop in quotes vendors to focus on this is there something particular about the way you do things or the end-to-end control you got over the platform that meant you could do this earlier than others really yes and i will say the main to make it very short as an answer for people, it's usually we are the first platform that has been built on Hadoop initially, but that was built from first day to deal with real-time data stores and applications. If you look at the other solution, you need another cluster or you need to bring another

Starting point is 00:34:07 tool. So what we try to do is provide that in a kind of all-in-one solution. So this is why we call it the convergence. But we don't want to fall on this is an important part. You can use other tools with

Starting point is 00:34:23 MAPAR. you can use everything on mapper you can use tools that have nothing to do with Kafka nothing to do with spark nothing to do with no sequel it's okay we want just to make sure that if you are running on mapper it will be easier for the developer on the system administrator it will be also faster if you use a feature that we provide inside as a platform so faster more reliable more secure okay okay so so I mean just in terms of sort of final thing to talk about with you I mean a topic that's been fairly consistent a lot of the podcast I've been doing

Starting point is 00:34:58 recently and things I've been talking about is I suppose as as hadoop and cloud and data warehousing kind of converge uh i i wonder um to what extent we'll be thinking about kind of that i suppose um things like mapr and hadoop and on premise and so on uh when actually customers are now buying things like kind of you know data warehouses a service one area that i've been working with a lot recently is google bigquery and and you know that's been quite a revelation in the fact that I've been working with a lot recently is Google BigQuery. And that's been quite a revelation in the fact that it abstracts away all the complexity around things like how the data is stored and so on and so forth.

Starting point is 00:35:33 I kind of wonder, looking forward to as things move into the cloud, what's kind of MapR's position on this? And how will MapR, I guess, still differentiate and be relevant and so on as we move into the cloud, people start to converge on vendors like Google and Amazon and so on. What's the kind of story there around MapR going forward, really? So you have two different topics in the question.

Starting point is 00:35:58 One of them, when you talk about BigQuery and data warehouse, all big data vendors are providing a SQL on Hadoop or SQL on everything to be able to either do a data warehouse offload to try to reduce the usage of very expensive data warehouse, not flexible data warehouse in a more flexible way. So this is exactly why you had the discussion with Neeraja around drill, allowing you to query almost everything. And if you look at it from a drill point of view, it's based on Dremel, Dremel being the paper on the architecture

Starting point is 00:36:36 behind big queries. So this is for the data warehouse. So yes, we are part of this. This is one of the many use cases you can run on MapR. Then you have we are part of this. This is one of the many use cases you can run on Mapper. Then you have the discussion around the cloud. And this one is a very interesting topic. And it depends how we want to see it. Sometimes what I say is using BigQuery or using any other services as software, big data software as a service

Starting point is 00:37:08 kind of stuff from cloud vendors, you go into vendor locking. When you start to use BigQuery and all the features of BigQuery, it's sometimes hard to move back. Just get back the data will be expensive. Some of the features will be Google-centric. What we try to do from an API point of view,

Starting point is 00:37:29 we try to make it open based on open source project. In the same time, we are not cloud vendor. We don't have MapR as a service on the cloud, but you can deploy and we have images for all the cloud. So you can build your MapR cluster on Amazon, on Google Compute Engine, on Azure, and leverage everything you want in terms of feature available with MapR.

Starting point is 00:37:58 So if you want to do a data warehouse offload, you will be able to use a file system or the database and do some queries using drill. But the big benefit in this case is if your enterprise is ready to put everything on the cloud, okay, and the enterprise doesn't really care. And I say care, I don't want to tell it's not safe or it's open. I think the cloud are very, very safe. But for example, you say, I don't want to tell it's not safe or it's open. I think the cloud are very, very safe. But for example, you say, I don't want to put the data on Google and I want to be sure that my data must stay in a specific country. When you talk about healthcare, when you talk about private data, you know, user data, for example, in Europe, you have some G and I don't remember the new regulation around the user data you need to be sure the data

Starting point is 00:38:48 are in a specific data center in a specific country so one of the benefits of running mapper on the cloud is you can still have a replication with a mapper cluster running on premise in your own infrastructure and you will be able to replicate the data from one to another, applying some rules. You will say, I want to replicate only these tables or these streams or this column family into the table. So, for example, all the public information that has been anonymized from the on-premise cluster could be replicated automatically to the cloud. On here, you will have more nodes, more elasticity on Google, map on Google or map on Amazon, depending on what you want to do.

Starting point is 00:39:33 So cloud is definitely a big part of what we see today. We have a balance between people say, we want to use really the cloud as a data as a service kind of layer or big data as a service using some services from Azure or from Google directly, or they want to use Mapa or other distribution installed into the cloud to have the liberty and the flexibility to move out of the cloud in a very efficient way.

Starting point is 00:40:04 Good, excellent. Excellent. Well, Tug, I mean, so just to kind of wrap things up, really. So where would people get hold of, where would people download the software? How would they find out as a developer how to learn this technology? What is the kind of equivalent in my old world of kind of OTN and technology networks and so on for MapR? So I would say you have three or four links.

Starting point is 00:40:25 I don't remember how many links I will give you. This is why I said three or four. So one of them that is, I think it's interesting for everybody, people that wants to learn MapR, but also people that want to learn Drill, Spark, Hadoop in general, you go to learn.mapr.com.

Starting point is 00:40:44 It's a free online training. Some trainings are specific to Mapar, but most of the trainings that are related to Apache Project have nothing to do, have nothing specific except sometimes one message to say, this is how you will run that on Mapar. So learn.mapar.com will be for learning big data technology and learning Mapar. Everything free,

Starting point is 00:41:05 only the certification is for a few bucks. And then you have obviously mapper.com, where you have some information, but what I really like, it's to mapper.com slash blog, where we push many articles on industries,

Starting point is 00:41:22 use case, technology, and community.mapr.com that will be similar so the Mapr blog plus the community will be similar with our old OTM website. Yes, yes. Where you have discussion forum, technical articles, interaction with the community. But also this is part of being in a big open source family. Most of the Drill, Spark, Hive, even Kafka, when we talk about architecture or design, most of the people will use Apache mailing list.

Starting point is 00:41:58 Yeah, excellent, excellent. So, well, Tug, thanks very much for this. It's been great to speak to you again. It's been quite a few years, I think since we you worked as the oc4jpm and i was trying to struggle to get oracle 9is running and and so on there so i think we've probably both we've probably both kind of done well moving on from there over time um it's been great to speak to you and um yeah appreciate that and have a good weekend and and take care same thank you

Drill to Detail - Drill to Detail Ep.20 'MapR Platform Differentation, Scaling Hadoop and Microservices' With Special Guest Tugdall Grall

Mark Rittman is joined in this episode by MapR's Tugdall Grall to talk about MapR's platform differentation and relationship with open-source Hadoop, scaling and streaming, microservices, and MapR's p...latform strategy around big data workloads in the cloud.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.