Drill to Detail - Drill to Detail Ep.20 'MapR Platform Differentation, Scaling Hadoop and Microservices' With Special Guest Tugdall Grall
Episode Date: March 6, 2017Mark Rittman is joined in this episode by MapR's Tugdall Grall to talk about MapR's platform differentation and relationship with open-source Hadoop, scaling and streaming, microservices, and MapR's p...latform strategy around big data workloads in the cloud.
Transcript
Discussion (0)
my guest on this episode of drill to detail is tug door graal a name some of you from the oracle
world might recognize and who's now at map r working alongside niraja from map r who came
on the show a couple of months ago to talk about Drill and MapR's analytics strategy.
So, Tug, nice to speak to you again after so many years,
and welcome to the show.
So, thank you for the invitation.
And yes, nice and quick introduction
and remind me old memories from our Oracle time
because I left Oracle 10 years ago
to move to an open source market working at Exo Platform, so a startup around open source portal
and social network for the enterprise. But after a few years doing the role of CTO in a startup,
I choose to move back to my roots,
meaning I want to talk to developer.
I want my user to be the developer
that will use my product to build application.
And I was looking for a new opportunity
and I moved to NoSQL.
Spent a little bit of time at Codespace,
then MongoDB, and for a little less than two years now I am working as a
technical evangelist for Europe at MapR. And what brings me to MapR is big data. And you know,
it's a very vague and very large scope. But what was very exciting to me, it was when I was working on NoSQL back in 2012 and 2013,
most of the jobs that you have to do when you were processing a very large amount of data
or when you have to integrate multiple data sources and so on,
we have to move the data out of the NoSQL database to something else. It would be any Hadoop distribution or MAPR
to store more data, process more data.
This is why I switched to this fantastic world two years ago.
Okay, okay.
So, Tug, what I wanted to talk to you about really was
when we had the podcast episode with Niraja,
we talked about drill and analytics and and and that kind of world really
but i think what's particularly interesting with mapr is is some of the choices that mapr made
over the technology you use and some of the proprietary parts of what you do and i guess
where you've sort of diverged from kind of open source hadoop and so on and given the background
you've got and the fact we know each other from the oracle world i thought it'd be interesting
i suppose to sort of drill into some of those kind of like products in the kind of stack you've got and try to understand from a developer's perspective, you know, what's different and potentially kind of what's better about the way you do things and so on.
So let's start off initially, Tug, by just for anybody that doesn't know the MapR stack um just kind of you know paint a picture
really of of what a kind of the platform looks like and and some of the key products in there
and we'll drill into some of the detail of those um in a moment yeah so uh the the product that we
we build on we sell is called mapper Converge Data Platform. And I think this name, Converge,
or this adjective, Converge, it's very important. The convergence of data into a single platform.
This is why we have built on the engineering team, on the vision of the company has been
focusing. Provide the best data platform to build any type of application. And for this, you have to make
some choice. How do you store the data? How do you organize the data? And also, how do you ingest,
consume, process the information? Something that was clear when Mapper was built in 2009, it was really that for big data, the Hadoop ecosystem was the way to go.
But looking at the Hadoop ecosystem, something was not good enough, at least based on the vision of
the founder of Mapart, is the file system HDFS. It's not a file system. It's a simple storage allowing you to store a
very very large file and do put and get. But what if you want to store very very
large amount of small file? What if you want to modify your file that is on a
distributed file system? It's not possible with Hadoop. It's not made for
that. So the idea was initially to build this very powerful file system. It's not possible with Hadoop. It's not made for that. So the idea was
initially to build this very powerful file system, allowing to store any type of files, small or big,
with the same characteristic in terms of replications, data localities that you can
have in the Hadoop ecosystem, but providing more infrastructure layer, like a different security model,
a more efficient replication between nodes,
a powerful replication between clusters.
So everything that I say on the file system,
it's transparent for any developer,
not talking about Hadoop developers,
not talking about big data developers.
But when you work on your laptop,
when you develop on your laptop,
when you develop on the cloud,
you open a file, you save it.
You don't care where it is.
You expect it to be very efficiently stored,
replicated if possible.
This is exactly what the file system gives to you.
And a very important part,
and I will come back about proprietary versus open source,
a very important part is
what the developer needs to know
to work on MAPAR.
Nothing special.
We leverage open source as an API.
What we will say is
we contribute to open source.
Drill is a good example.
We have also contributors in other ecosystem,
Hadoop ecosystem component.
But we also want to be sure that the APIs
that developers use to work with
data run
efficiently on the Mapper platform.
So if you are a Spark developer, you
will run with Spark. If you are a
MapReduce developer, you work with MapReduce
and you will consume the data
from the Mapper platform transparently. So for the developer, it's transparent.
Okay. Okay. So I guess, you know, we're talking about the MapR kind of file system here and
how, I guess, the technology you've got and the approach you took allows you to, as you
say, have small files as well as large files etc etc you know do uh
updates and so on so if we look back to i guess google file system and the and the kind of genesis
of a lot of hgfs and so on um you know the choices that they took and they they kind of the the
optimizations they went for and and and and so on were because they kind of had to you know so
you know you had to kind of have big files and big blocks and so on there to do what they were going to do and so on so how did mapR manage to
have the performance of say google file system and hdfs but still have this ability to you know
to update data to have small files and so on how did it manage to do that what's different about
the approach you've got that allows so i i think it's really uh i was not at the foundation uh at the foundation of
mapar so i will based on this is based on our history on dna that we we get into our body when
we join the company so it's it's um the founder of mapar come from different backgrounds. They used to work, Srivast, used to work for example on pure storage.
He works for NetApp, then he moved to Google working on MapReduce, on FileSystem, on Bigtable
and so on. So he has this very serious enterprise storage background, but also very advanced on new vision of the
data processing and storage for large-scale use case at Google.
So you really start to build with the different engineers to build a file system from day
one, say, let's take what I know and what works from an enterprise-scale storage
where you need to be able to store any type of file,
have the replications, and so on and so on,
but also keep the way Hadoop distributed processing layer are working
at the top level to make it more that efficient.
So, for example, back then, everybody knows
that the name node of HDFS is a challenge,
or at least was a big challenge for many users.
It was more or less a spoof, a single point of failure.
Historically, it was also the way the replication is done
inside HDFS with the same size of blocks for
the replication between nodes where you read the data so where you organize the chunks and so on
where for example mapper fs doesn't use a name node we have a context of a container location
database that is a transactional distributed systems
that have addresses, not of file,
but addresses of containers that contains addresses of files.
So we have an indirection allowing us to scale better,
allowing us to have more files.
So more addresses, more small file and so on.
And this database is CLDB, container location database,
is by nature, by day one,
it has been built to be distributed.
So it was unreplicated.
So it's not the case, for example,
of the name node initially.
Also, you have different size of blocks blocks different block size in the file system
to make it very efficient what you want you want to have a very large cldb with multiple gigabyte
to address file very very fast you want to have a very efficient block when you read the data, so you have a 260 by default, 256 megabytes, sorry, no, megabytes.
Block size, when you read the data,
when you want to do, for example,
a Spark job that have to read large files,
you will use this block size.
But because you want to be able to update, modify a file,
create a small file, when you replicate the data between nodes
for high availability, so when you redo the replication between nodes inside the cluster
with a default replication factor of three, you can choose how many replicates do you want.
This site is only 8K, 8 kilobytes, to make it very, very efficient.
So all this together, because it has been designed from day one to be an efficient file system, a real file system,
allowing any type of operations that you know from a file system,
this is why we are successful.
We were able to succeed on this.
And also, this is where we see the proprietary side
of the software.
Everything that is touching the file system
is written in C and C++, not in Java.
Everything is with a very fast, efficient access
to the hardware, to the type of disk you have.
So we have a different way of storing data, at least doing the IOs when you use a very
fast SSD drive compared with a classical drive and so on.
So all these make it more efficient.
And this is our intellectual properties.
We have a patent around the way we organize the content on the files and so on.
Okay, okay.
So would it then be Yarn and MapReduce and Spark and so on running on top of that,
the normal kind of open source one,
or is it a particular kind of variant of that from you guys that's proprietary?
No, so only the way we organize and store the data is proprietary
the way you access the data is based on the open source way of doing stuff for example
i have talked a lot about the file system and we will talk about the other component but you will
see that it's coming naturally to that how do you access the files in Hadoop? You access the file using HDFS command.
HDFS put,
HDFS get, HDFS
on, I don't,
LS on, FS, LS on, so on.
All these command,
HDFS command, are
compliant and works
on MAPA.
And so everything that has
been built, a MAPA use job where Yarn will look for the location
of the data to do dynamic allocations, for example,
will work exactly the same way.
Because for us, you have the file system
and you have multiple protocol or multiple API
to access the file system.
One of them is HDFS.
And this is made to migrate or use the same code that you have into a Hadoop standard project
and run it on MapR.
But most of the time, what people will do,
they will directly access MapR using an NFS endpoint
or what we call our POSIX client, or the FUSE client,
that has allow you on edge nodes
that need to process or read the data
to have a very, very efficient access to the information
using standard IOs.
So you can connect to the cluster, do LS,
put VI, modify your file, and immediately refresh it.
So because we use open source as an API,
when we expose the data out of our system,
all the open source API or open source frameworks
that are running at the top will run the same way,
if not more efficient, in a more efficient manner,
depending on the type of job you
do okay so so i mean in a tangible i mean we'll get on to kind of map our you know db and streams
in a second but in a tangible way what what does this mean as a benefit for a developer and a
customer so i get i get that it's probably more efficient and it's more scalable and so on what
does it mean for the customers that are actually using this? Because you've got quite a few, you know, what do they get out of it? Yeah, so you have different, so what we have to
keep in mind before answering the developer questions is MAPR from a pure ops point of view,
from an infrastructure point of view, is usually a lot more efficient. For the same use case,
let's talk about a Hadoop use case when you do Hive, MapReduce and all this, for the same use case on let's talk about a hadoop use case when you do hive map
reduce and all this for the same use case you can you may need 30 percent less physical servers
because we are more efficient to manipulate the data also it's easier to make it very highly
available and so on so this is upside for the developer first of all you will not change
anything so this is one thing however it's changing in the way you want to build application For the developer, first of all, it will not change anything.
So this is one thing.
However, it's changing in the way you want to build application.
Suppose you want to ingest log file into your file system.
So you know you have many web servers and you want to take the log file and push them into the file system because you will use that to do some job
with Spark and sort of job with Spark or job with
analytics with this SQL or MapReduce and so on. Usually in the Hadoop way of doing it,
you will have Flume on this kind of job that will take the file partially and they will
aggregate the file to create very large file that you push into HDFS. So you can take the same data flow into MapR
and it will work.
So is it totally transparent for developers?
But in the same way, it's a lot more easier with MapR
because usually what people will be doing,
they will simply directly use NFS,
a mount point on your web server
that generates a log,
saves a log directly into the cluster.
So you will simplify the ingestion process on
the data flow to ingest the information into the cluster with Mapa.
So this is a simplification for the developer.
Besides that, all the Hadoop APIs,
the Spark APIs, the SQL with drill,
or Hive will work the same way.
The way security works will be also based on Kerberos,
so the configurations on the way you authenticate to the cluster will be similar to what they know already.
Another little part that is interesting for developers,
if you want to manipulate the files,
you don't need to use HDFS API.
Just use, as a Java developer, Java IO API
and save the file into the cluster.
It will be automatically saved into a distributed file system.
Okay, okay.
So let's talk about MapRDB
because I think Niraja mentioned it at the time
when we did the call before.
So I take it MapRDB is a similar thing to say kind of, you know, HBase and it's a NoSQL database.
So tell us a bit about MapRDB.
Again, you know, what's the history of it and what problems it's solving really?
So one of the key elements is the more we do application today,
month after month, the more you have to deal with real-time data,
real-time applications, interactive applications.
So this is one of the reasons inside the Hadoop,
you have, for example, you have a NoSQL database called HBase,
a very successful database, but it's based on HDFS. So we had some
flow in terms of compaction of the data, the way you scale out and so on. So Mapper chose to
implement its own NoSQL database. So first step was in Mapper 4 was to use MapperDB binary,
what we call MapperDB binary. This is an OCQL database,
columnar format,
column-oriented OCQL database
that use the EdgeBase API.
So same for the developer,
it's transparent.
So you will be able to use
an OCQL database
on tables that are oriented by column
that is based directly on the file system.
So everything we said about scalability efficiency
will be exactly the same with mapperDB Binary.
In addition to that,
last year, end of 2015,
we added mapperDB JSON.
So using the same engine, using the same file system,
everything is part as soon as you install Mapper,
you have the file system, you have the database available.
You don't have product to install.
It's running on the same engine.
It's running on the same binaries.
You have MapperDB JSON.
That is a document-oriented database
that will use the same,
that is using the same scalability scheme that you have with the file system, but using a document-oriented database, allowing you to store JSON and
manipulate JSON document in an efficient way. And what you have seen with Neeraja last time
is you can query files, you can query mapper DB binaries or edge base,
and you can query mapper DBGs
on using Drill on doing SQL analytics.
And once again,
so this is a very important part for developers
because developers will need to manipulate data
and do some updates, do some manipulations,
aggregations, increment, decre some manipulations, aggregations,
increment, decrement some value and so on,
modify on the fly the structure of the data because the way, let's take, for example, in an insurance company,
a policy for a car, a policy for a home, a policy for healthcare
may have characteristics that are equivalent,
like the policy ID, the name, address of the customer, and so on.
But many, many things are totally different from one contract to another
because you don't represent the same data.
So NoSQL Engine is very useful for that.
This is a flexible schema.
It's something that many applications need.
Then you need, in the context of large data a large
project with lots of data you need the scalability and the reliability this is where mapper db
binary and mapper dbg is and will help the developers and will help the ops guy okay okay
so so and there's also i see there's a product called MapR Streams and streaming, you know, streaming ingestion, streaming processing.
That's a massively kind of hot area at the moment and so on.
So again, you know, why did MapR create its own kind of streams product?
Again, what problem is it solving and what's the kind of story behind it really?
So, it's the same story as a file system on the database and I will say
for example and I will answer this question about streaming two-step. Some things that I
didn't mention about MapperDB if you take a traditional Hadoop environment if you want to
have HBase running a very intensive workload with very
lot of queries and response and real time and modification of the data, and in the same
time you want to use your Hadoop cluster to do some large analytics with MapReduce, most
of the time what you have to do, what you must do, you need to create two clusters.
One cluster to run HBase, one cluster to run your MapReduce jobs
or your analytics on the file system, for example. With MapRDB, you can run that on the same cluster.
You have many configurations you can do in the way you will do a multi-tenancy of the data,
tagging of the nodes and so on. So you have in a single cluster that is easy to administer and easy to secure, running operational analytics
job with NoSQL database and file system.
For the same reason, if you look at the streaming part, what you need from an application, you
want to simply stream messages into the platform and be able to not only move the data from
one place to another, but also do some processing, for example, with Spark or with Fling.
You want to be able to process these messages,
but also emit, so publish new messages as a result.
For the same reason, we said we don't want people
to have to install another cluster on the site
to be able to stream data in and out,
because a common practice is to use Kafka to have to install another cluster on the site to be able to stream data in and out.
Because a common practice is to use Kafka and create a Kafka cluster,
meaning connect the Kafka cluster to ZooKeeper,
have multiple brokers, configure the replications, and so on.
So the idea was, for the same reason we create MapperDB binary
with the Edge-based API,
let's leverage the MapR capabilities
of having an easy installation,
an efficient storage,
replications between nodes
in an efficient way in real time,
but also multi-master
replication between
multiple clusters and multiple geos.
But do that
for your streamings. Do that
for your streams, for your messages.
So, but using the Kafka API,
because what we want with MapR,
we don't want, when it's possible,
we don't want to invent a new API.
It doesn't make sense.
You have so many good API and good programming models
that has been built by the open source community.
So what we do, we leverage this API, and we
simply change the way you discuss with a broker. We don't have
a broker in the sense you are saving data, sending data to the cluster.
Same API, same concept in the way you build your
applications, but at runtime, the way it's executed,
so where the data are saved, replicated, are
different. So in the case
of Stream, you have a few interests, at least we
can directly explain in
very short sentences. One of them will be the speed,
the scalability,
the latency and so on.
So if you have multiple million of message per second,
MapperStream will be a lot more efficient
and more easier to manage than Kafka itself.
But like I like to say quite often with a smile on my face,
it's if every project has to send
one or 10 or 20 million message per second,
mapper will be everywhere.
It's not everybody needs this scale
in terms of messaging,
but everybody needs a better security model
that is shared between the database,
the file system,
and the topics you have into your strings
or your messaging layer
because you don't want to have, in one case,
use an SSH key, an SSL key,
and the other, you want to use a Kerberos ticket
to authenticate and to say,
I can access all of this part of the data.
So a common security model is one of the benefits.
Another benefit is the fact that you don't have multiple clusters,
but also you want to be able, and we see that in a more and more common way in the IoT,
we have customers in oil and gas industry, for example,
and you have drills all over multiple plants,
and you want to be able to capture messages, send the messages on
the local cluster, for example, in a region that we need to replicate the same message
on a national cluster than on a worldwide cluster.
Doing that with Kafka, it's not that easy because you need to install MirrorMaker, configure
this, monitor this, and plus also you are losing the offset
of the various consumer over the different replication
where it's totally built in into a mapR.
So you can publish and subscribe on any of the cluster
and it will replicate in the both directions
between the different clusters,
the same way the database could be used
as a multi-master between different clusters, the same way the database could be used as a multi-master
between different clusters.
So you see, it's the same API, same developer experience, but usually a lot easier to put
in production and do the configuration between the different components you want to work
with.
Okay, okay.
So what we've been talking about is effectively MapR doing things kind of the same but better than things
already in the platform but I noticed that looking at your website and looking
at some of the the white papers from MapR, you talk a lot about
microservices and microservices seems to be a kind of increasingly kind of you
know topical kind of interesting area to do with kind of Hadoop and big data and so on so so tell us a little bit about what microservices are and why they're
important in this context and why uh map are kind of putting a lot of investment and time into this
yes so you see this um you probably have seen many uh about that already, but on microservices,
it's a different way of building applications.
We were talking about, in the past,
a big monolithic application,
and you and me, we work at a good time
of the big Java EE development,
so where you build the big IR files
that contain multiple war files and so on and so on.
This was a big IR file that contains multiple work files, and so on and so on. This was a big monolithic application
that was very, very hard to make new updates
and new features or remove features or change technology.
Suppose you want to be able to, let's say,
the user profile in my database,
my application user profile,
I want to switch from a relational database
to a document database because my application user profile. I want to switch from a relational database to a document database
because I need schema flexibility.
Doing that in a monolith application,
it's a nightmare.
And startups, big startups like Netflix,
I started to build application in a new way
by building very small services
that are very dedicated to one single thing that
they will do from end to end, creation of your user profile.
The creation of user profile will be not only the UI, the REST API, but also the storage
itself of the profile.
And it will communicate with other services using messages.
By having all these small services communicating together, so this is what
we call micro services, communicating together will allow you to build a very large application
on a large set of services. But if you want to upgrade a specific version of one of the services,
if you want to change the technology of one of the services, if you want to be able to test
a new service in parallel, suppose you are in the e-commerce platform and you want to change the technology of one of the services, if you want to be able to test a new service in parallel,
suppose you are in the e-commerce platform
and you want to test a new payment page.
So you know you have your V1 with a very nice UI
with a credit card on PayPal integrations
and you want to test something else.
You just create a new service that is plugged using the same messages
and you can do some A-B testing between the two versions of the service.
And then you can remove one, decommission one of them in a very easy way.
So what we see at MAPA is people need a platform where it's easy to deploy. And because we can run multiple
services on multiple types of applications, you can run at the top of Mapper most of your
microservices because you will be able to use, you can even run MySQL on the Mapper file system.
You can store data on files on the file system. You can use a NoSQL database and so on.
So each services could store the data in a single platform.
When I say single platform,
it's not necessarily on the same location physically,
but it's to be able to leverage all the securities
or applications that you have
because you want to be sure you have the same quality of services
for the whole data data stores and also what we see on most one of the way of deploying microservices it's using
docker container you will create a new service that is a user profile management that will save
data for example in the no sql. So in this case, you deploy the container
containing a very small Java application,
so either an embedded JT or Vertex,
whatever you want to use that communicate with Mapper DBG, for example.
And this is your microservices running at the top of Mapper.
The container itself is running, could be deployed, redeployed and so on.
And every time you need some different services needs to communicate between them.
When you have many hundreds of microservices, you need a very efficient messaging technology.
This is where Mapper Streams with using the Kafka API could be used to exchange messages
between the different services.
And this is why we see microservices being very important for us, for our customer,
and why the Converged Data Platform can help. And we have a few customers doing that. We have,
for example, a customer in healthcare providing a software as a service on the cloud to allow the doctors and patients to get some information about the different steps
when you have some health process to do.
And they use only microservices.
Everything running on Mapper Converged Data Platform, using container to deploy new services.
So this is a very easy way of developing application.
It comes with some new challenges, you know, in terms of how do you manage errors,
how do you, because you have to do kind of compensation,
so redoing some business transaction, not talking about the database transaction,
but really the business side of it, so you have to capture events.
So all this, usually you will see in the microservices application,
multiple topics inside a mapper stream or Kafka
to not only exchange business messages,
but also emit or publish many technical information
about SLAs, quality of service exceptions like that you
can monitor that in real time and for this you need something very efficient in term of processing
and storage. I guess that's the reason why you guys can implement this because the obvious
question is well what is it that's special about the MapR platform and the way you do things that
meant that you could introduce this because I think Map is the first of the hadoop in quotes vendors to focus
on this is there something particular about the way you do things or the end-to-end control you
got over the platform that meant you could do this earlier than others really yes and i will say the
main to make it very short as an answer for people,
it's usually we are the first platform that has been built on Hadoop initially,
but that was built from first day to deal with real-time data stores and applications.
If you look at the other solution, you need another cluster
or you need to bring another
tool. So what we
try to do is provide that in a kind
of all-in-one solution.
So this is why we call it the convergence.
But
we don't want to fall on this is an
important part.
You can use other tools with
MAPAR. you can use everything on
mapper you can use tools that have nothing to do with Kafka nothing to do
with spark nothing to do with no sequel it's okay we want just to make sure that
if you are running on mapper it will be easier for the developer on the system
administrator it will be also faster if you use a feature that we
provide inside as a platform so faster more reliable more secure okay okay so
so I mean just in terms of sort of final thing to talk about with you I mean a
topic that's been fairly consistent a lot of the podcast I've been doing
recently and things I've been talking about is I suppose as as hadoop and cloud and data warehousing kind of converge uh i i wonder um to
what extent we'll be thinking about kind of that i suppose um things like mapr and hadoop and on
premise and so on uh when actually customers are now buying things like kind of you know data
warehouses a service one area that i've been working with a lot recently is google bigquery
and and you know that's been quite a revelation in the fact that I've been working with a lot recently is Google BigQuery.
And that's been quite a revelation in the fact that it abstracts away
all the complexity around things like
how the data is stored and so on and so forth.
I kind of wonder, looking forward to
as things move into the cloud,
what's kind of MapR's position on this?
And how will MapR, I guess,
still differentiate and be relevant and so on
as we move into the cloud, people start to converge on vendors like Google and Amazon and so on.
What's the kind of story there around MapR going forward, really?
So you have two different topics in the question.
One of them, when you talk about BigQuery and data warehouse, all big data vendors are providing a SQL on Hadoop or SQL on
everything to be able to either do a data warehouse offload to try to reduce the usage of very
expensive data warehouse, not flexible data warehouse in a more flexible way. So this is
exactly why you had the discussion with Neeraja around drill,
allowing you to query almost everything.
And if you look at it from a drill point of view,
it's based on Dremel,
Dremel being the paper on the architecture
behind big queries.
So this is for the data warehouse.
So yes, we are part of this.
This is one of the many use cases
you can run on MapR. Then you have we are part of this. This is one of the many use cases you can
run on Mapper. Then you have the discussion around the cloud. And this one is a very interesting
topic. And it depends how we want to see it. Sometimes what I say is using BigQuery or
using any other services as software, big data software as a service
kind of stuff from cloud vendors,
you go into vendor locking.
When you start to use BigQuery
and all the features of BigQuery,
it's sometimes hard to move back.
Just get back the data will be expensive.
Some of the features will be Google-centric.
What we try to do from an API point of view,
we try to make it open based on open source project.
In the same time, we are not cloud vendor.
We don't have MapR as a service on the cloud,
but you can deploy and we have images for all the cloud.
So you can build your MapR cluster on Amazon,
on Google Compute Engine, on Azure,
and leverage everything you want
in terms of feature available with MapR.
So if you want to do a data warehouse offload,
you will be able to use a file system or the database
and do some queries using
drill. But the big benefit in this case is if your enterprise is ready to put everything on the cloud,
okay, and the enterprise doesn't really care. And I say care, I don't want to tell it's not safe
or it's open. I think the cloud are very, very safe. But for example, you say, I don't want to tell it's not safe or it's open. I think the cloud are very, very safe. But for example, you say, I don't want to put the data on Google and I want to be sure that my data must
stay in a specific country. When you talk about healthcare, when you talk about private data,
you know, user data, for example, in Europe, you have some G and I don't remember the new regulation around the user data you need to be sure the data
are in a specific data center in a specific country so one of the benefits of running mapper
on the cloud is you can still have a replication with a mapper cluster running on premise in your
own infrastructure and you will be able to replicate the data from one to another,
applying some rules. You will say, I want to replicate only these tables or these streams or this column family into the table. So, for example, all the public information that has
been anonymized from the on-premise cluster could be replicated automatically to the cloud. On here, you will have more nodes,
more elasticity on Google,
map on Google or map on Amazon,
depending on what you want to do.
So cloud is definitely a big part of what we see today.
We have a balance between people say,
we want to use really the cloud as a data as a service kind of layer
or big data as a service using some services from Azure
or from Google directly, or they want to use Mapa
or other distribution installed into the cloud
to have the liberty and the flexibility
to move out of the cloud in a very efficient way.
Good, excellent. Excellent.
Well, Tug, I mean, so just to kind of wrap things up, really.
So where would people get hold of,
where would people download the software?
How would they find out as a developer how to learn this technology?
What is the kind of equivalent in my old world of kind of OTN
and technology networks and so on for MapR?
So I would say you have three or four links.
I don't remember how many links I will give you.
This is why I said three or four.
So one of them that is,
I think it's interesting for everybody,
people that wants to learn MapR,
but also people that want to learn Drill, Spark,
Hadoop in general,
you go to learn.mapr.com.
It's a free online
training. Some trainings are specific to Mapar,
but most of the trainings that are related to Apache Project
have nothing to do, have nothing specific
except sometimes one message to say, this is how you will
run that on Mapar. So learn.mapar.com
will be for learning big data technology and learning
Mapar. Everything free,
only the certification is
for a few bucks.
And then you
have obviously mapper.com, where
you have some information, but what
I really like, it's to mapper.com
slash blog, where we push many
articles on industries,
use case,
technology, and community.mapr.com
that will be similar so the Mapr blog plus the community will be similar with
our old OTM website. Yes, yes. Where you have discussion forum, technical articles,
interaction with the community. But also this is part of being in a big open source family.
Most of the Drill, Spark, Hive, even Kafka,
when we talk about architecture or design,
most of the people will use Apache mailing list.
Yeah, excellent, excellent.
So, well, Tug, thanks very much for this.
It's been great to speak to you again.
It's been quite a few years, I think since we you worked as the oc4jpm and i was trying to struggle to get oracle
9is running and and so on there so i think we've probably both we've probably both kind of done
well moving on from there over time um it's been great to speak to you and um yeah appreciate that
and have a good weekend and and take care same thank you