Drill to Detail - Drill to Detail Ep.12 'Gluent and the New World of Hybrid Data' with Special Guest Tanel Poder
Episode Date: December 6, 2016Mark Rittman is joined by Gluent's Tanel Poder to talk about Hadoop, Gluent Data Platform, the coming of the hybrid world and how Hadoop will evolve as it moves into the cloud....
Transcript
Discussion (0)
Hello and welcome to a very special edition of Drill to Detail. I'm actually at the UK
OUG Tech 16 conference in Birmingham and I'm actually in the hotel room of Tanel Poder
who many of you,
most of you probably will know from his background in kind of Oracle work and so on there but he's
now started his own startup called Gluent and Tanel was going to talk to me today about what
Gluent is, what the story is and especially some of these sort of views for the future. So Tanel,
great to see you again. Yeah great to see you Mark as well and hi everybody. Hello, so Tanel just for anybody who doesn't know you just a little bit background as to see you again. Yeah, great to see you, Mark as well. Hi, everybody. Hello. So, Tonell, just for anybody who doesn't know you, just a little bit of background as to
what you've been doing in the past, really, and how you got to kind of this position.
Yeah, so by now I call myself a long-term computer performance geek, right?
And more like 25 years ago or something, I was already working on Unix, even though I
was in high school or somewhere back then. And then I got introduced to Oracle, and I immediately liked it because of its sophistication.
And for the last 20 years, I've done Oracle stuff, right?
And for the last 10 or more years of that, I used to be a consultant, and I flew around the world.
I helped customers with some of their biggest and baddest Oracle databases. I troubleshooted them. I fixed performance issues and I gave them general advice
how do you make Oracle better. But in the last two years or so I've been running through it.
Okay so most people know you or a lot of people know you from the Oracle background but I've seen
some of your presentations recently and things you've written, and you're kind of quite into Hadoop now and big data.
So what spurred your interest there, and why did you get interested in that sort of area?
Yeah, I guess the longer story or the back story is that seven years ago, if you asked me, hey, I data is the relational data is it images or is it
videos whatever um where should i put it and then often my answer or mostly my answer was put it in
oracle because oracle was the best data management system for so many things right people even put
images in there and stuff like that right but um about three years ago i I saw this thing called Hadoop.
I mean, I knew about Hadoop.
I knew about Google's MapReduce.
But it was always something that Yahoo would use on their web logs or Google would use and so on.
But about three years ago, I saw a SQL engine called Impala built by Cloudera and open sourced by now.
And that was a proper C++-based, Daemon-based SQL engine.
And that was the first thing, the first indication that, hey, this Hadoop thing seems serious.
And the second indication also about three years ago was that security showed up.
So a lot of these fancy new systems, which are very, very scalable and cheap, they were not enterprise ready.
But now with this engine called Impala and actual proper security throughout the whole system,
I saw that, hey, this scalable and cheap thing called Hadoop is going to be ready for enterprises soon.
And by today, it's very ready.
Okay. So, I mean, obviously had the same sort of thought as well you know you see you see Hadoop as a kind of the obvious replacement for data warehousing and for kind of
a lot of the work that a database like Oracle would do so is your feeling that this is going
to completely take over and replace kind of these kind of old school databases or what really?
When I first read about it and later on when I researched this whole Hadoop thing more,
I saw that it is great for use cases where you are ingesting and querying a lot of streams,
events which happen somewhere else, right?
And not transactional data, really, right?
And that's where, you know, I think the obvious question here for people who also do Oracle is that
would Hadoop replace Oracle or something like that and
as long as we talk about complex transactional systems like ERP systems, Oracle is the king of that, right?
So I think even five years from now, when big companies build more complex systems
where you do complex transactions
and then this needs to be completely online all the time,
then I think Oracle is the king of that, right?
So complex transactional systems,
I mean, I will keep recommending Oracle everywhere, right?
But now everything else,
again, events which happen somewhere else,
feeds which come in, unstructured data,
you know, I would be lying if I said which happens somewhere else, feeds which come in, unstructured data.
I would be lying if I said that I wouldn't think that Hadoop takes over.
Or now the cat will come out.
We open a can of worms or the cloud backends.
That's a different story.
So what's the story behind Glue-It then?
I remember you've done a few things in the past
around building products and had a few ideas around maybe tuning areas or performance areas and so on.
But what was the story around Gluent?
How did that come about really?
Yeah, so that was an interesting lesson.
So being a performance guy, then the obvious first reaction, my reaction was, hey, I've
got to build a performance tool for Hadoop.
And then we started talking to some customers who already were kind of using Hadoop about
three years ago and then some big telcos and banks as well.
And the idea of a performance tool like a SQL optimization or general performance or
capacity planning tool for Hadoop, it didn't resonate at all.
Nobody cared because nobody even knew what to do with Hadoop. So how do you get the benefit out of Hadoop, it didn't resonate at all. Nobody cared because nobody even knew what to do with Hadoop, right?
So how do you get the benefit out of Hadoop?
And then we kept talking, and then basically another pattern emerged.
And the other pattern was that, hey, our data volumes are growing even more.
You know, every year people say the data volumes are exploding.
And yes, they do explode. And the year after, the data volumes explode even more. Every year people say the data volumes are exploding, and yes, they do explode.
And the year after, the data volumes explode even more.
And at the same time, your queries need to run even faster.
People want to do things real-time and so on.
That's when I saw that the traditional SAN storage base,
that transactional databases will not be able to cope
with the modern requirements.
And so, however, on the other hand, we knew, I mean, we've been around enough
that we knew that there's no way that something like Hadoop will take over
your entire application infrastructure and somehow magically all
your code gets ported to Hadoop and it all works, right?
So, you know, when we started pitching Cluent, so basically the story is that Hadoop is here to stay, but your existing applications or existing databases are not going to go away
anytime soon.
So both of these worlds are here to stay and somehow you need to glue them together, right?
Okay.
In a modern enterprise.
And that's why the name Cluent.
Okay.
So who was with you at the start?
What was the kind of the team at the start and what was the kind of timeline really for
building this out?
You know, what kind of core technology did you start with and what problem did you solve
at the start then really?
Yeah, so now I got to think about like three years back when we started thinking
about this and as I said, what kind of prompted us to do this was the Impala, you know, built by Cloudera, you know, was released.
And so I have the co-founder of Cloudera, Paul Breacher.
We had some startups with him before.
And then like years ago, we had something called EatUSN, where we build a that was that was gonna write the virtualization wave and
we built a data center optimization analysis software and so on and but the
we we built the first prototypes with Paul Bridger and and we used Impala as a
backend backend and the front end was Oracle. And the simple use case really was, it was a very narrow use case back then,
was that you have a data warehouse, which has seven years of history, right?
And it's too big, it's too expensive, it's too slow because of all this data.
And you have 20,000 reports written on this Oracle-based data warehouse, let's say, right?
And so there's no way you can rewrite this on this magical new platform, but you don't
want to buy more hardware.
You don't want to buy more licenses all the time.
And you would want your queries to be 10 times faster, right?
And so the use case, what Gluen solved was that, hey, what about putting six and a half
years of history out of seven years into Hadoop, right?
Because if you have a big data warehouse, you know, you have a thousand tables in your schema, maybe only 10 tables are big, right?
So what about offloading 90% of these 10 tables to Hadoop and use Hadoop as a very scalable and powerful extension of your
data warehouse platform, right? And so the Gluant
will provide this glue between this Hadoop backend and your existing database
frontend so that you wouldn't have to rewrite your report. So you offload 90% of data away,
and all your 20,000 reports work as they did yesterday.
And actually, they work faster than they did yesterday.
Yeah, so the funny thing is over this weekend
that I managed to announce your website earlier
than actually you planned to,
because I tweeted it, which was kind of funny.
But one of the things I was looking at on there
was trying to understand, I suppose, the product's architecture.
So you've talked about offloading there
you've talked about kind of you know impala and so on there so just kind of paint the picture
really as to what are the kind of the components in the McLuhan at the moment and how does it do
how does it do this kind of like you know transfer allowing you to write oracle sql against kind of
hadoop and so on so what was the what's the key components first of all yeah so the key components
really are three components in one end you have an Oracle
database. In another end you have a Hadoop cluster with a SQL engine like Impala or Hive on it,
because that's what we use for heavy lifting. And the third component in between is the
QN software. And it actually, if you imagine that between these two worlds, Hadoop and Oracle, you have
two arrows.
One arrow goes towards Oracle and the other arrow goes towards Hadoop.
So we actually have software for both.
So we have a toolset for offloading 90% of your data to Hadoop, the single command.
There's no ETL development and so on.
And the arrow that goes the other direction is our Glue and Smart Connector.
That's where most of our secret source lies.
And secret source code as well, of course.
And that Smart Connector is now what gives you
this transparent access to Hadoop.
So that when you run a query in Oracle
on this hybrid schema where 90% of your data is in Hadoop,
then we actually take parts of your execution plan in Oracle.
We don't rewrite SQL.
We take parts of the execution plan in Oracle,
and our smart connector sends these parts of execution plan down to Hadoop.
And we use Impala or Hive in Hadoop side,
which actually does the heavy lifting.
So we don't have our own SQL engine written on Hadoop. And we use Impala or Hive in Hadoop side, which actually does the heavy lifting, right? So we don't have our own SQL engine written on Hadoop. There are plenty of SQL
engines in Hadoop and in the cloud, right? We just provide this sort of data virtualization
layer between this traditional database front ends and these awesome new back ends like
Hadoop.
So that really is where I guess your kind of Oracle heritage comes in, really, isn't
it? In the fact that you can take, you know how Oracle kind of writes SQL, you know how to break it down and so on there.
But you've actually talked about extending this now to SQL Server, Teradata and so on.
So how are you extending this idea of doing it from Oracle?
How are you extending the idea from Oracle to these ones as well?
How does the technology translate? Yeah, so first, my Oracle experience and Oracle performance
and internals experience has helped because we know how Oracle works.
So that when we want to build a product which is compatible with Oracle,
then there was much less trial and error.
So we kind of knew that what would work and what would not work in someone
when we do this integration.
So that was
that made much things easier
and we built things faster.
And now with SQL Server
Teradata and also Postgres, which is
in our plans now because of customer
demand,
obviously there are different technologies
built by different vendors, but
under the hood fundamentally everything is the same right and one of
the fundamental things is that in all major relational databases you know you
write sequel and this gets compiled to an execution plan and execution plan is
a tree of operators right some operators read data some operators join data, some do aggregation,
whatever, right? If you imagine this upside down tree, at the top of the tree is the root,
and the tree goes wider as it goes down, and in the bottom you have leaves.
Like in Dremel sort of thing, yeah.
Yeah. And the leaves of the tree are where data access happens, right? And we will take
the bottom leaves of that tree, and we will offload these leaves access happens, right? And we will take the bottom leaves of that tree,
and we will offload these leaves to Hadoop, right? So that's how we push some heavy lifting down to
Hadoop while piping a result set back to the rest of the tree of the execution plan. And that's how
you have 100% compatibility with Oracle or SQL Server because the proprietary stuff like PLSQL
or some model clause in Oracle, this still happens in Oracle, right?
But everything else happens in Hadoop.
So how does this compare to, say, Big Data SQL then or Polybase
or the vendor institutes in this area?
Yeah, so it's worth saying exactly that Bluend is not the only one
nor the first one who integrates databases with Hadoop.
Every big vendor, Oracle has Big Data SQL, Microsoft has PoloBase, Teradata has QueryGrid, IBM has BigSQL.
It's actually an interesting story that all these big vendors who supposedly should be pretty threatened,
feel pretty threatened by Hadoop, they are actually embracing this enemy of theirs.
I guess these guys in these big companies, they've all been visionary enough to see that
they better jump on the train of Hadoop than to fight it.
But specifically how Gluent is different. So obviously there are
many different layers how we are different. One major thing really is that we know that every
big enterprise or mid-sized enterprise, they don't only have Oracle or they don't only have SQL
Server or only Teradata, right? So they have many, many silos by different database vendors, right?
And Gluent connects all of them through this Hadoop-based data lake, right?
So we are not building only an Oracle-specific, Oracle-centric tool like Big Data SQL is,
or Teradata-specific, Teradata-centric tool like the Query Green is.
So we want to connect all data to all applications.
Oracle, SQL Server, Postgres, even Sybase, because many banks still have Sybase lying around, right? Okay. Okay, SQL, Postgres, even Sybase because many banks still have Sybase
lying around.
Okay. Okay. Yeah. I mean, I want to get onto that whole topic of why people might
want to do this in time actually. But one last thing is I noticed you've got something
called Gluent Advisor on the website as well that I had prematurely announced for you yesterday.
So what's Gluent Advisor then? How does that fit into things?
Yeah. So that's an interesting, actually a funny story that one of the customers, like a year ago or even more, we went to talk to them about offloading and, you know, you can take 90% of your data and then cut costs and then make things faster and so on.
And the customer said, yeah, you know, we have like tens of, you know, that business unit had tens of databases, right?
And the owner of this said, hey, man, I don't even know what's going on in these databases,
right?
So do you guys have a tool which would tell me which databases are offloadable and how offloadable they are?
They would be.
And we, of course, said, yes, we do.
And then I went back to our development team and they said, oh, crap, we've got to build
an advisor tool quickly.
Yes. Then I went back to our development team and they said, oh crap, we've got to build an advisor tool quickly. And initially it was like a text mode tool and like a script, but now it has evolved into this pretty nice graphical tool,
which basically, you know, you just run it.
It doesn't install any agents or anything like that.
You just run it and five minutes later, you will see that if you have a 100 terabyte data warehouse, that 80 terabyte
of that data is not modified much and it's
not really used for random lookups and so on very frequently, therefore
it's safely offloadable. And whatever is not offloadable, because sometimes
we see that only 40% of data can be offloaded
and then we ask, that's not what we typically see.
Then you can drill down and you will actually see
that somebody has this crazy batch job
which for some reason goes back five years into history
and modifies everything.
And so it's a tool which gives you an easier view
of how much you could shrink your database.
And it also tells you that if some data is so hot
that you cannot shrink it,
then you can actually see who is causing it to be so hot.
So what's the criteria then for being offloadable?
Is it data that's only kind of read from?
What's the criteria to be offloadable by your tool?
Six months ago, one of the important aspects was that
if data was ever modified, like even
once per month, then we said it's not off the level because our product did not allow
updates against Hadoop data, right? But this has changed now. So now we actually allowed you even,
now you can even update data which resides in Hadoop. So you can take your 90% of history, put it into Hadoop, drop it from your Oracle database,
but if you, every end of month or once per day, whatever, you still need to go back and
update some records, now we support that as well.
It's interesting.
It's interesting.
So now basically we have this configurable parameter that will just say that, hey, if
you see less than a million modifications
per you know week or day or whatever against the table or against some partitions we still
say that yep it's offloadable because you can still do these updates so that's the main criteria
and before we get on to the kind of the business of this really um so how how kind of how how
extendable or pluggable is that because you said you use impala as the kind of the sql engine there
what about things like drill or Presto or stuff like that?
How much could this in time extend to those tools as well?
Yeah, that was an early architectural decision,
which has ended up being beautiful there.
Sounded like Donald Trump there.
Yeah.
So we see Gluent as a data virtualization layer, right?
And you have front ends like Oracle or SQL Server,
and then you have back ends like Hadoop.
And how we get Hadoop to do heavy lifting for us
is that we construct SQL.
We parse the execution plan. We understand what the query wants to do. lifting for us is that we construct SQL, we parse the execution plan, we understand
what the query wants to do, and we
construct a SQL statement and we send
it to Impala. And if the backend
happens to be Hive, we added that
later, now we are certified on
Hortoworks as well, and
then
if you connect to Hive as a backend,
then we just construct a slightly
different SQL.
So supporting Drill will be If you connect to Hive as a backend, then we just construct a slightly different SQL, right?
So supporting Drill will be easy.
So we actually have it working in our lab.
So we just haven't fully certified it yet.
And then maybe that's a topic for later, supporting cloud backends like Google BigQuery, Amazon Redshift, and Amazon Athena, which was recently announced. I mean, we could even support MySQL as a backend if you wanted to.
Okay, okay. So let's touch on this idea you said earlier on about,
you mentioned a few times that Gluon is like data virtualization.
So paint the picture really of where you see this starting to be useful for business.
So why would your average business who's invested in, say,
in what we might call old world technologies,
why should they be concerned about data virtualization?
And I suppose kind of like connecting applications together.
Yeah. So there are, I think, two main topics or two main streams here.
One is what we already talked about, basically cost saving, you know, shrinking your database,
putting some stuff into Hadoop, and the data virtualization
layer keeps everything transparent, so that you still can log into Oracle as you did yesterday,
you can still run your same PL, SQL, your reports as you did yesterday.
And thanks to this data virtualization layer, transparently we push whatever needs to be
pushed down to Hado ideal or this backend. So cost saving, archiving, making database smaller for performance reasons, that's the
first use case where we started from.
And this is not the aha moment anymore.
Often we go and start talking about these topics with the customer and when there are
architects in the room and when we get to the point of
that, hey guys, you don't have
to use Glue-In only for making
this one database smaller.
But you can actually
use Glue-In with Hadoop as your
data sharing platform or
a data hub.
That
you could offload data, you could sync data
from your 10 SQL server databases to Hadoop,
some Teradata database to Hadoop,
and then you would still query the same data in Oracle.
And it looks like the data resides in Oracle.
Actually, in Hadoop, all the heavy lifting,
all the query processing heavy lifting happens in Hadoop.
But how it looks like in your Oracle is it's just a regular,
it looks like a regular table to you.
So this is what is the aha moment for the architects when they suddenly see that, hey, if I have 20 databases
in my application constellation, right?
So if I have 20 databases, previously I had to create
all kinds of data feeds, replication, ETL jobs,
just to get data from one silo to another silo, right?
Because they all want to use the same data.
But now there is a paradigm shift.
What about syncing all the data as it's born in your silos,
syncing it straight to Hadoop or this data hub, right?
And Cluent connects this data hub to the rest of the enterprise.
Okay, so I've heard that referred to as data fabric before.
Is that kind of idea that, you know, data virtualization, data fabric, that's what you're thinking of really?
Yes, that's where it's going.
And actually, I don't use data virtualization that much because it's so overloaded.
It's actually a good point.
There are plenty of vendors like even NetApp and so on who talk about data virtual, while what they really do is storage virtualization.
Yeah.
You know, the problem with storage virtualization is that
if you take an Oracle data file or SQL Server data file,
and you put it into cloud or whatever,
then it's still an Oracle data file.
It's still an Oracle format,
and you have to pay Oracle money to use your own data.
So how would this be different then to, say,
tools that do data federation?
Because, I mean, my background in Oracle BI,
we had to think with the Oracle BI server
that would create its own engine over different data sources.
How would this differ from that kind of thing then, really?
Yeah, the data federation, that's yet another topic.
And obviously, I wouldn't be the CEO of Fluent
if I wasn't able to say that our approach is better, right?
Yeah, yeah.
The data federation tools have been around for a long time.
And those guys who have done Oracle
and who have ran distributed queries over database links,
they probably know what I'm saying,
that running distributed queries or federated queries
over DB links,
it works very well if you have 10 rows in one database
and 20 rows in another database,
right? And you join them together. It's magical, right? But now when you think about the real world
in the real modern world, right? And when you want to join a billion rows to
two billion rows in another data source, then this will basically never work, right? Because
you cannot just keep pulling data between the databases and then just join and throw data away or throw these non-surviving
rows away. So, you know, I've seen this for years that these federated
queries don't work with large datasets. So you have to be really careful what
you can actually run and then what you cannot run, right? So the federation
engine can become a bottleneck, right?
So the second problem with the, you know,
if you think about a separate federation engine,
not like Oracle or whatever,
is these federation engines have their own SQL engine, right?
So then you will end up learning a new SQL dialect
and writing apps against this federation engine.
So you cannot run your existing Oracle code anymore and just augment it with some big
data source.
So you have to use a separate engine.
You have to port your application.
So now we end up with two applications.
So we kind of see that what we do is sort of like inverse federation, so that instead of running queries and always pulling data from the silos
into some engine from processing, we offload data,
we sync it right when it's born.
We will sync everything to the Hadoop data lake or data hub, if you will,
and now when you run queries, we will push this query down to Hadoop
where all the data resides.
So whatever data sets you need to join, for example, they have all been, or most of them have been synced to this scalable backend.
And the join happens there, the heavy lifting happens there, and you just get the results back to your data.
Yeah, I mean, I think going back to the point about not having to change the application code, that is the thing, isn't it?
So a lot of data warehouse projects I've seen that are floated to Hadoop,
the issue then is changing all your ETL code,
changing all your kind of query code.
And that's not even tackling anything to do with, say, OOTP
and that sort of thing, where those applications,
you just can't rewrite them or you just do that.
I mean, so who do you see within an organization then?
Who do you see?
Who is your customer?
Who typically are the people that get value from this?
And who do you typically have conversations with then in kind of where you're getting start to get traction with this in companies?
The first part is kind of easy that the initial use case for our product was cost saving.
Cost avoidance or cost saving.
That goes all the way up to CFO, right? In some cases, right? So, but often we talk to application owners
where they just wanna,
they don't wanna buy another rack
of some traditional storage array
or another rack of some Teradata or Oracle,
they're done with that, right?
So it's a cost avoidance for
basically application owners.
And often our
discussion, because of my own background as well,
the discussion often starts from DVAs.
We'll see that, hey, there's a cool technology
and we know the guys, they seem to know what they're
doing, and then we just move up there.
And then other business units,
application owners hear about what we just did
and then they come to us.
So it's cost saving, cost avoidance.
And the second angle we already are taking is, again,
I mentioned sometimes you have a business unit owner
who has a constellation of related applications.
And usually when we get in front of their architects, they have the aha moment.
Hey, we could simplify our lives so much.
We don't have to build so many data feeds.
Accessing data.
Accessing data, you know, data is born in one application or it comes in via some feed.
And if some other app needs to access it,
previously it took, I don't know,
two months to provision some additional servers
and another four months to build some ETL
and take data from one silo and put it into another silo.
And then you would continue with your business project, right?
So with Glue and the architects often have this aha moment of,
hey, if we sync all the data to the data hub,
not only will we make our database smaller and cut cost,
in addition to that, the time to market.
Well, this is the interesting thing, isn't it?
I think going beyond what you're saying there,
I imagine the conversations you've had at the start with DBAs, Yeah, this is the interesting thing, isn't it? I mean, I think going beyond what you're saying there, I mean, we're talking, you know,
imagine the conversations you've had at the start with DBAs,
they're with people who know the value of Hadoop
and they know kind of how hard it is to connect these things together.
But really, your market for this goes well beyond that.
And it's actually companies who have to compete with the likes of Netflix
and with Airbnb and all these companies here.
I mean, tell us a bit, you're thinking around that.
I mean, why is this a bigger thing than just kind of, I suppose,
in a way connecting kind of Oracle to Hadoop, really?
Yeah, so, I mean, maybe this is the first, you know,
I listed your two main, you know, targets, you know,
who are interested in our solution.
But the third one really is which should resonate with C-levels and so on is basically the sexy keyword or the buzzword is digital transformation.
And you have companies like, not to mention Google, but Netflix and Ubers and so on, who are what's called digital native and cloud native, thanks to that as well, so that they are used to doing
things really fast.
You go to Uber and if somebody tells you that it takes two months to provision a server
instead of two minutes, I mean, I think somebody will get fired, right?
And that's the difference.
So it's, you know, companies who use Hadoop only for cost-saving reason, you know, who migrate or replatform from Teradata or Hadoop for cost-saving, they're only getting a fraction of what this data lake concept and data hub concept can give you.
And it's all about speed of action.
It's all about time to market. And then how we put it is that if you have a big company,
they have tens of thousands of apps and databases.
So these are the biggest ones.
Mid-sized companies may have a thousand apps.
So all these apps are there for a separate purpose,
managed by different business units.
They're there for different reasons.
So you have thousands of silos.
And often data is born in these silos.
And it will continue to do so because they're different apps,
different requirements, different code base, and so on.
So fast forward like 20 years, you will still have a thousand silos.
Most of them may reside in cloud and maybe are like SaaS services,
but you still have a lot of silos, right?
Because of business reasons, right?
So there's no single vendor, single cloud vendor who suddenly takes over everything you do in your company.
So you have these silos where data is born, right?
But in order to compete and you have things like customer 360 and so on, right?
So you have to actually have access to all your data, right? And how do
old school companies do it today? If you want to have access to this extra data source in some
other business unit, it's going to take like nine months to get access, right? Three months for
servers, three months for Informatica installation or whatever, and then people build an ETL pipeline,
whatever, and then nine months later, we might see results and actually continue.
You go to Uber, that will take like three minutes, right?
So, you know, I don't know if they have governance in place.
I hope they have.
You know, you got to go,
you call somebody, you ask access to this data set,
and you're going to have it, right?
So what Gluent aims to do in long term,
we already have begun this,
is we want to connect all data to all applications.
So that whatever data applications you deploy on whatever platform, is it NoSQL or relational
databases, by default, all this data that's ever born in this silos, it's by default accessible
to anybody else in the company with right permissions, right?
Of course.
So if somebody wakes up on a Monday morning and says that, hey, I want to enhance my customer 360 view with this data from Tokyo, then it's just a matter of running a SQL query.
It's just a matter of adding one more query or adding one more virtual table into your report SQL, for example.
And maybe the first time you run it, you will get an error because you don't have the permissions. Then you make a phone call. And five minutes later,
you have that, you can query this data. Thanks to Cluence data virtualization,
there is no ETL development. There is no data loading. There is no pipeline building. It's
just a query. And we will pull the data in from where needed. We will cache it in Hadoop. And we
will also push down heavy lifting, right? Okay. So this cache in Hadoop, and we will also push down heavy lifting. Okay, so this cache in Hadoop, I mean, from my background of BI and data warehousing,
that makes a very interesting kind of base on which to do some very interesting analytics.
I mean, what's your thoughts on going beyond just creating that kind of layer, and this
being useful to people for, say, machine learning and things like that?
I mean, any ideas on that at all?
Yeah, absolutely. So I think it's maybe how I can explain this is,
let's say you have data scientists in the company
and, you know, there's plenty of research and analysis done
on where do data scientists spend their data.
And, like, you know, depending on the report,
some say 70%,
some say 90%,
that most of their time,
data scientists,
they don't spend on the science part.
They actually spend it on data plumbing.
Getting access to the data,
getting it into wherever you want to analyze it,
getting it into right format
and data cleansing
and all this stuff as well.
So data cleansing is less of an issue when the data source is a relational
database, because that database, you know,
takes care of the integrative data.
You don't have like missing fields and stuff like that.
So, or garbage data, whatever. Right.
But everything else takes time. Right.
So, and where Cluent comes in is that with Cluent,
we will sync your
data from all these silos to this scalable backend like Hadoop so the data
will be there it will be in a familiar format it will be in the same data
model as on the source system so so you can actually start querying this data
right away so you can actually start can actually start analyzing the data as you want
from day one as opposed to spending three months getting access to the data and getting
it into the right form. So you can focus on the science part. And the same thing with
machine learning now is, again, in order to do efficient machine learning, you actually
have to have access to this data, right? And what Gluent does right now is again we will make sure that all the data is all
the data you need is synced to Hadoop and and kept in sync and and later on
when you when you do this machine learning you build some sort of a
pipeline which which does some event enrichment perhaps you can consume this
enriched data while doing it So in order to enhance
your application with machine learned data,
with gluant, how it looks like is that you will only have a few more tables
showing up in your database. So your data is synced to Hadoop, your
machine learning, I don't know, Spark ML or whatever, TensorFlow Spark, whatever you run,
that will obviously happen in Hadoop. You will have to write your magic code, of course,
but the results of this data will be consumable by the same API as your application already
does, uses, right? So you just add one more table in your report, right?
So you posted a link on Twitter a while ago, which I thought was interesting, which was
the Google Goods project.
And it was, if anyone didn't read it, the idea is that Google, you know, Google have recognized, you know, more early than anybody else,
that part of the challenge with having big data lakes of data is understanding the meaning, the semantic meaning, the kind of table structures and so on.
Any thoughts, I mean, let's look into the future really now, but any thoughts on how we could make it easier for people to understand the schemas coming in, if they are floating it from, say, EBS, we can introspect stuff at all.
Any thoughts on how we might make that process a bit easier and more automatic, really?
Yeah, so that's an interesting topic, that right after this data plumbing, so that now that you have a sync data from a thousand databases, we actually have one customer who said that they have 25,000 databases.
And if they add all the columns together, it's a billion columns.
How can we speed up the onboarding of that data and that sort of thing? angle is, I don't know, I can't say it's unique, but it sure is nice because what we do right now is we
sync data from relational databases. We sync structured data
to this backend.
With unstructured data, you immediately have a problem with data cleansing
and garbage. You don't know what is where, what data it is
without this very extensive cataloging.
So with relational data, with the structured data syncing, it's a bit easier. So I guess
the most fundamental thing to say is that as of today, we just sync your data to Hadoop
exactly as it is in the source system. So if the developer is familiar with the Siebel
schema or EBS,
then they will see exactly familiar schema on Hadoop as well.
And some reports, some analytics, you might actually run directly on Hadoop.
So if you want to write new stuff.
But looking towards into future, then the immediate next thing,
what to do on the data plumbing on this thousand databases
is semi-aut semi automated data integration right so and then I guess how you visualize this
if you have a data scientist who now logs into Hadoop or logs into this
data lake analysis platform somehow and now they want to drag
and drop things around to build a report. So when they drag a customer from Siebel to a revenue number on EBS,
then somehow we need to figure out how you join these data sets together.
So I'm not going to go too much into details,
but we have some plans for doing machine machine learning assisted human in the loop okay
a semi-automated data integration okay okay so so that's all kind of good and we're looking again
looking into the future now is cloud going to make all of this kind of like effectively in a
relevant conversation so you know obviously you're talking about linking hadoop to to oracle for
example but if we look at a project i'm working on at the moment or customer i'm working with it's
all kind of google big query BigQuery and so on. What relevance
do you see Gluent having in the future when customers start moving workloads to the
cloud? And it's less about Hadoop then, and it's more about kind of data sitting in the cloud.
Where does Gluent sort of fit in there? What's your vision around that sort of area?
So implementation-wise, before we go to vision,
implementation-wise, in some sense,
we don't care where the data is, right?
So we offload data to a powerful backend
which is accessible via SQL.
And the first choice was Hadoop.
But this Hadoop can be in-house or on-prem
or it can be in the cloud.
Or obviously the next step from there is that
maybe you don't even need Hadoop
there because you have Google BigQuery or as I said in the beginning, Amazon just announced
something called Athena, which is somewhat like Google BigQuery, which allows you to run SQL on
Amazon S3 objects. So you might even not want to run a Hadoop cluster if you don't need all the sophistication and flexibility
in there.
So implementation-wise, we will support a cloud backend as well, because after all,
we just sync data there and we run SQL, we push down SQL to get it back.
So it's almost, and the cool thing is that the customers, the front end, the Oracle or
SQL server database will not even know a difference, because we will translate whatever needs to be translated in this virtualization layer.
So implementation-wise, there is almost no difference for do-it-yourself.
But more looking into future and strategically, then what may happen is that instead of having this one data lake in-house or a few of them,
and instead of having a thousand databases in-house, customers start using cloud services.
So you have BigQuery as the backend, and also some of the databases you will migrate to Amazon RDS or Amazon Aurora.
These are the new database engines.
And another thing is that more of your applications
become SaaS applications, right?
So that you used to have Siebel installed
in some database in-house on a local EMC storage area.
But now when you use Salesforce,
you only have a web API where you log into.
And the data is born there. People type in stuff there. So data is born in a Salesforce
app. You don't have access to Salesforce database, but you can do extracts. You can do real-time
feeds. And I think this is what the future will be about, so that there will be no single cloud.
The future will be also fragmented.
Just like today, you have Oracle now, SQL Server, you have eBusiness Suite for this vendor and SAP and so on.
Ten years in the future, you will still have Salesforce, you have Workday, you are using services by other vendors who don't even exist yet. And then you have your cloud environment,
and you probably will use Google Cloud for some things
and Amazon Cloud for other things,
just for vendor management reasons and competition reasons.
And you still have some old things taking away some mainframe in-house, right?
So even if 90% of your stuff is in the cloud,
I think it's going to be a handful of cloud infrastructure vendors,
and you will probably have like 50 or 100 SaaS vendors.
Yeah.
Yeah.
I mean,
excuse me,
is there a plan to kind of run Glue as a cloud service at some point then
really?
Because I mean,
that would be the logical progression really,
wouldn't it?
Yes.
Yes.
And yeah,
so basically data as a service.
Yeah.
And so it's on our roadmap.
I'm not going to tell you where it's going.
I'll announce it before you do that.
Yeah, you will do it like a day before.
And I think how you look at it is that right now, you know, like Salesforce is growing really fast.
And Salesforce is not about CRM only anymore, right?
They have their own machine learning services. They have their own analytics engine, stuff like that.
And you can actually upload, push your in-house data to Salesforce and integrate it
all there, right? But I think it's
somewhat utopia that all your data, all your analytics goes to Salesforce
because it's a single vendor. I mean, all companies have their own needs. So I think there is
always a, I can't even call it a niche,
because niche is too small.
I think there's always a big need for a general purpose processing platform,
you know, cloud platform, where you can put whatever you want in there
and you can run any analytics you want in there.
So no single vendor like Salesforce oracle or sap can handle everything you need
no and a vendor like salesforce would always have its analytics strategy aligned with kind of
salesforce so they're looking to add analytics to their tools to their products and so on not a
general thing like you're looking to do really i mean so you mentioned data as a service there i
mean i think what's your views on that i mean i would have thought that some kind of link out to
various kind of das services or even vendors like when I'm working with now, Qubit, you know, where they have
kind of, you know, actual kind of pools of e-commerce kind of click data and so on there.
I mean, any thoughts on data as a service as a kind of area as well to link to maybe?
Yeah.
So, you know, like we, when we talk about, you know, there are different vendors who
talk about data as a service and are different vendors who talk about data
as a service
and some vendors
are actually
information services,
right?
Yes.
They will provide
your stock data
and, you know,
events from the real world,
whatever.
So we really are not
thinking about that.
No, but you could certainly
have deals with them,
couldn't you,
where you kind of
bring in their data,
maybe,
that sort of thing,
or at least
create connectors
to those sort of services.
Yes.
Yeah.
And what I'm having in mind
for data as a service is really, it's kind of an extension of what we already are doing. But again,
you sync all, you have a big enterprise with 5,000 databases. As data is born in there,
you will sync that to the cloud. Also, you have feeds coming from Salesforce and Workday and so
on. You will sync these to the cloud as well. And this cloud environment is under your control.
It's not a Salesforce or some vendor who controls what can be done, right? So it's still a general
purpose data storage and processing platform. And where Glue Endelgrams in is that we know that if you have 5,000 applications built on existing relational databases,
but mostly they are still relational databases,
if you want to transform your company fast and be able to use any data anywhere and do it quickly,
it cannot be an exercise that, hey, let's rewrite all of these 5,000 apps
and add some REST capability in them or whatever.
So the Cluence idea is, you know, that's where the data virtualization comes in,
is that on all these databases, you will have virtual tables.
So whatever data you want to consume as a service, as a stream maybe,
or just run reports, this will show up in your existing
database as an existing familiar table.
So you don't have to re-engineer all your 5,000 apps.
So I think the Gluen's magic is that, yes, you have data as a service in the cloud, but
we will actually connect it all the way to the last mile.
Like in Telecom, you have the term last mile network that you know from the
from the you know center of the village to your house right so whoever controls the last mile
network that is in the position to to to say how things will work right that's it that's it yeah
so so i'm conscious of it's actually what is it now it's uh almost three o'clock so we do to speak
actually soon um so to now what what when are you speaking this week really what presentations
you're doing and how do people find out a bit more about kind of gluant as well if while you're here We do speak actually soon. So Tanel, when are you speaking this week? What presentations are you doing?
And how do people find out a bit more about Gluent as well while you're here?
Yeah, so about Gluent, just go to Gluent.com.
We will announce the website tomorrow.
In Tanel's time, in Mark's time, it's yesterday.
So if you go to Gluent.com, you have plenty of info there.
Navigate around.
We have white papers about the platform and about the advisor and so on.
And you can get in touch.
Just Google my blog at tunnelmoder.com and you can send me an email as well.
But regarding the UK OUG presentations, it's a mix.
I have my fluent hat on.
So on Wednesday, I will talk about extending Oracle data warehouses with Hadoop,
where I will talk about both offloading for cost reasons, but also big data blending, you know, augmenting your existing analytic environment with big data without having to re-engineer everything.
And because of my own background of last 25 or 20 years, on Monday, I will talk about Linux performance tools, just having fun as well and today in an hour we will
talk about in-memory processing for databases so all interesting topics but
on Wednesday it's gonna be, I think if your podcast goes live on Tuesday
yeah exactly exactly well thank you I'm on the same time as you actually today
so I think it's like Oasis versus blur today with in terms of the uh the kind of tip at the same time um so to now
thank you very much been fantastic speaking to you um as now said the website is gluant.com
and the white paper is there and so on and so forth um other than that thank you very much
and uh yeah thank you for listening thank you, thanks very much, Mark. Cheers.