Drill to Detail - Drill to Detail Ep.12 'Gluent and the New World of Hybrid Data' with Special Guest Tanel Poder

Starting point is 00:00:00 Hello and welcome to a very special edition of Drill to Detail. I'm actually at the UK OUG Tech 16 conference in Birmingham and I'm actually in the hotel room of Tanel Poder who many of you, most of you probably will know from his background in kind of Oracle work and so on there but he's now started his own startup called Gluent and Tanel was going to talk to me today about what Gluent is, what the story is and especially some of these sort of views for the future. So Tanel, great to see you again. Yeah great to see you Mark as well and hi everybody. Hello, so Tanel just for anybody who doesn't know you just a little bit background as to see you again. Yeah, great to see you, Mark as well. Hi, everybody. Hello. So, Tonell, just for anybody who doesn't know you, just a little bit of background as to what you've been doing in the past, really, and how you got to kind of this position.

Starting point is 00:00:51 Yeah, so by now I call myself a long-term computer performance geek, right? And more like 25 years ago or something, I was already working on Unix, even though I was in high school or somewhere back then. And then I got introduced to Oracle, and I immediately liked it because of its sophistication. And for the last 20 years, I've done Oracle stuff, right? And for the last 10 or more years of that, I used to be a consultant, and I flew around the world. I helped customers with some of their biggest and baddest Oracle databases. I troubleshooted them. I fixed performance issues and I gave them general advice how do you make Oracle better. But in the last two years or so I've been running through it. Okay so most people know you or a lot of people know you from the Oracle background but I've seen

Starting point is 00:01:42 some of your presentations recently and things you've written, and you're kind of quite into Hadoop now and big data. So what spurred your interest there, and why did you get interested in that sort of area? Yeah, I guess the longer story or the back story is that seven years ago, if you asked me, hey, I data is the relational data is it images or is it videos whatever um where should i put it and then often my answer or mostly my answer was put it in oracle because oracle was the best data management system for so many things right people even put images in there and stuff like that right but um about three years ago i I saw this thing called Hadoop. I mean, I knew about Hadoop. I knew about Google's MapReduce.

Starting point is 00:02:29 But it was always something that Yahoo would use on their web logs or Google would use and so on. But about three years ago, I saw a SQL engine called Impala built by Cloudera and open sourced by now. And that was a proper C++-based, Daemon-based SQL engine. And that was the first thing, the first indication that, hey, this Hadoop thing seems serious. And the second indication also about three years ago was that security showed up. So a lot of these fancy new systems, which are very, very scalable and cheap, they were not enterprise ready. But now with this engine called Impala and actual proper security throughout the whole system, I saw that, hey, this scalable and cheap thing called Hadoop is going to be ready for enterprises soon.

Starting point is 00:03:19 And by today, it's very ready. Okay. So, I mean, obviously had the same sort of thought as well you know you see you see Hadoop as a kind of the obvious replacement for data warehousing and for kind of a lot of the work that a database like Oracle would do so is your feeling that this is going to completely take over and replace kind of these kind of old school databases or what really? When I first read about it and later on when I researched this whole Hadoop thing more, I saw that it is great for use cases where you are ingesting and querying a lot of streams, events which happen somewhere else, right? And not transactional data, really, right?

Starting point is 00:04:01 And that's where, you know, I think the obvious question here for people who also do Oracle is that would Hadoop replace Oracle or something like that and as long as we talk about complex transactional systems like ERP systems, Oracle is the king of that, right? So I think even five years from now, when big companies build more complex systems where you do complex transactions and then this needs to be completely online all the time, then I think Oracle is the king of that, right? So complex transactional systems,

Starting point is 00:04:34 I mean, I will keep recommending Oracle everywhere, right? But now everything else, again, events which happen somewhere else, feeds which come in, unstructured data, you know, I would be lying if I said which happens somewhere else, feeds which come in, unstructured data. I would be lying if I said that I wouldn't think that Hadoop takes over. Or now the cat will come out. We open a can of worms or the cloud backends.

Starting point is 00:04:58 That's a different story. So what's the story behind Glue-It then? I remember you've done a few things in the past around building products and had a few ideas around maybe tuning areas or performance areas and so on. But what was the story around Gluent? How did that come about really? Yeah, so that was an interesting lesson. So being a performance guy, then the obvious first reaction, my reaction was, hey, I've

Starting point is 00:05:20 got to build a performance tool for Hadoop. And then we started talking to some customers who already were kind of using Hadoop about three years ago and then some big telcos and banks as well. And the idea of a performance tool like a SQL optimization or general performance or capacity planning tool for Hadoop, it didn't resonate at all. Nobody cared because nobody even knew what to do with Hadoop. So how do you get the benefit out of Hadoop, it didn't resonate at all. Nobody cared because nobody even knew what to do with Hadoop, right? So how do you get the benefit out of Hadoop? And then we kept talking, and then basically another pattern emerged.

Starting point is 00:05:54 And the other pattern was that, hey, our data volumes are growing even more. You know, every year people say the data volumes are exploding. And yes, they do explode. And the year after, the data volumes explode even more. Every year people say the data volumes are exploding, and yes, they do explode. And the year after, the data volumes explode even more. And at the same time, your queries need to run even faster. People want to do things real-time and so on. That's when I saw that the traditional SAN storage base, that transactional databases will not be able to cope

Starting point is 00:06:24 with the modern requirements. And so, however, on the other hand, we knew, I mean, we've been around enough that we knew that there's no way that something like Hadoop will take over your entire application infrastructure and somehow magically all your code gets ported to Hadoop and it all works, right? So, you know, when we started pitching Cluent, so basically the story is that Hadoop is here to stay, but your existing applications or existing databases are not going to go away anytime soon. So both of these worlds are here to stay and somehow you need to glue them together, right?

Starting point is 00:06:59 Okay. In a modern enterprise. And that's why the name Cluent. Okay. So who was with you at the start? What was the kind of the team at the start and what was the kind of timeline really for building this out? You know, what kind of core technology did you start with and what problem did you solve

Starting point is 00:07:12 at the start then really? Yeah, so now I got to think about like three years back when we started thinking about this and as I said, what kind of prompted us to do this was the Impala, you know, built by Cloudera, you know, was released. And so I have the co-founder of Cloudera, Paul Breacher. We had some startups with him before. And then like years ago, we had something called EatUSN, where we build a that was that was gonna write the virtualization wave and we built a data center optimization analysis software and so on and but the we we built the first prototypes with Paul Bridger and and we used Impala as a

Starting point is 00:08:01 backend backend and the front end was Oracle. And the simple use case really was, it was a very narrow use case back then, was that you have a data warehouse, which has seven years of history, right? And it's too big, it's too expensive, it's too slow because of all this data. And you have 20,000 reports written on this Oracle-based data warehouse, let's say, right? And so there's no way you can rewrite this on this magical new platform, but you don't want to buy more hardware. You don't want to buy more licenses all the time. And you would want your queries to be 10 times faster, right?

Starting point is 00:08:37 And so the use case, what Gluen solved was that, hey, what about putting six and a half years of history out of seven years into Hadoop, right? Because if you have a big data warehouse, you know, you have a thousand tables in your schema, maybe only 10 tables are big, right? So what about offloading 90% of these 10 tables to Hadoop and use Hadoop as a very scalable and powerful extension of your data warehouse platform, right? And so the Gluant will provide this glue between this Hadoop backend and your existing database frontend so that you wouldn't have to rewrite your report. So you offload 90% of data away, and all your 20,000 reports work as they did yesterday.

Starting point is 00:09:29 And actually, they work faster than they did yesterday. Yeah, so the funny thing is over this weekend that I managed to announce your website earlier than actually you planned to, because I tweeted it, which was kind of funny. But one of the things I was looking at on there was trying to understand, I suppose, the product's architecture. So you've talked about offloading there

Starting point is 00:09:46 you've talked about kind of you know impala and so on there so just kind of paint the picture really as to what are the kind of the components in the McLuhan at the moment and how does it do how does it do this kind of like you know transfer allowing you to write oracle sql against kind of hadoop and so on so what was the what's the key components first of all yeah so the key components really are three components in one end you have an Oracle database. In another end you have a Hadoop cluster with a SQL engine like Impala or Hive on it, because that's what we use for heavy lifting. And the third component in between is the QN software. And it actually, if you imagine that between these two worlds, Hadoop and Oracle, you have

Starting point is 00:10:25 two arrows. One arrow goes towards Oracle and the other arrow goes towards Hadoop. So we actually have software for both. So we have a toolset for offloading 90% of your data to Hadoop, the single command. There's no ETL development and so on. And the arrow that goes the other direction is our Glue and Smart Connector. That's where most of our secret source lies. And secret source code as well, of course.

Starting point is 00:10:54 And that Smart Connector is now what gives you this transparent access to Hadoop. So that when you run a query in Oracle on this hybrid schema where 90% of your data is in Hadoop, then we actually take parts of your execution plan in Oracle. We don't rewrite SQL. We take parts of the execution plan in Oracle, and our smart connector sends these parts of execution plan down to Hadoop.

Starting point is 00:11:20 And we use Impala or Hive in Hadoop side, which actually does the heavy lifting. So we don't have our own SQL engine written on Hadoop. And we use Impala or Hive in Hadoop side, which actually does the heavy lifting, right? So we don't have our own SQL engine written on Hadoop. There are plenty of SQL engines in Hadoop and in the cloud, right? We just provide this sort of data virtualization layer between this traditional database front ends and these awesome new back ends like Hadoop. So that really is where I guess your kind of Oracle heritage comes in, really, isn't it? In the fact that you can take, you know how Oracle kind of writes SQL, you know how to break it down and so on there.

Starting point is 00:11:49 But you've actually talked about extending this now to SQL Server, Teradata and so on. So how are you extending this idea of doing it from Oracle? How are you extending the idea from Oracle to these ones as well? How does the technology translate? Yeah, so first, my Oracle experience and Oracle performance and internals experience has helped because we know how Oracle works. So that when we want to build a product which is compatible with Oracle, then there was much less trial and error. So we kind of knew that what would work and what would not work in someone

Starting point is 00:12:24 when we do this integration. So that was that made much things easier and we built things faster. And now with SQL Server Teradata and also Postgres, which is in our plans now because of customer demand,

Starting point is 00:12:39 obviously there are different technologies built by different vendors, but under the hood fundamentally everything is the same right and one of the fundamental things is that in all major relational databases you know you write sequel and this gets compiled to an execution plan and execution plan is a tree of operators right some operators read data some operators join data, some do aggregation, whatever, right? If you imagine this upside down tree, at the top of the tree is the root, and the tree goes wider as it goes down, and in the bottom you have leaves.

Starting point is 00:13:16 Like in Dremel sort of thing, yeah. Yeah. And the leaves of the tree are where data access happens, right? And we will take the bottom leaves of that tree, and we will offload these leaves access happens, right? And we will take the bottom leaves of that tree, and we will offload these leaves to Hadoop, right? So that's how we push some heavy lifting down to Hadoop while piping a result set back to the rest of the tree of the execution plan. And that's how you have 100% compatibility with Oracle or SQL Server because the proprietary stuff like PLSQL or some model clause in Oracle, this still happens in Oracle, right? But everything else happens in Hadoop.

Starting point is 00:13:53 So how does this compare to, say, Big Data SQL then or Polybase or the vendor institutes in this area? Yeah, so it's worth saying exactly that Bluend is not the only one nor the first one who integrates databases with Hadoop. Every big vendor, Oracle has Big Data SQL, Microsoft has PoloBase, Teradata has QueryGrid, IBM has BigSQL. It's actually an interesting story that all these big vendors who supposedly should be pretty threatened, feel pretty threatened by Hadoop, they are actually embracing this enemy of theirs. I guess these guys in these big companies, they've all been visionary enough to see that

Starting point is 00:14:37 they better jump on the train of Hadoop than to fight it. But specifically how Gluent is different. So obviously there are many different layers how we are different. One major thing really is that we know that every big enterprise or mid-sized enterprise, they don't only have Oracle or they don't only have SQL Server or only Teradata, right? So they have many, many silos by different database vendors, right? And Gluent connects all of them through this Hadoop-based data lake, right? So we are not building only an Oracle-specific, Oracle-centric tool like Big Data SQL is, or Teradata-specific, Teradata-centric tool like the Query Green is.

Starting point is 00:15:17 So we want to connect all data to all applications. Oracle, SQL Server, Postgres, even Sybase, because many banks still have Sybase lying around, right? Okay. Okay, SQL, Postgres, even Sybase because many banks still have Sybase lying around. Okay. Okay. Yeah. I mean, I want to get onto that whole topic of why people might want to do this in time actually. But one last thing is I noticed you've got something called Gluent Advisor on the website as well that I had prematurely announced for you yesterday. So what's Gluent Advisor then? How does that fit into things? Yeah. So that's an interesting, actually a funny story that one of the customers, like a year ago or even more, we went to talk to them about offloading and, you know, you can take 90% of your data and then cut costs and then make things faster and so on.

Starting point is 00:15:58 And the customer said, yeah, you know, we have like tens of, you know, that business unit had tens of databases, right? And the owner of this said, hey, man, I don't even know what's going on in these databases, right? So do you guys have a tool which would tell me which databases are offloadable and how offloadable they are? They would be. And we, of course, said, yes, we do. And then I went back to our development team and they said, oh, crap, we've got to build an advisor tool quickly.

Starting point is 00:16:25 Yes. Then I went back to our development team and they said, oh crap, we've got to build an advisor tool quickly. And initially it was like a text mode tool and like a script, but now it has evolved into this pretty nice graphical tool, which basically, you know, you just run it. It doesn't install any agents or anything like that. You just run it and five minutes later, you will see that if you have a 100 terabyte data warehouse, that 80 terabyte of that data is not modified much and it's not really used for random lookups and so on very frequently, therefore it's safely offloadable. And whatever is not offloadable, because sometimes we see that only 40% of data can be offloaded

Starting point is 00:17:04 and then we ask, that's not what we typically see. Then you can drill down and you will actually see that somebody has this crazy batch job which for some reason goes back five years into history and modifies everything. And so it's a tool which gives you an easier view of how much you could shrink your database. And it also tells you that if some data is so hot

Starting point is 00:17:27 that you cannot shrink it, then you can actually see who is causing it to be so hot. So what's the criteria then for being offloadable? Is it data that's only kind of read from? What's the criteria to be offloadable by your tool? Six months ago, one of the important aspects was that if data was ever modified, like even once per month, then we said it's not off the level because our product did not allow

Starting point is 00:17:52 updates against Hadoop data, right? But this has changed now. So now we actually allowed you even, now you can even update data which resides in Hadoop. So you can take your 90% of history, put it into Hadoop, drop it from your Oracle database, but if you, every end of month or once per day, whatever, you still need to go back and update some records, now we support that as well. It's interesting. It's interesting. So now basically we have this configurable parameter that will just say that, hey, if you see less than a million modifications

Starting point is 00:18:25 per you know week or day or whatever against the table or against some partitions we still say that yep it's offloadable because you can still do these updates so that's the main criteria and before we get on to the kind of the business of this really um so how how kind of how how extendable or pluggable is that because you said you use impala as the kind of the sql engine there what about things like drill or Presto or stuff like that? How much could this in time extend to those tools as well? Yeah, that was an early architectural decision, which has ended up being beautiful there.

Starting point is 00:18:58 Sounded like Donald Trump there. Yeah. So we see Gluent as a data virtualization layer, right? And you have front ends like Oracle or SQL Server, and then you have back ends like Hadoop. And how we get Hadoop to do heavy lifting for us is that we construct SQL. We parse the execution plan. We understand what the query wants to do. lifting for us is that we construct SQL, we parse the execution plan, we understand

Starting point is 00:19:26 what the query wants to do, and we construct a SQL statement and we send it to Impala. And if the backend happens to be Hive, we added that later, now we are certified on Hortoworks as well, and then if you connect to Hive as a backend,

Starting point is 00:19:42 then we just construct a slightly different SQL. So supporting Drill will be If you connect to Hive as a backend, then we just construct a slightly different SQL, right? So supporting Drill will be easy. So we actually have it working in our lab. So we just haven't fully certified it yet. And then maybe that's a topic for later, supporting cloud backends like Google BigQuery, Amazon Redshift, and Amazon Athena, which was recently announced. I mean, we could even support MySQL as a backend if you wanted to. Okay, okay. So let's touch on this idea you said earlier on about,

Starting point is 00:20:07 you mentioned a few times that Gluon is like data virtualization. So paint the picture really of where you see this starting to be useful for business. So why would your average business who's invested in, say, in what we might call old world technologies, why should they be concerned about data virtualization? And I suppose kind of like connecting applications together. Yeah. So there are, I think, two main topics or two main streams here. One is what we already talked about, basically cost saving, you know, shrinking your database,

Starting point is 00:20:42 putting some stuff into Hadoop, and the data virtualization layer keeps everything transparent, so that you still can log into Oracle as you did yesterday, you can still run your same PL, SQL, your reports as you did yesterday. And thanks to this data virtualization layer, transparently we push whatever needs to be pushed down to Hado ideal or this backend. So cost saving, archiving, making database smaller for performance reasons, that's the first use case where we started from. And this is not the aha moment anymore. Often we go and start talking about these topics with the customer and when there are

Starting point is 00:21:23 architects in the room and when we get to the point of that, hey guys, you don't have to use Glue-In only for making this one database smaller. But you can actually use Glue-In with Hadoop as your data sharing platform or a data hub.

Starting point is 00:21:40 That you could offload data, you could sync data from your 10 SQL server databases to Hadoop, some Teradata database to Hadoop, and then you would still query the same data in Oracle. And it looks like the data resides in Oracle. Actually, in Hadoop, all the heavy lifting, all the query processing heavy lifting happens in Hadoop.

Starting point is 00:21:59 But how it looks like in your Oracle is it's just a regular, it looks like a regular table to you. So this is what is the aha moment for the architects when they suddenly see that, hey, if I have 20 databases in my application constellation, right? So if I have 20 databases, previously I had to create all kinds of data feeds, replication, ETL jobs, just to get data from one silo to another silo, right? Because they all want to use the same data.

Starting point is 00:22:29 But now there is a paradigm shift. What about syncing all the data as it's born in your silos, syncing it straight to Hadoop or this data hub, right? And Cluent connects this data hub to the rest of the enterprise. Okay, so I've heard that referred to as data fabric before. Is that kind of idea that, you know, data virtualization, data fabric, that's what you're thinking of really? Yes, that's where it's going. And actually, I don't use data virtualization that much because it's so overloaded.

Starting point is 00:22:57 It's actually a good point. There are plenty of vendors like even NetApp and so on who talk about data virtual, while what they really do is storage virtualization. Yeah. You know, the problem with storage virtualization is that if you take an Oracle data file or SQL Server data file, and you put it into cloud or whatever, then it's still an Oracle data file. It's still an Oracle format,

Starting point is 00:23:15 and you have to pay Oracle money to use your own data. So how would this be different then to, say, tools that do data federation? Because, I mean, my background in Oracle BI, we had to think with the Oracle BI server that would create its own engine over different data sources. How would this differ from that kind of thing then, really? Yeah, the data federation, that's yet another topic.

Starting point is 00:23:36 And obviously, I wouldn't be the CEO of Fluent if I wasn't able to say that our approach is better, right? Yeah, yeah. The data federation tools have been around for a long time. And those guys who have done Oracle and who have ran distributed queries over database links, they probably know what I'm saying, that running distributed queries or federated queries

Starting point is 00:23:58 over DB links, it works very well if you have 10 rows in one database and 20 rows in another database, right? And you join them together. It's magical, right? But now when you think about the real world in the real modern world, right? And when you want to join a billion rows to two billion rows in another data source, then this will basically never work, right? Because you cannot just keep pulling data between the databases and then just join and throw data away or throw these non-surviving rows away. So, you know, I've seen this for years that these federated

Starting point is 00:24:34 queries don't work with large datasets. So you have to be really careful what you can actually run and then what you cannot run, right? So the federation engine can become a bottleneck, right? So the second problem with the, you know, if you think about a separate federation engine, not like Oracle or whatever, is these federation engines have their own SQL engine, right? So then you will end up learning a new SQL dialect

Starting point is 00:25:01 and writing apps against this federation engine. So you cannot run your existing Oracle code anymore and just augment it with some big data source. So you have to use a separate engine. You have to port your application. So now we end up with two applications. So we kind of see that what we do is sort of like inverse federation, so that instead of running queries and always pulling data from the silos into some engine from processing, we offload data,

Starting point is 00:25:33 we sync it right when it's born. We will sync everything to the Hadoop data lake or data hub, if you will, and now when you run queries, we will push this query down to Hadoop where all the data resides. So whatever data sets you need to join, for example, they have all been, or most of them have been synced to this scalable backend. And the join happens there, the heavy lifting happens there, and you just get the results back to your data. Yeah, I mean, I think going back to the point about not having to change the application code, that is the thing, isn't it? So a lot of data warehouse projects I've seen that are floated to Hadoop,

Starting point is 00:26:07 the issue then is changing all your ETL code, changing all your kind of query code. And that's not even tackling anything to do with, say, OOTP and that sort of thing, where those applications, you just can't rewrite them or you just do that. I mean, so who do you see within an organization then? Who do you see? Who is your customer?

Starting point is 00:26:23 Who typically are the people that get value from this? And who do you typically have conversations with then in kind of where you're getting start to get traction with this in companies? The first part is kind of easy that the initial use case for our product was cost saving. Cost avoidance or cost saving. That goes all the way up to CFO, right? In some cases, right? So, but often we talk to application owners where they just wanna, they don't wanna buy another rack of some traditional storage array

Starting point is 00:26:56 or another rack of some Teradata or Oracle, they're done with that, right? So it's a cost avoidance for basically application owners. And often our discussion, because of my own background as well, the discussion often starts from DVAs. We'll see that, hey, there's a cool technology

Starting point is 00:27:15 and we know the guys, they seem to know what they're doing, and then we just move up there. And then other business units, application owners hear about what we just did and then they come to us. So it's cost saving, cost avoidance. And the second angle we already are taking is, again, I mentioned sometimes you have a business unit owner

Starting point is 00:27:43 who has a constellation of related applications. And usually when we get in front of their architects, they have the aha moment. Hey, we could simplify our lives so much. We don't have to build so many data feeds. Accessing data. Accessing data, you know, data is born in one application or it comes in via some feed. And if some other app needs to access it, previously it took, I don't know,

Starting point is 00:28:12 two months to provision some additional servers and another four months to build some ETL and take data from one silo and put it into another silo. And then you would continue with your business project, right? So with Glue and the architects often have this aha moment of, hey, if we sync all the data to the data hub, not only will we make our database smaller and cut cost, in addition to that, the time to market.

Starting point is 00:28:39 Well, this is the interesting thing, isn't it? I think going beyond what you're saying there, I imagine the conversations you've had at the start with DBAs, Yeah, this is the interesting thing, isn't it? I mean, I think going beyond what you're saying there, I mean, we're talking, you know, imagine the conversations you've had at the start with DBAs, they're with people who know the value of Hadoop and they know kind of how hard it is to connect these things together. But really, your market for this goes well beyond that. And it's actually companies who have to compete with the likes of Netflix

Starting point is 00:28:58 and with Airbnb and all these companies here. I mean, tell us a bit, you're thinking around that. I mean, why is this a bigger thing than just kind of, I suppose, in a way connecting kind of Oracle to Hadoop, really? Yeah, so, I mean, maybe this is the first, you know, I listed your two main, you know, targets, you know, who are interested in our solution. But the third one really is which should resonate with C-levels and so on is basically the sexy keyword or the buzzword is digital transformation.

Starting point is 00:29:32 And you have companies like, not to mention Google, but Netflix and Ubers and so on, who are what's called digital native and cloud native, thanks to that as well, so that they are used to doing things really fast. You go to Uber and if somebody tells you that it takes two months to provision a server instead of two minutes, I mean, I think somebody will get fired, right? And that's the difference. So it's, you know, companies who use Hadoop only for cost-saving reason, you know, who migrate or replatform from Teradata or Hadoop for cost-saving, they're only getting a fraction of what this data lake concept and data hub concept can give you. And it's all about speed of action. It's all about time to market. And then how we put it is that if you have a big company,

Starting point is 00:30:27 they have tens of thousands of apps and databases. So these are the biggest ones. Mid-sized companies may have a thousand apps. So all these apps are there for a separate purpose, managed by different business units. They're there for different reasons. So you have thousands of silos. And often data is born in these silos.

Starting point is 00:30:48 And it will continue to do so because they're different apps, different requirements, different code base, and so on. So fast forward like 20 years, you will still have a thousand silos. Most of them may reside in cloud and maybe are like SaaS services, but you still have a lot of silos, right? Because of business reasons, right? So there's no single vendor, single cloud vendor who suddenly takes over everything you do in your company. So you have these silos where data is born, right?

Starting point is 00:31:16 But in order to compete and you have things like customer 360 and so on, right? So you have to actually have access to all your data, right? And how do old school companies do it today? If you want to have access to this extra data source in some other business unit, it's going to take like nine months to get access, right? Three months for servers, three months for Informatica installation or whatever, and then people build an ETL pipeline, whatever, and then nine months later, we might see results and actually continue. You go to Uber, that will take like three minutes, right? So, you know, I don't know if they have governance in place.

Starting point is 00:31:50 I hope they have. You know, you got to go, you call somebody, you ask access to this data set, and you're going to have it, right? So what Gluent aims to do in long term, we already have begun this, is we want to connect all data to all applications. So that whatever data applications you deploy on whatever platform, is it NoSQL or relational

Starting point is 00:32:13 databases, by default, all this data that's ever born in this silos, it's by default accessible to anybody else in the company with right permissions, right? Of course. So if somebody wakes up on a Monday morning and says that, hey, I want to enhance my customer 360 view with this data from Tokyo, then it's just a matter of running a SQL query. It's just a matter of adding one more query or adding one more virtual table into your report SQL, for example. And maybe the first time you run it, you will get an error because you don't have the permissions. Then you make a phone call. And five minutes later, you have that, you can query this data. Thanks to Cluence data virtualization, there is no ETL development. There is no data loading. There is no pipeline building. It's

Starting point is 00:32:57 just a query. And we will pull the data in from where needed. We will cache it in Hadoop. And we will also push down heavy lifting, right? Okay. So this cache in Hadoop, and we will also push down heavy lifting. Okay, so this cache in Hadoop, I mean, from my background of BI and data warehousing, that makes a very interesting kind of base on which to do some very interesting analytics. I mean, what's your thoughts on going beyond just creating that kind of layer, and this being useful to people for, say, machine learning and things like that? I mean, any ideas on that at all? Yeah, absolutely. So I think it's maybe how I can explain this is, let's say you have data scientists in the company

Starting point is 00:33:33 and, you know, there's plenty of research and analysis done on where do data scientists spend their data. And, like, you know, depending on the report, some say 70%, some say 90%, that most of their time, data scientists, they don't spend on the science part.

Starting point is 00:33:52 They actually spend it on data plumbing. Getting access to the data, getting it into wherever you want to analyze it, getting it into right format and data cleansing and all this stuff as well. So data cleansing is less of an issue when the data source is a relational database, because that database, you know,

Starting point is 00:34:09 takes care of the integrative data. You don't have like missing fields and stuff like that. So, or garbage data, whatever. Right. But everything else takes time. Right. So, and where Cluent comes in is that with Cluent, we will sync your data from all these silos to this scalable backend like Hadoop so the data will be there it will be in a familiar format it will be in the same data

Starting point is 00:34:35 model as on the source system so so you can actually start querying this data right away so you can actually start can actually start analyzing the data as you want from day one as opposed to spending three months getting access to the data and getting it into the right form. So you can focus on the science part. And the same thing with machine learning now is, again, in order to do efficient machine learning, you actually have to have access to this data, right? And what Gluent does right now is again we will make sure that all the data is all the data you need is synced to Hadoop and and kept in sync and and later on when you when you do this machine learning you build some sort of a

Starting point is 00:35:18 pipeline which which does some event enrichment perhaps you can consume this enriched data while doing it So in order to enhance your application with machine learned data, with gluant, how it looks like is that you will only have a few more tables showing up in your database. So your data is synced to Hadoop, your machine learning, I don't know, Spark ML or whatever, TensorFlow Spark, whatever you run, that will obviously happen in Hadoop. You will have to write your magic code, of course, but the results of this data will be consumable by the same API as your application already

Starting point is 00:35:54 does, uses, right? So you just add one more table in your report, right? So you posted a link on Twitter a while ago, which I thought was interesting, which was the Google Goods project. And it was, if anyone didn't read it, the idea is that Google, you know, Google have recognized, you know, more early than anybody else, that part of the challenge with having big data lakes of data is understanding the meaning, the semantic meaning, the kind of table structures and so on. Any thoughts, I mean, let's look into the future really now, but any thoughts on how we could make it easier for people to understand the schemas coming in, if they are floating it from, say, EBS, we can introspect stuff at all. Any thoughts on how we might make that process a bit easier and more automatic, really? Yeah, so that's an interesting topic, that right after this data plumbing, so that now that you have a sync data from a thousand databases, we actually have one customer who said that they have 25,000 databases.

Starting point is 00:36:49 And if they add all the columns together, it's a billion columns. How can we speed up the onboarding of that data and that sort of thing? angle is, I don't know, I can't say it's unique, but it sure is nice because what we do right now is we sync data from relational databases. We sync structured data to this backend. With unstructured data, you immediately have a problem with data cleansing and garbage. You don't know what is where, what data it is without this very extensive cataloging. So with relational data, with the structured data syncing, it's a bit easier. So I guess

Starting point is 00:37:34 the most fundamental thing to say is that as of today, we just sync your data to Hadoop exactly as it is in the source system. So if the developer is familiar with the Siebel schema or EBS, then they will see exactly familiar schema on Hadoop as well. And some reports, some analytics, you might actually run directly on Hadoop. So if you want to write new stuff. But looking towards into future, then the immediate next thing, what to do on the data plumbing on this thousand databases

Starting point is 00:38:04 is semi-aut semi automated data integration right so and then I guess how you visualize this if you have a data scientist who now logs into Hadoop or logs into this data lake analysis platform somehow and now they want to drag and drop things around to build a report. So when they drag a customer from Siebel to a revenue number on EBS, then somehow we need to figure out how you join these data sets together. So I'm not going to go too much into details, but we have some plans for doing machine machine learning assisted human in the loop okay a semi-automated data integration okay okay so so that's all kind of good and we're looking again

Starting point is 00:38:51 looking into the future now is cloud going to make all of this kind of like effectively in a relevant conversation so you know obviously you're talking about linking hadoop to to oracle for example but if we look at a project i'm working on at the moment or customer i'm working with it's all kind of google big query BigQuery and so on. What relevance do you see Gluent having in the future when customers start moving workloads to the cloud? And it's less about Hadoop then, and it's more about kind of data sitting in the cloud. Where does Gluent sort of fit in there? What's your vision around that sort of area? So implementation-wise, before we go to vision,

Starting point is 00:39:23 implementation-wise, in some sense, we don't care where the data is, right? So we offload data to a powerful backend which is accessible via SQL. And the first choice was Hadoop. But this Hadoop can be in-house or on-prem or it can be in the cloud. Or obviously the next step from there is that

Starting point is 00:39:43 maybe you don't even need Hadoop there because you have Google BigQuery or as I said in the beginning, Amazon just announced something called Athena, which is somewhat like Google BigQuery, which allows you to run SQL on Amazon S3 objects. So you might even not want to run a Hadoop cluster if you don't need all the sophistication and flexibility in there. So implementation-wise, we will support a cloud backend as well, because after all, we just sync data there and we run SQL, we push down SQL to get it back. So it's almost, and the cool thing is that the customers, the front end, the Oracle or

Starting point is 00:40:21 SQL server database will not even know a difference, because we will translate whatever needs to be translated in this virtualization layer. So implementation-wise, there is almost no difference for do-it-yourself. But more looking into future and strategically, then what may happen is that instead of having this one data lake in-house or a few of them, and instead of having a thousand databases in-house, customers start using cloud services. So you have BigQuery as the backend, and also some of the databases you will migrate to Amazon RDS or Amazon Aurora. These are the new database engines. And another thing is that more of your applications become SaaS applications, right?

Starting point is 00:41:15 So that you used to have Siebel installed in some database in-house on a local EMC storage area. But now when you use Salesforce, you only have a web API where you log into. And the data is born there. People type in stuff there. So data is born in a Salesforce app. You don't have access to Salesforce database, but you can do extracts. You can do real-time feeds. And I think this is what the future will be about, so that there will be no single cloud. The future will be also fragmented.

Starting point is 00:41:48 Just like today, you have Oracle now, SQL Server, you have eBusiness Suite for this vendor and SAP and so on. Ten years in the future, you will still have Salesforce, you have Workday, you are using services by other vendors who don't even exist yet. And then you have your cloud environment, and you probably will use Google Cloud for some things and Amazon Cloud for other things, just for vendor management reasons and competition reasons. And you still have some old things taking away some mainframe in-house, right? So even if 90% of your stuff is in the cloud, I think it's going to be a handful of cloud infrastructure vendors,

Starting point is 00:42:27 and you will probably have like 50 or 100 SaaS vendors. Yeah. Yeah. I mean, excuse me, is there a plan to kind of run Glue as a cloud service at some point then really? Because I mean,

Starting point is 00:42:36 that would be the logical progression really, wouldn't it? Yes. Yes. And yeah, so basically data as a service. Yeah. And so it's on our roadmap.

Starting point is 00:42:46 I'm not going to tell you where it's going. I'll announce it before you do that. Yeah, you will do it like a day before. And I think how you look at it is that right now, you know, like Salesforce is growing really fast. And Salesforce is not about CRM only anymore, right? They have their own machine learning services. They have their own analytics engine, stuff like that. And you can actually upload, push your in-house data to Salesforce and integrate it all there, right? But I think it's

Starting point is 00:43:15 somewhat utopia that all your data, all your analytics goes to Salesforce because it's a single vendor. I mean, all companies have their own needs. So I think there is always a, I can't even call it a niche, because niche is too small. I think there's always a big need for a general purpose processing platform, you know, cloud platform, where you can put whatever you want in there and you can run any analytics you want in there. So no single vendor like Salesforce oracle or sap can handle everything you need

Starting point is 00:43:45 no and a vendor like salesforce would always have its analytics strategy aligned with kind of salesforce so they're looking to add analytics to their tools to their products and so on not a general thing like you're looking to do really i mean so you mentioned data as a service there i mean i think what's your views on that i mean i would have thought that some kind of link out to various kind of das services or even vendors like when I'm working with now, Qubit, you know, where they have kind of, you know, actual kind of pools of e-commerce kind of click data and so on there. I mean, any thoughts on data as a service as a kind of area as well to link to maybe? Yeah.

Starting point is 00:44:18 So, you know, like we, when we talk about, you know, there are different vendors who talk about data as a service and are different vendors who talk about data as a service and some vendors are actually information services, right? Yes.

Starting point is 00:44:29 They will provide your stock data and, you know, events from the real world, whatever. So we really are not thinking about that. No, but you could certainly

Starting point is 00:44:35 have deals with them, couldn't you, where you kind of bring in their data, maybe, that sort of thing, or at least create connectors

Starting point is 00:44:40 to those sort of services. Yes. Yeah. And what I'm having in mind for data as a service is really, it's kind of an extension of what we already are doing. But again, you sync all, you have a big enterprise with 5,000 databases. As data is born in there, you will sync that to the cloud. Also, you have feeds coming from Salesforce and Workday and so on. You will sync these to the cloud as well. And this cloud environment is under your control.

Starting point is 00:45:08 It's not a Salesforce or some vendor who controls what can be done, right? So it's still a general purpose data storage and processing platform. And where Glue Endelgrams in is that we know that if you have 5,000 applications built on existing relational databases, but mostly they are still relational databases, if you want to transform your company fast and be able to use any data anywhere and do it quickly, it cannot be an exercise that, hey, let's rewrite all of these 5,000 apps and add some REST capability in them or whatever. So the Cluence idea is, you know, that's where the data virtualization comes in, is that on all these databases, you will have virtual tables.

Starting point is 00:45:58 So whatever data you want to consume as a service, as a stream maybe, or just run reports, this will show up in your existing database as an existing familiar table. So you don't have to re-engineer all your 5,000 apps. So I think the Gluen's magic is that, yes, you have data as a service in the cloud, but we will actually connect it all the way to the last mile. Like in Telecom, you have the term last mile network that you know from the from the you know center of the village to your house right so whoever controls the last mile

Starting point is 00:46:31 network that is in the position to to to say how things will work right that's it that's it yeah so so i'm conscious of it's actually what is it now it's uh almost three o'clock so we do to speak actually soon um so to now what what when are you speaking this week really what presentations you're doing and how do people find out a bit more about kind of gluant as well if while you're here We do speak actually soon. So Tanel, when are you speaking this week? What presentations are you doing? And how do people find out a bit more about Gluent as well while you're here? Yeah, so about Gluent, just go to Gluent.com. We will announce the website tomorrow. In Tanel's time, in Mark's time, it's yesterday.

Starting point is 00:46:58 So if you go to Gluent.com, you have plenty of info there. Navigate around. We have white papers about the platform and about the advisor and so on. And you can get in touch. Just Google my blog at tunnelmoder.com and you can send me an email as well. But regarding the UK OUG presentations, it's a mix. I have my fluent hat on. So on Wednesday, I will talk about extending Oracle data warehouses with Hadoop,

Starting point is 00:47:36 where I will talk about both offloading for cost reasons, but also big data blending, you know, augmenting your existing analytic environment with big data without having to re-engineer everything. And because of my own background of last 25 or 20 years, on Monday, I will talk about Linux performance tools, just having fun as well and today in an hour we will talk about in-memory processing for databases so all interesting topics but on Wednesday it's gonna be, I think if your podcast goes live on Tuesday yeah exactly exactly well thank you I'm on the same time as you actually today so I think it's like Oasis versus blur today with in terms of the uh the kind of tip at the same time um so to now thank you very much been fantastic speaking to you um as now said the website is gluant.com and the white paper is there and so on and so forth um other than that thank you very much

Starting point is 00:48:18 and uh yeah thank you for listening thank you, thanks very much, Mark. Cheers.

Drill to Detail - Drill to Detail Ep.12 'Gluent and the New World of Hybrid Data' with Special Guest Tanel Poder

Mark Rittman is joined by Gluent's Tanel Poder to talk about Hadoop, Gluent Data Platform, the coming of the hybrid world and how Hadoop will evolve as it moves into the cloud....

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.