Drill to Detail - Drill to Detail Ep.39 'Gluent Update and Offloading to the Cloud' with Special Guest Michael Rainey

Episode Date: October 3, 2017

...

Transcript
Discussion (0)
Starting point is 00:00:00 So my guest on this week's episode of Drill to Detail is Michael Rainey, someone I worked with several years back at Whitburn Mead when he headed up our data integration practice at the time. And since then, both he and I have moved on, with Michael moving to Gluon as their technical advisor. So Michael, welcome to the show and good to have you on here. Thank you, Mark. Thanks for having me. So Michael, just give us a bit of a posse history really of what you've been doing up until now and how you got involved in data integration and databases and so on, and then
Starting point is 00:00:41 the route into the role you're doing at Gluten at the moment. Yeah, sure thing. I started out as an application developer, like many of us do, and moved into the data warehouse world. And at the time, we had our own homegrown data warehouse system. So we had built up a VB6 app that would generate SQL Server DTS packages. So this is really old technology and decided to transition to Oracle Data Integrator. And that's really how I got into the Oracle world. From then, I moved on to work with you and others at Ritman Mead, did consulting around Oracle Data Integration, the Oracle Golden Gate, Oracle Data Integrator for about five years. And then, you know, as things
Starting point is 00:01:33 happen, you know, you get contacted by somebody and have a great opportunity come up. And that was what happened with Gluent. Now with Gluent, I mean, technical advisor is sort of a general title. It's a wearer of hats, if you will, because it's a startup that's very small. So we do what we need to. But I focus on account management, being a customer advocate, and then also marketing, training, content development, delivery of the training and all of that so and really anything else that's that's needed okay okay so it sounds i mean that sounds like a role in the place i work currently called a product specialist so that's the kind of role that came i think out of google originally where you become as you say you you are the customer's
Starting point is 00:02:20 advocate you are a technical specialist in what you do um and you act as that kind of, I suppose, kind of funnel from the stuff coming from the customer to the company. But it's also interesting, must be interesting for you, you know, working in product now as well. I mean, you and I both used to work in consulting. Product is different, really, as well. Have you found working in a startup and working in a product company? Yeah, it is really interesting. And, you know, I'm definitely learning a lot just about how a startup works and how a business runs. And also, you know, the types of feedback you get, and not just from customers, but also the industry that help kind of make you pivot a little bit, I guess, as to what you're developing and what you're delivering. I was about to say, one of the things that I found with product is that, you know, typically when we come from consulting, you will do anything for the individual customer. You know, a customer comes to you with a requirement and you will move heaven and earth
Starting point is 00:03:16 to meet that customer's precise requirements. But, you know, certainly when you work for a product company, you've got to kind of balance out what are the requests from customers that are going to help you kind of grow the business and what are ones that are going to take you in a route that perhaps isn't strategic i mean that that's certainly you know i found that same with you really yeah and that's that's true it's it's it's uh going from evangelizing somebody else's product like oracle's data integration technology to advance in your own yeah you have to be careful about uh you know what what you're promising that the product will do and and also you know what like you said the the which features uh should be in and which should be you know put on the back burner for for a later time yeah so we had the reason i wanted to get you on
Starting point is 00:04:01 on the show is is first of all obviously it's good's good to have you on. And we've known each other for a while. And you've got some great opinions and thoughts on the industry. But also, I had Tanel on the show back, I think it was actually this time last year. It was the UK Oracle user group. I remember interviewing him in his hotel room in Birmingham. And we talked then about the Gluent founding story. And we talked about what problem Gluent was trying to solve, and the approach they were taking with this kind of idea of a hybrid workload and environment.
Starting point is 00:04:37 And I noticed that there was an announcement recently, which was that you guys had actually done some stuff in the cloud. And I was particularly interested to come back and get a bit of an update, because first of all all i'm curious to see how you guys are getting on but yeah i've got this kind of theory that as as these as workloads like ones we used to move into the cloud this kind of distinction between uh between you know what is data warehouse technology and what is big data technology will kind of change but also you know nothing will ever move entirely to the cloud there'll always be an on-premise workload and so i was kind of curious to see where you guys uh were going with with this and you know what your thoughts were on this as well um so first of all just for anybody who is new to gluant because i imagine a lot of people on the podcast you know the audience would be just tell us a bit about the basic facts about what is gluant the company and the software and what do you do and what problem do you solve and so on yeah sure thing um so gluant is is a
Starting point is 00:05:24 data virtualization software so and that's a broad term, data virtualization. So I'll break it down as to how we affect that. So we have two real major components of our Gluon data platform. The offload of data, which is moving data from a relational database into a big data technology like Hadoop. And so the reason we came about to do this was we saw that many enterprises were struggling with their storage or their CPU costs or, excuse me, CPU processing power, and eventually the cost as well to continue to maintain that and build up some more storage and processing power. The other aspect was that data within the enterprise is in all these different types of siloed data stores.
Starting point is 00:06:21 So there's relational databases from many different types of vendors, plus big data that's in Hadoop and HDFS. And we saw that that makes it very limiting as to how you can access the data. So offloading from a relational database into Hadoop puts it into an open data format, which can then be accessed by many different engines, not just the big data technologies like Impala or Hive or other SQL and Hadoop engines, but even the Kafka or other streaming technologies, graph technologies, anything really. The other aspect of that that helps us with the data virtualization
Starting point is 00:07:02 is transparent query, where after you offload that data, uh, to, to Hadoop, you can still access the data as if it never left the, the relational database. So we don't,
Starting point is 00:07:15 we don't require any application, uh, code rewrites or migrations. It's, it's completely transparent or, you know, as, as you will,
Starting point is 00:07:23 data virtual, the data is virtualized. The data access is virtualized. Okay. Or, you know, as you will, the data is virtualized. The data access is virtualized. Okay, okay. I mean, that's a good, I think, good explanation of it, really. I mean, so you're saying there that, first of all, you're kind of actually offloading data that's being stored, you know, I suppose expensive, you know, Oracle data warehouses
Starting point is 00:07:39 or kind of Teradata and so on into Hadoop storage, which obviously there's a cost saving there. But then I think, you know, the thing that makes it very interesting is this ability to then carry on with the workload going through the Oracle database. But actually, you know, in the cases where the data's now moved to Hadoop,
Starting point is 00:07:55 for example, you know, it still just kind of transparently accesses that. And what's the kind of underlying, I mean, comparing it to say Oracle Big Data SQL, which, you know, you might know, where, how does it work with gluon you know how what what kind of like sql technology to use on hadoop and how do you achieve this quite magical thing really yeah and just to yeah to to take it a step further i guess a little more detailed the offload process i mean it's that's
Starting point is 00:08:22 something that we you know you or i could probably write a scoop command and and perform an offload process, I mean, that's something that we, you know, you or I could probably write a scoop command and perform an offload, you know, from a relational database like Oracle. But behind the scenes, what Gloon is also doing is, well, first it's creating the, you know, it's putting it into a storage format that is a compressed columnar format like Parquet or Ork. So it's saving the storage that, you know, as much as you possibly can on Hadoop side and also enabling the faster analytic or data access for your analytics. We're also building the metadata around that table. So we're taking the Oracle table structure or relational database table structure and putting that into Hadoop as well. And we do a lot of the data type translations to make sure that, you know,
Starting point is 00:09:15 because SQL on Hadoop doesn't have the same data types as, you know, an oracle relational database. So we have to do all of that behind the scenes for you. Plus, if you have stats computed in Oracle, we can move those across to Impala, say. So we have a lot more that's going on there. On the transparent query side, so you can offload all of your table or a portion of your table.
Starting point is 00:09:45 So maybe you have 10% of the active data still remains in your relational database and you move the other 90% off to Hadoop. We have the ability when we run a query against this, we call it a hybrid table now, we take a look at the execution plan itself and determine which lines of that execution plan we can push down into Hadoop for the processing and save that processing power for a technology that is built for that. So you have this massive Hadoop cluster that has parallel processing across all of its nodes
Starting point is 00:10:29 and is made for that type of work. Okay, so how much, I mean, if you took maybe sort of like an EBS type sort of database, I mean, I'm not saying necessarily that one, but you took a kind of like ERP type sort of system, I mean, typically, you know, what percentage of data would you typically expect to be able to offload you know to hadoop and and how much of the how much of the kind of the transactional workload in terms of you know coverage of queries or whatever could you could you offload you think
Starting point is 00:10:56 yeah it's it's going to depend um you know and we have a a tool that will help us determine that you know it's called a gluon advisor Advisor, but you can run that against your relational database. It's just some SQL scripts that take a look at the usage of the data and see what's more active, what's being updated more often. And then we'll decide you want to offload. Most of the time, you don't want to offload active data because as you know, you know, HDFS is not, is append only. We do have the ability to go back and update offloaded data. So if you do happen to change some data later on and then you just rerun and offload and it will perform that update and essentially mimic an update
Starting point is 00:11:45 within hdfs uh so so it depends on you know how how active the data is okay all right good and just as a kind of a transparency just to some people might know that i did i worked a little bit of time um over in gluint last year hence uh hence kind of having this unnaturally kind of like good knowledge of uh of kind of how your product works. Actually, I worked there, as you know, Michael, I worked there for a couple of months just as I was leaving Ritman Mead and so on. So it's interesting to see how you guys have got on, really, and I've always been interested in where it's going.
Starting point is 00:12:17 And on that point, so when I was there and I left there, the technology to extend that to other database of database types was kind of being developed. Give us an update on how that's going and what was the, again, what was the kind of point of that and the purpose and so on? Yeah, yeah. So we've talked about Oracle a couple of times already. And as you know, others might not know, is the founders and most of us within the company are from an Oracle background, as you are. So we all decided it was best to start with Oracle first because we know it so well. of that is we decide, you know, after getting into the Oracle work and understanding what's
Starting point is 00:13:05 going on there and building the product, we determined we've probably done the most complex relational database that we could as our first start. So not to say that something like SQL Server isn't a complex database, but, you know, it is., otherwise this would have been done already. But we know that we have the appropriate pattern and have built out the product once. So we know the process we need to go through to replicate it against other databases. So that's kind of the history of where we started. Right now we have SQL Server in production as well. And then due to some interesting customer demand, we have Netiza in the works right now,
Starting point is 00:13:53 so a Netiza offload. And that's due to the end of life and need for these large enterprises to migrate off of Netiza into something else. So like I said, we've got this customer demand that just kind of came out of nowhere. And that's the interesting one, isn't it? To what extent you follow that?
Starting point is 00:14:11 Yeah, yeah. So we started working with one company, and then before you know it, we have several other requests. And so that's in the works with that one company, and we'll continue on with others. Okay. So the way I understand that is you can do the offloading from those databases, but the actual query translation is always still Oracle at the moment.
Starting point is 00:14:31 So you could actually run Oracle SQL on Matiza, or would it be on the data you've offloaded from Matiza? How would that work? Yeah. So if we, for example, this customer we're working with, it's a large financial institution that initially approached us with the need for the Natiza offload to Hadoop. So they also have, in fact, they also have a SQL Server set of databases
Starting point is 00:15:00 that they want to offload as well. Once they got into our pilot process and understanding how Gluent works, they saw that they could actually offload both data sets and then now they have them in one common location that can actually join things together and
Starting point is 00:15:18 transform the data and actually generate their analytics out of it. So imagine this, you offload the data from Netiza, from SQL Server. If you enrich the data with some Spark SQL or something like that within Hadoop, you can then, whatever table that exists within Hadoop, let's say it's in Impala, you can actually present that table. That's what our transparent query product is called, is Gluent Present. You can present that to an Oracle relational database
Starting point is 00:15:54 as if it lived there. So that's where another really big, powerful use for our product is that present-only sort of approach. And we have several customers that are just using it that way. Okay. Okay. So how did, I mean, on the last point on this, really, how did, and actually I'm confessing I never quite understood this when I was there, actually, how did you manage to get Impala to be as functional as Oracle SQL?
Starting point is 00:16:23 So when you kind of run the whole range of, and you and I know the kind of the weird and wonderful things you can run in Oracle, like model clauses and so on. How did you manage to kind of get that to then work on Impala, really? What was the kind of, I suppose this is where Tanel's knowledge of Oracle internals comes in, but what was the kind of the solution there? Yeah, and to give a little background
Starting point is 00:16:46 on on why i mentioned impala is you know along with our our knowledge and and starting with oracle as the original source for the the product we you know we we quickly realized that you know we can't be everything to everyone so we as you, Hadoop has several vendors, vendor distributions. With each vendor distribution, you get a different set of technologies. So Cloudera, you get the Impala SQL engine. Hortonworks, you would have Hive as your SQL engine. So we decided, well, we've got plenty of pipeline with the Oracle offload, and it so happens to be that almost all of those customers were Cloudera customers as well. So we took our focus down the Cloudera path. But now we have MapR in production, and then Hortonworks and Amazon EMR are both coming very soon.
Starting point is 00:17:40 So as far as a target goes, we're coming right along with those. But the way that works, I mean, with this transparent query access, I mean, one of the questions that people often ask is, you know, is it just a query pass-through? You know, how do you get it to work like that? Well, Oracle could have, well, we know it has many different analytic functions that just don't work in Impala or Hive. And there are even queries that have correct syntax that haven't been built out in that SQL engine. And so there's a lot of, you know, if you think about Oracle as a database, it turned 40 this year, right? So I think Impala is maybe three or four years old. So there's a lot of experience built up there.
Starting point is 00:18:35 What Tunnell and others have created and actually have patented now is the ability to, as I mentioned earlier, read that execution plan out of the database. And each line, it determines what can be pushed down into Impala. And we do that query translation on the fly. So one of the aspects, as I mentioned earlier, was the ability to translate data types from Oracle to Impala during that offload process when you're building up the table. So we need to make sure that the data types are correct. We're going to store the data correctly. But we also can push that query down, whether it's a join or a filter or aggregates.
Starting point is 00:19:25 We push all of that work down. Okay, okay. So far, I mean, up until the point of the cloud stuff in a moment we'll talk about, I mean, who has been the kind of the buyer of this really? I mean, so you say you act in account management as well and so on. Who typically within the organization is buying this? And what kind of organization is kind of buying into this kind of like approach really i'm just curious yeah yeah so we've we've really focused on some of the larger enterprises and you know fortune 500
Starting point is 00:19:55 fortune 1000 you know and that's where we we find the uh you know these these types of large data sets that are causing trouble on the relational database that can benefit from being offloaded and pushing work down to Hadoop. So, I mean, there's several case studies out on our website. And I can mention one company that they're called Vistra Energy and they're based out of Dallas, Texas. They're a large power company down there. And they originally came at it with, you know, looking at Gluent with the cost savings in mind, you know, trying to offload from their Exadata, Oracle Exadata machine into Hadoop. But quickly they realized they could do a lot more. They have a Hadoop cluster with smart meter data coming in.
Starting point is 00:20:58 So this sort of IoT that's coming from the consumer usage of the energy. They also have transcribed customer support calls that's coming from the consumer usage of the energy. They also have transcribed customer support calls so they can keep track of how happy or frustrated you are with their service. So they can now take this information and present it from Hadoop into their customer ERP system. And now they've kind of put these additional pieces of the customer 360 puzzle together.
Starting point is 00:21:30 And they can do some interesting analytics now where they, you know, one of the more recent bits of analytics that have come out of this exercise was, so they have a product called uh free nights so basically if you use energy a certain way uh they'll they'll offer you free nights free free energy overnight so they can now they can now tie all of this usage information in with with their customer information and and then present either you know maybe they'll send you a flyer or send you an email or present uh this option on the on
Starting point is 00:22:12 the website when you log in so so they're they're they're able to offer their customers a better product uh just by using this technology okay that's kind of what i was thinking i mean i i think that you know again sort of i spoke tanel um this time last year you know that the the obvious appeal of a technology like this is to save money really but but it's quite you know my experience is quite hard people to make a big change in their technology just to save money um especially if you're if it's a technical sale but it was really the the i suppose the additional kind of options that are now open to you and the fact that you've got your data in this centralized place you know you've got these more open formats
Starting point is 00:22:49 and it's more what you can do with it from that point onwards that is the real kind of appeal really and and you know almost this cost saving is is is like a conversation starter and a bonus but it's not the real reason you would do this and it's interesting to see that's kind of how it's worked out for you so um that's kind of good yeah i mean it's good to see it's worked out yeah it's it's it always started as the the foot in the door right uh but and really we've we've learned that that that isn't the only way to to to really you know begin speaking with with these enterprises. You ask who might be interested in an organization. It's going to be more of the architects and, of course, the CIO, those types of folks.
Starting point is 00:23:36 When we start talking with the database professionals, they really want to just know how it works. How did you do that? Are you doing it correctly? So, yeah, so that, yeah, that, and so we want to, you know, get to the folks that really are interested in the entire enterprise, you know, data architecture. Okay, okay, so let's move on then.
Starting point is 00:23:57 So the thing that, so the thing that prompted me to drop your line, just say if you're interested in coming on the show, was the announcement about Gluing and the cloud. So just tell us a bit about what that is at the start what have you done um and and what's the kind of headline features and we can drill a little bit into what that means afterwards okay yeah yeah so we you know the i think the blog post you're referring to is a product called cloud sync and that's and it's I guess, a component of the entire Gluon data platform.
Starting point is 00:24:27 So this started out really as a backup and restore service. So we're leading enterprises and leading them to offload their data into Hadoop. And we realized this would be good to offer them something, you know, another service that could back up that HDFS data, those files in HDFS, off into the cloud, into a storage service like, you know, an object store like S3 or Google Cloud Storage, whatever it may be. So we started with that. And then we really quickly realized, you know quickly realized there's more potential to this. The first one is if you're offloading or backing up your data lake from HTFS, that's maybe on-premises, off to the cloud, now you have this backup data that can be used for additional analytics.
Starting point is 00:25:21 It doesn't just have to sit there, you know, static and be used. So we have, you know, if you think of putting the data in Amazon S3, you have, now you have Amazon Athena, you could run queries against it. You have, you know, the latest, you know, the new released AWS Glue, you can perform transformations and use that data to enhance it. And then, you know, of course, if you have any other machine learning tasks or very processing-intensive tasks that you just want to ramp up a cluster, a Hadoop cluster for, and then process the data and then spin it down,
Starting point is 00:26:00 you could do that. So once the data gets out into this open storage format, then there's so many different ways you can access it. The other bit we realized was we could take the same sort of data virtualization innovation that we've done for the relational database and Hadoop, where we offload to Hadoop and then, you know, you virtualize the data access. We could do the same for Hadoop and this cloud storage. So you could offload or, you know, essentially back up a portion of your HDFS data into the cloud and then access it from a single Impala table. So we have that same sort of paradigm for HDFS in the cloud as we do with our current data virtualization. So is that effectively like doing query translation for Impala then?
Starting point is 00:27:03 So Impala can run its queries against object storage in the cloud as well? Exactly, yeah. Ah, interesting, interesting. So which clouds are you currently supporting? Is it kind of, you know, obviously Amazon, but are you looking to do Oracle and Google and so on? What's your thoughts on that? Yeah, yeah.
Starting point is 00:27:21 It's currently Amazon S3, and it's not much of a stretch to go to the others. I mean, really, it's the cloud store. So the way things work now is the customer demand will drive those features. So if we had someone come out and say, okay, look, I'm really interested in this, but we use Google Cloud Storage, then that's the direction we would go if it made sense. Okay. So what about, I mean, so would you see that as being, I mean,
Starting point is 00:28:00 so data from Oracle could end up into Hadoop, and then it would go into cloud, or could it go straight from Oracle into the cloud? I mean, how flexible is that? Yeah, if you offload into Hadoop and then back up to the cloud, that would be the way, yeah. I mean, ultimately the vision is to just run a query and Gluent helps you get the data returned. And that's, so I. And that's the end goal is limit the data movement
Starting point is 00:28:31 and virtualize the data access so you don't have to change your applications to be able to get to that same data. So do you think, and one of the things I often think is that Hadoop is still quite a complex technology on premise. And, you know, having worked with recently things like BigQuery and Athena and so on, you know, you can see the advantages of a kind of a no-op state of warehouse running in the cloud and how it gets, how it allows you to have this kind of scale, but without any of the kind of the, very little of the infrastructure work. I mean, do you still think there's going to be a role for Hadoop on-premise going forward, or is that all going to move into the cloud, do you think?
Starting point is 00:29:09 Where do you think this is going technology-wise? Yeah, yeah, that's interesting because when I started with Gluon at the beginning of the year, I think that's around the time that about three or four analysts wrote Hado is dead articles or hadoop is dying or you know that sort of thing like it's it's like the it's like the sequel is dying articles that come out you know every once in a while sql is is going away we know it's not going away it's it's actually being implemented in every technology you can think of so So with Hadoop, I mean, there's still the potential for something like HDFS to have a place in an enterprise.
Starting point is 00:29:58 So you mentioned that HDFS will still be relevant, and yet a lot of people are saying, John Pierre Dykes last week was saying that you know object store will take over from hdfs yeah do you think h do you think hdfs will stay as it is or do you you know what's your what's your thoughts on the demand for that and use for that really yeah there's so many different different technologies that use the the hadoop drivers and and and have the ability to access that data. I mean, it's, so the big thing right now, and I think JP mentioned this as well,
Starting point is 00:30:30 is Hadoop or HDFS acts like a file system. It's a logical file system. So you have your data access security, you have your controls around how, you know, a very granular level, how somebody can access the data. Whereas an object store is just you're in or out. That's it. So until it gets that sort of security, some sort of aspect of a file system on top of it, I think HDFS will still stick around for that reason. The serverless technologies and the ability to access that data
Starting point is 00:31:09 within S3 or within Google Cloud Storage or whatever other cloud storage, I mean, those technologies are very – I do think that's the way the data processing is going. Yeah, yeah. I mean, it's an interesting point you're saying that i mean i i was very much of that opinion um but certainly um it's certainly those kind of you know serverless databases server for no ops data warehouses that base level of storage and compute and so on they're they're fantastic but i mean i've been playing around with um druid
Starting point is 00:31:41 of last weekend actually and i was looking at kind of things we're doing with that and trying to get I suppose to solve that last mile of query performance with with BigQuery where you know the last 30 seconds of a query is still there and I guess the point I was looking at when I was looking at that was you you get a lot more kind of like innovation and a lot more kind of new projects springing up obviously within this kind of I suppose Hadoop world and you and on-premise world because you've got much more ability to do things at a smaller scale and i think you know where those things like bigquery come in is that they can solve a common problem very well um but there'll always be a need for kind of new things innovative things um you know more point solutions and more kind of like niche solutions um you know and hdfs probably will be
Starting point is 00:32:22 gone in its current form won't be there in a few years time because it will be in memory or something but it's all swappable out and that again is the big beauty of Hadoop that really every component can be swapped out Right, exactly and that's where if you think about
Starting point is 00:32:38 moving the data to a cloud storage then you're pretty tied to that vendor again so you're going to access data in S3 with Athena, with Glue, with QuickSight, with EMR, whatever it may be, but you're not going to go to BigQuery and access it. So that's kind of an interesting shift to think about as well. We don't want to get stuck in those silos again,
Starting point is 00:33:08 the way we kind of are now, relational databases. Yeah, yeah. So let's go, I mean, since you and I kind of worked together, there's been quite a few interesting products come out in this space that kind of I always wondered what your opinion of them would be really. And so, yeah, AWS Glue. Every so often I get a Twitter direct message from Tanel saying, AWS Glue is fantastic.
Starting point is 00:33:30 You ought to look at it. It's really, really interesting. And what's your take? First of all, for anybody who doesn't know what it is, Michael, just explain what AWS Glue is and why you and I might be interested in it. What's the kind of interesting thing with it? Yeah, so it's a serverless data integration technology that Amazon has released.
Starting point is 00:33:49 But that's not the only thing. It has a metadata catalog that ultimately replaces the metadata catalog, the Athena metadata catalog out there on AWS technologies. And to populate that catalog, it has these things called crawlers. So you can currently access data with an S3 or through any JDBC connection. So, I mean, that's quite a lot of technologies you can get to. And what it will do is go out and basically mine this metadata from these data stores and
Starting point is 00:34:27 store it away in your Glue catalog. At that point, you can use or access any of those data sources from within your data integration mapping, if you will. So it's also serverless, as we've mentioned. So it will spin up what it needs to behind the scenes, and you don't need to worry about provisioning servers or any of that. It just does it as part of its processing. It's a little bit of a graphical tool. There's a little bit of a graphical tool there's a little bit of uh you know graphical mapping to it but most of it is is uh pie spark with within a nice code window and you use the the the glue api to to perform uh transformations or or data access what have you so what's your take on it then because i mean you and i used to spend you and i and stewart back at the time used to spend ages kind of building kind of very carefully building data
Starting point is 00:35:28 mappings between source and target and and all that kind of stuff when we couldn't do anything until the business domain you know experts would kind of give us a kind of data model of the source system and so on and this glue you know it sounds kind of interesting and there's been similar ideas from google in the past google goods goods and so on. You know, what's your, what's your view on all this? You know, do you think it's too good to be true?
Starting point is 00:35:48 Do you think it's kind of, I mean, what, what do you, what's your take on it, Michael? Yeah, it,
Starting point is 00:35:52 it definitely is. It's not too good to be true. I think it's, as with any data integration tool, you, you need to, uh, understand what it can do and what it can't so there's there's
Starting point is 00:36:07 certain limitations uh as with anything the nice thing with you know i work with oracle data integrator so the way that worked was you you had some built-in templates for building your mappings but then you could develop your own or modify them. And that's kind of, you know, you take that look, or that that sort of approach with AWS Glue, where, yeah, you can use the transformations that they've defined for you, but you can also write your own Spark SQL or Spark code. So, so you can you can extend it as much as you need to. And that's where I think as it continues to get used, hopefully they'll get good product feedback from the folks that are using it in enterprises and start to improve the product.
Starting point is 00:36:57 But again, I think the serverless processing is just the way it's going to go. And, you know, you shouldn't need to worry about provisioning servers or understanding, you know, what size of server you need or how many CPUs or whatever. Just send it a job. And, you know, ultimately, I think JP mentioned this as well. That's kind of interesting with the serverless is, you know, ultimately you just define, like, what's my SLA on this query to return? And, you know, hopefully AWS Glue can, or whatever technology it is, can perform, right? So where would you, I mean, this, I mean, Tun i mean tunnel has again been interested in in in glue i
Starting point is 00:37:45 mean do you see it do you see that this being a complement to the stuff you guys do with with with kind of uh the gluant software i mean is it or is it solving the same problem or what really yeah well it i think it fits right into our vision of of you know access to all enterprise data at any time from basically from a query without without rewriting your your applications and so so one of the the original ideas around gluon is is to become this this data sharing platform and you know right now gluon solves that that physical data access that you know the the pipes for the plumbing. We can get to the data wherever it's at.
Starting point is 00:38:29 Glue and whatever other catalogs that are similar can provide that data awareness. So understanding where the data lives within the enterprise. And so when you, yeah, it's definitely a, I mean, not only a name, but it is a good compliment. I mean, it's funny to see this glue thing come out. But yeah, between the physical data access and the metadata, plus the glue has the transformations, which gluant doesn't do, I think it's a very good compliment. Okay, okay.
Starting point is 00:39:02 So the other thing that's happened since you and I were in consulting was the rise of this thing called the data engineering movement. So data engineers and kind of people with math degrees and PhDs writing ETL code and so on. I mean, what's your observation been of that? It's probably a little bit unrelated to Kaluan, but this idea of kind of software engineers and kind of math PhDs writing kind of ETL code and so on. What's your take on that yeah i think it's well i've you know listening to your podcast i've i've uh i i continue to to hear a theme which is you know where is this this uh big data technology etl tool you know what
Starting point is 00:39:42 we have and i don't know if you've found it yet. It doesn't sound like it, but, you know, you've gone through some of the stream sets and Glue might be it, but, you know, and so that's just the missing piece right now. And so the rise of the data engineer
Starting point is 00:39:58 and, you know, that great article that came out, you know, that defines that role. And that's what that role does is build the ETL pipelines that you don't have a tool to build with. Will it shift to something that's more easy to use and you don't need a PhD or something, you know, or at least you don't need to write everything in code.
Starting point is 00:40:27 I think it will. It depends on how rapidly these technologies evolve first. So it's easier to keep up with changing technologies in code, I believe, than writing code to build a platform that somebody else needs to use to write code. And also, you know, will things, you know, consolidate a little bit more? No. So it's, again, it's tough to write software on top of lots of, you can't be everything for everybody. So, you know will the the industry kind of consolidate into some a few standards so but i think ultimately it's probably going to get get closer to to a
Starting point is 00:41:14 graphical tool and you know maybe maybe glue ends up being that i don't know it's interesting i mean i think i think um yeah the the question I always had is whether or not you know everything being scripted these days is because um the paradigm has changed as people as people might say or it's kind of immaturity really in people doing it and it there was I was at a looker um event last week and and it was actually a a partner a part of it and they talked about you know they described what happened a few years ago it's almost like they described it as a situation when there was a big whale that got beached on uh the the uh on the kind of beach at san francisco and um i think what they did was to get rid of it they they um
Starting point is 00:41:53 they dynamited it this big kind of beached whale and at the time people were thinking it would just kind of like blow up and then disappear and actually what happened is that it kind of exploded and the bits went everywhere and it took them years to kind of clear it up i think something like that the point is that that's a bit like what happened with monolithic bi and etl suites you know when you and i first started doing this it was you know in our case oracle bi covered everything everything you can think of from olap to reporting to whatever and then you had etl tools like you know data integration suite that did everything there as well and it kind of you know a couple years ago it all just got blown up and and so now you've got these little kind of like point solutions you know so from the etl side what i've been noticing is is that you know there's scripting going on so things
Starting point is 00:42:33 like airflow and and so on the place that i work at is all called code but you've got little solutions coming along you've got stitch you've got kind of five tran um you've got things like dbt open source and so on and it's interesting to sort of think is this just the same components reassembling themselves or have we now moved to a kind of, are the, because really ETL was, let's be honest, it was the kind of the worst out of all the jobs you get really wasn't it, in terms of
Starting point is 00:42:55 writing mappings between tables was about the kind of the worst job you could have isn't it and now writing ETL code you're a hero you know because you're a data engineer and you just it's interesting isn't it you wonder have things changed have we moved to this i mean i think it was um uh um from from uh confluent gwen shapira you know talked about the you know to her mind writing code is a better way to do data data movement and data you know data transformation but i don't know i mean you and i are both here
Starting point is 00:43:24 around when everyone was saying that to us back about writing pls equal to do this and actually in practice it wasn't i don't know i mean what do you think what do you think on that yeah yeah the the etl developer is is the first to be blamed when when something doesn't doesn't process correctly or or perform correctly right and then you start pointing out dba as a network and yeah but but that's you know that it's it's interesting to to think about um you know when when everything sort of got blown up and and tools like odi and informatica came out their big data uh you know, approach, right? So with ODI,
Starting point is 00:44:06 because I'm most familiar with that, you know, they just built in these new templates that worked with Spark SQL. But when you go to an actual client and say, hey, you know, we've got this, you know, the ability to access your Spark data or your HTFS files and use Spark to transform it, then they say, well, but I want to use Scala instead of PySpark. We can't do anything about that. So the flexibility is gone, right? And that's where I don't know if that's what someone like Gwen would say. It's the flexibility.
Starting point is 00:44:41 But when you start getting into the technologies, you are the ETL tools and those technologies, you, you get, you get a little pigeonholed and you're stuck with what has been delivered to you. So, yeah. So I think,
Starting point is 00:44:55 yeah, I, I mean, I, I, I think it's, I, I also think it's a bit of the,
Starting point is 00:45:02 you know, immaturity where ultimately it's the, the industry is going to move towards a tool, but I definitely could be wrong. Yeah, I mean, I've asked Robin Moffat to come back on the show at some point, and of course he's now rocked up at Confluent as well. And, you know, very much there, it's obviously Kafka, and it's about data pipelines and so on. I mean, again, do you think this is, again, do you think there's a fundamental change there with things like Kafka compared to what you and I used to do with batch loading of kind of relational data and so on?
Starting point is 00:45:35 Or is it all the same, really? I mean, do you think there's some fundamental changes there or what? Yeah, it's the old, you know, it's the old it depends answer, right? Yes, yeah, yeah. I mean, you look, it's the old it depends answer, right? Yes, yeah, yeah. I mean, you look at what their announcement just a couple weeks ago and KSQL that came out, you know, when you think you're just writing, you know, writing Java code to process streaming data, now you can do it with SQL, you know, because SQL will never die.
Starting point is 00:46:03 It's always going to be the way you exit and that that's that's been that's been my kind of you know occasionally you kind of admit to things that you were perhaps wrong or you whatever and and that that thing the thing the statement a lot of people kind of old farts as i called them at the time that was saying sequel would be that would be the language of big data and and you know to my mind that was slightly self-serving statement that i think it was oracle that was saying it well they would say that but you know, to my mind, that was slightly self-serving statement that I think it was Oracle that was saying it. Well, they would say that. But, you know, SQL and batch transformations and so on. I mean, it just keeps coming back, doesn't it?
Starting point is 00:46:30 You kind of, you know, it surfaced again, that and tabular storage in BigQuery. And now Confluent announced it for, you know, for their product as well. And it's great. I mean, it's the way people have been doing it for years. So, you know so why change it? I think the SQL engines on Hadoop are definitely, they're immature,
Starting point is 00:46:57 but compared to relational database, like Oracle or SQL Server that have built up this functionality over years. But if you think about three or four years of Impala being out there and how much it can do, just imagine 10 years from now if a SQL on Hadoop engine or whatever it's running against, imagine what they can build in at that point. You know, the SQL engines on Hadoop are definitely, you know, they're immature, but they're for, you know, compared to relational database, like a relational database like Oracle or SQL Server that have, you know, built up this functionality over years. But if you think about, you know, three or four years of Impala being out there
Starting point is 00:47:50 and how much it can do, you know, just imagine 10 years from now, if a SQL on Hadoop engine or whatever it's running against, you know, imagine what they can build in at that point. You know, I think the great part about these open source tools are the fact that, you know, if you or I have a need and I guess the knowledge and the ability to do something about it and the need, you know, there isn't the functionality within that tool, we can go make it happen. So that's, you know know i can go find the jira ticket and fix that that bug you know so so that's i mean that's the power of the open source software i think is is is pretty neat yeah okay so so just to kind of round off then it's i think i think it's open
Starting point is 00:48:37 world next week oracle open world first year in about 10 years i've not i've not been um and are you gonna be there will gluant be there and will you be there or or um you know what's the kind of what's happening then yeah yeah i i won't be there that's yeah first in a few years for me that i'll be missing uh but yeah to now be there and a couple other folks so uh yeah you can definitely reach out and and um you know if you want to talk to us okay okay that's good. I mean, so brilliant. Well, look, Michael, it's been brilliant to speak to you again.
Starting point is 00:49:08 It's been a long time, probably about a year now or something, since we spoke properly. And it's great to hear what you guys are doing and how the product's developing with Gluent there. And yeah, it's been great to speak to you. And I mean, so keep us informed in the future what's going on with the product.
Starting point is 00:49:24 And it'll be kind of good to see where you how you get on really yeah definitely yeah thank you so much for having me on uh and uh yeah keep keep doing the these great shows i i really appreciate it Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.