Drill to Detail - Drill to Detail Ep.39 'Gluent Update and Offloading to the Cloud' with Special Guest Michael Rainey
Episode Date: October 3, 2017...
Transcript
Discussion (0)
So my guest on this week's episode of Drill to Detail is Michael Rainey, someone I worked
with several years back at Whitburn Mead when he headed up our data integration practice
at the time. And since then, both he and I have moved on, with Michael moving to Gluon
as their technical advisor. So Michael, welcome to the show and good to have you on
here.
Thank you, Mark. Thanks for having me.
So Michael, just give us a bit of a posse history really of what you've been doing up
until now and how you got involved in data integration and databases and so on, and then
the route into the role you're doing at Gluten at the moment.
Yeah, sure thing. I started out as an application developer, like many of us do, and moved into the data warehouse world. And at the time, we had our own homegrown data warehouse system. So we had
built up a VB6 app that would generate SQL Server DTS packages.
So this is really old technology and decided to transition to Oracle Data Integrator.
And that's really how I got into the Oracle world.
From then, I moved on to work with you and others at Ritman Mead,
did consulting around Oracle Data Integration,
the Oracle Golden Gate, Oracle Data Integrator for about five years. And then, you know, as things
happen, you know, you get contacted by somebody and have a great opportunity come up. And that was
what happened with Gluent. Now with Gluent, I mean, technical advisor is sort of a general title.
It's a wearer of hats, if you will, because it's a startup that's very small. So we do what we need
to. But I focus on account management, being a customer advocate, and then also marketing,
training, content development, delivery of the training and all of
that so and really anything else that's that's needed okay okay so it sounds i mean that sounds
like a role in the place i work currently called a product specialist so that's the kind of role
that came i think out of google originally where you become as you say you you are the customer's
advocate you are a technical specialist in what you do um and you act as that kind of, I suppose, kind of funnel from the stuff coming from the customer to the company.
But it's also interesting, must be interesting for you, you know, working in product now as well. I
mean, you and I both used to work in consulting. Product is different, really, as well. Have you
found working in a startup and working in a product company? Yeah, it is really interesting.
And, you know, I'm definitely learning a lot just about how a startup works and how a business runs.
And also, you know, the types of feedback you get, and not just from customers, but also the industry that help kind of make you pivot a little bit, I guess, as to what you're developing and what you're delivering.
I was about to say, one of the things that I found with product is that, you know,
typically when we come from consulting, you will do anything for the individual customer. You know, a customer comes to you with a requirement and you will move heaven and earth
to meet that customer's precise requirements.
But, you know, certainly when you work for a product company, you've got to kind of balance out
what are the requests from customers that are going to help you kind of grow the business and what are ones that are going to
take you in a route that perhaps isn't strategic i mean that that's certainly you know i found that
same with you really yeah and that's that's true it's it's it's uh going from evangelizing somebody
else's product like oracle's data integration technology to advance in your own yeah you have to be careful about uh you know what what you're promising that the product will do and
and also you know what like you said the the which features uh should be in and which should be you
know put on the back burner for for a later time yeah so we had the reason i wanted to get you on
on the show is is first of all obviously it's good's good to have you on. And we've known each other for a while.
And you've got some great opinions and thoughts on the industry.
But also, I had Tanel on the show back, I think it was actually this time last year.
It was the UK Oracle user group.
I remember interviewing him in his hotel room in Birmingham.
And we talked then about the Gluent founding story.
And we talked about what problem Gluent was trying to solve,
and the approach they were taking with this kind of idea of a hybrid workload and environment.
And I noticed that there was an announcement recently, which was that you guys had actually done some stuff in the cloud.
And I was particularly interested to come back and get a bit of an update, because first of all all i'm curious to see how you guys are getting on but yeah i've got this kind of theory that as as these as workloads like ones we used to move into the cloud this kind of distinction between uh between you know what is data warehouse
technology and what is big data technology will kind of change but also you know nothing will
ever move entirely to the cloud there'll always be an on-premise workload and so i was kind of
curious to see where you guys uh were going with with this and you know what your thoughts were on this as well um so first of all just for anybody who is new
to gluant because i imagine a lot of people on the podcast you know the audience would be
just tell us a bit about the basic facts about what is gluant the company and the software
and what do you do and what problem do you solve and so on yeah sure thing um so gluant is is a
data virtualization software so and that's a broad term,
data virtualization. So I'll break it down as to how we affect that. So we have two real
major components of our Gluon data platform. The offload of data, which is moving data from a relational database into a big data technology
like Hadoop. And so the reason we came about to do this was we saw that many enterprises were
struggling with their storage or their CPU costs or, excuse me, CPU processing power, and eventually the cost as well to continue to maintain that
and build up some more storage and processing power.
The other aspect was that data within the enterprise
is in all these different types of siloed data stores.
So there's relational databases from many different types of vendors,
plus big data that's in Hadoop and HDFS. And we saw that that makes it very limiting as to how
you can access the data. So offloading from a relational database into Hadoop puts it into an
open data format, which can then be accessed by many different engines, not just the big data technologies like Impala or Hive
or other SQL and Hadoop engines,
but even the Kafka or other streaming technologies,
graph technologies, anything really.
The other aspect of that that helps us with the data virtualization
is transparent query,
where after you offload that data,
uh,
to,
to Hadoop,
you can still access the data as if it never left the,
the relational database.
So we don't,
we don't require any application,
uh,
code rewrites or migrations.
It's,
it's completely transparent or,
you know,
as,
as you will,
data virtual,
the data is virtualized. The data access is virtualized. Okay. Or, you know, as you will, the data is virtualized.
The data access is virtualized.
Okay, okay.
I mean, that's a good, I think, good explanation of it, really.
I mean, so you're saying there that, first of all,
you're kind of actually offloading data that's being stored,
you know, I suppose expensive, you know, Oracle data warehouses
or kind of Teradata and so on into Hadoop storage,
which obviously there's a cost saving there.
But then I think, you know,
the thing that makes it very interesting
is this ability to then carry on with the workload
going through the Oracle database.
But actually, you know,
in the cases where the data's now moved to Hadoop,
for example, you know,
it still just kind of transparently accesses that.
And what's the kind of underlying,
I mean, comparing it to say Oracle Big Data SQL,
which, you know, you might know,
where, how does it work with gluon you know how what what kind of like sql technology
to use on hadoop and how do you achieve this quite magical thing really yeah and just to yeah to to
take it a step further i guess a little more detailed the offload process i mean it's that's
something that we you know you or i could probably write a scoop command and and perform an offload process, I mean, that's something that we, you know, you or I could probably write a scoop command and perform an offload, you know, from a relational database like Oracle. But behind
the scenes, what Gloon is also doing is, well, first it's creating the, you know, it's putting
it into a storage format that is a compressed columnar format like Parquet or Ork. So it's
saving the storage that, you know, as much as you possibly can on Hadoop side
and also enabling the faster analytic or data access for your analytics.
We're also building the metadata around that table.
So we're taking the Oracle table structure or relational database table structure and putting
that into Hadoop as well. And we do a lot of the data type translations to make sure that, you know,
because SQL on Hadoop doesn't have the same data types as, you know, an oracle relational database.
So we have to do all of that behind the scenes for you.
Plus, if you have stats computed in Oracle,
we can move those across to Impala, say.
So we have a lot more that's going on there.
On the transparent query side,
so you can offload all of your table
or a portion of your table.
So maybe you have 10% of the active data still remains in your relational database and you move the other 90% off to Hadoop.
We have the ability when we run a query against this, we call it a hybrid table now, we take a look at the execution plan itself
and determine which lines of that execution plan
we can push down into Hadoop for the processing
and save that processing power
for a technology that is built for that.
So you have this massive Hadoop cluster
that has parallel processing across all of its nodes
and is made for that type of work.
Okay, so how much, I mean,
if you took maybe sort of like an EBS type sort of database,
I mean, I'm not saying necessarily that one,
but you took a kind of like ERP type sort of system,
I mean, typically, you know, what percentage of data would you typically expect to be able to
offload you know to hadoop and and how much of the how much of the kind of the transactional
workload in terms of you know coverage of queries or whatever could you could you offload you think
yeah it's it's going to depend um you know and we have a a tool that will help us determine that
you know it's called a gluon advisor Advisor, but you can run that against your relational database.
It's just some SQL scripts that take a look at the usage of the data and see what's more active, what's being updated more often.
And then we'll decide you want to offload.
Most of the time, you don't want to offload active data because as you know,
you know, HDFS is not, is append only. We do have the ability to go back and update offloaded data.
So if you do happen to change some data later on and then you just rerun and offload and it will
perform that update and essentially mimic an update
within hdfs uh so so it depends on you know how how active the data is okay all right good and
just as a kind of a transparency just to some people might know that i did i worked a little
bit of time um over in gluint last year hence uh hence kind of having this unnaturally kind of like
good knowledge of uh of kind of how your product works. Actually, I worked there, as you know, Michael,
I worked there for a couple of months
just as I was leaving Ritman Mead and so on.
So it's interesting to see how you guys have got on, really,
and I've always been interested in where it's going.
And on that point, so when I was there and I left there,
the technology to extend that to other database of database types was kind of being developed.
Give us an update on how that's going and what was the, again,
what was the kind of point of that and the purpose and so on?
Yeah, yeah.
So we've talked about Oracle a couple of times already.
And as you know, others might not know, is the founders and most of us within the company are from an Oracle background, as you are.
So we all decided it was best to start with Oracle first because we know it so well. of that is we decide, you know, after getting into the Oracle work and understanding what's
going on there and building the product, we determined we've probably done the most
complex relational database that we could as our first start. So not to say that something like
SQL Server isn't a complex database, but, you know, it is., otherwise this would have been done already. But we know that we have the
appropriate pattern and have built out the product once. So we know the process we need to go through
to replicate it against other databases. So that's kind of the history of where we started.
Right now we have SQL Server in production as well.
And then due to some interesting customer demand,
we have Netiza in the works right now,
so a Netiza offload.
And that's due to the end of life
and need for these large enterprises
to migrate off of Netiza into something else.
So like I said, we've got this customer demand
that just kind of came out of nowhere.
And that's the interesting one, isn't it?
To what extent you follow that?
Yeah, yeah.
So we started working with one company,
and then before you know it, we have several other requests.
And so that's in the works with that one company,
and we'll continue on with others.
Okay.
So the way I understand that is you can do the offloading from those databases,
but the actual query translation is always still Oracle at the moment.
So you could actually run Oracle SQL on Matiza,
or would it be on the data you've offloaded from Matiza?
How would that work?
Yeah.
So if we, for example, this customer we're working with,
it's a large financial institution that initially approached us
with the need for the Natiza offload to Hadoop.
So they also have, in fact, they also have a SQL Server set of databases
that they want to offload as well.
Once they got into our pilot
process and understanding how
Gluent works, they saw that
they could actually offload both
data sets and then now they have them in
one common location that can
actually join things together and
transform the data
and actually generate
their analytics out of it.
So imagine this, you offload the data from Netiza, from SQL Server.
If you enrich the data with some Spark SQL or something like that within Hadoop,
you can then, whatever table that exists within Hadoop,
let's say it's in Impala, you can actually present that table. That's what our transparent
query product is called, is Gluent Present. You can present that to an Oracle relational database
as if it lived there. So that's where another really big, powerful use for our product is that
present-only sort of approach.
And we have several customers that are just using it that way.
Okay.
Okay.
So how did, I mean, on the last point on this, really, how did,
and actually I'm confessing I never quite understood this when I was there,
actually, how did you manage to get Impala to be as functional as Oracle SQL?
So when you kind of run the whole range of,
and you and I know the kind of the weird and wonderful things
you can run in Oracle, like model clauses and so on.
How did you manage to kind of get that to then work on Impala, really?
What was the kind of, I suppose this is where
Tanel's knowledge of Oracle internals comes in,
but what was the kind of the solution there?
Yeah, and to give a little background
on on why i mentioned impala is you know along with our our knowledge and and starting with
oracle as the original source for the the product we you know we we quickly realized that you know
we can't be everything to everyone so we as you, Hadoop has several vendors, vendor distributions.
With each vendor distribution, you get a different set of technologies. So Cloudera, you get the
Impala SQL engine. Hortonworks, you would have Hive as your SQL engine. So we decided, well,
we've got plenty of pipeline with the Oracle offload, and it so happens to be that almost all of those customers were Cloudera customers as well.
So we took our focus down the Cloudera path.
But now we have MapR in production, and then Hortonworks and Amazon EMR are both coming very soon.
So as far as a target goes, we're coming right along with those.
But the way that works, I mean, with this transparent query access, I mean, one of the
questions that people often ask is, you know, is it just a query pass-through? You know, how do
you get it to work like that? Well, Oracle could have, well, we know it has many different analytic functions that just don't work in Impala or Hive.
And there are even queries that have correct syntax that haven't been built out in that SQL engine.
And so there's a lot of, you know, if you think about Oracle as a database, it turned 40 this year, right?
So I think Impala is maybe three or four years old.
So there's a lot of experience built up there.
What Tunnell and others have created and actually have patented now is the ability to, as I mentioned earlier, read that execution plan out of the database.
And each line, it determines what can be pushed down into Impala. And we do that query translation
on the fly. So one of the aspects, as I mentioned earlier, was the ability to translate data types from Oracle to Impala
during that offload process when you're building up the table.
So we need to make sure that the data types are correct.
We're going to store the data correctly.
But we also can push that query down,
whether it's a join or a filter or aggregates.
We push all of that work down.
Okay, okay.
So far, I mean, up until the point of the cloud stuff in a moment we'll talk about,
I mean, who has been the kind of the buyer of this really?
I mean, so you say you act in account management as well and so on.
Who typically within the organization is buying this?
And what kind of organization is kind of buying into this kind of like approach really i'm just curious
yeah yeah so we've we've really focused on some of the larger enterprises and you know fortune 500
fortune 1000 you know and that's where we we find the uh you know these these types of large data sets that are causing trouble on the relational database
that can benefit from being offloaded and pushing work down to Hadoop.
So, I mean, there's several case studies out on our website.
And I can mention one company that they're called Vistra Energy and they're based out of Dallas, Texas.
They're a large power company down there.
And they originally came at it with, you know, looking at Gluent with the cost savings in mind, you know, trying to offload from their Exadata, Oracle Exadata machine into Hadoop.
But quickly they realized they could do a lot more.
They have a Hadoop cluster with smart meter data coming in.
So this sort of IoT that's coming from the consumer usage of the energy.
They also have transcribed customer support calls that's coming from the consumer usage of the energy.
They also have transcribed customer support calls so they can keep track of how happy
or frustrated you are with their service.
So they can now take this information
and present it from Hadoop into their customer ERP system.
And now they've kind of put these additional pieces
of the customer 360 puzzle together.
And they can do some interesting analytics now
where they, you know, one of the more recent bits of analytics
that have come out of this exercise was,
so they have a product
called uh free nights so basically if you use energy a certain way uh they'll they'll offer
you free nights free free energy overnight so they can now they can now tie all of this usage
information in with with their customer information and and then present either
you know maybe they'll send you a flyer or send you an email or present uh this option on the on
the website when you log in so so they're they're they're able to offer their customers a better
product uh just by using this technology okay that's kind of what i was thinking i mean i i
think that you know again sort of i spoke tanel um this time last year you know that the the obvious appeal of a
technology like this is to save money really but but it's quite you know my experience is quite
hard people to make a big change in their technology just to save money um especially
if you're if it's a technical sale but it was really the the i suppose the additional kind of
options that are now open to you and the
fact that you've got your data in this centralized place you know you've got these more open formats
and it's more what you can do with it from that point onwards that is the real kind of appeal
really and and you know almost this cost saving is is is like a conversation starter and a bonus
but it's not the real reason you would do this and it's interesting to see that's kind of how
it's worked out for you so um that's kind of good yeah i mean it's good to see it's worked out yeah it's it's it always started
as the the foot in the door right uh but and really we've we've learned that that that isn't
the only way to to to really you know begin speaking with with these enterprises. You ask who might be interested in an organization.
It's going to be more of the architects and, of course, the CIO,
those types of folks.
When we start talking with the database professionals,
they really want to just know how it works.
How did you do that?
Are you doing it correctly? So, yeah, so that, yeah, that,
and so we want to, you know,
get to the folks that really are interested
in the entire enterprise, you know, data architecture.
Okay, okay, so let's move on then.
So the thing that,
so the thing that prompted me to drop your line,
just say if you're interested in coming on the show,
was the announcement about Gluing and the cloud.
So just tell us a bit about what that is at the start what have you done
um and and what's the kind of headline features and we can drill a little bit into what that means
afterwards okay yeah yeah so we you know the i think the blog post you're referring to is a
product called cloud sync and that's and it's I guess, a component of the entire Gluon data platform.
So this started out really as a backup and restore service.
So we're leading enterprises and leading them to offload their data into Hadoop.
And we realized this would be good to offer them something, you know, another service that could back up that HDFS data, those files in HDFS, off into the cloud, into a storage service like, you know, an object store like S3 or Google Cloud Storage, whatever it may be.
So we started with that.
And then we really quickly realized, you know quickly realized there's more potential to this.
The first one is if you're offloading or backing up your data lake from HTFS,
that's maybe on-premises, off to the cloud,
now you have this backup data that can be used for additional analytics.
It doesn't just have to sit there, you know, static and be used.
So we have, you know, if you think of putting the data in Amazon S3,
you have, now you have Amazon Athena, you could run queries against it.
You have, you know, the latest, you know, the new released AWS Glue,
you can perform transformations and use that data to enhance it.
And then, you know, of course, if you have any other machine learning tasks
or very processing-intensive tasks that you just want to ramp up a cluster,
a Hadoop cluster for, and then process the data and then spin it down,
you could do that.
So once the data gets out into this open storage format,
then there's so many different ways you can access it.
The other bit we realized was we could take the same sort of data virtualization innovation that we've done for the relational database and Hadoop, where we offload to Hadoop and then, you know, you virtualize the data access. We could do the same for Hadoop and
this cloud storage. So you could offload or, you know, essentially back up a portion of your HDFS data into the cloud and then access it from a single Impala table.
So we have that same sort of paradigm for HDFS in the cloud as we do with our current
data virtualization.
So is that effectively like doing query translation for Impala then?
So Impala can run its queries against object storage in the cloud as well?
Exactly, yeah.
Ah, interesting, interesting.
So which clouds are you currently supporting?
Is it kind of, you know, obviously Amazon,
but are you looking to do Oracle and Google and so on?
What's your thoughts on that?
Yeah, yeah.
It's currently Amazon S3,
and it's not much of a stretch to go to the others.
I mean, really, it's the cloud store.
So the way things work now is the customer demand will drive those features.
So if we had someone come out and say, okay, look, I'm really interested in this,
but we use Google Cloud Storage, then that's the direction we would go if it made sense.
Okay.
So what about, I mean, so would you see that as being, I mean,
so data from Oracle could end up into Hadoop, and then it would go into cloud, or could it go straight from Oracle into the cloud?
I mean, how flexible is that?
Yeah, if you offload into Hadoop
and then back up to the cloud,
that would be the way, yeah.
I mean, ultimately the vision is to just run a query
and Gluent helps you get the data returned.
And that's, so I. And that's the end goal is limit the data movement
and virtualize the data access
so you don't have to change your applications
to be able to get to that same data.
So do you think, and one of the things I often think
is that Hadoop is still quite a complex technology on premise.
And, you know, having worked with recently things like BigQuery and Athena and so on, you know, you can see the advantages of a kind of a no-op state of warehouse running in the cloud and how it gets, how it allows you to have this kind of scale, but without any of the kind of the, very little of the infrastructure work.
I mean, do you still think there's going to be a role for Hadoop on-premise
going forward, or is that all going to move into the cloud, do you think?
Where do you think this is going technology-wise?
Yeah, yeah, that's interesting because when I started with Gluon
at the beginning of the year, I think that's around the time
that about three or four analysts wrote Hado is dead articles or hadoop is dying or
you know that sort of thing like it's it's like the it's like the sequel is dying articles that
come out you know every once in a while sql is is going away we know it's not going away it's it's
actually being implemented in every technology you can think of so So with Hadoop, I mean, there's still the potential
for something like HDFS to have a place in an enterprise.
So you mentioned that HDFS will still be relevant,
and yet a lot of people are saying,
John Pierre Dykes last week was saying that you know object store will take over from hdfs yeah do you think
h do you think hdfs will stay as it is or do you you know what's your what's your thoughts on the
demand for that and use for that really yeah there's so many different different technologies
that use the the hadoop drivers and and and have the ability to access that data.
I mean, it's, so the big thing right now,
and I think JP mentioned this as well,
is Hadoop or HDFS acts like a file system.
It's a logical file system.
So you have your data access security,
you have your controls around how, you know,
a very granular level, how somebody can access the data.
Whereas an object store is just you're in or out.
That's it.
So until it gets that sort of security, some sort of aspect of a file system on top of it, I think HDFS will still stick around for that reason. The serverless technologies and the ability to access that data
within S3 or within Google Cloud Storage or whatever other cloud storage,
I mean, those technologies are very –
I do think that's the way the data processing is going.
Yeah, yeah.
I mean, it's an interesting point you're
saying that i mean i i was very much of that opinion um but certainly um it's certainly those
kind of you know serverless databases server for no ops data warehouses that base level of storage
and compute and so on they're they're fantastic but i mean i've been playing around with um druid
of last weekend actually and i was looking at kind of things we're doing with that and trying to get I suppose to solve that last mile of query performance
with with BigQuery where you know the last 30 seconds of a query is still there and I guess
the point I was looking at when I was looking at that was you you get a lot more kind of like
innovation and a lot more kind of new projects springing up obviously within this kind of I
suppose Hadoop world and you and on-premise world because you've got much more ability to do things at a smaller scale and i
think you know where those things like bigquery come in is that they can solve a common problem
very well um but there'll always be a need for kind of new things innovative things um you know
more point solutions and more kind of like niche solutions um you know and hdfs probably will be
gone in its current form won't be there in a few years time
because it will be in memory or something
but it's all swappable out
and that again is the big beauty of Hadoop
that really every component can be swapped out
Right, exactly
and that's where
if you think about
moving the data to a cloud storage
then you're pretty tied
to that vendor again
so you're going to access
data in S3 with Athena, with Glue, with QuickSight, with EMR, whatever it may be,
but you're not going to go to BigQuery and access it. So that's kind of an interesting
shift to think about as well.
We don't want to get stuck in those silos again,
the way we kind of are now, relational databases.
Yeah, yeah.
So let's go, I mean, since you and I kind of worked together,
there's been quite a few interesting products come out in this space
that kind of I always wondered what your opinion of them would be really.
And so, yeah, AWS Glue.
Every so often I get a Twitter direct message from Tanel saying,
AWS Glue is fantastic.
You ought to look at it.
It's really, really interesting.
And what's your take?
First of all, for anybody who doesn't know what it is, Michael,
just explain what AWS Glue is and why you and I might be interested in it.
What's the kind of interesting thing with it?
Yeah, so it's a serverless data integration technology
that Amazon has released.
But that's not the only thing.
It has a metadata catalog
that ultimately replaces the metadata catalog,
the Athena metadata catalog out there on AWS technologies.
And to populate that catalog, it has these things called crawlers.
So you can currently access data with an S3 or through any JDBC connection.
So, I mean, that's quite a lot of technologies you can get to.
And what it will do is go out and basically mine this metadata from these data stores and
store it away in your Glue catalog. At that point, you can use or access any of those
data sources from within your data integration mapping, if you will.
So it's also serverless, as we've mentioned. So it will spin up what it needs to
behind the scenes, and you don't need to worry about provisioning servers or any of that. It
just does it as part of its processing. It's a little bit of a graphical tool. There's a little bit of a graphical tool there's a little bit of uh you know graphical mapping to it
but most of it is is uh pie spark with within a nice code window and you use the the the glue
api to to perform uh transformations or or data access what have you so what's your take on it
then because i mean you and i used to spend you and i and stewart back at the time used to spend ages kind of building kind of very carefully building data
mappings between source and target and and all that kind of stuff when we couldn't do anything
until the business domain you know experts would kind of give us a kind of data model of the source
system and so on and this glue you know it sounds kind of interesting and there's been similar ideas
from google in the past google goods goods and so on. You know,
what's your,
what's your view on all this?
You know,
do you think it's too good to be true?
Do you think it's kind of,
I mean,
what,
what do you,
what's your take on it,
Michael?
Yeah,
it,
it definitely is.
It's not too good to be true.
I think it's,
as with any data integration tool,
you,
you need to,
uh,
understand what it can do and what it can't so there's there's
certain limitations uh as with anything the nice thing with you know i work with oracle data
integrator so the way that worked was you you had some built-in templates for building your mappings
but then you could develop your own or modify them.
And that's kind of, you know, you take that look, or that that sort of approach with AWS Glue,
where, yeah, you can use the transformations that they've defined for you, but you can also write
your own Spark SQL or Spark code. So, so you can you can extend it as much as you need to. And that's where I think as it continues to get used,
hopefully they'll get good product feedback from the folks that are using it in enterprises
and start to improve the product.
But again, I think the serverless processing is just the way it's going to go.
And, you know, you shouldn't need to worry about provisioning servers or understanding, you know, what size of server you need or how many CPUs or whatever.
Just send it a job.
And, you know, ultimately, I think JP mentioned this as well.
That's kind of interesting with the serverless is, you know,
ultimately you just define, like, what's my SLA on this query to return?
And, you know, hopefully AWS Glue can, or whatever technology it is, can perform, right?
So where would you, I mean, this, I mean, Tun i mean tunnel has again been interested in in in glue i
mean do you see it do you see that this being a complement to the stuff you guys do with with
with kind of uh the gluant software i mean is it or is it solving the same problem or what really
yeah well it i think it fits right into our vision of of you know access to all enterprise data
at any time from basically from a query without without rewriting your your
applications and so so one of the the original ideas around gluon is is to become this this
data sharing platform and you know right now gluon solves that that physical data access that
you know the the pipes for the plumbing.
We can get to the data wherever it's at.
Glue and whatever other catalogs that are similar can provide that data awareness.
So understanding where the data lives within the enterprise.
And so when you, yeah, it's definitely a, I mean, not only a name, but it is a good compliment.
I mean, it's funny to see this glue thing come out.
But yeah, between the physical data access and the metadata,
plus the glue has the transformations, which gluant doesn't do,
I think it's a very good compliment.
Okay, okay.
So the other thing that's happened since you and I were in consulting
was the rise of this thing called the data engineering movement.
So data engineers and kind of people with math degrees and PhDs writing ETL code and so on.
I mean, what's your observation been of that?
It's probably a little bit unrelated to Kaluan, but this idea of kind of software engineers and kind of math PhDs writing kind of ETL code and so on.
What's your take on that
yeah i think it's well i've you know listening to your podcast i've i've uh i i continue to
to hear a theme which is you know where is this this uh big data technology etl tool you know what
we have and i don't know if you've found it yet.
It doesn't sound like it,
but, you know, you've gone through
some of the stream sets
and Glue might be it,
but, you know,
and so that's just the missing piece right now.
And so the rise of the data engineer
and, you know, that great article
that came out,
you know, that defines that role.
And that's what that role does is build the ETL pipelines
that you don't have a tool to build with.
Will it shift to something that's more easy to use
and you don't need a PhD or something, you know,
or at least you don't need to write everything in code.
I think it will.
It depends on how rapidly these technologies evolve first.
So it's easier to keep up with changing technologies in code,
I believe, than writing code to build a platform that
somebody else needs to use to write code. And also, you know, will things, you know,
consolidate a little bit more? No. So it's, again, it's tough to write software on top of
lots of, you can't be everything for everybody. So, you know will the the industry kind of consolidate into
some a few standards so but i think ultimately it's probably going to get get closer to to a
graphical tool and you know maybe maybe glue ends up being that i don't know it's interesting i mean
i think i think um yeah the the question I always had is whether or not you know everything
being scripted these days is because um the paradigm has changed as people as people might
say or it's kind of immaturity really in people doing it and it there was I was at a looker um
event last week and and it was actually a a partner a part of it and they talked about you
know they described what happened a few years ago it's almost like they described it as a situation
when there was a big whale that got beached on uh the the uh on
the kind of beach at san francisco and um i think what they did was to get rid of it they they um
they dynamited it this big kind of beached whale and at the time people were thinking it would just
kind of like blow up and then disappear and actually what happened is that it kind of exploded
and the bits went everywhere and it took them years to kind of clear it up i think something like that the point is that that's a bit like what happened
with monolithic bi and etl suites you know when you and i first started doing this it was you know
in our case oracle bi covered everything everything you can think of from olap to reporting to
whatever and then you had etl tools like you know data integration suite that did everything there
as well and it kind of you know a couple years ago it all just got blown up and and so now you've got these little kind of like point solutions you know so
from the etl side what i've been noticing is is that you know there's scripting going on so things
like airflow and and so on the place that i work at is all called code but you've got little
solutions coming along you've got stitch you've got kind of five tran um you've got things like
dbt open source and so on and it's interesting to sort of think is this
just the same components reassembling themselves
or have we now moved to a kind
of, are the, because really ETL
was, let's be honest, it was the kind of the worst
out of all the jobs you get really wasn't it, in terms of
writing mappings between tables
was about the kind of the worst job
you could have isn't it and now writing ETL
code you're a hero you know because you're
a data engineer and you just it's interesting isn't it you wonder have things
changed have we moved to this i mean i think it was um uh um from from uh confluent gwen shapira
you know talked about the you know to her mind writing code is a better way to do data data
movement and data you know data transformation but i don't know i mean you and i are both here
around when
everyone was saying that to us back about writing pls equal to do this and actually in practice
it wasn't i don't know i mean what do you think what do you think on that
yeah yeah the the etl developer is is the first to be blamed when when something doesn't
doesn't process correctly or or perform correctly right and then you start
pointing out dba as a network and yeah but but that's you know that it's it's interesting to
to think about um you know when when everything sort of got blown up and and tools like odi and
informatica came out their big data uh you know, approach, right? So with ODI,
because I'm most familiar with that, you know, they just built in these new templates that worked
with Spark SQL. But when you go to an actual client and say, hey, you know, we've got this,
you know, the ability to access your Spark data or your HTFS files and use Spark to transform it,
then they say, well, but I want to use Scala instead of PySpark.
We can't do anything about that.
So the flexibility is gone, right?
And that's where I don't know if that's what someone like Gwen would say.
It's the flexibility.
But when you start getting into the technologies, you are the ETL tools and those technologies,
you,
you get,
you get a little pigeonholed and you're stuck with what has been delivered to
you.
So,
yeah.
So I think,
yeah,
I,
I mean,
I,
I,
I think it's,
I,
I also think it's a bit of the,
you know,
immaturity where ultimately it's the, the industry is going to move towards a tool, but I definitely could be wrong.
Yeah, I mean, I've asked Robin Moffat to come back on the show at some point, and of course he's now rocked up at Confluent as well.
And, you know, very much there, it's obviously Kafka, and it's about data pipelines and so on.
I mean, again, do you think this is, again,
do you think there's a fundamental change there with things like Kafka
compared to what you and I used to do with batch loading
of kind of relational data and so on?
Or is it all the same, really?
I mean, do you think there's some fundamental changes there or what?
Yeah, it's the old, you know, it's the old it depends answer, right?
Yes, yeah, yeah. I mean, you look, it's the old it depends answer, right? Yes, yeah, yeah.
I mean, you look at what their announcement just a couple weeks ago
and KSQL that came out, you know, when you think you're just writing,
you know, writing Java code to process streaming data,
now you can do it with SQL, you know, because SQL will never die.
It's always going to be the way you exit and that
that's that's been that's been my kind of you know occasionally you kind of admit to things
that you were perhaps wrong or you whatever and and that that thing the thing the statement a lot
of people kind of old farts as i called them at the time that was saying sequel would be that
would be the language of big data and and you know to my mind that was slightly self-serving
statement that i think it was oracle that was saying it well they would say that but you know, to my mind, that was slightly self-serving statement that I think it was Oracle that was saying it. Well, they would say that.
But, you know, SQL and batch transformations and so on.
I mean, it just keeps coming back, doesn't it?
You kind of, you know, it surfaced again, that and tabular storage in BigQuery.
And now Confluent announced it for, you know, for their product as well.
And it's great.
I mean, it's the way people have been doing it for years.
So, you know so why change it?
I think the SQL engines on Hadoop
are definitely,
they're immature,
but compared to relational database,
like Oracle or SQL Server
that have built up this functionality over years.
But if you think about three or four years of Impala being out there
and how much it can do, just imagine 10 years from now
if a SQL on Hadoop engine or whatever it's running against,
imagine what they can build in at that point.
You know, the SQL engines on Hadoop are definitely, you know, they're immature, but they're for, you know, compared to relational database, like a relational database like Oracle or SQL Server that have, you know, built up this functionality over years. But if you think about, you know, three or four years of Impala being out there
and how much it can do, you know, just imagine 10 years from now, if a SQL on Hadoop engine or
whatever it's running against, you know, imagine what they can build in at that point. You know,
I think the great part about these open source tools are the
fact that, you know, if you or I have a need and I guess the knowledge and the ability to do
something about it and the need, you know, there isn't the functionality within that tool, we can
go make it happen. So that's, you know know i can go find the jira ticket and fix that
that bug you know so so that's i mean that's the power of the open source software i think is is
is pretty neat yeah okay so so just to kind of round off then it's i think i think it's open
world next week oracle open world first year in about 10 years i've not i've not been um and are
you gonna be there will gluant be there and will you be there or or um you know what's the kind of what's happening then yeah yeah i i won't be there
that's yeah first in a few years for me that i'll be missing uh but yeah to now be there and a couple
other folks so uh yeah you can definitely reach out and and um you know if you want to talk to us
okay okay that's good.
I mean, so brilliant.
Well, look, Michael,
it's been brilliant to speak to you again.
It's been a long time,
probably about a year now or something,
since we spoke properly.
And it's great to hear what you guys are doing
and how the product's developing with Gluent there.
And yeah, it's been great to speak to you.
And I mean, so keep us informed in the future
what's going on with the product.
And it'll be kind of good to see where you how you get on really yeah definitely
yeah thank you so much for having me on uh and uh yeah keep keep doing the these great shows i i
really appreciate it Thank you.