Drill to Detail - Drill to Detail Ep.14 ‘Cloudy Big Data Paradigm-Shifting Christmas & New Year Special’ With Special Guest Robin Moffatt
Episode Date: December 21, 2016Mark Rittman is joined in this Christmas & New Year special episode by none-other than Robin Moffatt, head of R&D at Rittman Mead and an old colleague from my consulting days, to talk about his experi...ences with Amazon Elastic MapReduce (EMR) for BI + Analytics and how "the devil's in the detail", and to hear his take on what happened in the BI and analytics world in 2016 and what to look out for in 2017.
Transcript
Discussion (0)
Hello and welcome to the special Christmas and New Year episode of Drill to Detail, the
podcast about the world of big data, analytics and data warehousing, and I'm your host, Mark Rittman. This week, I'm pleased to be joined on the show by Robin Moffat,
head of R&D at Rittman Mead, and who many of you will know from his blogs, conference presentations
and social media posts, and of course, he's an old colleague of mine from when I used to work there.
Robin, welcome to Drill to Detail, and it's great to have you on the show. Why don't you introduce
yourself properly to our listeners? Yeah, thanks for having me on the show, Mark. It's a great
honour. So I'm a head of R&D at Rittman Mead, and I've worked there for about five years now. Let's introduce yourself properly to our listeners. Yeah, thanks for having me on the show, Mark. It's a great honor.
So I'm a head of R&D at Ritman Mead, and I've worked there for about five years now.
Before that, I worked in a UK retailer on their Oracle data warehousing platform with OBIE as well.
Before that, for my sins, I was a SQL Server DBA.
And going even further back, I started off as a COBOL programmer with a DB2 data warehouse.
So I've worked in data and analytics for about 15 years now.
So kind of long enough to have seen cycles come and go and to be fairly cynical when people proclaim things
certain to have died or be coming up soon.
So it's interesting to see how things go.
Excellent. Well, it's great to have you on, Robin.
And what we're going to do in this special extended edition of Drill to Detail
is we're going to do kind of two parts.
In the second half, we're going to look back at 2016, a couple of you know
look back at some things that happened, get your opinion on a few kind of trends
and kind of new products and things that came out and also get your views
really on what you think will be interesting and and worth looking out
for in 2017 in BI and data warehousing and so on. But what I want to do in the
first half is actually have a chat about a series of blog posts that you put on the Ritman Mead blog
over the last few days about a project you did with a client to evaluate what it would be like
to move their BI system or to look at how you could move their BI system to a public cloud
and adopt some of the new open source technologies like Spark and Kafka and so on.
So I thought it was an interesting set of posts there and I'd like to kind of go through it with
you. I suppose just to kind of get your feel for where the benefits were, you know, what worked,
what didn't work and so on. But just to start off then, can you just kind of give us an overview
really of what the project was about and what were you trying to achieve and understand with
this piece of work? Yeah, sure.
So these clients, they contact us and they're interested, as you say, in what kind of benefits they could have in moving some of their work.
At the moment, they're an Oracle shop.
They've got Oracle Data Warehouse, Oracle Data Integrator, and so on.
And they have a batch process that they load chunks of data into the Oracle Data Warehouse.
And they were struggling in terms of performance,
in terms of query, kind of going back over long periods of time. So kind of the big data thing kind of came into play there. They're wondering about if those technologies could help them.
And then also in terms of cost benefits, if perhaps moving into open source tooling,
like you say, with Spark could help them out there. So we did a short proof of concept with them
to explore the different technologies with them and help them understand which ones may be more relevant and also kind of
the um the pitfalls around them or the additional complications um a lot of the time with the
technologies you get the kind of the jigsaw approach of well you can use this plus this plus
this and that would be great but sometimes what that doesn't show are the kind of the problems or
the um the tricky bits um in actually implementing it so that was also included are the kind of the problems or the um the tricky bits um in actually
implementing it so that was also included in the scope of what we did yeah and i think although
although you obviously mentioned oracle in the uh in the kind of intro to it i think you know it's
fair to say that a lot of um traditional kind of bi and data warehousing shops are using any kind
of database technology teradata ibm and so on are looking to see i suppose really you know can we
move this into the cloud?
Can we adopt these big data technologies?
How does it work?
And how much, I suppose, additional kind of manual work and kind of like scripting and
so on is involved.
So you did this on the Amazon cloud, didn't you?
So you did it on Amazon EMR, is that correct?
Tell us about what that is and kind of why you adopted that rather than, say, sort of, you know, running the project using, say, in-house kind of Hadoop.
Yeah, sure. So the clients, as well as being an Oracle shop, were already on Amazon Cloud.
So they ran a lot of their stuff on Amazon's EC2 servers.
So they were kind of they were already cloud friendly.
It wasn't a case of convincing them that cloud was a good place to be running this stuff.
So since they were Amazon and we were looking at Hadoop,
Amazon's Elastic MapReduce was the obvious place to be running it.
Okay, so Elastic MapReduce.
Tell us what that is then.
So what's the difference between that and kind of traditional Hadoop?
How does the elastic part come into it?
What does it do there so emr is um amazon's hadoop uh as a service um and it's brilliant because you
you can literally click on it and it'll provision a hadoop cluster for you um of any configuration
size that you want um so in terms of installing and configuration there isn't any um So even with Cloud Area's distribution, you have to install it,
and there are wizards, and it's pretty easy to do,
but you still have to go through and click on things,
and if it falls over, go and look at log files,
whereas with EMR, you go to the Amazon webpage,
and you click on, I want a server, I want to cluster this size,
and it just goes and provisions it.
So it's very, very simple to use.
And the other bit that's
kind of quite attractive about it is that you can spin it up and down as you want to on demand
so you don't have to build your cluster and then it's sat there and you're paying for it whether
on local 10 or cloud 10. um you can spin up an emr cluster run a job and then shut it down again and
just pay for the time that you're using it okay okay so let's start by you said in in on the
series of blog posts and what we'll do is we. So let's start by, you said in the series
of blog posts, and what we'll do is we'll put the links to the blog posts in the show notes for the
podcast. So you said there were two parts to it. There was ETL offload, or certainly doing the ETL
in the cloud, and there was a part with querying. Okay, so both of those are interesting. But
let's start with the ETL bit. So my understanding is the client currently uses an ETL tool to do
the work at the
moment on-premise, but you moved it all into Hadoop and you moved it into Spark and so on.
So tell us about that. How did you do it? What was involved and so on?
So we took a very small piece of the ETL work that they do that was simply taking inbound batch
files that arrived every half hour or so and joining them with some reference data that came
from a relational table. And again, for the scope of the exercise, we actually took that
reference data, again, just onto a simple local file. And we built out a project using PySpark,
which would load both the files and then do the necessary joins and enrichments as well
between the data sets and then write it out to a CSV file. So it was very, very simplistic,
but deliberately so,
so we didn't get bogged down in the detail of implementations. So we start off by building a local development environment using Docker, which in the way that's AMR in the cloud is brilliant
because you can just create an environment with a single click. Docker, we managed to find an image
that was a Spark environment with a Jupyter Notebooks. So we use that for local development
of the code and prototyping,
and also the benefit of notebooks.
You can actually write out and explain how you're building something,
which was useful for the clients when we handed this back to them.
And then we took that Spark code and then ran it up on EMR on Amazon.
So you say that you wrote all the code in PySpark and so on there.
Is this not going back to the old days of kind of scripting data warehouses and so on there?
I mean, forgetting the fact that, you know, is this not a step backwards, really?
Arguably so.
That's one of the kind of conclusions that I mentioned at the end of that blog series,
is that it's technically possible to do it like this.
Should you build your whole ETL platform on a bunch of scripts?
Probably not.
I think something that's interesting about the way that technology is changing
is that you can take a much more granular approach to how you build things.
And for something like this, do you need a full-blown ETL system for it?
No, because it's very simple what's being done.
And if you can then reduce your costs by doing it on demand through EMR,
that may well outweigh the long-term maintenance and support
benefits that you have from a more full-blown etl tool um but certainly you wouldn't write by hand
all of this stuff um on a large scale that's and people have done that for years and years and
years and i've kind of i started off doing that before ETL tools really existed.
And you end up with an absolute maintenance nightmare and very dependent on local staff who know the systems.
And it's difficult to scale.
Okay.
Okay.
I mean, certainly, I mean, let's be clear.
It was a prototype.
Looking at the blog post, it was a prototype.
It was a proof of concept.
And therefore, a lot of these things, you know, they happen on scripts and so on.
It was also about understanding the technology so certainly for me before you adopt a tool that's going to maybe
generate the code under the covers you want to know kind of what's going on with this technology
so that if you have to support it or simply from a performance point of view understanding is it
going to scale or validating the technology choice that underpins some of the more automated
platforms that you might get okay i mean I mean, certainly, as you and I both know,
tools like Oracle Data Integrator, I'm sure, you know, other tools as well,
will generate PySpark code.
So I guess, as you're saying, you know,
it's about understanding how it works at a low level and so on.
But again, what is interesting working in this kind of area
and seeing, I suppose, from the product side,
seeing kind of like how products are built with this,
it's generally the case now that people do code stuff,'t they as opposed to using etl tools it's quite rare
i think in practice to see etl tools in a big data environment still i mean maybe that's a
maturity thing i don't know i think it is a maturity thing i think uh like odi oracle data
integrators now catching up and kind of as you say can generate pi spark code and the latest releases
you can now do spark streaming and stuff but those technologies at root have been out for several years now.
So I think companies that are looking to adopt this new technology as it comes out to take
advantage of its benefits, you have no option but to write it by hand.
But I think in the long run, a managed approach to generating your code is always going to
win out at large scale. But sometimes that kind of long-term payoff is countered by the short-term benefit of
being able to take advantage of the technology immediately.
Yeah, exactly.
Exactly.
And I think it's classic.
People very rarely do prototypes using ETL tools.
If you can script it in five minutes, you will do that.
Yeah.
And as the project gets more mature and you have more developers in the project,
that's when ETL tools are useful.
I mean, the other part is cost as well.
I mean, presumably this didn't cost an awful lot
to kind of get running really.
And whereas if you were to use,
I mean, most of the ETL tools out there
that will work with big data are quite expensive.
How much was cost an issue in this really?
It was one of the drivers from the clients
and wanting to look at these open source technologies
was the current license commitments and also in terms of the technologies.
So we looked at ODI. It's got a big data licensing option, which the client were less keen to adopt.
It wasn't that they ruled it out, but can we do this without that was kind of one of the premises for the prototype. So we'll get onto, I'd like to talk a bit later on
about some of the new ETL technologies
and things coming out of Amazon, like Glue, for example.
I mean, that's quite interesting.
But you mentioned in the blog post about using Spark SQL.
Where did that come into it,
and what role did Spark SQL perform
within the kind of ETL process?
So we used PySpark for the set-based processing for joining the data sets and reading them in
and writing them out and so on.
But we use Spark SQL simply because SQL
is kind of my language background
for immediately inspecting the data sets
as we were building them and preparing them.
So it's kind of like the best of both worlds.
You didn't have to write it out
and then query it separately
to kind of check the data conditions
have been validated and so on.
You can actually do it in flight as part of the code okay okay so so so
once you'd processed it using spark and i guess spark was the kind of data processing layer with
this you you said in the blog that you loaded it into redshift i mean how'd that go really
yeah so one of the things that i found fascinating in doing this was that um for the past 15 years
to say i've been working in this,
you always process data and then you write it to a database.
Whereas with this new set of technologies,
that's no longer kind of like a default pattern.
You can actually look at writing it out,
which is what we did into S3.
And then from there, you can load it into Redshift,
which is one thing that we did.
But you can also look at querying it in place,
which I found fascinating. In terms of Redshift
loading it from S3 was very very simple. You just write your it's a copy statement with a simple DDL
for the table and it just sucks it all in. So that was it kind of it works very well within that
ecosystem. So certainly for me this ability to spin stuff up on demand, particularly for things
like ETL jobs and so on, is a bit of a paradigm shift.
And it certainly avoids this issue of hardware either sitting around not being used half
the time or it being kind of understrained when ETL routine runs and the queries then
run slow and so on.
And did you find, for example, Robin, this would suit certain types of workload better?
So for example, Elastic is quite good for ET etl but you'd want to do maybe a more permanent kind of setup
for you know for your queries and what do you think on that really yeah so as you say etl is
the obvious one particularly if you're doing periodic stuff if you're doing kind of uh
stream processing you wouldn't have that spinning up and down. But if you're doing once a day, twice a day,
or even every hour or something,
rather than having your capacity sat there idle,
or as you say, having to kind of over-provision your hardware,
it just makes so much sense.
And it was interesting with Redshift,
loading the data into that, you still have to have that running.
So just this idea that you can now decouple your computer from your storage, from your querying, was kind of a revelation with this
prototype, actually seeing it in practice, that you can do that. Yeah, definitely, definitely.
And I mean, we'll get on to that now. So again, as part of the blog post series, you talked about,
you evaluated different query engines. I think you looked at Presto, Hive on Tez, and that sort
of thing. I mean, again, just walk us through at a high level what the exercise was about and the technologies you were testing out
in that. Yeah, so one of the, part of the scope for the prototype was looking at can you, can we
query this data? Once we've kind of, we've processed it, enriched it and stored it, what does the
analytics on it look like? So we took the existing queries and then made sure that we could run those
against that data. So one of the options was, then made sure that we could run those against that data.
So one of the options was,
well, we'll go and load it into a data warehouse and cloud data warehouse.
So we'll stick it in Redshift.
That's the obvious choice.
But the other one which was interesting to me was,
well, let's just write it to S3,
which is kind of your long-term storage.
You're not actually paying for any compute
to sit against that.
You're just paying for your storage.
And so we tried it with Presto.
We tried it with Hive on Terz, as you say,
because they were there very easy to provision
as part of the EMR cluster.
So this was kind of one of the side bits
that was interesting to the project,
which was working with all these open source tools
in different ecosystems.
So on Amazon, it's very easy to provision Presto.
Hive on Terz is there by default.
Other stuff I wanted to look at was Impala and Drill,
but those you have to kind of install and configure yourself.
So it just adds that additional friction to it.
And it was a time-boxed exercise, so it'd be good to get to,
but where the friction comes in, you think, well, that's fine.
We'll stick with the options that are there.
And Presto was very interesting,
and it worked well enough for it to be a plausible option.
The response times were longer than Redshift.
But as with the Spark stuff,
there was a lot of performance optimization that we could have done.
We did move the data into ORC format,
which is kind of like the recommendation,
but didn't do any partitioning and all the rest of it. So the times were fine if you wanted to have your data in s3 long term and do
periodic analyses against it or you don't mind if you set your query running and come back to it
after lunchtime you wouldn't use it for your ad hoc low latency querying yeah yeah definitely i
mean as you said there the kind of major kind of
paradigm shift here is the fact you store it in one format and query lots of different tools,
really. And the fact that S3 can be, you know, a storage format just like HTFS and so on there
is interesting. You touched on there about storage formats as well. And something I certainly found
is, again, compared to the Oracle world that we came from, which was, you know, everything is
stored in tables, it's easy to query.
You mentioned ORC there, there's Parquet and that sort of thing.
I mean, how much work do you think is involved in getting that performance just right and so on?
Because we save time there with EMR, but it sounds like it would take a lot of work to get the storage formats working properly.
What was your thoughts on that?
Well, yeah, this is part of the devil in the detail of all this stuff that it's conceptually it's great you write it out to
s3 and then you can query it with many different engines and as you say that the open data format
is just kind of mind-blowing if you're coming from a proprietary database background um that
you can just use try a whole bunch of different tools and see which one fits best um but i think
there's always going to be engineering that you have to do on top of it.
And in Oracle, it's not that you don't have to do it. It's just that it's very well established what you do. You partition it, you index it, you use parallelism and so on. Whereas it's such new
technologies that I think all those, I'm not going to say best practices, but all of those kind of
recommended approaches are still evolving and people are still figuring them out. And the
technology changes so frequently that a document you find last year that says this is the best way to do it could well be obsolete
um so i think it's the they bring a great deal of power but you have to know what you're doing
to take advantage of that sometimes otherwise you may end up with something that's just not
quite as good as what you could have done if you'd stay within the kind of the safe world of
proprietary databases yeah i mean it's interesting there's a lot of i guess with the move to cloud with the move to kind of hadoop and analytics as a service
you know the thoughts are that you know i suppose the need for skills and the need for consultancy
will kind of go away but it strikes me that that actually there's quite a lot of need there for
for kind of understanding and skills around you know the nuances of building these kind of systems
really i mean i mean yeah so so i suppose separating storage from from these kind of systems, really. I mean, yeah. So I suppose separating storage from kind of query,
what does that mean in terms of how we design things
and how we do things and so on for you?
I suppose it means that you can be a bit more refined
in how you design things.
So you don't have to have the default
that we're going to load it into Oracle at the end of it.
I think in terms of things like Kafka,
when you get onto streaming platforms,
you can do your transformations in different places.
You could do it in Spark,
you could do it in Kafka streams or something like that,
and then worry about how's it going to get consumed afterwards.
So it's this decoupling of the processing that's important.
Yeah, and you said right back at the start
that one of the drivers for this piece of work
was to see, in this case, whether they'd get rid of Oracle,
and you can insert into there any kind of database teradata or whatever really so so you know what
do you think on that do you think do you think is a case of like yep absolutely we'll use drill
we use whatever and what's your thoughts on this really you know do you think these new technologies
are a replacement for though for oracle and so on or or is it hype or whatever yeah what's the
nuanced kind of yorkshireman view on that on this really for me the nuanced yorkshireman um i think there's a large chunk of analytics work that just can be
completely replaced because i think the tools are mature enough now um i think the interesting bit
is around the how do you get the stuff in there do you write it by hand or do you use a tool such
as odi to actually do all of that. Not only kind of the transformation definitions, but the orchestration and management. That's the bit that you still need to work
and support and scale out and kind of have staff that can support. But in terms of platforms
where you run and store this, I don't see a great advantage in sticking with the old
stuff to the extent that people do at the moment.
I think there's still stuff that's always going to sit on it
and obviously kind of like OLTP workload,
that's a different question,
but kind of like the big analytics stuff
and the stuff that we did here for this client,
it was very simple stuff
that was providing them great benefit in terms of the data.
It doesn't have to be complex stuff that it's doing
and it just moved so easily onto this.
Yeah, definitely.
I mean, in the last edition,
when Tanel Poder was on the show,
and he had a good rule of thumb
that I thought was interesting,
which was data that originated somewhere else.
That's a great thing to put into Hadoop.
So if you've generated the data in sensors
or in a transactional system,
then if you didn't want to query that data, then it's great to put it into Hadoop. But kind of in a transactional system, then if you want
to query that data, then it's great to put it into Hadoop.
If you want the transactional integrity, if you want all those kind of features, that's
what you want kind of Oracle for and so on.
But certainly, what's your take on Drill?
I mean, you've been quite an advocate of Drill.
You've been using it quite a lot.
I mean, you said you didn't use it on this project because it wasn't able to be provisioned in EMR easily,
but what's your take on drill really?
Is that the end of kind of, you know,
formal kind of data warehousing work or what really?
I don't think it's quite so much that.
I think it enables you to query data
without having to load it into a database first.
And SQLs I've worked with for so long
that you look at a problem
and you start breaking it down
into select and group by clauses
in your mind automatically.
And so being able to take a flat file
that someone's given you,
like a JSON or whatever,
and be able to query it from your hard disk
is just fantastic.
So yes, it runs distributed and clustered
and kind of huge as well um and i
heard the podcast you did um about drill that was really interesting and kind of um comparisons to
impala and so on um so i've been using on much more modest scale which is i'm doing some work
for a client you need to do some data wrangling and kind of work out what these data sets look
like how can we join them and so on um and being able to run that from your laptop without having to define your schemas because well some of the stuff with the this prototype
that we did where the data is in s3 using hive on tes or using presto you still have to go and
create the external table and it's all there's such simple columns there's not any there's no
reason why you should have to do that it's kind of it's obvious if you look at the file what it
is which is just what which is what drill does It kind of looks at the file and figures it out.
Yeah.
So it's a massive time saving.
Yeah, exactly.
I mean, I think that, you know, certainly Spark and drill,
there's some interesting things there that, so for drill,
drill is, you and I know a product called Indeca that we worked around with for a while.
And Indeca was about data discovery and, you know,
no data being left behind and all those kind of like, you know, slogans there.
But certainly drill as the new Indeca,
as Hadoop's version of Indeca, is interesting, isn't it?
Because, you know, you can sit there with a copy of Drill
and you can kind of reach out to all these different data sources
and you can query them in place.
You can reach out to, say, Hive or even Oracle or something
and bring in data, you know, as reference data.
I see Drill as being a new form of bi really
I don't you know it's you've got you've got impala you've got all those kind of
MPP style kind of engines there but drill is a very interesting sort of data
discovery technology as well I mean did you have you been finding that the way
you've been using it really yeah I think it's exactly that's just been out so but
poke around in the data and the kind of the whole data wrangling side of things is kind of
it's not the sexy side of big data work but it's the it's what you end up doing an awful lot of the
time um simply what does this file look like how can i match this one up with that one um and being
able to yeah do that data discovery with what you've got yeah and i think with spark sql i mean
that's in a way the new data federation if you think about what you can do with spark sql in
the kind of hive the hive compatibility you know. If you think about what you can do with Spark SQL and the kind of Hive compatibility,
you mentioned earlier on,
you said that you used, I think you used Spark SQL
to bring in some reference data and so on.
But going back to our days before
of working with tools like Oracle BI,
the fact that you can bring stuff together within Spark,
reach across, join data together from different sources,
I mean, that's kind of interesting as well, isn't it, really?
And it's all free as well. Yeah, exactly. it's all there it's all free it's to play
with and with the open open data formats underneath you're not kind of backing yourself into a corner
with your tool choice you can try something out and then just try a different one against it and
kind of mix and match to get the the optimal combination for what it is you're trying to do
okay so that all sounds brilliant but you used a phrase earlier on the devil's in the detail okay so so so so this all sounds fantastic but um what's the catch really
i mean what i mean compared to say oracle that has these kind of regular releases and it's very
predictable and so on that you know hadoop technology all this stuff is just releasing
all the time i mean how how does that play out with customers and what do you think of that really um the devil's definitely in the detail
um a lot of the time so it was a two-week project that we did and i'll have certainly spent a day
fighting with kind of java dependencies and which versions of libraries and there was one
timestamp problems that it was it was literally a dot one difference between different code bases
that timestamps were suddenly written out in epoch instead of character format and stuff like that.
That's not in a million years.
What Oracle ever released that kind of breaking change without it being a big thing.
So Oracle will kind of get mocked for kind of like slow releases and arguably kind of that's a shame that they're kind of slow to do things.
But generally when stuff comes out, it works and everyone knows knows how it works and there's lots of education and advocacy around
it whereas the pace of the change of of the newer technologies is so great that's to kind of to keep
up with all of them simply what they do is a challenge so to know kind of in detail how how
each one works and how to best take advantage of it um there's a lot of work that has to be done
each time you come to use it,
rather than simply, well, I know Oracle,
therefore it's just rinse and repeat.
It's much more, I know that conceptually
this bit can be used with this bit,
but do they actually play nicely together?
Okay, I mean, it certainly struck me
reading through your post
that it was kind of in between really
the old days of Hadoop being something
that was very complicated.
I remember you and I in the past,
you know, spending ages spinning up clusters and wiring all together and so on. And, you know,
it struck me that it was in between that. You had EMR there, which was the kind of,
I suppose, the easy part there. But there's a lot of kind of mucking around there as well.
I don't know if you saw the blog post I did on kind of Google BigQuery. I mean, that again is interesting.
It's even less things to kind of wire together.
I mean, I suppose, did you read that post?
What's your thoughts on where, I suppose,
Hadoop as a service is going really
and how it can get simpler?
I think, as you say, the more simple the user interface,
the less mucking around with configuration,
stuff like that's going to make it more accessible.
The idea of going and installing a CDH cluster or something like that,
I suppose it makes sense if you've got the hardware and all the rest of it, and you need to maintain the absolute control over it. But if you can get the same functionality at a click of
a button and someone else worry about it, then why would you? And there's parts of wanting to
understand what's going on
under the covers but sometimes you just want to get in your car and drive somewhere without being
able to know how my piston engine works um so i think more and more it's going to become more
commoditized and your blog post kind of referred to as the kind of the gmail moment and
i like your opening to that saying like back in the 90s of course everyone ran their own email
servers or if like me you ran your email server um which i never quite got to but uh but certainly things
like where you kind of you do this stuff because like technology is interesting and like you say
a few years ago we were working on hadoop and building our own clusters because it was so new
and that was the only way you could get into it whereas now to do that why would you if you can
actually just have it provisioned automatically and configured automatically as well?
Yeah, yeah, definitely.
And I think that's a bit of a message to consultants really out there that I think, you know, a lot of us or a lot of you guys, because I'm no longer in that sort of area.
But, you know, a lot of people spend a lot of time fixating on getting Spark working or building out clusters and so on there, really.
But, you know, that is in a way a bit of a solved problem really now.
And I think that certainly it's like building on layers of abstraction really and certainly to my from my perspective you know things like how to build how to get a cluster
sort of working scaled out you know reliable and so on how to kind of get a data processing layer
these are things that we shouldn't still be kind of like spending ages on really where it gets
interesting is how we then leverage that data as our american friends would say and uh how we kind of build you know predictive analytics on it and so on really
and we shouldn't have to spend all this time on just getting clusters working and then i suppose
one example of this that that is either way taking it to the extreme is i think i tweeted this week
about um amazon glue i mean did you so just for anyone who didn't read that, Amazon Glue is Amazon's
take on ETL
and it's kind of
very interesting.
You know, it's using
kind of machine learning
and artificial intelligence
to look at the data
that's in the kind of
data set you're working with
and predict things like,
you know,
which transformations
you should do
and all that.
I mean,
very kind of interesting really
and again,
you know,
the amount of time
that people like you
and I spend on building
ETL routines in the past and just wiring column a to column b together i think
that's gonna you know we're gonna move on from that or is it you know or is it just kind of
pie in the sky i mean what again what's your take on that really robin well i looked at glue and it
looks fascinating i can't wait to get my hands on it because when i first saw it i glanced and i saw
there's a screen full of code and i thought well if it's just doing code and I can write code,
you're not just ending up with the same kind of liability that you have from writing it by hand.
But then reading it a bit more closely and seeing that it does this cataloging of your data sets
and automatic categorization and the transformations,
but access to the code underneath and the orchestration and management of it,
it seems like, yeah, it looks absolutely fascinating.
It is, yeah. And I think that, yeah, if it kind of it seems like yeah it looks absolutely fascinating it is yeah and i think that uh yeah if it works it sounds fantastic yeah i know oracle is working
on similar things and and so on really so uh i mean so i mean so at the end of this piece of
work you did for the client i appreciated it was a uh a more kind of like evaluation but what was
the kind of the final i suppose you know what was the final thoughts taking away from it and what
was the client's kind of next steps really with this well they were really excited to see uh what could be done um and kind
of the what we put together in the time frame um i think they were very open to this idea that
um you don't have to have a relational database to do your etl work in you could actually just
do it on demand um that you can store the data without necessarily storing it with the computer attached, you don't have to load it into a data warehouse.
So I think they're going to kind of hopefully look to do more of that in the future and
move some of their analytic workload into that.
So I suppose really the kind of reality check is that, you know, I suppose the majority
of projects still going on are still around Oracle and ETL tools and so on.
But I would imagine probably the bulk of the new inquiries you guys are still around Oracle and ETL tools and so on. But I would imagine probably the bulk of the new inquiries
you guys are getting are around this sort of technology
and trying to see the value and I suppose trying to see
how it would fit in with what they're doing
and see whether it's worthwhile for them.
Yeah, definitely.
And I think in terms of maturity,
people are less skeptical about it.
I know a few years ago,
people were dismissing this as just hype.
I think people have accepted
that it's actually here to stay
and that it's not just
flash in the pan type stuff,
that it's actually got,
if nothing else,
big cost savings to bring.
And it can, at best,
a great deal of flexibility
and agility to give people.
But I think there's still
larger companies are probably
more established.
IT departments are still looking at
how can they actually take what they've got at the moment
and do that and take advantage of this stuff.
Okay, okay.
I noticed just one final thing on here.
I noticed on the blogs that you looked at QuickSight
as well from Amazon, their new BI tool.
But did you actually, did you try connecting
Oracle data visualization to this as well?
I mean, did that work and what was your thoughts on that?
Yeah, so QuickSight was something that for this particular project we couldn't use for various reasons.
But we've used it beforehand.
So in this project, we used Oracle DV Desktop.
It's got support direct for Presto and Redshift. And it worked great just for kind of completing the end-to-end picture
of source data, do transformations, store it.
We proved that we could do the analyses in SQL
and then actually prove that you can use it in a client-facing tool as well.
So, Robin, 2016, uneventful year, politics-wise, obviously.
Not much happened over in the UK and the US,
but certainly a lot happened in kind of the world of world of bi really there was the Gartner report
there was lots of new releases of software and and I suppose some
interesting things happened around kind of analytics and so on there so I've got
five questions five areas I want to go through with you just to get your
opinion really on on on what happened in 2016 and then we're going to go on to
you know what you think think is going to be worth
looking out for in 2017. So first question, Robin. Oracle's BI focus, I think it's pretty fair to say,
has shifted this year from enterprise BI tools like OBI 12c and I suppose kind of enterprise
BI software like the BI apps to DV desktop, so data viz desktop. What do you think on that?
Do you think it's the way of the future or do you think it's Oracle's, you know, last
desperate throw the dice where you just stay relevant within the BI market?
That's a good question.
I think DV desktop is something that they had to do.
And I think it's actually really interesting to see the rate of development around it and
the rate of releases and what they're doing with it. And it seems compared to Oracle BI, the kind of the server-based one,
which is a fairly slow release cycle.
Obviously, that's good that it's stable.
DV Desktop, I think, is every couple of months.
And the stuff they're doing with the plugins around it as well
to make it extendable with an API, I think is really interesting.
So I suppose the thing with that is how are they going to kind of bridge the two?
I don't see DV kind of desktop replacing the main OBIE,
but will they manage to transition people from DV desktop into it
or will it just end up kind of filling the same role
that you end up with single users just doing their data stuff locally
and losing out on that benefit of the enterprise view?
Yeah, it's interesting, isn't it?
I mean, talk about life going in kind of full circles, really.
I mean, so just for anyone who is not familiar with Oracle's own kind of, you know, Oracle's particular sort of desktop BI tool lineup.
So you've got, so OBI 12c is, Oracle BI 12c is the latest release of Oracle's kind of full
enterprise end-to-end kind of BI platform really and it's something that you know you and I are
probably quite famous for and there's a sort of saying isn't there and I think technology which
is technology reaches the point of perfection and it becomes obsolete and and the great irony I
think with OBI 12c is you know it's a fantastic platform but the the mood and the shift in the market is is more
towards kind of desktop bi tools now and and dv desktop data visualization desktop is is is a bit
like commit remember discover from years ago and and tools like that there were very much kind of
you know desktop bi tools yeah they had their kind of advantages but they were also you know they
were kind of they were silos of information and so on. But certainly, DV Desktop is interesting, isn't it?
And is it something that you and the guys at Amida are using a lot now?
I mean, is it your primary kind of like BI tool or what, really?
I wouldn't say primary, but it's something that's very relevant
when we're discussing with clients.
Obviously, if it's kind of like it's an Oracle shop,
then you'd rather be using that than a Tableau or whatever as an alternative desktop tool.
And the stuff that they're doing with kind of data flow within that, I think, is equally interesting.
I suppose the rate at which they've developed something which runs on the desktop and under the covers,
it's still the same server processes, but it's all encapsulated to run a local machine.
But then adding in this additional transformation stuff. Yeah, I think, like I said, how are they going to bridge that back into the
main product, or are they just going to end up with kind of two separate offerings? Yeah, I mean,
so just explain to us what the data transformation thing is in there. That's in the new release,
isn't it, of DV Desktop. So what's that really? Yeah, so that's where you can take multiple
data sets and apply transformations on them,
as you would do in an ETL workflow and aggregate or filter your data or join between the different
data sets to produce a kind of a final data set off that against which you build your visualizations.
Yeah, I think it's, I suppose in a way, this is the way the market's going. And so,
whether people like you and I would say,
well, actually, you know,
it's maybe the pendulum has gone too far one way.
It's the way the market is going really.
And so, I mean, do you think that or not?
I think it's, yeah. And I think in an ideal world,
you'd take something like that
and you would give it to your tech savvy business users
who would then actually, they know the data,
they could build out and kind of explain
how the stuff transforms and combines they can prototype the visualizations and then
give that to the enterprise department you can then formalize it and build it and something into
a kind of a supportable enterprise grade etl process and that's kind of like the perfect world
but you risk kind of going back to the days of kind of people building stuff in excel macros and
only kind of specific people
in the department knowing where that data came from or how to support it and that's kind of
that's the worst of all worlds that way yeah i'd be interested to see also how how oracle sell it
as well because the you know the model behind tools like tableau for example is you know to
sell one license really to go in there to sell one license into it into a department they call
it land and expand you know and funny enough i i um i um downloaded the trial version
of tableau uh this week and for testing out on some work i'm doing with uh with the place i'm
working at the moment and i got a phone call the next day from from one of the reps a very nice guy
but you know he was he was he was going to sell me this kind of one one one seat license really
and uh and and that but you know i can't imagine some of morocco doing that i mean that's
you know it's it's kind of interesting i'd imagine the paradigm shift and the rethinking thinking from
from the oracle sales people will be quite a lot there really yeah and and i try not to get too
close to licensing because it's always a bit of a minefield but from what i've understood you can't
actually buy a dvd license as such you kind of you get uh permitted to use it as part of a dv
license in the cloud or dv as part of um obie on premises um i guess that's a deliberate decision
and maybe it's kind of a nuance that i've misunderstood but yeah you can't just say i
think this is great i want to license it for threading accounts you actually have to
license more than that yeah definitely but i think I think one thing that I don't know my opinion is,
I'm surprised at how good DB Desktop is, really.
And as you said, you know, the rate of there being kind of plug-ins
and I think they're kind of, I think certainly the development team,
I mean, are definitely full in on this, really.
And I mean, I've been using it quite a bit and it's a good product,
but it's interesting to sort of, I suppose, in a way,
it has that horse bolted or is this a necessary thing, you know, to do? I don't know. it's interesting to sort of, I suppose, in a way, it has that horse bolted
or is this a necessary thing, you know, to do?
I don't know. It's interesting, isn't it?
I mean, top marks for Oracle doing it,
but how well they'll sell it, I don't know, really.
Yeah, and I suppose how much of the functionality
will come back into the main product?
Is it something where they can use it as a kind of a development,
not a development, but a kind of prototyping thing
to see how features take to the market
and then kind of migrate them into the enterprise stack.
I don't know.
I think also where there is potential for Oracle to do something very interesting
is in the linking it back to the full enterprise suite.
And certainly, I mean, I was at an event recently and I was with one of the Gartner analysts
and certainly within Gartner, I guess within a lot of analyst firms,
there's certainly a lot of different opinions about the value of, you know,
what they call kind of bimodal development and so on there.
And I think any vendor that can link together these desktop tools
and the kind of curation and IT adoption of things as well is going to do well.
And I think Oracle, if they do get that link between the desktop tools
and the enterprise kind of side worked through,
and in terms of metadata curation and so on,
that could be really interesting, can't it, really?
Yeah, definitely.
Yeah, definitely.
Okay, so next question for you.
Okay, so citizen data scientists.
You must have heard that phrase out there.
Yes.
Okay.
Is that an exciting new paradigm,
or is it this year's marketing bollocks, as we say in the UK?
As we say in the trade.
Citizen data scientists.
It depends how you define data scientist.
I mean, there's always been users in the business
who kind of, they know their way around technology,
usually Excel or maybe a bit more than that,
and they understand the data
and they know how to kind of
apply appropriate analyses to it.
But if you take kind of data scientists to one extreme of kind of advanced analytics and
predictive modeling and full blown statistician, then yeah, that's bollocks. But I think tool
tooling that supports that availability through to end users of something other than a highly curated and governed model is good.
I think users want their data, the values there to be had from that data.
So letting them work with it in different ways is good.
But yeah, there's a certain element of hyper-operability.
Aspirational, yeah.
Forward-looking.
Forward-looking.
Directional, I think, is the phrase we use in product management.
I think there's a we use in product management.
I think there's a couple of things in that that are interesting.
So first of all, you know, citizen data scientists, obviously, you know, I mean,
it's, I think to think that we will all become statisticians, we'll all become kind of, you know,
really, there's a lot of mathematics, there's a lot of stats knowledge there to be good at, to do that well.
But the aspiration for people to do more than just look back at you know what's happened in the past and to kind of use stats and to use machine learning and deep learning and so on to to get competitive advantage you know i think that's a
genuine thing there really and whether or not whether or not the tools enable it now i think
that is the again a driver and a kind of a demand from users now. Yeah, and I think there's the danger with stuff like that that people have to understand what they're doing.
So does the tool dumb it down so much
that it becomes slightly meaningless
or could be done by the tool anyway?
And what role does the person have in that?
And I've done some work with a colleague of mine
who you all know, Jordan Mayer,
who's kind of, he is a data scientist.
He kind of understands the maths, understands the stats.
And in working with him, it's kind of, you realize and starting to dabble in's kind of he is a data scientist he kind of understands the maths understands the stats and in working with him it's kind of you realize and starting to dabble
in this kind of stuff how much you don't know and how wrong you can get stuff if you put the wrong
interpretation on the data um so it's one thing to say how many cans of baked beans did I sell
last week but it's another to kind of to build a predictive model that's supposedly 80% accurate
but then if you tossed a coin it would would have been that anyway, or 50%.
Do you know what I mean?
Where you actually, you've got to understand what you're doing.
So if you can make that as accessible to an end user, as a citizen data scientist, in
a way that they can use it, I'm not sure.
I'm not sure.
Yeah, it's interesting.
I mean, I think certainly now, I think we've had enough of experts, really, as the famous
politician said in the UK.
I think certainly in the days of data mining, it was kind of common to sort of say, you know, it's too dangerous to put in users' hands because they can make the wrong decision and so on.
I think that's a kind of lazy kind of thing to say because, yes, obviously, you know, there's a lot of in our confidence factors and using the right model and
so on but the challenge i think to us in bi is to say well actually that's there how can we how can
we go beyond that really you know as a tool um i often refer to beyond core as a vendor that i think
is an interesting um vision i suppose of where this stuff can go um you know automating it predict
you know making it as easy as possible to get these insights, whilst also, I suppose like you said,
understanding that it's easy to come to the wrong conclusion.
But I think it's lazier to say that we shouldn't put this in people's hands.
But I think it's also slightly kind of hyperbolic to say that it's now possible, really.
I mean, yeah, I think that's probably the case.
So, okay, so next one for you.
So source control. You've done, again, a lot of very detailed posts about source control
and kind of automated builds and kind of automated deployments
with OBI and tools like that.
So is this kind of, you know, is this applying engineering rigor
to tools like kind of Oracle BI,
or is this just affecting the steam engine?
Why are you doing this?
Surely this is kind of pointless, with this old technology. Because as long as people are doing development
work, they should be doing it right. And that's not just from a kind of puritanical, I don't like
to see things done wrong view, but it causes an awful lot of trouble when people don't do it right.
And simple stuff like source control if you
don't have source control you're screwed because sooner or sooner or later you're going to lose a
file or deploy the wrong version of a file or you're going to go on holiday and someone else
can't find the right file um so it's simply taking that and then taking it a stage further how can we
use that for concurrent development and then you need to understand how a particular software works so it's a necessary evil in a way um yeah it has to be done i was being slightly devil's advocate
there but but it's certainly it's certainly i mean how much do you still see people going going on
on site and you see kind of rpds in this case uh numbered and that's your version control i mean
is any of this sinking in really for me something that i realized was that i started off with the
kind of concurrent concurrent development problem which was in obie how do you do concurrent
developments without using the the uh the provision one from oracle which was slightly
unsatisfactory but then so i wrote about that and i talked about that and didn't see a great deal of
kind of people said oh yeah that makes sense but it didn't really no one kind of um seen it didn't seem to make much impact and
then you actually go to clients and you speak to them about it and as you say they're naming their
rpds on this kind of like version one version two and a network share and so something like
concurrent developments arguably isn't for everyone um even if it'd be kind of useful
um places are so far off being
anywhere near being able to do it that simply the basic stuff like use source control um that's the
message that's kind of everyone's got to take of that heed of first and then you can look at
well let's get a bit more mature in our approach let's automate our deployments and once you've
done that then you can say actually no now let's do concurrent development and all the
flexibility and agility and kind of scalability of your development effort that entails that's great
but trying to leap to that one straight away when people can't even do source control is
it's a step too far yeah exactly just so i guess as a plug for what you're doing with me i mean
what what are you are you're you're driving a lot of kind of you know utilities and accelerators
in that sort of area i mean what what kind of what have you been doing in that sort of area really?
Yeah, so we've been trying to work out what makes sense to our clients and do we try and,
do we build a solution for this and kind of like say, this is how you do it.
But we've actually found in speaking to clients and kind of going and implementing this stuff
with them that because what you're interfacing with is a lot of the time kind of like enterprise
change management processes and release teams we found that a one-size-fits-all solution just
doesn't work because you start mandating upon them too much and they say yeah but we just don't do
things like that um so instead kind of breaking down the process and that's kind of the the recent
blog posts i did about that we're based around that kind of like let's understand everything
from the ground up,
and then we can tailor particular solutions to each individual client,
how they work, but kind of with the best practices, shall we say,
thrown in.
It's kind of like, this is how you should be doing it,
and then we can kind of tailor it to fit perfectly.
Okay, okay.
And, yeah, definitely.
And the other thing you've been doing as well,
I know you were responsible for this when I was there,
was the performance kind of,
I suppose performance and customer adoption stuff as well.
I mean, I suppose performance,
is that still an issue you see these days?
Do people still kind of like, you know,
tune things the wrong way?
Or what do you see?
Oh my God, all the time.
Maybe I get a skewed view of it
because I'm also on the Oracle OTN forums.
And the number of times people kind of say, I've got this query and it runs slow.
I've tried building an index on the database and it's still slow.
And you just, in a similar way to kind of concurrent development was kind of taking it too far straight away.
People just need to learn how to do source control.
The same way with performance.
People need to understand you've got to know why it's slow and where it's slow before you start changing things.
It's a very basic message.
But going and looking at where is your query running slowly?
Is it in the database?
Is it in the BI server?
If it's not in the database, then tuning the database is going to have no impact.
So I think it's got a long way to go.
And I don't know.
It's one of those things that sometimes the development styles taken to this stuff, performance is just an afterthought.
People work with small data volumes and developments.
They're more focused on the functionality than the non-functional requirements.
And then it goes to production and then it falls on its ass.
And then it's kind of, oh, now I suppose we ought to have a look at this.
And then the horse is bolted by then because with big deployments, they're very complex.
And sometimes you
have to say look i'm sorry you've kind of you've fundamentally done this wrong um and that's where
it gets quite painful okay okay so so number four number four question is um the rise of schema on
read schema ssql engines and data lakes you know is this is this all about making bi agile and
helping you know customers embrace this new technology or are things like schema read and
data lakes is it the end of civilization as we know it really what's your what's your take on
that um i think it's fantastic i think um i think and i heard the the um the show that you did with
kent graziano and talking about do we still need data modeling and absolutely 100 yes but just not
always straight away um and we talked earlier
about kind of the new technologies and what they enable and some of the the paradigm shifts
um and i think this idea that you don't have to model it up front is just um fundamental and like
a bit slightly mind-blowing when you've been doing this thing for a while and you realize hang on i
don't need to define my table i can just store this and potentially just all i want to know is
the number of rows that i've got i'm not i don't care about define my table. I can just store this and potentially just, all I want to know is the number of rows that I've got.
I don't care about the different columns within it.
Simply a count of the number of rows will suffice.
And it makes it much more easy to get your data in,
much more easy to start poking around the data.
And then once you start needing
to get repeatable answers out of it
and a more formal view of the data,
then you model it, but only then.
So it makes it much faster to get your data in and start storing it and start working with it to
working out is it even useful is there any point modeling it or is it turns out
it doesn't have what we need within it okay okay and the last question from my
side for this one is right so you write a lot about breakfasts don't you so you
were one of your one of your topics really on on your on Twitter for 2016 I
think you've now got your own hashtag which is a full english breakfast um so for you question for you english breakfast american
breakfast you know which one is the uh the best and give reasons why i might not say it depends
it depends is the consultant's answer um it's got to be full english um describe a full english to
our american guests and uh and then and i'll describe American breakfast to our British guests.
Go on.
So a full English is a thing of beauty.
You've got to have good quality sausages, good quality bacon.
You've got to have good quality everything.
Tomatoes, fried tomatoes, fried mushrooms.
You've got to have black pudding in there.
You've got to have black pudding.
Hash browns are controversial, but good addition to it.
Good granary toast or good white toast
um baked beans and you've got to have hp sauce you can't have brown sauce it's got to be hp sauce
okay and oh and fried eggs okay and the american breakfast is fairy cakes and fat
from what i've what i've what i've uh i've seen so i think actually that's kind of easy one to do
really i think a british breakfast uh of all it's always the best really so uh but certainly you
know i would recommend uh
listeners to uh to look at your twitter feed for the amount of uh reviews of breakfast you have on
there as well and beer as well so uh so so breakfast and beer healthy diet exactly good
okay so so let's kind of let's look forward to 2017 so these are always a kind of interesting
to do and and sometimes it's quite hard to put your finger on you know what you think will be
interesting things in the uh in going into the next year.
And a lot of these things already exist,
and there are more things that you think will kind of catch on.
But, you know, we talked earlier on,
and there were kind of three things that you told me about earlier on
when we did the planning for this that you thought would be kind of good.
And the first one is Kafka.
You've been talking a lot about Kafka in the past,
and Kafka seems to be coming up all the time as kind of a technology to watch.
So, again, first first of all just for anybody
who doesn't know the technology explain what kafka is very high level okay and tell me why you think
it's interesting for 2017 so kafka i actually i did a presentation at the ukug on it just this
month i've memorized the opening bit which is kafka is a published subscribe messaging rethought
as a distributed commit log which is kind of what they say on all the blogs and stuff.
But it's publish, subscribe messaging.
It's designed from the ground up to be distributed, to be highly resilient.
You've got guaranteed message delivery.
So it does an awful lot of things that position it to underpin data architectures,
basically, kind of like your data pipeline.
Okay, okay.
So, and there's, so Kafka is an open source product,
excuse me, Kafka is an open source product,
but there's also Confluent there.
So where does Confluent come into it?
And again, why do you think they're driving
a lot of this interest in kind of Kafka?
So Confluent was formed by the folk who wrote Kafka
back at LinkedIn.
So they're contributing to the Apache Kafka open source.
And you've also got Apache Kafka Core, which is the kind of messaging.
You've got Apache Kafka Connect, which is a really interesting framework that they're building around the actual messaging as a way of getting data in and out much more easily based on configuration files, rather than having to brew your own interface each time um and they've got the confluent platform which has got
its own um commercial control center which visualizes the uh the configuration the delivery
rates and things like that but the the way that they're driving kafka or the apache kafka projects
going um is taking it beyond simply messaging.
But they've added in Kafka streams,
so you can do stream processing.
I saw a very interesting presentation
at the Big Data London Conference recently,
where they were talking about the queryable
stateful stream processing.
So rather than stream processing your data
and then landing it to a NoSQL store
or something like that,
you can actually query it in flight, which I thought was interesting for how it could have kind of take your architectures
in the future.
And then I saw a tweet just this week from Gwen Shapiro, I think, talking about the
kind of the pre-processing that is going to be coming in Kafka.
So simple stuff on the Kafka Connect inbound.
So masking credit card numbers or
inserting values or renaming fields and things like that. So it being a whole bunch more than
simply just a messaging tool, but a platform in its own right. Okay. Yeah. I mean, I see Kafka
as being, I suppose, the ETL equivalent to Hive in a way. So Kafka is there and it's obviously
kind of does a lot of things, but it's extensible you said you've now got the streaming side you've got the bits you've been
talking about there as well but you've also got a lot of closed source products are adopting kafka
kind of interfaces and apis as their standard so if you look at i think it's map our streams that
that is effectively their own proprietary technology but they expose it and make it
available via you know via kafka interface
effectively and i think also with the oracle um oracle public cloud or the new elastic cloud
product they've got a product that is is is um you know it's a wrapper it's a commercial kind of
added value thing on top of kafka as well so certainly it looks like kafka is kind of here
to stay albeit you know be it as a as a standard as a kind of like a framework as well. But it certainly seems to be getting that adoption, doesn't it, really?
And a key kind of platform enabling technology for Hadoop, really.
Yeah, definitely.
And like with the other Hadoop technologies,
it's an open format.
So simply being able to build systems around Kafka,
connecting the different components together
and being able to decouple them in that way, I think explains why it's getting such an uptake. And with it being,
one of the things that struck me about it after looking at it for a while is it's a streaming
platform, but it's not just about streaming. It's also about data integration. So even if you're
not building something for streaming from the outset, if you start ingesting your data,
if it's an event stream, as it comes in into kafka and then you can consume it as a batch if you want
to but you can also stream process if you want to as well okay i always think kafka is the technology
that everybody always says they're doing in projects but actually haven't done it's one of
those things you drop in there as as kind of like a a very kind of sexy technology but it's it's
quite hard to get running though isn't it do you Do you think, and that's why I think that,
it's actually harder than you think to get running.
Is that correct?
Or what do you think on that?
So to get a simple actual Kafka cluster running,
it's fine.
For me, I found, as I said before,
devil in the detail.
And it's definitely, for me,
one of these things like you look at the
kind of jigsaw architecture,
kind of like, oh, we've got this one here. i've read a blog about this one here and there's a supposed
connector between them um but i recently did some work trying to connect um so oracle golden gates
great way of change data capture from your your database which can then create an event stream
into kafka um i was thinking this could be a great way to kind of um prototype populating your data
reservoir in hdfs from work that's happening in the database.
But actually, those three simple pieces don't fit together
as it currently stands because GoldenGate prefixes the Kafka topics
with a full database and schema.
And if you try and write that to HTFS, Hive gets upset
because you can't call a table with a three-part name.
So it's great.
And yes, there's kind of like pull requests that work around that.
But out of the box, this stuff, it still needs engineering kind of jiggling around to get
it to actually to work.
But that's just a maturity thing.
I think as a fundamental technology, it's here to stay.
And I think it's a great piece of kit.
So do you think is Kafka your new log stash then?
I mean, I know you've been quite a fan of the Elastic stack
and we've actually got Elastic coming on the show
sort of in one of the next episodes.
But so is Kafka the new logstash for you?
It's certainly my new love, yeah.
Elastic was great and it was probably the first...
Moved on now.
Yeah, it was the first open source project
that I kind of started working with with and realizing the power of.
And I think it's great.
And it supports Kafka.
So it's the best of both worlds.
And didn't you yarn about that as well?
I remember at the time you elastic this, elastic that.
Always been the gear of that.
I know.
It was drill and whatever, really.
So luckily I don't have to listen to that anymore.
So it's good.
No.
So the next one.
So you're from Yorkshire.
And people from Yorkshire are plain speaking and tell it how it is and and kind of cloud is is certainly um a uh an area that has
uh probably had its fair degree of skepticism and hand waving and and all that kind of stuff
but something I've noticed used you know when you and I've talked in the recently we started
you're starting to think now that cloud can be interesting really and cloud is going to be just
more than just a kind of word and more than just a kind of a sales opportunity.
What do you think on that?
Yeah, for me, it's been probably the last half year or something.
And looking at these things and cloud's not just marketing bollocks.
And part of the problem is that the marketing, all companies do it.
Marketing kind of gets ahead of the actual technology.
And so you try and keep up with the marketing all companies do it marketing kind of gets ahead of the actual technology and so you try and keep up with the marketing and then you start looking under the covers and you
think well actually this is just this is just slide where it's forward looking or whatever and
so they kind of the hype um kills it in a sense of kills people's taste for it so it's like oh
well cloud's just rubbish big data is just rubbish and i think in the same way that big data
technologies are now maturing and people are realizing that it's serious and it's got great benefits.
I think the flexibility that cloud gives and cloud in a sense is kind of a bad way of describing it because it's one of a dozen different things.
But so specifically the elastic capabilities and the separation of your compute from your storage from your query.
I think that's the important thing.
I think kind of cloud is a buzzword. It awful lot of people off there's a whole bunch of fuss
about it um but uh fundamentally it's going to change an awful lot of things about how systems
can be built um it will be and sometimes people are going to i suppose look at taking what they've
got and just lift and shift it running on kind of virtual servers in the cloud which is kind of
missing the point of it yes um but similarly it's running on kind of virtual servers in the cloud, which is kind of missing the point of it.
But similarly, it's kind of easier said than done to say, oh, we'll just rebuild this using kind of the new technologies.
That's not a small undertaking.
But I think to kind of like to properly benefit from it, then looking to separate out your processing and decouple the parts is going to be a good idea so when you go back to your christmas gathering in uh in yorkshire and you have your whippets and your flat cap you'll
be you'll be having a t-shirt that says ask me about cloud on there and you'll be definitely
you have the zeal of a convert really uh there so exactly so so yeah interesting and i think
i think cloud will actually ironically uh outlast hadoop as being the kind of the uh as being the
thing we talk about in a few years time and I think the complexity and the you know the detail and the and
the the amount of work involved in getting Hadoop running and so on you
know in the end this is when it moves to the cloud is this gonna be it's just
gonna be elastic computing storage really and I think the impact on business
models will be massive but I think yeah yeah definitely I think cloud will it's
gone from being something that is being kind of sold and hand-wavy and so on
to being something that people like you and I,
I think, think about quite a bit now
in terms of what it means
in terms of how you develop systems
and how you work with it.
And I think, you know,
probably you and I
really realized the Elastic thing recently
and the impact that's going to have
on kind of development and so on as well.
So, yeah, cloud's interesting.
Sorry, I'm particularly particularly the Amazon stuff and that
it's all there a click of a button and all the different pieces that you can
then work with and they all interact with each other and before that and
Oracle they had bi cloud service which was kind of it was fine and you could
kind of do stuff with but unless you kind of you fitted that particular
audience for it it's just like oh well it's kind of like that particular audience for it, it was just like, oh, well, it's kind of like,
it's like what we've got on-premises, but it's just not as capable.
And it's moved on since then.
But that was kind of like the first version of the cloud
that I was exposed to.
And then you start looking at AWS and its capabilities
and the options that you have for building out
these different permutations of platforms.
And it's fascinating and powerful.
Yeah. And the last one really is something which I read an article of yours on OTN about the Panama
Papers using graph technology. So graph is interesting, you've said, I think you said in
the past. Tell people what that is and why you think graph technology is something to look out
for in the next year. So yeah, so this is actually an idea that I stole from you
because you'd written about using graph analysis,
property graph analysis on Twitter data.
And so I thought, oh, this looks interesting.
And so I took the Panama Papers data sets of various parties
who have money held in various offshore funds at various addresses.
And so graph analysis lets you look at those
and how they relate to each other beyond simply the relationships.
And it lets you build out and visualize those patterns
and also run algorithms against those.
So you can find out which particular address
is used by lots of different funds.
But rather than doing that in SQL,
where you would simply have kind of like a list,
which would be kind of like accounts of them,
you could say, well, which are the addresses
which have got lots of funds,
which are also related by another set of properties,
such as the users or the countries.
So it's a completely different way of looking at the data.
For me, the visualization aspect of it makes it make a lot more sense.
But then, as I say, there are also these algorithms
that you can run against it as well
to come out of the results of a page rank and so on.
Yeah, absolutely.
And I think, for me, graph technology and the article you wrote,
and again, I'll put the link to it on the show notes
and things I've been doing with it,
it's a good example, isn't it,
of how you can bring different query and compute engines onto Hadoop
use the same data set you know which might be sitting in HDFS or in HBase or whatever
but different query engines can query it really and this kind of explosion in a way of kind of
different ways we can do things and different compute engines we can use and so on it sort
of makes BI interesting doesn't it again I mean it's it's yeah it's very interesting
isn't it it does i mean you can kind of like pick the right tool for the right job um it doesn't
have to be well the the primary focus is doing relational analysis therefore have to start in a
relational database and therefore can't do this other stuff or we'd have to kind of copy it all
over um you can store it in these open formats it's uh yeah yeah excellent yeah i mean so it's
yeah three three really interesting things looking forward to 2017 there.
And just to round this up, really,
I think certainly you're going to be
at the BeWare event, are you, in January,
speaking on this?
Yes, that's right.
So I've got a paper on the Panama Papers
and also about OBIE performance.
And then the Panama Papers one also
at Kscope in Texas in June.
Excellent.
And I'm actually doing,
I'm taking the idea I had with the kind of twitter stuff before um but adding the spatial side into it as
well so obviously one or two of you might have remembered the uh the kettle thing that happened
me a little while ago and uh where my uh where my kind of home automation experiments went sort of
slightly uh awry there um but it was it what i did was i captured all of the um so i captured all the
twitter activity that was happening at the time actually funny enough that was one of the things
that brought the network down at the time um but i captured all the network i captured all the
twitter activity and also captured things like all of the guardian comments and so on and what i'm
going to do in this presentation is is show how um show how kind of you know the the posts i put
on twitter at the time about the kettle not working and so on they started off with a few people like maybe you retweeting it and so on but then as it
went viral um it's interesting to see again their graph technology and network kind of you know
network analysis showing how in how you know certain influences in that group can suddenly
by them retweeting it can can massively sort of explode the amount of people that are reading
things and commenting it and so on. I thought that would be interesting,
but also bring in the spatial and time element of it as well.
So, you know, again, to me, that made me laugh at the time
was something that started off as a very kind of British thing,
you know, it's in The Guardian and so on,
was being retweeted around the world, Australia and so on.
I thought it'd be interesting, given the Oracle product,
which is Oracle spatial and graph,
to see how sort of time and geography affected it as well.
And just in a way, like you say, be able to look at and analyse data that's,
you know, tweets and so on like that, that's network in nature in different ways
that is a more appropriate way, really.
And I think Graph is interesting, really.
Graph is definitely, in a way, the kind of the slightly less well-known way
of analysing data that actually once you get your head around it and understand it,
it's kind of really interesting.
And certainly the article you wrote on Panama Papers was fantastic.
I think the feedback on that was good as well.
Thanks, Noah. It was a lot fun doing it.
Excellent. Okay, well, it's very late now,
and I'm conscious that I've been having you talking for a long time.
So thank you very much for coming on the show, Robin.
It's been great to have you on here, and hopefully we'll get you back at some point as well um so uh i think in
the future episodes coming up we've got uh we've got i think elastic actually coming on next as the
next uh as the next kind of guests um but other than that um yeah robin thank you very much and
have a good christmas and uh and see you soon thanks very much mark it's been a pleasure
excellent thank you and see you soon. Thanks very much, Mark. It's been a pleasure. Excellent. Thank you.