Drill to Detail - Drill to Detail Ep.14 ‘Cloudy Big Data Paradigm-Shifting Christmas & New Year Special’ With Special Guest Robin Moffatt

Starting point is 00:00:00 Hello and welcome to the special Christmas and New Year episode of Drill to Detail, the podcast about the world of big data, analytics and data warehousing, and I'm your host, Mark Rittman. This week, I'm pleased to be joined on the show by Robin Moffat, head of R&D at Rittman Mead, and who many of you will know from his blogs, conference presentations and social media posts, and of course, he's an old colleague of mine from when I used to work there. Robin, welcome to Drill to Detail, and it's great to have you on the show. Why don't you introduce yourself properly to our listeners? Yeah, thanks for having me on the show, Mark. It's a great honour. So I'm a head of R&D at Rittman Mead, and I've worked there for about five years now. Let's introduce yourself properly to our listeners. Yeah, thanks for having me on the show, Mark. It's a great honor. So I'm a head of R&D at Ritman Mead, and I've worked there for about five years now.

Starting point is 00:00:55 Before that, I worked in a UK retailer on their Oracle data warehousing platform with OBIE as well. Before that, for my sins, I was a SQL Server DBA. And going even further back, I started off as a COBOL programmer with a DB2 data warehouse. So I've worked in data and analytics for about 15 years now. So kind of long enough to have seen cycles come and go and to be fairly cynical when people proclaim things certain to have died or be coming up soon. So it's interesting to see how things go. Excellent. Well, it's great to have you on, Robin.

Starting point is 00:01:19 And what we're going to do in this special extended edition of Drill to Detail is we're going to do kind of two parts. In the second half, we're going to look back at 2016, a couple of you know look back at some things that happened, get your opinion on a few kind of trends and kind of new products and things that came out and also get your views really on what you think will be interesting and and worth looking out for in 2017 in BI and data warehousing and so on. But what I want to do in the first half is actually have a chat about a series of blog posts that you put on the Ritman Mead blog

Starting point is 00:01:49 over the last few days about a project you did with a client to evaluate what it would be like to move their BI system or to look at how you could move their BI system to a public cloud and adopt some of the new open source technologies like Spark and Kafka and so on. So I thought it was an interesting set of posts there and I'd like to kind of go through it with you. I suppose just to kind of get your feel for where the benefits were, you know, what worked, what didn't work and so on. But just to start off then, can you just kind of give us an overview really of what the project was about and what were you trying to achieve and understand with this piece of work? Yeah, sure.

Starting point is 00:02:25 So these clients, they contact us and they're interested, as you say, in what kind of benefits they could have in moving some of their work. At the moment, they're an Oracle shop. They've got Oracle Data Warehouse, Oracle Data Integrator, and so on. And they have a batch process that they load chunks of data into the Oracle Data Warehouse. And they were struggling in terms of performance, in terms of query, kind of going back over long periods of time. So kind of the big data thing kind of came into play there. They're wondering about if those technologies could help them. And then also in terms of cost benefits, if perhaps moving into open source tooling, like you say, with Spark could help them out there. So we did a short proof of concept with them

Starting point is 00:03:03 to explore the different technologies with them and help them understand which ones may be more relevant and also kind of the um the pitfalls around them or the additional complications um a lot of the time with the technologies you get the kind of the jigsaw approach of well you can use this plus this plus this and that would be great but sometimes what that doesn't show are the kind of the problems or the um the tricky bits um in actually implementing it so that was also included are the kind of the problems or the um the tricky bits um in actually implementing it so that was also included in the scope of what we did yeah and i think although although you obviously mentioned oracle in the uh in the kind of intro to it i think you know it's fair to say that a lot of um traditional kind of bi and data warehousing shops are using any kind

Starting point is 00:03:40 of database technology teradata ibm and so on are looking to see i suppose really you know can we move this into the cloud? Can we adopt these big data technologies? How does it work? And how much, I suppose, additional kind of manual work and kind of like scripting and so on is involved. So you did this on the Amazon cloud, didn't you? So you did it on Amazon EMR, is that correct?

Starting point is 00:04:00 Tell us about what that is and kind of why you adopted that rather than, say, sort of, you know, running the project using, say, in-house kind of Hadoop. Yeah, sure. So the clients, as well as being an Oracle shop, were already on Amazon Cloud. So they ran a lot of their stuff on Amazon's EC2 servers. So they were kind of they were already cloud friendly. It wasn't a case of convincing them that cloud was a good place to be running this stuff. So since they were Amazon and we were looking at Hadoop, Amazon's Elastic MapReduce was the obvious place to be running it. Okay, so Elastic MapReduce.

Starting point is 00:04:36 Tell us what that is then. So what's the difference between that and kind of traditional Hadoop? How does the elastic part come into it? What does it do there so emr is um amazon's hadoop uh as a service um and it's brilliant because you you can literally click on it and it'll provision a hadoop cluster for you um of any configuration size that you want um so in terms of installing and configuration there isn't any um So even with Cloud Area's distribution, you have to install it, and there are wizards, and it's pretty easy to do, but you still have to go through and click on things,

Starting point is 00:05:12 and if it falls over, go and look at log files, whereas with EMR, you go to the Amazon webpage, and you click on, I want a server, I want to cluster this size, and it just goes and provisions it. So it's very, very simple to use. And the other bit that's kind of quite attractive about it is that you can spin it up and down as you want to on demand so you don't have to build your cluster and then it's sat there and you're paying for it whether

Starting point is 00:05:33 on local 10 or cloud 10. um you can spin up an emr cluster run a job and then shut it down again and just pay for the time that you're using it okay okay so let's start by you said in in on the series of blog posts and what we'll do is we. So let's start by, you said in the series of blog posts, and what we'll do is we'll put the links to the blog posts in the show notes for the podcast. So you said there were two parts to it. There was ETL offload, or certainly doing the ETL in the cloud, and there was a part with querying. Okay, so both of those are interesting. But let's start with the ETL bit. So my understanding is the client currently uses an ETL tool to do the work at the

Starting point is 00:06:05 moment on-premise, but you moved it all into Hadoop and you moved it into Spark and so on. So tell us about that. How did you do it? What was involved and so on? So we took a very small piece of the ETL work that they do that was simply taking inbound batch files that arrived every half hour or so and joining them with some reference data that came from a relational table. And again, for the scope of the exercise, we actually took that reference data, again, just onto a simple local file. And we built out a project using PySpark, which would load both the files and then do the necessary joins and enrichments as well between the data sets and then write it out to a CSV file. So it was very, very simplistic,

Starting point is 00:06:44 but deliberately so, so we didn't get bogged down in the detail of implementations. So we start off by building a local development environment using Docker, which in the way that's AMR in the cloud is brilliant because you can just create an environment with a single click. Docker, we managed to find an image that was a Spark environment with a Jupyter Notebooks. So we use that for local development of the code and prototyping, and also the benefit of notebooks. You can actually write out and explain how you're building something, which was useful for the clients when we handed this back to them.

Starting point is 00:07:14 And then we took that Spark code and then ran it up on EMR on Amazon. So you say that you wrote all the code in PySpark and so on there. Is this not going back to the old days of kind of scripting data warehouses and so on there? I mean, forgetting the fact that, you know, is this not a step backwards, really? Arguably so. That's one of the kind of conclusions that I mentioned at the end of that blog series, is that it's technically possible to do it like this. Should you build your whole ETL platform on a bunch of scripts?

Starting point is 00:07:46 Probably not. I think something that's interesting about the way that technology is changing is that you can take a much more granular approach to how you build things. And for something like this, do you need a full-blown ETL system for it? No, because it's very simple what's being done. And if you can then reduce your costs by doing it on demand through EMR, that may well outweigh the long-term maintenance and support benefits that you have from a more full-blown etl tool um but certainly you wouldn't write by hand

Starting point is 00:08:15 all of this stuff um on a large scale that's and people have done that for years and years and years and i've kind of i started off doing that before ETL tools really existed. And you end up with an absolute maintenance nightmare and very dependent on local staff who know the systems. And it's difficult to scale. Okay. Okay. I mean, certainly, I mean, let's be clear. It was a prototype.

Starting point is 00:08:37 Looking at the blog post, it was a prototype. It was a proof of concept. And therefore, a lot of these things, you know, they happen on scripts and so on. It was also about understanding the technology so certainly for me before you adopt a tool that's going to maybe generate the code under the covers you want to know kind of what's going on with this technology so that if you have to support it or simply from a performance point of view understanding is it going to scale or validating the technology choice that underpins some of the more automated platforms that you might get okay i mean I mean, certainly, as you and I both know,

Starting point is 00:09:05 tools like Oracle Data Integrator, I'm sure, you know, other tools as well, will generate PySpark code. So I guess, as you're saying, you know, it's about understanding how it works at a low level and so on. But again, what is interesting working in this kind of area and seeing, I suppose, from the product side, seeing kind of like how products are built with this, it's generally the case now that people do code stuff,'t they as opposed to using etl tools it's quite rare

Starting point is 00:09:27 i think in practice to see etl tools in a big data environment still i mean maybe that's a maturity thing i don't know i think it is a maturity thing i think uh like odi oracle data integrators now catching up and kind of as you say can generate pi spark code and the latest releases you can now do spark streaming and stuff but those technologies at root have been out for several years now. So I think companies that are looking to adopt this new technology as it comes out to take advantage of its benefits, you have no option but to write it by hand. But I think in the long run, a managed approach to generating your code is always going to win out at large scale. But sometimes that kind of long-term payoff is countered by the short-term benefit of

Starting point is 00:10:11 being able to take advantage of the technology immediately. Yeah, exactly. Exactly. And I think it's classic. People very rarely do prototypes using ETL tools. If you can script it in five minutes, you will do that. Yeah. And as the project gets more mature and you have more developers in the project,

Starting point is 00:10:25 that's when ETL tools are useful. I mean, the other part is cost as well. I mean, presumably this didn't cost an awful lot to kind of get running really. And whereas if you were to use, I mean, most of the ETL tools out there that will work with big data are quite expensive. How much was cost an issue in this really?

Starting point is 00:10:40 It was one of the drivers from the clients and wanting to look at these open source technologies was the current license commitments and also in terms of the technologies. So we looked at ODI. It's got a big data licensing option, which the client were less keen to adopt. It wasn't that they ruled it out, but can we do this without that was kind of one of the premises for the prototype. So we'll get onto, I'd like to talk a bit later on about some of the new ETL technologies and things coming out of Amazon, like Glue, for example. I mean, that's quite interesting.

Starting point is 00:11:11 But you mentioned in the blog post about using Spark SQL. Where did that come into it, and what role did Spark SQL perform within the kind of ETL process? So we used PySpark for the set-based processing for joining the data sets and reading them in and writing them out and so on. But we use Spark SQL simply because SQL is kind of my language background

Starting point is 00:11:33 for immediately inspecting the data sets as we were building them and preparing them. So it's kind of like the best of both worlds. You didn't have to write it out and then query it separately to kind of check the data conditions have been validated and so on. You can actually do it in flight as part of the code okay okay so so so

Starting point is 00:11:49 once you'd processed it using spark and i guess spark was the kind of data processing layer with this you you said in the blog that you loaded it into redshift i mean how'd that go really yeah so one of the things that i found fascinating in doing this was that um for the past 15 years to say i've been working in this, you always process data and then you write it to a database. Whereas with this new set of technologies, that's no longer kind of like a default pattern. You can actually look at writing it out,

Starting point is 00:12:16 which is what we did into S3. And then from there, you can load it into Redshift, which is one thing that we did. But you can also look at querying it in place, which I found fascinating. In terms of Redshift loading it from S3 was very very simple. You just write your it's a copy statement with a simple DDL for the table and it just sucks it all in. So that was it kind of it works very well within that ecosystem. So certainly for me this ability to spin stuff up on demand, particularly for things

Starting point is 00:12:45 like ETL jobs and so on, is a bit of a paradigm shift. And it certainly avoids this issue of hardware either sitting around not being used half the time or it being kind of understrained when ETL routine runs and the queries then run slow and so on. And did you find, for example, Robin, this would suit certain types of workload better? So for example, Elastic is quite good for ET etl but you'd want to do maybe a more permanent kind of setup for you know for your queries and what do you think on that really yeah so as you say etl is the obvious one particularly if you're doing periodic stuff if you're doing kind of uh

Starting point is 00:13:17 stream processing you wouldn't have that spinning up and down. But if you're doing once a day, twice a day, or even every hour or something, rather than having your capacity sat there idle, or as you say, having to kind of over-provision your hardware, it just makes so much sense. And it was interesting with Redshift, loading the data into that, you still have to have that running. So just this idea that you can now decouple your computer from your storage, from your querying, was kind of a revelation with this

Starting point is 00:13:49 prototype, actually seeing it in practice, that you can do that. Yeah, definitely, definitely. And I mean, we'll get on to that now. So again, as part of the blog post series, you talked about, you evaluated different query engines. I think you looked at Presto, Hive on Tez, and that sort of thing. I mean, again, just walk us through at a high level what the exercise was about and the technologies you were testing out in that. Yeah, so one of the, part of the scope for the prototype was looking at can you, can we query this data? Once we've kind of, we've processed it, enriched it and stored it, what does the analytics on it look like? So we took the existing queries and then made sure that we could run those against that data. So one of the options was, then made sure that we could run those against that data.

Starting point is 00:14:26 So one of the options was, well, we'll go and load it into a data warehouse and cloud data warehouse. So we'll stick it in Redshift. That's the obvious choice. But the other one which was interesting to me was, well, let's just write it to S3, which is kind of your long-term storage. You're not actually paying for any compute

Starting point is 00:14:39 to sit against that. You're just paying for your storage. And so we tried it with Presto. We tried it with Hive on Terz, as you say, because they were there very easy to provision as part of the EMR cluster. So this was kind of one of the side bits that was interesting to the project,

Starting point is 00:14:55 which was working with all these open source tools in different ecosystems. So on Amazon, it's very easy to provision Presto. Hive on Terz is there by default. Other stuff I wanted to look at was Impala and Drill, but those you have to kind of install and configure yourself. So it just adds that additional friction to it. And it was a time-boxed exercise, so it'd be good to get to,

Starting point is 00:15:18 but where the friction comes in, you think, well, that's fine. We'll stick with the options that are there. And Presto was very interesting, and it worked well enough for it to be a plausible option. The response times were longer than Redshift. But as with the Spark stuff, there was a lot of performance optimization that we could have done. We did move the data into ORC format,

Starting point is 00:15:40 which is kind of like the recommendation, but didn't do any partitioning and all the rest of it. So the times were fine if you wanted to have your data in s3 long term and do periodic analyses against it or you don't mind if you set your query running and come back to it after lunchtime you wouldn't use it for your ad hoc low latency querying yeah yeah definitely i mean as you said there the kind of major kind of paradigm shift here is the fact you store it in one format and query lots of different tools, really. And the fact that S3 can be, you know, a storage format just like HTFS and so on there is interesting. You touched on there about storage formats as well. And something I certainly found

Starting point is 00:16:19 is, again, compared to the Oracle world that we came from, which was, you know, everything is stored in tables, it's easy to query. You mentioned ORC there, there's Parquet and that sort of thing. I mean, how much work do you think is involved in getting that performance just right and so on? Because we save time there with EMR, but it sounds like it would take a lot of work to get the storage formats working properly. What was your thoughts on that? Well, yeah, this is part of the devil in the detail of all this stuff that it's conceptually it's great you write it out to s3 and then you can query it with many different engines and as you say that the open data format

Starting point is 00:16:52 is just kind of mind-blowing if you're coming from a proprietary database background um that you can just use try a whole bunch of different tools and see which one fits best um but i think there's always going to be engineering that you have to do on top of it. And in Oracle, it's not that you don't have to do it. It's just that it's very well established what you do. You partition it, you index it, you use parallelism and so on. Whereas it's such new technologies that I think all those, I'm not going to say best practices, but all of those kind of recommended approaches are still evolving and people are still figuring them out. And the technology changes so frequently that a document you find last year that says this is the best way to do it could well be obsolete um so i think it's the they bring a great deal of power but you have to know what you're doing

Starting point is 00:17:34 to take advantage of that sometimes otherwise you may end up with something that's just not quite as good as what you could have done if you'd stay within the kind of the safe world of proprietary databases yeah i mean it's interesting there's a lot of i guess with the move to cloud with the move to kind of hadoop and analytics as a service you know the thoughts are that you know i suppose the need for skills and the need for consultancy will kind of go away but it strikes me that that actually there's quite a lot of need there for for kind of understanding and skills around you know the nuances of building these kind of systems really i mean i mean yeah so so i suppose separating storage from from these kind of systems, really. I mean, yeah. So I suppose separating storage from kind of query, what does that mean in terms of how we design things

Starting point is 00:18:09 and how we do things and so on for you? I suppose it means that you can be a bit more refined in how you design things. So you don't have to have the default that we're going to load it into Oracle at the end of it. I think in terms of things like Kafka, when you get onto streaming platforms, you can do your transformations in different places.

Starting point is 00:18:28 You could do it in Spark, you could do it in Kafka streams or something like that, and then worry about how's it going to get consumed afterwards. So it's this decoupling of the processing that's important. Yeah, and you said right back at the start that one of the drivers for this piece of work was to see, in this case, whether they'd get rid of Oracle, and you can insert into there any kind of database teradata or whatever really so so you know what

Starting point is 00:18:49 do you think on that do you think do you think is a case of like yep absolutely we'll use drill we use whatever and what's your thoughts on this really you know do you think these new technologies are a replacement for though for oracle and so on or or is it hype or whatever yeah what's the nuanced kind of yorkshireman view on that on this really for me the nuanced yorkshireman um i think there's a large chunk of analytics work that just can be completely replaced because i think the tools are mature enough now um i think the interesting bit is around the how do you get the stuff in there do you write it by hand or do you use a tool such as odi to actually do all of that. Not only kind of the transformation definitions, but the orchestration and management. That's the bit that you still need to work and support and scale out and kind of have staff that can support. But in terms of platforms

Starting point is 00:19:36 where you run and store this, I don't see a great advantage in sticking with the old stuff to the extent that people do at the moment. I think there's still stuff that's always going to sit on it and obviously kind of like OLTP workload, that's a different question, but kind of like the big analytics stuff and the stuff that we did here for this client, it was very simple stuff

Starting point is 00:19:56 that was providing them great benefit in terms of the data. It doesn't have to be complex stuff that it's doing and it just moved so easily onto this. Yeah, definitely. I mean, in the last edition, when Tanel Poder was on the show, and he had a good rule of thumb that I thought was interesting,

Starting point is 00:20:13 which was data that originated somewhere else. That's a great thing to put into Hadoop. So if you've generated the data in sensors or in a transactional system, then if you didn't want to query that data, then it's great to put it into Hadoop. But kind of in a transactional system, then if you want to query that data, then it's great to put it into Hadoop. If you want the transactional integrity, if you want all those kind of features, that's what you want kind of Oracle for and so on.

Starting point is 00:20:33 But certainly, what's your take on Drill? I mean, you've been quite an advocate of Drill. You've been using it quite a lot. I mean, you said you didn't use it on this project because it wasn't able to be provisioned in EMR easily, but what's your take on drill really? Is that the end of kind of, you know, formal kind of data warehousing work or what really? I don't think it's quite so much that.

Starting point is 00:20:56 I think it enables you to query data without having to load it into a database first. And SQLs I've worked with for so long that you look at a problem and you start breaking it down into select and group by clauses in your mind automatically. And so being able to take a flat file

Starting point is 00:21:14 that someone's given you, like a JSON or whatever, and be able to query it from your hard disk is just fantastic. So yes, it runs distributed and clustered and kind of huge as well um and i heard the podcast you did um about drill that was really interesting and kind of um comparisons to impala and so on um so i've been using on much more modest scale which is i'm doing some work

Starting point is 00:21:36 for a client you need to do some data wrangling and kind of work out what these data sets look like how can we join them and so on um and being able to run that from your laptop without having to define your schemas because well some of the stuff with the this prototype that we did where the data is in s3 using hive on tes or using presto you still have to go and create the external table and it's all there's such simple columns there's not any there's no reason why you should have to do that it's kind of it's obvious if you look at the file what it is which is just what which is what drill does It kind of looks at the file and figures it out. Yeah. So it's a massive time saving.

Starting point is 00:22:07 Yeah, exactly. I mean, I think that, you know, certainly Spark and drill, there's some interesting things there that, so for drill, drill is, you and I know a product called Indeca that we worked around with for a while. And Indeca was about data discovery and, you know, no data being left behind and all those kind of like, you know, slogans there. But certainly drill as the new Indeca, as Hadoop's version of Indeca, is interesting, isn't it?

Starting point is 00:22:29 Because, you know, you can sit there with a copy of Drill and you can kind of reach out to all these different data sources and you can query them in place. You can reach out to, say, Hive or even Oracle or something and bring in data, you know, as reference data. I see Drill as being a new form of bi really I don't you know it's you've got you've got impala you've got all those kind of MPP style kind of engines there but drill is a very interesting sort of data

Starting point is 00:22:52 discovery technology as well I mean did you have you been finding that the way you've been using it really yeah I think it's exactly that's just been out so but poke around in the data and the kind of the whole data wrangling side of things is kind of it's not the sexy side of big data work but it's the it's what you end up doing an awful lot of the time um simply what does this file look like how can i match this one up with that one um and being able to yeah do that data discovery with what you've got yeah and i think with spark sql i mean that's in a way the new data federation if you think about what you can do with spark sql in the kind of hive the hive compatibility you know. If you think about what you can do with Spark SQL and the kind of Hive compatibility,

Starting point is 00:23:26 you mentioned earlier on, you said that you used, I think you used Spark SQL to bring in some reference data and so on. But going back to our days before of working with tools like Oracle BI, the fact that you can bring stuff together within Spark, reach across, join data together from different sources, I mean, that's kind of interesting as well, isn't it, really?

Starting point is 00:23:44 And it's all free as well. Yeah, exactly. it's all there it's all free it's to play with and with the open open data formats underneath you're not kind of backing yourself into a corner with your tool choice you can try something out and then just try a different one against it and kind of mix and match to get the the optimal combination for what it is you're trying to do okay so that all sounds brilliant but you used a phrase earlier on the devil's in the detail okay so so so so this all sounds fantastic but um what's the catch really i mean what i mean compared to say oracle that has these kind of regular releases and it's very predictable and so on that you know hadoop technology all this stuff is just releasing all the time i mean how how does that play out with customers and what do you think of that really um the devil's definitely in the detail

Starting point is 00:24:26 um a lot of the time so it was a two-week project that we did and i'll have certainly spent a day fighting with kind of java dependencies and which versions of libraries and there was one timestamp problems that it was it was literally a dot one difference between different code bases that timestamps were suddenly written out in epoch instead of character format and stuff like that. That's not in a million years. What Oracle ever released that kind of breaking change without it being a big thing. So Oracle will kind of get mocked for kind of like slow releases and arguably kind of that's a shame that they're kind of slow to do things. But generally when stuff comes out, it works and everyone knows knows how it works and there's lots of education and advocacy around

Starting point is 00:25:08 it whereas the pace of the change of of the newer technologies is so great that's to kind of to keep up with all of them simply what they do is a challenge so to know kind of in detail how how each one works and how to best take advantage of it um there's a lot of work that has to be done each time you come to use it, rather than simply, well, I know Oracle, therefore it's just rinse and repeat. It's much more, I know that conceptually this bit can be used with this bit,

Starting point is 00:25:33 but do they actually play nicely together? Okay, I mean, it certainly struck me reading through your post that it was kind of in between really the old days of Hadoop being something that was very complicated. I remember you and I in the past, you know, spending ages spinning up clusters and wiring all together and so on. And, you know,

Starting point is 00:25:49 it struck me that it was in between that. You had EMR there, which was the kind of, I suppose, the easy part there. But there's a lot of kind of mucking around there as well. I don't know if you saw the blog post I did on kind of Google BigQuery. I mean, that again is interesting. It's even less things to kind of wire together. I mean, I suppose, did you read that post? What's your thoughts on where, I suppose, Hadoop as a service is going really and how it can get simpler?

Starting point is 00:26:17 I think, as you say, the more simple the user interface, the less mucking around with configuration, stuff like that's going to make it more accessible. The idea of going and installing a CDH cluster or something like that, I suppose it makes sense if you've got the hardware and all the rest of it, and you need to maintain the absolute control over it. But if you can get the same functionality at a click of a button and someone else worry about it, then why would you? And there's parts of wanting to understand what's going on under the covers but sometimes you just want to get in your car and drive somewhere without being

Starting point is 00:26:48 able to know how my piston engine works um so i think more and more it's going to become more commoditized and your blog post kind of referred to as the kind of the gmail moment and i like your opening to that saying like back in the 90s of course everyone ran their own email servers or if like me you ran your email server um which i never quite got to but uh but certainly things like where you kind of you do this stuff because like technology is interesting and like you say a few years ago we were working on hadoop and building our own clusters because it was so new and that was the only way you could get into it whereas now to do that why would you if you can actually just have it provisioned automatically and configured automatically as well?

Starting point is 00:27:26 Yeah, yeah, definitely. And I think that's a bit of a message to consultants really out there that I think, you know, a lot of us or a lot of you guys, because I'm no longer in that sort of area. But, you know, a lot of people spend a lot of time fixating on getting Spark working or building out clusters and so on there, really. But, you know, that is in a way a bit of a solved problem really now. And I think that certainly it's like building on layers of abstraction really and certainly to my from my perspective you know things like how to build how to get a cluster sort of working scaled out you know reliable and so on how to kind of get a data processing layer these are things that we shouldn't still be kind of like spending ages on really where it gets interesting is how we then leverage that data as our american friends would say and uh how we kind of build you know predictive analytics on it and so on really

Starting point is 00:28:09 and we shouldn't have to spend all this time on just getting clusters working and then i suppose one example of this that that is either way taking it to the extreme is i think i tweeted this week about um amazon glue i mean did you so just for anyone who didn't read that, Amazon Glue is Amazon's take on ETL and it's kind of very interesting. You know, it's using kind of machine learning

Starting point is 00:28:30 and artificial intelligence to look at the data that's in the kind of data set you're working with and predict things like, you know, which transformations you should do

Starting point is 00:28:38 and all that. I mean, very kind of interesting really and again, you know, the amount of time that people like you and I spend on building

Starting point is 00:28:44 ETL routines in the past and just wiring column a to column b together i think that's gonna you know we're gonna move on from that or is it you know or is it just kind of pie in the sky i mean what again what's your take on that really robin well i looked at glue and it looks fascinating i can't wait to get my hands on it because when i first saw it i glanced and i saw there's a screen full of code and i thought well if it's just doing code and I can write code, you're not just ending up with the same kind of liability that you have from writing it by hand. But then reading it a bit more closely and seeing that it does this cataloging of your data sets and automatic categorization and the transformations,

Starting point is 00:29:17 but access to the code underneath and the orchestration and management of it, it seems like, yeah, it looks absolutely fascinating. It is, yeah. And I think that, yeah, if it kind of it seems like yeah it looks absolutely fascinating it is yeah and i think that uh yeah if it works it sounds fantastic yeah i know oracle is working on similar things and and so on really so uh i mean so i mean so at the end of this piece of work you did for the client i appreciated it was a uh a more kind of like evaluation but what was the kind of the final i suppose you know what was the final thoughts taking away from it and what was the client's kind of next steps really with this well they were really excited to see uh what could be done um and kind of the what we put together in the time frame um i think they were very open to this idea that

Starting point is 00:29:54 um you don't have to have a relational database to do your etl work in you could actually just do it on demand um that you can store the data without necessarily storing it with the computer attached, you don't have to load it into a data warehouse. So I think they're going to kind of hopefully look to do more of that in the future and move some of their analytic workload into that. So I suppose really the kind of reality check is that, you know, I suppose the majority of projects still going on are still around Oracle and ETL tools and so on. But I would imagine probably the bulk of the new inquiries you guys are still around Oracle and ETL tools and so on. But I would imagine probably the bulk of the new inquiries you guys are getting are around this sort of technology

Starting point is 00:30:28 and trying to see the value and I suppose trying to see how it would fit in with what they're doing and see whether it's worthwhile for them. Yeah, definitely. And I think in terms of maturity, people are less skeptical about it. I know a few years ago, people were dismissing this as just hype.

Starting point is 00:30:45 I think people have accepted that it's actually here to stay and that it's not just flash in the pan type stuff, that it's actually got, if nothing else, big cost savings to bring. And it can, at best,

Starting point is 00:30:56 a great deal of flexibility and agility to give people. But I think there's still larger companies are probably more established. IT departments are still looking at how can they actually take what they've got at the moment and do that and take advantage of this stuff.

Starting point is 00:31:11 Okay, okay. I noticed just one final thing on here. I noticed on the blogs that you looked at QuickSight as well from Amazon, their new BI tool. But did you actually, did you try connecting Oracle data visualization to this as well? I mean, did that work and what was your thoughts on that? Yeah, so QuickSight was something that for this particular project we couldn't use for various reasons.

Starting point is 00:31:31 But we've used it beforehand. So in this project, we used Oracle DV Desktop. It's got support direct for Presto and Redshift. And it worked great just for kind of completing the end-to-end picture of source data, do transformations, store it. We proved that we could do the analyses in SQL and then actually prove that you can use it in a client-facing tool as well. So, Robin, 2016, uneventful year, politics-wise, obviously. Not much happened over in the UK and the US,

Starting point is 00:32:02 but certainly a lot happened in kind of the world of world of bi really there was the Gartner report there was lots of new releases of software and and I suppose some interesting things happened around kind of analytics and so on there so I've got five questions five areas I want to go through with you just to get your opinion really on on on what happened in 2016 and then we're going to go on to you know what you think think is going to be worth looking out for in 2017. So first question, Robin. Oracle's BI focus, I think it's pretty fair to say, has shifted this year from enterprise BI tools like OBI 12c and I suppose kind of enterprise

Starting point is 00:32:39 BI software like the BI apps to DV desktop, so data viz desktop. What do you think on that? Do you think it's the way of the future or do you think it's Oracle's, you know, last desperate throw the dice where you just stay relevant within the BI market? That's a good question. I think DV desktop is something that they had to do. And I think it's actually really interesting to see the rate of development around it and the rate of releases and what they're doing with it. And it seems compared to Oracle BI, the kind of the server-based one, which is a fairly slow release cycle.

Starting point is 00:33:12 Obviously, that's good that it's stable. DV Desktop, I think, is every couple of months. And the stuff they're doing with the plugins around it as well to make it extendable with an API, I think is really interesting. So I suppose the thing with that is how are they going to kind of bridge the two? I don't see DV kind of desktop replacing the main OBIE, but will they manage to transition people from DV desktop into it or will it just end up kind of filling the same role

Starting point is 00:33:42 that you end up with single users just doing their data stuff locally and losing out on that benefit of the enterprise view? Yeah, it's interesting, isn't it? I mean, talk about life going in kind of full circles, really. I mean, so just for anyone who is not familiar with Oracle's own kind of, you know, Oracle's particular sort of desktop BI tool lineup. So you've got, so OBI 12c is, Oracle BI 12c is the latest release of Oracle's kind of full enterprise end-to-end kind of BI platform really and it's something that you know you and I are probably quite famous for and there's a sort of saying isn't there and I think technology which

Starting point is 00:34:14 is technology reaches the point of perfection and it becomes obsolete and and the great irony I think with OBI 12c is you know it's a fantastic platform but the the mood and the shift in the market is is more towards kind of desktop bi tools now and and dv desktop data visualization desktop is is is a bit like commit remember discover from years ago and and tools like that there were very much kind of you know desktop bi tools yeah they had their kind of advantages but they were also you know they were kind of they were silos of information and so on. But certainly, DV Desktop is interesting, isn't it? And is it something that you and the guys at Amida are using a lot now? I mean, is it your primary kind of like BI tool or what, really?

Starting point is 00:34:54 I wouldn't say primary, but it's something that's very relevant when we're discussing with clients. Obviously, if it's kind of like it's an Oracle shop, then you'd rather be using that than a Tableau or whatever as an alternative desktop tool. And the stuff that they're doing with kind of data flow within that, I think, is equally interesting. I suppose the rate at which they've developed something which runs on the desktop and under the covers, it's still the same server processes, but it's all encapsulated to run a local machine. But then adding in this additional transformation stuff. Yeah, I think, like I said, how are they going to bridge that back into the

Starting point is 00:35:29 main product, or are they just going to end up with kind of two separate offerings? Yeah, I mean, so just explain to us what the data transformation thing is in there. That's in the new release, isn't it, of DV Desktop. So what's that really? Yeah, so that's where you can take multiple data sets and apply transformations on them, as you would do in an ETL workflow and aggregate or filter your data or join between the different data sets to produce a kind of a final data set off that against which you build your visualizations. Yeah, I think it's, I suppose in a way, this is the way the market's going. And so, whether people like you and I would say,

Starting point is 00:36:05 well, actually, you know, it's maybe the pendulum has gone too far one way. It's the way the market is going really. And so, I mean, do you think that or not? I think it's, yeah. And I think in an ideal world, you'd take something like that and you would give it to your tech savvy business users who would then actually, they know the data,

Starting point is 00:36:23 they could build out and kind of explain how the stuff transforms and combines they can prototype the visualizations and then give that to the enterprise department you can then formalize it and build it and something into a kind of a supportable enterprise grade etl process and that's kind of like the perfect world but you risk kind of going back to the days of kind of people building stuff in excel macros and only kind of specific people in the department knowing where that data came from or how to support it and that's kind of that's the worst of all worlds that way yeah i'd be interested to see also how how oracle sell it

Starting point is 00:36:52 as well because the you know the model behind tools like tableau for example is you know to sell one license really to go in there to sell one license into it into a department they call it land and expand you know and funny enough i i um i um downloaded the trial version of tableau uh this week and for testing out on some work i'm doing with uh with the place i'm working at the moment and i got a phone call the next day from from one of the reps a very nice guy but you know he was he was he was going to sell me this kind of one one one seat license really and uh and and that but you know i can't imagine some of morocco doing that i mean that's you know it's it's kind of interesting i'd imagine the paradigm shift and the rethinking thinking from

Starting point is 00:37:30 from the oracle sales people will be quite a lot there really yeah and and i try not to get too close to licensing because it's always a bit of a minefield but from what i've understood you can't actually buy a dvd license as such you kind of you get uh permitted to use it as part of a dv license in the cloud or dv as part of um obie on premises um i guess that's a deliberate decision and maybe it's kind of a nuance that i've misunderstood but yeah you can't just say i think this is great i want to license it for threading accounts you actually have to license more than that yeah definitely but i think I think one thing that I don't know my opinion is, I'm surprised at how good DB Desktop is, really.

Starting point is 00:38:08 And as you said, you know, the rate of there being kind of plug-ins and I think they're kind of, I think certainly the development team, I mean, are definitely full in on this, really. And I mean, I've been using it quite a bit and it's a good product, but it's interesting to sort of, I suppose, in a way, it has that horse bolted or is this a necessary thing, you know, to do? I don't know. it's interesting to sort of, I suppose, in a way, it has that horse bolted or is this a necessary thing, you know, to do? I don't know. It's interesting, isn't it?

Starting point is 00:38:29 I mean, top marks for Oracle doing it, but how well they'll sell it, I don't know, really. Yeah, and I suppose how much of the functionality will come back into the main product? Is it something where they can use it as a kind of a development, not a development, but a kind of prototyping thing to see how features take to the market and then kind of migrate them into the enterprise stack.

Starting point is 00:38:48 I don't know. I think also where there is potential for Oracle to do something very interesting is in the linking it back to the full enterprise suite. And certainly, I mean, I was at an event recently and I was with one of the Gartner analysts and certainly within Gartner, I guess within a lot of analyst firms, there's certainly a lot of different opinions about the value of, you know, what they call kind of bimodal development and so on there. And I think any vendor that can link together these desktop tools

Starting point is 00:39:14 and the kind of curation and IT adoption of things as well is going to do well. And I think Oracle, if they do get that link between the desktop tools and the enterprise kind of side worked through, and in terms of metadata curation and so on, that could be really interesting, can't it, really? Yeah, definitely. Yeah, definitely. Okay, so next question for you.

Starting point is 00:39:35 Okay, so citizen data scientists. You must have heard that phrase out there. Yes. Okay. Is that an exciting new paradigm, or is it this year's marketing bollocks, as we say in the UK? As we say in the trade. Citizen data scientists.

Starting point is 00:39:50 It depends how you define data scientist. I mean, there's always been users in the business who kind of, they know their way around technology, usually Excel or maybe a bit more than that, and they understand the data and they know how to kind of apply appropriate analyses to it. But if you take kind of data scientists to one extreme of kind of advanced analytics and

Starting point is 00:40:11 predictive modeling and full blown statistician, then yeah, that's bollocks. But I think tool tooling that supports that availability through to end users of something other than a highly curated and governed model is good. I think users want their data, the values there to be had from that data. So letting them work with it in different ways is good. But yeah, there's a certain element of hyper-operability. Aspirational, yeah. Forward-looking. Forward-looking.

Starting point is 00:40:42 Directional, I think, is the phrase we use in product management. I think there's a we use in product management. I think there's a couple of things in that that are interesting. So first of all, you know, citizen data scientists, obviously, you know, I mean, it's, I think to think that we will all become statisticians, we'll all become kind of, you know, really, there's a lot of mathematics, there's a lot of stats knowledge there to be good at, to do that well. But the aspiration for people to do more than just look back at you know what's happened in the past and to kind of use stats and to use machine learning and deep learning and so on to to get competitive advantage you know i think that's a genuine thing there really and whether or not whether or not the tools enable it now i think

Starting point is 00:41:18 that is the again a driver and a kind of a demand from users now. Yeah, and I think there's the danger with stuff like that that people have to understand what they're doing. So does the tool dumb it down so much that it becomes slightly meaningless or could be done by the tool anyway? And what role does the person have in that? And I've done some work with a colleague of mine who you all know, Jordan Mayer, who's kind of, he is a data scientist.

Starting point is 00:41:42 He kind of understands the maths, understands the stats. And in working with him, it's kind of, you realize and starting to dabble in's kind of he is a data scientist he kind of understands the maths understands the stats and in working with him it's kind of you realize and starting to dabble in this kind of stuff how much you don't know and how wrong you can get stuff if you put the wrong interpretation on the data um so it's one thing to say how many cans of baked beans did I sell last week but it's another to kind of to build a predictive model that's supposedly 80% accurate but then if you tossed a coin it would would have been that anyway, or 50%. Do you know what I mean? Where you actually, you've got to understand what you're doing.

Starting point is 00:42:09 So if you can make that as accessible to an end user, as a citizen data scientist, in a way that they can use it, I'm not sure. I'm not sure. Yeah, it's interesting. I mean, I think certainly now, I think we've had enough of experts, really, as the famous politician said in the UK. I think certainly in the days of data mining, it was kind of common to sort of say, you know, it's too dangerous to put in users' hands because they can make the wrong decision and so on. I think that's a kind of lazy kind of thing to say because, yes, obviously, you know, there's a lot of in our confidence factors and using the right model and

Starting point is 00:42:45 so on but the challenge i think to us in bi is to say well actually that's there how can we how can we go beyond that really you know as a tool um i often refer to beyond core as a vendor that i think is an interesting um vision i suppose of where this stuff can go um you know automating it predict you know making it as easy as possible to get these insights, whilst also, I suppose like you said, understanding that it's easy to come to the wrong conclusion. But I think it's lazier to say that we shouldn't put this in people's hands. But I think it's also slightly kind of hyperbolic to say that it's now possible, really. I mean, yeah, I think that's probably the case.

Starting point is 00:43:21 So, okay, so next one for you. So source control. You've done, again, a lot of very detailed posts about source control and kind of automated builds and kind of automated deployments with OBI and tools like that. So is this kind of, you know, is this applying engineering rigor to tools like kind of Oracle BI, or is this just affecting the steam engine? Why are you doing this?

Starting point is 00:43:43 Surely this is kind of pointless, with this old technology. Because as long as people are doing development work, they should be doing it right. And that's not just from a kind of puritanical, I don't like to see things done wrong view, but it causes an awful lot of trouble when people don't do it right. And simple stuff like source control if you don't have source control you're screwed because sooner or sooner or later you're going to lose a file or deploy the wrong version of a file or you're going to go on holiday and someone else can't find the right file um so it's simply taking that and then taking it a stage further how can we use that for concurrent development and then you need to understand how a particular software works so it's a necessary evil in a way um yeah it has to be done i was being slightly devil's advocate

Starting point is 00:44:30 there but but it's certainly it's certainly i mean how much do you still see people going going on on site and you see kind of rpds in this case uh numbered and that's your version control i mean is any of this sinking in really for me something that i realized was that i started off with the kind of concurrent concurrent development problem which was in obie how do you do concurrent developments without using the the uh the provision one from oracle which was slightly unsatisfactory but then so i wrote about that and i talked about that and didn't see a great deal of kind of people said oh yeah that makes sense but it didn't really no one kind of um seen it didn't seem to make much impact and then you actually go to clients and you speak to them about it and as you say they're naming their

Starting point is 00:45:13 rpds on this kind of like version one version two and a network share and so something like concurrent developments arguably isn't for everyone um even if it'd be kind of useful um places are so far off being anywhere near being able to do it that simply the basic stuff like use source control um that's the message that's kind of everyone's got to take of that heed of first and then you can look at well let's get a bit more mature in our approach let's automate our deployments and once you've done that then you can say actually no now let's do concurrent development and all the flexibility and agility and kind of scalability of your development effort that entails that's great

Starting point is 00:45:49 but trying to leap to that one straight away when people can't even do source control is it's a step too far yeah exactly just so i guess as a plug for what you're doing with me i mean what what are you are you're you're driving a lot of kind of you know utilities and accelerators in that sort of area i mean what what kind of what have you been doing in that sort of area really? Yeah, so we've been trying to work out what makes sense to our clients and do we try and, do we build a solution for this and kind of like say, this is how you do it. But we've actually found in speaking to clients and kind of going and implementing this stuff with them that because what you're interfacing with is a lot of the time kind of like enterprise

Starting point is 00:46:25 change management processes and release teams we found that a one-size-fits-all solution just doesn't work because you start mandating upon them too much and they say yeah but we just don't do things like that um so instead kind of breaking down the process and that's kind of the the recent blog posts i did about that we're based around that kind of like let's understand everything from the ground up, and then we can tailor particular solutions to each individual client, how they work, but kind of with the best practices, shall we say, thrown in.

Starting point is 00:46:54 It's kind of like, this is how you should be doing it, and then we can kind of tailor it to fit perfectly. Okay, okay. And, yeah, definitely. And the other thing you've been doing as well, I know you were responsible for this when I was there, was the performance kind of, I suppose performance and customer adoption stuff as well.

Starting point is 00:47:08 I mean, I suppose performance, is that still an issue you see these days? Do people still kind of like, you know, tune things the wrong way? Or what do you see? Oh my God, all the time. Maybe I get a skewed view of it because I'm also on the Oracle OTN forums.

Starting point is 00:47:24 And the number of times people kind of say, I've got this query and it runs slow. I've tried building an index on the database and it's still slow. And you just, in a similar way to kind of concurrent development was kind of taking it too far straight away. People just need to learn how to do source control. The same way with performance. People need to understand you've got to know why it's slow and where it's slow before you start changing things. It's a very basic message. But going and looking at where is your query running slowly?

Starting point is 00:47:50 Is it in the database? Is it in the BI server? If it's not in the database, then tuning the database is going to have no impact. So I think it's got a long way to go. And I don't know. It's one of those things that sometimes the development styles taken to this stuff, performance is just an afterthought. People work with small data volumes and developments. They're more focused on the functionality than the non-functional requirements.

Starting point is 00:48:14 And then it goes to production and then it falls on its ass. And then it's kind of, oh, now I suppose we ought to have a look at this. And then the horse is bolted by then because with big deployments, they're very complex. And sometimes you have to say look i'm sorry you've kind of you've fundamentally done this wrong um and that's where it gets quite painful okay okay so so number four number four question is um the rise of schema on read schema ssql engines and data lakes you know is this is this all about making bi agile and helping you know customers embrace this new technology or are things like schema read and

Starting point is 00:48:45 data lakes is it the end of civilization as we know it really what's your what's your take on that um i think it's fantastic i think um i think and i heard the the um the show that you did with kent graziano and talking about do we still need data modeling and absolutely 100 yes but just not always straight away um and we talked earlier about kind of the new technologies and what they enable and some of the the paradigm shifts um and i think this idea that you don't have to model it up front is just um fundamental and like a bit slightly mind-blowing when you've been doing this thing for a while and you realize hang on i don't need to define my table i can just store this and potentially just all i want to know is

Starting point is 00:49:24 the number of rows that i've got i'm not i don't care about define my table. I can just store this and potentially just, all I want to know is the number of rows that I've got. I don't care about the different columns within it. Simply a count of the number of rows will suffice. And it makes it much more easy to get your data in, much more easy to start poking around the data. And then once you start needing to get repeatable answers out of it and a more formal view of the data,

Starting point is 00:49:41 then you model it, but only then. So it makes it much faster to get your data in and start storing it and start working with it to working out is it even useful is there any point modeling it or is it turns out it doesn't have what we need within it okay okay and the last question from my side for this one is right so you write a lot about breakfasts don't you so you were one of your one of your topics really on on your on Twitter for 2016 I think you've now got your own hashtag which is a full english breakfast um so for you question for you english breakfast american breakfast you know which one is the uh the best and give reasons why i might not say it depends

Starting point is 00:50:14 it depends is the consultant's answer um it's got to be full english um describe a full english to our american guests and uh and then and i'll describe American breakfast to our British guests. Go on. So a full English is a thing of beauty. You've got to have good quality sausages, good quality bacon. You've got to have good quality everything. Tomatoes, fried tomatoes, fried mushrooms. You've got to have black pudding in there.

Starting point is 00:50:39 You've got to have black pudding. Hash browns are controversial, but good addition to it. Good granary toast or good white toast um baked beans and you've got to have hp sauce you can't have brown sauce it's got to be hp sauce okay and oh and fried eggs okay and the american breakfast is fairy cakes and fat from what i've what i've what i've uh i've seen so i think actually that's kind of easy one to do really i think a british breakfast uh of all it's always the best really so uh but certainly you know i would recommend uh

Starting point is 00:51:05 listeners to uh to look at your twitter feed for the amount of uh reviews of breakfast you have on there as well and beer as well so uh so so breakfast and beer healthy diet exactly good okay so so let's kind of let's look forward to 2017 so these are always a kind of interesting to do and and sometimes it's quite hard to put your finger on you know what you think will be interesting things in the uh in going into the next year. And a lot of these things already exist, and there are more things that you think will kind of catch on. But, you know, we talked earlier on,

Starting point is 00:51:30 and there were kind of three things that you told me about earlier on when we did the planning for this that you thought would be kind of good. And the first one is Kafka. You've been talking a lot about Kafka in the past, and Kafka seems to be coming up all the time as kind of a technology to watch. So, again, first first of all just for anybody who doesn't know the technology explain what kafka is very high level okay and tell me why you think it's interesting for 2017 so kafka i actually i did a presentation at the ukug on it just this

Starting point is 00:51:57 month i've memorized the opening bit which is kafka is a published subscribe messaging rethought as a distributed commit log which is kind of what they say on all the blogs and stuff. But it's publish, subscribe messaging. It's designed from the ground up to be distributed, to be highly resilient. You've got guaranteed message delivery. So it does an awful lot of things that position it to underpin data architectures, basically, kind of like your data pipeline. Okay, okay.

Starting point is 00:52:25 So, and there's, so Kafka is an open source product, excuse me, Kafka is an open source product, but there's also Confluent there. So where does Confluent come into it? And again, why do you think they're driving a lot of this interest in kind of Kafka? So Confluent was formed by the folk who wrote Kafka back at LinkedIn.

Starting point is 00:52:48 So they're contributing to the Apache Kafka open source. And you've also got Apache Kafka Core, which is the kind of messaging. You've got Apache Kafka Connect, which is a really interesting framework that they're building around the actual messaging as a way of getting data in and out much more easily based on configuration files, rather than having to brew your own interface each time um and they've got the confluent platform which has got its own um commercial control center which visualizes the uh the configuration the delivery rates and things like that but the the way that they're driving kafka or the apache kafka projects going um is taking it beyond simply messaging. But they've added in Kafka streams, so you can do stream processing.

Starting point is 00:53:30 I saw a very interesting presentation at the Big Data London Conference recently, where they were talking about the queryable stateful stream processing. So rather than stream processing your data and then landing it to a NoSQL store or something like that, you can actually query it in flight, which I thought was interesting for how it could have kind of take your architectures

Starting point is 00:53:50 in the future. And then I saw a tweet just this week from Gwen Shapiro, I think, talking about the kind of the pre-processing that is going to be coming in Kafka. So simple stuff on the Kafka Connect inbound. So masking credit card numbers or inserting values or renaming fields and things like that. So it being a whole bunch more than simply just a messaging tool, but a platform in its own right. Okay. Yeah. I mean, I see Kafka as being, I suppose, the ETL equivalent to Hive in a way. So Kafka is there and it's obviously

Starting point is 00:54:23 kind of does a lot of things, but it's extensible you said you've now got the streaming side you've got the bits you've been talking about there as well but you've also got a lot of closed source products are adopting kafka kind of interfaces and apis as their standard so if you look at i think it's map our streams that that is effectively their own proprietary technology but they expose it and make it available via you know via kafka interface effectively and i think also with the oracle um oracle public cloud or the new elastic cloud product they've got a product that is is is um you know it's a wrapper it's a commercial kind of added value thing on top of kafka as well so certainly it looks like kafka is kind of here

Starting point is 00:54:59 to stay albeit you know be it as a as a standard as a kind of like a framework as well. But it certainly seems to be getting that adoption, doesn't it, really? And a key kind of platform enabling technology for Hadoop, really. Yeah, definitely. And like with the other Hadoop technologies, it's an open format. So simply being able to build systems around Kafka, connecting the different components together and being able to decouple them in that way, I think explains why it's getting such an uptake. And with it being,

Starting point is 00:55:29 one of the things that struck me about it after looking at it for a while is it's a streaming platform, but it's not just about streaming. It's also about data integration. So even if you're not building something for streaming from the outset, if you start ingesting your data, if it's an event stream, as it comes in into kafka and then you can consume it as a batch if you want to but you can also stream process if you want to as well okay i always think kafka is the technology that everybody always says they're doing in projects but actually haven't done it's one of those things you drop in there as as kind of like a a very kind of sexy technology but it's it's quite hard to get running though isn't it do you Do you think, and that's why I think that,

Starting point is 00:56:05 it's actually harder than you think to get running. Is that correct? Or what do you think on that? So to get a simple actual Kafka cluster running, it's fine. For me, I found, as I said before, devil in the detail. And it's definitely, for me,

Starting point is 00:56:21 one of these things like you look at the kind of jigsaw architecture, kind of like, oh, we've got this one here. i've read a blog about this one here and there's a supposed connector between them um but i recently did some work trying to connect um so oracle golden gates great way of change data capture from your your database which can then create an event stream into kafka um i was thinking this could be a great way to kind of um prototype populating your data reservoir in hdfs from work that's happening in the database. But actually, those three simple pieces don't fit together

Starting point is 00:56:49 as it currently stands because GoldenGate prefixes the Kafka topics with a full database and schema. And if you try and write that to HTFS, Hive gets upset because you can't call a table with a three-part name. So it's great. And yes, there's kind of like pull requests that work around that. But out of the box, this stuff, it still needs engineering kind of jiggling around to get it to actually to work.

Starting point is 00:57:13 But that's just a maturity thing. I think as a fundamental technology, it's here to stay. And I think it's a great piece of kit. So do you think is Kafka your new log stash then? I mean, I know you've been quite a fan of the Elastic stack and we've actually got Elastic coming on the show sort of in one of the next episodes. But so is Kafka the new logstash for you?

Starting point is 00:57:35 It's certainly my new love, yeah. Elastic was great and it was probably the first... Moved on now. Yeah, it was the first open source project that I kind of started working with with and realizing the power of. And I think it's great. And it supports Kafka. So it's the best of both worlds.

Starting point is 00:57:49 And didn't you yarn about that as well? I remember at the time you elastic this, elastic that. Always been the gear of that. I know. It was drill and whatever, really. So luckily I don't have to listen to that anymore. So it's good. No.

Starting point is 00:57:59 So the next one. So you're from Yorkshire. And people from Yorkshire are plain speaking and tell it how it is and and kind of cloud is is certainly um a uh an area that has uh probably had its fair degree of skepticism and hand waving and and all that kind of stuff but something I've noticed used you know when you and I've talked in the recently we started you're starting to think now that cloud can be interesting really and cloud is going to be just more than just a kind of word and more than just a kind of a sales opportunity. What do you think on that?

Starting point is 00:58:30 Yeah, for me, it's been probably the last half year or something. And looking at these things and cloud's not just marketing bollocks. And part of the problem is that the marketing, all companies do it. Marketing kind of gets ahead of the actual technology. And so you try and keep up with the marketing all companies do it marketing kind of gets ahead of the actual technology and so you try and keep up with the marketing and then you start looking under the covers and you think well actually this is just this is just slide where it's forward looking or whatever and so they kind of the hype um kills it in a sense of kills people's taste for it so it's like oh well cloud's just rubbish big data is just rubbish and i think in the same way that big data

Starting point is 00:59:02 technologies are now maturing and people are realizing that it's serious and it's got great benefits. I think the flexibility that cloud gives and cloud in a sense is kind of a bad way of describing it because it's one of a dozen different things. But so specifically the elastic capabilities and the separation of your compute from your storage from your query. I think that's the important thing. I think kind of cloud is a buzzword. It awful lot of people off there's a whole bunch of fuss about it um but uh fundamentally it's going to change an awful lot of things about how systems can be built um it will be and sometimes people are going to i suppose look at taking what they've got and just lift and shift it running on kind of virtual servers in the cloud which is kind of

Starting point is 00:59:44 missing the point of it yes um but similarly it's running on kind of virtual servers in the cloud, which is kind of missing the point of it. But similarly, it's kind of easier said than done to say, oh, we'll just rebuild this using kind of the new technologies. That's not a small undertaking. But I think to kind of like to properly benefit from it, then looking to separate out your processing and decouple the parts is going to be a good idea so when you go back to your christmas gathering in uh in yorkshire and you have your whippets and your flat cap you'll be you'll be having a t-shirt that says ask me about cloud on there and you'll be definitely you have the zeal of a convert really uh there so exactly so so yeah interesting and i think i think cloud will actually ironically uh outlast hadoop as being the kind of the uh as being the thing we talk about in a few years time and I think the complexity and the you know the detail and the and

Starting point is 01:00:27 the the amount of work involved in getting Hadoop running and so on you know in the end this is when it moves to the cloud is this gonna be it's just gonna be elastic computing storage really and I think the impact on business models will be massive but I think yeah yeah definitely I think cloud will it's gone from being something that is being kind of sold and hand-wavy and so on to being something that people like you and I, I think, think about quite a bit now in terms of what it means

Starting point is 01:00:50 in terms of how you develop systems and how you work with it. And I think, you know, probably you and I really realized the Elastic thing recently and the impact that's going to have on kind of development and so on as well. So, yeah, cloud's interesting.

Starting point is 01:01:03 Sorry, I'm particularly particularly the Amazon stuff and that it's all there a click of a button and all the different pieces that you can then work with and they all interact with each other and before that and Oracle they had bi cloud service which was kind of it was fine and you could kind of do stuff with but unless you kind of you fitted that particular audience for it it's just like oh well it's kind of like that particular audience for it, it was just like, oh, well, it's kind of like, it's like what we've got on-premises, but it's just not as capable. And it's moved on since then.

Starting point is 01:01:31 But that was kind of like the first version of the cloud that I was exposed to. And then you start looking at AWS and its capabilities and the options that you have for building out these different permutations of platforms. And it's fascinating and powerful. Yeah. And the last one really is something which I read an article of yours on OTN about the Panama Papers using graph technology. So graph is interesting, you've said, I think you said in

Starting point is 01:01:54 the past. Tell people what that is and why you think graph technology is something to look out for in the next year. So yeah, so this is actually an idea that I stole from you because you'd written about using graph analysis, property graph analysis on Twitter data. And so I thought, oh, this looks interesting. And so I took the Panama Papers data sets of various parties who have money held in various offshore funds at various addresses. And so graph analysis lets you look at those

Starting point is 01:02:26 and how they relate to each other beyond simply the relationships. And it lets you build out and visualize those patterns and also run algorithms against those. So you can find out which particular address is used by lots of different funds. But rather than doing that in SQL, where you would simply have kind of like a list, which would be kind of like accounts of them,

Starting point is 01:02:48 you could say, well, which are the addresses which have got lots of funds, which are also related by another set of properties, such as the users or the countries. So it's a completely different way of looking at the data. For me, the visualization aspect of it makes it make a lot more sense. But then, as I say, there are also these algorithms that you can run against it as well

Starting point is 01:03:11 to come out of the results of a page rank and so on. Yeah, absolutely. And I think, for me, graph technology and the article you wrote, and again, I'll put the link to it on the show notes and things I've been doing with it, it's a good example, isn't it, of how you can bring different query and compute engines onto Hadoop use the same data set you know which might be sitting in HDFS or in HBase or whatever

Starting point is 01:03:31 but different query engines can query it really and this kind of explosion in a way of kind of different ways we can do things and different compute engines we can use and so on it sort of makes BI interesting doesn't it again I mean it's it's yeah it's very interesting isn't it it does i mean you can kind of like pick the right tool for the right job um it doesn't have to be well the the primary focus is doing relational analysis therefore have to start in a relational database and therefore can't do this other stuff or we'd have to kind of copy it all over um you can store it in these open formats it's uh yeah yeah excellent yeah i mean so it's yeah three three really interesting things looking forward to 2017 there.

Starting point is 01:04:05 And just to round this up, really, I think certainly you're going to be at the BeWare event, are you, in January, speaking on this? Yes, that's right. So I've got a paper on the Panama Papers and also about OBIE performance. And then the Panama Papers one also

Starting point is 01:04:20 at Kscope in Texas in June. Excellent. And I'm actually doing, I'm taking the idea I had with the kind of twitter stuff before um but adding the spatial side into it as well so obviously one or two of you might have remembered the uh the kettle thing that happened me a little while ago and uh where my uh where my kind of home automation experiments went sort of slightly uh awry there um but it was it what i did was i captured all of the um so i captured all the twitter activity that was happening at the time actually funny enough that was one of the things

Starting point is 01:04:48 that brought the network down at the time um but i captured all the network i captured all the twitter activity and also captured things like all of the guardian comments and so on and what i'm going to do in this presentation is is show how um show how kind of you know the the posts i put on twitter at the time about the kettle not working and so on they started off with a few people like maybe you retweeting it and so on but then as it went viral um it's interesting to see again their graph technology and network kind of you know network analysis showing how in how you know certain influences in that group can suddenly by them retweeting it can can massively sort of explode the amount of people that are reading things and commenting it and so on. I thought that would be interesting,

Starting point is 01:05:25 but also bring in the spatial and time element of it as well. So, you know, again, to me, that made me laugh at the time was something that started off as a very kind of British thing, you know, it's in The Guardian and so on, was being retweeted around the world, Australia and so on. I thought it'd be interesting, given the Oracle product, which is Oracle spatial and graph, to see how sort of time and geography affected it as well.

Starting point is 01:05:46 And just in a way, like you say, be able to look at and analyse data that's, you know, tweets and so on like that, that's network in nature in different ways that is a more appropriate way, really. And I think Graph is interesting, really. Graph is definitely, in a way, the kind of the slightly less well-known way of analysing data that actually once you get your head around it and understand it, it's kind of really interesting. And certainly the article you wrote on Panama Papers was fantastic.

Starting point is 01:06:12 I think the feedback on that was good as well. Thanks, Noah. It was a lot fun doing it. Excellent. Okay, well, it's very late now, and I'm conscious that I've been having you talking for a long time. So thank you very much for coming on the show, Robin. It's been great to have you on here, and hopefully we'll get you back at some point as well um so uh i think in the future episodes coming up we've got uh we've got i think elastic actually coming on next as the next uh as the next kind of guests um but other than that um yeah robin thank you very much and

Starting point is 01:06:38 have a good christmas and uh and see you soon thanks very much mark it's been a pleasure excellent thank you and see you soon. Thanks very much, Mark. It's been a pleasure. Excellent. Thank you.

Drill to Detail - Drill to Detail Ep.14 ‘Cloudy Big Data Paradigm-Shifting Christmas & New Year Special’ With Special Guest Robin Moffatt

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.