Drill to Detail - Drill to Detail Ep.10 'Oracle's Big Data Reboot, and Data Storytelling' With Special Guest Stewart Bryson
Episode Date: November 22, 2016Mark Rittman is joined once more by Stewart Bryson, talking about Oracle's recent reboot of it's cloud big data platform at Oracle Openworld 2016, thoughts on DataFlowML and comparisons with Google's ...Cloud DataFlow and Amazon Kinesis, and data storytelling with Oracle Data Visualisation Desktop 2.0
Transcript
Discussion (0)
So, hello and welcome to Drill to Detail, the podcast about the world of big data, analytics
and data warehousing. I'm your host, Mark Whitman, and I'm very pleased to be joined
by my first ever returning guest, none other than Stuart Bryson. So Stuart,
why don't you introduce yourself just in case there's anybody out there who hasn't heard of you.
Great, thanks for having me back on Mark. I'm honored to be both your first guest and your first returning guest. So we'd have to think of the trifecta what I can be next.
So my name is Stuart Bryson. I'm owner and co-founder of Red Pill Analytics. We're an Oracle data integration analytics company also playing a lot in the big data space. A lot of the cloud vendors, you know, worked with you in a former life. And that was a great endeavor. And always, always happy to come on the podcast. Thanks so much for having me.
Good, Stuart.
Okay, it was great to have you back on again. So we planned this talk for a while ago, actually, or from a while,
and we'll be talking about doing something, a roundup of Oracle Open World.
So you and I were both there recently as guests of the Oracle Ace Director Program,
just to kind of be upfront about that.
But as usual, there's never any kind of like, you know,
I suppose kind of obligation to say anything good or bad or whatever, really.
So thanks for obviously.
Just the truth, right?
Just the truth, exactly.
But certainly for me, there was a session, there was a keynote in a session that was particularly kind of interesting and particularly a bit of a kind of flavor and a taste really of, I suppose, a lot of the kind of the products that will be affecting us and be useful to us and so on are actually a bit of a reboot, really, in a way, of Oracle's big data strategy
and, I suppose, the future of some of the tools there.
So it was the Thomas Curiam engineering keynote, the product keynote,
which was one of the days, I think Wednesday or Tuesday.
And particularly in there, I don't know if you saw it,
there was a demo and a walkthrough that Jeff Pollock did,
who you and I know from the days of kind of data integration, ODI and so on.
But it was a kind of walkthrough, really, of, I suppose, a complete kind of like application
that was running in Oracle Cloud, using kind of big data as a source,
using some of the sort of tooling that was out there they're bringing along.
But I guess in a way giving us a bit of a flavor, really, of I suppose what Oracle sees as a source, using some of the sort of tooling that was out there they're bringing along. But I guess in a way giving us a bit of a flavor really of I suppose what Oracle see as their
market, what they're doing with the products, and potentially some of the sort of differentiation
really between what Oracle are doing and say sort of, you know, AWS and Microsoft and so on.
So, I mean, Stuart, did you see that session for a start? I mean, was it one you saw? Absolutely. Yeah, I attended. It felt very Apple-like, didn't it?
With sort of the main speaker introducing the demo guests and all of that.
It felt familiar.
Yeah, exactly.
I mean, Thomas Kurian, I mean, he's, I suppose, in a way, the kind of the command of detail he's got and the strategy and so on is kind of really interesting.
So I always kind of look forward to Thomas's sessions.
But seeing someone we know, Jeff Pollock,
talking about kind of what was Dataflow ML and so on,
it was very interesting.
So what I'm going to do is,
and Stuart, you and I organised this before,
the actual video, the actual kind of keynote video
for Thomas Curran's session,
and particularly Jeff Pollock's one,
I posted it on my blog as an embedded video a little while ago.
So I'm going to put the URL for that on the show notes.
And there's also, obviously, you can go to the Oracle site,
and it's on there as well.
What I'd like to do, though,
is to kind of go through a few parts of that with you
and just talk through, I suppose, what was talked about,
what was, I suppose, the message behind it,
what was the implication,
and just kind of walk through some of the things that they talked about
and so on.
Should be fun.
It should be fun.
It will be fun, actually.
Yeah, I think it will be.
So first of all, if we think about the opening of Jeff's kind of demo.
So first of all, Thomas Kuriam came on with a kind of like a business
application, like a sort of application that was about sensor readings and trying to you know work
out which kind of devices were going to go wrong and so on and so forth. Yeah it's HVAC units.
HVAC units which is an American thing isn't it? What are HVAC units? Heating. Oh, now you've got me on the spot.
Air conditioning?
Yeah, heating, something, air conditioning.
So it's the heating and air conditioning units.
Central heat and air.
So obviously it was a very kind of like, you know, quite glamorous kind of,
quite glamorous topic to pick up on and so on.
So just to kind of summarize that then.
So he basically came on, he had a kind of a web-based demo
where it was looking for anomalies a kind of a web-based demo where it was
looking for anomalies and kind of sense and predicting i suppose sensors that and hvac units
that were going to go wrong um and it was basically it was talking about is running on oracle cloud
and using um using oracle uh big data as the kind of back end really okay right so what did you think
of that then what was your what was about, I suppose, the market?
What was the messaging that was there, the market they're aiming for?
Because I'm a client the first to do this, really.
What was your thoughts really about the way they positioned big data and the uses of it in the market and so on?
Yeah, so I've seen that application before.
So that was the application. I don't know if they built it for this purpose, but when they first rolled out their IoT solution, which is pre-cloud or their first take at the cloud, right? So they had the full IoT application. So I had seen it before. It's interesting to see that they had plugged it into their new offering, which is good. You know, who are they targeting?
It's interesting that they're going after application developers,
is what it looks like to me,
with the idea that you can build an entire application in the Oracle Cloud,
starting from how is it exposed, who views it.
The entire sort of DevOps process was also covered in Thomas's keynote.
And with the back-end data sets, and that's where I think your common or average developer doesn't know what to do.
They don't know what to do with the data assets afterwards. And so I think it's interesting to go after, you know, it might be a startup,
it might be a big Fortune 100 customer who's looking at building a new application,
perhaps with a mobile element to it. And the idea that you have an entire platform
with even the downstream data, I think they're going after, not to go out on a limb or anything,
but everyone.
I mean, they're going after small companies, big companies.
I think it's an entire platform, and that's the pitch.
It's interesting.
I mean, I think particularly it was interesting that he opened,
so Jeff Pollack opened with, and Thomas in that section as well,
opened with a very kind of, this is a very business-like,
a very kind of practical use of the technology.
It was clearly, I think it was aimed at not the developers, but the developers' bosses.
You think about who would be in that keynote.
And it was a very kind of straight, very kind of, you know, the purpose of that demo was
not glamorous, really, was it?
I mean, it was obviously IoT and so on and so forth, but it was squarely aimed as being you know this is serious this is kind of you know this is Oracle applying
its kind of I suppose industry knowledge and so on there and certainly I mean I think the whole
thing with and we'll talk about later on Oracle's I suppose kind of reboot or kind of move into say
big data as a service you know how are they going to differentiate that and how are they going to
compete for developers as you said you know with and so on, I think certainly aiming at
their bosses is an interesting kind of thing, really. And it was, oh, no, I 100% agree. I mean,
if you were to see something comparable from reInvent or an AWS presentation, they would be
going after the developers, I think they would be showing APIs, They would be showing, and we saw a little bit of that,
and you're probably going to get into it,
but we saw a little bit of that in Jeff's talk or Jeff's presentation
with a custom Scala application, right?
Just enough code to tease us.
But if that were AWS, it would be about APIs.
It would be about, you know, it would be talking about the guts and
how you actually build the thing. So I think it's very interesting because AWS is sort of
grassroots. I think, I think no one's, well, I won't say no one, but AWS doesn't try to sell to
their bosses. They try to sell to the developers because it's so easy to onboard and get started
with. Whereas Oracle's just so used to selling to the developers' bosses that so easy to onboard and get started with whereas oracle's just so
used to selling to the developers bosses that i'm not sure they really know another way uh or maybe
they haven't yet um you know really figured out how to cross that that bridge if they even want to
or or particularly maybe because that's the angle oracle has you know i mean certainly
certainly um and we'll get into this later on, but certainly, I suppose, the kind of,
the additional, Oracle have got the attention
and they are, you know,
they're basically embedded in lots of big companies,
lots of big investment and so on.
Exactly.
The one angle they've got that, say, AWS and so on haven't got
is the kind of the footprint of the applications
or the industry knowledge and so on there.
So certainly, just to not delay this point too much but i i got a distinct message from that first part the bit
where it introduced with a very kind of business like uh you know uh you know uh way of putting it
and use and so on it was aiming at people's bosses and and you say and maybe kind of like adf
developers and so on who are used to building applications but let's kind of go so clip number
two okay um this the bit here so in clip number two, okay. This the bit here, so in clip number two
was very interesting because this is where Jeff gets into
using kind of Oracle Public Cloud.
So he logs in, and the whole question there really,
I think about, I suppose, how many developers
have access to it and how easy is it to get access
and trials with Oracle Public Cloud.
But he goes through provisions,
or certainly points to where there's some kind of
provisioned instances that are there.
And then interestingly, very clearly talks about Kafka
and talks about kind of open source components
that are then kind of overlaid with Oracle tools and so on there.
And I thought that was very interesting that Oracle are now kind of,
you know, I wouldn't say embracing,
but certainly making use now of open source products
and, you know, naming them and so on.
I mean, what's your take on that really, Stuart?
Well, so it's no surprise to you, Mark, that, you know, me and my company, we do a lot with Kafka.
And we're also confluent partners.
So we think a lot of this product.
So if you were to look and say differentiate, you know, we've done a lot with Kinesis and AWS, which is a competitor.
Kafka is full-fledged, built out, and it's very stable and it's very feature rich.
So the fact that they went with Kafka was exciting. I think if you look at, say, you know, AWS, they don't have it.
So their data pipeline tools, you know, is one of the first ones and it's aged and it's not really what you expect.
It's not Spark streaming like.
And then you look at, say, the Google Cloud with Dataflow, which is very powerful.
They don't have, you know, their PubSub product is not Kafka-like.
It's more like a message in queue.
So I think that when we, you know, one of the real, you know, approaches that we take now is
to invest in Kafka as an ingestion engine where you can really just ingest data and not worry
about what you're going to do with it yet. And I don't think any of
the sort of cloud vendors have that type of product or style. I think Confluent brings a
lot of great features that Oracle is obviously trying to write counterparts to. The fact that
Oracle is using their own REST APIs, I'm assuming, is to sort of capture the schema of what you're
ingesting there so that it can plug in to the other elements of their cloud.
So once again, Oracle's sort of locking us in to a certain degree, right?
So they've gone after these open source products, and that's great, and they've got great
capabilities, but everyone's going to use their APIs. And so it's an Oracle APIs. So I'm not quite sure why Oracle needs to put their own REST APIs
unless they're trying to capture schema
and make it easier to plug all this stuff together,
which is what we see in Jeff's presentation
is how easy it is to connect these things, right?
Yeah, I mean, I think I would probably, in a way,
give the benefit of the doubt on this bit.
I mean, certainly the fact it's based on Kafka
as opposed to something else that's been built in-house or whatever um exactly is a kind of tick in the box
which is good although you know presumably uh amazon with kinesis or or google with you know
pub sub believe theirs is kind of better for different reasons and so on so i guess also
probably the layering of of kind of oracle services over it is a number for a number of reasons one is
obviously um they want to make it available as a service within Oracle Public Cloud, and we'll cover that in a second.
One is, like you say, to make it easier to integrate and so on there.
But also there's an element there of kind of making it slightly harder
to kind of go somewhere else and so on.
I think on that I'd kind of give the benefit of the doubt.
There's probably technical reasons and there's kind of –
maybe he's driving it, I don't know.
But certainly –
I 100% agree that I don't think their purpose was lock-in. I 100% agree that, that I don't think, I don't think
their purpose was lock in. I just think it's going to be a side effect. And, and I think that,
you know, if you go use, uh, Kinesis, for instance, you're locked in because no one else has Kinesis.
And if you go use PubSub and Google, you're locked in. So, I mean, we can't really accuse them of
doing anything that the other cloud vendors aren't doing, right? Exactly.
They're taking an open source product instead of building from scratch.
And a very good one, I might add, they're going to put their own wrapping around it because, you know, half of or three-fourths of what you do in the cloud is provisioning and making things easy.
So, you know, I don't fault them for it.
It's just that the excitement around Kafka is kind of tempered a bit with the fact that it's going to be Oracle Kafka.
Yeah, we'll see. We'll see. I mean, so and then the next bit is kind of interesting.
So he then gets on to talk about the fact it's elastic.
So and that for me was the biggest kind of difference.
So obviously, if you look at what Oracle was selling as big data cloud service up until now, it was effectively a BDA, which you would pay for out of OPEX rather than CAPEX.
You'd rent it.
They'd host it for you in a data center.
But you were buying a BDA.
And clearly that was ridiculous because your minimum cost to start up was tens of thousands of dollars.
And it was something, but again, looking at how products are launched and so on,
it was something that they could do immediately, presumably.
And the long-term direction was the services and and having worked with services
recently you know on the google side you can see the benefit of that i mean just before we go
further you know what's your view on i guess moving hadoop and big data development or just
development in general maybe snowflake and so on you know into into services running on elastic
cloud what's your thoughts on that well so i can i can take this sort of the slant first from a small company
that's that's building that's building pipelines for their own data and it's it's incredible i mean
you think about what a company would have to do to invest in bi or analytics for themselves a small
company just 10 years ago. It's almost impossible.
And so I think the idea that we can provision a service with an API,
most of our data sources are also REST API based.
So the fact that you can, with a little bit of glue and roll up your sleeves with a little glue,
you can go and write these pipelines.
I think that everything I just said, though, is applicable for big companies, especially when you're looking at money coming from
departments now. And departments have their own technical resources now, and they're role your
own sort of at the department level. So I think that, you know, it does abstract away one of the hardest things about getting into the big data space or
sort of getting started with a big data project is there's so many pieces. I mean,
big vendors aren't used to a thousand servers, right? Or even a hundred servers for one data
asset. They're used to one. So I think the idea that we don't have to manage that,
and it's not important at the end of the day. I think on our last podcast, Mark, we discussed that
you gave the example of the space program and that there was so much ancillary technology
that came out of that space program. You made the analogy to Hadoop that so many,
there's so much great technology has come out of the Hadoop ecosystem.
And at the end of the day, the ironic thing is that the Hadoop side of this is the easiest to sort of abstract away.
These tools, these data pipelines and data streaming and all of that is incredibly exciting with what you can do with data now and when you make a cloud application
hopefully it will you know hold up to to the hype but when you build a cloud application
where ingesting data and processing data is really easy um and and it's sort of model as you go in
each of these elements it's really exciting and. And I think they're headed in the right direction.
Yeah, I mean, I think it's interesting.
I think as technologists, I know you and I,
one of the first things we always do is we think,
say it's a new product area or a new whatever,
we tend to try and install it first
because often installing it
and seeing how the services work
and seeing how the components fit together,
that's kind of how we learn things.
And I think that that then leads on to that's how
we then take development so i know certainly you and i we've spent a lot of time talking about kind
of you know schema on read and we've talked about hive and impala and all the different kind of
technologies there and so on and um and so naturally you yeah that's the way you think
things go um but certainly something i i noticed myself personally doing work you know at home
doing a spartacle and so on was running a Hadoop cluster reliably and particularly ingesting data reliably and provisioning it and scaling it is kind of hard.
And that only gets in the same way, the analogy I made, I did a tweet or a blog post yesterday about this.
I used to run a mail server at home and, and, you know, it was fun. You know, the fact I was running a mail server and so on, but half my mail never got through because the, the kind
of the server was down or I'd block listed or whatever. And, and, you know, beyond a certain
You should have got Hillary's, you should have got Hillary's, uh, admin, right? They all met
through there. Exactly. So, so beyond a certain point that you start to think to yourself,
is this really a good use of my time? You know, given that probably, you know, you could be
learning so much more and
again looking at running a hadoop cluster at home you have to say to yourself i mean obviously it is
completely ridiculous to do that in the first place but but then you think well actually i
should be i should be you know we should be doing things to build on that not to be just constantly
fiddling and i think something that's interesting is having seen a few systems recently where you
know it's big data scaled up to petabyte scales petabyte levels
you actually can't do that anymore it becomes something where realistically you cannot manage
a kind of an on-premise big data system at volume you know with the size we get to now and that's
why these services come in and i think that for me that was the kind of that when the penny dropped
really seeing that large systems do that and and i think that you know certainly working with so you can see why
we're doing this you know you can see why they're delivering it as services that's what we need
that's the building blocks you want to go on and then you actually add value with what you build
on it on the top it's not customers don't buy um you know intricate hadoop administration they buy
that that what they try and do with it, really.
So I think it's interesting, isn't it?
Sorry, I absolutely agree.
I'm going to use an analogy with Oracle Business Intelligence.
When they first went to the cloud,
and a lot of people asked me, obviously,
well, what do you think about that?
Do you think, are you concerned about the loss of business?
Are you concerned about Oracle making this too easy?
And I said, you know, and I've said this before. I may have even said it publicly on podcasts or things.
But I didn't get into BI because I wanted to build WebLogic clusters.
I didn't get into BI because I wanted to integrate LDAP. the plumbing becoming easier and actually doing things like data pipelines, about handling data,
about moving data, about finding innovative ways and cheaper ways to analyze, onboard, and move
new data sets. And I think that the on-premises, whether it's a big monolithic application or whether it's a
dupe cluster or anything in between i think the idea of going in and and editing config files and
figuring out why something's not talking to something else although interesting and we've
always found that sort of thing interesting but it's not good for the customer necessarily well
i think it was so evil wasn't it at the But I think certainly let's move beyond that really.
And I think one thing I've also noticed is when you get to the kind of scale that people go to now, again, a customer I'm aware of now and seeing their system and so on, you need that.
The elasticity comes into it as well because some of the volumes you see and the changes, I guess,
in kind of volume over kind of seasonality and so on,
what you need is something where you build it
and it just scales up.
I mean, I think this business of, you know,
you having to kind of re-architect it, for example,
because you've gone from sort of a certain level
of transactions to another,
it's interesting really.
And I think that in a way,
there's a kind of, you know,
the famous Kerry Millsap quote about performance and so on being a sole problem.
I think in a way, infrastructure for big data analytics should be a sole problem.
The thing then is about getting value. And this is where it gets interesting, I think, for consultants and people working in this area, because it gets back to that thing again.
You can't just be as a consultant. You're not there to um to be an expert in infrastructure
you actually need to understand analytics the value in it and so on really and that's kind of
interesting as well you can't hide behind technical knowledge anymore you've got to be out
there and actually kind of delivering a solution really can i jump back to the kafka point for just
a minute and make a comparable point so so so kafka is not manage. I mean, you just look at Kafka, you know, from the
cluster perspective and how to build all the different consumers and producers and the
consumer groups. It's very, very complicated. And then when you start building, you know,
actually topics and trying to figure out how to make them durable, there's a lot of fidgeting
that you have to do underneath the seams.
And so one of the things that Confluent has done
is try to enable you to do more with that Kafka cluster,
knowing that having customers go and build a bunch of clusters of different types,
maybe it's your Spark cluster, your Kafka cluster, et cetera.
So that's why they've introduced Kafka streams.
That's why they've introduced interactive queries, is because they know that building a cluster, et cetera. So that's why they've introduced Kafka streams. That's why they've introduced interactive queries
is because they know that building a cluster,
any cluster is a difficult thing.
The interesting thing in the cloud, however,
is that you can have Spark clusters
and you can have Hadoop clusters.
And the idea of saying,
well, I would really like to write that in Spark,
but I don't want to have to invest in another cluster.
I don't yet have a Spark cluster running.
And Spark might be the perfect language or solution for me to go do something in, but I don't want to have to manage another cluster.
I've already got a Hadoop cluster.
I've already got a Kafka cluster.
The last thing I want is a Spark cluster.
When you go to the cloud, you can choose what you want.
I mean, if the Oracle cloud is making Spark processing,
it's something that you can provision with a few clicks,
and it already plugs into Event Hub, which is their Kafka solution.
It already plugs into their Hadoop solution. You can't choose the right
solution. A lot of times we get pigeonholed because we don't want to invest in more
infrastructure. And so that's what's exciting about watching Thomas's presentation, Jeff's part
of that particularly, is that if this proves proves out it will be very easy for me to
choose to use kafka and then choose to use spark and then choose to use hadoop without really
thinking about all of the infrastructure that would go into that if i were to try to roll that
out on premises yeah interesting i mean so we'll get later on when we come to the end of this i'd
like to kind of talk to you about you know where we think this will succeed for Oracle, what kind of impact it will have.
But let's get on to the fourth part of this that I thought was interesting, was very kind of topical and relevant in terms of us as ex-data warehousing people and data modelers and so on.
There was a section when Jeff then talked about bringing in kind of data sources and targets and so on.
And two things I thought were interesting in in that clip so first of all he brought in data
sources that were like industry models effectively so he was saying here's
here's a data so here's a kind of a schema a data source design that is
based on some some kind of manufacturing apps we've got all manufacturing in
knowledge we've got and then there's a bit there where he auto
kind of generated some data warehouse dimensions and facts and so on there so there's two parts
that are interesting first of all what's your thoughts on on i guess i suppose oracle's angle
on this you know do we think that having these industry models and having this kind of i suppose
industry knowledge will be enough of a differentiator. And what do you think about the idea of,
when you saw it in the video,
auto-generating kind of data warehouse schemas and so on,
is this the holy grail of point and click,
it's done and so on?
I mean, absolutely.
We've been investigating a lot of other tools,
both BI tools and data integration tools,
mostly cloud-based and over the last year.
And they all do that to some degree, right?
They all have algorithms, and most of them are probably based on machine learning of some kind or another,
where they're trying to sort of figure out nuances, figure out connections, figure out how things plug together for you.
And I do think it's the Holy Grail, but like the Holy Grail, you know, it may be really hard to
find. I mean, it may not prove out. The proof will be in the pudding, to use, I think, a British
expression. But I think that everything that we build today should be looking to use to reuse as much of the data that we have
it's interesting that you know we've spent so much time and i know you have and i have in
in years past with etl tools where you know every single bit of of context has to be added by you
and i think the the idea that we're not reinventing the wheel most of the time.
I mean, 90% of what we're doing in an ETL – I won't say that. I'll say there's a good portion of what we're doing in an ETL mapping is just sort of look up in the document and see how these two things connect.
And I think that machines are much better at figuring that out than we are.
So absolutely, if we see – the one thing I – you saw the data source was called Cube.
Right.
Is that something he named or is there some data source somewhere that's called the Cube?
I couldn't get my head around that.
I don't know.
I mean, I know the Cube is certainly a show that some of them go on.
So maybe it's that. I don't know. I mean, I don't know i mean i know the cube is certainly a a show that some of them go on so maybe it's that i don't know i mean i don't know that really it's quite funny i mean
you and i've done demos and some of our demos sometimes perhaps do not kind of like um stand
up to very close scrutiny sometimes um so there could be things in there that but i thought that
was interesting i think that was an angle that again going back to what could potentially be
differentiated for oracle you know i'm going to this in a bit when we talk about cloud or certainly Dataflow ML,
that experience they've got with kind of, I suppose, industry models and sources,
I suppose the kind of footprint they've got in those industries and knowledge there,
and the data warehousing experience as well.
I mean, I think certainly having used them, I've been using Google BigQuery and Cloud Dataflow
and so on a lot recently for some work
i'm looking at um and and you can tell it's built by engineers it's not built by data warehouse
people so that could be an angle for oracle as well and is yeah is that a good thing or a bad
thing i mean let me let me hear what you think because i mean there's pros to to having you know
a lot of robust and expressive apis that can do a lot with. And then there's
a certain sort of learning curve or sort of table stakes, as I think you sometimes call it.
There's table stakes for stepping up to the table. And I think that there's a lot of customers that
I've dealt with in the past with big BI teams that don't have anyone on their team that could even approach some of these things right yeah i
mean it's i think certainly if i was looking at say google's um cloud for example i think
they might benefit commercially by making it slightly easier for a on-premise data warehouse
developer to uptake this technology i mean some of it is pretty obscure and particularly for google
with you know cloud uh cloud data flow it's pretty kind of bare bones, really. How much we want to just
repeat the kind of the ways we did things in the past is a question there, really.
I think it's, I guess there's probably different use cases and so on there. And there is this kind
of whole angle, I suppose, as well of, you know, putting into the cloud, consumerizing it and so
on. I mean, within the context of development, it's obviously not consumerizing it for your mum or something,
but certainly making it easier and using kind of, I suppose,
you know, machine learning and classification and so on
to make things faster makes sense, really.
And, yeah, I mean, that...
Do you think we will see GUIs, you know, for some of these things?
I mean, it's the next logical step,
but you might imagine someone like Google with Dataflow saying,
no, no, no, we're not going to do that because this is for engineers.
I think if Oracle, and I don't know if they're going to do that, right?
I mean, they might very well roll out, you know, usable GUIs to connect some of these dots.
But, you know, AWS has been in this business for a long time, and it's still
almost 100% API-driven. If Oracle
could step up and build the APIs, I'm not saying they shouldn't be APIs
because I should be able to code against them if I want to code against them.
But at the same time, if the promise
of some of these things
hooking together the way Jeff showed them and auto generating certain aspects, if he
could eliminate, you know, 60% of the coding that has to occur or 70% of the coding where
you're just coding on the penumbra where you're, you know, where you're tweaking and not so
much sitting staring at a blank page trying to write some code.
I think that could really differentiate them because I don't think the other cloud vendors,
you know, save maybe Azure, is really thinking about those things.
Yeah, I mean, I guess also, you know, again, Oracle are looking to take their customers from where they are to this as well.
So there's probably an element there of kind of, suppose gooey stuff might be more relevant for them because of the kind of the
almost going back to the owb kind of forms background and so on let's move on because
i'm conscious of time um so so the next bit the the fifth part i was interested in here and was
data pipelines and and we've covered this a bit quite a lot already actually so far in talking
about kinesis and cloud data flow and so on but sure as an etl developer um you know first of all just for the audience just
explain what a data pipeline is as opposed data flow pipeline is as opposed to the stuff you've
done in the past um in terms of what's the difference and what does it mean as a developer
really what can i expect from this in the future yeah i mean when you've done and you know the the
folks at kafka usually uh explain this pretty well when they're talking about kafka streams for
instance but you know when you're an etl developer in the past you've dealt mostly with data stores
and you think in terms of taking data from a data store maybe transforming it and joining it with
other data stores and putting it into
another data store. And I think from a code pipeline or, excuse me, data pipeline perspective,
I think you sort of lose that. The first thing that goes is that idea that a source and a target
are a store of some kind. It's really just dealing with data in motion. It's dealing with data being streamed.
Now, you might manifest data sets in pseudo stores, usually called Windows or some kind, but they're usually exposed via APIs. process and necessarily loading it somewhere, a data pipeline probably exposes that data set at
the end, but not manifest necessarily. So when you go to write, say, a data pipeline, you're
dealing with data in motion, usually at each step of the process. Of course, I won't say of course,
but usually you're going to persist it in the end. But perhaps just exposing it to an application with all the calculations done, you don't necessarily need to load it if that application can consume, usually by a REST or by some sort of an API, consume that data in motion.
So I think that's the biggest difference is not thinking about data as stores necessarily,
but thinking of it as streams really.
Yeah, and the last episode of the podcast
that I recorded was with StreamSets.
And certainly for them,
the whole concept of data in motion
is core to what they do really.
And there are certain things you do differently
because of that.
There are certain ways you architect tools differently
and so on there. And as a developer, you think differently because of that. There are certain ways you architect tools differently and so on there.
And, you know, as a developer, you think differently as well.
I think certainly, you know, it's the kind of the,
it's the norm within what we're doing now and big data and so on.
And it's, you know, it was interesting to see it in there
within the offering as well.
And then getting onto the next part, which was the,
I suppose for me that the kind of the centerpiece of the demo,
which was Dataflow ML, which was, for anybody who's not heard about this it's it's oracle's reimagining i guess of data integration
um within the cloud with kind of spark underneath it and and big data and so on there so stewart
just again just from what you've seen i know it's very it's not out yet and so it's under
nda and so on just explain what data flow ml is okay and maybe kind of paint a picture really what it looks like
for as a developer and so on certainly i mean unlike some of the other platforms we've been
discussing like google um data flows and aws data pipelines um oracle's product is does have you
know a gui representation which i you know from all the demos i've seen um you know, a GUI representation, which I, you know, from all the demos I've seen,
you know, looks pretty reasonable. Now, just like, you know, at any point in that GUI representation, you can click and you can customize and you can customize with code.
But the idea is really that this is going to be a streaming solution using Spark and Spark ML specifically, machine
learning, to be able to build, I won't say ETL, but data pipelines, but also with recommendation
underneath it. So that's where the ML side of it or the machine learning comes in, is that you don't
have to touch every single piece of your data pipeline.
It's going to make recommendations.
It's seen that this data set roughly maps and has some connection to this other data set or this other data flow that you've expressed.
And it's going to make some recommendations around how to hopefully accept these recommendations as new stages in your job.
And it could, as you said, the holy grail a little bit earlier, it could handle some of the
nuance that doesn't really need a human being clicking and dragging. It just simply needs a
human being accepting. And I think that that could be what's really, really different. I think the combination of the machine learning built in with the recommendations, the GUI capabilities, plus recommendations around data processing, it could be, you know, perhaps the best product on the market.
It certainly looks like it could be. I guess knowing the people behind it and knowing the development team behind it, I think this is the part of the whole demonstration and the whole kind of proposition
that certainly for me I had most confidence in and also thought could be a differentiator.
I think that the knowledge and experience, and I guess for us the familiarity is important as well,
in how it looks, people behind it and so on is interesting there.
Certainly the questions I would have is to how i suppose in a way what
would the scope of this be so will for example data flow ml cover things like data quality you
know will it cover things like um you know think about the whole kind of the whole product stack
you get in say oracle data integration od edq and so on is that part of it or is this kind of out
of scope and i suppose also um what about kind of the fact
that they also announced i think there was a cloud version of odi as well that was mentioned um yeah
what was your thoughts on that i mean that strikes me slightly kind of just uh i don't know first
of all what do you think about the announcement about odi in the cloud what was your thoughts on
that i mean i think these are these are two well we know that they're two different elephants or two – that's a terrible analogy – two completely different animals is what I meant to say.
In that ODI as a service is going to be sort of traditional ETL in the cloud.
So still that mindset of thinking about sources and targets as stores and not necessarily trying to handle the whole streaming crowd.
Although we know that ODI is going to have some, at least on-premises, is going to have some robust
data streaming capabilities coming very, very soon. Will that stuff port immediately to the cloud,
to the ODIs in the cloud solution? Then it does produce some overlap that's a bit confusing.
I think... Go ahead, Mark.
I think I put my view on ODI in the cloud.
It's very tactical.
It's for the instances where clearly
it's not a big data kind of problem that's being solved.
It's something where they just want to be able to host it in there.
It's probably running it within a sort of
Java cloud service kind of instance and so on.
I think it's something that is kind of there tactically. i can't believe they're going to invest much in it really
but then again not everything is a big data problem really and yeah i mean not everything's
about not everything's about analytics right i mean we tend to focus on we tend to focus on
you know analytics and what big data has done for analytics it's it's really commoditized data in
such a way that everybody can take a data set and do something with it.
ODI is about still integrating.
And my take on ODI in the cloud is still more about the data integration side of it.
Delivering data stores for applications that may happen to be in the cloud.
You might do data warehousing in the cloud that way. But I think that we're starting to draw a line now where the whole relational world that we're used to and all of the sort of tooling that goes with it, data integrator sort of applies there.
You're sort of starting to draw the line at analytics and saying, you know, maybe I, maybe I shouldn't do analytics that way anymore. Maybe I should,
maybe I should bring that down and do that in the big data space or the big
data world.
But if I just need to integrate data for an application that I'm building in
the cloud, I think ODI is going to be applicable there.
And you wouldn't want to go do that in, in data flow ML. I mean,
that's not really, you know, because you're not trying to deal with data you're
just trying to integrate data yeah exactly exactly so so just i guess now to finish off that that
kind of that clip in that video there was a bit at the end the game was interesting which was um
there there was a demonstration of of uh big data discovery in there and and it was it was again
looking at the kind of the output of the data flow and it was saying there's a bunch of kind
of attributes and values and so on there but what was interesting in that demonstration was
the focus on machine learning and automated predictive modeling and so on so yeah you know
you've seen big data discovery one of the things that it lacked and was probably kind of marked
as at the start maybe slightly kind of wrongly or whatever was yeah it kind of um it lessened the need for a
data scientist but it didn't do that at all obviously and there was things in the latest
release that was um you know there was the kind of the um the shell that you could run sort of
pie spark in but it was interesting to see um a demo there where they where they kind of um he
focused on attribute and he built the model out there to sort of to be able to you know the idea
was to say um press click on this and it built a model out there to sort of to be able to you know the idea was to
say um press click on this and it built a model which is most predictive thing and so on so my
point on that to you stewart is you know is this the answer to getting the citizen data scientist
is it kind of um a novelty you know what did you take on that on that last section really
this is where faith really had to kick in for me, because that's the side of this that is the most difficult
to solve in my mind. And I think what you look at at the end of this process and what we saw
in data discovery was, in my mind, the part that I don't necessarily have faith that the product is going to be that good, especially on day one.
I mean, that's really hard to identify all of those data points.
It's really hard to expose them in such a way that it's almost dummy proof.
I think that not to say that what was presented there at the end shouldn't be the target state.
It should be.
That's what they were showing and demoing there was brilliant.
And that's what I think when we think about how analytics today is different than it used to be.
It's those sorts of data sets that is what is the promise of these new frameworks.
That's where I think my faith ran out a bit and
and i'm gonna have to experience it and see it uh to believe it because that is a lot of work usually
it's interesting i think i think certainly where we are with um kind of i suppose uh predictive
analytics and data science and so on which is that only a very small amount of people can do it
and it's very complicated and it's kind of you know from the command line all this kind of stuff clearly that yeah clearly it would be better if it was more
accessible and i think that um and so probably the answer to how to do that is to um you know
automate it and give helpers and so on i think that it was i mean i i saw some kind of demos of
of that um of that feature kind of before the actual kind of keynote and the thinking behind
it is is to try and kind of find ways to i suppose bring data science and predictive analytics and so on to people who don't
necessarily have the kind of the r skills and so on um but it's hard to do and i think that you know
a lot of it is presenting um potential options to you now these are potential kind of predictive
models and so on and then you pick between them i mean it's i think it's an unsolved problem and i think it was good to see that in big data discovery that they were taking it further
and it wasn't just kind of you know putting a shell there and so on um but i'm not necessarily
sure it was the answer but certainly it was a good attempt at the answer and and for me as well
being kind of you know quite emotionally invested in big data discoveries at all i use quite a lot
i'm quite keen for them to succeed. So I hope it does work.
But I think, like you said,
I don't know if it is necessarily the answer.
I don't know.
Who knows, really, on that?
I will say that it is a product
that probably is going to be best consumed in the cloud.
I mean, if you think about everything that's required,
you know, if you were to go and try to,
if you were Oracle sales rep
and you're trying to sell Oracle big data discovery,
and you're trying to sell that to a customer that doesn't even have Hadoop yet, for instance,
it's a really tough sell, right? I mean, there's so many things that you have to invest in to get
that tool off the ground and to get it in people's hands. It's not like Tableau where all you need is an Excel spreadsheet. There's a lot
that's required to get that tool stood up and the data sets available. I think that it's built for
the cloud first off, because if it's truly easy to provision all of the services that are required
to get content into big data discovery. And if they are truly all connected
such that I can simply ingest data into Event Hub,
build some reasonably scalable solutions
and data flows in Dataflow ML,
get that data into Hadoop,
at least to the Hadoop that's in the Oracle Cloud.
And then if suddenly, you know, using all those tools,
big data discovery becomes something that's much, much easier
for somebody to get their hands on.
It was always difficult to even explore or demo the product
because there's so much that was required
before you could even get
to that yeah interesting yeah i mean anyway so so i mean certainly it was good to see that at the
end of the demo really and it's actually quite a nice um segue into the second topic we're going
to talk about which was it which was is is data storytelling really um and of course data
storytelling just to kind of to set the scene here, is the idea that, in fact, actually Stephen Few, who we call over here the kind of Jeremy Corbyn of analytics, I think probably Bernie Sanders in the States.
He kind of said, you know, numbers have an important story to tell.
They rely on you to give them a clear and convincing voice. there's a whole kind of movement which i mean you know a lot about really which is to say you can get your story but cross better and you can convince people and have more influence if
you tell a story about the data that you're kind of presenting and i know stewart you uh did a
session uh at open world i think with mike darren on on data storage well basically big date sorry
dvd wasn't it and uh and data storytelling tell us about what that was and what was the kind of
the thinking behind it and what did you sort of what did you demonstrate and talk about in that
session yeah it was a it was a great session at open world it's it's uh you know they quit letting
people into it it was it was standing room only and that's and that's i think probably as much
uh comment about mike duran himself but also a but also... Or a very small room.
Or a very small room.
Thanks for that, Mark.
Yeah.
Or it could be about the Oracle's data visualization desktop tool or data visualization strategy in general,
or it could have been the storytelling element.
I like to think that it's the storytelling element,
which is really what we want to talk about,
which is, you know, I listened to a podcast not long ago. It was from Freakonomics. Maybe we could put this in the show notes, but it was about storytelling. that he had done around the storytelling
paradigm versus the non-storytelling paradigm and the trance-like state that people tend to go into
when they think they're being told a story versus just being presented with facts, which could be
seen sort of nefariously if you wanted to. But it's the idea that if you, you know, just simple reports or dashboards are not enough.
You've got to layer them with context.
I mean, they almost need context dripping from all of these data sets from the very onset.
And so what Mike demonstrated very well in the presentation was how easy it was to take data sets, bring it into the tool, and then layer it with context such that we're used to having dashboards, you know, and sort of the traditional BI.
We have dashboards that have zero context.
It's just data being pulled in, and maybe the user can assign a filter but beyond that there's no context and what Mike demonstrated was yeah bring the data into the tool
but we're not always building just you know a reusable dashboard that can be
reused for the next year without any context it's more it was more about
actually layering in context and telling story telling a story all from within
the tool and I think that the feedback we
got from folks around both the capabilities of Oracle's new tool, but also that sort of mindset
of what it means to tell the story versus not tell a story, that feedback was great. And I think it
pointed to the fact that what we do with data today is so different. It is, yeah. And I think certainly there's more of an awareness now from people about stories being told with data.
People are used to kind of, I suppose, infographics, for example, you know, on newspapers and so forth.
And I think, you know, it's people, I mean, election in the U.S. kind of notwithstanding, you know, the idea that data can tell a story and so on is kind of, you know, obviously accepted now.
Within, I suppose, within the kind of the Oracle tooling itself, what features were there within there that you thought were useful and interesting?
And what does it mean as a developer to develop using this kind of idea of data storytelling compared to, say, before?
Yeah, so first off, we were using the old version, I guess you'd call 1.0 of DB Desktop.
And now 2.0 has been out since then.
It's got lots of new features.
So forgive me if I don't hit off all the new features.
But it was really a couple of things we talked about.
First is the concept of an insight, right?
So you could go and do some discovery and sort of bookmark that. A bookmark
sort of cheapens it in my mind, but it gives you all of the context for which you came to a certain
data set. So just linking together insights in a storytelling mode, which the tool has,
is one thing. But Mike also demonstrated an infographic built completely within the tool. And this was fascinating
the first time I saw it, which was, it's got layers like you think of in a graphic design
tool. So you can bring the data sets in sort of as a foundation or a backdrop,
and then you can start editing or developing layers of content on top of it such that you could even build an infographic that, you know, using data, what's truly and, you know,
what we're truly trying to express today or what's in our heart or what we really want to get across.
Sometimes a reusable dashboard is not the way to do that. And I think the rise of infographics,
the rise of different approaches to storytelling, you know, for instance, you know, we use DV
desktop at Red Pill Analytics and, you know, we do our quarterly meetings.
They're all done from within the tool, right?
So, I mean, the capabilities of being able to truly tell a story with one tool is pretty fascinating.
And I think that we're going to see a lot more in this area from all the vendors.
Yeah, I mean, I think it's kind of, it's what's appropriate for the circumstances.
So if you had a dashboard of GL data,
then someone kind of weaving a story
about the GL transactions and so on and so forth,
and how many invoices are outstanding.
That, in a way, that's not what it's about, really.
But going back to the idea of,
I think it would get a bit tiring, wouldn't it?
You know, after a while,
someone gets a guitar out and starts kind of,
you know, going through the story of the kind of gl transactions but going back to the
idea of the base idea behind a data scientist and so on that you are you're finding something
out interesting and then telling the story about it packaging it up and so on um it is good and i
i used that approach um in the latest article i did for oracle magazine which was on yeah it's
great article yeah yeah yeah so it's kind of's kind of telling the story of the data.
It was my Strava data.
I was getting in cycling data and so on
and telling the story of how I was using the data
to try and change some kind of behavior and so on there,
but presenting it in DV desktop terms,
in terms of insights and presentations and so on,
you can see the value in that, really.
I think that the features in the version that you use, DVD1,
are fairly thin for data storytelling. I think they are. Agreed.
They're taking snapshots. They are presenting it in a sequence and so on.
Certainly having, have you played around with DV Desktop 2 at all since it's been out?
I know there's some, it's quite a few changes. Absolutely. Give us, again, so give us a bit of a
kind of an overview and your thoughts on DV Desktop 2 from Oracle.
Yeah, knowing that time is short, I'm going to hit two real key points.
One is data wrangling, obviously.
I mean, which just goes to, you know, there's so many different places that you can process data now that sometimes it's confusing as to where to do it. But truly, if you've got data
sets that you're bringing in and you're a self-service and you need to curate that data,
I hate to use that word because it's sort of loaded, but you need to process the data,
wrangle the data, so to speak. And I think the idea that not all data processing needs to exist in a layer outside of the analytics tool is a good thing.
I think the idea that someone sitting down trying to solve a problem that they have today waiting for somebody to build some data processing is misguided.
So I think putting some basic data processing capabilities
in the tool was big.
I think also the idea that all the new sources and targets
was very, very valuable as well.
Isn't that a revelation?
I mean, how long have we waited for Oracle to support XYZ source?
And now it's coming out every quarter.
It's crazy.
Yeah.
It's crazy to see all the sources and targets that,
you know, we used to, as Oracle partners,
you know, I know we've both worked
for different Oracle partners over the years.
We used to, you know, anytime it was sort of like a,
you know, talking about anything
other than the Oracle database
was sometimes to be judged as misguided.
So the fact that Oracle's really gotten on board with the idea that not all data sits in an Oracle database is a revelation.
And the fact that all of those sources and targets.
And the last point I really want to make about the new tool and something that I know i've read up on i have not uh tested is
the the new sort of sdk of of writing and developing your own uh building your own
visualizations yeah uh and that could be you know i mean we've all you know with all sorts of
different enterprise bi tools we've tried to hack to to bring in visualizations that aren't there
now there's actually a supported process for writing APIs.
And I think that that could be truly, you know, bring your own sort of JavaScript libraries
and plug them into the tool.
That could be game changing.
Interesting.
Interesting.
And yeah, certainly it does beg the question, though, about to what extent with the data
wrangling in there, with the visualization and so on to what extent um it's gonna uh eat into the sales of big data discovery
or or certainly be a sufficient sort of substitute and something i found again with with looking at
some of the data coming in from i was bringing from home and so on was i tended in the end to
sort of veer towards kind of uh dvd in a way and it agreed it was enough really and and certainly it's the problem
i think with big data discovery is it's fantastic at kind of bringing data in and wrangling it and
kind of enriching it and so on but then as a tool to present that data out to yourself or to users
it's a bit obtuse in how it works sometimes and and it and given that dvd is is i presume a quite
lot cheaper really and easily downloadable install. You do have to wonder if you were a PM, really,
for BDD, Big Data Discovery,
how much of a threat DVD would be, really.
Is this an interesting thing?
I don't know.
What do you think?
Well, that's why I feel like I stand by my former comment,
which was DVD is a product built for the cloud.
I think the idea that you're going to have to stand up all these things just to even get started with it.
DVD desktop is so easy to get started.
I mean, you don't have your data set in a proper place and then just get a feed of it.
I think the idea behind big data discovery is you've got to have some really complex things built to be able to use it.
And that's why it probably makes sense for it to be powering the analytics in their cloud, or at least in the big data side of their cloud.
Whereas DB Desktop is the mashup capabilities.
I can connect to Redshift, which is an Amazon cloud. I can connect to an Excel document and also my Oracle database from me business all from my desktop. Something that we would have fought against five to seven years ago is just being as being crazy.
Right, Mark?
Yeah.
I mean, ironically, the interesting thing I've seen with I've been looking, I've been I've sat through a couple of Tableau things recently.
Tableau 10 is out now.
And the big thing about Tableau 10 is support for Linux as a kind of as a and they're bringing out a proper
sort of server version and and they're going the opposite direction and they are you know they're
looking at making it run at the moment to run Tableau you've really got to use the kind of have
the desktop tool as well but they're looking to make it so that it runs entirely kind of in the
cloud they're adding in stuff around governance things like certified data sources and so on
there and interestingly they're adding a data wrangling feature in as well,
but they're adding it in, they're making it a separate product
because in their view, they're quite different kind of like, you know, use cases and so on.
So in their case, they're adding it in as a separate product,
which is also interesting because there's a whole kind of, I suppose, ecosystem around Tableau of data prep tools.
And actually the hot new sort of tool in that market,
which is Paxato, I think,
they're like the Tableau of data prep.
So there's a lot,
I suppose there's a lot of cross-fertilization,
trying out different ideas and so on.
But isn't it good to be working in the Oracle space
to have a BI tool that's easy to install
and supports your data sources and looks nice?
So that for me is not a bad thing, really.
Nice to you, Mark.
Exactly.
So Stuart, it's been obviously, as usual,
fantastic speaking to you.
And yeah, I mean, at some point again,
come back and join us.
It's really good.
I'll see you.
Absolutely.
I look forward to my next one.
Yeah, hopefully I'll see you at Biwa in January.
Other than that, thank you very much.
Take care and see you soon.
All right.
Cheers, Mark.
Okay.