Drill to Detail - Drill to Detail Ep.29 'New-World BI Development using BigQuery, Looker, Kakfa and Streamsets' With Special Guest Stewart Bryson

Starting point is 00:00:00 So hello and welcome to another episode of Drill to Detail, the podcast series about the business, technology and strategy around analytics and big data. I'm your host, Mark Rittman, and I'm pleased to be joined once again by my old friend and former colleague, Stuart Bryson, the second funniest person in analytics, talking to us all the way from Atlanta in the US. So, Stuart, great to have you back. And why don't you introduce yourself to anybody who's new to the podcast? Yeah, great. Thanks, Mark. It's really an honor to be back and for my third visit here. Always a good time. So I'm the owner and co-founder of a company here in the States called Red Pill Analytics. Started off with traditional sort of BI and data warehousing tools, but that's about only about 60, 50 or 60% of our business now. So we're very much headfirst into some of the topics you've been discussing on the last few podcasts and and uh really interested in in having this discussion

Starting point is 00:01:05 i think it's timely and especially considering your last few guests and uh you know just uh meaningful conversations you've had so hopefully i can stand on the shoulders of giants add something to the conversation excellent excellent well sure what you're alluding to there is is kind of how i put this to you actually when we were talking about doing it, in that we've had a lot of people, some really good presenters, really good guests, I suppose, really, on the show in the last few episodes. Talking about, I think, you know, one way to put it is the new world of BI and analytics and development and some of the tools that we're using, some of the platforms we're using and how the techniques are changing and so on there and I thought it'd be good to kind of have you back on the show as someone who you know like me has been developing you know say Oracle based solutions and you know what you might call old world solutions in the past to kind of give you perspective really on you know what what works what doesn't work how things are different how things are changed and and really kind of you know the Stuart opinion

Starting point is 00:02:01 really on some of these technologies that we've been talking about and in particular there's kind of three areas that I wanted to talk about really in this show with you okay so you know what I want to talk about really the new cloud-based and big data originated data stores so things like BigQuery and Athena and some of these new things coming out these new kind of almost data warehouse as a service cloud kind of platforms that are coming out some of the new bi tools that are out and one i know that you've been working with and i've actually had some experience with is a tool called looker so this is this new take on well i suppose a new take on bi optimized for these new cloud-based distributed data stores but with some

Starting point is 00:02:40 of the kind of ideas coming back in that you and i know quite well from the kind of world of oracle bi and cognos and so on things like semantic models and metadata and so on. And then, you know, another area that we've both been looking at, I know you particularly, you know, interested in this is data integration. And you've looked at things like stream sets, I know you work with Kafka and so on as well, you know, really, you know, what is it like to create these kind of, you know, real time data pipelines? What is it like to use tools like StreamSets? How do things differ? And I suppose, in a way, the essence of the question really is,

Starting point is 00:03:12 are some of the new tools and techniques we see coming through, is it a new paradigm? Is it a new way of doing things? Or is it the old problems being readdressed, in some cases with tooling and techniques that are kind of immature really you know scripting and so on and that's the kind of background really there's a lot of things to cover there and so what I thought I'd do is is take it in stages really and and kind of go through the three different kind of layers we're talking about

Starting point is 00:03:37 and then get your opinion on these things really so let's start game yeah good well you always that's good so you're always kind of game for things like this so it's kind of good but let's start. I'm game. Yeah, good. Well, you're always, that's good to do. I'm always kind of game for things like this. So it's kind of good. But let's start off with something that's been a big kind of interest of mine over the last few months, particularly in the kind of product management work I've been doing. Data warehouses are service platforms like BigQuery and so on. So first of all, do you want to explain what those things are and why customers are interested in those kind of new ways of storing data? Absolutely. If you look at, for instance, Athena and BigQuery, I don't know how much they like to be grouped together, but it's very similar to what's going on there,

Starting point is 00:04:14 which is a Presto-based query layer on top of usually object store files, JSON, Avro, those sorts of things. And the difference is you don't have this online, memory-intensive, relational database that in a lot of ways was really born to do transactions. Transactions are what we've always tried to get around using relational databases for data warehousing, find ways to load and batch using loaders instead of pure SQL and find ways around it because it's just not what they're geared for. Whereas these platforms are more for massive queries that scale. They scale for you. They don't take a lot of administration.

Starting point is 00:05:13 They don't require, from a storage perspective, a constant eye on storage. Those sort of costs are baked in. And you can imagine someone that's been doing SQL for a long time, getting their hands on some of these platforms, and the SQL access feels premature in a lot of them. But you have to think to yourselves, what's their real goal, and their goal is to query. And so from that perspective, it's refreshing from both Athena and BigQuery when you go in there and you're defining your schemas, you know, that they don't require you to specify. You can't even specify, you know, what's the length of the string? What's the precision of the integer or the number?

Starting point is 00:05:56 Those things are just taken for granted. And I can tell you that from my perspective, that's refreshing because I can always hear in the back of my mind you know the tom kites of the world talking about perfectly designing you know your tables and that was always good advice for a relational system but that stuff really shouldn't matter when we're talking about querying and analytics so i i think you you know they are much different in that that, you know, the access to them is built in the browser. You go to the browser to define things like tables. You may not get the full complement of DDL and the things you're used to in a relational database. But at the end of the day, Athena's got JDBC and REST. You know, REST is, they just released REST API five days ago.

Starting point is 00:06:47 We're already using it on a project. You know, you look at BigQuery, it's mostly REST-based for access or tools like Looker that are built to access these platforms. So what kind of customers are you finding are using these services? I mean, and are they coming from a big data background? Is this a case of another sort of generation of big data systems where they're starting to take on some kind of like, I suppose, data warehouse characteristics? Are these on-premise data warehouse systems that are migrating to this? Or is it net it net new i mean who typically is interested in

Starting point is 00:07:25 these really so i'll just do a quick shout out for snowflake so if you're looking for sort of uh a lift and shift mentality uh you've you've built around traditional bi tools and a sort of traditional mindset that's still a relational database with really robust SQL. And that's probably what you're looking for. But when you look at the sort of customers that are looking to BigQuery and looking to Athena or those sorts of solutions, the one characteristic I think that really stands out is these are companies that are part of the digital transformation, really. They're not bystanders. They're not analysts sitting around looking at dashboards, trying to figure out the supply chain of their widgets. They're building data-driven applications. So I think when you talk about building,

Starting point is 00:08:19 and that's net new in a lot of cases, Mark, or they're being ported from on-prem big data systems. That's another sort of key that we see. I think that if you're building data-driven applications, these are applications that are using data and delivering them to customers in some way. These traditional BI tools just really don't stand up to that kind of workflow. I mean, you think about developers when they want to build an application, and that's what we're talking about, not a dashboard, but an application. They expect source control, they expect regression testing, they expect some of these deployment mechanisms and building jars and shipping jars makes a lot of sense in those scenarios. So that's one of the sort of the things

Starting point is 00:09:06 that really stands out to me. You know, Maxime, who was your previous guest and talking about what they were building at Airbnb. And I was at the Kafka Summit in New York a few weeks back and Airbnb did a presentation there. And they're not delivering data to a data store that they're pointing a BI tool. They're doing that stuff. But along the way, they're building these pipelines to interact with data, to process data in real time, to stream data. So these applications can consume them and deliver some sort of end user experience directly to their customers. So a lot of the times they're building this stuff to deliver some sort of value to their customers as opposed to just sort of internal ruminating about, you know, how many widgets they've sold, if that makes sense.

Starting point is 00:09:56 Are you finding that, so let's take two examples there. If we took, say, Athena, Amazon Athena, and say BigQuery, for example, are customers coming to you and saying, this is something you want to try and do, you know, there if we took say athena amazon athena and say bigquery for example are are these are our customers coming to you and saying this is something you want to try and do you know this we've got a problem we've got a problem with scale we've got a problem with kind of agility this is a solution we've got there or are they being sold it i mean you know where's that where's the problem that's being solved and who's solving it for who and that sort of thing really i think we certainly

Starting point is 00:10:23 have customers that have come to us and said, it's got to be platform. It can't be infrastructure and it can't be on-prem. We want platform. And we will carve up the edges where necessary to invest in platform. What does that mean in layman's terms? Yeah, just platform as a service. They're not looking for spinning up VMs and doing their workloads in the cloud on infrastructure virtual machines.

Starting point is 00:10:53 Although that does decrease some of the administration, it doesn't keep you from having to SSH into machines and manage storage, manage users, et cetera. So we certainly have one customer in particular and a couple of others that have come to us and said, you know, give us platform as a service. And those customers obviously gravitate toward, you know, the S3 and Athena model. I mean, S3 is so pervasive. Every tool you look at from stream sets to whatever has a way to read and write S3. So the fact that you can build a data lake using S3 and Athena is fantastic. But I will say that predominantly we're introducing these concepts. we're we have traditional customers that are realizing starting to see

Starting point is 00:11:48 you know that as data sets increase as different kinds of data sets are evolving and becoming important they're looking for solutions and these are some of the things that we're offering okay okay so let's look at an aspect of that then so something that you and i are yeah we spent a long time thinking about in the past was data warehouse design. How we design tables, how we did kind of the, how we laid them out to be efficient to query and so on there. You know, we kind of learned the kind of the gospel of Kimball and materialized views and indexes and all this kind of stuff. How does that differ if you're going to build a data warehouse in a platform like, say, Athena or BigQuery? So you mentioned two things that I'll sort of gravitate towards. So you mentioned a model and you mentioned indexes. And I think the latter is just something you don't worry about anymore. For one thing, these massively parallel engines such as BigQuery and Athena, they just sort of get that.

Starting point is 00:12:49 Not to say that there's not tweaking and tuning. There's partitioning at the file level that you need to consider that's going to make a big difference in performance. But in general, you can increase your capacity depending on how much you're willing to pay. You can have, especially with BigQuery, you can have resources already spun up and available for you. But if you take a look at the model perspective, this is significantly different. Not to say that you can't do a Kimball-like dimensional model. In these technologies, you absolutely can. And if you have complex dimensions, such as customer, you probably want to continue to sort of read

Starting point is 00:13:33 your Kimball and listen to what he says. However, some of the modeling techniques that we've struggled with over the years, many to many, those sorts of things, become almost unimportant in these platforms because of nested tables. So you can store objects inside of quote-unquote columns. So if you're looking for many to many relationships, you can actually store that right into one table. Not necessarily, you know, completely ignoring what Kimball has said about some of those techniques, but there are options in these platforms that make basic modeling. It gives you pause to think, do I continue down the same road I've always gone down, or do I try to take advantage of these new techniques? You know, the Oracle databases had nested objects and columns for years, but no one uses them because when you're in an Oracle database or any relational database,

Starting point is 00:14:33 you're typically designing for lowest common denominator of BI tools and ETL tools. Yeah, yeah, definitely. I mean, I hit that same issue as well where working with bigquery first of all you you you know you you build out your tables you build out kind of maybe not so much facts and so on but you definitely are thinking in dimensional terms and and we tend to sort of the work i've been doing recently is we might we might model it logically as as dimensional with facts and dimensions but it's always always then, it's always denormalized because in an engine like, say, BigQuery,

Starting point is 00:15:08 joins are expensive. And so there are better ways of kind of linking to dimensional attributes and so on than doing that. And I think there's probably going to be a lot of systems built over the next few years by people kind of moving from, say, relational systems to these platforms and building it in the same way. And, you know, and and therefore and actually then blaming kind of performance and latency on things like BigQuery

Starting point is 00:15:30 actually it's not that it's the joins and I think there's a whole there's going to be a whole kind of world of of consultancy and and I suppose best practice advice and so on around this in the future because there's a lot of systems being built I'd say suboptimally now really yeah i think when you looked at relational systems and and the models that we that we grew up building space was always a consideration so you never duplicated data any more than you had to you thought long and hard before you built aggregates all those sorts of things now in in in this sort of cloud data store world, that's the cheapest thing. The storage is literally the cheapest thing. It's the processing power. So why not compute these things and duplicate your data, pre-compute them? It's definitely a design technique that has been bubbling up for us as we've moved across this void and seen a different way to build data applications.

Starting point is 00:16:31 So have you, I mean, one thing again I've been noticing is that even with proper table design and pre-aggregation and so on, there is still a latency you get with all these systems. Typically, if you're using BigQuery, you're get you know a few seconds latency on every query and you know we're used to in the world of kind of say Oracle for example things like in memory and we're used to OLAP and so on have you looked at or thought about things like Druid or maybe a kind of kind of intermediate layer at all to use with some of the kind of the systems you've been building to try and get back to that point of being kind of split second response time? Has it been an issue for you or what? So it has and it hasn't. I mean, I think I just pulled the covers on Druid this week. So my first real look at it. I think the answer is yes, that these things do show up. But I would argue, Mark, that they have in relational databases for

Starting point is 00:17:26 years. I mean, the query performance that a lot of our customers deal with is poor. And I think looking at the right tool for the right job, as Gwen said in your last episode, Gwen Shapira said in your last episode, it's okay when you have something like Kafka or sort of a universal ingestion engine to go load multiple stores. And that was always sort of the absolute no-no in building relational data warehouses is to not duplicate the data. But why not? If you've got a use case that requires search, if you've got a use case that requires events, if you've got a use case that requires state, whatever the use case you have, there is loading intermediate layers. That's

Starting point is 00:18:21 all the in-memory database is after all so these are all techniques that i really don't think you know i've changed that much there's just different names so so you're you've had a background as a dba in the past as well and a lot of people listening to this might come from a more traditional dba background and are thinking either there's nothing for me to do in these new platforms or it's so different technology and different techniques that i wouldn't know where to start if you were a dba, particularly a sort of development DBA, someone working as a DBA on data warehouse projects, what would the world look like for them in this kind of new world of things like BigQuery and Athena and so on?

Starting point is 00:18:56 Well, so Maxime's, the show you did with Maxime from Airbnb is a good one for this because you're probably not going to sit around and manage indexes and manage storage and manage users. That world is, I do believe, gone. And frankly, you know, good riddance in my mind. I think that to exist in this new world, you're going to have to start thinking about becoming an engineer. Now, that doesn't mean that you have to be a hardcore coder. And something else Maxime said in your last show was that the level of development skills that you need is not necessarily what you think it is. Because, you know, if you were going to hand code a data warehouse ETL process or data process, in the old world, you're typically starting from scratch. You know, there's no execution platform,

Starting point is 00:19:59 there's no error handling, but all of these platforms from Spark to Kafka Streams to all of these, you know, Apache Beam and Flink and all the different runners that Beam has, you're starting at a high level abstracted coding perspective. So a lot of what you're doing is already built for you, and you're just instantiating those objects. So I think that there's a lot of what he said that rings true. You don't necessarily have to be a hardcore coder, but you can't be afraid to write some code. If that's going to be something that is sort of your Rubicon, so to speak, that you just don't want to do, you're going to have to find ways of plugging systems together as enterprises are considering multiple clouds, best of freed.

Starting point is 00:20:49 There's always going to be configuration and integration work, plumbing, as I like to sometimes call it, that can go on there. But I think your traditional sort of operational DBA, sorry, don't necessarily take my absolute word on this but i do think from cloud to new platforms to whatever i think those things are are sort of evaporating okay so so we've been talking about amazon talking about google and so on where where's the likes of oracle and and teradata and and all those kind of vendors uh playing this i mean we you know both you know both you and i have a background in Oracle and so on there. Are there still projects being started in technologies like that?

Starting point is 00:21:30 Are those vendors kind of being relevant in this market? What's your view on that, really? So we happen to have several customers doing OAC. What's that, sorry? Oh, sorry, Oracle Analytics Cloud. Yeah. So the Oracle, and I know you know that you were doing it for the sake of your listeners. So thanks.

Starting point is 00:21:51 But Oracle Analytics Cloud. And we've got one customer in particular that was a spinoff from another company. And they're 100% net new across the board. And they made a big investment in the Oracle Cloud. So everything they're doing investment in the Oracle Cloud. So everything they're doing is in the Oracle Cloud, from analytics to the database and Exadata in the cloud, GoldenGate in the cloud, et cetera. And I think that the only problem with Oracle is that they're just behind. They're building out first products that, that the other cloud vendors

Starting point is 00:22:27 already have. And I think that that's the only problem is can they, can they innovate? And there's been a lot of internal change there about the way products are built, about the way they're delivered, et cetera, that I think is good. Uh, but it's just a question of, you know, anytime you're in a race and you start from, you know, several paces back, it's going to be a challenge. Their first-gen products that they roll out over the next year are going to have to be really, really good.

Starting point is 00:22:57 I'm encouraged by Event Hub, which is Kafka-based. I'm encouraged by some of their Spark products that are coming out. So I'm encouraged by a of their spark products that are coming out so i'm encouraged by a lot of what i'm seeing it's just can you can you overcome you know the the gap yeah and i think developer developer um access and enablement has been interesting as well i mean i i've now finally got hold of access to uh the elastic big data cloud service for example um and so i've been playing around with that but you know that's through uh i guess through people i know and and through kind of that history in this background as well

Starting point is 00:23:28 i think here yeah exactly and i think that it's all good stuff but i think there's a huge um kind of uh gap to be made up there really and i think that um it's interesting i mean i think we all wish well for for oracle and other companies as well but it's interesting to see the pace at which the big cloud vendors are now kind of like you know releasing these new platforms getting traction on there and so on as well and let's move on then let's move on to data integration because that's another area that you specialize in as well so let's think about a couple of areas that i want to talk about one is the changes that have come about through the focus on things like kafka you know streaming data integration that i talked about with with particularly in the last show. And I want to talk about a tool called StreamSets that we had them on the show earlier

Starting point is 00:24:09 on in the year or last year. And I know you've been working on that as well, but let's start actually with Kafka. So just tell us a little bit about what problems it solved, what you've been doing with it, and just set a little bit of context before we get into the detail. Yeah. So I do a Kafka talk I've been doing for the last, close to a year now, and I've been doing it at a lot of Oracle conferences. So the picture I like to draw is that, you know, if you think about a relational database, its main purpose has always been to deliver state. And by state, you know, inventory on hand, the state of the table, what are the current values in the

Starting point is 00:24:45 table, it doesn't matter how many inserts, updates and deletes have been issued. What you typically get from a relational database is the state of that table. Kafka's sole purpose is to deliver the events that lead up to state. So and that's where data-driven platforms really thrive. So if you think about this for all of our Oracle database listeners, and I'm sure you still have quite a few of them, it's like the redo log, but it's the redo log for your entire enterprise. So everything that you – as a matter of principle, you'll ingest any data that you have there without worrying about what the downstream application of that data is going to be. And then you start building decoupled. And so Robin Moffitt kind of hammered that point home in our last conversation. Decoupled, decouple the ingestion from the consumption, the production from the consumption. And I think that that paradigm is really important for even traditional shops that are thinking about a change in the way data integration is done, because typically when you've done ETL tools in the past, maybe you put an intermediate layer there that's a table. source to target sort of 100%. And I think the idea of not worrying about where your target,

Starting point is 00:26:12 sorry, where your sources are, when you're always having a single ingestion engine, such as Kafka, is a real important paradigm shift. And I mean that in the strongest way possible. But building on that, one more point is that Kafka now is a – and some of this comes from the Confluent platform, which is built around Kafka. But it is a complete data platform now. So you've got Kafka Stream. So if you want to do processing, if you want to do data transformation and data pipelining and data processing, You don't necessarily need to invest in another cluster such as Spark. It's got a really expressive programming language, mostly written in Kafka, but there's other, sorry, mostly written in Java, but there are other options. So I think that when you look at just sort of putting a stake in the ground of if you really want to transform

Starting point is 00:27:07 the way your your organization sees data you've got to put kafka in the middle because it's the only way to really be agnostic to where the data came from and only focus on you know what you're trying to deliver in this micro service in this application or even this data warehouse so how is it different than to things like golden gate and enterprise service bus and every every other kind of messaging and and and data kind of transport mechanism that's been out there i mean these have been there for a long time what why why is Kafka and Confluent kind of caught on really no i absolutely agree with you. GoldenGate can do the flavor of an event-driven environment. But you're still delivering it to a relational database whose purpose for most of their lifetime

Starting point is 00:28:16 has been to deliver state, not events. So Kafka is all about delivering events. And you may not even have to consume data from a table in a database. Your application can call Kafka directly through Java, through REST APIs, et cetera, get data, get schema, get events. So you can build applications. I think if that's – I made that point at the top is that it's actually part of the application. It's actually part of the consumption of the data in a way that GoldenGate and other ETL tools really haven't been in the past. So is this another way of building,

Starting point is 00:28:59 solving the same kind of use case and problem as ETL had, or is this for different types of application really? I mean, yeah, so simple question. Is it the sameL had or is this for different types of application really? I mean, yeah, so simple question. Is it the same use case or is it a different use case or different scenario? Consulting answer market depends. So now let me drill into that a little bit. So it's interesting in traditional ETL, talking to users, talking to customers, drilling down to requirements, and you start trying to get from them, do you need all of your event data? Do you need every update, every insert, every delete, every change? And in those traditional workflows, even the tools muddied the waters about what is change data capture?

Starting point is 00:29:40 Well, it's not simply using a last modified date and pulling the state of your source table at that moment in time. It's about capturing all the events. And when you go talk to people who are building modern data platforms, data engineers, people who are building streaming applications, they get that. I mean, that is not something you have to explain to them. And there's no nuance there that there's no light bulb that goes on when you finally, you know, get across what you're trying to ask. They understand that it's part of everything they're doing. So I think there are some of these problems that are not different. We need to transform data. We need to deliver it to a serving layer that might be made up of several different tools that's still going on.

Starting point is 00:30:34 But this sort of core concept of events as the source for analytic applications is, I think, what's really different. And yes, the tools in the traditional world are there, but they're always sort of workarounds. They're always sort of ways to make a traditional tool do something a little bit different. Whereas these data streaming tools just get that, you know, as part of the sort of a core requirement. Okay. Okay. So one of the points that Gwen mentioned in the last week's podcast was building things like error handling, for example, and just generally working with a data pipeline where the data never stops is quite a different way of, you know, quite a different environment to work with as a developer and quite a different kind of problem set and all that to working with Batch.

Starting point is 00:31:26 They're massively different. Your background is in Batch. How have you found that transition? And what are the things that are interesting and things to point out and tips and so on? So I would say definitely the tool that we've just really gravitated toward recently is StreamSets, as you mentioned. I mean, it just has this built in the error handling.

Starting point is 00:31:45 You define error handling at the pipeline level. And then for each stage, you can choose to just accept the default error handling or override the error handling. You do things like write to S3, write to file systems, write to Kafka, which is what we usually do is all error records will go to a Kafka topic. And then you can just listen to that Kafka topic as part of your data pipeline and constantly process the error records at the same pace that they're being rejected. And you don't have to code that piece of it right away. There's a bit in sort of the agile sort of way of doing things that, yeah, you do need to handle error records. You do need to process those with probably some differing logic. I think what we used to do in the old ETL world is apply that error handling logic to every single

Starting point is 00:32:41 record as it's coming through. But really, in the the streaming world you don't have to do that so you put in your your logic that you expect for all of your data that's flowing through and then just handle the complex logic in an error stream that's really easy to use i mean stream sets is a is you know i think it was it might have even been you mark i hate to call you out but i think you might have said you you haven't seen the Informatica in the big data world. No, that's right. That was one of my kind of really provocative statements a while ago. I haven't seen, I suppose in a way my point was I haven't yet seen the equivalent kind of slam dunk solution in ETL for big data that I'd seen. I think stream sets is it. I think stream

Starting point is 00:33:23 sets is as close as we can be i mean it is if you look at um you know there's there's always going to be coding required they make that relatively easy so sure just just interrupt you there if anybody doesn't know stream sets just kind of just paint a picture really what stream sets is really and again how it differs from everything else that people have seen around kind of big data etl so i'll paint this picture by again referencing the kafka summit i was at and i love kafka and there was lots of great content there but every slot every presentation was riddled with code riddle is wrong word it's all code right so you want to talk about we have a new feature It's all code, right? So you want to talk about, we have a new feature, here's a code sample. And there's nothing wrong with that because if you're building high-performing data-driven applications, you might have to do a good bit of coding.

Starting point is 00:34:17 But StreamSets is pretty easy to pick up. It's a graphical UI. It's browser driven. It runs through the processing engine runs either in a standard JVM for sort of lowest common denominator processing, or it can run in a lot of the Hadoop distributions. So it can it can natively execute across Spark. It can natively execute across Hadoop clusters when those stages are being read from or written to. So for those ODI folks that are out there, it does have that sort of ODI feel about processing in the right place. So if you have a Hadoop cluster and you're going to process some data and you're pulling from or writing to it, then it's going to do most of the processing there. If you've got a Spark cluster that you've configured, you can point your pipelines, as they're called in StreamSets, to execute in Spark. And the fallback is that it can always execute in a JVM, which you can build up on a big machine if you need to. So it is and it handles Kafka reading from and writing to Kafka very easily, managing the consumer groups for you. You know, it can do a lot of what we're used to.

Starting point is 00:35:39 It handles schema. It can use the schema registry or it has its own sort of built-in schema drift capabilities. So it really is a beautiful tool. And it's great for those organizations that aren't, you know, full of data engineers because maybe they're from a traditional world and they're taking this sort of maiden voyage out into the big data world. I think it's a tool that really does bridge that gap okay so there's two things i want to talk to you about on that one one is you mentioned schema drift there and that's that's something i'd like to to drill into a little bit um but the point you made there about again gui tools and the fact that kind of um i suppose in a way

Starting point is 00:36:17 the stream set is graphical and point and click and so on the point that other guests have made you know maxime um and Gwen, is that actually that's inefficient and engineers will always write code and code is better abstraction over data and the complexity of the logic that's required now in data integration routines is so complex that code is required. Do you think that's true or do you think it's just immaturity or or what so i think when you're building some of these high throughput applications that that a lot of these big data vendors like to to showcase at their conferences you're going to have to write that in code i think there's no way around it and there's a lot of thing that things that code give you, like source control merging, real

Starting point is 00:37:07 development-driven lifecycle, regression testing, all those things that developers have expected for years that have just always been absent in traditional BI tools, traditional ETL tools. So the traditional ETL tools started with the GUI and everything else, you know, the lifecycle capabilities were always secondary. I think that code gives you that capability. So if you're massively changing with a lot of developers, there's no GUI solution in my mind that's going to satisfy a data-driven workflow with lots of developers innovating often. At the same time, what StreamSets gives you is the ability, if you want to take your maiden voyage, as I said before, into big data, do you have to go hire five developers to be able to do it? I mean, put your toe in the water, give StreamSets a try. And for a lot of the projects we've done, we don't need to fall back to code.

Starting point is 00:38:09 And also they have what's called processor stages, which are intermediate stages that you can automatically serialize and deserialize the JSON or Avro for you into objects that you can just code around. So there's lots of opportunities for you to code within the StreamSets tool. But what is missing from StreamSets that you're going to get from pure code is that whole source control, regression testing, continuous integration, Jenkins deployments, Jenkins automation, all that kind of stuff that developers just expect. Obviously, you can do that with StreamSets. The artifacts are JSON after all, so you can check them in the source, etc. It's just not as natural as writing something like Java code. Okay. Do you think, and one little point there, do you think StreamSets is a little bit too focused on on-premise

Starting point is 00:39:05 i noticed that you know for example there's no kind of far as i'm aware there's no big query integration and so on is it is it is it kind of designed for on-premise and doesn't really translate to cloud and what's your view on that so they do have big table integration i have not i have not played with that yet so uh presumably you could could maybe do some things on top of that. But yes, I mean, I think, I hope that they're thinking about this because, you know, we can do an entire platform as a service architecture from start to finish. But if we use stream sets, we're going to have some sort of compute because there's no way for me to deploy stream sets really in a pure platform. No, that's right. It's predicated on it sort of being as part of a data center and stuff like that really, isn't it?

Starting point is 00:40:01 Yeah. So obviously you can run it on compute infrastructure when everything else that you have is running platform. But I really do hope that they're thinking about that because if they ran a service, I should throw that Confluent Cloud was just announced and that was always something that I, a weakness I felt in Confluent because it

Starting point is 00:40:26 was very much predicated on on-prem. So hopefully, you know, and Oracle's also rolling out Kafka in the cloud. So, you know, I think we're going to see that sort of ingestion layer move to the cloud and that's a really important step. But we need, you know, if you're in the Google Cloud, for instance, if you're doing BigQuery and you're thinking about stream sets, it's going to be, obviously it's coding, but you're going to look at data flow because it's in the cloud. It's part of the platform. Security layer is the same. And it's really easy to get up and running with data flow if what you're loading is BigQuery. Yeah, exactly.

Starting point is 00:41:03 I mean, there's a vendor I'm trying to find the name of, actually, now. There's a vendor out there that has integrated with BigQuery that their service is available as a kind of Google Compute Engine image that can be spun up. Is it Matillion? Yeah, that's the one. Yeah, that's the one.

Starting point is 00:41:18 Matillion. And that's a very basic level. I mean, it's just a VM effectively running in the cloud. But I guess the problem that you're going to have if you run StreamSets or a tool like that on- it's just a VM effectively running in the cloud. But I guess the problem that you're going to have if you run StreamSets or a tool like that on-premise is just that moving data

Starting point is 00:41:28 from the cloud to on-premise and so on, really. But you've got to kind of play in that world, really. You've got to, I mean, I guess it's all about partnerships, really, isn't it? It is.

Starting point is 00:41:37 And, you know, I think, hopefully, that we'll see some platform from StreamSets. I think it would be a really nice addition. But at the same time, unless you're doing Google Dataflow, I mean, what other platform-based processing layer is there? PubSub is obviously the capital.

Starting point is 00:42:02 PubSub. Yeah. Well, so you got PubSub to Dataflow, that whole sort of stream that Google has. But if you're going to go build, you know, anywhere else but the Google Cloud, I think there's not really a platform driven data processing or data pipelining layer out there. We need it. It's the sort of the last. Yeah. Yeah.

Starting point is 00:42:24 It's the last uh step yeah to being able to run everything really in the in the cloud which is what we all want okay so before we got one last question um so one of the things that that um that stream sets talk about is this idea of data drift and i guess it's that problem you have when you work with doing etl with a big data system where the schema can change over time and almost is designed to do that. Is that a problem that you found was one that needs to be addressed? And if so, how well did it address it? And what's your thoughts on that really? So I think it is a problem in traditional warehouses. Well, so there's good and bad in this, right? So it's a nice to have, but I think Gwen actually sort of talked about this in your last episode,

Starting point is 00:43:06 is it a suite or is it a pill? I think that the idea of being able to support schema drift is great if that's what you need. But what's good about stream sets is you can accept data drift or not accept data drift. And just, you know, data drift is just the concept that new attributes may be added to a data set. They might be subtracted or removed. And you think about a traditional data warehouse with multiple staging layers. And, you know, if we went back to the old Oracle reference architecture, you had a foundation layer and then a serving layer or a presentation layer, access and performance layer, I think it was called. Imagine how many different places we would have to change a data type or a table column throughout these workflows. And I think that

Starting point is 00:44:01 if what you need is strict schema, then maybe you want that. And maybe that's a constraint that you actually want to introduce. But what if you don't want that constraint? What if your data application or your analytics application should just flow on through when new attributes are added and not fail in error. What's great about tools that support data drift, and the query engines do as well, right? So if you're using some of the standard sort of loading techniques, it doesn't get mad at you if columns have been removed. It just won't load them.

Starting point is 00:44:42 And if they've been added, it might not load them unless you specify it, unless there's a matching column in the target. But you really do want to support data drift for a good portion of your workflow. I would say for loading a data lake, you don't want to, why you don't put data lake in a relational database is because you don't think all the changes and new data sources and those sorts of things that would require table design. you don't think all the changes and new data sources and those sorts of things that would require table design. You don't want to get slowed down by that. However, some of your downstream analytic applications, some of those that might be finance related, some of those that might be, you know, Inman's one version of the truth type solutions, maybe you do want the constraints there. What something like

Starting point is 00:45:25 StreamSats or any technology that supports data drift gives you is the ability to support it or not support it. And you can support it at different stages in your workflow. It's been tremendous for getting a data lake built. I can just imagine conceptually, you know, I know you've built, I've built them with you, relational precursors to data lakes where we're trying to land all of time to be able to support in a data lake layer and maybe even several layers past that the ability for things just to sort of flow through and work until you really need to start constraining schema changes these new modern tools give you the ability to sort of to have your cake and eat it too to use a terrible cliche okay so let's get this let's get on to the last topic, is BI and Looker.

Starting point is 00:46:27 Okay, so Looker is a BI tool that is kind of fairly sort of flavoured at the moment, within the kind of big data and analytics world. It's web-based, it's delivered as a service. The company I'm working with at the moment, they use it to build out an analytics platform on top of kind of BigQuery. The interesting thing with with it two things really that are interesting with it are first of all that it accesses engines like bigquery very efficiently which makes a big difference when

Starting point is 00:46:54 other tools like say tableau are used to doing select star from whatever all the time and those engines those engines are charged on kind of metadata you access the second one though and this is why on twitter i i happen to say at the time i suspect you blew your beans when you kind of saw this was was the way that look at handles look at metadata with look ml do you want to explain why i thought that and what's interesting about look ml and what it's how different it is to what we were working with before certainly i won't pick on any one vendor, although most people probably know the one who know me know the one I have the most experience with. But traditional BI tools that have metadata layers have typically stored that in some

Starting point is 00:47:35 sort of proprietary format, maybe in binary format, maybe in convoluted XML, which is difficult to read. So, you know, even if you wanted to put those metadata layers into something like source control, it's unreadable, it's unrecognizable in most cases for when it comes time to try to do real multi-user development and source control and feature branch-driven development.

Starting point is 00:48:01 It just, obviously we've built a tool called Checkmate for Oracle Business Intelligence to try to do some of that. But at the end of the day, it's still not a natural sort of state of being, I'll say. And so what LookML is and what Looker does is they have a metadata layer that really looks like a YAML file. And I think in the previous versions, it actually was strictly YAML. But now it's a DSL language for anyone out there, you know, domain specific language. It's a configuration language that allows you to easily configure your semantic model in a text file. And portions of the presentation layer can be done that way

Starting point is 00:48:47 in Looker as well. And, you know, safe harbor doesn't mean anything to Looker folks, but their roadmap is robust for trying to make all of this sort of work from a source control perspective, even the front end layer. So the idea that it's text-based, immediately you could say, well, that's great. I can export it out and put it into a version control tool. And that's what you probably would have done with a traditional BI tool, and maybe even written some processing around it. What Looker has done is actually taken care of that for you. They didn try they didn't try to satisfy you know their subversion users their git on premises or bi you know git labs or etc they just said you know what

Starting point is 00:49:33 github we're going to go with github lowest common denominator uh i think it works for github enterprise as well probably just get in general isn't it really i think it's i think it's i think it's is it yeah i've only used it with GitHub. Yeah, yeah, yeah, it's Git in general, yeah. Okay, so my mistake. But just making the metadata layer a file-based and then also automatically taking care of its integration with a BI tool, with a source control,

Starting point is 00:50:01 so that if you enable this, then you've got a lot of options every user automatically gets their own branch i'd like to see feature branch support they've mentioned it in some of our calls but it's not there at the time so but still just having every user with their own branch when they do work they can either inside the Looker tool, they can integrate their code into the master branch, or we can open pull requests that supports both, so that you can do pull request-based regression testing and deployments, pull request-based workflows, which is the way the newest version of Jenkins works in most cases. So yeah, I mean, it's just some of these problems that we've dealt with, with more traditional tools for years, you know, developers are going to do what's easiest. And if what's easiest for them is to ignore source control and email artifacts around, that's what they're going to do. Looker said, you know what, we got to make this really, really easy. And that's exactly what they've done. So it's been, you know, it's been a wonderful sort of experience. And also one final

Starting point is 00:51:11 thing with the quality of the metadata layer is that, you know, when you reverse engineer a source for the first time, you know, Looker has an opinion about what's there and pre-creates you a metadata layer. Now, it's not going to be everything you need, but the time it takes to get up and running, no provisioning, no installation, no integration, you're just up and running. It's fantastic. And it really does increase the capacity at which BI developers can produce stuff. So in terms of building, so you mentioned there that metadata layer and that there's views and there's models and there's explores and so on there.

Starting point is 00:51:56 Exactly. Have you been able to build, in the end, is what you build still a dimensional model? Or are you able to do anything kind of more kind of interesting or more, or I guess in a way, what I'm getting to with this is more appropriate to the type of data and the type of data structures that you have within things like BigQuery. I mean, one of the problem, one of the problem cases we had was, in my current place, is that everything is many to many and everything is event level and so on, which meant that how we built the metadata had to be different. How have you found that really?

Starting point is 00:52:25 What have you built and how has it worked? Yeah, there's some things that it doesn't do that you might expect a more dimensional-based BI tool to do. There's workarounds. They've recognized it. They've talked about roadmaps. But some of the things like drill across, just automatically working across all your facts. But things like symmetric aggregates, aggregates for example i mean is that something you've looked at or had

Starting point is 00:52:48 a need to absolutely yeah so maybe explain what that is and and if there's anything innovative in what look we're doing there yeah so it's the it's the ability for you to define from a from sort of a a measure by measure basis how the aggregation rules work and how they're to be applied across dimensions. And so, you know, this is the classic example of over allocation that you get if your dimensional model hasn't been done right, double counting, triple counting for certain situations. Now, if you know clearly what's being stored and what grain your fact table is at, it's hard to make those mistakes. And that's what metadata layers are for. But they give you the real flexibility to determine

Starting point is 00:53:38 how roll-ups are going to occur. And they call it that symmetric aggregation and that you know works in a lot of cases we found that that because the looker is more targeted at some of these sort of non-traditional data stores I'll say when you get into discussions with with engineers, they'll come at you with a lot of solutions that are more sort of nested table solutions and BigQuery-like solutions, Athena-like solutions. And that's great. So I think that there is a learning curve. It looks and feels dimensional in a lot of cases. It is and isn't, isn't it? Exactly. dimension a lot of cases it isn't isn't it i mean exactly it's sort of you know it's more i've

Starting point is 00:54:25 thought of the i've so there's a concept of views and there's a concept of those kind of folders in the uh in in the kind of the the front you know the front end that you build out do you consider those to be subject areas or or fact tables or just tables or what do you how do you conceptually model the semantic model within looker for yourself? I've seen both. So I think we coming from traditional BI tools tend to think in terms of subject areas and tend to want to force that on whatever we build. So I think that's our starting point. I think we've had some rework at one customer going in with that mentality. So I think it is flexible and that may not be what a lot of traditional bi tool developers want to hear but i don't think you can necessarily declare the grain and kimball

Starting point is 00:55:16 terminology you can't necessarily declare that every you know we're going to build subject areas in this layer or that layer i think you've got to be different isn't it it's like so it is it is different it's more like universes really um universes in business objects were always a little bit more flexible than say business models and oracle business intelligence so i think you know you've got to think in terms of flexibility and what's the what's the solution at. If what you're trying to deliver is a dimensional model to your end users, you might think in terms of subject areas and try to design things in that way. On the other hand, you know, I've seen through my years of using Oracle Business Intelligence, a tool that's supposed to deliver subject areas,

Starting point is 00:56:01 anything but because the developers have just constantly worked around and worked around and worked around a very dogmatic perspective that the tool has. I think that you've got to do what works. You've got to be willing to refactor something that developers and data engineers will consider. I think we've always, in business intelligence, and I'm talking more traditional business intelligence, we've always sort of thought that whatever we deliver has to be perfect. And we'll spend lots and lots of cycles thinking about the best way to build this, the best way to model it, such that when we deliver something, it's perfect. I think if there's one thing that using some of these new tools has taught is that refactoring

Starting point is 00:56:48 is fine. Get people content, especially if the tool is flexible enough for you to be able to refactor quickly and easily. Okay. Okay. One last question then. Is there a consulting business around this? Because again, one of the things that you can often think about is that the users of

Starting point is 00:57:03 these systems, so things like BigQuery and Athena and Looker and so on, are typically, I guess, more technically kind of focused, maybe startups and so on. And there's obviously the cloud element there where a lot of stuff is done for you automatically. What's the kind of business model like around it? Is there a business or is it more kind of like a hobby or what really? So I use a phrase all the time that I stole from you. I believe I stole from you, which is I didn't get into BI to build web logic clusters. That's not, you know, or to do active directory integration

Starting point is 00:57:36 and do all these sort of things that are necessary on-prem. So that's not what gets me up in the morning. It's building things with data. And I think if you address the actual problem of helping companies do things with data, as opposed to being the only person that knows how to install these three products and make them work together. I think if you look at, especially the last few shows that you've done, there's a lot of help that enterprise customers are going to need to adopt these tools. So those organizations typically have hired people that are familiar with tools and not coding. Are they going to immediately go out for their first project and hire five developers?

Starting point is 00:58:28 No. So what we've seen a lot is that, you know, some of these are our traditional customers that were helped taking to the new world. Or, you know, new customers that maybe weren't our BI customer in the past, but have been traditional BI customers in the past. They don't know what, not all of them, many of them have folks that work there that are very familiar with the technologies. That's why they're trying to drive the organization to use them, because they can provide tremendous value, but they don't have a team or a staff. We've actually got several customers where they just haven't hired a BI team. They've just hired us and we're doing dashboard development, pipeline development, data-driven application construction,

Starting point is 00:59:20 the type of things that, yeah, a startup would probably have full-time employees for, but organizations, departmental acquisitions, et cetera, are bringing us in to do a lot of the things that maybe a startup would do with full-time employees. Will that continue? Do we need to evolve and pivot again in the future? That's quite possible. This is a very dynamic industry at the moment. I think we're, you know, we're fully capable of doing it. And I'm excited. I'm not down by the idea that that and and our employees are just so happy working with these technologies. I mean, they they're really excited to be doing new things. I mean, all of us get happy to – all of us are happy to try to do new things and learn new skills.

Starting point is 01:00:10 Excellent. Excellent. Well, Stuart, I'm going to let you go now because we've been speaking for about an hour. It's been great to speak to you. I mean, we'll have you back on again at some point in the future as well. But I think it's been really interesting to get the perspective of someone who's been doing this for a long time, who's seen, I suppose, old world, new world, and so on, is enthusiastic about these things, and is actually delivering customer projects. So thank you very much for coming on, Stuart, and it's been great to have you. Well, Mark, I really appreciate it. Thanks for always trusting my opinion.

Starting point is 01:00:36 Yes. And I look forward to doing another show in the future. Okay, excellent. Thank you. Cheers.

Drill to Detail - Drill to Detail Ep.29 'New-World BI Development using BigQuery, Looker, Kakfa and Streamsets' With Special Guest Stewart Bryson

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.