Drill to Detail - Drill to Detail Ep.99 “Is the Modern Data Stack Dead?” with Special Guest Chris Tabb

Starting point is 00:00:00 so welcome back to a new series of the drill to detail podcast and i'm your host mark rittman i'm very pleased to be joined today by someone I knew and worked with around eight years ago or so. We went our different ways then, but like great minds think alike, find ourselves both running consultancies specialising in the modern data stack. So welcome to the podcast, Chris Tabb, and it's great to have you as our first guest in this new series. Thank you very much, Mark. And yeah, it doesn't seem it seems like yesterday we worked on that project but it was quite a long time ago now wasn't it?

Starting point is 00:00:48 Thanks for the show and I think as you say the paths have led us in the same direction and both focusing on the modern data stack which is a hot topic of mine and something I'm quite a lot passionate about and I think

Starting point is 00:01:04 using our experiences that we lever leverage from back in the day, you know, before the big data bubble. Yeah, I think it helps us understand what really is needed and what's required and what the definition of modern data stack really is. Chris, so if anyone doesn't know you, so you're pretty kind of high profile on LinkedIn and on various social media sites. But more importantly, you run a co-founder of a consultancy called Leap Data.

Starting point is 00:01:31 So just kind of start by just telling everybody what it is you do, okay, and the role you do within Leap Data. Yeah, so I'll start with who Leap Data is. So a relatively new consultancy company, come together with myself and two other founders that had worked together over that era of the data platform, data warehouses and data clouds now. And we came together after a couple of successful snowflake implementations when we were other side of the fence

Starting point is 00:02:03 and thought there's something here and we set this consultancy up and um yeah we've we've uh we focus on the modern data stack and i think what i also like to say we focus on business value that's delivered by the modern data stack um so we're very much um business focused technology supported or the ways of getting technology to support that i think rather than some people you know start with technology and find the use cases afterwards um and yeah my my my um what i do in the company now so due to the expansion and we're now working in the us as well as the uk um i now run the cco function so i'm the chief commercial officer um i think i did a post about this the other day of what i actually do people just see me probably traveling around doing conferences

Starting point is 00:02:48 doing these podcasts and taking photos or selfies with some fellow data community people but i think behind the scenes it's getting to talk and discuss the different trends the different movements in the market you know the different technologies out there what's what's hot what's not and validating ways that we think that we implement for our clients constantly optimizing that and i think looking to how we can always work smarter not harder and something that i use a hashtag the mean data data streets. The focus of that is trying to make them less mean, trying to cut through some of the complications that the modern data stack has created on its evolution

Starting point is 00:03:37 and trying to look at how you can simplify that and reduce the time to value for our clients and the community that listen to us. I do quite a lot of LinkedIn posts and podcasts like this. And I think it's a lot about education, data modeling being just one example of that, which I think the project we worked on back then was very, very much focused on a very optimized data model that could support what was then a very high volume, high throughput payment processing system. So I said, I've known you for about eight years now.

Starting point is 00:04:16 And I was kind of surprised to sort of see that you, to link what I was seeing on LinkedIn and the activity of your consultancy with the Chris tab that I knew from before, because the Chris tab that I knew from before was a very kind of studious data architect who was central on a project that I was working on and quite a lot of the company I used to work with at the time. As you say, payment processor company, but you were doing a very technical and I suppose individual contributor type role at the time so maybe just tell people about what you were doing then okay as much as you can do and maybe kind of like what led into that first of all so I think in the past you worked in Cognos and so on but what was the role you were doing then really and how did and I suppose how did that lead to what you're doing now? Yeah so I think it's always a bit of a blend of different things but

Starting point is 00:05:05 yeah we'll start with that with that payment company where we met so that was i put at the time for the largest digital transformations that was happening and very complex landscape very complex architecture and many moving parts and yeah the the role the role there was a very much a technical designing hands-on role and i love that and i think what i've learned what i've learned from that i can no longer do that hands-on but now i have a team that you know still feed me that knowledge which i can still speak comfortably about but just share some of the experiences and the battle scars and mistakes that happen on some when when you when you embark on some of those sort of projects. Yeah, it was very much an Oracle-focused implementation,

Starting point is 00:05:53 complex billing process. And I think my role there was the, started off as the data architect, the lead data architect, and then I think I became the chief architect towards the end of it. But working on something that well I feel it's a really good challenge met loads of people look a lot of them for example two of those people I met or three of the people on that project well actually

Starting point is 00:06:18 four four or five people that project that now work or involved in in the company and we're still still still kept in contact with them. But, yeah, you mentioned how I got into it. So my journey into data probably wouldn't be seen as the normal routine. I was working in an administration department for a company called Cognos and just got involved in working on their products because they had some spaces on the, on the training courses, started using their own data. So my background was to some degree,

Starting point is 00:06:52 and I didn't finish university or go to university was, was business accounts, economics, um, uh, maths to some degree. I was not that very good at maths. I was okay. Better than my English at at least, anyway. But I think I had a business lens, but then I could go down to technical, which has allowed me to bounce between those two different roles over my career of being hands-on as a DBA, as an ETL developer, as a report developer,

Starting point is 00:07:27 collecting requirements because we had to do that back then, but understanding what requirements look like, how it feeds into projects, looking at how to deliver focus on it. So I think every aspect, including that payment companies, provided me some good foundational skills and a great network. I think you're no one without your network and i've got a fantastic network that um you know i help me with my knowledge and things i always give advice i also keep you real as well than some of my ideas um which is always a key skill that you you know you need from a a team to call out when something's not possible. And, yeah, so since that payment company, I think it just kept,

Starting point is 00:08:12 you know, we had that big data era. Yeah, so, yeah, big data era. Maybe that leads on to, like, we're talking about the, what I call the evolution piece really so um i i yeah talking about my career and how that how how that's traveled over over the data world over the past 30 years now you know i started when i was you know to give my age away but i started young obviously 19 i started in the industry yeah and it's 30 years now so yeah we can do the math um let's let's kind of let's let's let's just to interject there a little bit.

Starting point is 00:08:46 So you're starting to get into sort of technology, right? And this is kind of really where I want to go with you on this kind of conversation, right? So you've been doing this thing now for probably actually maybe slightly longer than me because I didn't get into it until I was about 25 or whatever. But we've both been in this industry for quite a while now. And we've seen a lot of kind of, I suppose, a lot of trends. industry for quite a while now okay um and and we've seen a lot of kind of i suppose a lot of trends we've seen a lot of technologies come and go um and we're both now working in uh what we what is now called the modern data stack space which is which has got a lot of similarities to what we've done in the past but it's got things

Starting point is 00:09:17 that are different as well and it's got its challenges and it's got its kind of benefits and so on so um you know and the project we worked on back in the day the one we met on it was it was it was it uses elt it had kind of like sql based kind of transformations it had a big database and so on there um and you know and so there's there's things that haven't changed and things that have really so let's kind of let's let's start a little bit with your journey and and and i suppose um but actually before we get into what i want to talk about with the modern data stack you mentioned big data there okay so so something that happened probably between that big project and what we're doing now is that whole world of hadoop right so how did how did you how did you get involved in that what was your take on that

Starting point is 00:09:56 to that of interest uh yeah so um i i think i'll just do a bit of background on the data warehouse and then go on to that big data one. So go for it in that order. So I refer to the data warehouse era as the beginning. You know, I think it was the basic era. So in the early 90s, you know, the Oracle for your database, maybe Informatica for your ETL, Cognos or Business Objects for your BI, you know, three vendors all working, doing what they do best.

Starting point is 00:10:27 But it was reserved for the big players, you know, and it was expensive. And later on, that moved into that consolidation era where you've got Oracle having the Exadata, ODI, that we work with, and OBIE. So single vendor, but, you know, three major components, you know, your data storage, your ETL, and associated things that came with it back then. So it was more like a one-stop orchestration,

Starting point is 00:10:52 quality, governance sort of product. And then OBIE, which is the equivalent of your reporting tools like Dapp, Tableau, and Power BI nowadays. bi nowadays but you know that was the oracle was offering them and that that was only reserved for the the big players it required a lot of planning and and uh budget to go and secure that and then then as you say yeah that big data what what did i make of it so what what that what i introduced was a um an era where it was open to the masses you know there was many many people would get hold of it it was all commodity hardware but it was very much technology um focus and i think that that created a bigger gap between

Starting point is 00:11:39 the business and technology um that you know the engineers weren't really connected to what they were doing you know they were using this approach of just sticking it in a data lake so there was no ownership no no idea of how it was going to be used i think the biggest the biggest problem that happened in that is data modeling data modeling just got forgotten the um the view of schema on read and you know we'll model it later. You know, it meant that there wasn't as much thought going into how you'd combine this data, not as much thought of how you'd optimize it for its usage, not only from storage, but also from access as well. So I think that was a major downside of that big data it was

Starting point is 00:12:30 very hard to tune unless you knew the access path and how it was going to come in so you tune it for one scenario then you have to tune for another scenario and it was only really good for large volumes of data and not everyone had the sort of volume of data that really warranted, you know, map produce as a way of querying it. You know, large file sizes where, you know, you're only wanting to get the odd record from it. So it didn't quite live up to the dream or what it promised.

Starting point is 00:13:04 And, you know, it got the data into a bit of a bad name, I think, to some degree of lots of failed projects. And I joke around, you know, the names of these, you can go and get a new budget from the board to go and do another data project because the last one failed. So that leads, and each of them have given us something that if we've learned by it and and and um combined the different um different methodologies or approaches or frameworks you know it led us to the modern data stack you know dbt that we we mentioned

Starting point is 00:13:40 um before that that approach of that data ops approach wouldn't have existed unless we had influence either from the devops or you know the the do one where you could check in check out code you could you know you could use your um cicd your jenkins your deployment framework because it was very easy to look at the code changes it wasn't like the tools we referred to before like like ODI, where it's a separate repository where you're putting all the metadata. So to export it out, you couldn't look at it visually to say what the differences were and whether to accept them or not.

Starting point is 00:14:15 Or if you did, you'd have to have specialist knowledge. It was more complex. More complex to live it or to work in that data ops approach. And we've now, with the templating, the ginger templating, and the Python wrappers and everything there, that's something that was good that came out of that big data era that's now being used in the modern data stack. I think that the modern data stack itself has evolved. And I think if someone asks you what the modern data stack um i think that the modern data stack itself is is has evolved and i think if someone asked you what the modern data stack was five years ago they would just list five tran dbt

Starting point is 00:14:54 snowflake and um maybe maybe um tableau look up probably actually if you go back five six years ago that would have been the blueprint probably. Looker being that first SAS 1-ounce, Snowflake not being the first, but being the predominant one, and DVT becoming the transformation logic of choice. And most of them would have Airflow. That was probably the standard one.

Starting point is 00:15:20 But yeah, things have moved on, I think, now. Okay, okay. So there's a few things in that, what you just said there, that's kind of interesting, I want to sort of dig into really so yeah so you you mentioned about big data there and you talk about sort of i suppose it's more technical approach and so on um so so who did did you find that this type of person that was working on these kind of projects changed a bit when big big big data came along and that and that that in itself led to a lot of this talk

Starting point is 00:15:43 around cicd and so on so first of all you know has the persona of the practitioner changed over time do you think 100 and i think that that term data engineer got born you know from during that period before there were etl developers and um i i speak to joe reese, you know, the author of the Fundamentals of Data Engineering. And I got to read it and help review it as he was writing it. So I think that role has evolved during that period and even fragmented to some degree because, you know, we used to have an ETl developer that used to build the pipeline but then with all of that cicd you needed more of a devops person so the the skill

Starting point is 00:16:32 set required and i think this is how i end up joint getting getting more into this this um linkedin and doing these podcasts i think is you know i was commenting on the skills needed to be a data engineer and you know the list was like 20 30 different technologies and there's no way of one person having that that's a unicorn so yeah it did influence a new skill set into it i think this is where um there's there's a team now and especially a little while ago i had to have so many different complementary skills now to achieve that creation of the data pipeline from source right the way through to a modern layer.

Starting point is 00:17:11 So that was dictable. That was caused by the number of different technologies around them as well as it becoming more of an engineering sort of data developer sort of skill set needed less guis were involved you know the products we were talking about earlier all were very much gui led okay so so so i suppose the logic so one of the one of the things that i noticed you getting involved in on on on linkedin other forums is the debate around is the modern data stack dead okay um so maybe just start off by just explaining what is this kind of i suppose meme or kind of i suppose

Starting point is 00:17:51 trend or or thought going around to say this thing that you know we call the modern data stack it's it's dead why why is it dead and why people are saying it's dead first of all yeah so i think the first thing is you try and ask people to define what the modern data stack is and i did this and this is what i challenged a lot of people out there it was very hard we got very mixed views of what it was so some answer would go straight down to technology just list the technologies uh and others others would say that the quoting the reason for it because it didn't have data modeling in it or it didn't have lineage and each of these reasons or rationale that i saw for saying the modern data stack was dead

Starting point is 00:18:31 let me say it's not dead it's just not been done correctly you know we haven't looked at some of the best practice and problems we've already solved in the past and why haven't they why haven't they just been implemented correctly and i put this down to the the um bubble or the the very high increase in engineers that entered into this industry but they entered into it in the big data world and they entered into that and then they moved into the modern data stack. So a lot of that best practice and knowledge that we had prior to that was reserved to a smaller amount. And most of those people may be architects now. I refer to them as recovering architects nowadays,

Starting point is 00:19:16 based on recovering data scientists is what Joe refers to himself as. So it's that lack of knowledge or best advice that was not being given to the people that were using modern data stack that's caused this issue. So people started coining with the post-modern data stack. I mean, just because we've run out, and just because we've run out and and that i said just because you know your first mod your first implementation modern data stack was didn't go well you know don't go

Starting point is 00:19:51 and call it a post-modern data stack to give it a new name maybe call it modern data stack 2.0 i even joked say maybe modern data stack 10.3 patch 2 it was a good oracle version you know um but what about the argument which some people are making which i think has got some value to it saying that um with i suppose the move towards analytics engineering and the modern data stack you are accumulating a lot of human capital costs and all the sort of things that you know in a way we went through in the past where we moved away from say scripting dba scripts or database scripts and we moved to say graphical tools with repositories you know um and and so on do you think there's a valid argument to say that we're accumulating a lot of

Starting point is 00:20:29 human capital costs in this that people aren't necessarily aware of until it becomes a cost yes i do i think i think there is there is some truth in that and i think i refer to something as the meta metadata and um i think that automation and working smarter lens. So if you think of all of the data as it gets created, when you have a system or something designed, they know the attribution of that data, they know the business usage of it. If, say, for example, going back to that project as well,

Starting point is 00:20:59 you had all the business services mapped out, from high-level right to the L5 business services, level five. So you actually know what attribute is linked to each business service, how that attribute is stored, and then that metadata in the old days is, you know, that data will be available, it will be CDC'd somewhere, and then the data team will go and redo it all over again. You know, they go and define that that all and i've never seen a project where all that metadata that once it's been collected or defined at its creation or its inception has actually followed through so i think the

Starting point is 00:21:37 what's what i do like i think is happening and i think this is a good thing is the the world of operations the world of analytics is coming closer together. Products, Snowflake having the uni store meaning you can have, maybe not replace OLTP totally, but you can have more apps running there. The team's working much closer together. I think if we can have used knowledge graph information to collect all the metadata that then can be transferred down to create DDL but also create business services or also then go and follow through into how it's stored in the analytics to to help catalog it define the ownership um the more the more we

Starting point is 00:22:18 can do with automation and with um metadata driven approaches uh and reuse of metadata metadata management um i think the less will need that manual effort or or um cost associated to it i mean you can go on the finops aspect of it as well you know how how um how you can do cost optimization based or recharging based on business business usage or um you know domain usage of your data you know supporting data products data mesh approaches where you know you know who's using it there's good contracts in place um you're reducing that cost of ownership uh and you're sharing the cost based on the usage and consumption. Okay. Okay. Interesting. You mentioned about, about, about cost there,

Starting point is 00:23:06 cost attribution and so on. There was, I think on Twitter today, there was a, there was, you might've seen it, but there was a post, I think from,

Starting point is 00:23:14 from high touch talking about using their tool, the reverse ETL tool to help with the usage based billing for SAS companies. And it was, I think Lauren was, was commented on Twitter about it and sort of said the worst possible thing you can do is use your reverse etl tool to do your customer billing so what's your take on i suppose activating data and the use of say reverse etl tools for that kind of

Starting point is 00:23:35 oh i mean this is another favorite topic of mine and and actually i did a post actually while i was with lauren in new york last year about etl in etl out um so so maybe i'll touch before i touch before i go into my view on reverse etl let's just talk about the use case of going for that billing and you think of things like socks compliance and lineage and accountantship and ownership so you know what gets gone what goes into your billing system is it needs to be done by an accountant. It needs to have some level of controls over it. It needs lineage to understand where that data has come from. And it needs that whole end-to-end pipeline supported if it's a production system

Starting point is 00:24:20 or with production controls. So unless you can guarantee or provide lineage or provide that end-to-end assurance and security controls around that, who's going to stop maybe some data engineer going and ingesting another accountancy process and quite easily creating fraud internally or unintentionally posting the wrong information. And anything that touches your accountancy system, it needs to have a level of governance controls around it. And this is where I'll go on to my view of reverse ETL tools. So if we go back to our day, back in the day if i asked for informatica to bring the data

Starting point is 00:25:07 in and i asked the data stage to take the data out and i asked to go and buy two licenses for it do you think i would have got the budget do you think i'd still be in this job now probably not i mean it's too too separate they and this i think this has happened where you know the modern data science evolved and the likes of high touch and sensors they've come into the market because

Starting point is 00:25:32 we've created now business processes on top of our modern data stack and we've used Fivetran or something to bring it in and Fivetran doesn't put the data back out so I think if you step back and if you were to design a product now you'd look at something that did both and maybe provided that more of a governance controls maybe provide that lineage maybe provide the security controls and this is a

Starting point is 00:25:56 a production production class pipeline that needs support needs monitoring on it it has different different security controls around it, different governance associates. I did a my post was a joke, I had a picture of two Aston Martins, ETL in, ETL out. We need an excuse to buy two products.

Starting point is 00:26:19 So there are products that can provide that end-to-end one, the Riverage one that we work with, not a plug, but there's others as well. So yeah, that simplification of modern data stack, I'd look at what can provide that whole end-to-end

Starting point is 00:26:34 capability and be very very careful if you're going to be using it to post accounts information. And Lauren would have had a very different approach of tackling that answer, believe me. Okay.

Starting point is 00:26:52 So, you know, at risk of sounding like a couple of old farts saying that everything now is new, is awful, and so on. I mean, so you've got a consultancy that specializes in the modern data stack, and you were like, I think you were Snowflake partner of the year last year and and all that so what does it what does that what does the data stack look like that you your company builds and and what products um do you particularly think are good in the market at the moment then in that sort of the area we've been talking about yeah so i'll go back to that whole simplification piece as well so five trends great as an ingest and it's been the market leader for ages.

Starting point is 00:27:28 But you still need an orchestration tool with it as well. So if I was to break down what I think the modern data stack components are, and so you need ingest, you need orchestration, you need ability for transformation, you need storage and compute, you need some data modeling and data contracts capability. Observability, I'll touch on that in a minute. But a data ops framework as well. And maybe you might need some reverse ETL.

Starting point is 00:27:55 You'll definitely need some visualization. You'll definitely need some machine learning capability. So if I was to map that to what I think of the players, Snowflake, I'll start with them. Snowflake for storage and compute, I think it's the de facto one. It's the only real challenger to it, I'd say, is BigQuery and then certain use cases. So as a de facto, I'd still think that they've got the competitive edge.

Starting point is 00:28:21 I think their roadmap with Snowpark is great. Yeah, we've done very well with them and and we like working on that um if i talk about um the next one which would be you know you ingest your orchestration and optional reverse etl and that would be riveri um riveri um provides that single SaaS ETL product. The underlying architecture, I know personally that it's built state-of-the-art on Kubernetes, sort of like modern architecture from a development perspective. And it does that orchestration for you. It does that reverse ETL if you need it.

Starting point is 00:29:02 It has a lot of metadata-driven components, which if I went back before and I had to build that self, it's complicated unless you do it properly. And I know that getting someone like yourself in as well, you know that we designed well, but a lot of time we get in and the horse is bolted. So having that ingest orchestration reverse ETL with transformation that you can pull out the

Starting point is 00:29:25 dbt or you can run your own scripts from it and that that would be what I choose there I think another cool player in that market is coalesce so coalesce we've seen them quite a while I've seen recently it's more of a well it's very much a dbt replacement or alternative should I. It has that GUI, so your drag and drop approach to building your data pipelines. And it provides, go back to that lineage piece anyway, it provides a lot of that lineage and some of the complexity with very large-scale DBT projects and maybe some of the in some of the inefficiencies they may they may

Starting point is 00:30:06 introduce because many data engineers work at the same time maybe not with a common data model um yeah i think having something like that helps encourage that data modeling approach um i'd say that sql dbm would be my go-to choice for doing data modeling now um you know having that model be deployed and then you build pipelines that match onto it I'd suggest SQL DBM would be my go-to choice for doing data modeling now. Having that model be deployed and then you build pipelines that match onto it. Visualization, I think the thought spot would be my go-to one. Why is that? Yeah, let me share my view on that one.

Starting point is 00:30:42 We'll give the other players as well. So I'll give the other ones that I think are in there as well. looker you know was one of the leaders i think right at the beginning it's acquisition by google has made it more of a google focus than the licensing model of that it's come a little bit more expensive but you know it's still a good product um tableau i think that that life cycle of developing Tableau dashboards, but you need to have Tableau developers, that time to market, time to value, time to getting the users in touch, the report took a bit of time. Power BI, you know, it's great from a, if you're already a Microsoft shop, it did the job, but you still need those developers.

Starting point is 00:31:25 And they seem to encourage you just downloading it to Excel to go do pivot tables and things like that as well so that that's not what what um you know i think it's the best use of all that data after it's been processed and modeled so what their thoughts what and you know there's others out there as well um but um thoughts what it's that self-service capability i think it's It's that if you've got it modelled, it's a very good start schemers, and you've built some, you know, the metadata on top of it, which understands the context of data, you can then prevent the need for having, you know, an army or loads of dashboards being developed

Starting point is 00:32:01 and not knowing who they're being used by, and give it to clients or maybe not as tech-savvy people to actually explore that data, find insight they wouldn't have found easy. And I think what a dashboard does when you build it, you give it to someone and they say, oh, yeah, but what about this now? Then they go off and have to go and make an alteration to it. If the tool allows that person to ask that second question and that third question without the need to involve tech again that's what that's why i

Starting point is 00:32:31 like it and i suppose from a consultancy sort of you know um uh perspective you think that's a bit of an anti-pattern because we know we want to put people and develop it so we do help with thought spot we help in that initial setup but it setup. But it doesn't require that long-term consultancy assistance from us. Whereas if it was maintaining Tableau dashboards and things like that, we'd probably more repeat work. But yes, that's my reason and rationale for that one. Okay. So what's your take on, I suppose, headless BI and metrics layers and so on?

Starting point is 00:33:05 Because that separation, I mean, in some respects, it's nothing new, having the idea of a semantic model. But the other bit, I suppose, is who within the organization would kind of set up and maintain it? And what does it mean about the kind of workflow? And what does it mean also about how maybe the industry might get reconfigured around different ecosystems and so on? So what's your take on metrics layers and semantic models and i suppose the elephant in the room being the dbt labs one yeah so i mean if you go back and again that that semantic there

Starting point is 00:33:35 is nothing new if i go back to my days on cognos you know you you had the metadata there and they are the catalogs as they refer to them them, as business object had its universe. So I think the key thing is centralization of business logic. And things like data vault methodologies, which I know you're very much in with as well, having that centralized business logic layer, whether the rules are consolidated, managed, governed, reused is essential.

Starting point is 00:34:09 Whether you do that or – I think it's defining the rules where they go. And the more reuse, the more I'd like to have them pre-built in the database. If you have some sort of complex um in different teams using different products maybe some things could be centralized in maybe your thought spot model or your um maybe you're using microsatural or something so you may have some some logic put in that but the the key to getting one source of the truth and making sure everyone is reporting the same way is making sure those business rules are in one location. in modeling DataVault or a similar approach, as long as it's structured in a way that allows that business logic to be put in one place and managed,

Starting point is 00:35:12 that's the end goal. How you achieve that, I'm less concerned on. Moving on a little bit from, I suppose, products and so on, you've mentioned a few things around data modeling there. You've mentioned about, I think you mentioned data contracts earlier on and so on and you've also just mentioned about centralization of logic okay so probably one of the one of the sort of the again one of the trends or or things that are interesting industry at the moment is around the idea of data contracts and data meshes and so

Starting point is 00:35:39 on right so and and what's your take on that and what's your take around what's your take on that? And what's your take on, I suppose, the idea of there being a central kind of warehouse with all the logic in there? And I suppose centralized versus decentralized. What do you find works in practice over the years? Okay. So I'll start to touch on data contracts first of all. So anyone from a development background, so it could be your API contract. So it's a metadata structure or structure that's defined with metadata that's used to transfer data between one location and another in a standardized format. So, for example, if we went back to the old world of ISO codes, you know, ISO codes provide structure and they provide consistency on how we do currency codes or country codes.

Starting point is 00:36:28 And then you can move on to more with like payment contracts. You know, a payment contract was an ISO 8583 that had all the attributes needed for you to make a credit card payment or a direct debit payment. No, sorry, a debit card payment. So data contracts have been around in the application

Starting point is 00:36:46 world for many many years um i think why they're why they're becoming more prominent in the in the data world now is we've got these data product we've got a data product concept we've got data mesh as a concept or a methodology and what that provide what that needs now is also what what that requires is a ability to be able to um exchange information in a consistent way that people can build on top of so um where where that where that works between the um um the the the in the data product world i'm sorry that centralized, decentralized approach. So, and we talked about business rules as well later, and I think maybe touching on the role of the analytics engineer as well,

Starting point is 00:37:33 a subset of a data engineer. So what companies or larger companies should provide is domain-based structures that can be subscribed to by multiple data products. And they would be subscribed to by those multiple data products by using data contracts. Those data contracts would be, I think when we did that, I called them data access layers and data input layers,

Starting point is 00:38:00 which are dills and downs. So it's a way of you exchanging information between different parts of your data platform in the same way you would with an API contract. So why are they so important? I think without having that defined and the governance around it, or the guardrails I like to refer to it as,

Starting point is 00:38:21 your data products or your data mesh um um platform approach will have fixed dependencies on on things that you'd have no control over so unless these contracts you've got a they said they're they're available so for um you know and supported and any changes will be communicated or or any new new attributes will be added in a controlled manner. You know, you, you're building on top of sand or building on top of things that could, could cause your project to fail later.

Starting point is 00:38:54 And you have a complex monolith application again, really. Okay. So, so imagine, so let's put this into, into sort of practical things. Imagine your,

Starting point is 00:39:04 your, your company's being brought in to architect and build an analytics layer for a reasonable-sized, I don't know, sort of Series C, Series D kind of company, and you're talking about a warehouse architecture. Okay, do you tend to sort of recommend what we might consider to be a traditional kind of Kimball-style warehouse, or do you talk about things like data meshes or what's the what's the kind of starting point really from a for a design

Starting point is 00:39:30 that you would kind of like put your name behind really yeah I mean we it's it's always a bit of hybrid there's no one size fits all and I think the there's a few characteristics that I'd use um the stability of the source systems. So for a large transformation or maybe a company that has gone through many acquisitions, they may have multiple CRM systems, maybe another acquisition comes on later, I'd very much go down the route of a data vault approach

Starting point is 00:39:58 for that one to insulate them from those and build some solid foundations that you'll be able to react quick later if there were any changes and which are also planned changes and it's very hard to work in those environments to do you know legacy testing and new world testing in the same place unless you use that approach so that would be one one approach i always am very much in favor of a kimball kimball insight layer you know or information layer presentation layer um it works very well for things like sports well we're doing dimensions and facts and having it having it modeled in a way that i think how we how we think and operate you know we think in

Starting point is 00:40:35 measures we think in dimensions we think in hierarchy so um for human interaction analysis i'd always have that as the the end product presentation layer but then on the flip side if it's being used to do machine learning i might do a feature model so you know you'd go that route for that but the key thing to any of these approaches is to have some structure and not just have a one big data lake dumping everything in there so if if you don't go full data vault but just at least have some clear either separation of the schemas that goes what's your what's your customer information your co information so so i always bring in the data based on the source system it come from so you have that clear separation and

Starting point is 00:41:18 then there's that middle layer that um you know you you'd you'd mix and match depending on the environment you may go data vault you may go um i'm not going to go for that it's an inman-esque style if he has the simplest building uh structure and then predominantly a kimball kimball um style uh dimensional facts as a presentation there with you know sourcing that same data in a different structure but in a consistent way. And that's where that business rule aspect comes into it. Because if you have gone to a situation where you've got multiple different ways that data is being presented,

Starting point is 00:41:54 you want them both to be able to use the same business rules. So that's why I'd always introduce a common layer, which may be what you refer to as the semantic or business rules layer that any of them can subscribe to and also if you make the changes make them it can be replayed easily so you can go and replay and recorrect or or adjust any historical information very easily that that would be complex if you didn't have that okay okay so just to kind of round things off really i mean so so you're you're a consultant i'm a consultant okay and one of the things off really, I mean, so, so you're, you're a consultant, I'm a consultant. Okay. And one of the things that really surprised me when I,

Starting point is 00:42:26 when I got back into this world after a little kind of a couple of years I took out was, was going to a looker conference and explaining that we were a looker partner at the time and we were consultants and people not knowing what on earth we were talking about or why on earth, why they want to hire us in. And so to you, Chris, where do you see the value in a consultancy um these days for people implementing these kind of projects and you

Starting point is 00:42:52 know and where where does that given that given that most companies we speak to really ideally want to hire their own data team where does the consultant come in and what unique value do we bring into it do you think so i think there think there's two, probably it's which stage. So let's start right at the beginning. And you may have your own team, but bringing in someone that's done this together as a collective team in many locations before, the acceleration you can get and preventing you going down any rabbit holes or

Starting point is 00:43:25 areas that you may not have foreseen without without out seeing it done properly in many many places and also you know it's what's best practice now um so getting the right foundations in and the right level of foundations at the right time and a roadmap to get to where you can have that competitive advantage against anyone in your in your area um building your own team full-time is quite expensive so i think what we've we seem to do now is we have the right amount of a skill at the right level so we have our principals we've got our seniors we've got our engineers we've got our juniors you don't need to have a principal the whole time for our whole project you have them right at the front and then after that you use the right level at the right price and we can be cost effective i think as an alternative route to you know to having a

Starting point is 00:44:14 very large bloated team with multiple skills because you've got a very complex architecture um you know we people like us can come in and show where that simplification can happen where you can actually save costs from an operational cost, from a simplification of a delivery. I refer to something as delivery friction. So we look at all the different components of delivering a data pipeline. And how can we simplify that? Sometimes it's a people process and way of working,

Starting point is 00:44:41 or frameworks, or templates, or metadata-driven approaches. So not saying that everyone in there, those people have a job and their job is dependent on their work. My mission or role in a company is to work smarter, not harder, more efficiently, quicker time to value. Maybe you don't need the same size teams you've got now, but maybe your team won't always be the first ones what do you think about what do you think about um there's there's project there's project products out there like say portable or mozart data that that are kind of i suppose modern data stack in a box right so do you have any exposure experience to those at all and and and is that an alternative or or where do you see that fitting in? Is it maybe a different stage in maturity? What's your thoughts on those products?

Starting point is 00:45:27 So Portable, I know Ethan very well. I'm going to his meet-up next week. So Portable, I'd say, he refers to as the longhorn adapters. So they're not your common. So they're not your large sales force. Well, maybe they do that, but they're not the your large sales force and think well maybe we do that but they're not not the big big irps in fact so um yeah i think it's baked down to a use case and i think it does work that simplification and it's it's what you need at the side the maturity of a company in the level

Starting point is 00:45:58 you're at what what level of complexity you need we go back to that modern data stack so what's modern for one is, you know, maybe not modern to someone else. If you're like Coca-Cola or Pepsi, you know, the modern data stack is going to be some severe, you know, machine learning algorithm that can run against all sorts of things, optimization on production line information. So not everyone needs that level.

Starting point is 00:46:22 If you're a two-man band, a spreadsheet might be your modern data stack so it's it's working out um of the of the skill set skills of the team you've got of the budget you've got and of your of the complexity of environment and your requirements what's right for you and that goes back down to me and you being both technologists that have moved into developing consultancy companies you know i think both of us have that we say what's right for the client when and less you know less pushing the pushing a vendor-driven solution it's a business value-driven solution that is supported by what we think are the best vendors okay okay so just to round things up then how do people find out more about your company um and maybe get in touch with you yeah so um you'll

Starting point is 00:47:05 find me i'm quite predominant on linkedin so there's not many chris tabs so if you look for chris tab you'll find me uh i use the hashtag mean data streets i've got a couple of couple of the ones i use is bring back data modeling but i've had to spell it in two ways because Americans have got one L, we've got two. But yeah, and our company's www.leit-data.com. So you can check out our website. We are on Twitter as well. And I think we're also on a, I'm not too sure I have to check the other outlets as well, but they're the main ones.

Starting point is 00:47:41 Oh, YouTube as well. But you'll find me on quite a few podcasts yes so if you google hashtag mean data streets you'll find a lot more content fantastic well chris it's been fantastic speaking to you thank you very much for uh for sharing your thoughts on on the industry and where we are now um good luck with uh with the company not too good luck because obviously there's uh us as well in the market but but certainly you're a good guy. You know what you're talking about. There's room for both of us. Exactly.

Starting point is 00:48:08 But well done though. And you've done, the company's done really well. So best of luck and thank you very much. Thank you very much, Mark. Thanks for having me on. you

Drill to Detail - Drill to Detail Ep.99 “Is the Modern Data Stack Dead?” with Special Guest Chris Tabb

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.