Drill to Detail - Drill to Detail Ep.79 'Scaling the Modern Data Analytics Stack' with Special Guests Drew Banin and Stewart Bryson

Starting point is 00:00:00 So welcome to Drill to Detail, and I'm your host, Mark Rittman. So I'm joined today by two very special guests, one of whom is long-term friend of the show, Stuart Bryson. Thank you very much. Always a pleasure, Mark. Anytime. And I'm also joined today by none other than Drew Bannon from Fishtone Analytics. So Drew, it's great to have you on the show for the first time. Thanks, Mark. Happy to be here. So Stuart, for anybody that doesn't know you, just maybe just tell us what you do, who you are, and I suppose how we know each other. Yeah, certainly. So we were colleagues once upon a time. And then for a brief time,

Starting point is 00:00:45 we were competitors. And I wouldn't call us that today, although we work in the same area. It's a very, you know, collaborative relationship our companies have. I'm the CEO and founder of Red Pill Analytics. We are a analytics and data company. I like to say that. We build systems for customers, usually using modern cloud technologies. And we try to help customers migrate. A lot of our customers are legacy customers, having old legacy tools.

Starting point is 00:01:22 Some of the ones we worked on together, Mark, back in the day, taking those customers to new, easier to use cloud native technologies, and helping them find value in what the cloud provides. So consulting company and a services company primarily, Red Pill. And we're going to talk about DBT today from Fishtown. And we're a partner of DBT and use it on most of our projects. At least, you know, we recommend it on most of our projects. Okay. So Drew, I mean, Drew, it's your first time on the show. We've had your colleague Tristan on before, but just tell us what you do at Fishtown. And I suppose in a way, how did you get there? What was your background, really?

Starting point is 00:02:06 Sure thing. I'm one of the maintainers of DBT, which is the open source product that we create at Fishtown Analytics. So DBT is used by, I think, our latest metrics are like 1,700 companies every week. It's really taking off in the data modeling space. And a lot of us are kind of collaboratively thinking through the best way to model data and perform analytics kind of at scale. So the way we got here is Tristan, Connor, and I, the three co-founders of Fishtone Analytics, used to work at a company called RJ Metrics together, based in Philadelphia. And RJ Metrics was a leading BI tool back in the day.

Starting point is 00:02:47 And sort of with the advent of data warehouses like Redshift and Snowflake and BigQuery, the industry started changing from a sort of all-in-one type of BI tool to a composition of different best-in-class tools. And so what we found was different tools popped up, like Stitch Data spun out of RJ Metrics, for instance. There's also Fivetran in the ETL space. They do the data ingestion part. There's data warehouses that are phenomenal

Starting point is 00:03:15 at storing and querying data. There are many great BI tools as well. And sort of in about 2016, the summer of 2016, Tristan and I really identified that there was this big glaring hole in the data pipeline The data stack that goes from a data loader to the data warehouse the BI tool We were missing a modeling layer And so we started building dbt initially to just create views stacked on top of views on redshift And it's really grown pretty significantly from there.

Starting point is 00:03:47 So Stuart and I have talked, I suppose, quite a bit really on this show in the past about some of the projects we've been doing with technologies like DBT, but also things like Looker and Segment and Redshift and BigQuery and so on. And they're typically, or certainly the early adopters of those have been what you might call kind of startup companies or companies that in a way get the whole modern data stack and the techniques and tools around that. I mean, Stuart, just give us a flavor of some of the companies in that space that you've been working with around, you know, with tools like dbt and the modern stack tools that we've been talking about. Yeah, certainly. So it usually begins and Drew set this up nicely,

Starting point is 00:04:25 usually begins with a cloud data warehouse. I think that just this whole discussion, I think, is enabled by the new cloud data warehouses. I mean, we wouldn't be talking about agile tools, data pipelining and modeling just in time the way we are today without that. So, you know, Snowflake, Redshift, BigQuery, excuse me. And, you know, with those cloud data warehouses, it really enables a whole, not just, you know, faster performance and better integration in the cloud. It's also just about the agility with which we can move.

Starting point is 00:05:06 And I think we had all these constraints. They were almost like weights on us from these legacy tools. And once we sort of shed those and started ingesting our data into these cloud data warehouses, Fivetran, we're a partner of. We use Fivetran on most of our projects. For those sources where Fivetran hasn't enabled us, we use a collection of other things. Sometimes that's custom code using APIs from the cloud services

Starting point is 00:05:36 layer. Sometimes there are things like stream sets, other technologies that can enable data to be ingested. But once it gets there, now is the place where DBT can kind of take over. So DBT kind of provides that glue between this ingestion layer, which is possible from a lot of modern technologies, a data warehouse. And of course, downstream from that is hopefully some sort of a more modern analytics tool, something like Looker, something like even Data Studio or QuickSight and Amazon, things like Mode. Yeah, I'd even put Power BI in there.

Starting point is 00:06:13 So Drew's exactly right that there was sort of this missing step that we needed. And I think that dbt really fills that gap nicely. So Drew, official analytics took a decision to focus on a certain type of customer when you started the business, and you've kept that focus. So why did you focus on those types of customer for DBT and for the sort of things you were doing? It's a really good question.

Starting point is 00:06:39 In the early days of building DBT, we worked with as many clients as we could in as many different sort of diverse environments that we could find um we had a consulting contract that we signed with with all these clients that made it very easy to get started and uh sort of step into the uh the fire if you will and um figure out how to make dbt work really well in that environment and we certainly like learned a whole lot very quickly the first you know six months or a year of delivering analytics consulting

Starting point is 00:07:11 with dbt and i think that the the product that dbt is today is a function of that experience where we very early got to see concrete use cases and the types of things that are very similar between different deployments dbt or the things that vary wildly between different deployments. And we got to sort of like code for those different use cases and similarities and differences in a way that you can still see today in the product. So I think the one thing to say there is historically, we did a lot more consulting than we really do these days.

Starting point is 00:07:44 We bootstrapped initially and we make open source software. We, in fact, didn't have a hosted product that you could pay for until years into our existence. And so consulting was how we paid the bills and got really good feedback to playing DBT in different environments. But as DBT has taken off, we've really shifted as an organization to being a lot more of a product company and these days our our primary focus is uh building dbt and delivering great services to say dbt cloud clients on top of it okay but but i guess the question i was really interested in there as well as what you just said there was the focus you had on on vc funding companies so so yeah that that for me was, how much was DBT a function of those are the companies you were working with and that's what they needed, or was it the other way around? So why was DBT a

Starting point is 00:08:32 success with those types of companies? Sure. I lived in New York City for almost the first year of Fishtone Analytics. And I just remember taking the subway up and down Manhattan, and I would see DBT users in the subway advertisements. And I think it was a function of when you got to a VC stage as a company, the number of tools that you started using exploded, and the amount of reporting that you needed to do similarly exploded. In fact, Tristan uses the term Cambrian explosion to talk about things like this, where it was really just like it's a new era for a lot of these companies. And that's really where the need for DBT presents itself.

Starting point is 00:09:13 So if you have all your analytics being conducted on top of a transactional database and you've got like two reports that you look at once a week. Like you don't need a very complicated data stack. I would even say like, I think dbt is still an appropriate tool to use there, but I don't think it's like a must have in that environment. It's when you start growing these things and you care a lot more about consistency and different types of reporting. And you have this plurality of different data sources that you're reporting on and different types of

Starting point is 00:09:46 operational and analytical needs at the other end on the BI and analytics and data science side, that's really when it becomes so obvious that it's a problem that you are doing the data modeling. If like, say, if you don't use a tool like dbt, you're still modeling data, you're just doing it in the biz layer. And that's where discrepancies arise. So we targeted those companies just because these companies had the problem that dbt helps you solve and we sort of knew how to work with them. They were very happy to work with a small team that could kind of deliver services to them in an agile way. Okay. Stuart, did you find that it was, it was that type of company that was initially interested in dbt and those

Starting point is 00:10:33 tools from your side as well? Absolutely. I mean, I think Drew's point about getting to that venture funded area is when you start having some money to spend on tools, but also that becomes your analytics requirements increase because now you have to report to people on your progress and your success. And I think there's an analytics requirement there that introduces itself when a company accepts some venture money, right? Suddenly you have to start reporting to a broader audience with consistency. We've seen that with startups.

Starting point is 00:11:08 We, I'd say we're like 50% of our, 50% of our business right now is startups. And the other 50% is, you know, legacy. I will say legacy, but traditional big companies, Fortune 100, Fortune 300. And it's certainly, when you look at DBT, it's perfect. And I think we're going to talk later about how it's good for the other environments as well. But that startup company still has that agile idea about how to do projects with the money to spend on good

Starting point is 00:11:43 tools. And I think when we start to look at like how that compares to say traditional big companies, they have the budgets, but they're usually sort of being dragged down by older processes, slower, slower, we'll say project methodologies, those that are sort of tied to older tools. And those are the ones that are interesting to try to change the culture. I don't think you have to change the culture to use tools like DBT in a startup. Okay. Okay. So really what I want to talk about in this episode really was to sort of build on that. So two things really.

Starting point is 00:12:25 So first of all, something I've been finding on projects I've been working on, and Stuart, you may recognize the same thing as well, is that I suppose the price of success, it's as you engage for longer with customers and those customers themselves grow, then the complexity of the projects gets more. So you start to get exponentially more complex, I suppose, problems to solve in terms of integrating source A and source B and source C

Starting point is 00:12:53 where there's been no upfront, I suppose, setting up of a say a customer hub or a product hub or, or anything to sort of say, this is the definitive source of information for the company. And there's also, I suppose, complexity in terms of um data quality so one of the things that i i find with startup type companies is they've not yet had the deal had to deal with i suppose the issues around data quality and and the issues the challenges around um a single sort of version of the truth um and so really wanted first of all wanted to get your thoughts on both from the product side

Starting point is 00:13:27 and from the consulting side on how do you scale modern data stack projects as the company grows, the complexity grows. And the other side of this I want to talk about later on is how can we take some of these tools and techniques, particularly DBT, into enterprise companies, where they already may have solved some of the issues around data quality, but they've got issues around, say, velocity of the project, or agility, or things like that, really. But let's start off really with this issue around complexity. So Stuart, first of all, have you noticed this as well, as you engage for longer with a customer, and you know, you might start with say dbt or looker or whatever and a

Starting point is 00:14:05 very agile project but as you start to scale it you start to issues around complexity absolutely um the interesting thing is that when you look at traditional etl tools i'm thinking informatica oracle data integrator um the like right they always had capabilities and add-ons to do some of this. They had data quality add-ins. They had, you know, sort of master data management add-ons. You could buy these things and they tried to, outside of the transformation layer, you could bolt on these add-ons to try to handle this. And what we saw mostly with customer after customer is that those utilities or those add-ons were almost never used. And a big part of why they were never used is because they weren't ingested into the data transformation process themselves.

Starting point is 00:15:00 They sat outside of it. They tried to feed a data transformation process. They tried to ingest a data transformation process. And it just never worked well. And so I think what I saw over those years, Mark, and you might have a different perspective on this, is that we were doing the master data management and the data quality in the ETL tool anyway. And what that meant was we were using ETL to do data quality and we were using ETL to do data governance. And so I think that not much has changed in that area when you approach it with dbt. The one difference being, I think the ability to, to be more agile with pure SQL-based transformations. That's the other thing that's just really brilliant about DBT is it's just SQL. It makes no apologies for that. And this is the language that I find that almost everyone I work with knows. And when they don't know it, it's maybe one of the easier languages to learn.

Starting point is 00:16:03 So I think the idea of doing data quality and governance and master data management, these things in SQL makes a whole lot of sense and doing it in the overall directed acyclic graph, the overall graph of dependencies. I think what's not perhaps clear in the tool, and I would argue it wasn't necessarily clear in the earlier tools, is, you know, where is the data quality in your DAG? And I think with the appropriate model structure, and with the appropriate sort of, you can document in dbt with the schema.yaml file, and then you also have these tests that you can build in. So I've never seen

Starting point is 00:16:45 an etl tool or a data engineering tool with tests built in like this so i think there's not necessarily a user guide or that tells you um here's where you stick your data quality here's where you stick your uh governance your master data management, but the tool certainly can support those approaches. I think you just use the tool for what the tool does, which is SQL-based transformations, and then you need a project structure in such a way that dictates the data quality and what we think of traditional data quality

Starting point is 00:17:24 and cleaning the data first goes here. And then next in the folder or the model structure is, you know, this is where we'll start to imbue this with our opinions about logic and transformations. And then down this step is where we'll start to perhaps define some things for master data management, etc. So I think the tool certainly supports it, but there's no screen that says, okay, insert data quality here.

Starting point is 00:17:54 But again, sorry to reiterate, but I just don't think those were used in the older tools anyway. True. Is this a problem you're trying to solve from a product perspective? Yeah, it's an interesting question because it varies tremendously by organization and the types of problems that folks are solving with dbt. And when I think about dbt, these days I think of it more as like a compiler, like a SQL generating and running engine where we want to give tools to these analysts use dbt to Create the project that they need for their circumstances. We want to be a little bit less prescriptive about exactly how you should structure project. Although Stuart. I think your example is the canonical structure. You talking about like staging your raw data and then sort of doing more advanced combinations of the data on top of it.

Starting point is 00:18:46 So, sure, Stuart, I think you're right to identify things like documentation and DVTs, data testing. I think there are some areas of the product where we're not totally serving testing needs today. And it's sort of a function of the space we're operating in. So maybe the context I can give here is when we think about how DBT should solve some given problem, we always like to do the thought exercise of relating it back to how software engineers solve similar problems.

Starting point is 00:19:21 Because if we have any good ideas here, it's things we've learned from software engineering best principles, right? You want to version control your code, do code review, add automated testing, things like that. And so one of the things we're missing, for instance, is a notion of unit testing. So this is maybe a slightly different type of testing

Starting point is 00:19:41 than what dbt currently supports, but I think it would go a long way to giving folks assurance that their data transformations are correct um so i think there's an opportunity to take the core dbt compiler and sort of give you new interfaces for generating and running sql that that can assert that your data in transformations do what they're supposed to do do you think that so do you think that there needs to be a methodology or a bit more prescriptive approach with how we do projects in this area because one of the things that i find is going back to the complexity thing as well people if you're thinking about what i suppose a common thing with the the sas applications that we use as data sources now is they often

Starting point is 00:20:23 several of them will be a source of customers and there'll also be financial data. There'll be product data. There's no one single authoritative source of this information. Whereas in the old ERP systems we used to work with, there would be a one single customer table. And do you, I mean, do you find true that, that from the telemetry you get and the, and the feedback you get that custom, that get, that developers of dbt now, at what point does it become an overly complex job to bring in the third source, the fourth source, and then to try and think about maybe kind of the lifecycle of data and so on these sound like boring old man sort of things but but you know any any any kind of business is gonna have that as an issue as they go along and i wonder maybe this is a question about how far the product goes and what's out of scope and what's in scope

Starting point is 00:21:13 but you know what are your thoughts on that really sure so i think taking a few steps back from that as a starting point the one of the good things about dbt and the approach it sort of forces you into is that it does compel you to make small, discrete changes that get version-controlled and shipped. The reason I say that is because frequently you don't start with four different sources of truth of what a user is. You add them over time.

Starting point is 00:21:43 And so the thing that needs to be in your head when you're writing these data models is like, what are the universes of things that we might need to address or change here in the future? And so it really starts with user identity number one, whether or not you set yourself up for success to have two, three, four. And it's not the most fun problem

Starting point is 00:22:07 to kind of stitch all these identities together in a way that's sensible and easy to debug if things go wrong or explain to your colleagues when they're saying things that are surprising. But no, ultimately, I think the way that we solve that is, again, it's very similar to the way that software engineers solve this. Sure, there are libraries that do sort of specific things that sort of automate that away from you. But in a lot of cases, you solve this with what you would call a design pattern. And so it's sort of a template for pattern recognition where it's, I have this problem

Starting point is 00:22:45 and a good way to solve it is this solution. So we thought a lot about design patterns in the early days. And there's actually a decent book called SQL Design Patterns that has, it's like pretty dated anymore, I think. But I enjoyed reading it. We, to this end, have cooked up a playbook recently that talks about how to do user attribution. And Mark, I think you actually published a similar type of article not so long ago too, right? And so it's this kind of thing where it's hard to solve these problems for the first time,

Starting point is 00:23:15 but it's so rare anymore that any one organization is solving a problem that's totally unique to them. I think what we need to kind of do more of is get people talking about and writing about how they solve the problems that they're running into and sort of helping advance the whole field so that we have a sort of template for how to approach these problems. So Stuart, I described you once as the Karl Marx of agile methodologies. So is this arguably really about agile methodologies and how do you have a design while still being agile? I think so. So I know in the, in the old world, we would buy additional tools to do some of these things, right? You can imagine, you know, if you're using an ETL, I know Informatica sold

Starting point is 00:23:55 these, right? Informatica sold these master data management add-ons where you could map your, your source tables and it would give you a product table as an output. And I can tell you that from experience, most customers rewrote that. Any of these add-ons I've seen over the years that try to deliver something to a transformation layer, a customer table, usually performed pretty poorly. They often weren't batch SQL-based. So I do think that it's something that should be solved in SQL. I think that trying to write an application

Starting point is 00:24:33 that may use array-based processing or whatever, I think it should be solved in SQL. And frankly, I think it's a problem that is just another version of a data transformation problem that needs to be solved. So I believe it should be solved in dbt in the way that we solve other problems. I could see somebody writing a module, you know, dbt has the concept of a module, which is reusable code that everyone doesn't have to, you know, rewrite from scratch. So I could imagine some, it certainly has support for if someone,

Starting point is 00:25:08 whoever that person might be, decided to write sort of a customer module, a product module. And I know there's some companies out there doing things like this. So I think that I truly believe that it belongs in SQL. And I think Drew's point about the software development life cycle. Here's the thing. It's like all these products we used in the past, Mark, they built data, ETL products, and they built them using SDLC. They used version control. They used feature branch development. They did CI and CD, but then they were delivering a product that they didn't think should be used in that way. They didn't think the audience for their product were developers, in whatever term that means. And I think the idea that, as Drew mentioned, version control,

Starting point is 00:26:01 small changes, being able to go back and look at patterns, collaboratively, you know, collaboratively solving these patterns of finding a single customer, or a single product is one that belongs in the tool. I think that the rigor, going back to the agile question, you know, it's a question of how do you inject rigor into a process for the first time? And I think the problem with non-agile projects is they inject rigor at the beginning when it's not needed. I think what DBT really is challenging for the enterprise, but in a good way, in that it causes the enterprise to rethink the way they've traditionally built data warehouses. And that is that they were modeled first with often a whole lot of design and thought before anybody started to think about what the DDL would look like and what the data load process would look like. Spending a lot of time there.

Starting point is 00:27:06 And then you would go back and build a data integration process to then load that target. And I think that in the design process of building that perfect model was usually a whole lot of SQL. We were, that data architect was usually SQL literate. And they were querying databases to figure out, you know, what's the granularity of this, and what's the granularity of that, and how would these things ultimately join. But then they were stepping away from SQL, and they were building a document or writing a document that defined that model, and then they would hand it over to ETL developers and say, now load this model.

Starting point is 00:27:47 I think what's interesting about DBT, and I would say preferred, and I would even say, you know, optimal about the design process is, instead of modeling to a target, you're slowly building a target, step by step. And Mark, if you think about where Oracle Data Integrator came from with synopsis and that whole concept of interfaces, we may lose the entire audience here, but stick with me. They had that idea as well, that you would build a data integration process, an interface at a time as they called it. And you focus on that first join, that first pattern, that first combination of data sets, and then solve it and then go on to the next step. And I think the idea that the model will evolve is the thing that, frankly, we're having to convince the enterprise to do. But I think it's valuable. I think it's more optimal anyway. And I think it's

Starting point is 00:28:46 really the way we think about it anyway. But on that point, before I hand over to Drew, I think there is, although on projects, the bane of my life was the enterprise architect and the enterprise architecture team who would kind of pedantically pick through the design I've got and say this entity here isn't right and so on, but at least it joined up in the end and they would have, there would be a central kind of thought process around, around, around that thing now. So I suppose really to Drew, the point you're trying to make, I think as well, Stuart, is, is, it's,

Starting point is 00:29:17 it's the opposite way around to wait to the agile way in which we've been developing stuff in dbt um you know true what's what's your experience been with with working with enterprise accounts and do they have the same problems to solve that smaller companies do and do they are they changing the way in which they model these things and think about things or is that influencing some of the design for dbt going forward sure so i think ultimately the problems that the very large companies and the very small companies out there using dbt are encountering are, are more similar than they are different. Um, there are some key differences, um, that I'd be happy to talk about that, that do

Starting point is 00:29:57 sort of inform maybe some product changes we'd want to make in the future. But, uh, for the most part there, just like Stuart said, said, they're sort of solving these problems a little bit more iteratively and they're doing it in SQL. And the fact that it's in SQL means that different folks within the organization can better collaborate

Starting point is 00:30:14 on top of this DBT project, these data models that they're building. So in that way, it's a lot more similar with the kind of workflows we see at very large companies that it is different to very small ones. kind of workflows we see at very large companies That it is different to very small ones, but I'll give you an example of the thing that does differ

Starting point is 00:30:31 We find that a lot of large companies Don't love the idea of running their dbt tests on production data. They have much stricter security requirements they They have whole security teams. I always think it's fascinating. They have security teams that are larger than our engineering team building dbt. And that's fair and appropriate for where they are. And what that means is some of our key axioms that we kind of operate from in dbt get tested a little bit. So what we see a lot more of

Starting point is 00:31:02 is totally different dev and CI test and production data sets that different folks have different levels of access to. In some cases, it means that no individual person actually has permission to run the entire set of data transformations required because it's actually three different groups of people with different configured roles. So these things all still

Starting point is 00:31:25 work well in a dbt capacity, you can sort of structure things in a nice way at a project level, and everything, the projects kind of blend together. But there's a lot of work that we can do to make it work better in these environments. And we're interested in prioritizing that work. I'm curious, if I can jump in, Mark, I'm curious, like, what are some of those things you're thinking about adding? Is it too soon? Or can you talk about, like, what it would look like in the tool or what you're thinking about adding? managed by, let's say, the marketing team and a DAG that's managed by, just say, HR. And so it's always like sensitive data. Analysts focus on HR. And so they're going to have their own separate DAGs, but you still want to combine these

Starting point is 00:32:13 things at the very end to build like a holistic view of documentation. Maybe they both pull from some base project that provides sort of like common utility models and macros. And so being able to have different projects that depend on each other and making it easier to only run the models that you actually want to run as a part of a given development workflow

Starting point is 00:32:31 or deploy them in production, only run some of them. You can do all this today. It's just not naturally supported. You're kind of fighting against DBT to do it in some cases. I'm going to add something to that, Mark, if you don't mind. I just had a call with a customer, the days are blending,

Starting point is 00:32:50 so I don't know if it was today or yesterday, where we're introducing dbt. And their question was very much... This feature, Drew, would go a long way in that they had the development team on, it was a big Zoom, by the way, large screen. They had the development team on, they had the development team on, it was big zoom, by the way, large screen, the development team on, they had the data architects on, but then they also had representatives from the operations team. And in big organizations, there are teams that all they do is make sure

Starting point is 00:33:15 that the loads are running and not failing. And they were asking about rerun ability and yes it's possible i can go into the dbt tool i can pass a model statement and run just a portion of the model and uh you know frankly he said we're not going to check out any get repos so so that's not what our team does so what's next i said well there's a rest api you can you can use a curl command uh next um then, so I started talking about, well, you know, in dbt cloud, which they're thinking about dbt cloud, you can build jobs and you can create jobs. And I think that the ability for you to have almost an operations pain, it's not the load and it's not the CICD load, but it's almost an operations view of this, which if you did have separate graphs in the way you're describing, maybe you would have almost

Starting point is 00:34:12 a job that's just a pain that's designed almost completely for management and rerunning of failed jobs. I think that would go a long way. Is that on the roadmap, Drew? Is that something that you've been thinking about? So let me say one thing on this topic. One of the key design constraints we've placed upon ourselves that has been most helpful for folks running DBT is that every DBT job should be adept at it. And so in the worst case, you can just hit the rerun job button. And if the thing that failed was transient in nature, if just rerunning the same code will fix it, that'll get you to a good place.

Starting point is 00:34:52 And then there's sort of a fork in the road, right? So on the one hand, that will only fix some types of problems, like a network flip or the database did something funny. We see that sometimes for sure. only fix some types of problems, like a network flip or the database did something funny. You know, we see that sometimes for sure. But most failures aren't in that class of failure from what we're seeing. It's like a logic error or the data changed out from under you in a way that you weren't expecting in your data transformations.

Starting point is 00:35:18 And so I think it's a compelling idea giving you more tools to sort of rerun parts of a job, like rerun from build is a common thing to see in a CI tool. CircleCI has this functionality for their workflows. I can totally imagine that. I do want to optimize dbt cloud for being

Starting point is 00:35:37 accessible to different types of people. It fundamentally is a user interface over dbt core. This is a really good candidate feature. I do think what it kind of requires us to do is maybe think harder about the different ways that our run can fail and understand like okay based on this this failure what are the set of actions that are even sensible to to try to do again and maybe get some more understanding of like do you really want to run this thing again or is like i'll give you example. Not every dbt job necessarily in dbt cloud is fully adempotent.

Starting point is 00:36:09 You could have an operation that does something not adempotent where you wouldn't want to do it twice. That's like not a best practice. You shouldn't do it. But I think that's one case where we want to have more information in dbt to help guide users to do the the right thing um in in case of failure actually on that on a related point something i've always been interested in is rj metrics you know you you obviously um have this kind of history in there and and rj metrics in a way it it solved that it sort of solved that problem didn't it but but then obviously when you went to form fishtown and do dbt you consciously chose not to build another, another RJ metrics. You've chosen,

Starting point is 00:36:47 I suppose, to make dbt, a kit car, a sports car, rather than a coach with, with every feature in it and everything else. Is that a choice? Is that,

Starting point is 00:36:57 is that a conscious choice? If, if it is a conscious choice, it's guided by, I think the, what's called the Unix philosophy, which is that tools should do one thing and do them well and be composable. There's too many problems out there to try to solve all of them in one tool.

Starting point is 00:37:13 And so we aggressively focus on our goal, which is helping folks model their data and document and test and provide a workflow around it. Maybe the one thing I do want to say there is this is where the distinction between core and cloud comes up. And it's an important one that everyone's really well aligned on what goes where. So for us, dbt core is open source Apache 2 licensed. That's the thing that compiles and runs SQL.

Starting point is 00:37:36 We at no point are interested in creating functionality in dbt cloud that you have to pay for that does like core compilation or SQL running. That's all going to be open source. And the thing we're primarily building in dbt cloud that you have to pay for that does like core compilation or sql running that's all going to be open source and the thing where you're primarily building a dbt cloud are things like permissioning single sign-on and otherwise user interfaces and job scheduling so sort of stateful things that you know they require a persistence layer we'll build those things in the cloud and so this kind of thing we're talking about here stewart it's it's an interesting where you kind of, when you think of a feature, you think like, where does

Starting point is 00:38:07 it go? Is it core or is it cloud? And in this case, it sounds like that's a core thing. We'd want to have dbt, we want to give dbt core the ability to say like, I just ran this command, these models failed, rerun from failures. And then we provide the UI inside of dbt cloud to let you tap into that. I do want to be clear about one thing, which is, I don't want to go back to the orchestration from the old tools. I'm very happy with DAG based execution. I think if you look at when we used to build an old tools, we would build mappings, right? That is what you would think about for a model, right? And maybe in these old tools, it would be a couple of models in DBT. But then you would go and in usually a different tool,

Starting point is 00:38:52 or at least a different area of the tool, you would go and glue these things together with orchestration layers. And it was usually sets of serialized or parallelized processes going down the way. And what I don't want to do is go back to that necessarily in that you would spend 50% of your time, that's a rough estimate, on the orchestration layer. That's 50% that's gone in dbt. I don't have to do that. I don't have to. And by the way, when I'm working on one small piece of something, in the old tools, we typically were only working on that one piece and only testing that one piece while we were working on it. dataset that's mine. And that's what happens in BigQuery. And you're in your own and dbt cloud makes that simple. You're in your own dataset. So you're not disrupting anyone. And the fact that I can do dbt run, when I'm working on some small piece of it, that it runs the entire DAG is kind of like a unit or I would almost say a regression test that's easily available to the developer

Starting point is 00:40:07 that's focusing on something small. So I think that that's better than the older tools. So I don't want to go back to heavy-handed orchestration, but I do think that if... I'm kind of on the fence with you on this on where it goes, but I could clearly see if you're adding multiple DAGs, we still have the ability to run just a small portion of the model with includes and excludes and all that.

Starting point is 00:40:33 If perhaps in dbt cloud is where this would belong in such a way that instead of me having to type that with the dbt command, there's some way in the UI that I could say, you know, that could help me maybe even from the lineage documentation, you know, zero in maybe even lasso a section. So I'm just, I'm shooting for the sky here, lasso a section and say, run that, that would be really, really cool. There you go. So I'm conscious of time.

Starting point is 00:41:02 And there's one last thing I wanted to talk about, not, not quite in so much depth, but I think it's very relevant. So I suppose the central, I mean, telling Drew here about his own company and his own philosophy, but the central philosophy with Fishtown and a lot of this is analytics being a branch of engineering and the modern kind of analytic workflow and so on there. How far do you think you see that gets in enterprise companies, Drew? Is it just the believers or is this something you find gets take up

Starting point is 00:41:35 beyond the small core people that bring you in? It's a novel concept to some of the folks we interact with at larger companies that they are in fact doing something that looks more like engineering than not. They're not familiar with thinking of themselves as operating that role, and they're certainly not familiar with a lot of the tools in some cases. This is a big part of why we built an integrated developer environment in dbt cloud um i think a lot of people rightfully so balked at the idea of pip installing dbt on windows and then figuring out how to use git in order to run their first model um so what i what i do see is that this is pretty constant at these larger companies that are using dbt,

Starting point is 00:42:26 there's always a champion somewhere. Much more so, even in the software, it's the mindset. They think that that's the thing that they need to do to operate at the level they're interested in operating at. They understand how version control and code view and CICD obviates whole classes of problems that they weren't solving particularly well before even. And so I think that that's invariant, that there's somebody there on the customer's end.

Starting point is 00:42:59 I guess I almost said the user's end. Not all these people are dbt cloud customers. There's open source users out there for sure. And all these users ends, there's somebody who really gets it and is able to, I need to find better words for this, but I want to say preach the gospel, if you will,

Starting point is 00:43:15 to the other folks in their organization where they can sort of understand the benefits even though it is very new and certainly there's a lot to learn up front as well. Stuart, I think you've always been a very credible person talking to enterprise customers about new techniques and technologies.

Starting point is 00:43:32 So what tips have you got around getting take up of this within enterprise customers and what works and what doesn't work and where's the interest you find? So I definitely agree with the champion concept, Drew. We see that. And I'll come back to that in just a minute. But in general, we sort of have a triage process.

Starting point is 00:43:53 It's almost like a flow chart. And when we start talking to customers, perhaps they're like, tell us what to do. And we have a lot of that. We have strategy engagements where they know there's a lot of possibility out there in a new world and they're trying to think about how to get started and they know that we can help them. We sort of have a first question in the flow chart is a graphical click and drag UI,

Starting point is 00:44:15 an absolute must. And we have some customers that are on the fence there, tell us more. We have some customers like, look, we hate that anyway. We wish we weren't using it. And then we have some customers that are like, absolutely, of course, how else would you do it? And I think when we have that latter section, we might introduce the concept. We'll start with CICD and configuration as code and talk about all the value there. Automated testing, automated building and testing, and talk about those and see if there's any light in their eyes about any of that. And a lot of times with, frankly, you know, for enterprise customers, there's absolutely no movement on any of those concepts. And we just sort of stop. It's not for them, right? Now, there's also the brand of customer on the other

Starting point is 00:45:06 end of the spectrum that are like, yeah, we absolutely are tired of, they're talking SDLC, or maybe they have got a new boss that said, everything has to be CICD, configuration is code, these things are important. Tell us how to do that. And that's great. But what we see more often than not is a champion, a couple of people that have seen the light, they are tired of not having testing, they're tired of struggling with graphical tools that can't generate the code they wanted to generate. They know how to write SQL, they can't get their tool to generate SQL. They know the value of not just, you know, committing code, but committing code often and merging code. They get all that, but they know, or we discover along the way that we bump into these operations people that are like,

Starting point is 00:45:59 where's the audit table? And, or operations people that are like, how can I tell my, you know, very, very unskilled, sorry, unskilled operator how to restart this DDT job? Where is the things like row level security and single sign on which, you know, I'm not minimizing those, they're important. So a lot of times these champions know we're going to bump into some of this. And that's the challenging part when you've got a champion who knows the value of all these things, but they can usually get a development team signed, sealed, and delivered on building it this way. And usually building it with dbt where we struggle is getting the rest of the organization that, that is also involved in owning the solution on board. So to that point in a way, true, what's the, what's the problem that you guys are trying to solve? Is it about making analytics an engineering profession or is it beyond that?

Starting point is 00:47:13 I mean, what's the bigger problem you're trying to solve really, I think, in Fishtown is a question I think is interested in. Sure. Ultimately, we're on a mission here to elevate the analytics profession. We think that these are important professionals in any organization and serve as a function of the tooling and the tasks that they were historically responsible for. OK, we think that we think so. We're on a mission here to elevate the analyst profession. So we think that these analysts are important members of their organizations. And one of the things they were lacking historically was good tooling. There's an abundance of products that you can buy that solve point solutions that these analysts have. But there's kind of a dearth of tools that they can

Starting point is 00:48:06 use. And so with DVC, we're trying to give these people tools so that they can do higher leverage tasks. And everyone kind of has their own reasons for caring about this. I know Tristan, he was an analyst back in the day, and I think he's said a lot of spreadsheets over email in his day. And he realized that was maybe a low leverage use of his time in some ways. For me, one of the things I point to is that you can't have a conversation about data in 2020 without talking about the misuse of data. And so I think it's really important that we get these analysts in positions where they are in the room when people are talking about which data do we collect? What do we do with it? And so I think we can only really do that by leveling them up from spreadsheet jockeys to people that have well formed and well considered thoughts on the data's organization, the knowledge that the organization commands, you know? So that's my personal reason for being so interested in this problem. So ultimately, we're on a mission to empower analysts to create and disseminate this organizational knowledge. And everyone cares about that for a different reason.

Starting point is 00:49:14 That's my personal reason why. So I mean, that I think is interesting thing. So if you think about what I think the thing that the thing that struck me about DBT and the thing that got me interested in it after my probably year or so of skepticism when Deepna and Kristen kept talking about it all the time on the show, was about the philosophy of making the repeatability and having a structure to what you're doing. And Stuart and I both know that the ETL developer

Starting point is 00:49:43 was the worst job on a project. And the analysts were people who just kind of fiddled around with the numbers, came up with a number that looked roughly like they were looking for. There was no repeatability and more times than not, the number was wrong anyway. And that's not our projects, by the way, that's projects that we came in and rescued. But it's about elevating that role, really, isn't it? It's about, to analysts, it's about elevating that role really isn't it it's about to analysts it's about making their job um engineering is one part of it but it's about repeatability and so on

Starting point is 00:50:11 and about scaling the impact of the knowledge and the insights and so on beyond that small team to the rest of the business really and whether that involves the analysts in the company or using git and whatever or some other kind of variant of that. And that is, I think, the question for me is, is it realistic that people who are in the finance department or in whatever department, you know, will they be using this? But I think the other thing to bear in mind is that not every enterprise customer is going to be some crusty old business that has got 500-year-old people in there

Starting point is 00:50:40 using crappy versions of kind of crystal reports and so on. The next enterprise is that is Amazon. It's it's companies like that. And so it's not an option to say we're just going to do it the old way. You've got to adopt these new techniques and tech and technologies and approaches. And because the next enterprise, the next big customer will be Amazon and so on really. Yeah, sure. So to me, that speaks to,

Starting point is 00:51:02 it speaks to the tooling problem most. And this is something Connor on my team talks about a ton. He talks about problems that you have to throw over the fence. And so our hope was with dbt, we can take a lot of these problems that you had to wait for someone else to solve and make them problems you can solve yourself, do the good version of it once and not have to think too hard about it again in the future. That's a concept that holds true at small companies, large companies, old curmudgeonly analysts and sprightly young advanced ones.

Starting point is 00:51:32 We can all kind of benefit from automating these tasks. And to me, it's really a tooling problem. These folks are doing it anyway, but it's Excel macros or things like that. It's helping them use higher leverage tools. Okay. So let Stuart have the last comment of the show. Excellent. So I just want to piggyback on what you just said, Drew, which is the throw it over the fence thing. If there's one thing that you can sort of describe a traditional

Starting point is 00:51:56 team, it is with a lot of fences and a lot of people sitting and waiting for things to come over fences so that they can throw them over downstream fences. Right. And I think that the idea that now I think there will be some traditional enterprises that are going to balk at what I'm about to say, but the idea that an analyst could get into a Git repo and fix a little piece of logic somewhere that is wrong. Right now, of course, in the Git flow process, you can have, you know, reviewers and all of that. But that is something that's absolutely impossible in old tools. They don't know how to open the old tool. They don't, they're not allowed to in a lot of

Starting point is 00:52:39 cases. And there's no way for them to really comment on or describe the problem. So there's documents and requirements documents and all these things that are generated and built so that these two teams can communicate with one another about what is probably an incorrect where clause at the end of the day. Right. SQL and has learned Git, and I think anyone can do that, by the way, maybe they don't come to the table with those skills, could get into a pull request and comment on it, or actually check it out, make the change, submit it, open a pull request, and let a more senior developer say, no, that's not it. But still let them, you know, suspect that might be the problem and flag it in that way, I think is a really valuable thing. It's the idea that an analyst can participate and an ETL developer can participate in the analysis as well.

Starting point is 00:53:39 And I think maybe these two roles start to blend. And so for me, that's really where it goes. Mark, you said the last word. I got one last thing to say, I promise. And that is, I just wanted to, you know, for all of the Oracle Data Integrator people that might be listening, we've written a Oracle Data Integrator

Starting point is 00:54:00 to DVT conversion utility. We've used it with one customer so far. So we're interested in testing it with some others and if that interests you we would love to help you out with that fantastic so stewart where would people how people find out more about this uh this utility then and about red pill we haven't put it on our website yet because it's relative it's like a week old so we've used it with one customer it's converted all their all their odi mappings to dbt models so it's not there yet we it'll be there soon they can obviously reach out to me they can find me in the show notes i'd love to talk to them

Starting point is 00:54:37 about it we're going to have a marketing blitz about it at a certain point and the folks at snowflake are super excited about it as well So they're going to eventually be talking about it is my understanding. Fantastic. And Drew, how do people find out more about dbt? Oh yeah. Check us out at get dbt.com or github.com slash Fishtown analytics slash dbt. Or follow me on Twitter.

Starting point is 00:55:02 I'm at Drew Manon. I mostly tweet about DBT these days. Fantastic. It's been great having you both on the show. Really, really interesting. Thank you so much. And yeah, great to have you. And speak soon.

Starting point is 00:55:14 Thanks, Mark. Thanks, Mark. Thank you.

Drill to Detail - Drill to Detail Ep.79 'Scaling the Modern Data Analytics Stack' with Special Guests Drew Banin and Stewart Bryson

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.