Drill to Detail - Drill to Detail Ep.79 'Scaling the Modern Data Analytics Stack' with Special Guests Drew Banin and Stewart Bryson
Episode Date: April 20, 2020Mark Rittman is joined by special guests Drew Banin, co-founder of Fishtown Analytics and maintainer of dbt (data build tool) and Stewart Bryson, long-time friend of the show and CEO/Co-founder of Red... Pill Analytics to talk about scaling modern data stack projects from startups to the enterprise; how do you deal with data quality issues when there’s no central record of customers, how do we introduce data governance, enterprise requirements and meet the needs of enterprise architects and how do we scale concepts such as agile and analytics as engineering beyond our initial champion and data team?The dbt ViewpointFishtown Analytics and Drew BaninMulti-Channel Marketing Attribution using Segment, Google BigQuery, dbt and LookerGetting Started with dbt.Red Pill Analytics
Transcript
Discussion (0)
So welcome to Drill to Detail, and I'm your host, Mark Rittman.
So I'm joined today by two very special guests, one of whom is long-term friend of the show,
Stuart Bryson.
Thank you very much.
Always a pleasure, Mark. Anytime. And I'm also joined today by none other than Drew Bannon from Fishtone Analytics.
So Drew, it's great to have you on the show for the first time. Thanks, Mark. Happy to be here.
So Stuart, for anybody that doesn't know you, just maybe just tell us what you do, who you are,
and I suppose how we know each other. Yeah, certainly. So we were colleagues once upon a time. And then for a brief time,
we were competitors. And I wouldn't call us that today, although we work in the same area.
It's a very, you know, collaborative relationship our companies have. I'm the CEO and founder of
Red Pill Analytics. We are a analytics and data company. I like to say that.
We build systems for customers,
usually using modern cloud technologies.
And we try to help customers migrate.
A lot of our customers are legacy customers,
having old legacy tools.
Some of the ones we worked on together,
Mark, back in the day,
taking those customers to new, easier to use cloud native technologies, and helping them find value in what the cloud provides. So consulting company and a services company primarily,
Red Pill. And we're going to talk about DBT today from Fishtown. And we're a
partner of DBT and use it on most of our projects. At least, you know, we recommend it on most of our
projects. Okay. So Drew, I mean, Drew, it's your first time on the show. We've had your colleague
Tristan on before, but just tell us what you do at Fishtown. And I suppose in a way, how did you
get there? What was your background, really?
Sure thing.
I'm one of the maintainers of DBT, which is the open source product that we create at
Fishtown Analytics.
So DBT is used by, I think, our latest metrics are like 1,700 companies every week.
It's really taking off in the data modeling space. And a lot of us are kind of
collaboratively thinking through the best way to model data and perform analytics kind of at scale.
So the way we got here is Tristan, Connor, and I, the three co-founders of Fishtone Analytics,
used to work at a company called RJ Metrics together, based in Philadelphia. And RJ Metrics was a leading BI tool back in the day.
And sort of with the advent of data warehouses like Redshift and Snowflake and BigQuery,
the industry started changing from a sort of all-in-one type of BI tool to a composition
of different best-in-class tools.
And so what we found was different tools popped up,
like Stitch Data spun out of RJ Metrics, for instance.
There's also Fivetran in the ETL space.
They do the data ingestion part.
There's data warehouses that are phenomenal
at storing and querying data.
There are many great BI tools as well.
And sort of in about 2016, the summer of 2016,
Tristan and I really identified that there was this big glaring hole in the data pipeline
The data stack that goes from a data loader to the data warehouse the BI tool
We were missing a modeling layer
And so we started building dbt initially to just create views stacked on top of views on redshift
And it's really grown pretty significantly from there.
So Stuart and I have talked, I suppose, quite a bit really on this show in the past
about some of the projects we've been doing with technologies like DBT,
but also things like Looker and Segment and Redshift and BigQuery and so on.
And they're typically, or certainly the early adopters of those have been
what you might call kind of startup companies or companies that in a way get the whole modern data stack and the techniques and tools around that.
I mean, Stuart, just give us a flavor of some of the companies in that space that you've been working with around, you know, with tools like dbt and the modern stack tools that we've been talking about.
Yeah, certainly.
So it usually begins and Drew set this up nicely,
usually begins with a cloud data warehouse.
I think that just this whole discussion,
I think, is enabled by the new cloud data warehouses.
I mean, we wouldn't be talking about agile tools,
data pipelining and modeling just in time the way we are today without that.
So, you know, Snowflake, Redshift, BigQuery, excuse me. And, you know, with those cloud data
warehouses, it really enables a whole, not just, you know, faster performance and better integration
in the cloud. It's also just about the agility with which we can move.
And I think we had all these constraints.
They were almost like weights on us from these legacy tools.
And once we sort of shed those
and started ingesting our data into these cloud data warehouses,
Fivetran, we're a partner of.
We use Fivetran on most of our projects.
For those sources where Fivetran hasn't enabled us, we use
a collection of other things. Sometimes that's custom code using APIs from the cloud services
layer. Sometimes there are things like stream sets, other technologies that can enable data
to be ingested. But once it gets there, now is the place where DBT can kind of take over.
So DBT kind of provides that glue between this ingestion layer,
which is possible from a lot of modern technologies, a data warehouse.
And of course, downstream from that is hopefully some sort of a more modern analytics tool,
something like Looker, something like
even Data Studio or QuickSight and Amazon, things like Mode.
Yeah, I'd even put Power BI in there.
So Drew's exactly right that there was sort of this missing step that we needed.
And I think that dbt really fills that gap nicely.
So Drew, official analytics took a decision to focus on a certain type of customer
when you started the business,
and you've kept that focus.
So why did you focus on those types of customer
for DBT and for the sort of things you were doing?
It's a really good question.
In the early days of building DBT,
we worked with as many clients as we could
in as many different sort of
diverse environments that we could find um we had a consulting contract that we signed with
with all these clients that made it very easy to get started and uh sort of step into the uh the
fire if you will and um figure out how to make dbt work really well in that environment and we
certainly like learned
a whole lot very quickly the first you know six months or a year of delivering analytics consulting
with dbt and i think that the the product that dbt is today is a function of that experience where
we very early got to see concrete use cases and the types of things that are very similar between
different deployments dbt or the things that vary wildly between different deployments.
And we got to sort of like code for those different use cases
and similarities and differences
in a way that you can still see today in the product.
So I think the one thing to say there is historically,
we did a lot more consulting than we really do these days.
We bootstrapped initially and we make open source software.
We, in fact, didn't have a hosted product that you could pay for until years into our existence.
And so consulting was how we paid the bills and got really good feedback to playing DBT in different environments.
But as DBT has taken off, we've really shifted as an organization to being a lot more of a product company and these days our our primary focus is uh building dbt and
delivering great services to say dbt cloud clients on top of it okay but but i guess the question i
was really interested in there as well as what you just said there was the focus you had on on vc
funding companies so so yeah that that for me was, how much was DBT a function of those are the companies
you were working with and that's what they needed, or was it the other way around? So why was DBT a
success with those types of companies? Sure. I lived in New York City for almost the first year
of Fishtone Analytics. And I just remember taking the subway up and down Manhattan, and I would see DBT users
in the subway advertisements.
And I think it was a function of when you got to a VC stage as a company, the number
of tools that you started using exploded, and the amount of reporting that you needed
to do similarly exploded.
In fact, Tristan uses the term Cambrian explosion to talk about things like this, where it was really just like it's a new era for a lot of these companies.
And that's really where the need for DBT presents itself.
So if you have all your analytics being conducted on top of a transactional database and you've got like two reports that you look at once a week. Like you don't need a very complicated data stack.
I would even say like,
I think dbt is still an appropriate tool to use there,
but I don't think it's like a must have in that environment.
It's when you start growing these things and you care a lot more about
consistency and different types of reporting.
And you have this plurality of different data sources that you're reporting
on and different types of
operational and analytical needs at the other end on the BI and analytics and data science side,
that's really when it becomes so obvious that it's a problem that you are doing the data modeling.
If like, say, if you don't use a tool like dbt, you're still modeling data, you're just doing it
in the biz layer. And that's where discrepancies arise.
So we targeted those companies just because these companies had the problem that dbt helps you solve
and we sort of knew how to work with them. They were very happy to work with a small team that could kind of deliver services to them in an agile way.
Okay. Stuart, did you find that it was,
it was that type of company that was initially interested in dbt and those
tools from your side as well?
Absolutely. I mean,
I think Drew's point about getting to that venture funded area is when you
start having some money to spend on tools, but also that becomes
your analytics requirements increase because now you have to report to people on your progress and
your success. And I think there's an analytics requirement there that introduces itself when
a company accepts some venture money, right? Suddenly you have to start reporting to a broader audience with consistency.
We've seen that with startups.
We, I'd say we're like 50% of our,
50% of our business right now is startups.
And the other 50% is, you know, legacy.
I will say legacy, but traditional big companies,
Fortune 100, Fortune 300.
And it's certainly, when you look at DBT, it's perfect. And
I think we're going to talk later about how it's good for the other environments as well. But that
startup company still has that agile idea about how to do projects with the money to spend on good
tools. And I think when we start to look at like how that compares to say
traditional big companies, they have the budgets,
but they're usually sort of being dragged down by older processes,
slower, slower, we'll say project methodologies,
those that are sort of tied to older tools. And those
are the ones that are interesting to try to change the culture. I don't think you have to
change the culture to use tools like DBT in a startup. Okay. Okay. So really what I want to
talk about in this episode really was to sort of build on that. So two things really.
So first of all, something I've been finding on projects I've been working on,
and Stuart, you may recognize the same thing as well,
is that I suppose the price of success,
it's as you engage for longer with customers
and those customers themselves grow,
then the complexity of the projects gets more.
So you start to get exponentially more complex, I suppose,
problems to solve in terms of integrating source A and source B and source C
where there's been no upfront, I suppose,
setting up of a say a customer hub or a product hub or,
or anything to sort of say,
this is the definitive source of information for the company.
And there's also, I suppose, complexity in terms of um data quality so one of the things that i i find with
startup type companies is they've not yet had the deal had to deal with i suppose the issues around
data quality and and the issues the challenges around um a single sort of version of the truth
um and so really wanted first of all wanted to get your thoughts on both from the product side
and from the consulting side on how do you scale modern data stack projects
as the company grows, the complexity grows.
And the other side of this I want to talk about later on is how can we take some of these tools
and techniques, particularly DBT, into enterprise companies, where they already may have solved
some of the issues around data quality, but they've got issues around, say, velocity of the
project, or agility, or things like that, really. But let's start off really with this issue around
complexity. So Stuart, first of all, have you noticed this as well, as you engage for longer
with a customer, and you know, you might start with say dbt or looker or whatever and a
very agile project but as you start to scale it you start to issues around complexity absolutely
um the interesting thing is that when you look at traditional etl tools i'm thinking
informatica oracle data integrator um the like right they always had capabilities and add-ons to do some of this. They had data
quality add-ins. They had, you know, sort of master data management add-ons. You could buy
these things and they tried to, outside of the transformation layer, you could bolt on these
add-ons to try to handle this. And what we saw mostly with customer after customer is that
those utilities or those add-ons were almost never used. And a big part of why they were never used
is because they weren't ingested into the data transformation process themselves.
They sat outside of it. They tried to feed a data transformation process. They tried to ingest a data transformation process. And it just never worked well. And so I think what I saw over those years, Mark, and you might have a different perspective on this, is that we were doing the master data management and the data quality in the ETL tool anyway. And what that meant was we were using ETL
to do data quality and we were using ETL to do data governance. And so I think that not much
has changed in that area when you approach it with dbt. The one difference being, I think the
ability to, to be more agile with pure SQL-based transformations.
That's the other thing that's just really brilliant about DBT is it's just SQL.
It makes no apologies for that.
And this is the language that I find that almost everyone I work with knows.
And when they don't know it, it's maybe one of the easier languages to learn.
So I think the idea of doing
data quality and governance and master data management, these things in SQL makes a whole
lot of sense and doing it in the overall directed acyclic graph, the overall graph of dependencies.
I think what's not perhaps clear in the tool, and I would argue it wasn't
necessarily clear in the earlier tools, is, you know, where is the data quality in your DAG?
And I think with the appropriate model structure, and with the appropriate sort of, you can document
in dbt with the schema.yaml file, and then you also have these tests that you can build in.
So I've never seen
an etl tool or a data engineering tool with tests built in like this so i think there's not
necessarily a user guide or that tells you um here's where you stick your data quality here's
where you stick your uh governance your master data management, but the tool certainly can support those approaches.
I think you just use the tool for what the tool does,
which is SQL-based transformations,
and then you need a project structure in such a way
that dictates the data quality
and what we think of traditional data quality
and cleaning the data first goes here.
And then next in the folder or the model structure is,
you know,
this is where we'll start to imbue this with our opinions about logic and
transformations.
And then down this step is where we'll start to perhaps define some things for
master data management, etc.
So I think the tool certainly supports it, but there's no screen that says, okay, insert data quality here.
But again, sorry to reiterate, but I just don't think those were used in the older tools anyway.
True. Is this a problem you're trying to solve from a product perspective?
Yeah, it's an interesting question because it varies tremendously by organization
and the types of problems that folks are solving with dbt.
And when I think about dbt, these days I think of it more as like a compiler, like a SQL
generating and running engine where we want to give tools to these analysts use dbt to
Create the project that they need for their circumstances. We want to be a little bit less prescriptive about exactly how you should structure project. Although Stuart. I think your example is the canonical structure.
You talking about like staging your raw data and then sort of doing more advanced combinations of the data on top of it.
So, sure, Stuart, I think you're right to identify things like documentation and DVTs,
data testing. I think there are some areas of the product where we're not totally serving
testing needs today. And it's sort of a function of the space we're operating in.
So maybe the context I can give here is
when we think about how DBT should solve some given problem,
we always like to do the thought exercise
of relating it back to how software engineers
solve similar problems.
Because if we have any good ideas here,
it's things we've learned from software engineering best
principles, right?
You want to version control your code, do code review,
add automated testing, things like that.
And so one of the things we're missing, for instance,
is a notion of unit testing.
So this is maybe a slightly different type of testing
than what dbt currently supports,
but I think it would go a long way to giving folks assurance that their data transformations are correct um so i think
there's an opportunity to take the core dbt compiler and sort of give you new interfaces for
generating and running sql that that can assert that your data in transformations do what they're
supposed to do do you think that so do you think that there needs to be a methodology
or a bit more prescriptive approach with how we do projects in this area because one of the things
that i find is going back to the complexity thing as well people if you're thinking about what i
suppose a common thing with the the sas applications that we use as data sources now is they often
several of them will be a source of customers and there'll also be financial
data. There'll be product data.
There's no one single authoritative source of this information.
Whereas in the old ERP systems we used to work with,
there would be a one single customer table. And do you, I mean,
do you find true that, that from the telemetry you get and the,
and the feedback you get that custom, that get, that developers of dbt now, at what point does it become an overly complex job to bring in the third source, the fourth source, and then to try and think about maybe kind of the lifecycle of data and so on these sound like boring old man sort of things but but you know any any any kind of business is gonna have that as an issue as they go along and i wonder
maybe this is a question about how far the product goes and what's out of scope and what's in scope
but you know what are your thoughts on that really sure so i think taking a few steps
back from that as a starting point the one of the good things about dbt and the approach
it sort of forces you into is that it does compel you
to make small, discrete changes that get version-controlled
and shipped.
The reason I say that is because frequently you don't start
with four different sources of truth of what a user is.
You add them over time.
And so the thing that needs to be in your head
when you're writing these data models is like,
what are the universes of things that we might need
to address or change here in the future?
And so it really starts with user identity number one,
whether or not you set yourself up for success
to have two, three, four.
And it's not the most fun problem
to kind of stitch all these identities together
in a way that's sensible and easy to debug if things go wrong
or explain to your colleagues when they're saying things that are surprising.
But no, ultimately, I think the way that we solve that is,
again, it's very similar to the way that software engineers solve this.
Sure, there are libraries that do sort of specific things that sort of automate that away from you.
But in a lot of cases, you solve this with what you would call a design pattern.
And so it's sort of a template for pattern recognition where it's, I have this problem
and a good way to solve it is this solution.
So we thought a lot about design patterns in the early days.
And there's actually a decent book called SQL Design Patterns that has, it's like pretty
dated anymore, I think.
But I enjoyed reading it.
We, to this end, have cooked up a playbook recently that talks about how to do user attribution.
And Mark, I think you actually published a similar type of article not so long ago too, right?
And so it's this kind of thing where it's hard to solve these problems for the first time,
but it's so rare anymore that any one organization is solving a problem that's totally unique to them.
I think what we need to kind of do more of is get people talking about
and writing about how they solve the problems that they're running into and sort of helping
advance the whole field so that we have a sort of template for how to approach these problems.
So Stuart, I described you once as the Karl Marx of agile methodologies. So is this arguably
really about agile methodologies and how do you have a design while still being agile?
I think so. So I know in the, in the old world, we would buy additional tools to do some of these
things, right? You can imagine, you know, if you're using an ETL, I know Informatica sold
these, right? Informatica sold these master data management add-ons where you could map
your, your source tables and it would give you a product table as an output.
And I can tell you that from experience, most customers rewrote that.
Any of these add-ons I've seen over the years that try to deliver something
to a transformation layer, a customer table, usually performed pretty poorly.
They often weren't batch SQL-based.
So I do think that it's something that should be solved in SQL.
I think that trying to write an application
that may use array-based processing or whatever,
I think it should be solved in SQL.
And frankly, I think it's a problem
that is just another version of a data transformation problem
that needs to be solved.
So I believe it should be solved in dbt in the way that we solve other problems. I could see
somebody writing a module, you know, dbt has the concept of a module, which is reusable code that
everyone doesn't have to, you know, rewrite from scratch. So I could imagine some, it certainly has support for if someone,
whoever that person might be, decided to write sort of a customer module, a product module.
And I know there's some companies out there doing things like this. So I think that I truly believe
that it belongs in SQL. And I think Drew's point about the software development life cycle.
Here's the thing. It's like all these products we used in the past, Mark, they built data,
ETL products, and they built them using SDLC. They used version control. They used feature
branch development. They did CI and CD, but then they were delivering a product that they didn't think
should be used in that way. They didn't think the audience for their product were developers,
in whatever term that means. And I think the idea that, as Drew mentioned, version control,
small changes, being able to go back and look at patterns, collaboratively,
you know, collaboratively solving these patterns of finding a single customer,
or a single product is one that belongs in the tool. I think that the rigor, going back to the agile question, you know, it's a question of how do you inject rigor into a process for the first time?
And I think the problem with non-agile projects is they inject rigor at the beginning when it's
not needed. I think what DBT really is challenging for the enterprise, but in a good way, in that
it causes the enterprise to rethink the way they've traditionally built data warehouses.
And that is that they were modeled first with often a whole lot of design and thought
before anybody started to think about what the DDL would look like and what the data load process would look like. Spending a lot of time there.
And then you would go back and build a data integration process to then load that target.
And I think that in the design process of building that perfect model
was usually a whole lot of SQL.
We were, that data architect was usually SQL literate.
And they were querying databases to figure out, you know, what's the granularity of this, and what's the granularity of that,
and how would these things ultimately join. But then they were stepping away from SQL,
and they were building a document or writing a document that defined that model, and then they
would hand it over to ETL developers and say, now load this model.
I think what's interesting about DBT, and I would say preferred, and I would even say, you know, optimal about the design process is,
instead of modeling to a target, you're slowly building a target, step by step. And Mark, if you think about where Oracle Data Integrator came from with
synopsis and that whole concept of interfaces, we may lose the entire audience here, but stick with
me. They had that idea as well, that you would build a data integration process, an interface
at a time as they called it. And you focus on that first join, that first pattern, that first combination of data sets, and then solve it and
then go on to the next step. And I think the idea that the model will evolve is the thing that,
frankly, we're having to convince the enterprise to do. But I think it's valuable. I think it's
more optimal anyway. And I think it's
really the way we think about it anyway. But on that point, before I hand over to Drew,
I think there is, although on projects, the bane of my life was the enterprise architect and the
enterprise architecture team who would kind of pedantically pick through the design I've got
and say this entity here isn't right and so on,
but at least it joined up in the end and they would have,
there would be a central kind of thought process around, around,
around that thing now. So I suppose really to Drew,
the point you're trying to make, I think as well, Stuart, is, is, it's,
it's the opposite way around to wait to the agile way in which we've been
developing stuff in dbt um you know true what's
what's your experience been with with working with enterprise accounts and do they have the
same problems to solve that smaller companies do and do they are they changing the way in which
they model these things and think about things or is that influencing some of the design for dbt
going forward sure so i think ultimately the problems that the very large companies and the
very small companies out there using dbt are encountering are, are more similar than they
are different. Um, there are some key differences, um, that I'd be happy to talk about that, that do
sort of inform maybe some product changes we'd want to make in the future. But, uh, for the most
part there, just like Stuart said, said, they're sort of solving these problems
a little bit more iteratively
and they're doing it in SQL.
And the fact that it's in SQL
means that different folks
within the organization
can better collaborate
on top of this DBT project,
these data models that they're building.
So in that way,
it's a lot more similar
with the kind of workflows we see
at very large companies
that it is different to very small ones. kind of workflows we see at very large companies
That it is different to very small ones, but I'll give you an example of the thing that does differ
We find that a lot of large companies
Don't love the idea of running their dbt tests on production data. They have much stricter security requirements they
They have whole security teams. I always think it's fascinating. They have security teams that are larger than our engineering team building dbt.
And that's fair and appropriate for where they are.
And what that means is some of our key axioms
that we kind of operate from in dbt
get tested a little bit.
So what we see a lot more of
is totally different dev and CI test and production data sets
that different folks have different levels of access to.
In some cases, it means that no individual person
actually has permission to run the entire set
of data transformations required
because it's actually three different groups of people
with different configured roles.
So these things all still
work well in a dbt capacity, you can sort of structure things in a nice way at a project level,
and everything, the projects kind of blend together. But there's a lot of work that we can
do to make it work better in these environments. And we're interested in prioritizing that work.
I'm curious, if I can jump in, Mark, I'm curious, like, what are some of those things you're
thinking about adding? Is it too soon? Or can you talk about, like, what it would look like in the tool or what you're thinking about adding? managed by, let's say, the marketing team and a DAG that's managed by, just say, HR.
And so it's always like sensitive data.
Analysts focus on HR.
And so they're going to have their own separate DAGs, but you still want to combine these
things at the very end to build like a holistic view of documentation.
Maybe they both pull from some base project that provides sort of like common utility
models and macros.
And so being able to have different projects
that depend on each other
and making it easier to only run the models
that you actually want to run
as a part of a given development workflow
or deploy them in production,
only run some of them.
You can do all this today.
It's just not naturally supported.
You're kind of fighting against DBT
to do it in some cases.
I'm going to add something to that, Mark, if you don't mind.
I just had a call with a customer, the days are blending,
so I don't know if it was today or yesterday,
where we're introducing dbt.
And their question was very much...
This feature, Drew, would go a long way in that
they had the development team on,
it was a big Zoom, by the way, large screen. They had the development team on, they had the development team on, it was big zoom, by the way, large screen,
the development team on, they had the data architects on, but then they also had representatives
from the operations team. And in big organizations, there are teams that all they do is make sure
that the loads are running and not failing. And they were asking about rerun ability and yes it's possible i can go into the dbt tool i can pass a model statement and run
just a portion of the model and uh you know frankly he said we're not going to check out
any get repos so so that's not what our team does so what's next i said well there's a rest api you
can you can use a curl command uh next um then, so I started talking about, well, you know, in dbt cloud, which they're thinking
about dbt cloud, you can build jobs and you can create jobs.
And I think that the ability for you to have almost an operations pain, it's not the load
and it's not the CICD load, but it's almost an operations view of this, which
if you did have separate graphs in the way you're describing, maybe you would have almost
a job that's just a pain that's designed almost completely for management and rerunning of
failed jobs.
I think that would go a long way.
Is that on the roadmap, Drew? Is that something
that you've been thinking about? So let me say one thing on this topic.
One of the key design constraints we've placed upon ourselves that has been most helpful for
folks running DBT is that every DBT job should be adept at it. And so in the worst case, you can just hit the rerun job button. And if the thing that failed was transient in nature, if just rerunning the same code
will fix it, that'll get you to a good place.
And then there's sort of a fork in the road, right?
So on the one hand, that will only fix some types of problems, like a network flip or
the database did something funny.
We see that sometimes for sure. only fix some types of problems, like a network flip or the database did something funny.
You know, we see that sometimes for sure.
But most failures aren't in that class of failure from what we're seeing.
It's like a logic error or the data changed out from under you in a way that you weren't
expecting in your data transformations.
And so I think it's a compelling idea giving you more tools to sort of rerun parts of a
job, like rerun from build is
a common thing to see in a CI
tool. CircleCI has this functionality
for their workflows.
I can totally imagine that.
I do
want to optimize dbt cloud for being
accessible to different types of people. It fundamentally
is a user interface over
dbt core. This is a really good
candidate feature.
I do think what it kind of requires us to do is maybe think harder about the different ways that our run can
fail and understand like okay based on this this failure what are the set of actions that are even
sensible to to try to do again and maybe get some more understanding of like do you really want to
run this thing again or is like i'll give you example. Not every dbt job necessarily in dbt cloud is fully adempotent.
You could have an operation that does something not adempotent where you wouldn't want to do it
twice. That's like not a best practice. You shouldn't do it. But I think that's one case
where we want to have more information in dbt to help guide users to do the the right thing um in in case of failure actually on that
on a related point something i've always been interested in is rj metrics you know you you
obviously um have this kind of history in there and and rj metrics in a way it it solved that
it sort of solved that problem didn't it but but then obviously when you went to form fishtown and
do dbt you consciously chose not to build another, another RJ metrics.
You've chosen,
I suppose,
to make dbt,
a kit car,
a sports car,
rather than a coach with,
with every feature in it and everything else.
Is that a choice?
Is that,
is that a conscious choice?
If,
if it is a conscious choice,
it's guided by,
I think the,
what's called the Unix philosophy,
which is that tools should do one thing and do them well and be composable.
There's too many problems out there to try to solve all of them in one tool.
And so we aggressively focus on our goal, which is helping folks model their data and
document and test and provide a workflow around it.
Maybe the one thing I do want to say there is this is where the distinction between core
and cloud comes up.
And it's an important one that everyone's really well aligned
on what goes where.
So for us, dbt core is open source Apache 2 licensed.
That's the thing that compiles and runs SQL.
We at no point are interested in creating functionality
in dbt cloud that you have to pay for
that does like core compilation or SQL running.
That's all going to be open source. And the thing we're primarily building in dbt cloud that you have to pay for that does like core compilation or sql running that's all going
to be open source and the thing where you're primarily building a dbt cloud are things like
permissioning single sign-on and otherwise user interfaces and job scheduling so sort of stateful
things that you know they require a persistence layer we'll build those things in the cloud and
so this kind of thing we're talking about here stewart it's it's an interesting where you kind of, when you think of a feature, you think like, where does
it go? Is it core or is it cloud? And in this case, it sounds like that's a core thing. We'd
want to have dbt, we want to give dbt core the ability to say like, I just ran this command,
these models failed, rerun from failures. And then we provide the UI inside of dbt cloud to
let you tap into that. I do want to be clear about one thing, which is, I don't want to go back to
the orchestration from the old tools. I'm very happy with DAG based execution. I think if you
look at when we used to build an old tools, we would build mappings, right? That is what you
would think about for a model, right? And maybe in these old
tools, it would be a couple of models in DBT. But then you would go and in usually a different tool,
or at least a different area of the tool, you would go and glue these things together with
orchestration layers. And it was usually sets of serialized or parallelized processes going down the way.
And what I don't want to do is go back to that necessarily in that you would spend 50% of your
time, that's a rough estimate, on the orchestration layer. That's 50% that's gone in dbt. I don't have
to do that. I don't have to. And by the way, when I'm working on one small piece of something, in the old tools, we typically were only working on that one piece and only testing that one piece while we were working on it. dataset that's mine. And that's what happens in BigQuery. And you're in your own and dbt cloud
makes that simple. You're in your own dataset. So you're not disrupting anyone. And the fact that I
can do dbt run, when I'm working on some small piece of it, that it runs the entire DAG is kind
of like a unit or I would almost say a regression test that's easily available to the developer
that's focusing on something small.
So I think that that's better than the older tools.
So I don't want to go back to heavy-handed orchestration,
but I do think that if...
I'm kind of on the fence with you on this on where it goes,
but I could clearly see if you're adding multiple DAGs,
we still have the ability to run just a small portion of the model with
includes and excludes and all that.
If perhaps in dbt cloud is where this would belong in such a way that instead
of me having to type that with the dbt command,
there's some way in the UI that I could say, you know,
that could help me maybe even from the lineage documentation, you know,
zero in maybe even lasso a section.
So I'm just, I'm shooting for the sky here, lasso a section and say,
run that, that would be really, really cool.
There you go. So I'm conscious of time.
And there's one last thing I wanted to talk about, not,
not quite in so much depth, but I think it's very relevant.
So I suppose the central, I mean, telling Drew here about his own company
and his own philosophy, but the central philosophy with Fishtown
and a lot of this is analytics being a branch of engineering
and the modern kind of analytic workflow and so on there.
How far do you think you see that gets in enterprise companies, Drew?
Is it just the believers or is this something you find gets take up
beyond the small core people that bring you in?
It's a novel concept to some of the folks we interact with
at larger companies that they are in fact doing something that looks more like engineering than not.
They're not familiar with thinking of themselves as operating that role, and they're certainly not familiar with a lot of the tools in some cases.
This is a big part of why we built an integrated developer environment in dbt cloud um i think a
lot of people rightfully so balked at the idea of pip installing dbt on windows and then figuring
out how to use git in order to run their first model um so what i what i do see is that this is
pretty constant at these larger companies that are using dbt,
there's always a champion somewhere.
Much more so, even in the software, it's the mindset.
They think that that's the thing that they need to do to operate at the level they're interested in operating at.
They understand how version control and code view and CICD
obviates whole classes of problems
that they weren't solving particularly well before even.
And so I think that that's invariant,
that there's somebody there on the customer's end.
I guess I almost said the user's end.
Not all these people are dbt cloud customers.
There's open source users out there for sure.
And all these users
ends, there's somebody who really gets it
and is able to,
I need to find better words for this, but
I want to say preach the gospel, if you will,
to the other folks in their organization where
they can sort of understand the benefits
even though it is very new and
certainly there's a lot to learn
up front as well.
Stuart, I think you've always been a very credible person
talking to enterprise customers
about new techniques and technologies.
So what tips have you got around getting take up of this
within enterprise customers
and what works and what doesn't work
and where's the interest you find?
So I definitely agree with the champion concept, Drew.
We see that.
And I'll come back to that in just a minute.
But in general, we sort of have a triage process.
It's almost like a flow chart.
And when we start talking to customers,
perhaps they're like, tell us what to do.
And we have a lot of that.
We have strategy engagements where they know
there's a lot of possibility out there in a new world
and they're trying to think about how to get started and they know that we
can help them. We sort of have a first question in the flow chart is a graphical click and drag UI,
an absolute must. And we have some customers that are on the fence there, tell us more.
We have some customers like, look, we hate that anyway. We wish we weren't using it. And then we have some customers that are like, absolutely, of course,
how else would you do it? And I think when we have that latter section, we might introduce
the concept. We'll start with CICD and configuration as code and talk about all the value there.
Automated testing, automated building and testing, and talk about those and see if there's
any light in their eyes about any of that. And a lot of times with, frankly, you know,
for enterprise customers, there's absolutely no movement on any of those concepts. And we just
sort of stop. It's not for them, right? Now, there's also the brand of customer on the other
end of the spectrum that are like, yeah, we absolutely are tired of, they're talking SDLC,
or maybe they have got a new boss that said, everything has to be CICD, configuration is code,
these things are important. Tell us how to do that. And that's great. But what we see more often than not is a champion,
a couple of people that have seen the light, they are tired of not having testing, they're tired of
struggling with graphical tools that can't generate the code they wanted to generate.
They know how to write SQL, they can't get their tool to generate SQL. They know the value of not just,
you know, committing code, but committing code often and merging code. They get all that, but
they know, or we discover along the way that we bump into these operations people that are like,
where's the audit table? And, or operations people that are like, how can I tell my, you know, very, very unskilled,
sorry, unskilled operator how to restart this DDT job? Where is the things like row level security
and single sign on which, you know, I'm not minimizing those, they're important. So a lot
of times these champions know we're going to bump into some of this. And that's the challenging part when you've got a champion who knows the value of all these things, but they can usually get a development team signed, sealed,
and delivered on building it this way. And usually building it with dbt where we struggle is getting
the rest of the organization that, that is also involved in owning the solution on board.
So to that point in a way, true, what's the, what's the problem that you guys are trying to solve?
Is it about making analytics an engineering profession or is it beyond that?
I mean, what's the bigger problem you're trying to solve really, I think, in Fishtown is a question I think is interested in.
Sure. Ultimately, we're on a mission here to elevate the analytics profession.
We think that these are important professionals in any organization and serve as a function of the tooling and the tasks that they were historically responsible for.
OK, we think that we think so. We're on a mission here to elevate the analyst profession.
So we think that these analysts are important members of their organizations.
And one of the things they were lacking historically was good tooling.
There's an abundance of products that you can buy that solve point solutions that these analysts have.
But there's kind of a dearth of tools that they can
use. And so with DVC, we're trying to give these people tools so that they can do higher leverage
tasks. And everyone kind of has their own reasons for caring about this. I know Tristan, he was an
analyst back in the day, and I think he's said a lot of spreadsheets over email in his day. And
he realized that was maybe a low leverage use of his time in some ways. For me, one of the things
I point to is that you can't have a conversation about data in 2020 without talking about the
misuse of data. And so I think it's really important that we get these analysts in positions
where they are in the room when people are talking about which data do we collect? What do we do with it? And so I think we can only really do that by leveling them up from spreadsheet jockeys to people that have well formed and well considered thoughts on the data's organization, the knowledge that the organization commands, you know? So that's my personal reason for being so interested in this problem. So ultimately, we're on a mission to empower analysts to create and
disseminate this organizational knowledge. And everyone cares about that for a different reason.
That's my personal reason why. So I mean, that I think is interesting thing. So if you think about
what I think the thing that the thing that struck me about DBT and the thing that got me interested in it
after my probably year or so of skepticism
when Deepna and Kristen kept talking about it
all the time on the show,
was about the philosophy of making the repeatability
and having a structure to what you're doing.
And Stuart and I both know that the ETL developer
was the worst job on a project.
And the analysts were people who just kind of fiddled around with the numbers,
came up with a number that looked roughly like they were looking for.
There was no repeatability and more times than not, the number was wrong anyway.
And that's not our projects, by the way, that's projects that we came in and rescued.
But it's about elevating that role, really, isn't it?
It's about, to analysts, it's about elevating that role really isn't it it's about to analysts
it's about making their job um engineering is one part of it but it's about repeatability and so on
and about scaling the impact of the knowledge and the insights and so on beyond that small team
to the rest of the business really and whether that involves the analysts in the company or using
git and whatever or some other kind of variant of that.
And that is, I think, the question for me is,
is it realistic that people who are in the finance department or in whatever department, you know, will they be using this?
But I think the other thing to bear in mind is that
not every enterprise customer is going to be some crusty old business
that has got 500-year-old people in there
using crappy versions of kind of crystal reports and so on.
The next enterprise is that is Amazon.
It's it's companies like that.
And so it's not an option to say we're just going to do it the old way.
You've got to adopt these new techniques and tech and technologies and
approaches. And because the next enterprise,
the next big customer will be Amazon and so on really.
Yeah, sure. So to me, that speaks to,
it speaks to the tooling problem most.
And this is something Connor on my team talks about a ton.
He talks about problems that you have to throw over the fence.
And so our hope was with dbt, we can take a lot of these problems that you had to wait
for someone else to solve and make them problems you can solve yourself, do the good version
of it once and not have to think too hard about it again in the future.
That's a concept that holds true at small companies, large companies, old curmudgeonly
analysts and sprightly young advanced ones.
We can all kind of benefit from automating these tasks.
And to me, it's really a tooling problem.
These folks are doing it anyway, but it's Excel macros or things like that.
It's helping them use higher leverage tools.
Okay.
So let Stuart have the last comment of the show.
Excellent. So I just want to piggyback on what you just said, Drew, which is
the throw it over the fence thing. If there's one thing that you can sort of describe a traditional
team, it is with a lot of fences and a lot of people sitting and waiting for things to come
over fences so that they can throw them over downstream fences. Right.
And I think that the idea that now I think there will be some traditional
enterprises that are going to balk at what I'm about to say,
but the idea that an analyst could get into a Git repo and fix a little piece
of logic somewhere that is wrong. Right now, of course,
in the Git flow process, you can have, you know, reviewers and all of that. But that is something that's absolutely impossible in old
tools. They don't know how to open the old tool. They don't, they're not allowed to in a lot of
cases. And there's no way for them to really comment on or describe the problem.
So there's documents and requirements documents and all these things that are generated and built so that these two teams can communicate with one another about what is probably an incorrect where clause at the end of the day.
Right. SQL and has learned Git, and I think anyone can do that, by the way, maybe they don't come to the table with those skills, could get into a pull request and comment on it, or actually check it
out, make the change, submit it, open a pull request, and let a more senior developer say,
no, that's not it. But still let them, you know,
suspect that might be the problem and flag it in that way,
I think is a really valuable thing.
It's the idea that an analyst can participate and an ETL developer can participate in the analysis as well.
And I think maybe these two roles start to blend. And so for me,
that's really where it goes.
Mark, you said the last word.
I got one last thing to say, I promise.
And that is, I just wanted to, you know,
for all of the Oracle Data Integrator people
that might be listening,
we've written a Oracle Data Integrator
to DVT conversion utility.
We've used it with one customer so far.
So we're interested in testing it
with some others and if that interests you we would love to help you out with that fantastic
so stewart where would people how people find out more about this uh this utility then and about red
pill we haven't put it on our website yet because it's relative it's like a week old so we've used it with one customer it's
converted all their all their odi mappings to dbt models so it's not there yet we it'll be there
soon they can obviously reach out to me they can find me in the show notes i'd love to talk to them
about it we're going to have a marketing blitz about it at a certain point and the folks at
snowflake are super excited about it as well So they're going to eventually be talking about it is my understanding.
Fantastic.
And Drew,
how do people find out more about dbt?
Oh yeah.
Check us out at get dbt.com or github.com slash Fishtown analytics slash dbt.
Or follow me on Twitter.
I'm at Drew Manon.
I mostly tweet about DBT these days.
Fantastic.
It's been great having you both on the show.
Really, really interesting.
Thank you so much.
And yeah, great to have you.
And speak soon.
Thanks, Mark.
Thanks, Mark. Thank you.