The Infra Pod - Let's learn how to survive in the Modern Data Stack in 2023... Chat with Josh Wills (ex-Slack)

Starting point is 00:00:00 Welcome back to our pod. Do a quick intro round again. This is Tim from Essence VC and Ian, take it away. This is Ian. Starts with companies currently helping turn sneak into a data company. And I am exceedingly excited to be joined today by Josh Wills. Josh, why don't you give us a little like intro to yourself?

Starting point is 00:00:23 I am a gainfully unemployed data person., is how I refer to myself these days. Worked on a number of open source projects over the years. My current kind of pet project is dbt.db. I used to run data engineering at Slack. Did that for a couple of years. Rewrote a bunch of Slack search. Worked on climate tech in a company called Weavegrid. I used to work really hard on the Google ad auction many years ago.

Starting point is 00:00:43 I used to be director of data science at Cloudera, where I wrote this famous definition of a data scientist. I'm just kind of a survivor. I've been doing data stuff for a long time, from before big data to big data to modern data stack to whatever the postmodern data stack is going to be. I'm somehow around. That's me. I mean, you almost need no introduction at this point in terms of how the definition and where you kind of sit in the industry today. Yeah. I'm sort of the it boy of data. It makes like literally no sense to me.

Starting point is 00:01:13 Is this your goal? Like when you're a small child, you go up and be like, I'm going to be the boy band equivalent in data to like the backstreet boys. Tragically, I think it was once when I joined Cloudera, it was kind of early on and like, you know, there was Hillary Mason and there was DJ Patel and there was Jeff Hammerbacher. And I, and I knew these people and I was like, I can, I can do that job. I can be a professionally famous person. Sure. Why not? And so I set out to do it. And then like, like a lot of things, it was a very strong, be careful what you wish for kind of vibe to it. Like, I don't really go to conferences ever anymore because I feel deeply uncomfortable being in a space where like

Starting point is 00:01:45 everyone knows who I am. Like the feeling of like walking in public and having people recognize you is just like deeply unsettling to me. I'm very much just like a nerd who's happy sitting in his office coding and like shit posting on blue sky. That's, that's my happy place. So oops, I fucked up. Here I am. I'm trying to make the best of it, I guess. Yeah. But I have a really interesting question. Like, how does one actually become a data influencer?

Starting point is 00:02:13 Like, what got you to this point? Like, if somebody wants to grow up to be like Josh, what is that path? God damn it. This is deeply troubling. Tim, you say this to me. I was having a conversation on Twitter the other day about, I got to meet Ralph Kimball once. The DBT folks, they did like a writing program. Someone wrote a great thing on dimensional modeling introduction to Kimball stuff, right? And I know Ralph Kimball. I met him and

Starting point is 00:02:32 worked with him when I was at Cloudera. Meeting him and getting to work on something with him was just like the absolute thrill of my life. Like I was utterly in thrall, you know what I mean? And the thought that someone else in the world would feel that way about me is like deeply uncomfortable for me because I'm like the forest gump of data. I just kind of been like the right place at the right time repeatedly through no fault of my own. Anyway, how do you get to be me? I guess like my hope is that people listen to me or like what I say because I've done the work. I've built a lot of data infrastructure and I've built it through multiple different epochs from like, again, like old school Google stuff through like Hadoop stuff through modern data stack stuff through whatever is coming next.

Starting point is 00:03:15 I've done the work a lot. And I like to say that in doing the work, I discover things that are useful, do the work and be funny. That's that's my advice. By the standards of people who talk about data, I am funny. I'm also like by far the least funny person in my family. Like my brother is an actor. My sister is a literal comedy writer, but by the standards of people who talk about data, I'm very funny. And so people like listening to me talk about this stuff. And if you can't be funny, be controversial. That seems like the approach that certain people take, I guess.

Starting point is 00:03:45 That seems like a sort of a shortcut to it, I guess. I mean, there's something to be said for that. I don't know. People just aren't funny. It kind of seems like anyone can be controversial, but some people are just like not that funny. It's such as life. Yeah.

Starting point is 00:03:57 I actually like really admire the data community and its ability to consistently create memes about itself. Like in the data influencer world, everyone's just making fun of random data memes all the time which is just wonderful it seems relatively unique to the data community i had to call out the untitled 01 pi notebook account meow books or whatever the python notebook it sort of does that for the ml community and he or she i don't actually know the person's identity, are absolutely hysterical. I'm going to dig up this thing, so y'all can send you guys a link to check this out. It's really, really funny, spot-on memes about the actual practice of doing machine learning and modeling and stuff like that.

Starting point is 00:04:35 Anyway, it's not just data. There's other things. Well, we'll definitely put in the show notes. I do enjoy a solid meme. So one of the things, just like you mentioned this, you've been around for eons, epochs, lots of generations of data tech. Yes. You know, when we think about the Hadoop era to like, let's say to the Snowflake era, to

Starting point is 00:04:52 like this future era, what do you think has been the primary driver? Has there a consistency across all these epochs, all these different generations of improvement, or have they all been like different angles looking at the problem in a different way because there's been a change in audience that's one of the things i'm kind of curious since you've been around for so long and you have you know have this broad experience i'm curious is there anything that's been true throughout that's been the primary focus like is data tech just driven by like performance or query size or like you know accessibility like i'm kind of curious to get your understanding as you think about like the different segmentations or different epochs of the data world yeah yeah i guess the two sort of dominant themes for me are hardware and convenience hadoop happened as a

Starting point is 00:05:39 thing so you gotta kind of project ourselves way back in time to before hadoop kind of the appliance era like the teradata epoch the net Epoch and stuff like that, right? These sort of tightly integrated appliances, this gigantic box you would buy and plug into your data center to do your data warehousing, your data stuff, right? Hadoop came along because hardware got really cheap. Disk in particular got really, really cheap. And when disk got really cheap, we came up with this whole map-produced thing, and we started pushing the compute out to the disk and all that kind of good stuff, right? So hardware got really cheap. In particular, disks got really cheap. And then I sort of think of Spark as actually being roughly part of that same epoch.

Starting point is 00:06:20 Spark was to do, but then memory got really cheap. And so keeping a lot of stuff in RAM to do processing became a much more viable option. So same kind of thing. The open source thing was like convenience. You could just download, you could just download Spark. And again, you had to be like fairly technical competent to know what you were doing, but you didn't have to talk to a salesperson. You didn't have to sit through a presentation. You could just do it. Right. So those are the two big things, hardware and convenience.

Starting point is 00:06:46 The cloud just sort of took both of those things and like, you know, jacked them up, right? S3 is the infinite disk. It was like, have you guys ever mounted S3 to a local file system? Yeah, I do. Highly recommend doing it. Just like, I did this, it's been a long time,

Starting point is 00:07:01 but like when you do that, the file system will say the S3 thing has 21 exabytes of storage. Like from the perspective of your local file system, like S three is basically infinite. Right. And then obviously the whole cloud sales distribution model was even more convenient than open source. You could just like sign up for the thing and click and use it. And what snowflake figured out how to do better than anyone else was leverage kind of the new hardware architecture of the cloud, of S3, of figuring out how to take advantage of this infinite object store in a performant and cost-effective way.

Starting point is 00:07:35 And then again, so between that and the convenience factor, it just basically blew Hadoop out of the water and that gave us the modern data stack. And then if I'm like looking forward, I think what's interesting right now for a lot of folks, what folks has excited is the local laptop. It's like back to the future, right? It's like your, your Macintosh M2, this thing like fricking cranks, right? I run, I run my stable diffusion. I run my mid journey. I run duct EB. I can do a lot of stuff like locally on this very, very powerful machine that otherwise is just used for like video conferencing and web browsing and stuff like that, right? And so that's, if it can be even more convenient than the cloud, maybe,

Starting point is 00:08:13 you don't even have to sign up for anything. If you just have to like say pip install something and off you go. So like hardware and convenience, that's the theme over and over again. More hardware architecture, changing how we build things in a way that makes it more convenient for people to consume. That's the whole ballgame. I think that's a beautiful

Starting point is 00:08:31 explanation and great simplification for what we've seen. Also, you've been gone through these different eras also of how data has changed in terms of importance of the organization. As we've had, it's become easier and more accessible. There's more people that are data-informed. There's more data going to the organization, right? As we've had, it's become easier and more accessible. There's more people that are data informed. There's more data going to the organization. There's more people that need to work with data. You know, how have you seen that change? You mentioned the fact that you coined

Starting point is 00:08:53 or defined the concept of data science. I'm really curious to understand, like, how you think about data practitioners, who is interacting with data and how that's going to change over time. And do some of these, like, new example like a duck db which you kind of you spoke about sort of i don't really like duck db that much most people don't know that about me they think them i could rarely very rarely ever talk about duck db t this is sort of an ongoing joke on

Starting point is 00:09:20 twitter how much i love duck db um I do. How do I think DuckTB in the sort of shifting roles from data science? I think how has it evolved? Where was it before? Where is it now? Where are we going in terms of the accessibility? We talked about convenience. We talked about speed, which are definitely these big

Starting point is 00:09:40 enablers to pulling, enabling more companies to use data. I'm kind of curious to think like how do you think that's going to impact like other roles in the organization more on the engineering side of the host like how does how does like the fact that we now have these like relatively complicated data organizations that help that sometimes tend to be these like weird silos off to the side from the rest of the r&d stack i'm kind of the leading question i'm asking is like how do you see data getting

Starting point is 00:10:05 infused back into the rest of the company? But more importantly, how does data and engineering fuse together into one? It's a great point. It's been something I've worried about for a long time. I think one of my great frustrations running data at Slack was that Slack, the web app was like sort of PHP monolith web server and like, you know,cript and and running off of my sql and all this kind of stuff like it's whole stack and then our data stack was kafka jvm based and you know rk is our kind of baseline file format all the hadoop spark presto stuff like all really jvm based stuff and then like various dashboarding tools and stuff like that such that like the stack is so different the engineering profile you're looking for is so different. The tooling is so different.

Starting point is 00:10:47 You very much do end up in these silos of like, you don't understand each other. You don't speak the same language in terms of tooling. You don't really kind of get the engineering bonding experience of building stuff together that I think is like fundamental to teams and stuff like that. And that's just a huge problem. And the modern data stack has not made that better. We switch out some of the Hadoop stuff for Snowflake and stuff or DBT or whatever, but it's like,

Starting point is 00:11:11 it's still fundamentally a radically different stack. And that bums me out. Like that sucks. My thinking on this, right, is that if you look at the architecture of a data warehouse, the central kind of component of it has always been like the storage query interface, whether it was the Hadoop era, the central kind of component of it has always been like the storage query interface, whether it was the Hadoop era, the appliance era, or like the snowflake era,

Starting point is 00:11:31 it's still, it's like the most critical, most valuable components of the stacks. It's very, very tightly coupled together. And in doing so, we kind of have oriented the entire world around this thing. It is the sun through which everything else revolves around. What's exciting to me about DuckDB is that it's not that. It's modular. It's a little embeddable component, again, akin to SQLite that you can kind of inject anywhere you need it to go. So one of my things that I've been kind of super excited about in the data contracts space, data contracts are, I'm hoping, is kind of like ideally a transition point in the data contracts space. Data contracts are, I'm hoping is kind of like ideally a transition point for the data community because data contracts are just integration tests. That's really all they are. It's an integration, like a production integration test between an

Starting point is 00:12:16 upstream system and a downstream system. And whether that's like between front ends and back ends or one API service, another API service, we know how to build integration tests. That's what continuous integration is all about, right? If DuckDB allows us to run our data pipelines as part of integration testing, as part of the release process for any app in our system very quickly in a modular way on top of Jenkins or CircleCI or whatever one of those nightmarish CIDC systems you use, that's a huge unlock. That's a huge opportunity for us to bring different engineering teams together.

Starting point is 00:12:52 A lot of the stuff I've done around DBT has been like, how do I get more engineers across the entire company to use DBT? DBT asks very little of you. It's SQL and Jinja. Any idiot can write SQL and Jinja. It is about the lowest common denominator. And that's like a good thing.

Starting point is 00:13:09 Like in a way that like writing Scala and Spark or whatever is not the easiest thing in the world to do. SQL and Jinja is the easiest thing in the world to do. And that means it's available to everybody, any engineer, right? That is like very much my dream is like we get these tooling to be open and dumb enough for everyone to use

Starting point is 00:13:27 and modular enough for us to stick it anywhere we need it to go. That's what I'm hoping for with like DuckDB and dbt going forward. Whether or not we get there, I don't know, but that's what I'm hoping for. Yeah.

Starting point is 00:13:38 Yeah, that's really interesting because dbt, you know, has been sort of notoriously being looked at as, oh, it's a simple product, it's a simple tool. But like you said, its insertion point is pretty strategic for most people. And there is not much tools out there that was targeting that sort of persona.

Starting point is 00:13:57 I guess if you just do DBT, and maybe we can talk about that later, DBT seems to have to invent their own category in some way or invent their own engineer category, analytical engineers, and start to no longer be a tool. They're growing up into something you need a definitive role for. And then you have to have a definitive ecosystem around with a bunch of startups, a bunch of things they can do. And we're seeing the latest evolution of it from Tristan recently. And I guess, how do you view DBT now? Are the tools ahead of the ecosystem

Starting point is 00:14:28 where we're trying to get everybody to learn two new wave of things? I just wonder, when do we stop seeing DBT as a simple tool for most people? And has it been a simple transition for most people out there? Because sometimes it's really hard to tell. There's a weird contradiction between DBT as a tool and dbt as a company and a platform and the ecosystem

Starting point is 00:14:51 which is curious curiously your take on are we still mostly in this early innings of folks trying to just even learn dbt uh or now dbt or this whole ecosystem has grown to a point where like hey yeah we need analytical engineering we need to hire for this role we embrace a complete new way of thinking and dbt is a catalyst for it or i don't know if that question makes sense to them it makes very little sense tim i'm not gonna lie i had a super hard time following what the hell you were talking about just there i'll do my best here um i guess we got to talk for a second about the whole modern data stack thing like which is like one of those Mike marketing terms, right? The original modern data stack, which now has like a bazillion tools and it's become like a meme, right? The, the, the I chart that is the

Starting point is 00:15:35 modern data stack tooling chart, right? Originally was just, you know, five Tran for ingest redshift for your data warehouse and looker for your bi layer that was it was the whole thing you know it was great the virtue of it especially compared to hadoop and all the kind of big data stuff that came before it was that any random analyst could like sign up for all of these things right and pretty much set the whole thing up themselves they didn't you didn't you know it would be like one generalist engineer in one day could wire up this entire thing and then like let the analysts kind of go nuts and start working and stuff like that right so dbt emerged out of out of fishtown analytics is consulting because they essentially discovered over time that lookers persistent derived table

Starting point is 00:16:20 abstraction was like not good like it was basically a mistake um to and people essentially were like embedding way too much of their transformational logic into their looker models when really you just wanted this kind of clear abstraction between like the heavy duty sequel centric transformation lifting and the like bi layer like metrics dimensions abstractions and stuff the worker provided right, so dbt like emerges to sort of fill this little fairly inconsequential niche in the modern data stack, right? So now it's Fivetran, Redshift, and then obviously Snowflake replacing Redshift, dbt and Looker, and that's the modern data stack. And again, fundamentally something that like, rando analysts can like,

Starting point is 00:17:01 pretty much do themselves. And like, that's, that's again a huge convenience unlock that's a really big deal i'm an investor in dbt i invested in them very early on in 2019 and i invested in them before they had a product or any idea what they were doing i invested essentially in like a python library guys i invested in a python library which you know generally speaking not like the smartest kind of angel investment to make. Like, is Python libraries being defensible? Probably not, right? But I invested in them because people loved it. People absolutely loved it.

Starting point is 00:17:34 Whenever I talked to anybody in the data community or analysts anywhere who used dbt, they would not shut up about how great it was. And I was like, okay, well, they have no product. I think they were saying at the time they were going to build like a cloud IDE. And I'm like, oh, my God, have no product. I think they were saying there at the time they were going to build like a cloud IDE. And I'm like, oh my God, that's the worst idea I've ever heard. But okay, they'll probably figure it out. It'll be fine. Right. So, so I invested just because people loved it.

Starting point is 00:17:54 That was the whole thing. And what I've come to slowly dawn on me over time is that DBT, the tool is very distinct from DBT, the business. DBT, the business is really like a metadata business it is a business user facing like so for instance a weed grid was the company i worked at for a couple years doing climate tech stuff right what was our stack we used meltano for ingest we use snowflake as our data warehouse we use mode as our bi layer and we use dbt as a transformation layer we use dbt cloud we paid for d as a transformation layer. We use dbt cloud.

Starting point is 00:18:25 We paid for dbt cloud. Why did we pay for dbt cloud? I'm a pretty good engineer. I can run dbt myself. Why am I paying for this thing? I'm paying for it because we had some dashboards in mode that were end customer facing. We used them to power things we showed to our customers. And one day the data pipelines failed and we ended up showing like out of date data to end customers. And it was very embarrassing, right? And dbt cloud

Starting point is 00:18:53 has this feature where it can integrate with mode and tell you on your mode dashboard, Hey, when was the data that's powered this dashboard last updated? That's it. That's the feature. Was I willing to pay like, you know, 50 bucks a month or whatever for that? Hell yeah. Without a doubt, I would have paid $500 a month for it. Don't tell DBT that. Let's keep the prices as low as we can, right? Providing that context from the data team and what the data team was doing to the end business users who are consuming the output product of that, that's where the money is. That's where the value is. In the same way that Atlassian and Jira and all these sort of tools are a huge business to effectively allow the business to communicate with the engineering team in a systematic, structured way, dbt as a product, dbt cloud exists to bridge the gap between the business

Starting point is 00:19:39 users who are consuming data and the upstream data analytical teams that are constructing the data and building data. That that's the whole product, the whole new model contracts thing they're doing. I know like a lot of people are like making fun of that and stuff. I think it's genius. I think it's brilliant. I think it is a obvious dbt cloud feature for end business user who doesn't know, get doesn't know. YAML doesn't care to be able to click a button and say, Hey,

Starting point is 00:20:02 I need this table to not change. And if it does, I need someone to tell me about it, right? And then have that sort of propagate upstream automatically to the DBT pipeline so that business users don't just have read-only access to that metadata, but write access to it as well. Like that's where the money is. That's where the value is, bridging those gaps. That's like the whole business right there. Everything else is just kind of noise as far as I'm concerned. Yeah. That's a great synopsis. I mean, I think like at the end of the day, the way I read what you just said and explained is like DBT, the product is really what democratizing data inside the organization. And I think that really does explain. I don't know if they're making it more accessible, making the work of the data team

Starting point is 00:20:43 liminal. There's a great book called Seeing Like a State. You guys have read Seeing Like a State? Wonderful book, Seeing Like a State. Like, how does a state think about its territory as an abstraction, right? As an organization, right? And it's a wonderful book about all the weird things governments do to make sort of the invisible things visible to the government, the way the government understands. That to me is like, again,

Starting point is 00:21:06 back to the sort of the brilliant simplicity of dbt, it's just SQL and YAML. That's the genius of it. It's accessible to anybody to sort of see it and read it and understand what's going on in it. And now just have chat dbt explain it to you. Like, if you don't understand what it's doing, just copy and paste in there

Starting point is 00:21:21 and it'll tell you what it's doing, right? To me, that's the business because that was a massive unmet need. That's again, like we're talking about how the, the data stack is often its own world separate from the engineering stack, the data products, the work of the data team is often often its own stack isolated from the rest of the business, not just the engineering team, but the finance team,

Starting point is 00:21:41 the customer success team, the marketing team, it's all, that's, that's why we, that's why everyone builds their own goddamn data stack. Just so you can like have a fucking clue what's going on, right. Solving that problem. That's where the value is. Yeah. Amazing. So I'd love to transition back earlier. You mentioned this DBT plugin for DuckDB. Oh yeah. I want to like, we talked, I swore that would be Alex engineer and the data engineer and how, how like business facing metrics and analytics and data is changing. But, you know, and chat GPT is just another example though of how data and the

Starting point is 00:22:13 instrumentation of data and the orchestration of data for online products, you know, businesses built around, like, I mean, Uber is the greatest example of what is effectively a data business at the end of the day at the core of it. How do I have a market of cars and riders and match them up and have that be accurate? Every market is a data business. Google's a data business. Facebook's a data business. Airbnb's a data business. Everyone who's matching buyers and sellers, you're a data business. In large degree, it's also true of a lot of SaaS companies, right? Like Salesforce is very much a

Starting point is 00:22:43 data business in terms of its recording component. But even at scale of like how you build a efficient Salesforce, like an actual people selling stuff and Salesforce is a tool you buy, like that is to scale to a thousand reps. That is the data business and Salesforce does do a ton of data optimization. So all of these tools, like at the end of the day, for any at-skill enterprise use case, it's entirely a data business. And so I was super interested to talk a little bit about your dbt plugin and also how you see about how does that shift some of this toolbox back to the engineering? Because one of the things I've experienced in my work at Snyk, but in other companies that I've started in data, is there's often this divide where you have the engineers, software engineers,

Starting point is 00:23:27 and the data engineering mindset. They're very different. They're like almost different modalities. And I'm curious to understand how you see an opportunity to bring these two together. Is that through the work you've been thinking about in terms of like DuckDB and dbt together? And also just like broader speaking, how do these worlds collide?

Starting point is 00:23:42 Because they definitely need to, right? At some point, they all live in the same ecosystem. But that's, if you actually ask a software engineer, they wouldn't think about any of the data problems when they're actively doing data engineering. Yeah, exactly. A bunch of great questions there, man. I think the first thing I want to address is like the mindset shift of a data engineer or data or an analytics engineer, for that matter to a like again general purpose software engineer general purpose software engineer is generally like request response engineering like request comes in do stuff very very quickly return response and that's the job and do that you know several bazillion times every sort of request is

Starting point is 00:24:19 independent of every other request all that kind of stuff like we architect everything that way data engineering is not that. Data engineering is very like what I call like throughput optimized engineering, which is to say like the work's not done until every record shows up in the table. Like missing end records is like effectively the same as not being finished. And that's what we're kind of optimizing for throughput. And so the mindset shift that goes with it also goes hand in hand with like all of the tooling differences and stuff like that. Right. It's a problem. It's a concern. I haven't figured

Starting point is 00:24:49 out how to solve that yet. Like how do I sort of bring that data engineering mindset mentality to software engineers who aren't used to thinking that way? I'm not super sure. That's a great question. My original reason for doing dbt.duckdb like way back in the day like a long long time ago like october 2020 back when like no one had really ever heard of duck db or people that started heard of dbt obviously at that point but no one had ever really heard of duck db it was you know weird little embedded database coming out of the netherlands right coming out of cwi i was unemployed at the time i left slack and helped out with covid stuff for a while but really kind of had nothing to do which is looking for a fun project to work on. And I guess like I wanted to make it easier to get started with dbt. That was my initial goal. Right. And I think dbt.db has largely done that.

Starting point is 00:25:35 And I'm very happy about it. Like I see a lot of demos and getting started guides for dbt that use duck db as they're getting started because it's so just brain dead simple like pip install it off you go there's no server to run nothing like that but again because duck db is postgres flavored you can do all of the dbt stuff inside of it so that's that's fantastic and that's great that makes me really happy so that was like my initial goal for the project it's succeeded in my goal to the level that I expected it to, right? Which has been good. The other side of dbt.tv, the stuff that I've been working on lately, it's like a lot more interesting and fun for me, is sort of pushing the limits of what dbt can do.

Starting point is 00:26:15 And like kind of going beyond what I think dbt labs would actually be happy or comfortable with people using dbt for. And that's much more interesting to me like the first of this was really was when dbt introduced python models as a first class citizen but like it's not really a first class citizen it's it's a it's at best a third class citizen the way dbt works right now for dbt snowflake and dbt bigquery and stuff it's just kind of a pain to use it's not very natural whereas with dbt.db our Python model strategy is run your Python code. That's the strategy. Run it right there. Do exactly what it says. Run it in process. We're running the database in process. So like, why the hell not? Let's run the Python models

Starting point is 00:26:54 in process too. Let's go with it, right? More recently, though, this one thing I'm excited about is there's this wonderful April Fool's joke called dbt.excel. Did you guys see this? Did this go by? Okay, Tim didn't see it because Tim was probably doing a deal or whatever. Ian didn't see it. I didn't see it either. No, I didn't. I'm completely out of the loop. You guys missed out. What are you doing? You must be on Blue Sky or something. DBT Excel.com, that thing? DBT Excel.com, DBT hyphen Excel.

Starting point is 00:27:20 Yeah, I know. I've seen the Twitter. Actually, I clicked on it quickly. I can't remember what it is. Yeah, so this is the fine folks at BDataDriven, which is another data consulting firm in Europe, did this as a joke. They hacked in the sense of like they forked and modified dbt.db to make it work against Excel files as the underlying substrate. And so dbt.db is great. It can run against CSV files, Parquet files, all kinds of different stuff. They made it run against Excel files. And I was kind of like, this is great. This is fantastic. How can we make this a first class concept in dbt.db? And so what I'm supporting now in like the next rev of dbt.db, which it would be coming out this week, but instead of working on it, I'm doing podcasts. So hopefully it'll come out next week.

Starting point is 00:28:13 I've added a sort of a plugin capability to dbt.db so that it can consume any kind of data that like duckdb can kind of like work with, whether it's like Excel or an iceberg table or a SQL alchemy connection, anything you can do in Python that kind of constructs something that looks like a data frame, roughly, you can now integrate into dbt.db going forward. And in doing so, I'm trying to make dbt.db into that kind of missing piece of the data contract stack that I have been looking for. I'm trying to create something that like where you can write a dbt.db job that can integrate like your sort of staging database from your your integration testing environment with your data pipeline code and like exercise your data pipelines as part of every single production push you do super super fast

Starting point is 00:28:55 just against the staging data that's kind of what i'm going for with this like pushing dbt to be like a complete sort of data stack where you don't need actually anything else beyond it. Like you don't even need Fivetran. You don't need any upstream stuff. Now that's not to say you shouldn't use that stuff. You should absolutely use those tools. Absolutely. When you're moving large amounts of data in production, but for testing or just for like your data side quest or whatever, let's just get you going. Like, let's not make you learn eight different tools. It's just like, let's just go. That's what I want to go after. That's what I want to solve. You referenced a modern data

Starting point is 00:29:28 stack in a box. I think Post in your GitHub, right? I do. This idea you have the whole complete platform basically running in a single node or your laptop and you can go from there. That's the power of DuckDB as well. What is the future of this? Because DuckDB or MotherDuck

Starting point is 00:29:43 is trying to bring the local experience as fast and great as possible, but also extend as a scalable thing in the future. Is there a better story here, like modern data stack in a box is where you actually want to start and can actually grow naturally towards? I worked on modern data stack in a box with Jacob Mattson, who did all of the heavy lifting on that project. It was his idea and his vision to kind of make that happen. Obviously, we worked closely together. He used a lot of stuff I wrote in the process and stuff. And he remains a very good friend. I think the world of him.

Starting point is 00:30:13 I think there's a few things here, Tim. I don't recommend anyone replace their data warehouse with DuckDB. Definitely not tomorrow. Probably not this year, I don't think. There's still a lot of stuff that DuckDuty can't do yet, doesn't do yet that we need to build. You know, Jordan Tagani at MotherDuck wrote this big data is dead blog post, right?

Starting point is 00:30:34 And that post touched me like in a few different ways. My takeaway from it at least is that we as an industry have so over-indexed on scale and processing big data at the expense of literally every other aspect of the developer experience, of the data experience, that this is the only thing that matters. And he's absolutely right. And I was kind of reflecting on why is that the case? And I was kind of thinking back to my pre-Google days, right? Where I would come across a problem where my data stack couldn't handle it. Like if I was building like, you know, a big recommender model or something like that, or trying to fit like a ginormous logistic regression model, I would hit a wall.

Starting point is 00:31:17 And again, this is back in the aughts. This is again, an eternity ago, right? I would hit a wall and like, I couldn't do it. And I think the initial thing that resonated for people about Hadoop and big data and stuff like that was that you could, you could do it for the first time. These enormously intractable and problems that everybody had were something that could actually be solved. You could like do it yourself. You didn't have to write a bunch of custom distributed computing code. You could just do this quote, simple thing called MapReduce or like write a Hive query or or whatever and you could do it and that was really exciting for people and for some reason i kind of like

Starting point is 00:31:49 look back on it now we must have all just been crazy we just had this collective delusion that like because that could solve that problem we should do everything that way absolutely everything right and again we've made things simpler over time. BigQuery, Snowflake, we've made it much simpler to solve these big problems. But we still seem to operate in this mentality that every single thing we do should be done on the biggest, baddest, most powerful distributed compute infrastructure ever. Because what if we have to do the huge thing? We got to make sure we have all this stuff at our disposal. Jordan's post is basically like, that's, you know, insane. Like we've lost our minds. That's not right. That's not true. Now at the same time though, there will still be those problems, right? They

Starting point is 00:32:34 will still be a thing. They will still crop up from time to time. Even if we're kind of going from the mindset of like our entire personality does not have to be built around like the petabyte scale data problem we're solving. We still have to solve those problems. And to me, I think the promise of MotherDuck is to say, you can start in DuckDB and you can run things as far as you want in DuckDB and do all your stuff in DuckDB. And if you hit a problem where you need that scale and it turns out, okay, shit, I actually do need a hundred computers to do this and I need all this stuff, right? Then MotherDuck is going to be there for you. MotherDuck is right there in DuckDB, speaks the exact same syntax as your local DuckDB, and it will take your problem, and it will crank that thing for you, and it will solve that problem for you. That, to me, that's the promise of

Starting point is 00:33:20 MotherDuck. And that's what's exciting about it to me. And that's why I invested in it. This is like, that's just awesome. That's kind of where I feel like it's going to bring us to a much better place. And I think not for nothing, the whole hybrid execution thing of like stuff happening locally and remotely on the cloud. I think that's the future. I think everybody's going that way. I think maybe snowflake will be the last holdout. I'm not sure, but I just, that's, that's the. That's the way forward. You're going to see everybody go down this route, without a doubt. That's my two cents.

Starting point is 00:33:51 I'm feeling energized. I think it's interesting because one of the things you bring up is why scale? And I think also the truth of the matter is that Web 1.0 and Web 2.0, the only people that really had data problems were people at scale. Yes. And the only buyers in the room to fund vendors were people who had scale data problems. And so we kind of think of, we've got to prove the value of data.

Starting point is 00:34:10 Well, who is the data and who actually would create value from like processing data in some regard, right? Yeah, so it makes a lot of sense that we have like hyper over-indexed on data. And now the rest of the long tail, I guess, if you will, has kind of started to emerge, right?

Starting point is 00:34:24 Where while every company can be a data company, because like we've had this huge you know the cloud has occurred and compute is cheaper and you know there's just a lot more people so now we can have an industry that does both high end but also can focus on like you know the other 90 percent of people and i think that's super exciting yeah i'm kind of curious to hear you know i really like a lot of the stuff you've worked on i think the idea of curious to hear you know i really like a lot of stuff you've worked on i think the idea of audited data stack in the box enabling like actual you know cicd workflows and actual workflows for data that isn't just like do it live you know we all i think we all learned over 20 years that do it live is like yeah it's great but maybe not not how you sleep

Starting point is 00:35:02 well right like you definitely sleep well that's? No, definitely not how you sleep well. That's true. Yeah, it's just like eight hours. I'm not slept in like 20-ish years. So yeah. I use all that leftover like data PTSD, right? Even now at this point, you're like, yeah. Totally.

Starting point is 00:35:15 I'm curious to think, like, do you, how do you think cost plays into this discussion between some of the sort of the centralized big cloud like Snowflake versus the DuckDB? Do you think there's a world where you'll see people with centralized data warehouse and pull stuff out and then use potentially DuckDB and other technologies just to like optimize or reduce costs?

Starting point is 00:35:34 Because I'm kind of curious, there's a thing that comes in my mind, there's been a lot of discussions about cost in the ecosystem in our, you know, sort of last couple of five minutes here. I'm really interested to hear, like, is cost important? Does cost play into some of the discussion around DuckDB? And how do these things interrelate if they do at all?

Starting point is 00:35:50 I mean, I think definitely in a non-zero interest rate environment, right? I think everyone's much more mindful of costs and stuff like that, right? You can think of DuckDB, I think, as like in some ways enabling an arbitrage play, which is, again, the computer you're having this conversation on is sort of like absurdly

Starting point is 00:36:07 embarrassingly powerful. You know, I think like what AWS has been doing around the Lambda instances and Lambda instances have gotten embarrassingly kind of insanely powerful and stuff like that over time. I think we're using DuckDB, like honestly, y'all is like a catch all for this kind of new world that again, I think we're heading to, cause I don't think like DuckDB is not going to be alone here like click house already has like click house local which looks fantastic and click house is a fantastic piece of technology right i think google is going to do this a lot of people are going to do this right we are going to take advantage of

Starting point is 00:36:36 the fact that all of this amazing compute that's not running in a data center in gcp or aws and i guess sometimes the ones that are, are fantastic. And we've actually spent a bunch of money on them. And maybe we ought to like take advantage of the fact that they're really good at this stuff. You know, at the same time, you know, it's an interesting question. I think what would be most amazing for a lot of folks,

Starting point is 00:36:56 I think is sort of clarity around the cost stuff. I think when people like, like Snowflake is fantastic. It's an incredible piece of technology. The most annoying thing about it is figuring out like, where exactly is the money going? Someone said that there's like entire software companies like Capital One, which is a bank, literally started an enterprise software company to help you figure out your Snowflake costs, right?

Starting point is 00:37:21 That seems like an opportunity for me. You know, Ian, it's not just the straight costs. It's what about clarity of the costs? What about unambiguously, how much does it cost for us to do this stuff? And like, can we understand and project the growth coming forward? Like, can we get a handle on it? Is I think, you know, a huge opportunity,

Starting point is 00:37:37 I think as much as the savings and stuff, just the clarity and visibility is a huge thing. Yeah. Yeah. Amazing. Like I think about that a lot. Like, I guess we just don't have the framework to even think about like making decisions about what we run in the run in certain places,

Starting point is 00:37:51 where is a place to invest is, Hey, this query or this workflow hyper expensive. Maybe we should invest our data engineer time over here to like, maybe we should pull it over to the centralized warehouse and do something off to the side. It's a parquet files and S3, you know,

Starting point is 00:38:04 some artisanal, some pistol3, some artisanal optimization, but we don't have that picture. It gives away because we don't have that clarity. We don't have the clarity and we don't have the flexibility. We don't have a database-ish thing that can run anywhere. We can just pick it up and move it. We can stick it on our Kubernetes clutter. We can run it on Lambda.

Starting point is 00:38:20 We can fire up a giant machine with a terabyte of data. We can do whatever we want. I mean, again, that's to me, that's the hardware thing and the convenience thing that I think drives the next trend. It's just yet another shift in hardware and convenience. It's a lot of optionality, which is like a great place for us to be. Josh, this has been super awesome. I've really enjoyed our conversation with you. Tim probably has as well.

Starting point is 00:38:45 It's always a laugh and always a good time. Thank you. I hope you have a great rest of your day. Do you got any things to plug? Like you got in this last little bit, I'll give you the, you know, if you want to plug some stuff, you know. I don't, I think, you know,

Starting point is 00:38:57 the hard thing for me right now is, you know, dbt.db is the thing I work on these days. It's my baby. I love it. I would actually like people to stop using it so much because they're filing like bugs and like asking for help. Full source it, you know?

Starting point is 00:39:11 Maybe I should close source it at this point. I just like, it's kind of my baby and it's my fun little toy and like people are using it for real things now. And I'm like, oh God, I don't know. But what happens? This is what I was hoping for at all. So my request for everyone would be,

Starting point is 00:39:25 please stop using my stuff so that I can keep having fun building it. That's my request. Amazing. Tim, you got last word. It's all you. No, no. Well, hey, Josh, you're the legend.

Starting point is 00:39:36 We'll do that plug for you and the extra tweets just to honor that. It's all I ask. Yeah, super great to have you on it. Thank you. Thank you all so much. All right, later.

Your Ad Here

The Infra Pod - Let's learn how to survive in the Modern Data Stack in 2023... Chat with Josh Wills (ex-Slack)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.