The Data Stack Show - 194: Building Retail Churn Prediction on DuckDB with Clint Dunn of Wilde

Starting point is 00:00:00 Hi, I'm Eric Dotz. And I'm John Wessel. Welcome to the Data Stack Show. The Data Stack Show is a podcast where we talk about the technical, business, and human challenges involved in data work. Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. The Data Stack Show is brought to you by Rudderstack, the warehouse

Starting point is 00:00:25 native customer data platform. Rudderstack is purpose-built to help data teams turn customer data into competitive advantage. Learn more at ruddersack.com. Welcome back to the show. We're here with Clint Dunn from Wild. Clint, welcome to the Data Stack Show. We are super excited to chat with you. Thanks for having me, guys. I'm super excited. All right. Well, give us just a brief background. Tell us about yourself. Yeah, I'm the co-founder of Wild. We do LTV and churn predictions for retail brands. In a prior life, I worked at Afterpay in the marketing data science department. Before that, I was building some data teams at small e-com companies. So Clint, one of the topics I'm really excited to talk about is DuckDB. We both, I think you more

Starting point is 00:01:20 than me, have an affinity for it. So I'm excited to talk about that. And then we'll definitely have to talk some about your experience as a head of data and working in data, producing business outcomes as well. Yeah, I love it. There's not a lot of podcasts I get to go on and talk about technical stuff. So I'm enjoying that. Yeah. Awesome. All right. Yeah.

Starting point is 00:01:41 You ready to dig in? Let's do it. Okay, Clint. awesome all right yeah you ready to dig in let's do it okay clint so many interesting things about your background you gave us a brief introduction but how did you start your data career did you study data and or sorry did you study like engineering or anything technical in school no i was an economics major and we did like a little bit of like sass which is like a real old school programming language common and oh yeah economics yeah and i kind of had like finance internships all through school but i i think the turning point for me was going into my senior year i was working at a ufc gym we like basically franchised out gyms across the

Starting point is 00:02:26 country oh yeah sure my whole job was like i was selling basically the whole summer the delaware valley the rights to the gyms in the delaware valley and i got roped into some meeting with like the president of the company at some point and he was like right, we got a major problem here. We are giving free memberships away as basically free trials indefinitely to a number of customers. And somebody raised their hand right away. I was like, all right, how many people is this affecting? He was like, we have no idea, but I'm guessing 4%. And so I kind of started talking to the technical team. I was like, how is it possible?

Starting point is 00:03:04 Don't we have a SQL database? How do we not know? And nobody really could work it out. And I just got this hunger from that point to start answering business questions a little bit better than putting your finger in the air. Yeah. And so you were actually selling. You were out selling rights to gym memberships. Not literally. people were kind

Starting point is 00:03:26 of coming to us i was doing some of the analysis to say like how many stores could we put into the delaware valleys i was going like all through pennsylvania it's like we're gonna stick one in mechanicsburg there's gonna be one in harrisburg we can get two in this other city yeah yeah kind of boots on the ground analysis yeah okay and then so where did things go from there? So I ended up working in a fractional CFO and accounting company after school. And I was like, the data guy. And I think that's kind of a common situation a lot of folks start their career in where they're kind of tasked with like broad data responsibilities and maybe not the skills to do them. So like I said, I knew like a little bit of SaaS.

Starting point is 00:04:06 I stayed after work every day and taught myself Python. And I was like really good at Excel and just tried to figure things out. But it was a little bit of everything at that job. Yeah. And then when did you get into sort SaaS world, software as a service? Yeah. Honestly, I've come to it a lot more recently. I've been mostly in the retail side of things.

Starting point is 00:04:34 Yeah. So with that job, we were a fractional CFO in an accounting company. We were working with startups. So I got to look at data for a couple of SaaS companies, but I did a lot of like food and bev and, you know, kind of traditional e-com analysis for folks. And then I went and was at a fractional marketing company doing basically the same thing. And then was in-house at Hairstory eventually. So, yeah, I'm like kind of a retail guy through and through it really. Yeah.

Starting point is 00:05:02 Okay. Interesting. Now you skipped over one really important piece of your history that we talked about briefly before the show. And that is the fact that there's a database at the Federal Reserve in Kansas City that is named Clint, which is a surprising resemblance to your name. Can you give us the quick story on how you got a database at the Federal Reserve named after you? It was a horrible mistake, and no one knows this about me. Yeah, J-10, it was the Federal Reserve Bank of Kansas City. I had an internship there one summer.

Starting point is 00:05:39 I was like an undergrad. Everyone else in the research department has PhDs. They'd never really had an intern before. And so I didn't really have any right to be there, and I was well an undergrad. Everyone else in the research department has PhDs. They'd never really had an intern before. So I didn't really have any right to be there. And I was well aware of it. But they had this project where I was supposed to catalog every economic event since World War II. It's pretty obvious and actually kind of cool and maybe relevant to data folks in companies now. Nobody really knows when things happen. If you just ask somebody when Hurricane Katrina

Starting point is 00:06:08 happened, it's very hard to pull that out, but it will affect a lot of analyses that you're doing, especially in the South and the odds. The head data scientist had this idea to catalog all these things. My job was to go through these old binders that secretaries

Starting point is 00:06:24 had typed up manually in the 50s and 60s and left to her. So I was digitizing all of them. It's really cool. And I gained like a huge appreciation for what like was like, right? Like the gas gas shortages and what like the 70s. Yeah, what it was like day by day. But, but yeah, I as a joke named it after myself and like had a backronym, a terrible backronym, the chronologically linked timeline

Starting point is 00:06:51 with L I N capitalized, I thought it was like, I thought it was like a really dumb joke and I realized like all federal databases are named after people. It's like Fred and Edgar and Noah, I think a few more that i can't think of so yeah i knew it had like caught on when the head of research came down to my desk one day it was like i heard you're the guy working on that clint database aren't you no can't escape it no that is how to leave a mark as an intern for sure. A lasting mark. Right? Yeah.

Starting point is 00:07:27 But you've had a similar experience. I have. We were, there is actually, we were joking back and forth, John and I on LinkedIn this week, cause he tagged me in a post where someone said, you know,

Starting point is 00:07:39 you build a quick prototype and you end up, you know, with a database called something like a John test or whatever. And so the joke is around RutterSack, our version of that is EricDB, which when I first joined RutterSack I was getting access to the product

Starting point is 00:07:56 and I asked for a schema in Snowflake so I could do testing and build prototypes. And there's still a lot of production workflows that run out of Eric DB today, four years later. We'll eventually fix that.

Starting point is 00:08:11 But yes, you know, it's great when you're onboarding like a new employee and they're like, what is Eric DB? Yeah. Yeah, you got to be really careful what you name after yourself.

Starting point is 00:08:26 Yeah, exactly. That's very true. Well, one thing I'd like to hear about. So I want to talk about Wild and what you're doing there. But can you talk about your experiences as a data professional before founding Wild? How did those sort of shape what you wanted to do at Wild? Like, what were the problems that, were there problems you kept running into?

Starting point is 00:08:51 And maybe just start with like a brief overview of what Wild does so that the listeners have some context. Yeah, sure. So for Wild, we're basically sucking up information about your customer, either from Shopify or from a data warehouse. And then we're basically sucking up information about your customer, either from Shopify or from a data warehouse. And then we're using that information to predict a few things about them. Kind of primary points are future lifetime value.

Starting point is 00:09:15 So how much are they going to spend in the future? And then the probability that they're going to come back and make another purchase. The reason we're starting with those two, and we'll probably build other models in the future, those two though are foundational to the way e-com and retail brands operate their businesses, right? It is the economic basis for the business.

Starting point is 00:09:37 And I would argue also the decision point on which you should handle every customer. I call it like horizontally important and vertically important and so what i saw when i was internal to these brands was setting up these models to predict things relatively similar to the brand you basically can use the same information same same inputs, outputs are the same. Everyone needs them. And it's deceptively hard, right? Like from a coding perspective, you can get this up and running in a day or two, but like

Starting point is 00:10:11 productionize it to run all the testing that you need to communicate internally with stakeholders and kind of productize what your, your predictions is actually really hard. And so, uh, yeah, when when i started wild it was basically just to solve those problems yeah makes total sense in terms of the stakeholders like i'd be interested to know can you dig into that a little bit more so you're on the data side and you have these stakeholders who and let's just take lifetime value prediction, for example, right? So some customer has made a purchase or maybe not. Maybe they've, you know, there's some characteristics that you're using as an input. But let's say they've made some sort of purchase or a couple purchases.

Starting point is 00:10:57 And then you're running some sort of model that predicts, you know, what is their eventual lifetime value over some time period, however many years or whatever. So is the business asking for that? I just love to hear, what's the genesis story? As you as the data person, how does that come up within the organization? Who on the business side is asking for that? Yeah, it's a really good question. I call this the LTV maturity curve. I see a lot of companies start off where like finance and operations owns LTV. And so they'll usually kind of do like a historical analysis. So they'll take cohorts of customers. Yep.

Starting point is 00:11:37 They'll draw those like classic cohorted lines. Yep. Come up with a term rate and AOV and then kind of like back into an LTV number. And that works pretty well until the business starts changing. And those lines start going up and down. And it's very hard to interpret like what is good LTV? What's the reason for things going up? And so, and it's not very actionable.

Starting point is 00:11:58 And so usually the marketing team then will go to the finance team and say, look, we need like, it's great that we have an understanding economically of how well our customers are performing and how profitable they are, but we need to take action on those profit profitability of signals. And so a lot of companies will start building like RFM models. You guys have played around with those basically recency, frequency, monetary value. So how recently have they purchased?

Starting point is 00:12:25 How frequently have they purchased? What's the kind of AOV or like total revenue to thing? I, you want to do it. Those are great, but usually you segment those into three by three kind of grids. So for each letter, you have three segments. You end up with like nine segments, which is way too many to actually market to and so i i kind of consider the end of that maturity curve being the ltv number just one number super simple and it's tied into what the finance team was trying to do originally which is understand

Starting point is 00:12:59 the profitability of these individual customers so i think finance usually is driving these conversations and then they're kind of proselytizing the importance of, you know, economic viability, especially right now in the e-com world. Yeah. And then everyone else kind of needs to get on board. And what is, so marketing gets these values and what are they doing? Like some sort of segmentation and then they like dump these people into different campaigns. Could you just give a couple examples of like what is the specific number they're trying to move or like an example segment

Starting point is 00:13:30 yeah so i think the basis of like the ltv and churn predictions is that they are again horizontally and vertically important and what i mean by that is vertically important, it's a C-suite level metric, but it's also actionable for the tactical folks who are actually executing on campaigns. Marketing is a great example, but I also think CX should be using it, operations should be using it. You kind of go down the list.

Starting point is 00:13:58 Everyone can leverage these and use it as a North Star. In terms of use cases, a super simple one is, we've had a lot of success talking about this one, it's Klaviyo has actually some of these predictions. Yeah. But they have this black box model and nobody really knows what the accuracy is.

Starting point is 00:14:16 Nobody can really pull out what the predictions are. Yep. And so we've had customers compare us to Klaviyo and figure out that Klaviyo was over-predicting churn by four times. Wow. And so I think it goes to show the importance of data teams in this stack in validating the numbers that marketers are actually taking action on rather than just kind of trusting what's in other people's platforms. Yep. I have a question.

Starting point is 00:14:42 And John, I mean, you've used Klaviyo heavily previously. And so question for both of you, what are the mechanics of why Klaviyo is over-reporting? I mean, I know it's a black box, but you know, Clint, you're building these models, but what is the data input problem or the regression problem that would cause that? I'll let Clint take this one. I have a suspicion, but yeah, I'm curious what you found. Because my knowledge is a couple years old. Let him validate your suspicion and then just come.

Starting point is 00:15:19 I'll tell you. Yeah, that's what I thought. Shoot, I want to hear the suspicion first. All right, I'll do it. Yeah, let's hear the thought. Shoot, I want to hear the suspicion first. All right, I'll do it. Yeah, let's hear the suspicion. That's more fun. Like, you've got... So Klaviyo has first-party access

Starting point is 00:15:31 to your Shopify data. So, like, theoretically, you have access to the same data, right? What I would guess is they built a more generic model, right? And are just going to run everything through a more generic model. And you're able to build

Starting point is 00:15:43 a more, like, bespoke, focused model as far as predicting. That's my high-level hypothesis. To interject on the suspicion there, isn't that what makes machine learning applications on Shopify so appealing, though, is because the ecosystem is consistent, right?

Starting point is 00:16:06 I mean, Shopify has a consistent data model. If you're going to try to scale that for someone like Klaviyo, like... Yeah. The fields are named the same for every customer. Sure, yeah. Like, as simple as that. Yeah.

Starting point is 00:16:16 Okay. Enlighten us. Yeah. No, I think that's one element of it, too, right? Like, I would say there's probably three elements. The first is some model differences. And I don't know what their model is and they don't give accuracy. Yeah.

Starting point is 00:16:33 So I can't really speculate on the metrics on, you know. I wish I was because then I'd know a lot more and I'd feel a lot more comfortable with what their predictions are. But I think the second element is some brands do have sales outside of Klaviyo. Or sorry, outside of Shopify. Sure. So, you know, one of our brands, they own, you know, three dozen retail locations that they own and manage throughout the country. And so that information is actually not flowing through Shopify. Klaviyo is not including it. So they're missing like're missing really important indicators.

Starting point is 00:17:08 And that's fairly common, I think. Because in my past life, we didn't have physical locations, but we had phone sales that didn't go through Shopify. And that's just another application. Yeah, fascinating. Right, but those, yeah, i guess yeah that's that is super and we're not talking like one or two phone sales we're talking like 20 30 percent of revenue yeah yeah anytime you're mixing sales channels right like things get much more complex and but i think

Starting point is 00:17:37 that's where like data team shine is simplifying all that so this data team we were working with right we're sitting on top of their in this, we're sitting on top of their, in this case, we're sitting on top of their warehouse rather than their Shopify instance. So the data team was able to do the identity resolution from in-store to online and kind of handle that so that we are looking at like one unified understanding of who the customer is. Yeah. So you have a table with each customer and then their combined order history across point of sale and then Shopify. Yeah, exactly. Yep. Okay, I have another question on the sort of business results side of things.

Starting point is 00:18:15 And this is, I think, again, just based on your experience, like, well, both with Wild, right? Because you're sort of producing some sort of output. And I'm going to pick on marketers here because I've been a marketer for most of my career. That's fun to do. Yeah, because it's great. It's great. But a lot of times, and I think this is changing to some extent

Starting point is 00:18:35 because marketers are getting increasingly technical. I think there are a lot of good dynamics. But at the end of the day, you talk about Klaviyo's model versus Wild's model or whatever. But the marketer doesn't actually care at the end of the day, you know, you talk about Klaviyo's model, you know, versus Wilde's model or whatever, but like the marketer doesn't actually care at the end of the day, right? They just want the score so that they can do something with it. So how do you think about that based on your past experience and then with Wilde as well,

Starting point is 00:18:57 where to your point, like the details are extremely important, right? I mean, the underlying data concerns are extremely important, but the end customer doesn't actually really care about that, right? Like, so how do you think about balancing that? Because you're producing some sort of outcome or you're producing some sort of output that's really critical to the business

Starting point is 00:19:20 that has all these important components, but like your customer's like, yeah, I mean, I don't really care. I just let me know who to email. Right. Yeah. I think so. I guess from like a data perspective,

Starting point is 00:19:32 generally, whether I'm in house or, you know, building data products, I'm not a huge believer in dashboards. I think they're like, like valuable, but I don't really think that's what our end goal

Starting point is 00:19:45 should be as data people. I think what we should be trying to do is integrate ourselves as tightly as possible with other people's workflow. Yeah. So in the Klaviyo example, like I really, like my ideal case is if I'm internal to a brand, that you don't ever have to leave Klaviyo as a marketer, right? That like the intelligence that we have as a data team is being pushed to you and you're not having to go somewhere else to get it. Yep. Yep. I love it.

Starting point is 00:20:13 John, thoughts on that? I mean, you did a bunch of this. Yeah, no, I think, I mean, a lot of people, I think talking just general data maturity, you over the last five years, it's like, wow, data collection is really easy, right? So there's a, Clint and I were talking about this before the show. There are a lot of Azure and AWS bills

Starting point is 00:20:35 that are high right now because data collection is really easy, right? Yes. And then you've got all the data in this database and like you can query it and that's exciting. And you can even easily hook up a BI tool to it, right? But that's all so unopinionated stuff, right? There's no structure, there's no business framework, nothing.

Starting point is 00:21:00 It's just whatever the analyst or data engineer whatever like whatever's in their mind and their level of in sync with the business which is often not very in sync determines the outcome so by getting it in the destination tool like you just enforced like a structure some business like logic you're forcing a certain number into a certain field, like it, like, even if it didn't have anything to do with the workflow, even that like structure and opinionation, I think is helpful. Yeah, definitely. That moves you from like soft ROI to hard ROI. Yeah, right. Like soft ROI is like building a dashboard, and you might inform some decisions. And I think there's the classic question in data, right? Like, how much does our data team generate? And that's very difficult to do if all of your ROI is soft ROI. But if you're

Starting point is 00:21:50 able to go into, I think, you know, reducing an Azure bill is like one example. But I think if you can actually generate top line revenue, and point to like, hey, we enabled this. Yeah, that's the gold standard yeah right that's hard roi that's actually something you can point to you can ask for more heads on your team because of it yeah yeah do you think about both you and john there's almost like an internal marketing element to this and what i mean by that is i i totally agree, right? Like, let's get the churn score or the predictive LTV value into Klaviyo or whatever tool, right? Like, so they can integrate it into their workflow. There's no disruption, right?

Starting point is 00:22:36 But to some extent, that can create a dynamic where, I don't want to say this, I mean, it almost looks too easy to where all of the work that went into that from the data team is undervalued. And so you can't get another head on your team. How do you think through that element, right? Because it's a lot of things, John and I talk about this a lot.

Starting point is 00:23:00 A lot of times things that are really well done seem easy when you see the final product, you know, which is great. And that's part of the point, but then you don't want that to come back and bite you. Yeah. Well, I mean, in marketing, right? That's, we've talked about that a lot in marketing,

Starting point is 00:23:17 whereas you read through something and it's like perfect logical flow, good messaging, like all the things. And you think in your mind, like I could have done that. Yeah. Right. and then like when you're actually on the marketing side of it trying to do that like it's impossibly hard it's very difficult yeah yeah and data has like a little bit of advantage over that because there's at least the technical aspect like well that seems kind of hard but there still is like that like really clean delivered product of like oh all you did was like fill out cltv and clavio like how hard is that yeah yeah yeah i was actually talking to a head of data recently who's having

Starting point is 00:23:51 this problem right now and we were kind of you know half joking half talking through like what do you do because he has made things look really easy and then you know the marketing team is like coming back and being like okay well like you could just do this and it'll take like a week right and and like actually educating them on how hard the data world is and like you know just getting clean data is really hard just tracking customer interactions really hard yep as you guys know very well so yeah there there is a bit of like internal marketing and I think also good data leaders. They're mixed between marketers

Starting point is 00:24:29 and kind of product managers. I'm a big believer in the kind of like product mindset internally. Yeah, you got to do a little bit of both. Yeah, I love it. Okay, we're going to switch gears here because John, I know you're chomping in the bed with a bunch of technical questions

Starting point is 00:24:43 and I cannot wait to hear about this. I'm going to ask a question to transition. Let's talk about Wild now. Can you give us just a... My question is what's happening under the hood? You're connecting to either a table in the warehouse that has

Starting point is 00:24:59 certain data or Shopify. Let's just use the Shopify example. What data are you pulling in from Shopify? Or's just use the Shopify example. What data are you pulling in from Shopify? Or do you access from Shopify? Yeah, we try and keep it pretty narrow. We're looking at some customer and demographic information. And we're looking at a lot of transaction kind of order history. Okay, I mean, that's it, right? So I mean, yeah, can you give us a set? Is that like 30 columns or like six columns when you pull it from the api you know there's i think a couple hundred just from those like two endpoints really

Starting point is 00:25:31 uh once you can blow everything right all right yeah they return everything yeah yeah we look at like five or six columns okay wow yeah so that is near yeah i think you said no pii like you can do it without pii and we can do it without PII. Yeah, you can hash an email before you send it to us. Yeah, we're trying to keep the scope really narrow because I think a lot of folks want to fit as many demographic pieces of information or, you know, interactions in.

Starting point is 00:25:58 And again, as you guys know, it's really hard to collect that information. It's really hard to clean it and organize it. And so, you know, I think like our onboarding engagements would be like five times longer if we wanted to collect a bunch of information from different platforms. So just keep it super narrow and we get 95% of the benefit. And I think from talking to you in the past, like you realize with your model that your models that you're working on now like the signal to noise ratio

Starting point is 00:26:26 like if you pulled in every single data point you could from Shopify like super high noise but as you narrow it down like the beauty and like a really good like predictive model is like we know the like five things that matter or however many it is right and then everything else like if we get a slight like increase like you have to like is it worth it and then everything else like if we get a slight like increase like you have to like is it worth it and is it truly a slight increase every time or is it like a one-off like i think that's the yeah that's the beauty of like a simple inputs into a like sophisticated model yeah i mean conceptually speaking when you're talking about retail purchases online or in store uh there's a lot exogenous to anything that you can measure.

Starting point is 00:27:08 So if I go down to my bodega guy like every day, I might be a super loyal customer. But if I move apartments, I'll never go back to that bodega again. And that bodega guy is probably not going to know that for that reason you know what the reasoning was but there are a lot of reasons outside of our actual purchase behavior or interaction with a brand that dictate our journey with that brand right i still remember like i did a lot of a lot of work with shopify a lot of work with shopify apps i still remember the sales pitch because we were really wrestling with the pricing problem like we had thousands of SKUs and pricing is hard especially at scale I still remember this like model this guy was selling me and he was like yeah we take in like hundreds of data points we look at behavioral data we look at visits and we produce dynamic prices for like each of your you know 20,000 items

Starting point is 00:28:00 and we demoed it and a there was basically no way to like prove like cool is this like more sales or more margin than you know they didn't have that built into the product yet and b we ended up with like whoops new pricing for like 20 000 skews so we had to like roll it all back which was a nightmare but that was a really good lesson of like all right like less is better here and understandable is way like more to be desired than like something that like is eking out each little like percentage of quote like efficiency you know in a model yep fragmentism might be like the most important characteristic in any of these models, right? Interpretability and getting it out the door really quickly is going to get 95% of the value that your stakeholders expect. Right.

Starting point is 00:28:51 All right. So tech stack. Yeah. You have to go there, right? Yeah. Yeah. So, okay. So we started with, you know, Shopify API endpoint, maybe a data warehouse.

Starting point is 00:29:01 Yeah. So what happens next in the high level flow yeah so this Shopify integration relatively new for us I mean we've been very warehouse focused for a while now and so my co-founder and I were kind of looking at different technologies because we're starting from scratch we have kind of freedom to do what we want and so we landed on basically the flow for us is we land data in S3 for cold storage. We use dbt for transformation. Awesome tool. And then we're actually using duck BB and mother duck for all of our kind of

Starting point is 00:29:40 storage and transformation warehousing needs and on the back end and the front end so that's been yeah i've been learning that stack lately this is definitely a first i don't think we've had anyone on the show who has used duct db like in production in this way yeah i think it's the first and we were talking before the show about BI tools and browser BI tools. So if you remember, I guess it's been 10 plus years now since Tableau came out. And like one of the major things there was their query engine or their storage solution, like as part of the tool where you can extract the data and then you can like manipulate it on your desktop and this amazingly like fast experience that was like one of the big deals there then they take it to the web ironically right and like they had a bunch of trouble early on i remember they like hired somebody from aws to try to help figure out the like web version of tableau you know obviously eventually got something that, that was good enough, but that I, but I still remember that initial experience of like, Hey, I've got this massive, like millions of lines file. I extract it and I can use it in Tableau and it's awesome.

Starting point is 00:30:54 So tell me about your workflow and like how that might, like, I think you've had a similar experience with DuckDB, different workflow, but maybe similar. Yeah, so a couple things that we've really liked working with it is first off, it pulls the front-end analysis that we're doing a lot closer to the data team.

Starting point is 00:31:18 So I don't know any software engineering front-end or back-end really, but I'm an okay data guy and I can go into our data stack right and I like our actual proper data repo and modify the queries that exist on the front end and so like having that close connection with the front end web app is kind of ridiculous for any right team because it. So what you're saying then is it's almost missing that traditional middleware, middle layer piece

Starting point is 00:31:50 that you normally see. There's not that handoff organizationally from data to a software team. I'm like, okay, now we need to abstract what's going on here. We're going to have to move it into some other framework. We're doing SQL queries from the front end and it's fast so I know there's got to be

Starting point is 00:32:10 some software engineers listening that are like no this is a terrible idea here's all the reasons you need that layer Duck TV is such a great polarizing my co-founder is a software engineer and he's you know coming to the data stack and he's the one who's really been pushing for that night so he can go fight all the software yeah i was gonna say yeah yeah we'll we'll put him on the linkedin on the linkedin

Starting point is 00:32:42 yeah yeah it's been great i think another awesome benefit for us is like on the back. Yeah. It's been great. I think another awesome benefit for us is on the backend analysis. We do this cold storage in S3. If I want to run an analysis on data that we have parked, you can do

Starting point is 00:33:01 you can basically glob everything from different S3 buckets. So, you know, we'll do like, you know, bucket star and I can select a bunch of different S3 buckets simultaneously. And so we can basically do like ephemeral analysis on multiple brands without actually joining and moving that data together so that's been like kind of a nice added benefit for us oh wow so are you or have you looked into iceberg at all like as part of we haven't yeah no i think patrick did a little bit and was like getting very intrigued so i was asking about it the other day is iceberg the one that was just acquired recently yeah like the other one the commercial yeah the commercial part of it was acquired by data okay yeah right yeah we were talking about

Starting point is 00:33:55 it i think that's on the horizon for us have you played with that at all not really like but i was reading about this really interesting workflow with Snowflake of basically people using Snowflake for the right layer into Iceberg and then DuckDB as a read from it. So then you're like cutting your Snowflake compute, right? Because you've got like it's just being used in the ingestion. But the read out of Iceberg tables is just straight with MotherDuck or DuckDB. So, like, I mean, it'll be so interesting because Iceberg is an open format. Even though the, like, commercial, you know, commercial company got acquired by Databricks. Like, that's still an open format.

Starting point is 00:34:40 Yeah, it's in Apache. Yeah, yeah, incubated project. But it'll be so interesting to have like all right so i'm like say i want to like store everything in that format and then you just have these engines right you're like all right snowflake engine like you're gonna do my rights like duck db you're gonna do my reads or you know any other number of combinations like it's going to be really fascinating like the cost savings right and then just creative things you can do when you're able to modularize and split up you know like that and

Starting point is 00:35:11 then i'm sure there's some kind of ai like application here too where you've got everything in like the same format like it'd be easier to access i've been seeing some of these narratives recently and i i haven't gone super deep admittedly but like do you think this kind of structure hurts or helps a snowflake or a data bricks but like an open structure like using ice yeah like where yeah where you don't need to put your storage into one of those platforms and where they become like purely a compute layer well i don't know i think i don't think it hurts them too i don't think it hurts them like that much but because the way they're going so like snowflake just really you can like the last couple weeks i'm sure you've probably seen it like they've got the full like python notebook experience in browser you know for snowflakes they're doing that they already have the streamlet stuff so like they're just going all out like all the things that we can use compute

Starting point is 00:36:10 for and they're gonna have ml and ai models and stuff so like they're like compute time is all going to be like more and more used on that stuff like you're going to be spending a ton of money in compute for like ai ml stuff and a ton of money with them for your like Python notebooks. And even maybe like querying will like start to move down the list as far as like what you're spending money on. So I don't know. I mean, I think it probably helps. It might help the industry in general and a little bit of like a rising tide for everybody because you've got like kind of because you potentially like depending on how it works out like you might end up with a standard of like pretty much everybody just uses iceberg because it will work with databricks and it works with snowflake and you

Starting point is 00:36:54 know x y or z other thing that you want so that i don't know that might help the general industry but it's hard to say whether i feel like it really helps like an individual snowflake or Databricks or hurts them. Yeah, it is interesting though, the episode that we had with Andrew Lamb from Influx, you know, and Influx does time series stuff. So different use case, but he was, I just made this connection.

Starting point is 00:37:24 One of the things that we talked about on that show was that his prediction is that things would move towards essentially having everything in S3 and then an ecosystem around that, to your point, you know, where it's like, okay, Snowflake does this, DuckDB does this, right? And building an ecosystem around that model. And so, Clint, it's fascinating that you guys are actually, I mean, you have adopted that for your product, right? I mean, we were talking about this just in terms of analytical workflows like within a company, right? That that's actually how you run your entire product. Yeah. I mean, I think from what we've experienced so far,

Starting point is 00:38:07 like it is not as easy as standing up. So it's like right now and just, you know, letting it rip in there. I think there's probably a little bit of ways to go in terms of accessibility. Right. But it's definitely interesting and opens up some pretty cool capabilities.

Starting point is 00:38:22 I mean, I can speak to the DuckDB thing alone. I was telling you guys earlier, like we have more than 500 brands that we're doing analysis for. And so we have all this transaction data. It's two and a half billion dollars with a B in the last year and total GMB that these brands have done. And so I started doing an analysis of duck BB. I just pulled up like a Jupiter notebook. You know, it's like a one liner to connect to these S3 files. And I'm like, often away writing SQL inside of a Jupiter notebook. And on, you know, hundreds of millions of rows, I'm getting like instantaneous queries on our local machine. That's, that's crazy. Yeah, and it's like you know just connecting to snowflake from my jupiter notebook would kind of be a pain um so yeah it is yeah there there are some elements where it's just like so easy as an analyst to get something out and i've just been able to focus on the fun of being

Starting point is 00:39:19 an analyst again rather than all the kind of like engineering setup yeah is isn't there some pretty some pretty cool things in browser things you can do with duck db and mother duck um standpoint yes i think that's i think a lot of that's related to like what we're doing on the front end right now which is like we're basically running these sequel queries directly from the front end right yeah i'm not well versed i think on the front end i'm like excited that i can i had some discussions i had some discussions around this and i wish i could represent it better but it's that same like where we started that same tableau concept of basically like you have this like extremely fast compressed like data set that the query experience feels just about instant,

Starting point is 00:40:08 but it's in browser, which historically has been a huge problem for just about any BI tool that I've used in browser. And I'm sure they're continuing to improve that part of the product, but it's pretty cool to see it. I'll be interested to see what MotherDuck's

Starting point is 00:40:24 go-to-market strategy is is because they do kind of have two disparate use cases right now, which is like run stuff really fast on your local machine. And then one that's, you know, run stuff really fast in your browser. And I don't think they're necessarily mutually exclusive because obviously we're using both with good effect,

Starting point is 00:40:42 but yeah, it'll be interesting to see which one they kind of lean on and which one proves more valuable. Clint, one thing we talked about before we hit record kind of related to this, and it came back to mind because you were talking about having, you know, querying a bunch of different data sets.

Starting point is 00:41:00 Obviously, there's a security concern related to that, right? So, I mean, maybe you've stripped PII or whatever. How are you thinking about that? Because my mind is instantly going towards all sorts of interesting use cases, right? I mean, you can provide insights across different customers, you know, because everyone's in retail.

Starting point is 00:41:16 You could provide sort of, you know, reporting, benchmarking. I mean, there's all sorts of, like, interesting product possibilities. But from a data perspective, you have to tread really carefully there, right? Because, you know, there are agreements that you have with each customer about like how you're managing their customer data. You know, security concerns around like if you're combining all of that in a single place and, you know, I mean, how are you approaching that side of it as you're working with data across all of your customers? Yeah, so we, I mean,

Starting point is 00:41:49 we strip PII for everything. We're hashing customer information as well as oftentimes when we're joining information, we often are looking at merchant anonymized information as well. So it's kind of like the first layer. The second is we actually spin up

Starting point is 00:42:03 separate DBs for each customer. So each customer lives in their own DB environment. And then when we join it, it's being joined similarly using DuckDB. So there's no like hard table where the data is landing. So I think we probably have some work to do on all of that but like it gives us a pretty good model where we're both getting some flexibility without just mixing a lot of and to be honest like i talk to a lot of vendors who do kind of push all the data together it is like a standard yeah yeah it's pretty standard yeah yep yeah we're trying to be data conscious on it. And one thing I will throw out, like none of the tooling out there

Starting point is 00:42:46 is really designed to work across a bunch of these databases. And so we're really having to like grok a few of the tools, you know, because we're basically running a different dbt instance for each. We have one central dbt repo, obviously, but like each customer is getting

Starting point is 00:43:04 their kind of like own dbt repo obviously but like each customer is getting their kind of like own dbt runs and so it's yeah it's a lot harder to manage this way yeah yeah all right well we're getting close to the buzzer here but we had talked before recording about clean rooms and that's probably a good place to end like as far as what you're thinking about for the future of wild so tell us about clean rooms how does that relate to like what you're doing as a product yeah i think at first blush it it feels far afield but uh what we've learned you know looking at 500 brands 600 brands data at this point is that a lot of data exists outside of the Shopify ecosystem because so many of these brands have gotten omni-channel now yeah a lot of them are selling in retailers a lot of them selling in Amazon and so I started doing research earlier this year on like okay if I'm

Starting point is 00:43:56 a big brand how do I solve this because I'm not going to be satisfied just not having this data right we kind of started getting into clean room world. And what we really learned there is like accessibility for clean rooms is a huge issue. You obviously Samoa or live ramp and and Snowflake both have products. They acquired two companies for data clean rooms, but they're technically and technically expensive and monetarily expensive and you know most retail brands are not using those technical tools and so what we've been exploring lately is basically productizing a lot of these clean rooms so we can continue sharing data with brands but then also

Starting point is 00:44:41 with their retailers oh Oh, wow. Okay, so, but you're sort of building it on like existing clean room technology from someone like a Snowflake? No, so we, no, we're not actually. We'll build some of that ourselves. Yeah, we have some hypotheses about that, but yeah, probably too early to say now. Yeah.

Starting point is 00:45:03 But yeah, we'll be be building around stack for that love it all right john any final questions before we hop off no i think the data sharing part like if you don't know what a clean room is right maybe a quick little definition of that for somebody and then in general i think data sharing is a really big like place for this stuff to go next whether it's sharing to be in app like and like i use clavio and i want to share it to clavio i don't want to like etl it like that's too hard like let me just share it or i want to share it to salesforce or whatever so i think that general concept is big but if you could just like focus the clean room piece like tell

Starting point is 00:45:40 people what that is yeah so i actually really dislike the term clean room we refer to them as collaboration room which i think is like a bit more explains what you're actually trying to do rather than what the tech is you know effectively if john you own a brand and i own a brand and we want to share information about our customers neither of us wants to share a list of our customers. We don't want to expose that. And so you can use these clean rooms, or as we call them, collaboration rooms, as basically a third party where you can dump the information in. And then neither of us can look at the individual PII, but we can do aggregated queries of that data, kind of predetermined aggregated queries. And so,

Starting point is 00:46:26 conceptually speaking, it sounds a little bit esoteric, but the actual use cases are quite interesting. So, you know, Amazon has a clean room solution. And so you actually if you're running on Shopify and Amazon, you can do things like you can give Amazon a list of your Shopify customers so that you can target them in Amazon's ad platform. And Amazon won't actually know who those customers are. And you can do that same thing with Google and Facebook, a few other platforms, the TV platforms have the same technology. It also means that you can go to a retailer.

Starting point is 00:47:00 So if you're selling in Kroger, you can get customer level sales information from Kroger. All of that is like kind of inaccessible to most brands because of their revenue and because of the tech requirements. But the big brands can tell you how many new versus returning customers they have in a retailer. That's fascinating. Yeah, that is fascinating. Super fascinating. It's been pretty fun to learn about.

Starting point is 00:47:24 Yeah, for sure. Well, as you build that product out, keep us posted and we'll have you back on the show because I think that's a huge topic for us to tackle. Yeah, definitely. That'd be awesome. Clint, well, thank you so much for joining us on the show. It's been a fascinating conversation

Starting point is 00:47:40 and we'll have you back on sometime soon. I'd love that. It's been a blast. Thanks for having me, guys. Yeah, thanks, Clint. The Data Stack Show is brought to you by Rudderstack, the warehouse-native customer data platform. Rudderstack is purpose-built to help data teams

Starting point is 00:47:55 turn customer data into competitive advantage. Learn more at ruddersack.com.

Your Ad Here

The Data Stack Show - 194: Building Retail Churn Prediction on DuckDB with Clint Dunn of Wilde

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.