The Data Stack Show - 194: Building Retail Churn Prediction on DuckDB with Clint Dunn of Wilde
Episode Date: June 19, 2024Highlights from this week’s conversation include:Clint’s Background and Journey in Data (0:51)Starting a Data Career (2:01)Transition to Startup SaaS World (4:27)Clint’s Connection to a Federal ...Reserve Database (5:31)Challenges in Predictive Modeling (10:27)Data Input Challenges (15:50)Marketers' Workflow and Data Integration (18:29)Soft ROI vs. Hard ROI in Data Analysis (00:21:31)Balancing Internal Marketing and Data Team's Value (22:35)Simplifying Data Inputs for Predictive Models (25:09)Data Analysis Workflow and Tech Stack (29:06)Open Data Formats and Impact on Data Platforms (34:40)The S3 and Ecosystem Model (37:08)In-browser SQL Queries with DuckDB (39:24)Data Security Concerns and Solutions (41:47)Clean Rooms and Data Sharing (43:32)Final Thoughts and Takeaways (47:35)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Hi, I'm Eric Dotz.
And I'm John Wessel.
Welcome to the Data Stack Show.
The Data Stack Show is a podcast where we talk about the technical, business, and human
challenges involved in data work.
Join our casual conversations with innovators and data professionals to learn about new
data technologies and how data teams are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the warehouse
native customer data platform. Rudderstack is purpose-built to help data teams turn customer
data into competitive advantage. Learn more at ruddersack.com. Welcome back to the show. We're
here with Clint Dunn from Wild. Clint, welcome to the Data Stack Show. We are super excited to chat with you.
Thanks for having me, guys. I'm super excited.
All right. Well, give us just a brief background. Tell us about yourself.
Yeah, I'm the co-founder of Wild. We do LTV and churn predictions for retail brands. In a prior life, I worked at Afterpay in the marketing data science
department. Before that, I was building some data teams at small e-com companies.
So Clint, one of the topics I'm really excited to talk about is DuckDB. We both, I think you more
than me, have an affinity for it. So I'm excited to talk about that. And then we'll definitely have to talk some about your experience as a head of data and working in data, producing business outcomes as well.
Yeah, I love it.
There's not a lot of podcasts I get to go on and talk about technical stuff.
So I'm enjoying that.
Yeah.
Awesome.
All right.
Yeah.
You ready to dig in?
Let's do it.
Okay, Clint. awesome all right yeah you ready to dig in let's do it okay clint so many interesting things about your background you gave us a brief introduction but how did you start your data career did you
study data and or sorry did you study like engineering or anything technical in school
no i was an economics major and we did like a little bit of like sass which
is like a real old school programming language common and oh yeah economics yeah and i kind of
had like finance internships all through school but i i think the turning point for me was going
into my senior year i was working at a ufc gym we like basically franchised out gyms across the
country oh yeah sure my whole job was like i was selling basically the whole summer the delaware
valley the rights to the gyms in the delaware valley and i got roped into some meeting with
like the president of the company at some point and he was like right, we got a major problem here. We are giving free memberships away as basically free trials indefinitely to a number of customers.
And somebody raised their hand right away.
I was like, all right, how many people is this affecting?
He was like, we have no idea, but I'm guessing 4%.
And so I kind of started talking to the technical team.
I was like, how is it possible?
Don't we have a SQL database?
How do we not know?
And nobody really could work it out.
And I just got this hunger from that point to start answering business questions a little bit better than putting your finger in the air.
Yeah.
And so you were actually selling.
You were out selling rights to gym memberships.
Not literally. people were kind
of coming to us i was doing some of the analysis to say like how many stores could we put into the
delaware valleys i was going like all through pennsylvania it's like we're gonna stick one in
mechanicsburg there's gonna be one in harrisburg we can get two in this other city yeah yeah kind
of boots on the ground analysis yeah okay and then so where did things go from there?
So I ended up working in a fractional CFO and accounting company after school. And I was like,
the data guy. And I think that's kind of a common situation a lot of folks start their career in
where they're kind of tasked with like broad data responsibilities and maybe not the skills to do
them. So like I said, I knew like a little bit of SaaS.
I stayed after work every day and taught myself Python.
And I was like really good at Excel and just tried to figure things out.
But it was a little bit of everything at that job.
Yeah.
And then when did you get into sort SaaS world, software as a service?
Yeah.
Honestly, I've come to it a lot more recently.
I've been mostly in the retail side of things.
Yeah.
So with that job, we were a fractional CFO in an accounting company.
We were working with startups.
So I got to look at data for a couple of SaaS companies, but I did a lot of like food and bev and, you know, kind of traditional e-com analysis for folks.
And then I went and was at a fractional marketing company doing basically the same thing.
And then was in-house at Hairstory eventually.
So, yeah, I'm like kind of a retail guy through and through it really.
Yeah.
Okay.
Interesting. Now you skipped over one really important piece of your history that we talked about
briefly before the show.
And that is the fact that there's a database at the Federal Reserve in Kansas City that
is named Clint, which is a surprising resemblance to your name. Can you give us
the quick story on how you got a database at the Federal Reserve named after you?
It was a horrible mistake, and no one knows this about me. Yeah, J-10,
it was the Federal Reserve Bank of Kansas City. I had an internship there one summer.
I was like an undergrad. Everyone else in the research department has PhDs. They'd never really
had an intern before. And so I didn't really have any right to be there, and I was well an undergrad. Everyone else in the research department has PhDs. They'd never really had an intern before.
So I didn't really have any right to be there.
And I was well aware of it.
But they had this project where I was supposed to catalog every economic event since World War II.
It's pretty obvious and actually kind of cool and maybe relevant to data folks in companies now.
Nobody really knows when things happen.
If you just ask somebody when Hurricane Katrina
happened, it's very hard to
pull that out, but it will affect
a lot of analyses that you're doing, especially
in the South and the odds.
The head data
scientist had this idea to catalog all these things.
My job was to go through these old
binders that secretaries
had typed up manually in the 50s and 60s and left to her.
So I was digitizing all of them.
It's really cool.
And I gained like a huge appreciation for what like was like, right?
Like the gas gas shortages and what like the 70s.
Yeah, what it was like day by day.
But, but yeah, I as a joke named it after myself and like had a
backronym, a terrible backronym, the chronologically linked timeline
with L I N capitalized, I thought it was like, I thought it was like a
really dumb joke and I realized like all federal databases are named after people.
It's like Fred and Edgar and Noah, I think a few more that i can't think of
so yeah i knew it had like caught on when the head of research came down to my desk one day it was
like i heard you're the guy working on that clint database aren't you no can't escape it
no that is how to leave a mark as an intern for sure. A lasting mark.
Right?
Yeah.
But you've had a similar experience.
I have.
We were,
there is actually,
we were joking back and forth,
John and I on LinkedIn this week,
cause he tagged me in a post where someone said,
you know,
you build a quick prototype and you end up,
you know,
with a database called something like a John test
or whatever. And so the joke is
around RutterSack, our version
of that is EricDB, which
when I first joined RutterSack
I was getting access to the product
and I asked for a schema in Snowflake
so I could do
testing and build prototypes.
And there's
still a lot of production workflows
that run out of Eric DB today,
four years later.
We'll eventually fix that.
But yes, you know,
it's great when you're onboarding
like a new employee
and they're like,
what is Eric DB?
Yeah.
Yeah, you got to be really careful
what you name after yourself.
Yeah, exactly.
That's very true.
Well, one thing I'd like to hear about.
So I want to talk about Wild and what you're doing there.
But can you talk about your experiences as a data professional before founding Wild?
How did those sort of shape what you wanted to do at Wild?
Like, what were the problems that,
were there problems you kept running into?
And maybe just start with like a brief overview
of what Wild does so that the listeners have some context.
Yeah, sure.
So for Wild, we're basically sucking up information
about your customer, either from Shopify
or from a data warehouse. And then we're basically sucking up information about your customer, either from Shopify or from a data warehouse.
And then we're using that information to predict a few things about them.
Kind of primary points are future lifetime value.
So how much are they going to spend in the future?
And then the probability that they're going to come back and make another purchase.
The reason we're starting with those two,
and we'll probably build other models in the future,
those two though are foundational
to the way e-com and retail brands
operate their businesses, right?
It is the economic basis for the business.
And I would argue also the decision point
on which you should handle every customer.
I call it like
horizontally important and vertically important and so what i saw when i was internal to these
brands was setting up these models to predict things relatively similar to the brand you
basically can use the same information same same inputs, outputs are the same. Everyone needs them.
And it's deceptively hard, right?
Like from a coding perspective, you can get this up and running in a day or two, but like
productionize it to run all the testing that you need to communicate internally with stakeholders
and kind of productize what your, your predictions is actually really hard.
And so, uh, yeah, when when i started wild it was basically just
to solve those problems yeah makes total sense in terms of the stakeholders like i'd be interested
to know can you dig into that a little bit more so you're on the data side and you have these
stakeholders who and let's just take lifetime value prediction, for example, right? So some customer has made a purchase or maybe not.
Maybe they've, you know, there's some characteristics that you're using as an input.
But let's say they've made some sort of purchase or a couple purchases.
And then you're running some sort of model that predicts, you know, what is their eventual
lifetime value over some time period, however many years
or whatever. So is the business asking for that? I just love to hear, what's the genesis story?
As you as the data person, how does that come up within the organization? Who on the business side
is asking for that? Yeah, it's a really good question. I call this the LTV maturity curve. I see a lot of companies start off where like finance and operations owns LTV.
And so they'll usually kind of do like a historical analysis.
So they'll take cohorts of customers.
Yep.
They'll draw those like classic cohorted lines.
Yep.
Come up with a term rate and AOV and then kind of like back into an LTV number.
And that works pretty well until the business starts changing.
And those lines start going up and down.
And it's very hard to interpret like what is good LTV?
What's the reason for things going up?
And so, and it's not very actionable.
And so usually the marketing team then will go to the finance team and say, look, we need
like, it's great that we have an understanding economically of how well our
customers are performing and how profitable they are,
but we need to take action on those profit profitability of signals.
And so a lot of companies will start building like RFM models.
You guys have played around with those basically recency,
frequency, monetary value.
So how recently have they purchased?
How frequently have they purchased?
What's the kind of AOV or like total revenue to thing?
I, you want to do it.
Those are great, but usually you segment those into three by three kind of grids.
So for each letter, you have three segments.
You end up with like nine segments, which is way too many to actually market to and so
i i kind of consider the end of that maturity curve being the ltv number just one number super
simple and it's tied into what the finance team was trying to do originally which is understand
the profitability of these individual customers so i think finance usually is driving these
conversations and then they're kind of proselytizing the importance of, you know, economic
viability, especially right now in the e-com world.
Yeah.
And then everyone else kind of needs to get on board.
And what is, so marketing gets these values and what are they doing? Like some
sort of segmentation and then they like dump these people into different
campaigns. Could you just give a couple examples of like what is the specific number they're trying to move or like an example segment
yeah so i think the basis of like the ltv and churn predictions is that they are again
horizontally and vertically important and what i mean by that is vertically important, it's a C-suite level metric,
but it's also actionable for the tactical folks
who are actually executing on campaigns.
Marketing is a great example,
but I also think CX should be using it,
operations should be using it.
You kind of go down the list.
Everyone can leverage these and use it as a North Star.
In terms of use cases,
a super simple one is,
we've had a lot of success talking about this one,
it's Klaviyo has actually some of these predictions.
Yeah.
But they have this black box model
and nobody really knows what the accuracy is.
Nobody can really pull out what the predictions are.
Yep.
And so we've had customers compare us to Klaviyo
and figure out that Klaviyo was over-predicting churn by four times.
Wow.
And so I think it goes to show the importance of data teams in this stack in validating the numbers that marketers are actually taking action on rather than just kind of trusting what's in other people's platforms.
Yep.
I have a question.
And John, I mean, you've used Klaviyo heavily previously.
And so question for both of you, what are the mechanics of why Klaviyo is over-reporting? I
mean, I know it's a black box, but you know, Clint, you're building these models, but what
is the data input problem or the regression problem that would cause that?
I'll let Clint take this one.
I have a suspicion, but yeah, I'm curious what you found.
Because my knowledge is a couple years old.
Let him validate your suspicion and then just come.
I'll tell you.
Yeah, that's what I thought.
Shoot, I want to hear the suspicion first.
All right, I'll do it. Yeah, let's hear the thought. Shoot, I want to hear the suspicion first. All right, I'll do it.
Yeah, let's hear the suspicion.
That's more fun.
Like, you've got...
So Klaviyo has first-party access
to your Shopify data.
So, like, theoretically,
you have access to the same data, right?
What I would guess
is they built a more generic model, right?
And are just going to run everything
through a more generic model.
And you're able to build
a more, like, bespoke, focused model
as far as predicting.
That's my high-level hypothesis.
To interject on the suspicion there,
isn't that what makes
machine learning applications on Shopify
so appealing, though,
is because the ecosystem is consistent, right?
I mean, Shopify has a consistent data model.
If you're going to try to scale that
for someone like Klaviyo, like...
Yeah.
The fields are named the same for every customer.
Sure, yeah.
Like, as simple as that.
Yeah.
Okay.
Enlighten us.
Yeah.
No, I think that's one element of it, too, right?
Like, I would say there's probably three elements.
The first is some model differences.
And I don't know what their model is and they don't give accuracy.
Yeah.
So I can't really speculate on the metrics on, you know.
I wish I was because then I'd know a lot more and I'd feel a lot more comfortable with what their predictions are. But I think the second element is some brands do have sales outside of Klaviyo.
Or sorry, outside of Shopify.
Sure.
So, you know, one of our brands, they own, you know, three dozen retail locations that they own and manage throughout the country.
And so that information is actually not flowing through Shopify.
Klaviyo is not including it.
So they're missing like're missing really important indicators.
And that's fairly common, I think.
Because in my past life, we didn't have physical locations,
but we had phone sales that didn't go through Shopify.
And that's just another application.
Yeah, fascinating.
Right, but those, yeah, i guess yeah that's that is super
and we're not talking like one or two phone sales we're talking like 20 30 percent of revenue yeah
yeah anytime you're mixing sales channels right like things get much more complex and but i think
that's where like data team shine is simplifying all that so this data team we were working with
right we're sitting on top of their in this, we're sitting on top of their, in this case, we're sitting on top of their warehouse rather than their Shopify instance.
So the data team was able to do the identity resolution from in-store to online and kind of
handle that so that we are looking at like one unified understanding of who the customer is.
Yeah. So you have a table with each customer and then their combined order history across point of sale and then Shopify.
Yeah, exactly.
Yep.
Okay, I have another question on the sort of business results side of things.
And this is, I think, again, just based on your experience, like, well, both with Wild, right?
Because you're sort of producing some sort of output. And I'm going to pick on marketers here
because I've been a marketer for most of my career.
That's fun to do.
Yeah, because it's great.
It's great.
But a lot of times,
and I think this is changing to some extent
because marketers are getting increasingly technical.
I think there are a lot of good dynamics.
But at the end of the day,
you talk about Klaviyo's model versus Wild's model or whatever. But the marketer doesn't actually care at the end of the day, you know, you talk about Klaviyo's model, you know, versus
Wilde's model or whatever, but like the marketer doesn't actually care at the end of the day,
right?
They just want the score so that they can do something with it.
So how do you think about that based on your past experience and then with Wilde as well,
where to your point, like the details are extremely important, right? I mean, the underlying data concerns
are extremely important,
but the end customer doesn't actually
really care about that, right?
Like, so how do you think about balancing that?
Because you're producing some sort of outcome
or you're producing some sort of output
that's really critical to the business
that has all these important components,
but like your customer's like,
yeah, I mean, I don't really care.
I just let me know who to email.
Right.
Yeah.
I think so.
I guess from like a data perspective,
generally,
whether I'm in house or,
you know,
building data products,
I'm not a huge believer in dashboards.
I think they're like,
like valuable,
but I don't really think that's what our end goal
should be as data people. I think what we should be trying to do is integrate ourselves as tightly
as possible with other people's workflow. Yeah. So in the Klaviyo example, like I really,
like my ideal case is if I'm internal to a brand, that you don't ever have to leave Klaviyo as a
marketer, right? That like the intelligence that we have as a data team is being pushed to you and you're
not having to go somewhere else to get it.
Yep.
Yep.
I love it.
John, thoughts on that?
I mean, you did a bunch of this.
Yeah, no, I think, I mean, a lot of people, I think talking just general data maturity,
you over the last five years,
it's like, wow, data collection is really easy, right?
So there's a,
Clint and I were talking about this before the show.
There are a lot of Azure and AWS bills
that are high right now
because data collection is really easy, right?
Yes.
And then you've got all the data in this database
and like you can query it and that's exciting.
And you can even easily hook up a BI tool to it, right?
But that's all so unopinionated stuff, right?
There's no structure, there's no business framework, nothing.
It's just whatever the analyst or data engineer whatever like whatever's in their mind
and their level of in sync with the business which is often not very in sync determines the outcome
so by getting it in the destination tool like you just enforced like a structure some business
like logic you're forcing a certain number into a certain field, like it, like, even if it didn't
have anything to do with the workflow, even that like structure and opinionation, I think is
helpful. Yeah, definitely. That moves you from like soft ROI to hard ROI. Yeah, right. Like soft
ROI is like building a dashboard, and you might inform some decisions. And I think there's the
classic question in data, right? Like, how much does our data team generate? And that's very difficult to do if all of your ROI is soft ROI. But if you're
able to go into, I think, you know, reducing an Azure bill is like one example. But I think if
you can actually generate top line revenue, and point to like, hey, we enabled this.
Yeah, that's the gold standard yeah right
that's hard roi that's actually something you can point to you can ask for more heads on your team
because of it yeah yeah do you think about both you and john there's almost like an internal
marketing element to this and what i mean by that is i i totally agree, right? Like, let's get the churn score or the predictive LTV value into Klaviyo or whatever tool, right?
Like, so they can integrate it into their workflow.
There's no disruption, right?
But to some extent, that can create a dynamic where,
I don't want to say this,
I mean, it almost looks too easy to where all of the work that went into that
from the data team is undervalued.
And so you can't get another head on your team.
How do you think through that element, right?
Because it's a lot of things,
John and I talk about this a lot.
A lot of times things that are really well done
seem easy when you see the final product, you know,
which is great.
And that's part of the point,
but then you don't want that to come back and bite you.
Yeah.
Well, I mean, in marketing, right?
That's, we've talked about that a lot in marketing,
whereas you read through something
and it's like perfect logical flow,
good messaging, like all the things.
And you think in your mind, like I could have done that. Yeah. Right. and then like when you're actually on the marketing side of it trying to do that like
it's impossibly hard it's very difficult yeah yeah and data has like a little bit of advantage over
that because there's at least the technical aspect like well that seems kind of hard but there still
is like that like really clean delivered product of like oh all you did was like fill out cltv and clavio like
how hard is that yeah yeah yeah i was actually talking to a head of data recently who's having
this problem right now and we were kind of you know half joking half talking through like what
do you do because he has made things look really easy and then you know the marketing team is like
coming back and being like okay well like you could just do this and it'll take like a week right and and like actually educating them on how
hard the data world is and like you know just getting clean data is really hard just tracking
customer interactions really hard yep as you guys know very well so yeah there there is a bit of
like internal marketing
and I think also good data leaders.
They're mixed between marketers
and kind of product managers.
I'm a big believer in the kind of like
product mindset internally.
Yeah, you got to do a little bit of both.
Yeah, I love it.
Okay, we're going to switch gears here
because John, I know you're chomping in the bed
with a bunch of technical questions
and I cannot wait to hear about this.
I'm going to ask a question to transition.
Let's talk about Wild now.
Can you give us
just a... My question is
what's happening under the hood?
You're connecting to
either a table in the warehouse that has
certain data or Shopify. Let's just
use the Shopify example. What data
are you pulling in from Shopify? Or's just use the Shopify example. What data are you pulling in
from Shopify? Or do you access from Shopify? Yeah, we try and keep it pretty narrow. We're
looking at some customer and demographic information. And we're looking at a lot of
transaction kind of order history. Okay, I mean, that's it, right? So I mean,
yeah, can you give us a set? Is that like 30 columns or like six columns when you pull it from the api
you know there's i think a couple hundred just from those like two endpoints really
uh once you can blow everything right all right yeah they return everything yeah yeah we look at
like five or six columns okay wow yeah so that is near yeah i think you said no pii like you can do
it without pii and we can do it without PII.
Yeah, you can hash an email before you send it to us.
Yeah, we're trying to keep the scope really narrow
because I think a lot of folks want to fit
as many demographic pieces of information
or, you know, interactions in.
And again, as you guys know,
it's really hard to collect that information.
It's really hard to clean it and organize it.
And so, you know, I think like our onboarding engagements would be like five times longer
if we wanted to collect a bunch of information from different platforms.
So just keep it super narrow and we get 95% of the benefit.
And I think from talking to you in the past, like you realize with your model that your
models that you're working on now like the signal to noise ratio
like if you pulled in every single data point you could from Shopify like super high noise but as
you narrow it down like the beauty and like a really good like predictive model is like we know
the like five things that matter or however many it is right and then everything else like if we get
a slight like increase like you have to like is it worth it and then everything else like if we get a slight like increase like
you have to like is it worth it and is it truly a slight increase every time or is it like a
one-off like i think that's the yeah that's the beauty of like a simple inputs into a like
sophisticated model yeah i mean conceptually speaking when you're talking about retail
purchases online or in store uh there's a lot exogenous to anything that you can measure.
So if I go down to my bodega guy like every day, I might be a super loyal customer.
But if I move apartments, I'll never go back to that bodega again.
And that bodega guy is probably not going to know that for that reason you know what the reasoning was but there are a lot of reasons outside of our actual purchase behavior or interaction with a brand that dictate
our journey with that brand right i still remember like i did a lot of a lot of work with shopify a
lot of work with shopify apps i still remember the sales pitch because we were really wrestling
with the pricing problem like we had thousands of SKUs and pricing is hard especially at scale I still remember this like model this guy was
selling me and he was like yeah we take in like hundreds of data points we look at behavioral
data we look at visits and we produce dynamic prices for like each of your you know 20,000 items
and we demoed it and a there was basically no way to like prove like cool is this like more
sales or more margin than you know they didn't have that built into the product yet and b we
ended up with like whoops new pricing for like 20 000 skews so we had to like roll it all back
which was a nightmare but that was a really good lesson of like all right like less is better here
and understandable is way like more to be desired than like something that like is eking out each
little like percentage of quote like efficiency you know in a model yep fragmentism might be like
the most important characteristic in any of these models, right? Interpretability and getting it out the door really quickly is going to get 95% of the value that your stakeholders expect.
Right.
All right.
So tech stack.
Yeah.
You have to go there, right?
Yeah.
Yeah.
So, okay.
So we started with, you know, Shopify API endpoint, maybe a data warehouse.
Yeah.
So what happens next in the high level flow yeah so this Shopify integration
relatively new for us I mean we've been very warehouse focused for a while now and so my
co-founder and I were kind of looking at different technologies because we're starting from scratch
we have kind of freedom to do what we want and so we landed on basically the flow for us is we land data in S3 for cold storage.
We use dbt for transformation.
Awesome tool.
And then we're actually using duck BB and mother duck for all of our kind of
storage and transformation warehousing needs and on the back end and the
front end so that's been yeah i've been learning that stack lately this is definitely a first
i don't think we've had anyone on the show who has used duct db like in production in this way
yeah i think it's the first and we were talking before the show about BI tools and browser BI tools. So if you remember, I guess it's been 10 plus years now since Tableau came out. And like one of the major things there was their query engine or their storage solution, like as part of the tool where you can extract the data and then you can like manipulate it on your desktop and this amazingly like fast experience that was like one of the big
deals there then they take it to the web ironically right and like they had a bunch of trouble early
on i remember they like hired somebody from aws to try to help figure out the like web version of
tableau you know obviously eventually got something that, that was good enough, but that I, but I still remember that initial experience of like, Hey, I've got this
massive, like millions of lines file. I extract it and I can use it in Tableau and it's awesome.
So tell me about your workflow and like how that might, like, I think you've had a similar
experience with DuckDB, different workflow, but maybe similar.
Yeah, so a couple things that we've really liked
working with it is
first off, it pulls
the front-end analysis that we're
doing a lot closer to the
data team.
So I don't
know any software engineering front-end
or back-end really, but I'm
an okay data guy
and I can go into our data stack right and I like our actual proper data repo and modify the queries
that exist on the front end and so like having that close connection with the front end web app
is kind of ridiculous for any right team because it. So what you're saying then is it's almost missing
that traditional middleware, middle layer piece
that you normally see.
There's not that handoff organizationally
from data to a software team.
I'm like, okay, now we need to abstract what's going on here.
We're going to have to move it into some other framework.
We're doing SQL queries from the front end
and it's fast
so I know there's got to be
some software engineers listening
that are like no this is a terrible idea
here's all the reasons
you need that layer
Duck TV is such a great polarizing
my co-founder is a software engineer and he's you know coming to
the data stack and he's the one who's really been pushing for that night so he can go fight all the
software yeah i was gonna say yeah yeah we'll we'll put him on the linkedin on the linkedin
yeah yeah it's been great i think another awesome benefit for us is like on the back. Yeah. It's been great. I think another
awesome benefit for us is
on the backend analysis.
We do this cold storage
in S3.
If I want to run an analysis on
data that we have parked,
you can do
you can basically glob everything from different S3 buckets.
So, you know, we'll do like, you know, bucket star and I can select a bunch of different S3 buckets simultaneously.
And so we can basically do like ephemeral analysis on multiple brands without actually joining and moving that data
together so that's been like kind of a nice added benefit for us oh wow so are you or have you looked
into iceberg at all like as part of we haven't yeah no i think patrick did a little bit and was
like getting very intrigued so i was asking about it the other
day is iceberg the one that was just acquired recently yeah like the other one the commercial
yeah the commercial part of it was acquired by data okay yeah right yeah we were talking about
it i think that's on the horizon for us have you played with that at all not really like but i was
reading about this really interesting workflow with Snowflake of basically people using Snowflake for the right layer into Iceberg and then DuckDB as a read from it.
So then you're like cutting your Snowflake compute, right?
Because you've got like it's just being used in the ingestion.
But the read out of Iceberg tables is just straight with MotherDuck or DuckDB.
So, like, I mean, it'll be so interesting because Iceberg is an open format.
Even though the, like, commercial, you know, commercial company got acquired by Databricks.
Like, that's still an open format.
Yeah, it's in Apache.
Yeah, yeah, incubated project.
But it'll be so interesting
to have like all right so i'm like say i want to like store everything in that format and then you
just have these engines right you're like all right snowflake engine like you're gonna do my
rights like duck db you're gonna do my reads or you know any other number of combinations like
it's going to be really fascinating like the cost savings right and then
just creative things you can do when you're able to modularize and split up you know like that and
then i'm sure there's some kind of ai like application here too where you've got everything
in like the same format like it'd be easier to access i've been seeing some of these narratives
recently and i i haven't gone super deep admittedly but like do you think this kind of structure hurts or helps a snowflake or a data bricks
but like an open structure like using ice yeah like where yeah where you don't need to put your
storage into one of those platforms and where they become like purely a compute layer well i don't know i think i don't think it hurts them too i don't think it hurts
them like that much but because the way they're going so like snowflake just really you can like
the last couple weeks i'm sure you've probably seen it like they've got the full like python
notebook experience in browser you know for snowflakes they're doing that they already have the streamlet stuff so like they're just going all out like all the things that we can use compute
for and they're gonna have ml and ai models and stuff so like they're like compute time
is all going to be like more and more used on that stuff like you're going to be spending a
ton of money in compute for like ai ml stuff and a ton of money with them for your like Python notebooks. And even maybe like querying will like start to move down
the list as far as like what you're spending money on. So I don't know. I mean, I think it
probably helps. It might help the industry in general and a little bit of like a rising tide
for everybody because you've got like kind of because you potentially like
depending on how it works out like you might end up with a standard of like pretty much everybody
just uses iceberg because it will work with databricks and it works with snowflake and you
know x y or z other thing that you want so that i don't know that might help the general industry
but it's hard to say whether i feel like it really helps like an individual snowflake or Databricks
or hurts them.
Yeah, it is interesting though,
the episode that we had with Andrew Lamb from Influx,
you know, and Influx does time series stuff.
So different use case,
but he was, I just made this connection.
One of the things that we talked about on that show was that his prediction is that things would move towards essentially having everything in S3 and then an ecosystem around that, to your point, you know, where it's like, okay, Snowflake does this, DuckDB does this, right? And building an ecosystem around that model.
And so, Clint, it's fascinating that you guys are actually,
I mean, you have adopted that for your product, right?
I mean, we were talking about this just in terms of analytical workflows
like within a company, right?
That that's actually how you run your entire product.
Yeah.
I mean, I think from what we've experienced so far,
like it is not as easy as standing up.
So it's like right now and just, you know,
letting it rip in there.
I think there's probably a little bit of ways to go
in terms of accessibility.
Right.
But it's definitely interesting
and opens up some pretty cool capabilities.
I mean, I can speak to the DuckDB thing alone. I was telling you guys earlier, like we have more than 500 brands that we're
doing analysis for. And so we have all this transaction data. It's two and a half billion
dollars with a B in the last year and total GMB that these brands have done. And so I started
doing an analysis of duck BB. I just pulled up like a Jupiter notebook. You know, it's like a one liner to connect to these S3 files. And I'm like, often away writing SQL inside of
a Jupiter notebook. And on, you know, hundreds of millions of rows, I'm getting like instantaneous
queries on our local machine. That's, that's crazy. Yeah, and it's like you know just connecting to snowflake from my jupiter notebook
would kind of be a pain um so yeah it is yeah there there are some elements where it's just like
so easy as an analyst to get something out and i've just been able to focus on the fun of being
an analyst again rather than all the kind of like engineering setup yeah is isn't there some pretty some pretty
cool things in browser things you can do with duck db and mother duck um standpoint yes i think that's
i think a lot of that's related to like what we're doing on the front end right now which is like
we're basically running these sequel queries directly from the front end right yeah i'm not well versed i think on the front end i'm like excited that i can i had some
discussions i had some discussions around this and i wish i could represent it better but it's that
same like where we started that same tableau concept of basically like you have this like
extremely fast compressed like data set that the query experience
feels just about instant,
but it's in browser, which historically
has been a huge problem for
just about any BI tool that I've used in
browser. And I'm sure they're
continuing to improve that part of the product, but
it's pretty cool
to see it. I'll be interested to see
what MotherDuck's
go-to-market strategy is is because they do kind of have
two disparate use cases right now,
which is like run stuff really fast on your local machine.
And then one that's,
you know,
run stuff really fast in your browser.
And I don't think they're necessarily mutually exclusive because obviously
we're using both with good effect,
but yeah,
it'll be interesting to see which one they kind of lean on
and which one proves more valuable.
Clint, one thing we talked about
before we hit record kind of related to this,
and it came back to mind
because you were talking about having,
you know, querying a bunch of different data sets.
Obviously, there's a security concern
related to that, right?
So, I mean, maybe you've stripped PII or whatever.
How are you thinking about that?
Because my mind is instantly going towards
all sorts of interesting use cases, right?
I mean, you can provide insights across different customers,
you know, because everyone's in retail.
You could provide sort of, you know, reporting, benchmarking.
I mean, there's all sorts of, like,
interesting product possibilities.
But from a data perspective,
you have to tread really carefully there, right?
Because, you know, there are agreements that you have with each customer about like how you're managing their customer data.
You know, security concerns around like if you're combining all of that in a single place and, you know, I mean, how are you approaching that side of it as you're working with data across all of your customers?
Yeah, so we, I mean,
we strip PII for everything.
We're hashing customer information
as well as oftentimes
when we're joining information,
we often are looking at
merchant anonymized information as well.
So it's kind of like the first layer.
The second is we actually spin up
separate DBs for each customer.
So each customer lives in their own DB environment.
And then when we join it, it's being joined similarly using DuckDB.
So there's no like hard table where the data is landing.
So I think we probably have some work to do on all of that but like it gives us a pretty good model where we're both getting some
flexibility without just mixing a lot of and to be honest like i talk to a lot of vendors who do
kind of push all the data together it is like a standard yeah yeah it's pretty standard yeah
yep yeah we're trying to be data conscious on it. And one thing I will throw out, like none of the tooling out there
is really designed to work across
a bunch of these databases.
And so we're really having to like
grok a few of the tools,
you know, because we're basically running
a different dbt instance for each.
We have one central dbt repo, obviously,
but like each customer is getting
their kind of like own dbt repo obviously but like each customer is getting their kind of like own
dbt runs and so it's yeah it's a lot harder to manage this way yeah yeah all right well we're
getting close to the buzzer here but we had talked before recording about clean rooms and that's
probably a good place to end like as far as what you're thinking about for the future of wild so tell us about clean rooms how does that relate to like what you're doing
as a product yeah i think at first blush it it feels far afield but uh what we've learned you
know looking at 500 brands 600 brands data at this point is that a lot of data exists outside of the Shopify ecosystem because so many
of these brands have gotten omni-channel now yeah a lot of them are selling in retailers a lot of
them selling in Amazon and so I started doing research earlier this year on like okay if I'm
a big brand how do I solve this because I'm not going to be satisfied just not having this data
right we kind of started getting into clean room world. And what we really learned there is like accessibility for
clean rooms is a huge issue.
You obviously Samoa or live ramp and and Snowflake both have products.
They acquired two companies for data clean rooms, but they're technically and
technically expensive and monetarily expensive and you know most retail
brands are not using those technical tools and so what we've been exploring lately is basically
productizing a lot of these clean rooms so we can continue sharing data with brands but then also
with their retailers oh Oh, wow.
Okay, so, but you're sort of building it on like existing clean room technology
from someone like a Snowflake?
No, so we, no, we're not actually.
We'll build some of that ourselves.
Yeah, we have some hypotheses about that,
but yeah, probably too early to say now.
Yeah.
But yeah, we'll be be building around stack for that
love it all right john any final questions before we hop off no i think the data sharing part like
if you don't know what a clean room is right maybe a quick little definition of that for somebody and
then in general i think data sharing is a really big like place for this stuff to go next whether
it's sharing to be in
app like and like i use clavio and i want to share it to clavio i don't want to like etl it like
that's too hard like let me just share it or i want to share it to salesforce or whatever so i
think that general concept is big but if you could just like focus the clean room piece like tell
people what that is yeah so i actually really dislike the term clean room we refer to them as
collaboration room which i think is like a bit more explains what you're actually trying to do
rather than what the tech is you know effectively if john you own a brand and i own a brand and we
want to share information about our customers neither of us wants to share a list of our
customers. We don't want to expose that. And so you can use these clean rooms, or as we call them,
collaboration rooms, as basically a third party where you can dump the information in. And then
neither of us can look at the individual PII, but we can do aggregated queries of that data,
kind of predetermined aggregated queries. And so,
conceptually speaking, it sounds a little bit esoteric, but the actual use cases are quite
interesting. So, you know, Amazon has a clean room solution. And so you actually if you're
running on Shopify and Amazon, you can do things like you can give Amazon a list of your Shopify
customers so that you can target them in Amazon's ad platform.
And Amazon won't actually know who those customers are.
And you can do that same thing with Google and Facebook,
a few other platforms, the TV platforms have the same technology.
It also means that you can go to a retailer.
So if you're selling in Kroger,
you can get customer level sales information from Kroger.
All of that is like kind of inaccessible to most brands because of their revenue and because of the tech requirements.
But the big brands can tell you how many new versus returning customers they have in a retailer.
That's fascinating.
Yeah, that is fascinating.
Super fascinating.
It's been pretty fun to learn about.
Yeah, for sure.
Well, as you build that product out,
keep us posted and we'll have you back on the show
because I think that's a huge topic for us to tackle.
Yeah, definitely.
That'd be awesome.
Clint, well, thank you so much for joining us on the show.
It's been a fascinating conversation
and we'll have you back on sometime soon.
I'd love that.
It's been a blast.
Thanks for having me, guys.
Yeah, thanks, Clint.
The Data Stack Show is brought to you by Rudderstack,
the warehouse-native customer data platform.
Rudderstack is purpose-built to help data teams
turn customer data into competitive advantage.
Learn more at ruddersack.com.