The Data Stack Show - 226: Building Trust in Marketing Data: An Engineer's Guide to Attribution Architecture with Lew Dawson of Momentum Consulting
Episode Date: January 29, 2025Highlights from this week’s conversation include:Lew’s Background and Journey in Data (1:06)Attribution Challenges (2:16)Attribution War Stories (8:09Defining Attribution (12:32)Complexities of At...tribution (16:08)Multi-Touch Attribution Challenges (21:31)Campaign Creation Difficulties (23:27)UTM Parameters Explained (26:01)Challenges in Data Extraction (31:17)Transforming and Merging Data (36:28)Behavioral Data and Identity Resolution (40:29)Hierarchical Structure of Campaigns (44:03)Challenges of Data Consistency (49:38)Mitigating Freeform Data Issues (52:21)Creating Unique Join Keys (55:30)The Importance of Defining Requirements (58:42)Final Thoughts and Takeaways (1:00:39)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Hi, I'm Eric Dotz.
And I'm John Wessel.
Welcome to the Data Stack Show.
The Data Stack Show is a podcast where we talk about the technical, business, and human
challenges involved in data work.
Join our casual conversations with innovators and data professionals to learn about new
data technologies and how data teams are run at top companies.
Welcome back to the Data Stack Show.
We have a special guest today, Lou Dawson,
from Momentum Consulting.
Lou, you have such an interesting background and have done lots of different things.
We met when you were a Redder Stack customer,
now you're a Redder Stack partner.
And so you and I have actually talked many times about one of our favorite subjects, which is attribution and all of the related data and reporting stuff. And so I am pumped to spend
a whole hour talking with you about that. So welcome to the show and give us just a high
level background of your journey and data. Yeah, thanks, Eric.
Awesome to be on the show.
Thanks for letting me come on.
In short, my background real quickly.
I started writing code back in the late 90s websites.
So I got started really early, loved it, and been doing it over 25 years now.
Got started early in the data warehousing space, spent a long time doing that. Then moved over to the marketing space and doing early days of MarTech and implementing a lot of MarTech technologies from scratch for companies.
And did a little cybersecurity in there and back really solving MarTech full time. That's the niche I found that businesses really need help with and really can use my consulting services is how do you really implement a proper and great marketing, mark tech ecosystem.
So that's where we are today.
And that's how I got here.
Awesome.
So Lou, you were talking before the show about attribution.
And we're going to dig deep today.
We're going gonna be pulling
out wires where you know where did that go it's gonna be fun yeah so what what like attribution
topic are you most excited to jump in oh man attribution is a deep and wide topic i think
this one interests me immensely because it's a hard business problem and a hard data problem to solve.
So it's just, it touches every facet of a business and every facet of data from coordinating with leadership, product, marketing.
So yes, like you have to deal with people.
It's scary, right?
All the way to those really scary, like down in the basement engineers.
Then, you know, talking about the data side for a second, you have to figure out how do I model my
data? How do I make sure my data is accurate? And how do I accurately represent it to the people who
care about that data so you can make good marketing decisions. And it's just a cycle that continues over and over.
And hopefully, if done right, you optimize your ad and retention ecosystem, and you keep
getting better and better and better.
And you continue to grow conversions by using that data, that attribution data.
How hard it is to get there.
You made it sound so simple. We're going
to break it down today. So let's dig in. Yeah, let's do it. Lou, I'm so pumped to have you on
the show. And I'm kicking myself because we talk, we have talked a lot, you know, I guess over the
last year plus. Like a week, a week, every week you talk. Every week, every single week. And
just somehow I haven't invited you on the
show i've been keeping the secret of our conversations but now we're going to expose
that to the world so you give us a brief overview in the intro but go back maybe just a couple of
roles so we met when you were doing uh data and martech stuff at Allbirds, who's a Ruddersat customer.
And so that's how we met.
That's how we connected.
We've maintained our friendship.
Now you're doing consulting.
So go back a couple of roles,
maybe prior to Allbirds
and tell us kind of about that story.
And then an overview of your consultancy momentum
and the types of projects that you work on.
Yeah, of course.
In short, I really got started in the entire data ecosystem back
before I got out of college.
I worked at Teradata for a long time.
The data warehousing company, probably a lot of viewers are familiar with it
to some degree.
And they were one of the primary vendors at that time for data warehousing.
And so I got a lot of exposure to data warehousing there,
large-scale processing of data.
And then somehow, I don't recall the details, doesn't really matter,
I moved over to Intuit for a while
and early on was tasked with rewriting the personalization engine
on the marketing website and so a lot of that was how do
we optimize what the customer sees and on the marketing website so engagement part of it so
when they come and how do we really optimize for conversion when they get here so how do we get
them into the product and so that really like, like, that was my big, big exposure
and my big realization that this is a cool technology.
This is a cool area to focus on.
And I like it.
Like, it's really interesting, interesting problems to solve.
So built that for a while.
Then, like I mentioned, cybersecurity for a while.
I've always been interested in security.
That's less interesting on this show, probably.
And then ultimately
ended up at Rudderstack
through an acquisition.
Sorry, not Rudderstack. My apologies at Allbirds.
And I ended up on the data team.
And that was early on.
We wanted to implement Customer 360
so we could improve
our acquisition
and retention campaigns,
but especially retention.
And so we developed out,
well, we partnered with our stack early on.
I think y'all were a really, really early stage startup
at that point, if I remember correctly.
I think we were one of your earlier customers.
And so we worked with you on a lot of stuff.
I think if I remember correctly,
some of what we ended up implementing
fed directly into
some of your requirements so that, you know, we built off of each other. Yeah. And ultimately,
awesome relationship. Yeah. Yeah. Ultimately that cultivated into somewhat successful customer
360 all birds. And then me realizing that I really enjoy doing this for a lot of different
customers and I enjoy the data space. And that got me to consulting for multiple different companies doing a myriad of different
things in the MarTech space but I do always love talking about acquisition because it's such a
challenging problem yeah and so yeah that's that's what I'm doing today with Momentum Consulting
anything MarTech related I do other stuff outside as well but it's generally
marketing focused martech and you know the the niche i've really carved out for myself in short
is a combination of providing solutions or providing a strategy for how to implement martech
for folks from marketing and product all the way up to leadership so communicating with them getting
their requirements etc to occasionally actually implementing solutions so communicating with them getting their requirements etc to occasionally
actually implementing solutions working working with them either me or a team people working to
implement solutions so that's what we do at momentum consulting love it okay i want to start
out there's so much to cover but i want to start out with brief for both of you this question is for you too john with maybe brief
anecdotes about like an attribution war story okay that was either wildly successful or a huge
failure so either one but it needs to be sort of like a you know on on either end of the on either
end of the spectrum so john why don't you go first?
So, Attribution War Story,
huge win or huge failure?
Oh, I'm definitely going huge failure
because it was the most fun.
It was kind of a two-part failure.
Those are kind of more frequent.
Yeah.
Those come to mind more quickly, too.
This was a fun one.
It came to mind during prep, actually.
And so you can picture yourself. This was a fun one. It came to mind during prep, actually. And so you picture yourself,
this was several years ago now. The board meeting, I'm sitting at the time in an IT spot. I eventually started managing marketing and IT. But I'm sitting from an IT spot, had a marketing leader in there,
board meeting, presenting. And they're presenting the just overall like performance acquisition performance
and talking through that so presenting the return you know the the row ads the return on ad spend
super common metric and they're saying it's it's so good like things are great it's 800 percent
return on ad spend which is quite high that's yeah quite high and my data brain starts churning a little bit like like you know like i think the thing is like go forward from there and fast forward a little
bit i ended up taking over that group and digging in really deep on the attribution and and and all
of that and and we found two major problems one One of which I think was already there and one
of which I think we created. The first one was like the most obvious problem, but it happens a lot
is conversion events were firing twice. So that eight was a four. And that is a massive
financial difference if you're trying to understand your ROI on ads.
And your willingness to deploy budget.
Yeah, right.
So that was like an early on find of like, ooh, this is not good.
And then the second one, which was just a bizarre one and was hard to find.
So this was a B2B site.
We had some larger orders, but not every order was large so there was some bizarre bug where orders over a thousand
dollars didn't get captured correctly it had something to do with like placement so like
you know like typical orders like typical day we would have several over a thousand but not a lot
and it was just off and it was the hardest thing to find because you know like odds are like you
pick up a pick a
random order pick a sample like it's not over a thousand dollars yeah there was enough to where
like it was a big problem like overall so those are those are two like attribution like data
challenges where yeah it was tough all right your turn lou yeah i could think of like john was saying
the failure comes to mind the quickest.
And the one that comes to mind immediately is attribution,
specifically conversion.
Well, attribution stopped working,
specifically conversion stopped firing in a lot of cases.
And no one noticed this for a while, right?
So you basically see a massive just drop off.
And they're depending's, Ooh, you basically see a massive just drop off and they're, they're depending on theoretically they're depending on this data in order to make decisions on how to re reallocate ads. But for some reason, like both there's a failure on both the data side and a failure on the marketing side data side. Like we didn't notice we weren't alerting on it and marketing side. It's like, were you guys actually using the data and paying attention to the
data? How did you not notice a massive drop off? Right.
So that's definitely the one that always sticks with me. It's like,
you really need robust alerting and monitoring mechanisms and
data, which is one of the many, many problems for acquisition.
You have to solve. Yeah. Yeah. Totally. Okay. Well let's start our deep dive and we're sort of at the edge of the many, many problems for acquisition we have to solve. Yeah, yeah, totally.
Okay, well, let's start our deep dive.
And we're sort of at the edge of the hill here. And I'll nudge the car towards the slope.
Towards the precipice, yes.
Okay, I want to talk about why attribution is hard.
But Lou, can you just give us a high-level definition
of attribution? What is the business problem that you're trying to solve using data? Because I think
this is probably something that a lot of our listeners have exposure to, but perhaps some
of them don't, and the levels of exposure may differ, and it can look so different at different
companies. So kind of level set us with just a really high level definition. What
problem are we trying to solve when it comes to the subject of attribution? Yeah, I think this is
actually one of the challenges is defining this. This is like one of those very early challenges
and the many challenges that attrition. But in short, Attribution is taking all of the traffic that you receive
and to your engagement properties.
So where the customers are coming
to actually do their final conversion.
So taking e-commerce, for example,
like a website that's selling things.
You want to understand if you're converted,
so they bought a product,
they checked out, they bought a product.
Where did they come from?
You want to attribute where they came from to an order to understand essentially like your customer acquisition cost, understand how well your ads are performing.
So you mentioned earlier ROAS, John, things like that. You want to understand at the end of the day, like how efficiently am I spending my dollars? Number one. And how well are my customers converting across my various channels? Number two. And then tangentially number three, it's like how well am I retaining customers across different channels? So that's the highest level last thing i'll say is it's a challenging problem because every business is a little different and how they want to look at it
when you dig into the details and then further down
different businesses have different stakeholders with different weight in that giving you a
slightly more specific example some some businesses like if let's say
like the acquisition team is kind of the driver like their their leader has more weight than like
the engagement or the retention or let's say the engagement leadership has a greater weight
they might care more about conversion. Right.
So especially at larger companies where it's like KPI driven development, let's say.
So like people care about getting promotions. So they care about boosting their KPIs.
Yeah.
They're going to potentially care more about prioritizing their KPIs and boosting their
metrics.
So like conversion and engagement versus maximizing revenue.
So that's just, I think that's one of the challenges
of just defining the problem is like,
what are you trying to optimize for?
What are you trying to measure?
So I think that's such a good point,
but let's, okay.
And I think your definition is great.
You have customers coming,
you have customers coming through channels.
You use the example of an e-commerce website,
but it could be a store.
Actually, more and more, you have e-commerce companies that started online who are actually launching
brick-and-mortar presence.
You have these channels, and you want to know
we're trying all of these things to get more people
to walk into the store, to come to our website,
and then ultimately make some sort of purchase.
So when I hear that definition,
I think before I actually had to face this challenge,
it's really easy to think,
okay, I'm pretty savvy with technology and with data,
and so we have a set of channels,
and so we need a measurement mechanism.
We need to see the conversion.
I'm pretty good at math.
I feel like I can tackle that.
That is
not untrue,
but I think
it's easy
to start out with an idea of like,
okay, that doesn't seem like that hard of a problem.
And it actually turns out to be a very difficult problem.
But why is that?
Break down for us the different dimensions
of why actually putting that math problem together
is really challenging.
Because there is an entire multi-billion dollar industry
of software focused on this,
and that doesn't even include all of the effort and time
and compute that goes into companies
that are hand-rolling this on their own stack
and their own infra.
Yeah, absolutely.
It's a multifaceted challenge, like I i said so i'll keep the like from the
business perspective all the people involved is one one challenge just to give you one super
quick example you can dig into this later you need to structure campaigns a certain way like
the wording how you define them etc so right, that's a people problem, but that then becomes a data problem.
So then getting to the data side, there's massive amounts of data challenges to actually make this
work. So again, using that same example, you get the data on the other side. Well, what if the
campaign name is not the same every single time, even when it comes from the same source, right? So what if someone's browser mangles it?
Well, now you can't attribute without additional logic.
You can't attribute 100% accurately every single person coming in
and every single conversion, right?
So data challenges are, I don't want to immense here but they are there are a lot of them
and they're complex so it's basically there are from the data side there are a few challenges
so it's like it just highlighted data accuracy so getting the data in fully accurately and correctly
so that's acquiring it transforming it and spitting it out correctly
yep then we talked about getting the data in like part of it is just generating the data
so on your engagement portion so on your website your mobile app it's am i even generating the
data necessary to track where someone came from or came came from and also what they purchased right so again
i came in i checked out on my shopify cart like how do we get the data that says i blew
purchased this product and i came in from these channels how do i then merge that data with two
things like data that came in session session data sorry so like i'll be by
behavioral data and then also all the ad spend in data that i then pulled in like how do i merge
those together to say oh yeah for lou i i spent 30 cents showing him an ad i spent 10 showing him
that ad right so you have to connect all that data. There's sort of a huge
data connection problem. That's way more complex than it seems on the surface. Next, there's a,
what do I do with that data problem? So it's, that's cool. Like you've connected it.
You now have data, but having the presence of data alone doesn't help you. Now you have to figure out what do I do with that data in order to give me data
that I can go take action on to evolve my business, to improve my conversion,
to improve my revenue profit.
And that's a challenge on its own.
It's like, how do you first figure out what's important?
And then what hands do I get that into so the correct decisions can be made so that we can evolve
these campaigns so boost the good ones kill the bad ones um and then lastly again like it's a
it's a people problem it's like how do I coordinate everyone to do all the things correctly across all these technologies we just talked about
to make sure that nothing breaks
and that everything is done
in a normalized enough fashion
that we can continue to do this over and over?
Did I miss anything?
No.
Luke, I want to expand on the people problem of this
because i think this is like really fascinating is like a hundred percent like i think you hit
all of the major like components there but there's a there's this like additional like people probably
have to do the right things like you said like name the campaign name the campaigns the same
you know every time of the same campaign so there's people problems like that there's also this people problem of at its fundamental level we are taking this big pile of
money of revenue for the company and trying to figure out who gets credit for what and that
creates drama in most companies right like if if you're like like you said if you're driving hard
on like all right like you're the amazon channel or you're on like, all right, you're the Amazon channel
or you're the inside sales team
or you're the whatever,
each of them wants their fair share,
their fair credit or attribution
for whatever they contributed.
Many of them have financial incentives.
That adds a wholeher dimension to this problem besides extremely big technical problems.
Yep.
Yeah.
And that's especially prevalent.
Again, it's like you figure out what you're measuring.
That adds a whole layer of problems, especially when you get into multi-touch attribution, which we'll talk about later in greater detail.
But in short, it's like people get partial credit.
Yeah, that becomes a huge problem
when a business, like a partner, stakeholder decide,
I disagree with that.
Like, I think I should have gotten more credit
for that one.
Yeah, there's all sorts of people issues here.
I think they're almost as prevalent
as the technology issue of stuff.
I just, sorry. I think they're almost as prevalent as the technology issue of stuff.
Okay, let's dig into the tech stack a little bit.
And Lou, let's walk through the sequence that you discussed.
Because I want to dig into the people side more a little bit later.
Because I think that's arguably to your point probably.
If you can solve the people side, then that actually paves a pathway for the tech side. But let's talk about the stack really quick. I think just to orient, just to orient everyone. So we talked about collecting the data.
Are you even collecting the data? So let's start there before we even get to accuracy.
So where is this data coming from? Like what are the data sources and sort of what mechanisms are
you using for this capture or like you know sort of what if you're going to go in and sort of put
together a strategy just describe the types of you know pipelines i guess or data sources
but actually take that it's up even one from that crazy enough and it's you know just it feels like
table stakes but having been in a
number of the platforms I have to say this like
even being able to create
those campaigns
it comes before that
right so it's like
yeah like it sounds
stupid to say but like some of those platforms
are actually a little bit on the harder
side to like to even create
campaigns successfully,
to get them started.
And you're talking about someone
going into an advertising platform.
You have to create some entity that's a campaign.
It has to target some subset of users.
You have to have some,
you're sending something,
text or images or something that's going out
to reach these
people it has to go to it has to be a valid landing page like theoretically it should be
like it should be a tailored landing page but like it is the easier part but nonetheless this
still is a barrier in itself like someone someone who's new to this whole paradigm of like, let's say an e-commerce website, that is a, that's, that's the first barrier that they have
to hop over is like, how do I even run an ad and that, that on its own would take
time to learn one, one platform, let alone, you know, like Facebook, Google,
like there are a number of different platforms, right?
So I would say that's first.
Yeah. There are a number of different platforms, right? I would say that's first.
Speak to the listeners who are on the other end of the pipeline where the campaigns and landing pages are generating data,
but they're on the other end of the pipeline,
so they're seeing this come through,
and probably I see it as tables of data.
Speak to them a little bit about what are the things
that you would say, here are things to keep in mind
about that process of even, let's just call them assets.
You have to have some sort of assets
that are actually going to generate this data.
There's a campaign that's being served,
someone's clicking on something,
they land on some landing page or something like that, right? Which sort of ultimately
generates the data. What is the data professional on the receiving end of the pipeline?
What are the main things they need to know about that whole process?
You're referring specifically to like all that data flowing in on the other end?
Yeah, totally.
Understanding. Okay. there's a number of
like things that have to be orchestrated on that end let me know if this doesn't completely answer
your question yeah yeah but there are a number of different areas that have to be orchestrated
together to get all that data right which we'll talk about in a second but effectively like that
that data only flows in if you enable the campaigns and that data only flows in if further you are
collecting either behavioral data manually or your platform is in some fashion like collecting
the the data especially like utm prams that are in the url yep those are and really quickly just
just for just for those who don't know what UTM parameters are, give us a
quick breakdown on UTM parameters, because I think that will become important later in the
conversation. Yeah, it's a kind of an antiquated paradigm and technology at this point. But in
short, query your param, well, two things. So query your param and URL. You'll see after the question mark, you'll see key value pairs.
So key is some sort of text and then an equal sign.
And then you'll see more text and then possibly an ampersand.
You'll see that over and over repeating.
That's query pram.
That gives you the ability to essentially add additional data and or metadata
that modifies behavior of the experience the customer is seeing
in a lot of cases or just tracks data.
So UTM is Urchin Tracking Metrics, I think.
I can't remember the M.
But nonetheless, it's a company who kind of, I would say to a degree,
was the initial starter of a lot of
what we would say is modern analytics.
So they were the company that developed
what is Google Analytics.
Google Analytics actually bought them,
or Google bought them and turned it in Google Analytics.
So in short, there's a specific set of UTM crams.
So UTM name for the campaign, or is it UTM campaign? It's a UTM campaign. There's a specific set of UTM crams. So UTM name for like the campaign or is it UTM campaign?
It's a UTM campaign.
There's a few of those and those are standard
and those are used to track various dimensions
of a specific campaign.
Yeah.
So those ideally come in on every channel
and every time a user comes from an external site or an external entity
into your engagement experience i say ideally because that doesn't always happen due to a
myriad of reasons and yet another reason why this is challenging yeah i think that's one of the
fascinating things you know you i mean queer params are used for all sorts of things in software right i mean
it can filter a list it can whatever right but it you know and i think actually when when urchin
decided to use that back in the day as essentially a way to capture metadata
about the source of where a user's coming from it's it's a it's it was a very elegant way to solve a pretty tricky problem in a ubiquitous
manner. Then Google Analytics as a free tool gets worldwide mass adoption as the go-to way
to track web analytics, which means UTMs for better and now probably for worse
are cemented as a way.
So you have five dimensions as key value pairs
that drive marketing reporting for most of the world.
And there are five arbitrary dimensions.
They're completely made up.
This is something I didn't know,
but they're completely made up.
You can type whatever you want
it could be you know and you can have as many as you want but but we've you know like you said
because of the google analytics adoption yeah these are the five that somebody at urch like
you said like 20 years ago decided and kind of been standardized on that yeah i think the other
part to that like you were saying john is in addition to people being able to decide what goes in there
like each platform suggests you use certain utm prams differently too yeah that's right yeah
to make it extra challenging so like here's how we generally do it on here but you can do it
whatever way you want it's yeah it is the worst kind of standard because it's completely
unenforceable and interpreted differently right so
while there is a standard as far as like these five things people use them so wildly different
it's almost not worth having this right right well and that's kind of why i wanted to like
speak to that a little bit for the person who's on the receiving end of that because
my gut is to say like come on we have like five okay actually it even reinforces like we have five dimensions here
like this can't be that hard but it's like it actually is like yeah it is a pretty pretty
tricky to actually get things tight even just from tagging those five dimensions as metadata
that i think at the end of the day like this is this foreshadows conversation we'll have later
so we'll build a little bit of here but, but like there are ways that you can do this.
Like you can make it work across all these paradigms.
And we'll unpack some of those
just to let the reader know like some better ways.
Yes.
Ooh, yes.
Ooh, I like that, Lou, foreshadowing.
Yes, actually, Lou, I'm excited.
You have some immensely helpful methodologies here
to help overcome that.
Okay, so then we have to collect the data.
And so you have to create the campaigns and the assets, then we're collecting data.
And so you're using pipelines to do that.
So there's probably behavioral data and structured data that's coming in.
Well, yeah, so collecting the data.
There's kind of two phases to collecting the data.
So it's getting the data out of the source system.
So out of Google Ads, Facebook Ads,
which again, this whole thing is crazy,
but there's a myriad of challenges there.
So again, everyone does it differently, number one.
So the scheme is a different data structure,
completely different.
And then number two, some of these platforms
make it
really challenging to get the data out uh both from a it's convoluted the naming it's convoluted
and complex but also throttling like facebook is a great example of this their their um their
paradigm of like how much data you can get out within a time frame is completely dependent on your audience size like the the the audience that you reach in facebook so like the larger audience
you reach the more data you can get out at a time which makes sense when you when i say that loud
at a high level but it it creates some pretty tough challenges when it's like yeah we're always
getting throttled like we're so far behind collecting the data.
So that's one thing,
just like getting the data out of the source system.
And then the other challenge,
which is a little bit easier,
but it's getting that data then into a place where you can transform it,
where you can do this actual attribution.
Generally, that's going to be a data warehouse.
Sometimes people favor data lakes, get a data lake, and then sometimes
they'll do data lake to, so like S3 and to data warehouse, but nonetheless,
wherever you store your data, you have to get it into there, right?
Which is, we're talking some pretty large volume of data for some of these companies.
Like it's not, it's not trivial.
It's not data.
It'll just take like 30 seconds to, yeah, strain we're talking about impressions go ahead yeah i mean there's also
just this like bad alignment with some of these companies with like your interest in like google
meta whoever's interest as far as like they don't want you to get the data out they just want you
to trust like they're like oh like you're get the data out. They just want you to trust. Like, they're like, oh, like your, you know,
return on ROI is this or whatever is this.
Like, they don't really want you to dig into it.
I mean, let's face it.
It's A, it's better for them because they don't have to like,
because, you know, it's costly to be streaming all that data
out of their system.
That costs them money.
And then B, for the bigger thing of like, yeah, just trust us.
Like, we'll tell you if it's going well or not.
Yeah. Yeah, that's a fantastic, like, foreshadowing point, too, just trust us. Like we'll tell you, we'll tell you if it's where, if it's going well or not. Yeah.
Yeah.
That's a fantastic,
like foreshadowing point too,
that we'll have to touch on.
It's like,
yeah,
well,
how does,
how does Facebook,
how does Google track a conversion?
Are they tracking the same way?
Are they tracking like every single user who came to your site?
Does that count as a conversion?
Like they,
they say they don't,
but it is a black box.
And when you go and calculate some of these and you compare them,
they're wildly different.
Like your calculation with your like runner stack behavioral data versus
their calculations.
It's like,
so sometimes you question like,
is the Fox guarding the hen house?
Because they're,
they're incented to boost the conversion you're seeing
because then it grows you know it will theoretically grow their revenue ad revenue
because you'll be like oh yeah i'm gonna spend more because it's yeah well yep so it's that's
an interesting call john good it's yeah seven and then you have the like the attribution fighting
problem too of if you've got different state,
you've got multiple platforms you're using for advertising,
multiple for retention.
You've got this kind of war of like,
oh, I want to take credit for this one
and it's some kind of retention tool.
I'm going to take credit for it.
And in reality, rarely does the number end up being
like adding up to say it's $100.
It adds up to $200.
Like, well, I only got $100,
but this attribution data adds up to $200.
All of these can't be right.
It's just another challenge.
Okay, so we're collecting data from source systems
on the advertising side.
We need to collect data from the website
or the digital property.
So all birds use RutterStack for that.
So that's the behavioral data.
This capturing page view data,
conversion data, etc.
And so you're streaming that
to the data store.
So a data lake or a data warehouse.
Okay.
So now we are with
the person who's on the receiving end of that
and they have probably
a lot of different
tables that's an understatement so what do we do now tables both in terms of numbers and then a lot
of data within those tables yes yeah yeah go ahead well I'm saying, okay, what do we do now? Yeah. What do we do? What do we do?
Yeah.
So at that point now, it's the data has to all be, the data has to be transformed, which
all impact that.
And ultimately it has to be all merged together.
Precursor to all that first has to be, which a lot of like engineering folks, especially struggle with is like, okay, what's the end state?
Like, what are we trying to accomplish here?
So, because it's very tough to actually merge the data together and figure out like, what are we trying to get out of this?
If you can't really say like, what's the end state here?
So that's, that's usually the first first step which we'll unpack in a minute but like talking directly to your point essentially
so it's once you figure that out and once you say okay i want to actually understand
across you know all of for my website across all the channels that we're advertising on, for example, so like Facebook, Google, et cetera,
like how well are users converting on each one of those?
Let's just say channel level to start with.
Keep it easy.
So Facebook is a channel.
Google is a Google Ads channel.
Just to clarify, how well am I converting there?
So I spend advertising dollars.
People are clicking on ads.
They come to my site.
When we say converting, it's just like, okay,
how many people who come from Facebook actually buy something
where I make money on their...
I make money on the purchase based on the advertising dollar
that I put towards the ad that they clicked on.
Yeah, so I spent $10 on the advertising dollar that I put towards the ad that they clicked on. Yeah. So it's, I, I spent, I spent $10 on the ad.
How much did the user purchase?
Like, did they purchase first of all?
And then how much did they purchase?
Essentially, did I get more back than I put in?
Right.
That's ultimately the question you want to answer.
Yep.
Yep.
And that, that then ladders up to all sorts of different interesting things.
The other thing I mentioned too is like,
you might want to measure conversion as the other fairly big thing.
Now I'm not a huge believer in measuring conversion
because that can be gamed.
We can talk about that later.
But nonetheless, like those are kind of the two main things.
Yeah.
So basically what you have to do there,
it's a transformation problem. So you
have to get all that behavioral data. You have to get all of that that you've collected on the
website. So that's got UTM crams, user conversions, things like that. Yep. You put usually what you
have to do as well as you get all your order data. So that gives you your conversions, the amount
that user spent. Sometimes, sometimes you've merged those two to a degree to make sure they align closely.
So obviously, as John said early on, it's like sometimes you can't get 100% of the data and behavioral.
So that's why you'd want to merge in your actual e-commerce data, like let's say Shopify or whatever.
Then you have to merge in your ad spending data.
So we're talking Google ads and Facebook ads here.
So you have to actually then figure out, okay, how do I normalize that data to figure out
per channel, how much did I spend?
And usually this is temporal data.
So you do like per day or a week, month or year, et cetera.
Yeah, yeah.
Same with all those other two I should mention, right?
And then lastly, then from that like
once you've merged all that together then you have to then generate data from that like metrics
measurements and that's you know like i talked about a minute ago it's that conversion it's that
that revenue etc okay so i want to ask two things one of them is that
i'm going to play dumb and ask about the keys that you join on at a very high level and then
the second is i actually want to circle back to your way of thinking about utm parameters and how
to solve some of the problems around that because you you have a couple of ways, and we've actually talked for a long time about some ways of overcoming some of the challenges there.
But, okay, one join key, and I'm massively oversimplifying this, but I think it's fun.
I think it's important to get into the details.
Hopefully helpful.
One of the join keys that makes sense to me is that you have behavioral data from the
website that contains the UTM values from a page view. So someone clicks on an ad, they come to the
site. We'll use Rutter Sack as an example. As you and I talked about a ton with the Allbirds stuff,
it fires a page call that goes into your warehouse, it gets flattened into a table,
and there's a column that says UTM campaign from that page view that has the timestamp
on that table. Then the data that comes from
the source advertising systems, there's some campaign
and ad, there's an ad, a row of data,
however it is, you have to clean it probably.
Not probably, you do actually.
I know that from experience.
I can't play completely dumb here.
You clean it up and you essentially get
some clean tables that are rows of data
where there's a URL that you input
into the source advertising system
when you deploy the ad so that when they click on it,
the user goes there.
So at a very high level,
you can join on UTM keys or sort of the components of
the URL in order to tie like, okay, I spent this much money on this ad, and then I see this many
UTMs in the behavioral data, you know, and then you can sort of correlate that to conversion.
Now, what makes this really gnarly is that you have to do that on a unique user level, right? Like, because you have
to tie the purchase and the page view and the conversion and all of that to like a unique user
so that you can say, okay, well, this page view is associated with this user is associated with
this like actual transaction that has a dollar value tied to it and so there's almost like a like a user reconciliation identity
resolution type element to this too where you have to like make sure that you're reconciling
you know reconciling that cleanly from a user standpoint am i thinking about that correctly
yes you absolutely are and there's even more to it as well um it's hot thickens so you're
you're spot on it is a it actually is the identity resolution problem and that that identity is is
basically the we're gonna say channel for right now because we're doing channel level but it
depends on what level you're doing right so like channel ad set add like at each level it's an identity resolution problem um so oh right yes yeah like at each entity right because you have yeah yeah you have
to reconcile all the different disparate data from the source system actually to whatever key you're
going to join on so that you can yeah yeah so like taking channel you have to, you have to do, you have to do identity resolution on, um, what are all the, what are all the, the channels?
So in this case, theoretically it's Facebook, Google, then you have to figure it out.
Okay.
For each one of those channels, what are, what's the order values that we talked about?
And so your join key key the end is those two
channels but then there's the part that i was saying there's a little bit more to it you also
have to figure out your spending in the ad platform which again is a join key and that is
ultimately has to be your it's a combination of what did i spend
at a channel level and then joining that with the other two to get to get channel
orders and spending right at spending conversion dollars and channel and so the combination of
those three at a high level are like that's how your joining
works and again right so like think about that that gets more complex each level you go down
because like just ad set just touching on that for a second now ad set is they step below
for listeners out there who may know know a little bit less ad says the step below
a campaign so within a campaign you have an ad set and an ad set or an ad set is a step below a campaign. So within a campaign, you have an ad set.
And an ad set or an ad group is,
it's basically, it can be multiple ads.
We'll unpack later why you'd want to do that.
But for now, just think multiple ads. And so now your join key is ad set and campaign.
So campaign would be like, you know, overstock sale. Or channel, sorry. Yeah, that's fine. You could be like, you know, overstock sale.
Or channel, sorry.
You could have like, you know, so you have overstock sale,
but that could be a campaign in Google, a campaign in Facebook.
Then you could have an ad set that's like, you know,
shirts and an ad set that's like shoes or pants or whatever
that are like these sort of logical groupings.
Yeah.
And then you may have ads within an ad set
that are like blue shirts or green shirts or something.
And so you have like a pretty complex hierarchy
even to try to triangulate all of that.
But spend your, yeah.
And so that's your join key, right?
So your join key is the combination of all those things.
So at whatever altitude you want to look at.
Wow.
And so again, this gets back to the,
what does the business want to measure?
What's the out time?
You have to decide that up front,
but a lot of people don't understand
that you have to decide that up front.
I mean, I guess you don't technically have to.
You can always do it later,
but to really do it well, you should decide it up front.
Yeah. That may be actually like i want to sorry to interrupt you there lou not at all i just wanted
i want to reiterate that may actually be one of the most helpful things i've ever heard about
attribution where it's like decide what you want up front because there are so many ways to slice this and altitude i think is a great
word for that like you can go so granular and get so close to the ground with a magnifying glass
right or you can be at 30 000 feet and none of those are wrong but like trying to do every level of altitude is impossible.
Yeah.
At least a bad idea.
At a minimum,
rarely ever worth the effort.
Right.
But I think that gets to the second part exactly of sure.
You can do any altitude,
but a naive,
a naive individual might be like,
Oh,
let's just go all the way down.
Like,
and then we'll have the data all the way up.
Sure. You can do that, but that actually is the hardest to implement it gets
it's harder to implement the deeper you go but then also the data the data it's it's harder to
gain information that you can use to make like actionable decisions the lower you go um i in a lot of cases i equate this to like
stock trading a little bit and so it's the more information you have possibly the better decisions
you can make but also the worst decisions you can make so if you're trying to optimize like if you're
trying to pick a stock like or you're trying to pick between two stocks it's an optimization
problem like stock a or stock b and conversely you're trying to pick between two stocks it's an optimization problem like stock
a or stock b and conversely you're trying to pick against advertisement a versus advertisement b
because you're at the ad level you're trying to figure out which one do i do there are there are
a lot of day different ways like data points that you can decide on that it's not just a straight
like it's not always gonna be a straight answer i should always go with a or i should always go with b same with stock trading right because stock trading is it's economic based
it's news based so there's a myriad of different things you have to look at in order to actually
decide like which ad should i boost which ad should i kill or should I do nothing? And so the decisions get more complicated, the lower you go.
Cause you also have to like, you have more data and you have to
decide more ads, which ones do I want to keep?
Which ones do I want to get out?
Same with stocks, more stocks you're looking at the more it's like, which
ones do I trade more of, which ones do I get out of, et cetera.
Right?
Like it's a, it's a Kelly criterion optimization problem, whether it's stocks or ads, like
you could apply it kind of the same way.
Yeah.
And so that, that, those are your joint keys, like back and just taking that back.
And then also if you think about it for a second, the other challenge of just generating
the joint keys, which I want to fill out to people people like I highlighted earlier is the data is not consistent so that I think that's actually one of the biggest challenges
any level but especially as you get lower because the joint keys get more complex
it's my my 100 different users came to my website through Facebook, 95,
like, the campaign name was correct.
But the campaign
had a space in it. And so 5,
like, the campaign,
the space isn't represented as
percent 20, it's represented as
plus, right? So now
theoretically, if you're matching directly,
like doing a direct string match,
you actually have
two different campaign names so they're going to be there's gonna be different like if you do a
naive like i'm just going to directly do a direct string match in order to create my my join keys
you now have two different campaigns even though they were the same campaign
yeah but the characters were different yeah so that creates a whole different set of
challenges it's standardizing it's basically creating standardized yeah keys and you have
to standardize those names you have to figure out like which ones are the same but which ones
actually are different even though they look similar yeah yeah well and what totally because
i think it's easy to conceive of the modeling problem.
It's like, okay, multiple levels of altitude, yes, that can get complex.
But if you don't assume that you're going to have dirty data,
it's like, okay, that can get complex, but that's doable, right?
But the dirty data problem compounds
because you have the different levels of aptitude
within each platform.
You have all the different platforms.
You have the fact that the data is actually delivered differently
in all these different platforms.
And because they're all different tech,
the conventions can break in all sorts of different ways.
And so the long tail becomes like absolutely insane well and even if you have your your team
like completely aligned marketing data team you know the whole team aligned all of your stuff
name is named perfectly correctly every time and every platform which never happens even if that
were the case like this is like free form data like any user yeah if you're an evil
person you want to mess with some marketing people let me give you some tips no but really like any
user can advertently or inadvertently like you said introduce a little space any of the millions
of people that may be on your website and then all of a sudden you have two campaigns for that
one little record and so it is an unsolvable problem to get to perfect yep yeah um yeah yeah or john doe decides like he like he wanted to do
something different because he's new to the company and he doesn't really know or understand
or he's like i don't want to read all the material and like he names the campaign differently or he
modifies the currently named campaign because it's like something in the spelling error. Yeah, yeah, sure.
Yeah, it's a million.
Now you've splintered your campaign, right?
Yeah, yeah.
Yeah, and the highlight, John,
like that's a great point.
Like there, it's freeform.
It's an absolute nightmare.
Yeah.
Okay, so we're clearly gonna have to turn this
into a two-part series
because we are maybe 5% of the way through the conversation.
At least two parts, if not more.
One thing I do want to cover really quickly,
because this is great.
I actually think we've gotten pretty deep down
into the stack and into the data.
But Lou, talk us through some of the ways
that you mitigate some of that freeform data challenge
and the inherent limitations of the prevailing
five-dimension metadata methodology
that is so ubiquitous because of Google Analytics.
So what are some ways when you think about the system design?
And one thing I love about the way that you think about this approach
that we've talked about many times is that
this is sort of a holistic way of thinking about the problem
both in terms of the inputs
and then also in terms of join keys even, right?
And sort of the way that you even think about solving the modeling problem.
So just walk us through a different way to think about that that can help you move beyond being
beholden to five free-form dimensions that are you know impossible to solve for yeah two things
before i like get into that so number one you, you know, this isn't, this isn't perfect.
First of all,
right. Like there's still,
as John eloquently put it,
like it's still,
it's free form.
It's yeah.
It's impossible to get perfect,
but I mean,
this is improvement.
Number one.
And then number two,
I think credit where credit's due.
Like I've been kicking a general idea like this around for a while.
And I was talking with Eric about how to do this better and eric
mentioned like his his old fern had come up with a way to do this as well and like they'd come up
with a pretty good way and it was a yeah it was it was a combination of this you know my thinking
in that conversation so like eric thank you like you you actually helped out a lot in the space you
and your you know your team of folks like you and benji so this is definitely not just me right
this is this is far from me coming up with this many conversations over over 30 months yeah
but in short you know the it is a key right like at the end of the day if you think about it from
that perspective and actually i'm sorry one more thing super quick that i want to highlight that i wanted
to highlight before is i think this is so important what i'm about to say in a second
it's so important again to like define what you're trying to do up front because doing this up front
will save you so much trouble and will enable you to do like historical merging of your data versus
if you don't do this until later
on it's going to be tough to nearly impossible to go back and like do your historical attribution
so getting into the meat of it it's really at the end of the day you have to develop i think
success to be more successful at this and take out a lot of the like, Hey, UTM, UTM params at,
especially at lower altitudes are really hard to merge together and create a
key from the verge key.
It's just create that merge key up front at the end of the day.
So it's create that merge key up front and attach it to every single
campaign.
So every single campaign,
every single ad set,
every single ad has a unique key and that it's a spaceless key,
right? Like it's a key that's gonna be tough for a browser to munch. I'm not saying it's impossible,
but it's gonna be very tough. And you attach that to every single essentially ad. And it's that
unique join key is a query param. and then there's some nuances to that
obviously which you and i eric have talked about before we can unpack here but if basically if you
do that job up front you could use that join key to do to skip all of the challenges we just talked
about and just join on that key right yep and you're generating that usually as some sort of hash correct so you basically
and how so what are the inputs to that hash because one interesting about thing about this
that you and i've talked about lou is that if you limit yourself to five dimensions
what generally well one at a base level,
just from a strict technical standpoint,
like you only have five dimensions and you don't want to add spaces and other things like that.
And so practically what ends up happening is
probably the best way to say it would be that
the people who are creating the key value pairs,
generally who are marketers, get very creative
in how they package information
into those five dimensions. Yeah. Well, and I think just like for people that are less technical,
you're talking about key value pair and such, like it can be as simple. And I think we've done
this before. I've done this before of like, hey, we're going to start at one and we're going to
put the number one in there. And then in a reference sheet, number one equals that trade show we went to that was in London.
You can say whatever you want.
And then you can categorize it in 12 different ways for later groupings.
And then when somebody changes their mind, you go rechange all those categories.
And it works.
Yeah.
And I think the key there is you rechange them only on your system of record, like internally. Yes, exactly. Or you augment it. Right. So like what you're asking. But you don't reuse that number again. One is toast. Like do not reuse it. Exactly. Right. Like there are a couple of nuances to this one pack and you just hit on one, John. But basically, like you don't need to necessarily hash all that, like all that data, Eric. And so every single thing you're interested in it's basically
you're hashing on an agreed upon set of columns so it could even be like the five utm params if you
want if you want to keep it simpler and you're just hashing that and you're hashing it like john
said one and done meaning you're if you're if you go to the hash once you generate your hash you
never change it like even if you change the utM params, you keep a stable hash because otherwise it's your join key.
Right.
So that's one, you know, that's one gotcha.
One key piece is like, you have to be diligent about not changing your hashes when you change things internally.
Another is like, you have to be diligent about tracking this.
So you have to have a system of record.
So sometimes like that gets a little complicated, like a
simple way to do it should be a spreadsheet that you feed into
your data warehouse. People make mistakes. So you just have to
like you have to be careful. By and large, yeah, like, I would
say hash to the hash is highly resilient to collisions,
meaning, you know, the same output should always generate
the same input shall generate the same output should always generate the same input,
should always generate the same output.
And any variation in the input should generate a wildly different output.
You know, the internet is very broken, if that's not true,
with modern hashing algorithms.
So that's why you would, hashing is probably the best way to do it,
generally, because that, I mean, that fits that paradigm very well.
And Lou, one thing I love, just to circle back
to what you mentioned earlier and which I called out,
but I'm really saying this
to myself, almost to
assuage my pain from past
life. This is me doing
a little self-therapy.
You're helping people out.
That's why we do this show.
I think it's good to get those out there
and help people out.
Defining, the hash hash thing as we've talked about it really can be a game changer because it just solves so many different issues but one one thing about it that is um that you
have to be careful of is like you can you can pack as much information as you want into the hash, right? So I could have a thousand columns of data
that I want to pack into a hash
and this system of record and whatever,
and then I have the ability to unpack all of that, right?
But to your point, Lou,
the thing is, what do you need to hash?
It's the requirements that you defined up front.
That's what you actually need to hash, right?
Is those requirements. And so, man, that's the, that's what you actually need to hash, right? Is, is those requirements.
And so, man, that's just such good advice in terms of like getting super sharp on that,
because that determines the level of complexity that the system needs to serve.
Not that that can't be changed over time, but in all of these things, there's really
no limit to how much you can add.
And of course our tendency is to just say, well, we might need to use that.
And so you tend to like add more and more and more you know or go or or do what you said
which is like let's just do every level of altitude right so right changing it over time definitely is
the reason why i say you it's really important ideally to define this up front define what
you're trying to accomplish is while changing it over time is not impossible changing it over time is not impossible, changing it over time adds a massive layer of complexity
when it's undoubtedly like you have to do a full refresh
of your data ecosystem, like say if you're doing DBT.
So it just generates a lot of complexity
if you ever have to go back and regenerate historical data.
This is the think like an accountant part of the show, right?
Because if you put that accounting hat
on you're like oh i'm gonna have to regenerate all these financials and do this to the bank and
like like think like if you if you yeah grab an accountant pull them into your team and they
would do this perfectly like maybe that's the strategy we've all been missing yes totally
yeah okay well unfortunately we are over time but lou let's get you back on as soon as we can.
Because, okay, we're at the point now where we're deep in the sack.
We understand at a high level, like the input, some of the complexities,
why this turns into a really gnarly problem.
And we have a way to do this way better with a hash.
We just scratched the surface
there. I think there's a lot more to talk about, but we literally have not even talked about like,
okay, you're producing a metric and that is the other side of it that gets even crazier. So,
so come back on and we'll start where we left off. We'll dig back into the hash and talk about
some specific methodologies here.
I think this has already been super helpful.
I've got a teaser for the next show.
The other thing, like zoomed way back out.
Like if I'm just listening in,
like it's like, man, that sounds really complicated.
Like when does it make sense to do this?
Like we gotta answer that question.
Yes.
Okay, so agenda for next show.
Deeper into the hash, attribution models, right?
And then especially when to apply advanced techniques
that include machine learning.
And then, Lou, also I think another thing
that would be really helpful is how is,
I mean, this sounds cliche,
but legitimately how is AI shaping this, right?
I mean, there are some things around that
that I think are super important as well.
So stay tuned for part two. I already can't wait because this is so fun. The Data Stack Show is brought to you by
Rudderstack, the warehouse native customer data platform. Rudderstack is purpose-built
to help data teams turn customer data into competitive advantage. Learn more at ruddersack.com.