The Data Stack Show - 227: The Art & Science of Marketing Attribution: From UTMs to Machine Learning with Lew Dawson of Momentum Consulting

Episode Date: February 5, 2025

Highlights from this week’s conversation include:Welcome Back, Lew (0:14)Recap of Previous Discussion (1:03)Benefits of Hashing Information (2:33)  Using Hashes for Data Context (4:24)Hashing and ...Query Parameters (7:24)Static Values for Hashing (11:10)Identity Resolution in Data Attribution (14:36)Methodologies for User Tracking (16:37)Combining Data Sources for Attribution (21:13)Understanding Data Gaps (25:25)Defining Objectives and KPIs (27:50)Identity Resolution Challenges (28:46)User and Session Stitching (32:01)Trusting Ad Platforms (35:23)Defining Attribution (38:09)The Credit Dilemma (40:18)First Touch Attribution Explained (41:47)Linear Attribution Model (43:21)B2C and B2B Attribution Scenarios (45:22)Timeframes in Attribution (47:29)Understanding Lookback Windows (49:34)Google Analytics Changes (51:20)Attribution After Conversion (53:26)Online vs. Offline Attribution (55:49)Discipline in Tracking (58:52)Challenges in Coordination (1:00:12)QR Codes and Data Integration (1:01:55)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Hi, I'm Eric Dotz. And I'm John Wessel. Welcome to the Data Stack Show. The Data Stack Show is a podcast where we talk about the technical, business, and human challenges involved in data work. Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. Lou, welcome back to the Data Stack Show.
Starting point is 00:00:34 We ran out of time last time talking about attribution stuff, although we did make a lot of progress. But we're going to dive right back in. So yeah, thanks for giving us even more of your time. Yeah, thanks. It's good to see you again. And this is the first three-part show for the Datastack show? This is the first ever?
Starting point is 00:00:50 Yes, this is the first ever three-part show. All right. Yeah, the first one we had to, we knew it was going to be a big multi-part show, which is super exciting. I'm excited, yeah. Congratulations, Lou. It's an honor.
Starting point is 00:01:03 So last time we walked through, I was reflecting on this a little bit, really an immense amount of work to get to the point where, you know, we have these various data sources coming in. So we talked about data, structured data from, you know from advertising platforms that have information about your campaigns, your ad groups, your ads. Those all contain UTM parameters. We talked about
Starting point is 00:01:35 behavioral data coming in so that you can see when a user lands in your website or mobile app, and then all of the actions they perform, ultimately culminating ideally in some sort of conversion event. We also talked about how UTM parameters are what now feels like a fairly primitive way of packaging information about your campaigns into a URL so that that metadata is observable by other systems. And we talked about a really clever methodology for hashing information so that you can overcome some of the limitations of that system. So why don't we start there? So we had just started to dip into the world of talking about the hash. Just give us a quick refresher on why that is so useful as compared with using the standard, you know, sort of, let's say, traditional taxonomy of the five UTM parameters.
Starting point is 00:02:33 Yeah, absolutely. So in short, if you recall, there were a number of challenges that, you know, highlighted with the traditional UTM parameters, like let's say UTM campaign. Campaign is a big offender because it's freeform so you could have a scenario where your campaign taxonomy has a space or it has some sort of utm character or some sort of character in it that browsers at times mangle or browsers do differently so spaces like they can be represented as percent 20. Sometimes they're represented with pluses, different ecosystems do different things.
Starting point is 00:03:10 And so you run into the scenario, first of all, where your UTM parameters get mangled, shall we say. And so now you have to go to the trouble just to have full and proper attribution to actually standardize those names. So two, three, four, five, 10, 15, 20 variations to have full and proper attribution to actually standardize those names. So 2, 3, 4, 5, 10, 15, 20 variations you'll sometimes see on a particular campaign name. You have to go through and figure out, how am I going to standardize those? So those actually point to a single identity for that campaign.
Starting point is 00:03:39 So that's a big problem right there. So the way that one of the primary ways that's solved is with an ID. So you roll up all those distinct values, so UTM source, campaign, term, et cetera, and do a single unique identifier that also is not easy to mangle by any sort of platform. And that's two benefits. One I just described, less likely to be mangled. The other one is that's now your join key too. So instead of having to do that resolution
Starting point is 00:04:08 and figuring out like the standardization of that as you back into your join key, now your join key is just coming in as part of your data and it's much easier. So that's the main reason why you'd want to like look into that new set. Yep. Okay, I have a couple of questions here.
Starting point is 00:04:25 I actually have one point that I realized we did not get to last time, which is another major benefit of using a hash. Because you can package a bunch of information into the hash, and so the concept there would be this is limited. Your mileage may vary using a spreadsheet to do this. Most companies actually do use a spreadsheet. But for the sake of, you know, the example, let's say, I guess what I'm saying is there are more stable ways to do this
Starting point is 00:04:56 and a lot of great tools out there that, you know, you can build the hashing system with that, you know, is a little bit easier to govern than a spreadsheet. But let's say you have this in a spreadsheet. You can actually add as much information as you want, or as we talked about last time, as much information as is helpful based on what you're trying to discover
Starting point is 00:05:15 as far as attribution. And so you're not limited to those five UTM parameters. You can actually, I mean, theoretically, you could actually just have the hash if you wanted to. But irregardless, you can start to use, it creates a context where you can free up one or more of the UTM parameters to use for other things. One of those is actually that a lot of ad platforms support dynamically pulling in the ID of the individual ad itself
Starting point is 00:05:47 when the ad is clicked, which can be super handy. You could just append that into another UTM parameter, but then it wouldn't be packaged in the hash necessarily. What we've seen a lot is you actually will use UTM content, for example, to pull in the advertising id using curly brackets because you can package the actual content that you want in other columns in the spreadsheet that are not represented as values in the utm keys but are in the hash and so that actually can speed up ad level reporting downstream because you have the hash and then you have the actual ad ID
Starting point is 00:06:28 in the UTM itself, which is represented on the click and all that sort of stuff. So I forgot that we didn't talk about that, Lou, but that's another really clever thing that can speed up some downstream modeling. Yeah, absolutely. I mean, it's a great call out. And even another one, like just iterating through all
Starting point is 00:06:45 those it makes it much easier to have stable um identifiers for particular campaign ad set ad because that also allows you to make changes to that ad in your metadata table so like where you do that mapping while keeping the same slash stable identifier. So that's another huge one too that we didn't touch on yet. Yep. So yeah, there are multiple benefits for sure. That's a great call.
Starting point is 00:07:12 Yeah. Yeah, the quality control on the input is awesome. Like being able to actually control a stable campaign name with a sequence number, for example, you know, can be really helpful. So I have a question. Yeah. So when I think of a sequence number, for example, can be really helpful. I have a question. When I think of a hash here, I'm thinking of taking some arbitrary amount of information, creating a unique ID that's specifically linked to that information.
Starting point is 00:07:37 In this case, we're doing that in a metadata table and then taking that and putting that into a query parameter or we're doing that somehow on the front end like through a tool or yeah you want to speak to that lynn yeah sure so kind of twofold to answer your question let me know if i don't completely address it but basically what you're doing is you're going to pick the parameters up front that you want to hash and that's generally going to be like your identifier, some sort of unique identifier, whether that's you created from scratch
Starting point is 00:08:09 or we can get into solutions later. But what you do is you generate that unique identifier. So like a SHA-256 or something of, or even an MD5 of a set of parameters. And then what you do is you take that and you put that into, so like in facebook ads you put that and for that particular ad as the as a utm pram in that particular ad right and so whenever the user will click on that particular ad as part of the query prams those will audit that will automatically
Starting point is 00:08:40 be sent that particular identifier does that answer your your question? Yeah, so we're not also somehow... Because I was thinking, how do you get dynamic parameters into the B5 hash? No, no, no. Yeah, the dynamic parameters you would actually get from the advertising platform through their mechanism, like on click, they can insert dynamic parameters. It's just the point there was that it's nice to just,
Starting point is 00:09:07 once you start, you really don't want to, in my opinion, for the purposes of attribution, Lou, tell me what you think about this. We're probably aligned, but if you can stick within the five UTM parameters, there are a lot of benefits of that. You don't really want to go way outside of that because then those aren't honored by every single system.
Starting point is 00:09:26 You're adding more complexity. And so the way that you can pull in dynamic parameters without having to add a bunch of additional parameters is that you use existing UTM values because they're not required anymore because you can hash all of the metadata. So in value one,
Starting point is 00:09:46 I've got a bunch of stuff crammed into a hash. In value two, I may have some dynamic stuff as well as three, four, five potentially. Yep. And it doesn't really, like it does, but it does not matter really what you put in the hash because as you both pointed out, it's static and it's going to be stable
Starting point is 00:10:03 and it's chosen by you in your system where you're going to track all these Ashes to campaigns. And then again, as you both pointed out, there's the aspect of dynamic parameters, which the ad platform will put, you'll configure it in your ad and then they, so like Google ads has value track. It's a bracket and then whatever parameter, dynamic parameter you want to put associated with that UTM param now, the way at the end of the day, all those dynamic prams you'll resolve. Those is in addition on your ad, you also have that identifier, right?
Starting point is 00:10:39 So you're going to use that identifiers, your join key, and then all those dynamic parameters, which will come in on your tracking pixel, like your click screen. So like if you use Redistack, you'll be able to get those at runtime because the ad platform will substitute those in at runtime. Then one other question, again, this is probably just from a software background.
Starting point is 00:10:59 A lot of times, like if you generate a hash, like it changes if you change the other data. In this case, you generate one time. And if you made some change, you'd probably, it'd be better, you'd want to keep it static, right? That's exactly what you're spot on. What I was referring to earlier is that's another benefit of using the hash
Starting point is 00:11:15 is you pick a static set of values and then you never change those values. And since it's your metadata and the covers, like you can keep those static while adding all sorts of other metadata that you can change while still having a staple id so exactly yep yeah and usually two other quick thoughts on that and then i want to move on to talking about identity resolution because it's a lot juicier but there are like these are it sounds so simple but i mean the nuances here like they're yeah yeah the url tricks are fascinating to me. Generally, I think it can be a good practice
Starting point is 00:11:47 to use an arbitrary UTM value for the hash. Lou, actually, we haven't discussed this specifically. In the past, I've used an arbitrary URL parameter for the hash itself so that there's more flexibility in using the ones that most systems honor out of the box. Yeah, I think it's at least wise to use wherever possible, whenever possible, use a non-standard one
Starting point is 00:12:18 so a platform doesn't step on it. Yes. If you're putting it in like utm campaign what is the platform that they put it on steps on it it's like yeah yeah just lost your attribution right exactly so one other nice thing i think this is the last the last point about the hashing in the urls one other benefit and this is something that i just I didn't think about a ton until we just dug into this problem a bunch, but the URL length can become an issue in certain cases. If you have a really long URL, it can often get truncated
Starting point is 00:12:56 or even the way that certain browsers or applications may capture it, they may capture a truncated version of it. It can create challenges. The other thing that hash allows you to do is keep a pretty trim URL so that you don't run into string length issues. That's not a big deal if you're capturing URLs as a string in a Rutter stack payload, for example, but if it's going into other systems or if a website or application is doing something
Starting point is 00:13:25 where it's parsing or interacting with it. Or another thing that you don't think about a ton is if an application, or I didn't think about a ton, if you have an application or a website that appends a bunch of additional parameters to the URL to actually do things like filtering and other things in the application, you can get these really long, gnarly... Like e-commerce search. Yeah, exactly.
Starting point is 00:13:48 So again, it just sort of gives you the ability to have these really nice... Basically as long as you need, but as short as possible to sort of mitigate that. Right. Okay, I think we have unhashed all of the hash. Okay, Luke, let's talk about identity resolution.
Starting point is 00:14:10 And we started to touch on this last time, but I want to dig a little bit deeper. And so if we think about where we're at in this journey, we are in the data store at this point, right? So let's say we have all of our data in, we're using all the URL tricks to sort of enforce quality control, have good URLs, have our join keys,
Starting point is 00:14:30 pick up extra bonus information, if you will, that can be inserted dynamically from the ad platforms. And so we have all this data in our warehouse, and really what we arrive at is that before we can start really doing attribution, and we'll talk about what that means because that means a lot of different things. I'm so excited to chat about that and hear your definitions. But we have a ton of data and there are actually multiple identity resolution problems
Starting point is 00:15:02 that we have to solve in order to produce let's just call or what would you call it like a baseline data set or what how would you describe like the sort of let's say the end point of prep and the starting point of like now i can actually begin the work of doing some you know insights around attribution Is that a baseline table or data set? Yeah, there's kind of, not kind of, there are two things effectively you need. And sometimes they're packaged into one, but you need a, you basically need a,
Starting point is 00:15:38 here's a session. So something that tells you, here's a session that occurred and how the user came in on that session. and then you need a, how did that session convert or not convert effectively is the other question you need to answer. So that can be in the same dataset. So like writer using writer stack, for example, like their e-commerce spec, you have a page, right?
Starting point is 00:16:01 So it's your initial page view event. And then order completed would be our conversion event. Like if it's an e-commerce company who's selling stuff, right? So you need those two at a minimum. Now that's not to say you have to have those two. You could have like a writer stack page view, and then you could join that with Shopify orders as long as there's a way like some sort of join key he joined those two right yep but basically those are at a minimum the two things you effectively need to attribute yeah or can I ask it yeah go for it I was I want to ask a question there because I think this is and John I'm interested in your opinion too because you've done lots of that
Starting point is 00:16:43 both of you have done an immense amount of that type of joining in the warehouse. There are two ways to do this and I want to know the best way to do this because I have tried multiple ways in the past and I don't have as much experience with e-commerce. Maybe there are other ways to think about this. But you have, let's call it, the session-based methodology,
Starting point is 00:17:10 which is where I have some way of following the same user across multiple sessions. You can persist the attribution data, so let's call it the hash and whatever else. So I could persist that across sessions, somehow store that. I could have some other way to tie the user's behavior together. RutterSack provides an anonymous ID.
Starting point is 00:17:37 Maybe you want to do both. But you essentially follow that user and perhaps the actual attribution data itself through the sessions until there's a conversion. But there's another way, another methodology, which would be relying on user level identity resolution so that it's like, okay, as long as I can get the attribution data on the first, on whatever session, I don't necessarily have to persist it through if I have a way to tie the user to a Shopify order. And so then I'm actually looking for
Starting point is 00:18:15 some instance of attribution data in a session, and then I can see there's an order at some point downstream with its own timestamp, right? And so then I can say, okay, well, that user came in here and then they eventually made this order. But in that case, I actually would be trying to join on tying the attribution data to the user from the page view event
Starting point is 00:18:41 or whatever that behavioral event is. Then I need to tie the actual user data to the Shopify from the page view event or whatever that behavioral event is. Then I need to tie the actual user data to the Shopify orders table, which means I'm using email or some trait of the user and I'm running the join that way. Does that make sense? Those two like rad methodologies, like I follow a session through
Starting point is 00:18:55 or like I capture the attribution data at some point in time, but I have a way to know that it's a user and then I tie that user to some conversion data like an order table downstream. Am I thinking about the broad methodologies of doing that right or is there another way to? I can speak a little bit to Shopify
Starting point is 00:19:15 and I'm sure this changes on a regular basis. We were using it fairly early on when Shopify was first bringing on large businesses, large view count, lots of traffic. So over, I don't know, seven years ago, five years ago, something like that. Oh, that's right, yeah. It was interesting because, and this might be different now, but Shopify will attempt to do this for you.
Starting point is 00:19:42 And they will be pulling UTM parameters, give you some session information. But it always felt pretty incomplete. And Lou, I don't know if that's been your experience too. So there's that part of it. And there's a part like, well, I can do my own, like a writer stack type thing, grab an anonymous ID, etc. Do some SQL gymnastics
Starting point is 00:19:58 to get it to work. So we went more that route where, like, we did both, actually. We did both actually for a while. It's like, okay, we'll just pull the data out of Shopify. How did it attribute it? Let's try that. And there's gaps and we're not sure how to fill the gaps. And then we went in the other way.
Starting point is 00:20:15 It was like, okay, fire anonymous ID, collect the email address and writer stack at checkout, which associates with anonymous ID. And then we did pretty simple attribution models would run first or last usually click usually first and and just use that information essentially yep so kind of both yeah yeah honestly just kind of my hypothesis but lou yeah yeah so you definitely you don't have to connect user data, right? It doesn't have to be user level identity resolution. As you pointed out in your first one, it's like, it can be just session and let's say
Starting point is 00:20:56 order, but you are correct. There also can be session user order. It depends on what kind of metrics you're trying to derive, which can be a topic we could talk about later, or I told these separate data stack conversation, right around, you know, like your customer feature table. So that's one thing. And then the other thing I'll point out, John, you kind of alluded to, like, I don't know, it felt kind of incomplete. If you, if one wants to do attribution as well on sessions that don't convert. Right.
Starting point is 00:21:28 So let's say come in really at the end of the day, like you want to include direct, right? Like you want to know how much direct traffic you're getting through specifically both for attribution. Like you can attribute to direct traffic conversions, but if you want to see how much direct traffic is coming through, including traffic, that's not converting for like your row as calculations or whatever, you can't get that through Shopify alone. Because you can only really tie sessions to conversions versus seeing all your traffic come through.
Starting point is 00:22:12 So that's one area where like Shopify sometimes falls over like that Shopify attribution. I think I'll point out too, is the attribution recently discovered is different depending on where you get the attribution data from. So REST landing site. So the Shopify's REST API, the landing site entry, like in the order is attribution is different than if you looked at the graph QLP API and you look at business. Oh, that is interesting. So, so nothing to keep an eye on too, is like attribution actually is different and it's not clear what models are being used, like what attribution models. Like it's not clear what models are being used like what attribution models like it's not
Starting point is 00:22:45 clearly documented so that's the pitfall you run into when you want to start getting more advanced is it's unclear how things are being calculated right in different scenarios and lastly like if you combine them or if you try and combine them or like compare them they're always going to be different which is same the same is going to be true for Google Analytics or let's say even Facebook ads. The conversion metrics are going to be very different in those platforms versus if you were to properly calculate them on your own, which is another thing you can chat about.
Starting point is 00:23:19 Yeah, we definitely are going to chat about that. So I just thought of something I think we skipped in our URL stuff, but I think you're talking about direct traffic. We didn't talk at all about like ad blockers
Starting point is 00:23:32 or other reasons like why we might be missing attribution. I mean, you talked about mangled URLs, but yeah. We did, you're right. But I feel like that's one that, I mean, the ad blocker thing comes up a lot, especially like if you're right i but i i feel like that's one that i mean the ad blocker thing comes up a lot especially like if you're in tech or you know advertising something with very technical
Starting point is 00:23:49 users so i guess let's just apply it to the id resolution any thoughts around that i mean because shopify is not going to be immune to that however they attribute nor is most solutions yeah it's a really good point and this is this is this actually plays into what the comment i just made of it's a little challenging to like use multiple sources of data like shopify combined with clickstream because again like they they attribute differently but your point which is so devalid shopify will capture will capture more data so sometimes there's absolutely a level of data you're going to lose with clickstream to your point ad blockers pixels will get dropped like they just they won't fire they don't render they'll fire too late in the page life cycle like as the user's leaving
Starting point is 00:24:38 developers didn't know there's like a you know a pixel api that will fire in the background if you use it properly so So things like that. So if you truly want like the most accurate picture, yes, you really do need to meld those data sets and you will be able to get, you'll see a subset of Shopify orders that do not have click streams or completion. Yeah, absolutely. And you should be able to generally get the attribution data from those because that data generally will be in the UTM frames,
Starting point is 00:25:07 which will go to Shopify. So that's if the request is going to Shopify, so they'll be able to capture those. Right. But yeah, it's a really good point. That's a even bigger challenge on top. Go ahead.
Starting point is 00:25:18 Yeah. Cause that's essentially what we ended up doing was one, like resign the fact of like, okay, we don't know how Shopify is doing this. We don't know what model they're using. The other side, too, to your point, we've got more data, especially on non-conversions with Rudderstack. And I was like, okay, well, if we have gaps that Shopify can fill,
Starting point is 00:25:39 we don't have from Rudderstack, would we rather it be blank or rather it be what Shopify said? And that was pretty clear. Even though we don't know the exact model Shopify is using, we'd rather know and have Shopify's data than nothing. And I think, Lou, I appreciate so much how you, throughout this whole conversation, have returned us to the just wonderful reminder that it kind of depends on what metrics you're trying to produce.
Starting point is 00:26:03 And so I'll give two examples here. So one would be, let's say you're a business that doesn't have a lot of repeat purchasers. People tend to come in, they buy one item and they don't ever... You're not necessarily building a relationship with them because it's highly transactional. There are businesses out there like that.
Starting point is 00:26:27 Sort of at one end of the spectrum. The other end of the spectrum might be a game where sessions really matter because you want to understand what was unique about a session in which someone clicked on an ad and then eventually made a purchase. And so the difference in how necessary
Starting point is 00:26:47 persisting all that stuff throughout the sessions and getting really tight on multiple session over session is really important. That's also a lot heavier duty modeling to do user-level session reporting that includes attribution data. There are considerations on the front end around persisting that data across the sessions and that you know, that that sort of gets heavy handed,
Starting point is 00:27:07 right? You don't necessarily have to do that. But there are situations where that can be extremely helpful, because it reflects, you know, the insight that you want to actually uncover about your particular business, you know, in the context in which someone, you know, click something or does a conversion. Yeah, absolutely. Yeah. And, you know, one other does a conversion yeah absolutely yeah and you know one other thing which is small but we didn't really talk about we don't need to highlight too much but like you might have multiple properties that funnel into this data too so like you have a mobile app sure web app yeah multiple web apps right like so there as you said there are so many variables so i think think this goes back into the people part.
Starting point is 00:27:46 And again, it's before people do all this, my strong urging is to define what you're trying to accomplish and define how deep you want to go. But most importantly, define the KPIs that you're trying to look at and you're trying to measure against. And then you can back and do the best solution to measure those KPIs. Yep.
Starting point is 00:28:09 And last time we talked about this concept of altitude, which I think is really helpful, right? Like determine your cruising altitude before you, you know, before you start barreling down the runway. I mean, I think that's the problem with all of this is like I could totally picture jumping into this and then somebody getting really deep. you start barreling down the runway. I think that's the problem with all of this. I could totally picture jumping into this and then somebody getting really deep on like, we're going to solve device stitching between desktop and mobile app.
Starting point is 00:28:36 We're going to solve that and just really myopically focus on that and miss an end-to-end solution for just attribution. Aside from that. Okay. I want to talk, I don't want to dig too deep into identity resolution because that is, we could do a three hour show literally just on that, which actually is not a bad idea because that is a really fascinating topic in and of itself. And that gets back to the, you know, the customer feature table I mentioned, which might be a different segment altogether, right? As big part of that is identity resolution.
Starting point is 00:29:06 So that we don't go down a rabbit hole because I want to make sure that we dig into, we haven't even gotten to attribution models. And so we've got to go there. We'll get to the baseline data set first, but give us just a quick rundown, Lou, of how are you, we have all this data in, you know, we have all these disparate data sets.
Starting point is 00:29:25 Not only do we need to join them using the join key, which at a very high level, again, as you have a behavioral event that's tied to a user with a hash value, you have the hash value in your data from the ad platforms. And so you have a join key where you can pull this together. But the reason identity resolution is a big deal is actually, I'll say this, the most immediately apparent reason it's a big deal is because you have the initial visit
Starting point is 00:29:56 from the user that contains the hash that represents, okay, they clicked on an ad, or came from some source. And then often the distinct timestamped behavioral event that represents a conversion is separate from that, right? It happened, you know, there's some purchase event or add to cart or subscribe or whatever that, you know, downstream event is. And so you need to make sure that you can say, like, this is actually the same user in order to associate whatever value the conversion is to, you know, that campaign and that it was actually the same user who performed that to avoid, you know, double counting and all that sort of stuff. but I would classify as a related identity resolution problem is that if you are running a campaign across multiple different platforms and the concept of a campaign transcends,
Starting point is 00:30:58 which is usually the case, right? Let's just say Spring sale 2025 is my campaign. And I actually want to push that campaign out across multiple different channels. You have to build an identity for that campaign from multiple different data sets. Again, that's one of those things where if you don't think about that going in,
Starting point is 00:31:21 you think about, okay, I need to tie these user events together. But you also, in a lot of cases, have to tie disparate data sets for campaigns together to create, let's call it a campaign entity that includes data from multiple different platforms and that kind of has to be normalized. Because let's say you want to look at how much, what was our return on ad spend across every single platform
Starting point is 00:31:44 for spring 2025 sale? And so you have to aggregate that. So that's my conception. What am I missing? And just give us a high level of how do you begin to approach this, again, without taking us down another three-episode rabbit hole, if that's possible. Yeah, totally. It's totally possible.
Starting point is 00:32:03 Great observation. There are a couple of things I'll clarify there. So for a complex user, yes, you're right. That definitely starts becoming a challenge, stitching multiple data sets. So a more advanced user, like you said, is going to want to know effectively a campaign across multiple platforms,
Starting point is 00:32:21 possibly retention, acquisition, engagement. It could be all of those. Yes, they're going to want to have different. First come in on web and then purchase later on mobile and all those different ways of challenging. Yeah, exactly. So just to give a concrete example, you're going to want to know how many emails did I send in months I VO,
Starting point is 00:32:39 how much ad spending did I have on that campaign campaign etc right so yep at the end of the day you're right you don't want to stitch multiple data sets together so that is challenging but i would say for the simpler users this is a little bit less of a challenge and this again goes back to which we won't beat a dead horse but goes back to what are you trying to accomplish and for simpler users i don't think you necessarily need to stitch together all of those channels. In most cases, it can be mainly orders, click stream, and possibly, depending again on what exactly you're trying to measure, possibly a couple like ad channels to look at like you're spending. Now, one other thing I'll point out too is you're absolutely correct that this problem is one or more identity stitchings. And that is, you talked about stitching a user, which in some cases, yes, like you're stitching a user together and a session.
Starting point is 00:33:34 You don't have to always stitch a user together. It can just be a session. Oh, yeah. To your point, again, it still is the identity resolution problem even for session and that it's a temporal problem so you're stitching one to end sessions over time so there's your temporal part together so you're effectively going what's you know what are all the sessions that point to a single version right so that's your node you're pointing all yours. Yes. That you're resolving.
Starting point is 00:34:05 Right. So it definitely is still an identity resolution problem, but it's somewhat of a different identity resolution problem depending on how you're looking at, how you're looking at, sorry, depending on what you're looking at to measure. Yep. Is what I would say.
Starting point is 00:34:20 Yeah. Yeah. Go ahead. I was just going to say, the way you described that is great because you have the campaign, let's say a campaign platform IDRES problem, you have the user IDRES problem,
Starting point is 00:34:34 then you introduce the idea of sessions. You could actually just look at sessions or user, but then in some cases you may want to look at both, and that's when things can get really gnarly because then you're looking at tying sessions, not only tying sessions to the attribution data and to a conversion, but then also tying users to sessions themselves.
Starting point is 00:34:55 You're getting into some pretty serious modeling. Which I think, to zoom out, is why it's easier said than done to just say, oh, well, just tell the marketing team that they can switch over to use the data that we have in the warehouse, right? Because they're doing some like really helpful things under the hood. You know, we could argue about the accuracy of that, but the sort of session level, user level, campaign level stuff you get out of the box is like, you know, it's very hard
Starting point is 00:35:22 to hand roll. Yeah. And I think that's part of the reason why people a lot of times will fall back to the platform to get conversion, which I think is, okay, like for a user who's just starting out, they don't, there's a point in time and the life cycle of a business for sure, that's fine. You just, you broadly care about how much you're spending,
Starting point is 00:35:42 how much you're converting your business, it's super small. But there's a point pretty early on where it's like okay i can't trust ad platforms anymore because i don't know if facebook is you know attributing over the last year we'll talk more about in a second in our attribution models but like treating over the last year if that user ever came to my site it's counting as a conversion right like yep yep you just don't know so yeah that's a it's very easy pitfall to fall into? Like, you just don't know. So, yeah, that's a, it's a very easy pitfall to fall into when you're like, oh, this is too challenging now.
Starting point is 00:36:10 We have the data, but it's too challenging. Let's just fall back to the platforms. Right, yeah. So, go ahead. Okay. Identity resolution is hard. We'll do a separate episode on that. By the way, amazing job threading the needle
Starting point is 00:36:23 on not, you know, getting us down a 30-minute rabbit hole there. I'm so glad we're here. It only took us two hours to get to the point where we have, let's call it a baseline data set for attribution. We have joined campaign data from a platform with some user-level data and or perhaps some session-level data. And we've done the appropriate level of identity resolution
Starting point is 00:36:53 across those different areas that we talked about, appropriate to our cruising altitude for the metrics that we want to produce. Okay, so now we have a table, or maybe more accurately, like a couple of tables, you know, that are, that can be joined to produce different metrics and different reporting for attribution. But now I think we have a bunch of decisions to make,, but this question's for both of you. Where do you start once you have this data set? Of course, where you want to measure, but you mentioned first and last touch.
Starting point is 00:37:35 We haven't even really talked about multi-touch. There's a machine learning aspect. Actually, maybe we start here. Lou, can you give us a breakdown of what are attribution models? I know that may sound silly, but especially for the listeners who haven't done a lot of research on this
Starting point is 00:37:57 or haven't built a lot of this, what are attribution models? Take us from very basic to maybe the more extreme end of the spectrum in terms of complexity. Absolutely. Yeah. So just to recap real quickly, attribution, it's at the end of the day, you're trying to figure out what channel or channels and my marketing ecosystem contributes to the conversion. So like in the e-commerce, for example, what channels contributed to the sale of a product to a user.
Starting point is 00:38:34 So you converted them for a prospect to an actual customer. So establishing that. Now let's set up a scenario of we have multiple different channels that we have campaigns going on right now. So let's say, for example, we have Google search ads. Then we also have Facebook ads. And then maybe we're using Klaviyo. So a user, setting up a scenario, a user searches for my cool company's product and sees a Google ad. Google ads are super prominent these days.
Starting point is 00:39:07 They're somewhat hard not to click. So you accidentally or you intentionally click on one, right? So now you go to that website and you establish that, okay, me as this anonymous user, I've come to this website. I didn't click on the Google ad. And you're like, ah, crap. You go back. I didn't mean to click on that then later you're in facebook and you see for my company and adigan for the same campaign that they're running on facebook and you actually click on that well
Starting point is 00:39:36 now you've come to the website again but this time instead of coming from google ads you've come from Facebook ads. And you're like, okay, actually, maybe this product is cool. I'm going to buy it. Right. And so you actually do go and buy it. Well, now who gets the credit is the ultimate issue. That's the, that's in that shell behind, you know, like attribution models. So like, you know, to your point, so there's been a conversion now,
Starting point is 00:40:04 but there's been two distinct events on two platforms that have contributed to the sale of this product. Yep. So it gets the credit. That's where attribution, the various attribution. Yep. I just wanted to say, of course, marketing gets the credit. Totally. We're that simple.
Starting point is 00:40:24 Yeah, that's cute but i mean think about it like it can get really wild if you've got like if you have like a sales team involved too and like we're talking not ecom anymore but maybe like sass like well the sales talk to them and the marketing did this and like i mean you can yeah yeah wild with an attribution model so this goes back to the people problem i alluded to yes again people are defensive about their KPIs when they're tied to their budget. Everybody wants credit, yeah. Right? Yeah.
Starting point is 00:40:48 When they're tied to their budget and their bonus. So first touch, what are the various basic levels? Yeah, first and last touch, which is sort of the most basic. Yeah, and I can unpack those, but yeah, go ahead. Yeah, so can you unpack those in the context of the scenario that you just the example you just gave yeah exactly so uh last touch is the more common of the two it's probably one of the most common but basically in the scenario laid out the user first clicked on google ads then last second right before the conversion they clicked on facebook ads so in a last touch
Starting point is 00:41:25 paradigm facebook ads would get 100 of the credit for that conversion for that sale because that was the last thing that the user clicked conversely if it was first touch google ads was the first thing they clicked on yep so that will get 100 of the credit for the conversion because that was the first thing they clipped on. Yep. So that will get 100% of the credit for the conversion because that was the first thing they clipped on. And so just to play that out, when we're calculating return on ad spend or ROAS, in Last Touch, you would basically say, okay, Facebook has a really good ROAS, but Google doesn't
Starting point is 00:42:00 because we are running a Last Touch model and Facebook's getting 100% of the credit. Yeah. So in that particular scenario, just like if you were just doing those two things for that, that one user. Yeah, exactly. Facebook would have 100% and Google ads would have 0%. Yes. Okay. Now multi-touch. Here's a funny question that I've never heard any stats on so you know you know like back in the day that like almost everybody did the little question how'd you hear about us question right so what do you think the stats are if i asked that user saw google saw facebook
Starting point is 00:42:39 clicked on facebook and he said how'd you hear about us google's a choice facebook's a choice and maybe you could be fancy and dynamically only populate those two choices. What do you think the stats are on something like this? Do you think most people are going to go with, well, Facebook where they won't know? Other. Other? Well, that's just out of laziness.
Starting point is 00:42:58 I'm saying you dynamically populate Facebook or Google. Yes, yes, yes. This is maybe a product that we've just invented here. This is a new product that's a really interesting i bet it would no i'm willing to bet money it would not be accurate to what actually happened yeah right right yeah people are notoriously yeah and even when they're trying to be like inaccurate about that yeah totally multi-touch so yes yes yes yes is linear attribution is probably one of the
Starting point is 00:43:28 more common of the slight less commons and linear attribution is everything that was touched gets equal credit so in this case now with linear attribution google ads would receive 50 and facebook ads would receive 50 so the thing I'll add to this is that seems, that seems like the way to go on the surface. Like, it's like, oh, well, that's way better. Right. And actually I believe that was, I had a conversation with Eric a long time ago about this and asked him like, which one do you recommend? I think it was you, Eric. And you're like, we recommend, we don't recommend linear attribution i'll just throw that in up front because that ultimately leads to infighting
Starting point is 00:44:10 among businesses people yes i do remember this conversation yes yeah right and i was like oh that's like as in the other ones don't lead to infighting well bold right like they don't exactly like they all do but this one in particular because people start like people start thinking they don't get the proper credit in certain scenarios more than ever and people start fighting over it and sure enough yeah like i have seen that happen before now where it's like even though it seemed good on the surface like at the end of the day like it's not such a good idea yeah and it's way more complex to calculate too which go ahead yeah yeah well i want to get into that but just a couple
Starting point is 00:44:49 examples i i remember in this conversation and let's take a b2c and a b2b example so in b2c let's say you have you know a paid search team let's say you have you have a team that is doing paid social, and let's say you have an email team. And so you can imagine that the paid search team, let's just imagine a sequence where the paid search team is getting a bunch of initial clicks following what you said. Maybe paid social is actually driving signups for the newsletter or signup for a coupon. And then the lifecycle team or the email team
Starting point is 00:45:25 is actually sending messages to this user to stay top of mind. And they eventually click on a link in an email and they make a purchase. And so the challenge is the Google team saying, they wouldn't have purchased if they didn't know about us and we're creating all this awareness and we gave them the first brand experience. And the email team's like, well, we're optimizing to the point where they actually convert. And if we weren't doing that, they wouldn't actually make a purchase.
Starting point is 00:45:47 And it's like, well, the challenge is both of those things are technically true, but if you have different teams optimizing towards different KPIs within that framework that's hard on the B2B side, it can be tricky, especially when you have a sales-supported motion where maybe you are serving a bunch of ads, maybe you have a free trial in your product experience that's driven by the product team,
Starting point is 00:46:12 but then you have an SDR that reaches out and actually books the meeting with the salesperson who closes it. It's the same scenario, the exact same scenario. Let's talk about calculating. You said it's really hard to calculate. So dig into that a little bit for us. Yeah.
Starting point is 00:46:38 So it definitely creates a lot more work and it's a lot easier to get wrong and creates a lot more testing to try and do multi-touch because you're no longer just at a high level you're no longer looking for i'll blow this down do you're no longer looking for a min timestamp or a max timestamp right effectively that's such a good way to describe how it gets more complex yeah exactly So now you're looking for a distinct set. Remember, this is a temporal problem. So you're looking for a distinct set of attribution traits over time. And then you have to aggregate all that together. And this is a temporal problem again, so you're doing that over time. So it really just is a lot more complicated to calculate.
Starting point is 00:47:29 And you introduce a lot of decisions. So I just, I hear you talk about that and you say, you know, you have to have, you have to pull together a sequence of distinct timestamps over some period of time. Right? And so the immediate question that comes to my mind is, what period of time, right? Is that a day? Is that an hour? Is that a year? Right? And I mean, that, so talk through that a little bit, right? Because that's non-trivial, both in terms of, you know, the actual reporting that you're going to produce, but also if you think about longer time periods, you could have an immense number of touch points, which you're talking about large data volumes, all that.
Starting point is 00:48:14 So walk us through those questions. And actually, Lou, walk us through those questions in terms of there are some established time periods in the ad platforms themselves, which can be initially helpful, but generally becomes problematic pretty quickly. Yeah. So the biggest one, which I believe is you were being kind enough to set up and lead to was the look back window for a particular model, right? So it's, as Eric was alluding to okay it's a time based problem so how far do you look back so i you know on that conversion from the facebook ad to the conversion i converted a specific point in time so how far back do i look to attribute because let's say for example that google ad i clicked eight days ago right and then
Starting point is 00:49:07 that facebook ad obviously i clicked when i converted so do you include or exclude that in linear do you include or exclude that facebook attribution again like it's you have to answer the question of what's my timeframe because you included if it's within the timeframe and you exclude it, if it's outside the timeframe. And I think I said Facebook there, but I meant Google. Sorry. Oh yeah. Yeah.
Starting point is 00:49:34 And the original, my apologies. Yep. So that's the biggest issue is look back. Now, if you think about that in terms of, you think about that in terms of linear attribution, now you have to figure out what are all distinct points of attribution within that time window.
Starting point is 00:49:51 And you have to take a snapshot at each conversion. You have to look back that many days, right? So it's a private conversion. You have to look at the time window for that conversion, right? So it becomes computationally pretty complex very quickly. Yep. And so the ad platforms, like you said, you can go in and look at conversion data
Starting point is 00:50:11 in the ad platforms themselves. And maybe this is a good opportunity to talk through, one, there are sort of built-in look-back windows. And then two, why do you eventually not want to rely on the conversion data in the ad platform? Yes, great call. Sorry, you mentioned that. I didn't touch on that yet.
Starting point is 00:50:30 I packed a bunch of stuff into that one question. And you had trouble going and doing attribution on the original first touch question. Yes, there are some more common ones. So that 7, 14, and 30 days are the more common ones I believe I've seen. I think probably 14 or 15 days are usually more common ones I believe I've seen. I think probably 14 or 15 days. They're usually the ones I've seen most people settle on. So like last few weeks. Yep.
Starting point is 00:50:53 There are benefits and pitfalls to each one of those. So the further back you go. So it's like, and one caveat, one side note real quickly. This was one of the reasons why google universal analytics so google three google analytics 360 was terrible at computing is by default it was six months right so it's basically covering everything yeah by default well yeah they changed i didn't know that ga4 yeah so it wasn't yeah wow no yeah it isn't six months i'm pretty sure it was pretty long i think ga4 went to 30 days of feminine correctly so it's better but basically you could argue and everyone
Starting point is 00:51:33 has a different opinion on this but like there's a certain point in time where like you should not be attributing a like three six nine months back visit to a bot. So you have to make that decision and calculate that. And I would say that's the decision and those are some of the more common ones. And then in e-commerce, you can have multiple conversions, right? So if you're set to first and then got a first impression
Starting point is 00:52:00 or first click from Google, then they buy like 10 things in six months like you're just racking up on that one google you know impression as far as your like return on investment right right yeah yeah that's a great point yeah which i mean actually it's an interesting point you may want you may actually want to have that view when you think about something if we think about and maybe i'm I'm getting a little ahead here, but if you think about answering your question, which channels bring in more users
Starting point is 00:52:35 who are high lifetime value users over a longer period of time, right? So we're not trying to answer what's driving the conversion, we're just saying okay, when someone first experiences our brand, which channels are the ones that tend to produce high lifetime value users over time, right? You actually do have to look over a long window. You know, that can be problematic in the ad platform itself. But again, I'm probably jumping the gun on like metrics and reporting. That can be a challenge in the ad platform itself, right? If you're trying to look over a longer period or even get the lifetime value data you know you really have to do that in your own data store yeah for sure and this really good i'm glad you brought that up john i did i didn't even highlight that one too and that's
Starting point is 00:53:14 actually another decision point right there is you have the option to include or exclude attribution once a conversion has occurred right so like that's another that's yet another decision points deciding what are the distinct timestamps right right yeah so like if i convert and then you know to john's point again like i convert in a day or two or i buy another product in a day or two and i technically have nothing new in there like do my old attribution points count if they're still within the window like am i still attributing that to, you know, the Google ads and the Facebook ads? Or is it once I get a conversion,
Starting point is 00:53:50 now that would be direct because there was nothing new in there. So you also have to make that decision when writing your model too. Now I will say last point real quickly, like I've generally seen it where once a user converts like that, you don't attribute things in the past again to that but you can right it depends on the business but go ahead john yeah i was yeah that's super interesting because i was thinking by channel and and i guess i'm just wondering out loud have either of you seen any like robust studies on like multi-touch attribution where somebody's actually trying to study like
Starting point is 00:54:25 consumer behavior and understand like you know per channel or per you know time frame like what actually you know makes more of a difference versus an aggregate yeah right yeah an aggregate yeah not yeah so i just don't know if there's any models out there that claimed of like we studied consumer behavior and this model is like, you know, more accurate because of that. Yeah. I don't know. I don't know,
Starting point is 00:54:51 but I feel like letter stacks in a pretty good position to study that if they can get access, you know, like work with enough of their customers to look at that data. Like you could, you'd probably can start figuring that out. Yeah. You know, got 20, 30, 50, a hundred customers on board to study that. That'd be interesting. Yeah. You know, got 20, 30, 50, 100 customers on board to study that.
Starting point is 00:55:06 That'd be interesting. Yeah, that is really interesting. We're just coming up with product ideas, you know, all over the place here. I will say it does get, also get interesting,
Starting point is 00:55:14 you know, when, you know, generally if it's worth it to understand that for a company on a fairly detailed level, they tend to be a larger company
Starting point is 00:55:23 and they have a lot of channels and then you introduce a lot of channels. And then you introduce a host of other challenges around things like television advertising. Which you start layering in those components and the situation gets even more complex. Well then at that point, from a consumer behavior standpoint, do you care? Or do you just go into ML and AI stuff?
Starting point is 00:55:43 Yes, okay, that's a great segue or actually i mean you know before you go off that yeah real quickly i mean now you're getting into which is a good point right so it's like online versus offline attribution is would be the official term and you're right like there's uh marketing mixed modeling in them and it tries to model for some of that. That's a whole other paradigm which companies potentially try and get into too if they do print ad, customer walk-ins
Starting point is 00:56:14 at their physical stores, they try to keep track. That's a whole different, that has a whole another layer of complexity to this whole paradigm too. We talked about linear multi-touch attribution. Let's quickly talk about weighted multi-touch and then, and then dig into like machine learning and,
Starting point is 00:56:37 you know, more probabilistic components. Yeah. So, so weighted, weighted BU generally what I've seen is you would weight the more recent ones in terms of percentage with you'd give them a higher percentage so the last click wouldn't get 100 but it would also get a larger percentage a higher weighted percentage than you
Starting point is 00:57:03 know so like facebook ads would potentially get a higher weighted percentage than you know so like facebook ads would potentially get a higher weighted percentage than google ads in our going back to our example again and in a weighted percentage and you know that becomes challenging once again if like you have two three four five six different channels or campaigns right like yep well like people will get angry if they were earlier in the cycle but got less credit so i think that's again just highlight like one of the challenges of some of these more exotic shall we say calculations in addition to the fact that like that's yet again that's even more complex to calculate because now like how do you choose percentages for each point in time right like it's you have
Starting point is 00:57:47 to come up with some sort of mathematical model or buy one yep okay one brief side note i did think of another url tip which is actually i mean i guess tip isn't necessarily we talked because i think it's the most you you know, it's on the surface just the most straightforward use case because you have to put a URL and the parameters into the ad platform so that you can track that when someone clicks an ad. But there's also a huge benefit
Starting point is 00:58:15 to being disciplined about doing that on all of your own channels, right? So the two main ones are email or SMS, where you're sending a message to a user through your own platform. Now, a lot of those tools have some level of attribution with it, but if you want to do multi-touch attribution or explore machine learning, having the same join key makes things way easier.
Starting point is 00:58:39 And then another big one that is so easy to miss is things that are in-app type things as well, where you may consider that an experiment or a touchpoint or something like that you can include as well. That's another thing. Like a push notification. Sure, yeah, something that's going out from the app itself or maybe it's some section of the app that's promotional
Starting point is 00:59:03 or whatever that is. Ubiquitous tagging, I guess, would be the concept there. Yeah, that's actually a really good point. I didn't touch on that at all. Fantastic point. When I was talking earlier, the IMR mentioned doing it unique to like ad campaign, ad set level. I guess I briefly touched on it with the campaign. That would be the campaign level, right?
Starting point is 00:59:25 Pretty much. Yes. Okay. I have a unified campaign, like new product X that I want to advertise across both retention. So email, SMS, et cetera, and new customer acquisition. So like to prospects. Yeah. You might want to, you're right.
Starting point is 00:59:42 You might want to attract that as a single identifier across multiple channels exactly and then join that later um yep affiliate as well can be helpful too right because again like it kind of goes back if you think about a campaign as abstracted across it as agnostic to channel having the hash join key is really helpful but it's easy to forget and it takes a lot of discipline but if you are disciplined about it it can be really helpful. And the challenge too
Starting point is 01:00:14 is in a smaller scenario like Lou was saying you probably just start with the platforms but then you get to a larger scenario then you have more teams so now you're trying to coordinate the stuff across teams. You're not just standardizing one team like your team that's working on email.
Starting point is 01:00:30 You're standardizing a bunch of teams to all do it the same way. That in and of itself is a challenge. I'll tell you one thing that I've done in the past that's really, I mean, and this is probably a good insight into me as a person and probably actually both of you as well
Starting point is 01:00:48 because I know both of you pretty well. But events are actually pretty tricky because it is actually something that happens at a distinct point in time but is very manual. It's essentially manual data even if you digitally scan someone's badge or whatever it is.
Starting point is 01:01:08 I mean, they put their name in an iPad or whatever. But I've actually generated synthetic events to send into the data store that has a tagged link with a hash because it's so much easier to represent that as a timestamped event, right? Because if you think about what we just talked about with multi-touch attribution, it could be they click on an ad, maybe they get an email, maybe they come to an event. And so synthetic events can actually be really useful for representing
Starting point is 01:01:36 things that are really hard to timestamp or offline data that doesn't come in a format that is easy to timestamp. And so, yeah, that's another. That's the beauty of QR codes that everybody discovered in 2020, right? That is true. And it's so funny. Yeah, QR codes. QR codes can also have, you know, hashes and URL parameters added to them. Okay, that concludes part two of our deep dive on attribution with Lou Dawson of Momentum Consulting. Tune in next week for the third and final installment where we go deeper into multi-touch attribution, talk about reporting and measurement, and of course, discuss AI's impact on attribution. The Data Stack Show is brought to you by Rudderstack, the warehouse-native
Starting point is 01:02:25 customer data platform. Rudderstack is purpose-built to help data teams turn customer data into competitive advantage. Learn more at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.