The Data Stack Show - 189: Customer Data Modeling, The Data Warehouse, Reverse ETL, and Data Activation with Ryan McCrary of RudderStack

Episode Date: May 16, 2024

Highlights from this week’s conversation include:Ryan's Background and Roles in Data (0:05)Data Activation and Dashboard Staleness (1:27)Profiles and Data Activation (2:54)Customer-Facing Experience... and Product Management (3:40)Profiles Product Overview (5:10)Use Cases for Profiles (6:44)Challenges with Data Projects (9:19)Entity Management and Account Views (15:33)Handling Entities and Duplicates (17:55)Challenges in Entity Management (22:18)Product Management and Data Solutions (26:08)Reverse ETL and Data Movement (31:58)Accessibility of Data Warehouses (36:14)Profiles and Entity Features (37:47)Cohorts Creation and Use Cases (41:17)Customer Data and Targeting (43:09)Activations and Reverse ETL (45:57)ML and AI Use Cases (55:53)Data Activation and ML Predictions (57:02)Spicy Take and Future Product Features (59:47)ETL Evolution and Cloud Tools (1:00:50)Unbundling and Future Trends (1:02:10)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week, we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. We are here with Ryan McCreary, who is a product manager at Rudder Stack, so close to home. And Ryan, you've been building a bunch of stuff that is intended to get business users
Starting point is 00:00:41 closer to data. And I'm really excited to dig into that whole problem because I think it's been a topic of late in the last year or two in the data space, or at least has reached fever pitch with venture-backed companies. So I want to talk about that, but briefly, give us a background. Yeah. So my original background is as a software engineer. I've been at Rudderstack for a few years now in a number of different roles. So kind of approaching this from different phases of the customer journey. So started off in customer success engineering, working with our existing customers. Moved from there into solutions engineering. So
Starting point is 00:01:15 building higher level solutions for prospects and customers. And now I'm on the product team helping build out the actual solution that we needed this whole time. Awesome. Awesome. Yeah. So Ryan, one of the topics I wanted to dig into was data activation. I think I've had two or three conversations over just the last week around people complaining a little bit maybe on dashboards and like, oh, you make a dashboard, it gets to the table and then really desiring like, hey, what if I could have the data and the tools I already
Starting point is 00:01:44 use? And I think that's data and the tools I already use? And I think that's one of the big things on data activation. So happy to talk about that. And then I'm excited to hear what you want to talk about. Yeah. So from a data activation perspective, I mean, that's kind of the impetus for what we're trying to accomplish with profiles and specifically with like that last bit of the activation piece.
Starting point is 00:02:02 And so we've talked about this, you know, a lot internally, but to your point, like, yes, everyone's used to like the world of BI and, you know, here's a view into our data, but it's largely worthless if it doesn't, A, align with what they're seeing in the downstream tools, but also just being in the downstream tools is kind of the prerequisite to that. So, you know, a dashboard's great, but unless you can act upon it in the tool where you live, whether you're marketing product, you know, advertising, anything like that, it's really largely useless. So really understanding not only how we model the data and visualize it, but then what do we actually do with it is kind of the key there. So that's kind of what we're, I guess, discussing today. Yeah, I'm excited. All right, well, let's dig in.
Starting point is 00:02:43 Okay, Ryan, with so much to talk about here, you mentioned Profiles, which is a Ruddersack product. We did an episode on this a while ago, actually. So I want to get an overview of that. But first, I actually want to dig in a little bit to your experience going through multiple customer facing roles before becoming a product manager at a data company. And I think this is really interesting. So you mentioned that you started at Rudderstack on the customer success side, then you moved into a solutions architect role, and then eventually moved into the product. So my main question is, I mean, I'm assuming the answer is yes,
Starting point is 00:03:25 that, you know, being customer facing has helped you as a product manager. So maybe that's not true if it's not Telus. But if it is true, what are the specific ways in which that's influenced your role as a product manager, the way you think about building product? Yeah. So, I mean, obviously, in both of those previous roles, working very closely with customers, which, you know, I think is the way to understand what we're actually trying to solve for and what we're trying to build. And I mean, what you see pretty quickly is that everyone believes that what they're doing is very unique. But at the end of the day, that really aligns around a handful of use cases. And, you know, solving it over the years, we've seen a lot of tools kind of come into vogue and, you know, DBT probably being the primary one. And I'll, you know, go ahead and caveat that most of our customers that use profiles also use DBT.
Starting point is 00:04:15 So we're not thinking of the tool as a replacement for that really more as an enhancement. And so, you know, the strength of DBT, which is also kind of the pitfall for this particular application is that you can do anything with it. It's there is no opinionation. I mean, it's largely just a better SQL interface, you know, but it's oversimplification. And so profiles kind of introduces a light layer of opinionation, specifically around customer data, right? So around the entities that we call them, you're actually interacting with from a business perspective. And so, you know, our approach to that kind of data modeling is really around, as I refer to those entities, largely for most of our users,
Starting point is 00:04:56 that is a customer or a user or, you know, a person essentially. And so basically what we do is we do two primary things. Identity resolution, identity resolution, you know, so what are all of the various... This is the profiles product. Yes. So for those listeners who didn't listen to the previous episode, which I love that we're just digging in here. Okay. So yeah, this is the profiles product. So give us a breakdown. So RutterSec profiles is the product. What does it do? Yeah. Sorry. A lot of excitement there just to jump in,
Starting point is 00:05:23 but yeah, the two kind of primary building blocks are the identity graph. So, you know, we have this customer journey across online, offline, different, you know, data sets. How are we reconciling that to a single user? And then that's really the foundation for, okay, what do we want to know about those users and making sure that we're doing that on the solid foundation of an identity graph that we believe to be true and trust. And so that's really it from a level of opinionation, right? Like we think about less of a just fully
Starting point is 00:05:50 unstructured or relational data set, but really more of how do we coalesce around that individual entity, that individual user? And I've mentioned entities a couple of times now, we can also have accounts or businesses or households or anything like that can be related to each other as entities, but it don't have to be. And so that level of opinionation is kind of what we built this on is, you know, the reason for that is to find out, you know, that single view of the customer, whatever that entity is. Now we've got to get that, like I mentioned before, into an actionable place, right? And so like, what does that look like in a marketing automation or a CRM or something like that? That's really the reason for that opinionation.
Starting point is 00:06:26 And that's really kind of where the opinionation kind of ends. Okay. So, right. And obviously you and I both work for RudderSec. So, and John, you are an unbiased, you know, consultant. So I want to ask you a question here. I think there are actually maybe a number of questions, John, that you may have about profiles, and then we'll get into like the activation piece and getting business users closer to the data.
Starting point is 00:06:48 But like, John, why would you use this, right? I mean, you were a heavy DBT user in multiple different roles previously. And I guess like another way to ask this question would be like, I've never been in a business where anyone asks for an identity graph. So to hear Ryan say like, okay, you have tools that help make writing SQL better, which is awesome, right? I mean, those are, we all use them, right? And so to add in an opinionated layer around identity resolution is really interesting
Starting point is 00:07:19 because that's not really, if you just go out and talk to a bunch of people, no, there aren't a bunch of people saying like, we need an identity graph. And so John, I mean, I'm sure you have answers for that, Ryan, but like, John, why is that? I mean, why is that a need? You know? Yeah. It's funny. I remember first talking about profiles as a product and then some other, you know, similar products. And I have the same same question of like why would you not just do this in dbt right yeah it's like that and i'm sure whatever or sequel yeah or or whatever yeah and and i think the funny part is there are a select number of people who you've even i think talked to these people that fully did this all themselves custom
Starting point is 00:08:03 role they maybe they just wrote custom software to do it sequel whatever but i think that's a very small number of people so then the next layer down is like okay i to your point like i'm probably not trying to solve like i i didn't graph i'm but i want the result of like hey i want all my customer data in one place and i want to start adding features like churn prediction, I want to add a lead score, you know, I want to add those things that are have them all in the same place. So once you get to that, and once you get sold, I think the other big concept, if you're kind of sold on the idea that we want to have first party data and like have data in a warehouse versus like, keep all in marketing a marketing tool or all in an erp or
Starting point is 00:08:46 something i think that's the other key but if you're there then then like you back into like oh we need to solve identity resolution we need to solve some of these other problems and and then when you get there i think it is easy to get trapped and like oh like we'll just write some sequel won't be that bad right and then i think when you get there and and then you start down the path you realize like oh like this is harder than i thought yeah this is way messier than i thought and then like any data project you're like even if you did it all yourself you have to maintain it yeah and that that is the kicker with any data project where even if you did an excellent job the first time, you don't remember what your past self did because it's so complicated.
Starting point is 00:09:30 And a team member that maybe has to come in and maintain it when you've moved on, maybe you're kind of doing special projects or something. It's just so hard. Yeah, yeah. I mean, I think that maybe is the story. So, you know, I think our listeners may know, like I used to do a bunch of marketing stuff at rudder stack and we did a ton of work around understanding attribution
Starting point is 00:09:51 and we did all of that in the warehouse you know with our own first party data whatever and we happen to have this guy named benji we haven't had him on the show actually probably be good to get benji on the show and he's just a when he's just like a unbelievable with sequel and so I just had some attribution needs and so I just went to him and you know I didn't ask for an identity graph I was like hey actually I need to see like first touch and then I need to see like a couple other things or whatever right and then so you know five or six thousand lines of SQL later, you know, I have a table where I just every week I'm asking him to add another. Right. Right. You know, and he's really the only one who could do it, you know?
Starting point is 00:10:34 Oh, absolutely. Yeah. So that's kind of like John said, like, no one, and we mentioned it before, like, no one's ever saying, like, you know, what I really want to do today is like, get someone to build me a really solid ID graph. That's just like, gets me going. I think every, like, everything starts from a use case. So even, you know, John was saying, like, he didn't think about the ID graph as more of the features, but even the features themselves are driven by a business use case. Like attribution, like, okay, attribution is cool, but why, right? So you're trying to like actually measure and quantify that. And so, you know, if you think kind of top down, like you have that objective, you know,
Starting point is 00:11:14 we want to understand where we should spend more money, what's being effective, or even like from a multi-structure perspective, like what's the next best action. That then informs the features that we want to build. And then you always end up in that place of, to build that feature, we need to understand the full customer journey. And so that's where it really does rely on the base of that being the ID graph. The second thing I'll mention is, I've been the victim of some of Benji's work and other people where you have this thing. And I'll be completely honest, when I first, you know, saw the MVP of Profiles,
Starting point is 00:11:46 I didn't get it either. I thought to myself, and this is a part of Profiles, you know, it is, at the end of the day, it outputs SQL that you can read and audit. And I thought to myself,
Starting point is 00:11:54 like, this is something I can just write and do. Oh, interesting. Okay, so let's stop there for a second. So Profiles, so we're talking about the identity graph.
Starting point is 00:12:02 Profiles does additional stuff. But like the actual, like what happens is Profiles, so we're talking about the identity graph. Profiles does additional stuff. But like the actual, like what happens is profiles generates SQL and then that runs in your warehouse. Yeah, it's all in your warehouse. It's the data we're consuming is in your warehouse. The SQL is being shipped to your warehouse and Python and some of the ML models. And then the outputs are actually in your warehouse.
Starting point is 00:12:20 So the tables that we generate are in your warehouse as well. So nothing belongs to your warehouse. So we're generating SQL. Yeah. And so, and, and we see that a lot when we first, you know, talk to folks about profiles, they're like, I can see the SQL, can I just use this? And the answer is yes, like you can just use it and that's fine. But to John's point, that's not the issue, right? Like to come like, yeah, Benji, like we've got a working model, we've got attribution solved, whatever. As soon as someone comes and says, hey, we have a new data set, we need this as an input, or like this is a new part of the customer
Starting point is 00:12:47 journey, or this is a new data set that's going to inform a feature. That's when the whole thing falls apart to shoehorn that in. And this is where you see data teams struggle. And well, really the business team struggle with getting what they need from the data teams is they go to a data team, they say, I need this simple thing added to this dashboard or to this metric. Can you just do it? And the answer is yes, It's simple. It's simple. And it is, but the problem isn't that feature. It's by shoehorning that into a multi-thousand line model, you risk affecting adjacent features. So customer success comes to you and says, can you add this in? This is simple. You say yes. And as soon as you get it deployed,
Starting point is 00:13:23 now sales is yelling at you because their dashboards are broken. And so the impetus for profiles there is to kind of encapsulate some of that so that you're not affecting other parts of the model as you add to it. Okay, so you mentioned customer success and sales. And so I actually want to ask about this because I'm genuinely curious.
Starting point is 00:13:39 I mean, I'm obviously fairly close to some of this stuff, but I don't know exactly how the sausage is made. So we get to hear all of that to bring our curiosities. And John, please jump in here. But like, okay, so one interesting thing that I know about Redrack because I work here is that when I was doing the attribution stuff with Benji, we were looking at it all on a user level, right? So like, it's first touch, we're looking at leads, we're looking at, you know, how does someone enter the site? And then we're like, what did they go on to do? Right? And we're, you know, you break that down by channel. And there's all this sort of, you know, stuff,
Starting point is 00:14:14 right? Do they request a demo? Do they sign up for the app? Do they do all these other things? But the customer success team is actually much more interested in sort of a collective account view, right? They don't necessarily care about the lead number they want to know what like an account is doing in the product how much data are they sending which features are they using etc that is actually pretty interesting because like when because i also asked for that because of course we're doing attribution yeah eventually i asked benji to add columns that were representative of, I guess you could say, an account roll-up, right? So you have a user, but then I also want to know how many other users are associated with this account. Yeah.
Starting point is 00:14:56 That actually is where things got really wild. And if we thought it was complicated before, that's when Benji, like, that's when it got really crazy, right? Because. Right. That's when he quit. I think you're not. Yeah. That's when I inherited it.
Starting point is 00:15:12 That's actually, he did change roles. Not because of that. I didn't. Yeah. But can we talk about that? So, like, we talked about an identity graph. But, like, if we just have that on a user level, that's fine. Like Benji sort of rolled like a V1 of that in SQL. But how do you think about these different sort of, I mean, you said entity, like what does entity mean? But like
Starting point is 00:15:34 account user is kind of a classic version of that. Yeah. That's a really common one. You know, another common one in a different space would be like a household, right? So like a roll up of multiple users, but an account, we think of the same way. And so like to kind of where we started with this, when you think about like, let's take Rotterstack as a product, as an example, from a sales and marketing perspective, we care about getting an individual across the line,
Starting point is 00:15:57 whatever we define a conversion, right? Like signing up for the app, setting up a source, whatever the case may be. From a customer success perspective, not to say they're not worried about the individual user, but when they're thinking about what is the overall product adoption or health score of this account, they're thinking of all of the many individuals within that
Starting point is 00:16:17 and how they're behaving. Because different people are going to use different parts of the product. Yeah, exactly. And some might not use it at all, right? So when you think about, again, using RudderSec as an example, you know, the front-end engineer who may be responsible for most of the instrumentation,
Starting point is 00:16:30 they're never in the app. So if you're looking at it from, if you're looking on a digital basis, you'll say like, hey, this front-end engineer is very uninvolved. Sure, the personal upstream of the API. Like, they're just sending it to an endpoint.
Starting point is 00:16:40 Yeah, they're just sending it. But if you look at it on an account level, you might say, oh, wow, well, you know, the business user is in here daily, you know, looking at the health score of, you know, their health dashboard, understanding the, you know, the overall volumes, their downstream destinations, they're the ones that are getting the emails that says like this threshold is dropped, go in and check. And so like, on one sense, there's the aggregation of those. On the other side, there's also excluding those. So like I work very closely
Starting point is 00:17:05 with a lot of our customers. So I'm in some of their workspaces. So if you were looking on an individual level or if you weren't calculating the account entity correctly, you might say like, wow, this is a really active account.
Starting point is 00:17:14 Like they're in there setting stuff up. This person's in there every day. But then you realize that's me. That's like an internal employee acting on the customer's behalf, but I'm part of their account. And so it's not only important
Starting point is 00:17:24 to include the right metrics in that, but also to actually do things that might, you know, influence that incorrectly. Yeah, like dev prod also, right? Like you may see a bunch of activity in a dev environment. Yep. That's interesting. John, how did you, like, entities?
Starting point is 00:17:40 Like, talk about entities a little bit. And like, did you face any of that? So, I mean, the funny thing about entities a little bit and like did you face any of that so the i mean the funny thing about entities is if i had to pick something in data that almost everybody handles poorly it would be entities but i just like one thing where interacting with like companies i've worked for companies i've worked with yeah like getting that it's so hard for them and i think some of this is some of the a lot of products are tiered around entities, where like you can do the individual user, but you have to be enterprise to like do that's part of it. Yeah, and then people like end up just hacking things. So they don't want to, you know, upgrade. Yeah. But the other the second, well, this is probably even bigger than entities. And I have to bring this up is duplicates. Yeah. So when you're doing ID resolution, like there's no magic, right? There's no AI magic that like can deduplicate your customer records yet.
Starting point is 00:18:29 Yeah. But, and then of course, there's two different types of duplicates. One, which ID res solves of like, it's in two different systems and we have an ID and we can like stitch them together. The other one is the one that is the tough one where they're truly duplicates.
Starting point is 00:18:44 They have different IDs. There's no like clear way to do that. but i'm sure people would be interested in like how people you know how people are addressing that that problem or how you've seen customers address that problem yeah i mean that's a tale as old as time i mean when we think about stitching users together in profiles it largely is the deterministic system but the way that we stitch is based on the ID types themselves. And so that gives us the ability to map back and find some of those outliers. So when we think about setting up the initial ID graph, we have some scripts that will run some QA on that. And there are some that are very easy to spot. There are others that are more difficult. So we
Starting point is 00:19:19 worked with a customer recently who, you know, we built this out. They were very pleased with it. But when we did the kind of QA of the ID graph, we found there was a single user that had, I think they were stitched to like 10,000 different identifiers across the user stack. And you might be in trouble if. Yeah. And there's two things to think about there. One is depending on the use case, you may not care because you may know, hey, that's an internal user impersonating folks. Oh, sure. Testing. If we're stitching it together, like something like for marketing use case, you may not care because you may know, hey, that's an internal user impersonating folks. Oh, sure. Testing. If we're stitching it together, like for marketing use cases, it doesn't matter.
Starting point is 00:19:52 That means that person might get a couple extra emails to their other emails. And so it's not a huge deal. When we're thinking about, you know, maybe custom offers or if we're doing more sophisticated things like, you know, there are folks that use their customer data for like password unlocks or account unlocks. That's much more important to be stitched to that user. So we find, you know, part of it is that could seem like a bug of like, oh, wow, profiles is stitching all these people together. In a way, it's a feature because it helps point out then instrumentation flaws.
Starting point is 00:20:15 And so what we realized with this particular customer is they, it was their policy, their standard operating procedure was that some of their employees would impersonate other users to place orders on their behalf. And so that seems, you know, fine, but then you realize it only takes one node to stitch all those together, right? Like when you now sign in as this person, you know, anonymous IDs is a good example. Every time you clear your cookies or launch your browser, you may get a new anonymous ID. So a user having a bunch of anonymous IDs, not a red flag, but if two years users have a bunch and now you've impersonated that one, you're now stitched to all of those other anonymous IDs
Starting point is 00:20:48 and everything that those are stitched to. And so we do have mechanisms around excluding specific nodes, whether that's things, you know, often we'll know that, like, let's just ignore internal email addresses. But we can also do it programmatically where we feed duplicates above a certain outlier into a table that are excluded in the future. And so a good example is for most operational systems, you know that a user should, I don't want to say most,
Starting point is 00:21:10 but by and large, a user should have one email address. And so for most systems, if they have two email addresses, that is something we want to take a look at and understand why were those stitched together? Because that's going to be more wide reaching. And so we can also put thresholds around individual ID types of like, again, for anonymous IDs, we're okay with the threshold of, you know, anything below 250, whereas emails, we want it to be exactly one per user or internal IDs as well. So it helps kind of find some of those anomalies. And again, a lot of times
Starting point is 00:21:37 that's, that's the challenge that we're helping solve is this goes back to instrumentation. And sometimes, and with the the or like data inputs in general exactly yeah and so like you know for the customer i was just referring to we actually realized they had really good server-side identification on these users so we're really just able to basically ignore anonymous ids oh interesting because we know there's you know web browsing behavior yeah we know that there are systems in place that we're not going to get rid of that are merging some of these together but but we're using a much more robust, you know, internal identification system. And so it was really fine to ignore those. But that showed us
Starting point is 00:22:12 that and then also allowed us to speed up the project because that's a lot of stitching you don't have to do. Sure. Oh, gosh. So I have a funny example of this. And I think this happens a lot in businesses. So at a previous company company we had a order management system not connected to some of the online systems we used and people would enter orders right and then we had some integrations that would flow between the systems and it was funny you got me thinking about it with the id graph thing with a bunch of like one node tied to a bunch of different ids so we had this customer in there you know and it started popping up on analytics reports. It was, let's say, Jane Smith. It was some person's name, and they would just have this massive number of orders. We're like, nobody's ever talked to her. We used to give her a call.
Starting point is 00:22:54 This is our best customer. Yeah, this is our best customer. Who is this? So it was funny, and I think this is true of a lot of OMS and even CRMs. What had first, it grabbed the name off of the first order that came in and it stuck that. And it was just an integration where it was all the Amazon orders. So everything that came in from Amazon, it grabbed it, grabbed the first one that came in was like Jane Smith and then it just stacked them up.
Starting point is 00:23:17 So if you did reporting off of it, it was like, wow, who is this Jane Smith? So I think that happens in a lot of these systems. And if you're just browsing like one record at a time like operationally like
Starting point is 00:23:29 it just doesn't show up but once you get into the data problems like it it shows up in a big way sometimes. Yeah. Yeah that's super
Starting point is 00:23:36 interesting. And I think one thing I'll add to that is you know I mentioned that everything we're doing is SQL that you it's running on
Starting point is 00:23:42 your warehouse that you can see and I think that's a big that's something that's appealing to me because you know traditionally using these black box systems you don't ever see that's happening and you know like you're yeah i dealt with that for who knows how long until they realized like oh shoot like this has been happening forever yeah and in a closed system that just happens and you're unaware whereas if you can see the sequel that's running you can yeah debug you know what might be causing sure
Starting point is 00:24:04 yeah yeah well i mean going back to, and then I want to talk about, I mean, we haven't even gotten to the pictures, but that's fine. Brooks is actually not here today for all the listeners. And so we can go long, which is... I get invited when the producer's not here. That's exactly right. The producer's gone. Let's get Ryan on the phone. One of the... So actually, so John, i want to return to something you said so you said one of the things that you know most companies do poorly from a data perspective as entities right i think that is probably most of the explanation of why every salesforce is the biggest nightmare yeah right it is. Right? Yeah, it is. Yeah. All Salesforce customization is like trying to wrangle entities
Starting point is 00:24:47 into a system that is like a lead and contact account opportunity, like whatever, you know? That's why Salesforce developers exist. Sure. Yeah. And they make a very good living. I know. Yeah.
Starting point is 00:25:01 But it is essentially like fairly complex entity resolution inside of a system that doesn't support, that only supports, like that is only designed from a simple problem but but even just that simple like parent company child company or a sim or yes like multiple people in one company like that's easy enough to like mess up but once you have parent companies and they spin off and then they merge back together and they change names a hundred times like that's the challenging data problem especially over time like do you want to update that information forever? Or do you want to keep a record of like, in 1997, they were this and then, you know, like, the slowly and data is like that slowly changing dimension problem, which almost nobody does that. They just, you know, retroactively. Sure. Yeah, we talk about it a lot. But we just retroactively, like, just, you know,
Starting point is 00:26:00 update it every day is when they change names, or, you know, get acquired. Okay, so let's switch gears a little bit names or, you know, get acquired. Okay. So let's switch gears a little bit here. So, A, that's really interesting. And I have a bunch more questions, actually. Actually, okay. One more question on this to close it out, just from a like product manager standpoint, just because I think it's really interesting to think about how we build data products
Starting point is 00:26:19 generally, right? We've had a ton of people on the show, but this is very interesting to me. So as a product manager, one of the things that you spend a huge amount of your time on is a product that generates SQL, the first output of which is an identity graph, which is something that no one asks for inside of a company, but that is required in order to like resolve entities or whatever. How do you think about, and that's a very, that seems like a very difficult problem, right? Where it's like, you don't like, no one's asking for this, but it's actually what you need. I mean, you're hurting my feelings right now. Like you work on a product whose primary output
Starting point is 00:27:01 no one wants. Thank you. I mean, I think it goes back to what we were saying before is like, you have to solve that to solve the actual problem. Right. And so, and what is the actual problem, actually? Like, I know you mentioned this, but just to say, like, because identity graph is a stepping stone. Yeah. I mean, the actual problem is to solve business use cases in the tools where these business stakeholders live. You know, like, again, like we talked about Salesforce.
Starting point is 00:27:24 Like, I don't care who you are. You're not going to get your sales team out of Salesforce. Yeah, of course not. You shouldn't. Yeah, you're not going to get your marketing team out of customer IO, iterable, braze, whatever you use, like, that's where they are going to live. That's where they are doing their jobs. And so, you know, all of this is for nothing if we can't make it useful. Totally. So how, so I guess, like, maybe to put a little bit of sharper point in the question, the solution to that problem and what you're building lives really far upstream of Salesforce. I mean, I guess you could argue about the distance, right? But the person in Salesforce probably should never know about the intricacies of like entity resolution or all of that that's happening, you know, in the data warehouse, right? Yeah.
Starting point is 00:28:09 How do you think about that just as a product manager? And like, you have this outcome that needs to happen in a business. And then you have this really technical process. Well, even I guess what's interesting about profiles is like, the identity graph is actually just a stepping stone to produce like computed user traits, right? Yeah. And so it's even upstream of the stuff that the data team produces. Yeah. Yeah. I mean, at a high level, you know, it's a single product profiles, but there are really two interfaces for it. There is the actual data definitions, you know, this ID stitching,
Starting point is 00:28:44 the building of the features, you know, this ID stitching, the building of the features, you know, which eventually result in these output tables that we mentioned in the warehouse. But it also, yeah, like I said, it's all for nothing if you can't access it. And so, you know, we have a UI that essentially, you know,
Starting point is 00:28:58 so backing up, I guess, profiles is a set of configuration files that connect to warehouse, you know, build these queries, run these queries, get out, build these queries, run these queries, get out, build output tables. And then that's all done in like a version controlled environment.
Starting point is 00:29:09 So you can manage that in whatever version control you use. And then that actual, you know, Git repo can be connected within the Rudder stack UI and allow for the business users to interact with. Oh, interesting.
Starting point is 00:29:22 Okay. So the data team is doing all of this in their own dev workflow with config files. Yep. But then the actual user interface, like the RudderSec web app, is reading the outputs? Yeah, it's connected to that Git repo, which is what is being used to kind of source and build those tables. And then the UI sits on top of the warehouse tables as well. So, you know, I always have to preface this when I'm doing demos or explaining the product to folks is that the UI is admittedly slow because it's actually pulling from the warehouse. Everything that you see exists in your warehouse. Wow. Okay. And so of course, the reason for that has to be to expose that data to someone who's not on the data team, because why wouldn't I just go like into a warehouse? I'm literally already there and I have the config files, right?
Starting point is 00:30:13 Yeah. Okay, so walk us through that. Yeah, I mean, if you think of it, it's a spectrum, right? So there are a lot of teams that operate in a lot of different ways. In some teams, you know, you have to teach everyone how to use the BI tool or how to understand how to query this data to get what they need. And so the way that we think of it is, you know, how can the data team live where they want to live? Can they have a technical tool and use software development best practices, but then give that to the non-technical stakeholder? How can they, you know,
Starting point is 00:30:43 have that in a way that they can see and understand it and then understand like, at what grain am I sending this to the downstream tool? Like, what do I actually want to send? And again, different teams operate different ways. Some teams send everything. Some will, you know, slice that data according to their needs and send, you know, subsets of that. Some will send full audiences as just lists of users, some will send traits and then do the dynamic audiences in the downstream tool. It really kind of caters to whatever they need. We talk about it.
Starting point is 00:31:13 It's kind of funny. I mean, this is a technical audience, so I can be honest here. But we talk about it as being, you know, for the two different users, you know, there's the technical solution for the technical user and then the UI version for the non-technical user. But they're really all for the technical user and then the UI version for the non-technical user. But they're really all for the technical user. The technical user wants the business teams to have that self-serve as much as maybe more than the business team wants self-serve. Because they don't love sending CSV. Exactly. And they don't want to handle all the
Starting point is 00:31:39 ticket. So they're both for the data engineer, but that's a necessary solution for the technical. So it's like a presentation layer. So like, hey, look, I can show you what this thing does. Yeah. Yeah. Okay. Okay.
Starting point is 00:31:52 So can we take a slight but related detour and quickly talk about reverse ETL? Yeah, the producer's not here. We can do whatever we want. The producer's not here, so we can do whatever we want. So this concept of reverse ETL, you know, has cropped up in the last couple of years, but I think it's actually an old idea, right? I mean, this has been happening. It's yeah, it's ETL actually. I mean, you've actually mentioned this. Like you've talked with a bunch of companies who just call it ETL. Yeah. I mean, it is. I mean,
Starting point is 00:32:19 I would say I would put reverse ETL in the same bucket as the ID graph. It's not, no one's like, give me a reverse ETL. I mean, they are now because we've told them they want it, but like no one's out there. Like, you know what I really want to do today is like get into some really cool reverse ETL. Like it's just a, it's just a means to an end. Like it's in the same way that the ID graph
Starting point is 00:32:37 is what we need to build, you know, reliable data solutions. Reverse ETL is what we need to get those data solutions into the tools where we actually can use them. Okay, so spicy data take here from both of you. Like, how did it become...
Starting point is 00:32:52 There's obviously a ton of buzz around it, right? I mean, Ruddersack has a reverse ETL pipeline, right? Yeah. But then the other thing is just it seemed like this... It seemed like a quote-unquote
Starting point is 00:33:04 industry unto its own but now you just every company is building this right even like the marketing tools right so john i mean you see this every day right i mean it's like it's actually just atl data movement and any company can build a pipeline to slurp it up but how did it become like a thing for a couple of years yeah we talked about this a little bit before the show today. And my theory on it is you, I had this pulled up, but I think Snowflake IPO'd in 2020, around 2020, you know, biggest IPO in tech history, really splashy. So that freed up a bunch of money for startups, right? And then it seems, I think those were Eric's's like, people form startups around features of products.
Starting point is 00:33:47 And then you had all these tiny little slices of like, we do ETL. Well, we do reverse ETL. We do observability. We do transfer. I mean, just every little slice imaginable, right? When they all got funding. And then in the last couple of years,
Starting point is 00:34:02 AI has kind of been the focus, right? So the funding's a little bit drier in the data space nowadays. And you're seeing some merging of companies and some acquisitions and some others that are like, I don't know if they're going to make it. But so I feel like it's just the macro environment that created it, honestly. Like in another time, you know, would, let's say Fivetran, would Fivetran just be like the data pipes company they do reverse and maybe transformations too like i don't know maybe which is kind of what all tricks is
Starting point is 00:34:30 right yeah like because that was like a generation before right exactly yeah so i mean who knows remains to be seen i agree i think i would add on one one layer and ryan feel free to disagree with any of this because we love we love a spicy take on when the producer's out here yeah but I mean the intent is good right like I think yeah I mean to your point Ryan you're like who cares about an identity graph if you're not getting it into some tool that marketing can use to send a campaign to like increase conversions right or whatever their use case is downstream right and of course if you're just writing a python script to do that you know or you have some custom etl job like that's annoying to manage over time and it's you know arguably not the best use of the data team's time and so having that as a managed
Starting point is 00:35:16 service like of course makes sense but i do agree that you know they're like of course like it is probably a feature and we see that now right like now marketing teams can literally self-serve from their own platform like data that's available in the warehouse yeah i think some of that comes from too i mean to john's point about the snowflake ipo i mean i think you've seen a huge acceleration too of just the accessibility of a data warehouse and so you've got teams that normally wouldn't have had access to that. Now you can just go sign up for Snowflake for free or BigQuery or whatever.
Starting point is 00:35:48 And so these are smaller, you know, maybe even younger software teams that a lot of times maybe aren't even data teams. They're just the software. They're just the engineering team.
Starting point is 00:35:57 And so ETL is not a concept that they're well-versed in. And so there is a place for, I think, reverse ETL from that perspective. But I think as you enter into a mature data team, you see that it becomes much more of a, you know, just kind of table stakes. Yeah. Yeah. I think you bring up a great point, Ryan, because historically databases were very locked up. Yeah. Like if it's a production
Starting point is 00:36:21 database, it's lock and key. you lock developers out of it you've got like a couple of ops people that have access to it you've got privacy concerns you've got uptime concern we don't want to take down production databases so some of it too is that like oh like i can go click a button sign up for this thing and have a database like this is cool and then i can move just the data i want and it's not going to impact, you know, production and I can anonymize things. Like some of that is like we kind of unlock what used to be like a lot more tightly held. I think about the early days of Data Studio in just how, well, we probably don't need to go down that path. But it was, there was a certain element of magic to it.
Starting point is 00:37:04 Oh, it was certainly magic at the time, yeah. I can can get it's so easy to get data into big query yes it's so easy to just lay data studio right on top of this and you know do like really cool reporting things that were so so hard yeah any other way now of course like course, like, you know, Ryan's ugh was like, yes, there are a number of things about that. You prefer him to call it Looker Data Studio? Oh, gosh. Oh, yeah.
Starting point is 00:37:33 Might be worse. Okay, yeah. Well, okay, that's a totally other episode. That's a totally separate episode. Okay, so we talked about reverse ETL a little bit. Thank you for the spicy takes. But let's just say I have any reverse ETL pipeline. It doesn't matter, right?
Starting point is 00:37:48 But profiles is outputting this identity graph. You build all these traits and profiles or, you know, what do you call them? Features. Features, okay. Okay, so features, user features, entity features, I guess, if I need to be very accurate. Yes. And so I just have this table or maybe a set of tables that are like okay this is my entity and here is like everything i knew about this entity so do i just slap a reverse etl
Starting point is 00:38:12 job on there and like i'm off to the races and this is of course a leading question because yeah you as a product manager just shipped two features one is called cohorts and one is called activations and cohorts is actually sort of like a an opinion about creating subsets of this giant table that represents an entity yeah and so why don't i just send just use a reverse etl job to like connect to this entity table yeah and then send the data where i want yeah i mean you can uh ultimately like that was kind of the original intent you know it's just like hey you have this you can send all of it or some of it you know wherever you want and you know i think that becomes that becomes a challenge at scale because you know who knows what or how you want to send that you know like that's a level
Starting point is 00:39:03 of opinionation for the business to understand and And so what we did, so, or just, I mean, honestly, just a ton of data. Yeah. Hundreds of columns. Yeah. I mean, if you think about sending all of that to a downstream tool, I mean, yes, most modern tools support custom traits and things like that, but you do just, you ship that mess somewhere else now. And so you have to deal with that. And so I've mentioned entities a couple of times. Early on in the product, we had the concept of entities since day one. We noticed customers kind of almost hacking those.
Starting point is 00:39:37 So like a good example would be, you know, multiple customers we found were stitching users as an entity together and then had a second entity that was like known users or customers so essentially like re-computing that identity graph for users where they had an oh interesting right so like you want the whole id graph when you think about things like attribution when you want to you know you have a bunch of anonymous users but you still understand how they got there or what their behavior is but then when it comes to targeting them you know again to the same point of tons of columns, that's tons of just empty records
Starting point is 00:40:08 that you don't need to send to your marketing automation or ESP or anything like that. And so we found them kind of hacking this together as a user's cohort and a customer's cohort, or a user's entity and then a customer's entity, which was just, at the end of the day, like a subset of that, but it's just driving compute. A filter.
Starting point is 00:40:24 Yeah, and so cohorts was kind of born out of that is okay you have your entity you know you can now define on those traits that exist in the entity a cohort which is a subset of that entity graph and all it's really doing is filtering that stitched master id based on some criteria that exist about those so now you have a entity, and then you can have a known user or a customer cohort within that, or really any type of cohort that you'd want. Those cohorts can also have different features than the main set. And so that's kind of why we saw customers beginning to break that out into a different entity is because if you think about, you know, something simple like just calculating an aggregate of LTV on customers,
Starting point is 00:41:09 even if everything's null, to calculate that on all of your anonymous users, take your time and compute. And so you really want to actually compute those features on the cohort which they actually apply to. Yep. That makes total sense. Yeah, that's super interesting. Have you seen cohort creation kind of follow team use cases? And so I guess like the immediate thing that came to my mind was if I have a known users cohort, like as someone who works in product or I'm trying to understand feature adoption
Starting point is 00:41:40 or I'm trying to understand, I'm trying to increase lifetime value of my customers or e-com or whatever. Or do those like, do cohorts sort of fall along business lines or what kind of patterns are you seeing there? Well, that's where it gets really interesting. We've seen, you know, customers in different verticals and really even different structures of internal teams that have taken those different ways. And so in some cases, yes, it's by kind of function, you know, so the product team wants to look at this, the marketing team wants to look at this different cohort. We've also seen cohorts acting as journey steps or funnel steps where you can have mutually
Starting point is 00:42:11 exclusive criteria for each of these folks can move, you know, between them and you can target those accordingly. That's something where, you know, we're still deciding where we probably won't have a heavy opinion on that because I think it really depends on the team and how they operate. And so some teams will have just that basic, you know, user cohort and the known users, and then they will activate or send that known users cohort, either the whole thing, or they'll segment on those features that exist and send subsets of those with them. That's where, you know, maybe a more resource constrained team where there's a single data
Starting point is 00:42:42 engineer that's saying, this is the clear definition of a customer. You guys go run with it and send it to the tools how you want. And then we see teams with more robust data teams where they say, here are the five primary cohorts that we have defined and split the customers into and then put features on. And so these are your entry points. And that could be something like US customers, or, you know, we've worked with an e-com customer recently that their primary ones are business and residential and those two teams operate very differently. Oh, interesting. Yeah.
Starting point is 00:43:10 The sky's kind of the limit as to how those are segmented. Interesting. All right, John. Cohorts. Did you try to do this? We did. You were rolling a bunch of stuff. It's funny and this is kind of a sad story, but we... How sad? Our producer stuff. Yeah, it's funny. And this is kind of a sad story. But we are producer leaves and we get like these hot takes.
Starting point is 00:43:32 No, it's a sad story of one of those companies that was funded in that, you know, 2020 range that built an awesome product that got acquired and then basically killed. But yeah, we actually there's a small primarily email tool, but they really built a pretty robust kind of customer data features into it. Like I said, they are no longer exist. But one of the things we did was feed custom entities into that. And they did some neat things like computing, like predictive stuff inside the tool as well. But that was something that we found was really helpful for, you know, for targeting and for email and customer messaging inside that tool. And then getting
Starting point is 00:44:13 insights, like one of the cooler things that we did is we had this cut up like product ranking thing where it was like an X and Y axis. And it scored it on like views and conversions. So like what are your like high view, low conversion or low view, high conversion all on like a X and Y axis. So that was something that we like piped it into. And then from a customer data standpoint, I think the biggest problem we faced where we were selling B2B and B2C
Starting point is 00:44:44 was how do you pinpoint the customers you should reach out to especially like businesses because you'd get purchases and they'd be from some big names they're like wow and you know and that wasn't necessarily the only like indicator that would be a good customer but that was an interesting one because you can't reach out to everybody but you do want to reach out to especially if a business buys something just like well what else do you buy like where else are you buying and that was probably one of the more interesting customer problems yeah we're working on that like at the very end of my time thinking through like all right how can we rank them let's find properties to rank
Starting point is 00:45:16 them on and then give like a call sheet or an email sheet or something to a sales team and then automating that further yeah so that was probably the most interesting man how rare to have like some sort of like customer engagement tool that actually handles entities well i don't that may be the first time it gets killed then it's gone yeah it's sad that is really bad yeah uh okay well actually speaking about that okay so cohorts is one of the things you recently launched, right? But speaking about email tools, there's this other piece of this called activations. So what is activations?
Starting point is 00:45:54 And to put a spicy take on it, like, is it that sounds just like reverse ETL? It is. It is reverse ETL. Reverse ETL. Wait, that's reverse ETL. Reverse ETL. Wait. That's just ETL. Yeah. So they're like, cancels out.
Starting point is 00:46:12 It cancels out. Yeah. I mean, so at a high level, essentially what we saw is that we're providing a way to define these entities in like a trustworthy manner for the data team to own that definition. And then for the data team to segment that further into, you know, again, cohorts that different business units or teams or, you know, different phases of the customer journey cared about. And so that became the grain at which we saw people needing to actually get that into the downstream tool. You know, like you've built your ID graph like you've built your id graph you've built your features you've subset that into you know usable buckets
Starting point is 00:46:50 of users and then that's where it was like okay now we've got it to a place where we can actually action on it and so like again like beating a dead horse here but like that you still can't do anything until it's in the tool where you want it sure that's the inner literally just talking about materialized views in the warehouse. Yeah, yeah. And so that's literally the grain at which it was like, okay, now you need to get this into the downstream tool. Because RudderStack is building these,
Starting point is 00:47:15 we know exactly how the views are materialized, where they live in the warehouse. And so it becomes very simple then for a non-technical user to say, you know, I'm looking at this UI, which again is built on top of snowflake or warehouse data i want either this cohort or even a further segment of this cohort or even some traits of features of this cohort in my marketing tool and so activations is basically you know a ui you know low number of clicks way to get that there i like
Starting point is 00:47:43 i wish it was one click that's like what I was really going for. But honestly, because you have to kind of map it. It's a feature not about. Yeah, you have to map it to the fields. It's hard rails. Yeah, you have to map it to the fields in the tool that you're sending them to. So, but the idea is that gives you a centralized place to say, you know, again, business
Starting point is 00:48:00 user is exploring that, saying this is what I want to get or a subset of this and then put it in the downstream tool. And then that's now connected in the UI, at least to that cohort. So they can see all the places it's being sent or what sub slices of that are being sent. And so it really, really kind of ties a bow on that notion that I mentioned of data team owning the definitions and the config and the business stakeholders owning the interface to that. Yep. So to go back to the spicy take you are actually actively just turning reverse etl like melting it into like it's just under the hood and a business
Starting point is 00:48:32 user like goes to look at data and then they're just like i just want it in this tool which actually is just yeah i'm like reverse etl is bogus use Use my reverse ETL. Yeah. Yeah. Exactly. Awesome. Yeah. That's hilarious. Like what tools are supported out of the gate or commonly used? So downstream, it's any of the integrations
Starting point is 00:48:51 that Rudder stack already has. Oh, all right. Okay. Wow. So there's a big library. Yeah. Nice. So anything where you would send
Starting point is 00:48:56 click stream or reverse ETL data is automatically supported by activation. Okay. I will trade your one click for like the Ning data anywhere okay i gotta ask a question though and john this is for both of you okay and i don't care who goes first
Starting point is 00:49:10 you guys can fight over it but we talked about and i mean i kind of know this for myself and maybe this is just because i was like a very technical marketer and i like actually did go into the warehouse so i'm probably not the right... But why is it important? You mentioned the data team can own the definitions, right? Couldn't I just go in and create a bunch of definitions? Why is it important to have that dynamic? Like you as a marketer? Sure.
Starting point is 00:49:39 Well, because you would do it wrong. Okay, Dan, hold on. I do need to create a definition. Me personally? Yes. okay hold on i didn't need to create a decision me personally for like yes i mean that's a good question and i mean my answer to that is that data is never as clean as you want it to be and so okay a good example we worked with a customer recently where they in their downstream tool wanted like a list of, they wanted to see and do activities on recently active users. And this was a product like Rudderstack, you know, a SaaS-based product.
Starting point is 00:50:16 And so someone on the marketing team, probably someone brilliant like you, like just literally like grabbed like, you know, account of like, did they have a session in the last week and they were using that as active users and you know wasn't converting like they thought it would or had in the past and the data team was able to come in and say you know this tool is primarily used to a browser extension and so like if you're signing in you're probably not really using the tool well if you're using the tool well you're using it from a browser extension and you're never signing in and so they were able to come in and essentially correct that feature by the definition of it but everything downstream remained the same right so everything that marketing had already built around saying sure recently active was still fine but the definition
Starting point is 00:50:57 was just done more more reliably around the business concepts and that's not i mean i love to knock against marketing people don't get me me wrong. It's my favorite activity. Me too. But that was just someone doing the best with what they had. And they didn't realize because that's what existed in that marketing tool based on just like click stream data. It was what they had available. Yeah.
Starting point is 00:51:16 But the business has access to those metrics in a different data set that's not available in the marketing tool that can say like, oh, this was the average number of minutes they used the tool last week and that's much better. And so that's why I think it's important for the data team to own those definitions. But the marketing team, as much as I hate to admit this, is always going to know how to use those better. Like how do we actually target folks?
Starting point is 00:51:37 Yeah, but it's like filters on it's like core business definitions that shouldn't change because it can create situations where someone is sending a campaign and actually reporting something that's inaccurate. Exactly, yeah. Interesting. Yeah. Yeah.
Starting point is 00:51:52 Well, and I think just as a general concept, some of the best tools are tools that bridge two teams or more than one team together. I mean, that is a lot of the value of really any SaaS tool is like, okay, this is how marketing looks at it. This is how data looks at it. And providing like clarity and an interface to work that out. Because that flip side is often true too, right? Where the data team like does things, they model things technically correct
Starting point is 00:52:21 and they do everything right, like technically. And then marketing is like, yeah, but this isn't useful yeah because of like x y and z like business rules yeah or just like oddities of like how some systems set up that can't be changed or whatever yeah so like you have to have that like the these tools like you know like profiles like forces you to kind of get it on paper and to agree on it. And then like, I think it can flesh out a lot of the problems with, with data definitions. Cause, because a lot of times the data teams and marketing teams aren't working together because the marketing team can have their fully enclosed black box and just like do stuff. And they would never even know if it was wrong because they don't have anything to like compare it
Starting point is 00:53:01 against. Yep. So is it, so it sounds like both of you are advocating for this world. I mean, there is this whole self-serve analytics, data democratization, blah, blah, blah, right? That's another episode where we can talk about how that didn't materialize or when it did. It's some very severe issues. But what's interesting is you could say, okay, we're just going to send this data to your tool and then you can do whatever you want with it. But to your point, Ryan, I think what's interesting is without context or sort of without an agreement on what the meaning of some of those core business definitions are, which ultimately like materialize as, you know, some sort of a column or a table or something like that's actually how it exists physically, unquote in the business but it
Starting point is 00:53:45 seems like there's this desire to create how do i want to say this get the marketer me closer to those like physical assets in the warehouse but in an environment that has like a bunch of safeguards yeah exactly okay interesting and i think to the vision of okay we need clear ownership for pieces of this thing because like actually after you're if you're at a certain scale like nobody can do everything right yeah so you have clear ownership but then you also have like we're talking about with a earlier you have visibility between teams like i'm not owner here, but I can at least go see like what is happening at a high level on this downstream thing that impacts me. And then same on the other side of like, I can see like the results of what I did. I think that's a really positive thing. So the teams feel like they're actually doing something useful versus
Starting point is 00:54:41 completely siloed of like, I ship it across the wall. I don't know what happens. I don't know what happens before me or after me. It's like an assembly line mentality in a bad way versus like the bold visibility of like what's going on and then like clear ownership lines in the process. Super interesting. All right. So, two more questions before we end here. I actually have no idea how long we've been recording, which is a great feeling. We have, yeah. I don't know. That's such a great feeling.
Starting point is 00:55:11 All right. We may change the format of the show. Yeah. Just, you know, the two episodes, add video, you know. Yeah. Okay. So two questions, one for Ryan and then one for both of you. Actually, we'll end on a spicy take.
Starting point is 00:55:25 Great. This isn't the spicy take. Okay, great. This isn't the spicy take. This is for you. So what are you building next? Okay, so you have your identity graphs, profiles, you generate an identity graph. No one asks for it. Everyone needs it. You build traits on top of that becomes this sort of table that represents everything you know about an entity. You create cohorts that are business definitions. Business users go in, they look at a cohort, they can filter it. They send the data to their tools. I mean, this sounds like a great world. What are you building next? Yeah. A couple of things we're focusing on right now. One is already live, but we're doing a lot of work around it, but is around the ML piece of this you know a lot of teams that are approaching us and wanting
Starting point is 00:56:05 to use profiles are wanting to do so to build kind of that solid foundation to start thinking about ml use cases you know everyone's trying to do right yeah like that's what you do do you mean ai brian oh you've done this whole time when nobody said ai i think we're like an hour in crazy no you did mention a oh you said there's no ai tool to... Yeah, you're right, you're right. You did mention AI. I have not said that. That should be a game. Although, as a product manager, you did say it's a feature, not a bug. That's true.
Starting point is 00:56:33 I have to call that out. Yeah, I think I'm contractually obligated to say that. That is true, yeah. So, when we think about, you know, a lot of folks are doing this to build that foundation to start to, you know, leverage some of these more advanced techniques. By the nature of profiles, you know, I mentioned a couple of times, we're outputting this table that's got all of these features that you're defining. We also, from the start, have stored historical snapshots of that over every run. Oh, interesting.
Starting point is 00:56:57 A lot of folks run this. Like every job run and you materialize a view that is this point in time? Yeah, so the view that you look at in the UI is pointing to the most recent run, but all the previous runs are in there and you can kind of set the retention that you want for those. That sounds fruitful for ML. Yeah, and so, you know, that was the intention
Starting point is 00:57:14 is we can do this for, you know, teams to have a good foundation for their ML, but then we realized, you know, we know exactly how we're writing this, we can do some of that ML for them. And so our predictions product, you know, kind of sits on top of that and allows you to say, well, you've got all these users and all of their you know their feature evolution
Starting point is 00:57:29 day over day you know if you can and some of these features are as simple as defining like which feature would you say is a conversion and then what are the ones you want to exclude like obviously you don't want to predict on like my first name and my state but you know excluding those like what are the things that are changing day over day? And then we can give you, you know, with this, how far of an outlook you want, we can run models on that. We'll say, Hey, this is the, you know, either a lead score or turn score, but this is the propensity to do this defined conversion action.
Starting point is 00:57:57 Okay. Fascinating. You go find the training data. I mean, I guess it's all there, but it's trained on the customer's data. So yeah. So we train it on that. It trains, you know, there's defaults, usually trains weekly and then runs.
Starting point is 00:58:07 On the inputs to whatever you define as like conversion or something? Yep. Interesting. And so thinking about what else can we layer onto that? So we've recently built that attribution, you know, so you can do first touch, last touch, multi-touch. And then, you know, eventually that gives you
Starting point is 00:58:17 on a per user basis, what's the next best action for this person based on, you know, actual trained data. And then, you know, we're building some other things around like LTV prediction and category prediction and things like that. So that's really exciting. And then the other is, you know, everything I've mentioned today is a batch process. So this runs on a defined cadence, calculates these things, writes them to the warehouse. Yeah. And so real-time features is kind of something that we're beginning to work on currently. And, you know, that's the ability to have that daily aggregate run,
Starting point is 00:58:45 which then gives you quick access to that historical aggregation. And then you can compare that to events coming through in real time and access those either through our API, which I haven't really mentioned yet, or to like tack on and send to the downstream tool so that you have that kind of real-time access.
Starting point is 00:58:59 And that's being used in beta by some customers right now for like broad detection, you know, different things like that. But understanding your users at that point of contact versus, you know, having to wait for the batch process. Yeah. Exciting. Super interesting.
Starting point is 00:59:11 Yeah. Wow. Okay. Well, we'll definitely have to have you back on to talk about that. We'll make sure the producer isn't here. Okay. So last spicy take, I have to end on a spicy take. Okay.
Starting point is 00:59:22 So reverse ETL is getting turned into a feature of a bunch of other products. Ryan is at the spearhead of that, obviously. He's changing the entire industry right now. What is another one that you see getting like a sort of, let's say like cottage, you know, you know, sort of data explosion VC backed product that is just going to get turned into a feature my first thought would be probably observability like there's a lot in that space where we're like how does that not get rolled into yeah there's so many places it could get rolled into like it could get rolled into orchestration it could get rolled into like the warehouse itself or the data pipeline tools sure so that that's the one that feels the most it's useful but it's it's you know like i don't say it doesn't do anything because it's useful and you can have alerts and stuff but at the same time it's not like core like actually like hey this is a data pipeline
Starting point is 01:00:21 what that moves data from here to here that i need. That would be my guess. Yeah, and there are so many tools that have access to the same stuff that could just build that. Yeah. And then in catalog, yeah, the catalog, it's maybe a similar space too. Yeah. I mean, both of those things are kind of just like, it's just a matter of time until Sniflake and Databricks. Right, right. They probably already have products or have acquired companies and Databricks. Right, right. They probably already have products or have acquired companies who have done that. Yeah, right.
Starting point is 01:00:49 I'm going to stick to my... I'm going to stick in the same vein as Reverse ETL, but I think ETL. I think your traditional like... Interesting. Five Trans, Stitch, Hevo, all those players. I mean, they built huge businesses around this, but I think we've already started to see it some,
Starting point is 01:01:02 but I think the actual cloud tools themselves should just start writing that data to the warehouse yep um and kind of cut out that the need to like have a dedicated like zero etl just directly like exactly they may just have like data shares basically with all of these like hub spots and like these big providers well i mean that's kind of what you see with the i mean mean, going back to reverse ETL, like that you see this with marketing platforms that are just like, we'll just plug into Snowflake. Yeah. Yeah.
Starting point is 01:01:30 Yeah. Exactly. Yeah. I agree. I agree. I mean, the other thing is for, let's call it traditional ETL, like some point, the big cloud providers, I mean, they already have tools that can do this. Yeah.
Starting point is 01:01:43 Right. And so at some point they just acquire or build or something, the connectors or this, the zero ETL thing. Yeah. That's interesting. That, and actually that sort of goes back to like a bundling, right. Where you see, you know, a lot of, you know, you mentioned, you know, alter X or some of these other, right. Yeah. Yeah. It's back to bundling. Yeah, exactly. Which is super interesting. Which is back to like the lock-in thing, which a lot of people are trying to get away from. And then it'll probably unbundle again, you know, next phase.
Starting point is 01:02:10 Well, I think it'll be... Actually, yeah, that's a whole other subject. But we recently had Andrew Lam on the show from Influx. And this whole idea around object storage and Apache Eros actually creating some crazy unbundling sort of on the analytics stack side of things, which is really interesting. So I don't know.
Starting point is 01:02:29 We'll see. Alrighty. I can't wait to see how long we recorded. Yeah. Was it like 25 minutes or? Yeah, it was more than 25. It was more than 25. I think so.
Starting point is 01:02:39 All right, Ryan, thanks for coming back. Have you been on the show? I was saying coming back. I don't think so. Wow. Yeah. Okay. Thank you for coming on. Yeah. Thanks for having back i don't think so wow yeah okay thank you for
Starting point is 01:02:45 coming on yeah thanks for having me great all right well when you get the ml stuff sorted out yeah let us all know walk over to my office and when you get ai figured out too yes there we go okay here's my third mention we needed another mention yeah three ai mentions all right thanks for joining us keywords yes subscribe if you haven, and we'll catch you on the next one. eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.