The Data Stack Show - 215: Data Sharing and the Truth About Data Clean Rooms with Patrik Devlin of Wilde AI

Starting point is 00:00:00 Hi, I'm Eric Dotz. And I'm John Wessel. Welcome to the Data Stack Show. The Data Stack Show is a podcast where we talk about the technical, business, and human challenges involved in data work. Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. Welcome back to the show, everyone. We are here with Patrick Devlin from Wild AI. We actually talked with your co-founder recently on the show,

Starting point is 00:00:42 so excited to dig into some more of the technical details with you, Patrick. Thanks for joining us. Yeah, thanks for having me. Pleasure to be here. All right. Well, give us just a quick background on how you got into the world of data and ended up building a product that uses a ton of data. So I've been in software for about a decade now, been in the startup world pretty much my whole career. And yeah, I got together with Clint, like you mentioned, beginning this year, and we started working on a predictive LTV product, doing that kind of stuff, data-related there.

Starting point is 00:01:12 And most of my career has been on the software engineering side. And as of late, I've definitely entered the world of data engineering and the data stack. And yeah, I'm excited to chat about what we're doing here at Wild. And hopefully some of the information could be useful to the listeners. Yeah. So Patrick, one of the topics that I'm really excited about, okay, maybe two topics. One of them is Duck TV and Mother Duck. I always appreciate a conversation about that. And the other one is your experience with clean rooms. We've talked a little bit about the marketing around clean rooms and we're excited to compare marketing versus reality when

Starting point is 00:01:49 it comes to clean rooms. So what are some topics you're excited about? Yeah, definitely. I mean, DuffDB and MotherDuck is a huge part of our stack and our architecture. So it's going to be great. And yeah, I think just like this whole like data sharing space is really interesting. And the concept of these like embedded OLAP databases really unlocks some interesting stuff to innovate on how data sharing actually happens. And I think there's probably some inspiration being taken from data clean rooms. Ultimately, we decided not to go that route. But yeah, I'm happy to take you through that story and journey on how we made that decision. Definitely. Well, let's dig in.

Starting point is 00:02:33 All right, let's do it. All right, Patrick, you gave us a brief intro, but dig a little bit deeper into the startups that you worked at, you know, early in your career, and then you had a pretty good run at the most recent one. Yeah, definitely. So I spent time actually in the blockchain space, we were building like NFT collectible cards for sports players. It was really exciting. I think we were probably a little early there. And we quickly realized that force hands just like don't they don't care about the blockchain which is a shocker yeah yeah exactly so we ended up like building a lot of tech on top to like obfuscate you know that part of the architecture and i think ultimately like dug ourselves into a hole there but i moved on. It was actually my boss at the time

Starting point is 00:03:26 went over to the DTX company, which is now Flowcode. And he brought me along. He said, I need engineers. I was like, yep. Sounds great. I need a job. Yeah, that's a very,

Starting point is 00:03:38 that's very like startup to startup transition. I love it. Yeah, definitely. Definitely. So I came on as the founding engineer at Flowcode and got to work on a ton of different systems. It was super exciting. I was there for four and a half years and moved on to Wild, I guess, December of 2023. So almost a year now.

Starting point is 00:03:59 Yeah, almost a year. Very cool. Okay. We have hundreds of questions about clean rooms and DuckDB and MotherDuck, Yeah, almost a year. We were chatting before the show. I remember, this has got to be 2011 or 2012, there was a guy in the co-working space where I was working, and he tweeted, I think it was as abbreviated as, I have never scanned a QR code. And this tweet went viral. I mean, this guy's not like a super well-known guy, right? But it really just encapsulated

Starting point is 00:04:48 like the joke of QR codes at that point because you had all these different apps and it was a hugely painful process and it was literally harder to like interact with a QR code than it was to just like Google the company's name. But Flowcode's a successful company and like that whole dynamic changed. So you were on the inside

Starting point is 00:05:07 building some of this technology really at a great time for that. So what was it like to be on the inside as that tide was completely changing? Yeah, yeah. Yeah, so I would say I think some of the inspiration was the fact that

Starting point is 00:05:22 kind of the Eastern Hemisphere has been using qr codes and it's it's very prolific over there and i remember ta tim armstrong he had taken a trip over there and essentially was like inspired by all of this sort of like physical activation and they're using qr codes and so when he came back, he brought that to the team. We're trying to figure out exactly what we were going to do. I think one of the requirements was that we wanted to bridge this world of offline to online, and we want to make it, it's got to be quick,

Starting point is 00:06:00 and it also needs to be super easy to use. So a very little barrier to entry and the qr code sort of enabled that because you don't need an app it's built into your phone and so you can really create this i guess like qr code ubiquity without requiring people to do too much yep now the caveat there is there was definitely this, like, huge education portion that needed to happen. Like you said, like, no one over here in the U.S. really is, like, engaging or scanning QR codes. They kind of looked at this, like, old archaic tech.

Starting point is 00:06:40 Right. And so that was, like, a problem that we spent a lot of time trying to figure out and we spent a lot of marketing material on simply just like educating users how to scan it and what does that look like and then kovat happened and that was all the user education perfect that was all the user education we needed that's great it the user education we needed. That was great. It really enabled us to switch into how can we actually, now people are using

Starting point is 00:07:11 them, how can we create a really strong experience around it. We spent a lot of time designing and making sure the QR codes were integrated into the brand. We didn't want them to like stick out as right. Um, so yeah, I think that like COVID sort of has unfortunate as it was,

Starting point is 00:07:34 it's freed us up to explore other parts of, uh, how flow code could come together. And really it's like on the consumer side, you're solving this offline to online sort of problem but as a business you don't really understand that attribution on the marketing on those like physical marketing spaces um and so flow code unlocks that and really it's like a huge data products behind right you know behind the qr codes Marketing analytics, for sure. Yeah, definitely. What was the trickiest... Okay, last question

Starting point is 00:08:09 because I don't want to waste too much time on this. It is fascinating. It's like, how many QR code companies died before the hardware and then COVID and made it ubiquitous? It's wild to think about. What was the trickiest technical problem you faced at Flowcode?

Starting point is 00:08:26 Yeah, that's great. It was one of my first projects. It was probably the most fun thing that I worked on there. And I got to work on it with Neil Cohen, who was our chief architect. He's a brilliant guy. And we essentially built the QR code generator from scratch. Really? Yeah.

Starting point is 00:08:46 And because there was such a requirement to make sure the design was so flexible that you could completely integrate it into your brand. Interesting. So we didn't just pull some top of the shelf. Sure. I mean, there's tons of frameworks. Yeah, exactly. Yeah. And we were just, yeah, we were too constrained by what was out there.

Starting point is 00:09:11 So we ended up like rebuilding that. And that was probably the most math I've done since college. You're never going to use this in your job. Exactly. That's great. Well, speaking of marketing job. It's not going to work. Exactly. That's great. Well, speaking of marketing analytics, that's probably a great segue into the clean room topic. And as our listeners, especially our longtime listeners know,

Starting point is 00:09:34 I used to be a marketer, which I can say now. You can say that. I can say that. But it actually makes it harder to joke about marketers because I'm not on the inside anymore. that, right? But it actually makes it harder to, you know, joke about marketers, you know, because I'm not on the inside anymore. Either way, it's not going to stop you. It won't stop me. Clean rooms are always fascinating to me because it's a really big, there's so much opportunity there, but it's so fraught with peril when you think about all the situations where you would want to share data, right? Even with like a

Starting point is 00:10:06 non-competitive brand, right? Like it can be so helpful to share data and there's kind of ways of doing that, but almost all of them are very technically painful. They create a lot of risk exposure. You're a lot of times dealing with a physical file, you know, or some sort of manual data munging, like, you know, and so it's like, you actually have the, there has to be a lot of juice for the squeeze to be worth it because it's really expensive. Right. So as you were talking about in the show, like the marketing around clean rooms is like, you just dump your data in there and it just like works you know okay so with that context yeah like you did a ton of clean room research at wild so can you just walk us through the story there like what were you trying to accomplish and then what did you learn about

Starting point is 00:10:59 clean rooms i don't think we've actually dug deep into clean rooms on the show brooks i don't correct us yeah so what was your use case and then educate us into clean rooms on the show. Brooks, correct us. Yeah, so what was your use case? And then educate us about clean rooms. Yeah, definitely. So yeah, just like some brief history on what we're on Wild and what we've been doing since I started there is really this LTD product at its core. We come in, we can ingest some data.

Starting point is 00:11:21 We run our predictive model on top and we give you some really interesting insights. This was sort of like a four-way foray into this data sharing marketplace that we had this idea for. And we thought that was, back up for one second, the data sharing piece is really between the consumer brand and the retail. We're like hyper hyper focus on essentially like solving the boundary the data boundary between online data and retail data and like a retailer that sells multiple consumer brands correct yeah yeah so typically that's you know that's a black box for you as a consumer brand you don't really know who those customers are. And when you do, you're like tier one brand or retailer

Starting point is 00:12:09 and you have a one-to-one engagement. It took you a year to set up. There's probably a data clean room involved in that situation. Or probably FTC. So yeah, we were super focused on just like bridging the gap between retail and online. And we had this hypothesis that we needed to build this huge corpus of consumer brands into our platform and trust system. And then we can go and take that to the retailer and be like,

Starting point is 00:12:37 hey, look, we have all of these brands that are really interested in consumer data that you have and sort of jumpstart the marketplace through that process. So yeah, that sort of switched. We found that retailers actually have a bunch of brands that they want to work with. There's this pressure to monetize data. And so we found that they actually are willing to come in and bring the consumer brands into this platform. That's a nice discovery. Yeah. Yeah.

Starting point is 00:13:14 So, you know, I go to markets like changed a bit there. And data sharing was sort of core to our infrastructure, whether it was going to be like the first step of our product or the third or fourth. We always knew we had to execute on it. So we spent a lot of time talking with folks and seeing what is a data clean room. And it turns out no one really knows. It's not the dirty secret.

Starting point is 00:13:43 It's an empty secret. Everyone's got their own definition of it. And so everyone talks about it differently. And I think the name kind of sucks too because it has this implicit definition that data is put into this room and then you can analyze it safely. But really, a lot of data clean rooms support different types of sharing architecture. So some data clean rooms, you actually have to move data.

Starting point is 00:14:14 Some data clean rooms are just all in place sharing, which we see this kind of in place concepts of data sharing is we're very bullish on. We see that as like the future of how brands interact with their data really quickly what would the flow there be so like if you actually have to move data i'm like writing data into a clean room or like sharing a table into a clean room yeah so i mean it depends on there's so many different factors and like each each avenue has like a different clean room solution so if you're like

Starting point is 00:14:56 snowflake yeah exactly potentially there's like regional boundaries as well like maybe you're on the same architecture but your data is stored in different regions. So in some cases, there just inherently needs to be like replication. Right. So we started to dig into Samoa, which was acquired by Snowflake, which is now just Snowflake Data Cleanrooms. And we looked at Habu, which was actually required by LiveRamp, as well as potentially being the infrastructure provider for us at Wild to support this data sharing case. We quickly found out it would just be way too expensive.

Starting point is 00:15:37 But also, behind the scenes, yes, there's some cryptography going on to enable secure queries. But ultimately, the major use cases for the data cleanroom are all in place sharing. So you're actually not moving data. You just sit on top of your data tables from, let's take the Snowflake example. If you're company A, you have Snowflake tables in US East. Company B and I have some tables in US East as well. The clean room is really just this protected space that tells you what you can and cannot query. And it'll go down and actually pull information

Starting point is 00:16:30 from those disparate data sets. You just get an authorized materialization of whatever you can query. Yeah. I think at one point, it's been a while since we looked at this, but I think at one point we found that it was just a bunch of Jinja templates telling you what you can and can't query. So, take that for what it is. config so like i i have some sort of config in the clean room and they're just translating that

Starting point is 00:17:07 to a jinja template which just creates like restrictions on what's queryable yes yeah one answer yeah exactly interesting bases have had permissions for a lot of years like it's just funny like it i mean i'm sure there's some subtleties around some of it but it's like yeah cool so this sounds a lot like a real level security where some of the other like database like they've been around for a lot of years sounds like marketing yeah I think it's yeah yeah it was kind of this like glorified permission system and like you said John I'm sure there's like some details that were missing there but i think the core learning was okay they're sharing data it's in place sharing so there's no replication right and the other one is they use

Starting point is 00:17:56 there's like some cryptography going on to make sure that you can write queries against that sort of clean room state but make sure that nothing's exposed. Or you can't be a bad actor in that situation. Yeah, and that would be unique. I can't think of any technologies that let... Like writing, a lot of times it's like, oh, we can give you read-only access to a certain piece of this. Writing's more complicated.

Starting point is 00:18:21 And that would be novel. And it is like, the ability to do that without moving your data is pretty awesome. Yeah. Like, that is cool, but... But yeah. Okay. So, some cryptography, some permission stuff, but those obviously didn't meet your needs at Wild. Yes.

Starting point is 00:18:39 Yeah, I mean, it was, like, it's a huge boundary to even just to even just maintain and provide a data clean room. Not only as us, if we were a consumer, like a consumer brand, but also in our situation, we would need to run these things for every data sharing situation within the platform. So we just, you know, it quickly like became unscalable in that sense. And yeah, we opted, I don't know if you're going to chat about this later, but just to give a little teaser, we opted to focus on the in-sharing piece or in-place data sharing. I think that was the biggest learning out of the data clean rooms. Even though there's some replication going on in some scenarios

Starting point is 00:19:31 where you're in the cross-vendors, there's still that in-place sharing. Yeah, so I think that's the perfect segue to get kind of into your current data stack. So part of the infrastructure I thought was really cool and we talked about it previously is most applications, when you write the application, you start typically some kind of relational database like a Postgres, for example, or SQL Server or whatever you're on.

Starting point is 00:19:59 And then you're writing to this common database. And then you get your first client and you're like, yeah. And then basically the only thing that's separating clients is like an ID column or something. So all the data is stored together, all the data is together. And then getting access back out. So say, you know, client like,

Starting point is 00:20:16 oh, I want this custom report or I want this thing or that. Like now you have to like kind of reverse engineer out and make sure you've got filters on every single little thing to make sure this client ID is always true if it's this client viewing the data so you have all this you know work built into it so part of your model i think was really interesting that i'd love for you to expand on is you don't do that you actually store the data by client to begin with separately yeah yeah and yeah i think our because we're building a data product, we're not

Starting point is 00:20:45 the data producers. The consumer is the data producer. And so we've essentially built a system using DAXer and DLT to sync and land data

Starting point is 00:21:01 into S3 on our end. And then because we have that kind of raw level data, we've got the flexibility to put sort of any type of compute engine on top. I think in the future, you know, we'd definitely look into experimenting with Iceberg and getting a little more sophisticated on that front. But as things work today, yeah, we land things in S3 and we use DAX there, which I think in probably the majority of cases would be used to orchestrate your internal

Starting point is 00:21:34 analytics. We actually use it to orchestrate all the pipes needed to process consumer's data. So every job, every asset in data center is partitioned by the consumer wow yeah yeah just a couple things like one like what i'm curious like what other architectures you considered and then two for maybe explain dlt a bit. I think a lot of people are familiar with DBT, but this is another three-letter tool. With a D and a T. With a D and a T. Yeah. Yeah, so talk about the architecture some and then DLT.

Starting point is 00:22:16 Definitely. So we knew we needed something to orchestrate all of the pipes. We have this ML model that we run, but before that we needed a clean. So we knew there was a piece of ingestion, transformation, running the actual models on top of that clean data set, and then actually doing like some additional transformations after the

Starting point is 00:22:44 model gets run and then prepping it for you know the platform and the product so we looked at you know dexter like airflow could hand roll some of that orchestration world data another shot out there but yeah ultimately we went with dexter they had good support for our partitioning which we know is going to be critical to this system that required to essentially process like i said not only like not our internal data set that we're producing and we'll probably have something else for that let's see if that could generate like the generated enough scale, but yeah, we opted for Daxter and then DLT.

Starting point is 00:23:27 You can find it at DLT hub. What's it stand for? Data load, data load transfer. The data load tool. Yeah. Data load tool. Yeah.

Starting point is 00:23:36 If you Google it, the Delta live tables comes up a lot, which is a data. It's different. It's different. It's different. Yeah. So, I mean, I guess the, the beauty of building like a greenfield project is we can sort of experiment with a lot of these different tools and DLT was,

Starting point is 00:23:57 I think at a second, it wasn't even, didn't even have a major release. They just released a stable, stable version like last month, but we didn't have any problems major release. They just released a stable version last month. But we didn't have any problems using it. It was great. From the beginning, it's all Python-based. It runs... I think what's interesting about DLTHub is you can have high concurrency, and it stores the pipeline state and the destination.

Starting point is 00:24:23 So every time a DAXX or job spins up and it's like, hey, sync client A's data to S3, it'll go and grab the state from S3, know exactly where to pick up, and it just runs from there. So yeah, after a little performance tuning, we were able to sync data pretty fast. This is important for, you know, like onboarding into the products.

Starting point is 00:24:47 I think we talked to some other vendors and, you know, this process of like back-selling historical transaction data, you know, can take up to days depending on the size of the business. And yeah, DLT Hub has been great at, you know, accelerating that so timing wise I'm curious because I like this would have been four or five probably five years ago now

Starting point is 00:25:12 there's an analytics tool built for Shopify and this was even before Shopify had pretty good analytics but this is before they redid all their dashboards and stuff so the promise of the tool was a give you better analytics analytics out of the box, then Shopify, and then I think it hooked up to a couple other things. This was pre, I think Dasity does this now, but it was before all that. It would hook up to your email and Google Ads and stuff.

Starting point is 00:25:35 But I remember hooking it up, and then this pop-up came up, and it's like, estimated time, come back in like 48 hours or 72 hours or something like that. I was like, what? It was not a good experience. It really

Starting point is 00:25:53 so I totally understand how you're saying the parallelization and speed is a big deal. Especially if it's a larger brand on Shopify. It can take a while to backfill data definitely yeah what what other tools besides zlt did you look at is question one and then one a is as a software engineer does it give you a little bit of pause to be using in

Starting point is 00:26:19 production a tool that just had their first stable release like a month ago like yes and no because honestly like it's really up to the vendor to decide whether it's a stable release or not so it's like it's sort of inherently like made up in that way so like you could d on like i mean they went from like 0.6 to like a 1.0 release. And you're just like, wait, what? Yeah, and it's so subjective. Gmail was in beta for what, like a decade? Right? Do you remember that?

Starting point is 00:26:54 Yeah, that's true. So I feel like some of that's the personalities of the founders. They just stay pre-stable for like three months until they really reach some level. Or it can be the opposite where they can just, oh, here's the first release. It's stable. Right.

Starting point is 00:27:11 I mean, it's definitely, it's like a signal. It's an indicator. You should be like, okay, this is what I'm getting into. But I don't think it's like the end all be all. Sorry, what was the other part of the question? So in terms of that stage in the pipeline, so you chose Dagster.

Starting point is 00:27:30 Oh, yeah, yeah. In terms of that stage of the pipeline, like what ways did you have, like what you sort of gave a number of options you explored for like the orchestration piece and making sure that you can do that with robust partitioning support. But in terms of that last mile before it hits S3,

Starting point is 00:27:46 what other tools did you look at in order to solve that part of the pipeline? Yeah, yeah. I mean, we looked at Airbyte, and I'm sure there was a few others as well. But we liked the fact that DLT was super easy to just run and spin up, and it had already an embedded integration into DAXer versus like when I got.

Starting point is 00:28:11 So like the first sync I ever did of this product was on Airbyte. And I had like, I think I had my Kubernetes running on my freaking laptop to get the same yeah and like i don't know it's just a little off-putting granted they do have this like really mature connector um community they've got a ton of options there you know for what we were solving i think dlt just fit in really nicely there yeah is there something with the architecture that maybe is different? I'm thinking about this architecture with the separate data pods, if you want to call them that. So there's something different about that than maybe a standard company. Because I think at least so far, you have a little bit of a unique scenario here where a lot of your customers are on Shopify,

Starting point is 00:29:04 I think almost all of them. So it's almost like you already have the data standardized. Now you're just landing it in different places. Whereas like I think previously, like one of the reasons it was so crucial to like get everything into a common database is this was a standardization step. Like the reason we're all on the same table with like a client ID is because when we write to this table like client name always matches up with client name and address, at least it's in the right field. Whereas if we just start sharing things separately

Starting point is 00:29:32 with no orchestration tool, no common schema, then you're just going to end up with a mess. Yeah, you just need a Shopify clean room, right? Yeah, exactly. Yeah, the system has worked great for us. We haven't really had any issues. Yeah, exactly. sit on top of that raw information and really start to do some interesting SQL to get the data on how we want it to look. You mentioned Shopify being our only source, but in theory, we can sync from any source. DLT manages schema evolution. And so as long as our DVT jobs are set up to

Starting point is 00:30:23 know what that source looks like and how we can actually pull it into some of our mark tables, I think, yeah, I'm happy to dive in on that part of the stack. Yeah. If you're interested. Yeah. I'm really curious on MotherDuck or DuckTB. Is there the ability? So I've got all this data.

Starting point is 00:30:46 You've standardized, so it's stored in the same schema in S3. What does it look like to essentially, if you needed to do this, say for internal analytics, how would you union across a bunch of different S3 accounts about buckets? So it's as simple as like a glob pattern. So as long as you have your S3 folder schema set up, which we

Starting point is 00:31:13 partition by client and then data source type and so on, you can select via glob pattern to query across everything, query across just a subset. And what's interesting about the dbt package is it will compile the SQL that you write as a part of your staging files into the actual S3 URL. So when you go and run those first jobs, when you go and run those first transformations, they're going straight to, you know, S3 to grab that information.

Starting point is 00:31:50 And that potentially, I guess if it's a, if it's like a CTE or a Vue or something like that, that could end up getting compiled all the way up to, you know, a materialized table. Yeah. Right. So yeah, it's been great to work with on that front. And part of the decision with MotherDuck, which has provided a lot of scale for us, sort of bringing that modern data warehouse and tooling

Starting point is 00:32:13 into that DuckDB and process land, is the fact that we can serve an entire platform without having to front all of this data with an analytics API or something to negotiate the contract between what data we're producing from our predictive models and what's actually showing up on the front end and being queried. Yeah. And the other thing is this,

Starting point is 00:32:43 I mean, the architecture is so flexible right because because like let's say you've got a big client you're working or a big prospect you're working with and you're like we want to use snowflake or databricks or something like it's in s3 right or you decide like you said you want to dive more into iceberg as a format yeah it's all there you have the data landed you have a partition like it this is probably the most like easy like architecture that i've seen to just like make a pivot and be like all right cool s3s are the snowflakes the computer databricks is the computer yeah whatever yeah so that's pretty cool yeah yeah definitely i think that's sort of how we're thinking about the architecture with this data marketplace and data sharing is with what we found with the data cleaners is like depending on where the data is there's a whole different path for the data cleaner and how it

Starting point is 00:33:38 comes together and what data needs to be replicated to execute on it. We wanted to create a system that could essentially sit the compute on top and have the flexibility to go into Snowflake or BigQuery or a blob storage and be able to pull that information together. And so that was a requirement of meeting our clients where their data is. And yeah, that has implications on security as well, right? If you don't have to replicate data on our side,

Starting point is 00:34:14 then they don't have to essentially worry about that from a compliance standpoint. I'm interested in the story of how you ended up trying DuckDB for this solution? Because to your point, John, the set of requirements itself is interesting, right? And sort of the stack leading up and doing partitioning with Dagster and then landing it in the client partitions in S3. But what was the process you went through? You're like, okay, I now have this data in S3

Starting point is 00:34:44 and then I have this use case that I need to implement. Just give us a little bit of the story of like what cycles you went through and then how you ended up, you know, deciding. I mean, I have to give some credit here.

Starting point is 00:34:58 Who's the one who told me about DuckDD and got me on it. And yeah, I mean, as soon as he mentioned it, I downloaded it, took a look at, you know, how it functions within your local development. I think something I think about a lot is what is the developer experience going to be like?

Starting point is 00:35:18 And do the tools that we choose create a really strong experience here? And partially from being just a bootstrapped to a small team, we need to be able to push new features extremely easy. And I think

Starting point is 00:35:37 having DuckDB be able to sit on that raw data and just run locally and be able to perform really well on top of this not huge data set. Everyone's got their own definition of what big data actually is. I guess in my standards, we're not dealing with big data. So everything can be run also with my data. So everything can be brought in also with my laptop. And the fact that DuckDV can sort of

Starting point is 00:36:08 enable that workflow allows me to iterate faster, allows Clint, who's done a ton of work on the data side of things, allows him to iterate really fast.

Starting point is 00:36:17 And I think that's definitely been a huge part of allowing us to push something out and get something into production. Yeah. Also, disclaimer for wild customers, your production

Starting point is 00:36:30 instance is not running on Patron's laptop. That is correct. But it is a really interesting model that I don't know of any other vendor that's doing where you can have this developer local developer experience with a duck db from a cost standpoint you're getting to use all that like you know just amazing capacity of some of these modern like laptops and then not wasting credits on like i'm developing all day and wasting credits and then when you're in production though like there's the the truthfully managed service production thing. So it's a cool model.

Starting point is 00:37:07 Well, I actually, and Patrick, interested to hear your thoughts on this, I immediately thought about hiring as well, where you think about bringing in a really good engineer and they're probably going to be like, man, that sounds so nice. It's interesting from that regard as well where it's like, okay, when I'm hiring, I can show you a pretty awesome

Starting point is 00:37:31 set of tools that we're using that will allow you to move really fast, maybe without some of the traditional restrictions. That's a huge part of it. And this should be baked into any decision on infrastructure choices and things like that well i think you bring up a really good point on the hiring because i feel like my analogy for this i feel like colleges pick up on this like maybe 15 years ago 20 years ago where it's like oh guess what like the dorms matter like if they're nice like kids want to come here yeah if the like food's good kids want to come here you know yeah because at one point i think it was a little more bent toward like just the academics and like you

Starting point is 00:38:09 know who has the best ranking here and here yeah and then like i don't know 20 years ago however many years it was like somebody's like oh student experience like that impacts people in work here and i still think there's companies that like have some set of tools that's like because they've always had them or because they're cheaper or whatever. And then they just are shocked. We just can't find the right people. We can't find the right talents. Maybe nobody wants to work with this set of tools. Yeah.

Starting point is 00:38:34 Okay, Patrick, I want to talk about the marketplace a little bit to bring this all full circle. So the marketplace in the context of Wild, I mean, there's one component of just being able to do secure data sharing or join data sets to find some common stuff. within Wild where you could engage and then get some sort of, you could consume some sort of data that shows crossover or whatever value that you could provide from a data perspective by showing information to the consumer brand and to the retailer, right? Here's some sort of crossover or whatever. It's essentially like an analytics product or a data product. But a marketplace is actually distinctly different in that you're still showing what's available,

Starting point is 00:39:26 but then it's actually an asset that the end user of your platform does something with. I can consume this, I can actually do something with it. Can you walk us through the differences between you providing a data product that relies on the sharing

Starting point is 00:39:44 and then providing an asset that you're selling or facilitating a transaction relative to? Yeah, definitely. So, yeah, as far as the dynamics of the Stata marketplace, it's less transactional and more of a subscription to access the retailer's data. Okay, interesting. And we see this as like a batteries sort of included experience. So because we've created this system to securely share and pull nuggets of information

Starting point is 00:40:22 out of both the retail consumer data or your sales within that retailer and your online sales, because we have that sort of granular, like row level data, we can start to, we can build up from there. And essentially like, I don't know how to describe this without giving up too much of the train secret here the secret sauce

Starting point is 00:40:49 yeah exactly we've been calling them like these ephemeral warehouses so because this like new era of WIC embedded OLAP databases,

Starting point is 00:41:06 we can run them really efficiently and we can spin them up whenever we want. So what the process is essentially like, we can create a warehouse, we can pull in data or query data from sort of external locations, run transformations, run models on top of that, pull out information and aggregates, and then persist just the sort of de-identified non-PII.

Starting point is 00:41:36 And because that's all run in processing and memory, we essentially let it all go your data stays with you and we actually created this system where we'll store the results because we needed to show so the product and the buffer right right but a process of getting to those results if you want a really good understanding of that customer crossover, you need to start at the sort of smallest and lowest grain you can get. And so that's kind of the system we're building towards and how we enable this kind of secure compute environment without replicating data.

Starting point is 00:42:21 That's really, yeah, that's really interesting. I'm thinking of it like in the physical world. It's like, rather than like, hey hey i'm going to give you a physical key to my house it's like we implemented a digital lock here you can have like temporary access to the house do a thing you need to do like get the result from that thing and then you lose access to the house and and you're not storing the pii they the pii stays inside their infrastructure right and yet they can get the results like customer lifetime value or like whatever things that maybe you calculated off of the data that's pretty cool yeah and i think that's where

Starting point is 00:42:58 the like some of the inspiration around the data clean rooms happened like yes uh we don't need cryptography because we're sort of in this black box environment that wild controls so it's not like we're allowing the retailer and the consumer to like go in and create each other's data and yeah we don't have to move that data set anywhere like you're saying yeah Yeah, that's cool. I mean, it is really fascinating to think about. I use the term asset, but it's fascinating to think about what's happening under the hood, right? Where really it's the availability of this asset and you essentially generate it when it's needed. But one question I have is, because you're dealing with time-stamped transaction data, the queries are going to take longer over time, right?

Starting point is 00:43:49 So in an ideal state where you have a customer who's, you know, a really awesome customer, they're growing like crazy, right? Like, how are you thinking about the scale? Because it's not a static data set and it tends to grow over time and then from a machine learning standpoint right like the larger data set's going to allow you to provide more accurate results but there are obviously performance implications you know relative to that yeah yeah no that's a great question and kind of the usage we're solving for are like small, medium consumer brands and then sort of small, medium, like mid-tier retailers. And so when you think about the quantity of transactions happening within the quantity of your sales happening within that retailer

Starting point is 00:44:38 and just historically how many transactions you as a consumer brand have had, you're still in this small data realm with tons and tons of room to grow. And so I think in that space, I guess the short answer is like, it's just all, we can do it all on a single machine a single process and we can now have access to you know these tremendous like tremendously large boxes that we can deploy at will and we if we can run you know we can run most things in memory but yeah i think it does pose a really interesting question on like okay what where does this architecture go when you go beyond that scale that you're talking about? Yeah.

Starting point is 00:45:31 And yeah, I mean, because it's not really the customer we're going for, we haven't done much thinking there, but it could be like a really interesting space to solve how someone that gets cashed throughout this system. And how do we reduce the load on the actual data provider and things like that. A problem that your investors would love for you to bring up if it became a serious pain point.

Starting point is 00:45:58 But you're right though, the brilliance of spot instances where, hey, I need 30 minutes of this ginormously oversized machine to do a thing is perfect use case. And since you're in the retail space, a transaction is a physical good, so somebody had to ship it and deliver it, typically. And that does not scale in the same way that digital goods do, where somebody takes a thousand pictures or whatever the other digital

Starting point is 00:46:23 thing is. So I think that's the other constraint that you would have to be, it'd have to be a massive retailer to, I think, for you guys to have problems. Definitely. I mean, I hope we have to solve that problem. I hope you have that problem. Yeah.

Starting point is 00:46:36 That'd be great. That's the goal. But, yeah. Okay, I think we're close to the buzzer here, but one more question for you. So you have a background in software engineering, doing really cool projects like building your own custom QR code generator and digging into all the associated math. to me about our conversation today is that you're running a really a very sophisticated data stack that a lot of companies would run for their own just internal analytics pipeline infrastructure. But you're running it in production, like across multiple clients with some,

Starting point is 00:47:22 you know, intelligent partitioning and querying and everything. So I think we can officially say you could probably put data engineer on your resume. I'll take that. I mean, I guess I can't bestow that, but John probably can. Granted. Yeah, as C4CTO, you can grant that.

Starting point is 00:47:44 Data engineer, Amoritis, or on a very... Yeah. But can you just walk through, like, going from, let's say traditional, which, you know, I think is a loaded term, but more traditional software engineering as a software developer, and then actually building out this data stack as a software product. What are some of the things that have really stuck out to you or that you've learned or that are wildly different? Building out a data pipeline infrastructure as software as opposed to a more traditional code base, which I know you have as well. No, I think it's definitely like my background in software has definitely influenced a lot of these architecture decisions. And yeah, it's to be said whether that's, you know, good or bad. But I think some of, like, the biggest differences for me coming into the data world was around the tooling. You just don't have the same, like, tooling as you do, like, a backend microservice stack or a frontend stack.

Starting point is 00:48:46 And I think part of the reasoning there is the majority of data stuff is built in Python. And Python's built for flexible, explorable, easy-to-write workloads. when you build a microservice architecture, you need really contracts between your APIs and the data. Everything is producing and consuming. And you want this really secure contract between the two. And I think

Starting point is 00:49:18 on the data side, that evolution of using Python has created a system where you don't have a lot of that guarantee a lot of those guarantees and we're sort of building some there's like new tooling coming out to sort of solve that problem you see like you see like sdf coming out with their like column level lineage. You're able to classify your data across your transformation stack

Starting point is 00:49:48 so you know exactly where your PII is landing. So there's some like interesting stuff that allows you to ensure like type safety and data contracts. And I think the other thing is a little more of a tangent, but the way I've thought about like database schemas on an ltp system where you have high throughput high transactions is completely different than how my brain thinks about what the heck we need like what are these mark tables going to look

Starting point is 00:50:22 like what's going to be in them and how just like why he can get and so there was like a shift there and thinking about yeah this sort of like this storage and schema of of on the data side and on the old tp side but yeah i think the other i guess the last piece I'll touch on there is sort of related to the tooling front. But yeah, I just, I value like a really good developer experience and anything that I can do to build that into the data stack, I will probably take a bet on

Starting point is 00:50:58 because I can see, I just know how important that is and how, yeah, like what type of efficiencies you can gain from that. So yeah, I think that's the other piece love it and I think the developer experience I don't have a good way to quantify this but I think if you like as you're optimizing for that picking tooling a lot around that I think it translates into the customer experience like there's just something about it where if it's easier to do the right thing as a developer

Starting point is 00:51:24 then the right thing gets done more which makes it better you know it makes it better it makes a better product so yeah i think that's cool yeah for sure i mean the more you can iterate and experiment the better the end result is gonna be yeah I guess one of the parallels you could draw is like, now with front-end frameworks, you can write some new code and immediately see that on the page and see what it looks like. Yep. And on the data side, if you're running DDT on Snowflake, there's going to be, that's like, you're going to write some new sequel and then you're going to wait and it's going to execute and just all this stuff.

Starting point is 00:52:09 And I think with DuckDB and bringing that more locally, yeah, you can just fly through some of these workloads. Yeah. And that's been particularly exciting for me. Super cool. Well, Patrick, this has been so fun. The time really flew by. Thanks so much for joining us and keep us posted on how the marketplace goes.

Starting point is 00:52:28 Yeah, definitely. No, I really appreciate it. Thanks a bunch. Thanks for coming on the show. The Data Stack Show is brought to you by Rudderstack, the warehouse native customer data platform. Rudderstack is purpose-built to help data teams turn customer data into competitive advantage. Learn more at ruddersack.com.

Your Ad Here

The Data Stack Show - 215: Data Sharing and the Truth About Data Clean Rooms with Patrik Devlin of Wilde AI

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.