The Data Stack Show - 90: The Modern Data Stack Has a Join Problem with Ahmed Elsamadisi of Narrator AI

Episode Date: June 8, 2022

Highlights from this week’s conversation include:Ahmed’s background and career journey (2:27)Why the modern data stack “sucks” (4:53)The limitations of progress (9:13)Showing data with only 11... columns (11:55)Managing one table that rules them all (19:02)Viewing the world as timestamped activities (32:40)When this model becomes harder to use (35:15)The two parts you need in a company (44:41)Those who use Narrator (48:32)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Kostas, today we're talking with Ahmed from Narrator. And I am so excited about this conversation because maybe this is the first guest we've had who has made the bold assertion that the modern data stack,
Starting point is 00:00:39 or at least a subset of it, because I want to be fair here, sucks. And we talk so much about the ways that people are combining these tools to sort of build architecture. And it was problematic for Ahmed, and he's building a company to try to solve that. So my burning question, not to, you know, steal the thunder from you, but why does the modern data stack suck? I think that's going to be a great conversation. Yeah, absolutely. Absolutely. I mean, it's always nice to see like people that have a more radical view of like the
Starting point is 00:01:10 things that are happening out there. And I think this is not something that's like we need because it's killed like rethink and not take like just, you know, like whatever we call the best practice and just continue with that. Like we should always be challenging like our methods and like the things that we are doing, the products that we are building. And I think that I'm doing like an amazing job in that. So yeah, like I'm really looking forward to see also why the modern data sucks in a way,
Starting point is 00:01:43 but also see what's like the alternative that he's building there, which might not be an alternative at the end, right? Like it might be something that works like pretty well with the modern data stack anyway, but understand like a little bit more in depth, like what the solution is there and what the problem is that they are solving. All right. Well, let's dig in and figure it out.
Starting point is 00:02:07 Ahmed, welcome to the Data Stack Show. We are so excited to chat with you. I'm so excited to be here and dive into all these details. All right. Okay. So give us your background. You have built various iterations of the modern data stack many, many times over, but give us the timeline.
Starting point is 00:02:23 So how did you get into data and what are you doing today? Yeah. So it started my career actually in robotics. So I was really interested in how human and robot interact together to make decisions. Self-driving cars, big kind of bigger projects and human robot traction eventually made my way to AI for missile defense for the US government.
Starting point is 00:02:42 So understanding kind of missiles go through space and how to unlock them. Kind of got burnt out from that intensity, switched to WeWork and built that WeWork's data team and data infrastructure that you see today. So that's when I implemented the data stack many, many times. They decided that there was a fundamental problem, went on like a tour of all these big companies to be like, how do you solve this problem? And pretty much realized that the way that data, the fundamental, there's fundamental problems in data that these different approaches haven't solved.
Starting point is 00:03:12 So I decided to really rethink data. And that's where I ended up founding Narrator, a single table approach to answer any question in data. And it's an 11 column table that you can use to answer any question. And it makes asking and answeringcolumn table that you can use to answer any question. And it makes ask and answer questions with data really easy. And the really special thing about Narrator is that that single table is a standard. So whether you're like airline companies, media companies, e-commerce, sales, crypto, banks, which are all different companies we have in those sectors,
Starting point is 00:03:41 you can use the same exact 11 columns to answer any of your questions, allowing us to share and reuse analyses. So really bringing that data world together, enabling that data analyst to really make the best decisions. Love it. Okay. I know that our audience's ears are burning just like Kostas and I's are because we want to know what those 11 columns are, but I will make this a substantial show. I want to start out, and you mentioned this when we were talking right before the show as well, that you've built sort of different iterations of the Modern Data Stack nine or 10 times. And when we were prepping for the show, you were like, it sucked. And that's such an interesting thing to hear because in the industry in general,
Starting point is 00:04:23 and of course in the show, one thing we hear a lot is like in general, and of course, in the show, one thing we hear a lot is like, well, you got to move towards the modern nudist stack, right? Or these are the components of the modern nudist stack, or this is sort of the right architecture, etc. And I want to know why, I'd love for you to be as specific as possible. Why did you come to the conclusion like, this sucks? Because, you know, most of the industry is trying to push, you know, push everyone towards modern data stack. Yeah. And I think that everyone who has implemented the stack more than once will tell you that
Starting point is 00:04:57 it seems like the only way and it's a necessary evil. So at a high level core, you have your data everywhere and you dump it into a warehouse. We call this EL. And there's a lot of tools that have automated this process. It's kind of been solved. Then you have your warehouse and there's a lot of different warehouses you can use in different flavors, different benefits, and that's been solved. Then you have your middle layer. We call this a transformation layer where you actually use data and write SQL to represent the questions that you need to answer. That table gets materialized and put it to your BI tool. That BI tool then allows you to build
Starting point is 00:05:30 dashboards and visualize it. And anyone who's ever done this will tell you what happens. So what happens is that you build a dashboard, then there's a follow-up question or your team is like, yeah, but I want to understand, slice the number of emails by how many people are repeat purchasers. And they go, cool, let's go back to the data team. Let's build a new transformation. Let's build a new materialized view and let's build a new dashboard. And as time goes by, those number of transformations you have in the middle continue to grow. The number of data that's similar in multiple transformation continues to grow. It actually gets so messy that you often have 700, 800 transformations, each answering a series of questions. Then you end up dealing with, hey, how come these two dashboards don't match?
Starting point is 00:06:13 How come these numbers don't match? How come my warehouse is this low? How come everything is so expensive? And because of this entire cycle of constantly needing to go back to build these new transformations, you end up having to spend the time to answer a question goes into weeks and months. Every new question goes into these complex thousand line SQL queries so you can answer it. What we've done is we've actually built different ways to manage this middle layer, but we haven't
Starting point is 00:06:40 solved it. So whether you're back then in like 40 years ago, this was called Microsoft stored procedures. And you would do that in like SQL server. Then we added more ways to build a staging layer. Then we added, we have like Luigi, which was like Spotify as a version of it. Then we add Airflow. Then we have dbt. Now all this history has kind of built a better ways to manage that SQL query. But the fundamental is that you still need a thousand line SQL query to answer the series of questions.
Starting point is 00:07:08 And that doesn't go away. Now, why does that happen? Why do you need a thousand line complex SQL query? And that itself, the underlying problem, and that is because data is actually captured in separate systems that don't relate. And you need to figure out ways to stitch how do you type email to an order?
Starting point is 00:07:27 Because everybody wants to know email attribution to order. Well, because you tie it, you need to go from email to web page, web page to parameters, to parameters to click, click to copy that parameters, assume no duplication. And that complexity of doing that simple join across all these systems ends up generating these really complex queries. And I think that's where the modern data stack consistently fails.
Starting point is 00:07:50 And no matter whether you're Apple or Airbnb or Spotify, everyone will tell you that they have an entire team of people doing it. It has now become a job that people call the analytics engineer, whose entire job is to build these transformations so you can answer questions. And every company will tell you how long does it take you to answer a follow-up question? How long does it take you to answer every new ad hoc question you get? Is there an infinite backlog on the data team of questions? And that's always the case.
Starting point is 00:08:17 And I think that's the problem we need to solve is why the transformation layer causes all these kind of roadblocks. And that's the problem that I went out to really innovate in solving. Yeah. Can you dig in just a little bit more into... You said that some of these modern tools, right? So you have, you know, scrub procedures and then, you know, Airflow and then now DBT.
Starting point is 00:08:44 What's the fundamental limitation there, right? Like they're making it easier to manage these transformations, but are they not making it easier to actually write them? Is that like a fundamental underlying challenge with like the structure of the data and the disaggregation?
Starting point is 00:09:03 Or like dig into like, what are the, like there's progress that's been made, but what are the, what are the limitations of that progress? Yeah. I don't think it's a problem with the tooling. It's a problem of the approach. So the approach of building custom tables to me is the idea of like every question. And if you were building a car and every piece you needed to cast molded custom, like you need more of the world interchangeable parts where different pieces can fit together really easily. So right now, the fundamental problem is a SQL today requires you to join based on a key. And if that key doesn't exist, or you put a person to hack at it with a bunch of complexity
Starting point is 00:09:46 to do it. That is the problem. Now, the tool you're using to manage a SQL doesn't really matter if your SQL doesn't solve this fundamental problem. And I think that is the core problem that we realized is that it is actually a join problem because joins depend on forward keys and forward keys don't exist. So to solve it, you actually need to reinvent how you join and how you structure data, not how you manage transformations. The managing transformations is like DBT, love the tool. I love Tristan as well. This is one of the best tools to manage transformations in this traditional, what we call the traditional way of doing data,
Starting point is 00:10:26 which is known as the modern data stack. But that way itself is fundamentally flawed. You need a different way that allows you to work in the way that modern data actually really is flowing, which is how do you ask and answer questions and bridge all your systems quickly, easily, in seconds? And that's the point. And that's the thing that we have to really highlight because a lot of questions that appear so complex, you have to so much sql to do in narrator appear so easy you
Starting point is 00:10:50 can answer them with a couple clicks and that's because we solved that underlying problem that lies within sql which is showing new data okay so let's dig in how do you do that with only 11 columns? Because it sounds, honestly, in many ways, it sounds too good to be true, right? And I know that we want to talk about, you know, there's no decision that you make technically that doesn't have a trade-off. And so I want to get there as well. But if you think about even a moderately sized company that, say, has sort of maybe some behavioral data in their warehouse. They're, you know, loading a bunch of structured data, you know, say from marketing tools or CRMs or whatever, right?
Starting point is 00:11:32 You have a bunch of materialized views. It doesn't, it's not that hard to have, you know, tens, hundreds, thousands of tables, right? Like you can get there really quickly, right? And if you do get there too quickly, everyone knows, you know, the pain that that creates. So it sounds crazy that you like solve all that with an 11 column table.
Starting point is 00:11:52 So tell us, how do you do that? Yeah. So first, I think we like to say one-one table because of the kind of like shock factor. It's like 95% of a table. There is like ways you can add additional tables, but that's not the core. So the core single table that we're going to discuss is known as an activity schema.
Starting point is 00:12:08 You can see it by activityschema.com. It's an open source project that kind of discusses this one table approach. And it is really just kind of taking the way that we speak about data and really bringing it to the way you kind of structure it. So it's just a time series table where it's customer, time, action, and you just abstract three features. So it's feature one, feature two, feature three,
Starting point is 00:12:32 a couple of additional columns, but that's kind of the core of it, which is that's it. So it's customer, time, action, and features. And you're thinking, well, like, Wendy, Ahmed, like, how can I just put everything I need in three features? Like I have so many features I need.
Starting point is 00:12:46 I need like a hundred features. And that's where the tool of data set comes in. That narrator provides is a way to pull in and what we call borrowing features from different activities. So let's take a simple example that you're, I want to know every email. Did that email lead to an order? I want to know what the. Did that email lead to an order? I want to know what the campaign of that email is. I want to know when that person did that, when that person came to our website,
Starting point is 00:13:10 from that email, how many pages did they view, and I want to know what page they landed on that seems like we already are talking about 10, 15 features, right? But if you break it down to like actions, you have open the email action, which has one feature, which is the campaign. You have the visited website feature, visited website action, which has path, which is also one feature on that. You have the startup viewed page, which might be the, have some features on it, but just the fact that the customer viewed a page and the fact that they completed an order.
Starting point is 00:13:44 So now it's four activities. And all I'm doing is really pulling the data from each activity. So if I want to know in between those emails that they have an order, I can pull the fact when the next time from that order is. I can count how many page views they had in between that
Starting point is 00:14:00 and say that's the number of page views. I can grab the first page view from the started session activity. I could just pretty much much thinking about it as thinking this really long table and doing a very clever fancy pivot and pulling the columns I want from each of these activities. And when you do that, what it turns out is that if you actually represent your business as this really rich customer journey, you don't need that many features per action, but you do have a lot of actions. And those actions are where all the nice rich information comes. And because time and the accounting and all that stuff is given to you by
Starting point is 00:14:32 narrator out of the box, you don't need to add features like first visited page, last visited page, number of visited pages, number of visited pages, last 30 seconds, all those can be recomputed on the spot when you're answering the question that you need instantly. Does that make sense? Yeah, it does. So how do we populate this one table from the raw data that we have, right? Yes.
Starting point is 00:14:58 I mean, obviously, let's say this is the data model that makes sense to have on your data warehouse, like for analytical workloads. Obviously, the data that is coming is not modeled for that, right? So again, we are going to do the extraction and the loading of the data. So after we have staged the data and we have loaded into the data warehouse, how do we get to the point where we have, let's say, a well-curated one table to rule them all? Yeah, great question.
Starting point is 00:15:27 So, Narrative provides a very, very thin layer that's known as our transformation layer. And this is not like a dbt transformation layer because you're really just mapping columns. You're pretty much saying, like, for example, I have my internal database has a user stable. And I want to have an activity like added user. And I just say like this, you're mapping the, to the 11 columns. So if you're saying like
Starting point is 00:15:48 the timestamp is the created app of the stable, the action is create added user. Here's the features that I care about. And it's a very thin layer to map it. It's so thin that it averages
Starting point is 00:15:58 around 12 minutes to write. I think most customers that have experienced it see like the, how easy it is to kind of take your data from whether it's a ready eventbased or relational or tickets and we have like a library of all these common transformations in our doc site and you just kind of like map it to that simple structure
Starting point is 00:16:15 that is this per building block so you define each activity and then narrator migrates that data does a bunch of caching does a bunch of things to make that really nice and easy and fast to use and provide you with an interface to actually ask and answer these questions. And the good thing about doing it with activities is that you only ever need to add a new building block when you have a new concept to add, not when you have a new question. So often in tables that you materialize in the modern data stack, every time there's a new way of relating data, you build a new table. In Narrator, you don't do that. You just build what's called modern data stack, every time there's a new way of relating data, you build a new table. In the area, you don't do that.
Starting point is 00:16:47 You just build what's called a dataset and that's done by a couple clicks. Every time you have a new concept added to your company, then you add it. So you're often doing these activity transformations within the first week, and then you add one every other month. It's like really rare that you're adding a bunch of new activities. Instead, you're taking the building blocks that you've kind of built and you're reassembling them to answer all sorts of questions. And how did you, how is this table like implemented?
Starting point is 00:17:12 Is this like a materialized view that gets like populated inside the Dena warehouse, is it like a logical view? Like what's the... It's an actual table. Yeah, it's an actual table. It's a table that we can insert into, we update, we manipulate. Nerriti does a lot of additional things like identity stitching and across all your systems and like handling fraud users and anonymous section and all that stuff. So we're actually just constantly updating and mutating this one single table.
Starting point is 00:17:40 And we're sorting it and partitioning it based on your warehouse to optimize performance. And we do a lot of stuff on that one table to make it really performant and really nice and fast. And then data set queries are all, there's no free SQL in there. So you're actually using the data set to answer any question and all the queries those generate are super optimized for speed and on that single table.
Starting point is 00:18:03 So in your warehouse, you'll have a schema or a data set, depending on which warehouse you use, that's called narrator. And in there, you'll see the activity schema or the activity stream is often what it's called. Okay. Yeah. All right. So let's talk a little bit more about the management of this table, right?
Starting point is 00:18:21 I mean, obviously this table relies on the underlying data that is getting loaded in the data table, right? Like, I mean, obviously this table relies on like the underlying data that is getting loaded into the data warehouse, right? How do you do things like, okay, let's say accidentally someone like drops a table, right? Like that is used like as a source. What happens then? Like, is this like changes, is there a removal, let's say that that's going to be reflected also with deletions on this table? What's the logic behind working with data that might
Starting point is 00:18:53 cease to exist at some point? Or it might be figured out that it's the wrong data, right? How does this work? Yeah, great question. So one of the benefits of only having modeled activities, our average query length is 20 lines. It's a really small queries. And if an activity, a transformation of an activity, let's say the query, we're updating this thing incrementally. So every like every five, 10 minutes, we're reinserting the new data into the activity stream. Let's say we go to insert it and the query fails for because data is not there. We take that activity and that transformation, we put it into what's called a maintenance state. So anyone when using that data will get a flag.
Starting point is 00:19:30 Hey, this data isn't up to date. Something went wrong. You get notified. You can go in and fix it and resync it. And the data is up to now it gets resynced and the maintenance goes away. We also provide out of the box anomaly detection. So if that data ever stops producing rows, you can write your own custom alerts on it.
Starting point is 00:19:47 So we've done a lot of stuff to make sure that as your data is migrating, it's correct. We do a lot of duplication checks for IDs and stuff like that as well to ensure that the data that you're inserting into your warehouse is always accurate. And the benefit is again, because it's a single table, we can do a lot more checks very cheaply and easily because we have guaranteed structure and guaranteed assumptions. So the narrator is always incremental. It's always time series. All these things get a lot of benefits from it.
Starting point is 00:20:11 So that's what ended up happening a lot with this thing. So people often find managing those like a single table, actually the easiest part, like super cheap because, and it's often on the raw data because it's so simple. You're often just pointing a timestamp from your raw tables to a structure. Like there's really few like complex queries that you're putting in activities. All that stuff happens in data sets. Mm-hmm. All right.
Starting point is 00:20:35 And so, okay, let's focus a little bit more on the modeling side of things now. One of the things that I have like experience, like when I'm talking like with companies or like I'm observing what the company is doing with their data is how the semantics of the sale might change for the same thing. Like, what is a user, for example? Like a customer, how a customer is perceived by sales or how a customer is perceived by product or how a customer is perceived by marketing. Right. And just to give an example, like you go with sales and chat with them and you start talking, hearing about like prospects and leads and opportunities and contacts and you know, like all these things that we pretty much like we all learn to live with because Salesforce became a thing and their schema became, let's say, the way of representing sales in the world.
Starting point is 00:21:28 Yeah. So how do you deal with that? Because from what I understand, like a core concept of your modeling is that everything like is around the concept of the user, like the customer, let's say, right? How do you differentiate with that? And how do you make this like accessible to people that they use different syntax and semantics about the same concepts? Yeah. So honestly, this is actually one of the best parts about narrator that you can actually, one thing that we see a lot when you're depending on dashboarding is that you have to force everyone to abide by one definition. Total sales has to be total sales and total customers has to be total customers. What you see a lot in narrator is that a person might, you could have multi-identifiers in narrator that get mapped to what sort of your global customer and customer could be, we have
Starting point is 00:22:18 companies that are ride sharing that the customer's car, we have companies that are customer, like we work as a building, like you have different ways of defining customer. So what we see a lot is the idea of that entity having events. So like you might have a created lead activity. You might have a pre-started opportunity. You might have a closed opportunity. You might have a signed contract, sent contract, moved in, made a payment, like started subscription.
Starting point is 00:22:48 And the reason why that's so important is when you deal with that argument, and I've had this at WeWork a lot. Well, when is a sale? Is it when they sign? Is it when they move in? Is it when they pay their first invoice? Is it when they start their subscription? Well, when is the sale? You don't have to actually fight that battle anymore.
Starting point is 00:23:04 Instead, what you do with narrators, you have this concept of dataset, which is you have the activity that you can represent them differently. And then when you go to create your KPI, which is like your key performance indicator that narrators create, you can then choose very explicitly what that is.
Starting point is 00:23:18 And the user then sees the KPI, they can always click into it and see the underlying dataset and say, oh, this says you did a timestamp of the first opportunity created. And because of opportunity down activity, it's just a lot easier to get that transparency. So when you're modeling the data, you don't need to model based on how it's going to be used.
Starting point is 00:23:36 You need to model based on what it is. And then when it's being used for like a specific question, the user can actually choose very specifically whether they want it from the sales perspective or the invoice perspective. And then there's also the global just company KPI, which the company has decided is the thing that they're going to track and they're going to call that total sales.
Starting point is 00:23:54 And you can always click on it and say, oh, they're using signed contract as their definition for total sales. And I think by kind of creating those three layers, whether it's a company global KPI, which people are using to measure any data set, which is answering
Starting point is 00:24:07 specific questions and then having your building blocks represent real actions that the customer is taking, it just kind of creates very little space for ambiguity.
Starting point is 00:24:16 Like questions that we don't get a narrator often is like, but what does this actually mean? It's like, oh, just click on data set and see exactly
Starting point is 00:24:22 what that means. Oh, what is that? And because the words are like created opportunity or like the word might be like, oh, just click on data set and see exactly what that means. Oh, what is that? And because the words are like created opportunity or like the word might be like made payment. You can be like, oh, and you can click onto that and see the exact SQL. And that SQL is 20 lines, so you can easily understand it. But it creates that separation so that the data team isn't fighting. And if the company decides, actually, we're not tracking redefining total sales to look at it based on when the first invoice is made, that doesn't even talk to data about that.
Starting point is 00:24:48 Like the data is already modeled. You have, you have just choose that for your dataset and that can be done without involving data at all. And everything will just cascade nicely because again, you're building blocks are what you're modeling, not the final results. So you're representing the world as these activities. Everything else happens in narrator. And you can build data sets to combine them.
Starting point is 00:25:10 You can build KPIs. And you can change those things without thinking about going back to data model ever. Can I ask us, I'd like to dig into that with a specific question. And this is inherently biased because I actually got to use narrator, kick the tires on it, which was really, really cool. And so I'd love to know, because unfortunately, I didn't dig in with our analyst team and data engineering team, but I was sort of like a consumer of a question that we were trying to ask. And in fact, I will tell you what the question is, because maybe that'll be helpful. And then I have like a specific question about how something's happening under the hood. So the question
Starting point is 00:25:48 we were trying to answer, which again, sounds like an easy question, but like actually ends up being difficult to answer is how much does consumption of a particular type of blog content, you know, well, A, does that seem to influence an opportunity being created in a certain time period? I have a couple of questions. It's like, okay, whatever this thought leadership or engineering or whatever, does increasing consumption, is that a leading indicator that there's increased likelihood or whatever? Okay. this is my question. And actually, it was very elegant the way this happened because the resultant narrator
Starting point is 00:26:30 actually had both a first touch and influenced view that were very easy to get, which is really cool. But here's my question under the hood. What makes that, and correct me if I'm wrong here because you know, because I'm not an expert in SQL. But part of what makes that difficult in raw SQL is actually not necessarily like looking at page views and then sort of saying like, okay, was that user associated with an opportunity eventually, right? You actually may have multiple users who have entered the funnel, but are related to the same account, which is also related to the opportunity. But in Salesforce, of course, with their data structure, not everyone is. And so when you talk about something like influence, as opposed to something very linear, like first touch, user did A, did B happen at some specified time period, right? Now you're talking about
Starting point is 00:27:26 a group of users who are associated with a different object or different table in the warehouse. What you want to know about is the opportunity, which is a different table in the warehouse. And so there's a ton of key crossing across those tables, right? To do something here. And this is actually also all assuming that your behavioral data like has a layer of identity stitching as well where you have like unique IDs for like the anonymous behavior because that can also
Starting point is 00:27:54 happen pre-identified, blah, blah, blah. Anyways, you get it. I won't keep going. Awesome. So first of all, that's a great question. Like it bridges multiple systems. It shows you're asking an analysis and you probably have seen our narratives, which is one of the benefits of standardizing data.
Starting point is 00:28:09 It allows us to build and reuse analyses, which is our intelligence to generate these beautiful stories that help you understand your data automatically for you in seconds that actually provide real answers. And that question that you asked has a lot of nice complexities to it, right? Like multiple systems, multiple tables you're talking about. How do you think about bringing that together?
Starting point is 00:28:28 And all sorts of different pieces that makes that really, really complicated. And if you probably talk to your data team that set it up, they'll probably tell you that they set up those activities in one 45-minute session. Because that's our proof concept usually is one 45-minute session. So they set that up, get that answer, gave it to you, allowed you to self-serve it. The entire setup was 45 minutes. So what did they do? So two pieces here that are really critical.
Starting point is 00:28:54 One is that in narrator, because we built an entire company based on a single table, we got a really good identity stitching. So we have a very, very proper way of stitching that data. Two, all that thing that you're talking about of this thing happening and that they first time they ever do it, everything is changing in time. If you notice, that narrative does everything as a function of time.
Starting point is 00:29:16 So what that probably looked like, I don't know the exact setup, but it probably looked like something like viewed content was an activity and it had an anonymous ID of whatever that cookie was of that user who viewed the content. And based, you had a, probably like a contact or an account ID that was like your global identifier, which is how you thought about your business, which is that account creates the opportunity that account creates a lead and all sorts of pieces. Right. So the user, so you have like
Starting point is 00:29:44 an account identifier and that applies to like both pieces. So the user, so you have like an account identifier and that applies to like both the users and the opportunity and you pass that through on the activities is how that's happening. So that's your customer. And then you create what's called, Narrative allows you to create tiny little snippets that match data together.
Starting point is 00:29:56 So you probably have one more snippet that's like, hey, we know that this cookie is now this account ID. There's a lot of explanation of how that works. And then that's it. So they build those three transformations and they're able to stitch that together, combine it. And then when you're asking that question, and if you use our tool,
Starting point is 00:30:12 you can right-click on any piece of data, see the exact customers, right-click and see that customer's entire journey. So you can see that customer viewed a page, viewed a blog, viewed blog, viewed blog, created opportunity, viewed blog, viewed blog, viewed blog, created opportunity, viewed blog, viewed blog, viewed blog. And they're able to understand the difference between that.
Starting point is 00:30:31 And we talked about the differences between knowing how many there were. That's a simple, give me the count of them. Knowing the rate, give me the count divided by the time from the first one. Giving me the first content they viewed versus the last content they viewed. All those things, we're using words like first, count, last, but we're still talking about actions that the customer is taking. And that's kind of the beauty is that the way you ask the question, you kind of, to express questions, you kind of convert them into these action-based questions because you're saying, how did the customer, you already combined the fact that it has to be
Starting point is 00:31:03 the same person because you're not asking how does something affect something else and nothing is tying it together. You often tie it together by a person. And you talk about these two building blocks, viewing content and creating an opportunity. And you're looking at a conversion rate and you're trying to optimize that. So you've already done
Starting point is 00:31:19 the way that you've asked the question. You've done 80% of the hard part of preparing data. And all they did was take that same structure of how you're imagining the data happened,'ve asked the question, you've done 80% of the hard part of preparing data. And all they did was take that same structure of how you're imagining the data happened, customer views the blog, they create an opportunity. And we enable you to create that structure.
Starting point is 00:31:33 And then we quickly enable you to actually structure that data using the way that you asked it. So that's what makes that experience so seamless and look kind of like magical because you've done three things in your head for us already. And we just kind of represented the way you think about it.
Starting point is 00:31:47 Yeah. Super interesting. Super helpful. Okay. And I can verify it was really cool to see that happen. Okay. Ahmed, I do want to play devil's advocate and I'm actually going to ask Kostas a question here because this is beyond my technical depth. But when you talk about activities as sort of the way that you view the entire world, you're talking about essentially converting every type of data into event data. And Costas, I mean, there are a few things that come to my mind, but I would love to know, Costas, that's a non-trivial sort of lens to put on all data, what comes to your mind as, you know, potential challenges, benefits, whatever, when you view the entire world as sort of timestamped activities? Yeah, that's a pretty interesting question. Usually the problem that we have with that is that there are questions that
Starting point is 00:32:46 you can better answer when you, let's say, keep track of like everything that has happened, right? Where having events there is like the way to do it. And there are questions that are like much easier to answer where you just keep, let's say, or you have already replicated the current state of your concept or entity or whatever you want to call it, right? So usually the problems that you have with events is that, yeah, it really helps you to measure change, for example, and stuff like that. But if you want to see at the end how like things look right now, you will probably have like to go and like replicate the whole, let's say, journey, like get the data there and go and replicate like the current state.
Starting point is 00:33:41 That's like from a very, let's say, it's a naive description that I'm giving, but it's like usually what like people have to deal with from an engineering perspective when you have to decide, am I going like to work with mutable states or like go and keep like events there and work with events. And usually like events give you like this extra expressivity, but there's some kind of explosion in terms of like the amount of data that you have to deal with or like what it means to go and replicate the whole, like the state by iterating all the different events that you have. Now, obviously there are like situations, like there are things that you can do only if you have events, right? If you want to see, let's say, what is the journey of your customer, you need to have
Starting point is 00:34:33 all the events there. Otherwise, how you're going to do that, right? So having this kind of turning everything into an event makes sense in a way. But the question is, and that's like a question that I have for Ahmed, is when does having, let's say, this model becomes a problem? What are, let's say, the questions that are not impossible to answer, but harder to answer because you add like this different way of like describing the words, right?
Starting point is 00:35:10 So great question. So a couple of things to kind of highlight. So one of the things, the benefit of kind of having we, a narrator put this like really intense, rigid structure, and it allowed us to kind of solve a lot of the core problems using data sets. So one thing that you can easily do in any activity is say, give me the last ever updated subscription or give me the last ever status of this company.
Starting point is 00:35:36 And when you can use words like last ever, it makes it really easy to know what the current state is. So we find a lot with our customers is that if things are changing, you can get like, if you let's say you have a contract object and that contract object is changing. If you want to know the current contract,
Starting point is 00:35:52 you say, give me the last ever updated contract and you get that contract object then. However, sometimes when you're asking questions, you're saying you want to know what the contract was at the moment when that person submitted a ticket. Those questions are nearly impossible to do with non-event data. But with narrative, you can say, give me the last before.
Starting point is 00:36:11 Before you submit a ticket, give me the last before updated contract. And now give me the state. So you can actually benefit of doing like generating state comes from instantly with the last ever. But you can also generate state at any given moment in time. This was inspired by, if you're familiar with like the, it used to be a very big database paradigm known as the Lambda architecture, where you have like a streaming layer and then you kind of do a batch layer to process it. But one of the benefits of that approach allows you to structure data any moment in time. And those change questions can be seen. The second thing you asked is like,
Starting point is 00:36:42 what about things that aren't changing? Like your customer's age maybe, or like their gender or some of these things that have changed less often than you think. Well, I said that narrator is mostly a single table. We do have what's known as what we call it, like kind of like an attribute table, which is on this customer, because everything's centered around the customer. You can just kind of create, we have a materialized view. That's like a dim customer, for example. You can add all the kind of static attributes of the customer that makes it really easy. You often don't add stuff like when they first signed up, you don't add timestamps there.
Starting point is 00:37:13 If you actually do, Narita will alert you saying you shouldn't do that. But usually it's like your name, address, blah, blah, blah. You can put it there. If it's changing, you make an activity. So like you might have an updated address activity and you want to know when we first acquired this customer, what was their first updated address? Or give me the last one to know what their current updated address was. So that's kind of how we handle a lot of these cases and we handle them in product. So the thing about this single table approach, and I'll tell you the honest truth, it has two huge, huge, huge downsides. The first downside is that a single table, querying it
Starting point is 00:37:50 is really hard. Take all the SQL you've learned and kind of throw it away because you can't imagine, when I say last before, like, Kostas, you've done this before, but you can imagine that SQL query is very not trivial. Like, that is and very probably if you write it without realizing, you might do very inefficient.
Starting point is 00:38:09 Like you'd think, oh, I can just use the last value window function. Good luck. What happens if it doesn't exist? What happens if it duplicates? All those things that can do. So the querying of that is extremely difficult, which is the challenge of having a single table. And the second thing is, if you notice I'm doing something with every question you're
Starting point is 00:38:27 asking me, where I'm doing this thing that looks kind of, that makes narrative work, where I'm actually translating your question to be a little bit more defined in this activity way. There's a mental thing that I've experienced and I've mastered, but a lot of our customers take a couple of weeks to learn, is this new way of thinking about how to think. Because in SQL, you can imagine stacking the data
Starting point is 00:38:48 and joining and how it works. But this new approach, you have to relearn the mental model of how to combine data. You need these like temporal relationships that you call. I think we actually find it
Starting point is 00:39:00 most customers who come from like a deep SQL background have a harder time learning our relationships than customers who come from like a deep SQL background have a harder time learning our relationships than people who come from like a like marketing or product mindset because they're used to thinking about things from a customer perspective while SQLs often thought about it from a table perspective. So that mental model learning is a big overhead and then knowing that that table is really hard to query by head is really hard. So what we decided to do was build a company around it.
Starting point is 00:39:29 Like the reason why activity, this single table isn't just an open source project is we found that like we open sourced it and people tried to use it and they were like, hey, this sucks. And I'm like, yeah, you're right. Like querying this thing takes you forever. So we spent years building and iterating over an experience so you can actually generate any table using this tool called Dataset,
Starting point is 00:39:50 which Eric got to see, which is a really just seamless way of combining data. And it makes it look very seamless and nice. So we solved that problem with product and we solved the second problem with just iterating. So we often give customers examples.
Starting point is 00:40:04 We do a lot of documentation. We do a lot of like examples. We do a lot of documentation. We do a lot of blogging. We do a lot of automatic analysis. We generate, we have a series of templates that helps you see how to ask an answer question. We have an entire library
Starting point is 00:40:14 depending on your industry that gives you a bunch of different types of questions that you can ask and shows you how to map it to an area of this world and how to answer it
Starting point is 00:40:22 using your own data in a couple of, in under 10 minutes each. So something like that is like a huge educational overhaul. But one of the things that you said that I did find beautiful is that you talked about this language that Salesforce created. I studied Salesforce for a while because I think they're one of the most interesting companies. Because prior to Salesforce, everyone had their own definitions of structuring sales data. Every company had their own sales data models. It was like, nowadays we're like, of course,
Starting point is 00:40:49 every sales company can be represented with leads, opportunities, tasks, and contacts. But that's really a Salesforce state. They changed how we thought about data and they standardized all of data for sales. And the thing that we like to think about as narrator is that's exactly what we're doing for data. We're like, here's a standard data model. And yes, it is very rigid. But we've shown you that you can do so much with it and answer so many questions with it and do all these things. And we've taken all the trade-offs and the downsides of using this data model and said that is narrator's job to make that solvent. So making sure that's super easy to create for anyone, whether technical or not technical, using our tool like Dataset. Making sure you can see the value in instant
Starting point is 00:41:28 beautiful analysis. Making sure that the benefits of the assumptions can be shown by giving you stuff like automatic anomaly detection, instant analyses that can answer any question, templates to understanding CAC or LTV and all sorts of template analysis. We gave you so much so that you can value in learning this mental model overhead of thinking differently about data. And that's the goal. And that's kind of why I ended up saying like, the modern data tech sucks and all these approaches
Starting point is 00:41:54 because they're just so different. Each one is so, every company you go into, you have to learn a new way that how they represented their data, the thousands of tables they built. The narrator, I can switch between any of our companies that use us and instantly answer any question because it's a standard way of thinking. It's a standard way of answering questions. And we've shown that it's flexible enough to answer any question.
Starting point is 00:42:15 If you've come to any of my talks, if you send me an email and I always message me on Twitter, tweet me, LinkedIn, email me directly, and give me a question I can't answer. Or I'll tell you if I can't answer it, I'll post publicly that I can't answer it, or else I do exactly how you can answer that question in Narrator. And having done this for five years,
Starting point is 00:42:37 you often see that almost all questions can be easily answered in this structure. You just got to think about it a little differently. And that's the downside to it, is that thinking differently, but we do believe the upside of the value of speed is just so incredible that it's a no brainer for us. Carlos Bernal de Sousa- Yeah, absolutely.
Starting point is 00:42:53 And that's where the opportunity is. And that's why you're like building a company, right? For that. Okay. I, I find interesting what you're saying about like the, like the comparison with, with Salesforce, because I think there are similarities, but there are also some big differences. Salesforce went there and had one domain that they had to model, which was sales. Now, with Narratory, you do like, let's say, in a way like, not the opposite, but you're saying, okay, I have one model that is abstract enough and expressive enough to go and cover, let's say, all the different domains out there, right?
Starting point is 00:43:38 So your work in a way, it's exponentially harder than Salesforce, I would say, because you'd have to deal with all these different domains and people that's working there and trying to help them think in a different way. But obviously, also, if you manage to do that, the reward is going to be probably even bigger. But I have a question that has to do with like, at the end, like the expressivity, because we keep talking all this time and we are talking about like customers and users, right?
Starting point is 00:44:15 Like the center of like this data model is around like the concept that you have a user there who acts, so you have an entity and activities, right? Is this all that we need in a company? Or there are also like other activities and other things that are happening that, let's say, maybe the future, like, narrator will also address? So great question. So there's two parts there. So what we've done is not trying to abstract away your business. I think that if we're done is not trying to abstract away your business. I think that if we're trying to build a model that represents every business, that's a very
Starting point is 00:44:50 hard thing. What we've done is we've built a model that represents how we ask questions. And what we've actually solved is behavioral and change. We build a model that's really good at understanding change. And what we've shown is that every question can be actually a function of understanding change. So when you think about a company, we think about a single table. It's per core.
Starting point is 00:45:10 We talk about like, oh, we have a ride sharing company. They have two streams. What's called a customer stream and a scooter stream. So their customer stream is everything customer opens an app, customer buys, customer rides, starts the ride. Customer submits a ticket, a customer makes a payment, customer moves scooter, enters a new zone, customer parks. But then you also have a separate stream,
Starting point is 00:45:34 which is where the customer is actually a scooter. And the scooter gets ridden, a scooter ends the ride, a scooter goes into maintenance, a scooter gets repaired, a scooter gets purchased, a scooter gets launched. All sorts of things happen to a scooter. A scooter gets presented maintenance, a scooter gets repaired, a scooter gets purchased, a scooter gets launched. All sorts of things happen to a scooter. A scooter gets presented to a customer. So it turns out that everything in a company, I'll say 99% of things, can be represented as some sort of global entry that you're trying to understand how it's changing and its actions. And whether the actions are done to it, done
Starting point is 00:46:05 because of it, done by it, it's independent. It's just that this action has happened in time to this core object. It's really representative of how we speak. It's like there's a noun, a verb, and you're just talking about these actions that are happening. So what we see is that most companies have one stream, but some companies like us, narrator, we have two streams. We have a company stream and a person stream. And we use the person stream to understand people behavior. But we use the company stream to understand like our financial reporting and our like onboarding and a company's onboarding and a company adds a user and a company does these behavior that we care about it from the company's perspective. Company
Starting point is 00:46:43 pays an invoice. So you can create more than one stream and narrator makes those multiple streams really easy to switch between. But yeah, so the thing that I'll say is that everything in the business can be represented as some sort of the entity that you're trying to see how it's changing and change. And narrator has really done, by implementing this really strict data model, has allowed us to really focus on how do we understand change? And whether we generate the current state of a business from doing the
Starting point is 00:47:11 last ever of a chain change, it's really, that's our really secret sauce. And we help people really think in a way of change instead of thinking of, about static things and thinking about things like first signed up and first attribution model to thinking more about if you're looking for a customer and the first time they visited a website and the answers from that.
Starting point is 00:47:32 So that's what we've really mastered is that change. And we still look for ways to things that don't get represented by change. I'm going to ask a question because I think Narrator is a really interesting example. In fact, we had an interesting conversation recently, Kostas, about roles in the data space, right?
Starting point is 00:47:51 Data engineer, analytics engineer, analyst, data scientist, even. So Ahmed, narrator sort of exists between different spaces in many ways, right? Like in the data world. So who's the user? And in some ways, like, maybe the question in this is leading a little bit, but is there a new sort of user that narrator imagines? You know, or are there a set of users like,
Starting point is 00:48:21 you know, who actually is interacting with it in an organization? Yeah. So I'm about to give you a very controversial opinion. Love it. We love a hot take. Like to think about everybody got into data to answer questions and make an impact by using data.
Starting point is 00:48:40 That job used to be called a data analyst. Data analysts were people who used to take questions and ask good questions to derive answer. And whether you're a product person operating as a data analyst or you're a data engineer answering a question and you're operating as a data analyst, I think that the tool that we built is for people who want to answer a question and those people are data analysts.
Starting point is 00:49:01 What we've seen in companies really interestingly happen is that it turns out that job of a data analyst kind of disappeared. And now we have like seven roles that do part of the data analyst job. So we have the analytics engineer. We have the data engineer. We have the data scientist. We have the BI engineer. We have the insights engineer or the insights analyst.
Starting point is 00:49:24 And one of them is doing dashboards, one of them is building PowerPoints, one of them is building tables, all trying to answer a question. And what we've thought about is that what if we got rid of all of them and forced everyone to be a data analyst and you just enable data analysts to, once the data is in your warehouse, like there's data engineering splits into two parts, getting your data into your warehouse and pipelining and capturing data. And then there's the data engineering to structure data.
Starting point is 00:49:49 Forgetting the first part, let's get rid of the second part. Let's get rid of the analytics engineer. Let's get rid of the data scientist. Let's get rid of the BI engineer. Let's just kind of make everyone, because everyone at the end of the day is trying to answer questions. And you may or may not be able to enable that data analyst with very limited SQL knowledge, but really the ability to ask good questions
Starting point is 00:50:08 to do that work end-to-end. Create the dashboard, create the analysis, create the story, represent the data, the way to answer that question, and do that all in under 10 minutes. And I think the future of the world is going to be where everyone becomes a data analyst. I think that's the value that drives business to be where everyone becomes a data analyst.
Starting point is 00:50:27 I think that's the value that drives business value. Those are the people who are helping make decisions. That's really what everyone really wants is to answer questions. And I think the more we stop focusing on the means to an end, we start focusing on the end, the more that these data analysts are going to be the ones who are going to just kind of take over every company. And I think when you think about it, every company's ability to answer questions is their competitive advantage. And I bet the more that these people who are data engineers, who are trained into asking and answering questions, and like a data scientist who has great skillset, instead of working
Starting point is 00:50:58 on like preparing the data, instead are actually answering questions, you'll find a lot more insights, you'll find a lot faster, and your business will grow. And I think that's the future is the world just becomes all about data analysts. And there it becomes just like Salesforce and the tools for salespeople, there it becomes a tool for data analysts to answer any question. Love it. That's a super powerful vision. Well, Ahmed, we're here at the buzzer. Thank you so much for giving us some of your time. I learned so much about the way that you're approaching sort of drastic simplification, at least for analysts with a single table.
Starting point is 00:51:32 And we'd love to chat with you again soon to hear how things are going. Yeah, I love it and excited to be here. Thank you. If anyone's interested, just follow me on LinkedIn or Twitter and you'll see everything I do. Well, Costas, my first takeaway is I was thinking about the intro recorded, and I said the word
Starting point is 00:51:49 sucks like 20 times. And I realized that if my kids are young, so if they came home from school and said sucks, I would probably say like, hey, you're not allowed to say that. You're too young. But in the world of sort of publicly accessible content, my son could play this episode back to me and say, well, you said so. So that's my main takeaway. So that cat's out of the bag.
Starting point is 00:52:18 No, I actually, I think one of the interesting things was the controversial take on the role of the data analyst and sort of the sort of connected roles of data engineering. Med basically said, when you're collecting data or sort of managing pipelines that do ingestion, that data engineering role will stay. But the data engineering around the transformation layers we talked about on the show he thinks should go away and in fact like he thinks that you know sort of anyone who has questions around data will become a data analyst certainly a really interesting take i will say you know i don't know if i wholesale agree with it but here's what I do agree with. The mindset that the tedious nature of manual labor as it relates to preparing data for simple things should go away. It is a good thing for technology to abstract those things away from a human having to go through a laborious multi-thousand line, you know, coding exercise
Starting point is 00:53:25 to do things that aren't actually that difficult. And so, you know, narrator is certainly a very opinionated way of doing that by turning everything into an activity. But I agree with the vision that, you know, the laborious nature of some of the preparation work, it should go away, right? That's not a great use of really smart people's time. Yeah, I agree. I mean, there's obviously like a lot of space there for improvement when it comes like to the ergonomics of like working with data, what I will, what I will keep like from this conversation that we had with Ahmed is like how hard it is to change the way that like people think and they
Starting point is 00:54:07 have learned like great in their work. Right? Like it is amazing. I mean, if you just like take a step back and listen to what Ahmed was saying, like, like, like, it's not like something really complicated. He says you just have like to think in terms of actions, right? I mean, okay. It's not like something great.
Starting point is 00:54:28 Yeah. Of course you have like a user and the user does something right. Like which it might be a sign up. Like, so the user is signing up or like the user is like signing both or like all that stuff. But I even like this let's say, symbol change in the way that we think it's like very, very hard to implement. And changing that for a whole industry, it's obviously a very, very big and hard task.
Starting point is 00:54:55 Yeah. It's very interesting. It says a lot of how change happens and how incremental or not incremental it is at the end. So that's what I is at the end. So that's what I keep from the conversation. And I'm really curious to see what the future will be for an opinionated solution like this one, like Narrator, that has to do with how people think, right? Sure. It's kind of like a change there.
Starting point is 00:55:21 So that's what I keep. And I want to see how things will change with how people like interact and use the product. Yeah, I agree. I think they'll do well. Whatever the final solution looks like, people who are thinking like Ahmed are certainly the ones who are going to invent the next iteration,
Starting point is 00:55:37 you know, of sort of the way that we interact with data sort of on the layer on top of the raw data. All right. Well, thank you so much for joining us. Tell a friend if you haven't told a friend about The Data Stack Show and you enjoy it, and we will catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
Starting point is 00:56:13 The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.