The Data Stack Show - 90: The Modern Data Stack Has a Join Problem with Ahmed Elsamadisi of Narrator AI
Episode Date: June 8, 2022Highlights from this week’s conversation include:Ahmed’s background and career journey (2:27)Why the modern data stack “sucks” (4:53)The limitations of progress (9:13)Showing data with only 11... columns (11:55)Managing one table that rules them all (19:02)Viewing the world as timestamped activities (32:40)When this model becomes harder to use (35:15)The two parts you need in a company (44:41)Those who use Narrator (48:32)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show. Kostas,
today we're talking with Ahmed from Narrator. And I am so excited about this conversation because
maybe this is the first guest we've had who has made the bold assertion that the modern data stack,
or at least a subset of it, because I want to be fair here, sucks. And we talk so much about the ways that people are combining these tools to sort of build
architecture.
And it was problematic for Ahmed, and he's building a company to try to solve that.
So my burning question, not to, you know, steal the thunder from you, but why does the
modern data stack suck?
I think that's going to be a great conversation.
Yeah, absolutely.
Absolutely. I mean, it's always nice to see like people that have a more radical view of like the
things that are happening out there.
And I think this is not something that's like we need because it's killed like rethink and
not take like just, you know, like whatever we call the best practice and just continue
with that.
Like we should always be challenging like our methods and like the things that we are
doing, the products that we are building.
And I think that I'm doing like an amazing job in that.
So yeah, like I'm really looking forward to see also why the modern data sucks in a way,
but also see what's like the
alternative that he's building there, which might not be an alternative at the
end, right?
Like it might be something that works like pretty well with the modern data
stack anyway, but understand like a little bit more in depth, like what the
solution is there and what the problem is that they are solving.
All right.
Well, let's dig in and figure it out.
Ahmed, welcome to the Data Stack Show.
We are so excited to chat with you.
I'm so excited to be here and dive into all these details.
All right.
Okay.
So give us your background.
You have built various iterations of the modern data stack many, many times over, but give
us the timeline.
So how did you get into data and what are you doing today?
Yeah.
So it started my career actually in robotics.
So I was really interested in how human and robot interact
together to make decisions.
Self-driving cars, big kind of bigger projects and human
robot traction eventually made my way to AI for missile defense
for the US government.
So understanding kind of missiles go through space and how to unlock them.
Kind of got burnt out from that intensity, switched to WeWork and built
that WeWork's data team and data infrastructure that you see today.
So that's when I implemented the data stack many, many times.
They decided that there was a fundamental problem, went on like a tour of all these
big companies to be like, how do you solve this problem?
And pretty much realized that the way that data, the fundamental, there's fundamental
problems in data that these different approaches haven't solved.
So I decided to really rethink data.
And that's where I ended up founding Narrator, a single table approach to answer any question
in data.
And it's an 11 column table that you can use to answer any question.
And it makes asking and answeringcolumn table that you can use to answer any question. And it makes
ask and answer questions with data really easy. And the really special thing about Narrator is
that that single table is a standard. So whether you're like airline companies, media companies,
e-commerce, sales, crypto, banks, which are all different companies we have in those sectors,
you can use the same exact 11 columns to answer any of your questions, allowing us to share and reuse analyses. So really bringing that data world together,
enabling that data analyst to really make the best decisions.
Love it. Okay. I know that our audience's ears are burning just like Kostas and I's are
because we want to know what those 11 columns are, but I will make this a substantial show.
I want to start out, and you mentioned this when
we were talking right before the show as well, that you've built sort of different iterations
of the Modern Data Stack nine or 10 times. And when we were prepping for the show, you were like,
it sucked. And that's such an interesting thing to hear because in the industry in general,
and of course in the show, one thing we hear a lot is like in general, and of course, in the show, one thing
we hear a lot is like, well, you got to move towards the modern nudist stack, right? Or these
are the components of the modern nudist stack, or this is sort of the right architecture, etc.
And I want to know why, I'd love for you to be as specific as possible. Why did you come to the
conclusion like, this sucks? Because, you know, most of the industry is trying to push, you know, push everyone towards
modern data stack.
Yeah.
And I think that everyone who has implemented the stack more than once will tell you that
it seems like the only way and it's a necessary evil.
So at a high level core, you have your data everywhere and you dump it into
a warehouse. We call this EL. And there's a lot of tools that have automated this process. It's
kind of been solved. Then you have your warehouse and there's a lot of different warehouses you can
use in different flavors, different benefits, and that's been solved. Then you have your middle
layer. We call this a transformation layer where you actually use data and write SQL to represent
the questions that you need to answer.
That table gets materialized and put it to your BI tool. That BI tool then allows you to build
dashboards and visualize it. And anyone who's ever done this will tell you what happens. So what
happens is that you build a dashboard, then there's a follow-up question or your team is like, yeah,
but I want to understand, slice the number of emails by how many people are repeat purchasers. And they go, cool, let's go back to the data team. Let's build a new transformation.
Let's build a new materialized view and let's build a new dashboard. And as time goes by,
those number of transformations you have in the middle continue to grow. The number of data that's
similar in multiple transformation continues to grow. It actually gets so messy that you often have 700, 800 transformations,
each answering a series of questions.
Then you end up dealing with, hey, how come these two dashboards don't match?
How come these numbers don't match?
How come my warehouse is this low?
How come everything is so expensive?
And because of this entire cycle of constantly needing to go back to build
these new transformations, you end up having to spend the time to answer a question goes into weeks and months.
Every new question goes into these complex thousand line SQL queries so you can answer
it.
What we've done is we've actually built different ways to manage this middle layer, but we haven't
solved it.
So whether you're back then in like 40 years ago, this was called Microsoft stored
procedures. And you would do that in like SQL server. Then we added more ways to build a staging
layer. Then we added, we have like Luigi, which was like Spotify as a version of it. Then we add
Airflow. Then we have dbt. Now all this history has kind of built a better ways to manage that
SQL query. But the fundamental is that you still need
a thousand line SQL query
to answer the series of questions.
And that doesn't go away.
Now, why does that happen?
Why do you need a thousand line complex SQL query?
And that itself, the underlying problem,
and that is because data is actually captured
in separate systems that don't relate.
And you need to figure out ways to stitch
how do you type email to an order?
Because everybody wants to know email attribution to order.
Well, because you tie it, you need to go from email to web page,
web page to parameters, to parameters to click,
click to copy that parameters, assume no duplication.
And that complexity of doing that simple join
across all these systems
ends up generating these really
complex queries. And I think that's where the modern data stack consistently fails.
And no matter whether you're Apple or Airbnb or Spotify, everyone will tell you that they have an
entire team of people doing it. It has now become a job that people call the analytics engineer,
whose entire job is to build these transformations so you can answer questions.
And every company will tell you how long does it take you to answer a follow-up question?
How long does it take you to answer
every new ad hoc question you get?
Is there an infinite backlog on the data team of questions?
And that's always the case.
And I think that's the problem we need to solve
is why the transformation layer
causes all these kind of roadblocks.
And that's the problem that I went out to really innovate in solving.
Yeah.
Can you dig in just a little bit more into...
You said that some of these modern tools, right?
So you have, you know, scrub procedures and then, you know, Airflow and then now DBT.
What's the fundamental limitation there, right?
Like they're making it easier
to manage these transformations,
but are they not making it easier
to actually write them?
Is that like a fundamental underlying challenge
with like the structure of the data
and the disaggregation?
Or like dig into like, what are the, like there's progress that's been made, but what are the,
what are the limitations of that progress? Yeah. I don't think it's a problem with the
tooling. It's a problem of the approach. So the approach of building custom tables to me
is the idea of like every question.
And if you were building a car and every piece you needed to cast molded custom,
like you need more of the world interchangeable parts where different pieces can fit together really easily.
So right now, the fundamental problem is a SQL today requires you to join based on a key.
And if that key doesn't exist, or you put a person to hack at it with a bunch of complexity
to do it. That is the problem. Now, the tool you're using to manage a SQL doesn't really
matter if your SQL doesn't solve this fundamental problem. And I think that is the core problem that
we realized is that it is actually a join problem because joins depend on forward keys and forward
keys don't exist.
So to solve it, you actually need to reinvent how you join and how you structure data, not how you manage transformations.
The managing transformations is like DBT, love the tool.
I love Tristan as well.
This is one of the best tools to manage transformations in this traditional, what we call the traditional way of doing data,
which is known as the modern data stack.
But that way itself is fundamentally flawed.
You need a different way that allows you to work in the way that modern data actually really is flowing,
which is how do you ask and answer questions
and bridge all your systems quickly, easily, in seconds?
And that's the point.
And that's the thing that we have to really highlight
because a lot of questions that appear so complex, you have to so much sql to do in narrator appear so easy you
can answer them with a couple clicks and that's because we solved that underlying problem that
lies within sql which is showing new data okay so let's dig in how do you do that with only 11 columns?
Because it sounds, honestly, in many ways, it sounds too good to be true, right?
And I know that we want to talk about, you know, there's no decision that you make technically that doesn't have a trade-off.
And so I want to get there as well.
But if you think about even a moderately sized company that, say, has sort of maybe some
behavioral data in their warehouse.
They're, you know, loading a bunch of structured data, you know, say from marketing tools or CRMs or whatever, right?
You have a bunch of materialized views.
It doesn't, it's not that hard to have, you know, tens, hundreds, thousands of tables,
right?
Like you can get there really quickly, right?
And if you do get there too quickly,
everyone knows, you know, the pain that that creates.
So it sounds crazy that you like solve all that
with an 11 column table.
So tell us, how do you do that?
Yeah.
So first, I think we like to say one-one table
because of the kind of like shock factor.
It's like 95% of a table.
There is like ways you can add additional tables,
but that's not the core.
So the core single table that we're going to discuss is known as an activity schema.
You can see it by activityschema.com.
It's an open source project that kind of discusses this one table approach.
And it is really just kind of taking the way that we speak about data and really bringing
it to the way you kind of structure it.
So it's just a time series table
where it's customer, time, action,
and you just abstract three features.
So it's feature one, feature two, feature three,
a couple of additional columns,
but that's kind of the core of it,
which is that's it.
So it's customer, time, action, and features.
And you're thinking, well, like,
Wendy, Ahmed, like,
how can I just put everything I need in three features?
Like I have so many features I need.
I need like a hundred features.
And that's where the tool of data set comes in.
That narrator provides is a way to pull in and what we call borrowing features from different
activities.
So let's take a simple example that you're, I want to know every email.
Did that email lead to an order?
I want to know what the. Did that email lead to an order? I want to know what the campaign of that email is.
I want to know when that person did that, when that person came to our website,
from that email, how many pages did they view, and I want to know what page they
landed on that seems like we already are talking about 10, 15 features, right?
But if you break it down to like actions, you have open the email action, which has
one feature, which is the campaign.
You have the visited website feature, visited website action, which has path, which is also
one feature on that.
You have the startup viewed page, which might be the, have some features on it, but just
the fact that the customer viewed a page and the fact that they completed an order.
So now it's four activities.
And all I'm doing is really pulling the data
from each activity.
So if I want to know in between those emails
that they have an order,
I can pull the fact when the next time
from that order is.
I can count how many page views they had in between that
and say that's the number of page views.
I can grab the first page view
from the started session activity. I could just pretty much much thinking about it as thinking this really long table and
doing a very clever fancy pivot and pulling the columns I want from each of these activities.
And when you do that, what it turns out is that if you actually represent your business as this
really rich customer journey, you don't need that many features per action, but you do have a lot
of actions. And those actions are where all the nice rich information comes.
And because time and the accounting and all that stuff is given to you by
narrator out of the box, you don't need to add features like first visited
page, last visited page, number of visited pages, number of visited
pages, last 30 seconds, all those can be recomputed on the spot when you're
answering the question that you need instantly.
Does that make sense?
Yeah, it does.
So how do we populate this one table from the raw data that we have, right?
Yes.
I mean, obviously, let's say this is the data model that makes sense to have on your data warehouse,
like for analytical workloads.
Obviously, the data that is coming is not modeled for that, right?
So again, we are going to do the extraction and the loading of the data.
So after we have staged the data and we have loaded into the data warehouse,
how do we get to the point where we have, let's say,
a well-curated one table to rule them all?
Yeah, great question.
So, Narrative provides a very, very thin layer that's known as our transformation layer.
And this is not like a dbt transformation layer because you're really just mapping columns.
You're pretty much saying, like, for example, I have my internal database has a user stable.
And I want to have an activity like added user.
And I just say like this,
you're mapping the,
to the 11 columns.
So if you're saying like
the timestamp is the created
app of the stable,
the action is create added user.
Here's the features
that I care about.
And it's a very thin layer
to map it.
It's so thin that it averages
around 12 minutes to write.
I think most customers
that have experienced it
see like the,
how easy it is to kind of
take your data from
whether it's a ready eventbased or relational or tickets and we have like a library of all these
common transformations in our doc site and you just kind of like map it to that simple structure
that is this per building block so you define each activity and then narrator migrates that data
does a bunch of caching does a bunch of things to make that really nice and easy and fast to use
and provide you with an interface to actually ask and answer these questions.
And the good thing about doing it with activities is that you only ever need to add a new building
block when you have a new concept to add, not when you have a new question. So often in tables
that you materialize in the modern data stack, every time there's a new way of relating data,
you build a new table. In Narrator, you don't do that. You just build what's called modern data stack, every time there's a new way of relating data, you build a new table.
In the area, you don't do that.
You just build what's called a dataset and that's done by a couple clicks.
Every time you have a new concept added to your company, then you add it.
So you're often doing these activity transformations within the first week,
and then you add one every other month.
It's like really rare that you're adding a bunch of new activities.
Instead, you're taking the building blocks that you've kind of built and
you're reassembling them to answer all sorts of questions.
And how did you, how is this table like implemented?
Is this like a materialized view that gets like populated inside the
Dena warehouse, is it like a logical view?
Like what's the...
It's an actual table.
Yeah, it's an actual table.
It's a table that we can insert into, we update, we manipulate.
Nerriti does a lot of additional things like identity stitching and across all your systems and like handling fraud users and anonymous section and all that stuff.
So we're actually just constantly updating and mutating this one single table.
And we're sorting it and partitioning it based on your warehouse to optimize performance.
And we do a lot of stuff on that one table
to make it really performant and really nice and fast.
And then data set queries are all,
there's no free SQL in there.
So you're actually using the data set to answer any question
and all the queries those generate
are super optimized for speed and on that single table.
So in your warehouse, you'll have a schema or a data set,
depending on which warehouse you use, that's called narrator.
And in there, you'll see the activity schema
or the activity stream is often what it's called.
Okay.
Yeah.
All right.
So let's talk a little bit more about the management of this table, right?
I mean, obviously this table relies on the underlying data that is getting loaded in the data table, right? Like, I mean, obviously this table relies on like the underlying data that is
getting loaded into the data warehouse, right?
How do you do things like, okay, let's say accidentally someone
like drops a table, right?
Like that is used like as a source.
What happens then?
Like, is this like changes, is there a removal, let's say that that's going to be reflected also with
deletions on this table? What's the logic behind working with data that might
cease to exist at some point? Or it might be figured out that it's the wrong data, right?
How does this work? Yeah, great question. So one of the
benefits of only having modeled activities,
our average query length is 20 lines. It's a really small queries. And if an activity,
a transformation of an activity, let's say the query, we're updating this thing incrementally.
So every like every five, 10 minutes, we're reinserting the new data into the activity stream.
Let's say we go to insert it and the query fails for because data is not there. We take that activity and that transformation, we put it into what's called a maintenance state.
So anyone when using that data will get a flag.
Hey, this data isn't up to date.
Something went wrong.
You get notified.
You can go in and fix it and resync it.
And the data is up to now it gets resynced and the maintenance goes away.
We also provide out of the box anomaly detection.
So if that data ever stops producing rows, you can write
your own custom alerts on it.
So we've done a lot of stuff to make sure that as your data is migrating, it's correct.
We do a lot of duplication checks for IDs and stuff like that as well to ensure that the data
that you're inserting into your warehouse is always accurate.
And the benefit is again, because it's a single table, we can do a lot more checks very cheaply
and easily because we have guaranteed structure and guaranteed assumptions.
So the narrator is always incremental.
It's always time series.
All these things get a lot of benefits from it.
So that's what ended up happening a lot with this thing.
So people often find managing those like a single table, actually the easiest part, like
super cheap because, and it's often on the raw data because it's so simple.
You're often just pointing a timestamp from your raw tables to a structure.
Like there's really few like complex queries that you're putting in activities.
All that stuff happens in data sets.
Mm-hmm.
All right.
And so, okay, let's focus a little bit more on the modeling side of things now.
One of the things that I have like experience, like when I'm talking like with companies
or like I'm observing what the company is doing with their data is how the semantics of the sale might change for the same thing.
Like, what is a user, for example?
Like a customer, how a customer is perceived by sales or how a customer is perceived by product or how a customer is perceived by marketing. Right. And just to give an example, like you go with sales and chat with them and you
start talking, hearing about like prospects and leads and opportunities and
contacts and you know, like all these things that we pretty much like we all
learn to live with because Salesforce became a thing and their schema became, let's say, the way of representing sales in the world.
Yeah.
So how do you deal with that?
Because from what I understand, like a core concept of your modeling is that everything like is around the concept of the user, like the customer, let's say, right? How do you differentiate with that? And how do you make this like accessible to people that they use different syntax and semantics about the same concepts?
Yeah.
So honestly, this is actually one of the best parts about narrator that you can actually, one thing that we see a lot when you're depending on dashboarding is that you have to force everyone to abide by one definition.
Total sales has to be total sales and total customers has to be total customers.
What you see a lot in narrator is that a person might, you could have multi-identifiers in
narrator that get mapped to what sort of your global customer and customer could be, we have
companies that are ride sharing that the customer's car, we have companies that are customer, like we
work as a building, like you have different ways of defining customer.
So what we see a lot is the idea of that entity having events.
So like you might have a created lead activity.
You might have a pre-started opportunity.
You might have a closed opportunity.
You might have a signed contract, sent contract,
moved in, made a payment, like started subscription.
And the reason why that's so important is when you deal with that argument, and I've had this at WeWork a lot.
Well, when is a sale?
Is it when they sign?
Is it when they move in?
Is it when they pay their first invoice?
Is it when they start their subscription?
Well, when is the sale?
You don't have to actually fight that battle anymore.
Instead, what you do with narrators,
you have this concept of dataset,
which is you have the activity
that you can represent them differently.
And then when you go to create your KPI,
which is like your key performance indicator
that narrators create,
you can then choose very explicitly what that is.
And the user then sees the KPI,
they can always click into it
and see the underlying dataset and say,
oh, this says you did a timestamp
of the first opportunity created.
And because of opportunity down activity, it's just a lot easier to get that transparency.
So when you're modeling the data, you don't need to model based on how it's going to be
used.
You need to model based on what it is.
And then when it's being used for like a specific question, the user can actually choose very
specifically whether they want it from the sales perspective
or the invoice perspective.
And then there's also the global just company KPI,
which the company has decided
is the thing that they're going to track
and they're going to call that total sales.
And you can always click on it and say,
oh, they're using signed contract
as their definition for total sales.
And I think by kind of creating those three layers,
whether it's a company global KPI,
which people are using
to measure any data set,
which is answering
specific questions
and then having your
building blocks represent
real actions that the
customer is taking,
it just kind of creates
very little space
for ambiguity.
Like questions that we
don't get a narrator often
is like,
but what does this
actually mean?
It's like, oh,
just click on data set
and see exactly
what that means.
Oh, what is that?
And because the words are like created opportunity or like the word might be like, oh, just click on data set and see exactly what that means. Oh, what is that? And because the words are like created opportunity or like the word might be like made payment.
You can be like, oh, and you can click onto that and see the exact SQL.
And that SQL is 20 lines, so you can easily understand it.
But it creates that separation so that the data team isn't fighting.
And if the company decides, actually, we're not tracking redefining total sales to look at it based on when the first invoice is made, that doesn't
even talk to data about that.
Like the data is already modeled.
You have, you have just choose that for your dataset and that can be done
without involving data at all.
And everything will just cascade nicely because again, you're building
blocks are what you're modeling, not the final results.
So you're representing the world as these activities.
Everything else happens in narrator.
And you can build data sets to combine them.
You can build KPIs.
And you can change those things without thinking about going back to data model ever.
Can I ask us, I'd like to dig into that with a specific question.
And this is inherently biased because I actually got to use narrator, kick the tires on it,
which was really, really cool.
And so I'd love to know, because unfortunately, I didn't dig in with our analyst team and data engineering team, but I was sort of like a consumer of a question that we were trying to ask.
And in fact, I will tell you what the question is, because maybe that'll be helpful.
And then I have like a specific question about how something's happening under the hood. So the question
we were trying to answer, which again, sounds like an easy question, but like actually ends
up being difficult to answer is how much does consumption of a particular type of blog content,
you know, well, A, does that seem to influence an opportunity being created in a certain time period?
I have a couple of questions.
It's like, okay, whatever this thought leadership or engineering or whatever,
does increasing consumption, is that a leading indicator that there's increased likelihood or
whatever? Okay. this is my question.
And actually, it was very elegant the way this happened because the resultant narrator
actually had both a first touch and influenced view that were very easy to get, which is
really cool.
But here's my question under the hood.
What makes that, and correct me if I'm wrong here because you know, because I'm not an expert in SQL. But part of what makes that difficult in raw SQL is actually not necessarily like looking at page views and then sort of saying like, okay, was that user associated with an opportunity eventually, right? You actually may have multiple users who have entered the funnel, but are related to the
same account, which is also related to the opportunity.
But in Salesforce, of course, with their data structure, not everyone is.
And so when you talk about something like influence, as opposed to something very linear,
like first touch, user did A, did B happen at some specified time period, right? Now you're talking about
a group of users who are associated with a different object or different table in the
warehouse. What you want to know about is the opportunity, which is a different table in the
warehouse. And so there's a ton of key crossing across those tables, right? To do something
here. And this is actually also all assuming that your behavioral data
like has
a layer of identity stitching as well
where you have like unique IDs for like
the anonymous behavior because that can also
happen pre-identified, blah, blah, blah. Anyways,
you get it. I won't keep going.
Awesome. So first
of all, that's a great question. Like
it bridges multiple systems.
It shows you're asking an analysis
and you probably have seen our narratives,
which is one of the benefits of standardizing data.
It allows us to build and reuse analyses,
which is our intelligence to generate these beautiful stories
that help you understand your data automatically for you in seconds
that actually provide real answers.
And that question that you asked
has a lot of nice complexities to it, right?
Like multiple systems, multiple tables you're talking about.
How do you think about bringing that together?
And all sorts of different pieces that makes that really, really complicated.
And if you probably talk to your data team that set it up,
they'll probably tell you that they set up those activities in one 45-minute session.
Because that's our proof concept usually is one 45-minute session.
So they set that up, get that answer, gave it to you, allowed you to self-serve it.
The entire setup was 45 minutes.
So what did they do?
So two pieces here that are really critical.
One is that in narrator, because we built an entire company based on a single table,
we got a really good identity stitching.
So we have a very, very proper way of stitching that data.
Two, all that thing that you're talking
about of this thing happening and that they
first time they ever do it,
everything is changing in time. If you notice, that
narrative does everything as a function of time.
So
what that probably looked like,
I don't know the exact setup, but it probably looked
like something like viewed content
was an activity and it had an anonymous ID of whatever that cookie was of that user who viewed the content.
And based, you had a, probably like a contact or an account ID that was like your global identifier,
which is how you thought about your business, which is that account creates the opportunity
that account creates a lead and all sorts of pieces. Right. So the user, so you have like
an account identifier and that applies to like both pieces. So the user, so you have like an account identifier
and that applies to like both the users and the opportunity
and you pass that through on the activities
is how that's happening.
So that's your customer.
And then you create what's called,
Narrative allows you to create tiny little snippets
that match data together.
So you probably have one more snippet that's like,
hey, we know that this cookie is now this account ID.
There's a lot of explanation of how that works.
And then that's it.
So they build those three transformations
and they're able to stitch that together, combine it.
And then when you're asking that question,
and if you use our tool,
you can right-click on any piece of data,
see the exact customers,
right-click and see that customer's entire journey.
So you can see that customer viewed a page,
viewed a blog, viewed blog, viewed blog,
created opportunity, viewed blog, viewed blog, viewed blog, created opportunity, viewed blog, viewed blog,
viewed blog.
And they're able to understand the difference between that.
And we talked about the differences between knowing how many there were.
That's a simple, give me the count of them.
Knowing the rate, give me the count divided by the time from the first one.
Giving me the first content they viewed versus the last content they viewed.
All those things, we're using words like first, count, last, but we're still talking about actions that the customer is taking.
And that's kind of the beauty is that the way you ask the question, you kind of, to
express questions, you kind of convert them into these action-based questions because
you're saying, how did the customer, you already combined the fact that it has to be
the same person because you're not asking how does something affect something else
and nothing is tying it together.
You often tie it together by a person.
And you talk about these two building blocks,
viewing content and creating an opportunity.
And you're looking at a conversion rate
and you're trying to optimize that.
So you've already done
the way that you've asked the question.
You've done 80% of the hard part of preparing data.
And all they did was take that same structure of how you're imagining the data happened,'ve asked the question, you've done 80% of the hard part of preparing data. And all they did was take
that same structure
of how you're imagining the data happened,
customer views the blog,
they create an opportunity.
And we enable you to create that structure.
And then we quickly enable you
to actually structure that data
using the way that you asked it.
So that's what makes that experience
so seamless and look kind of like magical
because you've done three things
in your head for us already.
And we just kind of represented the way you think about it.
Yeah. Super interesting. Super helpful. Okay. And I can verify it was really cool to see that
happen. Okay. Ahmed, I do want to play devil's advocate and I'm actually going to ask Kostas
a question here because this is beyond my technical depth. But when you talk about
activities as sort of the way that you view the entire world, you're talking about essentially converting every type of data into event data.
And Costas, I mean, there are a few things that come to my mind, but I would love to know, Costas, that's a non-trivial sort of lens to put on all data, what comes to your mind as, you know, potential challenges,
benefits, whatever, when you view the entire world as sort of timestamped activities?
Yeah, that's a pretty interesting question.
Usually the problem that we have with that is that there are questions that
you can better answer when you, let's say, keep track of like everything that has happened,
right? Where having events there is like the way to do it. And there are questions that are like
much easier to answer where you just keep, let's say, or you have already replicated the current state of your concept
or entity or whatever you want to call it, right?
So usually the problems that you have with events is that, yeah, it really helps you
to measure change, for example, and stuff like that. But if you want to see at the end how like things look right now, you will
probably have like to go and like replicate the whole, let's say, journey,
like get the data there and go and replicate like the current state.
That's like from a very, let's say, it's a naive description that I'm giving, but it's like usually what like people have to deal with from an engineering perspective when you have to decide, am I going like to work with mutable states or like go and keep like events there and work with events. And usually like events give you like this extra expressivity,
but there's some kind of explosion in terms of like the amount of data
that you have to deal with or like what it means to go and replicate
the whole, like the state by iterating all the different events that you have.
Now, obviously there are like situations,
like there are things that you can do only if you have
events, right?
If you want to see, let's say, what is the journey of your customer, you need to have
all the events there.
Otherwise, how you're going to do that, right?
So having this kind of turning everything into an event makes sense in a way.
But the question is, and that's like a question that I have for Ahmed, is when
does having, let's say, this model becomes a problem?
What are, let's say, the questions that are not impossible to answer, but harder
to answer because you add like this
different way of like describing the words, right?
So great question.
So a couple of things to kind of highlight.
So one of the things, the benefit of kind of having we, a narrator put this like
really intense, rigid structure, and it allowed us to kind of solve a lot
of the core problems using data sets.
So one thing that you can easily do in any activity
is say, give me the last ever updated subscription
or give me the last ever status of this company.
And when you can use words like last ever,
it makes it really easy to know what the current state is.
So we find a lot with our customers
is that if things are changing,
you can get like,
if you let's say you have a contract object
and that contract object is changing.
If you want to know the current contract,
you say, give me the last ever updated contract
and you get that contract object then.
However, sometimes when you're asking questions,
you're saying you want to know
what the contract was at the moment
when that person submitted a ticket.
Those questions are nearly impossible to do with non-event data.
But with narrative, you can say, give me the last before.
Before you submit a ticket, give me the last before updated contract.
And now give me the state.
So you can actually benefit of doing like generating state comes from instantly with the last ever.
But you can also generate state at any given moment in time.
This was inspired by, if you're familiar with like the, it used to be a very big database paradigm
known as the Lambda architecture, where you have like a streaming layer and then you kind of do a
batch layer to process it. But one of the benefits of that approach allows you to structure data
any moment in time. And those change questions can be seen. The second thing you asked is like,
what about things that aren't changing? Like your customer's age maybe, or like their gender or some of these things
that have changed less often than you think.
Well, I said that narrator is mostly a single table.
We do have what's known as what we call it, like kind of like an attribute table,
which is on this customer, because everything's centered around the customer.
You can just kind of create, we have a materialized view.
That's like a dim customer, for example. You can add all the kind of static attributes of the customer that makes it really
easy. You often don't add stuff like when they first signed up, you don't add timestamps there.
If you actually do, Narita will alert you saying you shouldn't do that. But usually it's like your
name, address, blah, blah, blah. You can put it there. If it's changing, you make an activity.
So like you might have an updated address activity and you want to know when we first acquired this customer, what was their
first updated address? Or give me the last one to know what their current updated address was.
So that's kind of how we handle a lot of these cases and we handle them in product. So the thing
about this single table approach, and I'll tell you the honest truth, it has two huge, huge, huge downsides.
The first downside is that a single
table, querying it
is really hard.
Take all the SQL you've learned and kind of
throw it away because you can't
imagine, when I say last before,
like, Kostas, you've
done this before, but you can imagine that SQL query is
very not trivial.
Like, that is and very probably if you write it without realizing, you might do very inefficient.
Like you'd think, oh, I can just use the last value window function.
Good luck.
What happens if it doesn't exist?
What happens if it duplicates?
All those things that can do.
So the querying of that is extremely difficult, which is the challenge of having a single
table.
And the second thing is, if you notice I'm doing something with every question you're
asking me, where I'm doing this thing that looks kind of, that makes narrative work,
where I'm actually translating your question to be a little bit more defined in this activity
way.
There's a mental thing that I've experienced and I've mastered, but a lot of our customers
take a couple of weeks to learn, is this new way of thinking
about how to think.
Because in SQL,
you can imagine stacking the data
and joining and how it works.
But this new approach,
you have to relearn the mental model
of how to combine data.
You need these like
temporal relationships
that you call.
I think we actually find it
most customers who come from
like a deep SQL background
have a harder time learning our relationships than customers who come from like a deep SQL background have a harder time
learning our relationships than people who come from like a like marketing or product mindset
because they're used to thinking about things from a customer perspective while SQLs often
thought about it from a table perspective. So that mental model learning is a big overhead
and then knowing that that table is really hard to query by head is really hard.
So what we decided to do was build a company around it.
Like the reason why activity, this single table isn't just an open source project is
we found that like we open sourced it and people tried to use it and they were like,
hey, this sucks.
And I'm like, yeah, you're right.
Like querying this thing takes you forever.
So we spent years building and iterating over an experience
so you can actually generate any table
using this tool called Dataset,
which Eric got to see,
which is a really just seamless way
of combining data.
And it makes it look very seamless and nice.
So we solved that problem with product
and we solved the second problem
with just iterating.
So we often give customers examples.
We do a lot of documentation. We do a lot of like examples. We do a lot of documentation.
We do a lot of blogging.
We do a lot of automatic analysis.
We generate,
we have a series of templates
that helps you see
how to ask an answer question.
We have an entire library
depending on your industry
that gives you
a bunch of different
types of questions
that you can ask
and shows you how to map it
to an area of this world
and how to answer it
using your own data
in a couple of,
in under 10 minutes each. So something like that is like a huge educational overhaul.
But one of the things that you said that I did find beautiful is that you talked about this
language that Salesforce created. I studied Salesforce for a while because I think they're
one of the most interesting companies. Because prior to Salesforce, everyone had their own
definitions of structuring sales data. Every
company had their own sales data models. It was like, nowadays we're like, of course,
every sales company can be represented with leads, opportunities, tasks, and contacts.
But that's really a Salesforce state. They changed how we thought about data and they
standardized all of data for sales. And the thing that we like to think about as narrator
is that's exactly what we're doing for data. We're like, here's a standard data model.
And yes, it is very rigid.
But we've shown you that you can do so much with it and answer so many questions with it and do all these things.
And we've taken all the trade-offs and the downsides of using this data model and said that is narrator's job to make that solvent.
So making sure that's super easy to create for anyone, whether technical or not technical, using our tool like Dataset. Making sure you can see the value in instant
beautiful analysis. Making sure that the benefits of the assumptions can be shown by giving you
stuff like automatic anomaly detection, instant analyses that can answer any question, templates
to understanding CAC or LTV and all sorts of template analysis. We gave you so much so that you can value
in learning this mental model overhead
of thinking differently about data.
And that's the goal.
And that's kind of why I ended up saying like,
the modern data tech sucks and all these approaches
because they're just so different.
Each one is so, every company you go into,
you have to learn a new way
that how they represented their data,
the thousands of tables they built.
The narrator, I can switch between any of our companies that use us and instantly
answer any question because it's a standard way of thinking. It's a standard way of answering questions.
And we've shown that it's flexible enough to answer any question.
If you've come to any of my talks, if you send me an email and I always
message me on Twitter, tweet me, LinkedIn,
email me directly,
and give me a question I can't answer.
Or I'll tell you if I can't answer it,
I'll post publicly that I can't answer it,
or else I do exactly how you can answer that question in Narrator.
And having done this for five years,
you often see that
almost all questions can be easily answered
in this structure.
You just got to think about it a little differently.
And that's the downside to it, is that thinking differently, but we do believe
the upside of the value of speed is just so incredible that it's a no brainer for
us.
Carlos Bernal de Sousa- Yeah, absolutely.
And that's where the opportunity is.
And that's why you're like building a company, right?
For that.
Okay.
I, I find interesting what you're saying about like the, like the comparison with, with Salesforce, because I think there are similarities, but there are also some big differences.
Salesforce went there and had one domain that they had to model, which was sales. Now, with Narratory, you do like, let's say, in a way like, not the opposite, but you're
saying, okay, I have one model that is abstract enough and expressive enough to go and cover,
let's say, all the different domains out there, right?
So your work in a way, it's exponentially harder than Salesforce, I would say, because
you'd have to deal with all these different domains and people that's working there and
trying to help them think in a different way.
But obviously, also, if you manage to do that, the reward is going to be probably even bigger.
But I have a question that has to do with like,
at the end, like the expressivity,
because we keep talking all this time
and we are talking about like customers and users, right?
Like the center of like this data model
is around like the concept that you have a user there
who acts, so you have an entity and activities, right?
Is this all that we need in a company? Or there are also like other activities and other things
that are happening that, let's say, maybe the future, like, narrator will also address?
So great question. So there's two parts there. So what we've done is not trying to abstract away
your business. I think that if we're done is not trying to abstract away your business.
I think that if we're trying to build a model that represents every business, that's a very
hard thing.
What we've done is we've built a model that represents how we ask questions.
And what we've actually solved is behavioral and change.
We build a model that's really good at understanding change.
And what we've shown is that every question can be actually a function of understanding
change.
So when you think about a company, we think about a single table.
It's per core.
We talk about like, oh, we have a ride sharing company.
They have two streams.
What's called a customer stream and a scooter stream.
So their customer stream is everything customer opens an app, customer buys, customer rides,
starts the ride.
Customer submits a ticket, a customer makes a payment,
customer moves scooter, enters a new zone, customer parks.
But then you also have a separate stream,
which is where the customer is actually a scooter.
And the scooter gets ridden, a scooter ends the ride,
a scooter goes into maintenance, a scooter gets repaired,
a scooter gets purchased, a scooter gets launched.
All sorts of things happen to a scooter. A scooter gets presented maintenance, a scooter gets repaired, a scooter gets purchased, a scooter gets launched. All sorts of things happen to a scooter.
A scooter gets presented to a customer.
So it turns out that everything in a company, I'll say 99% of things, can be represented as some sort of global entry that you're trying to understand how it's changing and its actions.
And whether the actions are done to it, done
because of it, done by it, it's independent. It's just that this action has happened in time to this
core object. It's really representative of how we speak. It's like there's a noun, a verb,
and you're just talking about these actions that are happening. So what we see is that most
companies have one stream, but some companies like us,
narrator, we have two streams. We have a company stream and a person stream. And we use the person
stream to understand people behavior. But we use the company stream to understand like our
financial reporting and our like onboarding and a company's onboarding and a company adds a user
and a company does these behavior that we care about it from the company's perspective. Company
pays an invoice.
So you can create more than one stream and narrator makes those multiple streams really
easy to switch between.
But yeah, so the thing that I'll say is that everything in the business can be represented
as some sort of the entity that you're trying to see how it's changing and change.
And narrator has really done, by implementing this really strict data model, has allowed
us to really focus on how do we understand change?
And whether we generate the current state of a business from doing the
last ever of a chain change, it's really, that's our really secret sauce.
And we help people really think in a way of change instead of thinking
of, about static things and thinking about things like first signed up
and first attribution model
to thinking more about
if you're looking for a customer
and the first time they visited a website
and the answers from that.
So that's what we've really mastered
is that change.
And we still look for ways to
things that don't get represented by change.
I'm going to ask a question
because I think Narrator
is a really interesting example.
In fact, we had an interesting conversation recently, Kostas, about roles in the data space, right?
Data engineer, analytics engineer, analyst, data scientist, even.
So Ahmed, narrator sort of exists between different spaces in many ways, right?
Like in the data world.
So who's the user?
And in some ways, like,
maybe the question in this is leading a little bit,
but is there a new sort of user that narrator imagines?
You know, or are there a set of users like,
you know, who actually is interacting with it
in an organization?
Yeah.
So I'm about to give you a very controversial opinion.
Love it.
We love a hot take.
Like to think about everybody got into data to answer questions and make an impact by
using data.
That job used to be called a data analyst.
Data analysts were people who used to take questions
and ask good questions to derive answer.
And whether you're a product person operating as a data analyst
or you're a data engineer answering a question
and you're operating as a data analyst,
I think that the tool that we built is for people who want to answer a question
and those people are data analysts.
What we've seen in companies really interestingly happen
is that it turns out that job of a data analyst kind of disappeared.
And now we have like seven roles that do part of the data analyst job.
So we have the analytics engineer.
We have the data engineer.
We have the data scientist.
We have the BI engineer.
We have the insights engineer or the insights analyst.
And one of
them is doing dashboards, one of them is building PowerPoints, one of them is building tables,
all trying to answer a question.
And what we've thought about is that what if we got rid of all of them and forced everyone
to be a data analyst and you just enable data analysts to, once the data is in your
warehouse, like there's data engineering splits into two parts, getting your data into your
warehouse and pipelining and capturing data.
And then there's the data engineering to structure data.
Forgetting the first part, let's get rid of the second part.
Let's get rid of the analytics engineer.
Let's get rid of the data scientist.
Let's get rid of the BI engineer.
Let's just kind of make everyone, because everyone at the end of the day is trying to
answer questions.
And you may or may not be able to enable that data analyst with very limited SQL knowledge,
but really the ability to ask good questions
to do that work end-to-end.
Create the dashboard, create the analysis,
create the story, represent the data,
the way to answer that question,
and do that all in under 10 minutes.
And I think the future of the world is going to be
where everyone becomes a data analyst.
I think that's the value that drives business to be where everyone becomes a data analyst.
I think that's the value that drives business value.
Those are the people who are helping make decisions.
That's really what everyone really wants is to answer questions.
And I think the more we stop focusing on the means to an end, we start focusing on the end, the more that these data analysts are going to be the ones who are going to just kind of take over every company.
And I think when you think about it, every company's ability to answer questions is their
competitive advantage.
And I bet the more that these people who are data engineers, who are trained into asking
and answering questions, and like a data scientist who has great skillset, instead of working
on like preparing the data, instead are actually answering questions, you'll find a lot more
insights, you'll find a lot faster, and your business will grow. And I think that's the future is the world just becomes all
about data analysts. And there it becomes just like Salesforce and the tools for salespeople,
there it becomes a tool for data analysts to answer any question.
Love it. That's a super powerful vision. Well, Ahmed, we're here at the buzzer. Thank you so
much for giving us some of your time. I learned so much about the way that you're approaching
sort of drastic simplification,
at least for analysts with a single table.
And we'd love to chat with you again soon
to hear how things are going.
Yeah, I love it and excited to be here.
Thank you.
If anyone's interested,
just follow me on LinkedIn or Twitter
and you'll see everything I do.
Well, Costas, my first takeaway is I was thinking about the intro recorded, and I said the word
sucks like 20 times.
And I realized that if my kids are young, so if they came home from school and said
sucks, I would probably say like, hey, you're not allowed to say that.
You're too young. But in the world of sort of publicly accessible content,
my son could play this episode back to me and say,
well, you said so.
So that's my main takeaway.
So that cat's out of the bag.
No, I actually, I think one of the interesting things
was the controversial take on the role of the data analyst and sort of the sort of connected roles of data engineering.
Med basically said, when you're collecting data or sort of managing pipelines that do ingestion, that data engineering role will stay.
But the data engineering around the transformation layers we talked about on the show he thinks should go away and in fact like he thinks that you know sort of anyone who has
questions around data will become a data analyst certainly a really interesting take i will say
you know i don't know if i wholesale agree with it but here's what I do agree with. The mindset that the tedious nature of manual labor as it relates to
preparing data for simple things should go away. It is a good thing for technology to abstract
those things away from a human having to go through a laborious multi-thousand line, you know, coding exercise
to do things that aren't actually that difficult. And so, you know, narrator is certainly a very
opinionated way of doing that by turning everything into an activity. But I agree with the vision that,
you know, the laborious nature of some of the preparation work, it should go away, right?
That's not a great use of really smart people's time.
Yeah, I agree. I mean, there's obviously like a lot of space there for improvement when it
comes like to the ergonomics of like working with data, what I will, what I
will keep like from this conversation that we had with Ahmed is like how hard
it is to change the way that like people think and they
have learned like great in their work.
Right?
Like it is amazing.
I mean, if you just like take a step back and listen to what Ahmed was saying,
like, like, like, it's not like something really complicated.
He says you just have like to think in terms of actions, right?
I mean, okay.
It's not like something great.
Yeah.
Of course you have like a user and the user does something right.
Like which it might be a sign up.
Like, so the user is signing up or like the user is like signing
both or like all that stuff.
But I even like this let's say, symbol change in the way that we think
it's like very, very hard to implement.
And changing that for a whole industry, it's obviously a very, very big and hard task.
Yeah.
It's very interesting.
It says a lot of how change happens and how incremental or not incremental it is at the end.
So that's what I is at the end.
So that's what I keep from the conversation.
And I'm really curious to see what the future will be for an opinionated solution like this one, like Narrator, that has to do with how people think, right?
Sure.
It's kind of like a change there.
So that's what I keep.
And I want to see how things will change with how people like interact and use the product.
Yeah, I agree.
I think they'll do well.
Whatever the final solution looks like,
people who are thinking like Ahmed
are certainly the ones who are going to
invent the next iteration,
you know, of sort of the way that we interact with data
sort of on the layer on top of the raw data.
All right.
Well, thank you so much for joining us. Tell a friend if you haven't told a friend
about The Data Stack Show and you enjoy it, and we will catch you on the next one.
We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite
podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me,
ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.