The Data Stack Show - 78: The Etymology of Reverse ETL & Why It’s a Key Piece Of The Modern Data Stack with Boris Jabes of Census
Episode Date: March 9, 2022Highlights from this week’s conversation include:Boris’ background career journey (2:32)The origins of “reverse ETL” (6:39)Reverse Fivetran (16:35)Product as an experience (22:41)Fivetran user...s vs Census users (24:14)How to add value to a data dump (26:56)Ways companies are creating IP (33:48)The cascade effect of the modern data stack (37:56)Defining “data federation” (43:51)Lessons from building a product (49:10)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, one platform for all your customer data pipelines.
Learn more at rudderstack.com.
And don't forget, we're hiring for all your customer data pipelines. Learn more at ruddersack.com. And don't forget,
we're hiring for all sorts of roles. You have the chance to meet Costas and I
live in person coming up soon in Austin, Texas. We're both going to be at Data Council Austin.
The event is the 23rd and 24th of March, but both of our companies are hosting a happy hour on the 22nd, the night
before the event.
So you can come out and have a drink with Costas and I.
Costas, why should our visitors join us live in Austin?
For tequila, of course.
That could make things very interesting.
I mean, yeah, it's a happy hour.
People should come.
It's before the main event.
So without getting,
being tired from the event or anything,
like come over there,
meet in person,
something that we all miss
because of all this mess with COVID.
Have some fun.
Talk about what we are doing
and yeah, relax and have fun.
It's going to be a great time. Learn more at datastackshow.com. There'll be a banner there you can click on to register for the happy hour and
we will see you in Austin in March. Welcome to the Data Stack Show. Today, we're going to talk
with Boris from Census and it's categorized as a reverse ETL tool, but I have a sneaky suspicion that Costas is going to ask about the reverse ETL terminology.
But what I'm going to ask about is, you know, it's interesting about census.
So, you know, taking data from the warehouse and pushing it out to other tools in the stack is that it kind of assumes that there has to be some value created in the warehouse beyond just the raw data that was loaded there, however.
And so I want to know what Boris is saying as far as how does that impact the way that he thinks about customers, their product that they're building, and the ways that companies are trying to do that, right?
I mean, DBT is obviously sort of a new way, but I'm really interested in that.
How about you, Kostas?
Well, first of all i have
to figure out who came up with the term reverse cto yes the etymology of tech terms is such a
tasty subject yeah i mean it's more of a marketing term probably to be honest but it's something that
like because i have also the suspicion i mean you know like census is probably the first
company that was like in this space i mean so it probably has to do with them.
Like, it's something that's related to them.
So I want to learn, like, what's the story behind it.
And outside of this, I want to ask Boris and, like, try to understand what's the difference
between getting data, for example, from Marketo and pushing it into the data warehouse
and doing the inverse,
which is take from the data warehouse
and push it back to Marketo.
Where are the different challenges there?
Why they're different?
Why we need different tools?
And who is using?
Is the user the same?
Why do we have different product categories at the end?
That's what I want to understand.
And I hope he's the right person to have this conversation.
Well, let's go find out.
Let's do it.
Boris, welcome to the Data Stack Show.
Hey, nice to be here.
All right.
Give us the brief background on where you came from and what you do today at Census.
Where I came from.
So originally from Canada, if that's the real me of the question.
It's mainly a geographic question. Yeah, it's a geographic question. I'm a Canadian who lives
in San Francisco through a variety of stops along the way. But my career started at Microsoft. I
have always been a tool builder. So I started my career on what I consider kind of the ultimate
tool, which is Visual Studio, which is the tool that tool
builders use to make software. So it's a particularly interesting challenge to start
your career in. And I spent quite a few years working on developer tools. And then about a
decade ago, I started my first company that was actually in the field of what you call identity
management and single sign-on for the people that kind of know these things.
And that, after I sold that company, it kind of, like started to solve in 2018 with Census,
which was to get kind of data from product and analytics teams out into the rest of the
business. We were just frustrated by the lack of bridging between those two worlds.
And so that's how our company was born. And so today I'm the CEO of census and, you know, we've,
we're, we're mostly based in San Francisco in the U S I think kind of a mix at this point of like
50, 50 kind of a remote and, and San Francisco and, and, you know, kind of humming along.
Yeah. One quick question on, on the sort of the way that you like notice problems around data silos and other things, was that
both in your company and with your customers or was it primarily something you learned building
the company yourself? I guess I see it everywhere. So like once you see, you can't unsee.
Yeah. Yeah. I think, you know, great startups, great founders tend to,
they don't look at like just, you know, they startups, great founders tend to... They don't look at like just...
You know, they don't look at the world, let's call it from the MBA perspective.
Like, ah, there's a market opportunity there.
They just want to build something, right?
Right.
And which I don't knock like identifying market opportunities.
But I find that you tend to get obsessive about trying to solve a problem,
either that you've experienced or that you see and you can't unsee.
In my case, it was both.
So when I see software as a service
and I see people using all the amazing apps, right?
Some of our customers have like 300 apps,
if you can imagine that in their organization.
I think that's wonderful, right?
That means that lots of people get to use the tools that they want.
People can be productive. There's, you know, people have best of breed user interfaces and all that stuff. But invariably, and maybe other people don't see it as immediately as I do, but I just can't not see it, is data is replicated ad hoc across all of these applications. And what is the data? Well, it's the same kinds of
things over and over again. And that feels wrong to me, right? And I feel like we need to help
solve that problem. And so there's all sorts of tools that have existed over the decades to try
to solve what people call data integration. It's not like a new concept. And the kind of unique
perspective we brought to it when we started the company in 2018 was there was this
treasure trove of data in data warehouses and product analytics teams and product teams
that everyone on the product and engineering side used. We all were very comfortable using
those things, whether that's from your operator console or from your amplitude analytics tool,
whatever, right? Like we were all living and breathing it. And sales and marketing and success and support teams were not.
And so we built this bridge, right?
That went from the data warehouse out towards the business tools.
And in 2018, that was a weird and novel thing.
So people didn't even know what to kind of call this.
Yeah.
So how did we come up with the term reverse ETL? And who came up with this term? Yeah. So how did we come up with the term reverse ETL and who came up with this term?
Yeah. So when we first started, this was in approximately August, 2018, August,
yeah, August, September, 2018 is when we were building the first version of census.
And we're talking to our first customer, our customers, our first two, three customers, basically on their own,
decided to describe our product as reverse Fivetran.
Okay.
If I were really specific.
Because they knew-
No way.
That's great.
And so they did that, right?
We were just kind of like, again, you'll meet a lot of first-time founders.
You're like, how do you describe your product as a classic conundrum? And people get too complicated to
use buzzwords. So we were just like, it connects to your data warehouse. And we were keeping it
real simple and we weren't trying to complicate it with buzzwords. And then they were like,
so it's kind of like Fivetran in reverse. And we're like, yes, that works for you. That's a
great... Let's go with that. Now, of course, we didn't put that on our website. That would seem really weird.
But in colloquial speech, that's how people were reasoning about our software.
And obviously, you're not going to launch your company that way.
So in our first year back in 2018, 2019, we were just going around finding our first customers,
just getting them on the product and riffing on all
sorts of ways in which we could call this. And funny enough, around... I'm going to say June,
July, August of 2019, around there. I'm sorry. Yeah, 2019. One of our customers was actually
working in tandem with folks at Fishtown Analytics,
which is now dbt labs. And they were actually, for the folks who might not know, because now
it feels like ancient history, but the company that builds dbt was originally selling consulting
services rather than selling the software. And so one of our customers was consulting with them. So they were, they're paying for our software and they were developing really cool, a really cool data stack. And they were working with one of the folks at, at their first... Almost one of their first consultants.
She became the community manager. Her name is Claire Carroll. And she started taking notes
on what things she was seeing out there when she was working with customers. And so out came
like sometime in the summer of 2019, this Notion doc that was like, you know, linked off of the
internet somewhere, right? Which has long disappeared in which she was kind of taking
notes. It was literally a page of just notes. And in it, there was this thing going like,
and then there's this census thing, like reverse ETL, which from her perspective,
it's like, instead of branding it reverse, I've turned turn reverse, it made sense to just say, oh, let's like reverse ETL.
So that's the first evidence of that word ever showing up in writing to my knowledge.
So the reason we weren't using that term at the time was I have the unfortunate problem of being like too knowledgeable or too nerdy or too mathematically obsessed or oriented,
which is
like the word is technically a misnomer since ETL has no direction. It feels weird. At least
five-turn in reverse actually was a reasonable descriptor, right? But reverse ETL actually
seemed like a mathematically incorrect way of describing the thing, but at least it's a generic term. So, so, you know, it was kind of, we banded around like for a while for fun in, in, in 2019.
And then, and then we launched the product and the company in 2020. And, and it just very quickly
became the de facto name for this and far be it for me to kind of argue with the public, right? It doesn't seem like a worthwhile way to spend my time.
So my personal recollection of the kind of birth of that word.
And then, you know, when we did our Series A announcement,
which was in February of 2021,
these last couple of years are all blending together.
Then thec ecosystem landscape
machinery kind of kicked into high gear and they you know in the same way that engineers like to
think about data stacks and and like venture capitalists like to think in terms of data
landscapes or landscapes everyone famously knows the marketing landscape and now the data landscape
is just as complicated and yeah and so know, this is like the kind of output
they like to produce.
This is like a success for them.
It's like, I've managed to put every logo
I've ever heard of into a single chart
with squares around them.
So that's, I think when reverse ETL
really became household concept
is when it started showing up in those.
That is some high quality lore.
Like even the detail of the Not doc is so it's perfect.
Like it's perfect. So that's, I thank you for that bit of history. Okay. My follow-up question
to that Costas about where the term came from is, okay. So I agree. Like mathematically it's not,
it's not, you know, know technically accurate but i think even beyond
that my bigger question is in some ways it's very singular right like a line on the chart you know
that you know whatever us in the data industry create or an investor creates but
you're building tooling in this space, do you think that's a sufficient term to describe at
least what you, like what you envision that you're building or like the problem you're solving?
No, I mean, you're giving me really too much, too much, too much rope there to say whatever I want,
but that's the point of the show before this call, right? Costa had described himself as a plumber since he had worked in pipelines for so long.
And I think there's great pride to be taken in building excellent data pipelines.
It's something that we pride ourselves on, and I'm sure you do as well.
And our customers do.
But it's not what I think the product is actually about.
It's not what excites our users, right?
When I think of great software, especially tools, I mean, there's software of all kinds,
right?
But when you think of great tools, you're basically trying to make someone else, right?
Your user, kind of a more awesome version of themselves, right?
That's just the best way to think about it.
And our users are not trying to become really good data pipeline people. That's not their goal. And when we started
the company, I was not thinking, you know what I'd love to do is just spend my life building great
data pipelines. That's not what the core Animus was. It is absolutely an essential means to reach our end. But what I wanted to solve
and what I get to see with our users every day is I wanted to bridge the gap between what I called
analytics and product organizations and the go-to-market organizations. I was very frustrated
that that gap existed. And there are a lot of
tools out there that had taken stabs at this, right? Famously, there were tools like Segment
that connected the code that you wrote in your app directly into your marketing tools. This was a huge
step forward. But I kept seeing this problem that the data organizations that were emerging,
the BI organizations that were emerging, were disconnected from the rest of go-to-market, right?
Finance, support, sales, like just the whole world of the company.
And so just building that connection was important to me.
And you don't just have to build data pipelines to make that work, right?
You have to change the relationship between those teams and the data organization. And if you ask data teams all over the world, and you ask
them what their day-to-day life is like, they will tell you that they're really crumbling under
kind of load, like support load of getting data requests, having to solve like yet another
dashboard. They're very overworked like IT teams, right?
And what I felt they needed to move towards and what I think census's underlying goal should be
for them is not to make pipelines that run faster than the pipelines they could write.
That's a good to have, right? And I'm glad that our pipelines are superior to the ones you would
build yourself. But actually to turn your data organization into a,
we use this term a lot nowadays, right?
But we really meant it from the beginning,
which is like a kind of product or platform team,
because it's the only way to serve your whole company at scale.
Otherwise you're just the hated service org, right?
You're the IT team that no one really likes because everyone's always stuck
behind 32 requests.
And so that was a huge kind of part of what census has always been about and
continues to be about, which is, so see,
it's not like really about the plumbing. It's about saying,
how do I turn the data team into the, the,
the most essential part of your whole company that everyone else depends on?
And so that's, you know, I kind of, you may have caught
me saying this earlier, but I think of census a lot more as a data federation tool rather than a
data pipeline tool. That's why it's called census. Because my goal is to say at a company, there
should only be one version of the truth. There should only be one census of your users, your
data, et cetera. And everything else in the company should be naturally kind of a cache on that data,
pulling from that information as seamlessly as possible.
And then that's what census does.
Boris, can you elaborate a little bit more
on how this reverse Fivetranity
or whatever we want to call it, right?
It's actually different.
And one of the challenges is that Fivetran does not have, right?
Right, right.
The data from the apps or the database and push it into the data.
Totally, totally.
Yeah, I mean, this is a great, that's a great question.
And everyone, you know, from the outside of almost any company,
any software, any tool, right?
People always think it's, how complicated can it be?
It's reverse Fivetrain, right?
So as soon as you distill things into like two words, it's like, then you somehow lose
all the underlying complexity.
So there's a couple really significant ways in which this is different and, you know,
difficult in its own right for people to build.
The first is when you're pulling data from SaaS applications into your warehouse, you're actually dealing with very consistent source data, right?
So if you go to all the various ELT tools, right, they'll show you the ERD for all these applications, right? And they're fairly stable. And what you're doing is you're saying, let me get Salesforce, let me pull the schema and dump it into the warehouse. And warehouses, to their credit,
are very easy places to say, here's a table, just dump it, right? I'm not trivializing the work of
building great pipelines there. But you're basically going from a kind of raw data structure
that is not changing super often with read APIs off those products that are
generally the first API that any SaaS product will build down into a data warehouse, which is of a
low end, right? There's only so many data warehouses that are fairly consistent at being able to write
a raw table in, right? And then all the little details, of course, emerge of trying to get that
just right and incremental, et cetera. When you're thinking about this in reverse, the first thing is everyone's data models are different, right?
You're at the end of the data refinery. So it's not the raw data from Salesforce that's always
the same schema. It's whatever entities your company has evolved, right? What your data
organization thinks is essential about your users and your workspaces. And maybe you have a many to
many model of your user base versus maybe you don't, right? you have a many-to-many model of your user base versus
maybe you don't, right? Maybe it's one-to-one or there are no organizations. It's all just B2C,
right? All these various patterns are bespoke to your company. And that's where census starts,
right? It has to first take your distilled version of the data at the end of all your
pipeline of transformations and say, okay, we'll work with this, right? It has to first take your distilled version of the data at the end of all your pipeline
transformations and say, okay, we'll work with this, right?
And then we have to write into applications.
And there's two problems there.
One is writing data, the APIs are terrible because most SaaS applications focus first
and foremost on easy read APIs.
And the right APIs are very heterogeneous, very generally, very poorly designed.
And then if you screw that up, the damage is really, really high. So I think that is
the most important aspect of this. So when you think about a product like ours, even if you were
to do this yourself, right? So you're an engineer at your company and you're going to build these things. You will generally be reticent to do a lot because your upside is like, I got the pipeline done,
who gets promoted for that? And the downside is very significant, right? Because you're going to
accidentally put a million things into Marketo that you weren't supposed to put in. And no one
knows how to delete those things. Guess what? Deleting is hard in SaaS applications. And so now your marketing team is angry. You've sent emails to the customers that
are wrong. So the downsides are very high. And so a lot of what... I think that's actually what
generally held back this side of the company. This is why the product and analytics,
that whole world was actually evolving very well because it's agile.
But this side, it's like one project a year, one project a quarter, right?
And so that's really what we were trying to change here.
And so what do you have to do?
You have to validate data more deeply.
You have to do a lot more fine-grained ways of like writing data in.
So we have, you know, all sorts of different capabilities. You can use census to say, hey, I only want to update what's there. I don't want
you to create new stuff. Or I want you to write into Salesforce, but I also don't want you to
overwrite this field if it's already there. Because again, there's much more subtle stuff
going on when you're in these operational workflows. There's an email that's going to
come out automatically at this. There's a salesperson who's going to make a phone call an hour later
based on what's happening in there.
And so we have a lot more subtle capabilities
to ensure that you're not breaking your operational world.
And so one way to reframe what census does as opposed to pipelines
is actually kind of a continuous deployment tool
for data and it has all of the you know the needs there that yeah 100 and actually i want to
like extra emphasize what you are saying about like the difference between reading and writing
from the source application and something that i want want to add and make sure that our audience
is aware of is that actually, by the way, Claire did something very right. She named it ETL and
not ELT. And that's, yeah, but that's very, very important because the fact that we can do ELT,
which means we extract whatever we can and just load it and dump it there.
And then we can have models that we version on DBT or whatever.
We can go back and fix problems if we have problems.
It's huge.
And we don't realize that.
If you go like to an ETL engineer that was working, I don't know,
with Oracle systems 30 years ago,
they had the same problems that you had because
everything was so costly, but transformation is something that can destroy something,
especially if you do it on the fly. So exactly as you said, it's a completely different... I mean,
mathematically, it is the same thing, but in terms of the engineering that you need to put there, it's very, very...
Yeah. And look, I think a lot about product as an experience as well.
And if you think of the user that is trying to pull data into a warehouse, that ELT scenario that we've been all very familiar with for the last decade.
If you think about what they're trying to accomplish, almost all of them, it's in the name, right? It's analysis. They're trying to
pull it in so they can do some kind of analysis. How much money did we make? How much money could
we make? It usually comes back to one of those two things. And so the use case is very... There's
lots of kinds of analysis, but it's analysis. Whereas in our user, analysis is not the goal.
The goal is operations, right?
It's automating something.
It's, hey, I want to send emails
to send a promotion about a shoe that you should buy,
but tied to the specific segment of users
that are likely to not retain
if we don't send them the shoe, et cetera, et cetera, right? And so you're trying to get fine-grained detail into your email
system, but not to do a spreadsheet, right? So that an email comes out, or a sales call comes out,
or a better support experience comes out. That is a very different end user need.
And so I think when the person wakes up in the morning and opens up our tool
versus opens up an ELT product, what they're thinking about is different. I think they're
actually just trying to solve different problems. Quick question before Eric asks his question.
Are the users different between a Fivetran user and a Census user?
Yeah. I mean, I'm sure you see the same thing as I do
in terms of data teams range dramatically in size.
So I admire the crap out of a lot of our users
who are data teams of one,
who are three things in one body, so to speak.
And so they pull the data in, they model the data,
they push the data out,
they do all of it in their own,
all on their own.
But I think when a data team grows,
it actually ends up being different people.
Yeah, because there is a user who is,
you could think of it as like,
almost like maybe the concept,
what are people getting?
Remember you used to talk about
the forward deployed engineer?
Remember that concept?
Was it Palantir that first started using that term?
I think data teams now have all sorts of roles, right?
There's the core platform building kind of people.
There's ML who, you know,
people just sitting there doing like really cool analyses
that hopefully are worth the money.
I don't know.
And then there's this kind of forward-deployed analyst, let's call it.
Your job is actually not just to sit there and pontificate on what is revenue,
but actually to go help the marketing team, the sales team,
and the support team to improve the operational excellence of the company.
And so, yeah, I think that person might, on a different week,
be doing something related to Fivetrain and analysis. But on a day-to-day, I think, at scale, yeah, I think that person might on a different week be doing something related to
Fivetrain and analysis, but on a day-to-day, I think at scale, your data team, this is actually
different sets of people. Yeah. Eric, all yours. You saw me chomping at the bit. So
Boris, I'm interested in what I'll call maybe like the, the chicken and egg problem a little bit. And
I'll lead in by, I was thinking the other day,
like Google analytics is still so pervasive, but relative to what's available now, it's
so primitive in many ways. I mean, J4 is a little bit better, but I was thinking about it and it's
like, okay, well, part of the reason is because like you have sort of have packaged collection
and visualization and disaggregating those things creates really big
challenges on both sides. Right. And so like, okay, just people kind of go to it. So you think
about Fivetran and it's like, okay, well, I'm taking, you know, data with largely known schemas
and dumping it into a place that can ingest known schemas, like, you know, whatever schemas and it's
great. When you think about like the
practical, I want to send emails or I want a salesperson to prioritize something.
There's an assumption. I think that there's been some sort of value created beyond the initial dump
into the warehouse. Yeah. And I'm just interested to know, like, how do you approach that?
Is because every business's data is different, different metrics, you know, all that sort
of stuff.
Are you like reaching into the warehouse and trying to enable the creation of that value?
I mean, tons of companies are doing it with DBT, but like in many ways you need to have
something to send that isn't there when the data arrives.
Yeah.
Yeah.
Yeah.
No, this is a, this might be my favorite question and topic and thing to think about.
You have to generate some kind of IP.
That's a way more succinct way to say it.
Yeah.
And so I think of a company has two kinds of IP. There is the widget
that you make and how you sell it and market it and support it and all the kind of... Yeah.
Those are both a kind of IP, right? And our industry focuses like 99% on how to make better
widgets and how the source code is your ultimate IP and all these things.
And I think all of this, call it how the sausage is made, how it's sold, how it's
supported, how it's marketed is absolutely IP. And if you have none, if the way you
send an email about promotions about your shopping cart can be solved by your Stripe automatic shopping cart reminder checkbox.
I don't know if they have that, but let's say they did.
Yeah.
Then great.
Then you don't need any of these things, right?
You have no IP of your own, right?
So I guess that puts the onus a little bit on companies actually thinking about what makes them unique. But here's what's happening and has been happening for years now. I think your point about Google Analytics being kind of all encapsulated is actually a really good metaphor for this entire modern data stack, right? We tend to think about the modern data stack as all these various tools and
the phases, right? And the data comes in and then it's transformed and it's,
you know, all these things.
But in a way the modern data stack is taking every single SAS app and putting
them, you know, making them fall on their side, right? So Google Analytics ingests data, stores data,
renders, visualizes data, allows you to query the data,
reports on the data.
It models the data, right?
It has everything in the app.
And the repeat that times thousands of applications.
And so as long as everything you need can be done inside that
silo, then those products are great. And what the modern data stack does in some ways is just
reinventing that. It's like, well, now we can ingest all applications into one single storage
layer. Okay. And then you can store everything in one place. You can visualize it all in one way. So is that a useful architecture versus
30 apps that each implement their own end-to-end data stack? And I think the key question there is,
does your IP involve joining data? And if it doesn't, then this entire modern data stack
could actually be, you could potentially throw it out,
right? And be like, we have a billing system. All of our information about how much money we made
is in the billing system. You can query the billings. All that matters is then the question,
does the billing system give me an interface that I can render and visualize and query?
And if they don't, then of course, then you need to pull the data out so you can query it, right?
But see, this is, I think, the transition. Once upon a time, people were pulling
data out into their database, their data warehouse, because you couldn't query Stripe using SQL,
right? Right. Yep. But that's going to change. All of them are going to increase how they make
their data queryable. But what you can never do is, from inside Stripe or Google Analytics,
join and query data, right? So that's not possible.
And so that is what uniquely the data warehouse and the data stack does.
So then is there insight?
Is there insight for your various teams
that comes from joining data together?
Well, in the real world, always, right?
The, your sales prioritization example or your marketing email, right? Those two examples.
You could tie that to product activity. Well, that's one source of data. That's assuming your
entire product is one database, which it almost never is nowadays, right? So it could be multiple
services and data. It's going to also be tied to financial information about that customer,
which comes
from what? Well, some kind of invoicing data, right? Which might be one billing system,
might be multiple, right? It's going to be tied to their level of engagement with your team.
So that might be your support data is getting joined into that as well. And that's just me
kind of rattling these along, right? I bet you the best companies have really interesting
ways of modeling, you know, their users, their customers, their value, whether that's to forecast
it or to automate it or whatever. So I think the longest short of it is yes. When you use census,
the goal is not to just take something from Fivetran into your warehouse and then back out
into sales with no intermediate step. If,
if then I don't know what you're doing, uh, then you're just,
you're getting the base value, which is like,
I can take something from one app and put it into another app,
which is still good. Right. So take like a Zendesk metric,
dump it into your warehouse and then take it from the warehouse and put it
into Salesforce. Like that's still something.
And I actually think it's a better architecture than connecting those apps directly.
Yeah, sure.
You at least have a hub.
Yeah.
But I think real value.
Yeah.
What that person, again,
if you're just setting up a pipeline
that's raw to raw,
then yeah, yeah.
Your job is not that interesting.
Yeah.
But the reason we employ data teams
is that they're actually
sitting there going,
I think I could take
these disparate pieces of information,
clean them, distill them, merge them, and come up with new valuable insight.
One quick follow-up question, because I know I want to leave enough time for Costas
to ask about the term data federation, because he and I talk about that all the time. And he
has some really interesting thoughts, but what are the ways that you see, I love the,
the paradigm of IP. What are the ways that you see companies creating that? And I'll just,
the, the, the context behind that question is, I mean, some of the most interesting ways I see
that happening is through tools like dbt, where you're sort of creating like interesting models.
Of course, I think there are a lot of companies who just maybe even write SQL on the warehouse to perform the joins to create those data sets. What else are you seeing though?
How are companies creating that IP? Is there anything interesting in the way that that IP
is being generated in the context of those joins? Right. So I think it's always helpful for small
to step back and remember that we are very, very, very deep in
the most cutting edge, sophisticated companies. And to your point, Google Analytics is still
so widely deployed. And so the majority of this does not happen in DBT, does not happen in all
these places, but there is business logic everywhere.
There's business logic everywhere.
So there's the query that you wrote ad hoc in your database.
Yes.
There is, if we were to be really honest,
probably the largest repository of these kinds of,
of this kind of logic, this kind of query,
is not in dbt and GitHub,
which I think that's, what's great there is it's starting to become a better repository for this.
I really hope our entire industry moves towards that model.
But it's probably, and don't freak out, in Salesforce, Socko queries, and Apex code.
I agree with you wholeheartedly, actually. And I think the traditional, you know, kind of, if we think about the sophistication stages, right, they're crossing the chasms, etc., etc., right?
Silicon Valley and broadly speaking, software companies have moved to this new paradigm, right?
Because their most important signals come from their software.
And your CRM doesn't store that. So the data warehouse is the perfect kind of query engine and storage and computation layer for that information.
And the number of signals that we generate, I don't even know how many events the average kind of software company generates now.
But it's a lot, right?
That is why we store these things there now.
But if you think of non-software companies, which again, eventually everyone will be a software company, right? So, so this is why it's like, we all skate to where the puck is going, but there
are still furniture companies in the world, right? And you would probably find that the bulk of the
intelligence, the IP that I'm talking about lives kind of glommed onto their Salesforce instance
in a collection of maybe checked in, probably not checked in code, code that looks like query
sometimes, like Salesforce has a query language called Taco, or it's more imperative code like
Apex. And the real goal of Census is to kind of move that into a kind of get-backed, kind of open, standard language called SQL.
Yeah.
And yeah, that's, I think, the journey that we're going to see over the next...
But it'll take, I'm talking easily a decade plus.
Oh, sure.
We all in our industry, and it's why we're so exuberant and why we all raise all these capital is like, we think these things happen much faster than, than, than they do. You know,
I started my first company, like, like I said, 10 years ago on a very simple premise that was about
if we're all going to live in SAS, you need to have your employee identity, your password,
your login, like centralized and federated. Right. And it seems to make sense. You can't have
8,000 passwords, right? In a company
that's not, like, that doesn't work. It's been over a decade and we're still in the infancy of
that market. Like that's how long these things take. And so I think data, we're very much in the
early stage. For sure. Back in when I was doing consulting, we used to joke about, you know, companies of all types and sizes.
It's like, OK, I've never seen a sales force that's not like some sort of Frankenstein.
And it's easy to talk down to that.
Right. Because it's actually very painful. Right.
Like it does create pain. But in reality, like it's pretty
advanced for a lot of the companies doing it and enables them to accomplish things that are
like, what else can they do? I mean, of course, like the modern data stack, but like,
it is very helpful and it is pretty advanced to be able to customize all of this business logic
inside of the tool. So that's such a helpful perspective.
Yeah.
And I think there's going to be this interesting cascade, right?
So I think the data community has so much still, and it's exciting, right?
That's why a lot of us work in this space.
And there's so much to distill from the world of engineering, of software engineering, down into, let's call it, the broader world of data.
So now, thank goodness, but like,
we're still at the early days of everyone realizing that you could treat your queries
as a piece of code that can be versioned, right? That's still, we're still at the beginning of
that, right? And then there's going to be all the other things that go around the software
development lifecycle for data. And even there, we have to get quite a bit more sophisticated,
right? If we're going to support these kinds of workflows.
So I'll give you an example.
One of the reasons you're...
Because if the cascade is like software engineering,
let's call it to data organizations,
and then down to business organizations.
So if you think of that Salesforce that you saw in your consulting days,
everyone always says, you're right.
It's a mess. It's a mess.
It's got all sorts of stuff.
There's like a field called blah, blah, blah, underscore two. You know,
it's like there's tons of tons of them, but what,
how many people in the modern data stack actually run like something
equivalent to a migration when their data scheme has changed? Right.
Very few, if not none. And so we still
have to, you know, get more sophisticated in how we manage data in the core, let's call it.
But as we do, I think a lot of that will then be able to have this amazing downstream effect on
the rest of the business. Yeah. I really, you really made me think, Boris,
with the comment that you made about Salesforce
and the business logic there,
because you remind me of something extremely painful,
which is if and how you can replicate
the results of formulas on Salesforce.
So I don't know if like Fiverr is doing it today or like they figured out
how to do it,
but it's pretty much impossible because the piece of logic there,
which is executed whenever you make an API call. Right.
And that's like, I think.
That's a beautiful microcosm, by the way, of this whole thing.
You're absolutely right.
You're absolutely right. Yeah. But that's a beautiful microcosm, by the way, of this whole thing. You're absolutely right. You're absolutely right.
Yeah.
But that's like the thing.
And I think that's what justifies and makes this category of reverse CTA or whatever we
want to call it like important, because at the end, you might be able to export the data
from Salesforce, but the business logic is not something that you can export.
Like you need someone to replicate it, which is a completely different story, right?
Exactly.
So you need to get the data out, but that's not enough.
You need also whatever you are going to do with this data
to push it back again, right?
And these systems are like, I mean, many times I say,
like when you get like a salesperson,
you can ask many things from the salesperson, but you
cannot ask them to leave the sales force. That's where they live. They don't want to learn about
YouTube. They don't care about that stuff. The only thing that they care about is their quota.
That's what they should do. They shouldn't care. Why they should care about whatever
sign technology we have? They would be engineers if they cared about that.
But there's versioning, man.
It's awesome.
Ah, yeah, yeah. Sure, sure.
You're right.
It's a QW.
Exactly.
Exactly.
No, absolutely.
Absolutely.
It's a,
it's a,
people,
what's the term people like,
you know,
people live in their
pane of glass, right?
And it's just like,
you can't get them
out of there.
And I think there were like some attempts to like do that with stuff like Looker, for example.
Yes, yes.
The previous version of BI tools, we were like, yeah, ask your salespeople to go and
work from within Looker, and then there will be links to go back to Salesforce.
Like, no, why?
Do you know who suffers from this the most?
It's actually
kind of tech founders in the Valley because they start their company and they're like, yeah,
I got Looker. My salespeople are just going to go there. And it's because like, they're also
deluded because they see this as easy, right? Because you and I can do it. And I'm like, no,
they're not, man. They're really not. I promise you they're not. And he's like,
it's easy. Like for sure, they're going to do that. Like I can do it. And I was like, no, they're not, man. They're really not. I promise you they're not. And he's like, it's easy.
Like for sure they're going to do that.
Like I can do it.
And I was like, uh-huh, uh-huh.
And it sometimes takes years for them to realize like,
oh yeah, I hired a VP sales.
Yeah, I'm like, they ended up doing their own thing.
I'm like, uh-huh, uh-huh.
They do their own thing.
They do their own thing.
So I think, yeah, tech founders particularly,
I think suffer from not seeing this.
Yeah, because it's also also extremely easy to burn money.
Actually, it's one of the reasons that you exist.
So why not pay 50 grand to buy a license for this thing, right?
So yeah, anyway, that's another very interesting conversation that we need to do at some point.
But yeah, that was a very, very interesting point that you made there.
But I want to go, you used the term federation. there, but I want to go, you use
like the term federation.
Eric mentioned that I want to ask about that, but traditionally, and like from, you're just
like an engineer, like federation and DTL are like two completely different things.
Yes.
Actually the opposite.
Like when you are talking about federation is more about, no, I'm not doing like to collect
the data into one place.
I'm going like to ask its data source and then I will federate the results and present the results that's true so if this is like what you are
thinking of like a solution or unless you have like a different definition would be more than
happy to discuss about that where do we stand today and where do you see going right like
because today i don't know like technically speaking this is not federation
that we have no no no i think that's a very reasonable technical pushback so let me start
with an analogy i tend to use with my team but it's going to make you're going to appreciate it
because i think you're close enough in age to me but i'm starting to notice that like younger
people are like what is he talking about so So your laptop, your computer has an operating system in
it. And it provides a lot of things for you, the user, and for the applications that are built on
it. And I think that when we move to the web, there are certain things that we kind of lost
along the way. We gained a lot, so that's fine.
But we lost a few things along the way.
So one is login, right?
So when you log in, you're gonna be able to log in once.
And then like, you don't open Word and go, please log in.
You don't open Photoshop or whatever and says, please log in.
Please with caveats that everyone now is a web app.
So like, that's different now, but let's put a pin in that.
So that was, you know, your identity, your user identity was just given as part of the
operating system to all the other applications.
So they just were receivers of that knowledge and just used it.
And in the same way, there's a file system in your operating system, right?
Your computer has a file system.
And when you open a file in Word and you want to open that in Excel, it's the same file. They don't both have to
implement a file system to be able to read and write data. And so I think when we moved to the
web, we lost both of these things. And funny enough, both companies I've started are solving
these two things. And so when I think of data federation, the reason I use that term is I think that in order to have a wealth of SaaS applications exist, which is what I want, right?
You're going to always hit this natural friction around replicating the data correctly and consistently, right?
Because it's a distributed system and they all want to speak about the same things. So this is just, you're always, the more apps you have that all
speak roughly about the same things, you're going to have master data management problems. You're
going to have all the things that kind of as a distributed systems minded software engineer,
you can think through and they're hard. And it only gets worse for every N plus one application you want to use.
And so I think there's only two ways in the long run that this gets resolved.
One is the one I don't want, which is everything gets progressively acquired by larger companies. And because then they can create that integration, right?
They can create the tight integration
between Slack and Salesforce.
I'm sure they will.
And Microsoft is,
and maybe it's because I started my career at Microsoft
that I saw this,
because Microsoft is basically the best company
in history at doing this.
Having built unbelievably great technology
to do interoperation between its applications.
They do this because they can work together
and they can force Excel to do something
that then Word will also abide by.
And so that's one option.
And we see this, right?
The more we get in the later stages of SaaS,
which is now year 20 of SaaS, right?
Like we see these pressures.
And the only alternative that I think of
is that for some of these things
where you need to come up
with a different model than just independently replicating the data in bespoke ways in every
application. And so that's why I use the term data federation, because I believe that as a company,
if you want to use the maximum number of SaaS applications with the most freedom and not to
be tied to one vendor, you want to be able to own your data and then seamlessly have it be usable in any application.
So today, my only option to be able to enable that world for people
is to say, okay, what is a place?
Let's work from first principles, right?
Well, you need to store all the data
in a way that is most cost-effective and scalable.
Data warehouses.
It's that or S3, right?
It's like either just raw storage or data warehouse.
Those are the best tools we have from first...
If something better came along, I'll take it, right?
But right now that is what's best.
And then I want seamless ability
to use that data from any application.
If I could eliminate the data pipelines and just say,
you know, your app is built directly off the data, that'd be great. But because of the way OLAP,
you know, warehouses are designed, because of the incentive structures in the market today,
you can't, you don't get that, right? So there are tools, by the way, in like Salesforce has
this concept, they're the only one, but they have this concept like external objects where you can have an
external back data store, but it's slow.
And then you don't get all the features and you don't get the formulas and you don't get
the indexes and you don't get all the things.
So thus what Census does, which is we will push the data into the internal file system
of each of those products, thus turning them into a kind of high performance
cache on a single data store.
And that's what I mean by data federation.
Yeah, makes total sense.
I have a, it's not exactly like a product question.
It's more like, it's probably like a, yeah, it is a product question, but it has more to do with like the experience of building a product.
Sure, sure.
So since you first launched Accenture, what you have learned by building this product?
Great question.
I think I would say that I've learned the most about our users, right? And data teams as a whole. And so it's been really fun to watch them on this journey over the last three years, just working with people. It really is the thing that always comes to mind, which the first experience I had when we started
selling this to users was, hey, great, this is going to save me time. Or this allows me to do
the thing that I didn't know how to build. I don't know how to write this kind of connector.
So it's great. I write SQL. I don't know how to write Python. That was the initial
experience we had. And that was not surprising. That was not something I was like, ah,
what a discovery I've made. But then, and we talked a bit about this, but it became very
visceral to me. After a little while of, especially in the early days, our early users using our
software, but now it's become kind of, it happens more often. I started seeing a very unusual
reaction from our users that actually caused me real pain.
Like I was worried. I was actually really like, are we screwing up here? This seems bad. These
are bad. These are not the words you want to hear from a user, right? You want to hear excitement,
power, enjoyment, right? And multiple customers started using effectively expressions of fear.
They started like genuinely saying, I'm scared in so many words.
One, one, one customer was like, like, this is,
I feel like I'm holding a machine gun, like I paid him. I was like, well, that's not the feeling I want to engender in you. But you know,
so I could have shied away from that. I could have been really freaked out,
but, but I started to think about it.
And what I realized is census is,
this is what I mean by it's not just a data pipeline.
It's giving these users a power they've never had before, right?
The power to do analysis is not new.
It's massively improved with great tools.
But the ability to analyze data
is something they always had.
But the ability to, from your something they always had. But the ability to,
from your vantage point
on the data organization,
to cause a marketing email
to get sent,
to cause a salesperson
to wake up in the morning
with a task to call this person,
that did not exist before SenseS.
And of course it's scary.
Like now it's your fault if something breaks or breaking would
be ideal. Like if senses like said, Hey, sorry, the pipeline can't go today. That's, that's not
even, that's, that's actually bad, but nowhere near the worst case scenario. The worst case
scenario is you push bad data, extra data, data that, that is like, that is going to be embarrassing when it goes out. And so that was
the emotion that we're trying to convey to me. And so now I spend a lot of time really thinking
about how can we build capability into census that improves your confidence. So I think this
is the point. We have a lot of experience in the world of software on how to be agile, but safe, right?
Code reviews, testing, unit testing, like just decades and our education and our content, right? Is to try to teach how to make this less scary, but also to
embrace a little bit of the fear, right? Because I don't want people to go back to, I'm only going
to press the go button once a year because I don't want to break things. And so that's probably
the biggest thing I've learned is the biggest hindrance to deploying census is actually helping people overcome
this new responsibility, this fear that comes with it. And I'm like, but on the other side is
so much power, so much growth, so much more your team will be able to do. And so you should embrace
it. But it is genuinely scary. And so that's a first in my life to have built a product that freaks people out.
Yeah, no, no, no.
I mean, it's a good problem to have because, of course, it's, I think, an indication of the value that...
I'll give you an example in how this manifests.
Speaking of product, we can do a very narrow...
Because I think this is not solved with one giant...
My marketing team is going to hate me solved with one giant, my marketing team's
going to hate me, like one giant whiz bang feature that you can announce, right? It's,
it's a collection of very like fine grain thinking, like small features here and there.
And so I'll give you an example. So there are a lot of products that when you write into them,
to your point about like reading and writing is very different. They have, there's a term in compilers, you know, about there's defined behavior,
and then there's undefined behavior, and then there's unspecified behavior,
which is actually like a different thing, which means like it'll work,
but I can't tell you what's going to happen.
So when you write duplicates into some system, not all, that's the beauty of it, right?
We support like 50 different applications and like all different,
you know, different behaviors.
Some of them will behave
in very unusual ways
when you sync duplicates.
So some of them will reject it.
Some of them will just pick one
and you won't know which one, right?
And so that is something
when we built the very first version
of Census all those years ago,
we just said,
here, let's take the table and like just efficiently, our was to get speed so it's like let's get it as
efficiently as possible into the destination and then we didn't know like oh turns out people are
people have plenty of duplicates like the warehouse is not enforcing you know unique ids so they're
singing duplicates and like we were like powering through we're like super fast like yay go sink
millions of duplicates.
No problem. And then you're back to the same old problem of the sales team or the support team or
the success team or the marketing team is like, this data is wrong. Screw the data team. Let's
go back to doing our own thing. I don't like these guys. And so now we've added the capability.
It's a built-in. You can't turn it off, which is we will block duplicates from being synced like we
will block them because even there are some people who are like frustrated by this because it's like
it's errors that they're like but it's not an error but it was like but we're going to treat
it like an error because like you don't you're not realizing this has annoying downstream effects
on your team so it's you know it's a million things like that that we've had to kind of invest
in yeah i totally i totally get that. I think
what people don't, there are like two things that I think people don't realize when they
start using products like sensors. One is that the census team has to learn to work
with a technology that is completely opaque, right?
You have Salesforce on the other side, and it's very interesting. I were cases that we couldn't predict, even big inside Salesforce.
There were edge cases that we couldn't replicate by having access to the whole infrastructure and all the knowledge that Salesforce itself has.
So imagine now that you have Boris and his team and they try like to interoperate with Marketo.
I don't know how many people have worked with Marketo, but I mean.
Is there an off the record version of this, Pac?
I mean, it's the best. It's the dominant marketing platform and for a good reason, I'm sure.
But like interoperating with it is a completely different thing.
There are errors that are not documented.
There are behaviors that are not documented.
They are not documented for a very good reason, because all these APIs, they were not built
for Boris and his team to send data.
Right?
They have a completely different specification.
That's one thing that people keep to forget, I think.
The other thing that they keep to forget is that as we add more and more systems into
these stack or architecture or whatever, we are actually building a super complicated
distributed system.
Right.
And distributed systems have some very specific rules and deliver systems have like some very specific rules like and delivery semantics
are like something that it might sound like very theoretical but it's actually very very practical
and i don't expect anyone in sales to know that one of ways that we can deal with that is to have
at least once delivery semantics right yeah sure i mean it doesn't work at the end like because i'm getting
ptsd i remember using the word eventual consistency in front of a marketing team and they were like
no no we need it we can't have it be eventual and i'm like it has to be eventual the speed of light
is not negotiable and like oh that's what you mean i'm like because in their brain eventual
meant like it'll come up tomorrow and i was like like, wow, I forgot that this is a term
that we use in distributed systems
that has nowhere near the same meaning.
Also, it goes both ways.
Real time doesn't mean real time to a lot of people.
So yeah.
The reason I'm saying that is because I think
there's like a very important element
and that's, we are all responsible for being in this market and that's education.
Like we need to make sure that like outside of building,
like actually I think that like,
it might sound a little bit exaggerated,
but part of the product is also education.
Like how we can help people understand what they can do and how they can do
it with their,
with their technology,
because there are limits and engineering is about trade dogs and we have to make this trade dogs.
Otherwise, like we are not going to have products at work.
Yep.
Yep.
No, I think that's a, I think interesting products tend to have this educational component
and I wholeheartedly agree that that's part of the journey we're all on. And especially, again, the world is large.
And one of the things I have learned
is the world is nowhere near as sophisticated
as people think it is.
Oh, yeah.
I tell people this even more.
Like Silicon Valley is not even as sophisticated
as you think it is, right?
And like we work,
you and I work with some of the best, right?
And it's like, sometimes I'm like, wow, this is, I remember I used to do really fancy demos in the early days. Really,
I would try to drop in words like AI to just, again, you're like, yeah, like you need all
these things and da-da-da-da. And it's like, and then one day out of expedience, I didn't have
time that day. I did the dumbest version of the census demo.
This is back in 2018, 2019. I did the dumbest, where there was two metrics you could set up in
12 seconds. The count page views. You know what I mean? It was like, count pages.
And then I was like, let's just put that in Salesforce for a customer success team to know
how many times they've visited your product. That was it. That was the demo. And I was, I was actually concerned like, and embarrassed for them at
first. Cause I was like, they were in awe. Like people were like, this is the greatest thing
since sliced bread. And I was like, this isn't the, what, this is the basics. This is not the,
this is not the wow demo. This is not the wow demo. Like why, why are you guys wowing? And,
and it's like, you forget how, how starved people are for this. Right.
And then you're right. It goes hand in hand with,
then you start delivering stuff and then they, yeah, you have to,
we have to find a way to,
we're going to have to do a book like distributed systems for, for,
for regular people. Cause yeah. Yeah.
Just cause it's because I think it's too intuitive for you and I.
We know it so well that we take it for granted.
And then you end up in these weird miscommunications.
And I think the need to educate is doubly so.
Because you are right that we need to educate just to serve our own users.
Think of what Fishtown DBT have to do, right? To teach the concept of version control
is like super valuable.
Just to teach that is unbelievably valuable.
And if I think about what we're doing
is we're turning the data team
into this kind of like company platform team.
So we need to help them explain
what's happening to everybody else.
Otherwise they will also fail.
So we have to act as like their advocates
to the rest of the company.
And like, that's super essential. So you're right. The education is unbelievably important.
Yep. A hundred percent.
Yep. Hopefully these conversations help.
Oh, this is great. Boris, this has been such a fun conversation. Brooks actually let us run a
little bit long, which is super fun when we get permission to do that. But we're at a time here.
This has been such a fun conversation, really helpful for me.
And I think definitely for our listeners as well.
So thanks for the time.
I mean, thank you.
Thanks for having me.
First of all, I have to say that Boris is so articulate.
I find myself jealous of his ability
to explain complex things
and even dip into the world of,
you know, sort of formal computer science
in a way that's so accessible. So, hey, I appreciated that and learned a ton from him.
My takeaway is around the way that he described sort of value that's created in the warehouse
as it relates to data that's transformed, say for downstream tools, sort
of creating value with data, right?
And he described that as any data that needs to be joined in order to produce some sort
of valuable asset.
He described that as IP, which I think is such a helpful way to frame the concept of
creating whatever kind of value we're creating in the warehouse, right?
Whether it's a unified customer profile or packaging some sort of analytical component
from one business unit and sharing it with another. So I really, I just really appreciated
that. I think it's been helpful for me to think through that. Yeah. I mean, okay. It was like an
amazing conversation I think we had with him in general. There are like many insights for someone to take from this conversation.
What I keep, I really liked how he's using the term federation.
This was like something that we discussed also during the show.
Traditionally, federation has a different meaning,
but it makes a lot of sense the way that he's using the term federation.
And that was very interesting.
And it was also super interesting to discuss with him about all the challenges around building a product like this.
So hopefully we are going to have him again in the future and we have more stuff to chat about.
Absolutely.
All right. Well, thanks again for joining us on the show. Lots of great episodes coming up. So we'll more stuff to chat about. Absolutely. All right.
Well, thanks again for joining us on the show.
Lots of great episodes coming up.
So we'll catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app
to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com. Ciao.