The Data Stack Show - 78: The Etymology of Reverse ETL & Why It’s a Key Piece Of The Modern Data Stack with Boris Jabes of Census

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, one platform for all your customer data pipelines. Learn more at rudderstack.com. And don't forget, we're hiring for all your customer data pipelines. Learn more at ruddersack.com. And don't forget, we're hiring for all sorts of roles. You have the chance to meet Costas and I live in person coming up soon in Austin, Texas. We're both going to be at Data Council Austin.

Starting point is 00:00:38 The event is the 23rd and 24th of March, but both of our companies are hosting a happy hour on the 22nd, the night before the event. So you can come out and have a drink with Costas and I. Costas, why should our visitors join us live in Austin? For tequila, of course. That could make things very interesting. I mean, yeah, it's a happy hour. People should come.

Starting point is 00:01:06 It's before the main event. So without getting, being tired from the event or anything, like come over there, meet in person, something that we all miss because of all this mess with COVID. Have some fun.

Starting point is 00:01:19 Talk about what we are doing and yeah, relax and have fun. It's going to be a great time. Learn more at datastackshow.com. There'll be a banner there you can click on to register for the happy hour and we will see you in Austin in March. Welcome to the Data Stack Show. Today, we're going to talk with Boris from Census and it's categorized as a reverse ETL tool, but I have a sneaky suspicion that Costas is going to ask about the reverse ETL terminology. But what I'm going to ask about is, you know, it's interesting about census. So, you know, taking data from the warehouse and pushing it out to other tools in the stack is that it kind of assumes that there has to be some value created in the warehouse beyond just the raw data that was loaded there, however. And so I want to know what Boris is saying as far as how does that impact the way that he thinks about customers, their product that they're building, and the ways that companies are trying to do that, right?

Starting point is 00:02:18 I mean, DBT is obviously sort of a new way, but I'm really interested in that. How about you, Kostas? Well, first of all i have to figure out who came up with the term reverse cto yes the etymology of tech terms is such a tasty subject yeah i mean it's more of a marketing term probably to be honest but it's something that like because i have also the suspicion i mean you know like census is probably the first company that was like in this space i mean so it probably has to do with them. Like, it's something that's related to them.

Starting point is 00:02:50 So I want to learn, like, what's the story behind it. And outside of this, I want to ask Boris and, like, try to understand what's the difference between getting data, for example, from Marketo and pushing it into the data warehouse and doing the inverse, which is take from the data warehouse and push it back to Marketo. Where are the different challenges there? Why they're different?

Starting point is 00:03:15 Why we need different tools? And who is using? Is the user the same? Why do we have different product categories at the end? That's what I want to understand. And I hope he's the right person to have this conversation. Well, let's go find out. Let's do it.

Starting point is 00:03:31 Boris, welcome to the Data Stack Show. Hey, nice to be here. All right. Give us the brief background on where you came from and what you do today at Census. Where I came from. So originally from Canada, if that's the real me of the question. It's mainly a geographic question. Yeah, it's a geographic question. I'm a Canadian who lives in San Francisco through a variety of stops along the way. But my career started at Microsoft. I

Starting point is 00:03:57 have always been a tool builder. So I started my career on what I consider kind of the ultimate tool, which is Visual Studio, which is the tool that tool builders use to make software. So it's a particularly interesting challenge to start your career in. And I spent quite a few years working on developer tools. And then about a decade ago, I started my first company that was actually in the field of what you call identity management and single sign-on for the people that kind of know these things. And that, after I sold that company, it kind of, like started to solve in 2018 with Census, which was to get kind of data from product and analytics teams out into the rest of the

Starting point is 00:04:54 business. We were just frustrated by the lack of bridging between those two worlds. And so that's how our company was born. And so today I'm the CEO of census and, you know, we've, we're, we're mostly based in San Francisco in the U S I think kind of a mix at this point of like 50, 50 kind of a remote and, and San Francisco and, and, you know, kind of humming along. Yeah. One quick question on, on the sort of the way that you like notice problems around data silos and other things, was that both in your company and with your customers or was it primarily something you learned building the company yourself? I guess I see it everywhere. So like once you see, you can't unsee. Yeah. Yeah. I think, you know, great startups, great founders tend to,

Starting point is 00:05:44 they don't look at like just, you know, they startups, great founders tend to... They don't look at like just... You know, they don't look at the world, let's call it from the MBA perspective. Like, ah, there's a market opportunity there. They just want to build something, right? Right. And which I don't knock like identifying market opportunities. But I find that you tend to get obsessive about trying to solve a problem, either that you've experienced or that you see and you can't unsee.

Starting point is 00:06:06 In my case, it was both. So when I see software as a service and I see people using all the amazing apps, right? Some of our customers have like 300 apps, if you can imagine that in their organization. I think that's wonderful, right? That means that lots of people get to use the tools that they want. People can be productive. There's, you know, people have best of breed user interfaces and all that stuff. But invariably, and maybe other people don't see it as immediately as I do, but I just can't not see it, is data is replicated ad hoc across all of these applications. And what is the data? Well, it's the same kinds of

Starting point is 00:06:45 things over and over again. And that feels wrong to me, right? And I feel like we need to help solve that problem. And so there's all sorts of tools that have existed over the decades to try to solve what people call data integration. It's not like a new concept. And the kind of unique perspective we brought to it when we started the company in 2018 was there was this treasure trove of data in data warehouses and product analytics teams and product teams that everyone on the product and engineering side used. We all were very comfortable using those things, whether that's from your operator console or from your amplitude analytics tool, whatever, right? Like we were all living and breathing it. And sales and marketing and success and support teams were not.

Starting point is 00:07:28 And so we built this bridge, right? That went from the data warehouse out towards the business tools. And in 2018, that was a weird and novel thing. So people didn't even know what to kind of call this. Yeah. So how did we come up with the term reverse ETL? And who came up with this term? Yeah. So how did we come up with the term reverse ETL and who came up with this term? Yeah. So when we first started, this was in approximately August, 2018, August, yeah, August, September, 2018 is when we were building the first version of census.

Starting point is 00:08:01 And we're talking to our first customer, our customers, our first two, three customers, basically on their own, decided to describe our product as reverse Fivetran. Okay. If I were really specific. Because they knew- No way. That's great. And so they did that, right?

Starting point is 00:08:20 We were just kind of like, again, you'll meet a lot of first-time founders. You're like, how do you describe your product as a classic conundrum? And people get too complicated to use buzzwords. So we were just like, it connects to your data warehouse. And we were keeping it real simple and we weren't trying to complicate it with buzzwords. And then they were like, so it's kind of like Fivetran in reverse. And we're like, yes, that works for you. That's a great... Let's go with that. Now, of course, we didn't put that on our website. That would seem really weird. But in colloquial speech, that's how people were reasoning about our software. And obviously, you're not going to launch your company that way.

Starting point is 00:08:55 So in our first year back in 2018, 2019, we were just going around finding our first customers, just getting them on the product and riffing on all sorts of ways in which we could call this. And funny enough, around... I'm going to say June, July, August of 2019, around there. I'm sorry. Yeah, 2019. One of our customers was actually working in tandem with folks at Fishtown Analytics, which is now dbt labs. And they were actually, for the folks who might not know, because now it feels like ancient history, but the company that builds dbt was originally selling consulting services rather than selling the software. And so one of our customers was consulting with them. So they were, they're paying for our software and they were developing really cool, a really cool data stack. And they were working with one of the folks at, at their first... Almost one of their first consultants.

Starting point is 00:10:07 She became the community manager. Her name is Claire Carroll. And she started taking notes on what things she was seeing out there when she was working with customers. And so out came like sometime in the summer of 2019, this Notion doc that was like, you know, linked off of the internet somewhere, right? Which has long disappeared in which she was kind of taking notes. It was literally a page of just notes. And in it, there was this thing going like, and then there's this census thing, like reverse ETL, which from her perspective, it's like, instead of branding it reverse, I've turned turn reverse, it made sense to just say, oh, let's like reverse ETL. So that's the first evidence of that word ever showing up in writing to my knowledge.

Starting point is 00:10:59 So the reason we weren't using that term at the time was I have the unfortunate problem of being like too knowledgeable or too nerdy or too mathematically obsessed or oriented, which is like the word is technically a misnomer since ETL has no direction. It feels weird. At least five-turn in reverse actually was a reasonable descriptor, right? But reverse ETL actually seemed like a mathematically incorrect way of describing the thing, but at least it's a generic term. So, so, you know, it was kind of, we banded around like for a while for fun in, in, in 2019. And then, and then we launched the product and the company in 2020. And, and it just very quickly became the de facto name for this and far be it for me to kind of argue with the public, right? It doesn't seem like a worthwhile way to spend my time. So my personal recollection of the kind of birth of that word.

Starting point is 00:11:52 And then, you know, when we did our Series A announcement, which was in February of 2021, these last couple of years are all blending together. Then thec ecosystem landscape machinery kind of kicked into high gear and they you know in the same way that engineers like to think about data stacks and and like venture capitalists like to think in terms of data landscapes or landscapes everyone famously knows the marketing landscape and now the data landscape is just as complicated and yeah and so know, this is like the kind of output

Starting point is 00:12:26 they like to produce. This is like a success for them. It's like, I've managed to put every logo I've ever heard of into a single chart with squares around them. So that's, I think when reverse ETL really became household concept is when it started showing up in those.

Starting point is 00:12:39 That is some high quality lore. Like even the detail of the Not doc is so it's perfect. Like it's perfect. So that's, I thank you for that bit of history. Okay. My follow-up question to that Costas about where the term came from is, okay. So I agree. Like mathematically it's not, it's not, you know, know technically accurate but i think even beyond that my bigger question is in some ways it's very singular right like a line on the chart you know that you know whatever us in the data industry create or an investor creates but you're building tooling in this space, do you think that's a sufficient term to describe at

Starting point is 00:13:27 least what you, like what you envision that you're building or like the problem you're solving? No, I mean, you're giving me really too much, too much, too much rope there to say whatever I want, but that's the point of the show before this call, right? Costa had described himself as a plumber since he had worked in pipelines for so long. And I think there's great pride to be taken in building excellent data pipelines. It's something that we pride ourselves on, and I'm sure you do as well. And our customers do. But it's not what I think the product is actually about. It's not what excites our users, right?

Starting point is 00:14:07 When I think of great software, especially tools, I mean, there's software of all kinds, right? But when you think of great tools, you're basically trying to make someone else, right? Your user, kind of a more awesome version of themselves, right? That's just the best way to think about it. And our users are not trying to become really good data pipeline people. That's not their goal. And when we started the company, I was not thinking, you know what I'd love to do is just spend my life building great data pipelines. That's not what the core Animus was. It is absolutely an essential means to reach our end. But what I wanted to solve

Starting point is 00:14:49 and what I get to see with our users every day is I wanted to bridge the gap between what I called analytics and product organizations and the go-to-market organizations. I was very frustrated that that gap existed. And there are a lot of tools out there that had taken stabs at this, right? Famously, there were tools like Segment that connected the code that you wrote in your app directly into your marketing tools. This was a huge step forward. But I kept seeing this problem that the data organizations that were emerging, the BI organizations that were emerging, were disconnected from the rest of go-to-market, right? Finance, support, sales, like just the whole world of the company.

Starting point is 00:15:33 And so just building that connection was important to me. And you don't just have to build data pipelines to make that work, right? You have to change the relationship between those teams and the data organization. And if you ask data teams all over the world, and you ask them what their day-to-day life is like, they will tell you that they're really crumbling under kind of load, like support load of getting data requests, having to solve like yet another dashboard. They're very overworked like IT teams, right? And what I felt they needed to move towards and what I think census's underlying goal should be for them is not to make pipelines that run faster than the pipelines they could write.

Starting point is 00:16:16 That's a good to have, right? And I'm glad that our pipelines are superior to the ones you would build yourself. But actually to turn your data organization into a, we use this term a lot nowadays, right? But we really meant it from the beginning, which is like a kind of product or platform team, because it's the only way to serve your whole company at scale. Otherwise you're just the hated service org, right? You're the IT team that no one really likes because everyone's always stuck

Starting point is 00:16:42 behind 32 requests. And so that was a huge kind of part of what census has always been about and continues to be about, which is, so see, it's not like really about the plumbing. It's about saying, how do I turn the data team into the, the, the most essential part of your whole company that everyone else depends on? And so that's, you know, I kind of, you may have caught me saying this earlier, but I think of census a lot more as a data federation tool rather than a

Starting point is 00:17:11 data pipeline tool. That's why it's called census. Because my goal is to say at a company, there should only be one version of the truth. There should only be one census of your users, your data, et cetera. And everything else in the company should be naturally kind of a cache on that data, pulling from that information as seamlessly as possible. And then that's what census does. Boris, can you elaborate a little bit more on how this reverse Fivetranity or whatever we want to call it, right?

Starting point is 00:17:44 It's actually different. And one of the challenges is that Fivetran does not have, right? Right, right. The data from the apps or the database and push it into the data. Totally, totally. Yeah, I mean, this is a great, that's a great question. And everyone, you know, from the outside of almost any company, any software, any tool, right?

Starting point is 00:18:02 People always think it's, how complicated can it be? It's reverse Fivetrain, right? So as soon as you distill things into like two words, it's like, then you somehow lose all the underlying complexity. So there's a couple really significant ways in which this is different and, you know, difficult in its own right for people to build. The first is when you're pulling data from SaaS applications into your warehouse, you're actually dealing with very consistent source data, right? So if you go to all the various ELT tools, right, they'll show you the ERD for all these applications, right? And they're fairly stable. And what you're doing is you're saying, let me get Salesforce, let me pull the schema and dump it into the warehouse. And warehouses, to their credit,

Starting point is 00:18:49 are very easy places to say, here's a table, just dump it, right? I'm not trivializing the work of building great pipelines there. But you're basically going from a kind of raw data structure that is not changing super often with read APIs off those products that are generally the first API that any SaaS product will build down into a data warehouse, which is of a low end, right? There's only so many data warehouses that are fairly consistent at being able to write a raw table in, right? And then all the little details, of course, emerge of trying to get that just right and incremental, et cetera. When you're thinking about this in reverse, the first thing is everyone's data models are different, right? You're at the end of the data refinery. So it's not the raw data from Salesforce that's always

Starting point is 00:19:32 the same schema. It's whatever entities your company has evolved, right? What your data organization thinks is essential about your users and your workspaces. And maybe you have a many to many model of your user base versus maybe you don't, right? you have a many-to-many model of your user base versus maybe you don't, right? Maybe it's one-to-one or there are no organizations. It's all just B2C, right? All these various patterns are bespoke to your company. And that's where census starts, right? It has to first take your distilled version of the data at the end of all your pipeline of transformations and say, okay, we'll work with this, right? It has to first take your distilled version of the data at the end of all your pipeline transformations and say, okay, we'll work with this, right?

Starting point is 00:20:09 And then we have to write into applications. And there's two problems there. One is writing data, the APIs are terrible because most SaaS applications focus first and foremost on easy read APIs. And the right APIs are very heterogeneous, very generally, very poorly designed. And then if you screw that up, the damage is really, really high. So I think that is the most important aspect of this. So when you think about a product like ours, even if you were to do this yourself, right? So you're an engineer at your company and you're going to build these things. You will generally be reticent to do a lot because your upside is like, I got the pipeline done,

Starting point is 00:20:51 who gets promoted for that? And the downside is very significant, right? Because you're going to accidentally put a million things into Marketo that you weren't supposed to put in. And no one knows how to delete those things. Guess what? Deleting is hard in SaaS applications. And so now your marketing team is angry. You've sent emails to the customers that are wrong. So the downsides are very high. And so a lot of what... I think that's actually what generally held back this side of the company. This is why the product and analytics, that whole world was actually evolving very well because it's agile. But this side, it's like one project a year, one project a quarter, right? And so that's really what we were trying to change here.

Starting point is 00:21:35 And so what do you have to do? You have to validate data more deeply. You have to do a lot more fine-grained ways of like writing data in. So we have, you know, all sorts of different capabilities. You can use census to say, hey, I only want to update what's there. I don't want you to create new stuff. Or I want you to write into Salesforce, but I also don't want you to overwrite this field if it's already there. Because again, there's much more subtle stuff going on when you're in these operational workflows. There's an email that's going to come out automatically at this. There's a salesperson who's going to make a phone call an hour later

Starting point is 00:22:07 based on what's happening in there. And so we have a lot more subtle capabilities to ensure that you're not breaking your operational world. And so one way to reframe what census does as opposed to pipelines is actually kind of a continuous deployment tool for data and it has all of the you know the needs there that yeah 100 and actually i want to like extra emphasize what you are saying about like the difference between reading and writing from the source application and something that i want want to add and make sure that our audience

Starting point is 00:22:45 is aware of is that actually, by the way, Claire did something very right. She named it ETL and not ELT. And that's, yeah, but that's very, very important because the fact that we can do ELT, which means we extract whatever we can and just load it and dump it there. And then we can have models that we version on DBT or whatever. We can go back and fix problems if we have problems. It's huge. And we don't realize that. If you go like to an ETL engineer that was working, I don't know,

Starting point is 00:23:20 with Oracle systems 30 years ago, they had the same problems that you had because everything was so costly, but transformation is something that can destroy something, especially if you do it on the fly. So exactly as you said, it's a completely different... I mean, mathematically, it is the same thing, but in terms of the engineering that you need to put there, it's very, very... Yeah. And look, I think a lot about product as an experience as well. And if you think of the user that is trying to pull data into a warehouse, that ELT scenario that we've been all very familiar with for the last decade. If you think about what they're trying to accomplish, almost all of them, it's in the name, right? It's analysis. They're trying to

Starting point is 00:24:08 pull it in so they can do some kind of analysis. How much money did we make? How much money could we make? It usually comes back to one of those two things. And so the use case is very... There's lots of kinds of analysis, but it's analysis. Whereas in our user, analysis is not the goal. The goal is operations, right? It's automating something. It's, hey, I want to send emails to send a promotion about a shoe that you should buy, but tied to the specific segment of users

Starting point is 00:24:41 that are likely to not retain if we don't send them the shoe, et cetera, et cetera, right? And so you're trying to get fine-grained detail into your email system, but not to do a spreadsheet, right? So that an email comes out, or a sales call comes out, or a better support experience comes out. That is a very different end user need. And so I think when the person wakes up in the morning and opens up our tool versus opens up an ELT product, what they're thinking about is different. I think they're actually just trying to solve different problems. Quick question before Eric asks his question. Are the users different between a Fivetran user and a Census user?

Starting point is 00:25:23 Yeah. I mean, I'm sure you see the same thing as I do in terms of data teams range dramatically in size. So I admire the crap out of a lot of our users who are data teams of one, who are three things in one body, so to speak. And so they pull the data in, they model the data, they push the data out, they do all of it in their own,

Starting point is 00:25:47 all on their own. But I think when a data team grows, it actually ends up being different people. Yeah, because there is a user who is, you could think of it as like, almost like maybe the concept, what are people getting? Remember you used to talk about

Starting point is 00:26:05 the forward deployed engineer? Remember that concept? Was it Palantir that first started using that term? I think data teams now have all sorts of roles, right? There's the core platform building kind of people. There's ML who, you know, people just sitting there doing like really cool analyses that hopefully are worth the money.

Starting point is 00:26:22 I don't know. And then there's this kind of forward-deployed analyst, let's call it. Your job is actually not just to sit there and pontificate on what is revenue, but actually to go help the marketing team, the sales team, and the support team to improve the operational excellence of the company. And so, yeah, I think that person might, on a different week, be doing something related to Fivetrain and analysis. But on a day-to-day, I think, at scale, yeah, I think that person might on a different week be doing something related to Fivetrain and analysis, but on a day-to-day, I think at scale, your data team, this is actually

Starting point is 00:26:49 different sets of people. Yeah. Eric, all yours. You saw me chomping at the bit. So Boris, I'm interested in what I'll call maybe like the, the chicken and egg problem a little bit. And I'll lead in by, I was thinking the other day, like Google analytics is still so pervasive, but relative to what's available now, it's so primitive in many ways. I mean, J4 is a little bit better, but I was thinking about it and it's like, okay, well, part of the reason is because like you have sort of have packaged collection and visualization and disaggregating those things creates really big challenges on both sides. Right. And so like, okay, just people kind of go to it. So you think

Starting point is 00:27:32 about Fivetran and it's like, okay, well, I'm taking, you know, data with largely known schemas and dumping it into a place that can ingest known schemas, like, you know, whatever schemas and it's great. When you think about like the practical, I want to send emails or I want a salesperson to prioritize something. There's an assumption. I think that there's been some sort of value created beyond the initial dump into the warehouse. Yeah. And I'm just interested to know, like, how do you approach that? Is because every business's data is different, different metrics, you know, all that sort of stuff.

Starting point is 00:28:13 Are you like reaching into the warehouse and trying to enable the creation of that value? I mean, tons of companies are doing it with DBT, but like in many ways you need to have something to send that isn't there when the data arrives. Yeah. Yeah. Yeah. No, this is a, this might be my favorite question and topic and thing to think about. You have to generate some kind of IP.

Starting point is 00:28:38 That's a way more succinct way to say it. Yeah. And so I think of a company has two kinds of IP. There is the widget that you make and how you sell it and market it and support it and all the kind of... Yeah. Those are both a kind of IP, right? And our industry focuses like 99% on how to make better widgets and how the source code is your ultimate IP and all these things. And I think all of this, call it how the sausage is made, how it's sold, how it's supported, how it's marketed is absolutely IP. And if you have none, if the way you

Starting point is 00:29:18 send an email about promotions about your shopping cart can be solved by your Stripe automatic shopping cart reminder checkbox. I don't know if they have that, but let's say they did. Yeah. Then great. Then you don't need any of these things, right? You have no IP of your own, right? So I guess that puts the onus a little bit on companies actually thinking about what makes them unique. But here's what's happening and has been happening for years now. I think your point about Google Analytics being kind of all encapsulated is actually a really good metaphor for this entire modern data stack, right? We tend to think about the modern data stack as all these various tools and the phases, right? And the data comes in and then it's transformed and it's,

Starting point is 00:30:12 you know, all these things. But in a way the modern data stack is taking every single SAS app and putting them, you know, making them fall on their side, right? So Google Analytics ingests data, stores data, renders, visualizes data, allows you to query the data, reports on the data. It models the data, right? It has everything in the app. And the repeat that times thousands of applications.

Starting point is 00:30:42 And so as long as everything you need can be done inside that silo, then those products are great. And what the modern data stack does in some ways is just reinventing that. It's like, well, now we can ingest all applications into one single storage layer. Okay. And then you can store everything in one place. You can visualize it all in one way. So is that a useful architecture versus 30 apps that each implement their own end-to-end data stack? And I think the key question there is, does your IP involve joining data? And if it doesn't, then this entire modern data stack could actually be, you could potentially throw it out, right? And be like, we have a billing system. All of our information about how much money we made

Starting point is 00:31:30 is in the billing system. You can query the billings. All that matters is then the question, does the billing system give me an interface that I can render and visualize and query? And if they don't, then of course, then you need to pull the data out so you can query it, right? But see, this is, I think, the transition. Once upon a time, people were pulling data out into their database, their data warehouse, because you couldn't query Stripe using SQL, right? Right. Yep. But that's going to change. All of them are going to increase how they make their data queryable. But what you can never do is, from inside Stripe or Google Analytics, join and query data, right? So that's not possible.

Starting point is 00:32:07 And so that is what uniquely the data warehouse and the data stack does. So then is there insight? Is there insight for your various teams that comes from joining data together? Well, in the real world, always, right? The, your sales prioritization example or your marketing email, right? Those two examples. You could tie that to product activity. Well, that's one source of data. That's assuming your entire product is one database, which it almost never is nowadays, right? So it could be multiple

Starting point is 00:32:39 services and data. It's going to also be tied to financial information about that customer, which comes from what? Well, some kind of invoicing data, right? Which might be one billing system, might be multiple, right? It's going to be tied to their level of engagement with your team. So that might be your support data is getting joined into that as well. And that's just me kind of rattling these along, right? I bet you the best companies have really interesting ways of modeling, you know, their users, their customers, their value, whether that's to forecast it or to automate it or whatever. So I think the longest short of it is yes. When you use census,

Starting point is 00:33:19 the goal is not to just take something from Fivetran into your warehouse and then back out into sales with no intermediate step. If, if then I don't know what you're doing, uh, then you're just, you're getting the base value, which is like, I can take something from one app and put it into another app, which is still good. Right. So take like a Zendesk metric, dump it into your warehouse and then take it from the warehouse and put it into Salesforce. Like that's still something.

Starting point is 00:33:42 And I actually think it's a better architecture than connecting those apps directly. Yeah, sure. You at least have a hub. Yeah. But I think real value. Yeah. What that person, again, if you're just setting up a pipeline

Starting point is 00:33:53 that's raw to raw, then yeah, yeah. Your job is not that interesting. Yeah. But the reason we employ data teams is that they're actually sitting there going, I think I could take

Starting point is 00:34:03 these disparate pieces of information, clean them, distill them, merge them, and come up with new valuable insight. One quick follow-up question, because I know I want to leave enough time for Costas to ask about the term data federation, because he and I talk about that all the time. And he has some really interesting thoughts, but what are the ways that you see, I love the, the paradigm of IP. What are the ways that you see companies creating that? And I'll just, the, the, the context behind that question is, I mean, some of the most interesting ways I see that happening is through tools like dbt, where you're sort of creating like interesting models.

Starting point is 00:34:41 Of course, I think there are a lot of companies who just maybe even write SQL on the warehouse to perform the joins to create those data sets. What else are you seeing though? How are companies creating that IP? Is there anything interesting in the way that that IP is being generated in the context of those joins? Right. So I think it's always helpful for small to step back and remember that we are very, very, very deep in the most cutting edge, sophisticated companies. And to your point, Google Analytics is still so widely deployed. And so the majority of this does not happen in DBT, does not happen in all these places, but there is business logic everywhere. There's business logic everywhere.

Starting point is 00:35:28 So there's the query that you wrote ad hoc in your database. Yes. There is, if we were to be really honest, probably the largest repository of these kinds of, of this kind of logic, this kind of query, is not in dbt and GitHub, which I think that's, what's great there is it's starting to become a better repository for this. I really hope our entire industry moves towards that model.

Starting point is 00:35:59 But it's probably, and don't freak out, in Salesforce, Socko queries, and Apex code. I agree with you wholeheartedly, actually. And I think the traditional, you know, kind of, if we think about the sophistication stages, right, they're crossing the chasms, etc., etc., right? Silicon Valley and broadly speaking, software companies have moved to this new paradigm, right? Because their most important signals come from their software. And your CRM doesn't store that. So the data warehouse is the perfect kind of query engine and storage and computation layer for that information. And the number of signals that we generate, I don't even know how many events the average kind of software company generates now. But it's a lot, right? That is why we store these things there now.

Starting point is 00:36:49 But if you think of non-software companies, which again, eventually everyone will be a software company, right? So, so this is why it's like, we all skate to where the puck is going, but there are still furniture companies in the world, right? And you would probably find that the bulk of the intelligence, the IP that I'm talking about lives kind of glommed onto their Salesforce instance in a collection of maybe checked in, probably not checked in code, code that looks like query sometimes, like Salesforce has a query language called Taco, or it's more imperative code like Apex. And the real goal of Census is to kind of move that into a kind of get-backed, kind of open, standard language called SQL. Yeah. And yeah, that's, I think, the journey that we're going to see over the next...

Starting point is 00:37:36 But it'll take, I'm talking easily a decade plus. Oh, sure. We all in our industry, and it's why we're so exuberant and why we all raise all these capital is like, we think these things happen much faster than, than, than they do. You know, I started my first company, like, like I said, 10 years ago on a very simple premise that was about if we're all going to live in SAS, you need to have your employee identity, your password, your login, like centralized and federated. Right. And it seems to make sense. You can't have 8,000 passwords, right? In a company that's not, like, that doesn't work. It's been over a decade and we're still in the infancy of

Starting point is 00:38:13 that market. Like that's how long these things take. And so I think data, we're very much in the early stage. For sure. Back in when I was doing consulting, we used to joke about, you know, companies of all types and sizes. It's like, OK, I've never seen a sales force that's not like some sort of Frankenstein. And it's easy to talk down to that. Right. Because it's actually very painful. Right. Like it does create pain. But in reality, like it's pretty advanced for a lot of the companies doing it and enables them to accomplish things that are like, what else can they do? I mean, of course, like the modern data stack, but like,

Starting point is 00:38:56 it is very helpful and it is pretty advanced to be able to customize all of this business logic inside of the tool. So that's such a helpful perspective. Yeah. And I think there's going to be this interesting cascade, right? So I think the data community has so much still, and it's exciting, right? That's why a lot of us work in this space. And there's so much to distill from the world of engineering, of software engineering, down into, let's call it, the broader world of data. So now, thank goodness, but like,

Starting point is 00:39:26 we're still at the early days of everyone realizing that you could treat your queries as a piece of code that can be versioned, right? That's still, we're still at the beginning of that, right? And then there's going to be all the other things that go around the software development lifecycle for data. And even there, we have to get quite a bit more sophisticated, right? If we're going to support these kinds of workflows. So I'll give you an example. One of the reasons you're... Because if the cascade is like software engineering,

Starting point is 00:39:55 let's call it to data organizations, and then down to business organizations. So if you think of that Salesforce that you saw in your consulting days, everyone always says, you're right. It's a mess. It's a mess. It's got all sorts of stuff. There's like a field called blah, blah, blah, underscore two. You know, it's like there's tons of tons of them, but what,

Starting point is 00:40:13 how many people in the modern data stack actually run like something equivalent to a migration when their data scheme has changed? Right. Very few, if not none. And so we still have to, you know, get more sophisticated in how we manage data in the core, let's call it. But as we do, I think a lot of that will then be able to have this amazing downstream effect on the rest of the business. Yeah. I really, you really made me think, Boris, with the comment that you made about Salesforce and the business logic there,

Starting point is 00:40:51 because you remind me of something extremely painful, which is if and how you can replicate the results of formulas on Salesforce. So I don't know if like Fiverr is doing it today or like they figured out how to do it, but it's pretty much impossible because the piece of logic there, which is executed whenever you make an API call. Right. And that's like, I think.

Starting point is 00:41:20 That's a beautiful microcosm, by the way, of this whole thing. You're absolutely right. You're absolutely right. Yeah. But that's a beautiful microcosm, by the way, of this whole thing. You're absolutely right. You're absolutely right. Yeah. But that's like the thing. And I think that's what justifies and makes this category of reverse CTA or whatever we want to call it like important, because at the end, you might be able to export the data from Salesforce, but the business logic is not something that you can export.

Starting point is 00:41:43 Like you need someone to replicate it, which is a completely different story, right? Exactly. So you need to get the data out, but that's not enough. You need also whatever you are going to do with this data to push it back again, right? And these systems are like, I mean, many times I say, like when you get like a salesperson, you can ask many things from the salesperson, but you

Starting point is 00:42:06 cannot ask them to leave the sales force. That's where they live. They don't want to learn about YouTube. They don't care about that stuff. The only thing that they care about is their quota. That's what they should do. They shouldn't care. Why they should care about whatever sign technology we have? They would be engineers if they cared about that. But there's versioning, man. It's awesome. Ah, yeah, yeah. Sure, sure. You're right.

Starting point is 00:42:31 It's a QW. Exactly. Exactly. No, absolutely. Absolutely. It's a, it's a, people,

Starting point is 00:42:38 what's the term people like, you know, people live in their pane of glass, right? And it's just like, you can't get them out of there. And I think there were like some attempts to like do that with stuff like Looker, for example.

Starting point is 00:42:49 Yes, yes. The previous version of BI tools, we were like, yeah, ask your salespeople to go and work from within Looker, and then there will be links to go back to Salesforce. Like, no, why? Do you know who suffers from this the most? It's actually kind of tech founders in the Valley because they start their company and they're like, yeah, I got Looker. My salespeople are just going to go there. And it's because like, they're also

Starting point is 00:43:14 deluded because they see this as easy, right? Because you and I can do it. And I'm like, no, they're not, man. They're really not. I promise you they're not. And he's like, it's easy. Like for sure, they're going to do that. Like I can do it. And I was like, no, they're not, man. They're really not. I promise you they're not. And he's like, it's easy. Like for sure they're going to do that. Like I can do it. And I was like, uh-huh, uh-huh. And it sometimes takes years for them to realize like, oh yeah, I hired a VP sales.

Starting point is 00:43:34 Yeah, I'm like, they ended up doing their own thing. I'm like, uh-huh, uh-huh. They do their own thing. They do their own thing. So I think, yeah, tech founders particularly, I think suffer from not seeing this. Yeah, because it's also also extremely easy to burn money. Actually, it's one of the reasons that you exist.

Starting point is 00:43:48 So why not pay 50 grand to buy a license for this thing, right? So yeah, anyway, that's another very interesting conversation that we need to do at some point. But yeah, that was a very, very interesting point that you made there. But I want to go, you used the term federation. there, but I want to go, you use like the term federation. Eric mentioned that I want to ask about that, but traditionally, and like from, you're just like an engineer, like federation and DTL are like two completely different things. Yes.

Starting point is 00:44:15 Actually the opposite. Like when you are talking about federation is more about, no, I'm not doing like to collect the data into one place. I'm going like to ask its data source and then I will federate the results and present the results that's true so if this is like what you are thinking of like a solution or unless you have like a different definition would be more than happy to discuss about that where do we stand today and where do you see going right like because today i don't know like technically speaking this is not federation that we have no no no i think that's a very reasonable technical pushback so let me start

Starting point is 00:44:52 with an analogy i tend to use with my team but it's going to make you're going to appreciate it because i think you're close enough in age to me but i'm starting to notice that like younger people are like what is he talking about so So your laptop, your computer has an operating system in it. And it provides a lot of things for you, the user, and for the applications that are built on it. And I think that when we move to the web, there are certain things that we kind of lost along the way. We gained a lot, so that's fine. But we lost a few things along the way. So one is login, right?

Starting point is 00:45:31 So when you log in, you're gonna be able to log in once. And then like, you don't open Word and go, please log in. You don't open Photoshop or whatever and says, please log in. Please with caveats that everyone now is a web app. So like, that's different now, but let's put a pin in that. So that was, you know, your identity, your user identity was just given as part of the operating system to all the other applications. So they just were receivers of that knowledge and just used it.

Starting point is 00:45:58 And in the same way, there's a file system in your operating system, right? Your computer has a file system. And when you open a file in Word and you want to open that in Excel, it's the same file. They don't both have to implement a file system to be able to read and write data. And so I think when we moved to the web, we lost both of these things. And funny enough, both companies I've started are solving these two things. And so when I think of data federation, the reason I use that term is I think that in order to have a wealth of SaaS applications exist, which is what I want, right? You're going to always hit this natural friction around replicating the data correctly and consistently, right? Because it's a distributed system and they all want to speak about the same things. So this is just, you're always, the more apps you have that all

Starting point is 00:46:50 speak roughly about the same things, you're going to have master data management problems. You're going to have all the things that kind of as a distributed systems minded software engineer, you can think through and they're hard. And it only gets worse for every N plus one application you want to use. And so I think there's only two ways in the long run that this gets resolved. One is the one I don't want, which is everything gets progressively acquired by larger companies. And because then they can create that integration, right? They can create the tight integration between Slack and Salesforce. I'm sure they will.

Starting point is 00:47:29 And Microsoft is, and maybe it's because I started my career at Microsoft that I saw this, because Microsoft is basically the best company in history at doing this. Having built unbelievably great technology to do interoperation between its applications. They do this because they can work together

Starting point is 00:47:42 and they can force Excel to do something that then Word will also abide by. And so that's one option. And we see this, right? The more we get in the later stages of SaaS, which is now year 20 of SaaS, right? Like we see these pressures. And the only alternative that I think of

Starting point is 00:48:01 is that for some of these things where you need to come up with a different model than just independently replicating the data in bespoke ways in every application. And so that's why I use the term data federation, because I believe that as a company, if you want to use the maximum number of SaaS applications with the most freedom and not to be tied to one vendor, you want to be able to own your data and then seamlessly have it be usable in any application. So today, my only option to be able to enable that world for people is to say, okay, what is a place?

Starting point is 00:48:36 Let's work from first principles, right? Well, you need to store all the data in a way that is most cost-effective and scalable. Data warehouses. It's that or S3, right? It's like either just raw storage or data warehouse. Those are the best tools we have from first... If something better came along, I'll take it, right?

Starting point is 00:48:54 But right now that is what's best. And then I want seamless ability to use that data from any application. If I could eliminate the data pipelines and just say, you know, your app is built directly off the data, that'd be great. But because of the way OLAP, you know, warehouses are designed, because of the incentive structures in the market today, you can't, you don't get that, right? So there are tools, by the way, in like Salesforce has this concept, they're the only one, but they have this concept like external objects where you can have an

Starting point is 00:49:26 external back data store, but it's slow. And then you don't get all the features and you don't get the formulas and you don't get the indexes and you don't get all the things. So thus what Census does, which is we will push the data into the internal file system of each of those products, thus turning them into a kind of high performance cache on a single data store. And that's what I mean by data federation. Yeah, makes total sense.

Starting point is 00:49:55 I have a, it's not exactly like a product question. It's more like, it's probably like a, yeah, it is a product question, but it has more to do with like the experience of building a product. Sure, sure. So since you first launched Accenture, what you have learned by building this product? Great question. I think I would say that I've learned the most about our users, right? And data teams as a whole. And so it's been really fun to watch them on this journey over the last three years, just working with people. It really is the thing that always comes to mind, which the first experience I had when we started selling this to users was, hey, great, this is going to save me time. Or this allows me to do the thing that I didn't know how to build. I don't know how to write this kind of connector.

Starting point is 00:50:56 So it's great. I write SQL. I don't know how to write Python. That was the initial experience we had. And that was not surprising. That was not something I was like, ah, what a discovery I've made. But then, and we talked a bit about this, but it became very visceral to me. After a little while of, especially in the early days, our early users using our software, but now it's become kind of, it happens more often. I started seeing a very unusual reaction from our users that actually caused me real pain. Like I was worried. I was actually really like, are we screwing up here? This seems bad. These are bad. These are not the words you want to hear from a user, right? You want to hear excitement,

Starting point is 00:51:36 power, enjoyment, right? And multiple customers started using effectively expressions of fear. They started like genuinely saying, I'm scared in so many words. One, one, one customer was like, like, this is, I feel like I'm holding a machine gun, like I paid him. I was like, well, that's not the feeling I want to engender in you. But you know, so I could have shied away from that. I could have been really freaked out, but, but I started to think about it. And what I realized is census is, this is what I mean by it's not just a data pipeline.

Starting point is 00:52:14 It's giving these users a power they've never had before, right? The power to do analysis is not new. It's massively improved with great tools. But the ability to analyze data is something they always had. But the ability to, from your something they always had. But the ability to, from your vantage point on the data organization,

Starting point is 00:52:30 to cause a marketing email to get sent, to cause a salesperson to wake up in the morning with a task to call this person, that did not exist before SenseS. And of course it's scary. Like now it's your fault if something breaks or breaking would

Starting point is 00:52:49 be ideal. Like if senses like said, Hey, sorry, the pipeline can't go today. That's, that's not even, that's, that's actually bad, but nowhere near the worst case scenario. The worst case scenario is you push bad data, extra data, data that, that is like, that is going to be embarrassing when it goes out. And so that was the emotion that we're trying to convey to me. And so now I spend a lot of time really thinking about how can we build capability into census that improves your confidence. So I think this is the point. We have a lot of experience in the world of software on how to be agile, but safe, right? Code reviews, testing, unit testing, like just decades and our education and our content, right? Is to try to teach how to make this less scary, but also to embrace a little bit of the fear, right? Because I don't want people to go back to, I'm only going

Starting point is 00:53:55 to press the go button once a year because I don't want to break things. And so that's probably the biggest thing I've learned is the biggest hindrance to deploying census is actually helping people overcome this new responsibility, this fear that comes with it. And I'm like, but on the other side is so much power, so much growth, so much more your team will be able to do. And so you should embrace it. But it is genuinely scary. And so that's a first in my life to have built a product that freaks people out. Yeah, no, no, no. I mean, it's a good problem to have because, of course, it's, I think, an indication of the value that... I'll give you an example in how this manifests.

Starting point is 00:54:39 Speaking of product, we can do a very narrow... Because I think this is not solved with one giant... My marketing team is going to hate me solved with one giant, my marketing team's going to hate me, like one giant whiz bang feature that you can announce, right? It's, it's a collection of very like fine grain thinking, like small features here and there. And so I'll give you an example. So there are a lot of products that when you write into them, to your point about like reading and writing is very different. They have, there's a term in compilers, you know, about there's defined behavior, and then there's undefined behavior, and then there's unspecified behavior,

Starting point is 00:55:12 which is actually like a different thing, which means like it'll work, but I can't tell you what's going to happen. So when you write duplicates into some system, not all, that's the beauty of it, right? We support like 50 different applications and like all different, you know, different behaviors. Some of them will behave in very unusual ways when you sync duplicates.

Starting point is 00:55:32 So some of them will reject it. Some of them will just pick one and you won't know which one, right? And so that is something when we built the very first version of Census all those years ago, we just said, here, let's take the table and like just efficiently, our was to get speed so it's like let's get it as

Starting point is 00:55:48 efficiently as possible into the destination and then we didn't know like oh turns out people are people have plenty of duplicates like the warehouse is not enforcing you know unique ids so they're singing duplicates and like we were like powering through we're like super fast like yay go sink millions of duplicates. No problem. And then you're back to the same old problem of the sales team or the support team or the success team or the marketing team is like, this data is wrong. Screw the data team. Let's go back to doing our own thing. I don't like these guys. And so now we've added the capability. It's a built-in. You can't turn it off, which is we will block duplicates from being synced like we

Starting point is 00:56:25 will block them because even there are some people who are like frustrated by this because it's like it's errors that they're like but it's not an error but it was like but we're going to treat it like an error because like you don't you're not realizing this has annoying downstream effects on your team so it's you know it's a million things like that that we've had to kind of invest in yeah i totally i totally get that. I think what people don't, there are like two things that I think people don't realize when they start using products like sensors. One is that the census team has to learn to work with a technology that is completely opaque, right?

Starting point is 00:57:08 You have Salesforce on the other side, and it's very interesting. I were cases that we couldn't predict, even big inside Salesforce. There were edge cases that we couldn't replicate by having access to the whole infrastructure and all the knowledge that Salesforce itself has. So imagine now that you have Boris and his team and they try like to interoperate with Marketo. I don't know how many people have worked with Marketo, but I mean. Is there an off the record version of this, Pac? I mean, it's the best. It's the dominant marketing platform and for a good reason, I'm sure. But like interoperating with it is a completely different thing. There are errors that are not documented.

Starting point is 00:58:11 There are behaviors that are not documented. They are not documented for a very good reason, because all these APIs, they were not built for Boris and his team to send data. Right? They have a completely different specification. That's one thing that people keep to forget, I think. The other thing that they keep to forget is that as we add more and more systems into these stack or architecture or whatever, we are actually building a super complicated

Starting point is 00:58:38 distributed system. Right. And distributed systems have some very specific rules and deliver systems have like some very specific rules like and delivery semantics are like something that it might sound like very theoretical but it's actually very very practical and i don't expect anyone in sales to know that one of ways that we can deal with that is to have at least once delivery semantics right yeah sure i mean it doesn't work at the end like because i'm getting ptsd i remember using the word eventual consistency in front of a marketing team and they were like no no we need it we can't have it be eventual and i'm like it has to be eventual the speed of light

Starting point is 00:59:18 is not negotiable and like oh that's what you mean i'm like because in their brain eventual meant like it'll come up tomorrow and i was like like, wow, I forgot that this is a term that we use in distributed systems that has nowhere near the same meaning. Also, it goes both ways. Real time doesn't mean real time to a lot of people. So yeah. The reason I'm saying that is because I think

Starting point is 00:59:40 there's like a very important element and that's, we are all responsible for being in this market and that's education. Like we need to make sure that like outside of building, like actually I think that like, it might sound a little bit exaggerated, but part of the product is also education. Like how we can help people understand what they can do and how they can do it with their,

Starting point is 01:00:02 with their technology, because there are limits and engineering is about trade dogs and we have to make this trade dogs. Otherwise, like we are not going to have products at work. Yep. Yep. No, I think that's a, I think interesting products tend to have this educational component and I wholeheartedly agree that that's part of the journey we're all on. And especially, again, the world is large. And one of the things I have learned

Starting point is 01:00:28 is the world is nowhere near as sophisticated as people think it is. Oh, yeah. I tell people this even more. Like Silicon Valley is not even as sophisticated as you think it is, right? And like we work, you and I work with some of the best, right?

Starting point is 01:00:43 And it's like, sometimes I'm like, wow, this is, I remember I used to do really fancy demos in the early days. Really, I would try to drop in words like AI to just, again, you're like, yeah, like you need all these things and da-da-da-da. And it's like, and then one day out of expedience, I didn't have time that day. I did the dumbest version of the census demo. This is back in 2018, 2019. I did the dumbest, where there was two metrics you could set up in 12 seconds. The count page views. You know what I mean? It was like, count pages. And then I was like, let's just put that in Salesforce for a customer success team to know how many times they've visited your product. That was it. That was the demo. And I was, I was actually concerned like, and embarrassed for them at

Starting point is 01:01:30 first. Cause I was like, they were in awe. Like people were like, this is the greatest thing since sliced bread. And I was like, this isn't the, what, this is the basics. This is not the, this is not the wow demo. This is not the wow demo. Like why, why are you guys wowing? And, and it's like, you forget how, how starved people are for this. Right. And then you're right. It goes hand in hand with, then you start delivering stuff and then they, yeah, you have to, we have to find a way to, we're going to have to do a book like distributed systems for, for,

Starting point is 01:01:57 for regular people. Cause yeah. Yeah. Just cause it's because I think it's too intuitive for you and I. We know it so well that we take it for granted. And then you end up in these weird miscommunications. And I think the need to educate is doubly so. Because you are right that we need to educate just to serve our own users. Think of what Fishtown DBT have to do, right? To teach the concept of version control is like super valuable.

Starting point is 01:02:26 Just to teach that is unbelievably valuable. And if I think about what we're doing is we're turning the data team into this kind of like company platform team. So we need to help them explain what's happening to everybody else. Otherwise they will also fail. So we have to act as like their advocates

Starting point is 01:02:43 to the rest of the company. And like, that's super essential. So you're right. The education is unbelievably important. Yep. A hundred percent. Yep. Hopefully these conversations help. Oh, this is great. Boris, this has been such a fun conversation. Brooks actually let us run a little bit long, which is super fun when we get permission to do that. But we're at a time here. This has been such a fun conversation, really helpful for me. And I think definitely for our listeners as well.

Starting point is 01:03:08 So thanks for the time. I mean, thank you. Thanks for having me. First of all, I have to say that Boris is so articulate. I find myself jealous of his ability to explain complex things and even dip into the world of, you know, sort of formal computer science

Starting point is 01:03:26 in a way that's so accessible. So, hey, I appreciated that and learned a ton from him. My takeaway is around the way that he described sort of value that's created in the warehouse as it relates to data that's transformed, say for downstream tools, sort of creating value with data, right? And he described that as any data that needs to be joined in order to produce some sort of valuable asset. He described that as IP, which I think is such a helpful way to frame the concept of creating whatever kind of value we're creating in the warehouse, right?

Starting point is 01:04:06 Whether it's a unified customer profile or packaging some sort of analytical component from one business unit and sharing it with another. So I really, I just really appreciated that. I think it's been helpful for me to think through that. Yeah. I mean, okay. It was like an amazing conversation I think we had with him in general. There are like many insights for someone to take from this conversation. What I keep, I really liked how he's using the term federation. This was like something that we discussed also during the show. Traditionally, federation has a different meaning, but it makes a lot of sense the way that he's using the term federation.

Starting point is 01:04:48 And that was very interesting. And it was also super interesting to discuss with him about all the challenges around building a product like this. So hopefully we are going to have him again in the future and we have more stuff to chat about. Absolutely. All right. Well, thanks again for joining us on the show. Lots of great episodes coming up. So we'll more stuff to chat about. Absolutely. All right. Well, thanks again for joining us on the show. Lots of great episodes coming up. So we'll catch you on the next one.

Starting point is 01:05:12 We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com. Ciao.

The Data Stack Show - 78: The Etymology of Reverse ETL & Why It’s a Key Piece Of The Modern Data Stack with Boris Jabes of Census

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.