The Data Stack Show - 253: Why Traditional Data Pipelines Are Broken (And How to Fix Them) with Ruben Burdin of Stacksync

Starting point is 00:00:00 Hi, I'm Eric Dotz. And I'm John Wessel. Welcome to the Data Stack Show. The Data Stack Show is a podcast where we talk about the technical, business, and human challenges involved in data work. Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. How to Create a Data Team with RutterSack Before we dig into today's episode,

Starting point is 00:00:30 we want to give a huge thanks to our presenting sponsor, RutterSack. They give us the equipment and time to do this show week in, week out, and provide you the valuable content. RutterSack provides customer data infrastructure and is used by the world's most innovative companies to collect, transform, and deliver their event data RutterSack provides customer data infrastructure

Starting point is 00:01:01 joined us at Data Council. We did a little bit of a lightning round at Data Council, Ruben. So we'll take our time to dive deep. Thanks for joining us again for your second slot on the show. Yeah, thanks so much for hosting. Great, well give us just a little bit of background for those who didn't hear the Data Council show.

Starting point is 00:01:17 Give us a little bit of background on yourself and then just the one or two sentence overview of Stacksync. Yeah, perfect. So, my name is Ruben. I'm co-founder and CEO at Stacksync. So I am based here in San Francisco, California, building Stacksync with our team. I'm originally from France, so a bit of my background, you know, I did study computer science, double degree in computer science and one degree in business as well, back in

Starting point is 00:01:39 Switzerland. And then I worked as well, you know, in Germany and in Singapore. And actually, this is also where I got really in touch with the world of two way syncing because I was working, you know, in a company and I was as a consultant and was in charge of putting everything in place, you know, from accounting software to ERP to CRM and I, you know, all of these tools to work in two way sync. So what I did in the CRM reflected to the ERP and vice versa.

Starting point is 00:02:08 And there were no products on the market. You know, like then I searched, you know, I tried to build, you know, some alternatives myself, you know, with somehow, you know, workado, etc. And none of them really worked, was really complex. And I just couldn't leave the company because everybody was afraid to take this work over. And this is where actually I realized, you know, like this is where I realized, you know, like this is where I realized, you know, this is a big whale problem. Everybody's complaining it should exist. And so, and there I committed, you know, I started an entrepreneurial journey and now

Starting point is 00:02:34 here we are, we did YC. So Stacksync basically we were running for a year and a half and we did YC, Y Combinator in the Winter 24 batch. And since then, you know, we moved to San Francisco and really got this explosive growth that we have at the moment. Awesome. Well, one thing that I'm excited to talk about is where two-way sync fits into the stack because there are a lot of companies who, you know, sort of use a

Starting point is 00:02:59 traditional sort of in, transform, out type loop. So I'm excited to dig into that and just learn more about two-way sync in general, dive deeper than we went last time. How about you Ruben, what do you wanna talk about? Absolutely, so I'm super excited to talk about this two-way sync, what it fits in the stack, but as well as like, you know, I'm very surprised

Starting point is 00:03:20 how marketing is actually reshaping the perception of people on zero on, you know, zero copy ETL, this kind of trend, you know, and how actually exists right now as we stand, you know, it's most of marketing and little tech, right? And it's crazy how much, you know, this tech people actually go into this, this, this fantasy of vendors, you know, selling it. And so, yeah, extremely happy actually also to decode that a little bit further. And yeah. Great. Well, let's dig in. Let's do it. Ruben, I love having second time guests because I can say, if you haven't heard the first show,

Starting point is 00:03:59 go listen to it and you'll get context and listen to them in a row. So we can kind of dig right into some of the spicier topics. The first one of which we'll cover is that zero copy isn't a real thing. And so very excited to dig into that. But first, I want to do two things. give our listeners a high level overview of StackSync. We'll get way more into that later. You mentioned when we were recording the intro that you were a consultant, you were trying to get all these tools to talk together, you know, CRM to talk to the ERP,

Starting point is 00:04:33 and it was a really brutal problem. You tried all these tools. What was the worst integration problem that you faced during that period? Where you just, that period where you just You know, you just said this is so Gnarly and bad that you know, did you ever think about giving up? I mean, what was the nastiest problem? Absolutely. I mean like this is I mean, you know building to with sync with workflow automation tools or code

Starting point is 00:05:00 Was as much like really brutal. First of all, you have to think about the whole architecture, then you start building. So after maybe like a day or two, you get the workflow. And then you realize, okay, now when I edit a record on one side, it goes to the other side, and vice versa. And then you say, okay, well, cool, this is great. So now, okay, so this is for one record, which I just created. What about for update? Oh, and now you have to figure out an entire update logic, right, and you realize, you know,

Starting point is 00:05:28 I'm gonna take the entire record. So whenever, you know, I have a record of it in Salesforce, I'm gonna update into my HubSpot and vice versa. And then you say, well, okay, so I just, because Salesforce only tells you that the record has been updated, but not which field. So now you say, well, intuitively, I'm gonna take the entire record. I'm gonna push it into HubSpot, but not which field. So now you say, well, intuitively, I'm going to take the entire record.

Starting point is 00:05:45 I'm going to push it into HubSpot. But then you override the email. So now you have a, for sure, you have a marketing person in the company creating a sequence, and say every time Mark, the email is updated, enrolls this into a welcome email sequence. So the guy, so customers start now receiving welcome emails every time something is updated on the CRM. And yep and so that's as a structure you have to know exactly what

Starting point is 00:06:08 changed now you have to run an entire system which detects which field was updated into the same record and now you start storing data and then you have deletes as well right and once you have done it say all the credit operations you're like well that's great but now this is for one record. Let's backfill the data. And now you have to backfill all the historical data of the same system. And now it becomes extremely complex. And now every time that your system misses one event

Starting point is 00:06:36 for any reason, one custom field added, someone changed the name of a field or something like this, it bugs and you lost the data. There is no monitoring. And so the maintenance over like the first three weeks were so huge. It was taking my full time integrator position for a single contact sync pipeline. And so, well, I'm laughing. It's a painful laugh because this is over 10 years ago.

Starting point is 00:07:05 But I have to confess that I was a victim to the promise from the sales team that I can't remember if it was Salesforce or HubSpot, but it doesn't matter which one it was, but they're like, oh yeah, we just integrate with, you know, let's just say it was HubSpot. They're like, we have a direct integration with Salesforce. All your data just goes back and forth. And you just configure a few things, right? And I was like, oh, a direct integration with Salesforce. All your data just goes back and forth. You just configure a few things, right? And I was like, oh, well, that's great.

Starting point is 00:07:28 And I could not have been more wrong. And actually the thing that, and I'm sure things have gotten better since then, because it's a long time ago, but it was impossible to troubleshoot, especially with large batches, right? I mean, you upload a bunch of leads, you do all of that. But the other thing actually that I'm interested in is that was one contact sync flow. But if we take Salesforce and HubSpout as an example, any data engineer or analyst who has worked on data from either or both systems knows that a contact record

Starting point is 00:08:03 and what that means is not the same in these different systems. And the database design is different and they can actually have pretty dramatically different meanings even across teams. And so that's what you described sound really painful with the assumption that it's a shared definition across the two systems and across the two teams, which is actually more rare than it is common. And I think probably what you had 10 years ago is probably even worse now, because the systems are more complex instead of getting simpler because technology evolved. Because it's even worse. And I'm telling you, now we're just talking about how to sync systems,

Starting point is 00:08:42 databases, CRMs, between each other, et cetera. But now you're mentioning the problem of the definition of a sync. So it means you need to have filtering, right? But filtering from A to B is not the same as filtering from B to A, even with the same definition, right? So this goes very crazy.

Starting point is 00:08:59 And also if you want to really go even deeper, no, contacts are not standalone. Contacts belong to a company. So now you want to really go even deeper, no, contacts are not standalone. Contacts belong to a company. So now you have to associate. Now you have to associate different association models. Also, you have ordering, right? Because you have to first create a company, and then the contact,

Starting point is 00:09:16 because the contact belongs to the company. But like in other systems, you might have different association systems. I mean, so where you have to create a record, to your company, and then associate them, you know, so it really, it's really different. And so you have all of these complexities to actually maintain. And this is what makes Twasync very hard.

Starting point is 00:09:35 So in the very beginning, it's very hard to go into the marketing and say, yeah, you know, it's very easy, you know, we integrate, etc. And for example, say, there is this very big, very big, et cetera. And for example, let's say there is this very big pain on the market at the moment, right? Which is HubSpot, Zendesk, for example, right? Integration of HubSpot and Zendesk. You know, the team at HubSpot is gonna tell you, yeah, of course we integrate with Zendesk.

Starting point is 00:09:58 You can use Zendesk for your customer support, HubSpot as your CRM, and this is gonna work fine because we have two async. So first of all, only contact and companies and tickets are associated. But actually the tickets, it's not the tickets as you imagine them in HubSpot. So the Zendesk tickets is gonna have to be synced

Starting point is 00:10:14 to the service hub of HubSpot. So you have to subscribe to the Zendesk of HubSpot to get the sync. But because you use Zendesk of HubSpot to get it synced. Which is a separate body. Because you use Zendesk, you don't need the service hub of HubSpot. You're actually just buying towards the same product to actually use it. This is not the tickets you want to sync. So eventually, it's just not syncing.

Starting point is 00:10:41 You won't have for all of your marketing contacts, right? So you want, so because, you know, contacts will sync, you know, have different definition in HubSpot and in Zendesk, every person you send a marketing email, you know, even a call lead from HubSpot, you don't want it to sync to your Zendesk server system, which is made for your customers or the people with very high intent. And so what about this transformation

Starting point is 00:11:06 which happened in between? Custom objects are not supported, associations are not supported, it's completely crazy. And so, and this integration are very costly and they still sell. Still sell. Yeah, it's wild, yeah. That's why the original two-way sync is called a CSV file.

Starting point is 00:11:23 Exactly, exactly, exactly. I mean, is called a CSV file. Exactly. CSV file and lots of VLOOKUPs and stuff. You mentioned in the intro that it was hard for you to leave the company just because no one knew how all this worked, right? And CSV files can be that way, where it's like, there's one person who knows how to get it just right. Yeah, it's crazy. I mean, and this is where Stacksync came, all about came to be, right? So Stacksync basically also, you know, really deep dives into this nature and to this reality of two-way syncing between enterprise systems. And where we really are, you know, really as a leader is really when it works at scale, right?

Starting point is 00:12:07 So Stacksync basically really gets this two-way sync at scale. And it requires a whole complex engineering, which is almost a database level, you know, conflict management, you know, technology, which is like, it's been developed over like tens of years. And so, yeah, so just more about Stacksync, you know, Stacksync is today's leader in real time and two-way syncing between enterprise systems and databases. So, Stacksync supports CRMs like

Starting point is 00:12:32 Salesforce and HubSpot, Zoho, et cetera, but also ERPs like NetSuite, ACP, Acumatica, and all of these tools basically, they can be synchronized in two-way sync with databases, such as Postgres, Snowflake, BigQuery, MongoDB, MySQL, OracleDB, you name it. And so what ReleaseTaxing enables to do is really to actually bypass all of these IPaaS tools, all of these complex in-house code, custom code logics, and just have a two-way sync as you would think it is in a human manner. Just simplify your architecture diagram to a very baby level, right? Just like I have, like, no, two, one, you know, one CRM, one database. Whenever you modify something into the CRM, it goes into your database.

Starting point is 00:13:18 And when you modify that same table, no, not on the table, the same table, back, it's actually going to write back into your CRM or ERP. And that's what Stacksync really offers in real time with millions of records per minute, technically at big times. I love it. Okay, I have a ton of questions about that, but I promised the listeners we would do the spicy,

Starting point is 00:13:39 we would get to a spicy take early, which is zero copy isn't a thing. So I want to move around back to Saxon specifics, but this really piqued my interest when we were chatting before the show. I think you used the phrase zero copy isn't real. I think that was the phrase. Okay. So give us the spicy take because it has been a major topic of discussion.

Starting point is 00:14:08 Product launches, feature launches, a lot of ink has been spilled. And I'm sure all of our listeners have heard of it, but for those that haven't, what is the promise of zero copy? Give us a baseline of what does zero copy mean. Yeah. So let's get zero copy or maybe in its full term, zero copy ETL or zero ETL even sometimes. It's basically the fact that right now, basically, you have to use five trends, stitch, or byte, et cetera,

Starting point is 00:14:36 to transfer your data from an external source to the warehouse. And so with this, there are a lot of recent tech developments, which actually tells you, okay, maybe we can agree on a common data format. And actually, every system would pump on the same storage. So there is no copy between system actually, you have one source of truth of data. And you know, the CRM would actually pump on this data. So that warehouse will pump on this data. Actually,

Starting point is 00:15:03 there is no transfer, just a single place, a single storage, right? And to this, there are technical and business challenges that at least 10 years or 15 years before it's solved if it will ever be, right? Because it's really a business problem actually. It's a very root. And every tech problem is a business problem in the end. And so-

Starting point is 00:15:24 Dig into that a little bit more. What is the business problem tech problem is a business problem in the end. Well, dig into that a little bit more. What is the business problem? What is the business problem and why will it potentially not get solved? Yeah. So it's basically like ETL in general is basically like we need to have that different data sources and we want to bring everything in the same place. So we can actually have, you know, data, you know, data available for insights and reporting and do all sorts of like operational things. So it's a business problem. We need to do some real

Starting point is 00:15:48 stuff and make some real money with this. And now what happens with this zero ETL is that, okay well shipping data from A to B is actually very long, costly, and hard to get very accurate at scale. Yes, yep. So then what happens is that Stacksync, I mean Stacksync, or any vendor actually will just transfer data, but it's gonna get long. So people say, well, let's agree on a common data format and a common place of storage. So we don't have to copy.

Starting point is 00:16:19 Everybody can just come and grab what they want, but there is no transfer, you know, there is no transport, it's just a common place. Okay, so that's a very good idea. But then this has a very big business issue. Why this almost cannot happen is because data warehouse, I mean, this common data format has to be sort of efficient for everybody.

Starting point is 00:16:41 Data warehouse work in a very different way than we used to make some query at scale. efficient for everybody. Then there were how work in a very different way than, you know, we're still make some query at scale. CRM, which is to retrieve records with different indexes, you know, to make it very fast for users, which work on a daily basis. Right. So this means that the storage of this two different, you know, the Salesforce and this Snowflake have to be very different in

Starting point is 00:17:02 nature, in nature. So just performance wise and businesswise, it's a problem. But also strategy-wise, right? I mean, like, what do you think? Do you think Salesforce is going to open up their backend and all of those are business secrets to everybody, right? It's like, you know, this is not, you know, the schema, the schema will never be exposed because also like

Starting point is 00:17:24 in the database schema, a company has much more than the data which is exposed to the customer. They have also a lot of metadata, a lot of organization, relations, optimizations, and you know like the storage just cannot be the same because some part of it needs to be masked. So now you have to have very deep row level and column level access rights, right? So this causes another problem of security. How do you make sure this never leaks? And so you have all of these business problems

Starting point is 00:17:50 which actually make sure you always have to make copy simply because the data which you operate in cannot be endangered. And also we have a common place of storage. And I'm throwing another question to the industry. So Salesforce and Snowflake would share the same storage? Where is the storage? Who pays for this? Where is it? Who accepts to have latency? Do we all lock in the same vendor into the same Amazon S3 bucket or Google Blob storage? Do we have to lock in into that storage? Because now if we move,

Starting point is 00:18:23 we have to move everybody. So it's even more if we move, we have to move everybody, right? So it's even more locked in. So we have so many problems that happen. And so all these issues, right, are something which make, you know, zero ETL or zero copy really something which is, you know, almost fantasy as today. Yep, yep. It was interesting.

Starting point is 00:18:42 We had a guest on the show from, Yep, yep. So, you know, you can of course see that there would be benefit in sharing, you know, having some crossover data there that benefits both parties. So they had really similar things to say about clean rooms. Cause they thought, oh, well, we'll just use clean rooms to facilitate this. But when they really started to dig into it, they found similar things to what you are saying where it is that you can do some things with it, but it is not, you know, it doesn't quite live up to everything that the marketing says it is. As far as this full seamless functionality where you can just dump data

Starting point is 00:19:38 into a clean room and all this magic happens, it's actually, you know, there is actually a lot of work in order to figure out how to make it work well with both parties. And so they ended up actually building a, you know, sort of a different architecture, but fascinating, fascinating. It's the power of marketing. Yeah, the power of marketing. And this is, and this gets even more critical, right?

Starting point is 00:19:58 Because like, you know, if you really deep dive into how it really works, okay, so you go to the Salesforce to Snowflake data sharing, right, it's a zero copy, zero caffeine, zero everything. It's like, there is zero, zero nothing, zero calories. Right, it's very Buddhist. It's pure. Blank. Yes, pure. Pure data, right?

Starting point is 00:20:23 Just pure data as a red piece. And then you go into this and you say, well, in the documentation, you have a five minutes replication lag. So I mean, like, if I really read zero copy and I really understanding as a human would five minutes, you know, it's already, there is a problem if it's the same place and sorry, five minutes and replication lag. So if you have the word replication into the documentation of something, which is zero copy, that's concerning, right? That's concerning.

Starting point is 00:20:49 And so eventually what is all of this, like my take on what is data cloud, data sharing with Snowflake or HubSpot, Snowflake data share, this is just Salesforce Snowflake account. Yes, that's what I'm doing. It's just like, is they manage the entire ETL pipeline for you, this is just Salesforce Snowflake account. They manage the entire ETL pipeline for you, and they give you a Snowflake account, which you can actually just grab the credentials,

Starting point is 00:21:15 you can grab the credentials and just query it. That's it, you can't write to it, you can't transfer it, you can't do anything, you can add custom fields, you can just query it. It's just a dump of data, which is locked into your Snowflake. And so what that means, for companies, you say, well, it's great, instead of going to FavTran, I can actually buy all of these tools.

Starting point is 00:21:35 But it's a very big problem now, because when you have plenty of pipelines with FavTran, et cetera, you have both discounts. But when you actually have to buy this small item from Salesforce, this small sync from Hubbot, this small sync from Zendesk, you have to actually contract with 10 or 15 different data sources, I mean, vendors to actually get data into your data warehouse instead of just having one ETL, which is maintained by your data engineering team.

Starting point is 00:22:03 So now you have a distribution of ownership of these pipelines, which to people who have nothing to do with pipelines because they are just working in Zendesk or just working in Salesforce. And so that's a very important strategy and cost problem as well, which also make zero ETL. So zero ETL is complex from a technical standpoint, if not saying almost impossible. It's complex from a strategy standpoint, and it's complex from cost standpoint.

Starting point is 00:22:31 So actually, all components which drive a business in reality are just not present into this zero-copy landscape. So this is, it's the most not existing, basically. Yeah, it is a really not existing, basically. Yeah, it is a really unfortunate term because I can see a narrow use case. When I say narrow, what I mean is there's a business team working in some tool and usually have an operations person.

Starting point is 00:23:00 And there is a use case for being able to write a query in that tool and pull in some data set or something. And actually I remember we had someone from Braze on the show. Braze is a marketing tool. You can send customer communications and create customer journeys through Braze. And they launched a tool that allowed people to write SQL queries and you could sort of pull data in. And he thought, power users will love this, but like adoption was way more than he thought and like a ton of people use it, right? And so there is this interesting use case there

Starting point is 00:23:34 where it's like, okay, you need some specific thing. Your data team has probably materialized a couple of views that have some things and some valuable data fields, just pull them into your tool. That's actually a totally valid use case, but calling it zero copy ETL is really misleading because that's not actually the value that it provides or even really a good description of what's actually happening. You're just querying data from a very high context individual system.

Starting point is 00:24:04 So it is unfortunate. Absolutely. I see though a value in zero ETL and a legit point, I mean, which is not legit, you know, really per se, but at least it can be legit from a business perspective. Let's say, for example, let's say you have a system, like say in your P with a very complex data structure, a complex format, you know, which is quite hard to actually expose over APIs or the vendor just doesn't do it because of strategy like SAP. So it's very hard

Starting point is 00:24:32 to get data out of SAP for no real apparent reason, just because I don't want you to get out of the ecosystem. And so maybe for vendors like this, selling this zero, I mean, I mean, data share. It's really data share. See, data share can be valuable because they can expose data which you cannot get access via APIs independently, right? You need to get access to your own data in some way and it's not possible to make it accessible to APIs.

Starting point is 00:25:01 So that maybe because of scale, because of complexity, because of data types, this kind of things could make sense, but it's because of technical or business limitation that the business has, and this is where data share makes sense. But data sharing for HubSpot or for Salesforce doesn't really make sense because the APIs

Starting point is 00:25:24 still enable some sort of real time sync. And so that's why there is no real need for this. So we have to really introduce hard limitation on the API. So at least it would be the only one but it is critical monopoly, which would be a very big scandal. Right, right.

Starting point is 00:25:42 We're gonna take a quick break from the episode to talk about our sponsor, RutterSack. Now I could say a bunch of nice things Right, right. is clean and then to stream it everywhere it needs to go. Yeah, Eric. As you know, customer data can get messy. And if you've ever seen a tag manager, you know how messy it can get. So RutterStack has really been one of my team's secret weapons. We can collect and standardize data from anywhere, web, mobile, even server side, Now, rumor has it that you have implemented the longest running production instance of RutterStack at six years and going. Yes, I can confirm that.

Starting point is 00:26:31 And one of the reasons we picked RutterStack was that it does not store the data and we can live stream data to our downstream tools. One of the things about the implementation that has been so common over all the years and with so many RutterStack customers is that it wasn't a wholesale replacement of your stack. It fit right into your existing tool set. Yeah and even with technical tools Eric things like Kafka or PubSub but you don't have to have all that complicated customer data infrastructure. Well if you need to stream clean customer data to your entire stack,

Starting point is 00:27:23 So, let's talk about two-way sync. And let me frame this a little bit because there's a very common loop that has been reliable for a really long time. And actually sort of pre-existed the modern data stack, right? Because people did it with, you know, whatever they would write pipelines in Python or whatever, you know, if you were completely hand rolling it. But let's just, we'll frame it in the terms of the modern data stack. So I have Salesforce. I'm on a data team, right? The go-to-market team uses Salesforce for all the business stuff, right? It's where they track leads and campaigns and opportunities and everything. I need to combine that data with other data, some model, I need to, whatever, right?

Starting point is 00:28:10 Enrich it. So I use 5Tran, I pull the data into some data source, Snowflake, Databricks, et cetera. I model the data. I'm using DBT or some transformation layer, to enrich the data point, do whatever transformations I need to run, and then I reverse ETL it back into Salesforce. And then of course that gets pulled in, five-train again, and that's the loop. And that's been a very reliable loop for a long time.

Starting point is 00:28:44 Again, there are sort of tools to do that now, but again, it's been going on for years. What is the problem with that loop? Yeah, so that's a very interesting question. Because then you say, well, you can build, you know, if there is no problem, you can build two-way sync manually, right? And as we were saying in the beginning of the podcast, you know, there is this, you know, it was an absolute mess to maintain. So the problem with two-way sync is that, you know, two-way sync is extremely hard problem to actually get

Starting point is 00:29:08 because of the limitation of current tools, right? So for example, I was mentioning, okay, so if you modify your record, in Salesforce, it would tell you that record changed, but doesn't tell you which field changed. So now what do you send to your hotspot or NetSuite? You override the entire record. So every time you, what do you send to your house, spot or next street, you override the entire record. So every time that you have a, you know,

Starting point is 00:29:28 if you override the email, even with the same value, you know, even though it might look the same, you might have a rule, which every time you update this field, you know, send a welcome email. So now every time that, you know, you update something in the CRM with even like something completely unrelated field, like first name or last name, you would actually send a welcoming mail because you told the rights and shy away from it

Starting point is 00:29:47 and you lose data. And so, even if it's one way. In the loop of like Salesforce ETL, transformation reverse ETL, Salesforce ETL, transformation reverse ETL, I mean, that's not really two way sync, but you kind of handle all of the nasty stuff in the transformations, right? I mean, that's not really two-way sync, but you kind of handle all of the nasty stuff in the transformations, right?

Starting point is 00:30:10 That's where most of the work happens. Ultimately, those things get crazy. Is that the big challenge with that and why the value proposition of two-way sync is attractive, or is that loop not suited well for specific use cases, for example? Because I think that people use the loop for everything, right? It's the go-to for any, this is probably a dramatic oversimplification, right? And it's like analytics, ops, operational data, whatever. It's just throw it into the loop

Starting point is 00:30:51 and we'll figure it out in the transformation layer, right? Which gets complicated and expensive, I know is one issue, but walk us through the other issues. Absolutely. So, I mean, like now we're talking about one pipeline, right? Maybe Salesforce contacts to Snowflake, you transform DBT, you put this into another table, which is a staging,

Starting point is 00:31:11 and then you have like an ETL which is scheduled, hopefully coordinated, but you know, it's orchestration is still like a piece of problem. And actually, which would turn it back. Huge actually, yeah. And send it, you know, query from Snowflake and send it back to Salesforce. So here is a reverse ETL vendor.

Starting point is 00:31:26 They have a data storage in between which actually compares what you send as a query. So data running the diff, yep. Is running the diff and then like the difference that we shipped and back to Salesforce. So this is all the fixer plan. But so this means that you have one ETL vendor, one reverse ETL vendor,

Starting point is 00:31:43 you have at least two tables in your house. I mean, that is the, yeah, I mean, for a baby startup company, 200 maybe. Yeah, at least two tables and this and your energy BT engine, right? So that's cool. And now you only talk about contacts because companies is another one. And if you need to add contacts and companies and associate them, you need to also make sure the companies feedback first after, you know, before the contacts because the contacts belong to companies. And so, but like if you want to associate a contact in a company, you need to

Starting point is 00:32:19 have the Salesforce IDs in Snowflake. So you can send, you know, create the record in Snowflake, right? I mean, Salesforce. So it means that, you know, you need to first do the first loop of companies, get the IDs back for the companies you created into Snowflake, and only then you can start with the context. But you have this kind of managed fields

Starting point is 00:32:42 which needs some feedback because you first have to create a company. So first create a company from Snowflake to Salesforce, get the ID back into Snowflake, use that ID to create a contact and get the contact back because then you need to create opportunities or something like this.

Starting point is 00:32:58 And so all of this orchestration now gets you tables and tables, et cetera. So this, the promise of having a simpler architecture is wrong and what this means concreted forbidden because complex tech is not a problem. But heavy maintenance, that's a challenge. And this is why the loop is not working and this is not even real time.

Starting point is 00:33:22 So this is only for analytics use cases where you need to ship some sort of aggregated metrics And this is not even real time. So this is only for analytics use cases where, you know, you need to ship some sort of aggregated metrics once a day, you know, some stuff like this. So it's a big deal and one more vendor and all of this organization you have to do just for shipping some metrics back. You know, so don't tell me this is easy, right? This is pretty complex.

Starting point is 00:33:42 And now, the other thing, sorry to interrupt, but the other thing now that you're talking through that is really interesting about it This is pretty complex. The other thing, sorry to interrupt, but the other thing now that you're talking through that is really interesting about it is that I said earlier, we'll just fix everything in the transformation layer. But in reality, actually, for anyone who's built these systems inside of companies, what actually happens and is a very pernicious problem is that the logic lives in different places, right? So you may have logic in Salesforce that runs on load, super common, right? I load a bunch of data, I run a bunch of Ajax to do some operation to apply some business logic, right? You may have logic that runs on the ETL pipeline on ingest into Snowflake.

Starting point is 00:34:24 And then you also may have logic that runs on the ETL pipeline on ingest into Snowflake. And then you also may have logic that runs in the individual reverse ETL jobs that are coming out of it, right? And so what's the thing about that is you don't really have a single source of truth. I mean, maybe you have one giant table that sort of represents it, and then you materialize things on top of that. But it is actually very difficult to ensure that all of the logic does actually live materialize things on top of that. model and check all the dependencies and all that sort of stuff. It's like, well, we don't really have time to do that.

Starting point is 00:35:05 It's like, okay, well, we'll just do it in the reverse ETL pipeline, or we'll just write like something in, you know, in Salesforce, right? And so then your business logic starts to get spread all over the place. Absolutely. And so basically by grouping, you know, this ETL and reverse ETL vendor and putting that in real time, actually you decrease basically the number of tables, but from hundreds to just one, you know, because it's the same table that you read and write data from. You also simplify all of these, you know, complex transformation layers, which now all sit within

Starting point is 00:35:35 your dbt or coalesce transformations. You know, coalesce is a different, is another dbt sort of, you know, competitor and very popular as well with some no code features. And so what I really see that, is our architecture get completely simplified by this, by having loops, you actually end up just having by a bi-directional arrow. And a bi-directional arrow is not the same as two arrows in opposite directions, right?

Starting point is 00:36:01 Because it's two tools that don't talk to each other. And also what people don't think is that if you have a conflict, a data conflict, which happened, same data at the same time, et cetera, if you are badly orchestrated, like you're just gonna swap values and actually your data will just not look same anymore. It's gonna revert values, you're gonna swap values.

Starting point is 00:36:19 It's gonna be really complex to maintain. And now the CMO walks to you and say, why? We texted this customer on a different segment, we lost the DR. And say, well, because actually there was a technical, we don't care about the technical issues. Like you have to make this pipeline work. So if you work, if your position is at stake,

Starting point is 00:36:37 you would not use this kind of tools. And that's where like the robust tooling like Stacking, which really blocks the scale. Because scale also like is a very different impact, right? It makes like, it's going to take ages. It's going to take ages, you know, just to ship, you know, to make the pipeline run. And if it takes four hours to run from HubSpot to Snowflake and four hours from Snowflake to HubSpot, it just means that, you know, it takes eight hours just to run a pipeline.

Starting point is 00:37:03 And when you go too much, it becomes impossible. Because it makes more than 24 hours. So that's where really the challenge is. Also, so actually you have the problem of loops, data types, you have the problem of authentication, managing two different vendors with potentially two different people in the data team, which are responsible for each tool. You know, like it's really an exposure of leadership to bad decisions. And so, you know, people are not really responsible for choosing bad toolings. And we see this like, you know, so

Starting point is 00:37:34 today, industry is filled up with bad tools, like, honestly, like, it's a lot of tools are crap. But like, leadership is actually responsible for having bad business results. And this is just digging a hole by having the wrong tooling. So IT investment is actually an IT investment to your own leadership. So as a data leader or even as a CEO or CFO, investing in the right tooling is actually like ensuring your business performance will be driven by the right data at the right pace. Right?

Starting point is 00:38:12 So actually like your company will not be underwater with simple normal growth. So let's walk through use case because I agree that, I mean, the loop has been a reliable architecture for analytical use cases and will continue to be. Right? I mean, it's actually wonderful in many ways for that. You know, and then reverse ETL sort of adds like this sort of slight operational benefit to it for like analytical type stuff, right? Where you need to get some data point into some tool or whatever. operational benefit to it for like analytical type stuff, right? Where you need to get some data point into some tool or whatever. So I mean it's not going to go anywhere, but I think we should talk about the

Starting point is 00:38:55 operational use cases and how you do two-way sync. And so what I'd love to do is let's pick two examples. And my use case is, I'll just make up a use case, you can tell me how close I am to what your customers experience. I have this use case where we're doing lead intake on some website or app, and those leads are coming into HubSpot where they get marketed to, they're sending emails, all that sort of stuff. Right? And then, and so the marketing team or the demand generation team is using HubSpot to do all of that. And then some subset of that, you know, of those need to make it into Salesforce, or

Starting point is 00:39:55 let's say all of those need to make it into Salesforce, but the sales team or whatever, you know, or the support team or whoever really only is going to focus on some subset of those based on some characteristics, you know, whatever those specific fields are, et cetera. Okay. So I could theoretically run the loop, but the problem is let's say I'm a pretty big company, And so those things, those actually, and they have a 10 minute SLA, okay? Is that, am I thinking correctly about the challenge

Starting point is 00:41:05 that I would have as a data team of like, okay, well, how do I actually make that work? Is that? Yes, this is actually a correct challenge. And I would even say like, even clearly, right? It's like, okay, so now you make a marketing campaign, you know, on hotspot and this person immediately logs in, you know, respond to the email.

Starting point is 00:41:20 It's an absolutely, you know, sweet spot. You, this person signs up and actually now goes into Salesforce. If your pipeline didn't run yet, you will, it will be inserted in Salesforce by sign up as well as for, from the HubSpot to Salesforce, you know, pipeline. So basically now you're going to have a duplicate. So you have to make sure that your pipelines are also robust to upserts and not only update, right?

Starting point is 00:41:48 And how do you do upserts? Well, you need to query the data first and write data second, right? So actually you need to first, so you need to first, so you need to first basically get the data from HubSpot to Salesforce in absurd, which is a very big, I mean, inquiries data

Starting point is 00:42:07 that we know if the record is actually present into the Salesforce. And then if it's present update or then insert, right? So there's two API calls. And so at scale, this is extremely hard and doing this on batch, batches, it's also complicated. It's very complex. So now you have to make, let's say,

Starting point is 00:42:24 most pipelines do query records like one by one, which is very problematic because if you have batches of 10,000 or 100,000 contacts, which is the size of a marketing campaign, you're actually going to be like basically over consuming your API limit for the entire week on Salesforce. And hotfought is also limited, right? Per second. So it's going to be, it's not going to work. Now what is a, so this is challenging in terms of setup, it's complex setup and it's complex. It's complex. So also maintenance.

Starting point is 00:42:55 Well with now with two-way sync, basically, Staxing tells you, okay, you create a two-way sync between Salesforce and Snowflake. You create a two-way sync between hotpot and Snowfl. You create a two-way thing between HubSpot and Snowflake. So now you have a place in set. So now you just have two bidirectional arrows. So now it's very just, it's not even a triangle. You have a in Snowflake, you have all the contacts, companies, extra, all the tables of Salesforce and also table of HubSpot.

Starting point is 00:43:23 Whenever you know, because taxync is real time, right? Is this very important because the previous type was not real time, right? So it's even- Oh yeah, it would run on scheduled jobs, yep. Exactly, so because StackSync is real time, let's say a new contact is created on HubSpot. Yep.

Starting point is 00:43:39 As soon as it's created, StackSync will ship it, right? It's sub second or maybe one second, two seconds latency. It's gonna go directly into your Snowflake. StackSync, we also have a feature called triggers, which actually enable you to say, when you observe a certain data event to be transferred, also triggers this workflow or this database query. So you say, when a contact is created or updated,

Starting point is 00:44:02 right into Snowflake, from Hubflow to Snowflake, and also create this query, which tells you, because also, StackSync also runs a diffing, so it tells you, take the, if email was updated, you know, and ID is a field of change, so it means it's a new record, write this record into the Salesforce table into MySnowflake.

Starting point is 00:44:23 So now you made one simple SQL query because the entire Salesforce data is also real time into a sync, it's fresh. So now you can do an absurd, right? And be pretty confident that this will not really to duplicate. So with a simple absurd operation, you actually know exact,

Starting point is 00:44:39 and you can actually upset on many fields, which you cannot do on Salesforce because on Salesforce you might not even be able to query based on a given field. In Snowflake, you can do... Can you filter as well? So, I totally understand that example where you offload... Because that's actually a very...

Starting point is 00:44:58 That's a pretty efficient... That's a super simple query, right? I mean, that's about as simple as that can run. Just search based on email or ID or whatever. Yeah, yeah. It's super easy. But could you also filter, right? So let's say I want to modify the filter going back to the use case.

Starting point is 00:45:14 I want to modify the filter so that I can say, even, you know, even if this exists in Salesforce, I don't want to send it because it doesn't meet some sort of qualification, right? Or, you know, I want to send it with a flat, whatever it is, whatever that sort of filtering is, right? So that I can kind of determine like when different types of things get sent. Absolutely.

Starting point is 00:45:39 You can also make this filter, and you say, for example, say, when a contact is created and the segment is X, Y, and Z, send it to Salesforce or maybe it's a very large company and maybe send it to the Salesforce of this company for this child company which serves the large company's network. And then also if it's for a company, send it to the Salesforce of this other company. Subsidiary which serves S&E. Oh, like literally a separate Salesforce instance. Exactly, because now you can synchronize.

Starting point is 00:46:10 Oh, right. So even if you had single marketing intake, you could send it to... Oh, interesting. Okay, yeah, I was thinking about it in a far too linear way. Exactly. And this is just a simple SQL query. So now what we did, there is no table transformation and everything.

Starting point is 00:46:31 There is just like a query, which is triggered at the right moment to maintain this real-time feeling across your system and a simple SQL query, or even a db transformation which can run, right? You can also do batch once an hour, once a day, as a current use case. But now this is really real time.

Starting point is 00:46:47 And now when you insert your record, so now I'm gonna say, a record has been created in a hotspot. It has been synced to Snowflake. Stack sync, you created a trigger which says, okay, when a contact having these properties is created and emails the field that changed, then also put it into Salesforce.

Starting point is 00:47:05 So into the Salesforce table in Snowflake. So now we run an upsert, which prevents the use of emergence of duplicates into Snowflake. And with Two Way Sync, I just want to just send this, I mean, because of Two Way Sync, this data is gonna be inserted into Salesforce. Yep.

Starting point is 00:47:22 And so now you just created a contact in HubSpot and you have it in Salesforce and it passed into your Snowflake. So it's also available to all of your other systems and analytics, you know, and dashboards to actually be observed. So you have real-time analytics, operational purposes, because now, you know,

Starting point is 00:47:37 you can reach release, because this takes maybe two to three seconds because every square time and all this, maybe it takes two seconds or three seconds at late at maximum So three seconds later You created a code in hotspot and you have it in Salesforce and you can trigger like a welcome sequence or whatever that's really operational and this entire pipeline has been built by one second query one trigger

Starting point is 00:48:02 with two to with six and you're handling all the API calls. Everything, and even batching. And so in StackSync, it even works in a very clever manner is that if you go into a Snowflake and you modify, let's say, one million records at a time, because Salesforce has a different rate limiting and Hubflow too, StackSync will just, you know, batch all these records as fast as possible

Starting point is 00:48:24 within your allowed API rate limits, which you can also configure and send this data. So it might take a bit more time, but for Salesforce, we go up to 1 million records per minute, alpha million records per minute on HasBots, so it can have very fast. And so all of this architecture and therefore some maintenance is simplified with a single vendor. So one deal, which has just triggers, SQL queries, and the bit transformations. Yep. Nothing else, you know.

Starting point is 00:48:52 Okay. So let's, I said we'll do easy mode and then let's do something harder. And maybe this isn't, maybe this isn't harder, but I'll try to, this is the example that came to mind. Okay. Let's move from Salesforce to NetSuite ERP. This is the example that came to mind. Okay, let's move from Salesforce to NetSuite ERP. Any ERP, but whatever, let's just say NetSuite. Things get more complicated when you think about operational use cases where you have a billing department who's using the ERP to send invoices,

Starting point is 00:49:25 manage payments and receivables. But in Salesforce, let's say that the salesperson or the customer support person is working with an individual. This contact and the complication, it can get crazy, but the complication is a lot of times you have to make multiple hops to go from the contact in Salesforce to a unique invoice ID, or purchase order number, right?

Starting point is 00:49:56 Because you have the contact, and then they have a company in Salesforce, and it's not always a one-to-one match of what they're called in the different systems, right? Because in NetSuite, it's not always a one-to-one match between two systems where there are some discrepancies You know, that's just a simple example, right? you can use that key, there are some complications there. But there are keys that you can use, right? But if you have to make a hop for that operational use case where there are different keys and then a physical asset,

Starting point is 00:50:54 like an invoice, related to a company, how would you handle a situation like that? Because there we get into some really interesting data modeling challenges and some data discrepancies. Yes, absolutely. So basically, if you have several hops, for example, say you have to query first, you have an email in Salesforce, but also in Nest Suite, so you have to query the contact by email, which might not even be possible. Then you have to get... Right, that might not be possible. So then you have to do company name and then... Exactly. So then you have to get to contact, then you have to get to company, then you have

Starting point is 00:51:24 to get to opportunities, and you have to get like contact and you have to get the company and you have to get opportunities and you have to get like so list of all the invoices for opportunities and then get the ID all that invoice to actually get about the payment so all of these are very complex workflow so if you are in the illegal world you have to get maybe like a workflow with 15 steps and merges and it It's brutal. Yeah, it's very bad. And so what happens is that now if you have your entire data real time fresh into your Snowflake, you can craft your SQL query to actually get very powerful.

Starting point is 00:51:57 So actually like you have in your workflow, you would have one Snowflake query. Then you have the check to check if it returns a result. Because maybe like you know, you are at a very millisecond point in time where data wasn't available or something. So make a check, right? And then like once you have aggregated all the data, so with a single query, you're not overloading your snowflake,

Starting point is 00:52:19 your NetSuite with many API calls, right? You are just querying snowflake, which is much more, you know, loadable, so to speak. Then you have your insight, you take your data and you insert that into the right Salesforce or HUB or NetSuite table and that creates an update into your NetSuite. So I feel like, so, so to a sync really makes it that your date, your tables into your Snowflake or into your Postgres even

Starting point is 00:52:47 are really a real time, you know, read and write interface to your enterprise system. It's an equivalent to using an API. Basically, you're just using an API via SQL. That's the only thing you're doing. That's a very easy way to understand it. So, actually, every time you get an insert, it's gonna make, actually, you're making an API call,

Starting point is 00:53:05 but actually this API call can be one million rows. This read can be filtered with any kind of business and custom logic that you have. It's really extremely, it's an API which is as flexible as SQL and as well documented as SQL. And you can see your data directly. Yeah, yeah. No, I think that is the, yeah, flexible as SQL and as well documented as SQL. And you can see your data directly. Yeah, Brooks is telling us we're at the buzzer here, but what a great,

Starting point is 00:53:35 I mean, I think that's the paradigm shift and it took me a minute to get there. Maybe because I'm a little slow, which is why I fell for the marketing type on Salesforce and HubSpot Sync. run the loop, which isn't really great for operational use cases, especially ones that are time sensitive, right? And then your logic has to grow as part of a gigantic Frankenstein model that gets crazier and crazier over time, right? And so it's really interesting because I would almost describe StackSync as inverting the problem, right? Where it's like, you literally don't think about APIs, you just write a SQL query for as inverting the problem, right?

Starting point is 00:54:25 You literally don't think about APIs, you just write a SQL query for the use case that you want to solve. It's super fast because it's a really low, it's not a heavy query. And then you don't even worry about the APIs, it just happens, right? modifying, you're adding these queries over time and modifying the logic is isolated. So it's very easy for anyone to reason about what's going on with Salesforce contact to invoice, pass, do, whatever use case. It's a very visible logic which is easy to build and to migrate and to debug, right? And especially it's a very declarative way to operate, right?

Starting point is 00:55:06 API is very even driven, do this, do that. And then SQL is very declarative, right? It's like, you know, just of everything, every data you can see, you know, from a top level perspective, just like pump everything you have, you need and get it back, you know? And this is very declarative and which really enables you

Starting point is 00:55:22 to build much more robust pipelines. So that's where we, StackSync really puts this declarative and which really enables you to have much more robust pipelines. That's where Stacksync really puts this declarative and SQL way to operate on top of traditional event-driven, dirty APIs. Just from a simplification perspective, just try to connect to the NetJit API. In a week, we're going to be still there and we're gonna say, okay, well, maybe it's useful to have something that manages to me. That my friend is a great sales pitch. Yeah, just not talking about the integration piece, right?

Starting point is 00:55:56 Just like the authentication, right? Yeah, that's brutal. Documentation. I know a few survivors back in the 90s, which actually understood how the API worked. So this is a bit like the statement. Yeah, yeah. Okay, one more question here. Can you give our listeners a sneak peek?

Starting point is 00:56:14 What feature are you working on that you're most excited about? So right now, basically, at StackSync, we are making, basically, we're also launching workflow automation tooling which actually plugs in into the syncs. So for example, say you have a two-way sync, some data events are transferred in real-time, you can trigger, you can say once when you see these transferings, when you see a new contact, tell me. And this tell me can be anything between a sequence of

Starting point is 00:56:46 divvage transformations, can be a workflow automation, can be a Slack notification to your sales team, can be anything. And this is really something we'll say a contact change status, boom, notification to the relevant sales rep. And all of this kind of enrichment. I was holding a webinar last week about how can you actually say every time there is a new contact created, go to LinkedIn, you know, get real time live data about the entire LinkedIn profile, make a summary and fill up all of this, you know, database fields,

Starting point is 00:57:18 which are actually like the CRM fields. And now every time I would just type an email into my CRM, I would see all of this field populating immediately, instantly. And so this is really with live leading data. And this is all this enrichment use cases are exactly what we're building. So really this upgrade into enterprise scale mode of your operations. This is what Stacksync actually does. And Stacksync now has become also the leader into the NetStreet 2 way sync.

Starting point is 00:57:44 So if you have any struggle into your team, which has NetStreet, Shopify, Zendesk, HubFod Salesforce involved, you know, happy to chat and actually help you architect your best use case in your precise business scenario. Cool. Ruben, this has been awesome. I really appreciate the time. Love that we dug in.

Starting point is 00:58:04 Love that we dem in. Love that we demystified zero whatever it is. Zero something. I guess zero something is an oxymoron, but this has been great. Congrats on all the success and we hope you have much more in the future. Thank you so much, guys. Thank you so much guys, thank you so much for hosting. The Data Stack Show is brought to you by Rutter Stack. Learn more at ruddersack.com.

The Data Stack Show - 253: Why Traditional Data Pipelines Are Broken (And How to Fix Them) with Ruben Burdin of Stacksync

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.