The Data Stack Show - 108: You Can’t Separate Data Reliability From Workflow with Gleb Mezhanskiy of Datafold

Episode Date: October 12, 2022

Highlights from this week’s conversation include:Gleb’s background and career journey (2:51)The adoption problems (10:53)How Datafold solves these problems (18:08)The vision for Datafold (26:27)In...corporating Datafold as a data engineer (38:53)The importance of the data engineer (42:12)Something to keep in mind when designing data tools (46:46)Implementing new technology into your company (53:18) The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com..

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Today, we are talking with Gleb from DataFold. DataFold is a data quality tool, and they have a super interesting approach to data quality. Costas, one of my burning questions is around when to implement data quality tooling and processes in the life cycle of a
Starting point is 00:00:46 company. Because a lot of times, and you and I both know this from experience, you a lot of times approach it in a reactionary way, right? Like something's breaking, a dashboard breaks, you're trying to do some sort of analysis, you want to launch something new. And you just you run into data quality issues that really limit what you're doing. And so then you, you begin to implement those processes. And so I know that Gleb worked with companies who were trying to tackle data quality all across the spectrum. And so I just want to hear from him. What does that look like today? Do you have to be reactionary? Is it worth the time cost it takes to do that proactively. So that's, that's what I want to learn about. How about you?
Starting point is 00:01:26 Alex Bialik- Yeah. I want to hear from him, like, how do you start building a technology in the product around data quality? Because data quality is one of these things that's like so broad and like, there are so many like different ways that you can implement it, like in so many different like parts of the data stack where you're going to go and like start working on, so I'd love to hear from him like about his experience in, okay, like starting like a business and also in your product around that, like the hard decisions that you have to make
Starting point is 00:01:59 in order to start, so yeah, like I think we'll start from there and I'm sure that like we will have like plenty of opportunities to go much deeper into like the product itself and the technology and all the different choices made around it. I agree. All right, well, let's dig in and talk with Gleb. Gleb, welcome to the Data Stack Show. We're so excited to chat with you about all things data and specifically DataFold. Thanks for having me.
Starting point is 00:02:29 We're excited. All right, well, let's start where we always do. We'd love to hear, tell us how you got started in data with your career and then what you were doing before DataFold. So yeah, we'd just love to hear about how you, the path that led you to starting Datafold. Yeah, absolutely. So my original academic background is economics and computer science. And I started my career around 2013.
Starting point is 00:03:02 And so a data engineer joined a big company called Autodesk, which focuses on the B2B software for creators. But at the time, they were putting together a consumer division. And I ended up essentially
Starting point is 00:03:15 almost creating a data platform from scratch for that consumer division, tying all the different metrics from different apps that Autodesk acquired. And now it is a really interesting time because if you remember, 2013 is where I think Spark was almost just released.
Starting point is 00:03:35 Looker just went out of style, staying a snowflake. So a lot of the tools and companies and technologies that we now consider really foundational at that time were super early stage, super cutting edge. And so that was a really exciting time to tinker with data. And after a year at Autodesk, I moved to Lyft, where I ended up being one of the first, one of the founding members of the Lyft data team. And at that time, Lyft was in a super hyper-grow stage. Data infrastructure was kind of barely there. So we had one Redshift cluster and we're building everything on top of it.
Starting point is 00:04:19 And things were constantly on fire. And I remember days when essentially entire analytics team would basically go get Boba because Redshift was completely underwater by all the queries that everyone tried to run. And I initially was tasked with building all sorts of different data applications from forecasting supply and demand to understanding driver behavior. But I was so frustrated with the quality, not with the quality, but with the tools that I had at my disposal. And that wasn't necessarily a lift, but basically at that time, what was available for data engineers in general off the shelf, the experience was quite terrible from just
Starting point is 00:05:04 not being able to run queries, to not being able to test things and standard data, trace dependencies, all that was extremely time consuming. And so I kind of gravitated naturally to building tools. First time, simple tools for my team members, for example, build a dev environment where people could prototype their ETO jobs before pushing down to production. Before that, we kind of tested in production, which was really a bad idea. I also had some really, really bad horror stories that I think led me ultimately on my journey to build VitaFold.
Starting point is 00:05:40 So one of them, which I quite often cite is, I've had this practice of being on call, data engineers being on call. Essentially, the data I live has been so important that entire company pretty much ran either on fully automated decision making and machine learning or meetings organized around reviewing dashboards, seeing how the company performed. And so delivering data on time by certain SLAs always has been super important. That's why we had on-call engineers making sure that whenever a pipeline was clogged or late, people could actually address that. So I was an on-call engineer
Starting point is 00:06:18 one day, and I was woken up by a PagerDuty alarm at, I think, 4 a.m. because there was some really important job at Computing Rides that failed. And so I was the error and found some bad data that entered the pipeline, implemented a two-line SQL filter, and pushed the change. Everything looked good. Did some quick data checks, got a plus one from my buddy. And then went back to sleep.
Starting point is 00:06:47 Everything seemed normal and green. And then you kind of see where it's going. So yes, next day I came to work. And then probably two hours in the workday, I was forwarded an email from, I think, CFO that was looking at the dashboards and everything was kind of all over the place, looking really weird. And so the craziest thing is not that I managed to break lots of dashboard. The craziest thing is that it took us about six hours of sitting in a war room
Starting point is 00:07:14 with me, a few other senior data engineers, and trying to understand what's going on. And it took us six hours to actually pinpoint the issue to the fix, the hotfix that I made. And obviously, if I was able to make such a trivial mistake and bring down so many tables and have a really big business impact with just a two-line SQL hotfix, if you extrapolate this to how much loss is happening just due to data breakage around the industry, it's actually enormous and it's really easy to break things. And luckily for myself, I wasn't fired back then. I was actually put in charge of building tools to make sure that this exact error doesn't happen again. And so we introduced a real-time anomaly detection system that was helpful and then
Starting point is 00:08:06 focused on improving the developer workflow to make sure that developers actually don't introduce such issues into production. And one of the interesting learnings that I had at Lyft that ultimately informed, I think, the way Datafold approaches the topic of data quality is that why we introduce... So the first reaction when we had this issue and other... I was not the only one breaking things, was that we need a system that would catch anomalies in production. We need something that would detect when things are broken because, well, we don't
Starting point is 00:08:43 want CFO to forward us an email, you know, dashboard screenshot and say, hey guys, I think this is wrong, right? This is a really bad situation to learn. I'm just stakeholders. And so we implemented a really sophisticated real-time anomaly detection system that would compute metrics in real-time using Apache Flay, both from our ETL transformed data as well as some streaming, so events. And that was somewhat impactful, but we really struggled to get adoption and really struggled to make an impact with that as much as we hoped. And the ultimate problem was that one, that system was kind of too late into the process.
Starting point is 00:09:25 So by the time you detect something, it's kind of already broken. And so a lot of the teams really struggled to see the value there. And the second challenge was that it, in a way, existed outside of workflow. So if someone is going to introduce a break and change like I did, right, or a bug, and they have to learn about this from a system that kind of detects this bug in production. They have to kind of drop everything, whatever they're doing.
Starting point is 00:09:52 They have to go outside of the workflow and then focus on investigating whatever anomalies that system finds. And we found that disrupting the workflow is actually a really expensive way to get on top of data quality issues. And so we started focusing back at Lyft to build tools that would actually prevent things from breaking.
Starting point is 00:10:12 And that also informed what we're trying to do at Datapool. So our philosophy is in proactive data quality and shifting web, which means that as much as possible, we try to detect issues very early on in the dev process, ideally in the IDE when someone types SQL, but at least in staging or full request review, not in production when things already do the damage. But that's another story. That's another probably question. But that's pretty much how I ended up doing beta fold. Love the story. I'd love to dig into the adoption problems a little bit more because I think that's really interesting in large part because, you know, there are a number of sort of data observability,
Starting point is 00:10:58 data quality tools that are betting on anomaly detection as sort of the primary way to solve that problem. Would you say, and maybe this is just me rephrasing what you already said, but would you say that the anomaly detection as you described it at Lyft was actually just a less manually intensive way of doing alerting, right? It was sort of almost like, you know, a more efficient way of doing alerting, right? But alerting tells you that something is already broken. And that's why, you know, people were just kind of like, well, we already are getting alerted when something's broken. The fact that we can do that more efficiently isn't exciting. Is that how you you would describe that dynamic or i'd just love to know more about that adoption problem yeah i think there are a few challenges here well i think in general alerting and monitoring is valuable so
Starting point is 00:11:58 we actually were able to attack really bad incidents in production where we would not do a really important step in the billing process. And that would have led to massive loss, a major loss for the company. We would basically not bill people who weren't on their bill. And that system actually helped us detect not just data quality issues, but a real production issue. So definitely not discounting the value of alerting. I think the challenge is that a lot of the times we forget about where the data quality or general issues are coming from.
Starting point is 00:12:39 And I like to think about this in a very simple form. So I think ultimately, when I think about data quality, although it's a very big topic and a large problem, it's ultimately either we the data pipelines or touch data in various ways, introduce changes that, you know, change the definition of things. Like we change how a session is calculated and then that potentially throws off all our calculus around conversion, right? Or maybe we change the schema of an event because microservice can no longer provide given shields. And then some machine learning model that relied on this is no longer, you know, doing machine learning. Things like that. Right? So that's kind of, we break data. And then there's also category where, which I call they break data when
Starting point is 00:13:38 external things happen and those external things are, for example, we may buy data from our vendor and that vendor ships us a bogus data set that doesn't fit all the CDs, right? Just completely outside of our control. Or we have, let's say, running Airflow and Airflow is known to have a really funky scheduler and sometimes things don't get scheduled or sometimes they may be marked as completed, but actually left in a standing state and well, not really our fault, like infrastructure fault. And so I think where data monitoring really is helpful is in detecting things that are in the they break data category, right?
Starting point is 00:14:19 Things that are outside of our control, as well as maybe being like the last final defense for stuff that we break, that people break. But I think where we are really missing sometimes, like the counterintuitive challenge is that we tend to attribute failures to like external factors, where in fact, it's us, like generally the people in the company, breaking things one way or the other. And so what I think the major learning for me was that we really need to invest in building systems that are more like anti-fragile and more robust to breaking. And that means potentially having better systems to how we do data in production, right? So like in general, improving how we're straight jobs and having data contracts.
Starting point is 00:15:10 But the part where I am most excited about is improving how we develop data. So improving the development process, in particular, how we introduce changes to data products, right? Be that events, be that transformations in SQL or postpart, or even end user data applications like Looker dashboards or certain machine learning models. I think that's probably my bet is that I see a huge opportunity in improving the status quo of, oh, everything is broken in really improving the change management process. And then if you do that, then there are less fewer things that are breaking in production.
Starting point is 00:15:51 And the other challenges that we saw is data is inherently noisy, right? And it's always changing. And so when we're talking about data monitoring, we're talking about typically unsupervised machine learning, where a system would learn a pattern of data. And by pattern, I mean that can be anything from a typical number of rows that a data set gets daily, right? Or what is the distribution in a given dimension or metric column in a data set? And then when the reality doesn't conform to that baseline, we get an alert, right? But given that our business is changing, especially in high growth companies always, and given
Starting point is 00:16:34 that we are operating in the world where even not even a unicorn data team, not even at a unicorn startup earlier stage,, is kind of common to have thousands of tables in production and tens of thousands of columns. You know, we can find anomalies there all day long, right? And the real challenge is how do you actually identify what is important? What is worth fixing? What is a real challenge? What is a business issue versus data quality issue?
Starting point is 00:17:04 And that's probably what really makes adoption of the quality platforms. That's what really held us back and lived with our real-time anomaly detection system. And that's why I also think change management
Starting point is 00:17:15 is so important is because if we can prevent preventable errors early on in the dev cycle, if we prevent that production, then we have less noise
Starting point is 00:17:24 to deal with and fraud, right production and we have less noise to live in fraud right we just have fewer things to worry about yeah absolutely no that's super helpful well we probably should have should have asked this question earlier but can you tell us and i think we kind of we we kind of touched on on many things, but tell us how Datafold specifically solves these problems. You sort of described the problems really well. I would love to hear about what the product does specifically. Yeah. And how you've built it in response to those things that you've experienced.
Starting point is 00:18:05 Yeah, absolutely. So I think to describe how Datapult approaches it probably makes sense for me just to outline certain like beliefs that I have about the space. And I think the three principles for kind of reliable data engineering that I have is that one, to really improve data quality, we need to improve the workflow. So not invent tools that would kind of sit outside of workflow and send us notifications, but look at how people go about, you know, writing SQL models or modifying SQL models or, you know, developing dashboards and improve their workflows so that they are much less likely to introduce bugs.
Starting point is 00:18:49 And so that it's less painful for them to develop and they can develop faster. The, you know, example of that is right now we've pretty much adopted the notion of version control in data space, right? Maybe even five years ago, that was kind of novel, but by now everyone agreed that we should version control everything. Events, transformations, even BI tools, right? Maybe even five years ago, it was kind of novel, but by now everyone agreed that we should version control everything. Events, transformations, even BI tools, right? Even reverse ETL, even event schemas, everything.
Starting point is 00:19:11 And so that means we have ability to stage our changes and also have a conversation about the changes in what's called a full request or a merge request, right? And so if we can't get to that stage,
Starting point is 00:19:24 this is exactly sort of a secret breaker where within the if we can't get to Earth at that stage, this is exactly sort of a circuit breaker where within the workflow you can improve things. This, as an example. I think the other principle that's really important is that we have to know what we ship. And it's kind of obvious, and it's kind of
Starting point is 00:19:39 humiliating to admit that we, you know, data engineers, latest engineers, a lot of times we don't really know what we're doing. We think we know our data, but data is far more complex. And, you know, it's not uncommon for us when we write SQL query with a new data set, just
Starting point is 00:19:56 like not really knowing what's in there, right? We make certain assumptions. I think this column has a certain distribution, or I think this column actually has values. But a lot of times we actually don't know, and those assumptions are wrong. And if we develop our data products with wrong assumptions, we are just saving the tab for bugs and for errors. And so I think it's really important to know the data that you're working with.
Starting point is 00:20:20 So examples of how data flops, there is, for example, we provide profiling. So anytime you're dealing with any new dataset, we visualize all the distributions in the dataset as well as provide you with certain quality metrics. For example, what's the field rate in that column? Is this column unique or not? And that really helps avoid a lot of the base errors when we are writing code or building dashboards on top of a dataset. The other important thing about knowing what you ship is understanding dependencies within beta because the hardest thing is it's hard enough to understand a given dataset, right?
Starting point is 00:20:54 But it's even harder to understand where does the data come from or to understand what is actually the source of a given dataset as well as understanding where it goes. So if I don't understand where it comes from, I may not know how actually it represents my business, right? If I don't know which event, for example, a given table is based on, I may not know exactly what this table is describing. If I don't know who is consuming my data, I'm likely to break something for someone eventually, right?
Starting point is 00:21:28 And so understanding dependencies within the data platform is super important, and that solves with lineage. So DataFold provides column-level lineage, which essentially allows you to answer the question of taking a given column table, how it's built, and what downstream data applications, such as BI dashboards, machine learning models, and on it, really, really foundational information. And then, finally, I think the third principle is automation. So I think no matter how great your audit checks are, or your processes, unless this is so easy to do for people that they don't actually need to do it,
Starting point is 00:22:09 it just happens automatically, they won't do it, right? So we can't assume people will test something. We have to test it pretty much in a mandatory way. And then software, that concept is long, you know, I adopted long time ago, right? So when we ship software, we have CI, CD pipelines that build staging, a lot of unit tests, integration tests, sometimes even UI tests automatically
Starting point is 00:22:33 for every single commit, right? So we need to also get to this point with data because this is the only way to actually be able to catch issues and not rely on people to do the right thing. And so the example of how this comes together in Datafold is that whenever someone makes change to any place in their data pipeline, let's take maybe a SQL table model in dbt or DAX or Airflow as an example, Datafold provides full impact analysis of this change.
Starting point is 00:23:08 So basically, we call it DataDiff. We show exactly how a change to a source code on, let's say, SQL affects the data produced. So let's say I changed a few lines of SQL, and now the definition of my session duration column changed. So the question is how, right? Am I doing the right thing? So we compute that difference, and we show exactly how the data is going to be changing.
Starting point is 00:23:36 And we're not only doing this for the script, for the table, for the job that you're modifying, we're also providing a full impact analysis downstream. So we're showing you, okay, if you're changing this table and this column, these are all the downstream changes, cascading changes that you'll see in other tables and ultimately in dashboards and maybe also in the reverse CTL applications. And so we can trace how data in Salesforce that gets synced five layers down will be affected by any given change to the data pipeline. And that all happens in CI. So we don't need people to do anything.
Starting point is 00:24:14 It's basically completely integrated with the CI-CD process. And it basically populates the report back in the pull request so that the developer knows what they're doing. But very importantly, they can look in other team members and have them have full context about the change. And this is really important because kind of full request reviews for data engineers is almost like a meme topic because no one understands what's going on. And it's really hard to get the full context of a change
Starting point is 00:24:41 unless you really see what it does, right? So, but now with Datafold, a team can have a conversation about, this is what the change is going to do, look in the stakeholders, and I make sure that nothing breaks and no one is surprised. So that's an example of how, how Datafold does it. You've been describing for like quite some now all the different dimensions involved in data quality. And I think we all understand data quality is just complicated. Data infrastructure gets more and more complicated, more moving parts. It's a stack.
Starting point is 00:25:22 I don't know. We can call it whatever we want, but the only certain thing is that it's too many moving parts that they have to be orchestrated and work together. There are just too many actors that interact with data, and each one of these actors might change something in a very unpredictable way. So my question is, deciding to start building a product and a business around that, right?
Starting point is 00:25:49 What's your vision? What's your goal? Let's say from five years from now, what you would like to see the product being able to deliver to your customers? And second, where do you start from in realizing this vision? Because obviously it's not something that you can build from one day to the other, right? So I'd love to hear a little bit of how you, as a founder of the company, and with obviously an obsession around this problem,
Starting point is 00:26:19 do you try to realize this vision? Yeah, absolutely. So my vision for Datapult is to build a platform that would automate data quality assurance from left to right in two dimensions. So the first dimension is the developer workflow. And what it means is if we consider where where data fault sits now, we're kind of really catching things in staging. So when someone is opening a pull request, and they are about to leverage the code, how can we catch as many errors as possible before that pull request
Starting point is 00:27:00 gets merged, right? And so we can go left from here, meaning that we can detect issues even earlier. For example, as someone is typing SQL in the IDE and plugging into their IDE and provide a lot of decision support in that point so that
Starting point is 00:27:19 they don't even write bogus code. That's where, again, software industry is moving, right? We have sort of code auto-completion as well bogus code. That's where, again, software industry is moving, right? We have sort of code auto-completion as well as static code analysis and flaking of all sorts of issues in the IDE. And I think we'll get there in the data space.
Starting point is 00:27:37 And we have a lot of the fundamental information to do that. And what we need is to really plug into the workflow and into the tools where people develop those workflows. And then going to the right in the developer workflow also means that once someone ships the code to production, we also are going to provide continuous monitoring and make sure that that code stays reliable, that these data products stay reliable.
Starting point is 00:28:07 And to the point of data monitoring, right, I still believe it's very valuable, but we decided to start a little bit to the left of that, first in the dev lifecycle. My vision for what a good data monitoring solution is, it's not only about detection. I think that's the easy part. The much harder part that I'm excited about solving is the root cause analysis and prioritization. Because again, it's not helpful to receive like a hundred notifications about anomalies on a daily basis. We need to basically say, Hey, you have a hundred potential issues, but really
Starting point is 00:28:44 you should focus on these top three, because this one goes into CFO dashboard, this one powers your online machine learning model, and the other one probably gets into Salesforce. So that's one. And the way you do this is by relying on column level lineage and the metadata of dependencies, basically, because we understand how the data is used. And then the other part is root cause analysis, right? Is it a business issue? Is it a data quality issue? Is it an issue that is isolated and given an ETL job?
Starting point is 00:29:12 Or is it actually propagating among multiple people and maybe can we pin it to a source data, like an event or a vendor data set? So doing that, again, relying on lineage as well as sophisticated analysis is what we are really excited about doing. So that's the kind of the workflow dimension, right? Kind of going to the left all the way to when people start doing developments and then going to the right into production.
Starting point is 00:29:36 Now, the second dimension of what we want to cover is the data flow itself. So right now, if you consider the data flow, ultimately we have source data like events. We have production data extracts. We have vendor data sets. We have other kind of exported data that all gets in the warehouse. Then we have the transformation layer and then ultimately data application layer,
Starting point is 00:30:01 meaning BI tools, dashboards, reports, machine learning, reverse TL. So right now, Datafold really shines in the transformation piece. And then we've already extended into the source data and into the application. So for example, for the source part, the big problem is how do we integrate data reliably in the warehouse? Because everyone is moving data around. And for example, it's very customary to copy data from
Starting point is 00:30:27 transactional stores into the warehouse, for example, from Postgres to Snowflake. And doing this at scale is really hard because you have to sync data from data stores that are changing rapidly at pretty high volume into another data store. And what we've noticed is that no matter whether data teams are using vendors or open source tooling, pretty much everyone runs into issues there. And there has been no tooling that would help validate the consistency of data replication and making sure that the data that you end up with in your warehouse actually stays reliable.
Starting point is 00:31:12 So we've actually shipped about three months ago a tool called DataDip in an open source repo under my license that essentially solved this problem of validating data replication between data stores and does this at scale and really fast. So you can basically validate a billion rows in under five minutes between systems like Postgres and Snowflake. And so this is how we essentially address the source and target data quality. An example, right? There are more problems there,
Starting point is 00:31:34 but this is a really, really major one that we saw. And then if we, again, ship to the right from the data transformation layer to the data apps, the core there is to understand how data is used. So once data leaves the warehouse, where it goes and how it's used. And so we've been building integrations with all sorts of tools from BI to data,
Starting point is 00:31:56 versus data activation layer. And that is really important because you can paint the picture of how every single data point is ultimately consumed. You can catch errors way earlier. So for example, we recently shipped integration with PyTouch in a data acceleration company. And so now if anyone is modifying anything in the data pipeline that eventually affects the sync to, let's say Salesforce or MailChimp, they will see a flag that data replication process will be impacted. So that's my vision, sort of like going into the left to right, both in how the
Starting point is 00:32:32 data flows and how people work with data. Henry Suryawirawanacke... All right. That's super interesting. Let's dive a little bit deeper, like starting from the left. Okay. And you mentioned DataDeep, which is an open source tool that can help you like run very efficiently and fast, like diffs between like a source database,
Starting point is 00:32:51 like a Postgres and a destination like Snowflake, correct? Yeah. And you mentioned like, you were talking about that and you were, you, you said that like, it's really, no matter what kind of technology or vendor you're using, to connect the transactional database with a data warehouse, it's really hard to make this at scale like war without issues. Can you tell me about a little bit the issues that you have seen out there around that, because I think people tend to take that stuff for granted.
Starting point is 00:33:24 Alex Bialik- That's right. Yeah. Yeah. Yeah. Yeah. So basically what happens is you have a transactional store, right? That underpins your, let's say, microservice or monolith application. And that transactional store gets a lot of writes, a lot of changes to data. And then you're trying to replicate this into an analytical store, which may not be actually
Starting point is 00:33:44 that great with handling a lot of changes, right? It assumes you just append data. And the way that typically those replication processes work, both for vendors and for open source tooling, is relying on what's called change data capture event stream. So every time anything changes in the introduction store, it would emit an event that, for example, a given value in a given row is changed to a certain value, right? And then all of these events are transported over the network using different streaming systems into another, into let's say your analytical warehouse. And so the types of issues that can be is as simple as event loss. So sometimes these
Starting point is 00:34:28 pipelines build on messaging systems that are not guaranteeing, for example, right order or exactly one's delivery. And if that is the guarantees that you lack, then eventually you may have certain inconsistencies. You just have to accept that there was a happening. The other type of issue is, which is very common, is soft deletes or hard deletes. For example, if a row is deleted from a transactional store, making sure that it's also deleted from that analytical store where it gets synced is actually a pretty hard problem because the change data capture streams might not be able to capture that. And so just two basic examples of when things can go
Starting point is 00:35:12 wrong. And then you have infrastructure layer on top of that, right? So what if you have an outage event stream or a delay in an event stream, how do you know that what data is impacted. And the reason why it gets harder or even critically important is because transactional data is considered a source of truth because if you have a table that underpins your production system and you report users in it,
Starting point is 00:35:41 you kind of rely on that as your source of truth about users because everyone knows events are kind of messy and lossy. And so if that data becomes not reliable in your analytical store, then you kind of lost all your last resort data sources.
Starting point is 00:35:55 And that's why it's really paramount for data teams that they keep that data really reliable. And what makes things complicated is how do you approach data quality there is that how do you check that data is consistent, right? Let's say even if you think your replication
Starting point is 00:36:13 process is reliable, right? We know that all things break eventually, but how do you make sure that they are consistent? How do you measure consistency? It's actually a very hard problem because you might have a billion rows table in Postgres, a billion rows table in Databricks or your data lake. So how do you check the consistency? You can run count star, but that just gives you a number of rows, right?
Starting point is 00:36:37 And validating consistency across that volume of data and being able to pinpoint down to individual value and row is very hard. And it's very hard to do it in a performant way because you can't also put a lot of load in your traditional store. And so the way that Datadiff solves it is by relying on hashing, but doing it in a pretty clever way where obviously if we hash both tables and we hashes, and there is any kind of inconsistency, we'll detect that. But the other problem is that the hashes won't match if you have one record out of a billion missing, and they won't match if you have a million records missing. And you won't know what's the magnitude of the inconsistencies. And so that's why we are doing just a single hash. We're actually breaking down tables into blocks and then checking the hashes in those blocks
Starting point is 00:37:32 and then pretty much doing a binary search to identify down into individual row where the data is inconsistent. And so what you achieve with that approach is both speed. So you can dip billion rows in under five minutes, but also accuracy. It can pinpoint down to individual rows and values that are off. And it's a really hard balance to get right. So we're really excited about that tool. Stas Mislavskyi That's awesome.
Starting point is 00:38:00 And like, okay, let's say I'm a data engineer and I hear about this amazing data div tool, right? How do I incorporate it in my like everyday work? Let's say I have like a typical scenario with Postgres using the BISUM, Kafka, writing on S3 and loading the data on like Snowflake, right? Like I would assume that this is like a very typical like... Yeah, definitely. Many things can...
Starting point is 00:38:29 I mean, things like there's latency, first of all, right? Like there's no way that at the same time that something happens on Postgres will also like be reflected on Snowflake. So I have this amazing new tool and I can use it. So how do I become productive with it? Like how do I improve my workflow? Yeah, absolutely. So when you're dealing with streaming data replication,
Starting point is 00:38:56 there is no way around pretty much watermarking. And by that, I mean sort of drawing the line and saying that all the data which is older than this timestamp should be consistent, right? Because we do expect, you know, lags and delays. And that's the first thing. Once you establish what the watermark is, then you just basically say, okay, I expect data as of certain timestamps to be consistent. You connect DataDip to both data source. And we use, you know, typical drivers for that. so Postgres, let's say, and Snowflake. And then you can run the Datadip process, which typically takes within seconds or minutes.
Starting point is 00:39:37 And then ideally, because we're talking about continuous data checking, you would run it on schedule. So probably you would want to wrap it in some sort of an orchestrator like Airflow or Dijkstra and run it every hour, every day, or every minute, depending on your SLAs for quality. And it's essentially a Python library that also has a CLI. So it's really easy to embed in a data orchestrator or you can even run it in a cron job. So that's also, also quite feeble. Stas Milius Ivovitch.
Starting point is 00:40:08 All right. That's great. And actually you give me the right material to move like to my next question, which is about developer experience and the importance of the data engineer in this whole flow of like monolithic data quality, right? Like, again, I, I, I won't like to, to repeat that, although I might start getting like a little bit like boring, but data is something that can go wrong for many different reasons and because it's interact with so many different actors.
Starting point is 00:40:40 And when I say actors, I mean like actual people, like the CFO, you know, made a mistake on the query or like whatever, clicked. Hour three, which is like, I don't know, like let's say
Starting point is 00:40:53 eventually consistent, like when we got the snaps from the data there, it wasn't that consistent at that point or whatever, right? Like there are technical reasons, there are like human reasons
Starting point is 00:41:04 that might touch that. So, but the role of the data engineer at the end is like the person who is responsible, right? Like to deliver, let's say, like safeguard like the infrastructure and also the quality of the data. So what I hear from you when you talk is that you put like a lot of effort
Starting point is 00:41:24 in like building the right tooling specifically for data engineers, right? It's not like you are building, let's say, a platform that is going to be used by an analyst or someone who only knows what Excel is and doing stuff on Excel or whatever. We are talking about data engineers here. We are talking about developers. We are talking about people who have like a specific like relationship with software engineering. You mentioned, for example, like CI, CD best practices there and like pull requests and like GitOps and like all that stuff.
Starting point is 00:41:54 So first of all, I'd like to hear from you why you believe that like the data engineer is like so important in realizing your vision behind like the product and the company, because unless I'm wrong, right? So correct me if I'm wrong. Yeah, Costas, you bring up a really great point. I think, yes, the reality is that it used to be, even five years ago, that data engineers would be people who would own the entire data pipeline,
Starting point is 00:42:22 and they would build everything, be responsible for everything and would have control of everything. As you point out, the reality now is very different. So we have people on the left of the data, right? So software engineers contributing a lot with,
Starting point is 00:42:37 you know, instrumenting events in the microservices, as well as, you know, the tables that we talked about, I copied from Postgres. Ultimately, software engineers own them, right? And software engineers might not actually have the context. And then to the right, we have less technical people
Starting point is 00:42:52 who are on the data consumption side and analysts, analytics engineers, and then even people from other teams like financial analysts that now kind of have to become familiar with data pipelines because they need to rely on analytical data to get their job done. I think the right way to think about what data fault ultimately solves and improves is the workflow of a data developer.
Starting point is 00:43:17 And I define this really broadly, right? So it can be a software engineer who defines an event that eventually gets consumed in analytics, or it can be a financial, you know, a P&A analyst who contributes to a dbt job because that informs how they build financial model. I think the reason why data engineers, analytics engineers are really central in this conversation is that even though multiple teams contribute to data, ultimately still that is the persona and that is the team that defines
Starting point is 00:43:48 the majority of the data stack and how data is consumed and produced. And they are almost center of the collaboration around data. And so giving them tools that would empower them to do their job faster is really important. But what Datafold does actually goes beyond that and it goes into the true,
Starting point is 00:44:11 I don't want to say like a corny phrase, but I'll say it, data democratization. Because if you have, if it's so easy to test any change in data pipeline, no matter who does it, software engineer, analytics engineer, data analyst, data engineer, then it's much easier and less risky to involve other folks, other teams to collaborate on data pipeline than before, right? And the very, like one of the worst bottlenecks we're seeing in companies and data-driven companies that data engineering teams become really bad bottlenecks for their business because data scientists cannot build ML models unless they have clean data. And so they push down that data equation, data engineers, and then data engineers don't
Starting point is 00:44:59 let other people contribute data five-fives because they are afraid that people may break stuff. And so they basically become a bottleneck, get overworked, and the company, the business doesn't move as fast as possible. And so if we make change management and data testing so, well, easy and automated, regardless of who is making the change, right, then anyone who catches data will do a better job and we'll be catching errors for everyone. So basically we're able to elevate the data quality throughout the pipeline and we can help the business move faster because more people
Starting point is 00:45:32 will be able to reliably contribute to building of data applications. Does that make sense? Yeah, absolutely. And one last question from me and then I'll give the stage back to Eric because I'm sure he has many questions. And we have to also mention that, as you say, there are many people that are touching the data inside the organization. And probably one of the most dangerous ones is marketeers, right? And they really won't like to touch the data. So we need to hear from them and see experience of like ruining data around, right?
Starting point is 00:46:08 But before, before, before we do that, like one last question, which is like a little bit of like a selfish question, like as a product person that builds like stuff for developers and engineers. Can you share with me, like based on your experience so far, like what was like something that really, really important to keep in mind when you're designing an experience and the product for the developer and the engineer? What is the first thing that you bring in your mind when you start brainstorming about a new tool for data engineers. Yes. I think there are, there are hard to say like, what is the most important thing,
Starting point is 00:46:48 but I'll probably call out a few. I think better. Yes. I think one, and I think you called it out cost us is that data stack is so heterogeneous, right? So many tools,
Starting point is 00:47:02 so many different databases and data application frameworks. And how do you generalize your product so that you work as equally well as possible with most of them? How do you distill your vision to certain principles and patterns that could work with any tool and it would allow you to build, especially in the data quality space, improve the workflow regardless of what stack people are using. And ultimately, you will inevitably have to focus on certain stack. For example, Datapoll really focuses on what is called modern data stack.
Starting point is 00:47:44 So cloud data workhouses or really mature modern data lakes, right? Like what you can get with, you know, Trino, these data that are really kind of almost kind of self-serve, right? For data teams, as opposed to more legacy systems, right? So we have to make these calls. But even though, even if you limit your scope of the modern data stack, it's still really hard. And I would say we really practice, but I think we so far
Starting point is 00:48:13 did a reasonably good job. I think the second problem is how do you integrate in the workflow of data engineers, again, given that different teams go about things in different ways. For example, if you want to do any kind of testing before production, then you have to rely on teams having version control for their pipelines, as well as
Starting point is 00:48:36 having a staging environment. Because if you don't have staging, there's nothing you can test. So you have to basically take a new version of code, doesn't matter if it's a dbt code, Looker code, or PySpark code, and basically test it, right? The way Datafold does it is by comparing
Starting point is 00:48:52 it to production and showing you exactly how things are going to be changing, right? But data teams may be building staging in so many different ways. Sometimes they may be using synthetic data, sometimes they're using production data to build staging. And so how do you generalize is a really big problem. And so again, finding common patterns that would be most applicable to most teams.
Starting point is 00:49:14 And sometimes it means betting on maybe less popular ways of doing things, but betting on those things eventually will become mainstream. One example of that, actually, that tough trade-off that we had to make is, so you consider how data transformations are orchestrated these days. The most popular tool is Airflow, right? Used by perhaps hundreds of thousands of data stacks in the world. And then there's also dbt, which is an emerging tool, which is a more modern approach to building that as well as maybe Docter. But those tools have orders of magnitude
Starting point is 00:49:52 smaller adoption than Airflow. And we had to make a call early on to actually not focus on building data quality automation for Airflow. We still support it, but not as deeply as we would support, let's say, DBT. And the reason is that with Airflow. We still support it, but not as deeply as we would support, let's say, dbt. And the reason is that with Airflow, it is very, very hard to build a reliable staging environment. Just the way the tool works, it almost kind of forces you to test in production.
Starting point is 00:50:17 And so that makes it really hard to work and implement any kind of really reliable change management process within Airflow. Whereas with tools like Daxter or dbt, development and staging environments come with a tool. And so it's really easy for us to come in and actually build on top of that. And these are really hard calls because by the time we started prioritizing and trading with dbt and Daxter, it wasn't the power that they will actually win. We just thought that their approach is actually the one which eventually will take over because it's more robust and more reliable. But it was a really tough
Starting point is 00:50:52 call to make. Super interesting. Eric? Yeah, that is fascinating. I have more questions on that, but if I got into that, Brooks wouldn't be happy because we do have to end the show at some point. Gleb, I have a question about the adoption and implementation of data fold or
Starting point is 00:51:18 technology like data fold in the lifecycle of a company. And I'm speaking from experience here as a marketer to Kostas' point, who has messed up a lot of data. Yeah. You know, messed up a lot of reporting by sort of introducing new data. And I'll explain the question like this, you know, and actually, you know, even with just sort of a personal anecdote with my experience, you know, multiple different companies, you're growing really fast or, you know,
Starting point is 00:51:51 you're launching some new data initiative and you have, you know, limited resources that are working on that. And, you know, say it's an analyst and maybe you're borrowing some engineering time. You don't have like a fully formed data team yet. Yet the company is growing really quickly. In many ways, it's almost like what you described at Lyft, right? Where data is the lifeblood of the company, but it's growing so fast that everything seems
Starting point is 00:52:18 to be on fire. And what's difficult in that situation is to slow down enough to implement both the processes and discipline it takes to sort of change your workflows, like implement new tooling, etc. Because you're moving so fast. And it really is a challenge because you inevitably create technical debt later on. So you wish you would have done it. But when you're in the moment, it's hard to, you know, sort of, it's hard to be that forward thinking because you're dealing with what's right in front of you. How do you see that play out with your customers? How have you dealt with that? And do you have any advice for our listeners who are thinking about that very challenge where it's like wow i would love to i would love to implement this but you know it's just really hard for where we're at as a company
Starting point is 00:53:13 yeah i would say ultimately it's not even about data fold i think this comes down to how a company and a data team thinks about putting together the data stack. And I think it comes down to doing the right amount of tinkering. And by that, I mean that if you look at the modern data stack today, you have really great tools that take care of most of your needs, kind of ingesting data, moving it around, transforming it, processing it. And you really can assemble a stack in such a way that it just works. And then you spend your time thinking about business metrics, thinking about analysis and how to drive your business forward. And then what is probably not the right way to go about this or some of the mistakes that teams make is then when they think they, for some reason, need to ticker more than they actually do,
Starting point is 00:54:09 they would start adopting, take open source projects and start running them in-house because they think they will save costs or because they think they need control or they would be kind of afraid of vendor lock-in. And sometimes it's premature optimization. Sometimes it's kind of engineering ambition out of control.
Starting point is 00:54:30 I'm sorry for engineers, but just right amount, right? So if the data team is adopting all the good tools and focuses on the things that they should be focusing on, that's great. And in that world, data fold is extremely easy to implement because we integrate with the standard modern data stack tools. The integration takes like about an hour and you basically improve your workflow in a day. So you connect, let's say, dbt cloud, like GitHub, your warehouse, and things
Starting point is 00:55:04 are just working out of the box. So it's actually not hard. You don't need to spend a lot of time writing tests or anything. Where you may have run into longer implementation times is when, let's say, you have a really legacy beta platform or you have a really unorthodox stack for some reason where you start to kind of you build a framework internally that you know is like a certain patterns that are not common then you you know then you will probably use our sdk
Starting point is 00:55:40 as opposed to like turnkey integrations and that may turn into a longer-term innovation project. But all that is to say that for a data team on modern data stack, using best practices, Datafold is extremely fast to implement. So I don't think, unlike assertion tests, where you do need to invest a lot of time to writing them, Datafold, or the compression testing capability, is actually really fast to implement. So you don't have to choose between moving fast and moving fast and kind of having reliable data.
Starting point is 00:56:10 You can have both. You can move fast with a stable infrastructure. Love it. That is super helpful. Well, Gleb, this has been such a wonderful conversation. Super helpful. We've learned a ton. So thank you for giving us some of your time
Starting point is 00:56:27 to talk about Datafold and data quality in general. Thank you so much, Eric Costas. Also, really appreciate you asking really kind of deep questions and helping me clarify my thinking. So really enjoyed it as well. Awesome show. What an interesting conversation, Costas. I think that my big takeaway, and I'll probably be thinking a lot about this this week, is the initial discussion around anomaly detection, which I thought was really interesting and how it was really helpful
Starting point is 00:57:06 in some ways in his experience at Lyft, but then also fell flat in some ways within the company because of some inherent limitations. And then the more I thought about that in connection with the way that he describes starting at the point in the stack where data fold sort of solves the initial problem and then moving left and right with like in being involved in sort of the like pre-production process or the deployment process was really interesting. And I think to summarize that, it would be that, you know, anomaly detection, AI, all detection, AI, all of that technology can be really useful, but ultimately you have humans involved in creating, managing, processing data. And so you have to have tooling that actually is interjected into the process that those humans
Starting point is 00:57:57 are using. And Gleb seems to have a really good handle on that. And so that's just a really good reminder, I think, about data quality and the particular characteristics of trying to solve it. Yeah, I agree. I think what I really enjoyed from the conversation, what I'm going to keep and probably think a little bit more about it is about the role of the data engineer. I think we are seeing the data engineering discipline becoming much more solid and well established as something important around having your data infrastructure and working with data.
Starting point is 00:58:43 But also, it's evolution, like it's going through like a very rapid evolution right now. Like we hear that like from Glyph by saying like how a couple of years ago you could, and when I say a couple of years, we are talking about like two, three years ago, it was not that far away, you could have like as a data engineer, like the complete control over over your pipelines and the infrastructure. So it was much easier. I mean, at least you had control over what was going wrong.
Starting point is 00:59:13 But that's not the case anymore. You have so many different stakeholders and so many different moving parts in the infrastructure that makes it yet challenging. And it's part of the evolution of the role. So yeah, I think that's something that like we, we should investigate more in general. I agree. Alrighty.
Starting point is 00:59:33 Well, thank you for joining us on the Data Sack Show. As always subscribe. If you haven't tell a friend, we love new listeners and we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com.
Starting point is 00:59:57 That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.