The Data Stack Show - 108: You Can’t Separate Data Reliability From Workflow with Gleb Mezhanskiy of Datafold
Episode Date: October 12, 2022Highlights from this week’s conversation include:Gleb’s background and career journey (2:51)The adoption problems (10:53)How Datafold solves these problems (18:08)The vision for Datafold (26:27)In...corporating Datafold as a data engineer (38:53)The importance of the data engineer (42:12)Something to keep in mind when designing data tools (46:46)Implementing new technology into your company (53:18) The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com..
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show. Today, we are talking with Gleb from DataFold. DataFold is a data quality tool, and they have a super
interesting approach to data quality. Costas, one of my burning questions is around when to
implement data quality tooling and processes in the life cycle of a
company. Because a lot of times, and you and I both know this from experience, you a lot of times
approach it in a reactionary way, right? Like something's breaking, a dashboard breaks, you're
trying to do some sort of analysis, you want to launch something new. And you just you run into
data quality issues that really limit what
you're doing. And so then you, you begin to implement those processes. And so I know that
Gleb worked with companies who were trying to tackle data quality all across the spectrum.
And so I just want to hear from him. What does that look like today? Do you have to be reactionary?
Is it worth the time cost it takes to do that proactively. So that's, that's what I want to learn about. How about you?
Alex Bialik- Yeah.
I want to hear from him, like, how do you start building a technology in the
product around data quality?
Because data quality is one of these things that's like so broad and like,
there are so many like different ways that you can implement it, like in so
many different like parts of the data stack where you're going to go and like start working on, so I'd love to hear
from him like about his experience in, okay, like starting like a business and
also in your product around that, like the hard decisions that you have to make
in order to start, so yeah, like I think we'll start from there and I'm sure that
like we will have like plenty of opportunities to go much deeper into like the product itself
and the technology and all the different choices made around it.
I agree.
All right, well, let's dig in and talk with Gleb.
Gleb, welcome to the Data Stack Show.
We're so excited to chat with you about all things data and specifically DataFold.
Thanks for having me.
We're excited.
All right, well, let's start where we always do.
We'd love to hear, tell us how you got started in data with your career and then what you
were doing before DataFold.
So yeah, we'd just love to hear about how you, the path that led you to starting Datafold.
Yeah, absolutely.
So my original academic background is economics and computer science.
And I started my career around 2013.
And so a data engineer joined a big company
called Autodesk,
which focuses on the B2B
software for creators.
But at the time,
they were putting together
a consumer division.
And I ended up essentially
almost creating a data platform
from scratch
for that consumer division,
tying all the different metrics
from different apps
that Autodesk acquired.
And now it is a really interesting time because if you remember, 2013 is where I think Spark
was almost just released.
Looker just went out of style, staying a snowflake.
So a lot of the tools and companies and technologies that we now consider really foundational at that time
were super early stage, super cutting edge. And so that was a really exciting time to tinker with
data. And after a year at Autodesk, I moved to Lyft, where I ended up being one of the first, one of the founding members of the Lyft data team.
And at that time, Lyft was in a super hyper-grow stage.
Data infrastructure was kind of barely there.
So we had one Redshift cluster
and we're building everything on top of it.
And things were constantly on fire.
And I remember days when essentially entire analytics team would basically go get Boba
because Redshift was completely underwater by all the queries that everyone tried to run.
And I initially was tasked with building all sorts of different data applications
from forecasting supply and demand to understanding driver behavior.
But I was so frustrated with the quality, not with the quality, but with the tools that I had
at my disposal. And that wasn't necessarily a lift, but basically at that time, what was available
for data engineers in general off the shelf, the experience was quite terrible from just
not being able to run queries,
to not being able to test things and standard data, trace dependencies, all that was extremely
time consuming. And so I kind of gravitated naturally to building tools. First time,
simple tools for my team members, for example, build a dev environment where people could
prototype their ETO jobs before pushing down to production.
Before that, we kind of tested in production, which was really a bad idea.
I also had some really, really bad horror stories that I think led me ultimately
on my journey to build VitaFold.
So one of them, which I quite often cite is, I've had this practice of being on call, data engineers being on call.
Essentially, the data I live has been so important that entire company pretty much ran either on fully automated decision making and machine learning or meetings organized around reviewing dashboards, seeing how the company performed. And so delivering data on time
by certain SLAs
always has been super important.
That's why we had on-call engineers making sure that
whenever a pipeline was clogged
or late, people could actually address
that. So I was an on-call engineer
one day,
and I was woken up by a PagerDuty alarm
at, I think, 4 a.m. because
there was some really important job at Computing Rides that failed.
And so I was the error and found some bad data that entered the pipeline, implemented a two-line SQL filter, and pushed the change.
Everything looked good.
Did some quick data checks, got a plus one from my buddy.
And then went back to sleep.
Everything seemed normal and green.
And then you kind of see where it's going.
So yes, next day I came to work.
And then probably two hours in the workday, I was forwarded an email from, I think, CFO
that was looking at the dashboards and everything was kind of all over the place,
looking really weird.
And so the craziest thing is not that I managed to break lots of dashboard.
The craziest thing is that it took us about six hours of sitting in a war room
with me, a few other senior data engineers,
and trying to understand what's going on.
And it took us six hours to actually pinpoint the issue to the fix,
the hotfix that I made.
And obviously, if I was able to make such a trivial mistake and bring down so many tables and have a really big business impact with just a two-line SQL hotfix,
if you extrapolate this to how much loss is happening just due to data breakage around the industry, it's actually enormous and it's really easy to break things.
And luckily for myself, I wasn't fired back then.
I was actually put in charge of building tools to make sure that this exact error doesn't happen again. And so we introduced a real-time anomaly detection system that was helpful and then
focused on improving the developer workflow to make sure that developers actually don't
introduce such issues into production.
And one of the interesting learnings that I had at Lyft that ultimately informed, I
think, the way Datafold approaches the topic of data quality is that why we introduce...
So the first reaction when we had this issue and other...
I was not the only one breaking things, was that we need a system that would
catch anomalies in production.
We need something that would detect when things are broken because, well, we don't
want CFO to forward us an email,
you know, dashboard screenshot and say, hey guys, I think this is wrong, right?
This is a really bad situation to learn. I'm just stakeholders.
And so we implemented a really sophisticated real-time anomaly detection system that would compute metrics in real-time using Apache Flay,
both from our ETL transformed data as well as some streaming, so events.
And that was somewhat impactful, but we really struggled to get adoption
and really struggled to make an impact with that as much as we hoped.
And the ultimate problem was that one, that system was kind of too late into the process.
So by the time you detect something, it's kind of already broken.
And so a lot of the teams really struggled to see the value there.
And the second challenge was that it, in a way, existed outside of workflow.
So if someone is going to introduce a break and change like I did, right, or a bug,
and they have to learn
about this from a system that kind of detects
this bug in production. They have to kind of
drop everything, whatever they're doing.
They have to go outside of the workflow
and then focus on investigating whatever anomalies
that system finds.
And we found that disrupting the
workflow is actually a really expensive
way to get on top of data
quality issues.
And so we started focusing back at Lyft to build tools that would actually prevent things from breaking.
And that also informed what we're trying to do at Datapool.
So our philosophy is in proactive data quality and shifting web, which means that as much
as possible, we try to detect issues very early
on in the dev process, ideally in the IDE when someone types SQL, but at least in staging or
full request review, not in production when things already do the damage. But that's another story.
That's another probably question. But that's pretty much how I ended up doing beta fold.
Love the story. I'd love to dig into the adoption problems a little bit more because I think that's really
interesting in large part because, you know, there are a number of sort of data observability,
data quality tools that are betting on anomaly detection as sort of the primary way to solve that problem.
Would you say, and maybe this is just me rephrasing what you already said, but would you say that
the anomaly detection as you described it at Lyft was actually just a less manually intensive way of doing alerting, right? It was sort of
almost like, you know, a more efficient way of doing alerting, right? But alerting
tells you that something is already broken. And that's why, you know, people were just kind of
like, well, we already are getting alerted when something's broken. The fact that we can do that
more efficiently isn't exciting. Is that how you you would describe that dynamic or i'd just love to know more about that adoption problem yeah i
think there are a few challenges here well i think in general alerting and monitoring is valuable so
we actually were able to attack really bad incidents in production where we would not do a really
important step in the billing process.
And that would have led to massive loss, a major loss for the company.
We would basically not bill people who weren't on their bill.
And that system actually helped us detect not just data quality issues, but a real production
issue.
So definitely not discounting the value of alerting.
I think the challenge is that a lot of the times we forget about where the data quality or general issues are coming from.
And I like to think about this in a very simple form. So I think ultimately, when I think about data quality, although it's a very big topic and a large problem, it's ultimately either we the data pipelines or touch data in various ways, introduce changes that, you know, change the definition of things.
Like we change how a session is calculated and then that potentially throws off all our calculus around conversion, right?
Or maybe we change the schema of an event because microservice can no longer provide given shields. And then some machine learning model that relied on this is no longer,
you know, doing machine learning.
Things like that.
Right?
So that's kind of, we break data.
And then there's also category where, which I call they break data when
external things happen and those external things are, for example, we
may buy data from our vendor and that vendor ships us a bogus data set
that doesn't fit all the CDs, right? Just completely outside of our control. Or we have,
let's say, running Airflow and Airflow is known to have a really funky scheduler and sometimes
things don't get scheduled or sometimes they may be marked as completed, but actually left in a standing state and well,
not really our fault, like infrastructure fault.
And so I think where data monitoring really is helpful is in detecting things that are
in the they break data category, right?
Things that are outside of our control, as well as maybe being like the last final defense for stuff that we break,
that people break.
But I think where we are really missing sometimes, like the counterintuitive challenge is that
we tend to attribute failures to like external factors, where in fact, it's us, like generally
the people in the company, breaking things one way or the other.
And so what I think the major learning for me was that we really need to invest in building systems
that are more like anti-fragile and more robust to breaking. And that means potentially having
better systems to how we do data in production, right? So like in general, improving how we're straight jobs and having data contracts.
But the part where I am most excited about is improving how we develop data.
So improving the development process, in particular, how we introduce changes to data products,
right?
Be that events, be that transformations in SQL or postpart, or even end user data
applications like Looker dashboards or certain machine learning models.
I think that's probably my bet is that I see a huge opportunity in improving the status
quo of, oh, everything is broken in really improving the change management process.
And then if you do that, then there are less fewer things that are breaking in production.
And the other challenges that we saw is data is inherently noisy, right?
And it's always changing.
And so when we're talking about data monitoring, we're talking about typically unsupervised
machine learning, where a system would learn a pattern of data.
And by pattern, I mean that can be anything from a typical number of rows that a data set gets daily, right?
Or what is the distribution in a given dimension or metric column in a data set?
And then when the reality doesn't conform to that baseline, we get an alert, right?
But given that our business is changing, especially in high growth companies always, and given
that we are operating in the world where even not even a unicorn data team, not even at
a unicorn startup earlier stage,, is kind of common to have
thousands of tables in production and tens of thousands of columns.
You know, we can find anomalies there all day long, right?
And the real challenge is how do you actually identify what is important?
What is worth fixing?
What is a real challenge?
What is a business issue versus data quality issue?
And that's probably what really makes adoption
of the quality platforms.
That's what really
held us back
and lived with our real-time
anomaly detection system.
And that's why I also think
change management
is so important
is because
if we can prevent
preventable errors
early on in the dev cycle,
if we prevent
that production,
then we have less noise
to deal with and fraud, right production and we have less noise to live
in fraud right we just have fewer things to worry about yeah absolutely no that's super helpful well
we probably should have should have asked this question earlier but can you tell us and i think
we kind of we we kind of touched on on many things, but tell us how Datafold specifically solves these problems.
You sort of described the problems really well.
I would love to hear about what the product does specifically.
Yeah.
And how you've built it in response to those things that you've experienced.
Yeah, absolutely.
So I think to describe how Datapult approaches it probably makes sense for me just to outline
certain like beliefs that I have about the space.
And I think the three principles for kind of reliable data engineering that I have is
that one, to really improve data quality, we need to improve the workflow. So not invent tools that would kind of sit outside of workflow
and send us notifications, but look at how people go about, you know, writing SQL models or modifying
SQL models or, you know, developing dashboards and improve their workflows so that they are much
less likely to introduce bugs.
And so that it's less painful for them to develop and they can develop faster.
The, you know, example of that is right now we've pretty much adopted the notion of version control in data space, right?
Maybe even five years ago, that was kind of novel, but by now everyone agreed that
we should version control everything. Events, transformations, even BI tools, right? Maybe even five years ago, it was kind of novel, but by now everyone agreed that we should version control everything.
Events, transformations,
even BI tools, right?
Even reverse ETL,
even event schemas, everything.
And so that means we have
ability to stage our changes
and also have a conversation
about the changes
in what's called a full request
or a merge request, right?
And so if we can't get
to that stage,
this is exactly sort of a secret breaker where within the if we can't get to Earth at that stage, this is exactly
sort of a circuit breaker where within the
workflow you can improve things. This, as an
example. I think the other
principle that's really important is
that we have to know what we ship.
And it's kind of obvious, and
it's kind of
humiliating to admit that we,
you know, data engineers, latest engineers, a lot
of times we don't really know what we're doing. We think
we know our data, but data is far more complex.
And, you know,
it's not uncommon
for us when we write SQL query
with a new data set, just
like not really knowing what's in there,
right? We make certain assumptions. I think this column
has a certain distribution, or I think
this column actually has values. But a lot of times
we actually don't know, and those assumptions are wrong.
And if we develop our data products with wrong assumptions, we are just saving the
tab for bugs and for errors.
And so I think it's really important to know the data that you're working with.
So examples of how data flops, there is, for example, we provide profiling.
So anytime you're dealing with any new dataset, we visualize all the distributions in the dataset as well as provide you with certain quality metrics.
For example, what's the field rate in that column?
Is this column unique or not?
And that really helps avoid a lot of the base errors when we are writing code or building
dashboards on top of a dataset.
The other important thing about knowing what you ship is understanding dependencies within
beta because the hardest thing is it's hard enough to understand a given dataset, right?
But it's even harder to understand where does the data come from or to understand what is
actually the source of a given dataset as well as understanding where it goes.
So if I don't understand where it comes from, I may not know how actually it represents
my business, right?
If I don't know which event, for example, a given table is based on, I may not know
exactly what this table is describing.
If I don't know who is consuming my data, I'm likely to break something for
someone eventually, right?
And so understanding dependencies within the data platform is super important, and that
solves with lineage.
So DataFold provides column-level lineage, which essentially allows you to answer the
question of taking a given column table, how it's built, and what downstream data applications, such as BI dashboards, machine
learning models, and on it, really, really foundational
information. And then, finally, I think the third principle is
automation. So I think no matter how great your audit checks are,
or your processes, unless this is so easy to do for people that they don't actually need to do it,
it just happens automatically, they won't do it, right?
So we can't assume people will test something.
We have to test it pretty much in a mandatory way.
And then software, that concept is long, you know,
I adopted long time ago, right?
So when we ship software, we have CI, CD pipelines
that build staging, a lot of unit tests,
integration tests, sometimes even UI tests automatically
for every single commit, right?
So we need to also get to this point with data
because this is the only way to actually be able
to catch issues and not rely on people
to do the right thing.
And so the example of how this comes together in Datafold is that whenever someone makes change to any place in their data pipeline,
let's take maybe a SQL table model in dbt or DAX or Airflow as an example,
Datafold provides full impact analysis of this change.
So basically, we call it DataDiff.
We show exactly how a change to a source code on, let's say, SQL
affects the data produced.
So let's say I changed a few lines of SQL,
and now the definition of my session duration column changed.
So the question is how, right?
Am I doing the right thing?
So we compute that difference, and we show exactly how the data is going to be changing.
And we're not only doing this for the script, for the table, for the job that you're modifying,
we're also providing a full impact analysis downstream.
So we're showing you, okay, if you're changing this table and this column, these are all the
downstream changes, cascading changes that you'll see in other tables and ultimately in dashboards
and maybe also in the reverse CTL applications. And so we can trace how data in Salesforce that
gets synced five layers down will be affected by any given change to the data pipeline.
And that all happens in CI.
So we don't need people to do anything.
It's basically completely integrated with the CI-CD process.
And it basically populates the report back in the pull request so that the developer knows what they're doing.
But very importantly, they can look in other team members
and have them have full context about the change.
And this is really important because kind of full request reviews
for data engineers is almost like a meme topic
because no one understands what's going on.
And it's really hard to get the full context of a change
unless you really see what it does, right?
So, but now with Datafold, a team can have a conversation about, this is what the change
is going to do, look in the stakeholders, and I make sure that nothing breaks and no one is
surprised. So that's an example of how, how Datafold does it.
You've been describing for like quite some now all the different dimensions involved in data quality.
And I think we all understand data quality is just complicated.
Data infrastructure gets more and more complicated, more moving parts.
It's a stack.
I don't know.
We can call it whatever we want, but the only certain thing is that it's too many moving parts
that they have to be orchestrated and work together.
There are just too many actors that interact with data,
and each one of these actors might change something
in a very unpredictable way.
So my question is,
deciding to start building a product and a business around that, right?
What's your vision?
What's your goal?
Let's say from five years from now, what you would like to see the product being able to
deliver to your customers?
And second, where do you start from in realizing this vision?
Because obviously it's not something that you can build from one day to the other, right?
So I'd love to hear a little bit of how you, as a founder of the company,
and with obviously an obsession around this problem,
do you try to realize this vision?
Yeah, absolutely.
So my vision for Datapult is to build a platform that would automate data
quality assurance from left to right in two dimensions.
So the first dimension is the developer workflow.
And what it means is if we consider where where data fault sits now, we're kind
of really catching things in staging. So when someone is opening a pull request, and they are
about to leverage the code, how can we catch as many errors as possible before that pull request
gets merged, right? And so we can go left from here, meaning that
we can detect issues even earlier.
For example,
as someone is typing SQL
in the IDE and plugging
into their
IDE and provide a lot of
decision support in that point so that
they don't even write bogus code.
That's where, again, software industry
is moving, right? We have sort of code auto-completion as well bogus code. That's where, again, software industry is moving, right? We have
sort of code
auto-completion as well as static code analysis
and flaking of all sorts of
issues in the IDE.
And I think we'll get there in the data space.
And we have
a lot of the fundamental information
to do that.
And what we need is to really plug
into the workflow and into the tools where people
develop those workflows. And then going to the right in the developer workflow also means that
once someone ships the code to production, we also are going to provide continuous monitoring
and make sure that that code stays reliable, that these data products stay reliable.
And to the point of data monitoring, right, I still believe it's very valuable, but we decided to start a little bit to the left of that,
first in the dev lifecycle.
My vision for what a good data monitoring solution is, it's not only about detection.
I think that's the easy part. The much harder part that I'm excited about solving is the root
cause analysis and prioritization.
Because again, it's not helpful to receive like a hundred notifications
about anomalies on a daily basis.
We need to basically say, Hey, you have a hundred potential issues, but really
you should focus on these top three,
because this one goes into CFO dashboard, this one powers your online machine learning model,
and the other one probably gets into Salesforce. So that's one. And the way you do this is by
relying on column level lineage and the metadata of dependencies, basically, because we understand
how the data is used. And then the other part is root cause analysis, right?
Is it a business issue?
Is it a data quality issue?
Is it an issue that is isolated and given an ETL job?
Or is it actually propagating among multiple people
and maybe can we pin it to a source data,
like an event or a vendor data set?
So doing that, again, relying on lineage
as well as sophisticated analysis
is what we are really excited about doing.
So that's the kind of the workflow dimension, right?
Kind of going to the left all the way to when people start doing developments and then going to the right into production.
Now, the second dimension of what we want to cover is the data flow itself.
So right now, if you consider the data flow,
ultimately we have source data like events.
We have production data extracts.
We have vendor data sets.
We have other kind of exported data that all gets in the warehouse.
Then we have the transformation layer
and then ultimately data application layer,
meaning BI tools, dashboards, reports,
machine learning, reverse TL.
So right now, Datafold really shines in the transformation piece.
And then we've already extended into the source data and into the application.
So for example, for the source part, the big problem is how do we integrate data
reliably in the warehouse?
Because everyone is moving data around.
And for example, it's very customary to copy data from
transactional stores into the warehouse, for example, from Postgres to Snowflake.
And doing this at scale is really hard because
you have to sync data from data stores that are changing
rapidly at pretty high volume into another data
store.
And what we've noticed is that no matter whether data teams are using vendors or open source tooling, pretty much everyone runs into issues there.
And there has been no tooling that would help validate the consistency of data replication
and making sure that the data that you end up with in your warehouse actually stays reliable.
So we've actually shipped about three months ago a tool called DataDip in an open source repo under my license that essentially solved this problem of validating data replication
between data stores and does this at scale and really fast.
So you can basically validate a billion rows in under five minutes
between systems like Postgres and Snowflake.
And so this is how we essentially address
the source and target data quality.
An example, right?
There are more problems there,
but this is a really, really major one that we saw.
And then if we, again, ship to the right
from the data transformation layer to the data apps,
the core there is to understand how data is used.
So once data leaves the warehouse,
where it goes and how it's used.
And so we've been building integrations
with all sorts of tools from BI to data,
versus data activation layer.
And that is really important
because you can paint the picture
of how every single data point is ultimately consumed.
You can catch errors way earlier. So for example, we recently shipped integration with PyTouch
in a data acceleration company. And so now if anyone is modifying anything in the data pipeline
that eventually affects the sync to, let's say Salesforce or MailChimp, they will see a flag that data replication process will be impacted.
So that's my vision, sort of like going into the left to right, both in how the
data flows and how people work with data.
Henry Suryawirawanacke...
All right.
That's super interesting.
Let's dive a little bit deeper, like starting from the left.
Okay.
And you mentioned DataDeep, which is an open source tool that can help you
like run very efficiently and fast, like diffs between like a source database,
like a Postgres and a destination like Snowflake, correct?
Yeah.
And you mentioned like, you were talking about that and you were, you, you
said that like, it's really, no matter what kind of technology
or vendor you're using, to connect the transactional database with a data
warehouse, it's really hard to make this at scale like war without issues.
Can you tell me about a little bit the issues that you have seen out there
around that, because I think people tend to take that stuff for granted.
Alex Bialik- That's right.
Yeah. Yeah.
Yeah.
Yeah.
So basically what happens is you have a transactional store, right?
That underpins your, let's say, microservice or monolith application.
And that transactional store gets a lot of writes, a lot of changes to data.
And then you're trying to replicate this into an analytical store, which may not be actually
that great with handling a lot of changes, right?
It assumes you just append data.
And the way that typically those replication processes work, both for vendors and for
open source tooling, is relying on what's called change data capture event stream.
So every time anything changes in the introduction store, it would emit an event
that, for example, a given value in a given row is changed to a certain value, right? And then
all of these events are transported over the network using different streaming systems into
another, into let's say your analytical warehouse. And so the types of issues that can be is as simple as event loss. So sometimes these
pipelines build on messaging systems that are not guaranteeing, for example, right order or
exactly one's delivery. And if that is the guarantees that you lack, then eventually you
may have certain inconsistencies. You just have to accept that there was a happening.
The other type of issue is, which is very common, is soft deletes or hard deletes.
For example, if a row is deleted from a transactional store, making sure that it's also deleted
from that analytical store where it gets synced is actually a pretty hard problem because
the change data capture
streams might not be able to capture that. And so just two basic examples of when things can go
wrong. And then you have infrastructure layer on top of that, right? So what if you have an outage
event stream or a delay in an event stream, how do you know that what data is impacted. And the reason why it gets harder
or even critically important
is because transactional data
is considered a source of truth
because if you have a table
that underpins your production system
and you report users in it,
you kind of rely on that
as your source of truth about users
because everyone knows events
are kind of messy and lossy.
And so if that data becomes not reliable
in your analytical store,
then you kind of lost all your
last resort data sources.
And that's why it's really paramount
for data teams that they keep that data
really reliable.
And what makes things complicated
is how do you approach data quality there is that
how do you check that
data is consistent, right? Let's say even
if you think your replication
process is reliable, right? We know
that all things break eventually, but how do you
make sure that they are consistent? How do you
measure consistency? It's actually a very hard problem
because you might have a billion rows
table in Postgres, a billion rows table in Databricks or your data lake.
So how do you check the consistency?
You can run count star, but that just gives you a number of rows, right?
And validating consistency across that volume of data and being able to pinpoint down to individual value and row is very hard.
And it's very hard to do it in a performant way because you can't also put a lot of load in your traditional store.
And so the way that Datadiff solves it is by relying on hashing,
but doing it in a pretty clever way where obviously if we hash both tables and we hashes, and there is any kind of inconsistency, we'll detect that.
But the other problem is that the hashes won't match if you have one record out of a billion missing, and they won't match if you have a million records missing.
And you won't know what's the magnitude of the inconsistencies.
And so that's why we are doing just a single hash.
We're actually breaking down tables into blocks and then checking the hashes in those blocks
and then pretty much doing a binary search to identify down into individual row where
the data is inconsistent.
And so what you achieve with that approach is both speed.
So you can dip billion rows in under five minutes, but also accuracy.
It can pinpoint down to individual rows and values that are off.
And it's a really hard balance to get right.
So we're really excited about that tool.
Stas Mislavskyi That's awesome.
And like, okay, let's say I'm a data engineer and I hear about
this amazing data div tool, right?
How do I incorporate it in my like everyday work?
Let's say I have like a typical scenario with Postgres using the
BISUM, Kafka, writing on S3 and loading the data on like Snowflake, right?
Like I would assume that this is like a very typical like...
Yeah, definitely.
Many things can...
I mean, things like there's latency, first of all, right?
Like there's no way that at the same time that something happens on Postgres
will also like be reflected on Snowflake.
So I have this amazing new tool and I can use it.
So how do I become productive with it?
Like how do I improve my workflow?
Yeah, absolutely.
So when you're dealing with streaming data replication,
there is no way around pretty much watermarking.
And by that, I mean sort of drawing the line
and saying that all the data which is older than this timestamp
should be consistent, right? Because we do expect, you know, lags and delays. And that's the first
thing. Once you establish what the watermark is, then you just basically say, okay, I expect data
as of certain timestamps to be consistent. You connect DataDip to both data source.
And we use, you know, typical drivers for that. so Postgres, let's say, and Snowflake.
And then you can run the Datadip process, which typically takes within seconds or minutes.
And then ideally, because we're talking about continuous data checking, you would run it on schedule. So probably you would want to wrap it in some sort of an orchestrator like
Airflow or Dijkstra and run it every hour, every day, or every minute,
depending on your SLAs for quality.
And it's essentially a Python library that also has a CLI.
So it's really easy to embed in a data orchestrator or you can
even run it in a cron job.
So that's also, also quite feeble.
Stas Milius Ivovitch.
All right.
That's great.
And actually you give me the right material to move like to my next question,
which is about developer experience and the importance of the data engineer in
this whole flow of like monolithic data quality, right?
Like, again, I, I, I won't like to, to repeat that, although I might start
getting like a little bit like boring, but data is something that can go wrong
for many different reasons and because it's interact with so many different actors.
And when I say actors, I mean like actual people, like the CFO,
you know,
made a mistake on the query or like whatever,
clicked.
Hour three,
which is like,
I don't know,
like let's say
eventually consistent,
like when we got the snaps
from the data there,
it wasn't that consistent
at that point
or whatever, right?
Like there are technical reasons,
there are like human reasons
that might touch that.
So, but the role of the data engineer at the end
is like the person who is responsible, right?
Like to deliver, let's say,
like safeguard like the infrastructure
and also the quality of the data.
So what I hear from you when you talk
is that you put like a lot of effort
in like building the right tooling
specifically for data engineers, right? It's not like you are building, let's say,
a platform that is going to be used by an analyst or someone who only knows what Excel is and doing
stuff on Excel or whatever. We are talking about data engineers here. We are talking about
developers. We are talking about people who have like a specific like relationship with software
engineering.
You mentioned, for example, like CI, CD best practices there and like pull requests and
like GitOps and like all that stuff.
So first of all, I'd like to hear from you why you believe that like the data engineer
is like so important in realizing your vision behind like the product and the company, because
unless I'm wrong, right?
So correct me if I'm wrong.
Yeah, Costas, you bring up a really great point.
I think, yes, the reality is that it used to be,
even five years ago, that data engineers would be people
who would own the entire data pipeline,
and they would build everything, be responsible for everything
and would have control of everything.
As you point out,
the reality now is very different.
So we have people on the left
of the data, right?
So software engineers
contributing a lot with,
you know, instrumenting events
in the microservices,
as well as, you know,
the tables that we talked about,
I copied from Postgres.
Ultimately, software engineers own them, right?
And software engineers might not actually have the context.
And then to the right, we have less technical people
who are on the data consumption side
and analysts, analytics engineers,
and then even people from other teams
like financial analysts that now kind of have to become
familiar with data pipelines
because they need to rely on analytical data to get their job done.
I think the right way to think about what data fault ultimately solves
and improves is the workflow of a data developer.
And I define this really broadly, right?
So it can be a software engineer who defines an event
that eventually gets consumed in analytics,
or it can be a financial, you know, a P&A analyst who contributes to a dbt job because that informs how they
build financial model.
I think the reason why data engineers, analytics engineers are really central in this
conversation is that even though multiple teams contribute to data, ultimately still that is the persona
and that is the team that defines
the majority of the data stack
and how data is consumed and produced.
And they are almost center
of the collaboration around data.
And so giving them tools
that would empower them to do their job faster
is really important.
But what Datafold does actually goes beyond that and it goes into the true,
I don't want to say like a corny phrase, but I'll say it, data democratization.
Because if you have, if it's so easy to test any change in data pipeline,
no matter who does it, software engineer, analytics engineer, data analyst, data engineer, then it's much easier and less risky to involve
other folks, other teams to collaborate on data pipeline than before, right?
And the very, like one of the worst bottlenecks we're seeing in companies
and data-driven companies that data engineering teams become really bad bottlenecks for their business because data scientists cannot build
ML models unless they have clean data.
And so they push down that data equation, data engineers, and then data engineers don't
let other people contribute data five-fives because they are afraid that people may break
stuff.
And so they basically become a bottleneck, get overworked, and the company, the business
doesn't move as fast as possible.
And so if we make change management and data testing so, well, easy and automated, regardless
of who is making the change, right, then anyone who catches data will do a better job and
we'll be catching errors for everyone. So basically we're able to elevate the data quality throughout the pipeline
and we can help the business move faster because more people
will be able to reliably contribute to building of data applications.
Does that make sense? Yeah, absolutely. And one last
question from me and then I'll give the stage back to Eric because
I'm sure he has many questions.
And we have to also mention that, as you say, there are many people that are touching the data inside the organization.
And probably one of the most dangerous ones is marketeers, right?
And they really won't like to touch the data.
So we need to hear from them and see experience of like ruining data around, right?
But before, before, before we do that, like one last question, which is like a
little bit of like a selfish question, like as a product person that builds
like stuff for developers and engineers.
Can you share with me, like based on your experience so far, like what was
like something that really, really important to keep in mind when you're designing an experience and the product for the developer and the engineer?
What is the first thing that you bring in your mind when you start brainstorming about a new tool for data engineers. Yes. I think there are,
there are hard to say like,
what is the most important thing,
but I'll probably call out a few.
I think better.
Yes.
I think one,
and I think you called it out cost us is that data stack is so
heterogeneous,
right?
So many tools,
so many different databases and data application frameworks. And
how do you generalize your product so that you work as equally well as possible with most of
them? How do you distill your vision to certain principles and patterns that could work with any tool and it would allow you to build,
especially in the data quality space,
improve the workflow regardless of what stack people are using.
And ultimately,
you will inevitably have to focus on certain stack.
For example, Datapoll really focuses on what is called modern data stack.
So cloud data workhouses or really mature modern data lakes, right?
Like what you can get with, you know, Trino, these data that are really
kind of almost kind of self-serve, right?
For data teams, as opposed to more legacy systems, right?
So we have to make these calls.
But even though, even if you limit your scope of the modern data
stack, it's still really hard.
And I would say we really practice, but I think we so far
did a reasonably good job.
I think the second problem is how do you integrate in the workflow of data
engineers, again, given that different teams go about things in different ways. For example, if you
want to do any kind of testing
before production,
then you have to rely
on teams having version control
for their pipelines, as well as
having a staging environment. Because if you don't have
staging, there's nothing you
can test. So you have to basically take
a new version of code, doesn't matter if
it's a dbt code, Looker code, or
PySpark code, and
basically test it,
right? The way Datafold does it is by comparing
it to production and showing you exactly how things
are going to be changing, right?
But data teams may be building
staging in so many different ways. Sometimes
they may be using synthetic data, sometimes
they're using production data to build staging.
And so how do you generalize is a really big problem.
And so again, finding common patterns that would be most applicable to most teams.
And sometimes it means betting on maybe less popular ways of doing things, but betting on those things eventually will become mainstream. One example of that, actually, that tough trade-off that we had to make is,
so you consider how data transformations are orchestrated these days.
The most popular tool is Airflow, right?
Used by perhaps hundreds of thousands of data stacks in the world.
And then there's also dbt, which is an emerging tool,
which is a more modern approach
to building that as well as maybe Docter.
But those tools have orders of magnitude
smaller adoption than Airflow.
And we had to make a call early on
to actually not focus on building
data quality automation for Airflow.
We still support it,
but not as deeply as we would support, let's say, DBT. And the reason is that with Airflow. We still support it, but not as deeply as we would support, let's
say, dbt. And the reason is that with Airflow, it is very, very hard to build a reliable staging
environment. Just the way the tool works, it almost kind of forces you to test in production.
And so that makes it really hard to work and implement any kind of really reliable change
management process within Airflow. Whereas with tools like Daxter or dbt, development and staging environments come with a tool. And so
it's really easy for us to come in and actually build on top of that. And these are really hard
calls because by the time we started prioritizing and trading with dbt and Daxter, it wasn't the
power that they will actually win. We just thought that their approach is actually the one which eventually
will take over because it's more
robust and more
reliable. But it was a really tough
call to make.
Super interesting.
Eric? Yeah, that is fascinating.
I have more questions on that,
but if I
got into that, Brooks wouldn't be happy
because we do have to end the show at
some point. Gleb, I have a question about the adoption and implementation of data fold or
technology like data fold in the lifecycle of a company. And I'm speaking from experience here as a marketer
to Kostas' point, who has messed up a lot of data.
Yeah.
You know, messed up a lot of reporting by sort of introducing new data. And
I'll explain the question like this, you know, and actually, you know, even with just sort of
a personal anecdote with my experience,
you know, multiple different companies,
you're growing really fast or, you know,
you're launching some new data initiative and you have, you know, limited resources
that are working on that.
And, you know, say it's an analyst
and maybe you're borrowing some engineering time.
You don't have like a fully formed data team yet.
Yet the company is growing really quickly.
In many ways, it's almost like what you described at Lyft, right?
Where data is the lifeblood of the company, but it's growing so fast that everything seems
to be on fire.
And what's difficult in that situation is to slow down enough to implement both the processes and discipline it takes to sort of change your workflows, like implement new tooling, etc.
Because you're moving so fast.
And it really is a challenge because you inevitably create technical debt later on.
So you wish you would have done it. But when you're in the moment, it's hard to, you know, sort of, it's hard to be that forward thinking because you're dealing with
what's right in front of you. How do you see that play out with your customers? How have you dealt
with that? And do you have any advice for our listeners who are thinking about that very challenge where it's like wow i would love to
i would love to implement this but you know it's just really hard for where we're at as a company
yeah i would say ultimately it's not even about data fold i think this comes down to how a company and a data team thinks about putting together the data stack.
And I think it comes down to doing the right amount of tinkering.
And by that, I mean that if you look at the modern data stack today, you have really great tools that take care of most of your needs, kind of ingesting data, moving it around, transforming it, processing it.
And you really can assemble a stack in such a way that it just works.
And then you spend your time thinking about business metrics, thinking about analysis and how to drive your business forward.
And then what is probably not the right way to go about this or some of the
mistakes that teams make is then when they think they, for some reason,
need to ticker more than they actually do,
they would start adopting,
take open source projects
and start running them in-house
because they think they will save costs
or because they think they need control
or they would be kind of afraid of vendor lock-in.
And sometimes it's premature optimization.
Sometimes it's kind of engineering ambition out of control.
I'm sorry for engineers, but just right amount, right?
So if the data team is adopting all the good tools and focuses on the things that they
should be focusing on, that's great.
And in that world, data fold is extremely easy to implement because we integrate
with the standard modern data stack tools.
The integration takes like about an hour and you basically
improve your workflow in a day.
So you connect, let's say, dbt cloud, like GitHub, your warehouse, and things
are just working out of the box.
So it's actually not hard.
You don't need to spend a lot of time writing tests or anything.
Where you may have run into longer implementation times
is when, let's say, you have a really legacy beta platform
or you have a really unorthodox stack
for some reason where you start to kind of you build a framework internally that you know is
like a certain patterns that are not common then you you know then you will probably use our sdk
as opposed to like turnkey integrations and that may turn into a longer-term innovation project. But all that is to say that for a data team on modern data stack,
using best practices, Datafold is extremely fast to implement.
So I don't think, unlike assertion tests,
where you do need to invest a lot of time to writing them,
Datafold, or the compression testing capability,
is actually really fast to implement.
So you don't have to choose between moving fast
and moving fast and kind of having reliable data.
You can have both.
You can move fast with a stable infrastructure.
Love it.
That is super helpful.
Well, Gleb, this has been such a wonderful conversation.
Super helpful.
We've learned a ton.
So thank you for giving us some of your time
to talk about Datafold and data quality in general.
Thank you so much, Eric Costas.
Also, really appreciate you asking really kind of deep questions
and helping me clarify my thinking.
So really enjoyed it as well.
Awesome show.
What an interesting conversation, Costas. I think that my big takeaway, and I'll probably be thinking a lot about this this week,
is the initial discussion around anomaly detection, which I thought was really interesting and how it was really helpful
in some ways in his experience at Lyft, but then also fell flat in some ways within the company
because of some inherent limitations. And then the more I thought about that in connection with
the way that he describes starting at the point in the stack where data fold sort of solves the initial problem
and then moving left and right with like in being involved in sort of the like pre-production
process or the deployment process was really interesting. And I think to summarize that,
it would be that, you know, anomaly detection, AI, all detection, AI, all of that technology can be
really useful, but ultimately you have humans involved in creating, managing, processing data.
And so you have to have tooling that actually is interjected into the process that those humans
are using. And Gleb seems to have a really good handle on that. And so that's just a really good reminder, I think, about data quality and
the particular characteristics of trying to solve it.
Yeah, I agree. I think what I really enjoyed from the conversation, what I'm going to keep and
probably think a little bit more about it is about the role
of the data engineer.
I think we are seeing the data engineering discipline becoming much more solid and well
established as something important around having your data infrastructure and working
with data.
But also, it's evolution, like it's going through like a very
rapid evolution right now.
Like we hear that like from Glyph by saying like how a couple of years ago
you could, and when I say a couple of years, we are talking about like two,
three years ago, it was not that far away, you could have like as a data
engineer, like the complete control over over your pipelines and the infrastructure.
So it was much easier.
I mean, at least you had control over what was going wrong.
But that's not the case anymore.
You have so many different stakeholders and so many different moving parts in the infrastructure
that makes it yet challenging.
And it's part of the evolution of the role.
So yeah, I think that's something that like we,
we should investigate more in general.
I agree.
Alrighty.
Well, thank you for joining us on the Data Sack Show.
As always subscribe.
If you haven't tell a friend, we love new listeners
and we will catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.