The Data Stack Show - 115: What Is Production Grade Data? Featuring Ashwin Kamath of Spectre
Episode Date: November 30, 2022Highlights from this week’s conversation include:Ashwin’s background in the data space (2:43)The unique nature of working with data in finance (7:32)Technological challenges of working in the fina...nce data space (13:55)The third-party data factor and judging if it is reliable enough (17:07)What made Ashwin decide to go out and build his own company? (31:47)Defining data decay and data storing and why both are important (37:52)Advice on the importance of data quality (42:10)Final takeaways and wrap-up (50:49)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome to the Data Stack Show, Kostas. We love talking to data professionals
who work in industries where they have certain requirements around the data. And Ashwin from
Spectre Data has worked in the finance industry for a really long time at multiple different
types of companies. So consumer loan to hedge fund,
and now he started his own company.
And needless to say,
people who have done that
are generally extremely intelligent.
So I know it's going to be a good conversation.
What I want to,
I actually worked for a company called Affirm
who sort of was the first big player
in financing purchase to online
and getting really sort of rapid approvals, if you will, for items that are not like buying a house,
you're buying a computer or something like that, or even stuff that's not even that expensive.
I'm really interested to ask him about that a little bit. I just want to hear a little bit
about that because I kind of remember when Affirm started showing up on all these websites and you could finance these purchases for a smaller amount.
So I'm going to just entertain me.
I'm going to ask him like one or two questions about that to satisfy my curiosity.
But of course, it's about Spectre.
So what are you interested to ask him about Spectre? Gareth Harteeldil, Yeah, I want to start first of all, asking him to share
some of his knowledge about how data is used or like what are like the unique
challenges working with data in the finance sector, I mean, heavily, let's
say data-driven sector, right? With its own unique challenges.
And so I thought from there
and then talked with him
about how he decided to build Spectrum
and what Spectre is, right?
So let's do that with him.
Yeah.
I may cut by sealing your question about the finance
data, so I apologize in advance.
All right.
Let's dig in and talk with Ashwin.
Ashwin, welcome to the Data Stack Show.
We are so excited to chat with you and learn from you.
Great to have you guys.
Great to be here, guys.
It's very nice to meet all of you.
Okay.
Give us, give us your background.
You spent a ton of time in finance, so give us that story,
but also how you got into data in the first place.
Yeah, so my name is Ashwin.
I am the CEO and founder of a data platform company called Spectre,
which I started about a year ago.
I've been in the data space for close to a decade now.
I used to work at a FinTech company out in San Francisco
called Affirm.
It was a buy now, pay later company
where I used to deal with data
both on the underwriting side,
building models to figure out
whether or not someone is both credit worthy
and whether they
on the fraud side, whether they are who they say they are, as well on the back office side with
reporting and funding of the loan portfolio. And then in 2018, I moved out to New York,
where I'm currently based, to join a quantitative hedge fund called Two Sigma,
where I used to work on the alternative
data portfolio, basically bringing in enormous amounts of data from external third-party sources,
putting that to use within the trading engines, everything end-to-end from cleaning of data,
standardization, building the underlying data infrastructure to make sure all of this is
working and flowing, preparing the data for research ready purposes, taking final research and analysis, putting that
into the system, making sure that that's being computed on an ongoing basis.
And finally, kind of layering all of this with a data quality system that makes sure
that the data as it flows between different stages of the pipeline is in a good and healthy
state for the trading systems.
Wow. So deep end-to-end experience across the entire pipeline. I want to ask about,
you've done so much in finance. And so I want to ask about sort of the specific nature of
working with data and finance. But one, this is just a personal curiosity. So
I remember when a firm started showing up on websites. So I mountain bike, for fast, right? How did you approach
that problem? Because that's a pretty, I mean, as a user, that's amazing, right? I'm about to buy
this thing and it's not like I'm buying a house, I'm buying whatever, but it's enough to where I
want to finance it. And you can get approved for one really fast, but from like an infrastructure
perspective, being in the industry, that's heavy duty.
How did you approach doing that?
Because you're doing it like pretty early, I think.
Yeah, you'd see this a lot in data systems and machine learning systems that have, especially
in today's day and age, where there is a lot of crunching that and data processing that happens in a more offline setting
to create and train these models that when used in an online setting, they basically get this
like feed of features from whatever behaviors the user has already kind of displayed at the time of
that decision being made. And so the model itself, when it runs, can actually produce a result in under a second,
right?
However, that computation that is happening within that one second is taking into account
tons and tons of data that's been crunched in a more offline setting and has been kind
of prepared already for the online version.
Super interesting.
So you're basically, you just process all these features.
You're basically just completing the model with known inputs that will, the known inputs of their last features that allow sort of the last mile compute of that.
Correct.
Correct. And then when it even comes to the specific features, you won't even believe some of the features that are being utilized here.
Things like where, what kind of websites did you come in from to this site?
How are you filling out the form?
Are you copy pasting?
Are you not?
Are you?
Really?
No way.
There's a lot that can be told about from a fraud perspective about who this person is just
by the behaviors that they display when kind of interacting with the website.
That is so interesting because to me, that's very like marketing user experience.
Those are like marketing user experience data points, right?
Like how someone interacting with that, but you're actually using those as features.
It's a sort of tech fraud and stuff.
That is so interesting. Fascinating. Okay. Well, I'm not going to go down that rabbit hole because
we have too much to talk about. Tell us about the unique nature of working with data in finance. So
you did it at a firm, then you were managing these huge pipelines for sort of non-financial data at a
hedge fund.
And Spectre works with a lot of financial firms. So give us the landscape of working with data
in the finance industry or FinTech.
Yeah, I think the biggest thing that I have seen
with data in finance is how important data quality is.
I think because the nature of decisions being made
in this industry and this sector
are very high stakes in nature,
and each decision can have meaningful impact
in the form of a trade going out,
whether or not that's going to be a long or a short,
an underwriting decision being made,
whether or not I'm going to give money to someone, it is extremely important that the data that's being fed into these
models that's being fed to use to create these decisions is in a good quality state.
And so what we start to see is the topology of how the data configure, the data start network slash pipelines
are configured, so to speak,
will look pretty similar to other industries.
But I think the way that the data quality
side of things is approached
is usually as a first-class principle
rather than something that you layer on top
after the fact and kind of in our best efforts hope so to
speak yes so just to make that a little bit more explicit i'm just thinking of examples here so
in a non-financial industry let's say we have a consumer mobile app or something right and
you don't make a good recommendation and so the person doesn't make an additional they don't add
an additional thing to cart on checkout, right?
Which is unfortunate and may affect like certain subset of users.
But if you make a bad loan, you're upside down financially in that it doesn't take very many of those for that to significantly skew the, you know, significantly skew the bottom line.
Is that kind of what you're getting at in terms of the critical nature of the quality?
Exactly. Or even in a trading setting, a simple
example, we're pulling in data from some sort of external source
and over the last week, the data hasn't updated.
And if you don't have good data quality monitoring
to notice that issue that data
continues to flow into the final trading system trading system gets a forecast saying there's
been no change in in a in a company forecast and so starts shorting a stock right and that
the stakes are high right this is it, it is a seemingly easy problem to detect,
but when you kind of take the infinite variety
of data quality issues that could occur
and that are pretty difficult to predict
in and of themselves,
it is actually a much more difficult problem
to make sure that everything is like working correctly,
even when no one's like looking at the data all the time.
Well, let's dig into that a little bit. So you talked about alternative data, which is sort of my understanding is that it's sort of inputs of a large variety that are not
necessarily directly related to the trading price of a particular stock, right? Or stocks in general,
right? So it's not like trading data from the
actual exchange itself. It's inputs from outside of that that may influence it. Can you give us
some examples of what those things would be? And the other thing I'd like to know is the breadth
of those sources. How many are there? How many do you include when you're modeling?
How do you even approach that decision?
Yeah, there is a ton of data.
There's a, there's actually a ton of alternative data and there's actually
a whole segment of alternative data called open source intelligence and
open source alternative data, which is really accessible, you're talking,
you're thinking like web behavioral data scraped data from
different types of websites there is so much that can be told about how
the the state of a business just from their online digital presence i think you know if i look at a
a trend of job postings from a specific company, right?
Who are they hiring?
Who are they keeping around?
LinkedIn data is like massive, right?
What is the trends of job positions that are being held at different companies versus our competitors?
Foot traffic data is another big one.
Where are people going and moving?
Credit card transaction data from banks?
You generally, all of these are anonymized in nature.
So we're, we're not really taking it from the perspective of like
personally identifying information.
We're trying to look at this from a more holistic, uh, somewhat macro,
somewhat micro scale of, of how that data kind of fits in to model the
overall economic environment that different businesses kind of play in. Super interesting.
And just from the sound of it, my guess is that those are really large data sets.
Yes. You can sometimes look at like terabytes and terabytes of data, especially when it becomes important to start to look at the historical nature of that data and how things changed over time.
It's very important to be able to have a large enough history that you can see those trends as they shape out and as they form.
And you can start to look at, okay, here's what we're seeing today versus here's what we saw a quarter or two ago
versus here's what we saw two years ago.
This is what we can make a prediction about
in the next quarter, right?
And that helps make those decisions
as they kind of play out, right?
Absolutely.
Okay, well, one last question for me,
and I'm going to just,
I'm going to tee this up as a lead-in for you, Kostas,
because I'm going to let you have dessert and ask all about the product because I want to do that, but I've been nodding the mic.
What were some of the big problems you faced?
I think especially thinking about the hedge fund and all the alternative data inputs from a technological perspective, right? So we're
talking about terabytes of data. We're talking about losing huge amounts of money if a simple
thing like data freshness falls behind. What were the issues you faced and how did you try to solve
those? Yeah, I think the number one issue was the handoff between a development environment and a production environment being quite slow.
And this is pretty agnostic to the hedge fund space.
I think we see this across every other industry. And the idea is that there is a lot that has gone into making it really quick to start
to explore data, to start to build analysis on top of it.
Usually you see this done in some sort of Jupyter notebook environment, like some sort
of local environment.
And then when it comes to actually, say, productionize that analysis, that data pipeline,
you know, everything kind of falls
flat.
There isn't really any standards here.
Every company is doing their own thing.
The infrastructure layer looks completely different when you look from one company to
the next.
Some companies are using Docker containers.
Other companies are just putting scripts onto servers and running them in a local conda
environment on that server.
And no one knows what's in that conda environment.
It's a complete mess.
Then when you take one step further and say, okay, now we want to also make sure
that the quality of the data that is being output by my data pipeline continues
to remain consistent over time.
And if I make changes to my data pipeline, I want to know that, okay,
something might go wrong at the data layer itself. Now it's like the ballgame is even more difficult to deal with,
right? You're talking about monitoring data with itself is some sort of like recurring process that
needs to run and look at the data and observe that data over time. And then almost apply another type
of machine learning anomaly detector
on top of the output or the metrics that are being computed about that data and
make sure that that data is being consistent, right?
And I think that's part of the challenge with data science and data engineering
is how do you get this infrastructure layer that does a lot of this for you
without having to spend an inordinate amount of effort just on the
infrastructure component and allow you to focus more on what this business logic looks like.
Yeah. Yeah. I mean, because what you're describing, I mean, you have data science
and data engineering, but a lot of what you're describing actually is more DevOps and SRE flavored work, right?
Where uptime and, and monitoring and alerting and responses and, okay.
That's super interesting.
Costas.
I say operate, operationalizing data science is really an engineering problem.
Yeah.
I don't think the world has realized that yet.
Absolutely.
So, Alvin, I have a... I'm super, super, super curious to hear from you
about third-party data.
In most cases, we talk with people, I mean,
you mostly struggle to collect your own data, right?
It's like the data that your own company is generating one way or another.
And you're trying to make sure that you don't miss anything and give access to everyone inside the organization to do that.
But you mentioned like third-party data, and I don't know much about it.
So I'd love to hear from you.
First of all, how do you go shopping for third-party data?
Like, how do you, like, how, how does this even work?
Right?
It's like, I'm going to Amazon.
I'm like, okay, I'm looking for, I don't know, two pounds of like data that
has this and that characteristic, right?
So can you tell us a little bit about the whole lifecycle of getting third-party and incorporating
third-party data into the product that you are building, right? Especially when it comes to
go out there and find this data, procure the data, maintain that, and all these things.
Yeah, it's a pretty laborious process,
but it does kind of follow the same steps
of what you would imagine from an e-commerce purchase
or procurement process,
with a few caveats in between
around making sure the data meets your compliance requirements of the company
and making sure that you can evaluate that data in a way that allows you to see that the data is
useful to yourself, to your company, while not getting the data for free. And that's the biggest
challenge here, right? There is this skewed incentive between the buyer and seller of data to say, Hey, I, I want to let you try this data without you actually like using it for a real decision process.
Right.
But let's, let's kind of go through the whole process from start to finish.
When you first, you would, you would think, okay, there's some use case at hand that you're looking for third-party data.
There are several ways to go about finding that.
Most obvious is go Google it, right? data about, because I'm prospecting for a marketing purpose and I want all US-based
companies that have chief financial officers on that base within the US.
The best source for that would be through something like a LinkedIn.
And being able to find data for that purpose generally involves looking through, there are like these data catalogs, data marketplaces that essentially kind of have a bunch of metadata about each of these data sets that give you enough information that at a high level, you can say that that kind of meets the criteria for what you're looking for. You reach out to the vendor, you initiate a conversation.
Generally, this looks very similar to any type of B2B sales process where you go
through some evaluation of that data.
There's typically no demo in the process because itself is a very
abstract, ephemeral kind of concept.
So the demo itself, the demo phase actually looks like you providing some sort of requirements around here is what I'm trying to do.
Some sample data will be provided back to the specifier.
That is evaluated in and of itself.
I know within kind of the hedge fund world,
usually you'll look at
some sort of historical amount of data as well.
So you can test in a backtesting purpose.
And if that meets the criteria,
then you go into kind of the negotiation side of things,
discuss a unit price on how much data you're looking for.
Generally with more bulk,
you get a better unit price per record of data.
And there are a lot of kind of levers that you have to think through.
The first is how, like what kind of sample do I need?
Do I, what kind of coverage do I need at a, in terms of geography, in
terms of sectors, industries, depending on the specific data at hand.
Then second, you need to think through how often that data needs to be refreshed or updated.
The world is constantly changing.
The data itself is changing.
Making sure that that refresh rate meets the criteria that you're looking for is extremely
important.
Third, how are you going to access that data?
Is this going to be a push-based access where the data vendor pushes data to, say, an S3
bucket and you pull it out of there?
Or is this going to be pull-based access where I'm pulling it out of an API and figuring
on my own how I'm going to store it?
Yeah.
This all gets written into a contract.
Once the contract is signed, and usually you go through some amount of compliance
audit as well to make sure that the data is, was collected in a way that
meets your business's requirement.
And then you get access to the data from there.
Stas Milonovic 1.0 Okay.
And how do you like judge if the data is good enough for you?
Okay, you said you give a sample of the data, right?
But is there some summary statistics that are provided, for example?
How you can formalize this process, if it can be formalized?
How do you, without them revealing the data set,
obviously because they don't want to do that.
So how do you go through that?
What's up with it?
Yeah.
Aggregate statistics helps a lot, right?
Being able to understand, okay, if there are 20 columns in a, in the data set and
the specific segment you're looking at, let's say us only, this might be a global data set, but you only care about the US segment.
Yeah.
I wanted to know out of the population of US data points, how many null values are in those other columns?
What are the differences in those columns?
That's like a pretty easy way to kind of get a sense of the completeness of the data or what you care about.
When it comes to evaluation itself, generally, you're going to want to put that data through a similar process to how you plan to use it in a live setting, right?
One must actually have the real data at hand, right?
And kind of test it from a statistical point of view.
Does this meet your needs from either a predictive side of things or if you're collecting data for fraud or underwriting purposes, making sure that the richness of the data coming in seems correct, if you're looking at data about people, for say,
a script from LinkedIn, you might want to cross-check some of the
entries back to LinkedIn.
It is a manual process, but that can go, leave some balance to making
you confident trusting that data.
Henry Suryawirawanacik, Yeah.
Yeah, it makes a little sense.
Actually, it reminds me a little bit of like a problem that I have.
I have faced, oops, look, it's my, I'm not the only one, but anyone who's like building
like query entities or like databases.
And you have this system like in production and then you have,
you need to debug that, right?
And they're like, okay, but to reproduce, let's say the query,
I need the query, first of all.
I need to know like the data and like the, at least like the
statistics around the data.
And it's unbelievably hard to do that because getting access and like taking
a look at that information, it's like something that's extremely proprietary
for like many companies, right?
It's not like, yeah, take a look at my database and see exactly what kind of
information I keep here for my users.
It cannot easily happen.
And it changes also.
This thing changes too fast and it's even hard to go and do, let's say,
regression testing using some baseline queries and data sets.
It's hard.
It's hard to define these requirements.
And we are talking about a very deterministic system.
We are talking about software at the end.
We are not talking about training models, right?
I mean, it's not like we know exactly what is happening inside the model, right?
Like it's more of a black book.
So it's a very fascinating area and like a very hard problem, but like
people don't really realize, I think.
David PĂ©rez- You see this challenge, especially, so going back to kind
of the skewed incentives, there's, whenever you go through one of these data evaluation processes, it's very common to get kind of the golden set of data from the vendor, which is their best segment of data that they can offer you so that you can see how great and how powerful this data is. Then when you get your hands on the real data, after having signed,
say, a one, two-year contract, you now realize that, hey, the rest of this data set is not nearly
as high quality as the sample that they've applied, right? What's even worse is when you start to
build stuff on top of that data, and if you don't have the monitoring in place to watch for things like
data decay, data scoring, getting out of whack, that sort of thing, suddenly there's larger than
average number of outliers that are appearing in the data. It's very easy for something that
worked in the first six months of releasing a new model or releasing a new data pipeline
to suddenly start behaving very poorly over time, right?
And that's why it's extremely important with third-party data, especially because you are
not the source of that data.
You don't know what's happening to that data from its true source till it gets to you.
You only know what happens after it gets to you onward.
It's important to put those tests in place, put those guardrails in place to make sure
that the data conforms to and stays consistent with the assumptions that you made when you
first started developing against it.
Yeah, yeah, 100%. I have a question. It's about, let's say the third-party data again, but it's about something
like that happened like months earlier.
So you're building a model, right?
Like you are trying to achieve something.
You have an objective there.
Let's say you're trying to, I don't know, do some scoring
or predict the behavior, right?
And you usually start by having some data, right?
Like it's a bit of a chicken and egg problem.
Like you have some observations and you try to model something
based on these observations, right?
Sure.
How do you reach a point where you're like, I need third-party data?
And how do you know what kind of data to go and look out there for?
Right?
Because it's one thing to be like, okay, this is like the data that I can get from
my company because it's a clickstream data.
These are like the sources where I can like capture data, blah, blah, blah.
Like it's much more straightforward in my mind, at least, like to have like all
the different, like the space of different options around the data that we can use.
Okay.
Procuring data means that like you have, I don't know, like an open space there. Things that you don't even know that they exist, right?
So how do you, from model training and building point of view, how do you
identify like the data that you need and it
reached a point where you're like, I don't have this type of data.
I'll go out there and try to find like a data source.
Yeah, I think educating about the different possible data segments out
there is, it's probably the first step.
And I think it's going to become much more common for data scientists to
just be more aware of what's out there.
I think third-party data is just kind of coming to the light.
I would say five years ago, the only real buyers of third-party data was the hedge fund in the street.
But over time, now it's kind of being adopted by several other industries. I see it a lot more commonly used within the marketing space for prospecting
and lead generation, being able to use what we call intent data to understand someone just
visited a specific site. And that is maybe a competitor site, maybe that shows interest in
them being a buyer of that product. That's like a good candidate for me to either run an email campaign against them
or run an advertising campaign against them, right?
And so I think we start to see a little bit more
of just people being more educated
about what types of data there is.
I don't know that I have a great sense
of when is the right time to start thinking about that.
Usually what I see is that people start to adopt third-party data either very early on when they're building and training new models as a way to kind of bootstrap that initial data segment. So instead of taking the approach, okay, if I collect a thousand observations
and I can build my model, I say, okay, if I just buy a thousand observations,
I can create my model and then I will keep filling that with more first-party
data as I collect the first-party data.
And the second is a more augmentary purpose where I say, okay, I have this
first-party data stream that's coming in.
It would be really good to know this other information about these
users based on information I can find
from their digital presence, and so being able to kind of feed that as an
additional data source and keep kind of augmenting that internal first party
data stash with third party data is another approach that I've seen to be very successful. That's super interesting.
All right.
So you obviously have a very interesting and exciting career in the financial sector, right?
You ended up building a company in the products.
Tell us a little bit about that and also what made you to decide and go and build?
Like what kind of problems you saw out there that you thought, oh, that's like
worth pursuing as a business and all in my career, like my safety over there,
like my comfort zone where I know what can happen or not, and go and do
like a company in the product, right?
Tell us a little bit more about that.
David PĂ©rez- Yeah.
So I think the biggest motivator for me was just kind of seeing the
sophistication of technology at these more established companies and
understanding that this, the data industry is going to continue to grow at the incredible pace that it is.
But when it comes to an understanding of how to handle data in production settings,
there has been what I believe to be a pretty big lack of innovation there.
Every company that I see is doing their own thing.
They generally all start with something like an Apache Airflow, where they run their data
pipelines, and then they're building their own kind of data quality stack on the side.
And then eventually they upgrade into something else.
And that something else tends to be completely different from one company to the next.
There's always requires like a tremendous amount of skilled data engineering support
to be able to deliver on that, especially at the infrastructure layer.
And so that was the biggest driver for me being able to say, okay, there is a way to
generalize some of this technology to basically create an out-of-the-box data infrastructure layer that makes it really
simple to go from development to production without and have a system that actually helps
you do it rather than you having this inordinate burden to configure things in exactly the right
way so that everything works correctly. And so when it comes down to what that those problems that we're really looking to solve, we kind of say, OK, on the exploratory side, on the development of data pipelines and machine learning pipelines, machine learning models, there's tremendous amount of tooling that already exists that kind of solves those problems, right? It's going to continue to improve, but we want to focus on what it means to take that
and put it into a system
so that when the data scientist decides
to move on to the next project,
they can come back six months later
and know that their initial project is still working
and is running appropriately
the way that they expected when they launched it.
So major problems that we see,
the first is around kind of the DevOps side of things, right?
How do I, when I have data pipelines running
in a local environment,
how do I push that into a production setting, right?
How do I make sure it's running on servers,
it's running on some sort of schedule
or maybe running whenever data itself is updating? How do I make sure dependencies's running on servers, it's running on some sort of schedule, or maybe running whenever
data itself is updating? How do I make sure dependencies are being tightly managed based on
how data is flowing from one step to the other, based on intermediate data inputs and outputs
between each of these stages? And then finally, how do I tie this back to data quality in a way
that guarantees that if there are data quality issues that occur
somewhere in between in the middle of the pipeline, that that data doesn't continue
to spread and contaminate downstream analysis.
And yeah, there's a pretty good analogy I have to go off of this.
It's kind of like the way that the manufacturing industry thinks about the assembly line, right? When you think about why the assembly line exists,
a lot of it comes down to this idea of being able to install quality control
woofers or nodes in between kind of different stages of the assembly line, right?
And the reason why factories are designed this way is because
recalls are extremely expensive, right?
Both reputationally and logistically, right?
Bringing back all the items, restating them.
The same kind of exists in the data world, right?
If I push out a data report and that goes out to, say, my CEO and they make a decision off of that, and it turns out that was made off of incorrect data. Now, reputationally, I, my data team is at risk, but also logistically, I have
to go restate all the data that went into making that report and republish that report.
Right.
Henry Suryawirawanacke...
So if you have to describe Spectra as a platform, would you say it's like a data
ops platform? Is it a DataOps platform?
Is it an ETL platform?
Is it a Cointee?
Like how you would call it?
Yeah, I would say it's a data operations platform is the closest way to describe it.
We think of things in four layers.
There's the storage layer, which we don't really handle, but we integrate with, which is your snowflakes, big queries, your data lakes, et cetera.
Then you have your compute layer, which is data moving from one
storage area to the next.
Usually in transit, some transformation is happening.
This is your ETL, your compute stock.
Then you have your data quality layer, which kind of reads the data, make sure
that it's in a good state, it's in a healthy state. And then finally, we have the control
layer, which is the brain of the system that makes sure that as data goes from one step to the next,
that it's taking into account what's happening at the data quality side of things, to make sure that a data pipeline doesn't actually run if the sources and the inputs are in a bad or unhealthy state.
Right.
So you mentioned two interesting terms a little bit earlier.
You said something like about data decaying.
Yeah.
Okay.
And data scoring.
So tell us a little bit more about these terms.
I'm pretty sure they have to do with like quantity, obviously, but it's very, I'm
very, very curious, like to learn more about like the semantics of these terms,
how they are like represented in the platform? Yeah, so data decay is basically the
idea that
over time, data
stops producing the same kind of
predictive value that it did
when you first developed against it.
And being able to
catch that issue in an
unsupervised fashion
is part of what our
platform helps do, right?
So basically the outputs of data pipelines
are automatically monitored
to detect statistically significant changes in the data,
such as, so it's across the main dimensions
of data quality, volume, freshness,
anomalies within the data, data
distributions, cardinality, nullness, et cetera.
But without going into the semantics of the specific dimensions, being able to spot those
issues where in a way that doesn't require you to program rules about how your data is
going to change is actually a very,
very powerful concept, right? It allows the data scientists to work on the business logic of how
their data is being transformed, focus on the outputs and results, and have the system kind
of detect when something is off because it's statistically inconsistent or significant issue that has
arisen.
Yeah, but okay.
I understand that like you are, I mean, we are using these characteristics of the data
as a proxy that something might be going wrong, right?
But it doesn't mean that necessarily it goes wrong.
So when you have a model on the other side that is doing something, right?
Like we are using it for a reason, like we have some kind of like
business objective tied to it.
How do we, I mean, let's say, okay, we go to the data scientist and raise
a flag and be like, Hey dude, like suddenly I see more null values than previously.
So that's an anomaly.
Or suddenly we see that, like, the continuity is changing, like,
dramatically, right?
What does this mean for the data scientist?
Like, what the data scientist can do with this information?
Because, okay, it might be, a false positive or like whatever, right?
It doesn't mean that necessarily something will continue to be wrong, right?
On the model and how it performs like as a service.
So what's happening there?
Like how is that part taken care of?
Alex Raucer- Yeah.
So this is where I, this is actually the part that I think is the most
fascinating about the platform, which is that the system actually takes in input from the data scientists to understand what's important about each data set that it's monitoring so that it can better track issues, find issues,
and start to build resolution patterns for those issues as well.
So in fact, one of the big things and big initiatives
that we're taking on right now is trying to understand
that when an issue is resolved,
what was that resolution that was taken
so that the system
can recommend that resolution the next time it occurs, right? And so instead of you having to
go into the data and say, delete outliers, the system itself says, click this button and we
will go delete the outliers for you, right? But ultimately what it comes down to is like building
an AI system for, for the
data engineers and data scientists of the world.
Henry Suryawirawan, Yeah.
That's super interesting.
Okay.
One last question from my side and then I'll give the microphone back to Eric.
So, okay.
Obviously you are very into data poignant, right?
So, and you have a lot of experience on that, like both by building a product
and like from like your work previously.
If you had like to give an advice to someone who is assigned to start
building, let's say a new data platform or start investing into data
infrastructure for a company, right?
Yeah.
What you would say to them about like how much attention they should pay on
quantity from day one or when they should start caring about it if it
shouldn't happen on day one?
I think it has to happen on day one,
at least to start that process
of understanding and thinking about
what data quality means
for that specific use case,
that specific problem.
Now, how do I put this?
I think that over time,
data quality is going to become
more and more of a solved problem.
There's going to become more and more of a solved problem, right?
There's going to be better tooling available and there's going to be, it's going to be easier and easier to actually set up a data quality stack from scratch.
Today, operationalizing data quality is actually very difficult, right?
Being able to continuously collect metrics about data as it's changing and then have those metrics itself be monitored for anomalies and issues, it takes a lot to get that system up and running.
Oftentimes you see people buy it off the shelf, but data quality tools are quite expensive in and of themselves. And so what we recommend is figure out what is specifically very important.
That's kind of like, if this goes wrong, it's going to be a deal breaker.
This is just absolutely incorrect.
And this might be things like if you have a column that represents the price of an item
and it goes negative, that's clearly a wrong thing.
So maybe write a check. What I find a bit unfortunate is that I think there's a lot
that can be learned about the data
just through these unsupervised systems that
continuously observe and track how that data
is changing over time. And I think that
is going to become more and more democratized over time.
So I would say for, for everyone out there keep your hopes up.
There's, there's definitely something coming down the line.
Henry Suryawirawanacke...
Sanyam Bhutaniyya... That's great.
Eric, the mic is all yours.
Eric Booth... Yeah.
This is okay.
So I'm going to continue on that line of questioning.
I know we're close to the buzzer here.
But I would love to know what your advice would be for our listeners who really resonate with what you're saying about data quality and about some of the challenges with,
say, like your typical sort of orchestration tools like Airflow and blah, blah, blah. But
the reality is like, that's what they've got, right? And maybe they're not actually
dealing with data that requires sort of the level of quality or accuracy
where maybe it's just first-party data, right?
And they don't have a ton of third-party data.
But they know that quality is really important.
What advice would you give to them?
I mean, you've built this stuff from the ground up
and now you're building a company that solves it.
What advice would you give to them though?
Who really value data quality,
but they sort of have the tools that they have and they want to implement this at their company?
What should they do and what are the next steps that you would recommend for them?
Yeah, I think the biggest thing that I see people get bogged down by and confused around is when it comes to the appropriate way to orchestrate data quality, data quality jobs, so to speak, right? You see
some people will, at some companies, it's kind of put directly into a data processing pipeline. So it's such that as soon as my processing is done, then my data quality will happen immediately after.
And then that's kind of like one series of steps that occurs.
And I think we're what my biggest recommendation here is to really think about, so from a rules-based perspective, what matters for data quality and structure that as independent jobs that kind of run onained data processing steps, that they take into account that data quality status, right?
So if, let's say I say, okay, here are the five rules that matter to me to say my data is in a separate airflow process that basically asserts true or false, is my data in a healthy state?
And use the status of that to determine whether or not another pipeline is allowed to run if it uses that data takes into account the data processing as well as
data quality and ties them together in a way that gives you this level of robustness.
And this is actually exactly what we're trying to do with Spectre.
Basically build that dynamic DAG of interactions between the data processing system to the
data quality system.
Yeah, that's fascinating because, I mean,
not that DAGs aren't capable of considering the things that you just mentioned,
but a lot of times it just deals with
data completeness or data freshness, right?
Where a job runs and then
a lot of companies sort of manage
all of the debt that's created along the way
just with massive compute on the warehouse, right?
Yeah, and human data support teams, right?
Yeah, yeah, yeah, yeah, for sure.
This is like one of the biggest things, right?
So you've got Airflow, you've got your data quality system,
which is like in its own isolated place.
Data quality system reporting issues one after the other, and then your data processing system, which is like in its own isolated place, data quality
system reporting issues one after the other, and then your data processing
system has no idea, right?
So your data processing system just continues to process the data and you
get 10 chains deep to creating a report.
And then you realize, oh, wait, but the data that was initially ingested into
the company was already in a bad state.
None of this should have even run.
Right.
Yeah.
Yeah.
But it's very hard to, to like set up that network topologies in a way that, that guarantees that data is only going to be run if it's in a good healthy state.
Yep.
Yeah, for sure.
No, that's and that is so instructive.
I'm even thinking about our own,
the pipelines that I have for review over.
You've inspired a lot of thinking there.
Where can people go to learn more about
Spectre data and about you?
If they want to dig into this
and learn more about the concepts
that you're talking about,
where can they go?
Yeah, so I am most reasonable on LinkedIn.
So you can find me, my I am most reasonable on LinkedIn.
So you can find me, my approval is Ashwin Kamath.
Our website is a great resource to find more information about the product.
That's www.spectordata.com.
And we have a contact us form there where you can reach out to the rest of our team as well.
Awesome.
Very cool. And we will put those in the show notes as well. So you can go to datas the rest of our team as well. Awesome. Very cool.
And we will put those in the show notes as well.
So you can go to datasackshow.com.
Ashwin, thank you so much.
This has been absolutely fascinating.
I feel like we could go for another hour,
but Brooks is telling us that we're at the buzzer.
So thank you so much for your time.
Yeah, thank you all for having me.
And this was a great, great show. I think my biggest takeaway, Kostas,
maybe this is a weird way to say it,
but a lot of people think about Big Brother
as being the government.
And really, Big Brother is just hedge funds
that have data about us copying and pasting data
and then that influencing. Don't say thating data and then that influence thing.
Don't say that.
You might be in danger now.
It's true.
That's true.
No, but it is amazing.
I mean, the things that he brought up about web behavior,
about foot traffic data, about credit card transactions,
all this sort of stuff.
I mean, it's a little bit
scary in many ways.
They are anonymized.
That's true.
No, but it's
wild. I mean, the stuff that
he's done and
that level of data modeling
and that level of granularity
is amazing.
And I think
as he said, the actual infrastructure to drive that is incredible, right?
The blunt way to say it is that the two industries that are actually driving infrastructure forward are Horn and Finding.
They're the ones on the on sort of the, the significant
scale, like innovation side of things.
And I think we saw that with Ash.
Yeah, yeah, absolutely.
And I think it's what I found like super interesting is that like how you can talk
about a topic that we have discussed a lot already, right?
Like data quality, for example, and how much of a different perspective someone
can bring because they're coming from a different industry, right?
Like even the terminology that he was using about data quality was very
different compared to what we have heard like from other vendors that are building
data, right?
So that's, that's what is's what I find super, super interesting.
I feel so privileged that I'm doing this show because I have the opportunity to compare these different, let's say, theses around how to build a product.
Which comes from the bias that each person has because of the industry where they come to solve the problem for, right?
And of course, like you see at the end, who's going to win, because that means
that we also, which industry has like a much better, let's say, understanding
of the problem.
So yeah, like super, super interesting.
David PĂ©rez- Yeah.
The, now that we're talking about this, I regret not asking him if he had So yeah, like super, super interesting. Yeah.
Now that we're talking about this, I regret not asking him if he had worked with Deephaven because they work in the finance industry and do like real-time data feeds.
So we can follow up with him.
If he has, actually, we should get him.
And I think it's Pete.
Is that right, Brooks?
Pete from Deephaven.
Brooks is giving me the thumbs up.
Off screen. Great. Well, let's do that. Let's follow up with him, Brooks.
And if so, then we can do like a finance data podcast. Maybe we could actually get the
Sri from who used to be a Robin Hood is and is now
a Stripe. That'd be cool.
Yep.
All right.
Well, thanks for entertaining our banter for another episode.
And we will catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.