The Data Stack Show - 115: What Is Production Grade Data? Featuring Ashwin Kamath of Spectre

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome to the Data Stack Show, Kostas. We love talking to data professionals who work in industries where they have certain requirements around the data. And Ashwin from Spectre Data has worked in the finance industry for a really long time at multiple different

Starting point is 00:00:42 types of companies. So consumer loan to hedge fund, and now he started his own company. And needless to say, people who have done that are generally extremely intelligent. So I know it's going to be a good conversation. What I want to, I actually worked for a company called Affirm

Starting point is 00:01:01 who sort of was the first big player in financing purchase to online and getting really sort of rapid approvals, if you will, for items that are not like buying a house, you're buying a computer or something like that, or even stuff that's not even that expensive. I'm really interested to ask him about that a little bit. I just want to hear a little bit about that because I kind of remember when Affirm started showing up on all these websites and you could finance these purchases for a smaller amount. So I'm going to just entertain me. I'm going to ask him like one or two questions about that to satisfy my curiosity.

Starting point is 00:01:37 But of course, it's about Spectre. So what are you interested to ask him about Spectre? Gareth Harteeldil, Yeah, I want to start first of all, asking him to share some of his knowledge about how data is used or like what are like the unique challenges working with data in the finance sector, I mean, heavily, let's say data-driven sector, right? With its own unique challenges. And so I thought from there and then talked with him about how he decided to build Spectrum

Starting point is 00:02:15 and what Spectre is, right? So let's do that with him. Yeah. I may cut by sealing your question about the finance data, so I apologize in advance. All right. Let's dig in and talk with Ashwin. Ashwin, welcome to the Data Stack Show.

Starting point is 00:02:34 We are so excited to chat with you and learn from you. Great to have you guys. Great to be here, guys. It's very nice to meet all of you. Okay. Give us, give us your background. You spent a ton of time in finance, so give us that story, but also how you got into data in the first place.

Starting point is 00:02:53 Yeah, so my name is Ashwin. I am the CEO and founder of a data platform company called Spectre, which I started about a year ago. I've been in the data space for close to a decade now. I used to work at a FinTech company out in San Francisco called Affirm. It was a buy now, pay later company where I used to deal with data

Starting point is 00:03:17 both on the underwriting side, building models to figure out whether or not someone is both credit worthy and whether they on the fraud side, whether they are who they say they are, as well on the back office side with reporting and funding of the loan portfolio. And then in 2018, I moved out to New York, where I'm currently based, to join a quantitative hedge fund called Two Sigma, where I used to work on the alternative

Starting point is 00:03:45 data portfolio, basically bringing in enormous amounts of data from external third-party sources, putting that to use within the trading engines, everything end-to-end from cleaning of data, standardization, building the underlying data infrastructure to make sure all of this is working and flowing, preparing the data for research ready purposes, taking final research and analysis, putting that into the system, making sure that that's being computed on an ongoing basis. And finally, kind of layering all of this with a data quality system that makes sure that the data as it flows between different stages of the pipeline is in a good and healthy state for the trading systems.

Starting point is 00:04:26 Wow. So deep end-to-end experience across the entire pipeline. I want to ask about, you've done so much in finance. And so I want to ask about sort of the specific nature of working with data and finance. But one, this is just a personal curiosity. So I remember when a firm started showing up on websites. So I mountain bike, for fast, right? How did you approach that problem? Because that's a pretty, I mean, as a user, that's amazing, right? I'm about to buy this thing and it's not like I'm buying a house, I'm buying whatever, but it's enough to where I want to finance it. And you can get approved for one really fast, but from like an infrastructure perspective, being in the industry, that's heavy duty.

Starting point is 00:05:27 How did you approach doing that? Because you're doing it like pretty early, I think. Yeah, you'd see this a lot in data systems and machine learning systems that have, especially in today's day and age, where there is a lot of crunching that and data processing that happens in a more offline setting to create and train these models that when used in an online setting, they basically get this like feed of features from whatever behaviors the user has already kind of displayed at the time of that decision being made. And so the model itself, when it runs, can actually produce a result in under a second, right?

Starting point is 00:06:09 However, that computation that is happening within that one second is taking into account tons and tons of data that's been crunched in a more offline setting and has been kind of prepared already for the online version. Super interesting. So you're basically, you just process all these features. You're basically just completing the model with known inputs that will, the known inputs of their last features that allow sort of the last mile compute of that. Correct. Correct. And then when it even comes to the specific features, you won't even believe some of the features that are being utilized here.

Starting point is 00:06:47 Things like where, what kind of websites did you come in from to this site? How are you filling out the form? Are you copy pasting? Are you not? Are you? Really? No way. There's a lot that can be told about from a fraud perspective about who this person is just

Starting point is 00:07:05 by the behaviors that they display when kind of interacting with the website. That is so interesting because to me, that's very like marketing user experience. Those are like marketing user experience data points, right? Like how someone interacting with that, but you're actually using those as features. It's a sort of tech fraud and stuff. That is so interesting. Fascinating. Okay. Well, I'm not going to go down that rabbit hole because we have too much to talk about. Tell us about the unique nature of working with data in finance. So you did it at a firm, then you were managing these huge pipelines for sort of non-financial data at a

Starting point is 00:07:43 hedge fund. And Spectre works with a lot of financial firms. So give us the landscape of working with data in the finance industry or FinTech. Yeah, I think the biggest thing that I have seen with data in finance is how important data quality is. I think because the nature of decisions being made in this industry and this sector are very high stakes in nature,

Starting point is 00:08:10 and each decision can have meaningful impact in the form of a trade going out, whether or not that's going to be a long or a short, an underwriting decision being made, whether or not I'm going to give money to someone, it is extremely important that the data that's being fed into these models that's being fed to use to create these decisions is in a good quality state. And so what we start to see is the topology of how the data configure, the data start network slash pipelines are configured, so to speak,

Starting point is 00:08:48 will look pretty similar to other industries. But I think the way that the data quality side of things is approached is usually as a first-class principle rather than something that you layer on top after the fact and kind of in our best efforts hope so to speak yes so just to make that a little bit more explicit i'm just thinking of examples here so in a non-financial industry let's say we have a consumer mobile app or something right and

Starting point is 00:09:17 you don't make a good recommendation and so the person doesn't make an additional they don't add an additional thing to cart on checkout, right? Which is unfortunate and may affect like certain subset of users. But if you make a bad loan, you're upside down financially in that it doesn't take very many of those for that to significantly skew the, you know, significantly skew the bottom line. Is that kind of what you're getting at in terms of the critical nature of the quality? Exactly. Or even in a trading setting, a simple example, we're pulling in data from some sort of external source and over the last week, the data hasn't updated.

Starting point is 00:10:00 And if you don't have good data quality monitoring to notice that issue that data continues to flow into the final trading system trading system gets a forecast saying there's been no change in in a in a company forecast and so starts shorting a stock right and that the stakes are high right this is it, it is a seemingly easy problem to detect, but when you kind of take the infinite variety of data quality issues that could occur and that are pretty difficult to predict

Starting point is 00:10:35 in and of themselves, it is actually a much more difficult problem to make sure that everything is like working correctly, even when no one's like looking at the data all the time. Well, let's dig into that a little bit. So you talked about alternative data, which is sort of my understanding is that it's sort of inputs of a large variety that are not necessarily directly related to the trading price of a particular stock, right? Or stocks in general, right? So it's not like trading data from the actual exchange itself. It's inputs from outside of that that may influence it. Can you give us

Starting point is 00:11:12 some examples of what those things would be? And the other thing I'd like to know is the breadth of those sources. How many are there? How many do you include when you're modeling? How do you even approach that decision? Yeah, there is a ton of data. There's a, there's actually a ton of alternative data and there's actually a whole segment of alternative data called open source intelligence and open source alternative data, which is really accessible, you're talking, you're thinking like web behavioral data scraped data from

Starting point is 00:11:46 different types of websites there is so much that can be told about how the the state of a business just from their online digital presence i think you know if i look at a a trend of job postings from a specific company, right? Who are they hiring? Who are they keeping around? LinkedIn data is like massive, right? What is the trends of job positions that are being held at different companies versus our competitors? Foot traffic data is another big one.

Starting point is 00:12:20 Where are people going and moving? Credit card transaction data from banks? You generally, all of these are anonymized in nature. So we're, we're not really taking it from the perspective of like personally identifying information. We're trying to look at this from a more holistic, uh, somewhat macro, somewhat micro scale of, of how that data kind of fits in to model the overall economic environment that different businesses kind of play in. Super interesting.

Starting point is 00:12:54 And just from the sound of it, my guess is that those are really large data sets. Yes. You can sometimes look at like terabytes and terabytes of data, especially when it becomes important to start to look at the historical nature of that data and how things changed over time. It's very important to be able to have a large enough history that you can see those trends as they shape out and as they form. And you can start to look at, okay, here's what we're seeing today versus here's what we saw a quarter or two ago versus here's what we saw two years ago. This is what we can make a prediction about in the next quarter, right? And that helps make those decisions

Starting point is 00:13:35 as they kind of play out, right? Absolutely. Okay, well, one last question for me, and I'm going to just, I'm going to tee this up as a lead-in for you, Kostas, because I'm going to let you have dessert and ask all about the product because I want to do that, but I've been nodding the mic. What were some of the big problems you faced? I think especially thinking about the hedge fund and all the alternative data inputs from a technological perspective, right? So we're

Starting point is 00:14:05 talking about terabytes of data. We're talking about losing huge amounts of money if a simple thing like data freshness falls behind. What were the issues you faced and how did you try to solve those? Yeah, I think the number one issue was the handoff between a development environment and a production environment being quite slow. And this is pretty agnostic to the hedge fund space. I think we see this across every other industry. And the idea is that there is a lot that has gone into making it really quick to start to explore data, to start to build analysis on top of it. Usually you see this done in some sort of Jupyter notebook environment, like some sort of local environment.

Starting point is 00:14:57 And then when it comes to actually, say, productionize that analysis, that data pipeline, you know, everything kind of falls flat. There isn't really any standards here. Every company is doing their own thing. The infrastructure layer looks completely different when you look from one company to the next. Some companies are using Docker containers.

Starting point is 00:15:17 Other companies are just putting scripts onto servers and running them in a local conda environment on that server. And no one knows what's in that conda environment. It's a complete mess. Then when you take one step further and say, okay, now we want to also make sure that the quality of the data that is being output by my data pipeline continues to remain consistent over time. And if I make changes to my data pipeline, I want to know that, okay,

Starting point is 00:15:42 something might go wrong at the data layer itself. Now it's like the ballgame is even more difficult to deal with, right? You're talking about monitoring data with itself is some sort of like recurring process that needs to run and look at the data and observe that data over time. And then almost apply another type of machine learning anomaly detector on top of the output or the metrics that are being computed about that data and make sure that that data is being consistent, right? And I think that's part of the challenge with data science and data engineering is how do you get this infrastructure layer that does a lot of this for you

Starting point is 00:16:21 without having to spend an inordinate amount of effort just on the infrastructure component and allow you to focus more on what this business logic looks like. Yeah. Yeah. I mean, because what you're describing, I mean, you have data science and data engineering, but a lot of what you're describing actually is more DevOps and SRE flavored work, right? Where uptime and, and monitoring and alerting and responses and, okay. That's super interesting. Costas. I say operate, operationalizing data science is really an engineering problem.

Starting point is 00:17:00 Yeah. I don't think the world has realized that yet. Absolutely. So, Alvin, I have a... I'm super, super, super curious to hear from you about third-party data. In most cases, we talk with people, I mean, you mostly struggle to collect your own data, right? It's like the data that your own company is generating one way or another.

Starting point is 00:17:29 And you're trying to make sure that you don't miss anything and give access to everyone inside the organization to do that. But you mentioned like third-party data, and I don't know much about it. So I'd love to hear from you. First of all, how do you go shopping for third-party data? Like, how do you, like, how, how does this even work? Right? It's like, I'm going to Amazon. I'm like, okay, I'm looking for, I don't know, two pounds of like data that

Starting point is 00:18:04 has this and that characteristic, right? So can you tell us a little bit about the whole lifecycle of getting third-party and incorporating third-party data into the product that you are building, right? Especially when it comes to go out there and find this data, procure the data, maintain that, and all these things. Yeah, it's a pretty laborious process, but it does kind of follow the same steps of what you would imagine from an e-commerce purchase or procurement process,

Starting point is 00:18:38 with a few caveats in between around making sure the data meets your compliance requirements of the company and making sure that you can evaluate that data in a way that allows you to see that the data is useful to yourself, to your company, while not getting the data for free. And that's the biggest challenge here, right? There is this skewed incentive between the buyer and seller of data to say, Hey, I, I want to let you try this data without you actually like using it for a real decision process. Right. But let's, let's kind of go through the whole process from start to finish. When you first, you would, you would think, okay, there's some use case at hand that you're looking for third-party data.

Starting point is 00:19:31 There are several ways to go about finding that. Most obvious is go Google it, right? data about, because I'm prospecting for a marketing purpose and I want all US-based companies that have chief financial officers on that base within the US. The best source for that would be through something like a LinkedIn. And being able to find data for that purpose generally involves looking through, there are like these data catalogs, data marketplaces that essentially kind of have a bunch of metadata about each of these data sets that give you enough information that at a high level, you can say that that kind of meets the criteria for what you're looking for. You reach out to the vendor, you initiate a conversation. Generally, this looks very similar to any type of B2B sales process where you go through some evaluation of that data. There's typically no demo in the process because itself is a very

Starting point is 00:20:43 abstract, ephemeral kind of concept. So the demo itself, the demo phase actually looks like you providing some sort of requirements around here is what I'm trying to do. Some sample data will be provided back to the specifier. That is evaluated in and of itself. I know within kind of the hedge fund world, usually you'll look at some sort of historical amount of data as well. So you can test in a backtesting purpose.

Starting point is 00:21:14 And if that meets the criteria, then you go into kind of the negotiation side of things, discuss a unit price on how much data you're looking for. Generally with more bulk, you get a better unit price per record of data. And there are a lot of kind of levers that you have to think through. The first is how, like what kind of sample do I need? Do I, what kind of coverage do I need at a, in terms of geography, in

Starting point is 00:21:42 terms of sectors, industries, depending on the specific data at hand. Then second, you need to think through how often that data needs to be refreshed or updated. The world is constantly changing. The data itself is changing. Making sure that that refresh rate meets the criteria that you're looking for is extremely important. Third, how are you going to access that data? Is this going to be a push-based access where the data vendor pushes data to, say, an S3

Starting point is 00:22:13 bucket and you pull it out of there? Or is this going to be pull-based access where I'm pulling it out of an API and figuring on my own how I'm going to store it? Yeah. This all gets written into a contract. Once the contract is signed, and usually you go through some amount of compliance audit as well to make sure that the data is, was collected in a way that meets your business's requirement.

Starting point is 00:22:37 And then you get access to the data from there. Stas Milonovic 1.0 Okay. And how do you like judge if the data is good enough for you? Okay, you said you give a sample of the data, right? But is there some summary statistics that are provided, for example? How you can formalize this process, if it can be formalized? How do you, without them revealing the data set, obviously because they don't want to do that.

Starting point is 00:23:04 So how do you go through that? What's up with it? Yeah. Aggregate statistics helps a lot, right? Being able to understand, okay, if there are 20 columns in a, in the data set and the specific segment you're looking at, let's say us only, this might be a global data set, but you only care about the US segment. Yeah. I wanted to know out of the population of US data points, how many null values are in those other columns?

Starting point is 00:23:37 What are the differences in those columns? That's like a pretty easy way to kind of get a sense of the completeness of the data or what you care about. When it comes to evaluation itself, generally, you're going to want to put that data through a similar process to how you plan to use it in a live setting, right? One must actually have the real data at hand, right? And kind of test it from a statistical point of view. Does this meet your needs from either a predictive side of things or if you're collecting data for fraud or underwriting purposes, making sure that the richness of the data coming in seems correct, if you're looking at data about people, for say, a script from LinkedIn, you might want to cross-check some of the entries back to LinkedIn.

Starting point is 00:24:31 It is a manual process, but that can go, leave some balance to making you confident trusting that data. Henry Suryawirawanacik, Yeah. Yeah, it makes a little sense. Actually, it reminds me a little bit of like a problem that I have. I have faced, oops, look, it's my, I'm not the only one, but anyone who's like building like query entities or like databases. And you have this system like in production and then you have,

Starting point is 00:24:56 you need to debug that, right? And they're like, okay, but to reproduce, let's say the query, I need the query, first of all. I need to know like the data and like the, at least like the statistics around the data. And it's unbelievably hard to do that because getting access and like taking a look at that information, it's like something that's extremely proprietary for like many companies, right?

Starting point is 00:25:23 It's not like, yeah, take a look at my database and see exactly what kind of information I keep here for my users. It cannot easily happen. And it changes also. This thing changes too fast and it's even hard to go and do, let's say, regression testing using some baseline queries and data sets. It's hard. It's hard to define these requirements.

Starting point is 00:25:53 And we are talking about a very deterministic system. We are talking about software at the end. We are not talking about training models, right? I mean, it's not like we know exactly what is happening inside the model, right? Like it's more of a black book. So it's a very fascinating area and like a very hard problem, but like people don't really realize, I think. David Pérez- You see this challenge, especially, so going back to kind

Starting point is 00:26:21 of the skewed incentives, there's, whenever you go through one of these data evaluation processes, it's very common to get kind of the golden set of data from the vendor, which is their best segment of data that they can offer you so that you can see how great and how powerful this data is. Then when you get your hands on the real data, after having signed, say, a one, two-year contract, you now realize that, hey, the rest of this data set is not nearly as high quality as the sample that they've applied, right? What's even worse is when you start to build stuff on top of that data, and if you don't have the monitoring in place to watch for things like data decay, data scoring, getting out of whack, that sort of thing, suddenly there's larger than average number of outliers that are appearing in the data. It's very easy for something that worked in the first six months of releasing a new model or releasing a new data pipeline to suddenly start behaving very poorly over time, right?

Starting point is 00:27:30 And that's why it's extremely important with third-party data, especially because you are not the source of that data. You don't know what's happening to that data from its true source till it gets to you. You only know what happens after it gets to you onward. It's important to put those tests in place, put those guardrails in place to make sure that the data conforms to and stays consistent with the assumptions that you made when you first started developing against it. Yeah, yeah, 100%. I have a question. It's about, let's say the third-party data again, but it's about something

Starting point is 00:28:10 like that happened like months earlier. So you're building a model, right? Like you are trying to achieve something. You have an objective there. Let's say you're trying to, I don't know, do some scoring or predict the behavior, right? And you usually start by having some data, right? Like it's a bit of a chicken and egg problem.

Starting point is 00:28:32 Like you have some observations and you try to model something based on these observations, right? Sure. How do you reach a point where you're like, I need third-party data? And how do you know what kind of data to go and look out there for? Right? Because it's one thing to be like, okay, this is like the data that I can get from my company because it's a clickstream data.

Starting point is 00:28:53 These are like the sources where I can like capture data, blah, blah, blah. Like it's much more straightforward in my mind, at least, like to have like all the different, like the space of different options around the data that we can use. Okay. Procuring data means that like you have, I don't know, like an open space there. Things that you don't even know that they exist, right? So how do you, from model training and building point of view, how do you identify like the data that you need and it reached a point where you're like, I don't have this type of data.

Starting point is 00:29:30 I'll go out there and try to find like a data source. Yeah, I think educating about the different possible data segments out there is, it's probably the first step. And I think it's going to become much more common for data scientists to just be more aware of what's out there. I think third-party data is just kind of coming to the light. I would say five years ago, the only real buyers of third-party data was the hedge fund in the street. But over time, now it's kind of being adopted by several other industries. I see it a lot more commonly used within the marketing space for prospecting

Starting point is 00:30:05 and lead generation, being able to use what we call intent data to understand someone just visited a specific site. And that is maybe a competitor site, maybe that shows interest in them being a buyer of that product. That's like a good candidate for me to either run an email campaign against them or run an advertising campaign against them, right? And so I think we start to see a little bit more of just people being more educated about what types of data there is. I don't know that I have a great sense

Starting point is 00:30:43 of when is the right time to start thinking about that. Usually what I see is that people start to adopt third-party data either very early on when they're building and training new models as a way to kind of bootstrap that initial data segment. So instead of taking the approach, okay, if I collect a thousand observations and I can build my model, I say, okay, if I just buy a thousand observations, I can create my model and then I will keep filling that with more first-party data as I collect the first-party data. And the second is a more augmentary purpose where I say, okay, I have this first-party data stream that's coming in. It would be really good to know this other information about these

Starting point is 00:31:24 users based on information I can find from their digital presence, and so being able to kind of feed that as an additional data source and keep kind of augmenting that internal first party data stash with third party data is another approach that I've seen to be very successful. That's super interesting. All right. So you obviously have a very interesting and exciting career in the financial sector, right? You ended up building a company in the products. Tell us a little bit about that and also what made you to decide and go and build?

Starting point is 00:32:07 Like what kind of problems you saw out there that you thought, oh, that's like worth pursuing as a business and all in my career, like my safety over there, like my comfort zone where I know what can happen or not, and go and do like a company in the product, right? Tell us a little bit more about that. David Pérez- Yeah. So I think the biggest motivator for me was just kind of seeing the sophistication of technology at these more established companies and

Starting point is 00:32:40 understanding that this, the data industry is going to continue to grow at the incredible pace that it is. But when it comes to an understanding of how to handle data in production settings, there has been what I believe to be a pretty big lack of innovation there. Every company that I see is doing their own thing. They generally all start with something like an Apache Airflow, where they run their data pipelines, and then they're building their own kind of data quality stack on the side. And then eventually they upgrade into something else. And that something else tends to be completely different from one company to the next.

Starting point is 00:33:23 There's always requires like a tremendous amount of skilled data engineering support to be able to deliver on that, especially at the infrastructure layer. And so that was the biggest driver for me being able to say, okay, there is a way to generalize some of this technology to basically create an out-of-the-box data infrastructure layer that makes it really simple to go from development to production without and have a system that actually helps you do it rather than you having this inordinate burden to configure things in exactly the right way so that everything works correctly. And so when it comes down to what that those problems that we're really looking to solve, we kind of say, OK, on the exploratory side, on the development of data pipelines and machine learning pipelines, machine learning models, there's tremendous amount of tooling that already exists that kind of solves those problems, right? It's going to continue to improve, but we want to focus on what it means to take that and put it into a system

Starting point is 00:34:29 so that when the data scientist decides to move on to the next project, they can come back six months later and know that their initial project is still working and is running appropriately the way that they expected when they launched it. So major problems that we see, the first is around kind of the DevOps side of things, right?

Starting point is 00:34:50 How do I, when I have data pipelines running in a local environment, how do I push that into a production setting, right? How do I make sure it's running on servers, it's running on some sort of schedule or maybe running whenever data itself is updating? How do I make sure dependencies's running on servers, it's running on some sort of schedule, or maybe running whenever data itself is updating? How do I make sure dependencies are being tightly managed based on how data is flowing from one step to the other, based on intermediate data inputs and outputs

Starting point is 00:35:15 between each of these stages? And then finally, how do I tie this back to data quality in a way that guarantees that if there are data quality issues that occur somewhere in between in the middle of the pipeline, that that data doesn't continue to spread and contaminate downstream analysis. And yeah, there's a pretty good analogy I have to go off of this. It's kind of like the way that the manufacturing industry thinks about the assembly line, right? When you think about why the assembly line exists, a lot of it comes down to this idea of being able to install quality control woofers or nodes in between kind of different stages of the assembly line, right?

Starting point is 00:35:59 And the reason why factories are designed this way is because recalls are extremely expensive, right? Both reputationally and logistically, right? Bringing back all the items, restating them. The same kind of exists in the data world, right? If I push out a data report and that goes out to, say, my CEO and they make a decision off of that, and it turns out that was made off of incorrect data. Now, reputationally, I, my data team is at risk, but also logistically, I have to go restate all the data that went into making that report and republish that report. Right.

Starting point is 00:36:35 Henry Suryawirawanacke... So if you have to describe Spectra as a platform, would you say it's like a data ops platform? Is it a DataOps platform? Is it an ETL platform? Is it a Cointee? Like how you would call it? Yeah, I would say it's a data operations platform is the closest way to describe it. We think of things in four layers.

Starting point is 00:36:59 There's the storage layer, which we don't really handle, but we integrate with, which is your snowflakes, big queries, your data lakes, et cetera. Then you have your compute layer, which is data moving from one storage area to the next. Usually in transit, some transformation is happening. This is your ETL, your compute stock. Then you have your data quality layer, which kind of reads the data, make sure that it's in a good state, it's in a healthy state. And then finally, we have the control layer, which is the brain of the system that makes sure that as data goes from one step to the next,

Starting point is 00:37:38 that it's taking into account what's happening at the data quality side of things, to make sure that a data pipeline doesn't actually run if the sources and the inputs are in a bad or unhealthy state. Right. So you mentioned two interesting terms a little bit earlier. You said something like about data decaying. Yeah. Okay. And data scoring. So tell us a little bit more about these terms.

Starting point is 00:38:07 I'm pretty sure they have to do with like quantity, obviously, but it's very, I'm very, very curious, like to learn more about like the semantics of these terms, how they are like represented in the platform? Yeah, so data decay is basically the idea that over time, data stops producing the same kind of predictive value that it did when you first developed against it.

Starting point is 00:38:37 And being able to catch that issue in an unsupervised fashion is part of what our platform helps do, right? So basically the outputs of data pipelines are automatically monitored to detect statistically significant changes in the data,

Starting point is 00:38:56 such as, so it's across the main dimensions of data quality, volume, freshness, anomalies within the data, data distributions, cardinality, nullness, et cetera. But without going into the semantics of the specific dimensions, being able to spot those issues where in a way that doesn't require you to program rules about how your data is going to change is actually a very, very powerful concept, right? It allows the data scientists to work on the business logic of how

Starting point is 00:39:31 their data is being transformed, focus on the outputs and results, and have the system kind of detect when something is off because it's statistically inconsistent or significant issue that has arisen. Yeah, but okay. I understand that like you are, I mean, we are using these characteristics of the data as a proxy that something might be going wrong, right? But it doesn't mean that necessarily it goes wrong. So when you have a model on the other side that is doing something, right?

Starting point is 00:40:07 Like we are using it for a reason, like we have some kind of like business objective tied to it. How do we, I mean, let's say, okay, we go to the data scientist and raise a flag and be like, Hey dude, like suddenly I see more null values than previously. So that's an anomaly. Or suddenly we see that, like, the continuity is changing, like, dramatically, right? What does this mean for the data scientist?

Starting point is 00:40:41 Like, what the data scientist can do with this information? Because, okay, it might be, a false positive or like whatever, right? It doesn't mean that necessarily something will continue to be wrong, right? On the model and how it performs like as a service. So what's happening there? Like how is that part taken care of? Alex Raucer- Yeah. So this is where I, this is actually the part that I think is the most

Starting point is 00:41:04 fascinating about the platform, which is that the system actually takes in input from the data scientists to understand what's important about each data set that it's monitoring so that it can better track issues, find issues, and start to build resolution patterns for those issues as well. So in fact, one of the big things and big initiatives that we're taking on right now is trying to understand that when an issue is resolved, what was that resolution that was taken so that the system can recommend that resolution the next time it occurs, right? And so instead of you having to

Starting point is 00:41:50 go into the data and say, delete outliers, the system itself says, click this button and we will go delete the outliers for you, right? But ultimately what it comes down to is like building an AI system for, for the data engineers and data scientists of the world. Henry Suryawirawan, Yeah. That's super interesting. Okay. One last question from my side and then I'll give the microphone back to Eric.

Starting point is 00:42:16 So, okay. Obviously you are very into data poignant, right? So, and you have a lot of experience on that, like both by building a product and like from like your work previously. If you had like to give an advice to someone who is assigned to start building, let's say a new data platform or start investing into data infrastructure for a company, right? Yeah.

Starting point is 00:42:48 What you would say to them about like how much attention they should pay on quantity from day one or when they should start caring about it if it shouldn't happen on day one? I think it has to happen on day one, at least to start that process of understanding and thinking about what data quality means for that specific use case,

Starting point is 00:43:14 that specific problem. Now, how do I put this? I think that over time, data quality is going to become more and more of a solved problem. There's going to become more and more of a solved problem, right? There's going to be better tooling available and there's going to be, it's going to be easier and easier to actually set up a data quality stack from scratch. Today, operationalizing data quality is actually very difficult, right?

Starting point is 00:43:41 Being able to continuously collect metrics about data as it's changing and then have those metrics itself be monitored for anomalies and issues, it takes a lot to get that system up and running. Oftentimes you see people buy it off the shelf, but data quality tools are quite expensive in and of themselves. And so what we recommend is figure out what is specifically very important. That's kind of like, if this goes wrong, it's going to be a deal breaker. This is just absolutely incorrect. And this might be things like if you have a column that represents the price of an item and it goes negative, that's clearly a wrong thing. So maybe write a check. What I find a bit unfortunate is that I think there's a lot that can be learned about the data

Starting point is 00:44:32 just through these unsupervised systems that continuously observe and track how that data is changing over time. And I think that is going to become more and more democratized over time. So I would say for, for everyone out there keep your hopes up. There's, there's definitely something coming down the line. Henry Suryawirawanacke... Sanyam Bhutaniyya... That's great.

Starting point is 00:44:58 Eric, the mic is all yours. Eric Booth... Yeah. This is okay. So I'm going to continue on that line of questioning. I know we're close to the buzzer here. But I would love to know what your advice would be for our listeners who really resonate with what you're saying about data quality and about some of the challenges with, say, like your typical sort of orchestration tools like Airflow and blah, blah, blah. But the reality is like, that's what they've got, right? And maybe they're not actually

Starting point is 00:45:36 dealing with data that requires sort of the level of quality or accuracy where maybe it's just first-party data, right? And they don't have a ton of third-party data. But they know that quality is really important. What advice would you give to them? I mean, you've built this stuff from the ground up and now you're building a company that solves it. What advice would you give to them though?

Starting point is 00:46:07 Who really value data quality, but they sort of have the tools that they have and they want to implement this at their company? What should they do and what are the next steps that you would recommend for them? Yeah, I think the biggest thing that I see people get bogged down by and confused around is when it comes to the appropriate way to orchestrate data quality, data quality jobs, so to speak, right? You see some people will, at some companies, it's kind of put directly into a data processing pipeline. So it's such that as soon as my processing is done, then my data quality will happen immediately after. And then that's kind of like one series of steps that occurs. And I think we're what my biggest recommendation here is to really think about, so from a rules-based perspective, what matters for data quality and structure that as independent jobs that kind of run onained data processing steps, that they take into account that data quality status, right? So if, let's say I say, okay, here are the five rules that matter to me to say my data is in a separate airflow process that basically asserts true or false, is my data in a healthy state?

Starting point is 00:47:48 And use the status of that to determine whether or not another pipeline is allowed to run if it uses that data takes into account the data processing as well as data quality and ties them together in a way that gives you this level of robustness. And this is actually exactly what we're trying to do with Spectre. Basically build that dynamic DAG of interactions between the data processing system to the data quality system. Yeah, that's fascinating because, I mean, not that DAGs aren't capable of considering the things that you just mentioned, but a lot of times it just deals with

Starting point is 00:48:35 data completeness or data freshness, right? Where a job runs and then a lot of companies sort of manage all of the debt that's created along the way just with massive compute on the warehouse, right? Yeah, and human data support teams, right? Yeah, yeah, yeah, yeah, for sure. This is like one of the biggest things, right?

Starting point is 00:48:58 So you've got Airflow, you've got your data quality system, which is like in its own isolated place. Data quality system reporting issues one after the other, and then your data processing system, which is like in its own isolated place, data quality system reporting issues one after the other, and then your data processing system has no idea, right? So your data processing system just continues to process the data and you get 10 chains deep to creating a report. And then you realize, oh, wait, but the data that was initially ingested into

Starting point is 00:49:24 the company was already in a bad state. None of this should have even run. Right. Yeah. Yeah. But it's very hard to, to like set up that network topologies in a way that, that guarantees that data is only going to be run if it's in a good healthy state. Yep. Yeah, for sure.

Starting point is 00:49:41 No, that's and that is so instructive. I'm even thinking about our own, the pipelines that I have for review over. You've inspired a lot of thinking there. Where can people go to learn more about Spectre data and about you? If they want to dig into this and learn more about the concepts

Starting point is 00:49:59 that you're talking about, where can they go? Yeah, so I am most reasonable on LinkedIn. So you can find me, my I am most reasonable on LinkedIn. So you can find me, my approval is Ashwin Kamath. Our website is a great resource to find more information about the product. That's www.spectordata.com. And we have a contact us form there where you can reach out to the rest of our team as well.

Starting point is 00:50:24 Awesome. Very cool. And we will put those in the show notes as well. So you can go to datas the rest of our team as well. Awesome. Very cool. And we will put those in the show notes as well. So you can go to datasackshow.com. Ashwin, thank you so much. This has been absolutely fascinating. I feel like we could go for another hour, but Brooks is telling us that we're at the buzzer.

Starting point is 00:50:41 So thank you so much for your time. Yeah, thank you all for having me. And this was a great, great show. I think my biggest takeaway, Kostas, maybe this is a weird way to say it, but a lot of people think about Big Brother as being the government. And really, Big Brother is just hedge funds that have data about us copying and pasting data

Starting point is 00:51:04 and then that influencing. Don't say thating data and then that influence thing. Don't say that. You might be in danger now. It's true. That's true. No, but it is amazing. I mean, the things that he brought up about web behavior, about foot traffic data, about credit card transactions,

Starting point is 00:51:22 all this sort of stuff. I mean, it's a little bit scary in many ways. They are anonymized. That's true. No, but it's wild. I mean, the stuff that he's done and

Starting point is 00:51:37 that level of data modeling and that level of granularity is amazing. And I think as he said, the actual infrastructure to drive that is incredible, right? The blunt way to say it is that the two industries that are actually driving infrastructure forward are Horn and Finding. They're the ones on the on sort of the, the significant scale, like innovation side of things.

Starting point is 00:52:08 And I think we saw that with Ash. Yeah, yeah, absolutely. And I think it's what I found like super interesting is that like how you can talk about a topic that we have discussed a lot already, right? Like data quality, for example, and how much of a different perspective someone can bring because they're coming from a different industry, right? Like even the terminology that he was using about data quality was very different compared to what we have heard like from other vendors that are building

Starting point is 00:52:43 data, right? So that's, that's what is's what I find super, super interesting. I feel so privileged that I'm doing this show because I have the opportunity to compare these different, let's say, theses around how to build a product. Which comes from the bias that each person has because of the industry where they come to solve the problem for, right? And of course, like you see at the end, who's going to win, because that means that we also, which industry has like a much better, let's say, understanding of the problem. So yeah, like super, super interesting.

Starting point is 00:53:23 David Pérez- Yeah. The, now that we're talking about this, I regret not asking him if he had So yeah, like super, super interesting. Yeah. Now that we're talking about this, I regret not asking him if he had worked with Deephaven because they work in the finance industry and do like real-time data feeds. So we can follow up with him. If he has, actually, we should get him. And I think it's Pete. Is that right, Brooks? Pete from Deephaven.

Starting point is 00:53:46 Brooks is giving me the thumbs up. Off screen. Great. Well, let's do that. Let's follow up with him, Brooks. And if so, then we can do like a finance data podcast. Maybe we could actually get the Sri from who used to be a Robin Hood is and is now a Stripe. That'd be cool. Yep. All right. Well, thanks for entertaining our banter for another episode.

Starting point is 00:54:12 And we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

Your Ad Here

The Data Stack Show - 115: What Is Production Grade Data? Featuring Ashwin Kamath of Spectre

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.