The Data Stack Show - 51: Democratizing AI and ML with Tristan Zajonc of Continual

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudutterstack.com. Welcome back to the Data Stack Show. Today, we have Tristan Zients.

Starting point is 00:00:38 His last name does not sound like it's spelled, but it is Zients we confirmed with him. And he founded a company, his second company actually, called Continual. And they do really interesting machine learning stuff on top of your existing cloud warehouse, which I think is just going to be a fascinating topic. One of the questions that I have, which in the parlance of machine learning is probably going to be predictable, is when you think about machine learning readily available on top of your existing warehouse, in many ways, that's kind of almost a democratization of machine learning, which inside of a lot of companies is still really hard to operationalize at scale, just because there are so many moving parts and pieces. But this is something that Tristan actually saw on the ground building tools for people

Starting point is 00:01:24 to operationalize data science. So I want to ask him, even though the promise of machine learning is still so exciting, it's still just actually a pretty hard problem when it comes down to the practical implementation. That is my burning question. And I've already been talking too much. So Kostas, what's your question? And then I'll give you plenty of time on the mic during the show. Yeah, I want to learn more about the product itself. I have a very interesting approach where they enhance data warehouses with ML capabilities in a way. So I want to see how they do it, what kind of like components and what kind of abstractions they have built on top of a data warehouse and what is missing, like how they are dealing with the latency problem, for example. I'll probably have quite a few like technical questions to ask Tristan and I'll focus on that.

Starting point is 00:02:17 Yeah, that's great. Yeah. It'll be interesting to see, you know, a lot of times when you see technology like this, you see the introduction of sort of new frameworks or even languages or languages that are sort of a variation of an existing language. So it'll be interesting to see if they're using sort of established paradigms or they're introducing new paradigms that sort of make delivering this easier for them, but may be difficult for users. So without further ado, let's jump in and talk with Tristan. Let's do it. Tristan, welcome to the Data Stack Show. So many things to talk about, and we really appreciate you taking the time. Thanks so much for having me. It's a pleasure. Okay, so we're going to talk lots about ML and AI and hear about Continual, but let's start out by just tell us your background, how you,

Starting point is 00:03:05 you're a two-time founder. So congratulations. That's a huge accomplishment, but what led you to Continual and what's your background? Yeah, well, that's a little bit of a long story. Let me see how condensed I can make it. So I'm a statistician by training. I graduated from grad school in the 2012 era when the rise of sort of the word data science was happening, big data was happening. Of course, the cloud was sort of well underway at that point. I was trying to figure out what to do next and had the entrepreneurial itch after being just seeing what I perceived to be a missing product in the market around enabling data science within the enterprise. And so in 2013, I founded a company

Starting point is 00:03:45 called Sense, which was really one of the first enterprise data science platforms out there. It was targeting code-first data scientists, right? The rise of open source data science tooling was well underway. The rise of the big data ecosystem, Hadoop, Spark, et cetera, was well underway. And it felt like there needed to be a new statistical computing or data science platform that was geared towards these users. And increasingly, it became clear, not only geared towards those users, but actually also serve the needs of the enterprise to bring a team of those users together, enable collaboration, enable operationalization, et cetera. So that was really 2013. We raised a seed round, grew that company basically to product market fit.

Starting point is 00:04:26 And then right before our series A ended up getting acquired by Cloudera, the big data platform company, the leading provider of Hadoop. Spent three years at Cloudera, had an amazing time that product sense that I had built became their data science workbench product. I guess they call it now Cloudera machine learning. They, unsurprisingly, they realized that the pinnacle application on top of a data platform really is AI ML doing predictions. Sort of once you've stored the data, once you've processed it, once you've done some basic analytics, you really want to go beyond analytics to predictive analytics or AI ML. And there's a whole class of users that don't write Java, they write Python, and they wanted

Starting point is 00:05:03 to enable those users. And so I spent three very, very pleasurable years at Cloudera building out their data science platform. And then the entrepreneurial itch started scratching again and decided to leave Cloudera about two years ago to found Continual. And the reason I left Cloudera, basically the problem that I saw at Cloudera really was there was tremendous buy-in for AI and ML to have a pervasive aspect across large-scale enterprises or businesses. Every customer that I talked to, I was sort of in the CTO for ML role at Cloudera, which was kind of partly outbound, partly inbound. Every customer I talked to was sort of all bought into this idea of like the AI first, AI centric enterprise. They could like rattle off a dozen use cases

Starting point is 00:05:50 or more. In a meeting, they would often show me those slides that often look like vendor slides where there's tons and tons of use cases. But they were really all struggling to actually make that vision a reality. And at Cladera, we offered an incredible portfolio of different products and capabilities just by being a very broad platform. But what I was seeing was just a lot of companies weren't succeeding. And the reason was just the sheer complexity of actually moving AI and L2 from the R&D phase into the operational and production phase, sort of this continual kind of continual operational. So, you know, founded continual, unsurprisingly, the name is, you know, continual.ai. So it has this idea of continual, you know, operational ML and AI. We have a very unique take on that, which I can talk about. But yeah, that was sort of, that was the initial genesis. I love it. And I want to,

Starting point is 00:06:40 I want to circle back to why AI and ML are hard, because I think that's just a helpful topic to discuss, especially from someone who's actually built tooling around it, because I think you get to experience the problem in a unique way if you're actually building tools to solve for it. But before we go there, can you just give us the brief high-level overview of what does Continual do? Yeah, no, absolutely. So Continual is, we like to say it's a continual AI ML platform that sits directly on cloud data warehouses like Snowflake, Redshift, BigQuery, Azure Synapse. It enables anybody to build predictive models that never stop learning from data. So it has this core recognition that the world is fundamentally changing,

Starting point is 00:07:25 that data is continually arriving, that predictions and models need to be continually maintained. And so a typical application would be maintaining a customer churn forecast, an inventory forecast, shipping arrival time, out of stock event, whether equipment was going to fail. And we're just building a sort of a fundamentally easier way to do that. That puts the data warehouse and the way we accomplish that is we kind of put the data warehouse at the center. So I can talk much more about that. We think that data is increasingly flowing into the data warehouse. And that's the place where you should build an experience and workflow around that. It will play well with all the rest of the ecosystem

Starting point is 00:08:04 and it will just sort of 10x simplify both the process of building machine learning, but also equally important, the process of maintaining and iterating on those predictive models that you have. Yeah, so I'm happy to go into more depth, but it's a platform that's fundamentally declarative. Like SQL, we try to make AI this process where it's very data-centric. You're focused on what are the features and the input signals to these predictive models? What are the things that you're trying to predict, like customer churn? And there's no need. We don't think there needs to be in this sort of modern data stack era, any Kubernetes, any containers, any Python pipelines, 80, 90, 95 even percent of AI ML use cases that I see

Starting point is 00:08:47 within the enterprise we think can be solved in this sort of dramatically simpler way. And yeah, and that's what Continual is doing. I feel like in that two minute explanation, you gave us enough fodder to do five or six podcast episodes. So much to talk about. So let's just first, could you just give us a quick, because there's so many things. And I think next, maybe we can jump to sort of the warehouse being the center of the stack and talking through like modern stack architecture, because I think that's a really interesting subject. But before we do that, and I think a good way to sort of get there with context is to talk about why AI and ML are hard. So you mentioned that you had some tools that you had built at Cloudera, but you noticed that, and it's a really interesting thing, AI and ML are sort of

Starting point is 00:09:40 like a cost us an aisle, I will say, like the marketing kind of leads the actual practical usage inside of orgs where it's the promise of the future. And it is like, we all believe that for sure. And we know there's power there. When the rubber meets the road, it's actually pretty hard to like operationalize it. Why is that? What are, could you just hit the top sort of couple points of like, what are the barriers that block companies from actually making it a reality and driving value? Yeah, this is I feel like there's a there's a lot of people are also missing the mark

Starting point is 00:10:10 in terms of solving the problem. So there's some people that think, OK, well, what we need to do is we need a notebook that can access compute resources in the cloud. Right. So they build a notebook. I mean, it's a worthwhile, worthwhile tool. Right. You can launch a notebook on in the cloud and get access to a GPU, right?

Starting point is 00:10:25 That might be solving one particular thing. You might have people that are building like some easier way or a different interface to actually train a model, right? But training a model, it actually isn't that hard. Once you, if you hire a data scientist who's taken some basic skill sets, calling scikit-learn fit or XGBoost is typically not that hard. But then there's people saying, okay, well, okay, the problem, and I think this is starting to go down the right direction.

Starting point is 00:10:48 The problem is really productionization and operationalization. Now, the naive answer to that is, okay, the answer is we'll put a predictive model inside a container or something like that, and we'll have a model deployment platform. But all of those really miss, from my experience, all of those really missed the mark. If you ask why isn't AI and ML being successful in the, in the business and why isn't it actually being embedded in these business processes so that it can have this impact. And the fundamental, fundamental problem there really is around the continual nature of ML, right? So there's, there's very rarely a static model that you can deploy. But even if there is a static model, the data that's feeding into that model is not static,

Starting point is 00:11:30 right? So you have data continually coming in about your customers, about products that they're purchasing, about the inventory on your shelves, about the mileage, how much jiggle there is in your aircraft engine that might have a maintenance issue. And so all those things, the data coming in is changing, even if the model is not changing. Now, almost certainly the model is also changing because the world is fundamentally changing. And then if the model is changing or the data that's going into that model is changing, the predictions are changing, right? So you need those updated predictions. So in order to really embed AI ML experience or insight into a product or into a operational system within an

Starting point is 00:12:07 enterprise, you've got to think about that continual nature, right? And you've got to build a workflow. So then the next step is, okay, we recognize that that's the problem. Well, how do you solve that? And if you go and you look out and look at the canonical stack diagram, right? Uber's Michelangelo platform, Uber, they've documented their internal NL platform. And if you go and look at that, you see, wow, there's about seven different distributed systems in this diagram, right? It's all of this crazy pipeline jungle. There's data storage systems, there's training systems, there's monitoring systems. And then kind of patching it all together is this kind of crazy, what at least looks to me like spaghetti kind of dags to do all manage all their training and inference and testing and performance monitoring and all of that, all of that sort of thing.

Starting point is 00:12:53 And so I think that you get to that and you and increasingly the either you don't have the in-house capability to pull that off or the ROI ends up not being there. It becomes so expensive to build and maintain these models that you say, hey, let me go and work on other more pressing problems. And so we think that that's a solvable problem. We think that there's a way to sort of like, just like in the Hadoop ecosystem, which I'm familiar with, we went from like the MapReduce era, right?

Starting point is 00:13:19 Where you wrote all this Java code to do basic analytics to figure out how many customers churned. And then now, of course, we just go and kick off, open up Snowflake and run a query there. The same sort of thing can happen for ML, but it doesn't just need to be an easier interface. It also needs to be this continual operational system. And that I think is the trick, right? It's not just there's some sort of people who kind of put a prediction statement inside of a SQL statement. That's really not enough. That's not solving the core problem. You need to think of an easier way to build and maintain both the model and predictions. Yeah, so that's my diagnosis, at least.

Starting point is 00:13:52 So Tristan, you mentioned the complexity that someone can see in this architecture with all the different distributed systems and this spaghetti of pipelines, blah, blah, blah, and all that stuff. How does Continual simplify that? Actually, there are two questions. One is like, what kind of complexity Continual exposes to the user? And the other is what complexity is hidden and how you manage to hide it, right? So can you say a little bit more about that? Because it's super interesting. I always find it fascinating. I think one of the reasons that I love technology is that

Starting point is 00:14:27 it gives you this opportunity to build something very, very complex in terms of how it operates, because that's how the world works, but hide it behind a lot of simplicity. And I think that's very common what we see with technology. So I'd love to hear more about how you do that. Yeah, no, I love that analogy. I think it is true that the history of technology is in many ways like the history of the hierarchy of abstractions, right? All the way, you know, you think about programming languages or, and all the way down to the hardware level, each, at each layer, there's another abstraction that hopefully isn't leaking and therefore makes building on top of it dramatically simpler. But in terms of continual NML, I think it's the way I think about it, just step back and think for a moment, pause,

Starting point is 00:15:15 sort of don't look at all the technology and just pause. What is ML, right? What is a machine learning model, a predictive model exactly? It's really nothing more than a function that takes some inputs. So data, those inputs, you know, are often called features, right? So signals or features, they could be about your customers, right? That could be, let's say that you're doing a customer churn problem. Those things are like, well, has the customer used the product? How much have they used it in the last seven days? So there's a set of inputs. And then there's a target. And that target could be something that you're trying to predict. So there's a set of inputs and then there's a target. And that target could be something that you're trying to predict. So that's in this case, customer churn. Now,

Starting point is 00:15:48 customer churn could be a few different definitions of customer churn, 30 days, 90 days, 100 days. So you see how quickly once you go down this path, you have a lot of predictive models, even if you only think of one use case. Then there's a function between those two things. Now, increasingly, that function is a very, very complicated transformation between inputs and the prediction tasks that you're doing. But if you think about the level of abstraction that ideally you should be able to achieve is, hey, manage your input signals, your features, and manage what you're trying to predict. What's inside that, the transform function should really, well, it's going to be learned

Starting point is 00:16:26 by machine learning, but really you shouldn't have to think too much about it, right? That's not something that feels like an essential complexity that you should have to manage. And then the second part of it, so that's the whole world of automated machine learning kind of deals with that kind of that transform function and figuring out how to, okay, let's go to compare a bunch of models and figure out the best models and the best architectures that will give us the best predictive performance. There's a second dimension to that, which is the operational and continual dimension, which is also what we focus on, which is saying, okay, well now if you're going to operationalize this, you need a way to continually retain and continually predict.

Starting point is 00:17:02 And so, but that's just should be policy, right? That's how often do you want things to be retrained? How often do you want things to be predicted? So what continual does is it really gives you a workflow to one, manage and collaborate around all your features. So you do that with SQL. You say, Hey, here's, here's how I'm going to model my business. Here's my customers. Here's the features on them. Here's my products. Here's the features on those. Here's my stores. Here's the features on those, et cetera. Then you can manage your prediction targets. What are the things you're trying to predict? And everything else is automated, right? The process of training models and retraining models, comparing models, the process of maintaining the prediction.

Starting point is 00:17:38 We do that. We automate all of that reps. We try to distill that down to this essential complexity. Now we bet on that one way to do that down to this essential complexity. Now, we bet on that one way to do that is your data is in your data warehouse, right? You kind of need to say, what's the level of abstraction below you, right? And what we've bet on, and I think this has been an amazing enabler for us, is we bet that the future is the data warehouse. The future is SQL from a data management perspective and data transformation perspective. Now, all we need is an AI ML system that's operational and plays well with that ecosystem and has a workflow that works for that ecosystem,

Starting point is 00:18:10 that user, et cetera. That's fascinating. So when we are talking about the use cases that continue to spill forward, we are talking about doing predictions and machine learning and AI around using like pretty structured data, right? Or do you see also like use cases? We usually, when we think about ML and AI, the first thing that we think about is like image recognition, right? Computer vision. Is this something that also can be part of Continual or the focus right now is mainly on like structure and business data?

Starting point is 00:18:43 So that's a great question. And so we are in the short, medium term, we are really focused on data that you typically see within an enterprise that lives in a data warehouse. And that tends to be a structured data that does have a relational dimension to it, right? So customers buy products, et cetera, and also very clearly has a temporal dimension. So we have an abstraction that's sort of very tailored towards relational and temporal data and building both features and building predictions on top of that relational temporal data. Now, the model though, and what's very exciting is the model is easily extensible to richer types of data. For instance, we already support text data, right? We can use text data as features. So conceptually, and if you think about computer vision, right? Conceptually, a computer vision is nothing more than an image type in to a function. And then let's say you're trying to do classification,

Starting point is 00:19:35 well, that would be a class or a category or Boolean or something as the output type. So the level of that abstraction level still works. It can even be more sophisticated than that. It can even be like a video comes in on one side and a video that's a segmentation video comes out on the other side. And really, you can think of that as a are happening in a data warehouse. That's not probably the dominant use case, certainly not the dominant use case we see. We do see a ton of text data and the need to leverage and extract information from text data. And we do increasingly see image data, right? So Snowflake, for instance, just announced support for unstructured information, including images, text, PDFs, et cetera. And a lot of times you want to extract insights from that, and then put those insights back into your data warehouse so that you can then query them. Right. So the data warehouse, our belief is the data warehouse still is going to be the place where a lot of that, a lot of that stuff happens. Now, if you're building an autonomous car, right. That's

Starting point is 00:20:36 processing a whole bunch of real-time streams. No, that's not going to be that architecture. Yeah. Makes total sense. And like all these use cases with IoT and like machine learning on the edge, at the edge and all that stuff, they're like more specialized. But yeah, that's super, super interesting. And if I understand correctly, okay, and I'm coming more from the world of data engineering, not that much from the world of machine learning. So I'm still learning about that. And the way that things work are that, let's say we have a data warehouse. So we push our data there, we collect, doesn't matter how we do it.

Starting point is 00:21:11 And from this raw data that we have, the next step is to go and create some features, right? And from once we have done that, the next step is to feed these features into a model and train a model. Is this correct, first of all, or am I missing something? Yeah, that's absolutely correct. Although I would say, yeah, I mean, just don't forget the continual. The end goal that you're really trying to end up at is a continual process by which you both maintain that model, at least on some frequency, weekly, monthly, et cetera. And it depends on if you're doing real-time or continual batch.

Starting point is 00:21:44 But let's say that you're doing customer churn or something like that, or inventory, you're almost always continually maintaining that prediction, right? So it's not a one-off script. Yeah. So you're almost always continually maintaining that prediction. Absolutely. Absolutely. Yeah. Yeah. I'm talking mainly about, let's say, the transformation of the data. I'm not talking that much about the operations right now. And I'm wondering, how do we go from the raw data to the features? How users do that? And let's get an example, a more concrete example.

Starting point is 00:22:18 Let's say we have the use case here is like Cher. So how a user that is going to start implementing Continuum today, let's say, and assuming that they have all the data in their data warehouse, they can do the first step, which is going from their own data to the features. And how these features look like also, because I hear the word feature in lots, feature stores, like all the stuff around them, but like at the end, what are these features, right? Yeah. Yeah. So, so a feature is something that you, you believe, given your business insights, your, your, your understanding as a human, right. Of the business that you think is going to be predictive of whatever you're trying to predict in this case,

Starting point is 00:22:58 churn, right? So a classic feature would be something like, in this case, would you not say you have clickstream data coming through aderstack and into your data warehouse. You might then want to say, well, I have a deep insight that activity over the last few days, like let's say seven days, is very important. And so you might want to embed that knowledge, basically your business knowledge, and you would define that as a feature. And you would really want to be able to reuse that feature across all of your downstream use cases, right? So you don't just have customer churn. You also have something about the other products that they might want to buy. You have their LTV calculation, lifetime value. You might have net expansion and net contraction. So something that's maybe not a binary churn metric or upsell to the premium plan. So what we see is typically once you're in a, let's say you're

Starting point is 00:23:46 dealing with these sales and marketing use cases, you might start with churn, but very, very quickly. I mean, if it becomes easy to build predictive models, very quickly, you go from one model to a dozen in that very narrow domain, even putting aside all the other ones, other parts of your business that you could impact and that you use the same features for those downstream use cases. And so one of the benefits of a feature store is the ability to easily reuse your features in multiple applications. Another aspect I think that is maybe less well understood outside the feature store and ML community is the temporal nature of features. If you're trying to predict something

Starting point is 00:24:26 like customer churn, it's critical. And let's say customer churn in the next month, right? It's critical that you have the ability to go back in time and ask yourself, hey, what was that feature a month ago, two months ago, three months ago? So that you can then look at the future ground truth. You see some customers, how does a machine learning model learn? It needs to see some examples of that actually happening, a customer churning. And so the way you do that is you go to your historical data and you look back in time and you say, okay, a year ago, did the person with these characteristics a year ago, did they then churn in the next 30 days?

Starting point is 00:24:59 This is 11 months ago. And so you need to, in your feature store, you need to make sure you define your features in a way that allows you to kind of have this time machine characteristic. Sometimes people say it's called point in time correct or temporal join. You'd be able to go back in time and say, I need to get that feature at that particular point in time. And so what Continual does is it gives you a whole workflow to define those features and make sure that you organize them properly. Make sure you attach them to the right entity or customers. You make sure they have a time index appropriately. Make sure when you train your models, you get the features for the right point in time.

Starting point is 00:25:33 You don't have data leakage. And so that's all very important. Now, we also bet that the way you should define those features to your question, well, how do you actually do that? How do you define a feature? You should do that with SQL. Increasingly, SQL really is this incredibly powerful lingua franca. It scales beautifully. It just, especially when you deal with, you know, at scale, it just becomes this very, very powerful language. So that's how we think about that

Starting point is 00:25:58 process. So is it accurate to describe that one of the things that Temporal offers to someone who has a data warehouse is to actually extend the data warehouse with a feature store? Yes, exactly. So we really have sort of there's really three layers to Continual. So one is there's a feature store, but that is a virtual feature store that is on top of your data warehouse. Right. So we replicate no data into our system. We define essentially views and organize views on top of your existing data and give you a workflow for that. You can also have native integration, for instance, with DBT.

Starting point is 00:26:34 So if you're coming from the world of DBT, the data build tool, you can define those features and your targets and pretty much your whole model using DBT. So that's kind of at the core. And that's why we say we're a data-first platform for AI. We really think that's the most important thing. That's modeling your business. And that's the most important thing where you really need to bring all your expertise to bear.

Starting point is 00:26:55 Above that, in terms of training models, we have this declarative AI engine. You can think of it like an auto ML system that has this very flexible ability to pull in data, this temporal relational data, and make sort of state-of-the-art predictions over time. And then the final thing we have is we have this continual ML operations aspect. So we don't just train that model once. It's not upload a CSV file and get a bunch of models. It's really about maintaining both the model and the prediction and giving you visibility on top of all. And that maybe sounds like a lot, but it really is not because the only thing that you're actually doing is you're really just defining your data, right? The rest of it's all kind of just happening automatically, kind of on autopilot. And the end result is you basically get state-of-the-art,

Starting point is 00:27:38 continually improving predictions inside your data warehouse and with a workflow that makes it not only easy to build that, but also easy to maintain and also easy to iterate. And we think that that basically, I mean, our goal is really like, imagine a company that has 500 models. What is the system that's going to be able to do that? Putting aside whether it's Continual or somebody else. In my view, it's got to be a high level declarative system. That's the only way to manage 500 models. If you go and try to manage 500 models and the continual life cycle of 500 models using a whole bunch of airflow, custom airflow DAGs that you write, and every single

Starting point is 00:28:15 one is a custom script maintained by a data scientist. I mean, that is just not a recipe. That's not the future that I think is possible. It's maybe the status quo today that that's all that's the way we do it today. But I think we all need to be striving for some sort of higher level experience. If we really want AI and ML to become pervasive, there's got to be some higher level experiences that we invent as technologists. Yeah. And just like to make it a little bit more clear, when you are talking about this declarative language, you're talking about having an approach when it comes to operationalize models similar to what Terraform, for example, has done for the cloud infrastructure. Is this correct?

Starting point is 00:28:57 Yeah, that's a fantastic example, you know, analogy. So if you think about managing cloud infrastructure, you manage that now with a declarative approach. You manage it using Terraform. If you think about managing containers, right, you manage it by using Kubernetes likely, and you define declarative, here's what I want to happen. And then Kubernetes goes and makes it happen. And if a thing fails and machines fail, it fixes those problems, right? And it maintains the number of replicas that you want. And so, yes, we think that, you know, our experience is very much tailored to that. You have this configuration, you can push into the system,

Starting point is 00:29:30 we go make it happen. You can do that in a UI, you can actually do it in version control, just like you would with Terraform or Kubernetes, you know, manifests, you can do it like that. You know, the second element, that's sort of, I would say, the second element, though, there is this data element. And so it's not a bunch of YAML, there's also SQL there in order to define your, your input features and your output targets, but you do that using the language of SQL. So that's sort of it, but that allows the whole system now to become declarative. So on one hand, you have the data manipulation that you have just SQL is, is a declarative language. So, you know, we, we have a declarative language for the data, the necessary data manipulation that you need, just SQL is a declarative language. So we have a declarative language for the data,

Starting point is 00:30:05 the necessary data manipulation that you need to kind of organize and model your business, which is, and then the continual operations aspect is declarative in a way, I think the analogy, the best analogy, right? It's exactly like Terraform. That's super interesting. A quick question, you keep mentioning SQL

Starting point is 00:30:23 and like how important SQL and how much you are betting on SQL. Do you see some kind of limitations in the expressivity that SQL has in terms of creating features or in the ergonomics of the language? The reason that I'm asking is because a very good example that we have seen in this space is DB dbt right which was a project that came into life exactly because of the limitations more around the ergonomics that sql has right and dbt came and brought into the game like all these best practices and all these nice tools that engineers used to have and brought this into data so do you see any kind of like limitations with sq And if yes, how do you see that we can overcome this? Yes. I mean, so you absolutely need to combine SQL with a workflow around SQL,

Starting point is 00:31:14 for instance, DBT, right? So if you just, you know, have a bunch of shell scripts lying around with SQL statements in them, that's not going to be a great way to manage your data. But if you, if you're managing, if you're doing your, you know, trying to model your business and your data is already in the data warehouse, you should really use the power of SQL. And I think as you become more and more embraced out of philosophy, you realize how far it will go. There are, for machine learning, there are things where the ergonomics, I come from a Python background. I lived in Python and R and all of those tools. There are instances where you kind of think, okay, well, that might be a little bit easier to express in, you know, or to wrap up in a Python syntax. It's amazing once I, increasingly, I just don't see

Starting point is 00:31:56 that. I think that the data that's coming in to models is more and more raw and raw. So like with respect to machine learning models. So it used to be that you needed to do a tremendous amount of feature engineering and increasingly, and when our system, you can still do feature engineering to kind of bring your business insight to bear, but increasingly the model itself is doing internal to it, some degree of feature engineering, which is really just part of the model. And so for instance, if you look at the history of computer vision, right. And even, even tabular data, increasingly you can push raw data into those models, raw images, raw tabular data with very little minimal preprocessing.

Starting point is 00:32:35 And then some of the complicated feature engineering that's very ML specific and maybe SQL is not as well suited for, that can happen sort of internal to the model. And so, but it doesn't need to be, I think that type of feature engineering really doesn't need to be exposed to the end user, right? So if you think about the type of features where the business user or the user needs to bring their own insights to bear,

Starting point is 00:32:55 that is, I have a hard time thinking of where SQL, you know, has let me down in terms of that. It makes total sense. I mean, and I think, again, as I said, I'm not coming from ML, but like also big parts of like the success of deep learning is exactly that, is that the model itself generates like optimal features

Starting point is 00:33:14 that can help like build better at the end. Because I remember back in the beginning of the zeros when we didn't have like deep learning yet, computer vision, like most of the papers that you will see getting published was like what kind of features we can create to make sure that like a very very specific niche use case of computer vision we can tackle a little bit better and i think part of the revolution with deep learning is exactly that yeah absolutely i absolutely. I mean, you can't write an edge detector on an image in SQL.

Starting point is 00:33:47 I grant you that. But that's not what you need to do anymore, right? For the state-of-the-art models, you just need to pass in a raw image. And increasingly, you might even be able to say something like a question on that image, like how many cars are there in this image, right? So you might have this whole area of visual question answering. So even something kind of as wild as that, if you think about from a data perspective, it's really no more than an image coming in, and a column with a question, and a column with an

Starting point is 00:34:13 answer. And that's, I mean, that's just mind blowing to me. I mean, it's almost it's almost amazing that that's possible. And I think the overwhelming trajectory is towards that a lot of a lot of models don't even need data. I mean, increasingly. So if you look at what's happening with, for instance, OpenAI and their GPT-3 type of work, and all, of course, speech recognition in many parts of the domain, you actually don't even need to bring any data to bear. There's no model training. It's an API. But what we've seen is within the enterprise. So some people ask me, okay, well, is it all just going to move to this? Everything's an API. Well, the answer there is no, clearly no, because customer churn within

Starting point is 00:34:48 your business, you have to look at your historical churn patterns. There's no way you can just predict customer churn given a user's demographics. If you don't have some history there, same thing with inventory forecasting, predictive maintenance use cases. There's a set of use cases where fundamentally they're data-driven, they're driven off the data of your business. And so what we're doing is really trying to provide the easiest experience

Starting point is 00:35:09 for those type of use cases. I keep saying that there are data problems that are very, how to say that, the business context is very important. You cannot take the term model that can predict

Starting point is 00:35:23 what is happening at DoorDash and use it in continual, right? Like it just doesn't work. It's completely different views of the world, right? Because they are dealing with a completely different view of the world. Yeah, no, absolutely. I mean, I think that's actually why the data warehouse

Starting point is 00:35:41 is so powerful. And in terms of a data strategy for companies, I mean, a lot of times people say, well, isn't for instance, all of AI and ML going to get verticalized? People have said about that, about BI as well, right? So you have these sales and marketing use cases like churn and you have inventory forecasting use cases.

Starting point is 00:35:59 But what I've seen is twofold. One, even for those very standard use cases that every business has, the data is very, the data is different, right? So of course, the signals that you're getting from all of your different touch points, your websites, your products, I mean, all the things that are sending you data, all of that data is very bespoke to your business, right? So Strava is very different than DoorDash, right? But they might still have both of a churn at the end, they're trying to predict churn and maintain and reduce churn.

Starting point is 00:36:26 And even actually this one thing sort of surprised me, even as I've worked with more and more companies, even the definition of churn is very bespoke. So like I just was chatting with a company that said, okay, well the person, they had strike data, but in there, so you think, okay, very standard, but in their world, churn was defined as they have to be 30 days out of sort of, they could cancel their account due to a bad credit card. But if it's just 15 days, and then they managed to put in a new credit card, it's not churn. And so what the beautiful thing about the data warehouse is, it gives the data professional, the data scientist, the power to kind of model their business in the ways that are unique. And it has that flexibility, but tries to hide everything else. Yeah, and I think that's, I think I have a hard time, that level of, I have a hard time

Starting point is 00:37:09 seeing how for many, many use cases you can get rid of that. And so, you know, our bet is that that level of complexity, the ability to model your business and the need to model your business is going to persist for most sophisticated data-driven companies. Absolutely. Yeah, I totally agree with that. I have a little bit of a more technical question to ask you. I was checking and going through the architectures

Starting point is 00:37:32 of different feature stores. And one of the things that I've seen, which pretty much exists in every feature store architecture, is a caching layer to serve the features. And the reason to do that is because low latency in some use cases is extremely important, right? Now, data warehouses on the other side and all up in general, was built with a completely different perception of what time is, right? In the past, for example, the data warehouse was built in a way that it could run queries for hours or even days, right?

Starting point is 00:38:12 So latency is a completely different thing when we are talking about data warehouses and then compared to transactional databases or caching layers. So how do you deal with that when you get a data warehouse centric approach? Yeah, no, that's a great, great question because you're right. You're absolutely right. I think there's widespread recognition that a feature store or something called the feature store should be at the center of your data, your ML strategy. And in part, because we see that, hey, that's one of the most important bits and also where a lot of complexity can come in. There's the way I think about it. There's really three parts. So they're just stepping back.

Starting point is 00:38:49 Like, what is a feature store? I think there's because maybe not everybody in the audience, they might know the term. But what exactly is it? I think it needs to offer sort of three things. So the first is collaboration and sort of do it, you know, do it once. Right. Sharing of the definitions of features across your business, right? You should not have data scientists duplicating feature definitions. You should have the features properly governed. That's probably the easiest one to solve, right? And you could probably solve some of that with your existing tools, right? Like by following DBT best practices, you might, you know, have a virtual feature store. The second one is

Starting point is 00:39:24 what's called point in time correctness, which is this idea of a time machine. You really, for training purposes, need to be able to go back in time and reproduce a feature at any point in time, or at least at regular intervals that you're going to train on. And it needs to be actually, you need to be able to do that on a per row basis. For every user, you need to potentially go back a different. So it's not just a Snowflake's time machine or Databricks' time machine kind of backup functionality. You actually need to be able to do sort of a temporal join where you get the features at a particular moment in time. You need to do that to construct your training data set so that you can then

Starting point is 00:39:56 forecast churn into the future without any data leakage. And the third one, which is what you're pointing out is for, and this is only applicable for real-time serving use cases that can't be prematerialized and need to be done on the real-time path and cannot also be passed in from the client, you need a way to serve features in low latency. So if you have somebody coming in, let's just say you're doing a search personalization type application, you have somebody who is typing in a search query, that search query comes in, you find the relevant records, then typically you want to re-rank it based on maybe the previous click stream of that user and what they've been doing and their history of actions. And you have

Starting point is 00:40:34 a set of features that you need to very, very quickly look up and say, okay, what have they most recently looked at? And did they click on those things or whatever it is, you need to use those features. And for that, you need a caching. Well, typically today, you need a caching layer on top of a database. And so a lot of the feature store work, if you look at what some of the open source feature stores are doing, it's really about trying to find an architecture for that caching. It's interesting. We actually started looking at that very closely when we founded Continual. And what we saw was, first of all, data is in the data warehouse and people want to leave it there to the degree possible. And second, that

Starting point is 00:41:08 a huge number of use cases can be solved with this sort of continual batch mindset. It's a tremendous simplifying approach in terms of your architecture. And third, we have a bunch of ideas around how to do that cache. We're waiting a little bit to see if you want to do the real-time use cases. We're waiting a little bit. It'll be interesting to see where the data warehouses themselves go. Some of the cloud platforms are building some capabilities indirectly. There's emerging data stores, things like Materialize, which have certain ability to do that. Obviously, the real-time databases, things like Rockset. I know Snowflake, for instance, is very focused on high concurrency, low latency. So it'll be interesting to see how that converges. That's definitely an open question. The architectural

Starting point is 00:41:48 complexity that emerges by trying to maintain consistency between these environments when in some ways you're just expressing, you'd like to just express a SQL statement and have it taken care of for you, seems to be me over time, something that's going to be eliminated. So I think it's a very interesting question so i think it's a it's a very interesting question i think it's unclear where it will go uh exactly in terms of will there be a dual right system in the cache or will we converge towards a data warehouse and new new functionalities built directly into the data warehouse or even will there be kind of tailor made data stores that are that have this characteristic of historical data stores.

Starting point is 00:42:30 It is super interesting. We actually talked with Materialize recently on the show, fascinating conversation, super smart team over there. And then a favorite topic of ours is what are the warehouses building? And they've already advanced in so many different ways, but they're building in things that they're going to make things really interesting. But speaking of the warehouse, Tristan, and we can land the plane on this question because we're coming up on time, but you've mentioned multiple times, and this is a really interesting topic, I think, in general, but the importance of the data warehouse in the context of the modern stack. So zooming the lens out from sort of the specifics, could you just tell us why? I mean, you're betting on the

Starting point is 00:43:11 warehouse being a central component of the modern way that companies are architecting their data stacks and all of these different tools. Why are you doing that? And then why do you think the time for that is now? It seems like there's a new crop of companies that are sort of making this bet. And why is that? Why do you think that's happening at this point in time? Yeah. So I think sort of, let's say twofold. So the first is that as I've just experienced, spent a decade experiencing data infrastructure, data engineering infrastructure, machine engineering infrastructure, machine learning infrastructure. I mean, the number one problem I see is complexity,

Starting point is 00:43:49 right? These stacks just get incredibly complicated to manage, move data between, and particularly with any velocity from a developer perspective. And there's many kind of death, maybe it's a death by a thousand paper cuts. So this is sort of the historical era. I mean, Hadoop, I mean, the knock on the Hadoop ecosystem is complexity. But even putting aside Hadoop and looking at if you use all the raw building blocks of a cloud vendor like AWS, it gets very, very tricky.

Starting point is 00:44:16 And the complexity gets very hard, which makes it very costly to build new use cases, very costly to maintain them and to iterate on them and build newity. And then it just compounds, it compounds over time. And so I think the big thing about the data warehouse, the first thing is by putting the data warehouse at the center, by betting on a cloud managed data warehouse, that's elastically elastic, that offers workload isolation. So you can have your data warehouse, data scientists going crazy in one isolated cluster on the same shared data. It's an incredibly

Starting point is 00:44:48 liberating experience if you've experienced the alternative. If you've experienced the complexity of shared compute, of multiple disparate systems where you're moving between them, of multiple different languages, you're moving from MapReduce to SQL to Parquet files to Python to all of that. It's an incredibly powerful model, like this kind of the big data ecosystem, but the data warehouse is much, much simpler. And so as I've matured and, or just as I've experienced sort of what happens when you deal with too much complexity, I have a natural affinity towards the simplicity of the data warehouse and the power of the data warehouse. That's the number one. The second one, which is more towards the ecosystem, there needs to be some common

Starting point is 00:45:29 foundation by which products can integrate and an ecosystem can develop. And the data warehouse is, I think it's now emerging, it's an amazing point where different products can collaborate in a kind of turnkey way while still allowing the flexibility that you want. So for instance, like ingestion, you have Rudderstack, you have Fivetran, you have that whole community that's bringing, that's making what previously were these crazy airflow DAGs from Salesforce into your data warehouse or from your logs into your data warehouse. They're now making it completely turnkey and now it's landing in the data warehouse. Of course, you can do transformation with DBT. You can do data monitoring with BigEye and Soda. And I'm sure I'm forgetting some. And you can,

Starting point is 00:46:09 of course, have your BI tools. There's a whole new class of BI tools that thank God are actually running the analytics inside the data warehouse. So there's not data movement. So you can build all your reporting off of that. You can increasingly, again, using tools like Rutterstack or Census, Hightouch, et cetera, move the data out of the data warehouse and actually make them actionable, kind of weaponize it. So it can actually have an impact on your business. And all of these things are doing it in this completely turnkey way. And so I was seeing that when I was that whole stack emerging and I was, I mean, the root, the Genesis of continual was really saying, wow, there's not really a way to do operational ML. I mean, you want to do these predictions on top of that stack in that ecosystem. And please don't tell me I'm going to kick up a Kubernetes cluster

Starting point is 00:46:50 to do that if I've embraced that stack. And so, yeah, I really, you know, I'm tremendously excited by the ability to, you know, drive down complexity, to simplify things. I think if we really want data and AI and ML to be centered and pervasive across the enterprise, you've got to have a simple, productive, kind of low cost from a manpower and manpower person perspective, if you really wanted to become pervasive. So that's why I'm bullish. Yeah, yeah, no, I couldn't agree more. And I think there are a lot of situations or sort of use cases for activating data where you had ecosystems of tools crop up in order to do those things, to your point, in a fairly complex way. And then five years later, the warehouse technology is advanced enough to where it's like, oh, well, actually, the most elegant solution was there

Starting point is 00:47:41 all along. It was your data warehouse, right? And just the ecosystem around it of pipelines to your point, and really the data warehouse technology itself hadn't quite gotten to the point where it was elegant, but now it really is. I love the way that you described it as number one, simple, and then number two, the need for a central hub. And it's such an obvious choice for both of those a central hub. And it's such an obvious choice for both of those points. Yeah. And it's always a journey, right? I mean, I think most technologies, you start complicated, you start low level, then some patterns and abstractions emerge, and then people build fundamentally new and easier experiences. And I think there's a

Starting point is 00:48:20 perpetual search for that. And yeah, I mean, Hadoop and that ecosystem is incredibly powerful. It's an incredibly powerful technology. If you look at how Facebook runs, a lot of it's on that sort of that technology. But if you need all that power, all that flexibility and the ability to dive into the code yourself, that's a great ecosystem to buy into. But there's also, I think, even the Hadoop ecosystem is bet on SQL, right? I mean, very quickly in the rise of Hadoop and the big data ecosystem, Hive became a thing, which is the big data query language.

Starting point is 00:48:51 And then it was quickly, well, how do we get faster querying? That was various projects trying to do faster querying. And then increasingly it was like segmentation of the compute. And then of course the rise of the cloud kind of disrupted a lot of that architecture. So yeah, there's a, there's a journey there. I think we're right now at this moment where there's this convergence on the modern data stack. I really think over the next five years, a huge amount of innovation is going to happen there. And if you kind of like buy into that ecosystem, you're going to be able to like free ride on all that innovation that's happening in a million different quarters. Yeah, absolutely. One of our previous guests described it as, they kind of described like ML being able to quickly derive value out of ML as the next phase of the data stack, whereas analytics is sort of maturing to the point where,

Starting point is 00:49:39 I mean, as simple as it sounds like this sort of self-serve analytics across the organization, still a lot of companies haven't figured out, but the technology is now there to where there are known playbooks for how to do that. And ML is going to be the next phase of that. And I really think in a lot of ways, that's true because once you have the data clean enough to produce really powerful analytics, then it's like, okay, well, great. Now let's really turn the heat up and start optimizing the business in some interesting ways with this data that's really well-suited for machine learning use cases. Yeah, I mean, absolutely.

Starting point is 00:50:11 I mean, that's the thesis for the company. I need to talk to this person who you talk to. We'll have to recruit them. Yeah, I think there's a classic pyramid where if you look at it, the sort of AI and ML is at the top. I think there's still some other things that are going to come that still need to happen. We still need to have push into the application domain and build and make sure that we can handle not just the sort of the backend operations of

Starting point is 00:50:30 business, but the applications themselves. But I'm very bullish on this path. And I've seen the simplicity of it now. It kind of makes me very excited. Very cool. Well, we are at time, but really quickly, Tristan, is there a way, I mean, just hearing about Continual, I keep thinking this is really cool. I want to see it in action. Is there a way for our listeners to check it out and try it? Or what's the process like there? Absolutely. So, I mean, we're in early access now. We launched about a month, two months ago. So you can go to continual.ai and you can learn a bunch more about continual. If you type in your email there, we absolutely will reach out to you within 24 hours and set up a demo. So we can give you a demo. We're taking early access customers. We typically do a demo and then onboard folks to try it out. We're hoping to get out something in terms of general availability soon. So stay tuned for that.

Starting point is 00:51:27 But yeah, I look forward to hearing from anybody who's interested. I think we have a little last plug here. Yeah, cool. And just to confirm, all the major warehouses, right? So sort of- All the major cloud warehouses, yeah.

Starting point is 00:51:38 All the major cloud data warehouses. Awesome. Very cool. Well, definitely encourage the audience to give it a try. Really cool product. And Tristan, we really thank you again for the time. This has been an awesome conversation and we'd love to have you back on the show sometime soon. Absolutely. My pleasure. Thanks so much.

Starting point is 00:51:57 Well, I think my first big takeaway is that you and Tristan are incredibly smart people, and it was really fun to hear you dig into the tech. But my second one was his excitement about the data warehouse, which has really been a continual theme, I think, throughout. Actually, it was a big theme last season. We've heard it in the last couple of shows about how the cloud data warehouse is just enabling so many different things, which when Redshift first came out, I don't think anyone would have, I mean, I'm sure there were very future looking people who sort of imagined this world where everything's connected

Starting point is 00:52:34 around the warehouse, but I don't think a lot of people imagine the sort of things that we're talking about as far as continual goes. And so that's just really cool. And I can't wait to see how that innovation continues to unfold. Absolutely. I don't think that see how that innovation continues to unfold. Absolutely. I don't think that the data warehouse we will be talking about in a couple of years from now is going to look very similar with what Redshift was when it started in 2012, for example. I found it very interesting, for example, that Tristan mentioned at some point about Snowflake supporting more unstructured types of data like images and free text.

Starting point is 00:53:09 And yeah, the data warehouse becomes like a much broader concept, right? It's more like a data platform in general, and it's going to fuel many different use cases. And of course, like one of the most important ones from what it seems is going to be built around AI in the mail. So yeah, it's very fascinating to see what people like Tristan do and the stuff that they are building

Starting point is 00:53:29 and how they are enhancing the data warehouse with non-traditional data warehousing capabilities like machine learning. And I'm really looking forward to see in a couple of months from now how the product is going to look like. Super interesting for me, very engaging. I mean, that's something that we see, especially with founders, very passionate about like

Starting point is 00:53:50 the products and the technologies they build. And it's always a lot of fun like to discuss with them. Yeah, absolutely. Well, thanks again for joining us on the Data Sack Show. Great set of shows lined up over the next couple of weeks. So make sure to stay tuned and we will catch you on the next one. We hope you enjoyed this episode

Starting point is 00:54:13 of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers.

Starting point is 00:54:35 Learn how to build a CDP on your data warehouse at rudderstack.com. you

Pet Camera - EBO Air 2

The Data Stack Show - 51: Democratizing AI and ML with Tristan Zajonc of Continual

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

The Data Stack Show - 51: Democratizing AI and ML with Tristan Zajonc of Continual

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.