The Data Stack Show - 51: Democratizing AI and ML with Tristan Zajonc of Continual
Episode Date: September 1, 2021Topics in this wide-ranging conversation include: Tristan’s background with Cloudera and the need for continual operational ML and AI (3:15)How the complexity of Continual is hidden behind a simpli...city of use (14:48)Focusing on data that lives within a data warehouse (18:43)Understanding features in the ML conversation (22:47)The three layers of Continual (26:11)The importance of SQL to Continual (30:19)Caching layers and the data warehouse centric approach (38:28)Betting on the warehouse being a central component of data stack architecture (43:34)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are
run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudutterstack.com.
Welcome back to the Data Stack Show.
Today, we have Tristan Zients.
His last name does not sound like it's spelled, but it is Zients we confirmed with him.
And he founded a company, his second company actually, called Continual. And they do really interesting machine learning stuff on top of your existing cloud warehouse,
which I think is just going to be a fascinating topic.
One of the questions that I have, which in the parlance of machine learning is probably going to be predictable, is when you think about machine learning readily available on top of your
existing warehouse, in many ways, that's kind of almost a democratization of machine learning,
which inside of a lot of companies is still really hard to operationalize at scale, just
because there are so many moving parts and pieces.
But this is something that Tristan actually saw on the ground building tools for people
to operationalize
data science. So I want to ask him, even though the promise of machine learning is still so
exciting, it's still just actually a pretty hard problem when it comes down to the practical
implementation. That is my burning question. And I've already been talking too much. So Kostas,
what's your question? And then I'll give you plenty of time on the mic during the show. Yeah, I want to learn more about the product itself. I have a very interesting
approach where they enhance data warehouses with ML capabilities in a way. So I want to see how
they do it, what kind of like components and what kind of abstractions they have built on top of a data warehouse and what is missing, like how they are dealing with the latency problem, for example.
I'll probably have quite a few like technical questions to ask Tristan and I'll focus on that.
Yeah, that's great. Yeah. It'll be interesting to see, you know, a lot of times when you see
technology like this, you see the introduction of sort of new frameworks or even languages or languages that are sort of a variation of an existing language.
So it'll be interesting to see if they're using sort of established paradigms or they're introducing new paradigms that sort of make delivering this easier for them, but may be difficult for users.
So without further ado, let's jump in and talk with Tristan.
Let's do it.
Tristan, welcome to the Data Stack Show.
So many things to talk about, and we really appreciate you taking the time.
Thanks so much for having me. It's a pleasure. Okay, so we're going to talk lots about ML and AI and hear about Continual, but let's start out by just tell us your background, how you,
you're a two-time founder. So congratulations. That's a huge accomplishment, but what led you
to Continual and what's your background? Yeah, well, that's a little bit of a long
story. Let me see how condensed I can make it. So I'm a statistician by training. I graduated
from grad school in the 2012 era when the rise of sort of the word data
science was happening, big data was happening. Of course, the cloud was sort of well underway at
that point. I was trying to figure out what to do next and had the entrepreneurial itch after being
just seeing what I perceived to be a missing product in the market around enabling data
science within the enterprise. And so in 2013, I founded a company
called Sense, which was really one of the first enterprise data science platforms out there.
It was targeting code-first data scientists, right? The rise of open source data science
tooling was well underway. The rise of the big data ecosystem, Hadoop, Spark, et cetera,
was well underway. And it felt like there needed to be a new statistical computing or data science platform
that was geared towards these users. And increasingly, it became clear, not only geared
towards those users, but actually also serve the needs of the enterprise to bring a team of those
users together, enable collaboration, enable operationalization, et cetera. So that was really
2013. We raised a seed round, grew that company basically to product market fit.
And then right before our series A ended up getting acquired by Cloudera, the big data
platform company, the leading provider of Hadoop. Spent three years at Cloudera, had an amazing time
that product sense that I had built became their data science workbench product. I guess they call
it now Cloudera machine learning. They, unsurprisingly, they realized that the pinnacle application on top of a data platform
really is AI ML doing predictions.
Sort of once you've stored the data, once you've processed it, once you've done some
basic analytics, you really want to go beyond analytics to predictive analytics or AI ML.
And there's a whole class of users that don't write Java, they write Python, and they wanted
to enable those users. And so I spent three very, very pleasurable years at Cloudera building out their data science
platform. And then the entrepreneurial itch started scratching again and decided to leave
Cloudera about two years ago to found Continual. And the reason I left Cloudera, basically the problem that I saw
at Cloudera really was there was tremendous buy-in for AI and ML to have a pervasive aspect across
large-scale enterprises or businesses. Every customer that I talked to, I was sort of in the
CTO for ML role at Cloudera, which was kind of partly outbound, partly inbound. Every customer
I talked to was sort of all bought into this idea
of like the AI first, AI centric enterprise. They could like rattle off a dozen use cases
or more. In a meeting, they would often show me those slides that often look like vendor slides
where there's tons and tons of use cases. But they were really all struggling to actually make that
vision a reality. And at Cladera, we offered an incredible portfolio of different products and capabilities just by being a very broad platform. But what I was seeing was just
a lot of companies weren't succeeding. And the reason was just the sheer complexity of actually
moving AI and L2 from the R&D phase into the operational and production phase, sort of this
continual kind of continual operational. So, you know, founded continual, unsurprisingly, the name is, you know, continual.ai. So it has this idea of
continual, you know, operational ML and AI. We have a very unique take on that, which I can talk
about. But yeah, that was sort of, that was the initial genesis. I love it. And I want to,
I want to circle back to why AI and ML are hard, because I think that's just a helpful topic to
discuss, especially from someone who's actually built tooling around it, because I think you get
to experience the problem in a unique way if you're actually building tools to solve for it.
But before we go there, can you just give us the brief high-level overview of what does Continual
do? Yeah, no, absolutely. So Continual
is, we like to say it's a continual AI ML platform that sits directly on cloud data warehouses like
Snowflake, Redshift, BigQuery, Azure Synapse. It enables anybody to build predictive models that
never stop learning from data. So it has this core recognition that the world is fundamentally changing,
that data is continually arriving, that predictions and models need to be
continually maintained. And so a typical application would be maintaining a customer
churn forecast, an inventory forecast, shipping arrival time, out of stock event, whether equipment
was going to fail. And we're just building a sort of a
fundamentally easier way to do that. That puts the data warehouse and the way we accomplish that is
we kind of put the data warehouse at the center. So I can talk much more about that. We think that
data is increasingly flowing into the data warehouse. And that's the place where you should
build an experience and workflow around that. It will play well with all the rest of the ecosystem
and it will just sort of 10x simplify both the process of building machine learning,
but also equally important, the process of maintaining and iterating on those predictive
models that you have. Yeah, so I'm happy to go into more depth, but it's a platform that's
fundamentally declarative. Like SQL, we try to make AI this process where it's very data-centric. You're focused on what are the features and the input
signals to these predictive models? What are the things that you're trying to predict,
like customer churn? And there's no need. We don't think there needs to be in this
sort of modern data stack era, any Kubernetes, any containers, any Python pipelines,
80, 90, 95 even percent of AI ML use cases that I see
within the enterprise we think can be solved in this sort of dramatically simpler way. And yeah,
and that's what Continual is doing. I feel like in that two minute explanation,
you gave us enough fodder to do five or six podcast episodes. So much to talk about. So let's just first, could you just give us a quick,
because there's so many things. And I think next, maybe we can jump to sort of the warehouse being
the center of the stack and talking through like modern stack architecture, because I think that's
a really interesting subject. But before we do that, and I think a good way to sort of get there with context
is to talk about why AI and ML are hard. So you mentioned that you had some tools that you had
built at Cloudera, but you noticed that, and it's a really interesting thing, AI and ML are sort of
like a cost us an aisle, I will say, like the marketing kind of leads the actual practical usage inside of orgs where it's the promise of the future.
And it is like, we all believe that for sure.
And we know there's power there.
When the rubber meets the road, it's actually pretty hard to like operationalize it.
Why is that?
What are, could you just hit the top sort of couple points of like, what are the barriers
that block companies from actually making it a reality and driving value?
Yeah, this is I feel like there's a there's a lot of people are also missing the mark
in terms of solving the problem.
So there's some people that think, OK, well, what we need to do is we need a notebook that
can access compute resources in the cloud.
Right.
So they build a notebook.
I mean, it's a worthwhile, worthwhile tool.
Right.
You can launch a notebook on in the cloud and get access to a GPU, right?
That might be solving one particular thing.
You might have people that are building like some easier way or a different interface to
actually train a model, right?
But training a model, it actually isn't that hard.
Once you, if you hire a data scientist who's taken some basic skill sets, calling scikit-learn
fit or XGBoost is typically not that hard.
But then there's people saying, okay, well, okay, the problem, and I think this is starting
to go down the right direction.
The problem is really productionization and operationalization.
Now, the naive answer to that is, okay, the answer is we'll put a predictive model inside
a container or something like that, and we'll have a model deployment platform.
But all of those really miss, from my experience, all of those really missed the mark. If you ask why isn't AI and ML being successful in the, in the business and why isn't
it actually being embedded in these business processes so that it can have this impact.
And the fundamental, fundamental problem there really is around the continual nature of ML,
right? So there's, there's very rarely a static model that you can deploy.
But even if there is a static model, the data that's feeding into that model is not static,
right?
So you have data continually coming in about your customers, about products that they're purchasing, about the inventory on your shelves, about the mileage, how much jiggle there is
in your aircraft engine that might have a maintenance issue.
And so all those things, the data coming in is changing,
even if the model is not changing. Now, almost certainly the model is also changing because the
world is fundamentally changing. And then if the model is changing or the data that's going into
that model is changing, the predictions are changing, right? So you need those updated
predictions. So in order to really embed AI ML experience or insight into a product or into a operational system within an
enterprise, you've got to think about that continual nature, right? And you've got to build
a workflow. So then the next step is, okay, we recognize that that's the problem. Well,
how do you solve that? And if you go and you look out and look at the canonical stack diagram,
right? Uber's Michelangelo platform, Uber, they've documented their internal
NL platform. And if you go and look at that, you see, wow, there's about seven different
distributed systems in this diagram, right? It's all of this crazy pipeline jungle. There's data
storage systems, there's training systems, there's monitoring systems. And then kind of
patching it all together is this kind of crazy, what at least looks to me like spaghetti kind of dags to do all manage all their training and inference and testing and performance monitoring and all of that, all of that sort of thing.
And so I think that you get to that and you and increasingly the either you don't have the in-house capability to pull that off or the ROI ends up not being there.
It becomes so expensive to build and maintain these models that you say,
hey, let me go and work on other more pressing problems.
And so we think that that's a solvable problem.
We think that there's a way to sort of like,
just like in the Hadoop ecosystem,
which I'm familiar with,
we went from like the MapReduce era, right?
Where you wrote all this Java code to do basic analytics
to figure out how many customers churned.
And then now, of course, we just go and kick off, open up Snowflake and run a query there.
The same sort of thing can happen for ML, but it doesn't just need to be an easier interface. It
also needs to be this continual operational system. And that I think is the trick, right?
It's not just there's some sort of people who kind of put a prediction statement inside of a
SQL statement. That's really not enough. That's not solving the core problem. You need to think of an easier way to build and maintain both the model and predictions.
Yeah, so that's my diagnosis, at least.
So Tristan, you mentioned the complexity that someone can see in this architecture with
all the different distributed systems and this spaghetti of pipelines, blah, blah, blah,
and all that stuff.
How does Continual simplify that?
Actually, there are two questions. One is like,
what kind of complexity Continual exposes to the user? And the other is what complexity is hidden
and how you manage to hide it, right? So can you say a little bit more about that? Because it's
super interesting. I always find it fascinating. I think one of the reasons that I love technology is that
it gives you this opportunity to build something very, very complex in terms of how it operates,
because that's how the world works, but hide it behind a lot of simplicity. And I think that's
very common what we see with technology. So I'd love to
hear more about how you do that. Yeah, no, I love that analogy. I think it is true that the history
of technology is in many ways like the history of the hierarchy of abstractions, right? All the way,
you know, you think about programming languages or, and all the way down to the hardware level,
each, at each layer, there's another abstraction that hopefully isn't leaking and therefore makes building on top of it dramatically simpler. But in terms of continual
NML, I think it's the way I think about it, just step back and think for a moment, pause,
sort of don't look at all the technology and just pause. What is ML, right? What is a machine
learning model, a predictive model exactly? It's really nothing more than a function
that takes some inputs. So data, those inputs, you know, are often called features, right? So
signals or features, they could be about your customers, right? That could be, let's say that
you're doing a customer churn problem. Those things are like, well, has the customer used
the product? How much have they used it in the last seven days? So there's a set of inputs.
And then there's a target. And that target could be something that you're trying to predict. So there's a set of inputs and then there's a target. And that target could be
something that you're trying to predict. So that's in this case, customer churn. Now,
customer churn could be a few different definitions of customer churn, 30 days, 90 days,
100 days. So you see how quickly once you go down this path, you have a lot of predictive models,
even if you only think of one use case. Then there's a function between those two things.
Now, increasingly, that function is a very, very complicated transformation
between inputs and the prediction tasks that you're doing. But if you think about the level
of abstraction that ideally you should be able to achieve is, hey, manage your input signals,
your features, and manage what you're trying to predict. What's inside that, the transform
function should really, well, it's going to be learned
by machine learning, but really you shouldn't have to think too much about it, right?
That's not something that feels like an essential complexity that you should have to manage.
And then the second part of it, so that's the whole world of automated machine learning
kind of deals with that kind of that transform function and figuring out how to, okay, let's
go to compare a bunch of models and figure out the best models and the best architectures that will give us the
best predictive performance. There's a second dimension to that, which is the operational and
continual dimension, which is also what we focus on, which is saying, okay, well now if you're
going to operationalize this, you need a way to continually retain and continually predict.
And so, but that's just should be policy, right? That's
how often do you want things to be retrained? How often do you want things to be predicted?
So what continual does is it really gives you a workflow to one, manage and collaborate around
all your features. So you do that with SQL. You say, Hey, here's, here's how I'm going to model
my business. Here's my customers. Here's the features on them. Here's my products. Here's
the features on those. Here's my stores. Here's the features on those, et cetera. Then you can manage your prediction targets. What are the
things you're trying to predict? And everything else is automated, right? The process of training
models and retraining models, comparing models, the process of maintaining the prediction.
We do that. We automate all of that reps. We try to distill that down to this essential complexity.
Now we bet on that one way to do that down to this essential complexity. Now, we bet on
that one way to do that is your data is in your data warehouse, right? You kind of need to say,
what's the level of abstraction below you, right? And what we've bet on, and I think this has been
an amazing enabler for us, is we bet that the future is the data warehouse. The future is SQL
from a data management perspective and data transformation perspective. Now, all we need
is an AI ML system that's
operational and plays well with that ecosystem and has a workflow that works for that ecosystem,
that user, et cetera. That's fascinating. So when we are talking about the use cases that
continue to spill forward, we are talking about doing predictions and machine learning and AI around using like pretty structured data, right?
Or do you see also like use cases?
We usually, when we think about ML and AI, the first thing that we think about is like
image recognition, right?
Computer vision.
Is this something that also can be part of Continual or the focus right now is mainly
on like structure and business data?
So that's a great question. And so we are
in the short, medium term, we are really focused on data that you typically see within an enterprise
that lives in a data warehouse. And that tends to be a structured data that does have a relational
dimension to it, right? So customers buy products, et cetera, and also very clearly has a temporal
dimension. So we have an abstraction that's sort of very tailored towards relational and temporal data and building both features and building predictions on top of that relational temporal data.
Now, the model though, and what's very exciting is the model is easily extensible to richer types of data. For instance, we already support text data, right? We can use text data as features.
So conceptually, and if you think about computer vision, right? Conceptually, a computer vision is
nothing more than an image type in to a function. And then let's say you're trying to do classification,
well, that would be a class or a category or Boolean or something as the output type.
So the level of that abstraction level still works. It can even be more sophisticated than that. It can even be like a video comes in on one side and a video that's a segmentation video comes out on the other side. And really, you can think of that as a are happening in a data warehouse. That's not probably the dominant use case, certainly not the dominant use case we see. We do see a ton of text data
and the need to leverage and extract information from text data. And we do increasingly see image
data, right? So Snowflake, for instance, just announced support for unstructured information,
including images, text, PDFs, et cetera. And a lot of times you want to extract insights from that,
and then put those insights back into your data warehouse so that you can then query them. Right. So the data
warehouse, our belief is the data warehouse still is going to be the place where a lot of that,
a lot of that stuff happens. Now, if you're building an autonomous car, right. That's
processing a whole bunch of real-time streams. No, that's not going to be that architecture.
Yeah. Makes total sense. And like all these use cases with
IoT and like machine learning on the edge, at the edge and all that stuff, they're like more
specialized. But yeah, that's super, super interesting. And if I understand correctly,
okay, and I'm coming more from the world of data engineering, not that much from the world of
machine learning. So I'm still learning about that. And the way that things work are that,
let's say we have a data warehouse.
So we push our data there, we collect, doesn't matter how we do it.
And from this raw data that we have, the next step is to go and create some features, right?
And from once we have done that, the next step is to feed these features into a model
and train a model.
Is this correct, first of all, or am I missing something?
Yeah, that's absolutely correct.
Although I would say, yeah, I mean, just don't forget the continual.
The end goal that you're really trying to end up at is a continual process by which you both maintain that model, at least on some frequency, weekly, monthly, et cetera.
And it depends on if you're doing real-time or continual batch.
But let's say that you're doing customer churn or something like that, or inventory,
you're almost always continually maintaining that prediction, right? So it's not a one-off
script. Yeah. So you're almost always continually maintaining that prediction.
Absolutely. Absolutely. Yeah. Yeah. I'm talking mainly about, let's say, the transformation of
the data. I'm not talking that much about the operations right now.
And I'm wondering, how do we go from the raw data to the features?
How users do that?
And let's get an example, a more concrete example.
Let's say we have the use case here is like Cher.
So how a user that is going to start implementing Continuum today, let's say, and assuming
that they have all the data in their data warehouse, they can do the first step, which
is going from their own data to the features.
And how these features look like also, because I hear the word feature in lots, feature stores,
like all the stuff around them, but like at the end, what are these features, right? Yeah. Yeah. So, so a feature is something that you, you believe,
given your business insights, your, your, your understanding as a human, right. Of the business
that you think is going to be predictive of whatever you're trying to predict in this case,
churn, right? So a classic feature would be something like, in this case, would you not
say you have clickstream data coming through aderstack and into your data warehouse.
You might then want to say, well, I have a deep insight that activity over the last few
days, like let's say seven days, is very important.
And so you might want to embed that knowledge, basically your business knowledge, and you
would define that as a feature.
And you would really want to be able to reuse that feature across all of your downstream
use cases, right? So you don't just have customer churn. You also have something about the other products that they might want to buy. You have their LTV calculation, lifetime value. You might have net expansion and net contraction. So something that's maybe not a binary churn metric or upsell to the premium plan. So what we see is typically once you're in a, let's say you're
dealing with these sales and marketing use cases, you might start with churn, but very, very quickly.
I mean, if it becomes easy to build predictive models, very quickly, you go from one model
to a dozen in that very narrow domain, even putting aside all the other ones, other parts
of your business that you could impact and that you use the same features for those
downstream use cases. And so one of the benefits of a feature store is the ability to easily
reuse your features in multiple applications. Another aspect I think that is maybe less well
understood outside the feature store and ML community is the temporal nature of features.
If you're trying to predict something
like customer churn, it's critical. And let's say customer churn in the next month, right?
It's critical that you have the ability to go back in time and ask yourself, hey, what was that
feature a month ago, two months ago, three months ago? So that you can then look at the future
ground truth. You see some customers, how does a machine learning model learn?
It needs to see some examples of that actually happening, a customer churning.
And so the way you do that is you go to your historical data and you look back in time
and you say, okay, a year ago, did the person with these characteristics a year ago, did
they then churn in the next 30 days?
This is 11 months ago.
And so you need to, in your feature store, you need to make sure you define your features in a way that allows you to kind of have this time machine characteristic. Sometimes people
say it's called point in time correct or temporal join. You'd be able to go back in time and say,
I need to get that feature at that particular point in time. And so what Continual does is
it gives you a whole workflow to define those features and make sure that you organize them
properly. Make sure you attach them to the right entity or customers.
You make sure they have a time index appropriately.
Make sure when you train your models, you get the features for the right point in time.
You don't have data leakage.
And so that's all very important.
Now, we also bet that the way you should define those features to your question, well, how
do you actually do that?
How do you define a feature?
You should do that with SQL. Increasingly, SQL really is this incredibly
powerful lingua franca. It scales beautifully. It just, especially when you deal with, you know,
at scale, it just becomes this very, very powerful language. So that's how we think about that
process. So is it accurate to describe that one of the things that Temporal offers to someone who has a data warehouse is to actually extend the data warehouse with a feature store?
Yes, exactly. So we really have sort of there's really three layers to Continual.
So one is there's a feature store, but that is a virtual feature store that is on top of your data warehouse.
Right. So we replicate no data into our system. We define essentially views and organize views
on top of your existing data
and give you a workflow for that.
You can also have native integration,
for instance, with DBT.
So if you're coming from the world of DBT,
the data build tool,
you can define those features and your targets
and pretty much your whole model using DBT.
So that's kind of at the core.
And that's why we say we're a data-first
platform for AI. We really think that's the most important thing. That's modeling your business.
And that's the most important thing where you really need to bring all your expertise to bear.
Above that, in terms of training models, we have this declarative AI engine. You can think of it
like an auto ML system that has this very flexible ability to pull in data, this temporal relational data, and make sort of state-of-the-art predictions over time.
And then the final thing we have is we have this continual ML operations aspect.
So we don't just train that model once.
It's not upload a CSV file and get a bunch of models.
It's really about maintaining both the model and the prediction and giving you visibility on top of all. And that maybe sounds like a lot, but it really is not because the only thing that you're actually
doing is you're really just defining your data, right? The rest of it's all kind of just happening
automatically, kind of on autopilot. And the end result is you basically get state-of-the-art,
continually improving predictions inside your data warehouse and with a workflow that makes it not
only easy to build
that, but also easy to maintain and also easy to iterate. And we think that that basically,
I mean, our goal is really like, imagine a company that has 500 models. What is the system that's
going to be able to do that? Putting aside whether it's Continual or somebody else.
In my view, it's got to be a high level declarative system. That's the only way to
manage 500 models. If you go and try to manage 500 models and the continual life cycle of 500
models using a whole bunch of airflow, custom airflow DAGs that you write, and every single
one is a custom script maintained by a data scientist. I mean, that is just not a recipe.
That's not the future that I think is possible. It's maybe the status quo today that
that's all that's the way we do it today. But I think we all need to be striving for some sort
of higher level experience. If we really want AI and ML to become pervasive, there's got to be some
higher level experiences that we invent as technologists. Yeah. And just like to make it a
little bit more clear, when you are talking about this declarative language,
you're talking about having an approach when it comes to operationalize models similar to what Terraform, for example, has done for the cloud infrastructure.
Is this correct?
Yeah, that's a fantastic example, you know, analogy.
So if you think about managing cloud infrastructure, you manage that now with a declarative approach.
You manage it using Terraform. If you think about managing containers,
right, you manage it by using Kubernetes likely, and you define declarative, here's what I want
to happen. And then Kubernetes goes and makes it happen. And if a thing fails and machines fail,
it fixes those problems, right? And it maintains the number of replicas that you want.
And so, yes, we think that, you know, our experience is
very much tailored to that. You have this configuration, you can push into the system,
we go make it happen. You can do that in a UI, you can actually do it in version control,
just like you would with Terraform or Kubernetes, you know, manifests, you can do it like that.
You know, the second element, that's sort of, I would say, the second element, though,
there is this data element. And so it's not a bunch of YAML, there's also SQL there in order to define your, your input
features and your output targets, but you do that using the language of SQL. So that's sort of it,
but that allows the whole system now to become declarative. So on one hand, you have the data
manipulation that you have just SQL is, is a declarative language. So, you know, we, we have
a declarative language for the data, the necessary data manipulation that you need, just SQL is a declarative language. So we have a declarative language for the data,
the necessary data manipulation that you need
to kind of organize and model your business,
which is, and then the continual operations aspect
is declarative in a way, I think the analogy,
the best analogy, right?
It's exactly like Terraform.
That's super interesting.
A quick question, you keep mentioning SQL
and like how important SQL
and how much you are betting on SQL.
Do you see some kind of limitations in the expressivity that SQL has in terms of creating features or in the ergonomics of the language?
The reason that I'm asking is because a very good example that we have seen in this space is DB dbt right which was a project that came into life exactly because of the
limitations more around the ergonomics that sql has right and dbt came and brought into the game
like all these best practices and all these nice tools that engineers used to have and brought this
into data so do you see any kind of like limitations with sq And if yes, how do you see that we can overcome
this? Yes. I mean, so you absolutely need to combine SQL with a workflow around SQL,
for instance, DBT, right? So if you just, you know, have a bunch of shell scripts lying around
with SQL statements in them, that's not going to be a great way to manage your data. But if you,
if you're managing, if you're doing your, you know, trying to model your business and your data is already in the data warehouse,
you should really use the power of SQL. And I think as you become more and more
embraced out of philosophy, you realize how far it will go. There are, for machine learning,
there are things where the ergonomics, I come from a Python background. I lived in
Python and R and all of those tools. There are instances where you kind of think, okay, well, that might be a little bit easier to express in,
you know, or to wrap up in a Python syntax. It's amazing once I, increasingly, I just don't see
that. I think that the data that's coming in to models is more and more raw and raw. So like with
respect to machine learning models. So it used to be that
you needed to do a tremendous amount of feature engineering and increasingly, and when our system,
you can still do feature engineering to kind of bring your business insight to bear, but
increasingly the model itself is doing internal to it, some degree of feature engineering, which is
really just part of the model. And so for instance, if you look at the history of computer vision,
right. And even, even tabular data, increasingly you can push raw data into those models,
raw images, raw tabular data with very little minimal preprocessing.
And then some of the complicated feature engineering that's very ML specific
and maybe SQL is not as well suited for, that can happen sort of internal to the model.
And so, but it doesn't need to be,
I think that type of feature engineering
really doesn't need to be exposed to the end user, right?
So if you think about the type of features
where the business user or the user
needs to bring their own insights to bear,
that is, I have a hard time thinking of where SQL,
you know, has let me down in terms of that.
It makes total sense.
I mean, and I think, again, as I said,
I'm not coming from ML,
but like also big parts of like the success
of deep learning is exactly that,
is that the model itself generates like optimal features
that can help like build better at the end.
Because I remember back in the beginning of the zeros
when we didn't have like deep learning yet,
computer vision, like most of the papers
that you will see getting
published was like what kind of features we can create to make sure that like a very very specific
niche use case of computer vision we can tackle a little bit better and i think part of the
revolution with deep learning is exactly that yeah absolutely i absolutely. I mean, you can't write an edge detector on an image in SQL.
I grant you that.
But that's not what you need to do anymore, right?
For the state-of-the-art models, you just need to pass in a raw image.
And increasingly, you might even be able to say something like a question on that image,
like how many cars are there in this image, right?
So you might have this whole area of visual question answering.
So even something kind of as wild as that, if you think about from a data perspective,
it's really no more than an image coming in, and a column with a question, and a column with an
answer. And that's, I mean, that's just mind blowing to me. I mean, it's almost it's almost
amazing that that's possible. And I think the overwhelming trajectory is towards that a lot of
a lot of models don't even need data. I mean, increasingly.
So if you look at what's happening with, for instance, OpenAI and their GPT-3 type of work,
and all, of course, speech recognition in many parts of the domain, you actually don't even
need to bring any data to bear. There's no model training. It's an API. But what we've seen is
within the enterprise. So some people ask me, okay, well, is it all just going to move to this?
Everything's an API. Well, the answer there is no, clearly no, because customer churn within
your business, you have to look at your historical churn patterns. There's no way you can just
predict customer churn given a user's demographics. If you don't have some history there,
same thing with inventory forecasting, predictive maintenance use cases. There's a set of use cases
where fundamentally they're data-driven,
they're driven off the data of your business.
And so what we're doing
is really trying to provide
the easiest experience
for those type of use cases.
I keep saying that
there are data problems
that are very,
how to say that,
the business context is very important.
You cannot take the term model
that can predict
what is happening at DoorDash
and use it in continual, right?
Like it just doesn't work.
It's completely different views of the world, right?
Because they are dealing with a completely different view
of the world.
Yeah, no, absolutely.
I mean, I think that's actually why the data warehouse
is so powerful.
And in terms of a data strategy for companies,
I mean, a lot of times people say,
well, isn't for instance,
all of AI and ML going to get verticalized?
People have said about that, about BI as well, right?
So you have these sales and marketing use cases like churn
and you have inventory forecasting use cases.
But what I've seen is twofold.
One, even for those very standard use cases
that every business has,
the data is very, the data is different, right? So of course, the signals that you're getting from all of your
different touch points, your websites, your products, I mean, all the things that are
sending you data, all of that data is very bespoke to your business, right? So Strava is very
different than DoorDash, right? But they might still have both of a churn at the end, they're
trying to predict churn and maintain and reduce churn.
And even actually this one thing sort of surprised me,
even as I've worked with more and more companies,
even the definition of churn is very bespoke.
So like I just was chatting with a company that said,
okay, well the person, they had strike data,
but in there, so you think, okay, very standard,
but in their world, churn was defined as
they have to be 30 days out of sort of, they could cancel their account due to a bad credit card. But if it's just 15 days, and then they managed to put in a new credit card, it's not churn. And so what the beautiful thing about the data warehouse is, it gives the data professional, the data scientist, the power to kind of model their business in the ways that are unique. And it has that flexibility, but tries to hide everything else. Yeah, and I think that's, I think I have a hard time, that level of, I have a hard time
seeing how for many, many use cases you can get rid of that.
And so, you know, our bet is that that level of complexity, the ability to model your business
and the need to model your business is going to persist for most sophisticated data-driven
companies.
Absolutely.
Yeah, I totally agree with that.
I have a little bit of a more technical question to ask you.
I was checking and going through the architectures
of different feature stores.
And one of the things that I've seen,
which pretty much exists in every feature store architecture,
is a caching layer to serve the features.
And the reason to do that is because low latency in some use cases is extremely important, right?
Now, data warehouses on the other side and all up in general,
was built with a completely different perception of what time is, right? In the past, for example, the data warehouse was built in a way that it could run queries
for hours or even days, right?
So latency is a completely different thing when we are talking about data warehouses
and then compared to transactional databases or caching layers.
So how do you deal with that when you get a data warehouse centric approach?
Yeah, no, that's a great, great question because you're right. You're absolutely right. I think
there's widespread recognition that a feature store or something called the feature store should
be at the center of your data, your ML strategy. And in part, because we see that, hey, that's
one of the most important bits and also where a lot of complexity can come in.
There's the way I think about it. There's really three parts. So they're just stepping back.
Like, what is a feature store? I think there's because maybe not everybody in the audience, they might know the term.
But what exactly is it? I think it needs to offer sort of three things.
So the first is collaboration and sort of do it, you know, do it once. Right.
Sharing of the definitions of
features across your business, right? You should not have data scientists duplicating feature
definitions. You should have the features properly governed. That's probably the easiest one to
solve, right? And you could probably solve some of that with your existing tools, right? Like by
following DBT best practices, you might, you know, have a virtual feature store. The second one is
what's called
point in time correctness, which is this idea of a time machine. You really, for training purposes,
need to be able to go back in time and reproduce a feature at any point in time, or at least at
regular intervals that you're going to train on. And it needs to be actually, you need to be able
to do that on a per row basis. For every user, you need to potentially go back a different. So it's
not just a Snowflake's time machine or Databricks' time machine kind of backup functionality. You
actually need to be able to do sort of a temporal join where you get the features at a particular
moment in time. You need to do that to construct your training data set so that you can then
forecast churn into the future without any data leakage. And the third one, which is what you're
pointing out is for, and this is only applicable for real-time
serving use cases that can't be prematerialized and need to be done on the real-time path
and cannot also be passed in from the client, you need a way to serve features in low latency.
So if you have somebody coming in, let's just say you're doing a search personalization
type application, you have somebody who is typing in a search query, that search query comes in,
you find the relevant records, then typically you want to re-rank it based on maybe the previous
click stream of that user and what they've been doing and their history of actions. And you have
a set of features that you need to very, very quickly look up and say, okay, what have they
most recently looked at? And did they click on those things or whatever it is, you need to use
those features. And for that, you need a caching. Well, typically today, you need a caching layer on top of a database.
And so a lot of the feature store work, if you look at what some of the open source feature
stores are doing, it's really about trying to find an architecture for that caching.
It's interesting.
We actually started looking at that very closely when we founded Continual.
And what we saw was, first of all, data is in the data warehouse and people want to leave it there to the degree possible. And second, that
a huge number of use cases can be solved with this sort of continual batch mindset. It's a tremendous
simplifying approach in terms of your architecture. And third, we have a bunch of ideas around how to
do that cache. We're waiting a little bit to see if you want to do the real-time use cases.
We're waiting a little bit. It'll be interesting to see where the data warehouses themselves go.
Some of the cloud platforms are building some capabilities indirectly. There's emerging data
stores, things like Materialize, which have certain ability to do that. Obviously, the real-time
databases, things like Rockset. I know Snowflake, for instance, is very focused on high concurrency,
low latency. So it'll be interesting to see how that converges. That's definitely an open question. The architectural
complexity that emerges by trying to maintain consistency between these environments when in
some ways you're just expressing, you'd like to just express a SQL statement and have it taken
care of for you, seems to be me over time, something that's going to be eliminated.
So I think it's a very interesting question so i think it's a it's a very
interesting question i think it's unclear where it will go uh exactly in terms of will there be a
dual right system in the cache or will we converge towards a data warehouse and new new
functionalities built directly into the data warehouse or even will there be kind of tailor
made data stores that are that have this characteristic of historical data stores.
It is super interesting. We actually talked with Materialize recently on the show,
fascinating conversation, super smart team over there. And then a favorite topic of ours is what are the warehouses building? And they've already advanced in so many different ways,
but they're building in things that they're going to make things really interesting.
But speaking of the warehouse, Tristan, and we can land the plane on this question because
we're coming up on time, but you've mentioned multiple times, and this is a really interesting
topic, I think, in general, but the importance of the data warehouse in the context of the
modern stack.
So zooming the lens out from sort of the specifics, could you just tell us why? I mean, you're betting on the
warehouse being a central component of the modern way that companies are architecting their data
stacks and all of these different tools. Why are you doing that? And then why do you think the time
for that is now? It seems like there's a
new crop of companies that are sort of making this bet. And why is that? Why do you think
that's happening at this point in time? Yeah. So I think sort of, let's say twofold. So the first
is that as I've just experienced, spent a decade experiencing data infrastructure, data engineering
infrastructure, machine engineering infrastructure,
machine learning infrastructure. I mean, the number one problem I see is complexity,
right? These stacks just get incredibly complicated to manage, move data between,
and particularly with any velocity from a developer perspective. And there's many kind of
death, maybe it's a death by a thousand paper cuts. So this is sort of the historical era. I
mean, Hadoop, I mean,
the knock on the Hadoop ecosystem is complexity.
But even putting aside Hadoop and looking at if you use all the raw building blocks
of a cloud vendor like AWS,
it gets very, very tricky.
And the complexity gets very hard,
which makes it very costly to build new use cases,
very costly to maintain them
and to iterate on them and build newity.
And then it just compounds, it compounds over time. And so I think the big thing about the data warehouse, the first thing
is by putting the data warehouse at the center, by betting on a cloud managed data warehouse,
that's elastically elastic, that offers workload isolation. So you can have your data warehouse,
data scientists going crazy in one isolated cluster on the same shared data. It's an incredibly
liberating experience if you've experienced the alternative. If you've experienced the complexity
of shared compute, of multiple disparate systems where you're moving between them,
of multiple different languages, you're moving from MapReduce to SQL to Parquet files to Python
to all of that. It's an incredibly powerful model,
like this kind of the big data ecosystem, but the data warehouse is much, much simpler. And so
as I've matured and, or just as I've experienced sort of what happens when you deal with too much
complexity, I have a natural affinity towards the simplicity of the data warehouse and the power of
the data warehouse. That's the number one. The second one, which is more towards the ecosystem, there needs to be some common
foundation by which products can integrate and an ecosystem can develop. And the data warehouse is,
I think it's now emerging, it's an amazing point where different products can collaborate in a
kind of turnkey way while still allowing the flexibility that you want.
So for instance, like ingestion, you have Rudderstack, you have Fivetran, you have that
whole community that's bringing, that's making what previously were these crazy airflow DAGs
from Salesforce into your data warehouse or from your logs into your data warehouse.
They're now making it completely turnkey and now it's landing in the data warehouse. Of course,
you can do transformation with DBT. You can do data monitoring with BigEye and Soda. And I'm sure I'm forgetting some. And you can,
of course, have your BI tools. There's a whole new class of BI tools that thank God are actually
running the analytics inside the data warehouse. So there's not data movement. So you can build
all your reporting off of that. You can increasingly, again, using tools like
Rutterstack or Census, Hightouch, et cetera, move the data out of the data warehouse and actually make them actionable, kind of weaponize it. So it can actually have an impact on your
business. And all of these things are doing it in this completely turnkey way. And so I was seeing
that when I was that whole stack emerging and I was, I mean, the root, the Genesis of continual
was really saying, wow, there's not really a way to do operational ML. I mean, you want to do these
predictions on top of that stack in that ecosystem. And please don't tell me I'm going to kick up a Kubernetes cluster
to do that if I've embraced that stack. And so, yeah, I really, you know, I'm tremendously excited
by the ability to, you know, drive down complexity, to simplify things. I think if we really want
data and AI and ML to be centered and pervasive across the enterprise, you've got to have a simple,
productive, kind of low cost from a manpower and manpower person perspective, if you really
wanted to become pervasive. So that's why I'm bullish. Yeah, yeah, no, I couldn't agree more.
And I think there are a lot of situations or sort of use cases for activating data where you had ecosystems of tools crop up in order to do those things,
to your point, in a fairly complex way. And then five years later, the warehouse technology is
advanced enough to where it's like, oh, well, actually, the most elegant solution was there
all along. It was your data warehouse, right? And just
the ecosystem around it of pipelines to your point, and really the data warehouse technology
itself hadn't quite gotten to the point where it was elegant, but now it really is. I love the way
that you described it as number one, simple, and then number two, the need for a central hub. And
it's such an obvious choice for both of those a central hub. And it's such an obvious choice
for both of those points. Yeah. And it's always a journey, right? I mean, I think most technologies,
you start complicated, you start low level, then some patterns and abstractions emerge,
and then people build fundamentally new and easier experiences. And I think there's a
perpetual search for that. And yeah, I mean, Hadoop and that ecosystem is incredibly powerful.
It's an incredibly powerful technology.
If you look at how Facebook runs, a lot of it's on that sort of that technology.
But if you need all that power, all that flexibility and the ability to dive into the code yourself,
that's a great ecosystem to buy into.
But there's also, I think, even the Hadoop ecosystem is bet on SQL, right?
I mean, very quickly in the rise of
Hadoop and the big data ecosystem, Hive became a thing, which is the big data query language.
And then it was quickly, well, how do we get faster querying? That was various projects
trying to do faster querying. And then increasingly it was like segmentation of the compute. And then
of course the rise of the cloud kind of disrupted a lot of that architecture. So yeah, there's a, there's a journey there. I think we're right now at this
moment where there's this convergence on the modern data stack. I really think over the next
five years, a huge amount of innovation is going to happen there. And if you kind of like buy into
that ecosystem, you're going to be able to like free ride on all that innovation that's happening
in a million different quarters. Yeah, absolutely. One of our previous guests described it as, they kind of described like ML being able to quickly derive value out of ML
as the next phase of the data stack, whereas analytics is sort of maturing to the point where,
I mean, as simple as it sounds like this sort of self-serve analytics across the organization,
still a lot of companies haven't figured out, but the technology is now there to where there are known playbooks for how
to do that. And ML is going to be the next phase of that. And I really think in a lot of ways,
that's true because once you have the data clean enough to produce really powerful analytics,
then it's like, okay, well, great. Now let's really turn the heat up and start optimizing
the business in some interesting
ways with this data that's really well-suited for machine learning use cases.
Yeah, I mean, absolutely.
I mean, that's the thesis for the company.
I need to talk to this person who you talk to.
We'll have to recruit them.
Yeah, I think there's a classic pyramid where if you look at it, the sort of AI and ML is
at the top.
I think there's still some other things that are going to come that still need to happen.
We still need to have push into the application
domain and build and make sure that we can handle not just the sort of the backend operations of
business, but the applications themselves. But I'm very bullish on this path. And I've seen
the simplicity of it now. It kind of makes me very excited. Very cool. Well, we are at time,
but really quickly, Tristan, is there a way, I mean, just hearing about Continual, I keep thinking this is really cool. I want to see it in action. Is there a way for our listeners to check it out and try it? Or what's the process like there? Absolutely. So, I mean, we're in early access now. We launched about a month, two months ago. So you can go to continual.ai and you can learn a bunch more about continual.
If you type in your email there, we absolutely will reach out to you within 24 hours and set up
a demo. So we can give you a demo. We're taking early access customers. We typically do a demo
and then onboard folks to try it out. We're hoping to get out something in terms of general
availability soon.
So stay tuned for that.
But yeah, I look forward to hearing
from anybody who's interested.
I think we have a little last plug here.
Yeah, cool.
And just to confirm,
all the major warehouses, right?
So sort of-
All the major cloud warehouses, yeah.
All the major cloud data warehouses.
Awesome.
Very cool.
Well, definitely encourage the audience
to give it a try.
Really cool product. And Tristan, we really thank you again for the time. This has been
an awesome conversation and we'd love to have you back on the show sometime soon.
Absolutely. My pleasure. Thanks so much.
Well, I think my first big takeaway is that you and Tristan are incredibly smart people,
and it was really fun to hear you dig into the tech.
But my second one was his excitement about the data warehouse, which has really been a continual theme, I think, throughout.
Actually, it was a big theme last season.
We've heard it in the last couple of shows about how the cloud data warehouse is just
enabling so many different things, which when Redshift first came out, I don't think anyone would have,
I mean, I'm sure there were very future looking people
who sort of imagined this world where everything's connected
around the warehouse, but I don't think a lot of people imagine
the sort of things that we're talking about as far as continual goes.
And so that's just really cool.
And I can't wait to see how that innovation continues to unfold.
Absolutely. I don't think that see how that innovation continues to unfold. Absolutely.
I don't think that the data warehouse we will be talking about in a couple of years from now is going to look very similar with what Redshift was when it started in 2012, for example.
I found it very interesting, for example, that Tristan mentioned at some point about Snowflake supporting more unstructured
types of data like images and free text.
And yeah, the data warehouse becomes like a much broader concept, right?
It's more like a data platform in general, and it's going to fuel many different use
cases.
And of course, like one of the most important ones from what it seems is going to be built
around AI in the mail.
So yeah, it's very fascinating to see
what people like Tristan do
and the stuff that they are building
and how they are enhancing the data warehouse
with non-traditional data warehousing capabilities
like machine learning.
And I'm really looking forward to see
in a couple of months from now
how the product is going to look like.
Super interesting for me, very engaging.
I mean, that's something that we see, especially with founders, very passionate about like
the products and the technologies they build.
And it's always a lot of fun like to discuss with them.
Yeah, absolutely.
Well, thanks again for joining us on the Data Sack Show.
Great set of shows lined up over the next couple of weeks.
So make sure to stay tuned
and we will catch you on the next one.
We hope you enjoyed this episode
of the Data Stack Show.
Be sure to subscribe
on your favorite podcast app
to get notified about new episodes every week.
We'd also love your feedback.
You can email me, Eric Dodds,
at eric at datastackshow.com. That's E-R-I-C
at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com. you