The Data Stack Show - 140: Stream Processing for Machine Learning with Davor Bonaci of DataStax
Episode Date: May 31, 2023Highlights from this week’s conversation include:Davor’s journey from Google and what he was building there (3:32)How work in stream processing changed Davor’s journey (5:10)Analytical predictiv...e models and infrastructure (9:39)How Kaskada serves as a recommendation engine with data (14:05)Kaskada’s user experience as an event processing platform (20:06)Enhancing typical feature store architecture to achieve better results (23:34)What is needed to improve stream and batch processes (27:39)Using another syntax instead of SQL (36:44)DataStax acquiring Kaskada and what will come from that merger (40:24)Operationalizing and democratizing ML (47:54)Final thoughts and takeaways (56:04) The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show.
Costas, we have another exciting one today.
We're going to talk about stream processing. We have actually talked about this subject a good bit on the show, but this is pretty interesting because Davor from Cascada, which was recently
acquired by Data Stacks, we'll talk about that a little bit, built a technology that's really focused on stream processing specifically for ML use cases
and kind of closing the gap between the actual sort of building of insights and features
and then actually serving those.
And it's pretty fascinating. What I'm really interested in is what they saw or sort of maybe the lack in the market
that they noticed that caused them to want to build something new, in large part because
you have a lot of really good, really high power stream processing tools.
You have things like feature stores.
You have all sorts of interesting low latency ways to serve stuff.
The pieces are there in order to sort of actually build
and deliver cool stuff, you know, even from a stream.
But obviously it wasn't sufficient.
So that's what I want to ask about.
How about you?
Yeah, I think it's a very interesting case.
I mean, Cascada is a very interesting case because it is a streaming processing engine,
but it emerged as the solution to a problem that is very use case specific and it has
to do with machine
learning.
So it's going to be very interesting.
And one of the things that I want to talk about is how we go from something like a feature
store, which is supposed to be one of the possible solutions out there in DML problems
to something like a scatter, right?
What are the differences?
Why we need something that is, let's say, more unified in a way
in terms of both like the technology,
but also like the experience of the user that uses the solution.
So that's like one of the things that I'm very
interested to discuss.
And yeah,
like learn about like the journey and also
learn about the
journey of, you know, like
getting acquired by data stacks
and...
Not about like the acquisition, but more about
like the vision. Like how
something that is...
If you think about it, it's actually quite interesting.
Data Stacks is based on Apache Cassandra, which is like a 10-year-old technology, right?
And then you have something that is super, let's say, new in terms of the requirement
and the need.
And even the technologies it's built on.
Yep.
And it's very interesting to see how these things come together and why,
and what's the potential output of this, right?
So it will be very interesting to discuss about all these things.
All right.
Well, let's dig in and talk with Devor.
Devor, welcome to the Data Stack Show. Hi, great to be with you.
All right. Well, we have some exciting things to talk about. You've had quite the journey
over the last couple of months in terms of acquisitions, open sourcing stuff,
which is all really cool. Let's look back a little bit in history, though, because this isn't the first time you've open
sourced technology related to streaming, which is kind of cool that you've been able to do that a
couple times now.
But you were at Google.
And can you tell us a little bit about Google and what you were doing there and then what you
built an open source there?
Oh, yeah.
So when I was at Google, it was like early days of Google Cloud and we've been building
a unified programming model for batch and stream processing that ultimately resulted
in the Apache Beam project.
It was a quite successful project with a relatively large number of companies around the world
using it, contributing it.
And then a few years later, me and my co-founder left Google to start a company in a similar space.
That led to Cascada, the founding of Cascada, that really tried to nail the problem of
building predictive behavioral machine learning model from event-based data.
Obviously, we were working at this problem for quite a while that resulted in an acquisition by DataStax about three months ago. And yeah, happy to be with you talking about all of this
journey across, you know, from Google to Cascada to DataStax and everything in
between. Sure. You've had such a focus on streaming. I'm interested to know, I mean,
you obviously are, you know, have been a professional software engineer and working
with data for quite some time now. Did you have an interest in stream processing or is that
something that you found at Google
and started to work on at Google?
I started working at Google.
I was not looking at stream processing pre, I guess, 2013.
Yeah.
And basically started around late 2013 looking into it.
And now I guess this year would be a decade
that I am looking at this
problem. Yeah. A decade of streaming. Well, I mean, I guess it's interesting to think back on
2013. I mean, there were infrastructure limitations that made certain stream processing
things pretty difficult or at a minimum, like pretty expensive. So can you tell us, so you get into stream processing at Google,
you work on Beam, you open source Beam, but Beam and then, you know, of course,
there are a number of other technologies out there around stream processing, you know, even
within Apache, but those weren't sufficient for what you wanted to do. So why build something new when at
that point, you know, there were multiple major players and multiple different architectures
running pretty large organizations at scale for stream processing use cases?
Yeah. So when we started looking at the problem of machine learning, we discovered that neither
batch solutions, neither streaming solutions, neither, you know,
Beam solves this problem well, right? And so if you start thinking about building behavioral
machine learning, right? So think about these are kind of recommendation engines,
churn prediction models, right? Something about predicting the future, future action,
future interest based on what has happened in the past, right?
Like when you look at the nature of that problem,
you have to process historical data, observe feature values,
generate training examples at the right points in time
to be able to train the model.
That problem looks more like analytics, looks more like batch,
looks more like historical data processing. And then you have this kind of inference problem
where you want to take real-time data and give it the most recent feature vector
to give it to the model and then produce a real-time prediction. And so when you look at
that problem, it's not well solved by batch because you
have too big of a latency. It's not well solved by streaming system because you can't have...
It's very hard to get this kind of historical component on top of it. And so we made the
conclusion that the fundamentally existing systems are not well built for that.
Obviously, other people around the same time have been looking at the same solutions and that they found ways of hacking certain things together to solve the problem.
And from that work, feature stores or common feature stores came to be.
They tried to create an online store and offline store.
It's really kind of a divergent architecture to try to solve these different use cases
on top of the same data.
And we are more of a system builders than, you know, hacking things on top of systems.
So we took the problem really deep and then designed the system, you know, that's really
built for the problem at hand.
And the problem at hand we see is easily connecting to the data, describing features in an easy
way where you can iterate in a place like a notebook, test hypothesis, test a lot of
features very quickly that gives you immediate backfill analysis of features at any point in time.
And once you train the model,
like really with a click of a button
or checking features as code into production,
you can compute and serve those features with low latency,
right, all from the same system
that is purpose-built for this problem.
And that's kind of how Cascada was born.
And, you know, we found some funding for it.
We found a team for it and the team built the product.
And then we took it to market.
And, you know, I guess the rest is history.
Very cool.
Can we actually talk about, you sort of mentioned that there's, you know, you have these sort
of two separate problems, right?
And that there's sort of an analytics type use case, which is looking historically, and
then you have the actual sort of ingestion of the real-time data that allows you to sort
of feed the model and actually create an experience, right?
Like a hotel recommendation or, you know, a product recommendation or
something. So can you describe the way that you saw that materialize in terms of both infrastructure
and teams? Were there different teams working on those separate problems? You know, like,
because a lot of times you'll see sort of data science is working on the model and, you know,
sort of more of the like analytics predictive model and, you know, sort of more
of the like analytics predictive piece. And then it's a pretty heavy engineering problem to actually
like grab the feature and then like, you know, it needs to be served in a website or app.
Can you describe the common patterns around that breakdown and how people sort of hack that
together? Yeah, absolutely. So we think that there are two fundamental problems. Problem number one is really finding predictive signal inside of your data. And that is very kind of company data your app, right? Think of it, click stream, tap stream, or, you know,
engagement information coming from the app. That's a lot of data and it's relatively hard
to find what is really predictive signal that tells you what the user might be interested in,
whether they'll, you know, buy something, whether they will, you know, renew a subscription,
what they may be interested in, why they are
here, and so on.
That's the problem of finding quality predictive signal out of clickstream, event stream data.
That problem becomes harder the more data gets messy.
If you are getting it from multiple places, from multiple applications,
right? And schemas and other things evolve over time. So kind of figuring things out there tends to be more of a data modeling extraction of useful signal.
And we feel that's a key part of getting machine learning and AI right. There is a different problem,
it's a problem of scale.
And that is kind of how can I,
once I know the model,
once I know what my features are,
how to open that model at scale
with low latency and good unit economics.
And that problem gets harder
the more scale you have. And those are two problems and usually
two different people are best to solve two different problems. We've seen in the data
community a lot of talk recently or the last few years about the scale side of things. And I think
that's very warranted because it's a hard problem. And
people pushing the boundaries here tend to work at big companies, typically in the Bay Area that
have a really large scale, and then they start hitting these problems. And I think that's a
really hard and difficult problem to solve. But I just want to make sure that we don't forget that to get to
a really good AI, what most people should do is focus on extracting quality signal.
The better the signal is, the more predictive it is. It's easier to build a model. It's cheaper.
And it's actually doing work that is company-specific.
It's very leveraged work.
Whereas distributed systems, they are very common and horizontal and not specific to
a company that may be doing it.
So we often think about this infrastructure being more horizontal and should probably be done in an open source
community with other people that can kind of jointly innovate on it.
And then companies really, we think, should focus on their quality signal from their data
because that's really leveraged for them.
That's a unique business value to them.
Makes total sense.
Okay, so walk us, let's do a breakdown of Cascada with kind of maybe a sample company.
So let's say, you know, I'm a company that, you know, sells maybe it's retail products,
you know, online or something, you know, sort of large scale e-commerce.
I have multiple websites, maybe even multiple mobile apps. I'm probably ingesting some sort of log data from my production databases. And so I have multiple brands and I want to know if someone's purchased
these things from this brand, what other products from this other brand could I maybe cross
sell them on, right?
What does it look like for my company to implement Cascada?
Like, you know, who are the people involved and how do we implement it?
Yeah.
So what you have described here,
if I can generalize a bit, is a recommendation engine.
Yeah. Right. And people have been looking at recommendation engines for a while. It's like one of the first use cases of machine learning.
And obviously in many industries, they have been successfully implemented. The interesting thing is when you look at recommendation engines and quality of them
is it's quite interesting what you can find, right?
And so I'll start with a few examples here, right?
So let's say that today you buy a couch, right?
What is the chance that you're going to buy a couch next week?
Well, you know, basic recommendation model will conclude you bought a couch this week,
you might buy a couch next week.
But we both know that's not how it works.
Yeah.
Right.
And so there are some recommendation engines that, you know, fail in miserable ways, you
know, in this way without understanding who you are and why you bought it.
Right. If you are a reseller of couches, sure, you know, more couches this week,
more couches next week. But if you're buying for your own home, if you bought the couch this week,
maybe you're interested in a coffee table, but not in another couch. Right. Like we have to
really understand who the customer is, why they are here to be able to provide good recommendations.
That's key, right?
Like sometimes recommendations are just totally off.
And if you search online, you'll find examples where people kind of laugh at quality of these
things when they are not done well.
So when you think about how can I do this well,
it's about understanding motivation and driving signal from interaction on a digital platform
to understand why the person is here, right?
And so it's what they are searching,
not just what they are buying,
how frequently they are searching something, right?
And then being able to do this quickly to give them in a session, personalized experience
based on the reason why they are here today, for example.
I think that's key.
How you do that, you have to focus on the signal coming from their interactions with the app.
And I have, in every case we looked, we always find that we can separate somebody buying
a couch for themselves and somebody who is a reseller of couches.
As long as you focus on their interactions on the site, it tends to look very different. And if you derive the signal out of the event-based data, then the model can latch on it, learn, and give good predictions that separate one experience from another.
That's key.
And that's what we like to enable customers to do. And most often, once they use our technology, they find things that they have
not known about their user base or user base before they started. And that's what we consider
success. Once you discover predictive things and segmentation of your users that was not clear
before you started, That is success.
Then you are discovering something about your business,
about your users from your data,
and that makes the company better.
Makes total sense.
That's what we are all about.
Yeah.
So let's get practical for a second.
So if I'm a user of, if I'm implementing Cascada, right?
Like I, you know, I get it, I get it set up, right?
And running.
And so are there just endpoints that I point my, you know, app and website and production
databases at?
Like it just, will it just ingest them no matter the schema?
Is it as simple as that?
Yeah, so we obviously want to load data from as many places as possible, right?
And we try to make that as easy as possible.
Obviously, we can't read it from anywhere or we can't read it from everywhere, but we
can read it from common places that people store data. We typically suggest people for doing some early exploration to start with parquet formatted
files with kind of scheme, structured data in parquet formatted files stored in some
cloud storage type place, perhaps managed by Iceberg or something like that is what
we usually recommend.
But we can read it from plenty of places
usually with a few lines of code
just kind of specifying the location
and then we can read structured data relatively easily.
We do not shine on unstructured data today.
Yep, makes total sense.
And then once the data makes it into Cascada,
what's the user experience like? How am I trying to find the signal and the noise
using Cascada in the platform? Yeah. So first we tell people usually to
use the tools they like. Everything we do today is API first, right?
So you can open a Jupyter notebook,
IPython notebook, do one pip install.
That's one line of code.
Then you load the data from somewhere.
That's another line of code.
And then after that, you can build features,
test features, and use all the machine learning libraries that you like,
right? Scikit-learn, PyTorch, whatever you like, we generally support. So think of our product as
API first, data frames in, data frames out, and you can connect it with all the tools that exist
in the machine learning ecosystem that obviously practitioners have learned to love over the last couple of years.
Yep, makes total sense.
All right, well, I've been hammering you with questions.
Costas, please jump in here because I know you probably have a ton of questions yourself.
Yeah, thank you, Eric. So, Davor, like I have, I want to ask you, you mentioned like a couple of different,
let's say, like product or technology categories, like feature stores, feature engines.
And so then obviously there's also like the whole idea of like having like a streaming
processing engine.
So what is Cascada, like, primarily?
Is it like a streaming
data processing engine,
feature store, or something else?
Yeah, it's a hard...
You know, we obviously need a label
for people to understand,
and at Cascada,
we call it
a feature engine.
It's like feature store,
but really focused on generating features
as opposed to storing and serving them.
And that's kind of how we coined the term feature engine.
And some other companies have caught on,
like there is another company, I think, called Sumatra
that also tried similar approaches in this space.
So we consider ourselves a feature engine, right?
The engine that can help you generate feature values at any point in time or at the time
of now for inference and so on.
So generation of features from underlying raw data, we call that feature engine. Recently, we open sourced Cascada code,
and we started calling it modern open source event processing. Because what we figured out
that what we built is actually generalizes to all processing of events, right? Be it batchy,
be it in streaming mode, be it in any way, shape, or form. And so our website now talks about modern open-source event processing
as our positioning today.
And that's more naturally how it evolved rather than our intention.
Our intention wasn't to build a generic event processing system.
It's just that we discovered that by accident, by solving,
I guess, the machine learning problem well. Yeah, makes a little sense. All right. So
if someone takes a look in, let's say, a typical feature store architecture, you usually see
two main components there. You have, let's say, the offline processing that happens there,
or like, let's say, the batch processing,
where you go get all your historical data,
use that to build the model, whatever,
and as part of that, you also define the features that you need for that, right?
And then, of course, you have the online version,
which is, okay, once new data comes, we need
to turn in the features that we have previously defined and use them somehow, right?
With Cascada, and usually like in feature stores, you have different technologies implementing
inside, right?
Which kind of makes sense because historically,
let's say data processing platforms
are focusing either in one or the other.
Like they are either like streaming or bots.
Yep.
How do you work with Cascada?
Like if I decide like to use Cascada,
am I going to have two architectures implemented?
One?
How does it work?
Yeah, so this architecture of online store and offline
store, this is what I think is
hack, under quotation here
around how can I stitch
existing systems to solve the problem and I
realize that they are not really built for it so I need to put
multiple of them and use them in different places to try to get the outcome and unit economics that I like.
And so if we kind of look at these two paths, I think streaming systems are really good
in this inference path as take the recent data, compute something that is relatively recent with low latency and
serve the results. These are kind of materialized views on top of event-based data. And I think we
have good systems to do that. On the batch side of things, I think obviously we have Spark and
other systems that can process vast amounts of data. But often we find when you think is the ability to test hypothesis,
to try to find signal that is actually relevant for their use case.
And that can, doing that in a batch system
and then running a backfill job that populates it at all possible points in time
for all entities,
for all features, that's really not great. And most of these values computed will never be used.
And so we think that the right solution to this problem is take a feature definition that is
described easily, declaratively, and that can easily cross this training to production gap.
It can run in training without doing complicated backfill that stores everything at every point
in time, but compute feature or training examples when you need them, generate easily
with simple queries, with tiny queries, complicated data-dependent windows and data-dependent features, deliver
them to training and literally with a click of a button, be able to maintain real-time
materialized views over streams for a production use case.
And so that's kind of how we view it, right? It's just one single architecture
purpose-built to process streams or event-based data, be it historic, be it real-time.
Yeah. Okay. That makes sense. And what I hear is that building a system like Ascada,
or trying to solve the
problem like Cascada is like solving, we need to innovate, like, let's say in two
fronts, like one is like the technology itself, right?
Like something that can incorporate like both, let's say, the streaming and
the bots paradigms in one paradigm.
But it's also, from what I hear, like a user experience or developer experience, probably.
We need to figure out, like, what's the right way for our user, in this case, like an ML
engineer or like a data scientist, to interact with the data and help them, guardrail them,
like figuring out much faster what's the signal out of the
noise, right? So let's talk a little bit more about that because I'm pretty sure that people
have heard a lot in the past couple of years about how to work with streaming data, how to have low
latency, high throughputs, distributed system, blah, blah, blah, all that stuff. Yep.
But I think these experience parts, it's still like very new and still
like mainly unexplored to be honest.
So what it takes from your experience by building Cascada
to deliver this experience?
What is needed and how you, what did you build to address that?
So I think it's really important to be able to interact with data in a natural declarative way
where you can just kind of state the intention of what you are trying to compute and the underlying
system figures out the best way of implementing that. So, right, like these really high levels
of abstractions where you describe in a natural way what is that you need to compute. So, let's
talk about machine learning features, right? Like there is a feature definition.
The feature definition can be something as simple as number of sessions you have had in the last month, right?
It's a very simple feature.
You have one window, right?
It's a one-month window.
You're counting number of logins.
That's probably number of sessions, right, in a particular window.
Great.
We can define that.
But then in machine learning use cases, you have more things.
Thing is when to observe this feature, right?
Like streaming system makes one simple assumption.
The only time you are interested in observing this feature is the time of now.
Yeah.
Right?
Like what happened three years ago?
Well, streaming,
that's not a concern for streaming system.
But somebody building machine learning models,
right, like needs to observe this feature that in a specific point in time
that matches the model context,
matches how the prediction is being made.
And that is at the very,
those features happen,
those times happen at
different points in time for
different users, for different entities.
And now we have to describe
what we want in a natural way.
So we want to count number of sessions
in the last month. We need
to observe it
30 days
before or after certain event.
Maybe 30 days after they signed up for service.
Maybe that's the right point in time to observe that feature.
Then you have to explain to the system when that time is.
And then usually in machine learning or in supervised learning,
you have the concept of labeling it.
So you have to observe something at that point in time and then move it to the future to compute the label, what has actually happened.
So that's how a practitioner, how ML engineer thinks about the problem.
So what's the feature definition? When it should be observed in a data-dependent way
and how to label that example
at some other point in time, right?
So those are the natural abstractions
that ML engineers
or any ML practitioner cares about.
And these are kind of quite difficult
to do in a tabular way kind of the SQL has championed.
And so what we have is a simple query language that can do these aggregations, right?
Like this feature definition looks like SQL in a few lines of code.
And the system takes care of the rest.
I think that's the real power that we bring to our community.
All right, that's super cool.
And one of the, I don't know,
I think one of the main issues that SQL probably always had
as a declarative language, which by the way, is like the definition of a declarative
language as a whole point, right?
Like I'm going to describe to you what I wanted, like you, the database go and
figure it out, at least with like the ugly details, but it was never easy like
or intuitive, let's say, to work with time.
And that's one part.
Some other things that are hard is anything that has to do with more imperative kind of programming,
like loops and all these things. So can you tell us a little bit more about, let's say, the new syntax that you figured out is best for working with time?
Because obviously, we're talking about events here.
Time is always present, right?
Exactly.
Even if we are not talking about a mem. a meme, but events is pretty much like what I usually tend to say is like
like
like time series
data, but they are not
with more dimensionality
like with more metadata.
That is exactly right.
So yeah, but please tell
us like what are the constructs that are like missing
from Chico? The most important I think
difference that we bring to our community is the concept of
a timeline.
So when an event happens, it really describes a change, right?
Like you logged in, that really increases number of sessions by one, right?
And so if we want to process this data over time, it's really about that the feature value
changes over time. It's really a timeline. It's a time, right? It's really about that the feature value changes over time. It's
really a timeline. It's a graph, right? It's not a computation at the end of time. It is how
the feature value has changed over time. And these events just describe points in time
when the feature value went from 10 to 11. And so our constructs produce timelines. When you say
summing integers, all systems will tell you, okay, the sum is 50 at the end of time, or current sum
is 42. We don't tell you that current sum is 42
or the total sum is 50.
We produce a timeline, right?
The sum has changed this way over the period of time, right?
And that is the output,
the basic output of primitive operations.
You produce a timeline
that describe how features have changed over time.
And then you have these kind of time selectors, let's call them that way.
All right, like time selectors that select when such a feature should be observed,
when such a feature should be labeled.
All right, so you can kind of manipulate timelines.
Right, like that's how I would describe Cascara.
It's built for manipulating timelines.
Okay.
That's super interesting.
You mentioned that
the syntax is like
SQL-like,
right?
It is
declarative, so that certainly
matches SQL perspective on things.
We don't have, you know, select, start, from, where, and these types of, you know, keywords in the language.
Yeah, so from a usability standpoint, because, okay, SQL is something pretty much everyone knows,
right?
If you have worked with data, even for a short period of time in your life, you have seen
SQL.
So it's a very, I'll say that it's like together with Excel and JavaScript out there in terms
of how global the syntax is.
Why go after, let's say, a completely different syntax
instead of enriching standard SQL with new constructs, right?
So we have had these debates for a long time.
We generally chose to make some changes
as opposed to add some additions
because if we were just adding additions,
certain things would be unnatural
and would surprise people, right?
And so we decided that doesn't make sense.
This tabular model that SQL enforces is not the best underlying concept for building these
abstractions.
On the other hand, yes, it's a trade-off between some learning curve that Cascada may
introduce.
But we think of that as, you know, these are simple concepts.
Like if you just understand that this is a
timeline and the definition of what you're computing is all the same, but you're just
selecting where.
If you understand the concepts, these are very tiny snippets that anytime you start
a new, using a new product, there is some learning curve.
Excel has its own DSL
inside Excel. People have been using Excel. Everybody uses Excel.
This is of that nature. You describe
some formula that looks like
few functions and few selectors.
You don't need to go to school to do this, right?
You read the documentation, you look at three examples,
and, you know, you should know what's going on, right?
Yeah, that makes sense.
All right.
The library, right?
Like, you have to understand, you know, constructs,
you know, user model of it, and then you start using it.
Yeah, makes a lot of sense. user model of it, and then you start using it. Yeah.
Makes a lot of sense. Your experience with...
Because, okay, you are...
We've been talking all this time about
primarily, let's say,
ML practitioners.
These are people that
primarily live in Python lands,
right?
So, okay, I mean, if they have the USQL, they can do it, but like, let's say
their native language is like Python.
So what did you, like, how, what was your experience working with them?
Like with people that they are coming like from a very like imperative
programming kind of like environment and getting into like a declaratives?
Yeah.
So we try to merge these worlds.
So if you go to our website and kind of see the flavor of what we built,
it looks like Python.
It has a pipe operator, just like Python.
We recognize that primary programming language for our community is Python,
that most ML libraries are built for Python, right?
And so we try to be as close to Python as we can and make it super easy to integrate
with IPython notebooks, right?
That has been a specific design point all along.
All right.
Okay.
We could keep chatting about that stuff for hours, but
there's also something great that's happened lately about Cascada. That was the acquisition
or it merged with data stacks. So I'd love to understand why this happened.
What's the vision behind merging these two products together?
Everyone knows data stacks and Apache Cassandra.
Apache Cassandra has been around for a while.
It's not something new.
And it's like a database system with very specific use cases.
So tell us more about that.
What should we expect as the child of this marriage?
Yep, absolutely.
So obviously, data stacks is rooted in Apache Cassandra.
Apache Cassandra is one of the first big data systems
that have been built.
All right, it's over a decade old
and it's still being used by so many companies
to store and serve transactional data.
Netflix uses it for everything.
Uber uses it for everything, right?
And plenty of others, right?
This is a really key storage system
even over a decade after it was originally built.
And it has been proven time and time again,
if you really want to scale with good unit economics, you go to Cassandra. That has been
kind of widely understood. And obviously, DataStax has been a company around Cassandra,
helping users adopt. Over the last few years, Datastacks moved into database as a service
market with the launch of Astra,
which is like a fully managed database as a service product that makes
usage of Cassandra easier, cloud native
and to support high growth applications.
And so what we've been looking at is what is the real opportunity here?
Obviously, databases are not super interesting in 2023,
like many people see databases as a solved problem.
But AI is obviously the interest of most high growth apps today.
And so data stack strategy is to serve smart, high growth applications
for the decade to come. And these applications obviously need a really good storage system like
Apache Cassandra to serve and store transactional data, but that's not enough for the apps that are
going to be built in the next decade, right? They need streaming capabilities.
They need to compute things from real-time data to serve real-time derived data inside
the applications.
And they need things like smart predictions, recommendation engines, churn prediction,
and many other things that personalize the app experience.
And so what we are really building here is the best solution to build modern, smart,
high growth applications.
And you need a storage system.
You need a compute system.
You need the AI system to be able to serve high growth applications for a decade to come.
Okay, that's super exciting.
So how is this, like which parts of this vision is like served from Cassandra and what is like Cascada adding to that?
Like how together they materialize this vision. Yeah. So Cassandra is obviously the storage system that has great unit economics and it scales
infinitely.
So Cassandra is the best way to store user-specific information and be able to serve it with low
latency.
Then we have in our portfolio streaming systems, right?
Based on Apache Pulsar mostly, but Kafka compatible
that can ingest data
coming from anywhere,
coming from high growth applications.
And then we bring Cascada into the fold,
which is really about computing things
that you need for real-time machine learning.
And then you can, again, store and serve out of Cassandra.
So it's really about completing the story,
completing the picture for serving high-growth applications.
You can ingest data, you can store data,
you can manipulate data to compute what you need
to be able to build smart, high-growth applications.
Yeah, that makes total sense.
And just like to remind our audience, Cascada got open source recently, right?
So there is GitHub repo out there with, let's say, the core engine of Cascada for event processing.
It's also like building on top of like some very interesting like technologies.
We have Apache Arrow here.
We have Rust.
So I think even if someone doesn't, let's say, have to use it in production, I think just going and seeing how the system is built, the assumptions.
It's a very modern system, and I think it's going to be an inspiration from many people who want either to use or build something like that.
So go and check it on GitHub.
Go check
like Cascada.io
like you can get
like all the links
from there.
And I think
what is important
is for you to get
feedback
from all the people.
Right?
So
go ahead, please.
Yeah, we'd love to engage
with folks
in the community
listen to their feedback
and
obviously advance the state ofof-the-art
in event processing, particularly for ML use cases.
And so we certainly invite everybody to come along,
join us, provide comments,
and even participate or contribute as they see fit.
So everybody's welcome.
That's awesome.
Is there a requirement
for the open-source Cascada
to have Cassandra
also, or it can be used
as a standalone solution
for something? It can be totally used
standalone, right? So just for
quick evaluation, you can do a
real simple pip install and you can
play with it on your machine.
It requires no connections anywhere, requires no installation of Cassandra, right?
Like for trying things out, just a simple pip install.
I think it can't be easier.
Okay, that's awesome.
Eric, all yours again.
Yeah, this has been such a fascinating conversation and it is exciting to look under the hood.
You know, Arrow and Rust and other technologies like that. Certainly very exciting, you know, not only for Casas and I, but I think our audience. Do you envision this solving the, let's say, operationalizing ML and closing the gap between those two problems we discussed?
Do you envision Cascada making that problem a lot easier for larger companies?
You've mentioned a couple of gigantic organizations.
And of course, you know,
if you're doing real-time recommendations,
you need to be a company of a certain scale, right?
You need to have enough data
and you need to have enough engineering resources,
you know, in order for that to be worth it.
Even to your point, you know,
the unit economics have to work out,
you know, for your recommendations engine
to, you know to have positive ROI.
But do you think something like Cascada can actually also democratize that process for
companies who maybe don't have multiple different teams who can manage the different parts of this?
Do you envision it or have you even seen with your users or customers,
it actually making it easier for maybe a single team to sort of build and
ship things that maybe would have taken them another couple of years to get to just simply
from a resource standpoint, team standpoint, fragmented infrastructure.
That is exactly what we hope the impact on our community to be, right?
Like nothing that I have talked about is novel in a sense that it couldn't be be, right? Like nothing that I have talked about
is novel in a sense
that it couldn't be built, right?
Like this is software.
Anything can be built.
Sure.
Like literally we have not
invented new math, right?
Like anything can be built.
It's right.
Like it's just the complexity
and how many people you need
to be able to reliably get to success, right?
Like people have real-time recommendations.
You can find online posts from, I don't know, Netflix that talk about these problems and
how many years it took them to get to the system they have today, right?
And obviously businesses like that had business need to
solve the problem and there was nothing available in the market and they had to
figure out the problem because it was lucrative for them. It was highly leveraged. So it was
worth it for them to solve. And we, I think, are significantly reducing the total cost of ownership of building something like
this, it's becoming much, much cheaper to do it.
Which means that there are so many more models that can be put in production because if the
cost is so much smaller, there are so many more models that have a positive ROI, right? That have a
lucrative ROI. And that's what we hope is the ultimate impact once, you know, this gets adopted
in larger numbers. Yeah, it makes total sense. Well, let's end with maybe some practical advice on how, like, you know, if I have a,
let's say I'm, you know, a machine learning engineer or working, you know, in, in kind
of the context of data science, and I want to try this out, would you recommend, you
know, playing around with building, you know, trying to build some features in Cascada that I maybe have already built with my existing system
and just experiencing how much more flexible it is?
Or would you suggest maybe starting out
with more of an exploratory exercise
and trying to find that signal and the noise that you talked about?
I think it depends who you're talking to.
If you are a person
who thinks about the signal, right, then I would say, right, like, try it, just do a pip install
on your laptop and just play with it, right? And, you know, consider it success if you discover
new things that you did not know an hour ago, That is success. You've discovered a new predictive signal
that was not obvious to you. If you are a person who cares about
extracting signal, play and just explore
and measure yourself on what you have learned from data that you didn't
know before. If you are more an engineer who cares about
reliability, stability
of production, unit economics,
what's the latency in production,
then I would say
the best thing would be implement
three simple features,
check in
features as code, and see
how easy it is
to populate a feature store that you can just
kind of serve it from any database like a Sandra or something else with a simple API
call to give the most recent feature vector.
I just, I would say, focus on getting to production part if that's what you care about.
Yep.
Makes total sense.
One last question.
And this one's more for
maybe the listeners who are early in their
career. Maybe they work more on the
data engineering or operational side, less
on the machine learning side.
But they know, okay, I need to
familiarize myself with ML
because it's going to increasingly infiltrate
many aspects of data within an organization.
So regardless of technology,
you've been operationalizing ML for a long time now.
Do you have any advice for that person
who's really good on the op side,
but maybe they want to
explore the ML side? Yeah. So I'd say advice would be like, you are really well positioned
and you are in a place that is likely going to be interesting for a long time, right? We are discovering that data is really powerful
and every company is becoming data company
and seeing how they can leverage data
that they have in the best way possible.
So you're kind of really well positioned, right?
If you are on the engineering side,
you probably care more about reliability, latency,
throughput, unit economics, and so on.
And I think here you want to understand the systems the best you can, understand what they are built for.
And every single time when you are evaluating a system, ask yourself what the system is
not built for, what has been sacrificed to achieve the benefits you just talked to me about that enabled you
to achieve that?
Like, what did you ignore?
What did you deprioritize, right?
Like those type of architectural analysis, I would, you know, wish everybody understood
and focused. So not running for the next cool thing, but really understanding the trade-offs in the design of different systems, what they are built for, and how to apply them well.
That would be like understanding and knowing that I think unlocks an engineer's career.
And you started growing and growing. So understanding the systems and particularly what is not prioritized to achieve the benefits that people like to talk about.
Such wise advice.
Yeah.
Noticing what's not there is often much more powerful than
simply understanding what's there. So wonderful advice. Davor, this has been such a great
conversation and we're so glad that you gave us some time to come on the show.
Thank you so much. It was a great conversation. I really enjoyed talking to both of you.
A fascinating conversation with Dvor of Cascada,
which was acquired by Datastacks. It caused this really interesting story about what they envision
in terms of Cascada being integrated into Datastacks, which sort of operates a lot of
stuff on top of Cassandra. So lots of cool stuff there, I think, for the future. But Cascada is also open source, and it does a lot of interesting things in terms of making it easier to not only discover interesting potential features and data sets, but also deliver those and serve those, which is really interesting. One of the things that I thought was fascinating about this conversation was the decision to
essentially create a new language as part of the system.
Because the system in and of itself is capable of doing some really interesting, cool things. But they chose to sort of write a language that,
this is probably a really bad way to describe it,
but it's almost a mix between SQL and Python, right?
It's declarative, but it's in the flavor of Python,
which I thought was fascinating.
And so it really does seem like they're kind of meeting
in the middle of these two
worlds of sort of the operational side and more of the statistical side so that i don't know that's
that was a fascinating approach i'm certainly gonna gonna be thinking about that one what stuck
out to you yeah a hundred percent i think like the most there are like two things like keep like
from this conversation one has to do with like building the technology itself and like how
part of the problem it is and why it's not something that can be, let's say
solved with just like stitching together technologies, but you really need like
to start thinking like in first principles and build a new system in a way, right?
That's one thing.
But that's, let's say, the bread and butter of innovation and technology, right?
What I found extremely interesting is how important the user experience also is.
And that's the connection with what you're saying about the language.
The reason they ended up building a new language is because they were trying to figure out
what's the right way for our users, in this case ML engineers,
to interact and work with the data and somehow like got railed them into figuring
out what's the signal out of all this noise out there.
Right?
Yep.
And exactly what you said, like it's, they had to find the good things from all the different Paradox shops out there and put them together in a way that feel
native to their user, which is the ML engineer. And the ML engineer lives in Python launch.
They use Python. You cannot change that. All the libraries are in Python. No matter how they work
with the data, when they will have to do some processing
with the data, Python will be needed.
So it is important to build the right experiences there.
And we see that the need for this experience
also drives innovation,
like building a new language
on top of the processing system that we have.
And that's something that I think we will see more and more of
in the data infrastructure space as we try to make like democratize access
to all these technologies, which is probably something that will get even
further accelerated because of all the recent developments with AI, ML, and all
that stuff.
So yeah, that's what I keep and I'm looking forward to chat again and see what comes out from putting Cassandra together with Cascada.
Absolutely.
Well, another good one in the books.
Thanks for listening to the Data Stack Show as always, and we will catch you on the next one. you can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com. you