The Data Stack Show - 140: Stream Processing for Machine Learning with Davor Bonaci of DataStax

Episode Date: May 31, 2023

Highlights from this week’s conversation include:Davor’s journey from Google and what he was building there (3:32)How work in stream processing changed Davor’s journey (5:10)Analytical predictiv...e models and infrastructure (9:39)How Kaskada serves as a recommendation engine with data (14:05)Kaskada’s user experience as an event processing platform (20:06)Enhancing typical feature store architecture to achieve better results (23:34)What is needed to improve stream and batch processes (27:39)Using another syntax instead of SQL (36:44)DataStax acquiring Kaskada and what will come from that merger (40:24)Operationalizing and democratizing ML (47:54)Final thoughts and takeaways (56:04) The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Costas, we have another exciting one today. We're going to talk about stream processing. We have actually talked about this subject a good bit on the show, but this is pretty interesting because Davor from Cascada, which was recently
Starting point is 00:00:40 acquired by Data Stacks, we'll talk about that a little bit, built a technology that's really focused on stream processing specifically for ML use cases and kind of closing the gap between the actual sort of building of insights and features and then actually serving those. And it's pretty fascinating. What I'm really interested in is what they saw or sort of maybe the lack in the market that they noticed that caused them to want to build something new, in large part because you have a lot of really good, really high power stream processing tools. You have things like feature stores. You have all sorts of interesting low latency ways to serve stuff.
Starting point is 00:01:29 The pieces are there in order to sort of actually build and deliver cool stuff, you know, even from a stream. But obviously it wasn't sufficient. So that's what I want to ask about. How about you? Yeah, I think it's a very interesting case. I mean, Cascada is a very interesting case because it is a streaming processing engine, but it emerged as the solution to a problem that is very use case specific and it has
Starting point is 00:02:04 to do with machine learning. So it's going to be very interesting. And one of the things that I want to talk about is how we go from something like a feature store, which is supposed to be one of the possible solutions out there in DML problems to something like a scatter, right? What are the differences? Why we need something that is, let's say, more unified in a way
Starting point is 00:02:35 in terms of both like the technology, but also like the experience of the user that uses the solution. So that's like one of the things that I'm very interested to discuss. And yeah, like learn about like the journey and also learn about the journey of, you know, like
Starting point is 00:02:55 getting acquired by data stacks and... Not about like the acquisition, but more about like the vision. Like how something that is... If you think about it, it's actually quite interesting. Data Stacks is based on Apache Cassandra, which is like a 10-year-old technology, right? And then you have something that is super, let's say, new in terms of the requirement
Starting point is 00:03:24 and the need. And even the technologies it's built on. Yep. And it's very interesting to see how these things come together and why, and what's the potential output of this, right? So it will be very interesting to discuss about all these things. All right. Well, let's dig in and talk with Devor.
Starting point is 00:03:46 Devor, welcome to the Data Stack Show. Hi, great to be with you. All right. Well, we have some exciting things to talk about. You've had quite the journey over the last couple of months in terms of acquisitions, open sourcing stuff, which is all really cool. Let's look back a little bit in history, though, because this isn't the first time you've open sourced technology related to streaming, which is kind of cool that you've been able to do that a couple times now. But you were at Google. And can you tell us a little bit about Google and what you were doing there and then what you
Starting point is 00:04:23 built an open source there? Oh, yeah. So when I was at Google, it was like early days of Google Cloud and we've been building a unified programming model for batch and stream processing that ultimately resulted in the Apache Beam project. It was a quite successful project with a relatively large number of companies around the world using it, contributing it. And then a few years later, me and my co-founder left Google to start a company in a similar space.
Starting point is 00:05:02 That led to Cascada, the founding of Cascada, that really tried to nail the problem of building predictive behavioral machine learning model from event-based data. Obviously, we were working at this problem for quite a while that resulted in an acquisition by DataStax about three months ago. And yeah, happy to be with you talking about all of this journey across, you know, from Google to Cascada to DataStax and everything in between. Sure. You've had such a focus on streaming. I'm interested to know, I mean, you obviously are, you know, have been a professional software engineer and working with data for quite some time now. Did you have an interest in stream processing or is that something that you found at Google
Starting point is 00:05:46 and started to work on at Google? I started working at Google. I was not looking at stream processing pre, I guess, 2013. Yeah. And basically started around late 2013 looking into it. And now I guess this year would be a decade that I am looking at this problem. Yeah. A decade of streaming. Well, I mean, I guess it's interesting to think back on
Starting point is 00:06:09 2013. I mean, there were infrastructure limitations that made certain stream processing things pretty difficult or at a minimum, like pretty expensive. So can you tell us, so you get into stream processing at Google, you work on Beam, you open source Beam, but Beam and then, you know, of course, there are a number of other technologies out there around stream processing, you know, even within Apache, but those weren't sufficient for what you wanted to do. So why build something new when at that point, you know, there were multiple major players and multiple different architectures running pretty large organizations at scale for stream processing use cases? Yeah. So when we started looking at the problem of machine learning, we discovered that neither
Starting point is 00:07:01 batch solutions, neither streaming solutions, neither, you know, Beam solves this problem well, right? And so if you start thinking about building behavioral machine learning, right? So think about these are kind of recommendation engines, churn prediction models, right? Something about predicting the future, future action, future interest based on what has happened in the past, right? Like when you look at the nature of that problem, you have to process historical data, observe feature values, generate training examples at the right points in time
Starting point is 00:07:39 to be able to train the model. That problem looks more like analytics, looks more like batch, looks more like historical data processing. And then you have this kind of inference problem where you want to take real-time data and give it the most recent feature vector to give it to the model and then produce a real-time prediction. And so when you look at that problem, it's not well solved by batch because you have too big of a latency. It's not well solved by streaming system because you can't have... It's very hard to get this kind of historical component on top of it. And so we made the
Starting point is 00:08:19 conclusion that the fundamentally existing systems are not well built for that. Obviously, other people around the same time have been looking at the same solutions and that they found ways of hacking certain things together to solve the problem. And from that work, feature stores or common feature stores came to be. They tried to create an online store and offline store. It's really kind of a divergent architecture to try to solve these different use cases on top of the same data. And we are more of a system builders than, you know, hacking things on top of systems. So we took the problem really deep and then designed the system, you know, that's really
Starting point is 00:09:03 built for the problem at hand. And the problem at hand we see is easily connecting to the data, describing features in an easy way where you can iterate in a place like a notebook, test hypothesis, test a lot of features very quickly that gives you immediate backfill analysis of features at any point in time. And once you train the model, like really with a click of a button or checking features as code into production, you can compute and serve those features with low latency,
Starting point is 00:09:39 right, all from the same system that is purpose-built for this problem. And that's kind of how Cascada was born. And, you know, we found some funding for it. We found a team for it and the team built the product. And then we took it to market. And, you know, I guess the rest is history. Very cool.
Starting point is 00:09:58 Can we actually talk about, you sort of mentioned that there's, you know, you have these sort of two separate problems, right? And that there's sort of an analytics type use case, which is looking historically, and then you have the actual sort of ingestion of the real-time data that allows you to sort of feed the model and actually create an experience, right? Like a hotel recommendation or, you know, a product recommendation or something. So can you describe the way that you saw that materialize in terms of both infrastructure and teams? Were there different teams working on those separate problems? You know, like,
Starting point is 00:10:39 because a lot of times you'll see sort of data science is working on the model and, you know, sort of more of the like analytics predictive model and, you know, sort of more of the like analytics predictive piece. And then it's a pretty heavy engineering problem to actually like grab the feature and then like, you know, it needs to be served in a website or app. Can you describe the common patterns around that breakdown and how people sort of hack that together? Yeah, absolutely. So we think that there are two fundamental problems. Problem number one is really finding predictive signal inside of your data. And that is very kind of company data your app, right? Think of it, click stream, tap stream, or, you know, engagement information coming from the app. That's a lot of data and it's relatively hard to find what is really predictive signal that tells you what the user might be interested in,
Starting point is 00:11:38 whether they'll, you know, buy something, whether they will, you know, renew a subscription, what they may be interested in, why they are here, and so on. That's the problem of finding quality predictive signal out of clickstream, event stream data. That problem becomes harder the more data gets messy. If you are getting it from multiple places, from multiple applications, right? And schemas and other things evolve over time. So kind of figuring things out there tends to be more of a data modeling extraction of useful signal. And we feel that's a key part of getting machine learning and AI right. There is a different problem,
Starting point is 00:12:25 it's a problem of scale. And that is kind of how can I, once I know the model, once I know what my features are, how to open that model at scale with low latency and good unit economics. And that problem gets harder the more scale you have. And those are two problems and usually
Starting point is 00:12:49 two different people are best to solve two different problems. We've seen in the data community a lot of talk recently or the last few years about the scale side of things. And I think that's very warranted because it's a hard problem. And people pushing the boundaries here tend to work at big companies, typically in the Bay Area that have a really large scale, and then they start hitting these problems. And I think that's a really hard and difficult problem to solve. But I just want to make sure that we don't forget that to get to a really good AI, what most people should do is focus on extracting quality signal. The better the signal is, the more predictive it is. It's easier to build a model. It's cheaper.
Starting point is 00:13:42 And it's actually doing work that is company-specific. It's very leveraged work. Whereas distributed systems, they are very common and horizontal and not specific to a company that may be doing it. So we often think about this infrastructure being more horizontal and should probably be done in an open source community with other people that can kind of jointly innovate on it. And then companies really, we think, should focus on their quality signal from their data because that's really leveraged for them.
Starting point is 00:14:20 That's a unique business value to them. Makes total sense. Okay, so walk us, let's do a breakdown of Cascada with kind of maybe a sample company. So let's say, you know, I'm a company that, you know, sells maybe it's retail products, you know, online or something, you know, sort of large scale e-commerce. I have multiple websites, maybe even multiple mobile apps. I'm probably ingesting some sort of log data from my production databases. And so I have multiple brands and I want to know if someone's purchased these things from this brand, what other products from this other brand could I maybe cross sell them on, right?
Starting point is 00:15:12 What does it look like for my company to implement Cascada? Like, you know, who are the people involved and how do we implement it? Yeah. So what you have described here, if I can generalize a bit, is a recommendation engine. Yeah. Right. And people have been looking at recommendation engines for a while. It's like one of the first use cases of machine learning. And obviously in many industries, they have been successfully implemented. The interesting thing is when you look at recommendation engines and quality of them is it's quite interesting what you can find, right?
Starting point is 00:15:55 And so I'll start with a few examples here, right? So let's say that today you buy a couch, right? What is the chance that you're going to buy a couch next week? Well, you know, basic recommendation model will conclude you bought a couch this week, you might buy a couch next week. But we both know that's not how it works. Yeah. Right.
Starting point is 00:16:15 And so there are some recommendation engines that, you know, fail in miserable ways, you know, in this way without understanding who you are and why you bought it. Right. If you are a reseller of couches, sure, you know, more couches this week, more couches next week. But if you're buying for your own home, if you bought the couch this week, maybe you're interested in a coffee table, but not in another couch. Right. Like we have to really understand who the customer is, why they are here to be able to provide good recommendations. That's key, right? Like sometimes recommendations are just totally off.
Starting point is 00:16:53 And if you search online, you'll find examples where people kind of laugh at quality of these things when they are not done well. So when you think about how can I do this well, it's about understanding motivation and driving signal from interaction on a digital platform to understand why the person is here, right? And so it's what they are searching, not just what they are buying, how frequently they are searching something, right?
Starting point is 00:17:25 And then being able to do this quickly to give them in a session, personalized experience based on the reason why they are here today, for example. I think that's key. How you do that, you have to focus on the signal coming from their interactions with the app. And I have, in every case we looked, we always find that we can separate somebody buying a couch for themselves and somebody who is a reseller of couches. As long as you focus on their interactions on the site, it tends to look very different. And if you derive the signal out of the event-based data, then the model can latch on it, learn, and give good predictions that separate one experience from another. That's key.
Starting point is 00:18:17 And that's what we like to enable customers to do. And most often, once they use our technology, they find things that they have not known about their user base or user base before they started. And that's what we consider success. Once you discover predictive things and segmentation of your users that was not clear before you started, That is success. Then you are discovering something about your business, about your users from your data, and that makes the company better. Makes total sense.
Starting point is 00:18:55 That's what we are all about. Yeah. So let's get practical for a second. So if I'm a user of, if I'm implementing Cascada, right? Like I, you know, I get it, I get it set up, right? And running. And so are there just endpoints that I point my, you know, app and website and production databases at?
Starting point is 00:19:19 Like it just, will it just ingest them no matter the schema? Is it as simple as that? Yeah, so we obviously want to load data from as many places as possible, right? And we try to make that as easy as possible. Obviously, we can't read it from anywhere or we can't read it from everywhere, but we can read it from common places that people store data. We typically suggest people for doing some early exploration to start with parquet formatted files with kind of scheme, structured data in parquet formatted files stored in some cloud storage type place, perhaps managed by Iceberg or something like that is what
Starting point is 00:20:04 we usually recommend. But we can read it from plenty of places usually with a few lines of code just kind of specifying the location and then we can read structured data relatively easily. We do not shine on unstructured data today. Yep, makes total sense. And then once the data makes it into Cascada,
Starting point is 00:20:27 what's the user experience like? How am I trying to find the signal and the noise using Cascada in the platform? Yeah. So first we tell people usually to use the tools they like. Everything we do today is API first, right? So you can open a Jupyter notebook, IPython notebook, do one pip install. That's one line of code. Then you load the data from somewhere. That's another line of code.
Starting point is 00:20:59 And then after that, you can build features, test features, and use all the machine learning libraries that you like, right? Scikit-learn, PyTorch, whatever you like, we generally support. So think of our product as API first, data frames in, data frames out, and you can connect it with all the tools that exist in the machine learning ecosystem that obviously practitioners have learned to love over the last couple of years. Yep, makes total sense. All right, well, I've been hammering you with questions. Costas, please jump in here because I know you probably have a ton of questions yourself.
Starting point is 00:21:41 Yeah, thank you, Eric. So, Davor, like I have, I want to ask you, you mentioned like a couple of different, let's say, like product or technology categories, like feature stores, feature engines. And so then obviously there's also like the whole idea of like having like a streaming processing engine. So what is Cascada, like, primarily? Is it like a streaming data processing engine, feature store, or something else?
Starting point is 00:22:12 Yeah, it's a hard... You know, we obviously need a label for people to understand, and at Cascada, we call it a feature engine. It's like feature store, but really focused on generating features
Starting point is 00:22:28 as opposed to storing and serving them. And that's kind of how we coined the term feature engine. And some other companies have caught on, like there is another company, I think, called Sumatra that also tried similar approaches in this space. So we consider ourselves a feature engine, right? The engine that can help you generate feature values at any point in time or at the time of now for inference and so on.
Starting point is 00:22:57 So generation of features from underlying raw data, we call that feature engine. Recently, we open sourced Cascada code, and we started calling it modern open source event processing. Because what we figured out that what we built is actually generalizes to all processing of events, right? Be it batchy, be it in streaming mode, be it in any way, shape, or form. And so our website now talks about modern open-source event processing as our positioning today. And that's more naturally how it evolved rather than our intention. Our intention wasn't to build a generic event processing system. It's just that we discovered that by accident, by solving,
Starting point is 00:23:47 I guess, the machine learning problem well. Yeah, makes a little sense. All right. So if someone takes a look in, let's say, a typical feature store architecture, you usually see two main components there. You have, let's say, the offline processing that happens there, or like, let's say, the batch processing, where you go get all your historical data, use that to build the model, whatever, and as part of that, you also define the features that you need for that, right? And then, of course, you have the online version,
Starting point is 00:24:24 which is, okay, once new data comes, we need to turn in the features that we have previously defined and use them somehow, right? With Cascada, and usually like in feature stores, you have different technologies implementing inside, right? Which kind of makes sense because historically, let's say data processing platforms are focusing either in one or the other. Like they are either like streaming or bots.
Starting point is 00:24:52 Yep. How do you work with Cascada? Like if I decide like to use Cascada, am I going to have two architectures implemented? One? How does it work? Yeah, so this architecture of online store and offline store, this is what I think is
Starting point is 00:25:10 hack, under quotation here around how can I stitch existing systems to solve the problem and I realize that they are not really built for it so I need to put multiple of them and use them in different places to try to get the outcome and unit economics that I like. And so if we kind of look at these two paths, I think streaming systems are really good in this inference path as take the recent data, compute something that is relatively recent with low latency and serve the results. These are kind of materialized views on top of event-based data. And I think we
Starting point is 00:25:53 have good systems to do that. On the batch side of things, I think obviously we have Spark and other systems that can process vast amounts of data. But often we find when you think is the ability to test hypothesis, to try to find signal that is actually relevant for their use case. And that can, doing that in a batch system and then running a backfill job that populates it at all possible points in time for all entities, for all features, that's really not great. And most of these values computed will never be used. And so we think that the right solution to this problem is take a feature definition that is
Starting point is 00:26:59 described easily, declaratively, and that can easily cross this training to production gap. It can run in training without doing complicated backfill that stores everything at every point in time, but compute feature or training examples when you need them, generate easily with simple queries, with tiny queries, complicated data-dependent windows and data-dependent features, deliver them to training and literally with a click of a button, be able to maintain real-time materialized views over streams for a production use case. And so that's kind of how we view it, right? It's just one single architecture purpose-built to process streams or event-based data, be it historic, be it real-time.
Starting point is 00:27:58 Yeah. Okay. That makes sense. And what I hear is that building a system like Ascada, or trying to solve the problem like Cascada is like solving, we need to innovate, like, let's say in two fronts, like one is like the technology itself, right? Like something that can incorporate like both, let's say, the streaming and the bots paradigms in one paradigm. But it's also, from what I hear, like a user experience or developer experience, probably. We need to figure out, like, what's the right way for our user, in this case, like an ML
Starting point is 00:28:36 engineer or like a data scientist, to interact with the data and help them, guardrail them, like figuring out much faster what's the signal out of the noise, right? So let's talk a little bit more about that because I'm pretty sure that people have heard a lot in the past couple of years about how to work with streaming data, how to have low latency, high throughputs, distributed system, blah, blah, blah, all that stuff. Yep. But I think these experience parts, it's still like very new and still like mainly unexplored to be honest. So what it takes from your experience by building Cascada
Starting point is 00:29:19 to deliver this experience? What is needed and how you, what did you build to address that? So I think it's really important to be able to interact with data in a natural declarative way where you can just kind of state the intention of what you are trying to compute and the underlying system figures out the best way of implementing that. So, right, like these really high levels of abstractions where you describe in a natural way what is that you need to compute. So, let's talk about machine learning features, right? Like there is a feature definition. The feature definition can be something as simple as number of sessions you have had in the last month, right?
Starting point is 00:30:13 It's a very simple feature. You have one window, right? It's a one-month window. You're counting number of logins. That's probably number of sessions, right, in a particular window. Great. We can define that. But then in machine learning use cases, you have more things.
Starting point is 00:30:30 Thing is when to observe this feature, right? Like streaming system makes one simple assumption. The only time you are interested in observing this feature is the time of now. Yeah. Right? Like what happened three years ago? Well, streaming, that's not a concern for streaming system.
Starting point is 00:30:49 But somebody building machine learning models, right, like needs to observe this feature that in a specific point in time that matches the model context, matches how the prediction is being made. And that is at the very, those features happen, those times happen at different points in time for
Starting point is 00:31:07 different users, for different entities. And now we have to describe what we want in a natural way. So we want to count number of sessions in the last month. We need to observe it 30 days before or after certain event.
Starting point is 00:31:23 Maybe 30 days after they signed up for service. Maybe that's the right point in time to observe that feature. Then you have to explain to the system when that time is. And then usually in machine learning or in supervised learning, you have the concept of labeling it. So you have to observe something at that point in time and then move it to the future to compute the label, what has actually happened. So that's how a practitioner, how ML engineer thinks about the problem. So what's the feature definition? When it should be observed in a data-dependent way
Starting point is 00:32:05 and how to label that example at some other point in time, right? So those are the natural abstractions that ML engineers or any ML practitioner cares about. And these are kind of quite difficult to do in a tabular way kind of the SQL has championed. And so what we have is a simple query language that can do these aggregations, right?
Starting point is 00:32:35 Like this feature definition looks like SQL in a few lines of code. And the system takes care of the rest. I think that's the real power that we bring to our community. All right, that's super cool. And one of the, I don't know, I think one of the main issues that SQL probably always had as a declarative language, which by the way, is like the definition of a declarative language as a whole point, right?
Starting point is 00:33:28 Like I'm going to describe to you what I wanted, like you, the database go and figure it out, at least with like the ugly details, but it was never easy like or intuitive, let's say, to work with time. And that's one part. Some other things that are hard is anything that has to do with more imperative kind of programming, like loops and all these things. So can you tell us a little bit more about, let's say, the new syntax that you figured out is best for working with time? Because obviously, we're talking about events here. Time is always present, right?
Starting point is 00:34:18 Exactly. Even if we are not talking about a mem. a meme, but events is pretty much like what I usually tend to say is like like like time series data, but they are not with more dimensionality like with more metadata. That is exactly right.
Starting point is 00:34:38 So yeah, but please tell us like what are the constructs that are like missing from Chico? The most important I think difference that we bring to our community is the concept of a timeline. So when an event happens, it really describes a change, right? Like you logged in, that really increases number of sessions by one, right? And so if we want to process this data over time, it's really about that the feature value
Starting point is 00:35:04 changes over time. It's really a timeline. It's a time, right? It's really about that the feature value changes over time. It's really a timeline. It's a graph, right? It's not a computation at the end of time. It is how the feature value has changed over time. And these events just describe points in time when the feature value went from 10 to 11. And so our constructs produce timelines. When you say summing integers, all systems will tell you, okay, the sum is 50 at the end of time, or current sum is 42. We don't tell you that current sum is 42 or the total sum is 50. We produce a timeline, right?
Starting point is 00:35:49 The sum has changed this way over the period of time, right? And that is the output, the basic output of primitive operations. You produce a timeline that describe how features have changed over time. And then you have these kind of time selectors, let's call them that way. All right, like time selectors that select when such a feature should be observed, when such a feature should be labeled.
Starting point is 00:36:18 All right, so you can kind of manipulate timelines. Right, like that's how I would describe Cascara. It's built for manipulating timelines. Okay. That's super interesting. You mentioned that the syntax is like SQL-like,
Starting point is 00:36:38 right? It is declarative, so that certainly matches SQL perspective on things. We don't have, you know, select, start, from, where, and these types of, you know, keywords in the language. Yeah, so from a usability standpoint, because, okay, SQL is something pretty much everyone knows, right? If you have worked with data, even for a short period of time in your life, you have seen
Starting point is 00:37:14 SQL. So it's a very, I'll say that it's like together with Excel and JavaScript out there in terms of how global the syntax is. Why go after, let's say, a completely different syntax instead of enriching standard SQL with new constructs, right? So we have had these debates for a long time. We generally chose to make some changes as opposed to add some additions
Starting point is 00:37:49 because if we were just adding additions, certain things would be unnatural and would surprise people, right? And so we decided that doesn't make sense. This tabular model that SQL enforces is not the best underlying concept for building these abstractions. On the other hand, yes, it's a trade-off between some learning curve that Cascada may introduce.
Starting point is 00:38:19 But we think of that as, you know, these are simple concepts. Like if you just understand that this is a timeline and the definition of what you're computing is all the same, but you're just selecting where. If you understand the concepts, these are very tiny snippets that anytime you start a new, using a new product, there is some learning curve. Excel has its own DSL inside Excel. People have been using Excel. Everybody uses Excel.
Starting point is 00:38:50 This is of that nature. You describe some formula that looks like few functions and few selectors. You don't need to go to school to do this, right? You read the documentation, you look at three examples, and, you know, you should know what's going on, right? Yeah, that makes sense. All right.
Starting point is 00:39:16 The library, right? Like, you have to understand, you know, constructs, you know, user model of it, and then you start using it. Yeah, makes a lot of sense. user model of it, and then you start using it. Yeah. Makes a lot of sense. Your experience with... Because, okay, you are... We've been talking all this time about primarily, let's say,
Starting point is 00:39:35 ML practitioners. These are people that primarily live in Python lands, right? So, okay, I mean, if they have the USQL, they can do it, but like, let's say their native language is like Python. So what did you, like, how, what was your experience working with them? Like with people that they are coming like from a very like imperative
Starting point is 00:39:59 programming kind of like environment and getting into like a declaratives? Yeah. So we try to merge these worlds. So if you go to our website and kind of see the flavor of what we built, it looks like Python. It has a pipe operator, just like Python. We recognize that primary programming language for our community is Python, that most ML libraries are built for Python, right?
Starting point is 00:40:26 And so we try to be as close to Python as we can and make it super easy to integrate with IPython notebooks, right? That has been a specific design point all along. All right. Okay. We could keep chatting about that stuff for hours, but there's also something great that's happened lately about Cascada. That was the acquisition or it merged with data stacks. So I'd love to understand why this happened.
Starting point is 00:41:07 What's the vision behind merging these two products together? Everyone knows data stacks and Apache Cassandra. Apache Cassandra has been around for a while. It's not something new. And it's like a database system with very specific use cases. So tell us more about that. What should we expect as the child of this marriage? Yep, absolutely.
Starting point is 00:41:44 So obviously, data stacks is rooted in Apache Cassandra. Apache Cassandra is one of the first big data systems that have been built. All right, it's over a decade old and it's still being used by so many companies to store and serve transactional data. Netflix uses it for everything. Uber uses it for everything, right?
Starting point is 00:42:14 And plenty of others, right? This is a really key storage system even over a decade after it was originally built. And it has been proven time and time again, if you really want to scale with good unit economics, you go to Cassandra. That has been kind of widely understood. And obviously, DataStax has been a company around Cassandra, helping users adopt. Over the last few years, Datastacks moved into database as a service market with the launch of Astra,
Starting point is 00:42:52 which is like a fully managed database as a service product that makes usage of Cassandra easier, cloud native and to support high growth applications. And so what we've been looking at is what is the real opportunity here? Obviously, databases are not super interesting in 2023, like many people see databases as a solved problem. But AI is obviously the interest of most high growth apps today. And so data stack strategy is to serve smart, high growth applications
Starting point is 00:43:27 for the decade to come. And these applications obviously need a really good storage system like Apache Cassandra to serve and store transactional data, but that's not enough for the apps that are going to be built in the next decade, right? They need streaming capabilities. They need to compute things from real-time data to serve real-time derived data inside the applications. And they need things like smart predictions, recommendation engines, churn prediction, and many other things that personalize the app experience. And so what we are really building here is the best solution to build modern, smart,
Starting point is 00:44:14 high growth applications. And you need a storage system. You need a compute system. You need the AI system to be able to serve high growth applications for a decade to come. Okay, that's super exciting. So how is this, like which parts of this vision is like served from Cassandra and what is like Cascada adding to that? Like how together they materialize this vision. Yeah. So Cassandra is obviously the storage system that has great unit economics and it scales infinitely.
Starting point is 00:44:51 So Cassandra is the best way to store user-specific information and be able to serve it with low latency. Then we have in our portfolio streaming systems, right? Based on Apache Pulsar mostly, but Kafka compatible that can ingest data coming from anywhere, coming from high growth applications. And then we bring Cascada into the fold,
Starting point is 00:45:16 which is really about computing things that you need for real-time machine learning. And then you can, again, store and serve out of Cassandra. So it's really about completing the story, completing the picture for serving high-growth applications. You can ingest data, you can store data, you can manipulate data to compute what you need to be able to build smart, high-growth applications.
Starting point is 00:45:48 Yeah, that makes total sense. And just like to remind our audience, Cascada got open source recently, right? So there is GitHub repo out there with, let's say, the core engine of Cascada for event processing. It's also like building on top of like some very interesting like technologies. We have Apache Arrow here. We have Rust. So I think even if someone doesn't, let's say, have to use it in production, I think just going and seeing how the system is built, the assumptions. It's a very modern system, and I think it's going to be an inspiration from many people who want either to use or build something like that.
Starting point is 00:46:41 So go and check it on GitHub. Go check like Cascada.io like you can get like all the links from there. And I think what is important
Starting point is 00:46:51 is for you to get feedback from all the people. Right? So go ahead, please. Yeah, we'd love to engage with folks
Starting point is 00:46:59 in the community listen to their feedback and obviously advance the state ofof-the-art in event processing, particularly for ML use cases. And so we certainly invite everybody to come along, join us, provide comments, and even participate or contribute as they see fit.
Starting point is 00:47:20 So everybody's welcome. That's awesome. Is there a requirement for the open-source Cascada to have Cassandra also, or it can be used as a standalone solution for something? It can be totally used
Starting point is 00:47:35 standalone, right? So just for quick evaluation, you can do a real simple pip install and you can play with it on your machine. It requires no connections anywhere, requires no installation of Cassandra, right? Like for trying things out, just a simple pip install. I think it can't be easier. Okay, that's awesome.
Starting point is 00:47:56 Eric, all yours again. Yeah, this has been such a fascinating conversation and it is exciting to look under the hood. You know, Arrow and Rust and other technologies like that. Certainly very exciting, you know, not only for Casas and I, but I think our audience. Do you envision this solving the, let's say, operationalizing ML and closing the gap between those two problems we discussed? Do you envision Cascada making that problem a lot easier for larger companies? You've mentioned a couple of gigantic organizations. And of course, you know, if you're doing real-time recommendations, you need to be a company of a certain scale, right?
Starting point is 00:48:52 You need to have enough data and you need to have enough engineering resources, you know, in order for that to be worth it. Even to your point, you know, the unit economics have to work out, you know, for your recommendations engine to, you know to have positive ROI. But do you think something like Cascada can actually also democratize that process for
Starting point is 00:49:11 companies who maybe don't have multiple different teams who can manage the different parts of this? Do you envision it or have you even seen with your users or customers, it actually making it easier for maybe a single team to sort of build and ship things that maybe would have taken them another couple of years to get to just simply from a resource standpoint, team standpoint, fragmented infrastructure. That is exactly what we hope the impact on our community to be, right? Like nothing that I have talked about is novel in a sense that it couldn't be be, right? Like nothing that I have talked about is novel in a sense
Starting point is 00:49:47 that it couldn't be built, right? Like this is software. Anything can be built. Sure. Like literally we have not invented new math, right? Like anything can be built. It's right.
Starting point is 00:49:59 Like it's just the complexity and how many people you need to be able to reliably get to success, right? Like people have real-time recommendations. You can find online posts from, I don't know, Netflix that talk about these problems and how many years it took them to get to the system they have today, right? And obviously businesses like that had business need to solve the problem and there was nothing available in the market and they had to
Starting point is 00:50:29 figure out the problem because it was lucrative for them. It was highly leveraged. So it was worth it for them to solve. And we, I think, are significantly reducing the total cost of ownership of building something like this, it's becoming much, much cheaper to do it. Which means that there are so many more models that can be put in production because if the cost is so much smaller, there are so many more models that have a positive ROI, right? That have a lucrative ROI. And that's what we hope is the ultimate impact once, you know, this gets adopted in larger numbers. Yeah, it makes total sense. Well, let's end with maybe some practical advice on how, like, you know, if I have a, let's say I'm, you know, a machine learning engineer or working, you know, in, in kind
Starting point is 00:51:33 of the context of data science, and I want to try this out, would you recommend, you know, playing around with building, you know, trying to build some features in Cascada that I maybe have already built with my existing system and just experiencing how much more flexible it is? Or would you suggest maybe starting out with more of an exploratory exercise and trying to find that signal and the noise that you talked about? I think it depends who you're talking to. If you are a person
Starting point is 00:52:05 who thinks about the signal, right, then I would say, right, like, try it, just do a pip install on your laptop and just play with it, right? And, you know, consider it success if you discover new things that you did not know an hour ago, That is success. You've discovered a new predictive signal that was not obvious to you. If you are a person who cares about extracting signal, play and just explore and measure yourself on what you have learned from data that you didn't know before. If you are more an engineer who cares about reliability, stability
Starting point is 00:52:48 of production, unit economics, what's the latency in production, then I would say the best thing would be implement three simple features, check in features as code, and see how easy it is
Starting point is 00:53:03 to populate a feature store that you can just kind of serve it from any database like a Sandra or something else with a simple API call to give the most recent feature vector. I just, I would say, focus on getting to production part if that's what you care about. Yep. Makes total sense. One last question. And this one's more for
Starting point is 00:53:28 maybe the listeners who are early in their career. Maybe they work more on the data engineering or operational side, less on the machine learning side. But they know, okay, I need to familiarize myself with ML because it's going to increasingly infiltrate many aspects of data within an organization.
Starting point is 00:53:52 So regardless of technology, you've been operationalizing ML for a long time now. Do you have any advice for that person who's really good on the op side, but maybe they want to explore the ML side? Yeah. So I'd say advice would be like, you are really well positioned and you are in a place that is likely going to be interesting for a long time, right? We are discovering that data is really powerful and every company is becoming data company
Starting point is 00:54:29 and seeing how they can leverage data that they have in the best way possible. So you're kind of really well positioned, right? If you are on the engineering side, you probably care more about reliability, latency, throughput, unit economics, and so on. And I think here you want to understand the systems the best you can, understand what they are built for. And every single time when you are evaluating a system, ask yourself what the system is
Starting point is 00:55:02 not built for, what has been sacrificed to achieve the benefits you just talked to me about that enabled you to achieve that? Like, what did you ignore? What did you deprioritize, right? Like those type of architectural analysis, I would, you know, wish everybody understood and focused. So not running for the next cool thing, but really understanding the trade-offs in the design of different systems, what they are built for, and how to apply them well. That would be like understanding and knowing that I think unlocks an engineer's career. And you started growing and growing. So understanding the systems and particularly what is not prioritized to achieve the benefits that people like to talk about.
Starting point is 00:56:00 Such wise advice. Yeah. Noticing what's not there is often much more powerful than simply understanding what's there. So wonderful advice. Davor, this has been such a great conversation and we're so glad that you gave us some time to come on the show. Thank you so much. It was a great conversation. I really enjoyed talking to both of you. A fascinating conversation with Dvor of Cascada, which was acquired by Datastacks. It caused this really interesting story about what they envision
Starting point is 00:56:31 in terms of Cascada being integrated into Datastacks, which sort of operates a lot of stuff on top of Cassandra. So lots of cool stuff there, I think, for the future. But Cascada is also open source, and it does a lot of interesting things in terms of making it easier to not only discover interesting potential features and data sets, but also deliver those and serve those, which is really interesting. One of the things that I thought was fascinating about this conversation was the decision to essentially create a new language as part of the system. Because the system in and of itself is capable of doing some really interesting, cool things. But they chose to sort of write a language that, this is probably a really bad way to describe it, but it's almost a mix between SQL and Python, right? It's declarative, but it's in the flavor of Python, which I thought was fascinating.
Starting point is 00:57:40 And so it really does seem like they're kind of meeting in the middle of these two worlds of sort of the operational side and more of the statistical side so that i don't know that's that was a fascinating approach i'm certainly gonna gonna be thinking about that one what stuck out to you yeah a hundred percent i think like the most there are like two things like keep like from this conversation one has to do with like building the technology itself and like how part of the problem it is and why it's not something that can be, let's say solved with just like stitching together technologies, but you really need like
Starting point is 00:58:21 to start thinking like in first principles and build a new system in a way, right? That's one thing. But that's, let's say, the bread and butter of innovation and technology, right? What I found extremely interesting is how important the user experience also is. And that's the connection with what you're saying about the language. The reason they ended up building a new language is because they were trying to figure out what's the right way for our users, in this case ML engineers, to interact and work with the data and somehow like got railed them into figuring
Starting point is 00:59:06 out what's the signal out of all this noise out there. Right? Yep. And exactly what you said, like it's, they had to find the good things from all the different Paradox shops out there and put them together in a way that feel native to their user, which is the ML engineer. And the ML engineer lives in Python launch. They use Python. You cannot change that. All the libraries are in Python. No matter how they work with the data, when they will have to do some processing with the data, Python will be needed.
Starting point is 00:59:47 So it is important to build the right experiences there. And we see that the need for this experience also drives innovation, like building a new language on top of the processing system that we have. And that's something that I think we will see more and more of in the data infrastructure space as we try to make like democratize access to all these technologies, which is probably something that will get even
Starting point is 01:00:18 further accelerated because of all the recent developments with AI, ML, and all that stuff. So yeah, that's what I keep and I'm looking forward to chat again and see what comes out from putting Cassandra together with Cascada. Absolutely. Well, another good one in the books. Thanks for listening to the Data Stack Show as always, and we will catch you on the next one. you can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers.
Starting point is 00:00:00 Learn how to build a CDP on your data warehouse at rudderstack.com. you

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.