The Data Stack Show - 112: Python Native Stream Processing with Zander Matheson of bytewax

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Today, we're going to talk with Xander from ByteWax. And Costas, I love the topic of streaming,

Starting point is 00:00:34 which won't surprise you at all. But this is really interesting. We've actually had some sort of stream processing type technologies, actually multiple on the show. However, ByteWax is stream processing type technology is actually multiple in the show. However, ByteWax is stream processing in the Python ecosystem, which is really interesting. Actually makes a ton of sense, you know, just at a high level when you think about, you know,

Starting point is 00:00:57 how prevalent Python is in terms of data. And my question is not going to surprise you at all, but I want to know where the motivation came from. So Xander was a Heroku and GitHub working as a data scientist, and so there was some sort of need, I'm sure, that he saw at those companies that motivated him to start ByteLax and do sort of stream processing and some MLOps stuff in Python. So that's what I'm going to ask you about,

Starting point is 00:01:24 and I'm sure you have some technical questions. So tell me what you're going to dig into. Alexi Vandenbroekke Yeah, I'd love to go deeper in the technical side of things and see like how it is built and what are like the technical decisions that were made there. And one thing that I definitely want to discuss with him is like, what's the difference between, let's say streaming processing system that is built in 2022 compared to the previous generation of streaming engines like Sling and Storm and Spark streaming and like all these frameworks that they have been around like for a while. So, yeah, I'm very excited about it.

Starting point is 00:02:03 I'd love to see like, what, what's the difference there? Like what's new. David Pérez- Totally. All right. Well, let's dig in and chat with Xander. Xander Selicka- Let's do it. David Pérez- Xander, welcome to the Data Stack Show. We're super excited to chat.

Starting point is 00:02:16 Xander Selicka- Thanks. I'm excited to be here and thanks for having me. David Pérez- Totally. Okay. Well, tell us about your background. Lots of data science type stuff, but yeah, tell us where you came from and what you're doing today. Yeah, I'll give a little bit more of my background. So the last, you know, I don't know,

Starting point is 00:02:34 five or six years I've been working as a data scientist. I was at GitHub and before that, Roku. But before that, I have like a little bit of a mixed path on how I ended up here. I was actually a civil engineer. That's what I went to school for and worked as a civil engineer. Then I went to business school. And after business school, I got into working in technology. And I worked at a company, a speech technology company, so doing speech recognition, text-to-speech. And I was supposed to do like biz staff and sales.

Starting point is 00:03:13 And I liked the technology more than I liked doing that. And I was better at working with technology. So that's how I kind of ended up in where I am today. It was right around the time when people started using neural networks for speech recognition. And I was just like, I thought it was so cool. And that's what led me into the data side of things. Yeah, very cool. Okay, so I have a question about, I'm always interested in, on the show, I probably ask this. I don't know if I've asked this in a while, but it always fascinates me sort of when people kind of take a career,

Starting point is 00:03:47 you know, their career changes, which is very common. But are there any lessons from like your study or work in civil engineering that you took with you into data science? You know, because they're both technical in their own way,

Starting point is 00:04:03 but like pretty different. So yeah, I'd love to know if there's any lessons you do yeah so when i was working as a civil engineer i was like working with hydrology models which are stochastic models and so there's a lot of like you know similar basis and theoretical work to like both of these things so there was a takeaway there in my like general education for civil engineering did like some programming it was just like part of what we did but not a lot but yeah that's so that's kind of from i guess from the technical point of point of things but decomposition of problems is just like a general concept. And, you know, thinking, being able to analyze things regardless is a skill, I guess, that transfers. I don't know, maybe.

Starting point is 00:04:56 Yes. Yeah, that's great. I love that. It's kind of like pipelines, you know, for like in the context of civil civil engineering, pipelines for data is great. Yeah. It's a great connection there. Okay. Well, tell us about ByteWax. Sure.

Starting point is 00:05:14 Yeah. So ByteWax today, there's ByteWax the company, and then there's ByteWax the open source project. So ByteWax the open source project is. So ByteWax, the open source project, is a Python framework for data processing. And it's focused on processing streams. So stream processing or stateful stream processing is kind of like the focus of the framework. And so you can think of it similarly to Flink or Spark Structured Streaming or Kafka Streams, and that you can do more complex or advanced analysis on streaming data. So you can do aggregations, you can do windowing, et cetera, et cetera. And then it's in Python, it's Python native. So you can use like the Python ecosystem to build these stream processing workflows.

Starting point is 00:06:09 And with ByteWax, those are called data flows. And that comes from the data flow, like a compute paradigm that is underneath ByteWax. Tell us about the company as well. Yeah. So ByteWax the company, we the company as well. Yeah, so ByteWax the company we started actually in the beginning of 2021. We started ByteWax and we actually started ByteWax to bring to market a hosted machine learning platform was the idea we had worked on this problem at GitHub and few of

Starting point is 00:06:42 us left GitHub to build that. And we had an idea that eventually led to this pivot where we started building a stream processing framework. And the idea was that all machine learning, when you're running a machine learning model in real time, you have this idea of pre-processing data for features or whatever, joining additional data on, and then you run the inference with the model, and then you'll do some post-processing potentially. And so we made a framework where you could create a DAG in Python to do these different steps, and then you would just deploy them. And we would go scaffold up some infrastructure to run this on Kubernetes, and you could scale

Starting point is 00:07:24 them and everything. And that was the initial idea. And people started using it for, to process data in real time without doing any machine learning. It's just, yeah. And it wasn't really meant for that because it didn't have like state built in. So we ended up just kind of scratching a bunch of that and, and restarting with a different execution underneath.

Starting point is 00:07:46 Fascinating. Okay, so let's dig into that a little bit. So the first part, so you and several people from GitHub were building a platform for machine learning, which is interesting. We've actually heard that on the show a ton, right? And we talked to people with machine learning and it's kind of like, well, tell us about MLOps ml ops and everyone's like oh like like there are a lot of things that have

Starting point is 00:08:10 gotten better and like but there's so much about that as well so that was that's sort of the goal was sort of like like ml ops pain like let's go solve that and then people just started using it for like a data flow that didn't include the machine learning part? Yeah. I mean, yeah, as a simple, that's a simplification, I think in some sense, but it captures the essence in a better way. Yeah, it was. So there was like four year, four or five years ago, we were, when I was at GitHub,

Starting point is 00:08:42 we were working on a machine learning platform and We did a bunch of work to extend Kubernetes and make our own custom operators and resource definitions so that you could run training jobs and do hyperparameter stuff and do all this MLOps stuff on Kubernetes. And so we took some of those ideas and we built into platforms so that the idea was smaller businesses without the larger teams to like build all that ops infrastructure could just like, you know, by wax deploy my, my workflow thing. And then we kind of handle the rest and give you like metrics and alerting and scaling and stuff like that. Super interesting. metrics and alerting and scaling and stuff like that super interesting okay so the next question

Starting point is 00:09:27 is like why do you think it's always fascinating to me when you're like you built something and then people use it in kind of a way that you didn't expect why do you think users were using it in the way that they were because obviously obviously there was a component that met a need that was unmet, and they're like, oh, this is a better way to do this. What were the driving factors or use cases or technical limitations they were facing that initial version of ByteWax unlocked

Starting point is 00:09:56 in terms of the streaming piece of it? Yeah. So many of these groups were still going to do some, use some machine learning down the road, but there was not a really good way in Python to do the feature transformations they needed to do in real time against streams. And so they were trying to transform data in real time for that aspect. That was it. So there was like, in Python, there wasn't an easy way to essentially like hook up to Kafka and then run some like, use some Python tools, whenever it is NumPy, etc, etc, to like transform that data. And then you have a feature set that you will then serve

Starting point is 00:10:39 in real time to the model. Super interesting. Okay. Costas, I have a hundred more questions, but please jump in here because there's so much to talk about and I know you have some great questions brewing on the technical side. Costas Pintasenac, Yeah. Yeah. So you mentioned Zeller, you mentioned at the beginning, like some frameworks for streaming processing, like Flink, which is probably the most well-known one. There have been a few others in the past, like we've got Storm from Twitter. There's another one.

Starting point is 00:11:14 SAMSA. Yeah, SAMSA. There have been quite a few attempts in building streaming processing systems, right? What's the difference between a system like PiedWalks today compared to, let's say, a paradigm that was implemented probably like a decade ago, right, because like these are like all these tools, they started like pretty much to appear like end of like the 2000s, beginning to 2010, right? So what's the difference there?

Starting point is 00:11:47 David Pérez- Yeah, I think many of those I mean, I'm not an expert in all those systems, so, you know, I could be missing things, but many of these systems, there's like a series of trade-offs make for like correctness, you know, latency, scalability, et cetera. And so I think many of these systems or And so I think many of these systems, or not, I think many of these systems take different approaches and like, you know, one thing for another potentially,

Starting point is 00:12:13 and that may lead to them being dropped off and something new coming along. Other things may have, you know, like we are like software engineers, et cetera. We are like a little bit like trend followers. And so some things seem to like fall out of trend and then they don't like gain the adoption. So certain like ecosystems of tools that, you know, they were integrated with or worked well with maybe, you know, fell by the wayside and new systems came up and those new technologies kind of worked better with those systems.

Starting point is 00:12:53 So like you had Spark was like becoming mid whatever, like 2014, 2015 or whatever, Spark was like becoming a thing and people were moving away from some of the different MapReduce frameworks and using Spark for different things. So I don't know. It seems like there are these trends, and sometimes it's like a trade-off. You make an architectural decision

Starting point is 00:13:16 that lets you have better performance, but it's like you're giving up correctness. Or maybe you have to add state in a different way, or like to persist state or, you know, checkpoint things in state, you like give up some performance. So there's all these trade-offs. And I think that maybe that leads to a plethora of different services. Yeah, makes a lot of sense. So someone like today that has to choose between, let's say, let's take Flink,

Starting point is 00:13:48 because Flink is quite dominant, let's say, like in the set-frame processing when a state is involved, and ByteWalks, right? Why would choose one or the other? Why I would go and use and work with ByteWalks instead of Flink? Yeah, so I mean, it comes back to those trade-offs again. Flink is a more mature product and you have SQL,

Starting point is 00:14:12 you can use Flink, and there are many different bindings for languages, and they've built out a bunch of features that ByteWax doesn't have. But the trade-off that we are sort of investing in there is it's a lot more involved to get up and running and started with Flink for like a big group of users that maybe don't have

Starting point is 00:14:33 the experience with that whole Java ecosystem, you know, tuning the JVM, et cetera, et cetera. And so our thought is like, we'll make a trade-off of maybe not having the exact same robustness that Flink has today, but we'll give an experience where it's a lot easier to get started and quicker to get started for this subset of users that are maybe a newer group of users, like machine learning engineers,

Starting point is 00:14:58 data scientists, or newer data engineers. So that's the trade-off that we've been playing in with the ecosystem. Yeah. Yeah. Yeah. Developer experience is quite important. Yeah. Yeah. I mean, everyone who has had the time point, like to work with systems like this, or like maintain these systems, like I think they have like many horror stories too. It's not easy. I mean, okay. It's easy to... Let's take Kafka, for

Starting point is 00:15:25 example. Getting a couple of Docker images and running them locally, yeah, it works. But from that point, to take it to production and keep it in production, it's a huge, huge distance in terms of the complexity of all the operations that are involved in Greece. So I totally understand what you mean by the experience of the developer there. And what are the use cases? Like the users of ByteWalks today, you shared a very interesting story of how you started from machine learning. You saw that they are using it for other reasons and ended up like creating like ByteWalks, what are the most common use cases that you see there where ByteWalks is involved?

Starting point is 00:16:12 David Pérez- Yeah. So kind of can align them maybe to application. So IoT, security, financial industry, there's many different use cases within those applications. So you have anomaly detection for things like fraud or malfunction usage is, and they can then make some decisions based on that. One thing I'll say about Biwax is it's a very generally applicable framework. I mean, it is a data processing tool for literally any set of data, which makes it also very difficult sometimes to position it correctly. But yeah, so there's applications, like I said, in those industries, and there are various different use cases.

Starting point is 00:17:18 And then coming back to what I was talking about for machine learning and how we ended up with this streaming framework is that stream processing framework is that application is around features. So today when you train a machine learning model, you're going to like generate a feature set and then you're going to like in some data set and then you're going to use that to train the model. And you're like, that's sweet. I'm getting great performance. I have these awesome features. And then you need to use that somehow in production. And so there's been all this like tooling around that you have like Feast,

Starting point is 00:17:51 which is a open source thing from Tekton and many other feature stores, et cetera. But you don't really have like a good tool for like creating those features on the fly that's that still allows you to use that Python ecosystem. So that's another big use case. It's like whatever you have to do to features when you're working in static data, you have to do that in real time. And so you're going to need state

Starting point is 00:18:14 because you're going to need to know like, I don't know, how many pickups did this Uber driver do in the last 30 minutes? And I'm going to feed that into my model to know if they're the person to recommend for the next one or whatever it might be for recommendation engines, personalization, et cetera. So that's a big like group of users or potential users, I guess.

Starting point is 00:18:35 Yeah, that's awesome. And like when you're talking about real time feature generation, like what is real time in terms of like time itself? Like, are we talking about milliseconds? Are we talking about seconds? Like what are the requirements when it comes like to a model feature? David Pérez- I mean, I think that the, I don't know, there's an actual definition, I think of real time and sub second, but for a streaming

Starting point is 00:19:00 system, for a real-time feature transformation, I assume it would be talking about end-to-end latency. So you have from the moment a user interacts with something to the feature being then stored in the database where it can then be served. So some low-latency key value stored or whatever, that would be the total latency or that would be like the time. And I think you're probably going to be over a second in that.

Starting point is 00:19:32 I'm not sure, but you might be near a second, so you could, it could possibly be near real time and not actually real time. Yeah. Yeah. So how do you see like, what's the architecture there? Like we have the model somewhere and we need to fit it with, like, features that we create, like, in real time, right? And we have, like, the system where the user interacts with. This data will be, like, stored on something like the database, whatever.

Starting point is 00:20:01 But how do we get, like, byte works together with, like, the existing production how do we get like Byteworks together with like the existing production environment that we have? And we create, let's say a system where we can almost real time, real time generate features, feed them to the recommenders or whatever models we have there and do something with it. Like, how do you, how, how does this work? Alex Raucer- Yeah. To make it like a little bit more concrete i we we work a bit with the folks at

Starting point is 00:20:28 working on the feast project so i'm just okay keep along that vein yeah in fact if you byrex uh is you can use byrex as like a materialization engine with these so it like turns offline features into online features but anyway so the architecture looks like you have some streaming platform be it kafka or red panda or pulsar so you have some system there and upstream of that you have interactions with your service where you're collecting the data so that can be don't know, you can be ingesting from logs or you can be collecting telemetry data, whatever. It's going into Kafka. And then downstream of Kafka would be,

Starting point is 00:21:15 or Kafka alternative would be Bywax. So it's a consumer and it's listening to Kafka and then it's doing the transformations and then it's writing to Kafka and then it's, you know, doing the transformations and then it's writing to the online feature store, which is going to be like Redis or Dynanote, BTP, your Postgres or MySQL or something that can be rather low latency. And it's probably also writing out to the offline feature store. So that's like your data warehouse or data layer or some storage, some like

Starting point is 00:21:47 a more analytical storage engine. And that's so that it can be reused for retraining or determining new life for new models, et cetera. So I don't know if I answered the question at all there, but you'll have, most likely you'll have some orchestration layer that's managing some serving thing so that's where your model's action actually going to be loaded so maybe your model's pickled and it's stored in an object store or it's um it's part of a an image or something in some way shape or form, your model is loaded into memory

Starting point is 00:22:26 and it's in a service. Let's say it's in a pod in Kubernetes and it's service traffic. So it's like a microservice and a request will be made there. And once that request is made, that feature that has been gathered in real time has to be ready in that low latency database

Starting point is 00:22:45 because that model serving microservice will reach out for that feature set and use that in conjunction with whatever was just requested to actually make the prediction. So you have other things as well on top of that. Like MLOps is just like so many layers right now, which is what we're getting into, but at the very base you have like services with the model loaded into memory, you have some databases and you have like a streaming platform and you

Starting point is 00:23:15 have some compute processing system to process the data both in real time and in batch. GIO NAKPILIAKOUPOUEIRA OK. That was super insightful, to be honest. It's like a very common question that I ask like pretty much everyone who has to do something with ML. And I think it's like a common question for everyone who doesn't work in ML Ops, to be honest, because it is like a very complex environment out there. And it was great like to hear

Starting point is 00:23:47 of that and see couples who like understand like how features stores interact with different technologies. Just like us, Bytebox, you described. So that was, that was great and very, very helpful. Okay. Let's get like a little bit more into what makes ByteWolfs what it is, right? Like the technologies used, like you mentioned something about like tiny data flow at the beginning, which is like a different computation model.

Starting point is 00:24:15 What else is out there? I've seen like, it's like you've got baseline Python together with Rust. Yeah. So tell us a little bit more about the choices there and like the pros and the cons and like how did you end up like using, making the choices that you made?

Starting point is 00:24:29 Yeah. So I'm going to go back to when we pivoted from the machine learning platform. It was a hosted thing. And when we pivoted, we had some frustrations with the hosted thing

Starting point is 00:24:42 because it was very difficult to get people to send data into our hosted environment. to run it in their environment so we could have a different go-to-market motion that wasn't about the host platform. We decided, okay, let's look at our options. And we found Timely Dataflow. And we were like, oh, this is really interesting because it can run as a single thread on a local machine. So we can give people the ability

Starting point is 00:25:22 to write these stream processing workflows in a way that they can then run on their local machine. And it can also scale up to many different workers. So that was part of the reason why we chose Timely Dataflow. And it's a cool project because it uses a little different approach in the architecture in comparison to the things you're more familiar with like Spark. Anyway, so yeah, Timely Dataflow is an awesome library. The person who created it, Frank McSherry, I don't know if you've had him on the podcast. I think his title is Chief Science Officer at something like that and materialize, but he was part of the team that started materialize and materialize uses timely data flow, but also uses something called differential data flow. and it's based off this project called NIAID, NIAID Timely Data Flow, which was a Microsoft research project back in the day.

Starting point is 00:26:31 Like I said, it's a Rust library. We wanted to provide a Python framework. So we were like, okay, we need to figure out how to make some sort of like FFI. We found, it just so happens, there's a library called Py03, which allows you to kind of marry Rust and Python. So we were able to marry the Timely Dataflow library into this Python framework. And so in actuality, the execution and runtime is rust and it's timely data flow and then you have

Starting point is 00:27:08 like your processing functions that are written in python and what's what is what is really cool about the whole like thing is we have the ability to like move different things down to the rust layer if we want for performance so say for example like recently when we we released like a kafka input configuration capability so you don't have to like write all the kafka consumer code we made that in the rust layer and the thought was okay cool if we do the input and the output in the Rust layer and we provide some serialization in that Rust layer. In essence, you can move data from one place to another without having to serialize input into Python, but you could still interact with it as if it was like Python code, so you can get some performance benefits there.

Starting point is 00:27:58 But yeah, PyO3 is really cool library. And the reason I think it's really interesting is much like how Python leverages C and there's like to speed up the code, you can use PyO3 in a similar way to speed up your Python code. And you can get like really insane performance boosts like 20, 30x sometimes, which is like, yeah, which is incredible.

Starting point is 00:28:24 And PyO3 is your gateway to that like great performance hit to keep you on the journey. So how does like byte works work when you want to start using the Python ecosystem, right? Like, okay, I get it. Like I can write Python and this Python will have some by means like to the Rust code and this Rust code is going to be like the more native, let's say, code is

Starting point is 00:28:49 going to be executed from that, but that's one thing and it's another thing when you start adding all the different libraries around, right? Like how does this work, especially like in a streaming environment, right? David Pérez- Yeah. So sometimes you have to pass around types. But so PyO3 allows you to compile Rust and use it from Python. It also allows you to interpret Python in Rust, in Rust execution.

Starting point is 00:29:21 So Timely Dataflow has this concept of operators, which you're, you might be familiar with in other frameworks where you have things like map and filter and et cetera, et cetera. And so there's like some code around that, that that's how you like control flow of data through, through the, the data flow in those operators, you pass, you can pass them Python functions. And so those Python functions are running as Python in they're being run from the interpreter, the PyPI interpreter, and they're running like on the REST execution layer. So you're not, you will see, you'll have to take a hint from serializing data from a rust type

Starting point is 00:30:06 potentially to a python type and and vice versa but then you're using the native python types and in it or i don't want to get people confused with a native you're using the python types with the python code and so that's why many of these libraries will just work and you're not kind of constrained in the UDF world, like the user defined function world, where it's, you know, if that, if that library doesn't exist, you just don't have access to it. Yeah. Makes sense. And like, for example, like I, like in Python, you have like people, for

Starting point is 00:30:41 example, like using like frameworks, like pandas, right. Like to, to interact and like work with data. Can I use pandas and like interact with the pandas data frame and use that like as the way that I work with data inside by the works right now? Yeah. So, I mean, that's one of the great like things to use the stateful map with. So if I'm like accumulating data from, you know, on a per user basis over a certain amount of time, I can like stack that into a data frame and then I can

Starting point is 00:31:11 actually pass that data frame onto the next step and then I can do whatever compute function there I could do it in. Earlier stuff, but I could also pass it on. Yeah. I mean, it's fun. It's really fun to use it that way because you get like because you get access to all the stuff that you want to use. You don't have to think about it because you're like, I've been using this for X number of years. I know the API.

Starting point is 00:31:36 100%. No, no, no. That's a very, very important point, in my opinion. Inventing a new API and having to educate everyone to use the new API, it just takes forever. It's not easy. And people choose one or the other for reasons.

Starting point is 00:31:57 Pandas is being used and it's good enough. Why someone would change that? Now, does it have limitations that they are not related to the API itself? Yeah, of course. That's why you see stuff like PySpark, for example, where you have a Pandas API, but the backend that does the computation on the bug is like Spark, so you can scale

Starting point is 00:32:14 like on a lot of data, right? Yeah. Or you'd like something like Bytebooks and do like streaming processing, like on the backend. But I think, and just to go back to what we were discussing at the beginning about the previous generation of, let's say,

Starting point is 00:32:31 streaming processing environments and the difference in product in the developer experience, I think that's a very big differentiator. When you had tools like Storm or Sling or whatever, they were imposing a way and an API too. It wasn't just what was happening with the system behind.

Starting point is 00:32:51 Now we see a big shift, I think, with developer productivity tools and all the systems that we are building. The system, we try to fit it to what the developer knows. Let's get the API that has proved its semantics and people know how to use it. And let's work behind the scenes and change the things to make them better. And I think that's amazing, in my opinion. It's great. I love to see projects doing that. David Pérez- Yeah.

Starting point is 00:33:24 It's always the magic feeling, right? Like when the tool meets your like expectation for the experience, that's when you feel the magic feeling of like, yeah, like this just worked cool. Yeah. Well, yeah. And that was the feeling I got when we could like, Pyrex is ready to use. And I was like making a data flow and I could just use

Starting point is 00:33:47 the Python library as I wanted to. Like, I always play around with this library called River. It's like a Python machine learning library for, like, online learning. And it just,

Starting point is 00:33:57 so fun because it just, like, works. You just, like, import it, do some anomaly detection, and it just, like, it just works. It's really fun. Henry Suryawirawan- Yeah. That's cool.

Starting point is 00:34:09 All right. I have two more questions before, like I give the microphone back to, to Eric. So the first one, what you mentioned is Materialize. Materialize is again, like a system that has been built, like on top of the data of node-parading, it's based on SQL. What's the difference between like a system like ByteWalks and Materialize? Alex Wrigley- Yeah. So like at the very bottom, they're both using data, timely data flow.

Starting point is 00:34:39 So, but that's like timely, the timely data flow aspect is more about managing the flow of data. Both companies, Materialize and Biwax, have exposed different ways to interact with that data. So the effort behind Materialize is to expose SQL layer so that it's more portable. And I think you can use... I don't know if they're at this point yet, but they're nearly at a point where you can interact with it like it's a Postgres database.

Starting point is 00:35:17 I think you can use like SQL PG2 and you can just write queries against it. They had to do a bunch of work between timely data flow and the user interacting with it with SQL. And there's another project called differential data flow that incorporates timely and then builds on top of it to kind of handle a bunch of those. Converting the SQL queries into what that data flow would actually look like.

Starting point is 00:35:48 Yeah, so similarities are at the data flow level. And then we kind of diverge, right? But they have done really interesting work. And I don't know, if you can, I would say you should talk to Frank, Frank McSharey, like get him on the, get him on the podcast because he would be able to speak about it. Yeah, absolutely.

Starting point is 00:36:14 In a more elegant way than I can. Yeah, yeah, absolutely. No, that's something that we should definitely do. Okay. And my last question, what's next? What's the roadmap for Byteax and what excites you there? David Pérez- Yeah. So we're, Biowax is both a open source company, open source project and a company

Starting point is 00:36:36 and that company has to be sustainable. So we have to become commercially, we have to make Biowax into a commercially viable product as well as an open source project so that's something that will come in the future not the most immediate but more immediately is just about more adoption biwax has many of the things you need to like build to build like an advanced analysis on on streams and so it's you know getting it out there, getting the word out there,

Starting point is 00:37:06 awareness of the project that exists and getting some adoption. That's the more immediate future. And then we'll turn to how we can add additional features that make it easier to run ByteWax at scale, to integrate ByteWax with existing organizations, infrastructure and other systems, and provide a paid version of ByteWax as well.

Starting point is 00:37:31 One other piece that's on the roadmap for the next little bit is we positioned ByteWax as a library for doing advanced analysis on streaming data for like machine learning use cases or like cyber security and stuff like that. We've had a lot of people who we've been, we've spoken to or like started to use Vyvax that tell us, oh yeah, that's really interesting. And it's on our roadmap, but it's like a year out or it's like 18 months out or nine months out and and but what we need to solve today is we we just need to like ingest data or like move it from one place to another which you can deal with by wax but um you know writing the code yourself but we might experiment a little bit on how we could potentially serve those users as well so make some some effort in making like a few different connections between

Starting point is 00:38:27 different sources of data so you can, you know, get started with that really easily and maybe just set some, some configurations and then deploy it. Henry Suryawirawanacik, Interest. It's super exciting. I think we have many resources, like to record another episode, like in the, I don't know, like a quarter or two from now. So looking forward to that. Eric, it's all yours.

Starting point is 00:38:49 Oh, why thank you, Costas. My pleasure. I'd actually like to follow up with what you just talked about in terms of different users and the way they're using ByteWax. But also return to something you mentioned about your original mission, you know, when you came out of GitHub and started ByteWax, but also return to something you mentioned about your original mission, you know, when you came out of GitHub and started ByteWax, which was to sort of almost democratize certain parts of, you know, this ML workflow for smaller companies, right? Who didn't have a whole team to actually manage the infrastructure side of the house when it came to like the ML side of it.

Starting point is 00:39:27 And one thing that was interesting to me, and, you know, this could just be the top things that came to mind, but a lot of the use cases you mentioned or examples you gave tend to be things that companies at scale are really interested in, but the smaller companies can't do. So, for example, like computing features, you know, pushing them to an in-memory store so that they can be served, you know, as like, you know, you know, like second level, you know, or slightly sub- a huge use case right because like delivering recommendations like that you know can you know changing like conversion by one percent can mean you know huge stuff for the bottom line but in the spirit of like democratization like what what does that look like for what bite wax is now and is that sort of like basic like data pipeline use case how you see that or i just love to know like do you feel like you've sort of retained that original mission or

Starting point is 00:40:32 or that like component of your mission i mean to some extent the reason we have like a python framework is some of that like democratization i mean there are a lot more like it's one of the most widely used programming languages and it is like the de facto language of data yep if you exclude sql name but programming language or yeah but i yeah i think that we're we're at a little different layer of companies than we were targeting before, potentially. Yeah, I'm not 100% sure. medium enterprise or whatever hasn't gotten to that level of sophistication where they were like actually like building models and deploying them and all that and so we were maybe like it was like maybe a little bit ahead of where the market was yeah yeah so we weren't able to democratize anyone because no one was thinking to be democratized or you know yeah yeah

Starting point is 00:41:41 and when if we were to like look at streaming i think that you're pointing something out that's interesting is we're still focused on like more advanced stuff and that's still we're still like kind of ahead of it in terms of like the broader market the broader market is trying to do more and justin ingestion and like some transformation and move data around and get it into like a centralized thing but what I think we can do there to try and like democratize is try and like build infrastructure that makes it easy. Like tools, I guess, to makes it easy for people to use data flows to also do that ingestion part.

Starting point is 00:42:20 I did a, there's a open source spotlight thing that Data Talks Club does, and it comes out next week. And I do like an example demo where you can use ByteWax to take server logs. It was something we did at GitHub. So we didn't have like telemetry across our products. We wanted to know what was going on. So we would just like ingest data from web server logs. And we would, it was either there or at the load balancer, but anyway, we'd take that kind of data, request data and like put it

Starting point is 00:42:51 in the data warehouse so that we could query it and then derive insights. So I've made a demo doing that with ByteWax and this tool called DuckDB. And maybe it was like my spirit was in democratization. It was like like here's the lightest way thing you can possibly use to like take data and understand what's going on with your product and yeah i don't know it was fun maybe it's not actually what anyone will actually use but yeah i hope we can help more people use streaming data like that is kind of why we're building by waxes we want more people to be able to do it yeah totally i mean i would say you know it it sounds like this the you know make giving people easier access to leverage streams in python in itself, to your point, is a big part of that, right?

Starting point is 00:43:46 Like, I mean, if you think about someone who wants to do something interesting with a Kafka stream, you know, especially if that Kafka stream is like managed by a completely different team, right? I mean, that's like really hard at a lot of companies to like do that, right? You know, either from a technical standpoint or, you know, cross-functional standpoint or whatever. So yeah, that's, that's super interesting. Well, where are our listeners? I, we should have mentioned this earlier, but where can they go to learn more about BiteWax or try it out? Yeah. So we have

Starting point is 00:44:20 our website, bitewax.io, B-Y-T-E-W-A-X.io. There's like the documentation. There's a bunch of examples and API documentation. And then the GitHub repo. Also, we have like a ton of examples in there. So those are, right now, those are the two places to go and try it out. It's like you can pip install it and then just like run one of the examples and get a feel for how it works.

Starting point is 00:44:46 I have a bunch of blogs too that have tutorials in them and they're more like positioned for like certain use cases. So there's like a cybersecurity one and there's like a level two, like order book from an exchange kind of one and anomaly detection for IoT. So if you have like more of an application, maybe that's the better place to start there

Starting point is 00:45:08 and look and see if it's solved and it's not and go dig through the examples and learn how to do it. And obviously come to the Slack channel. We have a Slack channel where you can like join. And there's people dump in their, you know, problems they're working through and we're happy to work through them.

Starting point is 00:45:26 I love doing that. So I always find it fun to build a data flow. So that's another place to get started. Yeah, I love it. Well, Zander, thank you so much for taking the time. Man, we learned so much. It was so fascinating. I loved hearing you and Costas break down all the technical details.

Starting point is 00:45:47 So super fun for me. And yeah, we'd love to have you back on the show as Vyvax grows and hear how things are going. Cool. Yeah. Thank you for having me. I hope we didn't go like too deep down the rabbit hole there, but you can never go too deep down the rabbit hole on the Data Stack Show. We're definitely involved in the stack. At that point, we were at the bottom of the stack.

Starting point is 00:46:09 Yeah. Yeah. All right. Well, thanks again. Cool. Yeah. Thanks, guys. It's great to chat.

Starting point is 00:46:16 Thank you, Xander. It was great. Super interesting. I one thing one interesting takeaway I have is that it seems like there is a proliferation in the ways that you can do stream processing and which is really interesting I think really good for the industry right like with materializers you talked a little bit on the show you can you know accomplish something similar with sequel, right? You know, you have traditional streaming tools like Kafka, right? And with Confluent, you know, you can do like interesting, interesting things with transformation and stream processing or whatever. And so it is

Starting point is 00:46:55 interesting. It seems like in general, if you look at all these technologies, they're responding to a big need in the market related to stream processing. You know, it's just really interesting. You can sort of access it through different interfaces, right. Or developer experiences, depending on it. So what do you think that that's true? Like, is that because streaming, like, if you think about stream processing has been around for a long time, but there seem to be a lot of new technologies cropping up around it.

Starting point is 00:47:21 Yeah. Well, and I think, okay, to be honest, I think there's some kind of like explosion in like the solutions that will have to do with like interacting with data in general, so it's not just like streaming. But I think streaming processing was one of these paradigms that was really struggling with developer experience. Just like a very different type of data to work with, the ways and the APIs that exist in the past and still exist. And different in terms of what most of the developers out there are used to in the user

Starting point is 00:48:01 end. are used to in the user. So what I find extremely interesting is how much attention to the developer experience is happening right now. You see that we build a new system not just because we have a new super performant way of doing things, but primarily because we want to make things easier for developers to work with and more accessible to more developers out there.

Starting point is 00:48:30 Yep. So I think that's going to be like a very common theme across new products. And if you remember when we had that panel about streaming, there was like a discussion about the developer experience. Actually, I think the most important topic that all of the panelists talked about was the developer experience. Yep. So, and we see this happening right now. We see like new systems coming up that primarily focus on like the developer experience. Yep.

Starting point is 00:49:00 Totally agree. Well, maybe we should do another streaming panel. Because you kind of have like a battle of different ecosystems, right? Like SQL versus Python, etc. So maybe we can have Brooks have like a battle with streaming, which would be great. But that's all the time we have for today. Thank you, as always, for listening. Subscribe if you haven't, and we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to

Starting point is 00:49:30 subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by rudder stack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudder stack.com.

Pet Camera - EBO Air 2

The Data Stack Show - 112: Python Native Stream Processing with Zander Matheson of bytewax

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

The Data Stack Show - 112: Python Native Stream Processing with Zander Matheson of bytewax

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.