The Data Stack Show - 112: Python Native Stream Processing with Zander Matheson of bytewax
Episode Date: November 9, 2022Highlights from this week’s conversation include:Zander’s background and career journey (2:32)Introducing bytewax (5:16)The difference between systems (10:57)Bytewax’s most common use cases (16:...15)How bytewax integrates with other systems (20:25)The technology that makes up bytewax (24:31)Comparing bytewax to other systems (34:17)What’s next for bytewax (36:31)Try it out: bytewax.ioThe Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show.
Today, we're going to talk with Xander from ByteWax.
And Costas, I love the topic of streaming,
which won't surprise you at all.
But this is really interesting.
We've actually had some sort of stream processing
type technologies, actually multiple on the show.
However, ByteWax is stream processing type technology is actually multiple in the show. However, ByteWax is stream processing in the Python ecosystem,
which is really interesting.
Actually makes a ton of sense, you know,
just at a high level when you think about, you know,
how prevalent Python is in terms of data.
And my question is not going to surprise you at all,
but I want to know where the motivation came from.
So Xander was a Heroku and GitHub working as a data scientist,
and so there was some sort of need, I'm sure,
that he saw at those companies that motivated him to start ByteLax
and do sort of stream processing and some MLOps stuff in Python.
So that's what I'm going to ask you about,
and I'm sure you have some technical questions.
So tell me what you're going to dig into.
Alexi Vandenbroekke Yeah, I'd love to go deeper in the technical side of things
and see like how it is built and what are like the technical
decisions that were made there.
And one thing that I definitely want to discuss with him is like, what's the
difference between, let's say streaming processing system that is built in 2022 compared to the previous generation of streaming engines like Sling and Storm and Spark streaming and like all these frameworks that they have been around like for a while.
So, yeah, I'm very excited about it.
I'd love to see like, what, what's the difference there?
Like what's new.
David PĂ©rez- Totally.
All right.
Well, let's dig in and chat with Xander.
Xander Selicka- Let's do it.
David PĂ©rez- Xander, welcome to the Data Stack Show.
We're super excited to chat.
Xander Selicka- Thanks.
I'm excited to be here and thanks for having me.
David PĂ©rez- Totally.
Okay.
Well, tell us about your background.
Lots of data science type stuff, but yeah, tell us where you came
from and what you're doing today.
Yeah, I'll give a little bit more of my background. So the last, you know, I don't know,
five or six years I've been working as a data scientist. I was at GitHub and before that,
Roku. But before that, I have like a little bit of a mixed path on how I ended up here.
I was actually a civil engineer.
That's what I went to school for and worked as a civil engineer.
Then I went to business school.
And after business school, I got into working in technology.
And I worked at a company, a speech technology company, so doing speech recognition, text-to-speech.
And I was supposed to do like biz staff and sales.
And I liked the technology more than I liked doing that.
And I was better at working with technology.
So that's how I kind of ended up in where I am today.
It was right around the time when people started using neural networks for speech recognition.
And I was just like, I thought it was so cool.
And that's what led me into the data side of things.
Yeah, very cool. Okay, so I have a question about, I'm always interested in, on the show, I probably ask this.
I don't know if I've asked this in a while, but it always fascinates me sort of when people kind of take a career,
you know, their career changes,
which is very common.
But are there any lessons
from like your study or work
in civil engineering
that you took with you into data science?
You know, because they're both technical
in their own way,
but like pretty different.
So yeah, I'd love to know if there's any lessons you do yeah so when i was working as a civil engineer i was like working with hydrology models which are stochastic
models and so there's a lot of like you know similar basis and theoretical work to like both of these things so there was a takeaway there in my like
general education for civil engineering did like some programming it was just like part of what we
did but not a lot but yeah that's so that's kind of from i guess from the technical point of point
of things but decomposition of problems is just like a general concept.
And, you know, thinking, being able to analyze things regardless is a skill, I guess, that transfers.
I don't know, maybe.
Yes.
Yeah, that's great. I love that.
It's kind of like pipelines, you know, for like in the context of civil civil engineering, pipelines for data is great.
Yeah.
It's a great connection there.
Okay.
Well, tell us about ByteWax.
Sure.
Yeah.
So ByteWax today, there's ByteWax the company, and then there's ByteWax the open source project.
So ByteWax the open source project is. So ByteWax, the open source project, is a Python framework for data processing.
And it's focused on processing streams.
So stream processing or stateful stream processing is kind of like the focus of the framework.
And so you can think of it similarly to Flink or Spark Structured Streaming or Kafka Streams, and that you can do
more complex or advanced analysis on streaming data. So you can do aggregations, you can do
windowing, et cetera, et cetera. And then it's in Python, it's Python native. So you can use like the Python ecosystem to build these stream processing workflows.
And with ByteWax, those are called data flows.
And that comes from the data flow, like a compute paradigm
that is underneath ByteWax.
Tell us about the company as well.
Yeah.
So ByteWax the company, we the company as well. Yeah, so ByteWax the company we started actually
in the beginning of 2021. We started ByteWax and we actually started ByteWax to bring to market a
hosted machine learning platform was the idea we had worked on this problem at GitHub and few of
us left GitHub to build that. And we had an idea that
eventually led to this pivot where we started building a stream processing framework. And
the idea was that all machine learning, when you're running a machine learning model in real
time, you have this idea of pre-processing data for features or whatever, joining additional data on, and then you run the inference
with the model, and then you'll do some post-processing potentially.
And so we made a framework where you could create a DAG in Python to do these different
steps, and then you would just deploy them.
And we would go scaffold up some infrastructure to run this on Kubernetes, and you could scale
them and everything.
And that was the initial idea.
And people started using it for, to process data in real time without doing any machine
learning.
It's just, yeah.
And it wasn't really meant for that because it didn't have like state built in.
So we ended up just kind of scratching a bunch of that and, and restarting with a different
execution underneath.
Fascinating.
Okay, so let's dig into that a little bit.
So the first part, so you and several people from GitHub
were building a platform for machine learning,
which is interesting.
We've actually heard that on the show a ton, right?
And we talked to people with machine learning
and it's kind of like, well, tell us about MLOps ml ops and everyone's like oh like like there are a lot of things that have
gotten better and like but there's so much about that as well so that was that's sort of the goal
was sort of like like ml ops pain like let's go solve that and then people just started using it
for like a data flow that didn't include the machine learning part?
Yeah.
I mean, yeah, as a simple, that's a simplification, I think in some sense, but it captures the
essence in a better way.
Yeah, it was.
So there was like four year, four or five years ago, we were, when I was at GitHub,
we were working on a machine learning platform and We did a bunch of work to extend Kubernetes
and make our own custom operators and resource definitions
so that you could run training jobs and do hyperparameter stuff
and do all this MLOps stuff on Kubernetes.
And so we took some of those ideas and we built into platforms
so that the idea was smaller businesses without the larger teams to like build all that ops infrastructure could just like, you know, by wax deploy my, my workflow thing.
And then we kind of handle the rest and give you like metrics and alerting and scaling and stuff like that.
Super interesting. metrics and alerting and scaling and stuff like that super interesting okay so the next question
is like why do you think it's always fascinating to me when you're like you built something and
then people use it in kind of a way that you didn't expect why do you think users were using
it in the way that they were because obviously obviously there was a component that met a need
that was unmet, and they're like,
oh, this is a better way to do this.
What were the driving factors or use cases
or technical limitations they were facing
that initial version of ByteWax unlocked
in terms of the streaming piece of it?
Yeah.
So many of these groups were still going to do some,
use some machine learning down the road, but there was not a really good way in Python to do the feature transformations they needed to do in real time against streams.
And so they were trying to transform data in real time for that aspect.
That was it. So there was like, in Python, there wasn't an easy way to essentially like hook
up to Kafka and then run some like, use some Python tools, whenever it is NumPy, etc, etc,
to like transform that data. And then you have a feature set that you will then serve
in real time to the model. Super interesting. Okay. Costas, I have a hundred more questions, but please jump in here because there's
so much to talk about and I know you have some great questions brewing on the technical side.
Costas Pintasenac, Yeah.
Yeah.
So you mentioned Zeller, you mentioned at the beginning, like some frameworks for
streaming processing, like Flink, which is probably the most well-known one.
There have been a few others in the past, like we've got Storm from Twitter.
There's another one.
SAMSA.
Yeah, SAMSA.
There have been quite a few attempts in building streaming processing systems, right?
What's the difference between a system like PiedWalks today
compared to, let's say, a paradigm that was implemented probably like a decade
ago, right, because like these are like all these tools, they started like pretty
much to appear like end of like the 2000s, beginning to 2010, right?
So what's the difference there?
David PĂ©rez- Yeah, I think many of those I mean, I'm not an expert in all
those systems, so, you know, I could be missing things, but many of these systems,
there's like a series of trade-offs make for like correctness, you know,
latency, scalability, et cetera.
And so I think many of these systems or And so I think many of these systems,
or not, I think many of these systems
take different approaches and like, you know,
one thing for another potentially,
and that may lead to them being dropped off
and something new coming along.
Other things may have, you know,
like we are like software engineers, et cetera.
We are like a little bit like trend followers.
And so some things seem to like fall out of trend and then they don't like gain the adoption.
So certain like ecosystems of tools that, you know, they were integrated with or worked well with maybe, you know, fell by the wayside and new systems came up
and those new technologies kind of worked better with those systems.
So like you had Spark was like becoming mid whatever,
like 2014, 2015 or whatever, Spark was like becoming a thing
and people were moving away from some of the different MapReduce frameworks
and using Spark for different things.
So I don't know.
It seems like there are these trends,
and sometimes it's like a trade-off.
You make an architectural decision
that lets you have better performance,
but it's like you're giving up correctness.
Or maybe you have to add state in a different way,
or like to persist state or, you know, checkpoint things in state, you like give up some performance.
So there's all these trade-offs.
And I think that maybe that leads to a plethora of different services.
Yeah, makes a lot of sense.
So someone like today that has to choose between, let's say, let's take Flink,
because Flink is quite dominant, let's say, like in the set-frame processing
when a state is involved, and ByteWalks, right?
Why would choose one or the other?
Why I would go and use and work with ByteWalks instead of Flink?
Yeah, so I mean, it comes
back to those trade-offs again.
Flink is a more mature product
and you have SQL,
you can use Flink, and there are many different
bindings for languages, and they've
built out a bunch of features
that ByteWax doesn't have.
But
the trade-off that we are
sort of investing in there is it's a lot more involved
to get up and running and started with Flink for like a big group of users that maybe don't have
the experience with that whole Java ecosystem, you know, tuning the JVM, et cetera, et cetera.
And so our thought is like, we'll make a trade-off of maybe not having the exact same robustness that Flink has today,
but we'll give an experience
where it's a lot easier to get started
and quicker to get started
for this subset of users that are
maybe a newer group
of users, like machine learning engineers,
data scientists, or newer data engineers.
So that's the trade-off
that we've been playing in with the ecosystem.
Yeah. Yeah. Yeah. Developer experience is quite important.
Yeah.
Yeah. I mean, everyone who has had the time point, like to work with systems like this,
or like maintain these systems, like I think they have like many horror stories too. It's not easy.
I mean, okay. It's easy to... Let's take Kafka, for
example. Getting a couple of Docker images and running them locally, yeah, it works. But from
that point, to take it to production and keep it in production, it's a huge, huge distance in terms
of the complexity of all the operations that are involved in Greece. So I totally understand what you mean by the experience of the developer there.
And what are the use cases?
Like the users of ByteWalks today, you shared a very interesting story of how
you started from machine learning.
You saw that they are using it for other reasons and ended up like creating like ByteWalks, what are the most common
use cases that you see there where ByteWalks is involved?
David PĂ©rez- Yeah.
So kind of can align them maybe to application.
So IoT, security, financial industry, there's many different use cases within those applications.
So you have anomaly detection for things like fraud or malfunction usage is, and they can then make some decisions based on that.
One thing I'll say about Biwax is it's a very generally applicable framework. I mean, it is a data processing tool for literally any set of data, which makes it also very
difficult sometimes to position it correctly.
But yeah, so there's applications, like I said, in those industries, and there are various
different use cases.
And then coming back to what I was talking about for machine learning and how we ended
up with this streaming framework is that stream processing framework is that application is around features.
So today when you train a machine learning model, you're going to like generate a feature set and then you're going to like in some data set and then you're going to use that to train the model.
And you're like, that's sweet.
I'm getting great performance.
I have these awesome features.
And then you need to use that somehow in production.
And so there's been all this like tooling around that you have like Feast,
which is a open source thing from Tekton and many other feature stores, et cetera.
But you don't really have like a good tool for like creating those features on the
fly that's that still allows you to use that Python ecosystem.
So that's another big use case.
It's like whatever you have to do to features
when you're working in static data,
you have to do that in real time.
And so you're going to need state
because you're going to need to know like,
I don't know, how many pickups did this Uber driver do
in the last 30 minutes?
And I'm going to feed that into my model
to know if they're the person to recommend for the next one
or whatever it might be for recommendation
engines, personalization, et cetera.
So that's a big like group of users or potential users, I guess.
Yeah, that's awesome.
And like when you're talking about real time feature generation, like what is real
time in terms of like time itself?
Like, are we talking about milliseconds?
Are we talking about seconds?
Like what are the requirements when it comes like to a model feature?
David PĂ©rez- I mean, I think that the, I don't know, there's an actual
definition, I think of real time and sub second, but for a streaming
system, for a real-time feature transformation,
I assume it would be talking about end-to-end latency.
So you have from the moment a user interacts with something to the feature being then stored in the database
where it can then be served.
So some low-latency key value stored or whatever,
that would be the total latency
or that would be like the time.
And I think you're probably going to be over a second in that.
I'm not sure, but you might be near a second, so you could, it could
possibly be near real time and not actually real time.
Yeah.
Yeah.
So how do you see like, what's the architecture there?
Like we have the model somewhere and we need to fit it with, like, features that we create, like, in real time, right?
And we have, like, the system where the user interacts with.
This data will be, like, stored on something like the database, whatever.
But how do we get, like, byte works together with, like, the existing production how do we get like Byteworks together with like the existing
production environment that we have?
And we create, let's say a system where we can almost real time, real time generate
features, feed them to the recommenders or whatever models we have there and do
something with it.
Like, how do you, how, how does this work?
Alex Raucer- Yeah.
To make it like a little bit more concrete i we we work a bit with the folks at
working on the feast project so i'm just okay keep along that vein yeah in fact if you byrex uh
is you can use byrex as like a materialization engine with these so it like turns offline
features into online features but anyway so the architecture looks like you have some streaming platform be it kafka or red panda or
pulsar so you have some system there and upstream of that you have interactions with your service
where you're collecting the data so that can be don't know, you can be ingesting from logs
or you can be collecting telemetry data, whatever.
It's going into Kafka.
And then downstream of Kafka would be,
or Kafka alternative would be Bywax.
So it's a consumer and it's listening to Kafka
and then it's doing the transformations
and then it's writing to Kafka and then it's, you know, doing the transformations and
then it's writing to the online feature store, which is going to be like Redis or
Dynanote, BTP, your Postgres or MySQL or something that can be rather low latency.
And it's probably also writing out to the offline feature store.
So that's like your data warehouse or data layer or some storage, some like
a more analytical storage engine.
And that's so that it can be reused for retraining or determining
new life for new models, et cetera.
So I don't know if I answered the question at all there, but you'll
have, most likely you'll have some orchestration layer
that's managing some serving thing so that's where your model's action actually going to be loaded
so maybe your model's pickled and it's stored in an object store or it's um it's part of a
an image or something in some way shape or form, your model is loaded into memory
and it's in a service.
Let's say it's in a pod in Kubernetes
and it's service traffic.
So it's like a microservice
and a request will be made there.
And once that request is made,
that feature that has been gathered in real time
has to be ready in that low latency database
because that model serving microservice will reach out for that feature set
and use that in conjunction with whatever was just requested
to actually make the prediction.
So you have other things as well on top of that.
Like MLOps is just like so many layers right now,
which is what we're getting
into, but at the very base you have like services with the model loaded into
memory, you have some databases and you have like a streaming platform and you
have some compute processing system to process the data both in real time and in
batch. GIO NAKPILIAKOUPOUEIRA OK.
That was super insightful, to be honest.
It's like a very common question that I ask like pretty much everyone
who has to do something with ML.
And I think it's like a common question for everyone who doesn't work in
ML Ops, to be honest, because it is like a very complex environment out there.
And it was great like to hear
of that and see couples who like understand like how features stores
interact with different technologies.
Just like us, Bytebox, you described.
So that was, that was great and very, very helpful.
Okay.
Let's get like a little bit more into what makes ByteWolfs what it is, right?
Like the technologies used, like you mentioned something about like
tiny data flow at the beginning, which is like a different computation model.
What else is out there?
I've seen like, it's like you've got baseline Python together with Rust.
Yeah.
So tell us a little bit more about the choices there and like the pros and the cons
and like how did you end up
like using,
making the choices
that you made?
Yeah.
So I'm going to go back
to when we pivoted
from the machine learning platform.
It was a hosted thing.
And when we pivoted,
we had some frustrations
with the hosted thing
because it was very difficult
to get people to send data into our hosted environment. to run it in their environment so we could have a different go-to-market motion
that wasn't about the host platform.
We decided, okay, let's look at our options.
And we found Timely Dataflow.
And we were like, oh, this is really interesting
because it can run as a single thread on a local machine.
So we can give people the ability
to write these stream processing workflows in a way that they can then run on their local machine.
And it can also scale up to many different workers.
So that was part of the reason why we chose Timely Dataflow.
And it's a cool project because it uses a little different approach in the architecture in comparison to the things you're more familiar with like Spark.
Anyway, so yeah, Timely Dataflow is an awesome library.
The person who created it, Frank McSherry, I don't know if you've had him on the podcast.
I think his title is Chief Science Officer at something like that and materialize, but he was part of the team that started materialize and materialize uses timely data flow, but also uses something called differential data flow. and it's based off this project called NIAID, NIAID Timely Data Flow,
which was a Microsoft research project back in the day.
Like I said, it's a Rust library.
We wanted to provide a Python framework.
So we were like, okay,
we need to figure out how to make some sort of like FFI.
We found, it just so happens,
there's a library called Py03, which allows you
to kind of marry Rust and Python. So we were able to marry the Timely Dataflow library into this
Python framework. And so in actuality, the execution and runtime is rust and it's timely data flow and then you have
like your processing functions that are written in python and what's what is what is really cool
about the whole like thing is we have the ability to like move different things down to the rust layer if we want for performance so say for
example like recently when we we released like a kafka input configuration capability so you don't
have to like write all the kafka consumer code we made that in the rust layer and the thought was
okay cool if we do the input and the output in the Rust layer and we provide some serialization in that Rust layer.
In essence, you can move data from one place to another without having to
serialize input into Python, but you could still interact with it as if it
was like Python code, so you can get some performance benefits there.
But yeah, PyO3 is really cool library.
And the reason I think it's really interesting is much like how Python leverages C
and there's like to speed up the code,
you can use PyO3 in a similar way
to speed up your Python code.
And you can get like really insane performance boosts
like 20, 30x sometimes,
which is like, yeah, which is incredible.
And PyO3 is your gateway to that like great performance hit
to keep you on the journey.
So how does like byte works work when you want to start using
the Python ecosystem, right?
Like, okay, I get it.
Like I can write Python and this Python will have some by
means like to the Rust
code and this Rust code is going to be like the more native, let's say, code is
going to be executed from that, but that's one thing and it's another thing when you
start adding all the different libraries around, right?
Like how does this work, especially like in a streaming environment, right?
David PĂ©rez- Yeah.
So sometimes you have to pass around types.
But so PyO3 allows you to compile Rust and use it from Python.
It also allows you to interpret Python in Rust,
in Rust execution.
So Timely Dataflow has this concept of operators, which you're, you might be familiar
with in other frameworks where you have things like map and filter and et cetera, et cetera.
And so there's like some code around that, that that's how you like control flow of data
through, through the, the data flow in those operators, you pass, you can pass them Python functions.
And so those Python functions are running as Python in they're being run from the
interpreter, the PyPI interpreter, and they're running like on the REST execution
layer.
So you're not, you will see, you'll have to take a hint from serializing data from a rust type
potentially to a python type and and vice versa but then you're using the native python types and
in it or i don't want to get people confused with a native you're using the python types with the
python code and so that's why many of these libraries will just work and you're not kind of constrained in the UDF world, like the user defined function world,
where it's, you know, if that, if that library doesn't exist, you just don't
have access to it.
Yeah.
Makes sense.
And like, for example, like I, like in Python, you have like people, for
example, like using like frameworks, like pandas, right.
Like to, to interact and like work with data.
Can I use pandas and like interact with the pandas data frame and use that like
as the way that I work with data inside by the works right now?
Yeah.
So, I mean, that's one of the great like things to use the stateful map with.
So if I'm like accumulating data from, you know, on a per user basis over a
certain amount of time, I can like stack that into a data frame and then I can
actually pass that data frame onto the next step and then I can do whatever
compute function there I could do it in.
Earlier stuff, but I could also pass it on.
Yeah.
I mean, it's fun.
It's really fun to use it that way because you get like because you get access to all the stuff that you want to use.
You don't have to think about it because you're like, I've been using this for X number of years.
I know the API.
100%.
No, no, no.
That's a very, very important point, in my opinion. Inventing a new API and having to educate
everyone to use the new API,
it just takes
forever. It's not easy.
And people choose one or the
other for reasons.
Pandas is being
used and it's good enough.
Why someone would change
that? Now, does it have limitations that they
are not related to the API itself?
Yeah, of course.
That's why you see stuff like PySpark, for example, where you have a Pandas API,
but the backend that does the computation on the bug is like Spark, so you can scale
like on a lot of data, right?
Yeah.
Or you'd like something like Bytebooks and do like streaming
processing, like on the backend.
But I think, and just to go back to what we were discussing
at the beginning
about the previous generation
of, let's say,
streaming processing environments
and the difference in product
in the developer experience,
I think that's a very big differentiator.
When you had tools like Storm
or Sling or whatever,
they were imposing a way and an API too.
It wasn't just what was happening with the system behind.
Now we see a big shift, I think, with developer productivity tools and all the systems that we are building.
The system, we try to fit it to what the developer knows.
Let's get the API that has proved its semantics and people know how to use it.
And let's work behind the scenes and change the things to make them better.
And I think that's amazing, in my opinion.
It's great.
I love to see projects doing that.
David PĂ©rez- Yeah.
It's always the magic feeling, right?
Like when the tool meets your like expectation for the experience, that's
when you feel the magic feeling of like, yeah, like this just worked cool.
Yeah.
Well, yeah.
And that was the feeling I got when we could like, Pyrex is ready to use.
And I was like making a data flow
and I could just use
the Python library
as I wanted to.
Like, I always play around
with this library called River.
It's like a Python
machine learning library
for, like, online learning.
And it just,
so fun because it just, like, works.
You just, like, import it,
do some anomaly detection,
and it just, like,
it just works.
It's really fun.
Henry Suryawirawan- Yeah.
That's cool.
All right.
I have two more questions before, like I give the microphone back to, to Eric.
So the first one, what you mentioned is Materialize.
Materialize is again, like a system that has been built, like on top of the
data of node-parading, it's based on SQL.
What's the difference between like a system like ByteWalks and Materialize?
Alex Wrigley- Yeah.
So like at the very bottom, they're both using data, timely data flow.
So, but that's like timely, the timely data flow aspect is more
about managing the flow of data.
Both companies, Materialize and Biwax, have exposed different ways to interact with that data.
So the effort behind Materialize is to expose SQL layer so that it's more portable.
And I think you can use...
I don't know if they're at this point yet,
but they're nearly at a point where you can interact with it
like it's a Postgres database.
I think you can use like SQL PG2
and you can just write queries against it.
They had to do a bunch of work between timely data flow
and the user interacting with it with SQL.
And there's another project called differential data flow
that incorporates timely and then builds on top of it
to kind of handle a bunch of those.
Converting the SQL queries into what that data flow would actually look like.
Yeah, so similarities are at the data flow level.
And then we kind of diverge, right?
But they have done really interesting work.
And I don't know, if you can,
I would say you should talk to Frank,
Frank McSharey, like get him on the, get him on the podcast because he
would be able to speak about it.
Yeah, absolutely.
In a more elegant way than I can.
Yeah, yeah, absolutely.
No, that's something that we should definitely do.
Okay.
And my last question, what's next?
What's the roadmap for Byteax and what excites you there?
David PĂ©rez- Yeah.
So we're, Biowax is both a open source company, open source project and a company
and that company has to be sustainable.
So we have to become commercially, we have to make Biowax into a commercially
viable product
as well as an open source project so that's something that will come in the future not
the most immediate but more immediately is just about more adoption biwax has many of the things
you need to like build to build like an advanced analysis on on streams and so it's you know
getting it out there,
getting the word out there,
awareness of the project that exists
and getting some adoption.
That's the more immediate future.
And then we'll turn to how we can add additional features
that make it easier to run ByteWax at scale,
to integrate ByteWax with existing organizations,
infrastructure and other systems, and provide
a paid version of ByteWax as well.
One other piece that's on the roadmap for the next little bit is we positioned ByteWax
as a library for doing advanced analysis on streaming data for like machine learning use cases
or like cyber security and stuff like that. We've had a lot of people who we've been, we've spoken
to or like started to use Vyvax that tell us, oh yeah, that's really interesting. And it's on our
roadmap, but it's like a year out or it's like 18 months out or nine months out and and but what we need to solve today is we
we just need to like ingest data or like move it from one place to another which you can deal with
by wax but um you know writing the code yourself but we might experiment a little bit on how we
could potentially serve those users as well so make some some effort in making like a few different connections between
different sources of data so you can, you know, get started with that really
easily and maybe just set some, some configurations and then deploy it.
Henry Suryawirawanacik, Interest.
It's super exciting.
I think we have many resources, like to record another episode, like in the, I
don't know, like a quarter or two from now.
So looking forward to that.
Eric, it's all yours.
Oh, why thank you, Costas.
My pleasure.
I'd actually like to follow up with what you just talked about in terms of different users and the way they're using ByteWax.
But also return to something you mentioned about your original mission, you know, when you came out of GitHub and started ByteWax, but also return to something you mentioned about your original mission,
you know, when you came out of GitHub and started ByteWax, which was to sort of
almost democratize certain parts of, you know, this ML workflow for smaller companies, right?
Who didn't have a whole team to actually manage the infrastructure side of the house when it came to like the
ML side of it.
And one thing that was interesting to me, and, you know, this could just be the top
things that came to mind, but a lot of the use cases you mentioned or examples you gave
tend to be things that companies at scale are really interested in, but the smaller
companies can't do. So, for example, like computing features, you know, pushing them to an in-memory store so that they can be served, you know, as like, you know, you know, like second level, you know, or slightly sub- a huge use case right because like delivering recommendations like
that you know can you know changing like conversion by one percent can mean you know huge stuff for
the bottom line but in the spirit of like democratization like what what does that look
like for what bite wax is now and is that sort of like basic like data pipeline use case how you see that or i just
love to know like do you feel like you've sort of retained that original mission or
or that like component of your mission i mean to some extent the reason we have like a python
framework is some of that like democratization i mean there are a lot more like it's one of the
most widely used programming languages and it is like the de facto language of data yep if you
exclude sql name but programming language or yeah but i yeah i think that we're we're at a little different layer of companies than we were targeting before, potentially.
Yeah, I'm not 100% sure. medium enterprise or whatever hasn't gotten to that level of sophistication where they
were like actually like building models and deploying them and all that and so we were
maybe like it was like maybe a little bit ahead of where the market was yeah yeah so we weren't
able to democratize anyone because no one was thinking to be democratized or you know yeah yeah
and when if we were to like look at streaming i think that you're pointing something
out that's interesting is we're still focused on like more advanced stuff and that's still
we're still like kind of ahead of it in terms of like the broader market the broader market is
trying to do more and justin ingestion and like some transformation and move data around
and get it into like a centralized thing but what I think we can do there to try and like democratize is try and like
build infrastructure that makes it easy.
Like tools, I guess, to makes it easy for people to use data flows
to also do that ingestion part.
I did a, there's a open source spotlight thing that Data Talks Club does, and it comes out next week.
And I do like an example demo where you can use ByteWax to take server logs.
It was something we did at GitHub.
So we didn't have like telemetry across our products.
We wanted to know what was going on.
So we would just like ingest data from web server logs.
And we would, it was either there
or at the load balancer, but anyway, we'd take that kind of data, request data and like put it
in the data warehouse so that we could query it and then derive insights. So I've made a demo
doing that with ByteWax and this tool called DuckDB. And maybe it was like my spirit was in
democratization. It was like like here's the lightest
way thing you can possibly use to like take data and understand what's going on with your product
and yeah i don't know it was fun maybe it's not actually what anyone will actually use but
yeah i hope we can help more people use streaming data like that is kind of why we're building by waxes we want
more people to be able to do it yeah totally i mean i would say you know it it sounds like this
the you know make giving people easier access to leverage streams in python in itself, to your point, is a big part of that, right?
Like, I mean, if you think about someone who wants to do something interesting with a Kafka
stream, you know, especially if that Kafka stream is like managed by a completely different
team, right?
I mean, that's like really hard at a lot of companies to like do that, right?
You know, either from a technical
standpoint or, you know, cross-functional standpoint or whatever. So yeah, that's,
that's super interesting. Well, where are our listeners? I, we should have mentioned this
earlier, but where can they go to learn more about BiteWax or try it out? Yeah. So we have
our website, bitewax.io, B-Y-T-E-W-A-X.io.
There's like the documentation.
There's a bunch of examples and API documentation.
And then the GitHub repo.
Also, we have like a ton of examples in there.
So those are, right now, those are the two places to go and try it out.
It's like you can pip install it and then just like run one of the examples
and get a feel for how it works.
I have a bunch of blogs too that have tutorials in them
and they're more like positioned for like certain use cases.
So there's like a cybersecurity one
and there's like a level two,
like order book from an exchange kind of one
and anomaly detection for IoT.
So if you have like more of an application,
maybe that's the better place to start there
and look and see if it's solved and it's not
and go dig through the examples and learn how to do it.
And obviously come to the Slack channel.
We have a Slack channel where you can like join.
And there's people dump in their, you know,
problems they're working through
and we're happy to
work through them.
I love doing that.
So I always find it fun to build a data flow.
So that's another place to get started.
Yeah, I love it.
Well, Zander, thank you so much for taking the time.
Man, we learned so much.
It was so fascinating.
I loved hearing you and Costas break down all the technical details.
So super fun for me.
And yeah, we'd love to have you back on the show as Vyvax grows and hear how things are going.
Cool.
Yeah.
Thank you for having me.
I hope we didn't go like too deep down the rabbit hole there, but you can never go too deep down the rabbit hole on the Data Stack Show.
We're definitely involved in the stack.
At that point, we were at the bottom of the stack.
Yeah.
Yeah.
All right.
Well, thanks again.
Cool.
Yeah.
Thanks, guys.
It's great to chat.
Thank you, Xander.
It was great.
Super interesting. I one thing one interesting takeaway I have is that it seems like there is a proliferation in
the ways that you can do stream processing and which is really interesting I think really good
for the industry right like with materializers you talked a little bit on the show you can
you know accomplish something similar with sequel, right? You know, you have
traditional streaming tools like Kafka, right? And with Confluent, you know, you can do like
interesting, interesting things with transformation and stream processing or whatever. And so it is
interesting. It seems like in general, if you look at all these technologies, they're responding to
a big need in the market related to stream processing.
You know, it's just really interesting. You can sort of access it through different interfaces, right.
Or developer experiences, depending on it.
So what do you think that that's true?
Like, is that because streaming, like, if you think about stream processing
has been around for a long time, but there seem to be a lot of new
technologies cropping up around it.
Yeah.
Well, and I think, okay, to be honest, I think there's some kind of like
explosion in like the solutions that will have to do with like interacting
with data in general, so it's not just like streaming.
But I think streaming processing was one of these paradigms that was really
struggling with developer experience. Just like a very different type of data to work with, the ways and the APIs that exist
in the past and still exist.
And different in terms of what most of the developers out there are used to in the user
end. are used to in the user. So what I find extremely interesting
is how much attention to the developer experience
is happening right now.
You see that we build a new system
not just because we have a new
super performant way of doing things,
but primarily because we want to make things
easier for developers to work with and more accessible to more developers out there.
Yep.
So I think that's going to be like a very common theme across new products.
And if you remember when we had that panel about streaming, there was like a discussion
about the developer experience. Actually, I think the most important topic that all of the panelists talked about was the developer experience.
Yep.
So, and we see this happening right now.
We see like new systems coming up that primarily focus on like the developer experience.
Yep.
Totally agree.
Well, maybe we should do another streaming panel.
Because you kind of have like a battle of different ecosystems, right?
Like SQL versus Python, etc.
So maybe we can have Brooks have like a battle with streaming, which would be great.
But that's all the time we have for today.
Thank you, as always, for listening.
Subscribe if you haven't, and we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to
subscribe on your favorite podcast app to get notified about new episodes every week. We'd also
love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C
at datastackshow.com. The show is brought to you by rudder stack,
the CDP for developers. Learn how to build a CDP on your data warehouse at rudder stack.com.