The Data Stack Show - 41: Doing MLOps on Top of Apache Pulsar and Trino with Joshua Odmark of Pandio
Episode Date: June 23, 2021Highlights from this week’s episode:Joshua started his first company at age 15 and then sold two more startups after that (2:15)Embracing the open source movement and not reinventing the wheel if yo...u don't have to (12:15)Pulsar seemed built to address Kafka's weaknesses (17:23)Using Redis as a coordinator for federated learning and taking advantage of its portability (23:05)The pillars of Pandio and some practical use cases (31:24)Feature stores and model versioning (38:23)Seeing Pulsar as the future because of the ability to run tens of millions of topics (41:04)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are
run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Well, on today's show, we get to discuss a topic that always brings a little bit of spice to the conversation, and that is Kafka and Kafka-related technologies.
And to make it even more interesting, we're going to talk with the founder and CTO of
Pandio, and it's tooling built on top of Apache Pulsar. They do a lot of other things, and we'll talk
about ML orchestration, some other things they do. Costas, I am just really interested. I always
love a conversation where there's very opinionated discussion around Kafka and things that compare
with Kafka. So that's what I want to hear.
Maybe you'll get to that question on the technical side, but I can't wait to hear what someone
building on Pulsar has to say about Kafka.
Yeah, absolutely.
That's also my burning questions, to be honest.
Pulsar is, it's not a very new technology, but it's gaining a lot of traction lately.
And I'm very curious to learn why,
what are the differences and why Pulsar instead of Kafka. And that's going to definitely be part
of the conversation that we'll have today with Josh. Great. Well, we're going to talk with Josh,
the CTO and founder of Pandio. Josh, welcome to the show. You have a long history working with data and you're doing
some really interesting things. You're doing some really interesting things at Pandio and we want to
talk about all sorts of things, including sort of machine learning orchestration and all of that.
But before we get going, could you just give us your background? And I think specifically,
you've done a lot of different things, but we'd love to hear
about your journey working at data-related companies and maybe just provide a little
bit of perspective on how things have changed over such a long career with data.
Yeah, thanks so much for having me.
I really appreciate it.
So yeah, so my name is Joshua Odemark.
I'm currently the founder and CTO of Pandio.
We consider ourselves an AI orchestration company.
So we help companies operationalize AI at the end of the day.
I started my career incredibly early.
This usually shocks people, but I started my first company when I was 15 years old.
Now it's not as exciting as that almost seems, but I mainly started it because I did not like working your traditional job.
So basically what I did was just repair computers.
So as my friends went to work at Burger King or serving at a local restaurant or dishwashing, I basically earned the same amount of money, but I set my own schedule.
And that kind of set the bug early of being a serial entrepreneur.
And then shortly after that is where my data journey begins.
And it got very interesting.
So as a senior in high school, so still not even out of high school yet, I started a company
in the early 2000s with somebody that I'd only met online.
I only knew them as sort of an online handle.
And yeah. Really quickly, what platform did you meet them on?
Just because, you know, we're so used to that's such a common thing now.
I mean, of course, like sort of digital dating and other things like that.
But in the 90s, you know, there weren't that many venues, actually.
Yeah, yes, it was sort of interesting.
So ICQ is the main sort of messaging platform that I communicated on.
But what's really interesting is, I mean, this is sort of hard these days.
You guys will probably appreciate that.
But back then, if you like emailed the owner of a website, they were like happy to sort of talk to you.
You know, that was like not a common thing back then. So you could sort of easily
connect with people online if they ran a popular website that you were a member of. And so that's
how I actually met this person. He owned the website. The website had a very popular forum.
This was in the days of Hot Scripts, if you guys remember that, where you could go download little snippets of PHP or JavaScript and things like that.
He had a competitor to that that also had a forum.
So I met him through that and we just sort of conversated that way, either through the inbox of that forum, which was vBulletin or ICQ.
So a little bit of AIM in there as well. The AOL is the messenger.
But yeah, so and then mostly actually over email, because a lot of those things were clunky back in
the day. People didn't normally keep that stuff running 24 seven. So it was very rarely sort of
a real time communication back then, at least in my experience.
But I sort of got to know him by just offering to help for free.
That was like very successful for me back then.
It was just, it helped me learn, but it made connections.
And we just grew to be sort of virtual friends, if you will.
And then we started a company and the premise was pretty straightforward.
This was back when Google released its sort of page rank algorithm. So we had the idea of taking his popular websites, which were already like page rank six and seven, which is out of tens. That's pretty decent back then. And we sold links, which is now not acceptable. But back then, this was like a relatively new thing. Nobody was doing it. There was no precedence.
Yeah, it wasn't good or bad.
And we were able to get like Template Monster and Dotster
and some of those people to sort of join.
And the whole point of was to make money with the actual links,
but also to double dip and promote our own properties
as part of those links.
And so we ended up having 10 plus
websites that were huge amount of traffic. So probably millions of unique visits a month
back in the early 2000s, which is, you know, today it's kind of like, oh, interesting. But
back then that was massive. So, and that's, that was sort of the, the, like, I was going to say golden era, but
you know, there's so much, so many weird things, but for SEO, it was a wild West. I mean, you could
do, I mean, you know, like the, the classic, you joke about it, but it's like, you know,
white text on a white background type stuff. But man, back then a lot of that stuff worked really
well. Yeah, exactly. And, you know, interesting little tidbit. So you
guys may have heard of Neil Patel. So he's kind of a big name in the SEO space these days. We were
one of his first customers because what happened was we started to get so many inbound requests to
sort of help with these SEO related things. We just farmed everything out to him and so we ended up being one of his first and biggest
customers back then yeah so so i i don't know him very well but i got to know him decently well for
you know that was also all virtual i ended up meeting him in person later but but at the
beginning yeah it was all virtual what's fascinating though, is this went gangbusters.
It just blew up.
Within the first month, we were making like $20,000 a month, which, again, not a huge sum of money.
But I'm also 18 years old in high school. Sure.
But that was just the first month.
And it kept going.
It got pretty nuts.
So that was my first foray. And what's, to me, most fascinating is, if you remember back then, there weren't really any ways to analyze data.
So I remember all things considered, you know, it was fun making money, having success quickly, just being in high school and not really knowing what I'm doing, et cetera, et cetera.
But what was most fascinating was all the data that those sites generated. So I remember like AWStats was a big sort of thing offered through
cPanel back in the day, but it couldn't really handle huge amounts of data. So I remember
spinning up big boxes, which again, there's no cloud back then. So you're renting the actual
hardware from a provider, which for us was theplanet.com back then.
So I remember just renting this huge box because I wanted to run SQL queries against all this data,
see where people were visiting from, what they were doing on the website and things like that.
So that was truly when my data journey first began, when I was sort of understanding the power of being able to sift through these millions of visitors and tens of millions of page views and trying to learn what that meant, what sections of the website were popular.
Because things like Google Analytics that give you these canned dashboards didn't exist really either.
So you're on your own to sort of figure that out.
But that was sort of a wild ride. We
sold it less than two years later. And I just sort of cruised for five years because that was a nice
little windfall. Not enough to retire on, but I finished my degree and then got back into the
startup world. And it's interesting from an entrepreneur perspective, that was my first proper startup and it went gangbusters and it was like it was easy.
You know, everything just worked.
And then doing the next startup and the startup after that, I thought it'd be easy.
But of course, you know, building companies is incredibly hard.
So I've since sold to other companies, but, you know, had my fair share of failures in that mix as well.
So it was a pretty wild ride.
But what a cool story.
Yeah, it was tons of fun.
And I met just lots and lots of people.
It was really fascinating.
And it was just me and another guy that sort of did it.
And the two of us sort of built it into something pretty amazing so that was that was a lot of fun and surprisingly enough i was not
a software engineer anything of the type back then i was more like a graphic designer in reality
yeah so i was very much into the arts i loved math and science and school and things like that but
the i never really had a practical purpose.
But as part of that startup, after about a year of doing this, it was mainly on autopilot, which was also fantastic.
But our only big expense was programming.
So I was like, oh, I'll pick it up, you know, cut some expenses and learn something.
Yeah.
And that's when it was like, oh, I love this.
So I've been a software engineer ever since that day.
Very cool.
Or that time.
Yeah.
One question, and this is a little bit,
because I want to get to Pandio and I want to talk about MLOps.
And I know that Costas has a lot of technical questions.
But one thing that's interesting, it struck me when you said,
you know, you were spinning out big boxes, but I just love this story. I just, I hope for our audience, I know it is for
Costas and me, it's just bringing back a lot of really good memories thinking about, you know,
ICQ and the AIM usernames that were kind of like a bad tattoo that you regret,
that you had to stick with. But, you know, you talked about spinning up big boxes because you
wanted to analyze all this data with SQL because you talked about spinning up big boxes because you wanted to
analyze all this data with SQL because you didn't really have this sort of out of the box like SaaS
analytics providers. One thing that's really interesting is you kind of have these phases,
right? So there was sort of maybe more like bare metal analytics, which you were doing.
Then you have this huge wave of SaaS analytics tools still around, right?
But then you have a lot of companies actually coming back full circle to writing SQL on big
data sets, you know, on the warehouse or other sort of different data stores. And then you kind
of have this in between with tools like Looker, you know, where there's sort of, or leveraging
Looker, like LookML or dbt, which sort of support the entire process, would just love your perspective on that. Because when you were talking about analyzing data,
I thought, okay, this is in the 90s. But you hear people use the exact same language today,
you know, decades later. Yeah, it is interesting, because it's almost cyclical in nature. It's like,
even when you look at the cloud, you know, it's kind of like a lot of people consider the cloud
is almost a step back to mainframes, just, you know, a lot sexier and things like that.
But yeah, I mean, for me, it has been interesting.
And I'm the type to where like I hate reinventing the wheel, like with a passion.
So I always go look for these tools and sort of in the early days, open source, you know, wasn't source wasn't really a thing.
Open source was your buddy or hot scripts.
And there's no licensing with that.
You're just using somebody else's script.
It's not been vetted, et cetera, et cetera.
But to me, the open source movement has just created all sorts of very interesting things.
And we talked to a lot of companies today.
Their entire offering,
they may not sort of talk about that publicly, but it is open source. And it's always interesting to me to find out things that are open source, like Athena at AWS is built on Presto and things
like that. So I think what's been fascinating is the open source kind of movement has allowed
entrepreneurs like myself to create tools like this. And to me,
I absolutely love that because then I can just use those to make my life easier versus having
to create it all to myself. It's like if we had to do that, our progress would be so much slower.
And especially when you get into the sort of specialized stuff. So, you know, you guys involved
with ETL and things like that,
a lot of people assume that that stuff is easy.
And then when you get into the actual data of like, you know,
sifting through form fills from your website or something like that,
or your lead gen or something,
you start to realize how crummy the data can sort of be in almost any industry
and how difficult that is to sort of deal with.
And then the sheer amounts of data.
That's been what's been getting very interesting.
As we talk to a lot of enterprises, they've got so much data.
It's absolutely absurd.
And they're only using a small fraction of it.
And they realize kind of how ridiculous that is, but it's just so hard and they can't
understand the cost of analyzing it or using all of it or
what's the ROI. So it's an interesting space to see all these tools sort of pop up that are slowly
addressing all these problems. And then when you move into the machine learning, the thing that
was always fascinating to me about machine learning is it's just like traditional software.
Now, obviously the differences of there's some pretty hardcore math and matrices behind it and all that.
But at the end of the day, operationally, it feels very similar.
It's just more of everything, more CPU, more data, more storage, more memory, more pods and Kubernetes, et cetera, et cetera.
So then your problems become more painful if you don't have the right infrastructure or et cetera, et cetera. So then your problems become more painful if you don't have
the right infrastructure or et cetera, et cetera. So it's been interesting, but I've just been
so thankful for the open source movement. And I myself try to contribute back. We're contributors
to Pulsar. We're about to contribute back to Trino and Presto. And then I've contributed to other things in the past as well.
So it's really amazing.
And I'm thankful for that movement has sort of blown up these days.
Yeah, I think we are definitely building over the shoulders of giants
when it comes to open source.
You mentioned a couple of projects, Josh.
Can you tell us a little bit more about the product,
the offering that Pandio has
and how it relates with open source projects? Yeah, sure. So one of the things that was sort
of very interesting to me is, so we have a managed service offering for Apache Pulsar.
So Apache Pulsar is a sort of traditional distributed messaging system. It handles typical workloads of like streaming, pub, sub, and traditional job worker queues.
And then it also has a very interesting component to it where you can actually host serverless functions inside of it.
So you can do things like you have like an inbound topic.
You can place a function on top of that topic. And then what it sort of spits out on the other side for the output topic runs through that lightweight compute thing.
So you can do things like ETL and really anything you could imagine.
Routing is very popular as well.
But when I came to Pulsar, because before Pandio, I was working in the insurance space. And so we were
involved with a lot of the big providers of insurance names that everybody's kind of familiar
with. And so we were delivering machine learning to them and building machine learning for them.
So we did some very interesting projects. Like there's one company who has satellite imagery
of the entire United States, and they wanted to measure the roofs of all homes in the United
States. That was the premise of what they wanted us to achieve. And so that's a massive project,
very interesting. And how you sort of solve that is a lot of fun to think through.
But at the end of the day, what was most interesting is data logistics became very painful for us in many of those cases. So that's just the literal
movement of data. So that was hundreds of petabytes of data to sort of deal with to do that particular
project. And we tried to use everything out there. So, you know, they were in the cloud. So we use the cloud providers
services, then that tipped over, then we sort of shifted to some other things like Kafka,
and found out like Kafka sort of doesn't handle that stuff particularly well. So in my sort of
process of doing that, I started to see the value of a logistics piece of software. And my journey there, we
actually built something custom based on Redis because I was very good at Redis personally,
and our team had a strong sort of experience with Redis. So we were able to do something
with Redis that was very fascinating, but it was very niche. So with Pandio, I wanted to find something and I ended up exploring Pulsar. And the thing that's interesting about Pulsar is they solve, it almost feels like Pulsar are stateless. This makes scaling it
a lot easier. So you can sort of actually properly scale those horizontally and you can scale the
compute, which is effectively the broker independent of your storage. Storage is handled with Apache
Bookkeeper and Pulsar. So you can scale those independently, scale up the storage or scale up
the CPU or scale them together. That's very powerful. It's also built more for the container
driven world. So it doesn't rely on low level sort of kernel related stuff to achieve speed or
things like that. So it's more portable and more cloud native at the end of the day. And it sort of solves the topic limitation.
So very interesting customer use case that we have is there's a large media company who
was hitting the sort of limits of Kafka with its number of topics.
So based on the way it's architected, you can only have so many topics.
And this sort of depends on how you set up your cluster.
But typically, a few thousand, you're not going to go above that.
They wanted to create one topic per one user in their system.
So that was hundreds of thousands of topics.
For Pulsar, this is pretty easy, again, because it was designed differently.
So there's lots of things like that.
Some ANSI layer edge cases
where Pulsar is just more interesting.
Additionally, it supports all the messaging types.
So from one SDK, you can do streaming, PubSub, or queuing.
So that's also very fascinating.
Although on the flip side,
I've found a lot of developers
that that's like a curve ball to them.
They're like, wait a second,
like one software to do vastly different messaging patterns.
So, but once they get sort of past that, you know,
they sort of see the value and being able to just choose right inside a single
SDK to do any of those messaging patterns.
Yeah. I mean, so it, and because Pulsar was really, and it's,
it's been around for a long time, but you know, community is, is a fraction of the size of Kafka.
So us getting involved, we sort of became known for Pulsar because there isn't a lot of providers
for it. So, so we run some pretty large Pulsar installations, especially in the finance world,
because another interesting difference with Kafka is Kafka is not really built for full durability.
It's built more for speed,
whereas Pulsar is sort of built for durability.
So for example, Kafka by default,
F-syncs to disk on a threshold basis.
Pulsar does it every single message. So that's much more
interesting to like the banking world, for example, because they want zero message loss
under any sort of circumstance. But yeah, in sort of doing Pulsar, we've now offered it as a managed
service because it's gaining a lot of interest, both from people who have hit the limits of Kafka,
which is typically almost always in the Fortune 1000 that are hitting those limitations,
but also people who are setting out to develop new systems. Pulsar, because of its ability to
scale higher, is a much better fit for machine learning. And that really is why we're involved with Pulsar at the
end of the day, is our focus at Pandio is to help companies achieve really any form of AI or ML.
It's quite shocking to something like 87% of executives want it, but only 15% have it.
There's a lot of reasons for that. But yeah, I mean, so that's sort of what led us down the road of Pulsar in a nutshell.
That's super fascinating, Josh.
I have quite a few questions around Pulsar and also how to use it today.
But I'm very curious about the custom solution you talked about building on Redis
and the limitations that you found in Kafka that made you go to use Redis.
Can you share a little bit more information about that?
Like what you managed to build on Redis?
I mean, I love it as a technology because it's amazing the stuff
that people have managed to build on top of Redis.
And it's always very interesting to hear about this.
So it would be amazing if you could share a little bit more about it.
Yeah, so I'm somewhat limited in some of the things I can talk about, but in general, so the
way Redis basically acted as a coordinator for us. So what was very interesting to us about Redis is
it was extremely portable. So we sort of treated it like it was meant to be not as a proper data store, more as
a caching layer. But because of some of the embedded Lua and things like that, you can do
you can add in some crazy powerful function like logic, at least around how keys are managed in
Redis. So but it basically acted as a coordinator coordinator because when it comes to that particular issue
we had very small payloads so it was basically a lot of coordination happening so instead of passing
anything to do with an image we created basically just a metadata payload so it was like imagine
like it was just a reference to where it was in S3 as an example.
Or if the image had to be split up and then there was four pieces of the image or 10 pieces
of the image, and those need to be coordinated in a way.
Because what we had basically is it felt like a mesh network of machine learning.
So a lot of people sort of call that today
like federated learning.
So we built Redis basically as a way
to coordinate a lot of federated learning.
So that can be specifically around
like rural and metro areas.
You would have a model at the end of the day
that was specific for like Chicago
or San Francisco or
Los Angeles. So we use Redis really to sort of coordinate that federated learning and keep track
of what had been done. And it worked very well because you could just take the quote unquote
database that Redis created, which is just a single file at the end of the day and move that around to sort of restore where you were.
So that helped us in scenarios where we were attempting to do some learning and we wanted to
halt it. So maybe we processed like 12% of images and then we wanted to a week later,
start back up at that sort of spot. So it gave us the durability to be able to do that easy
because Redis is dead simple, easy to install,
easy to make portable by moving keys around
that you had created on one VM
and you want to put it on another VM now.
So it just was, at the end of the day,
the easiest way to sort of do some of that coordination.
And the way we structured
it was what, what ends up being a topic in Kafka was basically just the namespace inside of Redis.
And so, you know, we could sort of pre-calculate how many of those we needed. So maybe that was
10,000, for example. And then we knew the payload size of what was being
coordinated because it was just you know absolute past s3 objects and then we could calculate the
memory that would sort of be needed to do that and then we sharded it ourselves can't really go
into too much detail how we did that but it it's basically the same way database is shard,
based on keys and things like that.
But yeah, so it wasn't really too advanced at the end of the day.
Again, it just was coordination.
But we needed to use Redis because it was just blazingly fast,
and we needed a lot of them.
So thousands of those individual Redis instances.
That's super interesting. While you were talking and mentioned like coordination,
I started thinking if this is like kind of problem that you could solve with something
like Zookeeper or XD because they are used a lot for service coordination and stuff like that.
But then you mentioned about like the thousands of instances. So I'm not so sure like if
something like this
could be used but yeah is this like you yeah you definitely could have done zookeeper i mean you
know i suppose this is the most this is the case with most developers it's kind of like if if you
if you plan or architect or something or you sort of understand the requirements and then you want
to sort of fit in the things you know to it you know what i mean me being mainly web-based programming languages
so like php ruby and things like that and a little bit of python zookeepers like that that jump into
java that none of us were really ready for you know but but yeah absolutely no i mean it's at
the end finding the right solution is not always...
How to say that?
It's not like solving an equation, right?
There's no one solution.
I mean, it has to do with the team.
It has to do with the circumstances when you're doing.
And at the end, that's also what is fun with technology.
I mean, there are so many different tools
that can be used to solve the same problem out there.
And yeah, Redis is one of them.
That's why, as I said, I'm always fascinated to hear
what people manage to do with Redis.
It's amazing.
So is this something, like this kind of problem
that you talked about solving with Redis,
is this something that you could do today with Pulsar?
Yeah, Pulsar would have been a lot easier,
mainly because it sort of handles the distributed nature to it.
I mean, Redis today, I haven't used it too much recently. This was kind of a while ago that this was built,
but we sort of had to make it distributed. We didn't really need the atomic nature of Redis,
but Pulsar sort of handles that for you. And the sort of nature of things need to be backed up or moved around is sort of removed or handled by
Pulsar itself. You know, so for example, if you wanted to process things again, you can easily
replay messages in Pulsar. So they, you know, it's got like a reader interface. So if you've got
thousand messages on the topic that you've already processed. You can just create a new subscriber or use a reader interface to go back to like offset zero. So some of those things are
just handled for you and the publishers, the producers and consumers, if you wanted to sort
of scale one up huge or the other up huge or both up huge, it's just easier to sort of do that.
You don't have to build a lot of that yourself with pulsar so
i would have loved to build that solution with pulsar it's to the point now where there's actually
there's a fair number of companies who are if you remember the traditional concept of like a
enterprise service bus a lot of companies are moving towards using something like Pulsar to be the fabric.
There's this like term called data fabric where, you know, something like you have an
ESB that sort of touches everything.
So it's both messaging patterns, it's access to the warehouse, the data lake, the data
marts, you know, et cetera, et cetera, et cetera.
And then it gives you some pretty interesting controls having that middleware
these are none of these are new concepts you know middleware has been around forever
esb has been around for forever but because something like pulsar has so much more additional
capabilities the serverless uh function type stuff does some interesting stuff then you can put
business logic in the middle of things and And then just traditional, you know, Pulsar also can sort of store things indefinitely.
So, you know, with Kafka, they have like an offloading function that they just came out
with relatively recently, but it's not seamless.
With Pulsar, like you can offload to HDFS or S3 or any blob storage, and then you can read back out seamlessly from the SDK.
You don't have to put it back into Pulsar.
So things like that are just very interesting
and make it interesting to use as a data lake.
You know, some people that are doing that.
Just a lot of very interesting use cases.
Yeah, it's super interesting.
And you mentioned that many of the use cases that
you're dealing with at Pandio right now is around ML orchestration. Can you tell us a little bit
more about this? How something like Pulsar helps with ML orchestration and what is involved there?
And I think also, it's not just Pulsar, right? It's like Pulsar and together with Presto. Is
this correct? Yeah, so Pandio really has three pillars at the end of the day. So accessibility is the first one.
So we just use like an open source data virtualization technology, which is Trino.
These are all kind of optional in your journey to AI, but these are the things that most people
need. So Trino is interesting because it can sort of connect
to almost any data source, even flat file systems.
And it lets you connect to maybe 5, 10 data sources, 15, 20,
it doesn't really matter.
And then execute a single SQL statement against it
so you can join like data and S3 flat files
with data in Snowflake and things like that. So that's very interesting.
So that helps solve the data accessibility issue where they've got data in some place and they just
need to get to it. And then Pulsar acts as sort of the foundational component just because the
movement of data becomes very difficult. And this is why Pulsar is very interesting. So I mentioned earlier that machine learning is a lot like traditional software, just more of everything. And so we focus typically on the heavy data use cases. So that might be a billion dollar media company is generating click data and impression data. And so what they want to do is they want to detect fraud and click data.
So they've got just an enormous amount of data coming in.
So click stream of data, impression stream of data.
And they want to, one, just be able to handle that data.
So Pulsar is great at that just for ingesting data.
It can scale out massively and sort of handle a lot of data with few resources by comparison, especially to Kafka.
Kafka is kind of the number one competitor when it comes to that.
And then we sort of built out our machine learning framework to do it in stream.
So we definitely focus on real time or near real time, but it doesn't have to be. That just happens to be a space where
there's not a lot of tools out there to help people do things like that. So a use case might
be a media company has a stream of clicks coming in and they want to segment them as fraud or not
fraud. And so they can sort of use the Pandio service to ingest that and then apply a machine learning model against that live stream of clicks. So in real time, it can route a click to fraud or not fraud. And that helps them do various different things. Cybersecurity is another big one. syslogs of access patterns, both logging into systems, traditionally, like through, you know,
an employee logging in or, you know, some third party logging in, or somebody accessing a file
on a file system. So all these things are getting streamed into some central system.
Doesn't actually have to be central. That's another interesting thing about distributed Pulsar.
You can do this at the edge.
It doesn't have to be centralized.
But that's a whole other topic.
But that's just a very interesting use case where you may want to flag traditional clustering of your data.
You just may want to flag anomalies.
That's all you're after.
You just want to know, is something weird happening?
Is somebody accessing a file that they haven't accessed in two years and they're accessing a lot of them you know an interesting use case for one
of our customers was they used this to find an employee who was downloading everything off of
the company servers so they were clearly doing a data dump and were likely going to leave the company.
They legitimately with their user account were downloading every single file.
And this was with a medical company.
So very sensitive to somebody doing something like that.
And they caught them and fired them that day.
So there's lots of use cases like that.
Again, it is weighted towards real-time or near real-time,
but it works traditionally as well for things that are less important.
Maybe you need to run something once a day or once a month,
but we certainly excel in huge amounts of data,
like Disney Plus streaming amounts of data,
as well as things that are real-time and need to make actions quickly.
The faster you act, the more money you save.
Makes sense.
And what about Trino?
You mentioned Trino as a data virtualization solution.
How is this used in Pandio right now?
So that was mainly a function of, you know,
to really provide value with the second two pieces of Pandio,
which is the middle piece, which is like logistics.
And then the third piece, which is the actual machine learning. So actually building models,
training, and doing the inference. A lot of people are, especially big enterprises,
it takes them a while to sort of make a decision and move data. So for example, they might have accepted or decided to use
Snowflake as a data warehouse or something like that. Like they've chosen that as their future.
And so now you have to wait for sprint by sprint things to get in there to actually move data into
Snowflake. So we found there's just an opportunity where before someone made that decision or when they were on their road to sort of implementing that decision, having something like Trino is very interesting because it can just virtualize that as a sort of stopgap.
And we found too, even when you have like a very forward thinking enterprise or company in general, they usually never move all of their data into a
warehouse or a data lake it's like the 80 20 rule feels like it pops up everywhere you know so
there's always like the 20 that they want to get access to but can't really so i like various
different reasons you know so trino again a lot of companies just like to use that because it
and made getting data into their pipelines a lot easier and trino's dead simple i mean
it's easy to pitch you know it's like hey do you want to run sql against all of your data
yeah i'd love that you know it's easy to demonstrate you know so and it's it's i mean
it's not easy to run but it's not hard to run. So,
you know, we just offer that as a managed service because it fits really well into
kind of AI orchestration. That's super interesting. There is a lot of noise lately around
feature stores. I don't know if you have heard about them, like products like Tecton or open source solutions like Feast.
What's the relationship with the feature store compared to what you are talking about doing in Pandio?
Do you see these things working together?
Do you think there's an overlap there?
How is this landscape around them starts to form? Yeah. So for us, you know, we're heavily sort of focused on
the actual training and deployment of models. So a lot of those relationships and even like the
data catalog people or even existing sort of MLOps platforms, there's a lot of synergies there. You
know, from us, we look at those as things we plug into to make things easier. So
for example, like data catalogs and that can actually feed into something like Trino. If
you're using a Hive catalog to the sort of index blob storage data. When it comes to future stores
and model versioning. So those are very powerful things, but, but there's sort of like, I consider
those things like cutting edge. It's kind of like, I wish more people knew about them, you know?
And it's almost like, like I talked to some advanced enterprises, some of the biggest
companies in the world, and I'm shocked that they can count the number of models they have
in production on their hands, you know? So tools like that sort of make it a lot easier.
But yeah, so for me, like we consider those things
as like things we would plug into.
It's very much about the Python library we built, you know?
So plugging into things like that, again,
it comes back to not having to reinvent the wheel.
Like we're, you know, dead focused on something very specific.
And then these things that can make the road easier
or allow these things to be democratized easier
or the operational component of it easier.
We love to partner with those types of things.
We don't have any sort of plans to build some of that stuff out.
That's super interesting.
So Josh, one last question from me because I completely monopolized the conversation today and then I'll give the stage
to Eric. I'm very curious about something and I have this in my mind like from the beginning of
our conversation, also for personal reasons. You mentioned that one of the limitations that
Kafka has is about the number of topics.
And you mentioned that there are companies, especially in the Fortune 1000 group of companies,
that they reach this limitation.
Can you share with us some use cases that are causing this kind of limitations to be triggered?
Because obviously, when Kafka was designed, they had in their mind that Kafka is going to be used in a way that nobody will need like thousands of topics right and by the way the reason that i'm
interested in this is because in my previous company blender we're using kafka and we had
to figure out a way to deal with these limitations so i'd love to hear from you about this so i'll
get a little pie in the sky on you
guys here, but it's sort of fun to sort of think about. So not a ton of companies, but some
companies are sort of understanding that you can use things like Kafka and Pulsar to segment your
data in very powerful ways. So in the same way, like an index in the database would allow you to sort of
segment data. So that would mean that you would have an interesting use case for creating a lot
of topics. So one might be, you know, a lot of companies sort of create these segments. So let's
take media, for example. So they've got like segments for their customers. So they might have,
you know, living in major metro areas, or they might have high income earners. So they've got like segments for their customers. So they might have, you know, living in major metro areas or they might have high income earners.
So they have these like segments of their customers, but they're limited on how they can do those segments.
You know, it has to be sort of categorical or high level.
I like to use the Facebook news feed as like an example of why this is sort of important. And I'll tie it to some
specific use cases. So what's interesting is like your feed on Facebook is very much tailored to you
as an individual. So you can imagine how would you sort of create a machine learning model
that is tailored to an individual. So that's like the holy grail. So instead of doing
categoricals, like if you earn between, you know, 50 grand and 75 grand, you're in this segment.
Imagine if I could create one specifically for you, you know? So something like this would involve,
I now need to segment your data exclusively. So the things you like on the internet, things you look
at with your consent, imagine if that happened on a platform like Facebook. So there's a major
media company that we help out doing this right now. So they're involved with shopping.
So if you could create a topic that was specific to an individual user, so now I can do very
interesting things.
So I was born and raised in Michigan, so I'm a big Detroit Lions fan.
So it's pretty easy to sort of loosely understand that I might like Detroit sports.
So that'd be more like a categorical model, but it becomes very hard to track that I like
an individual player. So
Matthew Stafford is a quarterback for Lions. He was just traded as a whole big thing. You know
what I mean? So for that shopping network to sort of track the, my preference of an individual
person, that's where they start to lose the minutia of things. And for shopping, that can
be very important because while Stafford is no longer a Detroit Lion, I'm still a fan of him. So I might still want to buy his
jersey or something on the new team that he's on. So that's like a capability today that someone is
trying to achieve that they couldn't. I mean, so what they do is they create a topic that is specific to that user.
And then they train a model specifically using that user's data.
And so it ends up looking like a federated sort of learning way where they've got their master model that has all the categorical stuff.
And then they've got the federated model that's specific to that user.
It ends up feeling very much like we all have cell phones.
So we all have acronyms or names we call our spouses or pets and things like that. And the
keyboard, as you type messages, it starts to learn what you're doing individually or the things you
do yourself. It's very much like that, but for like everything so to be able to do things like that
it's easy on a well i shouldn't say easy it's an amazing accomplishment on a phone
but segregating it is easy because it's just on your phone you know you're already sandboxed in
that way but when you're a major shopping network that's not so easy you need to create that segmentation so this ties into like to me i envisioned a future
where companies would have tens of thousands of models minimum now they've got like hundreds if
you're lucky like it's rare that i find a company that's got hundreds of models in production
it's typically like 10 or 20 you know so I imagine a future where you want to have tens of thousands of models as an individual company.
What would that look like?
It's going to be federated.
It's going to be distributed.
What does that look like?
And so that's where I saw Pulsar as the future is because you can do tens of millions of topics, and that can be the baseline of the stream of data for each model now hosting you
know 10 million models is its own difficult thing but we've got some fascinating technology to
actually do that so that is a focus that we do too and i was trying to i was like working backwards
from the terminator examples like if skynet were to happen, what would it actually look like from an infrastructure stack standpoint or data sharing?
What would that look like?
You know, you would need, to me, millions of niche models.
The ability to sort of, from a mesh network standpoint, share the outputs of one model as the input of another model in some huge mesh network.
And so that's why, to me, something like Pulsar in the Python library I built is kind of like moving in that direction
because I thought that's the next step.
It's going to be someone needs to create thousands of models, not 10.
What do they need to do that?
So that's kind of what you'll find
at Pandio. But again, I don't want to say if Skynet happens, blame Pandio, but that's kind
of like from a technical perspective, that's where my thought process kicked off years ago.
This is great. Eric, it's all yours.
Well, thank you. I have just been so fascinated by this conversation. And really,
I think I'm in the best way possible. There are so many more questions, I think, that I haven't
probably cost us as well. But we are at time, and I want to be respectful of your time. Josh,
this has been really great. Loved your story. Loved hearing about Pandio and all the interesting
things you're doing. So why don't we have you back on the show?
I'd love to dig into a couple of the specific things you talked about as far as use cases,
et cetera.
So we'll have you back on the show again, and we'll continue the conversation.
Awesome.
Well, thanks, guys.
I really appreciate it.
It's lots of fun from my perspective.
I really enjoy doing things like this.
So thank you again.
Well, Redis is a really fascinating tool and we've seen lots of companies
do really interesting things with it. That might be the most interesting Redis use case that I've
heard about, but that's actually not my biggest takeaway. I think my biggest takeaway was taking
a stroll down memory lane and just reminiscing a little bit about IRC and AIM and spinning up servers
to run SQL.
I mean, that's great.
I just really enjoyed that.
So that's my big takeaway.
I hope we can do more of that on future episodes.
Yeah.
Although the bitter side of this is that it reminds me how old I am. But yeah, it was great to hear how things were done back in the 90s, to be honest.
So this was great.
That was an amazing part of our conversation.
It was a great conversation in general, to be honest.
I mean, Joe has a lot of experience with many different things that have to do with data
and building actual products on top of data.
So it was amazing to hear about all these use cases and the products that he has built with many different things that have to do with data and building actual products on top of data.
So it was amazing to hear about all these use cases and the products that he has built
even before all the latest hype
over building products over data.
And of course, it was amazing to hear
how they used Redis on that.
What I'll keep from the conversation that we had
is about ML and how machine learning is actually deployed right now and how early we are in the commercialization, let's say, of machine learning.
There are amazing things that are happening and a lot of work that still has to be done.
And that, of course, means that there's a lot of opportunity out there, both for building amazing technologies, but also building amazing companies.
So let's see what happens in the next couple of years.
I think it's going to be fascinating.
It absolutely will.
Well, thank you again for joining us on The Data Stack Show.
More exciting episodes like this one coming up.
So until then, we'll catch you later.
We hope you enjoyed this episode of The Data Stack Show.
Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your
feedback. You can email me, Eric Dodds, at eric at datastackshow.com. That's E-R-I-C at
datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.