The Data Stack Show - 41: Doing MLOps on Top of Apache Pulsar and Trino with Joshua Odmark of Pandio

Episode Date: June 23, 2021

Highlights from this week’s episode:Joshua started his first company at age 15 and then sold two more startups after that (2:15)Embracing the open source movement and not reinventing the wheel if yo...u don't have to (12:15)Pulsar seemed built to address Kafka's weaknesses (17:23)Using Redis as a coordinator for federated learning and taking advantage of its portability (23:05)The pillars of Pandio and some practical use cases (31:24)Feature stores and model versioning (38:23)Seeing Pulsar as the future because of the ability to run tens of millions of topics (41:04)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Well, on today's show, we get to discuss a topic that always brings a little bit of spice to the conversation, and that is Kafka and Kafka-related technologies. And to make it even more interesting, we're going to talk with the founder and CTO of
Starting point is 00:00:43 Pandio, and it's tooling built on top of Apache Pulsar. They do a lot of other things, and we'll talk about ML orchestration, some other things they do. Costas, I am just really interested. I always love a conversation where there's very opinionated discussion around Kafka and things that compare with Kafka. So that's what I want to hear. Maybe you'll get to that question on the technical side, but I can't wait to hear what someone building on Pulsar has to say about Kafka. Yeah, absolutely. That's also my burning questions, to be honest.
Starting point is 00:01:18 Pulsar is, it's not a very new technology, but it's gaining a lot of traction lately. And I'm very curious to learn why, what are the differences and why Pulsar instead of Kafka. And that's going to definitely be part of the conversation that we'll have today with Josh. Great. Well, we're going to talk with Josh, the CTO and founder of Pandio. Josh, welcome to the show. You have a long history working with data and you're doing some really interesting things. You're doing some really interesting things at Pandio and we want to talk about all sorts of things, including sort of machine learning orchestration and all of that. But before we get going, could you just give us your background? And I think specifically,
Starting point is 00:02:03 you've done a lot of different things, but we'd love to hear about your journey working at data-related companies and maybe just provide a little bit of perspective on how things have changed over such a long career with data. Yeah, thanks so much for having me. I really appreciate it. So yeah, so my name is Joshua Odemark. I'm currently the founder and CTO of Pandio. We consider ourselves an AI orchestration company.
Starting point is 00:02:27 So we help companies operationalize AI at the end of the day. I started my career incredibly early. This usually shocks people, but I started my first company when I was 15 years old. Now it's not as exciting as that almost seems, but I mainly started it because I did not like working your traditional job. So basically what I did was just repair computers. So as my friends went to work at Burger King or serving at a local restaurant or dishwashing, I basically earned the same amount of money, but I set my own schedule. And that kind of set the bug early of being a serial entrepreneur. And then shortly after that is where my data journey begins.
Starting point is 00:03:07 And it got very interesting. So as a senior in high school, so still not even out of high school yet, I started a company in the early 2000s with somebody that I'd only met online. I only knew them as sort of an online handle. And yeah. Really quickly, what platform did you meet them on? Just because, you know, we're so used to that's such a common thing now. I mean, of course, like sort of digital dating and other things like that. But in the 90s, you know, there weren't that many venues, actually.
Starting point is 00:03:39 Yeah, yes, it was sort of interesting. So ICQ is the main sort of messaging platform that I communicated on. But what's really interesting is, I mean, this is sort of hard these days. You guys will probably appreciate that. But back then, if you like emailed the owner of a website, they were like happy to sort of talk to you. You know, that was like not a common thing back then. So you could sort of easily connect with people online if they ran a popular website that you were a member of. And so that's how I actually met this person. He owned the website. The website had a very popular forum.
Starting point is 00:04:19 This was in the days of Hot Scripts, if you guys remember that, where you could go download little snippets of PHP or JavaScript and things like that. He had a competitor to that that also had a forum. So I met him through that and we just sort of conversated that way, either through the inbox of that forum, which was vBulletin or ICQ. So a little bit of AIM in there as well. The AOL is the messenger. But yeah, so and then mostly actually over email, because a lot of those things were clunky back in the day. People didn't normally keep that stuff running 24 seven. So it was very rarely sort of a real time communication back then, at least in my experience. But I sort of got to know him by just offering to help for free.
Starting point is 00:05:09 That was like very successful for me back then. It was just, it helped me learn, but it made connections. And we just grew to be sort of virtual friends, if you will. And then we started a company and the premise was pretty straightforward. This was back when Google released its sort of page rank algorithm. So we had the idea of taking his popular websites, which were already like page rank six and seven, which is out of tens. That's pretty decent back then. And we sold links, which is now not acceptable. But back then, this was like a relatively new thing. Nobody was doing it. There was no precedence. Yeah, it wasn't good or bad. And we were able to get like Template Monster and Dotster and some of those people to sort of join.
Starting point is 00:05:55 And the whole point of was to make money with the actual links, but also to double dip and promote our own properties as part of those links. And so we ended up having 10 plus websites that were huge amount of traffic. So probably millions of unique visits a month back in the early 2000s, which is, you know, today it's kind of like, oh, interesting. But back then that was massive. So, and that's, that was sort of the, the, like, I was going to say golden era, but you know, there's so much, so many weird things, but for SEO, it was a wild West. I mean, you could
Starting point is 00:06:31 do, I mean, you know, like the, the classic, you joke about it, but it's like, you know, white text on a white background type stuff. But man, back then a lot of that stuff worked really well. Yeah, exactly. And, you know, interesting little tidbit. So you guys may have heard of Neil Patel. So he's kind of a big name in the SEO space these days. We were one of his first customers because what happened was we started to get so many inbound requests to sort of help with these SEO related things. We just farmed everything out to him and so we ended up being one of his first and biggest customers back then yeah so so i i don't know him very well but i got to know him decently well for you know that was also all virtual i ended up meeting him in person later but but at the
Starting point is 00:07:20 beginning yeah it was all virtual what's fascinating though, is this went gangbusters. It just blew up. Within the first month, we were making like $20,000 a month, which, again, not a huge sum of money. But I'm also 18 years old in high school. Sure. But that was just the first month. And it kept going. It got pretty nuts. So that was my first foray. And what's, to me, most fascinating is, if you remember back then, there weren't really any ways to analyze data.
Starting point is 00:07:51 So I remember all things considered, you know, it was fun making money, having success quickly, just being in high school and not really knowing what I'm doing, et cetera, et cetera. But what was most fascinating was all the data that those sites generated. So I remember like AWStats was a big sort of thing offered through cPanel back in the day, but it couldn't really handle huge amounts of data. So I remember spinning up big boxes, which again, there's no cloud back then. So you're renting the actual hardware from a provider, which for us was theplanet.com back then. So I remember just renting this huge box because I wanted to run SQL queries against all this data, see where people were visiting from, what they were doing on the website and things like that. So that was truly when my data journey first began, when I was sort of understanding the power of being able to sift through these millions of visitors and tens of millions of page views and trying to learn what that meant, what sections of the website were popular.
Starting point is 00:08:55 Because things like Google Analytics that give you these canned dashboards didn't exist really either. So you're on your own to sort of figure that out. But that was sort of a wild ride. We sold it less than two years later. And I just sort of cruised for five years because that was a nice little windfall. Not enough to retire on, but I finished my degree and then got back into the startup world. And it's interesting from an entrepreneur perspective, that was my first proper startup and it went gangbusters and it was like it was easy. You know, everything just worked. And then doing the next startup and the startup after that, I thought it'd be easy.
Starting point is 00:09:35 But of course, you know, building companies is incredibly hard. So I've since sold to other companies, but, you know, had my fair share of failures in that mix as well. So it was a pretty wild ride. But what a cool story. Yeah, it was tons of fun. And I met just lots and lots of people. It was really fascinating. And it was just me and another guy that sort of did it.
Starting point is 00:10:02 And the two of us sort of built it into something pretty amazing so that was that was a lot of fun and surprisingly enough i was not a software engineer anything of the type back then i was more like a graphic designer in reality yeah so i was very much into the arts i loved math and science and school and things like that but the i never really had a practical purpose. But as part of that startup, after about a year of doing this, it was mainly on autopilot, which was also fantastic. But our only big expense was programming. So I was like, oh, I'll pick it up, you know, cut some expenses and learn something. Yeah.
Starting point is 00:10:40 And that's when it was like, oh, I love this. So I've been a software engineer ever since that day. Very cool. Or that time. Yeah. One question, and this is a little bit, because I want to get to Pandio and I want to talk about MLOps. And I know that Costas has a lot of technical questions.
Starting point is 00:11:00 But one thing that's interesting, it struck me when you said, you know, you were spinning out big boxes, but I just love this story. I just, I hope for our audience, I know it is for Costas and me, it's just bringing back a lot of really good memories thinking about, you know, ICQ and the AIM usernames that were kind of like a bad tattoo that you regret, that you had to stick with. But, you know, you talked about spinning up big boxes because you wanted to analyze all this data with SQL because you talked about spinning up big boxes because you wanted to analyze all this data with SQL because you didn't really have this sort of out of the box like SaaS analytics providers. One thing that's really interesting is you kind of have these phases,
Starting point is 00:11:34 right? So there was sort of maybe more like bare metal analytics, which you were doing. Then you have this huge wave of SaaS analytics tools still around, right? But then you have a lot of companies actually coming back full circle to writing SQL on big data sets, you know, on the warehouse or other sort of different data stores. And then you kind of have this in between with tools like Looker, you know, where there's sort of, or leveraging Looker, like LookML or dbt, which sort of support the entire process, would just love your perspective on that. Because when you were talking about analyzing data, I thought, okay, this is in the 90s. But you hear people use the exact same language today, you know, decades later. Yeah, it is interesting, because it's almost cyclical in nature. It's like,
Starting point is 00:12:19 even when you look at the cloud, you know, it's kind of like a lot of people consider the cloud is almost a step back to mainframes, just, you know, a lot sexier and things like that. But yeah, I mean, for me, it has been interesting. And I'm the type to where like I hate reinventing the wheel, like with a passion. So I always go look for these tools and sort of in the early days, open source, you know, wasn't source wasn't really a thing. Open source was your buddy or hot scripts. And there's no licensing with that. You're just using somebody else's script.
Starting point is 00:12:53 It's not been vetted, et cetera, et cetera. But to me, the open source movement has just created all sorts of very interesting things. And we talked to a lot of companies today. Their entire offering, they may not sort of talk about that publicly, but it is open source. And it's always interesting to me to find out things that are open source, like Athena at AWS is built on Presto and things like that. So I think what's been fascinating is the open source kind of movement has allowed entrepreneurs like myself to create tools like this. And to me, I absolutely love that because then I can just use those to make my life easier versus having
Starting point is 00:13:32 to create it all to myself. It's like if we had to do that, our progress would be so much slower. And especially when you get into the sort of specialized stuff. So, you know, you guys involved with ETL and things like that, a lot of people assume that that stuff is easy. And then when you get into the actual data of like, you know, sifting through form fills from your website or something like that, or your lead gen or something, you start to realize how crummy the data can sort of be in almost any industry
Starting point is 00:14:01 and how difficult that is to sort of deal with. And then the sheer amounts of data. That's been what's been getting very interesting. As we talk to a lot of enterprises, they've got so much data. It's absolutely absurd. And they're only using a small fraction of it. And they realize kind of how ridiculous that is, but it's just so hard and they can't understand the cost of analyzing it or using all of it or
Starting point is 00:14:26 what's the ROI. So it's an interesting space to see all these tools sort of pop up that are slowly addressing all these problems. And then when you move into the machine learning, the thing that was always fascinating to me about machine learning is it's just like traditional software. Now, obviously the differences of there's some pretty hardcore math and matrices behind it and all that. But at the end of the day, operationally, it feels very similar. It's just more of everything, more CPU, more data, more storage, more memory, more pods and Kubernetes, et cetera, et cetera. So then your problems become more painful if you don't have the right infrastructure or et cetera, et cetera. So then your problems become more painful if you don't have the right infrastructure or et cetera, et cetera. So it's been interesting, but I've just been
Starting point is 00:15:12 so thankful for the open source movement. And I myself try to contribute back. We're contributors to Pulsar. We're about to contribute back to Trino and Presto. And then I've contributed to other things in the past as well. So it's really amazing. And I'm thankful for that movement has sort of blown up these days. Yeah, I think we are definitely building over the shoulders of giants when it comes to open source. You mentioned a couple of projects, Josh. Can you tell us a little bit more about the product,
Starting point is 00:15:44 the offering that Pandio has and how it relates with open source projects? Yeah, sure. So one of the things that was sort of very interesting to me is, so we have a managed service offering for Apache Pulsar. So Apache Pulsar is a sort of traditional distributed messaging system. It handles typical workloads of like streaming, pub, sub, and traditional job worker queues. And then it also has a very interesting component to it where you can actually host serverless functions inside of it. So you can do things like you have like an inbound topic. You can place a function on top of that topic. And then what it sort of spits out on the other side for the output topic runs through that lightweight compute thing. So you can do things like ETL and really anything you could imagine.
Starting point is 00:16:36 Routing is very popular as well. But when I came to Pulsar, because before Pandio, I was working in the insurance space. And so we were involved with a lot of the big providers of insurance names that everybody's kind of familiar with. And so we were delivering machine learning to them and building machine learning for them. So we did some very interesting projects. Like there's one company who has satellite imagery of the entire United States, and they wanted to measure the roofs of all homes in the United States. That was the premise of what they wanted us to achieve. And so that's a massive project, very interesting. And how you sort of solve that is a lot of fun to think through.
Starting point is 00:17:22 But at the end of the day, what was most interesting is data logistics became very painful for us in many of those cases. So that's just the literal movement of data. So that was hundreds of petabytes of data to sort of deal with to do that particular project. And we tried to use everything out there. So, you know, they were in the cloud. So we use the cloud providers services, then that tipped over, then we sort of shifted to some other things like Kafka, and found out like Kafka sort of doesn't handle that stuff particularly well. So in my sort of process of doing that, I started to see the value of a logistics piece of software. And my journey there, we actually built something custom based on Redis because I was very good at Redis personally, and our team had a strong sort of experience with Redis. So we were able to do something
Starting point is 00:18:18 with Redis that was very fascinating, but it was very niche. So with Pandio, I wanted to find something and I ended up exploring Pulsar. And the thing that's interesting about Pulsar is they solve, it almost feels like Pulsar are stateless. This makes scaling it a lot easier. So you can sort of actually properly scale those horizontally and you can scale the compute, which is effectively the broker independent of your storage. Storage is handled with Apache Bookkeeper and Pulsar. So you can scale those independently, scale up the storage or scale up the CPU or scale them together. That's very powerful. It's also built more for the container driven world. So it doesn't rely on low level sort of kernel related stuff to achieve speed or things like that. So it's more portable and more cloud native at the end of the day. And it sort of solves the topic limitation. So very interesting customer use case that we have is there's a large media company who
Starting point is 00:19:35 was hitting the sort of limits of Kafka with its number of topics. So based on the way it's architected, you can only have so many topics. And this sort of depends on how you set up your cluster. But typically, a few thousand, you're not going to go above that. They wanted to create one topic per one user in their system. So that was hundreds of thousands of topics. For Pulsar, this is pretty easy, again, because it was designed differently. So there's lots of things like that.
Starting point is 00:20:04 Some ANSI layer edge cases where Pulsar is just more interesting. Additionally, it supports all the messaging types. So from one SDK, you can do streaming, PubSub, or queuing. So that's also very fascinating. Although on the flip side, I've found a lot of developers that that's like a curve ball to them.
Starting point is 00:20:24 They're like, wait a second, like one software to do vastly different messaging patterns. So, but once they get sort of past that, you know, they sort of see the value and being able to just choose right inside a single SDK to do any of those messaging patterns. Yeah. I mean, so it, and because Pulsar was really, and it's, it's been around for a long time, but you know, community is, is a fraction of the size of Kafka. So us getting involved, we sort of became known for Pulsar because there isn't a lot of providers
Starting point is 00:20:58 for it. So, so we run some pretty large Pulsar installations, especially in the finance world, because another interesting difference with Kafka is Kafka is not really built for full durability. It's built more for speed, whereas Pulsar is sort of built for durability. So for example, Kafka by default, F-syncs to disk on a threshold basis. Pulsar does it every single message. So that's much more interesting to like the banking world, for example, because they want zero message loss
Starting point is 00:21:32 under any sort of circumstance. But yeah, in sort of doing Pulsar, we've now offered it as a managed service because it's gaining a lot of interest, both from people who have hit the limits of Kafka, which is typically almost always in the Fortune 1000 that are hitting those limitations, but also people who are setting out to develop new systems. Pulsar, because of its ability to scale higher, is a much better fit for machine learning. And that really is why we're involved with Pulsar at the end of the day, is our focus at Pandio is to help companies achieve really any form of AI or ML. It's quite shocking to something like 87% of executives want it, but only 15% have it. There's a lot of reasons for that. But yeah, I mean, so that's sort of what led us down the road of Pulsar in a nutshell.
Starting point is 00:22:27 That's super fascinating, Josh. I have quite a few questions around Pulsar and also how to use it today. But I'm very curious about the custom solution you talked about building on Redis and the limitations that you found in Kafka that made you go to use Redis. Can you share a little bit more information about that? Like what you managed to build on Redis? I mean, I love it as a technology because it's amazing the stuff that people have managed to build on top of Redis.
Starting point is 00:22:57 And it's always very interesting to hear about this. So it would be amazing if you could share a little bit more about it. Yeah, so I'm somewhat limited in some of the things I can talk about, but in general, so the way Redis basically acted as a coordinator for us. So what was very interesting to us about Redis is it was extremely portable. So we sort of treated it like it was meant to be not as a proper data store, more as a caching layer. But because of some of the embedded Lua and things like that, you can do you can add in some crazy powerful function like logic, at least around how keys are managed in Redis. So but it basically acted as a coordinator coordinator because when it comes to that particular issue
Starting point is 00:23:46 we had very small payloads so it was basically a lot of coordination happening so instead of passing anything to do with an image we created basically just a metadata payload so it was like imagine like it was just a reference to where it was in S3 as an example. Or if the image had to be split up and then there was four pieces of the image or 10 pieces of the image, and those need to be coordinated in a way. Because what we had basically is it felt like a mesh network of machine learning. So a lot of people sort of call that today like federated learning.
Starting point is 00:24:27 So we built Redis basically as a way to coordinate a lot of federated learning. So that can be specifically around like rural and metro areas. You would have a model at the end of the day that was specific for like Chicago or San Francisco or Los Angeles. So we use Redis really to sort of coordinate that federated learning and keep track
Starting point is 00:24:52 of what had been done. And it worked very well because you could just take the quote unquote database that Redis created, which is just a single file at the end of the day and move that around to sort of restore where you were. So that helped us in scenarios where we were attempting to do some learning and we wanted to halt it. So maybe we processed like 12% of images and then we wanted to a week later, start back up at that sort of spot. So it gave us the durability to be able to do that easy because Redis is dead simple, easy to install, easy to make portable by moving keys around that you had created on one VM
Starting point is 00:25:35 and you want to put it on another VM now. So it just was, at the end of the day, the easiest way to sort of do some of that coordination. And the way we structured it was what, what ends up being a topic in Kafka was basically just the namespace inside of Redis. And so, you know, we could sort of pre-calculate how many of those we needed. So maybe that was 10,000, for example. And then we knew the payload size of what was being coordinated because it was just you know absolute past s3 objects and then we could calculate the
Starting point is 00:26:14 memory that would sort of be needed to do that and then we sharded it ourselves can't really go into too much detail how we did that but it it's basically the same way database is shard, based on keys and things like that. But yeah, so it wasn't really too advanced at the end of the day. Again, it just was coordination. But we needed to use Redis because it was just blazingly fast, and we needed a lot of them. So thousands of those individual Redis instances.
Starting point is 00:26:46 That's super interesting. While you were talking and mentioned like coordination, I started thinking if this is like kind of problem that you could solve with something like Zookeeper or XD because they are used a lot for service coordination and stuff like that. But then you mentioned about like the thousands of instances. So I'm not so sure like if something like this could be used but yeah is this like you yeah you definitely could have done zookeeper i mean you know i suppose this is the most this is the case with most developers it's kind of like if if you if you plan or architect or something or you sort of understand the requirements and then you want
Starting point is 00:27:23 to sort of fit in the things you know to it you know what i mean me being mainly web-based programming languages so like php ruby and things like that and a little bit of python zookeepers like that that jump into java that none of us were really ready for you know but but yeah absolutely no i mean it's at the end finding the right solution is not always... How to say that? It's not like solving an equation, right? There's no one solution. I mean, it has to do with the team.
Starting point is 00:27:53 It has to do with the circumstances when you're doing. And at the end, that's also what is fun with technology. I mean, there are so many different tools that can be used to solve the same problem out there. And yeah, Redis is one of them. That's why, as I said, I'm always fascinated to hear what people manage to do with Redis. It's amazing.
Starting point is 00:28:10 So is this something, like this kind of problem that you talked about solving with Redis, is this something that you could do today with Pulsar? Yeah, Pulsar would have been a lot easier, mainly because it sort of handles the distributed nature to it. I mean, Redis today, I haven't used it too much recently. This was kind of a while ago that this was built, but we sort of had to make it distributed. We didn't really need the atomic nature of Redis, but Pulsar sort of handles that for you. And the sort of nature of things need to be backed up or moved around is sort of removed or handled by
Starting point is 00:28:47 Pulsar itself. You know, so for example, if you wanted to process things again, you can easily replay messages in Pulsar. So they, you know, it's got like a reader interface. So if you've got thousand messages on the topic that you've already processed. You can just create a new subscriber or use a reader interface to go back to like offset zero. So some of those things are just handled for you and the publishers, the producers and consumers, if you wanted to sort of scale one up huge or the other up huge or both up huge, it's just easier to sort of do that. You don't have to build a lot of that yourself with pulsar so i would have loved to build that solution with pulsar it's to the point now where there's actually there's a fair number of companies who are if you remember the traditional concept of like a
Starting point is 00:29:38 enterprise service bus a lot of companies are moving towards using something like Pulsar to be the fabric. There's this like term called data fabric where, you know, something like you have an ESB that sort of touches everything. So it's both messaging patterns, it's access to the warehouse, the data lake, the data marts, you know, et cetera, et cetera, et cetera. And then it gives you some pretty interesting controls having that middleware these are none of these are new concepts you know middleware has been around forever esb has been around for forever but because something like pulsar has so much more additional
Starting point is 00:30:15 capabilities the serverless uh function type stuff does some interesting stuff then you can put business logic in the middle of things and And then just traditional, you know, Pulsar also can sort of store things indefinitely. So, you know, with Kafka, they have like an offloading function that they just came out with relatively recently, but it's not seamless. With Pulsar, like you can offload to HDFS or S3 or any blob storage, and then you can read back out seamlessly from the SDK. You don't have to put it back into Pulsar. So things like that are just very interesting and make it interesting to use as a data lake.
Starting point is 00:30:56 You know, some people that are doing that. Just a lot of very interesting use cases. Yeah, it's super interesting. And you mentioned that many of the use cases that you're dealing with at Pandio right now is around ML orchestration. Can you tell us a little bit more about this? How something like Pulsar helps with ML orchestration and what is involved there? And I think also, it's not just Pulsar, right? It's like Pulsar and together with Presto. Is this correct? Yeah, so Pandio really has three pillars at the end of the day. So accessibility is the first one.
Starting point is 00:31:30 So we just use like an open source data virtualization technology, which is Trino. These are all kind of optional in your journey to AI, but these are the things that most people need. So Trino is interesting because it can sort of connect to almost any data source, even flat file systems. And it lets you connect to maybe 5, 10 data sources, 15, 20, it doesn't really matter. And then execute a single SQL statement against it so you can join like data and S3 flat files
Starting point is 00:32:02 with data in Snowflake and things like that. So that's very interesting. So that helps solve the data accessibility issue where they've got data in some place and they just need to get to it. And then Pulsar acts as sort of the foundational component just because the movement of data becomes very difficult. And this is why Pulsar is very interesting. So I mentioned earlier that machine learning is a lot like traditional software, just more of everything. And so we focus typically on the heavy data use cases. So that might be a billion dollar media company is generating click data and impression data. And so what they want to do is they want to detect fraud and click data. So they've got just an enormous amount of data coming in. So click stream of data, impression stream of data. And they want to, one, just be able to handle that data. So Pulsar is great at that just for ingesting data.
Starting point is 00:33:00 It can scale out massively and sort of handle a lot of data with few resources by comparison, especially to Kafka. Kafka is kind of the number one competitor when it comes to that. And then we sort of built out our machine learning framework to do it in stream. So we definitely focus on real time or near real time, but it doesn't have to be. That just happens to be a space where there's not a lot of tools out there to help people do things like that. So a use case might be a media company has a stream of clicks coming in and they want to segment them as fraud or not fraud. And so they can sort of use the Pandio service to ingest that and then apply a machine learning model against that live stream of clicks. So in real time, it can route a click to fraud or not fraud. And that helps them do various different things. Cybersecurity is another big one. syslogs of access patterns, both logging into systems, traditionally, like through, you know, an employee logging in or, you know, some third party logging in, or somebody accessing a file
Starting point is 00:34:14 on a file system. So all these things are getting streamed into some central system. Doesn't actually have to be central. That's another interesting thing about distributed Pulsar. You can do this at the edge. It doesn't have to be centralized. But that's a whole other topic. But that's just a very interesting use case where you may want to flag traditional clustering of your data. You just may want to flag anomalies. That's all you're after.
Starting point is 00:34:39 You just want to know, is something weird happening? Is somebody accessing a file that they haven't accessed in two years and they're accessing a lot of them you know an interesting use case for one of our customers was they used this to find an employee who was downloading everything off of the company servers so they were clearly doing a data dump and were likely going to leave the company. They legitimately with their user account were downloading every single file. And this was with a medical company. So very sensitive to somebody doing something like that. And they caught them and fired them that day.
Starting point is 00:35:19 So there's lots of use cases like that. Again, it is weighted towards real-time or near real-time, but it works traditionally as well for things that are less important. Maybe you need to run something once a day or once a month, but we certainly excel in huge amounts of data, like Disney Plus streaming amounts of data, as well as things that are real-time and need to make actions quickly. The faster you act, the more money you save.
Starting point is 00:35:47 Makes sense. And what about Trino? You mentioned Trino as a data virtualization solution. How is this used in Pandio right now? So that was mainly a function of, you know, to really provide value with the second two pieces of Pandio, which is the middle piece, which is like logistics. And then the third piece, which is the actual machine learning. So actually building models,
Starting point is 00:36:11 training, and doing the inference. A lot of people are, especially big enterprises, it takes them a while to sort of make a decision and move data. So for example, they might have accepted or decided to use Snowflake as a data warehouse or something like that. Like they've chosen that as their future. And so now you have to wait for sprint by sprint things to get in there to actually move data into Snowflake. So we found there's just an opportunity where before someone made that decision or when they were on their road to sort of implementing that decision, having something like Trino is very interesting because it can just virtualize that as a sort of stopgap. And we found too, even when you have like a very forward thinking enterprise or company in general, they usually never move all of their data into a warehouse or a data lake it's like the 80 20 rule feels like it pops up everywhere you know so there's always like the 20 that they want to get access to but can't really so i like various
Starting point is 00:37:19 different reasons you know so trino again a lot of companies just like to use that because it and made getting data into their pipelines a lot easier and trino's dead simple i mean it's easy to pitch you know it's like hey do you want to run sql against all of your data yeah i'd love that you know it's easy to demonstrate you know so and it's it's i mean it's not easy to run but it's not hard to run. So, you know, we just offer that as a managed service because it fits really well into kind of AI orchestration. That's super interesting. There is a lot of noise lately around feature stores. I don't know if you have heard about them, like products like Tecton or open source solutions like Feast.
Starting point is 00:38:06 What's the relationship with the feature store compared to what you are talking about doing in Pandio? Do you see these things working together? Do you think there's an overlap there? How is this landscape around them starts to form? Yeah. So for us, you know, we're heavily sort of focused on the actual training and deployment of models. So a lot of those relationships and even like the data catalog people or even existing sort of MLOps platforms, there's a lot of synergies there. You know, from us, we look at those as things we plug into to make things easier. So for example, like data catalogs and that can actually feed into something like Trino. If
Starting point is 00:38:52 you're using a Hive catalog to the sort of index blob storage data. When it comes to future stores and model versioning. So those are very powerful things, but, but there's sort of like, I consider those things like cutting edge. It's kind of like, I wish more people knew about them, you know? And it's almost like, like I talked to some advanced enterprises, some of the biggest companies in the world, and I'm shocked that they can count the number of models they have in production on their hands, you know? So tools like that sort of make it a lot easier. But yeah, so for me, like we consider those things as like things we would plug into.
Starting point is 00:39:32 It's very much about the Python library we built, you know? So plugging into things like that, again, it comes back to not having to reinvent the wheel. Like we're, you know, dead focused on something very specific. And then these things that can make the road easier or allow these things to be democratized easier or the operational component of it easier. We love to partner with those types of things.
Starting point is 00:39:58 We don't have any sort of plans to build some of that stuff out. That's super interesting. So Josh, one last question from me because I completely monopolized the conversation today and then I'll give the stage to Eric. I'm very curious about something and I have this in my mind like from the beginning of our conversation, also for personal reasons. You mentioned that one of the limitations that Kafka has is about the number of topics. And you mentioned that there are companies, especially in the Fortune 1000 group of companies, that they reach this limitation.
Starting point is 00:40:40 Can you share with us some use cases that are causing this kind of limitations to be triggered? Because obviously, when Kafka was designed, they had in their mind that Kafka is going to be used in a way that nobody will need like thousands of topics right and by the way the reason that i'm interested in this is because in my previous company blender we're using kafka and we had to figure out a way to deal with these limitations so i'd love to hear from you about this so i'll get a little pie in the sky on you guys here, but it's sort of fun to sort of think about. So not a ton of companies, but some companies are sort of understanding that you can use things like Kafka and Pulsar to segment your data in very powerful ways. So in the same way, like an index in the database would allow you to sort of
Starting point is 00:41:25 segment data. So that would mean that you would have an interesting use case for creating a lot of topics. So one might be, you know, a lot of companies sort of create these segments. So let's take media, for example. So they've got like segments for their customers. So they might have, you know, living in major metro areas, or they might have high income earners. So they've got like segments for their customers. So they might have, you know, living in major metro areas or they might have high income earners. So they have these like segments of their customers, but they're limited on how they can do those segments. You know, it has to be sort of categorical or high level. I like to use the Facebook news feed as like an example of why this is sort of important. And I'll tie it to some specific use cases. So what's interesting is like your feed on Facebook is very much tailored to you
Starting point is 00:42:14 as an individual. So you can imagine how would you sort of create a machine learning model that is tailored to an individual. So that's like the holy grail. So instead of doing categoricals, like if you earn between, you know, 50 grand and 75 grand, you're in this segment. Imagine if I could create one specifically for you, you know? So something like this would involve, I now need to segment your data exclusively. So the things you like on the internet, things you look at with your consent, imagine if that happened on a platform like Facebook. So there's a major media company that we help out doing this right now. So they're involved with shopping. So if you could create a topic that was specific to an individual user, so now I can do very
Starting point is 00:43:07 interesting things. So I was born and raised in Michigan, so I'm a big Detroit Lions fan. So it's pretty easy to sort of loosely understand that I might like Detroit sports. So that'd be more like a categorical model, but it becomes very hard to track that I like an individual player. So Matthew Stafford is a quarterback for Lions. He was just traded as a whole big thing. You know what I mean? So for that shopping network to sort of track the, my preference of an individual person, that's where they start to lose the minutia of things. And for shopping, that can
Starting point is 00:43:43 be very important because while Stafford is no longer a Detroit Lion, I'm still a fan of him. So I might still want to buy his jersey or something on the new team that he's on. So that's like a capability today that someone is trying to achieve that they couldn't. I mean, so what they do is they create a topic that is specific to that user. And then they train a model specifically using that user's data. And so it ends up looking like a federated sort of learning way where they've got their master model that has all the categorical stuff. And then they've got the federated model that's specific to that user. It ends up feeling very much like we all have cell phones. So we all have acronyms or names we call our spouses or pets and things like that. And the
Starting point is 00:44:33 keyboard, as you type messages, it starts to learn what you're doing individually or the things you do yourself. It's very much like that, but for like everything so to be able to do things like that it's easy on a well i shouldn't say easy it's an amazing accomplishment on a phone but segregating it is easy because it's just on your phone you know you're already sandboxed in that way but when you're a major shopping network that's not so easy you need to create that segmentation so this ties into like to me i envisioned a future where companies would have tens of thousands of models minimum now they've got like hundreds if you're lucky like it's rare that i find a company that's got hundreds of models in production it's typically like 10 or 20 you know so I imagine a future where you want to have tens of thousands of models as an individual company.
Starting point is 00:45:29 What would that look like? It's going to be federated. It's going to be distributed. What does that look like? And so that's where I saw Pulsar as the future is because you can do tens of millions of topics, and that can be the baseline of the stream of data for each model now hosting you know 10 million models is its own difficult thing but we've got some fascinating technology to actually do that so that is a focus that we do too and i was trying to i was like working backwards from the terminator examples like if skynet were to happen, what would it actually look like from an infrastructure stack standpoint or data sharing?
Starting point is 00:46:08 What would that look like? You know, you would need, to me, millions of niche models. The ability to sort of, from a mesh network standpoint, share the outputs of one model as the input of another model in some huge mesh network. And so that's why, to me, something like Pulsar in the Python library I built is kind of like moving in that direction because I thought that's the next step. It's going to be someone needs to create thousands of models, not 10. What do they need to do that? So that's kind of what you'll find
Starting point is 00:46:46 at Pandio. But again, I don't want to say if Skynet happens, blame Pandio, but that's kind of like from a technical perspective, that's where my thought process kicked off years ago. This is great. Eric, it's all yours. Well, thank you. I have just been so fascinated by this conversation. And really, I think I'm in the best way possible. There are so many more questions, I think, that I haven't probably cost us as well. But we are at time, and I want to be respectful of your time. Josh, this has been really great. Loved your story. Loved hearing about Pandio and all the interesting things you're doing. So why don't we have you back on the show?
Starting point is 00:47:25 I'd love to dig into a couple of the specific things you talked about as far as use cases, et cetera. So we'll have you back on the show again, and we'll continue the conversation. Awesome. Well, thanks, guys. I really appreciate it. It's lots of fun from my perspective. I really enjoy doing things like this.
Starting point is 00:47:40 So thank you again. Well, Redis is a really fascinating tool and we've seen lots of companies do really interesting things with it. That might be the most interesting Redis use case that I've heard about, but that's actually not my biggest takeaway. I think my biggest takeaway was taking a stroll down memory lane and just reminiscing a little bit about IRC and AIM and spinning up servers to run SQL. I mean, that's great. I just really enjoyed that.
Starting point is 00:48:14 So that's my big takeaway. I hope we can do more of that on future episodes. Yeah. Although the bitter side of this is that it reminds me how old I am. But yeah, it was great to hear how things were done back in the 90s, to be honest. So this was great. That was an amazing part of our conversation. It was a great conversation in general, to be honest. I mean, Joe has a lot of experience with many different things that have to do with data
Starting point is 00:48:42 and building actual products on top of data. So it was amazing to hear about all these use cases and the products that he has built with many different things that have to do with data and building actual products on top of data. So it was amazing to hear about all these use cases and the products that he has built even before all the latest hype over building products over data. And of course, it was amazing to hear how they used Redis on that. What I'll keep from the conversation that we had
Starting point is 00:49:02 is about ML and how machine learning is actually deployed right now and how early we are in the commercialization, let's say, of machine learning. There are amazing things that are happening and a lot of work that still has to be done. And that, of course, means that there's a lot of opportunity out there, both for building amazing technologies, but also building amazing companies. So let's see what happens in the next couple of years. I think it's going to be fascinating. It absolutely will. Well, thank you again for joining us on The Data Stack Show. More exciting episodes like this one coming up.
Starting point is 00:49:38 So until then, we'll catch you later. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.