The Infra Pod - From Spark to Eventual: Reinventing Data for the AI Era (Chat with Sammy from Eventual)

Starting point is 00:00:00 Welcome to the InfraPod. This is Tim from Essence, and Ian, let's go. This is Ian Livingston, co-founder of KeyCard, working to make your agent secure and safe for everybody. I couldn't be more excited to be joined today by Sammy Sadoo. CEO Eventual, the multimodal data processing platform. Sammy, what in the world got you to start eventual? What was the insight?

Starting point is 00:00:25 It made you say, you know what? I'm going to go build a company because that's a bit of a crazy thing to do. It is. And I think I learned that along the way. You know, to be honest, I think given the problems I was facing in my previous life, I feel like it had to be done. So I'm kind of like, I went kind of like an interesting path where I became a, I was an AI researcher and computer vision researcher working on self-driving cars turned data guy where I feel like normally is the other direction around. And yeah, like to build self-driving cars, we needed to process lots of data, things that come off. cameras, LiDAR, all that kind of stuff. And finding the right data and getting the right features out of it and processing it to get in the right format was really painful. And along the way, you know, I had developed this love, hate relationship, but, you know, the tool at the time, which was Apache Spark. And there was a moment where it was like 2 o'clock in the morning and I was like going through a bunch of like JVM logs. And I'm just like, okay,

Starting point is 00:01:26 there has to be a better way. And I think in that moment I was like, okay, I think, I think my future was set on, like, building something that was going to displace it. And what was it about that moment? Like, what was the actual pain you were feeling? They were like, oh, God, what am I doing with my life? This is terrible. This has to get fixed. Was there, like, in a visceral reaction or was it some built-up frustration?

Starting point is 00:01:45 It was a lot of questions, to be honest, right? Like, the thing I was trying to do is, like, I had a bunch of images at Sadness 3. Right? There were just, you know, images from the car. And I wanted to run a model over this data. Pretty much just grab an image, process it, write it out somewhere. And along the way, I had to, like, learn about, you know, the JVM, what kind of jar I have to put in there. You know, I got all of these, like, random ooms.

Starting point is 00:02:08 And I'm just like, hey, man, I'm just trying to run a model with a fucking image, right? Like, why am I going through all of this? And I think, you know, what ended up happening is, like, I ended up building a system that didn't have to use Spark and just essentially became like a glorified batch queue. And what I learned is that for every single, like, use case we would have to do, we would have to change, like, things around it. And so just has like, you know, Spark and these other data processing engines kind of like unified, like these, you know, kind of became a general platform to solve multiple use cases. I can add the idea of like, hey, what if we can do this for all this unstructured and multimodered and multi-motor data? You know, these are problems that we were facing four years ago and we're still seeing them today. I mean, if not growing, right?

Starting point is 00:02:47 Like, there's so much more the rise of large language models and all the diffusion models, those are fun things. You know, proven out the value of deep neural nets and how more data equals more is more. way better op-com, especially to scale of compute and training. I'm curious, sort of like, you know, you were working on self-driving cars and images and had this general challenge. Is there, like, a fundamental architectural change that has to happen to enable the multimodal data processing? Like, what's the thing about the existing data processing workloads as an example?

Starting point is 00:03:17 They're like, this is just, like, the wrong architecture? Like, it's not designed for building and dealing with this type of data that's, you know, sequenced in this way or whatever. Yeah, help us all understand what. the delta is there that makes them basically impractical for these use cases. Yeah, of course. I think the way I think about it is kind of like the shape of like where the data starts to where it ends.

Starting point is 00:03:40 So if you look at like a lot of use cases for like, let's say, you know, with analytics today, like imagine you have like a clickstream data or like orders data. And this is like, you know, you have orders coming in from all over the world getting stored on a table. And this is a lot of rows of data. But the shape of the query is often, okay, for. the last 30 days, I want to get these orders, and then I want to aggregate, break it down by like state or zip code or city. And it ends up happening is I start with a lot of data.

Starting point is 00:04:08 I filter a lot of it out, and I then group it together and aggregate it. And almost along every step of the way, it gets smaller and smaller. So at the very end, it's like a small enough table that you can just power like a web UI with or like a, you know, a dashboard. But if you look at multimodal and structure data, it's actually like almost the exact opposite, right? Oftentimes times you start with like a bunch of, you know, URLs, let's say a bunch of images of user uploaded. Then I'm going to go from these URLs to downloads and bytes. Then I'm going to get these bytes of images and then actually try to like make them to real images. And each step of the way, you actually increase data volume by like 20x. And once it's up happening

Starting point is 00:04:44 at the end of the day is that these systems that are largely designed for like being really fast for analytics and these like columnar operations actually kind of fall apart at these like very high inflation row operations. And that's kind of like the problem that we're addressing. care. And so the most common, you know, symptom you see of this is your machine just ooming all the time, like just going out of memory. Because if I have like, let's say a single metadata, you know, file of all my assets, that's often like in Spark world that's in one partition. Then as soon as I try to like download those URLs and process it, you know, you have to then kind start tuning. Like, okay, you know, this even though I notice is like one megabyte of metadata,

Starting point is 00:05:20 I want to expand as to 200 partitions so I can actually get parallelism and make sure I don't move. So that's kind of like the rough reason why. a lot of the existing infrastructure is not really well suited for what we're doing today. And what type of use cases, like image data is one. I'm sure volumetric, like 3D video, 3D images is like another. Like what type of data types are not suited to your traditional OLAP, you know, Spark-based workloads for data processing? Like, is it audio, video images?

Starting point is 00:05:50 Is there other types of telemetry? Help us understand so the veracity of problems that can be solved when you think of this in the context, like a multimilable data versus just sort of linear OLAP rows and columns. Yeah. I think the big one we see today, the one that I think has been most of my time thinking about now is documents, actually. And documents are very interesting. So it was video. And the reason why it's so interesting is that these are types of data that require a very diverse set of tasks that you want around them.

Starting point is 00:06:22 You have to run models on them. You have to run vision models on them. you have to run like a lot of text operations. But the thing that makes them really challenging is that they're extremely nested. And a lot of the systems today kind of just try to like almost like explode them and then try to like make it something

Starting point is 00:06:36 where like you lose that like lineage from like being a full document to just being like a bunch of like sentences of data basically. And so what we try to do is try to like keep that all together and keep the data in like the form that it should be in but still giving you the primitives to actually like express to the, you know, compute with the nested nature of it. So it sounds like, yeah, go ahead. No, I was going to say, it sounds like not only do you solve, like, hey, we're going to just make it possible here on this.

Starting point is 00:07:01 Like, you to pick up and start running a thing and not have it break and just spend your entire time trying to tune this and fuzz it around. It's VM sizes and also the craziness. But you're also like, just making it easier for people to pick up and, like, actually tackle these problems. It's not just like the pure like systems level optimization. It is actually the, we can start solving the, like, novel document use cases in ways that you would have to be. go layers and layers of stuff on top of like your traditional spark or whatever to actually do the same thing. Because it's just not designed for these types of use cases. Is that it?

Starting point is 00:07:31 No, it's very true. And that's exactly it. And like, I guess a, you know, a thing we see today is like if I have one 5,000 page document versus 5,000 one page documents, you have to approach it very differently. And the variability and length, right, would play a lot of, like, especially in a traditional like OLAP, pipe, spark or whatever workflow. Like the variability in length, like, that's just all complexity. You have to do a different ways. I'm very curious to understand. Can you give us a use case, like an example use case, of a thing that you enable with these documents.

Starting point is 00:07:59 It was like pretty common in, say, you know, an enterprise context, that would be very difficult for like a company that traditionally do with like a Pye Spark or, you know, based on Spark or some of these other more traditional data processing frameworks? Yeah. So the two big classes of workloads we support today is one is the N artifact is a data set. And that's one kind of like, you know, avenue we support. And the other one is essentially when you're an artifact. is supporting an application. And so one big use case we support today is if you're a business that has a constant stream of documents coming in or of audio coming in or images coming in, one of the things that

Starting point is 00:08:35 we help do is be able to read that efficiently, run models over them to things like summarize, caption, compute embeddings, and then actually keep it in some kind of like vector store. And so this idea of just like, you know, being able to handle the different like variability of load and diversity of documents, DAFs like actually really well. suited for this. So this is kind of like where you have almost a continuous nature of like just things dropping an S3 and he has to show up there with a very tight SLA. Now, the other use case is actually really interesting where if you're like an AI lab, you need to train on a lot of data, both

Starting point is 00:09:08 for pre-training and post-training. And one of the things that we do is that help go from all the croft of the internet to like the most valuable data for your model to train on. And both of these have these like very similar extraction phases inside the query. And I think this is interesting. Because when you mentioned Spark to me, because I used to work on Spark a lot, right? Spark has always been more like your data sets has structure. You know, you can able to extract stuff pretty much with all this sort of like, you know, operators. And it's meant for scale, right? Like it's the Hadoop, better Hadoop, really.

Starting point is 00:09:40 Like you run 100 nose, now you can run 50 nose, right? Or 30 nose. And everybody's just kind of keep ballooning. And I think you're talking about similarly like AI labs needs scale, but the data is so much more complicated, right? there's like huge amount of like nested and like things to scan like vision is still involves often when it comes to like parsing PDFs and so how does daft kind of work because I think I hear you talking about like the performance side talking about the scalability side but do you see people that are trying to even extract data from these documents is there like

Starting point is 00:10:13 a different way to even like interactively documents that people would not able to use with spark do you have like some sort of like really simpler way to programmatically try to get data out of these documents because these PDFs are no longer structured data anymore, right? It's not I have a hive metastore of tape and columns. You know, it's almost like on page 25, I want this table or like maybe get all the spreadsheets or something, right? It's a very, very different way of programming. So I think beyond just the scalability, maybe talk about like how you think people can interact with these data. That's a really good question.

Starting point is 00:10:50 I think you highlight something really important here, which is again, and the big data, era, the thing that was big was essentially the cardinality of rows, like the number of rows, and your operations were essentially expressing compute across these rows. But in the multimodal era, it's almost like the big data part is what's inside the row, right? And the complicated bits are inside the row, and that requires coordination. And so some of the things that are very different now with, let's say, the Spark era, is that the operations you want to run are pretty much in the engine themselves. You want to, like, lower your string. You want to find some regular expression match. Spark has all those built-ins. But now when you actually want

Starting point is 00:11:28 to process to say a document or a video, you're running these very complex functions that, for example, call LLMs, call a visual model, maybe call an API service. And there are so many things that can fail. And so oftentimes what we see is someone starts out with an open source project. There's actually a lot of really good AI tools out there today to just get started on your laptop. I think one of my favorites is one called DocLink, where it's like, a Python library, you can actually point it to a PDF and it actually runs like, you know, LLMs in like agentic ways where like it can look at your document, run OCR models and then be like, okay, I'm going to use like an agent to figure out where are the semantic boundaries I should actually like,

Starting point is 00:12:07 you know, cut this up in. So that works great for running on like one or 10 documents on a laptop. But if I now am running like this over thousands of documents in the cloud, you know, if you run them as UDS, you know how a thousand UDS trying to spin up their own models, trying to spin up their own PyTorge VLR model, and it just kind of explodes. And so lots of things that DAF do here is we actually don't just treat things as a UDF. We actually treat these as almost like resources, where you say, okay, I need to use an LLM. I need to use a VLM, and we can actually like enable your scalar functions to actually like use these almost like tools. And then the engine kind of handles the important bits of actually being able to like scale that

Starting point is 00:12:47 and let the function do the things that it needs to do and let the models do what it needs to do. And the part that we're working on in our product is actually super interesting is that we actually enabled this to be fault-tolerant, not at the infrastructure level, but at the data level. And so what that means is that, let's say you run into a PDF that's corrupted.

Starting point is 00:13:07 That shouldn't bring your whole pipeline down. What we can actually do is actually trace, like, that PDF through the whole pipeline, see what causes to fail, and then almost like quarantine it. So you can actually address it. later. The important bit here is that, you know, GPUs are really expensive to run. And so the thing is we can actually keep, make sure that those

Starting point is 00:13:23 are well utilized. And the bits that are running on the business logic can also run very efficiently. Yeah, I remember working on Spark. Like, all the big data, even Spark papers, we're all talking about like stragglers in your cloud. Like, I run thousand VNs. Ten of them

Starting point is 00:13:39 will be bad, right? You know, Amazon. Like, it's YOLO. Some of them will just, like, not process at all. So you have to like kill the stragglers. You know, some of these are just lagging down all the whole cluster compute. I think that kind of has not really been anyone's attention span anymore. But you're talking about, like, yeah, data can bring you down. And I think in a traditional data world, it has happened too, right?

Starting point is 00:14:00 You know, you have things that are stored in parcates or maybe Avros. They have to be able to understand the data. But oftentimes, like, JSON could be anything, right? So, like, it can be kind of YOLO. And so is there any unique challenges? I guess to address sort of like this, like, data lineage. and sort of like almost you're decoupling the model and the code. Because like I said, you don't want to like a thousand.

Starting point is 00:14:24 If I have a single node running 50 dock liens, they independently have to reload the same resources and model and overall. It's very huge in efficiency. What is something so unique challenges that you have to spend a lot of time on? Is there a good example of like, hey, this is a pretty gnarly problem I didn't realize that takes so much attention on? We'll be something like that. I think one that keeps me up is inflation.

Starting point is 00:14:49 And this is quite interesting where it's like, if you think about like, you know, bad data in like the parquet world, it's like if you see like a 30 gigabyte parquet file, you're like, they probably didn't like produce it correctly, right? Like you shouldn't have a 30 gigabyte parquet file. But the thing with PDFs is that the thing that kills your system isn't like a, you know, a hundred megabyte PDF because, you know, if you see 100 megabyte PDF, it's probably because they store a bunch of images in there. And when you inflate it, it's not actually like that bad. But sometimes you have like this 100 kilobyte PDF that becomes like almost like a zip bomb for your system, right? These things that have like these very high inflation where they look really small on disk, they look they're like, you know, going to be fine. And then when you actually try to process them, they actually just kill your system. Why is that?

Starting point is 00:15:32 I actually don't understand the problem. Like 100K PDF, why would it kill your system? Is it because like the structuredness inside? There's like a million nested tables or whatever? Or what happens? Yeah, it's when there's a lot of redundant data. That's when that kind of happens. So when you actually work with, like, compression and all that, that's where it gets you.

Starting point is 00:15:48 So the most common, the example I saw this in the parquet world is imagine, like, you have a, like a file where you have a column that represents, like, I don't know, is this person, like, I don't know, go to the grocery store. And the values are probably yet, like a Y or an N, like a yes or or no. And once it's happening, it's like that gets like dictionary encoded. So yeah, she only stores like pretty much like two, one bit. And then you actually run compression over it. And so your, you know, massive data set becomes, like, a really small parquet file. And then when you actually, like, decode it, you undictionary encode it. You decompress it and ends up becoming, like, a billion rows.

Starting point is 00:16:23 And when you store it in the in-memory representation, it completely just kills you. And so, like, this is the case for, like, multimodata as well. We're, like, you know, if you have a JPEG, JPEGs are actually amazing, like, codecs. Like, it's a great codec. But when you actually try to, like, store it as, like, pixels, you can inflate it 20, 30 times. With video, it's even worse. And so with PDFs, images, and video, this is kind of the thing that kind of kills your engine.

Starting point is 00:16:47 It's interesting. One of the things I was thinking about is you talk, like, I'm kind of curious. Like, a lot of, like, traditional data processing pipelines are to drive business analytics, right? Or to drive sort of like long-running, generally asynchronous tasks where the freshness was 24 hours is a requirement or even a week.

Starting point is 00:17:07 I'm curious, like, as you think about these use cases, is there also any shift? in terms of, like, the time to, like, time to insight or the time to run? And, like, are there still, like, broadly, like, a batch workflow? Or do you think that is this more of a streaming thing? Like, how does, how do you think, like, eventual, the work that you're doing on multimodal, and then the broad industry change from, hey, we're not just working on OLAP, you know, on column, row table information systems anymore?

Starting point is 00:17:32 We're now dealing with, like, you know, high-dimensional data, documents, video, images, audio, whatever. Yeah. I think there's been three really big shifts that kind of like change things quite a bit. I think the first one, you know, comparing like when I worked on self-driving today, is that before if you wanted to work with like images

Starting point is 00:17:53 or like documents, you kind of had to like have some expertise with computer vision and learn how to use like PyTorch or like, you know, OpenCV and all lots of stuff. And so the barrier to like work with these modalities is actually quite high. But nowadays with like these types of, types of, like, you know, language models and visual language models, the barrier is kind of, like, gone. Like, one of our users used to, like, train CV models to detect if, like, a camera on, like, a dash cam was facing forward on a road or back on a road. And they would have run a classifier that would, like, predict a one or zero.

Starting point is 00:18:23 But now they just put it into a visual language model and say, hey, they literally write the text, like, the prompt, is this facing forward or backward? And so now, like, the barrier to actually get, like, distance value from these types of modalities have gone down, like, like, it's just so much. easier to do. So what that means is like a lot of your data that was previously like, you know, almost like useless to your business is now very valuable because you can actually like extract value out of it without putting a lot of investment into it. I think the second thing that is very different now is that GPUs have gotten a lot more efficient. You know, running batch inference over a million, you know, items before. It was like, oh, that's actually like, I have to think about that cost. But now it's like, do it. Like, you know, there are a thousand X more

Starting point is 00:19:05 efficient than they were like five years ago. Let's just go for it. And third is I think the thing that we're seeing the most of today, which is it's no longer humans that are the users of this data. You now have these like intermediate systems like agents actually like leveraging these data and kind of like refining it before the user even sees it. In terms of like what that means for your data pipelines is that things go from like an SLA or a day now to have to be a lot quicker Because the more you can reduce the SLA from like 24 hours to, let's say, you know, one hour to one minute means that these like systems can actually take action sooner and get you insights or value sooner as well. Interesting. What's the intersection between sort of the work that you're doing with eventual and the work and like agents? Are you like, is part of the thing that you're optimizing for is actually integrating agents into the pipeline or as a feedback loop?

Starting point is 00:19:58 And what does that look like broadly from the way people are building pipelines today? I think we're still kind of early in that evolution. But I'm very curious to understand what you're hearing and seeing. Yeah. So I think of agents as today I see MUs in two places. And I actually have a third take as well. But the two today are one is traditional retrieval. We need to process all the data, populate these things like vector DVs or indecis.

Starting point is 00:20:28 so that an agent can actually retrieve from it and use it for asking questions and whatever like that. So like the traditional rag use cases that we see here. But another big one that I see is like refined search. And so one of the use cases that we power today is like a document company. And you know, as users make actions like and changes, we need to be able to like get that data and be able to like re-index that really quickly. And then write it back out to like whatever retrieval index that they use. So then that when they do search or AI recommendations or AI features, it's fresh. And so that makes the user very productive if you can do that.

Starting point is 00:21:06 And so in that case, you have users and agents both actually reading from these indices. And the second case is actually using agents within each row, essentially. And this is kind of like the example I was talking about with the document where like I think before you would have an engineer kind of like optimized for the structure of a document they might know. So, like, let's say it's like, oh, I'm processing 10Ks, like, financial 10Ks. Therefore, I'm going to, like, write an algorithm that actually, like, knows the structure of a 10K and then, like, do that very effectively. But what I'm seeing more of now is folks are uploading, like, you know, varied documents to a source and actually relying more on AI to process it at, like, a per document level and having the intelligence there rather than have an engineer try to, like, figure out how to, like, write an algorithm to process all these, like, you know, each type of, like, unique category of document. And I feel like there's so much complexity just on parsing documents alone, right?

Starting point is 00:22:01 Because the kind of issue we talked about and even the use cases is really about like, I want to have this accessibility to these data and have the understanding of how to parse these variety of stuff here. But I'm always curious, like huge part of Daft and sort of like the messaging is about scalability and performance as well.

Starting point is 00:22:19 And so is there, like do you see everybody parsing data at scale and growing in a pretty fast pace right now in this sort of data. Like this is like so much demand and going crazy. Or do you see like most people are trying to like do this in some small scale way to try to figure out what they're doing with it? Like what you're seeing of the trend lines and what kind of data is actually being the most skilled usage versus just thought cleans a little bit here and there, you know?

Starting point is 00:22:47 I actually see things exploding quite a bit. And the way I think about this is like, you know, I call it the granola effect. Like, as soon as you have, like, you know, AI voice, like, transcription and, like, LLMs be good enough. Now everyone in the Valley is, like, using granola to, like, record pretty much every meeting because, like, it is so cheap to do so. Like, by means, it's now recording every single meeting I'm doing. I can now, at the end of the day, like, ask questions about it, get, like, overall shape of, like, my meetings, things I can improve on.

Starting point is 00:23:16 And what we're going to see more and more is, like, by having the processing infrastructure to, like, make this really cheap and easy, I think. I see folks like recording all their Zoom meetings, recording all their documents, bring it in like all their Slack history, their GitHub, their process, like kind of like pretty much all the sources of data that get generated in their business coming into one place. And the only barrier I see today is actually like the cost and the complexity of integrating into these systems. So interesting. And so, yeah, I think video and audio is so easy to capture these days and the volume of every single one of them is so high. like this zoom recording we're doing right here like it's not that tiny you know I had to go to a bunch of tools and stuff like that and so like I guess if you solve the individual file problem which is not easy is the scale problem pretty much just like a standard scaling problem like try to add more nose try to coordinate these like it's pretty much a standard distributed system problem or even at scale there's actually pretty unique challenges as well I'm just curious like is there something even at that sort of like scalability side that is also like a little bit different traditional spark site as well. Yeah.

Starting point is 00:24:24 No, I think it gets quite interesting where the running the models become like actually the really challenging bit. And I think currently what we've seen a lot of folks do just to kind of like get something out the door is a lot of these inference providers just provide you like a rest API. And so the natural tendency, you know,

Starting point is 00:24:41 that we see folks do is like they get like, you know, their spark pipeline and they just do like a rest call to the inference provider. And then once it's happening, it's like you run the spark job. And now you have like, you know, a hundred thousand requests per second and just like hit an emphasis provider that previously had no load. And then you actually just like have retries. The thing goes down. And once it's

Starting point is 00:25:01 happening, it can actually like chug along for a bit. And then I don't know, a really bad retry. Let's say you hit all of your quota. We'll actually just kill the full job. And so that's kind of one of the reasons why we're actually trying to bake in models as a core concept here. We're like, we can actually make this much, much more reliable. These are not really problems you hit if you're just trying to process like thousands of rows or tens of thousands of rows. But when you actually try to go production scale, this is something that we see a lot. What do you think the future of like the data stack looks like? I mean, you must have a very specific take, right? So, you know, the data stack of, say, 2021 was like a DBT plus maybe snowflake

Starting point is 00:25:41 plus some stuff. And that's how people drove business, they'll build business analytics. and then like a thousand different companies around it, the modern data staff. Like, yeah, what's the future look like here? Yeah, that's a good question. I'm actually a really big believer that we don't need some kind of like unified storage platform that kind of like, you know, it's kind of like not like the data bricks for multimodal,

Starting point is 00:26:03 but actually just more like keeping things open, keeping things in like good old platform as like parquet, and just keeping everything in object storage. And then just using the right tool for the job. I think you have this explosion of tools that are like in a duct TV, fantastic for like doing low-scale analytics. It should be able to interact with the system. If you're trying to do multi-moder data processing, a tool like DAF and a Vangel is really good for that. And so the way I see this is like you should just store all your data in one place.

Starting point is 00:26:32 And I think that's object storage and just use the right tool for the job. And just keep it as open and as, you know, the way I say it's badly engineered as possible. Just keep your images as JPEGs and that's straight is fine. Keep your videos as videos in S3, that's fine. But just use the right tool for the job, and that tool should be good at reading and processing that data. And, you know, one big part you just mentioned, because these are pretty much,

Starting point is 00:26:58 these file formats are never been, like, designed for a quick lookup, right? These are, like, really about rendering, retaining information about all the metadata and about rendering onto clients or PDFs and all these formats, I felt like has been always designed for that, like JSON, like just unstructuredness of this. But when you talk about like latency, scalability, like reliability, like the data site becomes quite important.

Starting point is 00:27:24 And you don't want to go like go from scratch, just assuming you know nothing about the data every single time. And so we always have this idea about indexing, about like the ideas of parquet of creating like this specific schema and statistic and cardinality where all these information is really helping the query engine able to know how to, even perform the strategies, right? I don't think there's any data like that at all right now.

Starting point is 00:27:48 And so do you see yourself having to like implement or come up with a metadata schema-ish thing to make unstructured data becomes more structure to help you on the processing side? Or do you think it's too hard, you know, to get a, don't just store PDFs or, you know, I'm just curious what do you think about that side of things? Because in my mind, like if I want to architect a system that actually works reliably a scale, you basically have to turn unstructured into structured quickly to be able to actually do it a best job at it. But we don't have any. That's a really interesting question. And I do think at the metadata level, it makes a lot of sense. And that is stuff that we're working on internally

Starting point is 00:28:28 as well. One example I think about is if I have a video just sitting in a story, let's say I have like a one gigabyte video. And let's say a system wants to be like, okay, I want to jump in at like 35 seconds, the current way you would have to do is you start at the beginning and you read the video until you get to 35 seconds. But almost like a lot better of a way is actually just being able to be like, okay, I'm going to like index the key frames, right? I'm going to like know exactly where the bite I should start on depending on like where I want to jump in the video. And so by like actually like indexing like that level of metadata, it can actually save you a lot of bandwidth and misery actually. And so it actually is.

Starting point is 00:29:10 makes a lot of sense. But that doesn't necessarily mean that we have to actually like reinterpret the video file. It just means that we almost like add like an overlay metadata or statistics on top of it. And that could be stored separately, I believe that. And it's a pure optimization. And I think the one thing that I see is that, you know, as time goes on, we kind of like bridge together the serving layer and like the data layer, like kind of more like the query engine layer of these assets. Like let's say a PDF. Like what's a single PDF that you have an S3, you want to be able to, like, serve that to your user. And the way we see that is folks that are like, okay, I want to get this asset, and the

Starting point is 00:29:47 backend will, like, pre-sign that URL and send it over to the user. But if your query engine wants to read it, I can read the metadata layer, and then also grab that PDF. Now, I only have one copy of my data that I have to keep around to govern, to do, like, permissioning on, and to retain, essentially. And for every additional copy you would have to make in some other system, then I just more complexity. like traditionally a lot of these like data pipelines have fallen into some category of like data engineering tasks like what's the thing you think the feature of building apps are that are data intensive like you know the key value of this generation of AI because it changes every time we have a new type of model or architecture or some new discovery is the fact that we can now provide at runtime a bunch of additional context and that context finally changes what this thing can actually do i'm curious what you think the future of how we've

Starting point is 00:30:37 build apps looks like in an agent context and where things like these data processing pipelines, these serving layers kind of fit into that, that architecture and that workflow? And who? Like, is this, is data engineering continue to be like a specialized thing? Or does this like, you know, as a developer skills up, it actually is like more going to converge back to sort of generalized development skill? No, it's a really interesting question. I have a more opinion to take on what it looks like for multimodal per data engineering. And before, what it would look like is you have to have a pretty good idea of what you're looking for. some unstructured data to kind of like start the effort to start extracting that data.

Starting point is 00:31:12 So an example of that is like if I have a flow of images coming in, at the time where I get this image, I would say, okay, I want to extract out these signals. Like, does this image have a cat in it? And then write that to some like tabler source. And this would be the function of the data engineering team or the ML team, which is like run this transformation. But, you know, with the things that we talked about earlier, which is like things are getting a lot more accessible now, I actually view a lot of that being automated, where

Starting point is 00:31:37 I think now what I actually see is, you know, someone who's an engineer in the company or leader in the company asking the question, like, hey, I want to find out what are some things I can do better in my Zoom calls and in my emails to improve, like, I don't know, my open rates. And then actually being able to like, you know, have an agent or an LLM actually make some hypotheses, try to figure out the signals you want to actually extract out, run those pipelines, extract out those signals, try to bring it all together and almost give you a report at the very end of the day. And so the way I actually see this is that it's actually giving instruction to these agents

Starting point is 00:32:12 as well as almost like a budget. And that's why these engines have to really understand the concept of a budget, especially using these models. Because it's not just compute anymore. It's all of these APIs. It's all of these like models you have to use. But that's kind of how I see the future, especially with this like, you know, field of multimodal data engineering because there isn't really a status quo yet.

Starting point is 00:32:30 So I think this is probably a good timing to jump into our favorite section called the Spicy Future. Spicy Futures. So, Mr. Sammy, you've got to give us some spicy hot take here. What do you believe that most people don't believe yet? I think I kind of said this a little bit already, which is, I don't think there's that much bad architecture and data. I think it's just bad tools. And I think we've seen this a lot, right?

Starting point is 00:32:56 So one of the things that, like, we've kind of done for one of our users, Mobilize, is that, you know, one of the things that they were working a lot is, like, for every single type of work that they were doing, they were actually. storing like a different version of their data. Like, oh, yeah, this is like a more condensed version. This is more of like an analytical version. This is more of a different version. And one of the things that we did is just like made daft really, really, really good

Starting point is 00:33:19 at reading tons of little files that sit over all their clouds. And with doing that, we actually improved bandwidth by over like 100X. Well, like not storing like a billion versions of your data. And so my take is that like, I think we can actually solve a lot of like the common bad architecture practices in the cloud, which is better engineering. Yeah, that's fascinating. I think when you say bad architecture of data, I thought Lambda architecture right away. Like, we used to have to produce two different type of data streams.

Starting point is 00:33:50 One is for real time. One is for batch processing. And because they are optimized for different tools, you almost are forced to basically tee off your data, you know, in two ways. And I guess for multimodal data, you have to tee off, like, in eight ways. And we're almost seeing the same thing right now. We're like, we're seeing like the Lambda architecture for multimodal, right? You're like, oh, yeah, here's my online and here's my stream case and here's my batch case.

Starting point is 00:34:16 And I'm just saying like, I don't think we have to do that. Like just as what we saw with Lambda is like as time went on, companies kind of started converging and they started doing like things like micro-batch and whatever else instead of doing like two explicit pipelines as well. Yeah. And I think the hard part of doing that on the traditional data warehouse is like, lap and all the TP, like the Holy Grail's HTAB has never been enough maturity and just so damn hard to actually in practice do it. And for you, I guess you feel pretty optimistic, like

Starting point is 00:34:45 that is the one that can do the Holy Grill here, right? Like, you just shouldn't go off and do all these things. Because I think it's general consensus like HTAB will be great if it actually works, right? But like so hard to build. And for you, it's like, I don't know if this is exact same case, right? I don't think people are using it for real time purposes. for the most part, right? So it's actually still a batch. It's just different type of data. I mean, correct around,

Starting point is 00:35:09 do you find even yourself become almost like an H-TAP version of things on the other side because there's so many different type of usages both on real time and scale and everything too? So that's the thing that's quite interesting where it's like it's becoming increasingly online. So like an example of this is like,

Starting point is 00:35:26 imagine you do have like an agent or an application that comes across a document. And you're like, okay, the agent or application wants to process it. Should they have to send it to a completely different provider that provides an online version of it versus the one that already is provided in the batch system? I mean, today they might have to do that, but I don't think they should have to. It is almost like the H-Tap of multi-auto, I would say. Or like, batch online, it shouldn't really matter.

Starting point is 00:35:53 I'm curious, like, H-Tap. Yeah, H-Tap's a holy grail, but like no one system ever will actually enable to true H-Tap, right? Because the optimizations are so fucking different. I'm so curious to understand, from your perspective, in this world, where does the core of the data system in the query get run? Is it getting run against data managed by this core system, or is it still, the survey layer is still somehow different? The way I think about it is like, what is like the origin of data, right? Is the origin of data set in like some managed data like or is it sitting in some online source of data? But I think the thing that we do need is almost like unifying that compute.

Starting point is 00:36:32 And is that about, like, mapping the different locations so that they can be self-descable? Like, one of the things I've, I'm sure you look to MCP. I don't know how much you spent with it. But, like, one of the challenges for, like, MCP and using MCP to drive any agent is just, like, the list of potential things an agent could do, just completely clouds the context window. And such as a result, you kind of actually have to have these layers where it's like, okay, agent, you're trying to do a job. Now use this MCP to help you plan to figure out what other MCPs you could.

Starting point is 00:36:59 And then, like, slowly over time, they kind of build up the context window. can be managed. I'm very curious, like, how you think that management layer looks like in the future of data, because I think you'll end up the same problem, especially as we move to more agentic workflows, as how these things get discovered across. And maybe this is a solution here is like this is what data catalogs are for, but I think it's a very interesting question. It is a very interesting question. And I think that they're both hard for different reasons. I think with, like, the, you know, the example you gave is that the workflow is very dynamic. Like, depending on whether the agent is trying to do, it might be just very different.

Starting point is 00:37:32 of what you're trying to actually get out of it from the tool. Versus I think in a lot of cases we see for multimodal is that the workflow or the thing you're trying to accomplish is actually very similar. Like in this case, I want to go from a document to some kind of question answered at the very end of it. And you might want to run into some batch or online. But the thing that's very different is almost like where the data is coming from and where it should end up. And so I think that's where we can build better engineering to make that possible. so fascinating so I mean before we end we

Starting point is 00:38:05 probably want to really talk about like what you think will be a huge unlock for folks to truly get this sort of like H tap like is DAF basically there I can now do real time bash everything already like people just go download PIP and install DAF and we can get that already or just like

Starting point is 00:38:21 something you're fundamentally still working on and do you feel like it's coming pretty soon like is there like an example of thing you're unlocking and what do you think it's like a timeline almost like this is I think 2027 will be the HTAF systems coming out from DAF or something like that, yeah. It's really, yeah, that's a good question. Yeah, so, I mean, today, DAF, like, you can PIP and sell DAF today, and it's a batch multimodal

Starting point is 00:38:43 query engine, right? You know, you can think of it as, like, Spark for Multimodal. And it's pretty good at these very large-scale queries. Like, I think our biggest query we run today is, like, 8 petabyte audio transformation query that we run for a customer. And it can handle that pretty well. But if you want this, like, you know, almost like closer to 8%, that's what we build in our product. We have something called EV Cloud, which is an early

Starting point is 00:39:06 access right now. And what that kind of enables is leveraging DAF and actually building a lot on top of it, including things like dealing with fallibility, data lineage, and all that stuff. And what that enables is like going from your sources of data to making it very usable in a short amount of time, like very like, you know, old minutes of SLA. And so that's kind of like the first phase of our plan. The second phase of the plan is then actually enabling you from bringing data from many different sorts, the kind of the stuff that we talked about or like, you know, applying the granola effect. And then actually making that very usable for all your applications.

Starting point is 00:39:39 The analogy I see here is like, I think the cloud data warehouse, like for analytics was really powerful, not just for being such a, you know, quick way and a cost-effective way running like analytics, but I think the great enabler was that it enabled companies to kind of put all their data in one place, have all of their teams kind of go to one place to, you know, run queries and build their applications on. And I also had, you know, one, you know, one. unified interface for all their tools to talk to. I kind of see that for all the unstructured and multiple-lar data.

Starting point is 00:40:06 That's what we're trying to build here. Right now, we're not too pan-neated about storage. We work with partners like Databricks and Lance and all of those books, to store your data in those formats. And we want to just make that as easy as possible to interact with. So your data just shows up. Awesome. So for folks that want to try out DAF or learn more about DAF, where should they go?

Starting point is 00:40:24 How did I get more information? Yeah, just go to Eventual. If you want to learn about our open source, just go to DAF.AI. and if you're interested in eB cloud and data ingestion for multiple more data, go to avancho.a.a. Awesome. Thanks for being on our past, Sammy. It's super fun. Thanks so much. Thanks for having me.

The Infra Pod - From Spark to Eventual: Reinventing Data for the AI Era (Chat with Sammy from Eventual)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.