The Data Stack Show - 133: Building the Data Warehouse for Everything Else with Sammy Sidhu of Eventual

Episode Date: April 5, 2023

Highlights from this week’s conversation include:Sammy’s background in data and tooling (2:46)Going from multipurpose engineering to a CTO position (5:14) Changes in technology and deep learning ...models (7:31)The state of self-driving and adoption (13:49)What is Eventual and what are they solving in the space? (20:54)What are daft and data frame and how they work? (28:11)Building a query optimizer (33:42)Sammy’s take on what is going on in data and future possibilities (45:18)Eventual’s future and its impact on the space (51:44)Final thoughts and takeaways (53:47)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Kostas, I'm so excited that the people that we get to talk to on the show continually amaze me. We're going to talk with Sammy,
Starting point is 00:00:30 who's building Eventual, but he was at DeepScale, self-driving AI technology acquired by Tesla. He was at Lyft, built the Level 5 team there, working on self-driving stuff, acquired by Toyota. Thedriving stuff acquired by Toyota.
Starting point is 00:00:45 The team was acquired by Toyota. And is now building tooling to help people who do those sorts of things have a much better experience shipping sort of large scale projects and models around complex data. And what I want to ask Sammy is, he's had a dedicated focus on a very similar type of problem over a pretty long period of time. And if you think back, and we talk a lot about the recent history of data technology, but if you think back to sort of, I guess, 2015, when he started at deep
Starting point is 00:01:25 scale, you know, there were still a lot of limitations in terms of running models at scale and, you know, data storage and all this other sort of stuff, right? And so I can't wait to hear the trends that he thinks are most important from his perspective, and then what, you know, what he's building at Eventual based on all that experience. How about you? Yeah, I have plenty to talk about with him. First of all, we have to talk about frames. We have to talk about Pandas, the Python ecosystem around data.
Starting point is 00:02:02 And I'd like to hear more about like Eventual itself, like what it is and what's the vision. It's why they built this thing. So I think we are going to have like a very interesting conversation, especially because we have like a person here who has done like all these things around training models and building models and self-driving cars and like all that stuff. And who today is like starting from scratch, like a company that has to do with like developer tooling and data warehouses in ML, but I think that says
Starting point is 00:02:37 a lot about like current state of like tooling and technology that people have in ML. And I want to hear that from him. Like I'm very interested, like to see what's going on and why he made his decision. Right. So let's do that. Let's talk with him. All right. Well, let's dig in.
Starting point is 00:02:57 Let's do it. Sammy, welcome to the Data Sack Show. So much to talk about. So thank you for joining us. Glad to talk about. So thank you for joining us. Glad to be here. Okay, you have an amazing story and you seem to have this knack for getting acquired by large automotive corporations, which is really fun, times in a row. But take us back to the beginning. So can you just give us sort of the overview of your journey in, of your journey in data? Oh, yeah, for sure. So once again, I'm Sammy.
Starting point is 00:03:30 It all started when I went to Berkeley, where I focused on high performance computing, aka making things run fast and deep learning. So this is kind of the era where deep learning started taking off, neural networks were starting to seep into things. And I had found a research lab that focused on putting the two together. So I worked on everything from making models train really quickly on large supercomputers to making small neural networks. And my professor and PI, who I worked with, had the idea of starting a company. So they started a company called DeepSkill, which I joined as one of the early engineers. And during that process, we started with three people
Starting point is 00:04:10 to where we ended up with nearly 40. And I was a CTO towards the end. So there I worked on everything from building deep learning compilers, training new novel research models, to building entire data engineering stacks to process things like point clouds, images, radar, you name it. Towards the end of that, we actually got acquired by Tesla, the autopilot team, where the majority of my team got absorbed into autopilot working on training models or building infrastructure, taking all the learnings that we developed at DeepScale at the time. After that whole ordeal, I went to Lyft, where I continued working on self-driving, where they were a little bit earlier. I brought everything I've learned from before and trained better models for perception, built entire data
Starting point is 00:04:54 engineering pipelines, and really refined that process of how do you actually ship for self-driving. I did that for about three years, and my team my team was acquired once again by Toyota this time. After that, I was like, hey, you know what? I picked up a lot and I'm really kind of tired of the tooling I've had to deal with using systems that were designed for tabular data, things like Spark and BigQuery and kind of adapting it to make it work with images and point clouds and whatnot. So I was like, hey, things can be all better. And my co-founder and I, who I had met at Lyft, decided to start Eventual to kind of build a data warehouse for everything else.
Starting point is 00:05:31 Love it. And so much to ask about Eventual. One thing I'd like to start with, because I think it'd be really helpful for a lot of our listeners, And I'm just plain curious, going from being sort of a multi-purpose engineer, obviously with an emphasis on sort of infrastructure that's feeding sort of heavy duty data science stuff, very practical, wearing lots of hats like early stage company, and then you become CTO.
Starting point is 00:06:11 What was that transition like? And what was sort of, what were the things that like really stuck out to you sort of becoming CTO, as opposed to like, I'm executing a lot of practical stuff day to day to like, you know, get the software to work? Yeah, it's an interesting shift. When you're an engineer, the kind of fulfillment you get is like, what did I accomplish this week or today? Or what feature did I shift? And then when you kind of transition into like a manager, right, or a tech lead, it's like, what features or what product or what impacted my team shift. So you kind of have to like, change your mindset a little bit to get more fulfillment and happiness from that and then when you're a cto it's even one degree removed from that which is what is my company
Starting point is 00:06:51 shipping how are they making their customers happy and the thing that you can't measure is are the decisions i'm making now which have an impact a year from now gonna pay off right so you kind of have to like modify the discount factor in your head for every step of the way. Yeah, that's super tough. Was it a steep learning curve sort of thinking about that discount factor and having to think much further ahead than you
Starting point is 00:07:17 ever had in sort of more practical engineering roles? I think so. Because in the beginning when you transition, it's like you still try to do a little bit of your old job while doing some of the new responsibilities. And what you end up doing is a bad job at both.
Starting point is 00:07:33 And so you kind of have to like learn how to like, okay, I have a team to kind of manage the things I used to do. I need to now focus and get better at the things that are now more important for me. Yeah, that makes total sense.
Starting point is 00:07:44 Thank you for indulging me on that. No, it's just that's, you know, sort of like engineer to manager to CTO. You know, those are just very different, as you said. I'd love to get your perspective on the sort of the problem space that you operated in and are still operating in over a pretty long period of time, right?
Starting point is 00:08:07 So you were building a lot of this stuff back at deep scale. And then you sort of saw that through to Lyft. And then, of course, you worked on Toyota. and now are building your own thing. And so over that span of time, you know, there have been a number of things that have changed, right? I mean, the amount of data available to us or the amount of like non-tabular, like complex data seems to me to have grown a significant amount of that time. I mean, you actually even have like deep learning generating like a bunch of that net news. So it's, which is really crazy to think about. It's not just like capturing, you know, images of the real world, but then also the technology has changed a lot as well. Right.
Starting point is 00:08:55 So like infrastructure, you know, multiple new technologies in sort of the data science space have come out. Other things haven't changed. As you look back sort of over your experience, and maybe even sort of tying that into things that led you to founding Eventual, what are the main changes you've seen over that time period since you started at DeepScale? Yeah, it's really interesting. I think there's been step functions along the way. So I think initially when you're talking about 2015, 2016, the model you can ship, the model that you would build for doing perception or whatever task you were doing,
Starting point is 00:09:35 was kind of limited by how long it would take to train on a single machine. So it would be like you would have a server racked up somewhere in your office and you typically want to wait more than three to four days to train it on. And that kind of set the limit of how much data you could train on
Starting point is 00:09:54 and how big your model could be. And at the time, you would kind of just like, your iteration time would be limited by that. So you had to be very careful with what you trained and the data that went into it. And then kind of the next step that happened is things like distributed training
Starting point is 00:10:10 became very ubiquitous. We had frameworks that made it a lot easier. We had people who had more expertise with it. With that, the volume of data was now proportionate to how many GPUs
Starting point is 00:10:20 you can train on in parallel. And with that, that gave us a huge explosion in both the amount of data we can train on. So we went from things that were, you know, tens of thousands of images or hundreds of thousands of images to now millions of images that we train on at any given time. And then the second thing was that our iteration time
Starting point is 00:10:39 went from something like four or five days to maybe a day or sometimes even hours. And with that, you could actually crank out a lot more iterations of your model. And in kind of self-driving, the thing I believe is the person who gets the most iteration essentially wins. Because every time you can do something, get feedback and improve on that, it's kind of like the sharper journey that you can take for improving your overall system. So that was the next step.
Starting point is 00:11:05 And then we kind of hit a point where the models weren't really advancing anymore. Or another way to put that is the models were no longer the bottleneck. Before there was a point where if I just, you know, change my model to the latest and greatest paper, I would just get like a, you know, a jump in performance. But we kind of hit a point where that didn't make a difference anymore. And the point that was really crucial was, well, how is my data quality? What can I improve my dataset? Are things badly annotated? Do I have examples with conflicting truth?
Starting point is 00:11:35 And that's where the data game became really important. So that's where we were, instead of changing the model, we would just dive into the data set, find the failures, figure out what to do with it, and then iterate on the data sets. And these data systems became very important. that back when you had these, let's call them physical constraints, you mentioned that you had to be really careful, which makes a ton of sense, right? Because you're trying to maximize the amount of, you're trying to maximize every hour that this bottle is running, and you're still trying to create a fast cycle. Do you think anything was lost?
Starting point is 00:12:22 A, did you have to be, like, did you have the license to be less careful? Like in a world where you can just run this thing on so many GPUs simultaneously and like you don't have that physical constraint, like, do you have to be less careful? And then are there consequences for that where it's like, well, I mean, maybe that was helpful in some ways.
Starting point is 00:12:44 It kind of was. Like the analogy I draw in my head is I mean, maybe that was helpful in some ways. It kind of was. The analogy I draw in my head is, if you think about programmers way back in the day, where if you wrote code and ran it through the compiler and it took forever, you had to be very careful what you decided to compile, right? Or things that if you're putting it through a punch card, you have to be very careful what you're doing.
Starting point is 00:13:02 But nowadays, we kind of just write code and smash compile, and then we get an error warning right away. Yeah. Then that's kind of the same question of like, oh, we're programmers back in the day better. And my answer is like, I think programmers today get to focus on higher level concepts rather than the exact like doing the job of the compiler. I kind of view it the same way, which is this is kind of the having just a lot more GPU with a lot more compute is simply just trading off like human time for computer time. Yeah, that makes total sense.
Starting point is 00:13:30 That makes total sense. That's such a good analogy. One thing I'd love to, we actually have had, we had a guest on the show previously who had also worked in a self-driving space, which is really interesting. And one conversation we had with him, this is actually, this has been a while ago,
Starting point is 00:13:50 I think, Costas, maybe a year and a half ago. Peter from Aquarium, I think. Yeah, I know Peter. Oh, okay, great. We worked in the same lab together. Oh, no way. You worked in the same lab. Okay, that's awesome.
Starting point is 00:14:01 I wasn't aware of that connection. Okay, this is great. So we'll get an updated take from you on Peter's take from a while ago. So there's been an immense amount of work that you and Peter both have personally invested into the self-driving space. But for the common person out there, the news headlines tend to lead like the practical experience of self-driving. So where, like, could you describe the state of self-driving from the perspective of someone who's like built literally like core technology that's enabling this to become a reality in terms of like mass adoption of this or sort of, you know, implementation? Yeah, it is interesting. It also depends on
Starting point is 00:14:47 what part of landscape you're looking at. So on one side, you have, I call it the bottom up self driving, which is things like, you know, if you buy a car five years ago, it would have features like AEB, which is automatic emergency braking. Like if you're going too fast for something, it would brake automatically. So that was kind of the bare bones of safety features that you would have in a car. But now you have things like adaptive cruise control, automated lane keep, and it's kind of going higher and higher every year.
Starting point is 00:15:17 And then you have the other front, which is what I call like top-down, which is you're having things like robo-taxis with no steering wheels, and eventually that technology will trickle down to the everyday man. So they're both making progress. But if you actually think about like for a, you know, the bottom up approach, the progress can be gradual, right?
Starting point is 00:15:37 You can have a car every year where the features get better. And I think Tesla has shown this example quite a bit, which is it does more places it's a little bit better and you have companies like mercedes and you know audi also putting out these features but for the level five thing where i get into a taxi with no driver no steering wheel that's like a binary threshold yeah it works or it doesn't yeah so for that one it's i do imagine one day we'll pass that threshold. But for now, the bottom-up approach, the stuff I did at DeepScale and
Starting point is 00:16:09 Tesla, I feel like is making the most progress. Yeah. Are the technologies that drive both of those, what's the relationship between sort of the underlying work? Like was the baseline work feeding both of those efforts or are they approached pretty differently? They're completely different. So if you're just tackling highways, it's actually very simple systems. And we can actually build systems that can drive 90% of the highways in the US without too much difficulty. Wow. Yeah. Wow. So it's not too bad. And we are seeing cars that are starting to do that. Yeah.
Starting point is 00:16:46 However, when you start thinking about like, oh, I want to do the off ramps, on ramps and the cities, then that's when things just get very hard. I think self-driving is the most insane case of the long tail I've ever seen where the last 10% of work is just ungodly. Yeah. How do you... I mean, certainly part of that last 10% of work is dealing with actual changes,
Starting point is 00:17:07 right? I mean, maps have actually gotten incredibly good at incorporating user feedback, even in real time, which is amazing, right? I mean, you've seen this happen over the last couple of years where like, you know, Google Maps and Apple maps, right. Well, like prompt you and say like, is the rec still there or whatever, right. You know, or is the construction still there? And so those feedback loops are amazing.
Starting point is 00:17:31 So those maps are getting better. But when you like, how does that change? Right. Because traffic patterns change, construction happens, all of those like incorporating userating user feedback into navigation instructions to quickly recalculate which street you need to go on
Starting point is 00:17:49 is related to that. But when you have a car with no steering wheel, do you approach that differently? I mean, they're still using map technology, but it's pretty different, or it would seem different. Yeah, so the mapping technology is interesting for that. But also just even general perception, just trying to understand what's going on around the car. It's quite interesting because for the case of not having a steering wheel in the car, you kind of need to get to a critical point where you can adapt to the change in the world as fast as the change is happening in the world.
Starting point is 00:18:22 So I'll give an example. When I was working in self-driving, one thing that came out of nowhere was we had models and mathematical models to kind of represent the motion for pedestrians on the street, how a person would walk on the street, or if they might be on a bike or a motorcycle. But then out of nowhere, suddenly SF had thousands of electric scooters.
Starting point is 00:18:43 Oh, yeah. People would now be on the sidewalk with the road. Yeah, the bird craze, yeah. Yeah, it was wild. And you would see all these random scooters in the street. And then now your whole set of priors before are now useless. Yep. Right?
Starting point is 00:18:55 So you now have to adapt to that change rapidly. Yep. And so that's kind of how it is with self-driving. Why it's so hard is the world isn't like static. It changes. And you need to be able to keep your data loops, your model loops, your ability to ship as fast as the world is changing.
Starting point is 00:19:13 Yeah, that makes total sense. One really specific question is, did you work on distributing models, like distributing updated models to the actual like fleets themselves right because like you have this challenge of like getting data to update the model but then you actually have to redistribute that to
Starting point is 00:19:34 the fleet is that problematic or is it is that actually pretty streamlined now I would say it's pretty streamlined the hard part is doing everything to from the point where you have a trained model to being like I can safely deploy this.
Starting point is 00:19:49 You have to do simulation. You have to simulate this model and the whole stack on tens of thousands of GPUs and simulate over all your historical data. Do things like hardware simulation and then finally do a small rollout and then finally the full rollout. It's usually a lot of work and why self driving is so ops intensive.
Starting point is 00:20:09 Yeah. Yes. Okay, so no, no listener can complain about QA anymore. I would say a lot of the staff at self driving companies are QA. Yeah. Yeah, that's wild. Okay, I could keep going. One more question for me, but then I'm going to hand the mic off to Kostas. Do you own a car? And if so, what kind of car do you own? Because you've been acquihired by multiple car companies. So I just need to know this. So I'm a horrible person to ask this because I love driving. So I drive like a... I actually am a car guy. Oh, nice.
Starting point is 00:20:47 I drive like a 1990s BMW that I work on. Oh, I love it. And then for my daily commute, I have a Toyota RAV4. Yeah, totally. That's great. I have an old 1985 Land Cruiser that I work on. So I'm the same way. It's like this is, you know, pretty low tech, but super fun.
Starting point is 00:21:06 Yeah. I mean, it's a stick shift. I enjoy driving it. Yeah. Love it. All right. Costas. Thank you, Eric.
Starting point is 00:21:14 All right, Sam. So you have funded something new that's called DaVinci All, right? So can you tell us a little bit more about like, first of all, like what DaVinci All is? And then I'd like to ask you what made you start working in that. Yeah, the way I would sum up Eventual is that we're building the data warehouse for everything else. So you could think of things like BigQuery or Presto or Athena.
Starting point is 00:21:42 And these work amazing for things like Talbot data. Or anything that fits in an Excel spreadsheet. But what if you have thousands or millions of images or video or 3D scans? It doesn't really work quite well. What we're doing is building something native for that.
Starting point is 00:21:58 And to do that, we're actually building an open source query engine to help query this type of data. The way I like to why we need a new query engine is to think the first step we have to do there is think about what is kind of the natural user interface for Talbot data. If you ask most people, hey, what's the most natural interface to Talbot data, they'll tell you SQL. I agree that SQL makes a lot of sense for Talbot data. But if I'm starting to talk about images and video, do I really want
Starting point is 00:22:26 to use SQL to query video or like images or random complex data? It doesn't really make quite a lot of sense. What does make sense is having something where the ecosystem is. If you do any type of machine learning or complex data processing, you're probably using Python. You're probably using tools like PyTorch and TensorFlow and various image processing libraries there. So we're building a data frame library that's distributed, utilizing a RAID cluster underneath, but is native to the Python ecosystem.
Starting point is 00:22:57 You can use your normal Python functions, your normal Python objects, but under the hood, we have a really powerful vectorized compute engine written in Rust, and also have a powerful query engine and query optimizer and all the special things that you would want in a data frame. So, all right. Talking about like data warehousing, like for imaging and for like, let's say data types that are outside of the typical relational model that people
Starting point is 00:23:22 have been experiencing so far. So if this thing does not exist today, right? Like how you were doing the work that you were doing all these years, right? Like what was the current, like what were the states of the tooling there? And how good of an experience it was? Yeah. So for companies I worked at or companies I've helped out, the first step of the process, which usually does not change,
Starting point is 00:23:52 is the equivalent of having a bunch of files on your desktop folder. The way that most people do it is they have a bunch of individual images or video just sitting in the S3 or Google Storage bucket. And in the beginning, that's completely fine. They use that directly to train. But then they're like, hey, you know what, I want to kind of version what data I have or keep track of some metadata. And so you end up building a system that first, you know, you have the individual files and something like S3, but then you use something like BigQuery or Postgres to store all the metadata.
Starting point is 00:24:30 Then it kind of starts evolving from there, which is, okay, now we need an easy way to access the data. So then you typically build an abstraction on top of it using like a workload engine. So now you have something where it's like the Talbot data is here, the complex data is here. You have a pointer that points to it, and then you have something like Airflow or Spark to kind of process it. And then what ends up happening is that this spaghetti gets piled more and more on top, to eventually you have three teams managing this one system. That's very limited. And the work we're trying to do is kind of build an engine that kind of bridges the two, which is you can process all the top data that you need, write really expressible queries, but then also have things like image columns, video columns, and you can kind of interact
Starting point is 00:25:04 with both in one place. The next step for the data warehouse is how do you actually store the data together? Such at a point where things don't go stale from one another, things are versioned together and a schema is tied together as well. So what made you leave the work that you were doing, right? Like training the models and like a lot of stuff, like turning to like were doing, right? Like, training the models and, like, a lot of stuff, like, turning to, like, tooling, right? Because my feeling is that from what I hear from you is that
Starting point is 00:25:30 yeah, like, okay, they found additional work that's getting done, like, it's amazing, but we have risked probably, like, a point right now where we're slowed down because, like, the tooling is not there yet, right? And I think, like, for anyone who has been, like, an engineer for some time, and regardless of what engineer you are, right? And I think like for anyone who has been like an engineer for some time, and
Starting point is 00:25:45 regardless of what engineer you are, right, like data engineers or front-end engineers or whatever, like you can see that not every field, let's say, has access to the same type of tools, right? Like what is happening like with front-end development, for example, like the tooling that is out there, like all the different... I know that people keep complaining all the time about all the JavaScript libraries and all these
Starting point is 00:26:12 things, but at the end, that's also an indication of growth and progress in terms of the tooling. In my feeling, at least, if you compare the tooling that a data engineer or an ML engineer has today compared to the tooling that a front-end an ML engineer has today compared to the tooling that a front-end engineer has, there's no comparison
Starting point is 00:26:28 there. It's a very different experience in terms of how much the tools help you do your job. So, how big of a problem do you think of the NBT is?
Starting point is 00:26:41 I guess the answer is easy because you moved on and you're building a company around that. But I'd love to communicate to the people out there
Starting point is 00:26:52 the complexity and how much of a problem it is for not having the right tools out there to do your job. Yeah.
Starting point is 00:27:01 That's a good question. I would kind of put it this way, which is, for the problems you mentioned for front-end, and also I believe for tabular, there's kind of a path for graduation, if you will, right? If you're starting off with a data science project or doing some data mountain engineering, and it's just small sets of data, you can use Pandas and you'll be completely fine.
Starting point is 00:27:25 And once you need to graduate to like, hey, I have more data now. It's taking too long. I can't fit on one machine anymore. You have tools to go to. You have data warehouses. You have things like you have a bunch of different tools that you can go to. However, the kind of the domain that we're tackling, there isn't really a path besides I build a custom infrastructure for my problem solving. So what's happening is that barrier can actually slow your progress quite a bit.
Starting point is 00:27:47 So what we see for a lot of startups and people just doing projects is that their data set size that they can process and kind of comb through is completely limited with whatever they can process in one machine. And if you think about what the implication of that is, is that complex data typically is a lot larger too.
Starting point is 00:28:04 So if you're processing video or images, that's actually not very much data at all. So what we're trying to do is kind of build a tool, you know, via DAF to kind of give people a path to graduate, to give people a way that like they can start off processing data this way. And when they need to scale up, it will scale with them. And then finally, when they're a larger company around it, we give them all the benefits of a data warehouse that you typically find in something like Snowflake or BigQuery. Yeah. So, okay.
Starting point is 00:28:32 So what's Daft? Yeah, I'm glad you asked. So yeah, Daft is a Python data frame that's distributed, that's essentially made for complex data. And what is a data frame? Yeah, a data frame is, if you're familiar with pandas, or polars, that's essentially DataFrame. A DataFrame is a way to represent a dataset
Starting point is 00:28:51 with a set of columns. And it's kind of very similar to when you SQL a database and you essentially get a set of columns and a row that represent those columns. The thing with DAT that's a little bit different is that a column can be something you're used to, like an integer or like a flow or string. But it can also be an image or
Starting point is 00:29:10 a video. You can do operations like okay, I'm loading in these MRI data into my data frame. I have things like the patient name, the patient ID, whatever. And then I can do something like an explode on the MRI image and actually get individual slices. And I can could do something like an explode on the MRI image and actually get individual slices.
Starting point is 00:29:25 And I can then natively run a model on these slices and determine, is there cancer? And determine like, oh yeah, what's the difference between these different frames? You can run all these different operations just using the normal tools you would use in Python. So it gives the machine learning engineer or data scientist a lot of
Starting point is 00:29:41 power with a very simple idea. How is this difference to having, let's say, a standard data frame or even Pandas and be used like a data type that comes to our binary data, like a byte array or something like that, right? Because one of the things that I have noticed that all the stuff that we are talking about here is primarily binary data, right? So what's the difference? What is needed on top of that, like to make this experience like better with working with this data? Yeah, yeah. So I think the biggest thing there is how do you actually represent it in a way that's efficient? So the actual
Starting point is 00:30:21 operations you do are fast. The second thing is, what does the user interface look like? And then the third thing is, how do you actually make this scalable? I think with Pandas, you can put things like NumPy arrays or images inside the Pandas data frame as an object. But the problem is then distributing it over a cluster or natively using what you're used to like PyTorch or TensorFlow, is difficult. So what we kind of do is we represent it itself as an image column and can do things very intelligently under the hood. So for example, if you have something that is an image
Starting point is 00:30:56 and you want to send it across the cluster, we can do things like keeping it in a format that's most efficient, if it was like a JPEG, for example. I think another thing that's pretty interesting is that if you are going to distribute it, the tools that you have are usually not as powerful. So I think the biggest tool that most self-driving companies would use is Spark.
Starting point is 00:31:17 And if you're using the Spark DataFrame API, one of the key things there is that you're kind of limited to types that can support. So you could do things like integers and strings, and they do have a way to just make whatever object you're dealing with into just a bunch of bytes. But the problem is then is that's not very pleasant for the user. I as a user have to constantly convert back and forth from whatever I'm doing to bytes. And then whenever I'm trying to read it back, I convert bytes back to the thing I'm trying to do.
Starting point is 00:31:46 And that's just not very nice. Yeah, I would also assume that like probably, let's say query optimizers for these systems are probably oblivious to this type of data, right? Like they cannot really use the information from a binary array or something like that and optimize the query itself, right? So do you see opportunities to add this kind of functionality if you are more semantically
Starting point is 00:32:11 aware of what is stored there? Yeah, 100%. So we have a very powerful query optimizer in DAF that kind of handles these use cases. I think one of the things that happened, for example, with Spark is when people do get frustrated using the data from API, which you can only use bytes, they drop down to the very low level API, which is called an RDD. Yeah. Which is just, I have a big collection of rows of whatever. And you lose a query optimizer.
Starting point is 00:32:40 You lose all of the out-of-core processing. You lose kind of all the benefits of a Spark DataFrame. So for us, we can combine the best of both. We have a DataFrame API that's very intuitive. You can use GPUs very efficiently. But we can also optimize quite a bit when you write your queries. So whenever you do calls in DAF, it's very lazy. It stacks up and makes a very good query plan.
Starting point is 00:33:04 And we find the most efficient way to actually run it on your cluster. It stacks up and makes a very query plan. And we find the most efficient way to actually run it on your cluster. Yeah, absolutely. And I think it's also completely the whole purpose of having something in Python, right? Which is how easy it is to work with it when you go to the RDD level, where you pretty much have to read the publications that the folks did back in Berkeley when the King was back.
Starting point is 00:33:25 To figure out how to work with these. When you reach the point where you have to write an RDD, it feels almost like, oh, I write Java, but now it's probably better to use JNI or something like that to go and either operate with C and start writing C. So that's not what you want to do here, right? Like, yeah, sure, if you want to optimize, do it. But at the end, it shouldn't be like how everything happens, right? That's just like a bad experience. Question. You keep talking about images, right? But I would
Starting point is 00:34:05 assume there are also other formats that, like other types of data there, right? I don't know, like a radar probably is not generating images, it generates something else, I don't know. Or it can have audio, right? Is there support enough for this, or the focus
Starting point is 00:34:21 is on images right now? They are supported. So we kind of support audio, these different types of modalities that essentially wouldn't fit in a regular tool that you're using there. We've had some pretty interesting use cases. We had a user who was dumping protobuffs
Starting point is 00:34:39 from a Kafka stream just into S3. And they're like, hey, I want to just query a bunch of these protobufs without having to like ingest everything. So what they could do with DAF is just think, okay, read all these S3 files, deserialize it using my proto schema, and then find the ones that have these fields.
Starting point is 00:34:56 And rather than building, you know, ingesting everything into BigQuery or some big heavyweight data warehouse, they could just spin up DAF and essentially write their query in four lines of code. Oh, that's super cool. And you mentioned like the query optimizer, right? Like taking into consideration
Starting point is 00:35:12 like the semantics of the types that you are working with. Tell us a little bit more about that. Like how do you build like a query optimizer with information related to an image, right? Like how does this work? Yeah. So what we found is that there's a lot of simple operations
Starting point is 00:35:31 you do in a query optimizer that can give you like 90% of the speedups. So things like, what we found is if you're processing a data frame, like most data scientists do, you add every column you potentially could use and you kind of just stack on top of it. And so one of the simple things that we can do, for example, is say, okay, the columns we don't
Starting point is 00:35:49 need, let's not actually process them. We can do things like, oh, I actually only need this many rows of data at the end. Let's only read those many rows of data from the very beginning. So doing a lot of these operations can actually drastically speed things up. The thing where it gets interesting for complex data is we can actually factor out a lot of the heavy computations outside of the complex data. So for example, if I have multiple tables or data frames of things with images or audio or something heavyweight,
Starting point is 00:36:20 and I want to do a join on something like a key or some kind of timestamp, you actually don't need to shuffle around the really heavyweight, and I want to do a join on something like a key or some kind of timestamp, you actually don't need to shuffle around the really heavyweight data. You can actually figure out what is the data I actually need to keep or the data I want to emit, and then compute that first and then send over the binary data. So we do operations like these that are much more native or complex data, essentially. Yeah, that makes a lot of sense. And you mentioned distributed processing at some point, and I think also you
Starting point is 00:36:48 mentioned Rave. Tell us a little bit more about that. Because from what I understand, we have, let's say, the data frame, which is the API, how the user interacts with the data, but then somehow the actual processing needs to happen, right? And how does this work with Daft and Venture? And I don't know if there's any difference in there, but I'd love to hear about that. Yeah, I would kind of break it down into multiple layers.
Starting point is 00:37:22 At the top layer, it's like what you mentioned, you have the user API, and this is what the user is telling Da that what they want to do, I want to select these columns or run this model on this column. And essentially, that gets translated into what we call like a query plan or logical plan. And this is kind of like a correct compute graph, if you will, of these are the operations are going to happen. So then the next step there is we get this plan, and we figure out what is the most optimal way to run this. And finally, once we have that optimized plan, we can actually then schedule it onto a distributed cluster using Ray. So each of these steps for a given partition or a given subset of data gets scheduled as an individual function on your cluster. So that's kind of like the three layers of that. The part that gets really interesting and what Eventual is kind of working on is how should you be storing this data instead? So DAF makes it really easy to query data that's
Starting point is 00:38:12 just sitting around an S3, for example, but how should you store it to make it easily accessible, have schemas and kind of all the benefits of a data warehouse? And that's kind of what Eventual is building on the side. So DAF is an open source tool. It's really powerful, but the stuff about how do you actually catalog that data and store it is kind of the main product of Eventual. Yeah, that makes a lot of sense. And why Ray? Why did you choose Ray for
Starting point is 00:38:35 why not Spark, for example? I don't know, like something else. Why Ray? So the main reason, I mean, Ray is pretty low level, which lets us kind of have a lot of control of what we're doing. But the second part is we're very opinionated about not doing anything Java
Starting point is 00:38:52 related in our ecosystem with Python. Like I can't tell you how many probably wasted weeks of my life when you get some random error in Python you have to scroll through thousands of lines of Java logs. In a Spark cluster. When you read the Spark logs in, I don't know,
Starting point is 00:39:07 CloudWatch or something, you go through thousands of Java and you figure out, oh, I forgot a comma. It's just not a fun experience. We wanted just something that's very native, very simple to use, and something where if a user makes a mistake, which they probably will,
Starting point is 00:39:23 that's really easy to bug. Yeah, 100%. I think like big part of the pain that people have with Spark is actually like how to debug this thing. I have heard like many corner stories around that. And like what it means like to deal with all the stack traces coming from the JVM. Yeah.
Starting point is 00:39:48 I've had friends at FANG companies that when we were starting a venture, they were like, hey, if you can come up with a way for me to know, if you could just find a better way to present Spark logs, it would pay you a lot of money. Yeah, yeah. I have heard from people, I think the worst thing that I have heard is from people saying to me, we just cannot find the logs. Especially when you have
Starting point is 00:40:12 running Spark on EMR and that kind of case in production, it can get extremely painful to do the actual debugging. And yeah, it's one thing if you try to do that and you are like a Java developer. It's another thing when you're primarily, you know, you're a data engineer and you're
Starting point is 00:40:32 writing your code like in Python, right? And then you have like to go and figure out what the JVM is doing there. Like it's, yeah, it's bad. I can't get that. Yeah. And I think the other painful part is like, kind of like the not like hair on fire, but the other thing, which is like,
Starting point is 00:40:50 why is my program slow? Why is it not running as fast as it could be? And then just profiling and knowing what's actually happening under the hood is very hard with these JVM tools and interopting with Python. Yeah. Yeah, 100%.
Starting point is 00:41:04 And you also mentioned Rust at some point. So how Rust fits the equation here? Because, okay, you were talking about Python and being opinionated on that, but now we also have Rust, right? Yeah. What's going on there? So the thing is, when you're dealing with
Starting point is 00:41:21 this large amounts of data, unfortunately, Python's not fast enough. So under the hood, kind of like the user API is all in Python. You can run Python functions, Lambda functions, Python objects. But the stuff that's actually doing the processing under the hood is Rust. So we've crossed the boundary from Python into Rust to actually do all the hard computing. So on the top level, the user APIs in Python are plans and what happens in Python, but the functions that get called to actually do the number crunching is in Rust. Funny enough,
Starting point is 00:41:52 we started with C++ because that's my background, but there's actually not really that many C++ programs anymore. So we're like, hey, let's make the investment. Let's move to Rust. And it's actually been really amazing. I'm like really happy we made that move. So our whole core engine is written in Rust and it makes things very performant and actually really easy for contributors to jump in and get their hands dirty.
Starting point is 00:42:15 So how do you, by the way, you know, that's like we are, that's like a little bit of a different question from what we've been discussing so far. But like as a C++ developer going to Rust, like what was your experience? And I know it's like a of a controversial question that I'm asking right now because there's a lot
Starting point is 00:42:30 of language wars happening out there about C++ is dying, Rust is all the thing, no, Rust is bad, go and use Ling or whatever, but how was your experience? Let's see.
Starting point is 00:42:49 So I would say, I mean, I love C++. I've been doing it for over a decade. But the thing that I really like about Rust over C++ is that C++ essentially is optimizing for backwards compatibility. What that does mean is there's not things that get improved over time. The thing with Rust is when you start building it, if you're just a noob, everything kind of comes with sensible defaults. I think it's very underrated. In C++, you can set things up such that it's optimized so it won't copy.
Starting point is 00:43:18 It won't do this. It won't do that. But you have to set it up and you need to have someone experienced on your team to kind of lay that groundwork. But Rust, it comes out of the box. And the things like the build system and dependency management, all of that come out pretty good out of the box.
Starting point is 00:43:32 I think just coming out of the box strong is so underrated. Other than that, I feel like you could technically do everything in Rust. It's just a lot more groundwork. 100%. I think it's also like a modern developer experience
Starting point is 00:43:48 between the two ecosystems. How was the experience of working, they're operating between Python and Rust. I know that they work pretty well together, but how was your experience with that?
Starting point is 00:44:04 Oh, it was just night and day compared to c++ like with c++ like there are some tools around it but they're not great so usually you end up writing stuff in c++ and then use something like pybind or you use something like cython to kind of bridge the two and writing cython code just sucks it's not python which is not fun it's not c++ what you're used to. It becomes this weird in-between. But with Rust, they have this project called Py03. Yep.
Starting point is 00:44:31 And it's been amazing. You kind of just write your Rust code, declare what you want it to be, and it just magically works in Python. Nice, nice. I think we should have another episode at some point, going deeper into that stuff, because it's very fascinating. There are some very interesting lessons in terms of
Starting point is 00:44:50 how to build a good developer experience from these really complicated systems. Building a compiler and all the ecosystem around the compiler because it's not just the compiler itself. It's huge. So I think we should do that at some point. What we're seeing is the data ecosystem. It's funny,
Starting point is 00:45:10 the whole Python data ecosystem is kind of migrating to Rust. It's kind of cool. If you look up, Polaroid is written in Rust as well. I believe you guys had ByteWax on the show as well. They're the same thing. 100%. Stuff like with materialized,
Starting point is 00:45:25 all these things, like a timely data flow. Yeah, there's a lot of work getting done right now when it comes to data using REST. This is very fascinating. And you mentioned
Starting point is 00:45:37 a couple of different projects there. And I want to ask you, there are many things happening, right? A lot of innovation. We see Poland, for example. There are even stuff like the IBIS, I don't know how it's called, like they pronounce it. Whatever. There are maybe projects out there that they start from the data frame concept or the Pandas concept, and they try to build on top of that. As a person who's in a way doing something similar, tell us a little bit more about how you feel about what's going on right now in the industry, things that get you excited, what you are paying attention at, and what you would recommend us also to pay attention to.
Starting point is 00:46:28 Yeah, I mean, Pandas is sticky. Pandas is very sticky. And, you know, I had a hard time understanding why for a while. And then I went to PyData, which is a conference on a lot of the numerical tools within Python. I went to a talk that was teaching Pandas
Starting point is 00:46:52 developers or Pandas users how to use Python. It occurred to me that there's an entire population of people who know Pandas but not Python, which had never occurred to me. Yeah. Wow. This is That's wow. Okay.
Starting point is 00:47:10 But they were teaching like, oh, this is a for loop. This is how you make a function. Like they were teaching like these operations where I was like, wow, like I never realized that people who know a framework within a language, but not the language. And so I think when you build tools that kind of cater to that crowd, you kind of unlock a lot of the data scientists and users who are used to these types of tools. That's why I think IBIS is really cool because it gives you the API of something like Pandas, but then you can target a backend like BigQuery or whatever else where you don't need to change that much code. Yep.
Starting point is 00:47:43 Yep. A hundred percent. or whatever else where you don't need to change that much code. Yep, 100%. That's a very interesting project. They have a crazy amount of support for different backends, which is awesome. You can use from DuckDB to, I don't know, like Trino and Snowflake or whatever, and don't use the same code. It's very interesting.
Starting point is 00:48:03 Sorry, I interrupted you. No, that's cool. Yeah. I mean, I, these things, I think the data from concept is here to stay. I just think what the future data frame looks like is still unclear. What should we be looking at for getting like a glimpse of the future around that stuff, like who are are the teams outside of Venture, obviously, doing interesting stuff in this space?
Starting point is 00:48:30 I think there's a lot of cool concepts that we should pay attention to and I think are important for the future. I think one of the things that DuckDB is doing, which is fantastic, is one is I can just query data without worrying about the format or where it is or anything like that. So I think the concept of like, the format doesn't really matter anymore. I don't have to think about like, you know, Spark, you have to be very particular what exact version
Starting point is 00:48:55 of Parquet you're using and whatnot. Like that concept should just go away, right? The next part is being federated. I don't have to ingest my data to query it. I should just be able to like, give it an S3 path or a Google storage path and it should just work. I think those are concepts that are a must-have in whatever new tool comes out. The thing I think DuckDB is not right, I think, is being distributed. I think it's really important because not everything for the future is going to be on one machine. I think a lot of teller data that might be a case for some companies, but there are cases where you need to go
Starting point is 00:49:29 distributed or handle fault tolerance. I think that's one of the things that DAT is focusing on. And then finally, I think one of the things I'm really passionate about is making sure these systems can integrate well with enterprise tooling. We've used JDBC for a long time, and I hate it, and I think most people do too, but a lot of these new open formats, like the one that Arrow is building, the Voltron, is really cool. I'm really looking forward to that as well.
Starting point is 00:49:56 Yeah, Voltron and the Arrow ecosystem, I think it does some very interesting things and has a very interesting amplification effect to this industry. It's very interesting to see what they are building and what the effect that these things have. They have a very strong relationship to
Starting point is 00:50:16 partners also, right? Our core is built on Arrow as well for our de-civilization. I think it's an important tool and it kind of just makes you interrupt really easily. Yeah, I mean I think interoperability
Starting point is 00:50:31 is always a big issue in the data infrastructure space, which it seems that Arrow is managing to change that. Obviously things do not change from one day to the other, but it's amazing to see how fast, for example, systems
Starting point is 00:50:48 like BigQuery and Snowflake speak right now, like Arrow, right? And that says a lot about how important and how powerful the concept is. All right. I want to give some time also to Eric, because I'm monopolizing
Starting point is 00:51:04 the conversation here. We definitely need to get you back. We have a lot to talk about. But before I give the microphone back to Eric, something that you want to share with us about Eventual and Daft that is exciting and is coming up soon? Yeah, so Daft is doing its 0.1 release. We're fully going into beta. We have a lot of
Starting point is 00:51:27 really cool features, including our entire new core built in Rust. We're supporting these different types of what we call Daft types, like supporting images, videos, and these other data types very naturally. We're planning to do a launch at the end of the month, and we implore you to check it out. Our website is
Starting point is 00:51:43 getdaft.io. Check it out and start the GitHub. Awesome. Eric. Yes. I think we have time for one more question and I wanted to zoom out a little bit or maybe even say one more topic because I rarely stick to one more question. I wanted to zoom out a little bit and talk about the different ways you see currently and then envision seeing eventual and sort of the related, you know, in depth and the related technologies being used. And so if you think about the sort of obvious ones, even from our conversation, right? You know, processing imagery in the context of a self-driving car, right? Or algorithms that run, you know, to provide certain recommendations and, you know, in an app like Instagram, right? Which is very image heavy, right?
Starting point is 00:52:38 But what are some of the other, like, interesting ways that you see this being used on complex data? I mean, you mentioned you can operate on audio files and other stuff like that. But there are sort of some of the obvious ones that make sense to anyone who's listening. But what are some of the more non-obvious ones that you think will be really influential? Yeah, it's a good question. So right now for us, we're focusing on the most underserved market, which is people dealing with complex, you know, these things like images, videos, and what you mentioned. But in the spectrum of complex data, there are things in between that are still unserved.
Starting point is 00:53:13 Like the example of, I think the big one I think of is recommendation systems. So if you have something like Facebook and they're trying to recommend you, oh, what post they should show you, some of the data that they process for that is a user in one of the columns might be a list of interactions that they might have had or a list of like options of what they've done. It's something that's kind of complex, but not like super complex. But right now, if we try to operate that in existing systems, that would run very, very slowly.
Starting point is 00:53:42 Yep. So even these things of like nested data is actually very slow in existing systems. And that's something that we're planning to tackle next. Yeah. Fascinating. All right. Well, congratulations on Eventual and Daft.
Starting point is 00:53:56 Super exciting. We'll have you back on so we can continue to dig into all of our juicy questions. Thank you so much, Sammy. Thank you. Thanks for having me. Costas, talking with Sammy, what an incredible story, right? If you are in a Tesla and you're driving down the highway and you let go of the wheel and the car is safely carrying you at 70
Starting point is 00:54:17 miles per hour without you giving the vehicle any input, Sammy is a huge part of why that's possible because of what he's worked on. And I can't tell you how much I love that he drives an old BMW from the 90s that's a stick shift, and he works on it himself as sort of a car guy, which is so great. I mean, that sort of, you know, that sort of wonderful, wonderful story doesn't come along every day. So that was possibly my favorite part of the episode. But I also just really appreciated, really appreciated his thoughtful perspective on just the problem of dealing with complex data in general. And I was astounded by when, you know, towards the end of the episode, when we talked about like, okay,
Starting point is 00:55:10 you have like imagery, you know, for Instagram and self-driving cars. And obviously that's a huge problem space for complex data. You asked him what other things, and he said, I mean, actually just hierarchical data, you know, nested data is unbelievably slow when you try to work with it at scale.
Starting point is 00:55:28 And so it's like, oh yeah, this is still really early innings for sort of solving problems around complex data. So excited to see what eventual grows into. Yeah, a hundred percent for me. Okay. Yeah, 100%. For me, okay, it was like an amazing opportunity that's like talking with him because, first of all, we talked a lot about like things that are, how to the end is like the vision there is like to build a new type of data warehouse that can be used by ML people that are working with non-tabular data. But it's interesting to see like how many times, no matter like what we were talking about, we ended up talking about the developer experience and how important this is and how also
Starting point is 00:56:25 like silent this developer experience can be like, I think like what he said and what he said about the Pandas ecosystem was incredible. Yeah. I would never like to hear something like that. So tooling is important. It's like the foundations that we need if we want to accelerate progress, right? And that's why like I really talking with Sami because he gave like some like very deep insights of why tooling and developer experience is important and how this can
Starting point is 00:56:58 be addressed and how eventual is doing it right for the use case that they have, like in their minds and the problems that they are trying to solve. So let's have him back to the show again soon. I'm sure we have more to talk about with him. Indeed. Well, subscribe if you haven't. Tell a friend and we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show.
Starting point is 00:57:21 Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.