The Data Stack Show - 139: Decoupling the Execution Engine From Python’s Pandas with Aditya Parameswaran of Ponder

Episode Date: May 24, 2023

Highlights from this week’s conversation include:Aditya’s background and journey in the data space (2:47)What does Ponder do? (5:18)101 on Pandas and why people utilize it (6:42)The challenge of t...ranslating Pandas to a big data platform (16:11)Data Warehouses and ML workflows (21:27)The differences in the “zoo” of data languages (26:56)Why do ML and data engineering have to be so different in languages? (34:39)Builders should be adapting to the users and not the other way around (39:32)Will we see a singular data interface in the future? (46:19)Aditya’s most surprising discovery in his research (50:40)Final thoughts and takeaways (53:18)Read more of Aditya's work: Pandas vs. SQL – Part 1: The Food Court and the Michelin-Style RestaurantPandas vs. SQL – Part 2: Pandas Is More ConcisePandas vs. SQL – Part 3: Pandas Is More FlexiblePandas vs. SQL – Part 4: Pandas Is More ConvenientThe Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week, we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Costas, we have a great topic today. We're actually going to try to bridge the gap
Starting point is 00:00:31 between a Python library used for data science machine learning use cases, mainly locally, pandas, and the data warehouse, which was really surprising to me when I first heard it. But Aditya, who started this company called Ponder, is also a professor at Berkeley, is a really fascinating guy. So I think maybe I want to start out by doing a 101 on pandas.
Starting point is 00:01:03 I think it's something that a you know, a lot of people in and around data are familiar with. But I think it'd be great to just do a 101 on pandas. And then I'm going to do a, I'm going to do a two for one here in the intro, which is unfair. I'm going to ask about the, you know, the transition or not the transition, but covering the gap from pandas to the warehouse. I can't wait. It's going to be awesome. Absolutely. Okay, I have many things that I'd love to ask him. First of all, I think there's Modine itself,
Starting point is 00:01:37 which is a pretty popular open source project. It has almost 9,000 stars, for example, on GitHub. But outside of the project itself and how it works, what the secret shows behind it and why it is important, what I really want to talk about with him is why there is such a gap, let's say, between data practitioners in terms of tooling, especially if we compare what data engineering
Starting point is 00:02:12 tooling is compared to the AI ML tooling out there. Why this is happening at the end? Why SQL alone is not enough? And Token understands better what causes that SQL alone is not enough. And to understand better
Starting point is 00:02:26 what causes that and what can we do to bridge this gap. And I think we have the right person to do that because Aditi outside and Berkeley outside of the core data-based research
Starting point is 00:02:44 a lot of work that has been done has to do with how the core, let's say, database research. A lot of work that has been done has to do with how people interact and can be more productive with data. So there's a lot of wisdom there that can be served, first of all, with us and obviously with our audience. So let's go do that. Let's do it. Aditya, welcome to the Data Stack Show.
Starting point is 00:03:04 We're so excited to chat with you. Thank you so much for having me. I'm excited to chat. All right. Lots to talk about with Ponder, but give us your background. So you've done a lot of data stuff. You've also taught a lot of people about data in the university setting. So just give us your background and kind of the brief story of what led to Ponder.
Starting point is 00:03:25 Yeah. So I'm a database guy, I guess, or at least a data guy. I've been part of the data community for over a decade now. I got my PhD at Stanford, then spent a year at MIT. Then I became a faculty member. I was at Illinois, the University of Illinois, for a few years and then returned to the Bay Area. And so I'm a professor at the University of California, Berkeley. I do research on data systems, broadly defined.
Starting point is 00:03:59 And the work that we do is trying to make data systems better. And the way we do it is to try to look at the problems people currently have with data tooling and trying to make it more usable, more efficient, more intelligent, and so on. And the course of that research, we've explored a bunch of different tools, a bunch of different topics. At some point, we started looking at data science tooling and specifically pandas. And we realized that a lot of people love pandas and they found pandas to be incredibly useful and we can dive into what pandas is and why it's so cool and
Starting point is 00:04:36 so useful, but they were having problems with it. They were having problems in scale. They were having problems with sort of getting stuck on how to use it and so on. And so we kicked off a couple of open source projects towards making pandas better. One of which is this tool called Modin, which is a drop-in replacement for pandas. And this was led by one of the PhD students I was advising. This led to Modin, which is an open source project, broadly used and adopted by a lot of folks.
Starting point is 00:05:08 And so we said, hey, this is how do we amplify the impact even more? So why don't we go ahead and found a company around this? And so that's what we did about two years ago. And Fonda is a result of that. Fonda is a company behind open source modem. We sort of forked in our trajectory from open source modem to other products. But I'm happy to dive into why we did that and where we're going. Yeah, absolutely. Well, I'd love to dig into the Pandas 101 stuff because I think just orienting us to how Pandas
Starting point is 00:05:47 is used, where it fits into the data science ecosystem and the Python ecosystem would be helpful. But can you just give us the brief rundown of what Ponder does and the problem that it solves before we go back to the root? Totally. So what Ponder does is it allows you to run data science at scale directly in your data warehouse. So that's the headline. What does that mean? What that means is, for example, take Pandas, which is a popular data science tool you can use pandas as is so use the same api use the same scripts that you've used and curated over many years and now all of that runs inside
Starting point is 00:06:35 your data warehouse completely transparent to the user so the user doesn't need to know where it's running it's just now running in your data warehouse. And basically that you get all of the benefits that come with that. So obviously data warehouses are incredibly scalable. They are incredibly reliable. There are security guarantees, there are governance guarantees, there's provenance, all of that good stuff that happens all for free because now the execution engine is the data warehouse. So that's what Honda is doing. Love it. Okay. So let's rewind and talk about pandas. So I know a lot of our listeners are probably familiar with pandas, but we love digging down to the 101 and providing context. So maybe what would be helpful is,
Starting point is 00:07:18 can you paint a picture of sort of maybe like a vanilla Python flow and then how pandas sort of, like why do people love pandas compared with sort of, you know, maybe like more vanilla, you know, libraries or flows? Totally. So what is pandas? So pandas is basically, I would call it the language of data science.
Starting point is 00:07:49 It is, it's the Swiss army knife for data science. It's used for everything ranging from data cleaning, data extraction, data transformation, analysis, visualization, and modeling. It's basically doing everything in the data science world that you would need to. And it's been it's an incredibly complex APIs and this complex API of more than 700 plus functions, each of which have maybe thousands of parameter combinations each, right? Like it's a really incredibly complex,
Starting point is 00:08:26 incredibly detailed API that has been lovingly curated, improved upon over the course of many decades, right? So it's like, it's the result of many years of love and attention from the open source community to build something that's just like super useful for the data science and AI community. It is just the first tool that you would go to
Starting point is 00:08:47 if you wanted to get the job done on any kind of data transformation task that you have. The library. The library. It is the library, right? And so literally anything that you want to do from a data transformation standpoint, data analysis standpoint, visualization standpoint, data cleaning standpoint, you would use Pandas.
Starting point is 00:09:09 So it basically incorporates ideas and APIs from the spreadsheet universe. It incorporates ideas and APIs from the relational universe, the database universe. It incorporates ideas and APIs from the linear algebra universe, which is why it makes it a great fit for the ML and AI side of the equation. It's used by, I think, around 25% of all software developers. And remember, that includes a lot of web developers and so on. It's very popular compared to a lot of other tools out there. It's, I think, seven or eight times as popular as Spark. So just to put that in perspective, if you're talking about the space of the data science community
Starting point is 00:09:57 relative to the data engineering community, it's that much. It's very close to SQL. SQL is also an incredibly popular language. It's very close to SQL. Yep, fascinating an incredibly popular language. It's very close to SQL. Yep. Fascinating. Okay. So we're sold on pandas. If you're going to do data science, ML work, it's the library. Install pandas and it'll make your life easier. But there are problems, obviously, or Ponder wouldn't exist. So can you, and I'd love to approach this from the user experience standpoint. I know you've done a lot of research, actually, on sort of user experience, and you're a database
Starting point is 00:10:35 guy. So can you walk us through, like, what is it like to use Pandas? And where does it fail for the user in terms of the things that Ponder solves? Right. So, Pandas, while it's an incredibly rich, incredibly intuitive, incredibly expressive, convenient API, and if you would like to learn more about why Pandas, why Pandas over something like SQL for the purpose of data science, I have a, like a four part blog post series that I can share a few books.
Starting point is 00:11:10 We'll put it in the show notes. We'll have him put it in the show notes. So visit datasetshow.com. So coming back to what happens when people try to use pandas. If you try to use pandas at scale, you basically will run into issues. Why is that? A, pandas is single-sided. So even if you try to operate on a large data set
Starting point is 00:11:37 on a beefy machine, you're going to get pretty much the same performance on a not-so-beefy machine because it doesn't take advantage of multiple cores. It's very inefficient with memory, and it does most of the processing of data in memory as opposed to on disk. So you're limited by the
Starting point is 00:11:52 amount of memory that you have. It keeps making redundant copies. It's also like often what ends up happening is you'll be midway through your workflow, you run out of memory and it'll crash. And finally, it has no optimization built in. So every operator is run by itself.
Starting point is 00:12:10 It doesn't actually take into account the fact that, hey, there are multiple operators chained together. Maybe I can reorder them to make it run faster. It's not something that Pandas does. So to recap, Pandas is bad with memory. It doesn't do any optimization, doesn't take into account multiple cores. So all of this means that if you're trying to run pandas on datasets that are more than a few hundred thousand rows or a few million rows, depending on the type of task that you're
Starting point is 00:12:37 trying to accomplish, you're stuck, right? It just, it'll break down. You won't be able to get your job done. So what we've seen, at least in terms of workarounds for people in practice, there's a few workarounds. One is you operate on a sample of your data in Pandas and then convince someone or you may be yourself, translate that workload to run it in spark or in sql so that you could run it at scale so that's one one approach the other approach like i said is you just run it on a sample and then be satisfied with that sample right and the insights that you get with that sample hopefully you've taken a random enough sample that those insights translate into the entire data set, but if not your toast. The third approach, which is what we are adopting
Starting point is 00:13:29 at Fonda is to say, hey, we keep the pandas workflow as is. We'll run it at scale for you. You don't have to do this translation back and forth to a language that allows you to run it at scale. You can just stay with pandas. And there's no need to interact with a data engineer who will do that translation for you. There's no need for these long feedback loops. In fact, there was a company that we spoke to early in the days of Ponder who said it took them six months to translate a panda's workflow into PySpark. Like six months. I think this is on the higher end. The more typical numbers that we've heard
Starting point is 00:14:07 are like three to four weeks of translation cost from Pandas to a big data framework. And so our value proposition is that, hey, we can just do it for you, right? Like it'll be automatic, it'll be out of the box. You don't need to worry about it. Yeah. So is it an oversimplification?
Starting point is 00:14:23 As you were talking, I thought, okay, well, I mean, it seems simple. Like Pandas is really good for local development on smaller sample datasets, right? I mean, that seems to be sort of the developer experience that Pandas has optimized for. Is that an oversimplification or is that the main sort of mode of usage that you see? It is the main mode of usage that we see. And it is because of Pandas' limitations that is the main mode of execution and development that we see.
Starting point is 00:14:56 And it's useful to decouple two aspects, right? The one is Pandas the API and Pandas the execution engine. Right now we can complete the two and say it's one thing, right? Pandas the API and Pandas the execution engine. Right now, we conflate the two and say it's one thing. Pandas is broken. But Pandas is not broken. In fact, the API is great. People love the API.
Starting point is 00:15:13 People say, hey, the API is convenient. It's expressive. It's rich. It's convenient. All of that is great. But the execution engine is broken. And so with Fonda, what we are saying is that why do we conflate the two? Let's not fault the API for the execution engine.
Starting point is 00:15:29 Let's throw out the execution, throughout the execution engine. We will be the execution engine. We'll keep the benefits of the API as is. And so that's the shift. In fact, Modin, when we started out, which was open source project that Ponder was built out of, was Fandas on distributed computing engines. So it was on Ray and Dask, which are both distributed computing engines in the Python universe. And we started with that, it got incredibly popular. And then we migrated to data warehouses because people were like,
Starting point is 00:16:02 I don't want to manage array and a task cluster in addition to managing a data warehouse. Can't you just help me run pandas on data warehouses directly? And so we were like, okay, let's go build that. And so that's what led us to build the Ponder product. Yeah. Okay. I want to ask about the data warehouse aspect because usually you don't start there when you're thinking
Starting point is 00:16:25 about Python and machine learning use cases. So I want to dig into that. But can we dig into a little bit more of the challenge of translating pandas to what you called more of a big data platform, right? So you develop something in pandas locally because of the limitations, you're doing it on sample data. And then of course, like, okay, well, we need to run a model in production, we need to serve features, it needs to make recommendations in our app or whatever you're doing. Can you describe the user experience problems with going from pandas to like okay i need to migrate this to py spark and like run production on you know on spark to actually deliver this stuff
Starting point is 00:17:13 yeah so what happens why is it so hard to do this right i quoted three weeks four weeks six months to do this translation from pandas to PySpark or to SQL. Why is it so hard? So it turns out it's because pandas has a different data model and API than the relational universe or even the Spark universe. And what are these differences? So first, pandas is an ordered data model. So it actually assumes order for the rows.
Starting point is 00:17:49 Both the relational universe and the Spark universe don't assume order. And people can rely on the order, which is what makes Pandas so intuitive. There are these notions of row and column indexes. So you can label the rows and refer to them using these row indexes. Again, a super convenient feature that is not available in relational databases or Spark. And then there is a bunch of different kind of API-centric operations that are very easy to do in Pandas that are really hard to do in SQL, linear algebra style operations, spreadsheet style operations. If you wanted to do things like pivots,
Starting point is 00:18:31 very easy to do in pandas, very hard to do in databases, although some databases are starting to build in some of that capability. Linear algebra is extremely hard. If you wanted to do a transpose in a database, you're toast, right? In pandas, you can do a transpose. You can multiply matrices together. You can do things like that.
Starting point is 00:18:50 So all of these conveniences in terms of the data model and the API, you need to hand-retranslate all of that into something like PySpark or SQL to get the same effects. It's actually a really hard problem, even from a research angle. It's a really hard problem to get your head around, which is why it took us a while to sort of build this out, right? This is why it originated as research in Berkeley, and then we developed a product around it. Yeah, that's fascinating. I mean, it almost seems like you develop a feature in pandas
Starting point is 00:19:28 and you can get there fast with pandas because of all of these ergonomics that are really nice at the API. And so you know what the end state is and you kind of know the basic data model, but I almost envision it as like you know what you want to say, you know, you know, the meaning of a statement that you want to make. But if you translate that to a different language, you actually have to know like the context of the culture and, you know, conjugation and, you know, all of these different things that like make it actually pretty difficult to get the same
Starting point is 00:20:01 end result. Even though you know what that is, you have to do a bunch of background work to sort of migrate all of this underpinning to sort of produce the same concept. Absolutely. I think that's a great analogy. And I think part of this translation process is that because it's usually done manually, it also leads to bugs, right? And those bugs are not discovered until you run it in production. And then you're like, hey, there's a bug. And then you go back to the person who wrote the script, and they're like, no, I actually meant something else. And then you repeat this process. And each time, it's like three to four weeks of translation, hand translation. And like,
Starting point is 00:20:43 it's lost in translation. That's what happens. And just to put a sharp point on it, when we say translation, what we're talking about is you produce things with pandas, say you are using linear algebra or you're doing some matrix math or whatever. But if you have to reproduce that in SQL,
Starting point is 00:21:00 especially if you assume ordering and pandas and that's taken care of automatically, to reproduce that in SQL, that's what we're talking about, right? So you're writing like potentially thousands of lines of SQL to sort of reproduce like, you know, linear algebra or matrix math with assumed order that you have to reproduce in SQL, right? You have to basically hand roll the, you know, any time stuff in SQL is crazy anyway. So you're hand rolling all that. Is that the translation? Yep.
Starting point is 00:21:27 Often it'll be a single line, 10 characters in Pandas. This translates to 400 lines in SQL. You're absolutely right. We will handle all of the things like indexes, order, the fact that the API is much wider
Starting point is 00:21:43 in Pandas than in SQL, all of that is handled transparently to the user. Yep. Okay. Last question for me. I always say that it's a lie usually, but last question for me, because I know Kostas has a bunch of, and I actually want to hear Kostas' questions about pullers as well.
Starting point is 00:22:00 But let's talk about the data warehouse because when you think about ML workflows, I mean, the default, okay, so pandas locally, for sure. But when you think about running ML stuff in production, the standard reference point for most people is you're running Spark on some sort of Databricks, you know, data lake flavored infrastructure. So let's talk about the warehouse with Ponder, because you're talking about a flow that, you know, is maybe just out of ref, maybe not like a default reference point, right? Like, I'm taking Panda's work, and I'm putting it onto a data warehouse. I mean, I'm thinking about are there compute implications? Like, you know, data structure implications. Can you talk about the,
Starting point is 00:22:51 I mean, this is feedback that you got from users of Modin, but walk us through that because you don't think about, okay, I'm going to take my pandas and dump it, you know, straight to Snowflake or something like that. Yeah, so maybe a bit of history here would be relevant.
Starting point is 00:23:07 So we've seen the worlds of data science and ML on the one hand, and data analytics and data warehouse on the other hand kind of evolve separately over the past several decades, right? So the Python-centric tools like Pandas, and Pandas doesn't exist in isolation. There's also NumPy and Scikit-learn and what have you. And the visualization libraries like Matplotlib and Seaborn, etc. So this is all in the data scientist toolbox. And then there is the analytics universe, which traditionally OLAP-centric databases, which then over the last decade or so,
Starting point is 00:23:54 also incorporated Spark to deal with somewhat semi-structured, nested JSON, et cetera, data, but still caters largely to the data engineering and data engineer persona rather than the data science persona. So, and data engineers are pretty comfortable with Spark, a solid framework that allows you to process data at scale in a pretty more, in slightly more flexible operators than SQL, but SQL is still the lingua franca for the analytics universe.
Starting point is 00:24:30 And so these two worlds have evolved separately. There've been several attempts to bridge the divide between the data science and AI worlds and the database worlds. One of those attempts has been to add in Python-style UDFs to databases, which didn't end up, it's not super popular. It adds more constraints. It's not something that has a lot of adoption.
Starting point is 00:24:57 I mean, it's useful, but it's not as broadly adopted as the data science stack. There's also been attempts to add in ML primitives to databases, again, not having a lot of adoption. But broadly, I think these two universes have stayed separate, partly because it's so hard to map the data science primitives to the data warehouse primitives. The national world is very different from the data science world. And that's what kind of our research unlocked, right?
Starting point is 00:25:24 So our research on saying, hey, we can take the Pandas API, we can distill it down to a small set of core operators, which can then be mapped to the relational operators. That research unlocked the possibility to then be like, hey, now that we've done this for Pandas, you can do it for other data science libraries as well. So other ML libraries, other visualization libraries, all of them can be now pushed
Starting point is 00:25:49 into this data warehouse. And so our belief is that the libraries are great. The APIs are great. The execution engines don't need to be tightly coupled with the APIs and the execution engines could be anything. It could be databases coupled with the APIs. And the execution engines could be anything. It could be databases or data warehouses. It could be distributed computing engines like Dask and Ray. It could be even Spark. All of these are just execution engines. You shouldn't need to know
Starting point is 00:26:16 the details of Ray or Dask or Spark or SQL in order to be able to use your favorite data science or AI or visualization library. That's our belief. I love it. All right, Kostas. I've been monopolizing and I could keep going, but of course, I need to hand the mic to you. That's good. You're asking some incredibly interesting questions and getting some amazingly interesting answers for what DTL show.
Starting point is 00:26:44 I don't know. Maybe you should continue asking the questions. Maybe I'm not needed here. But I'll do my best to also ask interesting questions. So Aditya, I was listening all this time, but you were talking about pandas, the APIs. I love the whole focus around the user experience or the developer experience, if you want to be more precise, which is great. And we have many things to talk about there.
Starting point is 00:27:14 But before we do that, we have pandas, we have koalas, we have bears out there, like a whole zoo of, let's say, different frameworks that somehow relate to Pandas, right? Because we also have to keep in mind, historically, as we say, Pandas has been built for a long time. It's not something new. What are the differences between all these? And what's your take also in what sparked the need to build these new interpretations of what Pandas is trying to do? Assuming all the work that has been put on Pandas all these years, right? Yeah. So I think, again, it's useful to talk a little bit about history here. I'm a database professor. I keep bringing up lessons from database history. So if you are used to SQL and all of your work is in SQL, then it's hard for people to come in and be like,
Starting point is 00:28:31 hey, you should change all of your existing workflows and use this other language or library. Likewise, you've seen a similar groundswell of adoption for pandas over the years. Like I said, one in four developers use pandas. And pandas is not going anywhere. Just like SQL isn't going anywhere, pandas is not going anywhere.
Starting point is 00:28:50 And after now a decade of adoption, Spark is one-fifth that of pandas, right? So I think APIs come and go. There have been many attempts at building better data APIs, but it's an uphill battle to convince users to be like, hey, you know what, you should switch to my better data
Starting point is 00:29:12 API. And yes, that better data API may solve some ergonomic issues for the user being like, hey, this is a slightly better developer experience, slightly better ergonomics, slightly better this and that, or slightly simpler. It's not as complex as Pandas. It's slightly, it deals with better,omics slightly better this and that or slightly simpler it's not as complex as pandas it's slightly it deals with this better it's slightly more performant but people are not going to adopt it for those reasons alone because you and i these different
Starting point is 00:29:36 languages you're very comfortable in your language i'm very comfortable in my language yeah we have a language in common which is english i'm not going to switch and learn an entirely new language just because you asked me to. I'd much rather deal with the limitations of the languages that I speak. And if there are tools that will help me deal with that, which is what we are trying to do with Pandas, keep speaking the language you want to speak, we'll just make it better.
Starting point is 00:29:58 I think that's a much more compelling value proposition than saying, hey, now we want you to go from ground zero, convince people to adopt an entirely new framework. For example, you brought up Colas, which I'll respond in a second. But Colas is another one. It's a language. It's a commendable effort and commendable amount of adoption that they've gotten in a short amount
Starting point is 00:30:25 of time, but still minuscule compared to Pandas adoption. And I think there are a couple of reasons. One, people, if you're used to Pandas, you're used to Pandas. You're not likely to switch to Polars. And now with tools like Modin and Ponder, you don't necessarily need to switch to Polars to get performance benefits, right? The other piece is Polar is not trying to support arbitrarily large data sets. So the way we would want to support the data warehouse, you can just scale up the backends. Polar is trying to get the most of mileage
Starting point is 00:31:02 from a single machine as opposed to a data warehouse. So difference is an approach. And I should say that I feel the pain of anyone developing data querying languages. I've developed a bunch of them over the years. And what I found repeatedly is that it's a lot of effort. It's a lot of blood, sweat, and tears that goes into it. But often it's still very hard to compete with the likes of Pandas and SQL and Spark now, even though Spark is like one-fifth that of Pandas. It's still very hard to compete with those because people
Starting point is 00:31:37 have just gotten used to it. Yeah. Yeah. That makes a lot of sense. And okay, you mentioned Koalas. I also mentioned Koalas because I would like to make the connection there with Spark, right? Because Spark is, especially if you come from, let's say, the more analytical space, do you think of Spark as the de facto tool
Starting point is 00:32:02 when you want to work with Python. Even data warehouses like Snowflake with Snowpark and all the work that they have done there, still, I think whenever we put Python together with data, the mind of everyone goes directly to Spark. And there is PySpark, but before that there was Koalas. What was Koalas and why it was created and how it
Starting point is 00:32:31 converged to Pandas? Yeah, so Koalas was actually an early attempt at bringing the Pandas API to Spark. And this was... I'd like to think that it was inspired by Modem because it started after Modem, but maybe they came to this realization independently. Either way, I think it was an effort to bring the Pandas API to the Spark universe.
Starting point is 00:33:00 We have a more detailed comparison with Modem, but A, Qualys is very tied to Spark, doesn't quite support the same API as Pandas does. It supports a subset thereof. And due to the fact that it's very tied to Spark, doesn't actually get as much performance as we would like because it has to deal with the Spark idiosyncrasies. But yeah, it is addressing a similar problem that Ponder is in that it's saying, hey, you continue to use the Pandas API on your existing data infrastructure, in this case, Spark. So it is addressing a similar problem as Ponder is from that perspective. Yeah. Your other question on sort of PySpark, Koalas, Snowpark as bridging the Python universes and the analytics universes.
Starting point is 00:33:55 I think that is spot on. Let's leave aside Koalas for now, but PySpark and Snowpark definitely bring the benefits of the Python centric universe to the analytics universe. However, it is telling people to rewrite their workflows in things like Snowpark or PySpark. And that's usually not something that data scientists would do. That's something that data engineers would do. So this is a tool targeted at data engineers rather than data scientists.
Starting point is 00:34:24 So different persona. Yeah, and that's really my next question. Why are these personas so different? I mean, that's why, like, okay, we have, let's say, Spark with Koalas as an attempt to bring a Pythonic API to the data, right? Why do we need PySpark? Why do we need this whole concept of the data frame there? And why another iteration to satisfy, let's say,
Starting point is 00:34:54 this different persona, which is, let's say, the data engineering persona, right? Why ML and data engineering have to be so different at the end? Well, I mean, if you abstract enough, it's the same thing at the end. In a way, you have pipelines, you have data that needs to go through different stages of processing to end up with a data set that is going to be input to something. It's not that much different. Why is this happening?
Starting point is 00:35:24 Yeah, so that's a really insightful question. And someone who is schooled in the database universe, I'm like, hey, why should we have all of these? Ultimately, it's just like data transformations. Everyone should just write SQL, right? That was my initial instinct getting into all of this is like why people should just write sql i think the i think we underestimate the the challenges that come with folks who are not schooled in a computer science background who are getting into things like data science and ml and ai and don't know enough about distributed systems don't know enough about computer systems, don't know enough about computer science, but are trying to get useful work done with data.
Starting point is 00:36:08 A lot of these folks have taken like data science and coding bootcamps. So they know enough Python. They know enough Pandas. That's about all that they learn in these data science and coding bootcamps. Telling them to be like, hey, can you write all of
Starting point is 00:36:25 your workloads in Spark? Can you write all of your workloads in Snowpark? It's so much of a heavy lift for them because their mental models can't accommodate sort of thinking about distributed systems, managing a cluster, rewriting workloads that they are comfortable writing in Python or in Pandas into something like PySpark or Snowpark, that simply is out of the question for them. Imagine, for example, you're a biologist. You've just learned enough Python or data science to try to do some genomics data analysis. Now you're being told, hey, you know what?
Starting point is 00:37:00 Now you need to have a deep understanding of distributed systems and databases to just get your job done. And that's crazy. In fact, one of the origin stories for Modin was this group of genomics researchers who did all of their work in Python and Pandas, but were actually writing a Spark job to generate a sample of their data that they could then analyze in Pandas. They were learning how to write Spark so that they could extract a sample so that they can play around with it just the way they wanted in Pandas.
Starting point is 00:37:36 And it's just such a heavy lift for a lot of these folks and just expecting them to be like, you know what, here's another tool. Under the hood, it's all data transformation, right? Why don't you learn and do it? Maybe easy for you and me, easy for other folks
Starting point is 00:37:50 who are schooled in computer science, easy for folks who are schooled in distributed systems. Really heavy lift for someone who's in finance, who's coming from Excel, the Excel world, who's learned just enough Python
Starting point is 00:38:00 and Pandas to get by, telling them, hey, retransfers is heavy lift, biology, finance, insurance.fers, it's heavy lift, biology, finance, insurance. There are all of these industries that are becoming data rich, and they now have people who know a little bit of programming, a little bit of data science, want to get insights from data, but they can't do it because they get stuck. Yeah.
Starting point is 00:38:19 Yeah. A hundred percent. I think this is a great point that you're making here because we need to, and I want to get into the discussion towards more what our industry should be doing more of. Because I have a feeling that, okay, we take many things for granted
Starting point is 00:38:40 because we are also living in a way, like in our own bubble where, oh yeah, distribution system, sure, yeah, that's fun. Whatever, but come on, dude. Don't tell everyone you're interested in that stuff. People have other things that they have to do
Starting point is 00:38:55 and they want to use the tools that they are comfortable with, which has to do with the user experience and the developer experience. It's a very good point because we see, especially lately, so many languages coming up. And we also
Starting point is 00:39:12 have a very short memory in this industry, I think, because just see the SQL flavors out there, right? And how difficult it is to just go from high to a different SQL system. We're not talking even about anything else. We don't think about that stuff. We're like, oh, my language is better. You should be using that. Yeah, sure.
Starting point is 00:39:33 Okay. Whatever. And then you go to market and you're crushed because people have a better thing to do than believe in your dream. So I think it's, it's also like, I think very good, like to talk about that stuff, like on, on the show, like to communicate that also like to whoever thinks of like going and building something new out there, like think about the user and where they come from. And I think it's more important for the builder to adapt to the user than the builder assuming that the user is going to adapt to you, right? Exactly. Exactly. We have, as tool builders,
Starting point is 00:40:13 we have a tendency to build, it's like computer scientists building tools for computer scientists or system builders building tools for system builders. Not everyone is like you and not everyone is happy with a tool that was chucked over the wall at them.
Starting point is 00:40:29 Right? Like nobody is going to be like, hey, yeah, you sent me this new tool, but you're forcing me to abandon my existing workflows, all of my existing scripts and use this. Like the people
Starting point is 00:40:38 are not going to do that. Yes, they might because it's a fad for a short period of time. And yes, you might convince some people, but eventually it'll revert back to the old practices, right? So we do, again, lesson from history. In the early 2010s, there was a vast number of NoSQL systems that came up, some of which were addressing OLTP issues, but some were addressing OLAP analytics issues.
Starting point is 00:41:09 And there were so many of them, right? And now, if you think about it, the only ones that have survived of those languages that were invented expressly to deal with the challenges of big data, so to speak. The only thing that has survived at the test of time is Spark and SQL. And Spark 2 supports SQL, right? Like, it's not just by Spark. So, and again, I feel like I'm constantly peddling my blog post, but I recently had a blog post that was targeting this big data is dead blog post that was put out there recently. And I said, hey, big data is actually not dead.
Starting point is 00:41:52 I mean, a lot of organizations still have big data. And I trace a little bit of the history of NoSQL systems and how we are reliving the lessons of old that ultimately we have to adapt to our users rather than expecting the users to adapt to us. 100%. I think another good example of that is how many companies have died trying to kill spreadsheets. It's all happening.
Starting point is 00:42:18 It's simple as that. Yeah, exactly. Okay, so follow-up question to this topic. We do have a kind of dichotomy right now, even in very mature data organizations, right? You have the ML teams, they have their tooling, they're using Pandas, whatever they want to use, and you also have a data engineering organization in there.
Starting point is 00:42:47 And usually they use different tools, right? And in many cases, you see this ending up in a lot of duplication of both efforts and infrastructure, right? So how can we bridge that, but at the same time, let's say, respecting what its persona wants, right? Because in the same way that data scientists want their pandas, the data engineers, they want their PySpark, right? So, yeah, how we can do that? Yeah, I think this is a really interesting question and it's going to see continued evolution over the next five to 10 years, I think, as these personas become a little bit more blurry and we see a little bit more consolidation in the data stacks. So what we see right now is very simple, right?
Starting point is 00:43:40 It's like we have your data warehouses on the one hand, you often have a Spark cluster as well, sometimes, or maybe one of the two in certain cases. And then your data scientists and ML and AI folks often will just pull out a bunch of data as much as they can fit in memory and go off to work on it using their Jupyter notebooks and Python standards. That's what is happening right now. So there are often multiple sort of infrastructures, like beefy machines, maybe that the data scientists are working on. You might have a Spark cluster. You might also have a data warehouse. And so there is a convergence between the Spark universe and the
Starting point is 00:44:28 warehousing universe. And we are seeing that happening both from the traditional data warehousing companies like Snowflake or BigQuery or what have you, as well as database. I think those are converging. We still see data scientists pulling data out and operating on it in their laptops or their VC machines. That's still happening. And that's part of the piece that Ponder is trying to address, saying that, hey, you don't need to use your laptops and deal with possibly PII data because your data scientist likes to operate on it laptop. That shouldn't be happening.
Starting point is 00:45:05 You should still keep all of that in a heavy governance sort of environment, which is your data warehouse, and allow your data scientists to directly operate on it. Of course, there are challenges in completing this vision. Can you map every single data science library that a person wants to use to a data warehouse setting? We don't know yet. I mean, we are making progress on that front, but an ideal universe that I see in the future is
Starting point is 00:45:29 everyone interacts with a common set of infrastructure, right? Like we at a data warehouse, Beats, Spark, what have you, that's just one infrastructure, but people use their preferred languages to interact with it. And then there's middleware that allows you to translate between various languages. So if you're using PySpark, maybe you directly access your infrastructure using PySpark, but you could also use Pandas. You could also use visualization libraries. You could also use BI tools.
Starting point is 00:45:57 And in all of these cases, there is an intermediate layer that understands and can translate from whatever language the API is invoking to the infrastructure, right? So you can act as a middle person and do that translation for you. That is the ideal universe that I would like to go in because it would simplify a lot of work. Why do we have to manage three different infrastructures just because one doesn't speak Pandas or one doesn't speak Spark.
Starting point is 00:46:26 You should have just one infrastructure that is heavily managed, heavily provisioned for reliable, for tolerant, scalable. And then users can use it in whatever way they want. How far away we are right now from this reality of having this middleware or, let's say, getting in the data space a similar infrastructure as we have in compilers. Because that's what I visualize in my head.
Starting point is 00:46:57 When you talk about this middleware, it's more like we have the frontend of the compiler. It can be like Parse, I don't, like SQL or like PySpark or whatever. But then you have like on the back end, something like an EPM, for example, that can target like different targets, right? Like in one case, of course, that's like hardware, but it can be like a database system, right? But what is, and I ask you this both from your experience in the industry, but also in academia, because I feel it's also an academic question. It's not just a matter of someone writing the code down there. So how far away were you from that and what it would take us to do it?
Starting point is 00:47:40 Yeah, I think from the industry perspective, actually, let me start with the academic perspective. What we've shown with Wisponda and with Modin is that it's possible. These worlds don't need to be separate and that you can translate from a data science friendly API to a relational API. You can translate that is feasible, right? And it is doable in a performant way. And we continue to do this to the few other data science libraries like NumPy and Matplotlib and so on.
Starting point is 00:48:12 So it is doable, right? So the proof points are there. Now, what are the challenges and how long it will take us to get there? I think part of the challenges is that there is a long chain of data science libraries, very large number of data science libraries. I think Ponder, for example, can help you get to the ones that are very popularly used and adopted. So Pandas, NowPy, Scikit-learn, and so on.
Starting point is 00:48:35 That's how we are addressing this and mapping it to the popular backends as of now. So the Snowflakes and the BigQuerys of the world and possibly in the future Spark or what have you. So Ponder can help with the heavy adopted libraries and the heavy adopted databases, but it's a very long chain of databases and a very long chain of libraries. This, I think, has to come with a combination
Starting point is 00:49:01 of open source contributions because people are invested in having their libraries operate data on data at scale. So open source contributions for that other database vendors being like, hey, I want you to be able to run Pandas at scale on my database, be it MongoDB or CockroachDB or whatever bespoke database that you have. I want to be able to run that there,
Starting point is 00:49:22 so I will help you build those integrations. I think that long tail is going to take a while for us to fill out, but I think we could get to 80-90% of the popular interfaces and backends relatively soon, is my guess. Okay. That's super interesting. I'm always, personally, also very interested in that, to see how they make. All right, that's all from my side. I think we're getting closer to the buzzer, as Eric usually says. So I'll give the microphone back.
Starting point is 00:49:58 Oh my goodness! I did. Yeah, I did. So I'll give the mic back to Eric. And from my side, Aditya, like, I think we need more time. Like we need to spend more time and specifically talk about the user experience. I think it's it's something that is missing like so much for like the data. Let's say in industry, we can name it like this. And there's a lot to learn, especially by seeing what happens in other disciplines of software engineering.
Starting point is 00:50:32 If you see what front-end engineers experience today, or what tooling they have, or what progress has been done there, and compare it, what has happened in data, I think there's a lot of so I'd love to hear your opinion and have a conversation with you on that. So hopefully we can have you again in the future.
Starting point is 00:50:58 Sounds great. All right. Aditya, just one last question for me because we're close to the end here. And this is, I'm going to ask about your research, just because you've done so much interesting research. What is one of the most surprising discoveries that you've heard multiple times in the course of the podcast. But I think the most important lesson that I have learned from research is don't ignore the user. Literally, don't ignore the user.
Starting point is 00:51:42 Don't ignore their workflows. Don't ignore their economics. don't ignore their workflows, don't ignore their economics, don't ignore their practices, don't ignore the bigger organizational context they fit into. And often early on in my research, I had a tendency to chuck tools over the wall at users and people would not adopt that. And so it's been a sobering lesson over the course of the last decade to be like, you know what, we have to go where the users are and build tools for them and learn about their limitations rather than simply saying, here's another new shiny tool.
Starting point is 00:52:20 Because often the infrastructure investment in managing a new Shiny tool is not worth the added benefits you get from it. So that's been a sobering lesson. But it's been great to work very closely with users both at Berkeley as well as this company, learn about their challenges and build around those as opposed to building new tools from scratch. And I think that's been quite rewarding too, because they feel the pain, the users feel the pain. And then when they're like, hey, you solved it for me, that's great. A different level of excitement and enrichment altogether. I love it. Wise words, I think, in all the work we do. So Aditya, thank you so much for joining us on the show and giving us some of your time. It's been wonderful.
Starting point is 00:53:13 It's been a pleasure. Thank you so much for having me and I'd love to be back. Kostas, I love how practical that conversation with Aditya from Ponder was. I mean, we dug into the specifics of pandas, but then we also got pretty theoretical. He has some really interesting ideas about the way that things should operate. And I think my big takeaway, and this is actually from your conversation with him, so I'm going to unashamedly steal your thunder. But there's this idea, I think, you know, he didn't necessarily use these terms, but this idea of almost like artificial scarcity on execution engines.
Starting point is 00:53:57 And he envisions this world where the ergonomics are such that you can use whatever language you want and then plug that into whatever execution engine you want that makes sense for your business. And currently, there are a lot of limitations based on the languages and then they're mating to different execution engines. But if you remove those barriers, it actually becomes really interesting to think. And your point about compilers was really interesting. So when you think about sort of going from pandas to maybe like a snowflake warehouse, that doesn't seem like a normal mode of operation based on sort of typical ML workflows necessarily
Starting point is 00:54:38 when you're going into production. But I love the vision. So that was really exciting. And I'll think about that a lot because I think that is where things probably should go. Yeah. Yeah. For me, I mean, it was like a very fascinating conversation.
Starting point is 00:54:55 I think the whole focus around the user experience or like developer experience, like the opportunities that exist there, like to build and deliver value and all the insights from a DTR on that stuff, like was, was amazing. And also I think, I don't know, it's like the second professor that we have now on the show after Andy from CMU. And I don't know, I kind of like it. It's really nice to have these kind of unicorns where you have
Starting point is 00:55:30 these very academic people who are also like all this stuff. Maybe we should do like call them have them like both on a show. We should do a panel
Starting point is 00:55:40 because I think that's interesting. Like they really do. It's like they've summarized a huge legacy of academic research into things that are very practical in the market. And a unicorn is a great way to describe that.
Starting point is 00:55:55 And they have also started in part their own company. They're not just academics that talk only in theory. They have seen how the social is made which i think makes them like even more interesting and i don't know it's just like the personalities i think like having the two of them like on the panel it would be interesting we need to do it yeah someone starts out by saying you know know, I'm a database guy. Like it's probably going to be a good conversation.
Starting point is 00:56:26 And that just sounds cool. Like I'm a database guy. Yeah. Yeah, absolutely. But yeah, a lot of, so many sides. And I hope like we will spend more time with him to go even deeper in things around working with data systems on scale and seeing all these things from the user perspective, just like the technology. Absolutely.
Starting point is 00:56:54 Well, thanks for joining the Data Stack Show. Subscribe if you haven't, tell a friend, and we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
Starting point is 00:57:22 The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.