Drill to Detail - Drill to Detail Ep.44 'Pandas, Apache Arrow and In-Memory Analytics' With Special Guest Wes McKinney

Episode Date: December 8, 2017

Mark is joined in this episode of Drill to Detail by Wes McKinney, to talk about the origins of the Python Pandas open-source package for data analysis and his subsequent work as a contributor to the ...Kudu (incubating) and Parquet projects within the Apache Software Foundation and Arrow, an in-memory data structure specification for use by engineers building data systems and the de-facto standard for columnar in-memory processing and interchange.

Transcript
Discussion (0)
Starting point is 00:00:00 So hello and welcome to another episode of Drill to Detail, the podcast series about the world of big data and analytics, and I'm your host Mark Ripman. So my guest this week is Wes McKinney, someone I knew of from his work on the Pandas Python data analysis toolkit, and most recently from his work on the Apache Arrow in memory storage format. So I'm very pleased to have Wes on the show today, and Wes, thank you for joining us. Thanks for having me. So Wes, just tell us a bit about, I suppose,
Starting point is 00:00:39 what was your career up until where you are now, and what was the work you did, introduce the work you did with Pandas,as and how you got into that really well I started my I started my career in in 2007 I had a I had a math degree and I got a and I got a job working in quantitative finance at AQR capital management up in Greenwich, Connecticut. And it turned out that a lot of my job was manipulating data and cleaning and preparing data and doing data analysis. And so I got interested very early on in data analysis tools primarily to make myself more productive so that I would enjoy my job more
Starting point is 00:01:26 and would be able to get more, get more work done. Um, but I discovered, um, you know, at that, at that time that I was passionate for, uh, toolmaking and building tools to enable other people to be, to be productive as well. Um, and, uh, you know, at that time I, this was 2007, 2008, there really was not a data community or a data analysis community in Python. There was a fairly robust scientific computing community that was just becoming mature and more accepted at that time. But for statistical computing and data analysis, data science as a field did not exist at that time. But for statistical computing and data analysis, data science as a field did not exist at that time. But there really was not a very big community for Python. And so I felt there was an opportunity there to make the software stack more amenable to people doing statistics and statistical computing, and that Python was a good language to build those types of tools.
Starting point is 00:02:29 So I made a, you know, I guess in retrospect it was a fairly risky bet given that it wasn't clear that there was going to develop a large ecosystem of Python programmers, but it seemed to me like the right thing to do. And so I spent several years throughout the late 2000s and early 2010s essentially working to make the Python ecosystem viable as a tool for statistical computing and data science. So that included building the Pandas project and making that into a successful open source project and writing my book, Python for Data Analysis, which was published five years ago this fall. So what led you to, I suppose, what led you to focus on Python?
Starting point is 00:03:16 I guess you were using it through work, but what particularly made you want to stick with Python rather than using R or something instead? Well, I was doing general exploratory data analysis and data manipulation, but I was also building software. So I was building some small software production software systems that needed to do some amount of data processing and data analysis, but that were mostly about automating various business processes with a lot of different steps and things that could go wrong. And so really, I needed a tool set that was functional both for doing production software engineering, as well as exploratory data analysis.
Starting point is 00:04:01 And so in 2007, the conventional wisdom for doing software engineering was to use Java for everything. But I found that the Java stack was not especially favorable for doing interactive and exploratory computing. And similarly, for data analysis, people used Matlab and they were starting to use R at that time. R really wasn't nearly as popular back then. But these were languages that were not especially well-su for doing software engineering and had the right components for, it had the right kind of user interface and the right tools to build productive data manipulation and data analysis tools. But we, you know, there were, there was no library like pandas or libraries that solved, you know, those kinds of problems. So I was attracted to it because it seemed like
Starting point is 00:05:07 a promising environment to build the best of both worlds. But it was certainly a chicken and egg problem because not only was there not a community of people dealing with statistical computing, there just weren't the libraries. So it was kind of both, we've got to build, in order to build a community, I had to build software for the people to use
Starting point is 00:05:34 because people are not going to use Python to solve those problems if there wasn't a viable tool set for them to start and be productive in their work. Yeah. Yeah, good. I mean, I came across pandas actually only a few years ago i was um i was doing some work at home just taking some data and manipulating it and trying to do some kind of uh regression analysis and that sort of thing on it and and that was my first introduction to pandas and i was blown away
Starting point is 00:06:00 actually by how how useful it was and and how I suppose it how it solved problems for me so well and that was even in the times of when R was popular but maybe for people who don't know what pandas was just maybe trying to outline maybe in layman's terms you know what did that code do and what did it give you kind of beyond what say Python could do out of the box? Well one of the and one of the the biggest things that people use Pandas for, and I think one of the reasons why it's become so popular, is that it makes getting access and just getting access to data very easy. basically comma-separated files or other kinds of text files or data coming from databases or spreadsheets, Excel files, really any kind of tabular data. You know, the first part of the data analysis process is to get access to data, and Pandas makes that very fast and convenient.
Starting point is 00:07:00 It provides containers for dealing with data in memory. So when you load data from a file, you load data from a database, it provides what's called a data frame object, which is essentially a table or it could be thought of as tabular data or data having some kind of labeling or index. And so it's become very popular as a tool for dealing with doing exploratory data analysis on time series data, on, you know, data coming from databases and CSV files. It's used for data cleaning and data preparation for modeling and statistics and machine learning. So essentially, it solves all of the, you know, what I call the unsexy parts of data science. And people complain about or talk about how data cleaning and data preparation can be, you know, 80 to 90 percent of the time that you spend doing data science. And Pandas is concerned with making that data cleaning and data preparation as painless as possible for the user. So you find
Starting point is 00:08:18 that, you know, really anyone who's doing machine learning or data science in Python, you know, in probably 80 or 90 percent of cases are using pandas for data access in the initial kind of data wrangling to get their data ready for modeling. And I think I used it in combination with Spark as well, actually. I think it was PySpark I was using it with. So even outside of just classic Python, it was useful as well, really. Yeah, that's yeah that's true yeah it gets used uh yeah i think uh so in the in the pandas core team we we talk about pandas as being the um the middleware that is uh kind of gluing the ecosystem together so it's the it's the point of contact between different components it's used for for handling
Starting point is 00:09:02 data in many different circumstances so if you're using Spark, data that's going to Spark or coming from Spark may be in Pandas format. Yeah, excellent. So that was quite a while ago actually now, and you went on to, I mean obviously Pandas is still going now, but you went on to form a company called Datapad, so what was that about then really? So we formed Datapad so what was that about then really so we formed uh we formed uh the datapad uh with
Starting point is 00:09:26 um so a long time colleague and collaborator of mine um chang shi he uh he and i worked together at at aqr he's also um a fellow mit grad and um so i i got him you know we we collaborated on pandas in in the early days so he was you know, we collaborated on Pandas in the early days. So he was, you know, one of the earliest Pandas contributors. And we wanted to start an analytics company. And we used the, you know, we intended to use the Python, you know, the Python data stack to build the back endend systems for the products that we were building, and we would contribute back and make the open source better as we built vertical products for business users. And so our initial product and what we spent the majority of 2013 and 2014 building was a cloud hosted visual analytics product which could work well for business intelligence use cases it was designed to
Starting point is 00:10:33 be cloud cloud first and working well with with kind of the modern kind of analytics environment where you have data spread across many different sources and things, things like Salesforce, you know, marketing automation products. Because, you know, in modern businesses, data is fragmented all over the place. And so we wanted to make it simple to collect all that data in a hosted environment that would make, you know, integrating that data together and then exploring and analyzing it and visualizing it we uh and we you know we were building all that with uh you know with the the you know the
Starting point is 00:11:10 python stack under uh under the hood um so we yeah so we you know we we uh we had venture investors and we spent uh you know we spent about two years um you know working on working on the company prior to being uh acquired by cloud era at the end of uh of 2014. all right so that that explains how you went to cloud era then because i remember i remember at the time um reading that you'd gone there and the fact that you were involved in pandas and you went there as well and that was quite a big deal at the time i mean so your company was acquired by by uh cloud era what did you what did you um i suppose what was your kind of aim really in joining cloud era and and what were you what was the Ibis project you're working on there as well?
Starting point is 00:11:46 What was the kind of the idea there really? Well, so I'd known the Cloudera founders and many Cloudera employees for a long time. So they were already friends and supporters of our open source work on Pandas as well as Datapad. And I found that as we were building the Datapad product that one of the hardest things that we were dealing with was the, well, there were multiple hard things, but the backend systems, there were a lot of low-level data management and in-memory analytics problems that we were tackling. And that were, you know, basically the open source stack did not provide the best foundation for building that type of a product in the cloud, which needed to be able to have, you know,
Starting point is 00:12:48 linear scalability and very low latency analytics, very fast slicing and dicing. And, you know, even now when you look at people building these types of products, in a lot of cases, there really are not off-the-shelf open source projects to pick up that can meet the kind of performance and scalability requirements that an interactive visual analytics application has. And you see a lot of custom software being built. And so for me, critically, one of the biggest issues was the performance and scalability
Starting point is 00:13:25 of the Python data stack. And by the end of 2013, having been working on Datapad for about a year, I'd accumulated a list of complaints and grievances about the internal architecture relationship with the rest of the Python scientific computing and data science stack and sort of famously gave a talk at the end of 2013 that was called 10 Things I Hate About Pandas. And so really I think as much as we were enjoying building our company, I was very motivated about tackling some of these systems, kind of like architectural challenges that face the entire data science world. And I felt that Cloudera provided with its team of distributed systems engineers and people building databases. I was able to work
Starting point is 00:14:26 with the Impala team and the Kudu team. And really, I wanted to be able to work more closely with individuals who had experience in distributed storage and in-memory analytics to tackle some of these, what I felt are really serious challenges for the future of open source data science that we would be able to, you know, start building a better platform for the future there. So, I mean, Kudu, I mean, yeah, we talked about Kudu on the show kind of quite a while ago, actually, and that, to me, struck me as very kind of interesting and revolutionary and so on. And so what's your view on kudu really i mean
Starting point is 00:15:09 is that still i'm conscious that you've got arrow i'm conscious there's products like drew it out there as well what's your take on kudu really and and if that's relevant now or problem it solves and that sort of thing yeah so so so kudu you know for those who don't know, it's Apache Kudu. It's a project that was originally started at Cloudera and is now part of the Apache Software Foundation. It is a distributed column store. So it is a system that stores and manages data in a sort of fault tolerant and robust way but is not a does not have its own query processing engine so it's so the idea is that kudu provides a storage a scalable storage back-end that is designed for
Starting point is 00:15:58 fast analytical processing but also supports inserts and updates. So you can use it as a real-time place to capture real-time data. Then you can do fast analytics on that data. And so the idea is that rather than, you know, traditionally, if you were building an analytic database, you would have a vertically integrated system of storage and query processing and query language. So usually, you know, if you looked at. So usually, if you looked at a traditional analytic SQL system, something like Vertica, say, HP Vertica, the query language is SQL. It has its own storage system and its own query engine.
Starting point is 00:16:40 And so the idea of something like Kudu is that it decouples the storage and data management part of the problem from the query engine so that you can have many different kinds of query engines processing data that is being delivered by Kudu. So I think it's a novel system and a very powerful concept. And when I joined Cloudera at the end of 2014, Kudu was still in stealth, so to speak. It was not publicly available or known. And so that was one of the things that really excited me about joining Cloudera was that I felt that there was, you know, really innovative work happening around in-memory analytics and distributed storage.
Starting point is 00:17:30 And so to be able to work with, you know, Todd Lipkon and the Kudu team and the Impala team, you know, it seemed like a very, you know, fruitful kind of intellectual opportunity to get involved in some cutting-edge you know distributed you know big data technology yeah it was Mike Percy that was on the show actually remember when he first came on and it was certainly struck me as it combined with combined with impala obviously it solved the problem about be able to do updates inserts and so on but it struck me as I think it was positioned at the time as being you know the next thing on for my space really and it certainly struck me as being very kind of useful do you think that do you think that kind of problems being I suppose the problem is solving there is be able to decouple the
Starting point is 00:18:16 storage from the query make it distributed and so on is that a problem do you think that is people's minds these days do you think that having that decoupled query and and and kind and just, I suppose, query and storage, is that a problem in people's minds now? Or has it been surpassed really by, say, cloud and that sort of thing? Well, I think the original decoupling of storage and query was... And so you had the Hadoop file system based on Google MapReduce paper and so forth. And so at the time when Kudu was created,
Starting point is 00:18:57 the two primary storage systems for the Hadoop ecosystem were HDFS, the Hadoop file system, and HBase. And so what would happen was that people were using HBase for real-time data collection and serving and HDFS for long-term storage and batch processing. And so if you had real-time data or data that was very rapidly changing, you would often have, you know, you would build some kind of complicated sort of marionette of a real-time data capture with Kafka or HBase and HDFS.
Starting point is 00:19:39 And if you wanted to make the data available, it could get quite complicated to manage the metadata around how to expose the right data sets to query engines like Impala to process. So I think that in large enterprise settings, these problems have been experienced for a while, and particularly around real-time or fast-changing data. I think the challenges of making that work at scale have been well-known for a while. And I think that, I mean, I'm not an expert in the latest and greatest cloud offerings from different... I'm thinking like BigQuery and stuff like that. Yeah, but my understanding is that the public cloud providers
Starting point is 00:20:27 have created solutions to assist with this particular problem or for people who are doing analytics in AWS or in Google Cloud that are solving problems in a similar way that Kudu is solving the problem, but Kudu is certainly oriented problem um but oriented at what kudu is certainly oriented at the hadoop ecosystem yeah yeah okay so so you then went on to work towards my start and work on apache arrow um so so what is that then what what problem did that solve in the stack and and how is that different from say kudu and and say parquet and
Starting point is 00:21:01 so on right well so when i uh when i started up at so when I started up at Cloudera at the end of 2014, so first of all, I had my list of complaints and grievances with pandas as far as interoperability with other systems, basically dealing with close to the metal problems of data access and data movement. And I, you know, at that point I'd written many different, I'd written tons of different connectors from pandas to many other storage formats and data processing systems. So I'd experienced the pain of the fragmentation and incompatibility or non... ...that I worked on in Cloudera was exploring if we could build a, we could build Python user-defined functions, UDFs for Impala, and immediately was, and I was aware of Kudu as well. And so I was interested in Kudu integration between Python, you know, integration between Python and Kudu.
Starting point is 00:22:12 And so I was in one of the initial, you know, the first problems that I had to solve there was how to exchange memory between an analytical SQL engine like Impala or a distributed column store like Kudu and the Python data science environment, namely Pandas and so forth. And so I very quickly, and we had built an in-memory distributed column store and query engine for Datapad, and that was one of our biggest pieces of IP distributed column store and query engine for Datapad.
Starting point is 00:22:48 And that was one of our biggest pieces of IP was a columnar query engine. And so I tried to synthesize the learning from building the Datapad query engine with what was going on with Kudu and Impala and to a certain extent at that time with Spark as well. And so I very quickly found myself wanting effectively a middleware technology to be a standardized format for exchanging tabular data, column-oriented data, between these types of systems, between Spark and Python, Impala and Kudu,
Starting point is 00:23:29 and really any system that deals with analytical datasets and data processing. But certainly to create a large open source project and build a community is a very complicated thing. And so I set about, you know, starting at the end of 2014 and throughout 2015, I set about finding allies to essentially folks in other open source communities who had experienced similar kinds of problems to try to see if this was a problem that other people were experiencing around data interoperability and in-memory analytics. So it took most of 2015 to assemble a collection of like-minded open source developers to create the Apache Arrow project. We encountered a group of developers
Starting point is 00:24:27 led by Jacques Nadeau, who's now the CCO and co-founder of Dremio. And they had wanted to spin the kind of in-memory columnar data structures out of the Apache Drill project into a standalone software component so that it could be more easily reused in other Java projects. And so this was summer 2015. And we spent the latter half of 2015
Starting point is 00:24:56 basically working to get more people involved with the idea of something like Arrow existing and to see if we could agree on the path forward in terms of the project design and the governance structure. And so it took quite a bit of logistics to put the project together. We decided to do it within the Apache Foundation to make it easier for vendors to collaborate without concerns about governance and conflicts of interest.
Starting point is 00:25:28 But the idea was that what we wanted was a companion technology that would be a companion technology to columnar storage formats like Parquet and Ork. It would be useful for runtime format for data processing. So you would use it inside, you would use Arrow inside query engines as the place where you put data while it's being processed. And it could serve to, as a tool for connecting systems together without any overhead. So we wanted to be able to share data between, say, Python and Impala or Kudu and Python without losing any performance to conversions or serialization. But at the same time,
Starting point is 00:26:13 yeah, but at the same time, this memory format would need to be suitable for as a primary runtime format for query processing. So it would need to be kind of laid out in memory in a way that is efficient for CPUs and GPUs and so forth. Yeah, I remember reading about it at the time and thinking it was very impressive. I mean, so it's, I suppose Apache Arrow is a component that you'd find in other pieces of software. It's not something that has a front end to itself and so on. I mean, that's a very altruistic thing for you to do, to kind of put all this time into building something that's effectively like an interchange format and a memory format. I mean, what motivates you to do that, really? I mean, rather than just work for someone and release it as commercial or do something else, really?
Starting point is 00:26:55 Well, it's, I mean, I think the idea of creating standardized, open-sourced technology, I mean, it is altruistic in a way, standardized open source technology. I mean, it is altruistic in a way, but I think the main motivation is to make things simpler for data system developers. So right now, the status quo is that when you build a query engine or a data processing system, that you need to define data
Starting point is 00:27:26 structures, runtime data structures, where you put, so when you read data out of a Parquet file, or you read data from Kudu, or you read data from a SQL database, you have to place that data someplace in memory. And so what systems traditionally would do is they would define their own proprietary data structures to hold the data in memory while it's being processed. And so whenever you want to move that data from one runtime to another, you have to convert between two incompatible runtime formats, which incurs a conversion and copying penalty. And so there's multiple sources of efficiency in creating the Arrow technology will be sitting on the shelf and available to use as a runtime data format. So you won't have to design your own in-memory data format. Additionally, if two systems are using Arrow, then they can be composed and plugged
Starting point is 00:28:37 together without any overhead. And so that will give rise, and I believe in the long run, that will give rise to much more systems that are a lot more heterogeneous. They could be heterogeneous in programming languages. So you could see systems that have code that is written in Java as well as C or C++. And so by eliminating that barrier around data access and data movement will give rise to, I think, much more interesting software in the future and free up, you know, system engineers to work on problems that are further up the stack around just computation, cogeneration, you know, parallelization and so forth and so I found that I found over the years that I've spent just a huge amount of time writing data converters and data connectors
Starting point is 00:29:34 and dealing with just like data access and serialization and so to make that problem go away I think will be a major boon for the future as we were freed up to kind of work on higher level concerns rather than more mundane details of you know of getting access to data taking something that's your own kind of toolkit and then making it into something other people can use
Starting point is 00:29:56 and then make it into an open source project that takes quite a lot of work really and I suppose what's involved what I suppose what the extra steps to make it usable by somebody else? And the other question really is what's involved in making a project an Apache project? Right, yeah, well the difference between software that you build for your own use or use in an internal project versus a generally available open source project is a pretty huge delta. Right, right. So you have to, you know, and certainly Apache projects function based on, you know,
Starting point is 00:30:39 open and transparent development process and consensus. And so, you know, it is much easier in some ways to be able to make unilateral design decisions and also to not have to concern yourself with architectures and deployment environments that don't concern you. So, for example, I am not actively a Windows user, but this software does get used on Windows environments, and so we have to spend a lot of time building the Apache Arrow project and dealing with Windows compatibility and deployment on essentially all of the major platforms and different Linux distributions, different versions of Xcode on Mac OS,
Starting point is 00:31:26 different versions of Microsoft Visual Studio. So there's a lot of, I think, the development environment, maintaining a productive development environment becomes a lot more important. So having good developer documentation is essential if you want to attract developers and contributors to the project. You have to have developer documentation. You have to have a lot more user documentation than you would for a purely internal software project. So with the Apache Foundation, there is more work still as far as open source projects go
Starting point is 00:32:02 in that we very carefully track IP that is contributed to the project to make sure that software is appropriately licensed, that when users contribute code that they have distribution rights to that code and they are not copy and pasting code from Stack Overflow or from projects which may have incompatible licenses. And so this does come up occasionally that you'll see a code snippet
Starting point is 00:32:31 and it may have come from a source of unknown origin. And so I think I see the mark of a project released from an Apache project is being sort of a gold label for open source in that the IP has been very strongly vetted by the community. the origin of the code, so that companies can use Apache projects in a commercial setting without nearly as much fear of having IP contamination. And so the idea is that when you see something that's Apache licensed or that's part of an Apache project, there's certainly the distinction between being an Apache project and being Apache licensed are not the same thing. The Apache process is much more, there's a lot more hoops to jump through to get software released as an Apache project. But the commercial, the broader open source community and particularly commercial users benefit from that careful oversight of IP and licensing. Yeah, sure. I suppose the other thing is there's quite a few Apache projects out there,
Starting point is 00:33:52 even, I suppose, within this kind of space. And the other reason I thought to contact you was I saw the blog post by Daniel Abadi saying about Ork and Parquet and actually kind of in his blog post asking whether there was a point in having Arrow as a third column store Apache project. I mean, what was your view on, maybe just recapping what that blog post was, and what was your view on that? And this idea that there would be, maybe kind of Apache Arrow was a little bit superfluous. What's your sort of take on that?
Starting point is 00:34:18 Right. So the article that you're referring to, Daniel Labate is a professor of computer science at Yale University. He's well known in the analytic database world as he co-created a technology called C-Store, which formed the basis for the Vertica analytic columnar SQL engine. And so he's a well-known figure in this community. And he wrote an article last month in October about Arrow as a storage, as a column store for database systems. And so he looked at the project through a very, in my opinion, a quite narrow, not unreasonable lens, but a fairly narrow lens of thinking about Arrow as it relates to the runtime of an analytic database.
Starting point is 00:35:27 Because traditionally, you know, systems will store data on, database systems will store data on disk. If it's a columnar database, then it will store data in something like the Parquet format, which was, so Parquet for background is designed based on ColumnIO, which is a Google technology that's part of Dremel, which is a widely used analytic database technology at Google. So Parquet is kind of like open source ColumnIO. So essentially, Professor Abadi's blog post was about whether an in-memory analytic database runtime needs different design decisions in its storage format from an on-disk one. So essentially, what does Arrow bring to the table compared with an on-disk columnar storage technology? So he did conclude at the end of the article, and I'll leave the reader to read the article and make their own conclusions. But he did conclude that an in-memory analytic database system would benefit from different decisions from one that is on disk, that the characteristics of RAM versus disk
Starting point is 00:36:48 will lead to different design decisions. But I think, yeah, so I wrote a follow-up blog post to respond to the article because, I mean, I think Arrow is interesting strictly from the point of view of analytic databases, but really database systems is only one possible application of Arrow. So I think in his article, there was really no mention of the data science world and problems that are solved around system interoperability and data interchange, particularly zero copy data interchange between systems. So if you just strictly look at Arrow from the perspective of you're building an analytic database, you don't really care about sending and receiving data between other systems.
Starting point is 00:37:43 So it was a pretty narrow analysis, not an invalid one, but there's plenty of other reasons for the Arrow project to exist. I think I'm glad that he wrote the article because I think that it's instigated a fruitful discussion about the different use cases for Arrow and gotten more people thinking about the problem space. So I think it's been, the last month or six weeks has been pretty interesting. Excellent, excellent, that's good, isn't it? So what's the kind of, I suppose, what's the roadmap for Arrow, and where do you see it going, and what problems do you see it solving,
Starting point is 00:38:15 or what problems do you wanna solve, maybe, going into the future, really? Right, well, so I, the reason I got involved in Arrow in the first place, as we were talking about earlier in the interview, was that my motivation was to have an efficient bridge between the data science world and the distributed storage and analytic database world. And so we've spent most of the last two years working on hardening the details of the Arrow format specification, the Arrow metadata, like dealing with data types and what constitutes a timestamp and what kinds of dates and times and different, you know, what is a decimal, all the different data types that are supported in Arrow. So now that we've reasonably stabilized the Arrow format,
Starting point is 00:39:12 now we need to go and build different data processing systems which natively deal in Arrow data so that Arrow isn't strictly being used for data interchange. So you can use Arrow with Pandas right now, but Pandas has its own proprietary, quote-unquote, proprietary in-memory format. And so in the context of, you know, back on my talk in 2013 about the 10 things I hate about Pandas, you know, really, for me, Arrow has been built to be a better long-term in-memory format
Starting point is 00:39:50 for data processing in systems like pandas. And so the future for what I'm going to be spending my time doing is building a new and much higher performance in-memory data processing engine to be the kind of future backbone of pandas that people can use for many years into the future. And I've given a number of talks this year about this topic that I'm quite interested, I'm very interested in building native in-memory
Starting point is 00:40:27 data processing for Arrow in a way that can be used not just for pandas and for Python users but can be used more broadly across the entire open source data science world so I would like to build these libraries in a way that they can be used in R
Starting point is 00:40:43 or it can be used in Julia or it can even be used in Java. And by virtue of systems adopting and using Arrow, they will be able to use. So if we have a native Arrow based data processing engine, which is used in Python, that same code can be shared in different programming languages. And so I think my long-term vision and what I would like to see is a great deal more collaboration happening amongst different programming communities who are doing data science and in-memory analytics. Traditionally, there has not been a great deal of collaboration in these communities,
Starting point is 00:41:20 and you see people from the R community and the Python community do talk, but very rarely do they collaborate on software projects. So to kind of socialize this idea, I got together with Hadley Wickham from the R community last year. And we used the Arrow technology to build a small file format called Feather. And the idea is that it's a way to store, it uses the arrow format to store data frames on disk, and you can read and write them interoperably from R and Python. And so it was kind of one of the only examples of a real genuine software collaboration
Starting point is 00:41:58 between the R and Python communities where there was actual like a library of C code that is shared between an R project and a Python project. And I really see this as a model for the future so that we don't have, you know, everyone in the R community building their own custom implementations of everything and everyone in the Python community building their own custom implementations of everything, that we're able to create some shared, what I describe as shared infrastructure for data science. And that will enable us to essentially to build software that is, you know, curated and improved by a much larger community of developers. It'll be much higher performance.
Starting point is 00:42:37 It will have much better memory use, much more scalable and performant. And as one community contributes, so as, you so as the community contributes more to the shared ecosystem of analytical tools, the entire data science world will benefit. And so I hope that we see in the future that work happening in the R community will benefit the Python community and vice versa. Yeah, and so as more programming languages come to the fold and develop communities of data scientists that we will be able to share technology and, you know, kind of the rising tide will lift all boats, so to speak. And one really interesting thing that's been happening since we started Apache Arrow is that there is a community of
Starting point is 00:43:26 Ruby developers who are building bindings for the Arrow libraries and creating modern data science tools for Ruby. So some of them are in Japan. There's a developer, Kohei Suto, who's on the PMC for Arrow. He's been leading a project called Red Data Tools, which is based on Apache Arrow and is with the intent of building a richer data science stack for Ruby. So the fact that we have the Python community, people like me collaborating with the Ruby community on data science tools, I think is really exciting. And I think that we'll see more of that going into the future.
Starting point is 00:44:07 Okay, fantastic. So just to finish up then, really, where would people find out about Apache Arrow? Where on the web? And also, are you speaking at any events in the near future? Well, there's the Arrow website, which is arrow.apache.org. And certainly following me on Twitter is always
Starting point is 00:44:26 because I generally will tweet out about things that are related to Arrow or in this general domain I do not have any I do not have any talks planned at the moment but I usually usually do announce them when they happen. On my website, on westmckinney.com, you can find prior talks that I've given, slide decks and videos and so forth that go into some more detail and perhaps slightly more articulate, presenting the material that we've been discussing here about how, essentially,
Starting point is 00:45:10 how the vision around interoperable and standardized in-memory data processing technology is important to the data science world. There's also a Twitter handle for Arrow. It's at Apache Arrow. We also will tweet about other talks that have been given by members of the community.
Starting point is 00:45:33 Right now, we're focused on growing the developer community and building Arrow integrations in different systems and just building more collaborations with different communities that would benefit from being involved. Fantastic.
Starting point is 00:45:50 Well, Wes, it's been great speaking to you. Thank you very much for coming on the show. And yeah, thank you very much and take care. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.