Drill to Detail - Drill to Detail Ep.44 'Pandas, Apache Arrow and In-Memory Analytics' With Special Guest Wes McKinney
Episode Date: December 8, 2017Mark is joined in this episode of Drill to Detail by Wes McKinney, to talk about the origins of the Python Pandas open-source package for data analysis and his subsequent work as a contributor to the ...Kudu (incubating) and Parquet projects within the Apache Software Foundation and Arrow, an in-memory data structure specification for use by engineers building data systems and the de-facto standard for columnar in-memory processing and interchange.
Transcript
Discussion (0)
So hello and welcome to another episode of Drill to Detail, the podcast series about
the world of big data and analytics, and I'm your host Mark Ripman.
So my guest this week is Wes McKinney, someone I knew of from his work on the Pandas Python data analysis toolkit,
and most recently from his work on the Apache Arrow in memory storage format.
So I'm very pleased to have Wes on the show today,
and Wes, thank you for joining us.
Thanks for having me.
So Wes, just tell us a bit about, I suppose,
what was your career up until where you are now,
and what was the work you did,
introduce the work you did with Pandas,as and how you got into that really well I started my I started my career in in
2007 I had a I had a math degree and I got a and I got a job working in quantitative finance at
AQR capital management up in Greenwich, Connecticut.
And it turned out that a lot of my job was manipulating data and cleaning and preparing data and doing data analysis.
And so I got interested very early on in data analysis tools
primarily to make myself more productive so that I would enjoy my job more
and would be able to get more, get more work done. Um, but I discovered, um, you know, at that,
at that time that I was passionate for, uh, toolmaking and building tools to enable other
people to be, to be productive as well. Um, and, uh, you know, at that time I, this was 2007, 2008, there really was not a data community or a data analysis community in Python.
There was a fairly robust scientific computing community that was just becoming mature and more accepted at that time.
But for statistical computing and data analysis, data science as a field did not exist at that time. But for statistical computing and data analysis,
data science as a field did not exist at that time. But there really was not a very big community for Python. And so I felt there was an opportunity there to make the software stack more amenable to
people doing statistics and statistical computing, and that Python was a good language
to build those types of tools.
So I made a, you know, I guess in retrospect it was a fairly risky bet given that it wasn't
clear that there was going to develop a large ecosystem of Python programmers, but it seemed
to me like the right thing to do. And so I spent several years throughout the late 2000s and early 2010s
essentially working to make the Python ecosystem viable
as a tool for statistical computing and data science.
So that included building the Pandas project
and making that into a successful open source project and writing my book, Python for Data Analysis, which was published five years ago this fall.
So what led you to, I suppose, what led you to focus on Python?
I guess you were using it through work, but what particularly made you want to stick with Python rather than using R or something instead?
Well, I was doing general exploratory data analysis and data manipulation,
but I was also building software.
So I was building some small software production software systems
that needed to do some amount of data processing and data analysis,
but that were mostly about automating various business processes with a lot
of different steps and things that could go wrong. And so really, I needed a tool set that was
functional both for doing production software engineering, as well as exploratory data analysis.
And so in 2007, the conventional wisdom for doing software engineering
was to use Java for everything. But I found that the Java stack was not especially favorable for
doing interactive and exploratory computing. And similarly, for data analysis, people used Matlab
and they were starting to use R at that time.
R really wasn't nearly as popular back then.
But these were languages that were not especially well-su for doing software engineering and had the right components for,
it had the right kind of user interface and the right tools to build productive data manipulation and data analysis tools.
But we, you know, there were, there was no library like pandas or libraries that solved, you know, those kinds of problems. So I was attracted to it because it seemed like
a promising environment to build the best of both worlds. But it was certainly a chicken and egg
problem because not only was there not a community of people dealing with statistical computing,
there just weren't the libraries.
So it was kind of both,
we've got to build,
in order to build a community,
I had to build software
for the people to use
because people are not going to
use Python to solve those problems
if there wasn't a viable tool set
for them to start
and be productive in their work.
Yeah. Yeah, good. I mean, I came across pandas actually only a few years ago i was um i was doing some work at
home just taking some data and manipulating it and trying to do some kind of uh regression analysis
and that sort of thing on it and and that was my first introduction to pandas and i was blown away
actually by how how useful it was and and how I suppose it how it solved problems for me so well
and that was even in the times of when R was popular but maybe for people who don't know
what pandas was just maybe trying to outline maybe in layman's terms you know what did that code do
and what did it give you kind of beyond what say Python could do out of the box?
Well one of the and one of the the biggest things that people use Pandas for, and I think one of the reasons why it's become so popular, is that it makes getting access and just getting access to data very easy. basically comma-separated files or other kinds of text files or data coming from databases
or spreadsheets, Excel files, really any kind of tabular data.
You know, the first part of the data analysis process is to get access to data,
and Pandas makes that very fast and convenient.
It provides containers for dealing with data in memory.
So when you load data from a file, you load data from a database, it provides what's called a data frame object, which is essentially a table or it could be thought of as tabular data or data having some kind of labeling or index.
And so it's become very popular as a tool for dealing with doing exploratory data analysis on time series data, on, you know, data coming from databases and CSV files.
It's used for data cleaning and data preparation for modeling and statistics and machine learning.
So essentially, it solves all of the, you know, what I call the unsexy parts of data science.
And people complain about or talk about how data cleaning and data preparation can
be, you know, 80 to 90 percent of the time that you spend doing data science. And Pandas is concerned
with making that data cleaning and data preparation as painless as possible for the user. So you find
that, you know, really anyone who's doing machine learning or data science in Python, you know, in probably 80 or 90 percent of cases are using pandas for data access in the initial kind of data wrangling to get their data ready for modeling.
And I think I used it in combination with Spark as well, actually.
I think it was PySpark I was using it with.
So even outside of just classic Python, it was useful as well, really.
Yeah, that's
yeah that's true yeah it gets used uh yeah i think uh so in the in the pandas core team we we talk
about pandas as being the um the middleware that is uh kind of gluing the ecosystem together so
it's the it's the point of contact between different components it's used for for handling
data in many different circumstances so if you're using Spark, data that's going to Spark
or coming from Spark may be in Pandas format.
Yeah, excellent.
So that was quite a while ago actually now,
and you went on to, I mean obviously Pandas
is still going now, but you went on to form a company
called Datapad, so what was that about then really?
So we formed Datapad so what was that about then really so we formed uh we formed uh the datapad uh with
um so a long time colleague and collaborator of mine um chang shi he uh he and i worked together
at at aqr he's also um a fellow mit grad and um so i i got him you know we we collaborated on
pandas in in the early days so he was you know, we collaborated on Pandas in the early days. So he was, you know, one of the earliest Pandas contributors.
And we wanted to start an analytics company.
And we used the, you know, we intended to use the Python, you know, the Python data stack to build the back endend systems for the products that we were building,
and we would contribute back and make the open source better as we built vertical products for business users.
And so our initial product and what we spent the majority of 2013 and 2014 building was a cloud hosted visual analytics product which
could work well for business intelligence use cases it was designed to
be cloud cloud first and working well with with kind of the modern kind of
analytics environment where you have data spread across many different
sources and things,
things like Salesforce, you know, marketing automation products. Because, you know, in
modern businesses, data is fragmented all over the place. And so we wanted to make it simple
to collect all that data in a hosted environment that would make, you know, integrating that data
together and then exploring and analyzing it and visualizing
it we uh and we you know we were building all that with uh you know with the the you know the
python stack under uh under the hood um so we yeah so we you know we we uh we had venture investors
and we spent uh you know we spent about two years um you know working on working on the company
prior to being uh acquired by cloud era at the end of uh of 2014.
all right so that that explains how you went to cloud era then because i remember i remember at
the time um reading that you'd gone there and the fact that you were involved in pandas and you went
there as well and that was quite a big deal at the time i mean so your company was acquired by
by uh cloud era what did you what did you um i suppose what was your kind of aim really in
joining cloud era and and what were you what was the Ibis project you're working on there as well?
What was the kind of the idea there really?
Well, so I'd known the Cloudera founders and many Cloudera employees for a long time.
So they were already friends and supporters of our open source work on Pandas as well as Datapad.
And I found that as we were building the Datapad product that one of the hardest things that
we were dealing with was the, well, there were multiple hard things, but the backend
systems, there were a lot of low-level data management and in-memory analytics problems that we were tackling.
And that were, you know, basically the open source stack did not provide the best foundation for building that type of a product in the cloud,
which needed to be able to have, you know,
linear scalability and very low latency analytics,
very fast slicing and dicing.
And, you know, even now when you look at people
building these types of products,
in a lot of cases,
there really are not off-the-shelf open source projects to pick up that can meet the kind of performance and scalability requirements that an interactive visual analytics application has.
And you see a lot of custom software being built.
And so for me, critically, one of the biggest issues was the performance and scalability
of the Python data stack.
And by the end of 2013, having been working on Datapad for about a year, I'd accumulated
a list of complaints and grievances about the internal architecture relationship with the rest of the Python scientific computing and
data science stack and sort of famously gave a talk at the end of 2013 that was called
10 Things I Hate About Pandas.
And so really I think as much as we were enjoying building our company, I was very motivated about tackling some of these systems,
kind of like architectural challenges that face the entire data science world.
And I felt that Cloudera provided with its team of distributed systems engineers and people building databases. I was able to work
with the Impala team and the Kudu team. And really, I wanted to be able to work more closely with
individuals who had experience in distributed storage and in-memory analytics to tackle some
of these, what I felt are really serious challenges for the future of open source data science
that we would be able to, you know, start building a better platform
for the future there.
So, I mean, Kudu, I mean, yeah, we talked about Kudu on the show
kind of quite a while ago, actually, and that, to me, struck me
as very kind of interesting and revolutionary and so on. And so what's your view on kudu really i mean
is that still i'm conscious that you've got arrow i'm conscious there's products like
drew it out there as well what's your take on kudu really and and if that's relevant now or
problem it solves and that sort of thing yeah so so so kudu you know for those who don't know, it's Apache Kudu.
It's a project that was originally started at Cloudera and is now part of the Apache Software Foundation.
It is a distributed column store.
So it is a system that stores and manages data in a sort of fault tolerant and robust way but is
not a does not have its own query processing engine so it's so the idea is
that kudu provides a storage a scalable storage back-end that is designed for
fast analytical processing but also supports inserts and updates. So you can use it as a real-time
place to capture real-time data. Then you can do fast analytics on that data.
And so the idea is that rather than, you know, traditionally, if you were building an analytic
database, you would have a vertically integrated system of storage and query processing and query
language. So usually, you know, if you looked at. So usually, if you looked at a traditional
analytic SQL system, something like Vertica, say,
HP Vertica, the query language is SQL.
It has its own storage system and its own query engine.
And so the idea of something like Kudu is that it decouples
the storage and data management part of the problem from the query engine so that you can have many different kinds of query engines processing data that is being delivered by Kudu.
So I think it's a novel system and a very powerful concept.
And when I joined Cloudera at the end of 2014,
Kudu was still in stealth, so to speak.
It was not publicly available or known.
And so that was one of the things that really excited me
about joining Cloudera was that I felt that there was, you know, really innovative work happening around in-memory analytics and distributed storage.
And so to be able to work with, you know, Todd Lipkon and the Kudu team and the Impala team, you know, it seemed like a very, you know, fruitful kind of intellectual opportunity to get involved in some cutting-edge you know
distributed you know big data technology yeah it was Mike Percy that was on the
show actually remember when he first came on and it was certainly struck me
as it combined with combined with impala obviously it solved the problem about be
able to do updates inserts and so on but it struck me as I think it was
positioned at the time as being you know the next thing on for my space really and it certainly struck
me as being very kind of useful do you think that do you think that kind of
problems being I suppose the problem is solving there is be able to decouple the
storage from the query make it distributed and so on is that a problem
do you think that is people's minds these days do you think that having that
decoupled query and and and kind and just, I suppose, query and storage,
is that a problem in people's minds now?
Or has it been surpassed really by, say, cloud and that sort of thing?
Well, I think the original decoupling of storage and query was...
And so you had the Hadoop file system based on Google MapReduce paper and so forth.
And so at the time when Kudu was created,
the two primary storage systems for the Hadoop ecosystem
were HDFS, the Hadoop file system, and HBase.
And so what would happen was that people were using HBase for real-time data collection and serving
and HDFS for long-term storage and batch processing.
And so if you had real-time data or data that was very rapidly changing,
you would often have, you know,
you would build some kind of complicated sort of marionette
of a real-time data capture with Kafka or HBase and HDFS.
And if you wanted to make the data available,
it could get quite complicated to manage the metadata around how to expose the right data sets to query engines like Impala to process.
So I think that in large enterprise settings, these problems have been experienced for a while, and particularly around real-time or fast-changing data.
I think the challenges of making that work at scale have been well-known for a while.
And I think that, I mean, I'm not an expert in the latest and greatest cloud offerings
from different...
I'm thinking like BigQuery and stuff like that.
Yeah, but my understanding is that the public cloud providers
have created solutions to assist with this particular problem
or for people who are doing analytics in AWS or in Google Cloud
that are solving problems in a similar way
that Kudu is solving the problem,
but Kudu is certainly oriented problem um but oriented at what kudu is
certainly oriented at the hadoop ecosystem yeah yeah okay so so you then went on to work
towards my start and work on apache arrow um so so what is that then what what problem did
that solve in the stack and and how is that different from say kudu and and say parquet and
so on right well so when i uh when i started up at so when I started up at Cloudera at the end of 2014, so first of all, I had my
list of complaints and grievances with pandas as far as interoperability with other systems,
basically dealing with close to the metal problems of data access and data movement.
And I, you know, at that point I'd written many different, I'd written tons of different connectors from pandas to many other storage formats and data processing systems.
So I'd experienced the pain of the fragmentation and incompatibility or non...
...that I worked on in Cloudera was exploring if we could build a, we could build Python
user-defined functions, UDFs for Impala, and immediately was, and I was aware of Kudu as well. And so I was interested
in Kudu integration between Python, you know, integration between Python and Kudu.
And so I was in one of the initial, you know, the first problems that I had to solve there was how
to exchange memory between an analytical SQL engine like Impala or a distributed column store like Kudu
and the Python data science environment,
namely Pandas and so forth.
And so I very quickly,
and we had built an in-memory distributed column store
and query engine for Datapad,
and that was one of our biggest pieces of IP distributed column store and query engine for Datapad.
And that was one of our biggest pieces of IP was a columnar query engine.
And so I tried to synthesize the learning
from building the Datapad query engine
with what was going on with Kudu and Impala
and to a certain extent at that time with Spark as well.
And so I very quickly found myself wanting effectively a middleware technology
to be a standardized format for exchanging tabular data, column-oriented data,
between these types of systems, between Spark and Python, Impala and Kudu,
and really any system that deals with analytical datasets and data processing.
But certainly to create a large open source project and build a community is a very complicated thing.
And so I set about, you know, starting at the end of 2014 and throughout 2015,
I set about finding allies to essentially folks in other open source communities
who had experienced similar kinds of problems to try to see if this was a problem that other
people were experiencing around data interoperability and in-memory analytics.
So it took most of 2015 to assemble a collection of like-minded open source developers to create
the Apache Arrow project. We encountered a group of developers
led by Jacques Nadeau,
who's now the CCO and co-founder of Dremio.
And they had wanted to spin
the kind of in-memory columnar data structures
out of the Apache Drill project
into a standalone software component so that it could be more easily reused in other Java projects.
And so this was summer 2015.
And we spent the latter half of 2015
basically working to get more people involved with
the idea of something like Arrow existing and
to see if we could agree on the path forward
in terms of the project design and the governance structure.
And so it took quite a bit of logistics to put the project together.
We decided to do it within the Apache Foundation
to make it easier for vendors to collaborate
without concerns about governance and conflicts of interest.
But the idea was that what we wanted was a companion technology
that would be a companion technology to columnar storage formats
like Parquet and Ork.
It would be useful for runtime format for data processing. So you would use it inside,
you would use Arrow inside query engines as the place where you put data while it's being
processed. And it could serve to, as a tool for connecting systems together without any overhead.
So we wanted to be able to share data between, say, Python and Impala or Kudu and Python
without losing any performance to conversions or serialization. But at the same time,
yeah, but at the same time, this memory format would need to be suitable for as a primary runtime
format for query processing. So it would need to be kind of laid out in memory in a way that is efficient for CPUs and GPUs and so forth.
Yeah, I remember reading about it at the time and thinking it was very impressive.
I mean, so it's, I suppose Apache Arrow is a component that you'd find in other pieces of software.
It's not something that has a front end to itself and so on.
I mean, that's a very altruistic thing for you to do, to kind of put all this time into building something that's effectively like an interchange format and a memory format.
I mean, what motivates you to do that, really? I mean, rather than just work for someone
and release it as commercial or do something else, really?
Well, it's, I mean, I think the idea of creating
standardized, open-sourced technology,
I mean, it is altruistic in a way, standardized open source technology.
I mean, it is altruistic in a way,
but I think the main motivation is to make things simpler for data system developers.
So right now, the status quo is that
when you build a query engine or a data processing system,
that you need to define data
structures, runtime data structures, where you put, so when you read data out of a Parquet file,
or you read data from Kudu, or you read data from a SQL database, you have to place that data
someplace in memory. And so what systems traditionally would do is they would define
their own proprietary data structures to hold the data in memory while it's being processed.
And so whenever you want to move that data from one runtime to another, you have to convert between two incompatible runtime formats, which incurs a conversion and copying penalty.
And so there's multiple sources of efficiency in creating the Arrow technology will be sitting on the shelf and
available to use as a runtime data format. So you won't have to design your own in-memory data
format. Additionally, if two systems are using Arrow, then they can be composed and plugged
together without any overhead. And so that will give rise, and I believe in the long run, that will give rise to
much more systems that are a lot more heterogeneous. They could be heterogeneous in programming
languages. So you could see systems that have code that is written in Java as well as C or C++.
And so by eliminating that barrier around data access and data movement will give rise to, I think, much more interesting software in the future and free up, you know, system engineers to work on problems that are further up the stack around just computation, cogeneration, you know, parallelization and so forth and so I found that I found over the years
that I've spent
just a huge amount of time
writing data converters
and data connectors
and dealing with just like data access
and serialization
and so to make that problem go away
I think will be a major boon
for the future as we
were freed up to kind of work on higher level concerns rather than more
mundane details of you know of getting access to data taking something that's
your own kind of toolkit and then making it into something other people can use
and then make it into an open source project that takes quite a lot of work
really and I suppose what's involved what I suppose what the extra steps to
make it usable by somebody else?
And the other question really is what's involved in making a project an Apache project?
Right, yeah, well the difference between software that you build for your own use or use in an internal project versus a generally available open source project
is a pretty huge delta.
Right, right.
So you have to, you know, and certainly Apache projects function based on, you know,
open and transparent development process and consensus. And so, you know, it is much easier in some ways to be able to make unilateral design decisions and also to not have to concern yourself with architectures and deployment environments that don't concern you.
So, for example, I am not actively a Windows user,
but this software does get used on Windows environments,
and so we have to spend a lot of time building the Apache Arrow project
and dealing with Windows compatibility
and deployment on essentially all of the major platforms
and different Linux distributions,
different versions of Xcode on Mac OS,
different versions of Microsoft Visual Studio. So there's a lot of, I think, the development
environment, maintaining a productive development environment becomes a lot more important. So
having good developer documentation is essential if you want to attract developers and contributors
to the project.
You have to have developer documentation.
You have to have a lot more user documentation than you would for a purely internal software project.
So with the Apache Foundation,
there is more work still as far as open source projects go
in that we very carefully track IP
that is contributed to the project
to make sure that software is appropriately licensed,
that when users contribute code
that they have distribution rights to that code
and they are not copy and pasting code
from Stack Overflow or from projects which may have incompatible licenses.
And so this does come up occasionally that you'll see a code snippet
and it may have come from a source of unknown origin.
And so I think I see the mark of a project released from an Apache project is being sort of a gold label for open source in that the IP has been very strongly vetted by the community. the origin of the code, so that companies can use Apache projects in a commercial setting
without nearly as much fear of having IP contamination.
And so the idea is that when you see something that's Apache licensed or that's part of an Apache project,
there's certainly the distinction between being an Apache project and being Apache licensed are not the same thing.
The Apache process is much more, there's a lot more hoops to jump through to get software released as an Apache project. But the commercial, the broader open source community and particularly commercial users benefit from that careful oversight of IP and licensing.
Yeah, sure.
I suppose the other thing is there's quite a few Apache projects out there,
even, I suppose, within this kind of space.
And the other reason I thought to contact you was I saw the blog post
by Daniel Abadi saying about Ork and Parquet
and actually kind of in his blog post asking whether there was a point
in having Arrow as a third column store Apache project.
I mean, what was your view on, maybe just recapping what that blog post was, and what was your view on that?
And this idea that there would be, maybe kind of Apache Arrow was a little bit superfluous.
What's your sort of take on that?
Right. So the article that you're referring to, Daniel Labate is a professor of computer science at Yale University.
He's well known in the analytic database world as he co-created a technology called C-Store,
which formed the basis for the Vertica analytic columnar SQL engine.
And so he's a well-known figure in this community.
And he wrote an article last month in October about Arrow as a storage,
as a column store for database systems.
And so he looked at the project through a very, in my opinion, a quite narrow, not unreasonable lens, but a fairly narrow lens
of thinking about Arrow as it relates to the runtime of an analytic database.
Because traditionally, you know, systems will store data on, database systems will store data
on disk. If it's a columnar database, then it will store data in something like the Parquet format,
which was, so Parquet for background is designed based on ColumnIO, which is a Google technology that's part of Dremel, which is a widely used analytic database technology at Google.
So Parquet is kind of like open source ColumnIO. So essentially, Professor Abadi's blog post was about whether an in-memory analytic database runtime
needs different design decisions in its storage format from an on-disk one.
So essentially, what does Arrow bring to the table compared with an on-disk columnar storage technology?
So he did conclude at the end of the article, and I'll leave the reader to read the article and make their own conclusions.
But he did conclude that an in-memory analytic database system would benefit from different decisions from one that is on disk, that the characteristics of RAM versus disk
will lead to different design decisions.
But I think, yeah, so I wrote a follow-up blog post
to respond to the article because, I mean,
I think Arrow is interesting strictly from the point of view of analytic databases, but really database systems is only one possible application of Arrow.
So I think in his article, there was really no mention of the data science world and problems that are solved around system interoperability and data interchange,
particularly zero copy data interchange between systems.
So if you just strictly look at Arrow from the perspective of you're building an analytic database,
you don't really care about sending and receiving data between other systems.
So it was a pretty narrow analysis, not an invalid one,
but there's plenty of other reasons for the Arrow project to exist.
I think I'm glad that he wrote the article because I think that it's instigated a fruitful discussion about the different use cases for Arrow and gotten more people thinking about the problem space. So I think it's been, the last month or six weeks has been pretty interesting.
Excellent, excellent, that's good, isn't it?
So what's the kind of, I suppose,
what's the roadmap for Arrow,
and where do you see it going,
and what problems do you see it solving,
or what problems do you wanna solve, maybe,
going into the future, really?
Right, well, so I,
the reason I got involved in Arrow in the first place, as we were talking about earlier in the interview, was that my motivation was to have an efficient bridge between the data science world and the distributed storage and analytic database world. And so we've spent most of the last two years working on hardening the details of the
Arrow format specification, the Arrow metadata, like dealing with data types and what constitutes
a timestamp and what kinds of dates and times and different, you know, what is a decimal,
all the different data types that are supported in Arrow.
So now that we've reasonably stabilized the Arrow format,
now we need to go and build different data processing systems
which natively deal in Arrow data
so that Arrow isn't strictly being used for data interchange.
So you can use Arrow with Pandas right now, but Pandas has its own proprietary, quote-unquote,
proprietary in-memory format.
And so in the context of, you know, back on my talk in 2013 about the 10 things I hate
about Pandas, you know, really, for me, Arrow has been built
to be a better long-term in-memory format
for data processing in systems like pandas.
And so the future for what I'm going to be spending my time doing
is building a new and much higher performance
in-memory data processing engine
to be the kind of future backbone of pandas that people can use for many years into the future.
And I've given a number of talks this year about this topic that I'm quite interested,
I'm very interested in building
native in-memory
data processing for Arrow in a way
that can be used not just for
pandas and for Python users but
can be used more broadly across
the entire open source
data science world so
I would like to build these libraries
in a way that they can be used in R
or it can be used in Julia
or it can even be used in Java. And by virtue of systems adopting and using Arrow, they will be
able to use. So if we have a native Arrow based data processing engine, which is used in Python,
that same code can be shared in different programming languages. And so I think my long-term vision and what I would like to see
is a great deal more collaboration happening
amongst different programming communities
who are doing data science and in-memory analytics.
Traditionally, there has not been a great deal of collaboration in these communities,
and you see people from the R community and the Python community do talk, but very rarely do they collaborate on software projects.
So to kind of socialize this idea, I got together with Hadley Wickham from the R community last year.
And we used the Arrow technology to build a small file format called Feather.
And the idea is that it's a way to store,
it uses the arrow format to store data frames on disk,
and you can read and write them interoperably from R and Python.
And so it was kind of one of the only examples
of a real genuine software collaboration
between the R and Python communities
where there was actual like a library of C code
that is shared between an R project and a Python project.
And I really see this as a model for the future so that we don't have, you know,
everyone in the R community building their own custom implementations of everything
and everyone in the Python community building their own custom implementations of everything,
that we're able to create some shared, what I describe as shared infrastructure for data science. And that will enable us to essentially to build software that is, you know,
curated and improved by a much larger community of developers. It'll be much higher performance.
It will have much better memory use, much more scalable and performant. And as one community
contributes, so as, you so as the community contributes more
to the shared ecosystem of analytical tools, the entire data science world will benefit. And so I
hope that we see in the future that work happening in the R community will benefit the Python
community and vice versa. Yeah, and so as more programming languages come to the fold and develop
communities of data scientists that we will be able to share technology and, you know,
kind of the rising tide will lift all boats, so to speak. And one really interesting thing that's
been happening since we started Apache Arrow is that there is a community of
Ruby developers who are building bindings for the Arrow libraries and creating modern data
science tools for Ruby. So some of them are in Japan. There's a developer, Kohei Suto, who's on the PMC for Arrow. He's been leading a project called Red Data Tools,
which is based on Apache Arrow
and is with the intent of building a richer data science stack for Ruby.
So the fact that we have the Python community,
people like me collaborating with the Ruby community
on data science tools, I think is really exciting.
And I think that we'll see more of that going into the future.
Okay, fantastic.
So just to finish up then, really,
where would people find out about Apache Arrow?
Where on the web?
And also, are you speaking at any events in the near future?
Well, there's the Arrow website,
which is arrow.apache.org.
And certainly following me on Twitter is always
because I generally will tweet out about
things that are related to Arrow
or in this general domain
I do not have any
I do not have any talks planned
at the moment but I usually usually do announce them when they happen.
On my website, on westmckinney.com, you can find prior talks that I've given, slide decks and videos and so forth that go into some more detail and perhaps slightly more articulate, presenting the material that we've been discussing here
about how, essentially,
how the vision around interoperable
and standardized in-memory data processing technology
is important to the data science world.
There's also a Twitter handle for Arrow.
It's at Apache Arrow.
We also will tweet about
other talks that have been given by members
of the community.
Right now, we're focused on
growing the developer community
and building Arrow integrations
in different
systems and just building
more collaborations with different communities
that would benefit from being involved.
Fantastic.
Well, Wes, it's been great speaking to you.
Thank you very much for coming on the show.
And yeah, thank you very much and take care.
Thank you.