Pandas - History and Future - Talk Python to Me Ep.462

Transcript#

This transcript was generated automatically and may contain errors.

Hey Wes, welcome to Talk Python to Me. Thanks for having me. You know, honestly, I feel like it's been a long time coming, having you on the show. You've had such a big impact in the Python space, especially the data science side of that space. It's high time to have you on the show, so welcome. Good to have you.

Yeah, it's great to be here. I've been heads down a lot the last N years. I actually haven't been, because I think a lot of my work has been more like data infrastructure and working at even a lower level than Python. So I haven't been engaging as much directly with the Python community. But it's been great to kind of get back more involved and start catching up on all the things that people have been building.

So being at Posit gives me the ability to have more exposure to what's going on in people that are using Python in the real world. Yeah, there's a ton of stuff going on at Posit that's super interesting, and we'll talk about some of that. You know, sometimes it's just really fun to build and work with people building things.

Wes McKinney's background

Well, before we dive into Pandas and all the things that you've been working on after that, let's just hear a quick bit about yourself for folks who don't know you. So, yeah, my name is Wes McKinney. I grew up in Akron, Ohio, mostly, and I got involved, started getting involved in Python development around 2007, 2008. I was working in quant finance at the time. I started building a personal data analysis toolkit that turned into the Pandas project, and then open-sourced that in 2009.

I started getting involved in the Python community, and I spent several years writing my book, Python for Data Analysis, and then working with the broader scientific Python data science community to help enable Python to become a mainstream programming language for doing data analysis and data science. And in the meantime, I've become an entrepreneur. I've started some companies and have been working to, you know, innovate and improve the computing infrastructure that powers data science tools and libraries like Pandas. So that's led to some other projects like Apache Arrow and Ibis and some other things.

And, yeah, in recent years, I've worked on a startup, Voltron Data, which is still, you know, very much going strong and has a big team and is off to the races. And I've had a long relationship with Posit, formerly RStudio , and they were my home for doing Arrow development from 2018 to 2020. They helped me incubate the startup that became Voltron Data. And so I've gone back to work full-time there as a software architect to help them with their Python strategy to make sort of their data science platform a delight to use for the Python user base.

About Posit

I'm pretty impressed with what they're doing. I don't realize the connection between Voltron and Posit, but I have had Joe Chung on the show before to talk about Shiny for Python . And I've seen him demo a few really interesting things, how it integrates into notebooks these days, some of the stuff that you all are doing. Maybe give people a quick elevator pitch on that while we're on that subject.

Yeah, so Posit started out 2009 as RStudio, and so it didn't start out intending to be a company. JJ Allaire and Joe Chung built a new IDE, Integrated Development Environment, for R because what was available at the time wasn't great. They made that into, I think, probably one of the best data science IDEs that's ever been built. It's really an amazing piece of tech.

So it started becoming a company with customers and revenue in the 2013 time frame, and they've built a whole suite of tools to support enterprise data science teams to make open-source data science work in the real world. But the company itself, it's a certified B corporation, has no plans to go public or IPO. It is dedicated to the mission of open-source software for data science and technical communication, and basically building itself to be a 100-year company that has a revenue-generating enterprise product side and an open-source side so that the open-source feeds the enterprise part of the business. The enterprise part of the business generates revenue to support the open-source development, and the goal is to be able to sustainably support the mission of open-source data science for hopefully the rest of our lives.

The goal is to be able to sustainably support the mission of open-source data science for hopefully the rest of our lives.

It's an amazing company. It's been one of the most successful companies that dedicates a large fraction of its engineering time to open-source software development. I'm very impressed with the company and JJ Allaire, its founder. I'm excited to be helping it grow and become a sustainable long-term fixture in the ecosystem.

Many people know JJ Allaire created ColdFusion, which is the original dynamic web development framework in the 1990s. He and his brother Jeremy and some others built AllaireCorp to commercialize ColdFusion. They built a successful software business that was acquired by Macromedia, which was eventually acquired by Adobe. He did go public as AllaireCorp during the dot-com bubble, and JJ went on to found a couple of other successful startups.

I think he found himself in his late 30s 15 years ago, or around the age I am now, having been very successful as an entrepreneur, no need to make money, and looking for a mission to spend the rest of his career on. Identifying data science and statistical computing as open-source, in particular making open-source for data science work, was the mission that he aligned with and something that he had been interested in earlier in his career, but he had gotten busy with other things. I think it's really refreshing to work with people who are really mission-focused and focused on making impact in the world, creating great software, empowering people, increasing accessibility, and making most of it available for free on the Internet.

WebAssembly has opened up this whole kind of new world of possibilities, and so it's transformative, I think.

In the R community, they have WebR, which is similar to PyScript and PyOxide in some ways, like compiling the whole R stack to WebAssembly. There was just an article I saw on Hacker News where they worked on figuring out how to trick LLVM into compiling Fortran code, like legacy Fortran code, to WebAssembly because when you're talking about all of this scientific computing stack, you need the linear algebra and all of the 40 years of Fortran code that have been built to support scientific applications. You need all that to compile too and run in the browser.

Apache Arrow

Yeah, so around the mid-2010s, 2015, I started working at Cloudera, which is a company that was one of the pioneers in the big data ecosystem, and I had spent several years working on, five, six years working on Pandas, and so I had gone through the experience of building Pandas from top to bottom, and it was this full stack system that had its own mini query engine, all of its own algorithms and data structures and all the stuff that we had to build from scratch.

And I started thinking about, what if it was possible to build some of the underlying computing technology like data readers, like file readers, all the algorithms that power the core components of Pandas, like group operations, aggregations, filtering, selection, all those things. What if it were possible to have a general-purpose library that isn't specific to Python, but is really, really fast, really efficient, and has a large community building it so that you could take that code with you and use it to build many different types of libraries, not just data frame libraries, but also database engines and stream processing engines and all kinds of things.

And one of the problems we realized we needed to solve, this was like a group of other open-source developers and me, was that we needed to create a way to represent data that was not tied to a specific programming language and that could be used for a very efficient interchange between components, and the idea is that you would have this immutable, like this kind of constant data structure, which is like it's the same in every programming language, and then you can use that as the basis for writing all of your algorithms.

So we started with building the Arrow format and standardizing it, and then we built a whole ecosystem of components, like library components and different programming languages for building applications that use the Arrow format, so that includes not only tools for building and interacting with the data, but also file readers, so you can read CSV files and JSON data and Parquet files, read data out of database systems. Wherever the data comes from, we want to have an efficient way to get it into the Arrow format.

And then we moved on to building data processing engines that are native to the Arrow format, so that Arrow goes in, the data's processed, Arrow goes out. So DuckDB, for example, supports Arrow as a preferred input format and DuckDB is more or less Arrow-like in its internals.

And so if you were building Pandas now, you could build a Pandas-like library based on the Arrow components in much less time, and it would be fast and efficient and interoperable with the whole ecosystem of other projects that use Arrow.

And it turns out that, as is true with many open-source software problems, that many of these problems are, the social problems are harder than the technical problems. And so if you can solve the kind of people coordination and consensus problems, solving the technical issues is much easier by comparison.

Modern hardware and performance

Yeah, so in AeroLand, when we're talking about analytic efficiency, it mainly has to do with the underlying, how a modern CPU works or how a GPU works. And so when the data is arranged in column-oriented format, that enables the data to be moved efficiently through the CPU cache pipelines, so the data is made available efficiently to the CPU cores.

And so we spent a lot of energy in Aero making decisions, firstly, to enable very cache-efficient, like CPU cache or GPU cache-efficient analytics on the data. The other thing is that modern, and this is true with GPUs, which have a different parallelism model or a different kind of multicore parallelism model than CPUs, but in CPUs, they've focused on adding what are called single instruction multiple data, intrinsic, like built-in operations in the processor where now you can process up to 512 bytes of data in a single CPU instruction.

Ibis and the multi-engine data stack

I think one of the more interesting areas in recent years has been new DataFrame libraries and DataFrame APIs that transpile or compile to different, execute on different backends. And so around the time that I was helping start Arrow, I created this project called Ibis, which is basically a portable DataFrame API that knows how to generate SQL queries and compile to Pandas and Polars and different DataFrame backends.

And the goal is to provide a really productive DataFrame API that gives you portability across different execution backends with the goal of enabling what we call the multi-engine data stack. So you aren't stuck with using one particular system because all of the code that you've written is specialized to that system.

So maybe you could work with, you know, DuckDB on your laptop or Pandas or Polars with Ibis on your laptop. But if you have… If you need to run that workload someplace else, maybe with, you know, ClickHouse or BigQuery, or maybe it's a large big data workload that's too big to fit on your laptop and you need to use Spark SQL or something, that you can just ask Ibis, say, hey, I want to do the same thing on this larger dataset over here. And it has all the logic to generate the correct, you know, query representation and run that workload for you. So it's super useful.

Richie Fink started the Polars project, which is kind of a reimagining of Pandas data frames written in Rust and exposed in Python. And Polars, of course, is built on Apache Arrow at its core. So, you know, building an Arrow-native data frame library in Rust and, you know, all the benefits that come with, you know, building Python extensions in Rust, you know, you avoid the GIL and you can manage the multithreading in a system's language, all that fun stuff.

And so I think the mantra with Polars was we don't want to support the eager execution by default that Pandas provides. We want to be able to build expressions so that we can do query optimization and take inefficient code and under the hood rewrite it to, you know, be more efficient, which is, you know, what you can do with a query optimizer.

SQLGLOT

So, SQLGLOT project started by Toby Mao. So, he's a Netflix alum and, you know, really talented developer who's created this SQL query transpilation framework library for Python and, you know, kind of underlying core library. And so, the problem that's being solved there is that SQL, despite being a quote-unquote standard, is not at all standardized across different database systems.

And so, if you want to take your SQL queries written for one engine and use them someplace else, without something like SQLGLOT, you would have to manually rewrite and make sure you get the typecasting and coalescing rules correct. And so, SQLGLOT understands the intricacies and the quirks of every database dialect, SQL dialect, and knows how to correctly translate from one dialect to another.

And so, IBIS now uses SQLGLOT as its underlying engine for query transpilation and generating SQL outputs. So, originally, IBIS had its own kind of bad version of SQLGLOT, kind of a query transpilation, like SQL transpilation that was powered by, I think, powered by SQLAlchemy and a bunch of custom code. And so, I think they've been able to delete a lot in IBIS by moving to SQLGLOT.

So, and Toby, like, his company, Tobiko Data, they're building a product called SQL Mesh that's powered by SQLGLOT. So, very cool project and maybe a bit in the weeds, but if you've ever needed to convert a SQL query from one dialect to another, it's, yeah, SQLGLOT is here to save the day.

All right, Wes, I think we're getting short on time, but, you know, I know everybody appreciated hearing from you and hearing what you're up to these days. Anything you want to add before we wrap up?

I don't think so. Yeah. I enjoyed the conversation and yeah. Yeah, there's a lot of stuff, a lot of stuff going on and still plenty of things to get excited about. So I think often people feel like, you know, all the exciting problems in the Python ecosystem have been solved, but there's still a lot to do and yeah, we've made a lot of progress in the last, you know, 15 plus years, but, you know, in some ways it feels like we're just getting started. We are just getting started. Excited to see where things go next.

Pandas - History and Future - Talk Python to Me Ep.462

Transcript#

Wes McKinney's background

About Posit

What is Pandas?

History of Pandas and NumPy

Learning Pandas and the book

Quarto and Pandoc

Pandas growth and community

WebAssembly and the browser

Apache Arrow

Modern hardware and performance

Ibis and the multi-engine data stack

SQLGLOT

Featured software#

Quarto