PyCrastinate:~# posts/pandas-v-arrow/

Getting Data from Python to R using Reticulate

Oct. 26, 2022

I use the reticulate R to Python interface quite regularly. Besides the yelling when a numeric is once again converted to float, the transfer-time of pandas DataFrames with their memory usage has been a regular topic.

To get some data and maybe a definitive answert to the question “What is the best way to transfer tabular data from Python to R?” I set up a pair of tiny scipts.

tl;dr: It depends. If you have relatively small tables (< 10.000 rows) returning pandas DataFrames will be your best choice. For anything over that size returning arrow Tables seems to be the way to go.

Methodology

Data Transfer for data.frames / data.tables on the R side and Pandas DataFrames on the Python side.

Comparsion of transfer time for different sizes in rows and a fixed number of columns. They shall be filled with random numbers.

Sizes 20 columns of random integers between 1 and 100 with:

1e3 rows
1e4 rows
1e5 rows
1e6 rows
1e7 rows

The Python Side

Use a minimal class that will generate random input and fill them as members.

Use 4 different methods to get the different sizes from python to R. This should get the generating done ahead of time. (Will repetition make it quicker?)

Results

results

Versions

R: 4.2.0 with libraries:

Reticulate - 1.24
Arrow - 7.0.0
data.table - 1.14.2

Python: 3.9.12 with libraries:

Pandas - 1.4.2
numpy - 1.22.3
Arrow (pyarrow) - 7.0.0

Code

If you want to check the results on your machine, you can find the code on github.