Sharing n-dimensional data between Python and R

python

numpy

arrays

tips

parquet

Author

Daniel Kick

Published

October 24, 2024

Motivating Problem

Storing multidimensional data is a pain. In R or python you can save the n-dimensional array to disk, but not in a way that is easily accessible by the other. Admittedly, this is a rare issue to want to solve (It’s taken 3 years of deep learning for it to come up) but it’s worth discussing regardless as it illustrates a stumbling block when working between these languages. Think of it as the advanced version of trying to get the 0th entry in a R vector.

Zooming out, if we have a 3d array of data (say observations by nucleotides by SNPs) we have a few options on how to save it.

We could save it in a format designed for n-dim arrays (e.g. numpy’s .npz format). I’m not aware of any options that would allow for reading into both R and python (there certainly could be some) so we’ll ignore this option.
We could reduce this to a set of 2d arrays and save the lot. Then we can use any format for tabular data – text files, sqlite databases, parquet files – all would be an option. The downside is we’d be working with (potentially) many files. This isn’t a deal breaker¹ but it would be nice to avoid.
We could store the data and dimensions separately. We would end up with only two files each with a 1d array representing the values and dimensions. We could even put the dimensions in the name of the file if we wanted to and lets us use all the file types in option 2. Reshaping the data sounds nicer than reconstituting it so let’s go with that.

I’m a fan of parquet files for storing large amounts of data, especially when they need to be accessible by multiple programming languages. So the plan will be to turn the n-dim array into a 1-dim array and store the original dimension separately. On the other end we’ll reshape this 1-dim array back into the right shape. Easy, right?

The Curse of Dimensionality (but not that one)

It’s easiest to see the problem with an example. Suppose we want to “remake” 3d data of shape (2, 3, 4). We’re starting with an array of values (vals) which are the integers 0-23. We can reshape this array using .reshape and then look at at the 0th plane of the data.

# original python
import numpy as np
arr_shape = (2,3,4)
vals = np.arange(np.prod(arr_shape))

x = np.array(vals).reshape(arr_shape)
x[0, :, :]

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

When we repeat this in R (selecting the 1st plane because R is not 0 indexed) we don’t get the same values.

# original R
arr_shape <- c(2,3,4)
vals <- seq(0, prod(arr_shape)-1 )

x <- array(vals, dim = arr_shape)
x[1, , ]

     [,1] [,2] [,3] [,4]
[1,]    0    6   12   18
[2,]    2    8   14   20
[3,]    4   10   16   22

What’s going on and how do we fix it?

It’s easiest to see what’s happening by looking for the slice that begins with 0.

x[0, 0, ]

array([0, 1, 2, 3])

x[, 1, 1]

[1] 0 1

These values are in the last dimension in python but the first in R! In effect, numpy and R ‘fill’ values from the opposite directions.

So how do we fix this? We have to change the shape of the array so that it fills properly and then “rotate” the dimensions until we have the right shape. We’ll do this by

Reversing the desired dimensions
Filling the values into the array
Swapping the axes (swapping the first and last, second and penulitmate, etc. )

Finally, here’s how this would be done for either language:

Option 1: Remaking a Numpy array in R

Here’s the python we want to mimic:

# original python
import numpy as np
arr_shape = (2,3,4)
vals = np.arange(np.prod(arr_shape))

x = np.array(vals).reshape(arr_shape)
x[0, :, :]

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

arr_shape <- c(2,3,4)
vals <- seq(0, prod(arr_shape)-1 )

x <- array(vals, dim = rev(arr_shape)) # reverse the dims
x <- aperm(x, rev(seq_along(arr_shape)))    # transpose by permuting dims
x[1, , ]

     [,1] [,2] [,3] [,4]
[1,]    0    1    2    3
[2,]    4    5    6    7
[3,]    8    9   10   11

Option 2: Remaking a R array in Numpy

Here’s the R we want to mimic:

# original R
arr_shape <- c(2,3,4)
vals <- seq(0, prod(arr_shape)-1 )

x <- array(vals, dim = arr_shape)
x[1, , ]

     [,1] [,2] [,3] [,4]
[1,]    0    6   12   18
[2,]    2    8   14   20
[3,]    4   10   16   22

import numpy as np
arr_shape = (2,3,4)
vals = np.arange(np.prod(arr_shape))

arr_shape = list(arr_shape)
arr_shape.reverse() # now is [4, 3, 2]
x = np.array(vals).reshape(arr_shape)
# we only want to iterate over the first half of the list.
# Iterating over all of it will swap all axes and then swap them back
# Coercing 1/2 the length to an int will round down values in the case of odd dims
for i in range( int(len(arr_shape) / 2) ): 
  j = (len(arr_shape)-1)-i
  x = x.swapaxes(i, j)
x[0, :, :]

array([[ 0,  6, 12, 18],
       [ 2,  8, 14, 20],
       [ 4, 10, 16, 22]])

Footnotes

Once I split a file with genomes into several thousand files just to minimize what had to be read in.↩︎