We’re very chuffed to announce the release of nanoparquet 0.5.1 (and 0.5.0). nanoparquet is a small, self-sufficient R package for reading and writing Parquet files.
You can install it from CRAN with:
|
|
This blog post will go over some of the improvements in nanoparquet 0.5.0 and 0.5.1.
You can see a full list of changes in the release notes here and here .
|
|
List columns#
Parquet has a LIST type for columns whose values are variable-length
sequences of scalars. nanoparquet 0.5.0 adds support for reading and
writing such columns.
Note: for now nanoparquet supports one level of nesting: each element of a list column must be an atomic vector of a single type, not a list of lists. All elements in a column must have the same scalar type.
To write a list column, put a regular R list into your data frame.
Each element must be an atomic vector (integer, double, or character),
NULL for a missing list, or an empty vector for an empty list.
NA values inside an element vector encode missing elements.
|
|
|
|
# A data frame: 5 × 14
file_name r_col name r_type type type_length repetition_type converted_type
<chr> <int> <chr> <chr> <chr> <int> <chr> <chr>
1 /var/fold… NA sche… <NA> <NA> NA <NA> <NA>
2 /var/fold… 1 id integ… INT32 NA REQUIRED INT_32
3 /var/fold… 2 scor… list(… <NA> NA OPTIONAL LIST
4 /var/fold… 2 list <NA> <NA> NA REPEATED <NA>
5 /var/fold… 2 elem… <NA> INT32 NA OPTIONAL INT_32
# ℹ 6 more variables: logical_type <I<list>>, num_children <int>, scale <int>,
# precision <int>, field_id <int>, children <list>
read_parquet() reads LIST columns back as R list columns:
|
|
id scores
1 1 80, 95, 70
2 2 100
3 3 NULL
4 4
infer_parquet_schema() shows how nanoparquet maps each column to a Parquet
type. For list columns, the r_type shows e.g. list(integer):
|
|
# A data frame: 4 × 6
r_col name r_type type type_length repetition_type
<int> <chr> <chr> <chr> <int> <chr>
1 1 id integer INT32 NA REQUIRED
2 2 scores list(integer) <NA> NA OPTIONAL
3 2 list <NA> <NA> NA REPEATED
4 2 element <NA> INT32 NA OPTIONAL
A LIST column occupies three rows in the schema: the outer list node,
a repeated group node, and the leaf element node.
When you need to specify the element type explicitly, you can use
parquet_schema():
|
|
|
|
# A data frame: 5 × 14
file_name r_col name r_type type type_length repetition_type converted_type
<chr> <int> <chr> <chr> <chr> <int> <chr> <chr>
1 /var/fold… NA sche… <NA> <NA> NA <NA> <NA>
2 /var/fold… 1 id integ… INT32 NA REQUIRED <NA>
3 /var/fold… 2 scor… list(… <NA> NA OPTIONAL LIST
4 /var/fold… 2 list <NA> <NA> NA REPEATED <NA>
5 /var/fold… 2 elem… <NA> INT32 NA OPTIONAL INT_32
# ℹ 6 more variables: logical_type <I<list>>, num_children <int>, scale <int>,
# precision <int>, field_id <int>, children <list>
New types#
bit64::integer64#
Parquet’s INT64 type holds 64-bit integers. R’s native integer is
only 32 bits, so nanoparquet has mapped INT64 to double by default.
nanoparquet 0.5.1 adds support for bit64::integer64, which gives you true
64-bit integer arithmetic in R.
write_parquet() now writes bit64::integer64 columns as INT64:
|
|
# A data frame: 2 × 14
file_name r_col name r_type type type_length repetition_type converted_type
<chr> <int> <chr> <chr> <chr> <int> <chr> <chr>
1 /var/fold… NA sche… <NA> <NA> NA <NA> <NA>
2 /var/fold… 1 id double INT64 NA REQUIRED INT_64
# ℹ 6 more variables: logical_type <I<list>>, num_children <int>, scale <int>,
# precision <int>, field_id <int>, children <list>
To read INT64 columns back as bit64::integer64 instead of the default
double, use the read_int64_type option. The bit64 package must be
installed; if it isn’t, nanoparquet throws a clear error.
|
|
# A data frame: 3 × 1
id
<int64>
1 0e18
2 2e18
3 3e18
blob::blob#
read_parquet() previously returned raw BYTE_ARRAY and
FIXED_LEN_BYTE_ARRAY columns (i.e. those without a string, UUID, or
decimal annotation) as plain lists of raw vectors. They are now returned as
blob::blob objects, which print more neatly and come with the full set of
blob helpers. write_parquet() now also accepts blob::blob columns, so
round-tripping binary data is straightforward:
|
|
id payload
1 1 blob[5 B]
2 2 blob[5 B]
3 3 blob[1 B]
nanoparquet as a filter#
In Unix, a filter is a program that reads from standard input and writes
to standard output, making it a composable building block in shell pipelines.
write_parquet() now supports writing to standard output via
file = ":stdout:":
|
|
The most common use case is from the command line:
|
|
You can build this into a data pipeline. For example, to convert a CSV
to Parquet, and then process Parquet with another tool in one shot,
without an intermediate .parquet file on the disk, you can do:
|
|
Since nanoparquet 0.4.0, read_parquet() can also read from an R
connection, so you can pipe Parquet data in as well:
|
|
Acknowledgements#
We thank all contributors to nanoparquet so far, for opening issues, submitting pull requests, and providing feedback: @Aariq , @alvarocombo , @apalacio9502 , @atsyplenkov , @cboettig , @ChandlerLutz , @cmrnp , @D3SL , @damonbayer , @DavideMessinaARS , @eitsupi , @gksmyth , @hadley , @jack-davison , @jeroenjanssens , @lbm364dl , @lschneiderbauer , @mrcaseb , @pmarks , @PMassicotte , @r2evans , @RealTYPICAL , @tanho63 , @thisisnic , @torfason , @TurnaevEvgeny , @Upipa , @vankesteren , @vincentarelbundock , @wlandau , @YipengUva , and @yutannihilation .
