read_parquet_polars()
imports the data as a Polars DataFrame.
scan_parquet_polars()
imports the data as a Polars LazyFrame.
Usage
read_parquet_polars(
source,
...,
n_rows = NULL,
row_index_name = NULL,
row_index_offset = 0L,
parallel = "auto",
hive_partitioning = NULL,
hive_schema = NULL,
try_parse_hive_dates = TRUE,
glob = TRUE,
rechunk = TRUE,
low_memory = FALSE,
storage_options = NULL,
use_statistics = TRUE,
cache = TRUE,
include_file_paths = NULL
)
scan_parquet_polars(
source,
...,
n_rows = NULL,
row_index_name = NULL,
row_index_offset = 0L,
parallel = "auto",
hive_partitioning = NULL,
hive_schema = NULL,
try_parse_hive_dates = TRUE,
glob = TRUE,
rechunk = FALSE,
low_memory = FALSE,
storage_options = NULL,
use_statistics = TRUE,
cache = TRUE,
include_file_paths = NULL
)
Arguments
- source
Path to a file. You can use globbing with
*
to scan/read multiple files in the same directory (see examples).- ...
Ignored.
- n_rows
Maximum number of rows to read.
- row_index_name
If not
NULL
, this will insert a row index column with the given name into the DataFrame.- row_index_offset
Offset to start the row index column (only used if the name is set).
- parallel
This determines the direction of parallelism.
"auto"
will try to determine the optimal direction. Can be"auto"
,"columns"
,"row_groups"
,"prefiltered"
, or"none"
. See 'Details'.- hive_partitioning
Infer statistics and schema from Hive partitioned URL and use them to prune reads. If
NULL
(default), it is automatically enabled when a single directory is passed, and otherwise disabled.- hive_schema
A list containing the column names and data types of the columns by which the data is partitioned, e.g.
list(a = pl$String, b = pl$Float32)
. IfNULL
(default), the schema of the Hive partitions is inferred.- try_parse_hive_dates
Whether to try parsing hive values as date/datetime types.
- glob
Expand path given via globbing rules.
- rechunk
In case of reading multiple files via a glob pattern, rechunk the final DataFrame into contiguous memory chunks.
- low_memory
Reduce memory usage (will yield a lower performance).
- storage_options
Experimental. List of options necessary to scan parquet files from different cloud storage providers (GCP, AWS, Azure). See the 'Details' section.
- use_statistics
Use statistics in the parquet file to determine if pages can be skipped from reading.
- cache
Cache the result after reading.
- include_file_paths
Character value indicating the column name that will include the path of the source file(s).
Details
On parallel strategies
The prefiltered strategy first evaluates the pushed-down predicates in parallel and determines a mask of which rows to read. Then, it parallelizes over both the columns and the row groups while filtering out rows that do not need to be read. This can provide significant speedups for large files (i.e. many row-groups) with a predicate that filters clustered rows or filters heavily. In other cases, prefiltered may slow down the scan compared other strategies.
The prefiltered settings falls back to auto if no predicate is given.