read_parquet_polars()
imports the data as a Polars DataFrame.
scan_parquet_polars()
imports the data as a Polars LazyFrame.
Usage
read_parquet_polars(
source,
...,
n_rows = NULL,
row_index_name = NULL,
row_index_offset = 0L,
parallel = "auto",
hive_partitioning = NULL,
hive_schema = NULL,
try_parse_hive_dates = TRUE,
glob = TRUE,
rechunk = TRUE,
low_memory = FALSE,
storage_options = NULL,
use_statistics = TRUE,
cache = TRUE,
include_file_paths = NULL
)
scan_parquet_polars(
source,
...,
n_rows = NULL,
row_index_name = NULL,
row_index_offset = 0L,
parallel = "auto",
hive_partitioning = NULL,
hive_schema = NULL,
try_parse_hive_dates = TRUE,
glob = TRUE,
rechunk = FALSE,
low_memory = FALSE,
storage_options = NULL,
use_statistics = TRUE,
cache = TRUE,
include_file_paths = NULL
)
Arguments
- source
Path(s) to a file or directory. When needing to authenticate for scanning cloud locations, see the
storage_options
parameter.- ...
These dots are for future extensions and must be empty.
- n_rows
Stop reading from the source after reading
n_rows
.- row_index_name
If not
NULL
, this will insert a row index column with the given name.- row_index_offset
Offset to start the row index column (only used if the name is set by
row_index_name
).- parallel
This determines the direction and strategy of parallelism.
"auto"
(default): Will try to determine the optimal direction."prefiltered"
:Strategy first evaluates the pushed-down predicates in parallel and determines a mask of which rows to read. Then, it parallelizes over both the columns and the row groups while filtering out rows that do not need to be read. This can provide significant speedups for large files (i.e. many row-groups) with a predicate that filters clustered rows or filters heavily. In other cases, prefiltered may slow down the scan compared other strategies. Falls back to
"auto"
if no predicate is given."columns"
,"row_groups"
: Use the specified direction."none"
: No parallelism.
- hive_partitioning
Infer statistics and schema from Hive partitioned sources and use them to prune reads. If
NULL
(default), it is automatically enabled when a single directory is passed, and otherwise disabled.- hive_schema
A list containing the column names and data types of the columns by which the data is partitioned, e.g.
list(a = pl$String, b = pl$Float32)
. IfNULL
(default), the schema of the Hive partitions is inferred.- try_parse_hive_dates
Whether to try parsing hive values as date / datetime types.
- glob
Expand path given via globbing rules.
- rechunk
Reallocate to contiguous memory when all chunks/files are parsed.
- low_memory
Reduce memory pressure at the expense of performance
- storage_options
Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
Hugging Face (
hf://
): Accepts an API key under the token parameterc(token = YOUR_TOKEN)
or by setting theHF_TOKEN
environment variable.
If
storage_options
is not provided, Polars will try to infer the information from environment variables.- use_statistics
Use statistics in the parquet to determine if pages can be skipped from reading.
- cache
Cache the result after reading.
- include_file_paths
Include the path of the source file(s) as a column with this name.