read_csv_polars()
imports the data as a Polars DataFrame.
scan_csv_polars()
imports the data as a Polars LazyFrame.
Usage
read_csv_polars(
source,
...,
has_header = TRUE,
separator = ",",
comment_prefix = NULL,
quote_char = "\"",
skip_rows = 0,
schema = NULL,
schema_overrides = NULL,
null_values = NULL,
ignore_errors = FALSE,
cache = FALSE,
infer_schema_length = 100,
n_rows = NULL,
encoding = "utf8",
low_memory = FALSE,
rechunk = TRUE,
skip_rows_after_header = 0,
row_index_name = NULL,
row_index_offset = 0,
try_parse_dates = FALSE,
eol_char = "\n",
raise_if_empty = TRUE,
truncate_ragged_lines = FALSE,
include_file_paths = NULL,
dtypes,
reuse_downloaded
)
scan_csv_polars(
source,
...,
has_header = TRUE,
separator = ",",
comment_prefix = NULL,
quote_char = "\"",
skip_rows = 0,
schema = NULL,
schema_overrides = NULL,
null_values = NULL,
ignore_errors = FALSE,
cache = FALSE,
infer_schema_length = 100,
n_rows = NULL,
encoding = "utf8",
low_memory = FALSE,
rechunk = TRUE,
skip_rows_after_header = 0,
row_index_name = NULL,
row_index_offset = 0,
try_parse_dates = FALSE,
eol_char = "\n",
raise_if_empty = TRUE,
truncate_ragged_lines = FALSE,
include_file_paths = NULL,
dtypes,
reuse_downloaded
)
Arguments
- source
Path(s) to a file or directory. When needing to authenticate for scanning cloud locations, see the
storage_options
parameter.- ...
These dots are for future extensions and must be empty.
- has_header
Indicate if the first row of dataset is a header or not.If
FALSE
, column names will be autogenerated in the following format:"column_x"
withx
being an enumeration over every column in the dataset starting at 1.- separator
Single byte character to use as separator in the file.
- comment_prefix
A string, which can be up to 5 symbols in length, used to indicate the start of a comment line. For instance, it can be set to
#
or//
.- quote_char
Single byte character used for quoting. Set to
NULL
to turn off special handling and escaping of quotes.- skip_rows
Start reading after a particular number of rows. The header will be parsed at this offset.
- schema
Provide the schema. This means that polars doesn't do schema inference. This argument expects the complete schema, whereas
schema_overrides
can be used to partially overwrite a schema. This must be a list. Names of list elements are used to match to inferred columns.- schema_overrides
Overwrite dtypes during inference. This must be a list. Names of list elements are used to match to inferred columns.
- null_values
Character vector specifying the values to interpret as
NA
values. It can be named, in which case names specify the columns in which this replacement must be made (e.g.c(col1 = "a")
).- ignore_errors
Keep reading the file even if some lines yield errors. You can also use
infer_schema = FALSE
to read all columns as UTF8 to check which values might cause an issue.- cache
Cache the result after reading.
- infer_schema_length
The maximum number of rows to scan for schema inference. If
NULL
, the full data may be scanned (this is slow). Setinfer_schema = FALSE
to read all columns aspl$String
.- n_rows
Stop reading from the source after reading
n_rows
.- encoding
Either
"utf8"
or"utf8-lossy"
. Lossy means that invalid UTF8 values are replaced with "?" characters.- low_memory
Reduce memory pressure at the expense of performance.
- rechunk
Reallocate to contiguous memory when all chunks/files are parsed.
- skip_rows_after_header
Skip this number of rows when the header is parsed.
- row_index_name
If not
NULL
, this will insert a row index column with the given name.- row_index_offset
Offset to start the row index column (only used if the name is set by
row_index_name
).- try_parse_dates
Try to automatically parse dates. Most ISO8601-like formats can be inferred, as well as a handful of others. If this does not succeed, the column remains of data type
pl$String
.- eol_char
Single byte end of line character (default:
"\n"
). When encountering a file with Windows line endings ("\r\n"
), one can go with the default"\n"
. The extra"\r"
will be removed when processed.- raise_if_empty
If
FALSE
, parsing an empty file returns an empty DataFrame or LazyFrame.- truncate_ragged_lines
Truncate lines that are longer than the schema.
- include_file_paths
Include the path of the source file(s) as a column with this name.
- dtypes
- reuse_downloaded
Examples
### Read or scan a single CSV file ------------------------
# Setup: create a CSV file
dest <- withr::local_tempfile(fileext = ".csv")
write.csv(mtcars, dest, row.names = FALSE)
# Import this file as a DataFrame for eager evaluation
read_csv_polars(dest) |>
arrange(mpg)
#> shape: (32, 11)
#> ┌──────┬─────┬───────┬─────┬───┬─────┬─────┬──────┬──────┐
#> │ mpg ┆ cyl ┆ disp ┆ hp ┆ … ┆ vs ┆ am ┆ gear ┆ carb │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ i64 ┆ f64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
#> ╞══════╪═════╪═══════╪═════╪═══╪═════╪═════╪══════╪══════╡
#> │ 10.4 ┆ 8 ┆ 472.0 ┆ 205 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
#> │ 10.4 ┆ 8 ┆ 460.0 ┆ 215 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
#> │ 13.3 ┆ 8 ┆ 350.0 ┆ 245 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
#> │ 14.3 ┆ 8 ┆ 360.0 ┆ 245 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
#> │ 14.7 ┆ 8 ┆ 440.0 ┆ 230 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
#> │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
#> │ 27.3 ┆ 4 ┆ 79.0 ┆ 66 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 1 │
#> │ 30.4 ┆ 4 ┆ 75.7 ┆ 52 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 2 │
#> │ 30.4 ┆ 4 ┆ 95.1 ┆ 113 ┆ … ┆ 1 ┆ 1 ┆ 5 ┆ 2 │
#> │ 32.4 ┆ 4 ┆ 78.7 ┆ 66 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 1 │
#> │ 33.9 ┆ 4 ┆ 71.1 ┆ 65 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 1 │
#> └──────┴─────┴───────┴─────┴───┴─────┴─────┴──────┴──────┘
# Import this file as a LazyFrame for lazy evaluation
scan_csv_polars(dest) |>
arrange(mpg) |>
compute()
#> shape: (32, 11)
#> ┌──────┬─────┬───────┬─────┬───┬─────┬─────┬──────┬──────┐
#> │ mpg ┆ cyl ┆ disp ┆ hp ┆ … ┆ vs ┆ am ┆ gear ┆ carb │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ i64 ┆ f64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
#> ╞══════╪═════╪═══════╪═════╪═══╪═════╪═════╪══════╪══════╡
#> │ 10.4 ┆ 8 ┆ 472.0 ┆ 205 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
#> │ 10.4 ┆ 8 ┆ 460.0 ┆ 215 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
#> │ 13.3 ┆ 8 ┆ 350.0 ┆ 245 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
#> │ 14.3 ┆ 8 ┆ 360.0 ┆ 245 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
#> │ 14.7 ┆ 8 ┆ 440.0 ┆ 230 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
#> │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
#> │ 27.3 ┆ 4 ┆ 79.0 ┆ 66 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 1 │
#> │ 30.4 ┆ 4 ┆ 75.7 ┆ 52 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 2 │
#> │ 30.4 ┆ 4 ┆ 95.1 ┆ 113 ┆ … ┆ 1 ┆ 1 ┆ 5 ┆ 2 │
#> │ 32.4 ┆ 4 ┆ 78.7 ┆ 66 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 1 │
#> │ 33.9 ┆ 4 ┆ 71.1 ┆ 65 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 1 │
#> └──────┴─────┴───────┴─────┴───┴─────┴─────┴──────┴──────┘
### Change the datatype of some columns when reading the file ------------------------
scan_csv_polars(
dest,
schema_overrides = list(gear = polars::pl$String, carb = polars::pl$Float32)
) |>
arrange(mpg) |>
compute()
#> shape: (32, 11)
#> ┌──────┬─────┬───────┬─────┬───┬─────┬─────┬──────┬──────┐
#> │ mpg ┆ cyl ┆ disp ┆ hp ┆ … ┆ vs ┆ am ┆ gear ┆ carb │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ i64 ┆ f64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ str ┆ f32 │
#> ╞══════╪═════╪═══════╪═════╪═══╪═════╪═════╪══════╪══════╡
#> │ 10.4 ┆ 8 ┆ 472.0 ┆ 205 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4.0 │
#> │ 10.4 ┆ 8 ┆ 460.0 ┆ 215 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4.0 │
#> │ 13.3 ┆ 8 ┆ 350.0 ┆ 245 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4.0 │
#> │ 14.3 ┆ 8 ┆ 360.0 ┆ 245 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4.0 │
#> │ 14.7 ┆ 8 ┆ 440.0 ┆ 230 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4.0 │
#> │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
#> │ 27.3 ┆ 4 ┆ 79.0 ┆ 66 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 1.0 │
#> │ 30.4 ┆ 4 ┆ 75.7 ┆ 52 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 2.0 │
#> │ 30.4 ┆ 4 ┆ 95.1 ┆ 113 ┆ … ┆ 1 ┆ 1 ┆ 5 ┆ 2.0 │
#> │ 32.4 ┆ 4 ┆ 78.7 ┆ 66 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 1.0 │
#> │ 33.9 ┆ 4 ┆ 71.1 ┆ 65 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 1.0 │
#> └──────┴─────┴───────┴─────┴───┴─────┴─────┴──────┴──────┘
### Read or scan several all CSV files in a folder ------------------------
# Setup: create a folder "output" that contains two CSV files
dest_folder <- withr::local_tempdir(tmpdir = "output")
dir.create(dest_folder, showWarnings = FALSE)
dest1 <- file.path(dest_folder, "output_1.csv")
dest2 <- file.path(dest_folder, "output_2.csv")
write.csv(mtcars[1:16, ], dest1, row.names = FALSE)
write.csv(mtcars[17:32, ], dest2, row.names = FALSE)
list.files(dest_folder)
#> [1] "output_1.csv" "output_2.csv"
# Import all files as a LazyFrame
scan_csv_polars(dest_folder) |>
arrange(mpg) |>
compute()
#> shape: (32, 11)
#> ┌──────┬─────┬───────┬─────┬───┬─────┬─────┬──────┬──────┐
#> │ mpg ┆ cyl ┆ disp ┆ hp ┆ … ┆ vs ┆ am ┆ gear ┆ carb │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ i64 ┆ f64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
#> ╞══════╪═════╪═══════╪═════╪═══╪═════╪═════╪══════╪══════╡
#> │ 10.4 ┆ 8 ┆ 472.0 ┆ 205 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
#> │ 10.4 ┆ 8 ┆ 460.0 ┆ 215 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
#> │ 13.3 ┆ 8 ┆ 350.0 ┆ 245 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
#> │ 14.3 ┆ 8 ┆ 360.0 ┆ 245 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
#> │ 14.7 ┆ 8 ┆ 440.0 ┆ 230 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
#> │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
#> │ 27.3 ┆ 4 ┆ 79.0 ┆ 66 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 1 │
#> │ 30.4 ┆ 4 ┆ 75.7 ┆ 52 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 2 │
#> │ 30.4 ┆ 4 ┆ 95.1 ┆ 113 ┆ … ┆ 1 ┆ 1 ┆ 5 ┆ 2 │
#> │ 32.4 ┆ 4 ┆ 78.7 ┆ 66 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 1 │
#> │ 33.9 ┆ 4 ┆ 71.1 ┆ 65 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 1 │
#> └──────┴─────┴───────┴─────┴───┴─────┴─────┴──────┴──────┘
# Include the file path to know where each row comes from
scan_csv_polars(dest_folder, include_file_paths = "file_path") |>
arrange(mpg) |>
compute()
#> shape: (32, 12)
#> ┌──────┬─────┬───────┬─────┬───┬─────┬──────┬──────┬─────────────────────────────────┐
#> │ mpg ┆ cyl ┆ disp ┆ hp ┆ … ┆ am ┆ gear ┆ carb ┆ file_path │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ i64 ┆ f64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ str │
#> ╞══════╪═════╪═══════╪═════╪═══╪═════╪══════╪══════╪═════════════════════════════════╡
#> │ 10.4 ┆ 8 ┆ 472.0 ┆ 205 ┆ … ┆ 0 ┆ 3 ┆ 4 ┆ output/file46a33f763c83/output… │
#> │ 10.4 ┆ 8 ┆ 460.0 ┆ 215 ┆ … ┆ 0 ┆ 3 ┆ 4 ┆ output/file46a33f763c83/output… │
#> │ 13.3 ┆ 8 ┆ 350.0 ┆ 245 ┆ … ┆ 0 ┆ 3 ┆ 4 ┆ output/file46a33f763c83/output… │
#> │ 14.3 ┆ 8 ┆ 360.0 ┆ 245 ┆ … ┆ 0 ┆ 3 ┆ 4 ┆ output/file46a33f763c83/output… │
#> │ 14.7 ┆ 8 ┆ 440.0 ┆ 230 ┆ … ┆ 0 ┆ 3 ┆ 4 ┆ output/file46a33f763c83/output… │
#> │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
#> │ 27.3 ┆ 4 ┆ 79.0 ┆ 66 ┆ … ┆ 1 ┆ 4 ┆ 1 ┆ output/file46a33f763c83/output… │
#> │ 30.4 ┆ 4 ┆ 75.7 ┆ 52 ┆ … ┆ 1 ┆ 4 ┆ 2 ┆ output/file46a33f763c83/output… │
#> │ 30.4 ┆ 4 ┆ 95.1 ┆ 113 ┆ … ┆ 1 ┆ 5 ┆ 2 ┆ output/file46a33f763c83/output… │
#> │ 32.4 ┆ 4 ┆ 78.7 ┆ 66 ┆ … ┆ 1 ┆ 4 ┆ 1 ┆ output/file46a33f763c83/output… │
#> │ 33.9 ┆ 4 ┆ 71.1 ┆ 65 ┆ … ┆ 1 ┆ 4 ┆ 1 ┆ output/file46a33f763c83/output… │
#> └──────┴─────┴───────┴─────┴───┴─────┴──────┴──────┴─────────────────────────────────┘
### Read or scan all CSV files that match a glob pattern ------------------------
# Setup: create a folder "output_glob" that contains three CSV files,
# two of which follow the pattern "output_XXX.csv"
dest_folder <- withr::local_tempdir(tmpdir = "output_glob")
dir.create(dest_folder, showWarnings = FALSE)
dest1 <- file.path(dest_folder, "output_1.csv")
dest2 <- file.path(dest_folder, "output_2.csv")
dest3 <- file.path(dest_folder, "other_output.csv")
write.csv(mtcars[1:16, ], dest1, row.names = FALSE)
write.csv(mtcars[17:32, ], dest2, row.names = FALSE)
write.csv(iris, dest3, row.names = FALSE)
list.files(dest_folder)
#> [1] "other_output.csv" "output_1.csv" "output_2.csv"
# Import only the files whose name match "output_XXX.csv" as a LazyFrame
scan_csv_polars(paste0(dest_folder, "/output_*.csv")) |>
arrange(mpg) |>
compute()
#> shape: (32, 11)
#> ┌──────┬─────┬───────┬─────┬───┬─────┬─────┬──────┬──────┐
#> │ mpg ┆ cyl ┆ disp ┆ hp ┆ … ┆ vs ┆ am ┆ gear ┆ carb │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ i64 ┆ f64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
#> ╞══════╪═════╪═══════╪═════╪═══╪═════╪═════╪══════╪══════╡
#> │ 10.4 ┆ 8 ┆ 472.0 ┆ 205 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
#> │ 10.4 ┆ 8 ┆ 460.0 ┆ 215 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
#> │ 13.3 ┆ 8 ┆ 350.0 ┆ 245 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
#> │ 14.3 ┆ 8 ┆ 360.0 ┆ 245 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
#> │ 14.7 ┆ 8 ┆ 440.0 ┆ 230 ┆ … ┆ 0 ┆ 0 ┆ 3 ┆ 4 │
#> │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
#> │ 27.3 ┆ 4 ┆ 79.0 ┆ 66 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 1 │
#> │ 30.4 ┆ 4 ┆ 75.7 ┆ 52 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 2 │
#> │ 30.4 ┆ 4 ┆ 95.1 ┆ 113 ┆ … ┆ 1 ┆ 1 ┆ 5 ┆ 2 │
#> │ 32.4 ┆ 4 ┆ 78.7 ┆ 66 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 1 │
#> │ 33.9 ┆ 4 ┆ 71.1 ┆ 65 ┆ … ┆ 1 ┆ 1 ┆ 4 ┆ 1 │
#> └──────┴─────┴───────┴─────┴───┴─────┴─────┴──────┴──────┘