Skip to contents

read_csv_polars() imports the data as a Polars DataFrame.

scan_csv_polars() imports the data as a Polars LazyFrame.

Usage

read_csv_polars(
  source,
  ...,
  has_header = TRUE,
  separator = ",",
  comment_prefix = NULL,
  quote_char = "\"",
  skip_rows = 0,
  schema = NULL,
  schema_overrides = NULL,
  null_values = NULL,
  ignore_errors = FALSE,
  cache = FALSE,
  infer_schema_length = 100,
  n_rows = NULL,
  encoding = "utf8",
  low_memory = FALSE,
  rechunk = TRUE,
  skip_rows_after_header = 0,
  row_index_name = NULL,
  row_index_offset = 0,
  try_parse_dates = FALSE,
  eol_char = "\n",
  raise_if_empty = TRUE,
  truncate_ragged_lines = FALSE,
  include_file_paths = NULL,
  dtypes,
  reuse_downloaded
)

scan_csv_polars(
  source,
  ...,
  has_header = TRUE,
  separator = ",",
  comment_prefix = NULL,
  quote_char = "\"",
  skip_rows = 0,
  schema = NULL,
  schema_overrides = NULL,
  null_values = NULL,
  ignore_errors = FALSE,
  cache = FALSE,
  infer_schema_length = 100,
  n_rows = NULL,
  encoding = "utf8",
  low_memory = FALSE,
  rechunk = TRUE,
  skip_rows_after_header = 0,
  row_index_name = NULL,
  row_index_offset = 0,
  try_parse_dates = FALSE,
  eol_char = "\n",
  raise_if_empty = TRUE,
  truncate_ragged_lines = FALSE,
  include_file_paths = NULL,
  dtypes,
  reuse_downloaded
)

Arguments

source

Path(s) to a file or directory. When needing to authenticate for scanning cloud locations, see the storage_options parameter.

...

These dots are for future extensions and must be empty.

has_header

Indicate if the first row of dataset is a header or not.If FALSE, column names will be autogenerated in the following format: "column_x" with x being an enumeration over every column in the dataset starting at 1.

separator

Single byte character to use as separator in the file.

comment_prefix

A string, which can be up to 5 symbols in length, used to indicate the start of a comment line. For instance, it can be set to # or //.

quote_char

Single byte character used for quoting. Set to NULL to turn off special handling and escaping of quotes.

skip_rows

Start reading after a particular number of rows. The header will be parsed at this offset.

schema

Provide the schema. This means that polars doesn't do schema inference. This argument expects the complete schema, whereas schema_overrides can be used to partially overwrite a schema. This must be a list. Names of list elements are used to match to inferred columns.

schema_overrides

Overwrite dtypes during inference. This must be a list. Names of list elements are used to match to inferred columns.

null_values

Character vector specifying the values to interpret as NA values. It can be named, in which case names specify the columns in which this replacement must be made (e.g. c(col1 = "a")).

ignore_errors

Keep reading the file even if some lines yield errors. You can also use infer_schema = FALSE to read all columns as UTF8 to check which values might cause an issue.

cache

Cache the result after reading.

infer_schema_length

The maximum number of rows to scan for schema inference. If NULL, the full data may be scanned (this is slow). Set infer_schema = FALSE to read all columns as pl$String.

n_rows

Stop reading from the source after reading n_rows.

encoding

Either "utf8" or "utf8-lossy". Lossy means that invalid UTF8 values are replaced with "?" characters.

low_memory

Reduce memory pressure at the expense of performance.

rechunk

Reallocate to contiguous memory when all chunks/files are parsed.

skip_rows_after_header

Skip this number of rows when the header is parsed.

row_index_name

If not NULL, this will insert a row index column with the given name.

row_index_offset

Offset to start the row index column (only used if the name is set by row_index_name).

try_parse_dates

Try to automatically parse dates. Most ISO8601-like formats can be inferred, as well as a handful of others. If this does not succeed, the column remains of data type pl$String.

eol_char

Single byte end of line character (default: "\n"). When encountering a file with Windows line endings ("\r\n"), one can go with the default "\n". The extra "\r" will be removed when processed.

raise_if_empty

If FALSE, parsing an empty file returns an empty DataFrame or LazyFrame.

truncate_ragged_lines

Truncate lines that are longer than the schema.

include_file_paths

Include the path of the source file(s) as a column with this name.

dtypes

[Deprecated] Deprecated, use schema_overrides instead.

reuse_downloaded

[Deprecated] Deprecated with no replacement.

Value

The scan function returns a LazyFrame, the read function returns a DataFrame.

Examples

### Read or scan a single CSV file ------------------------

# Setup: create a CSV file
dest <- withr::local_tempfile(fileext = ".csv")
write.csv(mtcars, dest, row.names = FALSE)

# Import this file as a DataFrame for eager evaluation
read_csv_polars(dest) |>
  arrange(mpg)
#> shape: (32, 11)
#> ┌──────┬─────┬───────┬─────┬───┬─────┬─────┬──────┬──────┐
#> │ mpg  ┆ cyl ┆ disp  ┆ hp  ┆ … ┆ vs  ┆ am  ┆ gear ┆ carb │
#> │ ---  ┆ --- ┆ ---   ┆ --- ┆   ┆ --- ┆ --- ┆ ---  ┆ ---  │
#> │ f64  ┆ i64 ┆ f64   ┆ i64 ┆   ┆ i64 ┆ i64 ┆ i64  ┆ i64  │
#> ╞══════╪═════╪═══════╪═════╪═══╪═════╪═════╪══════╪══════╡
#> │ 10.4 ┆ 8   ┆ 472.0 ┆ 205 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
#> │ 10.4 ┆ 8   ┆ 460.0 ┆ 215 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
#> │ 13.3 ┆ 8   ┆ 350.0 ┆ 245 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
#> │ 14.3 ┆ 8   ┆ 360.0 ┆ 245 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
#> │ 14.7 ┆ 8   ┆ 440.0 ┆ 230 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
#> │ …    ┆ …   ┆ …     ┆ …   ┆ … ┆ …   ┆ …   ┆ …    ┆ …    │
#> │ 27.3 ┆ 4   ┆ 79.0  ┆ 66  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 1    │
#> │ 30.4 ┆ 4   ┆ 75.7  ┆ 52  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 2    │
#> │ 30.4 ┆ 4   ┆ 95.1  ┆ 113 ┆ … ┆ 1   ┆ 1   ┆ 5    ┆ 2    │
#> │ 32.4 ┆ 4   ┆ 78.7  ┆ 66  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 1    │
#> │ 33.9 ┆ 4   ┆ 71.1  ┆ 65  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 1    │
#> └──────┴─────┴───────┴─────┴───┴─────┴─────┴──────┴──────┘

# Import this file as a LazyFrame for lazy evaluation
scan_csv_polars(dest) |>
  arrange(mpg) |>
  compute()
#> shape: (32, 11)
#> ┌──────┬─────┬───────┬─────┬───┬─────┬─────┬──────┬──────┐
#> │ mpg  ┆ cyl ┆ disp  ┆ hp  ┆ … ┆ vs  ┆ am  ┆ gear ┆ carb │
#> │ ---  ┆ --- ┆ ---   ┆ --- ┆   ┆ --- ┆ --- ┆ ---  ┆ ---  │
#> │ f64  ┆ i64 ┆ f64   ┆ i64 ┆   ┆ i64 ┆ i64 ┆ i64  ┆ i64  │
#> ╞══════╪═════╪═══════╪═════╪═══╪═════╪═════╪══════╪══════╡
#> │ 10.4 ┆ 8   ┆ 472.0 ┆ 205 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
#> │ 10.4 ┆ 8   ┆ 460.0 ┆ 215 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
#> │ 13.3 ┆ 8   ┆ 350.0 ┆ 245 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
#> │ 14.3 ┆ 8   ┆ 360.0 ┆ 245 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
#> │ 14.7 ┆ 8   ┆ 440.0 ┆ 230 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
#> │ …    ┆ …   ┆ …     ┆ …   ┆ … ┆ …   ┆ …   ┆ …    ┆ …    │
#> │ 27.3 ┆ 4   ┆ 79.0  ┆ 66  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 1    │
#> │ 30.4 ┆ 4   ┆ 75.7  ┆ 52  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 2    │
#> │ 30.4 ┆ 4   ┆ 95.1  ┆ 113 ┆ … ┆ 1   ┆ 1   ┆ 5    ┆ 2    │
#> │ 32.4 ┆ 4   ┆ 78.7  ┆ 66  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 1    │
#> │ 33.9 ┆ 4   ┆ 71.1  ┆ 65  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 1    │
#> └──────┴─────┴───────┴─────┴───┴─────┴─────┴──────┴──────┘


### Change the datatype of some columns when reading the file ------------------------

scan_csv_polars(
  dest,
  schema_overrides = list(gear = polars::pl$String, carb = polars::pl$Float32)
) |>
  arrange(mpg) |>
  compute()
#> shape: (32, 11)
#> ┌──────┬─────┬───────┬─────┬───┬─────┬─────┬──────┬──────┐
#> │ mpg  ┆ cyl ┆ disp  ┆ hp  ┆ … ┆ vs  ┆ am  ┆ gear ┆ carb │
#> │ ---  ┆ --- ┆ ---   ┆ --- ┆   ┆ --- ┆ --- ┆ ---  ┆ ---  │
#> │ f64  ┆ i64 ┆ f64   ┆ i64 ┆   ┆ i64 ┆ i64 ┆ str  ┆ f32  │
#> ╞══════╪═════╪═══════╪═════╪═══╪═════╪═════╪══════╪══════╡
#> │ 10.4 ┆ 8   ┆ 472.0 ┆ 205 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4.0  │
#> │ 10.4 ┆ 8   ┆ 460.0 ┆ 215 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4.0  │
#> │ 13.3 ┆ 8   ┆ 350.0 ┆ 245 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4.0  │
#> │ 14.3 ┆ 8   ┆ 360.0 ┆ 245 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4.0  │
#> │ 14.7 ┆ 8   ┆ 440.0 ┆ 230 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4.0  │
#> │ …    ┆ …   ┆ …     ┆ …   ┆ … ┆ …   ┆ …   ┆ …    ┆ …    │
#> │ 27.3 ┆ 4   ┆ 79.0  ┆ 66  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 1.0  │
#> │ 30.4 ┆ 4   ┆ 75.7  ┆ 52  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 2.0  │
#> │ 30.4 ┆ 4   ┆ 95.1  ┆ 113 ┆ … ┆ 1   ┆ 1   ┆ 5    ┆ 2.0  │
#> │ 32.4 ┆ 4   ┆ 78.7  ┆ 66  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 1.0  │
#> │ 33.9 ┆ 4   ┆ 71.1  ┆ 65  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 1.0  │
#> └──────┴─────┴───────┴─────┴───┴─────┴─────┴──────┴──────┘


### Read or scan several all CSV files in a folder ------------------------

# Setup: create a folder "output" that contains two CSV files
dest_folder <- withr::local_tempdir(tmpdir = "output")
dir.create(dest_folder, showWarnings = FALSE)
dest1 <- file.path(dest_folder, "output_1.csv")
dest2 <- file.path(dest_folder, "output_2.csv")

write.csv(mtcars[1:16, ], dest1, row.names = FALSE)
write.csv(mtcars[17:32, ], dest2, row.names = FALSE)
list.files(dest_folder)
#> [1] "output_1.csv" "output_2.csv"

# Import all files as a LazyFrame
scan_csv_polars(dest_folder) |>
  arrange(mpg) |>
  compute()
#> shape: (32, 11)
#> ┌──────┬─────┬───────┬─────┬───┬─────┬─────┬──────┬──────┐
#> │ mpg  ┆ cyl ┆ disp  ┆ hp  ┆ … ┆ vs  ┆ am  ┆ gear ┆ carb │
#> │ ---  ┆ --- ┆ ---   ┆ --- ┆   ┆ --- ┆ --- ┆ ---  ┆ ---  │
#> │ f64  ┆ i64 ┆ f64   ┆ i64 ┆   ┆ i64 ┆ i64 ┆ i64  ┆ i64  │
#> ╞══════╪═════╪═══════╪═════╪═══╪═════╪═════╪══════╪══════╡
#> │ 10.4 ┆ 8   ┆ 472.0 ┆ 205 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
#> │ 10.4 ┆ 8   ┆ 460.0 ┆ 215 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
#> │ 13.3 ┆ 8   ┆ 350.0 ┆ 245 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
#> │ 14.3 ┆ 8   ┆ 360.0 ┆ 245 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
#> │ 14.7 ┆ 8   ┆ 440.0 ┆ 230 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
#> │ …    ┆ …   ┆ …     ┆ …   ┆ … ┆ …   ┆ …   ┆ …    ┆ …    │
#> │ 27.3 ┆ 4   ┆ 79.0  ┆ 66  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 1    │
#> │ 30.4 ┆ 4   ┆ 75.7  ┆ 52  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 2    │
#> │ 30.4 ┆ 4   ┆ 95.1  ┆ 113 ┆ … ┆ 1   ┆ 1   ┆ 5    ┆ 2    │
#> │ 32.4 ┆ 4   ┆ 78.7  ┆ 66  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 1    │
#> │ 33.9 ┆ 4   ┆ 71.1  ┆ 65  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 1    │
#> └──────┴─────┴───────┴─────┴───┴─────┴─────┴──────┴──────┘

# Include the file path to know where each row comes from
scan_csv_polars(dest_folder, include_file_paths = "file_path") |>
  arrange(mpg) |>
  compute()
#> shape: (32, 12)
#> ┌──────┬─────┬───────┬─────┬───┬─────┬──────┬──────┬─────────────────────────────────┐
#> │ mpg  ┆ cyl ┆ disp  ┆ hp  ┆ … ┆ am  ┆ gear ┆ carb ┆ file_path                       │
#> │ ---  ┆ --- ┆ ---   ┆ --- ┆   ┆ --- ┆ ---  ┆ ---  ┆ ---                             │
#> │ f64  ┆ i64 ┆ f64   ┆ i64 ┆   ┆ i64 ┆ i64  ┆ i64  ┆ str                             │
#> ╞══════╪═════╪═══════╪═════╪═══╪═════╪══════╪══════╪═════════════════════════════════╡
#> │ 10.4 ┆ 8   ┆ 472.0 ┆ 205 ┆ … ┆ 0   ┆ 3    ┆ 4    ┆ output/file46a33f763c83/output… │
#> │ 10.4 ┆ 8   ┆ 460.0 ┆ 215 ┆ … ┆ 0   ┆ 3    ┆ 4    ┆ output/file46a33f763c83/output… │
#> │ 13.3 ┆ 8   ┆ 350.0 ┆ 245 ┆ … ┆ 0   ┆ 3    ┆ 4    ┆ output/file46a33f763c83/output… │
#> │ 14.3 ┆ 8   ┆ 360.0 ┆ 245 ┆ … ┆ 0   ┆ 3    ┆ 4    ┆ output/file46a33f763c83/output… │
#> │ 14.7 ┆ 8   ┆ 440.0 ┆ 230 ┆ … ┆ 0   ┆ 3    ┆ 4    ┆ output/file46a33f763c83/output… │
#> │ …    ┆ …   ┆ …     ┆ …   ┆ … ┆ …   ┆ …    ┆ …    ┆ …                               │
#> │ 27.3 ┆ 4   ┆ 79.0  ┆ 66  ┆ … ┆ 1   ┆ 4    ┆ 1    ┆ output/file46a33f763c83/output… │
#> │ 30.4 ┆ 4   ┆ 75.7  ┆ 52  ┆ … ┆ 1   ┆ 4    ┆ 2    ┆ output/file46a33f763c83/output… │
#> │ 30.4 ┆ 4   ┆ 95.1  ┆ 113 ┆ … ┆ 1   ┆ 5    ┆ 2    ┆ output/file46a33f763c83/output… │
#> │ 32.4 ┆ 4   ┆ 78.7  ┆ 66  ┆ … ┆ 1   ┆ 4    ┆ 1    ┆ output/file46a33f763c83/output… │
#> │ 33.9 ┆ 4   ┆ 71.1  ┆ 65  ┆ … ┆ 1   ┆ 4    ┆ 1    ┆ output/file46a33f763c83/output… │
#> └──────┴─────┴───────┴─────┴───┴─────┴──────┴──────┴─────────────────────────────────┘


### Read or scan all CSV files that match a glob pattern ------------------------

# Setup: create a folder "output_glob" that contains three CSV files,
# two of which follow the pattern "output_XXX.csv"
dest_folder <- withr::local_tempdir(tmpdir = "output_glob")
dir.create(dest_folder, showWarnings = FALSE)
dest1 <- file.path(dest_folder, "output_1.csv")
dest2 <- file.path(dest_folder, "output_2.csv")
dest3 <- file.path(dest_folder, "other_output.csv")

write.csv(mtcars[1:16, ], dest1, row.names = FALSE)
write.csv(mtcars[17:32, ], dest2, row.names = FALSE)
write.csv(iris, dest3, row.names = FALSE)
list.files(dest_folder)
#> [1] "other_output.csv" "output_1.csv"     "output_2.csv"    

# Import only the files whose name match "output_XXX.csv" as a LazyFrame
scan_csv_polars(paste0(dest_folder, "/output_*.csv")) |>
  arrange(mpg) |>
  compute()
#> shape: (32, 11)
#> ┌──────┬─────┬───────┬─────┬───┬─────┬─────┬──────┬──────┐
#> │ mpg  ┆ cyl ┆ disp  ┆ hp  ┆ … ┆ vs  ┆ am  ┆ gear ┆ carb │
#> │ ---  ┆ --- ┆ ---   ┆ --- ┆   ┆ --- ┆ --- ┆ ---  ┆ ---  │
#> │ f64  ┆ i64 ┆ f64   ┆ i64 ┆   ┆ i64 ┆ i64 ┆ i64  ┆ i64  │
#> ╞══════╪═════╪═══════╪═════╪═══╪═════╪═════╪══════╪══════╡
#> │ 10.4 ┆ 8   ┆ 472.0 ┆ 205 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
#> │ 10.4 ┆ 8   ┆ 460.0 ┆ 215 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
#> │ 13.3 ┆ 8   ┆ 350.0 ┆ 245 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
#> │ 14.3 ┆ 8   ┆ 360.0 ┆ 245 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
#> │ 14.7 ┆ 8   ┆ 440.0 ┆ 230 ┆ … ┆ 0   ┆ 0   ┆ 3    ┆ 4    │
#> │ …    ┆ …   ┆ …     ┆ …   ┆ … ┆ …   ┆ …   ┆ …    ┆ …    │
#> │ 27.3 ┆ 4   ┆ 79.0  ┆ 66  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 1    │
#> │ 30.4 ┆ 4   ┆ 75.7  ┆ 52  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 2    │
#> │ 30.4 ┆ 4   ┆ 95.1  ┆ 113 ┆ … ┆ 1   ┆ 1   ┆ 5    ┆ 2    │
#> │ 32.4 ┆ 4   ┆ 78.7  ┆ 66  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 1    │
#> │ 33.9 ┆ 4   ┆ 71.1  ┆ 65  ┆ … ┆ 1   ┆ 1   ┆ 4    ┆ 1    │
#> └──────┴─────┴───────┴─────┴───┴─────┴─────┴──────┴──────┘