This function allows to stream a LazyFrame that is larger than RAM directly
to a .csv
file without collecting it in the R session, thus preventing
crashes because of too small memory.
Usage
sink_csv(
.data,
path,
...,
include_bom = FALSE,
include_header = TRUE,
separator = ",",
line_terminator = "\n",
quote = "\"",
batch_size = 1024,
datetime_format = NULL,
date_format = NULL,
time_format = NULL,
float_precision = NULL,
null_values = "",
quote_style = "necessary",
maintain_order = TRUE,
type_coercion = TRUE,
predicate_pushdown = TRUE,
projection_pushdown = TRUE,
simplify_expression = TRUE,
slice_pushdown = TRUE,
no_optimization = FALSE
)
Arguments
- .data
A Polars LazyFrame.
- path
Output file (must be a
.csv
file).- ...
Ignored.
- include_bom
Whether to include UTF-8 BOM (byte order mark) in the CSV output.
- include_header
Whether to include header in the CSV output.
- separator
Separate CSV fields with this symbol.
- line_terminator
String used to end each row.
- quote
Byte to use as quoting character.
- batch_size
Number of rows that will be processed per thread.
- datetime_format, date_format, time_format
A format string used to format date and time values. See
?strptime()
for accepted values. If no format specified, the default fractional-second precision is inferred from the maximum time unit found in theDatetime
cols (if any).- float_precision
Number of decimal places to write, applied to both
Float32
andFloat64
datatypes.- null_values
A string representing null values (defaulting to the empty string).
- quote_style
Determines the quoting strategy used:
"necessary"
(default): This puts quotes around fields only when necessary. They are necessary when fields contain a quote, delimiter or record terminator. Quotes are also necessary when writing an empty record (which is indistinguishable from a record with one empty field)."always"
: This puts quotes around every field."non_numeric"
: This puts quotes around all fields that are non-numeric. Namely, when writing a field that does not parse as a valid float or integer, then quotes will be used even if they aren't strictly necessary.
- maintain_order
Whether maintain the order the data was processed (default is
TRUE
). Setting this toFALSE
will be slightly faster.- type_coercion
Coerce types such that operations succeed and run on minimal required memory (default is
TRUE
).- predicate_pushdown
Applies filters as early as possible at scan level (default is
TRUE
).- projection_pushdown
Select only the columns that are needed at the scan level (default is
TRUE
).- simplify_expression
Various optimizations, such as constant folding and replacing expensive operations with faster alternatives (default is
TRUE
).- slice_pushdown
Only load the required slice from the scan. Don't materialize sliced outputs level. Don't materialize sliced outputs (default is
TRUE
).- no_optimization
Sets the following optimizations to
FALSE
:predicate_pushdown
,projection_pushdown
,slice_pushdown
,simplify_expression
. Default isFALSE
.
Examples
if (FALSE) { # \dontrun{
# This is an example workflow where sink_csv() is not very useful because
# the data would fit in memory. It simply is an example of using it at the
# end of a piped workflow.
# Create files for the CSV input and output:
file_csv <- tempfile(fileext = ".csv")
file_csv2 <- tempfile(fileext = ".csv")
# Write some data in a CSV file
fake_data <- do.call("rbind", rep(list(mtcars), 1000))
write.csv(fake_data, file = file_csv, row.names = FALSE)
# In a new R session, we could read this file as a LazyFrame, do some operations,
# and write it to another CSV file without ever collecting it in the R session:
scan_csv_polars(file_csv) |>
filter(cyl %in% c(4, 6), mpg > 22) |>
mutate(
hp_gear_ratio = hp / gear
) |>
sink_csv(path = file_csv2)
} # }