Several blog posts and GitHub repos try to benchmark packages for
dataframe manipulation (data.table
, dplyr
,
duckplyr
, etc.). When it comes to benchmarking
tidypolars
, I have seen several implementation mistakes
that make the results quite unreliable. This is not to say that
tidypolars
will always be faster than alternatives. In
fact, there is rarely one tool that is better than all the others in all
aspects of dataframe manipulation. The goal of this vignette is to give
some advice on how best to benchmark tidypolars
.
Summary: instead of this code:
my_function <- function(dat) {
my_polars_data <- as_polars_df(dat)
my_polars_data |>
<some slow operation>
}
bench::mark(
my_function(my_r_data_frame)
)
use this code:
my_polars_data <- as_polars_lf(my_r_data_frame)
my_function <- function(dat) {
my_polars_data |>
<some slow operation> |>
compute() # or compute(engine = "streaming")
}
bench::mark(
my_function(my_polars_data)
)
Do not include as_polars_df()
or
as_polars_lf()
in the timing
Some benchmarks do something like the following:
my_function <- function(dat) {
dat <- as_polars_df(dat)
dat |>
<some slow operation>
}
bench::mark(
my_function(my_r_data_frame)
)
The issue with this approach is that as_polars_df()
converts the R data.frame
to a Polars DataFrame, which
takes some time. As highlighted
in the “Getting started” vignette, as_polars_df()
and
as_polars_lf()
are convenience functions for demo and
testing purposes. The recommended way to use tidypolars
is
by using the dedicated readers (scan_parquet_polars()
,
read_parquet_polars()
, etc.) to import the data directly as
a Polars DataFrame or LazyFrame.
Using as_polars_df()
and as_polars_lf()
is
fine to get the data ready to be benchmarked, but those operations
should not be included in the timing.
Use lazy execution when possible
Polars provides DataFrames and LazyFrames. Operations on DataFrames are executed in “eager mode”, meaning that there is no optimization happening behind the scenes, for instance to efficiently reorder operations. In real-life workflows, it is strongly recommended to use LazyFrames since they allow for a large number of optimizations.
In benchmarks or demos, you should therefore prefer
as_polars_lf()
over as_polars_df()
. Using
LazyFrames means that you also need to collect results to trigger
computation. This can be done using compute()
(to return a
Polars DataFrame), collect()
(to return an R
data.frame
), or as_tibble()
(to return a
tibble
).
Depending on the objective of the benchmark, each of the three
options can be valid. compute()
is faster because it avoids
the extra step of converting a Polars DataFrame to R. However, the
output of compute()
cannot be passed directly to
ggplot2
, for example.
In summary, instead of doing:
my_polars_data <- as_polars_df(my_r_data_frame)
my_function <- function(dat) {
my_polars_data |>
<some slow operation>
}
bench::mark(
my_function(my_polars_data)
)
you should do:
Use streaming mode when possible
In addition to lazy execution, Polars comes with a streaming engine that is able to perform operations on larger-than-RAM data. It is also faster than the default engine in many cases, not just with larger-than-RAM data.
Note: at the time of writing (August 2025), the streaming engine is not yet used by default, but it should become the default in the next few months.
Using it is extremely simple: pass engine = "streaming"
in compute()
, collect()
or
as_tibble()
.
Do not try to run tidypolars in parallel
Some benchmarks take advantage of tools such as future
or mirai
to run code in parallel. However,
tidypolars
shouldn’t be used with such tools. All
polars
code that is used internally is already optimized to
run on all cores and isn’t guaranteed to play well with other parallel
frameworks.
In some cases, you may want to use those frameworks on Polars objects, but this should be the exception rather than the rule.