ℹ️ This is the R package “tidypolars”. The Python one is here: markfairbanks/tidypolars
Overview
tidypolars provides a polars backend for the tidyverse. The aim of tidypolars is to enable users to keep their existing tidyverse code while using polars in the background to benefit from large performance gains. The only thing that needs to change is the way data is imported in the R session.
See the “Getting started” vignette for a gentle introduction to tidypolars.
Since most of the work is rewriting tidyverse code into polars syntax, tidypolars and polars have very similar performance.
Click to see a small benchmark
The main purpose of this benchmark is to show that polars and tidypolars are close and to give an idea of the performance. For more thorough, representative benchmarks about polars, take a look at DuckDB benchmarks instead.
library(collapse, warn.conflicts = FALSE)
#> collapse 2.1.3, see ?`collapse-package` or ?`collapse-documentation`
library(dplyr, warn.conflicts = FALSE)
library(dtplyr)
library(polars)
library(tidypolars)
large_iris <- data.table::rbindlist(rep(list(iris), 100000))
large_iris_pl <- as_polars_lf(large_iris)
large_iris_dt <- lazy_dt(large_iris)
format(nrow(large_iris), big.mark = ",")
#> [1] "15,000,000"
bench::mark(
polars = {
large_iris_pl$
select(c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"))$
with_columns(
pl$when(
(pl$col("Petal.Length") / pl$col("Petal.Width") > 3)
)$then(pl$lit("long"))$
otherwise(pl$lit("large"))$
alias("petal_type")
)$
filter(pl$col("Sepal.Length")$is_between(4.5, 5.5))$
collect()
},
tidypolars = {
large_iris_pl |>
select(starts_with(c("Sep", "Pet"))) |>
mutate(
petal_type = ifelse((Petal.Length / Petal.Width) > 3, "long", "large")
) |>
filter(between(Sepal.Length, 4.5, 5.5)) |>
compute()
},
dplyr = {
large_iris |>
select(starts_with(c("Sep", "Pet"))) |>
mutate(
petal_type = ifelse((Petal.Length / Petal.Width) > 3, "long", "large")
) |>
filter(between(Sepal.Length, 4.5, 5.5))
},
dtplyr = {
large_iris_dt |>
select(starts_with(c("Sep", "Pet"))) |>
mutate(
petal_type = ifelse((Petal.Length / Petal.Width) > 3, "long", "large")
) |>
filter(between(Sepal.Length, 4.5, 5.5)) |>
as.data.frame()
},
collapse = {
large_iris |>
fselect(c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")) |>
fmutate(
petal_type = data.table::fifelse((Petal.Length / Petal.Width) > 3, "long", "large")
) |>
fsubset(Sepal.Length >= 4.5 & Sepal.Length <= 5.5)
},
check = FALSE,
iterations = 40
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 5 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 polars 108.38ms 149.19ms 6.32 2.13MB 0.158
#> 2 tidypolars 118.42ms 247.73ms 3.61 1.22MB 0.452
#> 3 dplyr 2.94s 3.75s 0.269 1.79GB 0.658
#> 4 dtplyr 653.02ms 729.33ms 1.36 1.72GB 2.79
#> 5 collapse 269.73ms 368.59ms 2.69 745.96MB 2.63
# NOTE: do NOT take the "mem_alloc" results into account.
# `bench::mark()` doesn't report the accurate memory usage for packages calling
# Rust code.If you want to do your own benchmarks, please take a look at How to benchmark tidypolars first for some best practices.
Installation
tidypolars is built on polars, which is not available on CRAN. This means that tidypolars also can’t be on CRAN. However, you can install it from R-universe.
Sys.setenv(NOT_CRAN = "true")
install.packages("tidypolars", repos = c("https://community.r-multiverse.org", 'https://cloud.r-project.org'))The development version contains the latest improvements and bug fixes:
# install.packages("remotes")
remotes::install_github("etiennebacher/tidypolars")Related work
Several packages have been developed to handle large data more efficiently while keeping the tidyverse syntax:
-
arrow: one of the closest alternatives totidypolars. Also has lazy evaluation and query optimizations, uses Acero in the background to translatedplyrcode and perform computations.-
How is tidypolars different?: Polars (and therefore
tidypolars) uses an unofficial Arrow memory specification. All operations are implemented (and optimized) from scratch, meaning that query optimizations can be very different from Acero. The list of R functions that are translated to the Arrow engine may also differ.
-
How is tidypolars different?: Polars (and therefore
-
collapse: has very fast operations but still needs to import all data into memory, which prevents using larger-than-RAM datasets.-
How is tidypolars different?:
tidypolarsprovides lazy evaluation that is more memory-efficient since it doesn’t import all data in memory. It also provides a streaming engine to handle larger-than-RAM datasets.
-
How is tidypolars different?:
-
dbplyr: allows usingdplyrfor data stored in a relational database by translating R code to SQL queries. The performance will therefore depend on the SQL backend used.-
How is tidypolars different?:
tidypolarsdoesn’t translate R code to SQL but directly evaluates it with Polars.
-
How is tidypolars different?:
-
dtplyr: usesdata.tablein the background for better performance but needs to import all data in memory, which prevents using larger-than-RAM datasets.-
How is tidypolars different?: same as for
collapse.
-
How is tidypolars different?: same as for
-
duckplyr: one of the closest alternatives totidypolars. Uses DuckDB in the background, also provides lazy evaluation and query optimizations. Can perform operations directly on Rdata.frames.-
How is tidypolars different?: similar to
arrow, the list of R functions that are optimized in Polars or DuckDB isn’t identical so the use case will determine which tool runs the fastest.duckplyralso relies on a fallback mechanism that will run the code in “standard R” if the function cannot be translated.tidypolarsis more conservative and will error in this case, avoiding importing data that may crash the session because of its size.
-
How is tidypolars different?: similar to
-
sparklyr: uses Apache Spark in the background, requires installing Spark. Can perform distributed processing.-
How is tidypolars different?:
tidypolarsdoesn’t need installing another tool and focuses on processing data on a single machine, not on distributed processing.
-
How is tidypolars different?:
Therefore, if you need to handle data that is larger than memory, you have three options: arrow, duckplyr, and tidypolars. The best one will probably depend on the use case and on your constraints (e.g. tidypolars is available via R-universe but isn’t on CRAN). Regarding performance, one should refer to the DuckDB benchmarks to compare tools. Keep in mind that accurately benchmarking data processing tools is hard; those benchmarks give useful information but don’t necessarily apply to all contexts.
Contributing
Did you find some bugs or some errors in the documentation? Do you want tidypolars to support more functions?
Take a look at the contributing guide for instructions on bug report and pull requests.
Acknowledgements
The website theme was heavily inspired by Matthew Kay’s ggblend package: https://mjskay.github.io/ggblend/.
The package hex logo was created by Hubert Hałun as part of the Appsilon Hex Contest.
