More Efficient Tidyverse Code, Using Polars in the Background • tidypolars

ℹ️ This is the R package “tidypolars”. The Python one is here: markfairbanks/tidypolars

Overview

tidypolars provides a polars backend for the tidyverse. The aim of tidypolars is to enable users to keep their existing tidyverse code while using polars in the background to benefit from large performance gains. The only thing that needs to change is the way data is imported in the R session.

See the “Getting started” vignette for a gentle introduction to tidypolars.

Since most of the work is rewriting tidyverse code into polars syntax, tidypolars and polars have very similar performance.

Click to see a small benchmark

The main purpose of this benchmark is to show that polars and tidypolars are close and to give an idea of the performance. For more thorough, representative benchmarks about polars, take a look at DuckDB benchmarks instead.

library(collapse, warn.conflicts = FALSE)
#> collapse 2.1.3, see ?`collapse-package` or ?`collapse-documentation`
library(dplyr, warn.conflicts = FALSE)
library(dtplyr)
library(polars)
library(tidypolars)

large_iris <- data.table::rbindlist(rep(list(iris), 100000))
large_iris_pl <- as_polars_lf(large_iris)
large_iris_dt <- lazy_dt(large_iris)

format(nrow(large_iris), big.mark = ",")
#> [1] "15,000,000"

bench::mark(
  polars = {
    large_iris_pl$
      select(c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"))$
      with_columns(
        pl$when(
          (pl$col("Petal.Length") / pl$col("Petal.Width") > 3)
        )$then(pl$lit("long"))$
          otherwise(pl$lit("large"))$
          alias("petal_type")
      )$
      filter(pl$col("Sepal.Length")$is_between(4.5, 5.5))$
      collect()
  },
  tidypolars = {
    large_iris_pl |>
      select(starts_with(c("Sep", "Pet"))) |>
      mutate(
        petal_type = ifelse((Petal.Length / Petal.Width) > 3, "long", "large")
      ) |>
      filter(between(Sepal.Length, 4.5, 5.5)) |>
      compute()
  },
  dplyr = {
    large_iris |>
      select(starts_with(c("Sep", "Pet"))) |>
      mutate(
        petal_type = ifelse((Petal.Length / Petal.Width) > 3, "long", "large")
      ) |>
      filter(between(Sepal.Length, 4.5, 5.5))
  },
  dtplyr = {
    large_iris_dt |>
      select(starts_with(c("Sep", "Pet"))) |>
      mutate(
        petal_type = ifelse((Petal.Length / Petal.Width) > 3, "long", "large")
      ) |>
      filter(between(Sepal.Length, 4.5, 5.5)) |>
      as.data.frame()
  },
  collapse = {
    large_iris |>
      fselect(c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")) |>
      fmutate(
        petal_type = data.table::fifelse((Petal.Length / Petal.Width) > 3, "long", "large")
      ) |>
      fsubset(Sepal.Length >= 4.5 & Sepal.Length <= 5.5)
  },
  check = FALSE,
  iterations = 40
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 5 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 polars     108.38ms 149.19ms     6.32     2.13MB    0.158
#> 2 tidypolars 118.42ms 247.73ms     3.61     1.22MB    0.452
#> 3 dplyr         2.94s    3.75s     0.269    1.79GB    0.658
#> 4 dtplyr     653.02ms 729.33ms     1.36     1.72GB    2.79 
#> 5 collapse   269.73ms 368.59ms     2.69   745.96MB    2.63

# NOTE: do NOT take the "mem_alloc" results into account.
# `bench::mark()` doesn't report the accurate memory usage for packages calling
# Rust code.

If you want to do your own benchmarks, please take a look at How to benchmark tidypolars first for some best practices.

Installation

tidypolars is built on polars, which is not available on CRAN. This means that tidypolars also can’t be on CRAN. However, you can install it from R-universe.

Sys.setenv(NOT_CRAN = "true")
install.packages("tidypolars", repos = c("https://community.r-multiverse.org", 'https://cloud.r-project.org'))

The development version contains the latest improvements and bug fixes:

# install.packages("remotes")
remotes::install_github("etiennebacher/tidypolars")

Several packages have been developed to handle large data more efficiently while keeping the tidyverse syntax:

arrow: one of the closest alternatives to tidypolars. Also has lazy evaluation and query optimizations, uses Acero in the background to translate dplyr code and perform computations.
- How is tidypolars different?: Polars (and therefore tidypolars) uses an unofficial Arrow memory specification. All operations are implemented (and optimized) from scratch, meaning that query optimizations can be very different from Acero. The list of R functions that are translated to the Arrow engine may also differ.
collapse: has very fast operations but still needs to import all data into memory, which prevents using larger-than-RAM datasets.
- How is tidypolars different?: tidypolars provides lazy evaluation that is more memory-efficient since it doesn’t import all data in memory. It also provides a streaming engine to handle larger-than-RAM datasets.
dbplyr: allows using dplyr for data stored in a relational database by translating R code to SQL queries. The performance will therefore depend on the SQL backend used.
- How is tidypolars different?: tidypolars doesn’t translate R code to SQL but directly evaluates it with Polars.
dtplyr: uses data.table in the background for better performance but needs to import all data in memory, which prevents using larger-than-RAM datasets.
- How is tidypolars different?: same as for collapse.
duckplyr: one of the closest alternatives to tidypolars. Uses DuckDB in the background, also provides lazy evaluation and query optimizations. Can perform operations directly on R data.frames.
- How is tidypolars different?: similar to arrow, the list of R functions that are optimized in Polars or DuckDB isn’t identical so the use case will determine which tool runs the fastest. duckplyr also relies on a fallback mechanism that will run the code in “standard R” if the function cannot be translated. tidypolars is more conservative and will error in this case, avoiding importing data that may crash the session because of its size.
sparklyr: uses Apache Spark in the background, requires installing Spark. Can perform distributed processing.
- How is tidypolars different?: tidypolars doesn’t need installing another tool and focuses on processing data on a single machine, not on distributed processing.

Therefore, if you need to handle data that is larger than memory, you have three options: arrow, duckplyr, and tidypolars. The best one will probably depend on the use case and on your constraints (e.g. tidypolars is available via R-universe but isn’t on CRAN). Regarding performance, one should refer to the DuckDB benchmarks to compare tools. Keep in mind that accurately benchmarking data processing tools is hard; those benchmarks give useful information but don’t necessarily apply to all contexts.

Contributing

Did you find some bugs or some errors in the documentation? Do you want tidypolars to support more functions?

Take a look at the contributing guide for instructions on bug report and pull requests.

Acknowledgements

The website theme was heavily inspired by Matthew Kay’s ggblend package: https://mjskay.github.io/ggblend/.

The package hex logo was created by Hubert Hałun as part of the Appsilon Hex Contest.

tidypolars

Overview

Installation

Related work

Contributing

Acknowledgements