⚠️ This is the R package “tidypolars”. The Python one is here: markfairbanks/tidypolars ⚠️
Overview
tidypolars
provides a polars
backend for the tidyverse
. The aim of tidypolars
is to enable users to keep their existing tidyverse
code while using polars
in the background to benefit from large performance gains.
See the example below and the “Getting started” vignette for a gentle introduction to tidypolars
.
Installation
tidypolars
is built on polars
, which is not available on CRAN. This means that tidypolars
also can’t be on CRAN. However, you can install it from R-universe.
Windows or macOS
install.packages(
'tidypolars',
repos = c('https://etiennebacher.r-universe.dev', getOption("repos"))
)
Linux
install.packages(
'tidypolars',
repos = c('https://etiennebacher.r-universe.dev/bin/linux/jammy/4.3', getOption("repos"))
)
Example
Suppose that you already have some code that uses dplyr
:
library(dplyr, warn.conflicts = FALSE)
iris |>
select(starts_with(c("Sep", "Pet"))) |>
mutate(
petal_type = ifelse((Petal.Length / Petal.Width) > 3, "long", "large")
) |>
filter(between(Sepal.Length, 4.5, 5.5)) |>
head()
#> Sepal.Length Sepal.Width Petal.Length Petal.Width petal_type
#> 1 5.1 3.5 1.4 0.2 long
#> 2 4.9 3.0 1.4 0.2 long
#> 3 4.7 3.2 1.3 0.2 long
#> 4 4.6 3.1 1.5 0.2 long
#> 5 5.0 3.6 1.4 0.2 long
#> 6 5.4 3.9 1.7 0.4 long
With tidypolars
, you can provide a Polars DataFrame
or LazyFrame
and keep the exact same code:
library(tidypolars)
iris |>
as_polars_df() |>
select(starts_with(c("Sep", "Pet"))) |>
mutate(
petal_type = ifelse((Petal.Length / Petal.Width) > 3, "long", "large")
) |>
filter(between(Sepal.Length, 4.5, 5.5)) |>
head()
#> shape: (6, 5)
#> ┌──────────────┬─────────────┬──────────────┬─────────────┬────────────┐
#> │ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ petal_type │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │
#> ╞══════════════╪═════════════╪══════════════╪═════════════╪════════════╡
#> │ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ long │
#> │ 4.9 ┆ 3.0 ┆ 1.4 ┆ 0.2 ┆ long │
#> │ 4.7 ┆ 3.2 ┆ 1.3 ┆ 0.2 ┆ long │
#> │ 4.6 ┆ 3.1 ┆ 1.5 ┆ 0.2 ┆ long │
#> │ 5.0 ┆ 3.6 ┆ 1.4 ┆ 0.2 ┆ long │
#> │ 5.4 ┆ 3.9 ┆ 1.7 ┆ 0.4 ┆ long │
#> └──────────────┴─────────────┴──────────────┴─────────────┴────────────┘
If you’re used to the tidyverse
functions and syntax, this will feel much easier to read than the pure polars
syntax:
library(polars)
# polars syntax
pl$DataFrame(iris)$
select(c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"))$
with_columns(
pl$when(
(pl$col("Petal.Length") / pl$col("Petal.Width") > 3)
)$then(pl$lit("long"))$
otherwise(pl$lit("large"))$
alias("petal_type")
)$
filter(pl$col("Sepal.Length")$is_between(4.5, 5.5))$
head(6)
#> shape: (6, 5)
#> ┌──────────────┬─────────────┬──────────────┬─────────────┬────────────┐
#> │ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ petal_type │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │
#> ╞══════════════╪═════════════╪══════════════╪═════════════╪════════════╡
#> │ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ long │
#> │ 4.9 ┆ 3.0 ┆ 1.4 ┆ 0.2 ┆ long │
#> │ 4.7 ┆ 3.2 ┆ 1.3 ┆ 0.2 ┆ long │
#> │ 4.6 ┆ 3.1 ┆ 1.5 ┆ 0.2 ┆ long │
#> │ 5.0 ┆ 3.6 ┆ 1.4 ┆ 0.2 ┆ long │
#> │ 5.4 ┆ 3.9 ┆ 1.7 ┆ 0.4 ┆ long │
#> └──────────────┴─────────────┴──────────────┴─────────────┴────────────┘
Since most of the work is rewriting tidyverse
code into polars
syntax, tidypolars
and polars
have very similar performance.
Click to see a small benchmark
For more serious benchmarks about polars
, take a look at DuckDB benchmarks.
library(collapse, warn.conflicts = FALSE)
#> collapse 2.0.11, see ?`collapse-package` or ?`collapse-documentation`
library(dtplyr)
large_iris <- data.table::rbindlist(rep(list(iris), 100000))
large_iris_pl <- as_polars_lf(large_iris)
large_iris_dt <- lazy_dt(large_iris)
format(nrow(large_iris), big.mark = ",")
#> [1] "15,000,000"
bench::mark(
polars = {
large_iris_pl$
select(c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"))$
with_columns(
pl$when(
(pl$col("Petal.Length") / pl$col("Petal.Width") > 3)
)$then(pl$lit("long"))$
otherwise(pl$lit("large"))$
alias("petal_type")
)$
filter(pl$col("Sepal.Length")$is_between(4.5, 5.5))$
collect()
},
tidypolars = {
large_iris_pl |>
select(starts_with(c("Sep", "Pet"))) |>
mutate(
petal_type = ifelse((Petal.Length / Petal.Width) > 3, "long", "large")
) |>
filter(between(Sepal.Length, 4.5, 5.5)) |>
compute()
},
dplyr = {
large_iris |>
select(starts_with(c("Sep", "Pet"))) |>
mutate(
petal_type = ifelse((Petal.Length / Petal.Width) > 3, "long", "large")
) |>
filter(between(Sepal.Length, 4.5, 5.5))
},
dtplyr = {
large_iris_dt |>
select(starts_with(c("Sep", "Pet"))) |>
mutate(
petal_type = ifelse((Petal.Length / Petal.Width) > 3, "long", "large")
) |>
filter(between(Sepal.Length, 4.5, 5.5)) |>
as.data.frame()
},
collapse = {
large_iris |>
fselect(c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")) |>
fmutate(
petal_type = data.table::fifelse((Petal.Length / Petal.Width) > 3, "long", "large")
) |>
fsubset(Sepal.Length >= 4.5 & Sepal.Length <= 5.5)
},
check = FALSE,
iterations = 40
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 5 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 polars 126.4ms 288.63ms 2.89 19.6KB 0.0722
#> 2 tidypolars 152.53ms 202.25ms 4.37 332.28KB 1.09
#> 3 dplyr 5.62s 6.06s 0.164 1.79GB 0.476
#> 4 dtplyr 835.67ms 1.03s 0.957 1.72GB 2.34
#> 5 collapse 487.95ms 653.16ms 1.50 745.96MB 1.09
# NOTE: do NOT take the "mem_alloc" results into account.
# `bench::mark()` doesn't report the accurate memory usage for packages calling
# Rust code.
Contributing
Did you find some bugs or some errors in the documentation? Do you want tidypolars
to support more functions?
Take a look at the contributing guide for instructions on bug report and pull requests.
Acknowledgements
The website theme was heavily inspired by Matthew Kay’s ggblend
package: https://mjskay.github.io/ggblend/.