Skip to contents

Is tidypolars slower than polars?

No, or just marginally. The objective of tidypolars is not to modify the data, simply to translate the tidyverse syntax to polars syntax. polars is still in charge of doing all the data manipulations under the hood.

Therefore, there might be minor overhead because we still need to parse the expressions and rewrite them in polars syntax (see the Parsing expressions vignette) but this should be marginal. Here’s a small benchmark to compare the performance of polars and tidypolars:

library(polars)
library(tidypolars)
library(dplyr, warn.conflicts = FALSE)

pl_test <- pl$DataFrame(
  grp = sample(letters, 2*1e7, TRUE),
  val1 = sample(1:1000, 2*1e7, TRUE),
  val2 = sample(1:1000, 2*1e7, TRUE)
)

bench::mark(
  polars = pl_test$
    group_by("grp")$
    agg(
      pl$col("val1")$mean()$alias("x"), 
      pl$col("val2")$sum()$alias("y"),
      pl$col("val1")$median()$alias("z")
    ),
  tidypolars = pl_test |> 
    group_by(grp) |> 
    summarize(
      x = mean(val1),
      y = sum(val2),
      z = median(val1)
    ),
  check = FALSE,
  iterations = 15
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 polars        300ms    336ms      2.98  412.18KB    0.213
#> 2 tidypolars    353ms    424ms      2.29    3.67MB    2.61

bench::mark(
  polars = pl_test$
    filter(pl$col("grp") == "a" | pl$col("grp") == "b"),
  tidypolars = pl_test |> 
    filter(grp == "a" | grp == "b"),
  check = FALSE,
  iterations = 15
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 polars       44.1ms   50.7ms      20.1    21.5KB     1.44
#> 2 tidypolars   50.8ms     56ms      17.9    82.6KB     4.47

Am I stuck with tidypolars?

No, tidypolars will always return DataFrames, LazyFrames or Series. Therefore, if at some point you want to use polars because you need more control or because you want to reduce your number of dependencies, you can easily do so.

Do I still need to load polars?

Yes, because tidypolars doesn’t provide any functions to create polars DataFrame or LazyFrame or to read data. You’ll still need to use polars for this.

Can I see some benchmarks with other tools?

Making accurate benchmarks of data wrangling tools is difficult and I won’t try to do it here. You should refer to DuckDB benchmarks.