Skip to contents

Is tidypolars slower than polars?

No, or just marginally. The objective of tidypolars is not to modify the data, simply to translate the tidyverse syntax to polars syntax. polars is still in charge of doing all the data manipulations under the hood.

Therefore, there might be minor overhead because we still need to parse the expressions and rewrite them in polars syntax (see the Parsing expressions vignette) but this should be marginal. Here’s a small benchmark to compare the performance of polars and tidypolars:

library(polars)
library(tidypolars)
library(dplyr, warn.conflicts = FALSE)

pl_test <- pl$DataFrame(
  grp = sample(letters, 2*1e7, TRUE),
  val1 = sample(1:1000, 2*1e7, TRUE),
  val2 = sample(1:1000, 2*1e7, TRUE)
)

bench::mark(
  polars = pl_test$
    group_by("grp")$
    agg(
      pl$col("val1")$mean()$alias("x"), 
      pl$col("val2")$sum()$alias("y"),
      pl$col("val1")$median()$alias("z")
    ),
  tidypolars = pl_test |> 
    group_by(grp) |> 
    summarize(
      x = mean(val1),
      y = sum(val2),
      z = median(val1)
    ),
  check = FALSE,
  iterations = 15
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 polars        260ms    315ms      3.09   404.6KB    0.221
#> 2 tidypolars    278ms    302ms      3.17    2.46MB    2.11

bench::mark(
  polars = pl_test$
    filter(pl$col("grp") == "a" | pl$col("grp") == "b"),
  tidypolars = pl_test |> 
    filter(grp == "a" | grp == "b"),
  check = FALSE,
  iterations = 15
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 polars       43.2ms   54.6ms      16.7    40.3KB     1.19
#> 2 tidypolars   46.4ms   48.6ms      20.5    81.7KB     3.15

Am I stuck with tidypolars?

No, tidypolars will always return DataFrames, LazyFrames or Series. Therefore, if at some point you want to use polars because you need more control or because you want to reduce your number of dependencies, you can easily do so.

Do I still need to load polars?

Yes, because tidypolars doesn’t provide any functions to create polars DataFrame or LazyFrame or to read data. You’ll still need to use polars for this.

Can I see some benchmarks with other tools?

Making accurate benchmarks of data wrangling tools is difficult and I won’t try to do it here. You should refer to DuckDB benchmarks.