Skip to contents

Is tidypolars slower than polars?

No, or just marginally. The objective of tidypolars is not to modify the data, simply to translate the tidyverse syntax to polars syntax. polars is still in charge of doing all the data manipulations under the hood.

Therefore, there might be minor overhead because we still need to parse the expressions and rewrite them in polars syntax (see the Parsing expressions vignette) but this should be marginal. Here’s a small benchmark to compare the performance of polars and tidypolars:

library(polars)
library(tidypolars)
library(dplyr, warn.conflicts = FALSE)

pl_test <- pl$DataFrame(
  grp = sample(letters, 2*1e7, TRUE),
  val1 = sample(1:1000, 2*1e7, TRUE),
  val2 = sample(1:1000, 2*1e7, TRUE)
)

bench::mark(
  polars = pl_test$
    group_by("grp")$
    agg(
      pl$col("val1")$mean()$alias("x"), 
      pl$col("val2")$sum()$alias("y"),
      pl$col("val1")$median()$alias("z")
    ),
  tidypolars = pl_test |> 
    group_by(grp) |> 
    summarize(
      x = mean(val1),
      y = sum(val2),
      z = median(val1)
    ),
  check = FALSE,
  iterations = 15
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 polars        383ms    427ms      2.32  417.81KB    0.166
#> 2 tidypolars    542ms    678ms      1.49    3.67MB    0.746

bench::mark(
  polars = pl_test$
    filter(pl$col("grp") == "a" | pl$col("grp") == "b"),
  tidypolars = pl_test |> 
    filter(grp == "a" | grp == "b"),
  check = FALSE,
  iterations = 15
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 polars         56ms   71.5ms      11.9    21.6KB     0   
#> 2 tidypolars   66.2ms  101.9ms      10.0    82.2KB     1.54

Am I stuck with tidypolars?

No, tidypolars will always return DataFrames, LazyFrames or Series. Therefore, if at some point you want to use polars because you need more control or because you want to reduce your number of dependencies, you can easily do so.

Do I still need to load polars?

Yes, because tidypolars doesn’t provide any functions to create polars DataFrame or LazyFrame or to read data. You’ll still need to use polars for this.

Can I see some benchmarks with other tools?

Making accurate benchmarks of data wrangling tools is difficult and I won’t try to do it here. You should refer to DuckDB benchmarks.