R and Polars expressions
Source:vignettes/r-and-polars-expressions.Rmd
r-and-polars-expressions.Rmd
When we use the tidyverse
, we use R expressions in
mainly three places: filter()
, mutate()
, and
summarize()
.
library(dplyr, warn.conflicts = FALSE)
filter(mtcars, am + gear > carb)
mutate(mtcars, x = (qsec - mean(qsec) / sd(qsec)))
mtcars |>
group_by(cyl) |>
summarize(x = mean(qsec) / sd(qsec))
This is very convenient but creates a challenge for
tidypolars
. Indeed, while it is possible to pass R
functions directly to a Polars Data/LazyFrame, it is strongly
discouraged to do so because it doesn’t take advantage of
Polars optimizations.
Indeed, Polars comes with dozens of built-in functions for maths
(median
, var
, arccos
, …), string
manipulation (len_chars
, starts
, …), and
date-time (hour
, quarter
,
ordinal_day
, …). All of these functions are optimized
internally and are ran in parallel under the hood, which will not be the
case if we pass R functions.
However, using these Polars expressions would imply that we need to
learn these new functions and this new syntax. To avoid doing that,
tidypolars
will automatically translate R expressions into
Polars ones. Basically, you can keep writing R expressions in
most situations, and they will automatically be translated to
Polars syntax.
However, there are some situations where this might not work, so this vignette explains the process and the limitations.
How does tidypolars
translate R expressions into Polars
expressions?
When tidypolars
receives an expression, it runs a
function translate()
several times until all components are
translated to their Polars equivalent. There are four possible
components: single values, column names, external objects, and
functions.
Single values, column names, and external objects
If you pass a single value, like x = 1
or
x = "a"
, it is wrapped into pl$lit()
. This is
also the case for external objects with the difference that these need
to be wrapped in {{ }}
and are evaluated before being
wrapped into pl$lit()
.
Column names, like x = mpg
, are wrapped into
pl$col()
.
x = "a" -> x = pl$lit("a")
x = {{ some_value }} -> x = pl$lit(*value*)
x = mpg -> x = pl$col("mpg")
Functions
Functions are split into two categories: built-in functions (i.e functions provided by base R or by other packages), and user-defined functions (UDF) that are written by the user (you).
Built-in functions
In the first case, tidypolars
checks the function name
and whether there’s an equivalent function in Polars. For example, the R
function sd(x, na.rm = TRUE)
is converted to
std(x, na.rm = TRUE)
. Since R and Polars functions often
don’t share the same name, this check relies on a custom list containing
all equivalencies between R and Polars functions. You can see the list
of supported R functions at the bottom of this vignette. Note that
most of essential base R functions are supported, but also some
functions from dplyr
or from stringr
for
example.
Here, replacing R’s sd()
function with
std()
is not enough because the argument x
(which usually is a variable in the dataset) is not in the right format
to be used by Polars. Therefore, tidypolars
calls
translate()
a second time on the inside of the
function.
We now have std(pl$col("x"))
1. To end this example,
we need to see what happens to additional arguments. You can see that we
didn’t modify the na.rm = TRUE
for now. This is because
Polars doesn’t have this argument in std()
(it
automatically drops the missing values). Internally,
tidypolars
checks whether additional arguments are accepted
and throws a message if this is not the case:
library(tidypolars)
library(polars)
mtcars |>
as_polars_df() |>
mutate(x = sd(mpg, na.rm = TRUE))
#> Warning:
#> Not all arguments of sd() are used by Polars.
#> The following argument(s) will not be used: `na.rm`.
#> shape: (32, 12)
#> ┌──────┬─────┬───────┬───────┬───┬─────┬──────┬──────┬──────────┐
#> │ mpg ┆ cyl ┆ disp ┆ hp ┆ … ┆ am ┆ gear ┆ carb ┆ x │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
#> ╞══════╪═════╪═══════╪═══════╪═══╪═════╪══════╪══════╪══════════╡
#> │ 21.0 ┆ 6.0 ┆ 160.0 ┆ 110.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 4.0 ┆ 6.026948 │
#> │ 21.0 ┆ 6.0 ┆ 160.0 ┆ 110.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 4.0 ┆ 6.026948 │
#> │ 22.8 ┆ 4.0 ┆ 108.0 ┆ 93.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 1.0 ┆ 6.026948 │
#> │ 21.4 ┆ 6.0 ┆ 258.0 ┆ 110.0 ┆ … ┆ 0.0 ┆ 3.0 ┆ 1.0 ┆ 6.026948 │
#> │ 18.7 ┆ 8.0 ┆ 360.0 ┆ 175.0 ┆ … ┆ 0.0 ┆ 3.0 ┆ 2.0 ┆ 6.026948 │
#> │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
#> │ 30.4 ┆ 4.0 ┆ 95.1 ┆ 113.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 2.0 ┆ 6.026948 │
#> │ 15.8 ┆ 8.0 ┆ 351.0 ┆ 264.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 4.0 ┆ 6.026948 │
#> │ 19.7 ┆ 6.0 ┆ 145.0 ┆ 175.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 6.0 ┆ 6.026948 │
#> │ 15.0 ┆ 8.0 ┆ 301.0 ┆ 335.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 8.0 ┆ 6.026948 │
#> │ 21.4 ┆ 4.0 ┆ 121.0 ┆ 109.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 2.0 ┆ 6.026948 │
#> └──────┴─────┴───────┴───────┴───┴─────┴──────┴──────┴──────────┘
User-defined functions
User-defined functions (UDF) are more challenging. Indeed, it is technically possible to inspect the code inside a UDF, but rewriting it to match Polars syntax would be extremely complicated. In this situation, you will have to rewrite your custom function using Polars syntax so that it returns a Polars expression. For example, we could make a function to standardize a column like this:
pl_standardize <- function(x) {
(x - x$mean()) / x$std()
}
Remember that the column name used as x
will end up
wrapped into pl$col()
, so to check that your function
returns a Polars expression, you have to provide a pl$col()
call:
pl_standardize(pl$col("mpg"))
#> polars Expr: [([(col("mpg")) - (col("mpg").mean())]) // (col("mpg").std())]
This function correctly returns a Polars expression, so we can now use it like any other function:
mtcars |>
as_polars_df() |>
mutate(x = pl_standardize(mpg))
#> shape: (32, 12)
#> ┌──────┬─────┬───────┬───────┬───┬─────┬──────┬──────┬───────────┐
#> │ mpg ┆ cyl ┆ disp ┆ hp ┆ … ┆ am ┆ gear ┆ carb ┆ x │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
#> ╞══════╪═════╪═══════╪═══════╪═══╪═════╪══════╪══════╪═══════════╡
#> │ 21.0 ┆ 6.0 ┆ 160.0 ┆ 110.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 4.0 ┆ 0.150885 │
#> │ 21.0 ┆ 6.0 ┆ 160.0 ┆ 110.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 4.0 ┆ 0.150885 │
#> │ 22.8 ┆ 4.0 ┆ 108.0 ┆ 93.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 1.0 ┆ 0.449543 │
#> │ 21.4 ┆ 6.0 ┆ 258.0 ┆ 110.0 ┆ … ┆ 0.0 ┆ 3.0 ┆ 1.0 ┆ 0.217253 │
#> │ 18.7 ┆ 8.0 ┆ 360.0 ┆ 175.0 ┆ … ┆ 0.0 ┆ 3.0 ┆ 2.0 ┆ -0.230735 │
#> │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
#> │ 30.4 ┆ 4.0 ┆ 95.1 ┆ 113.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 2.0 ┆ 1.710547 │
#> │ 15.8 ┆ 8.0 ┆ 351.0 ┆ 264.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 4.0 ┆ -0.711907 │
#> │ 19.7 ┆ 6.0 ┆ 145.0 ┆ 175.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 6.0 ┆ -0.064813 │
#> │ 15.0 ┆ 8.0 ┆ 301.0 ┆ 335.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 8.0 ┆ -0.844644 │
#> │ 21.4 ┆ 4.0 ┆ 121.0 ┆ 109.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 2.0 ┆ 0.217253 │
#> └──────┴─────┴───────┴───────┴───┴─────┴──────┴──────┴───────────┘
Special case: across()
across()
is a very useful function that applies a function (or a list of
functions) to a selection of columns. It accepts built-in functions,
UDFs, and anonymous functions.
mtcars |>
as_polars_df() |>
mutate(
across(
.cols = contains("a"),
list(mean = mean, stand = pl_standardize, ~ sd(.x))
)
)
#> shape: (32, 23)
#> ┌──────┬─────┬───────┬───────┬───┬──────────┬───────────┬────────────┬────────┐
#> │ mpg ┆ cyl ┆ disp ┆ hp ┆ … ┆ gear_3 ┆ carb_mean ┆ carb_stand ┆ carb_3 │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
#> ╞══════╪═════╪═══════╪═══════╪═══╪══════════╪═══════════╪════════════╪════════╡
#> │ 21.0 ┆ 6.0 ┆ 160.0 ┆ 110.0 ┆ … ┆ 0.737804 ┆ 2.8125 ┆ 0.735203 ┆ 1.6152 │
#> │ 21.0 ┆ 6.0 ┆ 160.0 ┆ 110.0 ┆ … ┆ 0.737804 ┆ 2.8125 ┆ 0.735203 ┆ 1.6152 │
#> │ 22.8 ┆ 4.0 ┆ 108.0 ┆ 93.0 ┆ … ┆ 0.737804 ┆ 2.8125 ┆ -1.122152 ┆ 1.6152 │
#> │ 21.4 ┆ 6.0 ┆ 258.0 ┆ 110.0 ┆ … ┆ 0.737804 ┆ 2.8125 ┆ -1.122152 ┆ 1.6152 │
#> │ 18.7 ┆ 8.0 ┆ 360.0 ┆ 175.0 ┆ … ┆ 0.737804 ┆ 2.8125 ┆ -0.503034 ┆ 1.6152 │
#> │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
#> │ 30.4 ┆ 4.0 ┆ 95.1 ┆ 113.0 ┆ … ┆ 0.737804 ┆ 2.8125 ┆ -0.503034 ┆ 1.6152 │
#> │ 15.8 ┆ 8.0 ┆ 351.0 ┆ 264.0 ┆ … ┆ 0.737804 ┆ 2.8125 ┆ 0.735203 ┆ 1.6152 │
#> │ 19.7 ┆ 6.0 ┆ 145.0 ┆ 175.0 ┆ … ┆ 0.737804 ┆ 2.8125 ┆ 1.97344 ┆ 1.6152 │
#> │ 15.0 ┆ 8.0 ┆ 301.0 ┆ 335.0 ┆ … ┆ 0.737804 ┆ 2.8125 ┆ 3.211677 ┆ 1.6152 │
#> │ 21.4 ┆ 4.0 ┆ 121.0 ┆ 109.0 ┆ … ┆ 0.737804 ┆ 2.8125 ┆ -0.503034 ┆ 1.6152 │
#> └──────┴─────┴───────┴───────┴───┴──────────┴───────────┴────────────┴────────┘
Similarly, UDFs and anonymous functions will error if they don’t return a Polars expression:
mtcars |>
as_polars_df() |>
mutate(
across(
.cols = contains("a"),
.fns = list(
mean = mean,
function(x) {
(x - mean(x)) / sd(x)
},
~ sd(.x)
)
)
)
#> Error in `mutate()`:
#> ! Could not evaluate an anonymous function in `across()`.
#> ℹ Are you sure the anonymous function returns a Polars expression?
List of base R and tidyverse
functions supported by
tidypolars
Package | Function | Notes |
---|---|---|
base |
abs |
|
base |
acos |
|
base |
acosh |
|
base |
all |
|
base |
any |
|
base |
asin |
|
base |
asinh |
|
base |
atan |
|
base |
atanh |
|
base |
ceiling |
|
base |
cos |
|
base |
cosh |
|
base |
cummin |
|
base |
cumsum |
|
base |
diff |
|
base |
exp |
|
base |
floor |
|
base |
grepl |
|
base |
ifelse |
|
base |
ISOdatetime |
|
base |
length |
|
base |
log |
|
base |
log10 |
|
base |
max |
|
base |
mean |
|
base |
min |
|
base |
nchar |
|
base |
paste0 |
|
base |
paste |
|
base |
rank |
|
base |
rev |
|
base |
round |
|
base |
sin |
|
base |
sinh |
|
base |
sort |
|
base |
sqrt |
|
base |
strptime |
|
base |
tan |
|
base |
tanh |
|
base |
tolower |
|
base |
toupper |
|
base |
unique |
|
base |
which.min |
|
base |
which.max |
|
dplyr |
between |
|
dplyr |
case_match |
|
dplyr |
case_when |
|
dplyr |
coalesce |
|
dplyr |
consecutive_id |
|
dplyr |
first |
|
dplyr |
group_keys |
|
dplyr |
group_vars |
|
dplyr |
if_else |
|
dplyr |
lag |
|
dplyr |
min_rank |
|
dplyr |
n |
|
dplyr |
nth |
|
dplyr |
n_distinct |
|
dplyr |
last |
|
lubridate |
ddays |
|
lubridate |
dhours |
|
lubridate |
dmilliseconds |
|
lubridate |
dminutes |
|
lubridate |
dseconds |
|
lubridate |
dweeks |
|
lubridate |
make_date |
|
lubridate |
make_datetime |
In lubridate::make_datetime() , when there
is an overflow (for example hours = 25 ), then it is
automatically converted to the higher unit (for example 1 day and 1h).
In Polars, this returns NA . |
stats |
median |
|
stats |
lag |
|
stats |
sd |
|
stats |
var |
|
stringr |
regex |
|
stringr |
str_count |
|
stringr |
str_dup |
|
stringr |
str_ends |
|
stringr |
str_extract |
|
stringr |
str_extract_all |
|
stringr |
str_length |
|
stringr |
str_pad |
|
stringr |
str_remove |
|
stringr |
str_remove_all |
|
stringr |
str_replace |
|
stringr |
str_replace_all |
|
stringr |
str_split |
|
stringr |
str_split_i |
|
stringr |
str_squish |
|
stringr |
str_starts |
|
stringr |
str_sub |
|
stringr |
str_trim |
|
stringr |
str_to_lower |
|
stringr |
str_to_title |
|
stringr |
str_to_upper |
|
stringr |
str_trunc |
|
stringr |
word |
|
tidyr |
replace_na |
|
tools |
toTitleCase |