R and Polars expressions
Source:vignettes/r-and-polars-expressions.Rmd
r-and-polars-expressions.Rmd
When we use the tidyverse
, we use R expressions in
mainly three places: filter()
, mutate()
, and
summarize()
.
library(dplyr, warn.conflicts = FALSE)
filter(mtcars, am + gear > carb)
mutate(mtcars, x = (qsec - mean(qsec)) / sd(qsec))
mtcars |>
group_by(cyl) |>
summarize(x = mean(qsec) / sd(qsec))
This is very convenient but creates a challenge for
tidypolars
. Indeed, while it is possible to pass R
functions directly to a Polars Data/LazyFrame, it is strongly
discouraged to do so because it doesn’t take advantage of
Polars optimizations.
Indeed, Polars comes with dozens of built-in functions for maths
(median
, var
, arccos
, …), string
manipulation (len_chars
, starts
, …), and
date-time (hour
, quarter
,
ordinal_day
, …). All of these functions are optimized
internally and are ran in parallel under the hood, which will not be the
case if we pass R functions.
However, using these Polars expressions would imply that we need to
learn these new functions and this new syntax. To avoid doing that,
tidypolars
will automatically translate R expressions into
Polars ones. Basically, you can keep writing R expressions in
most situations, and they will automatically be translated to
Polars syntax.
However, there are some situations where this might not work, so this vignette explains the process and the limitations.
How does tidypolars
translate R expressions into Polars
expressions?
When tidypolars
receives an expression, it runs a
function translate()
several times until all components are
translated to their Polars equivalent. There are four possible
components: single values, column names, external objects, and
functions.
Single values, column names, and external objects
If you pass a single value, like x = 1
or
x = "a"
, it is wrapped into pl$lit()
. This is
also the case for external objects with the difference that these need
to be wrapped in {{ }}
and are evaluated before being
wrapped into pl$lit()
.
Column names, like x = mpg
, are wrapped into
pl$col()
.
x = "a" -> x = pl$lit("a")
x = {{ some_value }} -> x = pl$lit(*value*)
x = mpg -> x = pl$col("mpg")
Functions
Functions are split into two categories: built-in functions (i.e functions provided by base R or by other packages), and user-defined functions (UDF) that are written by the user (you).
Built-in functions
In the first case, tidypolars
checks the function name
and whether it has already been translated internally. For example, if
we call the R function mean(x, trim = 2)
, then it looks for
a translation of mean()
. You can see the list of supported
R functions at the bottom of this vignette. Note that most of essential
base R functions are supported, as well as many functions from
dplyr
or from stringr
for example.
Now that tidypolars
knows that a translation of
mean()
exists, it parses the arguments in the call to
translate them to the Polars syntax: internally, x
is
converted to pl$col("x")
if there is a column
"x"
in the data. Sometimes, additional arguments do not
have an equivalent in Polars. This is the case for the argument
trim
here. In this case, tidypolars
ignores
this argument and warns the user:
library(tidypolars)
library(polars)
mtcars |>
as_polars_df() |>
mutate(x = mean(mpg, trim = 2))
#> Warning:
#> Package tidypolars doesn't know how to use some arguments of `mean()`.
#> The following argument(s) will be ignored: `trim`.
#> shape: (32, 12)
#> ┌──────┬─────┬───────┬───────┬───┬─────┬──────┬──────┬───────────┐
#> │ mpg ┆ cyl ┆ disp ┆ hp ┆ … ┆ am ┆ gear ┆ carb ┆ x │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
#> ╞══════╪═════╪═══════╪═══════╪═══╪═════╪══════╪══════╪═══════════╡
#> │ 21.0 ┆ 6.0 ┆ 160.0 ┆ 110.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 4.0 ┆ 20.090625 │
#> │ 21.0 ┆ 6.0 ┆ 160.0 ┆ 110.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 4.0 ┆ 20.090625 │
#> │ 22.8 ┆ 4.0 ┆ 108.0 ┆ 93.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 1.0 ┆ 20.090625 │
#> │ 21.4 ┆ 6.0 ┆ 258.0 ┆ 110.0 ┆ … ┆ 0.0 ┆ 3.0 ┆ 1.0 ┆ 20.090625 │
#> │ 18.7 ┆ 8.0 ┆ 360.0 ┆ 175.0 ┆ … ┆ 0.0 ┆ 3.0 ┆ 2.0 ┆ 20.090625 │
#> │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
#> │ 30.4 ┆ 4.0 ┆ 95.1 ┆ 113.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 2.0 ┆ 20.090625 │
#> │ 15.8 ┆ 8.0 ┆ 351.0 ┆ 264.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 4.0 ┆ 20.090625 │
#> │ 19.7 ┆ 6.0 ┆ 145.0 ┆ 175.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 6.0 ┆ 20.090625 │
#> │ 15.0 ┆ 8.0 ┆ 301.0 ┆ 335.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 8.0 ┆ 20.090625 │
#> │ 21.4 ┆ 4.0 ┆ 121.0 ┆ 109.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 2.0 ┆ 20.090625 │
#> └──────┴─────┴───────┴───────┴───┴─────┴──────┴──────┴───────────┘
This behavior can be changed to throw an error instead.
User-defined functions
User-defined functions (UDF) are more challenging. Indeed, it is technically possible to inspect the code inside a UDF, but rewriting it to match Polars syntax would be extremely complicated. In this situation, you will have to rewrite your custom function using Polars syntax so that it returns a Polars expression. For example, we could make a function to standardize a column like this:
pl_standardize <- function(x) {
(x - x$mean()) / x$std()
}
Remember that the column name used as x
will end up
wrapped into pl$col()
, so to check that your function
returns a Polars expression, you have to provide a pl$col()
call:
pl_standardize(pl$col("mpg"))
#> polars Expr: [([(col("mpg")) - (col("mpg").mean())]) // (col("mpg").std())]
This function correctly returns a Polars expression, so we can now use it like any other function:
mtcars |>
as_polars_df() |>
mutate(x = pl_standardize(mpg))
#> shape: (32, 12)
#> ┌──────┬─────┬───────┬───────┬───┬─────┬──────┬──────┬───────────┐
#> │ mpg ┆ cyl ┆ disp ┆ hp ┆ … ┆ am ┆ gear ┆ carb ┆ x │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
#> ╞══════╪═════╪═══════╪═══════╪═══╪═════╪══════╪══════╪═══════════╡
#> │ 21.0 ┆ 6.0 ┆ 160.0 ┆ 110.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 4.0 ┆ 0.150885 │
#> │ 21.0 ┆ 6.0 ┆ 160.0 ┆ 110.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 4.0 ┆ 0.150885 │
#> │ 22.8 ┆ 4.0 ┆ 108.0 ┆ 93.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 1.0 ┆ 0.449543 │
#> │ 21.4 ┆ 6.0 ┆ 258.0 ┆ 110.0 ┆ … ┆ 0.0 ┆ 3.0 ┆ 1.0 ┆ 0.217253 │
#> │ 18.7 ┆ 8.0 ┆ 360.0 ┆ 175.0 ┆ … ┆ 0.0 ┆ 3.0 ┆ 2.0 ┆ -0.230735 │
#> │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
#> │ 30.4 ┆ 4.0 ┆ 95.1 ┆ 113.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 2.0 ┆ 1.710547 │
#> │ 15.8 ┆ 8.0 ┆ 351.0 ┆ 264.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 4.0 ┆ -0.711907 │
#> │ 19.7 ┆ 6.0 ┆ 145.0 ┆ 175.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 6.0 ┆ -0.064813 │
#> │ 15.0 ┆ 8.0 ┆ 301.0 ┆ 335.0 ┆ … ┆ 1.0 ┆ 5.0 ┆ 8.0 ┆ -0.844644 │
#> │ 21.4 ┆ 4.0 ┆ 121.0 ┆ 109.0 ┆ … ┆ 1.0 ┆ 4.0 ┆ 2.0 ┆ 0.217253 │
#> └──────┴─────┴───────┴───────┴───┴─────┴──────┴──────┴───────────┘
Special case: across()
across()
is a very useful function that applies a function (or a list of
functions) to a selection of columns. It accepts built-in functions,
UDFs, and anonymous functions.
mtcars |>
as_polars_df() |>
mutate(
across(
.cols = contains("a"),
list(mean = mean, stand = pl_standardize, ~ sd(.x))
)
)
#> shape: (32, 23)
#> ┌──────┬─────┬───────┬───────┬───┬──────────┬───────────┬────────────┬────────┐
#> │ mpg ┆ cyl ┆ disp ┆ hp ┆ … ┆ gear_3 ┆ carb_mean ┆ carb_stand ┆ carb_3 │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
#> ╞══════╪═════╪═══════╪═══════╪═══╪══════════╪═══════════╪════════════╪════════╡
#> │ 21.0 ┆ 6.0 ┆ 160.0 ┆ 110.0 ┆ … ┆ 0.737804 ┆ 2.8125 ┆ 0.735203 ┆ 1.6152 │
#> │ 21.0 ┆ 6.0 ┆ 160.0 ┆ 110.0 ┆ … ┆ 0.737804 ┆ 2.8125 ┆ 0.735203 ┆ 1.6152 │
#> │ 22.8 ┆ 4.0 ┆ 108.0 ┆ 93.0 ┆ … ┆ 0.737804 ┆ 2.8125 ┆ -1.122152 ┆ 1.6152 │
#> │ 21.4 ┆ 6.0 ┆ 258.0 ┆ 110.0 ┆ … ┆ 0.737804 ┆ 2.8125 ┆ -1.122152 ┆ 1.6152 │
#> │ 18.7 ┆ 8.0 ┆ 360.0 ┆ 175.0 ┆ … ┆ 0.737804 ┆ 2.8125 ┆ -0.503034 ┆ 1.6152 │
#> │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
#> │ 30.4 ┆ 4.0 ┆ 95.1 ┆ 113.0 ┆ … ┆ 0.737804 ┆ 2.8125 ┆ -0.503034 ┆ 1.6152 │
#> │ 15.8 ┆ 8.0 ┆ 351.0 ┆ 264.0 ┆ … ┆ 0.737804 ┆ 2.8125 ┆ 0.735203 ┆ 1.6152 │
#> │ 19.7 ┆ 6.0 ┆ 145.0 ┆ 175.0 ┆ … ┆ 0.737804 ┆ 2.8125 ┆ 1.97344 ┆ 1.6152 │
#> │ 15.0 ┆ 8.0 ┆ 301.0 ┆ 335.0 ┆ … ┆ 0.737804 ┆ 2.8125 ┆ 3.211677 ┆ 1.6152 │
#> │ 21.4 ┆ 4.0 ┆ 121.0 ┆ 109.0 ┆ … ┆ 0.737804 ┆ 2.8125 ┆ -0.503034 ┆ 1.6152 │
#> └──────┴─────┴───────┴───────┴───┴──────────┴───────────┴────────────┴────────┘
Similarly, UDFs and anonymous functions will error if they don’t return a Polars expression:
mtcars |>
as_polars_df() |>
mutate(
across(
.cols = contains("a"),
.fns = list(
mean = mean,
function(x) {
(x - mean(x)) / sd(x)
},
~ sd(.x)
)
)
)
#> Error in `mutate()`:
#> ! Could not evaluate an anonymous function in `across()`.
#> ℹ Are you sure the anonymous function returns a Polars expression?
List of base R and tidyverse
functions supported by
tidypolars
Package | Function | Notes |
---|---|---|
base |
abs |
|
base |
acos |
|
base |
acosh |
|
base |
all |
|
base |
any |
|
base |
asin |
|
base |
asinh |
|
base |
atan |
|
base |
atanh |
|
base |
ceiling |
|
base |
cos |
|
base |
cosh |
|
base |
cummin |
|
base |
cumsum |
|
base |
diff |
|
base |
exp |
|
base |
floor |
|
base |
grepl |
|
base |
ifelse |
|
base |
ISOdatetime |
|
base |
length |
|
base |
log |
|
base |
log10 |
|
base |
max |
|
base |
mean |
|
base |
min |
|
base |
nchar |
|
base |
paste0 |
|
base |
paste |
|
base |
rank |
|
base |
rev |
|
base |
round |
|
base |
sin |
|
base |
sinh |
|
base |
sort |
|
base |
sqrt |
|
base |
strptime |
|
base |
substr |
|
base |
tan |
|
base |
tanh |
|
base |
tolower |
|
base |
toupper |
|
base |
unique |
|
base |
which.min |
|
base |
which.max |
|
dplyr |
between |
|
dplyr |
case_match |
|
dplyr |
case_when |
|
dplyr |
coalesce |
|
dplyr |
consecutive_id |
|
dplyr |
dense_rank |
|
dplyr |
first |
|
dplyr |
group_keys |
|
dplyr |
group_vars |
|
dplyr |
if_else |
|
dplyr |
lag |
|
dplyr |
lead |
|
dplyr |
last |
|
dplyr |
min_rank |
|
dplyr |
n |
|
dplyr |
nth |
|
dplyr |
n_distinct |
|
dplyr |
row_number |
Doesn’t work when x is missing. |
lubridate |
day |
|
lubridate |
ddays |
|
lubridate |
dhours |
|
lubridate |
dmilliseconds |
|
lubridate |
dminutes |
|
lubridate |
dseconds |
|
lubridate |
dweeks |
|
lubridate |
make_date |
|
lubridate |
make_datetime |
In lubridate::make_datetime() , when there
is an overflow (for example hours = 25 ), then it is
automatically converted to the higher unit (for example 1 day and 1h).
In Polars, this returns NA . |
lubridate |
mday |
|
lubridate |
month |
|
lubridate |
quarter |
|
lubridate |
wday |
Requires week_start == 7 . If
label = TRUE , it returns a string variable and not a factor
as in lubridate . |
lubridate |
yday |
|
lubridate |
year |
|
stats |
median |
|
stats |
lag |
|
stats |
sd |
|
stats |
var |
|
stringr |
regex |
|
stringr |
str_count |
|
stringr |
str_detect |
|
stringr |
str_dup |
|
stringr |
str_ends |
|
stringr |
str_extract |
|
stringr |
str_extract_all |
|
stringr |
str_length |
|
stringr |
str_pad |
|
stringr |
str_remove |
|
stringr |
str_remove_all |
|
stringr |
str_replace |
|
stringr |
str_replace_all |
|
stringr |
str_replace_na |
|
stringr |
str_split |
|
stringr |
str_split_i |
|
stringr |
str_squish |
|
stringr |
str_starts |
|
stringr |
str_sub |
|
stringr |
str_trim |
|
stringr |
str_to_lower |
|
stringr |
str_to_title |
Letters following apostrophe will be capitalized as
well, which differs from the stringr implementation. |
stringr |
str_to_upper |
|
stringr |
str_trunc |
|
stringr |
word |
|
tidyr |
replace_na |
|
tools |
toTitleCase |
Letters following apostrophe will be capitalized as
well, which differs from the tools implementation. |