Mutating joins — left_join.RPolarsDataFrame • tidypolars

Mutating joins add columns from y to x, matching observations based on the keys.

Usage

# S3 method for class 'RPolarsDataFrame'
left_join(
  x,
  y,
  by = NULL,
  copy = NULL,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL,
  na_matches = "na",
  relationship = NULL
)

# S3 method for class 'RPolarsDataFrame'
right_join(
  x,
  y,
  by = NULL,
  copy = NULL,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL,
  na_matches = "na",
  relationship = NULL
)

# S3 method for class 'RPolarsDataFrame'
full_join(
  x,
  y,
  by = NULL,
  copy = NULL,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL,
  na_matches = "na",
  relationship = NULL
)

# S3 method for class 'RPolarsDataFrame'
inner_join(
  x,
  y,
  by = NULL,
  copy = NULL,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL,
  na_matches = "na",
  relationship = NULL
)

# S3 method for class 'RPolarsLazyFrame'
left_join(
  x,
  y,
  by = NULL,
  copy = NULL,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL,
  na_matches = "na",
  relationship = NULL
)

# S3 method for class 'RPolarsLazyFrame'
right_join(
  x,
  y,
  by = NULL,
  copy = NULL,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL,
  na_matches = "na",
  relationship = NULL
)

# S3 method for class 'RPolarsLazyFrame'
full_join(
  x,
  y,
  by = NULL,
  copy = NULL,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL,
  na_matches = "na",
  relationship = NULL
)

# S3 method for class 'RPolarsLazyFrame'
inner_join(
  x,
  y,
  by = NULL,
  copy = NULL,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL,
  na_matches = "na",
  relationship = NULL
)

Arguments

x, y

Two Polars Data/LazyFrames

by

Variables to join by. If NULL, the default, *_join() will perform a natural join, using all variables in common across x and y. A message lists the variables so that you can check they're correct; suppress the message by supplying by explicitly.

by can take a character vector, like c("x", "y") if x and y are in both datasets. To join on variables that don't have the same name, use equalities in the character vector, like c("x1" = "x2", "y"). If you use a character vector, the join can only be done using strict equality.

Finally, by can be a specification created by dplyr::join_by(). Contrary to the input as character vector shown above, join_by() uses unquoted column names, e.g join_by(x1 == x2, y). It also uses equality and inequality operators ==, > and similar. For now, only equality operators are supported.

copy, keep

Not used.

suffix

If there are non-joined duplicate variables in x and y, these suffixes will be added to the output to disambiguate them. Should be a character vector of length 2.

...

Not used.

na_matches

Should two NA values match?

"na", the default, treats two NA values as equal.
"never" treats two NA values as different and will never match them together or to any other values.

Note that when joining Polars Data/LazyFrames, NaN are always considered equal, no matter the value of na_matches. This differs from the original dplyr implementation.

relationship

Handling of the expected relationship between the keys of x and y. Must be one of the following:

NULL, the default, is equivalent to "many-to-many". It doesn't expect any relationship between x and y.
"one-to-one" expects each row in x to match at most 1 row in y and each row in y to match at most 1 row in x.
"one-to-many" expects each row in y to match at most 1 row in x.
"many-to-one" expects each row in x matches at most 1 row in y.

Examples

test <- polars::pl$DataFrame(
  x = c(1, 2, 3),
  y1 = c(1, 2, 3),
  z = c(1, 2, 3)
)

test2 <- polars::pl$DataFrame(
  x = c(1, 2, 4),
  y2 = c(1, 2, 4),
  z2 = c(4, 5, 7)
)

test
#> shape: (3, 3)
#> ┌─────┬─────┬─────┐
#> │ x   ┆ y1  ┆ z   │
#> │ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 │
#> ╞═════╪═════╪═════╡
#> │ 1.0 ┆ 1.0 ┆ 1.0 │
#> │ 2.0 ┆ 2.0 ┆ 2.0 │
#> │ 3.0 ┆ 3.0 ┆ 3.0 │
#> └─────┴─────┴─────┘

test2
#> shape: (3, 3)
#> ┌─────┬─────┬─────┐
#> │ x   ┆ y2  ┆ z2  │
#> │ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 │
#> ╞═════╪═════╪═════╡
#> │ 1.0 ┆ 1.0 ┆ 4.0 │
#> │ 2.0 ┆ 2.0 ┆ 5.0 │
#> │ 4.0 ┆ 4.0 ┆ 7.0 │
#> └─────┴─────┴─────┘

# default is to use common columns, here "x" only
left_join(test, test2)
#> Joining by `x`
#> shape: (3, 5)
#> ┌─────┬─────┬─────┬──────┬──────┐
#> │ x   ┆ y1  ┆ z   ┆ y2   ┆ z2   │
#> │ --- ┆ --- ┆ --- ┆ ---  ┆ ---  │
#> │ f64 ┆ f64 ┆ f64 ┆ f64  ┆ f64  │
#> ╞═════╪═════╪═════╪══════╪══════╡
#> │ 1.0 ┆ 1.0 ┆ 1.0 ┆ 1.0  ┆ 4.0  │
#> │ 2.0 ┆ 2.0 ┆ 2.0 ┆ 2.0  ┆ 5.0  │
#> │ 3.0 ┆ 3.0 ┆ 3.0 ┆ null ┆ null │
#> └─────┴─────┴─────┴──────┴──────┘

# we can specify the columns on which to join with join_by()...
left_join(test, test2, by = join_by(x, y1 == y2))
#> shape: (3, 4)
#> ┌─────┬─────┬─────┬──────┐
#> │ x   ┆ y1  ┆ z   ┆ z2   │
#> │ --- ┆ --- ┆ --- ┆ ---  │
#> │ f64 ┆ f64 ┆ f64 ┆ f64  │
#> ╞═════╪═════╪═════╪══════╡
#> │ 1.0 ┆ 1.0 ┆ 1.0 ┆ 4.0  │
#> │ 2.0 ┆ 2.0 ┆ 2.0 ┆ 5.0  │
#> │ 3.0 ┆ 3.0 ┆ 3.0 ┆ null │
#> └─────┴─────┴─────┴──────┘

# ... or with a character vector
left_join(test, test2, by = c("x", "y1" = "y2"))
#> shape: (3, 4)
#> ┌─────┬─────┬─────┬──────┐
#> │ x   ┆ y1  ┆ z   ┆ z2   │
#> │ --- ┆ --- ┆ --- ┆ ---  │
#> │ f64 ┆ f64 ┆ f64 ┆ f64  │
#> ╞═════╪═════╪═════╪══════╡
#> │ 1.0 ┆ 1.0 ┆ 1.0 ┆ 4.0  │
#> │ 2.0 ┆ 2.0 ┆ 2.0 ┆ 5.0  │
#> │ 3.0 ┆ 3.0 ┆ 3.0 ┆ null │
#> └─────┴─────┴─────┴──────┘

# we can customize the suffix of common column names not used to join
test2 <- polars::pl$DataFrame(
  x = c(1, 2, 4),
  y1 = c(1, 2, 4),
  z = c(4, 5, 7)
)

left_join(test, test2, by = "x", suffix = c("_left", "_right"))
#> shape: (3, 5)
#> ┌─────┬─────────┬────────┬──────────┬─────────┐
#> │ x   ┆ y1_left ┆ z_left ┆ y1_right ┆ z_right │
#> │ --- ┆ ---     ┆ ---    ┆ ---      ┆ ---     │
#> │ f64 ┆ f64     ┆ f64    ┆ f64      ┆ f64     │
#> ╞═════╪═════════╪════════╪══════════╪═════════╡
#> │ 1.0 ┆ 1.0     ┆ 1.0    ┆ 1.0      ┆ 4.0     │
#> │ 2.0 ┆ 2.0     ┆ 2.0    ┆ 2.0      ┆ 5.0     │
#> │ 3.0 ┆ 3.0     ┆ 3.0    ┆ null     ┆ null    │
#> └─────┴─────────┴────────┴──────────┴─────────┘

# the argument "relationship" ensures the join matches the expectation
country <- polars::pl$DataFrame(
  iso = c("FRA", "DEU"),
  value = 1:2
)
country
#> shape: (2, 2)
#> ┌─────┬───────┐
#> │ iso ┆ value │
#> │ --- ┆ ---   │
#> │ str ┆ i32   │
#> ╞═════╪═══════╡
#> │ FRA ┆ 1     │
#> │ DEU ┆ 2     │
#> └─────┴───────┘

country_year <- polars::pl$DataFrame(
  iso = rep(c("FRA", "DEU"), each = 2),
  year = rep(2019:2020, 2),
  value2 = 3:6
)
country_year
#> shape: (4, 3)
#> ┌─────┬──────┬────────┐
#> │ iso ┆ year ┆ value2 │
#> │ --- ┆ ---  ┆ ---    │
#> │ str ┆ i32  ┆ i32    │
#> ╞═════╪══════╪════════╡
#> │ FRA ┆ 2019 ┆ 3      │
#> │ FRA ┆ 2020 ┆ 4      │
#> │ DEU ┆ 2019 ┆ 5      │
#> │ DEU ┆ 2020 ┆ 6      │
#> └─────┴──────┴────────┘

# We expect that each row in "x" matches only one row in "y" but, it's not
# true as each row of "x" matches two rows of "y"
tryCatch(
  left_join(country, country_year, join_by(iso), relationship = "one-to-one"),
  error = function(e) e
)
#> <RPolarsErr_error: Execution halted with the following contexts
#>    0: In R: in $collect():
#>    0: During function call [pkgdown::build_site_github_pages(new_process = FALSE, install = TRUE)]
#>    1: Encountered the following error in Rust-Polars:
#>       	join keys did not fulfill 1:1 validation
#> >

# A correct expectation would be "one-to-many":
left_join(country, country_year, join_by(iso), relationship = "one-to-many")
#> shape: (4, 4)
#> ┌─────┬───────┬──────┬────────┐
#> │ iso ┆ value ┆ year ┆ value2 │
#> │ --- ┆ ---   ┆ ---  ┆ ---    │
#> │ str ┆ i32   ┆ i32  ┆ i32    │
#> ╞═════╪═══════╪══════╪════════╡
#> │ FRA ┆ 1     ┆ 2019 ┆ 3      │
#> │ FRA ┆ 1     ┆ 2020 ┆ 4      │
#> │ DEU ┆ 2     ┆ 2019 ┆ 5      │
#> │ DEU ┆ 2     ┆ 2020 ┆ 6      │
#> └─────┴───────┴──────┴────────┘