Skip to contents

By default, duplicates are looked for in all variables. It is possible to specify a subset of variables where duplicates should be looked for. It is also possible to keep either the first occurrence, the last occurence or remove all duplicates.

Usage

# S3 method for RPolarsDataFrame
distinct(.data, ..., keep = "first", maintain_order = TRUE)

# S3 method for RPolarsLazyFrame
distinct(.data, ..., keep = "first", maintain_order = TRUE)

duplicated_rows(.data, ...)

Arguments

.data

A Polars Data/LazyFrame

...

Any expression accepted by dplyr::select(): variable names, column numbers, select helpers, etc.

keep

Either "first" (keep the first occurrence of the duplicated row), "last" (last occurrence) or "none" (remove all ofccurences of duplicated rows).

maintain_order

Maintain row order. This is the default but it can slow down the process with large datasets and it prevents the use of streaming.

Examples

pl_test <- polars::pl$DataFrame(
  iso_o = c(rep(c("AA", "AB"), each = 2), "AC", "DC"),
  iso_d = rep(c("BA", "BB", "BC"), each = 2),
  value = c(2, 2, 3, 4, 5, 6)
)

distinct(pl_test)
#> shape: (5, 3)
#> ┌───────┬───────┬───────┐
#> │ iso_o ┆ iso_d ┆ value │
#> │ ---   ┆ ---   ┆ ---   │
#> │ str   ┆ str   ┆ f64   │
#> ╞═══════╪═══════╪═══════╡
#> │ AA    ┆ BA    ┆ 2.0   │
#> │ AB    ┆ BB    ┆ 3.0   │
#> │ AB    ┆ BB    ┆ 4.0   │
#> │ AC    ┆ BC    ┆ 5.0   │
#> │ DC    ┆ BC    ┆ 6.0   │
#> └───────┴───────┴───────┘
distinct(pl_test, iso_o)
#> shape: (4, 3)
#> ┌───────┬───────┬───────┐
#> │ iso_o ┆ iso_d ┆ value │
#> │ ---   ┆ ---   ┆ ---   │
#> │ str   ┆ str   ┆ f64   │
#> ╞═══════╪═══════╪═══════╡
#> │ AA    ┆ BA    ┆ 2.0   │
#> │ AB    ┆ BB    ┆ 3.0   │
#> │ AC    ┆ BC    ┆ 5.0   │
#> │ DC    ┆ BC    ┆ 6.0   │
#> └───────┴───────┴───────┘

duplicated_rows(pl_test)
#> shape: (2, 3)
#> ┌───────┬───────┬───────┐
#> │ iso_o ┆ iso_d ┆ value │
#> │ ---   ┆ ---   ┆ ---   │
#> │ str   ┆ str   ┆ f64   │
#> ╞═══════╪═══════╪═══════╡
#> │ AA    ┆ BA    ┆ 2.0   │
#> │ AA    ┆ BA    ┆ 2.0   │
#> └───────┴───────┴───────┘
duplicated_rows(pl_test, iso_o, iso_d)
#> shape: (4, 3)
#> ┌───────┬───────┬───────┐
#> │ iso_o ┆ iso_d ┆ value │
#> │ ---   ┆ ---   ┆ ---   │
#> │ str   ┆ str   ┆ f64   │
#> ╞═══════╪═══════╪═══════╡
#> │ AA    ┆ BA    ┆ 2.0   │
#> │ AA    ┆ BA    ┆ 2.0   │
#> │ AB    ┆ BB    ┆ 3.0   │
#> │ AB    ┆ BB    ┆ 4.0   │
#> └───────┴───────┴───────┘