These functions are selection helpers. They are intended to be used inside all functions that accept a vector as argument (that is over() and crossover() and all their variants) to extract values of a variable.

  • dist_values() returns all distinct values (or in the case of factor variables: levels) of a variable x which are not NA.

  • seq_range() returns the sequence between the range() of a variable x.

dist_values(x, .sep = NULL, .sort = c("asc", "desc", "none", "levels"))

seq_range(x, .by)

Arguments

x

An atomic vector or list. For seq_range() x must be numeric or date.

.sep

A character vector containing regular expression(s) which are used for splitting the values (works only if x is a character vector).

.sort

A character string indicating which sorting scheme is to be applied to distinct values: ascending ("asc"), descending ("desc"), "none" or "levels". The default is ascending, only if x is a factor the default is "levels".

.by

A number (or date expression) representing the increment of the sequence.

Value

dist_values() returns a vector of the same type of x, with exception of factors which are converted to type "character".

seq_range() returns an vector of type "integer" or "double".

Examples

Selection helpers can be used inside dplyover::over() which in turn must be used inside dplyr::mutate or dplyr::summarise. Let's first attach dplyr:

library(dplyr)

# For better printing
iris <- as_tibble(iris)

dist_values() extracts all distinct values of a column variable. This is helpful when creating dummy variables in a loop using over().

iris %>%
  mutate(over(dist_values(Species),
              ~ if_else(Species == .x, 1, 0)
              ),
         .keep = "none")
#> # A tibble: 150 x 3
#>   setosa versicolor virginica
#>    <dbl>      <dbl>     <dbl>
#> 1      1          0         0
#> 2      1          0         0
#> 3      1          0         0
#> 4      1          0         0
#> # ... with 146 more rows

dist_values() is just a wrapper around unique. However, it has five differences:

(1) NA values are automatically stripped. Compare:

unique(c(1:3, NA))
#> [1]  1  2  3 NA
dist_values(c(1:3, NA))
#> [1] 1 2 3

(2) Applied on factors, dist_values() returns all distinct levels as character. Compare the following:

fctrs <- factor(c(1:3, NA), levels = c(3:1))

fctrs %>% unique() %>% class()
#> [1] "factor"

fctrs %>% dist_values() %>% class()
#> [1] "character"

(3) As default, the output is sorted in ascending order for non-factors, and is sorted as the underyling "levels" for factors. This can be controlled by setting the .sort argument. Compare:

# non-factors
unique(c(3,1,2))
#> [1] 3 1 2

dist_values(c(3,1,2))
#> [1] 1 2 3
dist_values(c(3,1,2), .sort = "desc")
#> [1] 3 2 1
dist_values(c(3,1,2), .sort = "none")
#> [1] 3 1 2

# factors
fctrs <- factor(c(2,1,3, NA), levels = c(3:1))

dist_values(fctrs)
#> [1] "3" "2" "1"
dist_values(fctrs, .sort = "levels")
#> [1] "3" "2" "1"
dist_values(fctrs, .sort = "asc")
#> [1] "1" "2" "3"
dist_values(fctrs, .sort = "desc")
#> [1] "3" "2" "1"
dist_values(fctrs, .sort = "none")
#> [1] "2" "1" "3"

(4) When used on a character vector dist_values can take a separator .sep to split the elements accordingly:

c("1, 2, 3",
  "2, 4, 5",
  "4, 1, 7") %>%
  dist_values(., .sep = ", ")
#> [1] "1" "2" "3" "4" "5" "7"

(5) When used on lists dist_values automatically simplifiies its input into a vector using unlist:

list(a = c(1:4), b = (4:6), c(5:10)) %>%
  dist_values()
#>  [1]  1  2  3  4  5  6  7  8  9 10

seq_range() generates a numeric sequence between the min and max values of its input variable. This is helpful when creating many dummy variables with varying thresholds.

iris %>%
  mutate(over(seq_range(Sepal.Length, 1),
              ~ if_else(Sepal.Length > .x, 1, 0),
              .names = "Sepal.Length.{x}"),
         .keep = "none")
#> # A tibble: 150 x 3
#>   Sepal.Length.5 Sepal.Length.6 Sepal.Length.7
#>            <dbl>          <dbl>          <dbl>
#> 1              1              0              0
#> 2              0              0              0
#> 3              0              0              0
#> 4              0              0              0
#> # ... with 146 more rows

Note that if the input variable does not have decimal places, min and max are wrapped in ceiling and floor accordingly. This will prevent the creation of variables that contain only 0 or 1. Compare the output below with the example above:

iris %>%
  mutate(over(seq(round(min(Sepal.Length), 0),
                  round(max(Sepal.Length), 0),
                  1),
              ~ if_else(Sepal.Length > .x, 1, 0),
              .names = "Sepal.Length.{x}"),
         .keep = "none")
#> # A tibble: 150 x 5
#>   Sepal.Length.4 Sepal.Length.5 Sepal.Length.6 Sepal.Length.7 Sepal.Length.8
#>            <dbl>          <dbl>          <dbl>          <dbl>          <dbl>
#> 1              1              1              0              0              0
#> 2              1              0              0              0              0
#> 3              1              0              0              0              0
#> 4              1              0              0              0              0
#> # ... with 146 more rows

seq_range() also works on dates:

some_dates <- c(as.Date("2020-01-02"),
                as.Date("2020-05-02"),
                as.Date("2020-03-02"))


some_dates %>%
  seq_range(., "1 month")
#> [1] "2020-01-02" "2020-02-02" "2020-03-02" "2020-04-02" "2020-05-02"