These functions are selection helpers. They are intended
to be used inside all functions that accept a vector as argument (that is over()
and crossover()
and all their variants) to extract values of a variable.
dist_values()
returns all distinct values (or in the case of factor variables:
levels) of a variable x
which are not NA
.
seq_range()
returns the sequence between the range()
of a variable x
.
dist_values(x, .sep = NULL, .sort = c("asc", "desc", "none", "levels")) seq_range(x, .by)
x | An atomic vector or list. For |
---|---|
.sep | A character vector containing regular expression(s) which are used for splitting the values (works only if x is a character vector). |
.sort | A character string indicating which sorting scheme is to be applied to distinct values: ascending ("asc"), descending ("desc"), "none" or "levels". The default is ascending, only if x is a factor the default is "levels". |
.by | A number (or date expression) representing the increment of the sequence. |
dist_values()
returns a vector of the same type of x, with exception of
factors which are converted to type "character"
.
seq_range()
returns an vector of type "integer"
or "double"
.
Selection helpers can be used inside dplyover::over()
which in turn must be
used inside dplyr::mutate
or dplyr::summarise
. Let's first attach dplyr
:
dist_values()
extracts all distinct values of a column variable.
This is helpful when creating dummy variables in a loop using over()
.
iris %>% mutate(over(dist_values(Species), ~ if_else(Species == .x, 1, 0) ), .keep = "none") #> # A tibble: 150 x 3 #> setosa versicolor virginica #> <dbl> <dbl> <dbl> #> 1 1 0 0 #> 2 1 0 0 #> 3 1 0 0 #> 4 1 0 0 #> # ... with 146 more rows
dist_values()
is just a wrapper around unique. However, it has five
differences:
(1) NA
values are automatically stripped. Compare:
(2) Applied on factors, dist_values()
returns all distinct levels
as
character. Compare the following:
fctrs <- factor(c(1:3, NA), levels = c(3:1)) fctrs %>% unique() %>% class() #> [1] "factor" fctrs %>% dist_values() %>% class() #> [1] "character"
(3) As default, the output is sorted in ascending order for non-factors, and
is sorted as the underyling "levels" for factors. This can be controlled by
setting the .sort
argument. Compare:
# non-factors unique(c(3,1,2)) #> [1] 3 1 2 dist_values(c(3,1,2)) #> [1] 1 2 3 dist_values(c(3,1,2), .sort = "desc") #> [1] 3 2 1 dist_values(c(3,1,2), .sort = "none") #> [1] 3 1 2 # factors fctrs <- factor(c(2,1,3, NA), levels = c(3:1)) dist_values(fctrs) #> [1] "3" "2" "1" dist_values(fctrs, .sort = "levels") #> [1] "3" "2" "1" dist_values(fctrs, .sort = "asc") #> [1] "1" "2" "3" dist_values(fctrs, .sort = "desc") #> [1] "3" "2" "1" dist_values(fctrs, .sort = "none") #> [1] "2" "1" "3"
(4) When used on a character vector dist_values
can take a separator
.sep
to split the elements accordingly:
c("1, 2, 3", "2, 4, 5", "4, 1, 7") %>% dist_values(., .sep = ", ") #> [1] "1" "2" "3" "4" "5" "7"
(5) When used on lists dist_values
automatically simplifiies its input
into a vector using unlist
:
seq_range()
generates a numeric sequence between the min
and max
values of its input variable. This is helpful when creating many dummy
variables with varying thresholds.
iris %>% mutate(over(seq_range(Sepal.Length, 1), ~ if_else(Sepal.Length > .x, 1, 0), .names = "Sepal.Length.{x}"), .keep = "none") #> # A tibble: 150 x 3 #> Sepal.Length.5 Sepal.Length.6 Sepal.Length.7 #> <dbl> <dbl> <dbl> #> 1 1 0 0 #> 2 0 0 0 #> 3 0 0 0 #> 4 0 0 0 #> # ... with 146 more rows
Note that if the input variable does not have decimal places, min
and max
are
wrapped in ceiling
and floor
accordingly. This will prevent the creation of
variables that contain only 0
or 1
. Compare the output below with the
example above:
iris %>% mutate(over(seq(round(min(Sepal.Length), 0), round(max(Sepal.Length), 0), 1), ~ if_else(Sepal.Length > .x, 1, 0), .names = "Sepal.Length.{x}"), .keep = "none") #> # A tibble: 150 x 5 #> Sepal.Length.4 Sepal.Length.5 Sepal.Length.6 Sepal.Length.7 Sepal.Length.8 #> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1 1 0 0 0 #> 2 1 0 0 0 0 #> 3 1 0 0 0 0 #> 4 1 0 0 0 0 #> # ... with 146 more rows
seq_range()
also works on dates: