These functions are selection helpers.
They are intended to be used inside over()
to extract parts or patterns of
the column names of the underlying data.
cut_names()
selects strings by removing (cutting off) the specified .pattern
.
This functionality resembles stringr::str_remove_all()
.
extract_names()
selects strings by extracting the specified .pattern
.
This functionality resembles stringr::str_extract()
.
cut_names(.pattern, .remove = NULL, .vars = NULL) extract_names(.pattern, .remove = NULL, .vars = NULL)
.pattern | Pattern to look for. |
---|---|
.remove | Pattern to remove from the variable names provided in |
.vars | A character vector with variables names. When used inside |
A character vector.
Selection helpers can be used inside dplyover::over()
which in turn must be
used inside dplyr::mutate
or dplyr::summarise
. Let's first attach dplyr
(and stringr
for comparision):
Let's first compare cut_names()
and extract_names()
to their stringr
equivalents stringr::str_remove_all()
and stringr::str_extract()
:
We can observe two main differences:
(1) cut_names()
and extract_names()
only return strings where the function
was applied successfully (when characters have actually been removed or
extracted). stringr::str_remove_all()
returns unmatched strings as is, while
stringr::str_extract()
returns NA
.
cut_names("Width", .vars = names(iris)) #> [1] "Sepal." "Petal." str_remove_all(names(iris), "Width") #> [1] "Sepal.Length" "Sepal." "Petal.Length" "Petal." "Species" extract_names("Length|Width", .vars = names(iris)) #> [1] "Length" "Width" str_extract(rep(names(iris), 2), "Length|Width") #> [1] "Length" "Width" "Length" "Width" NA "Length" "Width" "Length" "Width" #> [10] NA
(2) cut_names()
and extract_names()
return only unique values:
cut_names("Width", .vars = rep(names(iris), 2)) #> [1] "Sepal." "Petal." str_remove_all(rep(names(iris), 2), "Width") #> [1] "Sepal.Length" "Sepal." "Petal.Length" "Petal." "Species" #> [6] "Sepal.Length" "Sepal." "Petal.Length" "Petal." "Species" extract_names("Length|Width", .vars = names(iris)) #> [1] "Length" "Width" str_extract(rep(names(iris), 2), "Length|Width") #> [1] "Length" "Width" "Length" "Width" NA "Length" "Width" "Length" "Width" #> [10] NA
The examples above do not show that cut_names()
removes all strings matching
the .pattern
argument, while extract_names()
does only extract the .pattern
one time:
cut_names("Width", .vars = "Width.Petal.Width") #> [1] ".Petal." str_remove_all("Width.Petal.Width", "Width") #> [1] ".Petal." extract_names("Width", .vars = "Width.Petal.Width") #> [1] "Width" str_extract("Width.Petal.Width", "Width") #> [1] "Width"
Within over()
cut_names()
and extract_names()
automatically use the
column names of the underlying data:
iris %>% mutate(over(cut_names(".Width"), ~ .("{.x}.Width") * .("{.x}.Length"), .names = "Product_{x}")) #> # A tibble: 150 x 7 #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Product_Sepal #> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> #> 1 5.1 3.5 1.4 0.2 setosa 17.8 #> 2 4.9 3 1.4 0.2 setosa 14.7 #> 3 4.7 3.2 1.3 0.2 setosa 15.0 #> 4 4.6 3.1 1.5 0.2 setosa 14.3 #> # ... with 146 more rows, and 1 more variable: Product_Petal <dbl> iris %>% mutate(over(extract_names("Length|Width"), ~.("Petal.{.x}") * .("Sepal.{.x}"), .names = "Product_{x}")) #> # A tibble: 150 x 7 #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Product_Length #> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> #> 1 5.1 3.5 1.4 0.2 setosa 7.14 #> 2 4.9 3 1.4 0.2 setosa 6.86 #> 3 4.7 3.2 1.3 0.2 setosa 6.11 #> 4 4.6 3.1 1.5 0.2 setosa 6.9 #> # ... with 146 more rows, and 1 more variable: Product_Width <dbl>
What problem does cut_names()
solve?
In the example above using cut_names()
might not seem helpful, since we could easily
use c("Sepal", "Petal")
instead. However, there are cases where we have
data with a lot of similar pairs of variables sharing a common prefix or
suffix. If we want to loop over them using over()
then cut_names()
comes
in handy.
The usage of extract_names()
might be less obvious. Lets look at raw data
from a customer satifsaction survey which contains the following variables.
csatraw %>% glimpse(width = 50) #> Rows: 150 #> Columns: 15 #> $ cust_id <chr> "61297", "07545", "03822", "8~ #> $ type <chr> "existing", "existing", "exis~ #> $ product <chr> "advanced", "advanced", "prem~ #> $ item1 <dbl> 3, 2, 2, 4, 4, 3, 1, 3, 3, 2,~ #> $ item1_open <chr> "12", "22", "21, 22, 23", "12~ #> $ item2a <dbl> 2, 2, 2, 3, 3, 0, 3, 2, 2, 0,~ #> $ item2b <dbl> 3, 2, 5, 5, 2, NA, 3, 3, 4, N~ #> $ item3a <dbl> 2, 3, 3, 2, 3, 2, 3, 3, 0, 1,~ #> $ item3b <dbl> 2, 4, 5, 3, 5, 3, 4, 2, NA, 2~ #> $ item4a <dbl> 0, 2, 0, 0, 3, 3, 3, 2, 2, 2,~ #> $ item4b <dbl> NA, 3, NA, NA, 5, 2, 3, 5, 3,~ #> $ item5a <dbl> 2, 3, 2, 2, 3, 1, 3, 2, 3, 1,~ #> $ item5b <dbl> 5, 2, 3, 4, 1, 3, 3, 1, 3, 2,~ #> $ item6a <dbl> 2, 2, 3, 1, 3, 3, 3, 2, 3, 2,~ #> $ item6b <dbl> 3, 1, 2, 2, 5, 4, 4, 2, 2, 2,~
The survey has several 'item's consisting of two sub-questions / variables 'a'
and 'b'. Lets say we want to calculate the product of those two variables for
each item. extract_names()
helps us to select all variables containing
'item' followed by a digit using the regex "item\\d"
as .pattern
.
However, there is 'item1' and 'item1_open' which are not followed by a
and
b
. extract_names()
lets us exclude these items by setting the .remove
argument to [^item1]
: