These functions are selection helpers.
They are intended to be used inside over() to extract parts or patterns of
the column names of the underlying data.
cut_names() selects strings by removing (cutting off) the specified .pattern.
This functionality resembles stringr::str_remove_all().
extract_names() selects strings by extracting the specified .pattern.
This functionality resembles stringr::str_extract().
cut_names(.pattern, .remove = NULL, .vars = NULL) extract_names(.pattern, .remove = NULL, .vars = NULL)
| .pattern | Pattern to look for. |
|---|---|
| .remove | Pattern to remove from the variable names provided in |
| .vars | A character vector with variables names. When used inside |
A character vector.
Selection helpers can be used inside dplyover::over() which in turn must be
used inside dplyr::mutate or dplyr::summarise. Let's first attach dplyr
(and stringr for comparision):
Let's first compare cut_names() and extract_names() to their stringr
equivalents stringr::str_remove_all() and stringr::str_extract():
We can observe two main differences:
(1) cut_names() and extract_names() only return strings where the function
was applied successfully (when characters have actually been removed or
extracted). stringr::str_remove_all() returns unmatched strings as is, while
stringr::str_extract() returns NA.
cut_names("Width", .vars = names(iris)) #> [1] "Sepal." "Petal." str_remove_all(names(iris), "Width") #> [1] "Sepal.Length" "Sepal." "Petal.Length" "Petal." "Species" extract_names("Length|Width", .vars = names(iris)) #> [1] "Length" "Width" str_extract(rep(names(iris), 2), "Length|Width") #> [1] "Length" "Width" "Length" "Width" NA "Length" "Width" "Length" "Width" #> [10] NA
(2) cut_names() and extract_names() return only unique values:
cut_names("Width", .vars = rep(names(iris), 2)) #> [1] "Sepal." "Petal." str_remove_all(rep(names(iris), 2), "Width") #> [1] "Sepal.Length" "Sepal." "Petal.Length" "Petal." "Species" #> [6] "Sepal.Length" "Sepal." "Petal.Length" "Petal." "Species" extract_names("Length|Width", .vars = names(iris)) #> [1] "Length" "Width" str_extract(rep(names(iris), 2), "Length|Width") #> [1] "Length" "Width" "Length" "Width" NA "Length" "Width" "Length" "Width" #> [10] NA
The examples above do not show that cut_names() removes all strings matching
the .pattern argument, while extract_names() does only extract the .pattern
one time:
cut_names("Width", .vars = "Width.Petal.Width") #> [1] ".Petal." str_remove_all("Width.Petal.Width", "Width") #> [1] ".Petal." extract_names("Width", .vars = "Width.Petal.Width") #> [1] "Width" str_extract("Width.Petal.Width", "Width") #> [1] "Width"
Within over() cut_names() and extract_names() automatically use the
column names of the underlying data:
iris %>% mutate(over(cut_names(".Width"), ~ .("{.x}.Width") * .("{.x}.Length"), .names = "Product_{x}")) #> # A tibble: 150 x 7 #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Product_Sepal #> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> #> 1 5.1 3.5 1.4 0.2 setosa 17.8 #> 2 4.9 3 1.4 0.2 setosa 14.7 #> 3 4.7 3.2 1.3 0.2 setosa 15.0 #> 4 4.6 3.1 1.5 0.2 setosa 14.3 #> # ... with 146 more rows, and 1 more variable: Product_Petal <dbl> iris %>% mutate(over(extract_names("Length|Width"), ~.("Petal.{.x}") * .("Sepal.{.x}"), .names = "Product_{x}")) #> # A tibble: 150 x 7 #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Product_Length #> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> #> 1 5.1 3.5 1.4 0.2 setosa 7.14 #> 2 4.9 3 1.4 0.2 setosa 6.86 #> 3 4.7 3.2 1.3 0.2 setosa 6.11 #> 4 4.6 3.1 1.5 0.2 setosa 6.9 #> # ... with 146 more rows, and 1 more variable: Product_Width <dbl>
What problem does cut_names() solve?
In the example above using cut_names() might not seem helpful, since we could easily
use c("Sepal", "Petal") instead. However, there are cases where we have
data with a lot of similar pairs of variables sharing a common prefix or
suffix. If we want to loop over them using over() then cut_names() comes
in handy.
The usage of extract_names() might be less obvious. Lets look at raw data
from a customer satifsaction survey which contains the following variables.
csatraw %>% glimpse(width = 50) #> Rows: 150 #> Columns: 15 #> $ cust_id <chr> "61297", "07545", "03822", "8~ #> $ type <chr> "existing", "existing", "exis~ #> $ product <chr> "advanced", "advanced", "prem~ #> $ item1 <dbl> 3, 2, 2, 4, 4, 3, 1, 3, 3, 2,~ #> $ item1_open <chr> "12", "22", "21, 22, 23", "12~ #> $ item2a <dbl> 2, 2, 2, 3, 3, 0, 3, 2, 2, 0,~ #> $ item2b <dbl> 3, 2, 5, 5, 2, NA, 3, 3, 4, N~ #> $ item3a <dbl> 2, 3, 3, 2, 3, 2, 3, 3, 0, 1,~ #> $ item3b <dbl> 2, 4, 5, 3, 5, 3, 4, 2, NA, 2~ #> $ item4a <dbl> 0, 2, 0, 0, 3, 3, 3, 2, 2, 2,~ #> $ item4b <dbl> NA, 3, NA, NA, 5, 2, 3, 5, 3,~ #> $ item5a <dbl> 2, 3, 2, 2, 3, 1, 3, 2, 3, 1,~ #> $ item5b <dbl> 5, 2, 3, 4, 1, 3, 3, 1, 3, 2,~ #> $ item6a <dbl> 2, 2, 3, 1, 3, 3, 3, 2, 3, 2,~ #> $ item6b <dbl> 3, 1, 2, 2, 5, 4, 4, 2, 2, 2,~
The survey has several 'item's consisting of two sub-questions / variables 'a'
and 'b'. Lets say we want to calculate the product of those two variables for
each item. extract_names() helps us to select all variables containing
'item' followed by a digit using the regex "item\\d" as .pattern.
However, there is 'item1' and 'item1_open' which are not followed by a and
b. extract_names() lets us exclude these items by setting the .remove
argument to [^item1]: