over()
makes it easy to create new colums inside a dplyr::mutate()
or
dplyr::summarise()
call by applying a function (or a set of functions) to
an atomic vector or list using a syntax similar to dplyr::across()
.
The main difference is that dplyr::across()
transforms or creates new columns
based on existing ones, while over()
can create new columns based on a
vector or list to which it will apply one or several functions.
Whereas dplyr::across()
allows tidy-selection
helpers to select columns,
over()
provides its own helper functions to select strings or values based
on either (1) values of specified columns or (2) column names. See the
examples below and the vignette("why_dplyover")
for more details.
over(.x, .fns, ..., .names = NULL, .names_fn = NULL)
.x | An atomic vector or list to apply functions to. Alternatively a
< |
---|---|
.fns | Functions to apply to each of the elements in Possible values are:
For examples see the example section below. Note that, unlike |
... | Additional arguments for the function calls in |
.names | A glue specification that describes how to name the output
columns. This can use Note that, depending on the nature of the underlying object in
This standard behavior (interpretation of
Alternatively, a character vector of length equal to the number of columns to
be created can be supplied to |
.names_fn | Optionally, a function that is applied after the glue
specification in |
A tibble with one column for each element in .x
and each function in .fns
.
Similar to dplyr::across()
over()
works only inside dplyr verbs.
It has two main use cases. They differ in how the elements in .x
are used. Let's first attach dplyr
:
Here the values in .x
are used as inputs to one or more functions in .fns
.
This is useful, when we want to create several new variables based on the same
function with varying arguments. A good example is creating a bunch of lagged
variables.
tibble(x = 1:25) %>% mutate(over(c(1:3), ~ lag(x, .x))) #> # A tibble: 25 x 4 #> x `1` `2` `3` #> <int> <int> <int> <int> #> 1 1 NA NA NA #> 2 2 1 NA NA #> 3 3 2 1 NA #> 4 4 3 2 1 #> # ... with 21 more rows
Lets create a dummy variable for each unique value in 'Species':
iris %>% mutate(over(unique(Species), ~ if_else(Species == .x, 1, 0)), .keep = "none") #> # A tibble: 150 x 3 #> setosa versicolor virginica #> <dbl> <dbl> <dbl> #> 1 1 0 0 #> 2 1 0 0 #> 3 1 0 0 #> 4 1 0 0 #> # ... with 146 more rows
With over()
it is also possible to create several dummy variables with
different thresholds. We can use the .names
argument to control the output
names:
iris %>% mutate(over(seq(4, 7, by = 1), ~ if_else(Sepal.Length < .x, 1, 0), .names = "Sepal.Length_{x}"), .keep = "none") #> # A tibble: 150 x 4 #> Sepal.Length_4 Sepal.Length_5 Sepal.Length_6 Sepal.Length_7 #> <dbl> <dbl> <dbl> <dbl> #> 1 0 0 1 1 #> 2 0 1 1 1 #> 3 0 1 1 1 #> 4 0 1 1 1 #> # ... with 146 more rows
A similar approach can be used with dates. Below we loop over a date
sequence to check whether the date falls within a given start and end
date. We can use the .names_fn
argument to clean the resulting output
names:
# some dates dat_tbl <- tibble(start = seq.Date(as.Date("2020-01-01"), as.Date("2020-01-15"), by = "days"), end = start + 10) dat_tbl %>% mutate(over(seq(as.Date("2020-01-01"), as.Date("2020-01-21"), by = "weeks"), ~ .x >= start & .x <= end, .names = "day_{x}", .names_fn = ~ gsub("-", "", .x))) #> # A tibble: 15 x 5 #> start end day_20200101 day_20200108 day_20200115 #> <date> <date> <lgl> <lgl> <lgl> #> 1 2020-01-01 2020-01-11 TRUE TRUE FALSE #> 2 2020-01-02 2020-01-12 FALSE TRUE FALSE #> 3 2020-01-03 2020-01-13 FALSE TRUE FALSE #> 4 2020-01-04 2020-01-14 FALSE TRUE FALSE #> 5 2020-01-05 2020-01-15 FALSE TRUE TRUE #> 6 2020-01-06 2020-01-16 FALSE TRUE TRUE #> 7 2020-01-07 2020-01-17 FALSE TRUE TRUE #> 8 2020-01-08 2020-01-18 FALSE TRUE TRUE #> 9 2020-01-09 2020-01-19 FALSE FALSE TRUE #> 10 2020-01-10 2020-01-20 FALSE FALSE TRUE #> 11 2020-01-11 2020-01-21 FALSE FALSE TRUE #> 12 2020-01-12 2020-01-22 FALSE FALSE TRUE #> 13 2020-01-13 2020-01-23 FALSE FALSE TRUE #> 14 2020-01-14 2020-01-24 FALSE FALSE TRUE #> 15 2020-01-15 2020-01-25 FALSE FALSE TRUE
over()
can summarise data in wide format. In the example below, we want to
know for each group of customers (new
, existing
, reactivate
), how much
percent of the respondents gave which rating on a five point likert scale
(item1
). A usual approach in the tidyverse would be to use
count %>% group_by %>% mutate
, which yields the same result in the usually
prefered long format. Sometimes, however, we might want this kind of summary
in the wide format, and in this case over()
comes in handy:
csatraw %>% group_by(type) %>% summarise(over(c(1:5), ~ mean(item1 == .x))) #> # A tibble: 3 x 6 #> type `1` `2` `3` `4` `5` #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 existing 0.156 0.234 0.234 0.266 0.109 #> 2 new 0.0714 0.268 0.357 0.214 0.0893 #> 3 reactivate 0.0667 0.267 0.133 0.4 0.133
Instead of a vector we can provide a named list of vectors to calculate the top two and bottom two categories on the fly:
csatraw %>% group_by(type) %>% summarise(over(list(bot2 = c(1:2), mid = 3, top2 = c(4:5)), ~ mean(item1 %in% .x))) #> # A tibble: 3 x 4 #> type bot2 mid top2 #> <chr> <dbl> <dbl> <dbl> #> 1 existing 0.391 0.234 0.375 #> 2 new 0.339 0.357 0.304 #> 3 reactivate 0.333 0.133 0.533
over()
can also loop over columns of a data.frame. In the example below we
want to create four different dummy variables of item1
: (i) the top and (ii)
bottom category as well as (iii) the top two and (iv) the bottom two categories.
We can create a lookup data.frame
and use all columns but the first as input to
over()
. In the function call we make use of base R's match()
, where .x
represents the new values and recode_df[, 1]
refers to the old values.
recode_df <- data.frame(old = c(1, 2, 3, 4, 5), top1 = c(0, 0, 0, 0, 1), top2 = c(0, 0, 0, 1, 1), bot1 = c(1, 0, 0, 0, 0), bot2 = c(1, 1, 0, 0, 0)) csatraw %>% mutate(over(recode_df[,-1], ~ .x[match(item1, recode_df[, 1])], .names = "item1_{x}")) %>% select(starts_with("item1")) #> # A tibble: 150 x 6 #> item1 item1_open item1_top1 item1_top2 item1_bot1 item1_bot2 #> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 3 12 0 0 0 0 #> 2 2 22 0 0 0 1 #> 3 2 21, 22, 23 0 0 0 1 #> 4 4 12, 13, 11 0 1 0 0 #> # ... with 146 more rows
over()
work nicely with comma separated values stored in character vectors.
In the example below, the colum csat_open
contains one or more comma
separated reasons why a specific customer satisfaction rating was given.
We can easily create a column for each response category with the help of
dist_values
- a wrapper around unique
which can split vector elements
using a separator:
csat %>% mutate(over(dist_values(csat_open, .sep = ", "), ~ as.integer(grepl(.x, csat_open)), .names = "rsp_{x}", .names_fn = ~ gsub("\\s", "_", .x)), .keep = "none") %>% glimpse #> Rows: 150 #> Columns: 6 #> $ rsp_friendly_staff <int> 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0,~ #> $ rsp_good_service <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,~ #> $ rsp_great_product <int> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0,~ #> $ rsp_no_response <int> 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1,~ #> $ rsp_too_expensive <int> 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,~ #> $ rsp_unfriendly <int> 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,~
Here strings are supplied to .x
to construct column names (sharing the
same stem). This allows us to dynamically use more than one column in the
function calls in .fns
. To work properly, the strings need to be
turned into symbols and evaluated. For this dplyover provides a genuine
helper function .()
that evaluates strings and helps to declutter the
otherwise rather verbose code. .()
supports glue syntax and takes a string
as argument.
Below are a few examples using two colums in the function calls in .fns
.
For the two column case across2()
provides a more intuitive API that is
closer to the original dplyr::across
. Using .()
inside over
is really
useful for cases with more than two columns.
Consider the following example of a purrr-style formula in .fns
using .()
:
iris %>% mutate(over(c("Sepal", "Petal"), ~ .("{.x}.Width") + .("{.x}.Length") )) #> # A tibble: 150 x 7 #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal Petal #> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> #> 1 5.1 3.5 1.4 0.2 setosa 8.6 1.6 #> 2 4.9 3 1.4 0.2 setosa 7.9 1.6 #> 3 4.7 3.2 1.3 0.2 setosa 7.9 1.5 #> 4 4.6 3.1 1.5 0.2 setosa 7.7 1.7 #> # ... with 146 more rows
The above syntax is equal to the more verbose:
iris %>% mutate(over(c("Sepal", "Petal"), ~ eval(sym(paste0(.x, ".Width"))) + eval(sym(paste0(.x, ".Length"))) )) #> # A tibble: 150 x 7 #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal Petal #> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> #> 1 5.1 3.5 1.4 0.2 setosa 8.6 1.6 #> 2 4.9 3 1.4 0.2 setosa 7.9 1.6 #> 3 4.7 3.2 1.3 0.2 setosa 7.9 1.5 #> 4 4.6 3.1 1.5 0.2 setosa 7.7 1.7 #> # ... with 146 more rows
.()
also works with anonymous functions:
iris %>% summarise(over(c("Sepal", "Petal"), function(x) mean(.("{x}.Width")) )) #> # A tibble: 1 x 2 #> Sepal Petal #> <dbl> <dbl> #> 1 3.06 1.20
A named list of functions:
iris %>% mutate(over(c("Sepal", "Petal"), list(product = ~ .("{.x}.Width") * .("{.x}.Length"), sum = ~ .("{.x}.Width") + .("{.x}.Length")) ), .keep = "none") #> # A tibble: 150 x 4 #> Sepal_product Sepal_sum Petal_product Petal_sum #> <dbl> <dbl> <dbl> <dbl> #> 1 17.8 8.6 0.28 1.6 #> 2 14.7 7.9 0.28 1.6 #> 3 15.0 7.9 0.26 1.5 #> 4 14.3 7.7 0.3 1.7 #> # ... with 146 more rows
Again, use the .names
argument to control the output names:
iris %>% mutate(over(c("Sepal", "Petal"), list(product = ~ .("{.x}.Width") * .("{.x}.Length"), sum = ~ .("{.x}.Width") + .("{.x}.Length")), .names = "{fn}_{x}"), .keep = "none") #> # A tibble: 150 x 4 #> product_Sepal sum_Sepal product_Petal sum_Petal #> <dbl> <dbl> <dbl> <dbl> #> 1 17.8 8.6 0.28 1.6 #> 2 14.7 7.9 0.28 1.6 #> 3 15.0 7.9 0.26 1.5 #> 4 14.3 7.7 0.3 1.7 #> # ... with 146 more rows
over2()
to apply a function to two objects.
All members of the <over-across function family
>.