I would first like to thank Dean Attali for writing the original post that inspired this one. Creating vectorized functions is definitely a common problem when dealing with vectors and creating functions in R and having public discourse on the best ways to do this is great for the whole community, thanks Dean!
The Vectorize function is sometimes suggested as a solution to the problem of iterating over elements of a vector, such as How to use dplyr’s mutate in R without a vectorized function by Dean Attali. This function at first seems a perfect solution to this problem (automatically turn any function in a vectorized one!). However like many things in life you may find it to be too good to be true.
Are you sure your function cannot be vectorized?
Many (most?) functions in base R and extension packages such as those in the tidyverse are vectorized over their primary arguments. To be vectorized means the function works not just on a single value, but on the whole vector of values at the same time.
Some examples of vectorized functions are + and == and is.na().
1:3 + 1
## [1] 2 3 4
1:2 == 2
## [1] FALSE TRUE
is.na(c(1, NA, 3))
## [1] FALSE TRUE FALSE
However even functions you may not first think are vectorized often are, such as paste().
color <- c("blue", "red", "green")
object <- c("ball", "hat", "coat")
name <- c("Sally", "Hank", "Darla")
paste("A", color, object, "for", name)
## [1] "A blue ball for Sally" "A red hat for Hank"
## [3] "A green coat for Darla"
Often when people think they need to loop over elements their problem can actually be rewritten to work with the already vectorized functions in R.
For instance Dean’s example function to perform the following task.
Given a path some/path/abc/001.txt, this function will return abc_001.txt
library(tidyverse)
patient_name <- function(path) {
path_list <- str_split(path, "/") %>% unlist()
paste(path_list[length(path_list) - 1], path_list[length(path_list)], sep = "_")
}
# Vectorize it with Vectorize
patient_name_v <- Vectorize(patient_name)
At first it seems like this code would require you to iterate element by element to
have a vectorized form. However note the unlist()
in the implementation.
Often needing to unlist()
is an indication that you are dealing with an already vectorized
function. This is true in this case, stringr::str_split()
is vectorized over
its inputs. Knowing this we can use vapply()
with the tail()
function to
extract and then paste the rows.
patient_name_better <- function(path) {
path_list <- str_split(path, "/")
last_two <- vapply(path_list, tail, character(2), 2)
paste0(last_two[1, ], "_", last_two[2, ])
}
This is an improvement, the code is easier to read and while we have a loop in
the vapply()
call, we are taking advantage of the vectorized paste0()
.
However in this case an even better alternative is available. R has vectorized
functions basename() and dirname() to retrieve the
basename (filename) and directory name of a file path. So we can use these
directly along with paste()
.
patient_name_best <- function(path) {
paste0(basename(dirname(path)), "_", basename(path))
}
This gives us a very concise implementation and because all of these functions are implemented directly in C this is also very fast.
# Construct 100 paths
paths <- rep(c("some/path/abc/001.txt", "another/directory/xyz/002.txt"), 100)
res <- microbenchmark::microbenchmark(
patient_name_v(paths),
patient_name_better(paths),
patient_name_best(paths)
)
bench <- summary(res)[c("expr", "median")]
bench
## expr median
## 1 patient_name_v(paths) 18824.948
## 2 patient_name_better(paths) 1162.661
## 3 patient_name_best(paths) 159.825
In this simple case the median runtime is 16x faster for the better version and 117x faster for the best version!
But what if my problem really cannot use vectorized functions?
There are cases where your code really cannot be rewritten in this way, so
is Vectorize()
a good solution in that case? I argue no it is not, for the
following reasons.
Vectorize does not generate type stable functions.
The function generated by Vectorize
wraps the input function in a call to
mapply()
under the hood, with the default argument SIMPLIFY = TRUE
. This
means the type of the function output depends on the input. For example
# Everything seems good, character vector output
patient_name_v("some/path/abc/001.txt")
## some/path/abc/001.txt
## "abc_001.txt"
# But now my output is a named list()!
patient_name_v(character())
## named list()
# And now it is a list with no names!
patient_name_v(NULL)
## list()
Type stability is also the reason it is best to avoid sapply()
or mapply()
in favor of the type stable vapply()
or purrr::map_*()
and
purrr::pmap_*()
functions.
You can call Vectorize(SIMPLIFY = FALSE)
when you generate the vectorized
function, but this will cause the function to return a list of values rather
than a vector. Because many vectorized functions do not work with list inputs,
this often means you will then need to post-process your output.
Vectorize does not generate functions with easily inspect-able code
Because of the way Vectorize()
generates the function all generated functions
have the same body when printed.
patient_name_v
## function (path)
## {
## args <- lapply(as.list(match.call())[-1L], eval, parent.frame())
## names <- if (is.null(names(args)))
## character(length(args))
## else names(args)
## dovec <- names %in% vectorize.args
## do.call("mapply", c(FUN = FUN, args[dovec], MoreArgs = list(args[!dovec]),
## SIMPLIFY = SIMPLIFY, USE.NAMES = USE.NAMES))
## }
## <environment: 0x7fcb24915508>
This means you lose the easy inspectibility of functions. Being able to easily
see the implementation of functions in R is one of the strengths of the
ecosystem. So losing this behavior makes your functions much more difficult for
users (or yourself in the future) to understand. It is possible to retrieve
the original function definition, but doing so requires you to examine how
Vectorize()
works; by storing the function in a variable called FUN
.
environment(patient_name_v)$FUN
## function(path) {
## path_list <- str_split(path, "/") %>% unlist()
## paste(path_list[length(path_list) - 1], path_list[length(path_list)], sep = "_")
## }
## <bytecode: 0x7fcb24eeeeb0>
Vectorize functions use do.call()
, which can have unexpected performance consequences
This is best explained by Hadley Wickham in http://rpubs.com/hadley/do-call2,
but the gist is do.call()
ends up doing a lot more work than you might expect
and in some cases has performance implications, although in this particular
case they will be of minor concern. So because all functions generated with Vectorize()
use do.call()
they inherit these issues.
Vectorize does not actually make your code execute faster
Perhaps the most important reason is that Vectorize()
will not make your code
faster. People often want to vectorize their function because they have observed that vectorized functions are fast. This is usually true,
however it is true not because they are vectorized, but because vectorized
functions are often written in C code (or call other functions which are).
Vectorize()
essentially just wraps your code in a loop and runs it
repeatedly, so it cannot improve the running time.
So what should I do?
Because of these issues I think a cleaner solution is first, try to rewrite
your function to take advantage of existing vectorized functions. If that is
not possible define your original function in an internal helper, which then
calls the equivalent type stable map_()
or vapply()
function on the
original function.
patient_name2 <- function(path) {
patient_name_one <- function(path) {
path_list <- str_split(path, "/") %>% unlist()
paste(path_list[length(path_list) - 1], path_list[length(path_list)], sep = "_")
}
vapply(patient_name_one, path, character(1))
}
This takes only a few more lines of code than the original and solves the
majority of the issues with Vectorize()
. The function is now type stable for
all inputs, the function body remains inspectable and you avoid the potential
pitfalls of using do.call()
.
In Dean’s original case he was trying to use patient_name
in a call to
dplyr::mutate()
. Rather than using Vectorize()
to generate a new function
in this case I would instead suggest an idiom like the following.
df <- data_frame(path = paths)
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
df %>% mutate(name = map_chr(path, patient_name))
## # A tibble: 200 x 2
## path name
## <chr> <chr>
## 1 some/path/abc/001.txt abc_001.txt
## 2 another/directory/xyz/002.txt xyz_002.txt
## 3 some/path/abc/001.txt abc_001.txt
## 4 another/directory/xyz/002.txt xyz_002.txt
## 5 some/path/abc/001.txt abc_001.txt
## 6 another/directory/xyz/002.txt xyz_002.txt
## 7 some/path/abc/001.txt abc_001.txt
## 8 another/directory/xyz/002.txt xyz_002.txt
## 9 some/path/abc/001.txt abc_001.txt
## 10 another/directory/xyz/002.txt xyz_002.txt
## # … with 190 more rows
This is similarly concise to using Vectorize()
but is also type stable for all possible
inputs.