Beware of VectorizePosted on April 12, 2018
I would first like to thank Dean Attali for writing the original post that inspired this one. Creating vectorized functions is definitely a common problem when dealing with vectors and creating functions in R and having public discourse on the best ways to do this is great for the whole community, thanks Dean!
The Vectorize function is sometimes suggested as a solution to the problem of iterating over elements of a vector, such as How to use dplyr’s mutate in R without a vectorized function by Dean Attali. This function at first seems a perfect solution to this problem (automatically turn any function in a vectorized one!). However like many things in life you may find it to be too good to be true.
Are you sure your function cannot be vectorized?
Many (most?) functions in base R and extension packages such as those in the tidyverse are vectorized over their primary arguments. To be vectorized means the function works not just on a single value, but on the whole vector of values at the same time.
However even functions you may not first think are vectorized often are, such as paste().
Often when people think they need to loop over elements their problem can actually be rewritten to work with the already vectorized functions in R.
For instance Dean’s example function to perform the following task.
Given a path some/path/abc/001.txt, this function will return abc_001.txt
At first it seems like this code would require you to iterate element by element to
have a vectorized form. However note the
unlist() in the implementation.
Often needing to
unlist() is an indication that you are dealing with an already vectorized
function. This is true in this case,
stringr::str_split() is vectorized over
its inputs. Knowing this we can use
vapply() with the
tail() function to
extract and then paste the rows.
This is an improvement, the code is easier to read and while we have a loop in
vapply() call, we are taking advantage of the vectorized
However in this case an even better alternative is available. R has vectorized
functions basename() and dirname() to retrieve the
basename (filename) and directory name of a file path. So we can use these
directly along with
This gives us a very concise implementation and because all of these functions are implemented directly in C this is also very fast.
In this simple case the median runtime is 9x faster for the better version and 44x faster for the best version!
But what if my problem really cannot use vectorized functions?
There are cases where your code really cannot be rewritten in this way, so
Vectorize() a good solution in that case? I argue no it is not, for the
Vectorize does not generate type stable functions.
The function generated by
Vectorize wraps the input function in a call to
mapply() under the hood, with the default argument
SIMPLIFY = TRUE. This
means the type of the function output depends on the input. For example
Type stability is also the reason it is best to avoid
in favor of the type stable
You can call
Vectorize(SIMPLIFY = FALSE) when you generate the vectorized
function, but this will cause the function to return a list of values rather
than a vector. Because many vectorized functions do not work with list inputs,
this often means you will then need to post-process your output.
Vectorize does not generate functions with easily inspect-able code
Because of the way
Vectorize() generates the function all generated functions
have the same body when printed.
This means you lose the easy inspectibility of functions. Being able to easily
see the implementation of functions in R is one of the strengths of the
ecosystem. So losing this behavior makes your functions much more difficult for
users (or yourself in the future) to understand. It is possible to retrieve
the original function definition, but doing so requires you to examine how
Vectorize() works; by storing the function in a variable called
Vectorize functions use
do.call(), which can have unexpected performance consequences
This is best explained by Hadley Wickham in http://rpubs.com/hadley/do-call2,
but the gist is
do.call() ends up doing a lot more work than you might expect
and in some cases has performance implications, although in this particular
case they will be of minor concern. So because all functions generated with
do.call() they inherit these issues.
Vectorize does not actually make your code execute faster
Perhaps the most important reason is that
Vectorize() will not make your code
faster. People often want to vectorize their function because they have observed that vectorized functions are fast. This is usually true,
however it is true not because they are vectorized, but because vectorized
functions are often written in C code (or call other functions which are).
Vectorize() essentially just wraps your code in a loop and runs it
repeatedly, so it cannot improve the running time.
So what should I do?
Because of these issues I think a cleaner solution is first, try to rewrite
your function to take advantage of existing vectorized functions. If that is
not possible define your original function in an internal helper, which then
calls the equivalent type stable
vapply() function on the
This takes only a few more lines of code than the original and solves the
majority of the issues with
Vectorize(). The function is now type stable for
all inputs, the function body remains inspectable and you avoid the potential
pitfalls of using
In Dean’s original case he was trying to use
patient_name in a call to
dplyr::mutate(). Rather than using
Vectorize() to generate a new function
in this case I would instead suggest an idiom like the following.
This is similarly concise to using
Vectorize() but is also type stable for all possible