Recently, I became interested in maximizing lookup performance in R. Many problems require looking up values in tables. I decided to perform some tests to show how different lookup mechanisms performed in R.
If you’re not familiar with R, here’s a short description. R is a very popular language (and environment) for working with data. You can download a current version from The R Project web site. R was designed to make it easy to define and manipulate vectors and matrices.
In R, it’s easy to define a vector:
> a <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) > a [1] 1 2 3 4 5 6 7 8 9 10
You can refer to elements by index:
> a[3] [1] 3
It’s also possible to name elements in a vector, then refer to them by name:
> b <- c(Joe=1, Bob=2, Jim=3) > b["Bob"] Bob
This can be very convenient: you can use every vector in R as a table. Internally, R represents the variable names using a second vector. You can access the name vector through the names function:
> names(b) [1] "Joe" "Bob" "Jim"
It’s often useful to look up data in a table of values, and it is tempting to use names to look up items in R vectors. For small data sets, this is OK. Unfortunately, you might run into some performance problems with large tables. Let’s take a closer look at the performance of table lookups in R.
To start with, let’s build a large, labeled vector in R. To make things simple, I wrote a function to generate a labeled vector for a given input size. I filled each vector with sequential numerical values. Then, I assigned each value a label by translating each value to a character string.
labeled.array <- function(n) { a <- 1:n from <- "0123456789" to <- "ABCDEFGHIJ" for (i in 1:n) { names(a)[i] <- chartr(from,to,i) } a }
Here’s an example of the output of this function:
> a.20 <- labeled.array(20) > a.20 B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Now, let’s do the same thing for environments. We’ll create a new environemnt, specifying that a hash table should be used. Then, we’ll assign identical values to the labeled.array function:
labeled.environment <- function(n) { e <- new.env(hash=TRUE,size=n) from <- "0123456789" to <- "ABCDEFGHIJ" for (i in 1:n) { assign(x=chartr(from,to,i), value=i, envir=e) } e }
Here are a couple example showing how you fetch values from the environment object:
> get("B",envir=e.20) [1] 1 > get("CA",envir=e.20) [1] 20
Alternately, you can use the [[ operator:
> e.20[["B"]] [1] 1 > e.20[["CA"]] [1] 20
We’d like to show how long it takes to look up values in an array using different methods. Let’s start by creating a set of arrays and environments for testing:
arrays <- list() for (i in 10:15) { arrays[[as.character(2 ** i)]] <- labeled.array(2 ** i)} environments <- list() for (i in 10:15) { environments[[as.character(2 ** i)]] <- labeled.environment(2 ** i)}
Notice that I created a set of different size arrays and environments for testing. I created sets for different powers of 2, between 1024 and 32768.
Now, the fun part. Let’s write a test function to compute the time required to perform the lookups on different data objects. We’ll design the test function so that we can vary the input data type, lookup operation, data size, and number of repetitions.
To do the calculations, I took advantage of a nifty R feature. R allows you to pass around R functions, then execute these expressions later. This allowed me to write a single lookup expression as a function, then apply this function to different data objects.
We will show how to do the lookup in a few different ways:
- by index
- by label, using the single bracket lookup operator
- by label, using the double bracket lookup operator with exact matches only (the default)
- by label, using the double bracket lookup operator with inexact matches allowed
I performed 1024 lookups for each lookup type and data size.
Here is the script that I used to calculate the results.
test_expressions <- function(description, fun, data, reps) { # data should be a list # fun should be a function that takes a data object, length, # and number of repetitions # description should be a char cat(paste(description,"\n")) results <- vector() for (n in names(data)) { results[[n]] <- system.time(fun(data[[n]], as.integer(n), reps))[["user.self"]] } print(results) } test_expressions( "first element, by index:", function(d,l,r) { s <- 0 for (v in 1:r) {s <- s + d[1]} }, arrays, 1024) test_expressions( "last element, by index:", function(d,l,r) { s <- 0 for (v in 1:r) {s <- s + d[l]} }, arrays, 1024) # useful definitons for translation from <- "0123456789" to <- "ABCDEFGHIJ" test_expressions( "arrays, first element, by label, single bracket:", function(d,l,r) { s <- 0 min <- chartr(from,to,1) for (v in 1:r) {s <- s + d[min]} }, arrays, 1024) test_expressions( "arrays, last element, by label, single bracket:", function(d,l,r) { s <- 0 max <- chartr(from,to,l) for (v in 1:r) {s <- s + d[max]} }, arrays, 1024) test_expressions( "arrays, first element, by label, double bracket, exact (default):", function(d,l,r) { s <- 0 min <- chartr(from,to,1) for (v in 1:r) {s <- s + d[[min]]} }, arrays, 1024) test_expressions( "arrays, last element, by label, double bracket, exact (default):", function(d,l,r) { s <- 0 max <- chartr(from,to,l) for (v in 1:r) {s <- s + d[[max]]} }, arrays, 1024) test_expressions( "arrays, first element, by label, double bracket, not exact:", function(d,l,r) { s <- 0 min <- chartr(from,to,1) for (v in 1:r) {s <- s + d[[min, exact=FALSE]]} }, arrays, 1024) test_expressions( "arrays, last element, by label, double bracket, not exact:", function(d,l,r) { s <- 0 max <- chartr(from,to,l) for (v in 1:r) {s <- s + d[[max, exact=FALSE]]} }, arrays, 1024) test_expressions( "environments, first element, by label:", function(d,l,r) { s <- 0; min <- chartr(from,to,1); for (v in 1:r) {s <- s + get(x=min,envir=d)} }, environments, 1024) test_expressions( "environments, last element, by label:", function(d,l,r) { s <- 0; max <- chartr(from,to,l); for (v in 1:r) {s <- s + get(x=max,envir=d)} }, environments, 1024)
Here are the results of running these tests on my computer (an aluminum Macbook, 2 GHz, 4 GB RAM):
first element, by index: 1024 2048 4096 8192 16384 32768 0.010 0.003 0.004 0.003 0.005 0.004 last element, by index: 1024 2048 4096 8192 16384 32768 0.010 0.004 0.004 0.004 0.003 0.004 arrays, first element, by label, single bracket: 1024 2048 4096 8192 16384 32768 0.268 0.282 0.588 1.439 2.728 5.397 arrays, last element, by label, single bracket: 1024 2048 4096 8192 16384 32768 0.173 0.278 0.582 1.517 2.713 5.266 arrays, first element, by label, double bracket, exact (default): 1024 2048 4096 8192 16384 32768 0.002 0.002 0.002 0.002 0.003 0.002 arrays, last element, by label, double bracket, exact (default): 1024 2048 4096 8192 16384 32768 0.036 0.070 0.136 0.273 0.549 1.107 arrays, first element, by label, double bracket, not exact: 1024 2048 4096 8192 16384 32768 0.010 0.003 0.003 0.002 0.003 0.003 arrays, last element, by label, double bracket, not exact: 1024 2048 4096 8192 16384 32768 0.042 0.069 0.137 0.275 0.551 1.112 environments, first element, by label: 1024 2048 4096 8192 16384 32768 0.012 0.005 0.006 0.006 0.005 0.005 environments, last element, by label: 1024 2048 4096 8192 16384 32768 0.012 0.005 0.006 0.005 0.006 0.005
Notice that lookups by array are fastest, followed by lookups of the first element using the double-bracket notation. Lookups with the single bracket notation take substantially longer, but take approximately the same time as looking up the last element by double-bracket notation.
If you test this yourself with other array sizes, you’ll find that the performance of lookups by index is essentially constant, but that lookups by reference scale linearly with the size of the array (in the worst case).
The reason is that vectors are implemented as an array of values and an array of names. Looking up a value by location takes essentially constant time. Looking up a value by label requires R to scan every label in the array of names. In the worst case, R has to scan every name in the array. When you use single-bracket notation, R tried to match all elements with a given label, including fuzzy matches. That means that R scans all the element in the array when you use single bracket notation. Using double-bracket notation works slightly better; R will return the first matching element that it finds, so it performs well when looking up the first element, but badly when looking up the last.
For those of you with an algorithm background, here’s a summary:
- Array value lookup by index: Θ(1) time
- Array value lookup by label, using the single bracket lookup operator: : Θ(n) time
- Array value lookup by label, using the double bracket lookup operator with exact matches only (the default): O(1) time
- Array value lookup by label, using the double bracket lookup operator with inexact matches allowed: O(1) time
Notice that lookups in the environment are much, much faster. (If you look at other numbers of elements, you’ll find that performance is essentially constant.) The reason for this is that environments are implemented using hash tables. Hash tables are slower than index lookups by a constant factor: it takes a little time to calculate the hash for an item. (By the way, I haven’t checked exactly which hash implementation is used in R. Depending on the implementation, hash lookups could take up to O(n) time in the worst case.)
For those of you with an algorithm background, here’s a summary:
- Environment value lookup by label: O(1) time on average
So, what are the lessons here? First, if the performance of your program is OK, don’t worry about what type of look up you are using. Using environments can be less elegant and more confusing than vectors. It is usually more important to write correct code that you and other people can understand and maintain than it is to write fast code.
But suppose that you have measured the performance of your program and it’s not good enough for your application. If you’re looking up elements by index (and not value), you should stick with vectors: vectors are slightly faster than environments. However, if your program contains a large number of lookups by label, you should consider replacing vectors with environments.
If you have already written a program that uses lots of single bracket notation and want to switch to environments, you have two choices:
- Change to the double bracket operator. This function (yes, operators are functions in R) works with environments.
- Redefine the single-bracket operator so that it works with environments. Doing this is possible, but slightly tricky.
By default, single bracket notation is implemented using a primitive function:
> `[` .Primitive("[")
The primitive function will return an error message when applied to environments; you can’t just define a new method `[.environment`. (I tried this first.) As an alternative, you can create a wrapper around the primitive function that uses the get method for environments to look up the value:
`[` <- function(x, y, ...) { if (class(x) == "environment") get(x=y,envir=x) else .Primitive("[")(x,y,...) }
Doing this for the assignment function is much trickier, but also possible.
One final lesson: if you are looking up exactly one value in R, consider using the double bracket notation (such as x[[i]]) and not the single bracket notation (such as x[i]). As we showed above, the double bracket operator will be at least as fast as the single bracket operator when looking up values in vectors. Additionally, the double-bracket notation makes it easier to change your implementation to use environment objects for storage.