Lapply and sapply: avoiding loops on lists and data frames
Tapply: avoiding loops when applying a function to subsets
“Apply” functions keep you from having to write loops to perform some operation on every row or every column of a matrix or data frame, or on every element in a list. For example, the built-in data setstate.x77 contains eight columns of data describing the 50 U.S. states in 1977. If you wanted the average of each of the eight columns, you could do this:
> avgs <- numeric (8) > for (i in 1:8) + avgs[i] <- mean (state.x77[,i]) # The "+" is R's continuation character; don't type it > avgs [1] 4246.4200 4435.8000 1.1700 70.8786 7.3780 53.1080 104.4600 70735.8800
This is comparatively slow, much more so in large datasets. R is bad at looping. A more vectorized way to do this is to use the apply() function. In this example, apply extracts each column as a vector, one at a time, and passes it to the median() function.
> apply (state.x77, 2, median) Population Income Illiteracy Life Exp Murder HS Grad Frost Area 2838.5 4519 0.95 70.675 6.85 53.25 114.5 54277
The 2 means “go by column” — a 1 would have meant “go by row.” Of course, if we had used a 1, we would have computed 50 averages, one for each row. If we had had a three-dimensional array we could have used a 3 there. The third argument specifies the function to be applied to each column. We can use any function that makes sense there. We can use our own function or even pass in a function that we write on the spot. If your function returns a vector of constant length, S-Plus will stick the vectors together into a matrix. However, if your function returns vectors of different lengths, S-Plus will have to create a list (see more details below).
The special cases of mean and sum have been taken care of already with the built-in colMeans, ColSums, rowMeans, and rowSums functions. These are highly efficient and worth using.
In this example, we construct a function “on the fly” and pass it to apply. This particular function computes the median and maximum of each column of state.x77.
> apply (state.x77, 2, function(x) c(median (x), max(x))) Population Income Illiteracy Life Exp Murder HS Grad Frost Area [1,] 2838.5 4519 0.95 70.675 6.85 53.25 114.5 54277 [2,] 21198.0 6315 2.80 73.600 15.10 67.30 188.0 566432
If you pass additional arguments to apply, those arguments get passed down to the function you’re having apply call. So if you wanted to calculate the mean of each column after trimming the highest and lowest 10%, you could do this:
> apply (state.x77, 2, mean, trim=.1) Population Income Illiteracy Life Exp Murder HS Grad Frost Area 3384.27500 4430.07500 1.09750 70.91775 7.29750 53.33750 106.80000 56575.72500
This is particularly handy for passing the na.rm=T argument to functions like max.
Does apply() loop?
Yes. apply() calls lapply and lapply() loops. Clearly something has to loop. The reason that the apply family of functions is fast is that the looping is done in compiled code (C or Fortran), not in R’s own interpreted code. The difference can be the difference between finishing and crashing. Note: After writing this I got curious about the extent to which apply() increases speed. I used commands like this:
> system.time (for (j in 1:20000) colMeans (state.x77)) > system.time (for (j in 1:20000) apply (state.x77, 2, mean)) > system.time (for (j in 1:20000) for (i in 1:8) mean (state.x77[,i]))
expecting the last one to be reported as the slowest. Actually, though, the middle one was. I’m not sure what the story is here.
Sometimes you expect apply() to return a vector but you get a list
I include this topic because it has bedeviled me in the past. Suppose I have this matrix a, and I want to find the smallest number in each row. This is easy:
> a <- matrix (c(5, 2, 7, 1, 2, 8, 4, 5, 6), 3, 3) > a [,1] [,2] [,3] [1,] 5 1 4 [2,] 2 2 5 [3,] 7 8 6 > apply (a, 1, min) [1] 1 2 6
So apply() works on each row, one at a time, to tell me the smallest number in each row. What if I want the indexof the smallest number in each row? That is, I want the answer to the question “in which column can the minimum value be found”? That sounds easy, too: we’ll use the which() function, which returns the indices within a vector for which the vector holds the value TRUE.
> which (c(F, F, T, F, T, T, F)) # Example of "which" : where are the Trues? [1] 3 5 6 # # For each row, find the column in which that row has its smallest value. # > apply (a, 1, function(x) which(x == min(x))) [[1]] [1] 2 [[2]] [1] 1 2 [[3]] [1] 3
What has happened here is that there’s a tie in the second row. apply() returns a single value for rows 1 and 3, but two values for row 2, and R doesn’t know how to arrange those, so it makes a list. The[[1]] tells us that the first element of the list has no name.
If we needed to do this we might impose a rule like “if there’s a tie pick out the first one.”
> apply (a, 1, function(x) which(x == min(x))[1]) [1] 2 1 3
Lapply and sapply: avoiding loops on lists and data frames
The regular apply() function can be used on a data frame since a data frame is a type of matrix. When you use it on the columns of a data frame, passing the number 2 for the second argument, it does what you expect. It will work on the rows of a data frame, too, but remember: apply extracts each row as a vector, one at a time. Every element of a vector must have the same kind of data, so unless every column of the data frame has the same kind of data, R will end up converting the elements of the row to a common format (like character).
The lapply() function works on any list, not just a rectangular one. (The “l” in “lapply” stands for “list.”) In that way it’s more general than apply(), although it does not work on matrices or higher-dimensional arrrays. You don’t need to specify the “direction” as you do with apply(); just pass the function. However, lapply() always returns a list. Usually I want a vector, and that’s what sapply()tries to do. The “s” in “sapply” stands for “simplify.” Here’s an example using the built-in barley data frame. My question is, how many levels of each variable are there? We can count the number by seeing how many unique entries there are: so length(unique(x)) will do the trick.
library (lattice) # Make this data available > dim (barley) # Barley has 120 rows [1] 120 4 > lapply (barley, function(x) length(unique(x))) # returns a list $yield: [1] 114 $variety: [1] 10 $year: [1] 2 $site: [1] 6 > sapply (barley, function(x) length(unique(x))) # Simplifies output to a vector yield variety year site 114 10 2 6 > apply (barley, 2, function(x) length(unique(x))) # Also works on data frames (but not non-data frame lists). yield variety year site 114 10 2 6
Tapply: avoiding loops when applying a function to subsets
tapply() is a very powerful function that lets you break a vector into pieces, and then apply some function to each of the pieces. (For you Excel users, tapply() produces things that correspond to Excel’s pivot tables.) It’s sort of like sapply(), except that with sapply() the pieces are always elements of a list. With tapply() you get to specify how the breakdown is done. For example, suppose I want to find the average yield for each variety of barley in the last example.
> tapply (barley$yield, barley$site, mean) Grand Rapids Duluth University Farm Morris Crookston Waseca 24.93167 27.99667 32.66667 35.4 37.42 48.10833
tapply() returns a vector with one element for each unique value of barley$variety. The element for Grand Rapids, for example, gives the average of all the elements of barley$yield for whichbarley$variety == "Grand Rapids". I have found tapply() to be incredibly useful. If you want to cross-tabulate by more than one variable, construct a list of your tabulating variables and pass that totapply(). Here we break yields down by year and site.
> tapply (barley$yield, list (barley$year, barley$site), mean) Grand Rapids Duluth University Farm Morris Crookston Waseca 1932 20.81000 25.70000 29.50667 41.51333 31.18 41.87000 1931 29.05334 30.29333 35.82667 29.28667 43.66 54.34667
We’ve learned something: 1931 was a much better year, except in Morris. (There’s some suspicion that Morris was in fact incorrectly recorded in this well-known data set.) 1932 appears before 1931 in the table because that’s how the levels of “year” were set up in S-Plus. (If this bothers you see Reordering the levels of a factor.) Years appear in the rows because they came first in the list. Of course a three- or higher-way table can be made in this way as well.