Good R Programing Techniques
R is quite different from regular programing languages in the way that it executes the code given to it. Due to R’s eccentricities in order to create functions that run faster and consume less memory there are several guide lines to keep in mind. All of the guide lines are optional however as datasets get bigger the more benefits will be gained by following them.
- Don’t use recursion. Recursion and the way R executes code results in functions that run slower and consume massive amounts of memory.
- If you use a complex calculation that is constant often in a function consider assigning it to its own variable.
- Before:
w <- log(x) + log(y) + 10 v <- log(x) + log(y) + 32
- After:
t <- log(x) + log(y) w <- t + 10 v <- t + 32
- Before:
- If you are only using a variable once consider eliminating that variable.
- Before:
f <- function(n = 125000) { x <- runif(n) sum(x + 1) }
- After:
f <- function(n = 125000) { sum(runif(n) + 1) }
- Before:
- When temporary variables are needed, using the same name for ones of the same size that are not required simultaneously can avoid unneeded copying.
- Before:
g <- function(n = 125000) { tmp <- runif(n) tmp1 <- 2 * tmp + tmp^2 tmp2 <- tmp1 - trunc(tmp1) mean(tmp2 > 0.5) }
- After:
g1 <- function(n = 125000) { tmp <- runif(n) tmp <- 2 * tmp + tmp^2 tmp <- tmp - trunc(tmp) mean(tmp < 0.5) }
- Before:
- Try not to use loops.
- In R loops are very slow. In this example 15 is added to a 100000 element vector.
n <- 100000 a <- runif(n) d <- vector(mode=typeof(a), n) system.time(a + 15, gcFirst=TRUE) system.time(lapply(a, function(x) x + 15), gcFirst=TRUE) system.time(sapply(a, function(x) x + 15), gcFirst=TRUE) system.time(for(i in seq(along.with=a)) d[i] <- a[i] + 15, gcFirst=TRUE)
- Use Vectorized Arithmetic,
sapply
orlapply
instead.- Before
over.thresh <- function(x, threshold) { for (i in 1:length(x)) if (x[i] < threshold) x[i] <- 0 x }
- After
over.thresh2 <- function(x, threshold) { x[x < threshold] <- 0 x }
- Before
- In R loops are very slow. In this example 15 is added to a 100000 element vector.
- For operations on individual elements of a list use the
apply
function family, such aslapply
,tapply
, etc…
- Avoid looping over a named data set. If necessary save any names by
tnames <- names(x)
and then remove them bynames(x) <- NULL
, perform the loop, then reassign the names bynames(x) <- tnames
.
- Avoid growing data sets in a loop. Always create a data set of the desired size before entering the loop; this greatly improves memory allocation. If you don’t know the exact size over estimate it and the shorten the vector at the end of the loop.
- Before:
grow <- function() { nrow <- 1000 x <- NULL for(i in 1:(nrow)) { x <- rbind(x, i:(i+9)) } x } system.time(grow(), gcFirst=TRUE)
- After:
no.grow <- function() { nrow <- 1000 x <- matrix(0, nrow = nrow, ncol = 10) for(i in 1:nrow) { x[i, ] <- i:(i + 9) } x } system.time(no.grow(), gcFirst=TRUE)
- When an element is added to an existing vector, R allocates a new vector of length equal to the current vector plus the additional element. It then copies the existing vector and the new element into the new vector. In contrast overwriting a element in a vector requires just the copying of the replacement elements.
- Before:
These are some good programing techniques in general.
- Always use parentheses to make groupings explicit.
x < 5 && y > 10 || z < 6
not clear as to what should be happeningx <- 6 y <- 10 z <- 5 w <- 15 x < 5 && y > 10 || z < 6 && w < 25 # TRUE (x < 5 && y > 10) || (z < 6 && w < 25) # TRUE x < 5 && (y > 10 || z < 6) && w < 25 # FALSE (x < 5 && y > 10 || z < 6) && w < 25 # TRUE x < 5 && (y > 10 || z < 6 && w < 25) # FALSE
-2^2
equals -4 not 4 like (-2)^2.
- Always use { and } in functions, loops, if, else, and other statements
- Example of ambiguity:
test <- function(x, y) { if(x) if(y) 5 else 6 } (test(TRUE, TRUE)) # [1] 5 (test(TRUE, FALSE)) # [1] 6 (test(FALSE, TRUE)) # NULL (test(FALSE, FALSE)) # NULL
- Example of ambiguity:
- Use the
return()
function at the end of functions. - Always use
TRUE
andFALSE
instead ofT
andF
.
Writing Functions
Functions in R can do 3 things
- Be passed values
- Return a value
- Side effects, anything caused that is not the returning of a value. This would be like the text output of
print()
or the opening of a dvi viewer fromprint.latex()
. We are not going to deal with this topic in this lecture.
The R function
Statement
The basic R function statement looks like this
FUNname <- function( arglist ) { code }
- FUNname is th name that you have selected as you function name.
- You can select any non-reserved word to be you function name.
- However you cannot have an function name and a variable name be the same.
- arglist is a coma separated list of 0 or more arguments that can be passed to the function.
- code is the statements that perform the actions of the function
Function Return Values
Functions are designed to return values. It is call returning because the value is taken from the function and is returned to the calling code.
- Functions have 2 ways to return values.
- The value of the last statement evaluated in the function.
- Example:
f1 <- function() { 10 } > f1() [1] 10
- Example:
- The value of the last statement evaluated in the function.
-
- The value passed to the
return
statement. Calling a return statement always causes the function to return- Example:
f2 <- function() { return(20) 10 } > f2() [1] 20
- Example:
- The value passed to the
Function Arguments
The function argument statement looks like this
VARname
or
VARname = VALUE
- Function arguments are how values are passed to the function by the calling code.
- By convention the first argument is the main data object being passed to the function. The class of first argument is used for matching S3 methods.
- The function arguments are treated like variables inside the functions code.
- Example:
f3 <- function(x) { x + 5 } > f3(5) [1] 10
- Example:
- Arguments can be set to a default value is the calling code does not explicitly an argument to a value. This is done using the
VARname = VALUE
.- Example:
f4 <- function(x, y=5) { x + y } > f4(5) [1] 10 > f4(5,7) [1] 12
- Example:
- You can test to see if the calling code has set an argument to a value using the test function
missing
. Functionmissing(x)
returns a logicalTRUE
if the value ofx
has not been set by the calling code.- Example:
f5 <- function(x) { if( missing(x) ) { return("x is missing") } else { return("x is not missing") } } > f5() [1] "x is missing" > f5(10) [1] "x is not missing"
- Example:
- The arglist can also have a special type of argument
...
. This is argument can hold a variable number of arguments. In R functions it is mostly used for passing parameters to other functions.- Example:
f6 <- function(z, ...) { print(paste("The value of z is", z, sep=" ")) paste("The value of f5(...) is", f5(...), sep=" ") } > f6(5, x=3) [1] "The value of x is 5" [1] "The value of f5(...) is x is not missing" > f6(5) [1] "The value of x is 5" [1] "The value of f5(...) is x is missing" f6.5 <- function(z, ...) { print(paste("The value of z is", z, sep=" ")) b <- list(...) cat(paste(names(b), b, sep=':', collapse=" ")) cat("\n") } f6(5) f6(5, a="sdd", "desg")
- Example:
Useful Functions for Use in Functions
There are many functions that are designed to only be used in other functions. In this section we go over some of the more useful functions.
- The
as.
family of coercion functions, such asas.vector(x)
,as.data.frame(x)
, oras.double(x)
. These functions are used to change the data type ofx
to a new data type.- Example:
f7 <- function(x) { as.character(x) } > class(f7(7)) [1] "character" > class(f7("a")) [1] "character"
- Example:
- The
is.
family of test functions, such asis.null()
,is.numeric()
, andis.data.frame()
. These functions are used to tell what kind of variable that the function has been passed.- Example:%BR
f8 <- function(x) { is.character(x) } > f8(9) [1] FALSE > f8("abc") [1] TRUE
- Example:%BR
- The
on.exit()
function records the expression pass to it and executes that expression after the function exits, either naturally or the result of an error- Example:
f9 <- function() { opar <- par(mai = c(1,1,1,1)) on.exit(par(opar)) par()$mai } par()$mai # [1] 1.0627614 0.8543768 0.8543768 0.4376077 f9() # [1] 1 1 1 1 par()$mai # [1] 1.0627614 0.8543768 0.8543768 0.4376077
- Example:
R Gotchas
R has several ‘features’ that can trip up those who are not aware of them.
Environments
- Environments are the location that R stores its objects (variables, functions, etc.).
- All environments (except the global environment) have a parent environment.
- An environment can read from and write to objects in the parent environment. If an object from the parent environment is told to change instead a new object is created in the current environment holding the new value. The new object now masks the object from the parent environment. Use the alternative assignment operator to modify variables in the parent environment
a <- 1 test <- function() { print(a) a <- 2 print(a) invisible(NULL) } test() print(a) test2 <- function() { print(a) a <<- 3 print(a) invisible(NULL) } test2() print(a)
- An environment can read from and write to objects in the parent environment. If an object from the parent environment is told to change instead a new object is created in the current environment holding the new value. The new object now masks the object from the parent environment. Use the alternative assignment operator to modify variables in the parent environment
- Each call to a function receives it’s own environment that is a child of the calling environment. When a function returns that environment is destroyed.
- Because of this errors of omission can lead to very odd bugs.
- This is the correct code
a <- 5 test1 <- function(b=1) { a <- 10 * b if(a > 5) { return(45) } return(b) } test1() # [1] 45
- Here is the same code but a critical line has been commented out leading to a functioning function that produces incorrect output.
a <- 5 test2 <- function(b=1) { #a <- 10 * b if(a > 5) { return(45) } return(b) } test2() # [1] 1
- This is the correct code
- when debugging its a good idea to check to make sure that all of the variables in question have been assigned in the function.
- Because of this errors of omission can lead to very odd bugs.
T
and F
vs. TRUE
and FALSE
- Always use
TRUE
andFALSE
instead ofT
andF
Object name confusion
- If
attach
orwith
functions are used the is a good chance that this will lead to confusion on the users part about what objects are being referenced.- Lets say we want to add the object
mod
to columna
in the data framejunk
.mod <- 15 junk <- data.frame(a = 1:10) with(junk, a + mod) # [1] 16 17 18 19 20 21 22 23 24 25
- What happens if there column in ‘junk’ named ‘mod’?
mod <- 15 junk <- data.frame(a = 1:10, mod = 6:15) with(junk, a + mod) #[1] 7 9 11 13 15 17 19 21 23 25
- Lets say we want to add the object
Basic Intro to debugging in R
4 Most important debug functions in R. cat()
, debug()
, traceback()
, str()
cat()
– This is good for determining if a part of the code is being called running.str()
– This is the key to determining what an object actually contains. It gives a comprehensive summary of the contents of the object. It takes a while to easly read its output but it is invaluable for determining the structure of complex lists.traceback()
– When a function dies by an error the user can calltraceback
. This returns the stack trace at the time of the error. The higher the number the deeper in the call stack the function is.debug()
– marks a function to call the debugger when ever it is called.- to advance a line press enter and empty prompt or enter
n
. - to continue to the next debugged function call enter
c
. - to print a stack trace of all active function calls enter
where
. - to quit enter
Q
. - anything else entered at the prompt is evaluated as an expression in the current environment.
- to advance a line press enter and empty prompt or enter
— CharlesDupont – 03 Jun 2005