Prerequisite: Planning 2 or dean's permission
Units: 3.0
Classroom: online via Microsoft Teams
Class Time: Thursday: 9:30 AM-12:30 PM
Office Hour: Thursday: 12:30 PM -12:45 PM - Right after class time
Instructor: Zhuo Yao, Ph.D.
Instructor: Archt. Carmela C. Quizana
This is a set of notes, examples, data and projects for a short course in advanced R programming, aimed at graduate students and senior-level undergraduates. It is intended to help students develop the skills they'll need in their research or employment. Note that this lab session use the book "Cookbook for R" as a starting point.
Generally, while doing programming in any programming language, you need to use various variables to store various information. Variables are nothing but reserved memory locations to store values. This means that, when you create a variable you reserve some space in memory.
You may like to store information of various data types like character, wide character, integer, floating point, double floating point, Boolean etc. Based on the data type of a variable, the operating system allocates memory and decides what can be stored in the reserved memory.
In contrast to other programming languages like C and java in R, the variables are not declared as some data type. The variables are assigned with R-Objects and the data type of the R-object becomes the data type of the variable. There are many types of R-objects. The frequently used ones are −
Vectors
Lists
Matrices
Arrays
Factors
Data Frames
If you are using a GUI for R, there is likely a menu-driven way of installing packages. This is how to do it from the command line:
install.packages('reshape2')
In each new R session where you use the package, you will have to load it:
library(reshape2)
If you use the package in a script, put this line in the script.
To update all your installed packages to the latest versions available:
update.packages()
If you are using R on Linux, some of the R packages may be installed at the system level by the root user, and can’t be updated this way, since you won’t haver permission to overwrite them.
In contrast to other programming languages like C and java in R, the variables are not declared as some data type. The variables are assigned with R-Objects and the data type of the R-object becomes the data type of the variable. There are many types of R-objects. The frequently used ones are −
Vectors
Lists
Matrices
Arrays
Factors
Data Frames
Elements from a vector, matrix, or data frame can be extracted using numeric indexing, or by using a boolean vector of the appropriate length.
In many of the examples, below, there are multiple ways of doing the same thing. Indexing with numbers and names
With a vector:
# A sample vector
v <- c(1,4,4,3,2,2,3)
v[c(2,3,4)]
#> [1] 4 4 3
v[2:4]
#> [1] 4 4 3
v[c(2,4,3)]
#> [1] 4 3 4
With a data frame:
# Create a sample data frame
data <- read.table(header=T, text='
subject sex size
1 M7
2 F6
3 F9
4 M 11
')
# Get the element at row 1, column 3
data[1,3]
#> [1] 7
data[1,"size"]
#> [1] 7
# Get rows 1 and 2, and all columns
data[1:2, ]
#> subject sex size
#> 1 1 M7
#> 2 2 F6
data[c(1,2), ]
#> subject sex size
#> 1 1 M7
#> 2 2 F6
# Get rows 1 and 2, and only column 2
data[1:2, 2]
#> [1] M F
#> Levels: F M
data[c(1,2), 2]
#> [1] M F
#> Levels: F M
# Get rows 1 and 2, and only the columns named "sex" and "size"
data[1:2, c("sex","size")]
#> sex size
#> 1 M7
#> 2 F6
data[c(1,2), c(2,3)]
#> sex size
#> 1 M7
#> 2 F6
The simplest way to import data is to save it as a text file with delimiters such as tabs or commas (CSV).
data <- read.csv("datafile.csv")
# Load a CSV file that doesn't have headers
data <- read.csv("datafile-noheader.csv", header=FALSE)
The function read.table() is a more general function which allows you to set the delimiter, whether or not there are headers, whether strings are set off with quotes, and more. See ?read.table for more information on the details.
data <- read.table("datafile-noheader.csv",
header=FALSE,
sep="," # use "\t" for tab-delimited files
)
On some platforms, using file.choose() will open a file chooser dialog window. On others, it will simply prompt the user to type in a filename.
data <- read.csv(file.choose())
By default, strings in the data are converted to factors. If you load the data below with read.csv, then all the text columns will be treated as factors, even though it might make more sense to treat some of them as strings. To do this, use stringsAsFactors=FALSE:
data <- read.csv("datafile.csv", stringsAsFactors=FALSE)
# You might have to convert some columns to factors
data$Sex <- factor(data$Sex)
Another alternative is to load them as factors and convert some columns to characters:
data <- read.csv("datafile.csv")
data$First <- as.character(data$First)
data$Last <- as.character(data$Last)
# Another method: convert columns named "First" and "Last"
stringcols <- c("First","Last")
data[stringcols] <- lapply(data[stringcols], as.character)
Data can also be loaded from a URL. These (very long) URLs will load the files linked to below.
data <- read.csv("http://www.cookbook-r.com/Data_input_and_output/Loading_data_from_a_file/datafile.csv")
# Read in a CSV file without headers
data <- read.csv("http://www.cookbook-r.com/Data_input_and_output/Loading_data_from_a_file/datafile-noheader.csv", header=FALSE)
# Manually assign the header names
names(data) <- c("First","Last","Sex","Number")
The data files used above:
"First","Last","Sex","Number"
"Currer","Bell","F",2
"Dr.","Seuss","M",49
"","Student",NA,21
"Currer","Bell","F",2
"Dr.","Seuss","M",49
"","Student",NA,21
Suppose your data has fixed-width columns, like this:
First Last Sex Number
Currer BellF 2
Dr.SeussM 49
"" Student NA 21
One way to read it in is to simply use read.table() with strip.white=TRUE, which will remove extra spaces.
read.table("clipboard", header=TRUE, strip.white=TRUE)
However, your data file may have columns containing spaces, or columns with no spaces separating them, like this, where the scores column represents six different measurements, each from 0 to 3.
subject sex scores
N 1M 113311
NE 2F 112231
S 3F 111221
W 4M 011002
In this case, you may need to use the read.fwf() function. If you read the column names from the file, it requires that they be separated with a delimiter like a single tab, space, or comma. If they are separated with multiple spaces, as in this example, you will have to assign the column names directly.
# Assign the column names manually
read.fwf("myfile.txt",
c(7,5,-2,1,1,1,1,1,1), # Width of the columns. -2 means drop those columns
skip=1,# Skip the first line (contains header here)
col.names=c("subject","sex","s1","s2","s3","s4","s5","s6"),
strip.white=TRUE) # Strip out leading and trailing whitespace when reading each
#> subject sex s1 s2 s3 s4 s5 s6
#> 1N 1 M 1 1 3 3 1 1
#> 2NE 2 F 1 1 2 2 3 1
#> 3S 3 F 1 1 1 2 2 1
#> 4W 4 M 0 1 1 0 0 2
# subject sex s1 s2 s3 s4 s5 s6
#N 1 M 1 1 3 3 1 1
#NE 2 F 1 1 2 2 3 1
#S 3 F 1 1 1 2 2 1
#W 4 M 0 1 1 0 0 2
# If the first row looked like this:
# subject,sex,scores
# Then we could use header=TRUE:
read.fwf("myfile.txt", c(7,5,-2,1,1,1,1,1,1), header=TRUE, strip.white=TRUE)
#> Error in read.table(file = FILE, header = header, sep = sep, row.names = row.names, : more columns than column names
The read.xls function in the gdata package can read in Excel files.
library(gdata)
data <- read.xls("data.xls")
There are numerous methods for exporting R objects into other formats . For SPSS, SAS and Stata, you will need to load the foreign packages. For Excel, you will need the xlsReadWrite package.
To A Tab Delimited Text File
write.table(mydata, "c:/mydata.txt", sep="\t")
To an Excel Spreadsheet
library(xlsx)
write.xlsx(mydata, "c:/mydata.xlsx")
To SPSS
write out text datafile and an SPSS program to read it
library(foreign)
write.foreign(mydata, "c:/mydata.txt", "c:/mydata.sps", package="SPSS")
To SAS
write out text datafile and a SAS program to read it
library(foreign)
write.foreign(mydata, "c:/mydata.txt", "c:/mydata.sas", package="SAS")
To Stata
export data frame to Stata binary format
library(foreign)
write.dta(mydata, "c:/mydata.dta")
Some sample data to work with:
# Make some data
# X increases (noisily)
# Z increases slowly
# Y is constructed so it is inversely related to xvar and positively related to xvar*zvar
set.seed(955)
xvar <- 1:20 + rnorm(20,sd=3)
zvar <- 1:20/4 + rnorm(20,sd=2)
yvar <- -2*xvar + xvar*zvar/5 + 3 + rnorm(20,sd=4)
# Make a data frame with the variables
dat <- data.frame(x=xvar, y=yvar, z=zvar)
# Show first few rows
head(dat)
#> x y z
#> 1 -4.252354 4.5857688 1.89877152
#> 2 1.702318 -4.9027824 -0.82937359
#> 3 4.323054 -4.3076433 -1.31283495
#> 4 1.780628 0.2050367 -0.28479448
#> 5 11.537348 -29.7670502 -1.27303976
#> 6 6.672130 -10.1458220 -0.09459239
# Correlation coefficient
cor(dat$x, dat$y)
#> [1] -0.7695378
It is also possible to run correlations between many pairs of variables, using a matrix or data frame.
# A correlation matrix of the variables
cor(dat)
#>xy z
#> x 1.0000000 -0.769537849 0.491698938
#> y -0.7695378 1.000000000 0.004172295
#> z 0.4916989 0.004172295 1.000000000
# Print with only two decimal places
round(cor(dat), 2)
#> x yz
#> x 1.00 -0.77 0.49
#> y -0.77 1.00 0.00
#> z 0.49 0.00 1.00
Linear regressions, where dat$x is the predictor, and dat$y is the outcome. This can be done using two columns from a data frame, or with numeric vectors directly.
# These two commands will have the same outcome:
fit <- lm(y ~ x, data=dat) # Using the columns x and y from the data frame
fit <- lm(dat$y ~ dat$x) # Using the vectors dat$x and dat$y
fit
#>
#> Call:
#> lm(formula = dat$y ~ dat$x)
#>
#> Coefficients:
#> (Intercept)dat$x
#> -0.2278 -1.1829
# This means that the predicted y = -0.2278 - 1.1829*x
# Get more detailed information:
summary(fit)
#>
#> Call:
#> lm(formula = dat$y ~ dat$x)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -15.8922 -2.5114 0.2866 4.4646 9.3285
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.2278 2.6323 -0.0870.932
#> dat$x-1.1829 0.2314 -5.113 7.28e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 6.506 on 18 degrees of freedom
#> Multiple R-squared: 0.5922, Adjusted R-squared: 0.5695
#> F-statistic: 26.14 on 1 and 18 DF, p-value: 7.282e-05
Linear regression with y as the outcome, and x and z as predictors.
Note that the formula specified below does not test for interactions between x and z.
# These have the same result
fit2 <- lm(y ~ x + z, data=dat)# Using the columns x, y, and z from the data frame
fit2 <- lm(dat$y ~ dat$x + dat$z) # Using the vectors x, y, z
fit2
#>
#> Call:
#> lm(formula = dat$y ~ dat$x + dat$z)
#>
#> Coefficients:
#> (Intercept)dat$xdat$z
#> -1.382 -1.5641.858
summary(fit2)
#>
#> Call:
#> lm(formula = dat$y ~ dat$x + dat$z)
#>
#> Residuals:
#>Min 1Q Median 3QMax
#> -7.974 -3.187 -1.205 3.847 7.524
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.3816 1.9878 -0.695 0.49644
#> dat$x-1.5642 0.1984 -7.883 4.46e-07 ***
#> dat$z 1.8578 0.4753 3.908 0.00113 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.859 on 17 degrees of freedom
#> Multiple R-squared: 0.7852, Adjusted R-squared: 0.7599
#> F-statistic: 31.07 on 2 and 17 DF, p-value: 2.1e-06
The topic of how to properly do multiple regression and test for interactions can be quite complex and is not covered here. Here we just fit a model with x, z, and the interaction between the two.
To model interactions between x and z, a x:z term must be added. Alternatively, the formula x*z expands to x+z+x:z.
# These are equivalent; the x*z expands to x + z + x:z
fit3 <- lm(y ~ x * z, data=dat)
fit3 <- lm(y ~ x + z + x:z, data=dat)
fit3
#>
#> Call:
#> lm(formula = y ~ x + z + x:z, data = dat)
#>
#> Coefficients:
#> (Intercept)xz x:z
#> 2.2820 -2.1311 -0.1068 0.2081
summary(fit3)
#>
#> Call:
#> lm(formula = y ~ x + z + x:z, data = dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.3045 -3.5998 0.3926 2.1376 8.3957
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.282042.20064 1.037 0.3152
#> x -2.131100.27406 -7.7768e-07 ***
#> z -0.106820.84820 -0.126 0.9013
#> x:z 0.208140.07874 2.643 0.0177 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.178 on 16 degrees of freedom
#> Multiple R-squared: 0.8505, Adjusted R-squared: 0.8225
#> F-statistic: 30.34 on 3 and 16 DF, p-value: 7.759e-07
Find today's scripts here
setwd("C:/Users/zyao/Desktop/CE261_R_Exercise")
rm(list=ls(all=TRUE))
ipak <- function(pkg){
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
packages <- c("ggplot2", "plyr", "reshape2", "data.table","RColorBrewer", "scales", "grid", "rJava","stringr","ggthemes","pander", "RSQLite","sqldf","XLConnect","RCurl","data.table","reader","fBasics",'PerformanceAnalytics','ggthemes')
ipak(packages)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
county <- c("06031", "06089", "06045", "06035", "06023", "06107", "06021", "06053", "06079", "06083")
URLs <- replicate(10,"http://www.zhuoyao.net/ARCH30353/vmtdata/06001.csv")
for (i in 1:length(URLs) ) {
URLs[i]<- gsub("06001", county[i], URLs[i], ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
}
for (i in 1:length(URLs) ) {
download.file(URLs[i],destfile=sprintf("%s.csv", county[i]) , method='auto')
}
files <- list.files(pattern = "\\.csv$")
path1 <- "C:/Users/zyao/Desktop/CE261_R_Exercise/"
df <- do.call("rbind",
lapply(files,
function(x)
cbind(read.table(paste(path1, x, sep=''),
sep = ",", col.names = paste0("V",seq_len(17)), fill = TRUE))))
head (df)
df1 <- df[,2:17]
colnames(df1) <- c("County","Date_Time","VMT_Total","HHD_VMT", 'Truck_wo_HHD_VMT','OCC','AvgSpd_All','AvgHHD_Wt','AvgHHD_Axle','AvgSpd_HHD','Avg_Non_HHD_Wt','AvgAxle_Non_HHD','AvgSpd_Non_HHD','SectionLength','LaneMiles','DetrNos')
dfU <- df1[,1:3]
dfU$Date_Time=strptime(dfU$Date_Time,format="%Y-%m-%d %H:%M")
dfU[,1]<-as.character(dfU[,1])
dfU$County1<- dfU$County
for (i in 1:nrow(dfU) ) {
dfU$County1[i]<- str_sub(dfU$County[i], 23,-6)
}
dfU$Date<- dfU$Date_Time
dfU[,5]<-as.character(dfU[,5])
for (i in 1:nrow(dfU) ) {
dfU$Date[i]<- str_sub(dfU$Date_Time[i], 6,10)
}
data.long <- dfU[,2:5]
colnames(data.long) <- c("Date_Time","VMT_Total","County",'Date')
daf <- melt(data.long, id.vars=c("County","Date","Date_Time"), variable.name="Total VMT")
daf[,3]<-as.Date(daf[,3])
resultsA <- "C:/Users/zyao/Desktop/CE261_R_Exercise/PlotsA"
resultsB <- "C:/Users/zyao/Desktop/CE261_R_Exercise/PlotsB"
county.graphA <- function(df, na.rm = TRUE, ...){
###create list of counties in data to loop over
county_list <- unique(df$County)
###create for loop to produce ggplot2 graphs
###create plot for each county in df
county.graphA <- function(df, na.rm = TRUE, ...){
county_list <- unique(df$County)
for (i in seq_along(county_list)) {
p <- ggplot(subset(df, df$County==county_list[i]),aes(Date_Time, value))+
geom_line(size=0.5, aes(group=1),colour="blue") +
geom_point(size=1, colour="red") +theme_hc(bgcolor = "darkunica")+
scale_x_date(date_breaks="1 month", labels=date_format("%B"))+xlab("2012") + ylab("Total VMT Traveled") +
ggtitle(paste(county_list[i], ' County, California \n',
"County Level Detector-based VMT Total\n",
sep=''))+
theme(legend.title=element_blank(),legend.position="top")
ggsave(p, file=paste(resultsA,'/',county_list[i], ".png", sep=''),scale=3)
}
}
county.graphA(daf)
###create list of counties in data to loop over ###create for loop to produce ggplot2 graphs
county.graphB <- function(df, na.rm = TRUE, ...){
county_list <- unique(df$County)
for (i in seq_along(county_list)) {
p <- ggplot(subset(df, df$County==county_list[i]),aes(Date_Time, value))+
geom_line(size=0.5, aes(group=1),colour="blue") +
geom_point(size=1, colour="red") +theme_hc(bgcolor = "darkunica")+
scale_x_date(date_breaks = "1 month", labels=date_format("%B"))+
xlab("2012") + ylab("Total VMT Traveled") +
ggtitle(paste(county_list[i], ' County, California \n',
"County Level Detector-based VMT Total \n",
sep=''))+
theme(legend.title=element_blank(),legend.position="top",text = element_text(size=30),plot.title = element_text(hjust = 0.5))
ggsave(p, file=paste(resultsB,'/',county_list[i], ".png", sep=''),width = 40, height = 10, dpi = 200)
}
}
county.graphB(daf)