ARCH 30353: Planning 3- Introduction to Urban and Regional planning

Prerequisite: Planning 2 or dean's permission

Units: 3.0

Classroom: online via Microsoft Teams

Class Time: Thursday: 9:30 AM-12:30 PM

Office Hour: Thursday: 12:30 PM -12:45 PM - Right after class time

Instructor: Zhuo Yao, Ph.D.

Instructor: Archt. Carmela C. Quizana


This is a set of notes, examples, data and projects for a short course in advanced R programming, aimed at graduate students and senior-level undergraduates. It is intended to help students develop the skills they'll need in their research or employment. Note that this lab session use the book "Cookbook for R" as a starting point.


Lecture Notes

  1. Basics
  2. Numbers
  3. Strings
  4. Formulas
  5. Data input and output
  6. Manipulating data
  7. Statistical analysis
  8. Graphs
  9. Scripts and functions
  10. Tools for experiments

R - Data Types

Generally, while doing programming in any programming language, you need to use various variables to store various information. Variables are nothing but reserved memory locations to store values. This means that, when you create a variable you reserve some space in memory.

You may like to store information of various data types like character, wide character, integer, floating point, double floating point, Boolean etc. Based on the data type of a variable, the operating system allocates memory and decides what can be stored in the reserved memory.

In contrast to other programming languages like C and java in R, the variables are not declared as some data type. The variables are assigned with R-Objects and the data type of the R-object becomes the data type of the variable. There are many types of R-objects. The frequently used ones are −

Vectors
Lists
Matrices
Arrays
Factors
Data Frames

Lecture Notes

  1. Basics
  2. Data input and output
  3. Manipulating data
  4. Statistical analysis
  5. Graphs

1. Basics of R

1.1 Installing and using packages

If you are using a GUI for R, there is likely a menu-driven way of installing packages. This is how to do it from the command line:

install.packages('reshape2')

In each new R session where you use the package, you will have to load it:

library(reshape2)

If you use the package in a script, put this line in the script.

To update all your installed packages to the latest versions available:

update.packages()

If you are using R on Linux, some of the R packages may be installed at the system level by the root user, and can’t be updated this way, since you won’t haver permission to overwrite them.

1.2 Data Types

In contrast to other programming languages like C and java in R, the variables are not declared as some data type. The variables are assigned with R-Objects and the data type of the R-object becomes the data type of the variable. There are many types of R-objects. The frequently used ones are −

Vectors
Lists
Matrices
Arrays
Factors
Data Frames

1.3 Indexing into a data structure

Elements from a vector, matrix, or data frame can be extracted using numeric indexing, or by using a boolean vector of the appropriate length.

In many of the examples, below, there are multiple ways of doing the same thing. Indexing with numbers and names

With a vector:

# A sample vector
v <- c(1,4,4,3,2,2,3)

v[c(2,3,4)]
#> [1] 4 4 3
v[2:4]
#> [1] 4 4 3

v[c(2,4,3)]
#> [1] 4 3 4

With a data frame:

# Create a sample data frame
data <- read.table(header=T, text='
 subject sex size
   1   M7
   2   F6
   3   F9
   4   M   11
 ')

# Get the element at row 1, column 3
data[1,3]
#> [1] 7
data[1,"size"]
#> [1] 7


# Get rows 1 and 2, and all columns
data[1:2, ]   
#>   subject sex size
#> 1   1   M7
#> 2   2   F6
data[c(1,2), ]
#>   subject sex size
#> 1   1   M7
#> 2   2   F6


# Get rows 1 and 2, and only column 2
data[1:2, 2]
#> [1] M F
#> Levels: F M
data[c(1,2), 2]
#> [1] M F
#> Levels: F M

# Get rows 1 and 2, and only the columns named "sex" and "size"
data[1:2, c("sex","size")]
#>   sex size
#> 1   M7
#> 2   F6
data[c(1,2), c(2,3)]
#>   sex size
#> 1   M7
#> 2   F6

2. Data input and output

2.1 Delimited text files

The simplest way to import data is to save it as a text file with delimiters such as tabs or commas (CSV).

data <- read.csv("datafile.csv")


# Load a CSV file that doesn't have headers
data <- read.csv("datafile-noheader.csv", header=FALSE)

The function read.table() is a more general function which allows you to set the delimiter, whether or not there are headers, whether strings are set off with quotes, and more. See ?read.table for more information on the details.

data <- read.table("datafile-noheader.csv",
   header=FALSE,
   sep="," # use "\t" for tab-delimited files
)

2.2 Loading a file with a file chooser

On some platforms, using file.choose() will open a file chooser dialog window. On others, it will simply prompt the user to type in a filename.

data <- read.csv(file.choose())

2.3 Treating strings as factors or characters

By default, strings in the data are converted to factors. If you load the data below with read.csv, then all the text columns will be treated as factors, even though it might make more sense to treat some of them as strings. To do this, use stringsAsFactors=FALSE:

data <- read.csv("datafile.csv", stringsAsFactors=FALSE)

# You might have to convert some columns to factors
data$Sex <- factor(data$Sex)

Another alternative is to load them as factors and convert some columns to characters:

data <- read.csv("datafile.csv")

data$First <- as.character(data$First)
data$Last  <- as.character(data$Last)

# Another method: convert columns named "First" and "Last"
stringcols <- c("First","Last")
data[stringcols] <- lapply(data[stringcols], as.character)

2.4 Loading a file from the Internet

Data can also be loaded from a URL. These (very long) URLs will load the files linked to below.

data <- read.csv("http://www.cookbook-r.com/Data_input_and_output/Loading_data_from_a_file/datafile.csv")

# Read in a CSV file without headers
data <- read.csv("http://www.cookbook-r.com/Data_input_and_output/Loading_data_from_a_file/datafile-noheader.csv", header=FALSE)

# Manually assign the header names
names(data) <- c("First","Last","Sex","Number")

The data files used above:

datafile.csv:

"First","Last","Sex","Number"
"Currer","Bell","F",2
"Dr.","Seuss","M",49
"","Student",NA,21

datafile-noheader.csv:

"Currer","Bell","F",2
"Dr.","Seuss","M",49
"","Student",NA,21

2.5 Fixed-width text files

Suppose your data has fixed-width columns, like this:

  First Last  Sex Number
  Currer BellF  2
  Dr.SeussM 49
  ""   Student   NA 21

One way to read it in is to simply use read.table() with strip.white=TRUE, which will remove extra spaces.

read.table("clipboard", header=TRUE, strip.white=TRUE)

However, your data file may have columns containing spaces, or columns with no spaces separating them, like this, where the scores column represents six different measurements, each from 0 to 3.

subject  sex  scores
   N  1M  113311
   NE 2F  112231
   S  3F  111221
   W  4M  011002

In this case, you may need to use the read.fwf() function. If you read the column names from the file, it requires that they be separated with a delimiter like a single tab, space, or comma. If they are separated with multiple spaces, as in this example, you will have to assign the column names directly.

# Assign the column names manually
read.fwf("myfile.txt", 
 c(7,5,-2,1,1,1,1,1,1), # Width of the columns. -2 means drop those columns
 skip=1,# Skip the first line (contains header here)
 col.names=c("subject","sex","s1","s2","s3","s4","s5","s6"),
 strip.white=TRUE)  # Strip out leading and trailing whitespace when reading each
#>   subject sex s1 s2 s3 s4 s5 s6
#> 1N  1   M  1  1  3  3  1  1
#> 2NE 2   F  1  1  2  2  3  1
#> 3S  3   F  1  1  1  2  2  1
#> 4W  4   M  0  1  1  0  0  2
# subject sex s1 s2 s3 s4 s5 s6
#N  1   M  1  1  3  3  1  1
#NE 2   F  1  1  2  2  3  1
#S  3   F  1  1  1  2  2  1
#W  4   M  0  1  1  0  0  2


# If the first row looked like this:
# subject,sex,scores
# Then we could use header=TRUE:
read.fwf("myfile.txt", c(7,5,-2,1,1,1,1,1,1), header=TRUE, strip.white=TRUE)
#> Error in read.table(file = FILE, header = header, sep = sep, row.names = row.names, : more columns than column names

2.6 Excel files

The read.xls function in the gdata package can read in Excel files.

library(gdata)
data <- read.xls("data.xls")

2.7 Exporting Data

There are numerous methods for exporting R objects into other formats . For SPSS, SAS and Stata, you will need to load the foreign packages. For Excel, you will need the xlsReadWrite package.

To A Tab Delimited Text File

write.table(mydata, "c:/mydata.txt", sep="\t")

To an Excel Spreadsheet

library(xlsx)
write.xlsx(mydata, "c:/mydata.xlsx")

To SPSS

write out text datafile and an SPSS program to read it

library(foreign)
write.foreign(mydata, "c:/mydata.txt", "c:/mydata.sps",   package="SPSS")

To SAS

write out text datafile and a SAS program to read it

library(foreign)
write.foreign(mydata, "c:/mydata.txt", "c:/mydata.sas",   package="SAS")

To Stata

export data frame to Stata binary format

library(foreign)
write.dta(mydata, "c:/mydata.dta") 

3. Manipulating Data

3.1 General

  1. Sorting
  2. Randomizing order
  3. Converting between vector types - Numeric vectors, Character vectors, and Factors
  4. Finding and removing duplicate records
  5. Comparing vectors or factors with NA
  6. Recoding data
  7. Mapping vector values - Change all instances of value x to value y in a vector

3.2 Factors

  1. Renaming levels of a factor
  2. Re-computing the levels of factor
  3. Changing the order of levels of a factor

3.3 Data Frames

  1. Renaming columns in a data frame
  2. Adding and removing columns from a data frame
  3. Reordering the columns in a data frame
  4. Merging data frames
  5. Comparing data frames - Search for duplicate or unique rows across multiple data frames.
  6. Re-computing the levels of all factor columns in a data frame

3.4 Restructuring data

  1. Converting data between wide and long format
  2. Summarizing data - Collapse a data frame on one or more variables to find mean, count, standard deviation, standard error of the mean, and confidence intervals
  3. Converting between data frames and contingency tables - Data frames with individual cases, data frames with counts, and contingency tables

3.5 Sequential data

  1. Calculating a moving average
  2. Averaging a sequence in blocks - Convert a sequence into blocks of a given length and average within each block.
  3. Finding sequences of identical values
  4. Filling in NAs with last non-NA value

4 Statistical analysis

4.1 Regression and correlation

Some sample data to work with:

# Make some data
# X increases (noisily)
# Z increases slowly
# Y is constructed so it is inversely related to xvar and positively related to xvar*zvar
set.seed(955)
xvar <- 1:20 + rnorm(20,sd=3)
zvar <- 1:20/4 + rnorm(20,sd=2)
yvar <- -2*xvar + xvar*zvar/5 + 3 + rnorm(20,sd=4)

# Make a data frame with the variables
dat <- data.frame(x=xvar, y=yvar, z=zvar)
# Show first few rows
head(dat)
#>   x   y   z
#> 1 -4.252354   4.5857688  1.89877152
#> 2  1.702318  -4.9027824 -0.82937359
#> 3  4.323054  -4.3076433 -1.31283495
#> 4  1.780628   0.2050367 -0.28479448
#> 5 11.537348 -29.7670502 -1.27303976
#> 6  6.672130 -10.1458220 -0.09459239

4.1.1 Correlation

# Correlation coefficient
cor(dat$x, dat$y)
#> [1] -0.7695378

4.1.2 Correlation matrices (for multiple variables)

It is also possible to run correlations between many pairs of variables, using a matrix or data frame.

# A correlation matrix of the variables
cor(dat)
#>xy   z
#> x  1.0000000 -0.769537849 0.491698938
#> y -0.7695378  1.000000000 0.004172295
#> z  0.4916989  0.004172295 1.000000000


# Print with only two decimal places
round(cor(dat), 2)
#>   x yz
#> x  1.00 -0.77 0.49
#> y -0.77  1.00 0.00
#> z  0.49  0.00 1.00

4.1.3 Linear regression

Linear regressions, where dat$x is the predictor, and dat$y is the outcome. This can be done using two columns from a data frame, or with numeric vectors directly.

# These two commands will have the same outcome:
fit <- lm(y ~ x, data=dat)  # Using the columns x and y from the data frame
fit <- lm(dat$y ~ dat$x) # Using the vectors dat$x and dat$y
fit
#> 
#> Call:
#> lm(formula = dat$y ~ dat$x)
#> 
#> Coefficients:
#> (Intercept)dat$x  
#> -0.2278  -1.1829

# This means that the predicted y = -0.2278 - 1.1829*x


# Get more detailed information:
summary(fit)
#> 
#> Call:
#> lm(formula = dat$y ~ dat$x)
#> 
#> Residuals:
#>  Min   1Q   Median   3Q  Max 
#> -15.8922  -2.5114   0.2866   4.4646   9.3285 
#> 
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  -0.2278 2.6323  -0.0870.932
#> dat$x-1.1829 0.2314  -5.113 7.28e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6.506 on 18 degrees of freedom
#> Multiple R-squared:  0.5922, Adjusted R-squared:  0.5695 
#> F-statistic: 26.14 on 1 and 18 DF,  p-value: 7.282e-05

4.1.4 Linear regression with multiple predictors

Linear regression with y as the outcome, and x and z as predictors.

Note that the formula specified below does not test for interactions between x and z.

# These have the same result
fit2 <- lm(y ~ x + z, data=dat)# Using the columns x, y, and z from the data frame
fit2 <- lm(dat$y ~ dat$x + dat$z)  # Using the vectors x, y, z
fit2
#> 
#> Call:
#> lm(formula = dat$y ~ dat$x + dat$z)
#> 
#> Coefficients:
#> (Intercept)dat$xdat$z  
#>  -1.382   -1.5641.858

summary(fit2)
#> 
#> Call:
#> lm(formula = dat$y ~ dat$x + dat$z)
#> 
#> Residuals:
#>Min 1Q Median 3QMax 
#> -7.974 -3.187 -1.205  3.847  7.524 
#> 
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  -1.3816 1.9878  -0.695  0.49644
#> dat$x-1.5642 0.1984  -7.883 4.46e-07 ***
#> dat$z 1.8578 0.4753   3.908  0.00113 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.859 on 17 degrees of freedom
#> Multiple R-squared:  0.7852, Adjusted R-squared:  0.7599 
#> F-statistic: 31.07 on 2 and 17 DF,  p-value: 2.1e-06

4.1.5 Interactions

The topic of how to properly do multiple regression and test for interactions can be quite complex and is not covered here. Here we just fit a model with x, z, and the interaction between the two.

To model interactions between x and z, a x:z term must be added. Alternatively, the formula x*z expands to x+z+x:z.

# These are equivalent; the x*z expands to x + z + x:z
fit3 <- lm(y ~ x * z, data=dat) 
fit3 <- lm(y ~ x + z + x:z, data=dat) 
fit3
#> 
#> Call:
#> lm(formula = y ~ x + z + x:z, data = dat)
#> 
#> Coefficients:
#> (Intercept)xz  x:z  
#>  2.2820  -2.1311  -0.1068   0.2081

summary(fit3)
#> 
#> Call:
#> lm(formula = y ~ x + z + x:z, data = dat)
#> 
#> Residuals:
#> Min  1Q  Median  3Q Max 
#> -5.3045 -3.5998  0.3926  2.1376  8.3957 
#> 
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  2.282042.20064   1.037   0.3152
#> x   -2.131100.27406  -7.7768e-07 ***
#> z   -0.106820.84820  -0.126   0.9013
#> x:z  0.208140.07874   2.643   0.0177 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.178 on 16 degrees of freedom
#> Multiple R-squared:  0.8505, Adjusted R-squared:  0.8225 
#> F-statistic: 30.34 on 3 and 16 DF,  p-value: 7.759e-07

4.2 t-test

4.3 Frequency tests - Chi-square, Fisher’s exact, exact Binomial, McNemar’s test

4.4 ANOVA

4.5 Logistic regression

5. Graphs

5.1 Graphs with ggplot2

5.1.1 Bar and line graphs

5.1.2 Plotting means and error bars

5.1.3 Plotting distributions - Histograms, density curves, boxplots

5.1.4 Scatterplots

5.1.5 Titles

5.1.6 Axes - Control axis text, labels, and grid lines.

5.1.7 Legends

5.1.8 Lines - Add lines to a graph.

5.1.9 Facets - Slice up data and graph the subsets together in a grid.

5.1.10 Multiple graphs on one page

5.1.11 Colors (ggplot2)

5.2 Miscellaneous

5.2.1 Output to a file - PDF, PNG, TIFF, SVG

5.2.2 Shapes and line types - Set the shape of points and patterns used in lines

5.2.3 Fonts - Use different fonts in your graphs

5.3 Basic graphs with standard graphics

5.3.1 Histogram and density plot

5.3.1 Scatterplot

5.3.1 Box plot

5.3.1 Q-Q plot

R References

  1. Courses Taught by Hadley Wickham
  2. R Programming - Robin Evans
  3. A First Course in Statistical Programming with R
  4. Introduction to Visualizing Spatial Data in R

Plotting

  1. Beautiful plotting in R: A ggplot2 cheatsheet
  2. Cookbook for R
  3. R Graphical Manual
  4. ggplot2 Documentation

In Class Exercise

Find today's scripts here

Set a path where all the files are to be stored

setwd("C:/Users/zyao/Desktop/CE261_R_Exercise")

Load libraries

rm(list=ls(all=TRUE)) 

ipak <- function(pkg){
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if (length(new.pkg)) 
    install.packages(new.pkg, dependencies = TRUE)
  sapply(pkg, require, character.only = TRUE)
}

usage

packages <- c("ggplot2", "plyr", "reshape2", "data.table","RColorBrewer", "scales", "grid", "rJava","stringr","ggthemes","pander", "RSQLite","sqldf","XLConnect","RCurl","data.table","reader","fBasics",'PerformanceAnalytics','ggthemes')
ipak(packages)

options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))

create a list of counties

county <- c("06031", "06089", "06045", "06035", "06023", "06107", "06021", "06053", "06079", "06083")

generate an array of URLs to be replaced

URLs <- replicate(10,"http://www.zhuoyao.net/ARCH30353/vmtdata/06001.csv")

Loop to generate the working URLs

for (i in 1:length(URLs) ) {
URLs[i]<- gsub("06001", county[i], URLs[i], ignore.case = FALSE, perl = FALSE,
    fixed = FALSE, useBytes = FALSE)
}

executing the downloading

for (i in 1:length(URLs) ) {
download.file(URLs[i],destfile=sprintf("%s.csv", county[i]) , method='auto')
}

Create a file list of .csv files only

files <- list.files(pattern = "\\.csv$")
path1 <- "C:/Users/zyao/Desktop/CE261_R_Exercise/"

read in each .csv file in file_list and rbind them into a data frame

df <- do.call("rbind", 
      lapply(files, 
             function(x) 
               cbind(read.table(paste(path1, x, sep=''), 
                       sep = ",", col.names = paste0("V",seq_len(17)), fill = TRUE))))

check on first 6 rows of the dataframe df

head (df)

Subset to the 2nd~17th column of the data

df1 <- df[,2:17]

rename cols

colnames(df1) <- c("County","Date_Time","VMT_Total","HHD_VMT", 'Truck_wo_HHD_VMT','OCC','AvgSpd_All','AvgHHD_Wt','AvgHHD_Axle','AvgSpd_HHD','Avg_Non_HHD_Wt','AvgAxle_Non_HHD','AvgSpd_Non_HHD','SectionLength','LaneMiles','DetrNos')

subset again

dfU <- df1[,1:3]

Convert datetime to appropriate format, in this case POSIXct format

dfU$Date_Time=strptime(dfU$Date_Time,format="%Y-%m-%d %H:%M")

substring to county name

dfU[,1]<-as.character(dfU[,1])
dfU$County1<- dfU$County
for (i in 1:nrow(dfU) ) {
dfU$County1[i]<- str_sub(dfU$County[i], 23,-6)
}

substring the date/time to date only

dfU$Date<- dfU$Date_Time
dfU[,5]<-as.character(dfU[,5])

for (i in 1:nrow(dfU) ) {
dfU$Date[i]<- str_sub(dfU$Date_Time[i], 6,10)
}

data.long <- dfU[,2:5]

colnames(data.long) <- c("Date_Time","VMT_Total","County",'Date')

melt data to long format

daf <- melt(data.long, id.vars=c("County","Date","Date_Time"), variable.name="Total VMT")

daf$Date_Time=strptime(daf$Date_Time,format="%Y-%m-%d %H:%M")

daf[,3]<-as.Date(daf[,3])

build two folders namely, PlotsA and PlotsB

resultsA <- "C:/Users/zyao/Desktop/CE261_R_Exercise/PlotsA"
resultsB <- "C:/Users/zyao/Desktop/CE261_R_Exercise/PlotsB"

create graphing function A

county.graphA <- function(df, na.rm = TRUE, ...){

###create list of counties in data to loop over 

county_list <- unique(df$County)

###create for loop to produce ggplot2 graphs 
###create plot for each county in df 

  county.graphA <- function(df, na.rm = TRUE, ...){
  county_list <- unique(df$County)
  for (i in seq_along(county_list)) { 
  p <- ggplot(subset(df, df$County==county_list[i]),aes(Date_Time, value))+
    geom_line(size=0.5, aes(group=1),colour="blue") +
    geom_point(size=1, colour="red") +theme_hc(bgcolor = "darkunica")+
    scale_x_date(date_breaks="1 month", labels=date_format("%B"))+xlab("2012") + ylab("Total VMT Traveled") +

    ggtitle(paste(county_list[i], ' County, California \n', 
                "County Level Detector-based VMT Total\n",
                sep=''))+
    theme(legend.title=element_blank(),legend.position="top")
    ggsave(p, file=paste(resultsA,'/',county_list[i], ".png", sep=''),scale=3)
        }
    }

run graphing function on long df

county.graphA(daf)

create graphing function B

###create list of counties in data to loop over ###create for loop to produce ggplot2 graphs

county.graphB <- function(df, na.rm = TRUE, ...){
county_list <- unique(df$County)
for (i in seq_along(county_list)) { 
p <- ggplot(subset(df, df$County==county_list[i]),aes(Date_Time, value))+
  geom_line(size=0.5, aes(group=1),colour="blue") +
  geom_point(size=1, colour="red") +theme_hc(bgcolor = "darkunica")+
  scale_x_date(date_breaks = "1 month", labels=date_format("%B"))+
  xlab("2012") + ylab("Total VMT Traveled") +
  ggtitle(paste(county_list[i], ' County, California \n', 
                "County Level Detector-based VMT Total \n",
                sep=''))+
theme(legend.title=element_blank(),legend.position="top",text = element_text(size=30),plot.title = element_text(hjust = 0.5))

ggsave(p, file=paste(resultsB,'/',county_list[i], ".png", sep=''),width = 40, height = 10, dpi = 200) 
    }
}

run graphing function on long df

county.graphB(daf)

R References

  1. Courses Taught by Hadley Wickham
  2. R Programming - Robin Evans
  3. A First Course in Statistical Programming with R
  4. Introduction to Visualizing Spatial Data in R

Plotting

  1. Beautiful plotting in R: A ggplot2 cheatsheet
  2. Cookbook for R
  3. R Graphical Manual
  4. ggplot2 Documentation

Any comments and feedbacks on my teaching is highly appreciated!