MULTIPLE REGRESSION – Zhuo Yao, Ph.D.

Preliminaries

Model Formulae

If you haven’t yet read the tutorial on Model Formulae, now would be a good time!

Statistical Modeling

There is not space in this tutorial, and probably not on this server, to cover the complex issue of statistical modeling. For an excellent discussion, I refer you to Chapter 9 of Crawley (2007).

Here I will restrict myself to a discussion of linear modeling. However, transforming variables to make them linear (see Simple Nonlinear Correlation and Regression) is straightforward, and a model including interaction terms is as easily created as it is to change plus signs to asterisks. (If you don’t know what I’m talking about, read the Model Formulae tutorial.) R also includes a wealth of tools for more complex modeling, among them…

glm( ) for generalized linear models (covered in another tutorial)
gam( ) for generalized additive models
lme( ) and lmer( ) for linear mixed-effects models
nls( ) and nlme( ) for nonlinear models
and I’m sure there are others I’m leaving out

My familiarity with these functions is “less than thorough” (!), so I shall refrain from mentioning them further.

Warning: an opinion follows! I am very much opposed to the current common practice in the social sciences (and education) of “letting the software decide” what statistical model is best. Statistical software is an assistant, not a master. It should remain in the hands of the researcher or data analyst which variables are retained in the model and which are excluded. Hopefully, these decisions will be based on a thorough knowledge of the phenomenon being modeled. Therefore, I will not discuss techniques such as stepwise regression or hierarchical regression here. Stepwise regression is implemented in R (see below for a VERY brief mention), and hierarchical regression (SPSS-like) can easily enough be done, although it is not automated (that I am aware of). The basic rule of statistical modeling should be: ALL STATISTICAL MODELS ARE WRONG! Letting the software have free reign to include and exclude model variables in no way changes or improves upon that situation. It is not more “objective,” and it does not result in better models, only less well understood ones.

Preliminary Examination of the Data

For this analysis, we will use a built-in data set called state.x77. It is one of many helpful data sets R includes on the states of the U.S. However, it is not a data frame, so after loading it, we will coerce it into a data frame…

> state.x77                            # output not shown
> str(state.x77)                       # clearly not a data frame!
> st = as.data.frame(state.x77)        # so we'll make it one
> str(st)
'data.frame':   50 obs. of  8 variables:
 $ Population: num   3615   365  2212  2110 21198 ...
 $ Income    : num  3624 6315 4530 3378 5114 ...
 $ Illiteracy: num  2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
 $ Life Exp  : num  69.0 69.3 70.5 70.7 71.7 ...
 $ Murder    : num  15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
 $ HS Grad   : num  41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
 $ Frost     : num  20 152 15 65 20 166 139 103 11 60 ...
 $ Area      : num   50708 566432 113417  51945 156361 ...

A couple things have to be fixed before this will be convenient to use, and I also want to add a population density variable…

> colnames(st)[4] = "Life.Exp"         # no spaces in variable names, please
> colnames(st)[6] = "HS.Grad"
> st[,9] = st$Population * 1000 / st$Area
> colnames(st)[9] = "Density"          # create and name a new column
> str(st)
'data.frame':   50 obs. of  9 variables:
 $ Population: num   3615   365  2212  2110 21198 ...
 $ Income    : num  3624 6315 4530 3378 5114 ...
 $ Illiteracy: num  2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
 $ Life.Exp  : num  69.0 69.3 70.5 70.7 71.7 ...
 $ Murder    : num  15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
 $ HS.Grad   : num  41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
 $ Frost     : num  20 152 15 65 20 166 139 103 11 60 ...
 $ Area      : num   50708 566432 113417  51945 156361 ...
 $ Density   : num   71.291   0.644  19.503  40.620 135.571 ...

If you are unclear on what these variables are, or want more information, see the help page: ?state.x77.

A straightforward summary is always a good place to begin, because for one thing it will find any variables that have missing values. Then, seeing as how we are looking for interrelationships between these variables, we should look at a correlation matrix, or perhaps a scatterplot matrix…

> summary(st)
   Population        Income       Illiteracy       Life.Exp    
 Min.   :  365   Min.   :3098   Min.   :0.500   Min.   :67.96  
 1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625   1st Qu.:70.12  
 Median : 2838   Median :4519   Median :0.950   Median :70.67  
 Mean   : 4246   Mean   :4436   Mean   :1.170   Mean   :70.88  
 3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575   3rd Qu.:71.89  
 Max.   :21198   Max.   :6315   Max.   :2.800   Max.   :73.60  
     Murder          HS.Grad          Frost             Area       
 Min.   : 1.400   Min.   :37.80   Min.   :  0.00   Min.   :  1049  
 1st Qu.: 4.350   1st Qu.:48.05   1st Qu.: 66.25   1st Qu.: 36985  
 Median : 6.850   Median :53.25   Median :114.50   Median : 54277  
 Mean   : 7.378   Mean   :53.11   Mean   :104.46   Mean   : 70736  
 3rd Qu.:10.675   3rd Qu.:59.15   3rd Qu.:139.75   3rd Qu.: 81163  
 Max.   :15.100   Max.   :67.30   Max.   :188.00   Max.   :566432  
    Density        
 Min.   :  0.6444  
 1st Qu.: 25.3352  
 Median : 73.0154  
 Mean   :149.2245  
 3rd Qu.:144.2828  
 Max.   :975.0033
>
> cor(st)                              # correlation matrix
            Population     Income   Illiteracy    Life.Exp     Murder
Population  1.00000000  0.2082276  0.107622373 -0.06805195  0.3436428
Income      0.20822756  1.0000000 -0.437075186  0.34025534 -0.2300776
Illiteracy  0.10762237 -0.4370752  1.000000000 -0.58847793  0.7029752
Life.Exp   -0.06805195  0.3402553 -0.588477926  1.00000000 -0.7808458
Murder      0.34364275 -0.2300776  0.702975199 -0.78084575  1.0000000
HS.Grad    -0.09848975  0.6199323 -0.657188609  0.58221620 -0.4879710
Frost      -0.33215245  0.2262822 -0.671946968  0.26206801 -0.5388834
Area        0.02254384  0.3633154  0.077261132 -0.10733194  0.2283902
Density     0.24622789  0.3299683  0.009274348  0.09106176 -0.1850352
               HS.Grad        Frost        Area      Density
Population -0.09848975 -0.332152454  0.02254384  0.246227888
Income      0.61993232  0.226282179  0.36331544  0.329968277
Illiteracy -0.65718861 -0.671946968  0.07726113  0.009274348
Life.Exp    0.58221620  0.262068011 -0.10733194  0.091061763
Murder     -0.48797102 -0.538883437  0.22839021 -0.185035233
HS.Grad     1.00000000  0.366779702  0.33354187 -0.088367214
Frost       0.36677970  1.000000000  0.05922910  0.002276734
Area        0.33354187  0.059229102  1.00000000 -0.341388515
Density    -0.08836721  0.002276734 -0.34138851  1.000000000
>
> pairs(st)                            # scatterplot matrix

That, gives us plenty to look at, doesn’t it?

I propose we cast “Life.Exp” into the role of our response variable and attempt to model (predict) it from the other variables. Already we see some interesting relationships. Life expectancy appears to be moderately to strongly related to “Income” (positively), “Illiteracy” (negatively), “Murder” (negatively), HS.Grad (positively), and “Frost” (no. of days per year with frost, positively), and I call particular attention to this last relationship. These are all bivariate relationships–i.e., they don’t take into account any of the remaining variables.

Here’s another interesting relationship, which we would find statistically significant. The correlation between average SAT scores by state and average teacher pay by state is negative. That is to say, the states with higher teacher pay have lower SAT scores. (How much political hay has been made of that? These politicians either don’t understand the nature of regression relationships, or they are spinning the data to tell the story they want you to hear!!) We would also find a very strong negative correlation between SAT scores and percent of eligible students who take the test. I.e., the higher the percentage of HS seniors who take the SAT, the lower the average score. Here is the crucial point. Teacher pay is positively correlated with percent of students who take the test. There is a potential confound here. In regression terms, we would call percent of students who take the test a “lurking variable.”

The advantage of looking at these variables in a multiple regression is we are able to see the relationship between two variables with the others taken out. To say that another way, the regression analysis “holds the potential lurking variables constant” while it is figuring out the relationship between the variables of interest. When SAT scores, teacher pay, and percent taking the test are all entered into the regression relationship, the confound between pay and percent is controlled, and the relationship between teacher pay and SAT scores is shown to be significantly positive.

The point of this long discussion is this. Life expectancy does have a bivariate relationship with a lot of the other variables, but many of those variables are also related to each other. Multiple regression allows us to tease all of this apart and see in a “purer” (but not pure) form what the relationship really is between, say, life expectancy and average number of days with frost. I suggest you watch this relationship as we procede through the analysis.

We should also examine the variables for outliers and distribution before proceding, but for the sake of brevity, we’ll skip it.

The Maximal Model (Sans Interactions)

We begin by throwing all the predictors into a (linear additive) model…

> options(show.signif.stars=F)         # I don't like significance stars!
> names(st)                            # for handy reference
[1] "Population" "Income"     "Illiteracy" "Life.Exp"   "Murder"    
[6] "HS.Grad"    "Frost"      "Area"       "Density"
> model1 = lm(Life.Exp ~ Population + Income + Illiteracy + Murder +
+                        HS.Grad + Frost + Area + Density, data=st)
> summary(model1)

Call:
lm(formula = Life.Exp ~ Population + Income + Illiteracy + Murder + 
    HS.Grad + Frost + Area + Density, data = st)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.47514 -0.45887 -0.06352  0.59362  1.21823 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  6.995e+01  1.843e+00  37.956  < 2e-16
Population   6.480e-05  3.001e-05   2.159   0.0367
Income       2.701e-04  3.087e-04   0.875   0.3867
Illiteracy   3.029e-01  4.024e-01   0.753   0.4559
Murder      -3.286e-01  4.941e-02  -6.652 5.12e-08
HS.Grad      4.291e-02  2.332e-02   1.840   0.0730
Frost       -4.580e-03  3.189e-03  -1.436   0.1585
Area        -1.558e-06  1.914e-06  -0.814   0.4205
Density     -1.105e-03  7.312e-04  -1.511   0.1385

Residual standard error: 0.7337 on 41 degrees of freedom
Multiple R-squared: 0.7501,     Adjusted R-squared: 0.7013 
F-statistic: 15.38 on 8 and 41 DF,  p-value: 3.787e-10

It appears higher populations are related to increased life expectancy and higher murder rates are strongly related to decreased life expectancy. Other than that, we’re not seeing much. Another kind of summary of the model can be obtained like this…

> summary.aov(model1)
            Df  Sum Sq Mean Sq F value    Pr(>F)
Population   1  0.4089  0.4089  0.7597  0.388493
Income       1 11.5946 11.5946 21.5413 3.528e-05
Illiteracy   1 19.4207 19.4207 36.0811 4.232e-07
Murder       1 27.4288 27.4288 50.9591 1.051e-08
HS.Grad      1  4.0989  4.0989  7.6152  0.008612
Frost        1  2.0488  2.0488  3.8063  0.057916
Area         1  0.0011  0.0011  0.0020  0.964381
Density      1  1.2288  1.2288  2.2830  0.138472
Residuals   41 22.0683  0.5383

…but it’s less useful for modeling purposes.

Now we need to start winnowing down our model to a minimal adequate one. The least significant of the slopes in the above summary table was that for “Illiteracy”, but I’ll tell ya what. I kinda want to hold on to “Illiteracy” for awhile, and “Area” was not much better, so I’m going to toss out “Area” first.

The Minimal Adequate Model

We want to reduce our model to a point where all the remaining predictors are significant, and we want to do this by throwing out one predictor at a time. “Area” will go first. To do this, we could just recast the model without the “Area” variable, but R supplies a handier way…

> model2 = update(model1, .~.-Area)
> summary(model2)

Call:
lm(formula = Life.Exp ~ Population + Income + Illiteracy + Murder + 
    HS.Grad + Frost + Density, data = st)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5025 -0.4047 -0.0608  0.5868  1.4386 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  7.094e+01  1.378e+00  51.488  < 2e-16
Population   6.249e-05  2.976e-05   2.100   0.0418
Income       1.485e-04  2.690e-04   0.552   0.5840
Illiteracy   1.452e-01  3.512e-01   0.413   0.6814
Murder      -3.319e-01  4.904e-02  -6.768 3.12e-08
HS.Grad      3.746e-02  2.225e-02   1.684   0.0996
Frost       -5.533e-03  2.955e-03  -1.873   0.0681
Density     -7.995e-04  6.251e-04  -1.279   0.2079

Residual standard error: 0.7307 on 42 degrees of freedom
Multiple R-squared: 0.746,      Adjusted R-squared: 0.7037 
F-statistic: 17.63 on 7 and 42 DF,  p-value: 1.173e-10

The syntax of update( ) is cryptic, so here’s what it means. Inside the new formula, a dot means “same.” So read the above update command as follows: “Update model1 using the same response variables and the same predictor variables, except remove (minus) “Area”. Notice that removing “Area” has cost us very little in terms of R-squared, and the adjusted R-squared actually went up, due to there being fewer predictors.

The two models can be compared as follows…

> anova(model1, model2)
Analysis of Variance Table

Model 1: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + 
    Frost + Area + Density
Model 2: Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + 
    Frost + Density
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1     41 22.0683                           
2     42 22.4247 -1   -0.3564 0.6621 0.4205

As you can see, removing “Area” had no significant effect on the model (p = .4205). Compare the p-value to that for “Area” in the first summary table above. (Ignore the negative sum of squares in this last output. It happened because of the order in which I listed the models. Try it the other way around if you’re unhappy with negative SSs.)

What goes next, “Income” or “Illiteracy”? Fair is fair…

> model3 = update(model2, .~.-Illiteracy)
> summary(model3)

Call:
lm(formula = Life.Exp ~ Population + Income + Murder + HS.Grad + 
    Frost + Density, data = st)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.49555 -0.41246 -0.05336  0.58399  1.50535 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  7.131e+01  1.042e+00  68.420  < 2e-16
Population   5.811e-05  2.753e-05   2.110   0.0407
Income       1.324e-04  2.636e-04   0.502   0.6181
Murder      -3.208e-01  4.054e-02  -7.912 6.32e-10
HS.Grad      3.499e-02  2.122e-02   1.649   0.1065
Frost       -6.191e-03  2.465e-03  -2.512   0.0158
Density     -7.324e-04  5.978e-04  -1.225   0.2272

Residual standard error: 0.7236 on 43 degrees of freedom
Multiple R-squared: 0.745,      Adjusted R-squared: 0.7094 
F-statistic: 20.94 on 6 and 43 DF,  p-value: 2.632e-11

Things are starting to change a bit. R-squared went down again, as it will always do when a predictor is removed, but once more adjusted R-squared increased. “Frost” is also now a significant predictor of life expectancy, and unlike in the bivariate relationship we saw originally, the relationship is now negative.

Looks like income should go next…

> model4 = update(model3, .~.-Income)
> summary(model4)

Call:
lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost + 
    Density, data = st)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.56877 -0.40951 -0.04554  0.57362  1.54752 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  7.142e+01  1.011e+00  70.665  < 2e-16
Population   6.083e-05  2.676e-05   2.273  0.02796
Murder      -3.160e-01  3.910e-02  -8.083 3.07e-10
HS.Grad      4.233e-02  1.525e-02   2.776  0.00805
Frost       -5.999e-03  2.414e-03  -2.485  0.01682
Density     -5.864e-04  5.178e-04  -1.132  0.26360

Residual standard error: 0.7174 on 44 degrees of freedom
Multiple R-squared: 0.7435,     Adjusted R-squared: 0.7144 
F-statistic: 25.51 on 5 and 44 DF,  p-value: 5.524e-12

R-squared went down hardly at all. Adjusted R-squared went up. “Illiteracy” will not be missed.

Now all the predictors are significant except “Density”…

> model5 = update(model4, .~.-Density)
> summary(model5)

Call:
lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost, 
    data = st)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.47095 -0.53464 -0.03701  0.57621  1.50683 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  7.103e+01  9.529e-01  74.542  < 2e-16
Population   5.014e-05  2.512e-05   1.996  0.05201
Murder      -3.001e-01  3.661e-02  -8.199 1.77e-10
HS.Grad      4.658e-02  1.483e-02   3.142  0.00297
Frost       -5.943e-03  2.421e-03  -2.455  0.01802

Residual standard error: 0.7197 on 45 degrees of freedom
Multiple R-squared: 0.736,      Adjusted R-squared: 0.7126 
F-statistic: 31.37 on 4 and 45 DF,  p-value: 1.696e-12

Adjusted R-squared slipped a bit that time, but not significantly as we could confirm by…

> anova(model5, model4)                # output not shown

We have (implicitly by not saying otherwise) set our alpha level at .05, so now “Population” must go. This could have a substantial effect on the model, as the slope for “Population” is very nearly significant. Only one way to find out…

> model6 = update(model5, .~.-Population)
> summary(model6)

Call:
lm(formula = Life.Exp ~ Murder + HS.Grad + Frost, data = st)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5015 -0.5391  0.1014  0.5921  1.2268 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.036379   0.983262  72.246  < 2e-16
Murder      -0.283065   0.036731  -7.706 8.04e-10
HS.Grad      0.049949   0.015201   3.286  0.00195
Frost       -0.006912   0.002447  -2.824  0.00699

Residual standard error: 0.7427 on 46 degrees of freedom
Multiple R-squared: 0.7127,     Adjusted R-squared: 0.6939 
F-statistic: 38.03 on 3 and 46 DF,  p-value: 1.634e-12

We have reached the minimal adequate model.

Stepwise Regression

What we did in the previous section can be automated using the step( ) function…

> step(model1, direction="backward")
Start:  AIC=-22.89
Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + 
    Frost + Area + Density

             Df Sum of Sq     RSS     AIC
- Illiteracy  1     0.305  22.373 -24.208
- Area        1     0.356  22.425 -24.093
- Income      1     0.412  22.480 -23.969
                     22.068 -22.894
- Frost       1     1.110  23.178 -22.440
- Density     1     1.229  23.297 -22.185
- HS.Grad     1     1.823  23.891 -20.926
- Population  1     2.510  24.578 -19.509
- Murder      1    23.817  45.886  11.706

Step:  AIC=-24.21
Life.Exp ~ Population + Income + Murder + HS.Grad + Frost + Area + 
    Density

             Df Sum of Sq     RSS     AIC
- Area        1     0.143  22.516 -25.890
- Income      1     0.232  22.605 -25.693
                     22.373 -24.208
- Density     1     0.929  23.302 -24.174
- HS.Grad     1     1.522  23.895 -22.917
- Population  1     2.205  24.578 -21.508
- Frost       1     3.132  25.506 -19.656
- Murder      1    26.707  49.080  13.072

Step:  AIC=-25.89
Life.Exp ~ Population + Income + Murder + HS.Grad + Frost + Density

             Df Sum of Sq     RSS     AIC
- Income      1     0.132  22.648 -27.598
- Density     1     0.786  23.302 -26.174
                     22.516 -25.890
- HS.Grad     1     1.424  23.940 -24.824
- Population  1     2.332  24.848 -22.962
- Frost       1     3.304  25.820 -21.043
- Murder      1    32.779  55.295  17.033

Step:  AIC=-27.6
Life.Exp ~ Population + Murder + HS.Grad + Frost + Density

             Df Sum of Sq     RSS     AIC
- Density     1     0.660  23.308 -28.161
                     22.648 -27.598
- Population  1     2.659  25.307 -24.046
- Frost       1     3.179  25.827 -23.030
- HS.Grad     1     3.966  26.614 -21.529
- Murder      1    33.626  56.274  15.910

Step:  AIC=-28.16
Life.Exp ~ Population + Murder + HS.Grad + Frost

             Df Sum of Sq     RSS     AIC
                     23.308 -28.161
- Population  1     2.064  25.372 -25.920
- Frost       1     3.122  26.430 -23.876
- HS.Grad     1     5.112  28.420 -20.246
- Murder      1    34.816  58.124  15.528

Call:
lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost,     data = st)

Coefficients:
(Intercept)   Population       Murder      HS.Grad        Frost  
  7.103e+01    5.014e-05   -3.001e-01    4.658e-02   -5.943e-03

…but I’m against it! The “direction=” option can be set to “forward”, “backward”, or “both”.

Confidence Limits on the Estimated Coefficients

Easy…

> confint(model6)
                  2.5 %       97.5 %
(Intercept) 69.05717472 73.015582905
Murder      -0.35700149 -0.209128849
HS.Grad      0.01935043  0.080546980
Frost       -0.01183825 -0.001985219

Set the desired confidence level with the “level=” option (not conf.level as in many other R functions).

Predictions

Predictions can be made from a model equation using the predict( ) function…

> predict(model6, list(Murder=10.5, HS.Grad=48, Frost=100))
       1 
69.77056

Feed it the model name and a list of labeled values for the predictors. The values to be predicted from can be vectors.

Regression Diagnostics

The usual diagnostic plots are available…

> par(mfrow=c(2,2))                    # visualize four graphs at once
> plot(model6)
> par(mfrow=c(1,1))                    # reset the graphics defaults

These plots are discussed in more detail in the Simple Linear Correlation and Regression tutorial.

Extracting Elements of the Model Object

The model data object is a list containing quite a lot of information. Aside from the summary functions we used above, there are two ways to get at this information…

> model6[[1]]                          # extract list item 1: coefficients
 (Intercept)       Murder      HS.Grad        Frost 
71.036378813 -0.283065168  0.049948704 -0.006911735 
> model6[[2]]                          # extract list item 2: residuals
       Alabama         Alaska        Arizona       Arkansas     California 
    0.36325842    -0.80873734    -1.07681421     0.93888883     0.60063821 
      Colorado    Connecticut       Delaware        Florida        Georgia 
    0.90409006     0.48472687    -1.23666537     0.10114571    -0.17498630 
        Hawaii          Idaho       Illinois        Indiana           Iowa 
    1.22680042     0.23279723     0.26968086     0.05432904     0.19534036 
        Kansas       Kentucky      Louisiana          Maine       Maryland 
    0.61342480     0.79770164    -0.56481311    -1.50150772    -0.32455693 
 Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
   -0.48235430     0.96231978     0.80350324    -1.11037437     0.59509781 
       Montana       Nebraska         Nevada  New Hampshire     New Jersey 
   -0.94669741     0.38328311    -0.70837880    -0.54666731    -0.46189744 
    New Mexico       New York North Carolina   North Dakota           Ohio 
    0.10159299     0.53349703    -0.05444180     0.91307523     0.07808745 
      Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
    0.18464735    -0.41031105    -0.51622769     0.10314800    -1.23162114 
  South Dakota      Tennessee          Texas           Utah        Vermont 
    0.05138438     0.58330361     1.19135836     0.72277428     0.46958000 
      Virginia     Washington  West Virginia      Wisconsin        Wyoming 
   -0.06731035    -1.04976581    -1.04653483     0.60046076    -0.73927257

There are twelve elements in this list. They can also be extracted by name, and the supplied name only has to be long enough to be recognized. So instead of asking for list item 2, you can ask for…

> sort(model6$resid)                   # extract residuals and sort them
         Maine       Delaware South Carolina    Mississippi        Arizona 
   -1.50150772    -1.23666537    -1.23162114    -1.11037437    -1.07681421 
    Washington  West Virginia        Montana         Alaska        Wyoming 
   -1.04976581    -1.04653483    -0.94669741    -0.80873734    -0.73927257 
        Nevada      Louisiana  New Hampshire   Pennsylvania  Massachusetts 
   -0.70837880    -0.56481311    -0.54666731    -0.51622769    -0.48235430 
    New Jersey         Oregon       Maryland        Georgia       Virginia 
   -0.46189744    -0.41031105    -0.32455693    -0.17498630    -0.06731035 
North Carolina   South Dakota        Indiana           Ohio        Florida 
   -0.05444180     0.05138438     0.05432904     0.07808745     0.10114571 
    New Mexico   Rhode Island       Oklahoma           Iowa          Idaho 
    0.10159299     0.10314800     0.18464735     0.19534036     0.23279723 
      Illinois        Alabama       Nebraska        Vermont    Connecticut 
    0.26968086     0.36325842     0.38328311     0.46958000     0.48472687 
      New York      Tennessee       Missouri      Wisconsin     California 
    0.53349703     0.58330361     0.59509781     0.60046076     0.60063821 
        Kansas           Utah       Kentucky      Minnesota       Colorado 
    0.61342480     0.72277428     0.79770164     0.80350324     0.90409006 
  North Dakota       Arkansas       Michigan          Texas         Hawaii 
    0.91307523     0.93888883     0.96231978     1.19135836     1.22680042

That settles it! I’m moving to Hawaii. (This could all just be random error, of course, but more technically, it is unexplained variation. It could mean that Maine is performing the worst against our model, and Hawaii the best, due to factors we have not included in the model.)

Beta Coeffieicents

Beta or standardized coefficients are the slopes we would get if all the variables were on the same scale, which is done by converting them to z-scores before doing the regression. Betas allow a comparison of the relative importance of the predictors, which neither the unstandardized coefficients nor the p-values does. Scaling, or standardizing, the data vectors can be done using the scale( ) function…

> model7 = lm(scale(Life.Exp) ~ scale(Murder) + scale(HS.Grad) + scale(Frost),
+                               data=st)
> summary(model7)

Call:
lm(formula = scale(Life.Exp) ~ scale(Murder) + scale(HS.Grad) + 
    scale(Frost), data = st)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.11853 -0.40156  0.07551  0.44111  0.91389 

Coefficients:
                 Estimate Std. Error  t value Pr(>|t|)
(Intercept)    -4.541e-15  7.824e-02 -5.8e-14  1.00000
scale(Murder)  -7.784e-01  1.010e-01   -7.706 8.04e-10
scale(HS.Grad)  3.005e-01  9.146e-02    3.286  0.00195
scale(Frost)   -2.676e-01  9.477e-02   -2.824  0.00699

Residual standard error: 0.5532 on 46 degrees of freedom
Multiple R-squared: 0.7127,     Adjusted R-squared: 0.6939 
F-statistic: 38.03 on 3 and 46 DF,  p-value: 1.634e-12

The intercept goes to zero, and the slopes become standardized beta values. So an increase of one standard deviation in the murder rate is associated with a drop of 0.778 standard deviations in life expectancy, if all the other variables are held constant. VERY roughly, these values can be interpreted as correlations with the other predictors controlled. The same result could have been obtained via the formula for standardized coefficients…

> -0.283 * sd(st$Murder) / sd(st$Life.Exp)
[1] -0.778241

Where the first value is the unstandardized coefficient for “Murder”.

Partial Correlations

Another way to remove the effect of a possible lurking variable from the correlation of two other variables is to calculate a partial correlation. There is no partial correlation function in R. So to create one, copy and paste this script into the R Console…

### Partial correlation coefficient
### From formulas in Sheskin, 3e
### a,b=variables to be correlated, c=variable to be partialled out of both
pcor = function(a,b,c)
{
     (cor(a,b)-cor(a,c)*cor(b,c))/sqrt((1-cor(a,c)^2)*(1-cor(b,c)^2))
}
### end of script

This will put a function called pcor( ) in your workspace. To use it, enter the names of the variables you want the partial correlation of first, and then enter the name of the variable you want partialled out (controlled)…

> pcor(st$Life.Exp, st$Murder, st$HS.Grad)
[1] -0.6999659

The partial correlation between “Life.Exp” and “Murder” with “HS.Grad” held constant is about -0.7.

Making Predictions From a Model

To illustrate how predictions are made from a model, I will fit a new model to another data set…

> rm(list=ls())                        # clean up (WARNING! this will wipe your workspace!)
> data(airquality)                     # see ?airquality for details on these data
> na.omit(airquality) -> airquality    # toss cases with missing values
> lm(Ozone ~ Solar.R + Wind + Temp + Month,
             data=airquality) -> model
> coef(model)
 (Intercept)      Solar.R         Wind         Temp        Month 
-58.05383883   0.04959683  -3.31650940   1.87087379  -2.99162786

The model has been fitted and the regression coefficients displayed. Suppose now we wish to predict for new values: Solar.R=200, Wind=11, Temp=80, Month=6. One way to do it is this…

> (prediction <- c(1,200,11,80,6) * coef(model))
(Intercept)     Solar.R        Wind        Temp       Month 
 -58.053839    9.919365  -36.481603  149.669903  -17.949767
> ### Note: putting the whole statement in parentheses not only stores the values
> ###       but also prints them to the Console.
> sum(prediction)
[1] 47.10406

Our predicted value is 47.1. How accurate is it?

We can get a confidence interval for the predicted value, but first we have to ask an important question. Is the prediction being made for the mean response for cases like this? Or is it being made for a new single case? The difference this makes to the CI is shown below…

### Prediction of mean response for cases like this...
> predict(model, list(Solar.R=200,Wind=11,Temp=80,Month=6), interval="conf")
       fit      lwr      upr
1 47.10406 41.10419 53.10393
### Prediction for a single new case...
> predict(model, list(Solar.R=200,Wind=11,Temp=80,Month=6), interval="pred")
       fit      lwr      upr
1 47.10406 5.235759 88.97236

As you can see, the CI is much wider in the second case. These are 95% CIs. You can set a different confidence level using the “level=” option. The function predict( ) has a somewhat tricky syntax, so here’s how it works. The first argument should be the name of the model object. The second argument should be a list of new data values labeled with the variables names. That’s all that’s really required. The “interval=” option can take values of “none” (the default), “confidence”, or “prediction”. You need only type enough of it so R knows which one you want. The confidence level is changed with the “level=” option, and not with “conf.level=” as in other R functions.

Return to the Table of Contents