Plotting Model Results
Problem:
Often times, you would like to generate graphics based on a model you fit in R. These plots may include a number of diagnostic plots, or just plotting the fitted line with prediction and confidence bands.
"Solution":
NOTE: I use the term "Solution" loosely; in R, there is never just one solution. This is just the solution I have found.
NOTE: The code distinguished with a ">" is what you would type at the R command line. The R output, if any, and any of my comments have been commented out using a "#".
This solution is a nearly exact excerpt from specific sections in the following book:
- Introductory Statistics with R by Peter Dalgaard (Springer, 2002).
Specifically, I have excerpted paragraphs and have used and modified the R code from
Chapter 5: Regression and Correlation Section 1, 2, and 3 and
Chapter 10: Linear Models Section 8.
All of the data sets used in Peter Dalgaard's book can be found in the
ISwR
add-on package, which is available for download/installation on the
CRAN website (
http://cran.r-project.org/).
For this solution, we will be using the
thuesen
built-in data set from the
ISwR
add-on package. According to the
ISwR
description, the
thuesen
data set "contains ventricular shortening velocity and blood glucose for type 1 diabetic patients."
thuesen
contains two columns:
- blood.glucose - fasting blood glucose (mmol/l) (numeric)
- short.velocity - mean circumferential shortening velocity (%/s) (numeric)
In case some of you have problems installing the
ISwR
add-on package, I decided to generate the
thuesen
data set "from scratch" and proceed from there.
> thuesen<-data.frame(
blood.glucose=c(15.3, 10.8, 8.1, 19.5, 7.2, 5.3, 9.3,
11.1, 7.5, 12.2, 6.7, 5.2, 19.0, 15.1, 6.7, 8.6, 4.2, 10.3,
12.5, 16.1, 13.3, 4.9, 8.8, 9.5),
short.velocity=c(1.76, 1.34, 1.27, 1.47, 1.27, 1.49, 1.31, 1.09,
1.18, 1.22, 1.25, 1.19, 1.95, 1.28, 1.52, NA, 1.12, 1.37, 1.19,
1.05, 1.32, 1.03, 1.12, 1.70))
> thuesen
# blood.glucose short.velocity
# 1 15.3 1.76
# 2 10.8 1.34
# 3 8.1 1.27
# 4 19.5 1.47
# 5 7.2 1.27
# 6 5.3 1.49
# 7 9.3 1.31
# 8 11.1 1.09
# 9 7.5 1.18
# 10 12.2 1.22
# 11 6.7 1.25
# 12 5.2 1.19
# 13 19.0 1.95
# 14 15.1 1.28
# 15 6.7 1.52
# 16 8.6 NA
# 17 4.2 1.12
# 18 10.3 1.37
# 19 12.5 1.19
# 20 16.1 1.05
# 21 13.3 1.32
# 22 4.9 1.03
# 23 8.8 1.12
# 24 9.5 1.70
For those of you who are able to properly install the
ISwR
package, you can load the library and read in the
thuesen
data set as follows:
library(ISwR)
data(thuesen)
This will then add the
thuesen
data set as an object to your current workspace:
ls()
# [1] "last.warning" "thuesen"
Either way, you will end up with a data set called
thuesen
.
Let's fit a simple linear regression model using the
thuesen
data. For convenience, let's assign the value(s) returned by the
lm
function under the name
lm.velo
:
lm.velo<-lm(short.velocity ~ blood.glucose, data=thuesen)
lm.velo
#
# Call:
# lm(formula = short.velocity ~ blood.glucose, data = thuesen)
#
# Coefficients:
# (Intercept) blood.glucose
# 1.09781 0.02196
As Peter Daalgard states on page 97:
"The result of
lm
is a
model object. This is a distinctive concept of the S language (of which R is a dialect). Where other statistical systems focus on generating printed output that can be controlled by setting options, you get instead the result of a model
fit encapsulated in an object from which the desired quantities can be obtained using
extractor functions. An
lm
object does in fact contain much more information than you see when it is printed."
A basic extractor function is the
summary
function:
> summary(lm.velo)
#
# Call:
# lm(formula = short.velocity ~ blood.glucose, data = thuesen)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.40141 -0.14760 -0.02202 0.03001 0.43490
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 1.09781 0.11748 9.345 6.26e-09 ***
# blood.glucose 0.02196 0.01045 2.101 0.0479 *
# ---
# Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
#
# Residual standard error: 0.2167 on 21 degrees of freedom
# Multiple R-Squared: 0.1737, Adjusted R-squared: 0.1343
# F-statistic: 4.414 on 1 and 21 DF, p-value: 0.0479
Let's first
plot just the points and the fitted line of the model fit :
plot(thuesen$blood.glucose, thuesen$short.velocity,
xlab="Blood Glucose", ylab="Short Velocity",
main="Plot of Thuesen Data")
abline(lm.velo)
Now let's consider
residuals and
fitted values. Two further extractor functions are
fitted
and
resid
.
As Peter Daalgard states on page 100 - 101:
- "The fitted function returns fitted values - the y-values that you would expect for the given x-values according to the best fitting straight line", which was found by
lm
.
- "The residuals shown by the
resid
function is the difference between this and the observed *short.velocity*".
> fitted(lm.velo)
# 1 2 3 4 5 6 7 8
# 1.433841 1.335010 1.275711 1.526084 1.255945 1.214216 1.302066 1.341599
# 9 10 11 12 13 14 15 17
# 1.262534 1.365758 1.244964 1.212020 1.515103 1.429449 1.244964 1.190057
# 18 19 20 21 22 23 24
# 1.324029 1.372346 1.451411 1.389916 1.205431 1.291085 1.306459
> resid(lm.velo)
# 1 2 3 4 5 6
# 0.326158532 0.004989882 -0.005711308 -0.056084062 0.014054962 0.275783754
# 7 8 9 10 11 12
# 0.007933665 -0.251598875 -0.082533795 -0.145757649 0.005036223 -0.022019994
# 13 14 15 17 18 19
# 0.434897199 -0.149448964 0.275036223 -0.070057471 0.045971143 -0.182346406
# 20 21 22 23 24
# -0.401411486 -0.069916424 -0.175431237 -0.171085074 0.393541161
Notice, observation 16 has a missing
short.velocity, so no fitted value or residual is calculated. For ease of plotting, we will want to have calculated fitted values and residuals for all observations. A easy way to handle missing values is by using the
na.exclude
method for missing value (i.e.
NA
) handling. We do this using the
options
function:
> options(na.action="na.exclude")
This overrides the default option of
na.action = "na.omit"
.
na.omit
removes any incomplete cases from the data set before calculating the fitted values and residuals. On the other hand,
na.exclude
inset
NA=s for cases omitted by =na.omit
in order to pad the residuals and predictions to the correct length.
In the future, you can determine the default of a particular option by using the
getOption
function in a similar fashion to this:
# To see default na.action option:
> getOption("na.action")
# [1] "na.omit"
Peter Dalgaard discuss more ways to handle missing values on page 101.
We now have to refit
lm.velo
after changing the
na.action
option, and if we look at the fitted values, we see that observation 16 has a missing fitted value (
NA
):
> lm.velo<-lm(short.velocity ~ blood.glucose, data=thuesen)
> fitted(lm.velo)
# 1 2 3 4 5 6 7 8
# 1.433841 1.335010 1.275711 1.526084 1.255945 1.214216 1.302066 1.341599
# 9 10 11 12 13 14 15 16
# 1.262534 1.365758 1.244964 1.212020 1.515103 1.429449 1.244964 NA
# 17 18 19 20 21 22 23 24
# 1.190057 1.324029 1.372346 1.451411 1.389916 1.205431 1.291085 1.306459
Now let's create a
plot where residuals are displayed by connecting observations to corresponding points on the fitted line :
> plot(thuesen$blood.glucose, thuesen$short.velocity,
xlab="Blood Glucose", ylab="Short Velocity",
main="Plot of Thuesen Data")
> abline(lm.velo)
> segments(x0=thuesen$blood.glucose,
y0=fitted(lm.velo),
x1=thuesen$blood.glucose,
y1=thuesen$short.velocity)
We can also generate a simple
plot of residuals versus fitted values :
> plot(fitted(lm.velo), resid(lm.velo),
xlab="Fitted Values from lm.velo",
ylab="Residuals from lm.velo",
main="Residuals versus Fitted values from lm.velo")
We can also get an indication of whether the residuals might have come from a normal distribution by checking for a straight line on a %BLUR%
Q-Q plot :
> qqnorm(resid(lm.velo))
> qqline(resid(lm.velo))
Illustration and explanation of further
regression diagnostics is given in section 10.8. A basic set of
four regression diagnostic plots is available via the plot method for the
lm
objects:
- residual versus fitted values
- Q-Q normal distribution plot of standardized residuals
- plot of the square root of the absolute value of the standardized residuals
- plot of "Cook's distance", which is a measure of the influence of each observation on the regression coefficients
> par(mfro=c(2,2))
> plot(lm.velo)
# To return to the default of one graph:
> par(mfrow=c(1,1))
The last thing I am going to demonstrate how to
plot the fitted line of your model including the prediction and confidence bands . There is a complete explanation of the procress to produce this plot in section 5.3.
As Peter Dalgaard explains on page 106:
"First we create a new data frame in which the
blood.glucose variable contains the values at which we want predictions to be made":
> pred.frame<-data.frame(blood.glucose=4:20)
Next we generate two vectors which "contain the result of
predict
for the new data in =pred.frame=", specifically (1) the prediction limits, and (2) the confidence limits:
> pp<-predict(lm.velo, int="p", newdata=pred.frame)
> pc<-predict(lm.velo, int="c", newdata=pred.frame)
"Now for the plotting: first we create a standard scatterplot, except we ensure that it has enough room for the prediction limits", by incorporating the
ylim
argument.
> plot(thuesen$blood.glucose, thuesen$short.velocity,
ylim=range(thuesen$short.velocity, pp, na.rm=T),
xlab="Blood glucose", ylab="Short Velocity",
main="Plot with Confidence and Prediction Bands")
"Finally the curves are added, using as c-values the
blood.glucose used for the prediction, and setting the line types and colours to more sensible values."
matlines(pred.frame$blood.glucose, pc, lty=c(1,2,2), col="black")
matlines(pred.frame$blood.glucose, pp, lty=c(1,3,3), col="black")
We now probably want to set the
na.action
option back to
"na.omit"
instead of ="na.exclude"=":
options(na.action="na.omit")
Acknowledgements:
I would like to thank Richard Urbano for posing this problem.