 Audience: Statisticians and persons from other quantitative
disciplines who are interested in multivariable regression analysis of
univariate responses, in developing, validating, and graphically
describing multivariable predictive models, and in
covariable adjustment in randomized
clinical trials .
The course will be of particular interest to:

Applied statisticians who want to learn new methodology for
flexibly fitting all types of multivariable regression models while
making estimation of optimal covariable
transformations an explicit part of the modeling process.
 Those who want to learn how to develop models that are likely to
predict future observations as accurately as they predicted responses
from the data used to fit the models.
 Statisticians who want to learn how to graphically present
complex regression models to nonstatisticians.
 Analysts who would like to
learn how to incorporate multiple
imputation with regression models to handle missing and incomplete
data.
 Quantitativelyminded epidemiologists and others who need to use
binary and
timetoevent (survival)
models for analyzing and predicting outcomes in observational studies.
 Biostatisticians, health services and outcomes researchers, and
economists who need to study or predict health outcomes or
resource utilization.
 Biostatisticians working in clinical trials
who would like to learn about the need for adjusting for covariables
in perfectly balanced randomized trials and to be introduced to
developing analytic plans for such adjustment.
 Prerequisites: A good general knowledge of statistical estimation
and inference methods and a good command of ordinary linear
regression.
Those who want to run the laboratory exercises
themselves or who want to use SPlus to use the methods taught in
this course in their everyday work should have had a previous
introduction to SPlus.
Participants are encouraged to read
references [2, 3, 5] in advance.
Those interested in covariable adjustment in
randomized clinical trials may also want to read [4].
 Course Format:
Generally the first 2/3 of each day of this threeday course will
consist of lectures on statistical methodology and graphical methods
for interpreting complex models and presenting them to
nonstatisticians. In the remaining time students will gain
handson experience in using the freely available SPlus
Design library for developing, checking, validating,
testing formal hypotheses, and
graphically interpreting multivariable predictive models using real
datasets. See
http://hesweb1.med.virginia.edu/biostat/s/Design.html for
information related to Design and to datasets for learning and
testing modeling methods.
Course Description
The first part of the course presents the following
elements of multivariable predictive modeling for a single response
variable: using regression splines to relax linearity assumptions,
perils of variable selection and overfitting, where to spend degrees
of freedom, shrinkage, imputation of missing data, data reduction,
and interaction surfaces. Then a
default overall modeling strategy will be described. This is followed
by methods for graphically understanding models (e.g., using
nomograms) and using resampling to estimate a model's likely
performance on new data. Then the freely available SPlus
Design library will be overviewed. Design
facilitates most of the steps of the modeling process.
Next, statistical methods related to binary
logistic models
will
be covered.
Three of the following
case studies will be presented:
an exploration of voting tendencies over U.S. counties in the 1992
presidential election, an interactive
exploration of the survival status of Titanic passengers,
an interactive case study in developing a survival time
model for critically ill patients, and a case study in Cox
regression.
In the
handson computer lab students will develop, validate, and
graphically describe multivariable regression models themselves.
This short course will survey the advantages
of modeling in randomized trials and will provide some guidance in
developing a prospective statistical plan for use in a Phase III
clinical trial.
The methods covered in this course will apply to almost any regression
model, including ordinary least squares, logistic regression models,
and survival models.
 Objectives:

Be familiar with modern methods for fitting multivariable regression models:

accurately
 in a way the sample size will allow, without overfitting
 uncovering complex nonlinear or nonadditive relationships
 testing for and quantifying the association between one or
more predictors and the response, with possible adjustment
for other factors
 Be able to understand the different types of missing data and
how to use multiple imputation to incorporate partial covariable
information.
 Be able to validate models for predictive accuracy and to detect
overfitting
 Be able to interpret fitted models using both parameter estimates
and graphics
 Be able to critique the literature to detect models that are likely
to be unreliable
 Understand benefits of covariable adjustment in randomized
studies
 Outline:

Planning for Modeling
 Covariable Adjustment in Randomized Clinical Trials

Gaining efficiency
 Reducing bias even with perfect balance
 Notation for Regression Models
 Interpreting Model Parameters

Nominal Predictors
 Interactions
 Relaxing Linearity Assumption for Continuous Predictors

Simple Nonlinear Terms
 Splines for Estimating Shape of Regression Function and
Determining Predictor Transformations
 Cubic Spline Functions
 Restricted Cubic Splines
 Nonparametric regression
 Advantages of Splines over Other Methods
 Tests of Association
 Assessment of Model Fit

Regression Assumptions
 Modeling and Testing Interactions
 Missing Data

Types of Missingness
 Understanding Patterns of Missing Values
 Problems with Simple Alternatives to Imputation
 Strategies for Developing Imputation Algorithms
 Single Conditional Mean Imputation
 Multiple Imputation
 SPlus Software for Fitting Models and Adjusting Variances for
Multiple Imputation
 Multivariable Modeling Strategy

PreSpecification of Predictor Complexity
 Variable Selection
 Overfitting and Limits on Number of Predictors
 Shrinkage
 Data Reduction
 Resampling, Validating, Describing, and Simplifying the Model

The Bootstrap
 Model Validation
 Graphically Describing the Fitted Model
 Simplifying the Model by Approximating It
 SPlus Design library
 Case Study using Least Squares Multiple Regression: Voting
Patterns in U.S. Counties
 Binary Logistic Regression

The Model
 Assessment of Model Fit
 Quantifying Predictive Ability
 Validating & Describing the Fitted Model
 SPlus Functions
 Interactive Case Study: Binary Logistic Model for Survival of
Titanic Passengers

Missing Data
 Nonparametric Regression
 Development of Logistic Model
 Multiple Imputation to Handle Missing Passenger Ages
 Interactive Case Study: Development of a LongTerm Survival Model for
Critically Ill Patients
 Cox Proportional Hazards Model

The Model
 Checking Goodness of Fit
 Quantifying Predictive Ability
 Validation
 SPlus Functions
 Case Study in Cox Regression

Choosing the Number of Parameters
 Checking Proportional Hazards
 Testing Interactions
 Describing Predictor Effects
 Validating the Model
 Presenting the Model
 Instructor:
Dr. Harrell is Professor of Biostatistics and Statistics and Chief
of the Division of Biostatistics and Epidemiology, Department of
Health Evaluation Sciences, University of Virginia School of Medicine,
Charlottesville. He received his Ph.D. in biostatistics
from the University of North Carolina, Chapel Hill in 1979, where he
studied under P.K. Sen. Dr. Harrell has been involved in statistical
computing since 1969 and is the author of many SPlus functions and SAS
procedures. Since 1973 he has been involved in medical applications
of statistics, especially in the area of survival analysis and
clinical prediction modeling. He is an editorial consultant for the
Journal of Clinical Epidemiology, is on the editorial board of
Statistics in Medicine, is comanaging editor of the journal
Health Services and Outcomes Research Methodology, and is
a consultant to FDA and the pharmaceutical industry.
 Handouts: Participants will receive copies of the
206 slides that will be presented
and a copy of the book
on which the course is based, Regression
Modeling Strategies written by the instructor. See
http://hesweb1.med.virginia.edu/biostat/rms for information
about this text.
Background
Regression models are frequently used to develop diagnostic,
prognostic, and health resource utilization models in clinical, health
services, outcomes, pharmacoeconomic, and epidemiologic research, and
in a multitude of nonhealthrelated areas. Regression models
are also used to adjust for patient heterogeneity in randomized
clinical trials, to obtain tests that are more powerful and valid than
unadjusted treatment comparisons.
Models must be flexible enough to fit nonlinear and nonadditive
relationships, but unless the sample size is enormous, the approach to
modeling must avoid common problems with data mining or data dredging
that result in overfitting and a failure of the predictive model to
validate on new subjects.
All standard regression models have assumptions that must be verified
for the model to have power to test hypotheses and for it to be able
to predict accurately. Of the principal assumptions (linearity,
additivity, distributional), this short course will emphasize methods for
assessing and satisfying the first two. Practical but powerful tools
are presented for validating model assumptions and presenting model
results. This course provides methods for estimating the shape of the
relationship between predictors and response using the widely
applicable method of augmenting the design matrix using restricted
cubic splines.
References
 [1]

F. E. Harrell.
Regression Modeling Strategies, with Applications to Linear
Models, Survival Analysis and Logistic Regression.
Springer, New York, 2001.
 [2]

F. E. Harrell, K. L. Lee, and D. B. Mark.
Multivariable prognostic models: Issues in developing models,
evaluating assumptions and adequacy, and measuring and reducing errors.
Statistics in Medicine, 15:3687, 1996.
 [3]

F. E. Harrell, P. A. Margolis, S. Gove, K. E. Mason, E. K. Mulholland,
D. Lehmann, L. Muhe, S. Gatchalian, and H. F. Eichenwald.
Development of a clinical prediction model for an ordinal outcome:
The World Health Organization ARI Multicentre Study of clinical signs and
etiologic agents of pneumonia, sepsis, and meningitis in young infants.
Statistics in Medicine, 17:9044, 1998.
 [4]

W. W. Hauck, S. Anderson, and S. M. Marcus.
Should we adjust for covariates in nonlinear regression analyses of
randomized trials?
Controlled Clinical Trials, 19:2456, 1998.
 [5]

A. Spanos, F. E. Harrell, and D. T. Durack.
Differential diagnosis of acute meningitis: An analysis of the
predictive value of initial observations.
Journal of the American Medical Association, 262:270707,
1989.
Hardcopy Handouts and Books Supplied to Participants:

http://hesweb1.med.virginia.edu/biostat/rms/rms.pdf
 Printout of this web page
 Printout of web page
http://hesweb1.med.virginia.edu/biostat/s/data

http://hesweb1.med.virginia.edu/biostat/rms/ShortCourse.hw.pdf

http://hesweb1.med.virginia.edu/biostat/teaching/biostat.mod/formulas.pdf
 Copy of solutions to the above lab assignments
 Selfquizzes and quiz solutions
 Copy of Regression Modeling Strategies