BIOS 330: Regression Modeling Strategies

Frank E. Harrell, Jr.
Professor of Biostatistics
Department of Biostatistics
Vanderbilt University School of Medicine
Teaching Assistant: Ryan Jarrett
Assistant: Tawanna Peters (615)322-2001
Office Hours: 10:45-11:45 Thursdays and by appointment, room 11122, 2525 West End
9 January - 19 April 2018, Final Work Due 2018-05-02
Grades are due by 11:59pm on Saturday, 2018-05-05. Last official exam day 2018-05-03
Tuesday, Thursday 9:15-10:45
8102, 8th Floor, 2525 West End

This course covers many aspects of multivariable regression modeling as it is commonly used in prognostic, diagnostic, and epidemiologic modeling, clinical trials, and prediction in general.

Course schedule

Syllabus | Handouts | R scripts in RMS 2nd edition | Papers to Read | Concepts to master |
Supplemental Material on Biostatistical Modeling including interactive R demonstrations | Document updates | Biostatistics for Biomedical Research Notes | Blog

Note: If you use Google Chrome or Chromium to view the handouts, the first time you click on a sound file the browser will download the playlist file (.m3u) for that .mp3 sound file. Click on the down arrow next to the name of the downloaded file on the bottom left of the browser window, and select "Always open files of this type".


The instructor's book Regression Modeling Strategies, 2nd edition, 2015 is available from Amazon and other book sellers in addition to the Vanderbilt bookstore.


Accurate estimation of patient prognosis or of the probability of a disease or other outcomes is important for many reasons.
  1. Prognostic estimates can be used to inform the patient about likely outcomes of her disease.
  2. A physician can use estimates of diagnosis or prognosis as a guide for ordering additional tests and selecting appropriate therapies.
  3. Outcome assessments are useful in the evaluation of technologies; for example, diagnostic estimates derived both with and without using the results of a given test can be compared to measure the incremental diagnostic information provided by that test over what is provided by prior information.
  4. A researcher may want to estimate the effect of a single factor (e.g., treatment given) on outcomes in an observational study in which many uncontrolled confounding factors are also measured. Here the simultaneous effects of the uncontrolled variables must be controlled (held constant mathematically if using a regression model) so that the effect of the factor of interest can be more purely estimated. An analysis of how variables (especially continuous ones) affect the patient outcomes of interest is necessary to ascertain how to control their effects.
  5. Predictive modeling is useful in designing randomized clinical trials. Both the decision concerning which patients to randomize and the design of the randomization process (e.g., stratified randomization using prognostic factors) are aided by the availability of accurate prognostic estimates before randomization. Lastly, accurate prognostic models can be used to test for differential therapeutic benefit or to estimate the clinical benefit for an individual patient in a clinical trial, taking into account the fact that low-risk patients must have less absolute benefit (e.g., lower change in survival probability). To accomplish these objectives, researchers must create multivariable models that accurately reflect the patterns existing in the underlying data and that are valid when applied to comparable data in other settings or institutions. Models may be inaccurate due to violation of assumptions, omission of important predictors, high frequency of missing data and/or improper imputation methods, and especially with small datasets, overfitting.


Many types of regression models are increasingly being used in developing clinical prediction models for diagnosis, prognosis, and other applications in epidemiology, health services research, health economics, clinical trials, business, finance, and prediction in general. Popular models include logistic models for binary and ordinal responses, survival models, quantile regression, and models for longitudinal data analysis, many of which are covered in this course. All regression models have assumptions that must be verified for them to have power to test hypotheses and to be able to predict accurately. Of the principal assumptions (linearity, additivity, distributional), this course will emphasize methods for assessing and satisfying the first two as these methods apply to all regression models. To deal with the linearity assumption, this course provides methods for estimating the shape of the relationship between predictors and response using the widely applicable method of piecewise polynomials. Emphasis will be given to interpreting fitted models using effect plots (e.g., odds ratio charts) and nomograms. Even when assumptions are satisfied, overfitting can ruin a model's predictive ability for future observations. Methods for data reduction will be introduced to deal with the common case where the number of potential predictors is large in comparison with the number of observations. Methods of model validation (bootstrap and cross-validation) will be introduced, as will auxiliary topics such as modeling interaction surfaces, dealing with missing data, variable selection, collinearity, and shrinkage. All methods covered will apply to almost any regression model. The course will include detailed case studies in developing, validating, and interpreting clinical prediction and epidemiologic models.

In the course much attention is paid to dealing with missing data using multiple imputation, the use of bootstrapping, enhancements to ordinary maximum likelihood estimation and inference, and testing general or complex hypotheses using general contrasts and likelihood ratio tests. Quantifying the predictive discrimination and calibration accuracy of models are also key areas of emphasis.


Students must have mastered ordinary linear regression and have had an introduction to maximum likelihood estimation. Mastery of regular algebra is assumed, and students must have been introduced to linear algebra. Good working knowledge of R is required.

Learning Objectives

To become familiar with modern methods for fitting multivariable regression models
  1. accurately
  2. in a way the sample size will allow, without overfitting
  3. uncovering complex non-linear or non-additive relationships
  4. testing for and quantifying the association between one or more predictors and the response, with possible adjustment for other factors
Students will be introduced to the bootstrap and will learn how to deal with missing data, how to validate models for predictive accuracy and to detect overfitting, will be able to interpret fitted models using both parameter estimates and graphics, and will be able to critique the literature to determine when models are likely to be unreliable.

Reading Assignments

Recommended Supplemental Reading

  • Steyerberg EW. Clinical Prediction Models. New York: Springer; 2009.


Class Announcements

  • Will appear on this wiki

Class Discussion Group


This is the world's best statistics discussion/Q&A site and is the best place to ask questions that are not particular to the course (e.g., questions not about specific assignments). Tag questions related to course topics as regression-strategies.

Help with Assignments

regmod is a good way to post class-specific questions and answers because you can return to the discussion group weeks later and still benefit from seeing answers regarding a specific topic. The discussion group is an excellent way to keep in touch with the class and even more to ask and answer questions. I hope that all students will use it to * ask or answer any question whatsoever related to group assignments * ask or answer any logistical or purely technical questions related to individual work assignments * ask or answer any questions about modeling or statistical computing concepts that are not directly related to a pending individual work assignment Be sure to check existing topics for posting your message, to avoid creating any unnecessary new topics that will make it more difficult for others to navigate the discussion board.

We will use channels rms and bios330 instead of the Google group. rms is for statistical questions/answers/discussion and bios330 is for course logistics.


R and the rms and Hmisc packages plus several other R packages to be listed here as the class progresses. Students are expected to turn in their assignments in html format created using Markdown with knitr . See KnitrHowto for some useful setup as well as here and here. R and knitr are most easily run by RStudio. NEW This template is highly recommended.

Class Format

The majority of the course is "flipped" so that class time can be devoted to clarifying concepts, methods, and strategies, and problem solving. Students must read assigned sections in the primary textbook and accompanying course handouts, and listen to audio narration and watch videos linked from the handouts, in advance of the class for which those topics are to be discussed.

Help Sessions

The instructor has an open office hour after each Thursday's class in his office. Other meetings can be scheduled as needed.

Assignments and Grading

Assignments are due by 5p on date listed. Projects must be done independently unless marked as group assignments. Work turned in must be as concise as clarity will allow. Students should pay attention to interpreting results, not just obtaining them. knitr must be used (see above). Assignments must list those who actively participated. html files which include code should be emailed to the teaching assistant or sent via slack personal message.

For the final project you will do an in-depth analysis of a dataset you are interested in which contains many predictors of various types (at least one being continuous unless you receive special instructor permission) and having a binary, continuous, ordinal, or possibly a right-censored response variable. The dataset may not be one used in the course or any of the texts. The dataset should have a sufficient number of observations and the meaning of the data should be such that development of a predictive model makes sense. The analyses you perform on the dataset should use several of the methods we learned in the course. Extra weight is given to selection of appropriate methods, when grading the project. The analysis must include at least one simulation studying the properties of one of the procedures used in developing the model.

Homework Assignments

Cumulative assignments are here. After the due date, solution sets will be distributed to solutions for approximately 2/3 of the assignments (including assignment #s 1 and 3). For other assignments, individuals or groups submitting the best solution in LaTeX/knitr/Markdown will receive extra credit and will have their solution (with attribution) added to the solution set for future students.
  1. Assignments 2-3 and 8 are group assignments. Constitution of groups is shown at the top of the assignment. Group members are randomized separately for each group assignment.

Assignment 0 is a reproducible R report that you should run during the first week of class to make sure that you have all software properly installed.

Cumulative solutions to selected problems are here, with knitr source files here.

Weights Used for Final Grade

  • Individual projects (n=5): 3
  • Group projects (n=4): 1
  • Final project (n=1): 8
  • Quizzes (n=6): 1/3
  • Class participation : 2.5
All components are graded on a 0-1 scale before weighting, so group and individual projects get an effective weight of 19 vs. final project of 8.

Reading Assignments

Papers are here. See also this excellent resource on splines.
  1. By 2018-01-17: relaxLinear: smi79spl, gia14opt, col16qua
  2. NEW By 2018-01-21: multivar: giu11spe, gra91eff
  3. NEW By 2018-02-03: datasetsCaseStudies: nic99reg spa89dif
  4. By 2014-02-04: multivar: gre00whe, smi92pro; accuracy (all 4 papers), validation (all papers), logistCal
  5. modelUncertainty: bor15vie
  6. Added for MLE mle/jen86jud

Document Updates

Document Last Revision What
Syllabus 2018-01-11  
R Scripts in Book    
Assignments 2018-04-05 Assignment 9
NEW Solutions 2018-04-30 Assignment 9
Solution knitr source    
Final due date    

Bibliographic Databases

Useful Material From Courses at Other Universities

Other Links

  • TRIPOD: Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration
Topic attachments
I Attachment Action Size Date Who Comment
hw4.RnwRnw hw4.Rnw manage 7.2 K 08 Mar 2015 - 09:41 FrankHarrell knitr source file for Assignment 4
sat.rr sat.r manage 0.5 K 12 Jan 2013 - 19:19 FrankHarrell R code to create SAT dataset in RMS Chapter 2
Topic revision: r187 - 11 Dec 2018, FrankHarrell

This site is powered by FoswikiCopyright © 2013-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback