# BIOS 330: Regression Modeling Strategies

Frank E. Harrell, Jr.
f.harrell@vanderbilt.edu
Professor of Biostatistics
Department of Biostatistics
Vanderbilt University School of Medicine
Teaching Assistants: Lisa Lin and Hannah Weeks (contact them through Slack)
Assistant: Tawanna Peters (615)322-2001
Office Hours: After class, or by appointment, room 11122, 2525 West End
7 January - 16 April 2020, Final Work Due 2020-05-01
Grades are due by 11:59pm on Saturday, 2020-05-04. Last official exam day 2020-05-02
Tuesday, Thursday 3:30-5:00
8102, 8th Floor, 2525 West End

This course covers many aspects of multivariable regression modeling as it is commonly used in prognostic, diagnostic, and epidemiologic modeling, clinical trials, and prediction in general.

## Syllabus | Handouts | Study Questions | R scripts in RMS 2nd edition | Papers to Read | Concepts to master |Supplemental Material on Biostatistical Modeling including interactive R demonstrations | Document updates | Biostatistics for Biomedical Research Notes | Blog

Note: If you use Google Chrome or Chromium to view the handouts, the first time you click on a sound file the browser will download the playlist file (.m3u) for that .mp3 sound file. Click on the down arrow next to the name of the downloaded file on the bottom left of the browser window, and select "Always open files of this type".

## Text

The instructor's book Regression Modeling Strategies, 2nd edition, 2015 is available from Amazon and other book sellers in addition to the Vanderbilt bookstore.

## Motivation

Accurate estimation of patient prognosis or of the probability of a disease or other outcomes is important for many reasons.
1. Prognostic estimates can be used to inform the patient about likely outcomes of her disease.
2. A physician can use estimates of diagnosis or prognosis as a guide for ordering additional tests and selecting appropriate therapies.
3. Outcome assessments are useful in the evaluation of technologies; for example, diagnostic estimates derived both with and without using the results of a given test can be compared to measure the incremental diagnostic information provided by that test over what is provided by prior information.
4. A researcher may want to estimate the effect of a single factor (e.g., treatment given) on outcomes in an observational study in which many uncontrolled confounding factors are also measured. Here the simultaneous effects of the uncontrolled variables must be controlled (held constant mathematically if using a regression model) so that the effect of the factor of interest can be more purely estimated. An analysis of how variables (especially continuous ones) affect the patient outcomes of interest is necessary to ascertain how to control their effects.
5. Predictive modeling is useful in designing randomized clinical trials. Both the decision concerning which patients to randomize and the design of the randomization process (e.g., stratified randomization using prognostic factors) are aided by the availability of accurate prognostic estimates before randomization. Lastly, accurate prognostic models can be used to test for differential therapeutic benefit or to estimate the clinical benefit for an individual patient in a clinical trial, taking into account the fact that low-risk patients must have less absolute benefit (e.g., lower change in survival probability). To accomplish these objectives, researchers must create multivariable models that accurately reflect the patterns existing in the underlying data and that are valid when applied to comparable data in other settings or institutions. Models may be inaccurate due to violation of assumptions, omission of important predictors, high frequency of missing data and/or improper imputation methods, and especially with small datasets, overfitting.

## Description

Many types of regression models are increasingly being used in developing clinical prediction models for diagnosis, prognosis, and other applications in epidemiology, health services research, health economics, clinical trials, business, finance, and prediction in general. Popular models include logistic models for binary and ordinal responses, survival models, quantile regression, and models for longitudinal data analysis, many of which are covered in this course. All regression models have assumptions that must be verified for them to have power to test hypotheses and to be able to predict accurately. Of the principal assumptions (linearity, additivity, distributional), this course will emphasize methods for assessing and satisfying the first two as these methods apply to all regression models. To deal with the linearity assumption, this course provides methods for estimating the shape of the relationship between predictors and response using the widely applicable method of piecewise polynomials. Emphasis will be given to interpreting fitted models using effect plots (e.g., odds ratio charts) and nomograms. Even when assumptions are satisfied, overfitting can ruin a model's predictive ability for future observations. Methods for data reduction will be introduced to deal with the common case where the number of potential predictors is large in comparison with the number of observations. Methods of model validation (bootstrap and cross-validation) will be introduced, as will auxiliary topics such as modeling interaction surfaces, dealing with missing data, variable selection, collinearity, and shrinkage. All methods covered will apply to almost any regression model. The course will include detailed case studies in developing, validating, and interpreting clinical prediction and epidemiologic models.

In the course much attention is paid to dealing with missing data using multiple imputation, the use of bootstrapping, enhancements to ordinary maximum likelihood estimation and inference, and testing general or complex hypotheses using general contrasts and likelihood ratio tests. Quantifying the predictive discrimination and calibration accuracy of models are also key areas of emphasis.

## Prerequisites

Students must have mastered ordinary linear regression and have had an introduction to maximum likelihood estimation. Mastery of regular algebra is assumed, and students must have been introduced to linear algebra. Good working knowledge of R is required.

## Learning Objectives

To become familiar with modern methods for fitting multivariable regression models
1. accurately
2. in a way the sample size will allow, without overfitting
3. uncovering complex non-linear or non-additive relationships
4. testing for and quantifying the association between one or more predictors and the response, with possible adjustment for other factors
Students will be introduced to the bootstrap and will learn how to deal with missing data, how to validate models for predictive accuracy and to detect overfitting, will be able to interpret fitted models using both parameter estimates and graphics, and will be able to critique the literature to determine when models are likely to be unreliable.

• Steyerberg EW. Clinical Prediction Models. New York: Springer; 2009.

## Class Announcements

• Will be on slack rms330 channel

## Class Discussion Group

#### General Concepts

This is the world's best statistics discussion/Q&A site and is the best place to ask questions that are not particular to the course (e.g., questions not about specific assignments). Tag questions related to course topics as `regression-strategies`.

#### Help with Assignments

• vbiostatcourse.slack.com
• Channel `bios330` for logistics, private and group messaging, questions about group assignments, stat computing issues
• Channel `rms` for questions and answers and short to medium-length discussions

Be sure to check existing topics for posting your message, to avoid creating any unnecessary new topics that will make it more difficult for others to navigate the discussion board.

## Software

R and the `rms` and `Hmisc` packages plus several other R packages to be listed here as the class progresses. Students are expected to turn in their assignments in `html` format created using Markdown with knitr . See KnitrHowto for some useful setup as well as here and here. R and `knitr` are most easily run by RStudio. This template is highly recommended.

## Class Format

The majority of the course is "flipped" so that class time can be devoted to clarifying concepts, methods, and strategies, and problem solving. Students must read assigned sections in the primary textbook or accompanying course handouts, and listen to audio narration and watch videos linked from the handouts when helpful, in advance of the class for which those topics are to be discussed. Study questions are provided before class and students should attempt to answer them by themselves or in small groups. These questions form the basis for in-class discussions.

## Help Sessions

The instructor has an open office hour after each Thursday's class in his office. Other meetings can be scheduled as needed.

Assignments are due by 5p on date listed. Projects must be done independently unless marked as group assignments. Work turned in must be as concise as clarity will allow. Students should pay attention to interpreting results, not just obtaining them. `knitr` must be used (see above). Assignments must list those who actively participated. `html` files which include code should be emailed to the teaching assistant or sent via `slack` personal message.

For the final project you will do an in-depth analysis of a dataset you are interested in which contains many predictors of various types (at least one being continuous unless you receive special instructor permission) and having a binary, continuous, ordinal, or possibly a right-censored response variable. The dataset may not be one used in the course or any of the texts. The dataset should have a sufficient number of observations and the meaning of the data should be such that development of a predictive model makes sense. The analyses you perform on the dataset should use several of the methods we learned in the course. Extra weight is given to selection of appropriate methods, when grading the project. The analysis must include at least one simulation studying the properties of one of the procedures used in developing the model.

### Homework Assignments

Cumulative assignments are here. After the due date, solution sets will be distributed to solutions for approximately 2/3 of the assignments (including assignment #s 1 and 3). For other assignments, individuals or groups submitting the best solution in LaTeX/knitr/Markdown will receive extra credit and will have their solution (with attribution) added to the solution set for future students.
1. Assignments 2-3 and 8 are group assignments. Constitution of groups is shown at the top of the assignment. Group members are randomized separately for each group assignment.

Assignment 0 is a reproducible R report that you should run during the first week of class to make sure that you have all software properly installed.

Turn in your solutions by sending a direct message on Slack to the instructor and TAs and attaching the html or pdf file.

Cumulative solutions to selected problems are here

### Weights Used for Final Grade

• Individual projects (n=5): 3
• Group projects (n=4): 1
• Final project (n=1): 8
• Quizzes (n=6): 1/3
• Class participation : 2.5
All components are graded on a 0-1 scale before weighting, so group and individual projects get an effective weight of 15 vs. final project of 8.

1. By 2020-01-15: relaxLinear: smi79spl, gia14opt, col16qua
2. By 2020-01-18: multivar: gra91eff
3. By 2020-01-23: missingData: pen15mul, don06rev, hei06imp (skim), hip07reg (skim), jan10mis (skim), muchado
4. By 2020-01-25: multivar: giu11spe, gre00whe, smi92pro, ril18min, ril18mina
5. By 2019-01-30: datasetsCaseStudies: nic99reg spa89dif
6. By 2019-02-02: multivar: accuracy (all 4 papers), validation (all papers)
7. modelUncertainty: bor15vie

Document Last Revision What
Syllabus 2020-02-04
Handouts 2019-11-29
R Scripts in Book
Assignments 2020-02-17 Assignment 5
Solutions 2020-02-17 Assignment 4
Solution knitr source
Study Questions
Final due date