MSCI Biostatistics II Course Schedule -- February 2019

Key:
A = Whitlock and Schluter (text)
B = Biostatistics for Biomedical Research (handout)
H = Harrell (text) and the corresponding chapter in RMS Handouts

Date Room Material to read/watch
prior to class
Session Topics Assignments Examples
Monday 3 Light 411CD Course Webpage
Syllabus
Key Concepts
A17.1-17.7, A17.10-1
B10-B10.5.8
H2.4.7
Simple Linear Regression   Getting Started
  4 Light 411CD A18, B10.6-B10.8 Multiple Linear Regression HW 1 Lion Age from Nose
  5 Light 411CD B 10.9-10.10, 9.4-9.9 " HW 2  
  6 Light 411CD B13-13.6.1 Analysis of Covariance in Randomized Studies HW 3 Mole Rats
  7 Light 411CD Hxiii, H1, H2-2.3.1, classification blog Regression Modeling Strategies: Introduction HW 4 Chunk tests
Cat vars in regression
Monday 10 Light 411CD H2.3.2-2.4.1-2.4.6 demos Methods for Multivariable Models HW 5 Nonlinear relationships
  11 Light 411CD H2.4.8-2.7.2 (omit 2.5.1) " HW 6  
  12 Light 411CD H3, H4-4.1, 4.3-4.4 Missing Data, Multivariable Modeling Strategies Check-in
HW 7
Plot RCS
Missing data
  13 Light 411CD H4.5-4.7.2, 4.7.5,table before 4.8 " HW 8 Data reductions (principle components)
Variable clustering
  14 Light 411CD H4.9-4.12 and this " HW 9 Outliers
Monday 17 Light 411CD A19, H5 (omit 5.5), B10.11 Bootstrap, Validating and Describing the Model Prep for critique Internal Validation
  18 Light 411CD   Student led discussion of papers    
  19 Light 411CD A17.9, H10-10.5 Binary Logistic Models HW 10 Simple Logistic
  20 Light 411CD H10.6-10.10, B6.10.3, H10.11 (Bayes summary), B13.6.2-13.7, blog, blog " Analysis plan for final project Logistic
Bayes
  21 Light 411CD H11, H12, NNT, HTE, HTE, Risk Difference Binary Logistic Model Case Studies HW 11  
Monday 24 Light 411CD B19,
More Analyses of Diagnostic Yield,
Simple Decision Analysis,
H13.1-13.3, H14.1 and Table 14.1, B4.1.2, B5.12.4-.5
Risk-Based Diagnostic Research
Proportional Odds Ordinal Logistic Models
HW 12 Ordinal predictors
Ordinal regression
  25 Light 412 B7.6, Power B7.8.3-.6, H15-15.2, 15.5.2-15-25, 15-30, 15-34,
17.1-17.3, 17.5.1,
18.1, 18.4-18.6
Ordinal Regression, Survival Analysis HW 13  
  26 Light 411CD   Survival Analysis   Cox Model
  27 Light 411CD B15 Analysis of Serial Measurements Final Project
  28 Light 411CD   Student Presentations of Final Project  


In Class, Hands-on Examples

TASK: Simple linear regression
The lion's nose, ABD example 17.1, page 541
Dataset: 17-e-1
N observations: 32 male lions
Response: Age (years)
Predictor: Proportion of black pigmentation on nose

  • Why would understanding the Age-Nose relationship be helpful?
  • Construct a scatter plot of the response and explanatory variable.
  • How would you describe the relationship in the plot?
  • Write down the linear regression model of Age and Proportion of black pigmentation on nose.
    • Interpret the constant coefficient.
    • Interpret the slope coefficient.
    • Interpret the variance coefficient.
    • How would the model change if the predictor were on the percent scale?
    • How would the model change if Age were on the decade scale?
    • What scale is the variance coefficient?
  • Fit a linear regression model
    • What are the parameter estimates?
    • What is the uncertainty of the model parameters?
  • Plot the linear regression model
  • Plot the linear regression model with CI bands.
  • Assess the model fit
    • Plot a scatter plot of the studentized residuals and the predictor variable.
      • What aspect of model fit does this plot communicate?
      • What would the plot look like if the model fits well?
      • What would the plot look like if the model fits poorly?
  • How might you use the model?
  • What is the difference between "estimating the mean" and "prediction of age for a new lion"?
    • Is there a difference between estimating a population parameter and predicting an individual outcome?
    • Which of the two (population vs individual outcome) are we "more certain"?
  • Plot the linear regression model with PI band

getvdata 17-e-1

//Scatter plot
twoway (scatter ageinyears proportionblack), ytitle(Age (years)) xtitle(Proportion black)

// Regression
regress ageinyears proportionblack

// Add plot data
plotdata, at(proportionblack = .1(.1).9)

// Generate yhat and confidence intervals
regress_yhat_ci

// Plot regression line and CI
twoway (rarea ageinyears_hat_lb ageinyears_hat_ub proportionblack if _plotindicator == 1, sort fcolor(bluishgray%60)) ///
 (line ageinyears_hat proportionblack if _plotindicator == 1, sort lcolor(bluishgray) lwidth(thick)) ///
 (scatter ageinyears proportionblack), ///
 ytitle(Age (years)) ///
 xtitle(Proportion black) ///
 legend(off)
 
// Generate yhat and forecast confidence intervals (use stub to identify new vars)
regress_yhat_ci, forecast stub(f_)

// Plot regression line
twoway (rarea f_ageinyears_hat_lb f_ageinyears_hat_ub proportionblack if _plotindicator == 1, sort fcolor(lime%70)) ///
 (rarea ageinyears_hat_lb ageinyears_hat_ub proportionblack if _plotindicator == 1, sort fcolor(bluishgray%60)) ///
 (line ageinyears_hat proportionblack if _plotindicator == 1, sort lcolor(bluishgray) lwidth(thick)) ///
 (scatter ageinyears proportionblack), ///
 ytitle(Age (years)) ///
 xtitle(Proportion black) ///
 legend(off)

// Generate residuals
generate ehat = ageinyears - ageinyears_hat

// Studentized residuals
predict student_ehat, rstudent

// QQplot for residual diagnostic
qnorm student_ehat

// Constant variance diagnostic
twoway (scatter  student_ehat proportionblack)

TASK: Adjusting for the effects of a covariate
Mole rats , ABD example 18.4, page 620
Dataset: 18-e-4
N observations: 35
Response: lnenergy (log daily energy expenditure)
Predictors: lnmas (log body mass); caste (worker, lazy)

Questions to consider:
  • What relationship are the researchers studying?
  • Plot the data.
  • Write down a linear model which will help researchers understand the relationship of interest.
    • How many parameters are in the model?
  • Write down the null and alternative hypotheses of interaction effect in terms of the model parameters.
    • Write down the null and alternative hypotheses of interaction effect in terms of a comparison of two models.
  • In terms of the research question, what does the presence an interaction effect mean?
  • Fit the model.
  • Plot the estimated relationship of the null model and the estimated relationship of the alternative model. See figure 18.4-1.
  • Plot model diagnostics of both models (null and alternative).
  • Test the hypothesis.

// Read data
getvdata 18-e-4

// Generate worker indicator variable
encode caste, generate(lazy)
gen worker = 1*(caste == "worker")

// Construct regression model
regress lnenergy lnmass i.lazy c.lnmass#i.lazy

// Generate data for plot
plotdata, at(lazy=1; lnmass = 4.2(.1)5.5)
plotdata, at(lazy=2; lnmass = 3.5(.1)5)

// Generate predicted values and CIs
regress_yhat_ci

// Generate plot of regression model
twoway  ///
 (rarea lnenergy_hat_lb lnenergy_hat_ub lnmass if _plotindicator == 1, sort fcolor(red%30)) ///
 (line lnenergy_hat lnmass if _plotindicator==1, sort lcolor(red) lwidth(vthick)) ///
 (rarea lnenergy_hat_lb lnenergy_hat_ub lnmass if _plotindicator == 2, sort fcolor(blue%30)) ///
 (line lnenergy_hat lnmass if _plotindicator==2, sort lcolor(blue) lwidth(vthick)) ///
 (scatter lnenergy lnmass if lazy==1, mcolor(red)) ///
 (scatter lnenergy lnmass if lazy==2, mcolor(blue)) ///
 , legend(order(5 "Lazy" 6 "Worker")) ///
 ytitle(Energy expenditure (log scale)) ///
 xtitle(Body mass (log scale))

TASK: Chunk tests
Mole rats , ABD example 18.4, page 620
Dataset: 18-e-4
N observations: 35
Response: lnenergy (log daily energy expenditure)
Predictors: lnmas (log body mass); caste (worker, lazy)

Goal: Generate table of standard hypothesis tests.

getvdata 18-e-4
gen worker = 1*(caste == "worker")

/* 
# Chunk test deliciousness

In the following example, we are going to perform a series of chunk tests.  We
are going to demonstrate how to perform a chunk test with two different methods.
The first method is a comparison of models.  The second is a direct test of the 
betas.  The two methods are equivalent and will give identical answers.  

Depending on your circumstances, one method may be easier to implement than another.

## METHOD 1: Comparison of models

### Interaction test

    FULL: E[ lnenergy | - ] = b0 + b1 lnmass + b2 worker + b3 worker * lnmass
 REDUCED: E[ lnenergy | - ] = b0 + b1 lnmass + b2 worker 
 
### Total impact of lnmass

    FULL: E[ lnenergy | - ] = b0 + b1 lnmass + b2 worker + b3 worker * lnmass
 REDUCED: E[ lnenergy | - ] = b0 +             b2 worker 

### Total impact of caste

    FULL: E[ lnenergy | - ] = b0 + b1 lnmass + b2 worker + b3 worker * lnmass
 REDUCED: E[ lnenergy | - ] = b0 + b1 lnmass              

### Overall regression

    FULL: E[ lnenergy | - ] = b0 + b1 lnmass + b2 worker + b3 worker * lnmass
 REDUCED: E[ lnenergy | - ] = b0
 
### Steps to compare two models

1. Fit and store each model (use `est store` command)
2. Use `ftest` command

### Note:

+ The comparison of models approach is only valid if the regression models are fit
using the same observations.  This is often an issue if there is missing data as 
the default practice is to disregard observations that have missing covariates.
The consequence is that the REDUCED model may use more observations than the FULL 
model.  We will cover an example of how to compare models when there is missing data.
*/

regress lnenergy lnmass worker i.worker#c.lnmass
est store full

regress lnenergy lnmass worker
est store no_int // for no interaction

regress lnenergy worker
est store no_mass // model with no lnmass

regress lnenergy lnmass
est store no_caste // model with no caste variable

regress lnenergy 
est store no_vars // model with no predictors 

ftest full no_int // (also available from regress output)
ftest full no_mass
ftest full no_caste
ftest full no_vars // (also available from regress output)

/*
## METHOD 2: Betas

### Interaction test

 Ho: b3 = 0
 Ha: b3 != 0
 
### Total impact of lnmass

 Ho: b1 = 0 AND b3 = 0
 Ha: b1 != 0 OR b3 != 0 

### Total impact of caste

 Ho: b1 = 0 AND b2 = 0
 Ha: b1 != 0 OR b2 != 0              

### Overall regression

 Ho: b1 = 0 AND b2 = 0 AND b3 = 0
 Ha: b1 != 0 OR b2 != 0 OR b3 != 0
 
### Steps to test betas

1. Fit the full model
2. Use the `regress, coeflegend` command to identify labels for betas
3. Use `test` command

### Note:

+ This approach avoids issues with missing data.
*/

regress lnenergy lnmass worker i.worker#c.lnmass
regress, coeflegend

test (1.worker#c.lnmass)              // Interaction (also available from regress output)
test (lnmass 1.worker#c.lnmass)        // Total impact of lnmass
test (worker 1.worker#c.lnmass)        // Total impact of caste
test (lnmass worker 1.worker#c.lnmass) // Overall regression (also available from regress output)


/*
## Table of usefull hypothesis tests

You will want to organize the results of commonly used hypothesis tests in a 
table like the following:

| Variable      | df | Test stat | p-value |
|---------------|----|-----------|---------|
| lnmass        |  2 |           |         |
|   interaction |  1 |           |         |
| caste         |  2 |           |         |
|   interaction |  1 |           |         |
| Overall       |  3 |           |         |

*/

TASK: Categorical Variables in STATA
Dataset: nhgh
N observations: 6805
Response: Height
Predictors: Race and ethnicity, sex
/***
# Incorporating categorical variables into a regression model

Example 1: Height as a function of race-ethnicity (re)

  + Option 1 - estimate the mean for each re

               E[ ht | re ] =   a1 * Mexican American
                              + a2 * Other Hispanic
                              + a3 * Non-Hispanic White
                              + a4 * Non-Hispanic Black
                              + a5 * Other Race Including Multi-Racial


  + Option 2 - estimate the mean in a reference group (say Non-Hispanic Black),
               then estimate the difference in mean height between the reference
               group and each of the remaining re group

               E[ ht | re ] =   b0
                              + b1 * Mexican American
                              + b2 * Other Hispanic
                              + b3 * Non-Hispanic White
                              + b4 * Other Race Including Multi-Racial
***/

getvdata nhgh

// Group means coding
regress ht ibn.re, noconst

/*
Note: When you use the noconstant option, disregard any thing from the SS table or
the overall chunk test
*/

// Reference group coding
regress ht i.re

/*
Which is the reference group?
How do you tell stata which group to use as reference?
  + help fvvarlist
*/

regress ht ib3.re //         ib#.           use # as base, #=value of variable
regress ht ib(3).re //       ib(##).        use the #th ordered value as base (**)
regress ht ib(first).re //   ib(first).     use smallest value as base (the default)
regress ht ib(last).re //    ib(last).      use largest value as base
regress ht ib(freq).re //    ib(freq).      use most frequent value as base

/***
Example 2: Height as a function of race-ethnicity (re) and sex
***/

regress ht i.re##i.sex
plotdata, at(sex = 1/2; re = 1/5)
regress_yhat_ci
decode(sex), generate(s_sex)
decode(re), generate(s_re)
gen label = s_re + " " + s_sex
egen float row = rank(-10*re+sex) if _plotindicator == 1
point_ci ht_hat ht_hat_lb ht_hat_ub row if _plotindicator == 1 & sex == 1,  ///
 labelvar(label) ///
 point_options(mcolor(magenta)) ///
 ci_options(lwidth(thick)) ///
 plot_options(title(Mean heights NHANES) xtitle(Height (cm)) ytitle(Race and Sex))

TASK: Basic Diagnostics
Dataset: nhgh
N observations: 6805
Response: Height
Predictors: Race and ethnicity, sex
getvdata nhgh

TASK: Variable Clustering
The SUPPORT Study, Dupont text section 3.25, page 138
Description: (link)
N observations: 1000
************************************
* THE SUPPORT STUDY
* Dupont text, section 3.25
************************************
clear
getvdata support
keep slos meanbp sex
label variable slos "Length of Stay (days)"
label variable meanbp "Mean Arterial Pressure (mm Hg)"

scatter slos meanbp
twoway (scatter slos meanbp), ///
  yscale(log) ///
  ylabel(, angle(horizontal))
  
egen meanbpcat = cut(meanbp), at(0 25.71429 51.42857 77.14286 102.8571 128.5714 154.2857 181) icodes

* STEP FUNCTION
regress slos i.meanbpcat
regress_yhat_ci, stub(step_)
twoway (line step_slos_hat meanbp if step_slos_hat > 0, sort), ///
  yscale(log) ///
  ylabel(, angle(horizontal)) ///
  xline(25.71429   51.42857   77.14286   102.8571  128.5714   154.2857 ) 

summarize ibn.meanbpcat
regress slos ibn.meanbpcat ibn.meanbpcat#c.meanbp, noconstant
regress_yhat_ci, stub(pwl_) drop
twoway (line pwl_slos_hat meanbp if pwl_slos_hat > 0, sort), ///
  yscale(log) ///
  ylabel(, angle(horizontal)) ///
  xline(25.71429   51.42857   77.14286   102.8571  128.5714   154.2857 ) 
   
* LINEAR SPLINE
mkspline ls_bp_ 7 = meanbp, displayknots
twoway (scatter ls_bp_1 meanbp), xline(25.71429   51.42857   77.14286   102.8571   128.5714   154.2857 ) 
twoway (scatter ls_bp_2 meanbp), xline(25.71429   51.42857   77.14286   102.8571  128.5714   154.2857) 
twoway (scatter ls_bp_3 meanbp), xline(25.71429   51.42857   77.14286   102.8571  128.5714   154.2857 )
twoway (scatter ls_bp_4 meanbp), xline(25.71429   51.42857   77.14286   102.8571  128.5714   154.2857 )
list in 1/5

regress slos ls_bp_*
regress_yhat_ci, stub(ls_)

twoway (line ls_slos_hat meanbp if ls_slos_hat > 0, sort), ///
  yscale(log) ///
  ylabel(, angle(horizontal)) ///
  xline(25.71429   51.42857   77.14286   102.8571  128.5714   154.2857 ) 
    
  
* RESTRICTED SPLINE
mkspline rcs_bp_ = meanbp, nknots(7) cubic displayknots
* 40 59.6829 69 78 101.8579 113 139 

twoway (scatter rcs_bp_1 meanbp), xline(40 59.6829 69 78 101.8579 113 139 ) 
twoway (scatter rcs_bp_2 meanbp), xline(40 59.6829 69 78 101.8579 113 139 ) 
twoway (scatter rcs_bp_3 meanbp), xline(40 59.6829 69 78 101.8579 113 139 )

twoway (scatter rcs_bp_1 meanbp) ///
 (scatter rcs_bp_2 meanbp) ///
 (scatter rcs_bp_3 meanbp) ///
 (scatter rcs_bp_4 meanbp) ///
 (scatter rcs_bp_5 meanbp) ///
 (scatter rcs_bp_6 meanbp) ///
 , xline(40 59.6829 69 78 101.8579 113 139 )

regress slos rcs_bp_*
regress_yhat_ci, stub(rcs_) drop

twoway (line rcs_slos_hat meanbp if rcs_slos_hat > 0, sort), ///
  yscale(log) ///
  ylabel(, angle(horizontal)) ///
  xline(40 59.6829 69 78 101.8579 113 139 ) 

  
 * TEST OF LINEARITY
 test (rcs_bp_2 rcs_bp_3 rcs_bp_4 rcs_bp_5 rcs_bp_6)

* OR FULL vs REDUCED 
est store rcs // full
regress slos meanbp
regress_yhat_ci, stub(sl_) drop // simple linear
est store simple_linear
ftest simple_linear rcs
 
 * COMPARE CUBIC TO LINEAR
twoway (line rcs_slos_hat meanbp, sort) (line sl_slos_hat meanbp, sort), ///
  yscale(log range(3 250)) ///
  ylabel(2(2)10 10(10)50 50(50)250, angle(horizontal))

TASK: Restricted Cubic Splines & Partial Effect Plots
Framingham Heart Study, Dupont example 3.11, page 102
Dataset: Levy (1999)
N observations: 4699
Response: log serum cholesterol
Predictors: bmi, age, sex
* 1. Specify the model
*--------------------------------------------------
* E[ log scl | - ] = b0 + b1 * sex +
*    b2 * age + b3 * age'
*    b4 * bmi + b5 * bmi'
*    b6 * age * sex + b7 * age' * sex
*    b8 * bmi * sex + b9 * bmi' * sex
* V[ log scl | - ] = sigma^2


* 2. Specify the partial effect plots of interest
*--------------------------------------------------
*    Plot 1: Y = E[ log scl | - ]
*            X = age
*            line for males and line for females
*            other covariates set at median
*
*    Plot 2: Y = E[ log scl | - ]
*            X = bmi
*            line for males and line for females
*            other covariates set at median

* Use centile to get range and median of variables

getvdata 2.20.Framingham
gen logscl = log(scl)
label define map_sex 1 "Male" 2 "Female"
label values sex map_sex
centile age bmi, centile(5 50 95)

* 3. Generate data for plotting
*    All of the predictors in the model need to be 
*    specified.
*--------------------------------------------------
plotdata, at(age=34/60; bmi=25; sex=1/2)
plotdata, at(age=45; bmi=19/33; sex=1/2)

* OR specify each line separately
*    This approach can simplify the `if` statements
*    when generating the partial effect plots

* plotdata, at(age=34/60; bmi=25; sex=1)
* plotdata, at(age=34/60; bmi=25; sex=2)
* plotdata, at(age=45; bmi=19/33; sex=1)
* plotdata, at(age=45; bmi=19/33; sex=2)

* 4. Generate restricted cubic splines
*
*    Why use `mkspline_plotindicator`?
*
*    To correctly calculate the default knots for
*    the predictor variable after plotting data
*    has been appended to the dataset.
*
*    If you want to use custom knots, or if no
*    plotting data have been appended to the data,
*    then `mkspline` will work just fine.
*--------------------------------------------------
mkspline_plotindicator bmi, nknots(3)
mkspline_plotindicator age, nknots(3)

* 5. Fit the model
*--------------------------------------------------
regress logscl rcs_age* rcs_bmi* i.sex i.sex#c.(rcs_bmi*) i.sex#c.(rcs_age*)
* Shorter way:
regress logscl i.sex##c.(rcs*)

* 6. Model diagnostics (more to come)
*--------------------------------------------------
regress_yhat_ci
predict studres, rstudent
qnorm studres

* 7. Generate partial effect plots
*--------------------------------------------------
twoway ///
 (rarea logscl_hat_lb logscl_hat_ub age if sex == 1, sort fcolor(cranberry%50)) ///
 (line logscl_hat age if sex == 1, sort lcolor(cranberry) lwidth(thick)) ///
 (rarea logscl_hat_lb logscl_hat_ub age if sex == 2, sort fcolor(dknavy%50)) ///
 (line logscl_hat age if sex == 2, sort lcolor(dknavy) lwidth(thick)) ///
 if _plotindicator == 1, ///
 legend(order(2 "Male" 4 "Female")) ///
 ytitle(Mean log scl) ///
 xtitle(Age) ///
 note(BMI = 25)

 twoway ///
 (rarea logscl_hat_lb logscl_hat_ub bmi if sex == 1, sort fcolor(cranberry%50)) ///
 (line logscl_hat bmi if sex == 1, sort lcolor(cranberry) lwidth(thick)) ///
 (rarea logscl_hat_lb logscl_hat_ub bmi if sex == 2, sort fcolor(dknavy%50)) ///
 (line logscl_hat bmi if sex == 2, sort lcolor(dknavy) lwidth(thick)) ///
 if _plotindicator == 2, ///
 legend(order(2 "Male" 4 "Female")) ///
 ytitle(Mean log scl) ///
 xtitle(BMI) ///
 note(age = 45)

TASK: Missing Data
The SUPPORT Study, Dupont text section 3.25, page 138
Description: (link)
N observations: 1000
Response: Total RCC cost
************************************
* THE SUPPORT STUDY
* Dupont text, section 3.25
************************************
getvdata support
keep age sex dzgroup num_co edu income scoma meanbp hrt resp temp totcst
misstable summarize
misstable pattern
replace edu = . if edu == .z
replace income = . if income == .z
misstable summarize
mi query
mi set flong
mi register imputed edu income totcst
codebook num_co
codebook scoma
mi impute chained (pmm, knn(10)) totcst edu (ologit) income = age i.sex i.dzgroup num_co  scoma meanbp hrt resp temp, add(10) augment rseed(23450)

list income totcst edu _mi_id if _mi_m == 0 & _mi_id  < 10
list income totcst edu _mi_id if _mi_m == 1 & _mi_id  < 10
list income totcst edu _mi_id if _mi_m == 2 & _mi_id  < 10

mkspline rcs_age_=age, cubic nknots(3)
mkspline rcs_meanbp_=meanbp, cubic nknots(3)
mkspline rcs_hrt_=hrt, cubic nknots(3)

list rcs* if _mi_m == 1 & _mi_id  < 10
list _mi_miss in 1/10
list _mi_miss _mi_m if _mi_m == 1 & _mi_id  < 10

mi estimate : regress totcst rcs_* i.sex i.dzgroup num_co scoma i.income
mi estimate, saving("miexample", replace) : regress totcst rcs_* i.sex i.dzgroup num_co scoma i.income
help mi estimate
mi estimate, coeflegend
mi test (rcs_age_1 rcs_age_2)
mi test (2.income 3.income 4.income)
mi test (2.dzgroup 3.dzgroup 4.dzgroup 5.dzgroup 6.dzgroup 7.dzgroup 8.dzgroup)

TASK: Variable Clustering
The SUPPORT Study, Dupont text section 3.25, page 138
Description: (link)
N observations: 1000
getvdata support
keep age sex dzclass race meanbp wblc hrt resp temp alb bili crea sod ph glucose bun urine adlsc
* findit hcavar
tabulate sex, generate(s)
tabulate dzclass, generate(dz)
hcavar s2 dz2 dz3 dz4 adlsc urine bun glucose ph sod crea bili alb temp resp hrt wblc meanbp age

TASK: Data Reduction with Principle Components
Glycohemoglobin from NHANES 2009-2010, BBR 4-22, page 51
Description: (link)
Dataset: nhgh (from Department collection of datasets)
N observations: 6795
getvdata nhgh

* SUBSET AGE GE 21 and DX = 0 and TX = 0
keep if age >= 21 & dx == 0 & tx == 0
drop dx tx

* GENERATE INDICATOR VARIABLES FOR RE
quietly tabulate re, generate(re_)
list re re_*

* PREDICTIVE CAPABILITY OF CONTINUOUS VARIABLES
* must have EGENMORE installed, use command "findit egenmore" for user defined function
spearman2 gh age wt leg arml armc waist tri sub

* VARIABLE CLUSTERING
* findit hcavar
hcavar wt ht bmi leg arml armc waist tri sub age sex re_*

* PRINCIPLE COMPONENTS
centile waist bmi, centile(5 95)
twoway (scatter waist bmi), xline(20 40) yline(74 125)
pca waist bmi, components(2)
predict waist_bmi_pc*, score

* INTERNAL COEFFICIENTS
regress waist_bmi_pc1 waist bmi
regress waist_bmi_pc2 waist bmi

twoway (scatter waist_bmi_pc2 waist_bmi_pc1)
centile waist_bmi_pc1 waist_bmi_pc2, centile(5 95)
twoway (scatter waist_bmi_pc2 waist_bmi_pc1), xline(-1.95 2.49) yline(-.466 .473)

* USE PC IN REGRESSION
regress SCr waist_bmi_pc1 age leg armc

TASK: Outliers
Framingham Heart Study, Dupont example 3.11, page 102
Dataset: Levy (1999) (Available in department collection of datasets.)
N observations: 4699
Response: Log Systolic Blood Pressure
Predictors: log(bmi), age, sex, log(serum cholesterol)
*************************
* FRAMINGHAM HEART STUDY
* Dupont, Section 3.11
*************************

getvdata 2.20.Framingham

* For demonstration purposes:
replace age = 103 if id == 2642

generate logsbp = log(sbp)
label variable logsbp "Log Systolic Blood Pressure"
generate logbmi = log(bmi)
label variable logbmi "Log Body Mass Index"
generate logscl = log(scl)
label variable logscl "Log Serum Cholesterol"

label define sexlabel 1 "Male" 2 "Female" 
label values sex sexlabel

* MODEL 3.23
* E[log spb | ... ] = b0 + b1 logbmi + b2 age + b3 logscl + b4 female
*                        + b5 female * logbmi + b6 female * age
*                        + b7 female * logscl
regress logsbp logbmi age logscl i.sex i.sex#c.(logbmi age logscl)

* OUTLIERS
dfbeta

* FIT DIAGNOSTICS
predict yhat, xb
predict ehat, residuals
predict standard_resid, rstandard
predict student_resid, rstudent
predict cooksd, cooksd
predict leverage, leverage
predict dfits, dfits

* PLOTS
twoway (scatter student_resid yhat) (lowess student_resid yhat, lwidth(vthick)), yline(2 -2)

* COOKS D
generate index = _n
twoway (scatter cooksd index)

* DFIT
twoway (scatter dfits index)

* DFBETAS
graph matrix _dfbeta_*

TASK: R2 Internal Validation in STATA
Framingham Heart Study, Dupont example 3.11, page 102
Dataset: Levy (1999) (Available in department collection of datasets.)
N observations: 4699
Response: Log Systolic Blood Pressure
Predictors: log(bmi), age, sex, log(serum cholesterol)
*************************
* FRAMINGHAM HEART STUDY
* Dupont, Section 3.11
*************************

getvdata 2.20.Framingham

generate logsbp = log(sbp)
label variable logsbp "Log Systolic Blood Pressure"
generate logbmi = log(bmi)
label variable logbmi "Log Body Mass Index"
generate logscl = log(scl)
label variable logscl "Log Serum Cholesterol"

label define sexlabel 1 "male" 2 "female" 
label values sex sexlabel

quietly regress logsbp i.sex##c.(logbmi age logscl)

* R2 Bootstrap VALIDATION
validate_regress, reps(100) seed(3243)

* In-sample calibration plot
regress_yhat_ci

twoway (scatter logsbp logsbp_hat) ///
 (line logsbp_hat logsbp_hat, sort) ///
 (lowess logsbp logsbp_hat)
 
 * In-sample Mean absolute prediction error
 gen abs_error = abs(logsbp - logsbp_hat)
 sum abs_error
 
 * In-sample Rank discrimination
 ** Spearman
 spearman logsbp logsbp_hat
 
 ** In-sample Kendall's tau
 ktau logsbp logsbp_hat

TASK: Crash course in logistic regression
RMS notes 10.1.3, 10.5 letter W
Dataset: sex.age.response & acath

TASK: Logistic regression, RCS, interactions
RMS notes 10.5 letter A
Dataset: acath
Response: sigdz
Predictors: age, sex, cholesterol
/*
0.  Question

1.  Translate question to model

2.  Identify covariates for adjustment

    a. Missing data strategy

3.  Determine parameter budget

    a.  Effective sample size --> 15:1 rule --> budget
    b.  Data reduction needed?
    c.  Interactions?
    d.  Non-linear terms
    e.  Allocation of parameters according to clinical understanding
   
4.  Fit the model

5.  Model diagnostics

    a.  Overly influential observations (dfbetas, dfits)
    b.  Residual diagnostics (linear model usually)
    
6.  Measures of model performance (Optimism corrected)

    a.  Discrimination
    b.  Calibration
    
7.  Understanding the model (Partial effect plots)

8.  Hypothesis tests

    a.  Total predictor impact
    b.  Linearity
    c.  Interaction

*/

getvdata acath

/*
0. What are the associations of age, cholesterol, and sex with significant coronary disease
1. logodds[ SCD | - ] = 
2. Covariates for adjustment
*/

keep sex age choleste sigdz
misstable pattern
drop if choleste == .z /* For demonstration purposes.  Imputation is a better option. */

/*
3.  Paramter budget
*/
tab sigdz //
display 768/15 // Data reduction probably not needed.
/*
3c.  Interactions?
3d.  Non-linear terms
3e.  Allocation of parameters according to clinical understanding


| Full interaction                   | A * f(B) + B * f(A) Interaction | f(A*B)                      |
|------------------------------------|---------------------------------|-----------------------------|
|   1                                | 1                               | 1                           |
| + sex                              | + sex                           | + sex                       |
| + rcs(age, 4 knots)                | + rcs(age, 4)                   | + rcs(age, 4)               |
| + rcs(cholesterol, 4 knots)        | + rcs(cholesterol, 4)           | + rcs(cholesterol, 4)       |
| + rcs(age, 4)*rcs(cholesterol, 4)  | + age * rcs(cholesterol, 4)     | + rcs(age * cholesterol, 4) |
|                                    | + cholesterol * rcs(age, 4)     |                             |
| + rcs(age, 4)*sex                  | + rcs(age, 4) * sex             | + rcs(age, 4)*sex           |
|------------------------------------|---------------------------------|-----------------------------|

How many parameters do each of these models have?
*/


/* OUT OF ORDER
Need to make plans for partial effect plots.  Plan is to generate heatmap of 

 Age |                 Z = predicted log odds
     |                 
     |
     |
     +--------------  
         cholesterol
    
 Holding sex = 1 
*/
centile age cholest, centile(0 3 50 97 100)
plotdata, at(age = 34(2)70; choleste = 150(5)340; sex = 1)


// Time permitting, will come back to this part
// It isn't essential
plotdata, at(age = 17(2)81; choleste = 30(5)575; sex = 1)
sum age cholest if _plotindicator == 0
bnormpdf age cholest, dens(npdf) m1(52.2) m2(229.9) s1(9.9) s2(50.6) rho( 0.0139)
centile npdf if _plotindicator == 0, centile(5)
gen drop5 = 1*(npdf < .000016) if _plotindicator == 2
twoway (scatter age cholest if _plotindicator == 2 & drop5 == 0) ///
 (scatter age cholest if _plotindicator == 0)


/*
4. Fit the model
*/
mkspline_plotindicator age, nknots(4)
mkspline_plotindicator choleste, nknots(4)
logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex

/*
5. Diagnostics
*/
gen row = _n
predict leverage, hat
predict deltad, ddeviance
twoway scatter leverage row if _plotindicator == 0
twoway scatter deltad row if _plotindicator == 0

// DFBETAS not part of STATA base.  Can generate with ldfbeta package.
// search ldfbeta
//ldfbeta // Will be slow for larger datasets.
//twoway scatter DFrcs_ag row
//twoway scatter DFrcs_ch row
//twoway scatter DF2 row

/*
6. Apparent measures of performance
*/
logistic_phat_ci // to get phat
brier sigdz sigdz_phat if _plotindicator == 0
lroc
//ssc install somersd
somersd sigdz sigdz_phat

/*
Apparent calibration with histogram rug plot
*/

twoway ///
 (lowess sigdz sigdz_phat, lcolor(blue) lwidth(thick)) ///
 (line sigdz sigdz, sort lcolor(red) lwidth(medium)) ///
 (histogram sigdz_phat, fraction bin(100) fcolor(black) lwidth(none)) ///
 , title(Apparent calibration) ///
 legend(order(1 "Apparent calibration" 2 "Ideal calibration")) ///
 xtitle(Predicted probability)
 
 
/*
Optimism corrected measures of performance
*/
logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex
validate_logistic, reps(100) seed(023)


/*
Partial effect plots
*/
twoway ///
 (contour sigdz_phat age choleste if _plotindicator == 1, levels(100) format(%0.0g) heatmap zlabel(#7))

twoway ///
 (contour sigdz_phat age choleste if _plotindicator == 1, levels(100) format(%0.0g) heatmap zlabel(#7)) ///
 (scatter age choleste if _plotindicator == 0, mcolor(white%0) mfcolor(white%0) mlcolor(black%10))

twoway ///
 (contour sigdz_phat age choleste if _plotindicator == 2 & drop5 == 0, levels(100) format(%0.0g) heatmap zlabel(#7) interp(none)) ///
 (scatter age choleste if _plotindicator == 0, mcolor(white%0) mfcolor(white%0) mlcolor(black%8))


/*
8.  Hypothesis Tests
*/
logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex
est store full

// Minus sex
test (1.sex 1.sex#c.rcs_age_1 1.sex#c.rcs_age_2 1.sex#c.rcs_age_3)
testparm i.sex i.sex#c.*
// OR
quietly logistic sigdz rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*)
est store minus_sex
lrtest full minus_sex

// Minus age
logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex
testparm rcs_age* c.rcs_age*#c.* c.rcs_age*#i.*
//       |        |              + Interaction of age with categorical vars
//       |        + Interaction of age with continuous variables
//       + Age terms without interaction
// OR
quietly logistic sigdz i.sex rcs_chol* 
est store minus_age
lrtest full minus_age

// Minus chol
logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex
testparm rcs_chol* c.rcs_chol*#c.* c.rcs_chol*#i.*
// OR
quietly logistic sigdz i.sex rcs_age* c.(rcs_age*)#i.sex
est store minus_chol
lrtest full minus_chol

// Minus chol nonlinear
logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex
// OR
quietly logistic sigdz i.sex rcs_age* rcs_choleste_1 c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex
est store minus_chol_nonlinear
lrtest full minus_chol_nonlinear

// Minus chol interaction
quietly logistic sigdz i.sex rcs_age* rcs_chol* c.(rcs_age*)#i.sex
est store minus_chol_interaction
lrtest full minus_chol_interaction

TASK: Bayes, a tantalizing morsel
RMS notes 10.11
Dataset: sex.age.response
getvdata sex.age.response
display (5/invnormal(0.975))^2
display ((20 / invnormal(0.975)) / 10)^2
bayes, ///
 prior({response:_cons}, t(0,10,3)) ///
 prior({response:age}, normal(0,1)) ///
 prior({response:sex}, normal(0,6.5)) ///
 mcmcsize(40000) ///
 thinning(10) ///
 rseed(123) ///
 saving(stata-bayes-results.dta, replace) ///
 coef : logistic response age sex

bayesgraph diagnostics _all
 
use stata-bayes-results.dta, clear
 
sunflower eq1_p1 eq1_p2
twoway (scatter eq1_p1 eq1_p2, mcolor(blue%2) msymbol(circle) mlwidth(none))
gen age_gt_zero = 1*(eq1_p1 > 0)
sum age_gt_zero
gen sex_gt_zero = 1*(eq1_p2 > 0)
sum sex_gt_zero
gen sex_age_gt_zero = 1*(eq1_p1 > 0 & eq1_p2 > 0)
sum sex_age_gt_zero

TASK: Regression with ordinal predictors with a moderate (5ish - 20ish) number of levels
If an ordinal variable has few levels, we generally include it into a regression model as a nominal categorical variable. If an ordinal variable has many levels, we generally treat it as a continuous variable and include it into the regression as a restricted cubic spline. However, what are the options for an ordinal variable with a middle number of levels?
Dataset: support
getvdata support

* Collapse levels 6 and 7
codebook num_co
replace num_co = 6 if num_co == 7

* To compare models
plotdata, at(num_co = 0/6)

* Model of mean los where num_co is a nominal categorical variable
regress slos i.num_co
regress_yhat_ci, stub(i_) /* store results with i_ prefix */
twoway scatter i_slos_hat num_co if _plotindicator == 1

* Model of mean los where num_co association is linear
regress slos c.num_co
regress_yhat_ci, stub(c_)
twoway scatter c_slos_hat num_co if _plotindicator == 1

* Model of mean los where num_co association is quadratic
gen num_co_squared = num_co*num_co
regress slos c.num_co c.num_co_squared
regress_yhat_ci, stub(c2_)
twoway scatter c2_slos_hat num_co if _plotindicator == 1

* Model of mean los where num_co association is quadratic with indicator variable for group 0
gen zero_co = 1*(num_co == 0)
regress slos c.num_co c.num_co_squared zero_co
regress_yhat_ci, stub(c2_gi_)
twoway scatter c2_gi_slos_hat num_co if _plotindicator == 1

* Compare estimated means from all models
list num_co *_slos_hat if _plotindicator == 1

TASK: Crash course in ordered logistic regression
Dataset: Simulated HIV data provided by Bryan Shepherd as part of the Liu, Shepherd, Li, Harrell (2017) publication.
use "https://biostatdata.app.vumc.org/tgs/simhiv.dta", clear
// Simulated HIV data provided by Bryan Shepherd as part of the Liu, Shepherd, Li, and Harrell (2017) publication.

encode site, generate(site_id)
generate sex = real(male)
encode init_class, generate(class)
encode route, generate(iroute)
encode preARTAIDS_clinical, generate(base_aids)
drop site male init_class route preARTAIDS_clinical

/*
 post_cd4 ~ site + male + rcs(age) + init_class + route + 
            rcs(nadir_cd4) +rcs(init_year)+ rcs(pre_rna)+
            preARTAIDS_clinical
         
         
To create the partial-effect plot:
  What is the variable on the x-axis?
  What is the range of the x-axis variable?
  What value are the other predictors held-at?

E[post_cd4|-] |       ****
              |     *
              |    *
              |  *
              |*
              |
              +--------------
                 nadir_cd4
*/

plotdata, at(          ///
  nadir_cd4 = 2(2)550; ///  This is the x-axis variable
  init_year = 2007;    ///  All other covariates in the
  pre_rna = 35000;     ///  model are set to a specific
  age = 35;            ///  value.
  site_id = 1;         ///
  sex = 1;             ///
  class = 1;           ///
  iroute = 1;          ///
  base_aids = 1)

mkspline_plotindicator age, nknots(5)
mkspline_plotindicator nadir_cd4, nknots(5)
mkspline_plotindicator init_year, nknots(5)
mkspline_plotindicator pre_rna, nknots(4)

set matsize 800

ologit post_cd4 i.site_id i.sex rcs* i.base_aids i.iroute i.class if _plotindicator == 0      

ologit_yhat

/*
Hypothesis tests (can also perform with lrtest command)
*/
testparm rcs_age*   // total impact of age
testparm rcs_nadir* // total impact of nadir cd4
testparm i.site_id  // total impact of site

/*
Partial effect plot
*/
twoway (line post_cd4_mean nadir_cd4 if _plotindicator == 1, sort)

TASK: Crash course in Cox regression
RMS notes 18.1
Dataset: prostate


Lesson Plans (restricted access)
Topic revision: r162 - 25 Feb 2020, ThomasStewart
 

This site is powered by FoswikiCopyright © 2013-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback