MSCI Biostatistics II Course Schedule -- February 2019

Key:
A = Whitlock and Schluter (text)
B = Biostatistics for Biomedical Research (handout)
H = Harrell (text) and the corresponding chapter in RMS Handouts

Date Room Material to read/watch
prior to class
Session Topics Assignments Examples
Friday 1 Light 411AB Course Webpage
Syllabus
Key Concepts
A17.1-17.7, A17.10-1
B10-B10.5.8
H2.4.7
Simple Linear Regression Q 29  
Monday 4 Light 411AB A18, B10.6-B10.8 Multiple Linear Regression Q 1 Lion Age from Nose
  5 Light 411AB B 10.9-10.10, 9.4-9.9 " Q 2 Mole Rats
  6 Light 411AB B13-13.6.1 Analysis of Covariance in Randomized Studies Q 10 Chunk Tests Mole Rats
  7 Light 411AB Hxiii, H1, H2-2.3.1, classification blog Regression Modeling Strategies: Introduction Q 21  
  8 Light 411AB H2.3.2-2.4.1-2.4.6 demos Methods for Multivariable Models Q 9 Cat vars in regression
Monday 11 Light 411AB H2.4.8-2.7.2 (omit 2.5.1) " Q 17 Plot RCS
  12 Light 411AB H3, H4-4.1, 4.3-4.4 Missing Data, Multivariable Modeling Strategies Q 30 Missing data
  13 Light 411AB H4.5-4.7.2, 4.7.5,table before 4.8 " Q 23 Data reductions (principle components)
Variable clustering
  14 Light 411AB H4.9-4.12 "   Outliers
  15 Light 411AB A19, H5 (omit 5.5), B10.11 Bootstrap, Validating and Describing the Model Q 24 Internal Validation
Monday 18 Light 411AB   Student led discussion of papers Paper critique
  19 Light 411AB A17.9, H10-10.5 Binary Logistic Models Q 25 & 27  
  20 Light 411AB H10.6-10.10, H10.11 (Bayes summary), B13.6.2-13.7 " Analysis plan for final project Logistic
  21 Light 411AB H11, H12, NNT, HTE, Risk Difference Binary Logistic Model Case Studies   Bayes
  22 Light 411AB B19, More Analyses of Diagnostic Yield, Simple Decision Analysis, H13.1-13.3, H14.1 and Table 14.1, B4.1.2 Risk-Based Diagnostic Research
Proportional Odds Ordinal Logistic Models
Q 31 Ordinal predictors
Monday 25 Light 411AB Power, H15-15.2, 15.5.2-15-25, 15-30, 15-34,
17.1-17.3, 17.5.1,
18.1, 18.4-18.6
Ordinal Regression, Survival Analysis -- Ordinal regression
  26 Light 431 AB   Survival Analysis  
  27 Light 411AB B15 Analysis of Serial Measurements
  28 Light 411AB   Student Presentations of Final Project Final Project - Due 11:59PM  


In Class, Hands-on Examples

TASK: Simple linear regression
The lion's nose, ABD example 17.1, page 541
Dataset: 17-e-1
N observations: 32 male lions
Response: Age (years)
Predictor: Proportion of black pigmentation on nose

  • Why would understanding the Age-Nose relationship be helpful?
  • Construct a scatter plot of the response and explanatory variable.
  • How would you describe the relationship in the plot?
  • Write down the linear regression model of Age and Proportion of black pigmentation on nose.
    • Interpret the constant coefficient.
    • Interpret the slope coefficient.
    • Interpret the variance coefficient.
    • How would the model change if the predictor were on the percent scale?
    • How would the model change if Age were on the decade scale?
    • What scale is the variance coefficient?
  • Fit a linear regression model
    • What are the parameter estimates?
    • What is the uncertainty of the model parameters?
  • Plot the linear regression model
  • Plot the linear regression model with CI bands.
  • Assess the model fit
    • Plot a scatter plot of the studentized residuals and the predictor variable.
      • What aspect of model fit does this plot communicate?
      • What would the plot look like if the model fits well?
      • What would the plot look like if the model fits poorly?
  • How might you use the model?
  • What is the difference between "estimating the mean" and "prediction of age for a new lion"?
    • Is there a difference between estimating a population parameter and predicting an individual outcome?
    • Which of the two (population vs individual outcome) are we "more certain"?
  • Plot the linear regression model with PI band

getvdata 17-e-1

//Scatter plot
twoway (scatter ageinyears proportionblack), ytitle(Age (years)) xtitle(Proportion black)

// Regression
regress ageinyears proportionblack

// Add plot data
plotdata, at( proportionblack = .1(.1).9)

// Generate yhat and confidence intervals
regress_yhat_ci

// Plot regression line and CI
twoway (rarea ageinyears_hat_lb ageinyears_hat_ub proportionblack if _plotindicator == 1, sort fcolor(bluishgray%60)) ///
 (line ageinyears_hat proportionblack if _plotindicator == 1, sort lcolor(bluishgray) lwidth(thick)) ///
 (scatter ageinyears proportionblack), ///
 ytitle(Age (years)) ///
 xtitle(Proportion black) ///
 legend(off)
 
// Generate yhat and forecast confidence intervals (use stub to identify new vars)
regress_yhat_ci, forecast stub(f_)

// Plot regression line
twoway (rarea f_ageinyears_hat_lb f_ageinyears_hat_ub proportionblack if _plotindicator == 1, sort fcolor(lime%70)) ///
 (rarea ageinyears_hat_lb ageinyears_hat_ub proportionblack if _plotindicator == 1, sort fcolor(bluishgray%60)) ///
 (line ageinyears_hat proportionblack if _plotindicator == 1, sort lcolor(bluishgray) lwidth(thick)) ///
 (scatter ageinyears proportionblack), ///
 ytitle(Age (years)) ///
 xtitle(Proportion black) ///
 legend(off)

// Generate residuals
generate ehat = ageinyears - ageinyears_hat

// Studentized residuals
predict student_ehat, rstudent

// QQplot for residual diagnostic
qnorm student_ehat

// Constant variance diagnostic
twoway (scatter  student_ehat proportionblack)

TASK: Adjusting for the effects of a covariate
Mole rats , ABD example 18.4, page 620
Dataset: 18-e-4
N observations: 35
Response: lnenergy (log daily energy expenditure)
Predictors: lnmas (log body mass); caste (worker, lazy)

Questions to consider:
  • What relationship are the researchers studying?
  • Plot the data.
  • Write down a linear model which will help researchers understand the relationship of interest.
    • How many parameters are in the model?
  • Write down the null and alternative hypotheses of interaction effect in terms of the model parameters.
    • Write down the null and alternative hypotheses of interaction effect in terms of a comparison of two models.
  • In terms of the research question, what does the presence an interaction effect mean?
  • Fit the model.
  • Plot the estimated relationship of the null model and the estimated relationship of the alternative model. See figure 18.4-1.
  • Plot model diagnostics of both models (null and alternative).
  • Test the hypothesis.
// Read data
getvdata 18-e-4

// Generate worker indicator variable
gen worker = 1*(caste == "worker")

// Construct regression model
regress lnenergy lnmass worker c.lnmass#i.worker

// Generate data for plot
plotdata, at(lnmass = 3.8(.1)5.3; worker = 0/1)

// Calculate yhat and ci
regress_yhat_ci

// Generate plot of regression model
twoway (rarea lnenergy_hat_lb lnenergy_hat_ub lnmass if worker == 1, sort fcolor(blue%50) lcolor(blue)) ///
 (line lnenergy_hat lnmass if worker == 1, sort lcolor(blue) lwidth(vthick)) ///
 (rarea lnenergy_hat_lb lnenergy_hat_ub lnmass if worker == 0, sort fcolor(red%50) lcolor(red)) ///
 (line lnenergy_hat lnmass if worker == 0, sort lcolor(red) lwidth(vthick)) ///
 if _plotindicator == 1, ///
 ytitle(Energy expenditure (log scale)) ///
 xtitle(Body mass (log scale)) ///
 legend(order(1 "Worker" 3 "Lazy"))

// Test of interaction (comparison of two models)
// FULL model: E[y|-] = b0 + b1 lnmass + b2 worker + b3 lnmass * worker
// REDU model: E[y|-] = b0 + b1 lnmass + b2 worker
//
// Test of interaction (written interms of beta coefficients
// Ho : b3 = 0
// Ha : b3 != 0
//
// From the output:
// ---------------------------------------------------------------------------------
//       lnenergy |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
// ----------------+----------------------------------------------------------------
// worker#c.lnmass|    .418622   .4147347     1.01   0.321    -.4272351    1.264479
//
// By hand:
// FULL R2 = 0.4278
// REDU R2 = 0.4090
// p = 3
// q = 1
// n = 35
// Fstat


display (0.4278 - 0.4090)*(35 - 3 - 1)/(1 - 0.4278)/1
display Ftail(1, 35 - 3 - 1, 1.018525)

getvdata 18-e-4
gen worker = 1*(caste == "worker")

/* 
# Chunk test deliciousness

In the following example, we are going to perform a series of chunk tests.  We
are going to demonstrate how to perform a chunk test with two different methods.
The first method is a comparison of models.  The second is a direct test of the 
betas.  The two methods are equivalent and will give identical answers.  

Depending on your circumstances, one method may be easier to implement than another.

## METHOD 1: Comparison of models

### Interaction test

    FULL: E[ lnenergy | - ] = b0 + b1 lnmass + b2 worker + b3 worker * lnmass
 REDUCED: E[ lnenergy | - ] = b0 + b1 lnmass + b2 worker 
 
### Total impact of lnmass

    FULL: E[ lnenergy | - ] = b0 + b1 lnmass + b2 worker + b3 worker * lnmass
 REDUCED: E[ lnenergy | - ] = b0 +             b2 worker 

### Total impact of caste

    FULL: E[ lnenergy | - ] = b0 + b1 lnmass + b2 worker + b3 worker * lnmass
 REDUCED: E[ lnenergy | - ] = b0 + b1 lnmass              

### Overall regression

    FULL: E[ lnenergy | - ] = b0 + b1 lnmass + b2 worker + b3 worker * lnmass
 REDUCED: E[ lnenergy | - ] = b0
 
### Steps to compare two models

1. Fit and store each model (use `est store` command)
2. Use `ftest` command

### Note:

+ The comparison of models approach is only valid if the regression models are fit
using the same observations.  This is often an issue if there is missing data as 
the default practice is to disregard observations that have missing covariates.
The consequence is that the REDUCED model may use more observations than the FULL 
model.  We will cover an example of how to compare models when there is missing data.
*/

regress lnenergy lnmass worker i.worker#c.lnmass
est store full

regress lnenergy lnmass worker
est store no_int // for no interaction

regress lnenergy worker
est store no_mass // model with no lnmass

regress lnenergy lnmass
est store no_caste // model with no caste variable

regress lnenergy 
est store no_vars // model with no predictors 

ftest full no_int // (also available from regress output)
ftest full no_mass
ftest full no_caste
ftest full no_vars // (also available from regress output)

/*
## METHOD 2: Betas

### Interaction test

 Ho: b3 = 0
 Ha: b3 != 0
 
### Total impact of lnmass

 Ho: b1 = 0 AND b3 = 0
 Ha: b1 != 0 OR b3 != 0 

### Total impact of caste

 Ho: b1 = 0 AND b2 = 0
 Ha: b1 != 0 OR b2 != 0              

### Overall regression

 Ho: b1 = 0 AND b2 = 0 AND b3 = 0
 Ha: b1 != 0 OR b2 != 0 OR b3 != 0
 
### Steps to test betas

1. Fit the full model
2. Use the `regress, coeflegend` command to identify labels for betas
3. Use `test` command

### Note:

+ This approach avoids issues with missing data.
*/

regress lnenergy lnmass worker i.worker#c.lnmass
regress, coeflegend

test (1.worker#c.lnmass)              // Interaction (also available from regress output)
test (lnmass 1.worker#c.lnmass)        // Total impact of lnmass
test (worker 1.worker#c.lnmass)        // Total impact of caste
test (lnmass worker 1.worker#c.lnmass) // Overall regression (also available from regress output)


/*
## Table of usefull hypothesis tests

You will want to organize the results of commonly used hypothesis tests in a 
table like the following:

| Variable      | df | Test stat | p-value |
|---------------|----|-----------|---------|
| lnmass        |  2 |           |         |
|   interaction |  1 |           |         |
| caste         |  2 |           |         |
|   interaction |  1 |           |         |
| Overall       |  3 |           |         |

*/

TASK: Categorical Variables in STATA
Dataset: nhgh
N observations: 6805
Response: Height
Predictors: Race and ethnicity, sex
/***
# Incorporating categorical variables into a regression model

Example 1: Height as a function of race-ethnicity (re)

  + Option 1 - estimate the mean for each re

               E[ ht | re ] =   a1 * Mexican American
                              + a2 * Other Hispanic
                              + a3 * Non-Hispanic White
                              + a4 * Non-Hispanic Black
                              + a5 * Other Race Including Multi-Racial


  + Option 2 - estimate the mean in a reference group (say Non-Hispanic Black),
               then estimate the difference in mean height between the reference
               group and each of the remaining re group

               E[ ht | re ] =   b0
                              + b1 * Mexican American
                              + b2 * Other Hispanic
                              + b3 * Non-Hispanic White
                              + b4 * Other Race Including Multi-Racial
***/

getvdata nhgh

// Group means coding
regress ht ibn.re, noconst

/*
Note: When you use the noconstant option, disregard any thing from the SS table or
the overall chunk test
*/

// Reference group coding
regress ht i.re

/*
Which is the reference group?
How do you tell stata which group to use as reference?
  + help fvvarlist
*/

regress ht ib3.re //         ib#.           use # as base, #=value of variable
regress ht ib(3).re //       ib(##).        use the #th ordered value as base (**)
regress ht ib(first).re //   ib(first).     use smallest value as base (the default)
regress ht ib(last).re //    ib(last).      use largest value as base
regress ht ib(freq).re //    ib(freq).      use most frequent value as base

/***
Example 2: Height as a function of race-ethnicity (re) and sex
***/

regress ht i.re##i.sex
plotdata, at(sex = 1/2; re = 1/5)
regress_yhat_ci
decode(sex), generate(s_sex)
decode(re), generate(s_re)
gen label = s_re + " " + s_sex
egen float row = rank(-10*re+sex) if _plotindicator == 1
point_ci ht_hat ht_hat_lb ht_hat_ub row if _plotindicator == 1 & sex == 1,  ///
 labelvar(label) ///
 point_options(mcolor(magenta)) ///
 ci_options(lwidth(thick)) ///
 plot_options(title(Mean heights NHANES) xtitle(Height (cm)) ytitle(Race and Sex))

TASK: Restricted Cubic Splines & Partial Effect Plots
The Class of 1988: A Statistical Portrait
Dataset: sat
N observations: 50
Response: Average SAT scores for each state
Predictors: Percentage of students taking the SAT within each state

TASK: Restricted Cubic Splines & Partial Effect Plots
Framingham Heart Study, Dupont example 3.11, page 102
Dataset: Levy (1999)
N observations: 4699
Response: log serum cholesterol
Predictors: bmi, age, sex
* 1. Specify the model
*--------------------------------------------------
* E[ log scl | - ] = b0 + b1 * sex +
*    b2 * age + b3 * age'
*    b4 * bmi + b5 * bmi'
*    b6 * age * sex + b7 * age' * sex
*    b8 * bmi * sex + b9 * bmi' * sex
* V[ log scl | - ] = sigma^2


* 2. Specify the partial effect plots of interest
*--------------------------------------------------
*    Plot 1: Y = E[ log scl | - ]
*            X = age
*            line for males and line for females
*            other covariates set at median
*
*    Plot 2: Y = E[ log scl | - ]
*            X = bmi
*            line for males and line for females
*            other covariates set at median

* Use centile to get range and median of variables

getvdata 2.20.Framingham
gen logscl = log(scl)
label define map_sex 1 "Male" 2 "Female"
label values sex map_sex
centile age bmi, centile(5 50 95)

* 3. Generate data for plotting
*--------------------------------------------------
plotdata, at(age=34/60; bmi=25; sex=1/2)
plotdata, at(age=45; bmi=19/33; sex=1/2)

* 4. Generate restricted cubic splines
*--------------------------------------------------
mkspline_plotindicator bmi, nknots(3)
mkspline_plotindicator age, nknots(3)

* 5. Fit the model
*--------------------------------------------------
regress logscl rcs_age* rcs_bmi* i.sex i.sex#c.(rcs_bmi*) i.sex#c.(rcs_age*)
* Shorter way:
regress logscl i.sex##c.(rcs*)

* 6. Model diagnostics (more to come)
*--------------------------------------------------
regress_yhat_ci

* 7. Generate partial effect plots
*--------------------------------------------------
twoway ///
 (rarea logscl_hat_lb logscl_hat_ub age if sex == 1, sort fcolor(cranberry%50)) ///
 (line logscl_hat age if sex == 1, sort lcolor(cranberry) lwidth(thick)) ///
 (rarea logscl_hat_lb logscl_hat_ub age if sex == 2, sort fcolor(dknavy%50)) ///
 (line logscl_hat age if sex == 2, sort lcolor(dknavy) lwidth(thick)) ///
 if _plotindicator == 1, ///
 legend(order(2 "Male" 4 "Female")) ///
 ytitle(Mean log scl) ///
 xtitle(Age) ///
 note(BMI = 25)

 twoway ///
 (rarea logscl_hat_lb logscl_hat_ub bmi if sex == 1, sort fcolor(cranberry%50)) ///
 (line logscl_hat bmi if sex == 1, sort lcolor(cranberry) lwidth(thick)) ///
 (rarea logscl_hat_lb logscl_hat_ub bmi if sex == 2, sort fcolor(dknavy%50)) ///
 (line logscl_hat bmi if sex == 2, sort lcolor(dknavy) lwidth(thick)) ///
 if _plotindicator == 2, ///
 legend(order(2 "Male" 4 "Female")) ///
 ytitle(Mean log scl) ///
 xtitle(BMI) ///
 note(age = 45)

TASK: Missing Data
The SUPPORT Study, Dupont text section 3.25, page 138
Description: (link)
N observations: 1000
Response: Total RCC cost
************************************
* THE SUPPORT STUDY
* Dupont text, section 3.25
************************************
getvdata support
keep age sex dzgroup num_co edu income scoma meanbp hrt resp temp totcst
misstable summarize
misstable pattern
replace edu = . if edu == .z
replace income = . if income == .z
misstable summarize
mi query
mi set flong
mi register imputed edu income totcst
codebook num_co
codebook scoma
mi impute chained (pmm, knn(10)) totcst edu (ologit) income = age i.sex i.dzgroup num_co  scoma meanbp hrt resp temp, add(10) augment rseed(23450)

list income totcst edu _mi_id if _mi_m == 0 & _mi_id  < 10
list income totcst edu _mi_id if _mi_m == 1 & _mi_id  < 10
list income totcst edu _mi_id if _mi_m == 2 & _mi_id  < 10

mkspline rcs_age_=age, cubic nknots(3)
mkspline rcs_meanbp_=meanbp, cubic nknots(3)
mkspline rcs_hrt_=hrt, cubic nknots(3)

list rcs* if _mi_m == 1 & _mi_id  < 10
list _mi_miss in 1/10
list _mi_miss _mi_m if _mi_m == 1 & _mi_id  < 10

mi estimate : regress totcst rcs_* i.sex i.dzgroup num_co scoma i.income
mi estimate, saving("miexample", replace) : regress totcst rcs_* i.sex i.dzgroup num_co scoma i.income
help mi estimate
mi estimate, coeflegend
mi test (rcs_age_1 rcs_age_2)
mi test (2.income 3.income 4.income)
mi test (2.dzgroup 3.dzgroup 4.dzgroup 5.dzgroup 6.dzgroup 7.dzgroup 8.dzgroup)

TASK: Variable Clustering
The SUPPORT Study, Dupont text section 3.25, page 138
Description: (link)
N observations: 1000
getvdata support
keep age sex dzclass race meanbp wblc hrt resp temp alb bili crea sod ph glucose bun urine adlsc
* findit hcavar
tabulate sex, generate(s)
tabulate dzclass, generate(dz)
hcavar s2 dz2 dz3 dz4 adlsc urine bun glucose ph sod crea bili alb temp resp hrt wblc meanbp age

TASK: Data Reduction with Principle Components
Glycohemoglobin from NHANES 2009-2010, BBR 4-22, page 51
Description: (link)
Dataset: nhgh (from Department collection of datasets)
N observations: 6795
getvdata nhgh

* SUBSET AGE GE 21 and DX = 0 and TX = 0
keep if age >= 21 & dx == 0 & tx == 0
drop dx tx

* GENERATE INDICATOR VARIABLES FOR RE
quietly tabulate re, generate(re_)
list re re_*

* VARIABLE CLUSTERING
* findit hcavar
hcavar wt ht bmi leg arml armc waist tri sub age sex re_*

* PRINCIPLE COMPONENTS
centile waist bmi, centile(5 95)
twoway (scatter waist bmi), xline(20 40) yline(74 125)
pca waist bmi, components(2)
predict waist_bmi_pc*, score

* INTERNAL COEFFICIENTS
regress waist_bmi_pc1 waist bmi
regress waist_bmi_pc2 waist bmi

twoway (scatter waist_bmi_pc2 waist_bmi_pc1)
centile waist_bmi_pc1 waist_bmi_pc2, centile(5 95)
twoway (scatter waist_bmi_pc2 waist_bmi_pc1), xline(-1.95 2.49) yline(-.466 .473)

* USE PC IN REGRESSION
regress SCr waist_bmi_pc1 age leg armc

TASK: Outliers
Framingham Heart Study, Dupont example 3.11, page 102
Dataset: Levy (1999) (Available in department collection of datasets.)
N observations: 4699
Response: Log Systolic Blood Pressure
Predictors: log(bmi), age, sex, log(serum cholesterol)
*************************
* FRAMINGHAM HEART STUDY
* Dupont, Section 3.11
*************************

getvdata 2.20.Framingham

* For demonstration purposes:
replace age = 103 if id == 2642

generate logsbp = log(sbp)
label variable logsbp "Log Systolic Blood Pressure"
generate logbmi = log(bmi)
label variable logbmi "Log Body Mass Index"
generate logscl = log(scl)
label variable logscl "Log Serum Cholesterol"

label define sexlabel 1 "Male" 2 "Female" 
label values sex sexlabel

* MODEL 3.23
* E[log spb | ... ] = b0 + b1 logbmi + b2 age + b3 logscl + b4 female
*                        + b5 female * logbmi + b6 female * age
*                        + b7 female * logscl
regress logsbp logbmi age logscl i.sex i.sex#c.(logbmi age logscl)

* OUTLIERS
dfbeta

* FIT DIAGNOSTICS
predict yhat, xb
predict ehat, residuals
predict standard_resid, rstandard
predict student_resid, rstudent
predict cooksd, cooksd
predict leverage, leverage
predict dfits, dfits

* PLOTS
twoway (scatter student_resid yhat) (lowess student_resid yhat, lwidth(vthick)), yline(2 -2)

* COOKS D
generate index = _n
twoway (scatter cooksd index)

* DFIT
twoway (scatter dfits index)

* DFBETAS
graph matrix _dfbeta_*

TASK: R2 Internal Validation in STATA
Framingham Heart Study, Dupont example 3.11, page 102
Dataset: Levy (1999) (Available in department collection of datasets.)
N observations: 4699
Response: Log Systolic Blood Pressure
Predictors: log(bmi), age, sex, log(serum cholesterol)
*************************
* FRAMINGHAM HEART STUDY
* Dupont, Section 3.11
*************************

getvdata 2.20.Framingham

generate logsbp = log(sbp)
label variable logsbp "Log Systolic Blood Pressure"
generate logbmi = log(bmi)
label variable logbmi "Log Body Mass Index"
generate logscl = log(scl)
label variable logscl "Log Serum Cholesterol"

label define sexlabel 1 "male" 2 "female" 
label values sex sexlabel

quietly regress logsbp i.sex##c.(logbmi age logscl)

* R2 VALIDATIOn
validate_regress, reps(100) seed(3243)

TASK: Crash course in logistic regression
RMS notes 10.1.3, 10.5 letter W
Dataset: sex.age.response & acath

TASK: Logistic regression, RCS, interactions
RMS notes 10.5 letter A
Dataset: acath
Response: sigdz
Predictors: age, sex, cholesterol
/*
0.  Question

1.  Translate question to model

2.  Identify covariates for adjustment

    a. Missing data strategy

3.  Determine parameter budget

    a.  Effective sample size --> 15:1 rule --> budget
    b.  Data reduction needed?
    c.  Interactions?
    d.  Non-linear terms
    e.  Allocation of parameters according to clinical understanding
   
4.  Fit the model

5.  Model diagnostics

    a.  Overly influential observations (dfbetas, dfits)
    b.  Residual diagnostics (linear model usually)
    
6.  Measures of model performance (Optimism corrected)

    a.  Discrimination
    b.  Calibration
    
7.  Understanding the model (Partial effect plots)

8.  Hypothesis tests

    a.  Total predictor impact
    b.  Linearity
    c.  Interaction

*/

getvdata acath

/*
0. What are the associations of age, cholesterol, and sex with significant coronary disease
1. logodds[ SCD | - ] = 
2. Covariates for adjustment
*/

keep sex age choleste sigdz
misstable pattern
drop if choleste == .z /* For demonstration purposes.  Imputation is a better option. */

/*
3.  Paramter budget
*/
tab sigdz //
display 768/15 // Data reduction probably not needed.
/*
3c.  Interactions?
3d.  Non-linear terms
3e.  Allocation of parameters according to clinical understanding


| Full interaction                   | A * f(B) + B * f(A) Interaction | f(A*B)                      |
|------------------------------------|---------------------------------|-----------------------------|
|   1                                | 1                               | 1                           |
| + sex                              | + sex                           | + sex                       |
| + rcs(age, 4 knots)                | + rcs(age, 4)                   | + rcs(age, 4)               |
| + rcs(cholesterol, 4 knots)        | + rcs(cholesterol, 4)           | + rcs(cholesterol, 4)       |
| + rcs(age, 4)*rcs(cholesterol, 4)  | + age * rcs(cholesterol, 4)     | + rcs(age * cholesterol, 4) |
|                                    | + cholesterol * rcs(age, 4)     |                             |
| + rcs(age, 4)*sex                  | + rcs(age, 4) * sex             | + rcs(age, 4)*sex           |
|------------------------------------|---------------------------------|-----------------------------|

How many parameters do each of these models have?
*/


/* OUT OF ORDER
Need to make plans for partial effect plots.  Plan is to generate heatmap of 

 Age |                 Z = predicted log odds
     |                 
     |
     |
     +--------------  
         cholesterol
    
 Holding sex = 1 
*/
centile age cholest, centile(0 3 50 97 100)
plotdata, at(age = 34(2)70; choleste = 150(5)340; sex = 1)


// Time permitting, will come back to this part
// It isn't essential
plotdata, at(age = 17(2)81; choleste = 30(5)575; sex = 1)
sum age cholest if _plotindicator == 0
bnormpdf age cholest, dens(npdf) m1(52.2) m2(229.9) s1(9.9) s2(50.6) rho( 0.0139)
centile npdf if _plotindicator == 0, centile(5)
gen drop5 = 1*(npdf < .000016) if _plotindicator == 2
twoway (scatter age cholest if _plotindicator == 2 & drop5 == 0) ///
 (scatter age cholest if _plotindicator == 0)


/*
4. Fit the model
*/
mkspline_plotindicator age, nknots(4)
mkspline_plotindicator choleste, nknots(4)
logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex

/*
5. Diagnostics
*/
gen row = _n
predict leverage, hat
predict deltad, ddeviance
twoway scatter leverage row if _plotindicator == 0
twoway scatter deltad row if _plotindicator == 0

// DFBETAS not part of STATA base.  Can generate with ldfbeta package.
// search ldfbeta
//ldfbeta // Will be slow for larger datasets.
//twoway scatter DFrcs_ag row
//twoway scatter DFrcs_ch row
//twoway scatter DF2 row

/*
6. Apparent measures of performance
*/
logistic_phat_ci // to get phat
brier sigdz sigdz_phat if _plotindicator == 0
lroc
//ssc install somersd
somersd sigdz sigdz_phat

/*
Apparent calibration with histogram rug plot
*/

twoway ///
 (lowess sigdz sigdz_phat, lcolor(blue) lwidth(thick)) ///
 (line sigdz sigdz, sort lcolor(red) lwidth(medium)) ///
 (histogram sigdz_phat, fraction bin(100) fcolor(black) lwidth(none)) ///
 , title(Apparent calibration) ///
 legend(order(1 "Apparent calibration" 2 "Ideal calibration")) ///
 xtitle(Predicted probability)
 
 
/*
Optimism corrected measures of performance
*/
logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex
validate_logistic, reps(100) seed(023)


/*
Partial effect plots
*/
twoway ///
 (contour sigdz_phat age choleste if _plotindicator == 1, levels(100) format(%0.0g) heatmap zlabel(#7))

twoway ///
 (contour sigdz_phat age choleste if _plotindicator == 1, levels(100) format(%0.0g) heatmap zlabel(#7)) ///
 (scatter age choleste if _plotindicator == 0, mcolor(white%0) mfcolor(white%0) mlcolor(black%10))

twoway ///
 (contour sigdz_phat age choleste if _plotindicator == 2 & drop5 == 0, levels(100) format(%0.0g) heatmap zlabel(#7) interp(none)) ///
 (scatter age choleste if _plotindicator == 0, mcolor(white%0) mfcolor(white%0) mlcolor(black%8))


/*
8.  Hypothesis Tests
*/
logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex
est store full

// Minus sex
test (1.sex 1.sex#c.rcs_age_1 1.sex#c.rcs_age_2 1.sex#c.rcs_age_3)
testparm i.sex i.sex#c.*
// OR
quietly logistic sigdz rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*)
est store minus_sex
lrtest full minus_sex

// Minus age
logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex
testparm rcs_age* c.rcs_age*#c.* c.rcs_age*#i.*
//       |        |              + Interaction of age with categorical vars
//       |        + Interaction of age with continuous variables
//       + Age terms without interaction
// OR
quietly logistic sigdz i.sex rcs_chol* 
est store minus_age
lrtest full minus_age

// Minus chol
logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex
testparm rcs_chol* c.rcs_chol*#c.* c.rcs_chol*#i.*
// OR
quietly logistic sigdz i.sex rcs_age* c.(rcs_age*)#i.sex
est store minus_chol
lrtest full minus_chol

// Minus chol nonlinear
logistic sigdz i.sex rcs_age* rcs_chol* c.rcs_age_1#c.(rcs_chol*) c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex
// OR
quietly logistic sigdz i.sex rcs_age* rcs_choleste_1 c.rcs_choleste_1#c.(rcs_age*) c.(rcs_age*)#i.sex
est store minus_chol_nonlinear
lrtest full minus_chol_nonlinear

// Minus chol interaction
quietly logistic sigdz i.sex rcs_age* rcs_chol* c.(rcs_age*)#i.sex
est store minus_chol_interaction
lrtest full minus_chol_interaction

#Bayes
TASK: Bayes, a tantalizing morsel
RMS notes 10.11
Dataset: sex.age.response
getvdata sex.age.response
display (5/invnormal(0.975))^2
display ((20 / invnormal(0.975)) / 10)^2
bayes, ///
 prior({response:_cons}, t(0,10,3)) ///
 prior({response:age}, normal(0,1)) ///
 prior({response:sex}, normal(0,6.5)) ///
 mcmcsize(40000) ///
 thinning(10) ///
 rseed(123) ///
 saving(stata-bayes-results.dta, replace) ///
 coef : logistic response age sex

bayesgraph diagnostics _all
 
use stata-bayes-results.dta, clear
 
sunflower eq1_p1 eq1_p2
twoway (scatter eq1_p1 eq1_p2, mcolor(blue%2) msymbol(circle) mlwidth(none))
gen age_gt_zero = 1*(eq1_p1 > 0)
sum age_gt_zero
gen sex_gt_zero = 1*(eq1_p2 > 0)
sum sex_gt_zero
gen sex_age_gt_zero = 1*(eq1_p1 > 0 & eq1_p2 > 0)
sum sex_age_gt_zero

TASK: Regression with ordinal predictors with a moderate (5ish - 20ish) number of levels
If an ordinal variable has few levels, we generally include it into a regression model as a nominal categorical variable. If an ordinal variable has many levels, we generally treat it as a continuous variable and include it into the regression as a restricted cubic spline. However, what are the options for an ordinal variable with a middle number of levels?
Dataset: support
getvdata support

* Collapse levels 6 and 7
codebook num_co
replace num_co = 6 if num_co == 7

* To compare models
plotdata, at(num_co = 0/6)

* Model of mean los where num_co is a nominal categorical variable
regress slos i.num_co
regress_yhat_ci, stub(i_) /* store results with i_ prefix */
twoway scatter i_slos_hat num_co if _plotindicator == 1

* Model of mean los where num_co association is linear
regress slos c.num_co
regress_yhat_ci, stub(c_)
twoway scatter c_slos_hat num_co if _plotindicator == 1

* Model of mean los where num_co association is quadratic
gen num_co_squared = num_co*num_co
regress slos c.num_co c.num_co_squared
regress_yhat_ci, stub(c2_)
twoway scatter c2_slos_hat num_co if _plotindicator == 1

* Model of mean los where num_co association is quadratic with indicator variable for group 0
gen zero_co = 1*(num_co == 0)
regress slos c.num_co c.num_co_squared zero_co
regress_yhat_ci, stub(c2_gi_)
twoway scatter c2_gi_slos_hat num_co if _plotindicator == 1

* Compare estimated means from all models
list num_co *_slos_hat if _plotindicator == 1

TASK: Crash course in ordered logistic regression
Dataset: Simulated HIV data provided by Bryan Shepherd as part of the Liu, Shepherd, Li, Harrell (2017) publication.
use "https://biostatdata.app.vumc.org/tgs/simhiv.dta", clear
// Simulated HIV data provided by Bryan Shepherd as part of the Liu, Shepherd, Li, and Harrell (2017) publication.

encode site, generate(site_id)
generate sex = real(male)
encode init_class, generate(class)
encode route, generate(iroute)
encode preARTAIDS_clinical, generate(base_aids)
drop site male init_class route preARTAIDS_clinical

/*
 post_cd4 ~ site + male + rcs(age) + init_class + route + 
            rcs(nadir_cd4) +rcs(init_year)+ rcs(pre_rna)+
            preARTAIDS_clinical
         
         
To create the partial-effect plot:
  What is the variable on the x-axis?
  What is the range of the x-axis variable?
  What value are the other predictors held-at?

E[post_cd4|-] |       ****
              |     *
              |    *
              |  *
              |*
              |
              +--------------
                 nadir_cd4
*/

plotdata, at(          ///
  nadir_cd4 = 2(2)550; ///  This is the x-axis variable
  init_year = 2007;    ///  All other covariates in the
  pre_rna = 35000;     ///  model are set to a specific
  age = 35;            ///  value.
  site_id = 1;         ///
  sex = 1;             ///
  class = 1;           ///
  iroute = 1;          ///
  base_aids = 1)

mkspline_plotindicator age, nknots(5)
mkspline_plotindicator nadir_cd4, nknots(5)
mkspline_plotindicator init_year, nknots(5)
mkspline_plotindicator pre_rna, nknots(4)

set matsize 800

ologit post_cd4 i.site_id i.sex rcs* i.base_aids i.iroute i.class if _plotindicator == 0      

ologit_yhat

/*
Hypothesis tests (can also perform with lrtest command)
*/
testparm rcs_age*   // total impact of age
testparm rcs_nadir* // total impact of nadir cd4
testparm i.site_id  // total impact of site

/*
Partial effect plot
*/
twoway (line post_cd4_mean nadir_cd4 if _plotindicator == 1, sort)

TASK: Crash course in Cox regression
RMS notes 18.1
Dataset: prostate


Lesson Plans (restricted access)
Topic revision: r137 - 25 Feb 2019, ThomasStewart
 

This site is powered by FoswikiCopyright © 2013-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback