Statistical Thinking in Biomedical Research
Division of Biostatistics & Epidemiology
Department of Health Evaluation Sciences
University of Virginia School of Medicine
Rob Abbott, Viktor Bovbjerg, Mark Conaway, Frank Harrell
hesweb1.med.virginia.edu/biostat/
teaching/clinicians
Office of Continuing Medical Education
Fall 2002
Slide: 1
Section 1: Introduction to Statistical Concepts ¾
Frank Harrell

What do biostatistics & epidemiology have to offer to
biomedical research?
 Statistical inference
 Study design issues
 Descriptive statistics
 Measuring change
 Statistical resources at UVa
Slide: 2
What Do Biostatistics & Epidemiology Offer?

Help in developing concrete objectives and data acquisition
methods that meet the objectives, concrete descriptions
of primary and secondary endpoints
 Appropriate experimental and study design

Sources of bias
 Measurement issues
 Efficiency/power
 Maximizing use of a given number of animals
 Interpretability of findings
 Reproducibility of analyses
Choice of appropriate design depends crucially on the type of
experiment/disease and treatment being studied.
 likelihood that sample will yield estimates of adequate
precision to make experiments conclusive/affect medical practice
 More efficient use of data
 Formulate analysis plans without making inappropriate assumptions
 Estimate sample size (if fixed)
Slide: 3
Example

Objective: Does an intervention (could be a new drug,
patient education intervention, method of administration) improve
a response?
 What is meant by ``response'' and how will it be
measured?
 Is it based on symptoms, physiologic measurements, anatomical
measurements, or a combination?
 Does the ``response'' variable truly
measure the effect of treatment?
 When is the response measured ¾ 30 days after enrollment;
30 days after discharge from the hospital, anytime within a
5 year followup period, during the procedure, immediately
upon reperfusion after a coronary artery is unclamped,
after steady state, ...?
 If a ``time to'' endpoint, how much time is given to
enrolling patients and how much to followup?
 Where will patients come from and which group do they represent?
 How will the study results be used?
Slide: 4
Statistical Inference ¾ Examples

``How did my 5 patients do after I put them on an
ACEinhibitor?'': Describe results.
 ``How do patients with condition x respond after being on
an ACEinhibitor for 6 months?'': Infer ® Need to take a sample of
patients of interest to approximate what would be observed
had all such patients been treated that way.
 ``What is the inhospital mortality after open heart surgery
at my hospital so far this year?'': Describe; whole population
captured.
 ``What is the inhospital mortality after open heart surgery
likely to be this year, given results from last year?'': Infer ®
Estimate probability of death for patients like those seen in
1996.
 Inference = observations ® some general truth
 Answering research questions usually requires inferential
reasoning because you want to make a statement in general, not
just a statement about your specific study.
 Ability to do so depends on how observations collected as well
as their number
Slide: 5
Infinite Data Case

Suppose that one had an infinitely large amount of data of the
kind under consideration
 Inference not required
 Do need to determine if the infinite dataset would answer the
question of interest (QOI)

Subjects relevant?
 Measurements biased?
 Measurements relevant (e.g. measure cholesterol reduction
but not survival time)?
 Data collection process adequate?
 Patienttopatient variability still too great for
conclusions to be applied to individuals?
Slide: 6
Finite Dataset

Compute an estimate of something, e.g. expected reduction in blood pressure
 Approximates what would have been observed if had ¥
data
 Can estimate likely error in this approximation
 Probabilistic thinking: likely absolute error is a function of:

Sample size
 Subject to subject variability
 Intrasubject variability if using multiple
observations/subject
 Systematic bias
 Some subjects not getting desired experimental condition
Slide: 7
Steps Involved in Statistical Inference

Statistical inference based on the fact that when laws of
probability govern data collection, can infer from sample to
infinite data results
 First insure that an infinite dataset would answer the QOI
 Assess results using a sample
 Compute likely closeness with which sample results
approximates infinite dataset results
 Internal validity (chance), external validity
(generalizability)
Slide: 8
Study Design Issues

Concretely define study objectives
 Design study so that if had ¥ data would answer QOI
 Conduct experiment making efficient use of resources,
minimize likely error
 Standardize measurement devices
 Quantify and minimize intra and interobserver
variability of measurements
 Define terms: symptoms, signs, diagnoses, disease severity,
risk factors,
treatment or experimental conditions, control condition, events
 Use standardized assessment instruments when possible
 Definite animal/patient entry criteria
 Concomitant therapies/laboratory conditions
 Dosing of active control agents must be optimal
 Account for accommodation (tolerance) to drug effect
 In comparison studies, masked and random assignment of
experimental conditions
 Masked assessment of specimens/subject responses
 Masked specification of analysis[6]
 Masked reporting: write manuscript before data
analysis[7]
Slide: 9
Response Variables

Continuous measurements are best (e.g., mmHg, not
``hypertension'')
 Time lapse after experimental condition/how often to measure
 May have to wait until after an acute but temporary
derangement
 Time of assessment may need to correspond with phases of
disease development/trajectory of disease severity as well as
lifespan of the technology
 Binary response when time of event not important (e.g.,
procedural death) ¾ still need to justify duration of observation
 Time to event: all subjects without event need to be
followed a minimum duration to capture some of the
clinically relevant period.
Sufficiently large subset of
subjects should be followed until the end of the clinically
relevant period.
 Ordinal responses can be useful and have good statistical
power, e.g., no event within 30d, mild myocardial infarction,
moderate MI, severe MI, death within 30d. For diagnostic
studies may need to at least include a ``gray zone''.
 When have multiple responses they should be (1) prioritized,
or (2) combined into a summary scale. Need to ``go out on a limb''
and prespecify which results will be emphasized when study
results are publicized.
 Example: may combine systolic and diastolic b.p. into
mean arterial b.p.
 Otherwise, have multiple comparison problems. To
preserve overall type I error (false positive rate), would need to
be more conservative ® ¯ power.
Slide: 10
Types of Studies/Believability of Results

Singlearm (pilot, Phase III)
 Comparisons with a reference standard
 Toxicity
 Pharmacokinetic
 Correlating two responses
 Dosefinding/dose titration
 Estimation of dose or timeresponse curves within subjects
 Beware of problems with noncomparative studies of therapies:
Treatment response = natural history + Hawthorne effect +
placebo effect + bias caused by investigator enthusiasm + real
treatment effect
 Comparative: ³ 2 arms
 Unacceptable Studies:

Observational studies where subjects were selected on
the basis of their outcomes (e.g., consecutive series of
100 open heart patients who lived)
 Comparison with historical controls unless timetrends
fully understood and excellent subject baseline
descriptors in both studies
 Randomized controlled trial (RCT) where physicians only
allowed patients to be randomized who were invincible
 RCT where entry criteria do not reflect patients seen in practice
 RCT of a procedure or therapy that is obsolete by the
time the results are disseminated (or mode of use is
obsolete)
 Any study where positive results were
derived only after torturing the data (multiple subgroups
or response variables examined)
 Experiment in which measurements have extremely large
variability across replications within the same animal,
or ones in which measurements were ``optimized'' by
nonreplicable ``tweaking''
 An average ranking of quality for comparative
studies^{1}:

double masked RCT with masked analysis &
manuscript writing
 double masked RCT
 single masked RCT
 unmasked RCT
 prospectively designed and conducted
cohort study
 prospective casecontrol study
 retrospective cohort study
 retrospective casecontrol study
 See Chalmers et al.[4] for a rating scale for
study quality. Also see[3].
Slide: 11
Randomized Experiments

Randomly allocate patients to treatments while
masked
 If sample size is at all reasonable, should balance all known and
unknown risk factors
 Even if there is an apparent imbalance in one factor, you'll
see imbalances in the other direction if you look at enough
other factors
 Best not to look at patient characteristics stratified by
treatment; report statistics for overall sample
 Randomization is best done using a computer program, with
treatment assignments revealed at the last moment
Slide: 12
External Validity of Study Findings

Knowledge of pathophysiology can allow extrapolation of
results to a group of subjects not represented in study
 Example: Reduction of probability of myocardial infarction by
aspirin in men ® reduction in women
 But what if aspirin GI bleeding in women more than
in men?
 Differences in dosing, side effects, compliance can cause
different results in another population
 Relative effects of treatments frequently carry over to
other types of subjects even though absolute effects do
not^{2}
Slide: 13
Pitfalls in Analysis & Interpretation

Highlighting results found by data dredging; need to at
least document the context
 Deleting ``outliers'' based on observed response values

Unscientific, results nonreplicable
 Instead use robust statistical methods
 Irreproducible analyses based on pointandclick software without
audit trail
 Concentrating exclusively on hypothesis testing. Null hypotheses
are generally boring and do not answer questions about clinical
significance. It's better to think in terms of being able to
estimate, with sufficient precision, the effects of interest.
 Using Pvalues to provide evidence supporting a hypothesis;
they can only be used to quantify evidence against a
hypothesis.
``Absence of evidence is not evidence of absence''
[1]
 P=0.4 ® insufficient sample size or no effect; don't know which
 Relying too much on standard deviations as descriptive statistics.
Standard deviations are not very meaningful if the distribution of
the data is nonGaussian and especially if asymmetric.
 Using standard errors to describe anything other than the
precision of a summary estimate. Standard errors do not
describe variability across subjects. To describe precision,
it's better to use confidence limits on summary statistics.
Slide: 14
Descriptive Statistics

Number of nonmissing measurements, central tendency, perhaps
intersubject variability
 Mean and especially standard deviation may not be meaningful
unless data normally distributed
 Don't expect normality for biological variables
 Deciding on statistics to use on basis of test of normality
assumes such tests have power near 1.0
 For continuous variables, a good summary is obtained from
the 3 quartiles (25^{th}, 50^{th}, 75^{th} percentiles,
50^{th} = median)
 Describes central tendency, spread, symmetry
 For continuous variables for which totals may be relevant
(e.g., costs), supplement this with the sample mean
 Computing means on transformations (e.g., geometric mean) and
then backtransforming is problematic
 Standard errors are not descriptive statistics
 For discrete numeric variables representing counts or
interval scale values, where the number of possible
categories <10, use the mean and outer
quartiles or mean and selected proportions.
Median will not be sensitive and is erratic
because of heavy ties in data.
 Nominal (polytomous) variables ® proportions in k1 of the
k categories
 Binary variables ® mean (proportion of ``positives'')
Slide: 15
Analysis of Paired Observations

Frequently one makes multiple observations on same
experimental unit
 Can't analyze as if independent
 When two observations made on each unit (e.g., prepost),
it is common to summarize each pair using a measure of effect
® analyze effects as if (unpaired) raw data
 Most common: simple difference, ratio, percent change
 Can't take effect measure for granted
 Subjects having large initial values may have largest
differences
 Subjects having very small initial values may have
largest post/pre ratios
Slide: 16
What's Wrong with Percent Change?

Depends on point of reference ¾ which term is used in the
denominator?
 Example:
Treatment A: 0.05 proportion having stroke
Treatment B: 0.09 proportion having stroke
Treatment A reduced proportion of stroke by 44%
Treatment B increased proportion by 80%
 Two increases of 50% result in a total increase of 125%, not
100%
 Percent change (or ratio) not a symmetric measure
 Simple difference or log ratio are symmetric
Slide: 17
Objective Method for Choosing Effect Measure

Goal: Measure of effect should be as independent of
baseline value as possible^{3}
 Plot difference in pre and post values vs. the average of the
pre and post values. If this shows no trend, the simple
differences are adequate summaries of the effects, i.e., they
are independent of initial measurements.
 If a systematic pattern is observed, consider repeating the
previous step after taking logs of both the pre and post
values. If this removes any systematic relationship between
the average and the difference in logs, summarize the data
using logs, i.e., take the effect measure as the log ratio.
 Other transformations may also need to be examined
Slide: 18
Biostat / Epi Resources at UVa

We profit from strong ties with Statistics in the College
of Arts & Sciences
 Who to contact: Faculty & Executive Secretary of the Division of
Biostatistics & Epidemiology, Dept. of HES, Box 800717,
712

Frank E Harrell Jr, Professor and Chief
fharrell@virginia.edu, 48712
General, cardiovascular, nephrology, critical care medicine, health
services research and outcomes evaluation, pharmaceutical research
 Robert D Abbott, Professor
abbott@virginia.edu, 41687
GCRC, cardiovascular epidemiology, dementia studies, Medicine
 Mark R Conaway,
Professor
mconaway@virginia.edu, 48510
General, clinical trial design, longitudinal data, oncology
 Gina R Petroni, Associate Research Professor
gpetroni@virginia.edu, 48363
Cancer Center
 Viktor E Bovbjerg,
Assistant Professor
bovbjerg@virginia.edu, 48430
Epidemiology, chronic disease studies, especially
cardiovascular disease and diabetes
 Jae K Lee, Assistant Professor
jklee@virginia.edu, 48712
Statistical genetics, genomics,
bioinformatics
 Jim Patrie MS, Biostatistician III
jpatrie@virginia.edu, 48576
GCRC, general, experimental design
 Eric Bissonette MS, Biostatistician III
ebisson@virginia.edu, 48586
Cancer Center, interventional radiology, general
 Xin Wang MS, Biostatistician III
xwang@virginia.edu, 48519
Health services research and outcomes evaluation,
cardiology, general
 Mark Smolkin MS, Biostatistician II
msmolkin@virginia.edu, 48712
Cancer Center
 Mir Siadaty MD, MS, Biostatistician II
msiadaty@virginia.edu, 48712
Bioinformatics, general
 Brenda Lee
Executive Secretary
712
Inquiries, appointments
 General Email address: biostat@virginia.edu
 Department of Statistics Consulting Lab
 Biostatistics Consulting Service (hourly charge for data
analysis)  see hesweb1.med.virginia.edu/
biostat/services.htm
 Free (but currently limited because almost all our time is now
funded by successful grant applications): developing grant
proposals, study planning meetings

For Divisions not providing general ongoing support of a
portion of an M.S. biostatistician and a faculty biostatistician
or epidemiologist, assistance is on a firstcome firstserve basis
with a minimum of 120 days advance notice before the grant
application deadline
 Clinical Trials Office (Lori Elder)
Slide: 19
How to Collaborate with Statisticians & Epidemiologists?

Willingness to explain the
details of the study. An
appropriate choice of the outcome, design, sample size,
data collection, etc. requires some knowledge of
the area being studied. Explaining these things can
not only provide a more efficient way of doing the
study, but can sometimes help to clarify issues that
may have been taken for granted.
Slide: 20
Collaboration Issues

Collaboration should begin early
 Too often stat. will uncover a fatal flaw in data
collection too late, e.g., recording measurements
as ranges rather than raw data
 Early understanding on authorship; depends on whether e.g.
statistician serves as a ``number cruncher'' vs. as part of
the investigation or manuscript writing or she
develops/assimilates new methods for the purpose
of the project
 Best way to fund epi/biostat involvement is through %FTE
of grant support or general ongoing funding from the
divisional level
 Longterm goal of Division of Biostatistics &
Epidemiology is to have stat. on staff who have
longterm collegial relationships with biomedical
researchers and with a good understanding of specific
subject matter areas
Slide: 21
Education Opportunities at UVa

Onesemester clinical trials methodology course
taught once/year by Lori Elder and Gina Petroni
 Short courses: Statistical Thinking in Biomedical
Research offered twice yearly through CME
 M.S. in Health Evaluation Sciences: Clinical
Investigation, Health Services Research and Outcomes
Evaluation, and Epidemiology tracks
 Comprehensive Introduction to Clinical Investigation,
intensive 4week course for housestaff and other physicians
(month of July, beginning 2001)
 Health Evaluation Sciences Research Conference: Wednesdays
4:0:00p, 3rd floor Hospital West, Room 3182
 First Wednesday of the month:
Clinical Investigation Seminar sponsored jointly by the Center for
the Study of Complementary and Alternative medicine and DHES

Forum for research methodology, clinical trials design
 Ideal for presenting work/grant proposals in progress, for
critique by methodologists
 Contact fharrell@virginia.edu if you want to be put on
the emailing list
 Short courses in clinical investigation and biomedical ethics
References
 [1]

D. G. Altman and J. M. Bland.
Absence of evidence is not evidence of absence.
British Medical Journal, 311:485, 1995.
 [2]

J. C. Bailar III and F. Mosteller.
Medical Uses of Statistics.
NEJM Books, Boston, second edition, 1995.
 [3]

C. Begg, M. Cho, S. Eastwook, R. Horton, D. Moher, I. Olkin, and et al.
Improving the quality of reporting of randomized controlled trials.
The Consort statement.
Journal of the American Medical Association, 276:6339,
1996.
 [4]

T. C. Chalmers, H. Smith, B. Blackburn, B. Silverman, B. Schroeder, D. Reitman,
and A. Ambroz.
A method for assessing the quality of a randomized control trial.
Controlled Clinical Trials, 2:39, 1981.
 [5]

T. J. Cole.
Sympercents: symmetric percentage differences on the 100 log_{e}
scale simplify the presentation of log transformed data.
Statistics in Medicine, 19:310125, 2000.
 [6]

CPMP Working Party.
Biostatistical methodology in clinical trials in applications for
marketing authorizations for medicinal products.
Statistics in Medicine, 14:165682, 1995.
 [7]

P. C. Gøtzsche.
Blinding during data analysis and writing of manuscripts.
Controlled Clinical Trials, 17:2893, 1996.
 [8]

L. Kaiser.
Adjusting for baseline: Change or percentage change?
Statistics in Medicine, 8:118190, 1989.
 [9]

R. A. Kronmal.
Spurious correlation and the fallacy of the ratio standard revisited.
Journal of the Royal Statistical Society A, 156:3792, 1993.
 [10]

T. A. Lang and M. Secic.
How to Report Statistics in Medicine: Annotated Guidelines for
Authors, Editors, and Reviewers.
American College of Physicians, Philadelphia, 1997.
 [11]

J. S. Maritz.
Models and the use of signed rank tests.
Statistics in Medicine, 4:1453, 1985.
 [12]

L. Törnqvist, P. Vartia, and Y. O. Vartia.
How should relative changes be measured?
American Statistician, 39:46, 1985.
 1
 RCTs include crossover studies, which can be of
excellent quality when there are no carryover effects or when
carryover effects are understood well enough to be ``subtracted out''.
 2
 Because absolute risks of events vary with
disease severity, dictating that risk differences must vary.
 3
 Because of regression to
the mean, it may be impossible to make the measure of change
truly independent of the initial value. A high initial value
may be that way because of measurement error. The high value will
cause the change to be less than it would have been had the
initial value been measured without error. Plotting
differences against averages rather than against initial
values will help reduce the effect of regression to the mean.