Biostatistician Responsibilities

Requirements for Statistical Reports

All statistical reports produced by biostatisticians must be reproducible in that the report can be "executed" to re-run all code and reproduce the final (usually pdf) report. It is highly recommended that you use R + knitr using LaTeX packages spaper and knitrl to do this. An example report's source code may be found at under Examples. The LaTeX packages are also available there, under packages. Copy the source code of these packages to texmf/tex/ under your home directory so LaTeX will find them automatically.

The templates directory is also of interest. Note that the example report spaper.Rnw makes use of optional features of spaper whereby you can declare that some sections of the report are "supplemental" and these are shipped to a separate document spapersupp.tex for compilation using pdflatex. You also have the options to have supplemental sections moved to the end of the report, or to appear inline in the the original order. For usage instructions for spaper see this in addition to the example and templates. Note: If you want an investigator to be able to not print the code, you can use the approach in the example script spaper.Rnw to see how to easily make knitr print all the code at the end of the report, and use the global option knitrSet(echo=FALSE) to suppress code from appearing throughout the report. But during the development stage it is better to intersperse the code with the output to make it easier to check the code.

Besides our longstanding requirement of making statistical reports reproducible, there are many reasons for including the code that produced the report in the report itself. These reasons include:
  • Documentation: The code is the ultimate documentation for what was done. When the code gets separated from the the report it is difficult for a statistician colleague (and sometimes for the investigator) to tell exactly what was done. Having code in the report is a component of making the research fully reproducible whenever one is relying on printed reports.
  • Quality control: another statistician who looks at the code may notice a mismatch between the code and what was the intention of the analysis.
  • Improving programming skills: a supervisor may see a staff biostatistician's code more frequently and find it easier to write suggestions for programming efficiency and clarity.
Because of these reaons, beginning 2015-02-10 is is required that virtually all statistical reports include the code used to manipulate the data, do calculations, and produce graphs and tables. This allows:
  • other statisticians to see exactly what you did when an investigator happens to ask a question
  • the investigator to see that this is real work and not just making software menu selections
  • investigators who leave Vanderbilt mid-project to have an easier time getting their new statistician up to speed
  • the statistician who created the report to sometimes save time listing model covariates when they can easily be seen in the printed model formula

Exceptions to this policy:
  • Reports that are delivered to committees overseeing a study, e.g., a data monitoring committee or clinical trial steering committee
  • Final version of a report that serves as a manuscript

Note that when a statistical report is submitted as an online supplement to a journal article, the journals will applaud having the code in this supplement.

It is important to include standard model output when fitting statistical models so the report contains the sample size actually used in each analysis as well as R-squared and other measures. This is best done in knitr when using the rms package as follows, so that the results are typeset using LaTeX.

f <- lrm(...)
print(f, latex=TRUE)

With the knitrl package and the Hmisc knitrSet function the best way to control appearance of code is to use knitrSet(echo=TRUE) (this is the default behavior anyway).

Additional Notes

  • For slow code sections using cache=TRUE for chunk header or as a global option in the call to kniterSet() will result in caching the slow calculations so that each run is fast
  • Commonly used code that is long can be =source()='d into the document and should be documented in some central place
  • Commonly used code that is not long can be handled very effectively with the "shared" code mechanism in knitr, e.g.

# Contents of shared.R

## @knitr code-segmentA
some code here

## @knitr code-segmentB
more code here

# Analysis file my.Rnw

% Bring in the code defined in shared.R for code-segmentA

% Likewise for B

If you have commonly used scripts that are not part of an R package, there is an easy way to bring them in in a reproducible and transparent way if you have the scripts on a web server. Github can be great for this, but you can also use Box, Dropbox, and our own servers, plus wiki attachments. Here are some examples:

getRs('importREDCap.r', put='source')   # goes to rscripts Github
# Same thing longhand:
source('', echo=TRUE)
source('', echo=TRUE)

For extensive data management steps such as data import and merging, especially when the code is long, it is sometimes best to move those steps to another file (you might call it create.r or create.Rnw in your project directory) and automatically document where the source code is by showing your source call. You could also write text in your report explaining where this code it, but that would be less automated and would have to be updated if you restructured the directories.

source('', echo = FALSE)

Other useful links:

Submission of Statistical Reports for Review

Topic revision: r6 - 11 Feb 2016, FrankHarrell

This site is powered by FoswikiCopyright © 2013-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback