How to use knitr

Introduction

knitr is a newer way to create reproducible documents with R and LaTeX. It has several advantages over Sweave, including the following:
  1. A wider variety of graphics devices are supported
  2. Support for the LaTeX listings package is builtin
  3. The Sweavel.sty macro is builtin
    • NEW Sweavel.sty was improved in June 2015 to use a method that does not require color but still distinguishes input from output from main text. Get the style file here and see SweaveTemplate for more information.
  4. NEW The knitrl.sty LaTeX package supercedes Sweavel. See https://github.com/harrelfe/rlatex
  5. Specifying figure captions in chunk headers
  6. Ability to specify other graphics parameters in chunk headers so as to not clutter the R code output in your report
  7. Better caching of compute-intensive code chunks
  8. Easy to include animations in pdf reports
  9. You don't need to put print() around lattice objects to make the plot be rendered
  10. Chunks can produce multiple plots, and you can easily have LaTeX format them into a matrix of plots
Most of this is pages is specific to using knitr for latex output.

Basics

  • Your code needs to be in a file with a .Rnw extension. Otherwise, the files are not found automatically.
  • Chunk options need to be R objects. For example in Sweave, you would use <<eval=true>>=, whereas in knitr, you need to use <<eval=TRUE>>=. For this reason, you need to put file paths and other character expressions in quotes.

Workflow

  • Make sure you install the knitr R package. You can use install.packages("knitr")
  • Each session during which you want to compile your document, first make sure to load the knitr library using library(knitr)
  • Once you have a .Rnw file with R code and latex markup, run knit("filename.Rnw") within R to get a .tex file.
  • Then compile your tex file as usual. I like to use system("pdflatex filename") from within R on a linux machine.
  • Linux Shell Script for Running knitr in Batch Mode: Alternatively, you can run the R knit function in a linux shell using a script. For example, if you put the following commands in ~/bin/knitr and do chmod +x ~/bin/knitr you can run file xxx.Rnw through knitr in batch mode, and get an automatic pop-up window showing progress, knitr options sensed for each chunk, and errors.

rm -f messages.txt
xterm -hold -e R --no-save --no-restore -e "require(knitr); knit('$1.Rnw')"
echo PDF graphics produced:
ls -lgt *.pdf

NEW Marking Changes

When providing collaborators with multiple updates to statistical reports it is often a good idea to make it easy for them to find recent changes. The LaTeX changebar style is an excellent way to do this using vertical bars in the (by default) right margin. This is particular useful with knitr because if a section you mark as having changed spans a code chunk, all the output from that chunk will have a vertical bar in the margin. A user-defined macro changes in the example document in the attachments in SweaveLatex makes it easy to use multiple change bars corresponding, for example, to multiple dates on which edits are made. The bars can vary in color and/or line thickness, and a key is added to the text at the point at which \changes is issued. To produce a final document without change bars, it is easy to set a LaTeX variable so that changebar markup is ignored.

The changebar LaTeX package only allows up to 4 different colors (e.g, 4 types of changes / 4 authors). A more general approach uses the marchange LaTeX package available here. This does not draw a continuous vertical line for the block of changes but puts a beginning and an ending symbol in the right margin to indicate the block, for example a down arrow and an up arrow of a certain color. The user can specify a different symbol and color for each block. As with chargebar with the changes macro the beginning block commands are \cbstarta \cbstartb ... but unlike changes the ending commands are \cbenda \cbendb .... A table is automatically constructed to define the block colors and symbols. The LaTeX marchange.sty file contains a full example.

R output

  • Where you would use <<results=tex>>= in Sweave, you now need to use <<results="asis">>=
  • Errors and warnings are handled differently than in Sweave. In Sweave, the warnings and errors from R code would appear on screen after running Sweave() function on the file. Now the warnings appear on the output tex file. They can be controlled with the warning and error options.
  • Output will be formatted according to the current R options. For example, you can set options(digits = 2, scipen = 6) in the R code.

Inline R output

  • As in Sweave, you can use \Sexpr{}.
  • You can conveniently format this inline R output by setting the inline option of knit_hooks. This way you don't have to wrap everything inside of \Sexpr{} with round(). Here is an example:
  knit_hooks$set(inline = function(x) {
   if (is.numeric(x)) round(x, 3)})

R code in output document

  • You have the option to include your code in the output report. There are lots of options about this under "Code Decoration" on the knitr website.
  • If you do include your source code in the output document, any comments will also be displayed. This is different from the behavior in Sweave, which omits comments.
  • The code has background shading and is highlighted.
  • The code is tidied by default using the function tidy.source(), and this is controlled by the tidy option. I dislike this because it gets rid of spacing. I turn it off, but you can also customize it using tidy.opts. See the knitr website for details.

Plots

  • To include figures, you don't need to use any options in the chunk headers, unless you want to suppress plots. In Sweave, you needed to use <<fig=true>>=.
  • Instead of wrapping the R chunk in \begin{figure} \end{figure}, you can automatically get the figure environment by adding fig.cap="caption".
    • Sometimes, within the figure caption, you may want to refer to an R object that is created within that chunk. Normally, the object must be previously defined. You can set opts_knit$set(eval.after = "fig.cap") to allow you to refer to R objects created within a chunk in the fig.cap argument. Note that eval.after is a package option rather than a chunk option.
  • You can include multiple graphs within one R chunk, and they will all be included by default. If you include two figures within a chunk, they are automatically put side-by-side and all with one caption as indicated by fig.cap. I tried putting two figures in a chunk and using a vector of length 2 for fig.cap, but only the first caption was used.
    • You can can put captions on individual subfigures by using the option fig.subcap. You also need to have \usepackage{subfig} in the preamble for it to work.
  • You can choose the device for figures using the 'dev' option. You can set if to "pdf", "postscript", "tiff". It should be a character vector, meaning you can specify for it to make the plots in more than one device.
  • You can automatically put all automatically generated figures into a separate directory using the "fig.path" chunk option. You need to give it a character. I use fig.path = "graphics/plot", which automatically makes a new folder called "graphics" in my working directory, and puts all the figure inside, and it causes the file names for each plot to begin with "plot". If your chunk is unnamed, the rest of the file name will be "unnamed-chunk-45" by default. If you have a name for your chunk, the chunk name will be part of the file name.
  • Amazing discovery: If you have multiple plots within one chunk that will run over more than one page, if fig.show is set to "hold", adding a fig.cap in the chunk header will cause only the first page to display on the output document. If you leave out the fig.cap, all the plots will display on the output. This is because the figure environment (used when you use a figure caption) does not allow spanning over multiple pages. One work-around is to make the figures smaller (see Plot sizes below).

Creating figure labels for latex

  • For figures, the chunk names are used to make the labels for latex that you can refer to using \ref{}. To refer to a figure created in a chunk called "jill", the default label would be "fig:jill". (The chunk name is automatically prefixed by the fig.lp option, whose default is "fig:".) For example, if in latex you wanted to use \label{fig:jill}, you would put the R code that makes the plot inside of a chunk with this header: <\<jill\>>=. If you want to add the prefix manually or not at all, you can set fig.lp to "". This can be done in the chunk header or with the opts_chunk$set() function (see section on options, below).

Plot sizes

  • These are controlled by fig.height, fig.width, out.height, out.width. The first two need numeric arguments, while the second two need character arguments (in quotes).
  • fig.width and fig.height both have a default value of 7 (inches). They control the size of the plot made by the plotting device, not the size in your output document.
  • out.height and out.width require units. Here's an example: out.width="4in"
  • If you give character arguments to fig.height or fig.width, you'll get the following warnings:
Warning message:
‘mode(width)’ and ‘mode(height)’ differ between new and previous
         ==> NOT changing ‘width’ & ‘height’

fig.show

  • This applies when multiple figures are created within one R chunk.
  • Controls how to show or arrange the plots.
  • "asis" is the default. It puts each figure in a separate figure environment. If you include a figure caption (fig.cap), it puts the same caption on all figures within that chunk.
  • "hold" This will put all the figures created in that chunk inside one figure environment.

Caching results

  • knitr provides an easy way to automatically save results.
  • To save the results of an R chunk, set the cache option of that chunk to TRUE.
  • After reading the section on caching on the knitr options page http://yihui.name/knitr/options, there are three ways to indicate dependencies. This refers to how knitr knows which chunks with cache=TRUE need to be re-run at what time. knitr can figure out which R chunks (within the current file) have been modified since the last run. Based on this and the dependency options you set, knitr will decide which cached chunks to rerun. Here are the three ways:
    • Automatically. Let knitr figure out automatically which chunks the cached chunk depends on. This is set using the autodep option to TRUE. It is FALSE by default.
    • Explicitly indicate which chunks each cached chunk depends on. This is set using the dependson option, which is NULL by default. You can give a character string which is the chunk name that the current chunk depends on, or you can give numbers indicating the order the chink falls in.
    • Neither. By default, if you don't set automatic or explicitly list the chunks that a chunk depends on, that chunk is assumed to be independent. In this case, it will only be re-executed if the code within the chunk is modified.
  • NB: If you are sourcing a separate R file within a chunk, knitr will not be able to recognize when the file has been modified, so any chunk that depends on results from a sourced R code file should not be cached unless every time the file is modified, you delete the cached files.

Setting options

  • You can set options for an individual R chunk inside of the chunk header (<<>>=)
  • Using Sweave, global options could be set using latex \SweaveOpts{}. In knitr, options can be set globally via R code. The global options set within an R chunk will take effect in the next chunk, and remain until they are reset (or the R session ends).
  • There is an example here. It includes setting the par object globally.
Here is an example. (Note that using comments or line breaks in the call of opts_chunk$set causes problems. I added comments here to explain the arguments.): NB: There was a problem with ' h = knitr:::hilight_source(x, 'latex', list(prompt=FALSE, size='normalsize'))' as of R version 3. "highlight=FALSE" corrected the error.
<<setup, include=FALSE>>=
setwd("/home")
source("../file.R", chdir = TRUE)

opts_chunk$set(
   dev="pdf",
   fig.path="graphics/plot",                                         # puts all figures in a folder in the current directory called "graphics." If you use this, you don't need to create the graphics folder first. It also starts each figure's filename with "plot"
   fig.lp = "",                        
   out.width=".47\\textwidth",
   fig.keep="high",
   fig.show="hold",
   fig.align="center",
   comment=NA)                                                      # Suppresses "##" in R output. The "##" is useful if you want people to be able to copy and paste your code in the output file, though.
# this allows for code formatting inline.  
knit_hooks$set(inline = function(x) {
   if (is.numeric(x)) return(knitr:::format_sci(x, 'latex'))
   x = as.character(x)
   h = knitr:::hilight_source(x, 'latex', list(prompt=FALSE, size='normalsize', highlight=FALSE))
   h = gsub("([_#$%&])", "\\\\\\1", h)
   h = gsub('(["\'])', '\\1{}', h)
   gsub('^\\\\begin\\{alltt\\}\\s*|\\\\end\\{alltt\\}\\s*$', '', h)})
par(las = 1)
options(width = 90, scipen = 6, digits = 3)
@

knitr Setup for Making Publication-Ready Reports/Articles/Books | Example Report (see attachments there)

Code to put in .Rprofile

The .Rprofile file is a file on the machine of your R installation (likely your computer) that R automatically runs on start up. You can add code there as a way to automate code that you always want to run at the beginning of every R session.

Here is an example of some code to put in your .Rprofile:

options(
   help_type = 'html',
   lib='/usr/local/lib/R/site-library', 
   width = 200,
   browser = 'kfmclient newTab', 
   repos = "http://debian.mc.vanderbilt.edu/R/CRAN/", 
   warnPartialMatchDollar = TRUE,
   #warnPartialMatchArgs = TRUE,
   show.error.locations = "top",      # this gives the location of the error. Source location highest on the stack (inside)
   xdvicmd = 'okular')

library(knitr)
The knitrSet function in the Hmisc package helps in setting up knitr usage. Typical usage:

<<echo=FALSE,results='hide'>>=
require(Hmisc)
knitrSet()
@

Code to Use at the Beginning of a Report or Chapter

If you are developing a report that is a single unit, you may want something like the following after \begin{document}

<<echo=FALSE>>=
knitrSet() 
knitrSet(w=4.5, h=3.5)
@
The last command will override the default size of plots to 4.5 by 3.5 inches. Figure environments are generated if you specify cap= in <<>>=

If you have more than one chapter or section and you want different figure labels or file name prefixes to appear for each section/chapter, use something like:

knitrSet('xxx')
where xxx is the short name for the section. Then if you create a figure using

<<yyy,cap='Figure caption',scap='Short figure caption for table of figures'>>=
SweaveConvertSweaveConvert the graphic file will be named xxx-yyy.pdf and the figure will have a label of fig:xxx-yyy.

Code to Use for a Chunk with Various Options

<<bigplot,h=7,w=7,cap='A \\textbf{caption} for the figure'>>=   # need to double backslashes to escape them
<<example2,cap=paste('Survival curves for study', study_name)>>=
<<this,results='asis'>>=   # need to put character values in quotes with knitr, unlike Sweave
<<that,ps=6,mfrow=c(2,2)>>=
plot(something)  # Figure (*\ref{fig:xxx-that}*)   [symbolic reference from R to LaTeX]
Note that chunk headers must be contained on one physical line in the noweb source, so to enter a very long caption you need to turn auto word wrap off in your editor.

Using a Macro Preprocessor with knitr to Factor Out Repetitive Operations

NOTE: There is a simpler way to do this with knitr using the knit_expand function. Also see here. This page will be updated in the future.
One frequently needs to write a large number of LaTeX text and R coSweaveConvertde chunks in which only a handful of variables change from one section to another. For example, there may be a key independent or dependent variable that is replaced by another variable, but the remainder of the sentences and R code remain unchanged. Code maintenance and editing is made much easier by factoring out common text and code. The right macro preprocessor can allow one to use a simple syntax for variable name and other substitutions.

The python pyexpander preprocessor is a good fit for this task, and it allows one to use arbitrary python code to do such things as appending characters to base variable names or capitalizing the first letter of each word in a section title. The main drawback to pyexpander is that LaTeX and R $ must be escaped by prefixing them with \. To install the preprocessor on a Ubuntu/debian-based linux system, get the .deb file from http://sourceforge.net/projects/pyexpander/files/deb then use the command sudo dpkg -i ...deb . This creates an executable program named expander.py. Then define a bash script in ~/bin named pye2r :

expander.py --eval $2 $1.pye > $3.Rnw
Here is an example invocation of pye2r : pye2r test 'x="age";y="bp"' test2 which will preprocess fille test.pye to create file test2.Rnw, replacing macro variables x and y with variable names age and bp.

To run an entire analysis report on different key variables it is useful to create a bash script to run pye2r with a series of variable substitutions, then to combine all the resulting .Rnw files into one combined .Rnw file for insertion into the master .Rnw file. The master file will typically contain all the LaTeX and knitr setup along with code chunks that do not need to be repeated, such as descriptive statistics on the entire dataset. In the following example, four bone mineral density (BMD) measurements, corresponding to four bones, are analyzed in turn. When one BMD measure is the key predictor, the other three are adjusted for. Only the main analysis variable $v will appear in certain interactions. The combined processed file allbmd.Rnw is run by knitr inside the master document using the command \Sexpr{knit_child('allbmd.Rnw')} in a LaTeX chunk. Here is the script named maker that runs pye2r:

pye2r repbmd 'v="hip";oth1="lumbar";oth2="femur";oth3="forearm"' hip
pye2r repbmd 'v="lumbar";oth1="hip";oth2="femur";oth3="forearm"' lumbar
pye2r repbmd 'v="femur";oth1="hip";oth2="lumbar";oth3="forearm"' femur
pye2r repbmd 'v="forearm";oth1="hip";oth2="lumbar";oth3="femur"' forearm
cat hip.Rnw lumbar.Rnw femur.Rnw forearm.Rnw > allbmd.Rnw
rm -f hip.Rnw lumbar.Rnw femur.Rnw forearm.Rnw
Here is the beginning of repbmd.pye. Note that chunk names are prefixed by $v to create unique chunk names that start with hip, lumbar, etc. $v is the base name of the main variable of interest, and 0, 26, 52 are appended to the variable name $v to denote the BMD measure at baseline (time 0), 26w, and 52w.

% Usage: ./maker
$begin
$extend(v,oth1,oth2,oth3)    $# allow referencing by $v not just $(v)
$py(v0=v+"0")                $# create new variable with 0 appended
$py(v26=v+"26")
$py(v52=v+"52")
$py(oth10=oth1+"0")
$py(oth20=oth2+"0")
$py(oth30=oth3+"0")
$py(vupper=str.title(v))     $# capitalize first letters
$extend(oth10,oth20,oth30,v0,v26,v52,vupper)  $# allow easy referencing of new var

\section{Analysis of $vupper Bone Mineral Density}
\subsection{26w BMD and Baseline Predictors}
In this section I model 26w interpolated BMD using baseline BMD (all
four measures, not just the target measure), baseline weight, age,
sex, race, \$t\$-score stratum, and treatment.  I first fit a saturated
model with respect to all of the continuous predictors, using
restricted cubic splines with 5 knots, then potentially use partial
\$\chi^2\$ (blinded to nonlinearity components) to reassign degrees
of freedom.  
<<$v-sat26,results="asis">>=
f <- ols($v26 ~ sex*rcs($v0,5) + rcs($oth10,5) + rcs($oth20,5) +
         rcs($oth30,5) + sex*rcs(wt0,5) + trtp + rcs(age,5) + race + sex +
         blppar + bltscgrp, data=d)
options(prType='latex')
print(f, coefs=FALSE)
lan(f)
@

The global test for nonlinearity is nowhere close to being significant, so
all nonlinear effects will be dropped.
$end
The beginning of file allbmd.Rnw follows.

\section{Analysis of Hip Bone Mineral Density}
\subsection{26w BMD and Baseline Predictors}
In this section I model 26w interpolated BMD using baseline BMD (all
four measures, not just the target measure), baseline weight, age,
sex, race, $t$-score stratum, and treatment.  I first fit a saturated
model with respect to all of the continuous predictors, using
restricted cubic splines with 5 knots, then potentially use partial
$\chi^2$ (blinded to nonlinearity components) to reassign degrees
of freedom.  
<<hip-sat26,results="asis">>=
f <- ols(hip26 ~ sex*rcs(hip0,5) + rcs(lumbar0,5) + rcs(femur0,5) +
         rcs(forearm0,5) + sex*rcs(wt0,5) + trtp + rcs(age,5) + race + sex +
         blppar + bltscgrp, data=d)
print(f, coefs=FALSE)
lan(f)
@

The global test for nonlinearity is nowhere close to being significant, so
all nonlinear effects will be dropped.

Links

Topic revision: r40 - 25 Nov 2016, FrankHarrell
 

This site is powered by FoswikiCopyright © 2013-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback