---
title: "R Workshop"
author: "Cole Beck"
references:
- URL: http://miktex.org/download
id: miktex
title: 1
- URL: http://www.tug.org/mactex/morepackages.html
id: mactex
title: 2
- URL: http://yihui.name/knitr/
id: knitr
title: 3
- URL: http://yihui.name/en/2013/10/markdown-or-latex/
id: yihui
title: 4
- URL: https://github.com/yihui/knitr-examples
id: knitr-examples
title: 5
- URL: http://yihui.name/knitr/options/#chunk_options
id: knitr-chunk
title: 6
- URL: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/RConfiguration/Rprofile
id: rprofile
title: 7
- URL: https://github.com/harrelfe/rscripts
id: rscripts
title: 8
- URL: http://www.statmethods.net/advgraphs/ggplot2.html
id: ggplot
title: 9
output:
pdf_document:
number_sections: yes
toc: yes
toc_depth: 1
html_document:
number_sections: yes
toc: yes
toc_depth: 1
---
```{r, echo = FALSE}
library(knitr)
opts_chunk$set(comment = NA)
options(width = 85)
```
# Installing LaTeX
LaTeX is an open source typesetting system used for document preparation. It should be installed for R and RStudio to produce PDF reports. It is a tool for advanced users who wish to compose advanced tables, use cross-referencing, indexing, or have finer control over document layout. Others may use Markdown as an alternative to produce PDF and HTML reports.
## Windows
Install [MiKTeX][miktex] [-@miktex] - choose Basic MiKTeX Installer
Applications will likely be installed in `C:\Program Files (x86)\MiKTeX 2.9\miktex\bin`.
Use the `mpm` application to install additional LaTeX packages.
## OS X
Install [MacTeX][mactex] [-@mactex] - choose BasicTeX.pkg
Applications will likely be installed in `/usr/texbin`.
Use the `tlmgr` command-line program to install additional LaTeX packages. As an example, this installs the `framed` style package:
```
sudo tlmgr install framed
```
## Linux
Install TeXLive - Ubuntu flavored distributions may use the command-line program `apt-get`:
```
sudo apt-get install texlive-base texlive-latex-recommended texlive-latex-extra
```
Applications will likely be installed in `/usr/bin`.
Additional LaTeX packages may require manual installation in `/usr/share/texlive/texmf-dist/tex/latex`.
---
## Test Installation
After LaTeX has been installed, test that `pdflatex` runs from the command line prompt:
```
pdflatex --version
```
If this fails to return information about `pdflatex`, check that it is installed in the proper location. If found the LaTeX application path needs to be added to your environment path, or you may need to log out and back in.
---
# Navigating RStudio
* Creating scripts (R Script, R Markdown, R Sweave)
* Creating projects
* Environment
* Files
* Plots
* Packages
* Help
## Running a R script
When an R script is open, RStudio can run one line at a time, run all highlighted lines, or run the entire script.
A script can also be run in entirety by calling the `source` function, which works with local files as well as URLs.
```{r, eval=FALSE}
source("https://github.com/harrelfe/rscripts/raw/master/introda.r")
```
---
# R Packages
Install these packages `Hmisc, rms, knitr, rmarkdown` through RStudio.
Packages contain R code and may often require the installation of other packages. When I install `Hmisc`, I see that its dependency `acepack` is automatically installed:
```
Installing package into ‘/home/beckca/R/x86_64-pc-linux-gnu-library/3.1’
(as ‘lib’ is unspecified)
also installing the dependency ‘acepack’
trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/acepack_1.3-3.3.tar.gz'
Content type 'application/x-gzip' length 33590 bytes (32 Kb)
opened URL
==================================================
downloaded 32 Kb
trying URL 'http://debian.mc.vanderbilt.edu/R/CRAN/src/contrib/Hmisc_3.14-6.tar.gz'
Content type 'application/x-gzip' length 611348 bytes (597 Kb)
opened URL
==================================================
downloaded 597 Kb
* installing *source* package ‘acepack’ ...
.
.
.
* DONE (acepack)
* installing *source* package ‘Hmisc’ ...
.
.
.
* DONE (Hmisc)
```
---
## Commands for Package Management
### Install and updating
```{r, eval=FALSE}
# install several packages
install.packages(c('Hmisc', 'rms', 'knitr', 'rmarkdown'))
# update all packages
update.packages(checkBuilt=TRUE, ask=FALSE)
```
### Loading
Note that `library` is a bit of misnomer as R uses packages, not libraries. From a technical standpoint, it's nice to recognize the distinction. You may see `require` used in its place. If the package is not installed, `library` will trigger an error while `require` will return FALSE. Once loaded the functions created in the package are available to your R session.
```{r, eval=FALSE}
library(Hmisc)
require(Hmisc)
```
---
# knitr
[knitr][knitr] [-@knitr] was designed as a replacement for Sweave - it is used for dynamic report generation. It combines the content of your document with the R code that produces results. Blocks of code are called chunks.
By default RStudio uses Sweave, but this can be changed to knitr in the options.
`Tools -> Global Options, Sweave panel, "Weave Rnw files using..." `
Here's an example of using the `knitr` package to strip out all of this file's R chunks and store them in an R script.
```{r, eval=FALSE}
library(knitr)
# may need path to file
purl('intro.Rmd')
```
## Rnw/LaTeX/PDF or Rmd/markdown/HTML+
Reports can be generated by two methods, Rnw (noweb) files or Rmd (markdown) files. Syntax-wise there are two major differences:
1. Declaration of code chunks
1. Content is formatted with LaTeX (Rnw) or Markdown (Rmd)
LaTeX is much more difficult to learn but it offers superior control over document layout and formatting. Markdown offers more flexibility in the type of output, such as PDF, HTML and Word. How do you choose which to use? A good rule of thumb is to ask yourself "must I print the results nicely on paper?". If yes, Rnw ([Yihui Xie][yihui] [-@yihui]).
Some R objects are easily converted into LaTeX (see `Hmisc::latex`). If you are using any LaTeX functions, you should also choose Rnw files.
You can see good examples of both methods [here][knitr-examples] [-@knitr-examples].
Output is created by clicking the `Compile PDF` or `Knit HTML` button.
---
### Code Chunks
An Rnw chunk looks like this:
```
<>=
# lines of R code
@
```
Here's an R chunk in Rmd:
```{r, chunk-options}`r ''`
# lines of R code
```
There are many chunk options, see the [online documentation][knitr-chunk] [-@knitr-chunk].
### Nice Defaults with knitrSet
Dr. Harrell has predefined several options and functions that are useful when creating Rnw files. Load the `Hmisc` package to make the `knitrSet` function available for use. In the past this function could be included by adding it to your `.Rprofile` file [(example Rprofile)][rprofile] [-@rprofile]. If `knitrSet` is included in your `Rprofile`, it should be removed.
The following chunk will load `Hmisc` and run this function.
```
<>=
require(Hmisc)
knitrSet()
@
```
### Rmarkdown Template
Any of the Rmd files found [here][rscripts] [-@rscripts] would make a good template for a reproducible report. There are two chunks you should consider including in any report. As previously mentioned `knitrSet` should be added to the beginning of the report. The computing environment should be added to the end.
It should look similar to this.
```{r setup,echo=FALSE}`r ''`
require(Hmisc)
knitrSet(lang='markdown')
```
```
# Analysis Start
# content
# Analysis End
# Computing Environment
```
```{r rsession,echo=FALSE}`r ''`
si <- sessionInfo(); si$loadedOnly <- NULL
print(si, locale=FALSE)
```
```{r cite,results='asis',echo=FALSE}`r ''`
print(citation(), style='text')
```
The `Hmisc::getRs` function can be used to open one of the sample rscripts within RStudio.
---
# Importing Data Sets
In R, a data set is called a `data.frame`. Consisting of rows and columns of data, they are the fundamental data structure in R.
R reads in delimited files, such as comma-separated or tab-delimited, as data frames. Typically this is done with one of the following functions.
* read.table - work with data in table format
* read.csv - CSV files
* read.delim - tab-delimited files
* scan - more general function for reading in data
Several options are useful:
* header - indicator if first line contains column headers
* sep - field separator
* quote - quote character
* na.strings - strings to convert to NA value
* nrows - number of rows to read
* skip - number of rows to skip at start of file
* stringsAsFactors - read strings as character or factor values
A data set may also be imported by clicking the `Import Dataset` button in the Environment pane in RStudio.
The base package `datasets` includes several data sets. In this example we'll use one to create a file that can be read in as a CSV.
```{r}
head(state.x77)
write.csv(state.x77, file='state.csv')
state <- read.csv('state.csv', stringsAsFactors=FALSE)
```
---
## Importing from Other Applications
Other methods are available to read in data sets created outside of R.
For example, `Hmisc` includes the functions `sas.get` and `sasxport.get` to work with SAS data sets.
Stat/Transfer is a good option but is not free nor open source.
See the `foreign` package for other options.
## getHdata
The `Hmisc::getHdata` function downloads ready-to-use data sets from the web site for `Hmisc` and `rms`.
```{r}
require(Hmisc, warn.conflicts=FALSE, quietly=TRUE)
# list available data sets
getHdata()
# download and load data set
getHdata(counties)
head(counties)
# view data dictionary for file - opens in web browser
getHdata(counties, what="contents")
contents(counties)
```
---
## Structure of a data.frame
A `data.frame` consists of observations (rows) of variables (columns). Examine the `hospital` data set.
```{r}
getHdata(hospital)
contents(hospital)
hospital
```
Once loaded, note that it is displayed in RStudio's `Environment` pane. It gives information about the `data.frame`. This can also be shown through the following functions.
```{r}
str(hospital)
names(hospital)
rownames(hospital)
dim(hospital)
```
The `str` function returns the structure of an R object. In this case the hospital data set is made up of columns of different types: integer, character, factor, numeric. The `Hmisc::describe` function is especially helpful when taking a first look at a data set.
```{r}
describe(hospital)
```
---
# Saving and Loading R Binary Objects
Saving data sets as text files (like CSV) is not always a good idea. It is more efficient (time and space) to work with R binary objects. The `save` function will store R objects (including the global environment). Save a single data set with the `.rda` extension, and save a collection of objects with the `.RData` extension. These objects can be restored with the `load` function. By default these files are saved with compression.
```{r}
# save single data set
save(hospital, file='hosp.rda', compress=TRUE)
# save everything
save.image(file='myintro.RData')
```
Restart R and the following will restore the `hospital` data set.
```{r}
load('hosp.rda')
```
`Hmisc::Save` and `Hmisc::Load` may also be used on single objects - the file will keep the name of the object unless otherwise provided. Compression is also enabled.
```{r}
Save(hospital)
Load(hospital)
```
---
# R Basics
## Assignment
```{r}
x <- "my string"
y = 5
```
## Printing
```{r}
print(x)
cat(x, "\n")
x
sum(y)
(x <- 1)
```
---
## Vectors
Container of one type. Elements are indexed from 1 to n, and can be accessed by this index (more on this later).
```{r}
z <- c(1,-5,20)
(z <- c(10, z, 36))
z[2]
```
---
## Numeric
### Basic arithmetic
```{r}
z + 1
z - 2
z * 3
z / 4
z ^ 5
# mod, the remainder
z %% 6
# integer division
z %/% 6
# recreate z
(z %/% 6) * 6 + z %% 6
sqrt(z)
```
### Numeric Functions
Many other functions work with numeric values, see ?S3groupGeneric
```{r}
abs(z)
sign(z)
round(z / 7, 2)
floor(z / 7)
ceiling(z / 7)
# exponential function
exp(z)
# natural logarithms
log(z)
# trig functions
sin(z)
```
### Descriptive Statistics
```{r}
sum(z)/length(z)
mean(z)
median(z)
sd(z)
min(z)
max(z)
range(z)
quantile(z)
quantile(z, probs=c(0.1, 0.5, 0.9))
```
---
### Generating Sequences and Random Data
See ?seq and ?Distributions
```{r}
numeric(5)
seq(1, 10)
seq(10)
1:10
seq(1, 10, by=2)
rep(1, 5)
rep(1:3, each=3)
rep(1:3, times=3)
rep(1:3, times=1:3)
rnorm(5, mean=0, sd=1)
rbinom(5, size=1, prob=0.5)
rpois(5, lambda=3)
runif(5, min=0, max=10)
```
Recreate random data with seeds - values are reproducible across computers
```{r}
# the seed can be any integer value
set.seed(1)
rnorm(5)
set.seed(1)
rnorm(5)
```
### Floating-point precision
Beware precision errors with numeric values.
```{r}
seq(10)/10 == seq(0.1, 1, by=0.1)
abs(seq(10)/10 - seq(0.1, 1, by=0.1)) < 10^-7
```
---
## Integer
Mostly unnecessary, useful to represent data exactly or to pass to C or Fortran code
```{r}
as.integer(1)
class(1L)
class(c(1L, 1))
```
## Logical
```{r}
logical(5)
rep(TRUE, 5)
as.logical(c(-1,0,1))
1+1 == 2
3 < 2
# logical negation
!FALSE
x <- seq(10)
# logical or
x %% 2 == 0 | x %% 3 == 0
# logical and
x %% 2 == 0 & x %% 3 == 0
(x <- rnorm(5) > 0)
# return index for true elements
which(x)
any(x)
all(x)
```
## Character
```{r}
character(5)
LETTERS[seq(5)]
letters[seq(5)]
x <- "My string"
toupper(x)
tolower(x)
nchar(x)
substr(x, 4, 6)
# [set of values]
grep("[aeiou]", letters)
grepl("[aeiou]", letters)
gsub("[a-m]", "_", x)
(y <- paste(x, "is", x))
paste(y, "!", sep='')
# this might not do what you think
paste(letters[seq(10)], sep='')
paste(letters[seq(10)], collapse='')
# strsplit turns vector into list
strsplit(c("1;2","3;4","5:6"), split=";")
# use unlist if only one string
unlist(strsplit(y, split=" "))
```
---
## Factor
Factor variables are categorical variables; they may appear as character strings, but they are stored as numerics. They may be important to models or plotting functions.
```{r}
set.seed(5)
race <- sample(c('white','black','other'), size=10, replace=TRUE, prob=c(0.5,0.3,0.2))
# factor values are created in alphabetical order
(r2 <- factor(race))
r2 <- factor(race, levels=c('white','black','asian','other'))
levels(r2)
nlevels(r2)
levels(r2) <- list(white='white', black='black', other=c('other', 'asian'))
# this fails
c(r2, "hispanic")
# coerce to character string first
factor(c(as.character(r2), "hispanic"))
```
---
## Date
Internally, dates are stored as days from 01/01/1970 (this is day 0). Format options can be checked from the ?strptime help file.
```{r}
Sys.Date()
# need consistent format
as.Date(c("05-03-01", "11-11-25"), format="%y-%m-%d")
# accepts numeric data, if origin is specified
# Excel typically uses an origin of "1900-01-01"
(x <- as.Date(seq(0, by=365, length.out=3), origin="2000-01-01"))
as.numeric(x)
format(x, format="%m/%d/%Y")
# seq.Date(x[1], x[2], by=14)
seq(x[1], x[2], by=14)
diff(x)
as.numeric(diff(x), units='weeks')
```
---
## Date-time
Date-time variables can be stored in two ways. POSIXct variables are stored as seconds from 01/01/1970. POSIXlt variables have each date/time component saved in a list container. Typically, you should use POSIXct.
Careful consideration of the timezone should be used when working with date-time variables. By default R will take the timezone from the operating system. It can be helpful to explicitly select your time zone, and additionally using `UTC` whenever possible.
```{r}
Sys.time()
Sys.timezone()
x <- as.POSIXct("2014-10-10 10:20")
# see ?strptime for format options
y <- as.POSIXlt("10/17/2014 20:10", tz="UTC", format="%m/%d/%Y %H:%M")
unclass(x)
unclass(y)
difftime(Sys.time(), x, units="days")
base::round.POSIXt(x, units="hours")
```
---
## Special Types
* Undefined - `NULL`
* Missing - `NA`
* Infinite - `Inf` and `-Inf`
* Not a Number - `NaN`
```{r}
x <- c(1,4,NA,0)
is.na(x)
# note calculations with NA result in NA
is.infinite(x / 0)
is.nan(x / 0)
```
---
## Mixing Types
Mixing types together in a vector will coerce all variables to be of the same, least flexible, type.
```{r}
# str shows the structure of an R object, similar to Environment pane
str(c("a", 1))
str(c(5L, 1))
str(c(TRUE, 0))
str(c(FALSE, "false"))
str(c(100, Sys.Date()))
str(c('today', Sys.time()))
str(c(factor(c("apple", "orange")), "banana"))
```
---
## Containers
* vector - supports one type, with one dimension
* matrix - supports one type, with two dimensions
* array - supports one type, with many dimensions
* data.frame - supports one type per column, but may vary by column
* list - supports many types
Each have two important attributes, dimension length and dimension names.
```{r}
x <- c(1, 10, 100)
length(x)
names(x) <- c("ones", "tens", "hundreds")
# assign names during vector creation
(x <- c(ones=1, tens=10, hundreds=100))
names(x)
# print vector without names
unname(x)
# remove names
names(x) <- NULL
mx <- matrix(1:6, nrow=2)
# number of rows
nrow(mx)
# number of columns
ncol(mx)
# display number of rows and columns
dim(mx)
rownames(mx) <- c('odd', 'even')
colnames(mx) <- c('a', 'b', 'c')
# assign names during matrix creation
(mx <- matrix(1:6, nrow=2, dimnames=list(c('odd', 'even'), c('a','b','c'))))
str(mx)
```
---
## Vectors
### Accessing Elements
Indexing is vital concept. It allows access to specific parts of a container, for extraction or replacement. There are four main ways to use indexing.
* index by element number
* index by negation
* index by element name
* index on truth of logical comparison
Note that in R, the first element is indexed at 1. Many programming languages start at 0 instead.
```{r}
x <- c(royals=4, angels=0, tigers=1, orioles=7)
# index by number, returned in order specified
x[c(4,3)]
# index by negation
x[-c(1,2)]
# index by name
x[c("orioles", "angels")]
# index by truth
x[x > 3]
x[x > 3 & x < 7]
# index replacement
x[c("royals","orioles")] <- c(7,4)
# create new element
x["as"] <- 0
# non-existent elements are NA
x[6]
```
### Functions for Vectors
* head, tail
* rev
* match, `%in%`
* summary
* sort, order
* table - contingency table
* unique
* duplicated
* union, intersect, setdiff
* split, unsplit
```{r}
# these examples use state data sets, see ?state
head(state.region)
tail(state.region, n=10)
state.region %in% c("West", "South", "Central")
c("West", "South", "Central") %in% state.region
match(state.region, c("West", "South", "Central"))
summary(state.area)
sort(state.area)
# display order by element index
order(state.area)
table(state.region)
table(state.division, state.region)
unique(state.region)
duplicated(state.region)
set.seed(2)
x <- sample(20, 10)
y <- sample(20, 10)
union(x, y)
# union takes two arguments, unique takes one (a vector)
all(union(x, y) == unique(c(x, y)))
intersect(x, y)
# elements in x, not in y
setdiff(x, y)
# elements in y, not in x
setdiff(y, x)
# create list of state names by region
split(state.name, state.region)
```
---
### Recycling
Some vector operations require vectors of equal length. If the vectors are of unequal length, elements in the shorter vector may be recycled.
```{r}
# c(10,1) becomes c(10,1,10,1,10)
c(1,10,100,1000,10000) / c(10,1)
# letters[1:2] becomes letters[c(1,2,1,2)]
paste(letters[1:2], letters[23:26])
```
---
## Matrices
Two dimensional vector. Values are stored by column by default.
Useful functions:
* `%*%` - matrix multiplication
* rowSums, colSums
* t - transpose
* cbind - add column
* rbind - add row
* solve
* upper.tri, lower.tri
* diag - access diagonal, or create identity matrix
* det - calculate the determinant
```{r}
matrix(1:5)
matrix(seq(8), nrow=4, ncol=2)
(x <- matrix(seq(8), nrow=4, ncol=2, byrow=TRUE))
rowSums(x)
colSums(x)
y <- matrix(c(1,1,0,0), nrow=2)
x %*% y
x - y[rep(1,4),]
# add a column, note the use of recycling
cbind(x, 3)
rbind(x, c(9, 10))
x <- matrix(c(4,12,-5,-10), nrow=2)
all.equal(x %*% solve(x), diag(2))
solve(x, c(10, 40))
x <- matrix(seq(25), nrow=5)
t(x)
upper.tri(x)
```
### Indexing
* by row(s)
* by column(s)
* by comparison
Remember that the row and column index could be either the row/column number or the row/column name.
```{r}
x[3,]
x[,3]
x[3,3]
x[c(1,5),]
x[3,c(3,4)]
x[x %% 3 == 0] <- 0
x
x[lower.tri(x)] <- 0
x
```
## Arrays
Arrays are similar to matrices, however they support one, two or more dimensions.
Index by `array[d1ix, d2ix, d3ix, ..., dnix]`
```{r}
# this creates 4 2x3 matrices
x <- array(1:24, dim=c(2,3,4))
x[2,3,4]
x[,,4]
```
---
## Data Frames
Data.frames are the fundamental data structures in R. They have rows and columns like matrices, however different columns may contain different variable types.
Remember that matrices use `rownames` and `colnames` to access row/column names. Data frames use `row.names` and `names` instead.
When creating a data.frame with a column of character values, make sure to decide if the values should be converted to factor variables. This can be set with the `stringsAsFactors` argument.
```{r}
x <- data.frame(id=1:3, animal=c('cat','dog','rabbit'),
speed=c('faster','fast','fastest'))
y <- data.frame(id=1:3, animal=c('cat','dog','rabbit'),
speed=c('faster','fast','fastest'), stringsAsFactors=FALSE)
y[3,]
y[,c(2,3)]
y[,'speed']
y[y[,'speed'] == 'fastest',]
y[y[,'speed'] == 'fastest', 'animal']
y[,'speed.factor'] <- factor(y[,'speed'])
```
There are other methods to select or create data.frame columns:
```{r}
y[,3]
y[,'speed']
y$speed
y[['speed']]
y[,'speed', drop=FALSE]
y['speed']
y[,'new'] <- NA
y <- cbind(y, newer=0)
y
```
It is good to write code that is consistent and easy to understand. For this reason I recommend using the `df[,colname]` syntax.
There are also a couple of ways to remove columns:
```{r}
y[,'new'] <- NULL
y <- y[,-1]
y <- y[,-match('newer', names(y))]
y
```
Useful functions:
* merge - join data frames by common columns
* with, within - simplify expressions using column names
```{r}
x <- data.frame(id=seq(4,24,4), gender=rbinom(6, 1, 0.5))
y <- data.frame(id=seq(6,24,6), smoker=rbinom(4, 1, 0.25))
merge(x, y)
merge(x, y, by=0)
merge(x, y, by.x='id', by.y='id', all.x=TRUE, all.y=FALSE)
merge(x, y, all=TRUE)
# modify data frame, and reference columns without repeating data frame name
x <- within(x, {
weight <- round(rnorm(nrow(x), 120+gender*60, 10))
age <- sample(15:25, nrow(x), replace=TRUE)
})
# shorter than x[order(x[,'gender'], x[,'age']),]
x[with(x, order(gender, age)),]
```
---
## Lists
Lists are generic containers. Think of them as a vector where each element can by anything you want, including more lists.
```{r}
emptylist <- vector('list', 10)
x <- list(abb=state.abb, area=state.area, region=state.region, animals=y, other=5)
```
### Indexing
Extracting with a single bracket returns a list. Extracting with two brackets returns the contents at the given index.
```{r}
x[1]
x[1:2]
x[[1]]
x[["area"]]
# while not recommended, $ provides short-cut
x$area
x[4:5] <- NULL
```
A list can be flattened to produce a vector with `unlist`.
```{r}
# notice that all values become character strings, because the first element
# contained character values (the least flexible variable type)
unlist(x)
```
---
## Control Structure
Control structures are used to change if and when lines of code in a program are run. The control comes from conditions being true or false.
### Branching
`if` statements execute a block of code once or not at all. Multiple conditions can be tested so that only the block under the first condition evaluated as true will be executed. A default block of code can be evaluated if none of the conditions pass.
Several lines of code can be combined in a block by surrounding them with braces `{...}`. It is good practice to use braces, though they are not required for a single statement.
```{r}
x <- 5
if(x >=0) {
print(sqrt(x))
}
x <- -4
if(x >=0) {
print(sqrt(x))
} else if(x < 0) {
print(sqrt(x*-1))
}
x <- 0
if(x > 0) {
print(sqrt(x))
} else if(x < 0) {
print(sqrt(x*-1))
} else {
print(0)
}
```
The `ifelse` function can be used to test a vector of values.
```{r}
set.seed(3)
ifelse(rnorm(10) > 0, 1, 0)
```
### Iteration
Three mechanisms can be used to run blocks of code multiple times.
* for - repeat over sequence
* while - repeat while condition passes
* repeat - repeat until forced break
```{r}
set.seed(5)
nc <- 5
x <- sample(nc, 100, replace=TRUE)
cnt <- numeric(nc)
for(i in seq(nc)) {
cnt[i] <- sum(x == i)
}
cnt
i <- 3
while(i <= 100) {
isprime <- all(i %% seq(2, sqrt(i)) != 0)
if(isprime) cat(i, "")
i <- i + 1
}
a <- 1
b <- 2
repeat {
tmp <- a
a <- b
b <- a + tmp
if(b > 100) break
print(b)
}
```
## Functions
Functions allow blocks of code to be run repeatedly. Variables are passed into functions as arguments. When the function completes, it may also return a variable (including a list, which may hold several elements).
Don't access variables from within a function that were created outside of the function (unless they are passed into the function as arguments).
### Return Values
```{r}
prQ <- function(x) {
qs <- quantile(x, probs=c(0.5, 0.25, 0.75))
return(sprintf("%s (%s, %s)", qs[1], qs[2], qs[3]))
}
prQ(1:9)
# return doesn't have to be explicitly called
factors <- function(x) {
Filter(function(i) x %% i == 0, seq(1, floor(x/2)))
}
factors(64)
# NULL can be returned as well
doNothing <- function(x) NULL
```
### Arguments
A function can have many arguments. A generic argument `...` can be used to specify any number of additional arguments. Arguments can be given a default value.
```{r}
powerAdd <- function(p1=1, p2=1, p3=1) {
p1 + p2^2 + p3^3
}
powerAdd()
powerAdd(3,2,1)
powerAdd(5, p3=3)
powerAdd(p2=16)
# take any number of arguments
morePowerAdd <- function(p1=1, ...) {
extra <- unlist(list(...))
sum(p1, extra^(1+seq_along(extra)))
}
morePowerAdd()
morePowerAdd(5, 4, 3, 2, 1)
# use extra arguments for another function
callPlot <- function(msg, ...) {
print(sprintf("calling plot: %s", msg))
plot(...)
NULL
}
callPlot("log", 1:10, log(1:10))
```
## Putting It Together
This creates a sample data set with multiple visits for each patient:
```{r}
size <- 1000
set.seed(475)
x <- data.frame(id=sample(100, size, replace=TRUE),
visitdate=sample(365*2, size, replace=TRUE)
)
male <- sample(c(0,1,NA), 100, replace=TRUE, prob=c(45, 45, 10))
age <- round(runif(100, 40, 80))
x <- merge(x, cbind(male, id=seq(100)))
x <- merge(x, cbind(age, id=seq(100)))
x[,'visitdate'] <- as.Date(x[,'visitdate'], origin='2010-01-01')
x <- x[with(x, order(id, visitdate)),]
row.names(x) <- NULL
```
It's common to operate on a `data.frame` by breaking it into small subsets. Calculate visit number and days since first visit:
```{r}
uids <- unique(x[,'id'])
for(i in seq_along(uids)) {
# note x is already ordered by visitdate
id.x <- which(x[,'id'] == uids[i])
x[id.x, 'visitno'] <- seq_along(id.x)
x[id.x, 'daysOn'] <- x[id.x, 'visitdate'] - x[id.x[1], 'visitdate']
}
```
This method may work great, unless rows are added or removed. What if we wanted to add a last visit that's three years after the first visit?
First consider a function that is given a subset.
```{r}
# x should only have one unique ID value, and be ordered by visitdate
addLastVisit <- function(x) {
last <- x[1,]
last[,'visitdate'] <- last[,'visitdate'] + 365*3
last[,'visitno'] <- nrow(x) + 1
last[,'daysOn'] <- 365*3
rbind(x, last)
}
# quick example
addLastVisit(x[x[,'id'] == 100,])
```
Because our new `data.frame` may have a unknown size, we can store the intermediate subsets in a list. `lapply` is a convenient function to use for this.
```{r}
# now x is saved as 100 element list
x.list <- lapply(unique(x[,'id']), FUN=function(i) addLastVisit(x[x[,'id'] == i,]))
```
The finals steps are to convert the list back to a `data.frame`, and, if we care, update row names. `do.call` is powerful, but can be difficult to learn. The first argument is a function. The second argument should be a list, whose elements are the arguments to the referenced function. In this case, the 100 elements of `x.list` are the arguments to the `rbind` function, which should make sense as we are row-binding these data frames together.
```{r}
x.df <- do.call(rbind, x.list)
row.names(x.df) <- NULL
```
The R package `plyr` was written to handle similar tasks.
---
# Transforming Variables
Variables can be transformed by applying functions or arithmetic.
```{r}
getHdata(diabetes)
contents(diabetes)
# create BMI
diabetes[,'bmi'] <- 703*diabetes[,'weight']/diabetes[,'height']^2
# scale cholesterol to [0,1]
diabetes[,'chol.scaled'] <- (diabetes[,'chol']-min(diabetes[,'chol'], na.rm=TRUE))/
(max(diabetes[,'chol'], na.rm=TRUE)-min(diabetes[,'chol'], na.rm=TRUE))
head(diabetes)
getHdata(olympics.1996)
contents(olympics.1996)
# take log of population
olympics.1996[,'log.pop'] <- log(olympics.1996[,'population'])
# create factor variable using cut points
olympics.1996[,'medals.factor'] <- cut(olympics.1996[,'medals'],
c(0,10,20,50,100,Inf), right=FALSE)
head(olympics.1996)
```
---
# Subsetting data frames
Several examples of subsetting or indexing the `hospital` data set are below. Also shown is the `subset` function, a convenience function for accomplishing this task.
```{r}
# select rows by index
hospital[1:3,]
# select rows by name
hospital[c("5","10","15","20"),]
# select columns by index, and first 10 rows
hospital[seq(10),c(1,5)]
# select columns by name, and 5 random rows
hospital[sample(nrow(hospital), 5), c('age','sex')]
# be careful when selecting a single column
# no longer data.frame, but vector
hospital[seq(5),'service']
# maintain data.frame structure
hospital[seq(5),'service', drop=FALSE]
# select rows that satisfy condition
# every 5th row
hospital[seq(nrow(hospital))%%5 == 0,]
# men
hospital[hospital[,'sex'] == 'male',]
subset(hospital, sex == 'male')
# 30-40 year olds, but only first 5
hospital[hospital[,'age'] %in% 30:49, c('id','age')][1:5,]
subset(hospital, age %in% 30:49, select=c(id, age))[1:5,]
# temp greater than 99
hospital[hospital[,'temp'] > 99, c('id','temp')]
subset(hospital, temp > 99, c(id, temp))
# combining conditions with AND
# women with duration over 10
hospital[hospital[,'sex'] == 'female' & hospital[,'duration'] > 10,]
# `with` saves some typing
hospital[with(hospital, sex == 'female' & duration > 10),]
# `subset` saves even more
subset(hospital, sex == 'female' & duration > 10)
# combining conditions with OR
# antibiotic or bculture
hospital[hospital[,'antibiotic'] == 'yes' | hospital[,'bculture'] == 'yes',]
# combining conditions with complicated combinations
# men younger than 30 or women older than 60
hospital[with(hospital, (sex == 'male' & age < 30) | (sex == 'female' & age > 60)),]
# select columns that match pattern
hospital[seq(10), grep("^s", names(hospital))]
# select empty data.frame
hospital[FALSE,]
```
---
# Data Aggregation
Summary statistics can be computed with `tapply`, `aggregate`, and `Hmisc::summarize`.
```{r}
getHdata(DominicanHTN)
contents(DominicanHTN)
# calculate mean age by village
with(DominicanHTN, tapply(age, village, mean))
# calculate mean sbp by gender
aggregate(sbp ~ gender, data=DominicanHTN, FUN=mean)
# calculate median age by village and gender
aggregate(age ~ village + gender, DominicanHTN, median)
# calculate quantile of sbp by village and gender
with(DominicanHTN, summarize(sbp, llist(village, gender), quantile))
```
---
# Formulas
The previous `aggregate` example introduced the `~` operator. The tilde is used to create a model formula, which consists of a left-hand side and right-hand side. Many R functions utilize formulas, such as plotting functions and model-fitting functions. The left-hand side consists of the response variable, while the right-hand side may contain several terms. You may see the following operators within a formula.
* y ~ a + b, `+` indicates to include both a and b as terms
* y ~ a:b, `:` indicates the interaction of a and b
* y ~ a*b, equivalent to y ~ a+b+a:b
* y ~ (a+b)^2, equivalent to y ~ (a+b)*(a+b)
* y ~ a + I(b+c), `I` indicates to use `+` in the arithmetic sense
See ?formula for more examples.
---
# Models
The most simple model-fitting function is `lm`, which is used to fit linear models. It's primary argument is a formula. Using the DominicanHTN data set, we can fit sbp by age.
```{r}
(m <- lm(sbp ~ age, data=DominicanHTN))
```
This creates an `lm` object, and several functions can be used on model objects. The internal structure of a model object is a list - its elements may be accessed just like a list.
* coef
* fitted
* predict
* residuals
* vcov
* summary
* anova
```{r}
names(m)
m$coefficients
coef(m)
head(fitted(m))
predict(m, data.frame(age=c(40, 60)))
head(residuals(m))
vcov(m)
summary(m)
coef(summary(m))
anova(m)
```
---
# R Graphics with ggplot2
The examples [here][ggplot] [-@ggplot] say it best.
```{r}
library(ggplot2)
# create factors, categorical variables
m2 <- within(mtcars, {
gear.factor <- factor(gear, levels=c(3,4,5), labels=c("3gears","4gears","5gears"))
am.factor <- factor(am, levels=c(0,1), labels=c("Automatic","Manual"))
cyl.factor <- factor(cyl, levels=c(4,6,8), labels=c("4cyl","6cyl","8cyl"))
})
# kernel density
qplot(mpg, data=m2, geom="density", fill=gear.factor, alpha=I(.5),
main="Distribution of Gas Mileage", xlab="Miles Per Gallon", ylab="Density"
)
```
```{r, fig.width=5.5, fig.height=3.7}
# scatterplot
qplot(hp, mpg, data=m2, shape=am.factor, color=am.factor,
facets=gear.factor~cyl.factor, size=I(3), xlab="Horsepower",
ylab="Miles per Gallon"
)
# boxplot
qplot(gear, mpg, data=m2, geom=c("boxplot", "jitter"), fill=gear.factor,
main="Mileage by Gear Number", xlab="", ylab="Miles per Gallon"
)
```
Example with regression:
```{r, fig.width=5.5, fig.height=3.6}
# scatterplot
p <- qplot(wt, mpg, data=m2, color=cyl.factor,
main="Regression of MPG on Weight", xlab="Weight", ylab="Miles per Gallon"
)
# add regression line
p + geom_smooth(method="lm")
p + geom_smooth(method="loess")
```
## Graphics Devices
Plots created with `ggplot2` can be saved to a file with the `ggsave` function. The file format is determined by the file name. Some are vector file formats (pdf, svg), while others (bmp, jpeg, png) are raster formats. Vector files can be scaled without pixelation.
Create 6"x6" png at 300 dpi, useful for print publications:
```{r, fig.show='hide'}
qplot(hp, mpg, data=m2, shape=am.factor, color=am.factor,
facets=gear.factor~cyl.factor, size=I(3), xlab="Horsepower",
ylab="Miles per Gallon"
)
ggsave("mileageByHorses.png", width=6, height=6, dpi=300)
```
Plot to PDF:
```{r, fig.show='hide'}
qplot(gear, mpg, data=m2, geom=c("boxplot", "jitter"), fill=gear.factor,
main="Mileage by Gear Number", xlab="", ylab="Miles per Gallon")
ggsave("mileageByGear.pdf")
```
---
# Links
[miktex]: http://miktex.org/download "MiKTeX"
[mactex]: http://www.tug.org/mactex/morepackages.html "MacTeX"
[knitr]: http://yihui.name/knitr/ "knitr"
[yihui]: http://yihui.name/en/2013/10/markdown-or-latex/ "markdown or latex"
[knitr-examples]: https://github.com/yihui/knitr-examples "examples"
[knitr-chunk]: http://yihui.name/knitr/options/#chunk_options "chunk options"
[rprofile]: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/RConfiguration/Rprofile "~/.Rprofile"
[rscripts]: https://github.com/harrelfe/rscripts "R script examples"
[ggplot]: http://www.statmethods.net/advgraphs/ggplot2.html "ggplot2 examples"