Programming Tips For Statisticians
Programming is a huge part of what modern statisticians do, yet few of us ever take time to study good programming practices. Some may complain,
but I'm a statistician, not a programmer. Consider that while
we are statisticians and not mathematicians, we spend a lot of time developing our mathematical skills because they are essential to our jobs. In this age, programming is just as essential to being a good statistician.
This page contains some of the barest basics every statistician should know and practice.
Key phrases: Good programming advice for statisticians. Good programming practices for statisticians. Programming Style Guide for Statisticians.
Meaningful Variable Names
Good variables names have appropriate meanings. Bad variable names have no meaning or inappropriate meanings. Consider a trivial script to print the first three subjects' ages in a dataset (coded in R).
for(i in 1:3){print(age[i])}
In a short script like this the meaning of i is clear, but in a complex script dozens of lines long we would be forced to constantly remember
i means row number.
Good programming minimizes the need for a good memory. A better name would be something like
for(subject in 1:3){print(age[subject])}
or
for(row in 1:3){print(age[row])}
Choose a variable naming format and be consistent. What format you choose is a matter of personal preference, but read up on the
style guides for nonstatistical programming languages for good advice. Programmers of nonstatistical languages have thought carefully about what makes a good name. Co-opt good style practices from other languages for your own use. Taking advantage of capitalization allows readable names like
glyburideFilldate and
PersistenceEndDate.
If creating variables that become part of a dataset, then
a good variable name should work in many languages. As statisticians, we frequently use several programming languages on a given dataset. We may do one analysis in R and another in Stata. We use the best program we have available for a job. A good variable name will work seamlessly with all the languages we use. This means being mindful of using special characters like _ and . in variable names that may work well with one language and poorly with another.
Indentation
The above example made many a programmer's skin crawl because it lacked the most basic element of any good script, an indentation scheme. A better script would have been
for(row in 1:3){
print(age[row])
}
The specifics of a good scheme will vary based on personal aesthetics, but here are some basic pointers.
- indent the body of for loops, foreach loops, and while loops
- indent the body of conditional statements (if statements), especially if the conditional statement has several actions within the body
- choose an indent, e.g. 2 spaces or 4 spaces, and stick with it
- don't mix tabs and spaces, use one or the other for indentation
- be consistent with nested indentation, i.e. if indenting with 4 spaces, the body of an if statement within a while loop will be indented 8 spaces, 4 for the while loop and 4 for the if statement.
Non-significant Whitespace
Put spaces (and sometimes linebreaks) between things for readability. Noonelikestoreadasentencewithoutanyspacebetweenthewords. It's still intelligible, but it takes longer to understand. Besides, storage is cheap.
Consider changing the following
a<-4;b<-6;c<-a+b
for(x in a:b){print(c:x)}
f<-function(x,y){return(x^2+y/3)}
to
a <- 4
b <- 6
c <- a + b
for (x in a:b) { print(c:x) }
f <- function(x, y) { x^2 + y/3 }
and your eyes will thank you.
Comments
How much to comment is up for debate. Some will say "comment copiously", while others will say "don't comment - write clear code". Consider this line of code:
$line =~ s/\,/\;/g; # substitute commas with semicolons everywhere in the line
To an expert Perl programmer, the comment is annoying. But to a novice Perl programmer, the comment is helpful. To help you decide how much to comment, ask yourself the following questions?
- Who needs to be able to read my code? What's their level of experience with the language I'm using?
- Do these comments make my code easier for me to read? ... easier to read for everyone else who needs to read this code?
- Will I remember what I'm doing in this section of code a year from now? Will these comments help me remember?
- Does this comment provide information that the line of code does not?
In the example above, a comment like
# commas will be used only as data delimiters provides new information that helps explain why that line of code was written.
When commenting functions, consider including the following.
- The function's purpose (if the name doesn't make this obvious).
- The arguments (including variable types).
- The return value(s).
Code Reuse
Don't reinvent the wheel if you can help it. Often, what you need to do has been done before and is available in your statistical program by default or online as an add on. The time you save not reinventing the wheel can be spent testing and understanding the add on. More often than not, you'll break even on time and get better, more reliable results. That said, don't trust everything you find online. Some code has been thoroughly tested and debugged; some hasn't. You have to do the work to ensure you understand and trust the code. Ask the obvious questions. Is this a default function or did you pick up the code off the street? Did the code go through a review process? Does it have good documentation? Do you understand it? You may not need to understand every detail of a default function in a validated statistical package, but you do need to for an unvalidated script you grabbed off the web.
Naming Scripts
Personal preference will vary widely here. Here are some questions to ask yourself?
- Does the name have to be short because you have to type it 50 more times, or can it be longer and contain important information?
- Should version be part of the script name?
- Does the script name convey the script's purpose?
- Will you remember the name a year from now?
- A year from now, will you remember the purpose of the script from just its name?
- Does the script name appear in a good place in the alphabetically sorted folder where it lives? Do you want the script to appear next to the data it processes? ... next to the data or analyses it outputs? Do you want different versions of the script to alphabetize in proper order?
Documentation
Good code documentation is as important as good knowledge of a programming language. Documentation may be done in files that accompany your script or in the script itself. Good documentation should:
- Make it easier for someone else to understand what your code is for and what it does.
- Make it easier to reproduce your results.
- Make you write better code by forcing you to clarify your thinking.
Exactly what makes good documentation will depend on your audience and the purpose of the documentation. Good documentation frequently contains:
- The purpose of the code.
- Brief logic for each major block of your code.
- Where things are located. If you are working on a massive project with many dataset and dozens of scripts, things can be hard to find when you restarting on a project after months away from it.
- How to reproduce your results. In complex projects, you don't go from raw uncleaned data to the final report with just one script. Several scripts are run in a sequence. Good documentation captures that so that if you had nothing but the raw data, the documentation, and your scripts, you could reproduce your final report without having to remember everything you did.
- Good programming minimizes the need for a good memory. So does good documentation.
- Creation and modification dates, along with why the modification was required. Although ideally, this is better part of a good version control system.
Aside: For a higher level discussion of making code truly readable, see
Literate Programming and Web. These ideas led to the ideas of
Noweb,
Sweave, and Literate Data Analysis (
Sweave, Leisch 2002 pdf).
Version Control System
There is nothing worse than making a major overhaul to a script, realizing it wasn't the right thing to do, and then having to figure out how to undo all your changes because you didn't save the old version of your script. Have a system where you routinely save versions of your script and track what changes are made in each version. In general, before making any new changes to a script, save a working version according to your version control system.
Be Clear and Correct First; Fast and Clever Second
Before you spend time finding clever ways to make your code run fast, first make your code as clear as you can and make sure it is correct. Then you can optimize your code as time permits.
Get Bang for Your Buck
Commenting, documenting, tracking versions, all take time. In the long run, the time should pay off with better code, more readable code, more reproducible analyses, faster restart times when coming back to a project after months away from it, the ability to quickly undo bad choices, and so on. A modest time investment in developing and maintaining these practices can pay off big over time. Find the right amount of time to spend so you get bang for your buck.
Test With a Sample of Your Data
If you are working with a big data set, it can be difficult to test your code if it takes a long time to run. In this case, take a sample of you data and use that for testing.
Testing and debugging
Many programming languages have debuggers available. Debuggers are programs that will allow you to trace your code as it is run. In other words you can see what value a particular variable contains at a specific point in your code. You may also consider using print statements within your code. Here are a few places where print statements can be helpful:
- Entering and leaving a function
- The length of a vector before looping through it
- After processing each row within a dataset (print the record ID or a count)
Unclutter Where The Language and Your Experience Level Allow
Some languages have very strict syntax. Others like R are more flexible and provide opportunities to unclutter your code. Consider the following R code.
a <- 3;
b <- 5;
if (a > b) {
"a is greater than b";
} else
if (a < b) {
"a is less than b";
} else {
"a and b are equal";
}
The following code is read by most people more easily and by R just as well as the code cluttered with semicolons and curly brackets.
a <- 3
b <- 5
if (a > b) "a is greater than b" else
if (a < b) "a is less than b" else
"a and b are equal"
When you can unclutter your code requires some understanding of how R reads your code. Consequently, the novice R programmer may prefer to write the first code because it requires less understanding of subtleties in R. The experienced R programmer may prefer the second code because it is easier to read. Learning where you can unclutter your code can pay off with more readable code.
Performance
If you know how many elements will be in a vector, create the vector with that size. This is much more efficient than appending values onto the end of the vector.
Here's an example:
squaredBad<-function(size) {squared<-c();for(i in 1:size){squared<-append(squared, i^2)};squared}
squaredGood<-function(size) {squared<-numeric(size);for(i in 1:size) {squared[i]<-i^2};squared}
x<-squaredBad(100000)
y<-squaredGood(100000)
It would be even more efficient if you could accomplish this without a for loop:
squaredBest <- function(size) { seq(size)^2 }
z <- squaredBest(100000)
Floating-point Arithmetic
Computers use a binary representation to store numbers, thus making it difficult to store the exact representation of decimal numbers. This can lead to cases where two numbers that should be equal are not.
0.68 == 0.68 #=> [1] TRUE
which(seq(0.6, 0.7, 0.01) == 0.68) #=> numeric(0)
In the R code above, the sequence (0.60, 0.61, …, 0.68, 0.69, 0.70) should have a match found at the ninth element, 0.68. Instead, the code returns no matches.
In such cases, you may want to check for "closeness".
tol <- function(x, y, delta) { (x < y + delta) & (x > y - delta) }
which(tol(seq(0.6, 0.7, 0.01), 0.68, .00000001)) #=> [1] 9
This R code rephrases the question as are these two numbers within
+0.00000001 of each other. This
close enough question allows for the small imprecision caused by how the computer stores numbers.
Code Review
Get more eyes than your own on your code. Depending on the complexity of the code and project, you may want anything from a simple
over the shoulder review where you walk someone through your code on the monitor to a
formal walk-through where you assemble a group to review your code. Peer reviews and formal walk-throughs are a big time investment, but they have a big payoff. The errors you catch as a team alone will make this worthwhile. It is also an opportunity to get personalized feedback on your programming practices, as well as a chance for everyone to learn from each other. One caveat. Just like when reviewing someone's writing, artwork, cooking, etc., feelings can easily be bruised. It is important for all participants to be respectful towards each other. In more formal reviews, it is helpful for everyone to have a checklist of things they should check in the code and supporting documents. See the References and Useful Links section below for more information on code reviews.
One of the principles under which the Department of Biostatistics operates is that faculty and staff make themselves available for helping each other. A review of samples of your work will be done in a friendly fashion. This is also true at the R Clinic, which is an excellent opportunity to obtain feedback on small sections of your code that might represent larger sections.
References and Useful Links