Procedures for Preparing Datasets for Analysis and for Transmission to the Department of Biostatistics | NEW Data Sharing

Data must be in a rectangular format with one observation or subject per row. If data are in a spreadsheet, there should be only one array per sheet. Usually, all the data can be combined into one array using the guidelines below.

Basic data guidelines

  1. Think through the analysis before collecting any data.
  2. Give each column a unique, simple, 1-word name, with no spaces, beginning with a letter, and place this name in the first row. There should be no parentheses or special characters in the variable names. You can use camel case to make them easier to read. For example, "tumorChar" for "tumor characteristic," or "estrogenSymp" for "estrogen symptoms."
  3. Put only one variable in a column. Do not combine variables in the same column.
  4. Enter each patient (or unit of analysis) on a separate line (row), beginning on the second line (the row after the column names). If patients are observed multiple times and you want a longitudinal analysis, you most likely need to use a separate row for each observation of a patient. Guidance from a statistician before collecting data is recommended especially in this case.
  5. Enter quantitative data as numbers. Avoid entering letters, words, string variables (e.g.,NA, 22%, <3.6), or anything that resembles a cartoon curse word, @#&*%,. A computer will have to be able to read this. In Excel, all columns with quantities should be formatted as numbers or dates (not as general or text).
  6. Do not enter nominal or binary data using numbers. (For example no=0 and yes=1) This way, the information is contained in the data and does not require a data dictionary. Before collecting the data, decide on the possible answer choices. You can set this up in redCap as a dropdown list or with radio buttons. In excel, you can set up a dropdown list. Make sure the answer choices are always entered consistently, including any capitalization. R statistical software can handle these values.
  7. Give each research participant or patient a unique case number (1,2,3, etc.)- in the first column. Delete patient name, SS#, MR#, and any identifying information before sending it to a statistician. Always save the spreadsheet with a password.
  8. Enter cases and controls in the same spreadsheet. Use one variable to define the control group For example, call the variable "group," and use the values "case" and "control" or "drug A" and "drug B".
  9. Quantify. Enter continuous measurements when possible. If you decide to categorize later it will be possible, but if you only collect categories, it will be impossible to decide to use the original measurements later.
  10. Plan in advance how you will record missing values. Be consistent. An example would be leaving all missing values blank or entering "NA" for missing values.
  11. Try to design your spreadsheet or database so that it does not rely on an elaborate data dictionary or system to understand. However, if you have a very large number of variables or very complex data, you may need to use variable naming conventions, in which case you should have a separate file that explains this.
  12. Have a biostatistician review the coding before data entry and again after the first 10 patients have been entered. We love helping on the front end because it helps avoid problems later.
  13. See here for some great ideas about data preparation.

Preferred Dataset Formats

  • R .rda files composed of a data frame object.
  • Comma separated variables with field names in one row and optionally with long field labels (descriptions) in another row. These are desired over .xls because they don't require intermediate steps of converting.
  • Spreadsheets, formatted as described under comma separated variables
The following two formats are desirable because they facilitate transmission of variable labels (long variable names) and value labels (definitions for coded variables).

Preferred Method for Transmitting Datasets to the Department of Biostatistics

Stata and SPSS files can be read into R using the functions stata.get and spss.get in the Hmisc package. These functions have many options. stata.get automatically senses date variables. With spss.get you can specify a list of variables to be converted to dates in R.

For Department members: Stat/Transfer runs under Windows or Linux and can convert any SAS binary format to Stata to import into R.

Example Spreadsheets | Dallal's Notes

  • Spreadsheet from Heaven: Excel spreadsheet to demonstrate the proper way of entering data for a clinical research project.

  • Spreadsheet from hell: Excel spreadsheet to demonstrate improper ways of entering data for a clinical research project.

"Research demands involvement. It cannot be delegated very far."
Topic attachments
I Attachment Action Size Date Who Comment
Spreadsheetfromhell.xlsxls Spreadsheetfromhell.xls manage 17.5 K 17 May 2007 - 11:37 ColeBeck  
spreadsheetfromheaven.csvcsv spreadsheetfromheaven.csv manage 1.1 K 28 Jan 2008 - 22:20 FrankHarrell Spreadsheet from Heaven (csv version exported from spreadsheet)
spreadsheetfromheaven.xlsxls spreadsheetfromheaven.xls manage 16.5 K 17 May 2007 - 11:37 ColeBeck  
Topic revision: r16 - 17 Jun 2016, FrankHarrell

This site is powered by FoswikiCopyright © 2013-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback