Line: 1 to 1  

 
Changed:  
< < 
Checklist for Authors  References  
> > 
Checklist for Authors  References  General Issues  
Statistical Problems to Document and to AvoidThis list is not allinclusive, and nominations for additional entries are welcomed. The list is not meant to imply that a lead author should wait until the end to correct statistical errors, especially with regard to design flaws. To make your research reproducible, you need a statistical analysis plan to be completed before starting any data analysis that reveals patterns about which you are hypothesizing. The plan is best developed with close collaboration of a statistician. One other reason that collaborating with a statistician up front is important is the high frequency with which statisticians find flaws in the response variable chosen by the investigator, with regard to information content, power, precision, and the more subtle issue of finding that the response variable's definition depends on another variable, e.g., when the response variable has different meanings for different subjects.  
Line: 121 to 121  
 
Added:  
> > 

Line: 1 to 1  

 
Changed:  
< < 
Checklist for Authors  References  Editable References  
> > 
Checklist for Authors  References  
Statistical Problems to Document and to Avoid  
Added:  
> >  This list is not allinclusive, and nominations for additional entries are welcomed. The list is not meant to imply that a lead author should wait until the end to correct statistical errors, especially with regard to design flaws. To make your research reproducible, you need a statistical analysis plan to be completed before starting any data analysis that reveals patterns about which you are hypothesizing. The plan is best developed with close collaboration of a statistician. One other reason that collaborating with a statistician up front is important is the high frequency with which statisticians find flaws in the response variable chosen by the investigator, with regard to information content, power, precision, and the more subtle issue of finding that the response variable's definition depends on another variable, e.g., when the response variable has different meanings for different subjects.  
Design and Sample Size ProblemsUse of an improper effect size 
Line: 1 to 1  

Checklist for Authors  References  Editable ReferencesStatistical Problems to Document and to Avoid  
Line: 71 to 71  
Lack of insignificant variables in the final modelUnless the sample size is huge, this is usually the result of the authors using a stepwise variable selection or some other approach for filtering out "insignificant" variables. Hence the presence of a table of variables in which every variable is significant is usually the sign of a serious problem.  
Changed:  
< <  Authors frequently use strategies involving removing insignificant terms from the model without making an attempt to derive valid confidence intervals or Pvalues that account for uncertainty in which terms were selected (using for example the bootstrap or penalized maximum likelihood estimation). A paper in J Clin Epi March 2009 cited Ockham's razor as a principle to be followed when building a model, not realizing that parsimony resulting from utilizing of the data at hand to make modeling decisions only seems to result in parsimony. Removing insignificant terms causes bias, inaccurate (too narrow) confidence intervals, and failure to preserve type I error in the resulting model's Pvalues, which are calculated as though the model was completely prespecified.  
> >  Authors frequently use strategies involving removing insignificant terms from the model without making an attempt to derive valid confidence intervals or Pvalues that account for uncertainty in which terms were selected (using for example the bootstrap or penalized maximum likelihood estimation). J Clin Epi 20090301, Volume 62, Issue 3, Pages 232240 cited Ockham's razor as a principle to be followed when building a model, not realizing that parsimony resulting from utilizing of the data at hand to make modeling decisions only seems to result in parsimony. Removing insignificant terms causes bias, inaccurate (too narrow) confidence intervals, and failure to preserve type I error in the resulting model's Pvalues, which are calculated as though the model was completely prespecified.  
Overfitting and lack of model validationWhen a multivariable model is reported, an unbiased validation (at least an internal validation) should be reported in the paper unless
 
Line: 80 to 80  
The 20:1 rule is as follows. Let m denote the effective sample size (the number of subjects if the response variable is a fullyobserved continuous one; the number of events if doing a survival analysis; the lower of the number of events and number of nonevents if the response is dichotomous) and p denote the number of candidate predictor terms that were examined in any way with respect to the response variable. p includes nonlinear terms, product terms, different transformations attempted, the total number of cutoffs attempted to be applied to continuous predictors, and the number of variables dropped from the final model in a way that was unblinded to the response. If the ratio of m to p exceeds 20, the model is likely to be reliable and there is less need for the model to be validated. When a validation is needed, the best approach is typically the bootstrap. This is a Monte Carlo simulation technique in which all steps of the modelbuilding process (if the model was not prespecified) are repeated for each of, say, 150 samples with replacement of size n from the original sample containing n subjects.  
Changed:  
< < 
Failure to validate predictive accuracy with full resolution  
> > 
Failure to validate predictive accuracy with full resolution  
When a predictive model or instrument is intended to provide absolute estimates (e.g., risk or time to event), it is necessary to validate the absolute accuracy of the instrument over the entire range of predictions that are supported by the data. It is not appropriate to use binning (categorization) when estimating the calibration curve. Instead, the calibration curve should be estimated using a method that smoothly (without assuming linearity) relates predicted values (formed from a training set) to observed values (in an independent test or overfittingcorrected using resampling; see Stat in Med 15:361;1996). For testing whether the calibration curve is ideal (i.e., is the 45 degree line of identity) consider using the single d.f. Spiegelhalter ztest (Stat in Med 5:421;1986). The mean absolute error and the 90th percentile of absolute calibration error are useful summary statistics. All of these quantities and tests are provided by the R rms package.
Use of Imprecise Language  Glossary 
Line: 1 to 1  

Checklist for Authors  References  Editable ReferencesStatistical Problems to Document and to Avoid  
Line: 67 to 67  
Inappropriate model specificationResearchers frequently formulate their first model in such a way that it encapsulates a model specification bias that affects all later analytical steps. For example, the March 2009 issue of J Clin Epi has a paper examining the relationship between change in cholesterol and mortality. The authors never questioned whether the mortality effect was fully captured by a simple change, i.e., is the prediction equation of the form f(post)f(pre) where f is the identify function? It is often the case that the effect of a simple difference depends on the "pre" value, that the transformation f is not linear, or that there is an interaction between pre and post. All of these effects are contained in a flexible smooth nonlinear regression spline surface (tensor spline) in three dimensions where the predictors are pre and post. One can use the surface to test the adequacy of its special case (postpre) and to visualize all patterns.Use of stepwise variable selection  
Changed:  
< <  Stepwise variable selection, univariable screening, and any method that eliminates "insignificant" predictor variables from the final model causes a multitude of serious problems related to bias, significance, improper confidence intervals, and multiple comparisons. Stepwise variable selection should be avoided unless backwards elimination is used with an alpha level of 0.5 or greater.  
> >  Stepwise variable selection, univariable screening, and any method that eliminates "insignificant" predictor variables from the final model causes a multitude of serious problems related to bias, significance, improper confidence intervals, and multiple comparisons. Stepwise variable selection should be avoided unless backwards elimination is used with an alpha level of 0.5 or greater. See also here.  
Lack of insignificant variables in the final modelUnless the sample size is huge, this is usually the result of the authors using a stepwise variable selection or some other approach for filtering out "insignificant" variables. Hence the presence of a table of variables in which every variable is significant is usually the sign of a serious problem. 
Line: 1 to 1  

Checklist for Authors  References  Editable ReferencesStatistical Problems to Document and to Avoid  
Line: 21 to 21  
The mean and standard deviation are not descriptive of variables that have an asymmetric distribution such as variables with a heavy right tail (which includes many clinical lab measurements). Quantiles are always descriptive of continuous variables no matter what the distribution. A good 3number summary is the lower quartile (25th percentile), median, and upper quartile (75th percentile). The difference in the outer quartiles is a measure of subjecttosubject variability (it is an interval containing half the subjects' values). The median is always descriptive of "typical" subjects. By comparing the difference between the upper quartile and the median with the difference between the median and the lower quartile, one obtains a sense of the symmetry of the distribution. Above all don't provide descriptive statistics such as "the mean hospital cost was $10,000 plus or minus $20,000." Nonparametric bootstrap confidence intervals will prevent impossible values being used as confidence limits.
When using the WilcoxonMannWhitney test for comparing two continuous or ordinal variables, use difference estimates that are consistent with this test. The Wilcoxon test does not test for difference in medians or means. It tests whether the HodgesLehmann estimate of the difference between two groups is zero. The HLestimate is the median difference over all possible pairs of subjects, the first from group 1 and the second from group 2. See http://en.wikipedia.org/wiki/MannWhitney_U for an example statement of results, and a good reference for this. Recommended Software Note : When there are excessives ties in the response variable such as a variable with clumping at zero, quantiles such as the median may not be good descriptive statistics (the mean with associated bootstrap confidence limits may be better), and the HodgesLehmann estimate does not work well.  
Added:  
> >  Although seen as appealing by some, the socalled "number needed to treat" suffers from a long list of problems and is not recommended.  
Failure to include confidence intervalsConfidence intervals for key effects should always be included. Studies should be designed to provide sufficiently narrow confidence intervals so that the results contain use information. See sections 3.4.4, 3.5.3, 3.7.4 of this and The End of Statistical Significance?Inappropriate choice of measure of change  
Line: 60 to 63  
Multivariable Modeling ProblemsInappropriate linearity assumptions  
Changed:  
< <  In general there is no reason to assume that the relationship between a predictor and the response is linear. However, categoring a continuous predictor can cause major problems. Good solutions are to use regression splines or nonparametric regression.  
> >  In general there is no reason to assume that the relationship between a predictor and the response is linear. However, categorizing a continuous predictor can cause major problems. Good solutions are to use regression splines or nonparametric regression.  
Inappropriate model specificationResearchers frequently formulate their first model in such a way that it encapsulates a model specification bias that affects all later analytical steps. For example, the March 2009 issue of J Clin Epi has a paper examining the relationship between change in cholesterol and mortality. The authors never questioned whether the mortality effect was fully captured by a simple change, i.e., is the prediction equation of the form f(post)f(pre) where f is the identify function? It is often the case that the effect of a simple difference depends on the "pre" value, that the transformation f is not linear, or that there is an interaction between pre and post. All of these effects are contained in a flexible smooth nonlinear regression spline surface (tensor spline) in three dimensions where the predictors are pre and post. One can use the surface to test the adequacy of its special case (postpre) and to visualize all patterns.Use of stepwise variable selection  
Line: 68 to 71  
Lack of insignificant variables in the final modelUnless the sample size is huge, this is usually the result of the authors using a stepwise variable selection or some other approach for filtering out "insignificant" variables. Hence the presence of a table of variables in which every variable is significant is usually the sign of a serious problem.  
Changed:  
< <  Authors frequently use strategies involving removing insignificant terms from the model without making an attempt to derive valid confidence intervals or Pvalues that account for uncertainty in which terms were selected (using for example the bootstrap or penalized maximum likelihood esetimation). A paper in J Clin Epi March 2009 cited Ockham's razor as a principle to be followed when building a model, not realizing that parsimony resulting from utilizing of the data at hand to make modeling decisions only seems to result in parsimony. Removing insignificant terms causes bias, inaccurate (too narrow) confidence intervals, and failure to preserve type I error in the resulting model's Pvalues, which are calculated as though the model was completely prespecified.  
> >  Authors frequently use strategies involving removing insignificant terms from the model without making an attempt to derive valid confidence intervals or Pvalues that account for uncertainty in which terms were selected (using for example the bootstrap or penalized maximum likelihood estimation). A paper in J Clin Epi March 2009 cited Ockham's razor as a principle to be followed when building a model, not realizing that parsimony resulting from utilizing of the data at hand to make modeling decisions only seems to result in parsimony. Removing insignificant terms causes bias, inaccurate (too narrow) confidence intervals, and failure to preserve type I error in the resulting model's Pvalues, which are calculated as though the model was completely prespecified.  
Overfitting and lack of model validationWhen a multivariable model is reported, an unbiased validation (at least an internal validation) should be reported in the paper unless
 
Line: 76 to 79  
 
Changed:  
< <  When a validation is needed, the best approach is typically the bootstrap. This is a Monte Carlo simulation technique in which all steps of the modelbuilding process (if the model was not prespecified) are repeated for each of, say, 150 samples with replacement of size n from the original sample containing n subjects.  
> > 
When a validation is needed, the best approach is typically the bootstrap. This is a Monte Carlo simulation technique in which all steps of the modelbuilding process (if the model was not prespecified) are repeated for each of, say, 150 samples with replacement of size n from the original sample containing n subjects.
Failure to validate predictive accuracy with full resolutionWhen a predictive model or instrument is intended to provide absolute estimates (e.g., risk or time to event), it is necessary to validate the absolute accuracy of the instrument over the entire range of predictions that are supported by the data. It is not appropriate to use binning (categorization) when estimating the calibration curve. Instead, the calibration curve should be estimated using a method that smoothly (without assuming linearity) relates predicted values (formed from a training set) to observed values (in an independent test or overfittingcorrected using resampling; see Stat in Med 15:361;1996). For testing whether the calibration curve is ideal (i.e., is the 45 degree line of identity) consider using the single d.f. Spiegelhalter ztest (Stat in Med 5:421;1986). The mean absolute error and the 90th percentile of absolute calibration error are useful summary statistics. All of these quantities and tests are provided by the R rms package.  
Use of Imprecise Language  GlossaryIt is important to distinguish rates from probabilities, odds ratios from risk ratios, and various other terms. The word risk usually means the same thing as probability. Here are some common mistakes seen in manuscripts: 
Line: 1 to 1  

Checklist for Authors  References  Editable ReferencesStatistical Problems to Document and to Avoid  
Line: 61 to 61  
Multivariable Modeling ProblemsInappropriate linearity assumptionsIn general there is no reason to assume that the relationship between a predictor and the response is linear. However, categoring a continuous predictor can cause major problems. Good solutions are to use regression splines or nonparametric regression.  
Added:  
> > 
Inappropriate model specificationResearchers frequently formulate their first model in such a way that it encapsulates a model specification bias that affects all later analytical steps. For example, the March 2009 issue of J Clin Epi has a paper examining the relationship between change in cholesterol and mortality. The authors never questioned whether the mortality effect was fully captured by a simple change, i.e., is the prediction equation of the form f(post)f(pre) where f is the identify function? It is often the case that the effect of a simple difference depends on the "pre" value, that the transformation f is not linear, or that there is an interaction between pre and post. All of these effects are contained in a flexible smooth nonlinear regression spline surface (tensor spline) in three dimensions where the predictors are pre and post. One can use the surface to test the adequacy of its special case (postpre) and to visualize all patterns.  
Use of stepwise variable selectionStepwise variable selection, univariable screening, and any method that eliminates "insignificant" predictor variables from the final model causes a multitude of serious problems related to bias, significance, improper confidence intervals, and multiple comparisons. Stepwise variable selection should be avoided unless backwards elimination is used with an alpha level of 0.5 or greater.Lack of insignificant variables in the final modelUnless the sample size is huge, this is usually the result of the authors using a stepwise variable selection or some other approach for filtering out "insignificant" variables. Hence the presence of a table of variables in which every variable is significant is usually the sign of a serious problem.  
Added:  
> >  Authors frequently use strategies involving removing insignificant terms from the model without making an attempt to derive valid confidence intervals or Pvalues that account for uncertainty in which terms were selected (using for example the bootstrap or penalized maximum likelihood esetimation). A paper in J Clin Epi March 2009 cited Ockham's razor as a principle to be followed when building a model, not realizing that parsimony resulting from utilizing of the data at hand to make modeling decisions only seems to result in parsimony. Removing insignificant terms causes bias, inaccurate (too narrow) confidence intervals, and failure to preserve type I error in the resulting model's Pvalues, which are calculated as though the model was completely prespecified.  
Overfitting and lack of model validationWhen a multivariable model is reported, an unbiased validation (at least an internal validation) should be reported in the paper unless
 
Line: 82 to 86  
 
Changed:  
< < 
Graphics  Handouts  
> > 
Graphics  Handouts  Advice from the PGF manual (chapter 6)  

Line: 1 to 1  

Checklist for Authors  References  Editable ReferencesStatistical Problems to Document and to AvoidDesign and Sample Size ProblemsUse of an improper effect size  
Changed:  
< <  If a study is designed to detect a certain effect size with a given power, the effect size should never be the observed effect from another study, which may be estimated with error and be overly optimistic. The effect size to use in planning should be the biologically relevant effect one would regret missing.  
> >  If a study is designed to detect a certain effect size with a given power, the effect size should never be the observed effect from another study, which may be estimated with error and be overly optimistic. The effect size to use in planning should be the clinically or biologically relevant effect one would regret missing. Usually the only information from prior studies that is useful in sample size estimation are (in the case of a continuous response variable with a symmetric distribution) estimates of the standard deviation or the correlation between two measurements on the same subject measured at two different times, or (in the case of a binary or time to event outcome) event probabilities in control subjects.  
Relying on standardized effect sizes  
Changed:  
< <  Many researchers use Cohen's standardized effect sizes in planning a study. This has the advantage of not requiring pilot data. But such effect sizes are not biologically meaningful and may hide important issues as discussed by Lenth. Studies should be designed on the basis of effects that are relevant to the investigator and human subjects.  
> >  Many researchers use Cohen's standardized effect sizes in planning a study. This has the advantage of not requiring pilot data. But such effect sizes are not biologically meaningful and may hide important issues as discussed by Lenth. Studies should be designed on the basis of effects that are relevant to the investigator and human subjects. If, for example, one plans a study to detect a one standard deviation (SD) difference in the means and the SD is large, one can easily miss a biologically important difference that happened to be much less than one SD in magnitude. Note that the SD is a measure of how subjects disagree with one another, not a measure of an effect (e.g., the shift in the mean).  
General Statistical ProblemsInefficient use of continuous variablesCategorizing continuous predictor or response variables into intervals, as detailed here, causes serious statistical inference problems including bias, loss of power, and inflation of type I error.  
Line: 113 to 113  
 
Deleted:  
< < 
 FrankHarrell  26 Jun 2004; updated 4, 7 Jul, 29 Aug, 25 Oct, 24 Nov 2004, 7 Jan 05, 20 Oct, 24 Dec 06, 8 Mar, 3 May, 29 Jun, 11 Dec 07  DanielByrne  Section on Statistical Reporting 7 Jan 05  
Line: 1 to 1  

Checklist for Authors  References  Editable ReferencesStatistical Problems to Document and to Avoid  
Line: 82 to 82  
 
Changed:  
< < 
Graphics  Handouts  
> > 
Graphics  Handouts  
 
Changed:  
< < 
 
> > 
 

Line: 1 to 1  

Checklist for Authors  References  Editable ReferencesStatistical Problems to Document and to Avoid  
Line: 82 to 82  
 
Changed:  
< < 
Graphics  Handouts  
> > 
Graphics  Handouts  

Line: 1 to 1  

Checklist for Authors  References  Editable ReferencesStatistical Problems to Document and to Avoid  
Line: 110 to 110  
Useful Articles and Web Sites with Statistical Guidance for Authors
 
Added:  
> > 
 
Changed:  
< < 
 FrankHarrell  26 Jun 2004; updated 4, 7 Jul, 29 Aug, 25 Oct, 24 Nov 2004, 7 Jan 05, 20 Oct, 24 Dec 06, 8 Mar, 3 May 07  DanielByrne  Section on Statistical Reporting 7 Jan 05  
> > 
 FrankHarrell  26 Jun 2004; updated 4, 7 Jul, 29 Aug, 25 Oct, 24 Nov 2004, 7 Jan 05, 20 Oct, 24 Dec 06, 8 Mar, 3 May, 29 Jun, 11 Dec 07  DanielByrne  Section on Statistical Reporting 7 Jan 05  
Line: 1 to 1  

Checklist for Authors  References  Editable ReferencesStatistical Problems to Document and to Avoid  
Line: 16 to 16  
Inappropriate use of parametric testsWhen all that is desired is an unadjusted (for other variables) P value and a parametric test is used, the resulting inference will not be robust to extreme values, will depend on how the response variable is transformed, and will suffer a loss of power if the data are not normally distributed. Parametric methods are more necessary when adjusting for confounding or for subject heterogeneity, or when dealing with a serially measured response variable.  
Changed:  
< <  When one wants a unitless index of the strength of association between two continuous variables and only wants to assume that the true association is monotonic (is always decreasing or always increasing), the nonparametric Spearman's rho rank correlation coefficient is a good choice. A good nonparametric approach to getting confidence intervals for means and differences in means is the bootstrap.  
> > 
When one wants a unitless index of the strength of association between two continuous variables and only wants to assume that the true association is monotonic (is always decreasing or always increasing), the nonparametric Spearman's rho rank correlation coefficient is a good choice. A good nonparametric approach to getting confidence intervals for means and differences in means is the bootstrap. Recommended Software  
Inappropriate descriptive statisticsThe mean and standard deviation are not descriptive of variables that have an asymmetric distribution such as variables with a heavy right tail (which includes many clinical lab measurements). Quantiles are always descriptive of continuous variables no matter what the distribution. A good 3number summary is the lower quartile (25th percentile), median, and upper quartile (75th percentile). The difference in the outer quartiles is a measure of subjecttosubject variability (it is an interval containing half the subjects' values). The median is always descriptive of "typical" subjects. By comparing the difference between the upper quartile and the median with the difference between the median and the lower quartile, one obtains a sense of the symmetry of the distribution. Above all don't provide descriptive statistics such as "the mean hospital cost was $10,000 plus or minus $20,000." Nonparametric bootstrap confidence intervals will prevent impossible values being used as confidence limits.  
Changed:  
< < 
When using the WilcoxonMannWhitney test for comparing two continuous or ordinal variables, use difference estimates that are consistent with this test. The Wilcoxon test does not test for difference in medians or means. It tests whether the HodgesLehmann estimate of the difference between two groups is zero. The HLestimate is the median difference over all possible pairs of subjects, the first from group 1 and the second from group 2. See http://en.wikipedia.org/wiki/MannWhitney_U for an example statement of results, and a good reference for this. Recommended Software Note : When there are excessives ties in the response variable such as a variable with clumping at zero, quantiles such as the median may not be good descriptive statistics (the mean may be better), and the HodgesLehmann estimate does not work well.  
> > 
When using the WilcoxonMannWhitney test for comparing two continuous or ordinal variables, use difference estimates that are consistent with this test. The Wilcoxon test does not test for difference in medians or means. It tests whether the HodgesLehmann estimate of the difference between two groups is zero. The HLestimate is the median difference over all possible pairs of subjects, the first from group 1 and the second from group 2. See http://en.wikipedia.org/wiki/MannWhitney_U for an example statement of results, and a good reference for this. Recommended Software Note : When there are excessives ties in the response variable such as a variable with clumping at zero, quantiles such as the median may not be good descriptive statistics (the mean with associated bootstrap confidence limits may be better), and the HodgesLehmann estimate does not work well.  
Failure to include confidence intervalsConfidence intervals for key effects should always be included. Studies should be designed to provide sufficiently narrow confidence intervals so that the results contain use information. See sections 3.4.4, 3.5.3, 3.7.4 of this and The End of Statistical Significance?Inappropriate choice of measure of change  
Line: 107 to 107  
 
Changed:  
< < 
Useful Articles with Statistical Guidance for Authors  
> > 
Useful Articles and Web Sites with Statistical Guidance for Authors  
 
Added:  
> > 
 
Changed:  
< < 
 FrankHarrell  26 Jun 2004; updated 4, 7 Jul, 29 Aug, 25 Oct, 24 Nov 2004, 7 Jan 05, 20 Oct, 24 Dec 06, 8 Mar 07  DanielByrne  Section on Statistical Reporting 7 Jan 05  
> > 
 FrankHarrell  26 Jun 2004; updated 4, 7 Jul, 29 Aug, 25 Oct, 24 Nov 2004, 7 Jan 05, 20 Oct, 24 Dec 06, 8 Mar, 3 May 07  DanielByrne  Section on Statistical Reporting 7 Jan 05  
Line: 1 to 1  

Checklist for Authors  References  Editable ReferencesStatistical Problems to Document and to Avoid  
Line: 19 to 19  
When one wants a unitless index of the strength of association between two continuous variables and only wants to assume that the true association is monotonic (is always decreasing or always increasing), the nonparametric Spearman's rho rank correlation coefficient is a good choice. A good nonparametric approach to getting confidence intervals for means and differences in means is the bootstrap.
Inappropriate descriptive statisticsThe mean and standard deviation are not descriptive of variables that have an asymmetric distribution such as variables with a heavy right tail (which includes many clinical lab measurements). Quantiles are always descriptive of continuous variables no matter what the distribution. A good 3number summary is the lower quartile (25th percentile), median, and upper quartile (75th percentile). The difference in the outer quartiles is a measure of subjecttosubject variability (it is an interval containing half the subjects' values). The median is always descriptive of "typical" subjects. By comparing the difference between the upper quartile and the median with the difference between the median and the lower quartile, one obtains a sense of the symmetry of the distribution. Above all don't provide descriptive statistics such as "the mean hospital cost was $10,000 plus or minus $20,000." Nonparametric bootstrap confidence intervals will prevent impossible values being used as confidence limits.  
Added:  
> > 
When using the WilcoxonMannWhitney test for comparing two continuous or ordinal variables, use difference estimates that are consistent with this test. The Wilcoxon test does not test for difference in medians or means. It tests whether the HodgesLehmann estimate of the difference between two groups is zero. The HLestimate is the median difference over all possible pairs of subjects, the first from group 1 and the second from group 2. See http://en.wikipedia.org/wiki/MannWhitney_U for an example statement of results, and a good reference for this. Recommended Software Note : When there are excessives ties in the response variable such as a variable with clumping at zero, quantiles such as the median may not be good descriptive statistics (the mean may be better), and the HodgesLehmann estimate does not work well.  
Failure to include confidence intervalsConfidence intervals for key effects should always be included. Studies should be designed to provide sufficiently narrow confidence intervals so that the results contain use information. See sections 3.4.4, 3.5.3, 3.7.4 of this and The End of Statistical Significance?Inappropriate choice of measure of change  
Line: 108 to 110  
Useful Articles with Statistical Guidance for Authors
 
Changed:  
< < 
 FrankHarrell  26 Jun 2004; updated 4, 7 Jul, 29 Aug, 25 Oct, 24 Nov 2004, 7 Jan 05, 20 Oct, 24 Dec 06  DanielByrne  Section on Statistical Reporting 7 Jan 05  
> > 
 FrankHarrell  26 Jun 2004; updated 4, 7 Jul, 29 Aug, 25 Oct, 24 Nov 2004, 7 Jan 05, 20 Oct, 24 Dec 06, 8 Mar 07  DanielByrne  Section on Statistical Reporting 7 Jan 05  
Line: 1 to 1  

Checklist for Authors  References  Editable ReferencesStatistical Problems to Document and to Avoid  
Line: 20 to 20  
Inappropriate descriptive statisticsThe mean and standard deviation are not descriptive of variables that have an asymmetric distribution such as variables with a heavy right tail (which includes many clinical lab measurements). Quantiles are always descriptive of continuous variables no matter what the distribution. A good 3number summary is the lower quartile (25th percentile), median, and upper quartile (75th percentile). The difference in the outer quartiles is a measure of subjecttosubject variability (it is an interval containing half the subjects' values). The median is always descriptive of "typical" subjects. By comparing the difference between the upper quartile and the median with the difference between the median and the lower quartile, one obtains a sense of the symmetry of the distribution. Above all don't provide descriptive statistics such as "the mean hospital cost was $10,000 plus or minus $20,000." Nonparametric bootstrap confidence intervals will prevent impossible values being used as confidence limits.Failure to include confidence intervals  
Changed:  
< <  Confidence intervals for key effects should always be included. Studies should be designed to provide sufficiently narrow confidence intervals so that the results contain use information. See sections 3.4.4, 3.5.3, 3.7.4 of this.  
> >  Confidence intervals for key effects should always be included. Studies should be designed to provide sufficiently narrow confidence intervals so that the results contain use information. See sections 3.4.4, 3.5.3, 3.7.4 of this and The End of Statistical Significance?  
Inappropriate choice of measure of change  
Changed:  
< <  In a singlesubjectgroup study in which there are paired comparisons (e.g., pre vs. post measurements), researchers too easily take for granted the appropriate measure of change (simple difference, percent change, ratio, difference of square roots, etc.). It is important to choose a change measure that can be taken out of context, i.e., is independent of baseline. See MeasureChange, STBRsylConcepts, and STBRsylEffectEvidence for more information.  
> >  In a singlesubjectgroup study in which there are paired comparisons (e.g., pre vs. post measurements), researchers too easily take for granted the appropriate measure of change (simple difference, percent change, ratio, difference of square roots, etc.). It is important to choose a change measure that can be taken out of context, i.e., is independent of baseline. See MeasureChange, STBRsylConcepts, and STBRsylEffectEvidence for more information. In general, change scores cause more problems than they solve. For example, one cannot use summary statistics on percent changes because of improper cancellation of positive and negative changes.  
Use of change scores in parallelgroup designsWhen there is more than one subject group, for example a twotreatment parallelgroup randomized controlled trial, it is very problematic to incorporate change scores into the analysis. First, it may be difficult to choose the appropriate change score as mentioned above (e.g., relative vs. absolute). Second, regression to the mean and measurement error often render simple change scores inappropriate. Third, when there is a baseline version of the final respose variable, it is necessary to control for subject heterogeneity by including that baseline variable as a covariate in an analysis of covariance. Then the baseline variable needs to appear in both the left hand and right hand side of the regression model, making interpretation of the results more difficult. It is much preferred to keep the pure response (followup) variable on the left side and the baseline value on the right side of the equation.Inappropriate analysis of serial data (repeated measures)Some researchers analyze serial responses, when multiple response measurements are made per subject, as if they were from separate subjects. This will exaggerate the real sample size and make P values too small. Some researchers still use repeated measures ANOVA even though this technique makes assumptions that are extremely unreasonable for serial data. An appropriate methodology should be used, such as generalized least squares or mixed effects models with an appropriate covariance structure, GEE, or an approach related to GEE in which the cluster bootstrap or the cluster sandwich covariance estimator is used to correct a working independence model for withinsubject correlation.Making conclusions from large P values  More Information  Absence of Evidence is not Evidence of Absence  
Changed:  
< <  In general, the only way that a large P value can be interpreted is for example "The study did not provide sufficient evidence for an effect." One cannot say " P = 0.7, therefore we conclude the drug has no effect". Only when the corresponding confidence interval excludes both clinically significant benefit and harm can one make such a conclusion. A large P value by itself merely means that a higher sample size is required to allow conclusions to be drawn. See section 3.8 of this for details.  
> >  In general, the only way that a large P value can be interpreted is for example "The study did not provide sufficient evidence for an effect." One cannot say " P = 0.7, therefore we conclude the drug has no effect". Only when the corresponding confidence interval excludes both clinically significant benefit and harm can one make such a conclusion. A large P value by itself merely means that a higher sample size is required to allow conclusions to be drawn. See section 3.8 of this for details, along with this.  
FilteringThere are many ways that authors have been seduced into taking results out of context, particularly when reporting the one favorable result out of dozens of attempted analyses. Filtering out (failing to report) the other analyses is scientifically suspect. At the very least, an investigator should disclose that the reported analyses involved filtering of some kind, and she should provide details. The context should be reported (e.g., "Although this study is part of a planned oneyear followup of gastric safety for Cox2 inhibitors, here we only report the more favorable short term effects of the drug on gastric side effects."). To preserve type I error, filtering should be formally accounted for, which places the burden on the investigator of undertaking often complex Monte Carlo simulations.  
Line: 107 to 108  
Useful Articles with Statistical Guidance for Authors
 
Changed:  
< < 
 FrankHarrell  26 Jun 2004; updated 4, 7 Jul, 29 Aug, 25 Oct, 24 Nov 2004, 7 Jan 05, 20 Oct 06  DanielByrne  Section on Statistical Reporting 7 Jan 05  
> > 
 FrankHarrell  26 Jun 2004; updated 4, 7 Jul, 29 Aug, 25 Oct, 24 Nov 2004, 7 Jan 05, 20 Oct, 24 Dec 06  DanielByrne  Section on Statistical Reporting 7 Jan 05  
Line: 1 to 1  

Checklist for Authors  References  Editable ReferencesStatistical Problems to Document and to Avoid  
Line: 45 to 45  
In a randomized comparison of treatments, the "intent to treat" analysis should be emphasized.
Missing Data  
Changed:  
< <  It is not appropriate to merely exclude subjects having incomplete data from the analysis. No matter how missing data are handled, the amount of missing baseline or response data should be be carefully documented, including the proportion of missing values for each variable being analyzed and a description of the types of subjects having missing variables. The latter may involve an exploratory analysis predicting the tendency for a variable to have a missing value, based on predictors that are usually not missing. When there is a significant proportion of subjects having incomplete records, multiple imputation is advisable.  
> >  It is not appropriate to merely exclude subjects having incomplete data from the analysis. No matter how missing data are handled, the amount of missing baseline or response data should be be carefully documented, including the proportion of missing values for each variable being analyzed and a description of the types of subjects having missing variables. The latter may involve an exploratory analysis predicting the tendency for a variable to have a missing value, based on predictors that are usually not missing. When there is a significant proportion of subjects having incomplete records, multiple imputation is advisable. Adding a new category to a variable to indicate missingness renders interpretation impossible and causes serious biases. See the October 2006 issue of Journal of Clinical Epidemiology for more about this and for other useful papers about missing data imputation.  
Changed:  
< <  A commonly used approach to handling dropouts in clinical trials is to use the "last observation carried forward" method. This method has been proven to be completely inappropriate. One of several problems with this method is that it treats imputed (carried forward) values as if they were real measurements. This results in overconfidence in estimates of treatment effects (standard errors and P values are too low and confidence intervals are too narrow).  
> >  A commonly used approach to handling dropouts in clinical trials is to use the "last observation carried forward" method. This method has been proven to be completely inappropriate in all situations. One of several problems with this method is that it treats imputed (carried forward) values as if they were real measurements. This results in overconfidence in estimates of treatment effects (standard errors and P values are too low and confidence intervals are too narrow).  
Multiple Comparison ProblemsWhen using traditional frequentist statistical methods, adjustment of P values for multiple comparisons is necessary unless  
Line: 81 to 81  
Graphics  Handouts
 
Added:  
> > 
 
 
Changed:  
< < 
Tables  Examples  
> > 
Tables  Examples (see section 4.2)  
As stated in Northridge et al (see below), "The text explains the data, while tables display the data. That is, text pertaining to the table reports the main results and points out patterns and anomalies, but avoids replicating the detail of the display." In many cases, it is best to replace tables with graphics.
Ways Medical Journals Could Improve Statistical Reporting  
Line: 106 to 107  
Useful Articles with Statistical Guidance for Authors
 
Changed:  
< < 
 FrankHarrell  26 Jun 2004; updated 4, 7 Jul, 29 Aug, 25 Oct, 24 Nov 2004, 7 Jan 05  DanielByrne  Section on Statistical Reporting 7 Jan 05  
> > 
 FrankHarrell  26 Jun 2004; updated 4, 7 Jul, 29 Aug, 25 Oct, 24 Nov 2004, 7 Jan 05, 20 Oct 06  DanielByrne  Section on Statistical Reporting 7 Jan 05  
Line: 1 to 1  

Checklist for Authors  References  Editable ReferencesStatistical Problems to Document and to Avoid  
Added:  
> > 
Design and Sample Size ProblemsUse of an improper effect sizeIf a study is designed to detect a certain effect size with a given power, the effect size should never be the observed effect from another study, which may be estimated with error and be overly optimistic. The effect size to use in planning should be the biologically relevant effect one would regret missing.Relying on standardized effect sizesMany researchers use Cohen's standardized effect sizes in planning a study. This has the advantage of not requiring pilot data. But such effect sizes are not biologically meaningful and may hide important issues as discussed by Lenth. Studies should be designed on the basis of effects that are relevant to the investigator and human subjects.  
General Statistical ProblemsInefficient use of continuous variablesCategorizing continuous predictor or response variables into intervals, as detailed here, causes serious statistical inference problems including bias, loss of power, and inflation of type I error.  
Line: 84 to 89  
Tables  ExamplesAs stated in Northridge et al (see below), "The text explains the data, while tables display the data. That is, text pertaining to the table reports the main results and points out patterns and anomalies, but avoids replicating the detail of the display." In many cases, it is best to replace tables with graphics.  
Added:  
> > 
Ways Medical Journals Could Improve Statistical Reporting
 
Useful Articles with Statistical Guidance for Authors
 
Changed:  
< <   FrankHarrell  26 Jun 2004; updated 4, 7 Jul, 29 Aug, 25 Oct, 24 Nov 2004  
> > 
 FrankHarrell  26 Jun 2004; updated 4, 7 Jul, 29 Aug, 25 Oct, 24 Nov 2004, 7 Jan 05  DanielByrne  Section on Statistical Reporting 7 Jan 05  
Line: 1 to 1  

Checklist for Authors  References  Editable ReferencesStatistical Problems to Document and to Avoid  
Line: 10 to 10  
Some analysts use tests or graphics for assessing normality in choosing between parametric and nonparametric tests. This is often the result of an unfounded belief that nonparametric rank tests are not as powerful as parametric tests. In fact on the average nonparametric tests are more powerful than their parametric counterparts, because data are nonnormally distributed more often than they are Gaussian. At any rate, using an assessment of normality to choose a test relies on the assessment having nearly perfect sensitivity. If a test of normality has a large type II error, there is a high probability of choosing the wrong approach. Coupling a test of normality to a final nonparametric vs. parametric test only has the appearance of increasing power. If the normality test has a power of 1.0 one can, for example, improve on the 0.96 efficiency of the Wilcoxon test vs. the t test when normality holds. However, once the uncertainty of the normality test is accounted for, there is no power gain.
Inappropriate use of parametric testsWhen all that is desired is an unadjusted (for other variables) P value and a parametric test is used, the resulting inference will not be robust to extreme values, will depend on how the response variable is transformed, and will suffer a loss of power if the data are not normally distributed. Parametric methods are more necessary when adjusting for confounding or for subject heterogeneity, or when dealing with a serially measured response variable.  
Added:  
> >  When one wants a unitless index of the strength of association between two continuous variables and only wants to assume that the true association is monotonic (is always decreasing or always increasing), the nonparametric Spearman's rho rank correlation coefficient is a good choice. A good nonparametric approach to getting confidence intervals for means and differences in means is the bootstrap.  
Inappropriate descriptive statistics  
Changed:  
< <  The mean and standard deviation are not descriptive of variables that have an asymmetric distribution such as variables with a heavy right tail (which includes many clinical lab measurements). Quantiles are always descriptive of continuous variables no matter what the distribution. A good 3number summary is the lower quartile (25th percentile), median, and upper quartile (75th percentile). The difference in the outer quartiles is a measure of subjecttosubject variability (it is an interval containing half the subjects' values). The median is always descriptive of "typical" subjects. By comparing the difference between the upper quartile and the median with the difference between the median and the lower quartile, one obtains a sense of the symmetry of the distribution. Above all don't provide descriptive statistics such as "the mean hospital cost was $10,000 plus or minus $20,000."  
> >  The mean and standard deviation are not descriptive of variables that have an asymmetric distribution such as variables with a heavy right tail (which includes many clinical lab measurements). Quantiles are always descriptive of continuous variables no matter what the distribution. A good 3number summary is the lower quartile (25th percentile), median, and upper quartile (75th percentile). The difference in the outer quartiles is a measure of subjecttosubject variability (it is an interval containing half the subjects' values). The median is always descriptive of "typical" subjects. By comparing the difference between the upper quartile and the median with the difference between the median and the lower quartile, one obtains a sense of the symmetry of the distribution. Above all don't provide descriptive statistics such as "the mean hospital cost was $10,000 plus or minus $20,000." Nonparametric bootstrap confidence intervals will prevent impossible values being used as confidence limits.  
Failure to include confidence intervalsConfidence intervals for key effects should always be included. Studies should be designed to provide sufficiently narrow confidence intervals so that the results contain use information. See sections 3.4.4, 3.5.3, 3.7.4 of this.  
Added:  
> > 
Inappropriate choice of measure of changeIn a singlesubjectgroup study in which there are paired comparisons (e.g., pre vs. post measurements), researchers too easily take for granted the appropriate measure of change (simple difference, percent change, ratio, difference of square roots, etc.). It is important to choose a change measure that can be taken out of context, i.e., is independent of baseline. See MeasureChange, STBRsylConcepts, and STBRsylEffectEvidence for more information.Use of change scores in parallelgroup designsWhen there is more than one subject group, for example a twotreatment parallelgroup randomized controlled trial, it is very problematic to incorporate change scores into the analysis. First, it may be difficult to choose the appropriate change score as mentioned above (e.g., relative vs. absolute). Second, regression to the mean and measurement error often render simple change scores inappropriate. Third, when there is a baseline version of the final respose variable, it is necessary to control for subject heterogeneity by including that baseline variable as a covariate in an analysis of covariance. Then the baseline variable needs to appear in both the left hand and right hand side of the regression model, making interpretation of the results more difficult. It is much preferred to keep the pure response (followup) variable on the left side and the baseline value on the right side of the equation.  
Inappropriate analysis of serial data (repeated measures)Some researchers analyze serial responses, when multiple response measurements are made per subject, as if they were from separate subjects. This will exaggerate the real sample size and make P values too small. Some researchers still use repeated measures ANOVA even though this technique makes assumptions that are extremely unreasonable for serial data. An appropriate methodology should be used, such as generalized least squares or mixed effects models with an appropriate covariance structure, GEE, or an approach related to GEE in which the cluster bootstrap or the cluster sandwich covariance estimator is used to correct a working independence model for withinsubject correlation.  
Changed:  
< < 
Making conclusions from large P values  More Information  
> > 
Making conclusions from large P values  More Information  Absence of Evidence is not Evidence of Absence  
In general, the only way that a large P value can be interpreted is for example "The study did not provide sufficient evidence for an effect." One cannot say " P = 0.7, therefore we conclude the drug has no effect". Only when the corresponding confidence interval excludes both clinically significant benefit and harm can one make such a conclusion. A large P value by itself merely means that a higher sample size is required to allow conclusions to be drawn. See section 3.8 of this for details.
Filtering  
Line: 29 to 35  
 
Added:  
> > 
There must be a complete accounting of all subjects or animals who entered the study. Response rates to followup assessments must be quoted, and interpretation of study results will be very questionable if more than perhaps 10% of subjects do not have response data available unless this is by design.
In a randomized comparison of treatments, the "intent to treat" analysis should be emphasized.
Missing DataIt is not appropriate to merely exclude subjects having incomplete data from the analysis. No matter how missing data are handled, the amount of missing baseline or response data should be be carefully documented, including the proportion of missing values for each variable being analyzed and a description of the types of subjects having missing variables. The latter may involve an exploratory analysis predicting the tendency for a variable to have a missing value, based on predictors that are usually not missing. When there is a significant proportion of subjects having incomplete records, multiple imputation is advisable. A commonly used approach to handling dropouts in clinical trials is to use the "last observation carried forward" method. This method has been proven to be completely inappropriate. One of several problems with this method is that it treats imputed (carried forward) values as if they were real measurements. This results in overconfidence in estimates of treatment effects (standard errors and P values are too low and confidence intervals are too narrow).  
Multiple Comparison ProblemsWhen using traditional frequentist statistical methods, adjustment of P values for multiple comparisons is necessary unless  
Line: 37 to 51  
P values should be adjusted for filtering as well as for tests that are reported in the current paper.
Multivariable Modeling Problems  
Added:  
> > 
Inappropriate linearity assumptionsIn general there is no reason to assume that the relationship between a predictor and the response is linear. However, categoring a continuous predictor can cause major problems. Good solutions are to use regression splines or nonparametric regression.  
Use of stepwise variable selectionStepwise variable selection, univariable screening, and any method that eliminates "insignificant" predictor variables from the final model causes a multitude of serious problems related to bias, significance, improper confidence intervals, and multiple comparisons. Stepwise variable selection should be avoided unless backwards elimination is used with an alpha level of 0.5 or greater.Lack of insignificant variables in the final model  
Line: 57 to 73  
 
Changed:  
< <   FrankHarrell  26 Jun 2004; updated 4, 7 Jul 2004  
> > 
Graphics  Handouts
Tables  ExamplesAs stated in Northridge et al (see below), "The text explains the data, while tables display the data. That is, text pertaining to the table reports the main results and points out patterns and anomalies, but avoids replicating the detail of the display." In many cases, it is best to replace tables with graphics.Useful Articles with Statistical Guidance for Authors
 
Line: 1 to 1  

Checklist for Authors  References  Editable ReferencesStatistical Problems to Document and to Avoid  
Added:  
> > 
 
General Statistical ProblemsInefficient use of continuous variablesCategorizing continuous predictor or response variables into intervals, as detailed here, causes serious statistical inference problems including bias, loss of power, and inflation of type I error.  
Line: 9 to 10  
Some analysts use tests or graphics for assessing normality in choosing between parametric and nonparametric tests. This is often the result of an unfounded belief that nonparametric rank tests are not as powerful as parametric tests. In fact on the average nonparametric tests are more powerful than their parametric counterparts, because data are nonnormally distributed more often than they are Gaussian. At any rate, using an assessment of normality to choose a test relies on the assessment having nearly perfect sensitivity. If a test of normality has a large type II error, there is a high probability of choosing the wrong approach. Coupling a test of normality to a final nonparametric vs. parametric test only has the appearance of increasing power. If the normality test has a power of 1.0 one can, for example, improve on the 0.96 efficiency of the Wilcoxon test vs. the t test when normality holds. However, once the uncertainty of the normality test is accounted for, there is no power gain.
Inappropriate use of parametric testsWhen all that is desired is an unadjusted (for other variables) P value and a parametric test is used, the resulting inference will not be robust to extreme values, will depend on how the response variable is transformed, and will suffer a loss of power if the data are not normally distributed. Parametric methods are more necessary when adjusting for confounding or for subject heterogeneity, or when dealing with a serially measured response variable.  
Changed:  
< < 
Inappropriate descriptive statistics  
> > 
Inappropriate descriptive statistics  
The mean and standard deviation are not descriptive of variables that have an asymmetric distribution such as variables with a heavy right tail (which includes many clinical lab measurements). Quantiles are always descriptive of continuous variables no matter what the distribution. A good 3number summary is the lower quartile (25th percentile), median, and upper quartile (75th percentile). The difference in the outer quartiles is a measure of subjecttosubject variability (it is an interval containing half the subjects' values). The median is always descriptive of "typical" subjects. By comparing the difference between the upper quartile and the median with the difference between the median and the lower quartile, one obtains a sense of the symmetry of the distribution. Above all don't provide descriptive statistics such as "the mean hospital cost was $10,000 plus or minus $20,000."  
Changed:  
< < 
Making conclusions from large P valuesIn general, the only way that a large P value can be interpreted is for example "The study did not provide sufficient evidence for an effect." One cannot say " P = 0.7, therefore we conclude the drug has no effect". Only when the corresponding confidence interval excludes both clinically significant benefit and harm can one make such a conclusion. A large P value by itself merely means that a higher sample size is required to allow conclusions to be drawn.  
> > 
Failure to include confidence intervalsConfidence intervals for key effects should always be included. Studies should be designed to provide sufficiently narrow confidence intervals so that the results contain use information. See sections 3.4.4, 3.5.3, 3.7.4 of this.Inappropriate analysis of serial data (repeated measures)Some researchers analyze serial responses, when multiple response measurements are made per subject, as if they were from separate subjects. This will exaggerate the real sample size and make P values too small. Some researchers still use repeated measures ANOVA even though this technique makes assumptions that are extremely unreasonable for serial data. An appropriate methodology should be used, such as generalized least squares or mixed effects models with an appropriate covariance structure, GEE, or an approach related to GEE in which the cluster bootstrap or the cluster sandwich covariance estimator is used to correct a working independence model for withinsubject correlation.Making conclusions from large P values  More InformationIn general, the only way that a large P value can be interpreted is for example "The study did not provide sufficient evidence for an effect." One cannot say " P = 0.7, therefore we conclude the drug has no effect". Only when the corresponding confidence interval excludes both clinically significant benefit and harm can one make such a conclusion. A large P value by itself merely means that a higher sample size is required to allow conclusions to be drawn. See section 3.8 of this for details.  
FilteringThere are many ways that authors have been seduced into taking results out of context, particularly when reporting the one favorable result out of dozens of attempted analyses. Filtering out (failing to report) the other analyses is scientifically suspect. At the very least, an investigator should disclose that the reported analyses involved filtering of some kind, and she should provide details. The context should be reported (e.g., "Although this study is part of a planned oneyear followup of gastric safety for Cox2 inhibitors, here we only report the more favorable short term effects of the drug on gastric side effects."). To preserve type I error, filtering should be formally accounted for, which places the burden on the investigator of undertaking often complex Monte Carlo simulations.  
Line: 23 to 29  
 
Added:  
> > 
 
Multiple Comparison ProblemsWhen using traditional frequentist statistical methods, adjustment of P values for multiple comparisons is necessary unless
 
Changed:  
< < 
Overfitting and Lack of Model Validation  
> > 
Multivariable Modeling ProblemsUse of stepwise variable selectionStepwise variable selection, univariable screening, and any method that eliminates "insignificant" predictor variables from the final model causes a multitude of serious problems related to bias, significance, improper confidence intervals, and multiple comparisons. Stepwise variable selection should be avoided unless backwards elimination is used with an alpha level of 0.5 or greater.Lack of insignificant variables in the final modelUnless the sample size is huge, this is usually the result of the authors using a stepwise variable selection or some other approach for filtering out "insignificant" variables. Hence the presence of a table of variables in which every variable is significant is usually the sign of a serious problem.Overfitting and lack of model validation  
When a multivariable model is reported, an unbiased validation (at least an internal validation) should be reported in the paper unless
 
Line: 37 to 50  
When a validation is needed, the best approach is typically the bootstrap. This is a Monte Carlo simulation technique in which all steps of the modelbuilding process (if the model was not prespecified) are repeated for each of, say, 150 samples with replacement of size n from the original sample containing n subjects.  
Changed:  
< <   FrankHarrell  26 Jun 2004; updated 4 Jul 2004  
> > 
Use of Imprecise Language  GlossaryIt is important to distinguish rates from probabilities, odds ratios from risk ratios, and various other terms. The word risk usually means the same thing as probability. Here are some common mistakes seen in manuscripts:
 
Line: 1 to 1  

Checklist for Authors  References  Editable ReferencesStatistical Problems to Document and to AvoidGeneral Statistical ProblemsInefficient use of continuous variables  
Changed:  
< < 
There are many reasons not to categorize continuous predictor or response variables into intervals as detailed here.
Relying on tests of normality  
> > 
Categorizing continuous predictor or response variables into intervals, as detailed here, causes serious statistical inference problems including bias, loss of power, and inflation of type I error.
Relying on assessment of normality of the dataSome analysts use tests or graphics for assessing normality in choosing between parametric and nonparametric tests. This is often the result of an unfounded belief that nonparametric rank tests are not as powerful as parametric tests. In fact on the average nonparametric tests are more powerful than their parametric counterparts, because data are nonnormally distributed more often than they are Gaussian. At any rate, using an assessment of normality to choose a test relies on the assessment having nearly perfect sensitivity. If a test of normality has a large type II error, there is a high probability of choosing the wrong approach. Coupling a test of normality to a final nonparametric vs. parametric test only has the appearance of increasing power. If the normality test has a power of 1.0 one can, for example, improve on the 0.96 efficiency of the Wilcoxon test vs. the t test when normality holds. However, once the uncertainty of the normality test is accounted for, there is no power gain.  
Inappropriate use of parametric tests  
Added:  
> > 
When all that is desired is an unadjusted (for other variables) P value and a parametric test is used, the resulting inference will not be robust to extreme values, will depend on how the response variable is transformed, and will suffer a loss of power if the data are not normally distributed. Parametric methods are more necessary when adjusting for confounding or for subject heterogeneity, or when dealing with a serially measured response variable.
Inappropriate descriptive statisticsThe mean and standard deviation are not descriptive of variables that have an asymmetric distribution such as variables with a heavy right tail (which includes many clinical lab measurements). Quantiles are always descriptive of continuous variables no matter what the distribution. A good 3number summary is the lower quartile (25th percentile), median, and upper quartile (75th percentile). The difference in the outer quartiles is a measure of subjecttosubject variability (it is an interval containing half the subjects' values). The median is always descriptive of "typical" subjects. By comparing the difference between the upper quartile and the median with the difference between the median and the lower quartile, one obtains a sense of the symmetry of the distribution. Above all don't provide descriptive statistics such as "the mean hospital cost was $10,000 plus or minus $20,000."  
Making conclusions from large P valuesIn general, the only way that a large P value can be interpreted is for example "The study did not provide sufficient evidence for an effect." One cannot say " P = 0.7, therefore we conclude the drug has no effect". Only when the corresponding confidence interval excludes both clinically significant benefit and harm can one make such a conclusion. A large P value by itself merely means that a higher sample size is required to allow conclusions to be drawn.Filtering  
Changed:  
< <  There are many ways that authors have been seduced into taking results out of context, particularly when reporting the one favorable result out of dozens of attempted analyses. Filtering out (failing to report) the other analyses is scientifically suspect. At the very least, an investigator should disclose that the reported analyses involved filtering of some kind. The context should be reported (e.g., "Although this study is part of a planned oneyear followup of gastric safety for Cox2 inhibitors, here we only report the more favorable short term effects of the drug on gastric side effects.").  
> >  There are many ways that authors have been seduced into taking results out of context, particularly when reporting the one favorable result out of dozens of attempted analyses. Filtering out (failing to report) the other analyses is scientifically suspect. At the very least, an investigator should disclose that the reported analyses involved filtering of some kind, and she should provide details. The context should be reported (e.g., "Although this study is part of a planned oneyear followup of gastric safety for Cox2 inhibitors, here we only report the more favorable short term effects of the drug on gastric side effects."). To preserve type I error, filtering should be formally accounted for, which places the burden on the investigator of undertaking often complex Monte Carlo simulations.  
Here is a checklist of various ways of filtering results, all of which should be documented, and in many cases, rethought:  
Changed:  
< < 
 
> > 
 
Multiple Comparison Problems  
Added:  
> > 
When using traditional frequentist statistical methods, adjustment of P values for multiple comparisons is necessary unless
 
Overfitting and Lack of Model Validation  
Added:  
> > 
When a multivariable model is reported, an unbiased validation (at least an internal validation) should be reported in the paper unless
 
Changed:  
< <   FrankHarrell  26 Jun 2004; updated 3 Jul 2004  
> >  When a validation is needed, the best approach is typically the bootstrap. This is a Monte Carlo simulation technique in which all steps of the modelbuilding process (if the model was not prespecified) are repeated for each of, say, 150 samples with replacement of size n from the original sample containing n subjects.  FrankHarrell  26 Jun 2004; updated 4 Jul 2004  
Line: 1 to 1  

 
Changed:  
< < 
Checklist for Authors  References  
> > 
Checklist for Authors  References  Editable References  
Statistical Problems to Document and to AvoidGeneral Statistical Problems  
Changed:  
< < 
Inefficient use of continuous variables  
> > 
Inefficient use of continuous variables  
There are many reasons not to categorize continuous predictor or response variables into intervals as detailed here.  
Changed:  
< < 
Relying on tests of normalityInappropriate use of parametric testsMaking conclusions from large P values  
> > 
Relying on tests of normalityInappropriate use of parametric testsMaking conclusions from large P values  
In general, the only way that a large P value can be interpreted is for example "The study did not provide sufficient evidence for an effect." One cannot say " P = 0.7, therefore we conclude the drug has no effect". Only when the corresponding confidence interval excludes both clinically significant benefit and harm can one make such a conclusion. A large P value by itself merely means that a higher sample size is required to allow conclusions to be drawn.  
Changed:  
< < 
Filtering  
> > 
Filtering  
There are many ways that authors have been seduced into taking results out of context, particularly when reporting the one favorable result out of dozens of attempted analyses. Filtering out (failing to report) the other analyses is scientifically suspect. At the very least, an investigator should disclose that the reported analyses involved filtering of some kind. The context should be reported (e.g., "Although this study is part of a planned oneyear followup of gastric safety for Cox2 inhibitors, here we only report the more favorable short term effects of the drug on gastric side effects."). Here is a checklist of various ways of filtering results, all of which should be documented, and in many cases, rethought:  
Line: 19 to 19  
 
Changed:  
< < 
Multiple Comparison ProblemsOverfitting and Lack of Model Validation  
> > 
Multiple Comparison ProblemsOverfitting and Lack of Model Validation  
Changed:  
< <   FrankHarrell  26 Jun 2004  
> >   FrankHarrell  26 Jun 2004; updated 3 Jul 2004  
Line: 1 to 1  

 
Changed:  
< < 
Checklist for Authors  
> > 
Checklist for Authors  References  
Statistical Problems to Document and to AvoidGeneral Statistical Problems  
Changed:  
< < 
 
> > 
Inefficient use of continuous variablesThere are many reasons not to categorize continuous predictor or response variables into intervals as detailed here.Relying on tests of normalityInappropriate use of parametric testsMaking conclusions from large P valuesIn general, the only way that a large P value can be interpreted is for example "The study did not provide sufficient evidence for an effect." One cannot say " P = 0.7, therefore we conclude the drug has no effect". Only when the corresponding confidence interval excludes both clinically significant benefit and harm can one make such a conclusion. A large P value by itself merely means that a higher sample size is required to allow conclusions to be drawn.  
Filtering  
Added:  
> >  There are many ways that authors have been seduced into taking results out of context, particularly when reporting the one favorable result out of dozens of attempted analyses. Filtering out (failing to report) the other analyses is scientifically suspect. At the very least, an investigator should disclose that the reported analyses involved filtering of some kind. The context should be reported (e.g., "Although this study is part of a planned oneyear followup of gastric safety for Cox2 inhibitors, here we only report the more favorable short term effects of the drug on gastric side effects."). Here is a checklist of various ways of filtering results, all of which should be documented, and in many cases, rethought:  
