Skip to topic | Skip to bottom
Home
Main
Main.CatContinuousr1.12 - 19 Mar 2008 - 22:21 - FrankHarrelltopic end

You are here: Main > Education > HandoutsBioRes > BioMod > CatContinuous

Start of topic | Skip to actions

Problems Caused by Categorizing Continuous Variables | References | Editable References | Key Reference:1

  1. Loss of power and loss of precision of estimated means, odds, hazards, etc.
  2. Categorization assumes that the relationship between the predictor and the response is flat within intervals; this assumption is far less reasonable than a linearity assumption in most cases
  3. To make a continuous predictor be more accurately modeled when categorization is used, multiple intervals are required. The needed dummy variables will spend more degrees of freedom than will fitting a smooth relationship, hence power and precision will suffer. And because of sample size limitations in the very low and very high range of the variable, the outer intervals (e.g., outer quintiles) will be wide, resulting in significant heterogeneity of subjects within those intervals, and residual confounding.
  4. Categorization assumes that there is a discontinuity in response as interval boundaries are crossed
  5. Categorization only seems to yield interpretable estimates such as odds ratios. For example, suppose one computes the odds ratio for stroke for persons with a systolic blood pressure > 160 mmHg compared to persons with a blood pressure <= 160 mmHg. The interpretation of the resulting odds ratio will depend on the exact distribution of blood pressures in the sample (the proportion of subjects > 170, > 180, etc.). On the other hand, if blood pressure is modeled as a continuous variable (e.g., using a regression spline, quadratic, or linear effect) one can estimate the ratio of odds for exact settings of the predictor, e.g., the odds ratio for 200 mmHg compared to 120 mmHg.
  6. When the risk of stroke is being assessed for a new subject with a known blood pressure (say 162), the subject does not report to her physician "my blood pressure exceeds 160" but rather reports 162 mmHg. The risk for this subject will be much lower than that of a subject with a blood pressure of 200 mmHg.
  7. If cutpoints are determined in a way that is not blinded to the response variable, calculation of P -values and confidence intervals requires special simulation techniques; ordinary inferential methods are completely invalid. For example, if cutpoints are chosen by trial and error in a way that utilizes the response, even informally, ordinary P -values will be too small and confidence intervals will not have the claimed coverage probabilities. The correct Monte-Carlo simulations must take into account both multiplicities and uncertainty in the choice of cutpoints. For example, if a cutpoint is chosen that minimizes the P -value and the resulting P -value is 0.05, the true type I error can easily be above 0.52.
  8. Likewise, categorization that is not blinded to the response variable results in biased effect estimates3, 4.
  9. "Optimal" cutpoints do not replicate over studies. Hollander, Sauerbrei, and Schumacher2 state that "... the optimal cutpoint approach has disadvantages. One of these is that in almost every study where this method is applied, another cutpoint will emerge. This makes comparisons across studies extremely difficult or even impossible. Altman et al. point out this problem for studies of the prognostic relevance of the S-phase fraction in breast cancer published in the literature. They identified 19 different cutpoints used in the literature; some of them were solely used because they emerged as the `optimal' cutpoint in a specific data set. In a meta-analysis on the relationship between cathepsin-D content and disease-free survival in node-negative breast cancer patients, 12 studies were in included with 12 different cutpoints ... Interestingly, neither cathepsin-D nor the S-phase fraction are recommended to be used as prognostic markers in breast cancer in the recent update of the American Society of Clinical Oncology."
  10. Cutpoints are arbitrary and manipulatable; cutpoints can be found that can result in both positive and negative associations 5
  11. If a confounder is adjusted for by categorization, there will be residual confounding that can be explained away by inclusion of the continuous form of the predictor in the model in addition to the categories.
  12. A better approach that maximizes power and that only assumes a smooth relationship is to use a restricted cubic spline (regression spline; piecewise cubic polynomial) function for predictors that are not known to predict linearly. Use of flexible parametric approaches such as this allows standard inference techniques (P -values, confidence limits) to be used

Java applet demonstrating the information loss from dichotomizing a continuous variable. See this also.


[1] Patrick Royston, Douglas G. Altman, and Willi Sauerbrei. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med, 25:127-141, 2006. Key: roy06dic
Annotation: continuous covariates;dichotomization;categorization;regression;efficiency;clinical research;residual confounding;destruction of statistical inference when cutpoints are chosen using the response variable;varying effect estimates when change cutpoints;difficult to interpret effects when dichotomize;nice plot showing effect of categorization;PBC data
[2] Norbert Hollander, Willi Sauerbrei, and Martin Schumacher. Confidence intervals for the effect of a prognostic factor after selection of an `optimal' cutpoint. Stat Med, 23:1701-1713, 2004. Key: hol04con
Annotation: cutpoints;true type I error can be much greater than nominal level;one example where nominal is 0.05 and true is 0.5;minimum P-value method;CART;recursive partitioning;bootstrap method for correcting confidence interval;based on heuristic shrinkage coefficient;``It should be noted, however, that the optimal cutpoint approach has disadvantages. One of these is that in almost every study where this method is applied, another cutpoint will emerge. This makes comparisons across studies extremely difficult or even impossible. Altman et al. point out this problem for studies of the prognostic relevance of the S-phase fraction in breast cancer published in the literature. They identified 19 different cutpoints used in the literature; some of them were solely used because they emerged as the `optimal' cutpoint in a specific data set. In a meta-analysis on the relationship between cathepsin-D content and disease-free survival in node-negative breast cancer patients, 12 studies were in included with 12 different cutpoints ... Interestingly, neither cathepsin-D nor the S-phase fraction are recommended to be used as prognostic markers in breast cancer in the recent update of the American Society of Clinical Oncology.''; dichotomization; categorizing continuous variables; refs alt94dan, sch94out, alt98sub
[3] D. G. Altman, B. Lausen, W. Sauerbrei, and M. Schumacher. Dangers of using `optimal' cutpoints in the evaluation of prognostic factors. J Nat Cancer Inst, 86:829-835, 1994. Key: alt94dan
Annotation: cutpoints;dichotomizing continuous variables
[4] G. Schulgen, B. Lausen, J. Olsen, and M. Schumacher. Outcome-oriented cutpoints in quantitative exposure. Am J Epi, 120:172-184, 1994. Key: sch94out
Annotation: cutpoint;dichotomization;categorizing continuous variables
[5] Howard Wainer. Finding what is not there through the unfortunate binning of results: The Mendel effect. Chance, 19(1):49-56, 2006. Key: wai06fin
Annotation: can find bins that yield either positive or negative association;especially pertinent when effects are small;With four parameters, I can fit an elephant; with five, I can make it wiggle its trunk. - John von Neumann

-- FrankHarrell - 27 Jun 2004, 21 Jun 2006, 25 Sep 2007, 7 Jan 2008
to top


You are here: Main > Education > HandoutsBioRes > BioMod > CatContinuous

to top

Home | VUMC Web Email | Medical Center Home | VU | Statistics at Vanderbilt | VICC Cancer Biostatistics Division

Vanderbilt Biostatistics
S-2323 Medical Center North, 1161 21st Avenue South, Nashville, TN 37232-2158
(615) 322-2001 • fax (615) 343-4924
General questions about the department? E-mail biostat@vanderbilt.edu

Copyright © 2003-2008 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Do you have ideas, requests, problems regarding Biostatistics TWiki web site? Send feedback.