Chun's Stata Notes

In this page, [EMS] refers to Essential Medical Statistics, a very good introductory book on biostatistics.

Getting help

Help pages: Type help followed by the name of a command or a function will display the corresponding help page.
  • Connection between help pages and GUI: On the top right of most help pages, there are links labelled as dialogs. Click one of them to get to the menu counterpart(s) of the command.
  • Pointers to other relevant commands: At the end of a help page, there often is a section titled "Also see", which includes pointers to relevant commands. The links in this section often point me to the commands that suit my need.
  • Help on topic groups: For example, help estimation commands shows all estimation/model fitting commands, and help functions shows all functions.
  • You can type help followed by nothing or go to Help -> Contents to start a help window and continue from there.

Useful web pages:

Useful web pages accessible from Stata
  • net from : UCLA's Stata programs. Type net cd teach to enter the directory for teaching demonstrations.
  • net from : Stata's web page.
  • You can use net to check current site, net cd dirname to change directory to "dirname" (as if in Unix), net describe pkgname to describe a package (named "pkgname"), net install pkgname to install a package, and net link linkname to go to another site.


Updates: Type update for information of your current installation. Type update query to compare the current installation with the newest available from You also can set the program to check for updates automatically.

Stata news: Just type news.

Set memory: Type memory for current memory setting. When Stata starts, it allocates a fixed amount of memory to use. You can use set memory 50m, perm to set the amount to 50 Megabytes in the future.

Calculator: Stata can be used as a simple command-line calculator by using the display (or di) command. For example, di 15/4 gives 3.75, di log(3) gives the natural logarithm of 3, di log10(5) gives the base-10 logarithm of 5, di chi2tail(2, 3.87) gives the p-value of a test statistic 3.87 compared to the chi-squared distribution with 2 degrees of freedon.

Functions for tail probabilities: Stata doesn't have a clear, consistent format.
  • normal(1.96) gives the standard normal left-tail probability of Pr(z ≤ 1.96)
  • ttail(10, 1.8) gives the right-tail probability Pr(t10 ≥ 1.8)
  • Ftail(2, 10, 3.5) gives the right-tail probability of Pr(F2,10 ≥ 3.5)
  • chi2tail(3, 2.5) gives the right-tail probability of Pr(χ23 ≥ 2.5)
  • Binomial(50, 29, 0.3) gives the right-tail probability of Pr(X ≥ 29 | n = 50, p = 0.3).
  • For details of more such functions, type help density functions.

Exploring data

I find it useful to learn the basic information about a dataset through commands describe, label list, notes, and st. For more details, try codebook, labelbook.

Variable types: (Type help data types for details.) Almost all statistical software packages need to store variables in different ways depending on the nature of the variables. When a software package reads in a data set in a non-default format (e.g. ASCII/text files), it has to decide on the types of variables based on the patterns it sees. This may cause a problem if you are not careful.
  • Example 1: If you have a value 16o (with zero mistyped as letter 'o') in variable "height", or if "." or "?" is used for missing values, the software may think these are possible values of the variable and thus the variable must not be numeric. As a result, the variable may be stored as categorical or character strings. Later on, when you want to calculate the mean (or any other analysis), the software may give you an error message because it cannot do this on categorical variables or strings.
  • Example 2: If you have an "ID" variable and code it using numbers, or if you have a categorical variable with five categories and code it as 1-5, the software may think the variable is numeric and store it as numbers. Later on, when you do analysis on this variable, the software will treat them as numbers instead of factors and thus give wrong results, without any warning at all.
  • When you see strange error messages or strange results, check if the variables are in correct types. Stata command describe can give you some basic information on variable types. Some variables may be disguised by labels; use label list to see all the labels used in the data. You can change variable types using command format if you want to.

Summary: The commands summarize and inspect can give you basic information about variables. To get more detailed information, add the , detail option after the command. For example, su age language, d will give detailed summary information on variables age and language.

Count: To count the number of records meeting some criteria, use the count command. For example, count mi(language) gives the number of records with missing values for variable language, count if age<9 gives the number of records with age<9.

Tables: tabulate is a more powerful command than table. To get some additional statistics, you can try tabulate var1 var2, cchi2 chi2 column expected row.


Type help graph intro for more information on Stata graphics. Graph Examples provides examples to show what types graphs Stata can create.

Bar chart for a categorical variable: You have to create multiple variables, each for a category. For example, use tabulate sex, gen(aa) will create two variables aa1 and aa2 as indicator variables of the two sexes. To make it cleaner, I can use tabulate sex, gen(sex) to create sex1 and sex2. Then I use graph hbar (rawsum) sex1-sex2 to draw a horizontal bar graph. Horizontal bars tend to be better than vertical bars.

Pie chart is easier. graph pie, over(sex) will do it. Pie charts tend to be not as effective as bar charts, unless extensive labelling is done.

Cumulative frequency distribution of a variable abc can be drawn by first generating a new variable (say, cumabc) using cumul abc, gen(cumabc), and then using line cumabc abc, sort.

Scatterplot with jittering: Try twoway (scatter fev1 sex, jitter(30)).

Bar chart for a continuous variable: In the Peru lung study, suppose you want to draw average fev1 for the four sex-respsymptoms groups with error bars. You need first to calculate some values:
collapse (mean) meanfev1=fev1 (sd) sdfev1=fev1 (count) n=fev1, by(sex resp)
gen hifev1 = meanfev1 + invttail(n-1,0.025) * (sdfev1 / sqrt(n))
gen lofev1 = meanfev1 - invttail(n-1,0.025) * (sdfev1 / sqrt(n))
To make the graphs nicer, you may want to define labels:
label define sexlabel 0 female 1 male
label value sex sexlabel
label define symplabel 0 "no symptom" 1 symptom
label value respsymptoms symplabel
Then draw the graph using graph bar meanfev1, over(sex) over(resp). To draw a nicer bar with error bars, follow the instruction here.

Common issues

Labels: You can create labels for datasets, variables, or the values of variables so that the output will be easier to read and interpret. To list all existent value labels on the current dataset, use label dir or label list; for detailed information, use labelbook. To create a label for a dataset, use label data; to create a label for a variable, use label variable; to created labels for the values of a variable, you need first to define the labels by using label define and then to assign them to the variable by using label value. For the last situation, there is an example under the bar chart section in this page. Type help label for details.

Analysis by subsets: Suppose you have two groups of subjects defined by a variable, say group, and you want to study a variable, say abc, across the groups. Stata has inconsistent ways of doing this depending on the command you use. Some commands require a by var: prefix, some require a by(var) option after a comma after the command, and some require a over(var) option.

Creation of indicator variables: xi will generate indicator (dummy) variables for the variable(s) that marked with i.. The names for the newly created indicator variables by default start with _I; using the prefix() option gets rid of this default. Example commands are:
xi, prefix(): logistic mf i.agegrp
xi, prefix(): logistic mf i.agegrp*area
xi, prefix(): logistic mf i.agegrp*i.area

Model fitting: Stata views the latest model fit (or "estimation" in Stata terminology, generated through an "estimation command") as the current model, and you can issue "postestimation" commands to obtain the results of the fitted model or to use the fitted model for prediction. Type help estimation commands to see all estimation/model fitting commands.

Reture values: Many commands can return values, which can be used in the next command or saved for future use. The return values can be in either r-class or e-class. You can always type return list and ereturn list to see what results can be returned. General commands generate r-class return values and estimation commands (i.e. model fitting commands) generate e-class return values. Type help return for details. For example, if I want to use the parameter estimates of a logistic regression in my future analysis, I can print them by matrix list e(b) or extract them right after fitting the regression by matrix define A = e(b). This saves the coefficients into a matrix A.

Likelihood ratio test: If you want to compare two models, you have to store the results of a model. For example, the command est store A will store the results of then currect model as "A". Later on, you can compare the results stored in "A" and those of the current model by lrtest A . or lrtest . A or just lrtest A, where "." stands for the current model.

Note that the likelihood ratio test can be used ONLY to compare two nested models (one full model and one reduced model). When you use lrtest to compare models that are not nested, Stata will assume the model with fewer parameters as the reduced model and generate results. But the results are not valid.

Continuous outcome variables

T-test and ANOVA (EMS 7-9)
Confidence interval: ci is the command.

T-test on a variable stratified by another variable: An example is to test if fev1 is different in children less than 9 years old and in those older than 9 years old. If the stratifying variable doesn't have two levels, you need to generate a new variable first and carry out the t-test:
gen agegroup = age < 9
ttest fev1, by(agegroup)

ANOVA: anova is the general command. For one-way ANOVA models, oneway and loneway are more useful. For the data in EMS Table 9.1(a), oneway hemoglbn type, t will generate the results in EMS Table 9.1(c). (anova hemoglbn type will generate the same results.)

Multiple comparison tests in ANOVA: In the command oneway hemoglbn type, b sch si, the options after comma makes the command generate multiple comparison test results using Bonferroni, Scheffé, and Sidak procedures. To carry out Tukey's multiple comparison test, you can use the prcomp command in the "sg101" package. [Type net search sg101 and click on the link and follow instructions for installation.] Once you have installed the package, type prcomp hemoglbn type, anova tukey to get Tukey's test results.

Linear regression (EMS 10-12)
regress is one major command. See Regression with Stata for details.

For the data in EMS Table 10.1, regress plasma weight will generate the results in EMS 10.2 and 10.4.

If you do a simple linear regression and want to draw the fitted line and its confidence interval, try the following. In the Peru lung study, try the following:
regress fev1 age
scatter fev1 age || lfit fev age
scatter fev1 age || lfitci fev age
rvfplot, yline(0)

To carry out a F-test to see if age and height combined will have significant effect on the outcome after sex is taken into account, use test command. For example, try this:
regress fev1 age height sex
test age height
The test command allows you to carry out various tests. Check its help page for details.

To obtain residuals and put them into a variable, say aa, use command predict aa, resid. To obtain Cook's distance in a variable, say bb, use predict bb, cooksd. To obtain standardized residuals in a variable, say cc, use predict cc, rstandard.

qnorm draws an inverse normal plot. swilk performs Shapiro-Wilk test for normality.

Categorical outcome variables

Analyses on contigency tables (EMS 15-17)
  • Single binomial distribution: point estimate, CI, and z-test
    • prtesti 1000 123 .13, count will generate the results in EMS 15.5-6.
    • cii 1000 123, wald will generate point estimate and CI.
  • Risk difference: point estimate, CI, and z-test
    • prtesti 240 20 220 80, count will generate the results in EMS 16.3 (data in EMS Table 16.2).
    • csi 20 80 220 140 will genrate point estimate and CI, but with a test on odds ratio.
  • Risk ratio: point estimate, CI
    • csi 39 6 29961 59994 will generate the results in EMS 16.5 (data in EMS Table 16.4), but with a test on odds ratio.
  • Odds ratio: point estimate, CI, and test
    • cci 81 995 57 867, woolf will generate the results in EMS 16.7 (data in EMS Table 16.5), but with a test based on the square of the z-statistic (which is equivalent to the z-test in EMS).
    • csi 81 57 995 867, or woolf will generate the same results.
  • Pearson's chi-squared test and Fisher's exact test
    • tabi 20 220 \ 80 140, chi2 will generate the results in EMS 17.2 (data in EMS Table 16.2 or 17.1).
    • tabi 1 12 \ 3 9, exact will generate the results in EMS 17.3 (data in EMS Table 17.2).
    • tabi 20 18 12 \ 32 20 8 \ 18 12 10 will generate the results in EMS 17.4 (data in EMS Table 17.4).
  • Tests of trend and of departure from trend:
    • We need to input data in EMS Table 17.5 like this:
input menarche triceps freq
1 0 15
0 0 156
1 1 29
0 1 197
1 2 36
0 2 150
    • tabodds menarche triceps [weight=freq] will generate the results in EMS 17.5.
    • logistic menarche triceps [weight=freq] will generate logistic regression results.
    • ptrendi 15 156 0 \ 29 197 1 \ 36 150 2 will generate the results in EMS 17.5, using a formula for V that is slightly different from that in EMS p. 175 (N − 1 is replaced by N). (Note that ptrendi can be installed by typing ssc install ptrend.)

Analyses on stratified contigency tables (EMS 18, 21)
For data in EMS Table 18.2, there are multiple approaches to studying the effect of area type (rural or urban) on the probability of having antibodies to leptospirosis, while taking into account the effect due to gender. We are assuming the effect as measured by odds ratio is the same for males and females; in other words, we assume no interaction between area type and gender. On the basis of this assumption, we are estimating that "common" effect.
  • Mantel-Haenszel method: point estimate, CI, and test
    • We need to input data in EMS Table 18.2 like this:
input gender area antibody freq
1 1 1 36
1 1 0 14
1 0 1 50
1 0 0 50
0 1 1 24
0 1 0 126
0 0 1 10
0 0 0 90
    • mhodds antibody area gender [weight=freq], compare(1,0) will generate the results in EMS 18.4.
    • cc antibody area [weight=freq], by(gender) also will generate these results as well as the result of test of heterogeneity (EMS 18.5). Note that test of heterogeneity is equivalent to test of interaction effect between area and gender.
  • Logistic regression: logistic antibody area gender [weight=freq] will generate similar, but different, results using logistic regression.
  • Conditional logistic regression: It will generate another different set of results. The clogit command treats [weight=] differently from we did in the above commands. Therefore, we need to input data as 400 rows before carrying out conditional logistic regression.
set obs 400
gen gender = 0
replace gender = 1 in 1/150
gen area = 0
foreach i in 1/50 151/300 {
  replace area = 1 in `i'
gen antibody = 0
foreach i in 1/36 51/100 151/174 301/310 {
  replace antibody = 1 in `i'
** conditional logistic regression
clogit antibody area, group(gender) or

  • Matched studies: mcci 184 54 14 63 will generate the McNemar's test result and OR estimate in EMS 21.2-3, but its CI calculations are based on formulae different from those in EMS. We also can try the Mantel-Haenszel method or conditional logistic regression, which are equivalent for paired binary data:
set obs 630
gen speciman = int((_n+1)/2)
gen bell = 0
replace bell = 1 if mod(_n,2)==0
gen outcome = 0
replace outcome = 1 in 1/368
replace outcome = 1 if _n>=369 & _n<=476 & mod(_n,2)==0
replace outcome = 1 if _n>=477 & _n<=504 & mod(_n,2)==1

** Mantel-Haenszel method
mhodds outcome bell speciman
** conditional logistic regression (same results)
clogit outcome bell, group(speciman) or
** another command to do conditional logistic regression
xtlogit outcome bell, fe i(speciman) or
** random-effect logistic regression (different results)
xtlogit outcome bell, re i(speciman) or

  • The following calculates the kappa statistics (EMS 36.3) on the agreement between the two tests:
input bell katokatz freq
1 1 184
1 0 54
0 1 14
0 0 63
kap bell katokatz [weight=freq], tab

Logistic regression (EMS 19-20)
See Logistic Regression with Stata and here for more information on logistic regression. help logistic estimation commands will list relevant commands (but without logistic, wierd).

Examples in EMS 19-20 use the Onchocerciasis data, which can be downloaded here.

  • Study solely the effect of residence area (EMS 19.2)
    • logistic mf area or logit mf area, or generates output in odds scale
    • cc mf area, woolf generates the same results. This is because the input variable area is binary.
    • logistic mf area, coef or logit mf area generates output in parameter scale (i.e. log(odds) scale)

  • Study solely the effect of age group (EMS 19.4-5) (See the "Common issues" section for explanations of xi, est store, and lrtest.)
    • Age group as a categorical variable (EMS 19.4): xi, prefix(): logistic mf i.agegrp generates output in odds scale.
      • Wald test for age group effect: After fitting the above regression, type test agegrp_1 agegrp_2 agegrp_3.
      • Likelihood ratio test for age group effect:
** Model with age group
xi, prefix(): logistic mf i.agegrp
est store A
** Model without age group
logistic mf
lrtest A, stat
    • Age group as a continuous variable (EMS 19.5): logistic mf agegrp
      • We can generate a graph to see the fit of this model compared to the data. Use table agegrp to see which groups tend to have high influence on results.
xi, prefix(): logistic mf i.agegrp
predict risk1
gen logodds1 = log(risk1/(1-risk1))
logit mf agegrp
matrix define A=e(b)
** Draw the group-wise risk and the predicted risk curve
scatter risk1 agegrp||function 1-1/(exp(A[1,2]+A[1,1]*x)+1),range(agegrp)
graph rename graph1, replace
** Draw group-wise log-odds and fitted regression line as EMS Figure 19.3
scatter logodds1 agegrp || function A[1,2]+A[1,1]*x, range(agegrp)
    • Compare the above two models to see if the extra parameterization of treating age group as a categorical variable is significantly better. We cannot directly compare the two models because they appear not to be nested. (Actually, in this case, they are nested, as shown below. In general, always make sure the models are nested.) We can do this:
xi, prefix() i.agegrp
** Model with age group as categories but parameterized to include
** the continuous agegrp variable
logistic mf agegrp agegrp_2 agegrp_3
est store A
** Model with agegrp as a continuous variable
logistic mf agegrp
lrtest A, stat

  • Study two variables simultaneously: residence area and age group (See the "Common issues" section for explanations of xi, est store, and lrtest.)
    • No interaction between the two variables (EMS 20.2): xi, prefix(): logistic mf area i.agegrp
    • Interaction is modeled: xi, prefix(): logistic mf i.agegrp*area
    • The following code generate EMS Figure 20.1. Use tab area agegrp to see which groups tend to have high influence on results.
xi, prefix(): logistic mf i.agegrp*area
predict risk2
gen logodds2 = log(risk2/(1-risk2))
xi, prefix(): logistic mf area i.agegrp
predict risk3
gen logodds3 = log(risk3/(1-risk3))
** sort data so that the line plot will not be messy
sort agegrp
** Draw group-wise log-odds and fitted log-odds as EMS Figure 20.1
** Copy the next two lines into one line before hitting Enter
scatter logodds2 agegrp || line logodds3 agegrp if area==0
  || line logodds3 agegrp if area==1
** Graph in probability scale.  The difference between two areas
** are no longer the same for different age groups.
scatter risk2 agegrp || line risk3 agegrp if area==0
  || line risk3 agegrp if area==1

    • To test for interaction, we need to fit two models, one with interaction and one without interaction, and compare their likelihood using a likelihood ratio test. The following can be used to test for interaction while treating age group as a categorical variable:
** Model with full interaction between age group and area
xi, prefix(): logistic mf i.agegrp*area
est store A
** Model without interaction
xi, prefix(): logistic mf area i.agegrp
lrtest A, stat
The following can be used to test for interaction while treating age group as a continous variable:
** Model with interaction between age group and area
gen agebyarea = agegrp*are
logistic mf agegrp area agebyarea
est store A
** Model without interaction
logistic mf area agegrp
lrtest A, stat

When we treated age group as a categorical variable, the interaction model with three extra parameters only increased the 2x(log-likelihood) by 5.27 over the non-interaction model, with corresponding p-value 0.153. When we treated age group as a continuous variable, the interaction model with only one extra parameter increased the 2x(log-likelihood) by 6.46 over the non-interaction model (a different non-interaction model), with corresponding p-value 0.011. This is a good example to demonstrate the fact that results should always be interpreted with context. These two "interactions" are different interactions! In practice, the test with fewer degrees of freedom often is more powerful.

The interaction analysis treating age group as a categorical variable is comparing the fitness of the two parallel zig-zagged lines with the eight dots (previous code is copied here):
scatter logodds2 agegrp || line logodds3 agegrp if area==0
  || line logodds3 agegrp if area==1

The interaction analysis treating age group as a continous variable is comparing the fitness of the blue lines and the red lines (assuming agebyarea and logodds2 have been defined as above):
logistic mf agegrp area agebyarea
matrix define B=e(b)
logistic mf agegrp area
matrix define C=e(b)
** Copy the next five lines into one line before hitting Enter
scatter logodds2 agegrp
  || function B[1,4]+B[1,1]*x, range(agegrp) lc(blue)
  || function B[1,4]+B[1,2]+(B[1,1]+B[1,3])*x, range(agegrp) lc(blue)
  || function C[1,3]+C[1,1]*x, range(agegrp) lc(red)
  || function C[1,3]+C[1,2]+C[1,1]*x, range(agegrp) lc(red)
  • If the outcome variable has more than two categories and the categories are naturally ordered, the appropriate analysis is ordinal logistic regression (also called proportional odds model). The Stata command for it is ologit.

Count outcome variables

Analyses on rates (EMS 22-23)
  • Single Poisson distribution: rate estimate, s.e. of the rate estimate, CI. cii 873 57, poisson will generate results in EMS 22.5-6, with numbers in count per child-year, not per 1000 child-years. In the latter scale, the CI will be from 49.45 to 84.59, different from that in EMS 22.6. In fact, the CI calculated by Stata is based on the Poisson likelihood function. The lower bound is λL such that Pr[Poisson(λL) ≥ 57] = 0.025, and the upper bound is λU such that Pr[Poisson(λU) ≤ 57] = 0.025. If the data are input as below, strate will generate CI as in EMS 22.6.
  • Rate difference: iri 33 24 355 518 gives results in EMS 23.2. If the data are input as below, stir poor and ir infection poor childyear give the same results.
  • Rate ratio: We need to input data in EMS Table 23.2 like this:
input infection childyear poor stove outcome
28 251 1 1 1
 5  52 0 1 1
 5 104 1 2 1
19 466 0 2 1
gen unit = childyear/infection
stset unit [freq=infection]
Note that the command stset is important. It declares data as survival-time data, with a total of 57 failures (sum of the frequency variable infection). Stata will treat the first row as 28 copies and will multiply the time variable by 28. This is why we had to define a time variable unit as the total child-year divided by the number of infections.
    • Overall and stratified failure rates and their CIs: strate gives overal failure rate and strate stove poor gives results stratified by stove and poor.
    • stmh poor, compare(1,0) gives rate ratio and its CI as in EMS 23.2, but with a different test. Alternatively, we may carry out a Poisson regression: poisson outcome poor [freq=infection], e(unit) irr.
    • stmh poor, by(stove) compare(1,0) gives the results in EMS 23.3: Rate ratio for each stove stratum, Mantel-Haenszel estimate of the assumed common rate ratio acrosss stove strata, the associated CI, test for rate ratio, and test for heterogeneity. [Note stmh poor stove, compare(1,0) also gives some of the information.] Again, we may carry out a Poisson regression: poisson outcome poor stove [freq=infection], e(unit) irr. Note that test for heterogeneity is equivalent to test for interaction effect between poor and stove.

Poisson regression (EMS 24)
After opening the caerphilly_ems.dta file, type des, label list, and st to learn the structure of the data.
  • Summarize the rates stratified by current smoking status: stir cursmoke will generate the results in EMS Table 24.3.
  • Study solely the effect of current smoking status: poisson mi cursmoke, e(folltime) will generate the results in EMS Table 24.5, while adding the option irr will generate the results in rate ratio (EMS Table 24.4; same as stmh cursmoke).
  • Study solely the effect of social class:
    • Social class as a categorical variable: xi, prefix(): poisson mi i.socclass, e(folltime) ir will generate the results in EMS Table 24.6.
    • Social class as a continuous variable: poisson mi socclass, e(folltime) ir will generate the results in EMS Table 24.7.
    • Compare the two models to see if treating social class as a categorical variable significantly improves the fit compared to treating it as a continuous variable. Note that these two models are nested. You can see which groups have high influence on the fit by strate socclass.
** Model with social class as a categorical variable
xi, prefix(): poisson mi i.socclass, e(folltime) ir
est store A
predict irate, ir
label var irate "incidence rate"
** Model with social class as a continuous variable
poisson mi socclass, e(folltime) ir
lrtest A, stat
matrix define A=e(b)
** Graphical comparison
scatter ir socclass || function exp(A[1,2]+A[1,1]*x), range(socclass)

  • Study current smoking status and social class simultaneously:
    • stmh cursmoke, by(socclass) generate the results in EMS Table 24.8.
    • xi: poisson mi cursmoke i.socclass, e(folltime) ir generate the results in EMS Table 24.9.

Survival outcome variables

Survival Analysis with Stata is a good resource on survival analysis. Type help st for more details.

Analysis on hazards (EMS 26)
After opening the pbc_ems.dta file, type des, label list, and st to learn the structure of the data.

The follow up time is recorded in years. We define a new variable to reflect the time in days by gen days = time*365.25, and re-declare the survival-time data structure by stset days, failure(d).
  • sts list if cenc0==1 will generate EMS Table 26.2. ltable days d if cenc0==1, graph will generate EMS Table 26.2 and Figure 26.2. sts graph if cenc0==1, g will generate a different flavor of EMS Figure 26.2.
  • sts graph, by(cenc0) or ltable days d, by(cenc0) notable graph overlay will generate EMS Figure 26.3.
  • sts graph, by(cenc0) na will generate a graph of cumulative hazard. To generate EMS Figure 26.4, we need to add more options: sts graph, by(cenc0) na xscale(log) yscale(log) xlabel(10 100 1000 5000) ylabel(.001 .01 .1 1 5).
  • Mantel-Cox method: stmc cenc0 will generate the results in EMS 26.5.

Cox proportional hazards regression (EMS 27)
  • stcox cenc0 will generate the results in EMS 27.2.

EMS other chapters

Nonparametrics (EMS 30.2)
  • One sample: An observation has two parts of information: magnitude and sign. A t-test statistic is calculated based on both. We can replace the magnitude information by the corresponding ranks. Wilcoxon's signed rank test (signrank) is calculated based on the rank information and sign. A sign test (signtest) is calculated based on the sign information alone, effective making the outcome as binomial.
  • Two samples: Wilcoxon's rank sum test (ranksum)
  • More than two samples: Kruskal-Wallis test (kwallis)

  • Rank correlations: spearman can be used to calculate Spearman's rank correlation coefficients. ktau can be used to calculate Kendall's rank correlation coefficients.

ROC curve (EMS 36.2)
  • In the Peru lung study, lower volumn of air a child could breathe out in a second tends to be associated with respiratory symptoms (coded as 1). Thus, the two variables appear in negative association. However, the command roctab assumes positive association, and we have to define a new variable before hand: gen negfev1 = -fev1. Then we can use roctab respsymptoms negfev1, sum graph to output the results and EMS Figure 36.1.
  • If we want to compare the ROC curves for boys and girls, we can use roccomp respsymptoms negfev1, by(sex) sum graph.

Advanced topics

Matrix is very useful in statistics. Stata has many matrix related functions. Type help matrix to start learning them.

Programming is useful in many scenarios: (1) When you analyze data using a protocol repeatedly, you may want to put the code into a program so that you can just call the program instead of repeating the many commands. (2) When you carry out a simulation, you will repeat a procedure many times. It will be very efficient to write a program and execute the program. (3) When people implement a statistical method in Stata, they write and distribute programs. Type help program to start learning Stata programming.

Note: Stata is not a good program to carry out sophisticated simulations, for which R is a better choice.

Generation of random variables: uniform() will generate a random number between 0 and 1 (i.e. from the uniform(0, 1) distribution). To generate a random number from another distribution, apply the corresponding inverse cumulative distribution function. For example, invnormal(uniform()) will generate a random number from N(0, 1). help density functions will list many such functions.

For example, if you want to generate a variable with 100 random numbers from a normal distribution with mean 3 and SD = 7, use:
set obs 100
gen aa = invnormal(uniform()) * 7 + 3

Generation of normal random variables: Normal variables are used so often that Stata has a designated function for drawing them. drawnorm can be used to generate normal variables, independent or dependent.

Sampling: sample can be used to sample from existing values without replacement. bsample and bootstrap can be used to sample from existing values with replacement.

Simulation: simulate is a major command used in simulations. It calls another command repeatedly and stores the results of each call. The command that is called often is a program defined by the user.

The following programs give you a sense of how people write simulation programs in Stata:
  • is a program to simulate the repeated sampling procedure similar to that in EMS 4.5. The program treats the data as a population, and draws a random sample from a variable and calculates the sample mean. This procedure is repeated 100 times to demonstrate the variation of the sample mean as an estimate of the population mean. The program outputs summary and histogram of the 100 estimates. It also draw samples in different colors depending on whether the 95% confidence interval calculated based on the sample covers the real average or not.
    • You may need to modify the program to run it on your computer. After you have saved the program, go to Stata and press ctrl-8 [or, Window -> Do-file Editor -> New Do-file], then open the program. Modify the four lines near the beginning to try it on any dataset/variable/sample size/number of samples. Make sure the pathname is correct for the data file in your computer. After this, you can run the program by going to File -> Do and choose the program.
    • You can change the sample size to other numbers and re-run the program to see the effect of sample size on variation of the estimates.

  • is a program to show the variation of sampling and its impact on test results as a function of sample means and sample SDs.

  • is a program to show the variation of regression line estimate as a function of sample size and standard deviation in the error term.

Stata graphics, under the hood
Stata graphics, under the hood was a talk by Vince Wiggins on the capabilities of Stata graphics. The talk was given from within Stata. You need to install the talk before starting it.
net from
net describe uk04
net install uk04
After this, you need to check if the PERSONAL ado directory exists. Type sysdir to see what directory is designated for PERSONAL. Then check if the directory exist; if not, create it yourself. Then use ukgrtalk to install the files needed for the talk. After this, you can type help ukgrtalk and follow the instructions.

System settings and limits
Try query to see the various system parameter settings. To change a setting, use set command. To obtain the current setting dynamically (e.g. in a program), use c-class return values. For details, type creturn list, which gives a more comprehensive list than query does. Click on the links of the output to see corresponding help pages. Type help limits will list all the limitations your version of Stata imposes.

Topic revision: r73 - 23 Apr 2007, ChunLi

This site is powered by FoswikiCopyright © 2013-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback