Statistics for Biomedical Research Discussion Board

Go to WelcomeGuest to learn about twiki. You will see a place to click to go to the TWikiRegistration page to do a one-time registration where you assign yourself a twiki ID and password. Your twiki ID has to be of the form FirstLast containing your first and last names. You must have an ID and password to add or change content on our site, e.g., to ask or answer questions on this discussion board.

Final Exam

Problem 1

Problem 2

Problem 3

Could you clarify the experiment that the investigator is doing? Are there really two treatment groups? I don't think that the experimental procedure is clear enough for us to entirely evaluate what is wrong with the analysis.

Reply by Chris

Sure. There are 40 total animals in two independent groups (e.g. 20 in a treatment group, 20 in a control group). Each animal receives a baseline measurement (of something, the details aren't important) and a follow up measurement at a future time. The investigator thinks she is interested in measuring change, so she first calculates the percent change for each animal, followed by the mean percent change in each group, and then compares the two means using a t-test. As the question implies, this was an awful analysis choice and there are (at least) three major errors. What are they? How should you have analyzed the data?

Problem 4

I'm having trouble recoding the variables, such as drug and wild, so that I can perform the necessary tests to make comparisons between mouse genotype and drug treatment. How do I rename these variables that I can do the statistical tests of comparison?

Reply by Chris

It is hard for me to guess what your exact problem is, but you can recode variables as factors using R-commander menus: Data... Manage Variables... Recode numeric variables as factors. That may be (and probably is) one possible way to get the correct answer, but if that is not working for you, there are lots of ways to do these problems. If you can't figure it out in the remaining time, at least indicate the test that you would have used to get some partial credit.

part 1:Is this question most appropriately handled similarly to the serial data analysis on the last day of class, where every mouse gets boiled down to one number, and xy plots done in lattice? Alternatively can this be done in Rcmdr?

Reply by Chris

This question refers back to topics covered in the first few weeks of the class-- basic descriptive statistics and plots. Your answers here (and other plots that you might look at but not turn in-- only turn in 1 table and 1 figure) should help you decide the appropriate analyses needed to do the remaining parts of this problem.

Yes, it can be done it R commander.

Part 1: Should we omit extreme or outlying variables from our descriptive analysis, table, and figure? The way the question is phrased, "subsequent analysis", it seems as if all data should be included in the initial part of the question. Is this correct?

Reply by Chris

  • Definitely remove the extreme data points when doing the remaining parts of problem 4
  • For part 1, definitely say which data points you removed in your paragraph describing the data
  • You are right that the question is not clear as to whether or not the extreme values should be in the table and figure, so either is acceptable for full credit. I would recommend removing the extreme points from any plots and tables (like you would do for a paper) because that will be the most helpful when answering the other parts of the problem. You only need to turn in one figure & table, but looking at additional plots may be helpful.

Part3: Because there are more subjects in the Drug A group compared to the Drug B group, R limits the types of statistical tests I can perform. Is there a way to get around this?

Reply by Chris

Yes, I can get the answer in R and R-commander. If you are using R-commander, remember that you need to define factor variables for some of the tests to be available.

In general, there are many ways to get the correct answer to each part of problem 4. If you are concerned about a particular number, you could do a problem two different ways and check that answers match.

Part4: This question about significant change doesn't specify if we are separating drugA and drugB. I can report the results of two tests, asking whether there is a significant change for drugA or for drugB. Or I can report one test, asking if the change for drugA is significantly different from the change in drugB.

Reply by Chris

Part 4 does does not specify drug A or drug B, so it should not be part of the test. You need to test the size variables at the two time points without considering drug.

Problem 5

Part 1: I have no idea what you want here. The assumptions depend on the analysis I design in part 2. If my analysis is non-parametric, then normality is a non-issue (for instance). What sort of assumptions are independent of my analytical method?

Reply by Chris

In any analysis you perform, you will be making between a few and a lot of assumptions. It is important to be aware of the assumptions you are making and to verify that they are reasonable before conducting any analysis, which is often done using a combination of scientific knowledge and basic descriptive statistics and plots. Part 1 is about describing an appropriate type of plot that could help you to understand if your assumptions are reasonable or not. In this problem, SBP is measured at multiple time points for each animal and there also are some potential problems with missing data-- there could be assumptions needed to deal with either of these issues.

Number 1: In order to "design a graphic", don't we need data? Or, are we to assume that we can make up our own data for this question

Reply by Chris

You do not have to actually create a plot/graphic for this question, just describe the type of plot you would create to check the modeling assumptions. If you prefer to also sketch what it might look like with some made up data, that would be fine but is not expected or required.

By analogy, if you were given a general situation in which a continous outcome variable was measured in a treatment and control group, what kind of plot would you suggest to check modeling assumptions? I would always begin that analysis by looking at a dot plot (or box plot) of the outcome in the treatment and control groups. This questions is different in than deals with a generic repeated measures scenario, so there are other types of descriptive plots that are always useful.

Homework 4

I can't figure this stuff out. I can't find the information I need in the notes. Could we have a help session where we do problems that are like the homework problems?

Reply by Chris

There are examples in the book, with additional examples gone over in class that are very similar to the questions being asked. I developed the notes given in class based on the book material. Everything you need to answer these questions was covered in class and in most cases is repeated in the EMS book.

Regarding help sessions, we are available every Thursday after class to answer questions for up to one hour. We are also willing to arrange additional help sessions if needed.

Problem 1

To find the probabiliity of an event, don't you divide the occurrence of that event by the total tries, as in getting 3 when rolling dice? For example, it you roll the dice 12 times, and you get 3 two times, the probability of rolling 3 is 1/6 right?

Reply by Chris

If you roll the dice 12 times and get 3 two times, you are right that your observed probability of getting a 3 was 2/12 = 1/6. In that situation each roll of the dice is independent.

As a side note (this is beyond the scope of the question), we can calculate the exact probability of rolling two 3's in twelve total rolls of a fair pair of dice. First, find the probability of rolling a three = Pr(roll 1 then 2 or 2 then 1) = Pr(roll 1 and 2) + Pr(roll 2 and 1) = Pr(roll 1)*Pr(roll 2) + Pr(roll 2)*Pr(roll 1) = 1/6*1/6 + 1/6*1/6 = 2/36. This calculation utilizes the fact that (a) The first and second roll are independent events, and (b) the two events rolling 1 then 2 and rolling 2 then 1 are mutually exclusive (both can't occur). The second step would be to use the Binomial distribution to find the Pr(Getting two 3's in 12 rolls) using n = 12, p = 2/36, and r = 2. I get the Pr(Getting two 3's in 12 rolls) = 0.115

So in problem 1, the number of times one picked a certain kind of flower was given out the total number of flowers. Wouldn't you then say the probability of being a particular color and height is the number of flowers of that color and height out of the total number of flowers?

Reply by Chris

Yes; if you add up those 4 probabilities, they should sum to 1

Why does multiplying the individual probabilities of color and height not equal the observed number of flowers of that same color and height?

Reply by Chris

I am not sure what you mean by this-- if you multiply probabilities (which are numbers <= 1), you will end up with something less than 1. That won't ever be equal to the observed counts. My guess is that you are assuming independence between flower color and height, which is not the case here like it was with rolling dice.

When calculating a CI for the OR needed in Q1 part 3, how do I get "n"? Also how do I calculate standard deviation/ variance?

Reply by Chris

There is a specific section of the book that goes over how to do this, with the formula and an example. I also covered an example during lecture.

I have spent considerable time on Q1 p4 and remain lost. It sounds like you are asking for a partial F test or something.

Reply by Chris

Again, the book and the notes given during lecture covered the test we are using. In the lecture notes, my example was using low birth weight as the binary outcome and race as a 3-level categorical predictor (rather than a 4-level predictor in this problem).

Quiz 3 (Take Home)

Problem 4

Is question (4) on the quiz meant to be a multiple choice question? Are we to choose the variables that would not be summarized well by the mean?

Reply by Chris

Yes, it is essentially a multiple choice question, but with more than one answer possibly being correct. That is, the correct answer could be (a) only, (a) and (b), or any other combination of a, b, and c.

Midterm

R Problems

If you get an error cannot coerce class "labelled" into a data.frame when analyzing or graphing the lead dataset, you can either install and load the Hmisc package, or use a version of the lead dataset that does not have labels. It may be obtained here.

I keep having problems with R commander freezing when I am working with a dataset. Is there a way around this problem?

Reply by Frank

Please specify the operating system you were using and what you were doing right before the freeze. If you need access to another computer or need to bring your computer to us to take a look at the problem please let us know.

Problem 1

There are only 19 observations, but 20 is stated... am i missing something?

Reply by Chris

You are right in that there are 19 points on the graph, but there are 20 observations. Yes, you are missing something (but I can't give away what that is).

Problem 2

Problem 3

Problem 4

How do I find the exact confidence interval. I can't seem to find anything in my notes, or the class handout on this topic.

Reply by Frank

Think about how we often use two different probability distributions, one of them penalizing us more for not knowing the true variance.

Can you explain the difference between part 8 and part 9. I am having difficulty differentiating between exactly what is being asked in each question

Reply by Frank

This is something that the handouts covers fairly explicitly. I'll just say that predicting the result for an individual subject with X=x is much easier than predicting the mean of a bunch of subjects whose X variable equals x.

Is there an Equation that can be used to calculate sigma in question 10?

Reply by Frank

See section 7.4.6 in the handouts, or let software do it.

Problem 5

A clarification from class. You said that obtaining F by using the 'compare two models' function, one model adjusted and the other model full is not the way to go for obtaining partial F, but it was how you obtain partial SSR's for adjusted variables?

Reply by Frank

The 'compare two models' menu is good for both purposes, I think.

#2 "Interpret the t statistic for the age x sex effect." What are you talking about here? I can give you a t-statistic for age or sex, or an F-statistic for the whole model

Reply by Frank

The question is fairly specific. Test the interaction effect in isolation, i.e., test for whether the effect of age is modified by sex (equivalent to whether the effect of sex is modified by age).

How would I use t-tests to assess whether each of the lead levels is needed in predicting maxfwt once the other lead level is adjusted for. What is meant by "adjusted for"?

Reply by Frank

Adjusted for implies there is more than one variable in a multiple regression model. In a model, the effect of each predictor variable is adjusted for the effects of all the other variables in the model. The partial t or F test for a predictor is a test of association of that predictor with the outcome, adjusted for the effects of the other variables. [Another way to say adjusted for is to say controlling for or after holding constant the other variables.]

How would I find the weighted combination of lead levels that best predicts maxfwt?

Reply by Frank

Think about what regression coefficients mean at a fundamental level.

Is distance from the smelting plant represented by the "area" variable in the lead dataset?

Reply by Frank

Yes; the name of the variable isn't such a good choice.

Scatterplot error "Cannot coerce "labelled" into a data.frame" How do I fix the problem so I can view a scatterplot. Only occurs when trying to set the group as "sex".

Reply by Frank

See R Problems above.

Does the sex variable need to be recoded?

Reply by Frank

It is already a factor variable and does not need to be recoded.

When making a linear model, how do you allow the slope of one variable to vary with another?

Reply by Frank

When the slope of one variable varies according to the values of another variable, interaction or synergism is present. Different software packages have different methods for adding interaction terms to the model. Some require you to create a new variable that is the product of two variables, e.g. age*(sex=='male'). In R's modeling formula you separate two variables by an asterisk instead of a plus to force the main effects and the (automatically created) products (interaction effects) to be included. For example, a model y ~ height*weight would generate a model with an intercept, a slope for height, a slope for weight, and a slope for the product of the two.

For #4 what is meant by combining the two lead levels? Does this mean I need to define a new variable in R or group ld72 and ld73 in some way? To "add age and sex", does this mean defining a new linear model such as maxfwt ~ ld72*ld73*age*sex?

Reply by Frank

Beware of things like x1*x2*x3*x4 which generates a complex model with slopes for all combinations of all variables; this is too many slopes to manage. Notation like x1+x2+x3*x4 will generate different slopes for x3 by levels of x4 (or for x4 by levels of x3) but will be simple with regard to x1 and x2. No grouping of ld72 and ld73 is needed.

HW #3

Does anyone know how to do problem #5 of homework 3? I am getting crazy numbers, and don't know how to calculate the predicted values and residuals.

Reply by Chris

I suggest looking at a simple plot of the data points of y versus x, adding the regression line, and seeing if it makes sense that it is a "best fit" line. The following is some R code for (1) entering the x and y data, (2) inputing the intercept (a) and slope (b; you have to calculate these numbers), (3) plotting the data, and (4) adding a line to the plot with intercept (a) and slope (b). There are so few data points that you can even do this by hand.

x <- 1:5
y <- c(98,198,315,380,530)
a <- ...
b <- ...
plot(x,y)
abline(a,b)
Predicted values fall on the regression line. Residuals are the difference between your observed and predicted values.

Reply by Frank

As a check, guestimate the intercept and slope of the regression line on the plot and compare them to your estimates. Your estimates are derived from the formulas in the notes. Once you use these formulas you can compute the predicted values using the equation $\hat{y} = a + b \times x$, five times. If you are brave enough to use low-level R commands to do this you can use

yhat <- a + b*x
yhat     # prints the results
y - yhat # prints residuals

Question 5: What is the "fitting criterion"?

I was wondering about the third and final part of question 5. When you have you new a and b...what is meant by "fitting criterion". Is that calculating the predicted values and residuals for each observation again like in 5.2? Thanks!

Reply by Chris

When it says "compute the fitting criterion" I interpret that to mean "compute the sse" ('sse' is the sum of the squared errors, also known as the residual sum of squares). You can either use the R function that Frank gave you, or you can do it by hand by getting the residuals and predicted values (as in 5.2) for each of the 4 scenarios. It doesn't say to do this explicitly in the homework, but you should compare the sse you get from the 4 scenarios to the sse you get using the original 'a' and 'b'

What is meant in question 4?

Reply by Frank

Think about the quantity of interest that a two-group comparison in a clinical trial aims to estimate, and think about what it is that you don't know.

HW #2

Do non-parametric tests have degrees of freedom?

Reply by Chris

You're right about the non-parametric tests not having degrees of freedom in the same sense as the parametric tests. If you look up p-values for Wilcoxon tests using a table, they are based on n (for one sample test), or n1 and n2 (for two sample tests). The question in the homework should read "Describe how the degrees of freedom are calculated for the parametric tests." I corrected the assignment just now.

Factor variables

Where can I find the instructions for recoding a variable? I'm trying to recode "treatment" as a factor variable so I can group calculations based on treatment status.

Reply by Chris

I think you have figured this out, but you can't summarize by groups until you have defined a variable as a factor. By default, R thinks numeric variables are continuous, not categorical, so you have to define it appropriately. Anyway, in R commander that is found in the Data menu, Manage variables in active data set, and Convert numeric variables to factors. It creates the appropriate "factor" command, something like dataset$trt <- factor(dataset$trt, labels=c('trt','ctrl'))

Importing the data

How do I import the dataset for this assignment?

Reply by Frank

Click on the link to the dataset from the web page listing all the homework assignments. In Rcmdr under Windows you can paste the entire URL into the file name box when using the Import menu.

Computing P-values

I'm having trouble calculating the p-value from the t value. Is there a concise equation for this? Also, how would I calculate t over the interval tn-1,1-a/2?

Reply by Frank

If you have computed a test statistic and do not have the P-value already computed by the software you are using, you can use one of the web stat calculators to compute the P-value, or use the R command (script) window. Type for example

2*(1 - pt(1.96, 1000))
to get a 2-tailed P-value when a t-statistic is 1.96 and the degrees of freedom is 1000. For computing the same type of P-value for a normal (z) test use 2*(1-pnorm(1.96)) when z=1.96.

If you are using an Rcmdr menu to perform a statistical test, the P-value should be part of the output.

Calculating test statistics "by hand"

Can I calculate statistics (ie. t, p, etc) by using just mean, n, and standard deviation, without loading the data set? Can I compare two means without the data set?

Reply by Chris

Yes, you can calculate t-statistics by hand using the sample mean (xbar), the sample standard deviation (s) and the sample size (n) for the one-sample problem. If you want to compare two means, like a treatment and control group, you have to know the mean in each group, the standard deviation in each group, and the sample size in each group. To get the T-statistic, you plug these numbers into formulas which are given in the book or notes.

Can I get really large test statistics

I know the population mean is mu. Is this the same as mu_0 ? I came up with a t-value that is extremely out there of -6.84 and it doesn't even sound close. Is this possible?

Reply by Chris

Right, mu is the population mean; mu_0 is some number that you pick (often 0) based on the hypothesis you want to test.

You will get more extreme t-statistics when the observed mean is "far" from the hypothesized mean, where "far" is relative to the standard error of the mean (which itself is a function of the standard deviation and the sample size). I don't have the numbers in front of me for this HW problem, but it is possible to get t-statistics which are that big (or even bigger) when the null hypothesis is not true. By "big" I mean big in absolute value-- test statistics are "big" or "extreme" when they are far from 0.

Is it required to present a summary table of all the variables in the sepsis data set (Homework 2)?

Reply by Chris

No, you do not need to include a table in the homework. It may be helpful to look at summary statistics before doing the hypothesis tests.

Questions about reporting confidence intervals

When calulating the confidence interval, I have seen it interpreted as the interval in which the means differ or the interval in which the true population mean lies. In one case, the interval would contain 0 if the mean do not differ, but not in the other case. Which is best to report?

Reply by Frank

We have to be clear about sample vs. population characteristics, the latter being unknowable and the former providing estimates (with a margin of error) for the truth. But you can define a confidence interval as the set of all possible values that if hypothesized (H0: mean or mean difference equal to that value) would be accepted at the 0.05 level. An interval for a difference in means will contain zero if the P-value for the t-test is >0.05. When reporting we don't usually give all this wording but just report something like mean difference followed by 0.95 confidence interval [lower,upper] and sometimes P-value.

HW #1

Importing data

I am not able to use any of the FEV data sets with R: *.sav, *.sdd etc. What format can I import?

Reply by Chris:

I got the R dataset (the .sav) to work using R commander. Using the menus, it is "Data... Load data set" and then find where you saved the download file. Alternatively, the .csv file should also work and, to import that, follow what we did in class. In particular, for R-commander, the menu is found in "Data... Import Data... from text file or clipboard" then changing the options so that you (1) name your dataset and (2) change the Field Separator to 'commas'

Reply by Frank:

To import a dataset from our DataSets area you can right click on an R .sav file and save the address (URL). Under Rcmdr you can paste this URL into the file name field in the import dialog.
Topic revision: r73 - 13 Apr 2013, FrankHarrell
 

This site is powered by FoswikiCopyright © 2013-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback