Estimators and Statistical Tests TODO: give the structure of this chapter motivation of statistical tests notion of estimator, bias, MSE ... MLE (should be in a section of its own) Bayesian methods (in a section of their own) (TODO) TODO: list the more important tests (Student and Chi^2) Student T test: compare a mean with a given number compare the mean in two samples There are generalizations for more than two samples (analysis of variance) and for non-gaussian samples (Wilcoxon). One can devise similar tests to compare the variance of a sample with a given number or to compare the variance of two samples. Chi^2 test: to compare the distribution of a qualitative variable with predetermined values, to compare the distribution of a qualitative variable in two samples. One can also use it to check if two qualitative variables are independant. However, it is only an approximation, valid for large samples (more than 100 observations, more that 10 observations per class). TODO: check if we can do without the Chi^2 test: - binary variable: bimom.test - multinomial test: ??? - Independance Chi^2: fisher.test - two variables: fisher.test * Introduction to statistical tests: TODO: REWRITE THIS SECTION We want to answer a question of the kind "Does tobacco increase the risk of cancer?", "Does the proximity of a nuclear waste reprocessing plant increase the risk of leukemia?", "Is the mean of the population from which this sample was drawn zero, given that the sample mean is 0.02?" Let us detail the problem "Have those two samples the same mean?" (it is a simplification of the problem "Do those two samples come from the same population?"). Let us consider a first population, on which is defined a statistical variable (with a gaussian distribution), from which we get a sample. We do the same for a second population, with the same population mean. We can then consider the statistical variable sample mean in the first sample - sample mean in the second sample and find its distribution. If we measure a certain value of this difference, we can compute the probability of obtaining a difference at least as large. If P( difference > observed difference ) < alpha, (for a given value of alpha, say 0.05), we reject the hypothesis "the two means are equal", with a risk equal to alpha. But beware, this result is not certain at all. There can be two kinds of error: either wrongly clain that they are different (this happens with a probability alpha) or wrongly claim that the two means are equal. Beware again, those tests are only valid under certain conditions (gaussian variables, same variance, etc.). If we really wish to be rigorous, we do not consider a single hypothesis, but two: for instamce "the means are equal" and "the means are different"; or "the means are equal" and "the first mean is larger than the second". We would use the second formulation if we can a priori reject the fact that the first mean is lower than the second -- but this has to come from information independant from the samples at hand. The statistical tests will never tell "the hypothesis is true": they will merely reject or fail to reject the hypothesis stating "there is nothing significant". (This is very similar to the development of science as explained by K. Popper: we never prove that something is true, we merely continuously try to prove it wrong and fail to do so.) + H0 (null hypothesis) and H1 (alternative hypothesis) Let us consider two hypotheses: the null hypothesis H0, "there is no noticeable effect" (for instance, "tobacco does not increas the risk of cancer", the proximity of a waste recycling plant does not increas the risk of leukemia) and the alternative hypothesis H1, "there is a noticeable effect" (e.g., "tobacco increases the risk of cancer"). The alternative hypothesis can be symetric ("tobacco increases of decreases the risk of cancer") or not ("tobacco increases the risk of cancer"). To choose an asymetric hypothesis means that we reject, a priori, half of the hypothesis: it can be a prejudice, so you should think carefully before choosing an asymetric alternative hypothesis. H0 is sometimes called the "conservative hypothesis", because it is the hypothesis we keep if the results of the test are not conclusive. + Type I error To wrongly reject the null hypothesis (i.e., to wrongly conclude "there is an effet" or "there is a noticeable difference"). For instance, if the variable X follows a gaussian distribution, we expect to get values "in the middle" of the bell-shaped curve. If we get extreme values, we shall reject, sometimes wrongly, the null hypothesis (that the mean is actually zero). The type I error corresponds to the red part in the following plot. %G colorie <- function (x, y1, y2, N=1000, ...) { for (t in (0:N)/N) { lines(x, t*y1+(1-t)*y2, ...) } } # No, there is already a function to do this colorie <- function (x, y1, y2, ...) { polygon( c(x, x[length(x):1]), c(y1, y2[length(y2):1]), ... ) } x <- seq(-6,6, length=100) y <- dnorm(x) plot(y~x, type='l') i = xqnorm(.975) colorie(x[i],y[i],rep(0,sum(i)) ,col='red') lines(y~x) title(main="Type I error") %-- + p-value Probability (if the null hypothesis is true) to get a result at least as extreme. It is the probability of making a type I error. + Type II error Wrongly accepting the null hypothesis (i.e., wrongly concluding "there is no statistically significant effect" or "there is no difference"). In all rigour, it is not an error, because one never says "H0 is true" but "we do not reject H0 (yet)". It is not an error, but a missed opportunity. The type II error is the area of the red part in the following plot. The middle bell-shaped curve is the distribution predicted by the null hypothesis, the other one is the actual distribution of the population. %G x <- seq(-6,6, length=1000) y <- dnorm(x) plot(y~x, type='l') y2 <- dnorm(x-.5) lines(y2~x) i <- x>qnorm(.025) & xqnorm(.025) & x> against H1: << abs(mu - mu0) > epsilon >> with a confidence level alpha=0.05 be at least 0.80 ("tradition" suggests a power equal to 0.80, and a confidence level 0.05). The "power.t.test" performs those computations (for Student's T test). > power.t.test(delta=.1, sd=1, sig.level=.05, power=.80, type='one.sample') One-sample t test power calculation n = 786.8109 delta = 0.1 sd = 1 sig.level = 0.05 power = 0.8 alternative = two.sided You can ask the question in any direction. For instance, the experiment has already been carried out, with a sample of size n: what is the minimal difference in mean we can expect to spot if we want the power of the test to be 0.08? > power.t.test(n=100, sd=1, sig.level=.05, power=.80, type='one.sample') One-sample t test power calculation n = 100 delta = 0.2829125 sd = 1 sig.level = 0.05 power = 0.8 alternative = two.sided Here is a plot of this minimum noticeable difference against the sample size. %G N <- seq(10,200, by=5) delta <- NULL for (n in N) { delta <- append(delta, power.t.test(n=n, sd=1, sig.level=.05, power=.80, type='one.sample')$delta ) } plot(delta~N, type='l', xlab='sample size') delta <- NULL for (n in N) { delta <- append(delta, power.t.test(n=n, sd=1, sig.level=.01, power=.80, type='one.sample')$delta ) } lines(delta~N, col='red') delta <- NULL for (n in N) { delta <- append(delta, power.t.test(n=n, sd=1, sig.level=.001, power=.80, type='one.sample')$delta ) } lines(delta~N, col='blue') legend(par('usr')[2], par('usr')[4], xjust=1, c('significance level=.05', 'significance level=.01', 'significance level=.001'), col=c(par('fg'), 'red', 'blue'), lwd=1, lty=1) title(main='Sample size and difference detected for tests of pover 0.80') %-- You might want to read: http://www.stat.uiowa.edu/~rlenth/Power/2badHabits.pdf http://www.stat.uiowa.edu/techrep/tr303.pdf + Simple hypothesis A hypothesis is said to be simple if it entails a complete knowledge of the distribution of the random variables. For instance, if we are looking for the mean of a gaussian variable of variance 1 (we already know the variance), the hypothesis H0: "the mean is zero" is simple. On the contrary, if we are looking for the mean of a gaussian variable (we know the variable is gaussian, but we do not know its variance -- we are not interested in it), this hypothesis is not simple. Here is another example: we have a sample that comes either from a population 1 (described by a random variable of mean 3 and variance 2), or from a population 2 (described by a random variable of mean 1 and variance 1). In this case, both hypotheses H0: "the sample comes from population 1" and H1: "the sample comes from population 2" are simple. + Composite hypothesis A non-simple hypothesis. Quite often, the alternative assumptions (H1) are composite: if H1 is true, we know that the distribution has a certain form, with a parameter that has not a certain value -- but we do not know its value. + Confidence Interval Let us consider a random variable, whose distribution is not completely known: for instance, we know it is a gaussian distribution of variance 1, but the mean is unknown. If we estimate this mean as a single number, we are sure to be wrong: the actual mean might be close to our proposal, but there is no reason it should be exactly this one, up to the umpteenth decimal. Instead, we can give a confidence interval -- note the use of an indefinite article: there are many such intervals (you can shift it to the left or to the right -- and doing so will change its width). TODO: check that I give an example with several intervals. Here are two interpretations of this notion of "confidence interval". (1) It is an interval in which we have a 95% probability of finding the sample mean. (2) More naively, it is an interval that has a 95% probability of containing the population mean (i.e., the actual mean). Actually, these two interpretations are equivalent. As I was very dubious, let us check it on a simulation (here, I draw the population mean at random, and this prior distribution does not play any role). TODO: say that it is bayesian... # Sample size n <- 100 # Number of points to draw the curve N <- 1000 v <- vector() for (i in 1:N) { m <- runif(1) x <- m+rnorm(n) int <- t.test(x)$conf.int v <- append(v, int[1] n <- 10 # Sample size > m <- rnorm(1) # Unknown mean > x <- m + rnorm(n) # Sample > t.test(x)$p.value # p-value [1] 0.2574325 But what is the probability that the mean be exactly zero? In our simulation, this probability is zero: the mean is almost surely non-zero. > m [1] 0.7263967 + UMP (Uniform Most Powerful) tests We would like to set the power of a test, in the same way we choose the level of confidence. But there is a problem: if H1 is not a simple hypothesis (i.e., if it is not a single hypothesis, as "mu=0" but a set of hypotheses, as "mu!=0", "mu>0" or "mu>1"), the power is not a number but a function, that depends on the simple hypotheses contained in H1. A test is UMP (Uniformly Most Powerful) for a given confidence level alpha if, for any simple hypothesis in H1, it has the largest power among the tests with confidence level alpha. + Non parametric test Most of the time, statistical tests assume that the random variables studied are gaussian (and even, when there are several, that they have the same variance). Non-parametric tests do not make such assumptions -- xwe say that they are more "robust" -- but, as a counter part, they are less powerful. Here are a few examples of parametric and non-parametric tests. TODO: is it the right place for this list? Aim Parametric tests Non-parametric tests ---------------------------------------------------------------------------- compare two means Student's T test Wilcoxon's U test compare more than Anova (analysis of Kruskal--Wallis test two means variance) Compare two Fisher's F test Ansari-Bradley or variances Mood test Comparing more than Bartlett test Fligner test + Robustness A (parametric) test is robust if its results are still valid when its assumptions are no longer satisfied (especially if the random variables studied are no longer gaussian). + Resistance A statistic (mean, median, variance, trimmed mean, etc.) is resistant if it does not depend much on extreme values. For instance, the mean is not resistant, while the median is: a single extreme value can drastically change the mean, while it will not change the median. > x <- rnorm(10) > mean(x) [1] -0.08964153 > mean(c(x,10^10)) # We add an extreme value [1] 909090909 > median(x) [1] -0.2234618 > median(c(x,10^10)) # We add an extreme value [1] -0.1949206 + Pearson residuals One can sometimes spot outliers with the Pearsons residuals: sample density / density according to the model - 1 TODO: Example with circular data. + Outlier detection TODO - Should you want them, there are statistical tests for the presence of outliers - Outlier detection benchmark: use classical examples, such as library(robustbase) ?hkb One should not try to automatically detect and remove the outliers: the first problem is that this is a multi-stage procedure (first the outlier detection and removal, then the rest of the analysis), whose performance should be assessed (the rest of the analysis will not perform as well as you think it should, because the data has been tampered with); the second problem is that the outlier removal is a black-and-white decision, either we keep the observation, or we remove it completely -- if we are unsure whether the observation is an actual outlier, we are sure to make mistakes. Methods that remain efficient even in presence of outliers ("robust methods") do exist: use them! + Breaking point The breaking point is the proportion of observations you can tamper with without being able to make the estimator arbitrarily "large". This is a measure of robustness of an estimator. For instance, the breaking point of the mean is zero: by changing a single observation, you can make the mean arbitrary large. On the other hand, the breaking point of the median is 50%: if you change a single observation, the median will change, but its value vull be bounded by the rest of the cloud of points -- to make the median arbitrarily large, you would have to move (say) the top half of the data points. The trimmed mean has a breaking point somewhere inbetween. + TODO: A few robust estimators Location (mean): trimmed mean, median Dispersion (standard deviation): IQR, MAD, Sn, L-moments, MCD rlm, cov.rob library(rrcov) library(robustbase) (includes roblm) + Three means of performing statistical tests 1. We can look for a parameter of the distribution describing the population. If possible, we shall take an unbiased one, with the lowest variance possible. (However, we shall see later that certain biased estimators may be more interesting, because their mean square error is lower -- this is the case for ridge regression.) 2. Instead of a single value of this parameter, we can be looking for an interval containing it: we are then building a confidence interval. 3. We can also want to know if this parameter has (or not) a predefined value: we will then perform a test. + Criticism of statistical tests A p-value close to zero can mean two very different things: either the null hypothesis is very wrong, or ot is just slightly wrong but our sample is large enough to spot it. In the latter case, the difference between reality and the hypothesis is statistically significant, but for all practical purposes, it is negligible. TODO: this problem with p-values is a starting point for the development of bayesian methods. http://www.stat.washington.edu/www/research/online/1994/bic.ps + Decision Theory Among all the possible tests, we want one for which the risks of type I and type II errors are as low as possible. Among the tests that can not be improved (i.e., we cannot modify them to get one with the same type I error risk and a lower type II error risk, and conversely), there is no means of choosing THE best. Indeed, we can plot those tests in the (type I error risk)x(type II error risk) plane: we get a curve. However, decision theory allows everyone to choose, among those tests, the one that becomes best one's taste taste for risk. One crude way of preceeding is to choose an upper bound on the type I error risk (alpha) and minimize the type II error risk. For more details, read Simon's French book, "Decision Theory: an introduction to the mathematics of rationality", Ellis Horwood series in mathematics and its applications, Halsted Press, 1988. * The Zoo of statistical tests: Parametric Tests + Statistical tests under R Most R functions that perform (classical) tests are in the "stats" package (it is already loaded). library(help="stats") The list of functions is huge: look for those containing the "test" string. > apropos(".test") [1] "ansari.test" "bartlett.test" "binom.test" [4] "Box.test" "chisq.test" "cor.test" [7] "fisher.test" "fligner.test" "friedman.test" [10] "kruskal.test" "ks.test" "mantelhaen.test" [13] "mcnemar.test" "mood.test" "oneway.test" [16] "pairwise.prop.test" "pairwise.t.test" "pairwise.wilcox.test" [19] "power.anova.test" "power.prop.test" "power.t.test" [22] "PP.test" "prop.test" "prop.trend.test" [25] "quade.test" "shapiro.test" "t.test" [28] "var.test" "wilcox.test" + Reading a test result One would expect those functions to yield a result as "Null hypothesis rejected" or "Null hypothesis not rejected" -- bad luck. The user has to know how to interpret the results, with a critical eye. The result is mainly a number, the p-value. It is the probability to get a result at least as extreme. If it is close to one, we do not reject the hypothesis, i.e., the test did not find anything statistically significant; if it close to zero, we can reject the null hypothesis. More precisely, before performing the test, we choose a confidence level alpha (often 0.05; for human health, you will be more conservative an choose 0.01 or even less; if you want results with little data, if you do not mind that those results are not reliable, you can take 0.10): if palpha, you do not reject it. Of course, you have to choose the confidence level alpha BEFORE performing the test (otherwise, you will choose it so that it produces the results you want)... For instance, > x <- rnorm(200) > t.test(x) One Sample t-test data: x t = 3.1091, df = 199, p-value = 0.002152 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 0.07855896 0.35102887 sample estimates: mean of x 0.2147939 If we reject the null hypothesis (here: "the mean is zero"), we will be wrong with a probability 0.002152, i.e., in 2 cases out of 1,000. We can check that with simulations: if the null hypothesis H0 is true, in 95% of the cases, we have p>0.05. Nore precisely, if H0 is true, the p-value is unifromly distributed in [0,1]. %G p <- c() for (i in 1:1000) { x <- rnorm(200) p <- append(p, t.test(x)$p.value) } hist(p, col='light blue') %-- %G p <- sort(p) p[950] p[50] x <- 1:1000 plot(p ~ x, main="p-value of a Student T test when H0 is true") %-- That was the type I error risk (wrongly rejecting the null hypothesis). Let us now focus on the type II error risk. This time, the population mean is non-zero. The risk will depend on the actual value of this mean -- but we do not know this value. If the mean is close to zero, the risk is high (we need a lot of data to spot small differences), if it is farther, the risk is lower. %G # Sample size n <- 10 # Number of simulations # (sufficiently large to get a good approximation # of the probability) m <- 1000 # Number of points to draw the curve k <- 50 # Maximum value for the actual mean M <- 2 r <- vector() for (j in M*(0:k)/k) { res <- 0 for (i in 1:m) { x <- rnorm(10, mean=j) if( t.test(x)$p.value > .05 ){ res <- res + 1 } } r <- append(r, res/m) } rr <- M*(0:k)/k plot(r~rr, type="l", xlab='difference in mean', ylab="type II error probability") # Comparison with the curve produced by "power.t.test" x <- seq(0,2,length=200) y <- NULL for (m in x) { y <- append(y, 1-power.t.test(delta=m, sd=1, n=10, sig.level=.05, type='one.sample')$power) } lines(x,y,col='red') # Theoretical curve # (This is a Z test, not too different) r2 <- function (p,q,conf,x) { p(q(1-conf/2)-x) - p(q(conf/2)-x) } f <- function(x) { p <- function (t) { pnorm(t, sd=1/sqrt(10)) } q <- function (t) { qnorm(t, sd=1/sqrt(10)) } r2(p,q,.05,x) } curve( f(x) , add=T, col="blue" ) # Theoretical curve (T test) f <- function(x) { p <- function (t) { pt(sqrt(10)*t, 10) } # Is this correct? q <- function (t) { qt(t, 10)/sqrt(10) } r2(p,q,.05,x) } curve( f(x) , add=T, col="green" ) legend(par('usr')[2], par('usr')[4], xjust=1, c('simulation', 'power.t.test', '"exact" value, Z test', 'excat value'), col=c(par('fg'),'red','blue','green'), lwd=1,lty=1) title(main="Type II error risk in a Student T test") %-- We saw earlier that, if the null hypothesis is true, the p-values are uniformly distributed in [0,1]. Here is their distribution if H0 is false, for different values of the population mean. %G N <- 10000 x <- 100*(1:N)/N plot( x~I(x/100), type='n', ylab="cumulated percents", xlab="p-value" ) do.it <- function (m, col) { p <- c() for (i in 1:N) { x <- m+rnorm(200) p <- append(p, t.test(x)$p.value) } p <- sort(p) x <- 100*(1:N)/N lines(x ~ p, type='l', col=col) } do.it(0, par('fg')) do.it(.05, 'red') do.it(.1, 'green') do.it(.15, 'blue') do.it(.2, 'orange') abline(v=.05) title(main='p-value distribution') legend(par('usr')[2],par('usr')[3],xjust=1,yjust=1, c('m=0', 'm=0.05', 'm=.01', 'm=.015', 'm=.02'), col=c(par('fg'), 'red', 'green', 'blue', 'orange'), lty=1,lwd=1) %-- The manual gives this striking example, that explains why R does not give a clear result "Null hypothesis rejected" or "Null hypothesis not rejected". If we look at the data (with the eyes, on a plot -- always plot the data) we can state with little risk of error that the means are very different. However, the p-value is very high and suggests not to reject the null hypothesis that the means are equal. One problem is the size of the confidence interval: it is extremely large. Thus you should carefully look at the data. (On this example, there was one more problem: to perform a Student T test, the data have to be gaussian, to have the same variance, to be independant -- this is far from being the case). TODO: somewhere in this document, state how to correctly perform a test. 0. Choose a confidence level, a test 1. Plot the data 2. Check the assumptions. Change the test if needed. 3. Perform the test. Check the confidence interval. TODO: where is my rant against T-values? %G 600 200 x <- 1:10 y <- c(7:20, 200) boxplot(x,y, horizontal=T) %-- %G 600 200 boxplot(x,y, log="x", horizontal=T) %-- > t.test(x, y) Welch Two Sample t-test data: x and y t = -1.6329, df = 14.165, p-value = 0.1245 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -47.242900 6.376233 sample estimates: mean of x mean of y 5.50000 25.93333 Of course, if we remove the outlier, 200, the result conforms with our intuition. > t.test(1:10,y=c(7:20)) Welch Two Sample t-test data: 1:10 and c(7:20) t = -5.4349, df = 21.982, p-value = 1.855e-05 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -11.052802 -4.947198 sample estimates: mean of x mean of y 5.5 13.5 Similarly, if we use a non parametric test -- what we should have done from the very beginning. > wilcox.test(x,y) Wilcoxon rank sum test with continuity correction data: x and y W = 8, p-value = 0.0002229 alternative hypothesis: true mu is not equal to 0 Warning message: Cannot compute exact p-value with ties in: wilcox.test.default(x, y) + The zoo of parametric tests + WARNING Most of these tests are only valid for gaussian variables. Furthermore, if there are several samples, they often ask them to be independant and have the same variance. + Student's T test Here, we want to find the mean of a random variable or, more precisely, to compare it with a predefined value. ?t.test Assumptions: The variable is gaussian (if it is not, you can use Wilcoxon's U test). We already gave an example above. Here is the theory. The null hypothesis is H0: "the mean is m", the alternative hypothesis is H1: "the mean is not m". We compute sample mean - m T = --------------------------------------- sample standard deviation / sqrt(n) (beware: there are two formulas for the standard deviation: the population standard deviation, with "n" in the denominator, and the sample standard deviation, with "n-1" -- R only uses the latter) (here, n is the sample size) and we reject H0 if abs( T ) > t_{n-1} ^{-1} ( 1 - alpha/2 ) where T_{n-1} is the student T distribution with n-1 degrees of freedom. (With gaussian independant identically distributed random variables, T indeed follows this distribution -- you can actually define the Student T distribution that way.) Here are a few Student T probability distribution functions. The larger n, the closer it is from a gaussian distribution. %G curve(dnorm(x), from=-5, to=5, add=F, col="orange", lwd=3, lty=2) curve(dt(x,100), from=-5, to=5, add=T, col=par('fg')) curve(dt(x,5), from=-5, to=5, add=T, col="red") curve(dt(x,2), from=-5, to=5, add=T, col="green") curve(dt(x,1), from=-5, to=5, add=T, col="blue") legend(par('usr')[2], par('usr')[4], xjust=1, c('gaussian', 'df=100', 'df=5', 'df=2', 'df=1'), col=c('orange', par('fg'), 'red', 'green', 'blue'), lwd=c(3,1,1,1,1), lty=c(2,1,1,1,1)) title(main="Student's T probability distribution function") %-- Let us now see how to use this, by hand. We start with a sample from a gaussian population, whose mean we want to estimate. > x <- rnorm(200) > m <- mean(x) > m [1] 0.05875323 We now try to find an interval, centered on this sample mean, in which we ahve a 95% probability of finding the actual (population) mean (it is zero). x <- rnorm(100) n <- length(x) m <- mean(x) m alpha <- .05 m + sd(x)/sqrt(n)*qt(alpha/2, df=n-1, lower.tail=T) m + sd(x)/sqrt(n)*qt(alpha/2, df=n-1, lower.tail=F) We get [m - 0.19, m + 0.19]. The "t.test" function provides a similar result. > t.test(x, mu=0, conf.level=0.95) One Sample t-test data: x t = 0.214, df = 99, p-value = 0.831 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -0.1987368 0.2467923 sample estimates: mean of x 0.02402775 You can also experimentally check this with, with a simulation. > m = c() > for (i in 1:10000) { + m <- append(m, mean(rnorm(100)) ) + } > m <- sort(m) > m[250] [1] -0.1982188 > m[9750] [1] 0.1999646 The interval is [-0.20, +0.20]: we get approximately the same result as the theoretical computations (this will always be the case with Monte-Carlo methods: the approximation is coarse, the convergence is slow, we need a lot of iterations to get a reliable result). We can also plot the result of those simulations. We see that the actual mean is sometimes outside the confidence interval. %G N <- 50 n <- 5 v <- matrix(c(0,0),nrow=2) for (i in 1:N) { x <- rnorm(n) v <- cbind(v, t.test(x)$conf.int) } v <- v[,2:(N+1)] plot(apply(v,2,mean), ylim=c(min(v),max(v)), ylab='Confidence interval', xlab='') abline(0,0) c <- apply(v,2,min)>0 | apply(v,2,max)<0 segments(1:N,v[1,],1:N,v[2,], col=c(par('fg'),'red')[c+1], lwd=3) title(main="The population mean need not be in the confidence interval") %-- + Student's T test: robustness Let us perform a simulation to see what happens with non normal data -- a situation in which one should be using the Wilcoxon U test. Let us first check with a distribution less dispersed that the gaussian one: the uniform distribution. > N <- 1000 > n <- 3 > v <- vector() > for (i in 1:N) { x <- runif(n, min=-1, max=1) v <- append(v, t.test(x)$p.value) } > sum(v>.05)/N [1] 0.932 Now with gaussian data. > N <- 1000 > n <- 3 > v <- vector() > for (i in 1:N) { x <- rnorm(n, sd=1/sqrt(3)) v <- append(v, t.test(x)$p.value) } > sum(v>.05)/N [1] 0.952 Thus, the p-value is wrong when the variable is no longer gaussian: when we think we are rejecting the null hypothesis with a 5% risk of type I error, this risk is actually (for a uniform distribution, which is very far from pathological) 7%. Let us seewhat happens to the confidence interval: we should be in it in 95% of the cases. For a uniform distribution: > N <- 1000 > n <- 3 > v <- vector() > for (i in 1:N) { x <- runif(n, min=-1, max=1) r <- t.test(x)$conf.int v <- append(v, r[1]<0 & r[2]>0) } > sum(v)/N [1] 0.919 For a normal distribution: > N <- 1000 > n <- 100 > v <- vector() > for (i in 1:N) { x <- rnorm(n, sd=1/sqrt(3)) r <- t.test(x)$conf.int v <- append(v, r[1]<0 & r[2]>0) } > sum(v)/N [1] 0.947 For the uniform distribution, we are in the confidence interval in 92% of the cases instead of the expected 95%. Actually, when the distribution is not gaussian, the results can be "wrong" in two ways: either the distribution is less dispersed than the gaussian, and the p-value is higher than stated by the test and the confidence interval provided is too small; or the distribution is more dispersed than the gaussian, the p-value will always be high, the confidence interval will be huge, and the test will no longer test anything (it loses its power, it will always state: there is nothing noticeable -- you can say it is blinded by the outliers). TODO: I will say this later... However, if the sample is much larger, the error becomes negligible. > N <- 1000 > n <- 100 > v <- vector() > for (i in 1:N) { x <- runif(n, min=-1, max=1) v <- append(v, t.test(x)$p.value) } > sum(v>.05)/N [1] 0.947 > N <- 1000 > n <- 100 > v <- vector() > for (i in 1:N) { x <- runif(n, min=-1, max=1) r <- t.test(x)$conf.int v <- append(v, r[1]<0 & r[2]>0) } > sum(v)/N [1] 0.945 Let us now check the power of the test. %G N <- 1000 n <- 10 delta <- .8 v <- vector() w <- vector() for (i in 1:N) { x <- delta+runif(n, min=-1, max=1) v <- append(v, t.test(x)$p.value) w <- append(w, wilcox.test(x)$p.value) } plot(sort(v), type='l', lwd=3, lty=2, ylab="p-valeur") lines(sort(w), col='red') legend(par('usr')[1], par('usr')[4], xjust=0, c('Student', 'Wilcoxon'), lwd=c(2,1), lty=c(2,1), col=c(par("fg"), 'red')) %-- %G N <- 1000 n <- 100 delta <- .1 v <- vector() w <- vector() for (i in 1:N) { x <- delta+runif(n, min=-1, max=1) v <- append(v, t.test(x)$p.value) w <- append(w, wilcox.test(x)$p.value) } plot(sort(v), type='l', lwd=3, lty=2) lines(sort(w), col='red') legend(par('usr')[1], par('usr')[4], xjust=0, c('Student', 'Wilcoxon'), lwd=c(2,1), lty=c(2,1), col=c(par("fg"), 'red')) %-- %G N <- 1000 n <- 100 delta <- .8 v <- vector() w <- vector() for (i in 1:N) { x <- delta+runif(n, min=-1, max=1) v <- append(v, t.test(x)$p.value) w <- append(w, wilcox.test(x)$p.value) } plot(sort(v), type='l', lwd=3, lty=2) lines(sort(w), col='red') legend(par('usr')[1], par('usr')[4], xjust=0, c('Student', 'Wilcoxon'), lwd=c(2,1), lty=c(2,1), col=c(par("fg"), 'red')) %-- From this, Student's T test seems robust: if the data are less dispersed than with a gaussian distribution, the p-value and the confidence interval are underestimated, but not too much. This effect disappeard if the sample is large enough. But wait! Let us now see waht happens with a distribution more dispersed than the gaussian -- e.g., the Cauchy distribution. > N <- 1000 > n <- 3 > v <- vector() > for (i in 1:N) { + x <- rcauchy(n) + v <- append(v, t.test(x)$p.value) + } > sum(v>.05)/N [1] 0.974 > N <- 1000 > n <- 3 > v <- vector() > for (i in 1:N) { + x <- rcauchy(n) + r <- t.test(x)$conf.int + v <- append(v, r[1]<0 & r[2]>0) + } > sum(v)/N [1] 0.988 The given confidence interval is too large: we are not in it in 95% of the cases, but much more. Things do not improve with an even larger sample. > N <- 1000 > n <- 100 > v <- vector() > for (i in 1:N) { + x <- rcauchy(n) + v <- append(v, t.test(x)$p.value) + } > sum(v>.05)/N [1] 0.982 > N <- 1000 > n <- 100 > v <- vector() > for (i in 1:N) { + x <- rcauchy(n) + r <- t.test(x)$conf.int + v <- append(v, r[1]<0 & r[2]>0) + } > sum(v)/N [1] 0.986 Let us now check the test power: this is disastrous... %G N <- 1000 n <- 10 delta <- 1 v <- vector() w <- vector() for (i in 1:N) { x <- delta+rcauchy(n) v <- append(v, t.test(x)$p.value) w <- append(w, wilcox.test(x)$p.value) } plot(sort(v), type='l', lwd=3, lty=2) lines(sort(w), col='red') %-- %G N <- 1000 n <- 100 delta <- 1 v <- vector() w <- vector() for (i in 1:N) { x <- delta+rcauchy(n) v <- append(v, t.test(x)$p.value) w <- append(w, wilcox.test(x)$p.value) } plot(sort(v), type='l', lwd=3, lty=2) lines(sort(w), col='red') %-- From this, we conclude that if the data are more dispersed than with a gaussian distribution, the power of the test decreases and nothing remains. Increasing th sample size does not improve things. In these situations, one should use non-parametric tests (or a parametric test adapted to the distribution at hand, or transform the data so that it looks more normal -- be beware, using the same data to choose the distribution or the trans formation and to perform the test will yield biased p-values). > N <- 1000 > n <- 100 > v <- vector() > w <- vector() > for (i in 1:N) { + x <- rcauchy(n) + v <- append(v, t.test(x)$p.value) + w <- append(w, wilcox.test(x)$p.value) + } > sum(v>.05)/N [1] 0.976 > sum(w>.05)/N [1] 0.948 > N <- 1000 > n <- 100 > v <- vector() > w <- vector() > for (i in 1:N) { + x <- 1+rcauchy(n) + v <- append(v, t.test(x)$p.value) + w <- append(w, wilcox.test(x)$p.value) + } > sum(v<.05)/N [1] 0.22 > sum(w<.05)/N [1] 0.992 + Z test It is similar to the Student T test, but this time, we know the exact value of the variance: in this cas, the distribution of the test statistic is nolonger a Student T distribution but a gaussian distribution (often denoted Z). In the real world, when we do not know the mean, we do not know the variance either -- thus, this test has little practical utility. For large samples, Student's T distribution is well approximated by a gaussian distribution, so we can perform a Z test instead of a T test (that would be relevant if we we computing everything ourselves, but as the computer is there...) + Student T test: comparing the mean of two samples. We now consider two samples and we would like to know if they come from the same population. More simply, we would like to know if they have the same mean. Here, we assume that the data are normal (do a qqplot or a Shapiro test), have the same variance (use an F test). If those conditions are not satisfied, you can use a Wilcoxon test. We also assume that the variables are independant (this is not always the case: for instance, you might want to compare the length of the left and right humerus on a bunch of human skeletons; to get rid of the dependance problem, you can consider the length difference for each individual and compare it with zero. For two samples of size n (it also works for samples of different sizes, but the formula is more complicated), one can show that the statistic t = difference between the means / sqrt( (sum of the variances)/n ) follows a Student distribution with 2n-n degrees of freedom. We shall reject the null hypothesis "the means are equal" if abs(t) > abs( t(alpha, 2n-2) ) We could perform the test by hand, as above, but there is already a function to do so. ?t.test > x <- rnorm(100) > y <- rnorm(100) > t.test(x,y) Welch Two Sample t-test data: x and y t = -1.3393, df = 197.725, p-value = 0.182 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.46324980 0.08851724 sample estimates: mean of x mean of y -0.03608115 0.15128514 > t.test(x, y, alternative="greater") Welch Two Sample t-test data: x and y t = -1.3393, df = 197.725, p-value = 0.909 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: -0.4185611 Inf sample estimates: mean of x mean of y -0.03608115 0.15128514 Here is the example from the manual: the study of the efficiency of a soporific drug. > ?sleep > data(sleep) > plot(extra ~ group, data = sleep) > t.test(extra ~ group, data = sleep) Welch Two Sample t-test data: extra by group t = -1.8608, df = 17.776, p-value = 0.0794 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.3654832 0.2054832 sample estimates: mean in group 1 mean in group 2 0.75 2.33 Here, as the p-value is greater than 5%, we would conclude that the drug's effects are not noticeable in this sample. However, if you work for the drug's manufacturer, you would prefer a different conclusion. You can get it if you change the alternative hypothsis, that becomes H1: "the drug increases the time of sleep". (If you are honest, you choose the null and alternative hypotheses before performing the experiment, and this choice must be backed by prior data or knowledge: remarking that in this very experiment the sample mean is greater is not sufficient.) > t.test(extra ~ group, data = sleep, alternative="less") Welch Two Sample t-test data: extra by group t = -1.8608, df = 17.776, p-value = 0.0397 alternative hypothesis: true difference in means is less than 0 95 percent confidence interval: -Inf -0.1066185 sample estimates: mean in group 1 mean in group 2 0.75 2.33 TODO: There should be a section "How to lie with statistics" 1. Reusing the same data several times to increase the statistical significance of an effect. (examples: choose H1 after looking at the data, use the data to choose the transformation to apply, etc.) 2. Using a statistical test whose asumptions are not satisfied (it can either yield more significant results or less significant results: e.g., if the test assumes that ******************** (deleted?) 3. Perform the same experiment several times and only publish the results that go in the direction you want. (Adverts by tobacco companies claiming that passive smoking is harmless are done that way.) 4. Misleading plots Adding a lot of unnecessary details 3D effects - The most striking is the 3-dimensional piechart, with the part of interest at the forefront, so that it be deformed and larger) - You can also plot a quantity as a sphere (ar any 3-dimensional object) of different scales: for a quantity twice as large, you can multiply all the dimensions by 2, thereby getting a volume eight times as large. If you have more than two samples, you can perform an ANalysis Of VAriance (Anova). If you have two non-gaussain samples, you can perform a Wilcoxon U test. If you have more that two non-gaussian samples, you can turn to non-parametric analysis of variance with the Kruskal--Wallis test. + Robustness of Student's 2-sample T test To compare means with a Student T test, we assume that: the samples are gaussian, independant and have the same variance. Let us check what happens with gaussian non-equivariant samples. TODO: check that there is no confusion between the Student T test and Welch's test (var.equal=T for the first, var.equal=F (the default) for the second). %G N <- 1000 n <- 10 v <- 100 a <- NULL b <- NULL for (i in 1:N) { x <- rnorm(n) y <- rnorm(n, 0, v) a <- append(a, t.test(x,y)$p.value) b <- append(b, t.test(x/var(x), y/var(y))$p.value) } plot(sort(a), type='l', col="green") points(sort(b), type="l", col="red") abline(0,1/N) %-- And for the power: %G N <- 1000 n <- 10 v <- 100 a <- NULL b <- NULL c <- NULL d <- NULL for (i in 1:N) { x <- rnorm(n) y <- rnorm(n, 100, v) a <- append(a, t.test(x,y)$p.value) b <- append(b, t.test(x/var(x), y/var(y))$p.value) c <- append(c, t.test(x, y/10000)$p.value) d <- append(d, wilcox.test(x, y)$p.value) } plot(sort(a), type='l', col="green") points(sort(b), type="l", col="red") points(sort(c), type="l", col="blue") points(sort(d), type="l", col="orange") abline(0,1/N) legend(par('usr')[1], par('usr')[4], c('Student Test', 'Renormalized Student Test', 'Student Test renormalized with the sample variances', "(Non-Parametric) Wilcoxson's U Test"), col=c('green', 'blue', 'red', 'orange'), lwd=1,lty=1) title(main="Student's T test on non-equivariant samples") %-- With non-equivariant samples, the p-value of the Student test remains correct, but the power dramatically decreases. If the data look gaussian but have different variances, you had better normalize them andto perform a Student T test than perform a non-parametric test. TODO: What about the Welch test??? + Several means of comparing means There are several ways of comparing two means. %G 600 300 data(sleep) boxplot(extra ~ group, data=sleep, horizontal=T, xlab='extra', ylab='group') %-- With a Student T test, if the data are gaussian (or a Welch test, if they are gaussian but have different variances). > t.test(extra ~ group, data=sleep) Welch Two Sample t-test data: extra by group t = -1.8608, df = 17.776, p-value = 0.0794 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.3654832 0.2054832 sample estimates: mean in group 1 mean in group 2 0.75 2.33 > t.test(extra ~ group, data=sleep, var.equal=T) Two Sample t-test data: extra by group t = -1.8608, df = 18, p-value = 0.07919 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.3638740 0.2038740 sample estimates: mean in group 1 mean in group 2 0.75 2.33 With a Wilcoxon test, if we do not know that the data are gaussian, or if we suspect they are not (from a Shapiro--Wilk test, for instance). > wilcox.test(extra ~ group, data=sleep) Wilcoxon rank sum test with continuity correction data: extra by group W = 25.5, p-value = 0.06933 alternative hypothesis: true mu is not equal to 0 Warning message: Cannot compute exact p-value with ties in: wilcox.test.default(x = c(0.7, -1.6, With an analysis of variance (we will present this later -- with only two samples, it yields exactly the same result as the Student T test). > anova(lm(extra ~ group, data=sleep)) Analysis of Variance Table Response: extra Df Sum Sq Mean Sq F value Pr(>F) group 1 12.482 12.482 3.4626 0.07919 . Residuals 18 64.886 3.605 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 With a non parametric analysis of variance, i.e., a Kruskal--Wallis test. > kruskal.test(extra ~ group, data=sleep) Kruskal-Wallis rank sum test data: extra by group Kruskal-Wallis chi-squared = 3.4378, df = 1, p-value = 0.06372 Here are the results Method p-value Welch test 0.07919 Student T test 0.0794 Wilcoxon 0.06933 Analysis of Variance 0.07919 Kruskal--Wallis Test 0.06372 + The Chi^2 and variance computations We are looking for the variance of a sample (whose mean is unknown). Here is the theory. The null hypothesis is H0: "the (population) variance is v", the alternative hypothesis is H1: "the variance is not v". We compute the statistic Chi2 = (n-1) * (sample variance) / v and je reject the null hypothesis H0 if Chi2 > Chi2_{n-1} ^{-1} ( 1 - alpha/2 ) or Chi2 < Chi2_{n-1} ^{-1} ( alpha/2 ) where Chi2_{n-1} is the Chi2 distribution with n-1 degrees of freedom. Here is the Chi^2 probability distribution function, with various degrees of freedom. %G curve(dchisq(x,2), from=0, to=5, add=F, col="red", ylab="dchisq(x,i)") n <- 10 col <- rainbow(n) for (i in 1:n) { curve(dchisq(x,i), from=0, to=5, add=T, col=col[i]) } legend(par('usr')[2], par('usr')[4], xjust=1, paste('df =',1:n), lwd=1, lty=1, col=col) title(main="Chi^2 Probability Distribution Function") %-- We can get a confidence interval for the standard deviation, by hand, as follows. alpha <- .05 x <- rnorm(200) n <- length(x) v = var(x) sd(x) sqrt( (n-1)*v / qchisq(alpha/2, df=n-1, lower.tail=F) ) sqrt( (n-1)*v / qchisq(alpha/2, df=n-1, lower.tail=T) ) We get [0.91, 1.11]. We can check this with a simulation: v <- c(0) for (i in 1:10000) { v <- append(v, var(rnorm(200)) ) } v <- sort(v) sqrt(v[250]) sqrt(v[9750]) We get [0.90, 1.10]. With smaller samples, the estimation is less reliable: the confidence interval is larger, [0.68, 1.32]. v <- c(0) for (i in 1:10000) { v <- append(v, var(rnorm(20)) ) } v <- sort(v) sqrt(v[250]) sqrt(v[9750]) I have not found an R function that performs those computations (the "var.test" works with two samples): we can write our own. %G chisq.var.test <- function (x, var=1, conf.level=.95, alternative='two.sided') { result <- list() alpha <- 1-conf.level n <- length(x) v <- var(x) result$var <- v result$sd <- sd(x) chi2 <- (n-1)*v/var result$chi2 <- chi2 p <- pchisq(chi2,n-1) if( alternative == 'less' ) { stop("Not implemented yet") } else if (alternative == 'greater') { stop("Not implemented yet") } else if (alternative == 'two.sided') { if(p>.5) p <- 1-p p <- 2*p result$p.value <- p result$conf.int.var <- c( (n-1)*v / qchisq(alpha/2, df=n-1, lower.tail=F), (n-1)*v / qchisq(alpha/2, df=n-1, lower.tail=T), ) } result$conf.int.sd <- sqrt( result$conf.int.var ) result } x <- rnorm(100) chisq.var.test(x) # We can check tha the results are correct by looking at # the distribution of the p-values: it should be uniform # in [0,1]. v <- NULL for (i in 1:1000) { v <- append(v, chisq.var.test(rnorm(100))$p.value) } plot(sort(v)) %-- We can also compare the results with those of the "var.test" function, that works woth two samples. Either graphically, %G p1 <- NULL p2 <- NULL for (i in 1:100) { x <- rnorm(10) p1 <- append(p1, chisq.var.test(x)$p.value) p2 <- append(p2, var.test(x, rnorm(10000))$p.value) } plot( p1 ~ p2 ) abline(0,1,col='red') %-- or with a computation (we shall see later what it means: this is a test on a regression, that tells us that p1=p2 with a p-value equal to 0.325). > summary(lm(p1-p2~0+p2)) Call: lm(formula = p1 - p2 ~ 0 + p2) Residuals: Min 1Q Median 3Q Max -0.043113 -0.007930 0.001312 0.009386 0.048491 Coefficients: Estimate Std. Error t value Pr(>|t|) p2 -0.002609 0.002638 -0.989 0.325 Residual standard error: 0.01552 on 99 degrees of freedom Multiple R-Squared: 0.009787, Adjusted R-squared: -0.0002151 F-statistic: 0.9785 on 1 and 99 DF, p-value: 0.325 + Fisher distribution (F test) and comparison of the variance of two samples Here, we want to know if two samples come from populations from the same variance (we are not interested in the mean). We proceed as for the comparison of means, but instead of considering the difference of means, we consider the quotient of variances. One will use such a test before a Student T test (to compare the mean in two samples), to check that the equivariance assumption is valid. Here is the example with which we had illustrated the Student T test: indeed, the variances do not seem too different. ?var.test > var.test( sleep[,1] [sleep[,2]==1], sleep[,1] [sleep[,2]==2] ) F test to compare two variances data: sleep[, 1][sleep[, 2] == 1] and sleep[, 1][sleep[, 2] == 2] F = 0.7983, num df = 9, denom df = 9, p-value = 0.7427 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.198297 3.214123 sample estimates: ratio of variances 0.7983426 Here is the theory behind this test. The null hypothesis is H0: "the two populations have the same variance", the alternative hypothesis is H1: "the two populations do not have the same variance". We compute the statistic variance of the first sample F = ------------------------------- variance of the second sample (here, I assume that the two samples have the same size, otherwise the formula becomes more complicated) and we reject H0 if F < F _{n1-1, n2-2} ^{-1} ( alpha/2 ) or F > F _{n1-1, n2-2} ^{-1} ( 1 - alpha/2 ) where F is Fisher's distribution. Practically, we compute the quotient of variances with the largest variance in the numerator and we reject the null hypothesis that the variances are equal if F > F(alpha/2, n1-1, n2-1) where n1 and n2 are the sample sizes. Here is an example, where the computations were performed by hand. > x <- rnorm(100, 0, 1) > y <- rnorm(100, 0, 2) > f <- var(y)/var(x) > f [1] 5.232247 > qf(alpha/2, 99, 99) [1] 0.6728417 > f > qf(alpha/2, 99, 99) [1] TRUE If we have more than two samples, we can use Bartlett's test. Of we have two non-gaussian samples, we can use Ansari's or Mood's non parametric test. Of there are more that two non-gaussian samples, we can use Fligner's test. ?bartlett.test ?ansari.test ?mood.test ?fligner.test * The Zoo of Statistical Tests: discrete variables and the Chi^2 test + Binomial test In a sample of 100 butterflies, we found 45 males and 55 females. Can we conclude that there are, in general, more males than females? The number of female butterflies in a samples if 100 animals follows a binimial distribution B(100,p) and we want to test the null hypothesis H0: "p=0.5" against the alternative hypothesis H1: "p different from 0.5". > binom.test(55, 100, .5) Exact binomial test data: 55 and 100 number of successes = 55, number of trials = 100, p-value = 0.3682 alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.4472802 0.6496798 sample estimates: probability of success 0.55 In this example, the difference is not statistically significant. + Mock binomial test (not important) If we were doing the computations by hand, we would not use the binomial test, but an approximation, with the "prop.test". But as the computer performscarries out the computations for us, we need not use it. %G p <- .3 col.values <- c(par('fg'),'red', 'blue', 'green', 'orange') n.values <- c(5,10,20,50,100) plot(0, type='n', xlim=c(0,1), ylim=c(0,1), xlab='exact', ylab='approximate') for (i in 1:length(n.values)) { n <- n.values[i] x <- NULL y <- NULL for (a in 0:n) { x <- append(x, binom.test(a,n,p)$p.value) y <- append(y, prop.test(a,n,p)$p.value) } o <- order(x) lines(x[o],y[o], col=col.values[i]) } legend(par('usr')[1],par('usr')[4], as.character(n.values), col=col.values, lwd=1,lty=1) title(main="Comparing the binomial test and its approximation") %-- We can also compare the distribution of p-values of these two tests. %G p <- .3 n <- 5 N <- 1000 e <- rbinom(N, n, p) x <- y <- NULL for (a in e) { x <- append(x, binom.test(a,n,p)$p.value) y <- append(y, prop.test(a,n,p)$p.value) } x <- sort(x) y <- sort(y) plot(x, type='l', lwd=3, ylab='p-value') lines(y, col='red') legend(par('usr')[2], par('usr')[3], xjust=1, yjust=0, c('exact', 'approximation'), lwd=c(3,1), lty=1, col=c(par("fg"),'red')) title(main="Binomial test (H0 true)") %-- %G p1 <- .3 p2 <- .5 n <- 5 N <- 1000 e <- rbinom(N, n, p1) x <- y <- NULL for (a in e) { x <- append(x, binom.test(a,n,p2)$p.value) y <- append(y, prop.test(a,n,p2)$p.value) } x <- sort(x) y <- sort(y) plot(x, type='l', lwd=3, ylab='p-value') lines(y, col='red') legend(par('usr')[2], par('usr')[3], xjust=1, yjust=0, c('exact', 'approximation'), lwd=c(3,1), lty=1, col=c(par("fg"),'red')) title(main="Binomial test (H0 false)") %-- We remark that if H0 is true, the p-value is over-estimated (the test is too conservative, it is less powerful, it does not see anything significant while there is), but the wronger H0, the better the approximation. + Another binomial test TODO: alternative to the binomial test: glm(y~x, family=binomial) # Logistic regression (well, for p=0.5) Question: and for p != 0.5? You can do the same for multilogistic regression + Chi^2 test (very important) The binomial test is fine, but it does not generalize: it allows you to study a binary variable, nothing more. But sometimes we need a comparable test to study qualitative variables with more than 2 values or to study several qualitative variables. There is no such "exact multinomial test" (you can devise one, but you would have to implement it...): instead, one uses the approximate Chi2 test. The Chi2 test is a non parametric , non-rigorous (it is an approximation) test to compare distributions of qualitative variables. In spite of that, it is the most important discrete test. One can show that if (X1, X2, ..., Xr) is a multinomial random variable, then ( X_1 - n p_1 )^2 ( X_r - n p_r )^2 Chi^2 = ------------------- + ... + -------------------- n p_1 n p_r asymptotically follows a Chi^2 distribution with r-1 degrees of freedom. This tis is just an asymptotic result, that is sufficiently true if n >= 100 (the sample is large enough) n p_i >= 10 (the theoretical frequencies (counts) are not too small) In particular, we get another mock binomial test. %G p <- .3 col.values <- c(par('fg'),'red', 'blue', 'green', 'orange') n.values <- c(5,10,20,50,100) plot(0, type='n', xlim=c(0,1), ylim=c(0,1), xlab='exact', ylab='approximate') for (i in 1:length(n.values)) { n <- n.values[i] x <- NULL y <- NULL z <- NULL for (a in 0:n) { x <- append(x, binom.test(a,n,p)$p.value) y <- append(y, chisq.test(c(a,n-a),p=c(p,1-p))$p.value) z <- append(z, prop.test(a,n,p)$p.value) } o <- order(x) lines(x[o],y[o], col=col.values[i]) lines(x[o],z[o], col=col.values[i], lty=3) } legend(par('usr')[1],par('usr')[4], as.character(c(n.values, "prop.test", "chisq.test")), col=c(col.values, par('fg'), par('fg')), lwd=1, lty=c(rep(1,length(n.values)), 1,3) ) title(main="The binomial test and its approximations") %-- Or a mock multinomial test: let us check, with a simulation, that the p-values are close. %G # A Monte Carlo multinomial test multinom.test <- function (x, p, N=1000) { n <- sum(x) m <- length(x) chi2 <- sum( (x-n*p)^2/(n*p) ) v <- NULL for (i in 1:N) { x <- table(factor(sample(1:m, n, replace=T, prob=p), levels=1:m)) v <- append(v, sum( (x-n*p)^2/(n*p) )) } sum(v>=chi2)/N } multinom.test( c(25,40,25,25), p=c(.25,.25,.25,.25) ) # 0.13 chisq.test( c(25,40,25,25), p=c(.25,.25,.25,.25) ) # 0.12 N <- 100 m <- 4 n <- 10 p <- c(.25,.25,.1,.4) x <- NULL y <- NULL for (i in 1:N) { a <- table( factor(sample(1:m, n, replace=T, prob=p), levels=1:m) ) x <- append(x, multinom.test(a,p)) y <- append(y, chisq.test(a,p=p)$p.value) } plot(y~x) abline(0,1,col='red') title("Monte Carlo Multinomial Test and Chi^2 Test") %-- Here is the distribution of the p-values of a Chi^2 test. %G # We sample 10 subjects in a 4-class population. # We repeat the experiment 100 times. N <- 1000 m <- 4 n <- 10 p <- c(.24,.26,.1,.4) p.valeur.chi2 <- rep(NA,N) for (i in 1:N) { echantillon <- table(factor(sample(1:m, replace=T, prob=p), levels=1:m)) p.valeur.chi2[i] <- chisq.test(echantillon,p=p)$p.value } plot( sort(p.valeur.chi2), type='l', lwd=3 ) abline(0, 1/N, lty=3, col='red', lwd=3) title(main="p-values in a Chi^2 test") %-- + Independance Chi^2 Let us consider the following situation: we measure two qualitative variables each with two values on a sample. In the whole population, the four classes occur with proportions 10%, 20%, 60%, 10%. A B total C 10 20 30 D 60 10 70 total 70 30 100 We can get a sample as follows. TODO: this code is ugly... %G foo <- function (N) { population1 <- c(rep('A',10), rep('B',20), rep('A',60), rep('B',10)) population1 <- factor(population1, levels=c('A','B')) population2 <- c(rep('C',10), rep('C',20), rep('D',60), rep('D',10)) population2 <- factor(population2, levels=c('C','D')) o <- sample(1:100, N, replace=T) table( population2[o], population1[o] ) } a <- foo(1000) op <- par(mfcol=c(1,2)) plot( a, shade=T ) plot( t(a), shade=T ) par(op) %-- We would like to know wether these variables are independant or not. To do so, we take a sample (as always, we do not know the population direclty, we only know it through samples -- so we are not supposed to know the proportions I mentionned above); we compute the marginal proportions (i.e., the "total" row and column); the product of this row and column, so as to get the proportions one would observe if the variables were independant; we can then compare those proportions with the observed ones. > n <- 100 > a <- foo(n) > a/n A B C 0.09 0.15 D 0.65 0.11 > a1 <- apply(a/n,2,sum) # The "total" row > a1 A B 0.74 0.26 > a2 <- apply(a/n,1,sum) # The "total" column > a2 C D 0.24 0.76 > b <- a1 %*% t(a2) > b C D [1,] 0.1776 0.5624 [2,] 0.0624 0.1976 > chisq.test(as.vector(a),p=as.vector(b)) Chi-squared test for given probabilities data: as.vector(a) X-squared = 591.7683, df = 3, p-value = < 2.2e-16 TODO I think the syntax is different chisq.test(rbind(as.vector(a), as.vector(b))) Is the result the same? TODO Remark One can also use the Chi^2 for a homogeneity test, i.e., to check if two samples come from the same population. It IS an independance Chi^2 (independance between the variable and the sample number): the syntax is the same. chisq.test(rbind(as.vector(a), as.vector(b))) + Fisher test: independance of two qualitative variables Here, we want to check if two variables, given by a contingency table, are independant. It sounds like the Chi^2 test, but this time, it is an exact test, not an approximation. Let us take the example we had examined above with the Chi^2 test. > a A B C 9 15 D 65 11 > fisher.test(a) Fisher's Exact Test for Count Data data: a p-value = 1.178e-05 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.03127954 0.32486526 sample estimates: odds ratio 0.1048422 With a smaller sample, it is less clear. > fisher.test( foo(10) ) Fisher's Exact Test for Count Data data: foo(10) p-value = 1 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.003660373 39.142615141 sample estimates: odds ratio 0.3779644 %G n1 <- 10 n2 <- 100 N <- 1000 x1 <- rep(NA,N) x2 <- rep(NA,N) for (i in 1:N) { x1[i] <- fisher.test(foo(n1))$p.value x2[i] <- fisher.test(foo(n2))$p.value } plot( sort(x1), type='l', lwd=3, ylab='p-valeur') lines( sort(x2), col='blue', lwd=3 ) abline(0,1/N,col='red',lwd=3,lty=3) abline(h=c(0,.05),lty=3) abline(v=c(0,N*.05),lty=3) title(main="p-value of a Fisher test, H0 false") legend(par('usr')[1],par('usr')[4], c("n=10", "n=100"), col=c(par('fg'), 'blue'), lwd=3, lty=1) %-- %G foo <- function (N) { population1 <- c(rep('A',2), rep('B',8), rep('A',18), rep('B',72)) population1 <- factor(population1, levels=c('A','B')) population2 <- c(rep('C',2), rep('C',8), rep('D',18), rep('D',72)) population2 <- factor(population2, levels=c('C','D')) o <- sample(1:100, N, replace=T) table( population2[o], population1[o] ) } n1 <- 10 n2 <- 100 N <- 1000 x1 <- rep(NA,N) x2 <- rep(NA,N) for (i in 1:N) { x1[i] <- fisher.test(foo(n1))$p.value x2[i] <- fisher.test(foo(n2))$p.value } plot( sort(x1), type='l', lwd=3, ylab='p-valeur', ylim=c(0,1)) lines( sort(x2), col='blue', lwd=3 ) abline(0,1/N,col='red',lwd=3,lty=3) abline(h=c(0,.05),lty=3) abline(v=c(0,N*.05),lty=3) title(main="p-valueof a Fisher test, H0 true") legend(par('usr')[2], .2, xjust=1, yjust=0, c("n=10", "n=100"), col=c(par('fg'), 'blue'), lwd=3, lty=1) %-- * The Zoo of Statistical Tests: non-parametric tests + Sign test It is a nom-parametric test on the median of a random variable -- with no assumption on it. The idea is simple: we count the number of values that are above the proposed median -- we know that of ot is the actual median, this number follows a binomial distribution, because each value has exactly one chance out of two to be above it. I have not found a function to perform this test, so I wrote my own. sign.test <- function (x, mu=0) { # does not handle NA n <- length(x) y <- sum(x res Empirical Value Theoretical Value n [1,] 0.06 0.05 96 [2,] 0.04 0.05 138 [3,] 0.05 0.05 6 [4,] 0.06 0.05 135 [5,] 0.03 0.05 138 [6,] 0.04 0.05 83 [7,] 0.03 0.05 150 [8,] 0.05 0.05 91 [9,] 0.05 0.05 144 [10,] 0.04 0.05 177 + Wilcoxon's U test: comparing two "means" TODO: read and correct (if needed) this part. It is a non-parametric test: we do not know (or assume) anything on the distribution of the variables, in particular, we think it is not gaussian (to check ot, look at a quantile-quantile plot or perform a Shapiro--Wilk test) -- otherwise, we would use Student's T test, that is more powerful. However, there IS an assumption: the variable is symetric (if it is not, consider the sign test). For this reason, we can speak of mean or of median -- for the whole distribution it is the same (but for a sample, that may be asymetric, it is different). TODO: the following descrition is not that of the Wilcoxon test I know, that assumes the variable is symetric, that considers the variables Xij=(Xi+Xj)/2 and counts the number of those variables that are above the proposed mean. The recipe is as follows. Take two samples, concatenate them, sort them. Then, look of the two samples are "well-shuffled" or if the elements from one sample are rather at the begining while those of the others are rather at the end. The null hypothesis is H0: "P(X1_i > X2_i) = 0.5". We first sort each sample (separately) and compute U1 = number of pairs (i,j) such that X1_i>X2_j + (1/2) * number of pairs (i,j) so that X1_i=X2_j U2 = number of pairs (i,j) so that X1_j>X2_i + (1/2) * number of pairs (i,j) so that X1_i=X2_j U = min(U1,U2) Here is another method of computing this: Contatenate the two samples, rank them R1 = sum of the ranks on the first sample R2 = sum of the ranks on the second sample U2 = n1*n2 + n1(n1+1)/2 - R1 U1 = n1*n2 + n2(n2+1)/2 - R2 U = min(U1, U2) Here is example from the man page (here, we imagine that prior data suggests that x>y, so we choose an asymetric alternative hypothesis). help.search("wilcoxon") ?wilcox.test > x <- c(1.83, 0.50, 1.62, 2.48, 1.68, 1.88, 1.55, 3.06, 1.30) > y <- c(0.878, 0.647, 0.598, 2.05, 1.06, 1.29, 1.06, 3.14, 1.29) > wilcox.test(x, y, paired = TRUE, alternative = "greater", conf.level=.95, conf.int=T) Wilcoxon signed rank test data: x and y V = 40, p-value = 0.01953 alternative hypothesis: true mu is greater than 0 95 percent confidence interval: 0.175 Inf sample estimates: (pseudo)median 0.46 Let us check on an example that the test remains valid for non-gaussian distributions. N <- 1000 n <- 4 v <- vector() w <- vector() for (i in 1:N) { x <- runif(n, min=-10, max=-9) + runif(n, min=9, max=10) v <- append(v, wilcox.test(x)$p.value) w <- append(w, t.test(x)$p.value) } sum(v>.05)/N sum(w>.05)/N We get 1 and 0.93: we make fewer mistakes with Wilcoxons's test, but its power is lower, i.e., we miss many opportunities of rejecting the null hypothesis: # Probability of rejecting H0, when H0 is false (power) N <- 1000 n <- 5 v <- vector() w <- vector() for (i in 1:N) { x <- runif(n, min=0, max=1) v <- append(v, wilcox.test(x)$p.value) w <- append(w, t.test(x)$p.value) } sum(v<.05)/N sum(w<.05)/N We get 0 (the power of the test is zero: it never rejects the null hypothesis -- if our sample is very small and we do not know anything about the distribution, we cannot say much) against 0.84. Let us consider a more ungaussian distribution. # Probability of rejecting H0, when H0 is false (power) N <- 1000 n <- 5 v <- vector() w <- vector() for (i in 1:N) { x <- runif(n, min=-10, max=-9) + runif(n, min=9, max=10) v <- append(v, wilcox.test(x)$p.value) w <- append(w, t.test(x)$p.value) } sum(v<.05)/N sum(w<.05)/N We get 0 and 0.05. For n=10, we would get 0.05 in both cases. Let us now check the confidence interval. N <- 1000 n <- 3 v <- vector() w <- vector() for (i in 1:N) { x <- runif(n, min=-10, max=-9) + runif(n, min=9, max=10) r <- wilcox.test(x, conf.int=T)$conf.int v <- append(v, r[1]<0 & r[2]>0) r <- t.test(x)$conf.int w <- append(w, r[1]<0 & r[2]>0) } sum(v)/N sum(w)/N We get 0.75 and 0.93. TODO: I do not understand. As the U test is non-parametric, to should give larger confidence intervals and make fewer mistakes. The confidence interval for the Wilcoxon is three times as small as that of Student's T test. ??? However, these simulations show that Student's T test is robust. TODO: read what I just wrote and check the power The following example tests if the variable is symetric around its mean. > x <- rnorm(100)^2 > x <- x - mean(x) > wilcox.test(x) Wilcoxon signed rank test with continuity correction data: x V = 1723, p-value = 0.005855 alternative hypothesis: true mu is not equal to 0 Idem for the median. > x <- x - median(x) > wilcox.test(x) Wilcoxon signed rank test with continuity correction data: x V = 3360.5, p-value = 0.004092 alternative hypothesis: true mu is not equal to 0 If there are more that two samples, you can use the Kruskal--Wallis test. ?kruskal.test + Kolmogorov-Smirnov Test (comparing two distributions) We want to see if two (quantitative random variables) follow the same distribution. > ks.test( rnorm(100), 1+rnorm(100) ) Two-sample Kolmogorov-Smirnov test data: rnorm(100) and 1 + rnorm(100) D = 0.43, p-value = 1.866e-08 alternative hypothesis: two.sided > ks.test( rnorm(100), rnorm(100) ) Two-sample Kolmogorov-Smirnov test data: rnorm(100) and rnorm(100) D = 0.11, p-value = 0.5806 alternative hypothesis: two.sided > ks.test( rnorm(100), 2*rnorm(100) ) Two-sample Kolmogorov-Smirnov test data: rnorm(100) and 2 * rnorm(100) D = 0.19, p-value = 0.0541 alternative hypothesis: two.sided TODO: state the idea (If my memory is good, we consider the sample cumulative distribution function of both variables and compute the area between them). + Shapiro--Wilk test This test check if a random variable is gaussian. That might seem to be a special case of the Kolmogorov--Smirnov test, but actually we compare a random variable with the family of gaussian distributions, without specifying the mean and variance, while the K-S test want a completely specified distribution. > shapiro.test(rnorm(10))$p.value [1] 0.09510165 > shapiro.test(rnorm(100))$p.value [1] 0.8575329 > shapiro.test(rnorm(1000))$p.value [1] 0.1853919 > shapiro.test(runif(10))$p.value [1] 0.5911485 > shapiro.test(runif(100))$p.value [1] 0.0002096377 > shapiro.test(runif(1000))$p.value [1] 2.385633e-17 It is a good idea to look at the quantile-quantile plot to see what happens, because the data might be non-gaussian in a benign way (either, as here, because the data are less dispersed that gaussian data, either because the deviation from a gaussian is statistically significant but practically negligible -- quite common if the sample is very large). + Other non-parametric tests There is a wealth of other tests, we shall not detail them and merely refer to the manual. ?kruskal.test ?ansari.test ?mood.test ?fligner.test library(help=ctest) help.search('test') * Estimators TODO: State the structure of this part -- and reorder it... 1. Generalities about estimators 2. Least Squares estimators. Example: NLS. 3. Maximum Likelihood Estimators (also: REML, Penalized Likelihood) 4. GMM Estimators Quite often, the data we are studying (the "statistical series") is not the whole population but just a sample of it -- e.g., when you study birds on an island, you will not measure all of them, you will simply catch a few dozen specimens and work on them. You can easily compute the statistical parameters of this sample (mean, standard deviation, etc.), but they are only approximations of those parameters for the whole population: how can we measure the precision of those approximations? + Examples Here are a few concrete examples of this situation. An industrialist must choose a variety of maize. It will be used as (part of) an animal food and we want the variety that contains the most proteins: we want to know the average protein content of each variety -- not that of the sample at hand, but that of the variety as a whole. We might find that the sample of a variety has a larger protein content that the sample of another: but were to samples sufficiently large, was the difference sifficiently significant to extrapolate the results to the whole population? An industrialist must choose a variaty of maize. It will be sold for human consumption and, to ease its conditioning and packaging, we want all the ears to have the same size; i.e., we want the standard deviation to be as small as possible. If we find that the standard deviation of the sample of a variety is smaller than that of another variety, can we extrapolate the results to the varieties as a whole? Were the samples sufficiently large? Was the difference of standard deviations significant? + Modelling a statistical experiment In simple terms: we draw, at random, subjects from a population and we measure, on the resulting sample, the statistical variables of interest (size, protein content, etc.). In algorithmic terms: Repeat a large number of times: Take a sample from the population Estimate, on this sample, the quantity of interest (this quantity is called a "statistic") Compare those estimations (you can spot them as a histogram) with the value on the whole population. In more mathematical terms: Let X1,X2,...,Xn be independant identically distributed random variables (iidrv). With the value those variables take on a single point of the universe, we try to get some information about the distribution of those variables. For instance, we expect (X1+...+Xn)/n to be an estimation of the mean of this distribution: is it really the case? In what sense is it an "estimation"? What is the law of this sample mean? How to measure the quality of this estimation? + Some vocabulary + Estimator Let us consider a statistical experiment, i.e., iidrv X1,X2,...,Xn; we assume we know the family this distribution belongs to but not its parameters (for instance, we know it is a gaussian distribution, but we do not know the mean, or the variance -- or even both). An estimator is a function of X1,X2,...,Xn that gives an estimation of this parameter. (If we see the random variables X1,X2,...,Xn as maps from the universe Omega to the real line R, an estimator is the random variable obtained by composing (X1,...,Xn) with a map R^n ---> R.) For instance, (X1+...+Xn)/n is an estimator of the mean. (Remark: we said that it was an estimator of the mean, but nowhere in the definition did the mean appear. Indeed, it is also an estimator of the variance, of the standard deviation or of any quantity we could think of -- but we will see shortly that is is a "good" estimator of the mean and a bad estimator of all these other quantities.) Quite often, estimators comes in families; for instance, the formula (X1+...+Xn)/n is a family of estimators of the mean: we have one for each n. + Unbiased estimator It is an estimator whose expectation is indeed the value of the parameter. For instance, the sample mean (X1+...+Xn)/n is indeed an unbiased estimator of the mean. On the contrary, the variance of the sample is a biased estimator of the population variance (but you can easily transform it into an unbaised estimator: just replace "n" in the denominator by "n-1"). We can check this with a small simulation, as follows. Let us draw 5 numbers at random, following a gaussian distribution of mean 0 and variance 1. If we repeat this experiment 10000 times, we get (an estimation of) the expectation the empirical mean: it is roughly equal to the population mean (here, 0). x <- vector() for(i in 1:10000){ x <- append(x, mean(rnorm(5))) } mean(x) [1] 9.98948e-05 If we proceed likewise with the variance, we see that its expectation is significantly different from the population variance (we get 0.8 instead of 1): we say that the "population variance" is a biased estimator of the population variance. TODO: the last paragraph was a bit ambiguous... n <- 5 x <- vector() for(i in 1:10000){ t <- rnorm(n) t <- t - mean(t) t <- t*t x <- append(x, sum(t)/n) } mean(x) [1] 0.806307 To get an unbiased estimator of the variance, it suffices to replace the "n" in the definition of the variance by "n-1". These two notions of variance are called "population variance" and "sample variance". 1 Population variance = --- * Sum( X_i - mean(X_j) ) n i 1 Sample variance = ----- * Sum( X_i - mean(X_j) ) n-1 i Let us check, with another simulation, that this is indeed unbiased. n <- 5 y <- vector() for(i in 1:10000){ t <- rnorm(n) t <- t - mean(t) t <- t*t y <- append(y, sum(t)/(n-1)) } mean(y) [1] 1.000129 But this is not the only unbiased estimator of the variance: if we use the first formula with the actual mean (that of the population, not that of the sample), the estimator is indeed unbiased (but this is unrealistic: there is no reason why we should know the population mean). n <- 5 z <- vector() for(i in 1:10000){ t <- rnorm(n) t <- t - 0 t <- t*t z <- append(z, sum(t)/n) } mean(x) [1] 1.001210 We can graphically compare those three estimators of the variance. %G n <- 5 x <- y <- z <- vector() for(i in 1:10000){ t <- rnorm(n) z <- append(z, sum(t*t)/n) t <- t - mean(t) t <- t*t x <- append(x, sum(t)/n) y <- append(y, sum(t)/(n-1)) } boxplot(x,y,z) %-- %G plot(density(x)) points(density(y), type="l", col="red") points(density(z), type="l", col="blue") %-- %G op <- par( mfrow = c(3,1) ) hist(x, xlim=c(0,5), breaks=20) hist(y, xlim=c(0,5), breaks=20) hist(z, xlim=c(0,5), breaks=20) par(op) %-- Their square root (three estimators of the standard deviation) is more symetric. %G x <- sqrt(x) y <- sqrt(x) z <- sqrt(z) boxplot(x,y,z) %-- %G plot(density(x)) points(density(y), type="l", col="red") points(density(z), type="l", col="blue") %-- %G op <- par( mfrow = c(3,1) ) hist(x, xlim=c(0,5), breaks=20) hist(y, xlim=c(0,5), breaks=20) hist(z, xlim=c(0,5), breaks=20) par(op) %-- Question: are these estimators of the standard deviation biased? Check your answer with a few simulations. Do you understand why? (Hint: they are all biased. The expectation is an integral and we know that, in general, the integral of a square root is not the square roor of the integral.) TODO: check this hint. There are many other notions, besides bias, that measure the quality of an estimator or of a sequence of estimators. For instance, an estimator is "consistent" if, when the sample size grows, ot converges (in probability) to the desired value. Thus, the weeak law of large numbers states that the sample mean is a consistent estimator of the population mean. + Maximum Likelihood Estimators (MLE) But where do we find the estimators in the first place? For simple quantities, such as the mean or the variance, we have a formula for the whole population and we hope it will work for a sample. For the mean, it works; for the variance, it yields a biased estimator, but we can tweak it into an unbiased one. But if we want to find a parameter of a more complicated law, for which we have no "population formula" (imagine, for instance, that you want to find the parameter(s) of an exponential, Poisson, Gamma, Beta, Weibull, Binomial, etc. distribution), how do we proceed? The Maximum Likelihood is a recipe to find such a candidate estimator. We do not know of it will have "good" properties, in particular, it will pften be biased, it it will be a good start. The idea is to find the value of the parameter for which the probability of observing the results actually observed in the sample is maximum -- this quantity is called the "likelihood": we maximize it. For instance, let us use this method to find an estimator if the mean of a gaussian random variable of variance 1 with a 5-element sample (actually, in this case, we get the "sample mean" -- the method is useful in more complex situations). %G # The population mean m <- runif(1, min=-1, max=1) # The n-element sample n <- 5 v <- rnorm(n, mean=m) # Likelihood N <- 1000 l <- seq(-2,2, length=N) y <- vector() for (i in l) { y <- append(y, prod(dnorm(v,mean=i))) } plot(y~l, type='l') # Population mean points(m, prod(dnorm(v,mean=m)), lwd=3) # Sample mean points(mean(v), prod(dnorm(v,mean=mean(v))), col='red', lwd=3) %-- The "optimize" function finds such a maximum, in dimension 1. > optimize(dnorm, c(-1,2), maximum=T) $maximum [1] 1.307243e-05 $objective [1] 0.3989423 In higher dimensions, you can use the "optim" function (it looks for a minimum) f <- function (arg) { x <- arg[1] y <- arg[2] (x-1)^2 + (y-3.14)^2 } optim(c(0,0), f) of the "nlm" function TODO We get: $par [1] 0.9990887 3.1408265 $value [1] 1.513471e-06 $counts function gradient 59 NA $convergence [1] 0 $message NULL We can ask for a specific optimization method, for instance, simulated annealing -- actually, in this example, we get an imprecise result. optim(c(0,0), f, method="SANN") Let us consider the eruption lengths of the Faithful geyser and let us try to approximate them as a mixture of Gaussian variables. %G f <- function (x, p, m1, s1, m2, s2) { p*dnorm(x,mean=m1,sd=s1) + (1-p)*dnorm(x,mean=m2,sd=s2) } data(faithful) fn <- function(arg) { prod(f(faithful$eruptions, arg[1], arg[2], arg[3], arg[4], arg[5])) } start <- c(.5, min(faithful$eruptions), var(faithful$eruptions), max(faithful$eruptions), var(faithful$eruptions), ) p <- optim(start, function(a){-fn(a)/fn(start)}, control=list(trace=1))$par hist(faithful$eruptions, breaks=20, probability=T, col='light blue') lines(density(faithful$eruptions,bw=.15), col='blue', lwd=3) curve(f(x, p[1], p[2], p[3], p[4], p[5]), add=T, col='red', lwd=3) #curve(dnorm(x, mean=p[2], sd=p[3]), add=T, col='red', lwd=3, lty=2) #curve(dnorm(x, mean=p[4], sd=p[5]), add=T, col='red', lwd=3, lty=2) legend(par('usr')[2], par('usr')[4], xjust=1, c('sample density', 'theoretical density'), lwd=3, lty=1, col=c('blue', 'red')) %-- We have just seen, in the previous examples, that the likelihood was often defined as a product, with as many factors as observations. As such large products are numerically unwieldy, we often prefer to take their logarithm. TODO: the "fitdistr", in the MASS package, computes univariate maximum likelihood estimators. TODO: The "mle" function in the "stats4" package is a wrapper around "optim" and provides functions to compute the information matrix, the covariance matrix, confidence intervals, etc. + log-likelihood The log likelihood is (not the logarithm of the likelihood but) -2 log (vraissemblance). It is the quantity we try to minimize in the Maximum likelihood method. You might wonder where the "-2" coefficient comes from: for gaussian variables, it yields the logarithm of the sum of squares. TODO: check the above. + Fisher information If L is the log-likelihood and p the parameter to estimate, then the "score" is defined as d log L U = --------- d p and the Fisher information as d^2 L I(p) = ------- d p^2 The Fisher information measures the sharpness of the peak at the maximum of L. + Likelihood Ratio (LR) test TODO: I have NOT said anything about "H0" and "H1"... We remark that L if H0 is true LR = - 2 log --------------------------------------- L with the MLE parameters L(b0) = - 2 log ------- (if b is the parameter to estimate) L(b) where L is the likelihood, approximately follows a Chi^2 distribution with m degrees of freedom, where m is the number of parameters to estimate. TODO: explain why, in the gaussian case, it IS a Chi^2 distribution TODO: explain how we estimate the variance of the estimator. TODO: an example (take one gaussian distribution and model it as a mixture of distributions; take a mixture of two gaussian distributions and model it as a mixture of two gaussians; in both cases, test if the means of the two gaussian distributions are the same). The "Wald test" and the "score test" are approximations of the Likelihood Ratio (LR) test. More precisely, if we note b the parameter to estimate, the "score" (it is a vector) is U(b) = Nabla log V(b) ( d log V(b) ) = ( ------------ ) ( d bi ) i and the "information" (it is a matrix) is ( d^2 ) I(b) = ( - ----------- log V(b) ) ( d bi d bj ) i,j The "Wald statistic" is an order 2 Taylor expansion of the Likelihood Ratio (LR) W = (b-b0)' * I(b) * (b-b0) and the "Score statistic" (it does not depend on b: it is very imprecise, but very fast to compute) S = U(b0)' * I(b0)^-1 * U(b0). + MLE: mode We can see the likelihood as the probability density function of the parameter, conditionnal to the sample. TODO: is it true? TODO: plot The idea of the Maximum Likelihood Method is to take the mode of this function. But there are a few problems. First, we completely forget the distribution of this estimator: we often look at its variance, but if it is not normal, this is far from sufficient. (We can use this density to measure the quality of the estimator or to do simulations (parametric bootstrap)). Besides, the mode is not always a good choice. Why not take the mean (this is called bagging -- more about this later) or the median? What do we do if the mode is not unique? Or, more realistically, if the likelihood has several local maxima? Or a high narrow peak (so narrow that its probability is low) and a wider smaller peak (so that its probability is high)? %G op <- par(mfrow=c(2,2)) curve(dnorm(x-5)+dnorm(x+5), xlim=c(-10,10)) curve(dnorm(x-5)+.4*dnorm(x+5), xlim=c(-10,10)) curve(dnorm(5*(x-5)) + .5*dnorm(x+5), xlim=c(-10,10)) par(op) %-- However, I have not managed to devise a situation in which we really observe such a phenomenon (if you have such an example, tell me). Actually, such a situation might be a sign that the model is wrong. Ah, I finally got such an example. I simply took an estimator you cannot extrapolate from a sample to a population: the maximum. (The statistics for which we might want to find an estimator might have, or not, pleasant properties: the maximum is one of the worst statistics you can imagine.) %G get.sample <- function (n=10, p=1/100) { ifelse( runif(n)>p, runif(n), 2 ) } N <- 1000 d <- rep(NA,N) for (i in 1:N) { d[i] <- max(get.sample()) } hist(d, probability=T, ylim=c(0, max(density(d)$y)), col='light blue') lines(density(d), type='l', col='red', lwd=3) %-- %G get.sample <- function (n=100, p=1/100) { ifelse( runif(n)>p, runif(n), 2 ) } N <- 1000 d <- rep(NA,N) for (i in 1:N) { d[i] <- max(get.sample()) } #hist(d, breaks=seq(0,3,by=.02), # probability=T, ylim=c(0, max(density(d)$y)), col='light blue') #lines(density(d), type='l', col='red', lwd=3) plot(density(d), type='l', col='red', lwd=3) %-- %G get.sample <- function (n=100, p=1/100) { ifelse( runif(n)>p, runif(n), 2+2*runif(n) ) } N <- 1000 d <- rep(NA,N) for (i in 1:N) { d[i] <- max(get.sample()) } #hist(d, breaks=seq(0,4,by=.05), # probability=T, ylim=c(0, max(density(d)$y)), col='light blue') #lines(density(d), type='l', col='red', lwd=3) plot(density(d), type='l', col='red', lwd=3) %-- However, some methods replace the estimators (a single value) by probability distributions: on the computational side, all the methods relying on the bootstrap (e.g., the bagging), on the theoretical side, these are called "bayesian methods". + Bayesian methods We have just seen that it could be interesting to have, not an estimator (a single number, possibly with its variance) but a whole probability distribution. Bayesian methods push this remark a little further. We assume that the parameters of the model were chose at random according to a given distribution -- the prior distribution. Bayesian methods try fo find the posterior distribution, i.e., the distribution of those parameters given the actually observed sample. TODO: write this up. Z: date t: parameters The model gives us P(Z|t) We assume we know P(t) (the prior distribution). We then compute P(t|Z) (posterior distribution) from P(Z|t) and P(t), with the Bayes formula -- hence the name. TODO: recall this formula. TODO: MCMC (Markov Chain Monte Carlo) to sample from the posterior distribution parametric bootstrap = computational implementation of the maximum likelihood method MLE = mode bagged estimator = mean TODO: EM algorithm (to find MLE in a 2-component misture model) 1. find first estimates (m1, m2: random; s1, s2: sd(whole sample)) 2. Compute the responsabilities gamma(i) = probability that observation i comes from the second component 1-gamma(i)=probability that observation i comes from the first component 3. new estimators = weighted means and variances (use gamma(i), resp 1-gamma(i), as weights) 4. Iterate TODO: you can generalize this algorith to other models. TODO: this is a special case of REML TODO: put those more complicated examples in the "bootstrap" chapter/section? TODO: remark that bayesian methods can be applied iteratively. Initial data --> prior New data --> posterior Update prior New data --> posterior Update prior ... TODO: The theory behind the Kalman Filter (State Space Models) is bayesian. + Bayesian statistics and quantum mechanics Bayesian statistics are very similar to quantum mechanics: in quantum mechanics, reality is a superposition of many possible realizations, each with its probability; similarly, the result of a bayesian analysis is a whole probability distribution, giving all the possible answers, each with its probability. + Bayesian Networks TODO This is very similar to neural networks, but we can interpret the resulting model and add (prior) information to it. + Fuzzy logic TODO (Notes taken while reading Matlab's documentation) The idea behind Fuzzy Inference Systems is to replace boolean values by continuously varying values -- think "probabilities". %G curve(ifelse(x < .3, 0, ifelse(x > .7, 0, 1)), lwd = 3, col = "blue", xlim = c(0,1), main = "A classical membership function", xlab="", ylab="") abline(h = 0, lty = 3) %-- %G curve( dnorm( x - .5, sd = .1 ) / dnorm(0, sd=.1), lwd = 3, col = "blue", xlim = c(0, 1), main = "A fuzzy membership function", xlab="", ylab="" ) abline(h=0, lty=3) %-- Boolean operators and deduction rules can then be fuzzified: AND becomes MIN, OR becomes MAX, NOT(x) becomes 1-x THEN becomes MIN. Note that there might be incoherences in the rules -- it is not a problem. You can tune the fuzzy inference system by changing the membership functions (or even the boolean operators). As with bayesian methods, the result is a probability distribution, not a single number. So far, it looks like bayesian networks, with a very simple, fixed network structure, and no learning capabilities. (2) Adaptive Neuro-Fuzzy Inference Systems (ANFIS) Actually, you do not have to provide the set of rules: you can simply take you data, cluster it, and ask the computer to write a rule for each cluster. This looks very similar to nearest-neighbour methods -- and, if you impose some parsimony, it can be seen as a machine-learning analogue of Generalized Additive Models (GAM, "mgcv" package in R). (3) Fuzzy C-means clustering This is a generalization of the k-means clustering algorithm where cluster membership is fuzzy -- i.e., is a probability. (In R, ir is implemented in the "fanny" or "cmeans" functions.) + Influence curve Let us consider an extimator U of a parameter u of a distribution F(u). The problem, is that we do not know F exactly (more realistically: we do know F, but the population does not exactly follow this distribution, it is merely an estimation of the distribution of the data). To assess the robustness of the estimator, we can check how it changes when we modify F at a single point. More precisely, we compute U( (1-h) F + h 1x ) - U(F) w(x) = lim ---------------------------- h -> 0 h Where 1x is the Dirac measure at x. TODO: example with the mean TODO: example with the variance TODO: example with the correlation (in dimension 2) ?persp TODO: empirical estimation of the influence function with bootstrap. library(boot) ?empinf TODO: TO SORT? + Method of Moments Estimation (MME) This is another recipe to get an estimator. If there is a single parameter, we choose it so that the first moment (i.e., the mean) of the variable coincides with the sample mean. If there are k parameters, we choose them so that the first k moments coincide with the sample moments (I recall that the moments of a real random variable X are the expectations of X^k). As with the Maximum Likelihood Method, we do not know anything about the quality of the resulting estimator -- it is often biased. Let us use this method to study the Faithful geyser eruptions durations. Let X0 be a Bernoulli random variable (i.e., a rnadom variable with two values, 0 and 1, or "heads" and "tails"), with parameter p, and X1, X2, gaussian random variables of mean m1, m2 and standard deviations s1, s2. We try to put our data Y under the forn Y = X0 X1 + (1-X0) X2. Noting that X0^2=X0, (1-X0)^2=1-X0 et X0*(1-X0), one can show that Y^2 = X0 X1^2 + (1-X0) X2^2 Y^3 = X0 X1^3 + (1-X0) X2^3 Y^4 = X0 X1^4 + (1-X0) X2^4. As X0, X1, X2 are independant, E(Y) = X(X0) E(X1) + E(1-X0) E(X2) E(Y^2) = X(X0) E(X1^2) + E(1-X0) E(X2^2) E(Y^3) = X(X0) E(X1^3) + E(1-X0) E(X2^3) E(Y^4) = X(X0) E(X1^4) + E(1-X0) E(X2^4) But as E(X0)=p E(X1) = m E(X1^2) = m^2 + s^2 E(X1^3) = 3 m s^2 + m^3 E(X1^4) = 10 s^4 + 6 s^2 m^2 + m^4 E(X1^5) = 50 s^4 m + 5 s^2 m^3 + m^5 (and similarly for X2), we get E(Y) = p m1 + (1-p) m2 E(Y^2) = p (m1^2 + s1^2) + (1-p) (m2^2 + s2^2) E(Y^3) = p (3 m1 s1^2 + m1^3) + (1-p) (3 m2 s2^2 + m2^3) E(Y^4) = p (10 s1^4 + 6 s1^2 m1^2 + m1^4) + (1-p) (10 s2^4 + 6 s2^2 m2^2 + m2^4) E(Y^4) = p (50 s1^4 m1 + 5 s1^2 m1^3 + m1^5) + (1-p) (50 s2^4 m2 + 5 s2^2 m2^3 + m2^5) We can then ask R to find (numerically) the values of those five parameters). (Exercise left to the reader -- I am not absolutely sure of my results.) TODO There is a function to find the minimum of a function of several variables (optimize, optim, nlm, nls), there are functions to find the zeros of a function of a single variable (uniroot, polyroot), but what about the zeros of a function of several variables, i.e., how do we numerically solve a non-linear system??? (One can do that with the Newton algorithm, as in dimension 1 -- the only difference being that the derivative is a matrix.) + L-moments, TL-moments Since the higher moments have a high variance, are very sensitive to outliers -- or even, in some cases, are not defined for the distribution at hand --, one can replace the moments by the L-moments or the trimmed L-moments (TL-moments). + GMM (Generalized Method of Moments) Least squares and maximum likelihood provide recipes to build new estimators. The Generalized Method of Moments is another recipe, that actually generalizes both Least Squares and Maximum Likelihood. Consider a (real-valued) random variable X, known to be gaussian, with unknown mean mu and standard deviation sigma. We want to find those parameters, mu and sigma, from a sample x1,x2,...,xn, drawn from the distribution of X. Remarking that E[ X ] = mu E[ X^2 ] = mu^2 + sigma^2 we can simply try to solve 1 --- Sum( x_i ) = mu n 1 --- Sum( x_i^2 ) = mu^2 + sigma^2 n There is no reason it will be a "good" estimator, but it will be an estimator -- then, it is up to us to study how good it is and to improve it if needed. This idea is more general: if the distribution of the random variable X has a known form, with unknown parameters theta_1, theta_2, ..., theta_n, we can compute the first moments of X as a function of those parameters E[ X ] = m_1( theta_1, ..., theta_n ) E[ X^2 ] = m_2( theta_1, ..., theta_n ) E[ X^3 ] = m_3( theta_1, ..., theta_n ) ... Then, given the value of X on a sample, we can equate the mean of X, X^2, etc. to their theoretical values and solve the resolving equations. This is called the Method of Moments Estimator (MME). But this can be generalized further: instead of taking the moments E[X], E[X^2], E[X^3], etc., we can consider the expectation of simpler quantities, that have a more direct interpretationm or that can be more easily computed. Those quantites are called generalized moments. The recipe for our more GMM estimator is: find easy-to-compute and easy-to-interpret quantities (e.g., X, X^2, X^3, etc.); compute their expectation in the population, as functions of the unknown parameters; compute their average in the sample; equate expectations and averages and solve: the law of large numbers tells you that the averages converge to the expectations. Furthermore, the law of large numbers (often) tells you that the resulting estimators are asymptotically gaussian. For instance, if you have two random variables X and Y, linked by the relation Y = a + b X + epsilon, where epsilon is a random variable with zero mean and a and b are unknown parameters, then you can consider the generalized moments E[ Y - ( a + b X ) ] = 0 E[ (Y - ( a + b X )) X ] = 0 The corresponding GMM estimator is the least squares estimator: the equations state that the derivatives the of sum of squares appreaing the the definition of the least squares estimator with respect to X and Y are zero. The classical moments can be written as E[ X - m1(theta) ] = 0 E[ X^2 - m2(theta) ] = 0 E[ X^3 - m3(theta) ] = 0; ... we shall write the generalized moments as E[ h1(X, theta) ] = 0 E[ h2(X, theta) ] = 0 E[ h3(X, theta) ] = 0 ... Quite often, you have more generalized moments than parameters: in this case, you cannot hope that the moments will all be zero, and you will try to minimize a "weighted sum" of squares of generalized moments. More precisely, if h = (h1,...,hn) is the vector of generalized moments, you will minimize h' W h where W is a weight matrix. If you are completely clueless about the properties of your moments, you can set W to the identity matrix, but otherwise, the (limit, when the sample size becomes large, of the) inverse of the variance-covariance matrix of the residuals is a good choice W = Var[ (1/N) Sum( h(x_i, theta) ) ] ^ -1 The actual algorithm is usually: W <- I Until convergence: Find theta that minimizes [(1/N)Sum(h(x_i))]' W [(1/N)Sum(h(x_i))] Set W = Var[residuals]^-1 TODO: Orthogonality conditions, Instrumental Variables (IV) When you select you model, say Y = f(X, theta) + epsilon, you want its residuals to be uncorrelated with all the predictive variables, say epsilon = y - f(x, theta) uncorrelated with x (variables included in the model) epsilon = y - f(x, theta) uncorrelated with z (variables not included in the model) The first condition will be satisfied, but the second need not: we have to impose it. This can be written as E[ (y - f(x, theta)) \tens z ] = 0 where \tens is the tensor product (also called "Kroneker product"). TODO: recall what it is. The variables in z (the columns of z) are called Instrument Variables (IV). Tests, in the context of the Generalized Method of Moments, can look as follows: H0: E[ h ] = 0, i.e., the model is right H1: E[ h ] != 0, i.e., the model is wrong You can use the following statistic: J = N [(1/N)Sum(h(x_i))]' W [(1/N)Sum(h(x_i))] where N is the sample size. You should remark that, with the GMM, we never really specify the model: we just specify its moments. Thus, in this context, we cannot distinguish between models with the same moments. GMM are used in econometrics, because one can choose moments with an economic interpretation -- furthermore, as there is no real model behind the moments, once the moments are meaningful, the "model" makes sense. TODO: give more details + Cramer-Rao inequality The Cramer-Rao inequality states that one cannot find arbitrarily good extimators: more precisely, it gives a lower bound on the variance of an unbiased estimator. + Sufficient statistic The first step to find the "best" estimator possible is to replace the data X1,X2,...,Xn by one (or several) numbers that contain as much information: this number is called a "sufficient statistic". More precisely, if P( (X1,X2,...,Xn) \in U | t(X1,...,Xn) = c ) only depends on U and C but not on the parameter we want to estimate, then t is a sufficient statistic. There is a simple criterion (the Neyman factorization theorem) to recognize a sufficient statistic. More precisely, we try to find a "minimal" sufficient statistic, i.e., one whose "level lines" be as large as possible. + BUE (Best Unbiaised Estimators, aka UMVUE, Uniformly Minimum Variance Unbiased Estimator) Unbiased estimators whose variance is the lower bound in the Cramer-Rao inequality. * TO SORT + Information TODO The information of a realization of a random variable is its degree of surprise. For instance, if we flip a coin (i.e., if we consider a Bernoulli variable with probability 1/2) and get heads, there is no real surprise: the information is rather low. This notion becomes more interesting for more asymetric random variables. Let us consider the occurrence of an earthquake, given that earthquakes are rare. Knowing that no earthquake occurred is not informative at all: the information is low (even lower than in the coin flip experiment). On the contrary, if an earthquake did occur, this is surprising, and the information is high. For a Bernoulli variable P(X=1) = p the information of the event X=1 is TODO: formulas + Entropy TODO The Kolmogorov-Chaitin entropy of a sequence of characters is the minimal length of a program that produces this sequence. It cannot really be computed but, if your sequence is sufficiently long, you can estimate its entropy by compressing the sequence (with programs such as gzip or bzip2) and looking at the length of the resulting file. http://arxiv.org/abs/cond-mat/0108530 Language trees and zipping http://arxiv.org/abs/cond-mat/0202383 Extended comments on "Language trees and zipping" http://arXiv.org/abs/cond-mat/0203275 + Relative entropy (Kullback-Leibler distance) TODO + Maximum Likelihood Methods and Statistical Tests TODO: write this part... Likelihood-Ratio test LR = -2 ln ( L at H0 / L at MLE ) LR ~ Chi2 df = number of parameters (eg, 1). Other tests for MLE: Wald test, Score test. To compare two models fitted with MLE, compare AIC = LRChi^2 - 2 p where p is the number of parameters. This is an approximative criterion. There are other questionable such criteria, eg, BIC (Bayesian Information Criterion). r <- lm(...) AIC(r) library(nlme) BIC(r) r <- gls(...) summary(r) The AIC says that adding a useless parameter generally increases the log likelihood by about 1 unit. So if adding a parameter increases the log like- lihood by more than 1, it is not useless.