Estimators and Statistical Tests

  TODO: give the structure of this chapter
  motivation of statistical tests
  notion of estimator, bias, MSE
  ...
  MLE (should be in a section of its own)
  Bayesian methods (in a section of their own)
  
  (TODO)

  TODO: list the more important tests (Student and Chi^2)
  Student T test: compare a mean with a given number
                  compare the mean in two samples
    There are generalizations for more than two samples
    (analysis of variance) and for non-gaussian samples
    (Wilcoxon).
    One can devise similar tests to compare the variance of
    a sample with a given number or to compare the variance
    of two samples.

  Chi^2 test: to compare the distribution of a qualitative
    variable with predetermined values, to compare the
    distribution of a qualitative variable in two
    samples. One can also use it to check if two qualitative
    variables are independant. However, it is only an
    approximation, valid for large samples (more than 100
    observations, more that 10 observations per class).

  TODO: check if we can do without the Chi^2 test:
    - binary variable:      bimom.test
    - multinomial test:     ???
    - Independance Chi^2:   fisher.test
    - two variables:        fisher.test

* Introduction to statistical tests: TODO: REWRITE THIS SECTION

We want to answer a question of the kind "Does tobacco
increase the risk of cancer?", "Does the proximity of a
nuclear waste reprocessing plant increase the risk of
leukemia?", "Is the mean of the population from which this
sample was drawn zero, given that the sample mean is 0.02?"

Let us detail the problem "Have those two samples the same
mean?"  (it is a simplification of the problem "Do those two
samples come from the same population?").

Let us consider a first population, on which is defined a
statistical variable (with a gaussian distribution), from
which we get a sample. We do the same for a second
population, with the same population mean.

We can then consider the statistical variable 

  sample mean in the first sample - sample mean in the
  second sample

and find its distribution.

If we measure a certain value of this difference, we can
compute the probability of obtaining a difference at least
as large.

If 

  P( difference > observed difference ) < alpha,

(for a given value of alpha, say 0.05), we reject the
hypothesis "the two means are equal", with a risk equal to
alpha.

But beware, this result is not certain at all.  There can be
two kinds of error: either wrongly clain that they are
different (this happens with a probability alpha) or wrongly
claim that the two means are equal.

Beware again, those tests are only valid under certain
conditions (gaussian variables, same variance, etc.).

If we really wish to be rigorous, we do not consider a
single hypothesis, but two: for instamce "the means are
equal" and "the means are different"; or "the means are
equal" and "the first mean is larger than the second". We
would use the second formulation if we can a priori reject
the fact that the first mean is lower than the second -- but
this has to come from information independant from the
samples at hand.

The statistical tests will never tell "the hypothesis is
true": they will merely reject or fail to reject the
hypothesis stating "there is nothing significant". (This is
very similar to the development of science as explained by
K. Popper: we never prove that something is true, we merely
continuously try to prove it wrong and fail to do so.)

+ H0 (null hypothesis) and H1 (alternative hypothesis)

Let us consider two hypotheses: the null hypothesis H0,
"there is no noticeable effect" (for instance, "tobacco does
not increas the risk of cancer", the proximity of a waste
recycling plant does not increas the risk of leukemia)
and the alternative hypothesis H1, "there is a noticeable
effect" (e.g., "tobacco increases the risk of cancer"). The
alternative hypothesis can be symetric ("tobacco increases
of decreases the risk of cancer") or not ("tobacco increases
the risk of cancer"). To choose an asymetric hypothesis
means that we reject, a priori, half of the hypothesis: it
can be a prejudice, so you should think carefully before
choosing an asymetric alternative hypothesis.

H0 is sometimes called the "conservative hypothesis",
because it is the hypothesis we keep if the results of the
test are not conclusive.

+ Type I error

To wrongly reject the null hypothesis (i.e., to wrongly
conclude "there is an effet" or "there is a noticeable
difference").

For instance, if the variable X follows a gaussian
distribution, we expect to get values "in the middle" of the
bell-shaped curve. If we get extreme values, we shall
reject, sometimes wrongly, the null hypothesis (that the
mean is actually zero). The type I error corresponds to the
red part in the following plot.

  %G
  colorie <- function (x, y1, y2, N=1000, ...) {
    for (t in (0:N)/N) {
      lines(x, t*y1+(1-t)*y2, ...)
    }
  }
  # No, there is already a function to do this
  colorie <- function (x, y1, y2, ...) {
    polygon( c(x, x[length(x):1]), c(y1, y2[length(y2):1]), ... )
  }
  x <- seq(-6,6, length=100)
  y <- dnorm(x)
  plot(y~x, type='l')
  i = x<qnorm(.025)
  colorie(x[i],y[i],rep(0,sum(i)) ,col='red')
  i = x>qnorm(.975)
  colorie(x[i],y[i],rep(0,sum(i)) ,col='red')
  lines(y~x)
  title(main="Type I error")
  %--

+ p-value

Probability (if the null hypothesis is true) to get a result
at least as extreme. It is the probability of making a type
I error.

+ Type II error

Wrongly accepting the null hypothesis (i.e., wrongly
concluding "there is no statistically significant effect" or
"there is no difference").

In all rigour, it is not an error, because one never says
"H0 is true" but "we do not reject H0 (yet)". It is not an
error, but a missed opportunity.

The type II error is the area of the red part in the
following plot.  The middle bell-shaped curve is the
distribution predicted by the null hypothesis, the other one
is the actual distribution of the population.

  %G
  x <- seq(-6,6, length=1000)
  y <- dnorm(x)
  plot(y~x, type='l')
  y2 <- dnorm(x-.5)
  lines(y2~x)
  i <- x>qnorm(.025) & x<qnorm(.975)
  colorie(x[i],y2[i],rep(0,sum(i)), col='red')
  segments( qnorm(.025),0,qnorm(.025),dnorm(qnorm(.025)), col='red' )
  segments( qnorm(.975),0,qnorm(.975),dnorm(qnorm(.975)), col='red' )
  lines(y~x)
  lines(y2~x)
  title(main="High risk of type II error")
  %--

If the two curves are sufficiently far appart (i.e., if the
difference of means is more significant), the risk is much
lower.

  %G
  x <- seq(-6,6, length=1000)
  y <- dnorm(x)
  plot(y~x, type='l')
  y2 <- dnorm(x-3.5)
  lines(y2~x)
  i <- x>qnorm(.025) & x<qnorm(.975)
  colorie(x[i],y2[i],rep(0,sum(i)), col='red')
  segments( qnorm(.025),0,qnorm(.025),dnorm(qnorm(.025)), col='red' )
  segments( qnorm(.975),0,qnorm(.975),dnorm(qnorm(.975)), col='red' )
  lines(y~x)
  lines(y2~x)
  title(main="Lower risk of type II error")
  %--

You can plot the probability of type II error against the
mean difference (contrary to the type I error, the risk of
type II error is not a constant but depends on the actual
value of the parameter: this is why this risk is often
considered as difficult to compute and people tend to fail
to mention it in elementary courses).

  %G
  delta <- seq(-1.5, 1.5, length=500)
  p <- NULL
  for (d in delta) {
    p <- append(p, 
                power.t.test(delta=abs(d), sd=1, sig.level=0.05, n=20,
                             type='one.sample')$power)
  }
  plot(1-p~delta, type='l',
       xlab='mean difference', ylab="propability of a type II error",
       main="type II error in a Student T test")
  abline(h=0,lty=3)
  abline(h=0.05,lty=3)
  abline(v=0,lty=3)
  %--

+ Power

The power of a test is 

  1 - P( type II error ).

Usually, the power is not a number but a function.  The null
hypothesis is often of the form H0: "mu = mu0" and the
alternative hypothesis of the form H1: "mu different from
mu0".  The power will depend on the actual value of mu: if
mu is close to mu0, the type II error probability is hygh
and the power is low; on the contrary, if mu and mu0 are
very different. the probability of type II error is lower
and the power higher.

  %G
  delta <- seq(0, 1.5, length=100)
  p <- NULL
  for (d in delta) {
    p <- append(p, 
                power.t.test(delta=d, sd=1, sig.level=0.05, n=20,
                             type='one.sample')$power)
  }
  plot(p~delta, type='l',
       ylab='power', main='Power of a one-sample t-test')
  %--

The power plays an important role when you design an
experiment.  Let us imagine that we want to know if the mean
mu of a certain variable on a certain population equals
mu0. We want to be able to detect a difference at least
epsilon in at least 80% of the cases, with a type I error
risk inferior to 5%: what should be the sample size?

In other words, we want that the power of the test 
H0: << mu = mu0 >> against H1: << abs(mu - mu0) > epsilon >> with a
confidence level alpha=0.05 be at least 0.80 ("tradition" suggests a
power equal to 0.80, and a confidence level 0.05).

The "power.t.test" performs those computations (for
Student's T test).

  > power.t.test(delta=.1, sd=1, sig.level=.05, power=.80, type='one.sample')

     One-sample t test power calculation

              n = 786.8109
          delta = 0.1
             sd = 1
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

You can ask the question in any direction. For instance, the
experiment has already been carried out, with a sample of
size n: what is the minimal difference in mean we can expect
to spot if we want the power of the test to be 0.08?

  > power.t.test(n=100, sd=1, sig.level=.05, power=.80, type='one.sample')

     One-sample t test power calculation

              n = 100
          delta = 0.2829125
             sd = 1
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

Here is a plot of this minimum noticeable difference against
the sample size.

  %G
  N <- seq(10,200, by=5)
  delta <- NULL
  for (n in N) {
    delta <- append(delta, 
                    power.t.test(n=n, sd=1, sig.level=.05, 
                                 power=.80, type='one.sample')$delta
                   )
  }
  plot(delta~N, type='l', xlab='sample size')
  delta <- NULL
  for (n in N) {
    delta <- append(delta, 
                    power.t.test(n=n, sd=1, sig.level=.01, 
                                 power=.80, type='one.sample')$delta
                   )
  }
  lines(delta~N, col='red')
  delta <- NULL
  for (n in N) {
    delta <- append(delta, 
                    power.t.test(n=n, sd=1, sig.level=.001, 
                                 power=.80, type='one.sample')$delta
                   )
  }
  lines(delta~N, col='blue')
  legend(par('usr')[2], par('usr')[4], xjust=1,
         c('significance level=.05', 'significance level=.01', 'significance level=.001'),
         col=c(par('fg'), 'red', 'blue'),
         lwd=1, lty=1)
  title(main='Sample size and difference detected for tests of pover 0.80')
  %--

You might want to read:

  http://www.stat.uiowa.edu/~rlenth/Power/2badHabits.pdf
  http://www.stat.uiowa.edu/techrep/tr303.pdf

+ Simple hypothesis

A hypothesis is said to be simple if it entails a complete
knowledge of the distribution of the random variables. For
instance, if we are looking for the mean of a gaussian
variable of variance 1 (we already know the variance), the
hypothesis H0: "the mean is zero" is simple. On the
contrary, if we are looking for the mean of a gaussian
variable (we know the variable is gaussian, but we do not
know its variance -- we are not interested in it), this
hypothesis is not simple.

Here is another example: we have a sample that comes either
from a population 1 (described by a random variable of mean
3 and variance 2), or from a population 2 (described by a
random variable of mean 1 and variance 1). In this case,
both hypotheses H0: "the sample comes from population 1" and
H1: "the sample comes from population 2" are simple.

+ Composite hypothesis

A non-simple hypothesis.

Quite often, the alternative assumptions (H1) are composite:
if H1 is true, we know that the distribution has a certain
form, with a parameter that has not a certain value -- but
we do not know its value.

+ Confidence Interval

Let us consider a random variable, whose distribution is not
completely known: for instance, we know it is a gaussian
distribution of variance 1, but the mean is unknown.

If we estimate this mean as a single number, we are sure to
be wrong: the actual mean might be close to our proposal,
but there is no reason it should be exactly this one, up to
the umpteenth decimal. Instead, we can give a confidence
interval -- note the use of an indefinite article: there are
many such intervals (you can shift it to the left or to the
right -- and doing so will change its width).

TODO: check that I give an example with several intervals.

Here are two interpretations of this notion of "confidence
interval".

(1) It is an interval in which we have a 95% probability of
finding the sample mean.

(2) More naively, it is an interval that has a 95%
probability of containing the population mean (i.e., the
actual mean).

Actually, these two interpretations are equivalent.  As I
was very dubious, let us check it on a simulation (here, I
draw the population mean at random, and this prior
distribution does not play any role).

TODO: say that it is bayesian...

  # Sample size
  n <- 100
  # Number of points to draw the curve
  N <- 1000
  v <- vector()
  for (i in 1:N) {
    m <- runif(1)
    x <- m+rnorm(n)
    int <- t.test(x)$conf.int
    v <- append(v, int[1]<m & m<int[2])
  }
  sum(v)/N

We get 0.95.

From this, we get two interpretations of the p-value: first,
as the probability pf getting results at least as extre,e if
H0 is true (this is the definition); second, it is the
probability of being the in "the" confidence interval.

  TODO: explain a bit more. I said that "the" confidence
  interval was not uniquely defined, and thus, that we
  should not use a definite article...

The p-value is NOT the probability that H0 be true. To
convince yourself of this, consider gaussian random
variable, of variance 1 and unknown mean. From a sample, we
test H0: "the mean is 0" against H1: "the mean is not 0". We
will get a certain p-value.

  > n <- 10              # Sample size
  > m <- rnorm(1)        # Unknown mean
  > x <- m + rnorm(n)    # Sample
  > t.test(x)$p.value    # p-value
  [1] 0.2574325

But what is the probability that the mean be exactly zero?
In our simulation, this probability is zero: the mean is
almost surely non-zero.

  > m
  [1] 0.7263967

+ UMP (Uniform Most Powerful) tests

We would like to set the power of a test, in the same way we
choose the level of confidence. But there is a problem: if
H1 is not a simple hypothesis (i.e., if it is not a single
hypothesis, as "mu=0" but a set of hypotheses, as "mu!=0",
"mu>0" or "mu>1"), the power is not a number but a function,
that depends on the simple hypotheses contained in H1.

A test is UMP (Uniformly Most Powerful) for a given
confidence level alpha if, for any simple hypothesis in H1,
it has the largest power among the tests with confidence
level alpha.

+ Non parametric test

Most of the time, statistical tests assume that the random
variables studied are gaussian (and even, when there are
several, that they have the same variance). Non-parametric
tests do not make such assumptions -- xwe say that they are
more "robust" -- but, as a counter part, they are less
powerful.

Here are a few examples of parametric and non-parametric
tests.

TODO: is it the right place for this list?

  Aim                   Parametric tests             Non-parametric tests
  ----------------------------------------------------------------------------
  compare two means     Student's T test             Wilcoxon's U test

  compare more than     Anova (analysis of           Kruskal--Wallis test
  two means             variance)

  Compare two           Fisher's F test              Ansari-Bradley or
  variances                                          Mood test

  Comparing more than   Bartlett test                Fligner test

+ Robustness

A (parametric) test is robust if its results are still valid
when its assumptions are no longer satisfied (especially if
the random variables studied are no longer gaussian).

+ Resistance

A statistic (mean, median, variance, trimmed mean, etc.) is
resistant if it does not depend much on extreme values. For
instance, the mean is not resistant, while the median is: a
single extreme value can drastically change the mean, while
it will not change the median.

  > x <- rnorm(10)
  > mean(x)
  [1] -0.08964153
  > mean(c(x,10^10))      # We add an extreme value
  [1] 909090909

  > median(x)
  [1] -0.2234618
  > median(c(x,10^10))    # We add an extreme value
  [1] -0.1949206

+ Pearson residuals

One can sometimes spot outliers with the Pearsons residuals:

  sample density  /  density according to the model  -  1

TODO: Example with circular data.

+ Outlier detection

TODO

  - Should you want them, there are statistical tests for the presence
    of outliers
  - Outlier detection benchmark: use classical examples, such as
    library(robustbase)
    ?hkb

One should not try to automatically detect and remove the outliers:
the first problem is that this is a multi-stage procedure (first the
outlier detection and removal, then the rest of the analysis), whose
performance should be assessed (the rest of the analysis will not
perform as well as you think it should, because the data has been
tampered with); the second problem is that the outlier removal is a
black-and-white decision, either we keep the observation, or we remove
it completely -- if we are unsure whether the observation is an actual
outlier, we are sure to make mistakes.

Methods that remain efficient even in presence of outliers ("robust
methods") do exist: use them!

+ Breaking point

The breaking point is the proportion of observations you can tamper
with without being able to make the estimator arbitrarily "large".

This is a measure of robustness of an estimator. 

For instance, the breaking point of the mean is zero: by changing a
single observation, you can make the mean arbitrary large.

On the other hand, the breaking point of the median is 50%: if you
change a single observation, the median will change, but its value
vull be bounded by the rest of the cloud of points -- to make the
median arbitrarily large, you would have to move (say) the top half of
the data points.

The trimmed mean has a breaking point somewhere inbetween.

+ TODO: A few robust estimators

  Location (mean): trimmed mean, median
  Dispersion (standard deviation): IQR, MAD, Sn, L-moments, MCD

  rlm, cov.rob
  library(rrcov)
  library(robustbase) (includes roblm)

+ Three means of performing statistical tests

1. We can look for a parameter of the distribution
describing the population. If possible, we shall take an
unbiased one, with the lowest variance possible. (However,
we shall see later that certain biased estimators may be
more interesting, because their mean square error is lower
-- this is the case for ridge regression.)

2. Instead of a single value of this parameter, we can be
looking for an interval containing it: we are then building
a confidence interval.

3. We can also want to know if this parameter has (or not) a
predefined value: we will then perform a test.

+ Criticism of statistical tests

A p-value close to zero can mean two very different things:
either the null hypothesis is very wrong, or ot is just
slightly wrong but our sample is large enough to spot it. In
the latter case, the difference between reality and the
hypothesis is statistically significant, but for all
practical purposes, it is negligible.

  TODO: this problem with p-values is a starting point for the development
  of bayesian methods.
  http://www.stat.washington.edu/www/research/online/1994/bic.ps

+ Decision Theory

Among all the possible tests, we want one for which the
risks of type I and type II errors are as low as
possible. Among the tests that can not be improved (i.e., we
cannot modify them to get one with the same type I error
risk and a lower type II error risk, and conversely), there
is no means of choosing THE best. Indeed, we can plot those
tests in the (type I error risk)x(type II error risk) plane:
we get a curve. However, decision theory allows everyone to
choose, among those tests, the one that becomes best one's
taste taste for risk.  One crude way of preceeding is to
choose an upper bound on the type I error risk (alpha) and
minimize the type II error risk.

For more details, read Simon's French book, "Decision
Theory: an introduction to the mathematics of rationality",
Ellis Horwood series in mathematics and its applications,
Halsted Press, 1988.

* The Zoo of statistical tests: Parametric Tests

+ Statistical tests under R

Most R functions that perform (classical) tests are in the
"stats" package (it is already loaded).

  library(help="stats")

The list of functions is huge: look for those containing the
"test" string.

  > apropos(".test")
   [1] "ansari.test"          "bartlett.test"        "binom.test"
   [4] "Box.test"             "chisq.test"           "cor.test"
   [7] "fisher.test"          "fligner.test"         "friedman.test"
  [10] "kruskal.test"         "ks.test"              "mantelhaen.test"
  [13] "mcnemar.test"         "mood.test"            "oneway.test"
  [16] "pairwise.prop.test"   "pairwise.t.test"      "pairwise.wilcox.test"
  [19] "power.anova.test"     "power.prop.test"      "power.t.test"
  [22] "PP.test"              "prop.test"            "prop.trend.test"
  [25] "quade.test"           "shapiro.test"         "t.test"
  [28] "var.test"             "wilcox.test"

+ Reading a test result

One would expect those functions to yield a result as "Null
hypothesis rejected" or "Null hypothesis not rejected" --
bad luck. The user has to know how to interpret the results,
with a critical eye.

The result is mainly a number, the p-value.  It is the
probability to get a result at least as extreme.  If it is
close to one, we do not reject the hypothesis, i.e., the
test did not find anything statistically significant; if it
close to zero, we can reject the null hypothesis. More
precisely, before performing the test, we choose a
confidence level alpha (often 0.05; for human health, you
will be more conservative an choose 0.01 or even less; if
you want results with little data, if you do not mind that
those results are not reliable, you can take 0.10): if
p<alpha, you reject the null hypothesis, if p>alpha, you do
not reject it.

Of course, you have to choose the confidence level alpha
BEFORE performing the test (otherwise, you will choose it so
that it produces the results you want)...

For instance,

  > x <- rnorm(200)
  > t.test(x)

        One Sample t-test

  data:  x
  t = 3.1091, df = 199, p-value = 0.002152
  alternative hypothesis: true mean is not equal to 0
  95 percent confidence interval:
   0.07855896 0.35102887
  sample estimates:
  mean of x
  0.2147939

If we reject the null hypothesis (here: "the mean is zero"),
we will be wrong with a probability 0.002152, i.e., in 2
cases out of 1,000.

We can check that with simulations: if the null hypothesis
H0 is true, in 95% of the cases, we have p>0.05. Nore
precisely, if H0 is true, the p-value is unifromly
distributed in [0,1].

  %G
  p <- c()
  for (i in 1:1000) {
    x <- rnorm(200)
    p <- append(p, t.test(x)$p.value)
  }
  hist(p, col='light blue')
  %--

  %G
  p <- sort(p)
  p[950]
  p[50]
  x <- 1:1000
  plot(p ~ x, main="p-value of a Student T test when H0 is true")
  %--

That was the type I error risk (wrongly rejecting the null
hypothesis). Let us now focus on the type II error
risk. This time, the population mean is non-zero. The risk
will depend on the actual value of this mean -- but we do
not know this value. If the mean is close to zero, the risk
is high (we need a lot of data to spot small differences),
if it is farther, the risk is lower.

  %G
  # Sample size
  n <- 10
  # Number of simulations
  # (sufficiently large to get a good approximation
  # of the probability)
  m <- 1000
  # Number of points to draw the curve
  k <- 50
  # Maximum value for the actual mean
  M <- 2
  r <- vector()
  for (j in M*(0:k)/k) {
    res <- 0
    for (i in 1:m) {
      x <- rnorm(10, mean=j)
      if( t.test(x)$p.value > .05 ){
        res <- res + 1
      }
    }
    r <- append(r, res/m)
  }
  rr <- M*(0:k)/k
  plot(r~rr, type="l",
       xlab='difference in mean',
       ylab="type II error probability")

  # Comparison with the curve produced by "power.t.test"
  x <- seq(0,2,length=200)
  y <- NULL
  for (m in x) {
    y <- append(y, 1-power.t.test(delta=m, sd=1, n=10, sig.level=.05, 
                                  type='one.sample')$power)
  }
  lines(x,y,col='red')

  # Theoretical curve
  # (This is a Z test, not too different)
  r2 <- function (p,q,conf,x) {
    p(q(1-conf/2)-x) - p(q(conf/2)-x)
  }
  f <- function(x) {
    p <- function (t) { pnorm(t, sd=1/sqrt(10)) }
    q <- function (t) { qnorm(t, sd=1/sqrt(10)) }
    r2(p,q,.05,x)
  }
  curve( f(x) , add=T, col="blue" )

  # Theoretical curve (T test)
  f <- function(x) {
    p <- function (t) { pt(sqrt(10)*t, 10) } # Is this correct?
    q <- function (t) { qt(t, 10)/sqrt(10) }
    r2(p,q,.05,x)
  }
  curve( f(x) , add=T, col="green" )

  legend(par('usr')[2], par('usr')[4], xjust=1,
         c('simulation', 'power.t.test', '"exact" value, Z test',
           'excat value'),
         col=c(par('fg'),'red','blue','green'),
         lwd=1,lty=1)
  title(main="Type II error risk in a Student T test")
  %--

We saw earlier that, if the null hypothesis is true, the
p-values are uniformly distributed in [0,1]. Here is their
distribution if H0 is false, for different values of the
population mean.

  %G
  N <- 10000
  x <- 100*(1:N)/N
  plot( x~I(x/100), type='n', ylab="cumulated percents", xlab="p-value" )
  do.it <- function (m, col) {
    p <- c()
    for (i in 1:N) {
      x <- m+rnorm(200)
      p <- append(p, t.test(x)$p.value)
    }
    p <- sort(p)
    x <- 100*(1:N)/N
    lines(x ~ p, type='l', col=col)
  }
  do.it(0,   par('fg'))
  do.it(.05, 'red')
  do.it(.1,  'green')
  do.it(.15, 'blue')
  do.it(.2,  'orange')
  abline(v=.05)
  title(main='p-value distribution')
  legend(par('usr')[2],par('usr')[3],xjust=1,yjust=1,
         c('m=0', 'm=0.05', 'm=.01', 'm=.015', 'm=.02'),
         col=c(par('fg'), 'red', 'green', 'blue', 'orange'),
         lty=1,lwd=1)
  %--

The manual gives this striking example, that explains why R
does not give a clear result "Null hypothesis rejected" or
"Null hypothesis not rejected". If we look at the data (with
the eyes, on a plot -- always plot the data) we can state
with little risk of error that the means are very
different. However, the p-value is very high and suggests
not to reject the null hypothesis that the means are
equal. One problem is the size of the confidence interval:
it is extremely large. Thus you should carefully look at the
data. (On this example, there was one more problem: to
perform a Student T test, the data have to be gaussian, to
have the same variance, to be independant -- this is far
from being the case).

  TODO: somewhere in this document, state how to correctly
  perform a test.
  0. Choose a confidence level, a test
  1. Plot the data
  2. Check the assumptions. Change the test if needed.
  3. Perform the test. Check the confidence interval.

  TODO: where is my rant against T-values?

  %G 600 200
  x <- 1:10
  y <- c(7:20, 200)
  boxplot(x,y, horizontal=T)
  %--

  %G 600 200
  boxplot(x,y, log="x", horizontal=T)
  %--

  > t.test(x, y)

        Welch Two Sample t-test

  data:  x and y
  t = -1.6329, df = 14.165, p-value = 0.1245
  alternative hypothesis: true difference in means is not equal to 0
  95 percent confidence interval:
   -47.242900   6.376233
  sample estimates:
  mean of x mean of y
    5.50000  25.93333

Of course, if we remove the outlier, 200, the result
conforms with our intuition.

  > t.test(1:10,y=c(7:20))

        Welch Two Sample t-test

  data:  1:10 and c(7:20)
  t = -5.4349, df = 21.982, p-value = 1.855e-05
  alternative hypothesis: true difference in means is not equal to 0
  95 percent confidence interval:
   -11.052802  -4.947198
  sample estimates:
  mean of x mean of y
        5.5      13.5

Similarly, if we use a non parametric test -- what we should
have done from the very beginning.

  > wilcox.test(x,y)

        Wilcoxon rank sum test with continuity correction

  data:  x and y
  W = 8, p-value = 0.0002229
  alternative hypothesis: true mu is not equal to 0

  Warning message:
  Cannot compute exact p-value with ties in: wilcox.test.default(x, y)
  
+ The zoo of parametric tests

+ WARNING

Most of these tests are only valid for gaussian variables.
Furthermore, if there are several samples, they often ask
them to be independant and have the same variance.

+ Student's T test

Here, we want to find the mean of a random variable or, more
precisely, to compare it with a predefined value.

  ?t.test

Assumptions: The variable is gaussian (if it is not, you can
use Wilcoxon's U test).

We already gave an example above.

Here is the theory.

The null hypothesis is H0: "the mean is m",
the alternative hypothesis is H1: "the mean is not m".
We compute 

                sample mean  -  m
  T = ---------------------------------------
       sample standard deviation  /  sqrt(n)

(beware: there are two formulas for the standard deviation:
the population standard deviation, with "n" in the
denominator, and the sample standard deviation, with 
"n-1" -- R only uses the latter) (here, n is the sample
size) and we reject H0 if

  abs( T )  >  t_{n-1} ^{-1} ( 1 - alpha/2 )

where T_{n-1} is the student T distribution with 
n-1 degrees of freedom.

(With gaussian independant identically distributed random
variables, T indeed follows this distribution -- you can
actually define the Student T distribution that way.)

Here are a few Student T probability distribution functions.
The larger n, the closer it is from a gaussian distribution.

  %G
  curve(dnorm(x), from=-5, to=5, add=F, col="orange", lwd=3, lty=2)  
  curve(dt(x,100), from=-5, to=5, add=T, col=par('fg'))  
  curve(dt(x,5),  from=-5, to=5, add=T, col="red")  
  curve(dt(x,2),  from=-5, to=5, add=T, col="green")  
  curve(dt(x,1),  from=-5, to=5, add=T, col="blue")  
  legend(par('usr')[2], par('usr')[4], xjust=1,
         c('gaussian', 'df=100', 'df=5', 'df=2', 'df=1'),
         col=c('orange', par('fg'), 'red', 'green', 'blue'),
         lwd=c(3,1,1,1,1),
         lty=c(2,1,1,1,1))
  title(main="Student's T probability distribution function")
  %--

Let us now see how to use this, by hand.

We start with a sample from a gaussian population, whose
mean we want to estimate.

  > x <- rnorm(200)
  > m <- mean(x)
  > m
  [1] 0.05875323

We now try to find an interval, centered on this sample
mean, in which we ahve a 95% probability of finding the
actual (population) mean (it is zero).

  x <- rnorm(100)
  n <- length(x)
  m <- mean(x)
  m
  alpha <- .05
  m + sd(x)/sqrt(n)*qt(alpha/2, df=n-1, lower.tail=T)
  m + sd(x)/sqrt(n)*qt(alpha/2, df=n-1, lower.tail=F)

We get [m - 0.19, m + 0.19].

The "t.test" function provides a similar result.

  > t.test(x, mu=0, conf.level=0.95)

        One Sample t-test

  data:  x
  t = 0.214, df = 99, p-value = 0.831
  alternative hypothesis: true mean is not equal to 0
  95 percent confidence interval:
   -0.1987368  0.2467923
  sample estimates:
   mean of x
  0.02402775

You can also experimentally check this with, with a
simulation.

  > m = c()
  > for (i in 1:10000) {
  +   m <- append(m, mean(rnorm(100)) )
  + }
  > m <- sort(m)
  > m[250]
  [1] -0.1982188
  > m[9750]
  [1] 0.1999646

The interval is [-0.20, +0.20]: we get approximately the
same result as the theoretical computations (this will
always be the case with Monte-Carlo methods: the
approximation is coarse, the convergence is slow, we need a
lot of iterations to get a reliable result).

We can also plot the result of those simulations.  We see
that the actual mean is sometimes outside the confidence
interval.

  %G
  N <- 50
  n <- 5
  v <- matrix(c(0,0),nrow=2)
  for (i in 1:N) {
    x <- rnorm(n)
    v <- cbind(v, t.test(x)$conf.int)
  }
  v <- v[,2:(N+1)]
  plot(apply(v,2,mean), ylim=c(min(v),max(v)),
       ylab='Confidence interval', xlab='')
  abline(0,0)
  c <- apply(v,2,min)>0 | apply(v,2,max)<0
  segments(1:N,v[1,],1:N,v[2,], col=c(par('fg'),'red')[c+1], lwd=3)
  title(main="The population mean need not be in the confidence interval")
  %--

+ Student's T test: robustness

Let us perform a simulation to see what happens with non
normal data -- a situation in which one should be using the
Wilcoxon U test.

Let us first check with a distribution less dispersed that
the gaussian one: the uniform distribution.

  > N <- 1000
  > n <- 3
  > v <- vector()
  > for (i in 1:N) {
      x <- runif(n, min=-1, max=1)
      v <- append(v, t.test(x)$p.value)
    }
  > sum(v>.05)/N
  [1] 0.932
  
Now with gaussian data.

  > N <- 1000
  > n <- 3
  > v <- vector()
  > for (i in 1:N) {
      x <- rnorm(n, sd=1/sqrt(3))
      v <- append(v, t.test(x)$p.value)
    }
  > sum(v>.05)/N
  [1] 0.952

Thus, the p-value is wrong when the variable is no longer
gaussian: when we think we are rejecting the null hypothesis
with a 5% risk of type I error, this risk is actually (for a
uniform distribution, which is very far from pathological)
7%.

Let us seewhat happens to the confidence interval: we should
be in it in 95% of the cases.

For a uniform distribution:

  > N <- 1000
  > n <- 3
  > v <- vector()
  > for (i in 1:N) {
      x <- runif(n, min=-1, max=1)
      r <- t.test(x)$conf.int
      v <- append(v, r[1]<0 & r[2]>0)
    }
  > sum(v)/N
  [1] 0.919

For a normal distribution:

  > N <- 1000
  > n <- 100
  > v <- vector()
  > for (i in 1:N) {
      x <- rnorm(n, sd=1/sqrt(3))
      r <- t.test(x)$conf.int
      v <- append(v, r[1]<0 & r[2]>0)
    }
  > sum(v)/N
  [1] 0.947

For the uniform distribution, we are in the confidence
interval in 92% of the cases instead of the expected
95%. Actually, when the distribution is not gaussian, the
results can be "wrong" in two ways: either the distribution
is less dispersed than the gaussian, and the p-value is
higher than stated by the test and the confidence interval
provided is too small; or the distribution is more dispersed
than the gaussian, the p-value will always be high, the
confidence interval will be huge, and the test will no
longer test anything (it loses its power, it will always
state: there is nothing noticeable -- you can say it is
blinded by the outliers).

  TODO: I will say this later...

However, if the sample is much larger, the error becomes
negligible.

  > N <- 1000
  > n <- 100
  > v <- vector()
  > for (i in 1:N) {
      x <- runif(n, min=-1, max=1)
      v <- append(v, t.test(x)$p.value)
    }
  > sum(v>.05)/N
  [1] 0.947

  > N <- 1000
  > n <- 100
  > v <- vector()
  > for (i in 1:N) {
      x <- runif(n, min=-1, max=1)
      r <- t.test(x)$conf.int
      v <- append(v, r[1]<0 & r[2]>0)
    }
  > sum(v)/N
  [1] 0.945

Let us now check the power of the test.

  %G
  N <- 1000
  n <- 10
  delta <- .8
  v <- vector()
  w <- vector()
  for (i in 1:N) {
    x <- delta+runif(n, min=-1, max=1)
    v <- append(v, t.test(x)$p.value)
    w <- append(w, wilcox.test(x)$p.value)
  }
  plot(sort(v), type='l', lwd=3, lty=2, ylab="p-valeur")
  lines(sort(w), col='red')
  legend(par('usr')[1], par('usr')[4], xjust=0,
         c('Student', 'Wilcoxon'),
         lwd=c(2,1),
         lty=c(2,1),
         col=c(par("fg"), 'red'))
  %--

  %G
  N <- 1000
  n <- 100
  delta <- .1
  v <- vector()
  w <- vector()
  for (i in 1:N) {
    x <- delta+runif(n, min=-1, max=1)
    v <- append(v, t.test(x)$p.value)
    w <- append(w, wilcox.test(x)$p.value)
  }
  plot(sort(v), type='l', lwd=3, lty=2)
  lines(sort(w), col='red')
  legend(par('usr')[1], par('usr')[4], xjust=0,
         c('Student', 'Wilcoxon'),
         lwd=c(2,1),
         lty=c(2,1),
         col=c(par("fg"), 'red'))
  %--

  %G
  N <- 1000
  n <- 100
  delta <- .8
  v <- vector()
  w <- vector()
  for (i in 1:N) {
    x <- delta+runif(n, min=-1, max=1)
    v <- append(v, t.test(x)$p.value)
    w <- append(w, wilcox.test(x)$p.value)
  }
  plot(sort(v), type='l', lwd=3, lty=2)
  lines(sort(w), col='red')
  legend(par('usr')[1], par('usr')[4], xjust=0,
         c('Student', 'Wilcoxon'),
         lwd=c(2,1),
         lty=c(2,1),
         col=c(par("fg"), 'red'))
  %--

From this, Student's T test seems robust: if the data are
less dispersed than with a gaussian distribution, the
p-value and the confidence interval are underestimated, but
not too much. This effect disappeard if the sample is large
enough.

But wait! Let us now see waht happens with a distribution
more dispersed than the gaussian -- e.g., the Cauchy
distribution.

  > N <- 1000
  > n <- 3
  > v <- vector()
  > for (i in 1:N) {
  +   x <- rcauchy(n)
  +   v <- append(v, t.test(x)$p.value)
  + }
  > sum(v>.05)/N
  [1] 0.974

  > N <- 1000
  > n <- 3
  > v <- vector()
  > for (i in 1:N) {
  +   x <- rcauchy(n)
  +   r <- t.test(x)$conf.int
  +   v <- append(v, r[1]<0 & r[2]>0)
  + }
  > sum(v)/N
  [1] 0.988

The given confidence interval is too large: we are not in it
in 95% of the cases, but much more. Things do not improve
with an even larger sample.

  > N <- 1000
  > n <- 100
  > v <- vector()
  > for (i in 1:N) {
  +   x <- rcauchy(n)
  +   v <- append(v, t.test(x)$p.value)
  + }
  > sum(v>.05)/N
  [1] 0.982

  > N <- 1000
  > n <- 100
  > v <- vector()
  > for (i in 1:N) {
  +   x <- rcauchy(n)
  +   r <- t.test(x)$conf.int
  +   v <- append(v, r[1]<0 & r[2]>0)
  + }
  > sum(v)/N
  [1] 0.986
  
Let us now check the test power: this is disastrous...

  %G
  N <- 1000
  n <- 10
  delta <- 1
  v <- vector()
  w <- vector()
  for (i in 1:N) {
    x <- delta+rcauchy(n)
    v <- append(v, t.test(x)$p.value)
    w <- append(w, wilcox.test(x)$p.value)
  }
  plot(sort(v), type='l', lwd=3, lty=2)
  lines(sort(w), col='red')
  %--

  %G
  N <- 1000
  n <- 100
  delta <- 1
  v <- vector()
  w <- vector()
  for (i in 1:N) {
    x <- delta+rcauchy(n)
    v <- append(v, t.test(x)$p.value)
    w <- append(w, wilcox.test(x)$p.value)
  }
  plot(sort(v), type='l', lwd=3, lty=2)
  lines(sort(w), col='red')
  %--

From this, we conclude that if the data are more dispersed
than with a gaussian distribution, the power of the test
decreases and nothing remains. Increasing th sample size
does not improve things. In these situations, one should use
non-parametric tests (or a parametric test adapted to the
distribution at hand, or transform the data so that it looks
more normal -- be beware, using the same data to choose the
distribution or the trans formation and to perform the test
will yield biased p-values).

  > N <- 1000
  > n <- 100
  > v <- vector()
  > w <- vector()
  > for (i in 1:N) {
  +   x <- rcauchy(n)
  +   v <- append(v, t.test(x)$p.value)
  +   w <- append(w, wilcox.test(x)$p.value)
  + }
  > sum(v>.05)/N
  [1] 0.976
  > sum(w>.05)/N
  [1] 0.948
 
  > N <- 1000
  > n <- 100
  > v <- vector()
  > w <- vector()
  > for (i in 1:N) {
  +   x <- 1+rcauchy(n)
  +   v <- append(v, t.test(x)$p.value)
  +   w <- append(w, wilcox.test(x)$p.value)
  + }
  > sum(v<.05)/N
  [1] 0.22
  > sum(w<.05)/N
  [1] 0.992
 
+ Z test

It is similar to the Student T test, but this time, we know
the exact value of the variance: in this cas, the
distribution of the test statistic is nolonger a Student T
distribution but a gaussian distribution (often denoted Z).

In the real world, when we do not know the mean, we do not
know the variance either -- thus, this test has little
practical utility.

For large samples, Student's T distribution is well
approximated by a gaussian distribution, so we can perform a
Z test instead of a T test (that would be relevant if we we
computing everything ourselves, but as the computer is
there...)

+ Student T test: comparing the mean of two samples.

We now consider two samples and we would like to know if
they come from the same population. More simply, we would
like to know if they have the same mean.

Here, we assume that the data are normal (do a qqplot or a
Shapiro test), have the same variance (use an F test). If
those conditions are not satisfied, you can use a Wilcoxon
test.

We also assume that the variables are independant (this is
not always the case: for instance, you might want to compare
the length of the left and right humerus on a bunch of human
skeletons; to get rid of the dependance problem, you can
consider the length difference for each individual and
compare it with zero.

For two samples of size n (it also works for samples of
different sizes, but the formula is more complicated), one
can show that the statistic

  t = difference between the means / sqrt( (sum of the variances)/n )

follows a Student distribution with 2n-n degrees of freedom.

We shall reject the null hypothesis "the means are equal" if

  abs(t) > abs( t(alpha, 2n-2) )

We could perform the test by hand, as above, but there is
already a function to do so.

  ?t.test
  > x <- rnorm(100)
  > y <- rnorm(100)
  > t.test(x,y)

        Welch Two Sample t-test

  data:  x and y
  t = -1.3393, df = 197.725, p-value = 0.182
  alternative hypothesis: true difference in means is not equal to 0
  95 percent confidence interval:
   -0.46324980  0.08851724
  sample estimates:
    mean of x   mean of y
  -0.03608115  0.15128514

  > t.test(x, y, alternative="greater")

        Welch Two Sample t-test

  data:  x and y
  t = -1.3393, df = 197.725, p-value = 0.909
  alternative hypothesis: true difference in means is greater than 0
  95 percent confidence interval:
   -0.4185611        Inf
  sample estimates:
    mean of x   mean of y
  -0.03608115  0.15128514

Here is the example from the manual: the study of the
efficiency of a soporific drug.

  > ?sleep
  > data(sleep)
  > plot(extra ~ group, data = sleep)

  > t.test(extra ~ group, data = sleep)
        Welch Two Sample t-test

  data:  extra by group
  t = -1.8608, df = 17.776, p-value = 0.0794
  alternative hypothesis: true difference in means is not equal to 0
  95 percent confidence interval:
   -3.3654832  0.2054832
  sample estimates:
  mean in group 1 mean in group 2
             0.75            2.33

Here, as the p-value is greater than 5%, we would conclude
that the drug's effects are not noticeable in this
sample. However, if you work for the drug's manufacturer,
you would prefer a different conclusion.  You can get it if
you change the alternative hypothsis, that becomes H1: "the
drug increases the time of sleep". (If you are honest, you
choose the null and alternative hypotheses before performing
the experiment, and this choice must be backed by prior data
or knowledge: remarking that in this very experiment the
sample mean is greater is not sufficient.)

  > t.test(extra ~ group, data = sleep, alternative="less")

        Welch Two Sample t-test

  data:  extra by group
  t = -1.8608, df = 17.776, p-value = 0.0397
  alternative hypothesis: true difference in means is less than 0
  95 percent confidence interval:
         -Inf -0.1066185
  sample estimates:
  mean in group 1 mean in group 2
             0.75            2.33

TODO: There should be a section "How to lie with statistics"

  1. Reusing the same data several times to increase the
     statistical significance of an effect.  (examples:
     choose H1 after looking at the data, use the data to
     choose the transformation to apply, etc.)
  2. Using a statistical test whose asumptions are not
     satisfied (it can either yield more significant results
     or less significant results: e.g., if the test assumes
     that ******************** (deleted?)
  3. Perform the same experiment several times and only
     publish the results that go in the direction you
     want. (Adverts by tobacco companies claiming that
     passive smoking is harmless are done that way.)
  4. Misleading plots
     Adding a lot of unnecessary details 
     3D effects
       - The most striking is the 3-dimensional piechart,
         with the part of interest at the forefront, so that
         it be deformed and larger)
       - You can also plot a quantity as a sphere (ar any
         3-dimensional object) of different scales: for a
         quantity twice as large, you can multiply all the
         dimensions by 2, thereby getting a volume eight
         times as large.
     
If you have more than two samples, you can perform an
ANalysis Of VAriance (Anova). If you have two non-gaussain
samples, you can perform a Wilcoxon U test. If you have more
that two non-gaussian samples, you can turn to
non-parametric analysis of variance with the Kruskal--Wallis
test.

+ Robustness of Student's 2-sample T test

To compare means with a Student T test, we assume that: the
samples are gaussian, independant and have the same
variance.

Let us check what happens with gaussian non-equivariant
samples.

TODO: check that there is no confusion between the Student T
test and Welch's test (var.equal=T for the first,
var.equal=F (the default) for the second).

  %G
  N <- 1000
  n <- 10
  v <- 100
  a <- NULL
  b <- NULL
  for (i in 1:N) {
    x <- rnorm(n)
    y <- rnorm(n, 0, v)
    a <- append(a, t.test(x,y)$p.value)
    b <- append(b, t.test(x/var(x), y/var(y))$p.value)
  }
  plot(sort(a), type='l', col="green")
  points(sort(b), type="l", col="red")
  abline(0,1/N)
  %--

And for the power:

  %G
  N <- 1000
  n <- 10
  v <- 100
  a <- NULL
  b <- NULL
  c <- NULL
  d <- NULL
  for (i in 1:N) {
    x <- rnorm(n)
    y <- rnorm(n, 100, v)
    a <- append(a, t.test(x,y)$p.value)
    b <- append(b, t.test(x/var(x), y/var(y))$p.value)
    c <- append(c, t.test(x, y/10000)$p.value)
    d <- append(d, wilcox.test(x, y)$p.value)
  }
  plot(sort(a), type='l', col="green")
  points(sort(b), type="l", col="red")
  points(sort(c), type="l", col="blue")
  points(sort(d), type="l", col="orange")
  abline(0,1/N)
  legend(par('usr')[1], par('usr')[4], 
         c('Student Test', 'Renormalized Student Test',
           'Student Test renormalized with the sample variances',
           "(Non-Parametric) Wilcoxson's U Test"),
         col=c('green', 'blue', 'red', 'orange'),
         lwd=1,lty=1)
  title(main="Student's T test on non-equivariant samples")
  %--

With non-equivariant samples, the p-value of the Student
test remains correct, but the power dramatically decreases.
If the data look gaussian but have different variances, you
had better normalize them andto perform a Student T test
than perform a non-parametric test.

TODO: What about the Welch test???

+ Several means of comparing means

There are several ways of comparing two means.

  %G 600 300
  data(sleep)
  boxplot(extra ~ group, data=sleep,
          horizontal=T,
          xlab='extra', ylab='group')
  %--

With a Student T test, if the data are gaussian (or a Welch
test, if they are gaussian but have different variances).

  > t.test(extra ~ group, data=sleep)

        Welch Two Sample t-test

  data:  extra by group
  t = -1.8608, df = 17.776, p-value = 0.0794
  alternative hypothesis: true difference in means is not equal to 0
  95 percent confidence interval:
   -3.3654832  0.2054832
  sample estimates:
  mean in group 1 mean in group 2
             0.75            2.33

  > t.test(extra ~ group, data=sleep, var.equal=T)

          Two Sample t-test

  data:  extra by group
  t = -1.8608, df = 18, p-value = 0.07919
  alternative hypothesis: true difference in means is not equal to 0
  95 percent confidence interval:
   -3.3638740  0.2038740
  sample estimates:
  mean in group 1 mean in group 2
             0.75            2.33

With a Wilcoxon test, if we do not know that the data are
gaussian, or if we suspect they are not (from a
Shapiro--Wilk test, for instance).

  > wilcox.test(extra ~ group, data=sleep)

        Wilcoxon rank sum test with continuity correction

  data:  extra by group
  W = 25.5, p-value = 0.06933
  alternative hypothesis: true mu is not equal to 0

  Warning message:
  Cannot compute exact p-value with ties in: wilcox.test.default(x = c(0.7, -1.6,

With an analysis of variance (we will present this later --
with only two samples, it yields exactly the same result as
the Student T test).

  > anova(lm(extra ~ group, data=sleep))
  Analysis of Variance Table

  Response: extra
            Df Sum Sq Mean Sq F value  Pr(>F)
  group      1 12.482  12.482  3.4626 0.07919 .
  Residuals 18 64.886   3.605
  ---
  Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

With a non parametric analysis of variance, i.e., a
Kruskal--Wallis test.

  > kruskal.test(extra ~ group, data=sleep)

        Kruskal-Wallis rank sum test

  data:  extra by group
  Kruskal-Wallis chi-squared = 3.4378, df = 1, p-value = 0.06372

Here are the results

  Method                 p-value

  Welch test             0.07919
  Student T test         0.0794
  Wilcoxon               0.06933
  Analysis of Variance   0.07919
  Kruskal--Wallis Test   0.06372

+ The Chi^2 and variance computations

We are looking for the variance of a sample (whose mean is
unknown).

Here is the theory.

The null hypothesis is H0: "the (population) variance is v",
the alternative hypothesis is H1: "the variance is not v".
We compute the statistic

  Chi2 = (n-1) * (sample variance) / v

and je reject the null hypothesis H0 if

    Chi2 > Chi2_{n-1} ^{-1} ( 1 - alpha/2 )
  or
    Chi2 < Chi2_{n-1} ^{-1} ( alpha/2 )

where Chi2_{n-1} is the Chi2 distribution with n-1 degrees
of freedom.

Here is the Chi^2 probability distribution function, with
various degrees of freedom.

  %G
  curve(dchisq(x,2),  from=0, to=5, add=F, col="red",
        ylab="dchisq(x,i)") 
  n <- 10
  col <- rainbow(n)
  for (i in 1:n) { 
    curve(dchisq(x,i),  from=0, to=5, add=T, col=col[i])
  }
  legend(par('usr')[2], par('usr')[4], xjust=1,
          paste('df =',1:n),
         lwd=1,
         lty=1,
         col=col)
  title(main="Chi^2 Probability Distribution Function")
  %--

We can get a confidence interval for the standard deviation,
by hand, as follows.

  alpha <- .05
  x <- rnorm(200)
  n <- length(x)
  v = var(x)
  sd(x)
  sqrt( (n-1)*v / qchisq(alpha/2, df=n-1, lower.tail=F) )
  sqrt( (n-1)*v / qchisq(alpha/2, df=n-1, lower.tail=T) )

We get [0.91, 1.11].

We can check this with a simulation:

  v <- c(0)
  for (i in 1:10000) {
    v <- append(v, var(rnorm(200)) )
  }
  v <- sort(v)
  sqrt(v[250])
  sqrt(v[9750])

We get [0.90, 1.10].

With smaller samples, the estimation is less reliable: the
confidence interval is larger, [0.68, 1.32].

  v <- c(0)
  for (i in 1:10000) {
    v <- append(v, var(rnorm(20)) )
  }
  v <- sort(v)
  sqrt(v[250])
  sqrt(v[9750])

I have not found an R function that performs those
computations (the "var.test" works with two samples): we can
write our own.

  %G
  chisq.var.test <- function (x, var=1, conf.level=.95,
                              alternative='two.sided') {
    result <- list()
    alpha <- 1-conf.level
    n <- length(x)
    v <- var(x)
    result$var <- v
    result$sd <- sd(x)
    chi2 <- (n-1)*v/var
    result$chi2 <- chi2
    p <- pchisq(chi2,n-1)
    if( alternative == 'less' ) {
      stop("Not implemented yet")
    } else if (alternative == 'greater') {
      stop("Not implemented yet")
    } else if (alternative == 'two.sided') {
      if(p>.5) 
        p <- 1-p
      p <- 2*p
      result$p.value <- p
      result$conf.int.var <- c(
        (n-1)*v / qchisq(alpha/2, df=n-1, lower.tail=F),
        (n-1)*v / qchisq(alpha/2, df=n-1, lower.tail=T),
      )
    }
    result$conf.int.sd <- sqrt( result$conf.int.var )
    result
  }
  x <- rnorm(100)
  chisq.var.test(x)

  # We can check tha the results are correct by looking at
  # the distribution of the p-values: it should be uniform
  # in [0,1].
  v <- NULL
  for (i in 1:1000) { 
    v <- append(v, chisq.var.test(rnorm(100))$p.value) 
  } 
  plot(sort(v))
  %--

We can also compare the results with those of the "var.test"
function, that works woth two samples. Either graphically,

  %G
  p1 <- NULL
  p2 <- NULL
  for (i in 1:100) {
    x <- rnorm(10)
    p1 <- append(p1, chisq.var.test(x)$p.value)
    p2 <- append(p2, var.test(x, rnorm(10000))$p.value)
  }
  plot( p1 ~ p2 )
  abline(0,1,col='red')
  %--

or with a computation (we shall see later what it means:
this is a test on a regression, that tells us that p1=p2
with a p-value equal to 0.325).

  > summary(lm(p1-p2~0+p2))

  Call:
  lm(formula = p1 - p2 ~ 0 + p2)

  Residuals:
        Min        1Q    Median        3Q       Max
  -0.043113 -0.007930  0.001312  0.009386  0.048491

  Coefficients:
      Estimate Std. Error t value Pr(>|t|)
  p2 -0.002609   0.002638  -0.989    0.325

  Residual standard error: 0.01552 on 99 degrees of freedom
  Multiple R-Squared: 0.009787,   Adjusted R-squared: -0.0002151
  F-statistic: 0.9785 on 1 and 99 DF,  p-value: 0.325
 
+ Fisher distribution (F test) and comparison of the variance of two samples

Here, we want to know if two samples come from populations
from the same variance (we are not interested in the mean).

We proceed as for the comparison of means, but instead of
considering the difference of means, we consider the
quotient of variances.

One will use such a test before a Student T test (to compare
the mean in two samples), to check that the equivariance
assumption is valid.

Here is the example with which we had illustrated the
Student T test: indeed, the variances do not seem too
different.

  ?var.test
  > var.test( sleep[,1] [sleep[,2]==1], sleep[,1] [sleep[,2]==2] )

        F test to compare two variances

  data:  sleep[, 1][sleep[, 2] == 1] and sleep[, 1][sleep[, 2] == 2]
  F = 0.7983, num df = 9, denom df = 9, p-value = 0.7427
  alternative hypothesis: true ratio of variances is not equal to 1
  95 percent confidence interval:
   0.198297 3.214123
  sample estimates:
  ratio of variances
           0.7983426

Here is the theory behind this test.

The null hypothesis is H0: "the two populations have the
same variance", the alternative hypothesis is H1: "the two
populations do not have the same variance". We compute the
statistic

       variance of the first sample
  F = -------------------------------
       variance of the second sample

(here, I assume that the two samples have the same size,
otherwise the formula becomes more complicated) and we
reject H0 if

    F < F _{n1-1, n2-2} ^{-1} ( alpha/2 )
  or
    F > F _{n1-1, n2-2} ^{-1} ( 1 - alpha/2 )

where F is Fisher's distribution.

Practically, we compute the quotient of variances with the
largest variance in the numerator and we reject the null
hypothesis that the variances are equal if

  F > F(alpha/2, n1-1, n2-1)

where n1 and n2 are the sample sizes.

Here is an example, where the computations were performed by hand.

  > x <- rnorm(100, 0, 1)
  > y <- rnorm(100, 0, 2)
  > f <- var(y)/var(x)
  > f
  [1] 5.232247
  > qf(alpha/2, 99, 99)
  [1] 0.6728417
  > f > qf(alpha/2, 99, 99)
  [1] TRUE

If we have more than two samples, we can use Bartlett's
test. Of we have two non-gaussian samples, we can use
Ansari's or Mood's non parametric test. Of there are more
that two non-gaussian samples, we can use Fligner's test.

  ?bartlett.test
  ?ansari.test
  ?mood.test
  ?fligner.test

* The Zoo of Statistical Tests: discrete variables and the Chi^2 test

+ Binomial test

In a sample of 100 butterflies, we found 45 males and 55
females. Can we conclude that there are, in general, more
males than females?

The number of female butterflies in a samples if 100 animals
follows a binimial distribution B(100,p) and we want to test
the null hypothesis H0: "p=0.5" against the alternative
hypothesis H1: "p different from 0.5".

  > binom.test(55, 100, .5)

        Exact binomial test

  data:  55 and 100
  number of successes = 55, number of trials = 100, p-value = 0.3682
  alternative hypothesis: true probability of success is not equal to 0.5
  95 percent confidence interval:
   0.4472802 0.6496798
  sample estimates:
  probability of success
                    0.55

In this example, the difference is not statistically
significant.

+ Mock binomial test (not important)

If we were doing the computations by hand, we would not use
the binomial test, but an approximation, with the
"prop.test". But as the computer performscarries out the
computations for us, we need not use it.

  %G
  p <- .3
  col.values <- c(par('fg'),'red', 'blue', 'green', 'orange')
  n.values <- c(5,10,20,50,100)
  plot(0, type='n', xlim=c(0,1), ylim=c(0,1), xlab='exact', ylab='approximate')
  for (i in 1:length(n.values)) {
    n <- n.values[i]
    x <- NULL
    y <- NULL
    for (a in 0:n) {
      x <- append(x, binom.test(a,n,p)$p.value)
      y <- append(y, prop.test(a,n,p)$p.value)
    }
    o <- order(x)
    lines(x[o],y[o], col=col.values[i])
  }
  legend(par('usr')[1],par('usr')[4],
         as.character(n.values),
         col=col.values,
         lwd=1,lty=1)
  title(main="Comparing the binomial test and its approximation")
  %--

We can also compare the distribution of p-values of these
two tests.

  %G
  p <- .3
  n <- 5
  N <- 1000
  e <- rbinom(N, n, p)
  x <- y <- NULL
  for (a in e) {
    x <- append(x, binom.test(a,n,p)$p.value)
    y <- append(y, prop.test(a,n,p)$p.value)
  }
  x <- sort(x)
  y <- sort(y)
  plot(x, type='l', lwd=3, ylab='p-value')
  lines(y, col='red')
  legend(par('usr')[2], par('usr')[3], xjust=1, yjust=0,
         c('exact', 'approximation'),
         lwd=c(3,1),
         lty=1,
         col=c(par("fg"),'red'))
  title(main="Binomial test (H0 true)")
  %--

  %G
  p1 <- .3
  p2 <- .5
  n <- 5
  N <- 1000
  e <- rbinom(N, n, p1)
  x <- y <- NULL
  for (a in e) {
    x <- append(x, binom.test(a,n,p2)$p.value)
    y <- append(y, prop.test(a,n,p2)$p.value)
  }
  x <- sort(x)
  y <- sort(y)
  plot(x, type='l', lwd=3, ylab='p-value')
  lines(y, col='red')
  legend(par('usr')[2], par('usr')[3], xjust=1, yjust=0,
         c('exact', 'approximation'),
         lwd=c(3,1),
         lty=1,
         col=c(par("fg"),'red'))
  title(main="Binomial test (H0 false)")
  %--

We remark that if H0 is true, the p-value is over-estimated
(the test is too conservative, it is less powerful, it does
not see anything significant while there is), but the
wronger H0, the better the approximation.

+ Another binomial test

  TODO: alternative to the binomial test:
    glm(y~x, family=binomial)   # Logistic regression
  (well, for p=0.5)

  Question: and for p != 0.5?
  You can do the same for multilogistic regression

+ Chi^2 test (very important)

The binomial test is fine, but it does not generalize: it
allows you to study a binary variable, nothing more. But
sometimes we need a comparable test to study qualitative
variables with more than 2 values or to study several
qualitative variables. There is no such "exact multinomial
test" (you can devise one, but you would have to implement
it...): instead, one uses the approximate Chi2 test.

The Chi2 test is a non parametric , non-rigorous (it is an
approximation) test to compare distributions of qualitative
variables. In spite of that, it is the most important
discrete test.

One can show that if (X1, X2, ..., Xr) is a multinomial
random variable, then

           ( X_1 - n p_1 )^2           ( X_r - n p_r )^2
  Chi^2 = ------------------- + ... + --------------------
                n p_1                        n p_r

asymptotically follows a Chi^2 distribution with r-1 degrees
of freedom.  This tis is just an asymptotic result, that is
sufficiently true if

  n >= 100     (the sample is large enough)
  n p_i >= 10  (the theoretical frequencies (counts) are not too small)

In particular, we get another mock binomial test.

  %G
  p <- .3
  col.values <- c(par('fg'),'red', 'blue', 'green', 'orange')
  n.values <- c(5,10,20,50,100)
  plot(0, type='n', xlim=c(0,1), ylim=c(0,1), xlab='exact', ylab='approximate')
  for (i in 1:length(n.values)) {
    n <- n.values[i]
    x <- NULL
    y <- NULL
    z <- NULL
    for (a in 0:n) {
      x <- append(x, binom.test(a,n,p)$p.value)
      y <- append(y, chisq.test(c(a,n-a),p=c(p,1-p))$p.value)
      z <- append(z, prop.test(a,n,p)$p.value)
    }
    o <- order(x)
    lines(x[o],y[o], col=col.values[i])
    lines(x[o],z[o], col=col.values[i], lty=3)
  }
  legend(par('usr')[1],par('usr')[4],
         as.character(c(n.values, "prop.test", "chisq.test")),
         col=c(col.values, par('fg'), par('fg')),
         lwd=1,
         lty=c(rep(1,length(n.values)), 1,3) 
        )
  title(main="The binomial test and its approximations")
  %--

Or a mock multinomial test: let us check, with a simulation,
that the p-values are close.

  %G
  # A Monte Carlo multinomial test
  multinom.test <- function (x, p, N=1000) {
    n <- sum(x)
    m <- length(x)
    chi2 <- sum( (x-n*p)^2/(n*p) )
    v <- NULL
    for (i in 1:N) {
      x <- table(factor(sample(1:m, n, replace=T, prob=p), levels=1:m))
      v <- append(v, sum( (x-n*p)^2/(n*p) ))
    }
    sum(v>=chi2)/N
  }
  multinom.test( c(25,40,25,25), p=c(.25,.25,.25,.25) )   # 0.13
  chisq.test(    c(25,40,25,25), p=c(.25,.25,.25,.25) )   # 0.12

  N <- 100
  m <- 4
  n <- 10
  p <- c(.25,.25,.1,.4)
  x <- NULL
  y <- NULL
  for (i in 1:N) {
    a <- table( factor(sample(1:m, n, replace=T, prob=p), levels=1:m) )
    x <- append(x, multinom.test(a,p))
    y <- append(y, chisq.test(a,p=p)$p.value)
  }
  plot(y~x)
  abline(0,1,col='red')
  title("Monte Carlo Multinomial Test and Chi^2 Test")
  %--

Here is the distribution of the p-values of a Chi^2 test.

  %G
  # We sample 10 subjects in a 4-class population.
  # We repeat the experiment 100 times.
  N <- 1000
  m <- 4
  n <- 10
  p <- c(.24,.26,.1,.4)
  p.valeur.chi2 <- rep(NA,N)
  for (i in 1:N) {
    echantillon <- table(factor(sample(1:m, replace=T, prob=p), levels=1:m))
    p.valeur.chi2[i] <- chisq.test(echantillon,p=p)$p.value
  }
  plot( sort(p.valeur.chi2), type='l', lwd=3 )
  abline(0, 1/N, lty=3, col='red', lwd=3)
  title(main="p-values in a Chi^2 test")
  %--

+ Independance Chi^2 

Let us consider the following situation: we measure two
qualitative variables each with two values on a sample. In
the whole population, the four classes occur with
proportions 10%, 20%, 60%, 10%.

         A    B   total
  C     10   20   30
  D     60   10   70
  total 70   30   100

We can get a sample as follows.

  TODO: this code is ugly...

  %G
  foo <- function (N) {
    population1 <- c(rep('A',10), rep('B',20), rep('A',60), rep('B',10))
    population1 <- factor(population1, levels=c('A','B'))
    population2 <- c(rep('C',10), rep('C',20), rep('D',60), rep('D',10))
    population2 <- factor(population2, levels=c('C','D'))
    o <- sample(1:100, N, replace=T)
    table( population2[o], population1[o] )
  }
  a <- foo(1000)
  op <- par(mfcol=c(1,2))
  plot( a, shade=T )
  plot( t(a), shade=T )
  par(op)
  %--

We would like to know wether these variables are independant
or not.  To do so, we take a sample (as always, we do not
know the population direclty, we only know it through
samples -- so we are not supposed to know the proportions I
mentionned above); we compute the marginal proportions
(i.e., the "total" row and column); the product of this row
and column, so as to get the proportions one would observe
if the variables were independant; we can then compare those
proportions with the observed ones.

  > n <- 100
  > a <- foo(n)
  > a/n
         A    B
    C 0.09 0.15
    D 0.65 0.11
  > a1 <- apply(a/n,2,sum) # The "total" row
  > a1
     A    B
  0.74 0.26
  > a2 <- apply(a/n,1,sum) # The "total" column
  > a2
     C    D
  0.24 0.76
  > b <- a1 %*% t(a2)
  > b
              C      D
    [1,] 0.1776 0.5624
    [2,] 0.0624 0.1976
  > chisq.test(as.vector(a),p=as.vector(b))

        Chi-squared test for given probabilities

  data:  as.vector(a)
  X-squared = 591.7683, df = 3, p-value = < 2.2e-16

  TODO
  I think the syntax is different
  chisq.test(rbind(as.vector(a), as.vector(b)))
  Is the result the same?

  TODO
  Remark
  One can also use the Chi^2 for a homogeneity test, i.e.,
  to check if two samples come from the same population.
  It IS an independance Chi^2 (independance between the
  variable and the sample number): the syntax is the same.
  chisq.test(rbind(as.vector(a), as.vector(b)))

+ Fisher test: independance of two qualitative variables

Here, we want to check if two variables, given by a
contingency table, are independant. It sounds like the Chi^2
test, but this time, it is an exact test, not an
approximation.

Let us take the example we had examined above with the Chi^2 test.

  > a
       A  B
    C  9 15
    D 65 11
  > fisher.test(a)

        Fisher's Exact Test for Count Data

  data:  a
  p-value = 1.178e-05
  alternative hypothesis: true odds ratio is not equal to 1
  95 percent confidence interval:
   0.03127954 0.32486526
  sample estimates:
  odds ratio
   0.1048422

With a smaller sample, it is less clear.

  > fisher.test( foo(10) )

        Fisher's Exact Test for Count Data

  data:  foo(10)
  p-value = 1
  alternative hypothesis: true odds ratio is not equal to 1
  95 percent confidence interval:
    0.003660373 39.142615141
  sample estimates:
  odds ratio
   0.3779644

  %G
  n1 <- 10
  n2 <- 100
  N <- 1000
  x1 <- rep(NA,N)
  x2 <- rep(NA,N)
  for (i in 1:N) {
    x1[i] <- fisher.test(foo(n1))$p.value
    x2[i] <- fisher.test(foo(n2))$p.value
  }
  plot( sort(x1), type='l', lwd=3, ylab='p-valeur')
  lines( sort(x2), col='blue', lwd=3 )
  abline(0,1/N,col='red',lwd=3,lty=3)
  abline(h=c(0,.05),lty=3)
  abline(v=c(0,N*.05),lty=3)
  title(main="p-value of a Fisher test, H0 false")
  legend(par('usr')[1],par('usr')[4],
         c("n=10", "n=100"),
         col=c(par('fg'), 'blue'),
         lwd=3,
         lty=1)
  %--

  %G
  foo <- function (N) {
    population1 <- c(rep('A',2), rep('B',8), rep('A',18), rep('B',72))
    population1 <- factor(population1, levels=c('A','B'))
    population2 <- c(rep('C',2), rep('C',8), rep('D',18), rep('D',72))
    population2 <- factor(population2, levels=c('C','D'))
    o <- sample(1:100, N, replace=T)
    table( population2[o], population1[o] )
  }
  n1 <- 10
  n2 <- 100
  N <- 1000
  x1 <- rep(NA,N)
  x2 <- rep(NA,N)
  for (i in 1:N) {
    x1[i] <- fisher.test(foo(n1))$p.value
    x2[i] <- fisher.test(foo(n2))$p.value
  }
  plot( sort(x1), type='l', lwd=3, ylab='p-valeur', ylim=c(0,1))
  lines( sort(x2), col='blue', lwd=3 )
  abline(0,1/N,col='red',lwd=3,lty=3)
  abline(h=c(0,.05),lty=3)
  abline(v=c(0,N*.05),lty=3)
  title(main="p-valueof a Fisher test, H0 true")
  legend(par('usr')[2], .2, xjust=1, yjust=0,
         c("n=10", "n=100"),
         col=c(par('fg'), 'blue'),
         lwd=3,
         lty=1)
  %--

* The Zoo of Statistical Tests: non-parametric tests

+ Sign test

It is a nom-parametric test on the median of a random
variable -- with no assumption on it.

The idea is simple: we count the number of values that are
above the proposed median -- we know that of ot is the
actual median, this number follows a binomial distribution,
because each value has exactly one chance out of two to be
above it.

I have not found a function to perform this test, so I wrote
my own.

  sign.test <- function (x, mu=0) { # does not handle NA
    n <- length(x)
    y <- sum(x<mu) # should warn about ties!
    p.value <- min(c( pbinom(y,n,.5), pbinom(y,n,.5,lower.tail=F) ))*2
    p.value
  }

To check that it works, let us simply remark that the
distribution of the p-values is approximately uniform, as it
should be.

  %G
  sign.test <- function (x, mu=0) {
    n <- length(x)
    y <- sum(x<mu) # should warn about ties!
    p.value <- min(c( pbinom(y,n,.5), pbinom(y,n,.5,lower.tail=F) ))*2
    p.value
  }
  N <- 500
  n <- 200
  res <- rep(NA,N)
  for (i in 1:N) {
    res[i] <- sign.test(rlnorm(n),mu=1)
  }
  plot(sort(res))
  abline(0,1/N,lty=2)
  %--

If the proposed median is wront, the p-values will be much
lower.

  %G  
  N <- 500
  n <- 10
  res <- rep(NA,N)
  for (i in 1:N) {
    res[i] <- sign.test(rlnorm(n),mu=2)
  }
  plot(sort(res), ylim=c(0,1))
  abline(0,1/N,lty=2)
  %--

We can now complete our function to have confidence
intervals.

  sign.test <- function (x, mu=0, alpha=0.05) {
    n <- length(x)
    y <- sum(x<mu) # should warn about ties!
    p.value <- min(c( pbinom(y,n,.5), pbinom(y,n,.5,lower.tail=F) ))*2
    x <- sort(x)
    q1 <- qbinom(alpha/2,n,.5,lower.tail=T)
    q2 <- qbinom(alpha/2,n,.5,lower.tail=F)
    ci <- c(x[q1], x[q2])
    new.alpha = pbinom(q1,n,.5) + (1-pbinom(q2,n,.5))
    list(p.value=p.value, ci=ci, alpha=new.alpha)
  }

We can check that the confidence intervals actually have the
advertised risk (as the binomial distribution is a discrete
distribution, this risk will not be exactly 0.05, that is
why the function returns the theoretical value).

  test.sign.test <- function (n=100, N=500) {
    N <- 500
    res <- matrix(NA, nr=N, nc=3)
    n <- 100
    for (i in 1:N) {
      r <- sign.test(rlnorm(n))
      ci <- r$ci
      res[i,] <- c( ci[1]<1 & 1<ci[2], n, r$alpha )
    }
    c( 
       1-sum(res[,1], na.rm=T)/sum(!is.na(res[,1])),
       res[1,3]
     )
  }

  N <- 10
  res <- matrix(NA, nc=3, nr=N, dimnames=list(NULL,  
    c("Empirical Value", "Theoretical Value", "n")) )
  for (i in 1:N) {
    n <- sample(1:200, 1)
    res[i,] <- round(c( test.sign.test(n), n ), digits=2)
  }
  res

This yields:

  > res
         Empirical Value Theoretical Value   n
   [1,]             0.06              0.05  96
   [2,]             0.04              0.05 138
   [3,]             0.05              0.05   6
   [4,]             0.06              0.05 135
   [5,]             0.03              0.05 138
   [6,]             0.04              0.05  83
   [7,]             0.03              0.05 150
   [8,]             0.05              0.05  91
   [9,]             0.05              0.05 144
  [10,]             0.04              0.05 177

+ Wilcoxon's U test: comparing two "means"

TODO: read and correct (if needed) this part.

It is a non-parametric test: we do not know (or assume)
anything on the distribution of the variables, in
particular, we think it is not gaussian (to check ot, look
at a quantile-quantile plot or perform a Shapiro--Wilk test)
-- otherwise, we would use Student's T test, that is more
powerful.

However, there IS an assumption: the variable is symetric
(if it is not, consider the sign test). For this reason, we
can speak of mean or of median -- for the whole distribution
it is the same (but for a sample, that may be asymetric, it
is different).

  TODO: the following descrition is not that of the Wilcoxon
  test I know, that assumes the variable is symetric, that
  considers the variables Xij=(Xi+Xj)/2 and counts the
  number of those variables that are above the proposed
  mean.

The recipe is as follows. Take two samples, concatenate
them, sort them. Then, look of the two samples are
"well-shuffled" or if the elements from one sample are
rather at the begining while those of the others are rather
at the end.

The null hypothesis is H0: "P(X1_i > X2_i) = 0.5".

We first sort each sample (separately) and compute

  U1 = number of pairs (i,j) such that X1_i>X2_j
       + (1/2) * number of pairs (i,j) so that X1_i=X2_j
  U2 = number of pairs (i,j) so that X1_j>X2_i
       + (1/2) * number of pairs (i,j) so that X1_i=X2_j
  U = min(U1,U2)

Here is another method of computing this:

  Contatenate the two samples, rank them
  R1 = sum of the ranks on the first sample
  R2 = sum of the ranks on the second sample
  U2 = n1*n2 + n1(n1+1)/2 - R1
  U1 = n1*n2 + n2(n2+1)/2 - R2
  U = min(U1, U2)

Here is example from the man page (here, we imagine that
prior data suggests that x>y, so we choose an asymetric
alternative hypothesis).

  help.search("wilcoxon")
  ?wilcox.test
  > x <- c(1.83,  0.50,  1.62,  2.48, 1.68, 1.88, 1.55, 3.06, 1.30)
  > y <- c(0.878, 0.647, 0.598, 2.05, 1.06, 1.29, 1.06, 3.14, 1.29)
  > wilcox.test(x, y, paired = TRUE, alternative = "greater", conf.level=.95, conf.int=T)

        Wilcoxon signed rank test

  data:  x and y
  V = 40, p-value = 0.01953
  alternative hypothesis: true mu is greater than 0
  95 percent confidence interval:
   0.175   Inf
  sample estimates:
  (pseudo)median
          0.46

Let us check on an example that the test remains valid for
non-gaussian distributions.

  N <- 1000
  n <- 4
  v <- vector()
  w <- vector()
  for (i in 1:N) {
    x <- runif(n, min=-10, max=-9) + runif(n, min=9, max=10)
    v <- append(v, wilcox.test(x)$p.value)
    w <- append(w, t.test(x)$p.value)
  }
  sum(v>.05)/N
  sum(w>.05)/N

We get 1 and 0.93: we make fewer mistakes with Wilcoxons's
test, but its power is lower, i.e., we miss many
opportunities of rejecting the null hypothesis:

  # Probability of rejecting H0, when H0 is false (power)
  N <- 1000
  n <- 5
  v <- vector()
  w <- vector()
  for (i in 1:N) {
    x <- runif(n, min=0, max=1)
    v <- append(v, wilcox.test(x)$p.value)
    w <- append(w, t.test(x)$p.value)
  }
  sum(v<.05)/N
  sum(w<.05)/N

We get 0 (the power of the test is zero: it never rejects
the null hypothesis -- if our sample is very small and we do
not know anything about the distribution, we cannot say
much) against 0.84.

Let us consider a more ungaussian distribution.

  # Probability of rejecting H0, when H0 is false (power)
  N <- 1000
  n <- 5
  v <- vector()
  w <- vector()
  for (i in 1:N) {
    x <- runif(n, min=-10, max=-9) + runif(n, min=9, max=10)
    v <- append(v, wilcox.test(x)$p.value)
    w <- append(w, t.test(x)$p.value)
  }
  sum(v<.05)/N
  sum(w<.05)/N

We get 0 and 0.05.
For n=10, we would get 0.05 in both cases.

Let us now check the confidence interval.

  N <- 1000
  n <- 3
  v <- vector()
  w <- vector()
  for (i in 1:N) {
    x <- runif(n, min=-10, max=-9) + runif(n, min=9, max=10)
    r <- wilcox.test(x, conf.int=T)$conf.int
    v <- append(v, r[1]<0 & r[2]>0)
    r <- t.test(x)$conf.int
    w <- append(w, r[1]<0 & r[2]>0)
  }
  sum(v)/N
  sum(w)/N

We get 0.75 and  0.93.

  TODO:

  I do not understand.
  As the U test is non-parametric, to should give larger
  confidence intervals and make fewer mistakes.

  The confidence interval for the Wilcoxon is three times as
  small as that of Student's T test.

  ???

  However, these simulations show that Student's T test is robust.

  TODO: read what I just wrote and check the power

The following example tests if the variable is symetric
around its mean.

  > x <- rnorm(100)^2
  > x <- x - mean(x)
  > wilcox.test(x)

        Wilcoxon signed rank test with continuity correction

  data:  x
  V = 1723, p-value = 0.005855
  alternative hypothesis: true mu is not equal to 0

Idem for the median.

  > x <- x - median(x)
  > wilcox.test(x)

        Wilcoxon signed rank test with continuity correction

  data:  x
  V = 3360.5, p-value = 0.004092
  alternative hypothesis: true mu is not equal to 0

If there are more that two samples, you can use the
Kruskal--Wallis test.

  ?kruskal.test

+ Kolmogorov-Smirnov Test (comparing two distributions)

We want to see if two (quantitative random variables) follow
the same distribution.

  > ks.test( rnorm(100), 1+rnorm(100) )

        Two-sample Kolmogorov-Smirnov test

  data:  rnorm(100) and 1 + rnorm(100)
  D = 0.43, p-value = 1.866e-08
  alternative hypothesis: two.sided

  > ks.test( rnorm(100), rnorm(100) )

        Two-sample Kolmogorov-Smirnov test

  data:  rnorm(100) and rnorm(100)
  D = 0.11, p-value = 0.5806
  alternative hypothesis: two.sided

  > ks.test( rnorm(100), 2*rnorm(100) )

        Two-sample Kolmogorov-Smirnov test

  data:  rnorm(100) and 2 * rnorm(100)
  D = 0.19, p-value = 0.0541
  alternative hypothesis: two.sided

TODO: state the idea (If my memory is good, we consider the
sample cumulative distribution function of both variables
and compute the area between them).

+ Shapiro--Wilk test

This test check if a random variable is gaussian. That might
seem to be a special case of the Kolmogorov--Smirnov test,
but actually we compare a random variable with the family of
gaussian distributions, without specifying the mean and
variance, while the K-S test want a completely specified
distribution.

  > shapiro.test(rnorm(10))$p.value
  [1] 0.09510165
  > shapiro.test(rnorm(100))$p.value
  [1] 0.8575329
  > shapiro.test(rnorm(1000))$p.value
  [1] 0.1853919
  > shapiro.test(runif(10))$p.value
  [1] 0.5911485
  > shapiro.test(runif(100))$p.value
  [1] 0.0002096377
  > shapiro.test(runif(1000))$p.value
  [1] 2.385633e-17

It is a good idea to look at the quantile-quantile plot to
see what happens, because the data might be non-gaussian in
a benign way (either, as here, because the data are less
dispersed that gaussian data, either because the deviation
from a gaussian is statistically significant but practically
negligible -- quite common if the sample is very large).

+ Other non-parametric tests

There is a wealth of other tests, we shall not detail them
and merely refer to the manual.

  ?kruskal.test
  ?ansari.test
  ?mood.test
  ?fligner.test
  library(help=ctest)
  help.search('test')

* Estimators

  TODO: State the structure of this part -- and reorder it...
  1. Generalities about estimators
  2. Least Squares estimators. Example: NLS.
  3. Maximum Likelihood Estimators (also: REML, Penalized Likelihood)
  4. GMM Estimators

Quite often, the data we are studying (the "statistical
series") is not the whole population but just a sample of it
-- e.g., when you study birds on an island, you will not
measure all of them, you will simply catch a few dozen
specimens and work on them. You can easily compute the
statistical parameters of this sample (mean, standard
deviation, etc.), but they are only approximations of those
parameters for the whole population: how can we measure the
precision of those approximations?

+ Examples

Here are a few concrete examples of this situation.

An industrialist must choose a variety of maize. It will be
used as (part of) an animal food and we want the variety
that contains the most proteins: we want to know the average
protein content of each variety -- not that of the sample at
hand, but that of the variety as a whole. We might find that
the sample of a variety has a larger protein content that
the sample of another: but were to samples sufficiently
large, was the difference sifficiently significant to
extrapolate the results to the whole population?

An industrialist must choose a variaty of maize. It will be
sold for human consumption and, to ease its conditioning and
packaging, we want all the ears to have the same size; i.e., we want
the standard deviation to be as small as possible. If we
find that the standard deviation of the sample of a variety
is smaller than that of another variety, can we extrapolate
the results to the varieties as a whole? Were the samples
sufficiently large? Was the difference of standard
deviations significant?

+ Modelling a statistical experiment

In simple terms: we draw, at random, subjects from a
population and we measure, on the resulting sample, the
statistical variables of interest (size, protein content,
etc.).

In algorithmic terms:

  Repeat a large number of times:
    Take a sample from the population
    Estimate, on this sample, the quantity of interest
    (this quantity is called a "statistic")
  Compare those estimations (you can spot them as a histogram)
  with the value on the whole population.

In more mathematical terms: Let X1,X2,...,Xn be independant
identically distributed random variables (iidrv). With the
value those variables take on a single point of the
universe, we try to get some information about the
distribution of those variables. For instance, we expect
(X1+...+Xn)/n to be an estimation of the mean of this
distribution: is it really the case? In what sense is it an
"estimation"? What is the law of this sample mean? How to
measure the quality of this estimation?

+ Some vocabulary

+ Estimator

Let us consider a statistical experiment, i.e., iidrv
X1,X2,...,Xn; we assume we know the family this distribution
belongs to but not its parameters (for instance, we know it
is a gaussian distribution, but we do not know the mean, or
the variance -- or even both). An estimator is a function of
X1,X2,...,Xn that gives an estimation of this parameter. (If
we see the random variables X1,X2,...,Xn as maps from the
universe Omega to the real line R, an estimator is the
random variable obtained by composing (X1,...,Xn) with a map
R^n ---> R.)  For instance, (X1+...+Xn)/n is an estimator of
the mean.

(Remark: we said that it was an estimator of the mean, but
nowhere in the definition did the mean appear. Indeed, it is
also an estimator of the variance, of the standard deviation
or of any quantity we could think of -- but we will see
shortly that is is a "good" estimator of the mean and a bad
estimator of all these other quantities.)

Quite often, estimators comes in families; for instance, the
formula (X1+...+Xn)/n is a family of estimators of the mean:
we have one for each n.

+ Unbiased estimator

It is an estimator whose expectation is indeed the value of
the parameter. For instance, the sample mean (X1+...+Xn)/n
is indeed an unbiased estimator of the mean. On the
contrary, the variance of the sample is a biased estimator
of the population variance (but you can easily transform it
into an unbaised estimator: just replace "n" in the
denominator by "n-1").

We can check this with a small simulation, as follows. 

Let us draw 5 numbers at random, following a gaussian
distribution of mean 0 and variance 1.

If we repeat this experiment 10000 times, we get (an
estimation of) the expectation the empirical mean: it is
roughly equal to the population mean (here, 0).

  x <- vector()
  for(i in 1:10000){
    x <- append(x, mean(rnorm(5)))
  }
  mean(x)

  [1] 9.98948e-05

If we proceed likewise with the variance, we see that its
expectation is significantly different from the population
variance (we get 0.8 instead of 1): we say that the
"population variance" is a biased estimator of the
population variance.

TODO: the last paragraph was a bit ambiguous...

  n <- 5
  x <- vector()
  for(i in 1:10000){
    t <- rnorm(n)
    t <- t - mean(t)
    t <- t*t
    x <- append(x, sum(t)/n)
  }
  mean(x)

  [1] 0.806307

To get an unbiased estimator of the variance, it suffices to
replace the "n" in the definition of the variance by
"n-1". These two notions of variance are called "population
variance" and "sample variance".

                         1
  Population variance = --- * Sum( X_i - mean(X_j) )
                         n     i

                          1
      Sample variance = ----- * Sum( X_i - mean(X_j) )
                         n-1     i

Let us check, with another simulation, that this is indeed
unbiased.

  n <- 5
  y <- vector()
  for(i in 1:10000){
    t <- rnorm(n)
    t <- t - mean(t)
    t <- t*t
    y <- append(y, sum(t)/(n-1))
  }
  mean(y)

  [1] 1.000129

But this is not the only unbiased estimator of the variance:
if we use the first formula with the actual mean (that of
the population, not that of the sample), the estimator is
indeed unbiased (but this is unrealistic: there is no reason
why we should know the population mean).

  n <- 5
  z <- vector()
  for(i in 1:10000){
    t <- rnorm(n)
    t <- t - 0
    t <- t*t
    z <- append(z, sum(t)/n)
  }
  mean(x)

  [1] 1.001210

We can graphically compare those three estimators of the
variance.

  %G
  n <- 5
  x <- y <- z <- vector()
  for(i in 1:10000){
    t <- rnorm(n)
    z <- append(z, sum(t*t)/n)
    t <- t - mean(t)
    t <- t*t
    x <- append(x, sum(t)/n)
    y <- append(y, sum(t)/(n-1))
  }
  boxplot(x,y,z)
  %--

  %G
  plot(density(x))
  points(density(y), type="l", col="red")
  points(density(z), type="l", col="blue")
  %--

  %G
  op <- par( mfrow = c(3,1) )
  hist(x, xlim=c(0,5), breaks=20)
  hist(y, xlim=c(0,5), breaks=20)
  hist(z, xlim=c(0,5), breaks=20)
  par(op)
  %--

Their square root (three estimators of the standard
deviation) is more symetric.

  %G
  x <- sqrt(x)
  y <- sqrt(x)
  z <- sqrt(z)
  boxplot(x,y,z)
  %--

  %G
  plot(density(x))
  points(density(y), type="l", col="red")
  points(density(z), type="l", col="blue")
  %--

  %G
  op <- par( mfrow = c(3,1) )
  hist(x, xlim=c(0,5), breaks=20)
  hist(y, xlim=c(0,5), breaks=20)
  hist(z, xlim=c(0,5), breaks=20)
  par(op)
  %--

Question: are these estimators of the standard deviation
biased? Check your answer with a few simulations. Do you
understand why? (Hint: they are all biased. The expectation
is an integral and we know that, in general, the integral of
a square root is not the square roor of the integral.)

  TODO: check this hint.

There are many other notions, besides bias, that measure the
quality of an estimator or of a sequence of estimators.  For
instance, an estimator is "consistent" if, when the sample
size grows, ot converges (in probability) to the desired
value. Thus, the weeak law of large numbers states that the
sample mean is a consistent estimator of the population
mean.

+ Maximum Likelihood Estimators (MLE)

But where do we find the estimators in the first place?  For
simple quantities, such as the mean or the variance, we have
a formula for the whole population and we hope it will work
for a sample. For the mean, it works; for the variance, it
yields a biased estimator, but we can tweak it into an
unbiased one.

But if we want to find a parameter of a more complicated
law, for which we have no "population formula" (imagine, for
instance, that you want to find the parameter(s) of an
exponential, Poisson, Gamma, Beta, Weibull, Binomial,
etc. distribution), how do we proceed?

The Maximum Likelihood is a recipe to find such a candidate
estimator. We do not know of it will have "good" properties,
in particular, it will pften be biased, it it will be a good
start.  The idea is to find the value of the parameter for
which the probability of observing the results actually
observed in the sample is maximum -- this quantity is called
the "likelihood": we maximize it.

For instance, let us use this method to find an estimator if
the mean of a gaussian random variable of variance 1 with a
5-element sample (actually, in this case, we get the "sample
mean" -- the method is useful in more complex situations).

  %G
  # The population mean
  m <- runif(1, min=-1, max=1)
  # The n-element sample
  n <- 5
  v <- rnorm(n, mean=m)
  # Likelihood
  N <- 1000
  l <- seq(-2,2, length=N)
  y <- vector()
  for (i in l) {
    y <- append(y, prod(dnorm(v,mean=i)))
  }
  plot(y~l, type='l')
  # Population mean
  points(m, prod(dnorm(v,mean=m)), lwd=3)
  # Sample mean
  points(mean(v), prod(dnorm(v,mean=mean(v))), col='red', lwd=3)
  %--

The "optimize" function finds such a maximum, in dimension 1.

  > optimize(dnorm, c(-1,2), maximum=T)
  $maximum
  [1] 1.307243e-05
  $objective
  [1] 0.3989423

In higher dimensions, you can use the "optim" function (it
looks for a minimum)

  f <- function (arg) {
    x <- arg[1]
    y <- arg[2]
    (x-1)^2 + (y-3.14)^2
  }
  optim(c(0,0), f)

of the "nlm" function

  TODO

We get:

  $par
  [1] 0.9990887 3.1408265

  $value
  [1] 1.513471e-06

  $counts
  function gradient
        59       NA
 
  $convergence
  [1] 0

  $message
  NULL

We can ask for a specific optimization method, for instance,
simulated annealing -- actually, in this example, we get an
imprecise result.

  optim(c(0,0), f, method="SANN")
 
Let us consider the eruption lengths of the Faithful geyser
and let us try to approximate them as a mixture of Gaussian
variables.

  %G
  f <- function (x, p, m1, s1, m2, s2) {
    p*dnorm(x,mean=m1,sd=s1) + (1-p)*dnorm(x,mean=m2,sd=s2)
  }
  data(faithful)
  fn <- function(arg) {
    prod(f(faithful$eruptions, arg[1], arg[2], arg[3], arg[4], arg[5]))
  }
  start <- c(.5, 
             min(faithful$eruptions), var(faithful$eruptions),
             max(faithful$eruptions), var(faithful$eruptions),
            )
  p <- optim(start, function(a){-fn(a)/fn(start)}, control=list(trace=1))$par
  hist(faithful$eruptions, breaks=20, probability=T, col='light blue')
  lines(density(faithful$eruptions,bw=.15), col='blue', lwd=3)
  curve(f(x, p[1], p[2], p[3], p[4], p[5]), add=T, col='red', lwd=3)
  #curve(dnorm(x, mean=p[2], sd=p[3]), add=T, col='red', lwd=3, lty=2)
  #curve(dnorm(x, mean=p[4], sd=p[5]), add=T, col='red', lwd=3, lty=2)
  legend(par('usr')[2], par('usr')[4], xjust=1,
         c('sample density', 'theoretical density'),
         lwd=3, lty=1,
         col=c('blue', 'red'))
  %--

We have just seen, in the previous examples, that the
likelihood was often defined as a product, with as many
factors as observations. As such large products are
numerically unwieldy, we often prefer to take their
logarithm.

TODO: the "fitdistr", in the MASS package, computes
univariate maximum likelihood estimators.

TODO: The "mle" function in the "stats4" package is a
wrapper around "optim" and provides functions to compute the
information matrix, the covariance matrix, confidence
intervals, etc.

+ log-likelihood

The log likelihood is (not the logarithm of the likelihood
but)

  -2 log (vraissemblance).

It is the quantity we try to minimize in the Maximum
likelihood method.

You might wonder where the "-2" coefficient comes from: for
gaussian variables, it yields the logarithm of the sum of
squares.

  TODO: check the above.

+ Fisher information 

If L is the log-likelihood and p the parameter to estimate,
then the "score" is defined as

       d log L
  U = ---------
         d p

and the Fisher information as 

          d^2 L
  I(p) = -------
          d p^2

The Fisher information measures the sharpness of the peak at
the maximum of L.

+ Likelihood Ratio (LR) test

TODO: I have NOT said anything about "H0" and "H1"...

We remark that 

                L if H0 is true
  LR = - 2 log ---------------------------------------
                L with the MLE parameters

                L(b0)
     = - 2 log -------  (if b is the parameter to estimate)
                L(b)

where L is the likelihood, approximately follows a Chi^2
distribution with m degrees of freedom, where m is the
number of parameters to estimate.

  TODO: explain why, in the gaussian case, it IS a Chi^2
  distribution
  TODO: explain how we estimate the variance of the
  estimator.

  TODO: an example (take one gaussian distribution and model
  it as a mixture of distributions; take a mixture of two
  gaussian distributions and model it as a mixture of two
  gaussians; in both cases, test if the means of the two
  gaussian distributions are the same).

The "Wald test" and the "score test" are approximations of
the Likelihood Ratio (LR) test. More precisely, if we note b
the parameter to estimate, the "score" (it is a vector) is

  U(b) = Nabla log V(b)

         (  d log V(b)  )
       = ( ------------ )
         (     d bi     ) i

and the "information" (it is a matrix) is

         (       d^2              )
  I(b) = ( - ----------- log V(b) )
         (    d bi d bj           ) i,j

The "Wald statistic" is an order 2 Taylor expansion of the
Likelihood Ratio (LR)

  W = (b-b0)' * I(b) * (b-b0)

and the "Score statistic" (it does not depend on b: it is
very imprecise, but very fast to compute)

  S = U(b0)' * I(b0)^-1 * U(b0).

+ MLE: mode

We can see the likelihood as the probability density
function of the parameter, conditionnal to the sample.

  TODO: is it true?
  TODO: plot

The idea of the Maximum Likelihood Method is to take the
mode of this function. But there are a few problems.

First, we completely forget the distribution of this
estimator: we often look at its variance, but if it is not
normal, this is far from sufficient. (We can use this
density to measure the quality of the estimator or to do
simulations (parametric bootstrap)).

Besides, the mode is not always a good choice. Why not take
the mean (this is called bagging -- more about this later)
or the median? What do we do if the mode is not unique? Or,
more realistically, if the likelihood has several local
maxima? Or a high narrow peak (so narrow that its
probability is low) and a wider smaller peak (so that its
probability is high)?

  %G
  op <- par(mfrow=c(2,2))
  curve(dnorm(x-5)+dnorm(x+5), xlim=c(-10,10))
  curve(dnorm(x-5)+.4*dnorm(x+5), xlim=c(-10,10))
  curve(dnorm(5*(x-5)) + .5*dnorm(x+5), xlim=c(-10,10))
  par(op)
  %--

However, I have not managed to devise a situation in which
we really observe such a phenomenon (if you have such an
example, tell me). Actually, such a situation might be a
sign that the model is wrong.

Ah, I finally got such an example. I simply took an
estimator you cannot extrapolate from a sample to a
population: the maximum. (The statistics for which we might
want to find an estimator might have, or not, pleasant
properties: the maximum is one of the worst statistics you
can imagine.)

  %G
  get.sample <- function (n=10, p=1/100) {
    ifelse( runif(n)>p, runif(n), 2 )
  }
  N <- 1000
  d <- rep(NA,N) 
  for (i in 1:N) {
    d[i] <- max(get.sample())
  }
  hist(d, probability=T, ylim=c(0, max(density(d)$y)), col='light blue')
  lines(density(d), type='l', col='red', lwd=3)
  %--

  %G
  get.sample <- function (n=100, p=1/100) {
    ifelse( runif(n)>p, runif(n), 2 )
  }
  N <- 1000
  d <- rep(NA,N) 
  for (i in 1:N) {
    d[i] <- max(get.sample())
  }
  #hist(d, breaks=seq(0,3,by=.02),
  #     probability=T, ylim=c(0, max(density(d)$y)), col='light blue')
  #lines(density(d), type='l', col='red', lwd=3)
  plot(density(d), type='l', col='red', lwd=3)
  %--

  %G
  get.sample <- function (n=100, p=1/100) {
    ifelse( runif(n)>p, runif(n), 2+2*runif(n) )
  }
  N <- 1000
  d <- rep(NA,N) 
  for (i in 1:N) {
    d[i] <- max(get.sample())
  }
  #hist(d, breaks=seq(0,4,by=.05),
  #     probability=T, ylim=c(0, max(density(d)$y)), col='light blue')
  #lines(density(d), type='l', col='red', lwd=3)
  plot(density(d), type='l', col='red', lwd=3)
  %--

However, some methods replace the estimators (a single
value) by probability distributions: on the computational
side, all the methods relying on the bootstrap (e.g., the
bagging), on the theoretical side, these are called
"bayesian methods".

+ Bayesian methods

We have just seen that it could be interesting to have, not
an estimator (a single number, possibly with its variance)
but a whole probability distribution. Bayesian methods push
this remark a little further.

We assume that the parameters of the model were chose at
random according to a given distribution -- the prior
distribution.  Bayesian methods try fo find the posterior
distribution, i.e., the distribution of those parameters
given the actually observed sample.

  TODO: write this up.
  Z: date
  t: parameters
  The model gives us P(Z|t)
  We assume we know P(t) (the prior distribution).
  We then compute P(t|Z) (posterior distribution) from P(Z|t) and
  P(t), with the Bayes formula -- hence the name.

  TODO: recall this formula.

  TODO: 
  MCMC (Markov Chain Monte Carlo) to sample from the posterior distribution
  parametric bootstrap = computational implementation of the maximum likelihood method
  MLE = mode
  bagged estimator = mean

  TODO: EM algorithm (to find MLE in a 2-component misture model)
  1. find first estimates (m1, m2: random; s1, s2: sd(whole sample))
  2. Compute the responsabilities
     gamma(i) = probability that observation i comes from the second component
     1-gamma(i)=probability that observation i comes from the first component
  3. new estimators = weighted means and variances 
     (use gamma(i), resp 1-gamma(i), as weights)
  4. Iterate

  TODO: you can generalize this algorith to other models.

  TODO: this is a special case of REML

  TODO: put those more complicated examples in the "bootstrap" chapter/section?

  TODO: remark that bayesian methods can be applied iteratively.
    Initial data --> prior
    New data --> posterior
    Update prior
    New data --> posterior
    Update prior
    ...    

  TODO: The theory behind the Kalman Filter (State Space
  Models) is bayesian.

+ Bayesian statistics and quantum mechanics

Bayesian statistics are very similar to quantum mechanics:
in quantum mechanics, reality is a superposition of many
possible realizations, each with its probability; similarly,
the result of a bayesian analysis is a whole probability
distribution, giving all the possible answers, each with its
probability.

+ Bayesian Networks

  TODO
  This is very similar to neural networks, but we can
  interpret the resulting model and add (prior) information
  to it.

+ Fuzzy logic

  TODO
  (Notes taken while reading Matlab's documentation)


The idea behind Fuzzy Inference Systems is to replace
boolean values by continuously varying values -- think
"probabilities".

  %G
  curve(ifelse(x < .3, 0, ifelse(x > .7, 0, 1)),
        lwd = 3, col = "blue", 
        xlim = c(0,1),
        main = "A classical membership function",
        xlab="", ylab="")
  abline(h = 0, lty = 3)
  %--

  %G
  curve( dnorm( x - .5, sd = .1 ) / dnorm(0, sd=.1),
         lwd = 3, col = "blue", 
         xlim = c(0, 1),
         main = "A fuzzy membership function",
         xlab="", ylab="" )
  abline(h=0, lty=3)
  %--

Boolean operators and deduction rules can then be fuzzified: 
AND becomes MIN, 
OR becomes MAX, 
NOT(x) becomes 1-x
THEN becomes MIN.

Note that there might be
incoherences in the rules -- it is not a problem.

  You can tune the fuzzy inference system by changing the
  membership functions (or even the boolean operators).
  As with bayesian methods, the result is a probability
  distribution, not a single number.
  So far, it looks like bayesian networks, with a very
  simple, fixed network structure, and no learning
  capabilities.

  (2) Adaptive Neuro-Fuzzy Inference Systems (ANFIS)
  Actually, you do not have to provide the set of rules: you
  can simply take you data, cluster it, and ask the computer
  to write a rule for each cluster.
  This looks very similar to nearest-neighbour methods --
  and, if you impose some parsimony, it can be seen as a
  machine-learning analogue of Generalized Additive Models
  (GAM, "mgcv" package in R).

  (3) Fuzzy C-means clustering
  This is a generalization of the k-means clustering
  algorithm where cluster membership is fuzzy -- i.e., is a
  probability. (In R, ir is implemented in the "fanny" or
  "cmeans" functions.)


+ Influence curve

Let us consider an extimator U of a parameter u of a
distribution F(u). The problem, is that we do not know F
exactly (more realistically: we do know F, but the
population does not exactly follow this distribution, it is
merely an estimation of the distribution of the data). To
assess the robustness of the estimator, we can check how it
changes when we modify F at a single point. More precisely,
we compute

                  U( (1-h) F + h 1x ) - U(F)
  w(x) =   lim   ----------------------------
         h -> 0              h

Where 1x is the Dirac measure at x.

  TODO: example with the mean
  TODO: example with the variance
  TODO: example with the correlation (in dimension 2)
  ?persp

TODO: empirical estimation of the influence function with
bootstrap.

  library(boot)
  ?empinf

TODO: TO SORT?

+ Method of Moments Estimation (MME)

This is another recipe to get an estimator.  If there is a
single parameter, we choose it so that the first moment
(i.e., the mean) of the variable coincides with the sample
mean. If there are k parameters, we choose them so that the
first k moments coincide with the sample moments (I recall
that the moments of a real random variable X are the
expectations of X^k). As with the Maximum Likelihood Method,
we do not know anything about the quality of the resulting
estimator -- it is often biased.

Let us use this method to study the Faithful geyser
eruptions durations. Let X0 be a Bernoulli random variable
(i.e., a rnadom variable with two values, 0 and 1, or
"heads" and "tails"), with parameter p, and X1, X2, gaussian
random variables of mean m1, m2 and standard deviations s1,
s2. We try to put our data Y under the forn

  Y = X0 X1 + (1-X0) X2.

Noting that X0^2=X0, (1-X0)^2=1-X0 et X0*(1-X0), one can show that

  Y^2 = X0 X1^2 + (1-X0) X2^2
  Y^3 = X0 X1^3 + (1-X0) X2^3
  Y^4 = X0 X1^4 + (1-X0) X2^4.

As X0, X1, X2 are independant, 

  E(Y) = X(X0) E(X1) + E(1-X0) E(X2)
  E(Y^2) = X(X0) E(X1^2) + E(1-X0) E(X2^2)
  E(Y^3) = X(X0) E(X1^3) + E(1-X0) E(X2^3)
  E(Y^4) = X(X0) E(X1^4) + E(1-X0) E(X2^4)

But as 

  E(X0)=p

  E(X1) = m
  E(X1^2) = m^2 + s^2
  E(X1^3) = 3 m s^2 + m^3
  E(X1^4) = 10 s^4 + 6 s^2 m^2 + m^4
  E(X1^5) = 50 s^4 m + 5 s^2 m^3 + m^5

(and similarly for X2), we get 

  E(Y)   = p m1 + (1-p) m2
  E(Y^2) = p (m1^2 + s1^2) + (1-p) (m2^2 + s2^2)
  E(Y^3) = p (3 m1 s1^2 + m1^3) + (1-p) (3 m2 s2^2 + m2^3)
  E(Y^4) = p (10 s1^4 + 6 s1^2 m1^2 + m1^4) + (1-p) (10 s2^4 + 6 s2^2 m2^2 + m2^4)
  E(Y^4) = p (50 s1^4 m1 + 5 s1^2 m1^3 + m1^5) + (1-p) (50 s2^4 m2 + 5 s2^2 m2^3 + m2^5)

We can then ask R to find (numerically) the values of those
five parameters).

(Exercise left to the reader -- I am not absolutely sure of
my results.)

  TODO
  There is a function to find the minimum of a function of
  several variables (optimize, optim, nlm, nls), there are
  functions to find the zeros of a function of a single
  variable (uniroot, polyroot), but what about the zeros of
  a function of several variables, i.e., how do we
  numerically solve a non-linear system???

  (One can do that with the Newton algorithm, as in
  dimension 1 -- the only difference being that the
  derivative is a matrix.)

+ L-moments, TL-moments

Since the higher moments have a high variance, are very
sensitive to outliers -- or even, in some cases, are not
defined for the distribution at hand --, one can replace the
moments by the L-moments or the trimmed L-moments
(TL-moments).

+ GMM (Generalized Method of Moments)

Least squares and maximum likelihood provide recipes to
build new estimators. The Generalized Method of Moments is
another recipe, that actually generalizes both Least Squares
and Maximum Likelihood.

Consider a (real-valued) random variable X, known to be
gaussian, with unknown mean mu and standard deviation
sigma. We want to find those parameters, mu and sigma, from
a sample x1,x2,...,xn, drawn from the distribution of
X. Remarking that

  E[ X ] = mu
  E[ X^2 ] = mu^2 + sigma^2

we can simply try to solve 

   1
  --- Sum( x_i ) = mu
   n

   1
  --- Sum( x_i^2 ) = mu^2 + sigma^2
   n

There is no reason it will be a "good" estimator, but it
will be an estimator -- then, it is up to us to study how
good it is and to improve it if needed.

This idea is more general: if the distribution of the random
variable X has a known form, with unknown parameters
theta_1, theta_2, ..., theta_n, we can compute the first
moments of X as a function of those parameters

  E[ X   ] = m_1( theta_1, ..., theta_n )
  E[ X^2 ] = m_2( theta_1, ..., theta_n )
  E[ X^3 ] = m_3( theta_1, ..., theta_n )
  ...

Then, given the value of X on a sample, we can equate the
mean of X, X^2, etc. to their theoretical values and solve
the resolving equations. This is called the Method of
Moments Estimator (MME).

But this can be generalized further: instead of taking the
moments E[X], E[X^2], E[X^3], etc., we can consider the
expectation of simpler quantities, that have a more direct
interpretationm or that can be more easily computed. Those
quantites are called generalized moments.

The recipe for our more GMM estimator is: find
easy-to-compute and easy-to-interpret quantities (e.g., X,
X^2, X^3, etc.); compute their expectation in the
population, as functions of the unknown parameters; compute
their average in the sample; equate expectations and
averages and solve: the law of large numbers tells you that
the averages converge to the expectations. Furthermore, the
law of large numbers (often) tells you that the resulting
estimators are asymptotically gaussian.

For instance, if you have two random variables X and Y,
linked by the relation Y = a + b X + epsilon, where epsilon
is a random variable with zero mean and a and b are unknown
parameters, then you can consider the generalized moments

  E[ Y - ( a + b X ) ]     = 0 
  E[ (Y - ( a + b X )) X ] = 0 

The corresponding GMM estimator is the least squares
estimator: the equations state that the derivatives the of
sum of squares appreaing the the definition of the least
squares estimator with respect to X and Y are zero.

The classical moments can be written as

  E[ X   - m1(theta) ] = 0
  E[ X^2 - m2(theta) ] = 0
  E[ X^3 - m3(theta) ] = 0;
  ...

we shall write the generalized moments as 

  E[ h1(X, theta) ] = 0
  E[ h2(X, theta) ] = 0
  E[ h3(X, theta) ] = 0
  ...

Quite often, you have more generalized moments than
parameters: in this case, you cannot hope that the moments
will all be zero, and you will try to minimize a "weighted
sum" of squares of generalized moments. More precisely, if 
h = (h1,...,hn) is the vector of generalized moments, you
will minimize

  h' W h

where W is a weight matrix. If you are completely clueless
about the properties of your moments, you can set W to the
identity matrix, but otherwise, the (limit, when the sample
size becomes large, of the) inverse of the
variance-covariance matrix of the residuals is a good choice

  W = Var[ (1/N) Sum( h(x_i, theta) ) ] ^ -1

The actual algorithm is usually: 

  W <- I
  Until convergence: 
    Find theta that minimizes [(1/N)Sum(h(x_i))]' W [(1/N)Sum(h(x_i))] 
    Set W = Var[residuals]^-1

TODO: Orthogonality conditions, Instrumental Variables (IV)

  When you select you model, say Y = f(X, theta) + epsilon,
  you want its residuals to be uncorrelated with all the
  predictive variables, say 
    epsilon = y - f(x, theta)    uncorrelated with x (variables
                                 included in the model)
    epsilon = y - f(x, theta)    uncorrelated with z (variables
                                 not included in the model)

  The first condition will be satisfied, but the second need
  not: we have to impose it. This can be written as
    E[ (y - f(x, theta)) \tens z ] = 0
  where \tens is the tensor product (also called "Kroneker
  product").

  TODO: recall what it is.

  The variables in z (the columns of z) are called
  Instrument Variables (IV).

Tests, in the context of the Generalized Method of Moments,
can look as follows:

  H0: E[ h ]  = 0, i.e., the model is right
  H1: E[ h ] != 0, i.e., the model is wrong

You can use the following statistic: 

  J = N [(1/N)Sum(h(x_i))]' W [(1/N)Sum(h(x_i))] 

where N is the sample size. 

You should remark that, with the GMM, we never really
specify the model: we just specify its moments. Thus, in
this context, we cannot distinguish between models with the
same moments.

GMM are used in econometrics, because one can choose moments
with an economic interpretation -- furthermore, as there is
no real model behind the moments, once the moments are
meaningful, the "model" makes sense.

  TODO: give more details

+ Cramer-Rao inequality

The Cramer-Rao inequality states that one cannot find arbitrarily good
extimators: more precisely, it gives a lower bound on the variance of
an unbiased estimator.

+ Sufficient statistic

The first step to find the "best" estimator possible is to replace the
data X1,X2,...,Xn by one (or several) numbers that contain as much
information: this number is called a "sufficient statistic". More
precisely, if 

  P( (X1,X2,...,Xn) \in U | t(X1,...,Xn) = c )

only depends on U and C but not on the parameter we want to estimate,
then t is a sufficient statistic. There is a simple criterion (the
Neyman factorization theorem) to recognize a sufficient
statistic. More precisely, we try to find a "minimal" sufficient
statistic, i.e., one whose "level lines" be as large as possible.

+ BUE (Best Unbiaised Estimators, aka UMVUE, Uniformly Minimum Variance Unbiased Estimator)

Unbiased estimators whose variance is the lower bound in the
Cramer-Rao inequality.

* TO SORT

+ Information

TODO

The information of a realization of a random variable is its degree of
surprise. For instance, if we flip a coin (i.e., if we consider a
Bernoulli variable with probability 1/2) and get heads, there is no
real surprise: the information is rather low. This notion becomes more
interesting for more asymetric random variables. Let us consider the
occurrence of an earthquake, given that earthquakes are rare. Knowing
that no earthquake occurred is not informative at all: the information
is low (even lower than in the coin flip experiment). On the contrary,
if an earthquake did occur, this is surprising, and the information is
high. 

For a Bernoulli variable

  P(X=1) = p

the information of the event X=1 is

  TODO: formulas

+ Entropy

TODO

The Kolmogorov-Chaitin entropy of a sequence of characters is the
minimal length of a program that produces this sequence. 

It cannot really be computed but, if your sequence is sufficiently
long, you can estimate its entropy by compressing the sequence (with
programs such as gzip or bzip2) and looking at the length of the
resulting file. 

  http://arxiv.org/abs/cond-mat/0108530 Language trees and zipping
  http://arxiv.org/abs/cond-mat/0202383 Extended comments on "Language trees and zipping"
  http://arXiv.org/abs/cond-mat/0203275

+ Relative entropy (Kullback-Leibler distance)

  TODO  

+ Maximum Likelihood Methods and Statistical Tests

TODO: write this part...

  Likelihood-Ratio test
  LR = -2 ln ( L at H0 / L at MLE )
  LR ~ Chi2
  df = number of parameters (eg, 1).

  Other tests for MLE: Wald test, Score test.

  To compare two models fitted with MLE, compare AIC = LRChi^2 - 2 p
  where p is the number of parameters.  This is an approximative
  criterion. There are other questionable such criteria, eg, BIC
  (Bayesian Information Criterion).
  r <- lm(...)
  AIC(r)
  library(nlme)
  BIC(r)
  r <- gls(...)
  summary(r)

  The AIC says that adding a useless parameter generally increases the
  log likelihood by about 1 unit.  
  So if adding a parameter increases the log like- lihood by more than 1, 
  it is not useless.