Analysis of Variance (Anova)

Analysis of variance is merely regression when the
predictive variables are qualitative -- more precisely, it
refers to the tests one performs in that context.

Covariance analysis is regression with some qualitative
predictive variables and some quantitative predictive
variables (the latter are then called "covariates" or
"covariables").

The easiest way to understand analysis of variance is as a
generalization of Student's test: it tells you if the mean
of a quantitative variable is the same in several groups
(Student's test is limited to the case of two groups) or, in
other words, if a quantitative variables depends on a
qualitative variable.

A more general way of understanding analysis of variance
(and this is the point of view chosen by R's "anova"
function) is as a test comparing two models: checking if a
quantitative variable y depends on a qualitative variable x
is equivalent to comparing the models y ~ x and y ~ 1 (if
the two models are significantly different, the more complex
one, y ~ x, brings more information, i.e., the quantitative
variable y depends on the qualitative variable x, i.e., the
mean of y is not the same in the groups defined by x).

We shall present analysis of variance in this order:
regression with qualitative predictive variables,
generalization of Student's test, model comparison.

In a later chapter, we shall present, more quickly, a few
generalizations of analysis of variance: mixed models,
hierarchical models, panel models, graphical models,
bayesian networks.

  TODO: 
  The end of this chapter is not completely written yet and
  overlaps with the next chapter. 

* Regression with qualitative predictive variables

Before tackling anova itself (i.e., the tests related to a
regression with qualitative predictive variables), let us
focus on the regression with qualitative variables
itself. Actually, there is not a big difference with
classical regression.

+ Binary variable

That is simple: you code it with 0 or 1 and you perform a
regression as usual.

  %G 600 300
  n <- 200
  x <- sample(0:1, n, replace=T)
  y <- 2*(x==0) +rnorm(n)
  plot( y ~ factor(x), 
        horizontal = TRUE, 
        xlab = 'y', 
        ylab = 'x',
        col = "pink" )
  %--

  %G
  plot(density(y[x==0]), 
       lwd = 3, 
       xlim = c(-3,5), 
       ylim = c(0,.5), 
       col = 'blue',
       main = "Density in each group",
       xlab = "x")
  lines(density(y[x==1]), 
        lwd = 3, 
        col = 'red')
  %--

  %G
  plot(y ~ x,
       main = "lm(y~x) when x is qualitative")
  abline(lm(y ~ x), 
         col = 'red')
  %--

As there are only two possible values for the predictive
variable, there are only two forecasts: the means of the
corresponding groups.

  > predict(lm(y~x), data.frame(x=c(0,1)))
           1          2
   2.0785496 -0.1654518
  > tapply(y,x,mean)
         0          1
   2.0785496 -0.1654518

You can also wonder whether the means in these two groups
are significantly different, with a Student test

  > t.test(y[x==0], y[x==1])
        Welch Two Sample t-test
  data:  y[x == 0] and y[x == 1]
  t = 15.1062, df = 197.472, p-value = < 2.2e-16

or with an anova (the anova is the last line of the result
("summary") of a regression).

  > summary(lm(y~x))
  Call:
  lm(formula = y ~ x)
  Residuals:
      Min      1Q  Median      3Q     Max
  -3.7520 -0.5786  0.1071  0.6504  2.7427
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept)   2.1283     0.1068   19.93   <2e-16 ***
  x1           -2.1218     0.1440  -14.73   <2e-16 ***
  Residual standard error: 1.013 on 198 degrees of freedom
  Multiple R-Squared: 0.523,      Adjusted R-squared: 0.5206
  F-statistic: 217.1 on 1 and 198 DF,  p-value: < 2.2e-16

You can also explicitely ask for an analysis of variance, i.e.,
explicitely compare the y ~ x model with the trivial one y ~ 1.

  > anova( lm(y~x), lm(y~1) )
  Analysis of Variance Table
  Model 1: y ~ x
  Model 2: y ~ 1
    Res.Df     RSS  Df Sum of Sq      F    Pr(>F)
  1    198  203.27
  2    199  426.11  -1   -222.84 217.07 < 2.2e-16 ***

I never remerber in which order to list the models to be
compared: here, the number of degrees of freedom is
negative, the sum of squares is negative, so the order is
wrong -- but the p-value is correct. I should have put the
first model first. In any case, the result is only
meaningful if the models are embedded in one another, i.e.,
if one (the first) is a simplification of the other.

This might seem trivial: we do not need to know about
regression in order to compare means -- Student's test will
do. Things become less simple when you have both qualitative
and quantitative variables.

+ Two predictive variables, one qualitative (binary), another quantitative

We now try to predict a quantitative variable, y, from a
qualitative variable, x1, and another quantitative variable,
x2.

  %G
  n <- 200
  x1 <- sample(0:1, n, replace=T)
  x2 <- rnorm(n)
  y <- 1 - 2*(x1==0) + x2 +rnorm(n)
  plot( y ~ x2, 
        col = c(par('fg'), 'red')[1+x1],
        main = "y ~ (1 | x1) + x2")
  rc <- lm( y ~ x1+x2 )$coef
  abline(rc[1], rc[3])
  abline(rc[1]+rc[2], rc[3], 
         col = 'red')
  legend(par('usr')[2], par('usr')[3], # Bottom right
         xjust = 1, yjust = 0,
         c("x1 = 0", "x1 = 1"),
         col = c("red", par("fg")),
         lty = 1,
         lwd = 3)
  %--

  %G
  x1 <- ifelse(x1==1, "x1 = 1", "x1 = 0")
  library(lattice)
  xyplot( y ~ x2 | x1,
          panel = function (x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.lmline(x, y)
          },
          main = "y ~ (1 | x1) + x2"
         )
  %--

The preceding model, with no interaction term, yields parallel
lines. If we add an interaction term, the lines are no longer
parallel.

  %G
  n <- 200
  x1 <- sample(0:1, n, replace=T)
  x2 <- rnorm(n)
  y <- 1 + x2 - (x1==1)*(2+3*x2) + rnorm(n)
  plot( y ~ x2, 
        col = c(par('fg'), 'red')[1+x1],
        main = "y ~ x2 + (x2 | x1)" )
  # With no interaction term (dotted lines)
  rc <- lm( y ~ x1 + x2 )$coef
  abline(rc[1], rc[3], 
         lty = 3)
  abline(rc[1]+rc[2], rc[3], 
         col = 'red', lty = 3)
  # with
  rc <- lm( y ~ x1 + x2 + x1:x2 )$coef
  abline(rc[1], rc[3])
  abline(rc[1]+rc[2], rc[3]+rc[4], 
         col='red')
  %--

  %G
  x1 <- ifelse(x1==1, "x1 = 1", "x1 = 0")
  xyplot( y ~ x2 | x1,
          panel = function (x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.lmline(x, y)
          },
          main = "y ~ x2 + (x2 | x1)"
         )
  %--

  %G
  # One might also want to compare the intercepts
  xyplot( y ~ x2 | x1,
          panel = function (x, y, ...) {
            panel.xyplot(x, y, ...)
            panel.lmline(x, y)
            panel.abline(v=0, h=0, lty=3)
          },
          main = "y ~ x2 + (x2 | x1)"
         )
  %--

Actually, this is merely a linear regression of each value of the
qualitative variable.

  TODO: interaction plot (matrix plot)
  if the lines are parallel, there is no interaction.

+ Two qualitative (binary) predictive variables

  TODO

+ Two qualitative (binary) predictive variables: interactions

When you have several quanlitative variables, interactions
can start playing an important role. For instance, it is
possible that the mean of Y does not depend on X1 nor on X2,
but on the pair (X1,X2). For instance

  E[ Y | X=(0,0) ] > E[ Y | X=(0,1) ]
  E[ Y | X=(1,0) ] < E[ Y | X=(1,1) ]

Here is a more graphical example.

  %G 600 300
  n <- 1000
  l <- 0:1
  v <- .5
  x1 <- factor( sample(l, n, replace=T), levels=l )
  x2 <- factor( sample(l, n, replace=T), levels=l )
  y <- ifelse( x1==0, 
         ifelse( x2==0, 1+v*rnorm(n),   v*rnorm(n) ),
         ifelse( x2==0, v*rnorm(n),   1+v*rnorm(n) )
       )
  boxplot( y ~ x1, 
           horizontal = TRUE,
           col = "pink",
           main = "y does not seem to depend on x1",
           xlab = "y", 
           ylab = "x1" )  
  %--

  %G 600 300
  boxplot( y ~ x2,
           horizontal = TRUE,
           col = "pink",
           main = "y does not seem to depend on x2",
           xlab = "y", 
           ylab = "x2" )  
  %--

  %G
  plot( as.numeric(x1) - 1 + .2*rnorm(n), 
        y, 
        col = c("red", "blue")[as.numeric(x2)], 
        xlab = "x1 (jittered)",
        main = "Interactions between x1 and x2")
  legend(par('usr')[2], par('usr')[3], # Bottom right
         xjust = 1, yjust = 0,
         c("x2 = 0", "x2 = 1"),
         col = c("red", "blue"),
         lty = 1,
         lwd = 3)
  %--

TODO: Just for fun, a 3-dimensional boxplot, with PoVRay or
Rgl.

+ Qualitative variable with more that two values

Let us consider a qualitative variable with four values: 0,
1, 2, 3.
 
  %G 600 400
  n <- 200
  x <- sample(0:3, n, replace=T)
  y <- 2*(x==0) + 5*(x==2) -2*(x==3) + rnorm(n)
  plot( y ~ factor(x), 
        horizontal = TRUE, 
        col = "pink",
        xlab = 'y', 
        ylab = 'x',
        main = "x now has four values" )
  %--

We cannot code it with those four numbers: this would imply
that these values are ordered, that their effects are
monotonic, that the effect of "4" is exactly twice that of
"2", which is itself exactly twice that of "1". Instead, you
can use several boolean variables (one also speaks of
"dummy" variables, boolean variables or 0-1 variables), that
assume only two values.

  x1 <- as.numeric(x==1)
  x2 <- as.numeric(x==2)
  x3 <- as.numeric(x==3)

There are other ways of coding our qualitative variable:
they will yield the same forecasts, the same statistics (F
statistic, p-value, etc.), but the coefficients will have a
different interpretation.

Then, we perform a regression, as usual. 

  > lm(y ~ x1+x2+x3)

  Call:
  lm(formula = y ~ x1 + x2 + x3)

  Coefficients:
  (Intercept)           x1           x2           x3
        2.178       -2.293        2.844       -4.110

Actually, R does this transformation itself when one of the
predictive variables is a factor (beware: it HAS to be a
factor, it it is merely a variable with integral values, it
will be considered as a quantitative variable).

  > lm(y ~ factor(x))

  Call:
  lm(formula = y ~ factor(x))

  Coefficients:
  (Intercept)   factor(x)1   factor(x)2   factor(x)3
        2.178       -2.293        2.844       -4.110

As our variable only has four values, there are only four
values to predict: the means of the four groups. (You can
try to interpret the coefficients direclty -- we will
explain how to do that shortly -- but it is much easier,
more symetric, to think in terms of predicted values.)

  > predict(lm(y~factor(x)), data.frame(x=c(0,1,2,3)))
           1          2          3          4
   2.1778596 -0.1148520  5.0220615 -1.9325181
  > tapply(y,x,mean)
           0          1          2          3
   2.1778596 -0.1148520  5.0220615 -1.9325181

As above, we can add in quantitative predictive variables
(and here again, this is equivalent to performing a
regression for each value of the qualitative variable).

  %G
  n <- 1000
  x1 <- sample(1:4, n, replace=T)
  x2 <- runif(n, min=-1, max=1)
  y <- (x1==1) * (2 - 2*x2) +
       (x1==2) * (1 + x2) +
       (x1==3) * (-.5 + .5*x2) +
       (x1==4) * (-1) +
       rnorm(n)
  cols <- c(par('fg'), 'red', 'blue', 'orange')
  plot( y ~ x2, 
        col = cols[x1],
        main = "y ~ x2 + (x2 | x1)" )
  x1 <- factor(x1)
  r <- lm( y ~ x1*x2 )
  xx <- seq( min(x2), max(x2), length = 100 )
  for (i in levels(x1)) {
    yy <- predict(r, data.frame(
      x1 = rep(i, length(xx)), 
      x2 = xx
    ))
    lines(xx, yy, 
          col = cols[as.numeric(i)], 
          lwd = 3)
  }
  %--

+ Contrasts

If the variables are not binary, R can create the dummy variables
himself. However, let us see how this can be done. 

Example, with two variables: 

  Two qualitative variables U and V with 3 values (a, b and c).
  Replace U by two variables X1 and X2.
  If U=a, take X1=1 and X2=0.
  If U=b, take X1=0 and X2=1.
  If U=c, take X1=X2=0.
  Similarly replace V by two variables X3 and X4.

The model with no interaction is

    Y ~ X1 + X2 + X3 + X4

But it is not the only way of creating those dummy variables.
Here is another coding.

  Replace U by two variables X1 and X1
  If U=a, take V1=1, V2=0.
  If U=b, take V1=0, V2=1.
  If U=c, take V1=V2=-1.
  Similarly replace V by X3 and X4.

If you want interactions, things can become trickier... 

  Replace U by two variables X1 and X1
  If U=a, take V1=1, V2=0.
  If U=b, take V1=0, V2=1.
  If U=c, take V1=V2=-1.
  Similarly replace V by X3 and X4.
    Y ~ V1 + V2 + V3 + V4 + V1V3 + V1V4 + V2V3 + V2V4

You can ask R to use a coding or another.

  > options()$contrasts
          unordered           ordered
  "contr.treatment"      "contr.poly"

  > apropos("contr\\.")
  [1] "contr.helmert"   "contr.poly"      "contr.sum"       "contr.treatment"

By default, SPlus uses Helmert contrasts (but many people
think it is a bad idea because they are difficult to
interpret -- I personnally think they are all difficult to
interpret and suggest not to interpret them directly but use
them to compute quantities easier to interpret).

  n <- 100
  l <- 1:4
  x <- factor( sample(l, n, replace=T), levels=l )
  y <- ifelse(x==1, rnorm(n), 
       ifelse(x==2, -4+rnorm(n), 
       ifelse(x==3, -2+rnorm(n), 4+rnorm(n))))

The "model.matrix" function will tell you what matrix is actually used
in the regression.

  > cbind(x,y)[1:5,]
       x          y
  [1,] 3 -2.1962202
  [2,] 4  3.9745616
  [3,] 1  0.3440235
  [4,] 4  2.7521184
  [5,] 2  4.3539651

  > model.matrix(y~x)[1:5,]
    (Intercept) x2 x3 x4
  1           1  0  1  0
  2           1  0  0  1
  3           1  0  0  0
  4           1  0  0  1
  5           1  1  0  0

  > model.matrix(y~x, contrasts=list(x=contr.helmert))[1:5,]
    (Intercept) x1 x2 x3
  1           1  0  2 -1
  2           1  0  0  3
  3           1 -1 -1 -1
  4           1  0  0  3
  5           1  1 -1 -1

  > model.matrix(y~x, contrasts=list(x=contr.poly))[1:5,]
    (Intercept)        x.L  x.Q        x.C
  1           1  0.2236068 -0.5 -0.6708204
  2           1  0.6708204  0.5  0.2236068
  3           1 -0.6708204  0.5 -0.2236068
  4           1  0.6708204  0.5  0.2236068
  5           1 -0.2236068 -0.5  0.6708204

  > model.matrix(y~x, contrasts=list(x=contr.sum))[1:5,]
    (Intercept) x1 x2 x3
  1           1  0  0  1
  2           1 -1 -1 -1
  3           1  1  0  0
  4           1 -1 -1 -1
  5           1  0  1  0

The default coding, "contr.treatment", is not symetric: one of the
values is considered as central and the others are compared to it.
You can specify which one with the "relevel" function.

Here are the results with the (default) "contr.treatment" contrasts.

  > lm(y~x)

  Call:
  lm(formula = y ~ x)

  Coefficients:
  (Intercept)           x2           x3           x4
      -0.1659      -3.7780      -2.0133       3.8462

Idem with "contr.helmert".

  > a <- options()$contrasts
  > a[1] <- "contr.helmert"
  > options(contrasts=a)
  > lm(y~x)

  Call:
  lm(formula = y ~ x)

  Coefficients:
  (Intercept)           x1           x2           x3
     -0.65218     -1.88902     -0.04143      1.44416

With "contr.sum"

  > a <- options()$contrasts
  > a[1] <- "contr.sum"
  > options(contrasts=a)
  > lm(y~x)

  Call:
  lm(formula = y ~ x)

  Coefficients:
  (Intercept)           x1           x2           x3
      -0.6522       0.4863      -3.2918      -1.5270

And finally with "contr.poly".

  > a <- options()$contrasts
  > a[1] <- "contr.poly"
  > options(contrasts=a)
  > lm(y~x)

  Call:
  lm(formula = y ~ x)

  Coefficients:
  (Intercept)          x.L          x.Q          x.C
      -0.6522       2.9747       4.8188      -0.3238

Let us sum up the results.

  contr.treatment   -0.1659      -3.7780      -2.0133       3.8462
  contr.helmert     -0.65218     -1.88902     -0.04143      1.44416
  contr.sum         -0.6522       0.4863      -3.2918      -1.5270
  contr.poly        -0.6522       2.9747       4.8188      -0.3238

Let us see what happens if the effects are monotonic.

  n <- 100
  v <- .3
  l <- 1:4
  x <- factor( sample(l, n, replace=T), levels=l )
  y <- ifelse(x==1, -3+v*rnorm(n), 
       ifelse(x==2, -1+v*rnorm(n), 
       ifelse(x==3, 1+v*rnorm(n), 3+v*rnorm(n))))
  a <- options()$contrasts
  for (i in c("contr.treatment", "contr.helmert", "contr.sum", "contr.poly")) {
    a[1] <- i
    options(contrasts=a)
    print(lm(y~x)$coef)
  }

We get:

  contr.treatment   -2.911686    1.887725    3.938457    5.914572
  contr.helmert    0.02350254  0.94386256  0.99819831  0.99312780
  contr.sum        0.02350254 -2.93518867 -1.04746355  1.00326881
  contr.poly       0.02350254  4.42617327  0.04419473 -0.05313457

TODO: explain how to interpret the coefficients.
(I am sure I have already written this somewhere...)

  TODO: One could be tempted to choose another coding, easier to
  interpret:
  Replace U by 3 variables
  If U=a, take X1=1, X2=0, X3=0
  If U=b, take X1=0, X2=1, X3=0
  If U=c, take X1=0, X2=0, X3=1
  Explain why it does not work.
  Explain how to get the corresponding coefficients.

* ANalysis Of VAriance (Anova)

This is another name for regression with qualitative
predictive variables or, more precisely, for the tests
performed in this setup.

+ Anova: comparing more than two means

We have seen that regression with a single qualitative
predictive variable was nothing more than a computation of
means. In this setup, anova is merely a test to compare
means: a generalization of Student's T test, with an
arbitrary number of samples (and not just two).

The assumptions are the same as with Student's T test: the
samples are to be gaussian, to have the same variance, the
observations are to be independant.

You can rephrase this in terms of regressions: We have two
variables, X and Y, variable Y is quantitative and we are
interested in its mean, variable X is qualitative (it gives
the sample number, or the class of the observation -- e.g.,
the species if you are comparing different animal
species). We want to know if Y depends in X. i.e., we
perform a regression of Y against X. This can be written as

  Y = a0 + a1*(X==1) + a2*(X==2) + a3*(X==3) + a4*(X==4)

and we compare this model with the trivial one

  Y = b0.

Here is a simulation:

  %G
  N <- 10
  a <- rnorm(N)
  b <- rnorm(N)
  c <- rnorm(N)
  d <- rnorm(N)
  df <- data.frame( 
    y = c(a,b,c,d), 
    x = factor(c(rep(1,N), rep(2,N), rep(3,N), rep(4,N))) 
  )
  plot( y ~ x, 
        data = df,
        main = "Our 4 samples" )
  %--

  %G
  plot(density(df$y,bw=1), 
       xlim = c(-6,6), 
       ylim = c(0,.5), 
       type = 'l',
       main = "our 4 samples",
       xlab = "y")
  points(density(a,bw=1), type='l', col='red')
  points(density(b,bw=1), type='l', col='green')
  points(density(c,bw=1), type='l', col='blue')
  points(density(d,bw=1), type='l', col='orange')
  %--

The plot below represents the value of a statistical
variable on three samples. We want to know if those three
samples come from the same population. To this end, we start
to compare the means of those three samples. The three means
will be different, of course, but will they be significantly
different? Here, the difference between the means is rather
high compared to the "width" of the bell-shaped curves, that
have almost no overlap. For this reason, in this example, we
would reject the hypothesis "the means are equal".

  %G
  curve( dnorm(x-2), from=-6, to=6, col='red',
         xlab = "", ylab = "" )
  curve( dnorm(x+2),    add = T, col = 'green')
  curve( dnorm(x+2+.2), add = T, col = 'blue')
  curve( dnorm(x+2-.3), add = T, col = 'orange')
  x <- c(2, -2, -2.2, -1.7)
  segments(x, c(0,0,0,0), 
           x, rep(dnorm(0),4), 
           col = c('red', 'green', 'blue', 'orange') )
  title("The means are significantly different")
  %--

This is very different in the following plot. Even though the means
are the same as above, the differences between the means are small
compared to the width of the bell-shaped curves, that overlap a
lot. For this reason, we shall not reject the hypothesis "the means
are equal".

  %G
  s <- 3
  curve( dnorm(x-2, sd=s), from=-6, to=6, col='red',
         xlab = "", ylab = "" )
  curve( dnorm(x+2, sd=s),    add=T, col='green')
  curve( dnorm(x+2+.2, sd=s), add=T, col='blue')
  curve( dnorm(x+2-.3, sd=s), add=T, col='orange')
  x <- c(2, -2, -2.2, -1.7)
  segments( x, c(0,0,0,0), x, rep(dnorm(0, sd=s),4), 
            col=c('red', 'green', 'blue', 'orange') )
  title("The means are not significantly different")
  %--

In order to quantify those qualitative and graphical
reasonnings, we need a way of measuring the width of those
"bell-shaped curves": we could take the variance of each
sample but that would give us four numbers, one for each as
we prefer a single number; so we take the mean of those
variances -- if the three samples actually come from the
same population, the mean of the variances is indeed an
estimator of the variance of the whole population. We also
need a measure of the dispersion of the means: we could take
the variance of the means, but in order to compare it with
the mean of the variances, we multiply it by the size of the
samples -- we get another estimator of the variance of the
whole population.

Finally, we consider the ratio (here, n is the size of each
sample -- to simplify things, we assume they have the same
size)

       n * (variance of the means)
  F = -----------------------------
          mean of the variances

TODO: check this formula.

If F >> 1, we reject the hypothesis that the means are
equal; if F is small, we do not reject it. As all tests, it
is not symetric. One can show that if the means are equal,
the the expected value of F is 1; if the means are
different, the expectation is larger than 1. Of course, it
happens that F<1: in that case, we do not reject the
hypothesis.

This quotient, F, follows an F distribution with ********
and ****** degrees of freedom.

  %G
  N <- 5000  # Iterations
  n <- 5   # Sample size
  k <- 3     # Number of groups
  r <- rep(NA, N)
  for (i in 1:N) {
    l <- matrix(rnorm(n*k), nc=k)
    r[i] <- n * var(apply(l,2,mean)) / 
            mean(apply(l,2,var))
  }
  plot(sort(r), qf(ppoints(N),k-1,(k-1)*n),
    main = "When I know the distribution but not its parameters")
  abline(0,1,col="red")
  %--

  %G
  N <- 5000  # Iterations
  n <- 5     # Sample size
  k <- 3     # Number of groups
  r <- rep(NA, N)
  for (i in 1:N) {
    l <- matrix(rnorm(n*k), nc=k)
    r[i] <- n * var(apply(l,2,mean)) / 
            mean(apply(l,2,var))
  }
  plot(sort(r), qf(ppoints(N),k-1,k*(n-1)),
    main = "When I know the distribution but not its parameters")
  abline(0,1,col="red")
  %--

  %G
  curve( df(x, 2, 2), from=0, to=4, ylim=c(0,1),
         xlab = "", ylab = "", main = "" )
  curve( df(x, 2, 10), add=T, col='red' )
  curve( df(x, 4, 2),  add=T, col='green' )
  curve( df(x, 4, 6),  add=T, col='green' )
  curve( df(x, 4, 10), add=T, col='green' )
  curve( df(x, 4, 20), add=T, col='green' )
  curve( df(x, 6, 2),  add=T, col='blue' )
  curve( df(x, 6, 6),  add=T, col='blue' )
  curve( df(x, 6, 10), add=T, col='blue' )
  curve( df(x, 6, 20), add=T, col='blue' )
  %--

If we literally follow the definition,

  n <- 10
  a <- rnorm(n)
  b <- rnorm(n)
  c <- rnorm(n)
  d <- 1+rnorm(n)
  sx2 <- var(c( mean(a), mean(b), mean(c), mean(d) ))
  sp2 <- mean(c( var(a), var(b), var(c), var(d) ))
  MSbetween <- n * sx2
  dlbetween <- 4-1
  MSwithin <- sp2
  dlwithin <- 4*(n-1)
  F <- MSbetween / MSwithin

we find

  >   F
  [1] 5.137807

But usually, we perform the computations in a slightly different way,
by remembering that varance may be computed from sums of squares. 

  Var(X) = E[ (X-EX)^2 ]
         = E[X^2] - (EX)^2

The computations then go as follows.

  # Sums
  sa <- sum(a)
  sb <- sum(b)
  sc <- sum(c)
  sd <- sum(d)
  s <- sa + sb + sc + sd
  # Sums of squares
  ssa <- sum(a^2)
  ssb <- sum(b^2)
  ssc <- sum(c^2)
  ssd <- sum(d^2)
  ss <- ssa + ssb + ssc + ssd
  # The values we are interested in 
  # SS = Sum of Squares
  # MS = Mean Square
  # effect = between = explained variance
  SSeffect <- (sa^2/n + sb^2/n + sc^2/n + sd^2/n) - s^2/(4*n)
  dleffect <- 4-1
  MSeffect <- SSeffect/dleffect
  SStotal <- ss - s^2/(4*n) # ????
  # error = within = residuals = residual variance
  SSerror = SStotal - SSeffect
  dlerror <- 4*(n-1)
  MSerror <- SSerror/dlerror
  F <- MSeffect / MSerror
 
  res <- matrix( nrow=3, ncol=5 )
  rownames(res) <- c("Effect", "Error", "Total")
  colnames(res) <- c("SS", "df", "MS", "F", "p-value")
  res[,1] <- c(SSeffect, SSerror, SStotal)
  res[,2] <- c(dleffect, dlerror, dleffect+dlerror)
  res[,3] <- c(MSeffect, MSerror, NA)
  res[,4] <- c(F, NA, NA)
  res[,5] <- c(1-pf(F,dleffect,dlerror), NA, NA)
  print(res, na.print="")

This yields:

               SS df        MS        F    p-value
  Effect 15.18513  3 5.0617083 5.137807 0.00463538
  Error  35.46678 36 0.9851884
  Total  50.65191 39

The correlation ratio is defined as 

  >   SSeffect/SStotal
  [1] 0.2997937

We then say that "30% of the variations are explaines by the
explainatory variable" (here, the "explainatory variable" is
the variable telling in which group each observation is).

We now understand a little more the result produced by R:

  > df <- data.frame(
  +   y = c(a,b,c,d), 
  +   x = factor(c(rep(1,n),rep(2,n),rep(3,n),rep(4,n)))
  + )
  > anova(lm(y~x, data=df))
  Analysis of Variance Table

  Response: y
            Df Sum Sq Mean Sq F value   Pr(>F)
  x          3 15.185   5.062  5.1378 0.004635 **
  Residuals 36 35.467   0.985
  ---
  Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

The only number we are interested in is the p-value: here,
we can reject the hypothesis "the four means are equal" with
a risk of error inferior to 1%.

+ R^2, adjusted R^2

  TODO

+ Model

The model underlying analysis of variance is the following.
We want to predict a quantitative variable Y with a qualitative
variable X:

  Y = a0 + a1*(X==1) + ... + an*(X==n) + noise

which may also be written as 

  Y = a_0 + a_X + noise

or (if we write i (previously X) the group of observation number j)

  Y_ij = a + b_i + c_ij

The interpretation of this F test is the same as for a linear
regression between quantitative variables: we measure which part of
the variance of Y is explained by X.

The model may also contain several predictive variables

  y  = something that depends on the value of u
       + something that depends on the value of v
       + noise

i.e., 

  y = f(u) + g(v) + noise

i.e., if u and v are qualitative, 

  Y_ijk = a + b_i + c_j + d_ijk

We can also add some interaction terms

  y  = something that depends on the value of u
       + something that depends on the value of v
       + something that depends on the value of (u,v)
       + noise

i.e., 

  y = f(u) + g(v) + h(u,v)

i.e.,

  Y_ijk = a + b_i + c_j + d_ij + e_ijk.

When some predictive variables are quatitative and others qualitative, 
we call this "covariance analysis" (and the quantitative predictive
variables are called "covariates") -- but it is only a vocabulary
change, the ideas, the computations are the same. 

+ Example

  TODO: give a (real) example with two predictive variables.

+ Comparing models

In R, the "anova" command is more general: it is used to
compare regression models. Analysis of variance, as we have
just presented it, is actually a comparison of the model
studied (say, y ~ x) with the trivial (empty) model 
(here, y ~ 1).

  n <- 200
  k <- 5
  m <- sample(0:1, k, replace=T,  # The means (perhaps equal, 
              p = c(.9,.1))       # perhaps not)
  x <- factor(sample(1:k,n,replace=T))
  y <- rnorm(n,m[x])
  anova( lm(y~x), lm(y~1) )

This yields:

  Analysis of Variance Table

  Model 1: y ~ x
  Model 2: y ~ 1
    Res.Df     RSS  Df Sum of Sq      F    Pr(>F)
  1    195 184.366
  2    199 226.197  -4   -41.831 11.061 4.177e-08 ***

And indeed:

  > m
  [1] 0 0 0 1 0

Actually, you can use the "anova" function to compare any
two embedded models: for instance, two non-trivial models,
or even more that two models. But the models have to be
nested (to compare non-nested models, use the BIC (Bayesian
Information Criterion) -- more about this somewhere else in
this document).

  TODO: give a real example.

  anova( lm(y~x1+x2), lm(y~x1), lm(y~1) )
  anova( lm(y~x1+x2), lm(y~x2), lm(y~1) )

Of course, you will be careful not to perform too many
tests: each time, you have a 5% chance of claiming to see
something that is not there -- in the end, you would be sure
to spot some artefactual phenomenon.

+ Likelihood ratio test

  TODO

* ANOVA vocabulary

One of the most disturbing things when one tries to
understand analysis of variance is the vocabulary used: many
complicated words, never (or rarely) explained, for very
simple things. Here is a compendium of some of these words
(only those I have inderstood -- and I am not even sure I
have understood them properly)...

In the following examples, I shall be mainly using the "aov"
function: however, I advise you not to use it unless you are
absolutely confident in your understanding of its "Error"
argument -- I for one, do not understand it: expect more
mistakes in this part than in the rest of this document.

+ Simple anova, one-factor anova, one-way anova

It is the analysis of variance with a simple predictive
variable (i, in the following model).

  Y_ij = a + b_i + e_ij

In R, you will write

  y ~ x

where y is the variable to predict and x the predictive
(qualitative) variable.

Here is an example:

  %G
  n <- 30
  x <- sample(LETTERS[1:3],n,replace=T, p=c(3,2,1)/6)
  x <- factor(x)
  y <- rnorm(n)  
  plot(y ~ x, 
       col = 'pink',
       xlab = "", ylab = "",
       main = "Simple anova: y ~ x")
  %--

Here, the analysis of variance tells us that the three
samples might have the same mean.

  > summary(aov(y~x))
              Df Sum Sq Mean Sq F value Pr(>F)
  x            2  2.271   1.135  1.1193 0.3307
  Residuals   97 98.391   1.014

+ Double anova, two-factor anova, two-way anova

It is the analysis of variance with two predictive variables
(i and j in the following model).

  Y_ijk = a + b_i + c_j + e_ijk

In R, you would write 

  y ~ x1 + x2

In this example, the predictive variables do not interact.
If you have enough data, you can look if the interact, i.e.,
you can compare with the model

  Y_ijk = a + b_i + c_j + d_ij + e_ijk

In R, the model with an interaction term would be written

  y ~ x + y + x:y

or

  y ~ x * y

Here is another simulation.

  %G
  n <- 100
  x1 <- sample(LETTERS[1:3],n,replace=T,p=c(3,2,1)/6)
  x1 <- factor(x1)
  x2 <- sample(LETTERS[1:2],n,replace=T,p=c(3,1)/4)
  x2 <- factor(x2)
  y <- rnorm(n)
  i <- which(x1=='A' & x2=='B')
  y[i] <- rnorm(length(i),.5)
  library(lattice)
  bwplot( ~ y | x1 * x2, 
         layout = c(1,6),
         main = "Double anova: y ~ x1 + x2")
  %--

Here is the corresponding anova:

  > summary(aov(y~x1*x2))
              Df Sum Sq Mean Sq F value Pr(>F)
  x1           2  2.855   1.427  1.4716 0.2348
  x2           1  0.487   0.487  0.5021 0.4803
  x1:x2        2  0.241   0.121  0.1243 0.8833
  Residuals   94 91.175   0.970

When the groups do not have the same size, the result will
depend on the order in which you list the predictive
variables...

  > summary(aov(y~x2*x1))
              Df Sum Sq Mean Sq F value Pr(>F)
  x2           1  0.798   0.798  0.8230 0.3666
  x1           2  2.544   1.272  1.3112 0.2744
  x2:x1        2  0.241   0.121  0.1243 0.8833
  Residuals   94 91.175   0.970

In this situation you might want to use the "anova"
function to compare three or four (nested) models.

  anova( lm(y~1), lm(y~x1), lm(y~x1+x2), lm(y~x1*x2) )

+ Interaction plot

This plot can help you see if a model with two predictive
variables need an interaction term.

When comparing y ~ x1 + x2 with y ~ x1 * x2, we can plot y
as a function of x1, for each value of x2: we get a curve
for each value of x2. You can also interchange the role of
x1 and x2.

Parallel lines indicate there is no interaction.

  %G
  x1 <- gl(2,2,4,c(0,1))
  x2 <- gl(2,1,4,c(0,1))
  n <- function (x) {
    as.numeric(as.vector(x))
  }
  y1 <- n(x1) + n(x2)
  interaction.plot(
    x1, x2, y1, 
    main = "Interaction plot: No interaction"
  )
  %--

  %G
  y2 <- n(x1) + n(x2) - 2*n(x1)*n(x2)
  interaction.plot(
    x1, x2, y2, 
    main = "Interaction plot: Interaction"
  )
  %--

+ Repeated measure anova

There are several predictive variables, say X1 and X2, and
several observations for each value of (X1,X2).

This is a particular case of two-way anova -- a case in
which we can, if needed, consider an interaction term: the
model would be y~x1+x2 (no interaction) of y~x1*x2
(interaction).

+ Cross-factor anova

There are several predictiva variables, say X1 and X2, and a
single observation for each value of (X1,X2); in this case
we do not have enough data to study the interaction between
X1 and X2.

This is a particular case of two-way anova.

+ Hierarchical anova

This is still a double anova, i.e., a model of the form 

  Y_ijk = a + b_i + c_j + e_ijk

but now, the value of i determines that of j, i.e., i and j
both define groups in the data, and the groups defined by i
are contained in those defined by j. For instance

  i = Number of the leaf (from 1 to 100)
  j = Number of the tree (from 1 to 10: leaves 1 to 10 are in tree 1,
                          leaves 11 to 20 on tree 2, etc.)

Other example: 

  i = pupil
  j = classroom

Other example:

  i = country
  j = region (a region is a set of countries)

Here is a simulation of such data.

  %G
  n <- 2000    # Number of experiments
  k <- 20      # Number of subjects
  l <- 4       # Number of groups
  kl <- sample(1:l, k, replace=T)  # Group of each subject
  x1 <- sample(1:k, n, replace=T)  
  x2 <- kl[x1]
  A <- rnorm(1,sd=4)
  B <- rnorm(k,sd=4)
  C <- rnorm(l,sd=4)
  y <- A + B[x1] + C[x2] + rnorm(n)
  x1 <- factor(x1)
  x2 <- factor(x2)
  op <- par(mfrow=c(1,2))
  plot(y~x1, col='pink')
  plot(y~x2, col='pink')
  par(op)
  mtext("Hierarchical anova", line=1.5, font=2, cex=1.2)
  %--

You can put this in a single plot, by using a different
color for each group

  %G
  # If the data were real, we wouldn't know kl.
  # One may recover it that way.
  kl <- tapply(x2, 
               x1, 
               function (x) { 
                 a <- table(x)
                 names(a)[which(a==max(a))[1]] 
               })
  kl <- factor(kl, levels=levels(x2))
  plot( y ~ x1, col = rainbow(l)[kl],
        main = "Hierarchical anova")
  %--

and reordering the subjects.

  %G
  x1 <- factor(x1, levels=order(kl))
  kl <- sort(kl)
  plot( y ~ x1, col = rainbow(l)[kl],
        main = "Hierarchical anova")
  legend(par('usr')[2], par('usr')[3], # Bottom right
         xjust = 1, yjust = 0,
         1:l,   # Is this right?
         col = rainbow(l),
         lty = 1,
         lwd = 3)
  %--

If we try to naively model this, the fact that the variables
are nestes is not taken into account and prevents us from
estimating the effects of x2: once we know the effects of
x1, there is nothing left.

  > a <- x1
  > b <- x2
  > lm(y~a+b)
  Call:
  lm(formula = y ~ a + b)
  Coefficients:
  (Intercept)          a11          a18          a19           a1           a2
      10.0156      -2.7380      -2.6942      -0.7015     -15.9302     -11.5106
           a8          a13          a14          a15           a3           a4
     -13.3365      -3.0425     -10.6694     -10.7422     -15.2508     -20.9595
           a9          a17          a20           a5           a7          a10
     -21.8916     -19.5307     -19.9899     -11.1213      -7.5232     -10.9528
          a12          a16           b2           b3           b4
      -5.8704      -6.5261           NA           NA           NA

Instead, we can use the "lme" function from the "nlme"
package, or the "lmer" function from the newer "lme4"
package.

  TODO

+ Within subject

For each subject.

For instance, we might want to compute the variance for all
the observations on a given subject: this would be a "within
subject" variance.

+ Across subjets

Mean (or anything else) accross all the subjects.

For instance, we can compute the variance of all the
observations, regardless of the subject: this would be
called the "across-subject variance".

+ Experience design

In some cases, you can choose the values of the predictive
variables, because they just describe the conditions in
which the experiment was conducted.

This contrasts with the usual setup where the predictive
variables are random variables. 

  TODO: understand what follows (actually, there are entire
  books on the subject). 
  CRD: Completely Randomized Design
  RCBD: Randomized Block Design
  Latin Square: Randomized Block Design with two series of blocks
  BIB: Balanced incomplete Block Design
  Factorial Design

+ Between-subject design

In such a design, we assume there is no difference between the
subjects.

In other words, this is classical regression. 

+ In-subject design 

This time, we consider there are differences between the
subjects. This is often the case in psychological
experiments: the subjects undergo various experiments, but
each subject undergoes several experiments.

You can consider the "subject" variable as another
qualitative variable, whose effects should be taken into
account -- but we are not _interested_ in the differences
between the subjects, we do not want to know that subject 1
is faster than subject 2, but we want to account for those
differences.

This is actually one motivation for the notion of mixed
models. 

From a practical point of view, the "anova table" in the
result will have one more row, for the subjects.  This is
called "anova within".

+ Split-splot design

We have two qualitative predictive variables, X1 and X2. We
first divide the subjects into "plots", one for each value
of X1. Then, in each plot, we assign each subject to a given
value of X2 (i.e., we split the plot). The values of X2 are
nested in those of X1. This was initially used in farming:
the plots are fields (or groups of fields, close together),
the subplots are sections of those fields.

+ Hierarchical design

We have two qualitative variables, X1 and X2, and X2 is
nested within X1, i.e., X1 partitions the subjects into
groups and X2 partitions each group into smaller groups. For
instance:

  X1: continent
  X2: country

  X1: (number of the) tree
  X2: (number of the) leaf on the tree

+ Between-subject factor

A qualitative variable whose value is completely determined
once we know the others. For instance, the group the subject
belongs to, when each subject belongs to only one group.

+ Avova within

See above, "in-subject design". 

+ Carry-over effects 

In an in-subject design, the tests performed on the subjects
must be independant (for instance, if he is asked to perform
the same task in different conditions, the second time, the
differences might be due to the differing conditions or to
the memory of the first experiment).

+ Effects

This is the name given to the parameters of the regression
(i.e., the values we want to estimate).

+ Fixed effects

The effects, i.e., the parameters of the regression, are
numbers.  This is what you are used to.

+ Random effects

The effects, i.e., the parameters in the regression, are not
numbers, but random variables, that depend on the subject.

This can be written

  y = alpha + beta x + noise
  alpha ~ group   # Random effect (it depends on the group)
  beta ~ 1        # Fixed effect (it is the same for all the
                  # observations, regardless of their group)

In this example, the intercept depends on the group, an we
assume that the distribution of these intercepts is normal,
with a mean and a variance to be determined, while the slope
is the same for all groups. We say that the intercept is a
random effect and the slope a fixed effect.

This model can also be written (this is closer to the syntax
actually used by R to describe mixed models in the "lme4"
package).

  y ~  1 + x + ( 1 | group ) 

where the terms can be read as: 

  1:           fixed effect (average of alpha over 
               all the groups) 
  x:           fixed effect of x (beta, in the model above)
  (1 | group): random effect (alpha, in the model above, 
               has been decomposed as a sum of a
               group-independant value (fixed effect) and a
               group-dependant value (random effect) by
               requesting that the mean of the
               group-dependant value be zero).

+ Random effects

The model is

  Y_ij = a_i + b x_ij + c_ij

where 

  i is the subject number
  j is the experiment number
  x is a predictive quantitative variable

This is very similar to a classical linear regression, but
the intercept depends on the subject -- the slope does not.

  %G
  n <- 1000
  k <- 9
  subject <- factor(sample(1:k,n,replace=T), levels=1:k)
  a <- rnorm(k,sd=4)
  b <- rnorm(1)
  x <- rnorm(n)
  y <- a[subject] + b * x + rnorm(n)
  plot(y~x, main="Fixed effects")
  abline(lm(y~x))
  %--

We do not see much... There is a line, indeed, but the
points are very far away from this line...

Actually, there are several lines (one for each subject).

  %G
  col <- rainbow(k)
  plot(y~x, main="Random effects")
  for (i in 1:k) {
    points(y[subject==i] ~ x[subject==i], col=col[i])
    abline( lm(y~x, subset=subject==i), col=col[i] )
  }
  %--

This situation is very simple: the lines are actually
parallel.  In real-world examples, they need not be
parallel, they need not even be lines. With all the subjects
on the same plot, you do not see anything -- so we use
several plots.

  %G
  library(lattice)
  xyplot(y~x|subject, main = "Random effects")
  %--

+ Within effects, within-subject effects

Other name for "random effects".

+ Summary

The following (incomplete) table recalls the main models we
have seen.

  Model                                  Name
  -----------------------------------------------------------------------------------
  Y_ij = a + b_i + c_ij                  1-factor anova
                                         i: predictive qualitative variable

  Y_ijk = a + b_i + c_j + d_ijk          2-factor anova, with no interaction
                                         i: qualitative predictive variable
                                         j: other qualitative predictive variable
                                         i and j are independant

  Y_ijk = a + b_i + c_j + d_ij + e_ijk   2-factor anova, with interactions

  Y_ijk = a + b_i + c_j + d_ijk          hierarchical anova
                                         i and j: predictive qualitative variables
                                         i: subject number
                                         j: group number (each subject
                                            is in only one group)
                                         j is completely determined by i

+ Repeated measures

We measure the same quantity (i.e., we perform the same
experiment) on each subject several times.

  %G
  n <- 30
  k <- 10
  subject <- gl(k,n/k,n)
  x <- gl(n/k,1,n)
  y <- rnorm(k)
  y <- rnorm(n, y[subject])
  y[ x==2 ] <- y[ x==2 ] + 1

  # I do not see anything
  plot(y ~ x, 
       col = 'pink',
       main = "Repeated measures: y ~ x | subject")
  %--

  %G
  interaction.plot(
    x, subject, y,
    main = "Repeated measures: y ~ x | subject"
  )
  %--

Here is the analysis of variance:

  > summary(aov( y ~ x + Error(subject/x) ))

  Error: subject
            Df Sum Sq Mean Sq F value Pr(>F)
  Residuals  9 40.000   4.444

  Error: subject:x
            Df Sum Sq Mean Sq F value   Pr(>F)
  x          2 8.6491  4.3245  7.8876 0.003468 **
  Residuals 18 9.8689  0.5483

Forgetting the error term is equivalent to considering the
box-and-whiskers boxes above: we do not see anything. 

  > summary(aov(y~x))
              Df Sum Sq Mean Sq F value Pr(>F)
  x            2  8.649   4.325  2.3414 0.1154
  Residuals   27 49.869   1.847

We could also have taken the subject number a predictive variable: 
but then, we have too many parameters to estimate, there is no
information (people often say "degree of freedom")
left to perform a test:

  > summary(aov(y~x*subject))
              Df Sum Sq Mean Sq
  x            2  8.649   4.325
  subject        9 40.000   4.444
  x:subject     18  9.869   0.548

We shall see later that this is a "mixed model" -- no need
to understand the "aov" function and its "Error" argument. 

+ Split-splot

The subjects are separated into several groups and we
perform several measures for each subject.

  %G
  n <- 50
  k <- 10
  subjet <- gl(k,n/k,n)
  group <- sample(1:3, k, replace=T)[subjet]
  group <- factor(group)
  x <- gl(n/k,1,n)
  y <- rnorm(n)
  y[x==1] <- y[x==1] + 1
  plot(y~x, col='pink',
       main = "y ~ x | subject/group")
  %--

Here is the analysis of variance (x is the only
within-subject factor, because each subject belongs to only
one group):

  > summary(aov( y ~ x * group + Error(subject/x) ))

  Error: subject
            Df  Sum Sq Mean Sq F value Pr(>F)
  group     2  1.7068  0.8534  0.5958 0.5768
  Residuals  7 10.0260  1.4323

  Error: subject:x
            Df Sum Sq Mean Sq F value  Pr(>F)
  x          4 25.149   6.287  4.5296 0.00601 **
  x:group   8  1.994   0.249  0.1795 0.99191
  Residuals 28 38.865   1.388

+ Hierarchical design

This is the same as the split-splot design, but we have two
levels of grouping: each subject is in a group, each group
is in a class.

  n <- 64
  group <- gl(16,n/16,n)
  supgroup <- gl(4,n/4,n)
  subjet <- gl(64,n/64,n)
  y <- rnorm(n)

  TODO: plot
  y ~ group, col=supgroup

Here, each subject appears exactly once.

The analysis of variance requires two steps: first, we check if the
larger groups ("classes") are meaningful:

  > summary(aov( y ~ supgroup + Error(group) ))

  Error: group
            Df  Sum Sq Mean Sq F value Pr(>F)
  supgroup   3  0.8376  0.2792  0.2675 0.8475
  Residuals 12 12.5227  1.0436

  Error: Within
            Df Sum Sq Mean Sq F value Pr(>F)
  Residuals 48 50.131   1.044

then, if there are dissimilarities within the larger groups, because
of the smaller groups. 

  > summary(aov( y ~ supgroup:group ))
                   Df Sum Sq Mean Sq F value Pr(>F)
  supgroup:group   15 13.360   0.891  0.8528 0.6173
  Residuals        48 50.131   1.044

+ Hierarchical design (2)

TODO

Let us consider the same example, but this time, each subject appears
several times.

  n <- 256
  group <- gl(16,n/16,n)
  supgroup <- gl(4,n/4,n)
  subjet <- gl(64,n/64,n)
  y <- rnorm(n)

We first check if the larger groups are meaningful:

  summary(aov( y ~ supgroup + Error(supgroup/group*subjet) ))

  TODO: I have a few doubts
  Furthermore, R takes a very long time to do the computations...

+ The "Error" argument in the "aov" function

  y ~ predictive variables + Error( 
        Subjet / (sum or product of the predictive variables whose
                  effect is likely to change from one subject to another)
      )

This allows you to perform tests:

  Are there significant variations from one subject to the next?
  How does the effect of the predictive variables depend on the subject?

The variables whose effects are likely to depend on the subject are
sometimes called "within effects" or "within-subject factors".

The syntax 

  subject / v

is equivalent to 

  subject  +  v %in% subject


+ TODO

  One-way analysis of variance: Y_ij = a + b_i + c_ij
  Unbalanced anova: The groups do not have the same size
    (the computations are trickier -- but the computer does them for us)
  Nested design: Y_ijk = a + b_i + c_ij + d_ijk
    No c_j term
    Example: i = tree number, j = leaf number
    Example: i = father (sire) number, j = mother number
      (in farming, the same male will be used several times, but the
       mother only once: you have to wait until she gives birth)
  Nested unbalanced anova

+ Anova within and error terms

Here is another example: we investigate the duration of a cold in
subjects that have received a drug or a placebo. If we perform the
experiment with different subjects, the differences between the
subjects are larger that those induced by the drug. 

  n <- 10
  subjet <- gl(n,2)
  treatment <- gl(2,1,2*n, 0:1)
  duration <- ifelse( treatment==1, rpois(2*n,2), rpois(2*n,3) )
  summary(lm( duration ~ treatment )) # Not significant

There is no significant effect

  > summary(lm( duration ~ treatment ))

  Call:
  lm(formula = duration ~ treatment)

  Residuals:
     Min     1Q Median     3Q    Max
   -2.20  -1.20  -0.10   0.45   3.80

  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept)   3.2000     0.5033   6.358 5.46e-06 ***
  treatment1   -1.2000     0.7118  -1.686    0.109
  ---
  Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

  Residual standard error: 1.592 on 18 degrees of freedom
  Multiple R-Squared: 0.1364,     Adjusted R-squared: 0.08838
  F-statistic: 2.842 on 1 and 18 DF,  p-value: 0.1091

However, if we perform two experiments with each subject
(they have to catch two colds each), one with the placebo,
one with the drug, we see that the cold is shortened by the
drug.

  %G
  n <- 10
  subject <- gl(n,2)
  traitement <- gl(2,1,2*n, 0:1)
  duree <- ifelse( traitement==1, rpois(2*n,2), rpois(2*n,3) )
  sans <- duree[2*(1:n)-1]
  avec <- duree[2*(1:n)]
  plot(sans+rnorm(n), avec+rnorm(n))
  abline(0,1)
  %--

Proportion of subjects for which the drug is efficient:

  > sum(sans>avec)/n
  [1] 0.8

From the Wilcoxon test, it is significant

  > wilcox.test(avec,sans, paired=T)

        Wilcoxon signed rank test with continuity correction

  data:  avec and sans
  V = 3.5, p-value = 0.02355
  alternative hypothesis: true mu is not equal to 0

We can compare this with a regression.

  > summary( lm(I(sans-avec) ~ 1) )

  Call:
  lm(formula = I(sans - avec) ~ 1)

  Residuals:
     Min     1Q Median     3Q    Max
   -2.20  -0.20  -0.20   0.55   1.80

  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept)   1.2000     0.3887   3.087    0.013 *
  ---
  Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

  Residual standard error: 1.229 on 9 degrees of freedom

Actually, this is simply a Student T test.

  > t.test(sans-avec)

        One Sample t-test

  data:  sans - avec
  t = 3.087, df = 9, p-value = 0.01299
  alternative hypothesis: true mean is not equal to 0
  95 percent confidence interval:
   0.3206314 2.0793686
  sample estimates:
  mean of x
        1.2

You can also formulate this as a regression (it is more
general: some subjects may have undergone more that two
experiments).

  > summary(lm( duree ~ traitement + subject )) # significant

  Call:
  lm(formula = duree ~ traitement + subject)

  Residuals:
         Min         1Q     Median         3Q        Max
  -1.100e+00 -1.750e-01  9.714e-17  1.750e-01  1.100e+00

  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept)   2.6000     0.6446   4.033  0.00296 **
  traitement1  -1.2000     0.3887  -3.087  0.01299 *
  subject2      1.5000     0.8692   1.726  0.11849
  subject3     -0.5000     0.8692  -0.575  0.57923
  subject4     -1.0000     0.8692  -1.150  0.27961
  subject5      2.5000     0.8692   2.876  0.01829 *
  subject6      0.5000     0.8692   0.575  0.57923
  subject7     -0.5000     0.8692  -0.575  0.57923
  subject8      3.5000     0.8692   4.027  0.00299 **
  subject9     -0.5000     0.8692  -0.575  0.57923
  subject10     0.5000     0.8692   0.575  0.57923
  ---
  Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

  Residual standard error: 0.8692 on 9 degrees of freedom
  Multiple R-Squared: 0.8712,     Adjusted R-squared: 0.7281
  F-statistic: 6.088 on 10 and 9 DF,  p-value: 0.006023

We get the same value

Here is yet another way of presenting the computations:

  > summary(aov(duree ~ traitement + Error(subject/traitement)))

  Error: subject
            Df Sum Sq Mean Sq F value Pr(>F)
  Residuals  9 38.800   4.311

  Error: subject:traitement
             Df Sum Sq Mean Sq F value  Pr(>F)
  traitement  1 7.2000  7.2000  9.5294 0.01299 *
  Residuals   9 6.8000  0.7556
  ---
  Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

However, be VERY careful at how you write those "error terms". 
To understand, you might want to consult

  http://cran.r-project.org/doc/contrib/rpsych.html
  
More generally: 

  summary(aov( resultat ~ a1 * a2 * a3 * b1 * b2 + Error(subj/(a1+a2+a3)) ))

where a1, a2, a3 are within-subject factors and b1, b2 are
between-subject factors.

   Formula              |    Model
  ------------------------------------------------------
   y ~ j                |    y = a_j + Error
   y ~ j + Error(i)     |    y = a_j + b_i + Error
   y ~ j + Error(i/j)   |    y = a_j + c_{i,j} + Error

Exercise: 
In the example above, should we writre the error term as 
Error(subject) or Error(subject/traitement)?

The "lme" function, in the "nlme" package, does comparable things 
(the data are different: otherwise, we would get a closer p-value).

  > summary( lme(duree ~ traitement, random = ~ 1 | subject) )
  Linear mixed-effects model fit by REML
   Data: NULL
         AIC      BIC    logLik
    67.44822 71.00971 -29.72411

  Random effects:
   Formula: ~1 | subject
          (Intercept)  Residual
  StdDev:   0.7601167 0.8850613

  Fixed effects: duree ~ traitement
              Value Std.Error DF   t-value p-value
  (Intercept)   3.2 0.3689324  9  8.673676  <.0001
  traitement1  -1.3 0.3958114  9 -3.284392  0.0095
   Correlation:
              (Intr)
  traitement1 -0.536

  Standardized Within-Group Residuals:
          Min          Q1         Med          Q3         Max
  -0.98547570 -0.73238917  0.03706879  0.48334911  1.85051895

  Number of Observations: 20
  Number of Groups: 10

The "nlme" function is a non-linear generalization of "lme" -- a
mixture of "lme" and "nls".

+ Mixed models

Sometimes, we are not interested in all the coefficients of
the model. For instance, we might want to study the effect
of a drug on the results to a given test, following the
model

  Y_ij = a + b_i + c_j + e_ij

where i is the "treatment" variable (drug or placebo) and j
is the subject number. We are interested in b_i (the effect
of the drug), not in c_j (the differences between the
subjects) -- however, we want to account for differences
between subjects when evaluating the effects of the drug.

That is a mixed model: a regression model, when we are not
interested in the actual values of some coefficients.

(There is one more difference: we shall assume that those
coefficients we are interested are random variables, that
follow a gaussian distribution of unknown mean and
variance).

In terms of matrices, a linear model is 

  Y = X b + e

(X and Y anr known, we want to explain Y from X, e is
random, we are looking for b) while a mixed model is

  Y = X b + Z u + e

(X, Y and Z are known, u and e are random, we are looking
for b).  We say that b are the fixed effects and u the
random effects.

+ Fixed effects

These are the predictive variables we are interested in in a
mixed model (more precisely, the "effects" are the
corresponding coefficients).

+ Random effects

These are the predictive variables we are not interested
in. They are usually qualitative, taking a very large number
of values, and we only have a few of those values (e.g., in
an experiment on humans, the "subject" variable tells which
human is being considered, among the 5 to 10 billions at
hand -- but the experiment was only conducted on a few
dozens of humans).

+ Mixed model and linear model

You might be tempted to see a mixed model as a particular
case of a linear model (put X and Z in the same matrix, b
and u in the same vector, estimate b and u, only report the
estimated values of b). There are however a few differences.

In a mixed model, we have fewer parameters to estimate, so
we can get results with fewer data.

There are also a few differences in the tests that can be
performed (to use a vocabulary I do not really like: we have
more "degrees of freedom" with a mixed model).

The mixed model assumes that the random effects (i.e., the
coefficients we are not interested in) follow a gaussian
distribution -- in a linear model, there is no such
assumption.

+ Mixed models and Generalized Least Squares (GLS)

Let us recall that in the linear model (Ordinary Least
Squares, OLS), the variance of the noise term is a scalar
matrix (i.e., a multiple of the identity matrix, i.e., the
noise terms are independant and have the same
variance). With the Generalized Least Squares, this matrix
no longer need be diagonal.

This is exactly what we have in a moxed model.

  Y = X b + Z u + e.

If we write R and G the variance-covariance of e and u, the
variance of Y is

  Var(Y) = ZGZ'+R.

So, even if e and u are scalar matrices, the variance of Y
can be more complex: the moxed model is not a particular
case of the linear model where we discard the coefficients
we are not interested in, but rather a particular cas of
generalized least squares.

* Mixed models

The following terms are more or less equivalent:

  Mixed models
  Random Coefficient Model 

  HLM (Hierarchical Level Model)
  Multilevel Analysis 
  Grouped Data

  Panel Data
  Longitudinal Data

+ Hierarchical Models

In classical models, we have a universe (for instance, the
British population) from which we sample subjects, at
random, in which we perform experiments or measurements
(income, sex, age, weight, etc.).

In hierarchical model, this sampling is performed in (at
least) two steps. We have a first universe (say, that of
British cities), from which we sample. For each of those
elements, we measure a few variables of interest (latitude,
longitude, size, taxes, average income, etc.). For each of
those elements (cities), we have another universe (its
inhabitants), from which we now sample elements (people) on
which we measure the variables we are interested in (income,
age, etc.)

If we out the collected data in a table, we get two tables.

  City          Area           Inhabitants
  ----------------------------------------
  A                1                 28000    
  B                7                 50000
  C                3                 15000

  Subject  City       Income  Age
  -------------------------------
  Artaud      A           30   32
  Balzac      A           50   38
  Camus       B           10   21
  Descartes   C           20   49
  Etiemble    C           27   36
  Fenelon     C           23   29
  Gide        B           15   56

It will be easier to put everything in a single table -- but some data
will be repeated.

  Subject  City            Area         Inhabitants   Income  Age
  ---------------------------------------------------------------
  Artaud      A               1               28000       30   32
  Balzac      A               1               28000       50   38
  Camus       B               7               50000       10   21
  Descartes   C               3               15000       20   49
  Etiemble    C               3               15000       27   36
  Fenelon     C               3               15000       23   29
  Gide        B               7               50000       15   56

In this example, we have two sampling levels (city and subject)
and we have variables defined at each level (area and number of
inhabitants are defined for the cities, i.e., are the same for all the
inhabitants of a given city; income and age are subject-level
variables).

Now that we have given a numeric example, let us perform a
few simulations and look at a few plots.

  %G
  # There are 12 cities
  n.cities <- 12

  # The area of those cities (more reasonably, the logarithm
  # of their areas) are gaussian, independant variables.
  area.moyenne <- 5
  area.sd <- 1
  area <- rnorm(n.cities, area.moyenne, area.sd)

  a <- rnorm(n.cities)
  b <- rnorm(n.cities)

  # 200 inhabitants sampled in each city
  n.inhabitants <- 20
  city <- rep(1:n.cities, each=n.inhabitants)

  # The age are independant gaussian variables, mean=40, sd=10
  # We could have chosen a different distribution for each city. 
  # (either randomly, or depending on their area or population).

  age <- rnorm(n.cities*n.inhabitants, 40, 10)

  # The income (the variable we try to explain) is a function of the
  # age, but the coefficients depend on the city
  # Here, the coefficients are taken at random, but they could 
  # depend on the city area or population.
  # Here, the coefficients are independant -- this is rarely the case
  a <- rnorm(n.cities, 20000, sd=2000)
  b <- rnorm(n.cities, sd=20)
  income <- 200*area[city] + a[city] + b[city]*age + 
            rnorm(n.cities*n.inhabitants, sd=200)

  plot(income ~ age, col=rainbow(n.cities)[city], pch=16)
  %--

We do not see much on this plot. One idea is to consider a
different plot for each city: this is what the "lattice"
package provides (recall that the general idea of "lattice"
(or "treillis") plots: slicing a cloud of points). Remember
the syntax,

  variable to predict  ~  predictive variables  |  group

we shall use it again when estimating the coefficients of
mixed models (i.e., estimating the mean and variance of a
and b, the covariance of (a,b), performing tests to know if
a and b really depend on the city).

  %G
  library(lattice)
  xyplot(income ~ age | city)
  %--

In this example, both a and b depended on the city. But
perhaps only one of them depends n the city -- or even,
none.

  %G
  a <- rnorm(n.cities, 20000, sd=2000)
  b <- rep(rnorm(1, sd=20), n.cities)
  income <- 200*area[city] + a[city] + b[city]*age + 
            rnorm(n.cities*n.inhabitants, sd=200)
  xyplot(income ~ age | city)
  %--

  %G
  a <- rep( rnorm(1, 20000, sd=2000), n.cities )
  b <- rnorm(n.cities, sd=20)
  income <- 200*area[city] + a[city] + b[city]*age + 
            rnorm(n.cities*n.inhabitants, sd=200)
  xyplot(income ~ age | city)
  %--

  %G
  a <- rep( rnorm(1, 20000, sd=2000), n.cities )
  b <- rep(rnorm(1, sd=20), n.cities)
  income <- 200*area[city] + a[city] + b[city]*age + 
            rnorm(n.cities*n.inhabitants, sd=200)
  xyplot(income ~ age | city)
  %--

+ Other motivation

Mixed models appear when the observations are no longer
independant: the subjects belong to groups (we know the
groups, but there are too many of them, we cannot take a
subject from all the groups) and the variables measured on
the subjects of a given group have similar values.

  TODO: more details
  TODO: give an example?

  Example: The results at an exam depend on the class.
  Example: cluster sampling

+ more complex models

You can have more that two levels:

  student/class/teacher/school/city/country

  TODO: other examples

You can move the variables from one level to another: for
instance, you can consider the income of a subject, or the
average income in a city.

TODO: give the R syntax to describe those more complicated
models.

The various random variables at hand need not be
independant: the model can give the shape of this
variance-covariance matrix.

  TODO: details
  TODO: more precise examples

+ The ecological fallacy

A commom mistake is to perform an analysis at a given level
and draw conclusitons at another.

Example: if there is a correlation between the rate of
illiteracy and the presence of a certain ethnic minority,
this does not mean that you will have the same correlation
at the subject level.

But rest reassured, you can of course perform multi-level
analyses, for instance, by trying to predict a subject-level
variable from subject-level and group-level variables.

+ Levels are not predictive variables

For instance, the name of the city should not be used to
predict anything -- simply because we have sampled among the
cities, we do not have all the cities.

+ Regressions galore

The first idea (and often, the first step), when analysing
grouped data, is to perform a regression in each group (or,
at least, look at each group separately -- e.g., with a
lattice plot). Let us go back to one of our simulated
examples above.

  %G
  n.cities <- 12
  area <- rnorm(n.cities, 5, 1)
  a <- rnorm(n.cities)
  b <- rnorm(n.cities)
  n.inhabitants <- 20
  city <- rep(1:n.cities, each=n.inhabitants)
  age <- rnorm(n.cities*n.inhabitants, 40, 10)
  a <- rnorm(n.cities, 20000, sd=2000)
  b <- rep(rnorm(1, sd=20), n.cities)
  income <- 200*area[city] + a[city] + b[city]*age +
  rnorm(n.cities*n.inhabitants, sd=200)
  xyplot(income ~ age | city)
  %--

The "lmList" function in the "nlme" package performs a
regression in each group.

  library(nlme)
  d <- data.frame(income, age, city, area=area[city])
  r <- lmList(income ~ age | city, data=d)

this yields:

  > summary(r)
  Call:
    Model: income ~ age | city
     Data: d

  Coefficients:

     (Intercept)
     Estimate Std. Error   t value Pr(>|t|)
  1  21479.14   162.7878 131.94565        0
  2  24026.24   274.6432  87.48163        0
  3  19633.32   169.5408 115.80292        0
  4  17354.47   205.3334  84.51853        0
  5  22675.63   207.5127 109.27345        0
  6  22732.46   235.9244  96.35486        0
  7  21292.20   178.5410 119.25666        0
  8  20044.50   185.1989 108.23230        0
  9  18924.25   136.4418 138.69835        0
  10 20630.17   147.8031 139.57873        0
  11 20572.80   170.0011 121.01566        0
  12 18025.29   180.2910  99.97887        0

     age
       Estimate Std. Error     t value   Pr(>|t|)
  1  -6.0190190   4.343643 -1.38570746 0.16726531
  2   4.3642466   6.469611  0.67457638 0.50066630
  3   5.1117531   4.252501  1.20205817 0.23065717
  4   0.2409800   5.088937  0.04735370 0.96227508
  5   0.2023943   5.150521  0.03929589 0.96869078
  6   8.0885149   5.755782  1.40528515 0.16137312
  7   7.9947740   4.245020  1.88333011 0.06099939
  8   4.2547386   4.528523  0.93954232 0.34850184
  9   2.9008258   3.713258  0.78120777 0.43553568
  10  3.4318408   3.813179  0.89999457 0.36912542
  11 -3.1825003   3.956591 -0.80435410 0.42207690
  12 -1.3094108   4.573603 -0.28629742 0.77492474

  Residual standard error: 195.6667 on 216 degrees of freedom

Here, we see that the intercept depends on the city while
the slope does not. We say that the intercept is a "random
coefficient" (or a random effect) and that the slope is a
fixed coefficient (or a fixed effect).

  %G
  library(nlme)
  d <- data.frame(income, age, city, area=area[city])
  r <- lmList(income ~ age | city, data=d)
  plot(intervals(r))
  %--

Exercise: Take the area into account.

+ TODO

  The rest...

+ TODO: TO SORT