Regression Problems -- and their Solutions

In this chapter, we list some of the problems that may occur
in a regression and explain how to spot them -- graphically.
Often, you can solve the problem by transforming the
variables (so that the outliers and influential observations
disappear, so that the residuals look normal, so that the
residuals have the same variance -- quite often, you can do
all this at the same time), by altering the model (for a
simpler or more complex one) or by using another regression
(GLS to account for heteroskedasticity and correlated
residuals, robust regression to account for remaining
influencial observations).

  Overfit: choose a simpler model 
  Underfit: curvilinear regression, non-linear regression, local regression
  Influential points: transform the data, robust regression, weighted
                      least squares, remove the points
  Influential clusters: transform the data, mixtures
  Non-gaussian residuals: transform the data, robust regression, normalp
  Heteroskedasticity: gls
  correlated residuals: gls
  Unidentifiability: shrinkage methods
  Missing values: discard the observations???
  The curse of dimension (GAM,...)
  Combining regressions (BMA,...)

* Tests and confidence intervals

After this bird's eye view of several regression techniques,
let us come back to linear regression.

+ Tests
 
The "summary" function gave us the results of a Student T
test on the regression coefficients -- that answered the
question "is this coefficient significantly different from
zero?".

  > x <- rnorm(100)
  > y <- 1 + 2*x + .3*rnorm(100)
  > summary(lm(y~x))
  ...
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept)  1.08440    0.03224   33.64   <2e-16 ***
  x            2.04051    0.03027   67.42   <2e-16 ***

In the same example, if we have a prior idea on the value of
the coefficient, we can test this value: here, let us test
if the intercept is 1.

(When you write a formula to describe a model, some
operators are interpreted in a different way (especially *
and ^): to be sure that they will be understood as
arithmetic operations on the variables, surround them with
I(...). Here, it is not needed.)

  > x <- rnorm(100)
  > y <- 1 + 2*x + .3*rnorm(100)
  > summary(lm(I(y-1)~x))

  Call:
  lm(formula = I(y - 1) ~ x)

  Residuals:
       Min       1Q   Median       3Q      Max
  -0.84692 -0.24891  0.02781  0.20486  0.60522

  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept) -0.01294    0.02856  -0.453    0.651
  x            1.96405    0.02851  68.898   <2e-16 ***
  ---
  Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

  Residual standard error: 0.2855 on 98 degrees of freedom
  Multiple R-Squared: 0.9798,     Adjusted R-squared: 0.9796
  F-statistic:  4747 on 1 and 98 DF,  p-value: < 2.2e-16

Under the hypothesis that this coefficient is 1, let us check if the
other is zero.

  > summary(lm(I(y-1)~0+x))

  Call:
  lm(formula = I(y - 1) ~ 0 + x)

  Residuals:
       Min       1Q   Median       3Q      Max
  -0.85962 -0.26189  0.01480  0.19184  0.59227

  Coefficients:
    Estimate Std. Error t value Pr(>|t|)
  x  1.96378    0.02839   69.18   <2e-16 ***
  ---
  Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

  Residual standard error: 0.2844 on 99 degrees of freedom
  Multiple R-Squared: 0.9797,     Adjusted R-squared: 0.9795
  F-statistic:  4786 on 1 and 99 DF,  p-value: < 2.2e-16

Other method:

  > x <- rnorm(100)
  > y <- 1 + 2*x + .3*rnorm(100)
  > a <- rep(1,length(x))
  > summary(lm(y~offset(a)-1+x))

  Call:
  lm(formula = y ~ offset(a) - 1 + x)

  Residuals:
       Min       1Q   Median       3Q      Max
  -0.92812 -0.09901  0.09515  0.28893  0.99363

  Coefficients:
    Estimate Std. Error t value Pr(>|t|)
  x  2.04219    0.03114   65.58   <2e-16 ***
  ---
  Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

  Residual standard error: 0.3317 on 99 degrees of freedom
  Multiple R-Squared: 0.9816,     Adjusted R-squared: 0.9815
  F-statistic:  5293 on 1 and 99 DF,  p-value: < 2.2e-16

Let us check if it is equal to 2:

  > summary(lm(I(y-1-2*x)~0+x))

  Call:
  lm(formula = I(y - 1 - 2 * x) ~ 0 + x)

  Residuals:
       Min       1Q   Median       3Q      Max
  -0.85962 -0.26189  0.01480  0.19184  0.59227

  Coefficients:
    Estimate Std. Error t value Pr(>|t|)
  x -0.03622    0.02839  -1.276    0.205

  Residual standard error: 0.2844 on 99 degrees of freedom
  Multiple R-Squared: 0.01618,    Adjusted R-squared: 0.006244
  F-statistic: 1.628 on 1 and 99 DF,  p-value: 0.2049

Other method:

  > summary(lm(y~offset(1+2*x)+0+x))
  Call:
  lm(formula = y ~ offset(1 + 2 * x) + 0 + x)

  Residuals:
       Min       1Q   Median       3Q      Max
  -0.92812 -0.09901  0.09515  0.28893  0.99363

  Coefficients:
    Estimate Std. Error t value Pr(>|t|)
  x  0.04219    0.03114   1.355    0.179

  Residual standard error: 0.3317 on 99 degrees of freedom
  Multiple R-Squared: 0.9816,     Adjusted R-squared: 0.9815
  F-statistic:  5293 on 1 and 99 DF,  p-value: < 2.2e-16

More generally, you can use the "offset" function in a
linear regression when you know exactly one of the
coefficients.

Other method:

  x <- rnorm(100)
  y <- 1 + 2*x + .3*rnorm(100)
  library(car)
  linear.hypothesis( lm(y~x), matrix(c(1,0,0,1), 2, 2), c(1,2) )

This checks if

  [ 1  0 ]   [ first coefficient  ]   [ 1 ]
  [      ] * [                    ] = [   ]
  [ 0  1 ]   [ second coefficient ]   [ 2 ].

This yields:

  F-Test
  SS = 0.04165324     SSE = 9.724817     F = 0.2098763  Df = 2 and 98     p = 0.8110479

+ Confidence intervals and prediction intervals

You can compute confidence intervals on the parameters. 

  > library(MASS)
  > n <- 100
  > x <- rnorm(n)
  > y <- 1 - 2*x + rnorm(n)
  > r <- lm(y~x)
  > r$coefficients
  (Intercept)           x
    0.9569173  -2.1296830
  > confint(r)
                   2.5 %    97.5 %
  (Intercept)  0.7622321  1.151603
  x           -2.3023449 -1.957021

You can also look for confidence intervals on the predicted
values.  It can be a confidence interval of aX+b (confidence
band) or -- this is different -- of E[Y|X=x] (prediction
band).

  %G
  x <- runif(20)
  y <- 1-2*x+.1*rnorm(20)
  res <- lm(y~x)
  plot(y~x)
  new <- data.frame( x=seq(0,1,length=21) )
  p <- predict(res, new)
  points( p ~ new$x, type='l' )
  p <- predict(res, new, interval='confidence')
  points( p[,2] ~ new$x, type='l', col="green" )
  points( p[,3] ~ new$x, type='l', col="green" )
  p <- predict(res, new, interval='prediction')
  points( p[,2] ~ new$x, type='l', col="red" )
  points( p[,3] ~ new$x, type='l', col="red" )
  title(main="Confidence and prediction bands")
  legend( par("usr")[1], par("usr")[3], yjust=0,
          c("Confidence band", "Prediction band"),
          lwd=1, lty=1, col=c("green", "red") )
  %--

TODO: stress the difference between the two...

Away from the values of the sample, the intervals grow.

  %G
  plot(y~x, xlim=c(-1,2), ylim=c(-3,3))
  new <- data.frame( x=seq(-2,3,length=200) )
  p <- predict(res, new)
  points( p ~ new$x, type='l' )
  p <- predict(res, new, interval='confidence')
  points( p[,2] ~ new$x, type='l', col="green" )
  points( p[,3] ~ new$x, type='l', col="green" )
  p <- predict(res, new, interval='prediction')
  points( p[,2] ~ new$x, type='l', col="red" )
  points( p[,3] ~ new$x, type='l', col="red" )
  title(main="Confidence and prediction bands")
  legend( par("usr")[1], par("usr")[3], yjust=0,
          c("Confidence band", "Prediction band"),
          lwd=1, lty=1, col=c("green", "red") )
  %--

  %G
  plot(y~x, xlim=c(-5,6), ylim=c(-11,11))
  new <- data.frame( x=seq(-5,6,length=200) )
  p <- predict(res, new)
  points( p ~ new$x, type='l' )
  p <- predict(res, new, interval='confidence')
  points( p[,2] ~ new$x, type='l', col="green" )
  points( p[,3] ~ new$x, type='l', col="green" )
  p <- predict(res, new, interval='prediction')
  points( p[,2] ~ new$x, type='l', col="red" )
  points( p[,3] ~ new$x, type='l', col="red" )
  title(main="Confidence and prediction bands")
  legend( par("usr")[1], par("usr")[3], yjust=0,
          c("Confidence band", "Prediction band"),
          lwd=1, lty=1, col=c("green", "red") )
  %--

Here are other ways of representing the confidence and
prediction intervals.

  %G
  N <- 100
  n <- 20
  x <- runif(N, min=-1, max=1)
  y <- 1 - 2*x + rnorm(N, sd=abs(x))
  res <- lm(y~x)
  plot(y~x)
  x0 <- seq(-1,1,length=n)
  new <- data.frame( x=x0 )
  p <- predict(res, new)
  points( p ~ x0, type='l' )
  p <- predict(res, new, interval='prediction')
  segments( x0, p[,2], x0, p[,3], col='red')
  p <- predict(res, new, interval='confidence')
  segments( x0, p[,2], x0, p[,3], col='green', lwd=3 )
  %--

  %G
  mySegments <- function(a,b,c,d,...) {
    u <- par('usr')
    e <- (u[2]-u[1])/100
    segments(a,b,c,d,...)
    segments(a+e,b,a-e,b,...)
    segments(c+e,d,c-e,d,...)
  }
  plot(y~x)
  p <- predict(res, new)
  points( p ~ x0, type='l' )
  p <- predict(res, new, interval='prediction')
  mySegments( x0, p[,2], x0, p[,3], col='red')
  p <- predict(res, new, interval='confidence')
  mySegments( x0, p[,2], x0, p[,3], col='green', lwd=3 )
  %--
    
+ Test on a pair of variables (ellipses)
 
You can want a confidence interval for a single, isolated,
parameter -- you get an interval -- or for several
parameters at a time -- you get an ellipsoid, called the
confidence region. It brings more information than the
1-variable confidence intervals (you cannot combine those
intervals). Thus you might want to plot these ellipses.

The ellipse you get is skewed, because it comes from the
correlation matrix of the two coefficients: simply
diagonalize it in an orthonormal basis and the eigen vectors
will give you the axes.

  %G
  library(ellipse)
  my.confidence.region <- function (g, a=2, b=3) {
    e <- ellipse(g,c(a,b))
    plot(e,
         type="l",
         xlim=c( min(c(0,e[,1])), max(c(0,e[,1])) ),
         ylim=c( min(c(0,e[,2])), max(c(0,e[,2])) ),
        )
    x <- g$coef[a]
    y <- g$coef[b]
    points(x,y,pch=18)
    cf <- summary(g)$coefficients
    ia <- cf[a,2]*qt(.975,g$df.residual)
    ib <- cf[b,2]*qt(.975,g$df.residual)
    abline(v=c(x+ia,x-ia),lty=2)
    abline(h=c(y+ib,y-ib),lty=2)
    points(0,0)
    abline(v=0,lty="F848")
    abline(h=0,lty="F848")
  }

  n <- 20
  x1 <- rnorm(n)
  x2 <- rnorm(n)
  x3 <- rnorm(n)
  y <- x1+x2+x3+rnorm(n)
  g <- lm(y~x1+x2+x3)
  my.confidence.region(g)
  %--

  %G
  n <- 20
  x <- rnorm(n)
  x1 <- x+.2*rnorm(n)
  x2 <- x+.2*rnorm(n)
  y <- x1+x2+rnorm(n)
  g <- lm(y~x1+x2)
  my.confidence.region(g)
  %--

In the three following plots, the probability that the
actual values of the parameters be in the pink area is the
same.

  %G
  my.confidence.region <- function (g, a=2, b=3, which=0, col='pink') {
    e <- ellipse(g,c(a,b))
    x <- g$coef[a]
    y <- g$coef[b]
    cf <- summary(g)$coefficients
    ia <- cf[a,2]*qt(.975,g$df.residual)
    ib <- cf[b,2]*qt(.975,g$df.residual)
    xmin <- min(c(0,e[,1]))
    xmax <- max(c(0,e[,1]))
    ymin <- min(c(0,e[,2]))
    ymax <- max(c(0,e[,2]))
    plot(e,
         type="l",
         xlim=c(xmin,xmax),
         ylim=c(ymin,ymax),
        )
    if(which==1){ polygon(e,col=col) }
    else if(which==2){ rect(x-ia,par('usr')[3],x+ia,par('usr')[4],col=col,border=col) }
    else if(which==3){ rect(par('usr')[1],y-ib,par('usr')[2],y+ib,col=col,border=col) }
    lines(e)
    points(x,y,pch=18)
    abline(v=c(x+ia,x-ia),lty=2)
    abline(h=c(y+ib,y-ib),lty=2)
    points(0,0)
    abline(v=0,lty="F848")
    abline(h=0,lty="F848")
  }
  my.confidence.region(g, which=1)
  %--

  %G
  my.confidence.region(g, which=2)
  %--

  %G
  my.confidence.region(g, which=3)
  %--

+ The dangers of multiple tests

When you perform a multiple regression, you try to retain as
few predictive variables as possible, while retaining all
those that are relevant. To choose or discard variables, you
might be tempted to perform a lot of statistical tests.

This is a bad idea.

Indeed, for each testm you have a certain risk of making a
mistake -- and those risks pile up.

However, this is usually what we do -- we rarely have the
choice.  You can either start with a model with no variable
at all, then add the "best" predictive variable (say, the
one with the higher correlation) and progressively add other
variables (say, the ones that provide the biggest increase
in R^2) and stop when you reach a certain criterion (say,
when F^2 reaches a certain value); or start with a saturated
model, containing all the variables an successively remove
the variables that provide the smallest decrease in R^2 and
stop when F^2 reaches a value fixed in advance.

TODO: Tukey, etc. (in the Anova chapter?)

+ Regression and sums of squares

When you read regression or anova (analysis of variance)
results, you often face a table "full of sums of squares".

RSS (Residual Sum of Squares): this is the quantity you try
to minimize in a regression. More precisely, let X be the
predictive variable, Y the variable to predict and hat(Yi)
the predicted velue, we set

  hat Yi = b0 + b1 Xi

and we try to find the values of b0 and b1 that minimize 

  RSS = Sum( (Yi - hat Yi)^2 ).

TSS (Total sum of squares): The is the sun of squares
minimized when you look for the mean of Y

  TSS = Sum( (Yi - bar Y)^2 )

ESS (Explained Sum of Squares): This is the difference
between the preceding two sums of squares. It can also be
written as a sum of squares.

  ESS = Sum( ( hat Yi - bar Y )^2 )

R-square: "determination coefficient" or "percentage of
variance of Y explained by X". The closer to one, the better
the regression explains the variations of Y.

  R^2 = ESS/TSS.

You can give a graphical interpretation of the determination
coefficient. The data are clustered in a band of height RSS
around the regression line, while the height of the plot is
TSS. We then have R^2=1-RSS/TSS.

  %G
  n <- 20000
  x <- runif(n)
  y <- 4 - 8*x + rnorm(n)
  plot(y~x, pch='.')
  abline(lm(y~x), col='red')
  arrows( .1, -6, .1, 6, code=3, lwd=3, col='blue' )
  arrows( .9, -3.2-2, .9, -3.2+2, code=3, lwd=3, col='blue' )
  text( .1, 6, "TSS", adj=c(0,0), cex=2, col='blue' )  
  text( .9, -3.2+2, "RSS", adj=c(1,0), cex=2, col='blue' )  
  %--

+ Reading the results of a regression

We have seen three way of printing the results of a
regression: with the "print", "summary" and "anova"
functions. The last line of the "anova" function compares
our model with the null model (i.e., with the model with no
explanatory variables at all, y ~ 1).

  > x <- rnorm(100)
  > y <- 1 + 2*x + .3*rnorm(100)
  > summary(lm(y~x))

  Call:
  lm(formula = y ~ x)

  Residuals:
       Min       1Q   Median       3Q      Max
  -0.84692 -0.24891  0.02781  0.20486  0.60522

  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept)  0.98706    0.02856   34.56   <2e-16 ***
  x            1.96405    0.02851   68.90   <2e-16 ***
  ---
  Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

  Residual standard error: 0.2855 on 98 degrees of freedom
  Multiple R-Squared: 0.9798,     Adjusted R-squared: 0.9796
  F-statistic:  4747 on 1 and 98 DF,  p-value: < 2.2e-16

The result of the "anova" function explains where these
fugures come from: you have the sum of squares, their "mean"
(just divide by the "number of degrees of freedom"), their
quotient (F-value) and the probability that this quotient be
as high if the slope of the line is zero (i.e., the p-value
of the test of H0: "the slope is zero" against H1: "the
slope is not zero").

  > anova(lm(y~x))
  Analysis of Variance Table

  Response: y
            Df Sum Sq Mean Sq F value    Pr(>F)
  x          1 386.97  386.97    4747 < 2.2e-16 ***
  Residuals 98   7.99    0.08
  ---
  Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

More generally, the "anova" function performs a test that
compares embedded models (here, a model with an intercept
and a slope, and a model with an intercept and no slope).
 
+ Comparing two models

In a multiple regression, you strive to retain as few
variables as possible. In this process, you want to compare
models: e.g., compare a model with a lot of variable and a
model with fewer variables.

The "anova" function performs that kind of comparison (it
does not answer the question "is this model better?" but
"are these models significantly different?" -- if they are
not significantly different, you will reject the more
complicated one).

  data(trees)
  r1 <- lm(Volume ~ Girth, data=trees)
  r2 <- lm(Volume ~ Girth + Height, data=trees)
  anova(r1,r2)

The result

  Analysis of Variance Table

  Model 1: Volume ~ Girth
  Model 2: Volume ~ Girth + Height
    Res.Df    RSS Df Sum of Sq      F  Pr(>F)
  1     29 524.30
  2     28 421.92  1    102.38 6.7943 0.01449 *
  ---
  Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

tells us that the two models are significantly different
with a risk of error under 2%.

Here are a few other examples.

  x1 <- rnorm(100)
  x2 <- rnorm(100)
  x3 <- rnorm(100)
  b <- .1* rnorm(100)

  y <- 1 + x1 + x2 + x3 + b 
  r1 <- lm( y ~ x1 )
  r2 <- lm( y ~ x1 + x2 + x3 )
  anova(r1,r2) # p-value = 2e-16

  y <- 1 + x1
  r1 <- lm( y ~ x1 )
  r2 <- lm( y ~ x1 + x2 + x3 )
  anova(r1,r2) # p-value = 0.25

  y <- 1 + x1 + .02*x2 - .02*x3 + b
  r1 <- lm( y ~ x1 )
  r2 <- lm( y ~ x1 + x2 + x3 )
  anova(r1,r2) # p-value = 0.10

You can compare more that two model (but always nested
models: each model is included in the next).

  y <- 1 + x1 + x2 + x3 + b 
  r1 <- lm( y ~ x1 )
  r2 <- lm( y ~ x1 + x2 )
  r3 <- lm( y ~ x1 + x2 + x3 )
  anova(r1,r2,r3) # p-values = 2e-16 (both)

If, in the comparison of two models, you get a very high
p-value, i.e., if the two models are not significantly
different, you will reject the more complex and retain the
simplest.

+ Anova and regression

You can present the computations performed in a regression
as an anova table. Furthermore, the idea behind the
computations is the same: express the variance of Y as the
sum of a variance of a variable affine in X and a residual
variance, and minimize this residual variance.

  x <- runif(10)
  y <- 1 + x + .2*rnorm(10)
  anova(lm(y~x))

Here, the anova tells us that, indeed, Y depend on X, with a
risk of error under 1%.

  Analysis of Variance Table

  Response: y
            Df  Sum Sq Mean Sq F value   Pr(>F)
  x          1 0.85633 0.85633  20.258 0.002000 **
  Residuals  8 0.33817 0.04227

It still works with several predictive variables.

  x <- runif(10)
  y <- runif(10)
  z <- 1 + x - y + .2*rnorm(10)
  anova(lm(z~x+y))

The analysis of variance table tells us that z depends on x
and y, with a risk of error under 1%.

  Analysis of Variance Table

  Response: z
            Df  Sum Sq Mean Sq F value    Pr(>F)
  x          1 2.33719 2.33719  45.294 0.0002699 ***
  y          1 0.73721 0.73721  14.287 0.0068940 **
  Residuals  7 0.36120 0.05160

Counterintuitive and frightening as it may be, you might
notice that the result depends on the order of the
parameters...

  >   anova(lm(z~y+x))
  Analysis of Variance Table

  Response: z
            Df  Sum Sq Mean Sq F value   Pr(>F)
  y          1 2.42444 2.42444  46.985 0.000241 ***
  x          1 0.64996 0.64996  12.596 0.009355 **
  Residuals  7 0.36120 0.05160

In some cases, you can even get contradictory results:
depending on the order of the predictive variables, you can
find that z sometimes depends on x, sometimes not.

  > x <- runif(10)
  > y <- runif(10)
  > z <- 1 + x + 5*y + .2*rnorm(10)

  > anova(lm(z~x+y))
  Analysis of Variance Table
  Response: z
            Df Sum Sq Mean Sq  F value    Pr(>F)
  x          1 0.0104  0.0104   0.1402    0.7192
  y          1 7.5763  7.5763 102.1495 1.994e-05 ***
  Residuals  7 0.5192  0.0742

  > anova(lm(z~y+x))
  Analysis of Variance Table
  Response: z
            Df Sum Sq Mean Sq F value    Pr(>F)
  y          1 7.1666  7.1666  96.626 2.395e-05 ***
  x          1 0.4201  0.4201   5.664   0.04889 *
  Residuals  7 0.5192  0.0742

* Partial residual plots, added variable plots

* Some plots to explore a regression

+ Residuals

The residuals are the differences between the observed
values and the predicted values.

+ Residuals and noise

The noise is the difference between the observed values and
the actual values: it appears in the model, e.g.

  Y = a + b X + noise

Residues and noise are two different things. Even from a
statistical point of view, they look different. For
instance, their variance is not the same (neither is the
shape of their distribution, by the way).  The following
simulation estimates the variance of the residuals: we get
0.008893307 while the noise variance was 0.01.

  a <- 1
  b <- -2
  s <- .1
  n <- 10
  N <- 1e6
  v <- NULL
  for (i in 1:N) {
    x <- rnorm(n)
    y <- 1-2*x+s*rnorm(n)
    v <- append(v, var(lm(y~x)$res))
  }
  mean(v)

You can show that the variance of the residual of the
observation i is

  sigma^2 * ( 1 - h_i )

where sigma^2 is the noise variance and h_i is the
"leverage" of observation i (the i-th diagonal term of
t(X)%*%X).

+ Studentized (or standardized) residuals

These are the normalized residuals. Their variance is
estimated from all the sample.

  ?rstandard

+ Jackknife (or studentized) residuals

These are still the "normalized" residuals, but this time,
we estimate the variance with the sample without the current
observation.

  ?rstudent

+ Plotting the residuals

Let us consider the following example.

  %G
  n <- 100
  x1 <- rnorm(n)
  x2 <- rnorm(n)
  x3 <- x1^2+rnorm(n)
  x4 <- 1/(1+x2^2)+.2*rnorm(n)
  y <- 1+x1-x2+x3-x4+.1*rnorm(n)
  pairs(cbind(x1,x2,x3,x4,y))  
  %--

The residuals are full-fledged statistical variables: you
can look at them through box-and-whisker plots, histograms,
qqplots, etc.  You can also plot them as a function of the
predicted values, as a function of the various predictive
variables or as a function of the observation number.

  %G 600 200
  r <- lm(y~x1+x2+x3+x4)  
  boxplot(r$res, horizontal=T)
  %--

  %G 
  hist(r$res)
  %--

  %G
  plot(r$res, main='Residuals')
  %--

  %G
  plot(rstandard(r), main='Standardized residuals')
  %--

  %G
  plot(rstudent(r), main="Studentized residuals")
  %--

  %G
  plot(r$res ~ r$fitted.values, 
       main="Residuals and predicted values")    
  abline(h=0, lty=3)
  %--

  %G
  op <- par(mfrow=c(2,2))
  plot(r$res ~ x1)    
  abline(h=0, lty=3)
  plot(r$res ~ x2)    
  abline(h=0, lty=3)
  plot(r$res ~ x3)    
  abline(h=0, lty=3)
  plot(r$res ~ x4)    
  abline(h=0, lty=3)
  par(op)
  %--

  %G
  n <- 100
  x1 <- rnorm(n)
  x2 <- 1:n
  y <- rnorm(1)
  for (i in 2:n) {
    y <- c(y, y[i-1] + rnorm(1))
  }
  y <- x1 + y
  r <- lm(y~x1+x2) # Or simply: lm(y~x1)
  op <- par(mfrow=c(2,1))
  plot( r$res ~ x1 )
  plot( r$res ~ x2 )
  par(op)
  %--

It is usually a bad idea to plot the residuals as a function
of the observed values, because the "noise term" in the
model appears on both axes, and you (almost) end up plotting
this noise as a function of itself...

  %G
  n <- 100
  x <- rnorm(n)
  y <- 1-x+rnorm(n)
  r <- lm(y~x)
  plot(r$res ~ y)    
  abline(h=0, lty=3)
  abline(lm(r$res~y),col='red')
  title(main='Not a good idea...')
  %--

+ Partial regression plot (or added variable plot)

Let us consider a regression situation with two predictive
variables X1 and X2 and one variable to predict Y.

You can study the effect of X1 on Y after removing the
(linear) effect of X2 on Y: simply regress Y against X2, X1
against X2 and plot the residuals of the former against
those of the latter.

Those plots may help you spot influent observations.

  %G
  partial.regression.plot <- function (y, x, n, ...) {
    m <- as.matrix(x[,-n])
    y1 <- lm(y ~ m)$res
    x1 <- lm(x[,n] ~ m)$res
    plot( y1 ~ x1, ... )
    abline(lm(y1~x1), col='red')
  }

  n <- 100
  x1 <- rnorm(n)
  x2 <- rnorm(n)
  x3 <- x1+x2+rnorm(n)
  x <- cbind(x1,x2,x3)
  y <- x1+x2+x3+rnorm(n)
  op <- par(mfrow=c(2,2))
  partial.regression.plot(y, x, 1)
  partial.regression.plot(y, x, 2)
  partial.regression.plot(y, x, 3)
  par(op)
  %--

There is already an "av.plot" function in the "car" package
for this.

  %G
  library(car)
  av.plots(lm(y~x1+x2+x3),ask=F)
  %--

The "leverage.plots", still in the "car" package,
generalizes this idea.

  ?leverage.plots

+ Partial residual plots

It is very similar to partial regression plots: this time,
you plot Y, from which you have removed the effects of X2,
as a function of X1.  It is more efficient than oartial
regression to spot non-linearities -- but partial regression
is superior when it comes to spotting influent or outlying
observations.

  %G
  my.partial.residual.plot <- function (y, x, i, ...) {
    r <- lm(y~x)
    xi <- x[,i]
    # Y, minus the linear effects of X_j
    yi <- r$residuals + r$coefficients[i] * x[,i]
    plot( yi ~ xi, ... )  
  }
  n <- 100
  x1 <- rnorm(n)
  x2 <- rnorm(n)
  x3 <- x1+x2+rnorm(n)
  x <- cbind(x1,x2,x3)
  y <- x1+x2+x3+rnorm(n)
  op <- par(mfrow=c(2,2))
  my.partial.residual.plot(y, x, 1)
  my.partial.residual.plot(y, x, 2)
  my.partial.residual.plot(y, x, 3)
  par(op)
  %--
  
The "car" or "Design" packages provide functions to plot the partial
residuals.

  library(car)
  ?cr.plots
  ?ceres.plots

  library(Design)
  ?plot.lrm.partial
  ?lrm

* Overfit

+ Overfit

The regression might be "too close" to the data, to the
point that it becomes irrealistic, that it performs poorly
with "out-of-sample" data. The situation is not always as
striking and obvious as here. However, if you want to
choose, say, a non-linear model (or anyting complex), you
must be able to justify it. In particular, compare the
number of parameters to estimate with the number of
observations...

  %G
  n <- 10
  x <- seq(0,1,length=n)
  y <- 1-2*x+.3*rnorm(n)
  plot(spline(x, y, n = 10*n), col = 'red', type='l', lwd=3)
  points(y~x, pch=16, lwd=3, cex=2)
  abline(lm(y~x))
  title(main='Overfit')
  %--

This is mainly common-sense. 

In the case of a linear regression, you can compare the
determination coefficient (in case of overfit, it is close
to 1) and the adjusted determination coefficient (that
accounts for overfitting problems).

  > summary(lm(y~poly(x,n-1)))
  Call:
  lm(formula = y ~ poly(x, n - 1))

  Residuals:
  ALL 10 residuals are 0: no residual degrees of freedom!

  Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
  (Intercept)      0.01196
  poly(x, n - 1)1 -1.94091
  poly(x, n - 1)2 -0.02303
  poly(x, n - 1)3 -0.08663
  poly(x, n - 1)4 -0.06938
  poly(x, n - 1)5 -0.34501
  poly(x, n - 1)6 -0.51048
  poly(x, n - 1)7 -0.28479
  poly(x, n - 1)8 -0.22273
  poly(x, n - 1)9  0.39983

  Residual standard error: NaN on 0 degrees of freedom
  Multiple R-Squared:     1,      Adjusted R-squared:   NaN
  F-statistic:   NaN on 9 and 0 DF,  p-value: NA

  > summary(lm(y~poly(x,n-2)))
  Call:
  lm(formula = y ~ poly(x, n - 2))

  Residuals:
          1         2         3         4         5         6         7         8         9        10
  -0.001813  0.016320 -0.065278  0.152316 -0.228473  0.228473 -0.152316  0.065278 -0.016320  0.001813

  Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
  (Intercept)      0.01196    0.12644   0.095    0.940
  poly(x, n - 2)1 -1.94091    0.39983  -4.854    0.129
  poly(x, n - 2)2 -0.02303    0.39983  -0.058    0.963
  poly(x, n - 2)3 -0.08663    0.39983  -0.217    0.864
  poly(x, n - 2)4 -0.06938    0.39983  -0.174    0.891
  poly(x, n - 2)5 -0.34501    0.39983  -0.863    0.547
  poly(x, n - 2)6 -0.51048    0.39983  -1.277    0.423
  poly(x, n - 2)7 -0.28479    0.39983  -0.712    0.606
  poly(x, n - 2)8 -0.22273    0.39983  -0.557    0.676

  Residual standard error: 0.3998 on 1 degrees of freedom
  Multiple R-Squared: 0.9641,     Adjusted R-squared: 0.6767
  F-statistic: 3.355 on 8 and 1 DF,  p-value: 0.4

If you are reasonable, the determination coefficient and its
adjusted version are very close.

  > x <- seq(0,1,length=n)
  > y <- 1-2*x+.3*rnorm(n)
  > summary(lm(y~poly(x,10)))

  Call:
  lm(formula = y ~ poly(x, 10))

  Residuals:
        Min        1Q    Median        3Q       Max
  -0.727537 -0.206951 -0.002332  0.177562  0.902353

  Coefficients:
                Estimate Std. Error t value Pr(>|t|)
  (Intercept)   -0.01312    0.02994  -0.438    0.662
  poly(x, 10)1  -6.11784    0.29943 -20.431   <2e-16 ***
  poly(x, 10)2  -0.11099    0.29943  -0.371    0.712
  poly(x, 10)3  -0.04936    0.29943  -0.165    0.869
  poly(x, 10)4  -0.28863    0.29943  -0.964    0.338
  poly(x, 10)5  -0.15348    0.29943  -0.513    0.610
  poly(x, 10)6   0.12146    0.29943   0.406    0.686
  poly(x, 10)7   0.05066    0.29943   0.169    0.866
  poly(x, 10)8   0.09707    0.29943   0.324    0.747
  poly(x, 10)9   0.07554    0.29943   0.252    0.801
  poly(x, 10)10  0.42494    0.29943   1.419    0.159
  ---
  Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

  Residual standard error: 0.2994 on 89 degrees of freedom
  Multiple R-Squared: 0.8256,     Adjusted R-squared: 0.8059
  F-statistic: 42.12 on 10 and 89 DF,  p-value: < 2.2e-16


  > summary(lm(y~poly(x,1)))
  ...
  Multiple R-Squared: 0.8182,     Adjusted R-squared: 0.8164
  ...

+ Sample too small

If the sample is too small, you will not be able to estimate much,
In such situationm you have to restrict yourself to simple
(simplistic) models, such as linear models, becaus the overfitting
risk is too high. 

+ Too many variables

  TODO

  Explain what you can do if there are more variables than
  observations.
  The naive approach will not work:
  n <- 100
  k <- 500
  x <- matrix(rnorm(n*k), nr=n, nc=k)
  y <- apply(x, 1, sum)
  lm(y~x)

  svm

* Underfit

+ Underfit (curvilinearity)

Sometimes, the model is too simplistic.

  %G
  x <- runif(100, -1, 1)
  y <- 1-x+x^2+.3*rnorm(100)
  plot(y~x)
  abline(lm(y~x), col='red')
  %--

On the preceding plot, it is not obvious, but you can spot
the problem if you try to see if a polynomial model would
not be better,

  > summary(lm(y~poly(x,10)))
  ...
  Coefficients:
                Estimate Std. Error t value Pr(>|t|)
  (Intercept)    1.29896    0.02841  45.725  < 2e-16 ***
  poly(x, 10)1  -4.98079    0.28408 -17.533  < 2e-16 ***
  poly(x, 10)2   2.53642    0.28408   8.928 5.28e-14 ***
  poly(x, 10)3  -0.06738    0.28408  -0.237    0.813
  poly(x, 10)4  -0.15583    0.28408  -0.549    0.585
  poly(x, 10)5   0.15112    0.28408   0.532    0.596
  poly(x, 10)6   0.04512    0.28408   0.159    0.874
  poly(x, 10)7  -0.29056    0.28408  -1.023    0.309
  poly(x, 10)8  -0.39384    0.28408  -1.386    0.169
  poly(x, 10)9  -0.25763    0.28408  -0.907    0.367
  poly(x, 10)10 -0.09940    0.28408  -0.350    0.727

or by using splines (or any other regularization method) an
by looking at the result,

  %G
  plot(y~x)
  lines(smooth.spline(x,y), col='red', lwd=2)
  title(main="Splines can help you spot non-linear relations")
  %--

  %G
  plot(y~x)
  lines(lowess(x,y), col='red', lwd=2)
  title(main='Non-linear relations and "lowess"')
  %--

  %G
  plot(y~x)
  xx <- seq(min(x),max(x),length=100)
  yy <- predict( loess(y~x), data.frame(x=xx) )
  lines(xx,yy, col='red', lwd=3)
  title(main='Non-linear relation and "loess"')
  %--

or by looking at the residuals with any regularization
method (plot the residuals as a function of the predicted
values or as a function of the predictive variables).

  %G
  r <- lm(y~x)
  plot(r$residuals ~ r$fitted.values,
        xlab='predicted values', ylab='residuals',
        main='Residuals and predicted values')
  lines(lowess(r$fitted.values, r$residuals), col='red', lwd=2)
  abline(h=0, lty=3)
  %--

  %G
  plot(r$residuals ~ x,
        xlab='x', ylab='residuals',
        main='Residuals and the predictive variable')
  lines(lowess(x, r$residuals), col='red', lwd=2)
  abline(h=0, lty=3)
  %--  

In some (rare) cases, you have several observations for the
same value of the predictive variables: you can then perform
the following test.

  x <- rep(runif(10, -1, 1), 10)
  y <- 1-x+x^2+.3*rnorm(100)
  r1 <- lm(y ~ x)
  r2 <- lm(y ~ factor(x))
  anova(r1,r2)

Both models should give the same predictions: here, it is
not the case.

  Analysis of Variance Table

  Model 1: y ~ x
  Model 2: y ~ factor(x)
    Res.Df     RSS Df Sum of Sq      F    Pr(>F)
  1     98 19.5259
  2     90  6.9845  8   12.5414 20.201 < 2.2e-16 ***
  ---
  Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Let us try with a linear relation.

  x <- rep(runif(10, -1, 1), 10)
  y <- 1-x+.3*rnorm(100)
  r1 <- lm(y ~ x)
  r2 <- lm(y ~ factor(x))
  anova(r1,r2)

The models are not significantly different.

  Analysis of Variance Table

  Model 1: y ~ x
  Model 2: y ~ factor(x)
    Res.Df    RSS Df Sum of Sq      F Pr(>F)
  1     98 9.6533
  2     90 8.9801  8    0.6733 0.8435 0.5671

+ Structural changes: TODO

  # Non-linearity
  library(lmtest)
  ?harvtest
  ?raintest
  ?reset

The "strucchange" package detects structural changes (very
often with time series, e.g., in econometry). There is a
structural change when (for instance) the linear model is
correct, but its coefficients change for time to time. If
you know where the change occurs, you just split your sample
into several chuks and perform a regression on each (to make
sure that a change occured, you can test the equality of the
coefficients in the chunks).

But usually, you do not know where the changes occur. You
can try with moving window to find the most probable date
for the structural change (you can take a window with a
constant width, or one with a constrant number of
observations).

  TODO: an example

  # Structural change
  library(strucchange)
  efp(..., type="Rec-CUSUM")
  efp(..., type="OLS-MOSUM")
  plot(efp(...))
  sctest(efp(...))

  TODO: an example

  Fstat(...)
  plot(Fstat(...))
  sctest(Fstat(...))

* Influential points

+ Influential observations

Some points might bear an abnormally high influence on the
regression results. SOmetimes, they come from mistakes (they
should be identified and corrected), sometimes, they are
perfectly normal but extreme. The leverage effect can yield
incorrect results.

  %G
  n <- 20
  done.outer <- F
  while (!done.outer) {
    done <- F
    while(!done) {
      x <- rnorm(n)
      done <- max(x)>4.5
    }
    y <- 1 - 2*x + x*rnorm(n)
    r <- lm(y~x)
    done.outer <- max(cooks.distance(r))>5
  }
  plot(y~x)
  abline(1,-2,lty=2)
  abline(lm(y~x),col='red',lwd=3)
  lm(y~x)$coef
  %--

The first thing to do, even before starting the regression,
is to look at the variables one at a time.

  %G 600 200
  boxplot(x, horizontal=T)
  %--

  %G 600 200
  stripchart(x, method='jitter')
  %--

  %G
  hist(x, col='light blue', probability=T)
  lines(density(x), col='red', lwd=3)
  %--

Same for y.

  %G 600 200
  boxplot(y, horizontal=T)
  %--

  %G 600 200
  stripchart(y, method='jitter')
  %--

  %G
  hist(y, col='light blue', probability=T)
  lines(density(y), col='red', lwd=3)
  %--

Here, there is an extreme point. AS there is a single one,
we might be tempted to remove it -- if there were several,
we would rather try to transform the data.

There is a measure of the "extremeness" of a point -- its
leverage --: the diagonal elements of the hat matrix

  H = X (X' X)^-1 X'

It is called "hat matrix" because

  \hat Y = H Y.

Those values tells us how an error on a predictive variable
prapagates to the predictions.

The leverages are between 1/n and 1. Under 0.2, it is
fine. You will not that this uses the predictive variable X
but not the variable to predict Y.

  %G
  plot(hat(x), type='h', lwd=5)
  %--

You can also measure the effect of each observation on the
regression: remove the point, compute the regression, the
predicted values and compare them with the values predicted
from the whole sample:

  %G
  plot(dffits(r),type='h',lwd=3)
  %--

You can also compare the coefficients:

  %G
  plot(dfbetas(r)[,1],type='h',lwd=3)
  %--

  %G
  plot(dfbetas(r)[,2],type='h',lwd=3)
  %--

In higher dimensions, you can plot the variation of a
coefficient as a function of the variation of other
coefficients.

  %G
  n <- 200
  x1 <- rnorm(n)
  x2 <- rnorm(n)
  yy <- x1 - x2 + rnorm(n)
  yy[1] <- 10
  r <- lm(yy~x1+x2)
  pairs(dfbetas(r))
  %--

Cook's distance measures the effect of an observation on the
regression as a whole.
You should start to be cautious when  D > 4/n.

  %G
  cd <- cooks.distance(r)
  plot(cd,type='h',lwd=3)
  %--

You can also have a look at the box-and-whiskers plot, the
scatterplot, the histogram, the density of Cook's distances
(for a moment, we put aside our pathological example).

  %G 
  n <- 100
  xx <- rnorm(n)
  yy <- 1 - 2 * x + rnorm(n)
  rr <- lm(yy~xx)
  cd <- cooks.distance(rr)
  plot(cd,type='h',lwd=3)
  %--

  %G 600 200
  boxplot(cd, horizontal=T)
  %--

  %G 600 200
  stripchart(cd, method='jitter')
  %--

  %G
  hist(cd, probability=T, breaks=20, col='light blue')
  %--

  %G
  plot(density(cd), type='l', col='red', lwd=3)
  %--

  %G
  qqnorm(cd)
  qqline(cd, col='red')
  %--

Some suggest to compare the distribution of Cook's distance
with a half-gaussian distribution. People often do that for
variables whose values are all positive -- but it does not
look like a half-gaussian!

  %G
  half.qqnorm <- function (x) {
    n <- length(x)
    qqplot(qnorm((1+ppoints(n))/2), x)
  }
  half.qqnorm(cd)
  %--

You can use those values to spot the most important points
on the scatterplot pr on a residual plot.

  %G
  m <- max(cooks.distance(r))
  plot(y~x, cex=1+5*cooks.distance(r)/m)
  %--

  %G
  cd <- cooks.distance(r)
  # rescaled Cook's distance
  rcd <- (99/4) * cd*(cd+1)^2
  rcd[rcd>100] <- 100
  plot(r$res~r$fitted.values, cex=1+.05*rcd)
  abline(h=0,lty=3)
  %--

You can also use colors.

  %G
  m <- max(cd)
  plot(r$res, 
       cex=1+5*cd/m,
       col=heat.colors(100)[ceiling(70*cd/m)],
       pch=16,
      )
  points(r$res, cex=1+5*cd/m)
  abline(h=0,lty=3)
  %--

  %G
  plot(r$res, 
       cex=1+.05*rcd,
       col=heat.colors(100)[ceiling(rcd)],
       pch=16,
      )
  points(r$res, cex=1+.05*rcd)
  abline(h=0,lty=3)
  %--

The following example should be more colorful.

  %G
  n <- 100
  x <- rnorm(n)
  y <- 1 - 2*x + rnorm(n)
  r <- lm(y~x)
  cd <- cooks.distance(r)
  m <- max(cd)
  plot(r$res ~ r$fitted.values,
       cex=1+5*cd/m,
       col=heat.colors(100)[ceiling(70*cd/m)],
       pch=16,
      )
  points(r$res ~ r$fitted.values, cex=1+5*cd/m)
  abline(h=0,lty=3)
  %--

It might be prettier on a black background.

  %G
  op <- par(fg='white', bg='black', 
            col='white', col.axis='white', 
            col.lab='white', col.main='white', 
            col.sub='white')
  plot(r$res ~ r$fitted.values,
       cex=1+5*cd/m,
       col=heat.colors(100)[ceiling(100*cd/m)],
       pch=16,
      )
  abline(h=0,lty=3)
  par(op)
  %--

You can also have a look at the "lm.influence",
"influence.measures", "ls.diag" functions.

TODO: delete the following plot?

  %G
  # With Cook's distance
  x <- rnorm(20)
  y <- 1 + x + rnorm(20)
  x <- c(x,10)
  y <- c(y,1)
  r <- lm(y~x)
  d <- cooks.distance(r)
  d <- (99/4)*d*(d+1)^2 + 1
  d[d>100] <- 100
  d[d<20] <- 20
  d <- d/20
  plot( y~x, cex=d )
  abline(r)
  abline(coef(line(x,y)), col='red')
  abline(lm(y[1:20]~x[1:20]),col='blue')
  %--

* Influential clusters

+ Clusters of outliers

When there is not a single extreme value but several, it is
trickier to spot. You can try with a resistant regression,
such as a trimmed regression (lts).

Usually, those multiple extreme values do not appear at
random (as would an isolated outlier), but have a real
meaning -- as such, they should be dealt with. You can try
to spot them with unsupervised learning algorithms.

  ?hclust
  ?kmeans

  TODO: give an example

A cluster of extreme values can also be a sign that the
model is not appropriate.

  TODO: write up (and model) the following example (mixture)

  %G
  n <- 200
  s <- .2
  x <- runif(n)
  y1 <- 1 - 2 * x + s*rnorm(n)
  y2 <- 2 * x - 1 + s*rnorm(n)
  y <- ifelse( sample(c(T,F),n,replace=T,prob=c(.25,.75)), y1, y2 )
  plot(y~x)
  abline(1,-2,lty=3)
  abline(-1,2,lty=3)
  %--

* Non gaussian residuals

+ Non-gaussian residuals

If the residuals are not gaussian, the least squares
estimators are not optimal (some robust estimators are
better, even if they are biased) and, even worse, all the
tests, variance computations, confidence interval
computations are wrong.

However, if the residuals are less dispersed that with a
gaussian distribution or, to a lesser extent, if the sample
is very large, you can forget those problems.

You can spot non-gaussian residuals with histograms,
box-and-whiskers plots (boxplots) or quantile-quantile
plots.

  %G 600 200
  x <- runif(100)
  y <- 1 - 2*x + .3*exp(rnorm(100)-1)
  r <- lm(y~x)
  boxplot(r$residuals, horizontal=T)
  %--

  %G
  hist(r$residuals, breaks=20, probability=T, col='light blue')
  lines(density(r$residuals), col='red', lwd=3)
  f <- function(x) {
    dnorm(x,
          mean=mean(r$residuals),
          sd=sd(r$residuals),
    )
  }
  curve(f, add=T, col="red", lwd=3, lty=2)
  %--

Do not forget the quantile-quantile plots.

  %G
  qqnorm(r$residuals)
  qqline(r$residuals, col='red')
  %--  

Let us look at what happens with non-gaussian residuals.  We
shall consider a rather extreme situation: a Cauchy variable
with a hole in the middle and a rather small sample.

  %G
  rcauchy.with.hole <- function (n) { 
    x <- rcauchy(n)
    x[x>0] <- 10+x[x>0]
    x[x<0] <- -10+x[x<0]
    x
  }
  n <- 20
  x <- rcauchy(n)
  y <- 1 - 2*x + .5*rcauchy.with.hole(n)
  plot(y~x)
  abline(1,-2)
  r <- lm(y~x)
  abline(r, col='red')
  %--

  %G
  op <- par(mfrow=c(2,2))
  hist(r$residuals, breaks=20, probability=T, col='light blue')
  lines(density(r$residuals), col='red', lwd=3)
  f <- function(x) {
    dnorm(x,
          mean=mean(r$residuals),
          sd=sd(r$residuals),
    )
  }
  curve(f, add=T, col="red", lwd=3, lty=2)
  qqnorm(r$residuals)
  qqline(r$residuals, col='red')
  plot(r$residuals ~ r$fitted.values)
  plot(r$residuals ~ x)
  par(op)
  %--  
  
Let us compute some forecasts.

  n <- 10000
  xp <- runif(n,-50,50)
  yp <- predict(r, data.frame(x=xp), interval="prediction")
  yr <- 1 - 2*xp + .5*rcauchy.with.hole(n)
  sum( yr < yp[,3] & yr > yp[,2] )/n

We get 0.9546, i.e., we are in the prediction interval in
(more than) 5% of the cases -- but the prediction interval
is huge: it tells us that we cannot predict much.

Let us try with a smaller sample.

  n <- 5
  x <- rcauchy(n)
  y <- 1 - 2*x + .5*rcauchy.with.hole(n)
  r <- lm(y~x)
  n <- 10000
  xp <- sort(runif(n,-50,50))
  yp <- predict(r, data.frame(x=xp), interval="prediction")
  yr <- 1 - 2*xp + .5*rcauchy.with.hole(n)
  sum( yr < yp[,3] & yr > yp[,2] )/n

Even worse: 0.9975.

To see what happens, let us plot some of these points. 

  %G
  done <- F
  while(!done) {
    # A situation where the prediction interval is not too
    # large, so that it appears on the plot.
    n <- 5
    x <- rcauchy(n)
    y <- 1 - 2*x + .5*rcauchy.with.hole(n)
    r <- lm(y~x)
    n <- 100000
    xp <- sort(runif(n,-50,50))
    yp <- predict(r, data.frame(x=xp), interval="prediction")
    done <- ( yp[round(n/2),2] > -75 & yp[round(n/2),3] < 75 )
  }
  yr <- 1 - 2*xp + .5*rcauchy.with.hole(n)
  plot(yp[,1]~xp, type='l',
       xlim=c(-50,50), ylim=c(-100,100))
  points(yr~xp, pch='.')
  lines(xp, yp[,2], col='blue')  
  lines(xp, yp[,3], col='blue')
  abline(r, col='red')
  points(y~x, col='orange', pch=16, cex=1.5)
  points(y~x, cex=1.5)
  %--

  %G
  done <- F
  while(!done) {
    # Even worse: the sign of the slope is incorrect
    n <- 5
    x <- rcauchy(n)
    y <- 1 - 2*x + .5*rcauchy.with.hole(n)
    r <- lm(y~x)
    n <- 100000
    xp <- sort(runif(n,-50,50))
    yp <- predict(r, data.frame(x=xp), interval="prediction")
    print(r$coef[2])
    done <- ( yp[round(n/2),2] > -75 &  
              yp[round(n/2),3] <  75 &  
              r$coef[2]>0 )
  }
  yr <- 1 - 2*xp + .5*rcauchy.with.hole(n)
  plot(yp[,1]~xp, type='l',
       xlim=c(-50,50), ylim=c(-100,100))
  points(yr~xp, pch='.')
  lines(xp, yp[,2], col='blue')  
  lines(xp, yp[,3], col='blue')
  abline(r, col='red')
  points(y~x, col='orange', pch=16, cex=1.5)
  points(y~x, cex=1.5)
  %--

We see that the prediction interval is huge if there are
several outliers.  Let us try with smaller values.

  n <- 10000
  xp <- sort(runif(n,-.1,.1))
  yp <- predict(r, data.frame(x=xp), interval="prediction")
  yr <- 1 - 2*xp + .5*rcauchy.with.hole(n)
  sum( yr < yp[,3] & yr > yp[,2] )/n

We get 0.9932...

  %G
  done <- F
  while (!done) {
    n <- 5
    x <- rcauchy(n)
    y <- 1 - 2*x + .5*rcauchy.with.hole(n)
    r <- lm(y~x)
    done <- T
  }
  n <- 10000
  xp <- sort(runif(n,-2,2))
  yp <- predict(r, data.frame(x=xp), interval="prediction")
  yr <- 1 - 2*xp + .5*rcauchy.with.hole(n)
  plot(c(xp,x), c(yp[,1],y), pch='.',
       xlim=c(-2,2), ylim=c(-50,50) )
  lines(yp[,1]~xp)
  abline(r, col='red')
  lines(xp, yp[,2], col='blue')  
  lines(xp, yp[,3], col='blue')
  points(yr~xp, pch='.')
  points(y~x, col='orange', pch=16)
  points(y~x)
  %--

  %G
  done <- F
  essais <- 0
  while (!done) {
    n <- 5
    x <- rcauchy(n)
    y <- 1 - 2*x + .5*rcauchy.with.hole(n)
    r <- lm(y~x)
    yp <- predict(r, data.frame(x=2), interval='prediction')
    done <- yp[3]<0
    essais <- essais+1
  }
  print(essais) # Around 20 or 30
  n <- 10000
  xp <- sort(runif(n,-2,2))
  yp <- predict(r, data.frame(x=xp), interval="prediction")
  yr <- 1 - 2*xp + .5*rcauchy.with.hole(n)
  plot(c(xp,x), c(yp[,1],y), pch='.',
       xlim=c(-2,2), ylim=c(-50,50) )
  lines(yp[,1]~xp)
  points(yr~xp, pch='.')
  abline(r, col='red')
  lines(xp, yp[,2], col='blue')  
  lines(xp, yp[,3], col='blue')
  points(y~x, col='orange', pch=16)
  points(y~x)
  %--

  %G
  done <- F
  e <- NULL
  for (i in 1:100) {
    essais <- 0
    done <- F
    while (!done) {
      n <- 5
      x <- rcauchy(n)
      y <- 1 - 2*x + .5*rcauchy.with.hole(n)
      r <- lm(y~x)
      yp <- predict(r, data.frame(x=2), interval='prediction')
      done <- yp[3]<0
      essais <- essais+1
    }
    e <- append(e,essais)
  }
  hist(e, probability=T, col='light blue')
  lines(density(e), col='red', lwd=3)
  abline(v=median(e), lty=2, col='red', lwd=3)
  %--

  > mean(e)
  [1] 25.8
  > median(e)
  [1] 19

In short, to have the most incorrect prediction intervals,
take large values of x, bit not too large (close to 0, the
predictions are correct, away from 0, the prediction
intervals are huge).

I wanted to prove, here, on an example, that non gaussian
residuals produces confidence intervals too small and thus
incorrect results. I was wrong: the confidence intervals are
correct but very large, to the point that the forecasts are
useless.

Exercice: do the same with other distributions (Cauchy,
uniform, etc.), either for the noise or for the variables.

* Heteroskedasticity

+ Heteroscedasticity

For the least squares estimators to be optimal and for the
test results to be correct, we had to assume (among other
hypotheses) that the variance of the noise was constant. If
it is not, it is said toe be heteroscedastic.

  %G
  x <- runif(100)
  y <- 1 - 2*x + .3*x*rnorm(100)
  plot(y~x)
  r <- lm(y~x)
  abline(r, col='red')
  title(main="Heteroscedasticity")
  %--

You can spot the problem on the residuals.

  %G
  plot(r$residuals ~ r$fitted.values)
  %--

Or, more precisely, on their absolute value, on which you
can perform a non-linear regression.

  %G
  plot(abs(r$residuals) ~ r$fitted.values)
  lines(lowess(r$fitted.values, abs(r$residuals)), col='red')
  %--

  %G
  plot(abs(r$residuals) ~ x)
  lines(lowess(x, abs(r$residuals)), col='red')
  %--

Here is a concrete example.

  %G
  data(crabs)
  plot(FL~RW, data=crabs)
  %--

  %G
  r <- lm(FL~RW, data=crabs)
  plot(r, which=1)
  %--

  %G
  plot(r, which=3, panel = panel.smooth)
  %--

The "spread.level.plot" from the "car" package has the same
aim: it plots the absolute value of the residuals as a
function of the predicted values, on logarithmic scales and
suggests a transformation to get rid of heteroscedasticity.

  %G
  library(car)
  spread.level.plot(r)
  %--

You can also see the problem in a more computational way, by
splitting the sample into two parts and performing a test to
see if the two parts have the same variance.

  n <- length(crabs$RW)
  m <- ceiling(n/2)
  o <- order(crabs$RW)
  r <- lm(FL~RW, data=crabs)
  x <- r$residuals[o[1:m]]
  y <- r$residuals[o[(m+1):n]]
  var.test(y,x) # p-value = 1e-4

Let us see, on an example, what the effects of
heteroscedasticity are.

  x <- runif(100)
  y <- 1 - 2*x + .3*x*rnorm(100)
  r <- lm(y~x)
  xp <- runif(10000,0,.1)
  yp <- predict(r, data.frame(x=xp), interval="prediction")
  yr <- 1 - 2*xp + .3*xp*rnorm(100)
  sum( yr < yp[,3] & yr > yp[,2] )/n
  
We get 1: where the variance is small, the confidence
intervals are too small.

  x <- runif(100)
  y <- 1 - 2*x + .3*x*rnorm(100)
  r <- lm(y~x)
  xp <- runif(10000,.9,1)
  yp <- predict(r, data.frame(x=xp), interval="prediction")
  yr <- 1 - 2*xp + .3*xp*rnorm(100)
  sum( yr < yp[,3] & yr > yp[,2] )/n

We get 0.67: where the variance is higher, the confidence
intervals are too small.

We can see this graphically.

  %G
  x <- runif(100)
  y <- 1 - 2*x + .3*x*rnorm(100)
  r <- lm(y~x)
  n <- 10000
  xp <- sort(runif(n,))
  yp <- predict(r, data.frame(x=xp), interval="prediction")
  yr <- 1 - 2*xp + .3*xp*rnorm(n)

  plot(c(xp,x), c(yp[,1],y), pch='.')
  lines(yp[,1]~xp)
  abline(r, col='red')
  lines(xp, yp[,2], col='blue')  
  lines(xp, yp[,3], col='blue')
  points(yr~xp, pch='.')
  points(y~x, col='orange', pch=16)
  points(y~x)
  title(main="Consequences of heteroscedasticity on prediction intervals")
  %--

The simplest way to get rid of heteroscedasticity is (when
it works) to transform the data. If it is possible, find a
transformation of the data that will both have it look
gaussian and get rid of heteroscedasticity.

Generalized least squaes allow you to perform a regression
with heteroscedastic data, but you have to know how the
variance varies.

  TODO: put this example later?

  %G
  n <- 100
  x <- runif(n)
  y <- 1 - 2*x + x*rnorm(n)
  plot(y~x)
  r <- lm(y~x)
  abline(r, col='red')
  title(main="Classical linear regression")
  %--

  %G
  plot(abs(r$res) ~ x)
  r2 <- lm( abs(r$res) ~ x )
  abline(r2, col="red")
  title(main="Heteroscedasticity of the residuals")
  %--

The idea of weighted least squares is to give a lesser
weight (i.e., a lesser importance) to observations whose
variance is high.

  %G
  # We assume the the standard deviation of the residuals 
  # is of the form a*x
  a <- lm( I(r$res^2) ~ I(x^2) - 1 )$coefficients
  w <- (a*x)^-2
  r3 <- lm( y ~ x, weights=w )
  plot(y~x)
  abline(1,-2, lty=3)
  abline(lm(y~x), lty=3, lwd=3)
  abline(lm(y~x, weights=w), col='red') 
  legend( par("usr")[1], par("usr")[3], yjust=0,
          c("acutal model", "least squares", 
            "weighted least squares"),
          lwd=c(1,3,1),
          lty=c(3,3,1),
          col=c(par("fg"), par("fg"), 'red') )
  title("Weighted least squares and heteroscedasticity")
  %--

On the contrary, the prediction intervals are not very
convincing...

  TODO: check what follows

  %G
  # Prediction intervals
  N <- 10000
  xx <- runif(N,min=0,max=2)
  yy <- 1 - 2*xx + xx*rnorm(N)
  plot(y~x, xlim=c(0,2), ylim=c(-3,2))
  points(yy~xx, pch='.')
  abline(1,-2, col='red')
  xp <- seq(0,3,length=100)
  yp1 <- predict(r, new=data.frame(x=xp), interval='prediction')
  lines( xp, yp1[,2], col='red', lwd=3 )
  lines( xp, yp1[,3], col='red', lwd=3 )
  yp3 <- predict(r3, new=data.frame(x=xp), interval='prediction')
  lines( xp, yp3[,2], col='blue', lwd=3 )
  lines( xp, yp3[,3], col='blue', lwd=3 )
  legend( par("usr")[1], par("usr")[3], yjust=0,
          c("least squares", "weighted least squares"),
          lwd=3, lty=1,
          col=c('red', 'blue') )
  title(main="Prediction band")
  %--
  
You can also do that with the "gls" function

  ?gls
  ?varConstPower
  ?varPower

  r4 <- gls(y~x, weights=varPower(1, form= ~x))
  ???

+ lmtest

  library(help=lmtest)
  library(help=strucchange)

  # Heteroscedasticity
  library(lmtest)
  ?dwtest
  ?bgtest
  ?bptest
  ?gqtest
  ?hmctest

* Correlated errors

+ Correlated errors

In the case of time series, of geographical data (or more
generally, data for which you ave a notion of "proximity"
between the observations), the errors of two consecutive
observations may be correlated.

In the case of time series, you can see the problem in an
autocorrelogram.

  %G
  my.acf.plot <- function (x, n=10, ...) {
    y <- rep(NA,n)
    l <- length(x)
    for (i in 1:n) {
      y[i] <- cor( x[1:(l-i)], x[(i+1):l] )
    }
    plot(y, type='h', ylim=c(-1,1),...)
  }
  n <- 100
  x <- runif(n)
  b <- .1*rnorm(n+1)
  y <- 1-2*x+b[1:n]
  my.acf.plot(lm(y~x)$res, lwd=10)
  abline(h=0, lty=2)
  %--

  %G
  z <- 1-2*x+.5*(b[1:n]+b[1+1:n])
  my.acf.plot(lm(z~x)$res, lwd=10)
  abline(h=0, lty=2)
  %--

Here is a very autocorrelated example.

  %G
  n <- 500
  x <- runif(n)
  b <- rep(NA,n)
  b[1] <- 0
  for (i in 2:n) {
    b[i] <- b[i-1] + .1*rnorm(1)
  }
  y <- 1-2*x+b[1:n]
  my.acf.plot(lm(y~x)$res, n=100)
  abline(h=0, lty=2)
  title(main='Very autocorrelated example')
  %--

We do not see anything on the plot of the residuals as a
function of the predicted values.

  %G
  r <- lm(y~x)
  plot(r$res ~ r$fitted.values)
  title(main="Residuals of the very correlated example")
  %--

On the contrary, if you plot the residuals as a function of
time, it is clearer.

  %G
  r <- lm(y~x)
  plot(r$res)
  title(main="Residuals of the very correlated example")
  %--

Another means of spotting the problem is to check if the
correlation between x[i] and x[i-1] is significantly non
zero.

  n <- 100
  x <- runif(n)
  b <- rep(NA,n)
  b[1] <- 0
  for (i in 2:n) {
    b[i] <- b[i-1] + .1*rnorm(1)
  }
  y <- 1-2*x+b[1:n]
  r <- lm(y~x)$res
  cor.test(r[1:(n-1)], r[2:n]) # p-value under 1e-15

  n <- 100
  x <- runif(n)
  b <- .1*rnorm(n+1)
  y <- 1-2*x+b[1:n]
  r <- lm(y~x)$res
  cor.test(r[1:(n-1)], r[2:n]) # p-value = 0.3

  y <- 1-2*x+.5*(b[1:n]+b[1+1:n])
  cor.test(r[1:(n-1)], r[2:n]) # p-value = 0.3 (again)

See also the Durbin--Watson test:

  library(car)
  ?durbin.watson

  library(lmtest)
  ?dwtest

and the chapter on time series.

Yet another means of spotting the problem is to plot the
consecutive residuals in 2 or 3 dimensions.

  %G
  n <- 500
  x <- runif(n)
  b <- rep(NA,n)
  b[1] <- 0
  for (i in 2:n) {
    b[i] <- b[i-1] + .1*rnorm(1)
  }
  y <- 1-2*x+b[1:n]
  r <- lm(y~x)$res
  plot( r[1:(n-1)], r[2:n],
        xlab='i-th residual', 
        ylab='(i+1)-th residual' )
  %--

In the following example, we do not see anything with two
consecutive terms (well, it looks like a Rorschach test, it
is suspicious): we need three.

  %G
  n <- 500
  x <- runif(n)
  b <- rep(NA,n)
  b[1] <- 0
  b[2] <- 0
  for (i in 3:n) {
    b[i] <- b[i-2] + .1*rnorm(1)
  }
  y <- 1-2*x+b[1:n]
  r <- lm(y~x)$res
  plot(data.frame(x=r[3:n-2], y=r[3:n-1], z=r[3:n]))
  %--

  %G
  plot(r)
  %--

It is exaclty like that we can see the problems of some old
random number generators. In three dimensions, front view,
there is nothing visible,

  %G
  data(randu)
  plot(randu)
  # Nothing visible
  %--

but if we rotate the figure...

  library(xgobi)
  xgobi(randu)

= xgobi_01.png

You can also turn the picture directly in R, by taking a
"random" rotation matrix (exercice: write a function to
produce such a matrix -- hint: there is one somewhere inthis
document).

  %G
  m <- matrix( c(0.0491788982891203, -0.998585856299176, 0.0201921658647648,  
                 0.983046639705112,  0.0448184901961194, -0.177793720645666, 
                 -0.176637312387723,  -0.028593540105802, -0.983860594462783),
               nr=3, nc=3)
  plot( t( m %*% t(randu) )[,1:2] )
  %--

Here is a real example.

  %G
  data(EuStockMarkets)
  plot(EuStockMarkets)
  %--

  %G
  x <- EuStockMarkets[,1]
  y <- EuStockMarkets[,2]
  r <- lm(y~x)
  plot(y~x)
  abline(r, col='red', lwd=3)
  %--

  %G
  plot(r, which=1)
  %--

  %G
  plot(r, which=3)
  %--

  %G
  plot(r, which=4)
  %--

  %G
  r <- r$res
  hist(r, probability=T, col='light blue')
  lines(density(r), col='red', lwd=3)
  %--

  %G
  plot(r)
  %--

  %G
  acf(r)
  %--

  %G
  pacf(r)
  %--

  %G
  r <- as.vector(r)
  x <- r[1:(length(r)-1)]
  y <- r[2:length(r)]
  plot(x,y, xlab='x[i]', ylab='x[i+1]')
  %--

In such a situation, you can use generalized least
squares. The AR1 model assumes that two successive errors
are correlated:

  e_{i+1} = r * e_i + f_i

Where r is the "AR1 coefficient" and the f_i are independant.

  %G
  n <- 100
  x <- rnorm(n)
  e <- vector()
  e <- append(e, rnorm(1))
  for (i in 2:n) {
    e <- append(e, .6 * e[i-1] + rnorm(1) )
  }
  y <- 1 - 2*x + e
  i <- 1:n
  plot(y~x)
  %--

  %G
  r <- lm(y~x)$residuals
  plot(r)
  %--

The "gls" function (generalized least squares) is in the
"nlme" package.

  library(nlme)
  g <- gls(y~x, correlation = corAR1(form= ~i))

Here is the result.

  > summary(g)
  Generalized least squares fit by REML
    Model: y ~ x
    Data: NULL
         AIC      BIC    logLik
    298.4369 308.7767 -145.2184

  Correlation Structure: AR(1)
   Formula: ~i
   Parameter estimate(s):
        Phi
  0.3459834

  Coefficients:
                  Value  Std.Error    t-value p-value
  (Intercept)  1.234593 0.15510022   7.959971  <.0001
  x           -1.892171 0.09440561 -20.042992  <.0001

   Correlation:
    (Intr)
  x 0.04

  Standardized residuals:
          Min          Q1         Med          Q3         Max
  -2.14818684 -0.75053384  0.02200128  0.57222518  2.45362824

  Residual standard error: 1.085987
  Degrees of freedom: 100 total; 98 residual

We can look at the confidence interval on the
autocorrelation coefficient.

  > intervals(g)
  Approximate 95% confidence intervals

   Coefficients:
                  lower      est.     upper
  (Intercept)  0.926802  1.234593  1.542385
  x           -2.079516 -1.892171 -1.704826

   Correlation structure:
          lower      est.     upper
  Phi 0.1477999 0.3459834 0.5174543

   Residual standard error:
     lower     est.    upper
  0.926446 1.085987 1.273003

Let us compare with a naive regression.

  %G
  library(nlme)
  plot(y~x)
  abline(lm(y~x))
  abline(gls(y~x, correlation = corAR1(form= ~i)), col='red')
  %--

In this example, there is no remarkable effect. In the
following, the situation is more drastic.

  %G
  n <- 1000
  x <- rnorm(n)
  e <- vector()
  e <- append(e, rnorm(1))
  for (i in 2:n) {
    e <- append(e, 1 * e[i-1] + rnorm(1) )
  }
  y <- 1 - 2*x + e
  i <- 1:n
  plot(lm(y~x)$residuals)
  %--

  %G
  plot(y~x)
  abline(lm(y~x))
  abline(gls(y~x, correlation = corAR1(form= ~i)), col='red')
  abline(1,-2, lty=2)
  %--

We shall come back on this when we tackle time series.

For spatial data, it is more complicated.

TODO: a reference???

* Unidentifiability

+ Multicolinearity (unidentifiability)

The correlation coefficient between two variables tells you
if they are correlated. But you can also have relations
between more than two variables, such as X3 = X1 + X2. To
detect those, you can perform a regression of Xk agains the
other Xi's and check the R^2: if it is high (>1e-1), Xk can
be expressed from the other Xi.

  n <- 100
  x <- rnorm(n)
  x1 <- x+rnorm(n)
  x2 <- x+rnorm(n)
  x3 <- rnorm(n)
  y <- x+x3
  summary(lm(x1~x2+x3))$r.squared  # 1e-1
  summary(lm(x2~x1+x3))$r.squared  # 1e-1
  summary(lm(x3~x1+x2))$r.squared  # 1e-3

Other example, with three dependant variables.

  n <- 100
  x1 <- rnorm(n)
  x2 <- rnorm(n)
  x3 <- x1+x2+rnorm(n)
  x4 <- rnorm(n)
  y <- x1+x2+x3+x4+rnorm(n)
  summary(lm(x1~x2+x3+x4))$r.squared  # 0.5
  summary(lm(x2~x1+x3+x4))$r.squared  # 0.4
  summary(lm(x3~x1+x2+x4))$r.squared  # 0.7
  summary(lm(x4~x1+x2+x3))$r.squared  # 3e-3

Real example (with the adjusted determination coefficient):

  %G
  check.multicolinearity <- function (M) {
    a <- NULL
    n <- dim(M)[2]
    for (i in 1:n) {
      m <- as.matrix(M[, 1:n!=i])
      y <- M[,i]
      a <- append(a, summary(lm(y~m))$adj.r.squared)
    }
    names(a) <- names(M)
    print(round(a,digits=2))
    invisible(a)
  }
  data(freeny)
  names(freeny) <- paste(
    names(freeny), 
    " (",
    round(check.multicolinearity(freeny), digits=2),
    ")",
    sep='')
  pairs(freeny,
    upper.panel=panel.smooth,
    lower.panel=panel.smooth)
  %--

In such a situation, the fact that a certain coefficient be
statistically different from zero depends on the presence of
other variables. With all the variables, the first variables
does not play a significant role, but the second does.

  > summary(lm(freeny.y ~ freeny.x))
  ...
  Coefficients:
                                Estimate Std. Error t value Pr(>|t|)
  (Intercept)                   -10.4726     6.0217  -1.739   0.0911 .
  freeny.xlag quarterly revenue   0.1239     0.1424   0.870   0.3904
  freeny.xprice index            -0.7542     0.1607  -4.693 4.28e-05 ***
  freeny.xincome level            0.7675     0.1339   5.730 1.93e-06 ***
  freeny.xmarket potential        1.3306     0.5093   2.613   0.0133 *
  ...
  Multiple R-Squared: 0.9981,     Adjusted R-squared: 0.9978

On the contrary, if you only retain the first two variables,
it is the opposite.

  > summary(lm(freeny.y ~ freeny.x[,1:2]))
  ...
                                       Estimate Std. Error t value Pr(>|t|)
  (Intercept)                           2.18577    1.47236   1.485    0.146
  freeny.x[, 1:2]lag quarterly revenue  0.89122    0.07412  12.024 3.63e-14 ***
  freeny.x[, 1:2]price index           -0.25592    0.17534  -1.460    0.153
  ...
  Multiple R-Squared: 0.9958,     Adjusted R-squared: 0.9956

Furthermore, the estimation of the coefficients anf their
standard deviation is worrying: in a multilinearity
situation, you cannot be sure of the sign of the
coefficients.

  n <- 100
  v <- .1
  x <- rnorm(n)
  x1 <- x + v*rnorm(n)
  x2 <- x + v*rnorm(n)
  x3 <- x + v*rnorm(n)
  y <- x1+x2-x3 + rnorm(n)

Let us check that the variables are linearly dependant.

  > summary(lm(x1~x2+x3))$r.squared
  [1] 0.986512
  > summary(lm(x2~x1+x3))$r.squared
  [1] 0.98811
  > summary(lm(x3~x1+x2))$r.squared
  [1] 0.9862133

Let us look at the most relevant ones.

  > summary(lm(y~x1+x2+x3))

  Call:
  lm(formula = y ~ x1 + x2 + x3)

  Residuals:
      Min      1Q  Median      3Q     Max
  -3.0902 -0.7658  0.0793  0.6995  2.6456

  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept)  0.06361    0.11368   0.560   0.5771
  x1           1.47317    0.94653   1.556   0.1229
  x2           1.18874    0.98481   1.207   0.2304
  x3          -1.70372    0.94366  -1.805   0.0741 .
  ---
  Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

  Residual standard error: 1.135 on 96 degrees of freedom
  Multiple R-Squared: 0.4757,     Adjusted R-squared: 0.4593
  F-statistic: 29.03 on 3 and 96 DF,  p-value: 1.912e-13

It is the third.

  > lm(y~x3)$coef
  (Intercept)          x3
   0.06970513  0.98313878

Its coefficient was negative, but if we remove the other
variables, it becomes positive.

Instead of looking at the determination coefficient
(percentage of explained variance) R^2, you can look at the
"Variance Inflation Factors" (VIF),

            1
  V_j = -----------
         1 - R_j^2

  > vif(lm(y~x1+x2+x3))
        x1       x2       x3
  48.31913 41.13990 52.10746


Instead of looking at the R^2, you can look at the
correlation matrix between the estimated coefficients.

  n <- 100
  v <- .1
  x <- rnorm(n)
  x1 <- x + v*rnorm(n)
  x2 <- rnorm(n)
  x3 <- x + v*rnorm(n)
  y <- x1+x2-x3 + rnorm(n)
  summary(lm(y~x1+x2+x3), correlation=T)$correlation

We get

               (Intercept)          x1            x2          x3
  (Intercept)  1.0000000000 -0.02036269  0.0001812560  0.02264558
  x1          -0.0203626936  1.00000000 -0.1582002900 -0.98992751
  x2           0.0001812560 -0.15820029  1.0000000000  0.14729488
  x3           0.0226455841 -0.98992751  0.1472948846  1.00000000

We can see that X1 and X3 are dependant.
We can also see it graphically.

  %G
  n <- 100
  v <- .1
  x <- rnorm(n)
  x1 <- x + v*rnorm(n)
  x2 <- rnorm(n)
  x3 <- x + v*rnorm(n)
  y <- x1+x2-x3 + rnorm(n)
  m <- summary(lm(y~x1+x2+x3), correlation=T)$correlation
  plot(col(m), row(m), cex=10*abs(m),
       xlim=c(0, dim(m)[2]+1), 
       ylim=c(0, dim(m)[1]+1),
       main="Correlation matrix of the coefficients of a regression")
  %--

TODO: A graphical representation of this correlation matrix
(transform it into a distance matrix, perform an MDS, plot
the points, add their MST -- or simply plot the MST, without
the MDS).

Here is yet another way of spotting the problem: compute the
ratio of the largest eigen value and the smallest. Under
100, it is fine, over 1000, it is worrying. This is called
the "contitionning index".

  m <- model.matrix(y~x1+x2+x3)
  d <- eigen( t(m) %*% m, symmetric=T )$values
  d[1]/d[4] # 230

To solve the problem, you can remove the "superfluous"
variables -- but you might run into interpretation
problems. You can also ask for more data (multicolinearity
are more frequent when you have many variables and few
observations). You can also use regression techniques
adapted to multicolinearity, such as "ridge regression" or
SVM (see somewhere below).

We have already run into this problem with polynomial
regression: to get rid of multicolinearity, we had
orthonormalized the predictive variables.

  > y <- cars$dist
  > x <- cars$speed
  > m <- cbind(x, x^2, x^3, x^4, x^5)
  > cor(m)
            x
  x 1.0000000 0.9794765 0.9389237 0.8943823 0.8515996
    0.9794765 1.0000000 0.9884061 0.9635754 0.9341101
    0.9389237 0.9884061 1.0000000 0.9927622 0.9764132
    0.8943823 0.9635754 0.9927622 1.0000000 0.9951765
    0.8515996 0.9341101 0.9764132 0.9951765 1.0000000

  > m <- poly(x,5)
  > cor(m)
                1             2             3             4             5
  1  1.000000e+00  6.409668e-17 -1.242089e-17 -3.333244e-17  7.935005e-18
  2  6.409668e-17  1.000000e+00 -4.468268e-17 -2.024748e-17  2.172470e-17
  3 -1.242089e-17 -4.468268e-17  1.000000e+00 -6.583818e-17 -1.897354e-18
  4 -3.333244e-17 -2.024748e-17 -6.583818e-17  1.000000e+00 -4.903304e-17
  5  7.935005e-18  2.172470e-17 -1.897354e-18 -4.903304e-17  1.000000e+00

TODO: Give the example of mixed models

  Adding a subject-dependant intercept is equivalent to imposing a
  certain correlation structure. 

* Missing values

Important variables may be missing, that can change the
results of the regression and their interpretation.

In the following example, we have three variables.

  %G
  n <- 100
  x <- runif(n)
  z <- ifelse(x>.5,1,0)
  y <- 2*z -x + .1*rnorm(n)
  plot( y~x, col=c('red','blue')[1+z] )
  %--

If we take into account the three variables, there is a
negative correlation between x and y.

  > summary(lm( y~x+z ))

  Call:
  lm(formula = y ~ x + z)

  Residuals:
        Min        1Q    Median        3Q       Max
  -0.271243 -0.065745  0.002929  0.068085  0.215251

  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept)  0.01876    0.02404    0.78    0.437
  x           -1.05823    0.07126  -14.85   <2e-16 ***
  z            2.05321    0.03853   53.28   <2e-16 ***
  ---
  Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

  Residual standard error: 0.1008 on 97 degrees of freedom
  Multiple R-Squared: 0.9847,     Adjusted R-squared: 0.9844
  F-statistic:  3125 on 2 and 97 DF,  p-value: < 2.2e-16

On the contrary, if we do not have z, the correlation
becomes positive.

  > summary(lm( y~x ))

  Call:
  lm(formula = y ~ x)

  Residuals:
       Min       1Q   Median       3Q      Max
  -1.05952 -0.38289 -0.01774  0.50598  1.05198

  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept)  -0.5689     0.1169  -4.865 4.37e-06 ***
  x             2.1774     0.2041  10.669  < 2e-16 ***
  ---
  Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

  Residual standard error: 0.5517 on 98 degrees of freedom
  Multiple R-Squared: 0.5374,     Adjusted R-squared: 0.5327
  F-statistic: 113.8 on 1 and 98 DF,  p-value: < 2.2e-16

To avoid this problem, you should include all the variables
that are likely to be important (you select them from a
prior knowledge of the domain studied, i.e., with
non-statistical methods).

To confirm a given effect, you can also try other models (if
they all confirm the effect, and its direction, it is a good
omen) or gather more data, either from the same experiment,
or from a comparable but slightly different one.

* Extrapolation

You often want to extrapolate from your data, i.e., infer
what happens at larger scale (say, you have data with X in
[0,1] and you would like conclusions for X in
[0,10]). Several problems occur. The first is that the
prediction intervals increase when you get away from the
sample values.

  %G
  n <- 20
  x <- rnorm(n)
  y <- 1 - 2*x - .1*x^2 + rnorm(n)
  #summary(lm(y~poly(x,10)))
  plot(y~x, xlim=c(-20,20), ylim=c(-30,30))
  r <- lm(y~x)
  abline(r, col='red')
  xx <- seq(-20,20,length=100)
  p <- predict(r, data.frame(x=xx), interval='prediction')
  lines(xx,p[,2],col='blue')
  lines(xx,p[,3],col='blue')
  title(main="Widening of the prediction band")
  %--

Furthermore, if the relation looks linear on a small scale,
it might be completely different on a larger scale.

  %G
  plot(y~x, xlim=c(-20,20), ylim=c(-30,30))
  r <- lm(y~x)
  abline(r, col='red')
  xx <- seq(-20,20,length=100)
  yy <- 1 - 2*xx - .1*xx^2 + rnorm(n)
  p <- predict(r, data.frame(x=xx), interval='prediction')
  points(yy~xx)
  lines(xx,p[,2],col='blue')
  lines(xx,p[,3],col='blue')
  title(main="Extrapolation problem: it is not linear...")
  %--

  %G
  data(cars)
  y <- cars$dist
  x <- cars$speed
  o <- x<quantile(x,.25)
  x1 <- x[o]
  y1 <- y[o]
  r <- lm(y1~x1)
  xx <- seq(min(x),max(x),length=100)
  p <- predict(r, data.frame(x1=xx), interval='prediction')
  plot(y~x)
  abline(r, col='red')
  lines(xx,p[,2],col='blue')
  lines(xx,p[,3],col='blue')
  %--

* Miscellaneous

+ Models

If you give the same data to different persons (for
instance, students, in an exam), each will have a different
model with different forecasts. The forecasts and the
corresponding confidence intervals are usually incompatible:
the prediction intervals are always too small...

+ Measurement errors

There can be measurement errors on the predictive variable:
this yields to biased estimations of the parameters (towards
0).

  %G
  n <- 100
  e <- .5
  x0 <- rnorm(n)
  x <- x0 + e*rnorm(n)
  y <- 1 - 2*x0 + e*rnorm(n)
  plot(y~x)
  points(y~x0, col='red')
  abline(lm(y~x))
  abline(lm(y~x0),col='red')
  %--

On the other hand, there is no problem about the
predictions, because the measurement errors will always be
present and are accounted for.

* The curse of dimension

  TODO: proofread this. In particular, mention the dangers
  of variable selection.
  There are actually two different subjects here: 
   - the curse of dimension
   - combining models

The curse of dimension

TODO: put this part after the other regressions (logistic,
Poisson, ordinal, multilogistic).

TODO: write this part.

TODO: I mainly mention non-linear models in high
dimensions. But what can one do when there are more
variables than observations?  See: svm (support vector
machines).

+ Introduction

We have already mentionned the "curse of dimension": what
can we do when we have more variables than observations?
What can we do when we want a non-linear regression with a
very large number of variables: In both cases, if we try to
play the game as in two dimensions, we end up with too many
parameters to estimate with respect to the number of
available observations. Either we cannot gat any estimation,
or the estimations are almost random (somewhere else in this
document, we call this "overfit").

We can turn around the problem by choosing simpler models,
models with fewer variables -- simpler, but not too simple:
the model has to be sufficiently complex to describe the
data.

The following regression methods lie between linear
regression (relevant when there are too few observations to
allow anything else, or when the data is too noisy) and
multiimensional non-linear regression (unuseable, because
there are too many parameters to estimate). The allow us to
thwart the curse of dimension. More precisely, we shall
mention variable selection, principal component regression,
ridge regression, lasso, partial least squares, Generalized
Additive Models (GAMs) and tree-based algorith,s (CART,
MARS).

TODO: give the structure of this chapter

Putative structure of this chapter:

  Variables selection
  GAM, ACE?, AVAS?
  Trees: CART, MARS
  Bootstrap: bagging, boosting 
    (But there is already a chapter about the bootstrap -- the
    previous chapter...)

Remark: Not all the methods we shall mention seem to be
implemented in R.

+ Variable selection and non-supervised classification

When facing real data with a very large number of variables,
we shall first reduce the dimension, for instance, by
selecting the "most relevant" variables and discarding the
others.

But how do we do that?

  TODO
  (This is a good question...)

Here are a few ideas (beware: not all thes ideas are good). 

  Compute a PCA and "round" the components to the "nearest"
  variables. 

  Reverse the problem and look at the variables: are some of
  them "close", in some sense (e.g., define a notion of
  distance between the variables and perform a distance
  analysis). Then, retain a single variables from each
  cluster.
  
+ TODO

  TO SORT
  
  Variables selection to predict a qualitative variables
  1. Perform ANOVAs and only retain important variables (say, p-value>0.05)
  %G
  n <- 100
  k <- 5
  x <- matrix(rnorm(n*k),nc=k)
  y <- x[,1] + x[,2] - sqrt(abs(x[,3]*x[,4]))
  y <- y-median(y)
  y <- factor(y>0)
  pairs(x, col=as.numeric(y)+1)
  %--
  for (i in 1:k) {
    f <- summary(lm(x[,i]~y))$fstatistic
    f <- pf(f[1],f[2],f[3], lower.tail=F)
    print(paste(i, ";  p =", round(f, digits=3)))
  }

  2. With the retained variables, compute the correlations and their
     p-values (the p-value of a correlation if the p-value of the test 
     of H0: "correlation=0".
  TODO: gove an example with colinear variables.
  cor(x)
  round(cor(x),digits=3)
  m <- cor(x)
  for (i in 1:k) {
    for (j in 1:k) {
      m[i,j] <- cor.test(x[,i],x[,j])$p.value
    }
  }
  m
  round(m,digits=3)
  m<.05

  Exercise: write a "print" method for correlation matrices
  that adds stars besides the correlations significantly different
  from zero. 

+ Variables selection and regression

When there are too many perdictive variables in a regression
(with respect to the number of observations), the first
thing that comes to the mind, is to remode the "spurious" or
"redundant" variables. Here are several ways of doing so.

+ General idea

We can start with a model containing all the variables,
discard the variable that brings the least to the regression
(the one whose p-value is the largest) and go on, until we
have removed all the variables whose p-value is over a
pre-specified threshold. We remove the variables one at a
time, because each time we remove one, the p-value of the
others change. (We have already mentionned this phenomenon
when we presented polynomial regression: the predictive
variables need not be orthogonal.)

We can also do the opposite: start with an empty model and
add the variables one after the other, starting with the one
with the smallest p-value.

Finally, we can combine both methods: First add the
variables, then try to remove them, try do add some more,
etc.. (this may happen: we might decide to add variable A,
then variable B, then variable C, then remove variable B,
then add variable D, etc. -- as the predictive variables are
not orthogonal, the fact that a variable is present or not
in the model depends on the other variables).  We stop when
the criteria tell us to stop, or when we get tired.

In the preceding discussion, we have used the p-value to
decide if we were to keep or discard a variable: use can
choose another criterion, say, the R^2 or a penalized
log-likelihood, such as the AIC (Akaike Information
Criterion),

  AIC = -2log(vraissemblance) + 2 p,

or the BIC (Bayesian Information Criterion),

  BIC = -2log(vraissemblance) + p ln n.

Let us consider an example.

  library(nlme) # For the "BIC" function (there may be another one elsewhere)
  n <- 20
  m <- 15
  d <- as.data.frame(matrix(rnorm(n*m),nr=n,nc=m))
  i <- sample(1:m, 3)
  d <- data.frame(y=apply(d[,i],1,sum)+rnorm(n), d)

  r <- lm(y~., data=d)
  AIC(r)
  BIC(r)
  summary(r)

It yields:

  Call:
  lm(formula = y ~ ., data = d)
 
  Residuals:
          1         2         3         4         5         6         7         8
  -0.715513  0.063803  0.233524  1.063999 -0.001007 -0.421912  0.712749 -1.188755
          9        10        11        12        13        14        15        16
  -1.686568 -0.907378  0.293071 -0.506539  0.644674  2.046780  0.236374 -0.110205
         17        18        19        20
   0.256414  0.397595  0.052581 -0.463687

  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept) -0.03322    0.86408  -0.038    0.971
  V1           0.76079    1.23426   0.616    0.571
  V2           0.60744    0.52034   1.167    0.308
  V3          -0.18183    1.09441  -0.166    0.876
  V4           0.49537    0.68360   0.725    0.509
  V5           0.54538    1.72066   0.317    0.767
  V6          -0.16841    0.89624  -0.188    0.860
  V7           0.51331    1.25093   0.410    0.703
  V8           0.25457    2.05536   0.124    0.907
  V9           0.34990    0.82277   0.425    0.693
  V10          0.72410    1.26269   0.573    0.597
  V11          0.69057    1.84400   0.374    0.727
  V12          0.64329    1.15298   0.558    0.607
  V13          0.07364    0.79430   0.093    0.931
  V14         -0.06518    0.53887  -0.121    0.910
  V15          0.92515    1.18697   0.779    0.479
  
  Residual standard error: 1.798 on 4 degrees of freedom
  Multiple R-Squared: 0.7795,     Adjusted R-squared: -0.04715
  F-statistic: 0.943 on 15 and 4 DF,  p-value: 0.5902

The "gls" function gives you directly the AIC and the BIC.

  > r <- gls(y~., data=d)
  > summary(r)
  Generalized least squares fit by REML
    Model: y ~ .
    Data: d
         AIC      BIC    logLik
    86.43615 76.00316 -26.21808
  ...

These quantities are important when you compare models with
a different number of parameters: the log-likelihood will
always increase if you add more variables, falling in the
"overfit" trap. On the contrary, the AIC and the BIC have a
corrective term to avoid this trap.

  %G
  library(nlme)
  n <- 20
  m <- 15
  d <- as.data.frame(matrix(rnorm(n*m),nr=n,nc=m))
  # i <- sample(1:m, 3)
  i <- 1:3
  d <- data.frame(y=apply(d[,i],1,sum)+rnorm(n), d)
  k <- m
  res <- matrix(nr=k, nc=5)
  for (j in 1:k) {
    r <- lm(d$y ~ as.matrix(d[,2:(j+1)]))
    res[j,] <- c( logLik(r), AIC(r), BIC(r), 
                  summary(r)$r.squared, 
                  summary(r)$adj.r.squared )
  }
  colnames(res) <- c('logLik', 'AIC', 'BIC', 
                     "R squared", "adjusted R squared")
  res <- t( t(res) - apply(res,2,mean) )
  res <- t( t(res) / apply(res,2,sd) )
  matplot(0:(k-1), res, 
          type = 'l', 
          col = c(par('fg'),'blue','green', 'orange', 'red'), 
          lty = 1,
          xlab = "Number of variables")
  legend(par('usr')[2], par('usr')[3], 
         xjust = 1, yjust = 0,
         c('log-vraissemblance', 'AIC', 'BIC', 
           "R^2", "adjusted R^2" ),
         lwd = 1, lty = 1,
         col = c(par('fg'), 'blue', 'green', "orange", "red") )
  abline(v=3, lty=3)
  %--

(In some cases, in the previous simulation, the AIC and the
BIC have a local minimum for three variables and a global
minimim for a dozen variables.)

There are other criteria, such as the adjusted R-squared or
Mallow's Cp.
 
  > library(wle)
  > r <- mle.cp(y~., data=d)
  > summary(r)
  Call:
  mle.cp(formula = y ~ ., data = d)
  Mallows Cp:
        (Intercept) V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15     cp
   [1,]           0  0  0  0  0  0  0  1  0  0   1   0   0   0   1   0 -5.237
   [2,]           0  1  0  0  0  0  0  1  0  0   1   0   0   0   1   0 -4.911
   [3,]           0  0  1  0  0  0  0  1  0  0   1   0   0   0   1   0 -4.514
   [4,]           0  0  0  0  0  0  0  1  0  0   1   0   1   0   1   0 -4.481
   [5,]           0  0  0  0  0  0  1  1  0  0   1   0   0   0   1   0 -4.078
   [6,]           0  1  1  0  0  0  0  1  0  0   1   0   0   0   1   0 -3.854
   [7,]           0  0  0  1  0  0  0  1  0  0   1   0   0   0   1   0 -3.829
   [8,]           0  1  0  0  0  0  1  1  0  0   1   0   0   0   1   0 -3.826
   [9,]           1  0  0  0  0  0  0  1  0  0   1   0   0   0   1   0 -3.361
  [10,]           0  0  0  0  0  0  1  1  0  0   1   0   1   0   1   0 -3.335
  [11,]           0  0  0  0  1  0  0  1  0  0   1   0   1   0   1   0 -3.287
  [12,]           0  0  0  0  0  0  0  1  1  0   1   0   0   0   1   0 -3.272
  [13,]           0  0  0  0  0  0  0  1  0  0   1   0   0   0   1   1 -3.261
  [14,]           0  0  0  0  1  0  0  1  0  0   1   0   0   0   1   0 -3.241
  [15,]           0  0  0  0  0  0  0  1  0  0   1   1   0   0   1   0 -3.240
  [16,]           0  1  0  1  0  0  0  1  0  0   1   0   0   0   1   0 -3.240
  [17,]           0  0  0  0  0  0  0  1  0  1   1   0   0   0   1   0 -3.240
  [18,]           0  0  0  0  0  1  0  1  0  0   1   0   0   0   1   0 -3.240
  [19,]           0  0  0  0  0  0  0  1  0  0   1   0   0   1   1   0 -3.237
  [20,]           0  1  0  0  0  0  0  1  0  0   1   0   1   0   1   0 -3.216
  Printed the first  20  best models

  > i
  [1]  7 10 14

+ Example, by hand

  get.sample <- function () {
    # Number of observations
    n <- 20 
    # Number of variables
    m <- 10
    # Number of the variables that actually play a role
    k <- sample(1:m, 5)
    print(k)    
    # Coefficients
    b <- rnorm(m); b <- round(sign(b)+b); b[-k] <- 0
    x <- matrix(nr=n, nc=m, rnorm(n*m))
    y <- x %*% b + rnorm(n)
    data.frame(y=y, x)
  }

Let us select the variables, starting with an empty model, and
progressively adding the variable whose p-value is the smallest. 

  my.variable.selection <- function (y,x, p=.05) {
    nvar <- dim(x)[2]
    nobs <- dim(x)[1]
    v <- rep(FALSE, nvar)
    done <- FALSE
    while (!done) {
      print(paste("Iteration", sum(v)))
      done <- TRUE
      # Ceck if one of the p-values is less than p
      pmax <- 1
      imax <- NA
      for (i in 1:nvar) {
        if(!v[i]){
          # Compute the p-value
          m <- cbind(x[,v], x[,i])
          m <- as.matrix(m)
          pv <- 1
          try( pv <- summary(lm(y~m))$coefficients[ dim(m)[2]+1, 4 ] )
          if( is.nan(pv) ) pv <- 1
          if (pv<pmax) {
            pmax <- pv
            imax <- i
          }
        }
      }
      if (pmax<p) {
        v[imax] <- TRUE
        done <- FALSE
        print(paste("Adding variable", imax, "with p-value", pmax))
      }
    }
    v
  }

  d <- get.sample()
  y <- d$y
  x <- d[,-1]
  k.exp <- my.variable.selection(y,x)

Quite often, we find the right model, but not always.

  > d <- get.sample()
  [1] 9 4 7 8 2
  > y <- d$y
  > x <- d[,-1]
  > k.exp <- my.variable.selection(y,x)
  [1] "Iteration 0"
  [1] "Adding variable 8 with p-value 0.00326788125893668"
  [1] "Iteration 1"
  [1] "Adding variable 3 with p-value 0.0131774876254023"
  [1] "Iteration 2"
  [1] "Adding variable 9 with p-value 0.0309234855260663"
  [1] "Iteration 3"
  [1] "Adding variable 4 with p-value 0.00370166323917281"
  [1] "Iteration 4"

Let us comparer the theoretical model with the empirical one.

  > x <- as.matrix(x)
  > summary(lm(y~x[,c(9,4,7,8,2)]))
  ...
                          Estimate Std. Error t value Pr(>|t|)
  (Intercept)               0.1471     0.3260   0.451  0.65870
  x[, c(9, 4, 7, 8, 2)]X9   1.2269     0.3441   3.565  0.00311 **
  x[, c(9, 4, 7, 8, 2)]X4   1.9826     0.3179   6.238 2.17e-05 ***
  x[, c(9, 4, 7, 8, 2)]X7   1.2958     0.4149   3.123  0.00748 **
  x[, c(9, 4, 7, 8, 2)]X8   2.6270     0.4089   6.425 1.59e-05 ***
  x[, c(9, 4, 7, 8, 2)]X2  -0.9715     0.3086  -3.148  0.00712 **
  Residual standard error: 1.287 on 14 degrees of freedom
  Multiple R-Squared: 0.8859,     Adjusted R-squared: 0.8451
  F-statistic: 21.73 on 5 and 14 DF,  p-value: 3.837e-06

  > summary(lm(y~x[,k.exp]))
  ...
                Estimate Std. Error t value Pr(>|t|)
  (Intercept)  -0.005379   0.360940  -0.015  0.98831
  x[, k.exp]X3 -1.070913   0.443122  -2.417  0.02886 *
  x[, k.exp]X4  1.292099   0.376419   3.433  0.00370 **
  x[, k.exp]X8  2.863028   0.469379   6.100 2.03e-05 ***
  x[, k.exp]X9  1.541648   0.388408   3.969  0.00123 **
  Residual standard error: 1.537 on 15 degrees of freedom
  Multiple R-Squared: 0.8254,     Adjusted R-squared: 0.7788
  F-statistic: 17.73 on 4 and 15 DF,  p-value: 1.486e-05

The theoretical model looks better...

To assess the relevance of a model, as always, we plot the
data -- here, the residuals as a function of each variable
included (in black, before adding it, in red, after).

  %G
  get.sample <- function () {
    # Number of observations
    n <- 20 
    # Number of variables
    m <- 10
    # Number of the variables that actually appear in the model
    k <- sample(1:m, 5)
    print(k)    
    # Coefficients
    b <- rnorm(m); b <- round(sign(b)+b); b[-k] <- 0
    x <- matrix(nr=n, nc=m, rnorm(n*m))
    y <- x %*% b + rnorm(n)
    data.frame(y=y, x)
  }

  my.variable.selection <- function (y,x, p=.05) {
    nvar <- dim(x)[2]
    nobs <- dim(x)[1]
    v <- rep(FALSE, nvar)
    p.values <- matrix(NA, nr=nvar, nc=nvar)
    res1 <- list()
    res2 <- list()
    done <- FALSE
    while (!done) {
      print(paste("Iteration", sum(v)))
      done <- TRUE
      # Is there a p-value lower that 
      pmax <- 1
      imax <- NA
      for (i in 1:nvar) {
        if(!v[i]){
          # Compute the p-value
          m <- cbind(x[,v], x[,i])
          m <- as.matrix(m)
          pv <- 1
          try( pv <- summary(lm(y~m))$coefficients[ dim(m)[2]+1, 4 ] )
          if( is.nan(pv) ) pv <- 1
          if (pv<pmax) {
            pmax <- pv
            imax <- i
          }
          p.values[i,sum(v)+1] <- pv
        }
      }
      if (pmax<p) {
        print(paste("Adding variable", imax, "with p-value", pmax))
        m1 <- as.matrix(x[,v])
        res1[[ length(res1)+1 ]] <- NULL
        try( res1[[ length(res1)+1 ]] <- data.frame(res=lm(y~m1)$res,xi=x[,imax]) )
        v[imax] <- TRUE
        done <- FALSE
        m2 <- as.matrix(cbind(x[,v], x[,imax]))
        res2[[ length(res2)+1 ]] <- data.frame(res=lm(y~m2)$res,xi=x[,imax])
      }
    }
    list(variables=v, p.values=p.values[,1:sum(v)], res1=res1, res2=res2)
  }

  d <- get.sample()
  y <- d$y
  x <- d[,-1]
  res <- my.variable.selection(y,x)

  k <- ceiling(length(res$res1)/3)
  op <- par(mfrow=c(k,3))
  for (i in 1:length(res$res1)) {
    r1 <- res$res1[[i]]
    r2 <- res$res2[[i]]
    plot(r1[,1] ~ r1[,2], ylab="res", xlab=names(r1)[2])
    points(r2[,1] ~ r2[,2],  col='red')
  }
  par(op)
  %--

We can also plot the evolution of the p-values (in bold, the
variables that were retained).

  %G
  matplot(t(res$p.values), type='l', lty=1, lwd=1+2*res$variables)
  abline(h=.05, lty=3)
  %--

Exercise: improve the preceding function.  Start with an
empty set of variables; add them, one at a time, if their
p-value is under 0.05, starting with the variables with the
lowest p-value; when you run out aof variables to add,
remove, one at a time, those whose p-value is larger that
0.05, startin with the variables with the highest p-value;
when you run out of variables to remove, start adding them
again; etc.  What happens with the example above? What
happens if we change the threshold?

Exercise: find, in the data provided with R or its packages,
a data set with many variables (compared to the number of
observations) and apply the methods presented above.  What
happens?

  grep variable str_data | 
    perl -p -e 's/^(.*\s)([0-9]+)(\s+variables:)/$2 $1$2$3/' | 
    sort -n

  library(ade4)
  data(microsatt)
  x <- microsatt$tab   # 18 observations, 112 variables, a lot of zeroes...
  y <- x[,3]
  x <- x[,-3]

  For this example, we have 16 parameters for 18
  observations: it does not work...

  We can interpret this as follows. 
  The vector we wanted to predict is "almost" orthogonal to the
  others. 
  No. It could be, but here it is not the case.

  %G
  library(ade4)
  data(microsatt)
  x <- microsatt$tab   # 18 observations, 112 variables, a lot of zeroes...
  y <- x[,3]
  x <- x[,-3]
  yn <- y/sqrt(sum(y*y))
  xn <- t(t(x)/sqrt(apply(x*x, 2, sum)))
  plot( sort(as.vector(t(yn) %*% xn)), type='h')
  %--

+ Examples

Actually, there are already a few functions to do this.

  regsubsets (in the "leaps" package)
  leaps      (in the "leaps" package -- prefer "regsubsets")
  subset     (in the "car" package)
  stepAIC    (in the "MASS" package)

Let us try them on our example.

  d <- get.sample()
  y <- d[,1]
  x <- as.matrix(d[,-1])
  
  library(leaps)
  a <- regsubsets(y~x)
  summary(a)

It yields (the "true" variables are 1, 3, 6, 7, 10):

  Selection Algorithm: exhaustive
           X1  X2  X3  X4  X5  X6  X7  X8  X9  X10
  1  ( 1 ) " " " " " " " " " " " " "*" " " " " " "
  2  ( 1 ) " " " " "*" " " " " " " " " " " " " "*"
  3  ( 1 ) " " " " "*" " " " " " " "*" " " " " "*"
  4  ( 1 ) "*" " " "*" " " " " " " "*" " " " " "*"
  5  ( 1 ) "*" " " "*" " " " " " " "*" "*" " " "*"
  6  ( 1 ) "*" " " "*" " " " " " " "*" "*" "*" "*"
  7  ( 1 ) "*" " " "*" " " " " "*" "*" "*" "*" "*"
  8  ( 1 ) "*" " " "*" "*" " " "*" "*" "*" "*" "*"

The "subsets" function in the "car" packages can plot this.

  %G
  library(leaps)
  library(car)
  get.sample <- function () {
    # Number of observations
    n <- 20 
    # Number of variables
    m <- 10
    # Number of the variables that actually appear in the model
    k <- sample(1:m, 5)
    print(k)    
    # Coefficients
    b <- rnorm(m); b <- round(sign(b)+b); b[-k] <- 0
    x <- matrix(nr=n, nc=m, rnorm(n*m))
    y <- x %*% b + rnorm(n)
    list(y=y, x=x, k=k, b=b)
  }
  d <- get.sample()
  x <- d$x
  y <- d$y
  k <- d$k
  b <- d$b
  subsets(regsubsets(x,y), statistic='bic', legend=F)
  title(main=paste(sort(k),collapse=', '))
  %--

Let us also mention the "stepAIC" function, in the "MASS" package.

  d <- data.frame(y=y,x)
  r <- stepAIC(lm(y~., data=d), trace = TRUE)
  r$anova

  > r$anova
  Stepwise Model Path
  Analysis of Deviance Table
 
  Initial Model:
  y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10

  Final Model:
  y ~ X1 + X3 + X4 + X5 + X7 + X10

    Step Df    Deviance Resid. Df Resid. Dev      AIC
  1      NA          NA         9   18.62116 20.57133
  2 - X9  1 0.007112768        10   18.62828 18.57897
  3 - X6  1 0.037505223        11   18.66578 16.61919
  4 - X2  1 0.017183580        12   18.68296 14.63760
  5 - X8  1 0.098808619        13   18.78177 12.74309

  > k
  [1]  5 10  3  7  4

Let us compare with the theoretical model.

  > summary(lm(y~x[,k]))
  ...
              Estimate Std. Error t value Pr(>|t|)
  (Intercept)   0.2473     0.3023   0.818 0.426930
  x[, k]1       1.7478     0.3330   5.249 0.000123 ***
  x[, k]2       1.4787     0.2647   5.587  6.7e-05 ***
  x[, k]3      -1.5362     0.4903  -3.133 0.007334 **
  x[, k]4      -1.1025     0.2795  -3.944 0.001470 **
  x[, k]5       1.6863     0.4050   4.164 0.000956 ***
  Residual standard error: 1.27 on 14 degrees of freedom
  Multiple R-Squared: 0.8317,     Adjusted R-squared: 0.7716
  F-statistic: 13.84 on 5 and 14 DF,  p-value: 5.356e-05
  > AIC(lm(y~x[,k]))
  [1] 73.1798

  > b
   [1]  0  0 -2  2  2  0 -1  0  0  2

  > summary(lm(y~x[,c(1,3,4,5,7,10)]))
  ...
                             Estimate Std. Error t value Pr(>|t|)
  (Intercept)                  0.1862     0.2886   0.645  0.52992
  x[, c(1, 3, 4, 5, 7, 10)]1  -0.4408     0.2720  -1.620  0.12915
  x[, c(1, 3, 4, 5, 7, 10)]2  -1.6742     0.4719  -3.548  0.00357 **
  x[, c(1, 3, 4, 5, 7, 10)]3   1.5300     0.3953   3.870  0.00193 **
  x[, c(1, 3, 4, 5, 7, 10)]4   1.7813     0.3159   5.639 8.07e-05 ***
  x[, c(1, 3, 4, 5, 7, 10)]5  -1.0521     0.2664  -3.949  0.00167 **
  x[, c(1, 3, 4, 5, 7, 10)]6   1.4903     0.2506   5.946 4.85e-05 ***
  Residual standard error: 1.202 on 13 degrees of freedom
  Multiple R-Squared:  0.86,      Adjusted R-squared: 0.7954
  F-statistic: 13.31 on 6 and 13 DF,  p-value: 6.933e-05
  > AIC(lm(y~x[,c(1,3,4,5,7,10)]))
  [1] 71.50063

If we look at the p-values, we would like to remove X1, but
if we look at the AIC, we would like to keep it...

+ Stepwise regression is BAD

  TODO

+ Stepwise regression and Bayesian Model Averaging (BMA)

The main problem of stepwise regression is that we are very
likely to choose a bad model. An alternative is to this is
select not one but several models. Then, we can compute the
forecasts for each of those models and combine them, giving
them a weight proportionnal to the likelihood of the model.

  # We select a "good" regression model, using the BIC as a
  # criterion, by starting from a model and adding or
  # removing variables at random, if this improves the BIC.

  # The second version also accepts models that are slightly
  # worse, with a certain probability (high if the model is
  # only slightly worse, low if it is really worse). We end
  # up with a Markov chain that wanders in the space of all
  # models, staying longer at models that are more
  # probable. You can use this to average predictions over
  # those models. This is called MCMC (Markov Chain Monte
  # Carlo) or MCMCMC (Markov Chain Monte Carlo Model
  # Combination).

  # You can also change the "temperature", i.e., the
  # probability that a worse model will be accepted, when
  # the algorithm proceeds. If you end with a temperature
  # equal to zero, you get a single solution, as with the
  # steepest descent, but you are less likely to be stuck in
  # a local minimum (this is called simulated annealing); if
  # you decrease the temperature until 1, you get another
  # MCMC, with a different burn-up period.

  # For more details: 
  #   http://www.stat.washington.edu/raftery/Research/PDF/volinsky1997.pdf
  #   http://www.research.att.com/~volinsky/bma.html
  library(stats4)  # for BIC
  bma.fit.model <- function (y, x, df) {
    # df: data.frame containing the data
    # y: name of the variable to predict
    # x: name of the predictive variables (vector of strings)
    if (length(x)==0) { x <- "1" }
    s <- paste("lm(", y, "~", paste(x, collapse="+"), ", data=df)")
    #cat(">", s, "\n")
    r <- eval(parse(text=s))
    BIC(logLik(r))
  }
  bma.neighbour <- function (model, variables) {
    # model:    vector containing the variables in the current model
    # variable: vector containing the names of all the variables among
    #           which we choose.
    model <- variables %in% model
    n <- length(model)
    i <- sample(1:n, 1)
    model[i] <- ! model[i]
    variables[ model ]
  }
  bma.steepest.descent <- function (y, x, df, N=1000) {
    # df: data.frame containing the data
    # y: name of the variable to predict
    # x: name of the predictive variables among which we
    #    shall choose
    # N: Number of iterations
    current.model <- character(0)
    current.bic <- bma.fit.model(y,current.model,df)
    for (i in 1:N) {
      new.model <- bma.neighbour(current.model, x)
      new.bic <- bma.fit.model(y, new.model, df)
      if (new.bic < current.bic) {
        current.bic <- new.bic
        current.model <- new.model
        cat("(", i, ") BIC=", current.bic, " ",
            paste(current.model,collapse=" "), "\n", sep="")
      } else {
        cat("(",i,") BIC=", new.bic, "\r", sep="")
      }
    }
    current.model
  }
  bma.mcmc.descent <- function (y, x, df, N=1000, temperature=1) {
    if (length(temperature)==1) { 
      temperature <- rep(temperature,N) 
    }
    res <- matrix(NA, nr=N, nc=length(x))
    colnames(res) <- x
    current.model <- character(0)
    current.bic <- bma.fit.model(y,current.model,df)
    for (i in 1:N) {
      new.model <- bma.neighbour(current.model, x)
      new.bic <- bma.fit.model(y, new.model, df)
      res[i,] <- x %in% new.model
      if ( current.bic - new.bic > temperature[i] * log(runif(1)) ) {
        current.bic <- new.bic
        current.model <- new.model
        cat("(", i, ") BIC=", current.bic, " ",
            paste(current.model,collapse=" "), "\n", sep="")
      } else {
        cat("(",i,") BIC=", new.bic, "\r", sep="")
      }
    }
    res
  }

  N <- 100
  df <- data.frame(
    y = rnorm(N),
    x1 = rnorm(N),
    x2 = rnorm(N),
    x3 = rnorm(N),
    x4 = rnorm(N),
    x5 = rnorm(N)
  )
  df$y <- df$y + .1 * df$x1 + .5 * df$x3
  bma.steepest.descent("y", setdiff(names(df),"y"), df)
  r <- bma.mcmc.descent("y", setdiff(names(df),"y"), df)
  apply(r[-(1:500),], 2, sum)
  bma.mcmc.descent("y", setdiff(names(df),"y"), df, N=1000, 
                   temperature=c(seq(20,1,length=500),  # Simulated annealing
                                 seq(1,0,length=250),
                                 rep(0,250)))

  TODO: Explain what I am doing...
  TODO: A prior (for the number of variables...)
  TODO: Explain the danger of MCMC: we have to check that it
        actually converges. Typically: run several chains
        with different starting points, they should give the
        same results.
  TODO: A few plots???
    barplot( apply(r[-(1:500),], 2, sum) )
    predictions with the "best" (wrong) model and
    predictions with the average model (but what do we plot?
    the error histograms? error2~error1?)

  TODO: An example where it works...
  True model: y ~ x1 + x2 + x3 but x1, x2 and x3 are not observed,
  we only have x11 = x1 + noise1, x12 = x1 + noise2, etc.
  The MCMC should alternate between x11 and x12, x21 and x22, etc.  

You might also want to have a look at the BMA and
ensembleBMA packages.

Other approaches try to combine models, not predictions --
for this, the models have to be expressed or transformed
into a common framework.

+ Model selection and the Vapnik-Chervonenkis dimension

Imagine you want to predict a quantitative variable y from
other variables x1, ..., xn; you have tried several
algorithms (say, logistic regression, stepwise logistic
regression, decision trees, decision forests, support vector
machines (SVM), neural networks) and you would like to
compare them. You can easily measure how they fare on the
learning sample, bit what will happen with new data? Indeed,
if the model performs well on the training data, it might
simply have overfitted the data and be completely useless --
there is a subtle tradeoff between the performance on the
training sample and the ability to generalize.

  TODO: a plot depicting this tradeoff.
  performance in-sample      ~  model complexity
  performance out-of-sample  ~  model complexity
  (on the same plot)

  %G
  set.seed(1)
  n <- 20
  x <- runif(n, -1, 1)
  y <- 1 - x^2 + .2*rnorm(n)
  X <- runif(10000, -1, 1)
  Y <- 1 - X^2 + .2*rnorm(1000)
  N <- n
  res <- matrix(NA, nc=N, nr=2)
  dimnames(res) <- list(
    c("In-sample error", "Out-of-sample error"), 
    "Model complexity" = as.character(1:N)
  )
  r <- lm(y~x)
  res[1,1] <- mean(abs(residuals(r)))
  res[2,1] <- mean(abs(predict(r, data.frame(x=X)) - Y))  
  for (i in 2:N) {
    r <- lm(y ~ poly(x,i-1))
    res[1,i] <- mean(abs(residuals(r)))
    res[2,i] <- mean(abs(predict(r, data.frame(x=X)) - Y))  
  }
  
  op <- par(mar=c(5,4,4,4))
  ylim <- c(0, 1.5*max(res[1,]))
  plot(res[1,], col="blue", type="l", lwd=3,
       ylim=ylim,
       axes=F, 
       xlab="Model complexity", 
       ylab="", 
       main="In- and out-of-sample error")
  axis(1)
  axis(2, col="blue")
  par(new=TRUE)
  plot(res[2,], col="red", type="b", lwd=3,
       ylim=ylim,
       axes=F, xlab="", ylab="", main="")
  axis(4, col="red")
  mtext("In-sample error",     2, line=2, col="blue", cex=1.2)
  mtext("Out-of-sample error", 4, line=2, col="red",  cex=1.2)
  par(op)
  %--
  
A simple way of estimating the ability of a model to
generalize is to cut the sample into two parts, use the
first as a training set and the second as a test sample, to
assess the model. To get a better idea, you can repeat this
several times, with different partitions of the initial
sample.

  TODO: Example (take an example from "mlbench" and keep it
  until the end of this section. Also choose a couple of
  algorithms, say: logistic regression, stepwise logictic
  regression, SVM and stick to them.)

  for (i in data(package="mlbench")$results[,"Item"]) {
    do.call("data", list(i))
    cat(i, "\n")
    str(get(i))
  }

  # If we restrict ourselves to predicting a binary variable
  str(Sonar)
  str(PimaIndiansDiabetes)
  str(Ionosphere)

  library(mlbench)
  data(Sonar)
  glm( Class ~ ., data=Sonar, family="binomial")

Another way of estimating the quality of a model is the
log-likelihood, i.e., the (conditionnal) probability of
observing the data we have actually observed given the model
-- however, if the model is sufficiently complex, i.e., if
it has sufficiently many parameters, it can perfectly fit
the learning sample. This prevents us from comparing models
with a different number of parameters -- which is exactly
what we want to do...

One way out of this problem is to compensate for an
excessive number of parameters by adding a "penalty" to the
log-likelihood, depending on the number of parameters. The
AIC (Akaike Information Criterion) and the BIC (Bayesian
Information Criterion) are examples of such penalized
log-likelihoods.

  TODO: Example

You might have a few qualms about the "number of parameters"
used in the definition of the AIC or the BIC. Indeed, if you
have two parameters, you can cheat and code them as a single
number (just choose a bijection between R and R^2). But even
without cheating, this poses a problem: how do we count the
"number of variable" in a regression on the subset of
variables? We have to somehow combine the number of initial
variables (n) and the number of selected variables (k): do
we consider that there are n parameters (some of which are
zero)? do we use a boolean parameter for each variable
telling us if it is retained or not (n+k parameters)? do we
only count the retained variables (k parameters)? do we use
a single discrete-valued variable to code the subset of
retained variables (k+1 parameters)?

The notion of "number of variables" is fine for classical
regression, but the problems we have just mentionned call
for an other, more general notion, better suited to "machine
learning algorithms", i.e., to algorithmic and not only
statistical methods.

Enters the VC (Vapnik-Chervonenkis) dimension.

The situation is as follows: we have a qualitative variable
y, quantitative variables x1, ..., xn neasured on a sample;
we refer to (x1(i),...xn(i)) (without y) as an observation;
we also have a classification algorithm
f : (x,a) |---> f(x,a) 
that tries to predict y from x; here, "a" are the parameters
of the classification algorithm, to be deternined (e.g., the
coefficients of a regression, the weights of a neural
network).

The classification algorithm f is said to shatter the
observations o1,...,om if, for any training set
(o1,y1),...,(om,ym) there exists a such that f(.,a) makes no
mistakes on the training set. If you prefer formulas:

  y \in Y   (Y is a finite set)
  x \in R^n
  a in |
  f: X*A --> Y
  f shatters (x1,...,xm) \in (R^n)^m iif
    \forall y1,...,ym \in Y
    \exists a \in A
    \forall i \in [1,m] 
    f(xi, a) = yi

The VC dimension of the classification algorithm f is the
largest number of points shattered by f. If you prefer
formulas:

  VC(f) = Max { m : \exists x1,...,xm \in R^n 
                    such that x1,...,xn shatters f }
        = Max { m : \exists x1,...,xm \in R^n 
                    \forall y1,...,ym \in Y
                    \exists a \in A
                    \forall i \in [1,m] 
                    f(xi, a) = yi                  }

Examples: 

  VC(linear classifier, in dimension n) = n + 1
  VC(SVM) = ???

  TODO

The raison d'etre of the VC dimension is the following
theorem:

  Out-of sample error rate <= In-sample error rate + 
                              sqrt( (VC*(1 + log(2N/VC)) - log(p/4)) / N )

with probability p, where N is the test sample size. 

One can use this formula for "Structural Risk Minimization"
(SRM) and choose the model with the lowest VC bound on the
out-of-sample error rate -- this bound plays the same role
as the AIC or the BIC.

  TODO: plot
  in-sample error  ~  model 
  VC bound on the out-of-sample error  ~  model

There is however a problem with this bound: it is extremely
conservative -- your classification algorithm could well
perform 100 times better...

As a conclusion, if you have to select a model among
several, stick to the BIC (if you know how to compute it for
your algorithm and if your data are well-behaved) or
cross-validation.

For more details, check A. Moore's tutorials: 

  http://www.autonlab.org/tutorials/vcdim08.pdf
  http://www.autonlab.org/tutorials/

+ Genetic algorithms and non-linear model selection

If you want to choose a model from a reasonable set (say, a
few dozen models), you can fit them all, compute some
measure of model quality (AIC, BIC, cross-validation error,
VC bound, etc.) and select the best one. 

If the number of models is not reasonable (for instance, if
you have 100 variables and want a model using a subset of
them -- there are 2^100 of them), you can try various
optimization algorithms, that wander through the space of
all models (e.g., try to add or remove a variable to the
current model and keep the new one if it is better, again
and again, until you can no longer improve it -- this is a
descent algorithm, but you could also use simulated
annealing to avoid local extrema.

But sometimes, the number of models is not reasonable and
the models are rather unwieldy: there can be no simple and
obvious notion of a "nearby" model that would allow you to
easily sample the space of all models. This is the case, for
instance, if you want a non-linear model: your space of
models could be "all the formulas one can obtain from the
predictive variables, the basic arithmetic operations 
(+ - * /)
and a few selected functions (sin, exp, log)". 

Such a formula can be represented by a tree. For instance, 

           sqrt( 1 - x1 * x3 )
  1.17 * -----------------------
          exp( sin( x1 / x2 ) )

can be represented as 

            *
           / \
          /   \  
         /     \   
       1.17     /
               / \
              /   \
             /     \
          sqrt     exp
           |        |
           |        |
           |        |
           -       sin
           |      /  \
           |     /    \
           |    /      \
           1   *       /
              / \     / \
             /   \   /   \
            x1   x3 x1   x2

The Lisp programmers amoung you would represent this as

  (* 1.17
     (/ (sqrt (- 1 (* x1 x3)))
        (exp (sin (/ x1 x3)))))

Contrary to waht it seems, this complex structure does not
rule out classical local search algorithms: all we need is a
notion of a "nearby" formula or "nearby" tree. Indeed, one
can get a tree near a given one by applying one of the
following "mutations": 
replace a node (an operator) by another, of the same arity; 
replace a leaf;
select a subtree and replace it by a random tree;
insert a node (if its arity if not one, complete it with
random trees);
delete a node;
splice the root.

One can go one step further and use genetic algorithms:
contrary to local search methods, where we have a single
current candidate model, we have a pool (or "population") of
models (one or two hundreds) and we go from one generation
to the next by mutating and pairing the models. Besides the
mutations, we also need a way to combine two parent models
into a child model; this is called cross-over and can be
done as follows: select a subtree in both parents and
interchange them. 

  TODO: Implementation?
  (I am not sure R is the language of choice if you want to
  play with trees). 

See also: 

  http://www.olsen.ch/research/workingpapers/gpForVolatility.pdf
  http://www.olsen.ch/research/307_ga_pase95.pdf

Generalizations: 

- Respect the type of the operators (in programming language
parlance, the signature of the functions), i.e., take into
account wether the functions take as arguments and return
arbitrary reals, reals in the interval [0,1], positive
numbers, boolean values, etc.

- Do not use genetic algorithms for the constants, but use
classical optimization algorithms;

- Take into account the possibility of non-relevant peaks in
the BIC landscape, e.g., by penalizing the BIC according
to the number of individuals around or by clustering the
individuals, thereby creating subpopulations;

- Be creative and use other kinds of operators, for
instance, exponential moving averages (in finance, people
feed very long irregular time series to genetic algorithms)
or boolean operators (< & | ifelse, e.g., to create trading
rules).

+ Dimension reduction

The "dr" package provides several dimension reduction
algorithms.

  library(help=dr)
  xpdf /usr/lib/R/library/dr/doc/drdoc.pdf

TODO: (understand and) explain those algorithms: sir, save,
phd.  (the PDF file above is not clear).

  sir    Sliced Inverse Regression
  save   Sliced Average Variance Estimation
  phd    Principal Hessian Direction

TODO: an example, with a few plots...
(take another example, not that from the manual...)

  # From the manual
  library(dr)
  data(ais)

  # The data
  op <- par(mfrow=c(3,3))
  for (i in names(ais)) {
    hist(ais[,i], col='light blue', probability=T, main=i)
    lines(density(ais[,i]),col='red',lwd=3)
  }
  par(op)

  # Their logarithm
  op <- par(mfrow=c(3,3))
  for (i in names(ais)) {
    x <- NA
    try( x <- log(ais[,i]) )
    if( is.na(x) | any(abs(x)==Inf) ){
      plot.new()
    } else {
      hist(x, col='light blue', probability=T, main=i)
      lines(density(x),col='red',lwd=3)
    }
  }
  par(op)

  # Dimension reduction
  r <- dr(LBM ~ Ht + Wt + log(RCC) + WCC, data=ais, method="sir")
  plot(r)
  r
  summary(r) # TODO: mention the tests for the dimension

+ SVM (Support Vector Machines)

SVM appear, for instance, in the following situation.

Genetics can be used to assess a risk in a patient. If we
know the mecanisms behind a given pathology, if we know what
genes are involved (i.e., what genes are over- or
under-expressed in that pathology, or what mutations trigger
the disease), we can perform the tests.  But often, we do
not know the mecanisms of the pathology -- yet. However, we
can still hope to come to a conclusion by performing tests
"at random". We take a few dozen patients (or a few hundred:
those tests are very expensive), whose condition is known
(for instance, by invasive tests, such as post-mortem
examinations) and we look for the presence/absence, or for
the expression of tens of tousands of genes (this can be
done on a small piece od glass called a "micro-chip" or a
"micro-array"). And we want to predict the patient's
condition (wether he has the disease) from the genes.

You have noticed the problem: there are too many variables.
(As a rule of thum, you should have (at least) ten times
more parameters to estimate than observations -- here, it
could be 100 times less...)

First idea: restrict yourself to 20 variables, chosen on the
basis of prior knowledge.

Second idea: restrict yourself to 20 variables, the 20
best. The is the variable selection, that we have already
presented.

Third idea: In the case the variable to predict is
qualititive (as in our example), the problem is to find a
hyperplane separating two clouds of points. Instead of
looking for a hyperplane, i.e., an affine function f so that
f>0 iif the patient is affected (if there are really many
variables, there is a wealth of such hyperplanes), we can
look for a thick hyperplance, i.e., an affine function f so
that f > a iif the patient is affected and f < -a iif the
patient is not affected, with a as large as possible.

Actually, the SVM method is more general: we first increase
the dimension, to allow for non-linear effects (for
instance, we could map a 2-dimensional vector space to a
5-dimensional vector space by
(x,y) --> (x,y,x^2,y^2,x*y)) 
and only then look for a thick separating hyperplane.

The algorithm is as follows.

1. Find a separating thick hyperplane, as thick as possible.
This is a simple Lagrange multipliers problem -- it can also
be seen as a quadratic programming problem. We have a
Lagrange multiplier for each point: if the multiplier is
zero, the points does not play any role (it is in the midst
of the cloud of points, it is far away from the separating
hyperplane); if the multiplier is non-zero, we say it is a
"support vector". Those points will "touch" the thick
hyperplane, they will limit its thickness -- actually, you
can even forget the other points

2. If it does not work, increase the dimension. You could
try various embeddings, such as
(x,y) |---> (x,x^2,x*y,y^2), 
but actually, we can simply change the "hyperplane" equation
(well, it will no longer be a hyperplane) by replacing the
scalar product used to define it by a kernel, such as
K(x,y) = ( <x,y> + a )^b or K(x,y) = exp( -a Norm(x-y)^2 ).
If you use algorithms that do not use coordinates but just
scalar products, this trick (the "kernel trick") allows you
to increase the dimension without having to compute the
actual coordinates.

3. You can also accept that certain points end on the
"wrong" side of the hyperplane, with a penality.

  %G
  n <- 200
  x <- rnorm(n)
  y <- rnorm(n)
  u <- sqrt(x^2+y^2)
  u <- ifelse( u<.5*mean(u), 1, 2)
  plot(y~x, col=u, pch=15)
  %--

  help.search("svm")
  help.search("support vector machine")
  library(e1071)
  library(help=e1071)
  ?svm

  %G
  library(e1071)
  u <- factor(u)
  r <- svm(u~x+y)
  {
    # The "plot.svm" calls "browser()": Why???
    # (I use it when I debug my code, it is probably the
    # same for them.)
    # And then it crashes...
    # (e1071 version 1.3-16, 1.3-16)
    browser <- function () {}
    try(  plot(r, data.frame(x,y,u), y~x)  )
  }
  %--

  %G
  n <- 200
  x <- runif(n, -1,1)
  y <- runif(n, -1,1)
  u <- abs(x-y)
  u <- ifelse( u<.5*mean(u), 1, 2)
  plot(y~x, col=u, pch=15)
  %--
  
  %G
  u <- factor(u)
  r <- svm(u~x+y)
  { 
    browser <- function () {} 
    try(  plot(r, data.frame(x,y,u), y~x)  )
  }
  %--

If you want to seriously use SVMs, you will try various
values for the "cost" and "gamma" parameters, and you will
select the best with, for instance, a cross-validation.

SVM can be generalized:

To distinguish between more than two classes, you can
consider the classes two at a time and then use a majority
rule to forecast the class of a new observation.

Alternatively, you can also try to distinguish a class
against the union of all the other classes; do that for each
class; then use a majority rule for prediction.

One can also use SVM to "distinguish between a single
class". It sounds odd, but it allows you to spot outliers.

you can also use SVMs for regression. In linear regression,
you are looking for a hyperplane "near" most of the points;
with SVMs, you will be looking for a thick hyperplane, as
thin as possible, that contains all the observations. It is
the same Lagrange multiplier problem as above, with all the
inequalities reversed.

There is a dual way of interpreting these methods: take the
median hyperplane of the segment joining the two nearest
points in the convex hull of both clouds of points. If the
convex hull intersect, replace them with "reduced convex
hulls" (still defined as the set of barycenters of the
points, but with a restriction on the coefficients: for
instance, we ask that they be all under 0.5, so as not to
give too much importance to an isolated point).

For more details, more examples and a comparison of SVMs and
regression trees, see

  /usr/lib/R/library/e1071/doc/svmdoc.pdf

For other details:

  http://bioconductor.org/workshops/Heidelberg02/moldiag-svm.pdf
  http://www.kernel-machines.org/  
  http://www.csie.ntu.edu.tw/~cjlin/papers/ijcnn.ps.gz
  http://www.acm.org/sigs/sigkdd/explorations/issue2-2/bennett.pdf
  http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.ps.gz

+ TODO

Exercise: use an SVM in the situation described in the
introduction (microarrays).

TODO: explain (understand?) what follows. Is it finished???

  library(help=sma)
  library(help=pamr)

  library(e1071)
  library(sma)
  data(MouseArray)
  # Mice number 1, 2 and 3: control 
  # Number 4, 5, 6: treatment
  m <- t(mouse.lratio$M)
  m <- data.frame( mouse=factor(c(1,1,1,2,2,2)), m)  
  # r <- svm(mouse~., data=m, na.action=na.omit, kernel="linear")
  # Out of memory...
  m <- m[,1:500]
  r <- svm(mouse~., data=m, na.action=na.omit, kernel="linear")

  TODO: finish this.
  In particular: how do we read the result?

  Forecast;
  
  m <- t(mouse.lratio$M)
  m <- m[,!is.na(apply(m,2,sum))]
  m <- data.frame( mouse=factor(c(1,1,1,2,2,2)), m)  
  m <- m[,1:500]
  m6 <- m[6,]
  m <- m[1:5,]
  r <- svm(mouse~., data=m, 
           na.action=na.omit, kernel="linear")
  predict(r,m6) # 1 instead of 2...
  r <- svm(mouse~., data=m, 
           na.action=na.omit)
  predict(r,m6) # 1 instead of 2...
  # One could expect that kind of error:
  library(cluster)
  m <- t(mouse.lratio$M)
  m <- m[,!is.na(apply(m,2,sum))]
  plot(hclust(dist(m)))

  rm(m,m6,r)

+ GAM (Generatized Additive Model)

Generalized Additive Models (GAMs) appear when you try to
describe a non-linear situation with a large number of
variables.

If the model were linear, the number of variables would be
reasonable, but in a non-linear situation, it explodes...

For a linear model, we would write

  Y = b0 + b1 x1 + b2 x2 + ... + bn xn.

where the bi are numbers. There are n+1 parameters to
estimate.

A non-linear model would be 

  Y = f(x1,x2,...,xn)

where f is any function. Even if you restrict the shape of
the function f, say, a polynomial of degree 2, the dimension
of the space of such functions grows to fast (here,
1+n+n(n+1)/2).

Side note: at school, one studies polynomials of one
variable, but not polynomial of several variables. One of
the reason is that dimension -- actually, you can give a
geometric interpretation of the computations you do with
polynomials: with one variable, you are in a straight line,
and the geometry of the straight line is, well,
straightforward; with two variables, you are in the plane,
you can study curves, which can be more intricate; the more
variables you have, the more complicated the geometry. This
is called algebraic geometry. End of the side note.

A generalized additive model is 

  Y = a + f1(x1) + f2(x2) + ... + fn(xn)

where the fi are arbitrary functions (you will estimate them
as you want: splines, kernel, local regression, etc.). The
important point is that they are functions of one variable:
they are not too complex.

Of course, all the functions R^n --> R cannot be written
like that: but we have a lot of them and most of the time
this will be sufficient. But a big problem lurks behind this
simplification: we completely forget potential interactions
between variables. Non-linearity or integration, you will
have to choose.  (Well, actually, if you think there is an
interaction between two of your variables, you will include
this interaction -- but just this one --; the model then
becomes y = a + f(x1,x2) + f3(x3) + ... + f(xn).)

The algorithm used to find those functions is iterative. If
we forget the constant, it would be:

1. Take a first estimation of the fi, say, as constants,
from a linear regression.

2. For all k, define fk as the local regression of 

  Y - Sum(fj(xj)) ~ Xk
     j!=k

3. Iterate ultil convergence. 

If you do not forget the constant:

  1. alpha <- mean(y)
     f[k] <- 0
  2. f[k] <- Smoother( Y - alpha - Sum(fj(xj)) )
                                  j!=k
  3. f[k] <- f[k] - 1/N * Sum fk(xik)
                           i
  4. Goto 2 until the f[k] no longer change.

You can generalize this algorithm by adding interaction
terms, when needed: fij(xi,xj), etc.

You can also ask that some of those functions have a
predetermined form, e.g., that they be linear.

One can show that this algorithm actually minimizes a
penalized sum of squares.

Variant: You can generalize this method to logistic or
Poisson regression (it is then interpreted with a penalized
log-likelihood), but the algorithm is a bit different (a
mixture of backfitting and IRLS). For a multilogistic
regression, it is even more complicated.

You can also interpret the Generalized Additive Model as the
quest for the best transformation of the predictive
variables so that the linear model be valid.

Remark: as always, we assume that the variable to predict is
gaussian -- it is significantly non gaussian, we shall
transform it.  For the predictive variables, it may not be
that important, but (to avoid numeric instability) they
should have the same order of magnitude (I think).

Let us now see how to do all this with R.

  TODO
  library(mgcv)
  ?gam

  %G
  n <- 200
  x1 <- runif(n,-3,3)
  x2 <- runif(n,-3,3)
  x3 <- runif(n,-3,3)
  f1 <- sin; f2 <- cos; f3 <- abs;
  y <- f1(x1) + f2(x2) + f3(x3) + rnorm(n)
  pairs(cbind(y,x1,x2,x3)) # Nothing really visible...
  %--

  %G
  library(mgcv)
  r <- gam(y~s(x1)+s(x2)+s(x3))
  x <- seq(-3,3,length=200)
  z <- rep(0,200)
  m.theoretical <- cbind(f1(x),f2(x),f3(x))
  m.experimental <- cbind( 
    predict(r, data.frame(x1=x,x2=z,x3=z)),
    predict(r, data.frame(x1=z,x2=x,x3=z)),
    predict(r, data.frame(x1=z,x2=z,x3=x))
  )
  matplot(m.theoretical, type='l', lty=1)
  matplot(m.experimental, type='l', lty=2, add=T)
  %--

It is difficult to compare, because the curves are
shifted...

  %G
  zero.mean <- function (m) {
    t(t(m)-apply(m,2,mean))
  }
  matplot(zero.mean(m.theoretical), 
          type='l', lty=1)
  matplot(zero.mean(m.experimental), 
          type='l', lty=2, add=T)
  title(main="GAM")
  legend(par('usr')[2], par('usr')[3], 
         xjust=1, yjust=0,
         c('theoretical curves', 'experimental curves'),
         lty=c(1,2))
  %--

  %G
  op <- par(mfrow=c(2,2))
  for (i in 1:3) {
    plot(r, select=i)
    lines(zero.mean(m.theoretical)[,i] ~ x, 
          lwd=3, lty=3, col='red')
  }
  par(op)
  %--

  %G
  res <- residuals(r)
  op <- par(mfrow=c(2,2))
  plot(res)
  plot(res ~ predict(r))
  hist(res, col='light blue', probability=T)
  lines(density(res), col='red', lwd=3)
  rug(res)
  qqnorm(res)
  qqline(res)
  par(op)
  %--

  %G
  op <- par(mfrow=c(2,2))
  plot(res ~ x1)
  plot(res ~ x2)
  plot(res ~ x3)
  par(op)
  %--

TODO: refer to other documents, URLs...
TODO: "thin-plate regression spline" ??? 

TODO: Other functions to fit GAMs:

  library(mda)
  ?bruto   # Gaussian GAM

  library(help=gss)

  ?glm # ???

TODO: What criterion is used?

  GCV ???

+ Classification and Regression Trees (CART (TM))

It is a means of predicting a qualitative binary variable
with many (quantitative or qualitative) variables, with no
linearity assumption whatsoever.

The algorithm is the following. Choose one of the variables
and a cutpoint so as to maximize some statistical criterion;
iterate until there are only a few (20 to 50) observations
in each class. The result can be seen as a tree. prune that
tree (for instance, compare with other (bootstrap) samples;
or find the number of nodes so that the tree built from 90%
of the data give the best possible results on the remaining
10%).

  %G
  library(rpart)
  data(kyphosis)
  r <- rpart(Kyphosis ~ ., data=kyphosis)
  plot(r)
  text(r)
  %--

The first application is obtaining a decision tree, that can
predict a result with few variables, i.e., with few tests
(for instance, in emergency medicine).

Another application is the study of missing values.

  library(Hmisc)
  ?transcan

It is VERY important to prune the tree, otherwise, you get a
tree that describes the sample and not the population.

  TODO: plot

The method can be generalized to predict a "counting
variable" (Poisson regression).

This method is not very stable: a similar sample can give a
completely different tree. However, you can reduce the
variance of the result with the bagging method. See also
MART.

+ PRIM (Patient Rule Induction Method, aka Bump Hunting)

We want to predict a binary variable Y from quantitative
variables, by looking for boxes in the space of predictive
variables in which we have often Y=1. The algorithm is the
following.

  1. Take a box (i.e., a part of the space delimited by hyperplanes
     parallel to the axes), containing all the data points.
  2. Reduce it in one dimension so as to increase the proportion of
     points with Y=1.
  3. Go back to 2, so that the box contains sufficiently few
     misclassified points. 
  4. Try to enlarge the box.
  5. Remove the points from this box.
  6. Start again, until there are no more points.
  7. As always, prune the result, so as to avoid overfit.

We end up with a bunch of boxes, each of which can be seen
as a rule.  This is actually a variant of Regression Trees
(CART): with regression trees, each node of the tree is a
rule with exactly one variable and one equality; with PRIM,
we still have a tree, its nodes are rules, but rule is made
of several inequalities of the form X_i < a or X_i > a.

Apparently, this algorithm is not implemented in R.

  help.search('PRIM') 
  help.search('bump')

There are out-of-memory implementations (i.e., implementations that do
not put all the data in memory): TurboPRIM.

+ Bagging (bootstrap aggregation)

One idea, to increase the quality on an estimator (a
non-linear and unstable one, e.g., a regression tree) is
simply to compute its "mean" on several bootstrap samples.

For instancem the result of forecast by a regression tree is
usually a class (1, 2, ..., or k): we replace it by a vector
(0,0,...,0,1,0,...,0) (put "1" for the predicted class, 0
for the others) and we take the average of those vectors --
we do not get a single class, but a probability distribution
(bayesian readers will prefer the words "posterior
distribution").

Remark: the structure of the estimator is lost -- if it was
a tree, you get a bunch of trees (a forest), whose
predictions are more reliable, but which is harder to
interpret.

Remark: This idea has no effect whatsoever with linear
estimators: they commute with the mean.

Remark: in some cases, the auelity of the estimator can
worsen.

In the following example, the forecasts are correct 99% of
the time.

  library(ipred)
  do.it <- function (formula, data, test) {
    r <- bagging(formula, data=data, coob=T, 
                 control=rpart.control(xval=10))
    p2 <- test$Class
    p1 <- predict(r,test)
    p1 <- factor(p1, levels=levels(p2))
    res1 <- sum(p1==p2) / length(p1)
    # Only with the trees
    r <- bagging(formula, data=data, nbagg=1, 
                 control=rpart.control(xval=10))
    p2 <- test$Class
    p1 <- predict(r,test)
    p1 <- factor(p1, levels=levels(p2))
    res2 <- sum(p1==p2) / length(p1)
    c(res1, res2, res1/res2)
  }

  # An example from "mlbench" (it is a collection of such
  # examples) 
  library(mlbench)

  data(Shuttle)
  n <- dim(Shuttle)[1]
  k <- sample(1:n,200)
  do.it(
    Class ~ ., 
    data = Shuttle[k,], 
    test = Shuttle[-k,]
  )  # Idem, 99.5%
  rm(Shuttle)

  data(Vowel)
  n <- dim(Vowel)[1]
  k <- sample(1:n,200)
  do.it(
    Class ~ ., 
    data = Vowel[k,], 
    test = Vowel[-k,]
  ) # Better: 47% instead of 36%
  rm(Vowel)

You can interpret the bagging method as an alternative to
the maximum likelihood method: to estimate a parameter, the
Maximum Likelihood Method looks at the distribution of this
parameter and chooses its mode; Bagging, on the other hand,
uses simulations to estimate this distribution (it is a
randomized algorithm) and selects its mean.

See also:

  /usr/lib/R/library/ipred/doc/ipred-examples.pdf

+ Boosting

This is very similar to bagging (we compute an estimator on
several bootstrap samples), but the bootstrap samples are
not taken at random: they are selected according to the
performance of the previous estimators. Usually, we simply
assign a weight (a probability of appearing in the next
bootstrap sample) to each observations according to the
preceding results.

Usually, it works quite well, better that the bagging -- but
in some cases, the estimator can worsen...

TODO: with R???

+ Ensemble methods

The methods presented above, bagging and boostring, replace
an estimator by a set of estimators. We then use these
families of estimators. All the estimators of a given family
are of the same kind (say, all are trees, or all are neural
networks).

But nothing prevents you from mixing completely different
estimators.

TODO: examples (tree + linear regression?)

You can also use those methods for non-supervised
classification (i.e., when you try to find classes in your
sample, without any "training set" to hint at the rules to
form those classes).

  TODO: examples

+ Random Forest

Again, this is very similar to the bagging: we build
regression trees on bootstrap samples, nut at each node of
each tree, we do not use all the variables, but just a small
(random) subset of variables. It can be seen as a "noisy"
variant of bagging.

Here is an example (three or four minutes of computation --
there are 500 trees...).

  library(randomForest)
  do.it <- function (formula, data, test) {
    r <- randomForest(formula, data=data)
    p2 <- test$Class
    p1 <- predict(r,test)
    p1 <- factor(p1, levels=levels(p2))
    res1 <- sum(p1==p2) / length(p1)
    # Only with the trees
    r <- bagging(formula, data=data, nbagg=1, 
                 control=rpart.control(xval=10))
    p2 <- test$Class
    p1 <- predict(r,test)
    p1 <- factor(p1, levels=levels(p2))
    res2 <- sum(p1==p2) / length(p1)
    c(res1, res2, res1/res2)
  }

  library(mlbench)
  data(Vowel)
  n <- dim(Vowel)[1]
  k <- sample(1:n,200)
  do.it(
    Class ~ ., 
    data = Vowel[k,], 
    test = Vowel[-k,]
  ) # From 42% to 67%
  rm(Vowel)

It is better than bagging (with the same sample, we go from
42% to 45% of good answers).

Actually, the "randomForest" function is a little more
verbose.

  > r <- randomForest(Class~., data=Vowel)
  > r
  Call:
   randomForest.formula(x = Class ~ ., data = Vowel[k, ])
                 Type of random forest: classification
                       Number of trees: 500
  No. of variables tried at each split: 3
          OOB estimate of  error rate: 26%
  Confusion matrix:
      hid hId hEd hAd hYd had hOd hod hUd hud hed class.error
  hid  11   3   0   0   0   0   0   0   0   3   0   0.3529412
  hId   2  11   1   0   0   0   0   0   1   0   0   0.2666667
  hEd   0   2   6   2   0   0   0   0   0   0   2   0.5000000
  hAd   0   0   0  20   0   1   0   0   0   0   1   0.0909091
  hYd   0   0   0   0  23   2   0   0   0   0   1   0.1153846
  had   0   0   0   2   5   7   0   0   0   0   1   0.5333333
  hOd   0   0   0   0   2   0  13   1   1   0   0   0.2352941
  hod   0   0   0   0   0   0   2  11   1   0   0   0.2142857
  hUd   0   0   0   0   0   0   2   2  16   2   0   0.2727273
  hud   1   0   0   0   0   0   0   1   2  15   0   0.2105263
  hed   0   0   0   0   2   2   0   0   2   0  15   0.2857143

First, we remark that the error rate is optimistic (26%
instead of 33% -- and the oob bootstrap is supposed to be
pessimistic...).

We also get the confusion matrix, that gives the "distances"
between the classes: we can use it to plot the classes in
the plane, with a Distance Analysis algorithm (MDS
(MultiDimensional Scaling), etc. -- we have already
mentionned this).

  %G
  library(randomForest)
  library(mlbench)
  library(mva)
  data(Vowel)
  r <- randomForest(
    Class ~ ., 
    data = Vowel, 
    importance = TRUE
  ) # Trois minutes...
  m <- r$confusion
  # We have a confusion matrix instead of a distance matrix.
  # We try to tweak the matrix to get something that looks
  # like a distance matrix -- this might not be the best way
  # to proceed.
  m <- m[,-dim(m)[2]]
  m <- m+t(m)-diag(diag(m))
  n <- dim(m)[1]
  m <- m/( matrix(diag(m),nr=n,nc=n) + 
    matrix(diag(m),nr=n,nc=n, byrow=T) )
  m <- 1-m
  diag(m) <- 0
  mds <- cmdscale(m,2)
  plot(mds, type='n')
  text(mds, colnames(m))
  rm(Vowel)
  %--

We also have the importance of each of the predictive variables.

  > r$importance
      Measure 1 Measure 2 Measure 3 Measure 4
  V1  178.94737 16.674731 0.9383838 0.4960153
  V2  563.15789 38.323021 0.9222222 1.0000000
  V3  100.00000 13.625267 0.8343434 0.4212575
  V4  200.00000 23.420278 0.8909091 0.5169644
  V5  321.05263 26.491365 0.8959596 0.5819186
  V6  189.47368 20.299312 0.8828283 0.5022913
  V7   84.21053 10.596949 0.7777778 0.3544800
  V8  110.52632 16.071039 0.8454545 0.4107499
  V9   47.36842 10.219346 0.8424242 0.3301638
  V10  63.15789  8.857154 0.8292929 0.3274473

TODO: explain/understand.

Graphically, this suggests that we only consider V2, or V2
and V5, or even V2, V5 and V1.

  %G
  op <- par(mfrow = c(2,2))
  m <- r$importance
  n <- dim(m)[1]
  for (i in 1:4) {
    plot(m[,i],
         type = "h", lwd=3, col='blue', axes=F,
         ylab='', xlab='Variables',
         main = colnames(r$importance)[i] )
    axis(2)
    axis(1, at=1:n, labels=rownames(m))
    # The two highest values in red
    a <- order(m[,i], decreasing=T)
    m2 <- m[,i]
    m2[ -a[1:2] ] <- NA
    lines(m2, type='h', lwd=5, col='red')
  }
  par(op)
  %--

In the examples, we have stressed classification trees, but
it also works with regression trees.

You can also use random forests to detect outliers.

  TODO: finish reading the article in Rnews_2002-3.pdf
  http://cran.r-project.org/doc/Rnews/Rnews_2002-3.pdf
  (at the end: ROC)

See:

  Rnews_2002-3.pdf
  http://cran.r-project.org/doc/Rnews/Rnews_2002-3.pdf

  TODO: I often speak of "regression trees", but I have only
  defined "classification trees".
  What is a "regression tree"?
  MARS?

+ Outlier detection

TODO: move this section to a better location.

TODO: find real data with one (or two) group(s) of outliers. 

+ Non supervised learning

Here, we simply ask the computer to try and find classes in
the data. If he puts most of the data in a class and the
rest in a few isolated classes, you should worry.

(Of course, if you have several variables, you first look at
them one at a time, and only then all together.)

TODO: idem with hiearchical classification.
should we get a balanced tree?

+ Supervised learning

Surprisingly, you cam also use supervised learning
algorithms. The idea is to put all the data in the same
class and add random (uniformly distributed) data, in a
second class. Then, we ask the computer to try to predict
the class from the data. Observations from the first class
but misclassified as elements of the second are potential
outliers.

The "uniform distribution" is arbitrary and may not be a
good choice. If you have already considered 1-dimensional
problems and focus on outliers that are only revealed when
you look at the data in higher dimensions, you can replace
this uniform distribution by sampling independantly in each
variable.

  TODO: an example...

+ Neural networks

TODO: recall what a neural net is...

You can use neural nets for regression (i.e., to forecats a
quantitative variable) or for supervised classification
(i.e., to forecats a qualitative variable).

If we again our vowel recognition example, we get around 6%
of good results (between 55% and 65%).

  library(nnet)
  do.it <- function (formula, data, test, size=20) {
    r <- nnet(formula, data=data, size=size)
    p2 <- test$Class
    p1 <- predict(r,test)
    nam <- colnames(p)
    i <- apply(p1,1,function(x){order(x)[11]})
    p1 <- nam[i]
    p1 <- factor(p1, levels=levels(p2))
    res1 <- sum(p1==p2) / length(p1)
    res1
  }
  library(mlbench)
  data(Vowel)
  n <- dim(Vowel)[1]
  k <- sample(1:n,200)
  do.it(Class~., data=Vowel[k,], test=Vowel[-k,]) 

+ Bayesian Networks

TODO

  library(deal)
  demo(rats)

  TODO: look in
    /usr/lib/R/library/deal/demo/
  
  hiscorelist <- heuristic(newrat, rats.df, rats.prior,
    restart = 10, degree = 7, trace = TRUE)

+ MARS (Multivariate Adaptative Regression Splines)

We are again in a regression setup: we want to predict a
quantitative variable Y from quantitative variables X. To
this end, we try to write Y as a linear combination of
functions of the form (Xj-t)_+ or (t-Xj)_+, where Xj is one
of the variables and t=xij is one of the observations. This
is very similar to regression trees, but the forecasts are
piecewise linear instead of piecewise constant.

The algorithm is the following (it is a regression, where
the predictive variables are of the form (Xj-t)_+ and
(t-Xj)_+ -- we do not take all these variables (there would
be too many), but just a subset of them).

  1. Start with a single, constant, function.
  2. Compute the coefficients of the model.
     For the first iteration, it is simply Y ~ 1.
  3. Add the pair of functions (Xj-t)_+, (t-Xj)_+ to the set
     of predictive variables, choosing Xj and t=xij so as to
     reduce the error as much as possible.
  4. Go to 2.
  5. Do not forget to prune the resulting tree! (as for
     regression trees or classification trees)

You can generalize this idea to logistic regression,
multiple logistic regression, etc. (in those cases, you no
longer use least squares to find the coefficients, but
maximum likelihood).

In R, you can use the "mars" function from the "mda" package. 

Let us consider the following example.

  %G
  library(mda)
  library(mlbench)
  data(BostonHousing)
  x <- BostonHousing
  x[,4] <- as.numeric(x[,4])
  pairs(x)
  %--

The variables seem very well suited to the methos: some have
two "humps" -- and the very idea of the MARS algorithm is to
separate such humps -- we should end with gaussian-looking
variables.

  %G
  op <- par(mfrow=c(4,4))
  for (i in 1:14) {
    hist(x[,i],probability=T,
         col='light blue', main=paste(i,names(x)[i]))
    lines(density(x[,i]),col='red',lwd=3)
    rug(jitter(x[,i]))
  }
    hist(log(x[,1]),probability=T,
         col='light blue', main="log(x1)")
    lines(density(log(x[,1])),col='red',lwd=3)
    rug(jitter(log(x[,1])))
  par(op)
  %--

  %G
  op <- par(mfrow=c(4,4))
  for (i in 1:14) {
    qqnorm(x[,i], main=paste(i,names(x)[i]))
    qqline(x[,i], col='red')
  }
    qqnorm(log(x[,1]), main="log(x1)")
    qqline(log(x[,1]), col='red')
  par(op)
  %--

We take the logarithm of the variable to predict, becaus it
is really far from gaussian (we leave the others alone: MARS
will separate their humps and they should, afterwards, look
normal).  (At fisrt, I had not transformed it -- the results
were extremely poor: the forecast error was as large as the
the standard deviation of the variable...)

  %G
  x[,1] <- log(x[,1])
  n <- dim(x)[1]
  k <- sample(1:n, 100)
  d1 <- x[k,]
  d2 <- x[-k,]
  r <- mars(d1[,-1],d1[,1])
  p <- predict(r, d2[,-1])
  res <- d2[,1] - p

  op <- par(mfrow=c(4,4))
  plot(res)
  plot(res~p)
  for (i in 2:14) {
    plot(res~d2[,i])
  }  
  par(op)
  %--

  %G
  op <- par(mfrow=c(2,2))
  qqnorm(r$fitted.values, main="fitted values")
  qqline(r$fitted.values)
  hist(r$fitted.values,probability=T,
       col='light blue', main='fitted values')
  lines(density(r$fitted.values), col='red', lwd=3)
  rug(jitter(r$fitted.values))
  qqnorm(res, main="residuals")
  qqline(res)
  hist(res, probability=T,
       col='light blue', main='residuals')
  lines(density(res),col='red',lwd=3)
  rug(jitter(res))
  par(op)
  %--

If we compare with a naive regression, it is slightly better.

  > mean(res^2)
  [1] 0.6970905

  > r2 <- lm(crim~., data=d1)
  > res2 <- d2[,1] - predict(r2,d2)
  > mean(res2^2)
  [1] 0.7982593

A naive regression with anly the "hunched" variables (bad idea@ the
other variables bring som important information):

  > kk <- apply(r$cuts!=0, 2, any)
  > kk <- (1:length(kk))[kk]
  > kk <- c(1, 1+kk)
  > r3 <- lm(crim~., data=d1[,kk])
  > res3 <- d2[,1] - predict(r3,d2)
  > mean(res3^2)
  [1] 0.7930023

If we try to select the variables: 

  > library(MASS)
  > r4 <- stepAIC(lm(crim~., data=d1))
  > res4 <- d2[,1]-predict(r4,d2)
  > mean(res4^2)
  [1] 0.8195915

(You can try many other methods...)

+ HME (Hierarchical Mixture of Experts)

This is similar to CART, with the following differences.
The nodes of the tree are probabilistic. The nodes depend on
a linear combinaition of the predictive variables, not of a
single variable. We perform a linear regression at each leaf
(with CART, we assign a constant to each leaf). There is no
algorithm to find the structure of the tree: you have to
choose it yourself.

TODO: with R?

+ MART

  TODO

+ TODO: TO SORT

  heuristic(deal)         Heuristic greedy search with random restart
  Title: Learning Bayesian Networks with Mixed Variables

  tune.rpart(e1071)       Convenience tuning functions


+ Stacking

  ???
  TODO

+ Bumping

  ???
  TODO

* Wide problems

TODO: put this somewhere else -- it should be with ridge
regression, partial least squares, etc. 

+ Supervised Principal component Analysis (SPCA)

Principal component Analysis is a simple and efficient means
of reducing the dimensionality of a data set, or reducing
the number of variables one will have to look at but, it the
context of regression, it misses a point: in principal
component regression, where one tries to predict or explain
a variable y from many variables x1, ..., xn, one computes
the PCA of the predictive variables x1,...,xn, and regresses
y against the first components -- but one does not take the
variable to predict into account! In other words, we select
non-redundant variables, that account for the shape of the
cloud of points in the x variables, instead of selecting
non-redundant variables with some power to predict y.

This is the same problem that leads to Partial Least Squares
(PLS). Supervised principal components are much easier to
understand, though. The algorithm goes as follows: 

  Compute the principal components of those of the
  predictive variables that are the most correlated with the
  variable to predict y (using some threshold, chosen by
  cross-validation).

  Regress y against the first principal components. 

Well, the actual algorithm is slightly more complicated: one
does not directly us the correlation to select the
variables. For more details, see:

  http://www-stat.stanford.edu/~tibs/ftp/spca.pdf
  http://www-stat.stanford.edu/~tibs/superpc/tutorial.html

In R, this method is available in the "superpc" package and
can accomodate classical or survival regression.

  TODO: Example...

  %G
  library(superpc)

  # Some simulated data
  n <- 50    # Number of observations (e.g., patients)
  m <- 500   # Number of values to predict
  p <- 1000  # Number of variables or "features" (e.g., genes)
  q <- 20    # Number of useful variables
  x <- matrix( rnorm((n+m)*p), nr=n+m )
  z <- svd(x[,1:q])$u
  y <- 1 - 2 * z[,1]
  d <- list(
    x = t(x[1:n,]),
    y = y[1:n],
    featurenames = paste("V", as.character(1:p),sep="")
  )
  new.x <- list(
    x = t(x[-(1:n),]),
    y = y[-(1:n)],
    featurenames = paste("V", as.character(1:p),sep="")
  )

  # Compute the correlation (more precisely, a score) of
  # each variable with the outcome.
  train.obj <- superpc.train(d, type="regression")
  hist(train.obj$feature.score, 
       col = "light blue",
       main = "SPCA (Supervised Principal Component Analysis)")
  %--

  %G
  ## PROBLEM: This should not look like that...

  # Compute the threshold to be used on the scores to select
  # the variables to retain. 
  cv.obj <- superpc.cv(train.obj, d)
  superpc.plotcv(cv.obj)
  title("SPCA")
  %--

  %G
  fit.cts <- superpc.predict(
    train.obj, 
    d, 
    new.x,
    threshold = 0.7, 
    n.components = 3, 
    prediction.type = "continuous"
  )
  r <- superpc.fit.to.outcome(train.obj, new.x, fit.cts$v.pred)
  plot(r$results$fitted.values, new.x$y,
       xlab="fitted values", ylab="actual values",
       main="SPCA forecasts")
  abline( lm(new.x$y ~ r$results$fitted.values),
          col="red", lwd=3 )
  abline(0, 1, lty=2, lwd=2)
  %--