Factorial methods: Around Principal Component Analysis (PCA)

There are two kind of methods in multidimensional
statistics: Factorial Methods, in which we project the data
on a vector space, trying to lose as little information as
possible; and Classification Methods, that try to cluster
the points.

There are three main techniques among the Factorial Methods:
Principal Component Analysis (PCA, with several quantitative
variables), Correspondance Analysis (CA, two quanlitative
variables, represented by a contingency table) and Multiple
Correspondance Analysis (MCA, more that two variables, all
quantitative).

The following table summarizes this:

  Method    Quantitative variables    Qualitative variables
  ----------------------------------------------------------
  PCA       several                   none
  MDS       several                   none
  CA        none                      two
  MCA       none                      several

There are other, non symetric, methods: one variable plays a
special role, is given a different emphasis than the others
(usually, we try to predict or explain one variable from the
others).

  Method                 Quantitative v.    Qualitative v.    Variable to predict
  ----------------------------------------------------------------------------------
  regression             several          several             quantitative
  anova                  none             one                 quantitative
  logistic regression    several          several             binary
  Poisson regression     several          several             counting
  logistic regression    several          several             ordered (qualitative)
  discriminant analysis  several          one 
  CART                   several          several             binary
  ...

* Principal Component Analysis (PCA)

+ Introducing PCA

Here is a first presentation of PCA. We have a cloud of
points in a high-dimensional space, too large for us to see
something in it. The PCA will give us a subspace of
reasonable dimension so that the projection onto this
subspace retains "as much information as possible", i.e., so
that the so that the projected cloud of points be as
"dispersed" as possible. In other words, it reduces the
dimension of the cloud of points, it helps choose a point of view
to look at the cloud of points.

The algorithm is the following.  We first translate the data
so that its center of gravity be at the origin (always a
good thing when you plan to use linear algebra).  Then, we
try to rotate it so that the standard deviation of the first
coordinate be as large as possible (and then the second,
third, etc.): this is equivalent to diagonalizing the
variance-covariance matrix (it is a (positive) real symetric
matrix, hence it is diagonalizable in an orthonormal basis),
starting with the eigen vectors of largest eigen value.  The
first axis of this new coordinate system corresponds to an
approximation of the cloud of points by a 1-dimensional
space; if you want an approximation by a k-dimensional
subspace, just take the first k eigen vectors.

To choose the dimension of that subspace, we can look
(graphically) at the eigen values and we stop when they
start dropping (if they decrease slowly, we are in a bad
situation: with our linear goggles, the data is
inherently high-dimensional).

One may also see PCA as an analogue of the least squares
method to find a line
that goes as "near" the points as possible -- to simplify,
let us assume there are just two dimensions. But while the
least squares method is asymetric (the two variables play
different roles: they are not interchangeable, we try to
predict one from the others, we measure the distance
parallel to one coordinate axis), the PCA is symetric
(the distance is measured orthogonally to the line we are looking for).

Here is yet another way of presenting PCA. We model the data
as X = Y + E, where X is the noisy data we have (in an
n-dimensional space), Y is the real data (without the
noise), in a smaller, k-dimensional space, and E is an error
term (the noise). We want to find the k-dimensional space
where Y lives.

You can also see PCA as a game on a table of numbers (it is
not as childish as it seems: it is actually a description of
correspondance analysis). Thus, we can switch the role of
rows and columns: on the one hand, we try to find
similarities or differences between subjects, on the other,
similarities or differences between variables. (There are
two matrices to diagonalize, A*t(A) and t(A)*A, but but it
suffices to diagonalize one: they have the same non-zero
eigen values and the eigen vectors of one can easily be
obtained from those of the other.) We usually ask that the
variables be centered (and, often, normalized). We can plot
both the variables on the same graph as the subjects: the
variables will lie on the unit sphere; thei projection on
the subspace spanned by the first two eigen vectors is the
"correlation circle".

After performing the PCA, we can add new subjects on the
plot (test (out-of-sample) subjects; fictitious sujbects,
representative of certain classes of subjects): just use the
"predict" function. Dually, we can add variables (for
explanatory reasons), Of those variables are qualitative, we
actually add an "average subject" for each value of this
variable. This is often called a "biplot", because both
subjects (in black) and variables (in red) appear on the
plot.

  %G
  data(USArrests)
  p <- prcomp(USArrests)
  biplot(p)
  %--

  TODO: find a situation where you really want to add
  variables and/or subjects.

  n <- 100   # Number of subjects
  nn <- 10   # Number of out-of-sample subjects
  k <- 5     # Number of variable
  kk <- 3    # Number of new variables
  x <- matrix(rnorm(n*k),  nr=n,  nc=k)    
  x <- t(rnorm(k) + t(x))                   # Initial data
  y <- matrix(rnorm(n*kk), nr=n,  nc=kk)    # New variables
  z <- matrix(rnorm(nn*k), nr=nn, nc=k)     # New subjects

  r <- prcomp(x)
  # I check that my change-of-base formulas were right
  all(abs(    t(r$center + t(x %*% r$rotation))  -  r$x    )<1e-8)
  # Subjects coordinates
  t(r$center + t(x %*% r$rotation))  
  # Out-of-sample subject coordinates
  t(r$center + t(z %*% r$rotation))  
  # For each new variable, we add an "average" subject
  # (more precisely: x %*% t(yy[1,]) = orthogonal projection of 
  # y[,1] on the subspace spanned by the columns of x).
  yy <- t(lm(y~x-1)$coef)
  t(r$center + t(yy %*% r$rotation))    

Here are the eigen values. They are plummeting: this means
that the PCA is meaningful and that we can retain only the
eigen vectors with the highest eigenvalues (here, the first:
we could have done a 1-dimensional plot (a histogram)).

  %G
  plot(p)
  %--

There is also a "prcomp" functions, that does the same
computations, with a few more limitations (namely, you
should have more observations that variables): do not use
it.

The vocabulary used with principal components analysis may
surprise you: people speak of "loadings" where you would
expect "rotation matrix" and "scores" where you think "new
coordinates".

In the preceeding plot, the old basis vectors are in red.
There is the correlation circle -- those vectors lie on the
unit sphere, which we project on the first two eigen vectors
(it should not be called a circle but a disc -- the
projection of a sphere is a disc, not a circle).

  %G
  a <- seq(0,2*pi,length=100)
  plot( cos(a), sin(a), 
        type = 'l', lty = 3,
        xlab = 'comp 1', ylab = 'comp 2', 
        main = "Correlation circle")
  v <- t(p$rotation)[1:2,]
  arrows(0,0, v[1,], v[2,], col='red')
  text(v[1,], v[2,],colnames(v))
  %--

Here is a situation where PCA is not relevant (my pupils's
marks, when I was a high school teacher).

  %G
  # Copy-pasted with the help of the "deparse" command:
  #   cat( deparse(x), file='foobar')
  notes <- matrix( c(15, NA, 7, 15, 11, 7, 7, 8, 11, 11, 13,
  6, 14, 19, 9, 8, 6, NA, 7, 14, 11, 13, 16, 10, 18, 7, 7,
  NA, 11, NA, NA, 6, 15, 5, 11, 7, 3, NA, 3, 1, 10, 1, 1,
  18, 13, 2, 2, 0, 7, 9, 13, NA, 19, 0, 17, 8, 2, 9, 2, 5,
  12, 0, 8, 12, 8, 4, 8, 0, 5, 5.5, 1, 12, 4, 13, 5, 11, 6,
  0, 7, 8, 11, 9, 9, 9, 14, 8, 5, 8, 5, 5, 12, 6, 16.5,
  13.5, 15, 3, 10.5, 1.5, 10.5, 9, 15, 7.5, 12, 13.5, 4.5,
  13.5, 13.5, 6, 12, 7.5, 9, 6, 13.5, 13.5, 15, 13.5, 6, NA,
  13.5, 4.5, 14, NA, 14, 14, 14, 8, 16, NA, 6, 6, 12, NA, 7,
  15, 13, 17, 18, 5, 14, 17, 17, 13, NA, NA, 16, 14, 18, 13,
  17, 17, 8, 4, 16, 16, 16, 10, 15, 8, 10, 13, 12, 14, 8,
  19, 7, 7, 9, 8, 15, 16, 8, 7, 12, 5, 11, 17, 13, 13, 7,
  12, 15, 8, 17, 16, 16, 6, 7, 11, 15, 15, 19, 12, 15, 16,
  13, 19, 14, 4, 13, 13, 19, 11, 15, 7, 20, 16, 10, 12, 16,
  14, 0, 0, 11, 9, 4, 10, 0, 0, 5, 11, 12, 7, 12, 17, NA, 6,
  6, 9, 7, 0, 7, NA, 15, 3, 20, 11, 10, 13, 0, 0, 6, 1, 5,
  6, 5, 4, 2, 0, 8, 9, NA, 0, 11, 11, 0, 7, 0, NA, NA, 7, 0,
  NA, NA, 6, 9, 6, 4, 5, 4, 3 ), nrow=30)
  notes <- data.frame(notes)
  # These are not the real names
  row.names(notes) <- 
    c("Anouilh", "Balzac", "Camus", "Dostoievski",
    "Eschyle", "Fenelon", "Giraudoux", "Homer",
    "Ionesco", "Jarry", "Kant", "La Fontaine", "Marivaux",
    "Nerval", "Ossian", "Platon", "Quevedo", "Racine",
    "Shakespeare", "Terence", "Updike", "Voltaire",
    "Whitman", "X", "Yourcenar", "Zola", "27", "28", "29",
    "30")
  attr(notes, "names") <- c("C1", "DM1", "C2", "DS1", "DM2",
                            "C3", "DM3", "DM4", "DS2")
  notes <- as.matrix(notes)
  notes <- t(t(notes) - apply(notes, 2, mean, na.rm=T))
  # Get rid of NAs
  notes[ is.na(notes) ] <- 0
  # plots
  plot(princomp(notes))
  %--

  %G
  biplot(princomp(notes))
  %--

Here, we gave the same weight to each mark and each subject;
we could have give more weight to some marks, to reflect
their nature (test or homework) or the weight of the
subjects for the final exam (e.g., A-level): it is the same
algorithm, only the scalar product on the space changes.

The "biplot" command anly gives the first two dimensions: 
we can sometimes see more with the "pairs" command.

  %G
  pairs(princomp(notes)$scores, gap=0)
  %--

  %G
  pairs(princomp(notes)$scores[,1:3])
  %--

  %G
  p <- princomp(notes)
  pairs( rbind(p$scores, p$loadings)[,1:3], 
         col=c(rep(1,p$n.obs),rep(2, length(p$center))),
         pch=c(rep(1,p$n.obs),rep(3, length(p$center))),
       )
  %--

We leave it to the reader to add red arrows (instead of red
points) for the variables -- actually, the "pairs" function
is not very configurable: the different panels take as
arguments the coordinates of the points to draw, while one
could want to plot, in the same panel, very different things
-- but we do not even have the row and column numbers... The
corresponding function in the "lattice" library seems
scarcely more configurable.

  %G
  library(lattice)
  splom(as.data.frame(
    princomp(notes)$scores[,1:3]
  ))
  %--
  
Here is another situation where one could want to use PCA:
the classification of texts in a corpus.  The rows of the
table (the subjects) are the texts, the columns are the
words (the vocabulary), the values are the number of
occurrences of the words in the text. As the dimension of
the space (the number of columns) is rather high (several
thousands), it is not very easy to compute the covariance
matrix, let alone diagonalize it. Yet, one can perform the
PCA, in an approximate way, with neural networks.

  http://www.loria.fr/projets/TALN/actes/TALN/articles/TALN02.pdf

Here is another means of tackling the same problem, without
PCA but still with geometry in high-dimensional spaces:

  http://www.perl.com/lpt/a/2003/02/19/engine.html

+ Principal Component Analysis: details

To understand what PCA really is, how it works, let us
implement it ourselves. Here is one possible implementation.

  %G
  my.acp <- function (x) {
    n <- dim(x)[1]
    p <- dim(x)[2]
    # Translation, to use linear algebra
    centre <- apply(x, 2, mean)
    x <- x - matrix(centre, nr=n, nc=p, byrow=T)
    # diagonalizations, base changes
    e1 <- eigen( t(x) %*% x, symmetric=T )
    e2 <- eigen( x %*% t(x), symmetric=T )
    variables <- t(e2$vectors) %*% x
    subjects <- t(e1$vectors) %*% t(x)
    # The vectors we want are the columns of the 
    # above matrices. To draw them, with the "pairs"
    # function, we have to transpose them.
    variables <- t(variables)
    subjects <- t(subjects)
    eigen.values <- e1$values
    # Plot
    plot( subjects[,1:2], 
          xlim=c( min(c(subjects[,1],-subjects[,1])), 
                  max(c(subjects[,1],-subjects[,1])) ),
          ylim=c( min(c(subjects[,2],-subjects[,2])), 
                  max(c(subjects[,2],-subjects[,2])) ),
          xlab='', ylab='', frame.plot=F )
    par(new=T)
    plot( variables[,1:2], col='red',
          xlim=c( min(c(variables[,1],-variables[,1])), 
                  max(c(variables[,1],-variables[,1])) ),
          ylim=c( min(c(variables[,2],-variables[,2])), 
                  max(c(variables[,2],-variables[,2])) ),
          axes=F, xlab='', ylab='', pch='.')
    axis(3, col='red')
    axis(4, col='red')
    arrows(0,0,variables[,1],variables[,2],col='red')
    # Return the data
    invisible(list(data=x, centre=centre, subjects=subjects, 
                   variables=variables, eigen.values=eigen.values))
  }

  n <- 20
  p <- 5
  x <- matrix( rnorm(p*n), nr=n, nc=p )
  my.acp(x)
  title(main="ACP by hand")
  %--

To check we did not make any error in implementing the
algorithm, we can compare the result with that of the
"printcomp" command.

  %G
  biplot(princomp(x))
  %--

Exercise: Add the subject names; the variable names; plot
the first three dimensions of the PCA (not just the first
two), as with the "pairs" command; add new variables to the
plot (if it is a quantitative variable, apply the same base
change, if it is a qualitative variable, compute an average
subject for each value of this variable and perform the base
change).

Exercice: Modify the preceding code by replacing the "eigen"
function by the "svd" command, that performs a Singular
Value Decomposition.

  TODO

+ Normalized and non-mormalized PCA

There are several kind of PCA: centered or not, normalized
(based on the correlations matrix) or not (based on the
variance-covariance matrix).

  prcomp(x, center=TRUE, scale.=FALSE) # default

Let us first consider the centered non-normalized principal
component analysis, i.e., that based on the
variance-covariance matrix.

TODO: This was written for the "princomp" function, not the
"prcomp" one -- actually, the problems disappear.

  d <- USArrests[,1:3]             # Data
  dd <- t(t(d)-apply(d, 2, mean))  # Centered data
  m <- cov(d)                      # Covariances matrix
  e <- eigen(m)                    # Eigen values and eigen vectors
  p <- princomp( ~ Murder + Assault + UrbanPop, data = USArrests)

The change-of-base matrix is the same in both cases:

  > e$vectors
              [,1]        [,2]        [,3]
  [1,] -0.04181042 -0.04791741  0.99797586
  [2,] -0.99806069 -0.04410079 -0.04393145
  [3,] -0.04611661  0.99787727  0.04598061

  > unclass(p$loadings)
                Comp.1      Comp.2      Comp.3
  Murder   -0.04181042 -0.04791741  0.99797586
  Assault  -0.99806069 -0.04410079 -0.04393145
  UrbanPop -0.04611661  0.99787727  0.04598061

and so are the coordinates:

  > p$scores[1:3,]
              Comp.1     Comp.2    Comp.3
  Alabama  -64.99204 -10.660459  2.188264
  Alaska   -91.34472 -21.676617 -2.651214
  Arizona -123.68089   8.979374 -4.437864

  > (dd %*% p$loadings) [1:3,]
              Comp.1     Comp.2    Comp.3
  Alabama  -64.99204 -10.660459  2.188264
  Alaska   -91.34472 -21.676617 -2.651214
  Arizona -123.68089   8.979374 -4.437864

Let us now look at the normalized principal component
analysis: we do not simply center each column of the matrix,
we normalize them.

  d <- USArrests[,1:3]
  dd <- apply(d, 2, function (x) { (x-mean(x))/sd(x) })

The variance-covariance matrix of this new matrix is the
correlation matrix of the initial matrix.

  > cov(dd)
               Murder   Assault   UrbanPop
  Murder   1.00000000 0.8018733 0.06957262
  Assault  0.80187331 1.0000000 0.25887170
  UrbanPop 0.06957262 0.2588717 1.00000000

  > cor(d)
               Murder   Assault   UrbanPop
  Murder   1.00000000 0.8018733 0.06957262
  Assault  0.80187331 1.0000000 0.25887170
  UrbanPop 0.06957262 0.2588717 1.00000000

Let us go on with the computations:

  m <- cov(dd)
  e <- eigen(m)
  p <- princomp( ~ Murder + Assault + UrbanPop, data = USArrests, cor=T)

We got the right change-of-base matrix:

  > e$vectors
            [,1]        [,2]       [,3]
  [1,] 0.6672955  0.30345520  0.6801703
  [2,] 0.6970818  0.06713997 -0.7138411
  [3,] 0.2622854 -0.95047734  0.1667309

  > unclass(p$loadings)
              Comp.1      Comp.2     Comp.3
  Murder   0.6672955 -0.30345520  0.6801703
  Assault  0.6970818 -0.06713997 -0.7138411
  UrbanPop 0.2622854  0.95047734  0.1667309

But the coordinates are not the same...

  > p$scores[1:3,]
               Comp.1     Comp.2     Comp.3
    Alabama 1.2508055 -0.9341207  0.2015063
    Alaska  0.8006592 -1.3941923 -0.6532667
    Arizona 1.3542765  0.8368948 -0.8488785

  > (dd %*% e$vectors) [1:3,]
                 [,1]       [,2]       [,3]
    Alabama 1.2382343  0.9247323  0.1994810
    Alaska  0.7926121  1.3801799 -0.6467010
    Arizona 1.3406654 -0.8284836 -0.8403468

However, the error is always the same (up to the sign, which
is meaningless):

  >  p$scores[1:3,] /  (dd %*% e$vectors) [1:3,]
              Comp.1    Comp.2   Comp.3
    Alabama 1.010153 -1.010153 1.010153
    Alaska  1.010153 -1.010153 1.010153
    Arizona 1.010153 -1.010153 1.010153

The difference comes from the fact that there are two
definitions of covariance, one in which you divide ny n,
another in which you divide by n-1.

  > dim(d)
  [1] 50  3
  > sqrt(50/49)
  [1] 1.010153

+ Example

  TODO: this is not the right place.
  Later, with LDA.

  for i in `ls *.txt | cat_rand | head -20`
  do
    perl -n -e 'BEGIN {
                  foreach ("a".."z") { $a{$_}=0 }
                };
                y/A-Z/a-z/;
                s/[^a-z]//g;
                foreach (split("")) { $a{$_}++ }
                END {
                  foreach("a".."z"){print "$a{$_} "}
                }' <$i
    echo E
  done > ling.txt
  # Then the same thing in a directory containing French texts

  %G
  b <- read.table('ling.txt')
  names(b) <- c(letters[1:26], 'language')
  a <- b[,1:26]
  a <- a/apply(a,1,sum)
  biplot(princomp(a))
  %--

  %G
  plot(hclust(dist(a)))
  %--

  %G
  kmeans.plot <- function (data, n=2, iter.max=10) {
    k <- kmeans(data,n,iter.max)
    p <- princomp(data)
    u <- p$loadings
    # The observations
    x <- (t(u) %*% t(data))[1:2,]
    x <- t(x)
    # The centers
    plot(x, col=k$cluster, pch=3, lwd=3)
    c <- (t(u) %*% t(k$center))[1:2,]
    c <- t(c)
    points(c, col = 1:n, pch=7, lwd=3)
    # A segment joining each observation to its group center
    for (i in 1:n) {
      for (j in (1:length(data[,1]))[k$cluster==i]) {
        segments( x[j,1], x[j,2], c[i,1], c[i,2], col=i )
      }
    }
    text( x[,1], x[,2], attr(x, "dimnames")[[1]] )
  }
  kmeans.plot(a,2)
  %--

  # plot(lda(a,b[,27])) # Bug?
  # plot(lda(as.matrix(a),b[,27])) # Newer bug?

+ PCA and linear algebra: Singular Value Decomposition (SVD)

Actually, computers do not perform Principal Component
Analysis as we have just seen, by computing the
variance-covariance matrix and diagonalizing it.

  TODO

+ Rank PCA

Principal component analysis assumes that the data is
gaussian. If it is not, we can replace the values by their
rank. But then, the variables follow a uniform distribution;
We can transform those uniform data to get a gaussian distribution.

  %G
  n <- 100
  v <- .1
  a <- rcauchy(n)
  b <- rcauchy(n)
  c <- rcauchy(n)
  d <- data.frame( x1 =  a+b+c+v*rcauchy(n),
                   x2 =  a+b-c+v*rcauchy(n),
                   x3 =  a-b+c+v*rcauchy(n),
                   x4 = -a+b+c+v*rcauchy(n),
                   x5 =  a-b-c+v*rcauchy(n),
                   x6 = -a+b-c+v*rcauchy(n) )
  biplot(princomp(d))
  %--

  %G
  rank.and.normalize.vector <- function (x) {
    x <- (rank(x)-.5)/length(x)
    x <- qnorm(x)
  }
  rank.and.normalize <- function (x) {
    if( is.vector(x) )
      return( rank.and.normalize.vector(x) )
    if( is.data.frame(x) ) {
      d <- NULL
      for (v in x) {
        if( is.null(d) )
          d <- data.frame( rank.and.normalize(v) )
        else 
          d <- data.frame(d, rank.and.normalize(v))
      }
      names(d) <- names(x)
      return(d)
    }
    stop("Data type not handled")
  }
  biplot(princomp(apply(d,2,rank.and.normalize)))
  %--

Let us check on the other components of the PCA.

  %G
  pairs( princomp(d)$scores )
  %--

  %G
  pairs( princomp(apply(d,2,rank.and.normalize))$scores )
  %--

+ Non linear PCA

For non-linear problems, we can first embed our space in a
larger one, with an application such as

  (x,y) |--> (x^2, x*y, y^2).

In fact, as the PCA only involves scalar products, we do not
really need to compute those new coordinates: it suffices to
replace all the occurrences of the scalar product <x,y> by
that of the new space (this function, that expresses the
scalar product in the new space from the coordinates in the
old space, is called a kernel -- and you can take any
kernel).

  TODO: an example

We shall reuse that idea later (to use a kernel to dive into
a higher-dimensional space to linearize a non-linear
problem) when we speak of SVM (Support Vector Machines).

  http://www.kernel-machines.org/papers/talk-dagm99_4.ps.gz

TODO: 

  library(kernlab)
  ?kpca

+ Projection Pursuit

Principal Component Analysis tries to find a plane (more
generally, a subspace of reasonable dimension) in which (the
orthogonal projection of) the data is as dispersed as
possible, the dispersion being measured with the variance
matrix.

Projection pursuit is a generalization of PCA in which one
looks for a subspace which maximizes some "interestingness"
criterion -- and everyone can define their own
criterion. For instance, it could be a measure of the
dispersion of the data (based on the variance matrix or on
more robust dispersion estimators) or a measure of the
non-gaussianity of the data (kurtosis, skewness, etc.).

Those measures of interestingness are called "indices". 

The most common Projection Pursuits are PCA, ICA
(Independant Component Analysis)m and robust PCA (replace the
1-dimensional variance you are trying to maximize by a
robust equivalent).

  http://www.r-project.org/useR-2006/Slides/Fritz.pdf
  library(pcaPP)

There is also a function called "ppr" (projection pursuit
regression) but it seems to do something competely
different: variable selection for linear regression, with
several variables to predict.

+ Robust PCA

  TODO: Give an example

+ Spherical PCA

  TODO

+ Grand tour, guided tour

The grand tour is an animation, a continuous family of
projections of the cloud of points onto a 2-dimensional space,
obtained by interpolation between random projections: you are
supposed to look at it until you see something (clusters,
artefacts, alignments, etc.).

GGobi can display grand tours.

  http://www.ggobi.org/

The guided tour is a variant of the grand tour, where one
interpolates between local maxima of a projection pursuit
function instead of random projections.

  http://citeseer.ist.psu.edu/cook95grand.html

+ Independant Component Analysis (ICA)

At first sight, it looks very similar to Principal component
analysis -- but it is very different. The general idea is
that the variables we are looking at are linear combinations
of independant random variables; and we want to recover
those random variables. The only assumption is that the
random variables we are looking for are independant (well,
actually, there is another assumption: those random
variables have to be non-gaussian).

ICA has been mainly used in signal processing, the initial
example being the cocktail party problem: you have two
microphones (or two ears) and you are hearing two
conversations at the same time; you want to separate
them. Similarly, ICA has also been used to analyze EEG
(electroencephalograms).

TODO: 2-dimensional applications, image compression, feature
extraction.

Let us see on an example how it differs with Principal
Component Analysis: let us take two random variables X1 and
X2, uniformly distributed

  %G
  N <- 1000
  X <- matrix(runif(2*N, -1, 1), nc=2)
  plot(X)
  %--

and transform them 

  %G
  M <- matrix(rnorm(4), nc=2)
  Y <- X %*% M
  plot(Y)
  %--

We get a parallelogram. PCA would yield something close to
its largest diagonal (red, in the following plot); ICA would
yield the image of the axes (blue), i.e., new axes parallel
to the sides of the parallelogram.

  %G
  plot(Y)
  p <- prcomp(Y)$rotation
  abline(0, p[2,1] / p[1,1], col="red", lwd=3)
  abline(0, -p[1,1] / p[2,1], col="red", lwd=3)
  abline(0, M[1,2]/M[1,1], col="blue", lwd=3)  
  abline(0, M[2,2]/M[2,1], col="blue", lwd=3)  
  %--

Other examples: 

  %G
  op <- par(mfrow=c(2,2), mar=c(1,1,1,1))
  for (i in 1:4) {
    N <- 1000
    X <- matrix(runif(2*N, -1, 1), nc=2)
    M <- matrix(rnorm(4), nc=2)
    Y <- X %*% M
    plot(Y, xlab="", ylab="")
    p <- prcomp(Y)$rotation
    abline(0, p[2,1] / p[1,1], col="red", lwd=3)
    abline(0, -p[1,1] / p[2,1], col="red", lwd=3)
    abline(0, M[1,2]/M[1,1], col="blue", lwd=3)  
    abline(0, M[2,2]/M[2,1], col="blue", lwd=3)  
  }
  par(op)
  %--

The ICA yields the variables that were indeed used. 

The main idea behind the algorithm is the fact that (from
the central limit theorem) a linear combination of
non-gaussian random variables is "more gaussian".

  %G
  op <- par(mfrow=c(2,2))
  hist(X[,1], col="light blue")
  hist(X[,2], col="light blue")
  hist(Y[,1], col="light blue")
  hist(Y[,2], col="light blue")  
  par(op)
  %--

Therefore the algorithm goes as follows: 

  1. Normalize the data, so that they have a variance equal
     to 1 and that the be uncorrelated.
  2. Find (with the usual numeric optimization algorithms)
     the linear transformation that maximizes the
     non-gaussianity.

To this end, one can use several measures of
non-gaussianity: the kurtosis (i.e., the fourth moment), the
entropy (integral of -f * log(f), where f is the pdf),
mutual information, etc.

There are two implementations of this algorithm in R.  The
first one, "ica" in the "e1071" package, does not give the
expected results.

  %G
  library(e1071)
  r <- ica(Y,.1)
  plot(r$projection)
  %--

The second one, "mlica" in the "mlica" package, gives the
expected result.

  %G
  library(mlica)
  ICA <- function (x,...) {
     prPCA <- PriorNormPCA(x);
     prNCP <- proposeNCP(prPCA,0.1);
     mlica(prNCP,...)
  }

  set.seed(1) # It sometimes crashes...
  N <- 1000
  X <- matrix(runif(2*N, -1, 1), nc=2)
  M <- matrix(rnorm(4), nc=2)
  Y <- X %*% M
  r <- ICA(Y)
  plot(r$S)
  %--

TODO: Give more interesting results. 

See also:

  http://www.cis.hut.fi/projects/ica/icademo/
  http://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf

+ Factor analysis

Principal Component Analysis is about approximating a variance matrix.
The simplest approximation is a scalar matrix (a diagonal matrix with
always the same element on the diagonal), the next simplest is a diagonal
matrix; then comes the PCA, which writes the variance matrix as
V = B B' for some rectangular matrix B, with as many columns as you want
components. The matrix B can be characterized as the matrix that minimizes
the norm of V - B B' (actually, it is not unique, but (a.s.) unique up to
multiplication by an orthogonal matrix). The next step is the factor model:
we write V as B B' + Delta, where Delta is a diagonal matrix.

In R, this is obtained with the factanal() function.

  r <- factanal(covmat = V, factors = k)
  # The correlation matrix
  v <- r$loadings %*% t(r$loadings) + diag(r$uniquenesses)
  # The corresponding variance matrix
  v <- diag(sqrt(diag(V))) %*% v %*% diag(sqrt(diag(V)))

This is used, for instance, in finance: we model stock returns as a
multidimensional gaussian distribution. A portfolio can be seen as 
a linear combination of stocks; the coefficients are called the weights
of the portfolio and are usually assumed to sum up to 1 (you might also
want them to be positive). If w is a portfolio and V the variance matrix
of the distribution of returns, then the returns of the portfolio are
gaussian with variance w' V w; the square root of this variance is called the
risk of the portfolio. Finding the portfolio with the minimum risk possible
sounds simple enough (it is an optimization problem), but you have to
estimate the variance matrix in the first place. If your universe contains
thousands of stocks, you actually want to estimate a 1000*1000 matrix: you will
never have enough data to do that! You need a way to reduce the number of
quantities to estimate, you need a way to more parsimoniously parametrize
variance matrices. Principal Component Analysis is the classical way of doing this,
but it assumes that all the stocks respond to the same few sources of randomness:
this is not a reasonable assumption. Instead, we assume that the stocks respond
to a few "risk factors" (they could be interpreted as, e.g., oil prices,
comsumption indexes, interest rates, etc.) and an idiosyncratic component,
specific to each stock: this is a factor model. Such an estimation of the
variance matrix of stock returns is called a risk model.

TODO: An example with actual computations (portfolio optimization?)

+ Factor models

  TODO: Merge this section with the previous

  TODO: Decide whether to put this section here or together with the
  variance matrix (there is already a note about RMT somewhere)...

Principal Component analysis can also be seen as decomposition of a
variance matrix, that can also lead to a simplification of variance
matrices (by discarding the eigenvectors with low eigenvalues), to a
parametrization of variance matrices with few parameters. Since large
variance matrices require a huge amount of data to be reliably
estimated, PCA can allow you to get a precise (but biased) estimation
with little data.

Those parametrizations are often called "factor models". They are
used, for instance, in finance: when you want to invest in stocks (or
any other asset, actually), you do not buy the stocks for which you
have a positive view in equal quantities: some of those stocks can be
moving together. Consider for instance three stocks, A, B and C, on
which you have a positive view, where A and B tend to mode together
and C moves independently from A and B: it might not be a good idea to
invest a third of your wealth in each of them -- it might be wiser to
invest half in C, and a quarter in A and B. This reasoning can be
formalized (and justified) if one assumes that the returns of those
three assets follow a gaussian distribution with mean (mu, mu, mu)
(where mu is the expected return of those assets -- here we assume
they are equal) and correlation matrix

   1  1/2  0
  1/2  1   0
   0   0   1

We want to find the portfolio (i.e., the combination of assets) with
the lowest risk, i.e., the lowest variance.

But where do we get this variance matrix in the first place? We have
to estimate it with historical data: for instance from the monthly
returns of the stocks in the past three years. But if you have 1000
stocks (that is a small number: in real life, it is usually between
1000 and 10,000), that is 36,000 data points to estimate around
500,000 parameters -- this is not reasonable. 

The idea is to assume that the variance matrix has some simple form,
i.e., that it can be parametrized with a reasonable number of
parameters. For instance, one can assume that the correlations between
the returns is due to their being correlated with a small number of
"factors": then, we just have to give the "exposure" of the each stock
to each factor, B and the variance matrix of the factors, v. The
variance matrix of the stock returns is then

  V = B v B'

If one assumes that the factors are independant (and of variance 1),
i.e., their variance matrix is the identity matrix, v=1, and V = B B'.

But this is very similar to the decomposition of a variance matrix
provided by PCA: if, furthermore, we do not know the factors
beforehand, we can simply select them as the first eigenvectors. 

This is called a noiseless statistical factor model.

TODO: Noiseless factor model

TODO: Scalar factor model

TODO: Approximate factor model

TODO: The double Procrustes problem?

+ Functional Principal Component Analysis (fPCA)

When the data you are modeling are functions (for instance,
the evolution over time of some quantity), you might realize
that the principal components are noisy and try to smooth them.
Actually, this is not the optimal thing to do -- by smoothing the
principal components, you run the risk of smoothing away the
information they were supposed to contain. Instead, you can put
the smoothing inside the PCA algorithm.

The first step is usually to express the functions to be studied
in some basis -- by doing so, you already smooth them, but you
should be careful not to discard the features of interest. Using
functions instead of series of points allows you to have a different
number of points, a different sampling frequency for each curve --
or even, irregular sampling. Using a basis of functions brings the problem
back in the finite-dimensional realm and reduces it to a linear algebra
problem.

In the second step, one computes the mean of all those curves,
usually using the same basis. 

  TODO: Do we add a smoothing penalty for the mean?
  TODO: Formula...

In the third step, one actually performs the principal component analysis.

  TODO: Explain how to introduce the smoothing penalties
  TODO: Formulas
  TODO: Implementation?

A few problems might occur.

For instance, the functions to estimate might have some very important
properties: for instance, we might know that they have to be increasing,
in spite of error measurements. We can enforce such a requirement by
parametrizing the function via a differential equation. For instance,
to get an increasing function H (say, height, when studying child growth),
one just has to find a function w (without any restriction) such that
H'' = w H -- w can be interpreted as the "relative acceleration".

Another problem is that the features in the various curves to study
are not aligned: they occur in the same order, but with different
amplitudes and locations. In that situation, the mean curve can fail
to exhibit the very features we want to study -- they are averaged out.
To palliate this problem, one can "align" the curves, by reparametrizing
the curves (i.e., by considering an alternate time: clock time versus
physiological time in biology, clock time versus market time or transaction
time in finance, etc.). This is called "registration" or "(dynamic)
time warping".

  TODO: More details on time warping.

  TODO:
  1. Write the functions to study in some basis
  2. Smoothed mean
  3. Smoothed PCA

For more details, check:

  Applied Functional Data Analysis, Methods and Case Studies
  J.O. Ramsay and B.W. Silverman, Springer Verlag (2002)
  http://www.stats.ox.ac.uk/~silverma/fdacasebook/

  TODO: An online reference?

+ Varimax rotation

The components you get as the result of a principal component analysis
are not always directly interpretable: even if all the relevant information
is in the first two principal components, they are not as "pure" as you would
like. But by rotating them, they can be easier to interpret. Since you use the
amplitude of the loadings of the PCA to interpret the principal components,
you can try to simplify them, as follows: find a rotation (not a rotation of
the whole space, but only a rotation of the subspace spanned by the first
components) that maximizes the variance of the loadings (contrary to PCA,
you do not consider the components one at a time, but all at the same time).
This will actually try to make the loadings either very large or very small,
easing the interpretation of the components.

  ?varimax
  ?promax

+ Tensor approximation

When you apply Principal Component Analysis (PCA) to the pixels of a
series of images, you forget the 2-dimensional structure and hope that
the PCA will somehow recover it, by itself. Instead, one can try to
retain that information by representing the data as matrices. This can
be generalized to higher-dimensional data with higher dimensional
arrays (learned people may call those arrays "tensors"). 

Tensor algebra, motivations:

  - Taylor expansion f(x+h) = f(x) + f'(x).h + f''(x).h.h + ...
    where f(x)   is a number
          f'(x)  is a 1-form: it transforms a vector h into
                              a number f'(x).h
          f''(x) is a symetric 2-form: it takes a two
                              vectors h and k and turns them
                              into a number f''(x).h.k 
  - Higher order statistics (HOS) (e.g., the autocorrelation
    of x^2 is a fourth moment).
  - [For category-theorists] The tensor algebra TV of a
    vector space V is obtained from the adjoint of the
    forgetful functor Alg --> Vect from the category of
    R-algebras to that of R-vector spaces. Similarly, one can
    define the symetric tensor algebra SV from the category of
    commutative R-algebras. The alternating algebra AV is
    often defined as AV=TV/SV. 
  - In mathematical physics, a tensor is a section of some
    vector bundle built from the tangent buldle of a
    manifold (if you do not know about manifolds, read "The
    large scale structure of space-time" by S. Hawking et
    al.). For instance: 
      - a section of the tangent bundle TX --> X is a field of
        tangent vectors;
      - a section of the cotangent bundle (the dual of the
        tangent space) TX' --> X is a (field of) 1-forms;
      - a section of the maximum alternating power of the
        cotangent bundle is an n-form, that you can use to
        integrate a (real-valued) function on the manifold.
    (You can also consider sections of other vector bundles,
    e.g., the Levi-Civita connection -- some people call
    those "holors" -- see "Theory of Holors : A Generalization 
    of Tensors", P. Moon and E. Spencer) 

Tentative implementation:

  # From "Out-of-code tensor approximation of
  # multi-dimensional matrices of visual data"
  # H. Wang et al., Siggraph 2005
  # http://vision.ai.uiuc.edu/~wanghc/research/siggraph05.html
  # http://vision.ai.uiuc.edu/~wanghc/papers/siggraph05_tensor_hongcheng.pdf
  mode.n.vector <- function (A, n) { # aka "unfolding"
    stopifnot( n == floor(n), 
               n >= 1,
               n <= length(dim(A)) )
    res <- apply(A, n, as.vector)
    if (is.vector(res)) {
      res <- matrix(res, nr=dim(A)[n], nc=prod(dim(A))/dim(A)[n] )
    } else {
      res <- t(res)
    }
    stopifnot( dim(res) == c( dim(A)[n], prod(dim(A))/dim(A)[n] ) )
    res
  }
  n.rank <- function (A, n) { 
    sum( svd( mode.n.vector(A, n) )$d 
         >= 1000 * .Machine$double.eps )
  }
  n.ranks <- function (A) {
    n <- length(dim(A))
    res <- integer(n)
    for (i in 1:n) {
      res[i] <- n.rank(A, i)
    }
    res
  }
  # The norm is the same as usual: sqrt(sum(A^2))
  n.mode.svd <- function (A) {
    res <- list()
    for (n in 1:length(dim(A))) {
      res[[n]] <- svd( mode.n.vector(A, n) )
    }
    res
  }
  product <- function (A, B, i) {
    # Multiply the ith dimension of the tensor A with the
    # columns of the matrix B
    stopifnot( is.array(A),
               i == floor(i), 
               length(i) == 1,
               i >= 1,
               i <= length(dim(A)),
               is.matrix(B),
               dim(A)[i] == dim(B)[1]
             )
    res <- array( apply(A, i, as.vector) %*% B,
                  dim = c(dim(A)[-i], dim(B)[2]) )
    N <- length(dim(A))
    ind <- 1:N
    ind[i] <- N
    ind[ 1:N > i ] <- ind[ 1:N > i ] - 1
    res <- aperm(res, ind)
    stopifnot( dim(res)[-i] == dim(A)[-i],
               dim(res)[i]  == dim(B)[2] )
    res
  }

  # A few tests...
  A <- array(1:8, dim=c(2,2,2))
  mode.n.vector(A, 1)
  mode.n.vector(A, 2)
  mode.n.vector(A, 3)
  n.rank(A, 1)
  n.rank(A, 2)
  n.rank(A, 3)
  n.ranks(A)

  B1 <- matrix(round(10*rnorm(6)),  nr=2, nc=3)
  B2 <- matrix(round(10*rnorm(12)), nr=3, nc=4)
  all( product(B1, B2, 2) == B1 %*% B2 )

  # The main function...    
  tensor.approximation <- function (A, R) {
    stopifnot( is.array(A),
               length(dim(A)) == length(R),
               is.vector(R),
               R == floor(R),
               R >= 0,
               R <= dim(A),
               R <= n.ranks(A),
               R <= prod(R)/R    # If all the elements of R
                                 # are equal, this is fine. 
             )
    N <- length(R)
    I <- dim(A)
    U <- list()
    for (i in 1:N) {
      U[[i]] <- matrix( rnorm(I[i] * R[i]), nr = I[i], nc = R[i] )
    }
    finished <- FALSE
    BB <- NULL
    while (!finished) {
      cat("Iteration\n")
      Utilde <- list()
      for (i in 1:N) {
        Utilde[[i]] <- A
        for (j in 1:N) {
          if (i != j) {
            Utilde[[i]] <- product( Utilde[[i]], U[[j]], j)
          }
          cat("  Utilde[[", i, "]]: ",
              paste(dim(Utilde[[i]]),collapse=","), 
              "\n", sep="")
        }
        stopifnot( length(dim(Utilde[[i]])) == N,
                   dim(Utilde[[i]])[i] == dim(A)[i],
                   dim(Utilde[[i]])[-i] == R[-i] )
        Utilde[[i]] <- mode.n.vector( Utilde[[i]], i )
        cat("  Utilde[[", i, "]]: ",
            paste(dim(Utilde[[i]]),collapse=","), 
            "\n", sep="")
      }
      for (i in 1:N) {
        U[[i]] <- svd( Utilde[[i]] ) $ u [ , 1: R[i], drop=FALSE ]
        #U[[i]] <- svd( Utilde[[i]] ) $ v
        stopifnot( dim(U[[i]]) == c(I[i], R[i]) )
      }
      B <- A
      for (j in 1:N) {
        B <- product( B, U[[j]], j )
      }
      res <- B
      for (j in 1:N) {
        res <- product( res, t(U[[j]]), j )
      }
      cat("Approximating A:", sum((res - A)^2), "\n")
      if (!is.null(BB)) {
        eps <- sum( (B - BB)^2 ) 
        cat("Improvement on B:", eps, "\n")
        finished <- eps <= 1e-6
      }
      BB <- B
    }
    res <- B
    for (j in 1:N) {
      res <- product( res, t(U[[j]]), j )
    }
    attr(res, "B") <- B
    attr(res, "U") <- U
    attr(res, "squared.error") <- sum( (A-res)^2 )
    res
  }
  A <- array(1:24, dim=c(4,3,2))
  x <- tensor.approximation(A, c(1,1,1))
  x <- tensor.approximation(A, c(2,2,2))

* Distance-based methods

The problem is now the following: given a set of points in
an n-dimensional space, find their coordinates kwnowing only
their distances. Usually, you do not know the space
dimension either.

+ PCO (Distance Analysis)

This is very similar to Principal Component Analysis (PCA),
with a few differences, though: PCA looks for relations
between the variables while PCO focuses on similarities
between the observations; you can interpret the PCA axes
with the variables while you cannot with PCO (indeed, there
are no variables).

As PCA, PCO is linear and the computations are the same: in
PCA, we start with the variance-covariance matrix, i.e., a
matrix whose entries are sums of squares; in distance
analysis, we consider a matrix of distances or a distance of
squares of matrices, i.e., from the pythagorean theorem, a
matrix of sums of squares.

In other words, PCO is just PCA without its first step: the
computation of the variance-covariance (or "squared
distances") matrix.

  ?cmdscale
  library(MASS)

+ MDS (Multi-Dimensional Scaling)

MDS is a non-linear equivalent of Principal Coordinate
Analysis (PCO).  We have a set of distances between points
and we try to assign coordinates to those points so as to
minimize a "stress function" -- for instance, the sum of
squares between the distances we have and the distances
obtained from the coordinates.

There are several R functions to perform those computations.

  ?cmdscale             Principal coordinate analysis (metric, linear)
  library(MASS)
  ?sammon               (metric, non-linear)
  ?isoMDS               (non-metric)

There was a comparison of them in RNews:

  http://cran.r-project.org/doc/Rnews/Rnews_2003-3.pdf

+ xgvis

  library(xgobi)
  ?xgvis

  TODO: speak a little more of xgvis
  xgvis slides and tutorial:
  http://industry.ebi.ac.uk/%257Ealan/VisWorkshop99/XGvis_Talk/sld001.htm
  http://industry.ebi.ac.uk/%257Ealan/VisWorkshop99/XGvis_Tutorial/index.html
  (restart the MDS optimization with a different starting point to
  avoid local minima ("scramble" button))
  (one may change the parameters (euclidian/L1/Linfty distance,
  removal of outliers, weights, etc.)
  (one may ask for a 3D or 4D layout and the rotate it in xgobi)
  (restart the MDS optimization with only a subset of the data --
  interactivity)
  What is a Shepard diagram ??? (change the power transformation)
  MDS to get a picture of protein similarity
  MDS warning: The global shape of MDS configurations is determined by
  the large dissimilarities; consequently, small distances should be
  in terpreted with caution: they may not reflect small
  dissimilarities.

+ Applications

The applications are manifold, for instance:

1. Dimension reduction: we start with a set of points in a
high-dimensional space, compute their pairwise distances,
and try to put the points in a space of smaller dimension
while retaining as much as possible the information
("proximity") present in the distance matrix.

Thus, we can see MDS as a non-linear analogue of Principal
Component Analysis (PCA).

  %G
  n <- 100
  v <- .1
  # (Almost) planar data
  x <- rnorm(n)
  y <- x*x + v*rnorm(n)
  z <- v*rnorm(n)
  d <- cbind(x,y,z)
  # A rotation
  random.rotation.matrix <- function (n=3) {
    m <- NULL
    for (i in 1:n) {
      x <- rnorm(n)
      x <- x / sqrt(sum(x*x))
      y <- rep(0,n)
      if (i>1) for (j in 1:(i-1)) {
        y <- y + sum( x * m[,j] ) * m[,j]
      }
      x <- x - y
      x <- x / sqrt(sum(x*x))
      m <- cbind(m, x)
    }
    m
  }
  m <- random.rotation.matrix(3)
  d <- t( m %*% t(d) )
  pairs(d)
  title(main="The points lie in a plane")
  %--

  %G
  pairs(cmdscale(dist(d),3))
  title(main="MDS")
  %--

  %G
  pairs(princomp(d)$scores)
  title(main="Principal Component Analysis")
  %--

2. Study of data with no coordinates, for instance protein
similarity: we can tell when two proteins are similar, we
can quantify this similarity -- but we have no coordinates,
we cannot think of a simple vector space whose points could
be interpreted as proteins.  MDS can help build such a space
and graphically represent the proximity relations among a
set of proteins.

The same process also appears in psychology: we can ask test
subjects to identify objects (faces, Morse-coded letters,
etc.) and count the number of confusions between two
objects. This measures the similarity of these objects (a
subjective similarity, that stems from their representation
in the human mind); MDS can assign coordinates to those
objects and represent those similarities in a graphical way.

Those misclassifications also appear in statistics: we can
use MDS to study forecast errors of a given statistical
algorithm (logistic regression, neural nets, etc.) when
trying to predict a qualitative variable from quantitative
variables.

  TODO: speak about MDS when I tackle those algorithms...

3. MDS can also graphically represent variables (and not
observations), after estimating the "distances" between
variables from their correlations.

  TODO: such a plot when I speak about variable selection...

4. We can also use MDS to plot graphs (the combinatorial
objects we study in graph theory -- see the "graph"
package), described by theiir vertices and edges.

Alternatively, you can use GraphViz, 

  http://www.graphviz.org/

outside R or from R -- see the "sem" and "Rgraphviz"
libraries.

  TODO

5. Reconstruct a molecule, especially its shape, from the
distance between its atoms.

For more details, see

  http://www.research.att.com/areas/stat/xgobi/papers/xgvis.ps.gz

+ Minimum Spanning Tree (MST)

  1. Put the points into 4 or 5 classes (e.g., by hierarchical
  classification, or with the k-means algorithm).
  2. Take the median point of those classes. 
  3. Is the Minimum Spanning Tree (MST) of those points ramified?

On our examples: 

  %G
  # Data
  n <- 200  # Number of patients, number of columns
  k <- 10   # Dimension of the ambient space
  nb.points <- 5
  p <- matrix( 5*rnorm(nb.points*k), nr=k )
  barycentre <- function (x, n) {
    # Add number between the values of x in order to get a length n vector
    i <- seq(1,length(x)-.001,length=n)
    j <- floor(i)
    l <- i-j
    (1-l)*x[j] + l*x[j+1]
  }
  m <- apply(p, 1, barycentre, n)
  data.broken.line <- t(m)
  data.noisy.broken.line <- data.broken.line + rnorm(length(data.broken.line))
  library(splines)
  barycentre2 <- function (y,n) {
    m <- length(y)
    x <- 1:m
    r <- interpSpline(x,y)    
    #r <- lm( y ~ bs(x, knots=m) )
    predict(r, seq(1,m,length=n))$y
  }
  data.curve <- apply(p, 1, barycentre2, n)
  data.curve <- t(data.curve)
  data.noisy.curve <- data.curve + rnorm(length(data.curve))
  data.real <- read.table("Tla_z.txt", sep=",")
  r <- prcomp(t(data.real))
  data.real.3d <- r$x[,1:3]

  library(cluster)
  library(ape)
  mst.of.classification <- function (x, k=6, ...) {
    x <- t(x) 
    x <- t( t(x) - apply(x,2,mean) )
    r <- prcomp(x)
    y <- r$x
    u <- r$rotation
    r <- kmeans(y,k)
    z <- r$centers
    m <- mst(dist(z))
    plot(y[,1:2], ...)
    points(z[,1:2], col='red', pch=15)
    w <- which(m!=0)
    i <- as.vector(row(m))[w]
    j <- as.vector(col(m))[w]
    segments( z[i,1], z[i,2], z[j,1], z[j,2], col='red' )
  }
  set.seed(1)
  mst.of.classification(data.curve, 6)
  %--
  %G
  mst.of.classification(data.curve, 6)
  %--

  %G
  op <- par(mfrow=c(2,2),mar=.1+c(0,0,0,0))
  for (k in c(4,6,10,15)) {
    mst.of.classification(data.curve, k, axes=F)
    box()
  }
  par(op)
  %--

  %G
  op <- par(mfrow=c(2,2),mar=.1+c(0,0,0,0))
  for (k in c(4,6,10,15)) {
    mst.of.classification(data.noisy.curve, k, axes=F)
    box()
  }
  par(op)
  %--

  %G
  op <- par(mfrow=c(2,2),mar=.1+c(0,0,0,0))
  for (k in c(4,6,10,15)) {
    mst.of.classification(data.broken.line, k, axes=F)
    box()
  }
  par(op)
  %--

  %G
  op <- par(mfrow=c(2,2),mar=.1+c(0,0,0,0))
  for (k in c(4,6,10,15)) {
    mst.of.classification(data.noisy.broken.line, k, axes=F)
    box()
  }
  par(op)
  %--

With real data, it does not work that well:

  %G
  op <- par(mfrow=c(2,2),mar=.1+c(0,0,0,0))
  for (k in c(4,6,10,15)) {
    mst.of.classification(data.real, k, axes=F)
    box()
  }
  par(op)
  %--

Details: 

  %G
  op <- par(mfrow=c(3,3),mar=.1+c(0,0,0,0))
  for (k in c(4:6)) {
    for (i in 1:3) {
      mst.of.classification(data.real, k, axes=F)
      box()
    }
  }
  par(op)
  %--
  %G
  op <- par(mfrow=c(3,3),mar=.1+c(0,0,0,0))
  for (k in c(7:9)) {
    for (i in 1:3) {
      mst.of.classification(data.real, k, axes=F)
      box()
    }
  }
  par(op)
  %--
  %G
  op <- par(mfrow=c(3,3),mar=.1+c(0,0,0,0))
  for (k in c(10:12)) {
    for (i in 1:3) {
      mst.of.classification(data.real, k, axes=F)
      box()
    }
  }
  par(op)
  %--
  %G
  op <- par(mfrow=c(3,5),mar=.1+c(0,0,0,0))
  for (k in c(13:15)) {
    for (i in 1:3) {
      mst.of.classification(data.real, k, axes=F)
      box()
    }
  }
  par(op)
  %--


+ Minimum Spanning Tree (MST)

  %G
  library(ape)
  my.plot.mst <- function (d) {
    r <- mst(dist(t(d)))
    d <- prcomp(t(d))$x[,1:2]
    plot(d)
    n <- dim(r)[1]
    w <- which(r!=0)
    i <- as.vector(row(r))[w]
    j <- as.vector(col(r))[w]
    segments( d[i,1], d[i,2], d[j,1], d[j,2], col='red' )
  }
  my.plot.mst(data.broken.line)
  %--
  %G
  my.plot.mst(data.noisy.broken.line)
  %--
  %G
  my.plot.mst(data.curve)
  %--
  %G
  my.plot.mst(data.noisy.curve)
  %--
  %G
  my.plot.mst(data.real)
  %--
  %G
  my.plot.mst(t(data.real.3d))
  %--

Pruned MST

  %G
  # Gives the list of the oaths from the branching nodes to the leaves
  chemins.vers.les.feuilles <- function (G) {
    nodes <- which(apply(G,2,sum)>2)
    leaves <- which(apply(G,2,sum)==1)
    res <- list()
    for (a in nodes) {
      for (b in which(G[a,]>0)) {
        if (! b %in% nodes) {
          res <- append(res,list(c(a,b)))
        }
      }
    }
    chemins.vers.les.feuilles.suite(G, nodes, leaves, res)
  }
  # Last coordinate of a vector
  end1 <- function (x) {
    n <- length(x)
    x[n]
  }
  # Last two coordinates of a vector
  end2 <- function (x) {
    n <- length(x)
    x[c(n-1,n)]
  }
  chemins.vers.les.feuilles.suite <- function (G, nodes, leaves, res) {
    new <- list()
    done <- T
    for (ch in res) {
      if( end1(ch) %in% nodes ) {
        # Pass
      } else if ( end1(ch) %in% leaves ) {
        new <- append(new, list(ch))
      } else {
        done <- F
        a <- end2(ch)[1]
        b <- end2(ch)[2]
        for (x in which(G[b,]>0)) {
          if( x != a ){
            new <- append(new, list(c( ch, x )))
          }
        }
      }
    }
    if(done) {
      return(new)
    } else {
      return(chemins.vers.les.feuilles.suite(G,nodes,leaves,new))
    }
  } 

  G <- matrix(c(0,1,0,0, 1,0,1,1, 0,1,0,0, 0,1,0,0), nr=4)
  chemins.vers.les.feuilles(G)

  # Compute the length of a path
  longueur.chemin <- function (chemin, d) {
    d <- as.matrix(d)
    n <- length(chemin)
    i <- chemin[ 1:(n-1) ]
    j <- chemin[ 2:n ]
    if (n==2) {
      d[i,j]
    } else {
      sum(diag(d[i,][,j]))
    }
  }

  G <- matrix(c(0,1,0,0, 1,0,1,1, 0,1,0,0, 0,1,0,0), nr=4)
  d <- matrix(c(0,2,4,3, 2,0,2,1, 4,2,0,3, 3,1,3,0), nr=4)
  chemins <- chemins.vers.les.feuilles(G)
  chemins
  l <- sapply(chemins, longueur.chemin, d)
  l
  stopifnot( l == c(2,2,1) )

  elague <- function (G0, d0) {
    d0 <- as.matrix(d0)
    G <- G0
    d <- d0
    res <- 1:dim(d)[1]
    chemins <- chemins.vers.les.feuilles(G)
    while (length(chemins)>0) {
      longueurs <- sapply(chemins, longueur.chemin, d)
      # Number of the shortest path
      i <- which( longueurs == min(longueurs) )[1]
      cat(paste("Removing", paste(res[chemins[[i]]],collapse=' '), "length =", longueurs[i],"\n"))
      # Nodes to remove
      j <- chemins[[i]] [-1]
      res <- res[-j]
      G <- G[-j,][,-j]
      d <- d[-j,][,-j]
      cat(paste("Removing", paste(j), "\n" ))
      cat(paste("", paste(res, collapse=' '), "\n"))
      chemins <- chemins.vers.les.feuilles(G)      
    }
    res
  }

  library(ape)
  my.plot.mst <- function (x0) {
    cat("Plotting the points\n")
    x <- prcomp(t(x0))$x[,1:2]
    plot(x)
    cat("Computing the distance matrix\n")
    d <- dist(t(x0))
    cat("Computing the MST (Minimum Spanning Tree)\n")
    G <- mst(d)
    cat("Plotting the MST\n")
    n <- dim(G)[1]
    w <- which(G!=0)
    i <- as.vector(row(G))[w]
    j <- as.vector(col(G))[w]
    segments( x[i,1], x[i,2], x[j,1], x[j,2], col='red' )
    cat("Pruning the tree\n")
    k <- elague(G,d)
    cat("Plotting the pruned tree\n")
    x <- x[k,]
    G <- G[k,][,k]
    n <- dim(G)[1]
    w <- which(G!=0)
    i <- as.vector(row(G))[w]
    j <- as.vector(col(G))[w]
    segments( x[i,1], x[i,2], x[j,1], x[j,2], col='red', lwd=3 )        
  }

  my.plot.mst(data.noisy.broken.line)
  %--
  %G
  my.plot.mst(data.noisy.curve)
  %--
  %G
  my.plot.mst(data.real)
  %--
  %G
  my.plot.mst(t(data.real.3d))
  %--

Remark: in image analysis, we sometimes us a simplification of this
algorithm (still called "pruning" to get rid of the ramifications in
the skeleton of an image: we just gnaw two of three segments from each
leaf. 

  TODO: a reference

+ Isomap

TODO: reference

TODO: explain

Let us first construct the graph. 

  %G
  # k: each point is linked to its k nearest neighbors
  # eps: each point is linked to all its neighbors within a radius eps

  isomap.incidence.matrix <- function (d, eps=NA, k=NA) {
    stopifnot(xor( is.na(eps), is.na(k) ))
    d <- as.matrix(d)
    if(!is.na(eps)) {
      im <- d <= eps
    } else {
      im <- apply(d,1,rank) <= k+1
      diag(im) <- F
    }
    im | t(im)
  }
  plot.graph <- function (im,x,y=NULL, ...) {
    if(is.null(y)) {
      y <- x[,2]
      x <- x[,1]
    }
    plot(x,y, ...)
    k <- which(  as.vector(im)  )
    i <- as.vector(col(im))[ k ]
    j <- as.vector(row(im))[ k ]
    segments( x[i], y[i], x[j], y[j], col='red' )
  }

  d <- dist(t(data.noisy.curve))
  r <- princomp(t(data.noisy.curve))
  x <- r$scores[,1]
  y <- r$scores[,2]

  plot.graph(isomap.incidence.matrix(d, k=5), x, y)
  %--

So far, there is a problem: the resulting grah need not be connected
-- this is a problem if we want to compute distances...

  %G
  plot.graph(isomap.incidence.matrix(d, eps=quantile(as.vector(d), .05)), 
             x, y)
  %--

The graph is connected if and only if it contains the Minimum Spanning
Tree (MST): thus, we shall just add the edges of this MST taht are not
already there. 

  %G
  isomap.incidence.matrix <- function (d, eps=NA, k=NA) {
    stopifnot(xor( is.na(eps), is.na(k) ))
    d <- as.matrix(d)
    if(!is.na(eps)) {
      im <- d <= eps
    } else {
      im <- apply(d,1,rank) <= k+1
      diag(im) <- F
    }
    im | t(im) | mst(d)
  }
  plot.graph(isomap.incidence.matrix(d, eps=quantile(as.vector(d), .05)), 
             x, y)
  %--

TODO: distance in this graph. 
It is a classical problem: shortest path in a graph. 

  %G
  inf <- function (x,y) { ifelse(x<y,x,y) }
  isomap.distance <- function (im, d) {
    d <- as.matrix(d)
    n <- dim(d)[1]
    dd <- ifelse(im, d, Inf)
    for (k in 1:n) {
      dd <- inf(dd, matrix(dd[,k],nr=n,nc=n) + matrix(dd[k,],nr=n,nc=n,byrow=T))
    }
    dd
  }

  isomap <- function (x, d=dist(x), eps=NA, k=NA) {
    im <- isomap.incidence.matrix(d, eps, k)
    dd <- isomap.distance(im,d)
    r <- list(x,d,incidence.matrix=im,distance=dd)
    class(r) <- "isomap"
    r    
  }

  r <- isomap(t(data.noisy.curve), k=5)
  xy <- cmdscale(r$distance,2)   # long: around 30 seconds
  plot.graph(r$incidence.matrix, xy)
  %--

In an orthonormal basis:

  %G
  plot.graph(r$incidence.matrix, xy, ylim=range(xy))
  %--

TODO: 

  Other ways of representing the results:
  1. Initial data, graph
  2. MDS
  3. Initial coordinates, graph, colors for the first coordinate of
     the MDS. 
  4. The curve?

TODO: apply this to our data. 

TODO: Choosing the parameters (eps or k)


* SOM (Self-Organizing Maps)

+ Idea

Kohonen networks (or SOM, Self-Organizing Maps) are a
qualitative, non-linear equivalent of Principal Component
(or Coordinate) Analysis (PCA, PCO): we try to classify
observations in several classes (unknown classes) and to
organize those classes into a spacial layout. For instance,
we can want the classes to be aligned:

  * * * * * * * * * *

or to form a grid:

  *  *  *  *  *  *  *  *  *  *

  *  *  *  *  *  *  *  *  *  *

  *  *  *  *  *  *  *  *  *  *

  *  *  *  *  *  *  *  *  *  *

  *  *  *  *  *  *  *  *  *  *

  *  *  *  *  *  *  *  *  *  *

We start with a cloud of points in an n-dimensional space,
whose coordinates will be written (x1,x2,...,xi,...,xn) and
we try to find, for each element j of the grid, coordinates
(w(1,j),w(2,j),...,w(i,j),...,w(n,j)).

In other words, we try to embed the grid in this
n-dimensional space, so that it comes as close as possible
to the points (as if the points were magnetic and were
attracting the grid vertices) and so that it retains its
grid shape (as if the grid edges were springs).

1. Choose random values for the w(i,j).

2. Take a point x = (x1,...,xn) in the cloud. 

2a. Consider the point j on the gris whose coordinates are
the closest from x:

  j = ArgMin Sum(  (x_i - w(i,j))^2  )
              i

2b. Compute the error vector:

  d_i = x_i - w(i,j)

2c. For each point k of the grid in the neighbourhood of j,
replace the value of w(i,k):

  w(i,k) <- w(i,k) + h * d_i.

3. Go back to point 2, with a smaller neighbourhood and a
smaller value for the learning coefficient h.

+ Remark

You can replace the notion of neighbourhood by a
"neighbourhood function":

  w(i,k) <- w(i,k) + h * v(i,k) * d_i  

where v(i,i) = 1 and v(i,k) decreases when k goes away from
i.  Iteration after iteration, you will replace the function
one with a more and more visible peak.

+ Remark

You can choose the grid geometry: often, a table of
dimension 1, 2 or 3, but it could also be a circle, a
cylinder, a torus, etc.

In that sense, we can say that Kohonen networks are
non-linear (a would be tempted to say "homotopically
non-linear").

You can also interpret Kohonen networks as a mixture of
Principal Component Analysis (finding a graphical
representation, in the plane, of a cloud of points) and
unsupervised classification (assign the points to classes, a
priori unknown).

+ Remark

You can assess the quality of the result by looking at the
error, i.e., the average of the

  Sum(  (x_i - w(i,j))^2  )
   i

when x runs over the points to classify and where j is the
class that minimizes this sum (i.e., j is the class to which
x_i is assigned):

  MSE = Mean Min Sum  (x_i - w(i,j))^2
          x   j   i

+ Remark

Kohonen networks are often presented as a special case of
neural networks: if the doodles used to explain them are
similar, trying hard to see a neural network in a Kohonen
network might slow your understanding of the subject (the
weights are completely different, there is no transfer
function, etc.).

+ Graphical representation

You can plot each class j by a disc or a square, and put, in
each square, its coordinates w(1,j),...,w(n,j), as a star
plot or as a parallel plot.

+ Examples

Under R, there are two functions to compute Kohonen maps, in
the "class" and "som" packages (if you hesitate, choose
"som").

The following examples use a Kohonen map to describe the
palette (i.e., the colors) of an image (this is supposed to be a landscape: an island in the middle of a river).

= photo1.jpg

  %G  
  library(pixmap)
  x <- read.pnm("photo1.ppm")
  d <- cbind( as.vector(x@red),
              as.vector(x@green),
              as.vector(x@blue) )
  m <- apply(d,2,mean)
  d <- t(  t(d) - m )
  s <- apply(d,2,sd)
  d <- t(  t(d) / s )
  library(som)
  r1 <- som(d,5,5)
  plot(r1)
  %--

  %G
  x <- r1$code.sum$x
  y <- r1$code.sum$y
  n <- r1$code.sum$nobs
  co <- r1$code   # Is it in the same order that x, y and n?
  co <- t( t(co) * s + m )
  plot(x, y, 
       pch=15,
       cex=5,      
       col=rgb(co[,1], co[,2], co[,3])
      )
  %--

  %G
  x <- r1$code.sum$x
  y <- r1$code.sum$y
  n <- r1$code.sum$nobs
  co <- r1$code   # Is it in the same order that x, y and n?
  co <- t( t(co) * s + m )
  plot(x, y, 
       pch=15,
       cex=5*n/max(n),      
       col=rgb(co[,1], co[,2], co[,3])
      )
  %--

  %G
  library(class)
  r2 <- SOM(d)
  plot(r2)
  %--
  
  %G
  x <- r2$grid$pts[,1]
  y <- r2$grid$pts[,2]
  n <- 1   # Where???
  co <- r2$codes
  co <- t( t(co) * s + m )
  plot(x, y, 
       pch=15,
       cex=5*n/max(n),
       col=rgb(co[,1], co[,2], co[,3])
      )
  %--

You will notice that the result is not always the same:

  %G
  op <- par(mfrow=c(2,2))
  for (i in 1:4) {
    r2 <- SOM(d) 
    plot(r2)
  }
  par(op)
  %--

+ Other graphical representation

You may want to plot the vertex coordinates: you can use one
colour per coordinate. This allows you to graphically choose
which variables to use: you can eliminate those that play
little role in the classification and those that bring the
same information as already selected variables.

In the first example, the first coordinate is informative
(it coincides with one of the map coordinates). The two
others contain the same information (we can get rid of one)
and correspond to a diagonal direction on the map.

  %G
  x <- r1$code.sum$x
  y <- r1$code.sum$y
  v <- r1$code

  op <- par(mfrow=c(2,2), mar=c(1,1,1,1))
  for (k in 1:dim(v)[2]) {
    m <- matrix(NA, nr=max(x), nc=max(y))
    for (i in 1:length(x)) {
      m[ x[i], y[i] ] <- v[i,k]
    }
    image(m, col=rainbow(255), axes=F)
  }
  par(op)
  %--

  %G
  x <- r2$grid$pts[,1]
  y <- r2$grid$pts[,2]
  v <- r2$codes

  op <- par(mfrow=c(2,2), mar=c(1,1,1,1))
  for (k in 1:dim(v)[2]) {
    m <- matrix(NA, nr=max(x), nc=max(y))
    for (i in 1:length(x)) {
      m[ x[i], y[i] ] <- v[i,k]
    }
    image(m, col=rainbow(255), axes=F)
  }
  par(op)
  %--

+ Other graphical representation

The vertex coordinates define a map from the plane (the
space in which the Kohonen map lives) towards R^n (the space
in which our cloud of points lives): we can plot this
application (or, more precisely, its image), especially when
the space R^n is of low dimension -- if the dimension is
high, you would resort to high-dimensional visualization
tools, such as ggobi.

  TODO: plot

+ Graphical representation

  TODO: shards plot, to represent the differences between a
  cell and its neighbours.

  library(klaR)
  ?shardsplot

TODO: sammon plot, to see how distorted the map is. 

TODO: colouring a SOM; one can use another, 1-dimensional,
SOM to choose the colours (it can be a colour ring or a
colour segment).

+ Other graphical representation

One can plot the grid and visualize how well it fits the cloud of
points using a grand tour -- for instance, using ggobi.

+ Application areas

Image analysis (medicine: echography, etc.; handwriting): an
n*m image can be seen as a point an an n*m-dimensional
space. The euclidian distance is not the most adequate one,
but it works pretty well nonetheless.

You can also use Kohonen maps to forecast time series:
usually, we try to write

  y[n+1] = f( y[n], y[n-1], ..., y[n-k] )

for a "well-chosen" k and a map f to be determined (for
example, by a linear regression, a non-linear regression, a
Principal Component Analysis (PCA), a Curvilinear Component
Analysis (CCA), etc.). Instead of this, you can build a
Kohonen map with ( y[n], y[n-1], ..., y[n-k] ) and look at
the values of y[n+1] at each vertex of the map.  You could
also use ( y[n+1], y[n], y[n-1], ..., y[n-k] ) to build the
map.

  TODO: develop this example, either here or in the Time
  Series chapter

You can use Kohonen maps to measure the colours in an image
(for instance, to convertr it from 16 bits to 8 bits, or to
convert it to an indexedd format, that limits the number of
colors but lets you choose those colours): the vertex
coordinates will be the quantified colors.

  http://www.cis.hut.fi/~lendasse/pdf/esann00.pdf

+ Supervised Learning (Feature map)

You can also use Kohonen maps for supervised learning: you
still start with a cloud of points, but this time, each
point already belongs to a class -- in other words, we have
quantitative predictive variables, as before, and a further
qualitative variable, to predict/explain.

We run the algorithm on the predictive variables and we
associate one (or several) class(es) to each node of the
map: the classes of a node are the classes of the points it
contains. Vertices with several classes are in a "clash
state".

The interest is twofold: first, you can use the mao to
predict the class of new observations, second, you get a
graphical representation of the qualitative variable to
predict, i.e., you have a notion of proximity, similarity,
between the classes.

+ Size of the SOM

We said above that Kohonen maps were not very stable: it you
change a few points in the cloud of points, or simply if you
change the initialization of the map, you get completely
different results.

However, small maps (3x3) are rather stable -- actually, the
fact is more general: unsupervised learning methods give
reproducible results up to 10 classes, but not above.

However, you can simplify a large Kohonen map in the
following way: if you colour the map vertices with the
number of points it contains, you can apply image processing
algorithms (mathematical morphology), such as the opening
(to get rid of noise, i.e., small elements) or the watershed
(that divides the image in several bassins of attraction).

  TODO: translate the last word...

  TODO: what about the hierarchical algorithm: a 3x3 SOM,
  then a 3x3 SOM in each of its cells, etc.?  Do I mention
  it later?

+ Geometric interpretation of SOMs

You can model the situation described by a SOM (i.e.,
simulate data that will be efficiently analyzed by a SOM) as
follows. To get a cloud of points:

1. Take a point at random in the plane, following some
distribution.  One of the aims of the SOM algorithm is to
recover this distribution.

2. Apply a transformation (one other aim of the SOM
distribution is to recover this transformation) to send our
point in an n-dimensional space. Our point is then on a
2-dimensional submanifold (a submanifold is like a subspace,
but it need not be "straight", it may be "curved") of R^n.

3. Add some noise. 

The Kohonen map is an estimation of the distribution of the
first step: the rows and columns correspond to the
coordinates in the plane; the number of points in each
vertex is an estimation of the density.

The coordinates w(i,j) of the vertices are an estimation of
the submanifold in step 2.

In those estimations, the plane R^2 and the submanifold of
R^n have been discretized.

Generative Topographic Mappings (GTM) are a possible
replacement of SOM that formalize this geometric description
of Kohonen maps.

  http://research.microsoft.com/~cmbishop/downloads/Bishop-GTM-Ncomp-98.pdf

* Simple Correspondance Analysis (CA)

  TODO: introduction
  TODO: proofread the following paragraph.

+ Around Principal Component Analysis

The following methods are all based on principal component
analysis: we start with a table of numbers (the values of
our variables for PCA, a contingency table for CA), we
consider its columns as points in an n-dimensional space
(same for the rows), and we look for the 2-dimensional
subspace of R^n (the space in which the columns live -- and
we would do the same for that in which the rows live) on
which we can see "as much information as possible" (we
project the cloud of points onto it, orthogonally with
respect to the canonical scalar product or another scalar
product, more appropriate to the problem at hand.

+ Simple correspondances

Correspondance analysis focuses on contingency tables, i.e.,
qualitative variables. Let us first consider the case of two
variables. 

  TODO: recall what a contingency table is...

We first transform the contingency table into two tables: 
that of row-profiles (the sum of the elements in a row is always 1 (or
100%)) and that of column-profiles. 

  TODO: define f(i,*) and f(*,j)

If the two variables are independant, we have f(i,j)=f(i,*)f(*,j).
You can use Pearson's Chi^2 test to compare the distributions of 
f(i,j) and f(i,*)f(*,j).

From a technical point of vue, Correspondance Analysis analysis looks
like Principal Component Analysis. We try to find a graphical
representation of the rows and colums of the table, as points in some
space; we want the points to reflect the information contained in the
table as faithfully as possible (technically, we try to maximize the
variance of the resulting cloud of points). Correspondance Analysis
proceeds in a similar way, the only difference being that the distance
is not given by the canonical scalar product but is the "Chi^2
distance". 

  %G
  library(MASS)
  data(HairEyeColor)
  x <- HairEyeColor[,,1]+HairEyeColor[,,2]
  biplot(corresp(x, nf = 2))
  %--

  %G
  biplot(corresp(t(x), nf = 2))
  %--

  %G
  # ???
  plot(corresp(x, nf=1))
  %--

If there are more variables, we can add them, a posteriori, on the
plot, with the predict.mca function. 

Other examples

  %G
  n <- 100
  m <- matrix(sample(c(T,F),n^2,replace=T), nr=n, nc=n)
  biplot(corresp(m, nf=2), main="Correspondance Analysis of Random Data")
  %--

  %G
  vp <- corresp(m, nf=100)$cor
  plot(vp, ylim=c(0,max(vp)), type='l', 
       main="Correspondance Analysis of Random Data")
  %--

  %G
  n <- 100
  x <- matrix(1:n, nr=n, nc=n, byrow=F)
  y <- matrix(1:n, nr=n, nc=n, byrow=T)
  m <- abs(x-y) <= n/10
  biplot(corresp(m, nf=2), 
         main='Correspondance Analysis of "Band" Data')
  %-- 

  %G
  vp <- corresp(m, nf=100)$cor
  plot(vp, ylim=c(0,max(vp)), type='l', 
         main='Correspondance Analysis of "Band" Data')
  %--

You can also check the "ca" function.

  library(multiv)
  ?ca

+ Example: using correspondance analysis to reorder a table

We start with a table, containing 0s and 1s, with a very simple form

  %G
  n <- 100
  x <- matrix(1:n, nr=n, nc=n, byrow=F)
  y <- matrix(1:n, nr=n, nc=n, byrow=T)
  m <- abs(x-y) <= n/10
  plot.boolean.matrix <- function (m) { # Voir aussi levelplot
    nx <- dim(m)[1]
    ny <- dim(m)[2]
    x <- matrix(1:nx, nr=nx, nc=ny, byrow=F)
    y <- matrix(1:ny, nr=nx, nc=ny, byrow=T)
    plot( as.vector(x)[ as.vector(m) ], as.vector(y)[ as.vector(m) ], pch=16 )
  }
  plot.boolean.matrix(m)
  %--

But someone changed (or forgot to give us) the order of the rows and
columns -- if the data is experimental, we might not have any a priori
clue as to the order.

  %G  
  ox <- sample(1:n, n, replace=F)
  oy <- sample(1:n, n, replace=F)
  reorder.matrix <- function (m,ox,oy) {
    m <- m[ox,]
    m <- m[,oy]
    m
  }
  m2 <- reorder.matrix(m,ox,oy)
  plot.boolean.matrix(m2)
  %--

We can reorder rows and columns as follows:

  %G
  a <- corresp(m2)
  o1 <- order(a$rscore)
  o2 <- order(a$cscore)
  m3 <- reorder.matrix(m2,o1,o2)
  plot.boolean.matrix(m3)
  %--

Had we started with random data, the result could have been:

  %G
  n <- 100
  p <- .05
  done <- F 
  while( !done ){
    # We often get singular matrices
    m2 <- matrix( sample(c(F,T), n*n, replace=T, prob=c(1-p, p)), nr=n, nc=n )
    done <- det(m2) != 0
  }
  plot.boolean.matrix(m2)
  %--

  %G
  a <- corresp(m2)
  o1 <- order(a$rscore)
  o2 <- order(a$cscore)
  m3 <- reorder.matrix(m2,o1,o2)
  plot.boolean.matrix(m3)
  %--

+ Simple Correspondance Analysis: Details

The idea is very simple. We start with a contingency table (for two
qualitative variables), we transform it into a frequency table, we
compute the marginal frequencies and the frequencies we would have had
if the two variables had been independant, we compute the difference,
so as to get a "centered" matrix, that describes the lack of
independance of the two variables. We then perform a Principal
Component Analysis, but not with the canonical distance: instead, we
use a weighted distance between row-profiles (resp. column-profiles)
so that each column (resp. row) have the same importance. This is
called the Chi^2 distance. 

  # Euclidian Distance
  sum ( f_{i,j} - f_{i',j} )^2

  # Euclidian distance between the row-profiles
         f_{i,j}     f_{i',j}
  sum ( --------- - ---------- )^2
   j     f_{i,.}     f_{i',.} 

  # Chi^2 distance
          1       f_{i,j}     f_{i',j}
  sum --------- ( --------- - ---------- )^2
   j   j_{.,j}    f_{i,.}     f_{i',.} 

We choose this metric because it has the following property:
we can group some values of a variable without changing the result. 
(For instance, we could replace one of the two variables, X, whose values
are A, B, C, D, E, F, by another variable X' with values A, BC, D, E,
F, in the obvious way, if we think that the differences between B and
C are not that relevant.)

  %G
  # Correspondance Analysis
  my.ac <- function (x) {
    if(any(x<0))
      stop("Expecting a contingency matrix -- with no negative entries")
    x <- x/sum(x)
    nr <- dim(x)[1]
    nc <- dim(x)[2]
    marges.lignes <- apply(x,2,sum)
    marges.colonnes <- apply(x,1,sum)
    profils.lignes <- x / matrix(marges.lignes, nr=nr, nc=nc, byrow=F)
    profils.colonnes <- x / matrix(marges.colonnes, nr=nr, nc=nc, byrow=T)
    # Do not forget to center the matrix: we compute the frequency matrix
    # we would have if the variables were independant and we take the difference.
    x <- x - outer(marges.colonnes, marges.lignes)
    e1 <- eigen( t(x) %*% diag(1/marges.colonnes) %*% x %*% diag(1/marges.lignes) )
    e2 <- eigen( x %*% diag(1/marges.lignes) %*% t(x) %*% diag(1/marges.colonnes) )
    v.col <- solve( e2$vectors, x )
    v.row <- solve( e1$vectors, t(x) )
    v.col <- t(v.col)
    v.row <- t(v.row)
    if(nr<nc)
      valeurs.propres <- e1$values
    else
      valeurs.propres <- e2$values
    # Dessin
    plot( v.row[,1:2], 
          xlab='', ylab='', frame.plot=F )
    par(new=T)
    plot( v.col[,1:2], col='red',
          axes=F, xlab='', ylab='', pch='+')
    axis(3, col='red')
    axis(4, col='red')
    # Return the data
    invisible(list(donnees=x, colonnes=v.col, lignes=v.row, 
                   valeurs.propres=valeurs.propres))
  }    

  nr <- 3
  nc <- 5
  x <- matrix(rpois(nr*nc,10), nr=nr, nc=nc)
  my.ac(x)
  %--

Let us compare our result with that of the "corresp" function.

  %G
  plot(corresp(x,nf=2))
  %--

Exercice: Modify the code above so that it uses the "svd" function
instead of "eigen" (this is a real problem: as the matrix is no longer
symetric, numeric computations can yield non-real eigen values).

On the plot representing the first variable, we can add the 
points corresponding to the second: we just take the barycenters of
the first points, weighted by the corresponding (row- or column-)
profile. We can do the same thing of the plot for the second
variable: up to the scale, we should get the same result -- it is this
property that justifies the superposition of both plots. 

  TODO: both plots

The barycenter also allows us to add new variables to the plots.

+ Detrended Correspondence Analysis (DCA)

  TODO

+ TODO

  library(CoCoAn)
  library(multiv)
  library(ade4)

* Multiple Correspondance Analysis

+ Introduction

Simple Correspondance Analysis tackled the problem of two qualitative
variables; multiple correspondance analysis caters to more than two
variables. Usually, the data are not given as a contingency matrix (or
"hypermatrix": if there are n variables, it should be an n-dimensional
table), because this matrix would have a huge number of elements, most
of them null. We use a table similar to that used to quantitative
variables:

      Hair   Eye    Sex
  1  Blond  Blue   Male
  2  Brown Brown Female
  3  Brown  Blue Female
  4    Red Brown   Male
  5    Red Brown Female
  6  Brown Brown Female
  7  Brown Brown Female
  8  Black Brown   Male
  9  Brown  Blue Female
  10 Blond  Blue   Male
  ...

Here is the example from the manual.

  %G
  library(MASS)
  data(farms)
  farms.mca <- mca(farms, abbrev=TRUE)
  farms.mca
  plot(farms.mca)
  %--

Let us consider again the eye and hair colour data set. As the data
are given by a contingency hypermatrix, we first have to transform
it. 

  %G
  # Not pretty
  my.table.to.data.frame <- function (a) {
    r <- NULL
    d <- as.data.frame.table(a)
    n1 <- dim(d)[1]
    n2 <- dim(d)[2]-1
    for (i in 1:n1) {
      for (j in 1:(d[i,n2+1])) {
        r <- rbind(r, d[i,1:n2])
      }
      row.names(r) <- 1:dim(r)[1]
    }
    r
  }
  r <- my.table.to.data.frame(HairEyeColor)
  plot(mca(r))
  %--

We can also add mor subjects or variables, afterwards, with the
"predict.mca" function.

  ?predict.mca

If there are only two variables, we can perform a simple or multiple
correspondance analysis: the result is not exactly the same, but
remains very similar. 

  %G 1000 500
  x <- HairEyeColor[,,1]+HairEyeColor[,,2]
  op <- par(mfcol=c(1,2))
  biplot(corresp(x, nf = 2), 
         main="Simple Correspondance Analysis")  
  plot(mca(my.table.to.data.frame(x)), rows=F, 
       main="Multiple Correspondance Analysis")
  par(op)
  %--

+ Multiple Correspondance Analysis: Details

We first transform the data, to turn it into a "disjunctive table":

        HairBlack HairBrown HairRed HairBlond EyeBrown EyeBlue EyeHazel EyeGreen SexMale SexFemale
   [1,]         0         0       0         1        0       1        0        0       1         0
   [2,]         0         1       0         0        1       0        0        0       0         1
   [3,]         0         1       0         0        0       1        0        0       0         1
   [4,]         0         0       1         0        1       0        0        0       1         0
   [5,]         0         0       1         0        1       0        0        0       0         1
   [6,]         0         1       0         0        1       0        0        0       0         1
   [7,]         0         1       0         0        1       0        0        0       0         1
   [8,]         1         0       0         0        1       0        0        0       1         0
   [9,]         0         1       0         0        0       1        0        0       0         1
  [10,]         0         0       0         1        0       1        0        0       1         0

Then we do an "analysis" (this term is used to describe the common
part of the PCA, CA, MCA algorithms): the observations all have the
same weight, the variables are weighted as for Simple Correspondance
Analysis. We remark that 1 is always an eigenvalue: we shall remove
the corresponding vector. 

  TODO: explain where this "1" comes from and explain why we remove it. 

  It comes from the fact that we did not center the data:

  We have a cloud of points, not centered, and we try to find a line
  through the origin that "fits" the cloud as tightly as
  possible. This is the line through the center of gravity of the
  cloud. It is not relevant (it does not really depend on the cloud of
  points, but rather on the position of the origin), so we discard
  it. 

Here is a bit of code that performs that kind of computation.

  %G
  # Multiple Correspondance Analysis
  tableau.disjonctif.vecteur <- function (x) {
    y <- matrix(0, nr=length(x), nc=length(levels(x)))
    for (i in 1:length(x)) {
      y[i, as.numeric(x[i])] <- 1
    }
    y
  }
  tableau.disjonctif <- function (x) {
    if( is.vector(x) )
      y <- tableau.disjonctif.vecteur(x)
    else {
      y <- NULL
      y.names <- NULL
      for (i in 1:length(x)) {
        y <- cbind(y, tableau.disjonctif.vecteur(x[,i]))
        y.names <- c( y.names, paste(names(x)[i], levels(x[,i]), sep='') )
      }
    }
    colnames(y) <- y.names
    y
  }
  my.acm <- function (x, garder.un=F) {
    # x is a data.frame that contains only factors
    # y is a boolean matrix (it only contains 0s and 1s)
    y <- tableau.disjonctif(x)
    # Number of observations
    n <- dim(y)[1]
    # Number of variables
    s <- length(x)
    # Number of columns in the disjunctive table
    p <- dim(y)[2]
    # The matrix and the weights
    F <- y/(n*s)
    Dp <- diag(t(y)%*%y) / (n*s)
    Dn <- rep(1/n,n)
    # Let us perform the analysis
    # Do NOT forget to remove 1 as an eigenvalue!!!
    # (it comes from the fact that we did not center the matrix)
    e1 <- eigen( t(F) %*% diag(1/Dn) %*% F %*% diag(1/Dp) )
    e2 <- eigen( F %*% diag(1/Dp) %*% t(F) %*% diag(1/Dn) )
    variables <- t(e2$vectors) %*% F
    individus <- t(e1$vectors) %*% t(F)
    variables <- t(variables)
    individus <- t(individus)
    valeurs.propres <- e1$values
    if( !garder.un ) {
      variables <- variables[,-1]
      individus <- individus[,-1]
      valeurs.propres <- valeurs.propres[-1]
    }
    plot( jitter(individus[,2], factor=5) ~ jitter(individus[,1], factor=5), 
          xlab='', ylab='', frame.plot=F )
    par(new=T)
    plot( variables[,1:2], col='red',
          axes=FALSE, xlab='', ylab='', type='n' )
    text( variables[,1:2], colnames(y), col='red')
    axis(3, col='red')
    axis(4, col='red')
    j <- 1
    col = rainbow(s)
    for (i in 1:s) {
      jj <- j + length(levels(x[,i])) - 1
      print( paste(j,jj) )
      lines( variables[j:jj,1], variables[j:jj,2], col=col[i] )
      j <- jj+1
    }
    invisible(list( donnees=x, variables=variables, individus=individus, 
                    valeurs.propres=valeurs.propres ))
  }
  
  random.data.1 <- function () {
    n <- 100
    m <- 3
    l <- c(3,2,5)
    x <- NULL    
    for (i in 1:m) {
      v <- factor( sample(1:l[i], n, replace=T), levels=1:l[i] )
      if( is.null(x) )
        x <- data.frame(v)
      else 
        x <- data.frame(x, v)
    }
    names(x) <- LETTERS[1:m]
    x
  }
  r <- NULL
  while(is.null(r)) {
    x1 <- random.data.1()
    try(r <- my.acm(x1))
  }
  %--

On the preceding plot, the three variables are well separated, 
a telltale sign that they are independant. Here is what happens if
they are dependant. 

  %G
  my.random.data.2 <- function () {
    n <- 100
    m <- 3
    l <- c(3,2,3)
    x <- NULL    
    for (i in 1:m) {
      v <- factor( sample(1:l[i], n, replace=T), levels=1:l[i] )
      if( is.null(x) )
        x <- data.frame(v)
      else 
        x <- data.frame(x, v)
    }
    x[,3] <- factor( ifelse( runif(n)>.8, x[,1], x[,3] ), levels=1:l[1])
    names(x) <- LETTERS[1:m]
    x
  }
  r <- NULL
  while(is.null(r)) {
    x2 <- my.random.data.2()
    try(r <- my.acm(x2))
  }
  %--

Let us compare our result with that of the "mca" command.

  %G
  library(MASS)
  plot(mca(x1))
  %--

  %G
  plot(mca(x2))
  %--

At first, I had forgotten to remove the 1 eigenvalue, and I got this.

  %G
  my.acm(x1, garder.un=T)
  %--

  %G
  my.acm(x2, garder.un=T)
  %--

+ Mixing quantitative and qualitative variables (it does not work)

(What follows are a few mindless tests -- none actually worked.)

One could try to analyse a set of qualitative and quantitative
variables by mixing Principal Component Analysis (PCA) and Multiple
Correspondance Analysis (MCA): we just put the two tables side by side
and concatenate the weight vectors of the columns. Let us look at the
results.

  %G
  tableau.disjonctif.vecteur <- function (x) {
    if( is.factor(x) ){
      y <- matrix(0, nr=length(x), nc=length(levels(x)))
      for (i in 1:length(x)) {
      y[i, as.numeric(x[i])] <- 1
      }
      return(y)
    } else 
    return(x)
  }
  tableau.disjonctif <- function (x) {
    if( is.vector(x) )
      y <- tableau.disjonctif.vecteur(x)
    else {
      y <- NULL
      y.names <- NULL
      for (i in 1:length(x)) {
        y <- cbind(y, tableau.disjonctif.vecteur(x[,i]))
        if( is.factor(x[,i]) )
          y.names <- c( y.names, paste(names(x)[i], levels(x[,i]), sep='') )
        else 
          y.names <- c(y.names, names(x)[i])
      }
    }
    colnames(y) <- y.names
    y
  }
  my.am <- function (x) {
    y <- tableau.disjonctif(x)
    # Number of observations
    n <- dim(y)[1]
    # Number of variables
    s <- length(x)
    # Number of columns of the disjunctive table
    p <- dim(y)[2]

    # The matrix and the weights
    # Each variable weight is 1, but for the qualitative variables
    # that are "split" into several "subvariables", it is the sum of
    # those "subvariable" weights that is 1.
    Dp <- diag(t(y)%*%y) / n
    j <- 1
    for(i in 1:s){
      if( is.factor(x[,i]) )
        j <- j + length(levels(x[,i]))
      else {
        Dp[j] <- 1
        j <- j+1
      }      
    }
    Dn <- rep(1,n)
    # It if not a good idea to center the variable...
    f <- ne.pas.centrer(y)
    # We perform the analysis...
    e1 <- eigen( t(f) %*% diag(1/Dn) %*% f %*% diag(1/Dp) )
    e2 <- eigen( f %*% diag(1/Dp) %*% t(f) %*% diag(1/Dn) )
    variables <- t(e2$vectors) %*% f
    individus <- t(e1$vectors) %*% t(f)
    variables <- t(variables)
    individus <- t(individus)
    valeurs.propres <- e1$values
    # Sometimes, because of rounding errors, the matrix becomes
    # non-diagonalizable in R. We then end up with complex eigenvalues
    # and (worse) complex eigenvectors. That is why it is a better
    # idea to use the SVD, that always yields real results. 
    if( any(Im(variables)!=0) | any(Im(individus)!=0) | 
        any(Im(valeurs.propres)!=0) ){
      warning("Matrix not diagonalizable on R!!!")
      variables <- Re(variables)
      individus <- Re(individus)
      valeurs.propres <- Re(valeurs.propres)
    }
    plot( jitter(individus[,2], factor=5) ~ jitter(individus[,1], factor=5), 
          xlab='', ylab='', frame.plot=F )
    par(new=T)
    plot( variables[,1:2], col='red',
          axes=FALSE, xlab='', ylab='', type='n' )
    text( variables[,1:2], colnames(y), col='red')
    axis(3, col='red')
    axis(4, col='red')
    col = rainbow(s)
    j <- 1
    for (i in 1:s) {
      if( is.factor(x[,i]) ){
        jj <- j + length(levels(x[,i])) - 1
        print( paste(j,jj) )
        lines( variables[j:jj,1], variables[j:jj,2], col=col[i] )
        j <- jj+1
      } else {
        arrows(0,0,variables[j,1],variables[j,2],col=col[i])
        j <- j+1
      }
    }
  }
  ne.pas.centrer <- function (y) { y }
  
  n <- 500
  m <- 3
  p <- 2
  l <- c(3,2,5)
  x <- NULL    
  for (i in 1:m) {
    v <- factor( sample(1:l[i], n, replace=T), levels=1:l[i] )
    if( is.null(x) )
      x <- data.frame(v)
    else 
      x <- data.frame(x, v)
  }
  x <- cbind( x, matrix( rnorm(n*p), nr=n, nc=p ) )
  names(x) <- LETTERS[1:(m+p)]
  x1 <- x
  my.am(x1)
  %--

If the variables are dependant, we can see that the qualitative
variables are linked (as with multiple correspondance analysis), but
there is nothing obvious with the quantitative variable...

  %G
  n <- 500
  m <- 3
  p <- 2
  l <- c(3,2,5)
  x <- NULL    
  for (i in 1:m) {
    v <- factor( sample(1:l[i], n, replace=T), levels=1:l[i] )
    if( is.null(x) )
      x <- data.frame(v)
    else 
      x <- data.frame(x, v)
  }
  x <- cbind( x, matrix( rnorm(n*p), nr=n, nc=p ) )
  names(x) <- LETTERS[1:(m+p)]
  x[,3] <- factor( ifelse( runif(n)>.8, x[,1], x[,3] ), levels=1:l[1])
  x[,5] <- scale( ifelse( runif(n)>.9, as.numeric(x[,1]), as.numeric(x[,2]) )) + .1*rnorm(n)
  x2 <- x
  my.am(x2)
  %--

If we center the matrix, the results are meaningless (however, we can
center the columns for the quantitative variables).

  %G
  ne.pas.centrer <- function (y) {
    centre <- apply(y, 2, mean)
    y - matrix(centre, nr=n, nc=dim(y)[2], byrow=T)
  }
  my.am(x1)
  %--

  %G
  my.am(x2)
  %--

+ Mixing qualitative and quantitative variables (a solution)

To plot, on the same graph, qualitative and quantitative variables, a
simple solution is to turn the quantitative variables into qualitative
variables.

People will tell you that this is always a bad idea: you lose "some" information.
And indeed, there is not much to be seen: this is rather unhelpful...

  %G
  to.factor.numeric.vector <- function (x, number) {
    resultat <- NULL
    intervalles <- co.intervals(x,number,overlap=0)
    for (i in 1:number) {
      if( i==1 ) intervalles[i,1] = min(x)
      else
        intervalles[i,1] <- intervalles[i-1,2]
      if( i==number )
        intervalles[i,2] <- max(x)
    }
    for (valeur in x) {
      r <- NA
      for (i in 1:number) {
        if( valeur >= intervalles[i,1] & valeur <= intervalles[i,2] )
          r <- i
      }
      resultat <- append(resultat, r)
    }
    factor(resultat, levels=1:number)
  }
  to.factor.vector <- function (x, number) {
    if( is.factor(x) ) 
      return(x)
    else
      return(to.factor.numeric.vector(x,number))
  }
  to.factor <- function (x, number=4 ) {
    y <- NULL
    for (a in x) {
      aa <- to.factor.vector(a, number)
      if( is.null(y) )
        y <- data.frame(aa)
      else 
        y <- data.frame(y,aa)
    }
    names(y) <- names(x)
    y
  }
  my.am(to.factor(x1))
  %--

Let us cut the quantitative variables into three pieces instead of four.

  %G
  my.am(to.factor(x1, number=3))
  %--

With dependant variables.

  %G
  my.am(to.factor(x2))
  %--

  %G
  my.am(to.factor(x2, number=3))
  %--

+ TODO

Check the "ade4" package.

* Log-linear model (Poisson Regression)

The log-linear model has a role similar to that of multiple
correspondance analysis -- with the added advantage that, as with any
regression, we can perform tests. We start with a contingency table
t_{i,j} and we try to write 

  log t_{i,j} = a_0 + a_1(i) + a_2(j) + a_3(i,j).

This can be immediately generalized in higher dimensions (but the
notations become unreadable); usually we only consider interactions of
two or three variables, not more, otherwise we woud have too many
parameters for the number of observations at hand.

With R, it is simply the "glm" function, with the "family=poisson"
argument. .

  m <- data.frame(table(random.data.1()))
  y <- m[,4]
  x1 <- m[,1]
  x2 <- m[,2]
  x3 <- m[,3]
  summary(glm( y ~ x1 + x2 + x3, family=poisson ))

It yields:

  > summary(glm( y ~ x1 + x2 + x3, family=poisson ))

  Call:
  glm(formula = y ~ x1 + x2 + x3, family = poisson)

  Deviance Residuals:
      Min       1Q   Median       3Q      Max
  -2.6060  -0.8723  -0.2724   0.4264   3.2634

  Coefficients:
                Estimate Std. Error  z value Pr(>|z|)
  (Intercept)  9.981e-01  2.965e-01    3.366 0.000762 ***
  x12          3.314e-01  2.476e-01    1.338 0.180799
  x13          1.643e-01  2.568e-01    0.640 0.522302
  x22         -4.000e-02  1.999e-01   -0.200 0.841418
  x32          5.129e-02  3.203e-01    0.160 0.872784
  x33          5.057e-09  3.244e-01 1.56e-08 1.000000
  x34          1.001e-01  3.163e-01    0.316 0.751674
  x35          1.001e-01  3.166e-01    0.316 0.751912
  ---
  Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

  (Dispersion parameter for poisson family taken to be 1)

      Null deviance: 38.794  on 29  degrees of freedom
  Residual deviance: 36.735  on 22  degrees of freedom
  AIC: 139.11

  Number of Fisher Scoring iterations: 4

TODO: understand this result

  TODO: explain how to add/remove some interaction terms

  > r1 <- glm( y ~ x1 + x2 + x3 + x1*x2 + x2*x3 + x1*x3, family=poisson )
  > r2 <- glm( y ~ x1 + x2 + x3, family=poisson )
  > anova(r1,r2)
  Analysis of Deviance Table

  Model 1: y ~ x1 + x2 + x3 + x1 * x2 + x2 * x3 + x1 * x3
  Model 2: y ~ x1 + x2 + x3
    Resid. Df Resid. Dev  Df Deviance
  1         8     6.9906
  2        22    13.1344 -14  -6.1438

* Discriminant Analysis

  TODO: Introduction

  TODO: This is supervised learning, it should make up a separate chapter.

+ Discriminant Analysis

We try to predict a qualitative variable, such as "has the patient just
had a heart attack?", with quantitative variables, corresponding to
potential risk factors (blood pressure, daily calory intake, workout,
weight, etc.). We try to determine which factors are the most
prevalent, i.e., which variables determine best the qualitative.

We can also use discriminant factor analysis in a predictive way,
i.e., to answer the question "will the patient have a heart attack?".

TODO: explain how we do this.

  More precisely, the variable to predict clusters the subjects into
  several classes and we are looking for a linear combination of the
  predictive variables on which we can see the separation between the
  clusters.

  TODO: check what follows.

  I guess we try to maximize the quotient of the 
  mean of the variances of this linear combination in each group
  and its global variance (as for anova).

  ???
  For a binary variable:
    Normalize the predictive variables
    Compute their variance-covariance matrix A
    Compute the vector w of the means of the differences for both groups
    One can show that (?) A w = d
    w = A^-1 * d, these are the weights we were looking for

  (It is a good idea to plot the observations -- in dimension higher
  than two, you can use ggobi to rotate the figure.)

  Diagnostics:
    Confusion matrix: rows: real values
                      columns: predicted values,
                      entries: number of cases
    From the confusion matrix, we can build more than a dozen 
    different measures of the relevance of those results.
      http://149.170.199.144/multivar/da4.htm

  We can first try to predict the dimension, for instance, with a
  principal component analysis. 

  One can also perform a stepwise discriminant analysis.

  %G
  library(MASS)
  n <- 100
  k <- 5
  x1 <- runif(k,-5,5) + rnorm(k*n)
  x2 <- runif(k,-5,5) + rnorm(k*n)
  x3 <- runif(k,-5,5) + rnorm(k*n)
  x4 <- runif(k,-5,5) + rnorm(k*n)
  x5 <- runif(k,-5,5) + rnorm(k*n)
  y <- factor(rep(1:5,n))
  plot(lda(y~x1+x2+x3+x4+x5))
  %--

We can choose to retain only the first two dimensions.

  %G
  plot(lda(y~x1+x2+x3+x4+x5), dimen=2) 
  %--

Or even, just the first.

  %G
  op <- par(mar=c(4,4,0,2)+.1)
  plot(lda(y~x1))
  par(op)
  %--

As for Principal Component Analysis, we choose the number of dimension
to retain from the eigen values. 

  %G
  plot(lda(y~x1+x2+x3+x4+x5)$svd, type='h')
  %--

Linear discriminant analysis (LDA) can be sensitive to heteroskedasticity problems:
of the two groups you are trying to separate have different variances (sizes), the boundary
will be shifted to the most compact one.

+ Quadratic discriminant analysis

The "qda" and "predict.qda" has the same role, but in a non-linear
(quadratic) setting.

  ?qda

Also check the "mda" package.

+ Bayesian Discriminant Factor Analysis

It is the same, but we know the probabilities 

  P ( Y == i ),

i.e., the size of the various clusters.

(We still use the "lda" function, with one more argument.)

  TODO: Example

* Canonical analysis

+ Canonical Analysis

We consider qualitative variables X1, X2, ..., Xp, Y1, Y2, ..., Yq
and we look for a linear combination Xa of the Xi and a linear
combination Yb of the Yj so that Xa and Yb be as correlated as
possible.

  library(ade4)
  ?cca

See for example part 4 of 

  http://www.cg.ensmp.fr/%7Evert/talks/021205inria/inria.pdf

+ Co-inertia analysis

Instead of using the correlation, one can use the covariance.

  TODO

+ Partial Least Squares

  TODO: Explain the relations/differences between PLS and CA.

* Kernel methods

+ Principal Component Analysis and kernel methods

  TODO: rewrite the introduction

  Methods such as Principal Component Analysis (PCA) can reduce the
  dimension of the cloud of points but, as they are linear, they will
  not enable us to "untangle" the data: it will remain curved, it will
  not become a straight segment.

One idea to straighten the curve is to embed our space (which can
already be of unreasonably high dimension) into a higher-dimensional
space, in a non-linear way. By applying linear methods on this
higher-dimensional space, we can hope to see phenomena that were
non-linear in the first space. 

Here is an example of such an embedding: 

  (x,y)  --->   (x^2, x*y, y^2).

It might sound unreasonnable to embed a space whose dimension is
already problematic into an even larger one. It would indeed be the
case if we needed to compute (and store) the coordinates of the points
in this new space -- but actually, we only need to be able to compute
scalar products. 

More precisely, principal component analysis (and other similar
techniques) play with the variance-covariance matrix

  t(m) %*% m

In our new space, this becomes: 

  %G
  # Data
  n <- 200  # Number of patients, number of columns
  k <- 10   # Dimension of the ambient space
  nb.points <- 5
  p <- matrix( 5*rnorm(nb.points*k), nr=k )
  library(splines)
  barycentre2 <- function (y,n) {
    m <- length(y)
    x <- 1:m
    r <- interpSpline(x,y)    
    #r <- lm( y ~ bs(x, knots=m) )
    predict(r, seq(1,m,length=n))$y
  }
  data.curve <- apply(p, 1, barycentre2, n)
  data.curve <- t(data.curve)
  pairs(t(data.curve))
  %--

  %G  
  m <- data.curve
  mm <- apply(m, 2, function (x) { x %o% x } )
  r <- princomp(t(mm))
  plot(r)
  %--
  %G
  pairs(r$scores[,1:5])
  %--

Nothing exciting...

Let us try with a higher degree. 

  %G
  # Degree 1 kernel ("noyau" is the French word for "kernel")
  noyau1 <- function (x,y) { sum(x*y) }
  m <- data.curve
  m <- t(t(m) - apply(m,2,mean))
  k <- dim(m)[1]
  wrapper <- function(x, y, my.fun, ...) {
    sapply(seq(along=x), FUN = function(i) my.fun(x[i], y[i], ...))
  }
  mm <- outer(1:k, 1:k, wrapper, function (i,j) { noyau1(m[,i],m[,j]) })

  # Degree 2 kernel
  noyau2 <- function (x,y) {
    a <- x*y
    n <- length(a)
    i <- gl(n,1,n^2)
    j <- gl(n,n,n^2)
    i <- as.numeric(i)
    j <- as.numeric(j)
    w <- which( i <= j & j <= k )
    i <- i[w]
    j <- j[w]
    sum(a[i]*a[j])
  }
  stopifnot( noyau2(1:2,2:1) == 12 )

  # Degree 3 kernel
  noyau3 <- function (x,y) {
    a <- x*y
    n <- length(a)
    i <- gl(n,1,n^3)
    j <- gl(n,n,n^3)
    k <- gl(n,n^2,n^3)
    i <- as.numeric(i)
    j <- as.numeric(j)
    k <- as.numeric(k)
    w <- which( i <= j & j <= k )
    i <- i[w]
    j <- j[w]
    k <- k[w]
    sum(a[i]*a[j]*a[k])
  }
  stopifnot( noyau3(1:2,2:1) == 32 )
  
  wrapper <- function(x, y, my.fun, ...) {
    sapply(seq(along=x), FUN = function(i) my.fun(x[i], y[i], ...))
  }
  k <- dim(m)[1]
  mm <- outer(1:k, 1:k, FUN=wrapper, my.fun = function (i,j) { noyau3(m[,i],m[,j]) })

  r <- princomp(covmat=mm)
  plot(r)
  %--
  %G
  #pairs(r$scores[,1:5])
  plot((t(m) %*% r$loadings) [,1:2])
  %--

  %G
  data.noisy.curve <- data.curve + rnorm(length(data.curve))
  k <- dim(data.noisy.curve)[1]
  mm <- outer(1:k, 1:k, FUN=wrapper, my.fun = function (i,j) { 
    noyau3(data.noisy.curve[,i],data.noisy.curve[,j]) 
  })
  r <- princomp(t(data.noisy.curve), covmat=mm)
  plot(r)
  %--
  %G
  pairs((t(m) %*% r$loadings) [,1:5])
  %--
  %G
  plot((t(m) %*% r$loadings) [,1:2])
  lines((t(m) %*% r$loadings) [,1:2], col='red')
  %--

  %G
  x <- 1:dim(data.noisy.curve)[2]
  y1 <- (t(m) %*% r$loadings) [,1]
  y2 <- (t(m) %*% r$loadings) [,2]
  r1 <- loess(y1~x)
  r2 <- loess(y2~x)
  plot(y1,y2)
  lines(y1,y2, col='red')
  lines(predict(r1), predict(r2), col='blue', lwd=3)
  %--

Let us try to combine those kernels.

  %G
  noyau <- function (x,y) {
    noyau1(x,y) + noyau2(x,y) + noyau3(x,y)
  }
  mm <- outer(1:k, 1:k, FUN=wrapper, my.fun = function (i,j) { 
    noyau(data.curve[,i],data.curve[,j]) 
  })
  r <- princomp(covmat=mm)
  plot(r)
  %--
  %G
  pairs((t(m) %*% r$loadings) [,1:5])
  %--

Well, in the end it is not very conclusive...

+ kernlab

Actually, there is already a package to perform those computations. 

  TODO: Examples

* Neural networks

+ Introduction

  TODO...

+ Perceptron

A perceptron is a neural network with no hidden layer and
binary output(s).

  %G
  perceptron_learn <- function (input, output,
                                max_iterations = 100) {
    stopifnot( is.matrix(input),
               is.matrix(output),
               is.numeric(input),
               is.logical(output) )
    stopifnot( dim(input)[2] == dim(output)[2] )
    input <- rbind(input, rep(1, N)) # Biases
    N <- dim(input)[2]
    dim_input  <- dim(input)[1]
    dim_output <- dim(output)[1]
    W <- matrix(rnorm(dim_input) * rnorm(dim_output),
                nc = dim_input, nr = dim_output)
    hardlim <- function (x) ifelse( x > 0, 1, 0 )
    finished <- FALSE
    remaining_iterations <- max_iterations
    while (! finished) {
      W_previous <- W
      for (i in 1:N) {
        forecast <- hardlim( W %*% input[,i] )
        error <- output[,i] - forecast
        W <- W + error %*% input[,i]
      }
      remaining_iterations <- remaining_iterations - 1
      finished <- remaining_iterations <= 0 | 
                  sum(abs(W - W_previous)) < 1e-6
    }
    attr(W, "converged") <- all(W == W_previous)
    dimnames(W) <- list(output=NULL, input=NULL)
    class(W) <- "perceptron"
    W
  }
  perceptron_predict <- function (W, input) {
    input <- as.matrix(input)
    N <- dim(input)[2]
    input <- rbind(input, rep(1, N)) # Biases
    stopifnot( dim(W)[2] == dim(input)[1] )
    W %*% input
  }

  # Test
  k <- 2   # BUG: It does not work with other values...
  N <- 50
  set.seed(1)
  centers <- matrix(rnorm(2*k), nc=2)
  output <- sample(1:k, N, replace=T)
  input <- t(matrix(
    rnorm(k*N, mean=centers[output,], sd=.1), 
             nc=2))
  output <- t(output == 1)
  w <- perceptron_learn(input, output)

  plot(t(input), col=1+output,
       xlab="", ylab="", main="Perceptron learning")
  abline(-w[3]/w[2], -w[1]/w[2], lwd=3, lty=2)
  %--  

  %%G
  # It is not designed to work if the clusters overlap:
  # It will fail to converge...
  N <- 50
  set.seed(2)
  centers <- matrix(rnorm(2*k), nc=2)
  output <- sample(1:k, N, replace=T)
  input <- t(matrix(
    rnorm(k*N, mean=centers[output,], sd=1), 
             nc=2))
  output <- t(output == 1)
  w <- perceptron_learn(input, output)

  plot(t(input), col = 1 + output,
       xlab = "", ylab = "", 
       main = "Perceptron failing to converge")
  abline(-w[3]/w[2], -w[1]/w[2], lwd=3, lty=2)
  %--

  %G
  library(MASS)
  z <- output[1,]
  x <- input[1,]
  y <- input[2,]
  r <- lda( z ~ x + y )
  n <- 200
  x <- rep(seq(-2,2,length=n), each=n)
  y <- rep(seq(-2,2,length=n), n)
  z <- predict(r, data.frame(x, y))
  z <- c(rgb(.7,.7,.7), rgb(1,.7,.7))[ as.numeric(z$class) ]
  plot(x, y, col=z, pch=15,
       xlab="", ylab="", 
       main="Non-converging perceptron and LDA")
  points(t(input), col = 1 + output)
  abline(-w[3]/w[2], -w[1]/w[2], lwd=3, lty=2)
  %--

  %G
  library(e1071)
  y <- as.factor(output[1,])
  x <- t(input)
  r <- svm(x, y)
  n <- 200
  x <- cbind( rep(seq(-2,2,length=n), each=n),
              rep(seq(-2,2,length=n), n)       )
  z <- predict(r, x)
  z <- c(rgb(.7,.7,.7), rgb(1,.7,.7))[ as.numeric(z) ]
  plot(x, col=z, pch=15,
       xlab="", ylab="", 
       main="Non-converging perceptron and SVM")
  points(t(input), col = 1 + output)
  abline(-w[3]/w[2], -w[1]/w[2], lwd=3, lty=2)
  %--

Perceptrons should not be used if the cloud of points is not
separable: in this case, prefer logistic regression, linear
discriminant analysis (LDA) or support vector machines
(SVM).

+ Linear filter neuron

  TODO
  In the statistical community, this is known as "(least
  squares) linear regression".

+ Neural networks: TO SORT

  Windrow-Hoff learning rule: gradient descent for linear
  filter neurons -- i.e., for least squares linear
  regression.

  Back-propagation: gradient descent algorithm for
  multi-layer networks.
  The tricky point is to compute the gradient (there are too
  many variables to do this by applying a general method):
  this is what the back-propagation algorithm provides.

  Feed-forward neural network: this is the classical 3-layer
  (i.e., with one hidden layer) neural net, fitted by
  back-propagation.

  Batch training: for each update of the weights, you use
  all the data.
  Incremental training, on-line learning: you only use one
  observation to update the weights.

  Variants of gradient descent: 

    Gradient descent with momentum (this increases the
    convergence speed; it is not too dissimilar to tabu
    search).

    Variable learning rate: if the previous move did not
    decrease the error, i.e., if we went too far in the
    direction of the gradient, decrease the learning rate by
    30%, if not, increase it by 5%. 

    Resilient back-propagation (Rprop): since sigmoid
    function tend to squash their inputs, even very wrong
    weights can induce a small change in the result -- this
    leads to a very slow convergence toward the optimal
    weights.  Resilient back-propagation only uses the sign
    of the derivative. The amplitude of the change is
    controled by another parameter that increases if there
    have been two successive changes in the same direction
    and decreases if there have been two successive changes
    in opposite directions.

    Conjugate gradient: 
    Look in the direction of the (opposite of the) gradient
    and perform a line search in this direction.
    The next direction is not the opposite of the gradient
    but a linear combination of the opposite of the gradient
    and the previous search direction.
      new direction = - gradient + beta * old direction
    where (Fletcher-Reeves update)
              norm(current gradient) ^ 2
      beta = -----------------------------
              norm(previous gradient) ^ 2
    or (Polak-Ribiere update)
              < current gradient - previous, current >
      beta = -------------------------------------------
                   norm( previous gradient ) ^ 2
    From time to time (e.g., after as many iterations as
    there are parameters, or when the angle between two
    successive gradients becomes too high (Powel_Beale
    restarts)), reset the direction to the opposite of the
    gradient.

    Quasi-newton: The newton method is very fast, but it
    requires second derivatives (hessian matrices), which
    are too expensive to compute for neural networks. 
    Quasi-newton algorithms try to approximate it without
    the second derivative computational overhead.

    BFGS quasi-Newton algorithm: an approximation of the
    hessian matrix is stored and updated at each iteration,
    using the gradient. (details?)

    One-step secant quasi-Newton algorithm: idem, but
    assumes that the previous estimation of the hessian was
    the identity matrix -- hence, there is no need to store
    it.

    Levenberg-Maquardt: This is yet another quasi-Newton
    method.s
      x[k+1] = x[k] - (J' + mu I)^-1 J' error
    where
      x: parameters to be estimated
      J = d error / d weights
      gradient = J' error
      hessian = J' J (approximately)
      mu = 0:  this is Newton's method, with an approximate
               hessian
      mu >> 0: this is a gradient method with a small step
               size

    Reduced-memory Levenberg-Maquardt: the matrix J is huge
    (its number of rows is the number of training sets). If
    ot does not fit in memory, you can cut it into blocks.10

  Line search routines (for conjugate gradient methods):
    Quadratic interpolation
    Golden section search
    Brent's search (hydrid of qquadratic interpolation and
      the golden section search)
    Hybrid bisection-cubic search
    Charalambous search
    Backtracking search

  Avoiding overfitting: 
    Stop the algorithm when the performance on a validation
    sample starts to decrease (it may sound reasonable, but
    this is apprtoximately a baysian fit with the initial
    (random) weights as a prior...). And you must be careful
    not to choose too fast an algorithm...
    Choose a network small enough
    Add a penalty (the sum of the squares of the weights) to
    the objective function, to shrink the weights towards
    zero (try to have inputs, outputs and weights of the
    same order of magnitude, say in [-1,1]).
      Objective = lambda * Mean Square Error + 
                  (1 - lambda) * Mean Square Weight
    
+ Neural networks and dimension reduction

You can reduce the dimension of a cloud of points with a 5-layer
neural network: an input layer (with the data), a (non-linear)
compression layer, a (linear) layer representing the data in the
low-dimensional space, a (non-linear) uncompression layer, and an
output layer (on which we expect to get the initial data back). 

I do not think that such networks are already implemented in R, so we
shall do it ourselves -- we shall be EXTREMELY careful with
convergence problems. 

As I do not really know neural networks (I was superficially
interested in the subject 10 years ago), I browse the net to
understand them

  http://www.willamette.edu/%7Egorr/classes/cs449/backprop.html

and I write the following code.

  TODO: start with a smaller example, on which the computations will
  be faster and for which we know waht kind of result we should
  expect. For instance, a 2-dimensional cloud in a 3-dimensional
  space. with only linear neurons and only two neurons in the middle
  layer. 

  Problem: in this case, the intermediary layers are useless and 
  may unstabilize the algorithm.


  # x: matrix whose columns are the vectors to learn
  # n: number of neurons in the middle layer
  # m: number of neurons in the two other hidden layers
  #     (you should have m>n)
  drnn <- function (M, n, m, e=.1, N=100) {
    # Activation fucntions
    id <- function (t) { t }
    f1 <- tanh
    f2 <- id
    f3 <- tanh
    f4 <- id
    # Number of neurons in each layer
    n0 <- dim(M)[1]
    n1 <- m
    n2 <- n
    n3 <- m
    n4 <- n0
    # The weights -- initialized by random values
    w1 <- matrix( rnorm(n0*n1), nc=n0, nr=n1 )
    w2 <- matrix( rnorm(n1*n2), nc=n1, nr=n2 )
    w3 <- matrix( rnorm(n2*n3), nc=n2, nr=n3 )
    w4 <- matrix( rnorm(n3*n4), nc=n3, nr=n4 )
    # Les biais
    b1 <- rnorm(n1)
    b2 <- rnorm(n2)
    b3 <- rnorm(n3)
    b4 <- rnorm(n4)
    # The algorithm (backpropagation)
    r <- list()
    err <- rep(0, N)
    for (i in 1:N) {
      cat(paste("Iteration", i, "\n"))
      res <- matrix(NA, nr=n, nc=dim(M)[2])
      for (j in 1:dim(M)[2]) {
        x <- M[,j] 
        # Computing the valu of the nodes
        y0 <- x
        y1 <- f1( w1 %*% y0 + b1 )
        y2 <- f2( w2 %*% y1 + b2 )
        y3 <- f3( w3 %*% y2 + b3 )
        y4 <- f4( w4 %*% y3 + b4 )
        # Computing the errors
        d4 <- x - y4
        d3 <- (t(w4) %*% d4) * (1 - y3^2)
        d2 <- (t(w3) %*% d3) * 1
        d1 <- (t(w2) %*% d2) * (1 - y1^2)
        # Updating the weights and the biases
        dw4 <- d4 %*% t(y3)
        dw3 <- d3 %*% t(y2)
        dw2 <- d2 %*% t(y1)
        dw1 <- d1 %*% t(y0)
        w4 <- w4 + e * dw4
        w3 <- w3 + e * dw3
        w2 <- w2 + e * dw2
        w1 <- w1 + e * dw1
        b4 <- b4 + e * d4
        b3 <- b3 + e * d3
        b2 <- b2 + e * d2
        b1 <- b1 + e * d1
        res[,j] <- y2
        err[i] <- err[i] + sum(d4^2)
      }
      r[[i]] <- res
    }
    list(scores=res, errors=err, r=r)
  }

  r <- drnn(data.curve, 1, 4, N=30, e=.02)
  plot(r$err, ylim=range(c(0,r$err)), type='l')
  %--

  i <- order(drop(r$scores))
  plot(t(data.curve)[,1:2])
  lines(t(data.curve)[,1:2][i,], col='red')
  %--

Well, it is not exactly what we were expecting. But it is still better
than nothing...

  op <- par(mfrow=c(2,2))
  for (k in c(1, 3, 7, 30)) {
    i <- order(drop(r$r[[k]]))
    plot(t(data.curve)[,1:2], main=paste("after",k,"iteration(s)"))
    lines(t(data.curve)[,1:2][i,], col='red')
  }
  par(op)
  %--

Changing the order in which we give the points to the network has no
significant effect. 

  op <- par(mfrow=c(2,4))
  for (i in 1:4) {
    m <- data.curve[,sample(1:dim(data.curve)[2])]
    r <- drnn(m, 1, 4, N=30, e=.02)
    plot(r$err, ylim=range(c(0,r$err)), type='l')
    i <- order(drop(r$scores))
    plot(t(m)[,1:2])
    lines(t(m)[,1:2][i,], col='red')
  }
  par(op)
  %--

The plot of the error is VERY worrying: the error is supposed to
decrease at each iteration -- it sometimes increases. Do not use this
code unless you really understand it and think it is correct. 

Let us try with two neurons in the middle layer. 

  i <- sample(1:dim(data.curve)[2])
  m <- data.curve[,i]
  r <- drnn(m, 2, 8, N=30, e=.02)
  plot(r$err, ylim=range(c(0,r$err)), type='l')
  plot(t(r$scores)[i,], type='l')
  %--

(Yes, it is supposed to be a curve...)

  TODO: look more precisely what happens when we change one of the
  parameters. Each time, display at least 4 plots: the error (that
  should be decreasing), the first path, the last path. 

  TODO: Check that my implementation is correct. 
  The evolution of the error term suggest there IS a problem.

+ Neural networks and dimension reduction: examples

Neural networks can help increase the coverage and the
precision of OS fingerprinting (this could apply to any
rules-based classifier):

  http://actes.sstic.org/SSTIC06/Fingerprinting_par_reseaux_neuronaux/

* Dimension reduction: TODO: Rewrite/Remove this section

  0. Data
  1. Kernels
  2. Neural networks
  3. k-means + MST
  4. Isomap
  5. Optimisation
  6. pruned MST
  7. TSP
  8. SVM

+ Problem

Let us consider a cloud of (40) points in a space of reasonable
dimension (10) or of less reasonable dimension (7000). The points are
more or less aligned on a curve we are trying to describe. 

The real situation was the following. With DNA array, we examined the
expression of 7000 genes in 40 patients, representing different stages
of a single disease. We can see this pattern by a Principal Component
Analysis (PCA) (or via other dimension reduction techniques): the
points give the impression of forming a curve. In dimension two, the
curve may seem singular, it seem to cross itself, but in higher
dimension, it looks smooth. With the expression of those genes (all of
them or, rather, just a subset of them), we would like to define an
"index" indicating the position on the curve and, thus, the current
stage of the disease. 

We shall first describe the data (both simulated data and real data), 
apply various classical dimension reduction techniques (PCA, PCA after
non-linear transformations, neural networks), some less classical ones
(based on optimization problems such as the traveling salesman
problem) and we shall compare all those methods. 

+ Data

Simulated data: broken line.

  %G
  n <- 200  # Number of patients, number of columns
  k <- 10   # Dimension of the ambient space
  nb.points <- 5
  p <- matrix( 5*rnorm(nb.points*k), nr=k )
  barycentre <- function (x, n) {
    # Add number between the values of x in order to get a length n vector
    i <- seq(1,length(x)-.001,length=n)
    j <- floor(i)
    l <- i-j
    (1-l)*x[j] + l*x[j+1]
  }
  m <- apply(p, 1, barycentre, n)
  data.broken.line <- t(m)
  pairs(t(data.broken.line))
  %--
  %G
  plot(princomp(t(data.broken.line)))
  %--
  %G
  pairs(princomp(t(data.broken.line))$scores[,1:5])
  %--

Simulated data: noisy broken line

  %G
  data.noisy.broken.line <- data.broken.line + rnorm(length(data.broken.line))
  pairs(princomp(t(data.noisy.broken.line))$scores[,1:5])
  %--

Simulated data: curve -- we just change our "barycentre" function. 

  %G
  library(splines)
  barycentre2 <- function (y,n) {
    m <- length(y)
    x <- 1:m
    r <- interpSpline(x,y)    
    #r <- lm( y ~ bs(x, knots=m) )
    predict(r, seq(1,m,length=n))$y
  }

  k <- 5
  y <- sample(1:100,k)
  x <- seq(1,k,length=100)
  plot(barycentre2(y,100) ~ x)
  lines(y, col='red', lwd=3, lty=2)
  %--

  %G
  data.curve <- apply(p, 1, barycentre2, n)
  data.curve <- t(data.curve)
  pairs(t(data.curve))
  %--

  %G
  plot(princomp(t(data.curve)))
  %--

  %G
  pairs(princomp(t(data.curve))$scores[,1:5])
  %--

Some more noise: 

  %G
  data.noisy.curve <- data.curve + rnorm(length(data.curve))
  pairs(t(data.noisy.curve))
  %--
  %G
  plot(princomp(t(data.noisy.curve)))
  %--
  %G
  pairs(princomp(t(data.noisy.curve))$scores[,1:5])
  %--

Let us think about the shape of the curve we would expect. If the
disease is lethal (or if, for all the patients of the study, it
evolves towards death), we can expect a segment, curved, one end
representing healthy people, the other dead patients. 

If the disease is benign, we would expect the points to form a
(deformed) circle, representing the successives stages of the disease,
until a once-again healthy patient. 

If the disease is sometimes lethal, sometimes not, we could expect a
mixture of both: the reunion of a circle and a segment. 

  %G
  random.rotation.matrix <- function (n=3) {
    m <- NULL
    for (i in 1:n) {
      x <- rnorm(n)
      x <- x / sqrt(sum(x*x))
      y <- rep(0,n)
      if (i>1) for (j in 1:(i-1)) {
        y <- y + sum( x * m[,j] ) * m[,j]
      }
      x <- x - y
      x <- x / sqrt(sum(x*x))
      m <- cbind(m, x)
    }
    m
  }

  n <- 200
  k <- 10
  x <- seq(0,2*pi,length=n)
  data.circle <- matrix(0, nr=n, nc=k)
  data.circle[,1] <- cos(x)    
  data.circle[,2] <- sin(x)
  data.circle <- data.circle %*% random.rotation.matrix(k)
  data.circle <- t( t(data.circle) + rnorm(k) )
  pairs(data.circle[,1:3])
  %--
  %G
  data.circle <- data.circle + .1*rnorm(n*k)
  pairs(data.circle[,1:3])
  %--
  %G
  pairs(princomp(data.circle)$scores[,1:3])
  %--

  TODO: The other example

Real data:

  %G
  data.real <- read.table("Tla_z.txt", sep=",")
  data.real.group <- factor(substr(names(data.real),0,1))
  r <- prcomp(t(data.real))
  plot(r$sdev, type='h')
  %--

  %G
  data.real.3d <- r$x[,1:3]
  pairs(data.real.3d, pch=16, col=as.numeric(data.real.group))
  %--

To see more clearly what happens, let us add an ellipse for each group
of points. 

  %G
  draw.ellipse <- function (
      x, y=NULL, N=100,
      method=lines, ...
    ) {
    if (is.null(y)) {
      y <- x[,2]
      x <- x[,1]
    }
    centre <- c(mean(x), mean(y))
    m <- matrix(c(var(x),cov(x,y),
                  cov(x,y),var(y)),
                nr=2,nc=2)
    e <- eigen(m)
    r <- sqrt(e$values)
    v <- e$vectors
    theta <- seq(0,2*pi, length=N)
    x <- centre[1] + r[1]*v[1,1]*cos(theta) +
         r[2]*v[1,2]*sin(theta)
    y <- centre[2] + r[1]*v[2,1]*cos(theta) +
         r[2]*v[2,2]*sin(theta)
    method(x,y,...)
  }
  draw.star <- function (x, y=NULL, ...) {
    if (is.null(y)) {
      y <- x[,2]
      x <- x[,1]
    }
    d <- cbind(x,y)
    m <- apply(d, 2, mean)
    segments(m[1],m[2],x,y,...)
  }
  my.plot <- function (
      d, f=rep(1,dim(d)[1]),
      col=rainbow(length(levels(f))),
      variables=NULL, legend=T, legend.position=1,
      draw=draw.ellipse, ...) {
    xlim <- range(d[,1])
    ylim <- range(d[,2])
    if(!is.null(variables)){
      xlim <- range(xlim, variables[,1])
      ylim <- range(ylim, variables[,2])
    }
    plot(d, col=col[as.numeric(f)], pch=16,
         xlim=xlim, ylim=ylim, ...)
    for (i in 1:length(levels(f))) {
      try(
        draw(d[ as.numeric(f)==i, ], col=col[i])
      )
    }
    if(!is.null(variables)){
      arrows(0,0,variables[,1],variables[,2])
      text(1.05*variables,rownames(variables))
    }
    abline(h=0,lty=3)
    abline(v=0,lty=3)
    if(legend) {
      if(legend.position==1) {
        l=c( par('usr')[1],par('usr')[4], 0, 1 )
      } else if (legend.position==2) {
        l=c( par('usr')[2],par('usr')[4], 1, 1 )
      } else if (legend.position==3) {
        l=c( par('usr')[1],par('usr')[3], 0, 0 )
      } else if (legend.position==4) {
        l=c( par('usr')[2],par('usr')[3], 1, 0 )
      } else {
        l=c( mean(par('usr')[1:2]),
             mean(par('usr')[1:2]), .5, .5 )
      }
      legend(l[1], l[2], xjust=l[3], yjust=l[4],
             levels(f),
             col=col,
             lty=1,lwd=3)
    }
  }
  my.plot(data.real.3d[,1:2], data.real.group)
  %--

  %G
  op <- par(mfrow=c(3,3), mar=.1+c(0,0,0,0))
  plot.new();
  my.plot(data.real.3d[,c(2,1)], data.real.group, 
          xlab='',ylab='',axes=F,legend=F)
  box()
  my.plot(data.real.3d[,c(3,1)], data.real.group, 
          xlab='',ylab='',axes=F,legend=F)
  box()
  
  my.plot(data.real.3d[,c(1,2)], data.real.group, 
          draw=draw.star, 
          xlab='',ylab='',axes=F,legend=F)
  box()
  plot.new()
  my.plot(data.real.3d[,c(3,2)], data.real.group, 
          xlab='',ylab='',axes=F,legend=F)
  box()

  my.plot(data.real.3d[,c(1,2)], data.real.group, 
          draw=draw.star, 
          xlab='',ylab='',axes=F,legend=F)
  box()
  my.plot(data.real.3d[,c(2,3)], data.real.group, 
          draw=draw.star, 
          xlab='',ylab='',axes=F,legend=F)
  box()
  plot.new()
  par(op)
  %--

You can also turn the plot to see it from various angles. 

  %G
  set.seed(66327)
  random.rotation.matrix <- function (n=3) {
    m <- NULL
    for (i in 1:n) {
      x <- rnorm(n)
      x <- x / sqrt(sum(x*x))
      y <- rep(0,n)
      if (i>1) for (j in 1:(i-1)) {
        y <- y + sum( x * m[,j] ) * m[,j]
      }
      x <- x - y
      x <- x / sqrt(sum(x*x))
      m <- cbind(m, x)
    }
    m
  }
  op <- par(mfrow=c(3,3), mar=.1+c(0,0,0,0))
  for (i in 1:9) {
    #plot( (data.real.3d %*% random.rotation.matrix(3))[,1:2],
    #      pch=16, col=as.numeric(data.real.group),
    #      xlab='', ylab='', axes=F )
    my.plot((data.real.3d %*% random.rotation.matrix(3))[,1:2],
            data.real.group, 
            draw=draw.ellipse, 
            xlab='',ylab='',axes=F,legend=F)
    box()
  }
  par(op)
  %--


+ Other idea


+ TODO

  Some time ago, I have seen an article about dimension reduction (in
  Rnews, or a vignette). Read it. 

  Implement Isomap
  (easy: use MST)

  Other articles
  google: dimensionality reduction

+ Other idea (bad)

We can formulate our problem as an optimization problem: 
we are looking for the coordinates of 5 points A1, A2, A3, A4, A5 in
order to minimize the sum of the (squares of the) distances of the
points in the cloud to the broken line A1-A2-A3-A4-A5.

  TODO: To check that this algorithm, first give it: 
  - a single segment to find, in dimension 1
  - a single segment to find, in dimension 2
  - a curve (a flat parabola) with 1 segment
  - a curve (parabola) with two segments
  - only then our data. 


  distance.broken.line <- function (point, ligne) {
    # point: vector containing the coordinates of the point (from the
             cloud) whose distance to the broken line is to be computed.
    # ligne: coordinates, in columns, of the points defining the broken line. 

    # Distance to the vertices of the broken line
    d.points <- apply( (ligne-point)^2, 2, sum )

    # Position of the projection of the point on each of the segments
    # of the broken line
    v <- t(apply(ligne,1,diff))   # Vectors  A(i)A(i+1)
    w <- -ligne + point           # Vectors  A(i)M
    w <- w[, -dim(w)[2]]          # Remove the last
    lambda <- apply(v*w, 2, sum) / apply(v^2,2,sum)
    lambda[ is.nan(lambda) ] <- 0

    # Distance to the segments of the broken line
    d.segments <- apply( (w - t(lambda*t(v)))^2, 2, sum )
    d.segments <- d.segments[ lambda>0 & lambda<1 ]
 
    min(c( d.points, d.segments ))
  }

  # Tests: the following distances ar close to zero
  stopifnot( distance.broken.line(p[,1], p) < 1e-6 )
  stopifnot( distance.broken.line(.5*p[,1]+.5*p[,2], p) < 1e-6 )
  # There is a BIG speed problem: in the following example, we need
  # two seconds to compute the distance...
  sum(apply(data.noisy.broken.line, 2, distance.broken.line, p))

  broken.line <- function (
      d,    # data
      np=4, # Number of segments in the broken line
      deb = as.vector(d[,sample(1:dim(d)[2], np)])
              # starting values: broken line through np poins of the cloud, at random. 
    ) {
    f <- function (x) {
      sum(apply(d, 2, distance.broken.line, matrix(x, nc=np) ))
    }
    # Very, very long...
    r <- optim(deb, f, control=list(trace=1, maxit=50)) # Default: 500 iterations
    list(par=r$par, method="optim", value=r$value, deb=deb, np=np, r=r)
  }

  r <- broken.line(data.noisy.broken.line)
  pc <- princomp(t(data.noisy.broken.line))
  plot(pc$scores[,1:2])

  lines( (t(matrix(r$deb,nc=r$np) - pc$center) %*% pc$loadings)[,1:2], col='red',lwd=3,lty=3)
  lines((t(matrix(r$par,nc=r$np) - pc$center) %*% pc$loadings)[,1:2], col='blue',lwd=3,lty=2)
  %--


  # Example: a segment, in dimension 1
  l <- c(-5, 3, 2, -1)
  deb <- c(1,2,3,4)
  d <- rbind( seq(l[1],l[2],length=10), seq(l[3],l[4],length=10) )
  r <- broken.line(d, np=2, deb=deb)
  plot(t(d), xlim=c(-10,10), ylim=c(-10,10))
  # Be careful about the order of the parameters to estimate...
  for (i in 1:100) { lines(matrix(sample(r$par), nr=2) ) }
  %--


  # Starting values: some points of the cloud, at random
  op <- par(mfrow=c(3,3), mar=.1+c(0,0,0,0))
  for (i in 1:9) {
    l <- runif(4, -8,8)
    d <- rbind( seq(l[1],l[2],length=10), seq(l[3],l[4],length=10) )    
    r <- broken.line(d, np=2)
    plot(t(d), xlim=c(-10,10), ylim=c(-10,10),
         xlab='', ylab='', axes=F)
    box()
    lines( r$par[c(1,3)], r$par[c(2,4)], col='red' )
  }
  par(op)
  %--

  # Starting values: random
  op <- par(mfrow=c(3,3), mar=.1+c(0,0,0,0))
  for (i in 1:9) {
    l <- runif(4, -8,8)
    deb <- runif(4, -8,8)
    d <- rbind( seq(l[1],l[2],length=10), seq(l[3],l[4],length=10) )    
    r <- broken.line(d, np=2, deb=deb)
    plot(t(d), xlim=c(-10,10), ylim=c(-10,10),
         xlab='', ylab='', axes=F)
    box()
    lines( r$par[c(1,3)], r$par[c(2,4)], col='red' )
    lines( deb[c(1,3)], deb[c(2,4)], col='blue', lty=2 )
  }
  par(op)
  %--
  
Aaaaaaahhhhhhhhhhhh... It does not converge -- not at all.
Let us try again, by running the algorithm several times.

  op <- par(mfrow=c(3,3), mar=.1+c(0,0,0,0))
  for (i in 1:9) {
    l <- runif(4, -8,8)
    deb <- runif(4, -8,8)
    d <- rbind( seq(l[1],l[2],length=10), seq(l[3],l[4],length=10) )    
    plot(t(d), xlim=c(-10,10), ylim=c(-10,10),
         xlab='', ylab='', axes=F)
    box()
    lines( deb[c(1,3)], deb[c(2,4)], col='blue', lty=2 )
    for (j in 1:10) {    
      r <- broken.line(d, np=2, deb=deb)
      lines( r$par[c(1,3)], r$par[c(2,4)], col='red' )
      deb <- r$par
    }
  }
  par(op)
  %--

No, it really fails to converge. 

  TODO: look at the value of the distance at the end of the
  algorithm...
  I get very small values while the segment id rather far away from
  the points: why?
  (this bug could account for the convergence problems...)

There is another problem: the solution we are looking for is not
unique. If a segment is a solution of our problem, any larger segment
is also a solution. 

It converges too slowly to get a reliable result. Let us try ith the
nlm function ("optim" should work fine with "naughty" functions, "nlm"
on "nice" functions -- our function os piecewise quadratic, with a
unique local minimum). 

  # Even longer...
  r <- nlm(f, deb, iterlim=20)   # By default, iterlim=100...
  pc <- princomp(t(data.noisy.broken.line))
  plot(pc$scores[,1:2])
  lines( (t(matrix(deb, nc=np) - pc$center) %*% pc$loadings)[,1:2], col='red', lwd=3, lty=3 )
  lines((t(matrix(r$estimate,nc=np) - pc$center) %*% pc$loadings)[,1:2], col='blue', lwd=3, lty=2)

In the previous example (had it worked), the curve was made of 
4 segments and we wer looking for a curve made of 3 segments. If we
had asked for 5 segments, we would have had a segment of length 0, or
an unused segment at the end of the curve. to avoid this, we can add a
penalty if the segments do not have the same length. 

  # I add the sum of the inverses of the squares of the lengths of the
  # segments. 
  # TODO: This is a BAD idea: it favours long, useless segments. 
  f <- function (x) {
    sum(apply(data.noisy.broken.line, 2, distance, matrix(x, nc=np) )) +
    sum( 1/apply(t(apply(matrix(x,nc=np), 1,diff))^2, 2, sum) )
  }

  TODO: computation with 6 or 7 segments to find a broken line with
  only 5. 

  TODO: Similarly, we could add a penalty for the total length of the
  curve. 

  TODO: We could do the same thing with splines

+ Traveling Salesman Problem (TSP)

The Traveling Salesman Problem is the following: we have a cloud of
points (in a metric space: think "cities on a map") and we are looking
for a path, as short as possible, that visits each point. On other
words, we want to find an order on the points that minimizes a
certain function. 

Presented like that, the problem already has a lot of applications
(for instance, the path of a delivery car; the path of a soldering
iron in an electrnic appliances assembly line), but other seemingly
unrelated problems can be formulated as TSP (for instance, some
problems in gene mapping -- see Setubal's book -- I confess I have
never read it).

We can apply a TSP-solving algorithm to our cloud of points: we get an
order on those points, that can be seen as a broken line. We then
smooth this broken line: we get a curve, parametrized by [0,1] -- this
parameter is the index we were looking for. Finally, given a new
point, we just have to project it onto the curve to get the value of
the parameter. 

Unfortunately, no TSP-solving algorithm is implemented under R: we
shall have to program. 

+ TSP: descent

Here is an implementation using descent: we start with a random path
(i.e., an order on the points) and we try to transform it (by
interchanging the rank of two points) to get a shorter path. 

  %G
  longueur <- function (d,o) {
    n <- length(o)
    sum(diag( d [o[1:(n-1)],] [,o[2:n]] ))
  }
  tsp.descent <- function (d, N=1000) {
    # d: distance matrix
    d <- as.matrix(d)
    n <- dim(d)[1]
    o <- sample(1:n)
    v <- longueur(d,o)
    k <- 0
    res <- list()
    k.res <- c()
    l.res <- c(v)
    while (k<N) {
      i <- sample(1:n, 2)
      oo <- o
      oo[ i[1] ] <- o[ i[2] ]
      oo[ i[2] ] <- o[ i[1] ]
      w <- longueur(d,oo)
      if (w<v) {
        v <- w
        o <- oo
        res <- append(res, list(o))
        k.res <- append(k.res, k)
        l.res <- append(l.res, v)
        k <- 0
      } else {
        k <- k+1
      }
    }
    list(o=o, details=res, k=k.res, longueur=v, longueurs=l.res)
  }

  n <- 100
  x <- matrix(runif(2*n), nr=n)
  r <- tsp.descent(dist(x))
  o <- r$o
  plot(x)
  lines(x[o,], col='red')
  %--
  %G
  op <- par(mfrow=c(2,2))
  n <- length(r$details)  
  for (i in floor(c(1, n/4, n/2, n))) {
    plot(x, main=paste("n =",i))
    lines(x[r$details[[i]],], col='red')
  }
  par(op)
  %--

The value we have chosen for N is not sufficient: 

  %G
  plot(r$k)
  %--

  %G
  n <- 100
  x <- seq(0,6, length=n)
  x <- sample(x)
  x <- cbind(cos(x), sin(x))
  r <- tsp.descent(dist(x), N=10000)
  op <- par(mfrow=c(2,2))
  n <- length(r$details)  
  for (i in floor(c(1, n/4, n/2, n))) {
    plot(x, main=paste("n =",i))
    lines(x[r$details[[i]],], col='red')
  }
  par(op)
  %--
  %G
  plot(r$k)
  %--

We can try to improve the algorithm by running it several times. 

  %G
  # Very long...
  op <- par(mfrow=c(3,3))
  for (i in 1:9) {
    r <- tsp.descent(dist(x), N=1000)
    plot(x, main=signif(r$longueur))
    lines( x[r$o,], col='red', type='l' )
  }
  par(op)
  %--

Still not conclusive.

+ TSP: Simulated annealing

We can modify the algorithm above by accepting a new path even if it
is worse, with a probability initially high, but that progressively
decreases.

  %G
  tsp.recuit <- function (d, N=1000, taux=.99) {
    # d: distance matrix
    d <- as.matrix(d)
    n <- dim(d)[1]
    o <- sample(1:n)
    v <- longueur(d,o)
    # Computing the initial temperature
    dE <- NULL
    j <- 0
    for (i in 1:100) {
      i <- sample(1:n, 2)
      oo <- o
      oo[ i[1] ] <- o[ i[2] ]
      oo[ i[2] ] <- o[ i[1] ]
      w <- longueur(d,oo)
      if (w<v) {
        dE <- append(dE,abs(w-v))
        j <- j+1        
      }
    }
    print(dE)
    T <- - max(dE) / log(.8)
    print(T)
    # Initialisations
    k <- 0
    res <- list()
    k.res <- c()
    l.res <- c(v)
    t.res <- c(T)
    p.res <- c()
    while (k<N) {
      i <- sample(1:n, 2)
      oo <- o
      oo[ i[1] ] <- o[ i[2] ]
      oo[ i[2] ] <- o[ i[1] ]
      w <- longueur(d,oo)
      p.res <- append(p.res, exp((v-w)/T))
      if ( runif(1) < exp((v-w)/T) ) {
        v <- w
        o <- oo
        res <- append(res, list(o))
        k.res <- append(k.res, k)
        l.res <- append(l.res, v)
        k <- 0
        T <- T*taux
        t.res <- append(t.res, T)
      } else {
        k <- k+1
      }
    }
    list(o=o, details=res, k=k.res, longueur=v, longueurs=l.res, T=t.res, p=p.res)
  }

  n <- 100
  x <- seq(0,5, length=n)
  x <- sample(x)
  x <- cbind(cos(x), sin(x))
  r <- tsp.recuit(dist(x), N=1000, taux=.995)
  op <- par(mfrow=c(2,2))
  n <- length(r$details)  
  for (i in floor(c(1, n/4, n/2, n))) {
    plot(x, main=paste("n =",i))
    lines(x[r$details[[i]],], col='red')
  }
  par(op)
  %--
  %G
  op <- par(mfrow=c(2,2))
  plot(r$k)
  plot(r$longueurs, main="Lengths")
  plot(r$T, main="Temperature")
  plot(r$p, ylim=0:1, main="Probabilities")
  par(op)
  %--

+ Exterior programs: heuristics and exact algorithms

We can currently solve, exactly, TSP problems with several thousands
of cities, by writing the problem as a "linear program". It is a bit
tricky to describe, because there is an exponentially large number of
constraints (namely, we want to exclude closed cycles, and we have one
constraint for each possible cycle): we start to solve the problem
without any constraint, we look at the constraints that are infringed
(the cycles in the result), we explicitely add those constraints and
we go on until we get a feasable solution. 

  TODO: give more details

After roaming the internet, I found "Concorde": it is a TSP solver
(developed at Princeton, frrely useable for academic purposes) that
relies on a linear programming solver such as QSopt (same licence) or
CPlex (commercial and expensive). 

  http://www.tsp.gatech.edu/concorde.html
  http://www.isye.gatech.edu/~wcook/qsopt/
  http://www.ilog.com/products/cplex/

  The documentation of the file formats:
  http://www.informatik.uni-heidelberg.de/groups/comopt/software/TSPLIB95/DOC.PS

The website http://www.branchandcut.org/ lists Branch and Cut (this is
one of the algorithms used to solve integer programs) software.

Symphony is an open source software to solve branc-and-cut problems
such as the TSP. It also assumes we have a linear program solver such
as CLP, Coin LP Solver (COIN, COmputational INfrasctructure for
Operations Research, also open source,
comes from IBM).

  http://branchandcut.org/SYMPHONY/
  http://www.coin-or.org/

Here is an example with "linkern", a program  (part of concorde) that
uses a heuristic to solve a TSP. 

  %G
  print.tsp <- function (x, d=dist(x), f="") {
    ## BEWARE: The documentation states "All explicit [weight] data is
    ##         integral"
    ## If the best tour has length 0, the problem may come from there.
    d <- round(as.dist(d))
    n <- dim(x)[1]
    cat("TYPE : TSP\n", file=f) # Creating the file
    cat("DIMENSION : ", file=f, append=T)
    cat(n, file=f, append=T)
    cat("\n", file=f, append=T)
    cat("EDGE_WEIGHT_TYPE : EXPLICIT\n", file=f, append=T)
    cat("EDGE_WEIGHT_FORMAT : UPPER_ROW\n", file=f, append=T)
    cat("DISPLAY_DATA_TYPE : NO_DISPLAY\n", file=f, append=T)
    cat("EDGE_WEIGHT_SECTION\n", file=f, append=T)
    cat(paste(as.vector(d), collapse=" "), file=f, append=T)    
    cat("\n", file=f, append=T)
    cat("EOF\n", file=f, append=T)
  }
  tsp.plot.linkern <- function (x, d=dist(x), ...) {
    print.tsp(x, d, f="tmp.tsp")
    system(paste("./linkern", "-o", "tmp.tsp_result", "tmp.tsp"))
    ij <- 1+read.table("tmp.tsp_result", skip=1)[,1:2]
    xy <- prcomp(x)$x[,1:2]
    plot(xy, ...)
    segments( xy[,1][ ij[,1] ], 
              xy[,2][ ij[,1] ], 
              xy[,1][ ij[,2] ], 
              xy[,2][ ij[,2] ],
              col='red' )
    ij
  }

  op <- par(mfrow=c(3,3), mar=.1+c(0,0,0,0))
  for (i in 1:9) {  
    n <- 50
    k <- 3
    x <- matrix(rnorm(n*k), nr=n, nc=k)
    tsp.plot.linkern(x, d=round(100*dist(x)), axes=F)
    box()
  }
  par(op)
  %--

Let us now try with Concorde. 

  %G
  tsp.plot.concorde <- function (x, d=dist(x), plot=T, ...) {
    print.tsp(x, d, f="tmp.tsp")
    system(paste("./concorde", "tmp.tsp"))
    i <- scan("tmp.sol")[-1]
    ij <- 1+matrix(c(i, i[-1], i[1]), nc=2)
    xy <- prcomp(x)$x[,1:2]
    if (plot) {
      plot(xy, ...)
      n <- dim(ij)[1]
      segments( xy[,1][ ij[,1] ], 
                xy[,2][ ij[,1] ], 
                xy[,1][ ij[,2] ], 
                xy[,2][ ij[,2] ],
                col=rainbow(n) )
    }
    ij
  }
  op <- par(mfrow=c(3,3), mar=.1+c(0,0,0,0))
  for (i in 1:9) {  
    n <- 50
    k <- 3
    x <- matrix(rnorm(n*k), nr=n, nc=k)
    tsp.plot.concorde(x, d=round(100*dist(x)), axes=F)
    box()
  }
  par(op)
  %--

+ Looking for an open path

If we apply this to our simulated data, we get: 

  %G
  tsp.plot.concorde(t(data.broken.line))
  %--

There is one segment too many, because the algorith wants a closed
path -- we want an open one. 

With noisy data, it is even worse: the path goes on one direction and
then comes back all the way. 

  %G
  tsp.plot.concorde(t(data.noisy.broken.line))
  %--

This is clearer if we smooth the curve: 

  %G
  x <- t(data.noisy.broken.line)
  ij <- tsp.plot.concorde(x, plot=F)
  xy <- prcomp(x)$x[,1:2]
  y <- xy[ij[,1],]
  x <- 1:dim(ij)[1]
  plot(xy)
  x1 <- predict(loess(y[,1]~x, span=.2))
  x2 <- predict(loess(y[,2]~x, span=.2))
  n <- length(x1)
  segments(x1[-n], x2[-n], x1[-1], x2[-1], col=rainbow(n-1), lwd=3)
  %--

We can turn an open TSP program by adding a vertex whose distance to
all the others is zero (or is the same). 

  %G
  print.open.tsp <- function (x, d=dist(x), f="") {
    ## BEWARE: The documentation states "All explicit [weight] data is
    ##         integral"
    ## If the best tour has length 0, the problem may come from there.
    d <- as.matrix(d)
    d <- cbind(0,rbind(0,d))
    d <- round(as.dist(d))
    n <- dim(x)[1] + 1
    cat("TYPE : TSP\n", file=f) # Creating the file
    cat("DIMENSION : ", file=f, append=T)
    cat(n, file=f, append=T)
    cat("\n", file=f, append=T)
    cat("EDGE_WEIGHT_TYPE : EXPLICIT\n", file=f, append=T)
    cat("EDGE_WEIGHT_FORMAT : UPPER_ROW\n", file=f, append=T)
    cat("DISPLAY_DATA_TYPE : NO_DISPLAY\n", file=f, append=T)
    cat("EDGE_WEIGHT_SECTION\n", file=f, append=T)
    cat(paste(as.vector(d), collapse=" "), file=f, append=T)    
    cat("\n", file=f, append=T)
    cat("EOF\n", file=f, append=T)
  }
  tsp.plot.concorde.open <- function (x, d=dist(x), plot=T, smooth=F, span=.2,...) {
    print.open.tsp(x, d, f="tmp.tsp")
    system(paste("./concorde", "tmp.tsp"))
    i <- scan("tmp.sol")[-1]  # Remove the number of
    i <- i[-1]                # Remove the first node: 0
    ij <- matrix(c(i[-length(i)], i[-1] ), nc=2)
    xy <- prcomp(x)$x[,1:2]
    if (plot) {
      if (smooth) {
        y <- xy[ij[,1],]
        x <- 1:dim(ij)[1]
        plot(xy, ...)
        x1 <- predict(loess(y[,1]~x, span=span))
        x2 <- predict(loess(y[,2]~x, span=span))
        n <- length(x1)
        segments(x1[-n], x2[-n], x1[-1], x2[-1], col=rainbow(n-1), lwd=3)
      } else {
        plot(xy, ...)
        n <- dim(ij)[1]
        segments( xy[,1][ ij[,1] ], 
                  xy[,2][ ij[,1] ], 
                  xy[,1][ ij[,2] ], 
                  xy[,2][ ij[,2] ],
                  col=rainbow(n) )
      }
    }
    ij
  }
  x <- t(data.noisy.broken.line)
  tsp.plot.concorde.open(x)
  %--

  %G
  tsp.plot.concorde.open(x, smooth=T)
  %--

+ With our data

  %G
  tsp.plot.concorde.open(t(data.curve), smooth=T)
  %--
  %G
  tsp.plot.concorde.open(t(data.noisy.curve), smooth=T)
  %--
  %G
  tsp.plot.concorde.open(t(data.real), smooth=F)
  %--
  %G
  tsp.plot.concorde.open(t(data.real), smooth=T)
  %--

  %G
  op <- par(mfrow=c(2,2))
  for (s in c(.2, .3, .4, .5)) {
    tsp.plot.concorde.open(t(data.real), smooth=T, span=s)
  }
  par(op)
  %--

  %G
  tsp.plot.concorde.open(data.real.3d, smooth=F)
  %--
  %G
  tsp.plot.concorde.open(data.real.3d, smooth=T)
  %--
  %G
  tsp.plot.concorde.open(data.real.3d, smooth=T, span=.5)
  %--
  
+ Yet another idea

  Take our cloud of points, assign a mass to each so that they attract
  each other. You can choose the law of attraction: for instance, a
  force proportionnal to the inverse of the square of the distance, We
  then simulate the evolution of this system (but not until
  equilibrium: we stop before: after some time, the points seem to
  form a curve). 

  This is one of the algorithms used to draw knots -- but is is rather
  tricky to tune: if you make a mistake with the weights or the law of
  attraction, you can get points that collapse on one another or that
  go to infinity.

  TODO: References?

  There are several delicate choices: 
  1. The law of attraction
  1bis. The dimension of the space (in which the points live and in
  which we compute the distance) might have some importance. 
  2. The simulation of the evolution (not too fast, not too slow)
  3. When do we stop?

  Idea to decide when to stop: 
  Look at the ramification of the Minimal Spanning Tree (I define the
  ramification of a tree as the number of nodes that are not on the
  longest path -- it is easy to find: simply compute the matrix of
  distances in the graph -- but there might be faster algorithms --
  indeed: 
    Remove the shortest edge among those leading to a leaf
    This creates a vertex of arity 2: fuse the two edges. 
    Repeat until we are left with a single segment. 

Yet another idea

  Asign a random value to the index of each point of the cloud
  This gives an order on the points
  Consider the corresponding curve (broken line)
  Smooth it.
  Project the points on this curve and get a new value of the index.
  Iterate

Yet another idea

  Cluster the points into several classes
  Find a segment for each of those classes
  Try to glue those segments (how?)

Idea 1bis

  TODO: svm if you know the index of each patient. 

+ TODO

  Add the computation time 
  Legends in my plots

  For neural networks, present things a bit differently: 
  In-line algorithm, that always takes new data (this is easy, because
    we know how the data were obtained, get can ask for an infinite
    number of them)
  Start with the linear PCA anc compare the neural-net implementation
    with the linear algebra one. 

  Optimization algorithms: 
  Start with a simple situation: a single segment, no noise, in
    dimension 2 (or even 1)
  In particular this will help see if/where my code is wrong. 

  TODO: for the simulated data, divide the points into 4 groups, with
  a colour for each. Thus, it will be clearer if we are supposed to
  see something or not. 
  Ideally, we could also put all those plots in a single function that
  we would call for all the examples. 

* TODO: to sort

+ Latent class analysis (LCA)

TODO...

  library(e1071)
  ?lca

  library(poLCA)
  ?poLCA  

+ Complements

The "ade4" package contains a lot of variants or applications of Principal
Component Analysis.

  library(help=ade4)

See also

  library(help=pcurve)

+ Caveat

Principal Component Analysis and the methods that stem from it are
linear: if the data is not linear, the result may be meaningless.

+ Various

  http://nitro.biosci.arizona.edu/courses/EEB581-2004/handouts/Eigenstructure.pdf