Factorial methods: Around Principal Component Analysis (PCA) There are two kind of methods in multidimensional statistics: Factorial Methods, in which we project the data on a vector space, trying to lose as little information as possible; and Classification Methods, that try to cluster the points. There are three main techniques among the Factorial Methods: Principal Component Analysis (PCA, with several quantitative variables), Correspondance Analysis (CA, two quanlitative variables, represented by a contingency table) and Multiple Correspondance Analysis (MCA, more that two variables, all quantitative). The following table summarizes this: Method Quantitative variables Qualitative variables ---------------------------------------------------------- PCA several none MDS several none CA none two MCA none several There are other, non symetric, methods: one variable plays a special role, is given a different emphasis than the others (usually, we try to predict or explain one variable from the others). Method Quantitative v. Qualitative v. Variable to predict ---------------------------------------------------------------------------------- regression several several quantitative anova none one quantitative logistic regression several several binary Poisson regression several several counting logistic regression several several ordered (qualitative) discriminant analysis several one CART several several binary ... * Principal Component Analysis (PCA) + Introducing PCA Here is a first presentation of PCA. We have a cloud of points in a high-dimensional space, too large for us to see something in it. The PCA will give us a subspace of reasonable dimension so that the projection onto this subspace retains "as much information as possible", i.e., so that the so that the projected cloud of points be as "dispersed" as possible. In other words, it reduces the dimension of the cloud of points, it helps choose a point of view to look at the cloud of points. The algorithm is the following. We first translate the data so that its center of gravity be at the origin (always a good thing when you plan to use linear algebra). Then, we try to rotate it so that the standard deviation of the first coordinate be as large as possible (and then the second, third, etc.): this is equivalent to diagonalizing the variance-covariance matrix (it is a (positive) real symetric matrix, hence it is diagonalizable in an orthonormal basis), starting with the eigen vectors of largest eigen value. The first axis of this new coordinate system corresponds to an approximation of the cloud of points by a 1-dimensional space; if you want an approximation by a k-dimensional subspace, just take the first k eigen vectors. To choose the dimension of that subspace, we can look (graphically) at the eigen values and we stop when they start dropping (if they decrease slowly, we are in a bad situation: with our linear goggles, the data is inherently high-dimensional). One may also see PCA as an analogue of the least squares method to find a line that goes as "near" the points as possible -- to simplify, let us assume there are just two dimensions. But while the least squares method is asymetric (the two variables play different roles: they are not interchangeable, we try to predict one from the others, we measure the distance parallel to one coordinate axis), the PCA is symetric (the distance is measured orthogonally to the line we are looking for). Here is yet another way of presenting PCA. We model the data as X = Y + E, where X is the noisy data we have (in an n-dimensional space), Y is the real data (without the noise), in a smaller, k-dimensional space, and E is an error term (the noise). We want to find the k-dimensional space where Y lives. You can also see PCA as a game on a table of numbers (it is not as childish as it seems: it is actually a description of correspondance analysis). Thus, we can switch the role of rows and columns: on the one hand, we try to find similarities or differences between subjects, on the other, similarities or differences between variables. (There are two matrices to diagonalize, A*t(A) and t(A)*A, but but it suffices to diagonalize one: they have the same non-zero eigen values and the eigen vectors of one can easily be obtained from those of the other.) We usually ask that the variables be centered (and, often, normalized). We can plot both the variables on the same graph as the subjects: the variables will lie on the unit sphere; thei projection on the subspace spanned by the first two eigen vectors is the "correlation circle". After performing the PCA, we can add new subjects on the plot (test (out-of-sample) subjects; fictitious sujbects, representative of certain classes of subjects): just use the "predict" function. Dually, we can add variables (for explanatory reasons), Of those variables are qualitative, we actually add an "average subject" for each value of this variable. This is often called a "biplot", because both subjects (in black) and variables (in red) appear on the plot. %G data(USArrests) p <- prcomp(USArrests) biplot(p) %-- TODO: find a situation where you really want to add variables and/or subjects. n <- 100 # Number of subjects nn <- 10 # Number of out-of-sample subjects k <- 5 # Number of variable kk <- 3 # Number of new variables x <- matrix(rnorm(n*k), nr=n, nc=k) x <- t(rnorm(k) + t(x)) # Initial data y <- matrix(rnorm(n*kk), nr=n, nc=kk) # New variables z <- matrix(rnorm(nn*k), nr=nn, nc=k) # New subjects r <- prcomp(x) # I check that my change-of-base formulas were right all(abs( t(r$center + t(x %*% r$rotation)) - r$x )<1e-8) # Subjects coordinates t(r$center + t(x %*% r$rotation)) # Out-of-sample subject coordinates t(r$center + t(z %*% r$rotation)) # For each new variable, we add an "average" subject # (more precisely: x %*% t(yy[1,]) = orthogonal projection of # y[,1] on the subspace spanned by the columns of x). yy <- t(lm(y~x-1)$coef) t(r$center + t(yy %*% r$rotation)) Here are the eigen values. They are plummeting: this means that the PCA is meaningful and that we can retain only the eigen vectors with the highest eigenvalues (here, the first: we could have done a 1-dimensional plot (a histogram)). %G plot(p) %-- There is also a "prcomp" functions, that does the same computations, with a few more limitations (namely, you should have more observations that variables): do not use it. The vocabulary used with principal components analysis may surprise you: people speak of "loadings" where you would expect "rotation matrix" and "scores" where you think "new coordinates". In the preceeding plot, the old basis vectors are in red. There is the correlation circle -- those vectors lie on the unit sphere, which we project on the first two eigen vectors (it should not be called a circle but a disc -- the projection of a sphere is a disc, not a circle). %G a <- seq(0,2*pi,length=100) plot( cos(a), sin(a), type = 'l', lty = 3, xlab = 'comp 1', ylab = 'comp 2', main = "Correlation circle") v <- t(p$rotation)[1:2,] arrows(0,0, v[1,], v[2,], col='red') text(v[1,], v[2,],colnames(v)) %-- Here is a situation where PCA is not relevant (my pupils's marks, when I was a high school teacher). %G # Copy-pasted with the help of the "deparse" command: # cat( deparse(x), file='foobar') notes <- matrix( c(15, NA, 7, 15, 11, 7, 7, 8, 11, 11, 13, 6, 14, 19, 9, 8, 6, NA, 7, 14, 11, 13, 16, 10, 18, 7, 7, NA, 11, NA, NA, 6, 15, 5, 11, 7, 3, NA, 3, 1, 10, 1, 1, 18, 13, 2, 2, 0, 7, 9, 13, NA, 19, 0, 17, 8, 2, 9, 2, 5, 12, 0, 8, 12, 8, 4, 8, 0, 5, 5.5, 1, 12, 4, 13, 5, 11, 6, 0, 7, 8, 11, 9, 9, 9, 14, 8, 5, 8, 5, 5, 12, 6, 16.5, 13.5, 15, 3, 10.5, 1.5, 10.5, 9, 15, 7.5, 12, 13.5, 4.5, 13.5, 13.5, 6, 12, 7.5, 9, 6, 13.5, 13.5, 15, 13.5, 6, NA, 13.5, 4.5, 14, NA, 14, 14, 14, 8, 16, NA, 6, 6, 12, NA, 7, 15, 13, 17, 18, 5, 14, 17, 17, 13, NA, NA, 16, 14, 18, 13, 17, 17, 8, 4, 16, 16, 16, 10, 15, 8, 10, 13, 12, 14, 8, 19, 7, 7, 9, 8, 15, 16, 8, 7, 12, 5, 11, 17, 13, 13, 7, 12, 15, 8, 17, 16, 16, 6, 7, 11, 15, 15, 19, 12, 15, 16, 13, 19, 14, 4, 13, 13, 19, 11, 15, 7, 20, 16, 10, 12, 16, 14, 0, 0, 11, 9, 4, 10, 0, 0, 5, 11, 12, 7, 12, 17, NA, 6, 6, 9, 7, 0, 7, NA, 15, 3, 20, 11, 10, 13, 0, 0, 6, 1, 5, 6, 5, 4, 2, 0, 8, 9, NA, 0, 11, 11, 0, 7, 0, NA, NA, 7, 0, NA, NA, 6, 9, 6, 4, 5, 4, 3 ), nrow=30) notes <- data.frame(notes) # These are not the real names row.names(notes) <- c("Anouilh", "Balzac", "Camus", "Dostoievski", "Eschyle", "Fenelon", "Giraudoux", "Homer", "Ionesco", "Jarry", "Kant", "La Fontaine", "Marivaux", "Nerval", "Ossian", "Platon", "Quevedo", "Racine", "Shakespeare", "Terence", "Updike", "Voltaire", "Whitman", "X", "Yourcenar", "Zola", "27", "28", "29", "30") attr(notes, "names") <- c("C1", "DM1", "C2", "DS1", "DM2", "C3", "DM3", "DM4", "DS2") notes <- as.matrix(notes) notes <- t(t(notes) - apply(notes, 2, mean, na.rm=T)) # Get rid of NAs notes[ is.na(notes) ] <- 0 # plots plot(princomp(notes)) %-- %G biplot(princomp(notes)) %-- Here, we gave the same weight to each mark and each subject; we could have give more weight to some marks, to reflect their nature (test or homework) or the weight of the subjects for the final exam (e.g., A-level): it is the same algorithm, only the scalar product on the space changes. The "biplot" command anly gives the first two dimensions: we can sometimes see more with the "pairs" command. %G pairs(princomp(notes)$scores, gap=0) %-- %G pairs(princomp(notes)$scores[,1:3]) %-- %G p <- princomp(notes) pairs( rbind(p$scores, p$loadings)[,1:3], col=c(rep(1,p$n.obs),rep(2, length(p$center))), pch=c(rep(1,p$n.obs),rep(3, length(p$center))), ) %-- We leave it to the reader to add red arrows (instead of red points) for the variables -- actually, the "pairs" function is not very configurable: the different panels take as arguments the coordinates of the points to draw, while one could want to plot, in the same panel, very different things -- but we do not even have the row and column numbers... The corresponding function in the "lattice" library seems scarcely more configurable. %G library(lattice) splom(as.data.frame( princomp(notes)$scores[,1:3] )) %-- Here is another situation where one could want to use PCA: the classification of texts in a corpus. The rows of the table (the subjects) are the texts, the columns are the words (the vocabulary), the values are the number of occurrences of the words in the text. As the dimension of the space (the number of columns) is rather high (several thousands), it is not very easy to compute the covariance matrix, let alone diagonalize it. Yet, one can perform the PCA, in an approximate way, with neural networks. http://www.loria.fr/projets/TALN/actes/TALN/articles/TALN02.pdf Here is another means of tackling the same problem, without PCA but still with geometry in high-dimensional spaces: http://www.perl.com/lpt/a/2003/02/19/engine.html + Principal Component Analysis: details To understand what PCA really is, how it works, let us implement it ourselves. Here is one possible implementation. %G my.acp <- function (x) { n <- dim(x)[1] p <- dim(x)[2] # Translation, to use linear algebra centre <- apply(x, 2, mean) x <- x - matrix(centre, nr=n, nc=p, byrow=T) # diagonalizations, base changes e1 <- eigen( t(x) %*% x, symmetric=T ) e2 <- eigen( x %*% t(x), symmetric=T ) variables <- t(e2$vectors) %*% x subjects <- t(e1$vectors) %*% t(x) # The vectors we want are the columns of the # above matrices. To draw them, with the "pairs" # function, we have to transpose them. variables <- t(variables) subjects <- t(subjects) eigen.values <- e1$values # Plot plot( subjects[,1:2], xlim=c( min(c(subjects[,1],-subjects[,1])), max(c(subjects[,1],-subjects[,1])) ), ylim=c( min(c(subjects[,2],-subjects[,2])), max(c(subjects[,2],-subjects[,2])) ), xlab='', ylab='', frame.plot=F ) par(new=T) plot( variables[,1:2], col='red', xlim=c( min(c(variables[,1],-variables[,1])), max(c(variables[,1],-variables[,1])) ), ylim=c( min(c(variables[,2],-variables[,2])), max(c(variables[,2],-variables[,2])) ), axes=F, xlab='', ylab='', pch='.') axis(3, col='red') axis(4, col='red') arrows(0,0,variables[,1],variables[,2],col='red') # Return the data invisible(list(data=x, centre=centre, subjects=subjects, variables=variables, eigen.values=eigen.values)) } n <- 20 p <- 5 x <- matrix( rnorm(p*n), nr=n, nc=p ) my.acp(x) title(main="ACP by hand") %-- To check we did not make any error in implementing the algorithm, we can compare the result with that of the "printcomp" command. %G biplot(princomp(x)) %-- Exercise: Add the subject names; the variable names; plot the first three dimensions of the PCA (not just the first two), as with the "pairs" command; add new variables to the plot (if it is a quantitative variable, apply the same base change, if it is a qualitative variable, compute an average subject for each value of this variable and perform the base change). Exercice: Modify the preceding code by replacing the "eigen" function by the "svd" command, that performs a Singular Value Decomposition. TODO + Normalized and non-mormalized PCA There are several kind of PCA: centered or not, normalized (based on the correlations matrix) or not (based on the variance-covariance matrix). prcomp(x, center=TRUE, scale.=FALSE) # default Let us first consider the centered non-normalized principal component analysis, i.e., that based on the variance-covariance matrix. TODO: This was written for the "princomp" function, not the "prcomp" one -- actually, the problems disappear. d <- USArrests[,1:3] # Data dd <- t(t(d)-apply(d, 2, mean)) # Centered data m <- cov(d) # Covariances matrix e <- eigen(m) # Eigen values and eigen vectors p <- princomp( ~ Murder + Assault + UrbanPop, data = USArrests) The change-of-base matrix is the same in both cases: > e$vectors [,1] [,2] [,3] [1,] -0.04181042 -0.04791741 0.99797586 [2,] -0.99806069 -0.04410079 -0.04393145 [3,] -0.04611661 0.99787727 0.04598061 > unclass(p$loadings) Comp.1 Comp.2 Comp.3 Murder -0.04181042 -0.04791741 0.99797586 Assault -0.99806069 -0.04410079 -0.04393145 UrbanPop -0.04611661 0.99787727 0.04598061 and so are the coordinates: > p$scores[1:3,] Comp.1 Comp.2 Comp.3 Alabama -64.99204 -10.660459 2.188264 Alaska -91.34472 -21.676617 -2.651214 Arizona -123.68089 8.979374 -4.437864 > (dd %*% p$loadings) [1:3,] Comp.1 Comp.2 Comp.3 Alabama -64.99204 -10.660459 2.188264 Alaska -91.34472 -21.676617 -2.651214 Arizona -123.68089 8.979374 -4.437864 Let us now look at the normalized principal component analysis: we do not simply center each column of the matrix, we normalize them. d <- USArrests[,1:3] dd <- apply(d, 2, function (x) { (x-mean(x))/sd(x) }) The variance-covariance matrix of this new matrix is the correlation matrix of the initial matrix. > cov(dd) Murder Assault UrbanPop Murder 1.00000000 0.8018733 0.06957262 Assault 0.80187331 1.0000000 0.25887170 UrbanPop 0.06957262 0.2588717 1.00000000 > cor(d) Murder Assault UrbanPop Murder 1.00000000 0.8018733 0.06957262 Assault 0.80187331 1.0000000 0.25887170 UrbanPop 0.06957262 0.2588717 1.00000000 Let us go on with the computations: m <- cov(dd) e <- eigen(m) p <- princomp( ~ Murder + Assault + UrbanPop, data = USArrests, cor=T) We got the right change-of-base matrix: > e$vectors [,1] [,2] [,3] [1,] 0.6672955 0.30345520 0.6801703 [2,] 0.6970818 0.06713997 -0.7138411 [3,] 0.2622854 -0.95047734 0.1667309 > unclass(p$loadings) Comp.1 Comp.2 Comp.3 Murder 0.6672955 -0.30345520 0.6801703 Assault 0.6970818 -0.06713997 -0.7138411 UrbanPop 0.2622854 0.95047734 0.1667309 But the coordinates are not the same... > p$scores[1:3,] Comp.1 Comp.2 Comp.3 Alabama 1.2508055 -0.9341207 0.2015063 Alaska 0.8006592 -1.3941923 -0.6532667 Arizona 1.3542765 0.8368948 -0.8488785 > (dd %*% e$vectors) [1:3,] [,1] [,2] [,3] Alabama 1.2382343 0.9247323 0.1994810 Alaska 0.7926121 1.3801799 -0.6467010 Arizona 1.3406654 -0.8284836 -0.8403468 However, the error is always the same (up to the sign, which is meaningless): > p$scores[1:3,] / (dd %*% e$vectors) [1:3,] Comp.1 Comp.2 Comp.3 Alabama 1.010153 -1.010153 1.010153 Alaska 1.010153 -1.010153 1.010153 Arizona 1.010153 -1.010153 1.010153 The difference comes from the fact that there are two definitions of covariance, one in which you divide ny n, another in which you divide by n-1. > dim(d) [1] 50 3 > sqrt(50/49) [1] 1.010153 + Example TODO: this is not the right place. Later, with LDA. for i in `ls *.txt | cat_rand | head -20` do perl -n -e 'BEGIN { foreach ("a".."z") { $a{$_}=0 } }; y/A-Z/a-z/; s/[^a-z]//g; foreach (split("")) { $a{$_}++ } END { foreach("a".."z"){print "$a{$_} "} }' <$i echo E done > ling.txt # Then the same thing in a directory containing French texts %G b <- read.table('ling.txt') names(b) <- c(letters[1:26], 'language') a <- b[,1:26] a <- a/apply(a,1,sum) biplot(princomp(a)) %-- %G plot(hclust(dist(a))) %-- %G kmeans.plot <- function (data, n=2, iter.max=10) { k <- kmeans(data,n,iter.max) p <- princomp(data) u <- p$loadings # The observations x <- (t(u) %*% t(data))[1:2,] x <- t(x) # The centers plot(x, col=k$cluster, pch=3, lwd=3) c <- (t(u) %*% t(k$center))[1:2,] c <- t(c) points(c, col = 1:n, pch=7, lwd=3) # A segment joining each observation to its group center for (i in 1:n) { for (j in (1:length(data[,1]))[k$cluster==i]) { segments( x[j,1], x[j,2], c[i,1], c[i,2], col=i ) } } text( x[,1], x[,2], attr(x, "dimnames")[[1]] ) } kmeans.plot(a,2) %-- # plot(lda(a,b[,27])) # Bug? # plot(lda(as.matrix(a),b[,27])) # Newer bug? + PCA and linear algebra: Singular Value Decomposition (SVD) Actually, computers do not perform Principal Component Analysis as we have just seen, by computing the variance-covariance matrix and diagonalizing it. TODO + Rank PCA Principal component analysis assumes that the data is gaussian. If it is not, we can replace the values by their rank. But then, the variables follow a uniform distribution; We can transform those uniform data to get a gaussian distribution. %G n <- 100 v <- .1 a <- rcauchy(n) b <- rcauchy(n) c <- rcauchy(n) d <- data.frame( x1 = a+b+c+v*rcauchy(n), x2 = a+b-c+v*rcauchy(n), x3 = a-b+c+v*rcauchy(n), x4 = -a+b+c+v*rcauchy(n), x5 = a-b-c+v*rcauchy(n), x6 = -a+b-c+v*rcauchy(n) ) biplot(princomp(d)) %-- %G rank.and.normalize.vector <- function (x) { x <- (rank(x)-.5)/length(x) x <- qnorm(x) } rank.and.normalize <- function (x) { if( is.vector(x) ) return( rank.and.normalize.vector(x) ) if( is.data.frame(x) ) { d <- NULL for (v in x) { if( is.null(d) ) d <- data.frame( rank.and.normalize(v) ) else d <- data.frame(d, rank.and.normalize(v)) } names(d) <- names(x) return(d) } stop("Data type not handled") } biplot(princomp(apply(d,2,rank.and.normalize))) %-- Let us check on the other components of the PCA. %G pairs( princomp(d)$scores ) %-- %G pairs( princomp(apply(d,2,rank.and.normalize))$scores ) %-- + Non linear PCA For non-linear problems, we can first embed our space in a larger one, with an application such as (x,y) |--> (x^2, x*y, y^2). In fact, as the PCA only involves scalar products, we do not really need to compute those new coordinates: it suffices to replace all the occurrences of the scalar product by that of the new space (this function, that expresses the scalar product in the new space from the coordinates in the old space, is called a kernel -- and you can take any kernel). TODO: an example We shall reuse that idea later (to use a kernel to dive into a higher-dimensional space to linearize a non-linear problem) when we speak of SVM (Support Vector Machines). http://www.kernel-machines.org/papers/talk-dagm99_4.ps.gz TODO: library(kernlab) ?kpca + Projection Pursuit Principal Component Analysis tries to find a plane (more generally, a subspace of reasonable dimension) in which (the orthogonal projection of) the data is as dispersed as possible, the dispersion being measured with the variance matrix. Projection pursuit is a generalization of PCA in which one looks for a subspace which maximizes some "interestingness" criterion -- and everyone can define their own criterion. For instance, it could be a measure of the dispersion of the data (based on the variance matrix or on more robust dispersion estimators) or a measure of the non-gaussianity of the data (kurtosis, skewness, etc.). Those measures of interestingness are called "indices". The most common Projection Pursuits are PCA, ICA (Independant Component Analysis)m and robust PCA (replace the 1-dimensional variance you are trying to maximize by a robust equivalent). http://www.r-project.org/useR-2006/Slides/Fritz.pdf library(pcaPP) There is also a function called "ppr" (projection pursuit regression) but it seems to do something competely different: variable selection for linear regression, with several variables to predict. + Robust PCA TODO: Give an example + Spherical PCA TODO + Grand tour, guided tour The grand tour is an animation, a continuous family of projections of the cloud of points onto a 2-dimensional space, obtained by interpolation between random projections: you are supposed to look at it until you see something (clusters, artefacts, alignments, etc.). GGobi can display grand tours. http://www.ggobi.org/ The guided tour is a variant of the grand tour, where one interpolates between local maxima of a projection pursuit function instead of random projections. http://citeseer.ist.psu.edu/cook95grand.html + Independant Component Analysis (ICA) At first sight, it looks very similar to Principal component analysis -- but it is very different. The general idea is that the variables we are looking at are linear combinations of independant random variables; and we want to recover those random variables. The only assumption is that the random variables we are looking for are independant (well, actually, there is another assumption: those random variables have to be non-gaussian). ICA has been mainly used in signal processing, the initial example being the cocktail party problem: you have two microphones (or two ears) and you are hearing two conversations at the same time; you want to separate them. Similarly, ICA has also been used to analyze EEG (electroencephalograms). TODO: 2-dimensional applications, image compression, feature extraction. Let us see on an example how it differs with Principal Component Analysis: let us take two random variables X1 and X2, uniformly distributed %G N <- 1000 X <- matrix(runif(2*N, -1, 1), nc=2) plot(X) %-- and transform them %G M <- matrix(rnorm(4), nc=2) Y <- X %*% M plot(Y) %-- We get a parallelogram. PCA would yield something close to its largest diagonal (red, in the following plot); ICA would yield the image of the axes (blue), i.e., new axes parallel to the sides of the parallelogram. %G plot(Y) p <- prcomp(Y)$rotation abline(0, p[2,1] / p[1,1], col="red", lwd=3) abline(0, -p[1,1] / p[2,1], col="red", lwd=3) abline(0, M[1,2]/M[1,1], col="blue", lwd=3) abline(0, M[2,2]/M[2,1], col="blue", lwd=3) %-- Other examples: %G op <- par(mfrow=c(2,2), mar=c(1,1,1,1)) for (i in 1:4) { N <- 1000 X <- matrix(runif(2*N, -1, 1), nc=2) M <- matrix(rnorm(4), nc=2) Y <- X %*% M plot(Y, xlab="", ylab="") p <- prcomp(Y)$rotation abline(0, p[2,1] / p[1,1], col="red", lwd=3) abline(0, -p[1,1] / p[2,1], col="red", lwd=3) abline(0, M[1,2]/M[1,1], col="blue", lwd=3) abline(0, M[2,2]/M[2,1], col="blue", lwd=3) } par(op) %-- The ICA yields the variables that were indeed used. The main idea behind the algorithm is the fact that (from the central limit theorem) a linear combination of non-gaussian random variables is "more gaussian". %G op <- par(mfrow=c(2,2)) hist(X[,1], col="light blue") hist(X[,2], col="light blue") hist(Y[,1], col="light blue") hist(Y[,2], col="light blue") par(op) %-- Therefore the algorithm goes as follows: 1. Normalize the data, so that they have a variance equal to 1 and that the be uncorrelated. 2. Find (with the usual numeric optimization algorithms) the linear transformation that maximizes the non-gaussianity. To this end, one can use several measures of non-gaussianity: the kurtosis (i.e., the fourth moment), the entropy (integral of -f * log(f), where f is the pdf), mutual information, etc. There are two implementations of this algorithm in R. The first one, "ica" in the "e1071" package, does not give the expected results. %G library(e1071) r <- ica(Y,.1) plot(r$projection) %-- The second one, "mlica" in the "mlica" package, gives the expected result. %G library(mlica) ICA <- function (x,...) { prPCA <- PriorNormPCA(x); prNCP <- proposeNCP(prPCA,0.1); mlica(prNCP,...) } set.seed(1) # It sometimes crashes... N <- 1000 X <- matrix(runif(2*N, -1, 1), nc=2) M <- matrix(rnorm(4), nc=2) Y <- X %*% M r <- ICA(Y) plot(r$S) %-- TODO: Give more interesting results. See also: http://www.cis.hut.fi/projects/ica/icademo/ http://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf + Factor analysis Principal Component Analysis is about approximating a variance matrix. The simplest approximation is a scalar matrix (a diagonal matrix with always the same element on the diagonal), the next simplest is a diagonal matrix; then comes the PCA, which writes the variance matrix as V = B B' for some rectangular matrix B, with as many columns as you want components. The matrix B can be characterized as the matrix that minimizes the norm of V - B B' (actually, it is not unique, but (a.s.) unique up to multiplication by an orthogonal matrix). The next step is the factor model: we write V as B B' + Delta, where Delta is a diagonal matrix. In R, this is obtained with the factanal() function. r <- factanal(covmat = V, factors = k) # The correlation matrix v <- r$loadings %*% t(r$loadings) + diag(r$uniquenesses) # The corresponding variance matrix v <- diag(sqrt(diag(V))) %*% v %*% diag(sqrt(diag(V))) This is used, for instance, in finance: we model stock returns as a multidimensional gaussian distribution. A portfolio can be seen as a linear combination of stocks; the coefficients are called the weights of the portfolio and are usually assumed to sum up to 1 (you might also want them to be positive). If w is a portfolio and V the variance matrix of the distribution of returns, then the returns of the portfolio are gaussian with variance w' V w; the square root of this variance is called the risk of the portfolio. Finding the portfolio with the minimum risk possible sounds simple enough (it is an optimization problem), but you have to estimate the variance matrix in the first place. If your universe contains thousands of stocks, you actually want to estimate a 1000*1000 matrix: you will never have enough data to do that! You need a way to reduce the number of quantities to estimate, you need a way to more parsimoniously parametrize variance matrices. Principal Component Analysis is the classical way of doing this, but it assumes that all the stocks respond to the same few sources of randomness: this is not a reasonable assumption. Instead, we assume that the stocks respond to a few "risk factors" (they could be interpreted as, e.g., oil prices, comsumption indexes, interest rates, etc.) and an idiosyncratic component, specific to each stock: this is a factor model. Such an estimation of the variance matrix of stock returns is called a risk model. TODO: An example with actual computations (portfolio optimization?) + Factor models TODO: Merge this section with the previous TODO: Decide whether to put this section here or together with the variance matrix (there is already a note about RMT somewhere)... Principal Component analysis can also be seen as decomposition of a variance matrix, that can also lead to a simplification of variance matrices (by discarding the eigenvectors with low eigenvalues), to a parametrization of variance matrices with few parameters. Since large variance matrices require a huge amount of data to be reliably estimated, PCA can allow you to get a precise (but biased) estimation with little data. Those parametrizations are often called "factor models". They are used, for instance, in finance: when you want to invest in stocks (or any other asset, actually), you do not buy the stocks for which you have a positive view in equal quantities: some of those stocks can be moving together. Consider for instance three stocks, A, B and C, on which you have a positive view, where A and B tend to mode together and C moves independently from A and B: it might not be a good idea to invest a third of your wealth in each of them -- it might be wiser to invest half in C, and a quarter in A and B. This reasoning can be formalized (and justified) if one assumes that the returns of those three assets follow a gaussian distribution with mean (mu, mu, mu) (where mu is the expected return of those assets -- here we assume they are equal) and correlation matrix 1 1/2 0 1/2 1 0 0 0 1 We want to find the portfolio (i.e., the combination of assets) with the lowest risk, i.e., the lowest variance. But where do we get this variance matrix in the first place? We have to estimate it with historical data: for instance from the monthly returns of the stocks in the past three years. But if you have 1000 stocks (that is a small number: in real life, it is usually between 1000 and 10,000), that is 36,000 data points to estimate around 500,000 parameters -- this is not reasonable. The idea is to assume that the variance matrix has some simple form, i.e., that it can be parametrized with a reasonable number of parameters. For instance, one can assume that the correlations between the returns is due to their being correlated with a small number of "factors": then, we just have to give the "exposure" of the each stock to each factor, B and the variance matrix of the factors, v. The variance matrix of the stock returns is then V = B v B' If one assumes that the factors are independant (and of variance 1), i.e., their variance matrix is the identity matrix, v=1, and V = B B'. But this is very similar to the decomposition of a variance matrix provided by PCA: if, furthermore, we do not know the factors beforehand, we can simply select them as the first eigenvectors. This is called a noiseless statistical factor model. TODO: Noiseless factor model TODO: Scalar factor model TODO: Approximate factor model TODO: The double Procrustes problem? + Functional Principal Component Analysis (fPCA) When the data you are modeling are functions (for instance, the evolution over time of some quantity), you might realize that the principal components are noisy and try to smooth them. Actually, this is not the optimal thing to do -- by smoothing the principal components, you run the risk of smoothing away the information they were supposed to contain. Instead, you can put the smoothing inside the PCA algorithm. The first step is usually to express the functions to be studied in some basis -- by doing so, you already smooth them, but you should be careful not to discard the features of interest. Using functions instead of series of points allows you to have a different number of points, a different sampling frequency for each curve -- or even, irregular sampling. Using a basis of functions brings the problem back in the finite-dimensional realm and reduces it to a linear algebra problem. In the second step, one computes the mean of all those curves, usually using the same basis. TODO: Do we add a smoothing penalty for the mean? TODO: Formula... In the third step, one actually performs the principal component analysis. TODO: Explain how to introduce the smoothing penalties TODO: Formulas TODO: Implementation? A few problems might occur. For instance, the functions to estimate might have some very important properties: for instance, we might know that they have to be increasing, in spite of error measurements. We can enforce such a requirement by parametrizing the function via a differential equation. For instance, to get an increasing function H (say, height, when studying child growth), one just has to find a function w (without any restriction) such that H'' = w H -- w can be interpreted as the "relative acceleration". Another problem is that the features in the various curves to study are not aligned: they occur in the same order, but with different amplitudes and locations. In that situation, the mean curve can fail to exhibit the very features we want to study -- they are averaged out. To palliate this problem, one can "align" the curves, by reparametrizing the curves (i.e., by considering an alternate time: clock time versus physiological time in biology, clock time versus market time or transaction time in finance, etc.). This is called "registration" or "(dynamic) time warping". TODO: More details on time warping. TODO: 1. Write the functions to study in some basis 2. Smoothed mean 3. Smoothed PCA For more details, check: Applied Functional Data Analysis, Methods and Case Studies J.O. Ramsay and B.W. Silverman, Springer Verlag (2002) http://www.stats.ox.ac.uk/~silverma/fdacasebook/ TODO: An online reference? + Varimax rotation The components you get as the result of a principal component analysis are not always directly interpretable: even if all the relevant information is in the first two principal components, they are not as "pure" as you would like. But by rotating them, they can be easier to interpret. Since you use the amplitude of the loadings of the PCA to interpret the principal components, you can try to simplify them, as follows: find a rotation (not a rotation of the whole space, but only a rotation of the subspace spanned by the first components) that maximizes the variance of the loadings (contrary to PCA, you do not consider the components one at a time, but all at the same time). This will actually try to make the loadings either very large or very small, easing the interpretation of the components. ?varimax ?promax + Tensor approximation When you apply Principal Component Analysis (PCA) to the pixels of a series of images, you forget the 2-dimensional structure and hope that the PCA will somehow recover it, by itself. Instead, one can try to retain that information by representing the data as matrices. This can be generalized to higher-dimensional data with higher dimensional arrays (learned people may call those arrays "tensors"). Tensor algebra, motivations: - Taylor expansion f(x+h) = f(x) + f'(x).h + f''(x).h.h + ... where f(x) is a number f'(x) is a 1-form: it transforms a vector h into a number f'(x).h f''(x) is a symetric 2-form: it takes a two vectors h and k and turns them into a number f''(x).h.k - Higher order statistics (HOS) (e.g., the autocorrelation of x^2 is a fourth moment). - [For category-theorists] The tensor algebra TV of a vector space V is obtained from the adjoint of the forgetful functor Alg --> Vect from the category of R-algebras to that of R-vector spaces. Similarly, one can define the symetric tensor algebra SV from the category of commutative R-algebras. The alternating algebra AV is often defined as AV=TV/SV. - In mathematical physics, a tensor is a section of some vector bundle built from the tangent buldle of a manifold (if you do not know about manifolds, read "The large scale structure of space-time" by S. Hawking et al.). For instance: - a section of the tangent bundle TX --> X is a field of tangent vectors; - a section of the cotangent bundle (the dual of the tangent space) TX' --> X is a (field of) 1-forms; - a section of the maximum alternating power of the cotangent bundle is an n-form, that you can use to integrate a (real-valued) function on the manifold. (You can also consider sections of other vector bundles, e.g., the Levi-Civita connection -- some people call those "holors" -- see "Theory of Holors : A Generalization of Tensors", P. Moon and E. Spencer) Tentative implementation: # From "Out-of-code tensor approximation of # multi-dimensional matrices of visual data" # H. Wang et al., Siggraph 2005 # http://vision.ai.uiuc.edu/~wanghc/research/siggraph05.html # http://vision.ai.uiuc.edu/~wanghc/papers/siggraph05_tensor_hongcheng.pdf mode.n.vector <- function (A, n) { # aka "unfolding" stopifnot( n == floor(n), n >= 1, n <= length(dim(A)) ) res <- apply(A, n, as.vector) if (is.vector(res)) { res <- matrix(res, nr=dim(A)[n], nc=prod(dim(A))/dim(A)[n] ) } else { res <- t(res) } stopifnot( dim(res) == c( dim(A)[n], prod(dim(A))/dim(A)[n] ) ) res } n.rank <- function (A, n) { sum( svd( mode.n.vector(A, n) )$d >= 1000 * .Machine$double.eps ) } n.ranks <- function (A) { n <- length(dim(A)) res <- integer(n) for (i in 1:n) { res[i] <- n.rank(A, i) } res } # The norm is the same as usual: sqrt(sum(A^2)) n.mode.svd <- function (A) { res <- list() for (n in 1:length(dim(A))) { res[[n]] <- svd( mode.n.vector(A, n) ) } res } product <- function (A, B, i) { # Multiply the ith dimension of the tensor A with the # columns of the matrix B stopifnot( is.array(A), i == floor(i), length(i) == 1, i >= 1, i <= length(dim(A)), is.matrix(B), dim(A)[i] == dim(B)[1] ) res <- array( apply(A, i, as.vector) %*% B, dim = c(dim(A)[-i], dim(B)[2]) ) N <- length(dim(A)) ind <- 1:N ind[i] <- N ind[ 1:N > i ] <- ind[ 1:N > i ] - 1 res <- aperm(res, ind) stopifnot( dim(res)[-i] == dim(A)[-i], dim(res)[i] == dim(B)[2] ) res } # A few tests... A <- array(1:8, dim=c(2,2,2)) mode.n.vector(A, 1) mode.n.vector(A, 2) mode.n.vector(A, 3) n.rank(A, 1) n.rank(A, 2) n.rank(A, 3) n.ranks(A) B1 <- matrix(round(10*rnorm(6)), nr=2, nc=3) B2 <- matrix(round(10*rnorm(12)), nr=3, nc=4) all( product(B1, B2, 2) == B1 %*% B2 ) # The main function... tensor.approximation <- function (A, R) { stopifnot( is.array(A), length(dim(A)) == length(R), is.vector(R), R == floor(R), R >= 0, R <= dim(A), R <= n.ranks(A), R <= prod(R)/R # If all the elements of R # are equal, this is fine. ) N <- length(R) I <- dim(A) U <- list() for (i in 1:N) { U[[i]] <- matrix( rnorm(I[i] * R[i]), nr = I[i], nc = R[i] ) } finished <- FALSE BB <- NULL while (!finished) { cat("Iteration\n") Utilde <- list() for (i in 1:N) { Utilde[[i]] <- A for (j in 1:N) { if (i != j) { Utilde[[i]] <- product( Utilde[[i]], U[[j]], j) } cat(" Utilde[[", i, "]]: ", paste(dim(Utilde[[i]]),collapse=","), "\n", sep="") } stopifnot( length(dim(Utilde[[i]])) == N, dim(Utilde[[i]])[i] == dim(A)[i], dim(Utilde[[i]])[-i] == R[-i] ) Utilde[[i]] <- mode.n.vector( Utilde[[i]], i ) cat(" Utilde[[", i, "]]: ", paste(dim(Utilde[[i]]),collapse=","), "\n", sep="") } for (i in 1:N) { U[[i]] <- svd( Utilde[[i]] ) $ u [ , 1: R[i], drop=FALSE ] #U[[i]] <- svd( Utilde[[i]] ) $ v stopifnot( dim(U[[i]]) == c(I[i], R[i]) ) } B <- A for (j in 1:N) { B <- product( B, U[[j]], j ) } res <- B for (j in 1:N) { res <- product( res, t(U[[j]]), j ) } cat("Approximating A:", sum((res - A)^2), "\n") if (!is.null(BB)) { eps <- sum( (B - BB)^2 ) cat("Improvement on B:", eps, "\n") finished <- eps <= 1e-6 } BB <- B } res <- B for (j in 1:N) { res <- product( res, t(U[[j]]), j ) } attr(res, "B") <- B attr(res, "U") <- U attr(res, "squared.error") <- sum( (A-res)^2 ) res } A <- array(1:24, dim=c(4,3,2)) x <- tensor.approximation(A, c(1,1,1)) x <- tensor.approximation(A, c(2,2,2)) * Distance-based methods The problem is now the following: given a set of points in an n-dimensional space, find their coordinates kwnowing only their distances. Usually, you do not know the space dimension either. + PCO (Distance Analysis) This is very similar to Principal Component Analysis (PCA), with a few differences, though: PCA looks for relations between the variables while PCO focuses on similarities between the observations; you can interpret the PCA axes with the variables while you cannot with PCO (indeed, there are no variables). As PCA, PCO is linear and the computations are the same: in PCA, we start with the variance-covariance matrix, i.e., a matrix whose entries are sums of squares; in distance analysis, we consider a matrix of distances or a distance of squares of matrices, i.e., from the pythagorean theorem, a matrix of sums of squares. In other words, PCO is just PCA without its first step: the computation of the variance-covariance (or "squared distances") matrix. ?cmdscale library(MASS) + MDS (Multi-Dimensional Scaling) MDS is a non-linear equivalent of Principal Coordinate Analysis (PCO). We have a set of distances between points and we try to assign coordinates to those points so as to minimize a "stress function" -- for instance, the sum of squares between the distances we have and the distances obtained from the coordinates. There are several R functions to perform those computations. ?cmdscale Principal coordinate analysis (metric, linear) library(MASS) ?sammon (metric, non-linear) ?isoMDS (non-metric) There was a comparison of them in RNews: http://cran.r-project.org/doc/Rnews/Rnews_2003-3.pdf + xgvis library(xgobi) ?xgvis TODO: speak a little more of xgvis xgvis slides and tutorial: http://industry.ebi.ac.uk/%257Ealan/VisWorkshop99/XGvis_Talk/sld001.htm http://industry.ebi.ac.uk/%257Ealan/VisWorkshop99/XGvis_Tutorial/index.html (restart the MDS optimization with a different starting point to avoid local minima ("scramble" button)) (one may change the parameters (euclidian/L1/Linfty distance, removal of outliers, weights, etc.) (one may ask for a 3D or 4D layout and the rotate it in xgobi) (restart the MDS optimization with only a subset of the data -- interactivity) What is a Shepard diagram ??? (change the power transformation) MDS to get a picture of protein similarity MDS warning: The global shape of MDS configurations is determined by the large dissimilarities; consequently, small distances should be in terpreted with caution: they may not reflect small dissimilarities. + Applications The applications are manifold, for instance: 1. Dimension reduction: we start with a set of points in a high-dimensional space, compute their pairwise distances, and try to put the points in a space of smaller dimension while retaining as much as possible the information ("proximity") present in the distance matrix. Thus, we can see MDS as a non-linear analogue of Principal Component Analysis (PCA). %G n <- 100 v <- .1 # (Almost) planar data x <- rnorm(n) y <- x*x + v*rnorm(n) z <- v*rnorm(n) d <- cbind(x,y,z) # A rotation random.rotation.matrix <- function (n=3) { m <- NULL for (i in 1:n) { x <- rnorm(n) x <- x / sqrt(sum(x*x)) y <- rep(0,n) if (i>1) for (j in 1:(i-1)) { y <- y + sum( x * m[,j] ) * m[,j] } x <- x - y x <- x / sqrt(sum(x*x)) m <- cbind(m, x) } m } m <- random.rotation.matrix(3) d <- t( m %*% t(d) ) pairs(d) title(main="The points lie in a plane") %-- %G pairs(cmdscale(dist(d),3)) title(main="MDS") %-- %G pairs(princomp(d)$scores) title(main="Principal Component Analysis") %-- 2. Study of data with no coordinates, for instance protein similarity: we can tell when two proteins are similar, we can quantify this similarity -- but we have no coordinates, we cannot think of a simple vector space whose points could be interpreted as proteins. MDS can help build such a space and graphically represent the proximity relations among a set of proteins. The same process also appears in psychology: we can ask test subjects to identify objects (faces, Morse-coded letters, etc.) and count the number of confusions between two objects. This measures the similarity of these objects (a subjective similarity, that stems from their representation in the human mind); MDS can assign coordinates to those objects and represent those similarities in a graphical way. Those misclassifications also appear in statistics: we can use MDS to study forecast errors of a given statistical algorithm (logistic regression, neural nets, etc.) when trying to predict a qualitative variable from quantitative variables. TODO: speak about MDS when I tackle those algorithms... 3. MDS can also graphically represent variables (and not observations), after estimating the "distances" between variables from their correlations. TODO: such a plot when I speak about variable selection... 4. We can also use MDS to plot graphs (the combinatorial objects we study in graph theory -- see the "graph" package), described by theiir vertices and edges. Alternatively, you can use GraphViz, http://www.graphviz.org/ outside R or from R -- see the "sem" and "Rgraphviz" libraries. TODO 5. Reconstruct a molecule, especially its shape, from the distance between its atoms. For more details, see http://www.research.att.com/areas/stat/xgobi/papers/xgvis.ps.gz + Minimum Spanning Tree (MST) 1. Put the points into 4 or 5 classes (e.g., by hierarchical classification, or with the k-means algorithm). 2. Take the median point of those classes. 3. Is the Minimum Spanning Tree (MST) of those points ramified? On our examples: %G # Data n <- 200 # Number of patients, number of columns k <- 10 # Dimension of the ambient space nb.points <- 5 p <- matrix( 5*rnorm(nb.points*k), nr=k ) barycentre <- function (x, n) { # Add number between the values of x in order to get a length n vector i <- seq(1,length(x)-.001,length=n) j <- floor(i) l <- i-j (1-l)*x[j] + l*x[j+1] } m <- apply(p, 1, barycentre, n) data.broken.line <- t(m) data.noisy.broken.line <- data.broken.line + rnorm(length(data.broken.line)) library(splines) barycentre2 <- function (y,n) { m <- length(y) x <- 1:m r <- interpSpline(x,y) #r <- lm( y ~ bs(x, knots=m) ) predict(r, seq(1,m,length=n))$y } data.curve <- apply(p, 1, barycentre2, n) data.curve <- t(data.curve) data.noisy.curve <- data.curve + rnorm(length(data.curve)) data.real <- read.table("Tla_z.txt", sep=",") r <- prcomp(t(data.real)) data.real.3d <- r$x[,1:3] library(cluster) library(ape) mst.of.classification <- function (x, k=6, ...) { x <- t(x) x <- t( t(x) - apply(x,2,mean) ) r <- prcomp(x) y <- r$x u <- r$rotation r <- kmeans(y,k) z <- r$centers m <- mst(dist(z)) plot(y[,1:2], ...) points(z[,1:2], col='red', pch=15) w <- which(m!=0) i <- as.vector(row(m))[w] j <- as.vector(col(m))[w] segments( z[i,1], z[i,2], z[j,1], z[j,2], col='red' ) } set.seed(1) mst.of.classification(data.curve, 6) %-- %G mst.of.classification(data.curve, 6) %-- %G op <- par(mfrow=c(2,2),mar=.1+c(0,0,0,0)) for (k in c(4,6,10,15)) { mst.of.classification(data.curve, k, axes=F) box() } par(op) %-- %G op <- par(mfrow=c(2,2),mar=.1+c(0,0,0,0)) for (k in c(4,6,10,15)) { mst.of.classification(data.noisy.curve, k, axes=F) box() } par(op) %-- %G op <- par(mfrow=c(2,2),mar=.1+c(0,0,0,0)) for (k in c(4,6,10,15)) { mst.of.classification(data.broken.line, k, axes=F) box() } par(op) %-- %G op <- par(mfrow=c(2,2),mar=.1+c(0,0,0,0)) for (k in c(4,6,10,15)) { mst.of.classification(data.noisy.broken.line, k, axes=F) box() } par(op) %-- With real data, it does not work that well: %G op <- par(mfrow=c(2,2),mar=.1+c(0,0,0,0)) for (k in c(4,6,10,15)) { mst.of.classification(data.real, k, axes=F) box() } par(op) %-- Details: %G op <- par(mfrow=c(3,3),mar=.1+c(0,0,0,0)) for (k in c(4:6)) { for (i in 1:3) { mst.of.classification(data.real, k, axes=F) box() } } par(op) %-- %G op <- par(mfrow=c(3,3),mar=.1+c(0,0,0,0)) for (k in c(7:9)) { for (i in 1:3) { mst.of.classification(data.real, k, axes=F) box() } } par(op) %-- %G op <- par(mfrow=c(3,3),mar=.1+c(0,0,0,0)) for (k in c(10:12)) { for (i in 1:3) { mst.of.classification(data.real, k, axes=F) box() } } par(op) %-- %G op <- par(mfrow=c(3,5),mar=.1+c(0,0,0,0)) for (k in c(13:15)) { for (i in 1:3) { mst.of.classification(data.real, k, axes=F) box() } } par(op) %-- + Minimum Spanning Tree (MST) %G library(ape) my.plot.mst <- function (d) { r <- mst(dist(t(d))) d <- prcomp(t(d))$x[,1:2] plot(d) n <- dim(r)[1] w <- which(r!=0) i <- as.vector(row(r))[w] j <- as.vector(col(r))[w] segments( d[i,1], d[i,2], d[j,1], d[j,2], col='red' ) } my.plot.mst(data.broken.line) %-- %G my.plot.mst(data.noisy.broken.line) %-- %G my.plot.mst(data.curve) %-- %G my.plot.mst(data.noisy.curve) %-- %G my.plot.mst(data.real) %-- %G my.plot.mst(t(data.real.3d)) %-- Pruned MST %G # Gives the list of the oaths from the branching nodes to the leaves chemins.vers.les.feuilles <- function (G) { nodes <- which(apply(G,2,sum)>2) leaves <- which(apply(G,2,sum)==1) res <- list() for (a in nodes) { for (b in which(G[a,]>0)) { if (! b %in% nodes) { res <- append(res,list(c(a,b))) } } } chemins.vers.les.feuilles.suite(G, nodes, leaves, res) } # Last coordinate of a vector end1 <- function (x) { n <- length(x) x[n] } # Last two coordinates of a vector end2 <- function (x) { n <- length(x) x[c(n-1,n)] } chemins.vers.les.feuilles.suite <- function (G, nodes, leaves, res) { new <- list() done <- T for (ch in res) { if( end1(ch) %in% nodes ) { # Pass } else if ( end1(ch) %in% leaves ) { new <- append(new, list(ch)) } else { done <- F a <- end2(ch)[1] b <- end2(ch)[2] for (x in which(G[b,]>0)) { if( x != a ){ new <- append(new, list(c( ch, x ))) } } } } if(done) { return(new) } else { return(chemins.vers.les.feuilles.suite(G,nodes,leaves,new)) } } G <- matrix(c(0,1,0,0, 1,0,1,1, 0,1,0,0, 0,1,0,0), nr=4) chemins.vers.les.feuilles(G) # Compute the length of a path longueur.chemin <- function (chemin, d) { d <- as.matrix(d) n <- length(chemin) i <- chemin[ 1:(n-1) ] j <- chemin[ 2:n ] if (n==2) { d[i,j] } else { sum(diag(d[i,][,j])) } } G <- matrix(c(0,1,0,0, 1,0,1,1, 0,1,0,0, 0,1,0,0), nr=4) d <- matrix(c(0,2,4,3, 2,0,2,1, 4,2,0,3, 3,1,3,0), nr=4) chemins <- chemins.vers.les.feuilles(G) chemins l <- sapply(chemins, longueur.chemin, d) l stopifnot( l == c(2,2,1) ) elague <- function (G0, d0) { d0 <- as.matrix(d0) G <- G0 d <- d0 res <- 1:dim(d)[1] chemins <- chemins.vers.les.feuilles(G) while (length(chemins)>0) { longueurs <- sapply(chemins, longueur.chemin, d) # Number of the shortest path i <- which( longueurs == min(longueurs) )[1] cat(paste("Removing", paste(res[chemins[[i]]],collapse=' '), "length =", longueurs[i],"\n")) # Nodes to remove j <- chemins[[i]] [-1] res <- res[-j] G <- G[-j,][,-j] d <- d[-j,][,-j] cat(paste("Removing", paste(j), "\n" )) cat(paste("", paste(res, collapse=' '), "\n")) chemins <- chemins.vers.les.feuilles(G) } res } library(ape) my.plot.mst <- function (x0) { cat("Plotting the points\n") x <- prcomp(t(x0))$x[,1:2] plot(x) cat("Computing the distance matrix\n") d <- dist(t(x0)) cat("Computing the MST (Minimum Spanning Tree)\n") G <- mst(d) cat("Plotting the MST\n") n <- dim(G)[1] w <- which(G!=0) i <- as.vector(row(G))[w] j <- as.vector(col(G))[w] segments( x[i,1], x[i,2], x[j,1], x[j,2], col='red' ) cat("Pruning the tree\n") k <- elague(G,d) cat("Plotting the pruned tree\n") x <- x[k,] G <- G[k,][,k] n <- dim(G)[1] w <- which(G!=0) i <- as.vector(row(G))[w] j <- as.vector(col(G))[w] segments( x[i,1], x[i,2], x[j,1], x[j,2], col='red', lwd=3 ) } my.plot.mst(data.noisy.broken.line) %-- %G my.plot.mst(data.noisy.curve) %-- %G my.plot.mst(data.real) %-- %G my.plot.mst(t(data.real.3d)) %-- Remark: in image analysis, we sometimes us a simplification of this algorithm (still called "pruning" to get rid of the ramifications in the skeleton of an image: we just gnaw two of three segments from each leaf. TODO: a reference + Isomap TODO: reference TODO: explain Let us first construct the graph. %G # k: each point is linked to its k nearest neighbors # eps: each point is linked to all its neighbors within a radius eps isomap.incidence.matrix <- function (d, eps=NA, k=NA) { stopifnot(xor( is.na(eps), is.na(k) )) d <- as.matrix(d) if(!is.na(eps)) { im <- d <= eps } else { im <- apply(d,1,rank) <= k+1 diag(im) <- F } im | t(im) } plot.graph <- function (im,x,y=NULL, ...) { if(is.null(y)) { y <- x[,2] x <- x[,1] } plot(x,y, ...) k <- which( as.vector(im) ) i <- as.vector(col(im))[ k ] j <- as.vector(row(im))[ k ] segments( x[i], y[i], x[j], y[j], col='red' ) } d <- dist(t(data.noisy.curve)) r <- princomp(t(data.noisy.curve)) x <- r$scores[,1] y <- r$scores[,2] plot.graph(isomap.incidence.matrix(d, k=5), x, y) %-- So far, there is a problem: the resulting grah need not be connected -- this is a problem if we want to compute distances... %G plot.graph(isomap.incidence.matrix(d, eps=quantile(as.vector(d), .05)), x, y) %-- The graph is connected if and only if it contains the Minimum Spanning Tree (MST): thus, we shall just add the edges of this MST taht are not already there. %G isomap.incidence.matrix <- function (d, eps=NA, k=NA) { stopifnot(xor( is.na(eps), is.na(k) )) d <- as.matrix(d) if(!is.na(eps)) { im <- d <= eps } else { im <- apply(d,1,rank) <= k+1 diag(im) <- F } im | t(im) | mst(d) } plot.graph(isomap.incidence.matrix(d, eps=quantile(as.vector(d), .05)), x, y) %-- TODO: distance in this graph. It is a classical problem: shortest path in a graph. %G inf <- function (x,y) { ifelse(x.8, x[,1], x[,3] ), levels=1:l[1]) names(x) <- LETTERS[1:m] x } r <- NULL while(is.null(r)) { x2 <- my.random.data.2() try(r <- my.acm(x2)) } %-- Let us compare our result with that of the "mca" command. %G library(MASS) plot(mca(x1)) %-- %G plot(mca(x2)) %-- At first, I had forgotten to remove the 1 eigenvalue, and I got this. %G my.acm(x1, garder.un=T) %-- %G my.acm(x2, garder.un=T) %-- + Mixing quantitative and qualitative variables (it does not work) (What follows are a few mindless tests -- none actually worked.) One could try to analyse a set of qualitative and quantitative variables by mixing Principal Component Analysis (PCA) and Multiple Correspondance Analysis (MCA): we just put the two tables side by side and concatenate the weight vectors of the columns. Let us look at the results. %G tableau.disjonctif.vecteur <- function (x) { if( is.factor(x) ){ y <- matrix(0, nr=length(x), nc=length(levels(x))) for (i in 1:length(x)) { y[i, as.numeric(x[i])] <- 1 } return(y) } else return(x) } tableau.disjonctif <- function (x) { if( is.vector(x) ) y <- tableau.disjonctif.vecteur(x) else { y <- NULL y.names <- NULL for (i in 1:length(x)) { y <- cbind(y, tableau.disjonctif.vecteur(x[,i])) if( is.factor(x[,i]) ) y.names <- c( y.names, paste(names(x)[i], levels(x[,i]), sep='') ) else y.names <- c(y.names, names(x)[i]) } } colnames(y) <- y.names y } my.am <- function (x) { y <- tableau.disjonctif(x) # Number of observations n <- dim(y)[1] # Number of variables s <- length(x) # Number of columns of the disjunctive table p <- dim(y)[2] # The matrix and the weights # Each variable weight is 1, but for the qualitative variables # that are "split" into several "subvariables", it is the sum of # those "subvariable" weights that is 1. Dp <- diag(t(y)%*%y) / n j <- 1 for(i in 1:s){ if( is.factor(x[,i]) ) j <- j + length(levels(x[,i])) else { Dp[j] <- 1 j <- j+1 } } Dn <- rep(1,n) # It if not a good idea to center the variable... f <- ne.pas.centrer(y) # We perform the analysis... e1 <- eigen( t(f) %*% diag(1/Dn) %*% f %*% diag(1/Dp) ) e2 <- eigen( f %*% diag(1/Dp) %*% t(f) %*% diag(1/Dn) ) variables <- t(e2$vectors) %*% f individus <- t(e1$vectors) %*% t(f) variables <- t(variables) individus <- t(individus) valeurs.propres <- e1$values # Sometimes, because of rounding errors, the matrix becomes # non-diagonalizable in R. We then end up with complex eigenvalues # and (worse) complex eigenvectors. That is why it is a better # idea to use the SVD, that always yields real results. if( any(Im(variables)!=0) | any(Im(individus)!=0) | any(Im(valeurs.propres)!=0) ){ warning("Matrix not diagonalizable on R!!!") variables <- Re(variables) individus <- Re(individus) valeurs.propres <- Re(valeurs.propres) } plot( jitter(individus[,2], factor=5) ~ jitter(individus[,1], factor=5), xlab='', ylab='', frame.plot=F ) par(new=T) plot( variables[,1:2], col='red', axes=FALSE, xlab='', ylab='', type='n' ) text( variables[,1:2], colnames(y), col='red') axis(3, col='red') axis(4, col='red') col = rainbow(s) j <- 1 for (i in 1:s) { if( is.factor(x[,i]) ){ jj <- j + length(levels(x[,i])) - 1 print( paste(j,jj) ) lines( variables[j:jj,1], variables[j:jj,2], col=col[i] ) j <- jj+1 } else { arrows(0,0,variables[j,1],variables[j,2],col=col[i]) j <- j+1 } } } ne.pas.centrer <- function (y) { y } n <- 500 m <- 3 p <- 2 l <- c(3,2,5) x <- NULL for (i in 1:m) { v <- factor( sample(1:l[i], n, replace=T), levels=1:l[i] ) if( is.null(x) ) x <- data.frame(v) else x <- data.frame(x, v) } x <- cbind( x, matrix( rnorm(n*p), nr=n, nc=p ) ) names(x) <- LETTERS[1:(m+p)] x1 <- x my.am(x1) %-- If the variables are dependant, we can see that the qualitative variables are linked (as with multiple correspondance analysis), but there is nothing obvious with the quantitative variable... %G n <- 500 m <- 3 p <- 2 l <- c(3,2,5) x <- NULL for (i in 1:m) { v <- factor( sample(1:l[i], n, replace=T), levels=1:l[i] ) if( is.null(x) ) x <- data.frame(v) else x <- data.frame(x, v) } x <- cbind( x, matrix( rnorm(n*p), nr=n, nc=p ) ) names(x) <- LETTERS[1:(m+p)] x[,3] <- factor( ifelse( runif(n)>.8, x[,1], x[,3] ), levels=1:l[1]) x[,5] <- scale( ifelse( runif(n)>.9, as.numeric(x[,1]), as.numeric(x[,2]) )) + .1*rnorm(n) x2 <- x my.am(x2) %-- If we center the matrix, the results are meaningless (however, we can center the columns for the quantitative variables). %G ne.pas.centrer <- function (y) { centre <- apply(y, 2, mean) y - matrix(centre, nr=n, nc=dim(y)[2], byrow=T) } my.am(x1) %-- %G my.am(x2) %-- + Mixing qualitative and quantitative variables (a solution) To plot, on the same graph, qualitative and quantitative variables, a simple solution is to turn the quantitative variables into qualitative variables. People will tell you that this is always a bad idea: you lose "some" information. And indeed, there is not much to be seen: this is rather unhelpful... %G to.factor.numeric.vector <- function (x, number) { resultat <- NULL intervalles <- co.intervals(x,number,overlap=0) for (i in 1:number) { if( i==1 ) intervalles[i,1] = min(x) else intervalles[i,1] <- intervalles[i-1,2] if( i==number ) intervalles[i,2] <- max(x) } for (valeur in x) { r <- NA for (i in 1:number) { if( valeur >= intervalles[i,1] & valeur <= intervalles[i,2] ) r <- i } resultat <- append(resultat, r) } factor(resultat, levels=1:number) } to.factor.vector <- function (x, number) { if( is.factor(x) ) return(x) else return(to.factor.numeric.vector(x,number)) } to.factor <- function (x, number=4 ) { y <- NULL for (a in x) { aa <- to.factor.vector(a, number) if( is.null(y) ) y <- data.frame(aa) else y <- data.frame(y,aa) } names(y) <- names(x) y } my.am(to.factor(x1)) %-- Let us cut the quantitative variables into three pieces instead of four. %G my.am(to.factor(x1, number=3)) %-- With dependant variables. %G my.am(to.factor(x2)) %-- %G my.am(to.factor(x2, number=3)) %-- + TODO Check the "ade4" package. * Log-linear model (Poisson Regression) The log-linear model has a role similar to that of multiple correspondance analysis -- with the added advantage that, as with any regression, we can perform tests. We start with a contingency table t_{i,j} and we try to write log t_{i,j} = a_0 + a_1(i) + a_2(j) + a_3(i,j). This can be immediately generalized in higher dimensions (but the notations become unreadable); usually we only consider interactions of two or three variables, not more, otherwise we woud have too many parameters for the number of observations at hand. With R, it is simply the "glm" function, with the "family=poisson" argument. . m <- data.frame(table(random.data.1())) y <- m[,4] x1 <- m[,1] x2 <- m[,2] x3 <- m[,3] summary(glm( y ~ x1 + x2 + x3, family=poisson )) It yields: > summary(glm( y ~ x1 + x2 + x3, family=poisson )) Call: glm(formula = y ~ x1 + x2 + x3, family = poisson) Deviance Residuals: Min 1Q Median 3Q Max -2.6060 -0.8723 -0.2724 0.4264 3.2634 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 9.981e-01 2.965e-01 3.366 0.000762 *** x12 3.314e-01 2.476e-01 1.338 0.180799 x13 1.643e-01 2.568e-01 0.640 0.522302 x22 -4.000e-02 1.999e-01 -0.200 0.841418 x32 5.129e-02 3.203e-01 0.160 0.872784 x33 5.057e-09 3.244e-01 1.56e-08 1.000000 x34 1.001e-01 3.163e-01 0.316 0.751674 x35 1.001e-01 3.166e-01 0.316 0.751912 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 38.794 on 29 degrees of freedom Residual deviance: 36.735 on 22 degrees of freedom AIC: 139.11 Number of Fisher Scoring iterations: 4 TODO: understand this result TODO: explain how to add/remove some interaction terms > r1 <- glm( y ~ x1 + x2 + x3 + x1*x2 + x2*x3 + x1*x3, family=poisson ) > r2 <- glm( y ~ x1 + x2 + x3, family=poisson ) > anova(r1,r2) Analysis of Deviance Table Model 1: y ~ x1 + x2 + x3 + x1 * x2 + x2 * x3 + x1 * x3 Model 2: y ~ x1 + x2 + x3 Resid. Df Resid. Dev Df Deviance 1 8 6.9906 2 22 13.1344 -14 -6.1438 * Discriminant Analysis TODO: Introduction TODO: This is supervised learning, it should make up a separate chapter. + Discriminant Analysis We try to predict a qualitative variable, such as "has the patient just had a heart attack?", with quantitative variables, corresponding to potential risk factors (blood pressure, daily calory intake, workout, weight, etc.). We try to determine which factors are the most prevalent, i.e., which variables determine best the qualitative. We can also use discriminant factor analysis in a predictive way, i.e., to answer the question "will the patient have a heart attack?". TODO: explain how we do this. More precisely, the variable to predict clusters the subjects into several classes and we are looking for a linear combination of the predictive variables on which we can see the separation between the clusters. TODO: check what follows. I guess we try to maximize the quotient of the mean of the variances of this linear combination in each group and its global variance (as for anova). ??? For a binary variable: Normalize the predictive variables Compute their variance-covariance matrix A Compute the vector w of the means of the differences for both groups One can show that (?) A w = d w = A^-1 * d, these are the weights we were looking for (It is a good idea to plot the observations -- in dimension higher than two, you can use ggobi to rotate the figure.) Diagnostics: Confusion matrix: rows: real values columns: predicted values, entries: number of cases From the confusion matrix, we can build more than a dozen different measures of the relevance of those results. http://149.170.199.144/multivar/da4.htm We can first try to predict the dimension, for instance, with a principal component analysis. One can also perform a stepwise discriminant analysis. %G library(MASS) n <- 100 k <- 5 x1 <- runif(k,-5,5) + rnorm(k*n) x2 <- runif(k,-5,5) + rnorm(k*n) x3 <- runif(k,-5,5) + rnorm(k*n) x4 <- runif(k,-5,5) + rnorm(k*n) x5 <- runif(k,-5,5) + rnorm(k*n) y <- factor(rep(1:5,n)) plot(lda(y~x1+x2+x3+x4+x5)) %-- We can choose to retain only the first two dimensions. %G plot(lda(y~x1+x2+x3+x4+x5), dimen=2) %-- Or even, just the first. %G op <- par(mar=c(4,4,0,2)+.1) plot(lda(y~x1)) par(op) %-- As for Principal Component Analysis, we choose the number of dimension to retain from the eigen values. %G plot(lda(y~x1+x2+x3+x4+x5)$svd, type='h') %-- Linear discriminant analysis (LDA) can be sensitive to heteroskedasticity problems: of the two groups you are trying to separate have different variances (sizes), the boundary will be shifted to the most compact one. + Quadratic discriminant analysis The "qda" and "predict.qda" has the same role, but in a non-linear (quadratic) setting. ?qda Also check the "mda" package. + Bayesian Discriminant Factor Analysis It is the same, but we know the probabilities P ( Y == i ), i.e., the size of the various clusters. (We still use the "lda" function, with one more argument.) TODO: Example * Canonical analysis + Canonical Analysis We consider qualitative variables X1, X2, ..., Xp, Y1, Y2, ..., Yq and we look for a linear combination Xa of the Xi and a linear combination Yb of the Yj so that Xa and Yb be as correlated as possible. library(ade4) ?cca See for example part 4 of http://www.cg.ensmp.fr/%7Evert/talks/021205inria/inria.pdf + Co-inertia analysis Instead of using the correlation, one can use the covariance. TODO + Partial Least Squares TODO: Explain the relations/differences between PLS and CA. * Kernel methods + Principal Component Analysis and kernel methods TODO: rewrite the introduction Methods such as Principal Component Analysis (PCA) can reduce the dimension of the cloud of points but, as they are linear, they will not enable us to "untangle" the data: it will remain curved, it will not become a straight segment. One idea to straighten the curve is to embed our space (which can already be of unreasonably high dimension) into a higher-dimensional space, in a non-linear way. By applying linear methods on this higher-dimensional space, we can hope to see phenomena that were non-linear in the first space. Here is an example of such an embedding: (x,y) ---> (x^2, x*y, y^2). It might sound unreasonnable to embed a space whose dimension is already problematic into an even larger one. It would indeed be the case if we needed to compute (and store) the coordinates of the points in this new space -- but actually, we only need to be able to compute scalar products. More precisely, principal component analysis (and other similar techniques) play with the variance-covariance matrix t(m) %*% m In our new space, this becomes: %G # Data n <- 200 # Number of patients, number of columns k <- 10 # Dimension of the ambient space nb.points <- 5 p <- matrix( 5*rnorm(nb.points*k), nr=k ) library(splines) barycentre2 <- function (y,n) { m <- length(y) x <- 1:m r <- interpSpline(x,y) #r <- lm( y ~ bs(x, knots=m) ) predict(r, seq(1,m,length=n))$y } data.curve <- apply(p, 1, barycentre2, n) data.curve <- t(data.curve) pairs(t(data.curve)) %-- %G m <- data.curve mm <- apply(m, 2, function (x) { x %o% x } ) r <- princomp(t(mm)) plot(r) %-- %G pairs(r$scores[,1:5]) %-- Nothing exciting... Let us try with a higher degree. %G # Degree 1 kernel ("noyau" is the French word for "kernel") noyau1 <- function (x,y) { sum(x*y) } m <- data.curve m <- t(t(m) - apply(m,2,mean)) k <- dim(m)[1] wrapper <- function(x, y, my.fun, ...) { sapply(seq(along=x), FUN = function(i) my.fun(x[i], y[i], ...)) } mm <- outer(1:k, 1:k, wrapper, function (i,j) { noyau1(m[,i],m[,j]) }) # Degree 2 kernel noyau2 <- function (x,y) { a <- x*y n <- length(a) i <- gl(n,1,n^2) j <- gl(n,n,n^2) i <- as.numeric(i) j <- as.numeric(j) w <- which( i <= j & j <= k ) i <- i[w] j <- j[w] sum(a[i]*a[j]) } stopifnot( noyau2(1:2,2:1) == 12 ) # Degree 3 kernel noyau3 <- function (x,y) { a <- x*y n <- length(a) i <- gl(n,1,n^3) j <- gl(n,n,n^3) k <- gl(n,n^2,n^3) i <- as.numeric(i) j <- as.numeric(j) k <- as.numeric(k) w <- which( i <= j & j <= k ) i <- i[w] j <- j[w] k <- k[w] sum(a[i]*a[j]*a[k]) } stopifnot( noyau3(1:2,2:1) == 32 ) wrapper <- function(x, y, my.fun, ...) { sapply(seq(along=x), FUN = function(i) my.fun(x[i], y[i], ...)) } k <- dim(m)[1] mm <- outer(1:k, 1:k, FUN=wrapper, my.fun = function (i,j) { noyau3(m[,i],m[,j]) }) r <- princomp(covmat=mm) plot(r) %-- %G #pairs(r$scores[,1:5]) plot((t(m) %*% r$loadings) [,1:2]) %-- %G data.noisy.curve <- data.curve + rnorm(length(data.curve)) k <- dim(data.noisy.curve)[1] mm <- outer(1:k, 1:k, FUN=wrapper, my.fun = function (i,j) { noyau3(data.noisy.curve[,i],data.noisy.curve[,j]) }) r <- princomp(t(data.noisy.curve), covmat=mm) plot(r) %-- %G pairs((t(m) %*% r$loadings) [,1:5]) %-- %G plot((t(m) %*% r$loadings) [,1:2]) lines((t(m) %*% r$loadings) [,1:2], col='red') %-- %G x <- 1:dim(data.noisy.curve)[2] y1 <- (t(m) %*% r$loadings) [,1] y2 <- (t(m) %*% r$loadings) [,2] r1 <- loess(y1~x) r2 <- loess(y2~x) plot(y1,y2) lines(y1,y2, col='red') lines(predict(r1), predict(r2), col='blue', lwd=3) %-- Let us try to combine those kernels. %G noyau <- function (x,y) { noyau1(x,y) + noyau2(x,y) + noyau3(x,y) } mm <- outer(1:k, 1:k, FUN=wrapper, my.fun = function (i,j) { noyau(data.curve[,i],data.curve[,j]) }) r <- princomp(covmat=mm) plot(r) %-- %G pairs((t(m) %*% r$loadings) [,1:5]) %-- Well, in the end it is not very conclusive... + kernlab Actually, there is already a package to perform those computations. TODO: Examples * Neural networks + Introduction TODO... + Perceptron A perceptron is a neural network with no hidden layer and binary output(s). %G perceptron_learn <- function (input, output, max_iterations = 100) { stopifnot( is.matrix(input), is.matrix(output), is.numeric(input), is.logical(output) ) stopifnot( dim(input)[2] == dim(output)[2] ) input <- rbind(input, rep(1, N)) # Biases N <- dim(input)[2] dim_input <- dim(input)[1] dim_output <- dim(output)[1] W <- matrix(rnorm(dim_input) * rnorm(dim_output), nc = dim_input, nr = dim_output) hardlim <- function (x) ifelse( x > 0, 1, 0 ) finished <- FALSE remaining_iterations <- max_iterations while (! finished) { W_previous <- W for (i in 1:N) { forecast <- hardlim( W %*% input[,i] ) error <- output[,i] - forecast W <- W + error %*% input[,i] } remaining_iterations <- remaining_iterations - 1 finished <- remaining_iterations <= 0 | sum(abs(W - W_previous)) < 1e-6 } attr(W, "converged") <- all(W == W_previous) dimnames(W) <- list(output=NULL, input=NULL) class(W) <- "perceptron" W } perceptron_predict <- function (W, input) { input <- as.matrix(input) N <- dim(input)[2] input <- rbind(input, rep(1, N)) # Biases stopifnot( dim(W)[2] == dim(input)[1] ) W %*% input } # Test k <- 2 # BUG: It does not work with other values... N <- 50 set.seed(1) centers <- matrix(rnorm(2*k), nc=2) output <- sample(1:k, N, replace=T) input <- t(matrix( rnorm(k*N, mean=centers[output,], sd=.1), nc=2)) output <- t(output == 1) w <- perceptron_learn(input, output) plot(t(input), col=1+output, xlab="", ylab="", main="Perceptron learning") abline(-w[3]/w[2], -w[1]/w[2], lwd=3, lty=2) %-- %%G # It is not designed to work if the clusters overlap: # It will fail to converge... N <- 50 set.seed(2) centers <- matrix(rnorm(2*k), nc=2) output <- sample(1:k, N, replace=T) input <- t(matrix( rnorm(k*N, mean=centers[output,], sd=1), nc=2)) output <- t(output == 1) w <- perceptron_learn(input, output) plot(t(input), col = 1 + output, xlab = "", ylab = "", main = "Perceptron failing to converge") abline(-w[3]/w[2], -w[1]/w[2], lwd=3, lty=2) %-- %G library(MASS) z <- output[1,] x <- input[1,] y <- input[2,] r <- lda( z ~ x + y ) n <- 200 x <- rep(seq(-2,2,length=n), each=n) y <- rep(seq(-2,2,length=n), n) z <- predict(r, data.frame(x, y)) z <- c(rgb(.7,.7,.7), rgb(1,.7,.7))[ as.numeric(z$class) ] plot(x, y, col=z, pch=15, xlab="", ylab="", main="Non-converging perceptron and LDA") points(t(input), col = 1 + output) abline(-w[3]/w[2], -w[1]/w[2], lwd=3, lty=2) %-- %G library(e1071) y <- as.factor(output[1,]) x <- t(input) r <- svm(x, y) n <- 200 x <- cbind( rep(seq(-2,2,length=n), each=n), rep(seq(-2,2,length=n), n) ) z <- predict(r, x) z <- c(rgb(.7,.7,.7), rgb(1,.7,.7))[ as.numeric(z) ] plot(x, col=z, pch=15, xlab="", ylab="", main="Non-converging perceptron and SVM") points(t(input), col = 1 + output) abline(-w[3]/w[2], -w[1]/w[2], lwd=3, lty=2) %-- Perceptrons should not be used if the cloud of points is not separable: in this case, prefer logistic regression, linear discriminant analysis (LDA) or support vector machines (SVM). + Linear filter neuron TODO In the statistical community, this is known as "(least squares) linear regression". + Neural networks: TO SORT Windrow-Hoff learning rule: gradient descent for linear filter neurons -- i.e., for least squares linear regression. Back-propagation: gradient descent algorithm for multi-layer networks. The tricky point is to compute the gradient (there are too many variables to do this by applying a general method): this is what the back-propagation algorithm provides. Feed-forward neural network: this is the classical 3-layer (i.e., with one hidden layer) neural net, fitted by back-propagation. Batch training: for each update of the weights, you use all the data. Incremental training, on-line learning: you only use one observation to update the weights. Variants of gradient descent: Gradient descent with momentum (this increases the convergence speed; it is not too dissimilar to tabu search). Variable learning rate: if the previous move did not decrease the error, i.e., if we went too far in the direction of the gradient, decrease the learning rate by 30%, if not, increase it by 5%. Resilient back-propagation (Rprop): since sigmoid function tend to squash their inputs, even very wrong weights can induce a small change in the result -- this leads to a very slow convergence toward the optimal weights. Resilient back-propagation only uses the sign of the derivative. The amplitude of the change is controled by another parameter that increases if there have been two successive changes in the same direction and decreases if there have been two successive changes in opposite directions. Conjugate gradient: Look in the direction of the (opposite of the) gradient and perform a line search in this direction. The next direction is not the opposite of the gradient but a linear combination of the opposite of the gradient and the previous search direction. new direction = - gradient + beta * old direction where (Fletcher-Reeves update) norm(current gradient) ^ 2 beta = ----------------------------- norm(previous gradient) ^ 2 or (Polak-Ribiere update) < current gradient - previous, current > beta = ------------------------------------------- norm( previous gradient ) ^ 2 From time to time (e.g., after as many iterations as there are parameters, or when the angle between two successive gradients becomes too high (Powel_Beale restarts)), reset the direction to the opposite of the gradient. Quasi-newton: The newton method is very fast, but it requires second derivatives (hessian matrices), which are too expensive to compute for neural networks. Quasi-newton algorithms try to approximate it without the second derivative computational overhead. BFGS quasi-Newton algorithm: an approximation of the hessian matrix is stored and updated at each iteration, using the gradient. (details?) One-step secant quasi-Newton algorithm: idem, but assumes that the previous estimation of the hessian was the identity matrix -- hence, there is no need to store it. Levenberg-Maquardt: This is yet another quasi-Newton method.s x[k+1] = x[k] - (J' + mu I)^-1 J' error where x: parameters to be estimated J = d error / d weights gradient = J' error hessian = J' J (approximately) mu = 0: this is Newton's method, with an approximate hessian mu >> 0: this is a gradient method with a small step size Reduced-memory Levenberg-Maquardt: the matrix J is huge (its number of rows is the number of training sets). If ot does not fit in memory, you can cut it into blocks.10 Line search routines (for conjugate gradient methods): Quadratic interpolation Golden section search Brent's search (hydrid of qquadratic interpolation and the golden section search) Hybrid bisection-cubic search Charalambous search Backtracking search Avoiding overfitting: Stop the algorithm when the performance on a validation sample starts to decrease (it may sound reasonable, but this is apprtoximately a baysian fit with the initial (random) weights as a prior...). And you must be careful not to choose too fast an algorithm... Choose a network small enough Add a penalty (the sum of the squares of the weights) to the objective function, to shrink the weights towards zero (try to have inputs, outputs and weights of the same order of magnitude, say in [-1,1]). Objective = lambda * Mean Square Error + (1 - lambda) * Mean Square Weight + Neural networks and dimension reduction You can reduce the dimension of a cloud of points with a 5-layer neural network: an input layer (with the data), a (non-linear) compression layer, a (linear) layer representing the data in the low-dimensional space, a (non-linear) uncompression layer, and an output layer (on which we expect to get the initial data back). I do not think that such networks are already implemented in R, so we shall do it ourselves -- we shall be EXTREMELY careful with convergence problems. As I do not really know neural networks (I was superficially interested in the subject 10 years ago), I browse the net to understand them http://www.willamette.edu/%7Egorr/classes/cs449/backprop.html and I write the following code. TODO: start with a smaller example, on which the computations will be faster and for which we know waht kind of result we should expect. For instance, a 2-dimensional cloud in a 3-dimensional space. with only linear neurons and only two neurons in the middle layer. Problem: in this case, the intermediary layers are useless and may unstabilize the algorithm. # x: matrix whose columns are the vectors to learn # n: number of neurons in the middle layer # m: number of neurons in the two other hidden layers # (you should have m>n) drnn <- function (M, n, m, e=.1, N=100) { # Activation fucntions id <- function (t) { t } f1 <- tanh f2 <- id f3 <- tanh f4 <- id # Number of neurons in each layer n0 <- dim(M)[1] n1 <- m n2 <- n n3 <- m n4 <- n0 # The weights -- initialized by random values w1 <- matrix( rnorm(n0*n1), nc=n0, nr=n1 ) w2 <- matrix( rnorm(n1*n2), nc=n1, nr=n2 ) w3 <- matrix( rnorm(n2*n3), nc=n2, nr=n3 ) w4 <- matrix( rnorm(n3*n4), nc=n3, nr=n4 ) # Les biais b1 <- rnorm(n1) b2 <- rnorm(n2) b3 <- rnorm(n3) b4 <- rnorm(n4) # The algorithm (backpropagation) r <- list() err <- rep(0, N) for (i in 1:N) { cat(paste("Iteration", i, "\n")) res <- matrix(NA, nr=n, nc=dim(M)[2]) for (j in 1:dim(M)[2]) { x <- M[,j] # Computing the valu of the nodes y0 <- x y1 <- f1( w1 %*% y0 + b1 ) y2 <- f2( w2 %*% y1 + b2 ) y3 <- f3( w3 %*% y2 + b3 ) y4 <- f4( w4 %*% y3 + b4 ) # Computing the errors d4 <- x - y4 d3 <- (t(w4) %*% d4) * (1 - y3^2) d2 <- (t(w3) %*% d3) * 1 d1 <- (t(w2) %*% d2) * (1 - y1^2) # Updating the weights and the biases dw4 <- d4 %*% t(y3) dw3 <- d3 %*% t(y2) dw2 <- d2 %*% t(y1) dw1 <- d1 %*% t(y0) w4 <- w4 + e * dw4 w3 <- w3 + e * dw3 w2 <- w2 + e * dw2 w1 <- w1 + e * dw1 b4 <- b4 + e * d4 b3 <- b3 + e * d3 b2 <- b2 + e * d2 b1 <- b1 + e * d1 res[,j] <- y2 err[i] <- err[i] + sum(d4^2) } r[[i]] <- res } list(scores=res, errors=err, r=r) } r <- drnn(data.curve, 1, 4, N=30, e=.02) plot(r$err, ylim=range(c(0,r$err)), type='l') %-- i <- order(drop(r$scores)) plot(t(data.curve)[,1:2]) lines(t(data.curve)[,1:2][i,], col='red') %-- Well, it is not exactly what we were expecting. But it is still better than nothing... op <- par(mfrow=c(2,2)) for (k in c(1, 3, 7, 30)) { i <- order(drop(r$r[[k]])) plot(t(data.curve)[,1:2], main=paste("after",k,"iteration(s)")) lines(t(data.curve)[,1:2][i,], col='red') } par(op) %-- Changing the order in which we give the points to the network has no significant effect. op <- par(mfrow=c(2,4)) for (i in 1:4) { m <- data.curve[,sample(1:dim(data.curve)[2])] r <- drnn(m, 1, 4, N=30, e=.02) plot(r$err, ylim=range(c(0,r$err)), type='l') i <- order(drop(r$scores)) plot(t(m)[,1:2]) lines(t(m)[,1:2][i,], col='red') } par(op) %-- The plot of the error is VERY worrying: the error is supposed to decrease at each iteration -- it sometimes increases. Do not use this code unless you really understand it and think it is correct. Let us try with two neurons in the middle layer. i <- sample(1:dim(data.curve)[2]) m <- data.curve[,i] r <- drnn(m, 2, 8, N=30, e=.02) plot(r$err, ylim=range(c(0,r$err)), type='l') plot(t(r$scores)[i,], type='l') %-- (Yes, it is supposed to be a curve...) TODO: look more precisely what happens when we change one of the parameters. Each time, display at least 4 plots: the error (that should be decreasing), the first path, the last path. TODO: Check that my implementation is correct. The evolution of the error term suggest there IS a problem. + Neural networks and dimension reduction: examples Neural networks can help increase the coverage and the precision of OS fingerprinting (this could apply to any rules-based classifier): http://actes.sstic.org/SSTIC06/Fingerprinting_par_reseaux_neuronaux/ * Dimension reduction: TODO: Rewrite/Remove this section 0. Data 1. Kernels 2. Neural networks 3. k-means + MST 4. Isomap 5. Optimisation 6. pruned MST 7. TSP 8. SVM + Problem Let us consider a cloud of (40) points in a space of reasonable dimension (10) or of less reasonable dimension (7000). The points are more or less aligned on a curve we are trying to describe. The real situation was the following. With DNA array, we examined the expression of 7000 genes in 40 patients, representing different stages of a single disease. We can see this pattern by a Principal Component Analysis (PCA) (or via other dimension reduction techniques): the points give the impression of forming a curve. In dimension two, the curve may seem singular, it seem to cross itself, but in higher dimension, it looks smooth. With the expression of those genes (all of them or, rather, just a subset of them), we would like to define an "index" indicating the position on the curve and, thus, the current stage of the disease. We shall first describe the data (both simulated data and real data), apply various classical dimension reduction techniques (PCA, PCA after non-linear transformations, neural networks), some less classical ones (based on optimization problems such as the traveling salesman problem) and we shall compare all those methods. + Data Simulated data: broken line. %G n <- 200 # Number of patients, number of columns k <- 10 # Dimension of the ambient space nb.points <- 5 p <- matrix( 5*rnorm(nb.points*k), nr=k ) barycentre <- function (x, n) { # Add number between the values of x in order to get a length n vector i <- seq(1,length(x)-.001,length=n) j <- floor(i) l <- i-j (1-l)*x[j] + l*x[j+1] } m <- apply(p, 1, barycentre, n) data.broken.line <- t(m) pairs(t(data.broken.line)) %-- %G plot(princomp(t(data.broken.line))) %-- %G pairs(princomp(t(data.broken.line))$scores[,1:5]) %-- Simulated data: noisy broken line %G data.noisy.broken.line <- data.broken.line + rnorm(length(data.broken.line)) pairs(princomp(t(data.noisy.broken.line))$scores[,1:5]) %-- Simulated data: curve -- we just change our "barycentre" function. %G library(splines) barycentre2 <- function (y,n) { m <- length(y) x <- 1:m r <- interpSpline(x,y) #r <- lm( y ~ bs(x, knots=m) ) predict(r, seq(1,m,length=n))$y } k <- 5 y <- sample(1:100,k) x <- seq(1,k,length=100) plot(barycentre2(y,100) ~ x) lines(y, col='red', lwd=3, lty=2) %-- %G data.curve <- apply(p, 1, barycentre2, n) data.curve <- t(data.curve) pairs(t(data.curve)) %-- %G plot(princomp(t(data.curve))) %-- %G pairs(princomp(t(data.curve))$scores[,1:5]) %-- Some more noise: %G data.noisy.curve <- data.curve + rnorm(length(data.curve)) pairs(t(data.noisy.curve)) %-- %G plot(princomp(t(data.noisy.curve))) %-- %G pairs(princomp(t(data.noisy.curve))$scores[,1:5]) %-- Let us think about the shape of the curve we would expect. If the disease is lethal (or if, for all the patients of the study, it evolves towards death), we can expect a segment, curved, one end representing healthy people, the other dead patients. If the disease is benign, we would expect the points to form a (deformed) circle, representing the successives stages of the disease, until a once-again healthy patient. If the disease is sometimes lethal, sometimes not, we could expect a mixture of both: the reunion of a circle and a segment. %G random.rotation.matrix <- function (n=3) { m <- NULL for (i in 1:n) { x <- rnorm(n) x <- x / sqrt(sum(x*x)) y <- rep(0,n) if (i>1) for (j in 1:(i-1)) { y <- y + sum( x * m[,j] ) * m[,j] } x <- x - y x <- x / sqrt(sum(x*x)) m <- cbind(m, x) } m } n <- 200 k <- 10 x <- seq(0,2*pi,length=n) data.circle <- matrix(0, nr=n, nc=k) data.circle[,1] <- cos(x) data.circle[,2] <- sin(x) data.circle <- data.circle %*% random.rotation.matrix(k) data.circle <- t( t(data.circle) + rnorm(k) ) pairs(data.circle[,1:3]) %-- %G data.circle <- data.circle + .1*rnorm(n*k) pairs(data.circle[,1:3]) %-- %G pairs(princomp(data.circle)$scores[,1:3]) %-- TODO: The other example Real data: %G data.real <- read.table("Tla_z.txt", sep=",") data.real.group <- factor(substr(names(data.real),0,1)) r <- prcomp(t(data.real)) plot(r$sdev, type='h') %-- %G data.real.3d <- r$x[,1:3] pairs(data.real.3d, pch=16, col=as.numeric(data.real.group)) %-- To see more clearly what happens, let us add an ellipse for each group of points. %G draw.ellipse <- function ( x, y=NULL, N=100, method=lines, ... ) { if (is.null(y)) { y <- x[,2] x <- x[,1] } centre <- c(mean(x), mean(y)) m <- matrix(c(var(x),cov(x,y), cov(x,y),var(y)), nr=2,nc=2) e <- eigen(m) r <- sqrt(e$values) v <- e$vectors theta <- seq(0,2*pi, length=N) x <- centre[1] + r[1]*v[1,1]*cos(theta) + r[2]*v[1,2]*sin(theta) y <- centre[2] + r[1]*v[2,1]*cos(theta) + r[2]*v[2,2]*sin(theta) method(x,y,...) } draw.star <- function (x, y=NULL, ...) { if (is.null(y)) { y <- x[,2] x <- x[,1] } d <- cbind(x,y) m <- apply(d, 2, mean) segments(m[1],m[2],x,y,...) } my.plot <- function ( d, f=rep(1,dim(d)[1]), col=rainbow(length(levels(f))), variables=NULL, legend=T, legend.position=1, draw=draw.ellipse, ...) { xlim <- range(d[,1]) ylim <- range(d[,2]) if(!is.null(variables)){ xlim <- range(xlim, variables[,1]) ylim <- range(ylim, variables[,2]) } plot(d, col=col[as.numeric(f)], pch=16, xlim=xlim, ylim=ylim, ...) for (i in 1:length(levels(f))) { try( draw(d[ as.numeric(f)==i, ], col=col[i]) ) } if(!is.null(variables)){ arrows(0,0,variables[,1],variables[,2]) text(1.05*variables,rownames(variables)) } abline(h=0,lty=3) abline(v=0,lty=3) if(legend) { if(legend.position==1) { l=c( par('usr')[1],par('usr')[4], 0, 1 ) } else if (legend.position==2) { l=c( par('usr')[2],par('usr')[4], 1, 1 ) } else if (legend.position==3) { l=c( par('usr')[1],par('usr')[3], 0, 0 ) } else if (legend.position==4) { l=c( par('usr')[2],par('usr')[3], 1, 0 ) } else { l=c( mean(par('usr')[1:2]), mean(par('usr')[1:2]), .5, .5 ) } legend(l[1], l[2], xjust=l[3], yjust=l[4], levels(f), col=col, lty=1,lwd=3) } } my.plot(data.real.3d[,1:2], data.real.group) %-- %G op <- par(mfrow=c(3,3), mar=.1+c(0,0,0,0)) plot.new(); my.plot(data.real.3d[,c(2,1)], data.real.group, xlab='',ylab='',axes=F,legend=F) box() my.plot(data.real.3d[,c(3,1)], data.real.group, xlab='',ylab='',axes=F,legend=F) box() my.plot(data.real.3d[,c(1,2)], data.real.group, draw=draw.star, xlab='',ylab='',axes=F,legend=F) box() plot.new() my.plot(data.real.3d[,c(3,2)], data.real.group, xlab='',ylab='',axes=F,legend=F) box() my.plot(data.real.3d[,c(1,2)], data.real.group, draw=draw.star, xlab='',ylab='',axes=F,legend=F) box() my.plot(data.real.3d[,c(2,3)], data.real.group, draw=draw.star, xlab='',ylab='',axes=F,legend=F) box() plot.new() par(op) %-- You can also turn the plot to see it from various angles. %G set.seed(66327) random.rotation.matrix <- function (n=3) { m <- NULL for (i in 1:n) { x <- rnorm(n) x <- x / sqrt(sum(x*x)) y <- rep(0,n) if (i>1) for (j in 1:(i-1)) { y <- y + sum( x * m[,j] ) * m[,j] } x <- x - y x <- x / sqrt(sum(x*x)) m <- cbind(m, x) } m } op <- par(mfrow=c(3,3), mar=.1+c(0,0,0,0)) for (i in 1:9) { #plot( (data.real.3d %*% random.rotation.matrix(3))[,1:2], # pch=16, col=as.numeric(data.real.group), # xlab='', ylab='', axes=F ) my.plot((data.real.3d %*% random.rotation.matrix(3))[,1:2], data.real.group, draw=draw.ellipse, xlab='',ylab='',axes=F,legend=F) box() } par(op) %-- + Other idea + TODO Some time ago, I have seen an article about dimension reduction (in Rnews, or a vignette). Read it. Implement Isomap (easy: use MST) Other articles google: dimensionality reduction + Other idea (bad) We can formulate our problem as an optimization problem: we are looking for the coordinates of 5 points A1, A2, A3, A4, A5 in order to minimize the sum of the (squares of the) distances of the points in the cloud to the broken line A1-A2-A3-A4-A5. TODO: To check that this algorithm, first give it: - a single segment to find, in dimension 1 - a single segment to find, in dimension 2 - a curve (a flat parabola) with 1 segment - a curve (parabola) with two segments - only then our data. distance.broken.line <- function (point, ligne) { # point: vector containing the coordinates of the point (from the cloud) whose distance to the broken line is to be computed. # ligne: coordinates, in columns, of the points defining the broken line. # Distance to the vertices of the broken line d.points <- apply( (ligne-point)^2, 2, sum ) # Position of the projection of the point on each of the segments # of the broken line v <- t(apply(ligne,1,diff)) # Vectors A(i)A(i+1) w <- -ligne + point # Vectors A(i)M w <- w[, -dim(w)[2]] # Remove the last lambda <- apply(v*w, 2, sum) / apply(v^2,2,sum) lambda[ is.nan(lambda) ] <- 0 # Distance to the segments of the broken line d.segments <- apply( (w - t(lambda*t(v)))^2, 2, sum ) d.segments <- d.segments[ lambda>0 & lambda<1 ] min(c( d.points, d.segments )) } # Tests: the following distances ar close to zero stopifnot( distance.broken.line(p[,1], p) < 1e-6 ) stopifnot( distance.broken.line(.5*p[,1]+.5*p[,2], p) < 1e-6 ) # There is a BIG speed problem: in the following example, we need # two seconds to compute the distance... sum(apply(data.noisy.broken.line, 2, distance.broken.line, p)) broken.line <- function ( d, # data np=4, # Number of segments in the broken line deb = as.vector(d[,sample(1:dim(d)[2], np)]) # starting values: broken line through np poins of the cloud, at random. ) { f <- function (x) { sum(apply(d, 2, distance.broken.line, matrix(x, nc=np) )) } # Very, very long... r <- optim(deb, f, control=list(trace=1, maxit=50)) # Default: 500 iterations list(par=r$par, method="optim", value=r$value, deb=deb, np=np, r=r) } r <- broken.line(data.noisy.broken.line) pc <- princomp(t(data.noisy.broken.line)) plot(pc$scores[,1:2]) lines( (t(matrix(r$deb,nc=r$np) - pc$center) %*% pc$loadings)[,1:2], col='red',lwd=3,lty=3) lines((t(matrix(r$par,nc=r$np) - pc$center) %*% pc$loadings)[,1:2], col='blue',lwd=3,lty=2) %-- # Example: a segment, in dimension 1 l <- c(-5, 3, 2, -1) deb <- c(1,2,3,4) d <- rbind( seq(l[1],l[2],length=10), seq(l[3],l[4],length=10) ) r <- broken.line(d, np=2, deb=deb) plot(t(d), xlim=c(-10,10), ylim=c(-10,10)) # Be careful about the order of the parameters to estimate... for (i in 1:100) { lines(matrix(sample(r$par), nr=2) ) } %-- # Starting values: some points of the cloud, at random op <- par(mfrow=c(3,3), mar=.1+c(0,0,0,0)) for (i in 1:9) { l <- runif(4, -8,8) d <- rbind( seq(l[1],l[2],length=10), seq(l[3],l[4],length=10) ) r <- broken.line(d, np=2) plot(t(d), xlim=c(-10,10), ylim=c(-10,10), xlab='', ylab='', axes=F) box() lines( r$par[c(1,3)], r$par[c(2,4)], col='red' ) } par(op) %-- # Starting values: random op <- par(mfrow=c(3,3), mar=.1+c(0,0,0,0)) for (i in 1:9) { l <- runif(4, -8,8) deb <- runif(4, -8,8) d <- rbind( seq(l[1],l[2],length=10), seq(l[3],l[4],length=10) ) r <- broken.line(d, np=2, deb=deb) plot(t(d), xlim=c(-10,10), ylim=c(-10,10), xlab='', ylab='', axes=F) box() lines( r$par[c(1,3)], r$par[c(2,4)], col='red' ) lines( deb[c(1,3)], deb[c(2,4)], col='blue', lty=2 ) } par(op) %-- Aaaaaaahhhhhhhhhhhh... It does not converge -- not at all. Let us try again, by running the algorithm several times. op <- par(mfrow=c(3,3), mar=.1+c(0,0,0,0)) for (i in 1:9) { l <- runif(4, -8,8) deb <- runif(4, -8,8) d <- rbind( seq(l[1],l[2],length=10), seq(l[3],l[4],length=10) ) plot(t(d), xlim=c(-10,10), ylim=c(-10,10), xlab='', ylab='', axes=F) box() lines( deb[c(1,3)], deb[c(2,4)], col='blue', lty=2 ) for (j in 1:10) { r <- broken.line(d, np=2, deb=deb) lines( r$par[c(1,3)], r$par[c(2,4)], col='red' ) deb <- r$par } } par(op) %-- No, it really fails to converge. TODO: look at the value of the distance at the end of the algorithm... I get very small values while the segment id rather far away from the points: why? (this bug could account for the convergence problems...) There is another problem: the solution we are looking for is not unique. If a segment is a solution of our problem, any larger segment is also a solution. It converges too slowly to get a reliable result. Let us try ith the nlm function ("optim" should work fine with "naughty" functions, "nlm" on "nice" functions -- our function os piecewise quadratic, with a unique local minimum). # Even longer... r <- nlm(f, deb, iterlim=20) # By default, iterlim=100... pc <- princomp(t(data.noisy.broken.line)) plot(pc$scores[,1:2]) lines( (t(matrix(deb, nc=np) - pc$center) %*% pc$loadings)[,1:2], col='red', lwd=3, lty=3 ) lines((t(matrix(r$estimate,nc=np) - pc$center) %*% pc$loadings)[,1:2], col='blue', lwd=3, lty=2) In the previous example (had it worked), the curve was made of 4 segments and we wer looking for a curve made of 3 segments. If we had asked for 5 segments, we would have had a segment of length 0, or an unused segment at the end of the curve. to avoid this, we can add a penalty if the segments do not have the same length. # I add the sum of the inverses of the squares of the lengths of the # segments. # TODO: This is a BAD idea: it favours long, useless segments. f <- function (x) { sum(apply(data.noisy.broken.line, 2, distance, matrix(x, nc=np) )) + sum( 1/apply(t(apply(matrix(x,nc=np), 1,diff))^2, 2, sum) ) } TODO: computation with 6 or 7 segments to find a broken line with only 5. TODO: Similarly, we could add a penalty for the total length of the curve. TODO: We could do the same thing with splines + Traveling Salesman Problem (TSP) The Traveling Salesman Problem is the following: we have a cloud of points (in a metric space: think "cities on a map") and we are looking for a path, as short as possible, that visits each point. On other words, we want to find an order on the points that minimizes a certain function. Presented like that, the problem already has a lot of applications (for instance, the path of a delivery car; the path of a soldering iron in an electrnic appliances assembly line), but other seemingly unrelated problems can be formulated as TSP (for instance, some problems in gene mapping -- see Setubal's book -- I confess I have never read it). We can apply a TSP-solving algorithm to our cloud of points: we get an order on those points, that can be seen as a broken line. We then smooth this broken line: we get a curve, parametrized by [0,1] -- this parameter is the index we were looking for. Finally, given a new point, we just have to project it onto the curve to get the value of the parameter. Unfortunately, no TSP-solving algorithm is implemented under R: we shall have to program. + TSP: descent Here is an implementation using descent: we start with a random path (i.e., an order on the points) and we try to transform it (by interchanging the rank of two points) to get a shorter path. %G longueur <- function (d,o) { n <- length(o) sum(diag( d [o[1:(n-1)],] [,o[2:n]] )) } tsp.descent <- function (d, N=1000) { # d: distance matrix d <- as.matrix(d) n <- dim(d)[1] o <- sample(1:n) v <- longueur(d,o) k <- 0 res <- list() k.res <- c() l.res <- c(v) while (k