Clustering

We now consider the problem of classifying the observation
or the subjects into several groups, several classes. For
instance, we have weighted and measured animals and we would
like to know if there are several species: do the data form
a single cluster or several?

This is called "unsupervised" learning, because we do not
know the classes nor even their number. When we know the
classes and try to find criteria to assign new observations
to a class, we call that "supervised learning" -- it is a
regression, with a qualitative variable to predict.

* Non-hierarchical clustering (k-means)

We choose k points, the "centers", and we cluster the data
into k classes, assigning each point to its nearest
center. We then replace the centers by the center of gravity
of the points in their class. And we iterate.

This method has several problems: the result depends on the
first centers chosen (we can start again with new centers
and compare -- or even adapt the algorithm with genetic
algorithms); it is not well-adapted to noisy data; we have
to know the number of clusters; the algorithm implicitely
assumes that the clusters have the same size (while one
application of clustering is "outlier detection", with two
classes: one, large, with most of the data, another,
smaller, with the outliers); it assigns each point to a
class, even if there is some ambiguity (we would like to
know if the classification of a new point is ambiguous).

The k-means algorith is very similar.

  Explain what the k-means algorithm is. 

  "the nonhierarchical k means or iterative relocation
  algorithm. Each case is initially placed in one of k
  clusters, cases are then moved between clusters if it
  minimises the differences between cases within a cluster."

  %G
  x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
             matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
  cl <- kmeans(x, 2, 20)
  plot(x, col = cl$cluster, pch=3, lwd=1)
  points(cl$centers, col = 1:2, pch = 7, lwd=3)
  segments( x[cl$cluster==1,][,1], x[cl$cluster==1,][,2], 
            cl$centers[1,1], cl$centers[1,2])
  segments( x[cl$cluster==2,][,1], x[cl$cluster==2,][,2], 
            cl$centers[2,1], cl$centers[2,2], 
            col=2)
  %--

Let us try this with my pupils' marks (I spent one year as a
high school teacher -- a dreadful experience, that I would
not advise to anyone).

  %G
  # This is not what we want...
  load(file="data_notes.Rda") # reads in the "notes" matrix, defined
                              # in a preceding chapter.
  cl <- kmeans(notes, 6, 20)
  plot(notes, col = cl$cluster, pch=3, lwd=3)
  points(cl$centers, col = 1:6, pch=7, lwd=3)
  %--

This is not what we want: we got the plot in the first two
coordinates of the data set instead of the first two
coordinates of the Principal Component Analysis of the data
set.

  %G
  n <- 6
  cl <- kmeans(notes, n, 20)
  p <- princomp(notes)
  u <- p$loadings
  x <- (t(u) %*% t(notes))[1:2,]
  x <- t(x)
  plot(x, col=cl$cluster, pch=3, lwd=3)
  c <- (t(u) %*% t(cl$center))[1:2,]
  c <- t(c)
  points(c, col = 1:n, pch=7, lwd=3)
  for (i in 1:n) {
    print(paste("Cluster", i))
    for (j in (1:length(notes[,1]))[cl$cluster==i]) {
      print(paste("Point", j))
      segments( x[j,1], x[j,2], c[i,1], c[i,2], col=i )
    }
  }
  text( x[,1], x[,2], attr(x, "dimnames")[[1]] )
  %--

In fact, the "clusplot" already does this.  We give it the
result of one of the following functions, that are variants
of the k-means algorithm.

  pam    partitionning around medoids (more robust)
  clara  other partitionning, well-suited for large data sets
         (the idea is just to sample from the large data set)
  daisy  dissimilarity matrix: qualitative or quantitative variables
  dist   dissimilarity matrix: quantitative variables only
  fanny  fuzzy clustering

TODO: explain (here or elsewhere, the assumed properties of
the clusters for these different methods: same size or not,
same shape (variance) or not, etc.)

Here are a few sample results.

  %G
  library(cluster)
  clusplot( notes, pam(notes, 6)$clustering )
  %--

  %G
  clusplot( notes, pam(notes, 2)$clustering )
  %--

  %G
  clusplot( daisy(notes), pam(notes,2)$clustering, diss=T )
  %--

  %G
  clusplot( daisy(notes), pam(notes,6)$clustering, diss=T )
  %--

  %G
  clusplot(fanny(notes,2))
  %--

  %G
  clusplot(fanny(notes,10))
  %--

* Hierarchical Classification (dendogram)

We start with a cloud of n points and we put each point in a
class of its own (so we have n classes, each containing a
single point).  Then, we look for the closest two classes
(for a given distance, for instance the distance between
their center of gravity -- but other distances will give
other results, perhaps more appropriate for some data sets)
and we join them into a new class. We now have n-1 classes,
all but one containing a single element. Then we
iterate. The result can be presented as a dendogram: the
leaves are the initial 1-element classes and the various
"cuts" (by a horizontal line in the following plot) of the
dendogram are various clusterings of the data, into a
decreasing number of classes.

  %G
  data(USArrests)
  hc <- hclust(dist(USArrests), "ave")
  plot(hc)
  %--

  %G
  plot(hc, hang = -1)
  %--

Here is the result with my pupils' marks: I try to plot the
tree onto the first two components of the PCA.

  %G
  plot(hclust(dist(notes)))
  %--

  %G
  p <- princomp(notes)
  u <- p$loadings
  x <- (t(u) %*% t(notes))[1:2,]
  x <- t(x)
  plot(x, col="red")
  c <- hclust(dist(notes))$merge
  y <- NULL
  n <- NULL
  for (i in (1:(length(notes[,1])-1))) {
    print(paste("Step", i))
    if( c[i,1]>0 ){
      a <- y[ c[i,1], ]
      a.n <- n[ c[i,1] ]
    } else {
      a <- x[ -c[i,1], ]
      a.n <- 1
    }
    if( c[i,2]>0 ){
      b <- y[ c[i,2], ]
      b.n <- n[ c[i,2] ]
    } else {
      b <- x[ -c[i,2], ]
      b.n <- 1
    }
    n <- append(n, a.n+b.n)
    m <- ( a.n * a + b.n * b )/( a.n + b.n )
    y <- rbind(y, m)
    segments( m[1], m[2], a[1], a[2], col="red" )
    segments( m[1], m[2], b[1], b[2], col="red" )
    if( i> length(notes[,1])-1-6 ){
      op=par(ps=30)
      text( m[1], m[2], paste(length(notes[,1])-i), col="red" )
      par(op)
    }
  }
  text( x[,1], x[,2], attr(x, "dimnames")[[1]] )
  %--

Conclusion: either choose the representation as a tree (and
forget about PCA, at least for a moment), or only plot the
first two or three levels of the tree -- as we did above.

Also see the "mclust" package (Model-based clustering) that
models the data as a mixture of gaussians (i.e., as a
superposition of clusters).

  library(help=mclust)

+ Various shapes of trees

  TODO: 
  - a tree with no clusters
  - a tree with a lot of outliers (i.e., a lot of "horizontal branches")
  - a tree with "biological" (balanced) clusters
  - a tree with "financial" (unbalanced) clusters

  %G
  do.it <- function (x, k=3, main="") {
    if (inherits(x, "dist")) {
      d <- x
      x <- sammon(d)$points
    } else {
      d <- dist(x)      
      if (!is.vector(x)) {
        x <- sammon(d)$points
      }
    }
    op <- par(mfrow = c(2,2), 
              mar = c(3,2,4,2)+.1,
              oma = c(0,0,0,0))
    if (is.vector(x)) {
      hist(x, 
           col="light blue",
           xlab="", ylab="", main="Data")
    } else {
      plot(x, main="Data")
    }
    hist(as.vector(d), 
         col="light blue",
         xlab="", ylab="", main="Distances")
    plot(hclust(d), labels=FALSE, 
         main = "Hierarchical clustering")
    plot(silhouette(cutree(hclust(d),k), d),
         main = "Silhouette")
    par(op)
    mtext(main, line=3, font=2, cex=1.2)
  }
  do.it(runif(100), k=2,
        main="Random gaussian data")
  %--

  %G
  set.seed(1)
  x <- rt(100,2)
  do.it(x, main="Random data with fat tails")
  %--

  %G
  x <- matrix(rnorm(10000), nr=100, nc=100)
  do.it(x, main="High-dimensional random gaussian data")
  %--

  %G
  set.seed(1)
  x <- matrix(rt(10000, df=4), nr=100, nc=100)
  do.it(x, k=5, main="High-dimensional, fat tails")
  %--

  %G
  N <- 100
  x <- rnorm(N, sample(c(-2,2), N, replace=T))
  do.it(x, k=2, main="Two real clusters")
  %--

  %G
  N <- 100  # Number of observations
  n <- 10   # Dimension
  k <- 3    # Number of clusters
  i <- sample( 1:k, N, replace=T )
  mu <- matrix(runif(n*k, -2, 2), nr=k, nc=n)
  x <- matrix(rnorm(N*n), nc=n) + mu[i,]   # One subject per row
  do.it(x, k=3, main="Three cluster, higher dimension")
  %--

  TODO: Find some financial data
  TODO: Find some biological data (multiple alignment)
  TODO: Find some bad data (with a lot of horizontal segments)

+ Distance between points

Hierarchical classification requires you to choose a notion
of distance between points and a notion of distance between
clusters.

There might be several reasonable choices of a distance
between points: for instance, to perform a hierarchical
analysis on protein sequences, we can, after aligning them,
count the number of differences. We can also weight those
differences: if the physico-chemical properties of the
amino-acids are similar, the distance will be short, if they
are very different, the distance will be higher. (For
further analyses, we also have another notion of distance,
only available after hierarchical classification: we can
measure the distance on the dendogram.)

  TODO: give an alignment example

TODO: other distance examples

  Cosine distance: scalar product of two (unit) vectors, i.e., the
                   correlation. It is a similarity measure.

  TODO: define "similarity measure".

  Count data:
    Chi2 : Chi2 of the test for equality.
    Phi2 : ??? This measure is equal to the chi-square measure
               normalized by the square root of the combined frequency.

  Binary (or qualitative) data: there are more than a dozen
  notions of distance or similarity.
  TODO: a reference.
  We already mentionned them when dealing with supervised
  classification, to assess the quality of the results: we
  were comparing the real classes and the predicted classes.

+ Mahalanobis distance

  TODO

Consider a cloud of points in the plane. To compute the
distance betweem two of those points, one could use the
euclidian distance.


  %G
  N <- 1000
  d <- matrix(rnorm(2*N), nc=2)
  par(mar=c(2,2,4,2))
  plot(d, xlim=c(-2,2), ylim=c(-2,2),
       axes = FALSE,
       xlab="", ylab="",
       main="Contour plot of the Euclidian distance to the origin")
  box()
  abline(h=0, v=0, lty=3)
  x <- seq(min(d[,1]), max(d[,1]), length=100)
  y <- seq(min(d[,2]), max(d[,2]), length=100)
  z <- outer(x, y, function (x,y) sqrt(x^2 + y^2))
  contour(x, y, z, 
          add = TRUE,
          col = "blue", lwd = 3)
  %--

But if the cloud of points is more elliptical than
spherical, this might not be the best choice: in the
following plot, points 1 and 2 are at the same euclidian
distance from the origin, but point 1 is very far away from
the bulk of the cloud.

  %G
  N <- 5000
  library(MASS)
  d <- mvrnorm(N, mu=c(0,0), Sigma=matrix(c(1,.8,.8,1),2))
  par(mar=c(2,2,4,2))
  plot(d, xlim=c(-2,2), ylim=c(-2,2),
       axes = FALSE,
       xlab="", ylab="",
       main="Contour plot of the Euclidian distance to the origin")
  box()
  abline(h=0, v=0, lty=3)
  x <- seq(min(d[,1]), max(d[,1]), length=100)
  y <- seq(min(d[,2]), max(d[,2]), length=100)
  z <- outer(x, y, function (x,y) sqrt(x^2 + y^2))
  contour(x, y, z, 
          add = TRUE,
          col = "blue", lwd = 1)
  points(sqrt(2)*c(1,-1), sqrt(2)*c(1,1), 
         lwd = 10, cex = 5, pch = 4, col = "blue")
  text(sqrt(2)*c(1,-1), sqrt(2)*c(1,1) + .1, 
       c("2", "1"),
       pos = 3, adj = .5,
       font = 2, cex = 3, col = "blue")
  %--

The Mahalanobis distance accounts for this. it is defined as 

  d(x1, x2) = (x1 - x2)' V^-1 (x1 - x2)

where V is the variance matrix of the cloud of points
(either the one corresponding to the gaussian random
variable from which the points were samples, if you know
it, or an estimation of it).

  %G
  N <- 1000
  V <- matrix(c(1,.8,.8,1),2)
  d <- mvrnorm(N, mu=c(0,0), Sigma=V)
  par(mar=c(2,2,4,2))
  plot(d, xlim=c(-2,2), ylim=c(-2,2),
       axes = FALSE,
       xlab="", ylab="",
       main="Contour plot of the Mahalanobis distance to the origin")
  box()
  abline(h=0, v=0, lty=3)
  x <- seq(min(d[,1]), max(d[,1]), length=100)
  y <- seq(min(d[,2]), max(d[,2]), length=100)
  z <- outer(x, y, function (x, y) {
    sqrt(apply(rbind(x,y) * solve(V, rbind(x,y)), 2, sum))
  } )   # BUG: MEMORY PROBLEM...
  contour(x, y, z, 
          add = TRUE,
          col = "blue", lwd = 3)
  %--

+ Tracking Error distance

In some situations, some people also advocate the use of the
"tracking error distance",

  x(x1, x2) = (x1 - x2)' V (x1 - x2)

The rationale could be the following: we think the data is
inherently unidimensional and we want to lower the
importance of the other, noisy dimensions.

  %G
  N <- 1000
  V <- matrix(c(1,.8,.8,1),2)
  d <- mvrnorm(N, mu=c(0,0), Sigma=V)
  par(mar=c(2,2,4,2))
  plot(d, xlim=c(-2,2), ylim=c(-2,2),
       axes = FALSE,
       xlab="", ylab="",
       main="Contour plot of the Tracking Error distance to the origin")
  box()
  abline(h=0, v=0, lty=3)
  x <- seq(min(d[,1]), max(d[,1]), length=100)
  y <- seq(min(d[,2]), max(d[,2]), length=100)
  z <- outer(x, y, function (x, y) {
    sqrt(apply(rbind(x,y) * (V %*% rbind(x,y)), 2, sum))
  } )
  contour(x, y, z, 
          add = TRUE,
          col = "blue", lwd = 3)
  %--

+ Distance between clusters

The notion of distance between clusters is slightly
trickier. Here are a few examples of such a distance.

  d(A,B) = Max { d(a,b) ; a \in A and b \in B }

  d(A,B) = Min { d(a,b) ; a \in A and b \in B }

  d(A,B) = d(center of gravity of A, center of gravity of B) (actually used?)

  d(A,B) = mean of the d(a,b) where a \in A and b \in B    (UPGMA: Unweighted Pair-Groups Method Average)

TODO: plots illustrating those notions.

TODO: Ward's method

  Ward's method
  Cluster membership is assessed by calculating the total
  sum of squared deviations from the mean of a cluster. The
  criterion for fusion is that it should produce the
  smallest possible increase in the error sum of squares.

Remark: the same data, with different algorithms and/or
distances, can produce completely unrelated dendograms.

+ A few applications

  TODO
  Detecting outliers (they form one or several small
  clusters, isolated from the rest of the data)

  TODO
  Non-supervised learning, when you do not know the number
  of classes.

* Comparing those two methods

The k-means algorithm algorithm and its variants are very
fast, well-suited to large data sets, but non-deterministic
-- hierarchical clustering is just the opposite.

That is why we sometimes mix those two approaches (hybrid
clustering): we start with the k-means algorithm to get a
few tens or hundreds of classes; then hierarchical
clustering on those classes (not on the initial data, too
large) to find the number of classes; finally, we can
refine, with the k-means algorithm on the newly obtained
classes.

TODO
Example: do this, following this example (from the manual)

  %G
  ## Do the same with centroid clustering and squared Euclidean distance,
  ## cut the tree into ten clusters and reconstruct the upper part of the
  ## tree from the cluster centers.
  hc <- hclust(dist(USArrests)^2, "cen")
  memb <- cutree(hc, k = 10)
  cent <- NULL
  for(k in 1:10){
    cent <- rbind(cent, colMeans(USArrests[memb == k, , drop = FALSE]))
  }
  hc1 <- hclust(dist(cent)^2, method = "cen", members = table(memb))
  opar <- par(mfrow = c(1, 2))
  plot(hc,  labels = FALSE, hang = -1, main = "Original Tree")
  %--

  %G
  plot(hc1, labels = FALSE, hang = -1, main = "Re-start from 10 clusters")
  %--

* Density estimation

The preceding methods assumed that the clusters were convex,
or even spherical. But this is not always the case.

In the following example, we would like the computer to find
two clusters...

  %G
  n <- 1000
  x <- runif(n,-1,1)
  y <- ifelse(runif(n)>.5,-.1,.1) + .02*rnorm(n)
  d <- data.frame(x=x, y=y)
  plot(d,type='p', xlim=c(-1,1), ylim=c(-1,1))
  %--

With k-means:

  %G
  test.kmeans <- function (d, ...) {
    cl <- kmeans(d,2)
    plot(d, col=cl$cluster, main="kmeans", ...)
    points(cl$centers, col=1:2, pch=7, lwd=3)
  }
  test.kmeans(d, xlim=c(-1,1), ylim=c(-1,1))
  %--

With hierarchical clustering:

  %G
  test.hclust <- function (d, ...) {
    hc <- hclust(dist(d))
    remplir <- function (m, i, res=NULL) {
      if(i<0) {
        return( c(res, -i) )
      } else {
        return( c(res, remplir(m, m[i,1], NULL), remplir(m, m[i,2], NULL) ) )
      }
    }
    a <- remplir(hc$merge, hc$merge[n-1,1])
    b <- remplir(hc$merge, hc$merge[n-1,2])
    co <- rep(1,n)
    co[b] <- 2
    plot(d, col=co, main="hclust", ...)
  }
  test.hclust(d, xlim=c(-1,1), ylim=c(-1,1))
  %--  

We can also imagine other pathologies, with non-convex
clusters.

  %G
  get.sample <- function (n=1000, p=.7) {
    x1 <- rnorm(n)
    y1 <- rnorm(n)
    r2 <- 7+rnorm(n)
    t2 <- runif(n,0,2*pi)
    x2 <- r2*cos(t2)
    y2 <- r2*sin(t2)
    r <- runif(n)>p
    x <- ifelse(r,x1,x2)
    y <- ifelse(r,y1,y2)
    d <- data.frame(x=x, y=y)
    d
  }
  d <- get.sample()
  plot(d,type='p', xlim=c(-10,10), ylim=c(-10,10))
  %--
  %G
  test.kmeans(d, xlim=c(-10,10), ylim=c(-10,10))
  %--
  %G
  test.hclust(d, xlim=c(-10,10), ylim=c(-10,10))
  %--
  
  %G
  y <- abs(y)
  d <- data.frame(x=x, y=y)
  plot(d,type='p', xlim=c(-10,10), ylim=c(0,10))
  %--
  %G
  test.kmeans(d)
  %--
  %G
  test.hclust(d)
  %--

You can try to add variables (this is the same idea that
leads to polynomial regression -- or kernel methods).

  %G
  d <- get.sample()
  x <- d$x; y <- d$y
  d <- data.frame(x=x, y=y, xx=x*x, yy=y*y, xy=x*y)
  test.kmeans(d)
  %--

  %G
  test.hclust(d)
  %--

But it does not work, unless we add the "right" variables.

  %G
  d <- data.frame(x=x, y=y, xx=x*x, yy=y*y, xy=x*y, xpy=x*x+y*y)
  test.kmeans(d)
  %--

  %G
  test.hclust(d)
  %--

In that kind of situation, if there are only two dimensions,
we can try to estimate the corresponding probability density
and hope to see two peaks.

  help.search("density")
  # Suggests:
  #   KernSur(GenKern)        Bivariate kernel density estimation
  #   bkde2D(KernSmooth)      Compute a 2D Binned Kernel Density Estimate
  #   kde2d(MASS)             Two-Dimensional Kernel Density Estimation
  #   density(mclust)         Kernel Density Estimation
  #   sm.density(sm)          Nonparametric density estimation in 1, 2 or 3 dimensions.
  %G
  library(KernSmooth)
  r <- bkde2D(d, bandwidth=c(.5,.5))
  persp(r$fhat)
  %--
  %G
  n <- length(r$x1)
  plot( matrix(r$x1, nr=n, nc=n, byrow=F), 
        matrix(r$x2, nr=n, nc=n, byrow=T), 
        col=r$fhat>.001 )
  %--

TODO: a similar plot with hexbin?

(We leave it as an exercise to the reader to finish this
example, by telling the computer how to find the connected
components of the preceding plot and how to use them for
forecasting.)

You can use the same idea in a more algorithmitic way as
follows: for each observation x, we count the number of
observations at a distance at most 0.1 (you can change this
value) from x; if there are more than 5 (you can change this
value), we say that the observation x is dense, i.e., inside
a cluster. We then define an equivalence relation between
dense observations by saying that they are in the same
cluster if their distance is under 0.1.

  http://www.genopole-lille.fr/fr/formation/cib/doc/datamining/DM-Bio.pps
  (page 66)

You can implement this idea as follows.

  %G
  density.classification.plot <- function (x,y,d.lim=.5,n.lim=5) {
    n <- length(x)
    # Distance computation
    a <- matrix(x, nr=n, nc=n, byrow=F) - matrix(x, nr=n, nc=n, byrow=T) 
    b <- matrix(y, nr=n, nc=n, byrow=F) - matrix(y, nr=n, nc=n, byrow=T) 
    a <- a*a + b*b
    # Which observations are dense (i.e., in a cluster)?
    b <- apply(a<d.lim, 1, sum)>=n.lim
    plot(x, y, col=b)
    points(x,y,pch='.')
  }
  density.classification.plot(d$x,d$y)
  %--

We then have to identify the various clusters.

  %G
  # Beware: the following code is very recursive -- but, by default, 
  # R limits the size of the function calls stack to 500 elements.
  # We first increment this value.
  options(expressions=10000)

  density.classification.plot <- function (x,y,d.lim=.5,n.lim=5) {
    n <- length(x)
    # Distance computations
    a <- matrix(x, nr=n, nc=n, byrow=F) - matrix(x, nr=n, nc=n, byrow=T) 
    b <- matrix(y, nr=n, nc=n, byrow=F) - matrix(y, nr=n, nc=n, byrow=T) 
    a <- a*a + b*b
    # Which observations are dense (in a cluster)?
    b <- apply(a<d.lim, 1, sum)>=n.lim
    # We sort the observations
    cl <- rep(0,n)
    m <- 1
    numerote <- function (i,co,cl) {
      print(paste(co, i))
      for (j in (1:n)[ a[i,]<d.lim & b & cl==0 ]) {
        #print(paste("  ",j))
        cl[j] <- co
        try( cl <- numerote(j,co,cl) )   # Too recursive...
      }
      cl
    }
    for (i in 1:n) {
      if (b[i]) { # Are we in a cluster?
        # Cluster number
        if (cl[i] == 0) {
          co <- m
          cl[i] <- co
          m <- m+1
        } else {
          co <- cl[i]          
        }
        # We number the nearby points
        #print(co)
        cl <- numerote(i,co,cl)
      }
    }
    plot(x, y, col=cl)
    points(x,y,pch='.')
  }
  density.classification.plot(d$x,d$y)
  %--

Let us try again, in a more iterative way.

  %G
  density.classification.plot <- function (x,y,d.lim=.5,n.lim=5, ...) {
    n <- length(x)
    # Distance computation
    a <- matrix(x, nr=n, nc=n, byrow=F) - matrix(x, nr=n, nc=n, byrow=T) 
    b <- matrix(y, nr=n, nc=n, byrow=F) - matrix(y, nr=n, nc=n, byrow=T) 
    a <- a*a + b*b
    # Which observations are dense (in a cluster)?
    b <- apply(a<d.lim, 1, sum)>=n.lim
    # We sort the observations
    cl <- rep(0,n)
    m <- 0
    for (i in 1:n) {
      if (b[i]) { # Are we in a cluster?
        # Cluster number
        if (cl[i] == 0) {
          m <- m+1
          co <- m
          cl[i] <- co
          print(paste("Processing cluster", co))
          done <- F 
          while (!done) {
            done <- T
            for (j in (1:n)[cl==co]) {
              l <- (1:n)[ a[j,]<d.lim & b & cl==0 ]
              if( length(l)>0 ) {
                done <- F
                for (k in l) { 
                  cl[k] <- co 
                  #print(k)
                }
              }
            }
            
          }
        } else {
          # Already processed cluster: pass
        }
      }
    }
    plot(x, y, col=cl, ...)
    points(x,y,pch='.')
  }
  density.classification.plot(d$x,d$y)
  %--

Let us play with the parameters.

  %G
  op <- par(mfrow=c(3,3))
  for (d.lim in c(.2,1,2)) {
    for (n.lim in c(3,5,10)) {
      density.classification.plot(d$x,d$y,d.lim,n.lim, 
        main=paste("d.lim = ",d.lim, ",  n.lim = ",n.lim, sep=''))
    }
  }
  par(op)
  %--

* Other packages

  amap:  parallelized hclust, robust PCA
  party: mob (model-based recursive partitionning)
         cforest (random forest ensemble algorithm on
         conditional inference trees)
  cba