Generalized Linear Models: logistic regression, Poisson regression, etc.

Let us now tackle regression when the variable to predict is
qualitative. 

TODO: In this chapter, I mention Discriminant Analysis, already
tackled in the chapter about PCA.
Choose where to detail this...

  TODO: Include LDA, naive bayesian classifier

* Example: a classification problem

For the moment, in the regressions we have seen, the variable to
predict was always a (continuous) quantitative variable. However, this
already enables us to investigate some situations in which qualitative
variables appear. Here is an example. 

+ Using regression in a classification problem

We have three variables, two of them quantitative, X1 and X2, and one
qualitative Y, with two values. We want to predict Y from X1 and X2.

  %G
  n <- 10
  N <- 10
  s <- .3
  m1 <- rnorm(n, c(0,0))
  a <- rnorm(2*N*n, m1, sd=s)
  m2 <- rnorm(n, c(1,1))
  b <- rnorm(2*N*n, m2, sd=s)
  x1 <- c( a[2*(1:(N*n))], b[2*(1:(N*n))] )
  x2 <- c( a[2*(1:(N*n))-1], b[2*(1:(N*n))-1] )
  y <- c(rep(0,N*n), rep(1,N*n))
  plot( x1, x2, col=c('red','blue')[1+y] )
  %--

We can consider the qualitative variable as a quantitative variable,
assuming two values, 0 and 1, and perform a linear regression against
the others. We get an expression of the form

  Y = b0 + b1 X1 + x2 X2.

We can then part the (X1,X2) plane in two, along the 
"b0 + b1 X1 + x2 X2 = 1/2" line.

  %G
  plot( x1, x2, col=c('red','blue')[1+y] )
  r <- lm(y~x1+x2)
  abline( (.5-coef(r)[1])/coef(r)[3], -coef(r)[2]/coef(r)[3] )
  %--

+ Curvilinear regression and classification

The situation is the same as above, but this time, we regress 
Y against X1, X2, X1X2, X1^2, X2^2.

  %G
  # I need a function to draw conics...
  conic.plot <- function (a,b,c,d,e,f, xlim=c(-2,2), ylim=c(-2,2), n=20, ...) {
    x0 <- seq(xlim[1], xlim[2], length=n)
    y0 <- seq(ylim[1], ylim[2], length=n)
    x <- matrix( x0, nr=n, nc=n )
    y <- matrix( y0, nr=n, nc=n, byrow=T )
    z <- a*x^2 + b*x*y + c*y^2 + d*x + e*y + f
    contour(x0,y0,z, nlevels=1, levels=0, drawlabels=F, ...)
  }
  r <- lm(y~x1+x2+I(x1^2)+I(x1*x2)+I(x2^2))$coef
  plot( x1, x2, col=c('red','blue')[1+y] )
  conic.plot(r[4], r[5], r[6], r[2], r[3], r[1]-.5, 
             xlim=par('usr')[1:2], ylim=par('usr')[3:4], add=T)
  %--

+ Nearest neighbours

The situation is still the same. This time, in order to find the value
of Y from those of X1 and X2, we take the 10 points of the sample
nearest from (X1,X2), we average the corresponding Y values and round
to 0 or 1.

  %G
  M <- 100
  d <- function (a,b, N=10) {
    mean( y[ order( (x1-a)^2 + (x2-b)^2 )[1:N] ] )
  }
  myOuter <- function (x,y,f) {
    r <- matrix(nrow=length(x), ncol=length(y))
    for (i in 1:length(x)) {
      for (j in 1:length(y)) {
        r[i,j] <- f(x[i],y[j])
      }
    }
    r
  }
  cx1 <- seq(from=min(x1), to=max(x1), length=M)
  cx2 <- seq(from=min(x2), to=max(x2), length=M)
  plot( x1, x2, col=c('red','blue')[1+y] )
  contour(cx1, cx2, myOuter(cx1,cx2,d), levels=.5, add=T)
  %--

The problem of the "outer" function is mentionned in the FAQ, where
they provide another solution, more "parallelized" than mine...

  wrapper <- function(x, y, my.fun, ...) {
    sapply(seq(along=x), FUN = function(i) my.fun(x[i], y[i], ...))
  }
  outer(cx1,cx2, FUN = wrapper, my.fun = d)

Here is another situation in which the nearest neighbours method
proves more relevant than the preceding. 

  %G
  n <- 20
  N <- 10
  s <- .1
  m1 <- rnorm(n, c(0,0))
  a <- rnorm(2*N*n, m1, sd=s)
  m2 <- rnorm(n, c(0,0))
  b <- rnorm(2*N*n, m2, sd=s)
  x1 <- c( a[2*(1:(N*n))], b[2*(1:(N*n))] )
  x2 <- c( a[2*(1:(N*n))-1], b[2*(1:(N*n))-1] )
  y <- c(rep(0,N*n), rep(1,N*n))
  plot( x1, x2, col=c('red','blue')[1+y] )
  M <- 100
  cx1 <- seq(from=min(x1), to=max(x1), length=M)
  cx2 <- seq(from=min(x2), to=max(x2), length=M)
  #text(outer(cx1,rep(1,length(cx2))),
  #     outer(rep(1,length(cx1)), cx2),
  #     as.character(myOuter(cx1,cx2,d)))
  contour(cx1, cx2, myOuter(cx1,cx2,d), levels=.5, add=T)
  # Color the various areas
  points(matrix(cx1, nr=M, nc=M),
         matrix(cx2, nr=M, nc=M, byrow=T),
         pch='.',
         col=c("red", "blue")[ as.numeric( myOuter(cx1,cx2,d) >.5 )+1])
  %--

  %G
  pastel <- .9
  plot( x1, x2, col=c('red','blue')[1+y] )
  points(matrix(cx1, nr=M, nc=M),
         matrix(cx2, nr=M, nc=M, byrow=T),
         pch=16,
         col=c(rgb(1,pastel,pastel), rgb(pastel,pastel,1))
             [ as.numeric( myOuter(cx1,cx2,d) >.5 )+1])
  points(x1, x2, col=c('red','blue')[1+y] )
  contour(cx1, cx2, myOuter(cx1,cx2,d), levels=.5, add=T)
  %--

Let us vary the number of neighbours.

  %G
  plot( x1, x2, col=c('red','blue')[1+y] )
  v <- c(3,10,50)
  for (i in 1:length(v)) {
    contour(cx1, cx2, 
            myOuter(cx1,cx2, function(a,b){d(a,b,v[i])}), 
            levels=.5, add=T, drawlabels=F, col=i+1)
  }
  legend(min(x1),max(x2),as.character(v),col=1+(1:length(v)), lty=1)
  %--

We remark that we should not take too many points. 

If we do not take the 10 nearest points but just the nearest point, we
get a Voronoi diagram.

  http://www.perlmonks.org/index.pl?node_id=189941

= perl_voronoi3.png

  Here is a funny application of Voronoi tessellations: sea ice
  modeling in computer-generated images (you can reuse the idea for
  giraffe skin):

  http://images.povcomp.com/entries/images/210_detail2.jpg
  http://www.povcomp.com/entries/210_detail2.php
  http://www.povcomp.com/entries/210.php
  http://trific.ath.cx/software/gimp-plugins/voronoi/

They are also used to create mosaics.

  http://nis-ei.eng.hokudai.ac.jp/~doba/papers/egshort02_mosaic.pdf

  We shall see another application of Voronoi diagrams to spot
  outliers. 
  TODO

Other applications: 

  http://www.voronoi.com/cgi-bin/voronoi_applications.php

  TODO: give more examples...

There are several variants of this method: instead of taking the mean
on the 10 neighbours, we can take all the points but weigh them
according to the distance to our point (such a weighting scheme is
called a "kernel"); we can replace the euclidian distance by another,
more relevant to the data; etc.

+ Technical note: kD trees, nearest neighbours and local regression

Given a set of points x1, x2, ..., xN in R^n, and a new
point y, we want to finc the x_i nearest to y. The naive
method consists in examining all the points in turn, compute
their distance with y, and retain the nearest. This may be
fine of there is a single new point y, but it is not very
efficient if we plan to use it on many new points. Instead,
we can preprocess the points x1, ..., xN so as to quickly
find the one nearest to a new given point.

In dimension one, this can be done as follows; instead of
representing the points as an unordered cloud, represent
them as a tree: find the median, cut the cloud of points in
two along the median, proceed recursively with thowe two new
clouds.

  TODO: give an example.
  (a numeric example; plot the tree, give a 1-dimensional
  plot as we shall shortly be doing in 2 dimensions)

This is used to speed up data access in a database (when you
create an "index" on a numeric variable); it works well for
requests of the form

  Find the records for which x = 17
  Find the records for which x <= 9
  Find the records for which 9 <= x and x <= 17

This can be generalized in higher dimensions, under the name
"kD tree" (for "k-dimensional tree"): we split the cloud of
points along the median of one of the coordinates and we
change the coordinate used at each step. 

  TODO: plot in 2D

There are several ways of choosing which coordinate to use
for the split, e.g., "take each coordinate in turn,
cyclically"; or "median of the most spread dimension"; or
(instead of the median) "the point closest to the center of
the widest dimension" (this enforces more square cells). 

  TODO: plots, for those three algorithms, in two examples: 
  uniform random points in a square
  points along a circle (this tends to be pathological).

As in dimension 1, this is used in DBMS (DataBase Management
Systems) to speed up requests of the form

  Find the records with x = 17, y = 3 and z = 5
  Find the records with x = 17
  Find the records with x <= 17

For the last two examples, we explore the tree, but when it 
forks according to y or z, we have to explore both
branches. 

  TODO: a plot of such a tree?

kD trees can also be used to find the nearest neighbour of a
point: the idea is to proceed as with the naive algorithm,
by examining all the points, but instead of takeing them in
an arbitrary order, we follow the tree ("depth first") and
we remember to distance to the closedt point found so far:
this allows us to prune whole branches, if their points are
farther that our best guess so far.

The main applications of k-D trees (and their
generalization, BSP (Binary Space Partitionning) trees,
where the cutting hyperplanes need not be orthogonal to the
axes) are: 
indexing geographical data (in spacial dsatabases or GIS,
Geographical Information Systems);
indexing multidimensional data in databases, even if the
data are not geographical; 
culling objects from 3-dimensional scenes, i.e., quickly
deciding what is visible and should be rendered and what is
not currently visible and can thus be safely discarded; 
description of FPS (First Person Shooter) video game levels
(e.g., Quake); 
local regression, i.e., regression on the k nearest
neighbours of each point of our sample (contrary to the
naive implementation, that precomputes the regression at
each sample point, kD trees allow us to perform the
computations on the fly and therefore allow you to change
the parameters (kernel width, degree) without needing a
retraining phase -- the training phase basically consists 
in storing the data); 
reducing the number of colours in an image (represent the
colours as points in the 3-dimensional red-green-blue space,
build a kD-tree on those points, stop when you have the
desired number of leaves, replace each leaf by its median);
simulating the n-body problem (build the kD tree of the n
objects, for each node (starting with the deepest nodes),
compute the forces between its children (considered as
points)).

  http://www.autonlab.org/autonweb/documents/papers/moore-tutorial.pdf
  http://en.wikipedia.org/wiki/Binary_space_partitioning
  http://en.wikipedia.org/wiki/Doom_engine
  http://groups.csail.mit.edu/graphics/classes/6.838/S98/meetings/m13/kd.html
  http://www.cs.cmu.edu/~kdeng/thesis/kdtree.pdf
  http://www.liacc.up.pt/~ltorgo/PhD/th5.pdf
  http://www.flipcode.com/articles/article_raytrace07.shtml

+ Comparing those methods

Regression is much more stable, it works fine with few
observations, but the results are imprecise. Furthermore, it
assumes that the data have a simple structure.

On the contrary, the nearest neighbour method is more
precise, does not assume anything about the structure of the
data, but is not very stable (i.e., a different sample can
yield completely different results) and works best with many
observations.

Furthermore, the nearest neighbour method falls under the
curse of dimension: if the dimension is high, the "nearest"
neighbours are far away, most points are approximately at
the same distance from our point...

TODO: take two samples, a small (100 or 200 observations, to
do the computations, and another, larger (1e4), to validate
them.
Plot the error rate as a function of the number of
neighbours. 
TODO: check that it works...

  %G
  get.model <- function (n=10, m=2, s=.1) {
    list( n=n, m=m, x=runif(n), y=runif(n), z=sample(1:m,n,replace=T), s=s )
  }
  get.sample <- function (model, n=200) {
    i <- sample( 1:model$n, n, replace=T )
    data.frame( x=rnorm(n, model$x[i], model$s),
                y=rnorm(n, model$y[i], model$s),
                z=model$z[i] )
  }
  nearest.neighbour.predict <- function (x, y, d, k=10) {
    o <- order( (d$x-x)^2 + (d$y-y)^2 )[1:k]
    s <- apply(outer(d[o,]$z, 1:max(d$z), '=='), 2, sum)
    order(s, decreasing=T)[1]
  }
  m <- get.model()
  d <- get.sample(m)
  N <- 1000
  d.test <- get.sample(m,N)
  n <- 50
  r <- rep(0, n)
  # Very slow
  for (k in 1:n) {
    for(i in 1:N) {
      r[k] <- r[k] + 
        (nearest.neighbour.predict(d.test$x[i], d.test$y[i], d, k) != d.test$z[i] )
    }
  }
  plot(r/N, ylim=c(0,1), type='l', xlab="Error rate")
  abline(h=c(0,.5,1), lty=3)
  rm(d.test)
  %--

With a smaller sample:

  %G
  m <- get.model()
  d <- get.sample(m, 20)
  N <- 1000
  d.test <- get.sample(m,N)
  n <- 50
  r <- rep(0, n)
  # Very slow
  for (k in 1:n) {
    for(i in 1:N) {
      r[k] <- r[k] + 
        (nearest.neighbour.predict(d.test$x[i], d.test$y[i], d, k) != d.test$z[i] )
    }
  }
  plot(r/N, ylim=c(0,1), type='l', xlab="Error rate")
  abline(h=c(0,.5,1), lty=3)
  rm(d.test)
  %--

+ Other examples

The nearest neighbour method remains valid with more than two classes.

  %G
  nearest.neighbour.plot <- function (d, k=10, model=NULL) {
    col <- rainbow(max(d$z))
    plot( d$x, d$y, col=col )
    cx <- seq(min(d$x), max(d$x), length=100)
    cy <- seq(min(d$y), max(d$y), length=100)
    pastel <- .8
    colp <- round(pastel*255 + (1-pastel)*col2rgb(col))
    colp <- rgb(colp[1,], colp[2,], colp[3,], max=255)
    points(matrix(cx,nr=100,nc=100),
           matrix(cy,nr=100,nc=100,byrow=T),
           col = colp[ 
             myOuter(cx,cy, function(a,b){
               nearest.neighbour.predict(a,b,d,k)
             })
           ],
           pch=16
          )
    points( d$x, d$y, col=col )
    if(!is.null(model)){
      points(model$x,model$y,pch='+',cex=3,lwd=3,col=col[model$z])
    }
  }
  m <- get.model(n=10, m=4)
  d <- get.sample(m)
  nearest.neighbour.plot(d, model=m)
  %--

TODO: plot the theoretical boundary.

TODO: there is already a package for this...

  library(help=knnTree)

Exercice: use a kernel instead.

+ TODO: lda, qda

  TODO: Read chapter 4 of "The Elements of Statistical Learning"

+ Mixture discriminant analysis

TODO

  library(mda)
  ?mda

+ Flexible discriminant analysis

TODO

  library(mda)
  ?fda

* Naive Bayes classifyer

+ Toy example

We consider two binary random variables X and Y and we try to predict
Y from X. 

For instance, we try to predict the name of a fruit from its shape.

  X = "round" or "long"
  Y = "apple" or "non-apple" (e.g.: bananam, orange)

We have a set of fruits to learn to recognise them: we know their
shape and we know if they are an apple or not.
For instance:

       |  pomme  |  pas une pomme
  -----+---------+-----------------
  rond |         |
  -----+---------+-----------------
  long |         |

  TODO: choose numeric values

When we see a new fruit, we can try to guess wether it is an apple in
the following way: we compute the conditionnal probabilities

                               P( Y = apple et X = round )
  P( Y = apple | X = round ) = ----------------------------
                                     P( X = round )

                                   P( Y = non-apple et X = round )
  P( Y = non-apple | X = round ) = --------------------------------
                                           P( X = round )

and we conpare them. 
If the first is the larger, we shall say it is an apple, 
if it is the second, we shall say it is not. 

With our values: 

  TODO

+ The Bayes formula

All the presentations of the naive Bayes classifier relie on the Bayes
formula. Indeed, the formula we have use, 

                               P( Y = apple et X = round )
  P( Y = apple | X = round ) = ----------------------------
                                     P( X = round )

can also be written 

                               P( X = round | Y = apple ) * P( Y = Apple )
  P( Y = apple | X = round ) = --------------------------------------------
                                             P( X = round )

which is called "the Bayes formula".

When we only had one predictive variable, it was not that useful, but
we shall see that with several, it becomes handy.

+ Several predictiva variables, non-binary variables

This idea can easily be generalized to several predictive variables
(shape of the fruit, colour, pips or stone, etc.).

                             P(X1=b1 and ... and Xn=bn and Y=a)
  P(Y=a|X=(b1,b2,...,bn)) = ---------------------------------
                                 P(X1=b1 and ... and Xn=bn)

But, if there are many predictive variables, some problems may occur:
in some cases, we have not observed simultaneously b1, b2, ..., and
bn. What can we do?

It is here that conditionnal probabilities really become useful. 

                             P(X=(b1,...,bn)|Y=a) * P(Y=a)
  P(Y=a|X=(b1,b2,...,bn)) = -------------------------------
                                    P(X=(b1,...,bn))

If the Xi are independant (it is a reasonnable hypothesis, iven if it
is false: otherwise, we would have to investigate the dependancy
relations, the interactions, between the Xi -- this would require much
more data), this can be written

                             P(X1=b1|Y=a) * ... * P(Xn=bn|Y=a) * P(Y=a)
                          = --------------------------------------------
                                           P(X=(b1,...,bn))

We are interested in the value of a tha maximizes this quantity. As a
does not appear in the denominator, we can get rid of it and find the
value of a that maximizes the numerator

  P(X1=b1|Y=a) * ... * P(Xn=bn|Y=a) * P(Y=a).

A naive bayesian classifier will thus be used in two steps: first,
compute the conditionnal probabilities 

  P(Xi=bj|Y=a)

for all the values of i, j and a and the a priori probabilities

  P(Y=a)

from the learning data set and then, in a second step, given new
values for the Xi, find the value of a that maximizes 

  P(X1=b1|Y=a) * ... * P(Xn=bn|Y=a) * P(Y=a).

+ Hypotheses

We had to assume that the predictive variables were independant 
(this is the meaning of the word "naive" in the expression "naive
Bayes classifier"). This hypothesis is usually wrong -- however, the
forecasts are nonetheless rather reliable.

  http://www.cs.washington.edu/homes/pedrod/mlj97.ps.gz

  TODO: read this document...

+ Implementation in R

As the idea is relatively simple, we can easily implement ourselves
such a predictor. 

  my.bayes <- function (y, x, xp) {
    # TODO: 
    # we should check that y, the columns of x and xp are factors,
    # that xp and x have the same number of columns,
    # that the levels of the columns of xp and x are the same 
    # (or at least that the values in the columns of xp are among the
    # levels of x)
    na <- names(x)
    m <- dim(x)[2]
    # Compute the conditionnal probabilities
    pc <- list()
    for (i in 1:m) {
      a <- table(data.frame(y,x=x[,i]))
      a <- a / apply(a,1,sum)
      pc[[ na[i] ]] <- a
    }
    # Compute the a priori probabilities
    pap <- table(y)
    pap <- pap/sum(pap)
    # To avoid numerical instability, we consider sums of logarithms
    # of probabilities instead of products of probabilities
    pc <- lapply(pc,log)
    pap <- log(pap)
    # Compute the P(X1=b1|Y=a) * ... * P(Xn=bn|Y=a) * P(Y=a)
    papo <- matrix(0, nr=dim(xp)[1], nc=length(pap))
    colnames(papo) <- names(pap)
    # I am sure we could do this with no loop...
    for (i in 1:dim(xp)[1]) {
      for (j in 1:length(pap)) {
        papo[i,j] <- pap[j]
        for (k in 1:dim(x)[2]) {
          papo[i,j] <- papo[i,j] + pc[[ na[k] ]][j,xp[i,k]]
        }
      }
    }
    papo <- papo - apply(papo,1,mean)
    papo <- exp(papo)
    levels(y)[ apply(papo, 1, function (a) { which(a==max(a))[1] }) ]
  }

  # Data
  library(e1071)
  data(HouseVotes84)
  y <- HouseVotes84[,1]
  x <- HouseVotes84[,-1]
  # Err... 
  # I forgot to account for missing value
  i <- which(!apply( apply(x,1,is.na), 2, any ))
  x <- x[i,]
  y <- y[i]

  # Compute the forecasts: correct in 92% of the cases.
  yp <- my.bayes(y,x,x)
  mean(y==yp)

+ Implemetation in R

Actually, it is already there -- and the function, contrary to ours,
accounts for missing values. 

  library(e1071)
  ?naiveBayes

For sample data, have a look at 

  library(mlbench)
  data(package=mlbench)

+ Precision and recall rates

TODO: recall what it is

  library(mlbench)
  data(DNA)
  d <- DNA
  d$Class <- factor(ifelse(d$Class=='n','n','i'))
  library(e1071)
  r <- naiveBayes(Class~.,d)
  i <- sample(1:dim(d)[2],100)
  p <- predict(r, d[,-length(d)][i,])       # Long...
  A <- sum(p == 'i' & d$Class[i] == 'i')
  B <- sum(p == 'i' & d$Class[i] != 'i')
  C <- sum(p != 'i' & d$Class[i] == 'i')
  # Precision rate
  100 * A/(A+B)
  # Recall rate
  100 * A/(A+C)

+ ROC Curve

  TODO

  library(mlbench)
  data(DNA)
  d <- DNA
  d$Class <- factor(ifelse(d$Class=='n','n','i'))
  library(e1071)
  r <- naiveBayes(Class~.,d)
  p <- predict(r, d[,-length(d)], type='raw')

+ Practical example: spam filtering

For several years, bayesian classifiers have been used in mail filters
-- the interesting point is that they learn the kind of mail _you_
receive and the kind of spam that flood the web -- everyone ends up
with a different filter. 

  http://www.paulgraham.com/spam.html
  http://www.xml.com/pub/a/2003/11/19/udell.html

  K. Williams, An Introduction to Machine Learning with Perl
  http://conferences.oreillynet.com/presentations/bio2003/williams_ken.ppt

The variable to predict, Y, has two values: "spam" and
"non-spam". There is a predictive variable for each word (yes, that
means thousands of predictive variables -- actually, we shall restrict
ourselves to the most common words and end up with hardly more than
one thousand words), with two values: "present" or "absent". 

The learning step is carried out from messages whose nature is known
-- for instance, the messages you received last week). 

More generally, text classification is one of the main applications of
the naive Bayes classifier

  TODO: give other examples (URL)

+ Other naive Bayes classifier

There is another notion of a "naive Bayes classifier", in which the
predictive variables are quantitative: these are counting variables
-- in the spam example or in the text classification example, we not
only look at the presence/absence of a word, but also hoz many times
it occurs.

Should you wish to distinguish the two classifiers, you can can the
binary one "Binomial Naive Bayes Classifier" and the counting variable
one "Multinomial Naive Bayes Classifier".

  A. McCallum, K. Nigam
  A Comparison of Event Models for Naive Bayes Text Classification
  http://www-2.cs.cmu.edu/%7Eknigam/papers/multinomial-aaaiws98.pdf

+ Other Bayes classifiers

There are variants of the naive Bayes classifier that tackle its
overly naive hypothesis (the independance of the predictive
variables).

TAN (Tree-augmented Naive Bayes) model the dependance relations
between predictive variables by a tree. 

Bayesian networks (very trendy for a couple of years, as were the
neural networks in tge 1990s) model the dependance relations between
the predictive variables by a graph.

  http://www.vision.ethz.ch/gm/bnclassifiers.pdf

  The Holmes and Watson examples (they are actually taken from Finn
  Jensen's book: An introduction to Bayesian Networks):
  http://www.ai.mit.edu/courses/6.825/fall02/pdf/6.825-lecture-15.pdf

+ Other (unsorted) examples

More generally, Bayesian classifiers may be used with any 
classification problem.

1. We want to know if a family name comes from southern Asia or
not. For this, we look at the presence/absence of 4-letter strings in
the name.

  http://dimax.rutgers.edu/%7Esguharay/newfinalpresentation.pdf

2. Handwritten character recognition, on a PDA. 

3. Voice recognition, on a mobile phone.

4. We have a bunch of biology research article abstracts and we are
looking for sentences that describe gene interactions. 
To this end, we process by hand a few hundred sentences: we look at
the words it contains (our predictive variables) and we state if it
describes a gene interaction (our variable to predict). 
From this learning corpus, we compute the conditionnal and a priori
probabilities. Given a new sentence, the Bayes classifier gives us the
probability that it describes an interaction.

  http://zoonek2.free.fr/bioinfo/TM/rapport.pdf

* Discriminant Analysis

We have already mentionned it above: we want to predict a qualitative
variable from several quantitative variables. More geometrically, we
are looking for a plane (or a subspace of smaller dimension) in which
the various values of the quantitaive variable correspond to points as
far away as possible.

  %G
  library(MASS)
  n <- 100
  k <- 5
  x1 <- runif(k,-5,5) + rnorm(k*n)
  x2 <- runif(k,-5,5) + rnorm(k*n)
  x3 <- runif(k,-5,5) + rnorm(k*n)
  x4 <- runif(k,-5,5) + rnorm(k*n)
  x5 <- runif(k,-5,5) + rnorm(k*n)
  y <- factor(rep(1:5,n))
  plot(lda(y~x1+x2+x3+x4+x5))
  %--

* Logistic Regression

+ Generalized Linear Model

Regression with quantitative variables can be written 

  E(Y) = a0 + a1 x1 + ... + an xn.

But if Y only has a finite number of variables, this is no longer
relevant: we can use numbers for the values of Y, but Y will be
bounded, will only assume a fixed number of values, known in advance,
while the right hand side of the expression is not bounded and can
take an infinite number of values -- it does not look very serious and
can become troublesome when you want to predict Y. 

  %G
  n <- 100
  x <- c(rnorm(n), 1+rnorm(n))
  y <- c(rep(0,n), rep(1,n))
  plot(y~x)
  abline(lm(y~x), col='red')
  %--

The idea of the Generalized Model (GLM)) is to replace E(Y) by
something else. 

For logistic regression, the variable to predict, Y, has only two
values, 0 and 1, and we are interested in the probability of each.

  P[ Y == 1 ] = a0 + a1 x1 + ... + an xn.

Hardly better: actually, it is the same formula (because E(Y)=P(Y==1))
and we have the same problem: the left hand side is bounded, the right
hand side is not. To get rid of the problem, we can apply some
transformation to the left hand side, a bijection between the interval
(0,1) and the real line -- in GLM parlance, such a function is called
a "link".

One often uses the logistic function>

                    p
  logit(p) = log -------
                  1 - p

We can write

                          P(win)
  logit( P(winr) ) = log --------.
                          P(lose)

The quotient P(win)/P(lose) is called the "odds" (bookmakers often use
this expression: "the odds are 3 to 1", etc.)

Here is the plot of this function.

  %G
  x <- seq(0,1, length=100)
  x <- x[2:(length(x)-1)]
  logit <- function (t) {
    log( t / (1-t) )
  }
  plot(logit(x) ~ x, type='l')
  %--

The "probit" function, the inverse of the cumulative distribution
function of the gaussian distribution, is another link. 

  %G
  curve(logit(x), col='blue', add=F)
  curve(qnorm(x), col='red', add=T)
  a <- par("usr")
  legend( a[1], a[4], c("logit","probit"), col=c("blue","red"), lty=1)
  %--

The "log-log" function is yet another link.

  %G
  curve(logit(x), col='blue', add=F)
  curve(qnorm(x), col='red', add=T)
  curve(log(-log(1-x)), col='green', add=T)
  abline(h=0, lty=3)
  abline(v=0, lty=3)
  a <- par("usr")
  legend( a[1], a[4], 
          c("logit","probit", "log-log"), 
          col=c("blue","red","green"), 
          lty=1)
  %--

You might also think about transforming the data with the inverse of
the link function, perform a linear regression and apply the link to
the result. It would not work, because the initial data are 0 and 1
and would give infinite values -- we are no longer trying to model the
data values themselves, but their probabilities. 

  %G
  ilogit <- function (l) {
    exp(l) / ( 1 + exp(l) )
  }
  fakelogit <- function (l) {
    ifelse(l>.5, 1e6, -1e6)
  }
  n <- 100
  x <- c(rnorm(n), 1+rnorm(n))
  y <- c(rep(0,n), rep(1,n))
  yy <- fakelogit(y)
  xp <- seq(min(x),max(x),length=200)
  yp <- ilogit(predict(lm(yy~x), data.frame(x=xp)))
  yp[is.na(yp)] <- 1
  plot(y~x)
  lines(xp,yp, col='red', lwd=3)
  %--

So we give up Least Squares Estimation and compute a Maximum
Likelihood Estimator.

  %G
  n <- 100
  x <- c(rnorm(n), 1+rnorm(n))
  y <- c(rep(0,n), rep(1,n))
  f <- function (a) {
    -sum(log(ilogit(a[1]+a[2]*x[y==1]))) - sum(log(1-ilogit(a[1]+a[2]*x[y==0])))
  }
  r <- optim(c(0,1),f)
  a <- r$par[1]
  b <- r$par[2]
  plot(y~x)
  curve( dnorm(x,1,1)*.5/(dnorm(x,1,1)*.5+dnorm(x,0,1)*(1-.5)), add=T, col='red')
  curve( ilogit(a+b*x), add=T )
  legend( .95*par('usr')[1]+.05*par('usr')[2],
          .9,
          c('theoretical curve', 'MLE'),
          col=c('red', par('fg')),
          lty=1, lwd=1)
  title(main="Logistic regression, by hand")
  %--

This can be done directly, with the "glm" function.

  %G
  #
  # BEWARE:
  # Do not forget the "family" argument -- otherwise, it would be a
  # linear regression -- the very thing we are trying to avoid. 
  #
  r <- glm(y~x, family=binomial)
  plot(y~x)
  abline(lm(y~x),col='red',lty=2)
  xx <- seq(min(x), max(x), length=100)
  yy <- predict(r, data.frame(x=xx), type='response')
  lines(xx,yy, col='blue', lwd=5, lty=2)
  lines(xx, ilogit(r$coef[1]+xx*r$coef[2]))
  legend( .95*par('usr')[1]+.05*par('usr')[2],
          .9,
          c('linear regression', 
            'prediction with "predict"',
            "prediction with the coefficients"),
          col=c('red', 'blue', par('fg')),
          lty=c(2,2,1), lwd=c(1,5,1))
  title(main='Logistic regression with the "glm" function')
  %--

In particular, the predicted values, which are probabilities, are
indeed between 0 and 1 -- with linear regression, they were not
bounded. 

We can compare the various kinds of regression: logistic regression
yields the results closer to the theoretical curve. 

  %G
  n <- 100
  x <- c(rnorm(n), 1+rnorm(n))
  y <- c(rep(0,n), rep(1,n))
  plot(y~x)
  # Brutal prediction
  m1 <- mean(x[y==0])
  m2 <- mean(x[y==1])
  m <- mean(c(m1,m2))
  if(m1<m2) a <- 0
  if(m1>m2) a <- 1
  if(m1==m2) a <- .5
  lines( c(min(x),m,m,max(x)),
         c(a,a,1-a,1-a),
         col='blue')  
  # Linear regression
  abline(lm(y~x), col='red')
  # Logistic regression
  xp <- seq(min(x),max(x),length=200)
  r <- glm(y~x, family=binomial)
  yp <- predict(r, data.frame(x=xp), type='response')
  lines(xp,yp, col='orange') 
  # Theoretical curve
  curve( dnorm(x,1,1)*.5/(dnorm(x,1,1)*.5+dnorm(x,0,1)*(1-.5)), add=T, lty=3, lwd=3)
  legend( .95*par('usr')[1]+.05*par('usr')[2],
          .9, #par('usr')[4],
          c('Brutal prediction', "Linear regression", "Logistic regression",
            "Theoretical curve"),
          col=c('blue','red','orange', par('fg')),
          lty=c(1,1,1,3),lwd=c(1,1,1,3))
  title(main="Comparing linear and logistic regression")
  %--

Do not forget to have a look at the examples from the "glm" manual:

  demo("lm.glm")

TODO: Compare the various links.

+ Reading the result

As for linear regression, we have three means of displaying the
result.

The "print" function":

  > r

  Call:  glm(formula = y ~ x, family = binomial)

  Coefficients:
  (Intercept)            x
      -0.4864       0.9039

  Degrees of Freedom: 199 Total (i.e. Null);  198 Residual
  Null Deviance:      277.3
  Residual Deviance: 239.7        AIC: 243.7

The "summary" function:

  > summary(r)

  Call:
  glm(formula = y ~ x, family = binomial)

  Deviance Residuals:
       Min        1Q    Median        3Q       Max
  -1.94200  -0.99371   0.05564   0.97949   1.87198

  Coefficients:
              Estimate Std. Error z value Pr(>|z|)
  (Intercept)  -0.4864     0.1804  -2.696  0.00702 **
  x             0.9039     0.1656   5.459 4.79e-08 ***
  ---
  Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

  (Dispersion parameter for binomial family taken to be 1)

      Null deviance: 277.26  on 199  degrees of freedom
  Residual deviance: 239.69  on 198  degrees of freedom
  AIC: 243.69

  Number of Fisher Scoring iterations: 3

The "anova" function:

  > anova(r)
  Analysis of Deviance Table

  Model: binomial, link: logit

  Response: y

  Terms added sequentially (first to last)

        Df Deviance Resid. Df Resid. Dev
  NULL                    199    277.259
  x      1   37.566       198    239.693

There is also the "aov" function.

  > aov(r)
  Call:
     aov(formula = r)

  Terms:
                         x Residuals
  Sum of Squares   9.13147  40.86853
  Deg. of Freedom        1       198

  Residual standard error: 0.45432
  Estimated effects may be unbalanced

Let us also mention the "manova" function.

  ?summary.manova

+ Residuals and plots

As usual, we have a wealth of plots to assess the quality of the
regression results.

The Pearson residuals are defined as 

           y_i - \hat y_i
  r_i =   ----------------
                s_i

where s_i is the estimation of the standard deviation of the noise. 
There is also a normalized version (this is the same sa with linear
regression: the standard deviation of the noise and the standard
deviation of the residuals are different) and a standardized version
(we divide by the estimation of the standard deviation obtained by
removing observation "i").

You will notice that y_i is 0 or 1, while \hat y_i is a probability
(between 0 and 1), which explains the shape of the plot: se see two
curves, one above the axis, the other under, that correspond to the
two possible values of y_i. One of those curves is decreasing and
tends to 0 in + infinity, for the other, it is the opposite. 

  %G
  n <- 100
  x <- c(rnorm(n), 1+rnorm(n))
  y <- c(rep(0,n), rep(1,n))
  r <- glm(y~x, family=binomial)
  plot(r, which=1)
  %--

Actually, this simulation does not faithfully reflect the model
underlying the logistic regression. It would rather be like this: 

  %G
  n <- 1000
  a <- -2
  b <- 1
  x <- runif(n, -4, 5)
  y <- exp(a*x+b + rnorm(n))
  y <- y/(1+y)
  y <- rbinom(n,1,y)
  plot(y~x)
  %--

  %G
  boxplot(x~y, horizontal=T)
  %--

  %G
  op <- par(mfrow=c(2,1))
  hist(x[y==1], probability=T, col='light blue')
  lines(density(x[y==1]),col='red',lwd=3)
  hist(x[y==0], probability=T, col='light blue')
  lines(density(x[y==0]),col='red',lwd=3)
  par(op)
  %--

  %G
  rt <- glm(y~x, family=binomial)
  plot(rt, which=1)
  %--

TODO: give concrete examples of pathological situations. 

We are also suggested to look at the gaussian quantile-quantile plot of
the residuals -- but I do not understand whym because the residuals 
are not supposed to be gaussian.

  %G
  plot(rt, which=2)
  %--

  %G
  hist(rt$residuals, breaks=seq(min(rt$residuals),max(rt$residuals)+1,by=.5),
       xlim=c(-10,10),
       probability=T, col='light blue')
  points(density(rt$residuals, bw=.5), type='l', lwd=3, col='red')
  %--

However, one can transform those residuals to get a gaussian
distribution: these are the Anscombe residuals -- the transformation
is rather complex and R does not seem to know it...

Exercice: implement this transformation, in an approximate manner,
with a simulation.

We can also look at the absolute value of the residuals.

  %G
  plot(rt, which=3)
  %--

and the Cook distances. 

  %G
  plot(rt, which=4)
  %--

+ Deviance, residuals, AIC

The deviance is defined by 

  D = -2(L-Lsat) 

where L is the log-likelihood of our model and Lsat the log-likelihood
of the saturated model (with as many variables as observations). 

TODO: understand. (Where do those new variables come from?)

We are happy if D/df < 1.

  > rt$deviance
  [1] 362.0677

We also have the deviance of the null model (the model with no
variables, corresponding to the hypothesis "the predictive variables
have no effect"), i.e., the probability of Y=1 is constant and does
not depend on the predictive variables).

  > rt$null.deviance
  [1] 1383.377

The AIC (Akaike Information Criterion) is preferable to the deviance
to compare models, because it accounts for the number of parameters in
those models. 

  > rt$aic
  [1] 366.0677

It is defined by

  - 2 * log-likelihood + 2 * number of parameters.

There are a few variants, such as the BIC,

  - 2 * log-likelihood + ln(number of observations) * number of parameters.

We have already seen the various residuals.

Pearson residuals (aka raw residuals, standardized residuals,
studentized residuals):

   y_i - \hat y_i
  ----------------
        s_i

Deviance residuals: contribution of each observation to the deviance.

Anscombe residuals: we transform the variable to that the residuals
follow a gaussian distribution (the "t" function is complicated):

   t(y_i) - t(\hat y_i)
  ----------------------
      t'(y_i)  s_i

As with linear regression, we can measure the leverage, with the "hat
matrix" (but it is not exactly the same matrix).

  TODO: with R

  %G
  # It is supposed not to be the same as in the linear situation.
  # Here, it seems to be the same...
  plot(hat(x), type='h')
  %--

We also have the Cook distance, as for linear regression.

  %G
  plot(rt, which=4)
  %--

+ Model comparisons

Likelihood Ratio test: to compare two nested models with q1 and q2
parameters,we look at D2 - D2, that asymptotically follows a Chi^2
distribution (if we start with a binomial or Poisson distribution) or
a Fisher distribution (if we start with a gaussian distribution) with
q2 - q1 degrees of freedom.

  TODO: understand 

Wald test: it is a quadratic approximation of the Likelihood Ratio
test. 

  library(car)
  ?linear.hypothesis

+ TODO: to sort, to rewrite

(Answer to a question I was asked by email)

Let us consider the data 

  library(mlbench)
  data(BostonHousing)
  y <- BostonHousing[,1]
  y <- factor(y>median(y))
  x1 <- BostonHousing[,2]
  x2 <- BostonHousing[,3]

We can compute the logistic regression with the "glm" function.

  glm(y~x1+x2, family=binomial)

or with the "lrm" function in the "Design" package, that provides much
more information.

  library(Design)
  lrm(y~x1+x2)

This yields:

  > lrm(y~x1+x2)
  Logistic Regression Model

  lrm(formula = y ~ x1 + x2)

  Frequencies of Responses
  FALSE  TRUE
    253   253

       Obs  Max Deriv Model L.R.       d.f.          P          C        Dxy
       506      3e-06     234.13          2          0      0.862      0.724
     Gamma      Tau-a         R2      Brier
     0.725      0.363      0.494      0.145

            Coef     S.E.    Wald Z P
  Intercept -1.66990 0.27245 -6.13  0e+00
  x1        -0.04954 0.01280 -3.87  1e-04
  x2         0.17985 0.02084  8.63  0e+00


Question:

  > 1. In a logistic regression, how do we know if the model is
  > "good"? I.e., do we have an equivalent of the R^2?

There are several equivalents of the R^2 for logistic regression. 
The simplest, "pseudo-R^2" or "McFadden's R^2" is

   (deviance of the model) - (deviance of the null model)
  --------------------------------------------------------
                (deviance of the null model)

I recall that the deviance is 

  - 2 log(L)

where L is the likelihood and that the "null model" is "y ~1" (the
model with no predictive variables, the model that assumes that the
quantities we are modelling (the probability of Y) are constant).

It can be interpreted as the "proportion of the deviance explained by
the model" -- similarly, the R^2 was the "proportion of variance
explained by the model".

But it does not tell you if the model is good: if the R^2 is low, it
can mean that there is a lot of noise or that the model is
incomplete. 

The idea is the same as for linear models: plots. 
To learn how to read them, let us start with a few simulations. 

Random data:

  %G
  n <- 1000
  y <- factor(sample(0:1,n,replace=T))
  x <- rnorm(n)
  r <- glm(y~x,family=binomial)
  op <- par(mfrow=c(2,2))
  plot(r,ask=F)
  par(op)
  %--
  %G
  library(Design)
  r <- lrm(y~x,y=T,x=T)
  P <- resid(r,"gof")['P']
  resid(r,"partial",pl=T)
  title(signif(P))
  %--

Data coming from the model:

  %G
  n <- 1000
  x <- rnorm(n)
  a <- 1
  b <- -2
  p <- exp(a+b*x)/(1+exp(a+b*x))
  y <- factor(ifelse( runif(n)<p, 1, 0 ), levels=0:1)
  r <- glm(y~x,family=binomial)
  op <- par(mfrow=c(2,2))
  plot(r,ask=F)
  par(op)
  %--
  %G
  r <- lrm(y~x,y=T,x=T)
  P <- resid(r,"gof")['P']
  resid(r,"partial",pl=T)
  title(signif(P))
  %--

Data from the model, with random errors (I do not see anything on the
plot, but the estimators of a and b are biased towards zero):

  %G
  n <- 1000
  x <- rnorm(n)
  a <- 1
  b <- -2
  p <- exp(a+b*x)/(1+exp(a+b*x))
  y <- ifelse( runif(n)<p, 1, 0 )
  i <- runif(n)<.1
  y <- ifelse(i, 1-y, y)
  y <- factor(y, levels=0:1)
  col=c(par('fg'),'red')[1+as.numeric(i)]
  r <- glm(y~x,family=binomial)
  op <- par(mfrow=c(2,2))
  plot(r,ask=F, col=col)
  par(op)
  %--
  %G
  r <- lrm(y~x,y=T,x=T)
  P <- resid(r,"gof")['P']
  resid(r,"partial",pl=T)
  title(signif(P))
  %--

Data from the model, with mode extreme errors:

  %G
  n <- 1000
  x <- rnorm(n)
  a <- 1
  b <- -2
  p <- exp(a+b*x)/(1+exp(a+b*x))
  y <- ifelse( runif(n)<p, 1, 0 )
  i <- runif(n)<.5 & abs(x)>1
  y <- ifelse(i, 1-y, y)
  y <- factor(y, levels=0:1)
  col=c(par('fg'),'red')[1+as.numeric(i)]
  r <- glm(y~x,family=binomial)
  op <- par(mfrow=c(2,2))
  plot(r,ask=F, col=col)
  par(op)
  %--
  %G
  r <- lrm(y~x,y=T,x=T)
  P <- resid(r,"gof")['P']
  resid(r,"partial",pl=T)
  title(signif(P))
  %--

One variable too many:

  %G
  n <- 1000
  x1 <- rnorm(n)
  x2 <- rnorm(n)
  a <- 1
  b <- -2
  p <- exp(a+b*x1)/(1+exp(a+b*x1))
  y <- factor(ifelse( runif(n)<p, 1, 0 ), levels=0:1)
  r <- glm(y~x1+x2,family=binomial)
  op <- par(mfrow=c(2,2))
  plot(r,ask=F)
  par(op)
  %--
  %G
  r <- lrm(y~x,y=T,x=T)
  P <- resid(r,"gof")['P']
  resid(r,"partial",pl=T)
  title(signif(P))
  %--

Non-linear model:

  %G
  n <- 1000
  x <- rnorm(n)
  a <- 1
  b1 <- -1
  b2 <- -2
  p <- 1/(1+exp(-(a+b1*x+b2*x^2)))
  y <- factor(ifelse( runif(n)<p, 1, 0 ), levels=0:1)
  r <- glm(y~x,family=binomial)
  op <- par(mfrow=c(2,2))
  plot(r,ask=F)
  par(op)
  %--
  %G
  r <- lrm(y~x,y=T,x=T)
  P <- resid(r,"gof")['P']
  resid(r,"partial",pl=T)
  title(signif(P))
  %--

Well, those plots seem much less useful than they were for linear
regression.

Question:

  > 2. What about the Wald, Hosmer and Lemshow, Somer's D tests?
  > What do they test? How do we compute them? How do we interpret
  > them?

The Wald test I know (there might be another one with the same name)
is an approximation of the likelihood ratio test: prefer the genuine
one. 

The likelihood ratio test compares two nested models whose parameters
have been estimated by the maximum likelihood method. In R, it is
computed by the "anova" function. (The "F test" comparing a linear
model to the trivial model is a particular case of it.)

We would expect to be able to perform this test as with linear models:

  r <- glm(y~x1+x2, family=binomial())
  anova(r)

but no...

You can perform the test by hand: the difference of the deviances of
two nested models (here, the model we are interested in and the null
model) follows, asymptotically, a Chi^2 distribution with p degrees of
freedom, where p is the number of parameters that are fixed in the
smaller model.

  r <- glm(y~x1+x2, family=binomial())
  pchisq(r$null.deviance - r$deviance, df=2, lower.tail=F)

You can use this test to compare two non-trivial models -- but they
have to be nested.

  r0 <- glm(y~x1+x2, family=binomial)
  r1 <- glm(y~x1, family=binomial)
  r2 <- glm(y~x2, family=binomial)
  lr.test <- function (r.petit,r.gros) {
    pchisq(r.petit$deviance - r.gros$deviance, df=1, lower.tail=F)
  }
  lr.test(r1,r0)
  lr.test(r2,r0)

Here, we find that the model  y ~ x1+x2  is significantly better that 
y ~ x1, with a risk of error inferior to 1e-21, and that the model 
y~x1+x2 is significantly better than y~x2, with a risk of error under
1e-6.

Actually, the "lrm" function (in the Design package) already performs
this test. 

  lrm(y~x1+x2)

In our example,

       Obs  Max Deriv Model L.R.       d.f.          P          C        Dxy
       506      3e-06     234.13          2          0      0.862      0.724
     Gamma      Tau-a         R2      Brier
     0.725      0.363      0.494      0.145

the p-value is zero, so the model is significantly better than the
trivial one. 

I think that "Wald's test" also refers to the test checking if a
regression coefficient is zero (under the null hypothesis that this
coefficient is zero, the estimated coefficient, divided by its
standard deviation, follows a gaussian distribution; if you square it
, it follows a Chi^2 distribution with one degree of freedom). Thus,
if we look at the last column of the result of the previous function,

            Coef     S.E.    Wald Z P
  Intercept -1.66990 0.27245 -6.13  0e+00
  x1        -0.04954 0.01280 -3.87  1e-04
  x2         0.17985 0.02084  8.63  0e+00

you can tell that the intercept and the x2 coefficient are non-zero
with a negligible risk of error and that the x1 coefficient is
non-zero with a risk of error of 0.1%.

Actually, we already had those results with the "glm" function:

  > summary(r0)
  Call:
  glm(formula = y ~ x1 + x2, family = binomial)

  Deviance Residuals:
      Min       1Q   Median       3Q      Max
  -2.5902  -0.7571   0.1312   0.6106   2.0397

  Coefficients:
              Estimate Std. Error z value Pr(>|z|)
  (Intercept) -1.66990    0.27245  -6.129 8.84e-10 ***
  x1          -0.04954    0.01280  -3.870 0.000109 ***
  x2           0.17985    0.02084   8.632  < 2e-16 ***

  (Dispersion parameter for binomial family taken to be 1)

      Null deviance: 701.46  on 505  degrees of freedom
  Residual deviance: 467.33  on 503  degrees of freedom
  AIC: 473.33

  Number of Fisher Scoring iterations: 6

I have never heard of the Hosmer and Lemeshow test, but Google refers
me to 

  http://www.biostat.wustl.edu/archives/html/s-news/1999-04/msg00147.html
  http://maths.newcastle.edu.au/~rking/R/help/03b/0735.html
  http://www.learnlink.mcmaster.ca/OpenForums/00031830-80000001/00040559-80000001/0045CA16-00977198-005B1B21

which claim that this test is obsolete. 
The "residuals.lrm" function in the Design package has a test to
replace it -- but I do not know what is behind it, neither how to
interpret it. 

  library(Design)
  ?residuals.lrm

For our example:

  > resid( lrm(y~x1+x2, y=T, x=T), "gof" )
  Sum of squared errors     Expected value|H0                    SD
           7.328945e+01          7.632820e+01          5.739924e-01
                      Z                     P
          -5.294060e+00          1.196300e-07

I do not know Somers's D, but some simulations can help interpret it.

  > library(Design)
  > somers2(x1,as.numeric(y)-1)
            C         Dxy           n     Missing
    0.2846475  -0.4307051 506.0000000   0.0000000
  > somers2(-x1,as.numeric(y)-1)
            C         Dxy           n     Missing
    0.7153525   0.4307051 506.0000000   0.0000000
  > somers2(x2,as.numeric(y)-1)
            C         Dxy           n     Missing
    0.8551532   0.7103064 506.0000000   0.0000000
  > somers2(-x2,as.numeric(y)-1)
            C         Dxy           n     Missing
    0.1448468  -0.7103064 506.0000000   0.0000000
  > somers2(rnorm(506),as.numeric(y)-1)
             C          Dxy            n      Missing
    0.50564764   0.01129529 506.00000000   0.00000000

It seems to be interpretable as a correlation, when one of the
variables only has two values, 0 and 1. 

So we have the same problems as with correlation: it is only relevant
for linear relations. Thus the following computations give
(approximately) 1 and 0.

  n <- 1000
  a <- rnorm(n)
  b <- ifelse(a<0,0,1)
  somers2(a,b)

  b <- ifelse(abs(a)<.5,0,1)
  somers2(a,b)

* TODO

  # age
  age <- c(25.0, 32.5, 37.5, 42.5, 47.5, 52.5, 57.5, 65.0)
  # Number of successes
  n <- c(100, 150, 120, 150, 130, 80, 170, 100)
  # predictive variable
  Y <- c(10, 20, 30, 50, 60, 50, 130, 80)

  f<-Y/n
  g<-log(f/(1-f)) # Transforming the data
  w<-n*f*(1-f) # Weights
  r<-predict(lm(g~age,weights=w)) # Weighted regression
  p<-exp(r)/(1+exp(r)) # Inverse transform
  plot(age,f,ylim=c(0,1))
  lines(age,p)
  symbols(age,f,circles=w,add=T)
  # Iterate...
  w<-n*p*(1-p)
  gu<-r+(f-p)/p/(1-p)
  r<-predict(lm(gu~age,weights=w))
  # Once more
  p<-exp(r)/(1+exp(r))
  w<-n*p*(1-p)
  gu<-r+(f-p)/p/(1-p)
  r<-predict(lm(gu~age,weights=w))

Actually, we get the results of 

  glm(cbind(Y,n-Y)~age,family=binomial)

* Variants of logistic regression

+ Log-linear model (Poisson regression)

The veriable to predict is no longer a binary variable, but a counting
variable. To perform this Poisson regression, we still use the "glm"
function, with the "family=poisson" argument.

I do not give any example her but just refer to those from the manual.

  library(MASS)
  ?epil
  ?housing

TODO: Give an example (at least a simulation).

See also the "loglm", "multinom", "gam" functions.

  ?loglm
  library(nnet)
  ?multinom
  library(mgcv)
  ?gam
  library(survival)
  ?survexp

+ If the qualitative variable to predict has more than two values (first attempt)

Actually, I do not really know how to proceed.

  %G
  # NO!
  n <- 100
  x <- c( rnorm(n), 1+rnorm(n), 2.5+rnorm(n) )
  y <- factor(c( rep(0,n), rep(1,n), rep(2,n) ))
  r <- glm(y~x, family=binomial)
  plot(as.numeric(y)-1~x)
  xp <- seq(-5,5,length=200)
  yp <- predict(r,data.frame(x=xp), type='response')
  lines(xp,yp)
  %--

Let us try to encode the variable to predict with binary variables, by
hand. 

First attempt: brutally, thoughtlessly.

  %G
  n <- 100
  x <- c( rnorm(n), 10+rnorm(n), 25+rnorm(n), -7 + rnorm(n) )
  y <- factor(c( rep(0,n), rep(1,n), rep(2,n), rep(3,n) ))
  y1 <- factor(c( rep(0,n), rep(0,n), rep(1,n), rep(1,n) ))
  y2 <- factor(c( rep(0,n), rep(1,n), rep(0,n), rep(1,n) ))
  r1 <- glm(y1~x, family=binomial)
  r2 <- glm(y2~x, family=binomial)
  xp <- seq(-50,50,length=500)
  y1p <- predict(r1,data.frame(x=xp), type='response')
  y2p <- predict(r2,data.frame(x=xp), type='response')

  plot(as.numeric(y)-1~x)
  lines(xp,y1p+2*y2p)
  lines(xp,y1p, col='red')
  lines(xp,y2p, col='blue')
  %--

It does not workm because Y1 and Y2 do not come from a logistic model:

  %G
  plot(as.numeric(y1)-1~x)
  lines(xp,y1p, col='red')
  %--

Second attempt.

  %G
  n <- 100
  x <- c( rnorm(n), 10+rnorm(n), 25+rnorm(n), -7 + rnorm(n) )
  y <- factor(c( rep(0,n), rep(1,n), rep(2,n), rep(3,n) ))
  y1 <- factor(c( rep(0,n), rep(1,n), rep(1,n), rep(1,n) ))
  y2 <- factor(c( rep(0,n), rep(0,n), rep(1,n), rep(1,n) ))
  y3 <- factor(c( rep(0,n), rep(0,n), rep(0,n), rep(1,n) ))
  r1 <- glm(y1~x, family=binomial)
  r2 <- glm(y2~x, family=binomial)
  r3 <- glm(y3~x, family=binomial)
  xp <- seq(-50,50,length=500)
  y1p <- predict(r1,data.frame(x=xp), type='response')
  y2p <- predict(r2,data.frame(x=xp), type='response')
  y3p <- predict(r3,data.frame(x=xp), type='response')

  plot(as.numeric(y)-1~x)
  lines(xp,y1p+y2p+y3p)
  %--

Same problem...

  %G
  plot(as.numeric(y1)-1~x)
  lines(xp,y1p, col='red')
  %--

Third attempt: with ordinal variables.

  %G
  n <- 100
  x <- c( -7+rnorm(n), rnorm(n), 10+rnorm(n), 25+rnorm(n))
  y <- factor(c( rep(0,n), rep(1,n), rep(2,n), rep(3,n) ))
  y1 <- factor(c( rep(0,n), rep(1,n), rep(1,n), rep(1,n) ))
  y2 <- factor(c( rep(0,n), rep(0,n), rep(1,n), rep(1,n) ))
  y3 <- factor(c( rep(0,n), rep(0,n), rep(0,n), rep(1,n) ))
  r1 <- glm(y1~x, family=binomial)
  r2 <- glm(y2~x, family=binomial)
  r3 <- glm(y3~x, family=binomial)
  xp <- seq(-50,50,length=500)
  y1p <- predict(r1,data.frame(x=xp), type='response')
  y2p <- predict(r2,data.frame(x=xp), type='response')
  y3p <- predict(r3,data.frame(x=xp), type='response')

  plot(as.numeric(y)-1~x)
  lines(xp,y1p+y2p+y3p)
  %--

  %G
  n <- 100
  x <- c( -.7+rnorm(n), rnorm(n), 1+rnorm(n), 2.5+rnorm(n))
  y <- factor(c( rep(0,n), rep(1,n), rep(2,n), rep(3,n) ))
  y1 <- factor(c( rep(0,n), rep(1,n), rep(1,n), rep(1,n) ))
  y2 <- factor(c( rep(0,n), rep(0,n), rep(1,n), rep(1,n) ))
  y3 <- factor(c( rep(0,n), rep(0,n), rep(0,n), rep(1,n) ))
  r1 <- glm(y1~x, family=binomial)
  r2 <- glm(y2~x, family=binomial)
  r3 <- glm(y3~x, family=binomial)
  xp <- seq(-5,5,length=500)
  y1p <- predict(r1,data.frame(x=xp), type='response')
  y2p <- predict(r2,data.frame(x=xp), type='response')
  y3p <- predict(r3,data.frame(x=xp), type='response')

  plot(as.numeric(y)-1~x)
  lines(xp,y1p+y2p+y3p)
  lines(xp,round(y1p+y2p+y3p, digits=0), col='red')
  %--

+ Ordinal Logistic Regression

You can also use logistic regression to forecast an ordinal
qualitative variable (i.e., a qualitative variable whose values are
ordered, -- e.g., its values could be "low", "medium", "high").
Here are some of the models you might want to use.

Proportionnal odds model (the effects are monotonic with respect to
the quantitative variable):

                              1
  P[ Y >= j | X ] = -----------------------
                           - ( a_j + X b )
                     1 + e

Continuation ratio model:

                                      1
  P [ Y = j | Y >= j, X ] = -----------------------
                                  - ( a_j + X b )
                             1 + e

Actually, our attempts in the preceding paragraph have led us to the
first model.

  %G
  n <- 100
  x <- c( -.7+rnorm(n), rnorm(n), 1+rnorm(n), 2.5+rnorm(n))
  y <- factor(c( rep(0,n), rep(1,n), rep(2,n), rep(3,n) ))
  ordinal.regression.one <- function (y,x) {
    xp <- seq(min(x),max(x), length=100)
    yi <- matrix(nc=length(levels(y)), nr=length(y))
    ri <- list();
    ypi <- matrix(nc=length(levels(y)), nr=100)
    for (i in 1:length(levels(y))) {
      yi[,i] <- as.numeric(y) >= i
      ri[[i]] <- glm(yi[,i] ~ x, family=binomial)
      ypi[,i] <- predict(ri[[i]], data.frame(x=xp), type='response')
    }
    plot(as.numeric(y) ~ x)
    lines(xp, apply(ypi,1,sum), col='red', lwd=3)
  }
  ordinal.regression.one(y,x)
  %--

  %G
  n <- 100
  v <- .2
  x <- c( -.7+v*rnorm(n), v*rnorm(n), 1+v*rnorm(n), 2.5+v*rnorm(n))
  y <- factor(c( rep(0,n), rep(1,n), rep(2,n), rep(3,n) ))
  ordinal.regression.one(y,x)
  %--


Here is the second.

  %G
  n <- 100
  x <- c( -.7+rnorm(n), rnorm(n), 1+rnorm(n), 2.5+rnorm(n))
  y <- factor(c( rep(0,n), rep(1,n), rep(2,n), rep(3,n) ))
  ordinal.regression.two <- function (y,x) {
    xp <- seq(min(x),max(x), length=100)
    yi <- list();
    ri <- list();
    ypi <- matrix(nc=length(levels(y)), nr=100)
    for (i in 1:length(levels(y))) {
      ya <- as.numeric(y)
      o <- ya >= i
      ya <- ya[o]
      xa <- x[o]
      yi[[i]] <- ya == i
      ri[[i]] <- glm(yi[[i]] ~ xa, family=binomial)
      ypi[,i] <- predict(ri[[i]], data.frame(xa=xp), type='response')
    }

    # The plot is trickier to draw than earlier
    plot(as.numeric(y) ~ x)
    p <- matrix(0, nc=length(levels(y)), nr=100)
    for (i in 1:length(levels(y))) {
      p[,i] = ypi[,i] * (1 - apply(p,1,sum))
    }
    for (i in 1:length(levels(y))) {
      p[,i] = p[,i]*i
    }
    lines(xp, apply(p,1,sum), col='red', lwd=3)
  }
  ordinal.regression.two(y,x)
  %--

  %G
  n <- 100
  v <- .1
  x <- c( -.7+v*rnorm(n), v*rnorm(n), 1+v*rnorm(n), 2.5+v*rnorm(n))
  y <- factor(c( rep(0,n), rep(1,n), rep(2,n), rep(3,n) ))
  ordinal.regression.two(y,x)
  %--

Actually, it is already implemented:

  library(MASS)
  ?polr

+ Multilogistic regression (aka Multinomial regression)

We want to predict a qualitative variable, with more tham two values.

TODO

Model:

       P( Y=1 | X=x )
  log ---------------- = b_{1,0} + b_1 x
       P( Y=k | X=x )

       P( Y=2 | X=x )
  log ---------------- = b_{2,0} + b_2 x
       P( Y=k | X=x )

       P( Y=k-1 | X=x )
  log ------------------ = b_{k-1,0} + b_{k-1} x
        P( Y=k | X=x )

  (fit with MLE)

  library(nnet)
  ?multinom

  If the variable is binary, we get the same results as with a
  classical logistic regression -- TODO: check this.