Programming in R

In this part, after quickly listing the main characteristics
of the language, we present the basic data types, how to
create them, how to explore them, how to extract pieces of
them, how to modify them.

We then jump to more advanced subjects (most of which can --
should?  -- be omitted by first-time readers): debugging,
profiling, namespaces, objects, interface with other
programs, with data bases, with other languages.

* The R language

+ Control structures

Actually, R is a programming language: as such, we have the
usual control structures (loops, conditionnals, recursion,
etc.)

Conditionnal statements:

  if(...) {
    ...
  } else {
    ...
  }

Conditionnals may be used inside other constructions.

  x <- if(...) 3.14 else 2.71

You can also construct vectors from conditionnal
expressions, with the "ifelse" function.

  x <- rnorm(100)
  y <- ifelse(x>0, 1, -1)
  z <- ifelse(x>0, 1, ifelse(x<0, -1, 0))

Switch (I do not like this command -- this is probably the
last time you see it in this document):

  x <- letters[floor(1+runif(1,0,4))]
  y <- switch(x,
              a='Bonjour',
              b='Gutten Tag',
              c='Hello',
              d='Konnichi wa',
             )

For loop (we loop over the elements of a vector or list):

  for (i in 1:10) {
    ...
    if(...) { next }
    ...
    if(...) { break }
    ...
  }

While loop:

  while(...) {
    ...
  }

Repeat loop:

  repeat {
    ...
    if(...) { break }
    ...
  }

+ Functions

R belongs to the family of functionnal languages (Lisp,
OCaML, but also Python): the notion of function is
central. In particular, if you need it, you can write
functions that take other functions as argument -- and in
case you wonder, yes, you need it.

A function is defined as follows.

  f <- function(x) {
    x^2 + x + 1
  }

The return value is the last value computed -- but you can
also use the "return" function.

  f <- function(x) {
    return( x^2 + x + 1 )
  }

Arguments can have default values. 

  f <- function(x, y=3) { ... }

When you call a function you can use the argument names,
without any regard to their order (this is very useful for
functions that expect many arguments -- in particular
arguments with default values).

  f(y=4, x=3.14)

After the arguments, in the definition of a function, you
can put three dots represented the arguments that have not
been specified and that can passed through another function
(very often, the "plot" function).

  f <- function(x, ...) {
    plot(x, ...)
  }

But you can also use this to write functions that take an
arbitrary number of arguments:

  f <- function (...) {
    query <- paste(...)  # Concatenate all the arguments to form a string
    con <- dbConnect(dDriver("SQLite"))
    dbGetQuery(con, query) 
    dbDisconnect(con)
  }

  f <- function (...) {
    l <- list(...) # Put the arguments in a (named) list
    for (i in seq(along=l)) {
      cat("Argument name:", names(l)[i], "Value:", l[[i]], "\n")
    }
  }  

Functions have NO SIDE EFFECTS: all the modifications are
local. In particular, you cannot write a function that
modifies a global variable. (Well, if you really want, you
can: see the "Dirty Tricks" part -- but you should not).

+ How to get the code of a function?

To get the code of a function, you can just type its name -- wit no
brackets. 

  > IQR
  function (x, na.rm = FALSE)
  diff(quantile(as.numeric(x), c(0.25, 0.75), na.rm = na.rm, names = FALSE))
  <environment: namespace:stats>

But sometimes, it does not work that well: if we want to
peer inside the "predict" function that we use for
predictions of linear models, we get.

  > predict
  function (object, ...)
  UseMethod("predict")
  <environment: namespace:stats>

This is a generic function: we can use the same function on
different objects (lm for linear regression, glm for Poisson
or logistic regression, lme for mixed models, etc.). The
actual function called is "predict.Foo" where "Foo" is the
class of the object given as a first argument.

  > methods("predict")
   [1] predict.ar*                predict.Arima*
   [3] predict.arima0*            predict.glm
   [5] predict.HoltWinters*       predict.lm
   [7] predict.loess*             predict.mlm
   [9] predict.nls*               predict.poly
  [11] predict.ppr*               predict.prcomp*
  [13] predict.princomp*          predict.smooth.spline*
  [15] predict.smooth.spline.fit* predict.StructTS*

     Non-visible functions are asterisked

As we wanted the one for the "lm" object, we just type (I do
not include all the code, it would take several pages):

  > predict.lm
  function (object, newdata, se.fit = FALSE, scale = NULL, df = Inf,
      interval = c("none", "confidence", "prediction"), level = 0.95,
      type = c("response", "terms"), terms = NULL, na.action = na.pass,
      ...)
  {
      tt <- terms(object)
      if (missing(newdata) || is.null(newdata)) {
          mm <- X <- model.matrix(object)
          mmDone <- TRUE
          offset <- object$offset
  (...)
      else if (se.fit)
          list(fit = predictor, se.fit = se, df = df, residual.scale = sqrt(res.var))
      else predictor
  }
  <environment: namespace:stats>

But if we wanted the "predict.prcomp" function (to add new
observations to a principal component analysis), it does not
work:

  > predict.prcomp
  Error: Object "predict.prcomp" not found

The problem is that the function is in a given namespace (R
functions are stored in "packages" and each function is
hidden in a namespace; the functions that a normal user is
likely to use directly are exported and visible -- but the
others, that are not supposed to be invoked directly by the
user are hidden, invisible). We can get it with the
"getAnywhere" function (here again, I do not include all the
resulting code).

  > getAnywhere("predict.prcomp")
  A single object matching "predict.prcomp" was found
  It was found in the following places
    registered S3 method for predict from namespace stats
    namespace:stats
  with value

  function (object, newdata, ...)
  {
      if (missing(newdata)) {
  (...)
  }
  <environment: namespace:stats>

Alternatively, we can use the getS3Method function. 

  > getS3method("predict", "prcomp")
  function (object, newdata, ...)
  {
  (...)

Alternatively, if we know in which package a function (or
any object, actually is), we can access it with the "::"
operator if it is exported (it can be exported but hidden by
another object with the same name) or the ":::" operator if
it is not.

  > stats::predict.prcomp
  Error: 'predict.prcomp' is not an exported object from 'namespace:stats'

  > stats:::predict.prcomp
  function (object, newdata, ...)
  {
  (...)

  > lm <- 1
  > lm
  [1] 1
  > stats::lm
  function (formula, data, subset, weights, na.action, method = "qr",
      model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
      contrasts = NULL, offset, ...)
  (...)

Things can get even more complicated. The most common reason
you want to peer into the code of a function is to extract
some information that gets printed when it is run
(typically, a p-value when performing a
regression). Actually, quite often, this information is not
printed when the function is run: the function performs some
computations and returns an object, with a certain class
(with our example, this would be the "lm" function and the
"lm" class) which is then printed, with the "print"
function.

  > print
  function (x, ...)
  UseMethod("print")
  <environment: namespace:base>

As the object belong to the "lm" class: 

  > print.lm
  function (x, digits = max(3, getOption("digits") - 3), ...)
  {
      cat("\nCall:\n", deparse(x$call), "\n\n", sep = "")
      if (length(coef(x))) {
          cat("Coefficients:\n")
          print.default(format(coef(x), digits = digits), print.gap = 2,
              quote = FALSE)
      }
      else cat("No coefficients\n")
        cat("\n")
      invisible(x)
  }
  <environment: namespace:stats>

Same for the "summary" function: it takes the result of a
function (say, the result of the "lm" function), builds
another object (here, of class "summary.lm") on which the
"print" function is called.

  > class(r)
  [1] "lm"

  > s <- summary(r)

  > class(s)
  [1] "summary.lm"

  > summary
  function (object, ...)
  UseMethod("summary")
  <environment: namespace:base>

  > summary.lm
  function (object, correlation = FALSE, symbolic.cor = FALSE, ...)
  {
    z <- object
    p <- 
  (...)

  > print.summary.lm
  Error: Object "print.summary.lm" not found

  > getAnywhere("print.summary.lm")
  A single object matching "print.summary.lm" was found
  It was found in the following places
    registered S3 method for print from namespace stats
    namespace:stats
  with value

  function (x, digits = max(3, getOption("digits") - 3), symbolic.cor = x$symbolic.cor,
      signif.stars = getOption("show.signif.stars"), ...)
  {
      cat("\nCall:\n")
      cat(paste(deparse(x$cal
  (...)

But it does not always work... There are two object-oriented
programming paradigms in R: what we have explained works for
the first (old, simple, understandandable) one. Here is an
example for the other.

  > class(r)
  [1] "lmer"
  attr(,"package")
  [1] "lme4"

  > print.lmer
  Error: Object "print.lmer" not found

  > getAnywhere("print.lmer")
  no object named "print.lmer" was found

The function is no longer called "print" but "show"...

  > getMethod("show", "lmer")
  Method Definition:

  function (object)
  show(new("summary.lmer", object, useScale = TRUE, showCorrelation = FALSE))
  <environment: namespace:lme4>

  Signatures:
          object
  target  "lmer"
  defined "lmer"

In this case, it simply calls the "summary" function (with
arguments that are not the default arguments) and the "show"
on the result.

  > getMethod("summary", "lmer")
  Method Definition:

  function (object, ...)
  new("summary.lmer", object, useScale = TRUE, showCorrelation = TRUE)
  <environment: namespace:lme4>

  Signatures:
          object
  target  "lmer"
  defined "lmer"
  > getMethod("show", "summary.lmer")
  Method Definition:

  function (object)
  {
      fcoef <- fixef(object)
      useScale <- object@useScale
  (...)
      invisible(object)
  }
  <environment: namespace:lme4>

  Signatures:
          object
  target  "summary.lmer"
  defined "summary.lmer"

+ Functions with side effects

Plotting functions are used for their side effect (the plot that appears
on the screen), but they can also return a value.

That value can be the result of the computations that lead to the plot.
Usually, you do not want the result to be printed, because most users will
to see the plot and nothing else, and those who actually want the data,
want it for further processing and will store it in a variable. To
this end, you can return the value as invisible(): it will not be printed.

  f <- function (x, y, N=10, FUN=median, ...) {
    x <- cut(x, quantile(x, seq(0,1,length=N)), include.lowest=TRUE)
    y <- tapply(y, x, FUN, na.rm=TRUE)
    x <- levels(x)
    plot(1:length(x), y, ...)
    result <- cbind(x=x, y=y)
    invisible(result)
  }
  f(rnorm(100), rnorm(100),
    type="o", pch=15, xlab="x fractiles", ylab="y median", las=1)

  res <- f(rnorm(100), rnorm(100),
           type="o", pch=15, xlab="x fractiles", ylab="y median", las=1)
  str(res)
  res # Now its gets printed

Some plotting functions return a "plotting object", that can be stored,
modified and later plotted, with the print() function.

  r <- xyplot(rnorm(10) ~ rnorm(10)) # Does not plot anything
  print(r)  # Plots the data
  r         # Plots the data: print() is implicitely called
  str(r)    # An object of class "treillis", so that print(r)
            # actually calls 
  r$panel.args.common$pch <- 15  # Modify the plot
  r                              # Replot it

+ Operators

The following operators mean what you thing they mean -- but they tend
to be applied to vectors.

  + * / - ^ 
  < <= > >= == != 

The boolean operators are !, & et | (but you can write && or ||
instead of | and &: the result is then a scalar, even if the arguments
are vectors).

The : (colon) operator creates vectors.

  > -5:7
   [1] -5 -4 -3 -2 -1  0  1  2  3  4  5  6  7

The [ operator retrieves one or several elements of a vector, matrix,
data frame or arrow. 

  > x <- floor(10*runif(10))
  > x
   [1] 3 6 5 1 0 6 7 8 5 8
  > x[3]
  [1] 5
  > x[1:3]
  [1] 3 6 5
  > x[c(1,2,5)]
  [1] 3 6 0

The $ operator retrieves an element in a list, with no need to put its
name between quotes, contrary to the [[ operator. 
The interest of the [[ operator is that is argument can be a
variable. 

  > op <- par()
  > op$col
  [1] "black"
  > op[["col"]]
  [1] "black"
  > a <- "col"
  > op[[a]]
  [1] "black"

Assignment is written "<-".
Some people use "=" instead: this will work most of the time, but not
always (for instance, in "try" statements) -- it is easier, safer and
more readable to use "<-". 

  x <- 1.17
  y <- c(1, 2, 3, 4)

The matrix product is %*%, tensor product (aka Kronecker product) is %x%.

  > A <- matrix(c(1,2,3,4), nr=2, nc=2)
  > J <- matrix(c(1,0,2,1), nr=2, nc=2)
  > A
       [,1] [,2]
  [1,]    1    3
  [2,]    2    4
  > J
       [,1] [,2]
  [1,]    1    2
  [2,]    0    1
  > J %x% A
       [,1] [,2] [,3] [,4]
  [1,]    1    3    2    6
  [2,]    2    4    4    8
  [3,]    0    0    1    3
  [4,]    0    0    2    4

The %o% operator builds multiplication tables (it calls the "outer"
function with the multiplication).

  > A <- 1:5
  > B <- 11:15
  > names(A) <- A
  > names(B) <- B

  > A %o% B
    11 12 13 14 15
  1 11 12 13 14 15
  2 22 24 26 28 30
  3 33 36 39 42 45
  4 44 48 52 56 60
  5 55 60 65 70 75

  > outer(A,B, '*')
    11 12 13 14 15
  1 11 12 13 14 15
  2 22 24 26 28 30
  3 33 36 39 42 45
  4 44 48 52 56 60
  5 55 60 65 70 75

Euclidian division is written %/%, its remainder %%.

  > 1234 %% 3
  [1] 1
  > 1234 %/% 3
  [1] 411
  > 411*3 + 1
  [1] 1234

"Set" membership is written %in%.

  > 17 %in% 1:100
  [1] TRUE
  > 17.1 %in% 1:100
  [1] FALSE

The ~ and | operators are used to describe statistical model: more about them
later.

For more details (and for the operators I have not mentionned, such as
<<- -> ->> @ :: ::: _ =), read the manual.

  ?"+"
  ?"<"
  ?"<-"
  ?"!"
  ?"["
  ?Syntax
  ?kronecker
  ?match
  library(methods)
  ?slot

  TODO: mention <<- (and the reverse ->, ->>)

You can also define your own operators: these are just 2-arguments
functions whose name starts and ends with "%". The following example
comes from the manual. 

  > "%w/o%" <- function(x,y) x[!x %in% y]
  > (1:10) %w/o% c(3,7,12)
  [1]  1  2  4  5  6  8  9 10

Other example, to turn a 2-argument function into an operator, that
can be easily used for more than two arguments:

  "%i%" <- intersect
  intersect(x,y)   # Only two arguments
  intersect( intersect(x,y), z )
  x %i% y %i% z

+ Global variables

  TODO: See below ("dirty tricks") for actual global variables -- avoid them
  TODO: options(), par()

+ Object Oriented Programming

This is a tricky bit. Object Orientation was added to R as an
afterthought -- even worse, it has been added twice.

The first flavour, S3 classes, is rather simple: you add a "class"
attribute to a normal object (list, vector, etc.); you then define
a "generic" (C++ programmers would say "virtual" function), e.g., "plot",
that looks at the class of its first argument and dispatches the call
to the right function (e.g., for an object of class "Foo", the plot.Foo()
function would be called).

The second flavour, S4 classes, is more intricate: it tries to copy
the paradigm used in most object-oriented programming languages. For
large projects, it might be a good idea, but think carefully!

More recently, several packages suggested other ways of programming
with objects within R: R.oo and proto

* Data structures

As all Matlab-like software (remember that "Matlab" stands
for "Matrix Laboratory" -- it has noting to do with
Mathematics), R handles tables of numbers. Yet, there are
different kinds of tables: vectors (tables of dimension 1),
matrices (tables of dimension 2), arrays (tables of any
dimension), "Data Frames" (tables of dimension 2, in which
each column may have a different type -- for instance, a
table containing the results of an experiment, with one row
per subject and one column per variable). We shall now
present in more detail each of these, explain how to build
them, to manipulate them, to transform them, to convert them
-- in the next chapter, we shall plot them.

+ Vectors

Here are several ways to define them (here, "c" stands for
"concatenate"). 

  > c(1,2,3,4,5)
  [1] 1 2 3 4 5

  > 1:5
  [1] 1 2 3 4 5

  > seq(1, 5, by=1)
  [1] 1 2 3 4 5

  > seq(1, 5, lenth=5)
  [1] 1 2 3 4 5

Here are several ways to select a part of a vector.

  > x <- seq(-1, 1, by=.1)
  > x
   [1] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1  0.0  0.1  0.2  0.3  0.4
  [16]  0.5  0.6  0.7  0.8  0.9  1.0

  > x[5:10]
  [1] -0.6 -0.5 -0.4 -0.3 -0.2 -0.1

  > x[c(5,7:10)]
  [1] -0.6 -0.4 -0.3 -0.2 -0.1

  > x[-(5:10)]   # We remove the elements whose index lies between 5 and 10
   [1] -1.0 -0.9 -0.8 -0.7  0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0

  > x>0
   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
  [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
  > x[ x>0 ]
   [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
  
We can name the coordinates of a vector -- and then access
its elements by their names.

  > names(x)
  NULL
  > names(x) <- letters[1:length(x)]    # "letters" is a vector of strings,
                                        # containing 26 lower case letters.
                                        # There is also LETTERS for upper
                                        # case letters.  
  > x
     a    b    c    d    e    f    g    h    i    j    k    l    m    n    o    p
  -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1  0.0  0.1  0.2  0.3  0.4  0.5
     q    r    s    t    u
   0.6  0.7  0.8  0.9  1.0
  > x["r"]
    r
  0.7

One can also define those names while creating the vector.

  > c(a=1, b=5, c=10, d=7)
   a  b  c  d
   1  5 10  7

A few operations on vectors:

  > x <- rnorm(10)
  > sort(x)
   [1] -1.4159893 -1.1159279 -1.0598020 -0.2314716  0.3117607  0.5376470
   [7]  0.6922798  0.9316789  0.9761509  1.1022298
  > rev(sort(x))
   [1]  1.1022298  0.9761509  0.9316789  0.6922798  0.5376470  0.3117607
   [7] -0.2314716 -1.0598020 -1.1159279 -1.4159893
  > o <- order(x)
  > o
   [1]  3  1  9  6  4  7  8 10  2  5
  > x[ o[1:3] ]
  [1] -1.415989 -1.115928 -1.059802

  > x <- sample(1:5, 10, replace=T)
  > x
   [1] 1 4 5 3 1 3 4 5 3 1
  > sort(x)
   [1] 1 1 1 3 3 3 4 4 5 5
  > unique(x)    # We need not sort the data before (this contrasts 
                 # with Unix's "uniq" command)
  [1] 1 4 5 3

Here are still other ways of creating vectors. The "seq"
command generates arithmetic sequences.

  > seq(0,10, length=11)
   [1]  0  1  2  3  4  5  6  7  8  9 10
  > seq(0,10, by=1)
   [1]  0  1  2  3  4  5  6  7  8  9 10

The "rep" command repeats a number or a vector.

  > rep(1,10)
   [1] 1 1 1 1 1 1 1 1 1 1
  > rep(1:5,3)
   [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

It can also repeat each element several times.

  > rep(1:5,each=3)
   [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5

We can mix the two previous operations.

  > rep(1:5,2,each=3)
   [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5

The "gl" command serves a comparable purpose, mainly to
create factors -- more about this in a few pages.

+ Factors

A factor is a vector coding for a qualitatitative variable
(a qualitative variable is a non-numeric variable, such as
gender, color, species, etc. -- or, at least, a variable
whose actual numeric values are meaningless, for example,
zip codes). We can create them with the "factor" command.

  > x <- factor( sample(c("Yes", "No", "Perhaps"), 5, replace=T) )
  > x
  [1] Perhaps Perhaps Perhaps Perhaps No
  Levels:  No Perhaps

We can specify the list of acceptable values, or "levels" of
this factor.

  > l <- c("Yes", "No", "Perhaps")
  > x <- factor( sample(l, 5, replace=T), levels=l )
  > x
  [1] No        Perhaps   No        Yes       Yes
  Levels:  Yes No Perhaps   
  > levels(x)
  [1] "Yes"       "No"        "Perhaps"  

One can summarize a factor with a contingency table.

  > table(x)
  x
        Yes        No   Perhaps
          2         2         1

We can create a factor that follows a certain pattern with
the "gl" command.

  > gl(1,4)
  [1] 1 1 1 1
  Levels:  1
  > gl(2,4)
  [1] 1 1 1 1 2 2 2 2
  Levels:  1 2
  > gl(2,4, labels=c(T,F))
  [1] TRUE  TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE
  Levels:  TRUE FALSE
  > gl(2,1,8)
  [1] 1 2 1 2 1 2 1 2
  Levels:  1 2
  > gl(2,1,8, labels=c(T,F))
  [1] TRUE  FALSE TRUE  FALSE TRUE  FALSE TRUE  FALSE
  Levels:  TRUE FALSE

The "interaction" command builds a new factor by
concatenating the levels of two factors.

  > x <- gl(2,4)
  > x
  [1] 1 1 1 1 2 2 2 2
  Levels:  1 2
  > y <- gl(2,1,8)
  > y
  [1] 1 2 1 2 1 2 1 2
  Levels:  1 2
  > interaction(x,y)
  [1] 1.1 1.2 1.1 1.2 2.1 2.2 2.1 2.2
  Levels:  1.1 2.1 1.2 2.2
  > data.frame(x,y, int=interaction(x,y))
    x y int
  1 1 1 1.1
  2 1 2 1.2
  3 1 1 1.1
  4 1 2 1.2
  5 2 1 2.1
  6 2 2 2.2
  7 2 1 2.1
  8 2 2 2.2

The "expand.grid" computes a cartesian product (and yields a
data.frame).

  > x <- c("A", "B", "C")
  > y <- 1:2
  > z <- c("a", "b")
  > expand.grid(x,y,z)
     Var1 Var2 Var3
  1     A    1    a
  2     B    1    a
  3     C    1    a
  4     A    2    a
  5     B    2    a
  6     C    2    a
  7     A    1    b
  8     B    1    b
  9     C    1    b
  10    A    2    b
  11    B    2    b
  12    C    2    b

When playing with factors, people sometimes want to turn
them into numbers. This can be ambiguous and/or dangerous.

  > x <- factor(c(3,4,5,1))
  > as.numeric(x) # No NOT do that
  [1] 2 3 4 1
  > x
  [1] 3 4 5 1
  Levels: 1 3 4 5

What you get is the numbers internally used to code the
various levels of the factor -- and it depends on the order
of the factors...

Instead, try one of the following: 

  > x
  [1] 3 4 5 1
  Levels: 1 3 4 5

  > levels(x)[ x ]
  [1] "3" "4" "5" "1"

  > as.numeric( levels(x)[ x ] )
  [1] 3 4 5 1

  > as.numeric( as.character(x) )   # probably slower
  [1] 3 4 5 1

+ Ordered factors

  TODO

+ Missing values

The missing values are coded as "NA" (it stands for "Not
Available").

  > x <- c(1,5,9,NA,2)
  > x
  [1]  1  5  9 NA  2

The default behaviour of many functions is to reject data
containing missing values -- this is natural when the result
would depend on the missing value, were it not missing.

  > mean(x)
  [1] NA

But of course, you can ask R to first remove the missing values. 

  > mean(x, na.rm=T)
  [1] 4.25

You can do that yourself with the "na.omit" function. 

  > x
  [1]  1  5  9 NA  2

  > na.omit(x)
  [1] 1 5 9 2
  attr(,"na.action")
  [1] 4
  attr(,"class")
  [1] "omit"

This also works with data.frames -- it discards the rows
containing at least one missing value.

  > d <- data.frame(x, y=rev(x))

  > d
     x  y
  1  1  2
  2  5 NA
  3  9  9
  4 NA  5
  5  2  1

  > na.omit(d)
    x y
  1 1 2
  3 9 9
  5 2 1

You should NOT use missing values in boolean tests: if you
test wether two numbers are equal, and if one (or both) is
(are) missing, then you cannot conclude: the result will be
NA.

  > x
  [1]  1  5  9 NA  2

  > x == 5
  [1] FALSE  TRUE FALSE    NA FALSE

  > x == NA              # If we compare with something unknown, the
                         # result is unknown. 
  [1] NA NA NA NA NA

To test if a value is missing, use the "is.na" function. 

  > is.na(x)
  [1] FALSE FALSE FALSE  TRUE FALSE

This is not the only way of getting non-numeric values in a
numeric vector: you can also get +Inf, -Inf (positive and
negative infinites), and NaN (Not a Number).

  > x <- c(-1, 0,1,2,NA)
  > cbind(X=x, LogX=log(x))
        X      LogX
  [1,] -1       NaN
  [2,]  0      -Inf
  [3,]  1 0.0000000
  [4,]  2 0.6931472
  [5,] NA        NA
  Warning message:
  NaNs produced in: log(x)

You can check wether a numeric value is actually numeric
with the "is.finite" function.

  > is.finite(log(x))
  [1] FALSE FALSE  TRUE  TRUE FALSE

+ Data Frames

A data frame may be seen as a list of vectors, each with the
same length. Usually, the table has one row for each subject
in the experiment, and one column for each variable measured
in the experiement -- as the different variables measure
different things, they maight have different types: some
will be quantitative (numbers; each column may contain a
measurement in a different unit), others will be qualitative
(i.e., factors).

  > n <- 10
  > df <- data.frame( x=rnorm(n), y=sample(c(T,F),n,replace=T) )

The "str" command prints out the structure of an object (any
object) and display a part of the data it contains.

  > str(df)
  `data.frame':   10 obs. of  2 variables:
   $ x: num   0.515 -1.174 -0.523 -0.146  0.410 ...
   $ y: logi  FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

The "summary" command print concise information about an
object (here, a data.frame, but it could be anything).

  > summary(df)
         x                 y
   Min.   :-1.17351   Length:10
   1st Qu.:-0.42901   Mode  :logical
   Median : 0.13737
   Mean   : 0.09217
   3rd Qu.: 0.48867
   Max.   : 1.34213

  > df
               x     y
  1   0.51481130 FALSE
  2  -1.17350867  TRUE
  3  -0.52338041 FALSE
  4  -0.14589347 FALSE
  5   0.41022626 FALSE
  6   1.34213009  TRUE
  7   0.77715729 FALSE
  8  -0.55460889 FALSE
  9  -0.03843468 FALSE
  10  0.31318467 FALSE

Different ways to access the columns of a data.frame.

  > df$x
   [1]  0.51481130 -1.17350867 -0.52338041 -0.14589347  0.41022626  1.34213009
   [7]  0.77715729 -0.55460889 -0.03843468  0.31318467
  > df[,1]
   [1]  0.51481130 -1.17350867 -0.52338041 -0.14589347  0.41022626  1.34213009
   [7]  0.77715729 -0.55460889 -0.03843468  0.31318467
  > df[["x"]]
   [1]  0.51481130 -1.17350867 -0.52338041 -0.14589347  0.41022626  1.34213009
   [7]  0.77715729 -0.55460889 -0.03843468  0.31318467

  > dim(df)
  [1] 10  2
  > names(df)
  [1] "x" "y"
  > row.names(df)
   [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

One may change the colomn/row names.

  > names(df) <- c("a", "b")
  > row.names(df) <- LETTERS[1:10]
  > names(df)
  [1] "a" "b"
  > row.names(df)
   [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
  > str(df)
  `data.frame':   10 obs. of  2 variables:
   $ a: num   0.515 -1.174 -0.523 -0.146  0.410 ...
   $ b: logi  FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

We can turn the columns the data.frame into actual variables
with the "attach" command (it is the same principle as
namespaces in C++). Do not forget to "detach" the data.frame
after use.

  > data(faithful)
  > str(faithful)
  `data.frame':   272 obs. of  2 variables:
   $ eruptions: num  3.60 1.80 3.33 2.28 4.53 ...
   $ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...
  > attach(faithful)
  > str(eruptions)
   num [1:272] 3.60 1.80 3.33 2.28 4.53 ...
  > detach()

The "merge" command joins two data frames -- it is the same
JOIN as in Databases. More precisely you have two data
frames a (with columns x, y, z) and b (with columns x1, x2,
y,z) and certain observations (rows) of a correspond to
certain observations of b: the command merges them to yield
a data frame with columns x, x1, x2, y, z. In this example,
the command

  merge(a,b)

is equivalent to the SQL command 

  SELECT * FROM a,b WHERE a.y = b.y AND a.z = b.z

In SQL, this is called an inner join can also be written as

  SELECT * FROM a INNER JOIN b ON a.y = b.y AND a.z = b.z

There are several types of SQL JOINs: in an INNER JOIN, we only get
the rows that are present in both tables; in a LEFT JOIN, we get all
the elements of the first table and the corresponding elements of the
second (if any); a RIGHT JOIN is the opposite; an OUTER JOIN is the
union of the LEFT and RIGHT JOINs. In R, you can get the other types
of JOIN with the "all", "all.x" and "all.y" arguments. 

  merge(x, y, all.x = TRUE)   # LEFT JOIN
  merge(x, y, all.y = TRUE)   # RIGHT JOIN
  merge(x, y, all   = TRUE)   # OUTER JOIN

By default, the join is over the columns common present in both data
frames, but you can restrict it to a subset of them, with the
"by" argument.

  merge(a, b, by=c("y", "z"))

Data frames are often used to store data to be analyzed.  We shall
detail those examples later -- do not be frightened if you have never
heard of "regression", we shall shortly demystify this notion. 

  # Regression
  data(cars)    # load the "cars" data frame
  lm( dist ~ speed, data=cars)

  # Polynomial regression
  lm( dist ~ poly(speed,3), data=cars)

  # Regression with splines
  library(Design)
  lm( y ~ rcs(x) )                                # TODO: Find some data

  # Logistic regression                    
  glm(y ~ x1 + x2, family=binomial, data=...)     # TODO: Find some data
  library(Design)
  lrm(death ~ blood.presure + age)                # TODO: Find some data

  # Non linear regression
  nls( y ~ a + b * exp(c * x), start = c(a=1, b=1, c=-1) )  # TODO: Find some data
  ?selfStart

  # Principal Component Analysis
  data(USArrest)
  princomp( ~ Murder + Assault + UrbanPop, data=USArrest)

  # Treillis graphics
  xyplot( x ~ y | group )                         # TODO: Find some data

We shall see in a separate section how to transform data
frames, because there are several ways of putting the result
of an experiment in a table -- but usually, we shall prefer
the one with the most rows and the fewer columns.

Some people may advise you to use the "subset" command to
extract subsets of a data.frame. Actually, you can do the
same thing with the basic subsetting syntax -- which is more
general: the "subset" function is but a convenience wrapper
around it.

  d[ d$subject == "laika", ]
  d[ d$day %in% c(1, 3, 9, 10, 11), ]
  d[ d$value < .1 | d$value > .9, ]
  d[ d$x < d$y, ]

TODO

  d <- data.frame(...)
  as.matrix(d)
  data.matrix(d)

+ Lists

Vectors only contain simple types (numbers, booleans or
strings); lists, on the contrary, may contain anything, for
instance sata frames or other lists. They can be used to
store complex data, for instamce, trees. They can also be
used, simply, as hash tables.

  > h <- list()
  > h[["foo"]] <- 1
  > h[["bar"]] <- c("a", "b", "c")
  > str(h)
  List of 2
   $ foo: num 1
   $ bar: chr [1:3] "a" "b" "c"

You can access one element with the "[[" operator, you can
access several elements with the "[" operator.

  > h[["bar"]]
  [1] "a" "b" "c"

  > h[[2]]
  [1] "a" "b" "c"

  > h[1:2]
  $foo
  [1] 1
  $bar
  [1] "a" "b" "c"

  > h[2]     # Beware, the result is not the second element, but a
             # list containing this second element. 
  $bar
  [1] "a" "b" "c"

  > str(h[2])
  List of 1
   $ bar: chr [1:3] "a" "b" "c"

For instance, the graphic parameters are stored in a list, used as a
hash table.

  > str( par() )
  List of 68
   $ adj      : num 0.5
   $ ann      : logi TRUE
   $ ask      : logi FALSE
   $ bg       : chr "transparent"
   $ bty      : chr "o"
   $ cex      : num 1
   $ cex.axis : num 1
   $ cex.lab  : num 1
   $ cex.main : num 1.2
   $ cex.sub  : num 1
   $ cin      : num [1:2] 0.147 0.200
   ...
   $ xpd      : logi FALSE
   $ yaxp     : num [1:3] 0 1 5
   $ yaxs     : chr "r"
   $ yaxt     : chr "s"
   $ ylog     : logi FALSE

The results of most statistical analyses is not a simple number or
array, but a list containing all the relevant values. 

  > n <- 100
  > x <- rnorm(n)
  > y <- 1 - 2 * x + rnorm(n)
  > r <- lm(y~x)

  > str(r)
  List of 12
   $ coefficients : Named num [1:2]  0.887 -2.128
    ..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
   $ residuals    : Named num [1:100]  0.000503  0.472182 -1.079153 -2.423841  0.168424 ...
    ..- attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
   $ effects      : Named num [1:100]  -9.5845 -19.5361  -1.0983  -2.5001   0.0866 ...
    ..- attr(*, "names")= chr [1:100] "(Intercept)" "x" "" "" ...
   $ rank         : int 2
   $ fitted.values: Named num [1:100] 0.67 1.65 1.75 4.20 4.44 ...
    ..- attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
   $ assign       : int [1:2] 0 1
   $ qr           :List of 5
    ..$ qr   : num [1:100, 1:2] -10 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
    .. ..- attr(*, "dimnames")=List of 2
    .. .. ..$ : chr [1:100] "1" "2" "3" "4" ...
    .. .. ..$ : chr [1:2] "(Intercept)" "x"
    .. ..- attr(*, "assign")= int [1:2] 0 1
    ..$ qraux: num [1:2] 1.10 1.04
    ..$ pivot: int [1:2] 1 2
    ..$ tol  : num 1e-07
    ..$ rank : int 2
    ..- attr(*, "class")= chr "qr"
  ...

  > str( summary(r) )
  List of 11
   $ call         : language lm(formula = y ~ x)
   $ terms        :Classes 'terms', 'formula' length 3 y ~ x
    .. ..- attr(*, "variables")= language list(y, x)
    .. ..- attr(*, "factors")= int [1:2, 1] 0 1
    .. .. ..- attr(*, "dimnames")=List of 2
    .. .. .. ..$ : chr [1:2] "y" "x"
    .. .. .. ..$ : chr "x"
    .. ..- attr(*, "term.labels")= chr "x"
    .. ..- attr(*, "order")= int 1
    .. ..- attr(*, "intercept")= int 1
    .. ..- attr(*, "response")= int 1
    .. ..- attr(*, ".Environment")=length 6 <environment>
    .. ..- attr(*, "predvars")= language list(y, x)
    .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
    .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
   $ residuals    : Named num [1:100]  0.000503  0.472182 -1.079153 -2.423841  0.168424 ...
    ..- attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
  ...

To delete an element from a list:

  > h[["bar"]] <- NULL
  > str(h)
  List of 1
   $ foo: num 1

+ Matrices

Matrices are 2-dimensional tables, but contrary to data
frames (whose type may vary from one column to the next),
their elements all have the same type.

A matrix:

  > m <- matrix( c(1,2,3,4), nrow=2 )
  > m
       [,1] [,2]
  [1,]    1    3
  [2,]    2    4

Caution: by default, the elements of a matrix are given vertically,
column after column.

  > matrix( 1:3, nrow=3, ncol=3 )
       [,1] [,2] [,3]
  [1,]    1    1    1
  [2,]    2    2    2
  [3,]    3    3    3

  > matrix( 1:3, nrow=3, ncol=3, byrow=T )
       [,1] [,2] [,3]
  [1,]    1    2    3
  [2,]    1    2    3
  [3,]    1    2    3

  > t(matrix( 1:3, nrow=3, ncol=3 ))
       [,1] [,2] [,3]
  [1,]    1    2    3
  [2,]    1    2    3
  [3,]    1    2    3

Matrix product (beware: A * B is an element-by-element product):

  > x <- matrix( c(6,7), nrow=2 )
  > x
       [,1]
  [1,]    6
  [2,]    7
  > m %*% x
       [,1]
  [1,]   27
  [2,]   40

Determinant:

  > det(m)
  [1] -2

Transpose:

  > t(m)
       [,1] [,2]
  [1,]    1    2
  [2,]    3    4

A diagonal matrix:

  > diag(c(1,2))
       [,1] [,2]
  [1,]    1    0
  [2,]    0    2

Identity matrix (or, more generally, a scalar matrix, i.e.,
the matrix of a homothety):

  > diag(1,2)
       [,1] [,2]
  [1,]    1    0
  [2,]    0    1

  > diag(rep(1,2))
       [,1] [,2]
  [1,]    1    0
  [2,]    0    1

  > diag(2)
       [,1] [,2]
  [1,]    1    0
  [2,]    0    1

There is also a "Matrix" package, in case you prefer a full
object-oriented framework and/or you need other operations on
matrices. 

  library(help=Matrix)

We have already seen the "cbind" and "rbind" functions that
put data frames side by side or on top of each other: they
also work with matrices.

  > cbind( c(1,2), c(3,4) )
       [,1] [,2]
  [1,]    1    3
  [2,]    2    4

  > rbind( c(1,3), c(2,4) )
       [,1] [,2]
  [1,]    1    3
  [2,]    2    4

The trace of a matrix:

  > sum(diag(m))
  [1] 5

The inverse of a matrix:

  > solve(m)
       [,1] [,2]
  [1,]   -2  1.5
  [2,]    1 -0.5

Actually, one rarely need the inverse of a matrix -- we
usually just want to multiply a given vector by this
inverse: this operation is simpler, faster and numerically
more stable.

  > solve(m, x)
       [,1]
  [1,] -1.5
  [2,]  2.5

  > solve(m) %*% x
       [,1]
  [1,] -1.5
  [2,]  2.5

Eigenvalues:

  > eigen(m)$values
  [1]  5.3722813 -0.3722813

Eigenvectors:

  > eigen(m)$vectors
             [,1]       [,2]
  [1,] -0.5742757 -0.9093767
  [2,] -0.8369650  0.4159736

Let us check that the matrix has actually been diagonalized:

  > p <- eigen(m)$vectors
  > d <- diag(eigen(m)$values)
  > p %*% d %*% solve(p)
       [,1] [,2]
  [1,]    1    3
  [2,]    2    4

It might be the good moment to recall the main matrix
decompositions.

The LU decomposition, or more precisely, the PA = LDU
decomposition (P: permutation matrix; L, U: lower (or upper)
triangular matrix, with 1's on the diagonal) expresses the
result of Gauss's Pivot Algorithm (L contains the operations
on the lines; D contains the pivots).

We do not really need it, because the Pivot Algorithm is
already implemented in the "solve" command.

The Choleski decomposition is a particular case of the LU
decomposition: if A is real symetric definite positive
matrix, it can be written as B * B' where B is upper
triangular. It is used to solve linear systems AX=Y where A
is symetric positive -- this is the case for the equations
defining least squares estimators.

We shall see later another application to the simulation of non
independant normal variables, with a given variance-covariance matrix.

When you look at them, matrices are rather complicated
(there are a lot of coefficients). However, if you look at
the way they act on vectors, it looks rather simple: they
often seem to extend or shrink the vectors, depending on
their direction. A matrix M of size n is said to be
diagonalizable if there exists a basis e_1,...,e_n of R^n so
that M e_i = lambda_i e_i for all i, for some (real or
sometimes complex) numbers. Geometrically, it means that, in
the direction of each e_i, the matrix acts like a
homothety. The e_i are said to be eigen vectors of the
matrix M, the lambda_i are said to be its eigen values. in
matrix terms, this means that there exists an invertible
matrix P (whose columns will be the eigen vectors) and a
diagonal matrix D (whose elements will be the corresponding
eigen values) such that

  M = P D P^-1.

Diagonalizable matrices sound good, but there may still be a
few problems. First, the eigen values (and the eigen
vectors) can be complex -- if you want to interpret them as
a real-world quantity, it is a bad start. However, the
matrices you will want to diagonalize are often symetric
real matrices: they are diagonalizable with real eigen
values (and eigen vectors). Second, not all matrices are
diagonalizable. For instance,

  1 1 
  0 1

is not diagonalizable. However, the set of
non-diagonalizable matrices has zero measure: in particular,
if you take a matrix at random, in some "reasonable" way
("reasonable" means "along a probability measure absolutely
continuous with respect to the Lebesgue measure on the set
of square matrices of size n), the probability that it be
diagonalizable (over the complex numbers) is 1 -- we say
that matrices are almost surely diagonalizable.

Should you be interested in the rare cases when the matrices
are not diagonalizable (for instance, if you are interested
in matrices with integer, bounded coefficients), you can
look into the Jordan decomposition, that generalizes the
diagonalization and works with any matrix.

There are also many decompositions based on the matrix t(A)*A.

The A=QR decomposition (R: upper triangular, Q: unitary)
expresses the Gram-Schmidt orthonormalization of the columns
of A -- we can compute it from the LU decomposition of
t(A)*A.

  ?qr

It may be used to do a regression:

  Model: Y = b X + noise
  X = QR
  \hat Y = Q Q' Y
  b = R^-1 Q' Y

The Singular Value Decomposition (SVD) A=Q1*S*Q2 (Q1, Q2:
matrices containing the eigenvectors of A*t(A) and t(A)*A;
S: diagonal matrix containing the square roots of the
eigenvalues of A*t(A) or t(A)*A (they are the same)) which
yields, when A is symetrix, its diagonalization in an
orthonormal basis; it also used in the computation of the
pseudo inverse. This decomposition may also be seen as a sum
of matrices of rank 1, such that the first matrices in this
sum approximate "best" the initial matrix.

  ?svd

The polar decomposition, A=QR (Q: orthogonal, R: symetric
positive definite) is an analogue of the polar decomposition
of a complex number: it decomposes the correspondiong linear
transformation into rotation and "stretching". We can meet
this decomposition in Least Squares Estimates: when we try
to minimize the absolute value of Ax-b, this amounts to
solve

  t(A) A  x = t(A) b

(Usually, t(A)*A is invertible, otherwise we would use
pseudo-inverses.)

TODO: speak a bit more about pseudo-inverses.

+ Matrices and arrays

There is also an "array" type, that generalizes matrices in higher
dimensions.

  > d <- array(rnorm(3*3*2), dim=c(3,3,2))
  > d
  , , 1

             [,1]       [,2]       [,3]
  [1,] 0.97323599 -0.7319138 -0.7355852
  [2,] 0.06624588 -0.5732781 -0.4133584
  [3,] 1.65808464 -1.3011671 -0.4556735

  , , 2

             [,1]       [,2]       [,3]
  [1,]  0.6314685  0.6263645  1.2429024
  [2,] -0.2562622 -1.5338054  0.9634999
  [3,]  0.1652014 -0.9791350 -0.2040375

  > str(d)
   num [1:3, 1:3, 1:2]  0.9732  0.0662  1.6581 -0.7319 -0.5733 ...

Contigency tables are arrays (computed with the "table" function),
when there are more than two variables.

  > data(HairEyeColor)
  > HairEyeColor
  , , Sex = Male

         Eye
  Hair    Brown Blue Hazel Green
    Black 32    11   10     3
    Brown 38    50   25    15
    Red   10    10    7     7
    Blond  3    30    5     8

  , , Sex = Female

         Eye
  Hair    Brown Blue Hazel Green
    Black 36     9    5     2
    Brown 81    34   29    14
    Red   16     7    7     7
    Blond  4    64    5     8

  > str(HairEyeColor)
   table [1:4, 1:4, 1:2] 32 38 10 3 11 50 10 30 10 25 ...
   - attr(*, "dimnames")=List of 3
    ..$ Hair: chr [1:4] "Black" "Brown" "Red" "Blond"
    ..$ Eye : chr [1:4] "Brown" "Blue" "Hazel" "Green"
    ..$ Sex : chr [1:2] "Male" "Female"
   - attr(*, "class")= chr "table"

It says "table", but a "table" is an "array":

  > is.array(HairEyeColor)
  [1] TRUE

+ Attributes

One may attach meta-data to an object: these are called "attributes". 
For instance, names of the elements of a list are in an attribute. 

  > l <- list(a=1, b=2, c=3)
  > str(l)
  List of 3
   $ a: num 1
   $ b: num 2
   $ c: num 3
  > attributes(l)
  $names
  [1] "a" "b" "c"

Similarly, the row and columns names of a data.frame

  > a <- data.frame(a=1:2, b=3:4)

  > str(a)
  `data.frame':   2 obs. of  2 variables:
   $ a: int  1 2
   $ b: int  3 4

  > attributes(a)
  $names
  [1] "a" "b"
  $row.names
  [1] "1" "2"
  $class
  [1] "data.frame"


  > a <- matrix(1:4, nr=2)
  > rownames(a) <- letters[1:2]
  > colnames(a) <- LETTERS[1:2]

  > str(a)
   int [1:2, 1:2] 1 2 3 4
   - attr(*, "dimnames")=List of 2
    ..$ : chr [1:2] "a" "b"
    ..$ : chr [1:2] "A" "B"

  > attributes(a)
  $dim
  [1] 2 2
  $dimnames
  $dimnames[[1]]
  [1] "a" "b"
  $dimnames[[2]]
  [1] "A" "B"

  > data(HairEyeColor)
  > str(HairEyeColor)
   table [1:4, 1:4, 1:2] 32 38 10 3 11 50 10 30 10 25 ...
   - attr(*, "dimnames")=List of 3
    ..$ Hair: chr [1:4] "Black" "Brown" "Red" "Blond"
    ..$ Eye : chr [1:4] "Brown" "Blue" "Hazel" "Green"
    ..$ Sex : chr [1:2] "Male" "Female"
   - attr(*, "class")= chr "table"

It is also used to hold the code of a function if you want to keep the
comments. 

  > f <- function (x) {
  +   # Useless function
  +   x + 1
  + }

  > f
  function (x) {
    # Useless function
    x + 1
  }

  > str(f)
  function (x)
   - attr(*, "source")= chr [1:4] "function (x) {" ...

  > attr(f, "source") <- NULL

  > str(f)
  function (x)

  > f
  function (x)
  {
      x + 1
  }

Some people even suggest to use this to "hide" code -- but choosing an
interpreted language is a very bad idea if you want to hide your
code. 

  > attr(f, "source") <- "Forbidden"

  > f
  Forbidden

  > attr(f, "source") <- "Remember to use brackets to call a function, e.g., f()"

  > f
  Remember to use brackets to call a function, e.g., f()

Typically, when the data has a complex structure, you use a list; but
when the bulk of the data has a very simple, table-like structure, you
store it in an array or data frame and put the rest in the
attributes. For instance, here is a chunk of an "lm" object (the
result of a regression): 

  > str(r$model)
  `data.frame':   100 obs. of  2 variables:
   $ y: num   5.087 -1.587 -0.637  2.023  2.207 ...
   $ x: num  -1.359  0.993  0.587 -0.627 -0.853 ...
   - attr(*, "terms")=Classes 'terms', 'formula' length 3 y ~ x
    .. ..- attr(*, "variables")= language list(y, x)
    .. ..- attr(*, "factors")= int [1:2, 1] 0 1
    .. .. ..- attr(*, "dimnames")=List of 2
    .. .. .. ..$ : chr [1:2] "y" "x"
    .. .. .. ..$ : chr "x"
    .. ..- attr(*, "term.labels")= chr "x"
    .. ..- attr(*, "order")= int 1
    .. ..- attr(*, "intercept")= int 1
    .. ..- attr(*, "response")= int 1
    .. ..- attr(*, ".Environment")=length 149 <environment>
    .. ..- attr(*, "predvars")= language list(y, x)
    .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
    .. .. ..- attr(*, "names")= chr [1:2] "y" "x"

We shall soon see another application of attributes: the notion of
class -- the class of an object is just the value of its "class"
attribute, if any. 

+ The contents of those complex objects: str, unclass, deparse

If you want to use a complex object, obtained as the result of a
certain command, by extracting some of its elements, or if you want to
browse through it, the printing command is not enough: we need other
means to peer inside an object.

The "unclass" command removes the class of an object: only remains the
underlying type (usually, "list"). As a result, it is printed by the
"print.default" function that displays its actual contents.

  > data(USArrests)
  > r <- princomp(USArrests)$loadings

  > r
  Loadings:
           Comp.1 Comp.2 Comp.3 Comp.4
  Murder                         0.995
  Assault  -0.995
  UrbanPop        -0.977 -0.201
  Rape            -0.201  0.974

                 Comp.1 Comp.2 Comp.3 Comp.4
  SS loadings      1.00   1.00   1.00   1.00
  Proportion Var   0.25   0.25   0.25   0.25
  Cumulative Var   0.25   0.50   0.75   1.00

  > class(r)
  [1] "loadings"

  > unclass(r)
                Comp.1      Comp.2      Comp.3      Comp.4
  Murder   -0.04170432  0.04482166  0.07989066  0.99492173
  Assault  -0.99522128  0.05876003 -0.06756974 -0.03893830
  UrbanPop -0.04633575 -0.97685748 -0.20054629  0.05816914
  Rape     -0.07515550 -0.20071807  0.97408059 -0.07232502

You cound also directly use the "print.default" function.

  > print.default(r)
                Comp.1      Comp.2      Comp.3      Comp.4
  Murder   -0.04170432  0.04482166  0.07989066  0.99492173
  Assault  -0.99522128  0.05876003 -0.06756974 -0.03893830
  UrbanPop -0.04633575 -0.97685748 -0.20054629  0.05816914
  Rape     -0.07515550 -0.20071807  0.97408059 -0.07232502
  attr(,"class")
  [1] "loadings"

The "str" function prints the contents of an objects and truncates all
the vectors it encounters: thus, you can peer into large objects.

  > str(r)
   loadings [1:4, 1:4] -0.0417 -0.9952 -0.0463 -0.0752  0.0448 ...
   - attr(*, "dimnames")=List of 2
    ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
    ..$ : chr [1:4] "Comp.1" "Comp.2" "Comp.3" "Comp.4"
   - attr(*, "class")= chr "loadings"

  > str(USArrests)
  `data.frame':   50 obs. of  4 variables:
   $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
   $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
   $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
   $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

Finally, to get an idea of what you can do with an object, you can
always look the code of its "print" or "summary" methods.

  > print.lm
  function (x, digits = max(3, getOption("digits") - 3), ...)
  {
    cat("\nCall:\n", deparse(x$call), "\n\n", sep = "")
    if (length(coef(x))) {
      cat("Coefficients:\n")
      print.default(format(coef(x), digits = digits), print.gap = 2,
                    quote = FALSE)
    }
    else cat("No coefficients\n")
    cat("\n")
    invisible(x)
  }
  <environment: namespace:base>

  > summary.lm
  function (object, correlation = FALSE, symbolic.cor = FALSE,
      ...)
  {
    z <- object
    p <- z$rank
    if (p == 0) {
      r <- z$residuals
      n <- length(r)
  etc.

  > print.summary.lm
  function (x, digits = max(3, getOption("digits") - 3), symbolic.cor = x$symbolic.cor,
            signif.stars = getOption("show.signif.stars"), ...)
  {
    cat("\nCall:\n")
    cat(paste(deparse(x$call), sep = "\n", collapse = "\n"),
              "\n\n", sep = "")
    resid <- x$residuals
    df <- x$df
    rdf <- df[2]
    cat(if (!is.null(x$w) && diff(range(x$w)))
        "Weighted ", "Residuals:\n", sep = "")
    if (rdf > 5) {
  etc.

The "deparse" command produces a character string whose evaluation
will yield the initial object (the resulting syntax is a bit strange:
if you were to build such an object from scratch, you would not
proceed that way).

  > deparse(r)
  [1] "structure(c(-0.0417043206282872, -0.995221281426497, -0.0463357461197109, "
  [2] "-0.075155500585547, 0.0448216562696701, 0.058760027857223, -0.97685747990989, "
  [3] "-0.200718066450337, 0.0798906594208107, -0.0675697350838044, "
  [4] "-0.200546287353865, 0.974080592182492, 0.994921731246978, -0.0389382976351601, "
  [5] "0.0581691430589318, -0.0723250196376096), .Dim = c(4, 4), .Dimnames = list("
  [6] "    c(\"Murder\", \"Assault\", \"UrbanPop\", \"Rape\"), c(\"Comp.1\", \"Comp.2\", "
  [7] "    \"Comp.3\", \"Comp.4\")), class = \"loadings\")"

  > cat(deparse(r)); cat("\n")
  structure(c(-0.0417043206282872, -0.995221281426497,
  -0.0463357461197109, -0.075155500585547, 0.0448216562696701,
  0.058760027857223, -0.97685747990989, -0.200718066450337,
  0.0798906594208107, -0.0675697350838044, -0.200546287353865,
  0.974080592182492, 0.994921731246978, -0.0389382976351601,
  0.0581691430589318, -0.0723250196376096), .Dim = c(4, 4), 
  .Dimnames = list( c("Murder", "Assault", "UrbanPop", "Rape"),
  c("Comp.1", "Comp.2", "Comp.3", "Comp.4")), class = "loadings")

+ Operations on vectors or arrays

  TODO: Check that I mention apply, sapply, lapply,
        rapply (recursive apply), rollapply (zoo)
  TODO: Mention the "reshape" package

There are several ways to code the results of an experiment.

First example: we have measured several qualitative variables on several
(hundred) subjects. The data may be written down as a table, one line
per subject, one column per variable. We can also use a contingency
table (it is only a good idea of there are few variables, otherwise
the array would mainly contain zeroes; if there are k variables the
array would hane k dimensions). 

How can we switch from one formulation to the next. 

In one direction, the "table" function computes a contingency table. 

  n <- 1000
  x1 <- factor( sample(1:3, n, replace=T), levels=1:3 )
  x2 <- factor( sample(LETTERS[1:5], n, replace=T), levels=LETTERS[1:5] )
  x3 <- factor( sample(c(F,T),n,replace=T), levels=c(F,T) )
  d <- data.frame(x1,x2,x3)
  r <- table(d)

This yields:

  > r
  , , x3 = FALSE

     x2
  x1  A  B  C  D  E
    1 27 45 31 38 25
    2 41 33 30 35 33
    3 33 30 28 35 39

  , , x3 = TRUE

     x2
  x1  A  B  C  D  E
    1 26 30 28 42 29
    2 35 33 22 37 40
    3 42 31 31 36 35

The "ftable" command presents the result in a slightly different way
(more readable if there are more variables).

  > ftable(d)
        x3 FALSE TRUE
  x1 x2
  1  A        27   26
     B        45   30
     C        31   28
     D        38   42
     E        25   29
  2  A        41   35
     B        33   33
     C        30   22
     D        35   37
     E        33   40
  3  A        33   42
     B        30   31
     C        28   31
     D        35   36
     E        39   35

Let us now see how to turn a contingency table into a data.frame. 

Case 1: 1-dimensional table

  n <- 100
  k <- 10
  x <- factor( sample(LETTERS[1:k], n, replace=T), levels=LETTERS[1:k] )
  d <- table(x)
  factor( rep(names(d),d), levels=names(d) )

Case 2: 2-dimensional table

  n <- 100
  k <- 4
  x1 <- factor( sample(LETTERS[1:k], n, replace=T), levels=LETTERS[1:k] )
  x2 <- factor( sample(c('x','y','z'),n,replace=T), levels=c('x','y','z') )
  d <- data.frame(x1,x2)
  d <- table(d)
  y2 <- rep(colnames(d)[col(d)], d)
  y1 <- rep(rownames(d)[row(d)], d)
  dd <- data.frame(y1,y2)

General case:

  n <- 1000
  x1 <- factor( sample(1:3, n, replace=T), levels=1:3 )
  x2 <- factor( sample(LETTERS[1:5], n, replace=T), levels=LETTERS[1:5] )
  x3 <- factor( sample(c(F,T),n,replace=T), levels=c(F,T) )
  d <- data.frame(x1,x2,x3)
  r <- table(d)
  
  # A function generalizing "row" and "col" in higher dimensions
  foo <- function (r,i) {
    d <- dim(r)
    rep(
      rep(1:d[i], each=prod(d[0:(i-1)])),
      prod(d[(i+1):(length(d)+1)], na.rm=T)
    )
  }
  k <- length(dimnames(r))
  y <- list()
  for (i in 1:k) {
    y[[i]] <- rep( dimnames(r)[[i]][foo(r,i)], r )
  }
  d <- data.frame(y)
  colnames(d) <- LETTERS[1:k]

  # Test
  r - table(d)

+ Operations on vectors and arrays (continued)

Other example: we made the same experiment, with the same subject,
three times. We can represent the data with one row per subject, with
several results for each 

    subject, result1, result2, result3

We can also use one row per experiment, with the number of the
subject, the number of the experiment (1, 2 or 3) and the result. 

    subject, retry, result

Exercice: Write function to turn one representation into the other.
(Hint: you may use the "split" command that separates data along a
factor). 

Other example: Same situation, but this time, the numner or
experiments per subject is not constant. The first representation can
no longer be a data frame: it can be a list of vectors (one vector for
each subject). The second representation is unchanged. 

  n <- 100
  k <- 10
  subject <- factor( sample(1:k,n,replace=T), levels=1:k )
  x <- rnorm(n)
  d1 <- data.frame(subject, x)

  # Data.frame to vector list
  d2 <- split(d1$x, d1$subject)

  # vector list to data.frame
  rep(names(d2), sapply(d2, length))

+ Aggregate functions: by, aggregate

(I never use those functions: fell free to skip to the next section
that present more general and powerful alternatives.)

In SQL (this is the language spoken by databases -- to simplify
things, you can consider that a database is a (set of)
data.frame(s)), we often want to apply a function (sum, mean, sd,
etc.) to groups of records ("record" is the database word for "line in
a data.frame). For instance, if you store you personnal accounting in
a database, giving, for each expense, the amount and the nature (rent,
food, transortation, taxes, books, cinema, etc.), 

  amount     nature
  ------------------
  390        rent
  4.90       cinema
  6.61       food
  10.67      food
  6.40       books
  14.07      food
  73.12      books
  4.90       cinema

you might want to compute the total expenses for each type of
expense. In SQL, you would say:

  SELECT nature, SUM(amount)
  FROM expenses
  GROUP BY nature;

You can do the same in R:

  nature <- c("rent", "cinema", "books", "food")
  p <- length(nature):1
  p <- sum(p)/p
  n <- 10
  d <- data.frame(
    nature = sample( nature, n, replace=T, prob=p ),
    amount = 10*round(rlnorm(n),2)
  )
  by(d$amount, d$nature, sum)

This yields:

  > d
     nature amount
  1   books   59.9
  2    rent    3.0
  3   books    6.7
  4  cinema    4.7
  5    food    7.3
  6   books   11.3
  7    rent   12.2
  8  cinema    6.5
  9    food    3.2
  10   food    4.7
 
  > by(d$amount, d$nature, sum)
  INDICES: books
  [1] 77.9
  ------------------------------------------------------------
  INDICES: cinema
  [1] 11.2
  ------------------------------------------------------------
  INDICES: food
  [1] 15.2
  ------------------------------------------------------------
  INDICES: rent
  [1] 15.2

The "by" function assumes that you have a vector, that you want to cut
into pieces and on whose pieces you want to apply a function.
Sometimes, it is not a vector, but several: all the columns in a
data.frame. You can then replace the "by" function by "aggregate". 

  > N <- 50
  > k1 <- 4
  > g1 <- sample(1:k1, N, replace=TRUE)
  > k2 <- 3
  > g2 <- sample(1:k2, N, replace=TRUE)
  > d <- data.frame(x=rnorm(N), y=rnorm(N), z=rnorm(N))
  > aggregate(d, list(g1, g2), mean)
     Group.1 Group.2       x        y        z
  1        1       1 -0.5765  0.07474 -0.01558
  2        2       1  0.4246  0.12450 -0.05569
  3        3       1 -0.3418  0.30908 -0.32289
  4        4       1  0.7405 -0.79703  0.18489
  5        1       2 -0.5855 -0.07166 -0.16581
  6        2       2 -0.4230 -0.15215  0.24693
  7        3       2  0.4329  0.32154 -0.82883
  8        4       2 -1.0167 -0.18424  0.12709
  9        1       3  0.3961 -0.86940  0.68552
  10       2       3 -0.8808  0.62404  0.79728
  11       3       3 -0.4884 -0.67295  0.03346
  12       4       3  0.1605 -0.68522 -0.35144

  TODO: Replace this example by real data...

These two functions, "by" and "aggregate", are actually special cases
of the apply/tapply/lapply/sapply/mapply functions, that are more
general and that we shall now present.

  > by.data.frame
  function (data, INDICES, FUN, ...) {
    (...)
    ans <- eval(substitute(tapply(1:nd, IND, FUNx)), data)
    (...)
  }

  > aggregate.data.frame
  function (x, by, FUN, ...) {
    (...)
    y <- lapply(x, tapply, by, FUN, ..., simplify = FALSE)
    (...)
  }                                                

+ Operations of vectors and arrays: useful commands

The "apply" function applies a function (mean, quartile, etc.) to each
column or row of a data.frame, matrix or array.

  > options(digits=4)
  > df <- data.frame(x=rnorm(20),y=rnorm(20),z=rnorm(20))
  > apply(df,2,mean)
         x        y        z
   0.04937 -0.11279 -0.02171
  > apply(df,2,range)
            x      y      z
  [1,] -1.564 -1.985 -1.721
  [2,]  1.496  1.846  1.107

It also works in higher dimensions. The second argument indicates the
indices along which the program should loop, i.e., the dimensions used
to slice the data, i.e., the dimensions that will remain after the
computation.

  > options(digits=2)
  > m <- array(rnorm(10^3), dim=c(10,10,10))
  > a <- apply(m, 1, mean)
  > a
   [1]  0.060 -0.027  0.037  0.160  0.054  0.012 -0.039 -0.064 -0.013  0.061
  > b <- apply(m, c(1,2), mean)
  > b
          [,1]    [,2]   [,3]   [,4]   [,5]   [,6]    [,7]   [,8]   [,9]   [,10]
   [1,] -0.083 -0.7297  0.547  0.283  0.182 -0.409 -0.0029  0.170 -0.131  0.7699
   [2,] -0.044  0.3618 -0.206 -0.095  0.062 -0.568 -0.4841  0.334  0.362  0.0056
   [3,]  0.255  0.2359 -0.331  0.040  0.213 -0.547 -0.1852  0.492 -0.257  0.4525
   [4,] -0.028  0.7422  0.417 -0.088  0.205 -0.521 -0.1981  0.042  0.604  0.4244
   [5,] -0.085  0.3461  0.047  0.683 -0.018 -0.173  0.1825 -0.826 -0.037  0.4153
   [6,] -0.139 -0.4761  0.276  0.174  0.145  0.232 -0.1194 -0.010  0.176 -0.1414
   [7,] -0.139  0.0054 -0.328 -0.264  0.078  0.496  0.2812 -0.336  0.124 -0.3110
   [8,] -0.060  0.1291  0.313 -0.199 -0.325  0.338 -0.2703  0.166 -0.133 -0.5998
   [9,]  0.091  0.2250  0.155 -0.277  0.075 -0.044 -0.4169  0.050  0.200 -0.1849
  [10,] -0.157 -0.3316 -0.103  0.373 -0.034  0.116  0.0660  0.249 -0.040  0.4689
  > apply(b, 1, mean)
   [1]  0.060 -0.027  0.037  0.160  0.054  0.012 -0.039 -0.064 -0.013  0.061

The "tapply" function groups the observations along the value of one
(or several) factors and applies a function (mean, etc.) to the
resulting groups. The "by" command is similar. 

  > tapply(1:20, gl(2,10,20), sum)
    1   2
   55 155

  > by(1:20, gl(2,10,20), sum)
  INDICES: 1
  [1] 55
  ------------------------------------------------------------
  INDICES: 2
  [1] 155

The "sapply" function applies a function to each element of a list (or
vector, etc.) and returns, if possible, a vector. The "lapply"
function is similar but returns a list. 

  > x <- list(a=rnorm(10), b=runif(100), c=rgamma(50,1))
  > lapply(x,sd)
  $a
  [1] 1.041

  $b
  [1] 0.294

  $c
  [1] 1.462
  > sapply(x,sd)
      a     b     c
  1.041 0.294 1.462

In particular, the "sapply" function can apply a function to each
column of a data.frame without specifying the dimension numbers
required by the "apply" command (at the beginning, you never know if
it sould be 1 or 2 and you end up trying both to retain the one whose
result has the expected dimension). 

The "split" command cuts the data, as the "tapply" function, but does
not apply any function afterwards. 

  > str(InsectSprays)
  `data.frame':   72 obs. of  2 variables:
   $ count: num  10 7 20 14 14 12 10 23 17 20 ...
   $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

  > str(  split(InsectSprays$count, InsectSprays$spray)  )
  List of 6
   $ A: num [1:12] 10 7 20 14 14 12 10 23 17 20 ...
   $ B: num [1:12] 11 17 21 11 16 14 17 17 19 21 ...
   $ C: num [1:12] 0 1 7 2 3 1 2 1 3 0 ...
   $ D: num [1:12] 3 5 12 6 4 3 5 5 5 5 ...
   $ E: num [1:12] 3 5 3 5 3 6 1 1 3 2 ...
   $ F: num [1:12] 11 9 15 22 15 16 13 10 26 26 ...

  > sapply(  split(InsectSprays$count, InsectSprays$spray),  mean  )
       A      B      C      D      E      F
  14.500 15.333  2.083  4.917  3.500 16.667

  > tapply( InsectSprays$count, InsectSprays$spray,  mean )
       A      B      C      D      E      F
  14.500 15.333  2.083  4.917  3.500 16.667

+ "Unlooping" exercises

TODO: This is a VERY important section.

At the begining of this document, list the most important sections,
list what the reader is expected to be able to do after reading this
document.

In R, many commands handle vectors or tables, allowing an (almost)
loop-less programming style -- parallel programming. Thus, the
computations are faster than with an explicitely written loop (because
R is an interpreted language). The resulting programming style is very
different from what you may be used to: here are a few exercises to
warm you up. We shall need the table functions we have just
introduced, in particular "apply".

Many people consider the "apply" function as a loop: in the current
implementation of R, it might be implemented as a loop, if if you run
R on a parallel machine, it could be different -- all the operations
could be run at once. This really is parallelization. 

Exercice: Let x be a table. Compute the sum of its rows, the sum of
each of its columns. If x is the contingency table of two qualitative
variables, compute the theoretical contingency table under the
hypothesis that the two variables are independant. If you already know
what it is, computhe the corresponding Chi^2.

  # To avoid any row/column confusion, I choose a non-square table
  n <- 4
  m <- 5
  x <- matrix( rpois(n*m,10), nr=n, nc=m )
  rownames(x) <- 1:n
  colnames(x) <- LETTERS[1:m]
  x
  apply(x,1,sum)  # Actually, there is already a "rowSums" function
  apply(x,2,sum)  # Actually, there is already a "colSums" function
  # Theoretical contingency table
  y <- matrix(apply(x,1,sum),nr=n,nc=m) * matrix(apply(x,2,sum),nr=n,nc=m,byrow=T) / sum(x)
  # Theoretical contingency table
  y <- apply(x,1,sum) %*% t(apply(x,2,sum)) / sum(x)
  # Computing the Chi^2 by hand
  sum((x-y)^2/y)
  # Let us check...
  chisq.test(x)$statistic

Exercice: Let x be a boolean vector. Count the number of sequences
("runs") of zeros (for instance, in 00101001010110, there are 6 runs:
00 0 00 0 0 0). Count the number of sequences of 1. Counth the total
number of sequences. Same question for a factor with more tham two
levels.

  n <- 50
  x <- sample(0:1, n, replace=T, p=c(.2,.8))
  # Number of runs
  sum(abs(diff(x)))+1
  # Number of runs of 1's.
  f <- function (x, v=1) { # If someone has a simpler idea...
    x <- diff(x==v)
    x <- x[x!=0]
    if(x[1]==1)
      sum(x==1)
    else
      1+sum(x==1)
  }
  f(x,1)
  # Number of runs of 0's.
  f(x,0)

  n <- 50
  k <- 4
  x <- sample(1:k, n, replace=T)
  # With a loop
  s <- 0
  for (i in 1:4) {
    s <- s + f(x,i)
  }
  s
  # With no loop (less readable)
  a <- apply(matrix(1:k,nr=1,nc=k), 2, function (i) { f(x,i) } )
  a 
  sum(a)

In a binary vector of length n, find the position of the runs of 1's
of length greater than k.

  n <- 100
  k <- 10
  M <- sample(0:1, n, replace=T, p=c(.2,.8))
  x <- c(0,M,0)
  # Start of the runs of 1's
  deb <- which( diff(x) == 1 )
  # End of the runs of 1's
  fin <- which( diff(x) == -1 ) -1
  # Length of those runs
  long <- fin - deb
  # Location of those whose lengths exceed k
  cbind(deb,fin)[ fin-deb > k, ]

Exercise: same question, but we are looking for runs of 1's of length
at least k in an n*m matrix. Present the result as a table.

  foo <- function (M,k) {
    x <- c(0,M,0)
    deb <- which( diff(x) == 1 )
    fin <- which( diff(x) == -1 ) -1
    cbind(deb,fin)[ fin-deb >= k, ]
  }
  n <- 50
  m <- 50
  M <- matrix( sample(0:1, n*m, replace=T, prob=c(.2,.8)), nr=n, nc=m )
  res <- apply(M, 1, foo, k=10)

  # Add the line number (not very pretty -- if someone has a better idea)
  i <- 0
  res <- lapply(res, function (x) { 
    x <- matrix(x, nc=2)
    i <<- i+1
    #if (length(x)) {
      cbind(ligne=rep(i,length(x)/2), deb=x[,1], fin=x[,2])
    #} else {
    #  x
    #}
  })
  # Present the result as a table
  do.call('rbind', res)   # The line numbers are still missing

TODO: check that I mention the "do.call" function somewhere in this
document...

Let r be the return of a financial asset. The clustered return is the
accumulated return for a sequence of returns of the same sign. The
trend number is the number of steps in such a sequence. The average
return is their ratio. Compute these quantities.

  %G
  data(EuStockMarkets)
  x <- EuStockMarkets
  # We aren't interested in the spot prices, but in the returns
  # return[i] = ( price[i] - price[i-1] ) / price[i-1]
  y <- apply(x, 2, function (x) { diff(x)/x[-length(x)] })
  # We normalize the data
  z <- apply(y, 2, function (x) { (x-mean(x))/sd(x) })
  # A single time series
  r <- z[,1]
  # The runs
  f <- factor(cumsum(abs(diff(sign(r))))/2)
  r <- r[-1]
  accumulated.return <- tapply(r, f, sum)
  trend.number <- table(f)
  boxplot(abs(accumulated.return) ~ trend.number, col='pink',
          main="Accumulated return")
  %--
  %G
  boxplot(abs(accumulated.return)/trend.number ~ trend.number,
          col='pink', main="Average return")
  %--

  %G
  op <- par(mfrow=c(2,2))
  for (i in 1:4) {
    r <- z[,i]
    f <- factor(cumsum(abs(diff(sign(r))))/2)
    r <- r[-1]
    accumulated.return <- tapply(r, f, sum)
    trend.number <- table(f)
    boxplot(abs(accumulated.return) ~ trend.number, col='pink')
  }
  par(op)    
  %--

  %G
  op <- par(mfrow=c(2,2))
  for (i in 1:4) {
    r <- z[,i]
    f <- factor(cumsum(abs(diff(sign(r))))/2)
    r <- r[-1]
    accumulated.return <- tapply(r, f, sum)
    trend.number <- table(f)
    boxplot(abs(accumulated.return)/trend.number ~ trend.number, col='pink')
  }
  par(op)    
  %--

Let M be an n*m matrix (representing a grayscale image); compute the
mean value of each quadripixel.

  %G
  data(volcano)
  M <- volcano
  n <- dim(M)[1]
  m <- dim(M)[2]
  M1 <- M [1:(n-1),] [,1:(m-1)]
  M2 <- M [2:n,] [,1:(m-1)]
  M3 <- M [1:(n-1),] [,2:m]
  M4 <- M [2:n,] [,2:m]
  # Overlapping quadripixels
  M0 <- (M1+M2+M3+M4)/4

  # Non-overlapping quadripixels
  nn <- floor((n-1)/2)
  mm <- floor((m-1)/2)
  M00 <- M0 [2*(1:nn),] [,2*(1:mm)]

  op <- par(mfrow=c(2,2))
  image(M, main="Initial image")
  image(M0, main="Overlapping Quadripixels")
  image(M00, main="Non Overlapping Quadripixels")
  par(op)
  %--  

Construct a Van der Monde matrix.

  outer(x, 0:n, '^')

Draw a graph from its indicence matrix.

  %G
  n <- 100
  m <- matrix(runif(2*n),nc=2)
  library(ape)
  r <- mst(dist(m)) # The incidence matrix (of the minimum spanning
                    # tree of the points)
  plot(m)
  n <- dim(r)[1]
  w <- which(r!=0)
  i <- as.vector(row(r))[w]
  j <- as.vector(col(r))[w]
  segments( m[i,1], m[i,2], m[j,1], m[j,2], col='red' )
  %--

TODO: Find other exercises. 

+ Strings

R is not the best tool to process strings, but you sometimes have to
do it. 

Strings are delimited by double or single quotes. 

  > "Hello" == 'Hello'
  [1] TRUE

You do not print a string with the "print" function but with the "cat"
function. The "print" function only gives you the representation of
the string. 

  > print("Hello\n")
  [1] "Hello\n"

  > cat("Hello\n")
  Hello

  > s <- "C:\\Program Files\\"   # At work, I am compelled to use Windows...

  > print(s)
  [1] "C:\\Program Files\\"

  > cat(s, "\n")
  C:\Program Files\

You can concatenate strings with the "paste" function. To get the
desired result, you may have to play with the "sep" argument. 

  > paste("Hello", "World", "!")
  [1] "Hello World !"
  > paste("Hello", "World", "!", sep="")
  [1] "HelloWorld!"
  > paste("Hello", " World", "!", sep="")
  [1] "Hello World!"

  > x <- 5
  > paste("x=", x)
  [1] "x= 5"
  > paste("x=", x, paste="")
  [1] "x= 5 "

The "cat" function also accepts a "sep" argument.

  > cat("x=", x, "\n")
  x= 5
  > cat("x=", x, "\n", sep="")
  x=5

Sometimes, you do not want to concatenate strings stored in different
variables, but the elements of a vector of strings. If you want the
result to be a single string, and not a vector of strings, you must
add a "collapse" argument. 
 
  > s <- c("Hello", " ", "World", "!")
  > paste(s)
  [1] "Hello" " "     "World" "!"
  > paste(s, sep="")
  [1] "Hello" " "     "World" "!"
  > paste(s, collapse="")
  [1] "Hello World!"

In some circumstances, you can even need both (the "cat" function does
not accept this "collapse" argument).

  > s <- c("Hello", "World!")

  > paste(1:3, "Hello World!")
  [1] "1 Hello World!" "2 Hello World!" "3 Hello World!"

  > paste(1:3, "Hello World!", sep=":")
  [1] "1:Hello World!" "2:Hello World!" "3:Hello World!"

  > paste(1:3, "Hello World!", sep=":", collapse="\n")
  [1] "1:Hello World!\n2:Hello World!\n3:Hello World!"

  > cat(paste(1:3, "Hello World!", sep=":", collapse="\n"), "\n")
  1:Hello World!
  2:Hello World!
  3:Hello World!

The "nchar" function gives the length of a string (I am often looking
for a "strlen" function: there it is). 

  > nchar("Hello World!")
  [1] 12

The "substring" function extract part of a string (the second argument
is the starting position, the third argument is 1 + the end
position). 

  > s <- "Hello World"
  > substring(s, 4, 6)
  [1] "lo "

The "strsplit" function splits a string into chunks, at each
occurrence of a given "string". 

  > s <- "foo, bar, baz"
  > strsplit(s, ", ")
  [[1]]
  [1] "foo" "bar" "baz"

  > s <- "foo-->bar-->baz"
  > strsplit(s, "-->")
  [[1]]
  [1] "foo" "bar" "baz"

Actually, it is not a string, but a regular expression. 

  > s <- "foo, bar, baz"
  > strsplit(s, ", *")
  [[1]]
  [1] "foo" "bar" "baz"

You can also use it to get the individual characters of a string. 

  > strsplit(s, "")
  [[1]]
  [1] "f" "o" "o" "," " " "b" "a" "r" "," " " "b" "a" "z"

  > str(strsplit(s, ""))
  List of 1
   $ : chr [1:13] "f" "o" "o" "," ...

The grep function looks for a "string" in a vector of strings. 

  > s <- apply(matrix(LETTERS[1:24], nr=4), 2, paste, collapse="")
  > s
  [1] "ABCD" "EFGH" "IJKL" "MNOP" "QRST" "UVWX"
  > grep("O", s)
  [1] 4
  > grep("O", s, value=T)
  [1] "MNOP"

Actually, it does not look for a string, but for a regular expression.

If Perl is installed on your machine, you can simply type (to the
shell)

  man perlretut

and read its Regular Expression TUTorial. 

(It may seem out of place to speak of regular expressions in a
document about statistics: it is not. We shall see (well, not in the
current version of this document, but soon -- I hope) that
stochastic regular expressions are a generalization of Hidden Markov
Models (HMM), which are the analogue of State Space Models for
qualitative time series. If you understood the last sentence, you
probably should not be reading this.)

The "regexpr" performs the same task as the "grep" function, but gives
a different result: the position and length of the first match (or -1
if there is none)

  > regexpr("o", "Hello")
  [1] 5
  attr(,"match.length")
  [1] 1
  > regexpr("o", c("Hello", "World!"))
  [1] 5 2
  attr(,"match.length")
  [1] 1 1

  > s <- c("Hello", "World!")
  > i <- regexpr("o", s)
  > i
  [1] 5 2
  attr(,"match.length")
  [1] 1 1
  > attr(i, "match.length")
  [1] 1 1

Sometimes, you want an "approximate" matches, not exact matches,
accounting for potential spelling or typing mistakes: the "agrep"
function provides suc a "fuzzy" matching. It is used by the
"help.search" function. 

  > grep ("abc", c("abbc", "jdfja", "cba"))
  numeric(0)
  > agrep ("abc", c("abbc", "jdfja", "cba"))
  [1] 1

The "gsub" function replaces each occurrence of a string (a regular
expression, actually) by a strin.

  > s <- "foo bar baz"
  > gsub(" ", "", s)   # Remove all the spaces
  [1] "foobarbaz"

  > s <- "foo  bar   baz"
  > gsub(" ", "", s)
  [1] "foobarbaz"
  > gsub("  ", " ", s)
  [1] "foo bar  baz"
  > gsub(" +", "", s)
  [1] "foobarbaz"
  > gsub(" +", " ", s)  # Remove multiple spaces and replace them by single spaces
  [1] "foo bar baz"

The "sub" is similar to "gsub" but only replaces the first
occurrence. 

  > s <- "foo bar baz"
  > sub(" ", "", s)
  [1] "foobar baz"

+ Time and Date

When you read data from various sources, you often run into date
format problem: different people, different software use different
formats, different conventions. For instance, 01/02/03 can mean the
first of february 2003 for some and the second of january 2003 for
others -- and perhaps even the third of february 2001 for some. The
only unambiguous, universal format is the ISO 8601 one, not really
used by people but rather by programmers: dates are coded as 

  2005-15-05

The main rationale for this format is that when you write a numeric
quantity you start with the largest units and end with the smallest;
e.g., when you write "123", everyone understands "a hundred and twenty
three": you start with the hundreds, proceed with the tens, and end
with the units. Why should it be different for dates? We should start
with the largest unit, the years, procedd with the next largest the
months, and end with the smallest, the days.

This format has an advantage: if you want to sort data according to
the date, your program just has to be able to sort strings, it need
not be aware of dates.

You can extend the format with a time, but it becomes ambiguous: 

  2005-05-15 21:34:10.03

It does not look ambiguous (hours, minutes, seconds, hundredths of
seconds -- for some applications, you may even need thousandths of
seconds), but the time zone is missing. Most of the problems you have
with times comes from those time zones.

To convert a string into a Date object: 

  > as.Date("2005-05-15")
  [1] "2005-05-15"

If you convert from an ambiguous format, you must specify the format: 

  > as.Date("15/05/2005", format="%d/%m/%Y")
  [1] "2005-05-15"
  > as.Date("15/05/05", format="%d/%m/%y")
  [1] "2005-05-15"

  > as.Date("01/02/03", format="%y/%m/%d")
  [1] "2001-02-03"
  > as.Date("01/02/03", format="%y/%d/%m")
  [1] "2001-03-02"

You can compute the difference between two dates -- it is a number of
days. 

  > a <- as.Date("01/02/03", format="%y/%m/%d")
  > b <- as.Date("01/02/03", format="%y/%d/%m")
  > a - b
  Time difference of -27 days

Today's date: 

  > Sys.Date()
  [1] "2005-05-16"

You can add a Date and a number (a number of days).

  > Sys.Date() + 21
  [1] "2005-06-06"

You can format the date to produce one of those ambiguous formats your
clients like. 

  > format(Sys.Date(), format="%d%m%y")
  [1] "160505"

  > format(Sys.Date(), format="%A, %d %B %Y")
  [1] "Monday, 16 May 2005"
  
The format is described in the manpage of the "strftime" function. 

If you want to extract part of a date, you can use the format"
function. For instance, if I want to aggregate my data by month, I can
use 

  d$month <- format( d$date, format="%Y-%m" )

For looping purposes, you might need series of dates: you may want to
use the "seq" function.

  ?seq.Date

  > seq(as.Date("2005-01-01"), as.Date("2005-07-01"), by="month")
  [1] "2005-01-01" "2005-02-01" "2005-03-01" "2005-04-01" "2005-05-01"
  [6] "2005-06-01" "2005-07-01"

  # A month is not always 31 days...
  > seq(as.Date("2005-01-01"), as.Date("2005-07-01"), by=31)
  [1] "2005-01-01" "2005-02-01" "2005-03-04" "2005-04-04" "2005-05-05"
  [6] "2005-06-05"

  > seq(as.Date("2005-01-01"), as.Date("2005-03-01"), by="2 weeks")
  [1] "2005-01-01" "2005-01-15" "2005-01-29" "2005-02-12" "2005-02-26"

However, you should be aware that loops tend to turn Dates into
numbers.

  > a <- seq(as.Date("2005-01-01"), as.Date("2005-03-01"), by="2 weeks")
  > str(a)
  Class 'Date'  num [1:5] 12784 12798 12812 12826 12840
  > for (i in a) {
  +   str(i)
  + }
   num 12784
   num 12798
   num 12812
   num 12826
   num 12840

Inside the loop, you may want to add

  for (i in dates) {
    i <- as.Date(i)
    ...
  }

There is another caveat about the use of dates as indices to arrays:
as a date is actually a number, if you use it as an index, R will
understand the number used to code the date (say 12784 for 2005-01-01)
as a row or column number, nor a row or column name. When using dates
as indices, always convert them into strings. 

  a <- matrix(NA, nr=10, nc=12)
  rownames(a) <- LETTERS[1:10]
  dates <- seq(as.Date("2004-01-01"), as.Date("2004-12-01"), 
               by="month"))
  colnames(a) <- as.character( dates )
  for (i in dates) {
    i <- as.Date(i)
    a[, as.character(i)] <- 1
  }

There are other methods: 

  > methods(class="Date")
   [1] as.character.Date  as.data.frame.Date as.POSIXct.Date    c.Date
   [5] cut.Date           -.Date             [<-.Date           [.Date
   [9] [[.Date            +.Date             diff.Date          format.Date
  [13] hist.Date*         julian.Date        Math.Date          mean.Date
  [17] months.Date        Ops.Date           plot.Date*         print.Date
  [21] quarters.Date      rep.Date           round.Date         seq.Date
  [25] summary.Date       Summary.Date       trunc.Date         weekdays.Date

For the time (up to the second, only): 

  > as.POSIXct("2005-05-15 21:45:17")
  [1] "2005-05-15 21:45:17 BST"

  > as.POSIXlt("2005-05-15 21:45:17")
  [1] "2005-05-15 21:45:17"

The two classes are interchangeable, only the internal representation
changes (use the first, more compact, one in data.frames). 

  > unclass(as.POSIXct("2005-05-15 21:45:17"))
  [1] 1116189917
  attr(,"tzone")
  [1] ""

  > unclass(as.POSIXlt("2005-05-15 21:45:17"))
  $sec
  [1] 17
  $min
  [1] 45
  $hour
  [1] 21
  $mday
  [1] 15
  $mon
  [1] 4
  $year
  [1] 105
  $wday
  [1] 0
  $yday
  [1] 134
  $isdst
  [1] 1

You can also perform a few computations 

  > as.POSIXlt("2005-05-15 21:45:17") - Sys.time()
  Time difference of -1.007523 days

This is actually a call to the "difftime" function (the unit is
automatically chose so that the result be readable).

  > difftime(as.POSIXlt("2005-05-15 21:45:17"), Sys.time(), units="secs")
  Time difference of -87246 secs

Should you be unhappy with those date and time classes, there is host
of packages that provide replacements for them. 

  date    (only dates, not times; rather limited, probably old,
           ignores ISO 8601) 
  chron   (no timezones or daylight saving times: this is a
           limitation, but as many problems come from timezones, it
           may be an advantage)
  zoo     (Important)

When reading a data.frame containing dates in a column, from a file, 
you can either read the column as strings and convert it afterwards,

  d <- read.table("foo.txt")
  d$Date <- as.Date( as.character( d$Date ) )

or explicitely state it is a Date

  read.table("foo.txt", colClasses=c("Date", "character", rep(10, "numeric")))

If the format is not the international one, it may be trickier. One
solution is to create your own class, that inherits from Date, but
with a different method to convert from strings. 

  setClass("Date")
  setClass("USDate", contains="Date")
  setAs("character", "USDate",
        function (from) { as.Date(from, format="%m/%d/%Y") })

  read.table("foo.txt", colClasses=c("USDate", "character", rep(10, "numeric")))

TODO: and if we need hundredths or thousandths of seconds?

There is an R-News article about date anhd time handling in R: 

  http://cran.r-project.org/doc/Rnews/Rnews_2004-1.pdf

Some of the intricacies of time and date handling are well known (some
months are 30-day long, others 31-day long, one 28-day long -- or 29,
every fourth year), others are not. But actually, every hundredth
year, the year that should be leap is not: 1700, 1800, 1900 were not
leap years. And this exception has exceptions: every fourth century,
the would-be leap year that should not be leap actually is -- 2000 was
a leap year, 2400 will be one. 

But this was just for dates: there are similar problems with time. We
have, from time to time, to add a second to the day. This has already
happened 18 times. 

  > .leap.seconds
   [1] "1972-07-01 01:00:00 BST" "1973-01-01 00:00:00 GMT"
   [3] "1974-01-01 00:00:00 GMT" "1975-01-01 00:00:00 GMT"
   [5] "1976-01-01 00:00:00 GMT" "1977-01-01 00:00:00 GMT"
   [7] "1978-01-01 00:00:00 GMT" "1979-01-01 00:00:00 GMT"
   [9] "1980-01-01 00:00:00 GMT" "1981-07-01 01:00:00 BST"
  [11] "1983-07-01 01:00:00 BST" "1985-07-01 01:00:00 BST"
  [13] "1986-07-01 01:00:00 BST" "1988-01-01 00:00:00 GMT"
  [15] "1990-01-01 00:00:00 GMT" "1991-01-01 00:00:00 GMT"
  [17] "1992-07-01 01:00:00 BST" "1993-07-01 01:00:00 BST"
  [19] "1994-07-01 01:00:00 BST" "1996-01-01 00:00:00 GMT"
  [21] "1997-07-01 01:00:00 BST" "1999-01-01 00:00:00 GMT"

The next one will be in december 2005.

Leap years are due to the fact that there is not a whole number of
days in a year; similarly, leap secons are due to the fact that
there is not a whole number of seconds in a day.

  http://en.wikipedia.org/wiki/Leap_second
  http://hpiers.obspm.fr/eop-pc/earthor/utc/leapsecond.html
  http://www.ucolick.org/~sla/leapsecs/onlinebib.html

+ Miscellanies: match

Actually, I have never used this function: let me just list a few
uncommented examples. 

  ?match
  # Get the 2's and 4's
  x[as.logical( match(x, c(2,4), nomatch=0)  )]

There are a few function written with "match": 

  > setdiff
  function (x, y)
  unique(if (length(x) || length(y)) x[match(x, y, 0) == 0] else x)
  <environment: namespace:base>

  > match.fun("%in%")
  function (x, table)
  match(x, table, nomatch = 0) > 0
  <environment: namespace:base>

  > intersect
  function (x, y)
  unique(y[match(x, y, 0)])
  <environment: namespace:base>

  > is.element
  function (el, set)
  match(el, set, 0) > 0
  <environment: namespace:base>

  > setequal
  function (x, y)
  all(c(match(x, y, 0) > 0, match(y, x, 0) > 0))
  <environment: namespace:base>

Exercice: How would we find ALL the functions whose definition uses
"match"?

  TODO: simplify the following code and state its limitations (limited
  to visible loaded functions)

  a <- lapply(search(), ls)
  names(a) <- search()
  a <- unlist(a)
  names(a) <- a
  a <- a[ sapply(a, 
                 function (x) { 
                 try( x <- match.fun(x) )
                 is.function(x)
                }) ]
  a <- lapply(a, match.fun)
  b <- lapply(a, deparse)
  b <- lapply(b, length)
  b <- order(unlist(b))
  a <- a[b]
  i <- lapply(a, function(x) { length(grep("match\\(", deparse(x)))>0 })
  i <- unlist(i)
  a[i]

* Debugging

+ Warnings

If you know (or even simply if you suspect) problems in your code, you 
can ask the R interpreter to be more rigorous, by saying 

  options(warn=1)

which prints the warning messages when they appear (and not at the end
on the execution, as usual), or even

  options(warn=2)

which turns the warning messages into real errors, that stop the
execution. 

+ The printf way: print, cat, str

One of the simplest ways to find the location of a bug in a program
(once we have witnessed an abnormal behaviour) is to add "print"
statement at the problematic locations, to see if the  code breaks
before that point, or to have a look at the data the functions are
playing with (quite often, you have a number or NULL while you would
expect a vector, or you have a vector instead of a matrix, or you have
a vector of strings instead of a vector of numbers, or complex numbers
have appeared, unnoticed, at some way in you code).

TODO: detail the functions

  print
  cat
  str
  unclass

TODO: log4R (no, it does not exist).

+ Step-by-step execution: debug

The "debug" command tags a function so that, when run, it be executed
step by step. 

  > debug(f)
  > f(3)
  debugging in: f(3)
  debug: {
      x^2 + x + 1
  }
  Browse[1]>
  debug: x^2 + x + 1
  Browse[1]>
  exiting from: f(3)
  [1] 13
  > undebug(f)

+ Breakpoints

The "browser" function adds a breakpoint in the code. 

For instance, if we run the following function, 

  f <- function () {
    x <- rnorm(10)
    y <- rnorm(11)
    browser()
    x + y
  }

R will stop when it encounters the "browser()" call.

   > f()
  Called from: f()
  Browse[1]>

You can then type in expressions, functions, to examine the
environment where it stopped. 

  > f()
  Called from: f()

  Browse[1]> x
   [1] -1.6684445 -1.4662686 -1.3792824  0.1103995  0.7431116 -1.9117947
   [7]  0.5333812 -0.6695517 -1.2382940 -0.3560036

  Browse[1]> str(x)
   num [1:10] -1.668 -1.466 -1.379  0.110  0.743 ...

  Browse[1]> str(y)
   num [1:11]  0.247 -0.505  0.197 -0.468  1.446 ...

  Browse[1]> x + y
   [1] -1.42137599 -1.97117326 -1.18208672 -0.35809903  2.18871467 -2.16168749
   [7]  0.88886591 -2.85428126 -0.85448640  0.37425241 -0.02050070
  Warning message:
  longer object length
          is not a multiple of shorter object length in: x + y

You can type "n" to execute the next instruction (and stop again) or
"c" to resume the execution, until the next stop. 

+ The calling stack: traceback, dump.frames, sys.call

The "traceback" command prints the callstack, i.e., the list of
functions that were called when the latest error occurred.

  ?traceback

The "dump.frames" command yields the equivalent of a "core" file (the
state of the interpretor at a given moment, typically, just after an
error), we can then examine with the "debugger" command. 

  ?dump.frames

The "sys.calls" function gives the list of the functions that have been
called, with all their arguments. 

  f <- function ()    { g(1) }
  g <- function (...) { h(17^2) }
  h <- function (x)   { 
    print(  sqrt(x)  )
    sys.calls()    
  }
  
This yields: 

  > str(  f()  )
  [1] 17
  Dotted pair list of 4
   $ : language str(f())
   $ : language f()
   $ : language g(1)
   $ : language h(17^2)

Here is an application of this "sys.calls" function: when you write a
new function, especially a function that will be implicitely called
(typically, a function of the form "[.foo", which is the overloaded
"[" operator for an S3 class "foo"), you might want to know from where
it was called. To this end, you can call the following
"function.print" at the start of your function. 

  function.print <- function () {
    l <- sys.calls()
    s <- lapply(l, function (x) { as.character(x[[1]]) })
    s <- unlist(s)
    s <- s[-length(s)] 
    cat("Stack: ", paste(s, collapse="/"), "(", sep="")
    cat(paste(as.character(l[[ length(l)-1 ]][-1]), collapse=","))
    cat(")\n")
  }

  f <- function (...)    { g(17) }
  g <- function (...) { function.print() }

This yields: 

  > f(2,11) 
  Stack: f/g(17)

+ Assertions

To check that your functions behave as expected (one could say, "to
check that they respect their contracts"), people sometimes add
comments saying "this should be so and so". This is a bad practice,
because the computer does not read the comments. Instead, you can
actually check that "this is so and so". This is called an assertion. 

Typically, assertions check thigs that should always be true: if they
are broken, they reveal there is problem in the code, that should be
fixed. And the program stops, often violently. 

As R is an interpreted environment, one often uses assertions to check
both the internal consistency of the code (the "things that should
always be true") and how the code is used (if the arguments you give
to a function are not those expected, the function should not return
anything, and the computations should be halted until the problem is
corrected).

The "assert" function is not called "assert", but "stopifnot".

  TODO: An example

If you want to be less violent when you check the arguments given to a
function, you can decide to return NULL or NA (as appropriate) a give
a warning. For instance: 

  > mean.default
  function (x, trim = 0, na.rm = FALSE, ...)
  {
    if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
      warning("argument is not numeric or logical: returning NA")
      return(as.numeric(NA))
    }
  ...

I am not very happy with the "assert" function: it tells me something
is wrong, it tells me where, but neither does it tell me how we got
there, nor offers me to examine what happenned. This suggests to use
the "sys.calls" and "browser" functions, to print the calling stack
and to insert a break point where the problem occurred. 

  assert <- function (condition, ...) {
    mc <- match.call()
    if (!is.logical(eval(condition)) || ! all(condition)) {
      cat("Assertion failled:", 
          deparse(mc[[2]]), "is not TRUE\n")
      ll <- list(...)
      for (i in seq(along=ll)) {
        cat("  ", deparse(mc[[i+2]]), ": ", ll[[i]], "\n", sep="")
      }
      ca <- sys.calls()
      cat(paste(length(ca):1, ": ", rev(ca), sep="", collapse="\n"), "\n")
      cat("BROWSER (type 'c' to quit):\n")
      browser()
      stop(paste(deparse(mc[[2]]), "is not TRUE"), call.=FALSE)
    }
  }

TODO: test this function
TODO: is "browser" called in the right environment?

+ Test-Driven Development (TDD): RUnit

  TODO


+ Profiling

"Profiling" means "finding where the program we have just written
spends most of its time, in order to rethink, rewrite or rewrite in C
those time-consuming parts. It is very useful when R is used for
prototyping (i.e., to test algorithms, to see if they actually work,
before using them in real applications).

The "system.time" tells you how much time was spent inside a command.

  %G
  several.times <- function (n, f, ...) {
    for (i in 1:n) {
      f(...)
    }
  }
  matrix.multiplication <- function (s) {
    A <- matrix(1:(s*s), nr=s, nc=s) 
    B <- matrix(1:(s*s), nr=s, nc=s) 
    C <- A %*% B
  }
  v <- NULL
  for (i in 2:10) {
    v <- append(
      v, 
      system.time( 
        several.times( 
          10000, 
          matrix.multiplication, 
          i 
        )
      ) [1]
    )
  }
  plot(v, type = 'b', pch = 15, 
       main = "Matrix product computation time")
  %--

But this is too coarse: we can compare the spped of two functions, but
given a slow function, we still need to find the parts of the
functions responsible for this. Here comes the "Rprof" command. 

  ?Rprof

Example:

  Rprof()
  n <- 200
  m <- matrix(rnorm(n*n), nr=n, nc=n)
  eigen(m)$vectors[,c(1,2)]
  Rprof(NULL)

We then look at the result (we are no longer under R; we call R from
the shell, with other options):

  % R CMD Rprof Rprof.out

  Each sample represents 0.02 seconds.
  Total run time: 0.9 seconds.

  Total seconds: time spent in function and callees.
   Self seconds: time spent in function alone.

     %       total       %       self
   total    seconds     self    seconds    name
   95.56      0.86      2.22      0.02     "eigen"
   82.22      0.74     82.22      0.74     ".Fortran"
   11.11      0.10      4.44      0.04     "all.equal.numeric"
    4.44      0.04      0.00      0.00     "matrix"
    4.44      0.04      0.00      0.00     "as.vector"
    4.44      0.04      4.44      0.04     "rnorm"
    2.22      0.02      2.22      0.02     "<Anonymous>"
    2.22      0.02      2.22      0.02     "|"
    2.22      0.02      2.22      0.02     "t.default"
    2.22      0.02      0.00      0.00     "mean"
    2.22      0.02      0.00      0.00     "t"

     %       self        %       total
   self     seconds    total    seconds    name
   82.22      0.74     82.22      0.74     ".Fortran"
    4.44      0.04     11.11      0.10     "all.equal.numeric"
    4.44      0.04      4.44      0.04     "rnorm"
    2.22      0.02      2.22      0.02     "<Anonymous>"
    2.22      0.02      2.22      0.02     "|"
    2.22      0.02      2.22      0.02     "t.default"
    2.22      0.02     95.56      0.86     "eigen"

* Object Oriented Programming: S3 Classes

The easiest (not the cleanest) way of defining classes in R is simply
to attach a "class" attribute to an object and to define functions that
look up this attribute and act accordingly.

  TODO: rewrite this section, stressing the difference between those
  two paradigms. 

  Introduction: the "print" method
  Other common methods: print, str, summary, predict, plot, 
  List of all the methods of all classes
  Writing your own classes and methods
  More complex examples: Overloading [, [<-, etc. (a simple "panel data" class?)


+ Introduction: more complex types

When we print certain objects, they do not look like the simple types
we have described (vector, array, data.frame). This is the case for
the results of a regression.

  n <- 200
  x <- rnorm(n)
  y <- 1 - 2 * x + rnorm(n)
  r1 <- lm(y~x)
  r2 <- summary(r1)

This yields

  > r1

  Call:
  lm(formula = y ~ x)

  Coefficients:
  (Intercept)            x
        0.924       -2.042

  > r2

  Call:
  lm(formula = y ~ x)

  Residuals:
       Min       1Q   Median       3Q      Max
  -2.85364 -0.66754 -0.04169  0.61238  2.78004

  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept)  0.92395    0.07345   12.58   <2e-16 ***
  x           -2.04152    0.07613  -26.82   <2e-16 ***
  ---
  Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

  Residual standard error: 1.038 on 198 degrees of freedom
  Multiple R-Squared: 0.7841,     Adjusted R-squared: 0.783
  F-statistic: 719.1 on 1 and 198 DF,  p-value: < 2.2e-16

Yet, the "str" command tells us this is truly one of the simple types
we have already seen, often, a list -- here, I removed a part of it --
it was too huge.

  > str(r2)
  List of 11
   $ call         : language lm(formula = y ~ x)
   $ terms        :Classes 'terms', 'formula' length 3 y ~ x
    .. ..- attr(*, "variables")= language list(y, x)
    .. ..- attr(*, "factors")= int [1:2, 1] 0 1
    .. .. ..- attr(*, "dimnames")=List of 2
    .. .. .. ..$ : chr [1:2] "y" "x"
    .. .. .. ..$ : chr "x"
  (...)
   - attr(*, "class")= chr "summary.lm"

There is one difference between the lists we created earlier: the
"class" attribute. The "r1" and "r2" objects we have just created
belong the the "lm" and "summary.lm" classes. As a result, certain
"generic" functions we apply to these objects are changed: this is the
case for the "print", "summary" and "plot" functions -- let us focus
on "print".

  > print
  function (x, ...)
  UseMethod("print")
  <environment: namespace:base>

  > print.lm
  function (x, digits = max(3, getOption("digits") - 3), ...)
  {
      cat("\nCall:\n", deparse(x$call), "\n\n", sep = "")
      if (length(coef(x))) {
          cat("Coefficients:\n")
          print.default(format(coef(x), digits = digits), print.gap = 2,
              quote = FALSE)
      }
      else cat("No coefficients\n")
      cat("\n")
      invisible(x)
  }
  <environment: namespace:base>

The "print" function, in this case, is called a "generic method": the
function actually called will depend on the class of the object we
apply it to.

The "methods" command gives the list of all the implementations of
this method.

  > methods(plot)
   [1] plot.acf*              plot.ACF*              plot.augPred*
   [4] plot.compareFits*      plot.data.frame        plot.decomposed.ts*
   [7] plot.default           plot.dendrogram*       plot.density
  [10] plot.factor            plot.formula           plot.function
  [13] plot.gls*              plot.hclust*           plot.histogram
  [16] plot.HoltWinters*      plot.intervals.lmList* plot.isoreg*
  [19] plot.lm                plot.lme               plot.lme1*
  [22] plot.lmList*           plot.mlm               plot.nffGroupedData*
  [25] plot.nfnGroupedData*   plot.nls*              plot.nmGroupedData*
  [28] plot.pdMat*            plot.POSIXct           plot.POSIXlt
  [31] plot.ppr*              plot.prcomp*           plot.princomp*
  [34] plot.profile.nls*      plot.ranef.lme*        plot.ranef.lmList*
  [37] plot.shingle*          plot.simulate.lme*     plot.spec
  [40] plot.spec1*            plot.spec.coherency    plot.spec.phase
  [43] plot.stl*              plot.table             plot.ts
  [46] plot.tskernel*         plot.TukeyHSD          plot.Variogram*
      Non-visible functions are asterisked

If you want to see the code of one of the non-visible functions, you
can use the "getAnywhere" of the "getS3method" function. 

  > plot.Date
  Error: Object "plot.Date" not found

  > getAnywhere("plot.Date")
  A single object matching `plot.Date' was found
  It was found in the following places
    registered S3 method for plot from namespace graphics
    namespace:graphics
  with value

  function (x, y, xlab = "", axes = TRUE, frame.plot = axes, xaxt = par("xaxt"),
      ...)
  {
      axisInt <- function(x, main, sub, xlab, ylab, col, lty, lwd,
          xlim, ylim, bg, pch, log, asp, ...) axis.Date(1, x, ...)
      plot.default(x, y, xaxt = "n", xlab = xlab, axes = axes,
          frame.plot = frame.plot, ...)
      if (axes && xaxt != "n")
          axisInt(x, ...)
  }
  <environment: namespace:graphics>

  > getS3method("plot", "Date")
  function (x, y, xlab = "", axes = TRUE, frame.plot = axes, xaxt = par("xaxt"),
      ...)
  {
      axisInt <- function(x, main, sub, xlab, ylab, col, lty, lwd,
          xlim, ylim, bg, pch, log, asp, ...) axis.Date(1, x, ...)
      plot.default(x, y, xaxt = "n", xlab = xlab, axes = axes,
          frame.plot = frame.plot, ...)
      if (axes && xaxt != "n")
          axisInt(x, ...)
  }
  <environment: namespace:graphics>

Let us remark that an object may have several types: the "class"
attribute may contain a string or a vector of strings -- when we call
the function, the computer tries all the types specified in this
vector, until it finds a method. If it finds none, it uses the
"default" class. In object-oriented parlance, this is called
"inheritance".

For instance, the result of the "aov" command 

  n <- 500
  x <- rnorm(n)
  y <- 1 - x + rnorm(n)
  r <- aov(y~x)

belongs to the classes "aov" and "lm" (in object-oriented programming,
the "aov" class inherits from "lm", i.e., an "aov" object is an "lm"
object -- the relation is sometimes called the "is a" (or "ISA")
relation).

  > class(r)
  [1] "aov" "lm"
  > attr(r,"class")
  [1] "aov" "lm"

+ Introduction: the "plot" method

There are many things one can want to do with an object that, from the
user's point of view should work for any object -- but that, from the
programmer's point of view, require completely different
implementations: for instance, displaying all the contents of an
object (with the "print" function -- this function is automatically
called when R prints a result -- for the Java programmers among you,
this is the analogue of the toString method), plotting an object (with
the "plot") function or displaying its structure (with the "str"
function).

All this seems to be done from a single function. 

But actually, the sole role of that function is to check the type of
its argument and call the (type-dependant) function that actually does
the job.

  > print
  function (x, ...)
  UseMethod("print")
  <environment: namespace:base>

  > methods("print")
    [1] print.Arima*                       print.AsIs
    [3] print.Bibtex*                      print.DLLInfo
    [5] print.DLLInfoList                  print.DLLRegisteredRoutines
    [7] print.Date                         print.HoltWinters*
    [9] print.Latex*                       print.MethodsFunction*
   [11] print.MethodsList*                 print.NativeRoutineList
   [13] print.POSIXct                      print.POSIXlt
   [15] print.RGBcolorConverter*           print.StructTS*
   [17] print.TukeyHSD*                    print.acf*
   [19] print.anova                        print.anova.gam
   [21] print.aov*                         print.aovlist*
   [23] print.ar*                          print.arima0*
   [25] print.by                           print.checkDocFiles*
   [27] print.checkDocStyle*               print.checkFF*
   [29] print.checkReplaceFuns*            print.checkS3methods*
   [31] print.checkTnF*                    print.checkVignettes*
   [33] print.check_Rd_files_in_Rd_db*     print.check_Rd_xrefs*
   [35] print.check_code_usage_in_package* print.check_demo_index*
   [37] print.check_make_vars*             print.check_package_depends*
   [39] print.check_package_description*   print.check_vignette_index*
   [41] print.citation*                    print.citationList*
   [43] print.classRepresentation*         print.codoc*
   [45] print.codocClasses*                print.codocData*
   [47] print.colorConverter*              print.condition
   [49] print.connection                   print.data.frame
   [51] print.default                      print.dendrogram*
   [53] print.density                      print.difftime
   [55] print.dist*                        print.dummy_coef*
   [57] print.dummy_coef_list*             print.ecdf*
   [59] print.factanal*                    print.factor
   [61] print.family                       print.formula
   [63] print.ftable                       print.gam
   [65] print.getAnywhere*                 print.glm
   [67] print.hclust*                      print.help_files_with_topic*
   [69] print.hexmode                      print.hsearch*
   [71] print.htest*                       print.infl
   [73] print.integrate                    print.isoreg*
   [75] print.kmeans*                      print.libraryIQR
   [77] print.listof                       print.lm
   [79] print.loadings*                    print.loess*
   [81] print.logLik                       print.ls_str*
   [83] print.medpolish*                   print.mtable*
   [85] print.nls*                         print.noquote
   [87] print.octmode                      print.packageDescription*
   [89] print.packageIQR*                  print.packageInfo
   [91] print.packageStatus*               print.package_version
   [93] print.pairwise.htest*              print.power.htest*
   [95] print.ppr*                         print.prcomp*
   [97] print.princomp*                    print.recordedplot*
   [99] print.restart                      print.rle
  [101] print.sessionInfo*                 print.simple.list
  [103] print.smooth.spline*               print.socket*
  [105] print.stepfun*                     print.stl*
  [107] print.subdir_tests*                print.summary.aov*
  [109] print.summary.aovlist*             print.summary.gam
  [111] print.summary.glm*                 print.summary.lm*
  [113] print.summary.loess*               print.summary.manova*
  [115] print.summary.nls*                 print.summary.ppr*
  [117] print.summary.prcomp*              print.summary.princomp*
  [119] print.summary.table                print.table
  [121] print.tables_aov*                  print.terms
  [123] print.ts                           print.tskernel*
  [125] print.tukeyline*                   print.tukeysmooth*
  [127] print.undoc*                       print.vignette*
  [129] print.xgettext*                    print.xngettext*
  [131] print.xtabs*

     Non-visible functions are asterisked

(The exact list you get will depend on the packages you have loaded:
you can have much more than that.)

To get the actual code, just type the name of the function (the
method, dot, the type): 

  > print.lm
  function (x, digits = max(3, getOption("digits") - 3), ...)
  {
    cat("\nCall:\n", deparse(x$call), "\n\n", sep = "")
    if (length(coef(x))) {
      cat("Coefficients:\n")
      print.default(format(coef(x), digits = digits), print.gap = 2,
        quote = FALSE)
    }
    else cat("No coefficients\n")
    cat("\n")
    invisible(x)
  }
  <environment: namespace:stats>

For hidden objects, you can obtain the code with the getAnywhere function. 

  > print.acf
  Error: object "print.acf" not found

  > getAnywhere("print.acf")
  A single object matching 'print.acf' was found
  It was found in the following places
    registered S3 method for print from namespace stats
    namespace:stats
  with value

  function (x, digits = 3, ...)
  {
    type <- match(x$type, c("correlation", "covariance", "partial"))
    msg <- c("Autocorrelations", "Autocovariances", "Partial autocorrelations")
    cat("\n", msg[type], " of series ", sQuote(x$series), ", by lag\n\n",
        sep = "")
    nser <- ncol(x$lag)
    if (type != 2)
      x$acf <- round(x$acf, digits)
    if (nser == 1) {
      acfs <- drop(x$acf)
      names(acfs) <- format(drop(x$lag), digits = 3)
      print(acfs, digits = digits, ...)
    }
    else {
      acfs <- format(x$acf, ...)
      lags <- format(x$lag, digits = 3)
      acfs <- array(paste(acfs, " (", lags, ")", sep = ""),
                    dim = dim(x$acf))
      dimnames(acfs) <- list(rep("", nrow(x$lag)), x$snames,
                             x$snames)
      print(acfs, quote = FALSE, ...)
    }
    invisible(x)
  }
  <environment: namespace:stats>

If you know the namespace of your function, you can also obtain it
with the ::: operator (twice for non-hidden functions, thrice for
hidden ones).

  > stats:::print.acf
  function (x, digits = 3, ...)
  {
    type <- match(x$type, c("correlation", "covariance", "partial"))
  ...

+ Class of an object

The class of an object is just a string attribute, attached to it --
if you want to add information to an object, typically metadata, just
put it in the attributes.

  > class(x)
  [1] "numeric"
  > mode(x)     # The class the object would have were there no attribute
  [1] "numeric"

  > r <- lm(y ~ x)
  > class(r)
  [1] "lm"
  > mode(r)
  [1] "list"

  > attributes(r)
  $names
   [1] "coefficients"  "residuals"     "effects"       "rank"
   [5] "fitted.values" "assign"        "qr"            "df.residual"
   [9] "xlevels"       "call"          "terms"         "model"
  $class
  [1] "lm"

  > attr(r, "class")
  [1] "lm"

+ Other methods

Here are a few examples of generic functions, that can be used with a
large number of classes: 

  print
  str
  plot
  predict
  seq
  anova

If you want more: 

  > methods(class="default")
    [1] AIC.default*             Axis.default*            add1.default*
    [4] aggregate.default        all.equal.default        ansari.test.default*
    [7] ar.burg.default*         ar.yw.default*           as.Date.default
   [10] as.POSIXct.default       as.character.default     as.complex.default
   [13] as.data.frame.default    as.dist.default*         as.double.default
   [16] as.expression.default    as.function.default      as.hclust.default*
   [19] as.integer.default       as.list.default          as.logical.default
   [22] as.matrix.default        as.null.default          as.person.default*
   [25] as.personList.default*   as.single.default        as.stepfun.default*
   [28] as.table.default         as.ts.default*           barplot.default
   [31] bartlett.test.default*   biplot.default*          boxplot.default
   [34] by.default               case.names.default*      cdplot.default*
   [37] coef.default*            confint.default          contour.default
   [40] cophenetic.default*      cor.test.default*        cut.default
   [43] cycle.default*           deltat.default*          density.default
   [46] deriv.default            deriv3.default           deviance.default*
   [49] df.residual.default*     diff.default             diffinv.default*
   [52] drop1.default*           duplicated.default       edit.default*
   [55] end.default*             fitted.default*          fligner.test.default*
   [58] format.default           formula.default*         frequency.default*
   [61] friedman.test.default*   ftable.default*          getInitial.default*
   [64] head.default*            hist.default             identify.default*
   [67] image.default            is.na<-.default          kappa.default
   [70] kernapply.default*       kruskal.test.default*    labels.default
   [73] lag.default*             levels<-.default         lines.default
   [76] makepredictcall.default* mean.default             merge.default
   [79] model.frame.default      model.matrix.default     monthplot.default*
   [82] mood.test.default*       mosaicplot.default*      na.action.default*
   [85] na.contiguous.default*   na.exclude.default*      na.fail.default*
   [88] na.omit.default*         names.default            names<-.default
   [91] napredict.default*       naprint.default*         naresid.default*
   [94] pacf.default*            pairs.default            persp.default*
   [97] plot.default             points.default           ppr.default*
  [100] prcomp.default*          princomp.default*        print.default
  [103] proj.default*            prompt.default*          qqnorm.default
  [106] quade.test.default*      quantile.default         range.default
  [109] relevel.default*         rep.default              residuals.default
  [112] rev.default              row.names.default        row.names<-.default
  [115] rowsum.default           scale.default            selfStart.default*
  [118] seq.default              solve.default            sortedXyData.default*
  [121] spineplot.default*       split.default            split<-.default
  [124] stack.default            start.default*           str.default*
  [127] subset.default           summary.default          t.default
  [130] t.test.default*          tail.default*            terms.default
  [133] text.default             time.default*            toString.default
  [136] transform.default        unique.default           unstack.default
  [139] update.default           var.test.default*        variable.names.default*
  [142] weights.default*         wilcox.test.default*     window.default*
  [145] with.default

  > methods(class="lm")
   [1] add1.lm*           alias.lm*          anova.lm           case.names.lm*
   [5] confint.lm*        cooks.distance.lm* deviance.lm*       dfbeta.lm*
   [9] dfbetas.lm*        drop1.lm*          dummy.coef.lm*     effects.lm*
  [13] extractAIC.lm*     family.lm*         formula.lm*        hatvalues.lm
  [17] influence.lm*      kappa.lm           labels.lm*         logLik.lm*
  [21] model.frame.lm     model.matrix.lm    plot.lm            predict.lm
  [25] print.lm           proj.lm*           residuals.lm       rstandard.lm
  [29] rstudent.lm        simulate.lm*       summary.lm         variable.names.lm*
  [33] vcov.lm*

Here is another way of getting all the methods (the functions in whose code
the string "UseMethod" appears, in any loaded namespace, whether they are visible
or not).

  res <- character(0)
  env <- append(
    lapply(search(), function (x) as.environment(x)),
    lapply(loadedNamespaces(), function (x) asNamespace(x))
  )
  for (e in env) {
    n <- ls(envir=e)
    l <- lapply(n, function (x) {
      x <- get(x, envir=e)
      x <- deparse(x)
      x <- paste(x, collapse="")
      length(grep("UseMethod", x)) > 0
    })
    l <- unlist(l)
    res <- c(res, n[l])
  }
  res <- unique(res)
  res

Here are the most often used.

  names(res) <- res
  res <- lapply(res, function (x) try(length(methods(x))))
  res <- res[unlist(lapply(res, is.numeric))]
  res <- unlist(res)
  head(sort(res, dec=TRUE), 40)

This yields:

  > length(res)
  [1] 211
  > head(sort(res, dec=TRUE), 40)
           print             plot          summary          predict
             168               53               50               32
   as.data.frame            resid        residuals           format
              26               13               13               12
            coef     coefficients             vcov           fitted
              11               11               10                9
   fitted.values            anova          formula        all.equal
               9                8                8                8
      extractAIC             head             tail          as.Date
               7                7                7                7
  Predict.matrix smooth.construct           logLik            image
               6                6                6                6
           lines       as.POSIXct        as.matrix             mean
               6                6                6                6
            add1          confint         deviance            drop1
               5                5                5                5
     model.frame              str             diff       duplicated
               5                5                5                5
          labels           unique         kurtosis         skewness
               5                5                4                4

+ Creating your own classes with the usual methods

You can define your own "classes" and overload the "plot", "print" and
"summary" functions.

  > print.foo <- function (x,...) {
    print.default(x)
    print.default(min(x))
    print.default(median(x))
    print.default(max(x))
  }
  > x <- matrix( rnorm(20), nrow=4 )
  > print(x)
             [,1]       [,2]        [,3]        [,4]       [,5]
  [1,] 0.05858332 -0.3082483  1.08259617 -0.10539949 -0.3734017
  [2,] 0.23264808 -0.4763760 -0.01989608 -0.07837898  2.3640196
  [3,] 0.05239833 -0.6764430 -0.76649216  0.76078938  0.2715206
  [4,] 0.27780672 -0.5458009 -0.96929622  0.90089157  1.7325063
  > class(x)
  [1] "matrix"
  > class(x) <- c("foo", class(x))
  > print(x)
             [,1]       [,2]        [,3]        [,4]       [,5]
  [1,] 0.05858332 -0.3082483  1.08259617 -0.10539949 -0.3734017
  [2,] 0.23264808 -0.4763760 -0.01989608 -0.07837898  2.3640196
  [3,] 0.05239833 -0.6764430 -0.76649216  0.76078938  0.2715206
  [4,] 0.27780672 -0.5458009 -0.96929622  0.90089157  1.7325063
  attr(,"class")
  [1] "foo"
  [1] -0.9692962
  [1] 0.01625113
  [1] 2.364020

Actually, you might want to use cat() instead of print().

  print.foo <- function (x,...) {
    cat("foo: ", length(x), " values between ",
        min(x),  " and ", max(x), "\n", sep="")
    cat("     ", "mean: ", mean(x),
        " median: ", median(x), "\n", sep="")
    print(x)
  }

+ Overloadable function

You can define your own overloadable functions as:

  print <- function (x, ...) UseMethod("print")

+ All the methods

The "getS3method" starts with 

  > getS3method
  function (f, class, optional = FALSE)
  {
    knownGenerics <- c(tools:::.get_internal_S3_generics(), names(.knownS3Generics))
  ...

So we can get all the gemeric methods as follows (these are only the
methods currently loaded): 

  > tools:::.get_internal_S3_generics(); names(.knownS3Generics)
   [1] "["              "[["             "$"              "[<-"
   [5] "[[<-"           "$<-"            "length"         "dimnames<-"
   [9] "dimnames"       "dim<-"          "dim"            "c"
  [13] "unlist"         "as.character"   "as.vector"      "is.array"
  [17] "is.atomic"      "is.call"        "is.character"   "is.complex"
  [21] "is.double"      "is.environment" "is.function"    "is.integer"
  [25] "is.language"    "is.logical"     "is.list"        "is.matrix"
  [29] "is.na"          "is.nan"         "is.null"        "is.numeric"
  [33] "is.object"      "is.pairlist"    "is.recursive"   "is.single"
  [37] "is.symbol"      "abs"            "sign"           "sqrt"
  [41] "floor"          "ceiling"        "trunc"          "round"
  [45] "signif"         "exp"            "log"            "cos"
  [49] "sin"            "tan"            "acos"           "asin"
  [53] "atan"           "cosh"           "sinh"           "tanh"
  [57] "acosh"          "asinh"          "atanh"          "lgamma"
  [61] "gamma"          "gammaCody"      "digamma"        "trigamma"
  [65] "tetragamma"     "pentagamma"     "cumsum"         "cumprod"
  [69] "cummax"         "cummin"         "+"              "-"
  [73] "*"              "/"              "^"              "%%"
  [77] "%/%"            "&"              "|"              "!"
  [81] "=="             "!="             "<"              "<="
  [85] ">="             ">"              "all"            "any"
  [89] "sum"            "prod"           "max"            "min"
  [93] "range"          "Arg"            "Conj"           "Im"
  [97] "Mod"            "Re"
   [1] "Math"          "Ops"           "Summary"       "Complex"
   [5] "as.character"  "as.data.frame" "as.matrix"     "as.vector"
   [9] "labels"        "print"         "solve"         "summary"
  [13] "t"             "edit"          "str"           "contour"
  [17] "hist"          "identify"      "image"         "lines"
  [21] "pairs"         "plot"          "points"        "text"
  [25] "add1"          "AIC"           "anova"         "biplot"
  [29] "coef"          "confint"       "deviance"      "df.residual"
  [33] "drop1"         "extractAIC"    "fitted"        "formula"
  [37] "logLik"        "model.frame"   "model.matrix"  "predict"
  [41] "profile"       "qqnorm"        "residuals"     "se.contrast"
  [45] "terms"         "update"        "vcov"

+ All the classes

You can get all the classes (only those currently loaded, actually),
as follows. 

  > g <- c(tools:::.get_internal_S3_generics(), names(.knownS3Generics)) 
  > r <- sapply(g[1:3], methods)
  > r[[ which.max(sapply(r, length)) ]] <- NULL    # This was not a class
  > r <- sort(unique(gsub("^[^\\.]+\\.", "", unlist(r))))
  > r
    [1] "acf"                        "anova"                     
    [3] "aov"                        "aovlist"                   
    [5] "ar"                         "Arima"                     
    [7] "arima0"                     "AsIs"                      
    [9] "Bibtex"                     "by"                        
   [11] "character.condition"        "character.Date"            
   [13] "character.default"          "character.error"           
   [15] "character.factor"           "character.octmode"         
   [17] "character.package_version"  "character.person"          
   [19] "character.personList"       "character.POSIXt"          
   [21] "check_demo_index"           "checkDocFiles"             
   [23] "checkDocStyle"              "checkFF"                   
   [25] "check_make_vars"            "check_package_depends"     
   [27] "check_package_description"  "check_Rd_files_in_Rd_db"   
   [29] "checkReplaceFuns"           "checkS3methods"            
   [31] "checkTnF"                   "check_vignette_index"      
   [33] "checkVignettes"             "citation"                  
   [35] "citationList"               "classRepresentation"       
   [37] "codoc"                      "codocClasses"              
   [39] "codocData"                  "colorConverter"            
   [41] "condition"                  "connection"                
   [43] "contrast.aov"               "contrast.aovlist"          
   [45] "coxph"                      "data.frame"                
   [47] "data.frame.array"           "data.frame.AsIs"           
   [49] "data.frame.character"       "data.frame.complex"        
   [51] "data.frame.data.frame"      "data.frame.Date"           
   [53] "data.frame.default"         "data.frame.factor"         
   [55] "data.frame.integer"         "data.frame.list"           
   [57] "data.frame.logical"         "data.frame.logLik"         
   [59] "data.frame.matrix"          "data.frame.model.matrix"   
   [61] "data.frame.numeric"         "data.frame.ordered"        
   [63] "data.frame.package_version" "data.frame.POSIXct"        
   [65] "data.frame.POSIXlt"         "data.frame.raw"            
   [67] "data.frame.table"           "data.frame.ts"             
   [69] "data.frame.vector"          "Date"                      
   [71] "decomposed.ts"              "default"                   
   [73] "dendrogram"                 "density"                   
   [75] "difftime"                   "dist"                      
   [77] "DLLInfo"                    "DLLInfoList"               
   [79] "DLLRegisteredRoutines"      "dummy_coef"                
   [81] "dummy_coef_list"            "ecdf"                      
   [83] "factanal"                   "factor"                    
   [85] "family"                     "formula"                   
   [87] "frame.aovlist"              "frame.default"             
   [89] "frame.glm"                  "frame.lm"                  
   [91] "ftable"                     "getAnywhere"               
   [93] "glm"                        "glmlist"                   
   [95] "gls"                        "hclust"                    
   [97] "help_files_with_topic"      "histogram"                 
   [99] "HoltWinters"                "hsearch"                   
  [101] "htest"                      "infl"                      
  [103] "integer.factor"             "integrate"                 
  [105] "isoreg"                     "kmeans"                    
  [107] "Latex"                      "libraryIQR"                
  [109] "listof"                     "lm"                        
  [111] "lme"                        "loadings"                  
  [113] "loess"                      "logLik"                    
  [115] "ls_str"                     "manova"                    
  [117] "matrix"                     "matrix.data.frame"         
  [119] "matrix.default"             "matrix.dist"               
  [121] "matrix.lm"                  "matrix.noquote"            
  [123] "matrix.POSIXlt"             "medpolish"                 
  [125] "MethodsFunction"            "MethodsList"               
  [127] "mlm"                        "mtable"                    
  [129] "na.data.frame"              "na.POSIXlt"                
  [131] "NativeRoutineList"          "negbin"                    
  [133] "nls"                        "noquote"                   
  [135] "numeric.factor"             "octmode"                   
  [137] "ordered"                    "packageDescription"        
  [139] "packageInfo"                "packageIQR"                
  [141] "packageStatus"              "package_version"           
  [143] "pairwise.htest"             "poly"                      
  [145] "POSIXct"                    "POSIXlt"                   
  [147] "POSIXt"                     "power.htest"               
  [149] "ppr"                        "prcomp"                    
  [151] "princomp"                   "profile.nls"               
  [153] "qr"                         "recordedplot"              
  [155] "residual.default"           "residual.nls"              
  [157] "restart"                    "RGBcolorConverter"         
  [159] "rle"                        "sessionInfo"               
  [161] "simple.list"                "smooth.spline"             
  [163] "smooth.spline.fit"          "socket"                    
  [165] "spec"                       "spec.coherency"            
  [167] "spec.phase"                 "stepfun"                   
  [169] "stl"                        "StructTS"                  
  [171] "summary.aov"                "summary.aovlist"           
  [173] "summary.glm"                "summary.lm"                
  [175] "summary.loess"              "summary.manova"            
  [177] "summary.nls"                "summary.ppr"               
  [179] "summary.prcomp"             "summary.princomp"          
  [181] "summary.table"              "survreg"                   
  [183] "table"                      "tables_aov"                
  [185] "terms"                      "ts"                        
  [187] "tskernel"                   "TukeyHSD"                  
  [189] "tukeyline"                  "tukeysmooth"               
  [191] "undoc"                      "vector.factor"             
  [193] "vignette"                   "xgettext"                  
  [195] "xngettext"                  "xtabs"                     

+ Writing your own classes: toy example

One may write one's own classes: it suffices to add a "class" attribute"
to an object and to define the corresponding method. 

  x <- pi
  attr(x, 'class') <- "number"
  print.number <- function (x) {
    cat("(number) ")
    cat(signif(x))
    cat("\n")s
  }

This gives:

  > x
  (number) 3.14159

We can also define our own methods.

  affiche <- function (x,...) {
    UseMethod("affiche")
  }
  affiche.default <- print
  affiche.number <- function (x) {
    cat("(number) ")
    cat(signif(x))
    cat("\n")
  }

This gives:

  > affiche(x)
  (number) 3.14159
  > affiche(pi)
  [1] 3.141593

+ More complex example

Let us write a class to store panel data, i.e., something like a
data.frame, but in which ecah variable is an array, one row per
subject, one column per data, instead of a vector.

  TODO...
  
  is.panel.data <- function (x) {
    # A panel data object should be a (non-empty) list, all of whose
    # elements are matrices of the same size and with the same row and
    # column names.
    # The attributes should be as follows: 
    #   class: contains "panel.data"
    #   rownames (subjects)
    #   colnames (dates)
    #   names (variables)
    if( is.null(x) ) return(FALSE)
    if( ! is.list(x) ) return(FALSE)
    if( ! inherits(x, "panel.data") ) return(FALSE)
    # Here, the object claims to be a "panel.data" object.
    # If one of the following conditions is not satisfied, it is
    # corrupted -- this is a bug. 
    x <- unclass(x)
    stopifnot(!is.null(attr(x, "names")))
    stopifnot(!is.null(attr(x, "rownames")))
    stopifnot(!is.null(attr(x, "colnames")))
    d1 <- attr(x, "rownames")
    d2 <- attr(x, "colnames")
    for (k in 1:length(x)) {
      stopifnot( is.array(x[[k]]) )
      stopifnot( length(dim(x[[k]])) == 2 )
      stopifnot( dimnames(x[[k]])[[1]] == d1 )
      stopifnot( dimnames(x[[k]])[[2]] == d2 )
    }
    return(TRUE)
  }

  panel.data <- function(...) {
    r <- list(...)  ############ TODO: flatten this list and remove the NULL elements
    if (is.null(r)) return(NULL)
    cat("Checking elements\n")
    for (i in seq(along=r)) {
      stopifnot( is.matrix(r[[i]]) )
      stopifnot( dim(r[[1]]) == dim(r[[i]]) )
      stopifnot( dimnames(r[[1]])[[1]] == dimnames(r[[i]])[[1]] )
      stopifnot( dimnames(r[[1]])[[2]] == dimnames(r[[i]])[[2]] )
    }
    cat("Checking attributes\n")
    stopifnot(!is.null(dimnames(r[[1]])))
    stopifnot(!is.null(names(r)))
    cat("Setting attributes\n")
    attr(r, "rownames") <- dimnames(r[[1]])[[1]]
    attr(r, "colnames") <- dimnames(r[[1]])[[2]]
    attr(r, "class") <- "panel.data"
    r
  }

  dim.panel.data    <- function (x) {
    c( length(attr(r,"rownames")), 
       length(attr(r,"colnames")),
       length(attr(r,"names")) )
  }

  n1 <- 2
  n2 <- 3
  n3 <- 4
  x <- matrix(rnorm(n1*n2), nr=n1, nc=n2)
  rownames(x) <- paste("Subject", 1:n1, sep="")
  colnames(x) <- paste("Date", 1:n2, sep="")
  r <- list(a=x, b=x, c=x, d=x)
  is.panel.data(x)
  is.panel.data(r)
  r <- panel.data(a=x, b=1+x, c=2+x, d=3+x)
  is.panel.data(r)
  dim(r)

  "[.panel.data"    <- function (x, i=1:dim(x)[1], j=1:dim(x)[2], k=1:dim(x)[3], drop=T) {
    if (length(i) == 0 | length(j) == 0 | length(k) == 0) return(NULL)
    a <- attributes(x)
    x <- unclass(x) # It is now a list
    x <- lapply(x, function (y) { y[i,j, drop=F] }) # The first two indices
    if (is.logical(k)) k <- which(k)
    if (is.numeric(k)) k <- a$names[k]
    r <- NULL
    for (ind in k) {
      r[[ ind ]] <- x[[ ind ]]
    }
    if (drop) { 
      if (is.list(r) & length(r) == 1) r <- r[[1]]
      r <- drop(r)      
    }
    r
  }

  "[<-.panel.data"  <- function(x, 
                                i=rownames(x), j=colnames(x), k=names(x), 
                                value) {
    # Make sure that the arguments contain the names of the rows, columns, etc.
    if (is.logical(i)) { stopifnot(length(i)==length(rownames(x))); i <- rownames(x)[i] }
    if (is.logical(j)) { stopifnot(length(j)==length(colnames(x))); j <- colnames(x)[i] }
    if (is.logical(k)) { stopifnot(length(k)==length(names(x)));    k <- names(x)[i] }
    if (is.numeric(i)) { i <- rownames(x)[i] }
    if (is.numeric(j)) { i <- colnames(x)[i] }
    if (is.numeric(k)) { i <- names(x)[i] }
    if (!is.panel.data(value)) {
      stop("Not implemented: non-panel.data argument")        ################## TODO
    }
    for (a in 1:length(k)) {
      x[[ k[a] ]] <- value[[a]]
    }
  }

  # "$.panel.data"    <- ... # Unchanged

  "$<-.panel.data"  <- function (x, key, value) {
    x[[key]] <- value
  }

  # "[[.panel.data"   <- ... # Unchanged

  "[[<-.panel.data" <- function (x, key, value) {
    cl <- class(x)
    d <- dim(x)
    x <- unclass(x)
    stopifnot(is.character(key), length(key) == 1)
    stopifnot(is.array(value))
    stopifnot(dim(value) == d[1:2])
    if (!is.null(rownames(value))) {
      stopifnot( rownames(value) == attr(x, "rownames") )
    } else {
      rownames(value) <- attr(x, "rownames") )    
    }
    if (!is.null(colnames(value))) { 
      stopifnot( colnames(value) == attr(x, "colnames") )    
    } else {
      colnames(value) <- attr(x, "colnames") )    
    }
    x[[key]] <- value
    class(x) <- cl
    x
  }

  "dimnames.panel.data" <- function (x) {
    list(subject   = attr(x, "rownames"), 
         dates     = attr(x, "colnames"), 
         variables = attr(x, "names") )
  }

  "dimnames<-.panel.data" <- function (x, l) {
    stopifnot( is.list(l) )
    stopifnot( length(l) == 3 )
    stopifnot( is.character(l[[1]]) )
    stopifnot( length(l[[1]]) == dim(x)[1] )
    stopifnot( length(l[[2]]) == dim(x)[2] )
    stopifnot( length(l[[3]]) == dim(x)[3] )
    attr(x, "rownames") <- l[[1]]
    attr(x, "colnames") <- l[[2]]
    attr(x, "names") <- l[[3]]
    x
  }

  TODO

+ Problems with S4 classes

The central notion is that of method, not that of object...

There is no encapsulation...

But they are very easy to use...

* Object Oriented Programming: S3 Classes

  TODO: this section should be (re)written

+ Methods

Here is the new way of defining objects and function:

  library(help=methods)
  http://www.omegahat.org/RSMethods/Intro.ps  

TODO: comment the following example (from BioConductor).

  library('methods')
  setClass('microarray',           ## the class definition
     representation(               ## its slots
         qua = 'matrix',
         samples = 'character',
         probes = 'vector'),
     prototype = list(             ## and default values
         qua = matrix(nrow=0, ncol=0),
         samples = character(0),
         probes = character(0)))
  dat = read.delim('../data/alizadeh/lc7b017rex.DAT')
  z = cbind(dat$CH1I, dat$CH2I)
  setMethod('plot',                ## overload generic function `plot'
   signature(x='microarray'),      ## for this new class
   function(x, ...)
   plot(x@qua, xlab=x@samples[1], ylab=x@samples[2], pch='.', log='xy'))
  ma = new('microarray',           ## instantiate (construct)
         qua = z,
         samples = c('brain','foot'))
  plot(ma)

To understand object oriented programming in R, the easiest is
probably to look at the libraries that use it, such as "pixmap".

  less /usr/lib/R/library/pixmap/R/pixmap.R

Other examples (in 2003):

  MASS/scripts/ch03.R
  DBI
  gpclib
  pixmap
  SparseM

Two years later (2005), I update this list:

  arules
  boolean
  CoCo
  coin
  colorspace
  DBI
  deal
  distr
  dynamicGraph
  fBasics
  flexmix
  fSeries
  gpclib
  gRbase
  its
  kernlab
  kinship
  limma
  lme4
  matlab
  Matrix
  orientlib
  pamr
  pixmap
  R2HTML
  rgdal
  rmetasim
  RMySQL
  ROCR
  R.oo
  ROracle
  RSQLite
  rstream
  SciViews
  SparseM
  tuneR
  urca
  XML

For a larger example, check BioConductor or Rmetrics.

  http://www.bioconductor.org/
  http://www.itp.phys.ethz.ch/econophysics/R/

* Data storage, Data import, Data export

For more details abour how to import or export data
to and from sensible or less sensible formats,
check the R data import export manual:

  http://cran.r-project.org/doc/manuals/R-data.pdf

+ read.table and co.

To import data from readable formats, you can use one of the following
commands: 

  d <- read.table("foo.txt", header=T, sep=",")
  d <- read.csv("txt.csv")
  d <- read.csv2("txt.csv")  # semicolon-separated file, with a
                             # comma instead of the decimal point.
  d <- read.delim("foo.txt") # Tab-delimited file
  d <- read.fwf("txt.fwf")   # Fixed width fields

In case your file comes from Excel, this may be trickier: the missing
values often appear as "#N/A!" and are mistaken for the start of a
comment... You can try

  d <- read.table("foo.csv", header = TRUE, sep = ",", 
                  na.strings = c("#N/A!", "NA", "@NA"), 
                  quote = '"',
                  comment.char = "")


+ Importing data

For simple and short examples, you can type in the data by hand. In
this document, we shall use a lot of simulated data: they are larger,
but a couple of lines suffice to produce them.

On the contrary, in real situations, the data are large and stored in
files or data bases: how to import them into R?

Personnaly, I often use the "source" command, even though it was not
designed for that purpose: it reads in code, not data -- you have to
process the data via external tools. In one situation, the data I had
to process had a rather non standard format (multiple alignment of DNA
sequences): thus, I wrote a small Perl program to convert this format
into R code (not "R data", but actual code).

More precisely, the data looked like

  CLUSTAL W (1.83) multiple sequence alignment

  AB020905        ATGACCAACATCCGAAAAACCCACCCATTAGCTAAAATCATCAACAACTCATTTATTGAC
  AB020906        ATGACCAACATCCGAAAAACCCACCCATTAGCTAAAATCATCAACAACTCATTTATTGAC
  AB020907        ATGACCAACATCCGAAAAACCCACCCATTAGCTAAAATCATCAACAACTCACTTATTGAC
  AB020908        ATGACCAACATCCGAAAAACCCACCCATTAGCTAAAATCATCAACAACTCATTTATTGAC
  AB020909        ATGACCAACATCCGAAAAACCCACCCATTAGCTAAAATCATCAACAACTCATTTATTGAC
                  *************************************************** ********

  AB020905        CTTCCAACACCATCAAACATCTCGGCATGATGAAACTTTGGATCCCTCCTTGGAGTATGT
  AB020906        CTTCCAACACCATCAAACATCTCAGCATGATGAAACTTTGGATCCCTCCTTGGAGTATGT
  AB020907        CTTCCAACACCATCAAACATCTCAGCATGATGAAACTTTGGATCCCTCCTTGGAGTATGT
  AB020908        CTTCCAACACCATCAAACATCTCAGCATGATGAAACTTTGGATCCCTCCTCGGAGTATGT
  AB020909        CTTCCAACACCATCAAACATCTCAGCATGATGAAACTTTGGATCCCTCCTCGGAGTATGT
                  *********************** ************************** *********

the program was 

  #! perl -w
  use strict;

  my @seq;
  my @names;
  my $i=0;

  # Go just after the first empty line
  while (<>) {
    chomp;
    print STDERR "Skipping $. ($_)\n";
    last if m/^\s*$/;
  }

  while (<>) {
    chomp;
    if( m/^\s*$/ ){
      $i=0;
      print STDERR "Skipping $. ($_)\n";
      next;
    }
    print STDERR "Reading $. ($i) ($_)\n";
    if (m/^([^\s]+?)\s+(.*)/) {
      print STDERR "Remembering $.\n";
      $names[$i] = $1;
      $seq[$i] .= $2;
    }
    $i++;
  }

  # foreach my $s (@seq) { print "$s\n"; }
  print "d <- matrix( c(\n";
  foreach my $s (@seq) {
    print '"'. join('", "', split('', $s)) .'",' ."\n";
  }
  print "), nr=". (scalar @seq) .", byrow=T)\n";
  print "rownames(d) <- c('". join("', '", @names) ."')\n"

and the result looked like 

  d <- matrix( c( "A", "T", "G", "A", "C", "C", "A", "A", "C", "A",
    "T", "C", "C", "G", "A", "A", "A", "A", "A", "C", "C", "C", "A",
    ...
  ), nr=5, byrow=T)
  rownames(d) <- c('AB020905', 'AB020906', 'AB020907', 'AB020908', 'AB020909')

The problem, with this method (yes, it is a bad method), is that these
are not data but code. If you just want to use it with R and ignore
all other software, that is fine, but otherwise, a more portable
format would be welcome. 

The "read.table" can read data frames, i.e., (rectangular) tables, 
whose columns may have different types (but the type of the data does
not change inside a column -- and all the columns have the same
length). With the preceeding example, the file could look like 

  AB020905 "T" "T" "A" "A" "A" "G" "T" "G" ...
  AB020906 "T" "T" "A" "A" "A" "G" "T" "G" ...
  AB020907 "T" "T" "A" "A" "A" "G" "T" "G" ...
  AB020908 "T" "T" "A" "A" "A" "G" "T" "G" ...
  AB020909 "T" "T" "A" "A" "T" "G" "T" "G" ...

(with very, very long lines).

Often, the "read.table" command works fine, but sometimes, problems
occur (usually because one has not read the manual of the "read.table"
function).

Let us consider first the simple case of a file containing only
numeric data, with no row or column name.
It could look like

  2 7 3 9 2 
  8 7 3 2 2
  6 2 8 8 1

We try:

  > d <- read.table('A.txt')
  > d
    V1 V2 V3 V4 V5
  1  2  7  3  9  2
  2  8  7  3  2  2
  3  6  2  8  8  1

R has give names to the columns. If we do not like them, we may change
them.

  > names(d)
  [1] "V1" "V2" "V3" "V4" "V5"
  > length(d)
  [1] 5
  > names(d) <- 1:length(d)
  > d
    1 2 3 4 5
  1 2 7 3 9 2
  2 8 7 3 2 2
  3 6 2 8 8 1
  > names(d) <- LETTERS[1:length(d)]
  > d
    A B C D E
  1 2 7 3 9 2
  2 8 7 3 2 2
  3 6 2 8 8 1

The file could be more complex and contain row names.

  x1  2 7 3 9 2
  x2  8 7 3 2 2
  x3  6 2 8 8 1

Here, it does not work as well, because the computer has no way of
knowing that the first column contains the name of the rows
and not a qualitative variable.

  > read.table('A.txt')
    V1 V2 V3 V4 V5 V6
  1 x1  2  7  3  9  2
  2 x2  8  7  3  2  2
  3 x3  6  2  8  8  1

We can ask it to remove the first column an use it as row names (this
is a good exercise: try to do it yourself).

  > d <- read.table('A.txt')
  > row.names(d) <- d[,1]
  > d <- d[,-1]
  > d
     V2 V3 V4 V5 V6
  x1  2  7  3  9  2
  x2  8  7  3  2  2
  x3  6  2  8  8  1
  > names(d) <- LETTERS[1:length(d)]
  > d
     A B C D E
  x1 2 7 3 9 2
  x2 8 7 3 2 2
  x3 6 2 8 8 1

Other situation: we have both column and row names. The file looks like:

     A B C D E
  x1 2 7 3 9 2
  x2 8 7 3 2 2
  x3 6 2 8 8 1

Now, R understands that the first row contains the variable names and
that the first column contains the observation names, because the
first line in the file has one fewer element that the others.

  > read.table('A')
     A B C D E
  x1 2 7 3 9 2
  x2 8 7 3 2 2
  x3 6 2 8 8 1

Last situation: The columns have names, but not the rows. The file
looks like

  A B C D E
  2 7 3 9 2
  8 7 3 2 2
  6 2 8 8 1

If we try, naively:

  > d <- read.table('A.txt')
  > d
    V1 V2 V3 V4 V5
  1  A  B  C  D  E
  2  2  7  3  9  2
  3  8  7  3  2  2
  4  6  2  8  8  1
  > str(d)
  `data.frame':   4 obs. of  5 variables:
   $ V1: Factor w/ 4 levels "2","6","8","A": 4 1 3 2
   $ V2: Factor w/ 3 levels "2","7","B": 3 2 2 1
   $ V3: Factor w/ 3 levels "3","8","C": 3 1 1 2
   $ V4: Factor w/ 4 levels "2","8","9","D": 4 3 1 2
   $ V5: Factor w/ 3 levels "1","2","E": 3 2 2 1

First, the computer had no way of guessing that the first line
contained the column names, Second, it thought that each column
containes character strings (before of the first element)... We can
avoid the problem by adding an argument to the "read.table" command.

  > read.table('A.txt', header=T)
    A B C D E
  1 2 7 3 9 2
  2 8 7 3 2 2
  3 6 2 8 8 1

Have weexhausted the problems one may encounter while using the
"read.table" command? Well, not quite. Let us look again at our first
example, the file containing our nucleis sequences.

  > read.table('A.txt')
          V1   V2   V3 V4 V5 V6 V7   V8 V9
  1 AB020905 TRUE TRUE  A  A  A  G TRUE  G
  2 AB020906 TRUE TRUE  A  A  A  G TRUE  G
  3 AB020907 TRUE TRUE  A  A  A  G TRUE  G
  4 AB020908 TRUE TRUE  A  A  A  G TRUE  G
  5 AB020909 TRUE TRUE  A  A  T  G TRUE  G

Each column contained characters (the four letters A, C, G, T), but
the computer misunderstood the "T" as a boolean value. To avoid the
problem, it suffices to state the column types (here, it is always the
same, we just give it once).

  > read.table('A', colClasses=c('character'))
          V1 V2 V3 V4 V5 V6 V7 V8 V9
  1 AB020905  T  T  A  A  A  G  T  G
  2 AB020906  T  T  A  A  A  G  T  G
  3 AB020907  T  T  A  A  A  G  T  G
  4 AB020908  T  T  A  A  A  G  T  G
  5 AB020909  T  T  A  A  T  G  T  G

TODO: What if the file is not on the local disc but given by its URL?

TODO: What if the file comes from a well-known spreadsheet program?

TODO: ?scan

To know more about all this, you might want to reaf the man page of
the "read.table" command and the "Data Import-Export manual".

  http://cran.r-project.org/doc/manuals/R-data.pdf

+ Large CSV files

For large files, it might be faster to explicitely give the type of
each column: otherwise, R would have to read the whole file to check
that the numeric columns are indeed numeric -- the begining of the
file could contain numbers and later rows strings...

  # All the columns contain strings
  read.table("foo.txt", colClasses = "character")

  # the fist column is numeric, the others contain strings
  read.table("foo.txt", 
             colClasses = c("numeric", rep("character", 10)))

If your file only contains number, or only strings, it is wiser to
store it in a matrix, not a data.frame. This is what the "scan"
function does. 

  # A numeric matrix
  x <- scan("foo.txt", sep=",")  # Gives a numeric vector
  n <- scan("foo.txt", sep=",", nlines=1)
  x <- matrix(x, nc=n)

  # A vector of strings
  x <- scan("foo.txt", what=character(0))

If you file is really large, you should consider storing
your data in a database (MySQL, PostgreSQL, or even simply SQLite, that
requires no configuration whatsoever), as explained in a few
pages.

+ Excel files

This is a big problem: only Microsoft knows what is inside those files
-- all we can do is try to guess from the outside, what they
contain. The easiest solution is to ask the person providing you with
the files to save them (with Excel) as "text files"
or as "CSV files" (Comma-Separated Values).

If this is impossible, you can try to convert the files yourself,
either with Excel (if you have it) or with any software that tries to
recognize this format, e.g., Open Office. 

  http://www.openoffice.org/

If this is also impossible, for instance if you want to automate this
process, you can turn to the Spreadsheet::ParseExcel Perl module.

  http://www-106.ibm.com/developerworks/linux/library/l-pexcel/

Actually, you do not have to know about Perl (it is an eclectic
language, designed to process text, used by computer hackers in the
early 1990s for its network capabilities and its tight
interaction with the operating system, in the mid-1990s for the first
web-based applications, later as a scripting language for games 

  http://www.frozen-bubble.org/

scientific computations 

  http://pdl.perl.org/index_en.html

more ambitious web-based applications thanks to its tight interaction
with the Apache web-server

  http://perl.apache.org/
  http://www.modperl.com/
  http://modperlbook.org/
  http://www.perl.com/pub/a/2002/02/26/whatismodperl.html
  http://www.perl.com/pub/a/2002/03/22/modperl.html
  ...

etc.): there there is an R function to call this Perl module to convert
an Excel file to a CSV file and read it into R. 

  library(gdata)
  ?read.xls

  (there is a bug in the current version: they use "dquote" instead of
  "shQuote", which has a disastrous effect if your string contains
  symbols such as $ or " -- it also crashes in an UTF8 locale). 

+ Diverting the output

The sink() function diverts the output to a file: all the messages
that would normally end up on the screen are instead written to the
file. To have them back on the screen, call sink() once again, with no
arguments.

The caputure.output() function does the same thing, but takes the code
whole result is to be retrieved as an argument. Instead of a file, it
can return a string -- for instance, you may want to escape some
characters that would otherwise be interpreted further down in the
pipeline, or you can add formatting information (colours, fonts,
etc.).

+ .Data

At the end of each session, R asks if you want to save the 
the environment to continue to work with the same data and functions
next time: he saves functions and variables in a file in the current
directory; if you work on several R projects at the same time, simply
use several directories. 

It might be a good idea to clean the variables of the current
directory from time to time, with the "ls" and "rm" commands. 

  ls()
  rm(x, y, z)

+ source

You can also store code in a file (especially if the code is rather
long: you will prefer typing it with a decent text editor, such as
Emacs) and call it back with the "source" command. 

  source("MyCode.R")

Sometimes, you also want to see the code being executed (inparticular
if some parts of it are time-consuming): 

  source("MyCode.R", echo=TRUE)

+ Out-of-memory computations

One of the differences between R and other statistical systems, such
as SAS, is that R stores all the data in memory: this prevents it from
dealing with datasets that do not fit in memory. This is less ans less
true. One way around that problem is to check wheter you need the
whole dataset or if you can split your computations into chunks that
each deal with a slice of it (for instance, just a couple of variables
at a time insteal of hundreds of them) and store and retrieve the data
in a database (see below). Depending on the algorithms you are using,
this may not be straightforward: your problem may require a new
implementation of the algorithm that does not take memory allocation
for granted and that sparinggly, explicitely uses the disk -- these
are called out-of-memory algorithms. 

Should you want it, there is already an out-of-memory linear
regression function, in the biglm package.

+ Databases

When dealing with large amounts of data, you do not really need
all the data at once in memory: quite often, your computations only
require one chunk of it at a time. It makes sens to store the data
in a database and only extract what you need.

R can talk to most databases (SQLite, MySQL, PostgreSQL, Oracle),
either through a generic API, such as ODBC, or through
database-specific interfaces).

And it works both ways: you can fetch data in a database from R, but
you can also use R as a language for stored procedures in some of them
(e.g., PostgreSQL).

  http://linuxfr.org/2003/02/20/11415.html
  http://archives.postgresql.org/pgsql-general/2003-02/msg00989.php
  http://www.joeconway.com/plr/

+ SQLite

Installing a DataBase Management System (DBMS) is often daunting: you
must start the server, create a new user and create a new database. If
you plan to use this database in a networked environment, from a
different machine, if you plan to access the same data from different
applications or machines at the same time, if you want to prevent
inconsistencies when two different people try to modify the same data
at the same time, a real DBMS is worth the trouble. 

But if you just want to play with the data, from a single application,
from a single machine, this is a overkill. 

Instead, you can use SQLite: you do not have to install anything, you
do not have to configure anything. It is just a library that stores
data in a file, and allows you to access and modify it with SQL
commands. It is just an elaborated binary file format. 

  http://www.sqlite.org/

It is becoming more and more popular: when you write an application
that has to store some data, in some structured way (for instance, the
configuration, the logs, etc.), when you want to be able to search
through those data -- SQLite is a light and efficient choice. 

  http://www.linuxjournal.com/print.php?sid=7803
  http://linuxgazette.net/109/chirico1.html
  http://conferences.oreillynet.com/cs/os2004/view/e_sess/5701

Back to R. As I said, there is nothing to install. More precisely, if
you tried to install all the packages from CRAN, it is already there. 

  TODO: a better example...

  library(RSQLite)

  # First, connect to to the database.
  con <- dbConnect(dbDriver("SQLite"), dbname="tmp.dbms")

  # As we currently have no data, we create a new table (poetically
  # named "foo") and put a data.frame in it. 
  r <- data.frame(...)
  dbWriteTable(con, "Foo", r)

  # We can retrieve the whole data frame 
  r2 <- dbReadTable(con, "foo")

  # We can also perform a few queries on the table
  x <- dbGetQuery(con, "SELECT ...")

  # When you are finished, you MUST close the connection.
  dbDisconnect(con)
  
Other tasks you might want to do:

  # List all the tables in this database
  dbListTables(con)

  # List the fields of a table
  dbListFields(con, "Foo")

  # Delete a table
  dbSendQuery("DROP TABLE Foo")

Remember to close your connections once you no longer need them. 
Otherwise: 

  > for (i in 1:200) {
  +   con <- dbConnect(dbDriver("SQLite"), dbname="tmp.dbms")
  + }
  Error in sqliteNewConnection(drv, ...) : RS-DBI driver: (1cannot
  allocate a new connection -- maximum of 16 connections already
  opened)

Let us check how fast it is: 

  TODO...

Actually, we can speed this up. Up to now, we have just used SQLite as
a binary file -- but it is actually a real data base, so we can define
the tables in the usual SQL way, in particular stating which columns
should be UNIQUE, we can use indices, we can use transactions, etc. 

  http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html

  TODO

Caveat: If you are still using version 2, the data are not typed:
everything is stored as strings. But version 3 is out. 

Caveat: All the data are stored in the same file (if your
file system has a 2Gb limit, beware)...

Caveat: I am very suspicious of the efficiency when the data becomes
very large -- but when I think "very large", I am probably not very
reasonnable.

+ SQLite -- TODO: merge this section with the previous...

When you handle a lot of data, you do not really need all the data all
the time: quote often, each step in your computations only requires a
slice of the data. In those situations in can be helpful to only keep
in memory the data you need: to rest being stored in a database for
later use.

You might object that doing so requires a DataBase Management System
(DBMS), which is very cumbersome to install and administer. This is
not the case: if your needs are reasonable (a few GB of data, a single
user and process accessing the data), SQLite might be a good solution:
it is not a client-server DBMS, but merely a library to access, with
SQL commands a file containing the data.

It is as if you were retrieving data from a CSV file with SQL
commands -- only faster.

  library(RSQLite)
  con <- dbConnect(dbDriver("SQLite"), "myData.dbms")
  x <- dbGetQuery(con, "SELECT * FROM Foo WHERE date > '2005-01-01';")
  ...
  dbDisconnect(con)

Actually, I use it as follows.

  # Parameters: the name of the database (this is actually
  # the name of the file containing the data) and the name
  # of the database driver (here, SQLite, but the same code
  # would work with other, more robust DBMS). 
  global_dbDriver <<- "SQLite"
  global_dbname   <<- "myData.dbms"

  # Connect to the database
  try(
    library(RSQLite)
  )
  if (exists("global_SQL_con")) {
    try( dbDisconnect(global_SQL_con) )
  }
  global_SQL_con <- dbConnect(
    dbDriver(global_dbDriver),
    dbname = global_dbname
  )

  # The function I use to retrieve the data

  # I use a single database connection, so I do not want to
  # give the connection argument each time.
  # Furthermore, when the result has a single column, I want
  # a vector, not a data.frame.
  sql <- function (s) {
    res <- dbGetQuery(global_SQL_con, s)
    if (!is.null(res)) {
      if (is.data.frame(res) & ncol(res) == 1) {
        res <- res[,1]
      }
    }
    drop(res)
  }

  # Function to quote strings
  as.sql.character <- function (x) {
    x <- as.character(x)
    x <- gsub("'", " ", x) # DANGER: you probably do not
                           # want to do that!
    x <- ifelse(is.na(x), "NULL", paste("'", x, "'", sep=""))
    x
  }

  # This should speed things up a bit
  sql("PRAGMA cache_size = 500000;")
  TODO: sync?

  cat(sprintf("Price table
      %d rows
      %d stocks
      %d dates from %s to %s
    ",
    sql("SELECT COUNT(*) FROM Price;"),
    sql("SELECT COUNT(DISTINCT sedol) FROM Price;"),
    sql("SELECT COUNT(DISTINCT date)  FROM Price;"),
    sql("SELECT MIN(date) FROM Price;"),
    sql("SELECT MAX(date) FROM Price;")
  ))

There are a few problems, though.

First, the SQL understood by SQLite is a bit limited; for
instance, you have LEFT OUTER JOINs but no FULL OUTER JOINs
-- SQLite does understand the syntax of the latter, but
replaces it by the former, which can lead to surprising
and incorrect results.

Second, the query optimizer is also very limited. You do
have indices, but all the joins are nested loop joins (TODO:
EXPLAIN). 

Third, it is a 1-user, 1-process DBMS. It may look fine at
first, but you might end up willing to have a process write
the data and another read it, or you might want to give read
access to the database to your colleagues -- this is not
possible.

Fourth, it is unreliable.
I sometimes end up with a database that contains duplicated
data, in spite of the UNIQUE constraints I added.
I routinely end up with a corrupted and/or locked database,
when I violently kill SQLite. 

Worse, when dealing with large amounts of data with SQLite,
R crashes: the cause of the problem is unclear (it does not
necessarily crash inside the SQLite functions: I first
suspected R, then some of the packages I was using, but the
problems only stopped when I removed SQLite -- the problem
seems to be due to large numbers of large INSERTs in a large
(5GB) database: if you only read from the database, if you
only write data once or twice per session, if your database
is small (say, under 2GB), you might be fine).

As a conclusion, SQLite is a very good replacement for CSV
files, especially if you have many CSV files, but if you
start using it, you will get more ambitious and will end up
needing a full-fledge DBMS, such as PostgreSQL, MySQL or
Firebird (there are also commercial alternatives, such as
Oracle or Microsoft SQL Server -- but even commercial
software proponents do not view MS SQL Server as a serious
product and suggest MySQL as a better alternative).

+ SQL

If you are new to SQL, here are a few sample queries. The
syntax is often that of SQLite, but not always: if in doubt,
check the documentation of your DBMS.

Extractting information from a table:

  -- Get all the contents of a table
  SELECT * FROM JPE3_ret;

  -- Only the first 10 rows of a table
  SELECT * FROM JPE3_ret LIMIT 10;

  -- Only a few columns
  SELECT date, jcode, return_
  FROM JPE3_ret
  LIMIT 10;

  -- Create a new column
  SELECT date, sed_cus, loc_price / loc_capt * usd_capt
  FROM GEMF_rsk
  LIMIT 10;

  -- Create a new column, with a new name
  SELECT date, sed_cus, loc_price / loc_capt * usd_capt AS usd_price
  FROM GEMF_rsk
  LIMIT 10;

Extracting information from a table:

  -- Select a given row (or set of rows)
  SELECT * FROM JPE3_ret
  WHERE name = "TOYOTA MOTOR" AND date = "2005-12";

  -- Order the results
  SELECT * FROM JPE3_ret
  WHERE name = "TOYOTA MOTOR" AND date >= "2000-01"
  ORDER BY date;

  -- Idem, descending order
  SELECT * FROM JPE3_ret
  WHERE name = "TOYOTA MOTOR"
  ORDER BY date DESC;

  -- Rows for which a given column is NULL
  SELECT * FROM GEMF_rsk WHERE sed_cus ISNULL;

  -- Remove duplicates
  SELECT DISTINCT isocurr FROM GEMF_rsk;

  -- The values of a columns are in a given set
  SELECT DISTINCT barrid, sed_cus, isocurr, name
  FROM GEMF_rsk
  WHERE isocurr IN ("JPY", "THB","HKD", "SGD", "KRW")
  ORDER BY isocurr, name;

Extracting information from several tables
("joining" several tables):

  -- Inner join
  SELECT * FROM JPE3_ret, JPE3_rsk
  WHERE JPE3_ret.date = JPE3_rsk.date
    AND JPE3_ret.barrid = JPE3_rsk.barrid
  LIMIT 10;

  -- Inner join (query equivalent to the previous one)
  SELECT * FROM JPE3_ret JOIN JPE3_rsk USING (date, barrid)
  LIMIT 10;

  -- Inner join (query equivalent to the previous ones)
  SELECT * FROM JPE3_ret JOIN JPE3_rsk 
             ON JPE3_ret.date = JPE3_rsk.date
            AND JPE3_ret.barrid = JPE3_rsk.barrid
  LIMIT 10;

  -- Outer join (if there are rows in the first table with no
  -- corresponding row in the second, they are discarded in an inner
  -- join; with an outer join, they are preserved and paired with
  -- empty ("NULL") rows)
  SELECT * FROM JPE3_ret OUTER JOIN JPE3_rsk USING (date, barrid)
  LIMIT 10;

Aggregate operations:

  -- Count the number of rows in a table
  SELECT COUNT(*) FROM JPE3_ret;

  -- Partition the rows of a table and count the number of rows in
  -- each group
  SELECT date, COUNT(*) FROM JPE3_ret GROUP BY date;

  -- Minimum
  SELECT MIN(return_) FROM JPE3_ret;

  -- Minumum, maximum, mean, median, etc.
  SELECT date,
         COUNT(*)       AS number,
         MIN(return_)   AS minimum,
         AVG(return_)   AS mean,
         MAX(return_)   AS maximum
  FROM JPE3_ret
  GROUP BY date
  ORDER BY date;

  -- Add a condition to be evaluated after the groups are formed
  SELECT COUNT(*) AS number, isocurr
  FROM GEMF_rsk
  WHERE date = "2006-01"
  GROUP BY isocurr
  HAVING number < 100
  ORDER BY number;

  -- Embedded queries
  -- (You can probably reformulate this one with DISTINCT and
  -- GROUP BY, but combining those two usually leads to hard-to-find
  -- bugs.)
  SELECT date, COUNT(*) AS number_of_currencies
  FROM (SELECT date, isocurr, COUNT(*) AS number
        FROM GEMF_rsk
        GROUP BY date, isocurr
        )
  GROUP BY date
  ORDER BY date;

For performance reasons, it is pivotal to index the columns (or
groups of columns) you will use in your queries -- otherwise, the
DBMS would have to scan the whole table for the rows you want.

  CREATE INDEX idx_foo_id_date ON Foo (id, date);
  CREATE INDEX idx_foo_id_date ON Foo (id);
  CREATE INDEX idx_foo_id_date ON Foo (date);

For performance reasons (and data integrity), you should group related
changes to the database into a single transaction (e.g., updating
several tables at the same time; inserting interconnected data at the
same time and make sure the database is never in an incoherent state;
or inserting a lot of data at once).

  BEGIN TRANSACTION;
  INSERT INTO Foo (id, date, value) VALUES (1, "2006-01-02", 1.4);
  INSERT INTO Foo (id, date, value) VALUES (2, "2006-01-05", 1.1);
  INSERT INTO Foo (id, date, value) VALUES (3, "2006-01-09", 0.7);
  ...
  END TRANSACTION;

Creating your own tables

  -- Remember two data types: NUMERIC and VARCHAR(255)
  DROP TABLE Foo;
  CREATE TABLE Foo (
    date  VARCHAR(255),
    id    NUMERIC,
    value NUMERIC
  );

  -- When values are supposed to be non missing (typically,
  -- identifiers), explicitely state them.
  DROP TABLE Foo;
  CREATE TABLE Foo (
    date  VARCHAR(255)  NOT NULL,
    id    NUMERIC       NOT NULL,
    value NUMERIC
  );

  -- When values or tuples are supposed to be unique,
  -- explicitely state that constraint
  DROP TABLE Foo;
  CREATE TABLE Foo (
    date  VARCHAR(255)  NOT NULL,
    id    NUMERIC       NOT NULL,
    value NUMERIC,
    UNIQUE (date, id)
  );

  -- Some DBMS allow you to specify a polocy to follow when someone
  -- tries to breach this constraint.
  DROP TABLE Foo;
  CREATE TABLE Foo (
    date  VARCHAR(255)  NOT NULL,
    id    NUMERIC       NOT NULL,
    value NUMERIC,
    UNIQUE (date, id) ON CONFLICT REPLACE
  );

  -- When a column of a table references a key of another table,
  -- explicitely state it. This is called a "foreign key constraint".
  -- SQLite does not enfore foreign key constraints
  DROP TABLE Foo;
  DROP TABLE Bar;
  CREATE TABLE Bar (
    id    NUMERIC NOT NULL,
    name  VARCHAR(255),
    UNIQUE (id) ON CONFLICT REPLACE
  );
  CREATE TABLE Foo (
    date  VARCHAR(255)  NOT NULL,
    id    NUMERIC       NOT NULL    REFERENCES Bar(id),
    value NUMERIC,
    UNIQUE (date, id) ON CONFLICT REPLACE
  );

  -- It is possible to add or remove columns to an already-created
  -- table.
  ALTER TABLE Foo ADD COLUMN other_value NUMERIC;

Populating your tables:

  -- Inserting a row
  INSERT INTO Foo (id, date, value)
  VALUES (17, "2006-04-05", 3.14);

  -- Changing one (or several) row(s)
  UPDATE Foo
  WHERE id=17 AND date="2006-04-05"
  SET value = 6.28;

  -- Deleting one (or several) rows
  DELETE FROM Foo
  WHERE id=17 AND date="2006-04-05";

NULL:

  -- Missing values are coded as NULL in SQL
  INSERT INTO Foo (id, date, value) VALUES (19, "2006-12-25", NULL);

  -- Beware: in some contexts, NULL has other meanings...
  -- This is due to the fact that most DBMS (PostgreSQL is a notable
  -- exception) do not allow for user-defined types.

Unsorted code samples:

  -- Set operations
  SELECT DISTINCT sedol FROM Price WHERE date="2004-01-19"
  EXCEPT
  SELECT DISTINCT sedol FROM Price WHERE date="2004-01-20";

  -- Merging two tables and putting the result in one of them
  UPDATE Alpha_Europe
    SET forward_returns_1W_7 = (
      SELECT forward_returns_1W_7 
      FROM Foo
      WHERE Foo.sedol = Alpha_Europe.sedol
      AND Foo.date = Alpha_Europe.date7d
    ) WHERE EXISTS (
      -- I do not like that syntax: we have to repeat
      -- the same query twice...
      SELECT forward_returns_1W_7 
      FROM Foo
      WHERE Foo.sedol = Alpha_Europe.sedol
      AND Foo.date = Alpha_Europe.date7d
    );

  -- Merging two tables
  CREATE TABLE Result AS
  SELECT * FROM A LEFT OUTER JOIN B USING (id, date);

  -- Merging two tables
  INSERT INTO Result
  SELECT * FROM A LEFT OUTER JOIN B USING (id, date);

  -- Negations and double negations
  -- select the elements that are not in category 1, 2 or 3 
  -- (each element can be in several categories)
  SELECT DISTINCT id FROM A WHERE id NOT IN (
    SELECT id FROM A WHERE category IN (1, 2, 3)
  ); -- not tested
  -- Other solution
  SELECT DISTINCT id FROM A
  EXCEPT
  SELECT id FROM A WHERE category IN (1, 2, 3); -- not tested

TODO: 

  --
  -- More specialized SQL notions
  --   Embedded SELECTs, EXISTS(SELECT...)
  --   LIKE, REGEXP
  --   String operations
  --   Functions
  --   Triggers
  --   Views
  --   Temporary tables
  --   Optimization (EXPLAIN)
  --
  -- A few words on database design and normalization could be useful.
  --
  -- Other SQL details
  --   ANALYZE (computes statistics on tables and indices, used to
  --     find the "best" way of running a query)
  --   VACUUM (defragmentation)
  --   COPY (to read data from a file)
  --   DEFAULT
  --   AUTOINCREMENT
  --   PRIMARY KEY (UNIQUE and NOT NULL)
  --   CHECK (for more complicated constraints)
  --   COLLATE (and other locale problems)
  --   UNION, UNION ALL, INTERSECT, EXCEPT
  --

For more information, check the manual of the DBMS you chose.

  http://www.sqlite.org/lang.html
  http://dev.mysql.com/doc/
  http://www.postgresql.org/docs/
  http://otn.oracle.com/pls/db10g/portal.portal_demo3?selected=1

+ ETL (Extraction, Transformation, Loading)

There is one more pivotal detail: getting the data in the database in
the first place. This is called ETL (Extraction, Transformation,
Loading) and most DBMS provide at least a crude form if it. 

With PostgreSQL, this would be the COPY command or psql's \copy (psql
is the command-line client to PostgreSQL and its non-SQL command all
start with a backslash).

I once needed something along those lines, but a litte more
complicated: I wanted the table to be created if it did not exist, I
wanted the column types to be automatically inferred (either VARCHAR
or NUMERIC), I wanted new columns to be added if they were missing, I
wanted column types to be converted if they were wrong. Here is what I
was using -- use at your own risk, but bug reports are welcome.

  #! perl -w
  
  ##
  ## (c) 2006 Vincent Zoonekynd <zoonek@gmail.com>
  ##
  ## Load a CSV file into a database, creating or extening the
  ## table schema if needed.
  ## You should be aware that the result will not be very
  ## clean, in particular, the database will usually not be in
  ## third normal form and referencial integrity will not be
  ## enforced.
  ##
  ## It should not replace a database with a schema upon which
  ## you would have pondered for a long time, but it allows
  ## the schema to be altered automatically when new columns
  ## are added -- and the data will be more accessible than in
  ## a bunch of CSV files.
  ##
  ## (Incomplete) bug list
  ## - Missing trailing commas are not handled
  ## - problem while inferring the type of bps.csv
  ##
  
  ############################################################
  ##
  ## Modules
  ##
  ############################################################
  
  use strict;
  use warnings;
  use Getopt::Simple qw/$switch/;
  use Text::CSV_XS;
  use Data::Dumper;
  use IO::File;       # Needed to use Text::CSV_XS
  use POSIX;          # For strtod(), to infer the type of the columns
  
  use constant TRUE  => 0==0;
  use constant FALSE => 0==1;
  
  ############################################################
  ##
  ## A few functions
  ##
  ############################################################
  
  ##
  ## In case two columns have the same name (this can be due
  ## to them genuinely having the same name, or having the
  ## same name up to capitalization, or the same name up to
  ## non-alphanumeric characters), we change the name of the
  ## second by adding "_X1", "_X2", etc. to it.
  ## This function creates those new names.
  ##
  
  sub alter_duplicate_column_names {
    my %a = ();
    my @result = ();
    foreach (@_) {
      if (exists $a{$_}) {
        my $i=0;
        $i++ while exists $a{ $_ . "_X" . $i };
        $_ .= "_X" . $i;
      }
      $a{$_} = 1;
      push @result, $_;
    }
    @result;
  }
  
  ##
  ## The values to be inserted into the database have to be
  ## slightly modified:
  ##  - They are quoted
  ##  - Missing values (as described by the --NA command line
  ##    option) are replaced by NULL
  ##  - To avoid other problems, dangerous characters (double
  ##    quote (") and backslash (\)) are replaced by a space.
  ##
  ## There are two versions of this function: one that
  ## produces unquoted results, useful to infer the type of
  ## the columns, and a quoted one, for the actual generation
  ## of SQL code.
  ##
  
  sub process_values_unquoted {
    my @a = map { $a = $_;   # Do not modify $_: it would
                             # change the elements of
                             # @extra_values...
                  $a =~ s/\'/_/g;
                  $a = "" if $a =~ m/$$switch{"NA"}/o;
                  $a;
                } @_;
    return @a;
  }
  
  sub process_values_quoted {
    my @a = map { $a = $_;   # Do not modify $_: it would
                             # change the elements of
                             # @extra_values...
                  $a =~ s/\'/_/g;
                  if ($a =~ m/$$switch{"NA"}/o) {
                    $a = "NULL";
                  } else {
                    $a = "\'$a\'";
                  }
                  $a;
                } @_;
    return @a;
  }
  
  ##
  ## Some command line options expect comma-seperated lists of
  ## column names or numbers: this function transforms them
  ## into lists of column names.
  ##
  
  sub get_column_names ($@) {
    my ($col, @column_names) = @_;
    my @col = split(",", $col);
    # Convert the column numbers to column names
    for (my $i=0; $i<=$#col; $i++) {
      if ( (POSIX::strtod($col[$i])) [1] == 0 ){
        $col[$i] = $column_names[ $col[$i] - 1 ];
      }
    }
    return @col;
  }
  
  ############################################################
  ##
  ## Parameters: how does the CSV file(s) look like?
  ##
  ############################################################
  
  my $option = Getopt::Simple -> new();
  $option -> getOptions({
    quote_char => { type    => "=s",
                    default => q/"/,  # Usually NOT ', because it
                                      # appears in some French names...
                    verbose => "quote character" },
    sep_char   => { type    => "=s",
                    default => q/,/,  # Could also be | or \t
                    verbose => "field separator" },
    header     => { type    => "=i",
                    default => 1,
                    verbose => "number of the line containing the headers" },
    data       => { type    => "=i",
                    default => 2,
                    verbose => "number of the first line after the headers" },
    table_name => { type    => "=s",
                    default => "Foo",
                    verbose => "Name of the SQL table to create and populate" },
    NA         => { type    => "=s",
                    default => '^\s*(|\.|NA|NULL|Null|Default|-999(.0*)?|[#@]?N/?A\!?)\s*$',
                    verbose => 'Regular expression to match missing values, e.g., ^NA$' },
    "add-column" => { type  => "=s@",
                    default => [],
                    verbose => "Columns missing in the CSV file, usually because they are constant and can be inferred from the file name; e.g., date=2006-03-27"
                  },
    "index"    => { type    => "=s@",
                    default => [],
                    verbose => "Columns on which to create an INDEX, e.g. '1,2,3' or 'id,date'"
                  },
    "unique"   => { type    => "=s@",
                    default => [],
                    verbose => "UNIQUE constraints to impose"
                  },
    "not-null" => { type    => "=s@",
                    default => [],
                    verbose => "NOT NULL constraints to impose"
                  },
    "no-column-type-check" => { type => "",
                                default => "",
                                verbose => "Should we try to guess the type of all the columns or set them all to VARCHAR(255)?"
                              },
    "wide"     => { type    => "=i",
                    default => -1,
                    verbose => "If the file contains wide data, number of the column where these data start; e.g., if the columns are factor1,factor2,2000,2001,2002,etc., this would be 3"
                  },
    "wide-name" => { type   => "=s",
                    default => "Wide_column_name",
                    verbose => "If the file contains wide data, name of the (SQL) column that will identify those data, e.g., if the file header is factor1,factor2,2000,2001,2002,etc., this could be 'date'"
                   },
    "wide-value" => { type  => "=s",
                    default => "Wide_value",
                    verbose => "If the file contains wide data, name of the (SQL) column that will contain those data, e.g., if the file header is factor1,factor2,2000,2001,2002,etc., this could be 'date'"
                   }
  }, "usage: $0 [options] file.csv");
  
  my @extra_headers = map { $a = $_; $a =~ s/=.*//; $a; }
                          @{ $$switch{"add-column"} };
  my @extra_values  = map { $a = $_; $a =~ s/^.*?=//; $a; }
                          @{ $$switch{"add-column"} };
  
  if ($$switch{"data"} <= $$switch{"header"}) {
    $$switch{"data"} = $$switch{"header"} + 1;
  }
  
  #print STDERR "Options:\n";
  #print STDERR Dumper($switch);
  
  if (@extra_headers) {
    print STDERR "Extra headers: " . join(", ", @extra_headers) . "\n";
    print STDERR "       values: " . join(", ", @extra_values) . "\n";
  }
  
  if ($$switch{"sep_char"} eq "TAB") {
    $$switch{"sep_char"} = "\t";
  }
  my $csv = new Text::CSV_XS({ 
    quote_char => $$switch{"quote_char"},
    sep_char   => $$switch{"sep_char"},
    binary     => TRUE
  });
  
  my $file = shift @ARGV or die "usage: $0 file.csv";
  my $fh = new IO::File;
  
  ############################################################
  ##
  ## Trying to infer the type of the columns
  ##
  ############################################################
  
  my @types;
  {
    print STDERR "Reading file $file to get the number of columns and their types\n";
    open($fh, "<", $file) || die "Cannot open $file for reading: $!";
    my $line = 0;
    while(1) {
      my $fields = $csv->getline($fh);
      last unless $fields;
      last unless @$fields;
      next if $#$fields == 0 and $$fields[0] =~ m/^\s*$/;   # Skip blank lines
      $line++;
      if ($line == $$switch{"header"}) {
        if ($$switch{"no-column-type-check"}) {
          @types = map { FALSE } (@extra_headers, @$fields);
          last;
        } else {
          @types = map { TRUE } (@extra_headers, @$fields);
        }
      } elsif ($line >= $$switch{"data"}) {
        #print STDERR $line . " " . join(", ", @$fields) . "\n";
        my @values = process_values_unquoted(@extra_values, @$fields);
        #print STDERR $line . " " . join(", ", @values) . "\n";
        #print STDERR $line . " " . join(", ", map { (POSIX::strtod($_))[1] > 0 ? "VARCHAR(255)" : "NUMERIC" } @values) . "\n";
        @values = map { (POSIX::strtod($_))[1] == 0 } @values;
        for (my $i=0; $i<=$#values; $i++) {
          $types[$i] &&= $values[$i];
        }
        #print STDERR $line . " " . join(", ", map { $_ ? "NUMERIC" : "VARCHAR(255)" } @types) . "\n";
      }
    }
    close($fh);
    if ($$switch{"wide"} > 0) {
      for (my $i = $#extra_headers + 1 + $$switch{"wide"}; $i <= $#types; $i++) {
        $types[ $#extra_headers + 1 + $$switch{"wide"} ] = 
          $types[ $#extra_headers + 1 + $$switch{"wide"} ] && $types[$i];
      }
      $types[ $#extra_headers + 1 + $$switch{"wide"} - 1 ] = FALSE;
      @types = @types[0..($#extra_headers + 1 + $$switch{"wide"})];
    }
    @types = map { $_ ? "NUMERIC" : "VARCHAR(255)" } @types;
    print STDERR "Column types: ";
    print STDERR join(", ", @types) . "\n";
  }
  
  ############################################################
  
  print STDERR "Reading file $file\n";
  open($fh, "<", $file) || die "Cannot open $file for reading: $!";
  my @column_names;
  my @wide_values;
  my $line = 0;
  while(1) {
    my $fields = $csv->getline($fh);
    last unless $fields;
    last unless @$fields;
    next if $#$fields == 0 and $$fields[0] =~ m/^\s*$/;   # Skip blank lines
    $line++;
    if ($line == $$switch{"header"}) {
      @column_names = @$fields;
      if ($$switch{"wide"} > 0) {
        @wide_values = @column_names[ ($$switch{"wide"}-1) .. ($#column_names) ];
        map { s/^\s+//; s/\s+$//; } @wide_values;
        @column_names = @column_names[ 0 .. ($$switch{"wide"}-1) ];
        $column_names[ $$switch{"wide"} - 1 ] = $$switch{"wide-name"};
        $column_names[ $$switch{"wide"}     ] = $$switch{"wide-value"};
      }
      @column_names = (@extra_headers, @column_names);
      @column_names = map { y/A-Z/a-z/;       # Only lower case
                            s/\s+$//;         # No trailing spaces
                            s/^\s+//;         # No leading spaces
                            s/[^a-z0-9]/_/g;  # Only alphanumeric characters
                            s/^([0-9])/x$1/;  # First character is a letter
                            s/^$/nameless_column/; # At least one character
                            $_; } @column_names;
      @column_names = alter_duplicate_column_names(@column_names);
      @column_names = map { "\"$_\"" } @column_names;
      my %not_null = ();
      foreach my $i (@{$$switch{"not-null"}}) {
        foreach my $j (get_column_names($i, @column_names)) {
          $not_null{$j} = 1;
        }
      }
      print "-- Table schema\n";
      print "CREATE TABLE " . $$switch{"table_name"} . " (\n";
      for (my $i=0; $i <= $#column_names; $i++) {
        print "  " . $column_names[$i] . " " . $types[$i];
        print " NOT NULL" if exists $not_null{ $column_names[$i] };
        print "," if $i < $#column_names or @{$$switch{"unique"}};
        print "\n";
      }
      foreach (my $j=0; $j <= $#{ $$switch{"unique"} }; $j++) {
        my $col = ${ $$switch{"unique"} }[$j];
        my @col = get_column_names($col, @column_names);
        print "  UNIQUE (" . join(", ", @col) . ")";
        #     ." ON CONFLICT REPLACE";
        print "," unless $j == $#{ $$switch{"unique"} };
        print "\n";
      }
      print ");\n";
      print "-- In case the table already exists, we make sure it has enough columns...\n";
      for (my $i=0; $i<=$#column_names; $i++) {
        print "ALTER TABLE " . $$switch{"table_name"} .
              " ADD COLUMN " . $column_names[$i] . " " .
              $types[$i] . ";\n";
      }
      if (@{ $$switch{"index"} }) {
        print "-- Indices\n";
        foreach my $col (@{ $$switch{"index"} }) {
          my @col = get_column_names($col, @column_names);
          print "CREATE INDEX " .
                "idx_" . $$switch{"table_name"} . "_" .
                join("_", @col) .
                " ON " .
                $$switch{"table_name"} .
                " (".
                join(", ", @col) .
                ");\n";
        }
      }
      print "-- The data from $file\n";
      print "BEGIN TRANSACTION;\n";
      print "PRAGMA cache_size = 500000;\n";
    } elsif ($line >= $$switch{"data"}) {
      map { s/^\s+//; s/\s+$//; } @$fields;
      if ($$switch{"wide"} > 0) {
        for (my $i=0; $i <= $#wide_values; $i++) {
          print "INSERT INTO " . $$switch{"table_name"} . " (" .
                join(", ", @column_names) .
                ")\n";
          print "  VALUES (" .
                join(",", process_values_quoted(
                            @extra_values,
  		          @$fields[ 0 .. ($$switch{"wide"}-2) ],
                            $wide_values[$i],
                            $$fields[$$switch{"wide"} - 1 + $i]) ) .
                ");\n";
        }
      } else {
        print "INSERT INTO " . $$switch{"table_name"} . " (" .
              join(", ", @column_names) .
              ")\n";
        print "  VALUES (" .
              join(",", process_values_quoted(@extra_values, @$fields)) .
              ");\n";
      }
    }
  }
  print "COMMIT TRANSACTION;\n";
  close($fh);

+ Uses of databases: OLTP, OLAP, Data Warehouse (DW)

Database specialists often distinguish between two uses of databases.

In OLTP (On-Line Transaction Processes) applications, very small
amounts of data (usually one record at a time) are read and written to
the database: this is the case for most web applications (e-commerce,
forums, etc.).

At the other end of the spectrum, OLAP (On-Line Analytical Processing)
applications or Decision Support Systems (DSS) or Business
Intelligence (BI) applications, only use the database as a data
repository, as a data warehouse (DW) or as an operational datastore (ODS)
(to be read from, not written to),
extract large amounts of data (several gigabytes) and try to summarize
them, in as interactive a way as possible.

The main example is sales data: you know the value of each
transaction, what item it was, which customer it was, which sales
clerk it was. You can summarize all the transactions in a large
3-dimensional cube: one dimension for the customers, one for the
items, one for the sales clerk. But that cube is too large to be
presented to the end user. To get a more amenable data cube, the
elements in each dimension can be grouped: customers grouped by city,
state, age, value of past purchases, number of past purchases, gender,
etc.; items by category, price, etc.; sales clerks by gender,
religion, shop, city, state, proximity of public transportations, etc.
Typically, OLAP applications are interactive: they first present the
user with the coarsest grouping (all the customers, all the items, all
the sales clerks -- the corresponding cube has a single element) and
then allow to "drill-down", i.e., to chose finer and finer groupings.
But three dimensions may be too much: you may prefer to select a
2-dimensional slice of the data (e.g., "only a given sales clerk") or
project the data cube onto one dimension (i.e., consider "all the
sales clerks").

Building a Data Warehouse (DW) is often a prerequisite to
OLAP operations: it refers to the nightmare of combining
several databases into one -- the problem being that the
databases may contain incoherencies, may use different
identifiers, may use different naming schemes, may lack
some of the data needed.

If you need a free OLAP tool, have a look at Mondrian.

  http://mondrian.pentaho.org/

+ Types of database management systems

Most Database Management Systems (DBMS) are relational DBMS: they
store relations, usually represented as SQL tables.

Some databases are not relational but simply store associations or
hash tables, i.e., key-value pairs: they are mere presistence engines,
that provide a random access to the data (in this context, the word
"random" loses its statistical meaning: it means that to access a data
item, we do not have to scan the whole database, we just go directly
where it is supposed to be). 
They are often used to store directories and are optimized for fast
read access (e.g., the information associated to an employee (name,
social security number, phone, address, login, password, email
address, etc.; the information associated to computers in a network
(name, IP address, etc. -- those directories are called DNS)).
BerkelyDB is such a DBMS. 

Some databases are targeted at large applications and implement a
client-server architecture (the DBMS is a program (the server), the
applications using the database (the clients) are other programs, that
run at the same time, on the same machine or on a different machine,
and they all talk to each other). Most databases fall into this
category: MySQL, PostgreSQL, Interbase, Oracle, DB2, Microsoft SQL
Server, etc.

At the other end of the spectrum, some databases are
designed for small ("embedded") applications: they require
very few ressources, but lack some features, such as
concurrency. You can find some of them, for instance, in
mobile phones. SQLite is designed for embedded
applications. Some client-server DBMS have a light,
embeddable version -- most notably MySQL -- I think it is
their main source of revenue.

Some databases advertise their ability to store complex
datatypes (e.g., time series) as easily as basic, "atomic"
types and to provide enhanced performance when accessing
them. Those products are called "multi-value databases" or
"post-relational databases" or "non-1NF DBMS" and are
usually commercial: Vision, Cache (formerly known as Mumps
or M), KSQL.
Unkess you really know what you are doing, you should stay
away from those: from my experience in the domain, it is
very hard to find expertise with these products, the schema
of the database is rarely documented, there are no
established best practices (such as "put your data in third
normalized form" with relational databases -- here, the
followed rule is to ignore those best practives), the
promised performance is rarely there (unless you manage to
find an expert on this product), the syntaxe is arcane and
does not allow end users to actually use the product.

+ Hot topics in databases

The algorithms used in conventional databases do not always scale
well; furthermore, the principles underlying conventional databases
often fail for Very Large DataBases (VLDB). For instance, one cannot
be sure that large databases contain "the truth" -- large databases do
contain mistakes -- the algorithms used must be "robust", in some
sense, to those mistakes. Furthermore, computing the exact result to a
query can be very time-consuming while an approximate result would be
as useful: approximate joining algorithms are starting to
emerge. Another desirable feature of VLDB systems is to provide "the
best result so far", and to update that result as time passes -- when
the user is bored, she can stop the computations.
  
  http://www.vldb.org/
  http://www.acm.org/sigs/sigkdd/explorations/issue.php?issue=current

  TODO: URLs 
  approximate matching

Real-time databases (RTDB), flow-programming langages, streaming data
are closely-related subjects.

  TODO: URL?

Databases often contain personal information (e.g., medical
information about patients in a hospital, bank details, etc.) and
mining those databases, let alone combining them, poses
confidentiality problems. To this end, privacy-enhanced data-mining
techniques are starting to emerge. 

  http://www.wired.com/news/wireservice/0,71184-0.html
  Mentionned in Cryptogram 2006-07:
  http://www.schneier.com/crypto-gram-0607.html

TODO: 

  Temporal databases

+ TODO

  Other types of databases:
    relational (SQL: MySQL, PostgreSQL, Interbase, Oracle, DB2, MSSQL, Derby, HSQL)
    persistence (berkeleyDB)
    embedded (SQLite, MySQL, berkeleyDB)
    ORM
    post-relational = non-1NF = multi-value 
  Vocabulary: 
    DDL (Data Definition Language)
    DML (Data Manipulation Language)
    MDX: a Microsoft language for OLAP
    ACID
  bitemporality

TODO: 

  OLTP; OLAP, Data cube, OLAP system, drill-down
  Other types: 
    Temporal data and the relational model (C. Date et al.)
    Time series databases (Vision, Cache (Mumps))

  VLDB, Approximate querying
  JOIN, approximate join and statistical matching

+ MySQL

Last time I used MySQL from R, I proceeded as follows (after
installing what was needed):

  library(RMySQL)
  con <- dbConnect(dbDriver("MySQL"), dbname = "MySQL_Test_1")
  dbListTables(con)
  d <- dbGetQuery(con, "SELECT * FROM Foo")
  d
  dbDisconnect(con)

+ ODBC

TODO: Some explanations

  # Not tested
  library(RODBC)
  ?ODBC
  DSN <- "foobar"
  channel <- odbcConnect(DSN, "zoonek", "azerty", believeNRows=FALSE)
  sqlQuery(channel, "SELECT foo, bar, baz FROM FooBar WHERE foo > bar")
  close(channel)

+ PostgreSQL

First, install PostgreSQL and configure it. If you are using
Gentoo/Linux, just type (the first command actually asks you
to type the second).

  emerge postgresql
  emerge --config =postgresql-8.0.4

If not, you can install it by hand and type 

  # Choose a directory to put the data
  DB=$HOME/Data
  initdb -D $DB

  # Launch the server
  postmaster -D $DB >logfile 2>&1 &

  # Create an empty "database" (you can think of a
  # "database" as a "namespace": it will be a set of tables,
  # isolated from the rest of the data, so as to avoid name
  # clashes).
  createdb test

  # Use it! (This is the command-line interface, you might
  # prefer a more graphical application.)
  psql test
  
But we want to use it from R.

  Install Rdbi and RdbiPgSQL, from Bioconductor.
  source("http://www.bioconductor.org/biocLite.R")
  biocLite("RdbiPgSQL")
  biocLite()
  biocLite(c("graph", "Rgraphviz"))
  I had to alter the .FirstLib function in RdbiPgSQL
  and remove the autoloading of the chron.

  library(Rdbi)
  # We connect via a UNIX socket (in the default PostgreSQL
  # installation, there are no INET sockets) and there is 
  # no password.
  pcon <- dbConnect(PgSQL(), dbname="zoonek", user="zoonek")
  res <- dbGetQuery(pcon, "SELECT * FROM TickData LIMIT 10")
  dbDisconnect(pcon)

* Packages

+ More Packages

Look on CRAN (Complete R Archive Network) 

  http://cran.r-project.org/

You can install them as:

  R CMD INSTALL vcd_0.1-3.tar.gz

It you realize they do not work (it could happen a couple of years
ago, but it should now be exceptionnal), you can remove them with:

  R CMD REMOVE vcd

+ Writing your own packages

You have written some nifty functions and would like to share them
with your colleagues, with the world, you would like to see them on
CRAN. For this, you have to put all your functions in a "package".

+ Vocabulary: package, library, bundle

There is often a confusion between "package" and "library". A
"library" is a directory, usually containing one or several
packages. A "package" is a set of functions, data sets and manual
pages, contained in a directory ("library") or a *.tar.gz file (for
Windows users, it can also be a *.zip file, but you must find the one
corresponding to your version of R). A "bundle" is a set of packages
contained in the same "*.tar.gz" file. To increase the confusion, the
function to load a package is called "library"...

+ Example: package.skeleton()

Let us assume you have written three functions, "foo", "bar" and
"baz", one data.frame, "fbz", and that you have put their definitions
in a file "foobar.R". You can create a package as follows.

  source("foobar.R")
  package.skeleton("foobar", c("foo", "bar", "baz", "fbz"))

This creates the files

  foobar
  foobar/man
  foobar/man/README
  foobar/man/foo.Rd
  foobar/man/bar.Rd
  foobar/man/baz.Rd
  foobar/src
  foobar/src/README
  foobar/R
  foobar/R/foo.R
  foobar/R/bar.R
  foobar/R/baz.R
  foobar/data
  foobar/data/fbz.rda
  foobar/DESCRIPTION
  foobar/README

We can leave the *.R files untouched -- they contain the code of our
functions. We need to modify the DESCRIPTION file. It currently
contains

  Package: foobarType: Package
  Title: What the package does (short line)
  Version: 1.0
  Date: 2005-05-04
  Author: Who wrote it
  Maintainer: Who to complain to <yourfault@somewhere.net>
  Description: More about what it does (maybe more than one line)
  License: What license is it under?

We change this to 

  Package: foobar
  Type: Package
  Title: Almost empty package
  Version: 1.0
  Date: 2005-05-04
  Author: Vincent Zoonekynd <zoonek@example.com>
  Maintainer: Vincent Zoonekynd <zoonek@example.com>
  Description: Example package, containing silly functions, 
    such as addition or multiplication.
  License: GPL

We also have to read and alter the documentation
files *.Rd. They look like this:

  \name{foo}
  \alias{foo}
  %- Also NEED an '\alias' for EACH other topic documented here.
  \title{ ~~function to do ... ~~ }
  \description{
    ~~ A concise (1-5 lines) description of what the function does. ~~
  }
  \usage{
  foo(x)
  }
  %- maybe also 'usage' for other objects documented here.
  \arguments{
    \item{x}{ ~~Describe \code{x} here~~ }
  }
  \details{
    ~~ If necessary, more details than the __description__  above ~~
  }
  \value{
    ~Describe the value returned
    If it is a LIST, use
    \item{comp1 }{Description of 'comp1'}
    \item{comp2 }{Description of 'comp2'}
    ...
  }
  \references{ ~put references to the literature/web site here ~ }
  \author{ ~~who you are~~ }
  \note{ ~~further notes~~ }
  
   ~Make other sections like Warning with \section{Warning }{....} ~
  
  \seealso{ ~~objects to See Also as \code{\link{~~fun~~}}, ~~~ }
  \examples{
  ##---- Should be DIRECTLY executable !! ----
  ##-- ==>  Define data, use random,
  ##--	or do  help(data=index)  for the standard data sets.
  
  ## The function is currently defined as
  function (x) { x + 1 }
  }
  \keyword{ ~kwd1 }% at least one, from doc/KEYWORDS
  \keyword{ ~kwd2 }% __ONLY ONE__ keyword per line

It is not really a LaTeX file. Let us read this file line by line. First
is the name of the manual page -- usually the name of the function
currently described. 

  \name{foo}

Then, all the functions that will be described in this manual page. 
If you have several functions that perform a similar task, it is wise
to "refactor" them into a single function -- the user will only have
to remember a single function name, you will have a single manual page
to write, a single set of tests. If it is not possible, you can still
document the functions together, in the same manual page. Here, let us
document the three functions in the same page.

  \alias{foo}
  \alias{bar}
  \alias{baz}

Then, a short (less than one line) description of the function. 

  \title{Arithmetic operations}

Then, a longer (but still short) description.

  \description{
    Very simple functions that perform elementary arithmetic
    operations such as adding one, multiplying by two or squaring.
  }

Then the "usage" of the function, i.e., how we should call it, with all
the arguments, with their default values, if any. 

  \usage{
  foo(x)
  }
  \usage{
  bar(x)
  }
  \usage{
  baz(x)
  }

Then, a description of the arguments, one by one (there can be many
arguments and the description can, of course, be much longer than
that). 

  \arguments{
    \item{x}{A number}
  }

Then, a more detailed description of the functions, of the algorithms
involved, of the common mistakes, etc.

  \details{
    The \code{foo} function returns its argument incremented by 1.

    The \code{bar} function returns its argument multiplied by 2.

    The \code{baz} function returns its argument squared. 
  }

Then, the description of the value of the functions, especially if the
value is a list. 

  \value{
    Result of the operation.
  }

A reference to articles, books, web sites that present the algorithms or
ideas behind the package. 

  \references{The four operations for dummies}

Your name and email address. 

  \author{Vincent Zoonekynd <zoonek@math.jussieu.fr>}

Links to other manual pages of interest: functions the user may want
to use in conjunction with your code -- or instead of your code.
  
  \seealso{\code{\link{sum}}, \code{\link{prod}}}

The most important part: the examples. 
The code must run without any problem, and must not take too long: it
will be run to check that everything went fine when downloading and
installing the package -- it will also serve as regression tests: do
use "stopifnot". 

  \examples{
  foo(1) + bar(2) + baz(3)
  (1+1) + (2*2) + (3*3)

  stopifnot( foo(5) == 6  }
  stopifnot( bar(5) == 10 }
  stopifnot( baz(5) == 25 }
  }

You should also include keywords (one at a time).

  \keyword{arith}% at least one, from doc/KEYWORDS

  \keyword{increment}
  \keyword{double}
  \keyword{square}

That is all for this first manual page. 
As we have documented the three functions, we can remove the (empty)
manual pages of "bar" and "baz".

  rm foobar/man/bar.Rd
  rm foobar/man/baz.Rd

I let you document the data set, man/fbz.Rd.

We can now build the package (we are still in the shell): 

  R CMD build foobar

If there are no errors (in particular in the DESCRIPTION file), this
yields

  * checking for file 'foobar/DESCRIPTION' ... OK
  * preparing 'foobar':
  * checking DESCRIPTION meta-information ... OK
  * cleaning src
  * removing junk files
  * checking for LF line-endings in source files
  * checking for empty directories
  * building 'foobar_1.0.tar.gz'

We can check it (this runs the examples of the manual pages),

  R CMD check foobar

This yields

  * checking for working latex ... OK
  * using log directory '/tmp/foobar.Rcheck'
  * using R version 2.1.0, 2005-04-18
  * checking for file 'foobar/DESCRIPTION' ... OK
  * checking extension type ... Package
  * this is package 'foobar' version '1.0'
  * checking if this is a source package ... OK
  
  * Installing *source* package 'foobar' ...
  ** libs
  WARNING: no source files found
  chmod: cannot access `/tmp/foobar.Rcheck/foobar/libs/*': No such file or directory
  ** R
  ** data
  ** help
   >>> Building/Updating help pages for package 'foobar'
       Formats: text html latex example
    fbz                               text    html    latex   example
    foo                               text    html    latex   example
       missing link(s):  ~~fun~~
  ** building package indices ...
  * DONE (foobar)
  
  * checking package directory ... OK
  * checking for portable file names ... OK
  * checking for sufficient/correct file permissions ... OK
  * checking DESCRIPTION meta-information ... OK
  * checking package dependencies ... OK
  * checking index information ... OK
  * checking package subdirectories ... WARNING
  Subdirectory 'src' contains no source files.
  * checking R files for syntax errors ... OK
  * checking R files for library.dynam ... OK
  * checking S3 generic/method consistency ... OK
  * checking replacement functions ... OK
  * checking foreign function calls ... OK
  * checking Rd files ... WARNING
  Rd files with likely Rd problems:
  Unaccounted top-level text in file "/tmp/foobar/man/foo.Rd":
  Following section "note":
  "\n\n ~Make other sections like Warning with \\section{Warning }{....} ~\n\n"
  
  Rd files with duplicate "usage":
    /tmp/foobar/man/foo.Rd
  These entries must be unique in an Rd file.
  
  Rd files with non-standard keywords:
    /tmp/foobar/man/foo.Rd: ~kwd1 ~kwd2
  Each "\keyword" entry should specify one of the standard keywords (as
  listed in file "KEYWORDS.db" in the "doc" subdirectory of the R home
  directory).
  
  See chapter 'Writing R documentation files' in manual 'Writing R
  Extensions'.
  * checking for missing documentation entries ... WARNING
  Undocumented code objects:
    baz
  All user-level objects in a package should have documentation entries.
  See chapter 'Writing R documentation files' in manual 'Writing R
  Extensions'.
  * checking for code/documentation mismatches ... OK
  * checking Rd \usage sections ... WARNING
  Objects in \usage without \alias in documentation object "foo":
    baz
  
  Functions with \usage entries need to have the appropriate \alias entries,
  and all their arguments documented.
  See chapter 'Writing R documentation files' in manual 'Writing R
  Extensions'.
  * checking for CRLF line endings in C/C++/Fortran sources/headers ... OK
  * creating foobar-Ex.R ... OK
  * checking examples ... OK
  * creating foobar-manual.tex ... OK
  * checking foobar-manual.tex ... ERROR
  LaTeX errors when creating DVI version.
  This typically indicates Rd problems.

Then, you can start the bug squashing: typically, you have forgotten
to document a function, to delete a file, to delete an unwanted line
in a manual page (the LaTeX errors above are due to this) or you have
deleted too much, etc..

Then, when there are no bugs left, you can install the package and
distribute the *.tar.gz file to the world.

  R CMD build foobar
  R CMD check foobar
  R CMD install foobar

Windows people are usually very inhappy with source packages (they
lack the appropriate tools to install everything -- some people even
say it is easier and faster to install Linux that to install all the
missing software) so you can provide them with a "binary" package --
this will not work if your package contains C code or if they have a
different version of R -- and I have not tested it either -- I do not
have Windows. 

  cd /usr/local/R/library/
  zip -r /tmp/foobar foobar/

TODO: 

  Understand the problems with the keywords.
  Read the documentation for all the bells and whistles I am not aware of...

+ example, demo

  TODO

+ Namespaces

  TODO
  Namespaces (there can be a "NAMESPACE" file in the root directory of
  the package)

+ R, Sweave and Lyx

Lyx is a user-friendly WYSIWYM (What You See Is What You
Mean) interface to LaTeX.

  http://www.troubleshooters.com/lpm/200210/200210.htm

It can be made Sweave-aware. 

  http://www.mail-archive.com/r-help%40stat.math.ethz.ch/msg46946.html
  http://www.ci.tuwien.ac.at/~leisch/Sweave/LyX

+ Vignettes

  TODO

* Other languages

+ C

For efficientcy reasons, all the code in a package will not
be written in R: the parts of the program that require the
most time or the most memory will be written in a faster
language (often C, but I think there are still some people
using Fortran -- in this 21st century...).

The procedure is detailes in "Writing R
extensions". I simply reproduce their example:

  In a "foobar.c" file:

    void convolve(double *a, int *na, double *b, int *nb, double *ab)
    {
      int i, j, nab = *na + *nb - 1;
      for(i = 0; i < nab; i++)
        ab[i] = 0.0;
      for(i = 0; i < *na; i++)
        for(j = 0; j < *nb; j++)
          ab[i + j] += a[i] * b[j];
    }

  Build a shared library (if it is for a package, it is automatic):

    R CMD SHLIBS foobar.c

  Load the shared library: for a package, you would use 

    .First.lib <- function(lib, pkg) {
      library.dynam("foobar",pkg,lib)
      cat("...")
    }

  but for an isolated use:

    dyn.load("foobar")

  You can then use it as:

    conv <- function(a, b)
     .C("convolve",
        as.double(a),
        as.integer(length(a)),
        as.double(b),
        as.integer(length(b)),
        ab = double(length(a) + length(b) - 1))$ab

Let us also mention the ".Call" function that allows you to
use more complex data types (the code would use R.h and
Rinternals.h or Rdefines.h; the C function expects SEXP
objects as arguments and returns a SEXP object) and the
".External" function (with a simgle argument, that contains
the list of arguments -- useful if the function has a
variable number of arguments).

I do not give any more details: do read "Writing R
extensions", do check how libraries are implemented.

+ C

I almost forgot: you may also call R code from C.

+ Perl

Here, it also works both ways: you can call Perl from R (for
instance, to use the network or regular expressions) or call
R from Perl.

TODO: give an example where one calls R from Perl.

  R::initR("--silent", "--vanilla");
  my @x = 1..100;
  R::callWithNames("plot", { x => \@x, ylab => 'foo bar' });
  R::eval("plot(1:10)");

But it might be easier, instead of calling Perl from R, to
write a small perl script you would call with the "system"
command.

  > library(gdata)

  > read.xls
  function (xls, sheet = 1, verbose = FALSE, ...)
  {
    package.dir <- .path.package("gregmisc")
    perl.dir <- file.path(package.dir, "perl")
    xls <- shQuote(xls)
    xls2csv <- file.path(perl.dir, "xls2csv.pl")
    csv <- paste(tempfile(), "csv", sep = ".")
    cmd <- paste("perl", xls2csv, xls, dQuote(csv), sheet, sep = " ")
    if (verbose)
        cat("Executing ", cmd, "... \n")
    results <- system(cmd, intern = !verbose)
    if (verbose)
        cat("done.\n")
    out <- read.csv(csv, ...)
    file.remove(csv)
    return(out)
  }

  
+ PostgreSQL

When you play with very large data (R becomes less efficient
when the data sizes grows unwieldy), you can store them in a
database: then, you just have to fetch the data you need at
a given moment (typically, a small part of the whole data).

  TODO: give a more complete example...

+ Java

  TODO
  See also iPlots:
  http://rosuda.org/iPlots/

* (Graphical) User Interface

+ Tcl/Tk

I first heard about Tcl (pronounce "tickle") in the early
1990s: it was used to build GUI on Unix
systems: at the time, it was quite a feat, and as Tcl was an
interpreted language, as easy to learn as a shell, all it
required was a couple of lines of code. Later, Tk, the
library used to build those GUI, has been incorporated into
other scripting languages, such as Perl or Python -- and
now, R.

  TODO: URLs
  http://www.math.jussieu.fr/~zoonek/UNIX/10_ptk/1.html

Tcl is still used now:

  http://www.macdevcenter.com/pub/a/mac/2005/08/12/tcl.html
  http://www.macdevcenter.com/pub/a/mac/2005/01/28/tcl.html
  http://www.macdevcenter.com/pub/a/mac/2004/11/09/weblog.html
  http://www.macdevcenter.com/pub/a/mac/2004/08/27/blitting.html
  http://www.vhayu.com/faq.html

+ Why do we need a GUI?

As R is a real programming language, it is already very
powerful as is, but there are two situations where you will
need a GUI.

The first one is when you want some interactive graphics,
while exploring the data. When you want to see what happens
when you change a parameter, you can write a loop and
display a plot for each iteration, but it is a bit clumsy:
you would probably prefer

The second situation is when you want other people to use
the software, without knowing R (yet): either scientists,
statistics users, or students, learning statistics. They
will prefer a menu-driven application.

+ Documentation

One may easily build graphical interfaces under R with
Tcl/Tk.

  library(tcltk)
  library(help=tcltk)

The widgets are not documented -- but they are the standard
Tk widgets, that have been used from Tcl, Perl, Python for
ages.

  http://www.math.jussieu.fr/~zoonek/UNIX/10_ptk/1.html

+ Idiosyncrasies of the link between R and Tcl

  TODO: The problem with variables...

+ Examples

A small calculator. 

  tkdestroy(wtop)
  wtop <- tktoplevel()
  w.titre <- tklabel(wtop, text="Additions")
  w.un <- tkentry(wtop)
  w.deux <- tkentry(wtop)
  w.resultat <- tklabel(wtop, text=0)
  tkpack(w.titre, w.un, w.deux, w.resultat)
  on.key.press <- function () {
    # How complicated it os!
    a <- tclvalue(tkget(w.un))
    a <- eval(parse(text=a))
    if(!is.numeric(a)) a <- 0
    b <- eval(parse(text=b))
    b <- tclvalue(tkget(w.deux))
    if(!is.numeric(b)) b <- 0
    tkconfigure(w.resultat, text=a+b)
  }
  tkbind(wtop, "<KeyPress>", on.key.press)

= tcltk_1.png

Here is an example from the manual.

  tkdestroy(tt)
  tt <- tktoplevel()
  tkpack(txt.w <- tktext(tt))
  tkinsert(txt.w, "0.0", "plot(1:10)")
  eval.txt <- function()
     eval(parse(text=tclvalue(tkget(txt.w, "0.0", "end"))))
  tkpack(but.w <- tkbutton(tt,text="Submit", command=eval.txt))

The interested reader will especially look into the
"tkrplot" package, to include an R graphic inside a
widget. The example from the manual shows a curve, depending
on a parameter that you can fix with a slider.

= tkrplot.png

Here is a more general function whose arguments are: a
function that draws a picture depending on a real parameter;
the limits of this parameter.

  library(tkrplot)
  animate <- function (plot.function, limits) {
    bb <- mean(limits)
    tt <- tktoplevel()
    img <-tkrplot(tt, function () { plot.function(bb) } )
    f <- function (...) {
      b <- as.numeric(tclvalue("bb"))
      if (b != bb) {
        bb <<- b
        tkrreplot(img)
      }
    }
    s <- tkscale(tt, command=f, from=limits[1], to=limits[2], 
                 variable="bb", showvalue=TRUE, 
                 resolution=diff(range(limits))/100, orient="horiz")
    tkpack(img,s)
  }

  animate( function (a) { hist(abs(rnorm(200))^a) }, c(.1,2) )

Example: find the transformation to apply to a variable so
that it looks normal.

  n <- 200
  k <- runif(1, 0,2)
  x <- (5+rnorm(n))^k
  animate( function (a) { x <- x^(1/a); qqnorm(x); qqline(x,col='red') },
           c(.01,2) )

Example: watch the effects of bin width in a histogram.

  n <- 200
  x <- rnorm(n)
  animate( function (a) { 
             a <- ceiling(a)
             print(a)
             hist(x, breaks=a, col='light blue', probability=T);
             lines(density(x), col='red', lwd=3)
           },
           c(2,102) )

Example: the central limit theorem, presented with an
interactive animation.

  N <- 1000
  n <- 102
  m <- .5
  s <- 1/sqrt(12)
  x <- matrix(runif(n*N), nc=n)
  animate( function (a) {
      x <- (apply(x[,1:a],1,sum) - a*m)/(sqrt(a)*s)
      hist(x, col='light blue', probability=T, 
           main=paste("n =",a), ylim=c(0,.4),
           xlim=c(-4,4))
      lines(density(x), col='red', lwd=3)
      curve(dnorm(x), col='blue', lwd=3, lty=3, add=T)
      if( N>100 ) {
        rug(sample(x,100))
      } else {
        rug(x)
      }
    },
    c(2,102)
  )

  # Idem, with a bimodal distribution
  N <- 1000
  n <- 101
  m <- 0
  s <- sqrt(10)
  x <- rnorm(n*N, sample(c(-3,3),n*N,replace=T))
  x <- matrix(x, nc=n)
  animate( function (a) {
      x <- (apply(x[,1:a],1,sum) - a*m)/(sqrt(a)*s)
      hist(x, col='light blue', probability=T, 
           main=paste("n =",a), ylim=c(0,.4),
           xlim=c(-4,4))
      lines(density(x), col='red', lwd=3)
      curve(dnorm(x), col='blue', lwd=3, lty=3, add=T)
      if( N>100 ) {
        rug(sample(x,100))
      } else {
        rug(x)
      }
    },
    c(1,101)
  )

  # Idem, with an asymetric distribution
  N <- 1000
  n <- 102
  m <- 1
  s <- 1
  x <- rexp(n*N)
  x <- matrix(x, nc=n)
  animate( function (a) {
      x <- (apply(x[,1:a],1,sum) - a*m)/(sqrt(a)*s)
      hist(x, col='light blue', probability=T, 
           main=paste("n =",a), ylim=c(0,.4),
           xlim=c(-4,4))
      lines(density(x), col='red', lwd=3)
      curve(dnorm(x), col='blue', lwd=3, lty=3, add=T)
      if( N>100 ) {
        rug(sample(x,100))
      } else {
        rug(x)
      }
    },
    c(1,101)
  )
  
Exercice: draw the density function of a distribution
depending on a parameter.

TODO: other examples (to be written...)

  Write a generic function to modify a graphic depending on
  one or several parameters.

  Application:
  - density of a distribution depending on several parameters.
  - qqnorm and variable transformations
  - histogram: interactively change "bw" and "offset"

  Do the same with several graphics that simultaneously
  change.  Example: density + repartition function + qqnorm
  + boxplot for a probability distribution depending on
  several parameters.  or a sample to which you apply a
  transformation depending on one parameter.

+ RCommander: a complete R GUI in Tk

TODO: Screenshots

+ Other examples

  library(fBasics)
  symstbSlider()
  TODO: screenshot...

+ Rgl: interactive 3D graphics

TODO: screenshots, examples

+ Other widget libraries: RGtk2

* Web interface: Rpad

+ Installation

First, install it (check the latest version number): 

  wget http://www.rpad.org/downloads/Rpad_0.9.2.tar.gz
  R CMD INSTALL Rpad_0.9.2.tar.gz

+ Use

If the pages are already written, if a server is already
running, it works as follows: you fill in the forms, you
click on the "Calculate" button

=Rpad1.png

and you get the result, that contains both numeric results
and plots.

=Rpad2.png

+ Overview

So we have to see several things: how do we create the
pages? How do we run the server? There are two ways of
creating the pages: either with the "wysiwyg" editor (some
Javascript can turn a web browser into an HTML editor) or
with a normal text editor that will not hide the HTML. There
are also two ways of running the server: either a tiny HTTP
server (in Tcl) run from within R, or a full-fledged HTTP
server (typically, Apache with php and mod_perl).

For the moment, let us simply run the tiny HTTP server 

  library(Rpad)
  Rpad()

= Rpad_01.png

(this should launch your default web browser -- it works
with firefox but not with Konqueror) and have a look at the
contents of the pages.  They are normal XHTML pages, with
some R code in the middle.

Let us quickly list some of the idiosyncrasies of this HTML.

The calculate button, that runs the computations:

  <span contentEditable="false">
    <input onClick="javascript:top.Rpad_calculate()"
           value="Calculate"
           type="button"
    />
  </span>

You can also run the computations when the user clicks on
link:

  href=javascript:R_run_commands('source("foo.bar")', 'Rpad_calculate()')

Some computations (yes, you have to add "<br/>" to indicate
the end of the lines and to replace the "<" by "&gt;":

  <div class="Rpad_input"
       rpad_type="R">
    source("myMacros.R") <br/>
    x &gt;- foo.bar() <br/>
  </div>

Sometimes, you want the computations to be performed when
the page is loaded:

  <div class="Rpad_input" rpad_type="R" Rpad_run="init">
    ...
  </div>

A plot:

  plot(...)
  ...
  HTMLon()
  showgraph()

Another plot: 

  graphoptions(width=4, height=4)
  newgraph()
  plot(...)
  ...
  HTMLon()
  showgraph()

Displaying a data frame in HTML: 

  HTMLon()
  d <- data.frame(...)
  HTML(d)

Forms to fill in: 

  <span contentEditable="false">
    <input class="Rpad_input"
           name="foo"
           rpad_type="Rvariable"
           value="2*pi"
    />
  </span>

The "Rvariable" type corresponds to numbers or any R
expression.  If you want a string (and do not want the user
to enter the quotes), use "Rstring" instead.

For larger data (say, copy-pasted from another application),
you can use a <textarea>.

  <textarea class="Rpad_input"
            rpad_type="file"
            name="data.csv"
            rows=10 cols="60">
  </textarea>
  ...
  <div class="Rpad_input" rpad_type="R">
    x = read.csv("data.csv")
    ...
  </div>

For choices from a dropdown menu: 

  HTMLon()
  HTMLselect("foo", myList)
  ...
  cat("You have chosen ", foo, ".\n", sep="")
  plot(get(foo))
  ...

You can ask the browser to colour the R code: 

  <style>
    .Rpad_input {
      behavior: url(js/R_highlight.htc)
      -moz-bindings: url(js/moz-behaviors.xml#R-highlight.htc)
    }
  </style>

Sometimes, you want to hide parts of the code (uninteresting
code, such as loading your functions, your data, or creating
the HTML buttons or menus):

  (not tested)
  <div contentEditable="false">
    <span class="Rpad_input" rpad_type="R" 
         style="display:none">
      ...
    </span>
  </div>

+ Conlusions

I have really tried hard to use Rpad, but the results were
unsatisfactory. There are speed and stability problems, and
we never know where they come from: is it a browser-related
problem (if the server seems dead and if you are using
Internet Explorer, just restart IE -- the problem does not
come from the server), is it a network problem, a server
problem, an R problem, an HTML problem (the documentation
examples contain incorrect HTML (<div> inside of <span>) and
suggests to use browsers known not to respect the standards
(IE)), a javascript problem? There are simply too many
software components -- and they interact in too many ways.

If you just want a "speciallized calculator", with which you
would perform a single simple task, Rpad may be a good
choice. If you want anything more complex, in particular if
you want several pages linked together, it is not a good
solution.

TODO: 

  The Mozille extension
  How to use a "true" web server
  Give an example.

* Web programming: RZope

  TODO

* Web services

A web service is like a web page with a form on it, for the
user to fill in, except that it is not designed to be used
by a human user, but by another machine. As a result, the
various fields to fill in and the results are not hidden in
the HTML but presented in a more directly accessible way --
if you want to extract the information from the result, you
do not have to trim down the HTML response.

Here are a few examples. You can access Google as a web
service: you send the search terms and you get an XML file
that contains the first 10 results with, for each of them,
the URL, the tile, the date and an extract containing the
serach terms.

  TODO: an example of the actual XML received

Amazon.com also provides a web service: you send an ISBN
(this is the reference number, on the back of every book)
and you get all the book details (author, title, publisher,
date, price, availability, etc.) in an XML file.

  TODO: other examples
  Amazon
  
Some data centers also provide API: for instrance, in
Finance, you can imagine a web service to which you send a
list of company names or identifiers, a list of dates, a
list of items of interest about the company (say, price,
volume, book value, earnings, sales, cash flow, etc. --
well, numbers hopefully describing the company) and that
gives you the corresponding items, for the corresponding
companies and the corresponding dates. It would be as simple
as retrieving a URL

  http://.../getData.pl?ids=IBM,RHAT&dates=2005-06-01,2005-07-01&items=price,volume

With more spaces: 

  http://.../getData.pl ? ids   = IBM, RHAT
                        & dates = 2005-06-01, 2005-07-01
                        & items = price, volume

The result could be, say, a CSV file such as 

  TODO: give an example.

The technologies used range from the utterly obvious (a URL
that ends in foo.cgi?search=bar) to the very complex (SOAP:
the request is an XML file, send via HTTP POST, the answer,
in XML, is contained in an XML "enveloppe").

  TODO

  http://www.google.com/apis/

* Clusters, parallel programming

Actually, I have never used a cluster, so the following
lines might not be the most reliable source of information.

+ Vocabulary 

SIMD (Simple Instruction Multiple Data): this is an
architecture in which we have several processors, that
process different data, according to the same program. This
is the ideal architecture if you want to parallelize the
operations that, in R, already appear parallelized, such as
vector addition or multiplication.

MIMD (Multiple Instruction Multiple Data): this is another
kind of architecture, in which different data are processed
in different ways. You can think of it as several machines
(or several processors in the same machine) that perform
different tasks, independantly.

A problem is said to be "embarassingly parallel" if it is
easy to solve, provided we have a lot of machines: we can
cut it into many small pieces, write a program to solve a
single piece of the problem, send those pieces to our
different machines, and finally gather and combine the
results. In particular, if the problem is data-intensive, in
order to solve a part of the problem, you only need a part
of the data.

There is often a distinction between parallel programming
(everything is done on the same processor, that can do
several things at the same time, in a SIMD or MIMD fashion
-- but the different tasks are perfectly timed) and
distributed programming. With distributed programming, the
processors are often on different machines, communication
between those machines takes time, the machines do not run
at the same speed -- so that we do not know if all the
computations will end at the same time, nor even which one
will end first --, there can be network problems (one or
several machines can become inaccessible), the machines can
break (so that if we do not receive the results from a
machine, we have to send the data again, to another machine)
and even the main machine (the "master"), that controls the
others, may break (so that the other machines have to
choose, themselves, someone to replace it).

A cluster is a set of machines used for parallel
programming. Typically, the operating system provides the
distributed programming layer, so that the programmer (or
user) sees the cluster as a parallel programming (or
computing) environment.

A node is a machine (or a processor) in a cluster.

The use of clusters is sometimes refered to as High
Performance Computing (HPC) or, nore recently, Grid
Computing.

+ BLAS (Basic Linear Algebra System)

A lot of parallel (or non-parallel, actually) computations
require vector or matrix computations: processor vendors
(Intel, IBM, Sun, etc.) can provide libraries that take
advantage of the peculiarities of their processors to speed
up those operations. For the most trivial operations, such
as adding or multiplying two vectors, the libraries ask the
processor to do several operations at the same time (with
very old processors, it was not possible, but recent
processors can do that, to a certain extent); for more
complicated operations, such as multiplying two matrices,
the libraries can, on top of that, use non-trivial
algorithms that have a lower complexity.

If you do not have a BLAS library (usually, you have to pay
extra for it), you can use ATLAS (Automatically Tuned Linear
Algebra Software).

  http://math-atlas.sourceforge.net/

When you search the internet about BLAS, you will run into
LinPack (an old linear algebra library, whose function names
are as readable as the 2-letter basic Unix commands -- but
with five letters), LaPack (a more recent replacement for
LinPack -- BLAS is a part of LaPack) and the LinPack
benchmark (a series of computations used to gauge the speed
of a computer for numeric computations). ScaLaPack is a
parallel implementation of LaPack.

  http://en.wikipedia.org/wiki/LAPACK

+ PVM (Parallel Virtual Machine), MPI (Message Passing Interface)

These are protocols for parallel programming. You will use
them if you want to program yourself the parallelization of
your application, if this paralelization is not trivial. It
is used, internally, by the user-level alternatives such as
OpenMosix.

LAM/MPI and MPICH are different implementations of MPI. 

+ Beowulf

This is a first kind of cluster, that uses MPI or PVM. The
programs running on a Beowulf cluster have to be
specifically written (or modified) for this architecture:
the programmer has to know about PVM and/or MPI.

+ SSI (Single System Image)

This is another type of cluster that turns a bunch of
machines into a single SMP machine (i.e., a multi-processor
machine).

You can use normal programs on an SSI cluster: each program
will run on a single machine (but you do not know which one:
the cluster will choose an idle machine and, if the load
increases, it may even move the process to another
machine). You can take advantage of such an architecture by
either forking your program (if you write it yourself) or by
running several programs at the same time -- in Perl, I
would use the Parallel::ForkManager module.

  #!perl -w
  use strict;
  my $MAX_PROCESSES = shift || 10;  # Number of processes to run simultaneously
  use Parallel::ForkManager;
  my $pm = new Parallel::ForkManager($MAX_PROCESSES);
  while(<>){  # The processes (shell commands) are read from stdin
    my $pid = $pm->start and next;
    system($_);
    $pm->finish; # Terminates the child process
  }

+ OpenMosix

This is a free (GPL) SSI implementation. 

+ pR

An R package for parallel computations. It can be used as follows.

  library(pR)
  StartPE(2)   # Either 2 processors on the same machine 
               # or 2 nodes in an MPI cluster
  PE( a <- some.function(some.argument) )
  PE( b <- some.function(some.other.argument) )
  PE( y <- f(a, b) )

+ RScaLaPack

An R package to use ScaLaPack, a parallel implementation of
LaPack -- if you have really large matrices: the
computations must take longer than the time required to send
and receive the data.

+ Snow

An R package that provides a parallelized version of the
"lapply" function -- but, if I understand correctly, there
must be at least as many machines as elements in the list.

+ nws

A new package, that provides seamless parallelism.

  TODO

+ Rmpi, Rpvm

R packages if you want to use MPI or PVM yourself, i.e., if you want
to control very finely how your algorithm is parallelized. 

+ RSprng (Scalable Parallel Random Number Generator)

R package that provides a parallelized random number
generator -- otherwise, random numbers generated on
different machines may fail to look independant (indeed,
they are generated by the same algorithm...).

+ Aspect

  http://www.aspect-sdm.org/

* Miscellaneous

+ Functions: numerical integration and derivation

TODO

  ?D
  Explain where those derivatives could be used (optimization
  algorithms)

  ?integrate
  library(help=adpat)
  library(help=odesolve)

+ Formulas

Some functions (lm, prcomp, plot, xyplot, lme) accept a formula as
argument. You might want to do the same with your own functions. 

  TODO

  # To turn a formula into a data.frame
  model.frame(y ~ x1 + x2)

  # If you have several variables on the left hand-side of the ~
  # operator, you get a data.frame whose first component is not a
  # vector but a matrix -- yes, this is possible. 
  model.frame( cbind(y1, y2) ~ x1 + x2 )

  # For formulas that contain the | operator
  TODO

Exercise: Take the code of a function using the
"model.frame" function and understand it.

  getAnywhere("plot.formula")

+ Sparse matrices

When computing with large matrices, you tend to keep the
whole matrices in memory. But this is not always needed: in
particular, if the matrix contains a lot of zeros, it can be
a waste of memory -- and a waste of time when you multiply
with it. A sparse matrix is such a matrix, with a lot of
zeros -- and we can ask the computer to efficiently deal
with it.

TODO

  library(help=Matrix)
  library(help=SparseM)

+ IEEE arithmetics

Sometimes, to check if a new computation software works well, people
give it a few simple computations, such as 

  > 1 + 1
  [1] 2

or 

  > .3 - .2 - .1
  [1] -2.775558e-17

  > 0.3 / 0.1
  [1] 3
  > 0.3 / 0.1 == 3
  [1] FALSE
  > 0.3 / 0.1 - 3
  [1] -4.440892e-16

Er...

This might not be the result you were expecting, but it
should not be surprising. Most computers and numerical
software use the so-called "floating point numbers". Broadly
speaking, it means that numbers like 121 are not stored as
"121", but rather as "1.21 * 10 ^2". The first number,
"1.21", is called the mantissa (or significand), the second,
"2", the exponent.

Well, you easily guess, from the result above, that this is
not the whole story: computers do not like decimal
arithmetics -- they prefer binary arithmetics. As a result,
the mantissa is written as a binary number, and we do not
use a power of 10 but a power of 2.

  1.21 = 1 * 2^0 + 
         0 * 2^-1 +
         0 * 2^-2 +
         1 * 2^-3 +
         1 * 2^-4 +
         0 * 2^-5 +
         1 * 2^-6 +
         0 * 2^-7 +
         1 * 2^-8 +
         1 * 2^-9 +
         1 * 2^-10 +
         0 * 2^-11 +
         0 * 2^-12 +
         0 * 2^-13 +
         ...

In other words, 

  1.21 (decimal) = 1.00110101110000101... (binary)

You can compute this as follows: 

  x <- 1.21
  for (i in 1:20) { 
    cat(floor(x))
    x <- x - floor(x)
    x <- x * 2 
  };  cat("\n")

The problem is that this decimal number is not a round
number in binary arithmetics: it has to be rounded, it
cannot be represented exactly in base 2.

This was the case with the numbers at the begining of this
note (I add spaces to underline the periodicity of the
binary expansion):

  .1 (decimal) = 0.0 0011 0011 0011 0011... (binary)
  .2 (decimal) = 0.0011 0011 0011 0011 0011... (binary)
  .3 (decimal) = 0.01 0011 0011 0011 0011... (binary)

(You may notice that the binary expansion of .2 is the same
as that of .1, shifted by one digit -- it simply means that
.2 is the double of .1.)

As none of those numbers are whole binary numbers, they have
to be rounded. When you compute, some of those rounding
errors cancel, but others accumulate.

To know more about IEEE 754 arithmetics: 

  http://babbage.cs.qc.edu/courses/cs341/IEEE-754references.html
  http://grouper.ieee.org/groups/754/
  http://stevehollasch.com/cgindex/coding/ieeefloat.html

If you want those unequal numbers to be equal (one usually
needs this to write tests for one's code), use the
"all.equal" function.

  > all.equal(0.3 / 0.1, 3)
  [1] TRUE

  > all.equal(0.3 - 0.2 - 0.1, 0)
  [1] TRUE

In actual regression tests, it should be

  > stopifnot( all.equal(0.3 / 0.1, 3) )
  > stopifnot( all.equal(0.3 - 0.2 - 0.1, 0) )

* Numerical optimization

Many numerical statistical problems can be phrased as "find
the values of the parameters that minimize some quantity --
some error term". We now present a few algorithms to
numerically solve such problems.

+ Newton-Raphson

  TODO

+ Nelder-Mead

  TODO

+ Linear programming: Simplex

Some of those optimization problems can be stated as
"minimize some linear function of x1, x2, ..., xn, subject
to a few linear constraints". For historical reasons, such
an optimization problem is called a "program" -- this dates
back to the pre-computer era. As the constraints are linear,
they define a "simplex", i.e., a higher-dimensional analogue
of a (convex) polygon or polyhedron. It can be shown that,
as the function to be minimized is linear, the solution is
on a vertex of this simplex.

The simplex algorithm starts at one vertex and hops to a
nearby vertex where the function to be minimized is lower --
when we have nowhere to hop, we are done.

  TODO: give more details about the algorithm

TODO: in R?

  library(help=lpSolve)
  library(help=linprog)
  library(help=glpk)

+ MIP (Mixed Integer Programming) and Branch and Bound

Sometimes, the constraints state that some of the variables
have to be integral or binary.

TODO: explain the algorithm in the case of binary constraints.

TODO: give an example (the Travelling Salesman Problem)

TODO: In R?

+ Interior Point (IP) Methods

  TODO

+ Quadratic programming

One sometimes stumbles upon a generalization of a linear
program: the constraints are still linear, but the function
to optimize is a degree-2 polynomial. This is called a
quadratic program.

TODO: Give an example (portfolio optimization)

TODO: In R 

  library(help=quadprog)

+ Sequential Quadratic Programming (SQP)

In the same way as Newton's method solves f(x) = 0 by
approximating the function f with its first-order Taylor
expansion, one can minimize a function f, iteratively, by
apploximating it by its second-order Taylor expansion.

One can even add linear constraints (equalities and
inequalities): the problem to solve at each iteration is
then a Quadratic Program (QP). In high dimensions, that way
of solving non-linear (and even non-convex) optimization
problems is actually faster than other methods (e.g.,
interior-point (IP) methods).

SQP also applies to non-linear constraints: replace the
constraints by their first-order Taylor expansion and use
the second-order derivative of those constraints as a
penalty to the function to optimize.

+ Interior Point (IP) methods 

TODO: understand

  Replace the inequalities in 
    Min f(x)
    g(x) = 0
    h(x) >= 0
  by a penalty
    Min f(x) - mu sum log(s_i)
    g(x) = 0
    h(x) = s

+ EM (Expectation Maximization)

  TODO

+ Optimizing noisy functions

Tricky, but check the genalg and DEoptim packages. 

+ Dynamic programming

  TODO (I am not sure it really belongs here)
  Examples: 
    - matrix multiplication
    - order of the JOINs
    - portfolio optimization (?)

* Miscellaneous

+ Memory woes

The most common memory problem strikes Windows users with a
2Gb machine: their operating software does not allow them to
use more than 1Gb of memory. If you are in this situation,
the R windows FAQ details all the new problems that appear
if you are stuck on this platform. In particular, it
explains how to increase this memory limit -- but Windows
specialists tell me that this operating system becomes
unstable when a process uses more than 1Gb: you might reach
1.5Gb without any problem, but do not hope to access those
2Gb...

  http://cran.r-project.org/bin/windows/base/rw-FAQ.html

But you should really consider installing Linux instead: you
can use those 2Gb (problems may appear when a process uses
more that 2Gb if the machine only has 2Gb, though).

  > library(fortunes)
  > fortune("install")

  Benjamin Lloyd-Hughes: Has anyone had any joy getting the
    rgdal package to compile under windows?
  Roger Bivand: The closest anyone has got so far is Hisaji
    Ono, who used MSYS (http://www.mingw.org/) to build
    PROJ.4 and GDAL (GDAL depends on PROJ.4, PROJ.4 needs a
    PATH to metadata files for projection and
    transformation), and then hand-pasted the paths to the
    GDAL headers and library into src/Makevars, running Rcmd
    INSTALL rgdal at the Windows command prompt as
    usual. All of this can be repeated, but is not portable,
    and does not suit the very valuable standard binary
    package build system for Windows. Roughly: [points 1 to
    5 etc omitted]
  Barry Rowlingson: At some point the complexity of
    installing things like this for Windows will cross the
    complexity of installing Linux... (PS excepting
    live-Linux installs like Knoppix)
   -- Benjamin Lloyd-Hughes, Roger Bivand and Barry Rowlingson
      R-help (August 2004)

Actually, even on Linux, you can run into memory problems:
when R needs more memory, it asks the operating system for
more and, from time to time, redeems the memory it no longer
needs to the operating system (this is called "garbage
collection"). The problem is that (when you get near the
maximum physically available memory) the memory can become
fragmented. A simple solution, if it only happens once (not
in the middle of a huge loop), is to save the current,
fragmented session (it is fragmented in memory, but the
fragmentation will be lost when it is written to disc), quit
R, launch it again and read in the previous session.

  ?save.image

If it happens in the middle of a loop, you can try to explicitely
delete the large objects (with the "rm" function) when you no longer
need them and explicitely call the garbage collector (with the "gc"
function). 

  for (...) {
    r1 <- lmer(...)
    ...
    rm(r1)
    gc()
    ...
    rm(r2)
    gc()
  }

If you do not know where the memory is used, the "object.size"
function is your friend.

  all.object.sizes <- function () {
    res <- unlist(lapply( ls(1), function (x) { object.size(get(x)) } ))
    names(res) <- ls(1)
    sort(res)
  }

  all.object.sizes.everywhere <- function () {
    res <- NULL
    for (a in search()) {
      r <- unlist(lapply( ls(a), function (x) { object.size(get(x)) } ))
      if (!is.null(r)) {
        names(r) <- paste(a, ls(a), sep="::")
        res <- c(res, r)
      }
    }
    sort(res)
  }    

But these are just workarounds: a better solution is to look
if you really need all this data at once. Usually, you do
not: if this is your case, you can store your data in a
database and only retrieve the chunks you need. The word
"database" may sound daunting, but for simple data, SQLite
(discussed somewhere in this document) requires no server,
no configuration, no installation -- it stores the data in a
conventionnal file and provides an SQL interface to it.

If you really need more memory, you should know that: 32-bit
machines are limited to 4Gb per processor; the Windows
operating system cannot reliably grant more that 1Gb to a
single process; on Linux, problems may appear if a simgle
process requires more than 3Gb (if you have that much
memory). You can turn to 64-bit machines, but then again,
you should know that: on these machines, the Windows
operating system runs in 32-bit emulation mode, so you are
still limited to 4Gb per processor; most Unix-like operating
systems (MacOS X.4, Linux, FreeBSD, etc.) are available for
64-bit machines and can take advantage of them.

+ Displaying numbers

It is usually meaningless to express the result of a
statistical computation with a lot of significant
digits. For instance, if you compute the average height of
10 people, you need not be as precise as "1.718937283m":
"1.72m" will do.

You can round numbers with the "floor" (round below),
"ceiling" (round above), "round" (round to the nearest),
"signif" (round to the nearest, not with a prespecified
number of decimal places, as with the previous function, but
with a prespecified number of significant digits).

  > f <- function (x) {
  +   c(x=x, floor=floor(x), ceiling=ceiling(x), round=round(x,2), signif=signif(x,2))
  + }
  > t(apply(t(rt(10,4)),2,f))

                 x floor ceiling round signif
   [1,]  0.3209408     0       1  0.32   0.32
   [2,] -3.2453803    -4      -3 -3.25  -3.20
   [3,] -0.8474375    -1       0 -0.85  -0.85
   [4,]  1.7481940     1       2  1.75   1.70
   [5,] -1.1009298    -2      -1 -1.10  -1.10
   [6,]  0.5767945     0       1  0.58   0.58
   [7,]  0.9479906     0       1  0.95   0.95
   [8,]  0.6373905     0       1  0.64   0.64
   [9,] -2.1388324    -3      -2 -2.14  -2.10
  [10,] -0.5720559    -1       0 -0.57  -0.57

Sometimes, you do not want to round the numbers but merely
control the way they are printed. The "digits" option
specifies the number of digits to be displayed (but, in
memory, for the computations, the numbers retain all their
digits).

  > pi
  [1] 3.141593

  > options()$digits
  [1] 7

  > options(digits=4)
  > pi
  [1] 3.142

  > options(digits=7)

Sometimes, the numbers get printed in scientific notation, while you
would prefer a more classical notation.

  > x <- as.data.frame(t(t(rcauchy(10)^2)))
  > x
               V1
  1  1.869630e+00
  2  5.909726e-01
  3  6.114153e-01
  4  5.320118e-01
  5  5.699883e+00
  6  2.616534e+04
  7  2.019110e-02
  8  1.910365e-01
  9  2.384527e-03
  10 7.097835e-02

The computer tries to choose between the standard notation
and the scientific one by comparing the length of the
various numbers. In this example, we want at least 7
significant digits (the "digits" option): for this, because
one value is around 2e-3, we need 10 decimal places. But as
we also 2e+4, we end up using as much as 16 characters to
display some of the numbers. On the contrary, in scientific
notation, we only use 12 characters: by parsimony, the
computer chooses the scientific notation.

But we can alter this by adding a penalty to the scientific
notation, with the "scipen" option: here, we add a
5-character penalty to the scientific notation; as 
16 <= 12 + 5, we keep the standard notation.

  > options(scipen=5)
  > x
                  V1
  1      1.869630292
  2      0.590972584
  3      0.611415260
  4      0.532011822
  5      5.699882601
  6  26165.335966629
  7      0.020191104
  8      0.191036531
  9      0.002384527
  10     0.070978346

Sometimes, when we want a greater control on the way the
numbers are printed (typically when you are writing the
"print" method of an object you have just defined), you can
resort to the lower-level "formatC" function, that
transforms numbers into strings, allows you to choose
between integer, standard or scientific notation, that
allows you to add marks between thousands, millions, etc.,
that allows you to change the symbol used as decimal point,
that allows you to align the numbers on the left or on the
right.

  > formatC(pi, digits=2, width=8, format="f")
  [1] "    3.14"

  > formatC(pi, digits=4, width=8, format="f")
  [1] "  3.1416"

  > formatC(pi, digits=4, width=8, format="f", flag="-")   # Flush left
  [1] "3.1416  "

  > formatC(1e6, digits=4, width=20, format="f", big.mark=",")
  [1] "        1,000,000.0000"

  > formatC(1e6, digits=0, width=20, format="f", big.mark=",")
  [1] "             1,000,000"

  > formatC(pi * 1e6, digits=9, width=20, format="f", big.mark=",", small.mark=" ")
  [1] "   3,141,592.65358 9793"

  > formatC(pi * 1e6, digits=9, width=20, format="f", big.mark=" ",
  +         small.mark=" ", small.interval=3, decimal.mark=",") # in France...
  [1] "   3 141 592,653 589 793 "

There is also a "format" function (slightly less powerful),
a "prettyNum" function (a variant of "format"), a
"format.pval" function (for p-values).


* Dirty Tricks

+ Catching errors

There is a "try" command, to run functions that might crash.

  try(...)

We shall use it later in this document, usually on the form 

  x <- NA
  try( x <- ... )
  if( is.na(x) ) { 
    ... 
  } else { 
    ... 
  }

or even, sometimes, 

  done <- FALSE
  while (!done) {
    r <- try( ... )
    done <- !inherits(r, "try-error")
  }

Actually, the manual tells us that we are not supposed to use "try" (I
use it a lot) but "tryCatch" (which I have never used).

  ?tryCatch

+ Exceptions

TODO

  ?on.exit
  ?reg.finalizer   # When I speak about memory and garbage collection
  ?setHook         # AOP...

+ Global variables

As functions have NO side effect, we should not be able to change
global variables from within a function. However, 

  x <<- 1+2+3+4+5

modifies the "x" variable where it lives or, if it does not exist yet,
defines it as a global variable. 

Other way of doing it:

  set.global <- function (x, value) {
    x <- deparse(substitute(x))
    assign(x, value, pos=.GlobalEnv)
  }

And it works:

  > set.global(a,3)
  > a
  [1] 3
  > set.global(a,1:10)
  > a
   [1]  1  2  3  4  5  6  7  8  9 10

This can be generalized to write methods that modify the object to
which they are atteched. Instead of using 
pos=.GlobalEnv, you can try -2:
a variable local to the environment that called the function (I think).

+ Environments

You can explicitely create "environments" and use them as hash
tables. 

  > x <- new.env(hash=T)
  > assign('foo', 3, env=x)
  > assign('bar', list('a'=3, 'b'=list('c'=1, d='foo')), env=x)
  > assign('baz', data.frame(rnorm(10),rnorm(10)), env=x)
  > x
  <environment: 0x8bb97bc>
  > ls(env=x)
  [1] "bar" "baz" "foo"
  > get('foo', env=x)
  [1] 3

+ Launching R

We can ask R to perform certain actions when it is launched (for
instance, load a certain library) or when we close it (for instance,
save some of the data under a certain format) by redefining the
".First" and ".Last" functions. 

  > ?Startup
  > .First
  function() {
      require("ctest", quietly = TRUE)
  }

+ #!

  TODO

+ Other external programs

The "system" command can launch any Unix command. 

  system("top")

If you want to launch the command in the background, you just add an
"&" at the end, as usual. If the command expects arguments (in
particular the name of a file containing the data to process), it
might be useful to print the command that will be used (R is not
good/intuitive at string processing). For instance, in xgobi's code
(xgobi or ggobi is an external program to visualize data in dimension
3 of higher), we find

    ...
    args <- paste("-title", paste("'", title, "'", sep = ""),
      args)
    command <- paste("xgobi", args, dfile, "&")
    cat(command, "\n")
    s <- system(command, FALSE)
    invisible(s)
  }

This is mainly used to transfer the data we are working with to
another program, often to visualize them. We shall see how to call
xgobi/ggobi (to visualize high-dimensional data, dynamically and
interactively) from R. We can also use it to produce "pretty"
pictures, for instance to add a background to a graphic (with
ImageMagick) or to ask PoVRay to produce a picture (such as the one
used as title of this document) or an animation. 

  TODO: URL of the Linux Journal article about data visualization with
  PoVRay. 

You can also use it to automatically launch LaTeX (them xdvi, dvips,
lpr) on a file, previously created (with SWeave, that allows one to
add R code, results of computations made with R, graphics made with R,
in a LaTeX document).

+ deparse() and substitute()

The "deparse" command shows the contents of an object, as a string
(you can then play with the string -- you would want to do that, for
instance, if you wanted to create a LaTeX file by hand, from R) -- if
you just want to see it, just give the object name to the interpreter
or use the "print.defaut" function. 

  > print
  function (x, ...)
  UseMethod("print")

  > deparse(print)
  [1] "function (x, ...) "   "UseMethod(\"print\")"

The "substitute" command performs substitutions in an expression (the
result is an unevaluated expression). 

  > substitute(x+1, list(x=3))
  3 + 1

We have also seen that the "substitute" command (often together with
the "deparse" command) could allow a function to see where its
arguments came from. 

  my.arg <- function (x, ...) {
    cat("My first argument was: ")
    cat(deparse(substitute(x)))
    cat("\n")
  }

  > my.arg(3)
  My first argument was: 3

  > my.arg(x)
  My first argument was: x

  > my.arg(x+1)
  My first argument was: x + 1

+ get()

The "get" command turns a string containing the name of a variable
into the contents of the variable. 

  > get("plot")
  function (x, ...)
  {
    if (is.null(attr(x, "class")) && is.function(x)) {
      if ("ylab" %in% names(list(...)))
        plot.function(x, ...)
      else plot.function(x, ylab = paste(deparse(substitute(x)),
        "(x)"), ...)
    }
    else UseMethod("plot")
  }

+ Environments: ls(), rm(), search()

The "ls" command gives the list of the variables currently defined.
You can specify in which environment to look (as a result, you can
browse through the global variables, even if they are masked by the
local variables). You can also search a variable name from a regural
expression.

  > ls()
  [1] "a"          "my.arg"     "set.global" "x"          "y"

  > ls(pos=.GlobalEnv)
  [1] "a"          "my.arg"     "set.global" "x"          "y"

In a Unix fashion, variables whose name start with a dot are "hidden":
you must explicitely ask to see them. 

  > ls(all=T)
  [1] "a"            "my.arg"       ".Random.seed" "set.global"
  [5] ".Traceback"   "x"            "y"

The "rm" command deletes variables.

The "search" command shows the "path" (i.e., the list of environments)
in which R looks for the variables. 

  > library(MASS)
  > search()
  [1] ".GlobalEnv"    "package:MASS"  "package:ctest" "Autoloads"
  [5] "package:base"
  > ls(pos="package:MASS")
    [1] "addterm"                "addterm.default"        "addterm.glm"
    [4] "addterm.lm"             "addterm.mlm"            "addterm.negbin"
    [7] "addterm.survreg"        "anova.loglm"            "anova.negbin"
    ...
  [175] "update.loglm"           "vcov"                   "vcov.glm"
  [178] "vcov.lm"                "vcov.nls"               "vcov.polr"
  [181] "width.SJ"               "write.matrix"

The "apropos" (or "find") command helps us find a variable (or
function) name.

  > apropos('exp')
   [1] "negexp.SSival"         "as.expression"         "as.expression.default"
   [4] "char.expand"           "dexp"                  "exp"
   [7] "expand.grid"           "expand.model.frame"    "expm1"
  [10] "expression"            "is.expression"         "path.expand"
  [13] "pexp"                  "qexp"                  "regexpr"
  [16] "rexp"

  > apropos('.')
     [1] "x"                          "boxcox"
     [3] "boxcox.default"             "boxcox.formula"
     [5] "boxcox.lm"                  "corresp.matrix"
     ...
  [1767] "xyinch"                     "xyz.coords"
  [1769] "yinch"                      "zapsmall"
  [1771] "zip.file.extract"

+ stop(), warning()

The "stop" function stops a function when a problem is spotted (for
example, when there is a type problem with its arguments).

  do.it <- function (x) {
    if( !is.numeric(x) )
      stop("Expecting a NUMERIC vector!")
    if( !is.vector(x) )
      stop("Expecting a numeric VECTOR!")
    if( length(x)<2 )
      stop("Expecting a numeric vector of length at least 2")
    return("Well done.")
  }
 
  > do.it("abc")
  Error in do.it("abc") : Expecting a NUMERIC vector!

  > do.it(3)
  Error in do.it(3) : Expecting a numeric vector of length at least 2

  > do.it(data.frame(a=1:3,b=3:1))
  Error in do.it(data.frame(a = 1:3, b = 3:1)) :
          Expecting a NUMERIC vector!

  > do.it(matrix(1:4,nc=2,nr=2))
  Error in do.it(matrix(1:4, nc = 2, nr = 2)) :
          Expecting a numeric VECTOR!

  > do.it(1:26)
  [1] "Well done."

The "stopifnot" function is similar -- it is an assertion mechanism,
as the "assert" mechanism in C -- is is a VERY good idea to use a lot
of assertions in your code: it helps you spot bugs  before they
appear. 

The warning() function is similar but emits a (usually non-fatal) warning.

  TODO: Example

You can control whether a warning is fatal or not with

  options(warn=-1)  # Do not print warnings
  options(warn=0)   # Print warnings at the end of the function, not when
                    # they are emitted; if there are too many of them,
                    # just say there are too many
  options(warn=1)   # Print the warnings when they occur
  options(warn=2)   # Make the warnings fatal

+ parse(), expression()

The "parse" command transforms a string into an unevaluated
expression. You can then evaluate the expression with the "eval"
command. 

  > parse(text="0==1")
  expression(0 == 1)

  > eval(parse(text="0==1"))
  [1] FALSE

Expressions may also be used as labels, in graphics. 

  %G
  x <- seq(0,4, length=100)
  y <- sqrt(x)
  plot(y~x, type='l', lwd=3, main=expression(y == sqrt(x)) )
  %--

There are more details in the manual:

  ?plotmath