This is a tutorial that Ryosuke and I originally wrote for the supplementary information of our dinosaur paper back in 2011. The most up to date code for phylo FDA is now on Github, along with some example data and tree. I hope you will find this tutorial useful. Please give it a try, use it with your own data and let me know how it works. Any feedback is much appreciated!

Before you start the tutorial, please make sure that “phylo.fda.v0.2.R”, “measurements.csv”, and “example.tree.tre” (all available on GitHub) are saved in your current working directory.

Let’s first start with some preliminaries, such as loading libraries and reading in the source code.

Loading libraries and source code

library(ape)
library(class)
library(geiger)
library(lattice)
library(mda)
## Loaded mda 0.4-4
library(nnet)

# Please adjust the file path to your settings.
source("C:/Users/lschmitz/Documents/GitHub/phylo.fda.v0.2/phylo.fda.v0.2.R")

Reading in the tree and data files

# Getting the tree-file into R. Please adjust the file path to your settings.
treA <- read.tree("C:/Users/lschmitz/Documents/GitHub/phylo.fda.v0.2/example.tree.tre") if(!is.binary.tree(treA)) treA <- multi2di(treA, random = TRUE) # Randomly resolving polytomies if there are any. # Reading in the data and assigning rownames. Please adjust the file path to your settings. ddA <- read.csv("C:/Users/lschmitz/Documents/GitHub/phylo.fda.v0.2/measurements.csv") rownames(ddA) <- ddA$taxon # Ordering data to match tip labels.Note that for this tutorial the data and phylogeny already match. Normally that's not the case. You can use 'geiger's treedata() function for fix that problem. Always make sure your data are in the correct order! ddA <- ddA[match(treA$tip.label,rownames(ddA)),]

Preparing data

Then we’ll have to prepare our data. We need to assign training (extant) and test (fossil) data, make a tree that only contains the living species, and should also log-transform the data if applicable.

# Defining groups and taxa.
gA <- ddA$groups # The 'groups' column contains data on diel activity pattern.
taxaA <- ddA$taxon # Species names.
rownames(ddA) <- taxaA # Assigning species names to rows.     

# Tree and data preparation
# All taxa (test and training, e.g., fossil and living) or training only (living).
XA <- ddA[,2:4] # We are selecting 3 variables from the dataset.

# Log10 transformation and some rounding.
XA <- log10(XA) 
XA <- signif(XA, 4)

# Creating a dataframe that only contains taxa with known group affiliation.
testtaxa <- rownames(ddA[gA=="unknown",]) # Specifying taxa with unknown group, e.g. fossils (unknown).
testtaxan <- row(ddA)[gA=="unknown",1]
trainingtaxa <- rownames(ddA[-testtaxan,]) # Kick out the fossils.
X <- XA[-testtaxan,]
dd <- ddA[-testtaxan,]
g <- gA[-testtaxan]
tre <- drop.tip(treA, testtaxa) # Dropping the fossils from the tree.
is.ultrametric(tre) # Tree should be ultrametric after removing fossils.
## [1] TRUE

Identifying optimal lambda

The optLambda() function rescales the phylogeny with Pagel’s lambda and tests the correlation between the morphological traits and group affiliation. The aim is to find the rescaled tree that maximizes the correlation between morphology and ecology, assuming a Brownian process of trait evolution. Traits that are highly controlled by phylogeny are expected to show high optimal lambda values, whereas traist that are more independent from phylogeny are expected to ahve lower optimal lambda values.

filename_stem <- "ecology" 
ol1 <- optLambda(X,g,tre,idc=filename_stem ) # A plot will appear and a *.pdf is saved in your working directory.

plot of chunk unnamed-chunk-4

ol1$optlambda # Displaying the optimal lambda value.
##       RSS
## [1,] 0.08

Performing the discriminant analysis

This is almost the shortest part of the tutorial. We only have to set the prior probabilities (if available) and then run the phylo.fda.pred() function.

pri  <- c(0.1427,0.5864,0.2709) # The prior probabilities will vary with your datasets. Replace by equal priors if unknown.
    
optl <- 0.08 # Replace with the optimal lambda value from above.
    
pfda <- phylo.fda.pred(XA,gA,taxaA,treA,testtaxan,val=optl,priin=pri) 
## Warning: the condition has length > 1 and only the first element will be
## used
# Warning message re: priors can be ignored.

Summarizing results

It is possible to extract very useful statistics, similar to the fda-package on which phylo.fda is based.

# Let's compile a cross-classification matrix to see how pFDA performed.
pfda$confusion # Misclassified poroportion listed as "error".
##             true
## predicted    cathemeral diurnal nocturnal
##   cathemeral          3       0         1
##   diurnal            21     105         3
##   nocturnal           6       3        22
## attr(,"error")
## [1] 0.2073
# Now, let's extract predictions for the fossils.
test.class <- as.character(predict(pfda, pfda$DATAtest, type="class"))

# Discriminant scores scores for the fossils are retreived as follows.
test.variates <- predict(pfda, pfda$DATAtest, type="variates")

# Posterior probabilities of group affiliations are available, too.
test.prob <- predict(pfda, pfda$DATAtest, type="posterior")

# And now let's put everything together (here shown for the fossils):
test.results <- cbind(test.class, test.prob, test.variates)
colnames(test.results) <- c("predicted class", "P(c)", "P(d)", "P(n)", "DA1", "DA2")
rownames(test.results) <- testtaxa

head(test.results) # Let's take a look what we got. Well, at least the first few rows.
##        predicted class P(c)                  P(d)                  
## tip165 "diurnal"       "0.182283141991708"   "0.806086181497671"   
## tip166 "diurnal"       "0.297452682578856"   "0.583005083034815"   
## tip167 "diurnal"       "0.365422310288433"   "0.633574032462211"   
## tip168 "cathemeral"    "0.667943494984201"   "0.321262951627568"   
## tip169 "diurnal"       "0.0698059277335058"  "0.908432828230657"   
## tip170 "nocturnal"     "0.00079427020512959" "0.000201546426958272"
##        P(n)                 DA1                   DA2                
## tip165 "0.0116306765106211" "1.54569669440266"    "-1.13718418578159"
## tip166 "0.119542234386329"  "-0.0577190902046224" "-1.15757001676524"
## tip167 "0.0010036572493554" "2.76026304975073"    "-3.27432448301127"
## tip168 "0.010793553388231"  "0.889208977345653"   "-3.75944341124825"
## tip169 "0.0217612440358368" "1.3313294723398"     "0.507327887901768"
## tip170 "0.999004183367912"  "-6.07722551216375"   "0.166677076308217"
# The same can be done for the training data (living species).
training.class <- as.character(predict(pfda, pfda$DATA, type="class"))
training.variates <- predict(pfda, pfda$DATA, type="variates")
training.prob <- predict(pfda, pfda$DATA, type="posterior")
training.results <- cbind(as.character(g), training.class, training.prob, training.variates)
colnames(training.results) <- c("true class", "predicted class", "P(c)", "P(d)", "P(n)", "DA1", "DA2")
rownames(training.results) <- trainingtaxa

Plotting discriminant scores

Visualization of results is very important. The following lines explain how the discriminant scores can be plotted with different colors for your groups.

# First some data organization of the living species
training <- as.data.frame(cbind(training.variates, as.character(dd$groups)))
colnames(training) <- c("DA1", "DA2", "groups")
rownames(training) <- dd$taxon

# Followed by organization of fossils
unknown <- as.character(rep("unknown", times=length(testtaxan)))
test <- as.data.frame(cbind(test.variates, unknown))
colnames(test) <- c("DA1", "DA2", "groups")
rownames(test) <- testtaxa

# All put together in one object, before broken into groups
scatter <- rbind(training, test)
scatter[, 1:2] <- lapply(scatter[,1:2], as.character)
scatter[, 1:2] <- lapply(scatter[,1:2], as.numeric)

c <- scatter[scatter$groups=="cathemeral",] 
d <- scatter[scatter$groups=="diurnal",] 
n <- scatter[scatter$groups=="nocturnal",]
f <- scatter[scatter$groups=="unknown",] 

And finally we can plot the discriminant scores. Please note that this is no longer a classical morphospace. In addition to shape and size, phylogenetic covariance influences position of species.

plot(c$DA1, c$DA2, 
      asp=1, 
      xlim=range(scatter[,1]),
      ylim=range(scatter[,2]), 
      pch=21, col="black", bg="gray", 
      cex=1.25, cex.lab=1.5, 
      xlab="DA 1", ylab="DA 2");
 
points(d$DA1, d$DA2, pch=21, cex=1.25, col="black", bg="gold");
points(n$DA1, n$DA2, pch=19, cex=1.25, col="black");
points(f$DA1, f$DA2, pch=24, cex=1.25, col="black", bg="green");

box(lwd=2); axis(1, lwd=2, lwd.ticks=2); axis(2, lwd=2, lwd.ticks=2)

plot of chunk unnamed-chunk-8

I would also recommended exploring how choice of optimal lambda influences the classification of fossils. Please let me know if you have any questions.

Key references

Hastie, T., Tibshirani, R., Buja, A. (1994) Flexible disriminant analysis by optimal scoring. J Americ Stat Assoc, 89, 1255-1270.

Motani, R., Schmitz, L. (2011) Phylogenetic versus functional signals in the evolution of form-function relationships in terrestrial vision. Evolution, 65, 2245-2257.

Schmitz, L., Motani, R. (2011) Nocturnality in dinosaurs inferred from scleral ring and orbit morphology. Science, 332, 705-708.