Reconstructing ancestral states of discrete characters

Many biologists desire to infer the evolutionary pattern of a trait, which can provide quite interesting information about when and where a trait originally appeared, if it was possibly lost later on, or if the trait evolved multiple times independently.

Ancestral state reconstructions can be fairly easily obtained with a variety of methods, but all carry a large amount of uncertainty, especially in the absence of fossil data. Please always keep that in mind when dealing with such inferences.

All we need for ancestral state reconstructions are a time-calibrated phylogeny and a trait. The trait can be either continuous or discrete, but we’ll stick to the latter for the purposes of this tutorial. Let’s use the grunt data set — you will note that the first part of the script is identical to a previous visualization tutorial. Specfically, we will apply the ace() function of the ape package and the stochastic character mapping procedures implemented in phytools.

#load required packages
library(ape)
library(geiger)
library(phytools)

After loading libraries, let’s get our data.

#getting tree
haemulidTrees <- read.nexus(url("http://schmitzlab.info/Trees4dryad.nex"))
haemulidTrees
## 500 phylogenetic trees
#getting data
haemulids <- read.csv(url("http://schmitzlab.info/haemulids.csv"), header=TRUE)
head(haemulids)
##                     Taxon Standard_length Raker_Length Suction_Index
## 1     Anisotremus_caesius          158.00         2.35          0.43
## 2  Anisotremus_davidsonii          240.00         5.10          0.41
## 3       Anisotremus_dovii          175.67         3.13          0.50
## 4 Anisotremus_interruptus          172.67         3.47          0.37
## 5   Anisotremus_moricandi          127.00         3.10          0.40
## 6    Anisotremus_pacifici          139.00         2.50          0.82
##    Habitat
## 1     reef
## 2     reef
## 3 non-reef
## 4     reef
## 5     reef
## 6 non-reef

The tree-file contains a set of 500 time-calibrated trees, the associated data include standard length, gill raker length, suction index, and information about habitat (reef vs. non-reef).

As said before, whenever working with comparative data and phylogenies, one must absolutely make sure that the data exactly match the phylogeny, for both analytical and plotting purposes. Let’s take a look at the haemulid tree.

#Let's begin by plotting the first tree in the list
plot(haemulidTrees[[1]], cex=0.4)
add.scale.bar()

#OK, that can and should look much cleaner...

#Let's ladderize all trees in the distribution with lapply() and plot the first tree again
haemulidTreesLadderized <- lapply(haemulidTrees, ladderize)
plot(haemulidTreesLadderized[[1]], cex=0.4) #Plot the first ladderized tree
add.scale.bar()

#Looks better, doesn't it? 

#However, does the order of tips in the plot match the order of tiplabels in the tree object?

#Please compare
haemulidTreesLadderized[[1]]$tip.label
##  [1] "Haemulon_striatum"               "Haemulon_steindachneri_Pacific" 
##  [3] "Haemulon_macrsotomum"            "Anisotremus_caesius"            
##  [5] "Haemulon_carbonarium"            "Anisotremus_davidsonii"         
##  [7] "Anisotremus_dovii"               "Anisotremus_interruptus"        
##  [9] "Haemulon_chrysargyreum"          "Xenichthys_xantii"              
## [11] "Anisotremus_surinamensis"        "Haemulopsis_leuciscus"          
## [13] "Isacia_conceptionis"             "Haemulon_album"                 
## [15] "Haemulon_plumieri"               "Anisotremus_moricandi"          
## [17] "Haemulopsis_axillaris"           "Orthopristis_reddingi"          
## [19] "Haemulon_boschmae"               "Haemulon_spB"                   
## [21] "Xenistius_californiensis"        "Microlepidotus_brevipinnis"     
## [23] "Pomadasys_crocro"                "Conodon_serrifer"               
## [25] "Haemulon_vittata"                "Anisotremus_scapularis"         
## [27] "Haemulon_flavolineatum"          "Haemulopsis_elongatus"          
## [29] "Haemulon_sciurus"                "Anisotremus_pacifici"           
## [31] "Haemulon_maculicauda"            "Microlepidotus_inornatus"       
## [33] "Anisotremus_virginicus"          "Haemulon_steindachneri_Atlantic"
## [35] "Pomadasys_branickii"             "Pomadasys_panamensis"           
## [37] "Orthopristis_chrysoptera"        "Orthopristis_ruber"             
## [39] "Haemulopsis_nitidus"             "Haemulon_melanurum"             
## [41] "Haemulon_sexfasciatum"           "Pomadasys_corvinaeformis"       
## [43] "Haemulon_spA"                    "Conodon_nobilis"                
## [45] "Orthopristis_chalceus"           "Pomadasys_macracanthus"         
## [47] "Haemulon_aurolineatum"           "Haemulon_flaviguttatum"         
## [49] "Haemulon_scudderi"               "Haemulon_parra"

As you will see, by ladderizing the tree, the plotting order and the tiplabel order are no longer aligned. That’s a problem when plotting phylogenetic and phenotypic data side by side!

There is an easy fix for that, though. First we will compare the tree to the data (a necessary step we learned about previously), then save the ladderized tree and finally re-load it.

#Grab the tree from above
base.tree <- haemulidTreesLadderized[[1]]

#assign row names to the dataframe
rownames(haemulids) <- haemulids$Taxon

#Compare tree with data
compare <- treedata(base.tree, haemulids, sort=TRUE) #R will inform you of any inconsistencies and how they were solved!
## Warning in treedata(base.tree, haemulids, sort = TRUE): The following tips were not found in 'phy' and were dropped from 'data':
##  Emellichthyops_atlanticus
##  Haemulon_bonariense
#Extract vetted data and tree
haemulid.data <- as.data.frame(compare$data)
haemulid.data[,2:4] <- lapply(haemulid.data[,2:4], as.character)
haemulid.data[,2:4] <- lapply(haemulid.data[,2:4], as.numeric)

tree <- compare$phy

#Save the new tree and bring it back
write.tree(tree, "haemulid.tre")
tree <- read.tree("haemulid.tre")

Let’s now plot the tree along with habitat info.

#Retrieve info on habitat and adding names
z <- as.factor(haemulid.data$Habitat)
names(z) <- tree$tip.label

#Create a vector to be filled with colors for plotting purposes, matching the diel activity pattern
mycol<-character(length(z))
mycol[haemulid.data$Habitat=="reef"]<-"blue"
mycol[haemulid.data$Habitat=="non-reef"]<-"red"

#make very narrow margins
par(mar=c(1,0,0,1))

#And finally we can plot with tiplabels :)
plot(tree, label.offset=1.5, cex=0.5)
tiplabels(pch=21, col="black", adj=1, bg=mycol, cex=1.3)

Now we are ready to carry out ancestral state reconstructions of the habitat of grunts. Did grunts originate on reefs or away from reefs?

#calculating marginal likelihoods for ancestral states with an all-rates-equal model (ER)
fit.ER <- ace(z, tree, model="ER", type="discrete")
fit.ER
## 
##     Ancestral Character Estimation
## 
## Call: ace(x = z, phy = tree, type = "discrete", model = "ER")
## 
##     Log-likelihood: -33.96421 
## 
## Rate index matrix:
##          non-reef reef
## non-reef        .    1
## reef            1    .
## 
## Parameter estimates:
##  rate index estimate  std-err
##           1   4.6268 2744.465
## 
## Scaled likelihoods at the root (type '...$lik.anc' to get them for all nodes):
## non-reef     reef 
##      0.5      0.5
fit.ER$lik.anc
##       non-reef reef
##  [1,]      0.5  0.5
##  [2,]      0.5  0.5
##  [3,]      0.5  0.5
##  [4,]      0.5  0.5
##  [5,]      0.5  0.5
##  [6,]      0.5  0.5
##  [7,]      0.5  0.5
##  [8,]      0.5  0.5
##  [9,]      0.5  0.5
## [10,]      0.5  0.5
## [11,]      0.5  0.5
## [12,]      0.5  0.5
## [13,]      0.5  0.5
## [14,]      0.5  0.5
## [15,]      0.5  0.5
## [16,]      0.5  0.5
## [17,]      0.5  0.5
## [18,]      0.5  0.5
## [19,]      0.5  0.5
## [20,]      0.5  0.5
## [21,]      0.5  0.5
## [22,]      0.5  0.5
## [23,]      0.5  0.5
## [24,]      0.5  0.5
## [25,]      0.5  0.5
## [26,]      0.5  0.5
## [27,]      0.5  0.5
## [28,]      0.5  0.5
## [29,]      0.5  0.5
## [30,]      0.5  0.5
## [31,]      0.5  0.5
## [32,]      0.5  0.5
## [33,]      0.5  0.5
## [34,]      0.5  0.5
## [35,]      0.5  0.5
## [36,]      0.5  0.5
## [37,]      0.5  0.5
## [38,]      0.5  0.5
## [39,]      0.5  0.5
## [40,]      0.5  0.5
## [41,]      0.5  0.5
## [42,]      0.5  0.5
## [43,]      0.5  0.5
## [44,]      0.5  0.5
## [45,]      0.5  0.5
## [46,]      0.5  0.5
## [47,]      0.5  0.5
## [48,]      0.5  0.5
## [49,]      0.5  0.5

Does that make sense to you? Well, something doesn’t seem quite right. All likelihood estimates are the same, which suggests that the likelihood optimization got stuck along the way. Let’s try to troubleshoot. I suspect the branch length may be an issue.

#divide all branches by 100
scaled.tree <- tree
scaled.tree$edge.length <- scaled.tree$edge.length/100

#try again...
#calculating marginal likelihoods for ancestral states with an all-rates-equal model (ER)
fit.ER <- ace(z, scaled.tree, model="ER", type="discrete")
fit.ER
## 
##     Ancestral Character Estimation
## 
## Call: ace(x = z, phy = scaled.tree, type = "discrete", model = "ER")
## 
##     Log-likelihood: -35.18024 
## 
## Rate index matrix:
##          non-reef reef
## non-reef        .    1
## reef            1    .
## 
## Parameter estimates:
##  rate index estimate std-err
##           1   7.0706  4.6382
## 
## Scaled likelihoods at the root (type '...$lik.anc' to get them for all nodes):
##  non-reef      reef 
## 0.4907739 0.5092261
fit.ER$lik.anc
##         non-reef      reef
##  [1,] 0.49077390 0.5092261
##  [2,] 0.48714480 0.5128552
##  [3,] 0.48465321 0.5153468
##  [4,] 0.48326972 0.5167303
##  [5,] 0.43679506 0.5632049
##  [6,] 0.42843565 0.5715643
##  [7,] 0.29581685 0.7041832
##  [8,] 0.14488920 0.8551108
##  [9,] 0.13155733 0.8684427
## [10,] 0.10511980 0.8948802
## [11,] 0.03751548 0.9624845
## [12,] 0.02149666 0.9785033
## [13,] 0.02028046 0.9797195
## [14,] 0.04001634 0.9599837
## [15,] 0.20202722 0.7979728
## [16,] 0.20337234 0.7966277
## [17,] 0.29801568 0.7019843
## [18,] 0.42578701 0.5742130
## [19,] 0.15405122 0.8459488
## [20,] 0.21684559 0.7831544
## [21,] 0.24661172 0.7533883
## [22,] 0.07319943 0.9268006
## [23,] 0.37911982 0.6208802
## [24,] 0.76922333 0.2307767
## [25,] 0.07419659 0.9258034
## [26,] 0.48709907 0.5129009
## [27,] 0.48477008 0.5152299
## [28,] 0.39560179 0.6043982
## [29,] 0.20367228 0.7963277
## [30,] 0.16684059 0.8331594
## [31,] 0.03254724 0.9674528
## [32,] 0.31998508 0.6800149
## [33,] 0.49557697 0.5044230
## [34,] 0.49903847 0.5009615
## [35,] 0.50156253 0.4984375
## [36,] 0.50761105 0.4923889
## [37,] 0.43738700 0.5626130
## [38,] 0.41359134 0.5864087
## [39,] 0.32134520 0.6786548
## [40,] 0.20701444 0.7929856
## [41,] 0.72504158 0.2749584
## [42,] 0.44739902 0.5526010
## [43,] 0.52116275 0.4788372
## [44,] 0.73618225 0.2638178
## [45,] 0.76410158 0.2358984
## [46,] 0.66462928 0.3353707
## [47,] 0.44826685 0.5517331
## [48,] 0.49145995 0.5085400
## [49,] 0.49219150 0.5078085

OK that looks more realistic. So, what can you say about the origin of grunts? On reefs or not? The likelihoods for either are pretty much the same so we cannot tell with any amount of certainty. Some other marginal likelihoods seem to much more strongly facor one habitat over the other though, especially for shallower nodes. Let’s take a look at that.

#make very narrow margins
par(mar=c(1,0,0,1))

#plot tree
plot(tree, label.offset=1.5, cex=0.5); add.scale.bar()

#add tiplabels
tiplabels(pch=21, col="black", adj=1, bg=mycol, cex=1.3)

#add nodelabels (pie charts representing likelihoods)
nodelabels(node=1:tree$Nnode+Ntip(tree), pie=fit.ER$lik.anc, cex=0.35, piecol=c("red", "blue"))

#Note that there is a more elegant way to ensure the colors are in the correct order but we are skipping that for sake of simplicity.

Please describe the evolutionary history of habitat occupation in grunts in one paragraph.

Next, please explore the ace-function in more detail. What rate transition models other than “ER” are available? Do you get different results when choosing other models?

I would like to briefly introduce an alternate approach for reconstructing discrete character states — stochastic character mapping, oiginally introduced by Huelsenbeck et al. (2003). Stochastic character mapping is great as it provides many samples of more complete histories of discrete traits over all branches of the tree, not just the nodes. Evolutionary transitions will also occur along the branches, and sometimes there may even be more than one change on the same branch. One often performs stochastic character mapping over entire tree sets, creating thousands of samples, but let’s stick to a single tree.

sim.hab <- make.simmap(scaled.tree, z, model="ER")
## make.simmap is sampling character histories conditioned on the transition matrix
## 
## Q =
##           non-reef      reef
## non-reef -7.070648  7.070648
## reef      7.070648 -7.070648
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## non-reef     reef 
##      0.5      0.5
## Done.
cols <- setNames(c("blue","red"),c("reef", "non-reef"))
plotSimmap(sim.hab, cols, fsize=0.7, ftype="i")