A brief intro to phylogenetic trees in R

I wrote this little tutorial as an introductory chapter for a NESCent Academy on Macroevolution back in July 2014. It’s meant to provide a brief overview of the basic structure of tree objects in R and illustrate some of the tree manipulation and visualization options. There is a multitude of other options out there, but this is a good start. Once you have mastered this, explore many more options in Emmanuel Paradis’ book (Analysis of Phylogenetics and Evolution with R) and Liam Revell’s blog.

Let’s first load some packages (and install if necessary) that we need for this exercise. Packages are libraries of functions that have been made available for specific tasks and applications.

if(!require(ape)){
    install.packages("ape") #'ape' will be installed if needed
    library(ape)
}
## Loading required package: ape
if(!require(geiger)){
    install.packages("geiger") #same for 'geiger'
    library(geiger)
}
## Loading required package: geiger

Trees as “phylo” objects

Let’s begin by simulating a tree. There are many options for doing this. We’ll use the rtree() function of the ape package. The tree is stored as an object called “phy”. We are simulating a tree with 30 tips (terminal edges) and plot it below.

phy <- rtree(n=30) #n specifies the number of tips we want.
plot(phy)

So, now we wonder how the phylogenetic information is encoded. First let’s find out what the class of the object “phy” is.

class(phy) #OK, it's a 'phylo' object.
## [1] "phylo"

By typing ‘phy’ in the command line one can retrieve some basic information about the object.

phy #a tree with n tips and (n-1) nodes, tip labels, the tree is rooted, and there are branch lengths.
## 
## Phylogenetic tree with 30 tips and 29 internal nodes.
## 
## Tip labels:
##  t13, t28, t8, t22, t19, t16, ...
## 
## Rooted; includes branch lengths.

But how is this information organized within the phylo object? We can find out with the str() function, which displays the structure of an R object.

str(phy)
## List of 4
##  $ edge       : int [1:58, 1:2] 31 32 33 34 35 36 37 37 38 38 ...
##  $ tip.label  : chr [1:30] "t13" "t28" "t8" "t22" ...
##  $ edge.length: num [1:58] 0.2235 0.4492 0.0535 0.9588 0.2777 ...
##  $ Nnode      : int 29
##  - attr(*, "class")= chr "phylo"
##  - attr(*, "order")= chr "cladewise"

The output tells us that a phylo object is a list of four components (+ two attributes): \(edge\), \(tip.label\), \(edge.length\), and \(Nnode\). We can display these items separately.

head(phy$edge) #displaying the first 5 rows, very useful when doing a quick check of larger objects
##      [,1] [,2]
## [1,]   31   32
## [2,]   32   33
## [3,]   33   34
## [4,]   34   35
## [5,]   35   36
## [6,]   36   37
phy$tip.label
##  [1] "t13" "t28" "t8"  "t22" "t19" "t16" "t7"  "t11" "t4"  "t24" "t17"
## [12] "t26" "t15" "t1"  "t12" "t30" "t27" "t6"  "t29" "t3"  "t20" "t25"
## [23] "t10" "t21" "t23" "t14" "t18" "t9"  "t5"  "t2"
phy$edge.length
##  [1] 0.223515275 0.449214761 0.053517260 0.958796252 0.277726818
##  [6] 0.133038171 0.985922320 0.192603189 0.704616801 0.701704861
## [11] 0.001569578 0.584744789 0.042474759 0.123435829 0.864143272
## [16] 0.845269604 0.716470833 0.096034075 0.716695571 0.915777217
## [21] 0.720790768 0.811228354 0.426972809 0.172463380 0.976097200
## [26] 0.378397654 0.498555075 0.377029418 0.632655868 0.517029825
## [31] 0.405517858 0.796377597 0.498934285 0.734478435 0.233717124
## [36] 0.977225052 0.455053014 0.640186676 0.018236970 0.134822021
## [41] 0.442633845 0.533804570 0.415390895 0.397663777 0.198261676
## [46] 0.339903616 0.173343718 0.993939807 0.176320243 0.087404924
## [51] 0.113624431 0.108095456 0.150655877 0.317285047 0.572281965
## [56] 0.019839419 0.529848145 0.998013315
phy$Nnode
## [1] 29

And one can also store these components as new objects:

branches <- phy$edge
species <- phy$tip.label
brlength <- phy$edge.length
nodes <- phy$Nnode

But what does this all mean? You may also wonder why we are spending so much time on this, but you will see throughout the semester that understanding the basic structure of phylo objects in R is absolutely important.

Let’s further explore this with a hand-written tree. Assume you have a tree with 6 species (A through F). The phylogenetic relationships of species A:F can be described in bracket form (“parenthetic format”, or Newick). Before you continue, let’s pause for a second to explain the bracket form of trees. [Group exercise]

mini.phy <- read.tree(text = "((((A,B), C), (D,E)),F);") #defining a tree in bracket form
plot(mini.phy, cex=2)

The structure of the object contains only three components this time because we didn’t provide branch length:

str(mini.phy)
## List of 3
##  $ edge     : int [1:10, 1:2] 7 8 9 10 10 9 8 11 11 7 ...
##  $ tip.label: chr [1:6] "A" "B" "C" "D" ...
##  $ Nnode    : int 5
##  - attr(*, "class")= chr "phylo"
##  - attr(*, "order")= chr "cladewise"

But how are the edges defined? When we type…

mini.phy$edge
##       [,1] [,2]
##  [1,]    7    8
##  [2,]    8    9
##  [3,]    9   10
##  [4,]   10    1
##  [5,]   10    2
##  [6,]    9    3
##  [7,]    8   11
##  [8,]   11    4
##  [9,]   11    5
## [10,]    7    6

… we see a matrix of 10 rows and 2 columns. This matrix represents unique combinations of node- and tip numbers, defining each branch segment of the tree.

plot(mini.phy, label.offset=0.2)  #the label.offset argument moves the species names a bit to the right (try different values!)
nodelabels() #add node numbers
tiplabels() #add tip numbers

For example, the branch (or edge) leading to species “C” is identified by row 6: 9, 3. Both nodelabel() and tiplabel() function are quite flexible and afford the opportunity to visually anhance your tree plot. Let’s explore this further!

?tiplabels

And here’s an example for our ‘mini.phy’ object. We begin by creating an vector of different colors, with the same length as number of species in the tree.

mycol<-c("blue", "blue", "blue", "red", "red", "red") #feel free to pick other colors...

Now we plot the tree, moving the taxon names a bit to the right, and add the tiplabels without text, using symbols instead.

plot(mini.phy, adj=0, label.offset=0.75) #adjust if necessary/desired!
tiplabels(pch=21, col="black", adj=1, bg=mycol, cex=2) #ditto! try different values!

Swapping sisterclades, identifying clades/tips, dropping tips

It sometimes may be quite useful to rotate the tree about a specific node, i.e. swap sister clades. This can be carried out with the rotate() function. Let’s continue to work with mini.phy:

plot(mini.phy, label.offset=0.2)  #same as before
nodelabels() #add node numbers
tiplabels() #add tip numbers

How about we swap clades (D, E) and (A, B, C)? Their most recent common ancestor is found at node 8.

rot.phy <- rotate(mini.phy, node=8)

And now let’s see what happenend:

plot(rot.phy, label.offset=0.2)   #still the same as before
nodelabels()          
tiplabels()  

It will also be very helpful to select all tips in a given clade. This is implemented in the geiger package; the tips() function finds all descendants of a node.

cladeABC <- tips(rot.phy, node=9) # node 9 defines the clade composed of (A, B, C)
cladeABC
## [1] "A" "B" "C"

Another helpful command allows for tree pruning, i.e. cutting of tips or entire clades. For example, one can delete the entire cladeABC group:

pruned.phy <- drop.tip(rot.phy, cladeABC)
plot(pruned.phy, label.offset=0.2)
nodelabels()
tiplabels() #did it work?

Or we can drop tips (one or multiple) randomly. Liam Revell explained how to do this nicely on his blog

To prune tips, say m=2 random tips, enter:

m <- 2
pruned.phy2 <- drop.tip(rot.phy, sample(rot.phy$tip.label)[1:m]) #m=1 drops a single tip, of course!
plot(pruned.phy2, label.offset=0.2)   

It may also be useful to select all branches in a specific group of tips. This is implemented in the ape package; the which.edge() function finds all edges in a specified group. For example, let’s identify all branches of the cladeABC group as defined above.

cladeABCbranches <- which.edge(rot.phy, cladeABC) #cladeABC was defined earlier, using the tips() function
cladeABCbranches #this should be a numerical vector containing 6, 7, 8, 9
## [1] 6 7 8 9

And as we can see, rows 6-9 of the \(edge.length\) matrix represent the expected branches. Let’s first plot the tree, and then look at the \(edge\) matrix for cross-checking.

plot(rot.phy, label.offset=0.2) #we are sticking to the same plot settings
nodelabels()
tiplabels()

rot.phy$edge
##       [,1] [,2]
##  [1,]    7    8
##  [2,]    8   11
##  [3,]   11    4
##  [4,]   11    5
##  [5,]    8    9
##  [6,]    9   10
##  [7,]   10    1
##  [8,]   10    2
##  [9,]    9    3
## [10,]    7    6

Here would be a good opportunity to show you how to assign different branch colors. For example, how can one emphasize the branches of the clade formed by A, B, and C? We first create a color vector, only consisting of grey colors. Then we’ll assign black to all branches of clade ABC.

clcolr <- rep("darkgrey", dim(rot.phy$edge)[1]) #defining a color vector; look up rep() and dim() functions!
clcolr[cladeABCbranches] <- "red" #specifying color of branches in clade (A, B, C)
plot(rot.phy, edge.color=clcolr, edge.width=2) #also making the branches a little thicker

Let’s conclude this section with one last exercise: combining trees. Assume you have two different phylogenies, with two different sets of taxa (no overlap). Another assumption is that you have knowledge how the trees may fit together. Then the bind.tree() function of ape package can help. The function takes in two phylo objects. The position of where the trees are bound is defined by tip- or node number within the first tree. Note that you can also specify the “root” as binding position.

#the first tree
tree1 <- rtree(n=10)
tree1$tip.label <- rep("apples", 10) #renaming labels to enhance readability
plot(tree1, label.offset=0.1); nodelabels(); tiplabels()