← Graph View All Domains

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Phylogenetic Tree Construction

Graduate Depth 237 in the knowledge graph ☐ I know this ☆ Set as goal

31topics build on this

1,452prerequisites beneath it

See this on the map →

Molecular Evolution Basics for Bioinformatics Molecular Evolution and Molecular Clocks +3 more→→Comparative Genomics

Core Idea

Phylogenetic trees depict evolutionary relationships among sequences or species, inferred from aligned molecular data. Distance-based methods (neighbor-joining) cluster sequences by pairwise distances. Character-based methods (maximum parsimony, maximum likelihood, Bayesian inference) evaluate alternative tree topologies against the alignment data. Maximum likelihood finds the tree that makes the observed data most probable given a model of sequence evolution. Bootstrap values and Bayesian posterior probabilities assess statistical support for each branch. Tree construction requires choosing an appropriate substitution model and rooting strategy.

How It's Best Learned

Build a neighbor-joining tree and a maximum likelihood tree from the same MSA of 10-15 orthologous sequences. Compare the topologies and bootstrap support values. Experiment with different substitution models (JC69 vs. GTR) and observe how model choice affects branch lengths and topology.

Common Misconceptions

A phylogenetic tree does not show which species evolved from which — it shows patterns of shared ancestry and relative divergence.
High bootstrap support (e.g., 95%) does not mean the branch is certainly correct; it means the data consistently support that grouping when resampled.

Explainer

Phylogenetic trees are the primary tool for representing evolutionary relationships, and molecular sequence data has become the dominant source of information for building them. Given a multiple sequence alignment, the question is: what tree topology (branching pattern) and branch lengths best explain the observed pattern of similarities and differences? Different methods answer this question in fundamentally different ways.

Distance-based methods convert the MSA into a matrix of pairwise evolutionary distances (corrected for multiple substitutions at the same site), then build a tree that approximates those distances. Neighbor-joining (NJ) is the most widely used distance method: it iteratively joins the pair of sequences that minimizes the total branch length of the tree, adjusting for the average distance to all other sequences. NJ is fast (O(n³) for n sequences) and produces reasonable trees, making it useful for quick exploratory analyses and very large datasets. But it reduces the full alignment to pairwise distances, losing information about which specific sites support which groupings.

Maximum likelihood (ML) takes a fundamentally different approach. It considers the alignment column by column, calculates the probability of each observed column pattern for every possible tree topology under a specified model of sequence evolution, and multiplies these probabilities across all columns to get the likelihood of the entire dataset given each tree. The tree with the highest total likelihood is selected. This approach uses all the information in the alignment and explicitly models the evolutionary process, but it requires searching an enormous space of possible topologies (which grows super-exponentially with the number of sequences). Software like RAxML and IQ-TREE use heuristic search strategies to navigate this space efficiently.

Bayesian inference (implemented in MrBayes and BEAST) extends ML by incorporating prior probabilities on tree topologies, branch lengths, and model parameters, using Markov chain Monte Carlo (MCMC) sampling to explore the posterior distribution. Rather than returning a single best tree, Bayesian methods return a distribution of trees weighted by their posterior probability, naturally providing measures of uncertainty. Bayesian posterior probabilities on branches tend to be higher than bootstrap values for the same data, and interpreting them correctly requires understanding MCMC convergence diagnostics.

Regardless of method, the resulting tree must be evaluated critically. Bootstrap analysis for ML/NJ and posterior probabilities for Bayesian trees indicate how strongly the data support each branch. An unrooted tree shows relative relationships but not the direction of evolution; rooting (typically with an outgroup) is needed to infer ancestor-descendant relationships. And the tree reflects the history of the sequences analyzed, which may not match the species tree if gene duplication, horizontal transfer, or incomplete lineage sorting has occurred.

Practice Questions 3 questions