Why does AlphaFold rely on multiple sequence alignments (MSAs), and what happens when the MSA is shallow (few homologs)?
Think about your answer, then reveal below.
Model answer: MSAs encode evolutionary information: which residues co-evolve (mutate in a correlated manner), reflecting spatial proximity in the 3D structure. These co-evolutionary signals are the primary source of information that AlphaFold uses to infer residue-residue contacts and ultimately 3D structure. When the MSA is shallow (the target protein has few homologs in sequence databases, as for orphan proteins or recently evolved sequences), co-evolutionary signals are weak or absent, and AlphaFold's accuracy drops significantly. Single-sequence methods (like ESMFold, which uses protein language model embeddings instead of MSAs) partially address this but are generally less accurate than MSA-based methods. The dependence on evolutionary information means AlphaFold is least reliable where it is most needed — for structurally novel proteins.
AlphaFold3 has expanded to predict protein-nucleic acid complexes, protein-ligand interactions, and post-translational modifications, addressing some limitations of AlphaFold2. However, the fundamental dependence on evolutionary information remains, and accuracy for novel targets and interactions continues to be lower than for well-represented protein families.