A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Protein Structure Prediction Basics

Graduate Depth 237 in the knowledge graph ☐ I know this ☆ Set as goal

7topics build on this

1,452prerequisites beneath it

Amino Acid Structure and Properties Translation: RNA to Protein +2 more→→Machine Learning in Genomics Proteomics Data Analysis

Core Idea

Protein structure prediction aims to determine a protein's three-dimensional structure from its amino acid sequence. Homology modeling builds a structure by mapping the target sequence onto an experimentally determined template structure from a related protein. Threading (fold recognition) matches sequences to known structural folds even without clear sequence homology. Ab initio methods predict structure from physical principles or learned patterns without templates. AlphaFold2 revolutionized the field by using deep learning on multiple sequence alignments and structural databases to predict structures with near-experimental accuracy for most proteins.

How It's Best Learned

Submit a protein sequence to the AlphaFold database and examine the predicted structure, paying attention to the per-residue confidence score (pLDDT). Compare regions of high and low confidence to known structural features (ordered domains vs. disordered loops). Then try homology modeling with SWISS-MODEL for the same protein and compare approaches.

Common Misconceptions

AlphaFold does not simulate the physical process of protein folding — it predicts the final folded structure using learned patterns from known structures and evolutionary co-variation.
Predicted structures are not experimental data; confidence scores (pLDDT, PAE) must be checked, and low-confidence regions may not be reliable for detailed functional interpretation.

Explainer

The amino acid sequence of a protein determines its three-dimensional structure, which in turn determines its function. But going from sequence to structure computationally — the "protein folding problem" — was one of the grand challenges of biology for fifty years. Understanding the approaches, even at a high level, is essential because structural information increasingly drives functional annotation, drug design, and interpretation of genetic variants.

Homology modeling is the oldest and most intuitive approach. If a protein's sequence is similar to a protein whose structure has been experimentally determined (by X-ray crystallography, cryo-EM, or NMR), you can use that known structure as a template. The steps are: find a template using BLAST or HMM searches against structural databases (PDB), align the target sequence to the template, build a model by copying the template's backbone coordinates and adjusting for insertions, deletions, and substitutions, then refine the model. Accuracy depends on sequence identity to the template: above 50%, models are generally reliable; below 30%, the alignment becomes uncertain and the model unreliable.

Threading (fold recognition) extends this idea to cases where sequence similarity is undetectable but structural similarity exists — proteins can adopt similar folds despite having diverged beyond sequence recognition. Threading methods fit the target sequence into each fold in a library of known structures and evaluate the compatibility using energy functions. This approach bridges the gap between homology modeling and truly ab initio prediction, recognizing that the universe of protein folds is much smaller than the universe of protein sequences.

AlphaFold2 (2020) transformed the field by achieving near-experimental accuracy for most protein domains. Its key insight is that evolutionary co-variation in multiple sequence alignments encodes structural contact information — if two positions consistently co-vary across homologous sequences, the corresponding residues likely interact in 3D. AlphaFold2's neural network architecture (particularly the Evoformer module) processes MSA features and pairwise residue relationships through iterative attention mechanisms, producing 3D coordinates along with confidence estimates. The AlphaFold Protein Structure Database now contains predicted structures for over 200 million proteins, making structural information available for essentially every known protein sequence. However, AlphaFold predictions still carry uncertainty (reflected in pLDDT and PAE scores), and the method struggles with proteins that lack many homologs, intrinsically disordered regions, and complexes whose interaction partners are not specified.

Practice Questions 3 questions