Pairwise AlignmentGlobal & local alignmentAnders Gorm Pedersen
Molecular Evolution Group
Center for Biological Sequence Analysis
Pairwise AlignmentGlobal & local alignment
Anders Gorm Pedersen
Molecular Evolution Group
Center for Biological Sequence Analysis
Sequences are related Darwin: all organisms are related through descent with modification
=> Sequences are related through descent with modification
=> Similar molecules have similar functions in different organisms Phylogenetic tree based on ribosomal RNA:
three domains of life
Why compare sequences?
Determination of evolutionary relationships
Prediction of protein function and structure (database searches). Protein 1: binds oxygen Sequence similarity Protein 2: binds oxygen ?
Dotplots: visual sequence comparison
Place two sequences along axes of plot
Place dot at grid points where two sequences have identical residues
Diagonals correspond to conserved regions
Simple scoring scheme (too simple in fact…):
Matching amino acids: 5
Mismatch: 0
Scoring example: K A W S A D V
: : : : :
K D W S A E V
5+0+5+5+5+0+5 = 25
Serine (S) and Threonine (T) have similar physicochemical properties Aspartic acid (D) and Glutamic acid (E) have similar properties Substitution of S/T or E/D occurs relatively often during evolution => Substitution of S/T or E/D should result in scores that are only moderately lower than identities =>
Protein substitution matrices
A 5
R -2 7
N -1 -1 7
D -2 -2 2 8
C -1 -4 -2 -4 13
Q -1 1 0 0 -3 7
E -1 0 0 2 -3 2 6
G 0 -3 0 -1 -3 -2 -3 8
H -2 0 1 -1 -3 1 0 -2 10
I -1 -4 -3 -4 -2 -3 -4 -4 -4 5
L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5
K -1 3 0 -1 -3 2 1 -2 0 -3 -3 6
M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7
F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8
P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10
S 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5
W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15
Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8
V 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 5
A R N D C Q E G H I L K M F P S T W Y V BLOSUM50 matrix:
Positive scores on diagonal (identities)
Similar residues get higher (positive) scores
Dissimilar residues get smaller (negative) scores
K L A A S V I L S D A L
K L A A - - - - S D A L -10 + 3 x (-1)=-13 Affine gap penalties:
Multiple insertions/deletions may be one evolutionary event =>
Separate penalties for gap opening and gap elongation
Protein substitution matrices
A 5
R -2 7
N -1 -1 7
D -2 -2 2 8
C -1 -4 -2 -4 13
Q -1 1 0 0 -3 7
E -1 0 0 2 -3 2 6
G 0 -3 0 -1 -3 -2 -3 8
H -2 0 1 -1 -3 1 0 -2 10
I -1 -4 -3 -4 -2 -3 -4 -4 -4 5
L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5
K -1 3 0 -1 -3 2 1 -2 0 -3 -3 6
M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7
F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8
P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10
S 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5
W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15
Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8
V 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 5
A R N D C Q E G H I L K M F P S T W Y V BLOSUM50 matrix:
Positive scores on diagonal (identities)
Similar residues get higher (positive) scores
Dissimilar residues get smaller (negative) scores
Pairwise alignment
Optimal alignment:
alignment having the highest possible score given a substitution matrix and a set of gap penalties
Pairwise alignment: the problem
The number of possible pairwise alignments increases explosively with the length of the sequences:
Two protein sequences of length 100 amino acids can be aligned in approximately 1060 different ways
Time needed to test all possibilities is same order of magnitude as the entire lifetime of the universe.
Pairwise alignment: the solution
”Dynamic programming”
(the Needleman-Wunsch algorithm)
Alignment depicted as path in matrix
T C G C A
T
C
C
A T C G C A
T
C
C
A TCGCA
TC-CA TCGCA
T-CCA
Alignment depicted as path in matrix
T C G C A
T
C
C
A x Meaning of point in matrix: all residues up to this point have been aligned (but there are many different possible paths). Position labeled “x”: TC aligned with TC
--TC -TC TC
TC-- T-C TC
Dynamic programming: computation of scores
T C G C A
T
C
C
A x Any given point in matrix can only be reached from three possible previous positions (you cannot “align backwards”).
=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.
Comments