HOMEWORK:
A) Write programs to align two sequences using
dynamic programming, weighted edit distance and
affine gap penalties for:
1) global alignment
2) end-space free variant
3) local alignment
4) global-local
The weights are given by the 20 by 20 symmetric
matrix of Gonnet.
Gonnet Matrix
Run your programs first on short examples of strings
with 10-20 letters and then run them using each of the
7 sequences below (49 runs in total for each alignment type).
The gap initiation weight is 2.7 and the
gap extension weight is 0.15.
The programs should output:
- the names of the sequences followed by their alignment score
- the alignment. of the two sequences
- the number of identical letters matched.
For example, for 4hhba and 4hhbb the output should look like:
4hhba 4hhbb 73.58
V.LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF...
VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESFGDL
...DLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRV
STPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHV
DPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
DPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
Identities: 62 (45%)
(in parenthesis is the % of identities divided by the length of the larger sequence).
A nicer output would show where the identities are:
V.LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF...
| | | | | | |||| | | ||| | | | | |
VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESFGDL
...DLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRV
| | || ||||| | || | || || || |
STPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHV
DPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
|| || || | || | |||| | | | | | | ||
DPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
To present the results of your 4 x 7 x 7 runs,
report 4 7x7 tables, each for each type of alignment.
The entry i,j in each table will contain the
alignment score of sequence i with sequence j.
Include in your output the alignment of:
4hhba and 4hhbb with global
4hhba and 1dxtb900 with global
4hhba and 1dxtb900 with local
4hhba and 1dxtb900 with global-local
1dxtb900 and 4hhba with local
1dxtb900 and 4hhbb with global-local
In addition, report the 5 alignments with largest
scores using global-local (excluding the self-alignment).
For parts B and D, each pair of students will choose the
type of alignment algorithm on which they will
work. Please let me know the name of the pairs,
and I will assign to each pair the type of alignment
on which they will work with.
B) To measure the significance of the alignment score,
one can generate a number of alignments using random
sequences, compute the mean M and standard deviation S of
the distribution of scores, and report the number N:
N= (SCORE-M ) / S , where SCORE is the alignment score of
the original pair.
To generate the distribution:
- The first sequence is kept as is.
- Generate 500 random sequences of same length and composition
as the second sequence.
- Compute the alignment score of each random sequence with the
first sequence.
To generate the random sequences you can generate random permutations
of the second sequence.
1.- Explain how you generate the random permutations.
2.- Write a program to report the number N for a pair of aligned
sequences (using your the type of alignment chosen).
3.- Run it on the 7X7 pairs and report the results in a 7 by 7
table. Each entry will contain the value of N for the corresponding
pair.
4.- Besides the diagonal (self-comparisons) which are the 5 pairs with
highest value of N?
5.- Find the 5 pairs with highest alignment score from part A. Are these
the same as those identified in part B4?
C) For each type of alignment, explain in 2-3 lines its advantages/disadvantages.
D)
Use various values of gap initiation weights. For example:
Gap initiation weight Gap extension weight
10 5
0 0
2.7 1
0 .5
Explain in 5-10 lines for each of the alignment types, what is the effect
of the different gap weights.
Besides the printed output required, you have to provide a directory
which will contain both the source code and the executable.
I should be able to run your programs easily to reproduce your results.
Use easy parameter line commands for each option and for the input
sequences.
Here are the 7 sequences. The first line is the name
of the sequence (it is not part of the sequence).
The following lines correspond to the sequence
until the first blank line.
4hhba
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH
GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKL
LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
4hhbb
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLST
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDP
ENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
1ten
LDAPSQIEVKDVTDTTALITWFKPLAEIDGIELTYGIKDVPGDRTTIDLT
EDENQYSIGNLKPDTEYEVSLISRRGDMSSNPAKETFTT
3hhrb
EPKFTKCRSPERETFSCHWTDEVHGPIQLFYTRRNEWTQEWKECPDYVSA
GENSCYFNSSFTSIWIPYCIKLTSNGGTVDEKCFSVDEIVQPDPPIALNW
TLLNVSLTGIHADIQVRWEAPRNADIQKGWMVLEYELQYKEVNETKWKMM
DPILTTSVPVYSLKVDKEYEVRVRSKQRNSGNYGEFSEVLYVTLP
2rhe
ESVLTQPPSASGTPGQRVTISCTGSATDIGSNSVIWYQQVPGKAPKLLIY
YNDLLPSGVSDRFSASKSGTSASLAISGLESEDEADYYCAAWNDSLDEPG
FGGGTKLTVLGQPK
2fb4h
EVQLVQSGGGVVQPGRSLRLSCSSSGFIFSSYAMYWVRQAPGKGLEWVAI
IWDDGSDQHYADSVKGRFTISRNDSKNTLFLQMDSLRPEDTGVYFCARDG
GSSAPDYWGQGTPVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDY
FPQPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYI
CNVNHKPSNTKVDKRVEPKSC
1dxtb900
TMKAHIVNGIWMRVGWVAASLRKLAKLAHSGEEAVRDQSVKLSSHLKETT
QMSLTPIKAKSAAGGPGIVNTEQCTENPSDKLGFIIMESLNLVKDVFGLG
FKHIVKKTLAAAQQATLAMIGPTDEAVGLDGPGANQAPGENLPQFGKAGS
SWGYGLGIQGTALQKPATFMACLSARKLGSTKRKFQGYAQDKISAVGSAA
REGAEDAAQLDWSSVLNIATKSVHIREEKFLDETKNTGGGESYQSGASQQ
KQNTCVGLSQQRQMDTYIPVQGYEQSVNSIRLVGFDQTWGAAIDAYTGLL
TGLGTTMGIAAGTAVDKVGSTIANLGSGESKDDAGVFLIVFSAAGVDVAY
AHYGSNAQYKKVKHLPTSELSSLAAFMVHLTPEEKSAVTALWGKVNVDEV
GGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSD
GLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFT
PPVQAAYQKVVAGVANALAHKYHQCKSWEVGSLAVEQPKQPEGVEIDQEN
KKFGNTEFVAIYQGQTGELEIGSMGMYSLAQNKAAAVGTGNHLLKKKIER
FAITLGAMIYSDGHNYSARACVLDWESAYSAASFTALAKSSCEVTKQEIF
TPDAKRPLLAKAGQMGVSGGLSGHILNGAP STVSTDGQSGWNMLVGSAQK
QQVKVDELGTYGSTKINTTYDPAKKVAAQHVKIALQTFHQVLGSGDSSAT
FKEGANKLEQVTKGNADLVFSVKDRDLGNDWLKAKIGLEIGPMSGAKIGL
SADSTYAATSEQATHVALNAGFQRVTKYQAVGAPAGLLSGIGSIHGLAVT
ISRSRAGASALMRDDQWGIGLATVKATSLTRPFLDPQVIAGFTQGALGM