HOMEWORK:


A) Write programs to align two sequences using
   dynamic programming, weighted edit distance and
   affine gap penalties for:

1) global alignment
2) end-space free variant
3) local alignment
4) global-local

The weights are given by the 20 by 20 symmetric
matrix of Gonnet.

Gonnet Matrix



 Run your programs first on short examples of strings
with 10-20 letters and then run them using each of the
7 sequences below (49 runs in total for each alignment type).





 The gap initiation weight is 2.7 and the
gap extension weight is 0.15.


The programs should output:
 - the names of the sequences followed by their alignment score 
 - the alignment. of the two sequences
 - the number of identical letters matched.

For example, for 4hhba and 4hhbb the output should look like:

4hhba 4hhbb 73.58

V.LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF...
VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESFGDL


...DLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRV
STPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHV


DPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
DPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

Identities: 62 (45%)
 (in parenthesis is the % of identities divided by the length of the larger sequence).


A nicer output would show where the identities are:

V.LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF...
| | |  |  | | ||||     | | ||| |     | |   |  |   
VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESFGDL


...DLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRV
   |   |   || |||||  |     || |        || ||  || |
STPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHV


DPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
|| || ||   |   || |   |||| | |   |  | |   |  || 
DPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH



To present the results of your 4 x 7 x 7 runs,
report 4 7x7 tables, each for each type of alignment.
The entry i,j in each table will contain the
alignment score of sequence i with sequence j.

Include in your output the alignment of:
  4hhba and 4hhbb   with  global
  4hhba and 1dxtb900 with global 
  4hhba and 1dxtb900 with local
  4hhba and 1dxtb900 with global-local
  1dxtb900 and 4hhba with local
  1dxtb900 and 4hhbb with global-local

 In addition, report the 5 alignments with largest
  scores using global-local (excluding the self-alignment).

For parts B and D, each pair of students will choose the
type of alignment algorithm on which they will
work. Please let me know the name of the pairs,
and I will assign to each pair the type of alignment
on which they will work with. 


B)  To measure the significance of the alignment score,
one can generate a number of alignments using random
sequences, compute the mean M and standard deviation S of
the distribution of scores, and report the number N: 

  N= (SCORE-M ) / S , where SCORE is the alignment score of
the original pair.

  To generate the distribution:

- The first sequence is kept as is.
- Generate 500 random sequences of same length and composition
  as the second sequence.
- Compute the alignment score of each random sequence with the
  first sequence.

 To generate the random sequences you can generate random permutations
of the second sequence.

  1.- Explain how you generate the random permutations.
  2.- Write a program to report the number N for a pair of aligned
      sequences (using your  the type of  alignment chosen).
  3.- Run it on the 7X7 pairs and report the results in a 7 by 7 
      table. Each entry will contain the value of N for the corresponding
      pair.
  4.- Besides the diagonal (self-comparisons) which are the 5 pairs with
      highest value of N?
  5.- Find the 5 pairs with highest alignment score from part A. Are these
      the same as those identified in part B4? 


C) For each type of alignment, explain in 2-3 lines its advantages/disadvantages. 
 

D) 
 Use various values of gap initiation weights. For example:

 Gap initiation weight      Gap extension weight 
     10                           5
     0                            0  
     2.7                          1
     0                            .5

   Explain in 5-10 lines for each of the alignment types, what is the effect 
   of the different gap weights.
 

 Besides the printed output required, you have to provide a directory
which will contain both the source code and the executable.
I should be able to run your programs easily to reproduce your results.
Use easy parameter line commands for each option and for the input
sequences. 


Here are the 7 sequences. The first line is the name
of the sequence (it is not part of the sequence).
The following lines correspond to the sequence
until the first blank line.


4hhba
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH
GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKL
LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

4hhbb
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLST
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDP
ENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

1ten
LDAPSQIEVKDVTDTTALITWFKPLAEIDGIELTYGIKDVPGDRTTIDLT
EDENQYSIGNLKPDTEYEVSLISRRGDMSSNPAKETFTT


3hhrb
EPKFTKCRSPERETFSCHWTDEVHGPIQLFYTRRNEWTQEWKECPDYVSA
GENSCYFNSSFTSIWIPYCIKLTSNGGTVDEKCFSVDEIVQPDPPIALNW
TLLNVSLTGIHADIQVRWEAPRNADIQKGWMVLEYELQYKEVNETKWKMM
DPILTTSVPVYSLKVDKEYEVRVRSKQRNSGNYGEFSEVLYVTLP

2rhe
ESVLTQPPSASGTPGQRVTISCTGSATDIGSNSVIWYQQVPGKAPKLLIY
YNDLLPSGVSDRFSASKSGTSASLAISGLESEDEADYYCAAWNDSLDEPG
FGGGTKLTVLGQPK


2fb4h
EVQLVQSGGGVVQPGRSLRLSCSSSGFIFSSYAMYWVRQAPGKGLEWVAI
IWDDGSDQHYADSVKGRFTISRNDSKNTLFLQMDSLRPEDTGVYFCARDG
GSSAPDYWGQGTPVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDY
FPQPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYI
CNVNHKPSNTKVDKRVEPKSC

1dxtb900
TMKAHIVNGIWMRVGWVAASLRKLAKLAHSGEEAVRDQSVKLSSHLKETT
QMSLTPIKAKSAAGGPGIVNTEQCTENPSDKLGFIIMESLNLVKDVFGLG 
FKHIVKKTLAAAQQATLAMIGPTDEAVGLDGPGANQAPGENLPQFGKAGS 
SWGYGLGIQGTALQKPATFMACLSARKLGSTKRKFQGYAQDKISAVGSAA 
REGAEDAAQLDWSSVLNIATKSVHIREEKFLDETKNTGGGESYQSGASQQ 
KQNTCVGLSQQRQMDTYIPVQGYEQSVNSIRLVGFDQTWGAAIDAYTGLL 
TGLGTTMGIAAGTAVDKVGSTIANLGSGESKDDAGVFLIVFSAAGVDVAY 
AHYGSNAQYKKVKHLPTSELSSLAAFMVHLTPEEKSAVTALWGKVNVDEV 
GGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSD 
GLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFT 
PPVQAAYQKVVAGVANALAHKYHQCKSWEVGSLAVEQPKQPEGVEIDQEN 
KKFGNTEFVAIYQGQTGELEIGSMGMYSLAQNKAAAVGTGNHLLKKKIER 
FAITLGAMIYSDGHNYSARACVLDWESAYSAASFTALAKSSCEVTKQEIF 
TPDAKRPLLAKAGQMGVSGGLSGHILNGAP STVSTDGQSGWNMLVGSAQK 
QQVKVDELGTYGSTKINTTYDPAKKVAAQHVKIALQTFHQVLGSGDSSAT 
FKEGANKLEQVTKGNADLVFSVKDRDLGNDWLKAKIGLEIGPMSGAKIGL 
SADSTYAATSEQATHVALNAGFQRVTKYQAVGAPAGLLSGIGSIHGLAVT 
ISRSRAGASALMRDDQWGIGLATVKATSLTRPFLDPQVIAGFTQGALGM