HOMEWORK: A) Write programs to align two sequences using dynamic programming, weighted edit distance and affine gap penalties for: 1) global alignment 2) end-space free variant 3) local alignment 4) global-local The weights are given by the 20 by 20 symmetric matrix of Gonnet.Gonnet Matrix

Run your programs first on short examples of strings with 10-20 letters and then run them using each of the 7 sequences below (49 runs in total for each alignment type). The gap initiation weight is 2.7 and the gap extension weight is 0.15. The programs should output: - the names of the sequences followed by their alignment score - the alignment. of the two sequences - the number of identical letters matched. For example, for 4hhba and 4hhbb the output should look like: 4hhba 4hhbb 73.58 V.LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF... VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESFGDL ...DLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRV STPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHV DPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR DPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH Identities: 62 (45%) (in parenthesis is the % of identities divided by the length of the larger sequence). A nicer output would show where the identities are: V.LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF... | | | | | | |||| | | ||| | | | | | VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESFGDL ...DLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRV | | || ||||| | || | || || || | STPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHV DPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR || || || | || | |||| | | | | | | || DPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH To present the results of your 4 x 7 x 7 runs, report 4 7x7 tables, each for each type of alignment. The entry i,j in each table will contain the alignment score of sequence i with sequence j. Include in your output the alignment of: 4hhba and 4hhbb with global 4hhba and 1dxtb900 with global 4hhba and 1dxtb900 with local 4hhba and 1dxtb900 with global-local 1dxtb900 and 4hhba with local 1dxtb900 and 4hhbb with global-local In addition, report the 5 alignments with largest scores using global-local (excluding the self-alignment). For parts B and D, each pair of students will choose the type of alignment algorithm on which they will work. Please let me know the name of the pairs, and I will assign to each pair the type of alignment on which they will work with. B) To measure the significance of the alignment score, one can generate a number of alignments using random sequences, compute the mean M and standard deviation S of the distribution of scores, and report the number N: N= (SCORE-M ) / S , where SCORE is the alignment score of the original pair. To generate the distribution: - The first sequence is kept as is. - Generate 500 random sequences of same length and composition as the second sequence. - Compute the alignment score of each random sequence with the first sequence. To generate the random sequences you can generate random permutations of the second sequence. 1.- Explain how you generate the random permutations. 2.- Write a program to report the number N for a pair of aligned sequences (using your the type of alignment chosen). 3.- Run it on the 7X7 pairs and report the results in a 7 by 7 table. Each entry will contain the value of N for the corresponding pair. 4.- Besides the diagonal (self-comparisons) which are the 5 pairs with highest value of N? 5.- Find the 5 pairs with highest alignment score from part A. Are these the same as those identified in part B4? C) For each type of alignment, explain in 2-3 lines its advantages/disadvantages. D) Use various values of gap initiation weights. For example: Gap initiation weight Gap extension weight 10 5 0 0 2.7 1 0 .5 Explain in 5-10 lines for each of the alignment types, what is the effect of the different gap weights. Besides the printed output required, you have to provide a directory which will contain both the source code and the executable. I should be able to run your programs easily to reproduce your results. Use easy parameter line commands for each option and for the input sequences. Here are the 7 sequences. The first line is the name of the sequence (it is not part of the sequence). The following lines correspond to the sequence until the first blank line.4hhba VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKL LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR 4hhbb VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLST PDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDP ENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH 1ten LDAPSQIEVKDVTDTTALITWFKPLAEIDGIELTYGIKDVPGDRTTIDLT EDENQYSIGNLKPDTEYEVSLISRRGDMSSNPAKETFTT 3hhrb EPKFTKCRSPERETFSCHWTDEVHGPIQLFYTRRNEWTQEWKECPDYVSA GENSCYFNSSFTSIWIPYCIKLTSNGGTVDEKCFSVDEIVQPDPPIALNW TLLNVSLTGIHADIQVRWEAPRNADIQKGWMVLEYELQYKEVNETKWKMM DPILTTSVPVYSLKVDKEYEVRVRSKQRNSGNYGEFSEVLYVTLP 2rhe ESVLTQPPSASGTPGQRVTISCTGSATDIGSNSVIWYQQVPGKAPKLLIY YNDLLPSGVSDRFSASKSGTSASLAISGLESEDEADYYCAAWNDSLDEPG FGGGTKLTVLGQPK 2fb4h EVQLVQSGGGVVQPGRSLRLSCSSSGFIFSSYAMYWVRQAPGKGLEWVAI IWDDGSDQHYADSVKGRFTISRNDSKNTLFLQMDSLRPEDTGVYFCARDG GSSAPDYWGQGTPVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDY FPQPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYI CNVNHKPSNTKVDKRVEPKSC 1dxtb900 TMKAHIVNGIWMRVGWVAASLRKLAKLAHSGEEAVRDQSVKLSSHLKETT QMSLTPIKAKSAAGGPGIVNTEQCTENPSDKLGFIIMESLNLVKDVFGLG FKHIVKKTLAAAQQATLAMIGPTDEAVGLDGPGANQAPGENLPQFGKAGS SWGYGLGIQGTALQKPATFMACLSARKLGSTKRKFQGYAQDKISAVGSAA REGAEDAAQLDWSSVLNIATKSVHIREEKFLDETKNTGGGESYQSGASQQ KQNTCVGLSQQRQMDTYIPVQGYEQSVNSIRLVGFDQTWGAAIDAYTGLL TGLGTTMGIAAGTAVDKVGSTIANLGSGESKDDAGVFLIVFSAAGVDVAY AHYGSNAQYKKVKHLPTSELSSLAAFMVHLTPEEKSAVTALWGKVNVDEV GGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSD GLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFT PPVQAAYQKVVAGVANALAHKYHQCKSWEVGSLAVEQPKQPEGVEIDQEN KKFGNTEFVAIYQGQTGELEIGSMGMYSLAQNKAAAVGTGNHLLKKKIER FAITLGAMIYSDGHNYSARACVLDWESAYSAASFTALAKSSCEVTKQEIF TPDAKRPLLAKAGQMGVSGGLSGHILNGAP STVSTDGQSGWNMLVGSAQK QQVKVDELGTYGSTKINTTYDPAKKVAAQHVKIALQTFHQVLGSGDSSAT FKEGANKLEQVTKGNADLVFSVKDRDLGNDWLKAKIGLEIGPMSGAKIGL SADSTYAATSEQATHVALNAGFQRVTKYQAVGAPAGLLSGIGSIHGLAVT ISRSRAGASALMRDDQWGIGLATVKATSLTRPFLDPQVIAGFTQGALGM