Stringology & Structural Biology
Day 2009
Ben-Gurion University


Ben-Gurion University
Stringology & Structural Biology Day 2009

Abstracts and Bios
Towards a Theory of Patches
Amihood Amir, Bar-Ilan University & Johns Hopkins University

Abstract:
Many applications have a need for indexing unstructured data. It turns out that a similar ad-hoc method is being used in many of them - that of considering small particles of the data. In this talk we formalize this concept as a tiling problem and consider the efficiency of dealing with this model. We present an efficient algorithm for the one dimension tiling problem, and prove the two dimension problem is hard. We then develop an approximation algorithm for the number of tile matches with an approximation ratio converging to $0.5$. We show that the "one-and-a-half" dimensional version of the problem is also hard.

Bio:
Amihood Amir received his Ph.D. in 1983 at Bar-Ilan University. He did his post-doc at MIT, and was on the faculty at Tufts, University of Maryland at College Park, and Georgia Tech, before joining Bar-Ilan University in 1994. Today he is a professor of Computer Science at Bar-Ilan University and a Research Professor at Johns Hopkins University. Amir had been head of the Bar-Ilan Responsa Project, and Chairman of the Department of Computer Science. Today he is dean of the College of Exact Sciences at Bar-Ilan University.Amihood Amir's Ph.D. thesis was in logic of programs, particularly temporal logics. He later did some work in Complexity theory - sub-recursive classes of functions and the concept of "cheatability" in hard sets. Since the late 1980's Amir's research has been in Algorithms design and analysis, in particular Pattern Matching Algorithms. In the later area he has been instrumental in developing the multidimensional pattern matching area, compressed matching, and, recently, asynchronous matching. In this context he has also done work in algorithms for Computational Biology. Amir has also worked in the past in Scheduling algorithms, VLSI algorithms, and Data Mining algorithms. Amihood Amir is the author of over a hundred research papers, and recipient of grants from the NSF, ISF, BSF, the Israel Ministry of Science, and the Israel Ministry of Industry and Commerce. In addition, he was a co-PI on bi-national grants with the UK, Italy, France and Finland. Amir serves on the editorial board of Information and Computation, and served on the program committee of dozens of major Computer Science conferences. Amir has advised 19 Ph.D. students and 16 M.Sc. students, and sponsored 3 post-doctoral researchers. He has won a departmental teaching excellence award at the University of Maryland, and research awards at UMCP and Georgia Tech. He has chaired the Computer Science Advisory Committee of the Israel Ministry of Education, and was a panel member on various panels of the NSF, Israel Ministry of Science, and the Israel Institute for Higher Education.


Structure-Based Identification of Catalytic Residues
Chen Keasar, Ben-Gurion University

Abstract:
Catalytic residues tend to destabilize enzyme structures and to be spatially clustered and occluded. We used these observations to develop a structure-based approach for the identification of catalytic residues. The new method is motivated by the difficulty of evolution-based methods to annotate "orphan" Structural Genomics targets, which have very few or no homologs in the databases. In addition, we believe that structure-based techniques are an essential complement to conservation-based methods, which may be misled by conserved non-catalytic residues. We introduce novel structural features that suggest a catalytic role for residues. These features, were used to train a support vector machine that discriminated catalytic from non-catalytic residues. Special attention was paid to the class imbalance problem that often spoils the performance of classifiers.

Bio:
TBA


On the Usefulness of Fibonacci Compression Codes
Tomi Klein, Bar-Ilan University

Abstract:
Recent publications advocate the use of various variable length codes for which each codeword consists of an integral number of bytes in compression applications using large alphabets. We show that another tradeoff with similar properties can be obtained by Fibonacci codes. These are fixed codeword sets, using binary representations of integers based on Fibonacci numbers of order m>= 2. Fibonacci codes have been used before, and we extend previous work presenting several novel features. In particular, the compression efficiency is analyzed and compared to that of dense codes, and various table-driven decoding routines are suggested.

Bio:
Shmuel Tomi Klein is an associate Professor at the Computer Science Department of Bar Ilan University in Israel. His research focuses on lossless data compression and on information retrieval, and in particular on the intersection of these two areas. He has been working with two of the largest full text information retrieval systems: the Tr?sor de la Langue Fran?aise (in French) in Chicago, and the Responsa Retrieval Project (in Hebrew) in Israel. He was also involved with several software companies and is the author of the DoubleDisk product implemented in Microsoft's MS-DOS 6.


FragBag: A 'Bag-of-Words' Representation of Protein Structure Retrieves Structural Near-Neighbors from the Complete PDB Quickly and Accurately
Rachel Kolodny, Haifa University

Abstract:
Fast identification of protein structures that are similar to a specified query structure in the complete Protein Data Bank (PDB) is fundamental in protein structure and function prediction. Here, we present FragBag: an ultra-fast and accurate method for comparing protein structures based on a new and succinct representation of proteins as ‘bags-of-fragments’. We describe any protein structure by the collection of its overlapping short contiguous backbone segments and discretize this set using a library of fragments. Then, we represent the protein by a vector counting the number of occurrences of each fragment, and measure the similarity of two protein structures by the similarity of their vectors. Our representation has two additional benefits: (1) it can be used to construct an inverted index when implementing a fast structural search engine of the complete PDB. (2) One can specify a structure as a collection of substructures, without combining them to a single structure. This is valuable for structure prediction, when there are only reliable predictions of parts of the protein. We use ROC curve analysis to quantify the success of FragBag in identifying candidate sets in a dataset of over 2,900 protein structures. The gold standard is the set of near structural neighbors found by six state-of-the-art structural aligners. Our best FragBag library finds more accurate candidate sets than three other filter methods: SGM, PRIDE, and a method by Zotenko et al. More interestingly, FragBag performs similarly to the computationally expensive yet highly trusted structural aligners STRUCTAL and CE.

Bio:
TBA


Some old and new results on efficient seeding for biosequence search
Gregory Kucherov, CNRS Lille, France & Laboratoire J.V.Poncelet, Moscow

Abstract: Since about 2002, spaced seeds and their different generalizations became an efficient tool for improving the performance of DNA sequence search. After presenting main ideas behind this technique, we briefly survey several extensions of spaced seeds proposed by different authors and allowing to further enhance the performance of the search. Then we present two new applications we developed recently: one for protein search and the other for mapping reads issued from a high-throughput sequencing technology to a reference genome.

Bio: Gregory Kucherov received the PhD degree in computer science from the USSR Academy of Sciences in 1988 and the habilitation degree from Henri Poincar? University in Nancy in 2000. He is a CNRS research director in the Laboratory for Computer Science (LIFL) in Lille, France. He is currently on leave at the French- Russian J.-V. Poncelet Lab in Moscow, Russia. Since the beginning of the 1990s, he has been doing research on word combinatorics, text algorithms, and combinatorial algorithms for bioinformatics and computational biology.


A Black Box for Online Approximate Pattern Matching
Ely Porat, Bar-Ilan University

Abstract:
We present a deterministic black box solution for online approximate matching. Given a pattern of length $m$ and a streaming text of length $n$ that arrives one character at a time, the task is to report the distance between the pattern and a sliding window of the text as soon as the new character arrives. Our solution requires $O(\Sum_{j=1}^{\log_2 m} T(n,2^{j-1})/n)$ time for each input character, where $T(n,m)$ is the total running time of the best offline algorithm. The types of approximation that are supported include exact matching with wildcards, matching under the Hamming norm, approximating the Hamming norm, k-mismatch and numerical measures such as the $L_2$ and $L_1$ norms. For these examples, the resulting online algorithms take $O(\log^2 m)$, $O(\sqrt{m\log m}, $O(\log^2 m/?^2)$, $O(\sqrt{k\log } \log m)$, $O(\log^2 m)$ and $O(\sqrt{m\log m}) time per character respectively. The space overhead is $O(m)$ which we show is optimal.

Bio:
TBA


Atomic Resolution Modeling of Large Macromolecular Assemblies
Haim Wolfson, Tel-Aviv University

Abstract: Modelling of multimolecular assemblies is crucial for the understanding of cellular function. Nevertheless, most of the structures in the PDB are either monomers or dimers. The yeast cell, for example, contains approximately 800 distinct core complexes of 5 proteins, on the average, most of which have not yet been structurally characterized. In parallel, the various Structural Genomics efforts and the improvement in homology modeling techniques provide a wealth of single protein atomic resolution structures. In addition, recent developments in experimental techniques, such as cryo-EM, FRET, SAXS provide low resolution information on large macromolecular complexes and distance constraints between the interacting units. Development of efficient algorithms, which integrate the high and low resolution data, in order to model macromolecular complexes at atomic resolution is a key Structural Bioinformatics task. At the talk we shall discuss and present several methods, which tackle the Multimolecular Assembly task by integrating experimental protein data at different resolutions, in particular, the integration of cryo-EM and X-ray crystallography data.

Bio:
Haim J. Wolfson earned his PhD in Mathematics at Tel Aviv University in 1985. He was a Senior Research Scientist at the NYU Robotics Lab (1985-1989) specializing in Object Recognition in Computer Vision. In 1989 he joined the Computer Science School of Tel Aviv University and has been a full professor there since 2000. Since May 2004 he is the incumbent of the George and Maritza Pionkowski chair in Computer Aided Drug Design. Haim served one term (2002-2004) as Head of the Computer Science School and is currently the Dean of the Raymond and Beverly Sackler Faculty of Exact Sciences (since Sep 1, 2006). He is a co-developer of the “Geometric Hashing” paradigm for Model based Object Recognition in Computer Vision, which is one of the leading geometric pattern discovery paradigms up-to-date. The International Conference on Computer Vision, which is the leading conference in this field, has awarded in 2009 the "Test-of-Time" award to Wolfson and Lamdan for their "Geometric Hashing" paper presented by Haim at this conference in 1988. In the early 1990’s Haim pioneered the introduction of Geometric Hashing and other Computer Vision based methodologies into Structural Bioinformatics, and is since conducting intensive interdisciplinary research in the field. He is a co-founder (together with Ron Shamir) of the TAU Bioinformatics study program. HJW has more than 130 publications in scientific journals, books and refereed conference proceedings. He was the chairman of the scientific program committee of the 5’th European Conference on Computational Biology, which was held in Jan 2007 in Eilat, Israel.