Algorithms for functional genomic data: making sense from noisy and heterogeneous big data collections in biology

Olga Troyanskaya
An immense molecular complexity forms the foundation of human disease. Our goal is to develop novel bioinformatics algorithms that can interpret and distill this complexity through integrative analysis of massive collections of diverse high-throughput biological data, creating accurate models of molecular networks and pathways, particularly those in which malfunction promotes the emergence of complex human disease.
Specifically, I will discuss novel machine learning approaches we developed to tackle the challenges of data heterogeneity, high noise levels, and very limited training standards in genomics.
Using these approaches, we are able to make accurate predictions of protein function and interactions, and understand how biological networks change across tissues, including elucidating molecular underpinnings of disease.

Algorithms for Jumbled Indexing, Jumbled Border and Jumbled Square on Run-Length Encoded Strings

Liat Rozenberg
Jumbled Indexing, the problem of indexing a text for histogram queries, has been of much interest lately.
In this paper we consider jumbled indexing for run-length encoded texts. We refute a former conjecture and show an algorithm for general sized alphabets. We also consider Jumbled Borders, the extension of borders to jumbled strings. Borders are the basis for various algorithms.
Finally, we consider Jumbled Squares, strings which are of the form x x', where x' is a jumbling of x.
We show efficient algorithms for these problems.

(Almost) Reliable regions in (relatively) unreliable protein multiple alignments

Mikhail Roytberg
Multiple alignment of weakly homologous protein sequences (%ID < 35%) is a difficult task. Alignments produced by best existing programs, e.g. ProbAlign, have accuracy and confidence in average ~ 85%-87%.
We propose a method allowing one, given a ProbAlgn multiple alignment, to pick up ~50% of aligned position pairs so that ~98% of selected pairs belong to the corresponding golden standard multiple alignment based on superposition of 3D protein structures. The idea of the approach is to construct several alternative candidate multiple alignments and consider as reliable those amino acid correspondences that belong to all candidate alignments. The method can be applied to alignments produced by arbitrary algorithm.

Approximate String Matching using a Bidirectional Index

Dekel Tsur
We study strategies of approximate pattern matching that exploit bidirectional text indexes, extending and generalizing ideas of Lam et al. We introduce a formalism, called search schemes, to specify search strategies of this type, then develop a probabilistic measure for the efficiency of a search scheme, prove several combinatorial results on efficient search schemes, and finally, provide experimental computations supporting the superiority of our strategies.

Automated Controversy Detection on the Web

Shiri Dori-Hacohen
Alerting users about controversial search results can encourage critical literacy, promote healthy civic discourse and counteract the “filter bubble” effect, and therefore would be a useful feature in a search engine or browser extension. In order to implement such a feature, however, the binary classification task of determining which topics or webpages are controversial must be solved. Earlier work described a proof of concept using a supervised nearest neighbor classifier with access to an oracle of manually annotated Wikipedia articles. This paper generalizes and extends that concept by taking the human out of the loop, leveraging the rich metadata available in Wikipedia articles in a weakly-supervised classification approach. The new technique we present allows the nearest neighbor approach to be extended on a much larger scale and to other datasets. The results improve substantially over naive baselines and are nearly identical to the oracle-reliant approach by standard measures of F1, F0.5, and accuracy.
If time allows, we will discuss implications of solving this problem as part of a broader subject of interest to the IR community, and our current and future research directions.

Bit Parallel Sequence Alignment Algorithms

Gary Benson
Next-generation sequencing and other processor-intensive sequence comparison applications have motivated a continued search for high-efficiency sequence alignment algorithms.
One successful approach, termed bit-parallel alignment, exploits inherent parallelism in computer logic calculations. It represents individual cells in an alignment scoring matrix as bits in computer words and uses a series of bit operations comprised of AND, OR, XOR, complement, shift, and addition to emulate the calculation of scores.
Bit-parallelism has been successfully applied to the Longest Common Subsequence (LCS) and edit-distance problems, producing the fastest single processor solutions to date. The bit-parallel LCS and edit-distance algorithms are closely tied to special properties of the problems, which limited efforts to extend bit-parallelism to more general scoring schemes.
In this talk, I will describe our recent bit-parallel solution for general, integer-scoring global alignment. Integer-scoring schemes, which are widely used, assign integer weights for match, mismatch, and gap operations. The method depends on structural properties of the relationship between adjacent scores in the scoring array. We utilize these properties to construct a class of efficient algorithms, each designed for a particular set of weights.
The talk will conclude with extensions and generalizations of the approach.

Cascading Bloom filters for bioinformatics applications

Gregory Kucherov
CNRS/LIGM Marné-la-Vallée
Next Generation Sequencing (NGS) technologies have now become ubiquitous in genomics studies, generating a massive flow of sequence data. Modern genomic projects ordinarily deal with data sets of hundreds of gigas of nucleotides.
In this talk, we will be concerned with efficient data structures for storing NGS data. We first turn to the problem of genome assembly and, in particular, consider de Bruijn graphs which are one of the main tools for this task. We introduce "cascading Bloom filters" and show how they can reduce the memory required for storing de Bruijn graphs. We then briefly present how this data structure can be applied to other problems, namely to compression of NGS data and to metagenomics.
Joint work with Kamil Salikhov, Gustavo Sacomoto

Combinatorics to Score Gene Clusters

Laxmi Parida
Consider the task of scoring or associating a quantitative (significance) value to common gene clusters observed across multiple species.
One of the crippling weaknesses has been in the use of individual probabilities of the component genes: the value (estimate) of this probability is questionable and almost no consensus exists about this in literature. In this talk, I will present a model that does not use any such probability estimates; instead it exploits the observed structure (in terms of the layout along each chromosome) to give a meaningful score value for the cluster.
How computable is this model? I will address this question in the talk.

Element Distinctness, Frequency Moments, and Sliding Windows

Raphael Clifford
I will present new time-space tradeoff lower bounds and algorithms for exactly computing statistics of input data, including frequency moments, element distinctness, and order statistics. These problems are simple to solve for sorted data. Determining the complexity of these problems where space is limited, on the other hand, requires considerably more effort.
I will first discuss a new randomised algorithm for the element distinctness problem whose time and space complexity is roughly O(n3/2/S1/2), smaller than previous lower bounds for comparison-based algorithms. I will then show how to extend this method to a sliding window version with only log factor overheads.
If there is time, I will then show tight (to within log factors) time-space tradeoff bounds for computing the number of distinct elements, F_0, over sliding windows. The same lower bound holds for computing the low-order bit of F_0 and computing any frequency moment F_k for k \ne 1. This shows that frequency moments F_k \ne 1 and even the decision problem F_0 \bmod 2 are strictly harder than element distinctness.
The paper is available online at
Joint work with Paul Beame and Widad Machmouchi

Fewer runs than word length

Maxime Crochemore
The study of repetitions in texts constitutes a fundamental area of combinatorics on words due to major applications to string algorithms, data compression, music analysis, and biological sequences analysis, etc.
The talk discusses the recent proof by Bannai et al. of the conjecture stated by Kolpakov and Kucherov in 1999 on the maximal number of runs occurring in a word.

Finding Feasible Polytope in Interval Disaggregation

Kunsoo Park
Business planning as well as analytics on top of large-scale database systems is valuable to decision makers, but planning operations known and implemented so far are very basic.
In this paper we propose a new planning operation called interval disaggregate, which goes as follows. Suppose that the planner, typically the management of a company, plans sales revenues of its products in the current year. An interval of the expected revenue for each product in the current year is computed from historical data in the database as the prediction interval of linear regression on the data. A total target revenue for the current year is given by the planner. The goal of the interval disaggregate operation is to find an appropriate disaggregation of the target revenue within the boundaries of the intervals.
We formulate the problem of interval disaggregation more precisely and give solutions for the problem. Multidimensional geometry plays a crucial role in the problem formulation and the solutions.
We implemented interval disaggregation into the planning engine of SAP HANA and did experiments on real-world data. Our experiments show that interval disaggregation gives more appropriate solutions with respect to historical data than the known basic disaggregation called referential disaggregation.

Georgia Tech's New Online MOOC Based Master Program

Zvi Galil
In May 2013, Georgia Tech together with its partners -- Udacity and AT&T -- announced a new Online Master degree in Computer Science based on massively open online courses (MOOCs).
A complete degree will cost less than $7000 compared to over $40,000 in public universities and over $70,000 in private universities. The program was launched in January of 2014. It is the first of its kind.
The reaction to this program has been unprecedented -- there have been more than 600 distinct news articles about it.
It was described to potentially be a "game changer" and "the first real step in the transformation of higher education in the US".
Two examples are an article from the front page of NYTimes (also the International Herald Tribune) in August 2013, and one in Inside Higher Education in June 2014,
We intentionally started with small enrollment of about 400. We plan to gradually scale up the program. This fall we had over 1200 students and next month more than 2000. We plan to scale the program gradually.
The talk will describe the new program, how it came about, it’s first three semesters and what we've learned from them, and its potential in the US and internationally.

Histogram Indexing via Additive Combinatorics

Moshe Lewenstein
Histogram indexing is the problem where one preprocesses a text to answer histogram queries, that is one receives a histogram and needs to answer whether there is a substring with that histogram. Despite much research on the problem in the last few years, the best algorithms, even for the binary case, are more-or-less the same as the naive bounds.
We present a new method to solve this problem using additive combinatorics. Our solution gives an O(n^{1.859}) preprocessing time algorithm for constant sized alphabet (instead of the naive O(n^2)) which answers queries in O(1) time.
Joint work with Timothy Chan

How to Encode a Data Structure Succinctly

Rajeev Raman
In recent years, there has been a rapid increase research on data structures that answer complex queries rapidly on data that is stored in compressed formats. This has been driven by the increasing need to analyze and search for complex patterns in very large data sets. Developing data structures that operate on compressed data allows the entire data structure to fit into the main memory of a computer, which in turn avoids expensive or inconvenient computation on hard disks.
This talk will focus on a new direction: encoding the ``data structure'' itself. Many data structuring problems are such that the queries they answer do not allow the original data to be reconstructed purely through the queries. This presents opportunities to obtain space savings even when the data is incompressible, by pre-processing the data, extracting only the information needed to answer the queries, and then deleting the data. By storing only the data needed to answer the queries, the data structures could also be said to operate on a ``need to know'' basis, which could be important from a data security perspective.

Longest Common Extensions in Trees

Philip Bille
The longest common extension (LCE) of two indices in a string is the length of the longest identical substrings starting at these two indices. The LCE problem asks to preprocess a string into a compact data structure that supports fast LCE queries.
In this paper we generalize the LCE problem to trees. Given a labeled and rooted tree T, the goal is the preprocess T into a compact data structure that support the following LCE queries between subpaths and subtrees in T. Let v1, v2, w1, and w2 be nodes of T such that w1 and w2 are descendants of v1 and v2 respectively.
LCEPP(v1,w1,v2,w2): (path-path LCE) return the longest common prefix of the paths v1 -> w1 and v2 -> w2.
LCEPT (v1 , w1 , v2 ): (path-tree LCE) return maximal path-path LCE of the path v1 -> w1 and any path from v2 to a descendant leaf.
LCETT (v1 , v2 ): (tree-tree LCE) return a maximal path-path LCE of any pair of paths from v1 and v2 to descendant leaves.
We present the first non-trivial bounds for supporting these queries. Let n be the size of T.
For LCEPP andLCEPT queries,we present a linear-space solution with O(loglogn) and O(logn) query time respectively.
For LCETT queries, we show a reduction to the the set intersection problem implying that a fast linear space solution is not likely to exist.
We complement this with a time-space trade-off, that given any parameter t leads to an O(nt) space and O(n/t) query-time solution.
Finally, we suggest a few applications of LCE in trees to tries and XML databases.

Optimizing the Design of Coding Sequences

Steven Skiena
Tremendous advances have been made in reducing the cost of DNA synthesis. We are entering an age of synthetic biology, where we can design and synthesize new life forms for scientific and medical applications. But what sequence properties optimize translation and gene expression?
Our gene design algorithms optimize the DNA sequence of a gene for particular desired properties while coding for a specified protein. For vaccine design, we optimized the codon-pair bias of a sequence to modulate expression. We have also developed sequence design algorithms to optimize the presence of RNA secondary structure and the degree of sequence autocorrelation, which effects the frequency of tRNA reuse and recycling.
In this talk, I will discuss our experiences with designing and synthesizing these coding sequences, and their resulting phenotypes.
I will also present some recent work analyzing ribosome profiling data to understand the impact of codon-bias on translation efficiency. (joint work with Justin Gardin, Ruhksana Yeasmin, Alisa Yurovsky, and Bruce Futcher)

Period Recovery over the Hamming and Edit Distance

Mika Amit
A string S is called periodic in $P$ if it can be written as $S=P^r$, where $r > 1$ is a real number.
The smallest such substring, $P$, is denoted as the period of $S$, and its length is denoted as $p$.
In this paper we investigate the period recovery over the edit distance problem: for a given string $S$ of length $n$, find the smallest period $P$ such that the edit distance between $S$ and $P^{n/p}$ is smaller than $\frac{n}{(4+\epsilon)p}$, for $0 < \epsilon < 1$. We give an $O(n^{\frac{4}{3}}\log n)$ time algorithm for this problem.
In addition, we solve the period recovery over the Hamming distance problem, which for the same input finds the smallest period $P$ such that the Hamming distance between $S$ and $P^{n/p}$ is smaller than $\frac{n}{(2+\epsilon)p}$, for $0 < \epsilon < 1$.
For this problem, we describe an $O(n\log n)$ time algorithm.
This is a joint work with Amihood Amir, Gad M. Landau and Dina Sokol.

Research Challenges in Multivariate Algorithmics for NP-hard String Problems

Christian Komusiewicz
String problems arise in various applications ranging from text mining to biological sequence analysis. Many string problems are NP-hard. This motivates the search for (fixed-parameter) tractable special cases of these problems. We survey parameterized and multivariate algorithmics results for NP-hard string problems and identify challenges for future research.

Sampling and Large Flow Detection in SDN

Shir Landau Feibish
Techniques for sampling and detecting large flows in software defined networks (SDN) with OpenFlow are presented.
We start by presenting methods for randomly sampling packets with probability p, both independent and dependent on packet size. Sampling techniques that are immune to various cyber attacks are considered. Following the sampling techniques we develop various efficient methods to detect large flows. Based on time and other parameters we differentiate between heavy flows, elephant flows and bulky flows and present efficient algorithms for the detection of the different types. In all cases, the techniques are both flow-table size and switch-controller communication efficient. The presented techniques are based on OpenFlow 1.3 capabilities.
Joint work with: Yehuda Afek, Anat Bremler-Barr and Liron Schiff

Shortest double-stranded DNA sequences covering all k-mers

Ron Shamir
Novel technologies can generate large sets of short double-stranded DNA sequences that can be used to measure their regulatory effects.
By using sequence probes that cover all k-mers, a comprehensive picture of the effect of all possible short sequences on gene regulation is obtained.
The value of k that can be used in practice is, however, severely limited by cost and space considerations.
A key challenge is, therefore, to cover all k-mers with a minimal number of probes.
The standard way to do this uses the de Bruijn sequence of length 4k. However, as probes are double stranded, when a k-mer is included in a probe, its reverse complement k-mer is accounted for as well.
We show how to efficiently create a shortest possible sequence with the property that it contains each k-mer or its reverse complement, but not necessarily both. The length of the resulting sequence approaches half that of the de Bruijn sequence as k increases.
Yaron Orenstein and Ron Shamir

Simpler RNA Folding in Sub-Cubic Time

Dan Gusfield
Cubic-time worst-case dynamic programming solutions to the problem of minimizing free energy in folded RNA were introduced about forty years ago.
Several years ago we showed that the worst-case time can be reduced by a factor of log n, where n is the length of the RNA sequence. Since then, the bound has been reduced further, and the techniques (exploiting the Four-Russian's idea) have been extended to other RNA folding problems.
Additionally, a simpler way was developed to explain and implement the first result (i.e., speedup by a log n factor). In this talk, I will discuss the history and the simpler approach.
Work is with Y. Frid and B. Venkatachalam

String Graph Construction Using Incremental Hashing

Ilan Ben-Bassat
New sequencing technologies generate larger amount of short reads data at decreasing cost.
De novo sequence assembly is the problem of combining these reads back to the original genome sequence, without relying on a reference genome. This presents algorithmic and computational challenges, especially for long and repetitive genome sequences.
Most existing approaches to the assembly problem operate in the framework of de Bruijn graphs. Yet, a number of recent works employ the paradigm of {\em string graph}, using a variety of methods for storing and processing suffixes and prefixes, like suffix arrays, the Burrows--Wheeler transform, or the FM index.
Our work is motivated by a search for new approaches to constructing the string graph, using alternative yet simple data structures and algorithmic concepts.
We introduce a novel, hash based method for constructing the string graph. We employ incremental hashing, and specifically a modification of the Karp--Rabin fingerprint, and Bloom filters. Using these probabilistic methods might create false positive and false negative edges during the algorithm's execution, but these are all detected and corrected.
The advantages of the proposed approach over existing methods are its simplicity, and in the incorporation of established probabilistic techniques in the context of de novo genome sequencing. Our preliminary implementation is favorably comparable to the first string graph construction of Simpson and Durbin (2010) (but not to subsequent improvements).
Further research and optimizations will hopefully enable the algorithm to be incorporated, with noticeable performance improvement, in state of the art string graph based assemblers.
joint work with Benny Chor

The Cancer Genome: Copy number variation and somatic mutations

Jeanette Schmidt
Copy number analysis in tumors is rapidly gaining importance in cancer therapy as a tool for differential diagnosis with impact on potential treatment [1].
New evidence argues that tumors can be classified in those driven by either somatic mutations (M class) or copy number aberrations (C class).
Formalin fixed paraffin embedded (FFPE) blocks are the most common sources of material for both research and clinical diagnostics, and DNA degradation in FFPE presents a major challenge for accurate measurement of copy number.
We address this challenge using molecular inversion probes, which capture the alleles of over 220,000 Single Nucleotide Polymorphisms (SNPs) at carefully selected genomic locations, evenly distributed across the genome and with increased density within ~900 cancer-related genes.
Another challenge of tumor samples is the presence of normal cells in most biopsy samples, which affects copy number estimates and the characterization of the genome of the tumor. To address this problem of variable tumor burden, we developed TuScan™, an algorithm inspired by ASCAT [2], to estimate the integer copy number in the tumor at each locus.
When a major clone is responsible for the majority of copy number changes, the algorithm estimates the tumor burden in the sample and reports the integer copy number in the cancer portion only, effectively subtracting the normal component, thereby enabling a comparison between samples with different tumor burden. For highly heterogeneous samples or very low tumor burden, the algorithm reports the fractional (average) copy number of all cells within the sample.
[1] G. Ciriello et al, Nat. Gen. 45, 1127–1133, 2013.
[2] Peter Van Loo et al, PNAS 107 (39) 16910-16915, 2010.

The Euclidean k-Supplier Problem

Baruch M. Schieber
In the k-supplier problem, we are given a set of clients C and set of facilities F located in a metric space along with a bound k.
The goal is to open a subset of k facilities so as to minimize the maximum distance of a client to an open facility.
We present a 1+√3<2.74 approximation algorithm for the k-supplier problem in Euclidean metrics.
This improves the previously known 3-approximation algorithm which also holds for general metrics (where it is known to be tight).
It has been shown that it is NP-hard to approximate Euclidean k-supplier to better than a factor of √7 which is approximately 2.65$, even in dimension two.
Our algorithm is based on a relation to the edge cover problem.
We also present a nearly linear O(n log n) time algorithm for Euclidean k-supplier in constant dimensions that achieves an approximation ratio of 2.965.
This is joint work with Viswanath Nagarajan and Hadas Shachnai

The Parameterized Critical Node Cut Problem

Barak Navon
In this talk we consider the following natural graph cut problem which we call Critical Node Cut (CNC): Given a graph G on n vertices, and two positive integers k and x, determine whether G has a set of k vertices whose removal leaves G with at most x connected pairs of vertices.
We analyzed this problem in the framework of parameterized complexity, considering four natural parameters and their aggregations.
In the talk we will give an overview of our results, as well as discuss future work on the problem.

The Resurrection of de novo DNA Sequencing

Gene Myers
With the advent of long read sequencers such as the PacBio RS II, the goal of near-perfect de novo reconstructions of unknown genomes is once again “on the table”. We will explain why, and further give a hypothesis as to why assemblers have improved only marginally since the era of the Human Genome Project circa 2000, namely that it is not about the assembly, but about the artifacts in the reads and the resolution of repeat families, topics that have not received sufficient attention.
Therefore we are developing algorithms that carefully analyze the input reads with new concepts such as “intrinsic quality values” and “overlap piles” and using these ideas to remove either a portion or all of a read when there is evidence it is not a segment of the underlying target genome.
We are also developing algorithms that discover and model repeat families before assembly in order to further facilitate repeat resolution during assembly. This long-standing topic remains a treasure-trove of combinatorial pattern problems in my view.

Tight tradeoffs for approximating palindromes in streams

Pawel Gawrychowski
We consider the question of finding the longest palindrome in a text of length n in the streaming model, where the characters arrive one-by-one, and we cannot go back and retrieve a previously seen character.
While computing the answer exactly using sublinear memory is not possible in such a setting, one can still hope for a good approximation guarantee.
We focus on the two most natural variants, where we aim for either an additive or a multiplicative approximation of the length of the longest palindrome.
We first show, that there is no point in considering either deterministic or Las Vegas algorithms in such a setting, as they cannot achieve sublinear space complexity.
For Monte Carlo algorithms, we provide a lowerbound of Ω(n/E) bits for approximating the answer with an additive error E, and Ω(logn/ε) bits for approximating the answer within a multiplicative factor of (1+ε). Then we construct a generic Monte Carlo algorithm, which by choosing the parameters appropriately achieves space complexity matching up to a logarithmic factor for both variants. This substantially improves the previous results [Berenbrink et al., STACS 2014] and essentially settles the space complexity of the problem.

What is the distance between motifs in RNA?

Rolf Backofen
Large RNA molecules are often composed of multiple functional domains whose spatial arrangement strongly influences their function.
Pre-mRNA splicing, for instance, relies on the spatial proximity of the splice junctions that can be separated by very long introns.
Similar effects appear in the processing of RNA virus genomes. These spatial distance can be approximated by the graph-distance in RNA secondary structure.
We show here that the equilibrium distribution of graph-distances between a fixed pair of nucleotides can be computed in polynomial time by means of dynamic programming. While a naïve implementation would yield recursions with a very high time complexity
of O(n6D5) for sequence length n and D distinct distance values, it is possible to reduce this to O(n4) for practical applications in which
predominantly small distances are of interest.