**Olga Troyanskaya**

An immense molecular complexity forms the foundation of human disease.
Our goal is to develop novel bioinformatics algorithms that can interpret and distill this complexity through integrative
analysis of massive collections of diverse high-throughput biological data, creating accurate models of molecular networks
and pathways, particularly those in which malfunction promotes the emergence of complex human disease.

Specifically, I will discuss novel machine learning approaches we developed to tackle the challenges of data heterogeneity, high noise levels,
and very limited training standards in genomics.

Using these approaches, we are able to make accurate predictions of protein function and interactions, and understand how biological networks
change across tissues, including elucidating molecular underpinnings of disease.

**Liat Rozenberg**

Jumbled Indexing, the problem of indexing a text for histogram queries, has been of much interest lately.

In this paper we consider jumbled indexing for run-length encoded texts. We refute a former conjecture and show an algorithm
for general sized alphabets. We also consider Jumbled Borders, the extension of borders to jumbled strings. Borders are the basis
for various algorithms.

Finally, we consider Jumbled Squares, strings which are of the form x x', where x' is a jumbling of x.

We show efficient algorithms for these problems.

**Mikhail Roytberg**

Multiple alignment of weakly homologous protein sequences (%ID < 35%) is a difficult task.
Alignments produced by best existing programs, e.g. ProbAlign, have accuracy and confidence in average ~ 85%-87%.

We propose a method allowing one, given a ProbAlgn multiple alignment, to pick up ~50% of aligned position pairs so that ~98% of selected pairs belong to the corresponding golden standard multiple alignment based on superposition of 3D protein structures.
The idea of the approach is to construct several alternative candidate multiple alignments and consider as reliable those amino acid correspondences that belong to all candidate alignments.
The method can be applied to alignments produced by arbitrary algorithm.

**Dekel Tsur**

We study strategies of approximate pattern matching that exploit
bidirectional text indexes, extending and generalizing ideas of Lam et al.
We introduce a formalism, called search schemes, to specify search strategies of
this type, then develop a probabilistic measure for the efficiency of
a search scheme, prove several combinatorial results on efficient
search schemes, and finally, provide experimental computations
supporting the superiority of our strategies.

**Shiri Dori-Hacohen**

Alerting users about controversial search results can encourage
critical literacy, promote healthy civic discourse and counteract the
“filter bubble” effect, and therefore would be a useful feature in a search
engine or browser extension. In order to implement such a feature, however,
the binary classification task of determining which topics or webpages
are controversial must be solved. Earlier work described a proof of
concept using a supervised nearest neighbor classifier with access to an
oracle of manually annotated Wikipedia articles. This paper generalizes
and extends that concept by taking the human out of the loop, leveraging
the rich metadata available in Wikipedia articles in a weakly-supervised
classification approach. The new technique we present allows the nearest
neighbor approach to be extended on a much larger scale and to other
datasets. The results improve substantially over naive baselines and are
nearly identical to the oracle-reliant approach by standard measures of
F1, F0.5, and accuracy.

If time allows, we will discuss implications of solving this
problem as part of a broader subject of interest to the IR community, and
our current and future research directions.

**Gary Benson**

Next-generation sequencing and other processor-intensive sequence comparison applications have motivated a continued search for high-efficiency sequence alignment algorithms.

One successful approach, termed bit-parallel alignment, exploits inherent parallelism in computer logic calculations. It represents individual cells in an alignment scoring matrix as bits in computer words and uses a series of bit operations comprised of AND, OR, XOR, complement, shift, and addition to emulate the calculation of scores.

Bit-parallelism has been successfully applied to the Longest Common Subsequence (LCS) and edit-distance problems, producing the fastest single processor solutions to date. The bit-parallel LCS and edit-distance algorithms are closely tied to special properties of the problems, which limited efforts to extend bit-parallelism to more general scoring schemes.

In this talk, I will describe our recent bit-parallel solution for general, integer-scoring global alignment. Integer-scoring schemes, which are widely used, assign integer weights for match, mismatch, and gap operations. The method depends on structural properties of the relationship between adjacent scores in the scoring array. We utilize these properties to construct a class of efficient algorithms, each designed for a particular set of weights.

The talk will conclude with extensions and generalizations of the approach.

**Gregory Kucherov**

CNRS/LIGM Marné-la-Vallée

Next Generation Sequencing (NGS) technologies have now become ubiquitous in genomics studies, generating a massive flow of sequence data. Modern genomic projects ordinarily deal with data sets of hundreds of gigas of nucleotides.

In this talk, we will be concerned with efficient data structures for storing NGS data. We first turn to the problem of genome assembly and, in particular, consider de Bruijn graphs which are one of the main tools for this task. We introduce "cascading Bloom filters" and show how they can reduce the memory required for storing de Bruijn graphs. We then briefly present how this data structure can be applied to other problems, namely to compression of NGS data and to metagenomics.

Joint work with Kamil Salikhov, Gustavo Sacomoto

**Laxmi Parida**

Consider the task of scoring or associating a quantitative (significance) value to common gene clusters observed across multiple species.

One of the crippling weaknesses has been in the use of individual probabilities of the component genes: the value (estimate) of this probability
is questionable and almost no consensus exists about this in literature. In this talk, I will present a model that does not use any such probability
estimates; instead it exploits the observed structure (in terms of the layout along each chromosome) to give a meaningful score value for the cluster.

How computable is this model? I will address this question in the talk.

**Raphael Clifford**

I will present new time-space tradeoff lower bounds and
algorithms for exactly computing statistics of input data, including
frequency moments, element distinctness, and order statistics. These
problems are simple to solve for sorted data. Determining the
complexity of these problems where space is limited, on the other
hand, requires considerably more effort.

I will first discuss a new randomised algorithm for the element
distinctness problem whose time and space complexity is roughly
O(*n*^{3/2}/*S*^{1/2}), smaller than previous lower bounds for
comparison-based algorithms. I will then show how to extend this
method to a sliding window version with only log factor overheads.

If there is time, I will then show tight (to within log factors)
time-space tradeoff bounds for computing the number of distinct
elements, F_0, over sliding windows. The same lower bound holds for
computing the low-order bit of F_0 and computing any frequency moment
F_k for k \ne 1. This shows that frequency moments F_k \ne 1 and even
the decision problem F_0 \bmod 2 are strictly harder than element
distinctness.

The paper is available online at http://arxiv.org/abs/1309.3690

Joint work with Paul Beame and Widad Machmouchi

**Maxime Crochemore**

The study of repetitions in texts constitutes a fundamental area of
combinatorics on words due to major applications to string algorithms,
data compression, music analysis, and biological sequences analysis, etc.

The talk discusses the recent proof by Bannai et al. of the conjecture
stated by Kolpakov and Kucherov in 1999 on the maximal number of runs
occurring in a word.

**Kunsoo Park**

Business planning as well as analytics on top of large-scale database systems is valuable to decision makers, but planning operations known and implemented so far are very basic.

In this paper we propose a new planning operation called interval disaggregate, which goes as follows. Suppose that the planner, typically the management of a company, plans
sales revenues of its products in the current year. An interval of the expected revenue for each product in the current year is computed from historical data in the database as
the prediction interval of linear regression on the data. A total target revenue for the current year is given by the planner. The goal of the interval disaggregate operation is
to find an appropriate disaggregation of the target revenue within the boundaries of the intervals.

We formulate the problem of interval disaggregation more precisely and give solutions for the problem. Multidimensional geometry plays a crucial role in the problem formulation and the solutions.

We implemented interval disaggregation into the planning engine of SAP HANA and did experiments on real-world data. Our experiments show that interval disaggregation gives more appropriate solutions
with respect to historical data than the known basic disaggregation called referential disaggregation.

**Zvi Galil**

In May 2013, Georgia Tech together with its partners -- Udacity and AT&T -- announced a new Online Master degree in Computer Science based on massively open online courses (MOOCs).

A complete degree will cost less than $7000 compared to over $40,000 in public universities and over $70,000 in private universities. The program was launched in January of 2014. It is the first of its kind.

The reaction to this program has been unprecedented -- there have been more than 600 distinct news articles about it.

It was described to potentially be a "game changer" and "the first real step in the transformation of higher education in the US".

Two examples are an article from the front page of NYTimes (also the International Herald Tribune) in August 2013, http://nyti.ms/1s3tz4G and one in Inside Higher Education in June 2014, http://bit.ly/1AQDERe.

We intentionally started with small enrollment of about 400. We plan to gradually scale up the program. This fall we had over 1200 students and next month more than 2000. We plan to scale the program gradually.

The talk will describe the new program, how it came about, it’s first three semesters and what we've learned from them, and its potential in the US and internationally.

**Moshe Lewenstein**

Histogram indexing is the problem where one preprocesses a text to answer histogram queries, that is one receives a histogram and needs to answer
whether there is a substring with that histogram. Despite much research on the problem in the last few years, the best algorithms, even for the binary case,
are more-or-less the same as the naive bounds.

We present a new method to solve this problem using additive combinatorics. Our solution gives an O(n^{1.859}) preprocessing time algorithm for constant sized alphabet (instead of the naive O(n^2)) which answers queries in O(1) time.

Joint work with Timothy Chan

**Rajeev Raman**

In recent years, there has been a rapid increase research on data
structures that answer complex queries rapidly on data that is stored in
compressed formats. This has been driven by the increasing need to
analyze and search for complex patterns in very large data sets.
Developing data structures that operate on compressed data allows the
entire data structure to fit into the main memory of a computer, which
in turn avoids expensive or inconvenient computation on hard disks.

This talk will focus on a new direction: encoding the ``data structure''
itself. Many data structuring problems are such that the queries they
answer do not allow the original data to be reconstructed purely through
the queries. This presents opportunities to obtain space savings even
when the data is incompressible, by pre-processing the data, extracting
only the information needed to answer the queries, and then deleting the
data. By storing only the data needed to answer the queries, the data
structures could also be said to operate on a ``need to know'' basis,
which could be important from a data security perspective.

**Philip Bille **

The longest common extension (LCE) of two indices in a string is the length of the longest identical substrings starting at these two indices. The LCE problem asks to preprocess a string into a compact data structure that supports fast LCE queries.

In this paper we generalize the LCE problem to trees. Given a labeled and rooted tree T, the goal is the preprocess T into a compact data structure that support the following LCE queries between subpaths and subtrees in T. Let v1, v2, w1, and w2 be nodes of T such that w1 and w2 are descendants of v1 and v2 respectively.

LCEPP(v1,w1,v2,w2): (path-path LCE) return the longest common prefix of the paths v1 -> w1 and v2 -> w2.

LCEPT (v1 , w1 , v2 ): (path-tree LCE) return maximal path-path LCE of the path v1 -> w1 and any path from v2 to a descendant leaf.

LCETT (v1 , v2 ): (tree-tree LCE) return a maximal path-path LCE of any pair of paths from v1 and v2 to descendant leaves.

We present the first non-trivial bounds for supporting these queries. Let n be the size of T.

For LCEPP andLCEPT queries,we present a linear-space solution with O(loglog*n*) and O(log*n*) query time respectively.

For LCETT queries, we show a reduction to the the set intersection problem implying that a fast linear space solution is not likely to exist.

We complement this with a time-space trade-off, that given any parameter t leads to an O(nt) space and O(n/t) query-time solution.

Finally, we suggest a few applications of LCE in trees to tries and XML databases.

**Steven Skiena**

Tremendous advances have been made in reducing the cost of
DNA synthesis. We are entering an age of synthetic biology, where we
can design and synthesize new life forms for scientific and medical
applications. But what sequence properties optimize translation
and gene expression?

Our gene design algorithms optimize the DNA sequence of a gene for
particular desired properties while coding for a specified protein.
For vaccine design, we optimized the codon-pair bias of a sequence to
modulate expression. We have also developed sequence design
algorithms to optimize the presence of RNA secondary structure and
the degree of sequence autocorrelation, which effects the frequency
of tRNA reuse and recycling.

In this talk, I will discuss our experiences with designing and
synthesizing these coding sequences, and their resulting phenotypes.

I will also present some recent work analyzing ribosome profiling data
to understand the impact of codon-bias on translation efficiency.
(joint work with Justin Gardin, Ruhksana Yeasmin, Alisa Yurovsky,
and Bruce Futcher)

**Mika Amit**

A string S is called periodic in $P$ if it can be written as $S=P^r$, where $r > 1$ is a real number.

The smallest such substring, $P$, is denoted as the period of $S$, and its length is denoted as $p$.

In this paper we investigate the period recovery over the edit distance problem:
for a given string $S$ of length $n$, find the smallest period $P$ such that the edit distance between $S$ and $P^{n/p}$ is smaller than $\frac{n}{(4+\epsilon)p}$, for $0 < \epsilon < 1$. We give an $O(n^{\frac{4}{3}}\log n)$ time algorithm for this problem.

In addition, we solve the period recovery over the Hamming distance problem, which for the same input finds the smallest period $P$ such that the Hamming distance between $S$ and $P^{n/p}$ is smaller than $\frac{n}{(2+\epsilon)p}$, for $0 < \epsilon < 1$.

For this problem, we describe an $O(n\log n)$ time algorithm.

This is a joint work with Amihood Amir, Gad M. Landau and Dina Sokol.

**Christian Komusiewicz**

String problems arise in various applications ranging from text mining
to biological sequence analysis. Many string problems are NP-hard. This
motivates the search for (fixed-parameter) tractable special cases of these
problems. We survey parameterized and multivariate algorithmics results for
NP-hard string problems and identify challenges for future research.

**Shir Landau Feibish**

Techniques for sampling and detecting large flows in software
defined networks (SDN) with OpenFlow are presented.

We start by presenting methods for randomly sampling packets with
probability p, both independent and dependent on packet size.
Sampling techniques that are immune to various cyber attacks
are considered. Following the sampling techniques we develop
various efficient methods to detect large flows. Based on time and
other parameters we differentiate between heavy flows, elephant
flows and bulky flows and present efficient algorithms for the
detection of the different types. In all cases, the techniques
are both flow-table size and switch-controller communication
efficient. The presented techniques are based on OpenFlow 1.3
capabilities.

Joint work with: Yehuda Afek, Anat Bremler-Barr and Liron Schiff

**Ron Shamir**

Novel technologies can generate large sets of short double-stranded DNA sequences that can be used to measure their regulatory effects.

By using sequence probes that cover all k-mers, a comprehensive picture of the effect of all possible short sequences on gene regulation is obtained.

The value of k that can be used in practice is, however, severely limited by cost and space considerations.

A key challenge is, therefore, to cover all k-mers with a minimal number of probes.

The standard way to do this uses the de Bruijn sequence of length 4k. However, as probes are double stranded, when a k-mer is included in a probe,
its reverse complement k-mer is accounted for as well.

We show how to efficiently create a shortest possible sequence with the property that it contains each k-mer or its reverse complement, but not necessarily both.
The length of the resulting sequence approaches half that of the de Bruijn sequence as k increases.

Yaron Orenstein and Ron Shamir

**Dan Gusfield**

Cubic-time worst-case dynamic programming solutions to the problem
of minimizing free energy in folded RNA were introduced about forty years
ago.

Several years ago we showed that the worst-case time can be reduced by
a factor of log n, where n is the length of the RNA sequence. Since then, the bound
has been reduced further, and the techniques (exploiting the Four-Russian's idea)
have been extended to other RNA folding problems.

Additionally, a simpler way was
developed to explain and implement the first result (i.e., speedup by a log n factor). In this talk, I will discuss the history and the simpler approach.

Work is with Y. Frid and B. Venkatachalam

**Ilan Ben-Bassat**

New sequencing technologies generate larger amount of short reads data at decreasing cost.

De novo sequence assembly is the problem of combining these reads back to the original genome sequence, without relying on a reference genome.
This presents algorithmic and computational challenges, especially for long and repetitive genome sequences.

Most existing approaches to the assembly problem operate in the framework of de Bruijn graphs.
Yet, a number of recent works employ the paradigm of {\em string graph}, using a variety of methods for
storing and processing suffixes and prefixes, like suffix arrays, the Burrows--Wheeler transform, or the FM index.

Our work is motivated by a search for new approaches to constructing the string graph, using alternative yet simple data structures and algorithmic concepts.

We introduce a novel, hash based method for constructing the string graph. We employ incremental hashing, and specifically a modification of the Karp--Rabin fingerprint,
and Bloom filters. Using these probabilistic methods might create false positive and false negative edges during the algorithm's execution, but these are all detected and corrected.

The advantages of the proposed approach over existing methods
are its simplicity, and in the incorporation of established probabilistic techniques
in the context of de novo genome sequencing. Our preliminary implementation
is favorably comparable to the first string graph construction of Simpson and Durbin (2010) (but not to subsequent improvements).

Further research and optimizations will hopefully enable the algorithm to be incorporated, with noticeable performance improvement, in state of the art string graph based assemblers.

joint work with Benny Chor

**Jeanette Schmidt**

Copy number analysis in tumors is rapidly gaining importance in cancer therapy as a tool for
differential diagnosis with impact on potential treatment [1].

New evidence argues that tumors can be classified in those driven by either somatic mutations (M class) or copy number aberrations (C class).

Formalin fixed paraffin embedded (FFPE) blocks are the most common sources of material for both research and clinical diagnostics, and DNA
degradation in FFPE presents a major challenge for accurate measurement of copy number.

We address this challenge using molecular inversion probes, which capture the alleles of over 220,000 Single Nucleotide Polymorphisms (SNPs) at
carefully selected genomic locations, evenly distributed across the genome and with increased density
within ~900 cancer-related genes.

Another challenge of tumor samples is the presence of normal cells in most biopsy samples, which affects copy number
estimates and the characterization of the genome of the tumor. To address this problem of variable tumor burden, we
developed TuScan™, an algorithm inspired by ASCAT [2], to estimate the integer copy number in the tumor at each locus.

When a major clone is responsible for the majority of copy number changes, the algorithm estimates the tumor burden in
the sample and reports the integer copy number in the cancer portion only, effectively subtracting the normal component,
thereby enabling a comparison between samples with different tumor burden. For highly heterogeneous samples
or very low tumor burden, the algorithm reports the fractional (average) copy number of all cells within the sample.

[1] G. Ciriello et al, Nat. Gen. 45, 1127–1133, 2013.

[2] Peter Van Loo et al, PNAS 107 (39) 16910-16915, 2010.

**Baruch M. Schieber**

In the k-supplier problem, we are given a set of clients C and set of facilities F located in a metric space along with a bound k.

The goal is to open a subset of k facilities so as to minimize the maximum distance of a client to an open facility.

We present a 1+√3<2.74 approximation algorithm for the k-supplier problem in Euclidean metrics.

This improves the previously known 3-approximation algorithm which also holds for general metrics (where it is known to be tight).

It has been shown that it is NP-hard to approximate Euclidean k-supplier to better than a factor of √7 which is approximately 2.65$, even in dimension two.

Our algorithm is based on a relation to the edge cover problem.

We also present a nearly linear O(*n* log*n*) time algorithm for Euclidean k-supplier in constant dimensions that achieves an approximation ratio of 2.965.

This is joint work with Viswanath Nagarajan and Hadas Shachnai

**Barak Navon**

In this talk we consider the following natural graph cut problem which we call Critical Node Cut (CNC):
Given a graph G on n vertices, and two positive integers k and x, determine whether G has a set of k vertices whose removal leaves G
with at most x connected pairs of vertices.

We analyzed this problem in the framework of parameterized complexity, considering four natural parameters and their aggregations.

In the talk we will give an overview of our results, as well as discuss future work on the problem.

**Gene Myers**

With the advent of long read sequencers such as the PacBio RS II, the goal of near-perfect de novo reconstructions of unknown genomes is once again “on the table”.
We will explain why, and further give a hypothesis as to why assemblers have improved only marginally since the era of the Human Genome Project circa 2000, namely
that it is not about the assembly, but about the artifacts in the reads and the resolution of repeat families, topics that have not received sufficient attention.

Therefore we are developing algorithms that carefully analyze the input reads with new concepts such as “intrinsic quality values” and “overlap piles” and using
these ideas to remove either a portion or all of a read when there is evidence it is not a segment of the underlying target genome.

We are also developing algorithms that discover and model repeat families before assembly in order to further facilitate repeat resolution during assembly.
This long-standing topic remains a treasure-trove of combinatorial pattern problems in my view.

**Pawel Gawrychowski**

We consider the question of finding the longest palindrome in a text of length n in the streaming model, where the characters arrive one-by-one, and we cannot go back and retrieve a previously seen character.

While computing the answer exactly using sublinear memory is not possible in such a setting, one can still hope for a good approximation guarantee.

We focus on the two most natural variants, where we aim for either an additive or a multiplicative approximation of the length of the longest palindrome.

We first show, that there is no point in considering either deterministic or Las Vegas algorithms in such a setting, as they cannot achieve sublinear space complexity.

For Monte Carlo algorithms, we provide a lowerbound of Ω(n/E) bits for approximating the answer with an additive error E, and Ω(logn/ε) bits for approximating the answer within a multiplicative factor of (1+ε).
Then we construct a generic Monte Carlo algorithm, which by choosing the parameters appropriately achieves space complexity matching up to a logarithmic factor for both variants.
This substantially improves the previous results [Berenbrink et al., STACS 2014] and essentially settles the space complexity of the problem.

**Rolf Backofen**

Large RNA molecules are often composed of multiple functional domains whose spatial arrangement strongly influences their function.

Pre-mRNA splicing, for instance, relies on the spatial proximity of the splice junctions that can be separated by very long introns.

Similar effects appear in the processing of RNA virus genomes. These spatial distance can be approximated by the graph-distance in RNA
secondary structure.

We show here that the equilibrium distribution of graph-distances between a fixed pair of nucleotides can be computed
in polynomial time by means of dynamic programming. While a naïve implementation would yield recursions with a very high time complexity

of O(*n*^{6}D^{5}) for sequence length n and D distinct distance values, it is possible to reduce this to O(*n*^{4}) for practical applications in which

predominantly small distances are of interest.