The
following projects are offered by *Prof.
Ehud Gudes*

**Investigating and Implementing XML indexes**

**For: 1-2 students with background in
Databases and like to Algorithms**

This project is
composed of two parts. The first part is reading papers and survey methods for
constructing various types of indexes for XML.

The second part is implementing several types of
such indexes and comparing their performance on several benchmark files.

A report comparing the index methods will be
required.

**Evaluating Reputation based clustering algorithms**

**For: 1-2 students with
background in Databases and/or Data security**

The problem of identifying
groups of trust (knots) in a trust network is modeled as a graph clustering
problem, where vertices correspond to individual items and edges describe
relationships. Under this interpretation, a community is represented by a
directed graph,

in which vertices represent
members and edges represent the trust relations between the members represented
by their end-point vertices.

A path between two vertices that are not connected by an edge represents
the transitive property of trust (e.g. Alice trusts Bob + Bob trusts Clair
=> Alice trusts Clair).

Correlation clustering is a powerful technique for
discovering groups of trust in graphs. It operates on the pair-wise
relationships between vertices, partitioning the graph to maximize the number
of related pairs that are clustered together, plus the number of unrelated
pairs that are separated. We investigate heuristic algorithms for correlation
clustering with restricted clusters diameter size (to avoid trust based on long
paths of transitive trust).

The goal of the project is to implement the
developed heuristics and evaluate the tradeoff of optimality/performance, mainly on clustering
maintenance algorithms.

**Co-advisor: Nurit Gal-Oz**

** **

· **Project Title: Implement and
test the Notos Domain reputation modelsets**

**Number of students: 2**

**Advisors***:* Ehud Gudes and Igor Mishky
<igormishsky@gmail.com>

__Project description:__

Estimating the reputation of Domains is a very important problem in trying to identify domains which are suspicious of spreading malware, or being command and control for Botnets.

The Notos model was the first model suggested for computing this reputation based on DNS requests logs and statistical properties of the Domain/IP network.

The goal will be to study this model and apply it ( approximately) on DNS logs we now have in the Cyber Kabarnit research project.

Reference:

[1] Manos Antonakakis, Roberto Perdisci, David Dagon, Wenke Lee, Nick Feamster: Building a Dynamic Reputation System for DNS. USENIX Security Symposium 2010: 273-290

**Project title - Authorization for Hadoop-based systems**

**Number of students: 2**

The project goal is to validate and
come up with guidance/best practices for leveraging Hbase
and Hive for fine-grained access control.

The student would need to read the
literature, investigate, implement and produce a best practices paper with
samples. Customers can use this best practice to leverage HBase
and Hive for fine-grained access control.

The
students will have to provide both an analysis of the access control features
in the above databases and an implementation of an example application which
will demonstrate some of the available (and lacking )
features

Advisors: Ehud Gudes and Boris Rozenberg (IBM)

· **Project Title: Discovering
Unique Patterns in Large Datasets**

**Number of students: 2**

**Co-Advisor **Boris
Rozenberg, borisr@il.ibm.com

__Project description:__

The analysis of personal data is fundamental to government policy-making and academic research. This information is often collected under guarantees of confidentiality. In order for maximum benefit to be gained, the data needs to be made available to as wide a group of researchers as possible. However, this can pose a risk of disclosure has worsened as the potential for linking independent datasets has risen.

The problem of preventing statistical disclosure is approached by estimating the risk of a certain individual being identified (statistical disclosure assessment) and then by applying statistical disclosure control, variously recoding, masking and perturbing the data in order to reduce the statistical disclosure risk.

Of particular concern are records whose contents, or
attributes, are unique and therefore have the potential to be matched directly
with details (including names and addresses) from another dataset. For example,
a study of the 2000

Another area where unique patterns (or itemsets) are of interest is Anomaly (or Outlier) Detection. Anomaly Detection techniques are widely used in various domains, especially for fraud detection. Records containing unique patterns are suspicious and should be considered by any Anomaly Detection system.

A special algorithm, Special Uniques Detection Algorithm (SUDA) and his later version SUDA2[1]) has been designed to efficiently discover all minimally unique itemsets, i.e., minimal itemsets of frequency one.

With this being said, the project goals are: (1) implement SUDA2 algorithm (in Matlab or Java); (2) the search strategies used in SUDA2 can be adapted easily for mining rare (i.e. of frequency < threshold) minimal itemsets and suitable implementation is required.

Reference:

[1] A.M.Manning, D.J.Haglin, and J.A.Keane. A recursive search algorithm for statistical disclosure assessment. Data Min Knowl Disc (2008), 16:165-196.

·
**Implementing parallel data
mining algorithms using Map/Reduce**

**For: 2 students with
background in Databases and/or Data security**

Map/Reduce is a recent
technique for implementing parallel data intensive algorithms. For example the
Hadoop project. The goal of this project is to implement several data mining
algorithms (e.g. FSG, GSPAN, SPADE, SUBSEA) and compare them to their

sequential
version.

**Co-advisor: Yaron Gonen**

**Generating Fraud Scenarios patterns in Simulated Utility Consumption data**

**For:
2 students with background in Databases and/or
Data mining**

The
project will include the generation of simulated Energy consumption data for a
large population of Prosumers
(consumers that also might be producers), and the generation of fraud scenarios
patterns. E.g., a scenario in which one neighbor steals electricity from
another neighbor. The volume of the data can be quite large 1 Million Prosumers generate between 8
to 32 Billion meter readings per year. The preferred environment for the
project is Hadoop. The project will require creation and manipulation of data
in this environment, based on simple statistical models, and the development of
a basic GUI for controlling the process and the model (the preferred GUI
environment is HTML5.

**Industrial
advisor: Gadi Solotorevsky, Cvidya**