Prof. Ehud Gudes

The following projects are offered by Prof. Ehud Gudes


  • Investigating and Implementing XML indexes

For: 1-2  students with background in Databases and like to Algorithms
           This project is composed of two parts. The first part is reading papers and survey methods for constructing various types of indexes for XML.
The second part is implementing  several types of such indexes and comparing their performance on several benchmark files.

A report comparing the index methods will be required.

  • Evaluating Reputation based clustering algorithms

For: 1-2  students with background in Databases and/or Data security


The problem of identifying groups of trust (knots) in a trust network is modeled as a graph clustering problem, where vertices correspond to individual items and edges describe relationships. Under this interpretation, a community is represented by a directed graph,

in which vertices represent members and edges represent the trust relations between the members represented by their end-point vertices.

A path between two vertices that are not connected by an edge represents the transitive property of trust (e.g. Alice trusts Bob + Bob trusts Clair => Alice trusts Clair).

Correlation clustering is a powerful technique for discovering groups of trust in graphs. It operates on the pair-wise relationships between vertices, partitioning the graph to maximize the number of related pairs that are clustered together, plus the number of unrelated pairs that are separated. We investigate heuristic algorithms for correlation clustering with restricted clusters diameter size (to avoid trust based on long paths of transitive trust).

The goal of the project is to implement the developed heuristics and evaluate the tradeoff of optimality/performance, mainly on clustering maintenance algorithms.


Co-advisor: Nurit Gal-Oz

       Project Title: Implement and test the Notos Domain reputation modelsets

Number of students: 2

Advisors: Ehud Gudes and Igor Mishky <>


Project description:

Estimating the reputation of Domains is a very important problem in trying to identify domains which are suspicious of spreading malware, or being command and control for Botnets.

The Notos model was the first model suggested for computing this reputation based on DNS requests logs and statistical properties of the Domain/IP network.

The goal will be to study this model and apply it ( approximately) on DNS logs we now have in the Cyber Kabarnit research project.



[1] Manos Antonakakis, Roberto Perdisci, David Dagon, Wenke Lee, Nick Feamster: Building a Dynamic Reputation System for DNS. USENIX Security Symposium 2010: 273-290

  • Project title - Authorization for Hadoop-based systems

Number of students: 2

The project goal is to validate and come up with guidance/best practices for leveraging Hbase and Hive for fine-grained access control.
The student would need to read the literature, investigate, implement and produce a best practices paper with samples. Customers can use this best practice to leverage HBase and Hive for fine-grained access control.
The students will have to provide both an analysis of the access control features in the above databases and an implementation of an example application which will demonstrate some of the available (and lacking ) features

Advisors: Ehud Gudes and Boris Rozenberg (IBM)

       Project Title: Discovering Unique Patterns in Large Datasets

Number of students: 2

Co-Advisor Boris Rozenberg,


Project description:

The analysis of personal data is fundamental to government policy-making and academic research. This information is often collected under guarantees of confidentiality. In order for maximum benefit to be gained, the data needs to be made available to as wide a group of researchers as possible. However, this can pose a risk of disclosure has worsened as the potential for linking independent datasets has risen.

The problem of preventing statistical disclosure is approached by estimating the risk of a certain individual being identified (statistical disclosure assessment) and then by applying statistical disclosure control, variously recoding, masking and perturbing the data in order to reduce the statistical disclosure risk.

Of particular concern are records whose contents, or attributes, are unique and therefore have the potential to be matched directly with details (including names and addresses) from another dataset. For example, a study of the 2000 US census revealed that 63% of US citizens can be uniquely identified by just three pieces of information: their gender, zip code, and full date of birth.

Another area where unique patterns (or itemsets) are of interest is Anomaly (or Outlier) Detection. Anomaly Detection techniques are widely used in various domains, especially for fraud detection. Records containing unique patterns are suspicious and should be considered by any Anomaly Detection system.

A special algorithm, Special Uniques Detection Algorithm (SUDA) and his later version SUDA2[1]) has been designed to efficiently discover all minimally unique itemsets, i.e., minimal itemsets of frequency one.

With this being said, the project goals are: (1) implement SUDA2 algorithm (in Matlab or Java); (2) the search strategies used in SUDA2 can be adapted easily for mining rare (i.e. of frequency < threshold) minimal itemsets and suitable implementation is required.



[1] A.M.Manning, D.J.Haglin, and J.A.Keane. A recursive search algorithm for statistical disclosure assessment. Data Min Knowl Disc (2008), 16:165-196.

       Implementing parallel data mining algorithms using Map/Reduce

For: 2  students with background in Databases and/or Data security


Map/Reduce is a recent technique for implementing parallel data intensive algorithms. For example the Hadoop project. The goal of this project is to implement several data mining algorithms (e.g. FSG, GSPAN, SPADE, SUBSEA) and compare them to their

sequential version.

Co-advisor: Yaron Gonen



  • Generating Fraud Scenarios patterns in Simulated Utility Consumption data

For: 2  students with background in Databases and/or Data mining


The project will include the generation of simulated Energy consumption data for a large population of  Prosumers (consumers that also might be producers), and the generation of fraud scenarios patterns. E.g., a scenario in which one neighbor steals electricity from another neighbor. The volume of the data can be quite large 1 Million Prosumers generate between  8 to 32 Billion meter readings per year.  The preferred environment for the project is Hadoop. The project will require creation and manipulation of data in this environment, based on simple statistical models, and the development of a basic GUI for controlling the process and the model (the preferred GUI environment is HTML5.


Industrial advisor: Gadi Solotorevsky, Cvidya