Unofficial intermediate report on the "Data Mining" project


Knowledge Discovery with Bayesian Knowledge Bases

In the reserach framework of the project, our paper "Cost-Sharing
in Bayesian Knowledge Bases" has been accepted for publication in the
proceedings of the conference in Uncertainty in AI, 1997. In order
to perform the experiments described in the paper, we had to modify our
existing BKB software to handle cycles in the graph diffrently, and
thus got a good handle on the implementation of this type of construct.

We are now implementing initial association-rule programs, which are
modifications of the first of Agarwal's methods, as a preliminary to learning
BKBs from data. This part is written in C++, and will interface to the
rest of the project via data files defining the data-base and special
data files defining Bayesian Knowledge Bases. We have looked into
finding associations including negated attributes, and believe we can
do it with reasonably low extra cost. The latter is in preliminary
stages.


Abstraction

We found various kinds of abstraction processes to be useful in data
mining for this type of numerical data-base (the students grade database).
We started programming processes for:
Data abstraction - grades{100,0} --> grades{A,B,C,D}
Aggregation - all personal data --> i.d.
Generalization - all students with grades > 75 --> good students
We plan to address the following issues:
To find the useful abstractions via queries.
To use abstractions as a guide to derive association rules. 
To examine the relationship between various abstractions of the same kind.


Status of Databases

We had considerable progress in the Database area. We got a raw-data database
containing students grades in our faculty from the University administration.
We converted that database to Prolog. We established a standard for
representing
and maintaining both Schema and Instance information in Prolog.
We are working on the Security aspects of this database.
The database is now accessible to project members with special group
rights only.
We are working on standard access routines (in Prolog called from C++ or Java)
to this database.

We have started to work on the Medical database from Soroka hospital. 
It is now in the process of being converted to Prolog.

Relevant attributes of database with information about students and courses
have been defined. The database has been translated to Prolog predicates, which
group the attributes according to their sense.

WWW Interface Status

We now have a "Data Mining" web page
that allows a user to see the definition
of the above predicates and an example of the database.  The real database is
accessible from the home page with a password, which is known only to a group
of users that participate in the project.

Currently, working on visual representation of database attributes
and visualization of primitive information available in the database.


Metaqueries

This project is still in the
design phase. We are now working on the interface to Prolog.

Learning Decision Trees

An attempt to learn rules using decision tree algorithms (ID3, ID5).
Current status is -  all code for the ID3 algorithm,
is completed and compiled. An initial design and implementation of
an interface in Java exists, but it still has a lot of bugs.

We are working now on converting our program to work in ID5.
Currently we do not have the prolog section of our project (the data
retrival), but it's only a function that creates a prolog predicate
according to fields values, so it shouldn't be too hard.

  Things left to do :
        1. Debug the ID3 code.
        2. Write the ID5 algorithm and debug it.
        3. Finishing the java interface.
        4. Writing the prolog section.