Unofficial intermediate report on the "Data Mining" project Knowledge Discovery with Bayesian Knowledge Bases In the reserach framework of the project, our paper "Cost-Sharing in Bayesian Knowledge Bases" has been accepted for publication in the proceedings of the conference in Uncertainty in AI, 1997. In order to perform the experiments described in the paper, we had to modify our existing BKB software to handle cycles in the graph diffrently, and thus got a good handle on the implementation of this type of construct. We are now implementing initial association-rule programs, which are modifications of the first of Agarwal's methods, as a preliminary to learning BKBs from data. This part is written in C++, and will interface to the rest of the project via data files defining the data-base and special data files defining Bayesian Knowledge Bases. We have looked into finding associations including negated attributes, and believe we can do it with reasonably low extra cost. The latter is in preliminary stages. Abstraction We found various kinds of abstraction processes to be useful in data mining for this type of numerical data-base (the students grade database). We started programming processes for: Data abstraction - grades{100,0} --> grades{A,B,C,D} Aggregation - all personal data --> i.d. Generalization - all students with grades > 75 --> good students We plan to address the following issues: To find the useful abstractions via queries. To use abstractions as a guide to derive association rules. To examine the relationship between various abstractions of the same kind. Status of Databases We had considerable progress in the Database area. We got a raw-data database containing students grades in our faculty from the University administration. We converted that database to Prolog. We established a standard for representing and maintaining both Schema and Instance information in Prolog. We are working on the Security aspects of this database. The database is now accessible to project members with special group rights only. We are working on standard access routines (in Prolog called from C++ or Java) to this database. We have started to work on the Medical database from Soroka hospital. It is now in the process of being converted to Prolog. Relevant attributes of database with information about students and courses have been defined. The database has been translated to Prolog predicates, which group the attributes according to their sense. WWW Interface Status We now have a "Data Mining" web page that allows a user to see the definition of the above predicates and an example of the database. The real database is accessible from the home page with a password, which is known only to a group of users that participate in the project. Currently, working on visual representation of database attributes and visualization of primitive information available in the database. Metaqueries This project is still in the design phase. We are now working on the interface to Prolog. Learning Decision Trees An attempt to learn rules using decision tree algorithms (ID3, ID5). Current status is - all code for the ID3 algorithm, is completed and compiled. An initial design and implementation of an interface in Java exists, but it still has a lot of bugs. We are working now on converting our program to work in ID5. Currently we do not have the prolog section of our project (the data retrival), but it's only a function that creates a prolog predicate according to fields values, so it shouldn't be too hard. Things left to do : 1. Debug the ID3 code. 2. Write the ID5 algorithm and debug it. 3. Finishing the java interface. 4. Writing the prolog section.