Evolutionary Computation and Artificial Life (202-1-5171), Semester A, 2006-2007
| Exercise 4 |
The Wisconsin Breast Cancer Diagnosis (WBCD) problem involves classifying a presented case as to whether it is benign or malignant. It admits a relatively high number of variables and consequently a large search space. The WBCD database consists of nine visually assessed characteristics obtained from fine needle aspirates of breast masses, each of which is ultimately represented as an integer value between 1 and 10.
In this exercise you will use an artificial neural network, trained by backpropagation, to classify cases as being benign or malignant.
The WBCD database is available here.
An explanation is given here.
As can be seen each database line consists of 11 values:
# Attribute Domain -- ----------------------------------------- 1. Sample code number id number (IGNORE THIS, it is NOT an input to the network) 2. Clump Thickness 1 - 10 3. Uniformity of Cell Size 1 - 10 4. Uniformity of Cell Shape 1 - 10 5. Marginal Adhesion 1 - 10 6. Single Epithelial Cell Size 1 - 10 7. Bare Nuclei 1 - 10 8. Bland Chromatin 1 - 10 9. Normal Nucleoli 1 - 10 10. Mitoses 1 - 10 11. Class: (2 for benign, 4 for malignant)
Attribute 1 is to be ignored, attributes 2-10 are the nine inputs, and attribute 11 is the desired output.
The database contains close to 700 cases. You will only use the first 100. The first 50 as a training set, used to train the network, and the next 50 as a test set, used to test the network AFTER training is done.
Network architecture:
Type of hidden neurons: Sigmoid.
Type of output neurons: Threshold (hard-limiting).
Training algorithm: Backpropagation, with learning rate
= 0.1.
The goal of the learning process is for the output neuron to provide a correct classifcation.
Vary the sizes of training/test sets, e.g., 60/40, 70/30. Do results improve?
Submit:
1) Program.
2) Histogram of error per N (number of hidden units). Error = percent
of incorrectly classified cases. Provide two histograms: one for training set, one for test set.
3) For best N: Plot of error versus time. Time is measured in training epochs,
where an epoch is one run through all 50 training cases.
5-Point bonus: Use two hidden layers.
3-point bonus: Use ENTIRE data set.
You may change the network topolgy or other paramters if you wish.
You may use sigmoid output units.