One nearest neighbor with compression

Written by Yevgeni Korsunsky,Aryeh Kontorovich

  1. Introduction

  2. The condesning problem

  3. Constructing gamma net

  4. Code

  5. Results

  6. References

1. Introduction

The nearest neighbor algorithm (1-NN): In this algorithm , the learner observes a sample S of labeled points (X,Y) = (Xi,Yi)i Î [n], where Xi is a point in some metric space c and some countable Yi Î Y is its label, for some finite labels set Y. Being a metric space, X is equipped with a distance function r : X × X ® R. Given a new unlabeled point x Î X to be classified, x is assigned the same label as its nearest neighbor in S, which is argminYi Î Y r(x, Xi). The 1-NN is a simple case of the k-nearest-neighbors (k-NN) algorithm with k=1.

Under mild regularity assumptions, the 1 nearest neighbor classifier¢s expected error is asymptotically bounded by twice the Bayesian error, when the sample size tends to infinity (A Bayes-consistent modification of the 1-NN classifier was more recently proposed in ). Of course, such asymptotic analysis is not sufficient in practice since one has to know the error he should expect given a certain limited size data set. In there they defined the conditional probability as h (x) = P[y = 1|x] and showed that if the conditional probability function is c-Lipschitz then ES ~ Dm[LD(h1-NN(S))] £ 2LD(h*) + 4c Ödm-(1)/(d+1) for c = [0,1]d. Note that the assumption about the conditional probability being c-Lipschitz can fail even for some basic problems - imagine for example c = [-1,1] and P[y = 1|x]=1 for 0 £ x and P[y = 1|x]=0 for x<0. More recently, strong margin-dependent generalization bounds were obtained in [7], where the margin is the minimum distance between opposite labeled points in S. Shortcomings in the NN classifier led Hart [25] to pose the problem of sample compression. Indeed, significant compression of the sample has the potential to simultaneously address the issues of memory usage, NN search time, and overfitting.

The nearest neighbor classifier for non-parametric classification is perhaps the most intuitive learning algorithm. It is apparently the earliest, having been introduced by Fix and Hodges in 1951. The simplicity of the algorithm, its non parametric approach, its minimal requirements about the setting of the problem, and its immediate extension to multi class problems are very appealing to use. For example, in [27] we can see that the 2nd place winner in the 2015 Otto Group Product Classification Challenge used k-NN to extract meta parameters which he later used as input for an artificial-neural-network (another type of machine learning model). Despite this example, the nearest neighbor approach is not very popular, as other methods usually achieve better results, and the NN approach suffers from the "curse of dimensionality" - the size of the training set should increase exponentially with the dimension of c. To compare, a popular approach such as SVM can work well even in a high dimension space since its performance relies on the margin of the data, which is not restricted by the dimension. Another issue is that exact NN evaluation requires q(|S|) time in high-dimensional metric spaces (and possibly Euclidean space as well ) - a phenomenon known as the algorithmic curse of dimensionality. Linear SVM for comparison requires a single inner product to be calculated in order to evaluate a new point, and its dimension does not depend on the sample size. Lastly, the NN classifier has infinite VC-dimension , implying that it tends to overfit the data. This last problem can be mitigated by taking the majority vote among k > 1 nearest neighbors or by deleting some sample points so as to attain a larger margin . To see why 1-NN has infinite VC-dimension, we can take a realizable sample of any size and any labelling and note that the 1-NN is consistent with it. K-NN is a smoother function of the data than 1-NN (more regularized) and thus tends to overfit less.

2. The nearest neighbors condensing problem

Hart considered the minimum Consistent Subset problem - elsewhere called the Nearest Neighbor Condensing problem - which seeks to identify a minimal subset S* Ì S that is consistent with S, in the sense that the nearest neighbor in S* of every x Î S possesses the same label as x. This problem is known to be NP-hard [15],[16] , and Hart provided a heuristic with runtime O(n3). In [1] the authors gave an improved runtime solution for the Nearest Neighbor Condensing problem and also showed a hardness-of-approximation result that closed the gap between the lower and upper bound on the compression rate that is achievable

We say that is -consistent with if . For we simply say the subset is consistent. [10] Roughly speaking, if a learning algorithm can express the output hypothesis using a small subset of the training set, then the error of the hypothesis on the rest of the examples estimates its true error. In other words, an algorithm that can "compress" its output is a good learner. The generalizing power of sample compression was independently discovered by [28],[29] and later elaborated upon by [30].

3. Constructing -net

Theorem : For any distribution , any and any , with probability at least over the random sample , the following holds:
(i) If is consistent with , then
(ii) If is -consistent with , then

It now remains to show how we can build such a g-net:

Mountain View

Greedy e-net algorithm: For every point p Î S, we add p to Sg if the distance from p to all other points currently in Sg is g or greater, d(p,Sg) ³ g.

The greedy algorithm runs in time.

Hierarchy e-net algorithm:

The hierarchy net construction algorithm is more complicated but runs in a potentially better run time of

Mountain View

Theorem Given a set of labeled points with scaled margin it is NP-hard to approximate the solution to the Nearest Neighbor Condensing problem on to within a factor where or is a function of . Note that the matching upper-bound is an absolute size guarantee, and stronger than an approximation estimate.
This theorem narrows the gap between the previous algorithm's performance and the best asymptotic algorithm possible.

Pruning algorithm:

a pruning heuristic, which given a -net produces a consistent sub sample by pruning some of the points.

Although the paper [1] gave no theoretical guarantees about this heuristic, in practice I found that it worked very well to further reduce the size of the sub samples. Note that its runtime can be in the worst case.

Mountain View

4. The code

The source code is maintained in https://bitbucket.org/yKorsunsky/1-nn-condensing/overview
For an explanation on how to use the code: /1-nn-condensing/overview

1-NN condensing / How to use 1-NN condensing project

There are two ways to run the project. The first one would be to use the installation which adds a command to your shell called "nn_condensing". This option is less flexible and is not recommended since it was not maintained properly. The second option is to define your own metric for you data. The reading of the data is done outside this project and is under your control. The only thing this project needs, is for you data objects to work with your metric function.

First to install the project, download(from repository) it and open a shell from the project's folder. The command to install is python setup.py install.

To use the project after installation you need to use a shell command that looks like this:

nn_condensing data_path=E:\nn_condensing_output\Skin_NonSkin.txt \
              starting_train_data_size=125 test_data_size=200 train_data_upper_limit=1000 \
              metric=L2 train_data_size_inc_func=geometric train_data_size_inc_arg=2 \
              CSV_path=E:\nn_condensing_output\CSV_All_Data.CSV

Here is a full list of parameters for this command including their default values and explanation :

data_path='E:\\1-NN repo\Data\Skin_NonSkin.txt'
             Full path of the data
starting_train_data_size=1000
                               Training data initial size
test_data_size=200
                                          Test data size
train_data_upper_limit=3000
                                 Training data final size limit
metric="L2"
                                                 Metric to be used.
train_data_size_inc_func="geometric"
                        How data size increases. "geometric"/"linear"
train_data_size_inc_arg=2
                                   The data size increase argument.
compress_greedy= True
                                       Test compression with greedy algorithm?
compress_heirarchy=True
                                     Test compression with fast net algorithm?
test_memory=True
                            Test memory consumption (only works with fast net compressino right now)
data_pruning_greedy=True
                                    Prune the data after greedy compression
data_pruning_hierarchy=True
                                 Prune the data after hierarchy compression
test_hierarchy_compression=True
                             Run 1-nn on the full train data using the fast compression result
test_1_nn_full_data=True
                                    Run 1-nn with the full training data (no compression)
test_1_nn_greedy=True
                                       Run 1-nn with compressed data (greedy compression)
test_1_nn_hierarchy=True
                                    Run 1-nn with compressed data (fast compression)
test_1_nn_greedy_pruning=True
                               Run 1-nn with pruned data (greedy)
test_1_nn_hierarchy_pruning=True
                            Run 1-nn with pruned data (hierarchy)
generate_CSV=True
                                           Write a CSV file with the results
CSV_path=''
                                                 File's location.
debug_info=True
                                             Print status messages while running the experiment.
print_tables=True
                                           Finish the experiment by printing the output into tables

As an example of how you could run it, for the skin database we need to implement a metric and a reader function.
Mountain View Mountain View
Once these two functions are written, you can run the run_experiment function. Note that the run_experiment is using the "skin_line_parser" so you need to replace it directly in the code (This is an implementation mistake and should be changed).

The preffered way to work with the algorithm is to provide it with a matrix of data items, and a matrix of precalculated Gram matrix.
For example this line is calculating an epsilon net using the hierarchy technique. "gram_matrix" is a precalculated distance matrix of the "data_sample" data to use. "markov_distance" would be the distance to use between data items. If a Gram matrix were not provided, the markov_distance would have been used to calculate it.

nn.epsilon_net_hierarchy(data_sample,markov_distance,gram_matrix=gram_matrix)

This line is an example of how to use the output of the previous line (hierarchy_net) to produce an even smaller net using the consistent pruning algorithm.

nn.consistent_pruning(hierarchy_net,markov_distance,gram_matrix=gram_matrix)

Finally, this line is executing the nearest neighbor algorithm using the pruned net as the known data sample to classify the unknown data set.

nn.execute(pruned_net,unknown_data,markov_distance)

5. Experimental results

The central experimental results done with the above code are presented in the table below. The data was a collestion of files totalled in 27GB of system call traces from 26 different processes. The files were given split between the different processes and arrange in a chronological order of appearance within each process. For each process they were usually numerous files describing different runs of it.
The files were first translated into a bi-gram matrices, and the distance measured was

Mountain View

We can see that all 3 methods based on 1-NN are making about half the error that SVM makes, which is a significant improvement. The price comes at runtime, which is high for both the preprocessing step and the execution step. In the naive 1-NN the runtime is expressed entirely in longer classification time since there is no preprocessing step. The greedy -net construction is very slow and makes it irrelevant since it doesn't improve the results. The hierarchy net construction is done fast, and together with the pruning and the classification gives a similar runtime to the naive 1-NN, except most of the work is done in the preprocessing phase since the resulting sub sample is about 30% of the size of the training sample. If we were to enlarge the testing set, the time gap would increase in favour of the compression scheme since the preprocessing time would decrease relatively to the classification time.

6. References

This webpage is mostly based on the Master's Thesis "Classifying processes by their system call traces using 1-nearest-neighbor with compression"/ Yevgeni Korsunsky, and on the open-source project that was written for it (see link in the Code section).

[1] Lee-Ad Gottlieb, Aryeh Kontorovich, Pinhas Nisnevitch Near-optimal sample com- pression for nearest neighbors, NIPS, 2014
[7] Ulrike von Luxburg and Olivier Bousquet Distance-based classiffcation with Lipschitz functions. , Journal of Machine Learning Research, 5:669-695, 2004.
[15] Gordon Wilfong Nearest neighbor problems., In Proceedings of the Seventh Annual Symposium on Computational Geometry,SCG ’91, pages 224-233, 1991.
[16] A. V. Zukhba Np-completeness of the problem of prototype selection in the nearest neighbor method., Pattern Recognit. Image Anal., 20(4):484-494, December 2010.
[25] Wikipedia Support vector machine
[27] Alexander Guschin Otto Product Classication Winner's Interview: 2nd place, http://blog.kaggle.com/2015/06/09/otto-product-classication-winners-interview- 2nd-place-alexander-guschin/