Classification

What is Classification

Classification is the task of assigning objects to a discrete set of categories. Learning a classifier consists of learning a function that maps objects to categories. (If the range of the function is continuous instead of a discrete set of categories, the task is called regression). In general, learning functions from complex objects to discrete values is a complex enterprise, because there are just many functions, and the structure of the objects in the domain of the function may be arbitrarily complex. If, however, we assume a priori that the function to be learned belongs to a restricted family of functions, characterized by a small number of parameters, then the task becomes tractable. For example, we may assume that the classifier function to be learned is a linear function that accepts vectors of real numbers. That is, given an input vector v = (v₁, ..., v_n), the classifier function would be classifier(v) = Σ_i=1..nv_i . w_i + c. Under this assumption, the task of learning a classifier would be that of learning good parameters (w_i) given the data. Such simplifying assumption is warranted only to the extent that:

We can map the objects to be classified to a simple domain such as vectors
A member of the simple function family is powerful enough to properly classify the observed objects across categories.

We will adopt a method inspired from these observations in this lecture, and study classifiers that are:

Feature-based: we represent the objects to be classified as vectors of numerical features. These feature vectors are obtained by applying a stage of feature extraction to abstract from the objects to a feature space. Features capture the important aspects of the objects that impact on the classification task.
Supervised training: we will assume a simple functional form (such as linear or log-linear) and train a probabilisitic model to learn a good approximation of the classifier given examples of pairs (feature vector, class).

Classifying in NLP

Classification is a useful technique used by many NLP applications. For example, one can classify documents as belonging to categories such as "politics", "economy" or "sports" based on their content. One can also classify words according to their part of speech (as we did in previous lectures). Another classification task would be to classify words according to their word sense (to resolve semantic ambiguity).