Classification
What is Classification
Classification is the task of assigning objects to a discrete set of categories.
Learning a classifier consists of learning a function that maps objects to categories.
(If the range of the function is continuous instead of a discrete set of categories, the
task is called regression).
In general, learning functions from complex objects to discrete values
is a complex enterprise, because there are just many functions, and
the structure of the objects in the domain of the function may be
arbitrarily complex. If, however, we assume a priori that the
function to be learned belongs to a restricted family of functions,
characterized by a small number of parameters, then the task becomes
tractable.
For example, we may assume that the classifier function to be learned
is a linear function that accepts vectors of real numbers. That is,
given an input vector v = (v1, ..., vn), the
classifier function would be classifier(v) =
Σi=1..nvi . wi + c. Under this
assumption, the task of learning a classifier would be that of
learning good parameters (wi) given the data.
Such simplifying assumption is warranted only to the extent that:
- We can map the objects to be classified to a simple domain such as vectors
- A member of the simple function family is powerful enough to properly classify the observed objects across categories.
We will adopt a method inspired from these observations in this lecture, and study classifiers that are:
- Feature-based: we represent the objects to be classified as vectors of numerical features. These feature vectors are obtained by
applying a stage of feature extraction to abstract from the objects to a feature space. Features capture the important aspects
of the objects that impact on the classification task.
- Supervised training: we will assume a simple functional form (such as linear or log-linear) and train a probabilisitic model to
learn a good approximation of the classifier given examples of pairs (feature vector, class).
Classifying in NLP
Classification is a useful technique used by many NLP applications.
For example, one can classify documents as belonging to categories
such as "politics", "economy" or "sports" based on their content. One
can also classify words according to their part of speech (as we did
in previous lectures). Another classification task would be to
classify words according to their word sense (to resolve semantic
ambiguity).