Classification

Method

Consider a supervised learning problem: we want to learn a function f : X → Y on the basis of example pairs (X,Y). A probabilistic perspective on this problem consists of characterizing the conditional distribution p(Y|X). When we learn classifiers, Y is a discrete variable (in the simplest case, it is a boolean variable). If Y is a continuous numerical value, we say the problem is a case of regression.

To develop classifiers, we first map the objects to be classified into simple feature vectors. This step is known as feature extraction. The features capture the aspects of the objects that are relevant to the problem. We think of X as a vector of n variables which characterize the input (x1...xn) - for most classifiers, we encode the features into numerical values. The output is a called the label, and takes values over a discrete range (1..m).

The methodology to build a supervised classifier is the following:

  1. Gather a dataset {(Xi, Yi)} i in (1..N) -- where Xi is an object to classify (for NLP applications, objects can be words or sentences or documents) and Yi is a label.
  2. Define a feature extractor - which maps objects Xi to feature vectors (Xij) j in (1..d). Generally, feature values are numerical values.
  3. Split the dataset into training and test sets. Split the training set into train and development set.
  4. Train a classifier model on the training set. Examples of classifier models include Naive Bayes, Logistic Regression, Maximum Entropy, Support Vector Machines (SVM), Perceptron and more.
  5. Measure the performance of the classifier using metrics: accuracy, precision and recall per label value on the development set.
  6. Perform error analysis: list errors, observe confusion matrix.
  7. Hypothesize reasons for most common error types: missing data in dataset, missing features, too many features. Improve the training set and/or the feature extractor. Perform feature selection.
  8. Iterate over the train / error analysis / improve steps.
  9. Eventually, test the classifier on the test set - and report its performance metrics.

Typical Features

The following features work well for many NLP tasks, and provide a "baseline" for many tasks.

Word-level Features

When the objects to classify are words, the following features are useful:
  1. The word form itself.
  2. The lemma of the word (removing possible inflections on the word - so that for example, "books", "booking", "booked" and "book" are all mapped to the feature "book").
  3. The word Part of speech (if a POS tagger is available, the output of the tagger is used as a feature for other classifiers). If the tagger returns probabilities, one can use the distribution over tags as features as well.
  4. Prefixes and suffixes up to a certain length (1 to 3 letters are typical).

Sentence-level Features

When the object to classify is a sentence, the following features are useful:
  1. The word forms of each word in the sentence.
  2. Word lemmas in the sentence.
  3. Word bigrams (pairs (wi-1, wi) for i=1..length(sentence) with the convention that w0 = "START".
  4. Word POS in the sentence.

Document-level Features

To model documents, one must map complex objects constructed of a variable number of paragraphs, each constructed of variable number of sentences, with variable numbers of words. The order of paragraphs, sentences and words within sentences is obviously critical to the meaning of the words.

Baseline feature encodings, however, are very simple. The easiest one is called the "bag of words" model. It consists of a vector of dimension d = |V|, where V is the vocabulary observed in the set of documents (number of distinct words).

Since V can be very large, and we expect V to be distributed according to a long-tail distribution, we can reduce the dimensionality of the feature space by:

  1. Considering only lemmas. There are less lemmas than inflected forms.
  2. Considering only the top-K most frequent words of the vocabulary. Because the words are distributed in a long-tail form, the top-K words cover a large share of the word occurrences even for relatively small values of K.
  3. Filtering the words by part-of-speech, keeping only Nouns, Adjectives and Verbs for example.
Another intuition consists of attributing a heavier weight to words that occur more frequently within a document than to words that occur only once. There are two problems in turning this intuition into a feature extraction method:
  1. We want to normalize the word counts, so that long documents do not get widely different feature vectors than short documents.
  2. We want to avoid attributing heavy weights to words that occur frequently across all documents, for example, function words such as "the", "and", "then" are very frequent across all documents. They do not provide discriminative information about the document.
The solution to these two factors consists of using a feature extraction method known as tf-idf.

Another element of the solution consists of filtering out very common words in the language, commonly refered to as stop words.

Classifier Algorithms

We focus on supervised classifiers. One can choose from a range of algorithms that have each different properties to deal with different datasets. The Scikit Learn Tutorial on supervised learning provides a concise mathematical introduction to the overall principles underlying most algorithms. Practically, the following algorithms are relevant to the tasks of NLP classifications we often face: The NLTK interface to Scikit-Learn provides convenient ways to prepare features for text processing, using the abstraction of pipelines. Any of the four algorithms listed above can be used with the NLTK Scikit-learn wrappers, with similar feature extraction and selection methods.

The example on document classification in Scikit-learn illustrates how such pipelines can be used.

The NLTK classify module includes multiple implementations of the classify interface - some of them implemented natively in Python, some of them wrappers for code written in other well known modules (in C or Java).

Sequence Classification

Sequence classification consists of performing classification of items that appear in a sequence, and where we expect that there exist dependencies between consecutive items. The classical example of sequence classification is parts of speech tagging. There exist specialized algorithms to classify sequences, that attempt to optimize the global problem of assigning a label to each item within the sequence. The two most important algorithms are:
Last modified Nov 22, 2015