Classification

Method

Consider a supervised learning problem: we want to learn a function f : X → Y on the basis of example pairs (X,Y). A probabilistic perspective on this problem consists of characterizing the conditional distribution p(Y|X). When we learn classifiers, Y is a discrete variable (in the simplest case, it is a boolean variable). If Y is a continuous numerical value, we say the problem is a case of regression.

To develop classifiers, we first map the objects to be classified into simple feature vectors. This step is known as feature extraction. The features capture the aspects of the objects that are relevant to the problem. We think of X as a vector of n variables which characterize the input (x₁...x_n) - for most classifiers, we encode the features into numerical values. The output is a called the label, and takes values over a discrete range (1..m).

The methodology to build a supervised classifier is the following:

Gather a dataset {(X_i, Y_i)} i in (1..N) -- where X_i is an object to classify (for NLP applications, objects can be words or sentences or documents) and Y_i is a label.
Define a feature extractor - which maps objects X_i to feature vectors (X_i^j) j in (1..d). Generally, feature values are numerical values.
Split the dataset into training and test sets. Split the training set into train and development set.
Train a classifier model on the training set. Examples of classifier models include Naive Bayes, Logistic Regression, Maximum Entropy, Support Vector Machines (SVM), Perceptron and more.
Measure the performance of the classifier using metrics: accuracy, precision and recall per label value on the development set.
Perform error analysis: list errors, observe confusion matrix.
Hypothesize reasons for most common error types: missing data in dataset, missing features, too many features. Improve the training set and/or the feature extractor. Perform feature selection.
Iterate over the train / error analysis / improve steps.
Eventually, test the classifier on the test set - and report its performance metrics.

Typical Features

The following features work well for many NLP tasks, and provide a "baseline" for many tasks.

Word-level Features

When the objects to classify are words, the following features are useful:

The word form itself.
The lemma of the word (removing possible inflections on the word - so that for example, "books", "booking", "booked" and "book" are all mapped to the feature "book").
The word Part of speech (if a POS tagger is available, the output of the tagger is used as a feature for other classifiers). If the tagger returns probabilities, one can use the distribution over tags as features as well.
Prefixes and suffixes up to a certain length (1 to 3 letters are typical).

Sentence-level Features

When the object to classify is a sentence, the following features are useful:

The word forms of each word in the sentence.
Word lemmas in the sentence.
Word bigrams (pairs (wi-1, wi) for i=1..length(sentence) with the convention that w0 = "START".
Word POS in the sentence.

Document-level Features

To model documents, one must map complex objects constructed of a variable number of paragraphs, each constructed of variable number of sentences, with variable numbers of words. The order of paragraphs, sentences and words within sentences is obviously critical to the meaning of the words.

Baseline feature encodings, however, are very simple. The easiest one is called the "bag of words" model. It consists of a vector of dimension d = |V|, where V is the vocabulary observed in the set of documents (number of distinct words).

Since V can be very large, and we expect V to be distributed according to a long-tail distribution, we can reduce the dimensionality of the feature space by:

Considering only lemmas. There are less lemmas than inflected forms.
Considering only the top-K most frequent words of the vocabulary. Because the words are distributed in a long-tail form, the top-K words cover a large share of the word occurrences even for relatively small values of K.
Filtering the words by part-of-speech, keeping only Nouns, Adjectives and Verbs for example.

Another intuition consists of attributing a heavier weight to words that occur more frequently within a document than to words that occur only once. There are two problems in turning this intuition into a feature extraction method:

We want to normalize the word counts, so that long documents do not get widely different feature vectors than short documents.
We want to avoid attributing heavy weights to words that occur frequently across all documents, for example, function words such as "the", "and", "then" are very frequent across all documents. They do not provide discriminative information about the document.

The solution to these two factors consists of using a feature extraction method known as tf-idf.

Another element of the solution consists of filtering out very common words in the language, commonly refered to as stop words.

Classifier Algorithms

We focus on supervised classifiers. One can choose from a range of algorithms that have each different properties to deal with different datasets. The Scikit Learn Tutorial on supervised learning provides a concise mathematical introduction to the overall principles underlying most algorithms. Practically, the following algorithms are relevant to the tasks of NLP classifications we often face:

Decision Trees A popular implementation of the Decision Trees method is available in the C4.5 algorithm. NLTK has an implementation of decision trees based on the Java Weka library. Scikit-learn has excellent documentation and implementation of decision trees.
Naive Bayes: a very simple method for classification based on the assumption that the features are pairwise conditionally independent of the target variable. Scikit-learn has very good implementation of Naive Bayes variants (mutinomial, bernoulli and gaussian). NLTK also includes a Naive Bayes implementation.
Support Vector Machines (SVM): an efficient classification method that works well in the presence of very high number of features. NLTK includes an SVM implementation based on the C implementation of SVMLight. The Scikit-learn implementation is based on the LibSVM engine.
Logistic Regression: a discriminative learning method (fits the conditional probability p(y|x) directly instead of learning the joint distribution p(x,y)).

The NLTK interface to Scikit-Learn provides convenient ways to prepare features for text processing, using the abstraction of pipelines. Any of the four algorithms listed above can be used with the NLTK Scikit-learn wrappers, with similar feature extraction and selection methods.

The example on document classification in Scikit-learn illustrates how such pipelines can be used.

The NLTK classify module includes multiple implementations of the classify interface - some of them implemented natively in Python, some of them wrappers for code written in other well known modules (in C or Java).

Sequence Classification

Sequence classification consists of performing classification of items that appear in a sequence, and where we expect that there exist dependencies between consecutive items. The classical example of sequence classification is parts of speech tagging. There exist specialized algorithms to classify sequences, that attempt to optimize the global problem of assigning a label to each item within the sequence. The two most important algorithms are:

Hidden Markov Models: HMM -- The NLTK HMM implementation can be used to explore HMMs. One of the key advantages of HMMs is that they can be trained in an unsupervised manner using an instance of the general method called Expectation Maximization (EM). HMMs are generative models (they model the joint probability P(O, L) where O is an observed sequence of tokens and L is the corresponding sequence of labels.
HMMs are efficiently implemented using dynamic programming methods: the Viterbi algorithm and the Forward-Backward algorithm are key to their efficiency.
Conditional Random Fields (CRF) -- The NLTK CRF implementation is based on the Java Mallet toolkit. CRFs are discriminative models -- they model the conditional distribution P(L|O). CRFs are explicitly feature based and can deal with large amounts of features.

Last modified Nov 22, 2015