Classification
Method
Consider a supervised learning problem: we want to learn a function f
: X → Y on the basis of example pairs (X,Y). A probabilistic
perspective on this problem consists of characterizing the conditional
distribution p(Y|X). When we learn classifiers, Y is a
discrete variable (in the simplest case, it is a boolean variable).
If Y is a continuous numerical value, we say the problem is a case of
regression.
To develop classifiers, we first map the objects to be classified into
simple feature vectors. This step is known as feature
extraction. The features capture the aspects of the objects that
are relevant to the problem. We think of X as a vector of n variables
which characterize the input (x1...xn) - for most
classifiers, we encode the features into numerical values. The output is
called the label, and takes values over a discrete range (1..m).
The methodology to build a supervised classifier is the following:
- Gather a dataset {(Xi, Yi)} i in (1..N) --
where Xi is an object to classify (for NLP applications, objects can be words or sentences or documents) and
Yi is a label.
- Define a feature extractor - which maps objects Xi to
feature vectors (Xij) j in (1..d). Generally,
feature values are numerical values.
- Split the dataset into training and test sets. Split the training set into train and development set.
- Train a classifier model on the training set. Examples of
classifier models include Naive Bayes, Logistic Regression, Maximum Entropy,
Support Vector Machines (SVM), Perceptron and more.
- Measure the performance of the classifier using metrics: accuracy,
precision and recall per label value on the development set.
- Perform error analysis: list errors, observe confusion matrix.
- Hypothesize reasons for most common error types: missing data in
dataset, missing features, too many features. Improve the training set and/or the
feature extractor. Perform feature selection.
- Iterate over the train / error analysis / improve steps.
- Eventually, test the classifier on the test set - and report its performance metrics.
Typical Features
The following features work well for many NLP tasks, and provide a "baseline" for many tasks.
Word-level Features
When the objects to classify are words, the following features are useful:
- The word form itself.
- The lemma of the word (removing possible inflections on the word -
so that for example, "books", "booking", "booked" and "book" are all
mapped to the feature "book").
- The word Part of speech (if a POS tagger is available, the output
of the tagger is used as a feature for other classifiers). If the tagger returns
probabilities, one can use the distribution over tags as features as well.
- Prefixes and suffixes up to a certain length (1 to 3 letters are typical).
Sentence-level Features
When the object to classify is a sentence, the following features are useful:
- The word forms of each word in the sentence.
- Word lemmas in the sentence.
- Word bigrams (pairs (wi-1, wi) for i=1..length(sentence) with the convention that w0 = "START".
- Word POS in the sentence.
Document-level Features
To model documents, one must map complex objects constructed of a
variable number of paragraphs, each constructed of variable number of
sentences, with variable numbers of words. The order of paragraphs,
sentences and words within sentences is obviously critical to the
meaning of the words.
Baseline feature encodings, however, are very simple. The easiest one
is called the "bag of words" model. It consists of a vector of
dimension d = |V|, where V is the vocabulary observed in the set of documents
(number of distinct words).
Since V can be very large, and we expect V to be distributed according to a long-tail
distribution, we can reduce the dimensionality of the feature space by:
- Considering only lemmas. There are less lemmas than inflected forms, and we expect that the occurrence of a lemma in any inflected form
still triggers the same meaning.
- Considering only the top-K most frequent words of the vocabulary.
Because the words are distributed in a long-tail form, the top-K words
cover a large share of the word occurrences even for relatively small
values of K. We also expect that more frequent words convey most of the
meaning of the document.
- Filtering the words by part-of-speech, keeping only Nouns, Adjectives and Verbs for example.
Another intuition consists of attributing a heavier weight to words
that occur more frequently within a document than to words that occur
only once. There are two problems in turning this intuition into a
feature extraction method:
- We want to normalize the word counts, so that long documents do not get widely different feature vectors
than short documents.
- We want to avoid attributing heavy weights to words that occur frequently across all documents, for example,
function words such as "the", "and", "then" are very frequent across all documents. They do not provide
discriminative information about the document.
The solution to these two factors consists of using a feature
extraction method known as tf-idf.
Another element of the solution consists of filtering out very common words in the language, commonly refered to
as stop words.
Classifier Algorithms
We focus on supervised classifiers. One can choose from a range of
algorithms that have each different properties to deal with different
datasets. The Scikit
Learn Tutorial on supervised learning provides a concise
mathematical introduction to the overall principles underlying most
algorithms.
Practically, the following algorithms are relevant to the tasks of NLP classifications we often face:
- Decision Trees
A popular implementation of the Decision Trees method is available in the
C4.5 algorithm.
NLTK has an implementation of decision trees
based on the Java Weka library.
Scikit-learn has excellent documentation and
implementation of decision trees.
- Naive Bayes:
a very simple method for classification based on the assumption that the features are pairwise conditionally
independent of the target variable. Scikit-learn has very good implementation of Naive Bayes variants
(mutinomial, bernoulli and gaussian).
NLTK also includes a Naive Bayes
implementation.
- Support Vector Machines (SVM): an efficient
classification method that works well in the presence of very high number of features.
The Scikit-learn implementation is based on the LibSVM engine.
- Logistic Regression: a discriminative learning method (fits the
conditional probability p(y|x) directly instead of learning the joint distribution p(x,y)).
The NLTK interface to Scikit-Learn provides
convenient ways to prepare features for text processing, using the abstraction of pipelines. Any of the four algorithms
listed above can be used with the NLTK Scikit-learn wrappers, with similar feature extraction and selection methods.
The example on document classification
in Scikit-learn illustrates how such pipelines can be used.
The NLTK classify module includes multiple implementations of
the classify interface - some of them implemented natively in Python, some of them wrappers for code written in other
well known modules (in C or Java).
Sequence Classification
Sequence classification consists of performing classification of items that appear in a sequence, and where we
expect that there exist dependencies between consecutive items. The classical example of sequence classification
is parts of speech tagging.
There exist specialized algorithms to classify sequences, that attempt
to optimize the global problem of assigning a label to each item
within the sequence.
The two most important algorithms are:
- Hidden Markov Models: HMM -- The NLTK HMM
implementation can be used to explore HMMs. One of the key advantages of HMMs is that they can
be trained in an unsupervised manner using an instance of the general method called Expectation Maximization (EM).
HMMs are generative models (they model the joint probability P(O, L)
where O is an observed sequence of tokens and L is the corresponding
sequence of labels.
HMMs are efficiently implemented using dynamic programming methods:
the Viterbi algorithm and
the Forward-Backward algorithm
are key to their efficiency.
- Conditional Random Fields (CRF)
-- The NLTK CRF implementation
is based on the Java Mallet toolkit.
CRFs are discriminative models -- they model the conditional distribution P(L|O). CRFs are explicitly
feature based and can deal with large amounts of features.
CRFs are best presented in Michael Collin's notes.
Last modified 25 Nov, 2018