Classification

We apply classification tools to solve NLP tasks.

We start with a very simple task that looks at words in isolation and tries to classify them into 2 labels: gender identification. The task consists of guessing whether a name is masculine or feminine.

The classification method consists of taking as input an observation, turning this observation into a feature vector, then predicting the label of this feature vector by applying a trained classifier model.

To prepare for this procedure, we must train a classifier. In supervised learning, a classifier is learned by generalizing a set of observed pairs (observationi, labeli) where [i = 1..N].

Document Classification

We now turn our attention to classifying full documents as opposed to single words in isolation.

The task seems more challenging, but simple methods can achieve surprisingly good results when the task is well defined. Consider the task of predicting whether a movie review is positive or negative. This is a task called sentiment analysis and is a hot practical task in the era of user-generated content (UGC) on the Web.

A good dataset is available in NLTK to experiment with this task.

Back to Part of Speech Tagging

We can consider the task of POS tagging as a classification task and use the classifier methodology described here. Let us revisit the POS tagging task discussed in the first lecture using the new tools we have developed.

Testing Different Classifiers

NLTK provides a common interface to different classifier algorithms. This is illustrated in the following examples.

Naive Bayes Classifier

Decision Tree Classifier

There is no prob() method for decision tree classifiers, as they do not provide a probability interpretation.

Scikit-Learn Classifiers

NLTK provides an interface to the Scikit-learn (sklearn) classifiers - including maximum entropy and SVM.

The key parameter to optimize for a given SVM kernel is the C parameter. Here is example code from sklearn that shows how to optimize C on a development set.