Assignment 2

Due: Mon 13 Feb 2017 Midnight

Natural Language Processing - Fall 2017 Michael Elhadad

This assignment covers the topic of sequence classification, word embeddings and RNNs. The tasks on which we experiment are Named Entity Recognition (NER) and document classification. The objective is:

Learn the HMM model and the Viterbi algorithm.
Experiment and evaluate classifiers for the tasks of named entity recognition and document classification.
Use pre-trained word embeddings and measure whether they help for the task of NER and Document Classification.
Experiment with LSTM document encoding using the Keras library with pre-trained word embeddings.

Make sure you have installed scikit-learn and pandas to work on this assignment (both come with Anaconda). You should also install Tensorflow - recent installations are easy on either Linux, MacOS or Windows. To install tensorflow and tflearn (a higher level package on top of Tensorflow that works in a way similar to Scikit-learn) do:

pip install tensorflow

# Under Windows, you need to install a simulation of the curses module:
# Download the appropriate version from http://www.lfd.uci.edu/~gohlke/pythonlibs/#curses 
# then run (for the python 3.5 / windows 64 bit example):
pip install ./curses-2.2-cp35-none-win_amd64.whl

pip install tflearn

We will use one dataset provided in the Keras toolkit - the required file is reuters.py. If you want to install Keras, do:

conda install mingw libpython

pip install keras

# The default backend of Keras is Theano - which does not work easily under Windows and 
# may be slower under Linux and MacOS.  To change it to use Tensorflow, edit the file:
# ~/.keras/keras.json
{"epsilon": 1e-07, "backend": "tensorflow", "floatx": "float32"}

Verify that Keras is properly installed:

$ python
> import keras
Using TensorFlow backend

Submit your solution by email in the form of an iPython ipynb file.

Q1. Document Classification

Q1.1. Reuters Dataset Exploration

Execute the notebook tutorial of Scikit-Learn on text classification: out of core classification. Our objective is to get familiar with Pandas to manipulate tabular data and document vectorization using Scikit Learn. Your task:

Turn the code of the Sklearn tutorial above into a notebook (there is a link to a ipynb in the page, but you can make a better one).
Explore how many documents are in the dataset, how many categories, how many documents per categories, provide mean and standard deviation, min and max. (use the pandas library to explore the dataset, use the dataframe.describe() method.)
Explore how many characters and words are present in the documents of the dataset.
Explain informally what are the classifiers that support the "partial-fit" method discussed in the code.
Explain what is the hashing vectorizer used in this tutorial. Why is it important to use this vectorizer to achieve "streaming classification"?

Q2. Sequence Labelling for Named Entity Recognition

Named Entity Recognition

The task of Named Entity Recognition (NER) involves the recognition of names of persons, locations, organizations, dates in free text. For example, the following sentence is tagged with sub-sequences indicating PER (for persons), LOC (for location) and ORG (for organization):

Wolff, currently a journalist in Argentina, played with Del Bosque in the final years of the seventies in Real Madrid.

[PER Wolff ] , currently a journalist in [LOC Argentina ] , played with [PER Del Bosque ] in the final years of the seventies in [ORG Real Madrid ] .

NER involves 2 sub-tasks: identifying the boundaries of such expressions (the open and close brackets) and labelling the expressions (with tags such as PER, LOC or ORG). This sequence labelling task is mapped to a classification tag, using the BIO encoding of the data:

        Wolff B-PER
            , O
    currently O
            a O
   journalist O
           in O
    Argentina B-LOC
            , O
       played O
         with O
          Del B-PER
       Bosque I-PER
           in O
          the O
        final O
        years O
           of O
          the O
    seventies O
           in O
         Real B-ORG
       Madrid I-ORG
            . O

Dataset

The dataset we will use for this question is derived from the CoNLL 2002 shared task - which is about NER in Spanish and Dutch. The dataset is included in the NLTK distribution. Explanations on the dataset are provided in the CoNLL 2002 page.

To access the data in Python, do:

from nltk.corpus import conll2002

etr = conll2002.chunked_sents('esp.train') # In Spanish
eta = conll2002.chunked_sents('esp.testa') # In Spanish
etb = conll2002.chunked_sents('esp.testb') # In Spanish

dtr = conll2002.chunked_sents('ned.train') # In Dutch
dta = conll2002.chunked_sents('ned.testa') # In Dutch
dtb = conll2002.chunked_sents('ned.testb') # In Dutch

The data consists of three files per language (Spanish and Dutch): one training file and two test files testa and testb. The first test file is to be used in the development phase for finding good parameters for the learning system. The second test file will be used for the final evaluation.

Q2.1 Features

Your task consists of:

Choosing good features for encoding the problem.
Encode your training dataset.
Run a classifier over the training dataset.
Train and test the model.
Perform error analysis and fine tune model parameters on the testa part of the datasets.
Perform evaluation over the testb part of the dataset, reporting on accuracy, per label precision, per label recall and per label F-measure, and confusion matrix.

Here is a list of features that have been found appropriate for NER in previous work:

The word form (the string as it appears in the sentence)
The POS of the word (which is provided in the dataset)
ORT - a feature that captures the orthographic (letter) structure of the word. It can have any of the following values: number, contains-digit, contains-hyphen, capitalized, all-capitals, URL, punctuation, regular.
prefix1: first letter of the word
prefix2: first two letters of the word
prefix3: first three letters of the word
suffix1: last letter of the word
suffix2: last two letters of the word
suffix3: last three letters of the word

For example, given the following toy training data, the encoding of the features would be:

        Wolff NP  B-PER
            , ,   O
    currently RB  O
            a AT  O
   journalist NN  O
           in IN  O
    Argentina NP  B-LOC
            , ,   O
       played VBD O
         with IN  O
          Del NP  B-PER
       Bosque NP  I-PER
           in IN  O
          the AT  O
        final JJ  O
        years NNS O
           of IN  O
          the AT  O
    seventies NNS O
           in IN  O
         Real NP  B-ORG
       Madrid NP  I-ORG
            . .   O

Classes
1 B-PER
2 I-PER
3 B-LOC
4 I-LOC
5 B-ORG
6 I-ORG
7 O

Feature WORD-FORM:
1 Wolff
2 ,
3 currently
4 a
5 journalist
6 in
7 Argentina
8 played
9 with
10 Del
11 Bosque
12 the
13 final
14 years
15 of
16 seventies
17 Real
18 Madrid
19 .

Feature POS
20 NP
21 ,
22 RB
23 AT
24 NN
25 VBD
26 JJ
27 NNS
28 .

Feature ORT
29 number
30 contains-digit
31 contains-hyphen
32 capitalized
33 all-capitals
34 URL
35 punctuation
36 regular

Feature Prefix1
37 W
38 ,
39 c
40 a
41 j
42 i
43 A
44 p
45 w
46 D
47 B
48 t
49 f
50 y
51 o
52 s
53 .

Given this encoding, we can compute the vector representing the first word "Wolff NP B-PER" as:

# Class: B-PER=1
# Word-form: Wolff=1
# POS: NP=20
# ORT: Capitalized=32
# prefix1: W=37
1 1:1 20:1 32:1 37:1

When you encode the test dataset, some of the word-forms will be unknown (not seen in the training dataset). You should, therefore, plan for a special value for each feature of type "unknown" when this is expected.

Instead of writing the code as explained above, use the Scikit-learn vectorizer and pipeline library. Learn how to use the DictVectorizer for this specific task. Hashing vs. DictVectorizer also provides useful background.

You can start from the following example notebook CoNLL 2002 Classification with CRF. You do not need to install Python-CRFSuite - just take this notebook as a starting point to explore the dataset and ways to encode features.

We implement here a version of NER which is based on "greedy tagging" (that is, without optimizing the sequence of tags as we would obtain by training an HMM or CRF model). Train the model using a logistic regression classifier and experiment with better features - looking at the tags of the previous word, the previous word and the following word (add padding words in the vectorizer). The following notebooks published by Zac Stewart Document Classification and Radim Rehurek Data Science in Python provide good starting points.

Q2.2 HMM and Viterbi

2.2.1 Implement the HMM algorithm discussed in class: Algorithm 2.5 (Viterbi with backpointers) from Tagging Problems, and Hidden Markov Models in Python.

2.2.2 Test your HMM/Viterbi implementation on the CoNLL 2002 NER tagging dataset using MLE for tag transitions estimation (parameters q) and a discounting language model for each tag in the Universal taget for parameters e(x|tag) for each tag (discounting is a method known as Lidstone estimator in NLTK).

Q2.3 Using Word Embeddings

One possible way to improve a greedy tagger for NER is to use Word Embeddings as features. A convenient package to manipulate Word2Vec word embeddings is provided in the gensim package by Radim Rehurek. To install it, use:

# conda install gensim

You must also download a pre-trained word2vec word embedding model from the Word2Vec site. The largest model is the GoogleNews-vectors-negative300.bin (1.5GB compressed file). To load it in Python use the following code (this requires about 8GB of RAM on your machine to work properly):

from gensim.models import word2vec
model_path = "GoogleNews-vectors-negative300.bin"
model = word2vec.load_word2vec_format(model_path, binary=True)

# A dense vector of 300 dimensions representing the word 'queen'
print(w["queen"])

stringA = 'woman'
stringB = 'king'
stringC = 'man'
print(model.most_similar(positive=[stringA, stringB], negative=[stringC], topn=10))

Your task:

Add the word2vec embeddings as dense vectors to the features of your NER classifier for each word feature (current word, previous word, next word).
Retrain the model and report on performance.

Q3. RNN (LSTM) for Document Classification

We investigate a type of RNN called LSTM in this question to encode a document and pass it to a logistic regression layer. We use the TFLearn library on top of Tensorflow to perform this work, and start from one of their tutorial example: lstm.py used to perform document classification on the sentiment classification task on the IMDB dataset (movie reviews categorized as positive / negative).

The dataset contains 50,000 reviews of about 300 words on average classified as half/half positive/negative.

3.1 Run the example

Prepare a notebook to run the example (it takes about 1 hour to train on a CPU machine). Make sure you have installed Tensorflow and TFLearn (see instructions above). Explore the way tensorboard is used: while the code runs, it writes logs into a folder (by default /tmp/tflearn_logs). Launch tensorboard as:

# tensorboard --logdir /tmp/tflearn_logs

Then open your browser on the URL http://localhost:6006. Save images of the accuracy, loss and graph diagrams generated by tensorboard while the LSTM code is running.

3.2 MLP Baseline

Consider a Multi-layer perceptron baseline. An example version of such an architecture for document classification is available in Keras in Reuters document classification in Keras. (Keras is a variant library very similar to TFLearn in structure.)

Port the Keras code to TFLearn and adapt it to the IMDB dataset. Compare the results you obtain with those of the LSTM model.

3.3 Reuters Classification in TFLearn

Adapt the TFLearn code of LSTM.py to work over the Reuters dataset packaged in the Keras library. You can use the reuters.py utility from the Keras library to load a pre-formatted version of the Reuters dataset that returns data in prepared vectors.

Note that the IMDB dataset loader provided by TFLearn does not provide the word index dictionary - so that it is not possible to map new text to vectors and vectors to text. In contrast, the Reuters loader gives access to this dictionary - see function get_word_index() in reuters.py.

3.3.1 Write a function encode_doc(string) which given a document returns a vector that encodes this document using the Reuters word index. (Use the NLTK tokenizer). Use the inverse function decode_doc(ids) which given a list of word IDs returns the list of words that appear in the document.

3.3.2 Write the TFLearn version to run over the Reuters dataset. Train the model. Report the Accuracy / Loss graphs from Tensorboard.

3.3.3 Write a function classify_doc(string) which given a document, applies the LSTM model to the document and returns the predicted class.

3.3.3 Explore the Reuters dataset: report on number of classes, number of docs per class, size of docs (number of words per doc) - use Pandas.

3.3.4 Use pre-trained word2vec embeddings: in the models above, there is a word embedding layer which is trained over the dataset. In this question - we propose to replace this trained layer with a pre-trained embedding layer that is just downloaded from Word2vec.

See TFLearn Issue 35 for the general direction to use to use word2vec vectors in TFLearn. More in TFLearn Issue 89.

Train the model with the new architecture, report on accuracy/loss (form Tensorboard). The following articles can help you:

Last modified 15 Jan, 2017