Assignment 2

Due: Tue 03 Jan 2018 Midnight

Natural Language Processing - Fall 2018 Michael Elhadad

This assignment covers the topic of document classification, word embeddings and named entity recognition. The objective is:

Experiment and evaluate classifiers for the tasks of named entity recognition and question classification.
Use pre-trained word embeddings and measure whether they help for the task of NER and Question Classification.

Make sure you have installed scikit-learn to work on this assignment (it comes with Anaconda).

Submit your solution by email in the form of an iPython ipynb file.

Q1. Questions Classification

Consider the dataset on Question Classification available here.

Q1.1. Describe the dataset qualitatively

Read the article introducing this dataset: in Li, Dan Roth, Learning Question Classifiers. COLING'02.

Write a half-page summary of the paper, focusing on the dataset description (more than on the description of the classifier introduced in the paper).

Q1.2. Dataset Exploration

The labels used to classify the questions are organized in two levels:

6 coarse classes (ABBREVIATION, ENTITY, DESCRIPTION, HUMAN, LOCATION and NUMERIC VALUE)
50 fine classes (see Table 1 of the article above)

The definition of the question labels is provided here.

Provide a quantitative description of the dataset:

Distribution of the question labels (number / percentage) - both for coarse and fine labels.
Distribution of the number of tokens per question - overall and per label.
Vocabulary size and number of tokens overall and per label.
Top 10 more frequent words overall and per label
Number of words occurring 1,2,3,4 and 5 times

For this type of exploration, the pandas library is extremely convenient. In particular, explore the function dataframe.describe().

Q1.3 Baseline classifier

Implement a baseline classifier for the 6 coarse labels using the heuristics described in the paper (of the form – If a query starts with Who or Whom: type Human).

Report on the accuracy, precision, recall, and F1 measure for all the coarse labels, and provide the confusion matrix for the 6 coarse labels.

Analyze the errors by listing types of errors (false positives and false negatives for each of the 6 labels).

Q1.4 Features-based classifier

Implement a feature-based classifier for the 6 coarse labels using the types of features described in the paper Section 3.2: words, POS tags, NER tags.

You could use the POS tagger we discussed in class with the Universal tagset.

Alternatively, I recommend you use the spacy 2.0 library to perform pre-processing of the questions - including POS tagging and Named Entity Recognition and Noun Chunks detection. Spacy comes with excellent pre-trained models for English and other languages. Installing Spacy requires the following steps (see spacy documentation):

// This installs the Spacy library (13MB)
% pip install spacy
// This downloads pre-trained models for POS tagging / NER / Noun chunks in English (34MB)
% python -m spacy download en    
% python
> nlp = spacy.load('en_core_web_sm')
> doc = nlp('Apple is looking at buying U.K. startup for $1 billion')
> doc.ents
(Apple, U.K., $1 billion)
> doc.ents[0].label_
'ORG'

Noun chunks are group of words which correspond to a single nominal phrase. They are detected by Spacy and accessible as doc.noun_chunks. For each chunk, the property chunk.root returns the root word of the chunk which is in general the central noun in the chunk. For example:

> doc = nlp('Apple announced a new model yesterday.)
> chunks = list(doc.noun_chunks)
> chunks
[Apple, a new model]
> chunks[1].root
'model'

You will note that the paper does not explicitly indicate how to encode the features it lists and is not precise about the features named `related words` (words which are usually associated with a specific type of questions). For example:

Word features can be encoded in different ways: noise words filtered or not, with or without lemmatization, with or without case normalization (all lower-case).
POS features can be encoded in different ways: as a bag of POS-tags, or associated with the word in a bag-of-tagged words such as 'Apple/PROPN'
Chunks can be encoded as a bag of chunk-roots
Related words can be "learned" from the training dataset by detecting words which have a high chi-square value with each class. Read in sklearn.feature_selection.chi2 for a discussion of how such words can be efficiently computed using scikit-learn and the Classification of text documents using sparse features example.

1.4.1 Discuss a priori what are good ways to encode these features (lemma, POS, NER, chunk, related words) - provide examples that explain your intuition.

1.4.2 Train a scikit-learn based classifier for:

Coarse labels
All labels as a flat classifier
A hierarchical classifier which predicts the fine-grained labels given the coarse label as proposed in the paper. Implement this as a two-step procedure - run the coarse-label classifier, then a second level classifier which takes the prediction of the first classifier as input.

For each of the three classifiers, report:

Accuracy, Precision, Recall, F-measure per label and confusion matrix.
Provide examples of prediction errors (positive and negative).
Discuss the most ambiguous label pairs (identified in the confusion matrix) and discuss whether the features you have used provide sufficient information to disambiguate the cases.

You should experiment with different classifiers from those illustrated in the Classification of text documents using sparse features example.

1.4.3 Optionally, analyze which of the features are most helpful for this task among lemma, POS, NER, Chunks and Related Words.

1.4.4 Note that the dataset is quite small (5,500 questions in the training dataset for 50 labels). How would you determine whether your model overfits on this data?

Q1.5 Deep Learning classifier (Optional)

Consider the following model to train a classifier on this dataset using PyTorch and a Convolution Network model. This follows the model introduced in Convolutional Neural Networks for Sentence Classification , Yoon Kim, EMNLP 2014.

1.5.1 Read and summarize the paper in about 500 words.

1.5.2 Explain why Convolutional Networks are an appropriate technique for text classification.

Q2. Sequence Labelling for Named Entity Recognition

Named Entity Recognition

The task of Named Entity Recognition (NER) involves the recognition of names of persons, locations, organizations, dates in free text. For example, the following sentence is tagged with sub-sequences indicating PER (for persons), LOC (for location) and ORG (for organization):

Wolff, currently a journalist in Argentina, played with Del Bosque in the final years of the seventies in Real Madrid.

[PER Wolff ] , currently a journalist in [LOC Argentina ] , played with [PER Del Bosque ]
in the final years of the seventies in [ORG Real Madrid ] .

NER involves 2 sub-tasks: identifying the boundaries of such expressions (the open and close brackets) and labelling the expressions (with tags such as PER, LOC or ORG). This sequence labelling task is mapped to a classification tag, using the BIO encoding of the data:

        Wolff B-PER
            , O
    currently O
            a O
   journalist O
           in O
    Argentina B-LOC
            , O
       played O
         with O
          Del B-PER
       Bosque I-PER
           in O
          the O
        final O
        years O
           of O
          the O
    seventies O
           in O
         Real B-ORG
       Madrid I-ORG
            . O

Dataset

The dataset we will use for this question is derived from the CoNLL 2002 shared task - which is about NER in Spanish and Dutch. The dataset is included in the NLTK distribution. Explanations on the dataset are provided in the CoNLL 2002 page. In addition, the English version of the dataset is available here.

To access the data in Python, do:

from nltk.corpus import conll2002

etr = conll2002.chunked_sents('esp.train') # In Spanish
eta = conll2002.chunked_sents('esp.testa') # In Spanish
etb = conll2002.chunked_sents('esp.testb') # In Spanish

dtr = conll2002.chunked_sents('ned.train') # In Dutch
dta = conll2002.chunked_sents('ned.testa') # In Dutch
dtb = conll2002.chunked_sents('ned.testb') # In Dutch

The data consists of three files per language (Spanish and Dutch): one training file and two test files testa and testb. The first test file is to be used in the development phase for finding good parameters for the learning system. The second test file will be used for the final evaluation.

Q2.1 Features

Your task consists of:

Choosing good features for encoding the problem.
Encode your training dataset.
Run a classifier over the training dataset.
Train and test the model.
Perform error analysis and fine tune model parameters on the testa part of the datasets.
Perform evaluation over the testb part of the dataset, reporting on accuracy, per label precision, per label recall and per label F-measure, and confusion matrix.

Here is a list of features that have been found appropriate for NER in previous work:

The word form (the string as it appears in the sentence)
The POS of the word (which is provided in the dataset)
ORT - a feature that captures the orthographic (letter) structure of the word. It can have any of the following values: number, contains-digit, contains-hyphen, capitalized, all-capitals, URL, punctuation, regular.
prefix1: first letter of the word
prefix2: first two letters of the word
prefix3: first three letters of the word
suffix1: last letter of the word
suffix2: last two letters of the word
suffix3: last three letters of the word

For example, given the following toy training data, the encoding of the features would be:

        Wolff NP  B-PER
            , ,   O
    currently RB  O
            a AT  O
   journalist NN  O
           in IN  O
    Argentina NP  B-LOC
            , ,   O
       played VBD O
         with IN  O
          Del NP  B-PER
       Bosque NP  I-PER
           in IN  O
          the AT  O
        final JJ  O
        years NNS O
           of IN  O
          the AT  O
    seventies NNS O
           in IN  O
         Real NP  B-ORG
       Madrid NP  I-ORG
            . .   O

Classes
1 B-PER
2 I-PER
3 B-LOC
4 I-LOC
5 B-ORG
6 I-ORG
7 O

Feature WORD-FORM:
1 Wolff
2 ,
3 currently
4 a
5 journalist
6 in
7 Argentina
8 played
9 with
10 Del
11 Bosque
12 the
13 final
14 years
15 of
16 seventies
17 Real
18 Madrid
19 .

Feature POS
20 NP
21 ,
22 RB
23 AT
24 NN
25 VBD
26 JJ
27 NNS
28 .

Feature ORT
29 number
30 contains-digit
31 contains-hyphen
32 capitalized
33 all-capitals
34 URL
35 punctuation
36 regular

Feature Prefix1
37 W
38 ,
39 c
40 a
41 j
42 i
43 A
44 p
45 w
46 D
47 B
48 t
49 f
50 y
51 o
52 s
53 .

Given this encoding, we can compute the vector representing the first word "Wolff NP B-PER" as:

# Class: B-PER=1
# Word-form: Wolff=1
# POS: NP=20
# ORT: Capitalized=32
# prefix1: W=37
1 1:1 20:1 32:1 37:1

When you encode the test dataset, some of the word-forms will be unknown (not seen in the training dataset). You should, therefore, plan for a special value for each feature of type "unknown" when this is expected.

Instead of writing the code as explained above, use the Scikit-learn vectorizer and pipeline library. Learn how to use the DictVectorizer for this specific task. Hashing vs. DictVectorizer also provides useful background. You can also refer to the example on POS tagging we reviewed in class.

You can start from the following example notebook CoNLL 2002 Classification with CRF. You do not need to install Python-CRFSuite - just take this notebook as a starting point to explore the dataset and ways to encode features.

We implement here a version of NER which is based on "greedy tagging" (that is, without optimizing the sequence of tags as we would obtain by training an HMM or CRF model).

Train the model using a logistic regression classifier and experiment with better features - looking at the tags of the previous word, the previous word and the following word (add padding words in the vectorizer).

The following notebooks all provide very good starting points - choose the one you prefer:

Sungdong Kim Window Classifier for NER, a PyTorch implementation of NER according to the course cs-224n(Stanford Univ) Fall 2017. Uses the same NLTK CoNLL 2002 dataset and includes data preparation.
Zac Stewart Document Classification
Radim Rehurek Data Science in Python

Q2.2 Using Pre-trained Word Embeddings (Optional)

One possible way to improve a greedy tagger for NER is to use Word Embeddings as features. A convenient package to manipulate Word2Vec word embeddings is provided in the gensim package by Radim Rehurek. To install it, use:

# conda install gensim

You must also download a pre-trained word2vec word embedding model from the Word2Vec site. The largest model is the GoogleNews-vectors-negative300.bin (1.5GB compressed file). To load it in Python use the following code (this requires about 8GB of RAM on your machine to work properly):

from gensim.models import word2vec
model_path = "GoogleNews-vectors-negative300.bin"
model = word2vec.load_word2vec_format(model_path, binary=True)

# A dense vector of 300 dimensions representing the word 'queen'
print(w["queen"])

stringA = 'woman'
stringB = 'king'
stringC = 'man'
print(model.most_similar(positive=[stringA, stringB], negative=[stringC], topn=10))

Your task:

Add the word2vec embeddings as dense vectors to the features of your NER classifier for each word feature (current word, previous word, next word).
Retrain the English model and report on performance.

Q2.3 Comparing Models

2.3.1 Execute the model in Window Classifier for NER in PyTorch (it takes abou 15 minutes to train on a CPU).

2.3.2 Compare the model you designed in 2.1 with the PyTorch model of 2.3.1. Identify examples correctly tagged by one model but not the other. Identify a way to measure agreement among the two models. Formulate hypotheses explaining the differences you observe.

Last modified 17 Dec, 2017