{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1>Assignment 2</h1>\n",
    "\n",
    "<h2>Due: Mon 04 Jan 2021 Midnight</h2>\n",
    "<a href='http://www.cs.bgu.ac.il/~elhadad/nlp21.html'>Natural Language Processing - Fall 2021 Michael Elhadad</a>\n",
    "<p/>\n",
    "This assignment covers the topic of document classification, sequence classification, named entity recognition and word embeddings.\n",
    "The objective is:\n",
    "<ul>\n",
    "<li>Apply feature-based supervised machine learning methods for document and token classification.</li>\n",
    "<li>Investigate algorithms for sequence classification: HMMs and CRFs.</li>\n",
    "<li>Learn the specific tasks of word classification, named entity recognition and document classification.\n",
    "<li>Use pre-trained word embeddings and measure whether they help for the task of NER.\n",
    "</ul>\n",
    "\n",
    "<p/>\n",
    "Submit your solution by email in the form of an iPython ipynb file.\n",
    "<p/>\n",
    "Do <b>not</b> attach the data in your submission.  Your notebook should refer to the data folder as \"../data\".\n",
    "\n",
    "<p>\n",
    "</p><hr>\n",
    "<h2>Content</h2>\n",
    "<ul>\n",
    "    <li><a href=\"#questions\">Q1. Questions Classification</a>\n",
    "      <ul>\n",
    "      <li><a href=\"#q1.1\">Q1.1. Describe the dataset qualitatively</a></li>\n",
    "      <li><a href=\"#q1.2\">Q1.2. Dataset Reader</a></li>\n",
    "      <li><a href=\"#q1.3\">Q1.3. Dataset Exploration</a></li>\n",
    "      <li><a href=\"#q1.4\">Q1.4. Classifier Interface, Evaluation Metrics, Confusion Matrix</a></li>\n",
    "      <li><a href=\"#q1.5\">Q1.5 Baseline Classifier</a></li>\n",
    "      <li><a href=\"#q1.6\">Q1.6 Features-based Classifier</a></li>\n",
    "      <li><a href=\"#q1.7\">Q1.7. Optional</a></li>\n",
    "      </ul>\n",
    "    </li>\n",
    "    <li><a href=\"#q2\">Q2. Document Classification</a>\n",
    "      <ul>\n",
    "        <li><a href=\"#q2.1\">Q2.1 Reuters Dataset</a>\n",
    "          <ul>\n",
    "          <li><a href=\"#q2.1.1\">Q2.1.1 Descriptive Statistics</a></li>\n",
    "          <li><a href=\"#q2.1.2\">Q2.1.2 Partial-fit classifiers</a></li>\n",
    "          <li><a href=\"#q2.1.3\">Q2.1.3 Hashing Vectorizer</a></li>\n",
    "          </ul>\n",
    "        </li>\n",
    "        <li><a href=\"#q2.2\">Q2.2. BBC News Dataset</a>\n",
    "          <ul>\n",
    "            <li><a href=\"#q2.2.1\">Q2.2.1 Descriptive Statistics</a></li>\n",
    "            <li><a href=\"#q2.2.2\">Q2.2.2 Feature Extraction</a></li>\n",
    "            <li><a href=\"#q2.2.3\">Q2.2.3 Model Training and Evaluation</a></li>\n",
    "            </ul>\n",
    "        </li>\n",
    "      </ul>\n",
    "    </li>\n",
    "    <li><a href=\"#q3\">Q3. Named Entity Recognition</a>\n",
    "      <ul>\n",
    "        <li><a href=\"#q3.1\">Q3.1 Features</a>\n",
    "          <ul>\n",
    "            <li><a href=\"#q3.1.1\">Q3.1.1 Feature Extraction</a></li>\n",
    "            <li><a href=\"#q3.1.2\">Q3.1.2 Model Training</a></li>\n",
    "            <li><a href=\"#q3.1.3\">Q3.1.3 Greedy Tagging vs. Sequence Tagging</a></li>\n",
    "          </ul>\n",
    "        </li>\n",
    "        <li><a href=\"#q3.2\">Q3.2 Using Word Embeddings</a></li>\n",
    "      </ul>\n",
    "</ul>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr/>\n",
    "<a name=\"questions\"></a>\n",
    "<h2>Q1. Questions Classification</h2>\n",
    "\n",
    "Consider the dataset on Question Classification available <a href=\"https://cogcomp.seas.upenn.edu/Data/QA/QC/\">here</a>.\n",
    "\n",
    "<a name=\"q1.1\"></a>\n",
    "<h3>Q1.1. Describe the dataset qualitatively</h3>\n",
    "\n",
    "Read the article introducing this dataset: \n",
    "<a href=\"http://www.aclweb.org/anthology/C02-1150\">in Li, Dan Roth, Learning Question Classifiers. COLING'02</a>.\n",
    "<p/>\n",
    "Write a half to one-page summary of the paper, focusing on the dataset description (more than on the description of the classifier introduced in the paper).\n",
    "Describe the exact task, the labels used, and provide the motivation for this task.\n",
    "Provide examples for the 6 main categories.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q1.2\"></a>\n",
    "<h3>Q1.2. Dataset Reader</h3>\n",
    "\n",
    "Implement a reader to parse the dataset into a data structure that will be easily used for scikit-learn processing.\n",
    "\n",
    "Adapt the code we used in HW1:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import codecs\n",
    "import math\n",
    "import random\n",
    "import string\n",
    "import time\n",
    "import numpy as np\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "'''\n",
    "Define different constants for the task of question classification \n",
    "based on the definition of the task.\n",
    "In the question classification case, there are 2 labels per question: coarse and fine.\n",
    "'''\n",
    "coarse_categories = [\"ABBREVIATION\", \"ENTITY\", \"DESCRIPTION\", \"HUMAN\", \"LOCATION\", \"NUMERIC VALUE\"]\n",
    "fine_categories = {}\n",
    "fine_categories[\"ABBREVIATION\"] = [\"abb\", \"exp\"]\n",
    "# @Todo more here...\n",
    "\n",
    "# Build the category_lines dictionary, a list of names per language\n",
    "coarse_category_lines = {}\n",
    "all_categories = []\n",
    "\n",
    "# @Todo: Define the way the lines should be parsed\n",
    "def parseLine(line):\n",
    "  return line\n",
    "\n",
    "# @Todo: Read a file and split into lines - create the appropriate data structure\n",
    "def readLines(filename):\n",
    "  lines = codecs.open(filename, \"r\",encoding='utf-8', errors='ignore').read().strip().split('\\n')\n",
    "  return [parseLine(line) for line in lines]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q1.3\"></a>\n",
    "<h3>Q1.3. Dataset Exploration</h3>\n",
    "\n",
    "The labels used to classify the questions are organized in two levels:\n",
    "<ul>\n",
    "<li>6 coarse classes (ABBREVIATION, ENTITY, DESCRIPTION, HUMAN, LOCATION and NUMERIC VALUE)\n",
    "<li>50 fine classes (see Table 1 of the article above)\n",
    "</ul>\n",
    "\n",
    "The definition of the question labels is provided <a href=\"https://cogcomp.seas.upenn.edu/Data/QA/QC/definition.html\">here</a>.\n",
    "<p/>\n",
    "\n",
    "Provide a quantitative description of the dataset:\n",
    "<ol>\n",
    "<li>Distribution of the question labels (number / percentage) - separately for coarse and fine labels.\n",
    "<li>Distribution of the number of tokens per question - overall and per label.\n",
    "<li>Vocabulary size and number of tokens overall and per label.\n",
    "<li>Top 20 more frequent words overall and per label\n",
    "<li>Number of words occurring 1,2,3,4 and 5 times\n",
    "</ol>\n",
    "\n",
    "For this type of exploration, the <a href=\"https://pandas.pydata.org/\">pandas</a> library is extremely convenient.\n",
    "In particular, explore the function dataframe.describe().  You can use other code if you prefer.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q1.4\"></a>\n",
    "<h3>Q1.4. Classifier Interface, Evaluation Metrics, Confusion Matrix</h3>\n",
    "\n",
    "Define the Python interface (functions or class according to your preference) of a question classifier so that the \n",
    "function accuracy_score and classification_report from the sklearn.metrics module can be used.\n",
    "\n",
    "Define a function evaluate_classifier that takes a trained classifier and reports classification results for \n",
    "coarse and fine categories.\n",
    "\n",
    "Define a function confusion_matrix(model) which prints a confusion matrix for the coarse level categories in the same\n",
    "way as in HW1 Question 3.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q1.5\"></a>\n",
    "<h3>Q1.5 Baseline Classifier</h3>\n",
    "\n",
    "Implement a baseline classifier for the 6 coarse labels using the heuristics described in the paper in Section 2.1\n",
    "(of the form – If a query starts with Who or Whom: type Human).\n",
    "<p/>\n",
    "\n",
    "Report on the accuracy, precision, recall, and F1 measure for all the coarse labels, and provide\n",
    "the confusion matrix for the 6 coarse labels.\n",
    "<p/>\n",
    "\n",
    "Analyze the errors by listing types of errors (false positives and false negatives for each of the 6 labels).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q1.6\"></a>\n",
    "<h3>Q1.6 Features-based Classifier</h3>\n",
    "\n",
    "Implement a feature-based classifier for the 6 coarse labels using the types of features described in the paper Section 3.2: words, POS tags, NER tags.  \n",
    "<p/>\n",
    "\n",
    "Use the <a href=\"https://spacy.io/usage/spacy-101\">spacy</a> library to perform pre-processing of the questions - including POS tagging and Named Entity Recognition and Noun Chunks detection.  Spacy comes with excellent pre-trained models for English and other languages.\n",
    "Installing Spacy requires the following steps (see <a href=\"https://spacy.io/usage/spacy-101#annotations-ner\">spacy documentation</a>):\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: spacy in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (2.3.4)\n",
      "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy) (4.49.0)\n",
      "Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy) (1.0.0)\n",
      "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy) (1.0.5)\n",
      "Requirement already satisfied: thinc<7.5.0,>=7.4.1 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy) (7.4.4)\n",
      "Requirement already satisfied: setuptools in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy) (49.6.0.post20201009)\n",
      "Requirement already satisfied: srsly<1.1.0,>=1.0.2 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy) (1.0.5)\n",
      "Requirement already satisfied: plac<1.2.0,>=0.9.6 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy) (0.9.6)\n",
      "Requirement already satisfied: blis<0.8.0,>=0.4.0; python_version >= \"3.6\" in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy) (0.7.4)\n",
      "Requirement already satisfied: requests<3.0.0,>=2.13.0 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy) (2.24.0)\n",
      "Requirement already satisfied: numpy>=1.15.0 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy) (1.19.4)\n",
      "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy) (2.0.5)\n",
      "Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy) (0.8.0)\n",
      "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy) (3.0.5)\n",
      "Requirement already satisfied: chardet<4,>=3.0.2 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from requests<3.0.0,>=2.13.0->spacy) (3.0.4)\n",
      "Requirement already satisfied: idna<3,>=2.5 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from requests<3.0.0,>=2.13.0->spacy) (2.10)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from requests<3.0.0,>=2.13.0->spacy) (2020.12.5)\n",
      "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from requests<3.0.0,>=2.13.0->spacy) (1.25.11)\n",
      "Requirement already satisfied: en_core_web_sm==2.3.1 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz#egg=en_core_web_sm==2.3.1 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (2.3.1)\n",
      "Requirement already satisfied: spacy<2.4.0,>=2.3.0 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from en_core_web_sm==2.3.1) (2.3.4)\n",
      "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.1) (2.0.5)\n",
      "Requirement already satisfied: thinc<7.5.0,>=7.4.1 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.1) (7.4.4)\n",
      "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.1) (4.49.0)\n",
      "Requirement already satisfied: plac<1.2.0,>=0.9.6 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.1) (0.9.6)\n",
      "Requirement already satisfied: blis<0.8.0,>=0.4.0; python_version >= \"3.6\" in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.1) (0.7.4)\n",
      "Requirement already satisfied: numpy>=1.15.0 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.1) (1.19.4)\n",
      "Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.1) (0.8.0)\n",
      "Requirement already satisfied: requests<3.0.0,>=2.13.0 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.1) (2.24.0)\n",
      "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.1) (1.0.5)\n",
      "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.1) (3.0.5)\n",
      "Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.1) (1.0.0)\n",
      "Requirement already satisfied: setuptools in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.1) (49.6.0.post20201009)\n",
      "Requirement already satisfied: srsly<1.1.0,>=1.0.2 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.1) (1.0.5)\n",
      "Requirement already satisfied: chardet<4,>=3.0.2 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.1) (3.0.4)\n",
      "Requirement already satisfied: idna<3,>=2.5 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.1) (2.10)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.1) (2020.12.5)\n",
      "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.1) (1.25.11)\n",
      "[+] Download and installation successful\n",
      "You can now load the model via spacy.load('en_core_web_sm')\n"
     ]
    }
   ],
   "source": [
    "# This installs the Spacy library (13MB)\n",
    "!pip install spacy\n",
    "# This downloads pre-trained models for POS tagging / NER / Noun chunks in English (34MB)\n",
    "!python -m spacy download en_core_web_sm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(Apple, U.K., $1 billion)\n",
      "ORG\n"
     ]
    }
   ],
   "source": [
    "import spacy\n",
    "nlp = spacy.load('en_core_web_sm')\n",
    "doc = nlp('Apple is looking at buying U.K. startup for $1 billion')\n",
    "print(doc.ents)\n",
    "print(doc.ents[0].label_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Invoking the 'nlp()' function of spacy performs a set of analyses on the text, including: sentence separation, tokenization, lemmatization, parts of speech tagging, \n",
    "Noun-phrase chunking, named entity recognition and syntactic parsing.  Information about these analyses is retrieved using the spacy document properties.\n",
    "\n",
    "As indicated in the paper, we want to extract the following information as features for the task of question classification:\n",
    "<ul>\n",
    "  <li>Tokens</li>\n",
    "  <li>Lemmas</li>\n",
    "  <li>Parts of speech tags</li>\n",
    "  <li>Noun phrase chunks</li>\n",
    "  <li>Named entities</li>\n",
    "</ul>\n",
    "\n",
    "Here are starting points to learn how to extract this information from the nlp analysis:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Token: Apple - Lemma: Apple - POS: PROPN\n",
      "Token: is - Lemma: be - POS: AUX\n",
      "Token: looking - Lemma: look - POS: VERB\n",
      "Token: at - Lemma: at - POS: ADP\n",
      "Token: buying - Lemma: buy - POS: VERB\n",
      "Token: U.K. - Lemma: U.K. - POS: PROPN\n",
      "Token: startup - Lemma: startup - POS: NOUN\n",
      "Token: for - Lemma: for - POS: ADP\n",
      "Token: $ - Lemma: $ - POS: SYM\n",
      "Token: 1 - Lemma: 1 - POS: NUM\n",
      "Token: billion - Lemma: billion - POS: NUM\n"
     ]
    }
   ],
   "source": [
    "import spacy\n",
    "nlp = spacy.load('en_core_web_sm')\n",
    "doc = nlp('Apple is looking at buying U.K. startup for $1 billion')\n",
    "  \n",
    "# Token level features retrieved by Spacy: token, lemma, POS\n",
    "for x in doc:   # Each x is a Token\n",
    "    print(f\"Token: {x} - Lemma: {x.lemma_} - POS: {x.pos_}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(Apple, U.K., $1 billion)\n",
      "Apple - 0 - 1 - ORG\n",
      "U.K. - 5 - 6 - GPE\n",
      "$1 billion - 8 - 11 - MONEY\n"
     ]
    }
   ],
   "source": [
    "# Span level features retrieved by Spacy: named entities, start (0-based index), end (index just after the span), category\n",
    "print(doc.ents)\n",
    "for e in doc.ents: \n",
    "    print(f\"{e} - {e.start} - {e.end} - {e.label_}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Apple, U.K. startup]\n",
      "0 - 1 - Apple\n",
      "5 - 7 - startup\n"
     ]
    }
   ],
   "source": [
    "# Span level features retrieved by Spacy: noun chunks\n",
    "print(list(doc.noun_chunks))\n",
    "for c in doc.noun_chunks: \n",
    "    print(f\"{c.start} - {c.end} - {c.root}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The paper does not explicitly indicate how to encode the features it lists and is not precise about the features named `related words` (words which are usually associated with a specific type of questions).  For example:\n",
    "<ol>\n",
    "<li>Word features can be encoded in different ways: noise words filtered or not, with or without lemmatization, with or without case normalization (all lower-case).  \n",
    "<li>POS features can be encoded in different ways: as a bag of POS-tags, or associated with the word in a bag-of-tagged words such as 'Apple/PROPN' \n",
    "<li>Chunks can be encoded as a bag of chunk-roots \n",
    "<li>Examples of \"related words\" per category are provided for a few categories: <a href=\"https://cogcomp.seas.upenn.edu/Data/QA/QC/lists/prof\">profession</a>,\n",
    "    <a href=\"https://cogcomp.seas.upenn.edu/Data/QA/QC/lists/mount\">mountains</a> and <a href=\"https://cogcomp.seas.upenn.edu/Data/QA/QC/lists/food\">food</a>.\n",
    "  You should learn the related words list from the training dataset by detecting words which have a high chi-square\n",
    "value with each category.  Read in <a href=\"http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html\">sklearn.feature_selection.chi2</a> for a discussion of how such words can be efficiently computed using scikit-learn.\n",
    "</li>\n",
    "</ol>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q1.6.1\"></a>\n",
    "<h4>Q1.6.1 Feature Extraction</h4>\n",
    "\n",
    "Discuss a priori what are good ways to encode these features (lemma, POS, NER, chunk, related words) - provide examples that explain your intuition.\n",
    "<p/>\n",
    "Implement a feature extraction function that turns a question into a feature vector appropriate for the scikit-learn classifiers.\n",
    "Adopt the example shown in the scikit-learn documentation: \n",
    "<a href=\"https://scikit-learn.org/stable/modules/feature_extraction.html#loading-features-from-dicts\">loading features from dicts</a>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q1.6.2\"></a>\n",
    "<h4>Q1.6.2 Train Models</h4>\n",
    "\n",
    "Train scikit-learn based classifiers for:\n",
    "<ol>\n",
    "<li>Coarse labels\n",
    "<li>All labels as a flat classifier\n",
    "<li>A hierarchical classifier which predicts the fine-grained labels given the coarse label as proposed in the paper.  Implement this as a two-step procedure - run the coarse-label classifier, then a second level classifier which takes the prediction of the first classifier as input (one finer classifier per coarse category).\n",
    "</ol>\n",
    "\n",
    "For each of the three classifiers, report:\n",
    "<ol>\n",
    "<li>Accuracy, Precision, Recall, F-measure per label and confusion matrix.\n",
    "<li>Provide examples of prediction errors (positive and negative).\n",
    "<li>Discuss the most ambiguous label pairs (identified in the confusion matrix) and discuss whether the features you have used provide sufficient information to disambiguate the cases.\n",
    "</ol>\n",
    "\n",
    "You should experiment with different classifiers from those illustrated in the\n",
    "<a href=\"https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html\">\n",
    "Classification of text documents using sparse features</a> example.\n",
    "<p/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q1.7\"></a>\n",
    "<h3>Q1.7 Optional</h3>\n",
    "\n",
    "1.7.1 Analyze which of the features are most helpful for this task among lemma, POS, NER, Chunks and Related Words.\n",
    "(This analysis is called <a href=\"https://en.wikipedia.org/wiki/Ablation_(artificial_intelligence)#:~:text=In%20artificial%20intelligence%20(AI)%2C,component%20to%20the%20overall%20system.\">ablation analysis</a>).\n",
    "<p/>\n",
    "\n",
    "1.7.2 The dataset is quite small (5,500 questions in the training dataset for 50 labels).\n",
    "How would you determine whether your model overfits on this data? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr/>\n",
    "<a name=\"q2\"></a>\n",
    "<h2>Q2. Document Classification</h2>\n",
    "\n",
    "<a name=\"q2.1\"></a>\n",
    "<h3>Q2.1. Reuters Dataset</h3>\n",
    "\n",
    "Execute the notebook tutorial of Scikit-Learn on text classification: \n",
    "<a href=\"http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html#example-applications-plot-out-of-core-classification-py\">out of core classification</a>.\n",
    "<p/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q2.1.1\"></a>\n",
    "<h4>Q2.1.1 Descriptive Statistics</h4> \n",
    "\n",
    "Explore how many documents are in the dataset, how many categories, how many documents per categories, provide mean and standard deviation, min and max. (use the pandas library to explore the dataset, use the dataframe.describe() method.)\n",
    "\n",
    "Explore how many characters and words are present in the documents of the dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q2.1.2\"></a>\n",
    "<h4>Q2.1.2 Partial-fit classifiers</h4> \n",
    "\n",
    "Explain informally what are the classifiers that support the \"partial-fit\" method discussed in the code."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q2.1.3\"></a>\n",
    "<h4>Q2.1.3 Hashing Vectorizer</h4> \n",
    "\n",
    "Explain what is the hashing vectorizer used in this tutorial.  Why is it important to use this vectorizer to achieve \"streaming classification\"?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q2.2\"></a>\n",
    "<h3>Q2.2. BBC News Dataset</h3>\n",
    "\n",
    "The <a href=\"https://www.kaggle.com/c/learn-ai-bbc/data\">Kaggle BBC News</a> dataset is a document dataset to test document classification.\n",
    "It contains 1,500 training documents (news stories from the BBC News) and 700 test documents. Documents are classified into 5 categories: sports, tech, business, \n",
    "entertainment, politics.  Text is encoded in the following format: all lower case, quotes are removed and separated, non period punctuations are removed.\n",
    "For example:\n",
    "<pre>\n",
    "lifestyle  governs mobile choice  faster  better or funkier hardware alone is not going to help phone firms sell more handsets\n",
    "research suggests.  instead  phone firms keen to get more out of their customers should not just be pushing the technology \n",
    "for its own sake. consumers are far more interested in how handsets fit in with their lifestyle than they are in screen size  \n",
    "onboard memory or the chip inside  shows an in-depth study by handset maker ericsson.  \n",
    "historically in the industry there has been too much focus on using technology   \n",
    "said dr michael bjorn  senior advisor on mobile media at ericsson s consumer and enterprise lab.\n",
    "</pre>\n",
    "\n",
    "Download the data <a href=\"bbcnews.zip\">bbcnews.zip</a> and place it in ../data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q2.2.1\"></a>\n",
    "<h4>Q2.2.1 Dataset Exploration</h4>\n",
    "\n",
    "Explore how many documents are in the dataset, how many categories, how many documents per categories, provide mean and standard deviation, min and max. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q2.2.2\"></a>\n",
    "<h4>Q2.2.2 Features Extraction</h4>\n",
    "\n",
    "Select appropriate features for document classification and implement a scikit-learn vectorizer for this dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q2.2.3\"></a>\n",
    "<h4>Q2.2.3 Model Training and Evaluation</h4>\n",
    "\n",
    "Implement a classifier for this dataset.  \n",
    "<p/>\n",
    "Report performance, confusion matrix and analyze errors.\n",
    "<p/>\n",
    "\n",
    "In order to run the test data, you will need to register to Kaggle and use their submission system.\n",
    "To avoid the complexity of using the Kaggle submission system, split the train data into 80% training / 20% test.  \n",
    "<p/>\n",
    "\n",
    "You can see examples solving this task with good usage of scikit-learn APIs in <a href=\"https://www.kaggle.com/c/learn-ai-bbc/leaderboard\">the Kaggle leaderboard</a>.\n",
    "In particular, <a href=\"https://www.kaggle.com/aryankaul31/aryan-bbc-news-classification\">aryan-bbc-news-classification</a> demonstrates data exploration for \n",
    "classification using pandas, tf-idf features, TSNE visualization for feature vectors, and chi-square correlation between features and labels. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr/>\n",
    "<a name=\"q3\"></a>\n",
    "<h2>Q3. Named Entity Recognition</h2>\n",
    "\n",
    "<h3>Named Entity Recognition</h3>\n",
    "\n",
    "The task of Named Entity Recognition (NER) involves the recognition of names of persons, locations, organizations, dates in free text.\n",
    "As we have seen above, Spacy includes a very good NER model as part of its library.  In this question, we will study how to implement \n",
    "such a model.\n",
    "\n",
    "The following sentence is tagged with sub-sequences indicating PER (for persons), LOC (for location) and ORG (for organization):\n",
    "<pre>\n",
    "Wolff, currently a journalist in Argentina, played with Del Bosque in the final years of the seventies in Real Madrid.\n",
    "\n",
    "[PER Wolff ] , currently a journalist in [LOC Argentina ] , played with [PER Del Bosque ] in the final years of the seventies in \n",
    "[ORG Real Madrid ] .\n",
    "</pre>\n",
    "\n",
    "NER involves 2 sub-tasks: identifying the boundaries of such expressions (the open and close brackets) and labelling the expressions\n",
    "(with tags such as PER, LOC or ORG).  This sequence labelling task is reduced into a classification task, using the BIO encoding of the data:\n",
    "<pre>\n",
    "        Wolff B-PER\n",
    "            , O\n",
    "    currently O\n",
    "            a O\n",
    "   journalist O\n",
    "           in O\n",
    "    Argentina B-LOC\n",
    "            , O\n",
    "       played O\n",
    "         with O\n",
    "          Del B-PER\n",
    "       Bosque I-PER\n",
    "           in O\n",
    "          the O\n",
    "        final O\n",
    "        years O\n",
    "           of O\n",
    "          the O\n",
    "    seventies O\n",
    "           in O\n",
    "         Real B-ORG\n",
    "       Madrid I-ORG\n",
    "            . O\n",
    "</pre>\n",
    "\n",
    "\n",
    "<h3>Dataset</h3>\n",
    "\n",
    "The dataset we will use for this question is derived from the CoNLL 2002 shared task - which is about NER in Spanish and Dutch.\n",
    "The dataset is included in the NLTK distribution.  Explanations on the dataset are provided in the \n",
    "<a href='https://www.clips.uantwerpen.be/conll2002/ner/'>CoNLL 2002</a> page.\n",
    "<p/>\n",
    "\n",
    "To access the data in Python, do:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.corpus import conll2002\n",
    "\n",
    "etr = conll2002.chunked_sents('esp.train') # In Spanish\n",
    "eta = conll2002.chunked_sents('esp.testa') # In Spanish\n",
    "etb = conll2002.chunked_sents('esp.testb') # In Spanish\n",
    "\n",
    "dtr = conll2002.chunked_sents('ned.train') # In Dutch\n",
    "dta = conll2002.chunked_sents('ned.testa') # In Dutch\n",
    "dtb = conll2002.chunked_sents('ned.testb') # In Dutch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data consists of three files per language (Spanish and Dutch): one training file and two test files testa and testb. \n",
    "The first test file is to be used in the development phase for finding good parameters for the learning system. \n",
    "The second test file will be used for the final evaluation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q3.1\"></a>\n",
    "<h3>Q3.1 Features</h3>\n",
    "\n",
    "Your task consists of:\n",
    "<ol>\n",
    "<li>Choosing good features for encoding the problem.\n",
    "<li>Encode your training dataset.\n",
    "<li>Run a classifier over the training dataset.\n",
    "<li>Train and test the model.\n",
    "<li>Perform error analysis and fine tune model parameters on the testa part of the datasets.\n",
    "<li>Perform evaluation over the testb part of the dataset, reporting on accuracy, per label precision, per label recall and per label F-measure, and confusion matrix.\n",
    "</ol>\n",
    "<p/>\n",
    "\n",
    "Here is a list of features that have been found appropriate for NER in previous work:\n",
    "<ol>\n",
    "<li>The word form (the string as it appears in the sentence)\n",
    "<li>The POS of the word (which is provided in the dataset)\n",
    "<li>ORT - a feature that captures the orthographic (letter) structure of the word.  It can have any of the following values: \n",
    "    number, contains-digit, contains-hyphen, capitalized, all-capitals, URL, punctuation, regular. \n",
    "<li>prefix1: first letter of the word\n",
    "<li>prefix2: first two letters of the word\n",
    "<li>prefix3: first three letters of the word\n",
    "<li>suffix1: last letter of the word\n",
    "<li>suffix2: last two letters of the word\n",
    "<li>suffix3: last three letters of the word\n",
    "</ol>\n",
    "\n",
    "<p/>\n",
    "\n",
    "For example, given the following toy training data, the encoding of the features would be:\n",
    "<pre>\n",
    "        Wolff NP  B-PER\n",
    "            , ,   O\n",
    "    currently RB  O\n",
    "            a AT  O\n",
    "   journalist NN  O\n",
    "           in IN  O\n",
    "    Argentina NP  B-LOC\n",
    "            , ,   O\n",
    "       played VBD O\n",
    "         with IN  O\n",
    "          Del NP  B-PER\n",
    "       Bosque NP  I-PER\n",
    "           in IN  O\n",
    "          the AT  O\n",
    "        final JJ  O\n",
    "        years NNS O\n",
    "           of IN  O\n",
    "          the AT  O\n",
    "    seventies NNS O\n",
    "           in IN  O\n",
    "         Real NP  B-ORG\n",
    "       Madrid NP  I-ORG\n",
    "            . .   O\n",
    "\n",
    "<u>Classes</u>\n",
    "1 B-PER\n",
    "2 I-PER\n",
    "3 B-LOC\n",
    "4 I-LOC\n",
    "5 B-ORG\n",
    "6 I-ORG\n",
    "7 O\n",
    "\n",
    "<u>Feature WORD-FORM:</u>\n",
    "1 Wolff\n",
    "2 ,\n",
    "3 currently\n",
    "4 a\n",
    "5 journalist\n",
    "6 in\n",
    "7 Argentina\n",
    "8 played\n",
    "9 with\n",
    "10 Del\n",
    "11 Bosque\n",
    "12 the\n",
    "13 final\n",
    "14 years\n",
    "15 of\n",
    "16 seventies\n",
    "17 Real\n",
    "18 Madrid\n",
    "19 .\n",
    "\n",
    "<u>Feature POS</u>\n",
    "20 NP\n",
    "21 ,\n",
    "22 RB\n",
    "23 AT\n",
    "24 NN\n",
    "25 VBD\n",
    "26 JJ\n",
    "27 NNS\n",
    "28 .\n",
    "\n",
    "<u>Feature ORT</u>\n",
    "29 number\n",
    "30 contains-digit\n",
    "31 contains-hyphen\n",
    "32 capitalized\n",
    "33 all-capitals\n",
    "34 URL\n",
    "35 punctuation\n",
    "36 regular\n",
    "\n",
    "<u>Feature Prefix1</u>\n",
    "37 W\n",
    "38 ,\n",
    "39 c\n",
    "40 a\n",
    "41 j\n",
    "42 i\n",
    "43 A\n",
    "44 p\n",
    "45 w\n",
    "46 D\n",
    "47 B\n",
    "48 t\n",
    "49 f\n",
    "50 y\n",
    "51 o\n",
    "52 s\n",
    "53 .\n",
    "</pre>\n",
    "\n",
    "Given this encoding, we can compute the vector representing the first word \"Wolff NP B-PER\" as:\n",
    "<pre>\n",
    "# Class: B-PER=1\n",
    "# Word-form: Wolff=1\n",
    "# POS: NP=20\n",
    "# ORT: Capitalized=32\n",
    "# prefix1: W=37\n",
    "1 1:1 20:1 32:1 37:1\n",
    "</pre>\n",
    "\n",
    "When you encode the test dataset, some of the word-forms will be unknown (not seen in the training dataset).\n",
    "You should, therefore, plan for a special value for each feature of type \"unknown\" when this is expected.\n",
    "<p/>\n",
    "\n",
    "Instead of writing the code as explained above, use the Scikit-learn vectorizer and pipeline library.\n",
    "General information on feature extraction for text data in Scikit-Learn is \n",
    "in <a href=\"https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction\">the Scikit-Learn documentation</a>.\n",
    "Refer to the <a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html\">DictVectorizer</a> for this specific task.\n",
    "<a href=\"https://scikit-learn.org/stable/auto_examples/text/plot_hashing_vs_dict_vectorizer.html\">Hashing vs. DictVectorizer</a>\n",
    "also provides useful background."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q3.1.1\"></a>\n",
    "<h4>Q3.1.1 Feature Extraction</h4>\n",
    "\n",
    "Start from the following example notebook\n",
    "<a href=\"http://nbviewer.ipython.org/github/tpeng/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb\">CoNLL 2002 Classification with CRF</a>.\n",
    "You do not need to install Python-CRFSuite - just take this notebook as a starting point to explore the dataset and ways to encode features.\n",
    "(This notebook also gives you an indication of the level of result you can expect to obtain.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q3.1.2\"></a>\n",
    "<h4>Q3.1.2 Model Training</h4>\n",
    "\n",
    "Train the model using a logistic regression classifier and experiment with better features - \n",
    "looking at the tags of the previous word, the previous word and the following word (add padding words in the vectorizer)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q3.1.3\"></a>\n",
    "<h4>Q3.1.3 Greedy Tagging vs. Sequence Tagging</h4>\n",
    "\n",
    "We implemented above a version of NER which is based on <i>greedy tagging</i>: that is, without optimizing the sequence of tags \n",
    "as we would obtain by training an HMM or CRF model.  \n",
    "In particular, we did not check that the BIO tags produced by the tagger is a legal sequence.\n",
    "Write code to identify sequences of BIO tags which are illegal and report on the frequency of this problem for each type\n",
    "of illegal tags transition (O-IX, IX-IY, BX-IY).  Comment on your observations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q3.2\"></a>\n",
    "<h3>Q3.2 Using Word Embeddings</h3>\n",
    "\n",
    "One way to improve a greedy tagger for NER is to use Word Embeddings as features.\n",
    "A convenient package to manipulate Word2Vec word embeddings is provided in the <a href=\"https://radimrehurek.com/gensim/\">gensim</a> package by Radim Rehurek.\n",
    "To install it, use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: gensim in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (3.8.3)\n",
      "Requirement already satisfied: scipy>=0.18.1 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from gensim) (1.5.3)\n",
      "Requirement already satisfied: six>=1.5.0 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from gensim) (1.15.0)\n",
      "Requirement already satisfied: numpy>=1.11.3 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from gensim) (1.19.4)\n",
      "Requirement already satisfied: smart-open>=1.8.1 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from gensim) (3.0.0)\n",
      "Requirement already satisfied: requests in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from smart-open>=1.8.1->gensim) (2.24.0)\n",
      "Requirement already satisfied: chardet<4,>=3.0.2 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from requests->smart-open>=1.8.1->gensim) (3.0.4)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from requests->smart-open>=1.8.1->gensim) (2020.12.5)\n",
      "Requirement already satisfied: idna<3,>=2.5 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from requests->smart-open>=1.8.1->gensim) (2.10)\n",
      "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\\users\\michael\\.conda\\envs\\nlp21\\lib\\site-packages (from requests->smart-open>=1.8.1->gensim) (1.25.11)\n"
     ]
    }
   ],
   "source": [
    "!pip install gensim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You must also download a pre-trained Word2Vec or fastText word embedding model.\n",
    "The models must naturally be in Spanish or Dutch. (Only test word embeddings for one language.)\n",
    "You can find pre-trained word embedding models in different formats:\n",
    "<ol>\n",
    "<li><a href=\"https://github.com/facebookresearch/fastText/blob/master/docs/pretrained-vectors.md\">fastText pretrained models</a> (includes models for 294 languages)\n",
    "<li><a href=\"https://github.com/uchile-nlp/spanish-word-embeddings\">Spanish Word2vec models</a> \n",
    "</ol>\n",
    "\n",
    "Specific information on manipulating word vectors with Gensim is provided in Gensim with the <a href=\"https://radimrehurek.com/gensim/models/keyedvectors.html\">KeyedVector</a>.\n",
    "Practical examples are available for Spanish in this <a href=\"https://github.com/uchile-nlp/spanish-word-embeddings/blob/master/examples/Ejemplo_WordVectors.ipynb\">notebook</a>.\n",
    "(Pay attention that word embeddings large are pretty big files - about 3GB when uncompressed.)\n",
    "\n",
    "Your task:\n",
    "<ol>\n",
    "<li>Add word embeddings as dense vectors to the features of your NER classifier for each word feature (current word, previous word, next word) - either in Spanish or in Dutch.\n",
    "<li>Retrain the model and report on performance.  Comment.\n",
    "</ol>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}