{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1>Assignment 3</h1>\n",
    "\n",
    "<h2>Due: Tue 28 Jan 2020 Midnight</h2>\n",
    "<a href='http://www.cs.bgu.ac.il/~elhadad/nlp21.html'>Natural Language Processing - Fall 2021 Michael Elhadad</a>\n",
    "<p/>\n",
    "This assignment covers the topic of syntactic parsing.\n",
    "Submit your solution in the form of a Jupyter notebook (ipynb) file.\n",
    "\n",
    "<h2>Content</h2>\n",
    "<ul>\n",
    "<li><a href=\"#q1\">Q1: Designing CFGs for NLP</a></li>\n",
    "    <ul>\n",
    "        <li><a href=\"#q1.1\">Q1.1: Extend a CFG to support Number agreement, Pronouns and Dative Constructions</a>\n",
    "            <ul>\n",
    "                <li><a href=\"#q1.1.1\">Q1.1.1 Determiners and Count/Mass Nouns</a></li>\n",
    "                <li><a href=\"#q1.1.2\">Q1.1.2 Pronouns</a></li>\n",
    "                <li><a href=\"#q1.1.3\">Q1.1.3 Subject/Verb Agreement</a></li>\n",
    "                <li><a href=\"#q1.1.4\">Q1.1.4 Over-generation</a></li>\n",
    "            </ul>            \n",
    "        </li>\n",
    "        <li><a href=\"#q1.2\">Q1.2: Extend a CFG to support Coordination and Prepositional Phrases</a>\n",
    "            <ul>\n",
    "                <li><a href=\"#q1.2.1\">Q1.2.1 Support Prepositional Phrases</a></li>\n",
    "                <li><a href=\"#q1.2.2\">Q1.2.2 Support Coordination</a></li>\n",
    "                <li><a href=\"#q1.2.3\">Q1.2.3 Over-generation</a></li>\n",
    "            </ul>            \n",
    "        </li>\n",
    "    </ul>\n",
    "</li>\n",
    "<li><a href=\"#q2\">Q2: Learning a PCFG from a Treebank</a>\n",
    "    <ul>\n",
    "        <li><a href=\"#q2.1\">Q2.1 Random PCFG Generation</a>\n",
    "            <ul>\n",
    "                <li><a href=\"#q2.1.1\">Q2.1.1 PCFG_Generate</a></li>\n",
    "                <li><a href=\"#q2.1.2\">Q2.1.2 Generate 1,000 Trees</a></li>\n",
    "                <li><a href=\"#q2.1.3\">Q2.1.3 Gather Rules Frequency for all Non-Terminals</a></li>\n",
    "                <li><a href=\"#q2.1.4\">Q2.1.4 Compute KL-Divergence between a priori and generated parameters</a></li>\n",
    "            </ul>\n",
    "        </li>\n",
    "        <li><a href=\"#q2.2\">Q2.2 Learn a PCFG from a Treebank</a>\n",
    "            <ul>\n",
    "                <li><a href=\"#q2.2.1\">Q2.2.1 Induce_PCFG</a></li>\n",
    "                <li><a href=\"#q2.2.2\">Q2.2.2 Data Exploration</a></li>\n",
    "            </ul>\n",
    "        </li>\n",
    "        <li><a href=\"#q2.3\">Q2.3 Induce a PCFG in Chomsky Normal Form</a>\n",
    "            <ul>\n",
    "                <li><a href=\"#q2.3.1\">Q2.3.1 PCFG_CNF_Learn</a></li>\n",
    "                <li><a href=\"#q2.3.2\">Q2.3.2 Data Exploration</a></li>\n",
    "            </ul>\n",
    "        </li>\n",
    "        <li><a href=\"#q2.4\">Q2.4 Test CFG Independence Assumptions</a></li>\n",
    "    </ul>\n",
    "</li>\n",
    "<li><a href=\"#q3\">Q3: Building and Evaluating a Simple PCFG Parser</a>\n",
    "\t<ul>\n",
    "        <li><a href=\"#q3.1\">Q3.1 Build a Parser</a>\n",
    "            <ul>\n",
    "                <li><a href=\"#q3.1.1\">Q3.1.1 Dataset Split</a></li>\n",
    "                <li><a href=\"#q3.1.2\">Q3.1.2 Learn a PCFG over the Chomsky Normal Form</a></li>\n",
    "                <li><a href=\"#q3.1.3\">Q3.1.3 Viterbi Parser</a></li>\n",
    "            </ul>            \n",
    "        </li>\n",
    "        <li><a href=\"#q3.2\">Q3.2 Evaluate the Parser</a></li>\n",
    "        <li><a href=\"#q3.3\">Q3.3 Accuracy per Distance</a></li>\n",
    "        <li><a href=\"#q3.4\">Q3.4 Accuracy per Label</a></li>\n",
    "    </ul>\n",
    "</li>\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr/>\n",
    "<a name=\"q1\"></a>\n",
    "<h2>Question 1: Designing CFGs for NLP</h2>\n",
    "\n",
    "NLTK provides library support to read CFGs from string representation, and parse sentences given a CFG using different \n",
    "parsing algorithms (either top-down or bottom-up).  In this question, we manually develop a grammar to support \n",
    "increasingly complex phenomena in English syntax. The following code provides the starting point:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Parsing 'John left'\n",
      "    [ * John left]\n",
      "  S [ 'John' * left]\n",
      "  R [ NP * left]\n",
      "  S [ NP 'left' * ]\n",
      "  R [ NP IV * ]\n",
      "  R [ NP VP * ]\n",
      "  R [ S * ]\n",
      "(S (NP John) (VP (IV left)))\n",
      "Parsing 'John eats bread'\n",
      "    [ * John eats bread]\n",
      "  S [ 'John' * eats bread]\n",
      "  R [ NP * eats bread]\n",
      "  S [ NP 'eats' * bread]\n",
      "  R [ NP TV * bread]\n",
      "  S [ NP TV 'bread' * ]\n",
      "  R [ NP TV NP * ]\n",
      "  R [ NP VP * ]\n",
      "  R [ S * ]\n",
      "(S (NP John) (VP (TV eats) (NP bread)))\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "from nltk import CFG\n",
    "\n",
    "sg = \"\"\"\n",
    "S -> NP VP\n",
    "VP -> IV | TV NP\n",
    "NP -> 'John' | \"bread\"\n",
    "IV -> 'left'\n",
    "TV -> 'eats'\n",
    "\"\"\"\n",
    "g = CFG.fromstring(sg)\n",
    "\n",
    "# Bottom-up  parser\n",
    "sr_parser = nltk.ShiftReduceParser(g, trace=2)\n",
    "\n",
    "# Parse sentences and observe the behavior of the parser\n",
    "def parse_sentence(sent):\n",
    "    tokens = sent.split()\n",
    "    trees = sr_parser.parse(tokens)\n",
    "    for tree in trees:\n",
    "        print(tree)\n",
    "\n",
    "parse_sentence(\"John left\")\n",
    "parse_sentence(\"John eats bread\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this toy grammar, the following non-terminals are used:\n",
    "<ul>\n",
    "<li>S: sentence\n",
    "<li>VP: verb phrase\n",
    "<li>NP: noun phrase\n",
    "<li>IV: intransitive verb (a verb which does not expect a complement)\n",
    "<li>TV: transitive verb (a verb which expects an obligatory complement)\n",
    "</ul>\n",
    "\n",
    "<a name=\"q1.1\"></a>\n",
    "<h3>Question 1.1: Extend a CFG to support Determiners, Number agreement, Countable/Mass nouns, Pronouns and Dative Constructions</h3>\n",
    "\n",
    "Extend the CFG so that the following sentences can be parsed:\n",
    "<pre>\n",
    "    John left\n",
    "    John loves Mary\n",
    "    They love Mary\n",
    "    They love her\n",
    "    She loves them\n",
    "    Everybody loves John\n",
    "    A boy loves Mary\n",
    "\tThe boy loves Mary\n",
    "\tSome boys love Mary\n",
    "    John gave Mary a heavy book\n",
    "    John gave it to Mary\n",
    "\tJohn likes butter\n",
    "\tJohn moves a chair\n",
    "</pre>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q1.1.1\"></a>\n",
    "<h4>1.1.1 Determiners and Count/Mass Nouns</h4>\n",
    "\n",
    "<b>Determiners</b> in English come before common nouns.\n",
    "Proper nouns cannot have a determiner. (We ignore cases such as \"The Petersons came by tonight for a visit.\")\n",
    "<p/>\n",
    "Some determiners can only be used for singular nouns (\"a\"), others only for plural nouns (\"many\") and others with either plural or singular (\"the\").\n",
    "<p/>\n",
    "Some common nouns are called \"uncountable\" or <a href=\"https://en.wikipedia.org/wiki/Mass_noun\">mass nouns</a> (e.g., \"butter\").\n",
    "Mass nouns cannot appear in plural and cannot have counting determiners.\n",
    "Other nouns are <a href=https://en.wikipedia.org/wiki/Count_noun\">countable nouns</a>. They can be used in the plural and can be modified by counting determiners.\n",
    "<p/>\n",
    "Extend the grammar to support countable and uncountable nouns with the corresponding restrictions on determiners and plural.\n",
    "<p/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q1.1.2\"></a>\n",
    "<h4>1.1.2 Pronouns</h4>\n",
    "\n",
    "<b>Pronouns</b> in English are characterized by the following morphological attributes:\n",
    "<ul>\n",
    "    <li>Number: singular / plural (e.g., he/they)</li>\n",
    "    <li>Gender: masculine / feminine / neutral (e.g., he/she/it)</li>\n",
    "    <li>Case: nominative / accusative (e.g., he/him)</li>\n",
    "</ul>\n",
    "Make sure the appropriate form of pronouns are used in each of the possible contexts.\n",
    "Do you need to encode gender in the grammar? Explain why.\n",
    "<p/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q1.1.3\"></a>\n",
    "<h4>1.1.3 Subject/Verb Agreement</h4>\n",
    "<p/>\n",
    "Make sure subject and verb agree in number.\n",
    "<p/>\n",
    "\n",
    "<a name=\"q1.1.4\"></a>\n",
    "<h4>1.1.4 Over-generation</h4>\n",
    "    \n",
    "Check whether your grammar overgenerates by giving an example of ungrammatical sentence that it recognizes as grammatical.\n",
    "<p/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q1.2\"></a>\n",
    "<h3>Question 1.2: Extend a CFG to support Coordination and Prepositional Phrases</h3>\n",
    "\n",
    "<a name=\"q1.2.1\"></a>\n",
    "<h4>1.2.1 Support Prepositional Phrases</h4>\n",
    "\n",
    "Extend the grammar so that it can parse the following sentences:\n",
    "\n",
    "<pre>\n",
    "John saw a man with a telescope\n",
    "John saw a man on the hill with a telescope\n",
    "\n",
    "Mary knows men and women\n",
    "Mary knows men, children and women\n",
    "\n",
    "John and Mary eat bread\n",
    "John and Mary eat bread with cheese\n",
    "</pre>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q1.2.2\"></a>\n",
    "<h4>1.2.2 Support Coordination</h4>\n",
    "    \n",
    "What is the number of a coordination such as \"John and Mary\"?\n",
    "<p/>\n",
    "What is the number of a coordination such as \"John or Mary\"?\n",
    "<p/>\n",
    "\"John or the children\"?\n",
    "<p/>\n",
    "Propose (without implementing) ways to support such variations.\n",
    "<p/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q1.2.3\"></a>\n",
    "<h4>1.2.3 Overgeneration</h4>\n",
    "\n",
    "Demonstrate ways in which your grammar over-generates.  Explain your observations.\n",
    "<p/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr/>\n",
    "<a name=\"q2\"></a>\n",
    "<h2>Question 2: Learning a PCFG from a Treebank</h2>\n",
    "\n",
    "<a name=\"q2.1\"></a>\n",
    "<h3>Question 2.1: Random PCFG Generation</h3>\n",
    "\n",
    "Consider the PCFG datatype implementation provided in NLTK.\n",
    "<p/>\n",
    "Consider the module <a href=\"http://www.nltk.org/_modules/nltk/parse/generate.html\">nltk.parse.generate</a>.\n",
    "It contains a function generate(grammar) which given a CFG grammar produces all the sentences the grammar can produce.\n",
    "Naturally, this may produce an infinite number of sentences if the grammar is recursive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  1. the man slept\n",
      "  2. the man saw the man\n",
      "  3. the man saw the park\n",
      "  4. the man saw the dog\n",
      "  5. the man saw a man\n",
      "  6. the man saw a park\n",
      "  7. the man saw a dog\n",
      "  8. the man walked in the man\n",
      "  9. the man walked in the park\n",
      " 10. the man walked in the dog\n"
     ]
    }
   ],
   "source": [
    "import itertools\n",
    "from nltk.grammar import CFG\n",
    "from nltk.parse import generate\n",
    "\n",
    "demo_grammar = \"\"\"\n",
    "  S -> NP VP\n",
    "  NP -> Det N\n",
    "  PP -> P NP\n",
    "  VP -> 'slept' | 'saw' NP | 'walked' PP\n",
    "  Det -> 'the' | 'a'\n",
    "  N -> 'man' | 'park' | 'dog'\n",
    "  P -> 'in' | 'with'\n",
    "\"\"\"\n",
    "grammar = CFG.fromstring(demo_grammar)\n",
    "\n",
    "for n, sent in enumerate(generate.generate(grammar, n=10), 1):\n",
    "        print('%3d. %s' % (n, ' '.join(sent)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q2.1.1\"></a>\n",
    "<h4>2.1.1 PCFG_Generate</h4>\n",
    "\n",
    "Write a new function <code>pcfg_generate(grammar)</code> which will\n",
    "generate a random tree from a PCFG according to its generative\n",
    "probability model.  In other words, your task is to write a <code>sample</code>\n",
    "method from the PCFG distribution.\n",
    "<p/>\n",
    "\n",
    "Generation works top down: pick the start symbol of\n",
    "the grammar, then randomly choose a production starting from the start\n",
    "symbol, and add the RHS as children of the root node, then expand each\n",
    "node until you obtain terminals.  The result should be a tree and not\n",
    "a sentence.\n",
    "<p/>\n",
    "\n",
    "The generated parse trees should exhibit the distribution of non-terminal\n",
    "distributions captured in the PCFG (that is, the generation process should\n",
    "sample rules according to the weights of each non-terminal in the grammar).\n",
    "<p/>\n",
    "\n",
    "Note: you must generate according to the PCFG distribution - not just randomly.\n",
    "To this end, you should use the method <code>generate()</code> from the\n",
    "NTLK <a href=\"http://www.nltk.org/_modules/nltk/probability.html\">ProbDistI</a> interface.\n",
    "Specifically, you should use the\n",
    "<a href=\"http://www.nltk.org/api/nltk.html#nltk.probability.DictionaryProbDist\">nltk.probability.DictionaryProbDist</a>\n",
    "class to represent the probability distribution of mapping an LHS non-terminal to one of the possible RHS."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(Name,)\n",
      "(Det, N)\n",
      "(Name,)\n"
     ]
    }
   ],
   "source": [
    "from nltk.grammar import Nonterminal\n",
    "from nltk.grammar import toy_pcfg2\n",
    "from nltk.probability import DictionaryProbDist\n",
    "\n",
    "productions = toy_pcfg2.productions()\n",
    "\n",
    "# Get all productions with LHS=NP\n",
    "np_productions = toy_pcfg2.productions(Nonterminal('NP'))\n",
    "\n",
    "dict = {}\n",
    "for pr in np_productions: dict[pr.rhs()] = pr.prob()\n",
    "np_probDist = DictionaryProbDist(dict)\n",
    "\n",
    "# Each time you call, you get a random sample\n",
    "print(np_probDist.generate())\n",
    "\n",
    "print(np_probDist.generate())\n",
    "\n",
    "print(np_probDist.generate())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<pre class=\"prettyprint\">\n",
    "Function Specification\n",
    "\n",
    "pcfg_generate(grammar) -- return a tree sampled from the language described by the PCFG grammar\n",
    "</pre>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q2.1.2\"></a>\n",
    "<h4>2.1.2 Generate 1,000 Trees</h4>\n",
    "\n",
    "Generate 1,000 random trees using nltk.grammar.toy_pcfg2 - store the resulting trees in a file \"toy_pcfg2.gen\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q2.1.3\"></a>\n",
    "<h4>2.1.3 Gather Rules Frequency for all Non-Terminals</h4>\n",
    "    \n",
    "Compute the rules frequency distribution of each non-terminal and pre-terminal in the generated corpus.\n",
    "You can look at the code of <a href=\"http://www.nltk.org/_modules/nltk/grammar.html#induce_pcfg\">nltk.induce_pcfg(root, productions)</a> to see how this can be done.\n",
    "You should construct one distribution per non-terminal."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q2.1.4\"></a>\n",
    "<h4>2.1.4 Compute KL-Divergence between a priori and generated parameters</h4>\n",
    "    \n",
    "For each distribution, compute the KL-divergence between the MLE estimation of the probability\n",
    "distribution constructed on your test corpus and toy_pcfg2.\n",
    "The MLE estimation is obtained by applying the <a href=\"http://www.nltk.org/_modules/nltk/probability.html#MLEProbDist\">MLEProbDist</a> estimator to the observed empirical frequency distribution.\n",
    "KL-divergence (<a href=\"http://en.wikipedia.org/wiki/Kullback-Leibler_divergence\">Kullback-Leibler divergence</a>\n",
    "also known as relative entropy) estimates the difference between two distributions.\n",
    "<p/>\n",
    "Read some precisions on handling <a href='KL.html'>diverging KL computations</a> when the support of the 2 distributions p and q are not identical.\n",
    "<p/>\n",
    "\n",
    "Explain your observations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q2.2\"></a>\n",
    "<h3>Question 2.2: Learn a PCFG from a Treebank</h3>\n",
    "\n",
    "In this question, we will learn a PCFG from a treebank with different\n",
    "types of tree annotations.  We start from a subset of 200 trees from\n",
    "the Penn Treebank (which is distributed by <a href=\"http://www.nltk.org\">NLTK</a>\n",
    "in the corpus named \"treebank\" - the NLTK version contains about 4,000 trees).\n",
    "<p/>\n",
    "\n",
    "In this question, we want to induce a PCFG from the treebank and\n",
    "investigate its properties.  For reference, NLTK provides a function\n",
    "called <a href=\"http://www.nltk.org/_modules/nltk/grammar.html#induce_pcfg\">induce_pcfg</a>\n",
    "in the nltk.grammar module.  It starts from a set of productions and constructs a weighted grammar with the MLE\n",
    "estimation of the production distributions.\n",
    "\n",
    "\n",
    "<a name=\"q2.2.1\"></a>\n",
    "<h4>2.2.1 Induce_PCFG</h4>\n",
    "\n",
    "At this stage, we learn the PCFG \"as is\" -- without Chomsky Form Normalization -- but with \"simplified\" tags. \n",
    "That is, we must use a function to simplify all tags in the tree (non-terminal and terminal): "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Tree('S', [Tree('NP-SBJ', [Tree('NP', [Tree('NNP', ['Pierre']), Tree('NNP', ['Vinken'])]), Tree(',', [',']), Tree('ADJP', [Tree('NP', [Tree('CD', ['61']), Tree('NNS', ['years'])]), Tree('JJ', ['old'])]), Tree(',', [','])]), Tree('VP', [Tree('MD', ['will']), Tree('VP', [Tree('VB', ['join']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['board'])]), Tree('PP-CLR', [Tree('IN', ['as']), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['nonexecutive']), Tree('NN', ['director'])])]), Tree('NP-TMP', [Tree('NNP', ['Nov.']), Tree('CD', ['29'])])])]), Tree('.', ['.'])])]\n",
      "(S\n",
      "  (NP-SBJ\n",
      "    (NP (NNP Pierre) (NNP Vinken))\n",
      "    (, ,)\n",
      "    (ADJP (NP (CD 61) (NNS years)) (JJ old))\n",
      "    (, ,))\n",
      "  (VP\n",
      "    (MD will)\n",
      "    (VP\n",
      "      (VB join)\n",
      "      (NP (DT the) (NN board))\n",
      "      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))\n",
      "      (NP-TMP (NNP Nov.) (CD 29))))\n",
      "  (. .))\n"
     ]
    }
   ],
   "source": [
    "from nltk.corpus import LazyCorpusLoader, BracketParseCorpusReader\n",
    "\n",
    "def simplify_functional_tag(tag):\n",
    "    if '-' in tag:\n",
    "        tag = tag.split('-')[0]\n",
    "    return tag\n",
    "\n",
    "treebank = LazyCorpusLoader('treebank/combined', BracketParseCorpusReader, r'wsj_.*\\.mrg')\n",
    "\n",
    "# Raw form\n",
    "print(treebank.parsed_sents()[:1])\n",
    "\n",
    "# Pretty print\n",
    "print(treebank.parsed_sents()[0])\n",
    "\n",
    "# we need to transform the tree to remove NONE tags and simplify tags."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want all tags of the form \"NP-SBJ-2\" to be simplified into \"NP\".  We also want to filter out \"NONE\" elements from the treebank.  \n",
    "You must find the best way to perform NONE filtering on the trees. (NONE elements appear in the original Penn Treebank for example in the case of relative clauses.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<pre class=\"prettyprint\">\n",
    "pcfg_learn(treebank, n)\n",
    "-- treebank is the nltk.corpus.treebank lazy corpus reader\n",
    "-- n indicates the number of trees to read\n",
    "-- return an nltk.PCFG instance\n",
    "</pre>\n",
    "\n",
    "Use the following code as a starting point:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.grammar import Production\n",
    "from nltk import Tree, Nonterminal\n",
    "\n",
    "def get_tag(tree):\n",
    "    if isinstance(tree, Tree):\n",
    "        return Nonterminal(simplify_functional_tag(tree.label()))\n",
    "    else:\n",
    "        return tree\n",
    "\n",
    "def tree_to_production(tree):\n",
    "    return Production(get_tag(tree), [get_tag(child) for child in tree])\n",
    "\n",
    "def tree_to_productions(tree):\n",
    "    yield tree_to_production(tree)\n",
    "    for child in tree:\n",
    "        if isinstance(child, Tree):\n",
    "            for prod in tree_to_productions(child):\n",
    "                yield prod"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q2.2.2\"></a>\n",
    "<h4>2.2.2 Data Exploration and Validation</h4>\n",
    "\n",
    "Explore the following properties of the learned PCFG:\n",
    "<ul>\n",
    "<li>How many productions are learned from the trees? How many interior nodes were in the treebank?\n",
    "<li>Draw a plot of the distribution of productions according to their frequency (number of rules with frequency 1, 2, ...).  What do you observe?\n",
    "<li>Induce a PCFG from 200 trees, and another one from 400 trees - compare the distribution of the rules you learn.\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q2.3\"></a>\n",
    "<h3>Question 2.3: Induce a PCFG in Chomsky Normal Form</h3>\n",
    "\n",
    "<a name=\"q2.3.1\"></a>\n",
    "<h4>2.3.1 PCFG_CNF_Learn</h4>\n",
    "\n",
    "Implement <code>pcfg_cnf_learn</code>\n",
    "\n",
    "We now want to learn a PCFG in Chomsky Normal Form from the treebank, with simplified tags, and with filtered NONE elements.\n",
    "The strategy is to convert the trees in the treebank into CNF, then to induce the PCFG from the transformed trees.\n",
    "Use the function <code>chomsky_normal_form</code> in <a href=\"http://www.nltk.org/_modules/nltk/treetransforms.html\">nltk.treetransforms</a> to convert the trees.  \n",
    "Pay attention to NONE filtering.  Use horizontal Markov annotation and parent annotation:\n",
    "\n",
    "<code>\n",
    "chomsky_normal_form(tree, factor='right', horzMarkov=1, vertMarkov=1, childChar='|', parentChar='^')\n",
    "</code>\n",
    "\n",
    "<pre class=\"prettyprint\">\n",
    "pcfg_cnf_learn(treebank, n)\n",
    "-- treebank is the nltk.corpus.treebank lazy corpus reader (simplified tags)\n",
    "-- n indicates the number of trees to read\n",
    "-- return an nltk.PCFG in CNF\n",
    "</pre>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q2.3.2\"></a>\n",
    "<h4>2.3.2 Data Exploration</h4>\n",
    "\n",
    "How many productions are learned from the CNF trees? How many interior nodes were in the original treebank, and in the CNF treebank?\n",
    "\n",
    "Compare the CNF and non-CNF figures (ratios).  What do you conclude?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q2.4\"></a>\n",
    "<h3>Question 2.4: Test CFG Independence Assumptions</h3>\n",
    "\n",
    "We want to test the CFG hypothesis that a node expansion is independent from its location within a tree.\n",
    "<p/>\n",
    "\n",
    "For the treebank before CNF transformation, report the following statistics, for the expansions of the NP category:\n",
    "<ul>\n",
    "<li>Compute the distribution of the RHS given the NP LHS for all NPs in the treebank.  Draw a histogram plot.\n",
    "<li>Compute the same distribution only for NPs that appear directly below a S node (any variant of S in the Penn treebank tagset). Draw a histogram plot.\n",
    "<li>Compute the same distribution only for NPs that appear directly below a VP node (any variant of VP in the Penn Treebank tagset). Draw a histogram plot. \n",
    "<li>Compare the distributions above.  Use KL-divergence.\n",
    "<li>Conclude: does the data in the treebank confirm the CFG hypothesis for each configuration (original trees, annotated CNF trees)? \n",
    "    (How do you calibrate the values of the KL-divergence to decide that 2 values are similar?)\n",
    "</ul>\n",
    "<p/>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr/>\n",
    "<a name=\"q3\"></a>\n",
    "<h2>Question 3: Building and Evaluating a Simple PCFG Parser</h2>\n",
    "\n",
    "In this question, we will construct a Viterbi parser for the PCFG induced in Question 2 and perform evaluation of this statistical parser. \n",
    "<p/>\n",
    "\n",
    "<a name=\"q3.1\"></a>\n",
    "<h3>Question 3.1: Build a Parser</h3>\n",
    "\n",
    "<a name=\"q3.1.1\"></a>\n",
    "<h4>3.1.1 Dataset Split</h4>\n",
    "    \n",
    "Split the NLTK treebank corpus into 80% training (about 3,200 trees) and 20% (about 800 trees) testing sets.\n",
    "<p/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q3.1.2\"></a>\n",
    "<h4>3.1.2 Learn a PCFG over the Chomsky Normal Form version of this treebank</h4>\n",
    "<p/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q3.1.3\"></a>\n",
    "<h4>3.1.3 Viterbi Parser</h4>\n",
    "    \n",
    "Construct a <code><a href=\"https://www.nltk.org/_modules/nltk/parse/viterbi.html\">ViterbiParser</a></code> using this PCFG grammar.\n",
    "Test the parser on a few sentences."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q3.2\"></a>\n",
    "<h3>Question 3.2: Evaluate the Parser</h3>\n",
    "\n",
    "Your task is to compute the ParsEval metric for the constituent-based parse trees on the test dataset of our treebank.  \n",
    "Report the following metrics:\n",
    "<ol>\n",
    "<li>Precision\n",
    "<li>Recall\n",
    "<li>F-measure\n",
    "<li>Labelled Precision\n",
    "<li>Labelled Recall\n",
    "<li>Labelled F-Measure\n",
    "</ol>\n",
    "\n",
    "Where precision and recall are computed over \"constituents\" between the parsed trees and the treebank trees: a constituent is a triplet \n",
    "(interior_node, first_index, last_index) - where first_index is the index of the first word in the sentence covered by the interior node,\n",
    "and last_index that of the last word (that is, [first_index, last_index] is the yield of the interior node).\n",
    "<p/>\n",
    "\n",
    "In labelled precision and recall we count that 2 constituents match if they have the same label for the interior node.  \n",
    "For unlabelled metrics, we just compare the first and last indexes.  \n",
    "Make sure you compare the trees from the treebank after they have been simplified: for example, S-TPC-1 must be compared with S and ADVP-TMP with ADVP, and -NONE- nodes have been eliminated.\n",
    "<p/>\n",
    "\n",
    "You can read more on this metric in the following references:\n",
    "<ol>\n",
    "<li><a href=\"http://courses.washington.edu/ling571/ling571_fall_2010/slides/evalb_improved_pcky.pdf\">An assignment on parsing evaluation by Scott Farrar</a> includes detailed explanation and examples on how the metric is computed.\n",
    "<li><a href=\"http://nlp.cs.nyu.edu/evalb/\">Evalb</a> software: this is the C program used in all standard evaluations of constituent parsers (by Satoshi Sekine and Michael  Collins).\n",
    "<li><a href=\"http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/parser/metrics/Evalb.html\">Evalb in Java</a> for those who prefer reading Java over C - is part of the Stanford CoreNLP package.\n",
    "</ol>\n",
    "\n",
    "The specific steps to follow are:\n",
    "<ul>\n",
    "<li>Generate a list of labelled constituents of the form <code>[label first_index last_index]</code> given a parse tree.</li>\n",
    "<li>Compute unlabelled Precision, Recall and F-measure for your parser on the test dataset</li>\n",
    "<li>Compute labelled Precision, Recall and F-measure for your parser on the test dataset</li>\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q3.3\"></a>\n",
    "<h3>Question 3.3: Accuracy per Distance</h3>\n",
    "\n",
    "In general, parsers do a much better job on short constituents than long ones.  \n",
    "Draw a plot of the accuracy of constituents per constituent length.  \n",
    "The length of a constituent (node, first, last) is last-first+1.  \n",
    "For a given constituent length X, accuracy is the number of constituents \n",
    "of length X in the parsed tree that are accurate divided by the total number of constituents of length X."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"q3.4\"></a>\n",
    "<h3>Question 3.4: Accuracy per Label</h3>\n",
    "\n",
    "Report accuracy of constituents per label type (S, SBAR, NP, VP, PP etc).\n",
    "For each node type, report number of occurrences and accuracy. \n",
    "Note: report accuracy per node type WITHOUT Chomsky Normal Form modification.\n",
    "For example, if the CNF of the tree generated a non-terminal NP^S, we should count this as NP for this question."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "END"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}