{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " John left\n", " John loves Mary\n", " They love Mary\n", " They love her\n", " She loves them\n", " Everybody loves John\n", " A boy loves Mary\n", "\tThe boy loves Mary\n", "\tSome boys love Mary\n", " John gave Mary a heavy book\n", " John gave it to Mary\n", "\tJohn likes butter\n", "\tJohn moves a chair\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", "John saw a man with a telescope\n", "John saw a man on the hill with a telescope\n", "\n", "Mary knows men and women\n", "Mary knows men, children and women\n", "\n", "John and Mary eat bread\n", "John and Mary eat bread with cheese\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
pcfg_generate(grammar)
which will\n",
"generate a random tree from a PCFG according to its generative\n",
"probability model. In other words, your task is to write a sample
\n",
"method from the PCFG distribution.\n",
"\n",
"\n",
"Generation works top down: pick the start symbol of\n",
"the grammar, then randomly choose a production starting from the start\n",
"symbol, and add the RHS as children of the root node, then expand each\n",
"node until you obtain terminals. The result should be a tree and not\n",
"a sentence.\n",
"\n",
"\n",
"The generated parse trees should exhibit the distribution of non-terminal\n",
"distributions captured in the PCFG (that is, the generation process should\n",
"sample rules according to the weights of each non-terminal in the grammar).\n",
"\n",
"\n",
"Note: you must generate according to the PCFG distribution - not just randomly.\n",
"To this end, you should use the method generate()
from the\n",
"NTLK ProbDistI interface.\n",
"Specifically, you should use the\n",
"nltk.probability.DictionaryProbDist\n",
"class to represent the probability distribution of mapping an LHS non-terminal to one of the possible RHS."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(Name,)\n",
"(Det, N)\n",
"(Name,)\n"
]
}
],
"source": [
"from nltk.grammar import Nonterminal\n",
"from nltk.grammar import toy_pcfg2\n",
"from nltk.probability import DictionaryProbDist\n",
"\n",
"productions = toy_pcfg2.productions()\n",
"\n",
"# Get all productions with LHS=NP\n",
"np_productions = toy_pcfg2.productions(Nonterminal('NP'))\n",
"\n",
"dict = {}\n",
"for pr in np_productions: dict[pr.rhs()] = pr.prob()\n",
"np_probDist = DictionaryProbDist(dict)\n",
"\n",
"# Each time you call, you get a random sample\n",
"print(np_probDist.generate())\n",
"\n",
"print(np_probDist.generate())\n",
"\n",
"print(np_probDist.generate())\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n", "Function Specification\n", "\n", "pcfg_generate(grammar) -- return a tree sampled from the language described by the PCFG grammar\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", "pcfg_learn(treebank, n)\n", "-- treebank is the nltk.corpus.treebank lazy corpus reader\n", "-- n indicates the number of trees to read\n", "-- return an nltk.PCFG instance\n", "\n", "\n", "Use the following code as a starting point:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "from nltk.grammar import Production\n", "from nltk import Tree, Nonterminal\n", "\n", "def get_tag(tree):\n", " if isinstance(tree, Tree):\n", " return Nonterminal(simplify_functional_tag(tree.label()))\n", " else:\n", " return tree\n", "\n", "def tree_to_production(tree):\n", " return Production(get_tag(tree), [get_tag(child) for child in tree])\n", "\n", "def tree_to_productions(tree):\n", " yield tree_to_production(tree)\n", " for child in tree:\n", " if isinstance(child, Tree):\n", " for prod in tree_to_productions(child):\n", " yield prod" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
pcfg_cnf_learn
\n",
"\n",
"We now want to learn a PCFG in Chomsky Normal Form from the treebank, with simplified tags, and with filtered NONE elements.\n",
"The strategy is to convert the trees in the treebank into CNF, then to induce the PCFG from the transformed trees.\n",
"Use the function chomsky_normal_form
in nltk.treetransforms to convert the trees. \n",
"Pay attention to NONE filtering. Use horizontal Markov annotation and parent annotation:\n",
"\n",
"\n",
"chomsky_normal_form(tree, factor='right', horzMarkov=1, vertMarkov=1, childChar='|', parentChar='^')\n",
"
\n",
"\n",
"\n", "pcfg_cnf_learn(treebank, n)\n", "-- treebank is the nltk.corpus.treebank lazy corpus reader (simplified tags)\n", "-- n indicates the number of trees to read\n", "-- return an nltk.PCFG in CNF\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
ViterbiParser
using this PCFG grammar.\n",
"Test the parser on a few sentences."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"[label first_index last_index]
given a parse tree.