Topics in Natural Language Processing
Assignment 2 - Spring 2007 - Michael Elhadad

Natural Language Processing - Assignment 2

The goal of this assignment is to develop a question-answering system based on full semantic analysis of the statements and questions.
  1. Domain
  2. DCG Parsing with semantic interpretation
  3. Managing under-specified semantic representations
  4. Dataset
  5. Putting Things Together: Question/Answering

Domain and Knowledge Acquisition

The system manages facts about geographic relations. The semantic predicates we are interested in covering are:
  1. Types of individuals: country, region, city, continent
  2. Properties of individuals: population
  3. Relations between individuals: located-in, capital-of, has-border-with
Examples of basic facts we store in the model are:
country(israel)
country(jordan)
region(negev)
city(beersheba)

population(beersheba, 200000)
population(israel, 7000000)
capital-of(negev, beersheba)
capital-of(israel, jerusalem)
has-border-with(israel, jordan)
located-in(israel, negev)
located-in(negev, beersheba)
located-in(negev, yeruham)
In addition, we can store rules in the model, such as:
located-in(?city, ?region) and located-in(?region, ?country) => located-in(?city, ?country)
Construct a program that parses data from the CIA World Fact Book on the Israel page, Jordan page, and Egypt page to populate your model with as much data as possible.

You may want to also use the data in the Tipset Gazetteer described below.

Define the vocabulary of the model:

  1. Individuals (it is convenient to list individuals by type)
  2. Predicates of arity 1 and 2.
Define a set of at least 10 inference rules over these properties to capture generalizations such as the example of located-in given above. Be careful to avoid "circular" rules such as:
has-border-with(?x, ?y) => has-border-with(?y, ?x)
Such rules can create looping in the Horn knowledge base.

Your knowledge base is to be entered into an instance of the Horn-kb knowledge base provided in horn.scm.

The easiest method to parse the CIA World Factbook pages is to use regular expressions to identify fields of information that appear in a regular format inside the HTML. You can use the following package to use Regular Expressions in Scheme in SISC: SCSH-Regexp. The documentation is provided here. Start with the tutorial.

The simple way to load the package in SISC is:

  1. cd into the directory containing the scsh-regexp folder, start SISC and type
  2. (require-extension (lib scsh-regexp/scsh-regexp))

Alternatively, you can write a Java program to do the Regexp information extraction and save the data you extract into a file in Scheme Horn clauses format.

Examples of semantic information to extract from the CIA pages include:

  • population
  • area
  • border countries (this is a list)
  • coastline
  • capital
  • GDP
  • GDP per capita

    DCG Parsing with Semantic Interpretation

    Prepare a DCG grammar to parse assertions and questions in the domain defined above. The coverage of your grammar must include coverage similar to that provided in the grammars discussed in class:
    1. Transitive and intransitive verbs
    2. Relative clauses with unbounded dependencies (only "that" as a relative marker)
    3. Quantifiers: a, every, the
    In addition, you must introduce:
    1. There-is constructs: There are 124 cities in Israel
    2. Y/N questions: Is Israel located in Europe?
    3. Wh-questions: Where is Israel located?, What is the population of Israel?
    4. Wh-determiner questions: Which cities are located in Israel?
    5. How-many questions: How many regions are there in Israel?
    6. Prepositional phrases: Jerusalem is the capital of Israel
    7. Adjectives: Tel Aviv is a large city in Israel
    The DCG must produce a semantic interpretation for the sentences that returns under-specified semantic representations for quantifiers.

    Information on the DCG and examples are provided in the lecture notes.


    Managing under-specified semantic representations

    The output of the semantic interpretation is not a valid Horn formula, since we decided to leave the quantifier scoping under-specified.

    Implement the specification method to decide on a scoping and convert the output of your semantic interpretation parser into valid Horn clauses.


    Dataset

    The TIPSTER Gazetteer data set contains about 240,000 basic World geography facts. For example:
    Beersheba (CITY 3) Southern (PROVINCE) Israel (COUNTRY)
    Beersheba Springs (CITY) Tennessee (PROVINCE 1) United States (COUNTRY)
    
    The format of this data set is explained in this document. The dataset itself is a 13MB file. In addition, use the information in the CIA World Fact Book:
    1. Israel page
    2. Jordan page
    3. Egypt page

    Putting Things Together: Question/Answering

    Implement the question/answering system in a manner that a user can ask questions about your knowledge base, and assert new facts when desired.
    Q> How many cities are there in Israel
    125
    Q> What are the cities in Israel?
    ...
    Q> The population of Beersheba is 252000
    OK
    Q> How many people live in Beersheba?
    252000
    

    The end.


    Last modified June 14th, 2007