General Intro to NLP

This lecture introduces the field of NLP and presents basic descriptive linguistic concepts: levels of description, word-level description, syntactic description of sentences.

What is NLP about

Different names - refer to different perspectives on language: In general, it is important to distinguish the main motivation to study the field: Different computational/descriptive methods have been used to model language: All of the various perspectives must pay attention to key aspects of language:

Ambiguity in Language

Key observation: Infinite occurrences (language = set of possible sentences) / finite processing (cognitive system with limited resources can generate and process unseen sentences in the language). One way to explain this cognitive capability is that language relies on structure (and recursion). Text is structured, but structure is not manifest. Therefore, problems of ambiguity (determining the intended structure). There are different sources of ambiguity in language: Local vs global ambiguity:

Variability in Language

Key observation: there are many different ways to express the same meaning in language. Variability makes it difficult to assess whether 2 distinct linguistic forms are actually similar in meaning. It makes it difficult to find text based on meaning, and it makes it difficult to decide how to generate text given meaning representation (or similar text).

Levels of Linguistic Description

Language use can be described at various levels - corresponding to different internal structures in language: It can also be described from various perspectives - corresponding to different functions language fulfils: These distinctions are subsumed by different sub-disciplines in linguistics:

Words: Parts of Speech and Morphology

Syntactic analysis starts with the following questions: We first focus on the word level. There are various definitions of what is a word: Words have an internal structure - they are composed of smaller parts that have grammatical function (mark number, gender, person, tense etc).The parts of words that have meaning are called morphemes. Linguists define groups of words which "behave in a similar manner" in syntactic contexts. The basic test is substitution:
 A {large, small, blue, enormous, ...} box is in the room.  
Basic classes are: verb, noun, adjective. These are called parts of speech. For basic Parts of speech we can provide a semantic interpretation: It is important to distinguish open and closed classes, because they have very different properties: Morphological features: There exist different types of morphological processes that can combine new word forms from basic word forms: The mechanisms of morphological transformation include: Inflections mark morphological features in the word. The set of relevant features depends on the word class: For more information on morphology, refer to this excellent Introduction to morphology.

The Penn Treebank Annotation Scheme

The Penn Treebank is an important initiative started in the 1980's. It describes a set of English sentences with manual annotations describing the parts-of-speech of all the words and their syntactic structure. The set of tags used for the parts-of-speech is described in this table (taken from this site):

Tag

Description

Examples

$

dollar

$ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$

``

opening quotation mark

` ``

''

closing quotation mark

' ''

(

opening parenthesis

( [ {

)

closing parenthesis

) ] }

,

comma

,

--

dash

--

.

sentence terminator

. ! ?

:

colon or ellipsis

: ; ...

CC

conjunction, coordinating

& 'n and both but either et for less minus neither nor or plus so therefore times v. versus vs. whether yet

CD

numeral, cardinal

mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025 fifteen 271,124 dozen quintillion DM2,000 ...

DT

determiner

all an another any both del each either every half la many much nary neither no some such that the them these this those

EX

existential there

there

FW

foreign word

gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte terram fiche oui corporis ...

IN

preposition or conjunction, subordinating

astride among uppon whether out inside pro despite on by throughout below within for towards near behind atop around if like until below next into if beside ...

JJ

adjective or numeral, ordinal

third ill-mannered pre-war regrettable oiled calamitous first separable ectoplasmic battery-powered participatory fourth still-to-be-named multilingual multi-disciplinary ...

JJR

adjective, comparative

bleaker braver breezier briefer brighter brisker broader bumper busier calmer cheaper choosier cleaner clearer closer colder commoner costlier cozier creamier crunchier cuter ...

JJS

adjective, superlative

calmest cheapest choicest classiest cleanest clearest closest commonest corniest costliest crassest creepiest crudest cutest darkest deadliest dearest deepest densest dinkiest ...

LS

list item marker

A A. B B. C C. D E F First G H I J K One SP-44001 SP-44002 SP-44005 SP-44007 Second Third Three Two \* a b c d first five four one six three two

MD

modal auxiliary

can cannot could couldn't dare may might must need ought shall should shouldn't will would

NN

noun, common, singular or mass

common-carrier cabbage knuckle-duster Casino afghan shed thermostat investment slide humour falloff slick wind hyena override subhumanity machinist ...

NNP

noun, proper, singular

Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA Shannon A.K.C. Meltex Liverpool ...

NNPS

noun, proper, plural

Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques Apache Apaches Apocrypha ...

NNS

noun, common, plural

undergraduates scotches bric-a-brac products bodyguards facets coasts divestitures storehouses designs clubs fragrances averages subjectivists apprehensions muses factory-jobs ...

PDT

pre-determiner

all both half many quite such sure this

POS

genitive marker

' 's

PRP

pronoun, personal

hers herself him himself hisself it itself me myself one oneself ours ourselves ownself self she thee theirs them themselves they thou thy us

PRP$

pronoun, possessive

her his mine my our ours their thy your

RB

adverb

occasionally unabatingly maddeningly adventurously professedly stirringly prominently technologically magisterially predominately swiftly fiscally pitilessly ...

RBR

adverb, comparative

further gloomier grander graver greater grimmer harder harsher healthier heavier higher however larger later leaner lengthier less-perfectly lesser lonelier longer louder lower more ...

RBS

adverb, superlative

best biggest bluntest earliest farthest first furthest hardest heartiest highest largest least less most nearest second tightest worst

RP

particle

aboard about across along apart around aside at away back before behind by crop down ever fast for forth from go high i.e. in into just later low more off on open out over per pie raising start teeth that through under unto up up-pp upon whole with you

SYM

symbol

% & ' '' ''. ) ). * + ,. < = > @ A[fj] U.S U.S.S.R \* \*\* \*\*\*

TO

"to" as preposition or infinitive marker

to

UH

interjection

Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen huh howdy uh dammit whammo shucks heck anyways whodunnit honey golly man baby diddle hush sonuvabitch ...

VB

verb, base form

ask assemble assess assign assume atone attention avoid bake balkanize bank begin behold believe bend benefit bevel beware bless boil bomb boost brace break bring broil brush build ...

VBD

verb, past tense

dipped pleaded swiped regummed soaked tidied convened halted registered cushioned exacted snubbed strode aimed adopted belied figgered speculated wore appreciated contemplated ...

VBG

verb, present participle or gerund

telegraphing stirring focusing angering judging stalling lactating hankerin' alleging veering capping approaching traveling besieging encrypting interrupting erasing wincing ...

VBN

verb, past participle

multihulled dilapidated aerosolized chaired languished panelized used experimented flourished imitated reunifed factored condensed sheared unsettled primed dubbed desired ...

VBP

verb, present tense, not 3rd person singular

predominate wrap resort sue twist spill cure lengthen brush terminate appear tend stray glisten obtain comprise detest tease attract emphasize mold postpone sever return wag ...

VBZ

verb, present tense, 3rd person singular

bases reconstructs marks mixes displeases seals carps weaves snatches slumps stretches authorizes smolders pictures emerges stockpiles seduces fizzes uses bolsters slaps speaks pleads ...

WDT

WH-determiner

that what whatever which whichever

WP

WH-pronoun

that what whatever whatsoever which who whom whosoever

WP$

WH-pronoun, possessive

whose

WRB

Wh-adverb

how however whence whenever where whereby whereever wherein whereof why


Sentence: Syntax

We move next to the sentence level. There are various definitions of what is a sentence: Sentences have an internal structure - they are composed of smaller parts that have grammatical function. There are different ways to describe a sentence structure. The structure description captures the dependencies among the words within the sentence. They reflect constraints on operations such as: The 2 major description methods are:

Phrases are groups of words that depend on a "central word" that we call the phrase head. Common Phrases include - depending on the part of speech of the head:

One computational problem we will study is that of parsing a linear sequence of words into a constituent tree. The questions to address to model this problem are: Why is syntactic structure important: For example, consider the following problem: you are given two strings:
S1: John eats
S2: John sleeps
You are asked to build a correct English sentence that combines the two events into one sentenceexpressing that S2 occurred after S1 in the past. A possible solution is:
S3: After eating, John slept
Variations of the same task include - given an input string S1, generate the string N1 of the negation of S1, or Q1 - which is a yes-no form of S1, or P1, which is S1 expressed in the past tense etc.
Working directly on strings, such transformations are very difficult to achieve. One need to know the syntactic structure of the small sentences in order to be able to combine them appropriately.

Descriptive Syntax of the Clause

The syntax for the clause category is the most complex. It can be described along four perspectives (note: we use here descriptive terminology from systemic linguistics introduced by M. Halliday - see for example Introduction to Functional Grammar, Halliday, 1985):
  1. Transitivity: determines the type of the main process and its participants. (This is often called the "argument structure" of the clause.)
  2. Circumstantials: determines the structure of modifiers to the predicate and to the clause as a whole.
  3. Mood: determines whether the clause is finite (declarative, interrogative or relative) or non-finite (imperative, infinitive, participial). For finite moods, tense and person are specified. For non-finite, they are not.
  4. Voice: active, passive, causative etc.
The transitivity system determines what participants contribute to the meaning of the clause - when the clause is viewed as a description of an event or relation in the world. Semantically, a clause is a description of a process - a generic term that can refer to either an event or a relation. Participants are the constituents within the clause that satisfy the following linguistic criteria: NOTE: Each one of these criteria taken alone is not sufficient to characterize participants, but taken together they have proven quite reliable.
Semantically, participants correspond to the nuclear roles of the process. Additional terms can then be added compositionally to modify the meaning ofthe process or of the predicate (these correspond to sentence and predicateadjuncts as explained next.

Argument Structure: Lexical properties of the verb

The main verb of a clause determines which participants (arguments) are required in the clause. Since many verbs often share the same argument structures, grammars have traditionally defined names to refer to common argument structures. Such common structures are given a name (for example: material process) and define names for each of the mandatory arguments they expect (for example agent). In systemic grammar, the following figure describes the most common argument structures in English:

While systemic linguists have developed complex transitivity systems to account for the structure of the clause, lexicalist approaches to grammar have developed the notion of subcategorization. The idea is that in any phrase, the lexical head determines the structure and order of the subconstituents or dependents in a dependency-basedgrammar like Meaning Text Theory (Mel’cuk and Pertsov, 1987) of the phrase. To capture this dependency, HPSG grammars describe each lexical item capable of governing arguments with a feature called the subcategorization list (short name: subcat). The subcat of a verb is a list of the constituents the verb expects to form a complete phrase. So forexample, the verb to deal as in "AI deals with logic" is represented in the lexicon with a subcat

deal[Subcat( [ NP ] [ PP : with ])]
which indicates that to produce a clause headed by this verb, one must find an NP and a PP headed by the preposition with.

Adjuncts and Disjuncts

Constituents in the clause which are not participants, are classified in terms of their "distance" from the verb. The three levels are called predicate adjuncts, sentence adjuncts and disjuncts. These are the three types of circumstantials. There following tables give examples of each of these circumstantials:

The Mood of Clauses

The various options for the mood of clauses in English are summarized in this table:


Noun Phrase

Noun phrases in English have a strict syntactic structure - composed of a sequence of slots, each slot can be filled by constituents of specific categories.



Last modified Mar 23rd, 2012