General Intro to NLP

This lecture introduces the field of NLP and presents basic descriptive linguistic concepts: levels of description, word-level description, syntactic description of sentences.

What is NLP about

Different names - refer to different perspectives on language: In general, it is important to distinguish the main motivation to study the field: Different computational/descriptive methods have been used to model language: Natural languages are instances of a more general category of phenomena called "signs" studied in the field of semiotics. A key definition of semiotics is to study the relation between "signs" and "meaning" that people attribute to them. The basic distinction between signifier and signified was introduced by De Saussure (Cours de Linguistique Generale, 1916). These two aspects are also called expression and content. See Semiotics for Beginners by Daniel Chandler for an introduction to the key concepts of semiotics (signifier, signified, sign, arbitrariness, structuralism, contrastive relation, syntagm and paradigm, type/token distinction). The various perspectives on the field (NLP, CL, HLT) must pay attention to key aspects of language:

Ambiguity in Language

Key observation: Infinite occurrences (language = set of possible sentences) / finite processing (cognitive system with limited resources can generate and process unseen sentences in the language). One way to explain this cognitive capability is that language relies on structure (and recursion). Text is structured, but structure is not manifest. Therefore, problems of ambiguity (determining the intended structure). There are different sources of ambiguity in language: Local vs global ambiguity / multi-layer combination of ambiguities:

Variability in Language

Key observation: there are many different ways to express the same meaning in language. Variability makes it difficult to assess whether 2 distinct linguistic forms are actually similar in meaning. It makes it difficult to find text based on meaning, and it makes it difficult to decide how to generate text given meaning representation (or similar text). Altogether - combination of variability leads to the possibility for "paraphrases" - many different ways to express the same meaning in different sentences. Determining whether 2 sentences are paraphrases is a non-categorical judgment (gradable - sentences may be more or less similar).

Levels of Linguistic Description

Language use can be described at various levels - corresponding to different internal structures in language: It can also be described from various perspectives - corresponding to different functions language fulfils: These distinctions are subsumed by different sub-disciplines in linguistics:

Words: Parts of Speech and Morphology

Syntactic analysis starts with the following questions: We first focus on the word level. There are various definitions of what is a word: Words have an internal structure - they are composed of smaller parts that have grammatical function (mark number, gender, person, tense etc).The parts of words that have meaning are called morphemes. Linguists define groups of words which "behave in a similar manner" in syntactic contexts. The basic test is substitution. A set of signs where one unit is substituted by another and the other is kept constant forms a paradigm. For example:
abstract-ly
concrete-ly
definite-ly
is a morphological paradigm (illustrating English adverb derivation). Similarly:
I sing
You sing
He sing-s
We sing
You sing
They sing
is also a paradigm (illustrating the person property for English verbs). The comparison of paradigms allows structuralist linguists to identify morphemes and word formation processes. Basic classes are: verb, noun, adjective. These are called parts of speech. For basic Parts of speech we can provide a semantic interpretation: It is important to distinguish open and closed classes, because they have very different properties: Morphological features: There exist different types of morphological processes that can combine new word forms from basic word forms: The mechanisms of morphological transformation include: Inflections mark morphological features in the word. The set of relevant features depends on the word class: For a short intro to morphology, refer to Introduction to morphology.

Non Categorical Phenomena in Language

Many of the tools used to describe language are defined as discrete categorical judgments. For example, we find it useful to distinguish between Nouns and Verbs. It is, however, often useful to refer as these descriptive categories as non-categorical gradual classifications. A good example is discussed in this Video presentation by Chris Manning at ACL 2015 (seek at position 24:00 for this specific discussion). In the discussion of Verb vs. Noun categories, consider the English Gerund V-ing form, such as climbing. Classically such forms are described as ambiguous alternations between nominal / verbal forms. V-ing can take any of the core categories defined by Chomsky 1970: (4-way classical POS distinction)
+N / +V   an unassuming man          [Adj]
+N / -V    the opening of the store    [N]
-N / +V    she is eating dinner           [V]
-N / -V     concerning your point       [Prep]
But it seems the membership of V-ing forms to N or V is more than "ambiguity": it is more appropriate to relate to it as "non categorical membership to V/N classes". Consider the following criteria (test) that determine whether a specific word form belongs to N or V: One can find occurrences of V=ing that satisfy both criteria at once:
The not observing this rule is damaging to our reputation.
It appears the V-ing are not simply ambiguous (can be sometimes V and sometimes N depending on context) but instead hold an intermediate status different from true nominalizations - V=ing forms have in-between status which is not the case for N or V forms:
Tom's winning the election was a big upset
? This teasing John all the time has got to stop
? There is no marking exams on Fridays
* The cessation hostilities was unexpected
We refer to such judgments as "non categorical facts" that can be modeled mathematically as fuzzy membership to the categories. In computational terms, instead of representing membership to the Part-of-speech category N or V as a categorical data for a given word form {N:1, V:0}, we can represent the judgment by using a probability distribution that reflects the relative properties of the word form {N:0.7, V:0.3}. Naturally, we will need to work out criteria to explain where these exact numbers come from and what depends on them.

The Penn Treebank Annotation Scheme

The Penn Treebank is an important initiative started in the 1980's. It describes a set of English sentences with manual annotations describing the parts-of-speech of all the words and their syntactic structure. It is an important dataset practically (because many other works rely on it) but also methodologically, because it demonstrates a completely empirical method to describe syntax (as opposed to theory-driven). In this empirical method, the adequacy of syntactic description is achieved when multiple human annotators who follow detailed guidelines and have followed training agree on the annotations they provide for a sentence. The set of tags used for the parts-of-speech in the Penn Treebank is described in the following table (taken from this site):

Tag

Description

Examples

$

dollar

$ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$

``

opening quotation mark

` ``

''

closing quotation mark

' ''

(

opening parenthesis

( [ {

)

closing parenthesis

) ] }

,

comma

,

--

dash

--

.

sentence terminator

. ! ?

:

colon or ellipsis

: ; ...

CC

conjunction, coordinating

& 'n and both but either et for less minus neither nor or plus so therefore times v. versus vs. whether yet

CD

numeral, cardinal

mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025 fifteen 271,124 dozen quintillion DM2,000 ...

DT

determiner

all an another any both del each either every half la many much nary neither no some such that the them these this those

EX

existential there

there

FW

foreign word

gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte terram fiche oui corporis ...

IN

preposition or conjunction, subordinating

astride among uppon whether out inside pro despite on by throughout below within for towards near behind atop around if like until below next into if beside ...

JJ

adjective or numeral, ordinal

third ill-mannered pre-war regrettable oiled calamitous first separable ectoplasmic battery-powered participatory fourth still-to-be-named multilingual multi-disciplinary ...

JJR

adjective, comparative

bleaker braver breezier briefer brighter brisker broader bumper busier calmer cheaper choosier cleaner clearer closer colder commoner costlier cozier creamier crunchier cuter ...

JJS

adjective, superlative

calmest cheapest choicest classiest cleanest clearest closest commonest corniest costliest crassest creepiest crudest cutest darkest deadliest dearest deepest densest dinkiest ...

LS

list item marker

A A. B B. C C. D E F First G H I J K One SP-44001 SP-44002 SP-44005 SP-44007 Second Third Three Two \* a b c d first five four one six three two

MD

modal auxiliary

can cannot could couldn't dare may might must need ought shall should shouldn't will would

NN

noun, common, singular or mass

common-carrier cabbage knuckle-duster Casino afghan shed thermostat investment slide humour falloff slick wind hyena override subhumanity machinist ...

NNP

noun, proper, singular

Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA Shannon A.K.C. Meltex Liverpool ...

NNPS

noun, proper, plural

Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques Apache Apaches Apocrypha ...

NNS

noun, common, plural

undergraduates scotches bric-a-brac products bodyguards facets coasts divestitures storehouses designs clubs fragrances averages subjectivists apprehensions muses factory-jobs ...

PDT

pre-determiner

all both half many quite such sure this

POS

genitive marker

' 's

PRP

pronoun, personal

hers herself him himself hisself it itself me myself one oneself ours ourselves ownself self she thee theirs them themselves they thou thy us

PRP$

pronoun, possessive

her his mine my our ours their thy your

RB

adverb

occasionally unabatingly maddeningly adventurously professedly stirringly prominently technologically magisterially predominately swiftly fiscally pitilessly ...

RBR

adverb, comparative

further gloomier grander graver greater grimmer harder harsher healthier heavier higher however larger later leaner lengthier less-perfectly lesser lonelier longer louder lower more ...

RBS

adverb, superlative

best biggest bluntest earliest farthest first furthest hardest heartiest highest largest least less most nearest second tightest worst

RP

particle

aboard about across along apart around aside at away back before behind by crop down ever fast for forth from go high i.e. in into just later low more off on open out over per pie raising start teeth that through under unto up up-pp upon whole with you

SYM

symbol

% & ' '' ''. ) ). * + ,. < = > @ A[fj] U.S U.S.S.R \* \*\* \*\*\*

TO

"to" as preposition or infinitive marker

to

UH

interjection

Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen huh howdy uh dammit whammo shucks heck anyways whodunnit honey golly man baby diddle hush sonuvabitch ...

VB

verb, base form

ask assemble assess assign assume atone attention avoid bake balkanize bank begin behold believe bend benefit bevel beware bless boil bomb boost brace break bring broil brush build ...

VBD

verb, past tense

dipped pleaded swiped regummed soaked tidied convened halted registered cushioned exacted snubbed strode aimed adopted belied figgered speculated wore appreciated contemplated ...

VBG

verb, present participle or gerund

telegraphing stirring focusing angering judging stalling lactating hankerin' alleging veering capping approaching traveling besieging encrypting interrupting erasing wincing ...

VBN

verb, past participle

multihulled dilapidated aerosolized chaired languished panelized used experimented flourished imitated reunifed factored condensed sheared unsettled primed dubbed desired ...

VBP

verb, present tense, not 3rd person singular

predominate wrap resort sue twist spill cure lengthen brush terminate appear tend stray glisten obtain comprise detest tease attract emphasize mold postpone sever return wag ...

VBZ

verb, present tense, 3rd person singular

bases reconstructs marks mixes displeases seals carps weaves snatches slumps stretches authorizes smolders pictures emerges stockpiles seduces fizzes uses bolsters slaps speaks pleads ...

WDT

WH-determiner

that what whatever which whichever

WP

WH-pronoun

that what whatever whatsoever which who whom whosoever

WP$

WH-pronoun, possessive

whose

WRB

Wh-adverb

how however whence whenever where whereby whereever wherein whereof why


Sentence: Syntax

We move next to the sentence level. There are various definitions of what is a sentence: Sentences have an internal structure - they are composed of smaller parts that have grammatical function. There are different ways to describe a sentence structure. The structure description captures the dependencies among the words within the sentence. They reflect constraints on operations such as: The 2 major description methods are:

Phrases are groups of words that depend on a "central word" that we call the phrase head. Common Phrases include - depending on the part of speech of the head:

One computational problem we will study is that of parsing a linear sequence of words into a constituent tree. The questions to address to model this problem are: Why is syntactic structure important: For example, consider the following problem: you are given two strings:
S1: John eats
S2: John sleeps
You are asked to build a correct English sentence that combines the two events into one sentence expressing that S2 occurred after S1 in the past. A possible solution is:
S3: After eating, John slept
Variations of the same task include - given an input string S1, generate the string N1 of the negation of S1, or Q1 - which is a yes-no form of S1, or P1, which is S1 expressed in the past tense etc.
Working directly on strings, such transformations are very difficult to achieve. One need to know the syntactic structure of the small sentences in order to be able to combine them appropriately.

Descriptive Syntax of the Clause

The syntax for the clause category is the most complex. It can be described along four perspectives (note: we use here descriptive terminology from systemic linguistics introduced by M. Halliday - see for example Introduction to Functional Grammar, Halliday, 1985):
  1. Transitivity: determines the type of the main process and its participants. (This is often called the "argument structure" of the clause.)
  2. Circumstantials: determines the structure of modifiers to the predicate and to the clause as a whole.
  3. Mood: determines whether the clause is finite (declarative, interrogative or relative) or non-finite (imperative, infinitive, participial). For finite moods, tense and person are specified. For non-finite, they are not.
  4. Voice: active, passive, causative etc.
The transitivity system determines what participants contribute to the meaning of the clause - when the clause is viewed as a description of an event or relation in the world. Semantically, a clause is a description of a process - a generic term that can refer to either an event or a relation. Participants are the constituents within the clause that satisfy the following linguistic criteria: NOTE: Each one of these criteria taken alone is not sufficient to characterize participants, but taken together they have proven quite reliable.
Semantically, participants correspond to the nuclear roles of the process. Additional terms can then be added compositionally to modify the meaning ofthe process or of the predicate (these correspond to sentence and predicateadjuncts as explained next.

Argument Structure: Lexical properties of the verb

The main verb of a clause determines which participants (arguments) are required in the clause. Since many verbs often share the same argument structures, grammars have traditionally defined names to refer to common argument structures. Such common structures are given a name (for example: material process) and define names for each of the mandatory arguments they expect (for example agent). In systemic grammar, the following figure describes the most common argument structures in English:

While systemic linguists have developed complex transitivity systems to account for the structure of the clause, lexicalist approaches to grammar have developed the notion of subcategorization. The idea is that in any phrase, the lexical head determines the structure and order of the subconstituents or dependents in a dependency-based grammar like Meaning Text Theory (Melcuk and Pertsov, 1987) of the phrase. To capture this dependency, HPSG grammars describe each lexical item capable of governing arguments with a feature called the subcategorization list (short name: subcat). The subcat of a verb is a list of the constituents the verb expects to form a complete phrase. So forexample, the verb to deal as in "AI deals with logic" is represented in the lexicon with a subcat

deal[Subcat( [ NP ] [ PP : with ])]
which indicates that to produce a clause headed by this verb, one must find an NP and a PP headed by the preposition with.

Adjuncts and Disjuncts

Constituents in the clause which are not participants, are classified in terms of their "distance" from the verb. The three levels are called predicate adjuncts, sentence adjuncts and disjuncts. These are the three types of circumstantials. There following tables give examples of each of these circumstantials:

The Mood of Clauses

The various options for the mood of clauses in English are summarized in this table:


Noun Phrase

Noun phrases in English have a strict syntactic structure - composed of a sequence of slots, each slot can be filled by constituents of specific categories.



Last modified 01 Nov 2015