Summarization

Task Definition

Our objective is: Given a text as input, generate a shorter version that contains "most of the relevant information".

Let us try to understand better this task by observing desirable variations of the input and output:

There are different contexts in which we may be interested in this task: We will focus on developing a summarizer with the following properties: In recent years, 2 variants of the Generic summarization task have been researched:
  1. Update summarization: a first document set A is read by the user, then a second document set B, on the same topic is received. The task consists of generating a summary of document set B while avoiding repeating information already covered by document set A.
  2. Query Focused summarization: an input query Q is given together with a set of documents that include an answer to the query (but also more information which may not be relevant to the query). The task is to provide a summary of the document set that is relevant to the query Q.
These two variant tasks are interesting because they help us distinguish in an empirical manner the concept of information importance / centrality / relevance and the risk of information redundancy (repeating information that is already known).

General Architecture of a Text Summarizer

As a general approach, we will adopt the following architecture to model the task of a summarizer:
  1. Text representation: capture the significant aspects of the input text in an abstract text representation.
  2. Ranking: sort the elements in the text representation in terms of information significance.
  3. Extraction: extract from the sorted representation the elements that will be rendered in the summary. Coherence and cohesion constraints may force us to expand the selection to more items than dictated by significance only. In addition, the requirement to avoid redundancy (select the same content more than once) may also force us to not only select items only based on significance.
  4. Realization: render the selected content into fluent text.
This general architecture can be instantiated in many different ways. The method most frequently used rely on extractive summaries: that is, the summary is constructed by operations of cut and paste: units from the source documents are selected and pasted together into the target summary.

Evaluating Summaries

One the most difficult issues in summarization, is to evaluate how well a summary fulfills its function. Several methods have been proposed to evaluate the quality of a summary:
  1. Human evaluation
  2. Compare with gold summary
  3. Task-based evaluation

DUC Shared Tasks

Since 2000, a group of researchers work on the definition of a shared-task - a set of benchmark datasets and evaluation methods that are shared by different research teams, so that results can be compared and reproduced in an objective manner. A series of conferences called SUMMAC, DUC and TAC have taken place every year, in the form of competitions among participating laboratories.

This shared experience has generated rich datasets and shared experience among researchers. We will use in our experiments a specific instance: DUC 2004 Task 2.

DUC Sample Document Clusters

Document clusters in DUC shared tasks are typically gathered from News agencies. Each cluster includes about 50 documents that deal with the same news topic. For example, 2 documents from the same cluster are available here:
  1. document1
  2. document2

DUC 2004 Task 2: Manual Evaluation

DUC 2004 consisted of five tasks. We describe here only Task 2. Task 2 was to produce summaries with a maximum length of 665 characters for 50 clusters of 10 documents each. These summaries are to be general, and not focused on a specific query.

The documents come from the AQUAINT Corpus of English News Text containing documents from Associated Press Newswire, New York Times Newswire, and Xinhua News Agency. The texts were tagged to identify document source information and the textual elements to be processed.

Human assessors hand-generated four summaries for each of the tasks. For Task 2, a single summary is selected as the model against which participating systems are judged. Each of the 665-character model summaries are analyzed into meaning units usually corresponding to sentence clauses conveying short nuggets of information.

Each of the non-selected hand summaries are then scored by the assessors to provide an indication of the variability among human summarizers.

Overall summary quality is also judged using seven quality questions, each judged on a scale from 1 to 5, measuring the linguistic quality of the summary. The following questions are used as part of the DUC/TAC/SUMMAC shared-task evaluations. They aim at capturing in a systematic way the linguistic quality of the text produced by the summarizer.

  1. Does the summary build from sentence to sentence to a coherent body of information about the topic?
    1. Very coherently
    2. Somewhat coherently
    3. Neutral as to coherence
    4. Not so coherently
    5. Incoherent
  2. If you were editing the summary to make it more concise and to the point, how much useless, confusing or repetitive text would you remove from the existing summary?
    1. None
    2. A little
    3. Some
    4. A lot
    5. Most of the text
  3. To what degree does the summary say the same thing over again?
    1. None; the summary has no repeated information
    2. Minor repetitions
    3. Some repetiton
    4. More than half of the text is repetitive
    5. Quite a lot; most sentences are repetitive
  4. How much trouble did you have identifying the referents of noun phrases in this summary? Are there nouns, pronouns or personal names that are not well-specified? For example, a person is mentioned and it is not clear what his role in the story is, or any other entity that is referenced but its identity and relation with the story remains unclear
    1. No problems; it is clear who/what is being referred to throughout.
    2. Slight problems, mostly cosmetic/stylistic
    3. Somewhat problematic; some minor events/things/people/places are unclear,or a very few major ones, but overall the who and what are clear.
    4. Rather problematic; enough events/things/people/places are unclear that parts of the summary are hard to understand
    5. Severe problems; main events, characters or places are not well-specified and/or it's difficult to say how they relate to the topic
  5. To what degree do you think the entities (person/thing/event/place/...) were re-mentioned in an overly explicit way, so that readability was impaired? For example, a pronoun could have been used instead of a lengthy description, or a shorter discription would have been more appropriate?
    1. None: references to entities were acceptably explicit
    2. A little: once or twice, an entity was over-described
    3. Somewhat: to a noticeable but not annoying degree, some entities were over-described
    4. Rather problematic: to a degree that became distracting, entities were over-described
    5. A lot: reintroduction of characters and entities made reading difficult/caused comprehension problems
  6. Are there any obviously ungrammatical sentences, e.g., missing components, unrelated fragments or any other grammar-related problem that makes the text diffcult to read.
    1. No noticeable grammatical problems
    2. Minor grammar problems
    3. Some problems, but overall acceptable
    4. A fair amount of grammatical errors
    5. Too many problems, the summary is impossible to read
  7. Are there any datelines, system-internal formatting or capitalization errors that can make the reading of the summary difficult?
    1. No noticeable formatting problems
    2. Minor formatting problems
    3. Some, but they do not create any major difficulties
    4. A fair amount of formatting problems
    5. Many, to an extent that reading is difficult

ROUGE Evaluation

Summaries can also be measured by an automatic tool, ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE compares a submitted summary with a manual summary, after stemming each word in the summaries, counting the proportion of words in the evaluated text with the words in the manual summaries. ROUGE-w counts the longest common substring and a weighted form of the longest common substring.

ROUGE evaluation is an automatic (and therefore cost-effective) evaluation method. It compares a summary with a gold-standard summary (one generated manually). There are different variants of the ROUGE computations that are presented in these papers:

ROUGE evaluation has been shown to be somewhat correlated with manual evaluation using the "basic-elements" evaluation method.

Task-Based Evaluation

Task-based evaluation is an extrinsic method: instead of judging the quality of the summary by looking at its internal properties, we evaluate how the summary helps people perform an external task using the summary instead of using a different summary or using the source documents.

The following paper: Do Summaries Help? A TaskBased Evaluation of MultiDocument Summarization shows an example of a task-based evaluation method.

Eventually, task-based evaluations are the most precious indicator of the quality of summaries. But they are extremely expensive to produce.

SumBasic: Impact of Frequency on Sentence Selection

We will first review a simple instance of the general architecture for multi-document summarization described above. The method is presented in this paper: The Impact of Frequency on Summarization by A. Nenkova and L. Vanderwende, 2005 -- available here in pdf format.

The idea of the method is to investigate how much the frequency of words in the cluster of input documents influences their selection in the summary.

We first must define what we mean by word frequency: the following parameters impact on this concept:

  1. How are words tokenized: we will assume a word tokenization module is provided to us, similar to the one used in NLTK. For example, look at the Punkt sentence tokenizer and the Punkt word tokenizer. The tokenizer most often used is the one that follows the Penn Treebank conventions: Treebank Word Tokenizer. Contractions, such as "can't", are split in to two tokens, for example: can't is tokenized as ca n't. The treebank tokenizer assumes that the text has already been segmented into sentences. Any periods -- apart from those at the end of a string -- are assumed to be part of the word they are attached to (e.g. for abbreviations, etc), and are not separately tokenized (for example, "I.B.M." is tokenized as a single word).
  2. How are morphological inflections taken into account: we will assume that variants of the same word (for example, book/books) will be counted as 2 instances of the same word (book). To this end, we will use the Stemming module in NLTK - in particular Porter's algorithm (NLTK's version and Porter's version available in many programming languages).
  3. Are all words taken into account, or only content words. We will assume that all stop words are first filtered from the text before we compute word frequencies. There are various lists of stop words available, we can use this list for our experiments.
Under these conditions, we can now map the SumBasic algorithm to the general architecture defined above:
  1. Text representation: we represent the input text as a bag of words - that is, the documents are represented as vectors of stemmed content words with their frequency. The units of representation are the sentences. This means, the method will rank full sentences from the input documents and include them into the target documents when they are selected.
  2. Ranking: We compute a significance score for input sentences based on the frequency of the words they contain. The method computes the probability distribution over the words wi appearing in the input, p(wi) for every i;
        p(wi) = ni/N
        
    where ni is the number of times the word wi appeared in the input, and N is the total number of content word tokens in the input.

    We then use the estimation of the word probabilities to compute the score of the sentences: For each sentence Sj in the input, compute a weight equal to the average probability of the words in the sentence:

        weight(Sj) = Σwi in Sj p(wi) / |{wi|wi ∈ Sj}|
        
  3. Extraction: extract from the sorted representation the elements that will be rendered in the summary. The way this is implemented in SumBasic is:
        Pick the sentence S* with the highest score 
        and append it to the output summary.
        
        For all word wk in S* do:
            pnew(wk) = pold(wk) * pold(wk)
    
        Re-compute the scores of the remaining sentences 
        ({Sj}\S*) using the re-estimated p(w)
        and iterate until the summary has reached the 
        maximum allowed length.
        
    The intuition of this re-estimation of pnew is that once words have been selected into the output summary, they become less important for the rest of the summary. This avoids redundancy.
  4. Realization: output the sentences in the order they have been selected.

KL-Sum: Minimization of the Summary Vocabulary Divergence from the Input Vocabulary

KL-Sum is introduced in Exploring Content Models for Multi-Document Summarization (Aghighi and Vanderwende, 2009). This is also a sentence selection algorithm, where a target length for the summary is fixed (L words). The sentence selection criterion is:
S* = min S | words(S) <= L KL(PD || PS)

where PS is the unigram distribution of the candidate summary S, 
and D is the document collection.

KL(P || Q) is computed as explained here.
According to this criterion, the objective of the summarizer is to find a set of sentences whose length is less than L words and whose unigram distribution is as similar as possible to the source document set. The global optimization of the criterion is exponential in the number of sentences in the document set D. As an approximation, KL-Sum uses a greedy optimization strategy:
  1. Set S = {} and d = 0
  2. While |S| <= L do:
  3. For i in [1..ND], di = KL(PS + Si || PD)
  4. Set S = S + Si with minimum di and d = di
  5. Stop if there is no i such that di < d
The last issue to take into account is to order the selected sentences Si into the target summary. The method adopted is to follow the order in the source documents, that is: for each selected sentence Si extracted from document Dj, compute a position index pi (in [0..1]) which reflects the position of Si within Dj. The sentences in the target document are ordered according to the value of pi.

Note that this algorithm takes no explicit step to avoid redundancy. The reason is that adding redundant sentences would affect the unigram distribution of words and increase divergence from the source document set.

Graph-Based Extraction Methods

A family of sentence extraction algorithms relies on a graph-based representation of the input document collection. This is illustrated by the work of Erkan and Radev 2004: LexRank: Graph-based Centrality as Salience in Text Summarization.

Centroid-based Summarization

A first step to understand this method is to consider the simpler centroid based method: Radev, Jing, Budzikowska, 2000 Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies.. In this approach, the input consists of a set of documents, all related to the same topic. The documents Dj are split into sentences Si. We construct a graph where the nodes correspond to the sentences in the input cluster and the edges indicate sentence similarity. One way to assess sentence similarity is to detect whether the information conveyed by a sentence Si is also conveyed by another sentence Sk -- this relation would mean that Sk subsumes Si.

To assess sentence similarity, we consider a term vector representation of the sentences: consider all the words that appear in all the sentences of the input document set as (w1... wV). Each sentence is represented as a vector X = (x1...xV) where the coordinate xi is 1 if the word wi appears in the sentence, 0 otherwise.

Given this vector representation, one can compute the dot product of two vectors X and Y representing two sentences as:

 X . Y = Σk = 1..V xk * yk
When we work with boolean vectors (the coordinates value are 0 or 1), the dot product computes the number of words shared by the two sentences.

One can then compute the cosine of the two vectors X and Y as:

cosine(X, Y) = X . Y / |X| * |Y|
where
|X| = sqrt( Σk = 1..V xk2 )
Note that cosine(X, X)=1 (that is, two identical sentences have a cosine measure of 1) whereas two sentences with no shared words have cosine(X,Y)=0. We can in general relate to cosine(X, Y) as a measure of the word similarity between X and Y.

Note that this word similarity measure does not take into account any distance between the words themselves (morphological derivation, synonymy etc). It also does not take into account the relative importance of words within the sentence.

To improve this boolean term-vector representation, one can introduce three modifications:

  1. Filter out stop-words from the input vocabulary so that one only keep content words
  2. Stem all words in the input vocabulary so that similar words are mapped to the same stem
  3. Replace the boolean coordinate in the vectors with a measure of the importance of a word in a document (or sentence). This measure of importance of a word is often computed as the tf*idf measure:
        N = total number of documents in the document set.
        ni = number of documents containing the word wi
        ci,j = number of occurrences of wi in document Dj
    
        nci,j = ci,j / maxk (ck,j)  [Normalized frequency of wi in dj]
    
        idfi = log(N/ni)  [Inverse document frequency]
    
        tf-idfi,j = nci,j * idfi
        
    The key property of the tf-idf measure of relevance is that it includes information from the global distribution of words across documents together with specific document properties.
When using tf-idf and filtered stemmed vocabulary, cosine can still be used as a similarity measure among vectors representing sentences or documents.

Given this sentence similarity measure, one can compute the centroid of a cluster of sentences: the centroid is a fictional sentence, which is encoded by the vector equal to the "average" of the vectors representing the sentences in the cluster. This average vector represents the "center" of the sentences. We can then compute the "centrality" of an existing sentence by measuring its distance to the centroid vector. This centrality measure is then used as the selection criterion in a sentence extraction method - using the usual adaptation of recomputing the cluster and the centroid after a sentence has been selected to avoid redundancy in the created summary and adopting a heuristic to order the selected sentences in the generated summary (based on the position of the extracted sentences in the original documents).

Lexrank-based Summarization

LexRank: Graph-based Centrality as Salience in Text Summarization by Erkan and Radev, 2004 - introduces a refinement of the centroid-based method. They define a more precise criterion to decide on the centrality of a sentence within a cluster. The idea is to adjust the importance of a sentence within a graph of sentences by taking into account the "prestige" of the sentences pointing to the sentence.

The input to the task is a cluster of topically related sentences (coming from a set of documents). The first step of the model is to compute a sentence similarity matrix - for each sentence, using cosine distance to each other sentence in the cluster we compute di,j. Note that di,i=1 (the diagonal of the matrix is filled by 1). We can then apply a threshold on similarity -- setting d'i,j=0 if di,j < threshold.

Given the filtered similarity matrix for a given threshold, one way of assessing sentence centrality is to count the number of similar sentences for each sentence. We define degree centrality of a sentence as the degree of the corresponding node in the similarity graph.

This definition of degree centrality, considers all neighbors of a sentence as equally significant. The idea of Lexrank is to rank the neighbors in terms of prestige: a sentence is more central if it is close to many high-prestige sentences. Recursively, the prestige of a sentence depends on the prestige of its neighbors.

This idea is modeled as follows: let p(s) be the prestige of a sentence. Then every sentence will have a centrality value and distribute this centrality to its neighbors. This formulation is captured by the following formula:

p(u) = Σv ∈ adj(u) p(v) / deg(v)

where:
p(u) is the centrality of node u
adj(u) is the set of nodes adjacent to u
deg(v) is the degree of the node v
The same equation can be rewritten in matrix notation as:
p = BTp

or 

pTB = pT

where B is the matrix derived from the adjacency matrix using:

B(i,j) = A(i,j) / ΣkA(i,k)
In order to derive the vector p which is the solution of this equation, one must ensure the matrix B satisfies mathematical properties derived in the paper (pp.465-466). To make sure this is the case, the equation is replaced with the variation:
p(u) = d/N  + (1-d)Σv ∈ adj(u) p(v) / deg(v)

where N is the total number of nodes in the graph and d is a damping factor typically in the range [0.1..0.2]
One interpretation of this equation is in terms of random walk in the sentence graph. Start in a sentence, and jump to a neighbor sentence -- selecting the neighbor according to the probability of b(i,j). This process is a Markovian process with a transition kernel (dU + (1-d)B) -- where U is the matrix whose elements are all 1/N. With this transition matrix, one chooses a neighbor node with probability (1-d) and jumps anywhere else with probability d. This probability of escaping anywhere in the graph is what ensures the process is aperiodic (we cannot run forever in the same cycle) and irreducible (there cannot be a disconnected component in the graph from which we cannot escape). The Markov chain encoded by this process is therefore irreducible and aperiodic - and therefore, limn → ∞[dU + (1-d)B]n converges to a well defined limit.

The algorithm to compute the LexRank centrality of a set of sentences is then defined as follows:

Input:
    S: array[n] of sentences
    t: cosine threshold
    d: damping factor in the range [0.1..0.2]
Output:
    L: array[n] of Lexrank scores for the n sentences

array cosineMatrix[n,n]
array degree[n] = {1}
array L[n]
for i = 1 to n do:
  for j = 1 to n do:
    cosineMatrix[i,j] = idf-modified-cosine(S[i], S[j])
    if cosineMatrix[i,j] > t then:
      cosineMatrix[i,j] = 1
      degree[i]++
    else:
      cosineMatrix[i,j] = 0
for i = 1 to n do:
  for j = 1 to n do:
    cosineMatrix[i,j] = d/n + d * cosineMatrix[i,j]/degree[i]
L = powerMethod(cosineMatrix, n, ε)
return L
where the powerMethod procedure computes the convergence of the matrix as shown here:
Input: 
    M: a stochastic, irreducible and aperiodic matrix[N,N]
    ε: error tolerance
Output:
    eigenvector p
p0 = [1/N]  (All elements are initialized to 1/N)
t = 0
repeat:
  t++
  pt = MTpt-1
  δ = | pt - pt-1|
until δ < ε
return pt  

Last modified Jan 10, 2016