Contents (hide)
4.2 Input
4.3 Output

# Extraction of Lexico-syntactic Similarities

## Abstract

In this assignment you will experience with a research paper, re-design its algorithm for the map-reduce pattern, and apply experiments on a large-scale input, with a scientific report. In addition, you are required to apply the method presented on the paper on some other similarity measures as well.

1. Read the paper of l. Dekang Lin and Patrick Pante: DIRT – Discovery of Inference Rules from Text.
2. Read the paper again, and make sure you understand it.
3. Examine the various aspects of the system.
4. Design a parallel algorithm, based on the Map-Reduce pattern, with the extensions described bellow.
5. Implement the system.
6. Run the experiments described bellow.
7. Analyze and report your results, as defined bellow.

Your work will be checked frontally by me (Meni), at Sunday & Monday Aug 16-17 - there will be no checking session afterward. The frontal check will cover the various aspects of the assignment and any material taught in class during the semester (see the grading policy bellow). In case you want to be checked earlier (which is a wise and nice decision), coordinate the date with me.

## Notes on the implementation of the article

• Regarding the second constraint at Section 3.2 (the second bullet at page 3): in contrast to MiniPar (the parser that is used in the paper), Stanford parser (the parser of your input) does not omit propositions from the tree nodes, so your dependency path should include dependency relations that connect prepositions (IN and TO, beside the content words: noun, verb, adjective, adverb).
• As mentioned in the paper, you should only refer to dependency paths in which the head is a verb and the X/Y slots are nouns.
• As mentioned in the paper, you should only refer to heads which are verbs. It is high recommended to filter auxiliary verbs - e.g., is, are.
• Here you can find the list of the part of speeches that appear in the syntactic ngrams.
• As should be clear from the paper, there might be some dependency paths in a given syntactic ngram.
• You should implement the extension described bellow.
• Your implementation should support large test sets (bigger than the one provided for this project).

## The System

### Extension

The paper suggest PMI value as the features of the vectors, and the Lin method for measuring the similarity between two vectors (Equation 2 at page 4).

In addition to the PMI-LIN measure, you should also implement the TFIDF-COS and the DICE-COVER measure, where the feature values are PPMI instead of PMI, and the similarity between two vectors is measured by Cosine instead of Lin.

The TFIDF value for a slot and one of its arguments is defined by: $$tfidf(p,Slot,w) = \frac{c(p,Slot,w)}{1 + log(c(*,Slot,w))}$$

Cosine similarity between two feature vectors is defined by: $$cosine(slot_1,slot_2) = \frac{\sum\nolimits_{w \in T(p_1,s) \cap T(p_2,s)} tfidf(p_1,s,w) \cdot tfidf(p_2,s,w)} {\sqrt{\sum \nolimits_{w \in T(p_1,s)} tfidf(p_1,s,w)^2} \cdot \sqrt{\sum \nolimits_{w \in T(p_2,s)} tfidf(p_2,s,w)^2}}$$

DICE value is defined by: $$dice(p,Slot,w) = \frac{2*c(p,Slot,w)}{c(p,Slot,*)+c(*,Slot,w)}$$

Cover similarity between two feature vectors is defined by: $$cover(slot_1,slot_2) = \frac{\sum\nolimits_{w \in T(p_1,s) \cap T(p_2,s)} dice(p_1,s,w)} {\sum \nolimits_{w \in T(p_1,s)} dice(p_1,s,w)}$$

### Input

The input of the system is the Biarcs dataset of the syntactic NGRAM resource, recently released by Google (generated by Yoav Goldberg - one of the course's formers).

The format of the datasets is described in the README file. While extracting dependency paths and their arguments.

For your convenience, we uploaded the dataset to S3. The following buckets contains the 99 files of the Biarcs dataset, where the xx denotes the number of the file (00 to 98).

https://s3.amazonaws.com/dsp152/syntactic-ngram/biarcs/biarcs.xx-of-99

The mappers of your program can read these files with the standard FileInputFormat.

### Output

The output of the system should be composed of:

• The feature vectors of ALL extracted dependency paths found in the corpus (according to the input parameters), for PMI, TFIDF, and DICE values.
• The similarity measure between the each pair of the given test-set, for PMIN-LIN, TFIDF-COS, and DICE-cover measures.

### Test Set

Omer Levy kindly provided us a set of positive predicate rules and negative predicate rules. The files are tab-separated, where the proposition template on the left entails/not-entails the template on the left.

### Input Parameters

• DPMinCount The minimal count of the dependency paths. You should not build a feature vector for dependency paths that appear less than this minimal count.
• MinFeatureNum The minimal number of (not empty) features per dependency path. You should discard dependency paths that do not have this minimal number of features, at slot X or slot Y.

## Experiments

You should run the system with reasonable parameters, on two types of inputs:

• Small 20% of the input files, i.e., 20 files.
• Large 100% of the input files, i.e., the whole set of 99 files.

## Reports

Your work should be followed by two reports:

• Design You should provide a document which describes the design of your system: the various components of the system and their functionality in terms of input and output. In case a component is based on Map-Reduce, you should describe the characteristic of the keys and the values of the mappers and the reducers, with the number of the key-value pairs and their size, and an estimation on the memory usage.

• Analysis

• Calculate F1-measure for the given test-set, for each of the three input sizes defined above (small, medium, large), and for PMI-LIN, TFIDF-COS and DICE-COVER measures (choose a threshold for the True/False decision, based on the similarity score, according to the data).

• Draw a Precision-Recall Curve for the given test-set, for each of the three input sizes defined above (small, medium, large), and for PMI-LIN, TFIDF-COS and DICE-COVER measures.

• Error Analysis
• Make a list of 5 examples for each true-positive, false-positive, true-negative, and false-negative categories for each of the PMI-LIN and TFIDF-COS, and DICE-COVER measures.
• Compare the similarity measures for these examples for the 20% and 100% training sets.
• Compare the similarity measures for these examples for the PMI-LIN, TFIDF-COS and DICE-COVER measures.
• Try to identify common errors, behavior

## Submission

You should submit the design document, the code, and the analysis report.

For the frontal check, you should come with the output of your system (feature vectors, similarity measures).

During the frontal check you will be required to run the system on a very small corpus.

• Questions on the course material - 20% of the final grade

• Project - 40% of the final grade

• Article understanding, with references to issues taught in class - 20%
• System design - 20%
• System implementation - 20%
• Running of the system according to the instructions - 20%
• Experiments and analysis report - 20%

Good Luck!