Reviews

This page contains the reviews for the paper (now tech-report) "Precision-biased Parsing and High-Quality Parse Selection".
===========================================================================
EMNLP-CoNLL 2012  Reviews for Submission #477
============================================================================

Title: Precision-biased Parsing and High-Quality Parse Selection

Authors: Yoav Goldberg and Michael Elhadad
============================================================================
                           REVIEWER #1
============================================================================


---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------

                        Appropriateness: 5
                                Clarity: 4
           Originality / Innovativeness: 3
                Soundness / Correctness: 3
     References / Meaningful Comparison: 4
             Impact of Ideas or Results: 3
                    Impact of Resources: 1
                         Recommendation: 2.5


---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------

This paper describes techniques for identifying high-risk decisions in
dependency parsers in order to avoid making them, thus increasing precision at
the expense of recall.              While the idea is interesting, I thought the
paper to
be lacking in both results and analysis, and think the endeavor has the feel of
early work more appropriate to a workshop or short paper track than EMNLP's
main proceedings.

To begin, the precision / recall tradeoffs achieved seem small to me.  As the
authors note, there is not a lot of prior work on this task (that I am aware
of) that can be compared to, but I expected much more distinct lines in the
figures at the end of the paper.  ~25% coverage at ~95% accuracy just seems too
low to me.  Now, this may just be my own mis-calibrated gut, but the paper also
lacks any other means of empirically demonstrating the validity of these
approaches.  I agree that there are situations in which it would be useful to
have *only* high-precision parse trees or parse edges, and the authors list a
number of them in the Introduction.  However, the real evaluation for your
approaches is to see whether the proposed techniques actually improve some
downstream task.  Your techniques seem to identify risky or difficult
attachment decisions, but how do we know (apart from evaluation) that those are
not also the most important attachment decisions?  Skipping hard sentences is
useful only if you can justify it.

Another point of weakness was the analysis.  In lieu of results, an extensive
analysis that catalogued the types and frequencies of errors could have been
useful.  The difficult decisions for parsers are well understood to be
phenomenon like PP attachment and NP coordination, and many of these difficult
decisions cannot be resolved without semantic or world knowledge.  It is a good
sign that your approach identifies such edges (PP attachment, at least) as
among the high-risk ones, but there really was no error analysis, and I am
confused as to how to interpret the 41% precision (961 / 2314) in identifying
PP attachment mistakes (§8).

Other issues
• §7.2 Define ROC
• Citations that perform a grammatical role in the sentence should be written
with only the year parenthesized, e.g., "…was shown in Goldberg and Elhadad
(2010)", not "…was shown in (Goldberg and Elhadad, 2010)".  In LaTeX, this is
accomplished by using \newcite instead of \cite
• Some effort went into distinguishing the riskiness of parser *decisions*
versus the riskiness of parser *edges*.  I thought the descriptive examples to
be helpful, but found the presentation confusing, and had some difficulty in
relating the distinction in the Experiments and Results section.
• You might want to cite Roark and Hollingshead (COLING 2008), who depend on
high-precision identification of constituent boundaries to achieve quadratic
parsing time (empirical) with CKY.

Really picky stuff
• The fonts in Figure 1 to 4 should be changed to Times so as to match the
text of the paper.
• Use subsections instead of bolded paragraph headings (§7.2)

============================================================================
                           REVIEWER #2
============================================================================


---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------

                        Appropriateness: 5
                                Clarity: 4
           Originality / Innovativeness: 2
                Soundness / Correctness: 2
     References / Meaningful Comparison: 3
             Impact of Ideas or Results: 2
                    Impact of Resources: 1
                         Recommendation: 2.5


---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------

This paper described two ways to generate confidence scores for high-precision
parse selection. One is to use an ensemble of several parsers. The second
method is to train a separable classifier to measure edge confidence within a
transition based parser.

The second approach is somehow confusing to me. The underlying parser uses a
similar discriminative learning method to measure edge score. Why not use the
new features directly, instead of introducing a second classifier? This is an
obvious missing baseline in this work.

I like the ROC curve analysis that the authors provided. However, there is
nothing surprising. It is quite standard practice in various fields in NLP.

============================================================================
                           REVIEWER #3
============================================================================


---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------

                        Appropriateness: 5
                                Clarity: 5
           Originality / Innovativeness: 3
                Soundness / Correctness: 4
     References / Meaningful Comparison: 4
             Impact of Ideas or Results: 4
                    Impact of Resources: 1
                         Recommendation: 4


---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------

This paper describes a novel problem of selecting high-precision arcs from a
dependency tree, and its application to selecting high-precision trees. Three
solution families are presented: a simple ensemble-based solution, and two
(also simple) classifier solutions, one that classifies parser actions as
risky, and another that directly classifies arcs as risky. The paper closes
with an interesting discussion of precision and prepositional attachment.

This is a very well written paper tackling a novel problem with simple
techniques. The results are presented very clearly, and in such a way that I am
eager to try their approach on my problem. The maximum entropy classifier
solution is fairly simple, and if the paper has a clarity flaw, it is that the
title and some of the front matter implies that the precision modification will
affect the parser itself, while as far as I can tell, the end result is
something that assesses a complete parser derivation (either viewed as a
sequence of actions, or as a tree of edges) and then prunes out edges post-hoc.
As such, this is more of a risky-edge-detection task than a high-precision
parsing task.

One other problem is the lack of an extrinsic evaluation, which would tell us
just how large an impact 84% coverage and 96% precision (or 70% and 96.5% in
the case of the Maxent solution) has on some downstream tasks.

Typos and small complaints:

 Last paragraph of 3.1: “Their method identify”

Last paragraph of 5: “Such ensemble was”

Footnote 6: Footnote within a footnote (presumably to point to the location of
MegaM) fails silently, leaving us a dangling footnote 7.

Table 2: Bottom right-hand-corner mislabeled as FP.

Update: Thanks for your response.