Reviews

This page contains the reviews for the short-paper (now tech-report) "Learning Sparser Perceptron Models", as well as my responses.
============================================================================
ACL-HLT 2011 Reviews for Submission #1284
============================================================================

Title: Learning Sparser Perceptron Models

Authors: Yoav Goldberg and Michael Elhadad
============================================================================
                           REVIEWER #1
============================================================================


---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------

                        APPROPRIATENESS: 5
                                CLARITY: 3
             ORIGINALITY/INNOVATIVENESS: 3
                  SOUNDNESS/CORRECTNESS: 3
                  MEANINGFUL COMPARISON: 4
                              SUBSTANCE: 3
             IMPACT OF IDEAS OR RESULTS: 4
                          REPLICABILITY: 4
        IMPACT OF ACCOMPANYING SOFTWARE: 1
         IMPACT OF ACCOMPANYING DATASET: 1
                    REVIEWER CONFIDENCE: 4


---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------

This paper is about learning sparse perceptron models. The algorithm presented
by the paper is straightforward; during usual perceptron learning, it maintains
an extra vector that remembers the number of times a feature weight
participated in an update. The scoring function of the perceptron ignores
features that participated in less than a given number of minimum updates.

The results presented in the paper are strong, there are hardly any
differences with dense models, but the model size decreases by a great deal for
all three tasks that the authors present results on.

The problems that I think the paper has are the following.

1) The modification to the standard perceptron approach could have been
presented as an algorithm, instead of describing it in a hand wavy way through
text. I think the presentation would have been clearer if the standard
perceptron and the modified version are presented side by side so that the
reader gets a succinct picture.

2) The convergence proof presented is also very vague. If the authors claim
that the presented method converges after a finite number of iterations, then
they should provide an analysis of the order of time it would take for the
algorithm to converge. They should also present assumptions under which their
claims hold -- such as whether the data is separable or not.

3) The footnote 1 contains a citation that is not present in the bibliography.

============================================================================
                           REVIEWER #2
============================================================================


---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------

                        APPROPRIATENESS: 5
                                CLARITY: 3
             ORIGINALITY/INNOVATIVENESS: 3
                  SOUNDNESS/CORRECTNESS: 3
                  MEANINGFUL COMPARISON: 4
                              SUBSTANCE: 2
             IMPACT OF IDEAS OR RESULTS: 3
                          REPLICABILITY: 4
        IMPACT OF ACCOMPANYING SOFTWARE: 3
         IMPACT OF ACCOMPANYING DATASET: 3
                    REVIEWER CONFIDENCE: 5


---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------

I would lean towards acceptance of the paper, although it is a tricky case.

On the positive side, the paper describes a simple method, which gives
good experimental results. I think it's a useful algorithm.

There are however a couple of negatives:

* most importantly, I'm not convinced at all by the convergence
 "proof". It is very informal. I am not convinced that it is clear or
 correct.  There are good reasons not to publish papers with dubious
 proofs...

* The description of the algorithm itself is also sketchy. The paper
 would be much improved if pseudocode was given (it shouldn't be
 difficult to do this).

============================================================================
                           REVIEWER #3
============================================================================


---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------

                        APPROPRIATENESS: 5
                                CLARITY: 4
             ORIGINALITY/INNOVATIVENESS: 3
                  SOUNDNESS/CORRECTNESS: 3
                  MEANINGFUL COMPARISON: 4
                              SUBSTANCE: 3
             IMPACT OF IDEAS OR RESULTS: 3
                          REPLICABILITY: 4
        IMPACT OF ACCOMPANYING SOFTWARE: 1
         IMPACT OF ACCOMPANYING DATASET: 1
                    REVIEWER CONFIDENCE: 4


---------------------------------------------------------------------------
Comments
---------------------------------------------------------------------------

This paper presents a variant of the perceptron algorithm which promotes sparse
models. Experiments are made in dependency parsing (graph-based and
transition-based) and in text chunking. Models which are 4-5 smaller are
produced.

"The averaged perceptron is not well understood theoretically and is hard to
analyze" - this is actually not true. The perceptron is effectively minimizing
a loss function L(w) = sum_i max_y' w'*(f(x,y)-f(x,y')) using a online
subgradient method (see e.g. Martins et al. 2010 EMNLP paper), and the
averaging operation can be seen as a online-to-batch conversion. This comment
also applies to the paragraph "comparison to L1-regularization models."

I didn't follow your convergence proof. By convergence you mean that the
algorithm finds a separating hyperplane if it exists? First, you need to assume
that data is separable. Then, if the order of examples changes, then the output
of the algorithm may also be different, isn't this right? So changing the order
may at least change the sparsity of the model that is obtained. It seems to me
that you guarantee that a separating hyperplane is found for a particular
ordering which renders the final model non-sparse. This does not seem very
helpful.

It seems to me that the algorithm is a bit like a "hack" with the goal of
obtaining sparsity, and I don't think it is supported here by theoretical
arguments. The stopping time also seems to very important. For example, if the
data is not separable, and the algorithm is not stopped early, then all
features will eventually fire for more K times, hence be included in the model.


It seems to me that text chunking is better described as a structured
prediction problem (sequence labeling) than a multi-class problem. Usually the
latter implies "few" labels and the former an exponentially large number of
labelings, which is the case.

The difference in model size is not as striking as in other L1-approaches
previously published (e.g. Tsuruoka 2009) but is significant. However, the
standard measure for text chunking is F1 measure, not accuracy!!! Since most
labels are "O", a system can have a high accuracy and a very poor recall. This
makes the prediction results of that experiment unreliable.

Minor comments:
- Why distinguishing the parameterization of multi-class and structured
prediction? In both cases, scores can be written as w'*f(x,y).
- Fix the bib: "Mark Dredze and Pereira, 2008" should be "Crammer et al. 2008"
in footnote 1.

      

Response

Most of the reviewers comments are correct.  Indeed, the method is essentially a hack, 
and the presentation is a bit informal.  This was done on purpose.  The idea is simple 
enough to understand (all of the reviewers clearly understood it), and the actual implementation 
details do not really matter (there is no "secret ingredient" here, you will probably get very 
similar results even with a different implementation than my own).  And then there is the page limit.

The proof is also informal, and definitely not tight.  In particular, it does not attempt to claim 
anything about the sparsity of the final model.  All it says is that, in the limit, this hack will 
reduce to the "regular" perceptron, so if you trust the perceptron, there is no reason not to trust 
this version.  Emprically, it works well AND produces sparser models.

Some specific responses to the third review:

1/ I still do not find the averaged perceptron to be well understood (my preferred interpretation of it 
is in the context of bayes-point machines, but at the end of the day it is still essentially a hack that
works extremely well). I am not sure which EMNLP 2010 I am supposed to look at.

2/ Chunking can be done as structured prediction. But many people are solving it as a multiclass problem, 
and get equally good results.  I needed a representative example of multi-class classification, and think
this task fits the bill very well.

3/ Re chunk-level F1 vs token-level Accuracy: certainly true if evaluating a chunking system for the sake 
of chunking.  But here I treat it as an abstract multi-class problem, and think the token-level numbers
provide more direct representation of the results.  Also, while in NP-chunking most tokens are tagged 'O', 
this certainly is NOT the case for the general chunking task (very few tokens are tagged as 'O', and there
is no predominant default class to default to).