============================================================================ ACL-HLT 2011 Reviews for Submission #1284 ============================================================================ Title: Learning Sparser Perceptron Models Authors: Yoav Goldberg and Michael Elhadad ============================================================================ REVIEWER #1 ============================================================================ --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- APPROPRIATENESS: 5 CLARITY: 3 ORIGINALITY/INNOVATIVENESS: 3 SOUNDNESS/CORRECTNESS: 3 MEANINGFUL COMPARISON: 4 SUBSTANCE: 3 IMPACT OF IDEAS OR RESULTS: 4 REPLICABILITY: 4 IMPACT OF ACCOMPANYING SOFTWARE: 1 IMPACT OF ACCOMPANYING DATASET: 1 REVIEWER CONFIDENCE: 4 --------------------------------------------------------------------------- Comments --------------------------------------------------------------------------- This paper is about learning sparse perceptron models. The algorithm presented by the paper is straightforward; during usual perceptron learning, it maintains an extra vector that remembers the number of times a feature weight participated in an update. The scoring function of the perceptron ignores features that participated in less than a given number of minimum updates. The results presented in the paper are strong, there are hardly any differences with dense models, but the model size decreases by a great deal for all three tasks that the authors present results on. The problems that I think the paper has are the following. 1) The modification to the standard perceptron approach could have been presented as an algorithm, instead of describing it in a hand wavy way through text. I think the presentation would have been clearer if the standard perceptron and the modified version are presented side by side so that the reader gets a succinct picture. 2) The convergence proof presented is also very vague. If the authors claim that the presented method converges after a finite number of iterations, then they should provide an analysis of the order of time it would take for the algorithm to converge. They should also present assumptions under which their claims hold -- such as whether the data is separable or not. 3) The footnote 1 contains a citation that is not present in the bibliography. ============================================================================ REVIEWER #2 ============================================================================ --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- APPROPRIATENESS: 5 CLARITY: 3 ORIGINALITY/INNOVATIVENESS: 3 SOUNDNESS/CORRECTNESS: 3 MEANINGFUL COMPARISON: 4 SUBSTANCE: 2 IMPACT OF IDEAS OR RESULTS: 3 REPLICABILITY: 4 IMPACT OF ACCOMPANYING SOFTWARE: 3 IMPACT OF ACCOMPANYING DATASET: 3 REVIEWER CONFIDENCE: 5 --------------------------------------------------------------------------- Comments --------------------------------------------------------------------------- I would lean towards acceptance of the paper, although it is a tricky case. On the positive side, the paper describes a simple method, which gives good experimental results. I think it's a useful algorithm. There are however a couple of negatives: * most importantly, I'm not convinced at all by the convergence "proof". It is very informal. I am not convinced that it is clear or correct. There are good reasons not to publish papers with dubious proofs... * The description of the algorithm itself is also sketchy. The paper would be much improved if pseudocode was given (it shouldn't be difficult to do this). ============================================================================ REVIEWER #3 ============================================================================ --------------------------------------------------------------------------- Reviewer's Scores --------------------------------------------------------------------------- APPROPRIATENESS: 5 CLARITY: 4 ORIGINALITY/INNOVATIVENESS: 3 SOUNDNESS/CORRECTNESS: 3 MEANINGFUL COMPARISON: 4 SUBSTANCE: 3 IMPACT OF IDEAS OR RESULTS: 3 REPLICABILITY: 4 IMPACT OF ACCOMPANYING SOFTWARE: 1 IMPACT OF ACCOMPANYING DATASET: 1 REVIEWER CONFIDENCE: 4 --------------------------------------------------------------------------- Comments --------------------------------------------------------------------------- This paper presents a variant of the perceptron algorithm which promotes sparse models. Experiments are made in dependency parsing (graph-based and transition-based) and in text chunking. Models which are 4-5 smaller are produced. "The averaged perceptron is not well understood theoretically and is hard to analyze" - this is actually not true. The perceptron is effectively minimizing a loss function L(w) = sum_i max_y' w'*(f(x,y)-f(x,y')) using a online subgradient method (see e.g. Martins et al. 2010 EMNLP paper), and the averaging operation can be seen as a online-to-batch conversion. This comment also applies to the paragraph "comparison to L1-regularization models." I didn't follow your convergence proof. By convergence you mean that the algorithm finds a separating hyperplane if it exists? First, you need to assume that data is separable. Then, if the order of examples changes, then the output of the algorithm may also be different, isn't this right? So changing the order may at least change the sparsity of the model that is obtained. It seems to me that you guarantee that a separating hyperplane is found for a particular ordering which renders the final model non-sparse. This does not seem very helpful. It seems to me that the algorithm is a bit like a "hack" with the goal of obtaining sparsity, and I don't think it is supported here by theoretical arguments. The stopping time also seems to very important. For example, if the data is not separable, and the algorithm is not stopped early, then all features will eventually fire for more K times, hence be included in the model. It seems to me that text chunking is better described as a structured prediction problem (sequence labeling) than a multi-class problem. Usually the latter implies "few" labels and the former an exponentially large number of labelings, which is the case. The difference in model size is not as striking as in other L1-approaches previously published (e.g. Tsuruoka 2009) but is significant. However, the standard measure for text chunking is F1 measure, not accuracy!!! Since most labels are "O", a system can have a high accuracy and a very poor recall. This makes the prediction results of that experiment unreliable. Minor comments: - Why distinguishing the parameterization of multi-class and structured prediction? In both cases, scores can be written as w'*f(x,y). - Fix the bib: "Mark Dredze and Pereira, 2008" should be "Crammer et al. 2008" in footnote 1.

Most of the reviewers comments are correct. Indeed, the method is essentially a hack, and the presentation is a bit informal. This was done on purpose. The idea is simple enough to understand (all of the reviewers clearly understood it), and the actual implementation details do not really matter (there is no "secret ingredient" here, you will probably get very similar results even with a different implementation than my own). And then there is the page limit. The proof is also informal, and definitely not tight. In particular, it does not attempt to claim anything about the sparsity of the final model. All it says is that, in the limit, this hack will reduce to the "regular" perceptron, so if you trust the perceptron, there is no reason not to trust this version. Emprically, it works well AND produces sparser models. Some specific responses to the third review: 1/ I still do not find the averaged perceptron to be well understood (my preferred interpretation of it is in the context of bayes-point machines, but at the end of the day it is still essentially a hack that works extremely well). I am not sure which EMNLP 2010 I am supposed to look at. 2/ Chunking can be done as structured prediction. But many people are solving it as a multiclass problem, and get equally good results. I needed a representative example of multi-class classification, and think this task fits the bill very well. 3/ Re chunk-level F1 vs token-level Accuracy: certainly true if evaluating a chunking system for the sake of chunking. But here I treat it as an abstract multi-class problem, and think the token-level numbers provide more direct representation of the results. Also, while in NP-chunking most tokens are tagged 'O', this certainly is NOT the case for the general chunking task (very few tokens are tagged as 'O', and there is no predominant default class to default to).