29 Apr 12: Google.py: An updated Python version that gets Google results.
The package called xgoogle referenced in the assignment stopped working with latest Google versions.
Tal Baumel submitted this function which seems to be working fine now.
10 May 12: Please request slots for frontal checking of HW1: slots.
The course is an introduction to Natural Language Processing.
The main objective of the course is to learn how to develop practical computer systems capable of performing intelligent tasks on natural language: analyze, understand and generate written text. This task requires learning material from several fields: linguistics, machine learning and statistical analysis, and core natural language techniques.
Acquire basic understanding of linguistic concepts and natural language complexity: variability (the possibility to express the same meaning in many different ways) and ambiguity (the fact that a single expression can refer to many different meanings in different contexts); levels of linguistic description (word, sentence, text; morphology, syntax, semantics, pragmatics). Schools of linguistic analysis (functional, distributional, Chomskyan); Empirical methods in Linguistics; Lexical semantics; Syntactic description; Natural language semantics issues.
Acquire basic understanding of machine learning techniques as applied to language: supervised vs. unsupervised methods; training vs. testing; classification; regression; distributions, KL-divergence; Bayesian methods; Support Vector Machines; Perceptron;
Natural language processing techniques: word and sentence tokenization; parts of speech tagging; lemmatization and morphological analysis; chunking; named entity recognition; n-gram language models; probabilistic context free grammars; probabilistic dependency grammars; parsing accuracy metrics; Treebank analysis; Text simplification; Paraphrase detection; Summarization; Text generation.
Peter Norvig: How to Write a Spelling Corrector (2007) - toy spelling corrector illustrating the statistical NLP method (probability theory, dealing with large collections of text, learning language models, evaluation methods).
SISC Scheme Interpreter: we use
Scheme examples to demonstrate algorithms in parsing, generation and
some semantic analysis. This interpreter is very small and convenient
to use on any platform supporting Java (full version is 2.4MB with
full doc - jar is 300KB).
The LingPipe Java library suite - by Bob Carpenter. Information extraction and data mining tools.
Text Analysis with LingPipe, Bob Carpenter, 2011. A practical book on using Lingpipe. Covers strings, streams, regular expressions, corpora readers, tokenization, language models, classifiers and Latent Dirichlet Allocation.
The JusText Python package does a very good job of cleaning up HTML by removing
boilerplate HTML code around "interesting text".
Decruft is a Python implementation of the Readability algorithm - easy to use, works well.
Readability browser bookmarklet
Readability is a browser bookmarklet that wipes out all that junk so you can have a more enjoyable reading experience.
It works with all the latest browsers and its success rate is pretty respectable (we'd guess over 90% of web sites are handled properly).
It is implemented in Jscript and relies on comparing the HTML link density of HTML elements: clean text have less density than junk.
CleanEval homepage:
CLEANEVAL is a shared task and competitive evaluation on the topic of
cleaning arbitrary web pages, with the goal of preparing web data for
use as a corpus, for linguistic and language technology research and
development. (2007)
Python code
to clean HTML pages -- based on the idea that elements in HTML with "clean" text have less HTML link density than "useless" elements. (2008)