Decruft is a Python implementation of the Readability algorithm - easy to use, works well.
Readability browser bookmarklet
Readability is a browser bookmarklet that wipes out all that junk so you can have a more enjoyable reading experience.
It works with all the latest browsers and its success rate is pretty respectable (we'd guess over 90% of web sites are handled properly).
It is implemented in Jscript and relies on comparing the HTML link density of HTML elements: clean text have less density than junk.
CleanEval homepage:
CLEANEVAL is a shared task and competitive evaluation on the topic of
cleaning arbitrary web pages, with the goal of preparing web data for
use as a corpus, for linguistic and language technology research and
development. (2007)
Python code
to clean HTML pages -- based on the idea that elements in HTML with "clean" text have less HTML link density than "useless" elements. (2008)
SISC Scheme Interpreter: we use
Scheme examples to demonstrate algorithms in parsing, generation and
some semantic analysis. This interpreter is very small and convenient
to use on any platform supporting Java (full version is 2.4MB with
full doc - jar is 300KB).
The LingPipe Java library suite - by Bob Carpenter. Information extraction and data mining tools.
Text Analysis with LingPipe, Bob Carpenter, 2011. A practical book on using Lingpipe. Covers strings, streams, regular expressions, corpora readers, tokenization, language models, classifiers and Latent Dirichlet Allocation.
The NLTK Toolkit - Python toolkit with thorough tutorial on all Natural Language topics.