Topics in Natural Language Processing (202-2-5381) Fall 2018

Meets: Sun 12-14 Bdg 34 Room 003

News:

22 Oct 17: Welcome to NLP 18
29 Oct 17: Quizz 01 and Language Modeling
30 Oct 17: There will be no lecture on Nov 5th.
12 Nov 17: Quizz 02 and Deep Learning Intro
19 Nov 17: Quizz 03 and more on Deep Learning Intro + DyNet Intro
20 Nov 17: HW1 is available.
26 Nov 17: Quizz 04 and Parts of Speech Tagging
03 Dec 17: Quizz 05 and Document Classification
10 Dec 17: Quizz 06 and Word Embeddings
17 Dec 17: Registration for grading HW1 is online. Please send me email with your requested slot - first-come first-served.
17 Dec 17: HW2 is available.
24 Dec 17: Quizz 07 and Sequence Classification
24 Dec 17: Excellent intro to NumPy - worth reading.
29 Dec 17: Feature union in Sklearn can help in addressing some of HW2 questions.
31 Dec 17: Quizz 08 and Syntax and Parsing Part 1 of 3.
07 Jan 18: Quizz 09 and Syntax and Parsing Part 2 of 3.
14 Jan 18: Quizz 10 and Syntax and Parsing Part 3 of 3.

General Intro to NLP - Linguistic Concepts
Language Modeling
Deep Learning Intro
Parts of Speech Tagging
Basic Statistic Concepts
Classification
Word Embeddings
Sequence Classification
Deep Learning for NLP
Syntax and Parsing
Summarization
Topic Modeling

Objectives

The course is an introduction to Natural Language Processing. The main objective of the course is to learn how to develop practical computer systems capable of performing intelligent tasks on natural language: analyze, understand and generate written text. This task requires learning material from several fields: linguistics, machine learning and statistical analysis, and core natural language techniques.

Acquire basic understanding of linguistic concepts and natural language complexity: variability (the possibility to express the same meaning in many different ways) and ambiguity (the fact that a single expression can refer to many different meanings in different contexts); levels of linguistic description (word, sentence, text; morphology, syntax, semantics, pragmatics). Schools of linguistic analysis (functional, distributional, Chomskyan); Empirical methods in Linguistics; Lexical semantics; Syntactic description; Natural language semantics issues.
Acquire basic understanding of machine learning techniques as applied to text: supervised vs. unsupervised methods; training vs. testing; classification; regression; distributions, KL-divergence; Bayesian methods; Support Vector Machines; Perceptron; Deep Learning methods in NLP; RNNs and LSTMs.
Natural language processing techniques: word and sentence tokenization; parts of speech tagging; lemmatization and morphological analysis; chunking; named entity recognition; language models; probabilistic context free grammars; probabilistic dependency grammars; parsing accuracy metrics; treebank analysis; text simplification; paraphrase detection; summarization; text generation.

Topics covered in class include:

Descriptive linguistic models
Language Models -- Statistical Models of Unseen Data (n-gram, smoothing, recurrent neural networks language models)
Language Models and deep learning -- word embeddings, continuous representations, neural networks
Parts of speech tagging, morphology, non categorical phenomena in tagging
Information Extraction / Named Entity Recognition
Using Machine Learning Tools: Classification, Sequence Labeling / Supervised Methods / SVM. CRF, Perceptron, Logistic Regression
Bayesian Statistics, generative models, topic models, LDA
Syntactic descriptions: Parsing sentence, why, how, PCFGs, Dependency Parsing
Text Summarization

Lecture Notes

22 Oct 17: General Intro to NLP - Linguistic Concepts
Things to do:
1. Find a way to estimate how many words exist in English. In Hebrew. What method did you use? What definition of word did you use? (Think derivation vs. inflection)
2. Experiment with Google Translate: find ways to make Google Translate "fail dramatically" (generate very wrong translations). Explain your method and collect your observations. Document attempts you made that did NOT make Google Translate fail. (Think variability and ambiguity; Think syntactic complexity; think lexical sparsity, unknown words).
3. Think of reasons why natural languages have evolved to become ambiguous (Think: what is the communicative function of language; who pays the cost for linguistic complexity and who benefits from it; is ambiguity created willingly or unconsciously?)

29 Oct 17: Language Modeling
1. Language Modeling with Ngrams Chapter 4 from Speech and Language Processing (SPL3), Jurafsky and Martin, 3rd Ed, 2016. (Focus on 4.1-4.4)
2. Peter Norvig: How to Write a Spelling Corrector (2007).
  This is a toy spelling corrector illustrating the statistical NLP method (probability theory, dealing with large collections of text, learning language models, evaluation methods). Read an extended version of the material with more applications (word segmentation, n-grams, smoothing, more on bag of words, secret code decipher): How to Do Things with Words (Use this local copy adapted to Python 3 / html version and the support files: big.txt, count_1w.txt, count_2w.txt).
3. Spelling Correction and the Noisy Channel, Chapter 5 from SPL3, 5.1-5.2 - This covers similar material as Norvig's piece above in a more formal manner.
4. Things to do:
  1. Read about Probability axioms and in more details in the notes from Fredrik Engstrom:
  2. Read about Edit Distance and in more details, a review of minimum edit distance algorithms using dynamic programming from Dan Jurafsky.
  3. Install Python: I recommend installing the Anaconda distribution (choose the Python 3.6 version). (Note: Many of the code samples you will see are written in Python 2 - which is not exactly compatible with Python 3 - the main annoying difference is that in Python 2 you can write: print x -- in Python 3 it must be print(x). Python3 comes with a utility 2to3 which converts most of the differences between Python 2 and Python 3.]
    The Anaconda distribution includes a large set of Python packages ready to use that we will find useful. (391MB download, 2GB disk space needed.) In particular, Anaconda includes the nltk, pandas, numpy, scipy and scikit-learn packages.
  4. Execute Norvig's spell checker for English (you will neeed the Python code from the article and the large file of text used for training big.txt).
  5. How many tokens are there in big.txt? How many distinct tokens? What are the 10 most frequent words in big.txt?
  6. In a very large corpus (discussed in the ngram piece quoted below), the following data is reported:
    The 10 most common types cover almost 1/3 of the tokens, the top 1,000 cover just over 2/3.
    What do you observe on the much smaller big.txt corpus?
  7. Read more from Norvig's piece on ngrams.
  8. Execute the word segmentation example from Norvig's ngram chapter (code in this notebook).
    Note the very useful definition of the @memo decorator in this example, which is an excellent method to implement dynamic programming algorithms in Python. From Python Syntax and Semantics:
    A Python decorator is any callable Python object that is used to modify a function, method or class definition. A decorator is passed the original object being defined and returns a modified object, which is then bound to the name in the definition. Python decorators were inspired in part by Java annotations, and have a similar syntax; the decorator syntax is pure syntactic sugar, using @ as the keyword:
```
@viking_chorus
def menu_item():
    print("spam")
	
```
    is equivalent to:
```
def menu_item():
    print("spam")
menu_item = viking_chorus(menu_item)
	
```
  9. This corpus includes a list of about 40,000 pairs of words (error, correction). It is too small to train a direct spell checker that would map word to word. Propose a way to learn a useful error model (better than the one used in Norvig's code) using this corpus. Hint: look at the model of weighted edit distance presented in Jurafsky's lecture cited above.

12 Nov 2017: Deep Learning Intro
Things to do:
1. Learn Python if you don't know it (About 4 hours)
  - Python Tutorial (Use Python 3.6)
  - Google intro to Python (this uses Python 2)
2. Install a good Python environment (About 3 hours) The default environment is pyCharm (the free Community Edition is good for our needs 127MB download). I also find Spyder which comes bundled with Anaconda very convenient.
3. Install Dynet and Pytorch in your environment:
```
Install Anaconda for Python 3.6
For Dynet: pip install dynet
For PyTorch: 
	Under Windows: conda install -c peterjc123 pytorch 
	Under Linux / MacOS: see instructions on pytorch.org using conda / Python 3.6 and depending on your hardware 
	with or without Cuda.
		
```
4. Read the Python DyNet tutorial (2 hours)
5. Execute the xor-dynet.py example.
6. Read Graham Neubig's intro to DyNet and the code examples 01-intro.
7. Read the 5 parts of the series "Machine Learning is Fun" from Adam Geitgey (about 15 min each part):

26 Nov 2017: Parts of Speech Tagging (ipynb)
Things to do:
1. Read about the Universal Parts of Speech Tagset (About 2 hours)
2. Install NLTK: if you have installed Anaconda, it is already installed. Make sure to download the corpora included with nltk.
  Q: How do you find out where your package is installed after you use easy_install?
  A: in the Python shell, type: import nltk; then type: nltk. You will get an answer like:
```
>>> import nltk
>>> nltk
<module 'nltk' from 'C:\Anaconda\lib\site-packages\nltk\__init__.py'>
>>>
   
```
3. Explore the Brown corpus of parts-of-speech tagged English text using NLTK's corpus reader and FreqDist object: Use the Universal tagset for all work (About 1 hour)
  - What are the 10 most common words in the corpus?
  - What are the 5 most common tags in the corpus?
4. Read Chapter 5 of the NLTK book (About 3 hours)
5. Advanced topics in POS tagging: we will get back to the task of POS tagging with different methods in the following chapters, for more advanced sequenced labeling methods (HMM), Deep Learning based methods using Recurrent Neural Networks, feature-based classifier methods for tagging (CRF), and as a test case for unsupervised EM techniques and Bayesian techniques. You can look at the source code of the nltk.tag module for a feeling of how the tag.hmm, tag.crf and tag.tnt methods are implemented.
  The following papers give a good feeling of the current state of the art in POS tagging:
  - Learning Character-level Representations for Part-of-Speech Tagging, by Dos Santos and Zadrozny, ICML 2014: uses a character-level Convolution Network to perform POS tagging; reaches accuracy of 97.32% and remarkably about 90% on unknown words (words never seen during training).
  - Understanding Convolutional Neural Networks for NLP, this is a blog article with high quality Python code and notebooks explaining and implementing Dos Santos and Zadrozny's model of POS tagging using character-level CNN.
  - A Universal Part-of-Speech Tagset by Slav Petrov, Dipanjan Das and Ryan McDonald, LREC, 2012.
6. Read A good POS tagger in 200 lines of Python, an Averaged Perceptron implementation with good features, fast, reaches 97% accuracy (by Matthew Honnibal).
7. Execute pos-tagging-skl.py, which implements a POS tagger using the Scikit-Learn model, with similar good features, fast, reaches 97% accuracy.

03 Dec 17 Classification
1. Read Chapter 6: Learning to Classify Text of the NLTK Book (About 3 hours).
2. Read Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression, Tom Mitchell, 2015. (About 3 hours)
3. Watch (ML 8.1) Naive Bayes Classification a 15mn video on Naive Bayes Classification by Mathematical Monk and the following chapter (ML 8.3) about Bayesian Naive Bayes (20 minutes).
4. Read and execute the tutorial on Using PyTorch for Logistic Regression
Practical work:
1. Explore the documentation of the nltk.classify module.
2. Read the code of the NLTK Naive Bayes classifier and run nltk.classify.naivebayes.demo()
3. Read the code of the NLTK classifier demos: names_demo and wsd_demo.
4. Read the documentation on feature extraction in Scikit-learn.
5. Run the example on document classification in Scikit-learn: Notebook (ipynb source).
6. Experiment with the example of classifications in this iPython notebook (code) which shows how to run NLTK classifiers in a variety of ways.
7. Experiment with the Reuters Dataset notebook (code) illustrating document classification with bag of words features and TF-IDF transformation.
8. The PyTorch tutorial on Logistic Regression is applied to a vision task (MNIST hand-written digit recognition). Apply the PyTorch classes to the task of Text Classification using the same dataset as Scikit-learn tutorial on text classification on the 20 newsgroup dataset.
  A step towards the solution is given in Logistic Regression Bag-of-Words classifier in the PyTorch tutorial.

10 Dec 17 Word Embeddings
1. Dense word representations in deep learning:
  - The Logistic regression bag of words model.
  - Continuous bag of words model.
2. Read CS 224D: Deep Learning for NLP1 1 - Lecture Notes: Part I by Richard Socher / the corresponding slides - Pre-trained word embeddings.
Further Reading::
1. View Chris Manning's Lecture on Word2Vec April 2017 (1h10)
2. Read Efficient Estimation of Word Representations in Vector Space from Mikolov et al (2013) and the Word2vec site.
3. Read TensorFlow's tutorial on Word Embeddings.
4. Read Word2Vec Explained by Yoav Goldberg and Omer Levy, 2014.
5. Read Neural Word Embedding as Implicit Matrix Factorization by Levy and Goldberg, 2014.
6. Read Linguistic Regularities in Continuous Space Word Representations by Mikolov et al, 2013.
7. Read king - man + woman is queen; but why? by Piotr Migdal, Jan 2017. Good summary of word embeddings with interactive visualization tools, including word2viz word analogies explorer.
Things to do:
1. Install Gensim in your environment (run "conda install gensim") and run the Gensim Word2vec tutorial.
2. Experiment with Sense2vec with spaCy and Gensim, Source code a tool to compute word embeddings taking into account multi-word expressions and POS tags.
3. Register to the Udacity Deep Learning course (free) by Vincent Vanhoucke, and study the chapter "Deep Models for Text and Sequences", then do Assignment 5 (ipynb notebook) "Train a Word2Vec skip-gram model over Text8 data".
4. Continue with Assignment 6 (a ipynb notebook) "Train a LSTM character model over Text8 data".

24 Dec 17 Sequence Classification
1. Read Michael Collin's nodes on Tagging Problems, and Hidden Markov Models: POS tagging and Named Entity Recognition as tagging problems (with BIO tag encoding), generative and noisy channel models, generative tagging models, trigram HMM, conditional independence assumptions in HMMs, estimating the parameters of an HMM, decoding HMMs with the Viterbi algorithm.
2. Sequence Models and LSTM - PyTorch Tutorial.
3. Introduction to RNNs (slides from Graham Neubig, Fall 2017).
Things to do:
1. Explain why the problem of decoding (see 2.5.4 in Tagging Problems, and Hidden Markov Models) requires a dynamic programming algorithm (Viterbi) while we did not need such a decoding step when we discussed classification using Logistic Regression and Naive Bayes?
2. Implement Algorithm 2.5 (Viterbi with backpointers) from Tagging Problems, and Hidden Markov Models in Python. Test it on the Brown POS tagging dataset using MLE for tag transitions estimation (parameters q) and a discounting language model for each tag in the Universal taget for parameters e(x|tag) for each tag.
3. Do the Assignment 3 from Richard Johannson's course on Machine Learning for NLP, 2014. Read the material assignment 3 material and Lecture 6: predicting structured objects.
  Start from the excellent Python implementation of the structured perceptron algorithm.

31 Dec 17 Syntax and Parsing
1. Context Free Grammars Parsing
2. CKY parsing interactive demo in Javascript
3. Probabilistic Context Free Grammars Parsing
4. Michael Collins's lecture on CFGs and CKY
5. Michael Collins's lecture on Lexicalized PCFGs:
  1. Why CFGs are not adequate for describing treebanks: lack of sensitivity to lexical items + lack of sensitivity to structural preferences.
  2. How to lexicalize CFGs with Head propagation.
  3. How to parse a lexicalized PCFG.
6. NLTK tools for PCFG parsing
7. Notes on computing KL-divergence
8. Dependency Parsing:
  1. Dependency Parsing by Graham Neubig. Graham's teaching page with github page for exercises.
  2. Dependency Parsing: Past, Present, and Future, Chen, Li and Zhang, 2014 (Coling 2014 tutorial)
  3. NLTK Dependency Parsing Howto
  4. Parsing English with 500 lines of Python, an implementation by Matthew Honnibal of Training Deterministic Parsers with Non-Deterministic Oracles, Yoav Goldberg and Joakim Nivre, TACL 2013. (Complete Python code)
  5. Neural Network Dependency Parser, Chen and Manning 2014. A Java implementation of a Neural Network Dependency Parser with Unlabelled accuracy of 92% and Labelled accuracy of 90%.

20 Nov 2016 - 11 Dec 2016: Basic Statistic Concepts / Supervised Machine Learning
Things to do:
1. Bayesian concept learning from Tenenbaum 1999 - reported in Murphy 2012 Chapter 3.
2. Read Deep Learning by Goodfellow, Bengio and Courville, 2016 Chapters 3 and 5. (About 5 hours)
3. Watch the 15mn video (ML 7.1) Bayesian inference - A simple example by Mathematical Monk.
4. Make sure you have installed numpy and scipy in your Python environment. Easiest way is to use the Anaconda distribution.
5. Read Introduction to statistical data analysis in Python - frequentist and Bayesian methods from Cyril Rossant, and execute the associated Jupyter notebooks. (About 4 hours)
6. Learn how to use Scipy and Numpy - Chapter 1 in Scipy Lectures (About 5 hours)
7. Write Python code using numpy, scipy and matplotlib.pyplot to draw the graphs of the Beta distribution that appear in the lecture notes (About 1 hour)
8. Given a dataset for a Bernouilli distribution (that is, a list of N bits), generate a sequence of N graphs illustrating the sequential update process, starting from a uniform prior until the Nth posterior distribution. Each graph indicates the distribution over μ, the parameter of the Bernouilli distribution (which takes value in the [0..1] range). (About 2 hours)
9. Learn how to draw Dirichlet samples using numpy.random.mtrand.dirichlet. A sample from a Dirichlet distribution is a multinomial distribution. Understand the example from the Wikipedia article on Dirichlet distributions about string cutting:
```
   import numpy as np
   import matplotlib.pyplot as plt
   s = np.random.dirichlet((10, 5, 3), 20).transpose()
   plt.barh(range(20), s[0])
   plt.barh(range(20), s[1], left=s[0], color='g')
   plt.barh(range(20), s[2], left=s[0]+s[1], color='r')
   plt.title("Lengths of Strings")
   plt.show()
   
```
  (About 2 hours)
10. Compute the MLE estimator μ_MLE of a binomial distribution Bin(m|N, μ).
11. Mixture Priors: assume we contemplate two possible modes for the value of our Beta-Binomial model parameter μ. A flexible method to encode this belief is to consider that our prior over the value of μ has the form:
```
   μ ~ k₁Beta(a, b) + k₂Beta(c, d)
   where k₁ + k₂ = 1
   m ~ Bin(μ N)
   
```
  A prior over μ of this form is called a mixture prior - as it is a linear combination of simple priors.
  1. Prove that the mixture prior is a proper probabilistic distribution.
  2. Compute the posterior density over μ for a dataset where (N = 10, m=8, N-m=2) where k₁=0.8 and k₂=0.2 and the prior distributions are Beta(1,10) and Beta(10,1). Write Python code to draw the prior density of μ and its posterior density. (About 2 hours)
  3. Experiment with a very simple form of Stochastic Gradient Descent (SGD) with a custom loss function by running this notebook. (Notebook source here). More examples available on the Autograd project homepage.
08 Jan 17 Deep Learning for NLP
1. Read A Primer on Neural Network Models for Natural Language by Yoav Goldberg, Oct 2015 up to Page 35.
2. Natural Language Understanding with Distributed Representation by Kyunghyun Cho, Nov 2015 - Chapters 1 to 3.
3. Deep Learning by Goodfellow, Bengio and Courville, 2016 Chapters 6, 7, 8 and 10.
4. The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy, May 2015 and the analysis of the same data The unreasonable effectiveness of Character-level Language Models (and why RNNs are still cool) by Yoav Goldberg, June 2015
5. Calculus on Computational Graphs: Backpropagation, by Chris Olah, Aug 2015.
6. Understanding LSTM Networks, by Chris Olah, Aug 2015.
7. WildML articles by Denny Britz - Sep 2015 - Jan 2016
  These include tutorials and Python notebooks of incremental complexity covering topics in Deep learning for NLP.
8. Heavy Metal and Natural Language Processing - Part 2 Iain Barr, Sept 2016, experiments with Language Models - ngrams and RNNs - to generate Deep Metal lyrics. Demo on deepmetal.io. Good intro material on language models, examples with char-models and word-models - starts with n-grams and smoothing, then RNN using Keras. Implementation - including notebook and pre-trained models..

Later - Summarization
1. Automatic Text Summarization
2. A Survey of Text Summarization Techniques, Ani Nenkova and Kathleen McKeown, Mining Text Data, 2012 - Springer

Later - Topic Modeling and Latent Dirichlet Allocation
1. Probabilistic Topic Models, Mark Steyvers and Tom Griffiths, In: In T. Landauer, D McNamara, S. Dennis, and W. Kintsch (eds), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum
2. David Blei's lecture on LDA (about 1 hour - Sep 2009)
3. A more recent video from David Blei video
4. A bit shorter well done lecture as well video
5. Background on Dirichlet Mathematical Monk
6. This is the standard SKlearn implementation: Reference manual, example notebook
7. pyLDAvis: cool interactive visualization tool of topic models

Software

NLTK: Nltk is a Python based toolkit with wide coverage of NLP techniques - both statistical and knowledge-based.
Dynet - a Python / C++ library for Deep Learning.
PyTorch - a deep learning framework in Python.
TensorFlow - a Python library for Deep Learning.
Keras - a high-level Python library on top of Tensorflow or Theano for Deep Learning.
Scikit-learn - a Python library for Machine Learning. Presents a uniform interface for many ML tasks (fit, transform). Good text processing example ( Working with text documents).

Resources

Notebooks
- Practical Data Science in Python by Radim Rehurek. This is an iPython notebook demonstrating how to write classifiers in Python (using ScikitLearn. The concrete example is a spam detector on SMS messages.
- Out-of-core classification of text documents, scikit-learn example showing how to perform document classification on the Reuters-21578 database.
- Sample pipeline for text feature extraction and evaluation, performs feature selection and compares performance on document classification on the 20 newsgroup dataset.
- The Travelling Salesperson Problem, this notebook demonstrates a sequence of exact and approximate algorithms to solve the Travelling Salesperson Problem. It is a great introduction to Python programming.
- Cooking with Pandas by Julia Evans (2013), introduction to the Pandas Python library to manipulate data with aggregations and queries. The updated Git repository is Panda Cookbook.
- Analyzing a Twitter Dataset with Pandas, by Gregory Saxton (2015).
- SciPy 2015 SkLearn Tutorial by Andreas Muller. Comprehensive tutorial to Machine Learning using ScikitLearn with relevant examples on Text analysis.
- Language Model GRU with Python and Theano, Part 4 of the WildML RNN Tutorial
Online Courses
- Natural Language Processing Course, Dan Jurafsky and Chris Manning, Stanford University, 2012
- From Languages to Information, Dan Jurafsky, Fall 2017
- Noah Smith Natural Language Processing Course, University of Washington, Spring 2017, video lectures
- Natural Language Processing, by Jason Eisner, John Hopkins University, Fall 2017
- Statistical Methods for NLP, by Joakim Nivre, Uppsala, 2012
Tutorials
- A Statistical MT Tutorial Workbook, Kevin Knight, 1999
- Bayesian Inference with Tears, Kevin Knight, 2009
- Structured Prediction for Natural Language Processing, Noah Smith, ICML 2009
- Classification for NLP, Dan Klein, ACL 2007
- Structured Bayesian Nonparametric Models with Variational Inference, Dan Klein and Percy Liang, ACL 2007
- Gibbs Sampling for the Uninitiated, Philip Resnik and Eric Hardisty, June 2010
Books:
- Speech and Language Processing (3rd ed. draft), by Dan Jurafsky and James H. Martin, Aug 2017
- Introduction to Natural Language Processing, by Steven Bird, Ewan Klein and Edward Loper, 2009, distributed on the NLTK site.
- Neural Network Methods for Natural Language Processing, Yoav Goldberg, April 2017
- Pattern Recognition and Machine Learning, Chris Bishop, 2007.
- Information Theory, Inference, and Learning Algorithms, David J.C. MacKay, 2003.
Python
1. Google's intro to Python
2. Python Ecosystem: notes on installing Python in your environment
3. Introduction to Python Generators
4. Scikit Learn: Machine Learning in Python
Extracting text from HTML:
- The JusText Python package does a very good job of cleaning up HTML by removing boilerplate HTML code around "interesting text".
- Decruft is a Python implementation of the Readability algorithm - easy to use, works well.
- Readability browser bookmarklet Readability is a browser bookmarklet that wipes out all that junk so you can have a more enjoyable reading experience. It works with all the latest browsers and its success rate is pretty respectable (we'd guess over 90% of web sites are handled properly). It is implemented in Jscript and relies on comparing the HTML link density of HTML elements: clean text have less density than junk.
- CleanEval homepage: CLEANEVAL is a shared task and competitive evaluation on the topic of cleaning arbitrary web pages, with the goal of preparing web data for use as a corpus, for linguistic and language technology research and development. (2007)
- Python code to clean HTML pages -- based on the idea that elements in HTML with "clean" text have less HTML link density than "useless" elements. (2008)

Last modified 07 Jan 2018 Michael Elhadad

Topics in Natural Language Processing (202-2-5381) Fall 2018

News:

Contents

Objectives