# Fetching and Cleaning HTML Text

### NLP 2016 - HW1

November 2015 - [NLP16](http://www.cs.bgu.ac.il/~elhadad/nlp16.html)

We will compare two different methods to clean raw HTML text into text.
HTML pages contain many "non textual" elements, in the form of HTML tags, jscript code, lots of advertisement and in general repetitive content which we will refer to as "boilerplate" content (menus, navigation etc).

We are intersted in extracting from a random HTML page the non-boilerplate textual content.

We will compare two libraries that achieve this.

First, let us get raw HTML from a URL:

In [1]:
import requests

url = "http://www.bbc.com/news/technology-26415021"
html = requests.get(url).text

Let us inspect the resulting raw HTML string we obtained:

In [3]:
print(html[:200])

<!DOCTYPE html>
<html lang="en" id="responsive-news" prefix="og: http://ogp.me/ns#">
<head >
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
    <title>An


Too many white spaces and empty lines, let us clean it up a bit:

In [5]:
import re

html = re.sub("[\r\n]+", "\n", html)
html = re.sub("[\n]+", "\n", html)
html = re.sub("[\t, ]+"," ", html)


In [6]:
html

'<!DOCTYPE html>\n<html lang="en" id="responsive-news" prefix="og: http://ogp.me/ns#">\n<head >\n <meta charset="utf-8">\n <meta http-equiv="X-UA-Compatible" content="IE=edge chrome=1">\n <title>An hour to catch the coding bug - BBC News</title>\n <meta name="description" content="Is it possible to get children interested in computer programming in just 60 minutes? The Hour of Code has been designed to do just that.">\n <link rel="dns-prefetch" href="https://ssl.bbc.co.uk/">\n <link rel="dns-prefetch" href="http://sa.bbc.co.uk/">\n <link rel="dns-prefetch" href="http://ichef-1.bbci.co.uk/">\n <link rel="dns-prefetch" href="http://ichef.bbci.co.uk/">\n <meta name="x-country" content="il">\n <meta name="x-audience" content="International">\n <meta name="CPS_AUDIENCE" content="International">\n <meta name="CPS_CHANGEQUEUEID" content="">\n <link rel="canonical" href="http://www.bbc.com/news/technology-26415021">\n <link rel="alternate" hreflang="en-gb" href="http://www.bbc.co.uk/news/techn

This is becoming a mess, let us get line breaks back:

In [7]:
html = html.split("\n") # html is now a list of lines
html = "\n".join(html)    # we turn it back into a single string
print(html)

<!DOCTYPE html>
<html lang="en" id="responsive-news" prefix="og: http://ogp.me/ns#">
<head >
 <meta charset="utf-8">
 <meta http-equiv="X-UA-Compatible" content="IE=edge chrome=1">
 <title>An hour to catch the coding bug - BBC News</title>
 <meta name="description" content="Is it possible to get children interested in computer programming in just 60 minutes? The Hour of Code has been designed to do just that.">
 <link rel="dns-prefetch" href="https://ssl.bbc.co.uk/">
 <link rel="dns-prefetch" href="http://sa.bbc.co.uk/">
 <link rel="dns-prefetch" href="http://ichef-1.bbci.co.uk/">
 <link rel="dns-prefetch" href="http://ichef.bbci.co.uk/">
 <meta name="x-country" content="il">
 <meta name="x-audience" content="International">
 <meta name="CPS_AUDIENCE" content="International">
 <meta name="CPS_CHANGEQUEUEID" content="">
 <link rel="canonical" href="http://www.bbc.com/news/technology-26415021">
 <link rel="alternate" hreflang="en-gb" href="http://www.bbc.co.uk/news/technology-26415021">


We give up -- to much noise in this page! How can we get just the text out of this?

Let us use existing libraries.

The first we try is called BeautifulSoup.  It is a library to parse "noisy" HTML in general.  
Once parsed, the HTML string can be navigated in a convenient manner.
Make sure you install beautifulsoup4 by running:

% pip install beautifulsoup4

We can then run this:

In [8]:
from bs4 import BeautifulSoup

def clean_html1(html):
    soup = BeautifulSoup(html)
    return soup.get_text()

Let us try this version of clean_html:

In [10]:
print(clean_html1(html))





An hour to catch the coding bug - BBC News





























 {
 "@context": "http://schema.org"
 "@type": "Article"
 
 "url": "http://www.bbc.com/news/technology-26415021"
 "publisher": {
 "@type": "Organization" 
 "name": "BBC News" 
 "logo": "http://www.bbc.co.uk/news/special/2015/newsspec_10857/bbc_news_logo.png?cb=1"
 }
 
 "headline": "An hour to catch the coding bug"
 "author": "Mark Ward"
 "mainEntityOfPage": "http://www.bbc.com/news/technology-26415021"
 "articleBody": "Is it possible to get children interested in computer programming in just 60 minutes? The Hour of Code has been designed to do just that."
 
 "image": {
 "@list": [
 "http://ichef-1.bbci.co.uk/news/560/media/images/73325000/jpg/_73325163_olly009.jpg"
 "http://ichef.bbci.co.uk/news/560/media/images/73325000/jpg/_73325167_hourofcode.jpg"
 "http://ichef.bbci.co.uk/news/560/media/images/73325000/jpg/_73325169_appinventor.jpg"
 ]
 }
 "datePublished": "2014-03-03T10:22:55+00:00"
 }
 













var _s



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


We got rid of the HTML tags, but we still have lots of "non text" - mainly javascript code.
We can still use beautifulsoup to navigate the HTML DOM structure and filter out the script blocks - this will be effective in this case.

But we can go faster and use the justext library.  This is developed specifically to remove boilerplate from html strings.
It uses a very nice machine learning algorithm which is documented in its homepage.

Make sure you install the library first:

% pip install justext

Now let's try it:

In [11]:
import justext

def clean_html2(html):
    paragraphs = justext.justext(html, justext.get_stoplist('English'))
    return "\n".join([p.text for p in paragraphs if not p.is_boilerplate])



Let us try this second version of clean_html:

In [12]:
print(clean_html2(html))

I'm putting my twin 10-year-old boys Toby and Callum through the Hour of Code - a campaign that seeks to ignite an interest in programming - the part we're doing using specially created web-based exercises.
The campaign begun in the US has landed in the UK where it also coincides with government calls for as many children as possible to get coding.
Programming is being pushed because in an ever more technological world it can only be a good thing to give people a peep into what goes on behind the touch screen cash point and website.
The Hour of Code is supposed to be the start of that journey and I like many other parents feel it's one my children should be embarking on. I do feel like a clock somewhere is ticking and unless they get started with this essential skill they'll be left behind.
"In the future kids are going to be doing programming " said Callum when I asked him why it was worth learning how to code. "We need to learn so we can do stuff with the computer otherwise it will b

## Tokenizing

Now that we have clean text, let us tokenize it into a list of words.

We will use nltk tokenizer to do this.  Make sure you have nltk installed, and that you invoke nltk.download() to download the datasets that come with it.

Then let us try the tokenizer.

In [13]:
import nltk

tokens = nltk.word_tokenize(clean_html2(html))

print("We found %s tokens" % (len(tokens)))

We found 1091 tokens


The output of justext is organized in paragraphs.  We converted them as one line of text each.

In [14]:
paragraphs = clean_html2(html).split("\n")
p0 = paragraphs[0]
print(nltk.word_tokenize(p0))

['I', "'m", 'putting', 'my', 'twin', '10-year-old', 'boys', 'Toby', 'and', 'Callum', 'through', 'the', 'Hour', 'of', 'Code', '-', 'a', 'campaign', 'that', 'seeks', 'to', 'ignite', 'an', 'interest', 'in', 'programming', '-', 'the', 'part', 'we', "'re", 'doing', 'using', 'specially', 'created', 'web-based', 'exercises', '.']


Note how abbreviations in English are tokenized ("I'm" becomes ["I", "'m"]).