SPAM Classifier using Scikit-Learn

This notebook explains how to perform document classification using the Scikit-Learn and Pandas libraries. Make sure to install the latest versions of these in your Anaconda environment with commands:

# conda install scikit-learn

# conda install pandas

We use a variety of vectorizers to turn text documents into feature vectors and compare different classifier algorithms on these features.

The code is derived from notebooks published by Zac Stewart and Radim Rehurek


We will work on two datasets - one of email messages classified as spam and ham (ham = not spam, good messages); and one of SMS messages, classified as spam and ham as well.

The email spam messages are collected from:

To make the work simpler, the two datasets are put into a single zip file here (107MB, contains about 60K files).

The SMS dataset is from:

Loading data

In [1]:
%matplotlib inline
import os
import sys
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import KFold
from sklearn.metrics import confusion_matrix, f1_score

def progress(i, end_val, bar_length=50):
    Print a progress bar of the form: Percent: [#####      ]
    i is the current progress value expected in a range [0..end_val]
    bar_length is the width of the progress bar on the screen.
    percent = float(i) / end_val
    hashes = '#' * int(round(percent * bar_length))
    spaces = ' ' * (bar_length - len(hashes))
    sys.stdout.write("\rPercent: [{0}] {1}%".format(hashes + spaces, int(round(percent * 100))))

NEWLINE = '\n'

The email files are organized in folders each containing only ham or spam files. The following code loads the whole dataset into a Pandas dataframe.

You should learn about Pandas by running the following notebooks:

In [2]:
HAM = 'ham'
SPAM = 'spam'

    ('data/spam',        SPAM),
    ('data/easy_ham',    HAM),
    ('data/hard_ham',    HAM),
    ('data/beck-s',      HAM),
    ('data/farmer-d',    HAM),
    ('data/kaminski-v',  HAM),
    ('data/kitchen-l',   HAM),
    ('data/lokay-m',     HAM),
    ('data/williams-w3', HAM),
    ('data/BG',          SPAM),
    ('data/GP',          SPAM),
    ('data/SH',          SPAM)

SKIP_FILES = {'cmds'}

def read_files(path):
    Generator of pairs (filename, filecontent)
    for all files below path whose name is not in SKIP_FILES.
    The content of the file is of the form:
    This skips the headers and returns body only.
    for root, dir_names, file_names in os.walk(path):
        for path in dir_names:
            read_files(os.path.join(root, path))
        for file_name in file_names:
            if file_name not in SKIP_FILES:
                file_path = os.path.join(root, file_name)
                if os.path.isfile(file_path):
                    past_header, lines = False, []
                    f = open(file_path, encoding="latin-1")
                    for line in f:
                        if past_header:
                        elif line == NEWLINE:
                            past_header = True
                    content = NEWLINE.join(lines)
                    yield file_path, content

def build_data_frame(l, path, classification):
    rows = []
    index = []
    for i, (file_name, text) in enumerate(read_files(path)):
        if ((i+l) % 100 == 0):
            progress(i+l, 58910, 50)
        rows.append({'text': text, 'class': classification})
    data_frame = DataFrame(rows, index=index)
    return data_frame, len(rows)

def load_data():
    data = DataFrame({'text': [], 'class': []})
    l = 0
    for path, classification in SOURCES:
        data_frame, nrows = build_data_frame(l, path, classification)
        data = data.append(data_frame)
        l += nrows
    data = data.reindex(numpy.random.permutation(data.index))
    return data
In [3]:
# This should take about 2 minutes
Percent: [##################################################] 100%
In [4]:
In [5]:
class text
count 58910 58910
unique 2 52936
top spam <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//E...
freq 35371 93

Building a Vectorizer and Classifier SkLearn Pipeline

In [6]:
def build_pipeline():
    pipeline = Pipeline([
        ('count_vectorizer',   CountVectorizer(ngram_range=(1, 2))),
        ('classifier',         MultinomialNB())
    return pipeline

def train(data = None, n_folds = 6):
    if data is None:
        print("Loading data...")
        data = load_data()
        print("Data loaded")
    k_fold = KFold(n=len(data), n_folds = n_folds)
    pipeline = build_pipeline()
    scores = []
    confusion = numpy.array([[0, 0], [0, 0]])
    print("Training with %d folds" % n_folds)
    for i, (train_indices, test_indices) in enumerate(k_fold):
        train_text = data.iloc[train_indices]['text'].values
        train_y = data.iloc[train_indices]['class'].values.astype(str)

        test_text = data.iloc[test_indices]['text'].values
        test_y = data.iloc[test_indices]['class'].values.astype(str)

        print("Training for fold %d" % i), train_y)
        print("Testing for fold %d" % i)
        predictions = pipeline.predict(test_text)

        confusion += confusion_matrix(test_y, predictions)
        score = f1_score(test_y, predictions, pos_label=SPAM)
        print("Score for %d: %2.2f" % (i, score))
        print("Confusion matrix for %d: " % i)

    print('Total emails classified:', len(data))
    print('Score:', sum(scores)/len(scores))
    print('Confusion matrix:')
    return pipeline
In [7]:
from sklearn.linear_model import LogisticRegression

def build_pipeline2():
    pipeline = Pipeline([
        ('count_vectorizer',   CountVectorizer(ngram_range=(1, 2))),
        ('classifier',         LogisticRegression())
    return pipeline

def train2(data = None, n_folds = 4):
    if data is None:
        print("Loading data...")
        data = load_data()
        print("Data loaded")
    k_fold = KFold(n=len(data), n_folds = n_folds)
    pipeline = build_pipeline2()
    scores = []
    confusion = numpy.array([[0, 0], [0, 0]])
    print("Training with %d folds" % n_folds)
    for i, (train_indices, test_indices) in enumerate(k_fold):
        train_text = data.iloc[train_indices]['text'].values
        train_y = data.iloc[train_indices]['class'].values.astype(str)
        test_text = data.iloc[test_indices]['text'].values
        test_y = data.iloc[test_indices]['class'].values.astype(str)
        print("Training for fold %d" % i), train_y)
        print("Testing for fold %d" % i)
        predictions = pipeline.predict(test_text)
        confusion += confusion_matrix(test_y, predictions)
        score = f1_score(test_y, predictions, pos_label=SPAM)
        print("Score for %d: %2.2f" % (i, score))
        print("Confusion matrix for %d: " % i)

    print('Total emails classified:', len(data))
    print('Score:', sum(scores)/len(scores))
    print('Confusion matrix:')
    return pipeline
    confusion = confusion_matrix(test_y, predictions)
    score = f1_score(test_y, predictions, pos_label=SPAM)
    print("Score for %d: %2.2f" % (i, score))
    print("Confusion matrix for %d: " % i)
    print('Total emails classified:', len(test_text))
    return pipeline
In [8]:
# This trains the pipeline on our data (about 60K email messages)
# using count vectors over unigrams and bigrams and using N-folding with 6 folds.
# The training takes about 5 minutes for Multinomial Naive Bayes and about 30 minutes for Logistic Regression.
pipeline = train2(data)
Training for fold 0
Testing for fold 0
Score for 0: 0.99
Confusion matrix for 0: 
[[5907   45]
 [  79 8697]]
Training for fold 1
Testing for fold 1
Score for 1: 0.99
Confusion matrix for 1: 
[[11640    81]
 [  212 17523]]
Training for fold 2
Testing for fold 2
Score for 2: 0.99
Confusion matrix for 2: 
[[17590   112]
 [  325 26156]]
Training for fold 3
Testing for fold 3
Score for 3: 0.99
Confusion matrix for 3: 
[[23381   158]
 [  500 34871]]
Total emails classified: 58910
Score: 0.990661681691
Confusion matrix:
[[23381   158]
 [  500 34871]]
In [9]:
pipeline_nb = train(data)
Training with 6 folds
Training for fold 0
Testing for fold 0
Score for 0: 0.98
Confusion matrix for 0: 
[[3956   16]
 [ 255 5592]]
Training for fold 1
Testing for fold 1
Score for 1: 0.98
Confusion matrix for 1: 
[[ 7831    33]
 [  514 11260]]
Training for fold 2
Testing for fold 2
Score for 2: 0.98
Confusion matrix for 2: 
[[11675    46]
 [  760 16975]]
Training for fold 3
Testing for fold 3
Score for 3: 0.98
Confusion matrix for 3: 
[[15650    66]
 [  992 22566]]
Training for fold 4
Testing for fold 4
Score for 4: 0.98
Confusion matrix for 4: 
[[19521    73]
 [ 1211 28287]]
Training for fold 5
Testing for fold 5
Score for 5: 0.98
Confusion matrix for 5: 
[[23450    89]
 [ 1418 33953]]
Total emails classified: 58910
Score: 0.978284726388
Confusion matrix:
[[23450    89]
 [ 1418 33953]]
In [10]:
class text
count 58910 58910
unique 2 52936
top spam <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//E...
freq 35371 93
In [11]:
from pandas import value_counts
spam    35371
ham     23539
Name: class, dtype: int64

Explore distribution of text length

We add a new column to our dataframe to represent the length of the messages.

In [12]:
data['length'] = data['text'].map(lambda text: len(text))

Let us explore the distribution of the message lengths:

In [13]:
data.length[data.length < 10000].plot(bins=100, kind='hist')
dsl = data.length[(data['class'] == 'spam') & (data.length < 10000)]
dhl = data.length[(data['class'] == 'ham') & (data.length < 10000)]

dsl.plot(bins=100, kind='hist')
dhl.plot(bins=100, kind='hist')
<matplotlib.axes._subplots.AxesSubplot at 0x5f5fca20>
In [14]:
ham count 23539.000000
mean 2136.225498
std 7559.825825
min 5.000000
25% 342.000000
50% 829.000000
75% 1743.500000
max 303302.000000
spam count 35371.000000
mean 4042.914959
std 8417.286850
min 0.000000
25% 922.000000
50% 1922.000000
75% 4071.000000
max 751895.000000
In [15]:
# All empty messages are marked as spam.
value_counts(data[data.length == 0]['class'])
spam    61
Name: class, dtype: int64
In [16]:
data[(data.length > 20) & (data.length < 10000)].groupby('class').describe()
ham count 22842.000000
mean 1270.940504
std 1425.039593
min 21.000000
25% 333.000000
50% 799.000000
75% 1624.750000
max 9960.000000
spam count 32072.000000
mean 2380.702420
std 2064.117531
min 24.000000
25% 865.000000
50% 1707.000000
75% 3276.000000
max 9999.000000
In [ ]: