SPAM Filtering with Basic Machine Learning Techniques¶

Introduction¶

  • We are going to build a simple SMS classifier using machine learning.
  • Given our time constraints, we'll only focus on the SMS content for this model.
  • When creating model for email, for more advanced models, we could incorporate email header information and other metadata for better accuracy.

Gentle Introduction to Machine Learning¶

Before we start, let's briefly talk about machine learning.

  • What is it?
  • How does it work?
  • What are the types of machine learning? etc.

What is Machine Learning.¶

In simple terms:

Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed.

Types of Machine Learning¶

There are three(Main) types of machine learning algorithms:

  • Supervised Learning
  • Unsupervised Learning
  • Reinforcement Learning

Today, We are only going to use supervised learning. So let's talk about supervised learning.

Supervised Learning¶

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It build a function from labeled training data consisting of a set of training examples.

  • Most common type of machine learning
  • Easy to understand
  • Easy to implement
  • In simple terms, It's similar to a teaching a child with flash cards.

Supervised Learning (Example)¶

Since we are going to classify emails into spam and non-spam categories, let's see how we can use supervised learning to solve this problem.

  1. First we provide data and known response. (For example Email and Spam or Non-Spam)
  2. Then the model learns the relationship between the data and the response.
  3. Finally, when we input a new email the model will predict whether the email is spam or not.

Hope you all got familiar with supervised learning concept.

Now, lets's move to over main objective, creating email classifier.

  • First we going do some basic analysis.
  • Then we will do some data cleaning and feature engineering.
  • After that we will create a model and evaluate the model.

Important Note 1:

  • We are going to copy paste code from this notebook to google colab.
  • So please make sure to run the code cells in order.

Important Note 2:

  • If you got any problem while running the code, please contact me or Hirano san immediately.
  • Then we can fix it as soon as possible. Otherwise you will be not able to run the code and follow the tutorial.

Google Colab.¶

  • Step 1: Open Google Colab
    • First, you need to open Google Colab by going to https://colab.research.google.com/ and sign in with your Google account.



  • Step 2: Create a New Notebook
    • Once you are signed in, click on the “New Notebook” button to create a new notebook.

1. Installing Libraries¶

First thing first, we need to install the libraries that we are going to use in this project.

  • We can use pip command to install libraries. (Pip is a package management system used to install and manage software packages written in Python.)
  • Note that % can be used to run shell commands within the notebook. (! also works)
In [1]:
%pip install pandas scikit-learn numpy matplotlib seaborn nltk wordcloud xgboost lightgbm
Requirement already satisfied: pandas in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (2.1.1)
Requirement already satisfied: scikit-learn in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (1.3.1)
Requirement already satisfied: numpy in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (1.25.1)
Requirement already satisfied: matplotlib in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (3.7.2)
Requirement already satisfied: seaborn in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (0.13.0)
Requirement already satisfied: nltk in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (3.8.1)
Requirement already satisfied: wordcloud in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (1.9.2)
Requirement already satisfied: xgboost in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (2.0.0)
Requirement already satisfied: lightgbm in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (4.1.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (from pandas) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (from pandas) (2023.3)
Requirement already satisfied: scipy>=1.5.0 in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (from scikit-learn) (1.11.3)
Requirement already satisfied: joblib>=1.1.1 in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (from scikit-learn) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (from scikit-learn) (3.2.0)
Requirement already satisfied: contourpy>=1.0.1 in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (from matplotlib) (1.1.0)
Requirement already satisfied: cycler>=0.10 in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (from matplotlib) (4.41.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (from matplotlib) (23.1)
Requirement already satisfied: pillow>=6.2.0 in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (from matplotlib) (10.0.0)
Requirement already satisfied: pyparsing<3.1,>=2.3.1 in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: click in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (from nltk) (8.1.7)
Requirement already satisfied: regex>=2021.8.3 in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (from nltk) (2023.10.3)
Requirement already satisfied: tqdm in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (from nltk) (4.65.0)
Requirement already satisfied: six>=1.5 in /mnt/ssd/mamba/envs/aaa/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Note: you may need to restart the kernel to use updated packages.

2. Import Libraries¶

Next thing is to import all the libraries that we are going to use in this project.

Let's list down the libraries that we are going to use in this project with a brief description.

  • numpy : Numpy is a library for adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
  • pandas : Data manipulation and analysis.
  • matplotlib : Plotting library for the Python programming language and its numerical mathematics extension NumPy.
  • seaborn : Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
  • sklearn : Scikit-learn is a machine learning library for the Python programming language
  • nltk : NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.
In [2]:
# For data manipulation and analysis
import pandas as pd
import numpy as np
import string
from collections import Counter

# For plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# For text processing and feature extraction
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

# For text preprocessing and feature extraction
import nltk
from nltk import bigrams
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize

# Machine learning models and evaluation metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

from xgboost import XGBClassifier
from lightgbm.sklearn import LGBMClassifier

# Machine learning evaluation metrics
from sklearn.metrics import accuracy_score,confusion_matrix, precision_score, roc_auc_score, f1_score, recall_score

from wordcloud import WordCloud

import warnings
warnings.filterwarnings("ignore")
nltk.download('punkt')
nltk.download('stopwords')
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Out[2]:
True

3. Data Loading and overview

After importing the libraries, we need to import the dataset as for the next step.

3.1. Data Loading.¶

  • We're going to use Pandas to pull in our data and create a Dataframe.
  • Think of a Dataframe like a spreadsheet or a SQL table.
  • A Dataframe has rows and columns, and each column can have its own type of data.
  • One cool thing about Pandas is that it can get data directly from a web link. We'll use this to get our data from a link on GitHub.
In [3]:
# DATA_SET = "./data/spam_dataset.csv"
DATA_SET = "https://raw.githubusercontent.com/qualitia-cdev/hands-on-data/main/spam_dataset.csv"

df = pd.read_csv(DATA_SET, encoding='latin-1')
df.head()
Out[3]:
v1 v2 Unnamed: 2 Unnamed: 3 Unnamed: 4
0 ham Go until jurong point, crazy.. Available only ... NaN NaN NaN
1 ham Ok lar... Joking wif u oni... NaN NaN NaN
2 spam Free entry in 2 a wkly comp to win FA Cup fina... NaN NaN NaN
3 ham U dun say so early hor... U c already then say... NaN NaN NaN
4 ham Nah I don't think he goes to usf, he lives aro... NaN NaN NaN

3.2. Overview of the dataset¶

The dataset seems to contain multiple columns:

  • v1: This appears to be the label, indicating whether a message is "spam" or "ham" (not spam).
  • v2: This seems to be the content of the sms message.
  • Unnamed: 2, Unnamed: 3, and Unnamed: 4: These columns seem to contain mostly NaN (missing) values based on the initial view.

4. EDA (Exploratory Data Analysis)

4.1. What is EDA¶

  • EDA is the process of visually and analytically exploring a dataset to understand its main characteristics and uncover patterns.
  • It's a crucial step before formal modeling in the data analysis process.

To proceed with the exploratory data analysis (EDA):

  • We'll check for the number of missing values in each column.
  • Get a summary of the dataset, such as the number of spam vs. ham messages.
  • Visualize the distribution of message lengths for both spam and ham messages.

Next function will help us to get a summary of the dataset.

In [4]:
# basic info about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB
  • We can see that we have 5572 entries in our dataset which mean we have 5572 SMS messages.
  • Also we can see that unnamed:2, unnamed:3 and unnamed:4 columns have a lot of missing values.
  • So those columns are not useful for our analysis. which mean we can drop these columns.

We can use next code cell to drop these columns.

In [5]:
# last 3 columns are unnecessary. So we are going to drop them.
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)

Then we are going to rename the columns of the dataset for better understanding.

  • We can see that v1 column includes the labels and v2 column includes the text content of the message.
  • So let's rename the columns as label and message.
  • Then check the dataset again.
In [6]:
df.rename(columns={'v1': 'label', 'v2': 'message'}, inplace=True)
df.head(10)
Out[6]:
label message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
5 spam FreeMsg Hey there darling it's been 3 week's n...
6 ham Even my brother is not like to speak with me. ...
7 ham As per your request 'Melle Melle (Oru Minnamin...
8 spam WINNER!! As a valued network customer you have...
9 spam Had your mobile 11 months or more? U R entitle...
  • Output shows that now we have only two columns.
  • Also we can see that the column names are changed to label and message.

4.2. Distribution of Target Variables¶

  • Now, we are going to see how the label column is distributed among the dataset.
  • The main reason for this is to see whether the dataset is balanced or not.
  • Because if the dataset is not balanced, it can cause bias in our model.

What is Target Variable?¶

  • In machine learning and statistics, a target variable is the variable that is usually denoted by y and is the variable whose values are to be predicted from the other features of the dataset.

For this dataset, target variable is label column.

Let's create pie chart to see the distribution of the target variable.

In [7]:
target_cat = df.label.value_counts()

#create pie chart using seaborn
color1 = sns.color_palette("viridis")[0]
color2 = sns.color_palette("viridis")[3]

plt.pie(target_cat, labels=target_cat.index, colors=[color1, color2], autopct='%1.1f%%', shadow=True, startangle=90, explode=(0, 0.1))
plt.title('Target Variable Distribution (Spam and Ham)')
plt.show()

print(target_cat)
label
ham     4825
spam     747
Name: count, dtype: int64

Output shows:

  • We have 4825 'ham' messages but only 747 'spam' ones.
  • This tells us that we have highly imbalanced dataset.

When the data is skewed like this, machine learning models might not perform at their best.

The uneven distribution can cause issues when we divide the data for training and testing. And chain effect of this is, our model will be biased towards the majority class.

There are few ways to handle this. But, to keep things simple, we'll just randomly remove some 'ham' messages to even things out.

After doing this, we'll look at the target variable distribution once more.

In [8]:
num_spam = target_cat['spam']

# Randomly sampling the majority class to match the count of the minority class
df = pd.concat(
    [
        df[df['label'] == 'ham'].sample(n=num_spam, random_state=42), 
        df[df['label'] == 'spam']
    ]
)

#check the target variable distribution again
target_cat = df.label.value_counts()

#set the colors
color1 = sns.color_palette("viridis")[0]
color2 = sns.color_palette("viridis")[3]

#plot the pie chart
plt.pie(target_cat, labels=target_cat.index, colors=[color1, color2], autopct='%1.1f%%', shadow=True, startangle=90, explode=(0, 0.1))
plt.title('Target Variable Distribution after dropping ham messages')
plt.show()

print(target_cat)
label
ham     747
spam    747
Name: count, dtype: int64

Now we can see that the percentage of ham and spams and the counts are Identical. So, we can proceed with the next step.

4.3. Most Common Words in Spam and Ham Messages (Unigrams)¶

  • By looking at the most common unigrams in spam and ham messages, we can get a better idea of which words are more frequent in spam messages vs. ham messages.
  • And also this will help us to understand the data better.

What is Unigrams?

  • Unigrams are the individual words in a sentence.
  • For example, the sentence "I love Python" has three unigrams: "I", "love", and "Python".

Next code cell will help us to get the 10 most common unigrams in spam and ham messages. And also we are going to check the count of the unigrams in spam messages to how its look like.

In [9]:
# We'll use a predefined list of stopwords as we can't download them directly here
# stop_words = set([
#     "ourselves", "hers", "between", "yourself", "but",
#     "again", "there", "about", "once", "during", "out",
#     "very", "having", "with", "they", "own", "an", "be",
#     "some", "for", "do", "its", "yours", "such", "into",
#     "of", "most", "itself", "other", "off", "is", "s", "am",
#     "or", "who", "as", "from", "him", "each", "the", "themselves",
#     "until", "below", "are", "we", "these", "your", "his", "through",
#     "don", "nor", "me", "were", "her", "more", "himself", "this",
#     "down", "should", "our", "their", "while", "above", "both", "up",
#     "to", "ours", "had", "she", "all", "no", "when", "at", "any", "before",
#     "them", "same", "and", "been", "have", "in", "will", "on", "does",
#     "yourselves", "then", "that", "because", "what", "over", "why", "so",
#     "can", "did", "not", "now", "under", "he", "you", "herself", "has",
#     "just", "where", "too", "only", "myself", "which", "those", "i",
#     "after", "few", "whom", "t", "being", "if", "theirs", "my", "against",
#     "a", "by", "doing", "it", "how", "further", "was", "here", "than"
# ])

#get the stopwords from nltk
stop_words = set(stopwords.words('english'))

def get_most_common_words(texts, n=10):
    words = []
    for text in texts:
        tokens = word_tokenize(text)
        words.extend([word.lower() for word in tokens if word.isalpha() and word.lower() not in stop_words])
    # Count the frequency of each word
    word_freq = Counter(words)
    
    return word_freq.most_common(n)

# Get the 10 most common words for spam and ham messages
common_words_ham = get_most_common_words(df[df['label'] == 'ham']['message'], 10)
common_words_spam = get_most_common_words(df[df['label'] == 'spam']['message'], 10)


#create table to for common_words_ham, common_words_spam
common_words_ham_df = pd.DataFrame(common_words_ham, columns=['word', 'count'])
common_words_spam_df = pd.DataFrame(common_words_spam, columns=['word', 'count'])

common_words_spam_df.head(10)
Out[9]:
word count
0 call 346
1 free 219
2 txt 156
3 u 144
4 ur 144
5 mobile 123
6 text 121
7 stop 114
8 claim 113
9 reply 104
  • We can see that the most common unigram in spam messages is 'call'. And also there are some other words.
  • Better method to get overview of the data is to create word cloud.
  • Word cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance.

So, let's create word cloud for spam and ham messages.

In [10]:
def generate_wordcloud(texts_ham, texts_spam):
    
    # Join all the text messages together 
    all_text_ham = ' '.join(texts_ham)
    all_text_spam = ' '.join(texts_spam)
    
    # Create a word cloud for ham and spam messages
    wordcloud_ham = WordCloud(stopwords=stop_words, background_color="white", width=800, height=400).generate(all_text_ham)
    wordcloud_spam = WordCloud(stopwords=stop_words, background_color="white", width=800, height=400).generate(all_text_spam)
    
    # Plot the word clouds
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 5))
    ax1.imshow(wordcloud_ham, interpolation='bilinear')
    ax1.axis('off')
    ax1.set_title("Ham Messages Word Cloud", fontsize=25)
    
    ax2.imshow(wordcloud_spam, interpolation='bilinear')
    ax2.axis('off')
    ax2.set_title("Spam Messages Word Cloud", fontsize=25)
    
    # Adjust layout
    plt.tight_layout()
    plt.show()

# Generate word clouds for ham and spam messages side by side
generate_wordcloud(df[df['label'] == 'ham']['message'].tolist(), df[df['label'] == 'spam']['message'].tolist())
  • Ham Messages Word Cloud:
    • Words that stand out in non-spam messages include "I'm", "u", "know", "get", "Ok", "love", and "go".
    • These words are indicative of casual or informal conversations.



  • Spam Messages Word Cloud:
    • In the spam messages, words like "free", "call", "text", "mobile", "reply", "urgent", and "claim" are prominently visible.
    • These words are often associated with promotional or deceptive content.

4.4. Most Common Bigrams in Spam and Ham Messages¶

Previously we looked at the most common unigrams in spam and ham messages. Now we are going to look at the most common bigrams.

What is Bigrams?¶

Bigrams, in the context of text analysis, are pairs of consecutive words from a given text.

Example: For the sentence "I love ice cream", the bigrams would be:

  • I love
  • love ice
  • ice cream

Let's take a look at the most common bigrams in spam and ham messages.

  • Next code cell will first extract the 10 most common bigrams from the spam and ham messages
  • Finally it will plot the bigrams in a bar chart.
In [11]:
def remove_punctuation(message):
    return ''.join([char for char in message if char not in string.punctuation])


def tokenize(message):
    return [word.lower() for word in message.split() if word.lower() not in stop_words]


def generate_bigrams(tokens):
    return list(bigrams(tokens))


def preprocess_message(message):
    message = remove_punctuation(message)
    tokens = tokenize(message)
    return generate_bigrams(tokens)


def get_top_bigrams(series, n=10):
    # Flatten the list of bigrams and count occurrences
    all_bigrams = [bigram for sublist in series for bigram in sublist]
    bigram_counts = Counter(all_bigrams)
    return bigram_counts.most_common(n)

def plot_top_bigrams(df, column, title, ax, palette="viridis"):
    sns.barplot(data=df, y=column, x='Count', ax=ax, palette=palette)
    ax.set_title(title)
    ax.set_xlabel('Count')
    ax.set_ylabel(column)

# Assuming your original DataFrame is loaded as 'df'
df["bigrams"] = df["message"].apply(preprocess_message)

# Extract top bigrams
top_spam_bigrams = get_top_bigrams(df[df['label'] == 'spam']['bigrams'])
top_ham_bigrams = get_top_bigrams(df[df['label'] == 'ham']['bigrams'])

# Create dataframes for visualization
spam_df = pd.DataFrame(top_spam_bigrams, columns=['Bigrams', 'Count'])
ham_df = pd.DataFrame(top_ham_bigrams, columns=['Bigrams', 'Count'])

# Adjust bigram representation for readability
spam_df['Bigrams'] = spam_df['Bigrams'].apply(lambda x: ' '.join(x))
ham_df['Bigrams'] = ham_df['Bigrams'].apply(lambda x: ' '.join(x))

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(16, 8), sharex=True)
plot_top_bigrams(spam_df, 'Bigrams', 'Top Bigrams in Spam Messages', axes[0])
plot_top_bigrams(ham_df, 'Bigrams', 'Top Bigrams in Ham Messages', axes[1])

plt.tight_layout()
plt.show()

From above graph, we can observe that:

  • Spam messages frequently include prompts for the recipient to take action, such as "please call", "contact u", and "await collection".

  • On the other hand, non-spam messages seem to revolve more around everyday conversations, with phrases like "call later", "let know", and "good morning".


5. Feature Engineering¶

What is Feature Engineering?¶

Feature engineering means using what you know about the data to create new features, helping machine learning models perform better.

5.1. Message Length¶

  • Next, let's visualize the distribution of message lengths for both spam and ham messages.
  • This will give us an idea of whether there's any noticeable difference in the length of spam versus non-spam messages.
In [12]:
# create a new column for message length and calculate the length of each message
df["msg_len"] = df["message"].apply(len)

# Plot the distribution of message lengths
sns.distplot(df[df.label == 'ham'].msg_len, label='Ham', color=sns.color_palette("viridis")[0])
sns.distplot(df[df.label == 'spam'].msg_len, label='Spam', color=sns.color_palette("viridis")[3])
plt.title('Distribution Plot for Length of Messages')
plt.xlabel('Length of Messages')
plt.legend()
plt.show()

print('Average message length for ham messages: ', df[df['label'] == 'ham']['message'].str.len().mean())
print('Average message length for spam messages: ', df[df['label'] == 'spam']['message'].str.len().mean())
Average message length for ham messages:  69.11780455153949
Average message length for spam messages:  138.8661311914324

Above graph shows that:

  • Ham (non-spam) Messages:
    • The majority of ham messages are shorter in length.
    • Mostly below 100 characters.
    • There's a peak near the beginning, indicating a significant number of very short ham messages.



  • Spam Messages:
    • Spam messages tend to be longer on average compared to ham messages.
    • There's a noticeable peak around 150 characters.

5.2 Other Features¶

Due to the limited time, without explaining each feature, we are going to create all the features at once.
But, In real world projects, we need to analyze each feature and decide whether to use it or not.

By following the same approach as we did earlier, we can create other features such as:

  • Number of Punctuations marks
  • Number of Exclamation marks
  • Number of Upper case letters
  • Number of Numeric digits
  • Number of sentences
  • Number of words.

Next code cell will create all the features mentioned above.

In [13]:
# Create new column for number of punctuations, and calculate the number of punctuations in each message
df['num_punctuations'] = df['message'].apply(lambda x: sum([1 for char in x if char in string.punctuation]))

# Create new column for number of exclamation marks, and calculate the number of exclamation marks in each message
df['num_exclamation_marks'] = df['message'].apply(lambda x: sum([1 for char in x if char == '!']))

# Calculate the  number of uppercase letters in each message and create a new column for it
df['num_upper_case'] = df['message'].apply(lambda x: sum([1 for char in x if char.isupper()]))

# Count the number of numeric characters in each message and create a new column for it
df['num_numeric'] = df['message'].apply(lambda x: sum([1 for char in x if char.isdigit()]))

# Tokenize messages and compute the number of sentences in each message and create a new column for it
df['num_sentences'] = df['message'].apply(lambda x: len(sent_tokenize(x)))

# Create a new column for number of words in each message and calculate the number of words in each message
df['num_words'] = df['message'].apply(lambda x: len(word_tokenize(x)))

6 Model Training¶

6.1. Feature Selection¶

  • Building an effective machine learning model for spam detection requires carefully selected features.
  • In real world projects, we need to analyze each feature and decide whether to use it or not.
  • But, due to the limited time, we are going to use all the features that we created in the previous steps.

Lets check the current columns in the dataframe.

In [14]:
df.columns
Out[14]:
Index(['label', 'message', 'bigrams', 'msg_len', 'num_punctuations',
       'num_exclamation_marks', 'num_upper_case', 'num_numeric',
       'num_sentences', 'num_words'],
      dtype='object')
  • By looking at the above output, we can see that we already created features that we are going to use.
  • Next step is to assign the features to variable. (X)

Note

  • X: This is matrix of independent variables.
    • It includes columns from the dataframe that represent various computed features of the messages.
    • Like message length (msg_len), number of exclamation marks (num_exclamation_marks) and etc.
In [15]:
X = df[['msg_len', 'num_exclamation_marks', 'num_punctuations', 'num_upper_case', 'num_numeric', 'num_words', 'num_sentences']]

6.2. Label Encoding¶

  • In many machine learning algorithms, categorical labels (or features) need to be converted to numerical values.
  • Because the algorithms are designed to work with numbers.
- This process is known as "encoding."

For this dataset, we need to encode the label column.

For example:

  • spam: 1
  • ham: 0
Next code cell will encode the label column.
In [16]:
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(df['label'])
df['encoded_label'] = encoded_labels
df.tail()
Out[16]:
label message bigrams msg_len num_punctuations num_exclamation_marks num_upper_case num_numeric num_sentences num_words encoded_label
5537 spam Want explicit SEX in 30 secs? Ring 02073162414... [(want, explicit), (explicit, sex), (sex, 30),... 90 3 1 17 21 3 18 1
5540 spam ASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE ... [(asked, 3mobile), (3mobile, 0870), (0870, cha... 160 5 0 104 14 6 38 1
5547 spam Had your contract mobile 11 Mnths? Latest Moto... [(contract, mobile), (mobile, 11), (11, mnths)... 160 8 1 20 2 5 35 1
5566 spam REMINDER FROM O2: To get 2.50 pounds free call... [(reminder, o2), (o2, get), (get, 250), (250, ... 147 3 0 14 5 1 30 1
5567 spam This is the 2nd time we have tried 2 contact u... [(2nd, time), (time, tried), (tried, 2), (2, c... 161 8 1 9 21 4 35 1
  • We can see that new column named encoded_label is added to the dataframe.
  • This column contains the encoded labels.

Let's check the how our target variables are encoded.

In [17]:
# Display classes and their encoded values
for original, encoded in zip(encoder.classes_, range(len(encoder.classes_))):
    print(f"'{original}' is encoded as {encoded}")
'ham' is encoded as 0
'spam' is encoded as 1
  • We can see that spam is encoded as 1 and ham is encoded as 0.

Now we can assign the encoded labels to variable. (y)

Note

  • y: This is the response or dependent variable, which in this case is encoded_label.
    • This represents whether a message is spam or not in an encoded form (e.g., 0 for 'ham' and 1 for 'spam').
In [18]:
y = df["encoded_label"]

6.3. Data Splitting¶

For machine learning, we need to split the data into training and test set.

What is Training and Test Set?¶

  • Training set is a dataset used to train a model.
  • Test set is a dataset used to measure how well the model performs at making predictions on that test set.

Next code cell will split the data into training and test set.

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

6.4. Machine Learning Models¶

  • There are many Machine Learning algorithms that can be used for classification.
    • Random Forest
    • Logistic Regression
    • Support Vector Machine
    • etc
  • For the most of the time, we cannot say which algorithm will perform best on our dataset.
  • So, we need to try different algorithms and compare their performance.

Let's try different algorithms and compare their performance.

Let's create a dictionary of machine learning algorithms.

I'm not going to explain each algorithm in detail. But I will give a brief description of each algorithm.

  • Random Forest: A versatile ensemble method that creates multiple decision trees during training and outputs the mode of the classes for classification tasks. Here, it's creating 1000 trees.
  • Decision Tree: A flowchart-like structure where each internal node represents a feature(or attribute), each branch represents a decision rule, and each leaf node represents the outcome.
  • Logistic Regression: A statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome.
  • SVC (Support Vector Classifier): Part of the Support Vector Machine (SVM) family, used for classification. Here, it's using a linear kernel.
  • KNN (K-Nearest Neighbors): A non-parametric algorithm used for classification and regression. By default, it considers 5 neighbors.
  • XGB (XGBoost): An optimized gradient boosting library. The objective here is set to 'binary:hinge', which is useful for binary classification problems.
  • Multinomial NB (Multinomial Naive Bayes): A variant of Naive Bayes classifier, particularly suited for classification with discrete features.
  • LGBM (Light Gradient Boosted Machine): A gradient boosting framework that uses tree-based learning algorithms. Here, it's set for binary classification.
In [20]:
num_feat_models = {
    "Random Forest": RandomForestClassifier(n_estimators=1000, random_state=42),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Logistic Regression": LogisticRegression(random_state=42),
    "SVC": SVC(kernel="linear", random_state=42),
    "KNN" : KNeighborsClassifier(),
    "XGB": XGBClassifier(objective='binary:hinge', random_state=42),
    "Multinomial NB": MultinomialNB(),
    "LGBM" : LGBMClassifier(boosting_type='gbdt',objective='binary', max_depth=4, random_state=42, verbose=-1)

}

6.5. Model Training¶

We came to the final step of our project. That is training the model.

  • Next code will train the model using the training data and evaluate the model using the test data.
In [21]:
def train_clf(model, X_train, y_train, X_test, y_test):
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions and get accuracy and precision
    y_pred = model.predict(X_test)
    
    results = {
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1 Score': f1_score(y_test, y_pred),
        'Roc Auc Score': roc_auc_score(y_test, y_pred),
        'Confusion Matrix': confusion_matrix(y_test, y_pred)
    }
    
    return results

# Store scores for each model
model_scores = {}

for name, model in num_feat_models.items():
    model_scores[name] = train_clf(model, X_train, y_train, X_test, y_test)

# Convert scores dictionary to a DataFrame
scores_df = pd.DataFrame(model_scores).T
scores_df.reset_index(inplace=True)
scores_df.rename(columns={'index': 'Algorithm'}, inplace=True)
scores_df = scores_df.sort_values(by="F1 Score", ascending=False)

6.6. Model Evaluation¶

Now let's check the performance of the models.

In [22]:
scores_df
Out[22]:
Algorithm Accuracy Precision Recall F1 Score Roc Auc Score Confusion Matrix
0 Random Forest 0.973274 0.986726 0.961207 0.973799 0.973691 [[214, 3], [9, 223]]
7 LGBM 0.966592 0.977974 0.956897 0.96732 0.966928 [[212, 5], [10, 222]]
5 XGB 0.962138 0.969432 0.956897 0.963124 0.962319 [[210, 7], [10, 222]]
1 Decision Tree 0.957684 0.965066 0.952586 0.958785 0.95786 [[209, 8], [11, 221]]
4 KNN 0.955457 0.969027 0.943966 0.956332 0.955854 [[210, 7], [13, 219]]
2 Logistic Regression 0.953229 0.968889 0.939655 0.954048 0.953699 [[210, 7], [14, 218]]
3 SVC 0.951002 0.972973 0.931034 0.951542 0.951692 [[211, 6], [16, 216]]
6 Multinomial NB 0.919822 0.962264 0.87931 0.918919 0.921222 [[209, 8], [28, 204]]
  • There are several metrics to evaluate the performance of the classification models.
  • We are going to focus on confusion matrix.

Due to the limited time, we are not going to explain each metric. What we need to know is:

  • Higher the true positives and true negatives, lower the false positives and false negatives, the better the model is.
  • Higher the F1 score, the better the model is.

From the above output, we can see that:

  • Random Forest has the highest F1 score.

Scores

  1. True Positive (TP):

    • What it means: The number of instances that were predicted as positive and are actually positive.

  2. True Negative (TN):

    • What it means: The number of instances that were predicted as negative and are actually negative.

  3. False Positive (FP) - also known as Type I error:

    • What it means: The number of instances that were predicted as positive but are actually negative.
    • Commonly referred to as a "False Alarm."

  4. False Negative (FN) - also known as Type II error:

    • What it means: The number of instances that were predicted as negative but are actually positive.
    • This can be considered a "Missed Detection."

These four measures are often displayed together in what is known as a Confusion Matrix. Understanding these terms is crucial since they form the basis for many other performance metrics in classification, like accuracy, precision, recall, and the F1 score.

  • Accuracy = (TP + TN) / (TP + TN + FP + FN)
  • Precision = TP / (TP + FP)
  • Recall (or Sensitivity or True Positive Rate) = TP / (TP + FN)
  • F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

The thing we need to know is that the higher the accuracy the better the model.

Next code cell will plot the confusion matrix for each model.

In [23]:
# Plot
classes = ['ham', 'spam']

fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(20, 10))
for idx, ax in enumerate(axes.flatten()):
    algo = scores_df.iloc[idx]['Algorithm']
    cm = scores_df.iloc[idx]['Confusion Matrix']
    # Prepare the labels for each cell
    labels = np.array([["TP: " + str(cm[0, 0]), "FP: " + str(cm[0, 1])],
                       ["FN: " + str(cm[1, 0]), "TN: " + str(cm[1, 1])]])

    sns.heatmap(cm, annot=labels, fmt="", cmap="Blues", ax=ax, cbar=False)

    ax.set_title(algo)

    ax.set_xticklabels(classes)
    ax.set_yticklabels(classes, rotation=0) 
    
    ax.set_xlabel('Predicted labels')
    ax.set_ylabel('True labels')

plt.tight_layout()
plt.show()
  • We can see that the Random Forest model has the highest number of true positives and true negatives, and the lowest number of false positives and false negatives.
  • This means that the Random Forest model has the highest accuracy among all the models.

7.6. Predictions on New Data¶

Now let's try predicting a new message using numerical features based model.

Flow of the prediction¶

  • We need to create features for the new message.
  • Then we can use the models to predict the new message.
  • We are going to use some example messages to predict the new message.

In summary following code will create features for the new message and predict using our all the models. So we can compare the predictions of each model.

In [24]:
def extract_numerical_features(message):
    features = {
        'msg_len': len(message),
        'num_exclamation_marks': message.count('!'),
        'num_punctuations': sum(1 for char in message if char in string.punctuation), 
        'num_upper_case': sum(1 if char.isupper() else 0 for char in message),
        'num_numeric': sum(1 if char.isdigit() else 0 for char in message),
        'num_words': len(word_tokenize(message)),
        'num_sentences': len(sent_tokenize(message))
    }
    return features

def predict_new_data(message, num_feat_models, encoder):
    # Extract features
    features = extract_numerical_features(message)
    
    # Convert to dataframe
    df = pd.DataFrame(features, index=[0])

    # Dictionary to store the results
    predictions = {}

    for model_name, model in num_feat_models.items():
        pred = model.predict(df)
        # Assuming encoder.inverse_transform(pred) returns a list of class labels
        predicted_class = encoder.inverse_transform(pred)[0]
        predictions[model_name] = predicted_class

    return predictions

def compare_predictions(text_samples, num_feat_models, encoder):
    for text in text_samples:
        print(f"Predicting for: {text}")
        predictions = predict_new_data(text, num_feat_models, encoder)
        
        # Print individual model predictions
        for model, pred in predictions.items():
            print(f"{model} predicted: {pred}")

        # Check if all model predictions are the same; if not, compare the results
        unique_predictions = set(predictions.values())

        if len(unique_predictions) == 1:
            print("All models agree on the prediction.")
        else:
            print("Models made different predictions.")

        print("----------")

# Usage:
text_samples = [
    "Congratulations! You have won a $1000 Walmart gift card. Click here to claim: https://bit.ly/3kceqyh",
    "Hey, how are you doing? Let's meet up soon!"
]

# Assuming 'num_feat_models' is a dictionary of your models and 'encoder' is your label encoder
print("Comparing predictions on new data using numeric features model")
compare_predictions(text_samples, num_feat_models, encoder)
Comparing predictions on new data using numeric features model
Predicting for: Congratulations! You have won a $1000 Walmart gift card. Click here to claim: https://bit.ly/3kceqyh
Random Forest predicted: spam
Decision Tree predicted: spam
Logistic Regression predicted: spam
SVC predicted: spam
KNN predicted: ham
XGB predicted: spam
Multinomial NB predicted: spam
LGBM predicted: spam
Models made different predictions.
----------
Predicting for: Hey, how are you doing? Let's meet up soon!
Random Forest predicted: ham
Decision Tree predicted: ham
Logistic Regression predicted: ham
SVC predicted: ham
KNN predicted: ham
XGB predicted: ham
Multinomial NB predicted: ham
LGBM predicted: ham
All models agree on the prediction.
----------

8. How to improve the model performance¶

  • More data to train the model.
  • More features to train the model.
  • Combining multiple algorithms can improve the performance of the model. (Ensemble Learning)
  • Combine numerical features and text features to train the model.
  • Use more advanced algorithms like deep learning algorithms.

9. Conclusion¶

  • In this project, we built a machine learning model to classify emails as spam or ham.
  • We got familiar with things like,
    • Machine Learning
    • Supervised Learning
    • EDA (Exploratory Data Analysis)
    • Feature Engineering
    • Model Training
    • Model Evaluation
  • We used two different methods to train the model.
    • First we used numerical features to train the model.
    • Then we used text features to train the model.
  • We used several machine learning algorithms to train the model.
  • Random Forest model had the best performance among all the models for this dataset and task.