Generate Email Subject With AI¶

1. Announcement¶

Sorry to disappoint everybody, today we cannot build a Email subject generation AI. This is due to our inability to share the email data of our employees that was utilized to train the model.

However, there's a silver lining. We're in the process of developing a similar model using a different dataset.

In email dataset we are using email body to generate email subject.

  • Email subject generation : Email body -> Email subject

In here we are using news content to generate news title

  • News title generation : News content -> News title

The concept remains quite similar. If you're keen, you can utilize email data to train a model in the future using the same codebase, with only minor tweaks required. If you need assistance, please don't hesitate to reach out at nuwan@qualitia.com

Important Note 1:

  • We are going to copy paste code from this notebook to google colab. So please make sure to run the code cells in order.

Important Note 2:

  • If you got any problem while running the code, please contact me or Hirano san immediately. So we can fix it as soon as possible. Otherwise you will be not able to run the code and follow the tutorial.

1.2. How to Activate GPU Computing in Google Colab?¶

  • Step 1: Open Google Colab
    • First, you need to open Google Colab by going to https://colab.research.google.com/ and sign in with your Google account.



  • Step 2: Create a New Notebook
    • Once you are signed in, click on the “New Notebook” button to create a new notebook.



  • Step 3: Select GPU as the Hardware Accelerator
    • Next, select “Runtime” from the top menu and then click on “Change runtime type.”
    • In the "Runtime type" default is "Python 3" and keep it as it is.
    • In the “Hardware accelerator” dropdown, select T4 GPU and then click “Save.”

1.3. Additional Information Notice¶


In Explanation I added Additional Information: between dotted lines. Basically you can ignore them. They are just for your reference. If you are interest read them at home.


2. Installing necessary libraries¶

For our first step lets install the necessary libraries.

We can use pip command to install the libraries.

We will be using the following libraries for our project. When using system commands in Colab, prefix the command with a ! or %.

List of libraries we will be using

  • transformers : Huggingface's transformers library is used for training and inference of our model.
  • datasets : Huggingface's datasets library is used for loading our dataset.
  • transformers[ja] : It will install necessary libraries such as mecab, fugashi and ipadic tha are used for tokenization of our text data.
  • sentencepiece : Sentencepiece is used for tokenization of our text data.
  • torch : Pytorch is used as the deep learning framework.
  • bert_score : Bert_score is used for evaluation of our model.
  • absl-py : Abseil is used for logging (We not using this, but other libraries use this)
  • evalute : Evaluate is used for evaluation of our model.
  • matplotlib : Matplotlib is used for plotting graphs.
  • seaborn : Seaborn is used for plotting graphs.
In [1]:
%pip install transformers[torch,ja]==4.33.3 datasets==2.14.5 sentencepiece matplotlib seaborn evaluate absl-py bert_score pandas tokenizers==0.13.3
Collecting transformers[ja,torch]==4.33.3
  Downloading transformers-4.33.3-py3-none-any.whl (7.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.6/7.6 MB 16.5 MB/s eta 0:00:00
Collecting datasets==2.14.5
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 519.6/519.6 kB 35.0 MB/s eta 0:00:00
Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 47.9 MB/s eta 0:00:00
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (3.7.1)
Requirement already satisfied: seaborn in /usr/local/lib/python3.10/dist-packages (0.12.2)
Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.1/84.1 kB 11.9 MB/s eta 0:00:00
Requirement already satisfied: absl-py in /usr/local/lib/python3.10/dist-packages (1.4.0)
Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.1/61.1 kB 6.6 MB/s eta 0:00:00
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (1.5.3)
Collecting tokenizers==0.13.3
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 55.8 MB/s eta 0:00:00
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers[ja,torch]==4.33.3) (3.12.4)
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers[ja,torch]==4.33.3)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.0/302.0 kB 32.4 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers[ja,torch]==4.33.3) (1.23.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers[ja,torch]==4.33.3) (23.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers[ja,torch]==4.33.3) (6.0.1)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers[ja,torch]==4.33.3) (2023.6.3)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers[ja,torch]==4.33.3) (2.31.0)
Collecting safetensors>=0.3.1 (from transformers[ja,torch]==4.33.3)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 62.2 MB/s eta 0:00:00
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers[ja,torch]==4.33.3) (4.66.1)
Requirement already satisfied: torch!=1.12.0,>=1.10 in /usr/local/lib/python3.10/dist-packages (from transformers[ja,torch]==4.33.3) (2.1.0+cu118)
Collecting accelerate>=0.20.3 (from transformers[ja,torch]==4.33.3)
  Downloading accelerate-0.24.1-py3-none-any.whl (261 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 261.4/261.4 kB 24.9 MB/s eta 0:00:00
Collecting fugashi>=1.0 (from transformers[ja,torch]==4.33.3)
  Downloading fugashi-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (600 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 600.9/600.9 kB 44.2 MB/s eta 0:00:00
Collecting ipadic<2.0,>=1.0.0 (from transformers[ja,torch]==4.33.3)
  Downloading ipadic-1.0.0.tar.gz (13.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.4/13.4 MB 56.5 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting unidic-lite>=1.0.7 (from transformers[ja,torch]==4.33.3)
  Downloading unidic-lite-1.0.8.tar.gz (47.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47.4/47.4 MB 10.8 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting unidic>=1.0.2 (from transformers[ja,torch]==4.33.3)
  Downloading unidic-1.1.0.tar.gz (7.7 kB)
  Preparing metadata (setup.py) ... done
Collecting sudachipy>=0.6.6 (from transformers[ja,torch]==4.33.3)
  Downloading SudachiPy-0.6.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.2/2.2 MB 104.4 MB/s eta 0:00:00
Collecting sudachidict-core>=20220729 (from transformers[ja,torch]==4.33.3)
  Downloading SudachiDict_core-20230927-py3-none-any.whl (71.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.7/71.7 MB 9.9 MB/s eta 0:00:00
Collecting rhoknp<1.3.1,>=1.1.0 (from transformers[ja,torch]==4.33.3)
  Downloading rhoknp-1.3.0-py3-none-any.whl (86 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86.8/86.8 kB 12.9 MB/s eta 0:00:00
Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets==2.14.5) (9.0.0)
Collecting dill<0.3.8,>=0.3.0 (from datasets==2.14.5)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.3/115.3 kB 12.7 MB/s eta 0:00:00
Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets==2.14.5) (3.4.1)
Collecting multiprocess (from datasets==2.14.5)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 6.9 MB/s eta 0:00:00
Requirement already satisfied: fsspec[http]<2023.9.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets==2.14.5) (2023.6.0)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets==2.14.5) (3.8.6)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.1.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (4.43.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.4.5)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (2.8.2)
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2023.3.post1)
Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.20.3->transformers[ja,torch]==4.33.3) (5.9.5)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.14.5) (23.1.0)
Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.14.5) (3.3.1)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.14.5) (6.0.4)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.14.5) (4.0.3)
Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.14.5) (1.9.2)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.14.5) (1.4.0)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.14.5) (1.3.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.15.1->transformers[ja,torch]==4.33.3) (4.5.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[ja,torch]==4.33.3) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[ja,torch]==4.33.3) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[ja,torch]==4.33.3) (2023.7.22)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch!=1.12.0,>=1.10->transformers[ja,torch]==4.33.3) (1.12)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch!=1.12.0,>=1.10->transformers[ja,torch]==4.33.3) (3.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch!=1.12.0,>=1.10->transformers[ja,torch]==4.33.3) (3.1.2)
Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch!=1.12.0,>=1.10->transformers[ja,torch]==4.33.3) (2.1.0)
Collecting wasabi<1.0.0,>=0.6.0 (from unidic>=1.0.2->transformers[ja,torch]==4.33.3)
  Downloading wasabi-0.10.1-py3-none-any.whl (26 kB)
Collecting plac<2.0.0,>=1.1.3 (from unidic>=1.0.2->transformers[ja,torch]==4.33.3)
  Downloading plac-1.4.1-py2.py3-none-any.whl (22 kB)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch!=1.12.0,>=1.10->transformers[ja,torch]==4.33.3) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch!=1.12.0,>=1.10->transformers[ja,torch]==4.33.3) (1.3.0)
Building wheels for collected packages: ipadic, unidic, unidic-lite
  Building wheel for ipadic (setup.py) ... done
  Created wheel for ipadic: filename=ipadic-1.0.0-py3-none-any.whl size=13556703 sha256=61e3ef66a9ff1f63cd9068271237e9b094082227b329157c54f38738924273a9
  Stored in directory: /root/.cache/pip/wheels/5b/ea/e3/2f6e0860a327daba3b030853fce4483ed37468bbf1101c59c3
  Building wheel for unidic (setup.py) ... done
  Created wheel for unidic: filename=unidic-1.1.0-py3-none-any.whl size=7406 sha256=15945d7a10ef6a93f0fa8658e40694e0ef737c75d573bf4a80b2218fc637b845
  Stored in directory: /root/.cache/pip/wheels/7a/72/72/1f3d654c345ea69d5d51b531c90daf7ba14cc555eaf2c64ab0
  Building wheel for unidic-lite (setup.py) ... done
  Created wheel for unidic-lite: filename=unidic_lite-1.0.8-py3-none-any.whl size=47658816 sha256=21d0f7abc554f83903997d222c43881d3dc0afbf447eaa990859dfc13d21c0db
  Stored in directory: /root/.cache/pip/wheels/89/e8/68/f9ac36b8cc6c8b3c96888cd57434abed96595d444f42243853
Successfully built ipadic unidic unidic-lite
Installing collected packages: wasabi, unidic-lite, tokenizers, sudachipy, sentencepiece, plac, ipadic, sudachidict-core, safetensors, rhoknp, fugashi, dill, unidic, responses, multiprocess, huggingface-hub, transformers, accelerate, datasets, bert_score, evaluate
  Attempting uninstall: wasabi
    Found existing installation: wasabi 1.1.2
    Uninstalling wasabi-1.1.2:
      Successfully uninstalled wasabi-1.1.2
Successfully installed accelerate-0.24.1 bert_score-0.3.13 datasets-2.14.5 dill-0.3.7 evaluate-0.4.1 fugashi-1.3.0 huggingface-hub-0.18.0 ipadic-1.0.0 multiprocess-0.70.15 plac-1.4.1 responses-0.18.0 rhoknp-1.3.0 safetensors-0.4.0 sentencepiece-0.1.99 sudachidict-core-20230927 sudachipy-0.6.7 tokenizers-0.13.3 transformers-4.33.3 unidic-1.1.0 unidic-lite-1.0.8 wasabi-0.10.1

3. Check GPU Availability¶

Now before moving forward, lets check the availability of GPU. We can use the following command to check the GPU availability.

In [2]:
import torch

if torch.cuda.is_available():
    status = "GPU is enabled."
    device_count = torch.cuda.device_count()
    current_device = torch.cuda.current_device()
    print(f"{status}\ndevice count: {device_count}, current device: {current_device}")
else:
    print("GPU is disabled.")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"device: {device}")
GPU is enabled.
device count: 1, current device: 0
device: cuda

Important !!!:

  • Please contact me or Hirano san right now if you got following output. Because we need to change the runtime type to GPU. If you don't have GPU, you can't run this notebook.
GPU is disabled.
Device: cpu

4. Set Seed (Controlling Randomness)¶

When it came to machine learning and deep-learning, setting a seed is important. This is to ensure that the results are reproducible. We will be setting a seed of 42 for our project.

Keep in mind that even with these seeds set, achieving reproducibility across different platforms or GPU architectures can still be challenging.

In [3]:
import random
import numpy as np
import torch

import warnings
warnings.filterwarnings('ignore')
def seed_everything(seed_value):
    random.seed(seed_value) # Python
    np.random.seed(seed_value) # Numpy
    torch.manual_seed(seed_value) # CPU
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value) # GPU if available
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

seed_value = 42
seed_everything(seed_value)

5. Data¶

As i mentioned earlier we going to use a news dataset to train our model.

  • Data set is livedoor news corpus.
  • It is a Japanese dataset.

5.1 Data Loading¶

we can use load_dataset function to load the dataset. To load datasets from the Hugging Face Hub we can use load_dataset. The function provides a straightforward way to access a many datasets, ensuring standardized data processing and allowing easy splits into training, validation, and testing sets.

In summary, following code loads the "livedoor-news-corpus" dataset and splits it into training (80%), validation (10%), and testing (10%) sets without shuffling and a fixed random seed.

Additional Information:

  • "shunk031/livedoor-news-corpus": Specifies the dataset identifier or location.

  • train_ratio=0.8: 80% of the dataset will be allocated for training.

  • validation_ratio=0.1: 10% of the dataset will be allocated for validation.

  • seed=42: Ensures reproducibility by using a fixed seed for random number generation.

  • shuffle=False: Not Shuffles the data before splitting it into train, validation, and test sets.


In [4]:
from datasets import load_dataset

dataset = load_dataset(
    "llm-book/livedoor-news-corpus",
    train_ratio = 0.8,
    validation_ratio = 0.1,
    seed=42,
    shuffle=False,

)
Downloading builder script:   0%|          | 0.00/3.88k [00:00<?, ?B/s]
Downloading readme:   0%|          | 0.00/823 [00:00<?, ?B/s]
Downloading data:   0%|          | 0.00/8.86M [00:00<?, ?B/s]
Generating train split: 0 examples [00:00, ? examples/s]
Generating validation split: 0 examples [00:00, ? examples/s]
Generating test split: 0 examples [00:00, ? examples/s]
Lets call dataset and check how it looks like.
In [5]:
dataset
Out[5]:
DatasetDict({
    train: Dataset({
        features: ['url', 'date', 'title', 'content', 'category'],
        num_rows: 5893
    })
    validation: Dataset({
        features: ['url', 'date', 'title', 'content', 'category'],
        num_rows: 736
    })
    test: Dataset({
        features: ['url', 'date', 'title', 'content', 'category'],
        num_rows: 738
    })
})
This is a custom dictionary object from the datasets library. It groups together different splits of a dataset (e.g., train, validation, test).

Each split in the dataset consists of the following features:

  • url: The web URL from where the news article was sourced.
  • date: The date on which the news article was published.
  • title: The title or headline of the news article.
  • content: The main content or body of the news article.
  • category: The category or topic under which the news article falls.

5.2 Simple Data Exploration¶

Lets load test data into a pandas dataframe and check how it looks like for curiosity.

Dataframe is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet.

In summary, the code imports the pandas library, converts the 'test' portion of the dataset into a pandas DataFrame, and then displays the first five rows of that DataFrame.


Additional Information:

  1. import pandas as pd: This line imports the pandas library and gives it the alias pd. Pandas is a widely-used Python library for data analysis and manipulation, especially with tabular data.

  2. test_df = pd.DataFrame(dataset['test']): This line does two main things:

    • dataset['test']: It extracts the 'test' portion of the dataset. If the dataset is a dictionary-like structure, this fetches the data associated with the key 'test'.
    • pd.DataFrame(...): This converts the extracted 'test' data into a pandas DataFrame. A DataFrame is a 2-dimensional labeled data structure, similar to a table in a database, an Excel spreadsheet, or a data frame in R. In essence, it's a table with rows and columns.


  1. test_df.head(): This line displays the first five rows of the test_df DataFrame. The head() method in pandas is used to quickly inspect the initial rows of a DataFrame.

In [6]:
import pandas as pd
test_df = pd.DataFrame(dataset['test'])
test_df.head()
Out[6]:
url date title content category
0 http://news.livedoor.com/article/detail/5936102/ 2011-10-14T09:11:00+0900 ゼンショー「事実無根」と反論 10月13日の夜、ゼンショーの広報室長がTwitterで読売新聞の報道に「事実無根」と反論し... topic-news
1 http://news.livedoor.com/article/detail/5936557/ 2011-10-14T11:16:00+0900 「報ステ」OP曲演奏のジャズミュージシャンに“売名行為”と批判相次ぐ 先日、福島県が行っている新米の放射性物質本検査が全て終了した。規制値を超える放射性セシウムは... topic-news
2 http://news.livedoor.com/article/detail/5936721/ 2011-10-14T11:46:00+0900 「何のための“予約”なんですか」孫社長に批判殺到 ソフトバンクは、今朝から発売が始まった“iPhone4S”をはじめとする、ソフトバンク全ての... topic-news
3 http://news.livedoor.com/article/detail/5937177/ 2011-10-15T10:00:00+0900 あまりにも多すぎる「会いたくて」への皮肉か!? 「西野カナゲーム」が流行 いま巷で「西野カナゲーム」なるものが流行してるという。 簡単に説明すると、1人目、2人目は... topic-news
4 http://news.livedoor.com/article/detail/5937649/ 2011-10-14T16:03:00+0900 憶測呼ぶ紳助さんの“天敵”引退 10月13日発売の東スポに「紳助の天敵が引退」との見出しが躍った。その天敵とは、警察庁の安藤... topic-news
Now we can see the data in a dataframe. So this view is more clear and we can get a good idea about our dataset.

¶

6. Model and Tokenizer¶

For this task we are going to use google's mt5-small model.

What is mt5?¶

mt5 is a Text-to-text transformers model released by google. It is a multilingual model. It can be used for translation, summarization, classification and many more tasks.

What is text to text transformer?¶

Text-to-text transformers are a class of transformer models that can be used for a variety of tasks by providing a text input and a text output.

  • For example, the input could be a question and the output could be the answer.
  • Or the input could be a sentence in English and the output could be the same sentence in French.
  • Or the input could be a news article and the output could be a headline.

In summary, we can use model to generate a text from a text.

In here we are using mt5-small, but there are many other models available. For example mt5-base. mt5-base is a larger model than mt5-small. It has more parameters and it takes more time to train. But it has more capacity to learn.

6.1. Set Tokenizer¶

We can use AutoTokenizer to load the tokenizer for the model. Following code will load the tokenizer for mt5-small model.

What is tokenizer?¶

A tokenizer is a function that splits a string of text into tokens. For example, the string "Hello world!" could be tokenized into the list of tokens ['Hello', 'world', '!'].

The code imports the AutoTokenizer class from the transformers library, specifies the "google/mt5-small" model, and then loads its associated tokenizer into the mt5_tokenizer variable. This tokenizer prepares text inputs for the mT5 small model.


Additional Information:

Here's a brief breakdown of the code:

  1. from transformers import AutoTokenizer: Imports the AutoTokenizer class from the transformers library. This class is designed to automatically retrieve the appropriate tokenizer for a given model.

  2. MODEL_NAME = "google/mt5-small": This line defines a constant MODEL_NAME that specifies the identifier for the mT5 (multilingual T5) model, which is a small version provided by Google. T5 (Text-to-Text Transfer Transformer) is a popular transformer-based model designed for various NLP tasks, and the multilingual version (mT5) has been trained on multiple languages.

  3. mt5_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME): This line initializes the tokenizer associated with the mT5 model. The from_pretrained method fetches the tokenizer for the specified model (google/mt5-small) from Hugging Face's model hub and loads it into the mt5_tokenizer variable.


In [7]:
from transformers import AutoTokenizer

MODEL_NAME = "google/mt5-small"

mt5_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
mt5_tokenizer.deprecation_warnings["Asking-to-pad-a-fast-tokenizer"] = True
(…)small/resolve/main/tokenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]
(…)oogle/mt5-small/resolve/main/config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]
(…)ogle/mt5-small/resolve/main/spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]
(…)all/resolve/main/special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565

Tokenizer bit more details¶

The tokenizer is a crucial component in the NLP pipeline, especially when dealing with transformer-based models like MT5. It handles the conversion between human-readable text and the format the model expects.

Usage of Tokenizer:¶

  • Encoding: Convert text (in this case, Japanese) into a format that the model can understand. This involves tokenizing the text into subwords or tokens, then mapping these tokens to their respective IDs in the model's vocabulary.

  • Decoding: Convert the model's output (token IDs) back into human-readable text.

Example:

Let's say we have the following Japanese sentence: 本日はAIトレーニングセッションへようこそ!

In [8]:
text = "本日はAIトレーニングセッションへようこそ!"

encoded_text = mt5_tokenizer(text)
print("Encoded text: ", encoded_text)

tokenized_text = mt5_tokenizer.tokenize(text)
print("Tokenized text: ", tokenized_text)

decoded_text = mt5_tokenizer.decode(encoded_text["input_ids"], skip_special_tokens=True)
print("Decoded Text: ", decoded_text)
Encoded text:  {'input_ids': [259, 212152, 15428, 96992, 191286, 6031, 15578, 68875, 309, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Tokenized text:  ['▁', '本日は', 'AI', 'トレーニング', 'セッション', 'へ', 'よう', 'こそ', '!']
Decoded Text:  本日はAIトレーニングセッションへようこそ!
So basically we can see that sentence is split into small pieces. These pieces are called tokens. And each token has a unique numeric ID.

The MT5 tokenizer breaks down the input text "本日はAIトレーニングセッションへようこそ!" into subwords and punctuation that it recognizes from its vocabulary. Each token is then assigned a unique ID. The decoding process reverses this, taking the token IDs and converting them back into the original text. The tokenization and decoding processes are designed to be reversible, ensuring that the original text can be perfectly reconstructed.


Additional Information:

Encoded Text:

  • input_ids: These are the IDs assigned by the tokenizer to each token. The sequence [259, 212152, 15428, 96992, 191286, 6031, 15578, 68875, 309, 1] represents the tokens in the tokenized text.

  • attention_mask: The mask [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] indicates that the model should attend to all tokens in the input_ids.

Tokenized Text:

  • The tokenized text is ['▁', '本日は', 'AI', 'トレーニング', 'セッション', 'へ', 'よう', 'こそ', '!']. This is how the MT5 tokenizer has broken down the input text:

    • '▁': This represents a space. The leading '▁' (underscore) indicates a space in tokenizers like SentencePiece (which MT5 uses). In languages without spaces (like Japanese), this helps differentiate between a token that starts a new word and a token that's part of the same word.
    • 本日は: This represents the word "本日は", which means "today".
    • AI: This represents the word "AI", which is an abbreviation for "artificial intelligence".
    • トレーニング: This represents the word "トレーニング", which means "training".
    • セッション: This represents the word "セッション", which means "session".
    • へ: This represents the word "へ", which means "to".
    • よう: This represents the word "よう", which means "welcome".
    • こそ: This represents the word "こそ", which is a particle that emphasizes the previous word.
    • !: This is the exclamation mark.

Decoded Text:

  • The decoded text, "本日はAIトレーニングセッションへようこそ!", is reconstructed from the tokenized text by joining the tokens and removing special characters like '▁'.

6.2 Tokenized Content and Title Distribution.¶

For the next step (for preparing text for model), we need to determine the maximum length of the tokenized text and title. Right now we don't know the maximum length of the tokenized text and title. So we need to find it out. More length can produce good results, but it will take more time to train. So we need to find a balance between length and training time.

In [9]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
def dist_info(text, max_bin_size=1024):
    token_counts = [len(mt5_tokenizer.tokenize(content)) for content in text]

    sns.histplot(token_counts, bins=100, binrange=(0, max_bin_size))
    plt.title("Tokenized Text Length Distribution")
    plt.xlabel("Tokenized Text Length")
    plt.ylabel("Count")
    plt.show()

    percentiles = [25, 50, 75, 90, 95, 99]
    for p in percentiles:
        print(f"{p}% of the dataset has token count below: {np.percentile(token_counts, p)}")

6.2.1 Tokenized Content Distribution¶

Let's use above function to plot the distribution of the tokenized content.

In [10]:
dist_info(dataset["train"]["content"])
25% of the dataset has token count below: 398.0
50% of the dataset has token count below: 589.0
75% of the dataset has token count below: 815.0
90% of the dataset has token count below: 1066.0
95% of the dataset has token count below: 1302.0
99% of the dataset has token count below: 1969.08
Lets check the tokenized content in term of length of 512, and 1024. And analyze each separately.
  1. Max Length of 512:
    • Pros:
      • Faster training and evaluation.
      • Lower memory requirements.
    • Cons:
      • Truncating or excluding about 50% of our dataset as the median is 519.

  2. Max Length of 1024:
    • Pros:
      • Cover 90% of our dataset without truncation.
      • Still relatively efficient in terms of memory and computation.
    • Cons:
      • Training might be slower compared to 512 due to longer sequences.
      • Higher memory usage, which might require reducing the batch size.

  1. Max Length above 1024:
    • Pros:
      • Can cover even more of the dataset without truncation.
    • Cons:
      • Significantly increased computational requirements.
      • Much higher memory usage.
      • Not all model variants can handle sequences longer than 1024.

Given the above, here's a recommendation:

  • If you have computational constraints (e.g., limited GPU memory), consider using a max_length of 1024. This will cover 90% of your data without truncation, which is a significant portion. You might need to adjust the batch size to prevent out-of-memory errors.

  • If you don't have computational constraints and want to ensure that you're leveraging as much of your data as possible without truncation, you might consider a length above 1024 (e.g., 1280 to cover up to the 95th percentile). However, note that sequences longer than 1024 might not be supported by all T5 model variants. You would need to ensure your specific model can handle longer sequences.

Due to the limited time I will use 512 as the max length. Otherwise 1024 should be the better option

6.2.2 Tokenized Title Distribution¶

Let's use the same function to plot the distribution of the tokenized title.

I 'm assuming here that titles are probably less than 128 tokens since even the 90% of content is less than 1024 tokens. with that in mind i 'm passing 128 as max bin range.

In [11]:
dist_info(dataset["train"]["title"], max_bin_size=128)
25% of the dataset has token count below: 16.0
50% of the dataset has token count below: 21.0
75% of the dataset has token count below: 25.0
90% of the dataset has token count below: 29.800000000000182
95% of the dataset has token count below: 32.0
99% of the dataset has token count below: 40.0

According to the above results, here is the analysis.

  1. Max Length of 32:
    • Pros:
      • Covers 90% of your title dataset without truncation.
      • Efficient in terms of memory and computation.
    • Cons:
      • 10% of your titles will be truncated, but given that it's just a few tokens above 32 for most of these, the truncation might be minimal.



2. Max Length of 48 or 50: - Pros: - Covers 99% of your title dataset without truncation. - Still relatively efficient in terms of memory and computation. - Cons: - Slightly more padding for titles that are significantly shorter, but given the overall low token counts, this is a minor concern.

Considering the title token counts:

  • Using a max_length of 32 for titles would be a good choice for efficiency, and it covers 90% of your titles without truncation.

  • If you want to ensure almost all titles are not truncated, you could go with a max_length of 48 or 50, which will cover up to the 99th percentile.

I am going to use 64 as the max length for the titles. This will cover 99% (or probably 99.99%) of the titles without truncation, and it's still relatively efficient in terms of memory and computation.

In [12]:
SOURCE_MAX_LEN = 512
TARGET_MAX_LEN = 64

7. Preprocessing and Normalization¶

Raw data can contain noise – irrelevant information, HTML tags, special characters, etc. Removing or cleaning such noise can help the model focus on the essential parts of the data.

Certain characters or sequences might have special meanings in the tokenization or modeling process. These need to be appropriately handled or escaped during preprocessing.

Next function is to do very simple preprocessing. When creating a sophisticated model, you might need to do more advanced preprocessing. But for this model, we are going to do very simple preprocessing.

The text_clean_preprocess function takes a text input and performs several cleaning operations on it. It replaces newline characters with an empty string, spaces with a single space, tabs with an empty string, and carriage return characters with an empty string. Finally, it converts the text to lowercase and returns the cleaned text.

In [13]:
NEWLINE_CHAR = "\n"
SPACE_CHAR = "\u3000"
TAB_CHAR = "\t"
CARRIAGE_RETURN_CHAR = "\r"

def text_clean_preprocess(text, newline_char=NEWLINE_CHAR, space_char=SPACE_CHAR, tab_char=TAB_CHAR, carriage_return_char=CARRIAGE_RETURN_CHAR):
    text = text.replace(newline_char, "")
    text = text.replace(space_char, " ")
    text = text.replace(tab_char, "")
    text = text.replace(carriage_return_char, "")
    text = text.lower()

    return text
Now its time to prepare our data as the model expects.

Models like mT5 expect input in a specific format, i.e., sequences of token IDs.

This function tokenize raw text into such sequences, making it compatible with the model's expectations.

Next function prepares and structures the data for training a summarization model, where the content of the text is used to predict or generate its title.

Additional Information:

  • Consistent Data Length: Neural models typically require fixed-length input. By padding or truncating sequences to SOURCE_MAX_LEN and TARGET_MAX_LEN, you ensure consistent input and target lengths.

  • Tokenization: The prefixed content (input) and the titles from data["title"] (target) are converted into token IDs using the mt5_tokenizer.

  • Attention Masking: The attention mask distinguishes real tokens from padding tokens, allowing the model to focus on meaningful parts of the input.


In [14]:
def tokenize_data(data):

    input_text = [text_clean_preprocess(content) for content in data["content"]]

    target_text = data["title"]

    inputs = mt5_tokenizer(input_text, max_length=SOURCE_MAX_LEN, truncation=True, padding="max_length")
    targets = mt5_tokenizer(target_text, max_length=TARGET_MAX_LEN, truncation=True, padding="max_length")
    label_attention_mask = [1] * len(targets["input_ids"])

    return {
        "input_ids": inputs.input_ids,
        "attention_mask": inputs.attention_mask,
        "decoder_input_ids": targets["input_ids"],
        "labels": targets.input_ids,
        "label_attention_mask": label_attention_mask

    }

In previous step we created function to transform our data. Now we can apply that function to our dataset. we can use map function to apply a function to a the dataset.

In summary, next code tokenize the dataset using the previously defined tokenize_data function and then removes certain original columns to produce a cleaned and tokenized version of the dataset suitable for machine learning tasks.

Additional Information:

  • The map method batch-processes the dataset using tokenize_data, enhancing efficiency, especially for large datasets.
  • The remove_columns parameter specifies columns to be deleted post-mapping, keeping only tokenized data and discarding extraneous columns.
  • The tokenized_dataset is a processed version of the original data, containing tokenized content and excluding columns listed in remove_columns.

In [15]:
tokenized_dataset = dataset.map(
    tokenize_data,
    batched=True,
    remove_columns=["content", "title", "url", "date", "category"]
)
Map:   0%|          | 0/5893 [00:00<?, ? examples/s]
Map:   0%|          | 0/736 [00:00<?, ? examples/s]
Map:   0%|          | 0/738 [00:00<?, ? examples/s]

8. Model Loading¶

Now its time to load the pretrained model. As I mentioned above, we are using mt5-small model. Following code loads the "google/mt5-small" model . Once loaded, the model can be used for various tasks like translation, summarization, text generation, etc.

note that if you're running this in an environment where you haven't previously downloaded the model, the from_pretrained method will attempt to download the model weights from the Hugging Face model hub. Ensure you have an active internet connection in such a case.


Additional Information:

  • AutoModelForSeq2SeqLM: This is a class designed to automatically infer and load the appropriate sequence-to-sequence model architecture (like T5, BERT, GPT-2, etc.) based on the provided model name or path. It's handy if you don't know the specific architecture of a model but have its name or path.

  • from_pretrained: This is a class method that loads a model based on the provided name or path. It downloads the model weights and configuration and returns an instance of the appropriate model class.


In [16]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
)
pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]
(…)mall/resolve/main/generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]
Then we should prepare Huggingface data collator.

Since we are training task that transform text sequence (content) to another text sequence (title) I am going to use DataCollatorForSeq2Seq. DataCollator function takes a list of samples from a Dataset and collates them into a batch, as a dictionary of Tensors

The job of a data collator is to:

  • Pad or truncate the inputs to the same length, if necessary.
  • Convert the inputs to the correct data type and format for the model.
  • Batch the inputs together.
  • Handle any special keys in the inputs, such as labels.

It's a good practice to use DataCollatorForSeq2Seq when fine-tuning seq2seq models, so incorporating it into your training workflow is a recommended step.

In summary, Following code is setting up a tool (data_collator) that prepares and organizes text data for a computer program (the model) so it can understand and generate its own responses in a conversation.

In [17]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(
    tokenizer=mt5_tokenizer,
    model=model,
)

9. Evaluation Metric¶

When training a model, it's important to evaluate its performance on a validation dataset.

This helps you understand how well the model is learning and whether it's overfitting or underfitting. It also helps you compare different models and select the best one. To measure evaluation, there are many metrics. For example, BLEU, ROUGE, METEOR, etc.
In this project we are going to use BERTScore.

9.1 Bert Score¶

BERTScore is a metric that measures the similarity between two pieces of text.

It's based on the BERT model, which is a popular transformer-based model. BERTScore is a good metric for evaluating text generation tasks like summarization, translation, etc.

Before moving forwards , lets check how to use bert score.

This function, calculates the average BERT scores for a set of predicted and labeled sentences using the BERTScore metric.

In [18]:
import evaluate

def compute_bert_score_demo(preds, labels):
    bert_score_metric = evaluate.load("bertscore")
    bert_score_metric.add_batch(predictions=preds, references=labels)
    result = bert_score_metric.compute(lang="ja")
    avg_scores = {k: sum(v) / len(v) for k, v in result.items() if k != "hashcode"}

    return avg_scores

original = "リンゴが好きです。"
candidate_1 ="リンゴが大好きです。"
candidate_2 = "赤い車を買うつもりです。"

bs_results = {
    "candidate_1": compute_bert_score_demo([candidate_1], [original]),
    "candidate_2": compute_bert_score_demo([candidate_2], [original]),
}

bs_df = pd.DataFrame(bs_results).T
bs_df
Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]
(…)cased/resolve/main/tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]
(…)tilingual-cased/resolve/main/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]
(…)ultilingual-cased/resolve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]
Out[18]:
precision recall f1
candidate_1 0.966469 0.986506 0.976384
candidate_2 0.680357 0.723368 0.701203
we can see that candidate 1 has a higher BERTScore than candidate 2, indicating that it's more similar to the reference sentence. This is expected, as candidate 1 is a closer match to the reference sentence and candidate 2 is a completely different sentence.
Now let's refactor compute_bert_score_demo into a function that can be reused for evaluating our model when it's trained.

The compute_bert_score_demo function takes in the eval_preds variable, which contains predictions and labels. It then decodes the predictions and labels using the mt5_tokenizer.batch_decode function. Finally, it computes the BERT score metric using the bert_score_metric.compute function.

In [19]:
def compute_bert_score(eval_preds):
    bert_score_metric = evaluate.load("bertscore")
    predictions, labels = eval_preds

    decoded_preds = mt5_tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = mt5_tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = bert_score_metric.compute(
        predictions=decoded_preds, references=decoded_labels, lang="ja"
    )

    return {
        "bertscore_precision": sum(result["precision"]) / len(result["precision"]),
        "bertscore_recall": sum(result["recall"]) / len(result["recall"]),
        "bertscore_f1": sum(result["f1"]) / len(result["f1"])
    }

10. Training¶

Perfect, now we have a function that can be used to evaluate our model.

  • Next thing is to define the training arguments.

10.1 Training Arguments¶

Now we need to define the training arguments. We can use Seq2SeqTrainingArguments to define the training arguments.

Seq2SeqTrainingArguments

This is a class provided by the transformers library that allows users to define training-related arguments or hyperparameters for sequence-to-sequence models (models that transform one sequence into another, like summarization models). It's a subclass of TrainingArguments, which is a general-purpose class for defining training-related arguments.

Since we need balance between training time and performance. There are many other hyperparameters that we can use to improve the performance. Also these are depend on the dataset, model and the task. And additionally there are dedicated hyperparameter tuning methods to find the best hyperparameters. But for this project we are going to use following hyperparameters.

In summary, this code is setting up the training configurations for the model.

Additional Information:

  • per_device_train_batch_size and per_device_eval_batch_size: Batch size for training and evaluation, respectively. Batch size is the number of data samples processed before the model updates its weights.

  • learning_rate: This is the initial learning rate for training.

  • lr_scheduler_type and warmup_ratio: These define the learning rate schedule. "linear" means the learning rate will decrease linearly with epochs. The warmup_ratio defines the fraction of total training steps for which the learning rate will be increased linearly before starting the decay.

  • num_train_epochs: Number of epochs to train the model. An epoch is one complete forward and backward pass of all training samples.

  • evaluation_strategy, save_strategy, and logging_strategy: These define when to evaluate, save, and log respectively. "epoch" means these actions will be performed at the end of each epoch.

  • logging_steps: How often to log training metrics (every 100 steps).

  • logging_dir: Directory where logs will be stored.

  • do_train and do_eval: Flags to determine if training and evaluation should be performed, respectively.

  • output_dir: Directory where the training outputs, such as the model checkpoints, will be saved.

  • save_total_limit: Maximum number of model checkpoints to be saved. Older checkpoints will be deleted.

  • load_best_model_at_end: If True, the best model according to the evaluation metric will be loaded at the end of training.

  • push_to_hub: If True, the model will be pushed to the Hugging Face Model Hub.

  • predict_with_generate: Indicates that the predict method should use the generate method for generating sequences.

  • The commented-out parameters (optim, gradient_accumulation_steps, weight_decay, fp16) are additional optional arguments that can be used to further customize the training. For example, fp16=True would enable half-precision (16-bit) floating-point training for potentially faster performance.


In [20]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

NUM_EPOCHS = 5
LEARNING_RATE = 5e-4
WARMUP_RATIO = 0.1
PER_DEVICE_TRAIN_BATCH_SIZE = 8
PER_DEVICE_EVAL_BATCH_SIZE = 8
GRADIENT_ACCUMULATION_STEPS = 4
WEIGHT_DECAY = 0.01
LOGGING_DIR = "./logs"
OUTPUT_DIR = "./results"

training_args = Seq2SeqTrainingArguments(
    per_device_train_batch_size=PER_DEVICE_TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=PER_DEVICE_EVAL_BATCH_SIZE,

    learning_rate=LEARNING_RATE,
    # lr_scheduler_type="linear", #comment on colab
    # warmup_ratio=0.1, #comment on colab

    num_train_epochs=NUM_EPOCHS,

    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="epoch",
    logging_steps=100,

    logging_dir=LOGGING_DIR,
    do_train=True,
    do_eval=True,
    output_dir=OUTPUT_DIR,

    save_total_limit=2,
    load_best_model_at_end=True,

    push_to_hub=False,
    predict_with_generate=True,

    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
)

10.2 Trainer¶

The following code initializes a Trainer instance using the previously defined model, training_args, and data_collator. The trainer is responsible for training and evaluating the model.


Additional Information:

  • Trainer: This is a class provided by the transformers library that handles the training and evaluation of a model. It's a high-level API that abstracts away many of the low-level details of training, such as batching, logging, saving checkpoints, etc.
  • model: This is the model to be trained.
  • args: This is an instance of Seq2SeqTrainingArguments that contains the training-related arguments.
  • data_collator: This is an instance of DataCollatorForSeq2Seq that collates data samples into batches.
  • compute_metrics: This is a function that computes the evaluation metrics. In this case, it's the previously defined compute_bert_score function.
  • train_dataset: This is the training dataset.
  • eval_dataset: This is the evaluation dataset.
  • tokenizer: This is the tokenizer associated with the model.

The Trainer class abstracts away many of the low-level details of training, such as batching, logging, saving checkpoints, etc. This makes it easy to train a model with just a few lines of code.

In [21]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    tokenizer=mt5_tokenizer,
    compute_metrics=compute_bert_score,
)

10.3 Model Training¶

We setuped everything for model training. Now its time to train the model using huggingface library. We can use the following code to train the model.

Even it seems very simple, trainer.train() abstracts away many of the intricacies of the training process, providing a unified interface for training various models on diverse tasks. This makes it easier for users to train models without getting bogged down by the details of the training loop, gradient computation, etc. However, for those who need custom behavior, the Trainer and its components are highly configurable.


Additional Information:

Here's a breakdown of what happens under the hood:

  1. Initialization:
    • The method checks if the training is being resumed from a checkpoint and sets the starting epoch and step accordingly.
    • If it's a fresh start, gradients are initialized to zero.



2. Training Loop:

The model goes through multiple epochs of training. One epoch is a complete forward and backward pass of all the training examples. For each epoch:

  • Data Loading: Training data is loaded in batches using data loaders.

  • Forward Pass: For each batch:

    • The batch is sent to the device (CPU or GPU).
    • The model computes its predictions for the batch.
    • The loss is computed based on the model's predictions and the actual labels.



  • Backward Pass: The gradients of the loss with respect to the model parameters are computed.

  • Optimization Step: The model's optimizer updates the model's parameters to minimize the loss. Common optimizers include Adam, SGD, etc.

  • Gradient Clipping: If specified, gradients are clipped to prevent exploding gradients which can destabilize training.

  • Logging: Metrics like training loss, learning rate, etc., are logged. If a logging directory is provided, these can be visualized using TensorBoard or other tools.



  1. Evaluation: If a validation dataset is provided:

    • After each epoch (or at specified intervals), the model is evaluated on the validation dataset to compute metrics like validation loss, accuracy, etc.
    • This helps monitor overfitting and understand how well the model is generalizing to unseen data.



  1. Model Saving:

    • At specified intervals or after training completion, the model and its configuration are saved to disk.



6. Learning Rate Scheduling:

  • Depending on the learning rate scheduler used, the learning rate might be adjusted as training progresses. For instance, the learning rate might be reduced if the validation loss plateaus to help the model converge.



  1. Early Stopping:

    • If enabled and the model doesn't improve on the validation dataset for a specified number of checks, training can be halted early to prevent overfitting.



  1. Callbacks:

    • The Trainer supports the use of callbacks, which are custom functions that can be called at specified points in the training process, allowing for custom behavior.



  1. Clean-up:

    • Once training is finished (or interrupted), the trainer ensures that all resources like data loaders, TensorBoard writers, etc., are properly closed.

In [22]:
trainer.train()
[920/920 36:38, Epoch 4/5]
Epoch Training Loss Validation Loss Bertscore Precision Bertscore Recall Bertscore F1
0 7.068200 1.034466 0.706013 0.664744 0.684008
1 1.268600 0.894248 0.712798 0.681310 0.696195
2 1.079000 0.864152 0.712160 0.684890 0.697850
4 0.996100 0.854636 0.713781 0.689665 0.701086
4 0.956100 0.847545 0.715506 0.692384 0.703341

Out[22]:
TrainOutput(global_step=920, training_loss=2.273634487649669, metrics={'train_runtime': 2200.8701, 'train_samples_per_second': 13.388, 'train_steps_per_second': 0.418, 'total_flos': 1.556004590321664e+16, 'train_loss': 2.273634487649669, 'epoch': 4.99})

Above output is what we got after the model trained.

Let's analyze the results step by step:

  1. Training and Validation Loss:

    • The training loss decreases significantly from the first to the second epoch and then slightly from the second to the third epoch. This indicates that the model is learning.
    • The validation loss also decreases with each epoch, which is a good sign. It suggests that the model is generalizing well to the validation data.

  2. BERTScore Metrics:

    • BERTScore Precision, Recall, and F1 are all increasing with each epoch. This shows that the model's titles are becoming more similar to the reference titles in terms of the embeddings of their tokens.

11. Model Evaluation¶

After model trained, next step is to evaluate the model. We can easily use the following code to evaluate the model.

The Hugging Face Trainer will evaluate the model on the validation dataset (or the test dataset, if specified) and return the computed metrics. This method is used to understand the performance of the model on unseen data.


Additional Information:

  1. Data Loading: The validation (or test) dataset is loaded batch by batch.

  2. Evaluation Loop: For each batch:

    • The batch is sent to the device (CPU or GPU).
    • The model computes its predictions for the batch.
    • Metrics (like loss, accuracy, etc.) are computed based on the model's predictions and the actual labels.

  3. Aggregation: After all batches are processed, the metrics are averaged (or otherwise aggregated) to provide a single value per metric for the entire dataset.

  4. Logging: The computed metrics are logged. If a logging directory is provided, they can be visualized using tools like TensorBoard.

  5. Return Value: The method returns a dictionary containing the computed metrics.


In [23]:
results = trainer.evaluate()

results_df = pd.DataFrame([results])
results_df
[92/92 00:50]
Out[23]:
eval_loss eval_bertscore_precision eval_bertscore_recall eval_bertscore_f1 eval_runtime eval_samples_per_second eval_steps_per_second epoch
0 0.847545 0.715506 0.692384 0.703341 57.0718 12.896 1.612 4.99
According to the above results, we can say that our model is performing well.

12. Model Testing¶

We have trained and evaluated the model. Now its time to test the model.

12.1. Generate Predictions¶

For the next step, we need to generate predictions using the model.

There are many methods to generate predictions.

  • Greedy Search
  • Beam Search
  • Top-K Sampling
  • Top-P Sampling

and many more.

In we here we are going to use Beam Search. Beam search is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. It is an optimization of best-first search that reduces its memory requirements.


Additional Information:

I'm not going to explain the beam search here. But if you are interested you can read more about it and other methods that nicely explained in the following blog post. https://huggingface.co/blog/how-to-generate


Different parameters will give different results. So you can try different parameters and see how it affects the results. There is no right or wrong answer here. It depends on the task and the dataset.

The generate_title function processes a given text (content), and uses a our model model to generate a list of possible titles for that content.

Here's a concise summary:

The function:

  • Preprocesses the provided content.
  • Tokenizes the cleaned content using the mt5_tokenizer.
  • Feeds the tokenized input to a model to generate potential titles, with constraints and preferences such as maximum length, beam search settings, and temperature for randomness.
  • Decodes the generated token outputs back to human-readable text titles.
  • Returns a list of these generated titles.
In [24]:
def generate_title(content):

    # inputs = [text_clean_preprocess(content)]
    inputs = [f"summarize: " + text_clean_preprocess(content)]

    batch = mt5_tokenizer.batch_encode_plus(
        inputs, max_length=512, truncation=True,
        padding="longest", return_tensors="pt")

    input_ids = batch['input_ids']
    input_mask = batch['attention_mask']


    model.eval()

    outputs = model.generate(
        input_ids=input_ids.cuda(),
        attention_mask=input_mask.cuda(),
        max_length=64,
        temperature=1.1, #
        num_beams=6, #24
        diversity_penalty=3.0, #1.8
        num_beam_groups=3,
        num_return_sequences=3,
        repetition_penalty=9.0,
        # early_stopping=True, #false
        # max_new_tokens=64,
        # do_sample = True
    )

    generated_titles = [mt5_tokenizer.decode(ids, skip_special_tokens=True,
                            clean_up_tokenization_spaces=False)
                        for ids in outputs]

    return generated_titles
Now let's select a random articles from the test set and generate a title for it.
In [25]:
print("Total titles in dataset:", len(dataset["test"]["title"]))

selected_index = [75, 140, 286]

for index in selected_index:
    print("original: ", dataset["test"]["title"][index])
    titles = generate_title(dataset["test"]["content"][index])
    for i, title in enumerate(titles):
        print(f"Generated title {i+1}: {title}")
    print()
Total titles in dataset: 738
original:  いいとも!で紹介された「ヒドすぎる」名前が話題に
Generated title 1: 『ザザザの斬新!赤ちゃんネーム』で紹介された「キラキラネーム」がありえない
Generated title 2: 『ザザザの斬新!赤ちゃんネーム』が「ありえない」とネットニュースで話題
Generated title 3: ネットスラングで「キラキラネーム」がありえないとネットで話題【話題】

original:  日本の引きこもりに海外から相次ぐ心配の声
Generated title 1: 日本の引きこもりが海外で話題に
Generated title 2: 日本の引きこもりが海外で話題
Generated title 3: 「日本=出る釘は打たれる」など、日本の引きこもりが海外で話題に

original:  甲子園出場する石巻工「約5000万円が必要」呼びかけに物議
Generated title 1: 【Sports Watch】石巻工業高校が総額約5000万円の協賛金を募っている
Generated title 2: 【Sports Watch】石巻工業高校が総額約5000万円の協賛金を募っている理由
Generated title 3: 石巻工業高校が、総額約5000万円の協賛金を募っている【話題】

We can see that the generated title is quite similar to the actual title. This is a good sign, indicating that the model is learning to generate titles that are similar to the actual titles.

Also note that the generated title is not identical to the actual title. This is expected, as the model is not trained to generate the exact title. Instead, it's trained to generate a title that's similar to the actual title.

And different parameters will give different results. So you can try different parameters and see how it affects the results.

12.2. Generate Predictions for Yahoo News¶

Our Model is trained on livedoor news corpus. And we tested with live door news. But I like to see how it respond to a news other than livedoor. I pasted a news content I got from yahoo news. Lets see what will be the output from our model.

In [26]:
# https://news.yahoo.co.jp/pickup/6476740

yahoo_news_1_original_title = """【速報】女川原発2号機「再稼働目標 3か月延期へ」来年5月に<東北電力>"""

yahoo_news_1 = """東北電力は来年2月を目標としていた「女川原発2号機の再稼働」について、来年5月に延期することを明らかにした。今年11月としていた安全対策工事の完了時期が来年2月に延びるため。

東北電力によると、工事が3か月延びるのは、発電所内の設備などにつながるケーブルが火災などで損傷しないようにする「火災防護対策」を追加したことが主な要因。この対策を巡っては、他の電力会社が原子力規制員会から指摘を受けた事例を踏まえ、東北電力では去年10月から追加で工事をすることを準備していた。
その工程を精査した結果、3か月ほど完了時期が延びることが判明したもので、それに伴って女川原発2号機の再稼働目標も3か月延期し、来年5月頃となった。
"""
In [27]:
print("Original yahoo title: ", yahoo_news_1_original_title)
for i, title in enumerate(generate_title(yahoo_news_1)):
    print(f"Generated title {i+1}: {title}")
Original yahoo title:  【速報】女川原発2号機「再稼働目標 3か月延期へ」来年5月に<東北電力>
Generated title 1: 東北電力、来年2月を目標としていた「女川原発2号機再稼働」を発表
Generated title 2: 東北電力、来年2月を目標としていた「女川原発2号機再稼働」について3か月延期
Generated title 3: 【ニュース】東北電力、来年2月を目標としていた「女川原発2号機再稼働」について3か月延期
We can see that even for the news that is not from livedoor, our model is generating a title. And it is a good title. So its a good sign that our model is working well.

12.3. Generate Predictions for email (Just for fun)¶

If you are interested you can copy paste a email of yours and see how it will respond.

In [28]:
sample_email = """おはようございます。
アナハイム・エレクトロニクス事務局です。

本日は、まもなく締め切りとなります、10/18(水)開催の対面イベント「モルゲンレーテ社製品無料体験会&相談会 @アナハイム・エレクトロニクスカリフォルニア本社」のご案内です!

実際に触れてみないとわからないとお困りの方、ぜひこの機会に体験してみてください。

モルゲンレーテ社製品無料体験会は特に以下のような方にお勧めです。

・電子・電気機器の新規導入でご検討中の方
 ※小規模から対応できます!規模は問いません。
・電子機器の更改でクラウド移行を検討している方
・顧客管理システムも含めセキュアに電子機器を構築したい方。

モルゲンレーテ社製品を検討はしているけど、実際に触れてみないとわからない、導入を検討しているけど、何から始めればよいか分からないといった課題、お困りごとのある方は、ぜひこの機会に体験してみてください。

また、体験会の後は、弊社エンジニアの個別相談会も予定しております。

皆さまのお悩みなどお気軽にお話いただければと思います。

以下、無料体験会の詳細、申し込み方法をご確認のうえ、ぜひお気軽にご参加ください!

イベントの参加は無料! 参加お申込みは10月16日(月)となります。
みなさまのご参加お待ちしています!
"""
In [29]:
for i, title in enumerate(generate_title(sample_email)):
    print(f"Generated title {i+1}: {title}")
Generated title 1: アナハイム・エレクトロニクスカリフォルニア本社にて開催の対面イベント「モルゲンレーテ社製品無料体験会&相談会」を開催
Generated title 2: アナハイム・エレクトロニクスカリフォルニア本社にて開催の対面イベント「モルゲンレーテ社製品無料体験会&相談会】
Generated title 3: 「モルゲンレーテ社製品無料体験会&相談会 アナハイム・エレクトロニクスカリフォルニア本社」
  • If model trained properly it should generate a sentence just like a news title.
  • Which mean that we can generate any style from specific corpus.(If we used email corpus, It can generate subjects.)

13. How to improve our model.¶

There are many ways to improve our model. Here are some of them.

13.1. Data¶

  • Data Size : We used relatively small dataset. The more data we have, the better our model will be.
  • Cleaning : We did not do any cleaning. We can do some cleaning to improve the model.
  • Normalize: We did very simple things like lowercase data, Japanese space. But there are more things we can do.
    • For instance, Japanese uses different types of quotation marks (「」 or 『』). You might want to standardize them to a single type.

13.2. Model and Parameters¶

  • Model Size: In here we used mt5-model which is smallest one. consider using a larger version for better performance, provided you have the computational resources.
  • Hyperparameters: Experiment with different learning rates, batch sizes, and optimization algorithms.
  • Model Architecture: You can try different model architectures. For instance, you can try BART, T5, Pegasus, etc.
  • mt5 is multilingual, Try using Japanese T5 model for better performance.

13.3. Training Strategy¶

  • Transfer Learning: : Before fine-tuning on title generation task, you can first fine-tune the model on a related task (e.g., summarization) to make the model more familiar with the domain.

    • Finetune [Article -> Summary]
    • Save model.
    • Use the saved model to finetune again[Article->Title]
  • Regularization: Techniques like dropout, layer normalization, or weight decay can help prevent overfitting.

13.4. Curriculum Learning¶

  • Dynamic Sampling: Rank training samples based on difficulty using BERTScore. Begin training on "easier" samples and gradually increase the difficulty.

13.5. Post-processing & rules¶

  • Length Control: You can introduce rules to ensure that titles are within a certain length.
  • Terminology: If there are specific terms or phrases that should (or shouldn't) appear in titles, you can enforce these rules post-generation.

13.6. Experiment with Decoding Strategies:¶

  • Beam Search: Instead of just taking the most likely next word at each step, beam search considers multiple sequences and can produce better results.
  • Temperature Sampling: By adjusting the temperature parameter, you can control the randomness of the output. A higher value makes the output more random, while a lower value makes it more deterministic.

14. Conclusion¶

  • Learn to use Google Colab
  • Learn to use Huggingface transformers library and Datasets library
  • We talked about how to use evaluation metrics
  • We talked about Bert Score
  • Familiarize with Tokenizing and Model Training
  • Learn to use Huggingface Trainer.
  • Got experience in fine-tuning a deep learning model for a text generation task.
  • Finally we learned how to generate predictions using the trained model.