Sorry to disappoint everybody, today we cannot build a Email subject generation AI. This is due to our inability to share the email data of our employees that was utilized to train the model.
However, there's a silver lining. We're in the process of developing a similar model using a different dataset.
In email dataset we are using email body to generate email subject.
In here we are using news content to generate news title
The concept remains quite similar. If you're keen, you can utilize email data to train a model in the future using the same codebase, with only minor tweaks required. If you need assistance, please don't hesitate to reach out at nuwan@qualitia.com
Important Note 1:
Important Note 2:
In Explanation I added Additional Information: between dotted lines. Basically you can ignore them. They are just for your reference. If you are interest read them at home.
For our first step lets install the necessary libraries.
We can use pip command to install the libraries.
We will be using the following libraries for our project. When using system commands in Colab, prefix the command with a ! or %.
List of libraries we will be using
%pip install transformers[torch,ja]==4.33.3 datasets==2.14.5 sentencepiece matplotlib seaborn evaluate absl-py bert_score pandas tokenizers==0.13.3
Collecting transformers[ja,torch]==4.33.3 Downloading transformers-4.33.3-py3-none-any.whl (7.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.6/7.6 MB 16.5 MB/s eta 0:00:00 Collecting datasets==2.14.5 Downloading datasets-2.14.5-py3-none-any.whl (519 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 519.6/519.6 kB 35.0 MB/s eta 0:00:00 Collecting sentencepiece Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 47.9 MB/s eta 0:00:00 Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (3.7.1) Requirement already satisfied: seaborn in /usr/local/lib/python3.10/dist-packages (0.12.2) Collecting evaluate Downloading evaluate-0.4.1-py3-none-any.whl (84 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.1/84.1 kB 11.9 MB/s eta 0:00:00 Requirement already satisfied: absl-py in /usr/local/lib/python3.10/dist-packages (1.4.0) Collecting bert_score Downloading bert_score-0.3.13-py3-none-any.whl (61 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.1/61.1 kB 6.6 MB/s eta 0:00:00 Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (1.5.3) Collecting tokenizers==0.13.3 Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 55.8 MB/s eta 0:00:00 Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers[ja,torch]==4.33.3) (3.12.4) Collecting huggingface-hub<1.0,>=0.15.1 (from transformers[ja,torch]==4.33.3) Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.0/302.0 kB 32.4 MB/s eta 0:00:00 Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers[ja,torch]==4.33.3) (1.23.5) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers[ja,torch]==4.33.3) (23.2) Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers[ja,torch]==4.33.3) (6.0.1) Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers[ja,torch]==4.33.3) (2023.6.3) Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers[ja,torch]==4.33.3) (2.31.0) Collecting safetensors>=0.3.1 (from transformers[ja,torch]==4.33.3) Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 62.2 MB/s eta 0:00:00 Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers[ja,torch]==4.33.3) (4.66.1) Requirement already satisfied: torch!=1.12.0,>=1.10 in /usr/local/lib/python3.10/dist-packages (from transformers[ja,torch]==4.33.3) (2.1.0+cu118) Collecting accelerate>=0.20.3 (from transformers[ja,torch]==4.33.3) Downloading accelerate-0.24.1-py3-none-any.whl (261 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 261.4/261.4 kB 24.9 MB/s eta 0:00:00 Collecting fugashi>=1.0 (from transformers[ja,torch]==4.33.3) Downloading fugashi-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (600 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 600.9/600.9 kB 44.2 MB/s eta 0:00:00 Collecting ipadic<2.0,>=1.0.0 (from transformers[ja,torch]==4.33.3) Downloading ipadic-1.0.0.tar.gz (13.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.4/13.4 MB 56.5 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done Collecting unidic-lite>=1.0.7 (from transformers[ja,torch]==4.33.3) Downloading unidic-lite-1.0.8.tar.gz (47.4 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47.4/47.4 MB 10.8 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done Collecting unidic>=1.0.2 (from transformers[ja,torch]==4.33.3) Downloading unidic-1.1.0.tar.gz (7.7 kB) Preparing metadata (setup.py) ... done Collecting sudachipy>=0.6.6 (from transformers[ja,torch]==4.33.3) Downloading SudachiPy-0.6.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.2/2.2 MB 104.4 MB/s eta 0:00:00 Collecting sudachidict-core>=20220729 (from transformers[ja,torch]==4.33.3) Downloading SudachiDict_core-20230927-py3-none-any.whl (71.7 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.7/71.7 MB 9.9 MB/s eta 0:00:00 Collecting rhoknp<1.3.1,>=1.1.0 (from transformers[ja,torch]==4.33.3) Downloading rhoknp-1.3.0-py3-none-any.whl (86 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86.8/86.8 kB 12.9 MB/s eta 0:00:00 Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets==2.14.5) (9.0.0) Collecting dill<0.3.8,>=0.3.0 (from datasets==2.14.5) Downloading dill-0.3.7-py3-none-any.whl (115 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.3/115.3 kB 12.7 MB/s eta 0:00:00 Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets==2.14.5) (3.4.1) Collecting multiprocess (from datasets==2.14.5) Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 6.9 MB/s eta 0:00:00 Requirement already satisfied: fsspec[http]<2023.9.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets==2.14.5) (2023.6.0) Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets==2.14.5) (3.8.6) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.1.1) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (4.43.1) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.4.5) Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (9.4.0) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (3.1.1) Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (2.8.2) Collecting responses<0.19 (from evaluate) Downloading responses-0.18.0-py3-none-any.whl (38 kB) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2023.3.post1) Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.20.3->transformers[ja,torch]==4.33.3) (5.9.5) Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.14.5) (23.1.0) Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.14.5) (3.3.1) Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.14.5) (6.0.4) Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.14.5) (4.0.3) Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.14.5) (1.9.2) Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.14.5) (1.4.0) Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.14.5) (1.3.1) Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.15.1->transformers[ja,torch]==4.33.3) (4.5.0) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib) (1.16.0) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[ja,torch]==4.33.3) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[ja,torch]==4.33.3) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[ja,torch]==4.33.3) (2023.7.22) Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch!=1.12.0,>=1.10->transformers[ja,torch]==4.33.3) (1.12) Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch!=1.12.0,>=1.10->transformers[ja,torch]==4.33.3) (3.2) Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch!=1.12.0,>=1.10->transformers[ja,torch]==4.33.3) (3.1.2) Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch!=1.12.0,>=1.10->transformers[ja,torch]==4.33.3) (2.1.0) Collecting wasabi<1.0.0,>=0.6.0 (from unidic>=1.0.2->transformers[ja,torch]==4.33.3) Downloading wasabi-0.10.1-py3-none-any.whl (26 kB) Collecting plac<2.0.0,>=1.1.3 (from unidic>=1.0.2->transformers[ja,torch]==4.33.3) Downloading plac-1.4.1-py2.py3-none-any.whl (22 kB) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch!=1.12.0,>=1.10->transformers[ja,torch]==4.33.3) (2.1.3) Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch!=1.12.0,>=1.10->transformers[ja,torch]==4.33.3) (1.3.0) Building wheels for collected packages: ipadic, unidic, unidic-lite Building wheel for ipadic (setup.py) ... done Created wheel for ipadic: filename=ipadic-1.0.0-py3-none-any.whl size=13556703 sha256=61e3ef66a9ff1f63cd9068271237e9b094082227b329157c54f38738924273a9 Stored in directory: /root/.cache/pip/wheels/5b/ea/e3/2f6e0860a327daba3b030853fce4483ed37468bbf1101c59c3 Building wheel for unidic (setup.py) ... done Created wheel for unidic: filename=unidic-1.1.0-py3-none-any.whl size=7406 sha256=15945d7a10ef6a93f0fa8658e40694e0ef737c75d573bf4a80b2218fc637b845 Stored in directory: /root/.cache/pip/wheels/7a/72/72/1f3d654c345ea69d5d51b531c90daf7ba14cc555eaf2c64ab0 Building wheel for unidic-lite (setup.py) ... done Created wheel for unidic-lite: filename=unidic_lite-1.0.8-py3-none-any.whl size=47658816 sha256=21d0f7abc554f83903997d222c43881d3dc0afbf447eaa990859dfc13d21c0db Stored in directory: /root/.cache/pip/wheels/89/e8/68/f9ac36b8cc6c8b3c96888cd57434abed96595d444f42243853 Successfully built ipadic unidic unidic-lite Installing collected packages: wasabi, unidic-lite, tokenizers, sudachipy, sentencepiece, plac, ipadic, sudachidict-core, safetensors, rhoknp, fugashi, dill, unidic, responses, multiprocess, huggingface-hub, transformers, accelerate, datasets, bert_score, evaluate Attempting uninstall: wasabi Found existing installation: wasabi 1.1.2 Uninstalling wasabi-1.1.2: Successfully uninstalled wasabi-1.1.2 Successfully installed accelerate-0.24.1 bert_score-0.3.13 datasets-2.14.5 dill-0.3.7 evaluate-0.4.1 fugashi-1.3.0 huggingface-hub-0.18.0 ipadic-1.0.0 multiprocess-0.70.15 plac-1.4.1 responses-0.18.0 rhoknp-1.3.0 safetensors-0.4.0 sentencepiece-0.1.99 sudachidict-core-20230927 sudachipy-0.6.7 tokenizers-0.13.3 transformers-4.33.3 unidic-1.1.0 unidic-lite-1.0.8 wasabi-0.10.1
Now before moving forward, lets check the availability of GPU. We can use the following command to check the GPU availability.
import torch
if torch.cuda.is_available():
status = "GPU is enabled."
device_count = torch.cuda.device_count()
current_device = torch.cuda.current_device()
print(f"{status}\ndevice count: {device_count}, current device: {current_device}")
else:
print("GPU is disabled.")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"device: {device}")
GPU is enabled. device count: 1, current device: 0 device: cuda
Important !!!:
GPU is disabled.
Device: cpu
When it came to machine learning and deep-learning, setting a seed is important. This is to ensure that the results are reproducible. We will be setting a seed of 42 for our project.
Keep in mind that even with these seeds set, achieving reproducibility across different platforms or GPU architectures can still be challenging.
import random
import numpy as np
import torch
import warnings
warnings.filterwarnings('ignore')
def seed_everything(seed_value):
random.seed(seed_value) # Python
np.random.seed(seed_value) # Numpy
torch.manual_seed(seed_value) # CPU
if torch.cuda.is_available():
torch.cuda.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value) # GPU if available
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
seed_value = 42
seed_everything(seed_value)
As i mentioned earlier we going to use a news dataset to train our model.
we can use load_dataset function to load the dataset. To load datasets from the Hugging Face Hub we can use load_dataset. The function provides a straightforward way to access a many datasets, ensuring standardized data processing and allowing easy splits into training, validation, and testing sets.
Additional Information:
"shunk031/livedoor-news-corpus": Specifies the dataset identifier or location.
train_ratio=0.8: 80% of the dataset will be allocated for training.
validation_ratio=0.1: 10% of the dataset will be allocated for validation.
seed=42: Ensures reproducibility by using a fixed seed for random number generation.
shuffle=False: Not Shuffles the data before splitting it into train, validation, and test sets.
from datasets import load_dataset
dataset = load_dataset(
"llm-book/livedoor-news-corpus",
train_ratio = 0.8,
validation_ratio = 0.1,
seed=42,
shuffle=False,
)
Downloading builder script: 0%| | 0.00/3.88k [00:00<?, ?B/s]
Downloading readme: 0%| | 0.00/823 [00:00<?, ?B/s]
Downloading data: 0%| | 0.00/8.86M [00:00<?, ?B/s]
Generating train split: 0 examples [00:00, ? examples/s]
Generating validation split: 0 examples [00:00, ? examples/s]
Generating test split: 0 examples [00:00, ? examples/s]
dataset
DatasetDict({ train: Dataset({ features: ['url', 'date', 'title', 'content', 'category'], num_rows: 5893 }) validation: Dataset({ features: ['url', 'date', 'title', 'content', 'category'], num_rows: 736 }) test: Dataset({ features: ['url', 'date', 'title', 'content', 'category'], num_rows: 738 }) })
Each split in the dataset consists of the following features:
Lets load test data into a pandas dataframe and check how it looks like for curiosity.
Dataframe is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet.
In summary, the code imports the pandas library, converts the 'test' portion of the dataset into a pandas DataFrame, and then displays the first five rows of that DataFrame.
Additional Information:
import pandas as pd: This line imports the pandas library and gives it the alias pd. Pandas is a widely-used Python library for data analysis and manipulation, especially with tabular data.
test_df = pd.DataFrame(dataset['test']): This line does two main things:
import pandas as pd
test_df = pd.DataFrame(dataset['test'])
test_df.head()
url | date | title | content | category | |
---|---|---|---|---|---|
0 | http://news.livedoor.com/article/detail/5936102/ | 2011-10-14T09:11:00+0900 | ゼンショー「事実無根」と反論 | 10月13日の夜、ゼンショーの広報室長がTwitterで読売新聞の報道に「事実無根」と反論し... | topic-news |
1 | http://news.livedoor.com/article/detail/5936557/ | 2011-10-14T11:16:00+0900 | 「報ステ」OP曲演奏のジャズミュージシャンに“売名行為”と批判相次ぐ | 先日、福島県が行っている新米の放射性物質本検査が全て終了した。規制値を超える放射性セシウムは... | topic-news |
2 | http://news.livedoor.com/article/detail/5936721/ | 2011-10-14T11:46:00+0900 | 「何のための“予約”なんですか」孫社長に批判殺到 | ソフトバンクは、今朝から発売が始まった“iPhone4S”をはじめとする、ソフトバンク全ての... | topic-news |
3 | http://news.livedoor.com/article/detail/5937177/ | 2011-10-15T10:00:00+0900 | あまりにも多すぎる「会いたくて」への皮肉か!? 「西野カナゲーム」が流行 | いま巷で「西野カナゲーム」なるものが流行してるという。 簡単に説明すると、1人目、2人目は... | topic-news |
4 | http://news.livedoor.com/article/detail/5937649/ | 2011-10-14T16:03:00+0900 | 憶測呼ぶ紳助さんの“天敵”引退 | 10月13日発売の東スポに「紳助の天敵が引退」との見出しが躍った。その天敵とは、警察庁の安藤... | topic-news |
For this task we are going to use google's mt5-small model.
mt5 is a Text-to-text transformers model released by google. It is a multilingual model. It can be used for translation, summarization, classification and many more tasks.
Text-to-text transformers are a class of transformer models that can be used for a variety of tasks by providing a text input and a text output.
In summary, we can use model to generate a text from a text.
In here we are using mt5-small, but there are many other models available. For example mt5-base. mt5-base is a larger model than mt5-small. It has more parameters and it takes more time to train. But it has more capacity to learn.
We can use AutoTokenizer to load the tokenizer for the model. Following code will load the tokenizer for mt5-small model.
A tokenizer is a function that splits a string of text into tokens. For example, the string "Hello world!" could be tokenized into the list of tokens ['Hello', 'world', '!'].
The code imports the AutoTokenizer class from the transformers library, specifies the "google/mt5-small" model, and then loads its associated tokenizer into the mt5_tokenizer variable. This tokenizer prepares text inputs for the mT5 small model.
Additional Information:
Here's a brief breakdown of the code:
from transformers import AutoTokenizer: Imports the AutoTokenizer class from the transformers library. This class is designed to automatically retrieve the appropriate tokenizer for a given model.
MODEL_NAME = "google/mt5-small": This line defines a constant MODEL_NAME that specifies the identifier for the mT5 (multilingual T5) model, which is a small version provided by Google. T5 (Text-to-Text Transfer Transformer) is a popular transformer-based model designed for various NLP tasks, and the multilingual version (mT5) has been trained on multiple languages.
mt5_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME): This line initializes the tokenizer associated with the mT5 model. The from_pretrained method fetches the tokenizer for the specified model (google/mt5-small) from Hugging Face's model hub and loads it into the mt5_tokenizer variable.
from transformers import AutoTokenizer
MODEL_NAME = "google/mt5-small"
mt5_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
mt5_tokenizer.deprecation_warnings["Asking-to-pad-a-fast-tokenizer"] = True
(…)small/resolve/main/tokenizer_config.json: 0%| | 0.00/82.0 [00:00<?, ?B/s]
(…)oogle/mt5-small/resolve/main/config.json: 0%| | 0.00/553 [00:00<?, ?B/s]
(…)ogle/mt5-small/resolve/main/spiece.model: 0%| | 0.00/4.31M [00:00<?, ?B/s]
(…)all/resolve/main/special_tokens_map.json: 0%| | 0.00/99.0 [00:00<?, ?B/s]
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
The tokenizer is a crucial component in the NLP pipeline, especially when dealing with transformer-based models like MT5. It handles the conversion between human-readable text and the format the model expects.
Encoding: Convert text (in this case, Japanese) into a format that the model can understand. This involves tokenizing the text into subwords or tokens, then mapping these tokens to their respective IDs in the model's vocabulary.
Decoding: Convert the model's output (token IDs) back into human-readable text.
Let's say we have the following Japanese sentence: 本日はAIトレーニングセッションへようこそ!
text = "本日はAIトレーニングセッションへようこそ!"
encoded_text = mt5_tokenizer(text)
print("Encoded text: ", encoded_text)
tokenized_text = mt5_tokenizer.tokenize(text)
print("Tokenized text: ", tokenized_text)
decoded_text = mt5_tokenizer.decode(encoded_text["input_ids"], skip_special_tokens=True)
print("Decoded Text: ", decoded_text)
Encoded text: {'input_ids': [259, 212152, 15428, 96992, 191286, 6031, 15578, 68875, 309, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} Tokenized text: ['▁', '本日は', 'AI', 'トレーニング', 'セッション', 'へ', 'よう', 'こそ', '!'] Decoded Text: 本日はAIトレーニングセッションへようこそ!
The MT5 tokenizer breaks down the input text "本日はAIトレーニングセッションへようこそ!" into subwords and punctuation that it recognizes from its vocabulary. Each token is then assigned a unique ID. The decoding process reverses this, taking the token IDs and converting them back into the original text. The tokenization and decoding processes are designed to be reversible, ensuring that the original text can be perfectly reconstructed.
Additional Information:
Encoded Text:
input_ids: These are the IDs assigned by the tokenizer to each token. The sequence [259, 212152, 15428, 96992, 191286, 6031, 15578, 68875, 309, 1] represents the tokens in the tokenized text.
attention_mask: The mask [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] indicates that the model should attend to all tokens in the input_ids.
Tokenized Text:
The tokenized text is ['▁', '本日は', 'AI', 'トレーニング', 'セッション', 'へ', 'よう', 'こそ', '!']. This is how the MT5 tokenizer has broken down the input text:
Decoded Text:
For the next step (for preparing text for model), we need to determine the maximum length of the tokenized text and title. Right now we don't know the maximum length of the tokenized text and title. So we need to find it out. More length can produce good results, but it will take more time to train. So we need to find a balance between length and training time.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
def dist_info(text, max_bin_size=1024):
token_counts = [len(mt5_tokenizer.tokenize(content)) for content in text]
sns.histplot(token_counts, bins=100, binrange=(0, max_bin_size))
plt.title("Tokenized Text Length Distribution")
plt.xlabel("Tokenized Text Length")
plt.ylabel("Count")
plt.show()
percentiles = [25, 50, 75, 90, 95, 99]
for p in percentiles:
print(f"{p}% of the dataset has token count below: {np.percentile(token_counts, p)}")
Let's use above function to plot the distribution of the tokenized content.
dist_info(dataset["train"]["content"])
25% of the dataset has token count below: 398.0 50% of the dataset has token count below: 589.0 75% of the dataset has token count below: 815.0 90% of the dataset has token count below: 1066.0 95% of the dataset has token count below: 1302.0 99% of the dataset has token count below: 1969.08
Given the above, here's a recommendation:
If you have computational constraints (e.g., limited GPU memory), consider using a max_length of 1024. This will cover 90% of your data without truncation, which is a significant portion. You might need to adjust the batch size to prevent out-of-memory errors.
If you don't have computational constraints and want to ensure that you're leveraging as much of your data as possible without truncation, you might consider a length above 1024 (e.g., 1280 to cover up to the 95th percentile). However, note that sequences longer than 1024 might not be supported by all T5 model variants. You would need to ensure your specific model can handle longer sequences.
Due to the limited time I will use 512 as the max length. Otherwise 1024 should be the better option
Let's use the same function to plot the distribution of the tokenized title.
I 'm assuming here that titles are probably less than 128 tokens since even the 90% of content is less than 1024 tokens. with that in mind i 'm passing 128 as max bin range.
dist_info(dataset["train"]["title"], max_bin_size=128)
25% of the dataset has token count below: 16.0 50% of the dataset has token count below: 21.0 75% of the dataset has token count below: 25.0 90% of the dataset has token count below: 29.800000000000182 95% of the dataset has token count below: 32.0 99% of the dataset has token count below: 40.0
According to the above results, here is the analysis.
2. Max Length of 48 or 50:
- Pros:
- Covers 99% of your title dataset without truncation.
- Still relatively efficient in terms of memory and computation.
- Cons:
- Slightly more padding for titles that are significantly shorter, but given the overall low token counts, this is a minor concern.
Considering the title token counts:
Using a max_length of 32 for titles would be a good choice for efficiency, and it covers 90% of your titles without truncation.
If you want to ensure almost all titles are not truncated, you could go with a max_length of 48 or 50, which will cover up to the 99th percentile.
I am going to use 64 as the max length for the titles. This will cover 99% (or probably 99.99%) of the titles without truncation, and it's still relatively efficient in terms of memory and computation.
SOURCE_MAX_LEN = 512
TARGET_MAX_LEN = 64
Raw data can contain noise – irrelevant information, HTML tags, special characters, etc. Removing or cleaning such noise can help the model focus on the essential parts of the data.
Certain characters or sequences might have special meanings in the tokenization or modeling process. These need to be appropriately handled or escaped during preprocessing.
The text_clean_preprocess function takes a text input and performs several cleaning operations on it. It replaces newline characters with an empty string, spaces with a single space, tabs with an empty string, and carriage return characters with an empty string. Finally, it converts the text to lowercase and returns the cleaned text.
NEWLINE_CHAR = "\n"
SPACE_CHAR = "\u3000"
TAB_CHAR = "\t"
CARRIAGE_RETURN_CHAR = "\r"
def text_clean_preprocess(text, newline_char=NEWLINE_CHAR, space_char=SPACE_CHAR, tab_char=TAB_CHAR, carriage_return_char=CARRIAGE_RETURN_CHAR):
text = text.replace(newline_char, "")
text = text.replace(space_char, " ")
text = text.replace(tab_char, "")
text = text.replace(carriage_return_char, "")
text = text.lower()
return text
Models like mT5 expect input in a specific format, i.e., sequences of token IDs.
This function tokenize raw text into such sequences, making it compatible with the model's expectations.
Additional Information:
Consistent Data Length: Neural models typically require fixed-length input. By padding or truncating sequences to SOURCE_MAX_LEN and TARGET_MAX_LEN, you ensure consistent input and target lengths.
Tokenization: The prefixed content (input) and the titles from data["title"] (target) are converted into token IDs using the mt5_tokenizer.
Attention Masking: The attention mask distinguishes real tokens from padding tokens, allowing the model to focus on meaningful parts of the input.
def tokenize_data(data):
input_text = [text_clean_preprocess(content) for content in data["content"]]
target_text = data["title"]
inputs = mt5_tokenizer(input_text, max_length=SOURCE_MAX_LEN, truncation=True, padding="max_length")
targets = mt5_tokenizer(target_text, max_length=TARGET_MAX_LEN, truncation=True, padding="max_length")
label_attention_mask = [1] * len(targets["input_ids"])
return {
"input_ids": inputs.input_ids,
"attention_mask": inputs.attention_mask,
"decoder_input_ids": targets["input_ids"],
"labels": targets.input_ids,
"label_attention_mask": label_attention_mask
}
In previous step we created function to transform our data. Now we can apply that function to our dataset. we can use map function to apply a function to a the dataset.
Additional Information:
tokenized_dataset = dataset.map(
tokenize_data,
batched=True,
remove_columns=["content", "title", "url", "date", "category"]
)
Map: 0%| | 0/5893 [00:00<?, ? examples/s]
Map: 0%| | 0/736 [00:00<?, ? examples/s]
Map: 0%| | 0/738 [00:00<?, ? examples/s]
Now its time to load the pretrained model. As I mentioned above, we are using mt5-small model. Following code loads the "google/mt5-small" model . Once loaded, the model can be used for various tasks like translation, summarization, text generation, etc.
note that if you're running this in an environment where you haven't previously downloaded the model, the from_pretrained method will attempt to download the model weights from the Hugging Face model hub. Ensure you have an active internet connection in such a case.
Additional Information:
AutoModelForSeq2SeqLM: This is a class designed to automatically infer and load the appropriate sequence-to-sequence model architecture (like T5, BERT, GPT-2, etc.) based on the provided model name or path. It's handy if you don't know the specific architecture of a model but have its name or path.
from_pretrained: This is a class method that loads a model based on the provided name or path. It downloads the model weights and configuration and returns an instance of the appropriate model class.
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained(
MODEL_NAME,
trust_remote_code=True,
)
pytorch_model.bin: 0%| | 0.00/1.20G [00:00<?, ?B/s]
(…)mall/resolve/main/generation_config.json: 0%| | 0.00/147 [00:00<?, ?B/s]
Since we are training task that transform text sequence (content) to another text sequence (title) I am going to use DataCollatorForSeq2Seq. DataCollator function takes a list of samples from a Dataset and collates them into a batch, as a dictionary of Tensors
The job of a data collator is to:
It's a good practice to use DataCollatorForSeq2Seq when fine-tuning seq2seq models, so incorporating it into your training workflow is a recommended step.
In summary, Following code is setting up a tool (data_collator) that prepares and organizes text data for a computer program (the model) so it can understand and generate its own responses in a conversation.
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(
tokenizer=mt5_tokenizer,
model=model,
)
When training a model, it's important to evaluate its performance on a validation dataset.
It's based on the BERT model, which is a popular transformer-based model. BERTScore is a good metric for evaluating text generation tasks like summarization, translation, etc.
Before moving forwards , lets check how to use bert score.
This function, calculates the average BERT scores for a set of predicted and labeled sentences using the BERTScore metric.
import evaluate
def compute_bert_score_demo(preds, labels):
bert_score_metric = evaluate.load("bertscore")
bert_score_metric.add_batch(predictions=preds, references=labels)
result = bert_score_metric.compute(lang="ja")
avg_scores = {k: sum(v) / len(v) for k, v in result.items() if k != "hashcode"}
return avg_scores
original = "リンゴが好きです。"
candidate_1 ="リンゴが大好きです。"
candidate_2 = "赤い車を買うつもりです。"
bs_results = {
"candidate_1": compute_bert_score_demo([candidate_1], [original]),
"candidate_2": compute_bert_score_demo([candidate_2], [original]),
}
bs_df = pd.DataFrame(bs_results).T
bs_df
Downloading builder script: 0%| | 0.00/7.95k [00:00<?, ?B/s]
(…)cased/resolve/main/tokenizer_config.json: 0%| | 0.00/29.0 [00:00<?, ?B/s]
(…)tilingual-cased/resolve/main/config.json: 0%| | 0.00/625 [00:00<?, ?B/s]
(…)ultilingual-cased/resolve/main/vocab.txt: 0%| | 0.00/996k [00:00<?, ?B/s]
model.safetensors: 0%| | 0.00/714M [00:00<?, ?B/s]
precision | recall | f1 | |
---|---|---|---|
candidate_1 | 0.966469 | 0.986506 | 0.976384 |
candidate_2 | 0.680357 | 0.723368 | 0.701203 |
The compute_bert_score_demo function takes in the eval_preds variable, which contains predictions and labels. It then decodes the predictions and labels using the mt5_tokenizer.batch_decode function. Finally, it computes the BERT score metric using the bert_score_metric.compute function.
def compute_bert_score(eval_preds):
bert_score_metric = evaluate.load("bertscore")
predictions, labels = eval_preds
decoded_preds = mt5_tokenizer.batch_decode(predictions, skip_special_tokens=True)
decoded_labels = mt5_tokenizer.batch_decode(labels, skip_special_tokens=True)
result = bert_score_metric.compute(
predictions=decoded_preds, references=decoded_labels, lang="ja"
)
return {
"bertscore_precision": sum(result["precision"]) / len(result["precision"]),
"bertscore_recall": sum(result["recall"]) / len(result["recall"]),
"bertscore_f1": sum(result["f1"]) / len(result["f1"])
}
Perfect, now we have a function that can be used to evaluate our model.
Now we need to define the training arguments. We can use Seq2SeqTrainingArguments to define the training arguments.
Seq2SeqTrainingArguments
This is a class provided by the transformers library that allows users to define training-related arguments or hyperparameters for sequence-to-sequence models (models that transform one sequence into another, like summarization models). It's a subclass of TrainingArguments, which is a general-purpose class for defining training-related arguments.
Since we need balance between training time and performance. There are many other hyperparameters that we can use to improve the performance. Also these are depend on the dataset, model and the task. And additionally there are dedicated hyperparameter tuning methods to find the best hyperparameters. But for this project we are going to use following hyperparameters.
Additional Information:
per_device_train_batch_size and per_device_eval_batch_size: Batch size for training and evaluation, respectively. Batch size is the number of data samples processed before the model updates its weights.
learning_rate: This is the initial learning rate for training.
lr_scheduler_type and warmup_ratio: These define the learning rate schedule. "linear" means the learning rate will decrease linearly with epochs. The warmup_ratio defines the fraction of total training steps for which the learning rate will be increased linearly before starting the decay.
num_train_epochs: Number of epochs to train the model. An epoch is one complete forward and backward pass of all training samples.
evaluation_strategy, save_strategy, and logging_strategy: These define when to evaluate, save, and log respectively. "epoch" means these actions will be performed at the end of each epoch.
logging_steps: How often to log training metrics (every 100 steps).
logging_dir: Directory where logs will be stored.
do_train and do_eval: Flags to determine if training and evaluation should be performed, respectively.
output_dir: Directory where the training outputs, such as the model checkpoints, will be saved.
save_total_limit: Maximum number of model checkpoints to be saved. Older checkpoints will be deleted.
load_best_model_at_end: If True, the best model according to the evaluation metric will be loaded at the end of training.
push_to_hub: If True, the model will be pushed to the Hugging Face Model Hub.
predict_with_generate: Indicates that the predict method should use the generate method for generating sequences.
The commented-out parameters (optim, gradient_accumulation_steps, weight_decay, fp16) are additional optional arguments that can be used to further customize the training. For example, fp16=True would enable half-precision (16-bit) floating-point training for potentially faster performance.
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
NUM_EPOCHS = 5
LEARNING_RATE = 5e-4
WARMUP_RATIO = 0.1
PER_DEVICE_TRAIN_BATCH_SIZE = 8
PER_DEVICE_EVAL_BATCH_SIZE = 8
GRADIENT_ACCUMULATION_STEPS = 4
WEIGHT_DECAY = 0.01
LOGGING_DIR = "./logs"
OUTPUT_DIR = "./results"
training_args = Seq2SeqTrainingArguments(
per_device_train_batch_size=PER_DEVICE_TRAIN_BATCH_SIZE,
per_device_eval_batch_size=PER_DEVICE_EVAL_BATCH_SIZE,
learning_rate=LEARNING_RATE,
# lr_scheduler_type="linear", #comment on colab
# warmup_ratio=0.1, #comment on colab
num_train_epochs=NUM_EPOCHS,
evaluation_strategy="epoch",
save_strategy="epoch",
logging_strategy="epoch",
logging_steps=100,
logging_dir=LOGGING_DIR,
do_train=True,
do_eval=True,
output_dir=OUTPUT_DIR,
save_total_limit=2,
load_best_model_at_end=True,
push_to_hub=False,
predict_with_generate=True,
gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
)
The following code initializes a Trainer instance using the previously defined model, training_args, and data_collator. The trainer is responsible for training and evaluating the model.
Additional Information:
The Trainer class abstracts away many of the low-level details of training, such as batching, logging, saving checkpoints, etc. This makes it easy to train a model with just a few lines of code.
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
data_collator=data_collator,
tokenizer=mt5_tokenizer,
compute_metrics=compute_bert_score,
)
We setuped everything for model training. Now its time to train the model using huggingface library. We can use the following code to train the model.
Even it seems very simple, trainer.train() abstracts away many of the intricacies of the training process, providing a unified interface for training various models on diverse tasks. This makes it easier for users to train models without getting bogged down by the details of the training loop, gradient computation, etc. However, for those who need custom behavior, the Trainer and its components are highly configurable.
Additional Information:
Here's a breakdown of what happens under the hood:
2. Training Loop:
The model goes through multiple epochs of training. One epoch is a complete forward and backward pass of all the training examples. For each epoch:
Data Loading: Training data is loaded in batches using data loaders.
Forward Pass: For each batch:
Backward Pass: The gradients of the loss with respect to the model parameters are computed.
Optimization Step: The model's optimizer updates the model's parameters to minimize the loss. Common optimizers include Adam, SGD, etc.
Gradient Clipping: If specified, gradients are clipped to prevent exploding gradients which can destabilize training.
Logging: Metrics like training loss, learning rate, etc., are logged. If a logging directory is provided, these can be visualized using TensorBoard or other tools.
Evaluation: If a validation dataset is provided:
Model Saving:
6. Learning Rate Scheduling:
Early Stopping:
Callbacks:
Clean-up:
trainer.train()
Epoch | Training Loss | Validation Loss | Bertscore Precision | Bertscore Recall | Bertscore F1 |
---|---|---|---|---|---|
0 | 7.068200 | 1.034466 | 0.706013 | 0.664744 | 0.684008 |
1 | 1.268600 | 0.894248 | 0.712798 | 0.681310 | 0.696195 |
2 | 1.079000 | 0.864152 | 0.712160 | 0.684890 | 0.697850 |
4 | 0.996100 | 0.854636 | 0.713781 | 0.689665 | 0.701086 |
4 | 0.956100 | 0.847545 | 0.715506 | 0.692384 | 0.703341 |
TrainOutput(global_step=920, training_loss=2.273634487649669, metrics={'train_runtime': 2200.8701, 'train_samples_per_second': 13.388, 'train_steps_per_second': 0.418, 'total_flos': 1.556004590321664e+16, 'train_loss': 2.273634487649669, 'epoch': 4.99})
Above output is what we got after the model trained.
Let's analyze the results step by step:
Training and Validation Loss:
BERTScore Metrics:
After model trained, next step is to evaluate the model. We can easily use the following code to evaluate the model.
The Hugging Face Trainer will evaluate the model on the validation dataset (or the test dataset, if specified) and return the computed metrics. This method is used to understand the performance of the model on unseen data.
Additional Information:
Data Loading: The validation (or test) dataset is loaded batch by batch.
Evaluation Loop: For each batch:
Aggregation: After all batches are processed, the metrics are averaged (or otherwise aggregated) to provide a single value per metric for the entire dataset.
Logging: The computed metrics are logged. If a logging directory is provided, they can be visualized using tools like TensorBoard.
Return Value: The method returns a dictionary containing the computed metrics.
results = trainer.evaluate()
results_df = pd.DataFrame([results])
results_df
eval_loss | eval_bertscore_precision | eval_bertscore_recall | eval_bertscore_f1 | eval_runtime | eval_samples_per_second | eval_steps_per_second | epoch | |
---|---|---|---|---|---|---|---|---|
0 | 0.847545 | 0.715506 | 0.692384 | 0.703341 | 57.0718 | 12.896 | 1.612 | 4.99 |
We have trained and evaluated the model. Now its time to test the model.
For the next step, we need to generate predictions using the model.
There are many methods to generate predictions.
and many more.
In we here we are going to use Beam Search. Beam search is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. It is an optimization of best-first search that reduces its memory requirements.
Additional Information:
I'm not going to explain the beam search here. But if you are interested you can read more about it and other methods that nicely explained in the following blog post. https://huggingface.co/blog/how-to-generate
Different parameters will give different results. So you can try different parameters and see how it affects the results. There is no right or wrong answer here. It depends on the task and the dataset.
Here's a concise summary:
The function:
def generate_title(content):
# inputs = [text_clean_preprocess(content)]
inputs = [f"summarize: " + text_clean_preprocess(content)]
batch = mt5_tokenizer.batch_encode_plus(
inputs, max_length=512, truncation=True,
padding="longest", return_tensors="pt")
input_ids = batch['input_ids']
input_mask = batch['attention_mask']
model.eval()
outputs = model.generate(
input_ids=input_ids.cuda(),
attention_mask=input_mask.cuda(),
max_length=64,
temperature=1.1, #
num_beams=6, #24
diversity_penalty=3.0, #1.8
num_beam_groups=3,
num_return_sequences=3,
repetition_penalty=9.0,
# early_stopping=True, #false
# max_new_tokens=64,
# do_sample = True
)
generated_titles = [mt5_tokenizer.decode(ids, skip_special_tokens=True,
clean_up_tokenization_spaces=False)
for ids in outputs]
return generated_titles
print("Total titles in dataset:", len(dataset["test"]["title"]))
selected_index = [75, 140, 286]
for index in selected_index:
print("original: ", dataset["test"]["title"][index])
titles = generate_title(dataset["test"]["content"][index])
for i, title in enumerate(titles):
print(f"Generated title {i+1}: {title}")
print()
Total titles in dataset: 738 original: いいとも!で紹介された「ヒドすぎる」名前が話題に Generated title 1: 『ザザザの斬新!赤ちゃんネーム』で紹介された「キラキラネーム」がありえない Generated title 2: 『ザザザの斬新!赤ちゃんネーム』が「ありえない」とネットニュースで話題 Generated title 3: ネットスラングで「キラキラネーム」がありえないとネットで話題【話題】 original: 日本の引きこもりに海外から相次ぐ心配の声 Generated title 1: 日本の引きこもりが海外で話題に Generated title 2: 日本の引きこもりが海外で話題 Generated title 3: 「日本=出る釘は打たれる」など、日本の引きこもりが海外で話題に original: 甲子園出場する石巻工「約5000万円が必要」呼びかけに物議 Generated title 1: 【Sports Watch】石巻工業高校が総額約5000万円の協賛金を募っている Generated title 2: 【Sports Watch】石巻工業高校が総額約5000万円の協賛金を募っている理由 Generated title 3: 石巻工業高校が、総額約5000万円の協賛金を募っている【話題】
We can see that the generated title is quite similar to the actual title. This is a good sign, indicating that the model is learning to generate titles that are similar to the actual titles.
Also note that the generated title is not identical to the actual title. This is expected, as the model is not trained to generate the exact title. Instead, it's trained to generate a title that's similar to the actual title.
And different parameters will give different results. So you can try different parameters and see how it affects the results.
Our Model is trained on livedoor news corpus. And we tested with live door news. But I like to see how it respond to a news other than livedoor. I pasted a news content I got from yahoo news. Lets see what will be the output from our model.
# https://news.yahoo.co.jp/pickup/6476740
yahoo_news_1_original_title = """【速報】女川原発2号機「再稼働目標 3か月延期へ」来年5月に<東北電力>"""
yahoo_news_1 = """東北電力は来年2月を目標としていた「女川原発2号機の再稼働」について、来年5月に延期することを明らかにした。今年11月としていた安全対策工事の完了時期が来年2月に延びるため。
東北電力によると、工事が3か月延びるのは、発電所内の設備などにつながるケーブルが火災などで損傷しないようにする「火災防護対策」を追加したことが主な要因。この対策を巡っては、他の電力会社が原子力規制員会から指摘を受けた事例を踏まえ、東北電力では去年10月から追加で工事をすることを準備していた。
その工程を精査した結果、3か月ほど完了時期が延びることが判明したもので、それに伴って女川原発2号機の再稼働目標も3か月延期し、来年5月頃となった。
"""
print("Original yahoo title: ", yahoo_news_1_original_title)
for i, title in enumerate(generate_title(yahoo_news_1)):
print(f"Generated title {i+1}: {title}")
Original yahoo title: 【速報】女川原発2号機「再稼働目標 3か月延期へ」来年5月に<東北電力> Generated title 1: 東北電力、来年2月を目標としていた「女川原発2号機再稼働」を発表 Generated title 2: 東北電力、来年2月を目標としていた「女川原発2号機再稼働」について3か月延期 Generated title 3: 【ニュース】東北電力、来年2月を目標としていた「女川原発2号機再稼働」について3か月延期
If you are interested you can copy paste a email of yours and see how it will respond.
sample_email = """おはようございます。
アナハイム・エレクトロニクス事務局です。
本日は、まもなく締め切りとなります、10/18(水)開催の対面イベント「モルゲンレーテ社製品無料体験会&相談会 @アナハイム・エレクトロニクスカリフォルニア本社」のご案内です!
実際に触れてみないとわからないとお困りの方、ぜひこの機会に体験してみてください。
モルゲンレーテ社製品無料体験会は特に以下のような方にお勧めです。
・電子・電気機器の新規導入でご検討中の方
※小規模から対応できます!規模は問いません。
・電子機器の更改でクラウド移行を検討している方
・顧客管理システムも含めセキュアに電子機器を構築したい方。
モルゲンレーテ社製品を検討はしているけど、実際に触れてみないとわからない、導入を検討しているけど、何から始めればよいか分からないといった課題、お困りごとのある方は、ぜひこの機会に体験してみてください。
また、体験会の後は、弊社エンジニアの個別相談会も予定しております。
皆さまのお悩みなどお気軽にお話いただければと思います。
以下、無料体験会の詳細、申し込み方法をご確認のうえ、ぜひお気軽にご参加ください!
イベントの参加は無料! 参加お申込みは10月16日(月)となります。
みなさまのご参加お待ちしています!
"""
for i, title in enumerate(generate_title(sample_email)):
print(f"Generated title {i+1}: {title}")
Generated title 1: アナハイム・エレクトロニクスカリフォルニア本社にて開催の対面イベント「モルゲンレーテ社製品無料体験会&相談会」を開催 Generated title 2: アナハイム・エレクトロニクスカリフォルニア本社にて開催の対面イベント「モルゲンレーテ社製品無料体験会&相談会】 Generated title 3: 「モルゲンレーテ社製品無料体験会&相談会 アナハイム・エレクトロニクスカリフォルニア本社」
There are many ways to improve our model. Here are some of them.
Transfer Learning: : Before fine-tuning on title generation task, you can first fine-tune the model on a related task (e.g., summarization) to make the model more familiar with the domain.
Regularization: Techniques like dropout, layer normalization, or weight decay can help prevent overfitting.