NLPを学ぶ - 8 - ML Over the Horizon

はじめに

今回はTransformerを用いた文章生成について調査・検証する。
GPT-2, CTRL2つのモデルについて触れてみる。
検証ライブラリはtransformersを用い、環境はGoogle Colaboratoryで行う。

GPT-2

GPT-2はOpenAIが提案したTransfomerを用いた言語生成モデルである。

GPT-2モデル構成

前処理
- Byte Pair Encoding¹
Embedding
- token embedding (語彙数50257)
- positional embedding (最大長1024)
Transformer (12層 - 潜在次元768)
- Layer Normalization
- Attention
- Lyaer Normalization
- MLP

コード - モデル読み込み

!pip install transformers
import torch
from transformers import AutoModelWithLMHead, AutoTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelWithLMHead.from_pretrained("gpt2").to(device)

GPT-2による文章生成処理

GPT-2は以下の処理により文章生成を行う。

生成する文章の最初の部分を与える。
GPT-2の言語モデルにより次のトークンの予測とこれまで生成した文章に対する各層のattention情報を得る。
文章に単語を追加して過去文章のattention情報とともに次の入力とする。
文章が目標の長さに達するまで繰り返す。

サンプルコード

# 最初の文を与えてencodeする
input_ids = tokenizer.encode("When I was young, I played tennis", add_special_tokens=False, return_tensors="pt") 
# マスク設定 
attention_mask = input_ids.new_ones(input_ids.shape)
# 最初は過去情報なし(encoder-decoderモデルの場合最初から利用)
past = None

# 繰り返し言語モデルを適用することで文章を生成する。
for n in range(10):
  model_inputs = model.prepare_inputs_for_generation(input_ids, past=past, attention_mask=attention_mask)
  # GPT-2LMモデルの場合、出力はトークンごとの次の文章の予測とattention情報
  outputs = model(**model_inputs)

  # 次のトークン予測のための入力準備
  past = outputs[1]
  next_token_logits = outputs[0][:, -1, :]
  next_token = torch.argmax(next_token_logits, dim=-1)
  input_ids = torch.cat([input_ids, next_token.unsqueeze(-1)], dim=-1)
  attention_mask = torch.cat(
                      [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1
                  )
print(tokenizer.decode(input_ids[0]))

出力

When I was young, I played tennis. I was a big fan of the game.

GPT-2による言語生成例

参考 : https://github.com/huggingface/transformers/blob/master/examples/run_generation.py

入力の最初の部分としては英語wikipediaのKyotoの項目を利用する。
https://en.wikipedia.org/wiki/Kyoto

Kyoto is the capital city of Kyoto Prefecture in Japan. Located in the Kansai region on the island of Honshu, Kyoto forms a part of the Keihanshin metropolitan area along with Osaka and Kobe. As of 2018, the city had a population of 1.47 million.

テキスト生成

max_length = 100 # 最大トークン長
# 前文
prompt_text = "Kyoto is the capital city of Kyoto Prefecture in Japan. Located in the Kansai region on the island of Honshu, Kyoto forms a part of the Keihanshin metropolitan area along with Osaka and Kobe. As of 2018, the city had a population of 1.47 million."

# 前文をもとに文章生成
encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt").to(device)
generated = model.generate(encoded_prompt, max_length=max_length, do_sample=True)
generated_text = tokenizer.decode(generated[0], clean_up_tokenization_spaces=True)

# 前文を削除
print(generated_text[len(prompt_text):])

出力文

While Kyoto has an economy of over 30 million people, cities in Japan currently are divided into five parts:

Kyotoや日本に関連する文章を生成しており、文法的にも一見正しく見える。
(英語に詳しくないので厳密に正しいかは分からない。)
ただ前半部と後半部の意味のつながりがよく分からず人間が書いた文章としては多少違和感を感じる。

CTRL

CTRLは条件を指定して言語を生成するモデルである。
※CTRLはモデルファイルが6.5GB程度ある重いモデルであり、COLAORATORYでモデル読み込みを行うとメモリ不足でクラッシュする。そのためメモリ増量セッションでないと実行不可能(メモリ使用量は13GB程度)。

CTRL モデル構成

前処理
- Byte Pair Encoding
Embedding
- word embedding(語彙数 246534)
- positional embedding (sinusoidal 最大256, 学習しない)
Transformer (48層 - 潜在状態次元1280)

CTRLは多言語で学習しており語彙数が多い。
また、全体の構成としてはBERTやGPT-2とほとんど同じだが層の数が48と非常に多く、潜在次元数も1280次元(BERT, GPT-2は768次元, BERT-large1024)ある。

モデル読み込み

import torch
from transformers import AutoModelWithLMHead, AutoTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained("ctrl")
model = AutoModelWithLMHead.from_pretrained("ctrl").to(device)

条件付き文章生成

CTRLの特徴は制御文字(Cotrol codes)を入力文の最初に付与することで、ジャンルを指定した文章生成ができることである。学習時はデータの収集元(Wikipedia: English Wikipedia, Reviews: Amazon Reviews data)を文章の先頭に付与して言語モデルを学習し、推論時も生成したいジャンルの制御文字を文章の先頭に付与する。
この制御文字はモデルの内部的には(推論時には)特別扱いしているわけではなく、通常のトークンと同様のembeddingにより処理されている。

CTRLによる文章生成

Reviewsカテゴリを入力とした場合

max_length = 30 # 最大トークン長
# 前文
control_code = "Reviews "
prompt_text = control_code + "Kyoto is"

# 前文をもとに文章生成
encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=True, return_tensors="pt").to(device)
generated = model.generate(encoded_prompt, max_length=max_length)
generated_text = tokenizer.decode(generated[0], clean_up_tokenization_spaces=True)

# 制御文を削除
print(generated_text[len(control_code):])

生成文

Kyoto is a great place to visit. Rating: 4.0 I have been to Kyoto and this book is a great guide to the city. It is

確かにレビューのような文章が生成されている。
Ratingは5段階評価らしいのでそれなりに評価は高い。
文法的には正しそう。

control_code = "Science"とした場合の生成文

Kyoto is a city of about 1 million people in the northern part of Japan. It is the capital of the prefecture of Kyoto and the largest city in

科学的には人口と地理的要素を述べることは間違っていない。
ただし日本の北部(northern part of Japan)に位置しているというのは正しくない(と思う)。

control_code = "Politics"とした場合の生成文

Kyoto is a good example of how the US can be a world leader in reducing carbon emissions while still being a world leader in economic growth. Sco@@

政治(politics)の話題の場合は京都議定書に関連するCO2排出関連の文章を生成しているようだ。 (意味的に正しいかはよく分からないが文法的には正しそう).

その他いくつかのcontrol_codeについての生成結果を以下にまとめる。 (改行コードは削除している)

制御	生成文
Fitness	Kyoto is a city of about 1 million people in the prefecture of Kyoto in Japan. It is the capital of the Kyoto Prefecture and the largest city in
Human	Kyoto is Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto Kyoto
Gaming	Kyoto is a bit of a misnomer as the term is used in the US and Canada to refer to the same thing. The term is
Movies	Kyoto is a great example of how to make a movie that is both visually stunning and emotionally moving. Score: 6 Title: What are some
Technologies	Kyoto is a good example of how the world is moving away from fossil fuels and towards renewable energy. Score: 6 Title: The Internet is
Books	Kyoto is the most important of the three cities of the empire, and is the seat of the Imperial Court. It is a large city, with a

これを見る限り、ジャンルによってはそのジャンルと関係の薄い生成文となっているものもある。
ジャンル自体が京都と直接的に結びつかなそうな場合は一般的な京都に関係する文章を生成している?

まとめと今後

今回はGPT-2とCTRLの2つのTransfomerを使って文章生成に触れた。

どちらのモデルも文法的には最もらしい英語を生成できている。
CTRLは入力の最初の1単語を変えるだけで生成される文章が大きく変わりおもしろいと思った。

今後は日本語データセットに対してTransformerを使って何かできないか検討してみたい。

参考文献

https://ja.wikipedia.org/wiki/%E3%83%90%E3%82%A4%E3%83%88%E5%AF%BE%E7%AC%A6%E5%8F%B7%E5%8C%96 ↩