GiNZA(5) - 文境界解析

October 15, 2021

文境界解析は、文章を文ごとに分解する処理です。

文の境界は「。」があるのでこれを検出すれば問題ないと思うかもしれませんが、「！」や「？」で区切ることもあります。

また、「彼は『本当ですか。』とつぶやいた。」というように会話文が含まれていることもあり、文境界解析は意外と難しい処理になります。

文境界解析

文境界解析を行うソースコードは次のようになります。

[Google Colaboratory]

mport spacy
nlp = spacy.load('ja_ginza')
doc = nlp('新橋でランチをご一緒しましょう。次の火曜日はどうですか。')

# 文境界解析
for span in doc.sents:
    print(span)

実行結果は以下の通りです。

[実行結果]

1 2	新橋でランチをご一緒しましょう。次の火曜日はどうですか。

文ごとに分解することができました。

さらにspanをforでループすることにより、文を分割したトークンを取得することができます。

[Google Colaboratory]

import spacy
nlp = spacy.load('ja_ginza')
doc = nlp('新橋でランチをご一緒しましょう。次の火曜日はどうですか。')

# 文境界解析+トークン化
for span in doc.sents:
    for token in span:
        print(token)

実行結果は以下の通りです。

[実行結果]

新橋
で
ランチ
を
ご
一緒
し
ましょう
。
次
の
火曜日
は
どう
です
か
。

次回は、文節分割を行います。

GiNZA(4) - 形態素解析（レンマ化）

October 14, 2021

レンマ化は、トークンを辞書の見出し語に変換する処理です。

トークンを辞書の見出し語にそろえることで、異なる表記でも同じ単語であることを判別できるようになります。

レンマ化

レンマ化を行うソースコードは以下のようになります。

[Google Colaboratory]

import spacy
nlp = spacy.load('ja_ginza')
doc = nlp('新橋に行きます。')

for token in doc:
    print(
        token.text+', '+ # テキスト
        token.lemma_) # レンマ化

実行結果は以下の通りです。

[実行結果]

新橋, 新橋
に, に
行き, 行く
ます, ます
。, 。

行きというトークンが、行くという見出し語に変換されています。

行くという単語であれば、辞書で確認できますね。

次回は、文境界解析を行います。

GiNZA(3) - 形態素解析（品詞タグ付け）

October 13, 2021

品詞タグ付けは、トークンの品詞を判別する処理です。

品詞タグ付け

トークンごとの品詞を確認するコードは以下の通りです。

品詞タグとして２種類表示します。

SudachiPyの品詞タグ
Universal Dependenciesの品詞タグ

[Google Colaboratory]

import spacy
nlp = spacy.load('ja_ginza')
doc = nlp('新橋に行きます。')

for token in doc:
    print(
        token.text+', '+ # トークン
        token.tag_+', '+ # SudachiPyの品詞タグ
        token.pos_)      # Universal Dependenciesの品詞タグ

実行結果は以下の通りです。

[実行結果]

新橋, 名詞-普通名詞-一般, NOUN
に, 助詞-格助詞, ADP
行き, 動詞-非自立可能, VERB
ます, 助動詞, AUX
。, 補助記号-句点, PUNCT

次回は、レンマ化を行います。

GiNZA(2) - 形態素解析（トークン化）

October 12, 2021

形態素解析

形態素解析とは、文章を形態素と呼ばれる言葉の最小単位に分割し形態素の品詞や見出し語を判定する処理です。

具体的には次のような処理に分けることができます。

トークン化
文章を言葉の最小単位に分割する処理
品詞タグ付け
トークンの品詞を判別する処理
レンマ化
トークンを辞書の見出し語に変換する処理

トークンの分割単位

GiNZAではトークン化の処理を行う場合、３種類の分割単位に切り替えて利用することができます。

分割単位を指定するためにはset_split_mode()を使います。

まずは分割単位Aでトークン化を行います。

[Google Colaboratory]

import spacy
import ginza
nlp = spacy.load('ja_ginza')
ginza.set_split_mode(nlp, 'A') # 分割単位A
doc = nlp('あの女性は国家公務員です')

for token in doc:
    print(token)

[実行結果]

あの
女性
は
国家
公務
員
です

次に分割単位Bでトークン化を行います。

[Google Colaboratory]

import spacy
import ginza
nlp = spacy.load('ja_ginza')
ginza.set_split_mode(nlp, 'B') # 分割単位B
doc = nlp('あの女性は国家公務員です')

for token in doc:
    print(token)

[実行結果]

あの
女性
は
国家
公務員
です

最後にに分割単位Cでトークン化を行います。

[Google Colaboratory]

import spacy
import ginza
nlp = spacy.load('ja_ginza')
ginza.set_split_mode(nlp, 'C') # 分割単位C
doc = nlp('あの女性は国家公務員です')

for token in doc:
    print(token)

[実行結果]

あの
女性
は
国家公務員
です

分割単位によって「国家公務員」という単語の分け方が異なっていることが分かります。

分割単位A
国家／公務／員
分割単位B
国家／公務員
分割単位C
国家公務員

次回は、品詞のタグ付けを行います。

GiNZA(1) - インストール

October 11, 2021

自然言語処理を行うライブラリであるGiNZAを利用してみます。

GiNZAの概要

GiNZAは日本語の自然言語処理ライブラリで、次のような用途で使われます。

情報抽出
大量の自然言語の文章から、特定の条件に合致した情報を抽出します。
自然言語理解
発話文章から発話者がどんなタスクを要求しているのかを推測します。
深層学習の前処理
自然言語の文章を深層学習の入力データに変換します。
自然言語処理の深層学習の推論で利用します。

インストール

GiNZAをインストールするには次のコマンドを実行します。

[Google Colaboratory]

1 2	# GiNZAのインストール !pip install ginza==4.0.5

次のようなログが出力されていれば、インストールは成功しています。

[実行結果]

Collecting ginza==4.0.5
  Downloading ginza-4.0.5.tar.gz (20 kB)
Collecting spacy<3.0.0,>=2.3.2
  Downloading spacy-2.3.7-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.4 MB)
     |████████████████████████████████| 10.4 MB 5.5 MB/s 
Collecting ja_ginza<4.1.0,>=4.0.0
  Downloading ja_ginza-4.0.0.tar.gz (51.5 MB)
     |████████████████████████████████| 51.5 MB 16 kB/s 
Collecting SudachiPy>=0.4.9
  Downloading SudachiPy-0.5.4.tar.gz (86 kB)
     |████████████████████████████████| 86 kB 5.0 MB/s 
Collecting SudachiDict-core>=20200330
  Downloading SudachiDict-core-20210802.tar.gz (9.1 kB)
Collecting thinc<7.5.0,>=7.4.1
  Downloading thinc-7.4.5-cp37-cp37m-manylinux2014_x86_64.whl (1.0 MB)
     |████████████████████████████████| 1.0 MB 43.1 MB/s 
Requirement already satisfied: blis<0.8.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.0.0,>=2.3.2->ginza==4.0.5) (0.4.1)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.0.0,>=2.3.2->ginza==4.0.5) (3.0.5)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.0.0,>=2.3.2->ginza==4.0.5) (2.0.5)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.0.0,>=2.3.2->ginza==4.0.5) (0.8.2)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.0.0,>=2.3.2->ginza==4.0.5) (1.0.5)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.0.0,>=2.3.2->ginza==4.0.5) (1.19.5)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/dist-packages (from spacy<3.0.0,>=2.3.2->ginza==4.0.5) (1.1.3)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.0.0,>=2.3.2->ginza==4.0.5) (2.23.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.0.0,>=2.3.2->ginza==4.0.5) (1.0.5)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy<3.0.0,>=2.3.2->ginza==4.0.5) (57.4.0)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/dist-packages (from spacy<3.0.0,>=2.3.2->ginza==4.0.5) (1.0.0)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.0.0,>=2.3.2->ginza==4.0.5) (4.62.3)
Requirement already satisfied: importlib-metadata>=0.20 in /usr/local/lib/python3.7/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy<3.0.0,>=2.3.2->ginza==4.0.5) (4.8.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy<3.0.0,>=2.3.2->ginza==4.0.5) (3.6.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy<3.0.0,>=2.3.2->ginza==4.0.5) (3.7.4.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.0.0,>=2.3.2->ginza==4.0.5) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.0.0,>=2.3.2->ginza==4.0.5) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.0.0,>=2.3.2->ginza==4.0.5) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.0.0,>=2.3.2->ginza==4.0.5) (2021.5.30)
Collecting sortedcontainers~=2.1.0
  Downloading sortedcontainers-2.1.0-py2.py3-none-any.whl (28 kB)
Collecting dartsclone~=0.9.0
  Downloading dartsclone-0.9.0-cp37-cp37m-manylinux1_x86_64.whl (473 kB)
     |████████████████████████████████| 473 kB 34.1 MB/s 
Requirement already satisfied: Cython in /usr/local/lib/python3.7/dist-packages (from dartsclone~=0.9.0->SudachiPy>=0.4.9->ginza==4.0.5) (0.29.24)
Building wheels for collected packages: ginza, ja-ginza, SudachiDict-core, SudachiPy
  Building wheel for ginza (setup.py) ... done
  Created wheel for ginza: filename=ginza-4.0.5-py3-none-any.whl size=15895 sha256=05809609d54412282a3aca677bf262183e269cff4ea2eaa92b35fe485ed4104c
  Stored in directory: /root/.cache/pip/wheels/ba/a9/a2/c1165c004f6dcb415b7a7d145aa4511b5024b5fb1f2eb0c0ea
  Building wheel for ja-ginza (setup.py) ... done
  Created wheel for ja-ginza: filename=ja_ginza-4.0.0-py3-none-any.whl size=51530813 sha256=f6d2a89bfa1850b356ef103dca344535c6472c5615cf8c7d3e0bf2f0bb964e42
  Stored in directory: /root/.cache/pip/wheels/a8/f5/4a/5d4877342f912e0b7209d8a65e7ce39fe2c1a3c2511d59acfb
  Building wheel for SudachiDict-core (setup.py) ... done
  Created wheel for SudachiDict-core: filename=SudachiDict_core-20210802-py3-none-any.whl size=71418512 sha256=5d1f8362b682fef55a4ca03c8d4a590ed74a910a220d1d51947aac52c6d9a1c6
  Stored in directory: /root/.cache/pip/wheels/91/e8/21/e80d212743835d87bb5e7eca81b6abef6d8cb67a294007a837
  Building wheel for SudachiPy (setup.py) ... done
  Created wheel for SudachiPy: filename=SudachiPy-0.5.4-cp37-cp37m-linux_x86_64.whl size=872116 sha256=de7e07689323107cc31919c53b64a7bad008113ae8e094daabcab14dbf5a8957
  Stored in directory: /root/.cache/pip/wheels/6b/5b/8b/ce1f543c9e9af590fdc62e8344fda5a3950c60c0d21c83174e
Successfully built ginza ja-ginza SudachiDict-core SudachiPy
Installing collected packages: thinc, sortedcontainers, dartsclone, SudachiPy, spacy, SudachiDict-core, ja-ginza, ginza
  Attempting uninstall: thinc
    Found existing installation: thinc 7.4.0
    Uninstalling thinc-7.4.0:
      Successfully uninstalled thinc-7.4.0
  Attempting uninstall: sortedcontainers
    Found existing installation: sortedcontainers 2.4.0
    Uninstalling sortedcontainers-2.4.0:
      Successfully uninstalled sortedcontainers-2.4.0
  Attempting uninstall: spacy
    Found existing installation: spacy 2.2.4
    Uninstalling spacy-2.2.4:
      Successfully uninstalled spacy-2.2.4
Successfully installed SudachiDict-core-20210802 SudachiPy-0.5.4 dartsclone-0.9.0 ginza-4.0.5 ja-ginza-4.0.0 sortedcontainers-2.1.0 spacy-2.3.7 thinc-7.4.5

メニューから「ランタイム → ランタイムを再起動」を選択し、Google Colaboratoryを再起動しておきます。

トークン化

きちんとインストールできているかどうかを確認するために、トークン化を行ってみます。

[Google Colaboratory]

import spacy

nlp = spacy.load('ja_ginza')
doc = nlp('リモートワークできない職場なのでフリーランスになりました。')

for token in doc:
    print(token)

実行結果は以下の通りです。

[実行結果]

リモートワーク
でき
ない
職場
な
の
で
フリーランス
に
なり
まし
た
。

きちんとトークン化できました。

次回からは、GiNZAを使って形態素解析を行います。

Transformers(16) - 要約④要約実行

October 10, 2021

学習したモデルを使って要約を行います。

要約

要約を行うコードは以下の通りです。

5行目で、日本語T5事前学習済みモデルのトークナイザーを読み込んでいます。

6行目では、前回ファインチューニングした要約モデルをoutputフォルダから読み込んでいます。

9行目に、要約対象の文章を設定しています。

[Google Colaboratory]

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

# トークナイザーとモデルの準備
tokenizer = AutoTokenizer.from_pretrained('sonoisa/t5-base-japanese') 
model = AutoModelForSeq2SeqLM.from_pretrained('output/')    

# テキスト
text = "ぴちぴちのおねえさんが川でせんたくをしていると、ドンブラコ、ドンブラコと、大きな桃が流れてきました。おねえさんは大きな桃をひろいあげて、家に持ち帰りました。そして、ギャル男とおねえさんが桃を食べようと桃を切ってみると、なんと中から元気 の良いドランゴンの赤ちゃんが飛び出してきました。"

# テキストをテンソルに変換
input = tokenizer.encode(text, return_tensors='pt', max_length=512, truncation=True)

# 推論
model.eval()
with torch.no_grad():
    summary_ids = model.generate(input)
    print(tokenizer.decode(summary_ids[0]))

要約結果は以下の通りです。

[実行結果]

/usr/local/lib/python3.7/dist-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
<pad><extra_id_0>川でせんたくをしていると、ドンブラコ、ドンブラコと大きな桃が流れてきました。</s>

・・・要約と言えば要約されていますが、主語が全部とんでますし、中盤以降を全部省略とずいぶん大胆な要約となっています。

なんらかの改善が必要なのかもしれません。

Transformers(15) - 要約③ファインチューニング

October 9, 2021

学習データと検証データを使ってファインチューニングを行います。

ファインチューニング

次のコマンドを実行し、要約のファインチューニングを行います。

[Google Colaboratory]

%%time

# ファインチューニングの実行
!python ./transformers/examples/seq2seq/run_summarization.py \
    --model_name_or_path=sonoisa/t5-base-japanese \
    --do_train \
    --do_eval \
    --train_file=train.csv \
    --validation_file=dev.csv \
    --num_train_epochs=10 \
    --per_device_train_batch_size=2 \
    --per_device_eval_batch_size=2 \
    --save_steps=5000 \
    --save_total_limit=3 \
    --output_dir=output/ \
    --predict_with_generate \
    --use_fast_tokenizer=False \
    --logging_steps=100

実行結果は以下の通りです。

[実行結果]

10/06/2021 11:45:03 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False
10/06/2021 11:45:03 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='output/', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=2, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=10.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_ratio=0.0, warmup_steps=0, logging_dir='runs/Oct06_11-45-03_2d9fb230d52a', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=100, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=5000, save_total_limit=3, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', fp16_backend='auto', fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=100, dataloader_num_workers=0, past_index=-1, run_name='output/', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, sortish_sampler=False, predict_with_generate=True)
Using custom data configuration default
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-2359a64c962f9aac/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2)
loading configuration file https://huggingface.co/sonoisa/t5-base-japanese/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/fb3cd498b86ea48d9c5290222d7a1b96e0acb4daa6d35a7fcb00168d59c356ee.82b38ae98529e44bc2af82ee82527f78e94856e2fb65b0ec6faecbb49d8ab639
Model config T5Config {
  "_name_or_path": "/content/drive/MyDrive/T5_models/oscar_cc100_wikipedia_ja",
  "architectures": [
    "T5Model"
  ],
  "bos_token_id": 0,
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "eos_token_ids": [
    1
  ],
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "max_length": 512,
  "model_type": "t5",
  "n_positions": 512,
  "num_beams": 4,
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "transformers_version": "4.4.2",
  "use_cache": true,
  "vocab_size": 32128
}

loading configuration file https://huggingface.co/sonoisa/t5-base-japanese/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/fb3cd498b86ea48d9c5290222d7a1b96e0acb4daa6d35a7fcb00168d59c356ee.82b38ae98529e44bc2af82ee82527f78e94856e2fb65b0ec6faecbb49d8ab639
Model config T5Config {
  "_name_or_path": "/content/drive/MyDrive/T5_models/oscar_cc100_wikipedia_ja",
  "architectures": [
    "T5Model"
  ],
  "bos_token_id": 0,
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "eos_token_ids": [
    1
  ],
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "max_length": 512,
  "model_type": "t5",
  "n_positions": 512,
  "num_beams": 4,
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "transformers_version": "4.4.2",
  "use_cache": true,
  "vocab_size": 32128
}

loading file https://huggingface.co/sonoisa/t5-base-japanese/resolve/main/spiece.model from cache at /root/.cache/huggingface/transformers/a455eff173d5e851553673177dcb6876d5fcd0d39a4bdd7e9a75c50dfb2ab158.82c0f9a9b4ec152c1ca13afe226abd56b618ec9f9d395dc56fd3ad6ca14b4dcc
loading file https://huggingface.co/sonoisa/t5-base-japanese/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/sonoisa/t5-base-japanese/resolve/main/special_tokens_map.json from cache at /root/.cache/huggingface/transformers/559eb952d008bbb60787ea4b89849e5a377f35e163651805b072f2fb1f4b28b9.c94798918c92ded6aeef2d2f0e666d2cc4145eca1aa6e1336fde07f2e13e2f46
loading file https://huggingface.co/sonoisa/t5-base-japanese/resolve/main/tokenizer_config.json from cache at /root/.cache/huggingface/transformers/175dd55a5be280f74e003b3b5efa1de2080efdc795da6dc013dd001d661fcb50.6de37cb3d7dbffde3a51667c5706471d3c6b2a3ff968f108c1429163c5860a5d
loading file https://huggingface.co/sonoisa/t5-base-japanese/resolve/main/tokenizer.json from cache at None
loading weights file https://huggingface.co/sonoisa/t5-base-japanese/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/e6cfe18cc2661a1900624126a7fd3af005a942d9079fd828c4333b0f46617f58.f3773e9948a37df8ff8b19de7bdc64bbcc9fd2225fcdc41ed5a4a2be8a86f383
All model checkpoint weights were used when initializing T5ForConditionalGeneration.

All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at sonoisa/t5-base-japanese.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-2359a64c962f9aac/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2/cache-1b9264874e882057.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-2359a64c962f9aac/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2/cache-31c3668d81aead05.arrow
***** Running training *****
  Num examples = 477
  Num Epochs = 10
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2
  Gradient Accumulation steps = 1
  Total optimization steps = 2390
{'loss': 3.7109, 'learning_rate': 4.7907949790794984e-05, 'epoch': 0.42}
{'loss': 3.3028, 'learning_rate': 4.581589958158996e-05, 'epoch': 0.84}
{'loss': 2.9266, 'learning_rate': 4.372384937238494e-05, 'epoch': 1.26}
{'loss': 2.6457, 'learning_rate': 4.1631799163179915e-05, 'epoch': 1.67}
{'loss': 2.4916, 'learning_rate': 3.95397489539749e-05, 'epoch': 2.09}
{'loss': 2.2776, 'learning_rate': 3.744769874476988e-05, 'epoch': 2.51}
{'loss': 2.3356, 'learning_rate': 3.5355648535564854e-05, 'epoch': 2.93}
{'loss': 2.1054, 'learning_rate': 3.3263598326359835e-05, 'epoch': 3.35}
{'loss': 2.0186, 'learning_rate': 3.117154811715482e-05, 'epoch': 3.77}
{'loss': 1.9114, 'learning_rate': 2.9079497907949792e-05, 'epoch': 4.18}
{'loss': 1.8481, 'learning_rate': 2.6987447698744773e-05, 'epoch': 4.6}
{'loss': 1.7216, 'learning_rate': 2.489539748953975e-05, 'epoch': 5.02}
{'loss': 1.6217, 'learning_rate': 2.280334728033473e-05, 'epoch': 5.44}
{'loss': 1.6558, 'learning_rate': 2.0711297071129708e-05, 'epoch': 5.86}
{'loss': 1.5691, 'learning_rate': 1.8619246861924686e-05, 'epoch': 6.28}
{'loss': 1.5122, 'learning_rate': 1.6527196652719665e-05, 'epoch': 6.69}
{'loss': 1.3568, 'learning_rate': 1.4435146443514645e-05, 'epoch': 7.11}
{'loss': 1.4267, 'learning_rate': 1.2343096234309625e-05, 'epoch': 7.53}
{'loss': 1.3995, 'learning_rate': 1.0251046025104603e-05, 'epoch': 7.95}
{'loss': 1.3224, 'learning_rate': 8.158995815899583e-06, 'epoch': 8.37}
{'loss': 1.3677, 'learning_rate': 6.066945606694561e-06, 'epoch': 8.79}
{'loss': 1.3022, 'learning_rate': 3.97489539748954e-06, 'epoch': 9.21}
{'loss': 1.2621, 'learning_rate': 1.882845188284519e-06, 'epoch': 9.62}
100% 2390/2390 [33:01<00:00,  1.25it/s]

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 1981.4145, 'train_samples_per_second': 1.206, 'epoch': 10.0}
100% 2390/2390 [33:01<00:00,  1.21it/s]
Saving model checkpoint to output/
Configuration saved in output/config.json
Model weights saved in output/pytorch_model.bin
tokenizer config file saved in output/tokenizer_config.json
Special tokens file saved in output/special_tokens_map.json
Copy vocab file to output/spiece.model
***** train metrics *****
  epoch                      =      10.0
  init_mem_cpu_alloc_delta   =       1MB
  init_mem_cpu_peaked_delta  =       0MB
  init_mem_gpu_alloc_delta   =     850MB
  init_mem_gpu_peaked_delta  =       0MB
  train_mem_cpu_alloc_delta  =       0MB
  train_mem_cpu_peaked_delta =       0MB
  train_mem_gpu_alloc_delta  =    2575MB
  train_mem_gpu_peaked_delta =    5616MB
  train_runtime              = 1981.4145
  train_samples              =       477
  train_samples_per_second   =     1.206
10/06/2021 12:18:20 - INFO - __main__ -   *** Evaluate ***
***** Running Evaluation *****
  Num examples = 120
  Batch size = 2
/usr/local/lib/python3.7/dist-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
100% 60/60 [01:37<00:00,  1.62s/it]
***** eval metrics *****
  epoch                     =    10.0
  eval_gen_len              = 16.7333
  eval_loss                 =  2.6638
  eval_mem_cpu_alloc_delta  =     2MB
  eval_mem_cpu_peaked_delta =     0MB
  eval_mem_gpu_alloc_delta  =     0MB
  eval_mem_gpu_peaked_delta =  1227MB
  eval_rouge1               = 10.1667
  eval_rouge2               =  6.3889
  eval_rougeL               = 10.2778
  eval_rougeLsum            = 10.5278
  eval_runtime              = 98.9901
  eval_samples              =     120
  eval_samples_per_second   =   1.212
CPU times: user 18 s, sys: 2.62 s, total: 20.7 s
Wall time: 35min 1s

35分ほどかかりました。

outputフォルダに学習結果となるモデルを表す複数ファイルが出力されています。

学習結果の確認

TensorBoardで学習結果を確認します。

[Google Colaboratory]

1
2
3

# 学習状況の確認
%load_ext tensorboard
%tensorboard --logdir runs

[実行結果]

損失(Loss)が0に収束・・・してはいないのですが、少しずづ減少し1.4以下になってはいるので、それなりに学習していることが分かります。

次回は、学習したモデルを使って要約を行います。

Transformers(14) - 要約②学習データと検証データの作成

October 8, 2021

前回ダウンロードしたニュース記事から学習データと検証データを作成します。

学習データ・検証データの作成

前回取得したニュース記事は、output.tsvに出力されています。

このファイルから、学習データ（８割）と検証データ（２割）を作成します。

[Google Colaboratory]

import os
import pandas as pd

# データフレームの作成
df = pd.DataFrame(columns=['text', 'summary'])
with open('output.tsv') as f:
    for line in f.readlines():
        strs = line.split('\t')
        df = df.append({'text':strs[3] , 'summary':strs[0]}, ignore_index=True)

# シャッフル
df = df.sample(frac=1)

# CSVファイルの保存
num = len(df)
df[:int(num*0.8)].to_csv('train.csv', sep=',', index=False)
df[int(num*0.8):].to_csv('dev.csv', sep=',', index=False)

処理が成功すると、以下のファイルが出力されます。

train.csv
学習データ
dev.csv
検証データ

必要ライブラリのインストール

次回ファインチューニングを行う前準備として、必要なライブラリをインストールしておきます。

[Google Colaboratory]

# ソースからのHuggingface Transformersのインストール
!git clone https://github.com/huggingface/transformers -b v4.4.2
!pip install -e transformers
!pip install fugashi[unidic-lite]
!pip install ipadic

メニューから「ランタイム → ランタイムを再起動」を選択し、Google Colaboratoryを再起動しておきます。

さらに以下のライブラリをインストールします。

[Google Colaboratory]

# Huggingface Datasetsのインストール
!pip install datasets==1.2.1

# 依存パッケージのインストール
!pip install rouge_score==0.0.4
!pip install sentencepiece==0.1.91

以上で、ライブラリのインストールが完了しました。

次回は、準備した学習データと検証データを使ってファインチューニングを行います。

Transformers(13) - 要約①データセットの準備

October 7, 2021

要約を全４回に分けて説明していきます。

要約は、本文（長い文章）を要約（短い文章）に変換する処理です。

「livedoorニュースの３行要約データセット」を使って、要約の処理を確認していきます。

ニュース記事の一覧を取得

まずはニュース記事の公開年月、カテゴリ、記事IDがまとまっているCSVファイルをダウンロードします。

ダウンロード後にファイル名を変更します。

[Google Colaboratory]

1 2	!wget https://raw.githubusercontent.com/KodairaTomonori/ThreeLineSummaryDataset/master/data/train.csv !mv train.csv downloaded_train.csv

[実行結果]

--2021-10-06 08:20:12--  https://raw.githubusercontent.com/KodairaTomonori/ThreeLineSummaryDataset/master/data/train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3836862 (3.7M) [text/plain]
Saving to: ‘train.csv’

train.csv           100%[===================>]   3.66M  --.-KB/s    in 0.07s   

2021-10-06 08:20:12 (52.4 MB/s) - ‘train.csv’ saved [3836862/3836862]

ニュース記事を取得

ニュース記事の一覧(CSVファイル)を元に、ニュース記事をダウンロードします。

beautiful soupというスクレイピング用のライブラリを使って、ニュース記事の取得を行います。

（サーバに負荷がかからないように１０秒間に１記事を取得するようにしています）

[Google Colaboratory]

from urllib.request import urlopen
from bs4 import BeautifulSoup
from bs4.element import NavigableString
from pprint import pprint
import time

# 収集するニュース記事のインデックス
start_index = 0 # 開始インデックス
end_index = 1000 # 終了インデックス

# コンテンツの取得
def get_content(id):
    # サーバに負荷をかけないように10秒スリープ
    time.sleep(10)

    # URL
    URL = 'https://news.livedoor.com/article/detail/'+id+'/'
    print(URL)
    try:
        with urlopen(URL) as res:
            # 本文の抽出
            output1 = ''
            html = res.read().decode('euc_jp', 'ignore')
            soup = BeautifulSoup(html, 'html.parser')
            lineList = soup.select('.articleBody p')
            for line in lineList:
                if len(line.contents) > 0 and type(line.contents[0]) == NavigableString:
                    output1 += line.contents[0].strip()
            if output1 == '': # 記事がない
                return
            output1 += '\n'

            # 要約の抽出
            output0 = ''
            summaryList = soup.select('.summaryList li')
            for summary in summaryList:
                output0 += summary.contents[0].strip()+'\t'
            if output0 == '': # 記事がない
                return

            # 出力
            print(output0+output1)
            with open('output.tsv', mode='a') as f:
                f.writelines(output0+output1)
    except Exception:
        print('Exception')

# IDリストの生成の取得
idList = []
# with open('ThreeLineSummaryDataset/data/train.csv', mode='r') as f:
with open('downloaded_train.csv', mode='r') as f:
    lines = f.readlines()
    for line in lines:
        id = line.strip().split(',')[3].split('.')[0]
        idList.append(id)

# コンテンツの取得
for i in range(start_index, end_index):
    print('index:', i)
    get_content(idList[i])

[実行結果]

index: 0
https://news.livedoor.com/article/detail/11097202/
岡むら屋から、期間限定の新メニュー「じゃが肉めし」が登場する	男爵いもなどは味噌ベースで煮こまれ、しっかり味が染み込んでいるとのこと	「岡むら屋特製肉じゃが」と言うべき一品に、仕上がっているという	新橋と秋葉原に店を構える「岡むら屋」。味噌ベースの独自の味つけで牛バラ肉を煮込んだ具材がたっぷり乗ったオリジナル丼『肉めし』で知られる店です。自分も新橋店で『肉めし』を食べたことがありますが、濃いめの味付けでトロットロに煮こまれた牛肉とホカホカご飯は最高の相性！ 味の染み込んだ豆腐も絶品で、腹ぺこボーイズ&ガールズたちの胃袋を満たし続けている丼なのです。昨年開催された『第2回全国丼グランプリ』では、肉丼部門で金賞受賞を獲得しています。そんな「岡むら屋」に期間限定新メニューが登場。その名も『じゃが肉めし』！肉と芋――このまま名作文学のタイトルにもなりそうな、重厚かつ甘美な響き。これって食いしんぼうたちにとっては定番かつ夢の組み合わせではないでしょうか。この組み合わせではすでに「肉じゃが」という殿堂入りメニューがありますが、今回の『じゃが肉めし』は、“岡むら屋特製肉じゃが”と言うべき一品に仕上がっているそうです。北海道産の男爵いもとしらたきを合わせ、味噌ベースの大鍋で他の具材と一気に煮込むことで、しっかり味が染み込んでいるとのこと。商品写真が公開されていますが、ビジュアルを見ただけで、その染みこみ具合いは一目瞭然！丼というステージで繰り広げられる、肉×芋×白米というスーパースターの競演。肉好き必食の1杯と言えそうです。あ、定食も同時販売されるので、気分で選べるのもうれしいですね。詳しい販売期間などは、店舗にお問い合わせください。肉めし「岡むら屋」

index: 1
https://news.livedoor.com/article/detail/11104959/
東京駅周辺の安くて美味しい「蕎麦ランチ」の名店を紹介している	「越後そば 東京店」では、ミニかき揚げ丼セットがおすすめと筆者	その他には、「手打ちそば 石月」「酢重正之 楽」「鎌倉 一茶庵 丸山」など	名店がひしめく「丸の内・日本橋」エリアで＜うまい蕎麦ランチ＞が食べられるお店を厳選してご紹介。ぴあMOOK『うまい蕎麦の店 首都圏版』が選んだ、とっておきの7店がこちら！ミニかき揚げ丼セット（温冷） 660円関東風の濃いめのつゆと、ふのりを練り込んだ喉ごしの良い自家製のそばが相性抜群。新鮮な油を使ったかき揚げは、サクサク。千代田区丸の内1-9-1 東京駅一番街B1F蕎麦も天ぷらも正統派の味お蕎麦と天丼（二八） 1600円喉ごしや歯応えを楽しむ二八蕎麦のほか、蕎麦の風味を存分に味わえる十割蕎麦と天丼のセットも。天ぷらは鮮度抜群で美味。千代田区丸の内1-6-4 丸の内オアゾ5F信州の郷土料理をアレンジ“信州フランス鴨”セリかも南蛮（温） 1680円和食とフレンチが融合したスタイルの人気店。このメニューは、本格信州そば、信州のフランス鴨、契約農家が育てた野菜など、素材にこだわっている。千代田区丸の内2-7-2 JPタワーKITTE5F納豆と卵白でふんわり食感なっとうそば 1100円取り寄せた蕎麦の実を、職人が毎日石臼で挽き、丹念に手打ちする。たっぷりの納豆と卵白がのって、なめらかな口当たり。千代田区丸の内1-5-1 新丸の内ビルディング5Fやわらかな厚切り鴨肉を堪能鴨せいろ 1945円、さつま揚げ 650円合鴨・ネギ・しめじなどを合わせた濃厚なつけ汁に、程良いコシの蕎麦がよく合う。自家製さつま揚げもぜひ味わいたい一品。千代田区丸の内2-4-1 丸の内ビルディング6Fコシの強い独特な田舎蕎麦が美味とり辛そば（温） 980円信州軽井沢の味噌・醤油屋「酢重正之商店」が手掛ける蕎麦屋の人気メニュー。辛口のつけ汁と太い田舎蕎麦がよく合うと評判。千代田区丸の内1-5-1 新丸の内ビルディングB1F風味豊かな生わさびが決め手ざるそば（生わさび） 710円信州の民家を思わせる店内で打つ自家製麺は歯応え抜群。生わさびを自分でおろして味わう。甘めのつゆがわさびにぴったり。千代田区丸の内1-6-1 丸の内センタービル B1Fうまい蕎麦の店　首都圏版日本人ならいつでも蕎麦が食べたい！

（・・・・・途中略・・・・・）

index: 998
https://news.livedoor.com/article/detail/11024867/
厚生労働省は1日、2015年の人口動態統計の年間推計を公表した	2015年の推計出生数は100万8000人で前年と比べ4000人増加した	死亡数は戦後最多の130万2000人となっている	○主な死因1位はガンなどの悪性新生物2015年の出生数は100万8,000人で、前年と比べて4,000人増加した。死産数は2万3,000胎(前年比1,000胎減)となった。死亡数は130万2,000人で戦後最多。主な死因の推計死亡数は、ガンなどの悪性新生物が37万人、心筋梗塞などの心疾患が19万9,000人、肺炎が12万3,000人、脳卒中などの脳血管疾患が11万3,000人だった。婚姻件数は63万5,000組(前年比9,000組減)、離婚件数は22万5,000組(前年比3,000組増)だった。人口動態総覧を日本・韓国・シンガポール・アメリカ・フランス・ドイツ・イタリア・スウェーデン・イギリスの9カ国で比較したところ、1人の女性が一生に産む子供の平均数(合計特殊出生率)が最も多いのはフランス(1.99人/2013年)、次いでスウェーデン(1.89人/2013年)、アメリカ(1.86人/2014年推定値)だった。一方シンガポール(1.19人/2013年)、韓国(1.21人/2014年)、イタリア(1.39人/2013年)の3カ国はいずれも1.4人を下回った。なお日本は1.42人(2014年)だった。

index: 999
https://news.livedoor.com/article/detail/11060391/
16年、年始から下落した中国株式市場を香港メディアが検証している	中国株式市場は個人投資家が主体の市場であり、価格変動が大きいと指摘	変動制限のため導入したサーキットブレーカー制度も一因だと論じている	２０１６年の株式市場は世界的に波乱の幕開けだった。中国株式市場では４日と７日にそれぞれ上海総合指数が７％も下落し、株価の急変を防ぐために導入されたサーキットブレーカーが発動。中国政府は導入されたばかりのサーキットブレーカー制度の一時停止に追い込まれた。中国株の急落は世界の株式市場に波及し、中国発の世界同時株安が起きたが、そもそも上海総合指数はなぜ年初から急落したのだろうか。香港メディアの鳳凰網は１０日、中国の株式市場は閉鎖的であり、外国人投資家の関与が制限されていることを指摘したうえで、投資家に株式を投げ売りさせた要因について考察した。記事はまず、さらに、中国株式市場は個人投資家が主体の市場であることを指摘したうえで、だからこそ価格変動が大きいと指摘。さらに２０１５年夏にも上海総合指数が急落したことを指摘し、「現在、市場で取引をしているのは前回の急落を脳裏に焼き付けつつ、今なお怯える個人投資家たちだ」と伝え、だからこそ中国証券監督管理委員会は値幅変動を制限するためにサーキットブレーカーを導入したのだと論じた。一方で記事は、上海総合指数はサーキットブレーカーの導入によって７％下落すると取引が停止されることになっていたため、株価が下落すると個人投資家たちが取引停止を恐れて我先にと売り始めたと指摘。サーキットブレーカーの存在がむしろ価格変動を大きくしてしまったと論じた。

以上で、ニュース記事のダウンロードが完了しました。

output.tsvというファイルにニュース記事が出力されています。

次回は、ダウンロードしたニュース記事を学習データと検証データに分けます。

Transformers(12) - 質疑応答③学習したモデルを使って質疑応答

October 6, 2021

前回は、わかち書きとファインチューニングを行いました。

今回は、学習したモデルを使って質疑応答を行います。

学習済みモデルを使って質疑応答

学習したモデルを使って質疑応答を行ってみます。

8行目のfrom_pretrained関数で、学習済みモデルをロードしています。

11行目でコンテキストを設定し、12行目で質問を指定しています。

[Google Colaboratory]

import torch
from transformers import BertJapaneseTokenizer, AutoModelForQuestionAnswering
import MeCab
wakati = MeCab.Tagger("-Owakati")

# トークナイザーとモデルの準備
tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking') 
model = AutoModelForQuestionAnswering.from_pretrained('output/')  

# コンテキストと質問
context = wakati.parse('土曜日に友達と表参道に遊びに行きました。').strip()
question = wakati.parse('どこに遊びに行ったの？').strip()
print(wakati.parse('土曜日に友達と表参道に遊びに行きました。'), wakati.parse('どこに遊びに行ったの？'))

# テキストをテンソルに変換
inputs = tokenizer.encode_plus(question, context, return_tensors='pt')

# 入力のトークンIDの配列の取得
input_ids = inputs['input_ids'].tolist()[0]

# 推論
model.eval()
with torch.no_grad():
    output = model(**inputs)
    answer_start = torch.argmax(output.start_logits)  
    answer_end = torch.argmax(output.end_logits) + 1 
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
    print(answer)

回答結果は以下の通りです。

[実行結果]

土曜 日 に 友達 と 表 参道 に 遊び に 行き まし た 。 
 どこ に 遊び に 行っ た の ？ 

表 参道

応答は「表参道」と的確な回答になっています。

次に、コンテキストはそのままで質問の内容を変えてみます。

[Google Colaboratory]

import torch
from transformers import BertJapaneseTokenizer, AutoModelForQuestionAnswering
import MeCab
wakati = MeCab.Tagger("-Owakati")

# トークナイザーとモデルの準備
tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking') 
model = AutoModelForQuestionAnswering.from_pretrained('output/')  

# コンテキストと質問
context = wakati.parse('土曜日に友達と表参道に遊びに行きました。').strip()
question = wakati.parse('いつ遊びに行ったの？').strip()
print(wakati.parse('土曜日に友達と表参道に遊びに行きました。'), wakati.parse('いつ遊びに行ったの？'))

# テキストをテンソルに変換
inputs = tokenizer.encode_plus(question, context, return_tensors='pt')

# 入力のトークンIDの配列の取得
input_ids = inputs['input_ids'].tolist()[0]

# 推論
model.eval()
with torch.no_grad():
    output = model(**inputs)
    answer_start = torch.argmax(output.start_logits)  
    answer_end = torch.argmax(output.end_logits) + 1 
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
    print(answer)

回答結果は以下の通りです。

[実行結果]

土曜 日 に 友達 と 表 参道 に 遊び に 行き まし た 。 
 いつ 遊び に 行っ た の ？ 

土曜 日

応答は「土曜日」と、こちらも的確な回答になっています。

前回のファインチューニングは２時間以上とかなり時間がかかりましたが、一度学習を完了してしまえば質問に対して的確な応答を得られるようになるのでかなり実用的だと感じました。