Transformers(16) - 要約④要約実行

October 10, 2021

学習したモデルを使って要約を行います。

要約

要約を行うコードは以下の通りです。

5行目で、日本語T5事前学習済みモデルのトークナイザーを読み込んでいます。

6行目では、前回ファインチューニングした要約モデルをoutputフォルダから読み込んでいます。

9行目に、要約対象の文章を設定しています。

[Google Colaboratory]

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

# トークナイザーとモデルの準備
tokenizer = AutoTokenizer.from_pretrained('sonoisa/t5-base-japanese') 
model = AutoModelForSeq2SeqLM.from_pretrained('output/')    

# テキスト
text = "ぴちぴちのおねえさんが川でせんたくをしていると、ドンブラコ、ドンブラコと、大きな桃が流れてきました。おねえさんは大きな桃をひろいあげて、家に持ち帰りました。そして、ギャル男とおねえさんが桃を食べようと桃を切ってみると、なんと中から元気 の良いドランゴンの赤ちゃんが飛び出してきました。"

# テキストをテンソルに変換
input = tokenizer.encode(text, return_tensors='pt', max_length=512, truncation=True)

# 推論
model.eval()
with torch.no_grad():
    summary_ids = model.generate(input)
    print(tokenizer.decode(summary_ids[0]))

要約結果は以下の通りです。

[実行結果]

/usr/local/lib/python3.7/dist-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
<pad><extra_id_0>川でせんたくをしていると、ドンブラコ、ドンブラコと大きな桃が流れてきました。</s>

・・・要約と言えば要約されていますが、主語が全部とんでますし、中盤以降を全部省略とずいぶん大胆な要約となっています。

なんらかの改善が必要なのかもしれません。