Transformers(10) - 質疑応答①データセットの準備とライブラリ・インストール

質疑応答を3回に分けて説明していきます。

質疑応答は、コンテキスト(ひとまとまりの文章)質問からコンテキスト内に含まれる応答を抽出する処理です。

今回は、データセットの準備と必要ライブラリのインストールまで行います。

データセットの準備

データセットとしては運転ドメインQAデータセットを使います。

下記のURLから、ダウンロードリンクよりDDQA-1.0.tar.gzファイルをダウンロードしてください。

運転ドメインQAデータセット - https://nlp.ist.i.kyoto-u.ac.jp/?Driving+domain+QA+datasets


次にダウンロードしたファイルを、Google Colaboratoryにアップロードし、下記コマンドを使って解凍します。

[Google Colaboratory]

1
!tar xzvf DDQA-1.0.tar.gz

[実行結果]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
DDQA-1.0/
DDQA-1.0/RC-QA/
DDQA-1.0/PAS-QA-NOM/
DDQA-1.0/PAS-QA-ACC/
DDQA-1.0/PAS-QA-DAT/
DDQA-1.0/README_en.txt
DDQA-1.0/README_ja.txt
DDQA-1.0/PAS-QA-DAT/DDQA-1.0_PAS-QA-DAT_train.json
DDQA-1.0/PAS-QA-DAT/DDQA-1.0_PAS-QA-DAT_dev.json
DDQA-1.0/PAS-QA-DAT/DDQA-1.0_PAS-QA-DAT_test.json
DDQA-1.0/PAS-QA-ACC/DDQA-1.0_PAS-QA-ACC_train.json
DDQA-1.0/PAS-QA-ACC/DDQA-1.0_PAS-QA-ACC_dev.json
DDQA-1.0/PAS-QA-ACC/DDQA-1.0_PAS-QA-ACC_test.json
DDQA-1.0/PAS-QA-NOM/DDQA-1.0_PAS-QA-NOM_dev.json
DDQA-1.0/PAS-QA-NOM/DDQA-1.0_PAS-QA-NOM_test.json
DDQA-1.0/PAS-QA-NOM/DDQA-1.0_PAS-QA-NOM_train.json
DDQA-1.0/RC-QA/DDQA-1.0_RC-QA_dev.json
DDQA-1.0/RC-QA/DDQA-1.0_RC-QA_test.json
DDQA-1.0/RC-QA/DDQA-1.0_RC-QA_train.json

これでデータセットの準備は完了です。

Huggingface Transformersのインストール

ソースからHuggingface Transformersのインストールを行います。

[Google Colaboratory]

1
2
3
4
5
# ソースからのHuggingface Transformersのインストール
!git clone https://github.com/huggingface/transformers -b v4.4.2
!pip install -e transformers
!pip install fugashi[unidic-lite]
!pip install ipadic

下記のような実行結果になれば、Huggingface Transformersのインストールは成功しています。

[実行結果]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
Cloning into 'transformers'...
remote: Enumerating objects: 85569, done.
remote: Counting objects: 100% (28/28), done.
remote: Compressing objects: 100% (22/22), done.
remote: Total 85569 (delta 8), reused 17 (delta 3), pack-reused 85541
Receiving objects: 100% (85569/85569), 68.44 MiB | 24.02 MiB/s, done.
Resolving deltas: 100% (61496/61496), done.
Note: checking out '9f43a425fe89cfc0e9b9aa7abd7dd44bcaccd79a'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

git checkout -b <new-branch-name>

Obtaining file:///content/transformers
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing wheel metadata ... done
Collecting tokenizers<0.11,>=0.10.1
Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
|████████████████████████████████| 3.3 MB 5.1 MB/s
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers==4.4.2) (3.0.12)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers==4.4.2) (2.23.0)
Collecting sacremoses
Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
|████████████████████████████████| 895 kB 50.1 MB/s
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from transformers==4.4.2) (1.19.5)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from transformers==4.4.2) (21.0)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers==4.4.2) (2019.12.20)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.7/dist-packages (from transformers==4.4.2) (4.62.3)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from transformers==4.4.2) (4.8.1)
Requirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->transformers==4.4.2) (3.7.4.3)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->transformers==4.4.2) (3.5.0)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->transformers==4.4.2) (2.4.7)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers==4.4.2) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers==4.4.2) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers==4.4.2) (2021.5.30)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers==4.4.2) (1.24.3)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers==4.4.2) (7.1.2)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers==4.4.2) (1.15.0)
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers==4.4.2) (1.0.1)
Installing collected packages: tokenizers, sacremoses, transformers
Running setup.py develop for transformers
Successfully installed sacremoses-0.0.46 tokenizers-0.10.3 transformers-4.4.2
Collecting fugashi[unidic-lite]
Downloading fugashi-1.1.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (490 kB)
|████████████████████████████████| 490 kB 5.2 MB/s
Collecting unidic-lite
Downloading unidic-lite-1.0.8.tar.gz (47.4 MB)
|████████████████████████████████| 47.4 MB 92 kB/s
Building wheels for collected packages: unidic-lite
Building wheel for unidic-lite (setup.py) ... done
Created wheel for unidic-lite: filename=unidic_lite-1.0.8-py3-none-any.whl size=47658836 sha256=b5301b21eb0c5ec36e22cf4f0e7f6dfb63d8c968cf59353e0ba252853e96d055
Stored in directory: /root/.cache/pip/wheels/de/69/b1/112140b599f2b13f609d485a99e357ba68df194d2079c5b1a2
Successfully built unidic-lite
Installing collected packages: unidic-lite, fugashi
Successfully installed fugashi-1.1.1 unidic-lite-1.0.8
Collecting ipadic
Downloading ipadic-1.0.0.tar.gz (13.4 MB)
|████████████████████████████████| 13.4 MB 210 kB/s
Building wheels for collected packages: ipadic
Building wheel for ipadic (setup.py) ... done
Created wheel for ipadic: filename=ipadic-1.0.0-py3-none-any.whl size=13556723 sha256=07367203dbaaef7bd0d3d9a3c355d8927072637e6e2398a3e2ab710c877d6e7a
Stored in directory: /root/.cache/pip/wheels/33/8b/99/cf0d27191876637cd3639a560f93aa982d7855ce826c94348b
Successfully built ipadic
Installing collected packages: ipadic
Successfully installed ipadic-1.0.0

ここで、一旦ランタイムの再起動を行います。

メニューからランタイム → ランタイムを再起動を選択してください。

Huggingface Datasetsのインストール

Huggingface Datasetsのインストールを行います。

[Google Colaboratory]

1
2
# Huggingface Datasetsのインストール
!pip install datasets==1.2.1

[実行結果]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Collecting datasets==1.2.1
Downloading datasets-1.2.1-py3-none-any.whl (159 kB)
|████████████████████████████████| 159 kB 5.4 MB/s
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from datasets==1.2.1) (1.19.5)
Requirement already satisfied: pyarrow>=0.17.1 in /usr/local/lib/python3.7/dist-packages (from datasets==1.2.1) (3.0.0)
Collecting tqdm<4.50.0,>=4.27
Downloading tqdm-4.49.0-py2.py3-none-any.whl (69 kB)
|████████████████████████████████| 69 kB 6.6 MB/s
Requirement already satisfied: multiprocess in /usr/local/lib/python3.7/dist-packages (from datasets==1.2.1) (0.70.12.2)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from datasets==1.2.1) (4.8.1)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from datasets==1.2.1) (1.1.5)
Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.7/dist-packages (from datasets==1.2.1) (2.23.0)
Collecting xxhash
Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
|████████████████████████████████| 243 kB 19.6 MB/s
Requirement already satisfied: dill in /usr/local/lib/python3.7/dist-packages (from datasets==1.2.1) (0.3.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets==1.2.1) (1.24.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets==1.2.1) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets==1.2.1) (2021.5.30)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets==1.2.1) (2.10)
Requirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->datasets==1.2.1) (3.7.4.3)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->datasets==1.2.1) (3.5.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets==1.2.1) (2.8.2)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets==1.2.1) (2018.9)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->datasets==1.2.1) (1.15.0)
Installing collected packages: xxhash, tqdm, datasets
Attempting uninstall: tqdm
Found existing installation: tqdm 4.62.3
Uninstalling tqdm-4.62.3:
Successfully uninstalled tqdm-4.62.3
Successfully installed datasets-1.2.1 tqdm-4.49.0 xxhash-2.0.2

MeCabのインストール

MeCabのインストールを行います。

MeCabは、簡単なテキスト解析や、分かち書きができるライブラリです。

[Google Colaboratory]

1
2
# MeCabのインストール
!pip install mecab-python3

[実行結果]

1
2
3
4
5
Collecting mecab-python3
Downloading mecab_python3-1.0.4-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (488 kB)
|████████████████████████████████| 488 kB 5.3 MB/s
Installing collected packages: mecab-python3
Successfully installed mecab-python3-1.0.4

以上で、必要なライブラリのインストールが完了しました。

次回は、わかち書きファインチューニングを行います。