GiNZA(10) - Wikipediaを用いた固有表現抽出データセットでの学習

固有表現抽出の学習に利用できるデータセットとして、Wikipediaを用いた日本語の固有表現抽出データセットが提供されています。

このデータセットを使って、固有表現の抽出を行ってみます。

データセットをダウンロード

まずは下記サイトより[Wikipediaを用いた日本語の固有表現抽出データセット] ner.jsonをダウンロードします。

Wikipediaを用いた日本語の固有表現抽出データセット - https://github.com/stockmarkteam/ner-wikipedia-dataset

ダウンロードしたner.jsonは、Google Colaboratoryにアップロードしておきます。

固有表現抽出モデルの学習

固有表現抽出モデルの学習を行うソースコードは以下の通りです。

14行目でダウンロードしたner.jsonを読み込んでいます。

[Google Colaboratory]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# 学習データの準備
import json
labels = {
'人名': 'Person',
'法人名': 'Juridical_Person',
'政治的組織名': 'Political_Organization',
'その他の組織名': 'Organization_Other',
'地名': 'Location',
'施設名': 'Facility',
'製品名': 'Product',
'イベント名': 'Event',
}

json_data = json.load(open('ner.json', 'r'))
train_data = []
for data in json_data:
text = data['text']
entities = data['entities']
value = []
for entity in entities:
span = entity['span']
label = labels[entity['type']]
value.append((span[0], span[1], label))
train_data.append((text, {'entities': value}))

# 固有表現抽出モデルの学習
nlp = train_ner(train_data, 50)

# 固有表現抽出モデルの保存
nlp.to_disk('ner_model')

実行結果は以下の通りです。

[実行結果]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
/usr/local/lib/python3.7/dist-packages/spacy/language.py:639: UserWarning: [W033] Training a new parser or NER using a model with an empty lexeme normalization table. This may degrade the performance to some degree. If this is intentional or this language doesn't have a normalization table, please ignore this warning.
**kwargs
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "1942年初頭、アメリカ海兵隊ジェームズ・・バリンジャー大佐は、太平洋艦隊司令長官ニミッツ提督とアメ..." with entities "[(8, 15, 'Political_Organization'), (15, 28, 'Pers...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "天正16年には、中津の築城工事や特権放棄などの命令に従わずにいた大村城主・山田常陸介を中津城に呼び出..." with entities "[(8, 10, 'Location'), (32, 35, 'Facility'), (37, 4...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "また、インド洋地域を重視し、独伊の作戦と呼応し、機を見てインド・西亜打通作戦を完遂し、戦争終末促進に..." with entities "[(3, 7, 'Location'), (14, 15, 'Location'), (15, 16...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "なお日本国外ではWhatsApp・Facebook Messenger・Skype・テンセントQQ・..." with entities "[(2, 6, 'Location'), (8, 16, 'Product'), (17, 35, ...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "しかし、同年7月、資金不足に陥ったレラティビティ・メディアは、本作の全米公開日を2016年2月26日..." with entities "[(17, 29, 'Juridical_Person'), (35, 36, 'Location'...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "この後、ロシア社会主義連邦ソビエト共和国とトルコ大国民議会とのカルス条約でトルコ共和国領になる。" with entities "[(4, 20, 'Location'), (21, 29, 'Political_Organiza...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "元々この地は亀穴という名で、江戸時代は岡崎藩領、天領及び林瑞寺除地であった。" with entities "[(6, 8, 'Location'), (19, 22, 'Location'), (28, 31...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "速水けんたろう、茂森あゆみは「第50回NHK紅白歌合戦」にも出場し「だんご3兄弟」を歌った。" with entities "[(0, 6, 'Person'), (8, 13, 'Person'), (15, 27, 'Ev...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "渡米の際に品質管理の専門家W・エドワーズ・デミングと会い、佐島が両者を取り結んだようである。" with entities "[(1, 2, 'Location'), (13, 25, 'Person'), (29, 31, ...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "1996年4月、関東学院大学軽音楽部内で結成された。" with entities "[(8, 14, 'Juridical_Person'), (14, 18, 'Organizati...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "ビートルズのリンゴ・スターもカントリー好きを公言し、バック・オーデエンスの「アクト・ナチュラリー」を..." with entities "[(0, 5, 'Organization_Other'), (6, 13, 'Person'), ...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "1942年初め、第7戦車軍団長に任命され、6月末、第7戦車軍団は第5戦車軍に配属された。" with entities "[(8, 13, 'Political_Organization'), (25, 31, 'Poli...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "ゲスト声優は新日本プロレスの棚橋弘至と真壁刀義、お笑い芸人の小島よしおの3人。" with entities "[(6, 13, 'Organization_Other'), (14, 18, 'Person')...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "それに追い討ちをかけるように1938年7月5日に発生した阪神大水害で神戸市内は壊滅的な被害を受けてし..." with entities "[(28, 33, 'Event'), (34, 37, 'Location')]". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "1986年、マグロウヒル社は、当時全米最大の教育教材出版社であった競合会社のThe Economy ..." with entities "[(6, 13, 'Juridical_Person'), (18, 19, 'Location')...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "1775年3月、土地投機家でノースカロライナの判事でもあったリチャード・ヘンダーソンがシカモア・ショ..." with entities "[(14, 22, 'Location'), (30, 42, 'Person'), (43, 53...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "コロラドカレッジは、USニューズ&ワールド・レポートの大学ランキングでは、全米のリベラル・アーツ・カ..." with entities "[(0, 8, 'Juridical_Person'), (10, 26, 'Juridical_P...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "これは1999年5月のコソボ紛争でNATO軍の一員として武力制裁に参加していたアメリカ軍機がユーゴス..." with entities "[(11, 16, 'Event'), (17, 22, 'Political_Organizati...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "ダンスアーティストケント・モリのプロデュースで開催された伊勢志摩ダンスサミットを記録したPVの撮影に..." with entities "[(9, 15, 'Person'), (28, 39, 'Event')]". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "社長はアメリカ政府の支援を得ようと国務省に働きかけたが、スタンダード・オイル寄りで対ソ交渉ではコーカ..." with entities "[(3, 9, 'Political_Organization'), (17, 20, 'Polit...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "2017年1月には訪日クルーズ客の増加を見込み、国土交通省が官民連携により施設整備を行う「国際クルー..." with entities "[(10, 11, 'Location'), (24, 29, 'Political_Organiz...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "孝謙上皇と淳仁天皇の時代の天平宝字3年11月16日、藤原氏と縁が深く仲麻呂も近江国守であったことから..." with entities "[(0, 2, 'Person'), (5, 7, 'Person'), (26, 28, 'Per...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "列強諸国の介入もあり、清朝政府代表の唐紹儀と各省代表の伍廷芳は上海イギリス租界で交渉を開始、その結果..." with entities "[(11, 15, 'Political_Organization'), (18, 21, 'Per...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "また、在福の民放テレビ局のうち、九州朝日放送・RKB毎日放送・福岡放送・TVQ九州放送は、法律上の放..." with entities "[(16, 22, 'Juridical_Person'), (23, 30, 'Juridical...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "小林かおりは、日本の女優。" with entities "[(0, 4, 'Person'), (7, 9, 'Location')]". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "最終階級はドイツ国防軍装甲兵大将、ドイツ連邦軍中将。" with entities "[(5, 11, 'Political_Organization'), (17, 23, 'Poli...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "1927年の五所平之助監督の同名の映画とは無関係。" with entities "[(6, 10, 'Person')]". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "四ツ小屋校地は後身の秋田大学学芸学部に引き継がれ、1990年3月まで四ツ小屋農場として使用された。" with entities "[(0, 5, 'Facility'), (10, 14, 'Juridical_Person'),...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "神田高等女学校を中退し、継母らの勧めで1906年に15歳で高木陳平と結婚し渡米、11月末ニューヨーク..." with entities "[(0, 7, 'Facility'), (29, 33, 'Person'), (38, 39, ...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "復活号の機体番号は1号機ではなく1007号機とされたが、これは1950年9月1日に撃墜され戦死した韓..." with entities "[(49, 53, 'Political_Organization'), (59, 63, 'Per...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "全国高等学校野球選手権大会における静岡県勢の成績について記す。" with entities "[(0, 13, 'Event'), (17, 20, 'Location')]". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "父は南種子町長を2007年まで4期16年務めた柳田長谷男。" with entities "[(2, 5, 'Location'), (23, 28, 'Person')]". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "一方で新渡戸稲造は在米中の1932年8月20日、CBSラジオでスティムソンドクトリンに反論する形で「..." with entities "[(3, 8, 'Person'), (10, 11, 'Location'), (24, 30, ...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "上流部にはしのがやと公園、下流には水車公園や緑道には小川の流れる公園、桜並木が整備されている。" with entities "[(5, 12, 'Facility'), (17, 21, 'Facility')]". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "1947年日農内の社会、共産両党の路線対立が激化したため、副委員長であった野溝らは「日農主流体制確立..." with entities "[(5, 7, 'Juridical_Person'), (9, 16, 'Political_Or...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "ただし、李登輝は政界入り直後に日本の政治家と会談し、副総統時代にも査証発給の問題なしに訪日を実現して..." with entities "[(4, 7, 'Person'), (15, 17, 'Location'), (44, 45, ...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "寛永4年2月23日に、幕府老中土井利勝・酒井忠勝・井上正就・永井尚政が、伊達政宗に仙台屋敷構を許可し..." with entities "[(11, 13, 'Political_Organization'), (15, 19, 'Per...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "蔡長庚は在日華僑の実業家。" with entities "[(0, 3, 'Person'), (5, 6, 'Location')]". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "2020年9月24日、ウィンミン大統領がソー・ハン外務次官を次期駐日大使に指名すると共に、ミン・トゥ..." with entities "[(9, 10, 'Location'), (11, 16, 'Person'), (20, 25,...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "その他、東京高等工芸学校講師、立教大学講師、武蔵高等工業学校講師、聖徳学園保姆養成所所長、日米文化り..." with entities "[(4, 12, 'Facility'), (15, 19, 'Juridical_Person')...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "2018年1月4日にはに選出され、王貞治に次ぐ2人目の日台野球殿堂入りとなった。" with entities "[(17, 20, 'Person'), (27, 28, 'Location'), (28, 29...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "選抜高等学校野球大会における鳥取県勢の成績について記す。" with entities "[(0, 10, 'Event'), (14, 17, 'Location')]". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "また、草戸稲荷神社前には遊女町を造ったといわれる。" with entities "[(3, 9, 'Facility')]". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "1960年代〜1990年代の日米間で、通産省および国産メーカーと、米国の間で一連の対立が発生した。" with entities "[(14, 15, 'Location'), (15, 16, 'Location'), (19, ...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "昭和11年4月8日、小山晃吉と美秀子の長男として、兵庫県芦屋で生誕。" with entities "[(10, 14, 'Person'), (15, 17, 'Person'), (25, 30, ...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "第二次世界大戦終結後の1946年5月、公職追放により主婦之友社、日本出版配給の両社を退社、主婦之友社..." with entities "[(0, 8, 'Event'), (26, 31, 'Juridical_Person'), (3...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "1947年の春までに、バチカンは米英両国にウスタシャの戦争犯罪人をユーゴスラビアに引き渡さない様にと..." with entities "[(11, 15, 'Location'), (16, 17, 'Location'), (17, ...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "佐藤俊久処長の下にあって山崎桂一技佐と共に大哈爾浜都市計画の立案を行なうと共に、水道科長として同市の..." with entities "[(0, 4, 'Person'), (12, 16, 'Person'), (21, 29, 'P...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "そのまま日本体育大学に進学し、その際、2012年ロンドンオリンピックアーチェリー競技、女子団体にて、..." with entities "[(4, 10, 'Juridical_Person'), (19, 34, 'Event'), (...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "NGT48加入前は、新潟県内在住の中高生がモデルとして登場するウェブサイト「制服ステーションMAP」..." with entities "[(0, 5, 'Organization_Other'), (10, 13, 'Location'...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "同年8月15日からは、母体番組の「ちちんぷいぷい」で、水曜日パネラーの未知やすえと共に「ぷいぷいデパ..." with entities "[(17, 24, 'Product'), (35, 40, 'Person'), (44, 53,...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "一方、鳥羽・伏見の戦いで旧幕府軍の総督であった大多喜藩主の松平正質は徳川慶喜から江戸城登城を禁止され..." with entities "[(3, 11, 'Event'), (12, 16, 'Political_Organizatio...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "担当試合でレッドカードを出す割合が多いといわれており、2018年6月16日に行われたJ2第19節・松..." with entities "[(49, 55, 'Organization_Other'), (57, 64, 'Organiz...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
/usr/local/lib/python3.7/dist-packages/spacy/language.py:482: UserWarning: [W030] Some entities could not be aligned in the text "統治機構の近代化により王朝を立て直すことに失敗、加えて義和団の乱後をめぐる清朝の醜態も加わり、191..." with entities "[(27, 32, 'Event'), (37, 39, 'Political_Organizati...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities (with BILUO tag '-') will be ignored during training.
gold = GoldParse(doc, **gold)
iteration0: 34617.39430559
iteration1: 26710.82919405
iteration2: 22441.80742917
iteration3: 20216.27078154
iteration4: 17570.75377652
iteration5: 15934.52385993
iteration6: 14505.29976718
iteration7: 13454.64992241
iteration8: 12410.60749704
iteration9: 11966.42870759
iteration10: 10877.21895532
iteration11: 10146.99085701
iteration12: 9996.58426121
iteration13: 9365.26938313
iteration14: 8816.20219649
iteration15: 8602.49586675
iteration16: 8006.44870532

この処理を行うことで、ner.jsonが学習データ(train_data)に変換されます。

学習したモデルを使って固有表現抽出

学習したモデルを使って固有表現を抽出するサンプルソースは次のようになります。

4行目で学習したモデルを読み込んでいます。

[Google Colaboratory]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
mport spacy

# 固有表現抽出モデルの読み込み
nlp = spacy.load('ner_model')

# 固有表現抽出
doc = nlp('*****固有表現を抽出する文章を記入*****')
for ent in doc.ents:
print(
ent.text + ',' + # テキスト
ent.label_ + ',' + # ラベル
str(ent.start_char) + ',' + # 開始位置
str(ent.end_char) # 終了位置
)