Kaggle - 災害ツイートについての自然言語処理(4) - データクレンジング

April 4, 2021

Natural Language Processing with Disaster Tweetsrに関する４回目の記事です。

Natural Language Processing with Disaster Tweets

今回はデータクレンジングを行います。

学習データと検証データの結合

まずは一括でデータクレンジングするために、学習データと検証データを結合します。

[ソース]

1 2	df = pd.concat([tweet,test]) df.shape

URLの排除

URLの排除を行います。正規表現を使ってURLパターンに合致したものを排除します。

[ソース]

example = "New competition launched :https://www.kaggle.com/c/nlp-getting-started"

def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', text)

remove_URL(example)
df['text'] = df['text'].apply(lambda x : remove_URL(x))

[結果]

HTMLタグの排除

HTMLタグも正規表現を使って排除します。

[ソース]

example = """<div>
<h1>Real or Fake</h1>
<p>Kaggle </p>
<a href="https://www.kaggle.com/c/nlp-getting-started">getting started</a>
</div>"""

def remove_html(text):
    html = re.compile(r'<.*?>')
    return html.sub(r'',text)
print(remove_html(example))

df['text'] = df['text'].apply(lambda x : remove_html(x))

[結果]

絵文字の排除

絵文字も正規表現を使って排除します。

複数パターンある場合は、リスト型を使ってまとめて指定することができます。

[ソース]

# Reference : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

remove_emoji("Omg another Earthquake 😔😔")

df['text'] = df['text'].apply(lambda x: remove_emoji(x))

[結果]

句読点の排除

句読点を排除します。下記の方法で文字の変換を行っています。

str.maketrans()でstr.translate()に使える変換テーブルを作成する。
str.translate()で文字列内の文字を変換する。

[ソース]

def remove_punct(text):
    table = str.maketrans('','', string.punctuation)
    return text.translate(table)

example = "I am a #king"
print(remove_punct(example))

df['text'] = df['text'].apply(lambda x : remove_punct(x))

[結果]

スペル修正

最後にスペルの修正を行います。

pyspellcheckerというライブラリを使うので、まずこれをインストールしておきます。

[コマンド]

1 2	!pip install pyspellchecker

[結果]

スペルチャックは一旦単語ごとに分解し、スペルミスがあれば正しいスペルに修正し、最後に修正した単語を含めて結合し直しています。

[ソース]

from spellchecker import SpellChecker

spell = SpellChecker()
def correct_spellings(text):
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)
        
text = "corect me plese"
correct_spellings(text)

[結果]

今回は、英語文字列のクレンジングを行いました。

次回は、単語ベクター化モデルの一つであるGloVeを試してみます。