Python - 機械学習の前準備

April 8, 2020

機械学習の代表的な処理として次の３つを挙げることができます。

分類
与えられたデータから分類（クラス）を予測します。
回帰
与えられたデータから数値を予測します。
クラスタリング
データの性質に従って、データの塊（クラスタ）を作成します。

今回は機械学習の前準備としまして、データの準備とグラフ化を行っていきます。

データの準備・グラフ化

[コード]

# matplotlibとnumpyをインポート
import matplotlib.pyplot as plt
import numpy as np

# x軸の範囲を定義
x_max = 1
x_min = -1

# y軸の範囲を定義
y_max = 2
y_min = -1

# スケールを定義（１単位に何点を使うか）
SCALE = 50

# テストデータの割り合い(全データに対してテストデータは30%)
TEST_RATE = 0.3

# データ生成
x = np.arange(x_min, x_max, 1 / float(SCALE)).reshape(-1, 1) 
# xの2乗
y = x ** 2
y_noise = y + np.random.randn(len(y), 1) * 0.5  # ノイズを乗せる

# 学習データとテストデータに分割（分類問題、回帰問題で使用）
def split_train_test(array):
    length = len(array)
    n = int(length * (1 - TEST_RATE))
    
    indices = list(range(length))
    np.random.shuffle(indices)
    idx_train = indices[:n]
    idx_test = indices[n:]
    
    return sorted(array[idx_train]), sorted(array[idx_test])

# インデックスリストを分割
indices = np.arange(len(x))     # インデックス値のリスト
idx_train, idx_test = split_train_test(indices)

# 学習データ
x_train = x[idx_train]
y_train = y_noise[idx_train]

# テストデータ
x_test = x[idx_test]
y_test = y_noise[idx_test]

### グラフ描画

# 分析対象点の散布図
plt.scatter(x, y_noise, label='target')

# 元の線を点線スタイルで表示
plt.plot(x, y, linestyle=':', label='non noise curve')

# x軸 y軸の範囲を設定
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)

# 凡例の表示位置を指定
plt.legend(bbox_to_anchor=(1.05, 1), loc='uppder left', borderaxespad=0)

# グラフを表示
plt.show()

[実行結果]

青い●が機械学習で使用するデータとなります。

点線はノイズが乗る前の元の曲線を表しています。