分類モデルの評価⑦（まとめ）

1月 30, 2022

今回は、これまで実行してきたいろいろな分類モデルをまとめて評価します。

PR曲線の可視処理を関数化

まず、前回実行したPR曲線の可視化処理を関数化します。

[Google Colaboratory]

def plot_pr_curve(y_true,proba):
    precision, recall, thresholds = precision_recall_curve(y_true, proba[:,0], pos_label=0)
    auc_score = auc(recall, precision)

    plt.figure(figsize=(12, 4))
    plt.subplot(1,2,1)

    plt.plot(recall, precision,label=f"PR Curve (AUC = {round(auc_score,3)})")
    plt.plot([0,1], [1,1], linestyle="--", color="red", label="Ideal Line")

    tg_thres = [0.3,0.5,0.8]
    for thres in tg_thres:
    tg_index = np.argmin(np.abs(thresholds - thres))
    plt.plot(recall[tg_index], precision[tg_index], marker = "o",markersize=10, label=f"Threshold = {thres}")

    plt.legend()
    plt.title("PR curve")
    plt.xlabel("Recall")
    plt.ylabel("Precision")
    plt.grid()

    plt.subplot(1,2,2)

    plt.plot(np.append(thresholds, 1), recall, label = "Recall")
    plt.plot(np.append(thresholds, 1), precision, label = "Precision")
    plt.xlabel("Thresholds")
    plt.ylabel("Score")
    plt.grid()
    plt.legend()

    plt.show()

スケーリング

決定木以外のアルゴリズムのために、データのスケーリングを行います。

[Google Colaboratory]

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

各モデルの定義

辞書型で各モデルの定義を行います。

[Google Colaboratory]

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

models = {"Logistic Regression":LogisticRegression(), 
          "Linear SVM":SVC(kernel="linear",probability=True,random_state=0),
          "Kernel SVM":SVC(kernel="rbf",probability=True,random_state=0),
          "K Neighbors":KNeighborsClassifier(),
          "Decision Tree":DecisionTreeClassifier(max_depth=3,random_state=0),
          "Random Forest":RandomForestClassifier(max_depth=3,random_state=0)}

SVCクラスでpredict_probaを使用するために、probabilityをTrueにします。

[Google Colaboratory]

1	data_set = {"Train":[X_train_scaled,y_train],"Test":[X_test_scaled,y_test]}

事前準備は以上で終了です。

各モデルの評価

各モデルごとに以下の処理を行います。

モデルの構築・学習（4行目）
予測（11行目）
データセットごとに分類レポートを出力（13～16行目）
output_dictをTrueにすることで辞書型で出力し、それをデータフレームにしています。
テストデータに関してのPR曲線の可視化（20行目）

[Google Colaboratory]

for model_name in models.keys():
    print()
    print(f"{model_name} Score Report")
    model = models[model_name].fit(X_train_scaled, y_train)

    for data_set_name in data_set.keys():

        X_data = data_set[data_set_name][0]
        y_true = data_set[data_set_name][1]

        y_pred = model.predict(X_data)

        score_df = pd.DataFrame(classification_report(y_true, y_pred, output_dict=True))
        score_df["model"] = model_name
        score_df["type"] = data_set_name
        display(score_df)

        if data_set_name == "Test":
            proba = model.predict_proba(X_data)
            plot_pr_curve(y_true, proba)