XGBoost in Python

XGBoost in Python

Here’s a basic example of how to use XGBoost in Python for a classification task.

This example uses the popular Iris dataset.

Install XGBoost

If you don’t have XGBoost installed, you can install it using pip:

1
pip install xgboost

Sample Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert the dataset into DMatrix, which is XGBoost's internal data structure
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Set parameters for XGBoost
params = {
'objective': 'multi:softmax', # Specify the loss function
'num_class': 3, # Number of classes in the dataset
'max_depth': 3, # Maximum depth of a tree
'eta': 0.3, # Step size shrinkage
'eval_metric': 'mlogloss' # Evaluation metric
}

# Train the model
num_rounds = 50 # Number of boosting rounds
bst = xgb.train(params, dtrain, num_rounds)

# Make predictions on the test set
y_pred = bst.predict(dtest)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

Explanation

  1. Loading the Dataset: We use the Iris dataset, which is a common dataset for classification tasks.
    It contains three classes of flowers.

  2. Splitting the Data: The dataset is split into training and testing sets using train_test_split.

  3. DMatrix: XGBoost uses its own data structure called DMatrix for training.
    It is more efficient and optimized for XGBoost operations.

  4. Setting Parameters:

    • objective: Defines the learning task and the corresponding objective function.
      In this case, multi:softmax is used for multiclass classification.
    • num_class: Specifies the number of classes.
    • max_depth: The maximum depth of the trees.
    • eta: The learning rate.
  5. Training: The model is trained using the train function with the specified parameters and number of boosting rounds.

  6. Prediction: The trained model makes predictions on the test set, and the accuracy is calculated using accuracy_score.

This is a basic example, but XGBoost offers a wide range of parameters and options that can be fine-tuned for different types of data and tasks.

Output

1
Accuracy: 100.00%