Monday, July 31, 2017

XGBoost

Basics:

XGBoost stands for eXtreme Gradient Boosting. It's an implementation of gradient boosted decision trees designed for Speed and Performance.

Three main forms of Gradient Boosting are supported:
  1. Gradient Boosting algorithm also called gradient boosting machine including the learning rate.
  2. Stochastic Gradient Boosting with sub-sampling at the row, column and column per split levels.
  3. Regularized Gradient Boosting with both L1 and L2 regularization
XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data.

XGBoost is a software library that you can download and install on your machine, then access from a variety of interfaces. Specifically, XGBoost supports the following main interfaces:
  • Command Line Interface (CLI).
  • C++ (the language in which the library is written).
  • Python interface as well as a model in scikit-learn.
  • R interface as well as a model in the caret package.
  • Julia.
  • Java and JVM languages like Scala and platforms like Hadoop.
It belongs to a broader collection of tools under the umbrella of the Distributed Machine Learning Community or DMLC who are also the creators of the popular mxnet deep learning library.

Installation Guide - http://xgboost.readthedocs.io/en/latest/build.html#

Business Problem: (Same as used in ANN)
You have bank customers (credit score, country, age, gender, tenure,balance, credit card, loan, exited etc). Given the problem you need to find out which customers are at high risk of leaving the bank.
In summary, it's a Classification Problem.

Code: XGBoost

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
X = X[:, 1:]

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Fitting XGBoost to the Training set
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()


Hope this helps!!

Arun Manglick

No comments:

Post a Comment