Ensemble learning is a strategy in which a group of models are used to find a solution to a challenging problem, by using a strategy and combining diverse machine learning models into one single predictive model.
In general, ensemble methods are mainly used for improving the overall performance accuracy of a model and combine several different models, also known as the base learners, to predict the results, instead of using a single model.
In one of the articles related to ensemble learning, we have already discussed about the popular ensemble method, Bootstrap Aggregation. Bagging tries to implement similar learners on small sample populations and then takes a mean of all the predictions. It combines Bootstrapping and Aggregation to form one ensemble model. It basically reduces the variance error and helps to avoid overfitting. In this article we will look into the limitations of bagging and how a boosting algorithm can be used to overcome those limitations. We will also learn about various types of boosting algorithms and implement one of them in Python. Let’s get started.
Let us recall the concept of bagging and consider a binary classification problem. We are either classifying an observation as 0 or as 1.
In bagging, T bootstrap samples are selected, a classifier is fitted on each of these samples, and the models are trained in parallel. In a Random Forest, decision trees are trained in parallel. Then the results of all classifiers are averaged into a bagging classifier:
Let us consider 3 classifiers and the result for the classification can either be right or wrong. If we plot the results of the 3 classifiers, there are regions in which the classifiers will be wrong. These regions are represented in red in the figure below.
The above example works pretty well as when one classifier is wrong, the two others are correct. By voting classifier, you can achieve a better accuracy. However, there are cases where Bagging does not work properly, when all classifiers are mistaken to be in the same region.
Due to this reason, the intuition behind the discovery of Boosting was the following :
With this intuition, Boosting algorithm was introduced. Let us understand what Boosting is all about.
Boosting is an ensemble modeling technique which attempts to build a strong classifier from the number of weak classifiers. It is done by building a model using weak models in series. First, a model is built from the training data. Then the second model is built which tries to correct the errors present in the first model. This procedure is continued and models are added until either the complete training data set is predicted correctly or the maximum number of models are added.
Boosting being a sequential process, each subsequent model attempts to correct the errors of the previous model. It is focused on reducing the bias unlike bagging. It makes the boosting algorithms prone to overfitting. To avoid overfitting, parameter tuning plays an important role in boosting algorithms, which will be discussed in the later part of this article. Some examples of boosting are XGBoost, GBM, ADABOOST etc..
To find weak learners, we apply base learning (ML) algorithms with a different distribution. As each time base learning algorithm is applied, it generates a new weak prediction rule. This is an iterative process. After many iterations, the boosting algorithm combines these weak rules into a single strong prediction rule.
Step 1: The base learner takes all the distributions and assigns equal weight or attention to each observation.
Step 2: If there is any prediction error caused by first base learning algorithm, then we pay higher attention to observations having prediction error. Then, we apply the next base learning algorithm.
Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is achieved.
Finally, it combines the outputs from weak learner and creates a strong learner which eventually improves the prediction power of the model. Boosting gives higher focus to examples which are mis-classiﬁed or have higher errors by preceding weak rules.
Our initial approach would be to identify ‘SPAM’ and ‘NOT SPAM’ emails using the following criteria. If:
Individually, these rules are not powerful enough to classify an email into ‘SPAM’ or ‘NOT SPAM’. Therefore, these rules are called as weak learner.
To convert weak learner to strong learner, we’ll combine the prediction of each weak learner using methods like:
Example: Above, we have defined 5 weak learners. Out of these 5, 3 are voted as ‘SPAM’ and 2 are voted as ‘Not a SPAM’. In this case, by default, we’ll consider an email as SPAM because we have higher(3) vote for ‘SPAM’
Boosting helps in training a series of low performing algorithms, called weak learners, simply by adjusting the error metric over time. Weak learners are considered to be those algorithms whose error rate is slightly under 50% as illustrated below:
Let us consider data points on a 2D plot. Some of the data points will be well classified, others won’t. The weight attributed to each error when computing the error rate is 1/n where n is the number of data points to classify.
Now if we apply some weight to the errors :
You might now notice that we give more weight to the data points that are not well classified. An illustration of the weighting process is mentioned below:
In the end, we want to build a strong classifier that may look like the figure mentioned below:
There might be a question in your mind about how many classifiers should one implement in order to ensure it works well. And how is each classifier chosen at each step?
Well, Tree stumps defines a 1-level decision tree. At each step, we need to find the best stump, i.e the best data split, which will minimize the overall error. You can see a stump as a test, in which the assumption is that everything that lies on one side belongs to class 1, and everything that lies on the other side belongs to class 0.
Many such combinations are possible for a tree stump. Let us look into an example to understand how many combinations we face.
Well there are 12 possible combinations. Let us check how.
There are 12 possible “tests” we could make. The “2” on the side of each separating line simply represents the fact that all points on one side could be points that belong to class 0, or to class 1. Therefore, there are 2 tests embedded in it.
At each iteration t, we will choose ht the weak classifier that splits best the data, by reducing the overall error rate the most. Recall that the error rate is a modified error rate version that takes into account what has been introduced before.
The best split is found by identifying at each iteration t, the best weak classifier ht, generally a decision tree with 1 node and 2 leaves (a stump). Let us consider an example of credit defaulter, i.e whether a person who borrowed money will return or not.
In this case, the best split at time t is to stump on the Payment history, since the weighted error resulting from this split is minimum.
Simply note that decision tree classifiers like these ones can in practice be deeper than a simple stump. This will be considered as a hyper-parameter.
In the next step we combine the classifiers into a Sign classifier, and depending on which side of the frontier a point will stand, it is classified as 0 or 1. It can be achieved by:
You can improve the classifier by adding weights on each classifier, to avoid giving the same importance to the different classifiers.
Pseudo-codeThe key elements to keep in mind are:
This algorithm is called AdaBoost or Adaptive Boosting. This is one of the most important algorithms among all boosting methods.
Boosting algorithms are generally fast to train, although we consider every stump possible and compute exponentials recursively.
Well, if we choose αt and Z properly, the weights that are supposed to change at each step simplify to:
Underlying engine used for boosting algorithms can be anything. It can be decision stamp, margin-maximizing classification algorithm etc. There are many boosting algorithms which use other types of engines such as:
In this article, we will focus on AdaBoost and Gradient Boosting followed by their respective Python codes and a little bit about XGBoost.
Boosted algorithms are mainly used when there is plenty of data to make a prediction and high predictive power is expected. It is used to reduce bias and variance in supervised learning. It combines multiple weak predictors to build strong predictor.
The underlying engine used for boosting algorithms can be anything. For instance, AdaBoost is a boosting done on Decision stump. There are many other boosting algorithms which use other types of engine such as:
Adaptive Boosting, or most commonly known AdaBoost, is a Boosting algorithm. This algorithm uses the method to correct its predecessor. It pays more attention to under fitted training instances by the previous model. Thus, at every new predictor the focus is more on the complicated cases more than the others.
It fits a sequence of weak learners on different weighted training data. It starts by predicting the original data set and gives equal weight to each observation. If prediction is incorrect using the first learner, then it gives higher weight to observation which have been predicted incorrectly. Being an iterative process, it continues to add learner(s) until a limit is reached in the number of models or accuracy.
Mostly, AdaBoost uses decision stamps. But, we can use any machine learning algorithm as base learner if it accepts weight on training data set. We can use AdaBoost algorithms for both classification and regression problems.
Let us consider the example of the image mentioned above. In order to build an AdaBoost classifier, consider that as a first base classifier a Decision Tree algorithm is trained to make predictions on our training data. Applying the following methodology of AdaBoost, the weight of the misclassified training instances is increased. Then the second classifier is trained and the updated weights are acknowledged. It repeats the procedure over and over again.
At the end of every model prediction we end up boosting the weights of the misclassified instances so that the next model does a better job on them, and so on.
This sequential learning technique might sound similar to Gradient Descent, except that instead of tweaking a single predictor’s parameter to minimize the cost function, AdaBoost adds predictors to the ensemble, gradually making it better.
One disadvantage of this algorithm is that the model cannot be parallelized since each predictor can only be trained after the previous one has been trained and evaluated.
Below are the steps for performing the AdaBoost algorithm:
base_estimators: specify the base type estimator, i.e. the algorithm to be used as base learner.
n_estimators: It defines the number of base estimators, where the default is 10 but you can increase it in order to obtain a better performance.
learning_rate: same impact as in gradient descent algorithm
max_depth: Maximum depth of the individual estimator
n_jobs: indicates to the system how many processors it is allowed to use. Value of ‘-1’ means there is no limit;
random_state: makes the model’s output replicable. It will always produce the same results when you give it a fixed value as well as the same parameters and training data.
Now, let us take a quick look at how to use AdaBoost in Python using a simple example on handwritten digit recognition.
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score from sklearn.model_selection import cross_val_score from sklearn.model_selection import cross_val_predict from sklearn.model_selection import train_test_split from sklearn.model_selection import learning_curve from sklearn.datasets import load_digits
Let us load the data :
dataset = load_digits() X = dataset['data'] y = dataset['target']
X contains arrays of length 64 which are simply flattened 8x8 images. The aim of this dataset is to recognize handwritten digits. Let’s take a look at a given handwritten digit:
If we stick to a Decision Tree Classifier of depth 1 (a stump), here’s how to implement AdaBoost classifier:
reg_ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1)) scores_ada = cross_val_score(reg_ada, X, y, cv=6) scores_ada.mean()
And it should head a result of around 26%, which can largely be improved. One of the key parameters is the depth of the sequential decision tree classifiers. How does accuracy improve with depth of the decision trees?
score =  for depth in [1,2,10] : reg_ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=depth)) scores_ada = cross_val_score(reg_ada, X, y, cv=6) score.append(scores_ada.mean()) score
[0.2636257855582272, 0.5902852679072207, 0.9527524912410157]
And the maximal score is reached for a depth of 10 in this simple example, with an accuracy of 95.3%.
This is another very popular Boosting algorithm which works pretty similar to what we’ve seen for AdaBoost. Gradient Boosting works by sequentially adding the previous predictors underfitted predictions to the ensemble, ensuring the errors made previously are corrected.
The difference lies in what it does with the underfitted values of its predecessor. Contrary to AdaBoost, which tweaks the instance weights at every interaction, this method tries to fit the new predictor to the residual errors made by the previous predictor.
So that you can understand Gradient Boosting it is important to understand Gradient Descent first.
Below are the steps for performing the Gradient Boosting algorithm:
n_estimators: It controls the number of weak learners.
Learning_rate: Controls the contribution of weak learners in the final combination. There is a trade-off between learning_rate and n_estimators.
min_samples_split: Minimum number of observation which is required in a node to be considered for splitting. It is used to control overfitting.
min_samples_leaf: Minimum samples required in a terminal or leaf node. Lower values should be chosen for imbalanced class problems since the regions in which the minority class will be in the majority will be very small.
min_weight_fraction_leaf: similar to the previous but defines a fraction of the total number of observations instead of an integer.
max_depth : maximum depth of a tree. Used to control overfitting.
max_lead_nodes : maximum number of terminal leaves in a tree. If this is defined max_depth is ignored.
max_features : number of features it should consider while searching for the best split.
You can tune loss function for better performance.
You can find Gradient Boosting function in Scikit-Learn’s library.
# for regression from sklearn.ensemble import GradientBoostingRegressor model = GradientBoostingRegressor(n_estimators=3,learning_rate=1) model.fit(X,Y) # for classification from sklearn.ensemble import GradientBoostingClassifier model = GradientBoostingClassifier() model.fit(X,Y)
XG Boost or Extreme Gradient Boosting is an advanced implementation of the Gradient Boosting. This algorithm has high predictive power and is ten times faster than any other gradient boosting techniques. Moreover, it includes a variety of regularization which reduces overfitting and improves overall performance.
Boosting algorithms represent a different machine learning perspective which is turning a weak model to a stronger one to fix its weaknesses. I hope this article helped you understand how boosting works.
We have covered most of the topics related to algorithms in our series of machine learning blogs, click here. If you are inspired by the opportunities provided by machine learning, enroll in our Data Science and Machine Learning Courses for more lucrative career options in this landscape.
Your email address will not be published. Required fields are marked *
Introduction to Machine Learning and its typesMach... Read More