Ensemble learning is a strategy in which a group of models is used to find a solution to a challenging problem, by using a strategy and combining diverse machine learning models into one single predictive model. Click here to know more about linear discriminant analysis in machine learning.

In general, ensemble methods are mainly used for improving the overall performance accuracy of a model and combine several different models, also known as the base learners, to predict the results, instead of using a single model. Learn more about different prediction models in our machine learning with Python training program.

In one of the articles related to ensemble learning, we have already discussed the popular ensemble method, **B**ootstrap **Agg**regation. **Bagging** tries to implement similar learners on small sample populations and then takes a mean of all the predictions. It combines Bootstrapping and Aggregation to form one ensemble model. It basically reduces the variance error and helps to avoid overfitting. In this article, we will look into the limitations of bagging and how a boosting algorithm can be used to overcome those limitations. We will also learn about various types of boosting algorithms and implement one of them in Python. Let’s get started.

## What are the limitations of Bagging?

Let us recall the concept of bagging and consider a binary classification problem. We are either classifying an observation as 0 or as 1.

In bagging, T bootstrap samples are selected, a classifier is fitted on each of these samples, and the models are trained in parallel. In a Random Forest, decision trees are trained in parallel. Then the results of all classifiers are averaged into a bagging classifier:

Formula for a Bagging ClassifierLet us consider 3 classifiers and the result for the classification can either be right or wrong. If we plot the results of the 3 classifiers, there are regions in which the classifiers will be wrong. These regions are represented in red in the figure below.

Example case in which Bagging works wellThe above example works pretty well as when one classifier is wrong, the two others are correct. By voting classifier, you can achieve a better accuracy. However, there are cases where Bagging does not work properly, when all classifiers are mistaken to be in the same region.

Due to this reason, the intuition behind the discovery of Boosting was the following :

- instead of training parallel models, one should train models sequentially
- each model should focus on where the performance of the previous classifier was poor

With this intuition, Boosting algorithm was introduced. With this intuition, boosting algorithm was introduced. Certain machine learning algorithms are getting more popular than others making data science one of the most in-demand skills. As a result, the data science course fees are on a rise as well.

Let us understand what Boosting is all about.

## What is Boosting?

Boosting is an ensemble modeling technique which attempts to build a strong classifier from the number of weak classifiers. It is done by building a model using weak models in series. First, a model is built from the training data. Then the second model is built which tries to correct the errors present in the first model. This procedure is continued and models are added until either the complete training data set is predicted correctly or the maximum number of models are added.

Boosting being a sequential process, each subsequent model attempts to correct the errors of the previous model. It is focused on reducing the bias unlike bagging. It makes the boosting algorithms prone to overfitting. To avoid overfitting, parameter tuning plays an important role in boosting algorithms, which will be discussed in the later part of this article. Some examples of boosting are XGBoost, GBM, ADABOOST etc..

## How can Boosting identify weak learners?

To find weak learners, we apply base learning (ML) algorithms with a different distribution. As each time base learning algorithm is applied, it generates a new weak prediction rule. This is an iterative process. After many iterations, the boosting algorithm combines these weak rules into a single strong prediction rule.

## How do we choose a different distribution for each round?

**Step 1:** The base learner takes all the distributions and assigns equal weight or attention to each observation.

**Step 2: **If there is any prediction error caused by first base learning algorithm, then we pay higher attention to observations having prediction error. Then, we apply the next base learning algorithm.

**Step 3: **Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is achieved.

Finally, it combines the outputs from weak learner and creates a strong learner which eventually improves the prediction power of the model. Boosting gives higher focus to examples which are mis-classiﬁed or have higher errors by preceding weak rules.

## How would you classify an email as SPAM or not?

Our initial approach would be to identify ‘SPAM’ and ‘NOT SPAM’ emails using the following criteria. If:

- Email has only one image file (promotional image), It’s a SPAM.
- Email has only link(s), It’s a SPAM.
- Email body consists of sentences like “You won a prize money of $ xxxxxx”, It’s a SPAM.
- Email from our official domain “www.knowledgehut.com” , Not a SPAM.
- Email from known source, Not a SPAM.

Individually, these rules are not powerful enough to classify an email into ‘SPAM’ or ‘NOT SPAM’. Therefore, these rules are called as **weak learner**.

To convert weak learner to strong learner, we’ll combine the prediction of each weak learner using methods like:

- Using average/ weighted average
- Considering prediction has higher vote

**E****xample:** Above, we have defined 5 weak learners. Out of these 5, 3 are voted as ‘SPAM’ and 2 are voted as ‘Not a SPAM’. In this case, by default, we’ll consider an email as SPAM because we have higher(3) vote for ‘SPAM’

Boosting helps in training a series of low performing algorithms, called weak learners, simply by adjusting the error metric over time. Weak learners are considered to be those algorithms whose error rate is slightly under 50% as illustrated below:

### Weighted errors

Let us consider data points on a 2D plot. Some of the data points will be *well classified*, others won’t. The weight attributed to each error when computing the error rate is 1/n where n is the number of data points to classify.

Now if we apply some weight to the errors :

You might now notice that we give more weight to the data points that are not well classified. An illustration of the weighting process is mentioned below:

Example of weighting process

In the end, we want to build a strong classifier that may look like the figure mentioned below:

Strong Classifier

Tree stumps

There might be a question in your mind about how many classifiers should one implement in order to ensure it works well. And how is each classifier chosen at each step?

Well, Tree stumps defines a 1-level decision tree. At each step, we need to find the best stump, i.e the best data split, which will minimize the overall error. You can see a stump as a test, in which the assumption is that everything that lies on one side belongs to class 1, and everything that lies on the other side belongs to class 0.

Many such combinations are possible for a tree stump. Let us look into an example to understand how many combinations we face.

3 data points to split

Well there are 12 possible combinations. Let us check how.

12 Stumps

There are 12 possible “tests” we could make. The “2” on the side of each separating line simply represents the fact that all points on one side could be points that belong to class 0, or to class 1. Therefore, there are 2 tests embedded in it.

At each iteration t, we will choose ht the weak classifier that splits best the data, by reducing the overall error rate the most. Recall that the error rate is a modified error rate version that takes into account what has been introduced before.

### Finding the best split

The best split is found by identifying at each iteration t, the best weak classifier ht, generally a decision tree with 1 node and 2 leaves (a stump). Let us consider an example of credit defaulter, i.e whether a person who borrowed money will return or not.

Identifying the best split

In this case, the best split at time t is to stump on the Payment history, since the weighted error resulting from this split is minimum.

Simply note that decision tree classifiers like these ones can in practice be deeper than a simple stump. This will be considered as a hyper-parameter.

### Combining classifiers

In the next step we combine the classifiers into a Sign classifier, and depending on which side of the frontier a point will stand, it is classified as 0 or 1. It can be achieved by:

Combining classifiers

You can improve the classifier by adding weights on each classifier, to avoid giving the same importance to the different classifiers.

AdaBoost

Pseudo-code

Pseudo-code

The key elements to keep in mind are:

- Z is a constant whose role is to normalize the weights so that they add up to 1
- αt is a weight that we apply to each classifier

This algorithm is called **AdaBoost or Adaptive Boosting**. This is one of the most important algorithms among all boosting methods.

### Computation

Boosting algorithms are generally fast to train, although we consider every stump possible and compute exponentials recursively.

Well, if we choose αt and Z properly, the weights that are supposed to change at each step simplify to:

Weights after choice of α and Z

## Types of Boosting Algorithms

Underlying engine used for boosting algorithms can be anything. It can be decision stamp, margin-maximizing classification algorithm etc. There are many boosting algorithms which use other types of engines such as:

- AdaBoost (
**Ada**ptive **Boost**ing) - Gradient Tree Boosting
- XGBoost

In this article, we will focus on AdaBoost and Gradient Boosting followed by their respective Python codes and a little bit about XGBoost.

### Where are Boosted algorithms required?

Boosted algorithms are mainly used when there is plenty of data to make a prediction and high predictive power is expected. It is used to reduce bias and variance in supervised learning. It combines multiple weak predictors to build strong predictor.

The underlying engine used for boosting algorithms can be anything. For instance, AdaBoost is a boosting done on Decision stump. There are many other boosting algorithms which use other types of engine such as:

- GentleBoost
- Gradient Boosting
- LPBoost
- BrownBoost

## Adaptive Boosting - AdaBoost Algorithm in Machine Learning

Adaptive Boosting, or most commonly known AdaBoost, is a Boosting algorithm. This algorithm uses the method to correct its predecessor. It pays more attention to under fitted training instances by the previous model. Thus, at every new predictor the focus is more on the complicated cases more than the others.

It fits a sequence of weak learners on different weighted training data. It starts by predicting the original data set and gives equal weight to each observation. If prediction is incorrect using the first learner, then it gives higher weight to observation which have been predicted incorrectly. Being an iterative process, it continues to add learner(s) until a limit is reached in the number of models or accuracy.

Mostly, AdaBoost uses decision stamps. But, we can use any machine learning algorithm as base learner if it accepts weight on training data set. We can use AdaBoost algorithm in machine learning for both classification and regression problems.

Let us consider the example of the image mentioned above. In order to build an AdaBoost classifier, consider that as a first base classifier a Decision Tree algorithm is trained to make predictions on our training data. Applying the following methodology of AdaBoost, the weight of the misclassified training instances is increased. Then the second classifier is trained and the updated weights are acknowledged. It repeats the procedure over and over again.

At the end of every model prediction we end up boosting the weights of the misclassified instances so that the next model does a better job on them, and so on.

This sequential learning technique might sound similar to Gradient Descent, except that instead of tweaking a single predictor’s parameter to minimize the cost function, AdaBoost adds predictors to the ensemble, gradually making it better.

One disadvantage of this algorithm is that the model cannot be parallelized since each predictor can only be trained after the previous one has been trained and evaluated.

Below are the steps for performing the AdaBoost algorithm:

- Initially, all observations are given equal weights.
- A model is built on a subset of data.
- Using this model, predictions are made on the whole dataset.
- Errors are calculated by comparing the predictions and actual values.
- While creating the next model, higher weights are given to the data points which were predicted incorrectly.
- Weights can be determined using the error value. For instance,the higher the error the more is the weight assigned to the observation.
- This process is repeated until the error function does not change, or the maximum limit of the number of estimators is reached.

**Hyperparameters**

**base_estimators**: specify the base type estimator, i.e. the algorithm to be used as base learner.

**n_estimators: **It defines the number of base estimators, where the default is 10 but you can increase it in order to obtain a better performance.

**learning_rate: **same impact as in gradient descent algorithm

**max_depth: **Maximum depth of the individual estimator

**n_jobs****: **indicates to the system how many processors it is allowed to use. Value of ‘-1’ means there is no limit;

**random_state: **makes the model’s output replicable. It will always produce the same results when you give it a fixed value as well as the same parameters and training data.

Now, let us take a quick look at how to use AdaBoost in Python using a simple example on handwritten digit recognition.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split
from sklearn.model_selection import learning_curve
from sklearn.datasets import load_digits

Let us load the data :

dataset = load_digits()
X = dataset['data']
y = dataset['target']

X contains arrays of length 64 which are simply flattened 8x8 images. The aim of this dataset is to recognize handwritten digits. Let’s take a look at a given handwritten digit:

plt.imshow(X[4].reshape(8,8))

If we stick to a Decision Tree Classifier of depth 1 (a stump), here’s how to implement AdaBoost classifier:

reg_ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1))
scores_ada = cross_val_score(reg_ada, X, y, cv=6)
scores_ada.mean()

0.2636257855582272

And it should head a result of around 26%, which can largely be improved. One of the key parameters is the depth of the sequential decision tree classifiers. How does accuracy improve with depth of the decision trees?

score = []
for depth in [1,2,10] :
reg_ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=depth))
scores_ada = cross_val_score(reg_ada, X, y, cv=6)
score.append(scores_ada.mean())
score

[0.2636257855582272, 0.5902852679072207, 0.9527524912410157]

And the maximal score is reached for a depth of 10 in this simple example, with an accuracy of 95.3%.

## Gradient Boosting

This is another very popular Boosting algorithm which works pretty similar to what we’ve seen for AdaBoost. Gradient Boosting works by sequentially adding the previous predictors underfitted predictions to the ensemble, ensuring the errors made previously are corrected.

The difference lies in what it does with the underfitted values of its predecessor. Contrary to AdaBoost, which tweaks the instance weights at every interaction, this method **tries to fit the new predictor to the ***residual errors*** made by the previous predictor.**

So that you can understand Gradient Boosting it is important to understand Gradient Descent first.

Below are the steps for performing the Gradient Boosting algorithm:

- A model is built on a subset of data.
- Using this model, predictions are made on the whole dataset.
- Errors are calculated by comparing the predictions and actual values.
- A new model is created using the errors calculated as target variable. Our objective is to find the best split to minimize the error.
- The predictions made by this new model are combined with the predictions of the previous.
- New errors are calculated using this predicted value and actual value.
- This process is repeated until the error function does not change, or the maximum limit of the number of estimators is reached.

**Hyperparameters**

**n_estimators: **It controls the number of weak learners.

**Learning_rate: **Controls the contribution of weak learners in the final combination. There is a trade-off between learning_rate and n_estimators.

**min_samples_split: **Minimum number of observation which is required in a node to be considered for splitting. It is used to control overfitting.

**min_samples_leaf: **Minimum samples required in a terminal or leaf node. Lower values should be chosen for imbalanced class problems since the regions in which the minority class will be in the majority will be very small.

**min_weight_fraction_leaf: **similar to the previous but defines a fraction of the total number of observations instead of an integer.

**max_depth : **maximum depth of a tree. Used to control overfitting.

**max_lead_nodes :** maximum number of terminal leaves in a tree. If this is defined max_depth is ignored.

**max_features : **number of features it should consider while searching for the best split.

You can tune loss function for better performance.

### Implementation in Python

You can find Gradient Boosting function in Scikit-Learn’s library.

# for regression
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor(n_estimators=3,learning_rate=1)
model.fit(X,Y)
# for classification
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
model.fit(X,Y)

## XGBoost

XG Boost or Extreme Gradient Boosting is an advanced implementation of the Gradient Boosting. This algorithm has high predictive power and is ten times faster than any other gradient boosting techniques. Moreover, it includes a variety of data regularization which reduces overfitting and improves overall performance.

**Advantages**

- It implements regularization which helps in reducing overfit (Gradient Boosting does not have);
- It implements parallel processing which is much faster than Gradient Boosting;
- Allows users to define custom optimization objectives and evaluation criteria adding a whole new dimension to the model;
- XGBoost has an in-built routine to handle missing values;
- XGBoost makes splits up to the max_depth specified and then starts pruning the tree backward and removes splits beyond which there is no positive gain;
- XGBoost allows a user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.

Boosting algorithms represent a different machine learning perspective which is turning a weak model into a stronger one to fix its weaknesses. I hope this article helped you understand how boosting works.

We have covered most of the topics related to algorithms in our series of machine learning tutorials. If you are inspired by the opportunities provided by machine learning, enroll in Knowledgehut machine learning with Python training for more lucrative career options in this landscape.

*Your one-stop-shop for **Machine Learning** is just a click away. Access our live online training and find easy solutions to all your queries here.*