Machine Learning is emerging as the latest technology these days, and is solving many problems that are impossible for humans. This technology has extended its wings into diverse industries like Automobile, Manufacturing, IT services, Healthcare, Robotics and so on. The main reason behind using this technology is that it provides more accurate solutions for problems, simplifies tasks and eases work processes. It automates the world with its applications that are helpful for many organizations and for the well-being of people. This technology uses the input data to develop a model, and further predicts the outcomes to know the performance of the model. Click here to know more about linear discriminant analysis.
Read more about Self Variable in Python!
Generally, we develop machine learning models to solve a problem by using the given input data. When we work on a single algorithm, we are unable to distinguish the performance of the model for that particular statement, as there is nothing to compare it against. So, we feed the input data to other machine learning algorithms and then compare them with each other to know which is the best algorithm that suits the given problem. Every algorithm has its own mathematical computation and significance to deal with a specific problem to bring out the best results at the end.
Why do we combine models?
While dealing with a specific problem with a machine learning algorithm we sometimes fail, because of the poor performance of the model. The algorithm that we have used may be well suited to the model, but we still fail in getting better outcomes at the end. In this situation, we might have many questions in our mind. How can we bring out better results from the model? What are the steps to be taken further in the model development? What are the hidden techniques that can help to develop an efficient model?
To overcome this situation there is a procedure called “Combining Models”, where we mix one or two weaker machine learning models to solve a problem and get better outcomes. In machine learning, the combining of models is done by using two approaches namely “Ensemble Models” & “Hybrid Models”.
Ensemble Models use multiple machine learning algorithms to bring out better predictive results, as compared to using a single algorithm. There are different approaches in Ensemble models to perform a particular task. There is another model called Hybrid model that is flexible and helps to create a more innovative model than an Ensemble model. While combining models we need to check how strong or weak a particular machine learning model is, to deal with a particular problem.
What are Ensemble Methods?
An Ensemble is made up of things that are grouped together, that take up a particular task. This method combines several algorithms together to bring out better predictive results, as compared to using a single algorithm. The objective behind the usage of an Ensemble method is that it decreases variance, bias and improves predictions in a developed model. Technically speaking, it helps in avoiding overfitting.
The models that contribute to an Ensemble are referred to as the Ensemble Members, which may be of the same type or different types, and may or may not be trained on the same training data.
In the late 2000s, adoption of ensembles picked up due in part to their huge success in machine learning competitions, such as the Netflix Prize and other competitions on Kaggle.
These ensemble methods greatly increase the computational cost and complexity of the model. This increase comes from the expertise and time required to train and maintain multiple models rather than a single model.
Ensemble models are preferred because of two main reasons; namely Performance & Robustness. The ensemble methods majorly focus on improving the accuracy of the model by reducing variance component of the prediction error and by adding bias to the model.
Performance helps a Machine Learning model to make better predictions. Robustness reduces the spread or dispersion of the prediction and model performance.
The goal of a supervised machine learning algorithm is to have “low bias and low variance”.
The Bias and the Variance are inversely proportional to each other i.e., if the bias is low then the variance is high, else the bias is high then the variance is low.
We explicitly use ensemble methods to seek better predictive performance, such as lower error on regression or higher accuracy for classification. They are also further used in Computer vision and are given utmost importance in academic competitions also.
This type of algorithm is commonly used in decision analysis and operation Research, and it is one of the mostly used algorithms in the context of Machine Learning.
The decision tree algorithm aims to produce better results for small and large amounts of data, which are taken as input data and fed to the model. These algorithms are majorly used in decision making problem statements.
The decision tree algorithm is a tree like structure consisting of nodes at each stage. The top of the tree is the Root Node which describes the main problem that we deal with, and there are Sub Nodes which act as classes or labels for the data given in the dataset. The Leaf Node is the last layer of the decision tree, representing the outcomes or values of the problem.
The tree structure is extended with a number of nodes till a perfect prediction is made from the given data using the model. Decision tree algorithms are used in classification as well as regression problems. This algorithm is widely used in machine learning to solve problems, and the main advantage of this model is that we can have 2 or more outputs, from which we can select the most suitable one for the given problem.
These can operate on both small and large amounts of data. Decisions taken using this algorithm are often fast and accurate. In machine learning the different types of Decision Tree algorithms include
- Classification and Regression Tree (CART)
- Decision stump
- Chi-squared automatic interaction detection (CHAID)
Types of Ensemble Methods
Ensemble methods are used to improve the accuracy of the model by reducing the bias and variance. These methods are widely used in dealing with Classification and Regression Problems. In ensemble method, several models combine together to form one reliable model that results in improving accuracy at the end.
Ensemble methods are widely classified into the following types to exhibit better performance of the model. They are:
These ensemble methods are broadly classified into four categories, namely “Sequential methods”, “Parallel methods”, “Homogeneous Ensemble” and “Heterogeneous Ensemble”. They help us to differentiate the performance and accuracy of models for a problem.
Sequential methods generate sequential base learners who are data dependent. Here the new data we take as input to the model is dependent on the previous data, and the data which is mislabeled previously by the model is tuned with weights to get better accuracies at the end. This technique is possible in “BOOSTING”, for example in Adaptive Boosting (AdaBoost).
Parallel methods generate parallel order base learners in which the data is independent. This independence of the base learners on the data significantly reduces the error with the application of averages. This technique is possible in “STACKING”, for example in Random Forest.
A Homogenous ensemble is a combination of the same type of classifiers. Even though the dataset consists of different classifiers, this ensemble technique makes a model that best suits a given problem. This type of technique is computationally expensive and is suitable for solving large datasets. “BAGGING” & “BOOSTING” are the popular methods that exhibit homogeneous ensemble.
Heterogeneous ensemble is a combination of different types of classifiers, in which each classifier is based on the same data. This method works on small datasets. “STACKING” comes in this category.
Bagging is a short form of Bootstrap Aggregating, used to improve the accuracy of the model. It is used when dealing with problems related to Classification and Regression. This technique improves the accuracy of the model by reducing variance, and helps to prevent the overfitting of the model. Bagging can be applied with any type of method in machine learning, but generally it is implemented using Decision Trees.
Bagging is an ensemble technique, in which several models are grouped together to make one single reliable model to improve the accuracy. In the technique of bagging, we fit several independent models together and average their predictions to get a model that results in low variance and high accuracy to the model.
Bootstrapping is a sampling technique, where we obtain the data in the form of samples. The samples are derived from the whole population with the help of replacement procedure. The sampling technique with the help of replacement method helps the learners to make the selection procedure randomized. Now the base learning algorithm is run across the samples to complete the procedure for better results.
Aggregation is a technique in bagging that helps to incorporate all the possible outcomes of the predictions and randomizes the outcomes at the end. Without the usage of aggregation, the predictions will not be that accurate, because all the outcomes that are obtained at the end of the model are not taken into consideration. Thus, the aggregation is used based on the probability bootstrapping procedures or on the basis of all outcomes of the predictive models.
Bagging is an advantageous procedure in Machine Learning, as it combines all the weak base learners that come together to form a single strong learner which is more stable. This technique reduces variance, thereby increasing the accuracy to the model. It prevents overfitting of the model. The limitation for bagging is that it is computationally expensive. When the proper procedure for bagging is established, we should not ignore bias as it fails in obtaining better results at the end.
Random Forest Models
It is a supervised machine learning algorithm, which is flexible and widely used because of its simplicity and diversity. It produces great results without hyper-parameter tuning.
In the term “Random Forest”, the “Forest” refers to a group of decision trees or an ensemble of decision trees, usually trained with the method of “Bagging”. We know that the method of bagging is the combination of learning models that increases the overall result.
Random forest is used for classification and regression problems. It builds many decision trees and combines them together to get a more accurate and stable prediction at the end of the model.
Random forest adds additional randomness to the model, while growing the trees. Instead of finding the most important feature at the time of splitting a node, the random forest model searches for the best feature among a random subset of features. Thus in random forest, only a random subset of features is taken into consideration by the algorithm for node splitting.
Random forest has the quality of measuring the relative importance of each feature on the prediction. In order to use the random forest algorithm, we import a tool “Sklearn”, which measures features importance by looking at the amount of tree nodes used to reduce the impurity across all the trees in the forest.
The benefits of using random forest include the following:
- The training time is less compared to other algorithms.
- Runs efficiently on a large dataset, and predicts output with high accuracy.
- When a large proportion of data is missing it also maintains accuracy.
- It is flexible to apply and outcomes are obtained easily.
Boosting is an ensemble technique, which converts the weak machine learning models into strong models. The main goal of this technique is to reduce bias and variance of a model to improve accuracy. This technique learns from the previous predictor mistakes of data to make better predictions in future by improving the performance of the model.
It is a stack like structure in which the weak learners are placed at the bottom and the strong learners are placed at the top. In the stack, the learners at the upper layers initially learn from the weak learners by applying some sort of modifications to the previous techniques.
It exists in many forms, that includes XGBoost (Extreme Gradient Boosting), Gradient Boosting, Adaptive Boosting (AdaBoost).
AdaBoost makes use of weak learners that are in the form of decision trees, which includes one split normally known as decision stumps. The main decision stumps of Adaboost comprises of observations carrying similar weights.
Gradient Boosting follows the sequential addition of predictors to an ensemble, each correcting the previous one. Without changing the weights of incorrect classified observations like Adaboost, this Gradient boosting technique places a new predictor based on the residual errors made by the previous predictors in the generated model.
XGBoost is called as Extreme Gradient Boosting. It is designed in order to show better speed and performance of the machine learning model, that we developed. XGBoost technique is an implementation of Gradient Boosted Decision Trees. Generally, normal boosting techniques are very slow as they are in sequential form of training, so XGBoost technique is widely used to have good computational speed and to show better model performance.
Simple Averaging / Weighted Method
It is a technique to improve the accuracy of the model, mainly used for regression problems. It is based on the weights of the model multiplied with the actual instance values in the given problem. This method produces some consistent results that are reliable and help to get a better understanding about the outcomes of the given problem.
In the case of a simple averaging method, average predictions are calculated for every instance of the test dataset. It can reduce the overfitting of the model, and is mainly suitable for regression problems as it consists of numerical data. It creates a smoother regression model at the end by reducing the effect of overfitting. The technique of simple averaging is like calculating the mean of the given values.
The weighted averaging method is a slight modification to the simple averaging method, in which the prediction values are multiplied with the weight factor and sum up all the multiplied values for every instance. We then calculate the average. We assume that the predicted values are in the range of 0 to 1.
This method is a combination of multiple regression or classifier techniques with a meta-regressor or meta-classifier. Stacking is different from bagging and boosting. Bagging and boosting models work mainly on homogeneous weak learners and don’t consider heterogeneous learners, whereas stacking works mainly on heterogeneous weak learners, and consists of different algorithms altogether.
The bagging and boosting techniques combine weak learners with the help of deterministic algorithms, whereas the stacking method combines the weak base learners with the help of a meta-model.
As we defined earlier, when using stacking, we learn from several weak base learners and combine them together by training with a meta-model to predict the results that are predicted by the weak learners used in the model.
Stacking results in a pile-like structure, in which the lower-level output is used as the input to the next layer. In the same way the stack increases from maximum error rate at the bottom to the minimum error rate area at the top. The top layer in the stack has good prediction accuracy compared to the lower levels. The aim of stacking is to produce a low bias model for accurate results for a given problem.
It is a technique similar to the stacking approach, but uses only the validation set from the training set of the model to make predictions. The validation set is also called a holdout set.
The blending technique uses a holdout set to make predictions for the given problem. With the help of holdout set and the predictions, a model is built which will run across the test set.
The process of blending is explained below:
- Train dataset is divided into training and validation sets
- The model is fitted on to the training set
- Predictions are made on the validation set and the test set
- Now the validation set and the predictions are used as features to build a new model
- This developed model is used to make final predictions on the test set and on the meta-features.
The stacking and blending techniques are useful to improve the performance of the machine learning models. They are used to minimize the errors to get good accuracy for the given problem.
Voting is the easiest ensemble method in machine learning. It is mainly used for classification purposes. In this technique, the first step is to create multiple classification models using a training dataset. When the voting is applied to regression problems, the prediction is made with the average of multiple other regression models.
In the case of classification there are two types of voting,
The Hard Voting ensemble involves summing up the votes for crisp class labels from other models and predicting the class with the most votes. Soft Voting ensemble involves summing up the predicted probabilities for class labels and predicting the class label with the largest sum probability.
In short, for the Regression voting ensemble the predictions are the averages of contributing models, whereas for Classification voting ensemble, the predictions are the majority vote of contributing models.
There are other forms of voting like “Majority Voting” and “Weighted Voting”. In the case of Majority Voting, the final output predictions are based on the number of votes it gets. If the count of votes is high, that model is taken into consideration. In some of the articles this method is also called as “Plurality Voting”.
Unlike the technique of Majority voting, the weighted voting works based on the weights to increase the importance of one or more models. In the case of weighted voting, we count the prediction of the better models multiple times.
In order to improve the performance of weak machine learning models, there is a technique called Ensembling to improve or boost the accuracy of the model. It is comprised of different techniques, helpful for solving different types of regression and classification problems.