Machine Learning is a multidisciplinary field of study, which gives computers the ability to solve complex problems, which otherwise would be nearly impossible to be hand-coded by a human being. Machine Learning is a scientific field of study which involves the use of algorithms and statistics to perform a given task by relying on inference from data instead of explicit instructions.
The process of Machine Learning can be broken down into several parts, most of which is based around “Data”. The following steps show the Machine Learning Process.
1. Gathering Data from various sources: Since Machine Learning is basically the inference drawn from data before any algorithm can be used, data needs to be collected from some source. Data collected can be of any form, viz. Video data, Image data, Audio data, Text data, Statistical data, etc.
2. Cleaning data to have homogeneity: The data that is collected from various sources does not always come in the desired form. More importantly, data contains various irregularities like Missing data and Outliers.These irregularities may cause the Machine Learning Model(s) to perform poorly. Hence, the removal or processing of irregularities is necessary to promote data homogeneity. This step is also known as data pre-processing.
3. Model Building & Selecting the right Machine Learning Model: After the data has been correctly pre-processed, various Machine Learning Algorithms (or Models) are applied on the data to train the model to predict on unseen data, as well as to extract various insights from the data. After various models are “trained” to the data, the best performing model(s) that suit the application and the performance criteria are selected.
4. Getting Insights from the model’s results: Once the model is selected, further data is used to validate the performance and accuracy of the model and get insights as to how the model performs under various conditions.
5. Data Visualization: This is the final step, where the model is used to predict unseen and real-world data. However, these predictions are not directly understandable to the user, and hence, data Visualization or converting the results into understandable visual graphs is necessary. At this stage, the model can be deployed to solve real-world problems.
To get the similarities out of the way, both, Machine Learning and Curve Fitting rely on data to infer a model which, ideally, fits the data perfectly.
The difference comes in the availability of the data.
Let’s initiate the idea of Bias and Variance with a case study. Let’s assume a simple dataset of predicting the price of a house based on its carpet area. Here, the x-axis represents the carpet area of the house, and the y-axis represents the price of the property. The plotted data (in a 2D graph) is shown in the graph below:
The goal is to build a model to predict the price of the house, given the carpet area of the property. This is a rather easy problem to solve and can easily be achieved by fitting a curve to the given data points. But, for the time being, let’s concentrate on solving the same using Machine Learning.
In order to keep this example simple and concentrate on Bias and Variance, a few assumptions are made:
With the above assumptions, the data is processed to train the model using the following steps:
1. Shuffling the data: Since the y-axis data-points are independent of the order of the sequence of the x-axis data-points, the dataset is shuffled in a pseudo-random manner. This is done to avoid unnecessary patterns from being learned by the model. During the shuffling, it is imperative to keep each x-y pair data point constant. Mixing them up will change the dataset itself and the model will learn inaccurate patterns.
2. Data Splitting: The dataset is split into three categories: Training Set (60%), Validation Set (20%), and Testing Set (20%). These three sets are used for different purposes:
3. Model Selection: Several Machine Learning Models are applied to the Training Set and their Training and Validation Losses are determined, which then helps determine the most appropriate model for the given dataset.
During this step, we assume that a polynomial equation fits the data correctly. The general equation is given below:
The process of “Training” mathematically is nothing more than figuring out the appropriate values for the parameters: a0, a1, ... ,an, which is done automatically by the model using the Training Set.
The developer does have control over how high the degree of the polynomial can be. These parameters that can be tuned by the developer are called Hyperparameters. These hyperparameters play a key role in deciding how well would the model learn and how generalized will the learned parameters be.
Given below are two graphs representing the prediction of the trained model on training data. The graph on the left represents a linear model with an error of 3.6, and the graph on the right represents a polynomial model with an error of 1.7.
By looking at the errors, it can be concluded that the polynomial model performs significantly better when compared to the linear model (Lower the error, better is the performance of the model).
However, when we use the same trained models on the Testing Set, the models perform very differently. The graph on the left represents the same linear model’s prediction on the Testing Set, and the graph on the right side represents the Polynomial model’s prediction on the Testing Set. It is clearly visible that the Polynomial model inaccurately predicts the outputs when compared to the Linear model.
In terms of error, the total error for the Linear model is 3.6 and for the Polynomial model is a whopping 929.12.
Such a big difference in errors between the Training and Testing Set clearly signifies that something is wrong with the Polynomial model. This drastic change in error is due to a phenomenon called Bias-Variance Tradeoff.
Error in Machine Learning is the difference in the expected output and the predicted output of the model. It is a measure of how well the model performs over a given set of data.
There are several methods to calculate error in Machine Learning. One of the most commonly used terminologies to represent the error is called the Loss/Cost Function. It is also known as the Mean Squared Error (or MSE) and is given by the following equation:
The necessity of minimization of Errors: As it is obvious from the previously shown graphs, the higher the error, the worse the model performs. Hence, the error of the prediction of a model can be considered as a performance measure: Lower the error of a model, the better it performs.
In addition to that, a model judges its own performance and trains itself based on the error created between its own output and the expected output. The primary target of the model is to minimize the error so as to get the best parameters that would fit the data perfectly.
Total Error: The error mentioned above is the Total Error and consists of three types of errors: Bias + Variance + Irreducible Error.
Total Error = Bias + Variance + Irreducible Error
Even for an ideal model, it is impossible to get rid of all the types of errors. The “irreducible” error rate is caused by the presence of noise in the data and hence is not removable. However, the Bias and Variance errors can be reduced to a minimum and hence, the total error can also be reduced significantly.
Ideally, the complete dataset is not used to train the model. The dataset is split into three sets: Training, Validation and Testing Sets. Each of these serves a specific role in the development of a model which performs well under most conditions.
Training Set (60-80%): The largest portion of the dataset is used for training the Machine Learning Model. The model extracts the features and learns to recognize the patterns in the dataset. The quality and quantity of the training set determines how well the model is going to perform.
Testing Set (15-25%): The main goal of every Machine Learning Engineer is to develop a model which would generalize the best over a given dataset. This is achieved by training the model(s) on a portion of the dataset and testing its performance by applying the trained model on another portion of the same/similar dataset that has not been used during training (Testing Set). This is important since the model might perform too well on the training set, but perform poorly on unseen data, as was the case with the example given above. Testing set is primarily used for model performance evaluation.
Validation Set (15-25%): In addition to the above, because of the presence of more than one Machine Learning Algorithm (model), it is often not recommended to test the performance of multiple models on the same dataset and then choose the best one. This process is called Model Selection, and for this, a separate part of the training set is used, which is also known as Validation Set. A validation set behaves similar to a testing set but is primarily used in model selection and not in performance evaluation.
Bias is used to allow the Machine Learning Model to learn in a simplified manner. Ideally, the simplest model that is able to learn the entire dataset and predict correctly on it is the best model. Hence, bias is introduced into the model in the view of achieving the simplest model possible.
Parameter based learning algorithms usually have high bias and hence are faster to train and easier to understand. However, too much bias causes the model to be oversimplified and hence underfits the data. Hence these models are less flexible and often fail when they are applied on complex problems.
Mathematically, it is the difference between the model’s average prediction and the expected value.
Variance in data is the variability of the model in a case where different Training Data is used. This would significantly change the estimation of the target function. Statistically, for a given random variable, Variance is the expectation of squared deviation from its mean.
In other words, the higher the variance of the model, the more complex the model is and it is able to learn more complex functions. However, if the model is too complex for the given dataset, where a simpler solution is possible, a model with high Variance causes the model to overfit.
When the model performs well on the Training Set and fails to perform on the Testing Set, the model is said to have Variance.
A biased model will have the following characteristics:
A model with high Variance will have the following characteristics:
From the understanding of bias and variance individually thus far, it can be concluded that the two are complementary to each other. In other words, if the bias of a model is decreased, the variance of the model automatically increases. The vice-versa is also true, that is if the variance of a model decreases, bias starts to increase.
Hence, it can be concluded that it is nearly impossible to have a model with no bias or no variance since decreasing one increases the other. This phenomenon is known as the Bias-Variance Trade
In order to get a clear idea about the Bias-Variance Tradeoff, let us consider the bulls-eye diagram. Here, the central red portion of the target can be considered the location where the model correctly predicts the values. As we move away from the central red circle, the error in the prediction starts to increase.
Each of the several hits on the target is achieved by repetition of the model building process. Each hit represents the individual realization of the model. As can be seen in the diagram below, the bias and the variance together influence the predictions of the model under different circumstances.
Another way of looking at the Bias-Variance Tradeoff graphically is to plot the graphical representation for error, bias, and variance versus the complexity of the model. In the graph shown below, the green dotted line represents variance, the blue dotted line represents bias and the red solid line represents the error in the prediction of the concerned model.
The expected values is a vector represented by y. The predicted output of the model is denoted by the vector y for input vector x. The relationship between the predicted values and the inputs can be taken as y = f(x) + e, where e is the normally distributed error given by:
The third term in the above equation, irreducible_error represents the noise term and cannot be fundamentally reduced by any given model. If hypothetically, infinite data is available, it is possible to tune the model to reduce the bias and variance terms to zero but is not possible to do so practically. Hence, there is always a tradeoff between the minimization of bias and variance.
Detection of Bias and Variance of a model
In model building, it is imperative to have the knowledge to detect if the model is suffering from high bias or high variance. The methods to detect high bias and variance is given below:
A graphical method to Detect a model suffering from High Bias and Variance is shown below:
The graph shows the change in error rate with respect to model complexity for training and validation error.
A systematic approach to solve a Bias-Variance Problem by Dr. Andrew Ng:
Dr. Andrew Ng proposed a very simple-to-follow step by step architecture to detect and solve a High Bias and High Variance errors in a model. The block diagram is shown below:
Detection and Solution to High Bias problem - if the training error is high:
Detection and Solution to High Variance problem - if a validation error is high:
To summarize, Bias and Variance play a major role in the training process of a model. It is necessary to reduce each of these parameters individually to the minimum possible value. However, it should be kept in mind that an effort to decrease one of these parameters beyond a certain limit increases the probability of the other getting increased. This phenomenon is called as the Bias-Variance Tradeoff and is a parameter to consider during model building.