Machine learning
Filter

- by Animikh Aich
- 25th Jul, 2019
- Last updated on 11th Mar, 2021
- 24 mins read

Machine Learning is a multidisciplinary field of study, which gives computers the ability to solve complex problems, which otherwise would be nearly impossible to be hand-coded by a human being. Machine Learning is a scientific field of study which involves the use of algorithms and statistics to perform a given task by relying on inference from data instead of explicit instructions.

The process of Machine Learning can be broken down into several parts, most of which is based around “Data”. The following steps show the Machine Learning Process.

**1. ****Gathering Data from various sources:** Since Machine Learning is basically the inference drawn from data before any algorithm can be used, data needs to be collected from some source. Data collected can be of any form, viz. Video data, Image data, Audio data, Text data, Statistical data, etc. **2. ****Cleaning data to have homogeneity:** The data that is collected from various sources does not always come in the desired form. More importantly, data contains various irregularities like Missing data and Outliers.These irregularities may cause the Machine Learning Model(s) to perform poorly. Hence, the removal or processing of irregularities is necessary to promote data homogeneity. This step is also known as data pre-processing. **3. Model Building & Selecting the right Machine Learning Model:** After the data has been correctly pre-processed, various Machine Learning Algorithms (or Models) are applied on the data to train the model to predict on unseen data, as well as to extract various insights from the data. After various models are “trained” to the data, the best performing model(s) that suit the application and the performance criteria are selected.**4. Getting Insights from the model’s results:** Once the model is selected, further data is used to validate the performance and accuracy of the model and get insights as to how the model performs under various conditions. **5. Data Visualization:** This is the final step, where the model is used to predict unseen and real-world data. However, these predictions are not directly understandable to the user, and hence, data Visualization or converting the results into understandable visual graphs is necessary. At this stage, the model can be deployed to solve real-world problems.

To get the similarities out of the way, both, Machine Learning and Curve Fitting rely on data to infer a model which, ideally, fits the data perfectly.

The difference comes in the availability of the data.

- Curve Fitting is carried out with data, all of which is already available to the user. Hence, there is no question of the model to encounter unseen data.
- However, in Machine Learning, only a part of the data is available to the user at the time of training (fitting) the model, and then the model has to perform equally well on data that it has never encountered before. Which is, in other words, the generalization of the model over a given data, such that it is able to correctly predict when it is deployed.

Let’s initiate the idea of Bias and Variance with a case study. Let’s assume a simple dataset of predicting the price of a house based on its carpet area. Here, the x-axis represents the carpet area of the house, and the y-axis represents the price of the property. The plotted data (in a 2D graph) is shown in the graph below:

The goal is to build a model to predict the price of the house, given the carpet area of the property. This is a rather easy problem to solve and can easily be achieved by fitting a curve to the given data points. But, for the time being, let’s concentrate on solving the same using Machine Learning.

In order to keep this example simple and concentrate on Bias and Variance, a few assumptions are made:

- Adequate data is present in order to come up with a working model capable of making relatively accurate predictions.
- The data is homogeneous in nature and hence no major pre-processing steps are involved.
- There are no missing values or outliers, and hence they do not interfere with the outcome in any way.
- The y-axis data-points are independent of the order of the sequence of the x-axis data-points.

With the above assumptions, the data is processed to train the model using the following steps:

**1. Shuffling the data: **Since the y-axis data-points are independent of the order of the sequence of the x-axis data-points, the dataset is shuffled in a pseudo-random manner. This is done to avoid unnecessary patterns from being learned by the model. During the shuffling, it is imperative to keep each x-y pair data point constant. Mixing them up will change the dataset itself and the model will learn inaccurate patterns.

**2. Data Splitting: **The dataset is split into three categories: Training Set (60%), Validation Set (20%), and Testing Set (20%). These three sets are used for different purposes:

**Training Set -**This part of the dataset is used to train the model. It is also known as the Development Set.**Validation Set -**This is separate from the Training Set and is only used for model selection. The model does not train or learn from this part of the dataset.**Testing Set -**This part of the dataset is used for performance evaluation and is completely independent of the Training or Validation Sets. Similar to the Validation Set, the model does not train on this part of the dataset.

**3. Model Selection: **Several Machine Learning Models are applied to the Training Set and their Training and Validation Losses are determined, which then helps determine the most appropriate model for the given dataset.

During this step, we assume that a polynomial equation fits the data correctly. The general equation is given below:

The process of “Training” mathematically is nothing more than figuring out the appropriate values for the parameters: a_{0}, a_{1}, ... ,a_{n}, which is done automatically by the model using the Training Set.

The developer does have control over how high the degree of the polynomial can be. These parameters that can be tuned by the developer are called Hyperparameters. These hyperparameters play a key role in deciding how well would the model learn and how generalized will the learned parameters be.

Given below are two graphs representing the prediction of the trained model on training data. The graph on the left represents a linear model with an error of 3.6, and the graph on the right represents a polynomial model with an error of 1.7.

By looking at the errors, it can be concluded that the polynomial model performs significantly better when compared to the linear model (Lower the error, better is the performance of the model).

However, when we use the same trained models on the Testing Set, the models perform very differently. The graph on the left represents the same linear model’s prediction on the Testing Set, and the graph on the right side represents the Polynomial model’s prediction on the Testing Set. It is clearly visible that the Polynomial model inaccurately predicts the outputs when compared to the Linear model.

In terms of error, the total error for the Linear model is 3.6 and for the Polynomial model is a whopping 929.12.

Such a big difference in errors between the Training and Testing Set clearly signifies that something is wrong with the Polynomial model. This drastic change in error is due to a phenomenon called Bias-Variance Tradeoff.

Error in Machine Learning is the difference in the expected output and the predicted output of the model. It is a measure of how well the model performs over a given set of data.

There are several methods to calculate error in Machine Learning. One of the most commonly used terminologies to represent the error is called the Loss/Cost Function. It is also known as the Mean Squared Error (or MSE) and is given by the following equation:

**The necessity of minimization of Errors:** As it is obvious from the previously shown graphs, the higher the error, the worse the model performs. Hence, the error of the prediction of a model can be considered as a performance measure: Lower the error of a model, the better it performs.

In addition to that, a model judges its own performance and trains itself based on the error created between its own output and the expected output. The primary target of the model is to minimize the error so as to get the best parameters that would fit the data perfectly.

**Total Error: **The error mentioned above is the Total Error and consists of three types of errors: Bias + Variance + Irreducible Error.

**Total Error = Bias + Variance + Irreducible Error**

Even for an ideal model, it is impossible to get rid of all the types of errors. The “irreducible” error rate is caused by the presence of noise in the data and hence is not removable. However, the Bias and Variance errors can be reduced to a minimum and hence, the total error can also be reduced significantly.

Ideally, the complete dataset is not used to train the model. The dataset is split into three sets: Training, Validation and Testing Sets. Each of these serves a specific role in the development of a model which performs well under most conditions.

**Training Set (60-80%): **The largest portion of the dataset is used for training the Machine Learning Model. The model extracts the features and learns to recognize the patterns in the dataset. The quality and quantity of the training set determines how well the model is going to perform.

**Testing Set (15-25%): **The main goal of every Machine Learning Engineer is to develop a model which would *generalize *the best over a given dataset. This is achieved by training the model(s) on a portion of the dataset and testing its performance by applying the trained model on another portion of the same/similar dataset that has not been used during training (Testing Set). This is important since the model might perform too well on the training set, but perform poorly on unseen data, as was the case with the example given above. Testing set is primarily used for model performance evaluation.

**Validation Set (15-25%): **In addition to the above, because of the presence of more than one Machine Learning Algorithm (model), it is often not recommended to test the performance of multiple models on the same dataset and then choose the best one. This process is called Model Selection, and for this, a separate part of the training set is used, which is also known as Validation Set. A validation set behaves similar to a testing set but is primarily used in model selection and not in performance evaluation.

Bias is used to allow the Machine Learning Model to learn in a simplified manner. Ideally, the simplest model that is able to learn the entire dataset and predict correctly on it is the best model. Hence, bias is introduced into the model in the view of achieving the simplest model possible.

Parameter based learning algorithms usually have high bias and hence are faster to train and easier to understand. However, too much bias causes the model to be oversimplified and hence underfits the data. Hence these models are less flexible and often fail when they are applied on complex problems.

Mathematically, it is the difference between the model’s average prediction and the expected value.

Variance in data is the variability of the model in a case where different Training Data is used. This would significantly change the estimation of the target function. Statistically, for a given random variable, Variance is the expectation of squared deviation from its mean.

In other words, the higher the variance of the model, the more complex the model is and it is able to learn more complex functions. However, if the model is too complex for the given dataset, where a simpler solution is possible, a model with high Variance causes the model to overfit.

When the model performs well on the Training Set and fails to perform on the Testing Set, the model is said to have Variance.

A biased model will have the following characteristics:

**Underfitting:**A model with high bias is simpler than it should be and hence tends to underfit the data. In other words, the model fails to learn and acquire the intricate patterns of the dataset.**Low Training Accuracy:**A biased model will not fit the Training Dataset properly and hence will have low training accuracy (or high training loss).**Inability to solve complex problems:**A Biased model is too simple and hence is often incapable of learning complex features and solving relatively complex problems.

A model with high Variance will have the following characteristics:

**Overfitting:**A model with high Variance will have a tendency to be overly complex. This causes the overfitting of the model.**Low Testing Accuracy:**A model with high Variance will have very high training accuracy (or very low training loss), but it will have a low testing accuracy (or a low testing loss).**Overcomplicating simpler problems:**A model with high variance tends to be overly complex and ends up fitting a much more complex curve to a relatively simpler data. The model is thus capable of solving complex problems but incapable of solving simple problems efficiently.

From the understanding of bias and variance individually thus far, it can be concluded that the two are complementary to each other. In other words, if the bias of a model is decreased, the variance of the model automatically increases. The vice-versa is also true, that is if the variance of a model decreases, bias starts to increase.

Hence, it can be concluded that it is nearly impossible to have a model with no bias or no variance since decreasing one increases the other. This phenomenon is known as the Bias-Variance Trade

In order to get a clear idea about the Bias-Variance Tradeoff, let us consider the bulls-eye diagram. Here, the central red portion of the target can be considered the location where the model correctly predicts the values. As we move away from the central red circle, the error in the prediction starts to increase.

Each of the several hits on the target is achieved by repetition of the model building process. Each hit represents the individual realization of the model. As can be seen in the diagram below, the bias and the variance together influence the predictions of the model under different circumstances.

Another way of looking at the Bias-Variance Tradeoff graphically is to plot the graphical representation for error, bias, and variance versus the complexity of the model. In the graph shown below, the green dotted line represents variance, the blue dotted line represents bias and the red solid line represents the error in the prediction of the concerned model.

- Since bias is high for a simpler model and decreases with an increase in model complexity, the line representing bias exponentially decreases as the model complexity increases.
- Similarly, Variance is high for a more complex model and is low for simpler models. Hence, the line representing variance increases exponentially as the model complexity increases.
- Finally, it can be seen that on either side, the generalization error is quite high. Both high bias and high variance lead to a higher error rate.
- The most optimal complexity of the model is right in the middle, where the bias and variance intersect. This part of the graph is shown to produce the least error and is preferred.
- Also, as discussed earlier, the model underfits for high-bias situations and overfits for high-variance situations.

The expected values is a vector represented by y. The predicted output of the model is denoted by the vector y for input vector x. The relationship between the predicted values and the inputs can be taken as y = f(x) + e, where e is the normally distributed error given by:

The third term in the above equation, *irreducible_error* represents the noise term and cannot be fundamentally reduced by any given model. If hypothetically, infinite data is available, it is possible to tune the model to reduce the bias and variance terms to *zero *but is not possible to do so practically. Hence, there is always a tradeoff between the minimization of bias and variance.

Detection of Bias and Variance of a model

In model building, it is imperative to have the knowledge to detect if the model is suffering from high bias or high variance. The methods to detect high bias and variance is given below:

- Detection of High Bias:
- The model suffers from a very High Training Error.
- The Validation error is similar in magnitude to the training error.
- The model is underfitting.

- Detection of High Variance:
- The model suffers from a very Low Training Error.
- The Validation error is very high when compared to the training error.
- The model is overfitting.

A graphical method to Detect a model suffering from High Bias and Variance is shown below:

The graph shows the change in error rate with respect to model complexity for training and validation error.

- The left portion of the graph suffers from High Bias. This can be seen as the training error is quite high along with the validation error. In addition to that, model complexity is quite low.
- The right portion of the graph suffers from High Variance. This can be seen as the training error is very low, yet the validation error is very high and starts increasing with increasing model complexity.

A systematic approach to solve a Bias-Variance Problem by Dr. Andrew Ng:

Dr. Andrew Ng proposed a very simple-to-follow step by step architecture to detect and solve a High Bias and High Variance errors in a model. The block diagram is shown below:

Detection and Solution to High Bias problem - if the training error is high:

**Train longer:**High bias means a usually less complex model, and hence it requires more training iterations to learn the relevant patterns. Hence, longer training solves the error sometimes.**Train a more complex model:**As mentioned above, high bias is a result of a less than optimal complexity in the model. Hence, to avoid high bias, the existing model can be swapped out with a more complex model.**Obtain more features:**It is often possible that the existing dataset lacks the required essential features for effective pattern recognition. To remedy this problem:- More features can be collected for the existing data.
- Feature Engineering can be performed on existing features to extract more non-linear features.

**Decrease regularization:**Regularization is a process to decrease model complexity by regularizing the inputs at different stages, promote generalization and prevent overfitting in the process. Decreasing regularization allows the model to learn the training dataset better.**New model architecture:**If all of the above-mentioned methods fail to deliver satisfactory results, then it is suggested to try out other new model architectures.

Detection and Solution to High Variance problem - if a validation error is high:

**Obtain more data:**High variance is often caused due to a lack of training data. The model complexity and quantity of training data need to be balanced. A model of higher complexity requires a larger quantity of training data. Hence, if the model is suffering from high variance, more datasets can reduce the variance.**Decrease number of features:**If the dataset consists of too many features for each data-point, the model often starts to suffer from high variance and starts to overfit. Hence, decreasing the number of features is recommended.**Increase Regularization:**As mentioned above, regularization is a process to decrease model complexity. Hence, if the model is suffering from high variance (which is caused by a complex model), then an increase in regularization can decrease the complexity and help to generalize the model better.**New model architecture:**Similar to the solution of a model suffering from high bias, if all of the above-mentioned methods fail to deliver satisfactory results, then it is suggested to try out other new model architectures.

To summarize, Bias and Variance play a major role in the training process of a model. It is necessary to reduce each of these parameters individually to the minimum possible value. However, it should be kept in mind that an effort to decrease one of these parameters beyond a certain limit increases the probability of the other getting increased. This phenomenon is called as the Bias-Variance Tradeoff and is a parameter to consider during model building.

9289

- by Harsha Vardhan Garlapati
- 08 Mar 2021
- 6 mins read

If we were to list the technologies that have revo... Read More

8572

- by Priyankur Sarkar
- 20 Sep 2019
- 15 mins read

Machine Learning, being a subset of Artificial Int... Read More

6791

- by Harsha Vardhan Garlapati
- 24 Feb 2021
- 8 mins read

Machine Learning is emerging as the latest technol... Read More

Subscribe to our newsletter.

## Join the Discussion

Your email address will not be published. Required fields are marked *

## 1 comments

I am interested in this blog, wish to get more information about the machine learning