## Search

Machine learning Filter

# Overfitting and Underfitting With Algorithms in Machine Learning

5976
• by Animikh Aich
• 05th Aug, 2019
• Last updated on 11th Mar, 2021 Curve fitting is the process of determining the best fit mathematical function for a given set of data points. It examines the relationship between multiple independent variables (predictors) and a dependent variable (response) in order to determine the “best fit” line. In the figure shown, the red line represents the curve that is the best fit for the given purple data points. It can also be seen that curve fitting does not necessarily mean that the curve should pass over each and every data point. Instead, it is the most appropriate curve that represents all the data points adequately.

## Curve Fitting vs. Machine Learning

As discussed, curve fitting refers to finding the “best fit” curve or line for a given set of data points. Even though this is also what a part of Machine Learning or Data Science does, the applications of Machine Learning or Data Science far outweigh that of Curve Fitting.

The major difference is that during Curve Fitting, the entire data is available to the developer. However, when it comes to Machine Learning, the amount of data available to the developer is only a part of the real-world data on which the Fitted Model will be applied.

Even then, Machine Learning is a vast interdisciplinary field and it consists of a lot more than just “Curve Fitting”. Machine Learning can be broadly classified into Supervised, Unsupervised and Reinforcement Learning. Considering the fact that most of the real-world problems are solved by Supervised Learning, this article concentrates on Supervised Learning itself.

Supervised learning can be further classified into Classification and Regression. In this case, the work done by Regression is similar to what Curve Fitting achieves.

To get a broader idea, let’s look at the difference between Classification and Regression:

ClassificationRegression
It is the process of separating/classifying two or more types of data into separate categories or classes based on their characteristics.It is the process of determining the “Best Fit” curve for the given data such that, on unseen data, the data points lying on the curve accurately represent the desired result.
The output values are discrete in nature (eg. 0, 1, 2, 3, etc) and are known as “Classes”.The output values are continuous in nature (eg. 0.1, 1.78, 9.54, etc). Here, the two classes (red and blue colored points) are clearly separated by the line(s) in the middle. This is an example of classification. Here, the curve represented by the magenta line is the “Best Fit” line for all the data points as shown. This is an example of Regression.

## Noise in Data

The data that is obtained from the real world is not ideal or noise-free. It contains a lot of noise, which needs to be filtered out before applying the Machine Learning Algorithms. As shown in the above image, the few extra data points in the top of the left graph represent unnecessary noise, which in technical terms is known as “Outliers”. As shown in the difference between the left and the right graphs, the presence of outliers makes a considerable amount of difference when it comes to the determination of the “Best Fit” line. Hence, it is of immense importance to apply preprocessing techniques in order to remove outliers from the data.

### Let us look at two of the most common types of noise in Data:

Outliers: As already discussed, outliers are data points which do not belong to the original set of data. These data points are either too high or too low in value, such that they do not belong to the general distribution of the rest of the dataset. They are usually due to misrepresentation or an accidental entry of wrong data. There are several statistical algorithms which are used to detect and remove such outliers.

Missing Data: In sharp contrast to outliers, missing data is another major challenge when it comes to the dataset. The occurrence is quite common in tabular datasets (eg. CSV files) and is a challenge if the number of missing data points exceeds 10% of the total size of the dataset. Most Machine Learning algorithms fail to perform on such datasets. However, certain algorithms such as Decision Trees are quite resilient when it comes to data with missing data and are able to provide accurate results even when supplied with such noisy datasets. Similar to Outliers, there are statistical methods to handle missing data or “NaN” (Not a Number) values. The most common of them is to remove or “drop” the row containing the missing data.

## Training of Data

“Training” is terminology associated with Machine Learning and it basically means the “Fitting” of data or “Learning” from data. This is the step where the Model starts to learn from the given data in order to be able to predict on similar but unseen data. This step is crucial since the final output (or Prediction) of the model will be based on how well the model was able to acquire the patterns of the training data.

Training in Machine Learning: Depending on the type of data, the training methodology varies. Hence, here we assume simple tabular (eg. CSV) text data. Before the model can be fitted on the data, there are a few steps that have to be followed:

• Data Cleaning/Preprocessing: The raw data that is thus obtained from the real-world is likely to contain a good amount of noise in it. In addition to that, the data might not be homogenous, which means, the values of different “features” might belong to different ranges. Hence, after the removal of noise, the data needs to be normalized or scaled in order to make it homogeneous.
• Feature Engineering: In a tabular dataset, all the columns that describe the data are called “Features”. These features are necessary to correctly predict the target value. However, data often contains columns which are irrelevant to the output of the model. Hence, these columns need to be removed or statistically processed to make sure that they do not interfere with the training of the model on features that are relevant. In addition to the removal of irrelevant features, it is often required to create new relevant features from the existing features. This allows the model to learn better and this process is also called “Feature Extraction”.
• Train, Validation and Test Split: After the data has been preprocessed and is ready for training, the data is split into Training Data, Validation Data and Testing Data in the ratio of 60:20:20 (usually). This ratio varies depending on the availability of data and on the application. This is done to ensure that the model does not unnecessarily “Overfit” or “Underfit”, and performs equally well when deployed in the real world.
• Training: Finally, as the last step,  the Training Data is fed into the model to train upon. Multiple models can be trained simultaneously and their performance can be measured against each other with the help of the Validation Set, based on which the best model is selected. This is called “Model Selection”. Finally, the selected model is used to predict on the Test Set to get a final test score, which more or less accurately defines the performance of the model on the given dataset.

Training in Deep Learning: Deep Learning is a part of machine learning, but instead of relying on statistical methods, Deep Learning Techniques largely depend on calculus and aims to mimic the Neural structure of the biological brain, and hence, are often referred to as Neural Networks.

The training process for Deep Learning is quite similar to that of Machine Learning except that there is no need for “Feature Engineering”. Since deep learning models largely rely on weights to specify the importance of given input (feature), the model automatically tends to learn which features are relevant and which feature is not. Hence, it assigns a “high” weight to the features that are relevant and assigns a “low” weight to the features that are not relevant. This removes the need for a separate Feature Engineering.

This difference is correctly portrayed in the following figure: Improper Training of Data: As discussed above, the training of data is the most crucial step of any Machine Learning Algorithm. Improper training can lead to drastic performance degradation of the model on deployment. On a high level, there are two main types of outcomes of Improper Training: Underfitting and Overfitting.

## Underfitting

When the complexity of the model is too less for it to learn the data that is given as input, the model is said to “Underfit”. In other words, the excessively simple model fails to “Learn” the intricate patterns and underlying trends of the given dataset. Underfitting occurs for a model with Low Variance and High Bias.

Underfitting data Visualization: With the initial idea out of the way, visualization of an underfitting model is important. This helps in determining if the model is underfitting the given data during training. As already discussed, supervised learning is of two types: Classification and Regression. The following graphs show underfitting for both of these cases:

• Classification: As shown in the figure below, the model is trained to classify between the circles and crosses. However, it is unable to do so properly due to the straight line, which fails to properly classify either of the two classes. • Regression: As shown in the figure below, the data points are laid out in a given pattern, but the model is unable to “Fit” properly to the given data due to low model complexity. Detection of underfitting model: The model may underfit the data, but it is necessary to know when it does so. The following steps are the checks that are used to determine if the model is underfitting or not.

1. Training and Validation Loss: During training and validation, it is important to check the loss that is generated by the model. If the model is underfitting, the loss for both training and validation will be significantly high. In terms of Deep Learning, the loss will not decrease at the rate that it is supposed to if the model has reached saturation or is underfitting.
2. Over Simplistic Prediction Graph: If a graph is plotted showing the data points and the fitted curve, and the curve is over-simplistic (as shown in the image above), then the model is suffering from underfitting. A more complex model is to be tried out.
1. Classification: A lot of classes will be misclassified in the training set as well as the validation set. On data visualization, the graph would indicate that if there was a more complex model, more classes would have been correctly classified.
2. Regression: The final “Best Fit” line will fail to fit the data points in an effective manner. On visualization, it would clearly seem that a more complex curve can fit the data better.

Fix for an underfitting model: If the model is underfitting, the developer can take the following steps to recover from the underfitting state:

1. Train Longer: Since underfitting means less model complexity, training longer can help in learning more complex patterns. This is especially true in terms of Deep Learning.
2. Train a more complex model: The main reason behind the model to underfit is using a model of lesser complexity than required for the data. Hence, the most obvious fix is to use a more complex model. In terms of Deep Learning, a deeper network can be used.
3. Obtain more features: If the data set lacks enough features to get a clear inference, then Feature Engineering or collecting more features will help fit the data better.
4. Decrease Regularization: Regularization is the process that helps Generalize the model by avoiding overfitting. However, if the model is learning less or underfitting, then it is better to decrease or completely remove Regularization techniques so that the model can learn better.
5. New Model Architecture: Finally, if none of the above approaches work, then a new model can be used, which may provide better results.

## Overfitting

When the complexity of the model is too high as compared to the data that it is trying to learn from, the model is said to “Overfit”. In other words, with increasing model complexity, the model tends to fit the Noise present in data (eg. Outliers). The model learns the data too well and hence fails to Generalize. Overfitting occurs for a model with High Variance and Low Bias.

Overfitting data Visualization: With the initial idea out of the way, visualization of an overfitting model is important. Similar to underfitting, overfitting can also be showcased in two forms of supervised learning: Classification and Regression. The following graphs show overfitting for both of these cases:

• Classification: As shown in the figure below, the model is trained to classify between the circles and crosses, and unlike last time, this time the model learns too well. It even tends to classify the noise in the data by creating an excessively complex model (right). • Regression: As shown in the figure below, the data points are laid out in a given pattern, and instead of determining the least complex model that fits the data properly, the model on the right has fitted the data points too well when compared to the appropriate fitting (left). Detection of overfitting model: The parameters to look out for to determine if the model is overfitting or not is similar to those of underfitting ones. These are listed below:

1. Training and Validation Loss: As already mentioned, it is important to measure the loss of the model during training and validation. A very low training loss but a high validation loss would signify that the model is overfitting. Additionally, in Deep Learning, if the training loss keeps on decreasing but the validation loss remains stagnant or starts to increase, it also signifies that the model is overfitting.
2. Too Complex Prediction Graph: If a graph is plotted showing the data points and the fitted curve, and the curve is too complex to be the simplest solution which fits the data points appropriately, then the model is overfitting.
1. Classification: If every single class is properly classified on the training set by forming a very complex decision boundary, then there is a good chance that the model is overfitting.
2. Regression: If the final “Best Fit” line crosses over every single data point by forming an unnecessarily complex curve, then the model is likely overfitting.

Fix for an overfitting model: If the model is overfitting, the developer can take the following steps to recover from the overfitting state:

1. Early Stopping during Training: This is especially prevalent in Deep Learning. Allowing the model to train for a high number of epochs (iterations) may lead to overfitting. Hence it is necessary to stop the model from training when the model has started to overfit. This is done by monitoring the validation loss and stopping the model when the loss stops decreasing over a given number of epochs (or iterations).
2. Train with more data: Often, the data available for training is less when compared to the model complexity. Hence, in order to get the model to fit appropriately, it is often advisable to increase the training dataset size.
3. Train a less complex model: As mentioned earlier, the main reason behind overfitting is excessive model complexity for a relatively less complex dataset. Hence it is advisable to reduce the model complexity in order to avoid overfitting. For Deep Learning, the model complexity can be reduced by reducing the number of layers and neurons.
4. Remove features: As a contrast to the steps to avoid underfitting, if the number of features is too many, then the model tends to overfit. Hence, reducing the number of unnecessary or irrelevant features often leads to a better and more generalized model. Deep Learning models are usually not affected by this.
5. Regularization: Regularization is the process of simplification of the model artificially, without losing the flexibility that it gains from having a higher complexity. With the increase in regularization, the effective model complexity decreases and hence prevents overfitting.
6. Ensembling: Ensembling is a Machine Learning method which is used to combine the predictions from multiple separate models. It reduces the model complexity and reduces the errors of each model by taking the strengths of multiple models. Out of multiple ensembling methods, two of the most commonly used are Bagging and Boosting.

## Generalization

The term “Generalization” in Machine Learning refers to the ability of a model to train on a given data and be able to predict with a respectable accuracy on similar but completely new or unseen data. Model generalization can also be considered as the prevention of overfitting of data by making sure that the model learns adequately.

Generalization and its effect on an Underfitting Model: If a model is underfitting a given dataset, then all efforts to generalize that model should be avoided. Generalization should only be the goal if the model has learned the patterns of the dataset properly and needs to generalize on top of that. Any attempt to generalize an already underfitting model will lead to further underfitting since it tends to reduce model complexity.

Generalization and its effect on Overfitting Model: If a model is overfitting, then it is the ideal candidate to apply generalization techniques upon. This is primarily because an overfitting model has already learned the intricate details and patterns of the dataset. Applying generalization techniques on this kind of a model will lead to a reduction of model complexity and hence prevent overfitting. In addition to that, the model will be able to predict more accurately on unseen, but similar data.

Generalization Techniques: There are no separate Generalization techniques as such, but it can easily be achieved if a model performs equally well in both training and validation data. Hence, it can be said that if we apply the techniques to prevent overfitting (eg. Regularization, Ensembling, etc.) on a model that has properly acquired the complex patterns, then a successful generalization of some degree can be achieved.

## Relationship between Overfitting and Underfitting with Bias-Variance Tradeoff

Bias-Variance Tradeoff: Bias denotes the simplicity of the model. A high biased model will have a simpler architecture than that of a model with a lower bias. Similarly, complementing Bias, Variance denotes how complex the model is and how well it can fit the data with a high degree of diversity.

An ideal model should have Low Bias and Low Variance. However, when it comes to practical datasets and models, it is nearly impossible to achieve a “zero” Bias and Variance. These two are complementary of each other, if one decreases beyond a certain limit, then the other starts increasing. This is known as the Bias-Variance Tradeoff. Under such circumstances, there is a “sweet spot” as shown in the figure, where both bias and variance are at their optimal values. Bias-Variance and Generalization: As it is clear from the above graph, the Bias and Variance are linked to Underfitting and Overfitting.  A model with high Bias means the model is Underfitting the given data and a model with High Variance means the model is Overfitting the given data.

Hence, as it can be seen, at the optimal region of the Bias-Variance tradeoff, the model is neither underfitting nor overfitting. Hence, since there is neither underfitting nor overfitting, it can also be said that the model is most Generalized, as under these conditions the model is expected to perform equally well on Training and Validation Data. Thus, the graph depicts that the Generalization Error is minimum at the optimal value of the degree of Bias and Variance.

Conclusion

To summarize, the learning capabilities of a model depend on both, model complexity and data diversity. Hence, it is necessary to keep a balance between both such that the Machine Learning Models thus trained can perform equally well when deployed in the real world.

In most cases, Overfitting and Underfitting can be taken care of in order to determine the most appropriate model for the given dataset. However, even though there are certain rule-based steps that can be followed to improve a model, the insight to achieve a properly Generalized model comes with experience. ### Animikh Aich

Computer Vision Engineer

Animikh Aich is a Deep Learning enthusiast, currently working as a Computer Vision Engineer. His work includes three International Conference publications and several projects based on Computer Vision and Machine Learning.

## Join the Discussion

praveen sharma 06 Aug 2019

Thank you very much for your article, very informative and explained easy ways. This information is very helpful for my further career!

pravallika 16 Aug 2019

Nice written useful and good understanding thanks.

Krunal 20 Aug 2019

Awesome Post :) Thanks for this.

## Top Data Analytics Certifications

5631
Top Data Analytics Certifications

What is data analytics?In the world of IT, every s... Read More