Enhance your career prospects with our Data Science TrainingKNOW MORE
When a model fits the input dataset properly, it results in the machine learning application performing well, and predicting relevant output with good accuracy. We have seen many machine learning applications not performing well. There are 2 main causes of this- underfitting and overfitting. We will see both these situations in detail in this post.
Let us understand overfitting from a supervised machine learning algorithm’s perspective. Supervised algorithms sole purpose is to generalize well on never-before-seen data. It is the ability of the machine learning model to produce relevant output for the input dataset.
Consider the below set of points which would be required to fit a Linear Regression model:
The aim of Linear Regression is that a straight line tries to fit/capture all/most of the data points present in the dataset.
It looks like the model has been able to capture all the data points and has learnt well. But now consider a new point being exposed to this model. Since the model has learnt too well from the data, it wouldn’t be able to capture this new data point and generalize on it.
With respect to a Linear Regression algorithm, when this algorithm is fed the input dataset, the general idea is to reduce the overall cost (which is the distance between the straight line generated and the input data points). This happens when the number of iterations increases, i.e when the algorithm is trained on a large dataset. If the number of iterations is too much, the model learns too well. Due to this, it can’t generalize well since the model would have learnt the noise which is present in the dataset (which needs to be skipped in reality).
Note: The model training can be stopped at certain point in time depending on certain conditions beingmet.
This phenomenon is known as ‘overfitting’. The model overfits the data, hence doesn’t generalize well on newly encountered data.
This is the opposite of overfitting. The aim of the machine learning algorithm is to generalize well, but not learn too much. It is also essential that the model shouldn’t learn too less due to which it would fail to capture the essential patterns in the data. Otherwise the model wouldn’t be able to predict or produce output for new data points.
Note: If the model training is stopped prematurely, it could lead to underfitting, or the data not being trained sufficiently, due to which it wouldn’t be able to capture the vital patterns in data. This would lead to the model not being able to produce satisfactory results.
Consider the below image which shows how underfitting looks visually:
The dashed line in blue is the model that underfits the data. The black parabola is the line of data points that fits the model well.
The consequence of underfitting is the model not being able to generalize on newly seen data, which would lead to unreliable predictions.
Underfitting and overfitting are equally bad and the model needs to fit the data just right.
In this post, we understood about the concepts of overfitting and underfitting.