There have been many articles written regarding overfitting and underfitting in machine learning, but virtually all of them are merely a list of tools. "Top 10 tools for dealing with overfitting and underfitting," or "best strategies: how to avoid overfitting in machine learning" or "best strategies: how to avoid underfitting in machine learning." It's like being shown nails but not being told how to hammer them. Underfitting and overfitting in machine learning may be highly perplexing for folks attempting to figure out how it works. Furthermore, most of these papers frequently ignore underfitting as if it does not exist.
In this article, I'd like to outline the fundamental principles for enhancing the quality of your model and, as a result, avoid underfitting and overfitting on a specific example. It is challenging to accurately discuss this problem because it is highly generic and can affect any method or model. But I want to make an effort to explain to you why underfitting and overfitting happen and why a certain approach should be employed. You can read more of these articles from the overfitting example Data Science Bootcamp duration.
Before we get into the understanding of what is overfitting and underfitting in machine learning are, let's define some terms that will help us understand this topic better:
- Signal: It's the actual underlying pattern of the data that enables the machine learning model to derive knowledge from the data.
- Noise: Unneeded and irrelevant data that lowers the model's performance is referred to as "noise."
- Bias: A prediction inaccuracy incorporated into the model as a result of oversimplifying machine learning methods. Alternatively, it is the discrepancy between projected and actual values.
- Variance: This is what happens when a machine learning model performs well with the training dataset but poorly with the test dataset.
What is Overfitting in Machine Learning?
Overfitting is a machine learning notion that arises when a statistical model fits perfectly against its training data. When this occurs, the algorithm cannot perform accurately against unseen data, thus contradicting its objective. Generalizing a model to new datasets allows us to use machine learning algorithms to make predictions and classify data daily.
To train the model, machine learning algorithms use sample datasets. The model, however, may begin to learn the "noise" or irrelevant information within the dataset if it trains on sample data for an excessively long time or if the model is overly complex. The model is "overfitted" when it memorizes the noise and fits the training data set too closely, thus preventing it from generalizing well to new data. A model won't be able to accomplish the classification or prediction tasks for which it was designed if it is not capable of making good generalizations to new data.
Low error rates and high variance indicate overfitting. Therefore, a piece of the training dataset is typically set aside as the "test set" to look for overfitting to prevent it. Overfitting occurs when the training data has a low error rate and the test data has a high error rate.
1. Overfitting Example
Assume you are performing fraud detection on credit card applications from folks in Jharkhand. There are tens of thousands of examples available to you. You only have seven Gujarati examples, however. Two are part of the validation set, whereas five are part of the training set. They are all classified as fraudulent. As a result, your algorithm will most likely learn that all Gujarat residents are fraudsters, and it will confirm that hypothesis by using those two cases in the validation set. As a result, no one from Wyoming will be approved for a credit card. Now, that is an issue. Your algorithm may perform admirably on average, which is what produces profit. It is not overfitting in general, but it is overfitting in some groups, like Gujarati residents, who will now always be denied a credit card. And this should be seen as a significant issue. It is frequently more subtle than this.
2. Reasons for Overfitting
Let us see what causes overfitting in machine learning:
- High variance and low bias.
- The model is too complex.
- The size of the training data.
What is Underfitting in Machine Learning?Source: towardsdatascience.com
Underfitting is a data science scenario in which a data model cannot effectively represent the connection between the input and output variables, resulting in a high error rate on both the training set and unseen data.
It happens when a model is overly simplistic, which might occur when a model requires more training time, more input characteristics, or less regularization.
When a model is under-fitted, it cannot identify the dominating trend in the data, resulting in training mistakes and poor model performance. Furthermore, a model that does not generalize effectively to new data cannot be used for classification or prediction tasks. Generalizing a model to new data allows us to utilize machine learning algorithms to make predictions and categorize data daily.
Indicators of underfitting include significant bias and low variance. Since this behavior may be seen while using the training dataset, under-fitted models are typically simpler to spot than overfitted ones. Please also see the Data Science online training to get a detailed understanding of these terms and topics.
1. Underfitting Example
It is the same as if you gave the student less study material. So he is not appropriately trained and will not be able to perform well in exams. Now, what is the solution? The solution is very simple: train the student well.
2. Reasons for Underfitting
- High bias and low variance.
- The size of the training dataset used is not enough.
- The model is too simple.
- Training data is not cleaned and also contains noise in it.
What is a Good Fit in Machine Learning?
A good fit model is a well-balanced model that is free of underfitting and overfitting. This excellent model provides a high accuracy score during training and performs well during testing.
To discover the best-fit model, examine the performance of a machine learning model with training data over time. As the algorithm learns, the model's error on the training data decreases, as does the error on the test dataset. However, if you train the model for too long, it may acquire extraneous information and noise in the training set, leading to overfitting. You must cease training when the error rises to attain a good fit.
Detecting Overfitting or Underfitting
There are a few ways we can understand how to "diagnose" underfitting and overfitting.
Underfitting occurs when your model produces accurate but inaccurate predictions at first. In this scenario, the training error is substantial, as is the validation/test error.
Overfitting occurs when your model fails to generate correct predictions. The training error is relatively modest in this example, but the validation/test error is highly significant.
When you identify a decent model, the training error is small (albeit more significant than in the case of overfitting), and the validation/test error is also minimal.
It would help if you remembered as a general intuition that underfitting arises when your model is too simplistic for your data. Conversely, overfitting happens when your model is too complicated for your data.
How to Prevent Overfitting and Underfitting in Models
While detecting overfitting and underfitting is beneficial, it does not address the problem. Fortunately, you have various alternatives to consider. These are some of the most common remedies.
Underfitting may be remedied by moving on and experimenting with different machine-learning techniques. Nonetheless, it stands in stark contrast to the problem of overfitting.
There are several methods for preventing overfitting. First, let us see how to avoid overfitting in machine learning:
- Cross-validation is an effective preventive approach against overfitting.
- Make many tiny train-test splits from your first training data. Fine-tune your model using these splits.
- In typical k-fold cross-validation, we divide the data into k subgroups called folds. The method is then repeatedly trained on k-1 folds, with the remaining fold serving as the test set (dubbed the "holdout fold").
- Through cross-validation, you may tweak hyperparameters using only your original training dataset. Cross-validation allows you to preserve your test dataset as an unknown dataset when choosing your final model.
2. More data for training
- It won't always work, but training with additional data can help computers detect the signal more accurately.
- As additional training data is fed into the model, it will be unable to overfit all of the samples and will be forced to generalize to provide results.
- Users should continue to collect data to improve the model's accuracy.
- However, because this approach is costly, users should ensure that the data is valuable and clean.
- Of course, this is not always true. For example, this strategy will not work if we add additional noisy data. As a result, you must always guarantee that your data is clean and functional.
3. Data enhancement
- Data augmentation, less expensive than training with extra data, is an alternative to the former.
- If you are unable to acquire new data continuously, you can make the present data sets look varied.
- Data augmentation changes the appearance of a data sample every time the model processes it. The approach makes each data set look unique to the model and prevents the model from learning about the data sets' properties.
4. Reduce Complexity or Simplify Data
- Overfitting can arise as a result of a model's complexity, such that even with vast amounts of data, the model manages to overfit the training dataset.
- The data simplification approach is used to reduce overfitting by reducing the model's complexity to make it simple enough that it does not overfit.
- Pruning a decision tree, lowering the number of parameters in a neural network, and utilizing dropout on a neural network are some operations that may be executed.
- Simplifying the model can also make it lighter and faster to run.
- Regularization refers to various strategies for pushing your model to be simpler.
- The approach you choose will be determined by the learner you are using. You could, for example, prune a decision tree, perform dropout on a neural network, or add a penalty parameter to a regression cost function.
- The regularization technique is frequently a hyperparameter, which implies it may be tweaked via cross-validation.
- Ensembles are machine learning algorithms that combine predictions from numerous different models. There are several ways to assemble, but the two most prevalent are boosting and bagging.
- Boosting works by increasing the collective complexity of basic base models. It educates many weak learners in a series, with each learner learning from the mistakes of the learner before them.
- There are increasing efforts to enhance the predictability of basic models.
- Boosting brings together the weak learners in the sequence to produce one strong learner.
- Bagging works by training a large number of strong learners in a parallel pattern and then merging them to improve their predictions.
- Bagging seeks to limit the likelihood of complicated models overfitting.
- Bagging then aggregates all strong learners to "smooth out" their predictions.
7. Early Termination
- When training a learning algorithm iteratively, you may assess how well each model iteration performs.
- New iterations refine the model until a specified number of iterations is reached. However, if the model begins to overfit the training data, its ability to generalize might deteriorate.
- Early stopping of the training process before the learner reaches that stage is referred to as early stopping.
- This approach is now primarily employed in deep learning, while other techniques (such as regularization) are favored for conventional machine learning.
Regularization is required for linear and SVM models.
The maximum depth of decision tree models can be reduced.
A dropout layer can be used to minimize overfitting in neural networks.
Let us see some techniques on how to prevent underfitting:
- Increase model complexity and increase the number of features by performing feature engineering.
- More parameters must be added to the model to make it more complex (degrees of freedom). Sometimes this involves immediately attempting a more sophisticated model—\one that is capable of restoring more intricate relationships from the start (SVM with different kernels instead of logistic regression). If the method is already fairly sophisticated (e.g., a neural network or an ensemble model), you should add extra parameters to it, such as increasing the number of models in boosting. This includes adding more layers, more neurons in each layer, more connections between layers, more filters for CNN, and so on in the context of neural networks.
- Remove noise from the data.
- Increase the number of epochs or increase the duration of training to get better results.
Model Fit: Underfitting vs Overfitting
Let us see and understand the difference between overfitting and underfitting in machine learning with examples:
Overfitting, which is the inverse of underfitting, happens when a model has been over-trained or is overly sophisticated, leading to high error rates on test data. Overfitting a model is more prevalent than underfitting, and underfitting is often done to minimize overfitting by a procedure known as "early stopping."
If undertraining or a lack of complexity leads to underfitting, a plausible preventative method would be to extend training time or incorporate more relevant inputs. However, if you overtrain the model or add too many features, it may overfit, resulting in low bias but significant variance (i.e., the bias-variance tradeoff). In this case, the statistical model fits too closely to its training data, preventing it from generalizing successfully to additional data points. It is crucial to remember that some models, such as decision trees or KNN, are more prone to overfitting than others.
If overtraining or model complexity causes overfitting, a sensible preventative approach would be to either interrupt the training process sooner, often known as "early stopping," or to minimize model complexity by removing fewer essential inputs. However, if you stop too soon or eliminate too many crucial characteristics, you may run into the opposite problem and underfit your model. Underfitting happens when the model has not been trained for a sufficient time or when the input variables are insufficiently significant to discover a meaningful link between the input and output variables.
In both cases, the model cannot identify the prevailing trend in the training dataset. As a result, underfitting generalizes poorly to previously unknown data. In contrast to overfitting, under-fitted models have a strong bias and less variation in their predictions. This exemplifies the bias-variance tradeoff when an under-fitted model transitions to an overfitted state. As the model learns, its bias decreases, but its variance increases as it becomes overfitted. When fitting a model, the objective is to locate the "sweet spot" between underfitting and overfitting so that a dominating trend may be established and applied to new datasets.
Overfitting: Key Takeaways
- Overfitting is a modeling issue in which the model generates bias because it is too closely connected to the data set.
- Overfitting limits the model's relevance to its data set and renders it irrelevant to other data sets.
- Ensembling, data augmentation, data simplification, and cross-validation are some of the strategies used to prevent overfitting.
Underfitting and Overfitting and Bias/Variance Trade-off
I won't go into detail regarding the bias/variance tradeoff, but here are some key points you need to know:
- Low bias, low variance: this is a nice, just-right outcome.
- Low bias and large variation: overfitting occurs when the algorithm produces widely diverse predictions for the same data.
- High bias, low variance: underfitting occurs when the algorithm produces comparable predictions for similar data, but the predictions are incorrect.
- High bias and high variance: imply a poor algorithm. You will almost certainly never see this.
The term “Generalization” in Machine Learning refers to the ability of a model to train on a given data and be able to predict with a respectable accuracy on similar but completely new or unseen data. Model generalization can also be considered as the prevention of overfitting of data by making sure that the model learns adequately.
1. Generalization and its effect on an Underfitting Model
If a model is underfitting a given dataset, then all efforts to generalize that model should be avoided. Generalization should only be the goal if the model has learned the patterns of the dataset properly and needs to generalize on top of that. Any attempt to generalize an already underfitting model will lead to further underfitting since it tends to reduce model complexity.
2. Generalization and its effect on Overfitting Model
If a model is overfitting, then it is the ideal candidate to apply generalization techniques upon. This is primarily because an overfitting model has already learned the intricate details and patterns of the dataset. Applying generalization techniques on this kind of a model will lead to a reduction of model complexity and hence prevent overfitting. In addition to that, the model will be able to predict more accurately on unseen, but similar data.
3. Generalization Techniques
There are no separate Generalization techniques as such, but it can easily be achieved if a model performs equally well in both training and validation data. Hence, it can be said that if we apply the techniques to prevent overfitting (eg. Regularization, Ensembling, etc.) on a model that has properly acquired the complex patterns, then a successful generalization of some degree can be achieved.
Analyzing the Goodness of Fit
Three distinct APIs may be used to evaluate the quality of a model's predictions:
- Estimator scoring system: Estimators have a scoring system that offers a default evaluation standard for the issue they are intended to address. This is covered in each estimator's documentation, not on this page.
- Scoring parameter: Cross-validation model assessment tools rely on an internal scoring scheme, such as model selection. Cross Val score and model selection.GridSearchCV. The section The scoring parameter: creating model assessment criteria discusses this.
- Metric functions: These measure prediction error and are implemented in the sklearn—metrics module. Sections on Classification metrics, Multilabel ranking metrics, Regression metrics, and Clustering metrics provide more information on these measures.
Here is the Code Implementation for Analyzing Goodness of Fit. Refer to KnowledgeHut’s Data Science Bootcamp duration for a detailed understanding of these terms. It makes it easy to understand topics like overfitting, how to prevent overfitting and underfitting, model overfitting and underfitting, and more.
Underfitting occurs when your model produces accurate but inaccurate predictions at first. In this scenario, the training error is substantial, as is the validation/test error. Overfitting occurs when your model fails to generate correct predictions. The training error is relatively modest in this example, but the validation/test error is highly significant. When you identify a decent model, the training error is small (albeit bigger than in the case of overfitting), and the validation/test error is also minimal.
Frequently Asked Questions (FAQs)
1. What is meant by overfitting and underfitting data with examples?
Overfitting and underfitting are two significant issues in machine learning that degrade the performance of machine learning models. Each machine learning model's primary goal is to generalize well. In this context, generalization refers to an ML model's capacity to deliver an acceptable output by adjusting the provided set of unknown inputs. Furthermore, it indicates that after training on the dataset, it can give dependable and accurate results. As a result, underfitting and overfitting are the terms that must be examined for model performance and whether the model is generalizing correctly or not.
2. What are the methods to avoid overfitting and underfitting in machine learning?
Methods for removing overfitting:
- Training with more data
- Removing features
- Early termination of training
Methods for removing underfitting:
- By increasing the training time of the model.
- By increasing the number of features.
3. How are bias and variance related to underfitting and overfitting in machine learning?
- Low bias, low variance: This is a nice, just-right outcome.
- Low bias and large variation: Overfitting occurs when the algorithm produces widely diverse predictions for the same data.
- High bias, low variance: Underfitting occurs when the algorithm produces comparable predictions for similar data, but the predictions are incorrect.
- High bias and high variance: Imply a poor algorithm. You will almost certainly never see this.