Curve fitting is the process of determining the best fit mathematical function for a given set of data points. It examines the relationship between multiple independent variables (predictors) and a dependent variable (response) in order to determine the “best fit” line.
In the figure shown, the red line represents the curve that is the best fit for the given purple data points. It can also be seen that curve fitting does not necessarily mean that the curve should pass over each and every data point. Instead, it is the most appropriate curve that represents all the data points adequately.
As discussed, curve fitting refers to finding the “best fit” curve or line for a given set of data points. Even though this is also what a part of Machine Learning or Data Science does, the applications of Machine Learning or Data Science far outweigh that of Curve Fitting.
The major difference is that during Curve Fitting, the entire data is available to the developer. However, when it comes to Machine Learning, the amount of data available to the developer is only a part of the real-world data on which the Fitted Model will be applied.
Even then, Machine Learning is a vast interdisciplinary field and it consists of a lot more than just “Curve Fitting”. Machine Learning can be broadly classified into Supervised, Unsupervised and Reinforcement Learning. Considering the fact that most of the real-world problems are solved by Supervised Learning, this article concentrates on Supervised Learning itself.
Supervised learning can be further classified into Classification and Regression. In this case, the work done by Regression is similar to what Curve Fitting achieves.
To get a broader idea, let’s look at the difference between Classification and Regression:
|It is the process of separating/classifying two or more types of data into separate categories or classes based on their characteristics.||It is the process of determining the “Best Fit” curve for the given data such that, on unseen data, the data points lying on the curve accurately represent the desired result.|
|The output values are discrete in nature (eg. 0, 1, 2, 3, etc) and are known as “Classes”.||The output values are continuous in nature (eg. 0.1, 1.78, 9.54, etc).|
|Here, the two classes (red and blue colored points) are clearly separated by the line(s) in the middle. This is an example of classification.|
Here, the curve represented by the magenta line is the “Best Fit” line for all the data points as shown. This is an example of Regression.
The data that is obtained from the real world is not ideal or noise-free. It contains a lot of noise, which needs to be filtered out before applying the Machine Learning Algorithms.
As shown in the above image, the few extra data points in the top of the left graph represent unnecessary noise, which in technical terms is known as “Outliers”. As shown in the difference between the left and the right graphs, the presence of outliers makes a considerable amount of difference when it comes to the determination of the “Best Fit” line. Hence, it is of immense importance to apply preprocessing techniques in order to remove outliers from the data.
Outliers: As already discussed, outliers are data points which do not belong to the original set of data. These data points are either too high or too low in value, such that they do not belong to the general distribution of the rest of the dataset. They are usually due to misrepresentation or an accidental entry of wrong data. There are several statistical algorithms which are used to detect and remove such outliers.
Missing Data: In sharp contrast to outliers, missing data is another major challenge when it comes to the dataset. The occurrence is quite common in tabular datasets (eg. CSV files) and is a challenge if the number of missing data points exceeds 10% of the total size of the dataset. Most Machine Learning algorithms fail to perform on such datasets. However, certain algorithms such as Decision Trees are quite resilient when it comes to data with missing data and are able to provide accurate results even when supplied with such noisy datasets. Similar to Outliers, there are statistical methods to handle missing data or “NaN” (Not a Number) values. The most common of them is to remove or “drop” the row containing the missing data.
“Training” is terminology associated with Machine Learning and it basically means the “Fitting” of data or “Learning” from data. This is the step where the Model starts to learn from the given data in order to be able to predict on similar but unseen data. This step is crucial since the final output (or Prediction) of the model will be based on how well the model was able to acquire the patterns of the training data.
Training in Machine Learning: Depending on the type of data, the training methodology varies. Hence, here we assume simple tabular (eg. CSV) text data. Before the model can be fitted on the data, there are a few steps that have to be followed:
Training in Deep Learning: Deep Learning is a part of machine learning, but instead of relying on statistical methods, Deep Learning Techniques largely depend on calculus and aims to mimic the Neural structure of the biological brain, and hence, are often referred to as Neural Networks.
The training process for Deep Learning is quite similar to that of Machine Learning except that there is no need for “Feature Engineering”. Since deep learning models largely rely on weights to specify the importance of given input (feature), the model automatically tends to learn which features are relevant and which feature is not. Hence, it assigns a “high” weight to the features that are relevant and assigns a “low” weight to the features that are not relevant. This removes the need for a separate Feature Engineering.
This difference is correctly portrayed in the following figure:
Improper Training of Data: As discussed above, the training of data is the most crucial step of any Machine Learning Algorithm. Improper training can lead to drastic performance degradation of the model on deployment. On a high level, there are two main types of outcomes of Improper Training: Underfitting and Overfitting.
When the complexity of the model is too less for it to learn the data that is given as input, the model is said to “Underfit”. In other words, the excessively simple model fails to “Learn” the intricate patterns and underlying trends of the given dataset. Underfitting occurs for a model with Low Variance and High Bias.
Underfitting data Visualization: With the initial idea out of the way, visualization of an underfitting model is important. This helps in determining if the model is underfitting the given data during training. As already discussed, supervised learning is of two types: Classification and Regression. The following graphs show underfitting for both of these cases:
Detection of underfitting model: The model may underfit the data, but it is necessary to know when it does so. The following steps are the checks that are used to determine if the model is underfitting or not.
Fix for an underfitting model: If the model is underfitting, the developer can take the following steps to recover from the underfitting state:
When the complexity of the model is too high as compared to the data that it is trying to learn from, the model is said to “Overfit”. In other words, with increasing model complexity, the model tends to fit the Noise present in data (eg. Outliers). The model learns the data too well and hence fails to Generalize. Overfitting occurs for a model with High Variance and Low Bias.
Overfitting data Visualization: With the initial idea out of the way, visualization of an overfitting model is important. Similar to underfitting, overfitting can also be showcased in two forms of supervised learning: Classification and Regression. The following graphs show overfitting for both of these cases:
Detection of overfitting model: The parameters to look out for to determine if the model is overfitting or not is similar to those of underfitting ones. These are listed below:
Fix for an overfitting model: If the model is overfitting, the developer can take the following steps to recover from the overfitting state:
The term “Generalization” in Machine Learning refers to the ability of a model to train on a given data and be able to predict with a respectable accuracy on similar but completely new or unseen data. Model generalization can also be considered as the prevention of overfitting of data by making sure that the model learns adequately.
Generalization and its effect on an Underfitting Model: If a model is underfitting a given dataset, then all efforts to generalize that model should be avoided. Generalization should only be the goal if the model has learned the patterns of the dataset properly and needs to generalize on top of that. Any attempt to generalize an already underfitting model will lead to further underfitting since it tends to reduce model complexity.
Generalization and its effect on Overfitting Model: If a model is overfitting, then it is the ideal candidate to apply generalization techniques upon. This is primarily because an overfitting model has already learned the intricate details and patterns of the dataset. Applying generalization techniques on this kind of a model will lead to a reduction of model complexity and hence prevent overfitting. In addition to that, the model will be able to predict more accurately on unseen, but similar data.
Generalization Techniques: There are no separate Generalization techniques as such, but it can easily be achieved if a model performs equally well in both training and validation data. Hence, it can be said that if we apply the techniques to prevent overfitting (eg. Regularization, Ensembling, etc.) on a model that has properly acquired the complex patterns, then a successful generalization of some degree can be achieved.
Bias-Variance Tradeoff: Bias denotes the simplicity of the model. A high biased model will have a simpler architecture than that of a model with a lower bias. Similarly, complementing Bias, Variance denotes how complex the model is and how well it can fit the data with a high degree of diversity.
An ideal model should have Low Bias and Low Variance. However, when it comes to practical datasets and models, it is nearly impossible to achieve a “zero” Bias and Variance. These two are complementary of each other, if one decreases beyond a certain limit, then the other starts increasing. This is known as the Bias-Variance Tradeoff. Under such circumstances, there is a “sweet spot” as shown in the figure, where both bias and variance are at their optimal values.
Bias-Variance and Generalization: As it is clear from the above graph, the Bias and Variance are linked to Underfitting and Overfitting. A model with high Bias means the model is Underfitting the given data and a model with High Variance means the model is Overfitting the given data.
Hence, as it can be seen, at the optimal region of the Bias-Variance tradeoff, the model is neither underfitting nor overfitting. Hence, since there is neither underfitting nor overfitting, it can also be said that the model is most Generalized, as under these conditions the model is expected to perform equally well on Training and Validation Data. Thus, the graph depicts that the Generalization Error is minimum at the optimal value of the degree of Bias and Variance.
To summarize, the learning capabilities of a model depend on both, model complexity and data diversity. Hence, it is necessary to keep a balance between both such that the Machine Learning Models thus trained can perform equally well when deployed in the real world.
In most cases, Overfitting and Underfitting can be taken care of in order to determine the most appropriate model for the given dataset. However, even though there are certain rule-based steps that can be followed to improve a model, the insight to achieve a properly Generalized model comes with experience.
Your email address will not be published. Required fields are marked *
Statistics is a science concerned with collection,... Read More