Enhance your career prospects with our Data Science TrainingKNOW MORE
Data pre-processing is considered as one of the most important steps that needs to be achieved in any machine learning tasks.
Data pre-processing simply refers to the task of getting all the data (that has been collected from various resources) into a single format or into uniform sets of data (based on the type of data) so that it becomes easier for the learning algorithm to learn and predict results with high accuracy.
Real-world data is never ideal, it will have missing data cells, errors, outliers, discrepancies in names, and much more.
Data pre-processing isn’t a single task, but multiple different tasks, that need to be performed step by step. The output of one step would be the input of the next step and so on.
The steps are listed below:
Once the redundancy from the data is removed, relationship between these records is analyzed and matched so that they can be represented in one format.
When data has been collected from multiple resources (or even a single resource), it is never ideal (if it is real-time data). It will have some missing values, irrelevant data or unidentified characters as well.
This occurs due to humans not collecting data properly, or labelling data incorrectly. These missing and irrelevant parts of the data need to either be corrected or removed completely. Failure in doing so will result in the machine learning algorithm predicting output on new data which will not be highly accurate. This would be because the irrelevant and unidentified data (which is considered as noise) will also be considered as relevant data by the learning algorithm.
Noisy data can be handled in any different ways:
In this post, we understood the significance of pre-processing data and a few methods involved in pre-processing data.