Data pre-processing is considered as one of the most important steps that needs to be achieved in any machine learning tasks.
Data pre-processing simply refers to the task of getting all the data (that has been collected from various resources) into a single format or into uniform sets of data (based on the type of data) so that it becomes easier for the learning algorithm to learn and predict results with high accuracy.
Real-world data is never ideal, it will have missing data cells, errors, outliers, discrepancies in names, and much more.
Data pre-processing isn’t a single task, but multiple different tasks, that need to be performed step by step. The output of one step would be the input of the next step and so on.
The steps are listed below:
- Data cleaning
- Data transformation
- Data reduction
- Data cleaning itself has multiple steps that includes parsing, data correction, standardization of data, data matching, consolidation of data, and data staging.
- Parsing refers to identifying data elements in the source files, and separating these elements into specific files.
- Data correction refers to correcting each and every element of the parsed file with the help of high-level algorithms.
- Standardization of data refers to applying conversion on these data elements so that data is in the preferred form at that is consistent using certain protocols.
- In order to remove/eliminate redundant data (duplicates), data elements are searched for and matched with the original data.
Once the redundancy from the data is removed, relationship between these records is analyzed and matched so that they can be represented in one format.
When data has been collected from multiple resources (or even a single resource), it is never ideal (if it is real-time data). It will have some missing values, irrelevant data or unidentified characters as well.
This occurs due to humans not collecting data properly, or labelling data incorrectly. These missing and irrelevant parts of the data need to either be corrected or removed completely. Failure in doing so will result in the machine learning algorithm predicting output on new data which will not be highly accurate. This would be because the irrelevant and unidentified data (which is considered as noise) will also be considered as relevant data by the learning algorithm.
Noisy data can be handled in any different ways:
- The data is first sorted (or ordered) and then smoothed out. Data is divided into different segments and every segment is operated upon separately.
- Data is fit on a regression model due to which data can be smoothened out thereby removing the noise.
- Similar data is grouped into a structure and any noise (outlier or unusual data) will fall outside this cluster which can later be eliminated or disregarded.
- This step is taken so as to get the dataset into a format which is easy for the learning algorithm to work with. There are many data transformation methods, and some of them have been discussed below.
- This step is done so as to get all the data to a specific scale of data values. This could mean scaling the data to lie in between the range of -1.0 to 1.0 or 0.0 to 1.0 and so on.
- New column names/features are constructed by combining two or more features so that this leverages the process of prediction.
In this post, we understood the significance of pre-processing data and a few methods involved in pre-processing data.