Data pre-processing is considered to be one of the most important steps in implementing a machine learning algorithm. In this, we will look at the important steps in pre-processing data.
The following steps need to be followed before using the data as input to learning algorithms:
For a program to run successfully, certain libraries need to be imported. Some of the libraries will also need to be referenced using the dot operator. A library refers to a collection of modules which can be accessed and used. There are many functions present in every module which can be accessed and used. Usually, scientific libraries are imported and given an alias name so that it is easier to reference them. Below is an example importing a module named ‘numpy’ whose alias name will be ‘np’.
import numpy as np
This is required when the user doesn’t wish to use his own dataset by collecting data and performing data cleaning operations on it. There are many datasets which are readily available, that are in the form of CSV files. They can be read using pandas and then converted to a data frame and worked with. A function named ‘read_csv’ can be used to convert CSV file to data frame. Below is an example showing how a CSV file can be read and converted to a data frame.
import pandas as pd data_set = pd.read_csv(“path to csv file”)
Based on the column values, a dependant vector can be created which can be used to predict outputs.
There is usually no ideal dataset. This means discrepancies are usually present in the dataset in the form of missing data, redundant data. This needs to be handled in different ways depending on the data. Sometimes, the entire row can be removed, the irrelevant columns can be eliminated or a value can be replaced in place of the missing or irrelevant data. Consider the below example:
A class named Imputer is present in the scikit-learn library which helps in handling missing data. The Imputer’s instance is created so that functions inside that class can be accessed and used. The Imputer class has parameters like ‘missing_values’, ‘strategy’, and ‘axis’. This Imputer object is made to fit our
dataset (training of the data to fit the mode). The required rows are selected. Next, the missing values are replaced by the mean of that column using the function named ‘transform’.
from sklearn.preprocessing import Imputer imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0) imputer = imputer.fit(y[:,1:4]) y[:, 1:4] = imputer.transform(y[:, 1:4])
When data is in the form of text, it needs to be converted to a form that is easily understood by the machines. This is because it is difficult for machines to read and understand and process strings. This can be done using LabelEncoder class which is present in the scikit-learn library. An object of the LabelEncoder class is created that contains a method known as ‘fit_transform’. The row and the column are passed as parameters to this fit_transform function. This way, text is replaced by numbers. This works well when there are 2 categories.
from sklearn.preprocessing import LabelEncoder labelencoder_X = LabelEncoder() X[:,0] = labelencoder_X.fit_transform(X[:,0])
The dataset needs to be split into training and test datasets, so that once the training is completed on the training dataset, the performance of the learning model is tested on the test dataset. Usually, 80 percent of the data is used for training and 20 percent of the data is assigned for testing purposes. This can be achieved using the scikit-learn library, that has a function named train_test_split. The ‘test_size’ parameter helps in dividing the dataset into training and test datasets.
from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2)
This is one of the most important steps in data pre-processing. It refers to standardizing the range of independent variables or features present within the dataset. When all the variables are transformed to the same scale, it is easier to work with machine learning equations. This can be achieved using the ‘StandardScaler’ class that is present in the scikit-learn library. The training dataset has to first be fit on the learning model and then transformed. On the other hand, the test dataset needs to just be transformed.
from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test)
In this post, we saw how data can be pre-processed by following a list of steps.