top

Search

Machine Learning Tutorial

Data pre-processing is considered to be one of the most important steps in implementing a machine learning algorithm. In this, we will look at the important steps in pre-processing data. The following steps need to be followed before using the data as input to learning algorithms: Importing libraries Importing required dataset Handling missing data in the dataset Handling categorical data Splitting the dataset into training and test datasets Feature scalingImporting libraries For a program to run successfully, certain libraries need to be imported. Some of the libraries will also need to be referenced using the dot operator. A library refers to a collection of modules which can be accessed and used. There are many functions present in every module which can be accessed and used. Usually, scientific libraries are imported and given an alias name so that it is easier to reference them. Below is an example importing a module named ‘numpy’ whose alias name will be ‘np’. import numpy as npImporting required dataset This is required when the user doesn’t wish to use his own dataset by collecting data and performing data cleaning operations on it. There are many datasets which are readily available, that are in the form of CSV files. They can be read using pandas and then converted to a data frame and worked with. A function named ‘read_csv’ can be used to convert CSV file to data frame. Below is an example showing how a CSV file can be read and converted to a data frame. import pandas as pd  data_set = pd.read_csv(“path to csv file”) Based on the column values, a dependant vector can be created which can be used to predict outputs. Handling missing data in the dataset There is usually no ideal dataset. This means discrepancies are usually present in the dataset in the form of missing data, redundant data. This needs to be handled in different ways depending on the data. Sometimes, the entire row can be removed, the irrelevant columns can be eliminated or a value can be replaced in place of the missing or irrelevant data. Consider the below example: A class named Imputer is present in the scikit-learn library which helps in handling missing data. The Imputer’s instance is created so that functions inside that class can be accessed and used. The Imputer class has parameters like ‘missing_values’, ‘strategy’, and ‘axis’. This Imputer object is made to fit our dataset (training of the data to fit the mode). The required rows are selected. Next, the missing values are replaced by the mean of that column using the function named ‘transform’. from sklearn.preprocessing import Imputer  imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)  imputer = imputer.fit(y[:,1:4])  y[:, 1:4] = imputer.transform(y[:, 1:4]) Handling categorical data When data is in the form of text, it needs to be converted to a form that is easily understood by the machines. This is because it is difficult for machines to read and understand and process strings. This can be done using LabelEncoder class which is present in the scikit-learn library. An object of the LabelEncoder class is created that contains a method known as ‘fit_transform’. The row and the column are passed as parameters to this fit_transform function. This way, text is replaced by numbers. This works well when there are 2 categories. from sklearn.preprocessing import LabelEncoder  labelencoder_X = LabelEncoder()  X[:,0] = labelencoder_X.fit_transform(X[:,0]) Splitting dataset into training and test datasets The dataset needs to be split into training and test datasets, so that once the training is completed on the training dataset, the performance of the learning model is tested on the test dataset. Usually, 80 percent of the data is used for training and 20 percent of the data is assigned for testing purposes. This can be achieved using the scikit-learn library, that has a function named train_test_split. The ‘test_size’ parameter helps in dividing the dataset into training and test datasets. from sklearn.model_selection import train_test_split  X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2) Feature scaling This is one of the most important steps in data pre-processing. It refers to standardizing the range of independent variables or features present within the dataset. When all the variables are transformed to the same scale, it is easier to work with machine learning equations. This can be achieved using the ‘StandardScaler’ class that is present in the scikit-learn library. The training dataset has to first be fit on the learning model and then transformed. On the other hand, the test dataset needs to just be transformed. from sklearn.preprocessing import StandardScaler  sc_X = StandardScaler()  X_train = sc_X.fit_transform(X_train)  X_test = sc_X.transform(X_test) Conclusion In this post, we saw how data can be pre-processed by following a list of steps. 
logo

Machine Learning Tutorial

Pre-processing data

Data pre-processing is considered to be one of the most important steps in implementing a machine learning algorithm. In this, we will look at the important steps in pre-processing data. 

The following steps need to be followed before using the data as input to learning algorithms: 

  • Importing libraries 
  • Importing required dataset 
  • Handling missing data in the dataset 
  • Handling categorical data 
  • Splitting the dataset into training and test datasets 
  • Feature scaling

Importing libraries 

For a program to run successfully, certain libraries need to be imported. Some of the libraries will also need to be referenced using the dot operator. A library refers to a collection of modules which can be accessed and used. There are many functions present in every module which can be accessed and used. Usually, scientific libraries are imported and given an alias name so that it is easier to reference them. Below is an example importing a module named ‘numpy’ whose alias name will be ‘np’. 

import numpy as np

Importing required dataset 

This is required when the user doesn’t wish to use his own dataset by collecting data and performing data cleaning operations on it. There are many datasets which are readily available, that are in the form of CSV files. They can be read using pandas and then converted to a data frame and worked with. A function named ‘read_csv’ can be used to convert CSV file to data frame. Below is an example showing how a CSV file can be read and converted to a data frame. 

import pandas as pd 
data_set = pd.read_csv(“path to csv file”) 

Based on the column values, a dependant vector can be created which can be used to predict outputs. 

Handling missing data in the dataset 

There is usually no ideal dataset. This means discrepancies are usually present in the dataset in the form of missing data, redundant data. This needs to be handled in different ways depending on the data. Sometimes, the entire row can be removed, the irrelevant columns can be eliminated or a value can be replaced in place of the missing or irrelevant data. Consider the below example: 

A class named Imputer is present in the scikit-learn library which helps in handling missing data. The Imputer’s instance is created so that functions inside that class can be accessed and used. The Imputer class has parameters like ‘missing_values’, ‘strategy’, and ‘axis’. This Imputer object is made to fit our 

dataset (training of the data to fit the mode). The required rows are selected. Next, the missing values are replaced by the mean of that column using the function named ‘transform’. 

from sklearn.preprocessing import Imputer 
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0) 
imputer = imputer.fit(y[:,1:4]) 
y[:, 1:4] = imputer.transform(y[:, 1:4]) 

Handling categorical data 

When data is in the form of text, it needs to be converted to a form that is easily understood by the machines. This is because it is difficult for machines to read and understand and process strings. This can be done using LabelEncoder class which is present in the scikit-learn library. An object of the LabelEncoder class is created that contains a method known as ‘fit_transform’. The row and the column are passed as parameters to this fit_transform function. This way, text is replaced by numbers. This works well when there are 2 categories. 

from sklearn.preprocessing import LabelEncoder 
labelencoder_X = LabelEncoder() 
X[:,0] = labelencoder_X.fit_transform(X[:,0]) 

Splitting dataset into training and test datasets 

The dataset needs to be split into training and test datasets, so that once the training is completed on the training dataset, the performance of the learning model is tested on the test dataset. Usually, 80 percent of the data is used for training and 20 percent of the data is assigned for testing purposes. This can be achieved using the scikit-learn library, that has a function named train_test_split. The ‘test_size’ parameter helps in dividing the dataset into training and test datasets. 

from sklearn.model_selection import train_test_split 
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2) 

Feature scaling 

This is one of the most important steps in data pre-processing. It refers to standardizing the range of independent variables or features present within the dataset. When all the variables are transformed to the same scale, it is easier to work with machine learning equations. This can be achieved using the ‘StandardScaler’ class that is present in the scikit-learn library. The training dataset has to first be fit on the learning model and then transformed. On the other hand, the test dataset needs to just be transformed. 

from sklearn.preprocessing import StandardScaler 
sc_X = StandardScaler() 
X_train = sc_X.fit_transform(X_train) 
X_test = sc_X.transform(X_test) 

Conclusion 

In this post, we saw how data can be pre-processed by following a list of steps. 

Leave a Reply

Your email address will not be published. Required fields are marked *