Machine Learning Tutorial

By KnowledgeHut .

Data pre-processing is considered to be one of the most important steps in implementing a machine learning algorithm. In this, we will look at the important steps in pre-processing data. The following steps need to be followed before using the data as input to learning algorithms: Importing libraries Importing required dataset Handling missing data in the dataset Handling categorical data Splitting the dataset into training and test datasets Feature scalingImporting libraries For a program to run successfully, certain libraries need to be imported. Some of the libraries will also need to be referenced using the dot operator. A library refers to a collection of modules which can be accessed and used. There are many functions present in every module which can be accessed and used. Usually, scientific libraries are imported and given an alias name so that it is easier to reference them. Below is an example importing a module named ‘numpy’ whose alias name will be ‘np’. import numpy as npImporting required dataset This is required when the user doesn’t wish to use his own dataset by collecting data and performing data cleaning operations on it. There are many datasets which are readily available, that are in the form of CSV files. They can be read using pandas and then converted to a data frame and worked with. A function named ‘read_csv’ can be used to convert CSV file to data frame. Below is an example showing how a CSV file can be read and converted to a data frame. import pandas as pd data_set = pd.read_csv(“path to csv file”) Based on the column values, a dependant vector can be created which can be used to predict outputs. Handling missing data in the dataset There is usually no ideal dataset. This means discrepancies are usually present in the dataset in the form of missing data, redundant data. This needs to be handled in different ways depending on the data. Sometimes, the entire row can be removed, the irrelevant columns can be eliminated or a value can be replaced in place of the missing or irrelevant data. Consider the below example: A class named Imputer is present in the scikit-learn library which helps in handling missing data. The Imputer’s instance is created so that functions inside that class can be accessed and used. The Imputer class has parameters like ‘missing_values’, ‘strategy’, and ‘axis’. This Imputer object is made to fit our dataset (training of the data to fit the mode). The required rows are selected. Next, the missing values are replaced by the mean of that column using the function named ‘transform’. from sklearn.preprocessing import Imputer imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0) imputer = imputer.fit(y[:,1:4]) y[:, 1:4] = imputer.transform(y[:, 1:4]) Handling categorical data When data is in the form of text, it needs to be converted to a form that is easily understood by the machines. This is because it is difficult for machines to read and understand and process strings. This can be done using LabelEncoder class which is present in the scikit-learn library. An object of the LabelEncoder class is created that contains a method known as ‘fit_transform’. The row and the column are passed as parameters to this fit_transform function. This way, text is replaced by numbers. This works well when there are 2 categories. from sklearn.preprocessing import LabelEncoder labelencoder_X = LabelEncoder() X[:,0] = labelencoder_X.fit_transform(X[:,0]) Splitting dataset into training and test datasets The dataset needs to be split into training and test datasets, so that once the training is completed on the training dataset, the performance of the learning model is tested on the test dataset. Usually, 80 percent of the data is used for training and 20 percent of the data is assigned for testing purposes. This can be achieved using the scikit-learn library, that has a function named train_test_split. The ‘test_size’ parameter helps in dividing the dataset into training and test datasets. from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2) Feature scaling This is one of the most important steps in data pre-processing. It refers to standardizing the range of independent variables or features present within the dataset. When all the variables are transformed to the same scale, it is easier to work with machine learning equations. This can be achieved using the ‘StandardScaler’ class that is present in the scikit-learn library. The training dataset has to first be fit on the learning model and then transformed. On the other hand, the test dataset needs to just be transformed. from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test) Conclusion In this post, we saw how data can be pre-processed by following a list of steps.

1. Machine Learning Overview

2. Machine Learning Terminologies

3. Demystifying Machine Learning

4. Applications of Machine Learning

5. Methods for Machine Learning

6. Underfitting and Overfitting in Machine Learning

7. Data Loading for ML Projects

8. Introduction to Data in Machine Learning

9. Why Data Pre-processing?

10. Normalization

11. Numpy

12. K-Nearest Neighbors (KNN)

13. Hyperparameter Tuning

14. Pre-procesing Data

15. What is Clustering in Machine Learning?

16. Overview - Regression & Logistic Regression

17. Linear Regression(Python Implementation)

18. Softmax Regression using TensorFlow

19. What is Linear Regression?

20. Linear Regression using PyTorch

21. Decision Trees

22. Introduction To Machine Learning using Python

23. Learning Model Building in Scikit-learn: A Python Machine Learning Library

24. Confusion matrix

25. Machine learning metrics

26. Improving Performance of ML Models

27. How to get synonyms/antonyms from NLTK WordNet in Python?

28. Removing stop words with NLTK in Python

29. Tokenize text using NLTK in Python

Pre-processing data

Data pre-processing is considered to be one of the most important steps in implementing a machine learning algorithm. In this, we will look at the important steps in pre-processing data.

The following steps need to be followed before using the data as input to learning algorithms:

Importing libraries
Importing required dataset
Handling missing data in the dataset
Handling categorical data
Splitting the dataset into training and test datasets
Feature scaling

Importing libraries

For a program to run successfully, certain libraries need to be imported. Some of the libraries will also need to be referenced using the dot operator. A library refers to a collection of modules which can be accessed and used. There are many functions present in every module which can be accessed and used. Usually, scientific libraries are imported and given an alias name so that it is easier to reference them. Below is an example importing a module named ‘numpy’ whose alias name will be ‘np’.

import numpy as np

Importing required dataset

This is required when the user doesn’t wish to use his own dataset by collecting data and performing data cleaning operations on it. There are many datasets which are readily available, that are in the form of CSV files. They can be read using pandas and then converted to a data frame and worked with. A function named ‘read_csv’ can be used to convert CSV file to data frame. Below is an example showing how a CSV file can be read and converted to a data frame.

import pandas as pd 
data_set = pd.read_csv(“path to csv file”)

Based on the column values, a dependant vector can be created which can be used to predict outputs.

Handling missing data in the dataset

There is usually no ideal dataset. This means discrepancies are usually present in the dataset in the form of missing data, redundant data. This needs to be handled in different ways depending on the data. Sometimes, the entire row can be removed, the irrelevant columns can be eliminated or a value can be replaced in place of the missing or irrelevant data. Consider the below example:

A class named Imputer is present in the scikit-learn library which helps in handling missing data. The Imputer’s instance is created so that functions inside that class can be accessed and used. The Imputer class has parameters like ‘missing_values’, ‘strategy’, and ‘axis’. This Imputer object is made to fit our

dataset (training of the data to fit the mode). The required rows are selected. Next, the missing values are replaced by the mean of that column using the function named ‘transform’.

from sklearn.preprocessing import Imputer 
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0) 
imputer = imputer.fit(y[:,1:4]) 
y[:, 1:4] = imputer.transform(y[:, 1:4])

Handling categorical data

When data is in the form of text, it needs to be converted to a form that is easily understood by the machines. This is because it is difficult for machines to read and understand and process strings. This can be done using LabelEncoder class which is present in the scikit-learn library. An object of the LabelEncoder class is created that contains a method known as ‘fit_transform’. The row and the column are passed as parameters to this fit_transform function. This way, text is replaced by numbers. This works well when there are 2 categories.

from sklearn.preprocessing import LabelEncoder 
labelencoder_X = LabelEncoder() 
X[:,0] = labelencoder_X.fit_transform(X[:,0])

Splitting dataset into training and test datasets

The dataset needs to be split into training and test datasets, so that once the training is completed on the training dataset, the performance of the learning model is tested on the test dataset. Usually, 80 percent of the data is used for training and 20 percent of the data is assigned for testing purposes. This can be achieved using the scikit-learn library, that has a function named train_test_split. The ‘test_size’ parameter helps in dividing the dataset into training and test datasets.

from sklearn.model_selection import train_test_split 
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2)

Feature scaling

This is one of the most important steps in data pre-processing. It refers to standardizing the range of independent variables or features present within the dataset. When all the variables are transformed to the same scale, it is easier to work with machine learning equations. This can be achieved using the ‘StandardScaler’ class that is present in the scikit-learn library. The training dataset has to first be fit on the learning model and then transformed. On the other hand, the test dataset needs to just be transformed.

from sklearn.preprocessing import StandardScaler 
sc_X = StandardScaler() 
X_train = sc_X.fit_transform(X_train) 
X_test = sc_X.transform(X_test)

Conclusion

In this post, we saw how data can be pre-processed by following a list of steps.

13-A Hyperparameter Tuning

15-A What is Clustering in Machine Learning?

Your email address will not be published. Required fields are marked *

Comments

Vinu

After reading your article, I was amazed. I know that you explain it very well. And I hope that other readers will also experience how I feel after reading your article. Thanks for sharing.

Johnson M

Good and informative article.

Vinu

I enjoyed reading your articles. This is truly a great read for me. Keep up the good work!

Vinu

Awesome blog. I enjoyed reading this article. This is truly a great read for me. Keep up the good work!

best data science courses in India

Thanks for sharing this article!! Machine learning is a branch of artificial intelligence (AI) and computer science that focus on the uses of data and algorithms. I came to know a lot of information from this article.

View More Comments

Search

Machine Learning Tutorial

By KnowledgeHut .

Machine Learning Tutorial

Pre-processing data

Importing libraries

Importing required dataset

Handling missing data in the dataset

Handling categorical data

Splitting dataset into training and test datasets

Feature scaling

Conclusion

Leave a Reply

Comments

Vinu

Johnson M

Vinu

Vinu

best data science courses in India