Machine Learning Tutorial

By KnowledgeHut .

Data pre-processing is considered as one of the most important steps that needs to be achieved in any machine learning tasks. Data pre-processing simply refers to the task of getting all the data (that has been collected from various resources) into a single format or into uniform sets of data (based on the type of data) so that it becomes easier for the learning algorithm to learn and predict results with high accuracy. Real-world data is never ideal, it will have missing data cells, errors, outliers, discrepancies in names, and much more. Data pre-processing isn’t a single task, but multiple different tasks, that need to be performed step by step. The output of one step would be the input of the next step and so on. The steps are listed below: Data cleaning Data transformation Data reduction Data cleaning Data cleaning itself has multiple steps that includes parsing, data correction, standardization of data, data matching, consolidation of data, and data staging. Data parsing Parsing refers to identifying data elements in the source files, and separating these elements into specific files. Data correction Data correction refers to correcting each and every element of the parsed file with the help of high-level algorithms. Data standardization Standardization of data refers to applying conversion on these data elements so that data is in the preferred form at that is consistent using certain protocols. Data matching In order to remove/eliminate redundant data (duplicates), data elements are searched for and matched with the original data. Data consolidation Once the redundancy from the data is removed, relationship between these records is analyzed and matched so that they can be represented in one format. When data has been collected from multiple resources (or even a single resource), it is never ideal (if it is real-time data). It will have some missing values, irrelevant data or unidentified characters as well. This occurs due to humans not collecting data properly, or labelling data incorrectly. These missing and irrelevant parts of the data need to either be corrected or removed completely. Failure in doing so will result in the machine learning algorithm predicting output on new data which will not be highly accurate. This would be because the irrelevant and unidentified data (which is considered as noise) will also be considered as relevant data by the learning algorithm. Noisy data can be handled in any different ways: Binning method The data is first sorted (or ordered) and then smoothed out. Data is divided into different segments and every segment is operated upon separately. Regression Data is fit on a regression model due to which data can be smoothened out thereby removing the noise. Clustering Similar data is grouped into a structure and any noise (outlier or unusual data) will fall outside this cluster which can later be eliminated or disregarded. Data transformation This step is taken so as to get the dataset into a format which is easy for the learning algorithm to work with. There are many data transformation methods, and some of them have been discussed below. Normalization This step is done so as to get all the data to a specific scale of data values. This could mean scaling the data to lie in between the range of -1.0 to 1.0 or 0.0 to 1.0 and so on. Feature selection New column names/features are constructed by combining two or more features so that this leverages the process of prediction. Conclusion In this post, we understood the significance of pre-processing data and a few methods involved in pre-processing data.

1. Machine Learning Overview

2. Machine Learning Terminologies

3. Demystifying Machine Learning

4. Applications of Machine Learning

5. Methods for Machine Learning

6. Underfitting and Overfitting in Machine Learning

7. Data Loading for ML Projects

8. Introduction to Data in Machine Learning

9. Why Data Pre-processing?

10. Normalization

11. Numpy

12. K-Nearest Neighbors (KNN)

13. Hyperparameter Tuning

14. Pre-procesing Data

15. What is Clustering in Machine Learning?

16. Overview - Regression & Logistic Regression

17. Linear Regression(Python Implementation)

18. Softmax Regression using TensorFlow

19. What is Linear Regression?

20. Linear Regression using PyTorch

21. Decision Trees

22. Introduction To Machine Learning using Python

23. Learning Model Building in Scikit-learn: A Python Machine Learning Library

24. Confusion matrix

25. Machine learning metrics

26. Improving Performance of ML Models

27. How to get synonyms/antonyms from NLTK WordNet in Python?

28. Removing stop words with NLTK in Python

29. Tokenize text using NLTK in Python

Why Data Pre-processing?

Data pre-processing is considered as one of the most important steps that needs to be achieved in any machine learning tasks.

Data pre-processing simply refers to the task of getting all the data (that has been collected from various resources) into a single format or into uniform sets of data (based on the type of data) so that it becomes easier for the learning algorithm to learn and predict results with high accuracy.

Real-world data is never ideal, it will have missing data cells, errors, outliers, discrepancies in names, and much more.

Data pre-processing isn’t a single task, but multiple different tasks, that need to be performed step by step. The output of one step would be the input of the next step and so on.

The steps are listed below:

Data cleaning
Data transformation
Data reduction

Data cleaning

Data cleaning itself has multiple steps that includes parsing, data correction, standardization of data, data matching, consolidation of data, and data staging.

Data parsing

Parsing refers to identifying data elements in the source files, and separating these elements into specific files.

Data correction

Data correction refers to correcting each and every element of the parsed file with the help of high-level algorithms.

Data standardization

Standardization of data refers to applying conversion on these data elements so that data is in the preferred form at that is consistent using certain protocols.

Data matching

In order to remove/eliminate redundant data (duplicates), data elements are searched for and matched with the original data.

Data consolidation

Once the redundancy from the data is removed, relationship between these records is analyzed and matched so that they can be represented in one format.

When data has been collected from multiple resources (or even a single resource), it is never ideal (if it is real-time data). It will have some missing values, irrelevant data or unidentified characters as well.

This occurs due to humans not collecting data properly, or labelling data incorrectly. These missing and irrelevant parts of the data need to either be corrected or removed completely. Failure in doing so will result in the machine learning algorithm predicting output on new data which will not be highly accurate. This would be because the irrelevant and unidentified data (which is considered as noise) will also be considered as relevant data by the learning algorithm.

Noisy data can be handled in any different ways:

Binning method

The data is first sorted (or ordered) and then smoothed out. Data is divided into different segments and every segment is operated upon separately.

Regression

Data is fit on a regression model due to which data can be smoothened out thereby removing the noise.

Clustering

Similar data is grouped into a structure and any noise (outlier or unusual data) will fall outside this cluster which can later be eliminated or disregarded.

Data transformation

This step is taken so as to get the dataset into a format which is easy for the learning algorithm to work with. There are many data transformation methods, and some of them have been discussed below.

Normalization

This step is done so as to get all the data to a specific scale of data values. This could mean scaling the data to lie in between the range of -1.0 to 1.0 or 0.0 to 1.0 and so on.

Feature selection

New column names/features are constructed by combining two or more features so that this leverages the process of prediction.

Conclusion

In this post, we understood the significance of pre-processing data and a few methods involved in pre-processing data.

8-A Introduction to Data in Machine Learning

10-A Normalization

Your email address will not be published. Required fields are marked *

Comments

Vinu

After reading your article, I was amazed. I know that you explain it very well. And I hope that other readers will also experience how I feel after reading your article. Thanks for sharing.

Johnson M

Good and informative article.

Vinu

I enjoyed reading your articles. This is truly a great read for me. Keep up the good work!

Vinu

Awesome blog. I enjoyed reading this article. This is truly a great read for me. Keep up the good work!

best data science courses in India

Thanks for sharing this article!! Machine learning is a branch of artificial intelligence (AI) and computer science that focus on the uses of data and algorithms. I came to know a lot of information from this article.

View More Comments

Search

Machine Learning Tutorial

By KnowledgeHut .

Machine Learning Tutorial

Why Data Pre-processing?

Data cleaning

Data parsing

Data correction

Data standardization

Data matching

Data consolidation

Binning method

Regression

Clustering

Data transformation

Normalization

Feature selection

Conclusion

Leave a Reply

Comments

Vinu

Johnson M

Vinu

Vinu

best data science courses in India