top

Search

Machine Learning Tutorial

Data which is fed to the learning algorithm as input should be consistent and structured. All the features of the input data should be on a single scale. But in real-world, data is unstructured, and most of the times, data is not on the same scale. This is where normalization comes into play. What is normalization? It is one of the most important data-preparation processes that helps in changing the values of the numerical columns of the input dataset to be on a same scale. It is also made sure that during the process of normalization, the range of values is not distorted. Note: Not all machine learning input datasets need to be normalized. Normalization is required onlywhen different features in a dataset have entirely different range of values. Consider this example: A person’s weight and their height. It is not necessary that the heights and weights need to be proportional. While predicting weight given the height, if the data is normalized, the patterns can be learned and predictions can be produced based on that. If the data is not normalized, unusual heights and their respective weights will influence the predictions which might not be accurate. There are different kinds of normalization and some of them have been listed below: Min-max normalization Z Normalization Unit vector normalizationMin-max Normalization: It helps in rescaling the data to fall between the range of 0 and 1. Most ofthe times, this is used on a specific feature or a set of features. Z Normalization: It is alsoknown as standardization, and it doesn’t change the type of distribution ofthe dataset. It makes sure that the mean of the dataset becomes 0 and the standard deviation of the dataset becomes 1. It can be applied on single feature or a set of features. Unit vector Normalization: When data is scaled, it either shrinks or expands. Every row of data canbe visualized as an n-dimensional vector. When normalization is applied on the entire dataset, this transformed data can be visualized as a set of vectors that have different directions. Let us look at an example of normalizing data: from sklearn import preprocessing  import numpy as np  import pandas as pd  #Obtain the dataset  df = pd.read_csv("C:\\Users\\Vishal\\Desktop\\train.csv", sep=",")  Normalize the column- total_bedrooms  x_array = np.array(df['0']  normalized_X = preprocessing.normalize([x_array]) Compare the ‘train.csv’ file and the normalized_X dataframe to see how the specific column ‘0’s data was normalized. Why should data be normalized? Normalized data makes sure that upon training the data, it is less sensitive to the feature’s scale, which means the value of coefficients can be found efficiently and effectively. Once data is normalized, if we need to find out which machine learning model would yield good results, normalized data would help in the analysis of these models much more efficiently. Optimization will be a feasible process, since the problem of convergence will not have a great effect on the variance. Conclusion In this post, we understood why normalization is important, how it affects the input dataset and how it can be used on simple CSV files. 
logo

Machine Learning Tutorial

Normalization

Data which is fed to the learning algorithm as input should be consistent and structured. All the features of the input data should be on a single scale. But in real-world, data is unstructured, and most of the times, data is not on the same scale. This is where normalization comes into play. 

What is normalization? 

It is one of the most important data-preparation processes that helps in changing the values of the numerical columns of the input dataset to be on a same scale. It is also made sure that during the process of normalization, the range of values is not distorted. 

Note: Not all machine learning input datasets need to be normalized. Normalization is required onlywhen different features in a dataset have entirely different range of values. 

Consider this example: A person’s weight and their height. It is not necessary that the heights and weights need to be proportional. While predicting weight given the height, if the data is normalized, the patterns can be learned and predictions can be produced based on that. If the data is not normalized, unusual heights and their respective weights will influence the predictions which might not be accurate. 

There are different kinds of normalization and some of them have been listed below: 

  • Min-max normalization 
  • Z Normalization 
  • Unit vector normalization

Min-max Normalization: It helps in rescaling the data to fall between the range of 0 and 1. Most ofthe times, this is used on a specific feature or a set of features. 

Z Normalization: It is alsoknown as standardization, and it doesn’t change the type of distribution ofthe dataset. It makes sure that the mean of the dataset becomes 0 and the standard deviation of the dataset becomes 1. It can be applied on single feature or a set of features. 

Unit vector Normalization: When data is scaled, it either shrinks or expands. Every row of data canbe visualized as an n-dimensional vector. When normalization is applied on the entire dataset, this transformed data can be visualized as a set of vectors that have different directions. 

Let us look at an example of normalizing data: 

from sklearn import preprocessing 
import numpy as np 
import pandas as pd 
#Obtain the dataset 
df = pd.read_csv("C:\\Users\\Vishal\\Desktop\\train.csv", sep=",") 
Normalize the column- total_bedrooms 
x_array = np.array(df['0'] 
normalized_X = preprocessing.normalize([x_array]) 

Compare the ‘train.csv’ file and the normalized_X dataframe to see how the specific column ‘0’s data was normalized. 

Why should data be normalized? 

  • Normalized data makes sure that upon training the data, it is less sensitive to the feature’s scale, which means the value of coefficients can be found efficiently and effectively. 
  • Once data is normalized, if we need to find out which machine learning model would yield good results, normalized data would help in the analysis of these models much more efficiently. 
  • Optimization will be a feasible process, since the problem of convergence will not have a great effect on the variance. 

Conclusion 

In this post, we understood why normalization is important, how it affects the input dataset and how it can be used on simple CSV files. 

Leave a Reply

Your email address will not be published. Required fields are marked *