Data which is fed to the learning algorithm as input should be consistent and structured. All the features of the input data should be on a single scale. But in real-world, data is unstructured, and most of the times, data is not on the same scale. This is where normalization comes into play.
It is one of the most important data-preparation processes that helps in changing the values of the numerical columns of the input dataset to be on a same scale. It is also made sure that during the process of normalization, the range of values is not distorted.
Note: Not all machine learning input datasets need to be normalized. Normalization is required onlywhen different features in a dataset have entirely different range of values.
Consider this example: A person’s weight and their height. It is not necessary that the heights and weights need to be proportional. While predicting weight given the height, if the data is normalized, the patterns can be learned and predictions can be produced based on that. If the data is not normalized, unusual heights and their respective weights will influence the predictions which might not be accurate.
There are different kinds of normalization and some of them have been listed below:
Min-max Normalization: It helps in rescaling the data to fall between the range of 0 and 1. Most ofthe times, this is used on a specific feature or a set of features.
Z Normalization: It is alsoknown as standardization, and it doesn’t change the type of distribution ofthe dataset. It makes sure that the mean of the dataset becomes 0 and the standard deviation of the dataset becomes 1. It can be applied on single feature or a set of features.
Unit vector Normalization: When data is scaled, it either shrinks or expands. Every row of data canbe visualized as an n-dimensional vector. When normalization is applied on the entire dataset, this transformed data can be visualized as a set of vectors that have different directions.
Let us look at an example of normalizing data:
from sklearn import preprocessing import numpy as np import pandas as pd #Obtain the dataset df = pd.read_csv("C:\\Users\\Vishal\\Desktop\\train.csv", sep=",") Normalize the column- total_bedrooms x_array = np.array(df['0'] normalized_X = preprocessing.normalize([x_array])
Compare the ‘train.csv’ file and the normalized_X dataframe to see how the specific column ‘0’s data was normalized.
In this post, we understood why normalization is important, how it affects the input dataset and how it can be used on simple CSV files.