Data Preparation for Machine Learning Projects

Read it in 13 Mins

Last updated on
17th Mar, 2021
28th Oct, 2020
Data Preparation for Machine Learning Projects

The data we collect for machine-learning must be pre-processed before it can be used to fit a model. Data preparation is essentially, the task of modifying raw data into a form that can be used for modelling, mostly by data addition, deletion or other data transformation techniques 

We need to pre-process the data before feeding into any algorithm mainly due to the following reasons: 

  1. Messy data – Real world data is messy, with missing values, redundant values, out-of-range values, errors and noise. 
  2. Machine learning algorithms need numeric data. 
  3. More often than not, algorithms have requirements on the input data, for example some algorithms assume a certain probability distribution of the data, others might perform worse if the predictor variables are highly correlated etc. 

Data preparation tasks are mostly dependent on the dataset we are working with, and to some extent on the choice of model. However, it becomes more evident after initial analysis of the data and EDA. For e.g. looking at the summary statistics, we know if predictors need to be scaled. Looking at correlation matrix you can find out if there are highly correlated predictors. Looking at various plots, e.g. boxplot, you can find, if outliers need to be dealt with, so on and so forth. 

Even though every dataset is different, we can define a few common steps which can guide us in preparing the data to feed into our learning algorithms. 

Some common tasks that contribute to data pre-processing are: 

  1. Data Cleaning 
  2. Feature Selection 
  3. Data Transformation 
  4. Feature Engineering 
  5. Dimensionality Reduction 

Note: Throughout this article, we will refer to Python libraries and syntaxes. 

  • Data Cleaning: It can be summed up as the process of correcting the errors in the data. Errors could be in the form of missing values, redundant rows or columns, variables with zero or near zero variance and so onThus, data cleaning involves a few or all of the below sub-tasks: 
  • Redundant samples or duplicate rowsshould be identified and dropped from the dataset. In Python,  functions in Pandas such as duplicated() can be used to identify such samples and drop_duplicates() can be used to drop such rows. 
  • Redundant Features: If the dataset has features which are highly correlated, it may lead to multi-collinearity (irregular regression coefficient estimates)Such columns can be identified using the correlation matrix and one of the pairs of the highly correlated feature should be dropped. Similarly, near zero variance featureswhich have the same value for all the samples do not contribute to the variance in data. Such columns should be identified and dropped from the dataset.  

  • Outlier Detection: Outliers are extreme values which fall far away from other observations. Outliers can skew the descriptive statistics of the data, hence mislead data interpretations and negatively impact model performance. So, it is important that the outliers are detected and dealt with. Outliers can be detected through data visualization techniques like box-plots and scatter plots.  

Example of outliers being detected using box plots:  

Data Preparation for Machine Learning Projects

Image Source 

Outliers can also be detected by computing the z-scores or the Inter-Quartile range. When using z-score, a data point which is more than 3 standard deviations away from the mean is normally considered as an outlier.  However, this may vary based on the size of the dataset. When using inter-quartile range, a point which is below Q1 - 1.5 inter-quartile range or above Q3 + 1.5 inter-quartile range is considered to be an outlier, where Q1 is the first quartile and Q3 is the third quartile. 

Below diagram shows outliers which are more than 3 standard deviations from the mean: 

Data Preparation for Machine Learning Projects
Image Source 

If there are few outliers, you may choose to drop the samples with outliers. Else if there are too many outliers, these can be modelled separately. We may also choose to cap or floor the outlier values by the 95th percentile or 5th percentile value. However, you may choose the appropriate replacement value by analyzing the deciles of the data. 

  • Missing Values: Data with missing values cannot be used for modelling; hence any missing values should be identified and cleaned. If the data in the predictor or sample is sparse, we may choose to drop the entire column/row. Else we may impute the missing value with mean or median. Missing values in categorical variables can be replaced with the most frequent class. 

Points to remember: 

  • Use z-score for outlier detection if the data follows Gaussian distribution, else use Inter-Quartile range for outlier detection. 

Feature Selection: Sometimes datasets have hundreds of input variables, not all of which are good predictors of the target and may contribute to noise in the data. Feature selection techniques are used to find the input variables that can most efficiently predict the target variable, in order to reduce the number of input variables. Feature selection techniques can be further classified as supervised selection techniques and unsupervised selection techniques. As the name suggests, unsupervised selection techniques do not consider the target variable while eliminating the input variables. This would include techniques like using correlation to eliminate highly correlated predictors or eliminating low variance predictors. Supervised feature selection techniques consider the target variable for selecting the features to be eliminated. These can be further divided into three groups namely, Intrinsic, Filter and Wrapper techniques. 

  • Intrinsic  the feature selection process is embedded in the model building process itself, for e.g. tree-based algorithms which pick up the best predictor for the split. Similarly, regularization techniques like lasso shrinks the coefficient of the predictors such that the coefficient can be shrunk to zero for some predictors, and hence are excluded from the model. Multivariate adaptive regression spline (MARS) models also fall under this category. A major advantage of such methods is that since the feature selection is a part of model building process, it is relatively fast. However model dependance can also prove to be disadvantageous for e.g. some tree-based algorithms are greedy and hence may select predictors which may lead to sub-optimal fit. 

  • Filter  Filter based selection techniques use some statistical method to score each predictor separately with the target variable and choose the predictors with highest scores. It is mostly univariate analysis, i.e., each predictor is evaluated in isolation. It does not consider the correlation of independent variables amongst themselves. 

Based on the type of the input variable i.e., numerical or categorical and the type of output variable an appropriate statistical measure can be used to evaluate predictors for feature selection: for examplePearson’s correlation coefficient, Spearmon’s correlation coefficient, ANOVA, Chi-square. 

  • Wrapper  Wrapper feature selection builds models using various subsets of predictors iteratively, and evaluates the model, until it finds a subset of features which best predict the target. These methods are agnostic to the type of variables. However, they are computationally more taxing. RFE is a commonly used wrapper-based feature selection method. 

Recursive Feature Elimination is a greedy backward elimination technique, which starts with a complete set of predictors and systematically eliminates less useful predictors, until it finds a subset of predictors which best predict the target variable with the specified number of predictors. Two important hyperparameters for RFE algorithm in scikit learn are the number of predictors(n_features_to_select) and the algorithm of choice (estimator). 

Points to remember: 

  • Feature selection techniques reduce the number of features by excluding or eliminating the existing features from the dataset, whereas dimensionality reduction techniques create a projection of the data in lower dimensional feature space, which does not have a one-to-one mapping with the existing features. However, both have a similar goal of reducing the number of independent variables. 

Data Transformations: We may need to transform data to change its data type, scale or distribution. 

Type: We need to analyze the input variables at the very beginning to understand if the predictors are represented with the appropriate data type, and do the required conversions before progressing with the EDA and modelling. For e.g., sometimes the Boolean values are encoded as true and false, and we may transform them to take values 0 and 1. Similarly sometimes we may come across integer variables where it might be more appropriate to treat it as a categorical variable. For e.g. when working on a dataset to predict car prices, it would be more appropriate to treat the variable ‘Number of doors’ which takes up values {2,4} as a categorical variable.  

Categorical variables should be converted to numeric, before they can be used for modelling. There are many categorical variable encoding techniques like, N-1 dummy encoding, 1 Hot encoding, label encoding, frequency encoding. Ordinal encoding can be used when we want to specify and maintain the order of the ordinal variable. 

Scale: Predictor variables may have different units (Km, $, years etc.) and hence, different scales. For e.g. we might have input variables like age and salary in a dataset. Scale of the variable salary will always be much higher than the age, and hence may contribute unequally to the model and create a bias. Hence, we transform the predictors to bring them to a common scale. Normalization and standardization are the most widely used scaling techniques. 

  • Normalization: helps scale the data such that all values lie between the range of 0 and 1. The scikit-learn library method even allows one to specify the preferred range. 

Data Preparation for Machine Learning Projects

Data shown before and after normalization:  

Data Preparation for Machine Learning Projects

Image Source

  • Standarisation: We standardize the data by centering it around the mean and then scaling the data by the standard deviation. In other words, mean of the variable is subtracted from each value of the input variable and the difference is divided by the standard deviation of the variable. The resulting data will have zero mean and standard deviation 1. Standardisation assumes that the data follows a Gaussian distribution. Scikit learn library in python can be used for normalization (MinMaxScaler()) and standardization (StandardScaler()).  

Data Preparation for Machine Learning Projects

Data shown before and after standardization:  

Data Preparation for Machine Learning Projects

Image Source 

  • Distribution: Many algorithms assume Gaussian distribution for the underlying data. If the data is not Gaussian or is Gaussian like, we can transform the data to reduce the skewness. Box-Cox transform, or Yeo-Johnson transform can be used to perform power transformations on the data. Box-Cox transform applies a different transformation othe data based on the value of lambda. For e.g. for Lambda = -1, it does inverse transformation, for Lambda=0 it does log transformation, for Lambda = 0.5, it does square root transformation, for Lambda = -0.5 it does reciprocal square root transformation. 

PowerTransformer() class in the python scikit library can be used for making these power transformations.

Data shown before and after log transformation: 

Data Preparation for Machine Learning Projects
Image Source

Points to remember: 

  • Data transformations should be done on the training dataset, so that the statistic required for transformation is estimated from the training set only and then applied on the validation set. 
  • Decision trees and other tree-based ensembles like Random forest and boosting algorithms are not impacted by different scale of the input variables. Hence scaling may not be required.  
  • Linear regression and neural networks which use weighted sum of the input variables and K-nearest neighbors or SVM which compute distance or dot product between predictors will be impacted by the scale of the predictors, hence input variables should be scaled for these models. 
  • Between normalization and standardization, one should standardize when the data follows a Gaussian distribution, else normalize. 

Feature Engineering:  is the part of data pre-processing where we derive new features using one or more existing features. For e.g. when working on taxi fare prediction problem, we may derive a new feature, distance travelled in the ride with the use of latitude and longitude co-ordinates of the start and end point of the ride. Or when working on predicting sales or foot fall for a retail business we may need to add a new feature to factor in, the impact of holiday, weekends and festivals on the target variable. Hence, we may need to engineer these new predictors and feed them into our model to identify the underlying patterns effectively. 

Polynomial term: We may also add new features by raising the existing input variables to a higher degree polynomial. Polynomial terms help the model learn the non-linear patterns. When polynomial terms of existing features are added to the linear regression model, it is termed as polynomial regression. Usually, we stick to a smaller degree of 2 or 3. 

Interaction term: We may add new features that represent interaction between existing features by adding a product of two features. For e.g. if we are working on a problem to help businesses allocate their marketing budget between various marketing mediums like radio, TV and newspaper, we need to model how effective each medium is. We may like to factor in the interaction term of radio and newspaper campaign, to understand the effectiveness of marketing if both the radio and newspaper campaigns were run together at the same time. 

Similarly, when predicting a crop yield, we may engineer a new interaction term for fertilizer and water together to factor in how the yield varies when water and fertilizer are provided together. 

Points to remember: 

  • When using polynomial terms in the model, it is good practice to restrict the degree of the polynomial to 3 or at most 4. This is firstly, to control the number of input variables. Secondly, larger degree of the polynomial will result in large values which may impact the weights(parameters) to be large and hence make the model less sensitive to small changes. 
  • Domain knowledge or the advice of the SME may come in handy to identify effective interaction terms. 

Dimensionality Reduction: Sometimes data might have hundreds and even thousands of features. High dimensional data can be more complicated, with way more parameters to train and very complicated model structureIn higher dimensions, the volume of space is huge, and the data points become sparse, which could negatively impact the machine learning algorithm performance. This is sometimes also referred to as the curse of dimensionality.  

Dimensionality Reduction techniques are used to reduce the number of predictor variables in the dataset. Some techniques for dimensionality reduction are: 

  1. PCA or Principal Component Analysis uses linear algebra and Eigenvalue to achieve dimensionality reduction. For given datapoints PCA finds orthogonal set of directions, that have maximum variance. Rotating the reference frame, it finds the directions (ones which correspond to smallest eigen values) which can be neglected. 

Principal Component Analysis applied to a dataset is shown below: 

PCA or Principal Component Analysis

  1. Manifold learning is a non-linear dimensionality reduction technique which uses geometric properties of the data, to create low dimensional projections of a high dimensional data, while preserving its structure and relationships, and to visualize high dimensional data, which is otherwise difficult. SOM Self organizing Map also called Kohonen map and t-SNE are examples of Manifold learning techniques.  

t-distributed stochastic neighbor embedding (t-SNE) computes the probability that pairs of datapoints (in high dimension) are related and maps them in low dimension, such that data has a similar distribution. 

  1. Autoencoders are deep learning neural networks that learn low dimensional representation of a given dataset in an unsupervised manner. The hidden layer is limited to contain fewer neurons, thus it learns to map high dimensional input vector into low dimensional vector, while still preserving the underlying structure and relationships in the data. Autoencoders have two parts, encoder which learns to map high dimensional vector to a low-dimensional space and decoder, which maps the data from low to high dimension. The output from the encoder with reduced dimension can be fed into any another model for supervised learning. 

Points to remember:  

  • Dimensionality reduction is mostly performed after data cleaning and data scaling.  
  • It is imperative that the dimensionality reduction performed on the training data set must also be performed on the validation and the new data on which the model will predict. 


Data preparation is an important and integral step of machine learning projects. There are multiple techniques for various data cleaning tasks. However, there are no best or worst data cleaning techniques. Every machine learning problem is unique and so is the underlying data. We need to apply different techniques and see what works best based on the data and the problem at hand.  


Suchita Singh


With 16+ years of experience, having served organisations like IBM for a decade, Suchita is currently playing the role of a data scientist at Algoritmo Lab with core hands-on with various tools and technologies and is helping lead a team of junior data scientists.