The data we collect for machine-learning must be pre-processed before it can be used to fit a model. Data preparation is essentially, the task of modifying raw data into a form that can be used for modelling, mostly by data addition, deletion or other data transformation techniques.  We need to pre-process the data before feeding into any algorithm mainly due to the following reasons: Messy data – Real world data is messy, with missing values, redundant values, out-of-range values, errors and noise. Machine learning algorithms need numeric data. More often than not, algorithms have requirements on the input data, for example some algorithms assume a certain probability distribution of the data, others might perform worse if the predictor variables are highly correlated etc. Data preparation tasks are mostly dependent on the dataset we are working with, and to some extent on the choice of model. However, it becomes more evident after initial analysis of the data and EDA. For e.g. looking at the summary statistics, we know if predictors need to be scaled. Looking at correlation matrix you can find out if there are highly correlated predictors. Looking at various plots, e.g. boxplot, you can find, if outliers need to be dealt with, so on and so forth. Even though every dataset is different, we can define a few common steps which can guide us in preparing the data to feed into our learning algorithms. Some common tasks that contribute to data pre-processing are: Data Cleaning Feature Selection Data Transformation Feature Engineering Dimensionality Reduction Note: Throughout this article, we will refer to Python libraries and syntaxes. Data Cleaning: It can be summed up as the process of correcting the errors in the data. Errors could be in the form of missing values, redundant rows or columns, variables with zero or near zero variance and so on. Thus, data cleaning involves a few or all of the below sub-tasks: Redundant samples or duplicate rows: should be identified and dropped from the dataset. In Python,  functions in Pandas such as duplicated() can be used to identify such samples and drop_duplicates() can be used to drop such rows. Redundant Features: If the dataset has features which are highly correlated, it may lead to multi-collinearity (irregular regression coefficient estimates). Such columns can be identified using the correlation matrix and one of the pairs of the highly correlated feature should be dropped. Similarly, near zero variance features, which have the same value for all the samples do not contribute to the variance in data. Such columns should be identified and dropped from the dataset.  Outlier Detection: Outliers are extreme values which fall far away from other observations. Outliers can skew the descriptive statistics of the data, hence mislead data interpretations and negatively impact model performance. So, it is important that the outliers are detected and dealt with. Outliers can be detected through data visualization techniques like box-plots and scatter plots.  Example of outliers being detected using box plots:  Image Source Outliers can also be detected by computing the z-scores or the Inter-Quartile range. When using z-score, a data point which is more than 3 standard deviations away from the mean is normally considered as an outlier.  However, this may vary based on the size of the dataset. When using inter-quartile range, a point which is below Q1 - 1.5 inter-quartile range or above Q3 + 1.5 inter-quartile range is considered to be an outlier, where Q1 is the first quartile and Q3 is the third quartile. Below diagram shows outliers which are more than 3 standard deviations from the mean: Image Source If there are a few outliers, you may choose to drop the samples with outliers. Else if there are too many outliers, these can be modelled separately. We may also choose to cap or floor the outlier values by the 95th percentile or 5th percentile value. However, you may choose the appropriate replacement value by analyzing the deciles of the data. Missing Values: Data with missing values cannot be used for modelling; hence any missing values should be identified and cleaned. If the data in the predictor or sample is sparse, we may choose to drop the entire column/row. Else we may impute the missing value with mean or median. Missing values in categorical variables can be replaced with the most frequent class. Points to remember: Use z-score for outlier detection if the data follows Gaussian distribution, else use Inter-Quartile range for outlier detection. Feature Selection: Sometimes datasets have hundreds of input variables, not all of which are good predictors of the target and may contribute to noise in the data. Feature selection techniques are used to find the input variables that can most efficiently predict the target variable, in order to reduce the number of input variables. Feature selection techniques can be further classified as supervised selection techniques and unsupervised selection techniques. As the name suggests, unsupervised selection techniques do not consider the target variable while eliminating the input variables. This would include techniques like using correlation to eliminate highly correlated predictors or eliminating low variance predictors. Supervised feature selection techniques consider the target variable for selecting the features to be eliminated. These can be further divided into three groups namely, Intrinsic, Filter and Wrapper techniques. Intrinsic – the feature selection process is embedded in the model building process itself, for e.g. tree-based algorithms which pick up the best predictor for the split. Similarly, regularization techniques like lasso shrinks the coefficient of the predictors such that the coefficient can be shrunk to zero for some predictors, and hence are excluded from the model. Multivariate adaptive regression spline (MARS) models also fall under this category. A major advantage of such methods is that since the feature selection is a part of model building process, it is relatively fast. However model dependance can also prove to be disadvantageous for e.g. some tree-based algorithms are greedy and hence may select predictors which may lead to sub-optimal fit. Filter – Filter based selection techniques use some statistical method to score each predictor separately with the target variable and choose the predictors with highest scores. It is mostly univariate analysis, i.e., each predictor is evaluated in isolation. It does not consider the correlation of independent variables amongst themselves. Based on the type of the input variable i.e., numerical or categorical and the type of output variable an appropriate statistical measure can be used to evaluate predictors for feature selection: for example, Pearson’s correlation coefficient, Spearmon’s correlation coefficient, ANOVA, Chi-square. Wrapper – Wrapper feature selection builds models using various subsets of predictors iteratively, and evaluates the model, until it finds a subset of features which best predict the target. These methods are agnostic to the type of variables. However, they are computationally more taxing. RFE is a commonly used wrapper-based feature selection method. Recursive Feature Elimination is a greedy backward elimination technique, which starts with a complete set of predictors and systematically eliminates less useful predictors, until it finds a subset of predictors which best predict the target variable with the specified number of predictors. Two important hyperparameters for RFE algorithm in scikit learn are the number of predictors(n_features_to_select) and the algorithm of choice (estimator). Points to remember: Feature selection techniques reduce the number of features by excluding or eliminating the existing features from the dataset, whereas dimensionality reduction techniques create a projection of the data in lower dimensional feature space, which does not have a one-to-one mapping with the existing features. However, both have a similar goal of reducing the number of independent variables. Data Transformations: We may need to transform data to change its data type, scale or distribution. Type: We need to analyze the input variables at the very beginning to understand if the predictors are represented with the appropriate data type, and do the required conversions before progressing with the EDA and modelling. For e.g., sometimes the Boolean values are encoded as true and false, and we may transform them to take values 0 and 1. Similarly sometimes we may come across integer variables where it might be more appropriate to treat it as a categorical variable. For e.g. when working on a dataset to predict car prices, it would be more appropriate to treat the variable ‘Number of doors’ which takes up values {2,4} as a categorical variable.  Categorical variables should be converted to numeric, before they can be used for modelling. There are many categorical variable encoding techniques like, N-1 dummy encoding, 1 Hot encoding, label encoding, frequency encoding. Ordinal encoding can be used when we want to specify and maintain the order of the ordinal variable. Scale: Predictor variables may have different units (Km, $, years etc.) and hence, different scales. For e.g. we might have input variables like age and salary in a dataset. Scale of the variable salary will always be much higher than the age, and hence may contribute unequally to the model and create a bias. Hence, we transform the predictors to bring them to a common scale. Normalization and standardization are the most widely used scaling techniques. Normalization: helps scale the data such that all values lie between the range of 0 and 1. The scikit-learn library method even allows one to specify the preferred range. Data shown before and after normalization: Image SourceStandarisation: We standardize the data by centering it around the mean and then scaling the data by the standard deviation. In other words, mean of the variable is subtracted from each value of the input variable and the difference is divided by the standard deviation of the variable. The resulting data will have zero mean and standard deviation 1. Standardisation assumes that the data follows a Gaussian distribution. Scikit learn library in python can be used for normalization (MinMaxScaler()) and standardization (StandardScaler()). Data shown before and after standardization: Image Source Distribution: Many algorithms assume Gaussian distribution for the underlying data. If the data is not Gaussian or is Gaussian like, we can transform the data to reduce the skewness. Box-Cox transform, or Yeo-Johnson transform can be used to perform power transformations on the data. Box-Cox transform applies a different transformation on the data based on the value of lambda. For e.g. for Lambda = -1, it does inverse transformation, for Lambda=0 it does log transformation, for Lambda = 0.5, it does square root transformation, for Lambda = -0.5 it does reciprocal square root transformation. PowerTransformer() class in the python scikit library can be used for making these power transformations.Data shown before and after log transformation: Image SourcePoints to remember: Data transformations should be done on the training dataset, so that the statistic required for transformation is estimated from the training set only and then applied on the validation set. Decision trees and other tree-based ensembles like Random forest and boosting algorithms are not impacted by different scale of the input variables. Hence scaling may not be required. Linear regression and neural networks which use weighted sum of the input variables and K-nearest neighbors or SVM which compute distance or dot product between predictors will be impacted by the scale of the predictors, hence input variables should be scaled for these models. Between normalization and standardization, one should standardize when the data follows a Gaussian distribution, else normalize. Feature Engineering: is the part of data pre-processing where we derive new features using one or more existing features. For e.g. when working on taxi fare prediction problem, we may derive a new feature, distance travelled in the ride with the use of latitude and longitude co-ordinates of the start and end point of the ride. Or when working on predicting sales or foot fall for a retail business we may need to add a new feature to factor in, the impact of holiday, weekends and festivals on the target variable. Hence, we may need to engineer these new predictors and feed them into our model to identify the underlying patterns effectively. Polynomial term: We may also add new features by raising the existing input variables to a higher degree polynomial. Polynomial terms help the model learn the non-linear patterns. When polynomial terms of existing features are added to the linear regression model, it is termed as polynomial regression. Usually, we stick to a smaller degree of 2 or 3. Interaction term: We may add new features that represent interaction between existing features by adding a product of two features. For e.g. if we are working on a problem to help businesses allocate their marketing budget between various marketing mediums like radio, TV and newspaper, we need to model how effective each medium is. We may like to factor in the interaction term of a radio and newspaper campaign, to understand the effectiveness of marketing if both the radio and newspaper campaigns were run together at the same time. Similarly, when predicting a crop yield, we may engineer a new interaction term for fertilizer and water together to factor in how the yield varies when water and fertilizer are provided together. Points to remember: When using polynomial terms in the model, it is good practice to restrict the degree of the polynomial to 3 or at most 4. This is firstly, to control the number of input variables. Secondly, a larger degree of the polynomial will result in large values which may impact the weights(parameters) to be large and hence make the model less sensitive to small changes. Domain knowledge or the advice of the SME may come in handy to identify effective interaction terms. Dimensionality Reduction: Sometimes data might have hundreds and even thousands of features. High dimensional data can be more complicated, with way more parameters to train and a very complicated model structure. In higher dimensions, the volume of space is huge, and the data points become sparse, which could negatively impact the machine learning algorithm performance. This is sometimes also referred to as the curse of dimensionality. Dimensionality Reduction techniques are used to reduce the number of predictor variables in the dataset. Some techniques for dimensionality reduction are: PCA or Principal Component Analysis uses linear algebra and Eigenvalue to achieve dimensionality reduction. For given datapoints PCA finds orthogonal set of directions, that have maximum variance. Rotating the reference frame, it finds the directions (ones which correspond to smallest eigen values) which can be neglected. Principal Component Analysis applied to a dataset is shown below: Manifold learning is a non-linear dimensionality reduction technique which uses geometric properties of the data, to create low dimensional projections of a high dimensional data, while preserving its structure and relationships, and to visualize high dimensional data, which is otherwise difficult. SOM Self organizing Map also called Kohonen map and t-SNE are examples of Manifold learning techniques. t-distributed stochastic neighbor embedding (t-SNE) computes the probability that pairs of datapoints (in high dimension) are related and maps them in low dimension, such that data has a similar distribution. Autoencoders are deep learning neural networks that learn low dimensional representation of a given dataset in an unsupervised manner. The hidden layer is limited to contain fewer neurons, thus it learns to map high dimensional input vector into low dimensional vector, while still preserving the underlying structure and relationships in the data. Autoencoders have two parts, encoder which learns to map high dimensional vector to a low-dimensional space and decoder, which maps the data from low to high dimension. The output from the encoder with reduced dimension can be fed into any another model for supervised learning. Points to remember: Dimensionality reduction is mostly performed after data cleaning and data scaling. It is imperative that the dimensionality reduction performed on the training data set must also be performed on the validation and the new data on which the model will predict. Conclusion:Data preparation is an important and integral step of machine learning projects. There are multiple techniques for various data cleaning tasks. However, there are no best or worst data cleaning techniques. Every machine learning problem is unique and so is the underlying data. We need to apply different techniques and see what works best based on the data and the problem at hand. # Data Preparation for Machine Learning Projects 11K The data we collect for machine-learning must be pre-processed before it can be used to fit a model. Data preparation is essentially, the task of modifying raw data into a form that can be used for modelling, mostly by data addition, deletion or other data transformation techniques We need to pre-process the data before feeding into any algorithm mainly due to the following reasons: 1. Messy data – Real world data is messy, with missing values, redundant values, out-of-range values, errors and noise. 2. Machine learning algorithms need numeric data. 3. More often than not, algorithms have requirements on the input data, for example some algorithms assume a certain probability distribution of the data, others might perform worse if the predictor variables are highly correlated etc. Data preparation tasks are mostly dependent on the dataset we are working with, and to some extent on the choice of model. However, it becomes more evident after initial analysis of the data and EDA. For e.g. looking at the summary statistics, we know if predictors need to be scaled. Looking at correlation matrix you can find out if there are highly correlated predictors. Looking at various plots, e.g. boxplot, you can find, if outliers need to be dealt with, so on and so forth. Even though every dataset is different, we can define a few common steps which can guide us in preparing the data to feed into our learning algorithms. Some common tasks that contribute to data pre-processing are: 1. Data Cleaning 2. Feature Selection 3. Data Transformation 4. Feature Engineering 5. Dimensionality Reduction Note: Throughout this article, we will refer to Python libraries and syntaxes. • Data Cleaning: It can be summed up as the process of correcting the errors in the data. Errors could be in the form of missing values, redundant rows or columns, variables with zero or near zero variance and so onThus, data cleaning involves a few or all of the below sub-tasks: • Redundant samples or duplicate rowsshould be identified and dropped from the dataset. In Python, functions in Pandas such as duplicated() can be used to identify such samples and drop_duplicates() can be used to drop such rows. • Redundant Features: If the dataset has features which are highly correlated, it may lead to multi-collinearity (irregular regression coefficient estimates)Such columns can be identified using the correlation matrix and one of the pairs of the highly correlated feature should be dropped. Similarly, near zero variance featureswhich have the same value for all the samples do not contribute to the variance in data. Such columns should be identified and dropped from the dataset. • Outlier Detection: Outliers are extreme values which fall far away from other observations. Outliers can skew the descriptive statistics of the data, hence mislead data interpretations and negatively impact model performance. So, it is important that the outliers are detected and dealt with. Outliers can be detected through data visualization techniques like box-plots and scatter plots. Example of outliers being detected using box plots: Outliers can also be detected by computing the z-scores or the Inter-Quartile range. When using z-score, a data point which is more than 3 standard deviations away from the mean is normally considered as an outlier. However, this may vary based on the size of the dataset. When using inter-quartile range, a point which is below Q1 - 1.5 inter-quartile range or above Q3 + 1.5 inter-quartile range is considered to be an outlier, where Q1 is the first quartile and Q3 is the third quartile. Below diagram shows outliers which are more than 3 standard deviations from the mean: If there are few outliers, you may choose to drop the samples with outliers. Else if there are too many outliers, these can be modelled separately. We may also choose to cap or floor the outlier values by the 95th percentile or 5th percentile value. However, you may choose the appropriate replacement value by analyzing the deciles of the data. • Missing Values: Data with missing values cannot be used for modelling; hence any missing values should be identified and cleaned. If the data in the predictor or sample is sparse, we may choose to drop the entire column/row. Else we may impute the missing value with mean or median. Missing values in categorical variables can be replaced with the most frequent class. Points to remember: • Use z-score for outlier detection if the data follows Gaussian distribution, else use Inter-Quartile range for outlier detection. Feature Selection: Sometimes datasets have hundreds of input variables, not all of which are good predictors of the target and may contribute to noise in the data. Feature selection techniques are used to find the input variables that can most efficiently predict the target variable, in order to reduce the number of input variables. Feature selection techniques can be further classified as supervised selection techniques and unsupervised selection techniques. As the name suggests, unsupervised selection techniques do not consider the target variable while eliminating the input variables. This would include techniques like using correlation to eliminate highly correlated predictors or eliminating low variance predictors. Supervised feature selection techniques consider the target variable for selecting the features to be eliminated. These can be further divided into three groups namely, Intrinsic, Filter and Wrapper techniques. • Intrinsic the feature selection process is embedded in the model building process itself, for e.g. tree-based algorithms which pick up the best predictor for the split. Similarly, regularization techniques like lasso shrinks the coefficient of the predictors such that the coefficient can be shrunk to zero for some predictors, and hence are excluded from the model. Multivariate adaptive regression spline (MARS) models also fall under this category. A major advantage of such methods is that since the feature selection is a part of model building process, it is relatively fast. However model dependance can also prove to be disadvantageous for e.g. some tree-based algorithms are greedy and hence may select predictors which may lead to sub-optimal fit. • Filter Filter based selection techniques use some statistical method to score each predictor separately with the target variable and choose the predictors with highest scores. It is mostly univariate analysis, i.e., each predictor is evaluated in isolation. It does not consider the correlation of independent variables amongst themselves. Based on the type of the input variable i.e., numerical or categorical and the type of output variable an appropriate statistical measure can be used to evaluate predictors for feature selection: for examplePearson’s correlation coefficient, Spearmon’s correlation coefficient, ANOVA, Chi-square. • Wrapper Wrapper feature selection builds models using various subsets of predictors iteratively, and evaluates the model, until it finds a subset of features which best predict the target. These methods are agnostic to the type of variables. However, they are computationally more taxing. RFE is a commonly used wrapper-based feature selection method. Recursive Feature Elimination is a greedy backward elimination technique, which starts with a complete set of predictors and systematically eliminates less useful predictors, until it finds a subset of predictors which best predict the target variable with the specified number of predictors. Two important hyperparameters for RFE algorithm in scikit learn are the number of predictors(n_features_to_select) and the algorithm of choice (estimator). Points to remember: • Feature selection techniques reduce the number of features by excluding or eliminating the existing features from the dataset, whereas dimensionality reduction techniques create a projection of the data in lower dimensional feature space, which does not have a one-to-one mapping with the existing features. However, both have a similar goal of reducing the number of independent variables. Data Transformations: We may need to transform data to change its data type, scale or distribution. Type: We need to analyze the input variables at the very beginning to understand if the predictors are represented with the appropriate data type, and do the required conversions before progressing with the EDA and modelling. For e.g., sometimes the Boolean values are encoded as true and false, and we may transform them to take values 0 and 1. Similarly sometimes we may come across integer variables where it might be more appropriate to treat it as a categorical variable. For e.g. when working on a dataset to predict car prices, it would be more appropriate to treat the variable ‘Number of doors’ which takes up values {2,4} as a categorical variable. Categorical variables should be converted to numeric, before they can be used for modelling. There are many categorical variable encoding techniques like, N-1 dummy encoding, 1 Hot encoding, label encoding, frequency encoding. Ordinal encoding can be used when we want to specify and maintain the order of the ordinal variable. Scale: Predictor variables may have different units (Km,$, years etc.) and hence, different scales. For e.g. we might have input variables like age and salary in a dataset. Scale of the variable salary will always be much higher than the age, and hence may contribute unequally to the model and create a bias. Hence, we transform the predictors to bring them to a common scale. Normalization and standardization are the most widely used scaling techniques.

• Normalization: helps scale the data such that all values lie between the range of 0 and 1. The scikit-learn library method even allows one to specify the preferred range.

Data shown before and after normalization:

Image Source

• Standarisation: We standardize the data by centering it around the mean and then scaling the data by the standard deviation. In other words, mean of the variable is subtracted from each value of the input variable and the difference is divided by the standard deviation of the variable. The resulting data will have zero mean and standard deviation 1. Standardisation assumes that the data follows a Gaussian distribution. Scikit learn library in python can be used for normalization (MinMaxScaler()) and standardization (StandardScaler()).

Data shown before and after standardization:

• Distribution: Many algorithms assume Gaussian distribution for the underlying data. If the data is not Gaussian or is Gaussian like, we can transform the data to reduce the skewness. Box-Cox transform, or Yeo-Johnson transform can be used to perform power transformations on the data. Box-Cox transform applies a different transformation othe data based on the value of lambda. For e.g. for Lambda = -1, it does inverse transformation, for Lambda=0 it does log transformation, for Lambda = 0.5, it does square root transformation, for Lambda = -0.5 it does reciprocal square root transformation.

PowerTransformer() class in the python scikit library can be used for making these power transformations.

Data shown before and after log transformation:

Points to remember:

• Data transformations should be done on the training dataset, so that the statistic required for transformation is estimated from the training set only and then applied on the validation set.
• Decision trees and other tree-based ensembles like Random forest and boosting algorithms are not impacted by different scale of the input variables. Hence scaling may not be required.
• Linear regression and neural networks which use weighted sum of the input variables and K-nearest neighbors or SVM which compute distance or dot product between predictors will be impacted by the scale of the predictors, hence input variables should be scaled for these models.
• Between normalization and standardization, one should standardize when the data follows a Gaussian distribution, else normalize.

Feature Engineering:  is the part of data pre-processing where we derive new features using one or more existing features. For e.g. when working on taxi fare prediction problem, we may derive a new feature, distance travelled in the ride with the use of latitude and longitude co-ordinates of the start and end point of the ride. Or when working on predicting sales or foot fall for a retail business we may need to add a new feature to factor in, the impact of holiday, weekends and festivals on the target variable. Hence, we may need to engineer these new predictors and feed them into our model to identify the underlying patterns effectively.

Polynomial term: We may also add new features by raising the existing input variables to a higher degree polynomial. Polynomial terms help the model learn the non-linear patterns. When polynomial terms of existing features are added to the linear regression model, it is termed as polynomial regression. Usually, we stick to a smaller degree of 2 or 3.

Interaction term: We may add new features that represent interaction between existing features by adding a product of two features. For e.g. if we are working on a problem to help businesses allocate their marketing budget between various marketing mediums like radio, TV and newspaper, we need to model how effective each medium is. We may like to factor in the interaction term of radio and newspaper campaign, to understand the effectiveness of marketing if both the radio and newspaper campaigns were run together at the same time.

Similarly, when predicting a crop yield, we may engineer a new interaction term for fertilizer and water together to factor in how the yield varies when water and fertilizer are provided together.

Points to remember:

• When using polynomial terms in the model, it is good practice to restrict the degree of the polynomial to 3 or at most 4. This is firstly, to control the number of input variables. Secondly, larger degree of the polynomial will result in large values which may impact the weights(parameters) to be large and hence make the model less sensitive to small changes.
• Domain knowledge or the advice of the SME may come in handy to identify effective interaction terms.

Dimensionality Reduction: Sometimes data might have hundreds and even thousands of features. High dimensional data can be more complicated, with way more parameters to train and very complicated model structureIn higher dimensions, the volume of space is huge, and the data points become sparse, which could negatively impact the machine learning algorithm performance. This is sometimes also referred to as the curse of dimensionality.

Dimensionality Reduction techniques are used to reduce the number of predictor variables in the dataset. Some techniques for dimensionality reduction are:

1. PCA or Principal Component Analysis uses linear algebra and Eigenvalue to achieve dimensionality reduction. For given datapoints PCA finds orthogonal set of directions, that have maximum variance. Rotating the reference frame, it finds the directions (ones which correspond to smallest eigen values) which can be neglected.

Principal Component Analysis applied to a dataset is shown below:

1. Manifold learning is a non-linear dimensionality reduction technique which uses geometric properties of the data, to create low dimensional projections of a high dimensional data, while preserving its structure and relationships, and to visualize high dimensional data, which is otherwise difficult. SOM Self organizing Map also called Kohonen map and t-SNE are examples of Manifold learning techniques.

t-distributed stochastic neighbor embedding (t-SNE) computes the probability that pairs of datapoints (in high dimension) are related and maps them in low dimension, such that data has a similar distribution.

1. Autoencoders are deep learning neural networks that learn low dimensional representation of a given dataset in an unsupervised manner. The hidden layer is limited to contain fewer neurons, thus it learns to map high dimensional input vector into low dimensional vector, while still preserving the underlying structure and relationships in the data. Autoencoders have two parts, encoder which learns to map high dimensional vector to a low-dimensional space and decoder, which maps the data from low to high dimension. The output from the encoder with reduced dimension can be fed into any another model for supervised learning.

Points to remember:

• Dimensionality reduction is mostly performed after data cleaning and data scaling.
• It is imperative that the dimensionality reduction performed on the training data set must also be performed on the validation and the new data on which the model will predict.

Conclusion:

Data preparation is an important and integral step of machine learning projects. There are multiple techniques for various data cleaning tasks. However, there are no best or worst data cleaning techniques. Every machine learning problem is unique and so is the underlying data. We need to apply different techniques and see what works best based on the data and the problem at hand.

### Suchita Singh

Author

With 16+ years of experience, having served organisations like IBM for a decade, Suchita is currently playing the role of a data scientist at Algoritmo Lab with core hands-on with various tools and technologies and is helping lead a team of junior data scientists.

What is data analytics?In the world of IT, every small bit of data count; even information that looks like pure nonsense has its significance. So, how do we retrieve the significance from this data? This is where Data Science and analytics comes into the picture.  Data Analytics is a process where data is inspected, transformed and interpreted to discover some useful bits of information from all the noise and make decisions accordingly. It forms the entire basis of the social media industry and finds a lot of use in IT, finance, hospitality and even social sciences. The scope in data analytics is nearly endless since all facets of life deal with the storage, processing and interpretation of data.Why data analytics? Data Analytics in this Information Age has nearly endless opportunities since literally everything in this era hinges on the importance of proper processing and data analysis. The insights from any data are crucial for any business. The field of data Analytics has grown more than 50 times from the early 2000s to 2021. Companies specialising in banking, healthcare, fraud detection, e-commerce, telecommunication, infrastructure and risk management hire data analysts and professionals every year in huge numbers.Need for certification:Skills are the first and foremost criteria for a job, but these skills need to be validated and recognised by reputed organisations for them to impress a potential employer. In the field of Data Analytics, it is pretty crucial to show your certifications. Hence, an employer knows you have hands-on experience in the field and can handle the workload of a real-world setting beyond just theoretical knowledge. Once you get a base certification, you can work your way up to higher and higher positions and enjoy lucrative pay packages. Top Data Analytics Certifications Certified Analytics Professional (CAP) Microsoft Certified Azure Data Scientist Associate Cloudera Certified Associate (CCA) Data Analyst Associate Certified Analytics Professional (aCAP) SAS Certified Data Analyst (Using SAS91. Certified Analytics Professional (CAP)A certification from an organisation called INFORMS, CAP is a notoriously rigorous certification and stands out like a star on an applicant's resume. Those who complete this program gain an invaluable credential and are able to distinguish themselves from the competition. It gives a candidate a comprehensive understanding of the analytical process's various fine aspects--from framing hypotheses and analytic problems to the proper methodology, along with acquisition, model building and deployment process with long-term life cycle management. It needs to be renewed after three years.The application process is in itself quite complex, and it also involves signing the CAP Code of Ethics before one is given the certification. The CAP panel reviews each application, and those who pass this review are the only ones who can give the exam.  Prerequisite: A bachelor’s degree with 5 years of professional experience or a master's degree with 3 years of professional experience.  Exam Fee & Format: The base price is $695. For individuals who are members of INFORMS the price is$495. (Source) The pass percentage is 70%. The format is a four option MCQ paper. Salary: $76808 per year (Source) 2. Cloudera Certified Associate (CCA) Data Analyst Cloudera has a well-earned reputation in the IT sector, and its Associate Data analyst certification can help bolster the resume of Business intelligence specialists, system architects, data analysts, database administrators as well as developers. It has a specific focus on SQL developers who aim to show their proficiency on the platform.This certificate validates an applicant's ability to operate in a CDH environment by Cloudera using Impala and Hive tools. One doesn't need to turn to expensive tuitions and academies as Cloudera offers an Analyst Training course with almost the same objectives as the exam, leaving one with a good grasp of the fundamentals. Prerequisites: basic knowledge of SQL and Linux Command line Exam Fee & Format: The cost of the exam is$295 (Source), The test is a performance-based test containing 8-12 questions to be completed in a proctored environment under 129 minutes.  Expected Salary: You can earn the job title of Cloudera Data Analyst that pays up to $113,286 per year. (Source)3. Associate Certified Analytics Professional (aCAP)aCAP is an entry-level certification for Analytics professionals with lesser experience but effective knowledge, which helps in real-life situations. It is for those candidates who have a master’s degree in a field related to data analytics. It is one of the few vendor-neutral certifications on the list and must be converted to CAP within 6 years, so it offers a good opportunity for those with a long term path in a Data Analytics career. It also needs to be renewed every three years, like the CAP certification. Like its professional counterpart, aCAP helps a candidate step out in a vendor-neutral manner and drastically increases their professional credibility. Prerequisite: Master’s degree in any discipline related to data Analytics. Exam Fee: The base price is$300. For individuals who are members of INFORMS the price is $200. (Source). There is an extensive syllabus which covers: i. Business Problem Framing, ii. Analytics Problem Framing, iii. Data, iv. Methodology Selection, v. Model Building, vi. Deployment, vii. Lifecycle Management of the Analytics process, problem-solving, data science and visualisation and much more.4. SAS Certified Data Analyst (Using SAS9)From one of the pioneers in IT and Statistics - the SAS Institute of Data Management - a SAS Certified Data Scientist can gain insights and analyse various aspects of data from businesses using tools like the SAS software and other open-source methodology. It also validates competency in using complex machine learning models and inferring results to interpret future business strategy and release models using the SAS environment. SAS Academy for Data Science is a viable institute for those who want to receive proper training for the exam and use this as a basis for their career. Prerequisites: To earn this credential, one needs to pass 5 exams, two from the SAS Certified Big Data Professional credential and three exams from the SAS Certified Advanced Analytics Professional Credential. Exam Fee: The cost for each exam is$180. (Source) An exception is Predictive Modelling using the SAS Enterprise Miner, costing $250, This exam can be taken in the English language. One can join the SAS Academy for Data Science and also take a practice exam beforehand. Salary: You can get a job as a SAS Data Analyst that pays up to$90,000 per year! (Source) 5. IBM Data Science Professional CertificateWhenever someone studies the history of a computer, IBM (International Business Machines) is the first brand that comes up. IBM is still alive and kicking, now having forayed into and becoming a major player in the Big Data segment. The IBM Data Science Professional certificate is one of the beginner-level certificates if you want to sink your hands into the world of data analysis. It shows a candidate's skills in various topics pertaining to data sciences, including various open-source tools, Python databases, SWL, data visualisation, and data methodologies.  One needs to complete nine courses to earn the certificate. It takes around three months if one works twelve hours per week. It also involves the completion of various hands-on assignments and building a portfolio. A candidate earns the Professional certificate from Coursera and a badge from IBM that recognises a candidate's proficiency in the area. Prerequisites: It is the optimal course for freshers since it requires no requisite programming knowledge or proficiency in Analytics. Exam Fee: It costs $39 per month (Source) to access the course materials and the certificate. The course is handled by the Coursera organisation. Expected Salary: This certification can earn you the title of IBM Data Scientist and help you earn a salary of$134,846 per annum. (Source) 6. Microsoft Certified Azure Data Scientist AssociateIt's one of the most well-known certifications for newcomers to step into the field of Big Data and Data analytics. This credential is offered by the leader in the industry, Microsoft Azure. This credential validates a candidate's ability to work with Microsoft Azure developing environment and proficiency in analysing big data, preparing data for the modelling process, and then progressing to designing models. One advantage of this credential is that it has no expiry date and does not need renewal; it also authorises the candidate’s extensive knowledge in predictive Analytics. Prerequisites: knowledge and experience in data science and using Azure Machine Learning and Azure Databricks. Exam Fee: It costs $165 to (Source) register for the exam. One advantage is that there is no need to attend proxy institutions to prepare for this exam, as Microsoft offers free training materials as well as an instructor-led course that is paid. There is a comprehensive collection of resources available to a candidate. Expected Salary: The job title typically offered is Microsoft Data Scientist and it typically fetches a yearly pay of$130,993.(Source) Why be a Data Analytics professional? For those already working in the field of data, being a Data Analyst is one of the most viable options. The salary of a data analyst ranges from $65,000 to$85,000 depending on number of years of experience. This lucrative salary makes it worth the investment to get a certification and advance your skills to the next level so that you can work for multinational companies by interpreting and organising data and using this analysis to accelerate businesses. These certificates demonstrate that you have the required knowledge needed to operate data models of the volumes needed by big organizations. 1. Demand is more than supply With the advent of the Information Age, there has been a huge boom in companies that either entirely or partially deal with IT. For many companies IT forms the core of their business. Every business has to deal with data, and it is crucial to get accurate insights from this data and use it to further business interests and expand profits. The interpretation of data also aims to guide them in the future to make the best business decisions.  Complex business intelligence algorithms are in place these days. They need trained professionals to operate them; since this field is relatively new, there is a shortage of experts. Thus, there are vacancies for data analyst positions with lucrative pay if one is qualified enough.2. Good pay with benefitsA data analyst is an extremely lucrative profession, with an average base pay of $71,909 (Source), employee benefits, a good work-home balance, and other perks. It has been consistently rated as being among the hottest careers of the decade and allows professionals to have a long and satisfying career. Companies Hiring Certified Data Analytics Professionals Oracle A California based brand, Oracle is a software company that is most famous for its data solutions. With over 130000 employees and a revenue of 39 billion, it is surely one of the bigger players in Data Analytics. MicroStrategy Unlike its name, this company is anything but micro, with more than 400 million worth of revenue. It provides a suite of analytical products along with business mobility solutions. It is a key player in the mobile space, working natively with Android and iOS. SAS One of the companies in the list which provides certifications and is also without a doubt one of the largest names in the field of Big Data, machine learning and Data Analytics, is SAS. The name SAS is derived from Statistical Analysis System. This company is trusted and has a solid reputation. It is also behind the SAS Institute for Data Science. Hence, SAS is the organisation you would want to go to if you're aiming for a long-term career in data science. Conclusion To conclude, big data and data Analytics are a field of endless opportunities. By investing in the right credential, one can pave the way to a viable and lucrative career path. Beware though, there are lots of companies that provide certifications, but only recognised and reputed credentials will give you the opportunities you are seeking. Hiring companies look for these certifications as a mark of authenticity of your hands-on experience and the amount of work you can handle effectively. Therefore, the credential you choose for yourself plays a vital role in the career you can have in the field of Data analytics. Happy learning! 5631 Top Data Analytics Certifications What is data analytics?In the world of IT, every s... Read More ## Why Should You Start a Career in Machine Learning? If you are even remotely interested in technology you would have heard of machine learning. In fact machine learning is now a buzzword and there are dozens of articles and research papers dedicated to it. Machine learning is a technique which makes the machine learn from past experiences. Complex domain problems can be resolved quickly and efficiently using Machine Learning techniques. We are living in an age where huge amounts of data are produced every second. This explosion of data has led to creation of machine learning models which can be used to analyse data and to benefit businesses. This article tries to answer a few important concepts related to Machine Learning and informs you about the career path in this prestigious and important domain.What is Machine Learning?So, here’s your introduction to Machine Learning. This term was coined in the year 1997. “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at the tasks improves with the experiences.”, as defined in the book on ML written by Mitchell in 1997. The difference between a traditional programming and programming using Machine Learning is depicted here, the first Approach (a) is a traditional approach, and second approach (b) is a Machine Learning based approach.Machine Learning encompasses the techniques in AI which allow the system to learn automatically looking at the data available. While learning, the system tries to improve the experience without making any explicit efforts in programming. Any machine learning application follows the following steps broadlySelecting the training datasetAs the definition indicates, machine learning algorithms require past experience, that is data, for learning. So, selection of appropriate data is the key for any machine learning application.Preparing the dataset by preprocessing the dataOnce the decision about the data is made, it needs to be prepared for use. Machine learning algorithms are very susceptible to the small changes in data. To get the right insights, data must be preprocessed which includes data cleaning and data transformation. Exploring the basic statistics and properties of dataTo understand what the data wishes to convey, the data engineer or Machine Learning engineer needs to understand the properties of data in detail. These details are understood by studying the statistical properties of data. Visualization is an important process to understand the data in detail.Selecting the appropriate algorithm to apply on the datasetOnce the data is ready and understood in detail, then appropriate Machine Learning algorithms or models are selected. The choice of algorithm depends on characteristics of data as well as type of task to be performed on the data. The choice also depends on what kind of output is required from the data.Checking the performance and fine-tuning the parameters of the algorithmThe model or algorithm chosen is fine-tuned to get improved performance. If multiple models are applied, then they are weighed against the performance. The final algorithm is again fine-tuned to get appropriate output and performance.Why Pursue a Career in Machine Learning in 2021?A recent survey has estimated that the jobs in AI and ML have grown by more than 300%. Even before the pandemic struck, Machine Learning skills were in high demand and the demand is expected to increase two-fold in the near future.A career in machine learning gives you the opportunity to make significant contributions in AI, the future of technology. All the big and small businesses are adopting Machine Learning models to improve their bottom-line margins and return on investment. The use of Machine Learning has gone beyond just technology and it is now used in diverse industries including healthcare, automobile, manufacturing, government and more. This has greatly enhanced the value of Machine Learning experts who can earn an average salary of$112,000.  Huge numbers of jobs are expected to be created in the coming years.  Here are a few reasons why one should pursue a career in Machine Learning:The global machine learning market is expected to touch \$20.83B in 2024, according to Forbes.  We are living in a digital age and this explosion of data has made the use of machine learning models a necessity. Machine Learning is the only way to extract meaning out of data and businesses need Machine Learning engineers to analyze huge data and gain insights from them to improve their businesses.If you like numbers, if you like research, if you like to read and test and if you have a passion to analyse, then machine learning is the career for you. Learning the right tools and programming languages will help you use machine learning to provide appropriate solutions to complex problems, overcome challenges and grow the business.Machine Learning is a great career option for those interested in computer science and mathematics. They can come up with new Machine Learning algorithms and techniques to cater to the needs of various business domains.As explained above, a career in machine learning is both rewarding and lucrative. There are huge number of opportunities available if you have the right expertise and knowledge. On an average, Machine Learning engineers get higher salaries, than other software developers.Years of experience in the Machine Learning domain, helps you break into data scientist roles, which is not just among the hottest careers of our generation but also a highly respected and lucrative career. Right skills in the right business domain helps you progress and make a mark for yourself in your organization. For example, if you have expertise in pharmaceutical industries and experience working in Machine learning, then you may land job roles as a data scientist consultant in big pharmaceutical companies.Statistics on Machine learning growth and the industries that use MLAccording to a research paper in AI Multiple (https://research.aimultiple.com/ml-stats/), the Machine Learning market will grow to 9 Billion USD by the end of 2022. There are various areas where Machine Learning models and solutions are getting deployed, and businesses see an overall increase of 44% investments in this area. North America is one of the leading regions in the adoption of Machine Learning followed by Asia.The Global Machine Learning market will grow by 42% which is evident from the following graph. Image sourceThere is a huge demand for Machine Learning modelling because of the large use of Cloud Based Applications and Services. The pandemic has changed the face of businesses, making them heavily dependent on Cloud and AI based services. Google, IBM, and Amazon are just some of the companies that have invested heavily in AI and Machine Learning based application development, to provide robust solutions for problems faced by small to large scale businesses. Machine Learning and Cloud based solutions are scalable and secure for all types of business.ML analyses and interprets data patterns, computing and developing algorithms for various business purposes.Advantages of Machine Learning courseNow that we have established the advantages of perusing a career in Machine Learning, let’s understand from where to start our machine learning journey. The best option would be to start with a Machine Learning course. There are various platforms which offer popular Machine Learning courses. One can always start with an online course which is both effective and safe in these COVID times.These courses start with an introduction to Machine Learning and then slowly help you to build your skills in the domain. Many courses even start with the basics of programming languages such as Python, which are important for building Machine Learning models. Courses from reputed institutions will hand hold you through the basics. Once the basics are clear, you may switch to an offline course and get the required certification.Online certifications have the same value as offline classes. They are a great way to clear your doubts and get personalized help to grow your knowledge. These courses can be completed along with your normal job or education, as most are self-paced and can be taken at a time of your convenience. There are plenty of online blogs and articles to aid you in completion of your certification.Machine Learning courses include many real time case studies which help you in understanding the basics and application aspects. Learning and applying are both important and are covered in good Machine Learning Courses. So, do your research and pick an online tutorial that is from a reputable institute.What Does the Career Path in Machine Learning Look Like?One can start their career in Machine Learning domain as a developer or application programmer. But the acquisition of the right skills and experience can lead you to various career paths. Following are some of the career options in Machine Learning (not an exhaustive list):Data ScientistA data scientist is a person with rich experience in a particular business field. A person who has a knowledge of domain, as well as machine learning modelling, is a data scientist. Data Scientists’ job is to study the data carefully and suggest accurate models to improve the business.AI and Machine Learning EngineerAn AI engineer is responsible for choosing the proper Machine Learning Algorithm based on natural language processing and neural network. They are responsible for applying it in AI applications like personalized advertising.  A Machine Learning Engineer is responsible for creating the appropriate models for improvement of the businessData EngineerA Data Engineer, as the name suggests, is responsible to collect data and make it ready for the application of Machine Learning models. Identification of the right data and making it ready for extraction of further insights is the main work of a data engineer.Business AnalystA person who studies the business and analyzes the data to get insights from it is a Business Analyst. He or she is responsible for extracting the insights from the data at hand.Business Intelligence (BI) DeveloperA BI developer uses Machine Learning and Data Analytics techniques to work on a large amount of data. Proper representation of data to suit business decisions, using the latest tools for creation of intuitive dashboards is the role of a BI developer.  Human Machine Interface learning engineerCreating tools using machine learning techniques to ease the human machine interaction or automate decisions, is the role of a Human Machine Interface learning engineer. This person helps in generating choices for users to ease their work.Natural Language Processing (NLP) engineer or developerAs the name suggests, this person develops various techniques to process Natural Language constructs. Building applications or systems using machine learning techniques to build Natural Language based applications is their main task. They create multilingual Chatbots for use in websites and other applications.Why are Machine Learning Roles so popular?As mentioned above, the market growth of AI and ML has increased tremendously over the past years. The Machine Learning Techniques are applied in every domain including marketing, sales, product recommendations, brand retention, creating advertising, understanding the sentiments of customer, security, banking and more. Machine learning algorithms are also used in emails to ease the users work. This says a lot, and proves that a career in Machine Learning is in high demand as all businesses are incorporating various machine learning techniques and are improving their business.One can harness this popularity by skilling up with Machine Learning skills. Machine Learning models are now being used by every company, irrespective of their size--small or big, to get insights on their data and use these insights to improve the business. As every company wishes to grow faster, they are deploying more machine learning engineers to get their work done on time. Also, the migration of businesses to Cloud services for better security and scalability, has increased their requirement for more Machine Learning algorithms and models to cater to their needs.Introducing the Machine learning techniques and solutions has brought huge returns for businesses.  Machine Learning solution providers like Google, IBM, Microsoft etc. are investing in human resources for development of Machine Learning models and algorithms. The tools developed by them are popularly used by businesses to get early returns. It has been observed that there is significant increase in patents in Machine Learning domains since the past few years, indicating the quantum of work happening in this domain.Machine Learning SkillsLet’s visit a few important skills one must acquire to work in the domain of Machine Learning.Programming languagesKnowledge of programming is very important for a career in Machine Learning. Languages like Python and R are popularly used to develop applications using Machine Learning models and algorithms. Python, being the simplest and most flexible language, is very popular for AI and Machine Learning applications. These languages provide rich support of libraries for implementation of Machine Learning Algorithms. A person who is good in programming can work very efficiently in this domain.Mathematics and StatisticsThe base for Machine Learning is mathematics and statistics. Statistics applied to data help in understanding it in micro detail. Many machine learning models are based on the probability theory and require knowledge of linear algebra, transformations etc. A good understanding of statistics and probability increases the early adoption to Machine Learning domain.Analytical toolsA plethora of analytical tools are available where machine learning models are already implemented and made available for use. Also, these tools are very good for visualization purposes. Tools like IBM Cognos, PowerBI, Tableue etc are important to pursue a career as a  Machine Learning engineer.Machine Learning Algorithms and librariesTo become a master in this domain, one must master the libraries which are provided with various programming languages. The basic understanding of how machine learning algorithms work and are implemented is crucial.Data Modelling for Machine Learning based systemsData lies at the core of any Machine Learning application. So, modelling the data to suit the application of Machine Learning algorithms is an important task. Data modelling experts are the heart of development teams that develop machine learning based systems. SQL based solutions like Oracle, SQL Server, and NoSQL solutions are important for modelling data required for Machine Learning applications. MongoDB, DynamoDB, Riak are some important NOSQL based solutions available to process unstructured data for Machine Learning applications.Other than these skills, there are two other skills that may prove to be beneficial for those planning on a career in the Machine Learning domain:Natural Language processing techniquesFor E-commerce sites, customer feedback is very important and crucial in determining the roadmap of future products. Many customers give reviews for the products that they have used or give suggestions for improvement. These feedbacks and opinions are analyzed to gain more insights about the customers buying habits as well as about the products. This is part of natural language processing using Machine Learning. The likes of Google, Facebook, Twitter are developing machine learning algorithms for Natural Language Processing and are constantly working on improving their solutions. Knowledge of basics of Natural Language Processing techniques and libraries is must in the domain of Machine Learning.Image ProcessingKnowledge of Image and Video processing is very crucial when a solution is required to be developed in the area of security, weather forecasting, crop prediction etc. Machine Learning based solutions are very effective in these domains. Tools like Matlab, Octave, OpenCV are some important tools available to develop Machine Learning based solutions which require image or video processing.ConclusionMachine Learning is a technique to automate the tasks based on past experiences. This is among the most lucrative career choices right now and will continue to remain so in the future. Job opportunities are increasing day by day in this domain. Acquiring the right skills by opting for a proper Machine Learning course is important to grow in this domain. You can have an impressive career trajectory as a machine learning expert, provided you have the right skills and expertise.
5683
Why Should You Start a Career in Machine Learning?

If you are even remotely interested in technology ... Read More