Search

Activation Functions for Deep Neural Networks

The Universal Approximation Theorem Any predictive model is a mathematical function, y = f(x) that can map the features (x) to the target variable (y). The function, f(x) can be a linear function or it can be a fairly complex nonlinear function. The function, f(x) can help predict with high accuracy depending on the distribution of the data. In the case of neural networks, it would also depend on the type of network architecture that's employed. The Universal Approximation Theorem says that irrespective of what the f(x) is, a neural network model can be built that can approximately deliver the desired result. In order to build a proper neural network architecture, let us take a look at the activation functions. What are Activation Functions? Simply put, activation functions define the output of neurons given a certain set of inputs. Activation functions are mathematical functions that are added to neural network models to enable the models to learn complex patterns. An activation function takes in the output from the previous layer, passes it through the mathematical function (mostly non-linear functions) to convert it into some form, that can be considered as an input for the next computation layer. Activation functions determine the final accuracy of a network model while also contributing to the computational efficiency of building the model. Why do we need Activation Functions? In a neural network, if we add the hidden layers as the weighted sum of the inputs, this would translate into a linear function which is equivalent to a linear regression model. Image source: Neural Network ArchitectureIn the above diagram, we see the hidden layer is simply the weighted sum of the inputs from the input layer. For example, b1 = bw1 + a1w1 + a2w3 which is nothing but a linear function.Linear combination of linear functions is a linear function. So no matter whatever number of linear function we add, or increase the hidden linear layers, the output would still be linear.However in the real world, more often than not, we need to model data which is non-linear and way more complex. Adding non-linear functions allow these non-linear decision boundaries to be built into the model.Multi-layer neural network models can classify linearly inseparable classes. However, in order to do so, we need the network to be transformed to a nonlinear function. For this nonlinear transformation to happen, we would pass the weighted sum of the inputs through an activation function. These activation functions are nonlinear functions which are applied at the hidden layers. Each hidden layer can have different activation functions, though mostly all neurons in each layer will have the same activation function.Additionally, by applying non-linear activation function to the neurons it can also act as gate and selectively switch on or off a neuron.Types of Activation Functions? In this section we discuss the following: Linear Function Threshold Activation Function Bipolar Activation Function Logistic Sigmoid Function Bipolar Sigmoid Function Hyperbolic Tangent Function Rectified Linear Unit Function Swish Function (proposed by Google Brain - a deep learning artificial intelligence research team at Google) Linear Function: g(x) = xA linear function is similar to a straight line, y=mx. Irrespective of the number of hidden layers, if all the layers are linear in nature, then the final output is also simply a linear function of the input values. Hence we take a look at the other activation functions which are non-linear in nature and can help learn complex patterns. Note: This function is useful when we want to model a wide range in the regression network output.Threshold Activation Function: (sign(x) +1)/2In this case, if the input is above a certain value, the neuron is activated. It is to note that this function provides either a 1 or a 0 as the output. In effect, the step function divides the input space into two halves such that one side of the hyperplane represents class 0 and the other side of the hyperplane represents class 1. However, if we need to classify certain inputs into more than 2 categories, a Threshold-Activation function is not a suitable one. Because of its binary output nature, this function is also known as binary-step activation function.Threshold Activation FunctionDrawback:Can be used for binary classification only. It is not suited for multi class classification problems.This function does not support learning, i.e., when you fine tune the NN, you would not know if by changing the weights slightly the loss has reduced or changed at all.Bipolar Activation Function: This is similar to the threshold function that was explained above. However, this activation function will return an output of either -1 or +1 based on a threshold.Bipolar Activation FunctionLogistic Sigmoid Function: One of the most frequently used activation functions is the Logistic Sigmoid Function. Its output ranges between 0 and 1 and is plotted as an ‘S’ shaped graph.Logistic Sigmoid FunctionThis is a nonlinear function and is characterised by a small change in x that would lead to large change in y. This activation function is generally used for binary classification where the expected output is 0 or 1. This activation function provides an output between 0 and 1 and a default threshold of 0.5 is considered to convert the continuous output to 0 or 1 for classifying the observationsAnother variation of the Logistic Sigmoid function is the Bipolar Sigmoid Function. This activation function is a rescaled version of the Logistic Sigmoid Function which provides an output in the range of -1 to +1.Bipolar Logistic FunctionDrawback: Slow convergence - Gradients only in the active region enable learning. When the neurons fire in the saturation region(the top and bottom part of the S curve), the gradients are very small or close to zero. Hence the training becomes slow and leads to slow convergence.Vanishing Gradient problem -  When the neurons fire in the saturation region, i.e., if the output of the previous layer is in the saturation region, the gradients will get close to zero not enable learning, i.e., even large changes in parameter(weights) leads to very small change in the output.Hyperbolic Tangent Function: This activation function is quite similar to the sigmoid function. Its output ranges between -1 to +1. So the output is zero centred, hence makes weight initialization easier.Hyperbolic Tangent FunctionDrawback:This too suffers from the vanishing gradient problem.Slightly more expensive to computeRectified Linear Activation Function: This activation function, also known as ReLU, outputs the input if it is positive, else will return zero. That is to say, if the input is zero or less, this function will return 0 or will return the input itself. This function mostly behaves like a linear function because of which the computational simplicity is achieved.This activation function has become quite popular and is often used because of its computational efficiency compared to sigmoid and the hyperbolic tangent function that helps the model converge faster.ReLU has a better convergence than sigmoid and tanh(x) functions, as there are no saturation regions in ReLU. If the input of the previous layer is positive, it simply passes it as is and if the input is negative, it simply clips it.Another critical point to note is that while the sigmoid & the hyperbolic tangent function tries to approximate a zero value, the Rectified Linear Activation Functions can return true zero.Rectified Linear Units Activation FunctionOne disadvantage of ReLU is that when the inputs are close to zero or negative, the gradient of the function becomes zero. This causes a problem for the algorithm while performing back-propagation and in turn the model cannot converge. If the dataset is such that the input for a particular neuron is a negative  number then during backward propagation, the gradient will always be zero. Since the gradient is zero the weights for those neurons will never be updated and there will be no learning. If the weights are not updated, we would get same negative numbers for those neurons. Thus, no matter what those neurons would be dead. This is commonly termed as the “Dying” ReLU problem. Hence when using ReLU, one should keep track of the fraction of dead neurons.There are a few variations of the ReLU activation function, such as, Noisy ReLU, Leaky ReLU, Parametric ReLU and Exponential Linear Units (ELU) Leaky ReLU which is a modified version of ReLU, helps solve the “Dying” ReLU problem. It helps perform back-propagation even when the inputs are negative. Leaky ReLU, unlike ReLU, defines a small linear component of x when x is a negative value. With this change in leaky ReLU, the gradient can be of non-zero value instead of zero thus avoiding dead neurons. However, this might also bring in a challenge with Leaky ReLU when it comes to predicting negative values.  Exponential Linear Unit (ELU) is another variant of ReLU, which unlike ReLU and leaky ReLU, uses a log curve instead of a straight line to define the negative values. Swish Activation Function: Swish is a new activation function that has been proposed by Google Brain. While ReLU returns zero for negative values, Swish doesn’t return a zero for negative inputs. Swish is a self-gating technique which implies that while normal gates require multiple scalar inputs, self-gating technique requires a single input only. Swish has certain properties - Unlike ReLU, Swish is a smooth and non-monotonic function which makes it more acceptable compared to ReLU. Swish is unbounded above and bounded below.  Swish is represented as x · σ(βx), where σ(z) = (1 + exp(−z))−1 is the sigmoid function and β is a constant or a trainable parameter.  Activation functions in deep learning and the vanishing gradient descent problem Gradient based methods are used by various algorithms to train the models. Neural networks algorithm uses stochastic gradient descent method to train the model. A neural network algorithm randomly assigns weights to the layers and once the output is predicted, it calculates the prediction errors. It uses these errors to estimate a gradient that can be used to update the weights in the network. This is done in order to reduce the prediction errors. The error gradient is updated backward from the output layer to the input layer.  It is preferred to build a neural network model with a larger number of hidden layers. With more hidden layers, the neural network model can achieve enhanced capability to perform more accurately.  One problem with too many layers is that the gradient diminishes pretty fast as it moves from the output layer to the input layer, i.e. during the back propagation, in order to get the update for the weights, we multiply a lot many gradients and jacobians. If the largest singular value of these matrices is less than one, we will get very small number when we multiply these less than one numbers. If we get very small number, the gradients would diminish. When we update the weight with this gradient, the update is very low. By the time it reaches the other end backward, it is quite possible that the error might get too small to make any effect on the model performance improvement. Basically, this is a situation where some difficulty is faced while training a neural network model using gradient based methods.  This is known as the vanishing gradient descent problem. Gradient based methods might face this challenge when certain activation functions are used in the network.  In deep neural networks, various activations functions are used. However when training deep neural network models, the vanishing gradient descent problems can demonstrate unstable behavior.  Various workaround solutions have been proposed to solve this problem. The most commonly used activation function is the ReLU activation function that has proven to perform way better than any other previously existing activation functions like sigmoid or hyperbolic tangent. As mentioned above, Swish improves upon ReLU being a smooth and non-monotonic function. However, though the vanishing gradient descent problem is much less severe in Swish, it does not completely avoid the vanishing gradient descent problem. To tackle this problem, a new activation function has been proposed. “The activation function in the neural network is one of the important aspects which facilitates the deep training by introducing the nonlinearity into the learning process. However, because of zero-hard rectification, some of the existing activation functions such as ReLU and Swish miss to utilize the large negative input values and may suffer from the dying gradient problem. Thus, it is important to look for a better activation function which is free from such problems.... The proposed LiSHT activation function is an attempt to scale the non-linear Hyperbolic Tangent (Tanh) function by a linear function and tackle the dying gradient problem… A very promising performance improvement is observed on three different types of neural networks including Multi-layer Perceptron (MLP), Convolutional Neural Network (CNN) and Recurrent Neural Network like Long-short term memory (LSTM).“   - Swalpa Kumar Roy, Suvojit Manna, et al, Jan 2019 In a paper published here, Swalpa Kumar Roy, Suvojit Manna, et al proposes a new non-parametric activation function - the Linearly Scaled Hyperbolic Tangent (LiSHT) - for Neural Networks that attempts to tackle the vanishing gradient descent problem. 

Activation Functions for Deep Neural Networks

9K
Activation Functions for Deep Neural Networks

The Universal Approximation Theorem 

Any predictive model is a mathematical function, y = f(x) that can map the features (x) to the target variable (y). The function, f(x) can be a linear function or it can be a fairly complex nonlinear function. The function, f(x) can help predict with high accuracy depending on the distribution of the data. In the case of neural networks, it would also depend on the type of network architecture that's employed. The Universal Approximation Theorem says that irrespective of what the f(x) is, a neural network model can be built that can approximately deliver the desired result. In order to build a proper neural network architecture, let us take a look at the activation functions. 

What are Activation Functions? 

Simply put, activation functions define the output of neurons given a certain set of inputs. Activation functions are mathematical functions that are added to neural network models to enable the models to learn complex patterns. An activation function takes in the output from the previous layer, passes it through the mathematical function (mostly non-linear functions) to convert it into some form, that can be considered as an input for the next computation layer. Activation functions determine the final accuracy of a network model while also contributing to the computational efficiency of building the model. 

Why do we need Activation Functions? 

In a neural network, if we add the hidden layers as the weighted sum of the inputs, this would translate into a linear function which is equivalent to a linear regression model. 

Neural Network ArchitectureImage source: Neural Network Architecture

In the above diagram, we see the hidden layer is simply the weighted sum of the inputs from the input layer. For example, b1 = bw1 + a1w1 + a2w3 which is nothing but a linear function.

Linear combination of linear functions is a linear function. So no matter whatever number of linear function we add, or increase the hidden linear layers, the output would still be linear.

However in the real world, more often than not, we need to model data which is non-linear and way more complex. Adding non-linear functions allow these non-linear decision boundaries to be built into the model.

Multi-layer neural network models can classify linearly inseparable classes. However, in order to do so, we need the network to be transformed to a nonlinear function. For this nonlinear transformation to happen, we would pass the weighted sum of the inputs through an activation function. These activation functions are nonlinear functions which are applied at the hidden layers. Each hidden layer can have different activation functions, though mostly all neurons in each layer will have the same activation function.

Additionally, by applying non-linear activation function to the neurons it can also act as gate and selectively switch on or off a neuron.

Types of Activation Functions? 

In this section we discuss the following: 

  • Linear Function 
  • Threshold Activation Function 
  • Bipolar Activation Function 
  • Logistic Sigmoid Function 
  • Bipolar Sigmoid Function 
  • Hyperbolic Tangent Function 
  • Rectified Linear Unit Function 

Swish Function (proposed by Google Brain - a deep learning artificial intelligence research team at Google) 

Linear Function: g(x) = x
A linear function is similar to a straight line, y=mx. Irrespective of the number of hidden layers, if all the layers are linear in nature, then the final output is also simply a linear function of the input values. Hence we take a look at the other activation functions which are non-linear in nature and can help learn complex patterns. 

Note: This function is useful when we want to model a wide range in the regression network output.

Threshold Activation Function: (sign(x) +1)/2

In this case, if the input is above a certain value, the neuron is activated. It is to note that this function provides either a 1 or a 0 as the output. In effect, the step function divides the input space into two halves such that one side of the hyperplane represents class 0 and the other side of the hyperplane represents class 1. However, if we need to classify certain inputs into more than 2 categories, a Threshold-Activation function is not a suitable one. Because of its binary output nature, this function is also known as binary-step activation function.

Threshold Activation FunctionThreshold Activation Function

Drawback:

  1. Can be used for binary classification only. It is not suited for multi class classification problems.
  2. This function does not support learning, i.e., when you fine tune the NN, you would not know if by changing the weights slightly the loss has reduced or changed at all.

Bipolar Activation Function: This is similar to the threshold function that was explained above. However, this activation function will return an output of either -1 or +1 based on a threshold.

Bipolar Activation Function

Bipolar Activation Function

Logistic Sigmoid Function: One of the most frequently used activation functions is the Logistic Sigmoid Function. Its output ranges between 0 and 1 and is plotted as an ‘S’ shaped graph.

Logistic Sigmoid Function

Logistic Sigmoid Function

This is a nonlinear function and is characterised by a small change in x that would lead to large change in y. This activation function is generally used for binary classification where the expected output is 0 or 1. This activation function provides an output between 0 and 1 and a default threshold of 0.5 is considered to convert the continuous output to 0 or 1 for classifying the observations

Another variation of the Logistic Sigmoid function is the Bipolar Sigmoid Function. This activation function is a rescaled version of the Logistic Sigmoid Function which provides an output in the range of -1 to +1.

Bipolar Logistic Function

Bipolar Logistic Function

Drawback: 

  1. Slow convergence - Gradients only in the active region enable learning. When the neurons fire in the saturation region(the top and bottom part of the S curve), the gradients are very small or close to zero. Hence the training becomes slow and leads to slow convergence.
  2. Vanishing Gradient problem -  When the neurons fire in the saturation region, i.e., if the output of the previous layer is in the saturation region, the gradients will get close to zero not enable learning, i.e., even large changes in parameter(weights) leads to very small change in the output.

Hyperbolic Tangent Function: This activation function is quite similar to the sigmoid function. Its output ranges between -1 to +1. So the output is zero centred, hence makes weight initialization easier.

Hyperbolic Tangent Function

Hyperbolic Tangent Function

Drawback:

  1. This too suffers from the vanishing gradient problem.
  2. Slightly more expensive to compute

Rectified Linear Activation Function: This activation function, also known as ReLU, outputs the input if it is positive, else will return zero. That is to say, if the input is zero or less, this function will return 0 or will return the input itself. This function mostly behaves like a linear function because of which the computational simplicity is achieved.

This activation function has become quite popular and is often used because of its computational efficiency compared to sigmoid and the hyperbolic tangent function that helps the model converge faster.

ReLU has a better convergence than sigmoid and tanh(x) functions, as there are no saturation regions in ReLU. If the input of the previous layer is positive, it simply passes it as is and if the input is negative, it simply clips it.

Another critical point to note is that while the sigmoid & the hyperbolic tangent function tries to approximate a zero value, the Rectified Linear Activation Functions can return true zero.

Rectified Linear Units Activation Function

Rectified Linear Units Activation Function

One disadvantage of ReLU is that when the inputs are close to zero or negative, the gradient of the function becomes zero. This causes a problem for the algorithm while performing back-propagation and in turn the model cannot converge. If the dataset is such that the input for a particular neuron is a negative  number then during backward propagation, the gradient will always be zero. Since the gradient is zero the weights for those neurons will never be updated and there will be no learning. If the weights are not updated, we would get same negative numbers for those neurons. Thus, no matter what those neurons would be dead. This is commonly termed as the “Dying” ReLU problem. Hence when using ReLU, one should keep track of the fraction of dead neurons.

There are a few variations of the ReLU activation function, such as, Noisy ReLU, Leaky ReLU, Parametric ReLU and Exponential Linear Units (ELU) 

Leaky ReLU which is a modified version of ReLU, helps solve the “Dying” ReLU problem. It helps perform back-propagation even when the inputs are negative. Leaky ReLU, unlike ReLU, defines a small linear component of x when x is a negative value. With this change in leaky ReLU, the gradient can be of non-zero value instead of zero thus avoiding dead neurons. However, this might also bring in a challenge with Leaky ReLU when it comes to predicting negative values.  

Exponential Linear Unit (ELU) is another variant of ReLU, which unlike ReLU and leaky ReLU, uses a log curve instead of a straight line to define the negative values. 

Swish Activation Function: Swish is a new activation function that has been proposed by Google Brain. While ReLU returns zero for negative values, Swish doesn’t return a zero for negative inputs. Swish is a self-gating technique which implies that while normal gates require multiple scalar inputs, self-gating technique requires a single input only. Swish has certain properties - Unlike ReLU, Swish is a smooth and non-monotonic function which makes it more acceptable compared to ReLU. Swish is unbounded above and bounded below.  Swish is represented as x · σ(βx), where σ(z) = (1 + exp(−z))1 is the sigmoid function and β is a constant or a trainable parameter.  

Activation functions in deep learning and the vanishing gradient descent problem 

Gradient based methods are used by various algorithms to train the models. Neural networks algorithm uses stochastic gradient descent method to train the model. A neural network algorithm randomly assigns weights to the layers and once the output is predicted, it calculates the prediction errors. It uses these errors to estimate a gradient that can be used to update the weights in the network. This is done in order to reduce the prediction errors. The error gradient is updated backward from the output layer to the input layer.  

It is preferred to build a neural network model with a larger number of hidden layers. With more hidden layers, the neural network model can achieve enhanced capability to perform more accurately.  

One problem with too many layers is that the gradient diminishes pretty fast as it moves from the output layer to the input layer, i.e. during the back propagation, in order to get the update for the weights, we multiply a lot many gradients and jacobians. If the largest singular value of these matrices is less than one, we will get very small number when we multiply these less than one numbers. If we get very small number, the gradients would diminish. When we update the weight with this gradient, the update is very low. By the time it reaches the other end backward, it is quite possible that the error might get too small to make any effect on the model performance improvement. Basically, this is a situation where some difficulty is faced while training a neural network model using gradient based methods.  

This is known as the vanishing gradient descent problem. Gradient based methods might face this challenge when certain activation functions are used in the network.  

In deep neural networks, various activations functions are used. However when training deep neural network models, the vanishing gradient descent problems can demonstrate unstable behavior.  

Various workaround solutions have been proposed to solve this problem. The most commonly used activation function is the ReLU activation function that has proven to perform way better than any other previously existing activation functions like sigmoid or hyperbolic tangent. 

As mentioned above, Swish improves upon ReLU being a smooth and non-monotonic function. However, though the vanishing gradient descent problem is much less severe in Swish, it does not completely avoid the vanishing gradient descent problem. 

To tackle this problem, a new activation function has been proposed. 

The activation function in the neural network is one of the important aspects which facilitates the deep training by introducing the nonlinearity into the learning process. However, because of zero-hard rectification, some of the existing activation functions such as ReLU and Swish miss to utilize the large negative input values and may suffer from the dying gradient problem. Thus, it is important to look for a better activation function which is free from such problems.... The proposed LiSHT activation function is an attempt to scale the non-linear Hyperbolic Tangent (Tanh) function by a linear function and tackle the dying gradient problem… A very promising performance improvement is observed on three different types of neural networks including Multi-layer Perceptron (MLP), Convolutional Neural Network (CNN) and Recurrent Neural Network like Long-short term memory (LSTM).“   Swalpa Kumar Roy, Suvojit Manna, et al, Jan 2019 

In a paper published here, Swalpa Kumar Roy, Suvojit Manna, et al proposes a new non-parametric activation function - the Linearly Scaled Hyperbolic Tangent (LiSHT) - for Neural Networks that attempts to tackle the vanishing gradient descent problem. 

Suchita

Suchita Singh

Author

With 16+ years of experience, having served organisations like IBM for a decade, Suchita is currently playing the role of a data scientist at Algoritmo Lab with core hands-on with various tools and technologies and is helping lead a team of junior data scientists.

Join the Discussion

Your email address will not be published. Required fields are marked *

Suggested Blogs

The Role of Mathematics in Machine Learning

IntroductionAutomation and machine learning have changed our lives. From the most technologically savvy person working in leading digital platform companies like Google or Facebook to someone who is just a smartphone user, there are very few who have not been impacted by artificial intelligence or machine learning in some form or the other;  through social media, smart banking, healthcare or even Uber.  From self – driving Cars, robots, image recognition, diagnostic assessments, recommendation engines, Photo Tagging, fraud detection and more, the future for machine learning and AI is bright and full of untapped possibilities.With the promise of so much innovation and path-breaking ideas, any person remotely interested in futuristic technology may aspire to make a career in machine learning. But how can you, as a beginner, learn about the latest technologies and the various diverse fields that contribute to it? You may have heard of many cool sounding job profiles like Data Scientist, Data Analyst, Data Engineer, Machine Learning Engineer etc., that are not just rewarding monetarily but also allow one to grow as a developer and creator and work at some of the most prolific technology companies of our times. But how do you get started if you want to embark on a career in machine learning? What education background should you pursue and what are the skills you need to learn? Machine learning is a field that encompasses probability, statistics, computer science and algorithms that are used to create intelligent applications. These applications have the capability to glean useful and insightful information from data that is useful to arrive business insights. Since machine learning is all about the study and use of algorithms, it is important that you have a base in mathematics.Why do I need to Learn Math?Math has become part of our day-to-day life. From the time we wake up to the time we go to bed, we use math in every aspect of our life. But you may wonder about the importance of math in Machine learning and whether and how it can be used to solve any real-world business problems.Whatever your goal is, whether it’s to be a Data Scientist, Data Analyst, or Machine Learning Engineer, your primary area of focus should be on “Mathematics”.  Math is the basic building block to solve all the Business and Data driven applications in the real-world scenario. From analyzing company transactions to understanding how to grow in the day-to-day market, making future stock predictions of the company to predicting future sales, Math is used in almost every area of business. The applications of math are used in many Industries like Retail, Manufacturing, IT to bring out the company overview in terms of sales, production, goods intake, wage paid, prediction of their level in the present market and much more.Pillars of Machine LearningTo get a head start and familiarize ourselves with the latest technologies like Machine learning, Data Science, and Artificial Intelligence, we have to understand the basic concepts of Math, write our own Algorithms and implement  existing Algorithms to solve many real-world problems.There are four pillars of Machine Learning, in which most of our real-world business problems are solved. Many algorithms in Machine Learning are also written using these pillars. They areStatisticsProbabilityCalculusLinear AlgebraMachine learning is all about dealing with data. We collect the data from organizations or from any repositories like Kaggle, UCI etc., and perform various operations on the dataset like cleaning and processing the data, visualizing and predicting the output of the data. For all the operations we perform on data, there is one common foundation that helps us achieve all of this through computation-- and that is Math.STATISTICSIt is used in drawing conclusions from data. It deals with the statistical methods of collecting, presenting, analyzing and interpreting the Numerical data. Statistics plays an important role in the field of Machine Learning as it deals with large amounts of data and is a key factor behind growth and development of an organization.Collection of data is possible from Census, Samples, Primary or Secondary data sources and more. This stage helps us to identify our goals in order to work on further steps.The data that is collected contains noise, improper data, null values, outliers etc. We need to clean the data and transform it into a meaningful observations.The data should be represented in a suitable and concise manner. It is one of the most crucial steps as it helps to understand the insights and is used as the foundation for further analysis of data.Analysis of data includes Condensation, Summarization, Conclusion etc., through the means of central tendencies, dispersion, skewness, Kurtosis, co-relation, regression and other methods.The Interpretation step includes drawing conclusions from the data collected as the figures don’t speak for themselves.Statistics used in Machine Learning is broadly divided into two categories, based on the type of analyses they perform on the data. They are Descriptive Statistics and Inferential Statistics.a) Descriptive StatisticsConcerned with describing and summarizing the target populationIt works on a small dataset.The end results are shown in the form of pictorial representations.The tools used in Descriptive Statistics are – Mean, Median, Mode which are the measures of Central and Range, Standard Deviation, variance etc., which are the measures of Variability.b) Inferential StatisticsMethods of making decisions or predictions about a population based on the sample information.It works on a large dataset.Compares, tests and predicts the future outcomes.The end results are shown in the probability scores.The specialty of the inferential statistics is that, it makes conclusions about the population beyond the data available.Hypothesis tests, Sampling Distributions, Analysis of Variance (ANOVA) etc., are the tools used in Inferential Statistics.Statistics plays a crucial role in Machine Learning Algorithms. The role of a Data Analyst in the Industry is to draw conclusions from the data, and for this he/she requires Statistics and is dependent on it.PROBABILITYThe word probability denotes the happening of a certain event, and the likelihood of the occurrence of that event, based on old experiences. In the field of Machine Learning, it is used in predicting the likelihood of future events.  Probability of an event is calculated asP(Event) = Favorable Outcomes / Total Number of Possible OutcomesIn the field of Probability, an event is a set of outcomes of an experiment. The P(E) represents the probability of an event occurring, and E is called an Event. The probability of any event lies in between 0 to 1. A situation in which the event E might occur or not is called a Trail.Some of the basic concepts required in probability are as followsJoint Probability: P(A ∩ B) = P(A). P(B), this type of probability is possible only when the events A and B are Independent of each other.Conditional Probability: It is the probability of the happening of event A, when it is known that another event B has already happened and is denoted by P (A|B)i.e., P(A|B) = P(A ∩ B)/ P(B)Bayes theorem: It is referred to as the applications of the results of probability theory that involve estimating unknown probabilities and making decisions on the basis of new sample information. It is useful in solving business problems in the presence of additional information. The reason behind the popularity of this theorem is because of its usefulness in revising a set of old probabilities (Prior Probability) with some additional information and to derive a set of new probabilities (Posterior Probability).From the above equation it is inferred that “Bayes theorem explains the relationship between the Conditional Probabilities of events.” This theorem works mainly on uncertainty samples of data and is helpful in determining the ‘Specificity’ and ‘Sensitivity’ of data. This theorem plays an important role in drawing the CONFUSION MATRIX.Confusion matrix is a table-like structure that measures the performance of Machine Learning Models or Algorithms that we develop. This is helpful in determining the True Positive rates, True Negative Rates, False Positive Rates, False Negative Rates, Precision, Recall, F1-score, Accuracy, and Specificity in drawing the ROC Curve from the given data.We need to further focus on Probability distributions which are classified as Discrete and Continuous, Likelihood Estimation Functions etc. In Machine Learning, the Naive Bayes Algorithm works on the probabilistic way, with the assumption that input features are independent.Probability is an important area in most business applications as it helps in predicting the future outcomes from the data and takes further steps. Data Scientists, Data Analysts, and Machine Learning Engineers use this probability concept very often as their job is to take inputs and predict the possible outcomes.CALCULUS:This is a branch of Mathematics, that helps in studying rates of change of quantities. It deals with optimizing the performance of machine learning models or Algorithms. Without understanding this concept of calculus, it is difficult to compute probabilities on the data and we cannot draw the possible outcomes from the data we take. Calculus is mainly focused on integrals, limits, derivatives, and functions. It is divided into two types called Differential Statistics and Inferential Statistics. It is used in back propagation algorithms to train deep Neural Networks.Differential Calculus splits the given data into small pieces to know how it changes.Inferential Calculus combines (joins) the small pieces to find how much there is.Calculus is mainly used in optimizing Machine Learning and Deep Learning Algorithms. It is used to develop fast and efficient solutions. The concept of calculus is used in Algorithms like Gradient Descent and Stochastic Gradient Descent (SGD) algorithms and in Optimizers like Adam, Rms Drop, Adadelta etc.Data Scientists mainly use calculus in building many Deep Learning and Machine Learning Models. They are involved in optimizing the data and bringing out better outputs of data, by drawing intelligent insights hidden in them.Linear Algebra:Linear Algebra focuses more on computation. It plays a crucial role in understanding the background theory behind Machine learning and is also used for Deep Learning. It gives us better insights into how the algorithms really work in day-to-day life, and enables us to take better decisions. It mostly deals with Vectors and Matrices.A scalar is a single number.A vector is an array of numbers represented in a row or column, and it has only a single index for accessing it (i.e., either Rows or Columns)A matrix is a 2D array of numbers and can be accessed with the help of both the indices (i.e., by both rows and columns)A tensor is an array of numbers, placed in a grid in a particular order with a variable number of axes.The package named Numpy in the Python library is used in computation of all these numerical operations on the data. The Numpy library carries out the basic operations like addition, subtraction, Multiplication, division etc., of vectors and matrices and results in a meaningful value at the end. Numpy is represented in the form of N-d array.Machine learning models cannot be developed, complex data structures cannot be manipulated, and operations on matrices would not have been performed without the presence of Linear Algebra. All the results of the models are displayed using Linear Algebra as a platform.Some of the Machine Learning algorithms like Linear, Logistic regression, SVM and Decision trees use Linear Algebra in building the algorithms. And with the help of Linear Algebra we can build our own ML algorithms. Data Scientists and Machine Learning Engineers work with Linear Algebra in building their own algorithms when working with data.How do Python functions correlate to Mathematical Functions?So far, we have seen the importance of Mathematics in Machine Learning. But how do Mathematical functions corelate to Python functions when building a machine learning algorithm? The answer is quite simple. In Python, we take the data from our dataset and apply many functions to it. The data can be of different forms like characters, strings, numerical, float values, double values, Boolean values, special characters, Garbage values etc., in the data set that we take to solve a particular machine learning problem. But we commonly know that the computer understands only “zeroes & ones”. Whatever we take as input to our machine learning model from the dataset, the computer is going to understand it as binary “Zeroes & ones” only.Here the Python functions like “Numpy, Scipy, Pandas etc.,” mostly use pre-defined functions or libraries. These help us in applying the Mathematical functions to get better insights of the data from the dataset that we take. They help us to work on different types of data for processing and extracting information from them. Those functions further help us in cleaning the garbage values in data, the noise present in data and the null values present in data and finally help to make the dataset free from all the unwanted matter present in it. Once the data is preprocessed with the Python functions, we can apply our algorithms on the dataset to know which model works better for the data and we can find the accuracies of different algorithms applied on our dataset. The mathematical functions help us in visualizing the content present in the dataset, and helps to get better understanding on the data that we take and the problem we are addressing using a machine learning algorithm.Every algorithm that we use to build a machine learning model has math functions hidden in it, in the form of Python code. The algorithm that we develop can be used to solve a variety of things like a Boolean problem or a matrix problem like identifying an image in a crowd of people and much more. The final stage is to find the best algorithm that suits the model. This is where the mathematical functions in the Python language help us. It helps to analyze which algorithm is best through comparison functions like correlation, F1 score, Accuracy, Specificity, sensitivity etc. Mathematical functions also help us in finding out if the selected model is overfitting or underfitting to the data that we take.To conclude, we cannot apply the mathematical functions directly in building machine learning models, so we need a language to implement the mathematical strategies in the algorithm. This is why we use Python to implement our math models and draw better insights from the data. Python is a suitable language for implementations of this type. It is considered to be the best language among the other languages for solving real-world problems and implementing new techniques and strategies in the field of ML & Data Science.Conclusion:For machine learning enthusiasts and aspirants, mathematics is a crucial aspect to focus on, and it is important to build a strong foundation in Math. Each and every concept you learn in Machine Learning, every small algorithm you write or implement in solving a problem directly or indirectly has a relation to Mathematics.The concepts of math that are implemented in machine learning are built upon the basic math that we learn in 11th and 12th grades. It is the theoretical knowledge that we gain at that stage, but in the area of Machine Learning we experience the practical use cases of math that we have studied earlier.The best way to get familiar with the concepts of Mathematics is to take a Machine Learning Algorithm, find a use case, and solve and understand the math behind it.An understanding of math is paramount to enable us to come up with machine learning solutions to real world problems. A thorough knowledge of math concepts also helps us enhance our problem-solving skills.
3342
The Role of Mathematics in Machine Learning

IntroductionAutomation and machine learning have c... Read More

What Is Data Splitting in Learn and Test Data?

Data is the fuel of every machine learning algorithm, on which statistical inferences are made and predictions are done. Consequently, it is important to collect the data, clean it and use it with maximum efficacy. A decent data sampling can guarantee accurate predictions and drive the whole ML project forward whereas a bad data sampling can lead to incorrect predictions. Before diving into the sampling techniques, let us understand what the population is and how does it differ from a sample? The population is the assortment or the collection of the components which shares a few of the other characteristics for all intents and purposes. The total number of observations is said to be the size of the populationImage SourceThe sample is a subset of the population. The process of  choosing a sample from a given set of the population is known as sampling. The number of components in the example is the sample size. Data sampling refers to statistical approaches for picking observations from the domain to estimate a population parameter. Whereas data resampling refers to the drawing of repeated samples from the main or original source of data. It is the non-parametric procedure of statistical extrapolation. It produces unique sample distributions based on the original data and is used to improve the accuracy and intuitively measure the uncertainty of the population parameter. Sampling methods can be divided into two parts: Probability sampling procedure  Non-probability sampling procedure  The distinction between the two is that the example of determination depends on randomization. With randomization, each component persuades equivalent opportunity and is important for test for study. Probability Sampling – It is a method in which each element of a given population has an equivalent chance of being selected. Simple random sampling –For instance, a classroom has 100 students and each student has an equal chance of getting selected as the class representative Systematic sampling- It is a sampling technique in which the first element is selected at random and others get selected based on a fixed sampling interval. For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20) Suppose we want the element with the number 3 and a sample size of 5. The next selection will be made at an interval of 20/5 i.e. 4 so 3 + 4 = 7 so 3,7,11 and so on.  Stratified sampling – In this sampling process, the total group is subdivided into smaller groups, known as the strata, to obtain a sampling process. Assume that we need to identify the average number of votes in three different cities to elect a representative. City x has 1 million citizens, city y has 2 million citizens, city z has 3 million citizens. We can randomly choose a sample size of 60 for the entire population. But if you notice, the random samples are not balanced with respect to the different cities. Hence there could be an estimation error. To overcome this, we may choose a random sample of 10,20,30 from city x, y, z respectively. We can therefore minimize the total estimated error. Reservoir sampling is a randomized algorithm. It is used to select k out of n samples. The n is generally very large or unknown. For instance, reservoir sampling can be used to obtain k out of the number of fish in a lake. Cluster sampling - samples are taken as subgroup /clusters of the population. These subgroups are selected at random. Image SourceNon-probability sampling – In a non-probability sampling method, each instance of a population does not have an equivalent chance of being selected. There is an element of risk of ending up with a non-representative sample which might not bring out a comprehensive outcome. Convenience sampling - This sampling technique includes people or samples that are easy to reach. Though it is the easiest methodology to collect a sample it runs a high risk of not being representative of a population. For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20) The surveyor wants the person 4,7,11,18 to participate, hence it can create selection bias. Quota sampling – In Quota sampling methods the sample or the instances are chosen based on their traits or characteristics which matches with the population For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20) Consider a quota in multiple of 4 - (4,8,12,16,20) Judgement sampling - Also known as selective sampling. Here individuals are asked to participate.  Snowball sampling - In this sampling technique, an individual element/person can nominate further elements/people known to them. It is only applicable when the sampling frame is difficult to identify.  A nominates P, P nominates G, G nominates M A > P > G > M The non-probability sampling technique may lead to selection bias and population misrepresentation.  Image SourceWe often come across the case of an imbalanced dataset.  Resampling is a technique used to overcome or to deal with  imbalanced datasets It includes removing samples/elements from the majority class i.e. undersampling  Adding more instances from the minority class i.e. Oversampling  There is a dedicated library to tackle imbalanced datasets in Python - known as imblearn. Imblearn has multiple methods to handle undersampling and oversampling    Image SourceTomek Links for under-sampling - pairs of examples from opposite classes in close instancesMajority elements are eliminated from the Tomek Links which intuitively provides a better understanding and decision boundary for ML classifier  SMOTE for oversampling - Synthetic Minority Oversampling Technique - works by increasing new examples from the minority cases. It is a statistical technique of increasing or generating the number of instances in the dataset in a more balanced manner.  Image SourcePick a minority class as the input vector  Discover its k closest neighbors (k_neighbors is indicated as a contention in the SMOTE()) Pick one of these neighbors and spot a synthetic point anyplace on the line joining the point viable and its picked neighbor  Rehash the above steps until it is adjusted or balanced Other must-read sampling methods - Near miss, cluster centroids for under sampling, ADASYN and bSMOTE for oversampling  Train-Test split  Python is bundled with overpowered ML library. The train_test_Split() module from Scikit-Learn library is one of the major python modules that provides a function to split the datasets into multiple subsets in different ways or let us say randomly into training and validation datasets. The parameter train_size takes a fraction between zero and one for specifying the training size. The remaining samples in the original data set are for testing purposes. The record which is selected for training and test sets are randomly sampled. The simplest method train_test_split() or the split_train_test() are more or less the same. train set – the subset of the dataset to train a model test set - the subset of the dataset to test the trained model The train-test method is used to measure the performance of ML algorithms  It is appropriate to use this procedure when the dataset is very large For any supervised Machine learning algorithms, train-test split can be implemented.  Involves taking the data set as a whole and further subdividing it into two subsets The training dataset is used to fit the model  The test dataset serves as an input to the model The model predictions are made on the test data  The output (prediction) is compared to the expected values  The ultimate objective is to evaluate the performance of the said ML model against the new or unseen data. A visual representation of training or test data:  Image SourceIt is important to note that the test data adheres to the following conditions:   Be large enough to fetch statistically significant results Is a representation of the whole dataset. One must not pick the test set with different traits/characteristics of the training set. Never train on test data - don’t get fooled by good results and high accuracy. It might be the case that one has accidentally trained the model on the test data. The train_test_split() is coupled with additional features: a random seed generator as random_state parameter – this ensures which samples go to training and which go to the test set It takes multiple data sets with the matching number of rows and splits them on similar indices. The train_test_split returns four variables  train_X  - which covers X features of the training set. train_y – which contains the value of a response variable from the training set test_X – which includes X features of the test set test_y – which consists of values of the response variable for the test set. There is no exact rule to split the data by 80:20 or 70:30; it depends on the data and the target variable. Some of the data scientists use a range of 60% to 80% for training and the rest for testing the model. To find the length or the number of records we use len function of python > len(X_train), len (X_test) The model is built by using the training set and is tested using the test set X_train and y_train contain the independent features or variables and response variable values for training datasets respectively. On the other hand, X_test and y_test include the independent features and response variables values for the test dataset respectively. Conclusion: Sampling is an ongoing process of accumulating the information or the observations on an estimate of the population variable. We learnt about sampling types - probability sampling procedure and non-probability sampling procedure. Resampling is a repeated process to draw samples from the main data source. And finally, we learnt about training, testing and splitting the data which are used to measure the performance of the model. The training and testing of the model are done to understand the data discrepancies and develop a better understanding of the machine learning model. 
5430
What Is Data Splitting in Learn and Test Data?

Data is the fuel of every machine learning algo... Read More

Data Preparation for Machine Learning Projects

The data we collect for machine-learning must be pre-processed before it can be used to fit a model. Data preparation is essentially, the task of modifying raw data into a form that can be used for modelling, mostly by data addition, deletion or other data transformation techniques.  We need to pre-process the data before feeding into any algorithm mainly due to the following reasons: Messy data – Real world data is messy, with missing values, redundant values, out-of-range values, errors and noise. Machine learning algorithms need numeric data. More often than not, algorithms have requirements on the input data, for example some algorithms assume a certain probability distribution of the data, others might perform worse if the predictor variables are highly correlated etc. Data preparation tasks are mostly dependent on the dataset we are working with, and to some extent on the choice of model. However, it becomes more evident after initial analysis of the data and EDA. For e.g. looking at the summary statistics, we know if predictors need to be scaled. Looking at correlation matrix you can find out if there are highly correlated predictors. Looking at various plots, e.g. boxplot, you can find, if outliers need to be dealt with, so on and so forth. Even though every dataset is different, we can define a few common steps which can guide us in preparing the data to feed into our learning algorithms. Some common tasks that contribute to data pre-processing are: Data Cleaning Feature Selection Data Transformation Feature Engineering Dimensionality Reduction Note: Throughout this article, we will refer to Python libraries and syntaxes. Data Cleaning: It can be summed up as the process of correcting the errors in the data. Errors could be in the form of missing values, redundant rows or columns, variables with zero or near zero variance and so on. Thus, data cleaning involves a few or all of the below sub-tasks: Redundant samples or duplicate rows: should be identified and dropped from the dataset. In Python,  functions in Pandas such as duplicated() can be used to identify such samples and drop_duplicates() can be used to drop such rows. Redundant Features: If the dataset has features which are highly correlated, it may lead to multi-collinearity (irregular regression coefficient estimates). Such columns can be identified using the correlation matrix and one of the pairs of the highly correlated feature should be dropped. Similarly, near zero variance features, which have the same value for all the samples do not contribute to the variance in data. Such columns should be identified and dropped from the dataset.  Outlier Detection: Outliers are extreme values which fall far away from other observations. Outliers can skew the descriptive statistics of the data, hence mislead data interpretations and negatively impact model performance. So, it is important that the outliers are detected and dealt with. Outliers can be detected through data visualization techniques like box-plots and scatter plots.  Example of outliers being detected using box plots:  Image Source Outliers can also be detected by computing the z-scores or the Inter-Quartile range. When using z-score, a data point which is more than 3 standard deviations away from the mean is normally considered as an outlier.  However, this may vary based on the size of the dataset. When using inter-quartile range, a point which is below Q1 - 1.5 inter-quartile range or above Q3 + 1.5 inter-quartile range is considered to be an outlier, where Q1 is the first quartile and Q3 is the third quartile. Below diagram shows outliers which are more than 3 standard deviations from the mean: Image Source If there are a few outliers, you may choose to drop the samples with outliers. Else if there are too many outliers, these can be modelled separately. We may also choose to cap or floor the outlier values by the 95th percentile or 5th percentile value. However, you may choose the appropriate replacement value by analyzing the deciles of the data. Missing Values: Data with missing values cannot be used for modelling; hence any missing values should be identified and cleaned. If the data in the predictor or sample is sparse, we may choose to drop the entire column/row. Else we may impute the missing value with mean or median. Missing values in categorical variables can be replaced with the most frequent class. Points to remember: Use z-score for outlier detection if the data follows Gaussian distribution, else use Inter-Quartile range for outlier detection. Feature Selection: Sometimes datasets have hundreds of input variables, not all of which are good predictors of the target and may contribute to noise in the data. Feature selection techniques are used to find the input variables that can most efficiently predict the target variable, in order to reduce the number of input variables. Feature selection techniques can be further classified as supervised selection techniques and unsupervised selection techniques. As the name suggests, unsupervised selection techniques do not consider the target variable while eliminating the input variables. This would include techniques like using correlation to eliminate highly correlated predictors or eliminating low variance predictors. Supervised feature selection techniques consider the target variable for selecting the features to be eliminated. These can be further divided into three groups namely, Intrinsic, Filter and Wrapper techniques. Intrinsic – the feature selection process is embedded in the model building process itself, for e.g. tree-based algorithms which pick up the best predictor for the split. Similarly, regularization techniques like lasso shrinks the coefficient of the predictors such that the coefficient can be shrunk to zero for some predictors, and hence are excluded from the model. Multivariate adaptive regression spline (MARS) models also fall under this category. A major advantage of such methods is that since the feature selection is a part of model building process, it is relatively fast. However model dependance can also prove to be disadvantageous for e.g. some tree-based algorithms are greedy and hence may select predictors which may lead to sub-optimal fit. Filter – Filter based selection techniques use some statistical method to score each predictor separately with the target variable and choose the predictors with highest scores. It is mostly univariate analysis, i.e., each predictor is evaluated in isolation. It does not consider the correlation of independent variables amongst themselves. Based on the type of the input variable i.e., numerical or categorical and the type of output variable an appropriate statistical measure can be used to evaluate predictors for feature selection: for example, Pearson’s correlation coefficient, Spearmon’s correlation coefficient, ANOVA, Chi-square. Wrapper – Wrapper feature selection builds models using various subsets of predictors iteratively, and evaluates the model, until it finds a subset of features which best predict the target. These methods are agnostic to the type of variables. However, they are computationally more taxing. RFE is a commonly used wrapper-based feature selection method. Recursive Feature Elimination is a greedy backward elimination technique, which starts with a complete set of predictors and systematically eliminates less useful predictors, until it finds a subset of predictors which best predict the target variable with the specified number of predictors. Two important hyperparameters for RFE algorithm in scikit learn are the number of predictors(n_features_to_select) and the algorithm of choice (estimator). Points to remember: Feature selection techniques reduce the number of features by excluding or eliminating the existing features from the dataset, whereas dimensionality reduction techniques create a projection of the data in lower dimensional feature space, which does not have a one-to-one mapping with the existing features. However, both have a similar goal of reducing the number of independent variables. Data Transformations: We may need to transform data to change its data type, scale or distribution. Type: We need to analyze the input variables at the very beginning to understand if the predictors are represented with the appropriate data type, and do the required conversions before progressing with the EDA and modelling. For e.g., sometimes the Boolean values are encoded as true and false, and we may transform them to take values 0 and 1. Similarly sometimes we may come across integer variables where it might be more appropriate to treat it as a categorical variable. For e.g. when working on a dataset to predict car prices, it would be more appropriate to treat the variable ‘Number of doors’ which takes up values {2,4} as a categorical variable.  Categorical variables should be converted to numeric, before they can be used for modelling. There are many categorical variable encoding techniques like, N-1 dummy encoding, 1 Hot encoding, label encoding, frequency encoding. Ordinal encoding can be used when we want to specify and maintain the order of the ordinal variable. Scale: Predictor variables may have different units (Km, $, years etc.) and hence, different scales. For e.g. we might have input variables like age and salary in a dataset. Scale of the variable salary will always be much higher than the age, and hence may contribute unequally to the model and create a bias. Hence, we transform the predictors to bring them to a common scale. Normalization and standardization are the most widely used scaling techniques. Normalization: helps scale the data such that all values lie between the range of 0 and 1. The scikit-learn library method even allows one to specify the preferred range. Data shown before and after normalization:  Image SourceStandarisation: We standardize the data by centering it around the mean and then scaling the data by the standard deviation. In other words, mean of the variable is subtracted from each value of the input variable and the difference is divided by the standard deviation of the variable. The resulting data will have zero mean and standard deviation 1. Standardisation assumes that the data follows a Gaussian distribution. Scikit learn library in python can be used for normalization (MinMaxScaler()) and standardization (StandardScaler()).  Data shown before and after standardization:  Image Source Distribution: Many algorithms assume Gaussian distribution for the underlying data. If the data is not Gaussian or is Gaussian like, we can transform the data to reduce the skewness. Box-Cox transform, or Yeo-Johnson transform can be used to perform power transformations on the data. Box-Cox transform applies a different transformation on the data based on the value of lambda. For e.g. for Lambda = -1, it does inverse transformation, for Lambda=0 it does log transformation, for Lambda = 0.5, it does square root transformation, for Lambda = -0.5 it does reciprocal square root transformation. PowerTransformer() class in the python scikit library can be used for making these power transformations.Data shown before and after log transformation: Image SourcePoints to remember: Data transformations should be done on the training dataset, so that the statistic required for transformation is estimated from the training set only and then applied on the validation set. Decision trees and other tree-based ensembles like Random forest and boosting algorithms are not impacted by different scale of the input variables. Hence scaling may not be required.  Linear regression and neural networks which use weighted sum of the input variables and K-nearest neighbors or SVM which compute distance or dot product between predictors will be impacted by the scale of the predictors, hence input variables should be scaled for these models. Between normalization and standardization, one should standardize when the data follows a Gaussian distribution, else normalize. Feature Engineering:  is the part of data pre-processing where we derive new features using one or more existing features. For e.g. when working on taxi fare prediction problem, we may derive a new feature, distance travelled in the ride with the use of latitude and longitude co-ordinates of the start and end point of the ride. Or when working on predicting sales or foot fall for a retail business we may need to add a new feature to factor in, the impact of holiday, weekends and festivals on the target variable. Hence, we may need to engineer these new predictors and feed them into our model to identify the underlying patterns effectively. Polynomial term: We may also add new features by raising the existing input variables to a higher degree polynomial. Polynomial terms help the model learn the non-linear patterns. When polynomial terms of existing features are added to the linear regression model, it is termed as polynomial regression. Usually, we stick to a smaller degree of 2 or 3. Interaction term: We may add new features that represent interaction between existing features by adding a product of two features. For e.g. if we are working on a problem to help businesses allocate their marketing budget between various marketing mediums like radio, TV and newspaper, we need to model how effective each medium is. We may like to factor in the interaction term of a radio and newspaper campaign, to understand the effectiveness of marketing if both the radio and newspaper campaigns were run together at the same time. Similarly, when predicting a crop yield, we may engineer a new interaction term for fertilizer and water together to factor in how the yield varies when water and fertilizer are provided together. Points to remember: When using polynomial terms in the model, it is good practice to restrict the degree of the polynomial to 3 or at most 4. This is firstly, to control the number of input variables. Secondly, a larger degree of the polynomial will result in large values which may impact the weights(parameters) to be large and hence make the model less sensitive to small changes. Domain knowledge or the advice of the SME may come in handy to identify effective interaction terms. Dimensionality Reduction: Sometimes data might have hundreds and even thousands of features. High dimensional data can be more complicated, with way more parameters to train and a very complicated model structure. In higher dimensions, the volume of space is huge, and the data points become sparse, which could negatively impact the machine learning algorithm performance. This is sometimes also referred to as the curse of dimensionality.  Dimensionality Reduction techniques are used to reduce the number of predictor variables in the dataset. Some techniques for dimensionality reduction are: PCA or Principal Component Analysis uses linear algebra and Eigenvalue to achieve dimensionality reduction. For given datapoints PCA finds orthogonal set of directions, that have maximum variance. Rotating the reference frame, it finds the directions (ones which correspond to smallest eigen values) which can be neglected. Principal Component Analysis applied to a dataset is shown below: Manifold learning is a non-linear dimensionality reduction technique which uses geometric properties of the data, to create low dimensional projections of a high dimensional data, while preserving its structure and relationships, and to visualize high dimensional data, which is otherwise difficult. SOM Self organizing Map also called Kohonen map and t-SNE are examples of Manifold learning techniques.  t-distributed stochastic neighbor embedding (t-SNE) computes the probability that pairs of datapoints (in high dimension) are related and maps them in low dimension, such that data has a similar distribution. Autoencoders are deep learning neural networks that learn low dimensional representation of a given dataset in an unsupervised manner. The hidden layer is limited to contain fewer neurons, thus it learns to map high dimensional input vector into low dimensional vector, while still preserving the underlying structure and relationships in the data. Autoencoders have two parts, encoder which learns to map high dimensional vector to a low-dimensional space and decoder, which maps the data from low to high dimension. The output from the encoder with reduced dimension can be fed into any another model for supervised learning. Points to remember:  Dimensionality reduction is mostly performed after data cleaning and data scaling.  It is imperative that the dimensionality reduction performed on the training data set must also be performed on the validation and the new data on which the model will predict. Conclusion: Data preparation is an important and integral step of machine learning projects. There are multiple techniques for various data cleaning tasks. However, there are no best or worst data cleaning techniques. Every machine learning problem is unique and so is the underlying data. We need to apply different techniques and see what works best based on the data and the problem at hand.  
10476
Data Preparation for Machine Learning Projects

The data we collect for machine-learning must be... Read More