The Universal Approximation Theorem
Any predictive model is a mathematical function, y = f(x) that can map the features (x) to the target variable (y). The function, f(x) can be a linear function or it can be a fairly complex nonlinear function. The function, f(x) can help predict with high accuracy depending on the distribution of the data. In the case of neural networks, it would also depend on the type of network architecture that's employed. The Universal Approximation Theorem says that irrespective of what the f(x) is, a neural network model can be built that can approximately deliver the desired result. In order to build a proper neural network architecture, let us take a look at the activation functions.
What are Activation Functions?
Simply put, activation functions define the output of neurons given a certain set of inputs. Activation functions are mathematical functions that are added to neural network models to enable the models to learn complex patterns. An activation function takes in the output from the previous layer, passes it through the mathematical function (mostly non-linear functions) to convert it into some form, that can be considered as an input for the next computation layer. Activation functions determine the final accuracy of a network model while also contributing to the computational efficiency of building the model.
Why do we need Activation Functions?
In a neural network, if we add the hidden layers as the weighted sum of the inputs, this would translate into a linear function which is equivalent to a linear regression model.
Image source: Neural Network Architecture
In the above diagram, we see the hidden layer is simply the weighted sum of the inputs from the input layer. For example, b1 = bw1 + a1w1 + a2w3 which is nothing but a linear function.
Linear combination of linear functions is a linear function. So no matter whatever number of linear function we add, or increase the hidden linear layers, the output would still be linear.
However in the real world, more often than not, we need to model data which is non-linear and way more complex. Adding non-linear functions allow these non-linear decision boundaries to be built into the model.
Multi-layer neural network models can classify linearly inseparable classes. However, in order to do so, we need the network to be transformed to a nonlinear function. For this nonlinear transformation to happen, we would pass the weighted sum of the inputs through an activation function. These activation functions are nonlinear functions which are applied at the hidden layers. Each hidden layer can have different activation functions, though mostly all neurons in each layer will have the same activation function.
Additionally, by applying non-linear activation function to the neurons it can also act as gate and selectively switch on or off a neuron.
Types of Activation Functions?
In this section we discuss the following:
- Linear Function
- Threshold Activation Function
- Bipolar Activation Function
- Logistic Sigmoid Function
- Bipolar Sigmoid Function
- Hyperbolic Tangent Function
- Rectified Linear Unit Function
Swish Function (proposed by Google Brain - a deep learning artificial intelligence research team at Google)
Linear Function: g(x) = x
A linear function is similar to a straight line, y=mx. Irrespective of the number of hidden layers, if all the layers are linear in nature, then the final output is also simply a linear function of the input values. Hence we take a look at the other activation functions which are non-linear in nature and can help learn complex patterns.
Note: This function is useful when we want to model a wide range in the regression network output.
Threshold Activation Function: (sign(x) +1)/2
In this case, if the input is above a certain value, the neuron is activated. It is to note that this function provides either a 1 or a 0 as the output. In effect, the step function divides the input space into two halves such that one side of the hyperplane represents class 0 and the other side of the hyperplane represents class 1. However, if we need to classify certain inputs into more than 2 categories, a Threshold-Activation function is not a suitable one. Because of its binary output nature, this function is also known as binary-step activation function.Threshold Activation Function
- Can be used for binary classification only. It is not suited for multi class classification problems.
- This function does not support learning, i.e., when you fine tune the NN, you would not know if by changing the weights slightly the loss has reduced or changed at all.
Bipolar Activation Function: This is similar to the threshold function that was explained above. However, this activation function will return an output of either -1 or +1 based on a threshold.
Bipolar Activation Function
Logistic Sigmoid Function: One of the most frequently used activation functions is the Logistic Sigmoid Function. Its output ranges between 0 and 1 and is plotted as an ‘S’ shaped graph.
Logistic Sigmoid Function
This is a nonlinear function and is characterised by a small change in x that would lead to large change in y. This activation function is generally used for binary classification where the expected output is 0 or 1. This activation function provides an output between 0 and 1 and a default threshold of 0.5 is considered to convert the continuous output to 0 or 1 for classifying the observations
Another variation of the Logistic Sigmoid function is the Bipolar Sigmoid Function. This activation function is a rescaled version of the Logistic Sigmoid Function which provides an output in the range of -1 to +1.
Bipolar Logistic Function
- Slow convergence - Gradients only in the active region enable learning. When the neurons fire in the saturation region(the top and bottom part of the S curve), the gradients are very small or close to zero. Hence the training becomes slow and leads to slow convergence.
- Vanishing Gradient problem - When the neurons fire in the saturation region, i.e., if the output of the previous layer is in the saturation region, the gradients will get close to zero not enable learning, i.e., even large changes in parameter(weights) leads to very small change in the output.
Hyperbolic Tangent Function: This activation function is quite similar to the sigmoid function. Its output ranges between -1 to +1. So the output is zero centred, hence makes weight initialization easier.
Hyperbolic Tangent Function
- This too suffers from the vanishing gradient problem.
- Slightly more expensive to compute
Rectified Linear Activation Function: This activation function, also known as ReLU, outputs the input if it is positive, else will return zero. That is to say, if the input is zero or less, this function will return 0 or will return the input itself. This function mostly behaves like a linear function because of which the computational simplicity is achieved.
This activation function has become quite popular and is often used because of its computational efficiency compared to sigmoid and the hyperbolic tangent function that helps the model converge faster.
ReLU has a better convergence than sigmoid and tanh(x) functions, as there are no saturation regions in ReLU. If the input of the previous layer is positive, it simply passes it as is and if the input is negative, it simply clips it.
Another critical point to note is that while the sigmoid & the hyperbolic tangent function tries to approximate a zero value, the Rectified Linear Activation Functions can return true zero.
Rectified Linear Units Activation Function
One disadvantage of ReLU is that when the inputs are close to zero or negative, the gradient of the function becomes zero. This causes a problem for the algorithm while performing back-propagation and in turn the model cannot converge. If the dataset is such that the input for a particular neuron is a negative number then during backward propagation, the gradient will always be zero. Since the gradient is zero the weights for those neurons will never be updated and there will be no learning. If the weights are not updated, we would get same negative numbers for those neurons. Thus, no matter what those neurons would be dead. This is commonly termed as the “Dying” ReLU problem. Hence when using ReLU, one should keep track of the fraction of dead neurons.
There are a few variations of the ReLU activation function, such as, Noisy ReLU, Leaky ReLU, Parametric ReLU and Exponential Linear Units (ELU)
Leaky ReLU which is a modified version of ReLU, helps solve the “Dying” ReLU problem. It helps perform back-propagation even when the inputs are negative. Leaky ReLU, unlike ReLU, defines a small linear component of x when x is a negative value. With this change in leaky ReLU, the gradient can be of non-zero value instead of zero thus avoiding dead neurons. However, this might also bring in a challenge with Leaky ReLU when it comes to predicting negative values.
Exponential Linear Unit (ELU) is another variant of ReLU, which unlike ReLU and leaky ReLU, uses a log curve instead of a straight line to define the negative values.
Swish Activation Function: Swish is a new activation function that has been proposed by Google Brain. While ReLU returns zero for negative values, Swish doesn’t return a zero for negative inputs. Swish is a self-gating technique which implies that while normal gates require multiple scalar inputs, self-gating technique requires a single input only. Swish has certain properties - Unlike ReLU, Swish is a smooth and non-monotonic function which makes it more acceptable compared to ReLU. Swish is unbounded above and bounded below. Swish is represented as x · σ(βx), where σ(z) = (1 + exp(−z))−1 is the sigmoid function and β is a constant or a trainable parameter.
Activation functions in deep learning and the vanishing gradient descent problem
Gradient based methods are used by various algorithms to train the models. Neural networks algorithm uses stochastic gradient descent method to train the model. A neural network algorithm randomly assigns weights to the layers and once the output is predicted, it calculates the prediction errors. It uses these errors to estimate a gradient that can be used to update the weights in the network. This is done in order to reduce the prediction errors. The error gradient is updated backward from the output layer to the input layer.
It is preferred to build a neural network model with a larger number of hidden layers. With more hidden layers, the neural network model can achieve enhanced capability to perform more accurately.
One problem with too many layers is that the gradient diminishes pretty fast as it moves from the output layer to the input layer, i.e. during the back propagation, in order to get the update for the weights, we multiply a lot many gradients and jacobians. If the largest singular value of these matrices is less than one, we will get very small number when we multiply these less than one numbers. If we get very small number, the gradients would diminish. When we update the weight with this gradient, the update is very low. By the time it reaches the other end backward, it is quite possible that the error might get too small to make any effect on the model performance improvement. Basically, this is a situation where some difficulty is faced while training a neural network model using gradient based methods.
This is known as the vanishing gradient descent problem. Gradient based methods might face this challenge when certain activation functions are used in the network.
In deep neural networks, various activations functions are used. However when training deep neural network models, the vanishing gradient descent problems can demonstrate unstable behavior.
Various workaround solutions have been proposed to solve this problem. The most commonly used activation function is the ReLU activation function that has proven to perform way better than any other previously existing activation functions like sigmoid or hyperbolic tangent.
As mentioned above, Swish improves upon ReLU being a smooth and non-monotonic function. However, though the vanishing gradient descent problem is much less severe in Swish, it does not completely avoid the vanishing gradient descent problem.
To tackle this problem, a new activation function has been proposed.
“The activation function in the neural network is one of the important aspects which facilitates the deep training by introducing the nonlinearity into the learning process. However, because of zero-hard rectification, some of the existing activation functions such as ReLU and Swish miss to utilize the large negative input values and may suffer from the dying gradient problem. Thus, it is important to look for a better activation function which is free from such problems.... The proposed LiSHT activation function is an attempt to scale the non-linear Hyperbolic Tangent (Tanh) function by a linear function and tackle the dying gradient problem… A very promising performance improvement is observed on three different types of neural networks including Multi-layer Perceptron (MLP), Convolutional Neural Network (CNN) and Recurrent Neural Network like Long-short term memory (LSTM).“ - Swalpa Kumar Roy, Suvojit Manna, et al, Jan 2019
In a paper published here, Swalpa Kumar Roy, Suvojit Manna, et al proposes a new non-parametric activation function - the Linearly Scaled Hyperbolic Tangent (LiSHT) - for Neural Networks that attempts to tackle the vanishing gradient descent problem.