Feb Flash Sale

Deep Learning Interview Questions and Answers

Deep learning is used to automatically learn hierarchical representations of data, allowing it to identify patterns and features that may be difficult or impossible to recognize with traditional machine learning techniques. Despite challenges, deep learning significantly impacts many fields and continues to be an active area of research and development. We have listed the top deep learning interview questions and answers for data science professionals with beginner, intermediate and expert proficiencies. These Deep Learning interview questions and answers are based on real-time projects and will help you competently answer questions on popular topics like neural networks, advanced pattern recognition, machine learning algorithms and more. Prepare with the top Deep Learning interview questions listed here. Convert your next Deep Learning interview into a sure job offer in the field, as these questions have been curated by experts and will be your best guide to surviving the trickiest Deep Learning interviews.

  • 4.6 Rating
  • 40 Question(s)
  • 25 Mins of Read
  • 3603 Reader(s)


Deep Learning is a branch of machine learning based on a set of algorithms that attempt to model high level and hierarchical representation in data using deep graph with multiple processing layers, multiple linear and non-linear transformations.

In Machine Learning (ML), basic process flow is from “Input” to “hand designed features” to “mapping from features” to “output”. In Representation Learning (RL), basic process flow is from “Input” to “features” to “mapping from features” to “output”. In Deep Learning (DL), basic process flow is from “Input” to “ simple features” to “more layers of abstract  features” to “mapping from features” to “output”. Below table provides a quick reference of this understanding.

Topic / Area
Basic Process Flow
Machine Learning

A neural network’s primary function is to receive a set of inputs, perform progressively complex computations, and then use the output to solve the problem. Neural networks are used for lot of different applications, one example would be classification. There are lots of classifiers available today such as logistic regression, support vector machine, decision trees, random forest and so on and of course neural networks.

For example, say we needed to predict if a person is healthy or sick. All you have are some input information such as height, weight, body temperature of each person, there is a need to classify / predict if a person is sick or healthy is a classification problem and it can be solved using approaches such as neural networks. The classifier would receive the data about the patient, process it and give a confidence score. A high score would indicate high confidence that patient is sick and a low score would suggest they are healthy. Score could be probability value of 0 to 1.

Neural network is highly structured and comes in layers. First layer is the input layer, last layer is the output layer and all layers in between are referred to as hidden layers. Hence a neural network can be viewed as the result of spinning classifiers together in a layered web.

What is a neural network

This is one of the most frequently asked deep learning interview questions for freshers in recent times.

The key is that deep neural nets are able to break complex patterns down into a series of simpler patterns. For example: let’s say a task is to determine whether or not an image contained a human face. A deep neural net would first use edges to detect different parts of the face – the nose, lips, ears, eyes etc. and would then combine the results together to form the whole face. This important feature using simpler patterns as building blocks to detect “complex patterns” is what gives deep neural nets their strength.

There is one key downside to all this – deep neural nets take much longer to train. However with the advancement in technology, now there are high performance GPUs available that can finish training a complex net in a relatively quicker time compared to those using CPUs.

There are different categories to be able to handle both scenarios where labelled data exist and where there is no labelled data. Different techniques / approaches can be used to hand such problems.

Below is correct mapping for the tabular data of Side A to Side B:

Side A
Side B
Unlabelled Data
Restricted Boltzmann Machine (RBM)Autoencoders
Text Processing
Recurrent Net (RNTN)
Unsupervised Learning
Restricted Boltzmann Machine (RBM) Autoencoders
Image Recognition
Deep Belief Nets (DBN) Convolutional Neural Nets (CNN)
Object Recognition
Recurrent Net (RNTN) Convolutional Neural Nets (CNN)
Speech Recognition
Recurrent Net (RNTN)
MLP/RELU, Deep Belief Nets (DBN)

*RNTN – Recursive Neural Tensor Network, *MLP – Multi Layer Perceptron, *RELU – Rectifier Linear Unit

All of the above / Option d is correct option.

Now coming to second part of question for the explanation, below is described:

With a method called backpropagation, we run into a problem called vanishing gradient or sometimes the exploding gradient. When that happens, training takes too long and accuracy really suffers.

For example, when we are training a neural net, we are constantly calculating a cost value. The cost is typically difference between net’s predicted output and the actual output from a set of labelled training data. The cost is then lowered by making slight adjustments to the weights and biases over and over throughout the training process, until the lowest possible value is obtained. The training process utilizes a “gradient”, which measures the rate at which the cost will change w.r.t. a change in a weight or a bias.

Early layers of a network are slowest to train, early layers are also responsible for early detection of features and building blocks. If we consider the face detection, early layers are important to figure out edges to correctly identify the face and then pass on the details to later layers where it’s features are captured and consolidated to be able to provide final output.

A convolutional neural network (CNN) is a type of artificial neural network used in image recognition and processing that is specifically designed to process pixel data. In deep learning, a CNN is a class of deep neural nets, most commonly applied to analysing visual imagery. CNNs use a variation of multilayer perceptrons designed to require minimal pre-processing. Convolution is the process of filtering through the image for a specific pattern.

CNNs typical has the following layers other than Input and Output layers –

  • Convolutional Layer (CONV)
  • Rectifier Linear Unit Layer (RELU)
  • Pooling Layer (POOLING)

There is also a fully connected layer (FC) at the end prior to output layer, in order to equip net with the ability to classify data samples.

A fundamental architecture comprising of all layers for a CNN can be described in the image below. This is an illustrative structure and layers can be used differently to solve a specific problem based on a context or situation.

What is a Convolutional Neural Network (CNN)

Yes, CNN does perform dimensionality reduction. Pooling layer is used for this.

This is one of the most frequently asked deep learning coding interview questions and answers for freshers in recent times.

Image classification and tagging, Face detection and Video recognition are use cases of machine vision. So option “i”,”iii”,”iv” are correct answers.

Image search systems use deep learning for image classification and automatic tagging, which allows images to be accessible through a standard search query. For example: companies such as Facebook use deep nets to scan pictures for faces at many different angles, and then label the face with the proper name.

Face detection is a computer technology being used in a variety of applications that identifies human faces in digital images. Deep nets are also used to recognize objects within images which allow for images to be searchable based on the objects within them.

Video recognition systems are important tools for driverless cars, remote robots, theft detection etc. Typically video is an ordered set of frames of same resolution. It has got two parts – video stream and video sequence. Video stream is an ongoing video for online real-time procession. Video sequence is a video of fixed length.

A deep learning platform provides a set of tools and an interface for building custom deep nets. Typically they provide a user with a selection of deep nets to choose from, along with the ability to integrate data from different sources, manipulate data, and manage models through user interface. Some platforms also help with performance if a net needs to be trained with a large dataset. The platform is typically an out of the box application that lets us configure a deep net’s hyper-parameters through an intuitive UI. With a platform, we don’t need to know anything about coding aspects in order to use tools. The downside is that we are constrained by the platform’s selection of deep nets as well as the configuration options. However, for somebody to quickly deploy a deep net, a platform could be the best way to go. Deep learning platforms come in two different forms – software platform and full platform. Examples are H2O.ai, Data Graph etc

Library is a set of functions and modules that we can call through our own programs in order to perform certain tasks. Deep net libraries gives us lot of flexibility with net selection and hyper-parameter configuration. For example: there are not many platforms that let us build a Recursive Neural Tensor Net (RNTN), however we can code our own with the appropriate deep net library. The obvious downside here is the coding experience required to use them. However if we need flexibility, these are great options. Examples are TensorFlow, Theano, Caffe, deeplearning4j, torch etc.

This is one of the most frequently asked deep learning interview questions for freshers in recent times.

Artificial Neural Networks (ANN) can be of two main categories: Feed-forward neural networks and recursive neural networks. They are flexible to many possible variations based on – the learning rule, the inference rule and the architecture.

Feed forward neural network can be described as per below diagram:

What is Artificial Neural Network (ANN and what is a perceptron algorithm

Perceptron algorithm is a fundamental computational unit at heart of the Deep learning model. It’s a supervised learning algorithm for binary classification.

There are many packages available for ANNs. In R, some of the packages are nnet, neuralnet, RSNNS, deepnet, darch, caret, RNN, Autoencoder, RcppDL, MXNetR and others for more specific tasks.

Answer: [5]

The answer will be [5].

The program creates a graph by using source operations. These source operations will pass their information to other operations which will execute computations.

Here in order to create two source operations which will output numbers, two constants are defined in a and b. After that, function tf.add() adds two elements and stores the calculation in c.

The with block then prints the result by opening the session.

After running the with block, session will close automatically.

Answer: 0, 1, 2

The result will print 0 to 2.

The variable is first defined using tf.Variable() and it is initialized with 0. Then in order to counter from the initial value, tf.assign() function is used. It takes two arguments – the reference_variable as 1st argument and value_to_update as a 2nd argument.

Variables must be initialized by running an initialization operation after having launched the graph. We first have to add the initialization operation to the graph.

We then start a session to run the graph, first initialize the variables, then print the initial value of the state variable, and then run the operation of updating the state variable and printing the result after each update.

Answer: 7.0

The tf.float32 will define a 32 bit floating point. The first line creates a placeholder and then it is multiplied by 2 and kept in second variable as b.

Now we need to define and run the session, but since we created a "hole" in the model to pass the data, when we initialize the session we are obligated to pass an argument with the data, otherwise we would get an error.

To pass the data into the model we call the session with an extra argument feed_dict in which we should pass a dictionary with each placeholder name followed by its respective data, like the way it was shown in above code snippet.

This is one of the most frequently asked deep learning coding interview questions and answers for freshers in recent times.

Operations are nodes that represent the mathematical operations over the tensors on a graph. These operations can be any kind of functions, like add and subtract tensor or maybe an activation function.

ALL of these four - tf.constant, tf.matmul, tf.add, tf.nn.sigmoid are operations in TensorFlow. These are like functions in python but operate directly over tensors and each one does a specific thing.

While others are straight forward and simple, tf.nn.sigmoid is an activation function, it's a little more complicated, but this function helps learning models to evaluate what kind of information is good or not.

The output will be as follows:

c =: 8
d =: 4

Operations are nodes that represent the mathematical operations over the tensors on a graph. These operations can be any kind of functions, like add and subtract tensor used here. tf.constant function defines the values to variables a and b.

Then c and d performs add and subtract operations respectively based on the given value in the session. Hence it outputs desired result and prints it.

Option c would be the correct answer.

The size of the convoluted matrix is given by C = ((I-F+2P)/S)+1, where

  • C is the size of the convoluted matrix.
  • I is the size of the input matrix.
  • F is the size of the filter matrix.
  • P is the padding applied to the input matrix.
  • S is the stride applied.

Here I = 12, F = 3, P = 0, S = 1

Therefore the answer is 10 X 10 matrix.

Option c is the correct answer.

Softmax function is of the form in which the sum of probabilities over all i sum to 1.

If we take an input of [1, 2, 3, 4, 1, 2, 3], the softmax of that is [0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175]. The output has most of its weight where the '4' was in the original input. This is what the function is normally used for: to highlight the largest values and suppress values which are significantly below the maximum value. However please note that, softmax is not scale invariant, so if the input were [0.1, 0.2, 0.3, 0.4, 0.1, 0.2, 0.3] (which sums to 1.6) the softmax would be [0.125, 0.138, 0.153, 0.169, 0.125, 0.138, 0.153]. This shows that for values between 0 and 1 softmax, in fact, de-emphasizes the maximum value (note that 0.169 is not only less than 0.475, it is also less than the initial proportion of 0.4/1.6=0.25)

If we execute below code snippet in Python 3.x, then we can get the desired output for the example of [1,2,3,4,1,2,3] that we mentioned above.

Code for deep learning

Code for deep learning

Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. If, instead, one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent.

  1. The steps to use gradient descent algorithm is as follows:
  2. Initialize random weight and bias
  3. Pass an input through the network and get values from the output layer
  4. Calculate error between the actual value and the predicted value
  5. Go to each neurons which contributes to the error and change its respective values to reduce the error
  6. Reiterate until we find the best weights of the network

Below is an illustrative diagram of gradient descent on a series of level sets.

Gradient descent algorithm

Dropout is a regularization technique patented by Google for reducing overfitting in neural networks by preventing complex co-adaptations on training data. It is a very efficient way of performing model averaging with neural networks. The term "dropout" refers to dropping out units (both hidden and visible) in a neural network.

Dropout can be seen as an extreme form of bagging in which each model is trained on a single case and each parameter of the model is very strongly regularized by sharing it with the corresponding parameter in all the other models.

Key points are as follows: Large weights in a neural network are a sign of a more complex network that has overfit the training data. Probabilistically dropping out nodes in the network is a simple and effective regularization method. A large network with more training and the use of a weight constraint are suggested when using dropout.

  • Bias – When the machine learning model has a “high bias” then it does not take into account the variation in the data and underfits the data
  • Variance – When the machine learning model learns all extraneous information from the data then it tends to overfit the data but does not generalize well.

Goodness of fit is to strike a balance between bias and variance.

What is bias and variance

Yes, overfitting can occur in a neural network. There are various ways to handle overfitting in a neural network which are as follows:

  1. Dropout
  2. Regularization
  3. Batch normalization

Dropout is a regularization method that approximates training a large number of neural networks with different architectures in parallel. Dropout has the effect of making the training process noisy, forcing nodes within a layer to probabilistically take on more or less responsibility for the inputs.

Some of the key aspects for using dropout regularization are:

  • Use with all network types - It can be used with most, perhaps all, types of neural network models, not least the most common network types of Multilayer Perceptrons, Convolutional Neural Networks, and Long Short-Term Memory Recurrent Neural Networks. In the case of LSTMs, it may be desirable to use different dropout rates for the input and recurrent connections.
  • Dropout rate - The default interpretation of the dropout hyperparameter is the probability of training a given node in a layer, where 1.0 means no dropout, and 0.0 means no outputs from the layer. A good value for dropout in a hidden layer is between 0.5 and 0.8. Input layers use a larger dropout rate, such as of 0.8.
  • Grid search parameters – Instead of guessing at a suitable dropout rate for your network, test different rates systematically. For example, test values between 1.0 and 0.1 in increments of 0.1. This will both help discover what works best for our specific model and dataset, as well as how sensitive the model is to the dropout rate. A more sensitive model may be unstable and could benefit from an increase in size.

Batch normalisation is a technique for improving the performance and stability of neural networks. The idea is to normalise the inputs of each layer in such a way that they have a mean output activation of 0 and standard deviation of 1. This is analogous to how the inputs to networks are standardised.

In Keras, it is implemented using the following code snippet. Note how the BatchNormalization call occurs after each fully-connected layer, but before the activation function and dropout.

Code for deep learning

Expect to come across this, one of the most important Python deep learning interview questions for experienced professionals in deep learning, in your next interviews.

Yes, CNN has the pooling layer.

  • When it is added to a CNN, yes, Translation invariance is induced when you use pooling.
  • One of the advantages of max-pooling is translation invariance. It provides a form of translation invariance.

Invariance means that we can recognize an object as an object, even when its appearance varies in some way. It preserves object’s identity, category etc across the changes in the specifics of visual input, like relative positions of the viewer and object.

The translation has a specific meaning in vision, generally borrowed from geometry. It does not refer to any type of conversion, unlike say, a translation from French language to English language or between file formats. Instead, it means that each point/pixel in the image has been moved the same amount in the same direction. Alternately, you can think of the origin as having been shifted an equal amount in the opposite direction.

So typically, CNN + max-pooling ~ Translation invariance


A must-know for anyone looking for top deep learning interview questions, this is one of the frequently asked deep learning behavioral interview questions.

Correct answer option is D.

The best method would be to train only the last layer as previous all layers work as feature extractors. They would have extracted key features as part of initial layers in a similar scenario.

Since the data similarity is very high, we do not need to retrain the model. All we need to do is to customize and modify the output layers according to our problem statement. We use the pretrained model as a feature extractor. For example: let’s say we decide to use models trained on Imagenet to identify if the new set of images have cats or dogs. Here the images we need to identify would be similar to imagenet, however we just need two categories as our output – cats or dogs. In this case all we do is just modify the dense layers and the final softmax layer to output 2 categories instead of a 1000. Additionally training time takes longer in these type of neural nets. Hence it would save significant amount of time. Re-training last layer will take care of the new dataset at hand with a similar feature being created already and executed leveraging that.

There are potentially four scenarios and they can be explained in below diagrammatical fashion.

Data similarity

The primary reason overfitting happens is because the model learns even the tiniest details present in the data. So after learning all the possible patterns it can find, the model tends to perform extremely well on the training set but fails to produce good results on the test sets. It falls apart when faced with previously unseen data. And this is critical from an accuracy standpoint.

One way to prevent overfitting is to reduce the complexity of the model. This is exactly what regularization does. If we set the regularization parameter to a large value, the decay in the weights during gradient descent update will be more. Hence, the weights of most of the hidden units will be close to zero.

Since the weights are negligible, the model will not learn much from these units. This will end up making the network simpler and thus reduce overfitting.

Let us take another example. Assume we are using a tanh activation function.

regularization reduce overfitting in neural network

Now if we set regularization parameter to a large value, the weight of the units will be less. To calculate the z[l], we can use the following:

Z[l] = w[l] m[l-1] + n[l]

Hence the z-value will be less. If we use the tanh activation function then these low values of z[l] will lie near the origin.

regularization reduce overfitting in neural network

The key aspect with this change is that we are only using the linear region of the tanh function. This will make every layer in the network mostly linear. i.e. we will get linear boundaries that separate the data which prevents overfitting.

This, along with other Python interview questions on deep learning, is a regular feature in deep learning interviews, be ready to tackle it with the approach mentioned below.

Correct answer option is B.

ReLU or Rectifier Linear Unit gives continuous output in range 0 to infinity. However in output layer, we would require a finite range of values.

A unit employing the rectifier is called Rectifier Linear Unit or ReLU.

Range is 0 to infinity.

Difference between ReLU and Leaky ReLU

In case of Leaky ReLU, f(y) is ay and not zero. The leak helps to increase the range of the ReLU function. Usually the value of a is 0.01 or equivalent to that.

  • When a is not 0.01, then it is called Randomized ReLU.
  • Hence the range of Leaky ReLU is -infinity to infinity.
  • Leaky ReLUs allow a small, positive gradient when the unit is not active.

Correct answer is B.

The output can be calculated as 3 (2*4 + 3*5 + 4*6) = 3 (8 + 15 + 24) = 3 * 47 = 141.

MLP or Multi Layer Perceptron is a class of feed forward artificial neural net. It comprises of at least 3 layers of nodes – input layer, hidden layer and output layer. Except the input node, each node is a neuron that uses a nonlinear activation function.

This is a common yet one of the most important deep learning interview questions and answers for experienced professionals, don't miss this one.

In the theory of artificial neural networks, the universal approximation theorem states that a feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of Rⁿ, under mild assumptions on the activation function. It does state that simple neural networks can represent a wide variety of interesting functions when given appropriate parameters; however, it does not touch upon the algorithmic learnability of those parameters.

Examples of universal approximators are –

  1. Kernel SVM
  2. Neural Networks
  3. Boosted Decision Trees

All of these methods can approximate any function.

Max pooling takes a 3 X 3 matrix and takes the maximum of the matrix as the output. Slide it over the entire input matrix with a stride of 2 and we can get below matrix as the result.


Yes dropout can be applied. Please refer below. We have added a new dropout layer between the input and first hidden layer. The dropout rate is also set to 20%. This means one in five inputs will be randomly excluded from each update cycle.

code for deep learning

We can not provide an answer to above as sufficient information is not available. We need to know weights and biases of a neural net, to be able to predict output like above scenario.

Don't be surprised if this question pops up as one of the top interview questions for deep learning in your next interview.

The above code will throw error.

Dimensions must be equal. But these are 3 and 2 for “matmul” function which is mismatching. Hence this has to be defined for a 2 X 2 matrix.

Please change the “Matrix_one” and “Matrix_two” dimensions to reflect a 2 X 2 matrix and then it should work fine.

Weight sharing occurs in Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). It does not occur in case of Fully Connected neural nets.

For example: “Shared weights” is simple: use the same weight vector to do the “convolution” (which is essentially inner product of two vectors). Let’s take below scenario.

weight sharing occur

Input layer is x = [x1 x2 x3 x4 x5 x6 x7]

Hidden layer is h = [h1 h2 h3]

Weight vector is w = [w1 w2 w3] = [1 0 -1] which is used by all

H1 = w * x[1:3]

H2 = w * x[3:5]

H3 = w * x[5:7]

Reference from : http://cs231n.github.io/assets/conv-demo/index.html - cs231n guide from Stanford University- amazing animation on convolution describing the shared weights.

weight sharing occur

The idea behind this is : a filter, e.g. which detects horizontal edge, matches the left corner of an image but may also match the right bottom corner of the image. Using(sharing) multiple filters avoid the feed forward neural network structure which is more complicated.

The Rectified Linear Unit or ReLU is represented below with a diagram. It computes the function f(x)=max(0,x). In other words, the activation is simply thresholded at zero.

What are pros and cons of a ReLU

The Pros and Cons of ReLU are as follows:


  • It does not saturate (during its +ve region)
  • Computationally it is very efficient
  • Generally models with ReLU neurons converge much faster than neurons with other activation functions.


An issue with dealing with them is where they die, i.e. dead Relus. Because if activation of any relu neurons become zero then its gradients will be clipped to zero in back-propagation. This can be avoided if we are very careful with weights initialization and tuning learning rate.

For example, a large gradient flowing through a ReLU neuron can cause the weights to update in such a way that the neuron will never activate on any data point again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, we may find that as much as 40% of our network can be “dead” (i.e. neurons that never activate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue.

One of the most frequently posed advanced deep learning interview questions and answers, be ready for this conceptual question.

  • Yes, we can model such an function to solve in neural network.
  • Activation functions can be a reciprocal function and hence the above can be accomplished using an activation function.

A staple in senior data scientist interview questions with answers, be prepared to answer this one using your hands-on experience. This is also one of the most difficult deep learning questions to ask a data scientist.

No, CNNs cannot.

Data pre-processing steps such as scaling, rotations is necessary before we model data and provide them as an input to neural network such as CNN because CNNs cannot do it by themselves.

Bias = -1.5, w1 = 1, w2 = 1

We can do hit and trial to ensure below functions can be computed based on the above AND gate functions mentioned. Below is the example of the same:

F(-1.5*1 + 1*0 + 1*0) = f(-1.5) = 0

F(-1.5*1 + 1*0 + 1*1) = f(-0.5) = 0

F(-1.5*1 + 1*1 + 1*0) = f(-0.5) = 0

F(-1.5*1 + 1*1 + 1*1) = f(0.5) = 1

Above all of these also comply with the AND gate.

Various factors could be affecting this scenario. Some of them are as follows:

  • The learning rate could be slow. This is why the loss does not decrease in few starting epochs.
  • Regularization parameter could be high.
  • This could be stuck at the local minima.

Some of the key aspects that we should monitor during training of a neural network are –

  • Loss function
  • Validation / training accuracy
  • Ratio of weights : updates
  • Activation / gradient distributions per layer
  • First layer visualizations

The first quantity that is useful to track during training is the loss, as it is evaluated on the individual batches during the forward pass. Below is a diagram showing the loss over time, and especially what the shape might tell us about the learning rate.

Provide key aspects while training neural network

The second important quantity to track while training a classifier is the validation/training accuracy. This plot can give us valuable insights into the amount of overfitting in our model:

Provide key aspects while training neural network

The last quantity we might want to track is the ratio of the update magnitudes to the value magnitudes. Note: updates, not the raw gradients (e.g. in vanilla sgd this would be the gradient multiplied by the learning rate). We might want to evaluate and track this ratio for every set of parameters independently. A rough heuristic is that this ratio should be somewhere around 1e-3. If it is lower than this then the learning rate might be too low. If it is higher then the learning rate is likely too high. Here is a specific example:

Deep learning

A staple in deep learning technical interview questions and answers, be prepared to answer this one using your hands-on experience.

Nesterov Momentum is a different version of the momentum update. It enjoys stronger theoretical converge guarantees for convex functions and in practice it also consistently works slightly better than standard momentum.

The core idea behind Nesterov momentum is that when the current parameter vector is at some position x, then looking at the momentum update above, we know that the momentum term alone (i.e. ignoring the second term with the gradient) is about to nudge the parameter vector by mu * v. Therefore, if we are about to compute the gradient, we can treat the future approximate position x + mu * v as a “lookahead” - this is a point in the vicinity of where we are soon going to end up. Hence, it makes sense to compute the gradient at x + mu * v instead of at the “old/stale” position x.

What is Nesterov Momentum

Nesterov momentum: Instead of evaluating gradient at the current position (red circle), we know that our momentum is about to carry us to the tip of the green arrow. With Nesterov momentum we therefore instead evaluate the gradient at this "looked-ahead" position.

In training deep networks, it is usually helpful to anneal the learning rate over time. Good intuition to have in mind is that with a high learning rate, the system contains too much kinetic energy and the parameter vector bounces around chaotically, unable to settle down into deeper, but narrower parts of the loss function.

Knowing when to decay the learning rate can be tricky: Decay it slowly and you’ll be wasting computation bouncing around chaotically with little improvement for a long time. But decay it too aggressively and the system will cool too quickly, unable to reach the best position it can. There are three common types of implementing the learning rate decay.

  • Step decay - Reduce the learning rate by some factor every few epochs. Typical values might be reducing the learning rate by a half every 5 epochs, or by 0.1 every 20      epochs. These numbers depend heavily on the type of problem and the model.
  • Exponential decay – has the mathematical form alpha = alpha0 * exp(-kt); where alpha0, k are hyperparameters, and t is the iteration number
  • 1/t decay – has the mathematical form alpha = alpha0 / (1 + kt) , where alpha0, k are hyperparameters, and t is the iteration number


Deep Learning is a subfield of machine learning methods and is based on learning data representation. The learning process can be supervised, semi-supervised or unsupervised. Professionals can opt for positions like Machine Learning Engineer, Senior Machine Learning Engineer,  Data Scientist, etc. once they go through a Deep Learning course and appear for an interview. 

According to payscale.com, the average salary for a Machine Learning Engineers ranges from $76,000 to $153,000 per year, with a base salary of approximately $111,453. Companies from around the world use Machine Learning in different yet amazing ways. A few of the companies that use Machine Learning are Yelp, Pinterest, Facebook, Twitter, etc.

There has been an increase in demand for Data Scientists and Machine Learning Engineers in the past few years. Yes, interviews for Deep Learning can be scary, but preparing with these Deep Learning interview questions will help you in pursuing your dream career. It’s important to be prepared to respond effectively to the questions that employers typically ask in an interview. Since these deep learning engineer interview questions are very common, your prospective recruiters will expect you to be able to answer. These current deep learning interview questions will increase your confidence that you need to ace the interview and motivation as well. You can also opt for a data scientist certification and benefit from the interview prep session in it. 

Going through these interview questions for deep learning will help you land your dream job and will definitely prepare you to answer the toughest of questions in the best way possible. These deep learning interview questions and answers are suggested by experts and have proven to have great value.

Not only the job aspirants but also the recruiters can refer to these deep learning technical interview questions to know the right set of questions to assess a candidate.

Read More