**Abstract**

Today, Deep Learning is being applied in many fields such as computer vision, audio and natural language processing and generation etc. Although we are pushing the limits of its performance, people should be aware that Deep Learning has its limitations. In 2015, Anh Nguyen published a paper in CVPR [1] that identified a limit in computer vision, where you can fool a deep neural network (DNN) by changing an image in a way that’s imperceptible to humans, but can cause the DNN to label the image as something else entirely.

In this paper, the authors described four families of problems where commonly used algorithms will fail or suffer significant difficulty.

**Introduction**

This paper discusses four families of simple problems for which commonly used methods do not show the expected performance. In the first section, the gradient information that carries negligible information on the target function will be discussed. In the second section, the two common approaches to learning and optimization problems will be discussed. In the third part, the authors show that even when two architectures have the same expressive power, there may be a tremendous difference in the ability to optimize them. In the last section, the authors focus on a simple case to question deep learning’s reliance on “vanilla” gradient information for optimization process.

**Experiment**

**Failure due to Non-Informative Gradients**

This experiment shows that if there is little information in the gradient, using it for learning will no lead to success. The authors begin with the simple problem of learning random parties: first choose some *v* *∈* {0, 1}**d*, then *y* is used to indicate whether the number of 1’s in a certain subset of coordinates of *x* (indicated by *v**) is odd or even. (**configuration:** hinge loss, one fully connected layer with ReLU activation, a fully connected output layer with linear activation and a single unit).

From Figure 1, we can see that as the dimension *d *increases, so does the difficulty of learning. To the point around d = 30, where no advance beyond random performance is observed after reasonable time.

Through a detailed analysis by using two theorems, the authors finally draw the points that gradient-based methods indeed cannot learn random parities and linear-periodic functions. Additionally, these results hold regardless of which class of predictors we use, but lie in using a gradient-based method to train them.

In conclusion, by direct examination of gradients’ variance with respect to the target function, the authors simply make the connection between gradient-based methods and parities, and propose that gradient-based methods are indeed unlikely to solve learning problems which are known to be hard in the statistical queries framework, in particular parties.

**Decomposition vs. End-to-end**

For learning and optimizing, there are two main approaches to choose from: an end-to-end manner, or by decomposition. In this section, the authors try to figure out whether the end-to-end approach is worth it (the authors focus on the optimization aspect, showing that the “end-to-end” approach may suffer from a low signal-to-noise ratio, which may affect the training time). There are two experiments in this section.

**Experiment 1**

This experiment compares the two approaches in a computer vision setting. The authors define a family problems, and show that as *k *(*k *∈* N*) grows, the gap between the “End-to-end” approach and the “Decomposition” approach grows.

The experiments let X1 denote the space of 28 * 28 binary images, and it uses some functions to define the distribution.

I chose to omit the detailed configuration of this experiment, and dive straight into the results. When comparing based on the “primary” objective, we can see that “End-to-end” is significant inferior to “Decomposition”. By using “decomposition”, we can quickly arrive at a good solution.

**Experiment 2**

Consider the problem of training a predictor, which is given a “positive media reference” *x* to a certain stock option. It will distribute our assets between the k = 500 stocks in the S&P500 index in some manner. There are also two ways as we mentioned above:

• An “End-to-end” approach: train a deep network Nw that when given x, outputs a distribution over the k stocks. The objective for training is maximizing the gain obtained by allocating our money according to this distribution.

• A “Decomposition” approach: train a deep network Nw that when given x, outputs a single stock, y 2 [k], whose future gains are the most positively correlated to x. Of course, we may need to gather extra labeling for training Nw based on this criterion.

The authors make the comparison experiments under non-realistic assumption. For the detailed configuration, please refer to the paper.

Figure 4 clearly shows that using the “End-to-end” estimator for optimization is inferior to working with the “Decomposition” one.

The authors approach the analysis through examination of the Signal to Noise Ratio (SNR) of the two stochastic gradient estimators, and it shows that the “End-to-end” one is significantly inferior.

In Figure 5, It is easy to see that the “End-to-end” approach suffers from significantly lower SNR. More importantly, it shows dependence on k, and quickly falling below machine precision; whereas the “Decomposition” approach’s SNR is constant.

Thus the authors get further evidence for the benefits of direct supervision, whenever it is applicable to a problem.

**Architecture and Conditioning**

Network architecture selection usually have two considerations: one is to improve the network’s expressiveness while not dramatically increasing sample complexity, and the other is to improve the computational complexity of training. The choice of architecture usually affects the training time.

The authors experiment with various deep learning solutions for encoding the structure of one-dimensional, continuous, piece-wise linear (PWL) curves. The authors consider the convex problem with large condition number, and improved condition number through convolutional architecture. Through the theoretical analysis, the authors note that the application of a convolution architecture is crucial for the efficiency of the conditioning, and the combined usage of a better architecture and of conditioning is what allows us to gain this dramatic improvement.

The authors also experiment with a deeper architecture for encoding. Each of the two networks has three layers with ReLU activations, except for the output layer of M having a linear activation. The dimensions of the layers are, 500, 100, 2k for N , and 100, 100, n for M.

From Figure 6 the authors propose that if the additional expressive power is unnecessary, it does not solve inherent optimization problems.

**Flat Activations**

Flatness of the loss surface because of saturation of the activation functions will lead to vanishing gradients and a slow-down of the training process, which creates difficulties for optimization. The authors consider a family of activation functions that are piece-wise flat, and will amplify the “vanishing gradient due to saturated activation” problem.

There are mainly four experiments:

For the Non-Flat approximation experiment, the authors try to approximate u using a non-flat function. Although the objective is not completely flat, it suffers from the flatness, and training using this objective is much slower. Sometimes it will even fail. Also, the authors show that the sensitivity to the initialization of bias term.

For the End-to-end experiment, this approach can find a reasonable solution, but as the authors say, it will bring the inaccuracies in capturing the non-continuity points to the forefront, and will costs extra due to the use of a much larger network.

For the multi-class experiment, the authors approach the problem as a general multi-class classification problem, with each value of the image of u being treated as a separate class. The problem is the inaccuracies at the boundaries between classes. Additionally, the ordering that we ignored results in higher sample complexity.

For the “Forward-Only” experiment, it can be interpreted as replacing the back propagation message for the activation function with an identity message. This method achieves the best results, for which the authors provides proof in the appendix.

**Conclusion:**

This paper provides the most common failure scenarios in four conditions. It also provides theoretical insights to explain their source, and how they might be remedied. The only deficiency is that it didn’t cover very deep networks, but to be honest, very deep networks are so complex that the conditions this paper was researched in might change.

**Reference:**

[1] Nguyen, Anh, Jason Yosinski, and Jeff Clune. “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

**Author**: *Shixin Gu* | **Editor**: *Zhen Gao*