# Hodas "Does the Impossible" at NCPW15

As deep neural networks grow in size, from thousands to millions to billions of weights, the performance of those networks becomes limited by our ability to accurately train them. A common naive question arises: if we have a system with billions of degrees of freedom, don't we also need billions of samples to train it? Of course, the success of deep learning indicates that reliable models can be learned with reasonable amounts of data. Similar questions arise in protein folding, spin glasses and biological neural networks. With effectively infinite potential folding/spin/wiring configurations, how does the system find the precise arrangement that leads to useful and robust results? Simple sampling of the possible configurations until an optimal one is reached is not a viable option even if one waited for the age of the universe. On the contrary, there appears to be a mechanism in the above phenomena that forces them to achieve configurations that live on a low-dimensional manifold, avoiding the curse of dimensionality. In the context of deep neural nets, the restriction to a low-dimensional manifold is facilitated by the contractive properties of popular activation functions or regularization techniques. But this is not enough to explain why the deep neural nets work well and more importantly how to train them efficiently. History has shown that, until very recently, adding excess depth impeded effective training, regardless of the number of training epochs. We will show that deep nets work exactly because they learn features of the data gradually, i.e., in succession starting from simple to more complicated ones. It is known that convolutional neural nets learn features of higher and higher semantic complexity at each layer, but, more precisely, the net finds the correct low-dimensional manifold on which to build the representation of the desired function of the data. The features of the early layers constrain the space of possible features in the deeper layers. The realization of the need for gradual learning of features suggests, in mathematical terms, that the successive layers of the deep net should be highly correlated and that highly-nonlinear activation functions that destroy correlation will impede training of large networks. We show how this concept is connected to a number of emerging training techniques, such as batch normalization and resnets (it is also related to the recently pointed connection between the Variational Renormalization Group and Restricted Boltzmann Machines). We compare the layer-by-layer feature learning of nets where correlation between layers is enforced and those without it. Lastly, we discuss how these ideas form promising design principles for efficiently training high complexity neural nets.

Hodas NO, P Stinis, and NA Baker. 2016. "Doing the impossible: Why neural nets can be trained at all." Abstract submitted to 15th Neural Computation and Psychology Workshop, Philadelphia, PA. PNNL-SA-117943.