NTK: unlocking the mysteries of deep learning
Professor Clément Hongler and his team from the Chair of Statistical Field Theory found a new way of understanding Artificial Neural Networks. Their 2018 paper has gathered a lot of attention from the scientific community and is now frequently used by researchers when they want to understand the training of large neural networks.
Artificial Neural Networks (ANN) are at the heart of the Deep Learning revolution in Artificial Intelligence, which has seen spectacular progress in the last decade: ANNs have yielded breakthroughs in numerous applications, from computer vision to sound processing, going through playing board and video games, driving vehicles, translating and generating texts and solving problems coming from biology and chemistry.
While ANNs have fundamentally transformed the way we process data from the world around us, they appear to be theoretically complex and their functioning remains somehow mysterious: developing a mathematical framework to understand ANNs is widely considered a central problem in the field.
Inspired by biological neural networks, an ANN treats input signals by applying a sequence of transformations through several neuron layers: this is very much alike the way the visual cortex processes information from the optic nerves. Each connection between neurons has a strength parameter: modern ANNs can have billions of such parameters, which are delicately adjusted to allow them to perform various tasks.
Unlike classical computer programs which are explicitly designed to solve a given task, ANNs are trained to accomplish it: for instance, to train a network to correctly identify a cat in an image, one gives it a dataset of images with labels indicating where cats appear. The way the training is performed is by optimization: a score function measures how good the ANN is at its task, and the parameters of the ANN are progressively adjusted to improve on it. While this optimization procedure gives impressive results in practice, understanding such a procedure over billions of parameters appears to be a daunting task (as the Analysis I & II students know, it can be very hard to optimize a function of even one or two parameters).
In 2018, Arthur Jacot, Franck Gabriel and Prof. Clément Hongler from the Chair of Statistical Field Theory found a new connection between Neural Networks and another field of Machine Learning, called Kernel Methods. More precisely, they found that an ANN can be described using its so-called Neural Tangent Kernel (NTK). This gives a new mathematical understanding of the optimization of ANNs with many parameters. This discovery yields in particular insights for two fundamental questions: the convergence and the generalization of ANNs.
The convergence problem is the question of whether the training of ANNs will reach the optimum for their score or if they will get stuck somewhere before. This question had remained unsolved and highly debated until then: some people suggested that this was tantamount to moving in a mountainous landscape with lots of valleys and saddles, while some suggested the landscape was much smoother. Using the NTK, Jacot, Gabriel and Hongler have been able to show that for very large ANNs the question gets surprisingly easier and typically they reach their global optimum: the smooth landscape model applies here. Moreover, the speed of convergence and the nature of the optimum reached by the ANN can be understood in terms of its NTK.
The generalization question is how an ANN will deal with data it has not been trained with. Does it merely learn by heart the examples it has been given or is it able to generalize and give correct predictions beyond those? Using the NTK, Jacot, Gabriel and Hongler have given a new formula explaining how an ANN performs predictions and extrapolations from the data it has been trained with. Thanks to this formula, the generalization of ANNs can be related to that of Kernel Methods, for which theoretical tools are more readily available.
Since 2018, the NTK has gathered a lot of attention from the scientific community. The original article has already garnered more than 500 citations on Google Scholar. It is now frequent that researchers use the NTK to understand the training of large neural networks. A new software library, called Neural Tangents, has been created by Google to compute the NTK (you can typepip install neural-tangents in your terminal to install it). The NTK has found its way in numerous contexts, going from Generative Adversarial Networks (how computers learn to make realistic pictures for example) to Reinforcement Learning (how computers play games against each other to improve), going to Transformers (which deal with long sequences, like texts).
At the moment, a central question towards understanding how Deep Learning works has become to understand the precise nature and evolution of the NTK during training: in particular, understanding medium-sized networks appears to be surprisingly more challenging. Perhaps in order to understand the NTK, a new object will need to be introduced?
To conclude, we have decided to ask a Transformer ANN its opinion about the question. Here is what it came with.
>What do you think of the Neural Tangent Kernel ?
>You may be interested in the transcript of our talk at Deep Learning Summit. We’re very pleased to show you our demo on how we’re integrating Bluetooth technology into the Neural Tangent Kernel: Google researchers creating a neural network with auto-encoder and decoder layers, by teaching it to detect objects and specify their size. The neural network itself is trained using the good old tuning parameters (in this case three parameters).
We're not there yet, but maybe one day, ANNs can just tell us how they work!
For more information: https://en.wikipedia.org/wiki/Neural_tangent_kernel