Sunday, March 15, 2015

Hessian Free Optimization and friends.

Frankly I have always had my doubts about the *general* applicability of *any* algorithms based on the differential calculus to nets with looped connections —because of chaotic dynamics and other reprobate dragons that infest the mathematical highways. Chaos, complexity, sensitivity to initial conditions, are all —I would think—  the unavoidable byproduct of iteration.

Most neural net specialists would, it seems, tend to disagree with me. 

Please remember that on some days this blog just consists of personal notes on my reading - and I'm not very good at understanding what I read ...

Recurrent networks (RNNs) can store state, and the power of the net model stems partly from the fact that it is the learning process which determines what the system state variables represent. But we also know that statefulness is the key to powerful computation, and the origin of complex dynamical phenomena. 

However, even though state is supremely useful, the presence of state appears in first analysis to impede the use of the classical backprop training method. But Werbos et al. demonstrated that one can propagate the gradient "backwards through time", and therefore it should be possible to reuse backprop learning for training RNN nets.

The gradients of the RNN are easy to compute via back- propagation through time (Rumelhart et al., 1986; Werbos, 1990)1, so it may seem that RNNs are easy to train with gradient descent. In reality, the relationship between the parameters and the dynamics of the RNN is highly unsta- ble which makes gradient descent ineffective. This intution was formalized by Hochreiter (1991) and Bengio et al. (1994) who proved that the gradient decays (or, less fre- quently, blows up) exponentially as it is backpropagated through time, and used this result to argue that RNNs can- not learn long-range temporal dependencies when gradi- ent descent is used for training. In addition, the occasional tendency of the backpropagated gradient to exponentially blow-up greatly increases the variance of the gradients and makes learning very unstable. As gradient descent was the main algorithm used for training neural networks at the time, these theoretical results and the empirical difficulty of training RNNs led to the near abandonment of RNN re- search.
One way to deal with the inability of gradient descent to learn long-range temporal structure in a standard RNN is to modify the model to include “memory” units that are specially designed to store information over long time pe- riods. This approach is known as “Long-Short Term Mem- ory” (Hochreiter & Schmidhuber, 1997) and has been suc- cessfully applied to complex real-world sequence mod- eling tasks (e.g., Graves & Schmidhuber, 2009). Long- Short Term Memory makes it possible to handle datasets which require long-term memorization and recall but even on these datasets it is outperformed by using a standard RNN trained with the HF optimizer (Martens & Sutskever, 2011).
Another way to avoid the problems associated with back- propagation through time is the Echo State Network (Jaeger & Haas, 2004) which forgoes learning the recurrent con- nections altogether and only trains the non-recurrent out- put weights. This is a much easier learning task and it works surprisingly well provided the recurrent connectionsare carefully initialized so that the intrinsic dynamics of the network exhibits a rich reservoir of temporal behaviours that can be selectively coupled to the output. 

In fact, in much of my own doctoral work I employed genetic reinforcement to sidestep gradient-based search issues. Now comes Hessian-free optimisation which is heralded as the breakthrough which allows the power of recurrent networks to be harnessed within a context quite similar to backprop or backprop through time. 

And Suskever, Martens and Hinton seem to have had big success so far with Hessian-free methods and multiplicative recurrent nets. So in case you, gentle reader wish to comprehend the maths involved, here is a link to Tartu's Andrew Gibiansky, a gentleman who seems willing to explain ....

Friday, March 13, 2015

Geoffrey Hinton's AMA is a must-read.

The Machine Learning Reddit seems to be where all the smart kids are these days. And here is a link to Geoffrey Hinton's bravura performance when confronted with some of the most painstakingly detailed questioning I've ever witnessed a scientist to endure. The man has the knowledge of an encyclopedia and the patience of a saint. This is too much fun to miss! Also, the level of the questions make one realize that there are a lot of very smart geeks out there who are lurking and watching neural net and machine learning tech as we finally start to realize some of the 70's promises of AI. 

It's the privilege of a blogger to extract one morsel from a huge text such as Hinton's.Here is a discussion of RNNs and their possible impact in automatic translation.  This is a fragment that I "almost" understand.  Babelfish, here we come!

[–]twainus 4 points 
Recently in 'Behind the Mic' video (, you said: "IF the computers could understand what we're saying...We need a far more sophisticated language understanding model that understands what the sentence means.
And we're still a very long way from having that."
Can you share more about some of the types of language understanding models which offer the most hope? Also, to what extent can "lost in translation" be reduced if those language understanding models were less English-centric in syntactic structure?
Thanks for your insights.
[–]geoffhinton[S] 25 points 
Currently, I think that recurrent neural nets, specifically LSTMs, offer a lot more hope than when I made that comment. Work done by Ilya Sutskever, Oriol Vinyals and Quoc Le that will be reported at NIPS and similar work that has been going on in Yoshua Bengo's lab in Montreal for a while shows that its possible to translate sentences from one language to another in a surprisingly simple way. You should read their papers for the details, but the basic idea is simple: You feed the sequence of words in an English sentence to the English encoder LSTM. The final hidden state of the encoder is the neural network's representation of the "thought" that the sentence expresses. You then make that thought be the initial state of the decoder LSTM for French. The decoder then outputs a probability distribution over French words that might start the sentence. If you pick from this distribution and make the word you picked be the next input to the decoder, it will then produce a probability distribution for the second word. You keep on picking words and feeding them back in until you pick a full stop.
The process I just described defines a probability distribution across all French strings of words that end in a full stop. The log probability of a French string is just the sum of the log probabilities of the individual picks. To raise the log probability of a particular translation you just have to backpropagate the derivatives of the log probabilities of the individual picks through the combination of encoder and decoder. The amazing thing is that when an encoder and decoder net are trained on a fairly big set of translated pairs (WMT'14), the quality of the translations beats the former state-of-the-art for systems trained with the same amount of data. This whole system took less than a person year to develop at Google (if you ignore the enormous infrastructure it makes use of). Yoshua Bengio's group separately developed a different system that works in a very similar way. Given what happened in 2009 when acoustic models that used deep neural nets matched the state-of-the-art acoustic models that used Gaussian mixtures, I think the writing is clearly on the wall for phrase-based translation.
With more data and more research I'm pretty confident that the encoder-decoder pairs will take over in the next few years. There will be one encoder for each language and one decoder for each language and they will be trained so that all pairings work. One nice aspect of this approach is that it should learn to represent thoughts in a language-independent way and it will be able to translate between pairs of foreign languages without having to go via English. Another nice aspect is that it can take advantage of multiple translations. If a Dutch sentence is translated into Turkish and Polish and 23 other languages, we can backpropagate through all 25 decoders to get gradients for the Dutch encoder. This is like 25-way stereo on the thought. If 25 encoders and one decoder would fit on a chip, maybe it could go in your ear :-)


PS. Another clip for my notebook
Hello Dr. Hinton ! Thanks for the AMA. I want to ask, what is the most interesting / important paper for Natural Language Processing fields in your opinion with neural network or deep learning element involved inside ? Thank you very much :D [–]geoffhinton[S] 14 points 
The forthcoming NiPS paper by Sutskever, Vinyals and Le (2014) and the papers from Yoshua Bengio's lab on machine translation using recurrent nets. 

Thursday, March 12, 2015

Ilya Sutskever's reveals net training secrets on Ysong Yue's blog!

A big part of the net is linking - pointing people at something interesting. And so I need to tell you that machine-learning Professor Ysong Yue of Caltech has done something clever, he got Ilya Sutskever to do a guest post about Deep Learning on his blog. 

One of Ilya's many claims to fame is his use of training recurrent networks to generate text fragments that resemble Wikipedia, 
The meaning of life is the tradition of the ancient human repro- duction: it is less favorable to the good boy for when to remove her bigger. In the show’s agreement unanimously resurfaced. The wild pasteured with consistent street forests were incorporated by the 15th century BE. In 1996 the primary rapford undergoes an effort that the reserve conditioning, written into Jewish cities, sleepers to incorporate the .St Eurasia that activates the popula- tion. Mar??a Nationale, Kelli, Zedlat-Dukastoe, Florendon, Ptu’s thought is. To adapt in most parts of North America, the dynamic fairy Dan please believes, the free speech are much related to the 
or machine learning papers, which I guess is the geek equivalent of drinking your own urine to survive. 
Recurrent network with the Stiefel information for logistic regres- sion methods Along with either of the algorithms previously (two or more skewprecision) is more similar to the model with the same average mismatched graph. Though this task is to be studied un- der the reward transform, such as (c) and (C) from the training set, based on target activities for articles a ? 2(6) and (4.3).  

The method relies on Martens and Sutskever's innovative mathematical technique of hessian-free optimisation, which makes recurrent net training tractable. 

Now, many of us would like to know what the difference between the old neural and the new deep-neural net tech is, what can be reasonably expected etc. Well you will find all of this in  Ilya's guest post, together with some interesting comments about SVMs yadda yadda by Yoshua Bengio.  But above all you will find invaluable heuristic advice about practical stuff like setting learning rates, initialization,  minibatches,  dropout, and other details of training methods. In other words a discussion of cooking methods rather than just a dry recipe. 

Stop reading my dumb post and just go there already!