Frankly I have always had my doubts about the *general* applicability of *any* algorithms based on the differential calculus to nets with looped connections —because of chaotic dynamics and other reprobate dragons that infest the mathematical highways. Chaos, complexity, sensitivity to initial conditions, are all —I would think— the unavoidable byproduct of iteration.
Most neural net specialists would, it seems, tend to disagree with me.
Please remember that on some days this blog just consists of personal notes on my reading - and I'm not very good at understanding what I read ...
Recurrent networks (RNNs) can store state, and the power of the net model stems partly from the fact that it is the learning process which determines what the system state variables represent. But we also know that statefulness is the key to powerful computation, and the origin of complex dynamical phenomena.
However, even though state is supremely useful, the presence of state appears in first analysis to impede the use of the classical backprop training method. But Werbos et al. demonstrated that one can propagate the gradient "backwards through time", and therefore it should be possible to reuse backprop learning for training RNN nets.
The gradients of the RNN are easy to compute via back- propagation through time (Rumelhart et al., 1986; Werbos, 1990)1, so it may seem that RNNs are easy to train with gradient descent. In reality, the relationship between the parameters and the dynamics of the RNN is highly unsta- ble which makes gradient descent ineffective. This intution was formalized by Hochreiter (1991) and Bengio et al. (1994) who proved that the gradient decays (or, less fre- quently, blows up) exponentially as it is backpropagated through time, and used this result to argue that RNNs can- not learn long-range temporal dependencies when gradi- ent descent is used for training. In addition, the occasional tendency of the backpropagated gradient to exponentially blow-up greatly increases the variance of the gradients and makes learning very unstable. As gradient descent was the main algorithm used for training neural networks at the time, these theoretical results and the empirical difficulty of training RNNs led to the near abandonment of RNN re- search.
One way to deal with the inability of gradient descent to learn long-range temporal structure in a standard RNN is to modify the model to include “memory” units that are specially designed to store information over long time pe- riods. This approach is known as “Long-Short Term Mem- ory” (Hochreiter & Schmidhuber, 1997) and has been suc- cessfully applied to complex real-world sequence mod- eling tasks (e.g., Graves & Schmidhuber, 2009). Long- Short Term Memory makes it possible to handle datasets which require long-term memorization and recall but even on these datasets it is outperformed by using a standard RNN trained with the HF optimizer (Martens & Sutskever, 2011).
Another way to avoid the problems associated with back- propagation through time is the Echo State Network (Jaeger & Haas, 2004) which forgoes learning the recurrent con- nections altogether and only trains the non-recurrent out- put weights. This is a much easier learning task and it works surprisingly well provided the recurrent connectionsare carefully initialized so that the intrinsic dynamics of the network exhibits a rich reservoir of temporal behaviours that can be selectively coupled to the output.
In fact, in much of my own doctoral work I employed genetic reinforcement to sidestep gradient-based search issues. Now comes Hessian-free optimisation which is heralded as the breakthrough which allows the power of recurrent networks to be harnessed within a context quite similar to backprop or backprop through time.
And Suskever, Martens and Hinton seem to have had big success so far with Hessian-free methods and multiplicative recurrent nets. So in case you, gentle reader wish to comprehend the maths involved, here is a link to Tartu's Andrew Gibiansky, a gentleman who seems willing to explain ....