More on Backpropagation

Gradient descent over entire network weight vector

Easily generalized to arbitrary directed graphs

Will find a local, not necessarily global error minimum

Often include weight momentum $\alpha$

$\begin{displaymath}\Delta w_{i,j}(n) = \eta \delta_{j} x_{i,j} + \alpha \Delta w_{i,j}(n-1) \end{displaymath}$

Minimizes error over training examples

Training can take thousands of iterations $\rightarrow$ slow!

Using network after training is very fast