To understand, consider simpler linear unit, where
Let's learn wi's that minimize the squared error
Where D is set of training examples