Weights adjusted according to:
![]()
(which differs from what is in the text--formula in text is a workable variant and, in fact, is what you would get in the following derivation if the activation function were the identity function)
If the activation function is a continuous, differentiable function (such as the sigmoid), the training rule is essentially implementing gradient descent on the squared error.