Learning Rate

Learning Rate

\(a_j := a_j - \alpha\frac{\partial}{\partial_j}\)
Debugging gradient descent: how to make sure gradient descent is working correctly
The job of gradient descent is to minimize \(J(a)\) as a function of vector \(a\) (or \(\theta\)). So if we plot the graph of J against number of iterations, it should be decreasing after every iteration.

Automatic convergence test: declare convergence if \(J(a)\) decreases by less than \(10^{-3}\) in one iteration. However in practice, it's difficult to choose this threshold. Not recommended.
How to choose learning rate \(\alpha\):
  • If \(J(a)\) is increasing or repeat, use smaller learning rate \(\alpha\).
  • For sufficiently small enough \(\alpha\), \(J(a)\) should decrease on every iteration.
  • If \(\alpha\) is too small, gradient descent can be slow to converge.
  • To choose \(\alpha\), try: ..., 0.001 then 0.003 then 0.1 then 0.3 then 1, ... (roughly 3 times), then pick the one right before the largest possible/working learning rate


Resources:

Comments