Learning Rate

Learning Rate

aj:=ajαj
Debugging gradient descent: how to make sure gradient descent is working correctly
The job of gradient descent is to minimize J(a) as a function of vector a (or θ). So if we plot the graph of J against number of iterations, it should be decreasing after every iteration.

Automatic convergence test: declare convergence if J(a) decreases by less than 103 in one iteration. However in practice, it's difficult to choose this threshold. Not recommended.
How to choose learning rate α:
  • If J(a) is increasing or repeat, use smaller learning rate α.
  • For sufficiently small enough α, J(a) should decrease on every iteration.
  • If α is too small, gradient descent can be slow to converge.
  • To choose α, try: ..., 0.001 then 0.003 then 0.1 then 0.3 then 1, ... (roughly 3 times), then pick the one right before the largest possible/working learning rate


Resources:

Comments