Logistic Regression for Binary Classification

\(\hat{y}\) = \(\sigma(z) = \frac{1}{(1 + e^{-z})} \) where \(z = w^T x + b\) with \(w \in R^{n_x}\), \(b \in R\)

Goal: given a set of m training examples, find w and b so that \(\hat{y}^{(i)} \approx y^{(i)} \), estimate is close to ground truth. In other words, minimize following Cost function:

Loss (error) function: \(L(\hat{y}, y) = - (y log(\hat{y}) + (1 - y) log(1 - \hat{y}))\) on a single training example
Note: square error not working well with logistic regression because of non-convex
Intuition: if \(\hat{y}\) is close to \(y\) (edge cases: either 0 or 1), L is close to 0 which is what we want

Cost function (on entire training examples): \(J(w, b) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)})\)

Gradient Descent: find w and b to minimze J(w, b). Algorithm:
Repeat following updates until converge: {
 \( \space\space\space w := w - \alpha \frac{\partial J(w, b)}{\partial w} \)
 \( \space\space\space b := b - \alpha \frac{\partial J(w, b)}{\partial b} \)
}
where \(\alpha\) is the learning rate, control how big a step in each iteration

Derivatives Intuition:
\(slope = \frac{height}{width}\)
f(a) = 3a
Let a = 2 then f(2) = 6, if we nudge a a little bit to 2.001, then f(2.001) = 6.003,
then we have the slope (derivative) of f(a) at a=2 is \(\frac{df(a)}{da} = \frac{\Delta y}{\Delta x} = 3\)

Comments