Using logistic regression, we're stacking multiple logistic regressions (on top of each other as well as additional layers) using result of previous layer as inputs of current layer
Single hidden layer neural network (2 layers neural network - not counting input layer):
Vertical corresponding to nodes (units) in each layer
Forward pass:
Non-vectorized implementation:
\(z^{[1](1)} = W^{[1]} x^{(1)} + b^{[1]}\), \(z^{[1](2)} = W^{[1]} x^{(2)} + b^{[1]}\), \(z^{[1](3)} = W^{[1]} x^{(3)} + b^{[1]}\)
\(z^{[2](1)} = W^{[2]} x^{(1)} + b^{[2]}\), \(z^{[2](2)} = W^{[2]} x^{(2)} + b^{[2]}\), \(z^{[2](3)} = W^{[2]} x^{(3)} + b^{[2]}\)
Vectorized implementation by stacking all training examples horizontally:
\(z^{[i]} = W^{[i]} X + b^{[i]}\) where \(X \in R^{n_x \times m}\)
Activation functions: (non-linear)
Currently, using sigmoid as activation function, not a very good choice.
Single hidden layer neural network (2 layers neural network - not counting input layer):
- Input layer (\(a^{[0]}\))
- Hidden layer: you don't see values in hidden layer (\(a^{[1]}\))
- Output layer (\(a^{[2]}\))
Vertical corresponding to nodes (units) in each layer
Forward pass:
Non-vectorized implementation:
\(z^{[1](1)} = W^{[1]} x^{(1)} + b^{[1]}\), \(z^{[1](2)} = W^{[1]} x^{(2)} + b^{[1]}\), \(z^{[1](3)} = W^{[1]} x^{(3)} + b^{[1]}\)
\(z^{[2](1)} = W^{[2]} x^{(1)} + b^{[2]}\), \(z^{[2](2)} = W^{[2]} x^{(2)} + b^{[2]}\), \(z^{[2](3)} = W^{[2]} x^{(3)} + b^{[2]}\)
Vectorized implementation by stacking all training examples horizontally:
\(z^{[i]} = W^{[i]} X + b^{[i]}\) where \(X \in R^{n_x \times m}\)
Activation functions: (non-linear)
Currently, using sigmoid as activation function, not a very good choice.
- sigmoid
- \(g'(z) = g(z) \times ( 1 - g(z)) = a(1-a)\)
- tanh(z), (shifted sigmoid function) hyperbolic tangent function where \(y \in [-1, 1]\)
- \(g'(z) = 1 - g(z)^2 = 1 - a^2\)
- ReLU (rectified linear unit): a = max(0, z) (commonly used)
- \(g'(z) = 0 \space if \space z < 0, 1 \space if \space z \geq 0 \)
- Leaky ReLU: a = max (0.01z, z)
- \(g'(z) = 0.01 \space if \space z < 0, 1 \space if \space z \geq 0 \)
Except for the output layer if output is either 0 and 1 then use sigmoid as activation makes sense
If we only use linear activation functions, we just basically compute output as linear function of inputs
How to determine dimensions for W and b:
Suppose \(X \in R^{3 \times m} \), hidden layer has 4 units, output layer has 2 units,
then \( W^{[1]} \in R^{4 \times 3}, b^{[1]} \in R^{4 \times 1}, W^{[2]} \in R^{2 \times 4}, b^{[2]} \in R^{2 \times 1} \)
Backward pass: Gradient descent
How to determine dimensions for W and b:
Suppose \(X \in R^{3 \times m} \), hidden layer has 4 units, output layer has 2 units,
then \( W^{[1]} \in R^{4 \times 3}, b^{[1]} \in R^{4 \times 1}, W^{[2]} \in R^{2 \times 4}, b^{[2]} \in R^{2 \times 1} \)
Backward pass: Gradient descent
Comments
Post a Comment