The Problem of Overfitting

The Problem of Overfitting

3 possible scenarios:
  • Under-fitting: high bias, not fit the data very well. Have a strong bias that data will fit a straight line
  • Just right:fit data not as well as overfitting but good enough and generalizable
  • Over-fitting: high variance, try too hard to fit all data, hard to generalize to fit new data. This is because we have many features

How to fix overfitting:

Below options apply to both linear regression and logistic regression (classification)
Reduce number of features (penalize them) by
  • Manually select which features to keep - neither easy nor efficient
  • Model selection algorithm - automatically decide which features to keep which to throw out (later)
Regularization:
  • Keep all the features, but reduce the magnitude/values of parameters \(\theta_j\)
  • Works well when we have a lot of features, each of which contributes a bit to predicting y

Resources:

Comments