[Machine Learning] #10 Solving Problem of Overfitting
[Write Infront]:
In the previous Machine Learning Topic, we talk about the Linear Regression and Logistic Regression, but we don't focus on the very important topic, Overfitting.
************************************************************************
************************************************************************
[The Problem of Overfitting]:
What is Overfitting, in a very easy word, overfitting is the function we got is soo.. accurate for the training set but not generalize for the new. Typical overfitting means that error on the training data is very low but error on new data is high.
Do not do that soo accurate, because our life is not that accurate. right?
So, the problem is how we can avoid such problem?
1) Reduce the number of features:
- Manually select which feature to keep.
- Use model selection Algo (We will discuss later)
2) Regularization:
- Keep all the features, but reduce the magnitude of parameters \({\theta _j}\)
- Regularization works well when we have a lot of slightly useful features.
************************************************************************
************************************************************************
[Cost Function]:
Now, we are going to find some ideas to avoid overfitting. as the picture above shows, we cannot do soo accurate fit for the training sets, because our life is not that accurate. But I find that maybe a quadratic will be a good choice. what should we do?
For example, we want to make the following function more quadratic:
$${\theta _0} + {\theta _1}x + {\theta _2}{x^2} + {\theta _3}{x^3} + {\theta _4}{x^4}$$
what should we do? Panish \({x^3}\) and \({x^4}\), we don't want them to be that big. Which means we have to min \({\theta _3}\) and \({\theta _4}\).
So, we modify our Cost Function as follows:
$$J(\theta ) = mi{n_\theta }{1 \over {2m}}\sum\limits_{i = 1}^m {{{({H_\theta }({x^{(i)}}) - {y^{(i)}})}^2} + 1000 \cdot \theta _3^2 + 1000 \cdot \theta _4^2} $$
What this function can help us?
Ans: not to make the function so "Aggressive", to become slightly up and down.
But how do you know which parameter should be "Punished"?
After this, we can apply regularization on both linear regression and logistic regression
Now, we are going to find some ideas to avoid overfitting. as the picture above shows, we cannot do soo accurate fit for the training sets, because our life is not that accurate. But I find that maybe a quadratic will be a good choice. what should we do?
For example, we want to make the following function more quadratic:
$${\theta _0} + {\theta _1}x + {\theta _2}{x^2} + {\theta _3}{x^3} + {\theta _4}{x^4}$$
what should we do? Panish \({x^3}\) and \({x^4}\), we don't want them to be that big. Which means we have to min \({\theta _3}\) and \({\theta _4}\).
So, we modify our Cost Function as follows:
$$J(\theta ) = mi{n_\theta }{1 \over {2m}}\sum\limits_{i = 1}^m {{{({H_\theta }({x^{(i)}}) - {y^{(i)}})}^2} + 1000 \cdot \theta _3^2 + 1000 \cdot \theta _4^2} $$
What this function can help us?
Ans: not to make the function so "Aggressive", to become slightly up and down.
But how do you know which parameter should be "Punished"?
After this, we can apply regularization on both linear regression and logistic regression
************************************************************************
************************************************************************
************************************************************************
[Regular Linear Regression]:
Now let's do linear Regression 1st. 2 method, Gradient Decent and Normal Equation
Gradient Descent:
we will modify our gradient decent function to separate \({\theta _0}\) out of the other parameters because we don't want to penalize \({\theta _0}\). We repeat the following function until it converage:
$${\theta _0} = {\theta _0} - \alpha {1 \over m}\sum\limits_{i = 1}^m {({H_\theta }({x^i}) - {y^i})x_0^i} $$
$${\theta _j} = {\theta _j} - \alpha \left[ {{1 \over m}\left( {\sum\limits_{i = 1}^m {({H_\theta }({x^i}) - {y^i})x_0^i} } \right) + {\lambda \over m}{\theta _j}} \right]$$
The term \({{\lambda \over m}{\theta _j}}\) perform our regularization part. We can do some math manipulation and get:
$${\theta _j} = {\theta _j}(1 - \alpha {\lambda \over m}) - \alpha {1 \over m}\sum\limits_{i = 1}^m {({H_\theta }({x^i}) - {y^i})x_j^i} $$
Please note that \(1 - \alpha {\lambda \over m}\) will always bigger than 0.
Normal Equation:
Now let's do Normal Equation, to add in regularization part, the equation is same as our original, except that we add another term inside the equation:
$$\theta = {({X^T}X + \lambda \cdot L)^{ - 1}}{X^T}y$$
where
Now let's do linear Regression 1st. 2 method, Gradient Decent and Normal Equation
Gradient Descent:
we will modify our gradient decent function to separate \({\theta _0}\) out of the other parameters because we don't want to penalize \({\theta _0}\). We repeat the following function until it converage:
$${\theta _0} = {\theta _0} - \alpha {1 \over m}\sum\limits_{i = 1}^m {({H_\theta }({x^i}) - {y^i})x_0^i} $$
$${\theta _j} = {\theta _j} - \alpha \left[ {{1 \over m}\left( {\sum\limits_{i = 1}^m {({H_\theta }({x^i}) - {y^i})x_0^i} } \right) + {\lambda \over m}{\theta _j}} \right]$$
The term \({{\lambda \over m}{\theta _j}}\) perform our regularization part. We can do some math manipulation and get:
$${\theta _j} = {\theta _j}(1 - \alpha {\lambda \over m}) - \alpha {1 \over m}\sum\limits_{i = 1}^m {({H_\theta }({x^i}) - {y^i})x_j^i} $$
Please note that \(1 - \alpha {\lambda \over m}\) will always bigger than 0.
Normal Equation:
Now let's do Normal Equation, to add in regularization part, the equation is same as our original, except that we add another term inside the equation:
$$\theta = {({X^T}X + \lambda \cdot L)^{ - 1}}{X^T}y$$
where
************************************************************************
************************************************************************
[Regularized Logistic Regression]:
Recall the logistic function was
$$J(\theta ) = - {1 \over m}\sum\limits_{i = 1}^m {[{y^i}\log ({H_\theta }({x^i})) + (1 - {y^i})\log (1 - {H_\theta }({x^i}))]} $$
now, we add regularization equation by adding a term to the end:
$$J(\theta ) = - {1 \over m}\sum\limits_{i = 1}^m {[{y^i}\log ({H_\theta }({x^i})) + (1 - {y^i})\log (1 - {H_\theta }({x^i}))]} + {\lambda \over {2m}}\sum\limits_{j = 1}^n {\theta _j^2} $$
Gradient Descent:
************************************************************************
Comments
Post a Comment