Linear Regression
Linear Regression is the first algorithm when I start to learn machine learning. Before machine learning, I have got in touch with it in statistics. In my memory, linear regression is very simple. And the most important thing is Least Mean Square(hereinafter LMS). But why using LMS to calculate linear regression model? Now we will explore this.
Construct the problem
Given the sample\(X\in R_{m \times n}\) (we always use column vector to represent sample data), \(Y\in R_{1 \times n}\). Suppose \(x\) and \(y\)has the linear relationship, we need to get the model’s parameters \(W \in R_{m \times 1}\). Except that, we would give a \(bias\). So the model would be: \[ Y = W^{T} * X + bias \] For simplicity, we could extend \(X\) to \(R_{m+1 \times n}\). The last row are all 1. Then extend \(W\) to \(R_{m+1 \times n}\), the \(bias\) could be merged to \(W\). Now the model becomes: \[ Y = W^{T} * X \] An the problem is to calculate the unknown item \(W\)
Theory Analysis
Above we have constructed the problem, due to the sample data always more than parameters. So we cannot get a unique result. But we could use optimization to approximate the result: \[ Y^{‘} = W^{T} * X \] , where \[ Y^{‘} \approx Y \] Now the LSM occurred, define\(L = \frac{1}{2} * (Y^{‘} - Y)^2\) , and minimize \(L\) could make \(Y^{'}\) approximates to \(Y\) . The optimization problem is: \(min(L)\).
Statistical Explanation
From statistical point of view, suppose\(y = y^{‘} + \epsilon\), where \(\epsilon\) is the error. In statistic, we define error is Gaussian distribution, by normalization we could get:\(\epsilon \sim N(0,\delta)\) . So, \[ p(\epsilon|W,x) = \frac{1}{\sqrt{2\pi}\delta} \exp(-\frac{\epsilon^2}{2\delta^2}) = \frac{1}{\sqrt{2\pi}\delta} \exp(-\frac{(y-W^Tx)^2}{2\delta^2}) \] For point \(i\), \(p(ϵ(i)|W,x(i))=p(y(i)|W,x(i))\), so the likelihood function: \[ L(W) = \prod_{i=1}^{n} p(y^{(i)} | W, x^{(i)}) \]
\[ \log{L(W)} = \sum_{i=1}^m p(y^{(i)} | x^{(i)},W) = -m\log{\sqrt{2\pi}\delta} - \sum_{i=1}^m \frac{(y^{(i)} - W^Tx^{(i)})^2}{2\delta^2} \]
According to max likelihood approximation: \[ obj = max(\log L(W)) \] so as to, \[ obj = \min{(\sum_{i=1}^m (y^{(i)} - W^Tx^{(i)})^2)} \] This is the same as Least Mean Square.
Solve the Optimization Problem
In optimization, the gradient decent is the most common used algorithm. It just like people to find a fastest way to get down a hillside. This person would stop to try every directions around him. If one way led to the fastest down way at current location, he would choose it. He would repeat above methods until arrived the destination.
OK. In math, gradient is the fastest rising direction. So we could use negative gradient as the fasted decrease direction. At each step, we compute the gradient of the objective function. Then update the unknown parameters as: \[ W = W - \eta * gradient(obj) \] In detail, for the j-th parameter \(W_j\) of \(W\) , the update rule is: \[ W_j = W_j - \eta * \frac{\partial(obj)}{\partial(x_j^{(i)})} \]
\[ W_j = W_j - \eta * (W^TX^{(i)} - y^{(i)})x^{(i)}_j \]
For all sample data, the update rule is to sum all the gradient: \[ W_j = W_j - \sum_{i=1}^{n}\eta * (W^TX^{(i)} - y^{(i)})x^{(i)}_j \]