Linear regression (or Linear Models for Regression), is probably the simplest model in machine learning. Yet it can be still be powerful enough for some industrial level applications. And usually, it provides a great intuition towards models whose nonlinearity is not very strong.
2D line fitting
Line fitting is its simplest form. Given a training data set1 $\left\{x^{(i)}, y^{(i)}\right\}_{i=1}^N$, where $x^{(i)}$ and $y^{(i)}$ are all real number, curve fitting requires us to build a model $$ y \approx w_0 + w_1 x = \begin{bmatrix} 1 & x \end{bmatrix} \begin{bmatrix} w_0\\ w_1 \end{bmatrix} \equiv \boldsymbol{x}^\top \boldsymbol{w}, $$ where we need to find the unknown parameters $\boldsymbol{w} = \begin{bmatrix} w_0 & w_1 \end{bmatrix}^\top$. To introduce some nomenclature:
$\boldsymbol{x}$ | input, feature, independent variables (IV), factors, treatment variables, predictors, … |
$\boldsymbol{y}$ | output, target, outcome, response, dependent variables, … |
$\boldsymbol{w}$ | weights, parameters, … |
$p$-D hyperplane fitting
Here, the input is of one dimsional, i.e. $x_i\in\mathbb R^1$. A direct generalization is to allow the input to be $p$ dimensional. That is to say, suppose we have a training data set $\left\{\boldsymbol{x}^{(i)}, y^{(i)}\right\}_{i=1}^N$, where $\boldsymbol{x}^{(i)} \in \mathbb R^{p}$. we want to find a model that $$ \begin{aligned} y &\approx 1\cdot w_0 + x_1 w_1 + x_2 w_2 + \cdots + x_p w_p\\ &= \begin{bmatrix} 1 & x_1 & x_2 & \cdots & x_p \end{bmatrix} \begin{bmatrix} w_0\\ w_1\\ w_2 \\ \vdots \\ w_p\end{bmatrix} = \boldsymbol{x}^\top \boldsymbol{w}. \end{aligned} $$ Notice that we arrive at the same form (of course…) and our goal is to solve the weight vector $\boldsymbol w \in \mathbb R^{p+1}$. (Here $\text{``}+1\text{''}$ is because the existence of $w_0$).
In the discussion below, we will continue use this vector (matrix) form, since it doesn't require us to specify the dimension of the input.
How to solve for weight vectors?
Step 1: Getting a test set
Once we have a mathematical form in our mind, this means that the model is built. Next, we need to train our model to find the "best" model. To define best, we need a metric. One intuition is that we can define success by tests (like students will take exams to see how well they've mastered certain skills). There are two options:
Collect (request) for more data to construct a test set.
- Con: getting new data could be expensive.
Conceptually split the original training set into training and test set.
- Con: This means that less data for training.
- Con: Need justification on how to split.
- Pro: We don't need new data.
Once we have a test set, it should be of the same format as the training data. Denote the data points to be $\{\boldsymbol{x}^{(j)}, y^{(j)}\}_{j=1}^M$.
One obvious assumption here is that we cannot use test set to train the model. (Otherwise, it's cheating!)
Step 2: Define a criteria
An intuitively reasonable criteria to choose the parameter minimize the "prediction error" on test set. An easy-to-hand choice would be sum-of-squares of error: $$ \sum_{j=1}^M (y^{(j)} - {\boldsymbol{x}^{(j)}}^\top \boldsymbol{w})^2. $$ The above error has a simpler presentation in matrix form. In fact, if we denote $$ \begin{aligned} \mathbf y &= \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(M)} \end{bmatrix}_{M\times 1} \quad \mathbf X = \begin{bmatrix} {\boldsymbol{x}^{(1)}}^\top \\ {\boldsymbol{x}^{(2)}}^\top \\ \vdots \\ {\boldsymbol{x}^{(M)}}^\top \end{bmatrix}_{M\times (p+1)} \quad \boldsymbol{w} = \begin{bmatrix} w_0 \\ w_1 \\ \vdots \\ w_p\end{bmatrix}_{(p+1) \times 1} \end{aligned} $$ Then the sum-of-square error (SSE) can be expressed as $$ \text{SSE} = \| \mathbf{y} - \mathbf X \boldsymbol{w} \|_2^2 $$
Comments
Some more names
- In the language of machine learning, this criteria is called loss function, while in the optimization context, it is called the objective function.
- Minimize SSE is called least square (LS).
- In some book, SSE is also called residual sum-of-squares (RSS), hence $y^{(j)} - {\boldsymbol{x}^{(j)}}^\top\boldsymbol{w} $ is residual.
Scaling factor?
- Sometimes, to handle this objective function nicely, we apply an $\frac12$ to make the derivative of $\boldsymbol{w}$ look more elegant2. In other cases, we also apply an $\frac1M$ to indicate the error on each data point in the (test) set.
Does it have to be a test set?
- Loss function is independent of the choice of the underlying set. We can be applied onto any sets, e.g. training set, test set, some other set in the wild.
Bayesian interpretation
- If we assume the "model error" (a.k.a residual) $\varepsilon^{(j)} = y^{(j)} - {\boldsymbol{x}^{(j)}}^\top\boldsymbol{w}$ is a Guassian variable centered at 0 and ha ve an additional unknown parameter — precision of Guassian distribution — $\beta$. That is $\varepsilon^{(j)} \sim \mathcal N(0, \beta^{-1})$. Then we are able to write down the likelihood for each sample in the set. Multiplication over all the samples gives us the likelihood function. We want the likelihood to be as large as possible. That is called the $\textbf{M}$aximum $\textbf{L}$ikelihood $\textbf{E}$stimation (MLE). It turns out MLE under the Gaussian assumption of error is equivalent to least square.
Step 3: Solve the optimization problem on training set
If dataset is not too big, we can solve the problem directly, as there is actually an explicit solution to the problem.
Otherwise, we can rely on iterative methods ([steepest|conjugate] gradient descent, (quasi-)Newton), or even stochastic optimization methods.
Step 4: Evaluate the performance (on test set)
One obvious choice is to report the (mean) SSE on the test set. Or, we can report (mean) SSE on another test set.
Other Extensions:
Multiple Outputs
It is pretty straightforward to generalize single output ($y_i \in \mathbb R^1$) into multiple output ($\boldsymbol{y}_i \in \mathbb R^q$). Recall the single output model takes the form: $$ y \approx \boldsymbol{x}^\top \boldsymbol{w}. $$ In this case, one can view this multi-output problem as $q$ single-output sub-problems, which takes the form $$ \boldsymbol{y}\approx \boldsymbol{x}^\top \boldsymbol{W} $$ This is because the number of parameters $\boldsymbol{w}$ will also grow from $(p+1)$ to a matrix $\boldsymbol W$ of shape $(p+1)\times q$. They are independent of each other.
Change a basis
Can we build a model like below? $$ y = w_0 + w_1 \sin(x_1) + w_2 \tan(x_2) + w_3 \exp(x_3) + \cdots $$ Of course. This is not different from the simplist form at all. All we need to do is to change the data matrix $\mathbf X$ into a design matrix $\boldsymbol{\Phi} = \boldsymbol{\Phi}(\mathbf X)$. This somehow increases the ability to model non-linear behavior, but it is still a linear model in its parameters.
Regularization
We can add restrictions on the parameters, asking them to be not too large. Why would we do that? Well, if we can find important features affecting the target, that would be great!
Other than that, if the coefficients are too large, then the model could be very sensitive to the input, hence the "variance" is high. This is also an indication of overfitting.
What if the paramters are random varialbes themselves?
This is where Bayesian Linear Regression comes into the story.
or training set, for short. Strictly speaking, it is not a set, as it allows duplication. It's an ordered list.
which in this case $\nabla_{\boldsymbol{w}} (\frac12 \|\boldsymbol{\varepsilon}\|_2^2) = \mathbf X^\top (\mathbf X \boldsymbol{w} - \mathbf y) = -\mathbf X^\top \boldsymbol{\varepsilon}$, if $\boldsymbol{\varepsilon} = \mathbf y - \mathbf X \boldsymbol{w}$.