Least squares

Consider $$y=Ax$$ where $$A \in \mathbb{R}^{m\times n}$$ is (strictly) skinny (i.e., $$m>n$$); that is, $$y=Ax$$ is an overdetermined set of linear equations.

For most $$y$$, we cannot solve for $$x$$. This makes sense: even if $$A$$ is full rank (i.e., $$\text{rank}(A)=n$$)its rank will still be smaller compared to the dimensionality of $$y$$, i.e., $$m$$.

The least squares approach is about approximately solving $$y=Ax$$: We define the residual $$r = y-Ax$$ and find the least-squares (approximate) solution, $$x=x_{\text{ls}}$$, that minimizes $$||r||$$.

Taking the partial derivative of $$||r||{}^2 = x^T A^T A x - 2y^T Ax + y^T y$$ w.r.t. $$x$$ and equating it to zero, we get:

$$\nabla_x ||r||{}^2 = 2A^T Ax - 2A^T y = 0 \implies A^T A x = A^T y$$. The latter is called the normal equations and it has a unique solution (because A^T A is square and full rank when $$A$$ is skinny and full rank), famous as the least square (approximate) solution:

$$x_ls = (A^T A)^{-1} A^T y := A^\dagger x$$.

The matrix above, $$A^\dagger$$, is called the pseudo-inverse of $$A$$ and it's a generalized inverse. It is also a left inverse.

Least-squares via QR factorization
One way to obtain the least-squares approximate solution is to use QR Factorization, because:

$$(A^T A)^{-1}A^T = (R^T Q^T Q QR)^{-1} R^T Q^T = R^{-1} Q^T$$,

so $$x_{\text{ls}} = R^{-1}Q^T y$$.

In this case, the projection matrix can be obtained as:

$$A(A^TA)^{-1}A^T = AR^{-1}Q^T = QQ^T$$.

Geometric Interpretation
The vector $$A x_ls$$ is the projection of $$y$$ on $$\mathcal R(A)$$. In this case, the residual $$r=Ax-y$$ is orthogonal to $$\mathcal R(A)$$.

Recursive Least Squares
Another way to write the least square solution is

$$x_{ls} = \left( \sum_{i=1}^m a_i a_i^T\right)^{-1} \sum_{i=1}^m y_i a_i$$

where $$a_i$$ are the columns of $$A$$. This formulation is particularly useful for on-line applications of least squares, i.e., when measurements keep coming and we want do update our estimation incrementally. An algorithm to do this would be :
 * 1) Initialize $$P(0) = 0 \in \mathbb{R}^{n\times n}, q(0) = 0 \in \mathbb{R}^n$$
 * 2) For $$m = 0,1,\dots$$: $$P(m+1) = P(m) + a_{m+1} a_{m+1}^T,\,\,\,\, q(m+1) = q(m) + y_{m+1} a_{m+1}$$
 * 3) if $$P(m)$$ is invertible, we have $$x_{ls}(m) = P(m)^{-1}q(m)$$

As can be seen above, all we need to keep is the most recent versions of $$P(m), q(m)$$ and give an estimation whenever we are asked for one with one matrix inversion and matrix-vector calculation, $$x_{ls}(m) = P(m)^{-1}q(m)$$. In fact, the matrix inversion can be done quite efficiently via the rank one update formula because it adheres to the form $$P(m+1)^{-1} = \left(P(m) + a_{m+1}\right)^{-1}$$.

Multi-objective Least Squares
When we have two (generally) competing objectives $$J_1, J_2$$, what we do is minimizing $$J_1 + \mu J_2$$ for various values of $$\mu$$, which controls how much we care about each objective. Let $$J_1, J_2$$ be defined as:

$$J_1 = ||Ax-y||$$

$$J_2 = ||Fx-g||$$

We can write this as a minimization of a single objective $$||Ax-y||{}^2 + \mu ||Fx-g||{}^2$$. This can actually be written as a single-objective minimization $$||\tilde{A}x-\tilde{b}||$$ where $$\tilde{A} = \begin{bmatrix}A\\ \sqrt{\mu} F\end{bmatrix}$$ and $$\tilde{b} = \begin{bmatrix}y \\ \sqrt{mu} g\end{bmatrix}$$. We can now solve this (assuming $$\tilde{A}$$ is full rank) using the least squares solution listed above. However, we need to solve for various $$\mu$$ values (possibly a logarithmic range). Of course, we must define what a good value of $$\mu $$ would be. This will depend on how much we care about each of the two objectives $$J_1$$ and $$J_2$$. In some cases there may be a point that satisfies both with relatively little compromise from each. The clue that we are actually close to a good trade-off point is that both $$J_1$$ and $$J_2$$ vary little with $$\mu$$. (16:30+ of Lecture 08 ). In other words, when our objectives becomes insensitive to $$\mu$$.

Both least-norm and least-squares are special cases of general norm minimization with equality constraints.

BLUE property
Least-squares is the best linear unbiased estimator (BLUE). The BLUE property is understood better when we assume that there is some noise in our measurement, i.e., $$y=Ax+v$$. The least squares approx. solution is nothing but multiplying $$y$$ with a left inverse, as seen above. Least squares involves a specific left inverse; the matrix $$A$$, however, may have more (in fact, infinite) left inverses. In this case, all of them would yield a solution. Indeed, assume that $$B$$ is a left inverse of $$A$$; in the noiseless case, if $$y=Ax$$, we have that $$By = BAx = x$$.

In the case with noise (which is much more realistic), we have that $$ \hat{x} = By = BAx + Bv = x + Bv \implies x - \hat{x} = -Bv$$. Since we don't know the noise and as far as we are concerned it can be random, we cannot expect to fully recover $$x$$. However, we can aim to have a left inverse $$B$$ that minimizes the effect of the noise, that is, $$-Bv$$. In this case, we'd like $$B$$ to be 'small'. The pseudo-inverse $$A^\dagger$$ is the smallest left inverse of $$A$$. This property is called makes least-squares the best linear unbiased estimator (BLUE).

Smallest uncertainty ellipse
Least-squares is also optimal in that it gives the smallest uncertainty ellipsoid. This property applies when we have a norm-bound model of noise; that is, we assume that $$||v||\le \alpha$$ but otherwise know nothing about $$v$$.

Consider any estimator $$B$$ of $$x$$, i.e., $$\hat{x} = By$$, where $$B$$ is unbiased (i.e., $$BA  = I$$). The estimation error is $$\tilde x = \hat x - x = Bv$$. Due to norm-bound noise model, we have that

$$\tilde x \in \mathcal{E}_{\text{unc}} = \{ Bv | ||v|| \le \alpha\}$$/

A good estimator should have a small uncertainty ellipsoid $$\mathcal E_{\text{unc}}$$. It can be shown that least-squares has the smallest possible uncertainty ellipsoid -- smallest in the sense of matrix comparison; that is, if $$B_{\text{ls}}$$ is the least-squares estimator and $$\mathcal E_{\text{ls}}$$ is its uncertainty ellipsoid, for any estimator $$B$$ with ellipsoid $$\mathcal E$$ we have that

$$B \ge B_{\text{ls}}$$ and (see Ellipsoid) $$\mathcal E_{ls} \subseteq \mathcal E$$.