Least squares

Consider $$y=Ax$$ where $$A \in \mathbb{R}^{m\times n}$$ is (strictly) skinny (i.e., $$m>n$$); that is, $$y=Ax$$ is an overdetermined set of linear equations.

For most $$y$$, we cannot solve for $$x$$. This makes sense: even if $$A$$ is full rank (i.e., $$\text{rank}(A)=n$$)its rank will still be smaller compared to the dimensionality of $$y$$, i.e., $$m$$.

The least squares approach is about approximately solving $$y=Ax$$: We define the residual $$r = y-Ax$$ and find the least-squares (approximate) solution, $$x=x_{\text{ls}}$$, that minimizes $$||r||$$.

Taking the partial derivative of $$||r||{}^2 = x^T A^T A x - 2y^T Ax + y^T y$$ w.r.t. $$x$$ and equating it to zero, we get:

$$\nabla_x ||r||{}^2 = 2A^T Ax - 2A^T y = 0 \implies A^T A x = A^T y$$. The latter is called the normal equations and it has a unique solution (because A^T A is square and full rank when $$A$$ is skinny and full rank), famous as the least square (approximate) solution:

$$x_ls = (A^T A)^{-1} A^T y := A^\dagger x$$.

The matrix above, $$A^\dagger$$, is called the pseudo-inverse of $$A$$ and it's a generalized inverse.

Geometric Interpretation
The vector $$A x_ls$$ is the projection of $$y$$ on $$\mathcal R(A)$$. In this case, the residual $$r=Ax-y$$ is orthogonal to $$\mathcal R(A)$$.