Singular Value Decomposition

Singular Value Decomposition (SVD) is a ubiquitous matrix decomposition that applies to any matrix $$A \in \mathbb{R}^{m\times n}$$ and takes the form

$$A = U \Sigma V^T$$.

Let $$\text{rank}(A) = r$$. Then, the columns of $$U \in \mathbb{R}^{m\times r}$$ contains an orthonormal set of eigenvectors of $$A A^T$$; the columns of $$V$$ form an orthogonal set of eigenvectors of $$A^T A$$; and $$\Sigma = \text{diag}(\sigma_1, \sigma_2, \dots, \sigma_r)$$ contains the square roots of the non-zero eigenvalues of $$A A^T$$ (or, equivalently, the non-zero eigenvalues of $$A^T A$$). The values $$\sigma_i$$ are listed from largest to smallest.

Now note that any real matrix that can be expressed in the form of $$A A^T$$ or $$A^T A$$ has to be symmetric and nonnegative definite (see Theorem 14.3.7 ). Therefore, the entries of $$\Sigma$$ are real. We can more succinctly represent the properties of the matrices in SVD as :


 * $$U = [u_1, \dots, u_r] \in \mathbb{R}^{m\times r}, U^T U = I$$; $$u_i$$ eigenvector of $$A A^T$$ corresponding to eigenvalue $$s_i^2$$
 * $$V = [v_1, \dots, v_r] \in \mathbb{R}^{n\times r}, V^T V = I$$; $$v_i$$ eigenvector of $$A^T A$$ corresponding to eigenvalue $$s_i^2$$
 * $$\Sigma = \text{diag}(\sigma_1, \sigma_2, \dots, \sigma_r)$$, where $$\sigma_1 \ge \dots \ge \sigma_r > 0$$

Now we can write out the SVD decomposition as:

$$A = U \Sigma V^T = \sum_{i=1}^r \sigma_r u_i v_i^T$$

Some terminology:
 * $$s_i$$ are the (nonzero) singular values of $$A$$
 * $$v_i$$ are the right or input singular vectors of $$A$$
 * $$u_i$$ are the left or output singular vectors of $$A$$

It's easy to show that $$U$$ and $$V$$ contain eigenvectors of $$A A^T$$ and $$A^T A$$, respectively:

$$A^T = (U\Sigma V^T)^T (U \Sigma V^T) = V \Sigma V^T$$

Some properties of SVD

 * $$u_1, \dots, u_r$$ form an orthonormal basis for $$\text{range}(A)$$
 * $$v_1, \dots, v_r$$ form an orthonormal basis for $$\mathcal{N}(A)^\perp$$
 * $$\sigma_1 = ||A||$$ where $$||\cdot||$$ is Matrix norm

The last property is obvious when one considers the definition of Matrix norm. To understand the first property, we can inspect the multiplication $$Ax$$:

$$Ax = U \Sigma V^T x = U (\Sigma V^T x) = Uz$$.

Now since the rank of $$V^T \in \mathbb{R}^{r\times n}$$ is $$r$$, the multiplication $$(V^T x)$$ can represent any vector in $$\mathbb{R}^r$$. In other words, $$z$$ above can be any vector in $$\mathbb{R}^r$$. And since $$U$$ is also of rank $$r$$, $$Uz$$ spans the range that $$A$$ spans.

The second property above is actually another way of saying that $$v_1, \dots, v_r$$ are the input spaces that are not $$\mathcal{N}(A)$$. In other words, if you are orthogonal to the span of $$v_1, \dots, v_r$$, then you are in $$\mathcal{N}(A)$$.

Full SVD
In some cases it may be preferable to have a version of SVD that includes zero values as well, in which case we would be augmenting the $$\Sigma$$ matrix with the (possible) zero singular values in the diagonal. Clearly, we would also need to augment $$U$$ and $$V$$, which we would do by completing both of them into bases by adding the sufficient independent columns.

To distinguish (regular) SVD from full SVD, the former is also called sometimes as compact SVD.

Interpretation: Understanding SVD
In SVD, $$v_1$$ is the most sensitive input direction and $$u_1$$ is the most sensitive output direction. The reason becomes clear upon inspecting the block diagram of SVD.



Note that the rightmost block does not change the norm of its input, and the gain of the matrix is determined by the first two blocks. The second block, $$\Sigma$$, is pretty simple: it just scales each entry of its input by a positive number. Since the first entry, $$\sigma_1$$, is the largest one, the outcome of the second block will be maximized if its input is concentrated entirely on the first element, which means that (assuming unit vector) $$V^T x = e_1$$, which becomes possible if $$x=v_1$$. This justifies the qualification of $$v_1$$ as the most sensitive input direction.

The qualification of $$u_1$$ as the highest gain output direction is also justified because:

$$A v_1 = \sigma_1 u_1$$.

The SVD decompositions acts not unlike the eigendecomposition for symmetric matrices in that it first rotates the input vector and then scales each entry; the only difference between the two is that while eigendecomposition does at the third step the first rotation in the reverse direction, SVD does a different rotation in the third step.

Image of unit ball under full SVD transformation
The full SVD transformation gives a clear picture of how SVD operates. The figure below shows what full SVD does to the (2D) unit ball at each block



Since we assume full SVD, both $$U$$ and $$V$$ are orthogonal matrices (and so are their transposes). Therefore, the first operation is a rotation by $$V^T$$, which does not change the image of the ball although each vector on it is rotated. The second step amplifies each dimension according to the corresponding singular value, which turns the ball into an ellipse. The last step does another rotation, one induced by $$U$$. Note that the highest output direction is associated with $$u_1$$.

Condition number
An important concept that is a product of SVD is the condition number of a matrix $$A$$, which can be denoted with $$\kappa(A)$$ and is computed as

$$\kappa(A) = \sigma_1 / \sigma_r$$

Clearly, the condition number is nothing but the ratio between the largest and smallest singular value of $$A$$. The importance of $$\kappa(A)$$ is that it places an upper bound to sensitivity to data error, as described below (see ).

What's the point of SVD?
SVD has a lot of uses, clearly this section will be a meager representation of them.

Approximate rank
A matrix that is rank-deficient can be easily made full-rank. In this sense, the concept of rank is an extremely sensitive one. SVD can be used to develop a robust alternative to it: the approximate rank. The approximate rank is the number of singular values of a matrix (i.e., the $$\sigma_i$$) that are not very close to zero. Of course, what is very close to zero can be application dependent but in some cases it is pretty clear. For example, if the singular values of a $$4 \times 4$$ matrix are as follows

$$(10, 5, 4, 0.1, 0,01)$$

we can say that the approximate rank is 3 as the last two values are comparatively very small. In fact, if we do an estimation problem where we try to find $$x$$ from noisy measurements

$$y = Ax + v$$

and if we have an idea about the order of the noise (e.g., the standard deviation if modelled as an RV), then we can use this order to get a threshold value for the significant singular values.

Give highest/smallest directions of matrix gain
Each singular value is associated with an input and an output direction. As described and illustrated in the Interpretation section above, $$v_1$$ and $$u_1$$ give the respectively highest input and output gain directions for a matrix $$A$$.

Analysing sensitivity to data error
Suppose that $$y=Ax$$ where $$A$$ is square and invertible, and we are interested in finding $$x$$ which clearly is

$$x = A^{-1}y$$.

Suppose that we have an error at $$y$$, i.e., $$y$$ becomes $$y+\delta y$$. Then the estimated $$x$$ becomes $$x + \delta x$$. Clearly we would like $$\delta x$$ to be small. Using a basic inequality (see Matrix norm), we can place an upper bound on $$||\delta x||$$ as :

$$||delta x|| = ||A^{-1} \delta y || \le ||A^{-1}|| ||delta y||$$.

This inequality says that a large $$|| A^{-1}||$$ can lead to large errors in $$x$$ (although the inequality can be far from tight). The relative error

$$\frac{||\delta x||}{||x||} \le ||A|| ||A^{-1}||\frac{||\delta y||}{||y||}$$

is obtained by using a similar inequality that leads to $$||y|| \le ||A|| ||x||$$.

The number $$||A||||A^{-1}||$$ is nothing but the condition number $$\kappa(A)$$, as one divided by the smallest singular value of $$A$$ (i.e., $$1/\sigma_{\text{min}}$$) is the  largest singular value and hence the norm of $$A^{-1}$$. (To see the latter, note that for invertible $$A$$ we have that $$A^{-1} = V \Sigma^{-1} U^T$$.)

To summarize, a small condition number means small amplification of data error (although a large condition number does not necessarily imply a large amplification of data error).

Matrix approximation
Let $$\tilde A$$ be the rank-$$p$$ approximation of $$A$$; that is, the approximation that is computed by taking into account only the first $$p$$ singular values. Then, $$\tilde A$$ is the optimal approximation in that the approximation error $$||\tilde A - A||$$ is minimal (for Spectral norm $$||\cdot||$$)

In fact, it's also easy to compute the approximation error of this rank-$$p$$: it's $$\sigma_{p+1}$$

Pseudo-inverse
SVD gives the pseudo-inverse of a matrix in the most general case (i.e., when it's not full rank) -- see Pseudo-inverse.