Lecture 3: Linear Algebra II

Topics

Least Square
Eigenvalue and Eigenvector
QR Decomposition
Principal Component Analysis
Singular Value Decomposition

Least Square

regression: noun, a return to a former or less developed state

Least square problem

Given an unknown scalar function $\renewcommand{bs}{\boldsymbol} y = y(\boldsymbol x)$, where the argument $\bs x$ is a $n$ dimensional vector.

We have $m > n$ observations of the inputs and matching outputs: $\bs {x_1, x_2, ..., x_m}$ and $[y_1, ..., y_m]$.
We want to find a factor loading vector $\bs \beta$ so that the linear function $\hat y = \bs x^T \bs \beta$ is the best aproximation to the observed inputs/outputs.

We can express this problem as a least square minimization:

We write the observed inputs as a matrix $X = [\bs{x_1, x_2, ..., x_m}]^T$, and the observed outputs as an vector of $\bs y = [y_1, ..., y_m]^T$
$X \bs \beta = \bs y$ is over-determined, no vector $\bs \beta$ solves the linear system exactly
Instead, we find $\bs \beta $ that minimizes $\Vert X \bs{\beta - y}\Vert_2$.

Pseudo inverse:

$$ \min_{\bs {\beta}} \Vert X \bs{\beta - y} \Vert_2$$ Which can be easily solved using Matrix calculus:

$$\begin{array}{l} \Vert X \bs{\beta - y} \Vert_2^2 &= (X \bs{\beta - y})^T (X \bs{\beta - y}) \\ &= \bs{\beta}^T X^T X \bs{\beta} - 2 \bs{\beta}^TX^T\bs y + \bs y^T\bs y\\ \frac{\partial}{\partial \bs \beta} \Vert X \bs{\beta - y} \Vert_2^2 &= 2 \bs \beta^T X^TX - 2\bs y^T X = \bs 0^T \\ X^TX \bs \beta &= X^T \bs y \\ \bs \beta &= (X^TX)^{-1}X^T\bs y \end{array}$$

The solution is equivalent to linear regression.
$X^+ = (X^TX)^{-1}X^T$ is also known as the psuedo inverse of $X$ because $X^+X = I$
- note $XX^+$ is not even a valid matrix multiplication.

Condition number of non-square matrix

We already learned that an invertable matrix $A$'s condition number is

$$k(A) = \Vert A \Vert \Vert A^{-1} \Vert$$

We can extend this to non-square matrix using pseudo-inverse, following the same proof.

$$k(A) = \Vert A \Vert \Vert A^{+} \Vert$$

Ridge regression

The $\bs \beta$ that solves the least square problem may have large magnitude:

often cause problems in practice

Ridge regression adds a magnitude penalty to the objective function of least square:

$$\begin{array}{l} l(\bs \beta) &= \Vert X \bs{\beta - y} \Vert_2^2 + \lambda \Vert W \bs \beta \Vert_2^2 \\ &= \bs{\beta}^T X^T X \bs{\beta} - 2 \bs{\beta}^TX^T\bs y + \bs y^T \bs y + \lambda \bs \beta^T W^TW \bs \beta \\ \frac{\partial l}{\partial \bs \beta} &= 2\bs \beta^T X^TX - 2\bs y^T X + 2 \lambda \bs \beta^T W^T W = \bs 0^T \\ \bs \beta &= (X^TX + \lambda W^TW)^{-1}X^T\bs y \end{array}$$

$W$ is a diagonal weighting for the elements in $\bs \beta$, e.g. $W = I$ is often a good start
$\lambda$ is a scaler penalty rate

Ridge regression example

We draw many pairs of $x, y, z$ from the following linear model:

$$ y = 2x + 0.1z + 5 + \epsilon $$

$x, z$ are standard normal, $z$ represent an insignificant feature (or accidental correlation)
$\epsilon$ is a standard normal noise

We regress the vector $\bs y$ against $X = [\bs x, \bs x + .0001 \bs z, \bs 1]$:

In [30]:

import pandas as pd

n = 5000
x = np.random.normal(size=n)
z = np.random.normal(size=n)
y = 2*x + 5 + 0.1*z + np.random.normal(size=n)

xs = np.array([x, x + 0.0001*z, np.ones(len(x))]).T

q = np.eye(len(xs.T))
lbd = .1
b_ols = np.linalg.inv(xs.T.dot(xs)).dot(xs.T).dot(y)
e_ols = np.linalg.norm(y - xs.dot(b_ols), 2)

b_rr = np.linalg.inv(xs.T.dot(xs) + lbd*q.T.dot(q)).dot(xs.T).dot(y)
e_rr = np.linalg.norm(y - xs.dot(b_rr), 2)

df = pd.DataFrame(np.array([b_ols, b_rr]), index=['Least square', 'Ridge regression $\\lambda=%2g$' % lbd], 
                  columns=['$\\bs x$', '$\\bs x+.0001\\bs z$', '$\\bs 1$'])
df['$\Vert X\\bs \\beta - \\bs y \\Vert_2$'] = [e_ols, e_rr]
fmt.displayDF(df, "4g")

	$\bs x$	$\bs x+.0001\bs z$	$\bs 1$	$\Vert X\bs \beta - \bs y \Vert_2$
Least square	-996	998	4.988	71.02
Ridge regression $\lambda=0.1$	0.7599	1.257	4.991	71.37

ridge regression is more robust, it works even if $X$ is not fully ranked (colinearity)
useful for constructing hedging portfolios

Eigenvalues and Eigenvectors

eigen-: characteristic, origin: German

Definitions

For a square matrix $A$ of size $n \times n$, if there exists a vector $\bs {v \ne 0}$ and a scalar $\lambda \ne 0$ so that:

$$ A \bs v = \lambda \bs v $$

$\bs v$ is an eigenvector, $\lambda$ is the corresponding eigenvalue
$A$ only changes $\bs v$'s magnitude, but not direction
$\lambda$ can be complex even for real $A$

Eigen vector/value naming conventions:

eigenvectors are usually specified as unit L2 vectors of $\Vert \bs v \Vert_2^2 = \bs v^T \bs v = 1$.
- there are still ambuiguity between $\bs v$ and $- \bs v$
eigenvalues are named in descending order: $\lambda_1 \ge \lambda_2 \ge ... \ge \lambda_n $, their corresponding eigen vectors are $\bs {v_1, v_2, ..., v_n}$
- can eigenvalues be negative?

Characteristic equation

Rewrite the equation as:

$$ (A - \lambda I) \bs {v = 0}$$

It has non-zero solution if and only if:

$$ \det(A - \lambda I) = 0$$

characteristic equation is a polynomial of degree $n$:

there can be $n$ distinct eigenvalues at most
$\prod_{i}\lambda_i = \det( A )$
$\sum_i\lambda_i = \text{tr}(A)$

It is difficult to directly solve the characteristic function for $\lambda_i$, especially for large $n$.

Independence of eigenvectors

Eigen vectors of distinct eigen values are linearly independent.

proof: suppose the first k-1 eigen vectors are independent, but not the k-th:

$$\begin{array} \\ \bs{v_k} &= \sum_{i=1}^{k-1}c_i\bs{v_i} \\ \lambda_k \bs{v_k} &= \sum_{i=1}^{k-1}c_i\lambda_i\bs{v_i} = \lambda_k \sum_{i=1}^{k-1}c_i\bs{v_i} \\ \bs 0 &= \sum_{i=1}^{k-1}c_i(\lambda_i - \lambda_k) \bs{v_i} \end{array} $$

this leads to contradiction.

If $A$ has $n$ distinct eigenvalues, then all the eigen vectors form a basis for the vector space.

Properties of eigenvectors

If $A$ has $n$ distinct eigen values, we can write:

$$ A R = R \Lambda$$

where each column of $R$ is a eigenvector, and $\Lambda$ is a diagnoal matrix of eigenvalues

$R$ is invertible because eigen vectors are all independent:

$$\begin{array} \\ R^{-1} A &= \Lambda R^{-1} \\ A^*(R^{-1})^* &= (R^{-1})^* \Lambda^* \\ \end{array}$$

If $A$ is real and symmetric: $A^* = A$:

$\Lambda^* = \Lambda$: all eigenvalues are real
we can consider real eigen vectors only without losing generaliy
$R$ is orthogonal: $(R^{-1})^* = (R^{-1})^T = R \iff RR^T = I$
$A = R\Lambda R^T$, this diagonalization is called the eigenvalue decomposition (EVD)
all eigenvalues are positive if and only if $A$ is SPD

The conclusion holds even when there are duplicated eigen values, with some care.

Eigenvectors and maximization

If $A$ is real and symmetric:

$\bs v_1$ maximizes the $\bs u^T A \bs u$ among all L-2 unit vectors, i.e., $\bs u^T \bs u = 1$

Apply Lagrange multiplier:

$$\begin{array} \\ l &= \bs u^T A \bs u - \gamma (\bs u^T \bs u - 1) \\ \frac{\partial l}{\partial \bs u} &= 2 \bs u^T A^T - 2\gamma \bs u^T = \bs 0^T \iff A \bs u = \gamma \bs u \end{array}$$

the solution must be an eigenvector
since $\bs v_i^T A \bs v_i = \lambda_i$, $\bs v_1$ is the solution because $\lambda_1$ is the largest

This process can be repeated to find all eigenvalues and vectors:

$\bs v_2$ maximizes $\bs u^T A \bs u$ for all unit $\bs u$ that is orthogonal to $\bs v_1$, i.e, $\bs u^T \bs v_1 = 0$.
$\bs v_i$ maximizes amongst those unit $\bs u$ that are orthogonal to $\bs v_1, ..., \bs v_{i-1}$
In the case of duplicated eigen values, $\bs v_i$ is not unique, we can pick any $\bs v_i$ and continue.

Matrix similarity

Square matrix $B$ and $A$ are similar if $B = P^{-1}AP$:

similar matrix have identical eigenvalues:

$$ AR = PBP^{-1} R = R \Lambda \iff B (P^{-1}R) = (P^{-1} R) \Lambda $$

they represents the same linear function in different basis
$B$ is also called a similarity transformation of $A$

The basic idea of computing eigen values for $A$:

Find a similarity transformation $B = P^{-1}AP$ so that $B$ is a triangular matrix
$A$ and $B$ has identical eigenvalues
Eigen values of a triangular matrix $B$ is just its diagonal elements.

QR Decomposition

orthogonal: adjective, at right angles

QR decomposition

Any real matrix $A$ can be decomposed into $A = QR$:

$Q$ is orthogonal ($QQ^T = I$)
$R$ is upper triangular, with the same dimension as $A$

QR decomposition is numerically stable

making EVD analysis of large matrices a routine practice

What if $A$ is not fully ranked?

$Q$ remains full rank (as all orthogonal matrices)
$R$ have more 0 rows

QR algorithm for solving eigenvalues

QA algorithm is one of the most important numerical methods of the 20th century link

Start with $A_0 = A$, then iterate:

run a QR decomposition of $A_k$: $A_k = Q_kR_k$
set $A_{k+1} = R_kQ_k = Q_k^{-1}A_kQ_k$, $A_{k+1}$ therefore has the same eigen values as $A_k$
stop if $A_k$ is adequately upper triangular
the eigven values are the diagonal elements of $A_k$

This algorithm is unconditionally stable because only orthogonal transformtions are used.

$A$ is often transformed using a similarity transformation to a near upper triangle before applying the QR algorithm.

Example of QR algorithm

Find the eigenvalues of the following matrix:

In [31]:

def qr_next(a) :
    q, r = np.linalg.qr(a)
    return r.dot(q)

a = np.array([[5, 4, -3, 2], [4, 4, 2, -1], [-3, 2, 3, 0], [2, -1, 0, -2]])
A = sp.MatrixSymbol('A', 4, 4)
A1 = sp.MatrixSymbol('A_1', 4, 4)
Q = sp.MatrixSymbol('Q_0', 4, 4)
R = sp.MatrixSymbol('R_0', 4, 4)

fmt.displayMath(fmt.joinMath('=', A, sp.Matrix(a)), pre="\\scriptsize ")

$$\scriptsize A=\left(\begin{matrix}5 & 4 & -3 & 2\\4 & 4 & 2 & -1\\-3 & 2 & 3 & 0\\2 & -1 & 0 & -2\end{matrix}\right)$$

QR decomposition of $A_0 = A$:

In [32]:

q, r = np.linalg.qr(a)
fmt.displayMath(fmt.joinMath('=', Q, sp.Matrix(q).evalf(3)), fmt.joinMath('=', R, sp.Matrix(r).evalf(3)), pre="\\scriptsize ")

$$\scriptsize Q_{0}=\left(\begin{matrix}-0.68 & -0.297 & -0.611 & 0.276\\-0.544 & -0.406 & 0.65 & -0.34\\0.408 & -0.75 & 0.136 & 0.502\\-0.272 & 0.43 & 0.431 & 0.745\end{matrix}\right)\;,\;\;\;R_{0}=\left(\begin{matrix}-7.35 & -3.81 & 2.18 & -0.272\\0 & -4.74 & -2.17 & -1.05\\0 & 0 & 3.54 & -2.73\\0 & 0 & 0 & -0.6\end{matrix}\right)$$

$A$ at the start of the next iteration, and after 20 iterations:

In [33]:

ii = 20
d = np.copy(a)
for i in range(ii) :
    d = qr_next(d)
    
An = sp.MatrixSymbol('A_%d' % ii, 4, 4)
fmt.displayMath(fmt.joinMath('=', A1, sp.Matrix(r.dot(q)).evalf(3)), fmt.joinMath('=', An, sp.Matrix(np.round(d, 3))), pre="\\scriptsize ")

$$\scriptsize A_{1}=\left(\begin{matrix}8.04 & 1.98 & 2.19 & 0.163\\1.98 & 3.1 & -3.83 & -0.258\\2.19 & -3.83 & -0.695 & -0.258\\0.163 & -0.258 & -0.258 & -0.447\end{matrix}\right)\;,\;\;\;A_{20}=\left(\begin{matrix}8.829 & 0.0 & 0.0 & 0.0\\0.0 & 5.405 & 0.009 & 0.0\\0.0 & 0.009 & -3.829 & 0.0\\0.0 & 0.0 & 0.0 & -0.405\end{matrix}\right)$$

Householder transformation

an orthogonal transformation representing reflection over a hyper plane, with a normal vector $\bs u$:

$$ H = I - \frac{2\bs{uu}^T}{\bs {(u^T u)^2}}$$

Household transformation can reflect an arbitrary vector $\bs x$ to $ \Vert \bs x \Vert_2 \hat{\bs e}_1$, where $\hat{\bs e}_1$ can be any unit vector.

It is obvious from the figure that: $\bs u = \frac{1}{2}\left(\bs x - \Vert \bs x \Vert_2 \hat{\bs e}_1\right)$.

QR decomposition using Householder

We show how to use Householder transformation to perform QR decomposition of the $A$ in previous example.

The first step, zero out the lower triangle of the first column by a Householder transformation

In [34]:

def householder(x0, e=0) :
    n = len(x0)
    e1 = np.zeros(n-e)
    x = x0[e:]
    e1[0] = np.linalg.norm(x, 2)
    u = x - e1
    v = np.matrix(u/np.linalg.norm(u, 2))
    hs = np.eye(n-e) - 2*v.T*v
    h = np.eye(n)
    h[e:,e:] = hs
    return h

x, u, e1, Q, R = sp.symbols("x, u, e_1, Q, R")
xn = sp.symbols("\Vert{x}\Vert_2")
b = a[:, 0]
c = np.zeros(len(b))
c[0] = np.linalg.norm(b, 2)

fmt.displayMath(fmt.joinMath('=', A, sp.Matrix(np.round(a))), fmt.joinMath('=', x, sp.Matrix(a[:,0])), 
                fmt.joinMath('=', xn, np.round(np.linalg.norm(b, 2), 3)),
                fmt.joinMath('=', e1, sp.Matrix([1, 0, 0, 0])),
                fmt.joinMath('=', u, sp.Matrix(a[:,0]-c).evalf(3)), pre="\\scriptsize ")

fmt.displayMath("\;")
    
h1 = householder(a[:, 0], 0)
H1 = sp.MatrixSymbol('H_1', 4, 4)
a1 = h1.dot(a)
fmt.displayMath(fmt.joinMath('=', H1, sp.Matrix(np.round(h1, 3))), 
                fmt.joinMath('=', H1*A, sp.Matrix(np.round(a1, 3))), pre="\\scriptsize ")

$$\scriptsize A=\left(\begin{matrix}5 & 4 & -3 & 2\\4 & 4 & 2 & -1\\-3 & 2 & 3 & 0\\2 & -1 & 0 & -2\end{matrix}\right)\;,\;\;\;x=\left(\begin{matrix}5\\4\\-3\\2\end{matrix}\right)\;,\;\;\;\Vert{x}\Vert_{2}=7.348\;,\;\;\;e_{1}=\left(\begin{matrix}1\\0\\0\\0\end{matrix}\right)\;,\;\;\;u=\left(\begin{matrix}-2.35\\4.0\\-3.0\\2.0\end{matrix}\right)$$

$$\;$$

$$\scriptsize H_{1}=\left(\begin{matrix}0.68 & 0.544 & -0.408 & 0.272\\0.544 & 0.073 & 0.695 & -0.464\\-0.408 & 0.695 & 0.478 & 0.348\\0.272 & -0.464 & 0.348 & 0.768\end{matrix}\right)\;,\;\;\;H_{1} A=\left(\begin{matrix}7.348 & 3.81 & -2.177 & 0.272\\0.0 & 4.323 & 0.599 & 1.943\\0.0 & 1.758 & 4.051 & -2.207\\0.0 & -0.838 & -0.701 & -0.529\end{matrix}\right)$$

continue to zero out the lower triangle

In [35]:

h2 = householder(a1[:, 1], 1)
H2 = sp.MatrixSymbol('H_2', 4, 4)
a2 = h2.dot(a1)
fmt.displayMath(fmt.joinMath('=', H2, sp.Matrix(np.round(h2, 3))), 
                fmt.joinMath('=', H2*H1*A, sp.Matrix(np.round(a2, 3))), pre="\\scriptsize ")
fmt.displayMath("\;")

h3 = householder(a2[:, 2], 2)
H3 = sp.MatrixSymbol('H_3', 4, 4)
a3 = h3.dot(a2)
fmt.displayMath(fmt.joinMath('=', H3, sp.Matrix(np.round(h3, 3))), 
                fmt.joinMath('=', H3*H2*H1*A, sp.Matrix(np.round(a3, 3))), pre="\\scriptsize ")

$$\scriptsize H_{2}=\left(\begin{matrix}1.0 & 0.0 & 0.0 & 0.0\\0.0 & 0.912 & 0.371 & -0.177\\0.0 & 0.371 & -0.557 & 0.743\\0.0 & -0.177 & 0.743 & 0.646\end{matrix}\right)\;,\;\;\;H_{2} H_{1} A=\left(\begin{matrix}7.348 & 3.81 & -2.177 & 0.272\\0.0 & 4.741 & 2.172 & 1.047\\0.0 & 0.0 & -2.556 & 1.558\\0.0 & 0.0 & 2.451 & -2.325\end{matrix}\right)$$

$$\;$$

$$\scriptsize H_{3}=\left(\begin{matrix}1.0 & 0.0 & 0.0 & 0.0\\0.0 & 1.0 & 0.0 & 0.0\\0.0 & 0.0 & -0.722 & 0.692\\0.0 & 0.0 & 0.692 & 0.722\end{matrix}\right)\;,\;\;\;H_{3} H_{2} H_{1} A=\left(\begin{matrix}7.348 & 3.81 & -2.177 & 0.272\\0.0 & 4.741 & 2.172 & 1.047\\0.0 & 0.0 & 3.542 & -2.733\\0.0 & 0.0 & 0.0 & -0.6\end{matrix}\right)$$

the final results are therefore $Q = (H_3H_2H_1)^T, R = Q^T A$:

In [36]:

q = (h3.dot(h2).dot(h1)).T
r = q.T.dot(a)
np.round(q.dot(r), 4)

fmt.displayMath(fmt.joinMath('=', Q, sp.Matrix(np.round(q, 3))), fmt.joinMath('=', R, sp.Matrix(np.round(r, 3))), pre="\\scriptsize ")

$$\scriptsize Q=\left(\begin{matrix}0.68 & 0.297 & -0.611 & 0.276\\0.544 & 0.406 & 0.65 & -0.34\\-0.408 & 0.75 & 0.136 & 0.502\\0.272 & -0.43 & 0.431 & 0.746\end{matrix}\right)\;,\;\;\;R=\left(\begin{matrix}7.348 & 3.81 & -2.177 & 0.272\\0.0 & 4.741 & 2.172 & 1.047\\0.0 & 0.0 & 3.542 & -2.733\\0.0 & 0.0 & 0.0 & -0.6\end{matrix}\right)$$

QR decomposition for least square

$$ \min_{\bs {\beta}} \Vert X \bs{\beta - y} \Vert_2$$

given the QR decomposition of $X = QR$:

$$ \min_{\bs {\beta}} \Vert X \bs{\beta - y} \Vert_2 = \min_{\bs {\beta}} \Vert Q^T X \bs \beta - Q^T \bs y \Vert_2 = \min_{\bs {\beta}} \Vert R \bs \beta - \bs y'\Vert_2$$

note that $R$ is right trangular, the vector whose norm is to be minimized looks like:

$$\scriptsize \begin{pmatrix} r_{11} & r_{12} & \cdots & r_{1n} \\ 0 & r_{22} & \cdots & r_{2n} \\ \vdots & \ddots & \ddots & \vdots \\ 0 & \cdots\ & 0 & r_{nn} \\ \hline 0 & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 \end{pmatrix} \begin{pmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_n \end{pmatrix} - \begin{pmatrix} y'_1 \\ y'_2 \\ \vdots \\ y'_n \\ \hline y'_{n+1} \\ \vdots \\ y'_m \end{pmatrix} $$

therefore, the solution is the $\bs \beta$ that zero out the first $n$ elements of the vector.

Top 10 numerical algorithm in the 20th century

By SIAM:

1946, Monte Carlo
1947, Simplex method for linear programming
1950, Subspace iteration for solving $A\bs x = \bs y$
1951, LU decomposition
1957, Fortran optimized compiler
1959-1961, QR decomposition/QR algorithm
1963, quicksort
1965, fast fourier transform
1977, integer relation detection algoirthm
1987, fast multipole algorithm

Why single out the 20th century?

Original mobile computing devices

Slide rule:

Abacus:

Principal Component Analysis

Mahatma Gandhi: action expresses priorities.

Principal component

The principal component (PC):

Suppose $\bs r$ is a random vector, with expectation $\bar{\bs r} = \mathbb{E}[\bs r]$ and covariance matrix $ V = \mathbb{E}[(\bs r - \bar{\bs r})(\bs r - \bar{\bs r})^T] $
- daily returns of a set of equities is an example of random vectors
PC is defined to be the direction $\hat{\bs u}$ onto which the projection $\bs r^T \hat{\bs u}$ has the maximimum variance,
- $\hat{\bs u}$ is a unit vector, i.e., $\Vert \hat{\bs u} \Vert_2 = \hat{\bs u}^T\hat{\bs u} = 1$

In [37]:

es = np.random.normal(size=[3, 300])
x = (1.5*es[0,:] + .25*es[1,:])*.3 + 1
y = es[0,:]*.4 + es[2,:]*.2

figure(figsize=[6, 4])
plot(x, y, '.')
xlim(-2, 4)
ylim(-2, 2);
xlabel('x')
ylabel('y');

cov = np.cov([x, y])
ev, evec = np.linalg.eig(cov)

ux = mean(x)
uy = mean(y)
arrow(ux, uy, -3*sqrt(ev[1])*evec[0, 1], -3*sqrt(ev[1])*evec[1, 1], width=.01, color='r')
arrow(ux, uy, 3*sqrt(ev[0])*evec[0, 0], 3*sqrt(ev[0])*evec[1, 0], width=.01, color='r');
title('Principal Components of 2-D Data');

Link to eigenvectors

The projection of the random vector in the direction of $\hat{\bs u}$ is a random scalar $\hat{\bs u}^T \bs r$

the variance of the projection is $\text{var}[\hat{\bs u}^T \bs r] = \hat{\bs u}^T V \hat{\bs u}$.
the first eigen vector $\bs v_1$ of $V$ is therefore the principal component

If we limit ourselves to all $\bs u$ that is perpendicular to $\bs v_1$, then:

the second eigen vector $\bs v_2$ of $V$ is the principal component amongst all $\bs u$ with $\bs v_1^T \bs u = 0$
this process can continue for all the eigen vectors

Percentage of variance explained

The total variance of is sum of variance of in all $n$ elements of the random vector, which equals:

trace of $V$
sum of all eigen values of $V$

Therefore, the percentage of variance explained by the first $k$ eigenvectors is $\frac{\sum_i^k \lambda_i}{\sum_i^n \lambda_i}$.

PCA vs. least square

The PCA analysis has some similarity to the least square problem, both of them can be viewed as minimizing the L2 norm of residual errors.

In [38]:

xs = np.array([x, np.ones(len(x))]).T
beta = np.linalg.inv(xs.T.dot(xs)).dot(xs.T).dot(y)

s_x, s_y = sp.symbols('x, y')
br = np.round(beta, 4)

figure(figsize=[13, 4])
subplot(1, 2, 1)
plot(x, y, '.')
xlim(-2, 4)
ylim(-2, 2);
xlabel('x')
ylabel('y');

dx = 4*sqrt(ev[0])*evec[0, 0]
dy = 4*sqrt(ev[0])*evec[1, 0]

arrow(ux-dx, uy-dy, 2*dx, 2*dy, width=.01, color='k');

ex = 3*sqrt(ev[1])*evec[0, 1]
ey = 3*sqrt(ev[1])*evec[1, 1]
plot([ux, ux+ex], [uy, uy+ey], 'r-o', lw=3)
legend(['data', 'residual error'], loc='best')

title('PCA');

subplot(1, 2, 2)
plot(x, y, '.')
xlim(-2, 4)
ylim(-2, 2);
xlabel('x')
ylabel('y');
arrow(-1, -1*beta[0]+beta[1], 4, 4*beta[0], width=.01, color='k');
y0 = (ux+ex)*beta[0] + beta[1]
plot([ux+ex, ux+ex], [y0, uy+ey], 'r-o', lw=3)
legend(['data', 'residual error'], loc='best')
title('Least Square');

Which one to use

The main differences between PCA and least square (regression):

	PCA	Least Square/Regression
Scale Invariant	No	Yes
Symmetry in Dimension	Yes	No

To choose between the two:

Use least square/regression when there are clear explanatary variables
Use PCA when there is a set of related variables but no clear drivers
PCA requires a natural unit for all dimensions

Application in interest rates

CMT rates are constant maturity treasury bond yield that are published daily by U. S. Treasury.

In [39]:

cmt_rates = pd.read_csv("data/cmt.csv", parse_dates=[0], index_col=[0])

cmt_rates.plot(legend=False, title='Historical CMT yield');

Correlation matrix

Correlation matrix between CMT rates at different maturities

In [40]:

tenors = cmt_rates.columns.map(float)
c = cmt_rates.cov()
fmt.displayDF(cmt_rates.corr(), "4g")

	0.25	0.5	1	2	3	5	7	10	20
0.25	1	0.9981	0.9923	0.9765	0.9612	0.9284	0.9017	0.8696	0.8146
0.5	0.9981	1	0.9971	0.9841	0.97	0.9388	0.9126	0.8822	0.8272
1	0.9923	0.9971	1	0.9938	0.9838	0.9582	0.9351	0.9086	0.8577
2	0.9765	0.9841	0.9938	1	0.9972	0.9816	0.9648	0.943	0.8995
3	0.9612	0.97	0.9838	0.9972	1	0.9927	0.9807	0.963	0.9254
5	0.9284	0.9388	0.9582	0.9816	0.9927	1	0.9966	0.9871	0.9604
7	0.9017	0.9126	0.9351	0.9648	0.9807	0.9966	1	0.9957	0.9779
10	0.8696	0.8822	0.9086	0.943	0.963	0.9871	0.9957	1	0.9903
20	0.8146	0.8272	0.8577	0.8995	0.9254	0.9604	0.9779	0.9903	1

IR PCA components

The eigenvectors and percentage explained,

the first principal component explains more than 95% of the variance
the first 3 principal components can be interpreted as level, slope and curvature of the curve
note that the sign of the PC is insignificant

In [41]:

x, v = np.linalg.eig(c)

pct_v = np.cumsum(x)/sum(x)*100

import pandas as pd
pd.set_option('display.precision', 3)
fmt.displayDF(pd.DataFrame({'P/C':range(1, len(x)+1), 'Eigenvalues':x, 'Cumulative Var(%)': pct_v}).set_index(['P/C']).T, "2f")

P/C	1	2	3	4	5	6	7	8	9
Cumulative Var(%)	96.27	99.74	99.92	99.96	99.99	99.99	99.99	100.00	100.00
Eigenvalues	34.33	1.24	0.06	0.02	0.01	0.00	0.00	0.00	0.00

In [44]:

flab = ['factor %d' % i for i in range(1, 4)]

fig = figure(figsize=[12, 4])
ax1 = fig.add_subplot(121)
plot(tenors, v[:, :3], '.-');
xlabel('Tenors(Y)')
legend(flab, loc='best')
title('First 3 principal components');

fs = cmt_rates.dot(v).iloc[:, :3]

ax2 = fig.add_subplot(122)
fs.plot(ax=ax2, title='Projections to the first 3 P/Cs')
legend(flab, loc='best');

Before applying PCA

PCA is a powerful technique, however, beware of its limitations:

PCA factors are not real market risk factors
- It does not lead to tradable hedging strategies
PCA is not scale (or unit) invariant
- it requires all the dimensions to have the same natural unit
- mixing factors of different magnitude can lead to trouble: eg, mixing interest rates with equities
- PCA results are different when applied to co-variance matrix and correlation matrix

PCA is usually applied to different data points of the same type

different tenors/strikes of interest rates, volatiilty etc.
price movements of the same sector/asset class

What variable to apply PCA

It is important to choose the right variable to apply PCA. The main considerations are:

Values or changes
- Equity prices or returns?
- IR levels or changes?
- Key consideration: stationarity and mean reversion, horizon of your model prediction
Spot or forward
- Forward is usually a better choice than spot because of spurious correlaitons

Since the PCA is commonly applied to a correlation/covariance matrix, it is equivalent to ask what variables' correlation/covariance matrix we should model.

Spurious correlation - equity price

Spurious correlation: variables can appear to be significantly correlated, but it is just an illusion:

In [31]:

f3 = pd.read_csv('data/f3.csv', parse_dates=[0]).set_index('Date').sort()
r = np.log(f3).diff()

fmt.displayDFs(f3.corr(), r.corr(), headers=['Price Correlation', 'Return Correlation'], fmt="4g")

Price Correlation

Return Correlation

	SPY	GLD	OIL
SPY	1	0.178	0.01086
GLD	0.178	1	-0.5238
OIL	0.01086	-0.5238	1

	SPY	GLD	OIL
SPY	1	0.03215	0.4191
GLD	0.03215	1	0.3011
OIL	0.4191	0.3011	1

Spurious correlation - spot rates

The correlation between 9Y and 10Y spot rates and forward rates:

In [32]:

es = np.random.normal(size=[2, 500])
r0 = es[0,:]*.015 + .05
f1 = es[1,:]*.015 + .05

r1 = -np.log(np.exp(-9*r0-f1))/10


rc = np.corrcoef(np.array([r0, r1]))[0, 1]
fc = np.corrcoef(np.array([r0, f1]))[0, 1]

figure(figsize=[12, 4])
subplot(1, 2, 1)
plot(r0, r1, '.')
xlabel('9Y rate')
ylabel('10Y spot')
title('Corr(9Y Spot, 10Y Spot) = %.4f'% rc)

subplot(1, 2, 2)
plot(r0, f1, '.')
xlabel('9Y rate')
ylabel('9-10Y forward')
title('Corr(9Y Spot, 9-10Y Forward) = %.4f'% fc);

$$\exp(-9 r_9)\exp(-(10-9) f_9^{10} ) = \exp(-10 r_{10}) \iff r_{10} = 0.9r_9 + 0.1f_9^{10} $$

the $f_9^{10}$ is the 9Y to 10Y forward rate as observed at $t=0$.
The correlation between spot rates are spurious because 10Y spot rate is largely determined by the 9Y spot rates.
it makes no difference if we have unlimited precision, but we usually only keep few eigen vectors in PCA
it's better to model the correlation/covariance matrix of forward rates

Singular Value Decomposition

rotate, stretch and rotate, then you get back to square one, that's pretty much the life in a bank.

Definition

For any real matrix $M$, $\sigma > 0$ is a singular value if there exists two vectors $\bs {u, v}$ such that:

$M \bs v = \sigma \bs u$
$M^T \bs u = \sigma \bs v$

$\bs {u, v}$ are called left and right singular vector of $M$.

unlike eigenvalues, singular values are always positive.

By convention,

we represent singular vectors as unit L-2 vectors, i.e., $\bs u^T\bs u = \bs v^T \bs v = 1$,
we name the singular values in descending order as $\sigma_1, \sigma_2, ..., \sigma_n$, and refer corresponding singular vector pairs as $\bs{ (u_1, v_1), (u_2, v_2), ..., (u_n, v_n) }$.

Singular vectors and maximization

Consider $\bs u^T M \bs v$ for real matrix $M$:

$\bs {u_1, v_1}$ maximize $\bs u^T M \bs v$ amongst all unit $\bs {u, v}$ vectors.

Apply the Lagrange multiplier, with constraints $\bs u^T \bs u = \bs v^T \bs v = 1$:

$$\small \begin{array} \\l &= \bs u^T M \bs v - \lambda_1 (\bs u^T \bs u - 1) - \lambda_2 (\bs v^T \bs v - 1) \\ \frac{\partial l}{\partial{\bs u}} &= \bs v^T M^T - 2 \lambda_1 \bs u^T = \bs 0^T \iff \bs u^T M \bs v - 2 \lambda_1 = 0\\ \frac{\partial l}{\partial{\bs v}} &= \bs u^T M - 2\lambda_2 \bs v^T = \bs 0^T \iff \bs u^T M \bs v- 2\lambda_2 = 0 \end{array}$$

Therefore, $$\small \bs u^T M \bs v = 2\lambda_1 = 2\lambda_2 \iff M \bs v = \sigma \bs u \;,\; M^T\bs u = \sigma \bs v $$

Orthogonality of Singular Vectors

Given $\bs{u_1, v_1}$ are the first singular vectors of $M$:

$\bs x^T \bs v_1 = 0 \iff (M \bs x)^T \bs u_1 = 0$

because:

$$ (M \bs x)^T \bs u_1 = \bs x^T M^T \bs u_1 = \bs x^T \sigma_1 \bs v_1 = 0 $$

Therefore, among those unit $\bs u, \bs v$ that are orthogonal to $\bs{u_1, v_1}$ :

$\bs {u_2, v_2}$ maximizes $\bs u^T M \bs v$
this process can repeat for all singular vectors (assuming singular values are distinct)

Singular value decmposition

We can write all singular vectors and singular values in matrix format, which is the singular value decomposition (SVD):

$$ U^T M V = \Sigma \iff M = U \Sigma V^T $$

	$M$	$U$	$\Sigma$	$V$
Name	Original Matrix	Left singular vector	Singular value	Right singular vector
Type	Real	Orthogonal, real	Diagonal, positive	Orthogonal, real
Size	$m \times n$	$m\times m$	$m\times n$	$n \times n$

For matrix $M$, there can only be $\min(m, n)$ singular values at most
The left/right singular vectors form a basis for the vector space of $\bs {u}$ and $\bs v$ respectively

Graphical representation of SVD

Illustration from wikipedia:

Link between SVD and EVD

$$ M = U \Sigma V^T $$

SVD and eigenvalue decomposition (EVD) are closely related to each other:

SVD reduces to EVD if $M$ is symmetric positive definite
$MM^T = U \Sigma V^T V \Sigma^T U^T = U \Sigma \Sigma^T U^T$:
- the column vectors of $U$ are the eigen vectors of $MM^T$
$M^TM = V\Sigma^TU^TU\Sigma V^T = V \Sigma\Sigma^T V^T$:
- the column vectors of $V$ are the eigen vectors of $M^TM$
The singular values are the positive square roots of eigen values of $MM^T$ or $M^TM$

SVD of pseudo inverse

Given the SVD decomposition $X = U\Sigma V^T$, the SVD of $X$'s pseudo inverse is:

$$X^+ = (X^TX)^{-1}X^T = V \Sigma^+ U^T$$

where $\Sigma^+$ is the pseudo inverse of $\Sigma$, whose diagonal elements are the reciprocals of those in $\Sigma$.

Proof: the last equation is obviously true, and every step is reversible:

$$\begin{array}&(X^TX)^{-1}X^T &= V\Sigma^+ U^T \\ (V \Sigma^T \Sigma V^T)^{-1} V\Sigma^TU^T &= V\Sigma^+ U^T \\ (V \Sigma^T \Sigma V^T)^{-1} V\Sigma^T\Sigma &= V \Sigma^+\Sigma \\ (V \Sigma^T \Sigma V^T)^{-1} V\Sigma^T\Sigma &= V \\ (V \Sigma^T \Sigma V^T)^{-1} (V\Sigma^T\Sigma V^T) &= I \end{array}$$

SVD is another way to solve the least square problem.

L2 norm and condition number

Consider a real matrix $A$, its L2 norm is defined as the maximum of $\frac{\Vert A \bs x \Vert_2}{\Vert \bs x \Vert_2}$:

$$ \Vert A \Vert_2^2 = \max_{\bs x} \frac{\Vert A \bs x \Vert_2^2}{\Vert \bs x \Vert_2^2} = \max_{\bs x} \frac{\bs x^T A^T A \bs x}{\bs x^T \bs x} = \max_{\hat{\bs x}} \hat{\bs x}^T A^T A \hat{\bs x} = \sigma_1(A)^2$$

where $\hat{\bs x} = \frac{\bs x}{\Vert \bs x \Vert_2}$ is a unit vector, and $\sigma_1(A)$ is the largest singular value of $A$.

For a generic non-square matrix $A$:

the L-2 norm of a matrix is its largest singular value: $\Vert A \Vert_2 = \sigma_1(A)$
the L-2 norm of the pseudo inverse $A^+$: $\Vert A^+ \Vert_2 = \frac{1}{\sigma_n(A)}$
- $\sigma_n(A)$ is the smallest singular value of $A$.
the L-2 condition number of a matrix $A$ is therefore: $k(A) = \Vert A \Vert_2 \Vert A^+ \Vert_2 = \frac{\sigma_1(A)}{\sigma_n(A)}$

SVD and condition number

SVD offers clear intuitions in understanding the L-2 condition number of a Matrix $A$.

For a linear system $\bs y = A \bs x$, consider a perturbation $\delta \bs x$ to the input $\bs x$:

if $\delta \bs x$ is stretched by a larger factor than $\bs x$, then the reltative error grows.
the worst case occurs when $\bs x$ is stretched by $\sigma_n(A)$, while $\delta \bs x$ is stretched by $\sigma_1(A)$
- the relative error grows by a factor of $\frac{\sigma_1(A)}{\sigma_n(A)}$.
Orthogonal matrix represents a pure rotation
- all singular values are 1, no stretches, thus the relative error does not grow

Spot the trouble

When working with near singular matrix, SVD can identify ill-conditioned area:

real world covariance and correlation matrix are often near singular
numerical stability arises when inputs are very close to signular vectors with small singular values
- e.g, when return forecast is in the same direction as the last singular vector ...

Condition number for EVD

What is the change in $\lambda_i$ if we perturb the symmetric matrix $A$?

$$ A \bs v_i = \lambda_i \bs v_i$$

Apply the perturbation analysis: $\delta A = \dot{A} \epsilon, \delta \lambda_i = \dot{\lambda_i} \epsilon, \delta \bs v_i = \dot{\bs v_i} \epsilon$, the first order terms of $\epsilon$:

$$\begin{array} \\ (A + \dot{A}\epsilon )(\bs v_i + \dot{\bs v_i}\epsilon) &= (\lambda_i + \dot{\lambda_i}\epsilon) (\bs v_i + \dot{\bs v_i} \epsilon) \\ \dot{A}\bs v_i + A \dot{\bs v_i} &= \lambda_i \dot{\bs v_i} + \dot{\lambda_i}\bs v_i \\ \bs v_i^T\dot{A}\bs v_i + \bs v_i^T A \dot{\bs v_i} &= \bs v_i^T \lambda_i \dot{\bs v_i} + \bs v_i^T \dot{\lambda_i}\bs v_i \\ \bs v_i^T\dot{A}\bs v_i &= \dot{\lambda_i} \end{array}$$

Therefore: $$\begin{array} \\ \frac{ | \delta \lambda_i |}{| \lambda_i |} &= \frac{\Vert \bs v_i^T\delta A \bs v_i\Vert_2}{|\lambda_i|} \le \frac{\Vert \delta A \Vert_2}{|\lambda_i|} = \frac{\Vert A\Vert_2}{|\lambda_i|} \frac{\Vert \delta A \Vert_2}{\Vert A \Vert_2} = \left\vert \frac{\lambda_1}{\lambda_i}\right\vert \frac{\Vert \delta A \Vert_2}{\Vert A \Vert_2} \end{array}$$

The eigenvalue's conditional number for a symmetric matrix $A$ is therefore the ratio from the largest eigen value: $k(A) = \left\vert \frac{\lambda_1}{\lambda_i} \right\vert$.
The smaller eigenvalues are less accurate, we should ignore them.

Decomposition summary

For real matrix $A$:

Decomposition	Applicability	Formula	Results	Key applications
LU	square, full rank	$A = LU$	lower, upper triangluar	matrix inversion
Cholesky	symmetric positive definite	$A = LL^T$	lower triangular	simulate correlated factors
QR	any	$A = Q R$	orthogonal, upper triangular	solve eigen values, least square
EVD	symmetric	$A = R\Lambda R^T$	orthogonal, real diagonal	PCA
SVD	any	$A = U\Sigma V^T$	orthogonal, positive diagonal, orthogonal	error analysis, rank reduction

Assignment

Requried reading:

Bindel and Goodman: Chapter 4