Lecture 2: Linear Algebra

Topics

Review of Linear Algebra
Gaussian Elimination and LU Factorization
Cholesky decomposition
Matrix calculus and Mean/Var optimization
Norm and condition

Review of Linear Algebra

Vector

Is a set of elements, which can be real or complex:

$$ \renewcommand{bs}{\boldsymbol} \begin{matrix} \bs u = \left( \begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix} \right) \hspace{2cm} \bs v = \left( \begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix} \right) \end{matrix} $$

Basic vector operations

Vector addition: $\bs{w = u + v}$
Scalar multiplication: $\bs w = a \bs u$, where $a$ is a scalar

Notation

column vector: $\bs {u, v, x, \beta}$
row vector: $\bs u^T, \bs v^T, \bs x^T, \bs \beta^T$
scalar: $a, b, \alpha, \beta$
matrix: $A, B, P$

Vector addition

Associativity: $\bs{u + (v + w) = (u + v) + w}$
Cumutativity: $\bs{u + v = v + u}$
Identity: $\bs{v + 0 = v}$ for $\forall \bs{v}$
Inverse: for $\forall \bs{v}$, exists an $-\bs{v}$, so that $\bs{v + (-v) = 0}$
Distributivity: $a \bs{(u + v)} = a \bs{u} + a \bs v, (a+b) \bs v = a \bs v + b \bs v$

Scalar multiplication

Associativity: $a(b \bs v) = (ab)\bs v$
Identity: $1 \bs v = \bs v$, for $\forall \bs v$

Vector space $\Omega$

a collection of vectors that can be added and multiplied by scalars.

Vector subspace

a subset of the vector space $\Omega' \subset \Omega$ that is closed under vector addition and scalar multiplication

Linear combination

$\bs v = a_1 \bs v_1 + a_2 \bs v_2 + ... + a_n \bs v_n$
Linear combinations of vectors form a subspace $\Omega_v = \text{span}(\bs{v_1, v_2, ... v_n}) \subset \Omega$
Linear independence: $\bs {v = 0} \iff a_k = 0$ for $\forall k$
Basis: any set of linearly independent $\bs v_i$ that spans $\Omega_v$
Dimension of $\Omega_v$ is the number of vectors in (any of) its basis

Inner product

$\langle \bs u, a \bs v_1 + b \bs v_2 \rangle = a \langle \bs{u, v_1} \rangle + b \langle \bs{u, v_2} \rangle$
$\langle \bs{u, v} \rangle = \langle \bs{v, u} \rangle^c$
$\langle \bs{u, u} \rangle \ge 0$
$\langle \bs{u, u} \rangle = 0 \iff u = 0$
$\bs {u, v}$ othogonal if $\langle \bs{u, v} \rangle = 0$

Dot product

the standard inner product: $\bs{u \cdot v} = \sum_{k=1}^n u_k^c v_k$
the magnitude of a vector $\bs b$ is $\vert b \vert = \sqrt{\bs b \cdot \bs b}$
the projection of vector $\bs a$ to the direction of vector $\bs b$ is: $ a_1 = \frac{\bs a \cdot \bs b}{\vert b \vert} = \vert \bs a \vert \cos(\theta) $

Matrix

Represents a linear response to multiple input factors:

$$ \overset{\text{Outputs}}{\longleftarrow}\overset{\downarrow \text{Inputs}} {\begin{pmatrix} a_{11} & a_{12} & . & a_{1n} \\ a_{21} & a_{22} & . & a_{2n} \\ . & . & . & \\ a_{m1} & a_{m2} & . & a_{mn} \end{pmatrix}} $$

Matrix addition and scalar multiplication are element wise, similar to those for vectors

Matrix multiplication

$$\begin{array} \\ \bs u = A \bs v &\iff u_i = \sum_{j=1}^{n} a_{ij}v_j \\ C = AB &\iff c_{ij} = \sum_{k=1}^{n} a_{ik}b_{kj} = a_i \cdot b_j \end{array}$$

Matrix represents linear transformation

Linear function on vectors:

$L(\bs{u + v}) = L(\bs u) + L(\bs v)$
$L(a \bs v) = a L(\bs v)$

Any linear transformation bewteen finite dimensional vector space can be represented by a matrix multiplication, therefore we can write $L \bs u$ instead of $L(\bs u)$.

Properties of linear transformation

Associativity: $A(BC) = (AB)C$
Distributivity:
- $A(B+C) = AB + AC$
- $(B+C)A = BA + CA$
- $\alpha (A+B) = \alpha A + \alpha B$
But not commutative: $AB \ne BA$

Matrix definitions

Identity matrix $I$: $I A = A I = A$
$A^T$ is the transpose of $A$: $a^T_{ij} = a_{ji}$
Symmetric matrix: $A = A^T$
$A^*$ is the adjoint of $A$: $a^*_{ij} = a_{ji}^c$
- real matrix: $A^T = A^*$
- self-adjoint (Hermitian) matrix: $A = A^*$
Inverse matrix: $AA^{-1} = A^{-1}A = I$
Orthogonal matrix: $A^T = A^{-1} \iff AA^T = I$

Gaussian Elimination and LU Factorization

Linear system

In matrix form, a linear system is $A \bs {x = y}$
It has a unique solution if $A$ is a full rank square matrix

In [3]:

a = sp.Matrix([[2, 1, -1], [-6, -2, 4], [-2, 1, 2]])
y = sp.Matrix([8, -22, -3])
X = sp.MatrixSymbol('x', 3, 1)
x1, x2, x3 = sp.symbols('x_1, x_2, x_3')
x = sp.Matrix([x1, x2, x3])
fmt.displayMath(a, fmt.joinMath('=', x, y), sep="", pre="\\small ")

$$\small \left(\begin{matrix}2 & 1 & -1\\-6 & -2 & 4\\-2 & 1 & 2\end{matrix}\right)\left(\begin{matrix}x_{1}\\x_{2}\\x_{3}\end{matrix}\right)=\left(\begin{matrix}8\\-22\\-3\end{matrix}\right)$$

Gaussian elimination

Eliminate the $x_1$ terms using the first row, this operation is a linear transformation:

In [4]:

A = sp.MatrixSymbol('A', 3, 3)
L1 = sp.MatrixSymbol('L_1', 3, 3)
L2 = sp.MatrixSymbol('L_2', 3, 3)
l1 = sp.eye(3)
l1[1, 0] = -a[1, 0]/a[0, 0]
l1[2, 0] = -a[2, 0]/a[0, 0]

fmt.displayMath(L1*a, fmt.joinMath('=', x, L1*y), "\;,\;\;", l1*a, fmt.joinMath('=', x, l1*y), sep="", pre="\\small ")

$$\small L_{1} \left(\begin{matrix}2 & 1 & -1\\-6 & -2 & 4\\-2 & 1 & 2\end{matrix}\right)\left(\begin{matrix}x_{1}\\x_{2}\\x_{3}\end{matrix}\right)=L_{1} \left(\begin{matrix}8\\-22\\-3\end{matrix}\right)\;,\;\;\left(\begin{matrix}2 & 1 & -1\\0 & 1 & 1\\0 & 2 & 1\end{matrix}\right)\left(\begin{matrix}x_{1}\\x_{2}\\x_{3}\end{matrix}\right)=\left(\begin{matrix}8\\2\\5\end{matrix}\right)$$

Use the 2nd equation (row) to eliminate the $x_2$ terms:

In [5]:

l2 = sp.eye(3)
a2 = l1*a
y2 = l1*y
l2[2, 1] = -a2[2, 1]/a2[1, 1]
u = l2*a2
fmt.displayMath(L2*a2, fmt.joinMath('=', x, L2*y2), "\;,\;", u, fmt.joinMath('=', x, l2*y2), sep="", pre="\\small ")

$$\small L_{2} \left(\begin{matrix}2 & 1 & -1\\0 & 1 & 1\\0 & 2 & 1\end{matrix}\right)\left(\begin{matrix}x_{1}\\x_{2}\\x_{3}\end{matrix}\right)=L_{2} \left(\begin{matrix}8\\2\\5\end{matrix}\right)\;,\;\left(\begin{matrix}2 & 1 & -1\\0 & 1 & 1\\0 & 0 & -1\end{matrix}\right)\left(\begin{matrix}x_{1}\\x_{2}\\x_{3}\end{matrix}\right)=\left(\begin{matrix}8\\2\\1\end{matrix}\right)$$

the $L_1$ and $L_2$ are both lower triangular matrix

In [6]:

Ui = sp.MatrixSymbol('U^{-1}', 3, 3)
U = sp.MatrixSymbol('U', 3, 3)
L = sp.MatrixSymbol('L', 3, 3)
fmt.displayMath(fmt.joinMath('=', L1, l1), fmt.joinMath('=', L2, l2), fmt.joinMath('=', U, u), pre="\\small ")

$$\small L_{1}=\left(\begin{matrix}1 & 0 & 0\\3 & 1 & 0\\1 & 0 & 1\end{matrix}\right)\;,\;\;\;L_{2}=\left(\begin{matrix}1 & 0 & 0\\0 & 1 & 0\\0 & -2 & 1\end{matrix}\right)\;,\;\;\;U=\left(\begin{matrix}2 & 1 & -1\\0 & 1 & 1\\0 & 0 & -1\end{matrix}\right)$$

The resulting matrix $U = L_2L_1A$ is upper trianglar

LU factorization

The triangular matrix is easy to invert by variable replacement

In [7]:

y3 = l2*y2
a3 = l2*a2
ui = a3.inv()
fmt.displayMath(fmt.joinMath('=', Ui, ui), "\;,\;", fmt.joinMath('=', x, Ui), fmt.joinMath('=', l2*y2, ui*y3), sep="\;", pre="\\small ")

$$\small U^{{-1}}=\left(\begin{matrix}\frac{1}{2} & - \frac{1}{2} & -1\\0 & 1 & 1\\0 & 0 & -1\end{matrix}\right)\;\;,\;\;\left(\begin{matrix}x_{1}\\x_{2}\\x_{3}\end{matrix}\right)=U^{{-1}}\;\left(\begin{matrix}8\\2\\1\end{matrix}\right)=\left(\begin{matrix}2\\3\\-1\end{matrix}\right)$$

Now we can group $L = L_1^{-1}L_2^{-1}$ and obtain the LU factorization

$$L_2 L_1 A = U \iff A = L_1^{-1} L_2^{-1} U \iff A = LU $$

$U$ is a upper triangular matrix.
There can be infinite numbers of LU pairs, the convention is to keep the diagonal elements of $L$ matrix 1.

In [8]:

l =  l1.inv()*l2.inv()
fmt.displayMath(fmt.joinMath('=', L, l), fmt.joinMath('=', U, a3), fmt.joinMath('=', L*U, l*a3), pre="\\small ")

$$\small L=\left(\begin{matrix}1 & 0 & 0\\-3 & 1 & 0\\-1 & 2 & 1\end{matrix}\right)\;,\;\;\;U=\left(\begin{matrix}2 & 1 & -1\\0 & 1 & 1\\0 & 0 & -1\end{matrix}\right)\;,\;\;\;L U=\left(\begin{matrix}2 & 1 & -1\\-6 & -2 & 4\\-2 & 1 & 2\end{matrix}\right)$$

The LU factorization is the matrix representation of Gaussian elimination

Pivoting

The Gaussian elimination does not work if there are 0s in the diagonal of the matrix.
The rows of the matrix can be permuted first, so that the diagonal elements have the greatest magnitude.

$$ A = P \cdot L \cdot U $$

where the $P$ matrix represent the row permutation.

The permutation (pivoting) also improve the numerical stability

In [9]:

from scipy.linalg import lu

def displayMultiple(fs) :
    tl=map(lambda tc: '$' + sp.latex(tc) + '$',fs)
    r = '''
  <table border="0"><tr>'''
    for v in tl :
        r += "<td>" + v + "</td>"
    r += "</tr></table>"
    return r

a = sp.Matrix([[0, 3, 1, 2], [4, 0, -3, 1], [-3, 1, 0, 2], [9, 2, 5, 0]])
p, l, u = map(lambda x: sp.Matrix(x), lu(a))
Pi = sp.MatrixSymbol('P^{-1}', 4, 4)
A = sp.MatrixSymbol('A', 4, 4)
P = sp.MatrixSymbol('P', 4, 4)
fmt.displayMath(fmt.joinMath('=', A, a), fmt.joinMath('=', P, p), pre="\\small ")

$$\small A=\left(\begin{matrix}0 & 3 & 1 & 2\\4 & 0 & -3 & 1\\-3 & 1 & 0 & 2\\9 & 2 & 5 & 0\end{matrix}\right)\;,\;\;\;P=\left(\begin{matrix}0.0 & 1.0 & 0.0 & 0.0\\0.0 & 0.0 & 1.0 & 0.0\\0.0 & 0.0 & 0.0 & 1.0\\1.0 & 0.0 & 0.0 & 0.0\end{matrix}\right)$$

In [10]:

fmt.displayMath(sp.Eq (Pi*A, p.inv()*a), pre="\\small ")

$$\small P^{{-1}} A = \left(\begin{matrix}9.0 & 2.0 & 5.0 & 0\\0 & 3.0 & 1.0 & 2.0\\4.0 & 0 & -3.0 & 1.0\\-3.0 & 1.0 & 0 & 2.0\end{matrix}\right)$$

Complexity of numerical algorithm

Complexity of a numerical algorithm is stated in the order of magnitude, often in the big-O notation:

binary search $O(\log(n))$
best sorting algorithm: $O(n \log(n))$

Most numerical linear algebra algorithm is of complexity of $O(n^3)$

e.g.: the LU decomposition (why?)

Cholesky Decomposition

Symmetric positive definite Matrix

$V$ is positive definite if $\bs x^T V \bs x > 0$ for $\forall \bs{x \ne 0}$
$V$ is semi positive definite if $\bs x^T V \bs {x \ge 0}$ for $\forall \bs{x \ne 0}$
Semi-positive definite does not mean every element in the matrix is positive

Covariance matrix

The most important and ubiquitous matrices in quant Finance,

capturing the linear dependencies of random factors $\bs r = [r_1, ..., r_n]^T$
we use $\bar{\bs r} = \mathbb{E}[\bs r]$ to denote their expectations

$$\begin{array}{l}V &= \mathbb{E}[(\bs r - \bar{\bs r})(\bs r - \bar{\bs r})^T] = \mathbb{E}[\bs r \bs r^T] - \bar{\bs r}\bar{\bs r}^T \\ &= (\text{cov}(r_i, r_j)) = (\rho_{ij} \sigma_i \sigma_j)\end{array}$$

Covariance of linear combinations:

$$\begin{array}{l} \text{cov}(\bs x^T \bs r, \bs y^T \bs r) &= \mathbb{E}[(\bs x^T \bs r)(\bs r^T \bs y)] - \mathbb{E}[\bs x^T \bs r]\mathbb{E}[\bs r^T \bs y]\\ &= \bs x^T \mathbb{E}[\bs r \bs r^T] \bs y - \bs x^T \bar{\bs r}\bar{\bs r}^T \bs y = \bs x^T V \bs y \end{array}$$

Correlation matrix

$C = (\rho_{ij})$ is the co-variance matrix of the normalized factors $\bs s = [\frac{r_1}{\sigma_1}, ..., \frac{r_n}{\sigma_n}]^T$
$V = \bs \Sigma C \bs \Sigma $, where $\Sigma$ is a diagonal matrix of $\sigma_i$

Both covariance and correlation matrices are symmetric positive semi-definate (SPD):

$\bs x^T V \bs x = \text{cov}[\bs x^T \bs r, \bs x^T \bs r] = \text{var}[\bs x^T \bs r] \ge 0$

Example: weekly rprice and returns

In [11]:

import pandas as pd
f3 = pd.read_csv('data/f3.csv', parse_dates=[0]).set_index('Date').sort()

fig = figure(figsize=[12, 4])
ax1 = fig.add_subplot(121)
f3.plot(title='Historical Prices', ax=ax1);

ax2 = fig.add_subplot(122)
r = np.log(f3).diff()
r.plot(title='Historical Returns', ax=ax2);

Correlation and covariance matrix of weekly returns:

In [12]:

cm = r.corr()
cv = r.cov()
    
fmt.displayDFs(cm, cv*1e4, headers=['Correlation', 'Covariance'], fmt="4f")

Correlation

Covariance

	SPY	GLD	OIL
SPY	1.0000	0.0321	0.4191
GLD	0.0321	1.0000	0.3011
OIL	0.4191	0.3011	1.0000

	SPY	GLD	OIL
SPY	8.0432	0.2461	5.6109
GLD	0.2461	7.2876	3.8375
OIL	5.6109	3.8375	22.2823

Why don't we compute the correlation and covariance of the price levels?

Cholesky decomposition

If $A$ is symmetric semi positive definate (SPD) matrix

$A$ can be decomposed as $A = LL^T$, where $L$ is lower triangle
In another word, the $U = L^T$ in $A$'s LU decomposition
$L$ can be viewed as the "square root" of $A$

Cholesky decomposition is unique:

The rank is same between $L$ and $A$

The Cholesky decomposition of previous correlation matrix:

In [13]:

lcm = np.linalg.cholesky(cm)
lcv = np.linalg.cholesky(cv)
fmt.displayMath("\\small ", sp.Matrix(lcm).evalf(4), fmt.joinMath('=', sp.Matrix(lcm.T).evalf(4), sp.Matrix(lcm.dot(lcm.T)).evalf(4)), 
                sep="")

$$ \small \left(\begin{matrix}1.0 & 0 & 0\\0.03215 & 0.9995 & 0\\0.4191 & 0.2878 & 0.8611\end{matrix}\right)\left(\begin{matrix}1.0 & 0.03215 & 0.4191\\0 & 0.9995 & 0.2878\\0 & 0 & 0.8611\end{matrix}\right)=\left(\begin{matrix}1.0 & 0.03215 & 0.4191\\0.03215 & 1.0 & 0.3011\\0.4191 & 0.3011 & 1.0\end{matrix}\right)$$

Recursive algorithm

A SPD matrix $A$ and its Choleski decomposition $L$ can be partitioned as:

In [14]:

s_A, a11, A12, A21, A22 = sp.symbols("A, a_{11} A_{21}^T A_{21} A_{22}")
s_L, s_LT, l11, L12, L12T, L22, L22T = sp.symbols("L L^T l_{11} L_{21} L_{21}^T L_{22} L_{22}^T")

A = sp.Matrix([[a11, A12], [A21, A22]])
L = sp.Matrix([[l11, 0], [L12, L22]])
LT = sp.Matrix([[l11, L12T], [0, L22T]])

fmt.displayMath(fmt.joinMath('=', s_A, A), fmt.joinMath('=', s_L, L), fmt.joinMath('=', s_LT, LT))

$$ A=\left(\begin{matrix}a_{{11}} & A^{T}_{{21}}\\A_{{21}} & A_{{22}}\end{matrix}\right)\;,\;\;\;L=\left(\begin{matrix}l_{{11}} & 0\\L_{{21}} & L_{{22}}\end{matrix}\right)\;,\;\;\;L^{T}=\left(\begin{matrix}l_{{11}} & L^{T}_{{21}}\\0 & L^{T}_{{22}}\end{matrix}\right)$$

From $A = LL^T$:

In [15]:

fmt.displayMath(fmt.joinMath('=', A,  L*LT))

$$ \left(\begin{matrix}a_{{11}} & A^{T}_{{21}}\\A_{{21}} & A_{{22}}\end{matrix}\right)=\left(\begin{matrix}l_{{11}}^{2} & L^{T}_{{21}} l_{{11}}\\L_{{21}} l_{{11}} & L_{{21}} L^{T}_{{21}} + L_{{22}} L^{T}_{{22}}\end{matrix}\right)$$

We immediately have:

In [16]:

fmt.displayMath(fmt.joinMath('=', l11, sp.sqrt(a11)), fmt.joinMath('=', L12,  A21/l11), 
                fmt.joinMath('=', A22 - L12*L12T, L22*L22T))

$$ l_{{11}}=\sqrt{a_{{11}}}\;,\;\;\;L_{{21}}=\frac{A_{{21}}}{l_{{11}}}\;,\;\;\;A_{{22}} - L_{{21}} L^{T}_{{21}}=L_{{22}} L^{T}_{{22}}$$

Note that $L_{22}$ is the Cholesky decomposition of the smaller matrix of $A_{22} - \frac{1}{a_{11}}A_{21}A_{21}^T$.

Correlated Brownian motion

Ubiquitous in quant Finance

The common underlying process for asset prices and other risk factors
The vast matjority of quant finance models are based on Brownian motions.

Steps in drawing correlated n-dimensional Geometric Brownian motion paths:

$$ \frac{dx^k_t}{x^k_t} = u^k dt + \sigma^k dw^k_t, \;\;\;\;dw_t^j\cdot dw_t^k = \rho_{jk} dt $$

compute the Cholesky decomposition of the correlation matrix: $C = LL^T$
draw a n independent normal random number vector $\bs \epsilon = (\epsilon_1, \epsilon_2, ..., \epsilon_n)^T$, these are the normalized risk factors
$L\bs \epsilon$ is the correlated normal random vector, and the correlated Brownian increments is $dw_t = L\bs \epsilon\sqrt{dt}$

Simulated paths

In [17]:

e = np.random.normal(size=[3, 20000])
dw = (lcm.dot(e)).T

dws = dw*np.diag(cv)
wcm = np.exp(np.cumsum(dws, 0))

figure(figsize=[12, 4])
subplot(1, 2, 1)

dw = (lcm.dot(e)).T
dws = dw*np.diag(cv)
wcm = np.exp(np.cumsum(dws, 0))
plot(wcm)
legend(cm.index, loc='best');
title('$L \epsilon$')

subplot(1, 2, 2)
dw2 = (lcm.T.dot(e)).T
dws2 = dw2*np.diag(cv)
wcm = np.exp(np.cumsum(dws2, 0))
plot(wcm)
legend(cm.index, loc='best');
title(('$L^T \epsilon$'));

The identical results can be obtained by using cholesky decomposition of the covariance matrix.
$L \bs \epsilon$ and $L^T \bs \epsilon$ are different, which one gives back the intended correlation?

In [18]:

df1 = pd.DataFrame(np.corrcoef(dws.T), columns=cm.index, index=cm.index)
df2 = pd.DataFrame(np.corrcoef(dws2.T), columns=cm.index, index=cm.index)

fmt.displayDFs(df1, df2, headers=['$L \epsilon$', '$L^T \epsilon$'])

$L \epsilon$

$L^T \epsilon$

	SPY	GLD	OIL
SPY	1	0.03515	0.4232
GLD	0.03515	1	0.2931
OIL	0.4232	0.2931	1

	SPY	GLD	OIL
SPY	1	0.1354	0.3878
GLD	0.1354	1	0.2696
OIL	0.3878	0.2696	1

The resulting covariance (correlation) is: $$\mathbb{E}[L\bs \epsilon (L\bs \epsilon)^T] = \mathbb{E}[L\bs{\epsilon \epsilon}^T L^T] = LL^T = C$$

Cautions on correlation/covariance matrices

The covariance/correlation matrix are extremely useful, but they are difficult to deal with when their size grows large:

In practice, we often deal with thousands or tens of thousands risk factors
Correlation matrix is often easier to work with than the covariance matrix, because it is "normalized"
It is very difficult to keep a large correlation matrix semi positive definite (SPD)
- Correlation matrix size of a few thousands is the practical limit
- Small changes in isolated values can invalidate the whole correlation matrix
Cholesky decomposition often fails for large matrices due to non-SPD or numerical noise.
Adding new entries to a large correlation matrix can be extremely hard

Matrix Calculus and Mean/Var Optimization

Scalar function

$$f(\bs x) = f(x_1, ..., x_n)$$

Derivative to vector (Gradient): $\frac{\partial f}{\partial \bs x} = \nabla f = [\frac{\partial f}{\partial x_1}, ..., \frac{\partial f}{\partial x_n}]$
Note that $\frac{\partial f}{\partial \bs x}$ is a row vector, whenever the vector or matrix appears in the denominator of differentiation, the result is transposed.

Vector function

$$\bs{ y(x) } = [y_1(\bs x), ..., y_n(\bs x)]^T$$

Deriative to vector (Jacobian matrix): $\frac{\partial \bs y}{\partial \bs x} = \left(\frac{\partial{y_i}}{\partial x_j}\right)$
- $\frac{\partial \bs y}{\partial \bs x}$ is in the transpose of $\bs y$, ie, with $\bs y$ rows and $\bs x$ columns
- $\frac{\partial \bs y}{\partial \bs x}\frac{\partial \bs x}{\partial \bs y} = I$, even when $\bs x$ and $\bs y$ have different dimension.
Derivative to scalar: $\frac{\partial \bs y}{\partial z} = [\frac{\partial y_1}{\partial z}, ..., \frac{\partial y_n}{\partial z}]^T$
- remains a column vector

Vector differentiation cheatsheet

$A, a, b, \bs c$ are constants (ie, not functions of $\bs x$)
$\bs {u=u(x), v=v(x)}, y=y(\bs x)$ are functions of $\bs x$

Expression	Results	Special Cases
$\frac{\partial(a \bs u + b \bs v)}{\partial \bs x}$	a $\frac{\partial{\bs u}}{\partial \bs x} + b\frac{\partial{\bs v}}{\partial \bs x}$	$\frac{\partial{\bs c}}{\partial \bs x} = 0I, \frac{\partial{\bs x}}{\partial \bs x} = I$
$\frac{\partial{A \bs u}}{\partial \bs x}$	$A\frac{\partial{\bs u}}{\partial \bs x}$	$\frac{\partial{\bs A x}}{\partial \bs x} = A, \frac{\partial{\bs x^T A}}{\partial \bs x} = A^T$
$\frac{\partial{y \bs u}}{\partial \bs x}$	$y \frac{\partial{\bs u}}{\partial \bs x} + \bs u \frac{\partial{y}}{\partial \bs x}$
$\frac{\partial \bs{u}^T A \bs v}{\partial \bs x} $	$\bs u^T A \frac{\partial{\bs v}}{\partial \bs x} + \bs v^T A^T \frac{\partial{\bs u}}{\partial \bs x} $	$\frac{\partial \bs{x}^T A \bs x}{\partial \bs x} = (\bs{A + A}^T)x, \frac{\partial \bs{u^Tv}}{\partial x} =\bs{u}^T\frac{\partial \bs{v}}{\partial x} + \bs{v}^T\frac{\partial \bs{u}}{\partial x} $
$\frac{\partial{\bs g(\bs u})}{\partial \bs x}$	$\frac{\partial{\bs g}}{\partial \bs u} \frac{\partial{\bs u}}{\partial \bs x} $	chain rules of multiple steps, $\frac{\partial \bs y}{\partial \bs x}\frac{\partial \bs x}{\partial \bs y} = I$

Very similar expressions to univariate calculus.
Replace $A^T$ by $A^*$ for complex matrix

Portfolio optimization

Powerful mean/variance portfolio theory can be expressed succinctly using linear algebra and matrix calculus.

Suppose there are $n$ risky assets on the market, with random return vector $\tilde{\bs r}$ whose covariance matrix is $V$,

$\bs w$: a portfolio, its elements are dollar amounts invested in each asset
- $\bs w^T \tilde{\bs r}$ is the portfolio's P&L in dollar amount
- $\sigma^2 = \bs w^TV\bs w$: the variance of the portfolio P&L
$\bs w_b$: a benchmark portfolio, e.g. a sector index
- $\bs \beta_b = \frac{V\bs w_b}{\sigma^2_b} $: the beta vector of individual assets to the benchmark portfolio
- $\bs w_b^T \bs \beta_b = \frac{\bs w_b^T V\bs w_b}{\sigma^2_b} = 1$, the benchmark portfolio itself has a beta of 1
$\bs w_m$: the market portfolio, as defined in CAPM
- $\bs \beta_m = \frac{V\bs w_m}{\sigma^2_m}$: the betas vector to the market portfolio $\bs w_m$

Excess return forecast

$\bs f = \mathbb{E}[\tilde{\bs r}] - r_0$ is a vector of excess return forecast of all assets
- $r_0$ is the risk free rate
$\bs f$ is a view, which can be from:
- Fundamental research: earning forcasts, revenue growth etc
- Technical and quantitative analysis
- Your secret trading signal

Optimal portfolio and Sharp Ratio

Given the view $\bs f$, the optimal portfolio $\bs w$ to express the view is:

minimize the variance of portfolio P&L (risk): $\bs w^TV\bs w$
while preserving a unit dollar of excess P&L: $\bs w^T \bs f = 1$

which solves a portfolio with the largest Sharp ratio under the view $\bs f$.

the optimal portfolio can then be scaled to match the dollar amount to be invested
the scaling does not change the portfolio's Sharp ratio and optimality

Why we take the covariance matrix $V$ as a constant, but treat expected return $\bs f$ as a variable?

Characteristic portfolio

The optimal portfolio can be solved analytically using Lagrange multiplier and matrix calculus:

$$\begin{array} \\ & l &= \bs w^TV \bs w - 2 \lambda (\bs f^T \bs w - 1) \\ & \frac{\partial l}{\partial \bs w} &= 2 \bs w^T V - 2 \lambda \bs f^T = \bs 0^T \\ \end{array}$$ $$ V \bs w - \lambda \bs f = 0 \iff \bs w = \lambda V^{-1} \bs f $$

plug it into $\bs f^T \bs w = 1$:

$$ \bs f^T(\lambda V^{-1} \bs f) = 1 \iff \lambda = \frac{1}{\bs f^T V^{-1} \bs f}$$ $$\bs w(\bs f) = \frac{V^{-1}\bs f}{\bs f^T V^{-1} \bs f}$$

$\bs w (\bs f)$ is also known as the characteristic portfolio for the forecast $\bs f$.

Relativity of return forecast

The optimal portfolio is invariant if $\bs f$ is scaled by a scalar $a$:

$$\bs w(a\bs f) = \frac{1}{a} \bs w(\bs f)$$

the portfolio $\bs w$ and $\frac{1}{a} \bs w$ have no material difference
identical after the scaling to match the dollar amount to be invested

Therefore, the return forecast $\bs f$ is only meaningful in a relative sense:

it is only important to forecast the relative size of excess returns
e.g. excess return forecast of [1%, 2%] and [10%, 20%] of two assets would result in identical optimal portfolio

The return forecast $\bs f$ should only be interpreted in a relative sense, we use $\propto$ to make it explicit.

Important characteristic portfolios

Identical returns: $\bs f \propto \bs e = [1, 1, ...., 1]^T$:

A naive view that all assets have the same excess returns (zero information)
$\bs e^T\bs w = 1$ means it is a fully invested portfolio of \$1
$\bs w_e = \frac{V^{-1}\bs e}{\bs e^TV^{-1}\bs e}$ has the minimum variance amongst all fully invested portfolio

Beta to a benchmark portfolio: $\bs f \propto \bs \beta_b = \frac{V\bs w_b}{\sigma_b^2}$:

$\bs{\beta_b w} = 1$ means the portfolio has a beta of 1 to the benchmark portfolio $\bs w_b$
$\bs w_{\beta} = \frac{V^{-1}\bs \beta_b}{\bs \beta_b^T V^{-1}\bs \beta_b} = \frac{\bs w_b}{\bs{\beta_b^T w_b}} = \bs w_b$, which is the benchmark portfolio itself
the benchmark portfolio itself is the optimal portfolio amongst those with unit beta.

Implied view of a portfolio

From any portfolio, we can backout its implied excess return forecast $\bs f$:

It is nothing but the betas to the portfolio, following previous slides
Only meaningful in a relative sense as well
Normalization required to produce absolute excess return forecast

Consider the market portfolio: the market collectively believe that $\bs f \propto \bs \beta_m$:

the absolute implied excess return by normalizing to the market excess return:

$$\mathbb{E}[\tilde{\bs r}] - r_0 = \bs \beta_m \left(\mathbb{E}[\tilde r_m] - r_0\right)$$

where $\tilde r_m$ is the market portfolio's return, this is exactly the CAPM.

The implied excess return is a much better return estimate than historical data.

How many years of historical data do we need to reliably estimate expected returns?

~500 years of data

Portfolio optimization example

The covariance matrix estimated from historical weekly returns are:

In [19]:

fmt.displayDF(cv*1e4, "4g")

	SPY	GLD	OIL
SPY	8.043	0.2461	5.611
GLD	0.2461	7.288	3.838
OIL	5.611	3.838	22.28

We consider three risky assets, with the following excess return forecast:

In [20]:

er = np.array([.05, .02, .01]).T
df_er = pd.DataFrame(np.array([er]).T*100, columns=["Excess Return Forcast (%)"], index = f3.columns).T
fmt.displayDF(df_er, "2f")

	SPY	GLD	OIL
Excess Return Forcast (%)	5.00	2.00	1.00

The optimal portfolio for the given forecast is:

In [21]:

cvi = np.linalg.inv(cv)
w = cvi.dot(er.T)/er.T.dot(cvi).dot(er)
df_er.ix['Optimial Portfolio', :] = w

fmt.displayDF(df_er[-1:], "3g")

	SPY	GLD	OIL
Optimial Portfolio	17.6	8.35	-4.83

Implied views

Suppose we are given the following portfolio:

In [22]:

w = np.array([10, 5, 5])
vb = w.dot(cv).dot(w)
ir = cv.dot(w)/vb

df = pd.DataFrame(np.array([w, ir])*100, index=["$ Position", "Implied Return %"], 
                  columns = ["SPY", "GLD", "OIL"])
fmt.displayDF(df[:1], "4g")

	SPY	GLD	OIL
$ Position	1,000	500	500

We can compute its implied return forecast as:

In [23]:

fmt.displayDF(df[1:], "2f")

	SPY	GLD	OIL
Implied Return %	4.73	2.50	8.04

note that these forecast are only meaningful in a relative sense

Does the investor really have so much confidence in OIL?

Norm and Condition

Ill-conditioned covariance matrix

The Mean-variance optimization is very powerful, but there are potential pitfalls in practice:

Suppose we have the following covariance matrix and excess return forecast, then we can compute the optimal portfolio.

In [24]:

nt = 1000
es = np.random.normal(size=[2, nt])

rho = .999999
e4 = rho/np.sqrt(2)*es[0, :] + rho/np.sqrt(2)*es[1,:] + np.sqrt(1-rho*rho)*np.random.normal(size=[1, nt])
es = np.vstack([es, e4])

cor = np.corrcoef(es)
cor1 = np.copy(cor)
cor1[0, 1] = cor1[1, 0] = cor[0, 1] + .00002
 
sd = np.eye(3)
np.fill_diagonal(sd, np.std(r))
cov = sd.dot(cor).dot(sd.T)
cov1 = sd.dot(cor1).dot(sd.T)

e, v = np.linalg.eig(np.linalg.inv(cov))

In [25]:

er = v[:, 2]/10

df_cov = pd.DataFrame(cov*1e4, index=f3.columns, columns=f3.columns)
df_cov1 = pd.DataFrame(cov1*1e4, index=f3.columns, columns=f3.columns)

pf = pd.DataFrame(np.array([er]), columns=f3.columns, index=['Expected Return'])
covi = np.linalg.inv(cov)
pf.ix['Optimal Portfolio', :] = covi.dot(er.T)/er.T.dot(covi).dot(er)

fmt.displayDFs(df_cov, pf, headers=["Covariance", "Optimized Portfolio"], fmt="4f")

Covariance

Optimized Portfolio

	SPY	GLD	OIL
SPY	8.0234	-0.1578	9.7193
GLD	-0.1578	7.2697	8.5246
OIL	9.7193	8.5246	22.2275

	SPY	GLD	OIL
Expected Return	0.0384	0.0325	0.0864
Optimal Portfolio	3.8417	3.2496	8.6419

A few days later, there is a tiny change in covariance matrix, but ...

In [26]:

pf1 = pd.DataFrame(np.array([er]), columns=f3.columns, index=['Expected Return'])
covi1 = np.linalg.inv(cov1)
pf1.ix['Optimal Portfolio', :] = covi1.dot(er.T)/er.T.dot(covi1).dot(er)

fmt.displayDFs(df_cov1, pf1, headers=["Covariance", "Optimized Portfolio"], fmt="4f")

Covariance

Optimized Portfolio

	SPY	GLD	OIL
SPY	8.0234	-0.1576	9.7193
GLD	-0.1576	7.2697	8.5246
OIL	9.7193	8.5246	22.2275

	SPY	GLD	OIL
Expected Return	0.0384	0.0325	0.0864
Optimal Portfolio	0.5848	0.0864	11.2791

Ill-conditioned linear system

Consider the following linear system $A\bs x = \bs y$, and its solution:

In [27]:

a = np.array([[1, 2], [2, 3.999]])
x = sp.MatrixSymbol('x', 2, 1)
y = np.array([4, 7.999])
fmt.displayMath(fmt.joinMath('=', sp.Matrix(a)*x, sp.Matrix(y)), 
                fmt.joinMath('=', x, sp.Matrix(np.round(np.linalg.solve(a, y), 4))))

$$ \left(\begin{matrix}1.0 & 2.0\\2.0 & 3.999\end{matrix}\right) x=\left(\begin{matrix}4.0\\7.999\end{matrix}\right)\;,\;\;\;x=\left(\begin{matrix}2.0\\1.0\end{matrix}\right)$$

A small perturbation on vector $\bs y$:

In [28]:

z = np.copy(y)
z[1] += .002

fmt.displayMath(fmt.joinMath('=', sp.Matrix(a)*x, sp.Matrix(z)), 
                fmt.joinMath('=', x, sp.Matrix(np.round(np.linalg.solve(a, z), 4))))

$$ \left(\begin{matrix}1.0 & 2.0\\2.0 & 3.999\end{matrix}\right) x=\left(\begin{matrix}4.0\\8.001\end{matrix}\right)\;,\;\;\;x=\left(\begin{matrix}6.0\\-1.0\end{matrix}\right)$$

A small perturbtion on matrix $A$:

In [29]:

b = np.copy(a)
b[1, 1] += .003

fmt.displayMath(fmt.joinMath('=', sp.Matrix(b)*x, sp.Matrix(y)), 
                fmt.joinMath('=', x, sp.Matrix(np.round(np.linalg.solve(b, y), 4))))

$$ \left(\begin{matrix}1.0 & 2.0\\2.0 & 4.002\end{matrix}\right) x=\left(\begin{matrix}4.0\\7.999\end{matrix}\right)\;,\;\;\;x=\left(\begin{matrix}5.0\\-0.5\end{matrix}\right)$$

How do we identify ill-conditioned linear system in practice?

Vector norms

is a measure of the magnitude of the vector:

Positive: $\Vert \bs u\Vert \ge 0$, $\Vert \bs u\Vert = 0 \iff \bs{u = 0}$
Homogeneous: $\Vert a \bs u \Vert = |a| \Vert \bs u\Vert $
Triangle inequality: $\Vert \bs u + \bs v\Vert \le \Vert \bs u\Vert + \Vert \bs v\Vert $

Common vector norms

L1: $\Vert \bs u\Vert _1 = \sum_i | u_i |$
L2 (Euclidean): $\Vert \bs u\Vert _2 = (\sum u_i^2)^{\frac{1}{2}} = (\bs u^T \bs u)^\frac{1}{2}$
Lp: $\Vert \bs u\Vert _p = (\sum u_i^p)^{\frac{1}{p}}$
L${\infty}$: $\Vert \bs u\Vert _\infty = \max(u_1, u_2, ..., u_n)$

Vector norms comparison

Vectors with unit norms:

Unit L2 norm forms a perfect circle (sphere in high dimension)
Unit L1 and L${\infty}$ norms are square boxes
The difference between different norms are not significant

Matrix norms

Defined to be the largest amount the linear transformation can stretch a vector:

$$\Vert A\Vert = \max_{\bs u \ne 0}\frac{\Vert A\bs u\Vert }{\Vert \bs u\Vert }$$

The matrix norm definition depends on the vector norms. Only L1 and L$\infty$ matrix norm have analytical formula:

L1: $\Vert A\Vert _1 = \max_{j} \sum_i |a_{ij}|$
L2: $\Vert A \Vert_2 =$ the largest singular value of $A$
L${\infty}$: $\Vert A\Vert _\infty = \max_i \sum_j |a_{ij}|$

Inequalities:

$\Vert A \bs u \Vert \le \Vert A\Vert \Vert u\Vert $
$\Vert b A\Vert = |b| \Vert A\Vert $
$\Vert A + B\Vert \le \Vert A\Vert + \Vert B\Vert $
$\Vert AB\Vert \le \Vert A\Vert \Vert B\Vert$

Matrix condition

The propogation of errors in a linear system $\bs y = A\bs x$ with invertable $A$:

Consider a perturbation to $\bs{ x' = x} + d\bs x$, and corresponding $d \bs y = A d\bs x$:

$$\begin{array}\\ \Vert d \bs y\Vert &= \Vert A d \bs x\Vert \le \Vert A\Vert\Vert d \bs x\Vert = \Vert A\Vert \Vert \bs x\Vert \frac{\Vert d \bs x\Vert }{\Vert \bs x\Vert } \\ &= \Vert A\Vert \Vert A^{-1} \bs y\Vert \frac{\Vert d \bs x\Vert }{\Vert \bs x\Vert } \le \Vert A\Vert \Vert A^{-1} \Vert \Vert \bs y\Vert \frac{\Vert d \bs x\Vert }{\Vert \bs x\Vert } \end{array}$$

$$\frac{\Vert d \bs y\Vert }{\Vert \bs y\Vert } \le \Vert A\Vert\Vert A^{-1}\Vert\frac{\Vert d \bs x\Vert }{\Vert \bs x\Vert}$$

$k(A) = \Vert A\Vert\Vert A^{-1}\Vert$ is the condition number for the linear system $\bs y = A\bs x$, which defines the maximum possible magnification of the relative error.

Matrix perturbation

What if we change the matrix itself? i.e. given $AB = C$, how would $B$ change under a small change in $A$ while holding $C$ constant?

we can no longer directly compute it via matrix calculus
perturbation is a powerful technique to solve this types of problem

We can write any $\delta A = \dot{A} \epsilon$ and the resulting $\delta B = \dot{B} \epsilon$:

$\dot{A}, \dot{B}$ are matrices representing the direction of the perturbation
$\epsilon$ is a first order small scalar

$$\begin{array} \\ (A + \delta A) (B + \delta B) &= (A + \dot{A} \epsilon ) (B + \dot{B} \epsilon) = C \\ AB + (\dot{A}B + A\dot{B})\epsilon + \dot{A}\dot{B}\epsilon^2 &= C \\ (\dot{A}B + A\dot{B})\epsilon + \dot{A}\dot{B}\epsilon^2 &= 0 \end{array}$$

Now we collect the first order terms of $\epsilon$, $\dot{A}B + A\dot{B} = 0$:

$$\begin{array} \\ \dot{B} &= -A^{-1}\dot{A}B \\ \delta B &= -A^{-1}\delta A B \\ \Vert \delta B \Vert &= \Vert A^{-1}\delta A B \Vert \le \Vert A^{-1} \Vert \Vert \delta A \Vert \Vert B \Vert \\ \frac{\Vert \delta B \Vert}{\Vert B \Vert} &\le \Vert A^{-1} \Vert \Vert \delta A \Vert = \Vert A^{-1} \Vert \Vert A \Vert \frac{\Vert \delta A \Vert}{\Vert A \Vert} \end{array}$$

We reach the same conclusion of $k(A) = \Vert A^{-1} \Vert \Vert A \Vert$ for a small change in $A$ under the linear system $AB = C$.

we will cover the condition number for non-square matrix in the next class.

Numerical example

Consider the ill-conditioned matrices from previous examples:

In [30]:

V = sp.MatrixSymbol('V', 3, 3)
Vi = sp.MatrixSymbol('V^{-1}', 3, 3)
fmt.displayMath(fmt.joinMath('=', V, sp.Matrix(cov*1e4).evalf(4)), fmt.joinMath('=', Vi, sp.Matrix(cov*1e4).inv().evalf(5)))

$$ V=\left(\begin{matrix}8.023 & -0.1578 & 9.719\\-0.1578 & 7.27 & 8.525\\9.719 & 8.525 & 22.23\end{matrix}\right)\;,\;\;\;V^{{-1}}=\left(\begin{matrix}32585.0 & 31647.0 & -26385.0\\31647.0 & 30737.0 & -25626.0\\-26385.0 & -25626.0 & 21365.0\end{matrix}\right)$$

In [31]:

A = sp.MatrixSymbol('A', 2, 2)
Ai = sp.MatrixSymbol('A^{-1}', 2, 2)
fmt.displayMath(fmt.joinMath('=', A, sp.Matrix(a)), fmt.joinMath('=', Ai, sp.Matrix(a).inv().evalf(4)))

$$ A=\left(\begin{matrix}1.0 & 2.0\\2.0 & 3.999\end{matrix}\right)\;,\;\;\;A^{{-1}}=\left(\begin{matrix}-3999.0 & 2000.0\\2000.0 & -1000.0\end{matrix}\right)$$

their condition numbers are large because of the large elements in the inversion:

In [32]:

fmt.displayDF(pd.DataFrame([[np.linalg.cond(x, n) for n in (1, 2, inf)] for x in [a, cov*1e4]],
             columns = ["L-1", "L-2", "L-$\infty$"], index=['Condition number $A$', 'Condition number $V$']), "4g")

	L-1	L-2	L-$\infty$
Condition number $A$	3.599e+04	2.499e+04	3.599e+04
Condition number $V$	3.667e+06	2.52e+06	3.667e+06

Orthogonal transformation

Orthogonal transformation is unconditionally stable:

$$\Vert Q \bs u\Vert_2^2 = (Q \bs u)^T(Q \bs u) = \bs u^T Q^TQ \bs u = \bs u^T \bs u = \Vert \bs u \Vert_2^2$$

therefore by definition: $\Vert Q \Vert_2 = \Vert Q^{-1} \Vert_2 = 1$
$k(Q) = \Vert Q \Vert_2 \Vert Q^{-1} \Vert_2 = 1$
the relative error does not grow under orthogonal transformation.
Orthogonal transformation is extremely important in numerical linear algebra.

Assignments

Required reading:

Bindel and Goodman: Chapter 4, 5.1-5.4

Homework:

Bindel and Goodman: Problem 4.4, 4.7, 4.8
Complete homework set 2
Problem 2, 4, 5 in the homework set and 4.4, 4.7 from the book are optional.