Lecture 7: Variance Reduction¶

Topics:

Overview
Control variate
Importance sampling
Low discrepency sequence
Other techniques

Overview¶

Estimator variance¶

Let's consider the simple average estimator for $\mathbb{E}[\tilde x]$:

$$\renewcommand{var}{\text{var}}\renewcommand{std}{\text{std}}\renewcommand{cov}{\text{cov}}\renewcommand{bs}{\boldsymbol} \hat{u} = \frac{1}{n}\sum_i x_i$$

$x_i$ are samples of the random variable $\tilde x$.
$\hat{u}$ is an unbiased estimator: $\mathbb{E}[\hat{u}] = \mathbb{E}[\tilde x]$
the error of the estimator is: $\std[\hat{u}] = \frac{\sigma}{\sqrt{n}}$, $\sigma^2$ is the population variance of $\tilde x$.

Total MC cost¶

The minimum number of samples erquired for a given accuracy $\epsilon$ is:

$$\frac{\sigma}{\sqrt{n}} = \epsilon \iff n = \frac{\sigma^2}{\epsilon^2}$$

The total computation cost is therefore roughly $c(x_i)\frac{\sigma^2}{\epsilon^2}$:

$c(x_i)$ is the cost to generate one random sample of $x_i$
out of the three factors, variance has the greatest potential for reduction

Variance reduction is extremely important in practice

MC is often applied to exotic, complex and large scale problems, speed is the critical bottleneck
advanced variance reduction could lead to many times of speedups

Variance reduction strategies¶

Sample an alternative random variable $\tilde y$ that has identical mean but smaller variance than $\tilde x$:

Antithetic variate
Control variate

Use more efficient estimator than the mean of independent random samples:

Importance sampling
Low-discrepency sequence (aka Quasi-random sequence)
Moment matching
Stratified sampling

Antithetic variates¶

Suppose $\tilde x(\bs w(t))$ is driven by an multi-d Brownian motion $\bs w(t)$,

$$\tilde y = \frac{1}{2} \left(\tilde x\left(\bs w\left(t\right)\right) + \tilde x\left(-\bs w\left(t\right)\right)\right) $$

$\tilde y$ is the average of two $\tilde x$ from mirroring Brownian paths
$\mathbb{E}[\tilde y] = \mathbb{E}[\tilde x]$, the mirror of a Brownian path is also a Brownian
$\var[\tilde y] = \frac{\sigma^2}{2}\left(1 + \rho\left(\tilde x\left(\bs w\right),\tilde x\left(-\bs w\right)\right)\right)$

Effectiveness depends on the payoff:

the cost of every sample doubled, i.e., $c(\tilde y) = 2c(\tilde x)$
to ensure an overall saving in computational cost: $$ \var[\tilde y] < \frac{\sigma^2}{2} \iff \rho\left(\tilde x\left(\bs w\right),\tilde x\left(-\bs w\right)\right)< 0 $$

Effectiveness of antithetic variates¶

works the best when the $\tilde x$ is a linearish function of the underlying $\bs w(t)$

In [2]:

x = np.arange(0, 11, .1) -5

figure(figsize=[14, 4])
subplot(1, 3, 1)
plot(x, (x+5)**1.5);
title('Very effective', fontsize=18)
xlabel('$w(t)$', fontsize=16)

subplot(1, 3, 2)
plot(x, np.maximum((x+7)**1.5 - 10, 0))
title('Somewhat effective', fontsize=18)
xlabel('$w(t)$', fontsize=16)

subplot(1, 3, 3)
plot(x, np.maximum(5-.5*x**2, 0))
ylim(0, 6.)
title('Ineffective', fontsize=18);
xlabel('$w(t)$', fontsize=16);

Antithetic variate is often abused in practice

very easy to implement, often turned on without cost/benefits analysis
it conflicts with other advanced variance reduction technique

Control Variate¶

it is better to be precisely wrong than roughly right

Motivation¶

Exotic trades are often 90% vanilla, but with a small exotic feature

the vanila trade is often a good proxy to the exotic trade
e.g. Asian option vs. vanilla European option
- Asian payoff is computed from average prices

We therefore price the (exotic - vanilla) instead of the full exotic instrument:

the difference has much smaller variance
- e.g., for two paths: Asian option [10, 0], European option [9, 0], the diff is [1, 0]
we can recover the exotic trade's price by adding back the price of vanilla trade

Control Variate¶

Express the idea of diffs formally:

$$ \tilde y = \tilde x + \beta (v - \tilde v) $$

$\tilde v$ is the control variate, which represents the similar vanilla instrument
$v = \mathbb{E}[\tilde v]$ is the true value of the vanilla instrument, which is known to very high precision (e.g. analytical or PDE)
$\beta$ is a scalar constant, a free parameter
$\mathbb{E}[\tilde y] = \mathbb{E}[\tilde x]$ by construction

Optimal $\beta$¶

$\beta$ is determined by minimizing $\tilde y$'s variance:

$$\begin{array}{l} \var[\tilde y] &= \cov(\tilde y, \tilde y) = \cov(\tilde x -\beta \tilde v, \tilde x -\beta \tilde v) \\ &= \sigma_x^2 - 2 \beta \cov(\tilde x, \tilde v) + \beta^2 \sigma_v^2 \\ \frac{\partial \var[\tilde y]}{\partial \beta} &= -2 \cov(\tilde x, \tilde v) + 2\beta \sigma_v^2 = 0 \\ \beta^* &= \frac{\cov(\tilde x, \tilde v)}{\sigma_v^2} = \frac{\rho(\tilde x, \tilde v)\sigma_x}{\sigma_v} \end{array}$$

The minimum variance is therefore: $\var[\tilde y] = \sigma_x^2\left(1-\rho^2(\tilde x, \tilde v)\right)$

the higher the correlation between $\tilde x, \tilde v$, the more variance reduction
significant performance gain with the right control variate
$\tilde x, \tilde v$ must be driven by the same Brownian paths to ensure high correlation

Asian option¶

Consider the following Asian call option:

$s(0) = 100, k=100, \sigma = .25$, maturity $t_m = 1$Y
constant countinuous risk free rate of $r=0.02$
payoff at maturity is based on monthly average of stock prices: $$\left(\frac{1}{n}\sum_{i=1}^{n} s(i\delta) - k\right)^+$$ with $\delta = \frac{1}{12}, n=\frac{t_m}{\delta} = 12$

A straight Monte Carlo simulation of 100,000 paths yields:

In [3]:

import proc, time

s0, k, vol, tm, dt, r = 100., 100., .25, 1., 1./12, 0.02
n = int(tm/dt + 1e-6) # why 1e-6?
gbm = proc.GBM(r, vol)
disc = np.exp(-r*tm)

def mc0(p, es) :
    ss = gbm.draw(es, s0, dt)
    x = np.maximum(np.mean(ss, 1) - k, 0)*disc
    return np.mean(x), np.sqrt(np.var(x)/p)

p0 = 100000
tic = time.clock()
es = np.random.normal(size=[p0, n])
u0, e0 = mc0(p0, es)
t0 = time.clock() - tic

df = pd.DataFrame([p0, u0, e0], index=['Paths', 'Value', 'MC error'], columns=['Asian Call'])
fmt.displayDF(df.T, "3g", 4)

	Paths	Value	MC error
Asian Call	1e+05	6.63	0.0324

European option as control¶

We use the European option with the same strike and maturity as control

the European options value (Black Scholes): $v = \mathbb{E}[\tilde v] = 10.871$
this simple control reduces the computational cost by about four times

In [4]:

import inst

euro_call_v = inst.BlackScholes.callPrice(r, s0, k, tm, vol)
euro_call_f = lambda ss : np.maximum(ss[:,-1]-k, 0.)*disc
asian_call_f =  lambda ss: np.maximum(np.mean(ss, 1) - k, 0)*disc

def mc_cv(p, xf, vf, ve) :
    es = np.random.normal(size=[p, n])
    ss = gbm.draw(es, s0, dt)
    x = xf(ss)
    v = vf(ss)
    
    rho = np.corrcoef(x, v)[0, 1]
    beta = rho*np.std(x)/np.std(v)
    
    y = x + beta*(ve - v)
    return np.mean(y), np.sqrt(np.var(y)/p), x, v, rho, beta

tic = time.clock()
cv1 = mc_cv(p0, asian_call_f, euro_call_f, euro_call_v)
t1 = time.clock() - tic

In [5]:

def chart_cv(cv, actual, t, p, control) :
    u, e, x, v, rho, beta = cv
    fig = figure(figsize=[12, 4])
    subplot(1, 2, 1)
    plot(x[:1000], v[:1000], '.')

    xlabel('Asian option payoff', fontsize=16)
    ylabel(control, fontsize=16)
    title('$\\tilde x$ vs. $\\tilde v$', fontsize=18);

    subplot(1, 2, 2)
    ax = fig.add_subplot(122)
    ax.text(0, .75, "$\\rho=%.4f, \; \\beta^*=%.4f$" % (rho, beta), size="xx-large")
    ax.text(0, .63, "Speedup limit: $\\frac{1}{1-\\rho^2} =$ %.2f, actual %.2f" % \
            (1./(1.-rho*rho), actual), size="xx-large")
    ax.text(0, .50, "Controlled MC valuation: %.3f $\\pm$ %.3f" % (u, e), size="xx-large")
    ax.text(0, .40, "%s paths took %.3g sec" % ("{:,}".format(p), t), size="xx-large")

    ax.set_axis_off()
    
e1 = cv1[1]
chart_cv(cv1, t0/t1*e0*e0/e1/e1, t1, p0, "European option payoff")

Can we do better than a 4x speed up? You bet!

Geometric Asian option¶

Consider the geomeric mean of the stock prices:

$$ g = \exp\left(\sum_{i=1}^{n} \log\left(s\left(i\delta\right)\right)\right) $$

$g$ follows a lognormal distribution with known mean and variance
$g$ is a close approximation to the arithmetic mean

Consider a call option with the following payoff:

$$\left(g - k\right)^+$$

can be priced exactly by Black-Scholes, (or another MC simulation)
not a real tradable instrument, but it does not matter

A stellar control¶

The results using the geometric Asian option control:

In [7]:

tic = time.clock()
cv2 = mc_cv(p0, asian_call_f, g_call_f, cvc[0])
t2 = time.clock() - tic

chart_cv(cv2, t0/t2*(e0/cv2[1])**2, t2, p0, "Geometric average option")

Disclaimer: such dramatic reduction in variance is not typical:

10x - 20x reduction is an exellent result in practice

Antithetic variable revisited¶

AV is a poor man's control variate:

$$\begin{array}{l} \tilde y &= \frac{1}{2} \left(\tilde x\left(\bs w\left(t\right)\right) + \tilde x\left(-\bs w\left(t\right)\right)\right) \\ &= \tilde x\left(\bs w\left(t\right)\right) + \frac{1}{2} \left(\tilde x\left(-\bs w\left(t\right)\right) - \tilde x\left(\bs w\left(t\right)\right) \right) \\ \end{array}$$

The control variate is $\tilde v = \frac{1}{2} \left(\tilde x\left(\bs w\left(t\right)\right) - \tilde x\left(-\bs w\left(t\right)\right) \right)$, and $v = \mathbb E[\tilde v] = 0 $

control variate is much more flexible than AV

Control variate summary¶

Control variate is the most potent variance reduction technique,

could lead to orders of magnitude of speed up
the choice of control variate is critical, it is problem specific
its implementation requires creativity and highly specialized skills

It can also be used to compute deltas with Monte Carlo simulation

Importance Sampling¶

You have a 99% chance of winning the lottery!

Extreme events¶

We often need to simulate rare or extreme events:

CDS pricing: default event of high grade issuers
CDO tranches: multiple defaults in a portfolio
Extreme quantile measures:
- VaR: 99% 10 day
- IRC/CRM: 99.9% 1Y
- PFE: 98%

Simulating rare events is very inefficient,

most simulated paths do not hit the area of interests.

Bernoulli trial¶

Consider buying a lottery ticket that pays $g = $\$1 billion, with the chance of winning being $p$:

The variance of the payoff is:

$$\var[\tilde x] = \mathbb{E}[\tilde x^2] - \mathbb{E}[\tilde x]^2 = g^2(p - p^2)$$

Suppose we want to price the payoff to a relative accuracy of 1% using Monte Carlo, then the minimum number of sample path $n$ is:

$$ g \sqrt{\frac{p-p^2}{n}} < .01 g\cdot p \iff n > 10^4\frac{1-p}{p}$$

The relationship between $p$ and the computational costs:

In [10]:

import fmt

p = np.array([1e-8, 1e-4, .01, .1, .5, .9, .99, 1-1e-4])
n = 1e4*(1-p)/p
df = pd.DataFrame(np.array([p, n]), index=["$p$", "Min Paths"])
df = df.T.set_index("$p$")
fmt.displayDF(df.T, "4g", 4)

$p$	1e-08	0.0001	0.01	0.1	0.5	0.9	0.99	0.9999
Min Paths	1e+12	9.999e+07	9.9e+05	9e+04	1e+04	1,111	101	1

Pitfalls with sampling extrememe events¶

Consider the following scenario of winning odds $p=10^{-8}$:

We draw 10,000 MC sample path and none of them hit the positive payoff
The average of the 10,000 paths gives incorrect value of 0.
but MC error estimated from the samples is $\frac{\hat s}{\sqrt{n}} = 0$.

Why and how the MC error estimation failed us?

Because of the large error of MC error $\sqrt{\frac{\beta-1}{4n}}\frac{\sigma}{\sqrt{n}}$, in this case the kurtosis $\beta$ is very large, $\beta >> n = 10,000$.

Intuitions of importance sampling¶

The expected payoff is the same between the two lotteries:

$p=10^{-8}$ chance of winning \$1 billion
$p=.99$ chance of winning $10.10

Let's pretend that we are playing lottery #2 in the MC.

Everybody wins!

sample mean is unchanged
lottery #2 has a much smaller variance, much faster

What we are doing is really a change of measure

Measure change¶

$$ \mathbb{E}^\mathbb{Q}\left[\tilde x\right] = \mathbb{E}^{\mathbb{P}}\left[\tilde x \frac{d \mathbb Q}{d \mathbb P}\right]$$

$\mathbb{P}$ and $\mathbb{Q}$ must be equivalent (i.e., they have identical set of events with non-zero probability)
$\frac{d \mathbb Q}{d \mathbb P}$ is the Radon-Nikodym derivative, $\mathbb{E}^{\mathbb P}\left[\frac{d \mathbb Q}{d \mathbb P}\right] = \mathbb{E}^\mathbb{Q}\left[1\right] = 1$
The expression like $\mathbb E^{\mathbb Q}[\cdots \frac{d \mathbb Q}{d \mathbb P}]$ is an error, almost surely

One-dimensional measure change¶

An intuitive special case:

$$\begin{array}{l} \mathbb{E}^\mathbb{Q}\left[\tilde x\right] = \int x q(x) dx = \int x \frac{q(x)}{p(x)} p(x) dx = \mathbb{E}^\mathbb{P}\left[\tilde x \frac{d \mathbb Q}{d \mathbb P}\right] \end{array}$$

$p(x), q(x)$ are the probability densities in the respective measures
the $\tilde x$'s distribution in $\mathbb P$ and $\mathbb Q$ measures are different

Importance sampling¶

Given the following equality,

$$ \mathbb{E}^\mathbb{Q}\left[\tilde x\right] = \mathbb{E}^{\mathbb{P}}\left[\tilde x \frac{d \mathbb Q}{d \mathbb P}\right]$$

We can compute $\mathbb{E}^{\mathbb Q}[\tilde x]$ by one of the two ways:

sample $\tilde x$ from measure $\mathbb Q$ and compute $\frac{1}{n} \sum_i x_i$
sample $\tilde x$ from measure $\mathbb P$ and compute $\frac{1}{n} \sum_i (x_i \frac{d \mathbb Q}{d \mathbb P})$

Note that the $\tilde x$ samples from different measures have different distributions.

Unequal probablity weights¶

The importance sampling estimator is therefore:

$$\small \mathbb{E}^{\mathbb Q}[\tilde x] = \mathbb{E}^{\mathbb{P}}\left[\tilde x \frac{d \mathbb Q}{d \mathbb P}\right] \approx \frac{1}{n} \sum_i (x_i \frac{d \mathbb Q}{d \mathbb P}) =\sum_i \frac{1}{n} \frac{d \mathbb Q}{d \mathbb P} x_i = \sum_i q_i x_i$$

where $q_i = \frac{1}{n} \frac{d \mathbb Q}{d \mathbb P}$ are unequal probability weights associated with each $x_i$.

this view is convenient for computing probability quantiles
on the RHS, the $x_i$ has to be drawn from the $\mathbb P$ measure
$\sum_i q_i = 1$ only in expectation (for large $n$), therefore it helps to normalize the $q_i$ so that $\sum_i q_i = 1$

Normal I/S example¶

In [11]:

from scipy.stats import norm

x = np.arange(-7, 4, .1)
p = norm.pdf(x)

u = -3 
z = np.exp(-u*x + .5*u*u)

df = pd.DataFrame(np.transpose([norm.pdf(x), norm.pdf(x-u), z]), index=x, 
                  columns=['$\\mathbb{Q}$', '$\\mathbb{P}$', '$\\frac{d \\mathbb{Q}}{d \\mathbb {P}}$'])

fig = figure(figsize=[12, 4])
ax1 = fig.add_subplot(121)

df.plot(ax=ax1, secondary_y=['$\\frac{d \\mathbb{Q}}{d \\mathbb {P}}$']);
ax1.text(-5, .05, "area of interests", size="x-large");
title('Probability Densities', fontsize=16)

subplot(1, 2, 2)

ns = 500
xs_q = np.random.normal(size=[ns]) # Q sample
xs_p = xs_q + u # P sample
qs = 1./ns*np.ones(ns)
zs = np.exp(-u*xs_p + .5*u*u) # R-N derivative
ps = qs*zs

ps = ps/sum(ps)  # normalize

nn = 100
semilogy(xs_q[:nn], qs[:nn], '.b')
semilogy(xs_p[:nn], ps[:nn], '.g')

ylabel('Probability Weights')
legend(['Original Samples', 'Importance Samples'], loc='best');
title('100 Discrete Samples', fontsize=16);

text(-4.8, 1e-2, "area of interests", size="x-large");

Suppose $\tilde x \sim N(0, 1)$ in $\mathbb Q$ measure, but we want to sample more in the area of interests around $d = -3$: $$ \renewcommand{intf}{\int_{-\infty}^{\infty}}$$ $$ \small \mathbb E^{\mathbb Q} [f(\tilde x)] = \intf f(x) \phi(x) dx = \intf f(x) \frac{\phi(x)}{\phi(x-d)}\phi(x-d) dx = \mathbb E^{\mathbb P}[f(\tilde x) \frac{d \mathbb{Q}}{d \mathbb {P}}] $$

$$ \small \frac{d \mathbb{Q}}{d \mathbb {P}} = \frac{\phi(x)}{\phi(x-d)} = \exp \left(-\frac{1}{2}\left(x^2 - (x-d)^2\right)\right) = \exp(-xd + \frac{1}{2}d^2) $$

One-factor Gaussian copula¶

The one factor Gaussian copula model is by far the most infamous model in quant finance,

$$\tilde x_i = \sqrt{\rho} \tilde m + \sqrt{1-\rho} \tilde \epsilon_i$$

it is a one factor correlation model, $x_i$ represents asset returns
$\tilde m, \tilde \epsilon_i$ are all independent standard normal random variables
$\tilde x_i, \tilde x_j$ are correlated only through the common market factor $\tilde m$: $\cov(\tilde x_i, \tilde x_j) = \rho\;\cov(\tilde m, \tilde m) = \rho$
it is silly to blame the financial crisis on such an innocent model ...

Correlated defaults¶

$$\renewcommand{ind}{1{\hskip -2.5 pt}\hbox{l}}$$

The one-factor Gaussian copula is often used to model the correlated defaults,

We first draw the normal market factor $m$ and correlated normals $x_i$ for each issuers $$\tilde x_i = \sqrt{\rho} \tilde m + \sqrt{1-\rho} \tilde \epsilon_i$$
If an issuer's survival probability is $p_i(t)$ at time $t$, then we can draw the default time $\tilde \tau$ as: $$ p_i(\tilde \tau_i) = \Phi(\tilde x_i) \iff \tilde \tau_i = p_i^{-1}(\Phi(\tilde x_i))$$
the coresponding default indicator at time $t$ is: $$\ind(\tau_i < t) = \ind(\Phi(x_i) > p_i(t))$$

The resulting default time $\tau_i$ and default indicators $\ind_i$ are correlated through $m$

Draw default times¶

In [12]:

figure(figsize=[10, 6])
ts = np.arange(0, 30, .01)
h = .1
p = np.exp(-h*ts)
plot(ts, p)
xlabel('Time (Y)', fontsize=16)
ylabel('Suvival Prob', fontsize=16)

u = .1
t = -np.log(u)/h
plot([0, t], [u, u], 'r')
plot([t, t], [0, u], 'r')
plot(t, u, 'o');

u = .8
t = -np.log(u)/h
plot([0, t], [u, u], 'k')
plot([t, t], [0, u], 'k')
plot(t, u, 'o');
legend(['Survival Prob $p(t)$']);

given a uniform random variable $\tilde u$, the default time is $p^{-1}(\tilde u)$.

Incremental risk charge model¶

IRC is part of the Basel 2.5 capital requirement for credit risks:

it is defined to be the 1Y 99.9% quantile of the portfolio loss, assuming the portfolio position is held constant
only captures the default and rating migration, but not spread volatility

A stylized IRC example:

only consider the default risk, ignore the trade aging
portfolio: 100 long CDS positions with \$10MM notional each
5 names' 1Y survival probability is 0.95, the rest is 0.99
recovery given default is uniform between [10%, 50%]
the correlation of default is governed by the 1F Gaussian copula with 70% correlation

Straight Monte Carlo for IRC¶

The IRC can be computed using straight MC.

we generate a large number of PnL scenarios, and read out the 99.9% quantile
to estimate the MC error, we have to apply batching (why?)

The straight MC converges slowly:

only a small percentage of the scenarios hit the 99.9% tail.

In [13]:

nn, rho = 100, .7
nsim = 40000
p = .99*np.ones(nn)
p[:5] = .95
b = 20
lgdl = 5e6
lgdh = 9e6

def irc0(nsim) :
    es = np.random.normal(size=[nn+1, nsim])
    xs = np.array([np.sqrt(rho)*es[0, :] + np.sqrt(1-rho)*e for e in es[1:, :]]).T
    lgd = np.random.uniform(lgdl, lgdh, size=nsim)

    pnls = np.sum(np.greater(norm.cdf(xs), p), 1)*lgd
    return np.percentile(pnls, 99.9), pnls

tic = time.clock()
ircs0 = [(irc0(nsim))[0] for i in range(b)]
t0 = time.clock() - tic

In [14]:

df = pd.DataFrame([nsim, np.mean(ircs0), np.std(ircs0)/np.mean(ircs0), t0/b], index=['paths', 'value', 'rel error', 'run time(s)'], 
                  columns=['IRC'])
fmt.displayDF(df.T, "3g", 4)

	paths	value	rel error	run time(s)
IRC	4e+04	5.03e+08	0.0429	0.392

Shift the common factor¶

A natural variable to shift is the market factor $\tilde m$, this effectively steers the sampling to those scenarios with a lot of defaults:

In [15]:

def irc1(nsim, d) :
    es = np.random.normal(size=[nn+1, nsim])
    xs = np.array([np.sqrt(rho)*(es[0, :] - d) + np.sqrt(1-rho)*e for e in es[1:, :]]).T
    lgd = np.random.uniform(lgdl, lgdh, size=nsim)

    pnls = np.sum(np.greater(norm.cdf(xs), p), 1)*lgd
    ws = (np.exp(d*es[0, :]))/nsim
    
    s_pnls, s_ws = zip(*sorted(zip(pnls, ws)))
    cws = np.cumsum(s_ws/np.sum(s_ws))
    return s_pnls[np.searchsorted(cws, .999)-1], cws, s_pnls

var1 = []
mean1 = []
ds = np.arange(-2.5, 0.1, .25)
for d in ds :
    ircs = [(irc1(nsim, d))[0] for i in range(b)]
    var1.append(np.var(ircs))
    mean1.append(np.mean(ircs))

In [16]:

uo = -1.5

tic = time.clock()
ircs1 = [irc1(nsim, uo)[0] for i in range(b)]
t1 = time.clock() - tic

In [17]:

fig = figure(figsize=[12, 4])
subplot(1, 2, 1)
plot(ds, var1, '.-')
xlabel('drift amount', fontsize=14)
title('Variance vs drift', fontsize=16);
ax = fig.add_subplot(122)
ax.set_axis_off()
ax.text(0, .5, "With drift = -1.5:", size="xx-large")
ax.text(.1, .4, "Variance reduction ratio %.2f times" % (var1[-1]/min(var1)), size="xx-large");
ax.text(.1, .3, "IRC = %.4g, rel err= %.3g, for %d samples" % (np.mean(ircs1), np.std(ircs1)/np.mean(ircs1), nsim), size="xx-large")
ax.text(.1, .2, "Actual time saving is %.2f times" % (t0/t1*np.var(ircs0)/np.var(ircs1)), size="xx-large");

a shift of $\tilde m$ around -1.5 gives the most variance reduction
here we followed the convesion that negative $\tilde m$ means to more defaults

How is the variance reduced?¶

In [18]:

uo = -1.5

nsim = 200
_, pnl1 = irc0(nsim)
_, w2, pnl2 = irc1(nsim, uo)

pnl1 = sorted(pnl1)
w1 = np.arange(1, nsim + 1)*1./nsim

figure(figsize=[12, 4])
subplot(1, 2, 1)
plot(np.transpose([w1, w2]), '.-');
title('CDF by sample', fontsize=18)
xlabel('Sample index', fontsize=16)
legend(['Straight MC', 'Importance Sampling'], loc='best')
ylabel('CDF');

subplot(1, 2, 2)
semilogy(pnl1, 1-w1, 'o-')
semilogy(pnl2, 1-w2, '.-')
axhline(1e-3, color='r')
title('100 Samples')
xlabel('PnL', fontsize=16)
ylabel('1-CDF')
title('1-CDF from 200 samples', fontsize=18)
legend(['Straight MC', 'Importance Sampling'], loc='best')
ylim(1e-5, 1);

The Gaussian copula is a simple static model,

it only gives the $\mathcal{F}_0$ distribution of defaults.

What if our model involves a stochastic process?

Equivalent views¶

Recall the change of measure for a standard normal RV $\tilde x$ in measure $\mathbb Q$:

$$ \small \mathbb E^{\mathbb Q} [f(\tilde x)] = \intf f(x) \phi(x) dx = \intf f(x) \frac{\phi(x)}{\phi(x-d)}\phi(x-d) dx = \mathbb E^{\mathbb P}[f(\tilde x) \frac{d \mathbb{Q}}{d \mathbb {P}}] $$$$ \small \frac{d \mathbb{Q}}{d \mathbb {P}} = \frac{\phi(x)}{\phi(x-d)} = \exp \left(-\frac{1}{2}\left(x^2 - (x-d)^2\right)\right) = \exp(-xd + \frac{1}{2}d^2) $$

We can also write $\tilde x = \tilde w + d$, where $\tilde w$ is standard normal in $\mathbb P$:

$$\small \mathbb E^{\mathbb Q} [f(\tilde x)] = \mathbb E^{\mathbb P}[f(\tilde x) \frac{d \mathbb{Q}}{d \mathbb {P}}] = \mathbb E^{\mathbb P}[f(\tilde w + d) \frac{d \mathbb{Q}}{d \mathbb {P}}] $$$$ \small \frac{d \mathbb{Q}}{d \mathbb {P}} = \frac{\phi(w+d)}{\phi(w)} = \exp(-wd - \frac{1}{2}d^2) $$

These two expression are equivalent in representing importance sampling around $d$, this leads to something very important.

Girsanov theorem¶

Given a 1-D Brownian $w^{\mathbb P}(t)$ in the $\mathbb P$ measure, and a drifted process $x(t)$:

$$ d x(t) = \theta(t) dt + d w^{\mathbb P} (t) $$

The process: $$ m(t) = \exp\left(- \int_0^t \theta(s) d w^{\mathbb P}(s) - \frac{1}{2}\int_0^t \theta^2(s) \; ds \right)$$ is a martingale in $\mathbb P$: $\mathbb E^{\mathbb P}[m(t) | \mathcal{F}_s] = m(s) $, then:

$x(s)$ for $s < t $ is a standard Brownian motion, $w^{\mathbb Q}(t)$, in the measure $\mathbb{Q}$ defined by the Radon-Nikodym derivative $\frac{d \mathbb Q}{d \mathbb P} = m(t)$.

when $\theta(t)$ is a constant: $m(t) = \frac{d \mathbb Q}{d \mathbb P} = \exp\left(-\theta w^{\mathbb P}(t) - \frac{1}{2} \theta^2t \right) $.

Girsanov in one chart¶

In [19]:

#<img src=http://upload.wikimedia.org/wikipedia/commons/b/b3/Girsanov.png width=600 height=400>
def giranov_w(w, theta, t) :
    return np.exp(-theta*w - .5*theta*theta*t)

ns = 5000

dt = .005
t = 1
ts = np.arange(dt, t+1e-6, dt)

d = -1.5
u = d*ts
e = np.random.normal(size=[ns, len(ts)])*sqrt(dt)
w = np.cumsum(e, 1)
wd = w + u

# the radon-nikodym derivative
mt = giranov_w(w[:,-1], d, t)
mt = mt/sum(mt)*ns  # it is more accurate to force R-N sum up to 1 
thr = -2

# use log scaled color scheme to bring out low prob
wgs = np.log(1./ns/mt) # inverse, darker color for bigger prob weights
wgs = wgs - np.min(wgs)
wgs = wgs/np.max(wgs)
wgs = wgs # adjust the dynamic range

bc = np.array([1, 1, .5])

fig = figure(figsize=[14, 4])

ni = np.array(list(range(20)) + [mt.argmin(), mt.argmax()])
lw = 1.5

ax1 = fig.add_subplot(121)

plot(ts, w[ni].T, c= bc*np.mean(wgs), label=None);
plot(ts, np.mean(w, 0), lw=lw, c='g', label='mean in Q')
plot(ts, np.var(w, 0), lw=lw, c='b', label='var in Q')
ax1.axhline(thr, c='r', lw=lw, label='barrier')
legend(loc='best')
ylim(-4, 2)

xlabel('Time (t)', fontsize=14)
title('$w^Q(t)$ in $Q$', fontsize=16)


ax2 = fig.add_subplot(122)

for w1, gs in zip(wd[ni], wgs[ni]) :
    plot(ts, w1, c=bc*gs, label=None);

plot(ts, mt.dot(wd)/ns, lw=lw, c='g', label='mean in Q')
plot(ts, mt.dot(wd*wd)/ns, lw=lw, c='b', label='var in Q')
ax2.axhline(thr, c='r', lw=lw, label='barrier');
legend(loc='best')
ylim(-4, 3)

xlabel('Time (t)', fontsize=14)
title(r'$x(t) = w^P(t) - \theta t$ in $P$', fontsize=16);

the color in $\mathbb P$ represents the $\frac{d \mathbb Q}{d \mathbb P} = m(t=1Y)$
the $\frac{d \mathbb Q}{d \mathbb P} = m(t)$ are per path, convenient for MC
sampling $x(t)$ in $\mathbb P$ is equivalent to sample a $w^{\mathbb Q}(t)$ in $\mathbb Q$.
we can steer $x(t)$ to any arbitrary area of interests in $\mathbb P$
effective variance reduction for deeply OTM barriers and options

Understand Girsanov¶

Given a $\mathbb P$ Brownian motion $w^{\mathbb P}(s)$, we try to find a $\mathbb Q$ under which:

$\frac{x(s)}{\sqrt s} = \frac{w^{\mathbb P}(s) + \theta s}{\sqrt s} = \frac{w^{\mathbb P}(s)}{\sqrt s} + \theta \sqrt s$ is standard normal

We can compute the $\frac{d \mathbb Q}{d \mathbb P}$ directly from normal PDFs as:

$$ \small \frac{d \mathbb Q}{d \mathbb P} = \phi(\frac{w^{\mathbb P}(s)}{\sqrt s} + \theta \sqrt s)/\phi(\frac{w^{\mathbb P}(s)}{\sqrt s}) = \exp(-\theta w^{\mathbb P}(s) - \frac{1}{2} \theta^2 s) = m(s) $$

Using $m(s)$ is a $\mathbb P$ martingale:

$$ \small \mathbb E^{\mathbb Q}[x(s)] = \mathbb E^{\mathbb P}[x(s) m(s)] =\mathbb E^{\mathbb P}[x(s) \mathbb E^{\mathbb P}[m(t) | \mathcal{F}_s]] = \mathbb E^{\mathbb P}[x(s) m(t)] $$

therefore under $\mathbb Q$ defined by $\frac{d \mathbb Q}{d \mathbb P} = m(t)$:

$\frac{x(s)}{\sqrt s} = \frac{1}{\sqrt s}(w^{\mathbb P}(s) + \theta s)$ is standard normal for any $s < t$

Girsanov in multimension¶

Girsanov theorem is easily extendible to independent multi-D Brownians:

$$\begin{array}{l} \small & \int f(x_1, \cdots, x_n) p(x_1, \cdots, x_n) d x_1 \cdots d x_n \\ &= \int f(x_1, \cdots, x_n) p_1(x_1) \cdots p_n(x_n) d x_1 \cdots d x_n \\ &= \int \left(\int f\left(x_1, \cdots, x_n\right) p_1(x_1) dx_1\right) p_2(x_2) dx_2 \cdots p_n(x_n) d x_n \end{array}$$

where $p(\cdot)$ are probability density functions
expectation of independent RVs reduces to 1-D expectations

Girsanov with correlated Brownians¶

Transform to independent Brownians first:

$$d \bs w(t) = M d \bs z(t)$$

correlation matrix $C dt = d \bs w \; d \bs w^T$
$d \bs z(t)$ is uncorrelated multi-d Brownian motions

Cholesky decomposition: $M = L$

$C = LL^T$: $d \bs w(t) d \bs w(t)^T = L \bs z(t) \bs z(t) L^T = C dt$

PCA analysis (EVD): $M = RE$

$C = R \Lambda R^T = R E E^TR^T$, both $\Lambda$ and $E$ are diagonal
$d \bs w(t) d \bs w(t) ^T = RE d \bs z(t) d \bs z(t)^T E^T R^T = R \Lambda R^T dt = C dt$
often it is good enough to just shift the first principal component

Low discrepency sequence¶

Stephen Sondheim: art, in itself, is an attempt to bring order out of chaos.

In [20]:

# need to do "pip install sobol_seq"
from sobol_seq import i4_sobol_generate
def sobol(dim, n, skip) :
    s = i4_sobol_generate(dim, n+skip)
    return s[skip:, :].T

Coverage of pseudo random samples¶

Random sampling:

it features obvious holes in the sample coverage
leading to rather slow convergence of $\propto \frac{1}{\sqrt{n}}$

In [21]:

es = np.random.uniform(size=[2, 2000])
figure(figsize=[11, 4])

subplot(1, 2, 1)
n = 200
title('%d random samples' % n, fontsize=16)
plot(es[0,:n], es[1, :n], '.')

subplot(1, 2, 2)
n = 2000
title('%d random samples' % n, fontsize=16)
plot(es[0,:n], es[1, :n], '.');

Low discrepency sequence¶

Low discrepency sequence (LDS) places samples in a methodical and deterministic fashion:

the whole sampling space is evenly covered:
sucessive addition of sample points leads to uniformly better coverage
also known as quasi random sequence (QRS)

There are many types of low discrepency sequence:

e.g. Sobol sequence, Halton sequence

MC with low discrepency sequence:

easy to implement: just replace the uniform random number generator
variance has to be estimated using batching method

LDS vs PRS¶

In [22]:

#from sobol_lib import i4_sobol_generate as sobol
x = sobol(2, 5000, 0)

figure(figsize=[11, 8])
subplot(2, 2, 1)
m = 256
plot(es[0, :m], es[1, :m], '.')
title('%d random samples' % m)
subplot(2, 2, 2)
plot(x[0, :m], x[1, :m], '.')
title('%d Sobol sequence' % m)

subplot(2, 2, 3)
m = 2048
plot(es[0, :m], es[1, :m], '.')
title('%d random samples' % m)
subplot(2, 2, 4)
plot(x[0, :m], x[1, :m], '.')
title('%d Sobol sequence' % m);

LDS and dimensionality¶

LDS gives exellent convergence for low to medium number of dimensions:

popular LDS converges at $\propto \frac{\log^n(d)}{n}$, where $d$ is dimensionality and $n$ is the number of samples.
much faster than the convergence of random samples of $\propto \frac{1}{\sqrt{n}}$

LDS does not work well for very high dimensionalities:

The coverage deteriorates when dimensionality > 20

In [23]:

x = sobol(40, 500, 0)

figure(figsize=[11, 4])
subplot(1, 2, 1)
plot(x[8, :], x[9, :], '.')
xlabel('Dimension 9')
ylabel('Dimension 10')
title('Sobol sequence: dimension 9 vs 10')

subplot(1, 2, 2)
plot(x[5,:], x[37, :], '.')
xlabel('Dimension 5')
ylabel('Dimension 37')
title('Sobol sequence: dimension 5 vs 37');

Common mistakes in using LDS¶

LDS is dimension specific,

we have to produce and consume the LDS in the same number of dimensions
we cannot regroup the LDS into different number of dimensions

If we draw 512 numbers from 1-D Sobol sequence, then split them into two groups to cover a two dimensions grid:

In [24]:

x = sobol(1, 512, 0)[0]
figure(figsize=[11, 4])
subplot(1, 2, 1)
plot(x[:256], x[256:], '.')
xlabel('First half')
ylabel('Second half')
title('First half vs second half')

subplot(1, 2, 2)
plot(x[::2], x[1::2], '.')
xlabel('Odd samples')
ylabel('Even samples')
title('Odd vs Even');

LDS and Brownian motion¶

Given a LDS sequence $\hat {\bs d}$, $\Phi^{-1}(\hat {\bs d})$ could be used as the normal increments to construct Brownian motions:

works well for problems with low dimensionalities, i.e., <20
may be problematic due to poor coverage for higher dimensionalities

In the Asian option example,

there are 12 dimensions for 12 monthly observations
the variance reduction is about 20-60 times using Sobol sequence, but it varies with the number of paths

In [25]:

p = 2**15
n = 12
b = 20

es = np.random.normal(size=[p, n])
tic = time.clock()
u0, e0 = mc0(p0, es)
t0 = time.clock() - tic

uss = []
u_lds = sobol(n, b*p, 0)
tic = time.clock()
for i in range(b) :
    lds = norm.ppf(u_lds[:, i*p:(i+1)*p]).T
    u1, _ = mc0(p, lds)
    uss.append(u1)
t1 = time.clock() - tic

vf = e0*e0/np.var(uss)

In [26]:

df = pd.DataFrame([[p, u0, e0, t0], [p, np.mean(uss), np.std(uss), t1/b]], 
                  index=['PRS', 'LDS'], columns=['Paths', 'Price', 'Error (Std)', 'Time(s)'])

fmt.displayDF(df, "4g", 4)

	Paths	Price	Error (Std)	Time(s)
PRS	32768	1.639	0.007318	0.009796
LDS	32768	1.637	0.0009927	0.05028

LDS with Brownian Bridge¶

Often in practice, we run into problems with more dimensions than 30,

under these situation, LDS could be dangerous.

A useful technique is to use LDS at only at a few key tenors, then use Brownian Bridge to fill in the gaps.

we used a 1-D LDS at the option maturity, the rest is built with BB
a modest variance reduction is achieved in the Asian option example
more reduction is expected if we sample multiple time steps with LDS

In [28]:

ubs = []
lds_bb = norm.ppf(np.array(sobol(1, b*p, 0)[0])+1e-10)*sqrt(n)
    
tic = time.clock()
for i in range(b) :
    es = np.random.normal(size=[p, n])
    es_u = np.sum(es, 1)
    d = (lds_bb[i*p:(i+1)*p] - es_u)/n
    es_bb = (es.T + d).T

    ubb, ebb = mc0(p, es_bb)
    ubs.append(ubb)
t2 = time.clock() - tic

vf = e0*e0/np.var(ubs)

In [29]:

df = pd.DataFrame([[p, u0, e0, t0], [p, np.mean(ubs), np.std(ubs), t2/b]], 
                  index=['PRS', 'LDS with BB'], columns=['Paths', 'Price', 'Error (Std)', 'Time(s)'])

fmt.displayDF(df, "4g", 4)

	Paths	Price	Error (Std)	Time(s)
PRS	32768	1.639	0.007318	0.009796
LDS with BB	32768	1.636	0.005965	0.02702

Other Techniques¶

Moment matching¶

Shift and scale the random samples so that they have the desired moments.

often applied to normal random numbers,
has limited variance reduction effects,

In [30]:

x = np.random.normal(size = 1000)
u = np.mean(x)
vol = np.std(x)
y = (x-u)/vol

figure(figsize=[12, 4])
subplot(1, 2, 1)
hist(x, 50);

text(-3.5, 50, '$\\mu_x=%.4g$' % np.mean(x), size='x-large');
text(-3.5, 45, '$\\sigma_x=%.4g$' % np.std(x), size='x-large');
title('Random normal samples $\\hat x$', fontsize=16)

subplot(1, 2, 2)
hist(x, 50)
text(-3.5, 50, '$\\mu_y=0$', size='x-large');
text(-3.5, 45, '$\\sigma_y=1$', size='x-large');
title('Moment matched samples $\\hat y = \\frac{\\hat x - \\mu}{\\sigma_x}$', fontsize=16);

Stratification¶

Stratification is essentially bucketing, in order to better capture the true distribution,

Consider drawing uniform random numbers in 1-D:

divide the range [0, 1] into $n$ buckets (or strata), each with length $\frac{1}{n}$
map the unform to each bucket in a round robin manner:

$$\hat v_i = \frac{(i \mod n) + \hat u_i}{n}$$

$\hat v_i$ are stratified samples, and it can be converted to other distributions using inverse CDF

Stratification is similar to LDS in spirit, but:

more difficulty in extending to higher dimensions, the number of strata expodes with dimensions
less popular than LDS in practice

In [31]:

def stratify(u, bs, shuffle) :
    b = len(bs)
    r = len(u)//b + 1
    sb = []
    
    for i in range(r) :
        if shuffle :
            np.random.shuffle(bs)
        sb = sb + bs.tolist()
            
    return [1.*(i + x)/b for x, i in zip(u, sb)]

In [32]:

bs = np.arange(1000)
n = 2000
u = np.random.uniform(size=n)
v = stratify(u, bs, False)

figure(figsize=[11, 4])

subplot(1, 2, 1)
title('%d random samples' % n, fontsize=16)
x = norm.ppf(u)
hist(x, 100);
text(-3.5, 50, '$\\mu=%.4f$' % np.mean(x), size='x-large');
text(-3.5, 45, '$\\sigma=%.4f$' % np.std(x), size='x-large');

subplot(1, 2, 2)
title('%d stratefied samples' % n, fontsize=16)
y = norm.ppf(v)
hist(y, 100);
text(-3.5, 50, '$\\mu=%.4f$' % np.mean(y), size='x-large');
text(-3.5, 45, '$\\sigma=%.4f$' % np.std(y), size='x-large');

Practical matters in variance reduction¶

Important: Different methods do not mix well.

combining two methods together do not necessarily yield more variance reduction,
it often messes things up

Do not abuse methods that are easy to implement, such as antithetic variate and LDS:

Control variate and importance sample deserve more attention

Variance reduction summary¶

Methods	Effectiveness	Generality	Batching?	Implementation	Best for
Antithetic Variate	low	high	no	easy	linearish payoffs
Control Variate	very high	low	no	hard	has vanila proxy
Importance Sampling	high	low	yes	hard	rare events
Low Discrepency Sequence	vary	high	yes	easy	low dimensionality
Moment Matching	low	high	yes	easy	??
Stratified Sampling	vary	low	yes	easy	low dimensionality

Assigment¶

Recommented reading:

Andersen and Piterbarg: Chapter 3.3-3.4

Homework:

Complete homework set 7