II. Unconstrained Optimization
Optimization exists in every field, particularly in finance: most activities in finance revolve around some goals that needed to be optimized, e.g. maximizing profits, minimizing risk, maximizing utility, and minimizing costs etc.
An optimization problem usually involves three elements:
Mathematically,
$$ \renewcommand{bx}{\boldsymbol x} \Large{ \min_{\bx\in \Psi, \; \Psi\subset \mathbb{R}^n}} f(\bx) $$
where $ f(\bx): \mathbb{R}^n \to \mathbb{R} $ is a scalar objective function, $\Psi$ is a subset of $\mathbb{R}^n$ and called the feasible region.
A point $\bx^*$ is called a local minimum, if $\exists \epsilon > 0$ $$ f(\bx^*) \leq f(\bx), \forall \bx\in\Psi, s.t. \| \bx - \bx^* \| \lt \epsilon. $$
A point $\bx^*$ is a global minimum, if
$$
f(\bx^*) \leq f(\bx), \forall \bx\in\Psi.
$$
Finding the global minimum is considerably harder than a local minimum --- imaging the difference in the difficulties of getting to the top of your neighborhood hill top vs getting to the top of Mt. Everest.
With one exception: for convex problems, any local minimum is also globally optimal (homework).
$$ \renewcommand{bs}{\boldsymbol}{g} (\bx^*) = {\bs\nabla} _\bx f(\bx^*) = 0 $$
Clearly, finding the critical points is equivalent to root searching problems you have encountered earlier.
On the other hand, observing that solving $f(\bx) = \bs{0}$ is equivalent to
$$ \min_{\bx}\left[ f(\bx)^T f(\bx)\right], $$
(although this is normally ill-advised --- doing anything through square or higher power when you can avoid it is generally a bad idea --- it makes the problem harder to solve, think ill-conditioning as a starter).
$$ {\bs H} (\bx^*) = {\bs\nabla^2} _\bx f(\bx^*) \succ 0 $$
$$ f(\bx^* + \bs\delta x) = f(\bx^*) + {\bs\nabla} f(\bx^*)^T {\bs\delta x} + \frac{1}{2} {\bs\delta x}^T{\bs\nabla^2} f(\bx^*) {\bs\delta x} + {\bs O}(||{\bs\delta x}||^3) $$
$$ \begin{array} \\ \min_{\bx} & & \frac{1}{2} \lambda\; \bx^T \Sigma \bx - \mu^T \bx \\ s.t. & & \Sigma x_i = 1 \end{array} $$ where $\lambda$ is the risk-aversion coefficient, $\mu$ is the expected asset return vector and $\Sigma$ is the covariance matrix. This is a quadratic programming problem.
$$ \begin{array} \\ \min_{\bx} & & \Sigma_j x_j P_j \\ s.t. & & \Sigma_j x_j C_j(t) \geq L(t) \;\; \forall t \\ & & x_j \geq 0 \;\; \forall j \end{array} $$ where $x_j, P_j, C_j(t)$ are the amount, price and cashflow at time $t$ of asset $j$. $L(t)$ is the liability payment at time $t$. This is a linear programming problem.
$$ \min_{\sigma(S,t)} {\Large\Sigma_j^n} (C(\sigma(S,t),K_j,T_j) - C_j)^2 $$ where $\sigma(S,t) > 0$ is the volatility value at the surface point $(S,t)$, $C(\sigma(S,t),K_j,T_j)$ is the standard Black-Scholes formule for European call options, $C_j$ is the market quoted price of the option. This a non-linear optimization problem.
$$ \min_{\bx\in \mathbb{R}^n} f(\bx) $$
$$ {\bs g} (\bx^*) = {\bs\nabla} _\bx f(\bx^*) = 0 $$
$$ {\bs H} (\bx^*) = {\bs\nabla^2} _\bx f(\bx^*) \succ 0 $$
$$ \bx^0, \bx^1, \bx^2, \cdots, \bx^n, \cdots $$ with $$ f(\bx^{k+1}) < f(\bx^k). $$
The algorithm typically stops at $||{\bs\nabla} f(\bx) || < \epsilon $ for some $\epsilon$ small.
No guarantee for finding the global minimum.
Similar in spirit to the Bisection method in one dimemsion. Requires only function evaluations.
Quoting M. Wright: "A direct search method does not 'in its heart' develop an approximate gradient".
Representative: Nelder-Mead Method (or Simplex Search method)
Scipy example:
import numpy as np
from scipy.optimize import minimize
def rosen(x):
return sum(100.0*(x[1:]-x[:-1]**2.0)**2.0 + (1-x[:-1])**2.0)
x0 = np.array([1.3, 0.7, 0.8, 2.2, 1.2, 2.1])
res = minimize(rosen, x0, method='nelder-mead',
options={'xtol': 1e-8, 'disp': True})
print("The solution from Nelder-Mead:", res.x)
Optimization terminated successfully. Current function value: 0.000000 Iterations: 650 Function evaluations: 1031 The solution from Nelder-Mead: [ 1. 1. 1. 1. 1. 1.]
Advantages: simple, only function evaulation needed.
Deficiencies: slow, may fail to converge in higher dimensions
Suffers from the "curse of dimensionality"
Algorithm for general descent method
Repeat
Until stopping criterion is satisfied
The algorithm alternates between two main decisions: determine a descent direction $\delta \bx$ and choose a step size $t$.
Different ways of choosing the descent direction giving rise to different descent method and convergence rate
The line search method falls into two categories: exact line search and backtracking line search.
$$ f(\bx^k + t \bs\delta x) \approx f(\bx^k) + t [ {\bs\nabla} f(\bx^k)^T {\bs\delta x} ] $$
$$ \bs\delta x = - {\bs\nabla} f(\bx^k) $$
will lead to the steepest descent at points sufficiently close to $\bx^k$.
Exact line search (choosing the minimizing $t$ above) leads to zig-zag path towards the minimum: which means slow convergence $$ \phi'(t) = 0 = [{\bs\nabla} f(\bx^k + t \bs\delta x)]^T \bs\delta x , $$ (notice the two consecutive search directions will be perpendicular to each other, we've met this problem before, what's the strategy?).
Convergence: the steepest descent method converges linearly, and it will behave badly if the condition number of the Hessian (the second order derivative matrix) is large.
$$ f(\bx^k + \bs\delta x) \approx f(\bx^k) + {\bs\nabla} f(\bx^k)^T {\bs\delta x} + \frac{1}{2} {\bs\delta x}^T{\bs\nabla^2} f(\bx^k) {\bs\delta x} $$
$$ \bs\delta x = - [{\bs\nabla^2} f(\bx^k)]^{-1} {\bs\nabla} f(\bx^k). $$
Convergence of the Newton's method is rapid in general, and quadratic once entering into the pure Newton phase.
Disadvantages of Newton's method:
There are various ways to compute an approximation of the Hessian to substantially reduce the cost of computing the Newton step. This leads to a family of algorithms called Quasi-Newton methods.
Now add constraints.
Constrainted problems are much harder: even a seemingly simple Integer Programming problem is NP-Complete (i.e. can't be solved in polynomial time).
$$ \begin{array} \\ \min_{\bx\in \mathbb{R}^n } & f(\bx) \\ s.t. &\bs{h}(\bx) = \bs{0} \\ &\bs{g}(\bx) \leq \bs{0} \end{array} $$
Will be focusing on smooth functions (typically at least twice differentable).
The goal is to find a local minimum satisfying the constriants.
And we will denote the ovarall domain of definition as $\mathcal{D} = DOM(f)\cap DOM(h) \cap DOM(g)$.
Define the Lagrangian as, $$ \renewcommand{ml}{\mathbb{\mathcal L}} \renewcommand{bmu}{\boldsymbol{ \mu}} \renewcommand{bld}{\boldsymbol{ \lambda}} \ml(\bx, \bmu, \bld) = f(\bx) + \bmu^T \bs{h}(\bx) + \bld^T \bs{g}(\bx). $$
Here the vectors $\bmu, \bld$ are called the dual variables or Lagrange Multipliers.
$$ \renewcommand{mD}{\mathbb{\mathcal D}} \renewcommand{df}{\hat{f}} \df(\bmu, \bld) = \inf_{\bx\in \mD}\; \ml(\bx, \bmu, \bld) =\inf_{\bx\in\mD} \left( f(\bx) + \bmu^T \bs{h}(\bx) + \bld^T \bs{g}(\bx) \right). $$
$$ \df(\bmu, \bld) \leq f(\bx^*). $$
$$ \begin{array} \\ \max_{(\bmu, \bld)} & \df(\bmu, \bld) \\ s.t. &\bld \succeq 0 \end{array} $$
$$ \df(\bmu^*, \bld^*) \leq f(\bx^*). $$
However, if the Primal problem satisfies certain constraint qualifications (such as, convexity or Slater's condition), then the strong duality holds, $$ \df(\bmu^*, \bld^*) = f(\bx^*), $$
which implies the primal and the dual problems are equivalent.
From the duality principle,
For the unconstrained case, the conditions 2, 3 and 4 drop out, what's left are:
$$ \ml(\bx, \bmu) = f(\bx) + \bmu^T \bs{h}(\bx) $$
If we simply consider this as an unconstrained problem with $(\bx, \bmu)$ as the new unknown vector and apply the two conditions in previous slide
Scenario 1: $\bs{g}(\bx^*) < \bs{0}$, in this case the point $\bx^*$ is an interior point of the feasible domain and the constraints is called inactive and the case simply reduces to the unconstrained case:
Scenario 2: $\bs{g}(\bx^*) = \bs{0}$, in this case the point $\bx^*$ is right on the boundary of the feasible domain and the constraints is called active and the case reduces to the equality constrained case:
$$ f(\bx^*) = (x_1 -2)^2 + (x_2 + 2)^2, \;\;\; g(\bx^*) = x_1^2 + x_2^2 - 1$$
$$ \begin{array} \\ \min_{\bx\in \mathbb{R}^n} & {\bs\nabla} f(\bx^k)^T {(\bx - \bx^k)} + \frac{1}{2} {(\bx - \bx^k)}^T{\bs\nabla^2} f(\bx^k) {(\bx - \bx^k)} \\ s.t. & {\bs\nabla} \bs{h}(\bx^k)^T {(\bx - \bx^k)} + \bs{h}(\bx^k) = \bs{0} \\ & {\bs\nabla} \bs{g}(\bx^k)^T {(\bx - \bx^k)} + \bs{g}(\bx^k) \leq \bs{0} \end{array} $$
S. Boyd and L. Vandenberghe (2004), Convex Optimization, Cambridge University Press.