Lecture 8: Optimization

Topics

Introduction
- Formulation
- Types of Optimization Problems
- Basic Concepts
- Optimization Examples in Finance

Unconstrained Optimization
- Optimality Characterization
- Solution Methods
  - Direct search method
  - Descent Methods

Constrained Optimization
- General framework
- Duality
- Optimality conditions
  - Why KKT conditions
- Solution Methods

Introduction

Optimization exists in every field, particularly in finance: most activities in finance revolve around some goals that needed to be optimized, e.g. maximizing profits, minimizing risk, maximizing utility, and minimizing costs etc.

Formulation

An optimization problem usually involves three elements:

Objective: to be optimized: profit, loss, risk, etc

Variables: numbers of shares in each stock, amount of capital to be invested in each sector, etc

Constraints: some choices are restricted, total risk cannot exceed certain criteria, total amount of capital is limited, etc

Mathematically,

$$ \renewcommand{bx}{\boldsymbol x} \Large{ \min_{\bx\in \Psi, \; \Psi\subset \mathbb{R}^n}} f(\bx) $$

where $ f(\bx): \mathbb{R}^n \to \mathbb{R} $ is a scalar objective function, $\Psi$ is a subset of $\mathbb{R}^n$ and called the feasible region.

The minimization and maximization problems are equivalent $$ \min_{\bx\in \Psi} f(\bx) = -\max_{\bx\in \Psi} (-f(\bx)) $$

Types of Optimization Problems

If $\Psi = \mathbb{R}^n$, the problem is unconstrained, otherwise, it is constrained

If $f(\bx)$ is linear and $\Psi$ is a polyhedron, then Linear Programming, otherwise, Nonlinear Programming

If $f(\bx)$ is quadratic and $\Psi$ is a polyhedron, then Quadratic Programming

If $f(\bx)$ and $\Psi$ are convex, then Convex Optimization (of which linear and quadratic programming are special cases)

If $\Psi$ contains discrete variables, then Discrete Optimization

If only integer variables are allowed, then Integer Programming, Mixed Integer Programming involves problems in which only some of the variables are constrained to be integers.

If the specifications of $f(\bx)$ and $\Psi$ are NOT deterministic, then Stochastic Programming

Another name you often hear is Dynamic Programming, this does not refer to a particular type of optimization problem, rather it is a method for solving an optimization problem by breaking it down to a collection of simpler subproblems, using Bellman's Principle of Optimality.

Basic Concepts

A point $\bx^*$ is called a local minimum, if $\exists \epsilon > 0$ $$ f(\bx^*) \leq f(\bx), \forall \bx\in\Psi, s.t. \| \bx - \bx^* \| \lt \epsilon. $$
A point $\bx^*$ is a global minimum, if
$$ f(\bx^*) \leq f(\bx), \forall \bx\in\Psi. $$

Finding a global minimum is considerably harder than a local minimum --- imaging the difference in the difficulties of getting to the top of your neighborhood hill top vs getting to the top of Mt. Everest.
With one exception: for convex problems, the local minimum is the also global solution.

If $f(\bx)$ is differentiable (i.e. first derivatives exist), then a necessary condition for a local minimizer is that $\bx^*$ is a critical point

$$ \renewcommand{bs}{\boldsymbol}{g} (\bx^*) = {\bs\nabla} _\bx f(\bx^*) = 0 $$

Clearly, finding the critical points is equivalent to root searching problems you have encountered earlier.
On the other hand, observing that solving $f(\bx) = \bs{0}$ is equivalent to

$$ \min_{\bx}\left[ f(\bx)^T f(\bx)\right], $$

(although this is rarely advised --- doing anything through square or higher power is generally a bad idea --- it makes the problem harder to solve).

But this shows the root searching problem and the optimization problem are closely related.

Critical point is only a necessary condition, it's not sufficient,

If $f(\bx)$ is twice-differentiable (i.e. second derivatives exist), then a sufficent condition for a local minimizer is, in addition to being a critical point, the Hessian at $\bx^*$ is positive definite

$$ {\bs H} (\bx^*) = {\bs\nabla^2} _\bx f(\bx^*) \succ 0 $$

These are all natural conclusions from what you are already very familiar with (hopefully!): properties of a quadratic function and the Taylor series for a general (twice-differentiable) function $$ f(\bx^* + \bs\delta x) = f(\bx^*) + {\bs\nabla} f(\bx^*)^T {\bs\delta x} + \frac{1}{2} {\bs\delta x}^T{\bs\nabla^2} f(\bx^*) {\bs\delta x} + {\bs O}(||{\bs\delta x}||^3) $$

Optimization Examples in Finance

1. Portfolio Optimization

$$ \begin{array} \\ \min_{\bx} & & \frac{1}{2} \lambda\; \bx^T \Sigma \bx - \mu^T \bx \\ s.t. & & \Sigma x_i = 1 \end{array} $$ where $\lambda$ is the risk-aversion coefficient, $\mu$ is the expected asset return vector and $\Sigma$ is the covariance matrix. This is a quadratic programming problem.

2. (Static) Asset-Liability Management

$$ \begin{array} \\ \min_{\bx} & & \Sigma_j x_j P_j \\ s.t. & & \Sigma_j x_j C_j(t) \geq L(t) \;\; \forall t \\ & & x_j \geq 0 \;\; \forall j \end{array} $$ where $x_j, P_j, C_j(t)$ are the amount, price and cashflow at time $t$ of asset $j$. $L(t)$ is the liability payment at time $t$. This is a linear programming problem.

3. Volatility Surface fitting

$$ \min_{\sigma(S,t)} {\Large\Sigma_j^n} (C(\sigma(S,t),K_j,T_j) - C_j)^2 $$ where $\sigma(S,t) > 0$ is the volatility value at the surface point $(S,t)$, $C(\sigma(S,t),K_j,T_j)$ is the standard Black-Scholes formule for European call options, $C_j$ is the market quoted price of the option. This a non-linear optimization problem.

Unconstrained Optimization

Unconstrained means:

$$ \min_{\bx\in \mathbb{R}^n} f(\bx) $$

Will be focusing on smooth functions (typically at least twice differentable). The main purpose here is to show the various types of algorithms for solving the unconstrained optimization problems.

Optimality Characterization

The necessary condition: the solution must be a critical point:

$$ {\bs g} (\bx^*) = {\bs\nabla} _\bx f(\bx^*) = 0 $$

The sufficient condition: the Hessian at the optimal point must be positive definite,

$$ {\bs H} (\bx^*) = {\bs\nabla^2} _\bx f(\bx^*) \succ 0 $$

Solution Methods

In practice, optimization problems are often solved using an iterative algorithm, which searches for a sequence of points,

$$ \bx^0, \bx^1, \bx^2, \cdots, \bx^n, \cdots $$ with $$ f(\bx^{k+1}) < f(\bx^k). $$

The algorithm typically stops at $||{\bs\nabla} f(\bx) || < \epsilon $ for some $\epsilon$ small.
No guarantee for finding the global minimum.

1. Direct Search Method

Similar in spirit to the Bisection method in one dimemsion. Requires only function evaluations.
Quoting M. Wright: "A direct serach method does not 'in its heart' develop an approximate gradient".

Representative: Nelder-Mead Method (or Simplex Search method)
- Searches through the simplex vertices (polytope of N+1 vertices in N dimensions)
- Techniques: reflection - expansion - contraction - reduction
Scipy example:

In [12]:

import numpy as np
from scipy.optimize import minimize

def rosen(x):
     return sum(100.0*(x[1:]-x[:-1]**2.0)**2.0 + (1-x[:-1])**2.0)
    
x0 = np.array([1.3, 0.7, 0.8, 2.2, 1.2, 2.1])
res = minimize(rosen, x0, method='nelder-mead',
                options={'xtol': 1e-8, 'disp': True})

print "The solution from Nelder-Mead:"
print  (res.x)

Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 650
         Function evaluations: 1031
The solution from Nelder-Mead:
[ 1.  1.  1.  1.  1.  1.]

Advantages: simple, only function evaulation needed.
Deficiencies: slow, may fail to converge in higher dimensions
Suffers from the "curse of dimensionality"

2. Descent Methods

Algorithm for general descent method

Given a starting point $ \bx^0$
Repeat
1. Determine a descent direction $\delta \bx$;
2. Line search. Choose a step size $t > 0$;
3. Update. $\bx^{k+1} = \bx^k + t \delta \bx $
Until stopping criterion is satisfied

The algorithm alternates between two main decisions: determine a descent direction $\delta \bx$ and choose a step size $t$.

Different ways of choosing the descent direction giving rise to different descent method and convergence rate
The line search method falls into two categories: exact line search and backtracking line search.

2.1 Steepest Descent Method

If the objective function is differentiable, we have $$ f(\bx^k + t \bs\delta x) \approx f(\bx^k) +t [ {\bs\nabla} f(\bx^k)^T {\bs\delta x} ] $$
This means choosing gradient direction $$ \bs\delta x = - {\bs\nabla} f(\bx^k) $$ will lead to the steepest descent at points sufficiently close to $\bx^k$.

The line search step size can be done in one-dimensional minimization: $$ t^k = \min_t f(\bx^k + t \bs\delta x) \triangleq \min_t \phi(t). $$

Exact line search (choosing the minimizing $t$ above) leads to zig-zag path towards the minimum: which means slow convergence $$ \phi'(t) = 0 = [{\bs\nabla} f(\bx^k + t \bs\delta x)]^T \bs\delta x , $$ (notice the two consecutive search directions will be perpendicular to each other, we've met this problem before, what's the strategy?).
Convergence: the steepest descent method converges linearly, and it will behave badly if the condition number of the Hessian (the second order derivative matrix) is large.

2.2 Newton's Method

If the objective function is twice differentiable, we have with more accuracy $$ f(\bx^k + \bs\delta x) \approx f(\bx^k) + {\bs\nabla} f(\bx^k)^T {\bs\delta x} + \frac{1}{2} {\bs\delta x}^T{\bs\nabla^2} f(\bx^k) {\bs\delta x} $$
RHS is a quadratic function in ${\bs\delta x}$, so the minimum is achieved at $$ \bs\delta x = - [{\bs\nabla^2} f(\bx^k)]^{-1} {\bs\nabla} f(\bx^k). $$

Convergence of the Newton's method is rapid in general, and quadratic once entering into the pure Newton phase.
Disadvantages of Newton's method:
- The cost of computing and storing the Hessian can be very high, if not outright prohibitive
- The cost of solving the set of linear equation at the Newton step
There are various ways to compute an approximation of the Hessian to substantially reduce the cost of computing the Newton step. This leads to a family of algorithms called Quasi-Newton methods.

Constrained Optimization

Now add constraints.
Constrainted problems are much harder: even a seemingly simple Integer Programming problem is NP-Complete (i.e. can't be solved in polynomial time).

General Framework

Constrained Optimization Problem

$$ \begin{array} \\ \min_{\bx\in \mathbb{R}^n } & f(\bx) \\ s.t. &\bs{h}(\bx) = \bs{0} \\ &\bs{g}(\bx) \leq \bs{0} \end{array} $$

Will be focusing on smooth functions (typically at least twice differentable).
The goal is to find a local minimum satisfying the constriants.
And we will denote the feasible region as domain $\mathcal{D} = \{\bx \in \mathbb{R}^n | \; \bs{h}(\bx) = \bs{0}, \; \bs{g}(\bx) \leq \bs{0}\}$.

Duality

Define the Lagrangian as, $$ \renewcommand{ml}{\mathbb{\mathcal L}} \renewcommand{bmu}{\boldsymbol{ \mu}} \renewcommand{bld}{\boldsymbol{ \lambda}} \ml(\bx, \bmu, \bld) = f(\bx) + \bmu^T \bs{h}(\bx) + \bld^T \bs{g}(\bx). $$

Here the vectors $\bmu, \bld$ are called the dual variables or Lagrange Multipliers.

Further, define the Lagrangian dual function as,

$$ \renewcommand{mD}{\mathbb{\mathcal D}} \renewcommand{df}{\hat{f}} \df(\bmu, \bld) = \inf_{\bx\in \mD}\; \ml(\bx, \bmu, \bld) =\inf_{\bx\in\mD} \left( f(\bx) + \bmu^T \bs{h}(\bx) + \bld^T \bs{g}(\bx) \right). $$

The dual function takes the pointwise infimum of a family of affine functions of $(\bmu, \bld)$, it is a concave function.

If $\bx^*$ is a solution to the original optimization problem (Primal Problem), then for $\forall \bld \succeq 0$ and any $\bmu$,

$$ \df(\bmu, \bld) \leq f(\bx^*). $$

(HW: prove the above statement).

Which leads to the optimization problem (Dual Problem)

$$ \begin{array} \\ \max_{(\bmu, \bld)} & \df(\bmu, \bld) \\ s.t. &\bld \succeq 0 \end{array} $$

If $(\bmu^*, \bld^*)$ is a solution to the Dual Problem, it's clear that the weak duality holds,

$$ \df(\bmu^*, \bld^*) \leq f(\bx^*). $$

However, if the Primal problem satisfies certain constraint qualifications (such as, convexity or Slater's condition), then the strong duality holds, $$ \df(\bmu^*, \bld^*) = f(\bx^*), $$

which implies the primal and the dual problems are equivalent.

In the Maximum Entropy Method lecture later, this will be explored in greater details.

Optimality Conditions

From the duality principle,
1. $ \bs{\nabla}_{\bx}\ml(\bx^*, \bmu^*, \bld^*) = \bs{0} $, stationality
2. $ \bs{h}(\bx^*) = \bs{0}; \bs{g}(\bx^*) \leq \bs{0} $, feasibility
3. $ \bld^* \succeq \bs{0} $, dual feasibility (component-wise)
4. $ \bld^* \circ \bs{g}(\bx^*) = \bs{0} $, complementary slackness (component-wise)
5. $ \bs{\nabla}^2\ml(\bx^*, \bmu^*, \bld^*) \succ 0 $, Hessian positive definite constraints

These conditions are called KKT conditions (Karush-Kuhn-Tucker) --- the necessary and sufficient conditions for $\bx^*$ to be a local minimizer.

Exploring the KKT conditions further

For the unconstrained case, the conditions 2, 3 and 4 drop out, what's left are:
1. $ \bs{\nabla}_{\bx}f(\bx^*) = \bs{0} $, stationality
2. $ \bs{\nabla}^2f(\bx^*) \succ 0 $, Hessian positive definite constraints

which we are familiar with---the necessary and sufficient condition for $\bx^*$ to be a (local) minimizer.

For the equality constrained case, let's take a look at the Lagrange function

$$ \ml(\bx, \bmu) = f(\bx) + \bmu^T \bs{h}(\bx) $$

If we simply consider this as an unconstrained problem with $(\bx, \bmu)$ as the new unknown vector and apply the two conditions in previous slide
1. $ \bs{\nabla}_{\bx}\ml(\bx^*, \bmu^*) = \bs{0} $, stationality
2. $ \bs{\nabla}_{\bmu}\ml(\bx^*, \bmu^*) = \bs{0} $, stationality
3. $ \bs{\nabla}^2_{\bx\bx}\ml(\bx^*, \bmu^*) \succ 0 $, Hessian positive definite constraints

Notice the second condition above is simply $\bs{h}(\bx^*) = \bs{0}$.

For the inequality(only) constrained problems, the answer can be analyzed in two scenarios based on how the local minimizer $\bx^*$ is situated in the feasible region.

Scenario 1: $\bs{g}(\bx^*) < \bs{0}$, in this case the point $\bx^*$ is an interior point of the feasible domain and the constraints is called inactive and the case simply reduces to the unconstrained case:
1. $ \bs{\nabla}_{\bx}f(\bx^*) = \bs{0} $, stationality
2. $ \bs{\nabla}^2_{\bx\bx}f(\bx^*) \succ 0 $, Hessian positive definite constraints
3. and of course $ \bs{g}(\bx^*) < \bs{0} $,

Scenario 2: $\bs{g}(\bx^*) = \bs{0}$, in this case the point $\bx^*$ is right on the boundary of the feasible domain and the constraints is called active and the case reduces to the equality constrained case:
1. $ \bs{\nabla}_{\bx}f(\bx^*) + \bld^T \bs{\nabla}_{\bx}\bs{g}(\bx^*) = \bs{0} $, stationality
2. $ \bs{\nabla}^2_{\bx\bx}f(\bx^*) \succ 0 $, Hessian positive definite constraints
3. $ \bs{g}(\bx^*) = \bs{0} $
4. and $ \bld > 0 $ component wise

Note the last condition ensures that the decreasing direction of $f(\bx)$ is strictly pointing outwards of the feasible region.

Best way to visualize the conditions 1 and 4 above is through the followimng example, where

$$ f(\bx^*) = (x_1 -2)^2 + (x_2 -2)^2, \;\;\; g(\bx^*) = x_1^2 + x_2^2 - 1$$

Solution Methods

The solution method for the constrained optimization problems are generally an extension of those for unconstrained version. We are not going into great details, but will use the Sequential Quadratic Programming (SQP) method as an example.

The SQP is a natural extension of the Newton's Descent method introduced earlier for the unconstrained case: recall that the Newton's method does a quadratic optimization in each iteration step. For SQP, at each step, the original optimization problem is approximated with

$$ \begin{array} \\ \min_{\bx\in \mathbb{R}^n} & {\bs\nabla} f(\bx^k)^T {(\bx - \bx^k)} + \frac{1}{2} {(\bx - \bx^k)}^T{\bs\nabla^2} f(\bx^k) {(\bx - \bx^k)} \\ s.t. & {\bs\nabla} \bs{h}(\bx^k)^T {(\bx - \bx^k)} + \bs{h}(\bx^k) = \bs{0} \\ & {\bs\nabla} \bs{g}(\bx^k)^T {(\bx - \bx^k)} + \bs{g}(\bx^k) \leq \bs{0} \end{array} $$

This is a constrained quadratic programming problem, which is "slightly" easier to deal with than the original problem.

Homework:

HW on the slide for Duality
For the portfolio optimization example given earlier, derive the dual problem.
Due on Apr 8th, 2015.