Lecture 8: Optimization

Topics

  • I. Introduction

    • Formulation
    • Types of Optimization Problems
    • Basic Concepts
    • Optimization Examples in Finance
  • II. Unconstrained Optimization

    • Optimality Characterization
    • Solution Methods
      • Direct search method
      • Descent Methods
  • III. Constrained Optimization
    • General framework
    • Duality
    • Optimality conditions
      • Why KKT conditions
    • Solution Methods

I. Introduction

Optimization exists in every field, particularly in finance: most activities in finance revolve around some goals that needed to be optimized, e.g. maximizing profits, minimizing risk, maximizing utility, and minimizing costs etc.

Formulation

An optimization problem usually involves three elements:

  • Objective: to be optimized: profit, loss, risk, etc
  • Variables: numbers of shares in each stock, amount of capital to be invested in each sector, etc
  • Constraints: some choices are restricted, total risk cannot exceed certain criteria, total amount of capital is limited, etc

Mathematically,

$$ \renewcommand{bx}{\boldsymbol x} \Large{ \min_{\bx\in \Psi, \; \Psi\subset \mathbb{R}^n}} f(\bx) $$

where $ f(\bx): \mathbb{R}^n \to \mathbb{R} $ is a scalar objective function, $\Psi$ is a subset of $\mathbb{R}^n$ and called the feasible region.

  • The minimization and maximization problems are equivalent $$ \min_{\bx\in \Psi} f(\bx) = -\max_{\bx\in \Psi} (-f(\bx)) $$

Types of Optimization Problems

  • If $\Psi = \mathbb{R}^n$, the problem is unconstrained, otherwise, it is constrained
  • If $f(\bx)$ is linear and $\Psi$ is a polyhedron, then Linear Programming, otherwise, Nonlinear Programming. (Polyhedron is defined as intersection of finite number of halfspaces and hyperplanes.)
  • If $f(\bx)$ is quadratic and $\Psi$ is a polyhedron, then Quadratic Programming
  • If $f(\bx)$ and $\Psi$ are convex, then Convex Optimization (of which linear and quadratic programming are special cases)
  • If $\Psi$ contains discrete variables, then Discrete Optimization
  • If only integer variables are allowed, then Integer Programming, Mixed Integer Programming involves problems in which only some of the variables are constrained to be integers.
  • If the specifications of $f(\bx)$ and $\Psi$ are NOT deterministic, then Stochastic Programming
  • Another name you often hear is Dynamic Programming, this does not refer to a particular type of optimization problem, rather it is a method for solving an optimization problem by breaking it down to a collection of simpler subproblems, using Bellman's Principle of Optimality.

Basic Concepts

  • A point $\bx^*$ is called a local minimum, if $\exists \epsilon > 0$ $$ f(\bx^*) \leq f(\bx), \forall \bx\in\Psi, s.t. \| \bx - \bx^* \| \lt \epsilon. $$

  • A point $\bx^*$ is a global minimum, if
    $$ f(\bx^*) \leq f(\bx), \forall \bx\in\Psi. $$

  • Finding the global minimum is considerably harder than a local minimum --- imaging the difference in the difficulties of getting to the top of your neighborhood hill top vs getting to the top of Mt. Everest.

  • With one exception: for convex problems, any local minimum is also globally optimal (homework).

  • If $f(\bx)$ is differentiable (i.e. first derivatives exist), then a necessary condition for a local minimizer is that $\bx^*$ is a critical point

$$ \renewcommand{bs}{\boldsymbol}{g} (\bx^*) = {\bs\nabla} _\bx f(\bx^*) = 0 $$

  • Clearly, finding the critical points is equivalent to root searching problems you have encountered earlier.

  • On the other hand, observing that solving $f(\bx) = \bs{0}$ is equivalent to

$$ \min_{\bx}\left[ f(\bx)^T f(\bx)\right], $$

(although this is normally ill-advised --- doing anything through square or higher power when you can avoid it is generally a bad idea --- it makes the problem harder to solve, think ill-conditioning as a starter).

  • But this shows the root searching problem and the optimization problem are closely related.
  • Critical point is only a necessary condition, it's not sufficient,
  • If $f(\bx)$ is twice-differentiable (i.e. second derivatives exist), then a sufficient condition for a local minimizer is, in addition to being a critical point, the Hessian at $\bx^*$ is positive definite

$$ {\bs H} (\bx^*) = {\bs\nabla^2} _\bx f(\bx^*) \succ 0 $$

  • These are all natural conclusions from what you are already very familiar with (hopefully!): properties of a quadratic function and the Taylor series for a general (twice-differentiable) function

$$ f(\bx^* + \bs\delta x) = f(\bx^*) + {\bs\nabla} f(\bx^*)^T {\bs\delta x} + \frac{1}{2} {\bs\delta x}^T{\bs\nabla^2} f(\bx^*) {\bs\delta x} + {\bs O}(||{\bs\delta x}||^3) $$

Optimization Examples in Finance

1. Portfolio Optimization

$$ \begin{array} \\ \min_{\bx} & & \frac{1}{2} \lambda\; \bx^T \Sigma \bx - \mu^T \bx \\ s.t. & & \Sigma x_i = 1 \end{array} $$ where $\lambda$ is the risk-aversion coefficient, $\mu$ is the expected asset return vector and $\Sigma$ is the covariance matrix. This is a quadratic programming problem.

2. (Static) Asset-Liability Management

$$ \begin{array} \\ \min_{\bx} & & \Sigma_j x_j P_j \\ s.t. & & \Sigma_j x_j C_j(t) \geq L(t) \;\; \forall t \\ & & x_j \geq 0 \;\; \forall j \end{array} $$ where $x_j, P_j, C_j(t)$ are the amount, price and cashflow at time $t$ of asset $j$. $L(t)$ is the liability payment at time $t$. This is a linear programming problem.

3. Volatility Surface fitting

$$ \min_{\sigma(S,t)} {\Large\Sigma_j^n} (C(\sigma(S,t),K_j,T_j) - C_j)^2 $$ where $\sigma(S,t) > 0$ is the volatility value at the surface point $(S,t)$, $C(\sigma(S,t),K_j,T_j)$ is the standard Black-Scholes formule for European call options, $C_j$ is the market quoted price of the option. This a non-linear optimization problem.

II. Unconstrained Optimization

  • Unconstrained means:

$$ \min_{\bx\in \mathbb{R}^n} f(\bx) $$

  • Will be focusing on smooth functions (typically at least twice differentable). The main purpose here is to show the various types of algorithms for solving the unconstrained optimization problems.

Optimality Characterization

  • The necessary condition: the solution must be a critical point:

$$ {\bs g} (\bx^*) = {\bs\nabla} _\bx f(\bx^*) = 0 $$

  • The sufficient condition: the Hessian at the optimal point must be positive definite,

$$ {\bs H} (\bx^*) = {\bs\nabla^2} _\bx f(\bx^*) \succ 0 $$

Solution Methods

  • In practice, optimization problems are often solved using an iterative algorithm, which searches for a sequence of points,

$$ \bx^0, \bx^1, \bx^2, \cdots, \bx^n, \cdots $$ with $$ f(\bx^{k+1}) < f(\bx^k). $$

  • The algorithm typically stops at $||{\bs\nabla} f(\bx) || < \epsilon $ for some $\epsilon$ small.

  • No guarantee for finding the global minimum.

1. Direct Search Method

  • Similar in spirit to the Bisection method in one dimemsion. Requires only function evaluations.

  • Quoting M. Wright: "A direct search method does not 'in its heart' develop an approximate gradient".

  • Representative: Nelder-Mead Method (or Simplex Search method)

    • Searches through the simplex vertices (polytope of N+1 vertices in N dimensions)
    • Techniques: reflection - expansion - contraction - reduction
  • Scipy example:

In [2]:
import numpy as np
from scipy.optimize import minimize

def rosen(x):
     return sum(100.0*(x[1:]-x[:-1]**2.0)**2.0 + (1-x[:-1])**2.0)
    
x0 = np.array([1.3, 0.7, 0.8, 2.2, 1.2, 2.1])
res = minimize(rosen, x0, method='nelder-mead',
                options={'xtol': 1e-8, 'disp': True})

print("The solution from Nelder-Mead:", res.x)
Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 650
         Function evaluations: 1031
The solution from Nelder-Mead: [ 1.  1.  1.  1.  1.  1.]
  • Advantages: simple, only function evaulation needed.

  • Deficiencies: slow, may fail to converge in higher dimensions

  • Suffers from the "curse of dimensionality"

2. Descent Methods

Algorithm for general descent method

  1. Given a starting point $ \bx^0$
  2. Repeat

    1. Determine a descent direction $\delta \bx$;
    2. Line search. Choose a step size $t > 0$;
    3. Update. $\bx^{k+1} = \bx^k + t \delta \bx $
  3. Until stopping criterion is satisfied

The algorithm alternates between two main decisions: determine a descent direction $\delta \bx$ and choose a step size $t$.

  • Different ways of choosing the descent direction giving rise to different descent method and convergence rate

  • The line search method falls into two categories: exact line search and backtracking line search.

2.1 Steepest Descent Method

  • If the objective function is differentiable, we have

$$ f(\bx^k + t \bs\delta x) \approx f(\bx^k) + t [ {\bs\nabla} f(\bx^k)^T {\bs\delta x} ] $$

  • This means choosing gradient direction

$$ \bs\delta x = - {\bs\nabla} f(\bx^k) $$

will lead to the steepest descent at points sufficiently close to $\bx^k$.

  • The line search step size can be done in one-dimensional minimization: $$ t^k = \min_t f(\bx^k + t \bs\delta x) \triangleq \min_t \phi(t). $$
  • Exact line search (choosing the minimizing $t$ above) leads to zig-zag path towards the minimum: which means slow convergence $$ \phi'(t) = 0 = [{\bs\nabla} f(\bx^k + t \bs\delta x)]^T \bs\delta x , $$ (notice the two consecutive search directions will be perpendicular to each other, we've met this problem before, what's the strategy?).

  • Convergence: the steepest descent method converges linearly, and it will behave badly if the condition number of the Hessian (the second order derivative matrix) is large.

2.2 Newton's Method

  • If the objective function is twice differentiable, we have with more accuracy

$$ f(\bx^k + \bs\delta x) \approx f(\bx^k) + {\bs\nabla} f(\bx^k)^T {\bs\delta x} + \frac{1}{2} {\bs\delta x}^T{\bs\nabla^2} f(\bx^k) {\bs\delta x} $$

  • RHS is a quadratic function in ${\bs\delta x}$, so the minimum is achieved at

$$ \bs\delta x = - [{\bs\nabla^2} f(\bx^k)]^{-1} {\bs\nabla} f(\bx^k). $$

  • Convergence of the Newton's method is rapid in general, and quadratic once entering into the pure Newton phase.

  • Disadvantages of Newton's method:

    • The cost of computing and storing the Hessian can be very high, if not outright prohibitive
    • The cost of solving the set of linear equation at the Newton step
  • There are various ways to compute an approximation of the Hessian to substantially reduce the cost of computing the Newton step. This leads to a family of algorithms called Quasi-Newton methods.

III. Constrained Optimization

  • Now add constraints.

  • Constrainted problems are much harder: even a seemingly simple Integer Programming problem is NP-Complete (i.e. can't be solved in polynomial time).

General Framework

  • Constrained Optimization Problem

$$ \begin{array} \\ \min_{\bx\in \mathbb{R}^n } & f(\bx) \\ s.t. &\bs{h}(\bx) = \bs{0} \\ &\bs{g}(\bx) \leq \bs{0} \end{array} $$

  • Will be focusing on smooth functions (typically at least twice differentable).

  • The goal is to find a local minimum satisfying the constriants.

  • And we will denote the ovarall domain of definition as $\mathcal{D} = DOM(f)\cap DOM(h) \cap DOM(g)$.

Duality

  • Define the Lagrangian as, $$ \renewcommand{ml}{\mathbb{\mathcal L}} \renewcommand{bmu}{\boldsymbol{ \mu}} \renewcommand{bld}{\boldsymbol{ \lambda}} \ml(\bx, \bmu, \bld) = f(\bx) + \bmu^T \bs{h}(\bx) + \bld^T \bs{g}(\bx). $$

    Here the vectors $\bmu, \bld$ are called the dual variables or Lagrange Multipliers.

  • Further, define the Lagrangian dual function as,

$$ \renewcommand{mD}{\mathbb{\mathcal D}} \renewcommand{df}{\hat{f}} \df(\bmu, \bld) = \inf_{\bx\in \mD}\; \ml(\bx, \bmu, \bld) =\inf_{\bx\in\mD} \left( f(\bx) + \bmu^T \bs{h}(\bx) + \bld^T \bs{g}(\bx) \right). $$

  • The dual function takes the pointwise infimum of a family of affine functions of $(\bmu, \bld)$, it is a concave function.
  • If $\bx^*$ is a solution to the original optimization problem (Primal Problem), then for $\forall \bld \succeq 0$ and any $\bmu$,

$$ \df(\bmu, \bld) \leq f(\bx^*). $$

  • Which leads to the optimization problem (Dual Problem)

$$ \begin{array} \\ \max_{(\bmu, \bld)} & \df(\bmu, \bld) \\ s.t. &\bld \succeq 0 \end{array} $$

  • If $(\bmu^*, \bld^*)$ is a solution to the Dual Problem, it's straightforward (homework) that the weak duality holds,

$$ \df(\bmu^*, \bld^*) \leq f(\bx^*). $$

  • However, if the Primal problem satisfies certain constraint qualifications (such as, convexity or Slater's condition), then the strong duality holds, $$ \df(\bmu^*, \bld^*) = f(\bx^*), $$

    which implies the primal and the dual problems are equivalent.

Optimality Conditions

  • From the duality principle,

    1. $ \bs{\nabla}_{\bx}\ml(\bx^*, \bmu^*, \bld^*) = \bs{0} $, stationality
    2. $ \bs{h}(\bx^*) = \bs{0}; \bs{g}(\bx^*) \leq \bs{0} $, feasibility
    3. $ \bld^* \succeq \bs{0} $, dual feasibility (component-wise)
    4. $ \bld^* \circ \bs{g}(\bx^*) = \bs{0} $, complementary slackness (component-wise)
    5. $ \bs{\nabla}^2\ml(\bx^*, \bmu^*, \bld^*) \succ 0 $, Hessian positive definite constraints
  • These conditions are called KKT conditions (Karush-Kuhn-Tucker) --- the necessary and sufficient conditions for $\bx^*$ to be a local minimizer.

Exploring the KKT conditions further

  • For the unconstrained case, the conditions 2, 3 and 4 drop out, what's left are:

    1. $ \bs{\nabla}_{\bx}f(\bx^*) = \bs{0} $, stationality
    2. $ \bs{\nabla}^2f(\bx^*) \succ 0 $, Hessian positive definite constraints
  • which we are familiar with---the necessary and sufficient condition for $\bx^*$ to be a (local) minimizer.
  • For the equality constrained case, let's take a look at the Lagrange function

$$ \ml(\bx, \bmu) = f(\bx) + \bmu^T \bs{h}(\bx) $$

  • If we simply consider this as an unconstrained problem with $(\bx, \bmu)$ as the new unknown vector and apply the two conditions in previous slide

    1. $ \bs{\nabla}_{\bx}\ml(\bx^*, \bmu^*) = \bs{0} $, stationality
    2. $ \bs{\nabla}_{\bmu}\ml(\bx^*, \bmu^*) = \bs{0} $, stationality
    3. $ \bs{\nabla}^2_{\bx\bx}\ml(\bx^*, \bmu^*) \succ 0 $, Hessian positive definite constraints
  • Notice the second condition above is simply $\bs{h}(\bx^*) = \bs{0}$.
  • For the inequality(only) constrained problems, the answer can be analyzed in two scenarios based on how the local minimizer $\bx^*$ is situated in the feasible region.
  • Scenario 1: $\bs{g}(\bx^*) < \bs{0}$, in this case the point $\bx^*$ is an interior point of the feasible domain and the constraints is called inactive and the case simply reduces to the unconstrained case:

    1. $ \bs{\nabla}_{\bx}f(\bx^*) = \bs{0} $, stationality
    2. $ \bs{\nabla}^2_{\bx\bx}f(\bx^*) \succ 0 $, Hessian positive definite constraints
    3. and of course $ \bs{g}(\bx^*) < \bs{0} $,
  • Scenario 2: $\bs{g}(\bx^*) = \bs{0}$, in this case the point $\bx^*$ is right on the boundary of the feasible domain and the constraints is called active and the case reduces to the equality constrained case:

    1. $ \bs{\nabla}_{\bx}f(\bx^*) + \bld^T \bs{\nabla}_{\bx}\bs{g}(\bx^*) = \bs{0} $, stationality
    2. $ \bs{\nabla}^2_{\bx\bx}f(\bx^*) \succ 0 $, Hessian positive definite constraints
    3. $ \bs{g}(\bx^*) = \bs{0} $
    4. and $ \bld > 0 $ component wise
  • Note the last condition ensures that the decreasing direction of $f(\bx)$ is strictly pointing outwards of the feasible region.
  • Best way to visualize the conditions 1 and 4 above is through the following example, where

$$ f(\bx^*) = (x_1 -2)^2 + (x_2 + 2)^2, \;\;\; g(\bx^*) = x_1^2 + x_2^2 - 1$$

  • with $ \bld > 0 $.

Solution Methods

  • There are two main competing approaches to solve the constrained optimization problems:
    • Sequential Quadratic Programming (SQP): BFGS, NLOPT, SNOPT
    • Interior Point Method (IPT): IPOPT, KNITRO, LOQO
  • We are not going into great details, but will expand the SQP method slightly (and will get into IPT method later).
  • The SQP is a natural extension of the Newton's Descent method introduced earlier for the unconstrained case: recall that the Newton's method does a quadratic optimization in each iteration step. For SQP, at each step, the original optimization problem is approximated with

$$ \begin{array} \\ \min_{\bx\in \mathbb{R}^n} & {\bs\nabla} f(\bx^k)^T {(\bx - \bx^k)} + \frac{1}{2} {(\bx - \bx^k)}^T{\bs\nabla^2} f(\bx^k) {(\bx - \bx^k)} \\ s.t. & {\bs\nabla} \bs{h}(\bx^k)^T {(\bx - \bx^k)} + \bs{h}(\bx^k) = \bs{0} \\ & {\bs\nabla} \bs{g}(\bx^k)^T {(\bx - \bx^k)} + \bs{g}(\bx^k) \leq \bs{0} \end{array} $$

  • This is a constrained quadratic programming problem, which is "slightly" easier to deal with than the original problem.

References:

S. Boyd and L. Vandenberghe (2004), Convex Optimization, Cambridge University Press.