2017-10-06 Convex optimisation¶

The goal of this lab is to manipulate some of the convex optimisation tools from scipy.optimize, which you will need to implement the machine learning algorithms we will see in the course.

Imports¶

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import Image, set_matplotlib_formats 
set_matplotlib_formats('pdf', 'svg') # toggle vector graphics for a crisp plot!

Introduction¶

Convex functions have the useful property that local minima are global minima.

In [2]:
Image('./img/convexity.png')
Out[2]:

To get a feel for scipy, we start with a simple example. We begin by defining a function to optimise, $$f_1(x) = (x - 1)^4 + x^2$$

In [3]:
# specify objective function
f1 = lambda x : (x - 1) ** 4 + x ** 2
In [4]:
from scipy.optimize import minimize_scalar

res = minimize_scalar(f1, method='brent')
print('xmin: %.02f, fval: %.02f, iter: %d' % (res.x, res.fun, res.nit))
xmin: 0.41, fval: 0.29, iter: 12

Question: Briefly describe the function output.

Answer: The output gives us the location of the minimum (xmin), the value of f1 (fval) and the number of iterations (iter) required to arrive at that point.

In [5]:
# plot curve
x = np.linspace(res.x - 0.5, res.x + 0.5, 100)
y = [f1(val) for val in x]
plt.plot(x, y, color='blue', label='f1')

# plot optima
plt.scatter(res.x, res.fun, color='orange', marker='x', label='opt')

plt.grid()
plt.legend(loc=1)
Out[5]:
<matplotlib.legend.Legend at 0x11294db10>

N.B. Optimisation, and particularly convex optimisation, plays a central role in machine learning. Although in this lab we focus on optimising simple functions, the same techniques will be used to optimise the high-dimensional loss functions of our learning machines.

Utility functions¶

In the following, we will be doing a lot of plotting, so we'll create a reusable function:

In [6]:
def plot_iterations(f, iterations, delta=0.1, box=1):
    xs = np.arange(-box, +box, delta)
    ys = np.arange(-box, +box, delta)
    X, Y = np.meshgrid(xs, ys)
    # create function mesh
    Z = np.array([[f(np.array([x, y])) for x in xs] for y in ys])
    # contour plot
    fig = plt.figure(figsize=(10, 5))
    ax = fig.add_subplot(1, 2, 1)
    cd = ax.contour(X, Y, Z)
    # error plot
    ax.clabel(cd, inline=1, fontsize=10)
    ax.plot(iterations[:, 0], iterations[:, 1], color='red', marker='x', label='opt')
    ax = fig.add_subplot(1, 2, 2)
    ax.plot(range(len(iterations)), [f(np.array(step)) for step in iterations], 'xb-')
    ax.grid('on')
    ax.set_xlabel('iteration')
    ax.set_xticks(range(len(iterations)))
    ax.set_ylabel('f(x)')

Question: What does plot_iterations do ?

Answer: It displays two plots for data collected during optimisation (see get_callback_function() below): a contour plot and a plot of the (decreasing) error.

We'll also need a callback function between iterations. This function will be called after each iteration.

In [7]:
# specify callback function
def get_callback_function(data_list):
    def callback_function(x):
        data_list.append(x)
    return callback_function

Smooth functions¶

When our cost function is smooth, we can use the tools of calculus for an open or closed-formed solution. We here focus on the former case, where our function is optimised incrementally in an iterative procedure. In general, we iterate until the gradient of our function is close to zero. We again start by specifing a function to minimise,

\begin{align} f_2(\mathbf{x}) &= 2x_1^2 + 5x_2^2 \\ \notag &= \mathbf{x}^T\mathbf{D}\mathbf{x} \notag \end{align}

where $\mathbf{D} = \begin{bmatrix} 2 & 0 \\ 0 & 5 \end{bmatrix}$ and $\mathbf{x} \in \mathbb{R}^{2}$.

In [8]:
f2 = lambda x : x.T.dot(np.diag([2, 5])).dot(x)

First we'll try a trivial optimisation algorithm: brute force. A brute force search tries every point in a fixed grid and retains the minimising value.

In [9]:
from scipy.optimize import brute
print(brute(f2, ranges=((-2, 2), (-2, 2))))
[  1.51252956e-05  -1.18367293e-07]

Question: Comment briefly on the output.

Answer: The true optimum $(x_1 = 0, x_2 = 0)$ is missed by this crude grid search.

First and second-order characterizations¶

The Karush-Kuhn-Tucker (KKT) conditions specify an unconstrained smooth function is minimised when the gradient, $\nabla f(\mathbf{x}) = 0$, and the Hessian matrix, $\nabla^2 f(\mathbf{x})$, is positive semi-definite. When $f(\mathbf{x})$ is convex, these conditions are necessary and sufficient for optimality. A twice-differentiable convex function has a positive semi-definite Hessian $\mathbf{x} \mapsto \nabla^2 f(\mathbf{x})$ and is minimized where the gradient $\mathbf{x} \mapsto \nabla f(\mathbf{x})$ is equal to 0.

Question: Write a function df2() for the Jacobian (vector of partial derivatives) of f2() below:

In [10]:
def df2(x):
    # TODO: return vector of partial derivatives
    return (np.diag([2, 5]) + np.diag([2, 5]).T).dot(x)

Question: Write a function ddf2() for the Hessian (matrix of second partial derivatives) of f(2) below:

In [11]:
def ddf2(x):
    # TODO: return Hessian matrix of second partial derivatives
    return np.diag([4, 10])

Question: Prove our Hessian matrix, $\mathbf{H} = \partial^2 f / \partial\mathbf{x}^2$, is positive-definite everywhere.

Answer: For any $\mathbf{x} = [x_1, x_2], \mathbf{x}^T\mathbf{H}\mathbf{x} = 4x_1^2 + 10x_2^2 > 0$. Or: the eigenvalues of $\mathbf{H}$ are clearly positive $\implies$ PD

Gradient descent methods¶

The workhorse of machine learning is gradient descent. Its variants are used to optimise a panoply of models from linear regression to deep neural nets. The gradient of mulivariate function at a point is the vector normal to the level curve. Gradient descent works by taking steps in the direction of the gradient,

$$\mathbf{x}_{k+1} = \mathbf{x}_{k} -\alpha\nabla f(\mathbf{x_k})$$
In [12]:
Image('img/gd.png')
Out[12]:

Newton's Method¶

Newton's method is a second order method, minimising a quadratic Taylor approximation to the objective function at each point. It thus combines the gradient and Hessian matrix,

$$\mathbf{x}_{k+1} = \mathbf{x}_{k} -\alpha\mathbf{H}_k^{-1}\mathbf{g}_k$$

where $\mathbf{g}_k = \nabla f(\mathbf{x_k})$ and $\mathbf{H}_k = \nabla^2 f(\mathbf{x_k})$. This is a multi-dimensional generalisation of the Newton root-finding method (here we are finding the root of the gradient). Newton's method typically involves a line search to optimise the size of the descent step, $\alpha$.

Question: Run Newton's method on our function. Note, in this implementation, the (sparse) Hessian is inverted by conjugate gradient descent.

In [13]:
# TODO import function
from scipy.optimize import fmin_ncg
# specify initial point
x0 = [-1, 1]
# initialise callback data
ncg_data = [np.array(x0)]
res = fmin_ncg(f2, x0, fprime=df2, fhess=ddf2, callback=get_callback_function(ncg_data))
Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 7
         Function evaluations: 8
         Gradient evaluations: 14
         Hessian evaluations: 7
In [14]:
plot_iterations(f2, np.array(ncg_data))

Question: Newton's method uses a line search to optimise each descent step by choosing a step size $\alpha$ that optimises the descent direction, that is, $$\min_\alpha f(\mathbf{x_k} + \alpha \mathbf{d}_k),$$ where $\mathbf{d}_k = \nabla f({\mathbf{x}_k})$ is the descent direction. How does this show in the visualisation?

Answer: The descent steps are all at right angles. Minimising $f(\mathbf{x_k} + \alpha \mathbf{d}_k)$ is equivalent to finding $\alpha$ such that $\nabla_{\alpha} f({\mathbf{x}_k} + \alpha\mathbf{d}_k) = \mathbf{d}_k^T\nabla f({\mathbf{x}_k} + \alpha\mathbf{d}_k) = 0$. Hence, the next descent direction is orthogonal to the current one (or the gradient is 0).

Quasi-Newton: BFGS¶

Broyden–Fletcher–Goldfarb–Shanno (BFGS) is a quasi-Newton method. Because the Hessian is $\mathcal{O}(n^2)$ (a large expense at each iteration), BFGS aims to approximate the Hessian "on the fly" as a sum of rank one approximations. Note when space and runtime are a concern, limited-memory BFGS (l-BFGS) is further an option.

Question: Run the scipy BFGS algorithm on our function.

In [15]:
# specify initial point
x0 = [-1, 1]
# initialise callback data
bfgs_data = [np.array(x0)]
# TODO: import BFGS method from scipy.optimize
from scipy.optimize import fmin_bfgs
# TODO: run optimisation algorithm
res = fmin_bfgs(f2, x0, gtol=1e-5, fprime=df2, callback=get_callback_function(bfgs_data))
Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 4
         Function evaluations: 6
         Gradient evaluations: 6
In [16]:
plot_iterations(f2, np.array(bfgs_data))

Question: Verify that the norm of the gradient at the final point is close to 0.

In [17]:
# TODO calculate the gradient norm
df_norm = np.sqrt(np.dot(df2(res), df2(res)))
print(df_norm)
4.33680868994e-19

Question: What do you observe about the steps?

Answer: The approximation to the Hessian leads to approximate right angle descent steps.

Conjugate gradient method¶

The conjugate gradient method is an alternative to gradient descent. For a linear system $\mathbf{A}\mathbf{x} = \mathbf{b}$, this algorithm finds a solution as a linear combination of a set of mutually conjugate vectors, $\mathbf{p}_d$, such that,

$$\mathbf{x}^* = \sum_{d=1}^D \alpha_d\mathbf{p}_d$$

for the $D$ dimensions of the problem. The vectors are built determined one by one in a process similar to the Gram Schmidt process. Particularly for a sparse system, a good approximate solution can be determined without constructing the entire conjugate set.

In [18]:
# specify initial point
x0 = [-1, 1]
# initialise callback data
cg_data = [np.array(x0)]
# TODO: import the conjugate gradient method from scipy.optimize
from scipy.optimize import fmin_cg
# TODO: run optimisation algorithm
res = fmin_cg(f2, x0, fprime=df2, callback=get_callback_function(cg_data))
Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 3
         Function evaluations: 6
         Gradient evaluations: 6
In [19]:
plot_iterations(f2, np.array(cg_data))

Question: Try changing the initial point. Does anything change? Why?

Answer Conjugate gradient has a closed form solution and always terminates after $D + 1$ iterations i.e. when the basis has been fully constructed (plus the initial step).

Non-smooth functions¶

Convex functions remain convex when we add linear constraints. A linear constraint restricts the solution space with an intersecting hyperplane. However, the smoothness property is lost, and calculus is no longer an option. To conceptualise the effect, imagine taking a spherical fruit such as an apple to be our unconstrained convex function. Then, adding a linear constraint corresponds to slicing it at some position at a fixed angle. Intuitively, the fruit remains convex after the cut, despite losing its smoothness (roundness).

Linear programming¶

We begin by defining a linear program, a linear function, $\mathbf{c}^T\mathbf{x}$ subject to linear constraints $\mathbf{A}\mathbf{x} \leq \mathbf{b}$:

$$ \begin{array}{rl} \text{minimise} & z = -2x_1 - x_2 \\ \text{subject to} & x_1 + x_2 \leq 5 \\ & 5x_1 + 2x_2 \leq 10 \\ & x_1, x_2 \geq 0 \end{array}, $$

Question: Initialise the matrix $\mathbf{A}$ and the vectors $\mathbf{b}$ and $\mathbf{c}$.

In [20]:
# TODO Initialise A, b, and c as numpy arrays
A = np.array([[1, 1], [5, 2]])
b = np.array([5, 10])
c = np.array([-2, -1])

Linear programs can be solved efficiently with the Simplex algorithm. The program specifies an N-dimensional convex solid, and the algorithm works by moving along the edges of this shape from corner to corner. As any optimal solution must be on the surface, we can discard all points interior to the solid. This reduces the feasible solution space to a finite set. Furthermore, the convexity of the solid guarantees that as long as we repeatedly take upward steps along the surface edges, we will arrive at a maximum in a finite number of steps.

In [21]:
Image('img/lp.png')
Out[21]:

Question: Solve the linear program using scipy's implementation of Simplex.

In [22]:
# TODO: solve the linear program
from scipy.optimize import linprog
lp_callback_data = []
print(linprog(c=c, A_ub=A, b_ub=b))
     fun: -5.0
 message: 'Optimization terminated successfully.'
     nit: 2
   slack: array([ 0.,  0.])
  status: 0
 success: True
       x: array([ 0.,  5.])

Quadratic programming¶

A quadratic programming (QP) problem is an optimisation problem of a quadratic function $\frac{1}{2}\mathbf{x^TQx} + \mathbf{c^Tx}$ subject to linear constraints, $\mathbf{Ax} \leq \mathbf{b}$. QP optimisation algorithms can be applied to the training of support vector machines (SVMs). Let's optimise the following quadratic program:

$$ \begin{array}{rl} \text{minimise} & z = x_1^2 + 4x_2^2 - 32x_2 + 64 \\ \text{subject to} & x_1 + x_2 \leq 7 \\ & -x_1 + 2x_2 \leq 4 \\ & x_2 \leq 4 \\ & x_1, x_2 \geq 0 \end{array}, $$

Question: Specify the constraint matrices, $\mathbf{A}$ (coefficients), and $\mathbf{b}$ (constants).

In [23]:
# TODO: specify the matrix of coefficients, A
A = np.array([[ 1., 1.],
              [-1., 2.],
              [0.,  1.],
              [-1., 0.],
              [0., -1.]])

# TODO: specify the vector of constants, B
b = np.array([7., 4., 4., 0., 0.])

We create dictionary of constraints for use in the optimisation function:

In [24]:
cons = {'type':'ineq',
        'fun':lambda x: b - np.dot(A,x),
        'jac':lambda x: -A}

Question: Next, specify the components of the objective function: the matrix of quadratic coefficients $\mathbf{Q}$, the vector of linear coefficients $\mathbf{c}$, and the constant term, $c_0$.

In [25]:
# TODO: specify the matrix of quadratic coefficients, Q
Q = np.array([[2., 0.],
              [0., 8.]])

# TODO: specify the vector of linear coefficients, c
c = np.array([0, -32])

# TODO: specify the constant, c0
c0 = 64

Question: Write the objective function (loss).

In [26]:
def loss(x):
    # TODO: return loss given x
    return 0.5 * np.dot(x.T, np.dot(Q, x)) + np.dot(c, x) + c0

Question: Write the Jacobian function.

In [27]:
def jac(x):
    # TODO: return vector of partial derivatives
    return np.dot(x.T, Q) + c

Question: Run scipy.optimize's minimize function on our QP problem.

In [28]:
from scipy.optimize import minimize

x0 = np.random.randn(2)
opt = {'disp':False}
res_cons = minimize(loss, x0, jac=jac, constraints=cons, method='SLSQP', options=opt)

print('\nConstrained:')
print(res_cons)

x1, x2 = res_cons['x']
f = res_cons['fun']
Constrained:
     fun: 7.9999999999997655
     jac: array([ 4., -8.])
 message: 'Optimization terminated successfully.'
    nfev: 4
     nit: 4
    njev: 4
  status: 0
 success: True
       x: array([ 2.,  3.])

Finally, we will plot the solution in 3D axes (code courtesy of https://stackoverflow.com/questions/17009774/quadratic-program-qp-solver-that-only-depends-on-numpy-scipy/19770058#19770058). Feel free to change the code to adjust the view.

In [29]:
from mpl_toolkits.mplot3d.axes3d import Axes3D
# plotting
xgrid = np.mgrid[-2:4:0.1, 1.5:5.5:0.1]
xvec = xgrid.reshape(2, -1).T
F = np.vstack([loss(xi) for xi in xvec]).reshape(xgrid.shape[1:])

ax = plt.axes(projection='3d')
ax.hold(True)
ax.plot_surface(xgrid[0], xgrid[1], F, rstride=1, cstride=1,
                cmap=plt.cm.jet, shade=True, alpha=0.9, linewidth=0)
ax.plot3D([x1], [x2], [f], 'og', mec='w', label='Constrained minimum')
ax.legend(fancybox=True, numpoints=1)
ax.set_xlabel('x1')
ax.set_ylabel('x2')
ax.set_zlabel('F')
/Users/jcboyd/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:8: MatplotlibDeprecationWarning: axes.hold is deprecated.
    See the API Changes document (http://matplotlib.org/api/api_changes.html)
    for more details.
Out[29]:
<matplotlib.text.Text at 0x113070c10>

N.B. Scipy doesn't really support proper QP solvers. When we come to support vector machines, we will implement the solver ourselves.