The Machinery behind Machine Learning - Part 1 - codecentric AG Blog

:

Machine Learning is one of the hottest topics on the web. But how do machines learn? And do they learn at all? The answer in most cases is: Machines do not learn, they optimize.

This is the first blog post in a small series about the machinery that drives most of the commonly used Machine Learning algorithms. The overall idea is to illustrate the dependance of the performance of ML algorithms on the underlying optimization algorithms. In the end I want to show that the use of appropriate minimization algorithms can have a significant positive influence on speed and even accuracy – I intend to explicitely demonstrate this for a Collaborative Filtering Recommender System.

Most data scientists are somewhat familiar with basic Machine/Statistical Learning algorithms like Linear and Logistic Regression, lots do know something about or use more sophisticated algorithms like Support Vector Machines, Neural Networks, Collaborative Filtering Recommender Systems, maybe even Restricted Boltzmann Machines, Deep Learning tools and so on. But most of these algorithms – let’s restrict ourselves to Supervised Learning here – share some general properties and rely on the same more or less hidden machinery: the toolbox of Optimization Theory. In this first posting I want to introduce the general ideas of this theory and prepare the ground for applying them to real problems. Self-contained example code in R will be available at github, see Optimization Routines. The series starts with looking at the ingredients of a rather general iterative minimization algorithm, then continues with the well-known Gradient Descent method, proceeds with the Nonlinear Conjugate Gradient and finally discusses Quasi-Newton methods, in particular BFGS. The reason for this selection of algorithms is that all of them just use information in terms of function values and gradients, they compete on equal ground – a globalized Newton method would require the second derivate, i.e. the Hesse matrix as well.

Unconstrained optimization

This part (un-?)fortunately it involves a bit of mathematical notation but I will keep it as simple as possible. Given is a function that maps points from the $latex n$-dimensional Euclidean space to the set of real numbers:

$latex \displaystyle f:\mathbb{R}^n \rightarrow \mathbb{R}$

Let’s start with a $latex 2$-dimensional example. Consider points $latex p=(x,y)$ in the Euclidean plane $latex \mathbb{R}^2$ and the function

$latex \displaystyle f(p)=f((x,y))=\frac{1}{2}\left(x^2+y^2\right)$

that determines half of the squared length of vector $latex p$, e.g. for $latex p = (3, -1)$ the corresponding function value is $latex \displaystyle f(p)=f((3,-1)) = 5$.

Looking for local minima

The goal of unconstrained optimization is to find local minima of such functions. The problem that has to be solved is minimizing the function $latex f$ over the whole space without any constraints like e.g. nonnegativity of some variables or integrity conditions, formally written as

$latex \min\ f(p),\quad p\in \mathbb{R}^n.$

In Optimization theory we can restrict ourselves to minima because any maximum of $latex f$ is a minimum of $latex -f$, so in fact this is no restriction but simplifies things a lot. We do not always aim to find the global minimum in $latex \mathbb{R}^n$ but instead points that take smaller function values than all of their neighbors, this is called a local minimum.

Local minima often exist even if a global minimum does not, e.g. just look at the plot of the function

$latex \displaystyle g(x)=x+2\ {\sin(x)}$.

Finding the global minimum is sort of an ill-posed problem in its full generality because in a general setting we only have local information – function value and gradient at some points – that does not tell us anything about other points far away, thus we would have to check the whole space in the end in order to figure out whether a local minimum is global or not. In some important special cases we can derive this from known properties of the function, e.g. if $latex f$ is convex then any local minimum is a global minimum, and if there exists an isolated minimum of the convex function $latex f$ then it is the unique global minimum, as it is the case for our example function $latex f$.

In order to apply the standard methods from the optimization toolbox we need some basic technical requirements, the problem and the setting have to be well-defined. Obviously there should be one or more local minima and the function has to be continuously differentiable at least once. Since this is the case in almost all supervised learning applications we can safely assume that this is granted. The first derivative describes the slope of the function at the current point. Informally, all we have to ensure is that a well-defined notion of slope exists in all points – this implies that there are no sharp corners/cusps as e.g. in the absolute value function – and that neither the function nor the first derivative have immediate jumps if we slightly change the current point.

A general minimization algorithm

Imagine now that you are located somewhere at a hill in a dry desert (ok, approximately all deserts are dry, this seems to be somewhat constituting for a desert, but you know what I mean, at least it should not be a snow desert). You know that there is one valley or canyon, maybe even several, somewhere near and you want to find a reasonable path that leads you to the bottom of such a valley. At these spots you might find water as all of us have learned in our regular IT survival trainings, especially in north facing canyons. In such a situation it would be very helpful to not have to search more or less randomly or driven by our gut feelings but to follow some rules that guarantee success. You need a recipe for rescue and this is what the meta-algorithm below will provide. In order to increase your chances of finding water at the most probable place – a local minimum – most iterative algorithms consist of an initializing procedure, that chooses the starting point, and three core iteration steps:

Initialize:
• Choose a starting point – an initial guess that can either be determined by your situation – your current location in the desert – or that can be actively chosen.
Iterate until convergence:
• Check a termination criterion – are we close enough to the minimum?
• Find a descent direction – a direction in which the function value decreases near the current point.
• Determine a step size – the length of a step in the given direction that leads to a good decrease.

Concrete realisations of this prototype algorithm typically differ in how they implement the last two points of the iteration, the choice of the descent direction and the step size rule. Sometimes, one even determines the step size before fixing the descent direction – e.g. Trust Region methods – which seems to be counterintuitive at first glance. But most minimization methods form a simplified model of the true function around the current point, and the name Trust Region already indicates that the step size then tells us something about how far we can trust our local model. Whatsoever, Trust Region methods are beyond the scope for now, and we now want to look closer at the three iteration steps of the above meta-algorithm.

Checking for optimality

This refers to the first part of the iteration – the termination criterion – and it’s a bit technical. As you might have already guessed from the first plot a candidate $latex p^*$ for a minimum always has a certain property, the first derivative, i.e. the gradient has to be zero:

$latex \nabla f(p^*) = 0$

For our $latex 2$-dimensional example the gradient is a vector with two components, one for each coordinate axis, and the above equation translates into the vector equation

$latex \nabla f(x,y) = (x,y)^T = (0,0)^T$

i.e. each component of the gradient has to be $latex 0$. The unique solution in this simple case is $latex x = y = 0$, or $latex p^* = (0, 0)$. The formal notation $latex (…)^T$ – T stands for transposed – means that despite the row notation the vector is column vector, which is the standard convention but not really important here. Some textbooks even omit this detail.
A point $latex p^*$ with the property $latex \nabla f(p^*) = 0$ is called critical point. In general a critical point must not be a minimum but any minimum is a critical point. To better understand why the above equation is important let’s get an intuition what happens to function and gradient near an isolated local minimum $latex p^*$. When you approach $latex p^*$ from a specific direction – we can choose e.g. the directions along the coordinate axes – the function values have to decrease until you reach the minimum $latex p^*$ and then increase again. Obviously this has to be true regardless of the direction you are coming from, because $latex p^*$ is an isolated minimum, in a small neigborhood all points have stricter larger function values. The described behaviour directly implies that the gradient has to become zero exactly at the minimum – the slope is zero – and that the gradient changes the sign in each component, i.e. due to the decrease in the one direction and the increase in the opposite direction. To see that think of a simple univariate quadratic function like

$latex g(x) = x^2$.

The graph of this function is the standard parabola, the minimum is attained at $latex x^* = 0$ and the gradient is given by

$latex \nabla g(x) = 2x$,

thus taking negative values for all $latex x<0$ and positive values for all $latex x > 0$, both approaching zero when $latex x$ tends to zero.

Now we can come up with a first rule for deciding when to terminate the algorithm. A standard termination criterion simply checks whether the gradient is “sufficiently close” to zero as the gradient has to vanish at a minimum:

Termination criterion: Terminate if $latex \|\nabla f(p)\| < \varepsilon$ for a given tolerance $latex \varepsilon > 0$.

There is no need to specify this any further here, you will see a possible implementation later on as well as some other complementary criteria.

Finding a descent direction

A search direction $latex s$ is a descent direction if the function value decreases in that direction at least “at the beginning”. The search direction defines a ray, a unidrectional path, and if we make a tiny (infinitesimal) step along that path the function value has to decrease. Intuitively the term descent direction is rather self-explaining but how can we test whether a given direction is a descent direction or not? In practice we cannot make an infinitesimal step and see what happens. Fortunately there is a nice geometric intuition, and based on that that we derive a surprisingly simple test. Here are the contour lines of our example function:

Points on the same contour line (here: the circle segments in grey) share the same function value. The minimum is attained at the point $latex (x^*,y^*)=(0,0)$ and the function values increase with the radii of the circles. For one specific point there are gradient $latex \nabla f(x,y)$, negative gradient $latex -\nabla f(x,y)$ and a descent direction $latex d$ given in the plot. Furthermore, you can see the tangent to the contour line at that point, and one can immediately see that the gradient is perpendicular/orthogonal to this tangent (do not mix it up with the tangent to the function!). For the geometric interpretation the tangent to the contour line plays a crucial role. Obviously it splits the $latex (x,y)$-plane into two halves. We now focus on the angle $latex \alpha$ between the gradient and a possible search direction. The half-plane, where the gradient points to, corresponds to angles $latex \alpha < 90^{\circ}$ and $latex \alpha > 270^{\circ}$. The other one, where the negative gradient points to, corresponds to

$latex 90^{\circ} < \alpha < 270^{\circ}$.

From the plot is immediately plausible that any search direction $latex s$ with an enclosing angle $latex \alpha$ between $latex 90^{\circ}$ and $latex 270^{\circ}$, thus pointing into the negative gradient’s half-plane as e.g. the blue $latex d$, is a descent direction: There is at least a tiny path through the interior or the contour circle, where the function values decrease before they increase again later on.
So now we have figured out that the angle $latex \alpha$ between the gradient $latex \nabla f(x,y)$ and a search direction $latex s$ reliably tells us whether $latex s$ is a descent direction or not. Visually such a check is trivial in the given setting, but in higher dimensions and with less regular functions, i.e. more interesting shapes of the contour lines it is practically impossible.

Descent direction – a simple test

Fortunately there is an almost trivial computation that does the trick here. Lots of you do remember the geometric interpretation of the scalar or dot product of vectors. With $latex \|a \|$ being the Euclidean norm or the length of the vector $latex a$ it holds

$latex \langle a,b \rangle = a^Tb = \|a \|\|b \| \cos \alpha(a,b),$

$latex \alpha(a,b)$ is the angle enclosed by $latex a$ and $latex b$. We now focus on the sign of this expression. Since the Euclidean norm or length of any nontrivial vector is positive (zero if and only if the vector is $latex (0,…,0)$) the interesting part is the cosine. We switch to radian instead of degree because mathematically that is the natural domain of the trigonometric functions. An angle of $latex 90^{\circ}$ corresponds to $latex \frac{\pi}{2}$, similarly, $latex 180^{\circ}$ corresponds to $latex {\pi}$ and $latex 270^{\circ}$ corresponds to $latex \frac{3\pi}{2}$. The roots or zeros of the cosine function in the interval $latex [0^{\circ},360^{\circ}]=[0,2\pi]$ in radians are $latex \frac{\pi}{2}$ and $latex \frac{3\pi}{2}$. In the open interval $latex (\frac{\pi}{2},\frac{3\pi}{2})$ the cosine function is negative, thus the scalar product is negative. Conversely, for $latex \alpha \in (0,\frac{\pi}{2})\cup (\frac{\pi}{2},2\pi)$ the cosine is positive. We have found our simple test based on the sign of the above scalar product:

Descent direction: Find a search direction $latex s$ that satisfies $latex s^T\nabla f(p)<0$.

Determining the step size

There is a variety of methods available for this task. It is rather critical, e.g. in some settings there are very strict requirements necessary in order to make an algorithm work, but we will not go into detail here. We assume that we have found a reasonable descent direction $latex d$ and now we need to decide how long the step in that direction shall be. Since we are only searching along a single direction, this is referred to as line search. It is essentially a one-dimensional problem and a first idea would be to directly find the next minimum in the given descent direction, an exact line search. In some cases this is even necessary in order to make the whole minimization algorithm work or at least for proving good convergence results but it requires solving an one-dimensional optimization problem in every iteration.
For algorithms like Gradient Descent we can rely on one of the most basic procedures, the so-called Armijo rule, an inexact line search method. Historically, it has been the very first non-exact stepsize rule. The original idea needs a slight technical modification called widening but for simplicity we omit this here. Later we will see more advanced stepsize procedures anyway. The Armijo rule takes two parameters, $latex 0< \sigma < 1$, and $latex 0 < \rho < 0.5$. Now we test for $latex \ell = 0,1,2,...$ the Armijo condition:

$latex f(p+\sigma^{\ell} d) < f(p) +\rho \sigma^{\ell}\nabla f(p)^Td.$

The first $latex \ell$ that passes this test defines our stepsize $latex \sigma^{\ell}$ for the current iteration. We formulate our first stepsize rule

Stepsize: Choose $latex \sigma^{\ell}$ as stepsize according to the Armijo rule.

We aim at generating some progress in the descent direction $latex d$ and we already know that if we make a tiny step in this direction then the function has to decrease. But in practice we do not want to make tiny steps. We want to choose a step as large as possible (whatever that might mean). Therefore, we always try a full gradient step, we test the stepsize $latex 1=\sigma^0$. But depending on the current point this might be too ambitious and we have to accept a smaller step size, and sometimes it is not ambitious enough and we could use larger stepsizes. The latter can be realised by a technique called widening that leads to better theoretical and practical properties for Armijo’s rule (efficiency).
But what exactly is the Armijo condition going to tell us? Let’s answer this question graphically.

The Armijo condition, i.e. the above inequality, compares two quantities: the function value on the left side and the cryptical value on the right side. The left side is represented by the black curve, the true function $latex f$, and the concrete value $latex f(p+\sigma^{\ell}d)$ is the function value when we make a step in direction $latex d$ with stepsize $latex \sigma^{\ell}$. The right side is represented by the red line. The green line is a straight line with exactly the same slope as the function $latex f$ at the current point $latex p$. Red and green line are very similar when you look at the defining functions. The one important difference between them is that the slope of the red line is smaller in terms of absolute value: for both lines the slope is negative and the red line is less steep which is ensured by the parameter $latex 0<\rho<0.5$. But we know from Analysis that the function $latex f$ that we are interested in locally decreases as the green line near the current point, and this implies that the funtion has to lie below the red line at least for all stepsizes $latex t$ below some unknown threshold value $latex t_s>0$. There has to be a whole interval $latex (0,t_s)$ of admissable stepsizes and for some finite number $latex \ell$ the corresponding stepsize $latex \sigma^{\ell}$ is smaller than $latex t_s$ and larger than zero by definition, thus is contained in this admissable interval

$latex \sigma^{\ell}\in (0,t_s)$.

Therefore, after a finite number of steps the Armijo condition is fulfilled, we have computed an admissable stepsize that guarantees a reasonable decrease.

Summary

We now have everything at hand for building our very first algorithm for unconstrained minimization of multivariate functions. We know when to stop, how to figure out whether the function decreases in a given direction or not and how to choose the length of the next step if the search direction is a descent direction.
There are several approaches how to find a good descent or search direction $latex s$. We are going to start the next posting with the most obvious choice. Stay tuned…

Other articles in this series:
Machinery – Linear Regression
Machinery – Part 2