Gradient Search (gradsearch)
Keywords: gradsearch gradient descent search optimization
This module implements a gradient or "steepest-descent" search. Given a function \(f\) which operates on a vector\(\vec{x} = [x_0,x_1,\ldots,x_{N-1}]^T\) of \(N\) parameters, the gradient search method seeks to find the optimum \(\vec{x}\) which minimizes \(f(\vec{x})\) .
Theory∞
The gradient search is an iterative method and adjusts \(\vec{x}\) proportional to the negative of the gradient of \(f\) evaluated at the current location. The vector \(\vec{x}\) is adjusted by
$$ \Delta \vec{x}[n+1] = -\gamma[n] \nabla f(\vec{x}[n]) $$where \(\gamma[n]\) is the step size and\(\nabla f(\vec{x}[n])\) is the gradient of \(f\) at \(\vec{x}\) , at the \(n^{th}\) iteration. The gradient is a vector field which points to the greatest rate of increase, and is computed at \(\vec{x}\) as
$$ \nabla f(\vec{x}) = \left( \frac{\partial f}{\partial x_0}, \frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_{N-1}} \right) $$In most non-linear optimization problems, \(\nabla f(\vec{x})\) is not known, and must be approximated for each value of \(\vec{x}[n]\) using the finite element method. The partial derivative of the \(k^{th}\) component is estimated by computing the slope of \(f\) when \(x_k\) is increased by a small amount \(\Delta\) while holding all other elements of \(\vec{x}\) constant. This process is repeated for all elements in \(\vec{x}\) to compute the gradient vector. Mathematically, the \(k^{th}\) component of the gradient is approximated by
$$ \frac{\partial f(\vec{x})}{\partial x_k} \approx \frac{f(x_0,\ldots,x_k+\Delta,\ldots,x_{N-1}) - f(\vec{x})}{\Delta} $$Once \(\nabla f(\vec{x}[n])\) is known, \(\Delta\vec{x}[n+1]\) is computed and the optimizing vector is updated via
$$ \vec{x}[n+1] = \vec{x}[n] + \Delta\vec{x}[n+1] $$Momentum constant∞
When \(f(\vec{x})\) is flat (i.e. \(\nabla f(\vec{x})\approx \vec{0}\) ), convergence will be slow. This effect can be mitigated by permitting the update vector equation to retain a small portion of the previous step vector. The updated vector at time \(n+1\) is
$$ \vec{x}[n+1] = \vec{x}[n] + \Delta\vec{x}[n+1] + \alpha\Delta\vec{x}[n] $$where \(\Delta\vec{x}[0] = \vec{0}\) . The effective update at time \(n+1\) is
$$ \vec{x}[n+1] = \sum_{k=0}^{n+1}{\alpha^{k}\Delta\vec{x}[n+1-k]} $$which is stable only for \(0 \le \alpha \lt 1\) . For flat regions, the gradient vector \(\nabla f(\vec{x})\) is approximately a constant \(\Delta\vec{x}\) , and \(\vec{x}[n]\) therefore becomes a geometric series converging to \(\Delta\vec{x}/(1-\alpha)\) . This accelerates the algorithm across relatively flat regions of \(f\) . The momentum constant additionally adds some stability for regions where the gradient method tends to oscillate, such as steep valleys in \(f\) .
Step size adjustment∞
In liquid , the gradient is normalized to unity (orthonormal). That is \(\|\nabla f(\vec{x}[n])\|=1\) . Furthermore, \(\gamma\) is slightly reduced each epoch by a multiplier \(\mu\)
$$ \gamma[n+1] = \mu \gamma[n] $$This helps improve stability and convergence over regions where the algorithm might oscillate due to steep values of \(f\) .
Interface∞
Here is a summary of the parameters used in the gradient search algorithm and their default values:
- \(\Delta\) : step size in computing the gradient (default \(10^{-6}\) )
- \(\gamma\) : step size in updating \(\vec{x}[n]\) (default 0.002)
- \(\alpha\) : momentum constant (default 0.1)
- \(\mu\)
Here is the basic interface to the gradsearch object:
- gradsearch_create(*userdata,*v,n,utility,min/max,*props) creates a gradient search object designed to optimize an \(n\) -point vector \(\vec{v}\) . The user-defined utility function and userdata structures define the search, as well as the min/max flag which can be either LIQUID_OPTIM_MINIMIZE or LIQUID_OPTIM_MAXIMIZE . Finally, the search is parametrized by the props structure; if set to NULL the defaults will be used. When run the gradsearch object will update the "optimal" value in the input vector \(\vec{v}\) specified during create() .
- gradsearch_destroy(q) destroys a gradsearch object, freeing all internally-allocated memory.
- gradsearch_print(q) prints the internal state of the gradient search object.
- gradsearch_reset(q) resets the internal state of the gradient search object.
- gradsearch_step(q) steps through a single iteration of the gradient search. The result is stored in the original input vector \(\vec{v}\) specified during the create() method.
- gradsearch_execute(q,n,target_utility) runs multiple iterations of the search algorithm, stopping after either \(n\) iterations or if the target utility is met.
Here is an example of how the gradient_search is used:
#include <liquid/liquid.h>
// user-defined utility callback function
float myutility(void * _userdata, float * _v, unsigned int _n)
{
float u = 0.0f;
unsigned int i;
for (i=0; i<_n; i++)
u += _v[i] * _v[i];
return u;
}
int main() {
unsigned int num_parameters = 8; // search dimensionality
unsigned int num_iterations = 100; // number of iterations to run
float target_utility = 0.01f; // target utility
float v[num_parameters]; // optimum vector
// ... intialize v ...
// create gradsearch object
gradsearch gs = gradsearch_create(NULL,
v,
num_parameters,
&myutility,
LIQUID_OPTIM_MINIMIZE);
// execute batch search
gradsearch_execute(gs, num_iterations, target_utility);
// clean it up
gradsearch_destroy(gs);
}
Notice that the utility function is a callback that is completely defined by the user.
Figure [fig-optim-gradsearch]. gradsearch performance for 2-parameter Rosenbrock function \(f(x,y) = (1-x)^2 + 100(y-x^2)^2\) with a starting point of \((x_0,y_0)=(-0.1,1.4)\) . The minimum is located at \((1,1)\) .
[ref:fig-optim-gradsearch] depicts the performance of the gradient search for the Rosenbrock function, defined as\(f(x,y) = (1-x)^2 + 100(y-x^2)^2\) for input parameters \(x\) and \(y\) . The Rosenbrock function has a minimum at \((x,y)=(1,1)\) ; however the minimum lies in a deep valley which can be difficult to navigate. From the figure it is apparent that finding the valley is trivial, but convergence to the minimum is slow.