There are is a wealth of literature on optimization techniques outside of gradient-based methods, however, I think we just don’t often consider them as such.
For example, I was chatting with a guy once and he asked me if I know any black-box optimization techniques. I thought for a bit and naievly said no. In his expression he looked a bit perplexed b/c we just had a conversation about hyper-parameter turning which is essentially block-box optimization. I hadn’t made that connection: whether we are doing grid search, random search, or something more advanced like bayesian optimization or some bandits method we are practicing black-box optimization, meaning we don’t have any heuristics to exploit.
- Grid search
- Random Search
- Bayesian Optimization
- Multi-Armed Bandits
Combinatorial problem if you have more than a few hyper parameters.
This is much better than grid search b/c it won’t get stuck iterating parameters that potentially have no effect.
This paper by Bengio1 motivates random search over grid search.
Among the choices above that are not completely trivial and/or complex is the Nelder-Mead method which outdates machine learning. Nelder-Mead is a numerical technique so it uses iteration to solve the problem and is included in the Scipy package.
Define the function you want to optimize, in this case it’s in 2 dimensions and give the algorithm a starting point:
def func_1(t): x,y=t return (x*x) + (y*y) from scipy.optimize import fmin fmin(func_1, [1,1])
Optimization terminated successfully. Current function value: 0.000000 Iterations: 38 Function evaluations: 69 Out: array([ -2.10235293e-05, 2.54845649e-05])
As you can see I had it optimize a convex (parabolic) function in 2 dimensions and it found the minimum which is indeed at
Became interested in this topic after reading the AutoML2 paper, as I though the Bayesian Optimisation they did was more interesting than that main contribution of the paper. They describe a bayesian optimisation routine called SMAC which cut their hyper-parameter search space considerably. This is an exploration of previous work in the space.
- BayesOpt, NIPS workshop on Bayesian Optimization (BayesOpt).
- GPyOpt, Python open-source library for Bayesian Optimization based on GPy.
- Bayesopt, an efficient implementation in C/C++ with support for Python, Matlab and Octave.
- Spearmint, a Python implementation focused on parallel and cluster computing.
- Hyperopt, a Python implementation for hyperparamenter optimization.
- SMAC, a Java implementation of random-forest-based Bayesian optimization for general algorithm configuration.
- pybo, a Python implementation of modular Bayesian optimization.
- Bayesopt.m, a Matlab implementation of Bayesian optimization with or without constraints.
- MOE MOE is a Python/C++/CUDA implementation of Bayesian Global Optimization using Gaussian Processes.
- SigOpt SigOpt offers Bayesian Global Optimization as a SaaS service.
Gaussian Process Models
Instead of using random combinations or exhaustive search naively, lets look at what has done well in the past are repeat that. In other words predict regions of the hyper-parameter space in which we expect to get good results.
These models assume similar inputs give similar outputs.
- Keep track of the best settings so far
- after each experiment this might stay the same or it might improve if the latest result is the best
- pick a setting of the hyper-parameters such that the expected improvement in our best setting is big.
Utility or Activation Functions help us formalise what the next best-guess is.
Expected Improvement Activation Function: This function, sometimes called the Acquisition function, looks at the variance of the gaussians and chooses the one w/ the most potential upside. Re: Explore vs Exploit paradox. Like a reckless investor since we don’t have much to loose.
All of these bayesian models need to start off of something, so they often do a few random samples and go bayesian from there. This is the valuable contribution from the auto-ml paper2 in that instead of doing random sampling for initialisation you can use smart samples from the meta learners.