AI::MXNet::Optimizer - Common Optimization algorithms with regularizations.
Common Optimization algorithms with regularizations.
Create an optimizer with specified name. Parameters ---------- $name: Str Name of required optimizer. Should be the name of a subclass of Optimizer. Case insensitive. :$rescale_grad : Num Rescaling factor on gradient. Normally should be 1/batch_size. %kwargs: Hash Parameters for optimizer Returns ------- opt : Optimizer The result optimizer.
Set individual learning rate multipler for parameters Parameters ---------- args_lr_mult : dict of string/int to float set the lr multipler for name/index to float. setting multipler by index is supported for backward compatibility, but we recommend using name and symbol.
Set individual weight decay multipler for parameters. By default wd multipler is 0 for all params whose name doesn't end with _weight, if param_idx2name is provided. Parameters ---------- args_wd_mult : dict of string/int to float set the wd multipler for name/index to float. setting multipler by index is supported for backward compatibility, but we recommend using name and symbol.
AI::MXNet::SGD - A very simple SGD optimizer with momentum and weight regularization.
A very simple SGD optimizer with momentum and weight regularization. If the storage types of weight and grad are both 'row_sparse', and 'lazy_update' is True, **lazy updates** are applied by for row in grad.indices: rescaled_grad[row] = lr * rescale_grad * clip(grad[row], clip_gradient) + wd * weight[row] state[row] = momentum[row] * state[row] + rescaled_grad[row] weight[row] = weight[row] - state[row] The sparse update only updates the momentum for the weights whose row_sparse gradient indices appear in the current batch, rather than updating it for all indices. Compared with the original update, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original update, and may lead to different empirical results. Otherwise, **standard updates** are applied by:: rescaled_grad = lr * rescale_grad * clip(grad, clip_gradient) + wd * weight state = momentum * state + rescaled_grad weight = weight - state Parameters ---------- learning_rate : Num, optional learning_rate of SGD momentum : Num, optional momentum value wd : Num, optional L2 regularization coefficient add to all the weights rescale_grad : Num, optional rescaling factor of gradient. Normally should be 1/batch_size. clip_gradient : Num, optional clip gradient in range [-clip_gradient, clip_gradient] param_idx2name : hash ref of Str/Int to Num, optional special treat weight decay in parameter ends with bias, gamma, and beta multi_precision: Bool, optional Flag to control the internal precision of the optimizer. False results in using the same precision as the weights (default), True makes internal 32-bit copy of the weights and applies gradients in 32-bit precision even if actual weights used in the model have lower precision. Turning this on can improve convergence and accuracy when training with float16. lazy_update: Bool, optional, default true
AI::MXNet::Signum - The Signum optimizer that takes the sign of gradient or momentum.
The optimizer updates the weight by: rescaled_grad = rescale_grad * clip(grad, clip_gradient) + wd * weight state = momentum * state + (1-momentum)*rescaled_grad weight = (1 - lr * wd_lh) * weight - lr * sign(state) See the original paper at: https://jeremybernste.in/projects/amazon/signum.pdf This optimizer accepts the following parameters in addition to those accepted by AI::MXNet::Optimizer Parameters ---------- momentum : Num, optional The momentum value. wd_lh : Num, optional The amount of decoupled weight decay regularization, see details in the original paper at: https://arxiv.org/abs/1711.05101
AI::MXNet::FTML - The FTML optimizer.
This class implements the optimizer described in *FTML - Follow the Moving Leader in Deep Learning*, available at http://proceedings.mlr.press/v70/zheng17a/zheng17a.pdf. This optimizer accepts the following parameters in addition to those accepted by AI::MXNet::Optimizer Parameters ---------- beta1 : Num, optional 0 < beta1 < 1. Generally close to 0.5. beta2 : Num, optional 0 < beta2 < 1. Generally close to 1. epsilon : Num, optional Small value to avoid division by 0.
AI::MXNet::LBSGD - The Large Batch SGD optimizer with momentum and weight decay.
The optimizer updates the weight by:: state = momentum * state + lr * rescale_grad * clip(grad, clip_gradient) + wd * weight weight = weight - state Parameters ---------- momentum : Num, optional The momentum value. multi_precision: Bool, optional Flag to control the internal precision of the optimizer. 0 results in using the same precision as the weights (default), 1 makes internal 32-bit copy of the weights and applies gradients in 32-bit precision even if actual weights used in the model have lower precision.`< Turning this on can improve convergence and accuracy when training with float16. warmup_strategy: string ('linear', 'power2', 'sqrt'. , 'lars' default : 'linear') warmup_epochs: unsigned, default: 5 batch_scale: unsigned, default: 1 (same as batch size*numworkers) updates_per_epoch: updates_per_epoch (default: 32, Default might not reflect true number batches per epoch. Used for warmup.) begin_epoch: unsigned, default 0, starting epoch.
AI::MXNet::DCASGD - DCASGD optimizer with momentum and weight regularization.
DCASGD optimizer with momentum and weight regularization. Implements paper "Asynchronous Stochastic Gradient Descent with Delay Compensation for Distributed Deep Learning" Parameters ---------- learning_rate : Num, optional learning_rate of SGD momentum : Num, optional momentum value lamda : NUm, optional scale DC value wd : Num, optional L2 regularization coefficient add to all the weights rescale_grad : Num, optional rescaling factor of gradient. Normally should be 1/batch_size. clip_gradient : Num, optional clip gradient in range [-clip_gradient, clip_gradient] param_idx2name : hash ref of Str/Int to Num, optional special threating of weight decay for parameters that end with bias, gamma, and beta
AI::MXNet::NAG - SGD with Nesterov weight handling.
It is implemented according to https://github.com/torch/optim/blob/master/sgd.lua
AI::MXNet::SGLD - Stochastic Gradient Riemannian Langevin Dynamics.
Stochastic Gradient Riemannian Langevin Dynamics. This class implements the optimizer described in the paper *Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex*, available at https://papers.nips.cc/paper/4883-stochastic-gradient-riemannian-langevin-dynamics-on-the-probability-simplex.pdf. Parameters ---------- learning_rate : Num, optional learning_rate of SGD wd : Num, optional L2 regularization coefficient add to all the weights rescale_grad : Num, optional rescaling factor of gradient. Normally should be 1/batch_size. clip_gradient : Num, optional clip gradient in range [-clip_gradient, clip_gradient]
AI::MXNet::Adam - Adam optimizer as described in [King2014]_.
Adam optimizer as described in [King2014]_. .. [King2014] Diederik Kingma, Jimmy Ba, *Adam: A Method for Stochastic Optimization*, http://arxiv.org/abs/1412.6980 Parameters ---------- learning_rate : Num, optional Step size. Default value is set to 0.001. beta1 : Num, optional Exponential decay rate for the first moment estimates. Default value is set to 0.9. beta2 : Num, optional Exponential decay rate for the second moment estimates. Default value is set to 0.999. epsilon : Num, optional Default value is set to 1e-8. wd : NUm, optional L2 regularization coefficient add to all the weights rescale_grad : Num, optional rescaling factor of gradient. Normally should be 1/batch_size. clip_gradient : Num, optional clip gradient in range [-clip_gradient, clip_gradient]
AI::MXNet::AdaGrad - AdaGrad optimizer of Duchi et al., 2011
AdaGrad optimizer of Duchi et al., 2011, This code follows the version in http://arxiv.org/pdf/1212.5701v1.pdf Eq(5) by Matthew D. Zeiler, 2012. AdaGrad will help the network to converge faster in some cases. Parameters ---------- learning_rate : Num, optional Step size. Default value is set to 0.05. wd : Num, optional L2 regularization coefficient add to all the weights rescale_grad : Num, optional rescaling factor of gradient. Normally should be 1/batch_size. eps: Num, optional A small float number to make the updating processing stable Default value is set to 1e-7. clip_gradient : Num, optional clip gradient in range [-clip_gradient, clip_gradient]
AI::MXNet::RMSProp - RMSProp optimizer of Tieleman & Hinton, 2012.
RMSProp optimizer of Tieleman & Hinton, 2012, For centered=False, the code follows the version in http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf by Tieleman & Hinton, 2012 For centered=True, the code follows the version in http://arxiv.org/pdf/1308.0850v5.pdf Eq(38) - Eq(45) by Alex Graves, 2013. Parameters ---------- learning_rate : Num, optional Step size. Default value is set to 0.001. gamma1: Num, optional decay factor of moving average for gradient^2. Default value is set to 0.9. gamma2: Num, optional "momentum" factor. Default value if set to 0.9. Only used if centered=True epsilon : Num, optional Default value is set to 1e-8. centered : Bool, optional Use Graves or Tielemans & Hintons version of RMSProp wd : Num, optional L2 regularization coefficient add to all the weights rescale_grad : Num, optional rescaling factor of gradient. clip_gradient : Num, optional clip gradient in range [-clip_gradient, clip_gradient] clip_weights : Num, optional clip weights in range [-clip_weights, clip_weights]
AI::MXNet::AdaDelta - AdaDelta optimizer.
AdaDelta optimizer as described in Zeiler, M. D. (2012). *ADADELTA: An adaptive learning rate method.* http://arxiv.org/abs/1212.5701 Parameters ---------- rho: Num Decay rate for both squared gradients and delta x epsilon : Num The constant as described in the thesis wd : Num L2 regularization coefficient add to all the weights rescale_grad : Num, optional rescaling factor of gradient. Normally should be 1/batch_size. clip_gradient : Num, optional clip gradient in range [-clip_gradient, clip_gradient]
AI::MXNet::Ftrl
Referenced from *Ad Click Prediction: a View from the Trenches*, available at http://dl.acm.org/citation.cfm?id=2488200. The optimizer updates the weight by: rescaled_grad = clip(grad * rescale_grad, clip_gradient) z += rescaled_grad - (sqrt(n + rescaled_grad**2) - sqrt(n)) * weight / learning_rate n += rescaled_grad**2 w = (sign(z) * lamda1 - z) / ((beta + sqrt(n)) / learning_rate + wd) * (abs(z) > lamda1) If the storage types of weight, state and grad are all row_sparse, **sparse updates** are applied by:: for row in grad.indices: rescaled_grad[row] = clip(grad[row] * rescale_grad, clip_gradient) z[row] += rescaled_grad[row] - (sqrt(n[row] + rescaled_grad[row]**2) - sqrt(n[row])) * weight[row] / learning_rate n[row] += rescaled_grad[row]**2 w[row] = (sign(z[row]) * lamda1 - z[row]) / ((beta + sqrt(n[row])) / learning_rate + wd) * (abs(z[row]) > lamda1) The sparse update only updates the z and n for the weights whose row_sparse gradient indices appear in the current batch, rather than updating it for all indices. Compared with the original update, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original update, and may lead to different empirical results. This optimizer accepts the following parameters in addition to those accepted by AI::MXNet::Optimizer Parameters ---------- lamda1 : Num, optional L1 regularization coefficient. learning_rate : Num, optional The initial learning rate. beta : Num, optional Per-coordinate learning rate correlation parameter.
AI::MXNet::Adamax
It is a variant of Adam based on the infinity norm available at http://arxiv.org/abs/1412.6980 Section 7. This optimizer accepts the following parameters in addition to those accepted AI::MXNet::Optimizer. Parameters ---------- beta1 : Num, optional Exponential decay rate for the first moment estimates. beta2 : Num, optional Exponential decay rate for the second moment estimates.
AI::MXNet::Nadam
The Nesterov Adam optimizer. Much like Adam is essentially RMSprop with momentum, Nadam is Adam RMSprop with Nesterov momentum available at http://cs229.stanford.edu/proj2015/054_report.pdf. This optimizer accepts the following parameters in addition to those accepted by AI::MXNet::Optimizer. Parameters ---------- beta1 : Num, optional Exponential decay rate for the first moment estimates. beta2 : Num, optional Exponential decay rate for the second moment estimates. epsilon : Num, optional Small value to avoid division by 0. schedule_decay : Num, optional Exponential decay rate for the momentum schedule
AI::MXNet::Updater - Updater for kvstore
Sets updater states.
Gets updater states. Parameters ---------- dump_optimizer : bool, default False Whether to also save the optimizer itself. This would also save optimizer information such as learning rate and weight decay schedules.
To install AI::MXNet, copy and paste the appropriate command in to your terminal.
cpanm
cpanm AI::MXNet
CPAN shell
perl -MCPAN -e shell install AI::MXNet
For more information on module installation, please visit the detailed CPAN module installation guide.