NAME

    AI::MXNet::Optimizer - Common Optimization algorithms with regularizations.

DESCRIPTION

    Common Optimization algorithms with regularizations.

create_optimizer

        Create an optimizer with specified name.

        Parameters
        ----------
        $name: Str
            Name of required optimizer. Should be the name
            of a subclass of Optimizer. Case insensitive.

        :$rescale_grad : Num
            Rescaling factor on gradient. Normally should be 1/batch_size.

        %kwargs: Hash
            Parameters for optimizer

        Returns
        -------
        opt : Optimizer
            The result optimizer.

set_lr_mult

        Set individual learning rate multipler for parameters

        Parameters
        ----------
        args_lr_mult : dict of string/int to float
            set the lr multipler for name/index to float.
            setting multipler by index is supported for backward compatibility,
            but we recommend using name and symbol.

set_wd_mult

        Set individual weight decay multipler for parameters.
        By default wd multipler is 0 for all params whose name doesn't
        end with _weight, if param_idx2name is provided.

        Parameters
        ----------
        args_wd_mult : dict of string/int to float
            set the wd multipler for name/index to float.
            setting multipler by index is supported for backward compatibility,
            but we recommend using name and symbol.

NAME

    AI::MXNet::SGD - A very simple SGD optimizer with momentum and weight regularization.

DESCRIPTION

    A very simple SGD optimizer with momentum and weight regularization.

    If the storage types of weight and grad are both 'row_sparse', and 'lazy_update' is True,
    **lazy updates** are applied by

        for row in grad.indices:
            rescaled_grad[row] = lr * rescale_grad * clip(grad[row], clip_gradient) + wd * weight[row]
            state[row] = momentum[row] * state[row] + rescaled_grad[row]
            weight[row] = weight[row] - state[row]

    The sparse update only updates the momentum for the weights whose row_sparse
    gradient indices appear in the current batch, rather than updating it for all
    indices. Compared with the original update, it can provide large
    improvements in model training throughput for some applications. However, it
    provides slightly different semantics than the original update, and
    may lead to different empirical results.

    Otherwise, **standard updates** are applied by::

        rescaled_grad = lr * rescale_grad * clip(grad, clip_gradient) + wd * weight
        state = momentum * state + rescaled_grad
        weight = weight - state

    Parameters
    ----------
    learning_rate : Num, optional
        learning_rate of SGD

    momentum : Num, optional
       momentum value

    wd : Num, optional
        L2 regularization coefficient add to all the weights

    rescale_grad : Num, optional
        rescaling factor of gradient. Normally should be 1/batch_size.

    clip_gradient : Num, optional
        clip gradient in range [-clip_gradient, clip_gradient]

    param_idx2name : hash ref of Str/Int to Num, optional
        special treat weight decay in parameter ends with bias, gamma, and beta

    multi_precision: Bool, optional
        Flag to control the internal precision of the optimizer.
        False results in using the same precision as the weights (default),
        True makes internal 32-bit copy of the weights and applies gradients
        in 32-bit precision even if actual weights used in the model have lower precision.
        Turning this on can improve convergence and accuracy when training with float16.

    lazy_update: Bool, optional, default true

NAME

    AI::MXNet::Signum - The Signum optimizer that takes the sign of gradient or momentum.

DESCRIPTION

    The optimizer updates the weight by:

        rescaled_grad = rescale_grad * clip(grad, clip_gradient) + wd * weight
        state = momentum * state + (1-momentum)*rescaled_grad
        weight = (1 - lr * wd_lh) * weight - lr * sign(state)

    See the original paper at: https://jeremybernste.in/projects/amazon/signum.pdf

    This optimizer accepts the following parameters in addition to those accepted
    by AI::MXNet::Optimizer

    Parameters
    ----------
    momentum : Num, optional
       The momentum value.
    wd_lh : Num, optional
       The amount of decoupled weight decay regularization, see details in the original paper at:
       https://arxiv.org/abs/1711.05101

NAME

    AI::MXNet::FTML - The FTML optimizer.

DESCRIPTION

    This class implements the optimizer described in
    *FTML - Follow the Moving Leader in Deep Learning*,
    available at http://proceedings.mlr.press/v70/zheng17a/zheng17a.pdf.

    This optimizer accepts the following parameters in addition to those accepted
    by AI::MXNet::Optimizer

    Parameters
    ----------
    beta1 : Num, optional
        0 < beta1 < 1. Generally close to 0.5.
    beta2 : Num, optional
        0 < beta2 < 1. Generally close to 1.
    epsilon : Num, optional
        Small value to avoid division by 0.

NAME

    AI::MXNet::LBSGD - The Large Batch SGD optimizer with momentum and weight decay.

DESCRIPTION

    The optimizer updates the weight by::

        state = momentum * state + lr * rescale_grad * clip(grad, clip_gradient) + wd * weight
        weight = weight - state

    Parameters
    ----------
    momentum : Num, optional
       The momentum value.
    multi_precision: Bool, optional
       Flag to control the internal precision of the optimizer.
       0 results in using the same precision as the weights (default),
       1 makes internal 32-bit copy of the weights and applies gradients
                in 32-bit precision even if actual weights used in the model have lower precision.`<
                Turning this on can improve convergence and accuracy when training with float16.
    warmup_strategy: string ('linear', 'power2', 'sqrt'. , 'lars'   default : 'linear')
    warmup_epochs: unsigned, default: 5
    batch_scale:   unsigned, default: 1 (same as batch size*numworkers)
    updates_per_epoch: updates_per_epoch (default: 32, Default might not reflect true number batches per epoch. Used for warmup.)
    begin_epoch: unsigned, default 0, starting epoch.

NAME

    AI::MXNet::DCASGD - DCASGD optimizer with momentum and weight regularization.

DESCRIPTION

    DCASGD optimizer with momentum and weight regularization.

    Implements paper "Asynchronous Stochastic Gradient Descent with
                    Delay Compensation for Distributed Deep Learning"

    Parameters
    ----------
    learning_rate : Num, optional
        learning_rate of SGD

    momentum : Num, optional
       momentum value

    lamda : NUm, optional
       scale DC value

    wd : Num, optional
        L2 regularization coefficient add to all the weights

    rescale_grad : Num, optional
        rescaling factor of gradient. Normally should be 1/batch_size.

    clip_gradient : Num, optional
        clip gradient in range [-clip_gradient, clip_gradient]

    param_idx2name : hash ref of Str/Int to Num, optional
        special threating of weight decay for parameters that end with bias, gamma, and beta

NAME

    AI::MXNet::NAG - SGD with Nesterov weight handling.

DESCRIPTION

    It is implemented according to
    https://github.com/torch/optim/blob/master/sgd.lua

NAME

    AI::MXNet::SGLD - Stochastic Gradient Riemannian Langevin Dynamics.

DESCRIPTION

    Stochastic Gradient Riemannian Langevin Dynamics.

    This class implements the optimizer described in the paper *Stochastic Gradient
    Riemannian Langevin Dynamics on the Probability Simplex*, available at
    https://papers.nips.cc/paper/4883-stochastic-gradient-riemannian-langevin-dynamics-on-the-probability-simplex.pdf.

    Parameters
    ----------
    learning_rate : Num, optional
        learning_rate of SGD

    wd : Num, optional
        L2 regularization coefficient add to all the weights

    rescale_grad : Num, optional
        rescaling factor of gradient. Normally should be 1/batch_size.

    clip_gradient : Num, optional
        clip gradient in range [-clip_gradient, clip_gradient]

NAME

    AI::MXNet::Adam - Adam optimizer as described in [King2014]_.

DESCRIPTION

    Adam optimizer as described in [King2014]_.

    .. [King2014] Diederik Kingma, Jimmy Ba,
       *Adam: A Method for Stochastic Optimization*,
       http://arxiv.org/abs/1412.6980

    Parameters
    ----------
    learning_rate : Num, optional
        Step size.
        Default value is set to 0.001.
    beta1 : Num, optional
        Exponential decay rate for the first moment estimates.
        Default value is set to 0.9.
    beta2 : Num, optional
        Exponential decay rate for the second moment estimates.
        Default value is set to 0.999.
    epsilon : Num, optional
        Default value is set to 1e-8.

    wd : NUm, optional
        L2 regularization coefficient add to all the weights
    rescale_grad : Num, optional
        rescaling factor of gradient. Normally should be 1/batch_size.

    clip_gradient : Num, optional
        clip gradient in range [-clip_gradient, clip_gradient]

NAME

    AI::MXNet::AdaGrad - AdaGrad optimizer of Duchi et al., 2011

DESCRIPTION

    AdaGrad optimizer of Duchi et al., 2011,

    This code follows the version in http://arxiv.org/pdf/1212.5701v1.pdf  Eq(5)
    by Matthew D. Zeiler, 2012. AdaGrad will help the network to converge faster
    in some cases.

    Parameters
    ----------
    learning_rate : Num, optional
        Step size.
        Default value is set to 0.05.

    wd : Num, optional
        L2 regularization coefficient add to all the weights

    rescale_grad : Num, optional
        rescaling factor of gradient. Normally should be 1/batch_size.

    eps: Num, optional
        A small float number to make the updating processing stable
        Default value is set to 1e-7.

    clip_gradient : Num, optional
        clip gradient in range [-clip_gradient, clip_gradient]

NAME

    AI::MXNet::RMSProp - RMSProp optimizer of Tieleman & Hinton, 2012.

DESCRIPTION

    RMSProp optimizer of Tieleman & Hinton, 2012,

    For centered=False, the code follows the version in
    http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf by
    Tieleman & Hinton, 2012

    For centered=True, the code follows the version in
    http://arxiv.org/pdf/1308.0850v5.pdf Eq(38) - Eq(45) by Alex Graves, 2013.

    Parameters
    ----------
    learning_rate : Num, optional
        Step size.
        Default value is set to 0.001.
    gamma1: Num, optional
        decay factor of moving average for gradient^2.
        Default value is set to 0.9.
    gamma2: Num, optional
        "momentum" factor.
        Default value if set to 0.9.
        Only used if centered=True
    epsilon : Num, optional
        Default value is set to 1e-8.
    centered : Bool, optional
        Use Graves or Tielemans & Hintons version of RMSProp
    wd : Num, optional
        L2 regularization coefficient add to all the weights
    rescale_grad : Num, optional
        rescaling factor of gradient.
    clip_gradient : Num, optional
        clip gradient in range [-clip_gradient, clip_gradient]
    clip_weights : Num, optional
        clip weights in range [-clip_weights, clip_weights]

NAME

    AI::MXNet::AdaDelta - AdaDelta optimizer.

DESCRIPTION

    AdaDelta optimizer as described in
    Zeiler, M. D. (2012).
    *ADADELTA: An adaptive learning rate method.*

    http://arxiv.org/abs/1212.5701

    Parameters
    ----------
    rho: Num
        Decay rate for both squared gradients and delta x
    epsilon : Num
        The constant as described in the thesis
    wd : Num
        L2 regularization coefficient add to all the weights
    rescale_grad : Num, optional
        rescaling factor of gradient. Normally should be 1/batch_size.
    clip_gradient : Num, optional
        clip gradient in range [-clip_gradient, clip_gradient]

NAME

    AI::MXNet::Ftrl

DESCRIPTION

    Referenced from *Ad Click Prediction: a View from the Trenches*, available at
    http://dl.acm.org/citation.cfm?id=2488200.

    The optimizer updates the weight by:

        rescaled_grad = clip(grad * rescale_grad, clip_gradient)
        z += rescaled_grad - (sqrt(n + rescaled_grad**2) - sqrt(n)) * weight / learning_rate
        n += rescaled_grad**2
        w = (sign(z) * lamda1 - z) / ((beta + sqrt(n)) / learning_rate + wd) * (abs(z) > lamda1)

    If the storage types of weight, state and grad are all row_sparse,
    **sparse updates** are applied by::

        for row in grad.indices:
            rescaled_grad[row] = clip(grad[row] * rescale_grad, clip_gradient)
            z[row] += rescaled_grad[row] - (sqrt(n[row] + rescaled_grad[row]**2) - sqrt(n[row])) * weight[row] / learning_rate
            n[row] += rescaled_grad[row]**2
            w[row] = (sign(z[row]) * lamda1 - z[row]) / ((beta + sqrt(n[row])) / learning_rate + wd) * (abs(z[row]) > lamda1)

    The sparse update only updates the z and n for the weights whose row_sparse
    gradient indices appear in the current batch, rather than updating it for all
    indices. Compared with the original update, it can provide large
    improvements in model training throughput for some applications. However, it
    provides slightly different semantics than the original update, and
    may lead to different empirical results.

    This optimizer accepts the following parameters in addition to those accepted
    by AI::MXNet::Optimizer

    Parameters
    ----------
    lamda1 : Num, optional
        L1 regularization coefficient.
    learning_rate : Num, optional
        The initial learning rate.
    beta : Num, optional
        Per-coordinate learning rate correlation parameter.

NAME

    AI::MXNet::Adamax

DESCRIPTION

    It is a variant of Adam based on the infinity norm
    available at http://arxiv.org/abs/1412.6980 Section 7.

    This optimizer accepts the following parameters in addition to those accepted
    AI::MXNet::Optimizer.

    Parameters
    ----------
    beta1 : Num, optional
        Exponential decay rate for the first moment estimates.
    beta2 : Num, optional
        Exponential decay rate for the second moment estimates.

NAME

    AI::MXNet::Nadam

DESCRIPTION

    The Nesterov Adam optimizer.

    Much like Adam is essentially RMSprop with momentum,
    Nadam is Adam RMSprop with Nesterov momentum available
    at http://cs229.stanford.edu/proj2015/054_report.pdf.

    This optimizer accepts the following parameters in addition to those accepted
    by AI::MXNet::Optimizer.

    Parameters
    ----------
    beta1 : Num, optional
        Exponential decay rate for the first moment estimates.
    beta2 : Num, optional
        Exponential decay rate for the second moment estimates.
    epsilon : Num, optional
        Small value to avoid division by 0.
    schedule_decay : Num, optional
        Exponential decay rate for the momentum schedule

NAME

    AI::MXNet::Updater - Updater for kvstore

set_states

    Sets updater states.

get_states

        Gets updater states.

        Parameters
        ----------
        dump_optimizer : bool, default False
            Whether to also save the optimizer itself. This would also save optimizer
            information such as learning rate and weight decay schedules.