Optimization¶
The module pyro.optim provides support for optimization in Pyro. In particular it provides PyroOptim, which is used to wrap PyTorch optimizers and manage optimizers for dynamically generated parameters (see the tutorial SVI Part I for a discussion). Any custom optimization algorithms are also to be found here.
Pyro Optimizers¶
-
class
PyroOptim(optim_constructor, optim_args)[source]¶ Bases:
objectA wrapper for torch.optim.Optimizer objects that helps with managing dynamically generated parameters.
Parameters: - optim_constructor – a torch.optim.Optimizer
- optim_args – a dictionary of learning arguments for the optimizer or a callable that returns such dictionaries
-
__call__(params, *args, **kwargs)[source]¶ Parameters: params (an iterable of strings) – a list of parameters Do an optimization step for each param in params. If a given param has never been seen before, initialize an optimizer for it.
-
get_state()[source]¶ Get state associated with all the optimizers in the form of a dictionary with key-value pairs (parameter name, optim state dicts)
-
AdagradRMSProp(optim_args)[source]¶ Wraps
pyro.optim.adagrad_rmsprop.AdagradRMSPropwithPyroOptim.
-
ClippedAdam(optim_args)[source]¶ Wraps
pyro.optim.clipped_adam.ClippedAdamwithPyroOptim.
-
class
PyroLRScheduler(scheduler_constructor, optim_args)[source]¶ Bases:
pyro.optim.optim.PyroOptimA wrapper for
lr_schedulerobjects that adjusts learning rates for dynamically generated parameters.Parameters: - scheduler_constructor – a
lr_scheduler - optim_args – a dictionary of learning arguments for the optimizer or a callable that returns such dictionaries. must contain the key ‘optimizer’ with pytorch optimizer value
Example:
optimizer = torch.optim.SGD scheduler = pyro.optim.ExponentialLR({'optimizer': optimizer, 'optim_args': {'lr': 0.01}, 'gamma': 0.1}) svi = SVI(model, guide, scheduler, loss=TraceGraph_ELBO()) for i in range(epochs): for minibatch in DataLoader(dataset, batch_size): svi.step(minibatch) scheduler.step(epoch=i)
- scheduler_constructor – a
-
class
AdagradRMSProp(params, eta=1.0, delta=1e-16, t=0.1)[source]¶ Bases:
torch.optim.optimizer.OptimizerImplements a mash-up of the Adagrad algorithm and RMSProp. For the precise update equation see equations 10 and 11 in reference [1].
References: [1] ‘Automatic Differentiation Variational Inference’, Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, David M. Blei URL: https://arxiv.org/abs/1603.00788 [2] ‘Lecture 6.5 RmsProp: Divide the gradient by a running average of its recent magnitude’, Tieleman, T. and Hinton, G., COURSERA: Neural Networks for Machine Learning. [3] ‘Adaptive subgradient methods for online learning and stochastic optimization’, Duchi, John, Hazan, E and Singer, Y.
Arguments:
Parameters: - params – iterable of parameters to optimize or dicts defining parameter groups
- eta (float) – sets the step size scale (optional; default: 1.0)
- t (float) – t, optional): momentum parameter (optional; default: 0.1)
- delta (float) – modulates the exponent that controls how the step size scales (optional: default: 1e-16)
-
class
ClippedAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, clip_norm=10.0, lrd=1.0)[source]¶ Bases:
torch.optim.optimizer.OptimizerParameters: - params – iterable of parameters to optimize or dicts defining parameter groups
- lr – learning rate (default: 1e-3)
- betas (Tuple) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
- eps – term added to the denominator to improve numerical stability (default: 1e-8)
- weight_decay – weight decay (L2 penalty) (default: 0)
- clip_norm – magnitude of norm to which gradients are clipped (default: 10.0)
- lrd – rate at which learning rate decays (default: 1.0)
Small modification to the Adam algorithm implemented in torch.optim.Adam to include gradient clipping and learning rate decay.
Reference
A Method for Stochastic Optimization, Diederik P. Kingma, Jimmy Ba https://arxiv.org/abs/1412.6980
PyTorch Optimizers¶
-
Adadelta(optim_args)¶ Wraps
torch.optim.AdadeltawithPyroOptim.
-
Adagrad(optim_args)¶ Wraps
torch.optim.AdagradwithPyroOptim.
-
Adam(optim_args)¶ Wraps
torch.optim.AdamwithPyroOptim.
-
AdamW(optim_args)¶ Wraps
torch.optim.AdamWwithPyroOptim.
-
SparseAdam(optim_args)¶ Wraps
torch.optim.SparseAdamwithPyroOptim.
-
Adamax(optim_args)¶ Wraps
torch.optim.AdamaxwithPyroOptim.
-
ASGD(optim_args)¶ Wraps
torch.optim.ASGDwithPyroOptim.
-
SGD(optim_args)¶ Wraps
torch.optim.SGDwithPyroOptim.
-
Rprop(optim_args)¶ Wraps
torch.optim.RpropwithPyroOptim.
-
RMSprop(optim_args)¶ Wraps
torch.optim.RMSpropwithPyroOptim.
-
LambdaLR(optim_args)¶ Wraps
torch.optim.LambdaLRwithPyroLRScheduler.
-
StepLR(optim_args)¶ Wraps
torch.optim.StepLRwithPyroLRScheduler.
-
MultiStepLR(optim_args)¶ Wraps
torch.optim.MultiStepLRwithPyroLRScheduler.
-
ExponentialLR(optim_args)¶ Wraps
torch.optim.ExponentialLRwithPyroLRScheduler.
-
CosineAnnealingLR(optim_args)¶ Wraps
torch.optim.CosineAnnealingLRwithPyroLRScheduler.
-
ReduceLROnPlateau(optim_args)¶ Wraps
torch.optim.ReduceLROnPlateauwithPyroLRScheduler.
-
CyclicLR(optim_args)¶ Wraps
torch.optim.CyclicLRwithPyroLRScheduler.
-
CosineAnnealingWarmRestarts(optim_args)¶ Wraps
torch.optim.CosineAnnealingWarmRestartswithPyroLRScheduler.
Higher-Order Optimizers¶
-
class
MultiOptimizer[source]¶ Bases:
objectBase class of optimizers that make use of higher-order derivatives.
Higher-order optimizers generally use
torch.autograd.grad()rather thantorch.Tensor.backward(), and therefore require a different interface from usual Pyro and PyTorch optimizers. In this interface, thestep()method inputs alosstensor to be differentiated, and backpropagation is triggered one or more times inside the optimizer.Derived classes must implement
step()to compute derivatives and update parameters in-place.Example:
tr = poutine.trace(model).get_trace(*args, **kwargs) loss = -tr.log_prob_sum() params = {name: site['value'].unconstrained() for name, site in tr.nodes.items() if site['type'] == 'param'} optim.step(loss, params)
-
step(loss, params)[source]¶ Performs an in-place optimization step on parameters given a differentiable
losstensor.Note that this detaches the updated tensors.
Parameters: - loss (torch.Tensor) – A differentiable tensor to be minimized. Some optimizers require this to be differentiable multiple times.
- params (dict) – A dictionary mapping param name to unconstrained value as stored in the param store.
-
get_step(loss, params)[source]¶ Computes an optimization step of parameters given a differentiable
losstensor, returning the updated values.Note that this preserves derivatives on the updated tensors.
Parameters: - loss (torch.Tensor) – A differentiable tensor to be minimized. Some optimizers require this to be differentiable multiple times.
- params (dict) – A dictionary mapping param name to unconstrained value as stored in the param store.
Returns: A dictionary mapping param name to updated unconstrained value.
Return type:
-
-
class
PyroMultiOptimizer(optim)[source]¶ Bases:
pyro.optim.multi.MultiOptimizerFacade to wrap
PyroOptimobjects in aMultiOptimizerinterface.
-
class
TorchMultiOptimizer(optim_constructor, optim_args)[source]¶ Bases:
pyro.optim.multi.PyroMultiOptimizerFacade to wrap
Optimizerobjects in aMultiOptimizerinterface.
-
class
MixedMultiOptimizer(parts)[source]¶ Bases:
pyro.optim.multi.MultiOptimizerContainer class to combine different
MultiOptimizerinstances for different parameters.Parameters: parts (list) – A list of (names, optim)pairs, where eachnamesis a list of parameter names, and eachoptimis aMultiOptimizerorPyroOptimobject to be used for the named parameters. Together thenamesshould partition up all desired parameters to optimize.Raises: ValueError – if any name is optimized by multiple optimizers.
-
class
Newton(trust_radii={})[source]¶ Bases:
pyro.optim.multi.MultiOptimizerImplementation of
MultiOptimizerthat performs a Newton update on batched low-dimensional variables, optionally regularizing via a per-parametertrust_radius. Seenewton_step()for details.The result of
get_step()will be differentiable, however the updated values fromstep()will be detached.Parameters: trust_radii (dict) – a dict mapping parameter name to radius of trust region. Missing names will use unregularized Newton update, equivalent to infinite trust radius.