Processing math: 100%

The FedOpt Family of Aggregation Strategies

Suggest an Edit

Reading time: 4 min

Recall that modern deep learning optimizers like AdamW1 or AdaGrad2 use first- and second-order moment estimates of the stochastic gradients computed during iterative optimization to adaptively modify the model updates. At a high level, each algorithm aims to reinforce common update directions (i.e. those with momentum) and damp update elements corresponding to noisy directions (i.e. those with high batch-to-batch variance). The FedOpt family3 of algorithms, considers modifying the traditional FedAvg aggregation algorithm to incorporate similar adaptations into server-side model updates in FL.

Mathematical motivation

In FedAvg, recall that, after a round of local training on each client, client model weights are combined into a single model representation via

wt+1=kCtnknswkt+1,

where wkt+1 is simply the model weights after local training on client k. For round t, each client starts local training from the same set of weights, wt. Assume that each client has the same number of data points such that nk=m. With a bit of algebra, the update is rewritten

wt+1=kCtnknswkt+1=wt1CtkCt(wtwkt+1),=wt+1CtkCtΔkt+1,=wt+Δt+1.

Here, Δkt+1=wkt+1wt is just the vector pointing from the initial models weights to those after local training and Δt+1 is simply the uniform average of these update vectors.

Recall that, if each client uses a fixed learning rate, η, and performs a single, full gradient update, FedAvg is equivalent to centralized large-batch SGD. Similarly, in this case, if each client performs one step of batch SGD with a learning rate of 1.0, then the update in Equation (1) is equivalent to a batch-SGD update with a learning rate of 1.0 for the server. The "server-side" batch is the union of the batches used on each client.

The observation that Δt+1 is simply a stochastic gradient motivates treating these update directions like the stochastic gradients in standard adaptive optimizers. It's important to note that if the clients, for instance, apply multiple steps of local SGD or use different learning rates, the exact equivalence of Δt+1 to a stochastic gradient is broken. However, it shares similarities to such a gradient and is, therefore, called a "pseudo-gradient."3

The algorithms: FedAdagrad, FedAdam, FedYogi

Drawing inspiration from three successful, traditional adaptive optimizers, the adaptive server-side aggregation strategies of FedAdaGrad, FedAdam, and FedYogi have been proposed. See the algorithm below for details.

FedOpt Algorithms

Those familiar with the mathematical formulations of Adagrad, Adam,4 and Yogi5 will recognize the general structure of these equations. Computation of mt, based on the average of the update directions suggested by each client through local training (Δt+1) serves to accumulate momentum associated with directions that are consistently and frequently part of these updates. On the other hand, νt estimates the variance associated with update directions throughout the server rounds. Directions with higher variance values are damped in favor of those with more consistency round over round.

As with the usual forms of these algorithms, there are a number of hyper-parameters that can be tuned, including τ,β1, and β2. However, sensible defaults are suggested in the paper such that β1=0.9 and β2=0.99. The authors also show that performance is generally robust to τ.

A number of experiments show that the proposed FedOpt family of algorithms can outperform FedAvg, especially in heterogeneous settings. Moreover, these algorithms, in the experiments of the paper, outperform SCAFFOLD,6 a variance reduction method aimed at improving convergence in the presence of heterogeneity. A final advantage of the FedOpt family of algorithms is that they are accompanied by several convergence results showing that, as long as the variance of the local gradients is not too large, the algorithms converge properly.


Contributors: