FedProx

Reading time: 5 min

The FedProx algorithm¹ is one of the earliest approaches specifically aimed at addressing the optimization challenges associated with data heterogeneity in FL. At its core, the FedProx algorithm is quite straightforward. However, prior to diving into the modifications proposed in the FedProx approach, we'll first consider the kind of phenomenon that FedProx, along with other methods, attempts to counteract.

To help illustrate the issue, we'll use some helpful visualizations from researchers who proposed the SCAFFOLD method.²^,³ Consider a two-client FL setting. Each client has their own loss landscape based on their privately held data, denoted $f_1$ and $f_2$. If each client has an equal amount of data, the global loss surface, which is the loss function when constructed from all data available on both clients is equivalent to $(f_1 + f_2)/2$. When performing standard federated training, the objective is to find model weights corresponding to the minimum of this global loss function. See the figure below. Note that the minima associated with the client loss functions are distinct from the global minimum.

Combined loss objective — Comparison of local loss landscapes for two clients with the combined global loss.

Recall that optimization with FedSGD is equivalent to centralized large-batch SGD. That is, rounds of FedSGD are equivalent to optimizing the global loss function, expressed here as $(f_1 + f_2)/2$. As such, with a properly tuned learning rate, FedSGD will converge to the global optimum, as illustrated in the figure below. Each averaged gradient step makes steady progress towards the global minimum.

FedSGD and global convergence — FedSGD rounds result in averaged models making steady progress towards the global minimum.

As detailed in the chapter on FedAvg,⁴ there is a substantial reduction in communication overhead if each client applies multiple steps of batch SGD, optimizing the local model based on the local loss. It was noted therein, however, that this breaks the equivalence enjoyed, for example, by FedSGD with centralized large-batch SGD. In settings, such as the one illustrated in the figures thus far, with data heterogeneity and markedly different loss landscapes this can engender various issues. One such issue is often referred to as "client drift" and is illustrated in the figure below.

FedAvg and the influence of local drift — Illustration of "client drift" in FedAvg updates caused by differences in the shape of the local loss functions of each client.

In the figure, each client is applying three local steps of batch SGD before sending the resulting weights to server for aggregation. The grey dots represent the updates using FedSGD for three rounds from the previous figure. The update using FedAvg deviates quite a bit from this path with a distinct drift towards the minima of Client 2. Drifts of this kind can be induced by the shape of the local loss surface and cause issues with FedAvg, such as slowed convergence.

The Math

The general idea for FedProx is to limit models from drifting too far during local training. For a server round $t$, consider the aggregated weights $\mathbf{w}_t$. For a given client $k$, let $\ell_k(b; \mathbf{w})$ denote the local loss function for a batch, $b$, of data, parameterized by model weights $\mathbf{w}$. The primary modification of FedAvg in the FedProx algorithm is to augment $\ell_k(b; \mathbf{w})$ with a penalty term such that, for $\mu > 0$, the local loss becomes

$$ \begin{align*} \ell_k(b; \mathbf{w}) + \mu \Vert \mathbf{w} - \mathbf{w_t} \Vert^2. \end{align*} $$

The penalty term is referred to as the proximal loss. It penalizes significant deviation from the global model during local training such that loss optimization must trade off improvements in the standard loss with potential divergence from the original model weights. Revisiting the loss surfaces above, the FedProx penalty term alters the loss surface to make client drift less attractive, unless it leads to significant performance gains.

The Algorithm

The FedProx algorithm is very similar to that of FedAvg, with the only modification coming in the local update calculations.

Adapting $\mu$

For a well-tuned $\mu$, FedProx has been shown to outperform FedAvg under heterogeneous data conditions. It is widely applied, both because it is a simple modification to the FedAvg framework and because it works fairly well across a number of tasks. However, in settings where the data is homogeneous, FedProx has been shown to under-perform compared to FedAvg when $\mu>0$. See the top left of the figure below.

Because of this, the authors of FedProx offer an alternative to extensive hyper-parameter tuning. Heuristically, the proximal weight may be adapted across server rounds. If the aggregated server-side training loss (average final loss on each client) fails to decrease for a round, $\mu$ is increased. If the loss improves for some number of rounds, $\mu$ is decreased. In the figure below, this procedure results in the fuchsia colored line in the Figures below.

FedProx vs. FedAvg — Comparison of FedProx to FedAvg in various settings. On the top left, data is homogeneous across clients. Without adaptation FedProx struggles to out perform FedAvg. Data is heterogeneous in the other settings and FedProx performs well with and without adaptation.

References & Useful Links

T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith. Federated optimization in heterogeneous networks. In I. Dhillon, D. Papailiopoulos, and V. Sze, editors, Proceedings of Machine Learning and Systems, volume 2, pages 429–450, 2020.

S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh. SCAFFOLD: Stochastic controlled averaging for federated learning. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5132–5143. PMLR, 13–18 Jul 2020.

Images adapted from Talk on "Stochastic Controlled Averaging for Federated Learning"

⁴

H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. Proceedings of the 20th AISTATS, 2017.

Contributors:

AI Pocket Reference: FL

FedProx

The Math

The Algorithm

Adapting \(\mu\)

References & Useful Links