Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Common Issues and Troubleshooting

Because FL relies on communication between distributed processes, even if they are simulated on the same machine, things can go a bit haywire if the communication orchestration gets off track. In this document, we'll try to list a few of the common issues one might run into when working with the library and running experiments.

Server and Clients Stuck and Doing Nothing

If this is happening there are several common causes for the hanging processes.

Not Enough Clients Have Started

A critical parameter in the configuration files is n_clients. See, for example, examples/basic_example/config.yaml. In many of our examples, this parameter is used to set the min_fit_clients and min_evaluate_clients for the strategy objects. See, for example, examples/basic_example/server.py. This tells the server that it should wait for at least n_clients before beginning federated learning.

If you have only started 3 clients, but n_clients: 4, the server (and the existing clients) will wait until at least one more client has reported into the server before starting.

Ghost or Orphaned Processes Remain Running

The FL4Health library relies on the communication layer provided by Flower in order to orchestrate information exchange between the server and client processes. While this process is generally robust, it can run aground in certain scenarios. If FL concludes cleanly, the server and client processes will be shutdown automatically. We also have functionality that can be used to terminate such processes when the server receives an exceptions or multiple exceptions from participating clients if accept_failures=False for the server class.

However, in certain scenarios, such as (ctrl+c) stopping a process or a failure before clients have registered to the server, processes may be left running in the background. This is especially true if you're launching processes with nohup, as is done in the examples/utils/run_fl_local.sh script. Because these orphaned processes will still be listening on the local IP and a specified port, they can interfere with communication of new processes that you start with the same IP and port specifications.

To alleviate this, you need to terminate these running processes before starting any new runs. The easiest way to do this is through top/htop via the terminal on Mac/Linux machines. An analogous process should be followed on windows machines to shut such processes down.

Scary Warnings On Startup

On starting up the server and client processes, sometimes various warnings that look a bit scary front the log files. An example of one of these warnings appears below when running locally on CPU.

2024-11-29 08:54:04.123569: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized
to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler
flags.
/usr/local/anaconda3/envs/fl4health/lib/python3.10/site-packages/threadpoolctl.py:1214: RuntimeWarning:
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

While the above warnings might appear problematic, they are often harmless and pop out from various libraries leveraged under the hood to warn users of issues that might arise under certain conditions or provide them a chance to install pieces of software. For example, the first output is saying that performance on the CPU could be improved with the right tensorflow compilation. However, it isn't necessary to run the code properly.