Bootcamp Repository Setup
These instructions assume you have already followed the basic Vector cluster account setup, including initial access, changing your password, and setting up multifactor authentication. These instructions have been sent to you by the Vector Ops Team. Multifactor authentication is now required upon all connections to the Vector cluster.
Overview
In this section, you will create ssh keys on the Vector cluster in order to connect to the Fl4Health GitHub repository. You will need to add these ssh keys to your GitHub profile in order to clone the repository and access code. A similar process may be followed on your local machine to establish keys to clone the repository locally.
Creating Your SSH Keys
First, login to Vaughan (Vector cluster) over ssh using your login credentials (replace username with your own Vector username). If you are using Windows, use Windows PowerShell to run local commands, including the following one. An alternative for Windows is to use git-bash. Otherwise, use Terminal.
ssh username@v.vectorinstitute.ai
Once logged into the Vaughan cluster, create ssh keys (replace your_email@example.com with your GitHub account email address). For additional reference, see information here.
ssh-keygen -t ed25519 -C "your_email@example.com"
When prompted to choose a file in which to save the key, just press Enter for the default. Additionally, when asked to enter a passphrase, press Enter to proceed without setting a passphrase. It is alright not to set one.
Using the command below show your public key in the terminal and copy it to the clipboard (replace username with your own Vector cluster username)
cat /h/$USER/.ssh/id_ed25519.pub
Add this ssh key to your GitHub profile by following the steps on this page: Add New SSH Key
Cloning the Repository
Return to your terminal session and clone the fl4health repository into your home directory.
cd ~/
git clone git@github.com:/VectorInstitute/fl4health.git
There should be a new folder in your home directory called FL4Health.
Once you have successfully cloned the fl4health repository, please proceed to setting up your VS Code and Python Environment. These steps are outlined ide_and_environment_guide.md
IDE and Environment Setup
Installing VS Code Locally and Cloning the Repository
For this bootcamp, we highly recommend using VS Code as your local IDE because it makes working on the cluster GPUs significantly easier. You can download VS Code here: https://code.visualstudio.com/
Once you have the application installed, you can clone and open a local version of the fl4health repository by following the same set of instructions that you followed to download it to Vector’s cluster but on your local machine.
See: Repo Setup Guide
Setting up your Python Environment
There are comprehensive instructions for setting up your IDE and environment in the CONTRIBUTING.MD. Reading and following these steps is optional, but it can be helpful if you run into issues.
You will need python 3.10 installed and available on your local machine to correctly create the python virtual environment locally in order to use the library. If you don’t already have it, there are multiple ways to obtain a copy and use it to create a python environment with the specific version. A few examples are:
- Using
miniconda
following the installation instructions (link) and the environment create instructions here - Homebrew via this link.
- Using
pyenv
following the readme here: link. Note thatpyenv
can be somewhat involved to use.
Thereafter, you run the commands (or variations if your system python is not 3.10 or you’re using an environment
manager like conda
).
cd path/to/fl4health
python -m venv ENV_PATH/env_name>
source ENV_PATH/env_name/bin/activate
pip install --upgrade pip poetry
poetry install --with "dev, test, codestyle"
The environment creation step may be different depending on how 3.10 is installed on your system or whether you’re using, for example, the conda steps to create the environment.
For example, if python 3.10 is not designated as your local systems python, you may need to adjust the path in the command
python -m venv ENV_PATH/env_name
</div>
</div>
to the right python path as, for example
```bash
path/to/python -m venv ENV_PATH/env_name
Here ENV_PATH/env_name
is whatever you want to call the environment to be created. Mine is simply
called fl4health
.
If you're using conda
then you can specify a python version to use as
conda create -n env_name python=3.10
where env_name
is what you would like to call your environment. Thereafter, you would activate your environment
using
conda activate env_name
and proceed with the remainder of the instructions unaltered.
Note that the above code must be run from the top level of the FL4Health directory.
Any time you want to run code in the library, this environment must be active.
The command to activate the environment is
source ENV_PATH/env_name/bin/activate
Many of the examples in the library can be run locally in a reasonable amount of time on a cpu. However, there are a few that are much faster on a GPU. Moreover, larger models and datasets of interest may require a GPU to perform efficient training.
Python Environment Setup on the Cluster
For working with the library on Vector’s cluster, there are two options:
- We have a pre-built environment that users can simply activate to start running the examples in the library and working with our code.
- The second option is to build your own version of the environment that you can modify to add libraries that you would like to work with above and beyond our installations.
Activating and Working with Our Pre-built Environment
First log onto the cluster with
ssh username@v.vectorinstitute.ai
going through the steps of two-factor authentication.
The shared environment is housed in the public folder:
/ssd003/projects/aieng/public/fl4health_bootcamp/
All that is necessary to start working with the library is to run
source /ssd003/projects/aieng/public/fl4health_bootcamp/bin/activate
This should prefix your terminal code with (fl4health_bootcamp)
Creating Your Own Environment on the Cluster
If you’re going this route, you’ll need to follow the steps below to create and set up a python environment of your own.
First log onto the cluster with
ssh username@v.vectorinstitute.ai
going through the steps of two-factor authentication.
The process is nearly the same as on your local machine. However, prior to creating the environment, you will need to activate python 3.10 on the cluster. This makes the process one step longer as
module load python/3.10.12
cd path/to/fl4health
python -m venv ENV_PATH
source ENV_PATH/bin/activate
pip install --upgrade pip poetry
poetry install --with "dev, test, codestyle"
Accessing a Cluster GPU through your Local VS Code
You can also connect your local VS Code directly to a VS Code instance on a GPU or CPU on Vector’s cluster.
Installing VS Code Server on the Cluster
First log into the cluster with
ssh username@v.vectorinstitute.ai
going through the steps of two-factor authentication.
The commands below downloads and saves VSCode in your home folder on the cluster. You need only do this once:
cd ~/
curl -Lk 'https://update.code.visualstudio.com/1.98.2/cli-alpine-x64/stable' --output vscode_cli.tar.gz
tar -xf vscode_cli.tar.gz
rm vscode_cli.tar.gz
Setting up a Tunnel and Connecting Your Local VS Code
After logging into the cluster, run the following.
srun --gres=gpu:1 --qos=m --time=4:00:00 -c 8 --mem 16G -p t4v2 --pty bash
This will reserve a t4v2 GPU and provide you a terminal to run commands on that node. Note that -p t4v2
requests
a t4v2 GPU. You can also access larger a40
and rtx6000
GPUs this way, but you may face longer wait times for
reservations. The -c 8
requests 8 supporting CPUs and --mem 16G
requests 16 GB of CPU memory (not GPU memory).
There may be a brief waiting period if the cluster is busy and many people are using the GPU resources.
Next verify the beginning of the command prompt to make sure that you are running this command from a GPU node
(e.g., user@gpu001
) and not the login node (user@v[1,2,..]
).
After that, you can spin up a tunnel to the GPU node using the following command:
~/code tunnel
You will be prompted to authenticate via Github. On the first run, you might also need to review Microsoft's terms of services.
Thereafter, you will be prompted to name your tunnel. You can name it whatever you like or leave it blank and it will default to the name of the first GPU you have connected to.
After that, you can access the tunnel through your browser (not the best but it works). If you've logged into Github on your VSCode desktop app, you can also connect from there by installing the extension:
ms-vscode.remote-server
Then, in your local VS Code press Shift-Command-P (Shift-Control-P), and locate
Remote-Tunnels: Connect to Tunnel.
After selecting this option and waiting for VS Code to find the GPU you have started the tunnel on (under whatever name you gave it, or the default of the first GPU you connected to), you should be able to select it. Now your VS Code is logged into the GPU and should be able to see the file system there.
Note that you will need to keep the SSH connection running in your terminal while using the tunnel. After you are done with the work, stop your session by pressing Control-C to release the GPU.
GPU reservations are time limited. The command --qos=m --time=4:00:00
guarantees that you get the GPU for 4 hours
uninterrupted. Thereafter, you may be preempted (kicked off), by other users hoping to use the resources.
If you want to request more time, you can increase --time=X:00:00
to request a longer time reservation. As the
reservation time increases, so does the potential wait time to obtain the requested resources.
Running an Example (Locally or On the Cluster)
For your convenience, we have a basic utility script that takes care of launching server and client code in background processes, so you don’t need to worry about opening multiple terminal windows to run each client and server process separately. It is located at
examples/utils/run_fl_local.sh
Of course, you may still launch processes separately and manually if you would like to.
By default, it is set up to run our basic example with 2 clients and a server. However, you may modify this script to run other examples of your choosing. If you run (remembering to activate your environment)
bash examples/utils/run_fl_local.sh
This should kick off the federated learning processes and train a model for 2 clients using FedAvg and place the logs in the folders specified in the script.
Cluster Datasets
For convenience, we have stored some useful datasets on the cluster. These include datasets that your team identified as potentially useful for the target use-cases you will be working on during the bootcamp.
These datasets are stored at /projects/federated_learning/
.
NOTE: This first /
is important. Without it the folder will not be visible to you. You can see its contents with
the command
ls /projects/federated_learning/
In the /projects/federated_learning/public
folder, you will find all datasets used in the examples for the library
including MNIST, CIFAR, and others. The remainder of the folders should loosely correspond to your team names and are
populated with datasets relevant to your PoCs. You and your teammates should have access to these folders, but other
teams will not. If you cannot access your folder, please let your facilitator know and we will get it sorted out.
Repository Roadmap
In this document, we'll provide a brief overview of the library structure and broadly categorize the examples code by their fit with the four lectures given in the Lab/Learn phase of the bootcamp.
Repository Structure
docs/
This folder simply houses our automatically built Sphinx documentation. To access a nicely rendered version of these docs, please visit: https://vectorinstitute.github.io/FL4Health/.
The documentation remains a work-in-progress. However, if you're interested in reading rendered documentation for the various functions in the core library, they can be found at: https://vectorinstitute.github.io/FL4Health/reference/api/fl4health.html.
examples/
This is where you'll likely spend at least some time. The examples/
folder houses a number of demonstrations of
implementing various kinds of federated learning (FL) workflows. There are a lot of examples here.
In Section Example Categorization, we roughly organize these examples to correspond to the various materials covered in the lectures. There are also some brief descriptions of the different examples in the Examples README.MD.
Another important folder to note is examples/utils/
which houses a small script called run_fl_local.sh
. This is a
nice helper script that automates the process of starting up servers and clients for the examples. At present, it is
set up to run the examples/basic_example/
code with 2 clients. It can, however, be modified to run many of the
examples and dump logs to the specified locations. To run this script, from the top level of the library one executes
bash examples/utils/run_fl_local.sh
fl4health/
The core components of the library are in the fl4health/
folder. This is where you will find nearly all of the code
associated with the FL engine and implementations of various FL clients, servers, aggregation
strategies and other core components of FL. If you need to make custom additions, adding a metric, implementing your
own strategy, or including custom functionality, it might fit properly here, but likely can be folded into code that
you're writing to support your experiments instead.
If you're interested in understanding what's happening under the hood or debugging certain failures, you'll likely be led into the various modules therein.
research/
Generally, this folder will not be a point of emphasis for the bootcamp. This folder houses some of the groups own research on new and emerging ideas in FL. It is mainly meant to house experimentation and tinkering code that doesn't necessarily fit into the core library at present.
tests/
This folder houses our unit, integration, and smoke tests meant to ensure code correctness associated with our implementations. There may be some value in seeing how certain tests are run for different functions in understanding the mechanics of various implementations. However, this isn't an area of the repository that is likely to be of significant interest to participants.
Example Categorization
In this section, the examples will be roughly grouped by where they most fit within the structure of the lectures given during the Lab/Learn phase of the bootcamp. As a reminder, these categories are
- Introduction to FL
- Data Heterogeneity and Global Models
- Personal(ized) Federated Learning
- Beyond Better Optimization in Federated Learning
There will also be an Other category where the remainder of examples that exist beyond the scope of the material that could be covered in the lectures given.
Introduction to FL
examples/basic_example
examples/fedopt_example
examples/ensemble_example
examples/docker_basic_example
examples/nnunet_example
(Integration with nnUnet, quite tricky to work with)
Data Heterogeneity and Global Models
examples/fedprox_example
examples/scaffold_example
examples/moon_example
examples/feddg_ga_example
Personal(ized) Federated Learning
examples/fl_plus_local_ft_example
examples/fedper_example
examples/fedrep_example
examples/apfl_example
examples/fenda_example
examples/ditto_example
examples/mr_mtl_example
examples/fenda_ditto_example
examples/perfcl_example
examples/fedbn_example
examples/fedpm_example
examples/dynamic_layer_exchange_example
examples/sparse_tensor_partial_exchange_example
Beyond Better Optimization in Federated Learning
examples/fedpca_examples
examples/feature_alignment_example
examples/ae_examples/cvae_dim_example
examples/ae_examples/cvae_examples
examples/ae_examples/fedprox_vae_example
examples/dp_fed_examples/client_level_dp
examples/dp_fed_examples/client_level_dp_weighted
examples/dp_fed_examples/instance_level_dp
examples/dp_scaffold_example
Other
examples/model_merge_example
examples/warm_up_example/fedavg_warm_up
examples/warm_up_example/warmed_up_fedprox
examples/warm_up_example/warmed_up_fenda
examples/fedsimclr_example
examples/flash_example
Common Issues and Troubleshooting
Because FL relies on communication between distributed processes, even if they are simulated on the same machine, things can go a bit haywire if the communication orchestration gets off track. In this document, we'll try to list a few of the common issues one might run into when working with the library and running experiments.
Server and Clients Stuck and Doing Nothing
If this is happening there are several common causes for the hanging processes.
Not Enough Clients Have Started
A critical parameter in the configuration files is n_clients
. See, for example,
examples/basic_example/config.yaml. In many of our examples, this parameter
is used to set the min_fit_clients
and min_evaluate_clients
for the strategy objects. See, for example,
examples/basic_example/server.py. This tells the server that it should wait for
at least n_clients
before beginning federated learning.
If you have only started 3 clients, but n_clients: 4
, the server (and the existing clients) will wait until at least
one more client has reported into the server before starting.
Ghost or Orphaned Processes Remain Running
The FL4Health library relies on the communication layer provided by Flower in order to orchestrate information
exchange between the server and client processes. While this process is generally robust, it can run aground in
certain scenarios. If FL concludes cleanly, the server and client processes will be shutdown automatically. We also
have functionality that can be used to terminate such processes when the server receives an exceptions or multiple
exceptions from participating clients if accept_failures=False
for the server class.
However, in certain scenarios, such as (ctrl+c) stopping a process or a failure before clients have registered to the
server, processes may be left running in the background. This is especially true if you're launching processes with
nohup
, as is done in the examples/utils/run_fl_local.sh
script. Because these orphaned processes will still be
listening on the local IP and a specified port, they can interfere with communication of new processes that
you start with the same IP and port specifications.
To alleviate this, you need to terminate these running processes before starting any new runs. The easiest way to do
this is through top/htop
via the terminal on Mac/Linux machines. An analogous process should be followed on windows
machines to shut such processes down.
Scary Warnings On Startup
On starting up the server and client processes, sometimes various warnings that look a bit scary front the log files. An example of one of these warnings appears below when running locally on CPU.
2024-11-29 08:54:04.123569: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized
to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler
flags.
/usr/local/anaconda3/envs/fl4health/lib/python3.10/site-packages/threadpoolctl.py:1214: RuntimeWarning:
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md
While the above warnings might appear problematic, they are often harmless and pop out from various libraries leveraged under the hood to warn users of issues that might arise under certain conditions or provide them a chance to install pieces of software. For example, the first output is saying that performance on the CPU could be improved with the right tensorflow compilation. However, it isn't necessary to run the code properly.