Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Bootcamp Repository Setup

Note

These instructions assume you have already followed the basic Vector cluster account setup, including initial access, changing your password, and setting up multifactor authentication. These instructions have been sent to you by the Vector Ops Team. Multifactor authentication is now required upon all connections to the Vector cluster.

Overview

In this section, you will create ssh keys on the Vector cluster in order to connect to the Fl4Health GitHub repository. You will need to add these ssh keys to your GitHub profile in order to clone the repository and access code. A similar process may be followed on your local machine to establish keys to clone the repository locally.

Creating Your SSH Keys

First, login to Vaughan (Vector cluster) over ssh using your login credentials (replace username with your own Vector username). If you are using Windows, use Windows PowerShell to run local commands, including the following one. An alternative for Windows is to use git-bash. Otherwise, use Terminal.

ssh username@v.vectorinstitute.ai

Once logged into the Vaughan cluster, create ssh keys (replace your_email@example.com with your GitHub account email address). For additional reference, see information here.

ssh-keygen -t ed25519 -C "your_email@example.com"

When prompted to choose a file in which to save the key, just press Enter for the default. Additionally, when asked to enter a passphrase, press Enter to proceed without setting a passphrase. It is alright not to set one.

Using the command below show your public key in the terminal and copy it to the clipboard (replace username with your own Vector cluster username)

cat /h/$USER/.ssh/id_ed25519.pub

Add this ssh key to your GitHub profile by following the steps on this page: Add New SSH Key

Cloning the Repository

Return to your terminal session and clone the fl4health repository into your home directory.

cd ~/
git clone git@github.com:/VectorInstitute/fl4health.git

There should be a new folder in your home directory called FL4Health.

Once you have successfully cloned the fl4health repository, please proceed to setting up your VS Code and Python Environment. These steps are outlined ide_and_environment_guide.md

IDE and Environment Setup

Installing VS Code Locally and Cloning the Repository

For this bootcamp, we highly recommend using VS Code as your local IDE because it makes working on the cluster GPUs significantly easier. You can download VS Code here: https://code.visualstudio.com/

Once you have the application installed, you can clone and open a local version of the fl4health repository by following the same set of instructions that you followed to download it to Vector’s cluster but on your local machine.

See: Repo Setup Guide

Setting up your Python Environment

There are comprehensive instructions for setting up your IDE and environment in the CONTRIBUTING.MD. Reading and following these steps is optional, but it can be helpful if you run into issues.

You will need python 3.10 installed and available on your local machine to correctly create the python virtual environment locally in order to use the library. If you don’t already have it, there are multiple ways to obtain a copy and use it to create a python environment with the specific version. A few examples are:

  1. Using miniconda following the installation instructions (link) and the environment create instructions here
  2. Homebrew via this link.
  3. Using pyenv following the readme here: link. Note that pyenv can be somewhat involved to use.

Thereafter, you run the commands (or variations if your system python is not 3.10 or you’re using an environment manager like conda).

cd path/to/fl4health
python -m venv ENV_PATH/env_name>
source ENV_PATH/env_name/bin/activate
pip install --upgrade pip poetry
poetry install --with "dev, test, codestyle"

Note

The environment creation step may be different depending on how 3.10 is installed on your system or whether you’re using, for example, the conda steps to create the environment.

For example, if python 3.10 is not designated as your local systems python, you may need to adjust the path in the command

python -m venv ENV_PATH/env_name

</div>
</div>
to the right python path as, for example
```bash
path/to/python -m venv ENV_PATH/env_name

Here ENV_PATH/env_name is whatever you want to call the environment to be created. Mine is simply called fl4health.

If you're using conda then you can specify a python version to use as

conda create -n env_name python=3.10

where env_name is what you would like to call your environment. Thereafter, you would activate your environment using

conda activate env_name

and proceed with the remainder of the instructions unaltered.

Note that the above code must be run from the top level of the FL4Health directory.

Any time you want to run code in the library, this environment must be active.

The command to activate the environment is

source ENV_PATH/env_name/bin/activate

Many of the examples in the library can be run locally in a reasonable amount of time on a cpu. However, there are a few that are much faster on a GPU. Moreover, larger models and datasets of interest may require a GPU to perform efficient training.

Python Environment Setup on the Cluster

For working with the library on Vector’s cluster, there are two options:

  1. We have a pre-built environment that users can simply activate to start running the examples in the library and working with our code.
  2. The second option is to build your own version of the environment that you can modify to add libraries that you would like to work with above and beyond our installations.

Activating and Working with Our Pre-built Environment

First log onto the cluster with

ssh username@v.vectorinstitute.ai

going through the steps of two-factor authentication.

The shared environment is housed in the public folder: /ssd003/projects/aieng/public/fl4health_bootcamp/

All that is necessary to start working with the library is to run

source /ssd003/projects/aieng/public/fl4health_bootcamp/bin/activate

This should prefix your terminal code with (fl4health_bootcamp)

Creating Your Own Environment on the Cluster

If you’re going this route, you’ll need to follow the steps below to create and set up a python environment of your own.

First log onto the cluster with

ssh username@v.vectorinstitute.ai

going through the steps of two-factor authentication.

The process is nearly the same as on your local machine. However, prior to creating the environment, you will need to activate python 3.10 on the cluster. This makes the process one step longer as

module load python/3.10.12
cd path/to/fl4health
python -m venv ENV_PATH
source ENV_PATH/bin/activate
pip install --upgrade pip poetry
poetry install --with "dev, test, codestyle"

Accessing a Cluster GPU through your Local VS Code

You can also connect your local VS Code directly to a VS Code instance on a GPU or CPU on Vector’s cluster.

Installing VS Code Server on the Cluster

First log into the cluster with

ssh username@v.vectorinstitute.ai

going through the steps of two-factor authentication.

The commands below downloads and saves VSCode in your home folder on the cluster. You need only do this once:

cd ~/

curl -Lk 'https://update.code.visualstudio.com/1.98.2/cli-alpine-x64/stable' --output vscode_cli.tar.gz

tar -xf vscode_cli.tar.gz
rm vscode_cli.tar.gz

Setting up a Tunnel and Connecting Your Local VS Code

After logging into the cluster, run the following.

srun --gres=gpu:1 --qos=m --time=4:00:00 -c 8 --mem 16G -p t4v2 --pty bash

This will reserve a t4v2 GPU and provide you a terminal to run commands on that node. Note that -p t4v2 requests a t4v2 GPU. You can also access larger a40 and rtx6000 GPUs this way, but you may face longer wait times for reservations. The -c 8 requests 8 supporting CPUs and --mem 16G requests 16 GB of CPU memory (not GPU memory). There may be a brief waiting period if the cluster is busy and many people are using the GPU resources.

Next verify the beginning of the command prompt to make sure that you are running this command from a GPU node (e.g., user@gpu001) and not the login node (user@v[1,2,..]).

After that, you can spin up a tunnel to the GPU node using the following command:

~/code tunnel

You will be prompted to authenticate via Github. On the first run, you might also need to review Microsoft's terms of services.

Thereafter, you will be prompted to name your tunnel. You can name it whatever you like or leave it blank and it will default to the name of the first GPU you have connected to.

After that, you can access the tunnel through your browser (not the best but it works). If you've logged into Github on your VSCode desktop app, you can also connect from there by installing the extension:

ms-vscode.remote-server

Then, in your local VS Code press Shift-Command-P (Shift-Control-P), and locate

Remote-Tunnels: Connect to Tunnel.

After selecting this option and waiting for VS Code to find the GPU you have started the tunnel on (under whatever name you gave it, or the default of the first GPU you connected to), you should be able to select it. Now your VS Code is logged into the GPU and should be able to see the file system there.

Note that you will need to keep the SSH connection running in your terminal while using the tunnel. After you are done with the work, stop your session by pressing Control-C to release the GPU.

Note

GPU reservations are time limited. The command --qos=m --time=4:00:00 guarantees that you get the GPU for 4 hours uninterrupted. Thereafter, you may be preempted (kicked off), by other users hoping to use the resources.

If you want to request more time, you can increase --time=X:00:00 to request a longer time reservation. As the reservation time increases, so does the potential wait time to obtain the requested resources.

Running an Example (Locally or On the Cluster)

For your convenience, we have a basic utility script that takes care of launching server and client code in background processes, so you don’t need to worry about opening multiple terminal windows to run each client and server process separately. It is located at

examples/utils/run_fl_local.sh

Of course, you may still launch processes separately and manually if you would like to.

By default, it is set up to run our basic example with 2 clients and a server. However, you may modify this script to run other examples of your choosing. If you run (remembering to activate your environment)

bash examples/utils/run_fl_local.sh

This should kick off the federated learning processes and train a model for 2 clients using FedAvg and place the logs in the folders specified in the script.

Cluster Datasets

For convenience, we have stored some useful datasets on the cluster. These include datasets that your team identified as potentially useful for the target use-cases you will be working on during the bootcamp.

These datasets are stored at /projects/federated_learning/.

NOTE: This first / is important. Without it the folder will not be visible to you. You can see its contents with the command

ls /projects/federated_learning/

In the /projects/federated_learning/public folder, you will find all datasets used in the examples for the library including MNIST, CIFAR, and others. The remainder of the folders should loosely correspond to your team names and are populated with datasets relevant to your PoCs. You and your teammates should have access to these folders, but other teams will not. If you cannot access your folder, please let your facilitator know and we will get it sorted out.

Repository Roadmap

In this document, we'll provide a brief overview of the library structure and broadly categorize the examples code by their fit with the four lectures given in the Lab/Learn phase of the bootcamp.

Repository Structure

docs/

This folder simply houses our automatically built Sphinx documentation. To access a nicely rendered version of these docs, please visit: https://vectorinstitute.github.io/FL4Health/.

The documentation remains a work-in-progress. However, if you're interested in reading rendered documentation for the various functions in the core library, they can be found at: https://vectorinstitute.github.io/FL4Health/reference/api/fl4health.html.

examples/

This is where you'll likely spend at least some time. The examples/ folder houses a number of demonstrations of implementing various kinds of federated learning (FL) workflows. There are a lot of examples here.

In Section Example Categorization, we roughly organize these examples to correspond to the various materials covered in the lectures. There are also some brief descriptions of the different examples in the Examples README.MD.

Another important folder to note is examples/utils/ which houses a small script called run_fl_local.sh. This is a nice helper script that automates the process of starting up servers and clients for the examples. At present, it is set up to run the examples/basic_example/ code with 2 clients. It can, however, be modified to run many of the examples and dump logs to the specified locations. To run this script, from the top level of the library one executes

bash examples/utils/run_fl_local.sh

fl4health/

The core components of the library are in the fl4health/ folder. This is where you will find nearly all of the code associated with the FL engine and implementations of various FL clients, servers, aggregation strategies and other core components of FL. If you need to make custom additions, adding a metric, implementing your own strategy, or including custom functionality, it might fit properly here, but likely can be folded into code that you're writing to support your experiments instead.

If you're interested in understanding what's happening under the hood or debugging certain failures, you'll likely be led into the various modules therein.

research/

Generally, this folder will not be a point of emphasis for the bootcamp. This folder houses some of the groups own research on new and emerging ideas in FL. It is mainly meant to house experimentation and tinkering code that doesn't necessarily fit into the core library at present.

tests/

This folder houses our unit, integration, and smoke tests meant to ensure code correctness associated with our implementations. There may be some value in seeing how certain tests are run for different functions in understanding the mechanics of various implementations. However, this isn't an area of the repository that is likely to be of significant interest to participants.

Example Categorization

In this section, the examples will be roughly grouped by where they most fit within the structure of the lectures given during the Lab/Learn phase of the bootcamp. As a reminder, these categories are

  • Introduction to FL
  • Data Heterogeneity and Global Models
  • Personal(ized) Federated Learning
  • Beyond Better Optimization in Federated Learning

There will also be an Other category where the remainder of examples that exist beyond the scope of the material that could be covered in the lectures given.

Introduction to FL

  • examples/basic_example
  • examples/fedopt_example
  • examples/ensemble_example
  • examples/docker_basic_example
  • examples/nnunet_example (Integration with nnUnet, quite tricky to work with)

Data Heterogeneity and Global Models

  • examples/fedprox_example
  • examples/scaffold_example
  • examples/moon_example
  • examples/feddg_ga_example

Personal(ized) Federated Learning

  • examples/fl_plus_local_ft_example
  • examples/fedper_example
  • examples/fedrep_example
  • examples/apfl_example
  • examples/fenda_example
  • examples/ditto_example
  • examples/mr_mtl_example
  • examples/fenda_ditto_example
  • examples/perfcl_example
  • examples/fedbn_example
  • examples/fedpm_example
  • examples/dynamic_layer_exchange_example
  • examples/sparse_tensor_partial_exchange_example

Beyond Better Optimization in Federated Learning

  • examples/fedpca_examples
  • examples/feature_alignment_example
  • examples/ae_examples/cvae_dim_example
  • examples/ae_examples/cvae_examples
  • examples/ae_examples/fedprox_vae_example
  • examples/dp_fed_examples/client_level_dp
  • examples/dp_fed_examples/client_level_dp_weighted
  • examples/dp_fed_examples/instance_level_dp
  • examples/dp_scaffold_example

Other

  • examples/model_merge_example
  • examples/warm_up_example/fedavg_warm_up
  • examples/warm_up_example/warmed_up_fedprox
  • examples/warm_up_example/warmed_up_fenda
  • examples/fedsimclr_example
  • examples/flash_example

Common Issues and Troubleshooting

Because FL relies on communication between distributed processes, even if they are simulated on the same machine, things can go a bit haywire if the communication orchestration gets off track. In this document, we'll try to list a few of the common issues one might run into when working with the library and running experiments.

Server and Clients Stuck and Doing Nothing

If this is happening there are several common causes for the hanging processes.

Not Enough Clients Have Started

A critical parameter in the configuration files is n_clients. See, for example, examples/basic_example/config.yaml. In many of our examples, this parameter is used to set the min_fit_clients and min_evaluate_clients for the strategy objects. See, for example, examples/basic_example/server.py. This tells the server that it should wait for at least n_clients before beginning federated learning.

If you have only started 3 clients, but n_clients: 4, the server (and the existing clients) will wait until at least one more client has reported into the server before starting.

Ghost or Orphaned Processes Remain Running

The FL4Health library relies on the communication layer provided by Flower in order to orchestrate information exchange between the server and client processes. While this process is generally robust, it can run aground in certain scenarios. If FL concludes cleanly, the server and client processes will be shutdown automatically. We also have functionality that can be used to terminate such processes when the server receives an exceptions or multiple exceptions from participating clients if accept_failures=False for the server class.

However, in certain scenarios, such as (ctrl+c) stopping a process or a failure before clients have registered to the server, processes may be left running in the background. This is especially true if you're launching processes with nohup, as is done in the examples/utils/run_fl_local.sh script. Because these orphaned processes will still be listening on the local IP and a specified port, they can interfere with communication of new processes that you start with the same IP and port specifications.

To alleviate this, you need to terminate these running processes before starting any new runs. The easiest way to do this is through top/htop via the terminal on Mac/Linux machines. An analogous process should be followed on windows machines to shut such processes down.

Scary Warnings On Startup

On starting up the server and client processes, sometimes various warnings that look a bit scary front the log files. An example of one of these warnings appears below when running locally on CPU.

2024-11-29 08:54:04.123569: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized
to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler
flags.
/usr/local/anaconda3/envs/fl4health/lib/python3.10/site-packages/threadpoolctl.py:1214: RuntimeWarning:
Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

While the above warnings might appear problematic, they are often harmless and pop out from various libraries leveraged under the hood to warn users of issues that might arise under certain conditions or provide them a chance to install pieces of software. For example, the first output is saying that performance on the CPU could be improved with the right tensorflow compilation. However, it isn't necessary to run the code properly.