fl4health.checkpointing.server_module module¶
- class AdaptiveConstraintServerCheckpointAndStateModule(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶
Bases:
PackingServerCheckpointAndAndStateModule
- __init__(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶
This module is meant to handle FL flows with adaptive constraints, where the server and client communicate a loss weight parameter in addition to the model weights. Unlike the module on the client side, this module has no concept of pre- or post-aggregation checkpointing. It only considers checkpointing the global server model after aggregation, perhaps based on validation statistics retrieved on the client side by running a federated evaluation step. Multiple model checkpointers may be used. For state checkpointing, which saves the state of the entire server-side FL process to help with FL restarts, we allow only a single checkpointer responsible for saving the state after each fit and eval round of FL.
- Parameters:
model (nn.Module | None, optional) – Model architecture to be saved. The module will use this architecture to hold the server parameters and facilitate checkpointing with the help of the parameter exchanger. Recall that servers only have parameters rather than torch models. So we need to know where to route these parameters to allow for real models to be saved. Defaults to None.
model_checkpointers (ModelCheckpointers, optional) – If defined, this checkpointer (or sequence of checkpointers) is used to checkpoint models based on their defined scoring function. Defaults to None.
state_checkpointer (PerRoundStateCheckpointer | None, optional) – If defined, this checkpointer will be used to preserve FL training state to facilitate restarting training if interrupted. Generally, this checkpointer will save much more than just the model being trained. Defaults to None.
- class BaseServerCheckpointAndStateModule(model=None, parameter_exchanger=None, model_checkpointers=None, state_checkpointer=None)[source]¶
Bases:
object
- __init__(model=None, parameter_exchanger=None, model_checkpointers=None, state_checkpointer=None)[source]¶
This module is meant to handle basic model and state checkpointing on the server-side of an FL process. Unlike the module on the client side, this module has no concept of pre- or post-aggregation checkpointing. It only considers checkpointing the global server model after aggregation, perhaps based on validation statistics retrieved on the client side by running a federated evaluation step. Multiple model checkpointers may be used. For state checkpointing, which saves the state of the entire server-side FL process to help with FL restarts, we allow only a single checkpointer responsible for saving the state after each fit and eval round of FL.
- Parameters:
model (nn.Module | None, optional) – Model architecture to be saved. The module will use this architecture to hold the server parameters and facilitate checkpointing with the help of the parameter exchanger. Recall that servers only have parameters rather than torch models. So we need to know where to route these parameters to allow for real models to be saved. Defaults to None.
parameter_exchanger (ExchangerType | None, optional) – This will facilitate routing the server parameters into the right components of the provided model architecture. Note that this exchanger and the model must match the one used for training and exchange with the servers to ensure parameters go to the right places. Defaults to None.
model_checkpointers (ModelCheckpointers, optional) – If defined, this checkpointer (or sequence of checkpointers) is used to checkpoint models based on their defined scoring function. Defaults to None.
state_checkpointer (PerRoundStateCheckpointer | None, optional) – If defined, this checkpointer will be used to preserve FL training state to facilitate restarting training if interrupted. Generally, this checkpointer will save much more than just the model being trained. Defaults to None.
- maybe_checkpoint(server_parameters, loss, metrics)[source]¶
If there are model checkpointers defined in this class, we hydrate the model for checkpointing with the server parameters and call maybe checkpoint model on each of the checkpointers to decide whether to checkpoint based on the model metrics or loss and the checkpointer definitions.
- Parameters:
server_parameters (Parameters) – Parameters held by the server that should be injected into the model
loss (float) – The aggregated loss value obtained by the current aggregated server model. Potentially used by checkpointer to decide whether to checkpoint the model.
metrics (dict[str, Scalar]) – The aggregated metrics obtained by the aggregated server model. Potentially used by checkpointer to decide whether to checkpoint the model.
- Return type:
- maybe_load_state(state_checkpoint_name)[source]¶
This function facilitates loading of any pre-existing state (with the name state_checkpoint_name) in the directory of the state_checkpointer. If the state already exists at the proper path, the state is loaded and returned. If it doesn’t exist, we return None.
- Parameters:
state_checkpoint_name (str) – Name of the state checkpoint file. The checkpointer itself will have a directory from which state will be loaded (if it exists).
- Raises:
ValueError – Throws an error if this function is called, but no state checkpointer has been provided
- Returns:
- If the state checkpoint properly exists and is loaded correctly, this dictionary
carries that state. Otherwise, we return a None (or throw an exception).
- Return type:
- save_state(state_checkpoint_name, server_parameters, other_state)[source]¶
This function is meant to facilitate saving state required to restart on FL process on the server side. By default, this function will always at least preserve the model being trained. However, it may be desirable to save additional information, like the current server round etc. So the other_state dictionary may be provided to preserve this additional state.
NOTE: This function will throw an error if you attempt to save the model under the ‘model’ key in other_state
- Parameters:
state_checkpoint_name (str) – Name of the state checkpoint file. The checkpointer itself will have a directory to which state will be saved.
server_parameters (Parameters) – Like model checkpointers, these are the aggregated Parameters stored by the server representing model state. They are mapped to a torch model architecture via the _hydrate_model_for_checkpointing function.
other_state (dict[str, Any]) – Any additional state (such as current server round) to be checkpointed in order to allow FL to restart from where it left off.
- Raises:
ValueError – Throws an error if other_state already has a key called ‘model’
ValueError – Throws an error if this function is called, but no state checkpointer has been provided
- Return type:
- class ClippingBitServerCheckpointAndStateModule(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶
Bases:
PackingServerCheckpointAndAndStateModule
- __init__(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶
This module is meant to handle FL flows with clipping bits being passed to the server along with the model weights. This is used for DP-FL with adaptive clipping. Unlike the module on the client side, this module has no concept of pre- or post-aggregation checkpointing. It only considers checkpointing the global server model after aggregation, perhaps based on validation statistics retrieved on the client side by running a federated evaluation step. Multiple model checkpointers may be used. For state checkpointing, which saves the state of the entire server-side FL process to help with FL restarts, we allow only a single checkpointer responsible for saving the state after each fit and eval round of FL.
- Parameters:
model (nn.Module | None, optional) – Model architecture to be saved. The module will use this architecture to hold the server parameters and facilitate checkpointing with the help of the parameter exchanger. Recall that servers only have parameters rather than torch models. So we need to know where to route these parameters to allow for real models to be saved. Defaults to None.
model_checkpointers (ModelCheckpointers, optional) – If defined, this checkpointer (or sequence of checkpointers) is used to checkpoint models based on their defined scoring function. Defaults to None.
state_checkpointer (PerRoundStateCheckpointer | None, optional) – If defined, this checkpointer will be used to preserve FL training state to facilitate restarting training if interrupted. Generally, this checkpointer will save much more than just the model being trained. Defaults to None.
- class DpScaffoldServerCheckpointAndStateModule(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶
Bases:
ScaffoldServerCheckpointAndStateModule
- __init__(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶
This module is meant to handle DP SCAFFOLD model and state checkpointing on the server-side of an FL process. Unlike the module on the client side, this module has no concept of pre- or post-aggregation checkpointing. It only considers checkpointing the global server model after aggregation, perhaps based on validation statistics retrieved on the client side by running a federated evaluation step. Multiple model checkpointers may be used. For state checkpointing, which saves the state of the entire server-side FL process to help with FL restarts, we allow only a single checkpointer responsible for saving the state after each fit and eval round of FL.
- Parameters:
model (nn.Module | None, optional) – Model architecture to be saved. The module will use this architecture to hold the server parameters and facilitate checkpointing with the help of the parameter exchanger. Recall that servers only have parameters rather than torch models. So we need to know where to route these parameters to allow for real models to be saved. Defaults to None.
model_checkpointers (ModelCheckpointers, optional) – If defined, this checkpointer (or sequence of checkpointers) is used to checkpoint models based on their defined scoring function. Defaults to None.
state_checkpointer (PerRoundStateCheckpointer | None, optional) – If defined, this checkpointer will be used to preserve FL training state to facilitate restarting training if interrupted. Generally, this checkpointer will save much more than just the model being trained. Defaults to None.
- class LayerNamesServerCheckpointAndStateModule(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶
Bases:
PackingServerCheckpointAndAndStateModule
- __init__(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶
This module is meant to handle FL flows with layer names being passed to the server along with the model weights. This is used for adaptive layer exchange FL. Unlike the module on the client side, this module has no concept of pre- or post-aggregation checkpointing. It only considers checkpointing the global server model after aggregation, perhaps based on validation statistics retrieved on the client side by running a federated evaluation step. Multiple model checkpointers may be used. For state checkpointing, which saves the state of the entire server-side FL process to help with FL restarts, we allow only a single checkpointer responsible for saving the state after each fit and eval round of FL.
- Parameters:
model (nn.Module | None, optional) – Model architecture to be saved. The module will use this architecture to hold the server parameters and facilitate checkpointing with the help of the parameter exchanger. Recall that servers only have parameters rather than torch models. So we need to know where to route these parameters to allow for real models to be saved. Defaults to None.
model_checkpointers (ModelCheckpointers, optional) – If defined, this checkpointer (or sequence of checkpointers) is used to checkpoint models based on their defined scoring function. Defaults to None.
state_checkpointer (PerRoundStateCheckpointer | None, optional) – If defined, this checkpointer will be used to preserve FL training state to facilitate restarting training if interrupted. Generally, this checkpointer will save much more than just the model being trained. Defaults to None.
- class NnUnetServerCheckpointAndStateModule(model=None, parameter_exchanger=None, model_checkpointers=None, state_checkpointer=None)[source]¶
Bases:
BaseServerCheckpointAndStateModule
- __init__(model=None, parameter_exchanger=None, model_checkpointers=None, state_checkpointer=None)[source]¶
This module is meant to be used with the NnUnetServer class to handle model and state checkpointing on the server-side of an FL process. Unlike the module on the client side, this module has no concept of pre- or post-aggregation checkpointing. It only considers checkpointing the global server model after aggregation, perhaps based on validation statistics retrieved on the client side by running a federated evaluation step. Multiple model checkpointers may be used. For state checkpointing, which saves the state of the entire server-side FL process to help with FL restarts, we allow only a single checkpointer responsible for saving the state after each fit and eval round of FL.
This implementation differs from the base class in the federated NnUnet only initializes its model after an initial communication phase with the clients. As such, the model is not necessarily available upon initialization, but may be set later.
- Parameters:
model (nn.Module | None, optional) – Model architecture to be saved. The module will use this architecture to hold the server parameters and facilitate checkpointing with the help of the parameter exchanger. Recall that servers only have parameters rather than torch models. So we need to know where to route these parameters to allow for real models to be saved. Defaults to None. NOTE: For NnUnet, this need not be set upon creation, as the model architecture may only be known later
parameter_exchanger (FullParameterExchangerWithPacking | None, optional) – This will facilitate routing the server parameters into the right components of the provided model architecture. Note that this exchanger and the model must match the one used for training and exchange with the servers to ensure parameters go to the right places. Defaults to None.
model_checkpointers (ModelCheckpointers, optional) – If defined, this checkpointer (or sequence of checkpointers) is used to checkpoint models based on their defined scoring function. Defaults to None.
state_checkpointer (PerRoundStateCheckpointer | None, optional) – If defined, this checkpointer will be used to preserve FL training state to facilitate restarting training if interrupted. Generally, this checkpointer will save much more than just the model being trained. Defaults to None.
- class OpacusServerCheckpointAndStateModule(model=None, parameter_exchanger=None, model_checkpointers=None, state_checkpointer=None)[source]¶
Bases:
BaseServerCheckpointAndStateModule
- __init__(model=None, parameter_exchanger=None, model_checkpointers=None, state_checkpointer=None)[source]¶
This module is meant to handle FL flows with Opacus models where special treatment by the checkpointers is required. This module simply ensures the checkpointers are of the proper type before proceeding. Unlike the module on the client side, this module has no concept of pre- or post-aggregation checkpointing. It only considers checkpointing the global server model after aggregation, perhaps based on validation statistics retrieved on the client side by running a federated evaluation step. Multiple model checkpointers may be used. For state checkpointing, which saves the state of the entire server-side FL process to help with FL restarts, we allow only a single checkpointer responsible for saving the state after each fit and eval round of FL.
- Parameters:
model (nn.Module | None, optional) – Model architecture to be saved. The module will use this architecture to hold the server parameters and facilitate checkpointing with the help of the parameter exchanger. Recall that servers only have parameters rather than torch models. So we need to know where to route these parameters to allow for real models to be saved. Defaults to None.
parameter_exchanger (FullParameterExchangerWithPacking | None, optional) – This will facilitate routing the server parameters into the right components of the provided model architecture. Note that this exchanger and the model must match the one used for training and exchange with the servers to ensure parameters go to the right places. Defaults to None.
model_checkpointers (ModelCheckpointers, optional) – If defined, this checkpointer (or sequence of checkpointers) is used to checkpoint models based on their defined scoring function. Defaults to None.
state_checkpointer (PerRoundStateCheckpointer | None, optional) – If defined, this checkpointer will be used to preserve FL training state to facilitate restarting training if interrupted. Generally, this checkpointer will save much more than just the model being trained. Defaults to None.
- class PackingServerCheckpointAndAndStateModule(model=None, parameter_exchanger=None, model_checkpointers=None, state_checkpointer=None)[source]¶
Bases:
BaseServerCheckpointAndStateModule
- __init__(model=None, parameter_exchanger=None, model_checkpointers=None, state_checkpointer=None)[source]¶
This module is meant to be a base class for any server-side checkpointing module that relies on unpacking of parameters to hydrate models for checkpointing. The specifics of the unpacking will be handled by the child classes of the packer within the parameter exchange. NOTE: This function ASSUMES full parameter exchange unpacking. If more complex unpacking/parameter exchange is used, this is not the right parent class.
- Parameters:
model (nn.Module | None, optional) – Model architecture to be saved. The module will use this architecture to hold the server parameters and facilitate checkpointing with the help of the parameter exchanger. Recall that servers only have parameters rather than torch models. So we need to know where to route these parameters to allow for real models to be saved. Defaults to None.
parameter_exchanger (FullParameterExchangerWithPacking | None, optional) – This will facilitate routing the server parameters into the right components of the provided model architecture. It specifically also should handle any necessary unpacking of the parameters. Note that this exchanger and the model must match the one used for training and exchange with the servers to ensure parameters go to the right places. Defaults to None.
model_checkpointers (ModelCheckpointers, optional) – If defined, this checkpointer (or sequence of checkpointers) is used to checkpoint models based on their defined scoring function. Defaults to None.
state_checkpointer (PerRoundStateCheckpointer | None, optional) – If defined, this checkpointer will be used to preserve FL training state to facilitate restarting training if interrupted. Generally, this checkpointer will save much more than just the model being trained. Defaults to None.
- class ScaffoldServerCheckpointAndStateModule(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶
Bases:
PackingServerCheckpointAndAndStateModule
- __init__(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶
This module is meant to handle SCAFFOLD model and state checkpointing on the server-side of an FL process. Unlike the module on the client side, this module has no concept of pre- or post-aggregation checkpointing. It only considers checkpointing the global server model after aggregation, perhaps based on validation statistics retrieved on the client side by running a federated evaluation step. Multiple model checkpointers may be used. For state checkpointing, which saves the state of the entire server-side FL process to help with FL restarts, we allow only a single checkpointer responsible for saving the state after each fit and eval round of FL.
- Parameters:
model (nn.Module | None, optional) – Model architecture to be saved. The module will use this architecture to hold the server parameters and facilitate checkpointing with the help of the parameter exchanger. Recall that servers only have parameters rather than torch models. So we need to know where to route these parameters to allow for real models to be saved. Defaults to None.
model_checkpointers (ModelCheckpointers, optional) – If defined, this checkpointer (or sequence of checkpointers) is used to checkpoint models based on their defined scoring function. Defaults to None.
state_checkpointer (PerRoundStateCheckpointer | None, optional) – If defined, this checkpointer will be used to preserve FL training state to facilitate restarting training if interrupted. Generally, this checkpointer will save much more than just the model being trained. Defaults to None.
- class SparseCooServerCheckpointAndStateModule(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶
Bases:
PackingServerCheckpointAndAndStateModule
- __init__(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶
This module is meant to handle FL flows with parameters encoded in a sparse COO format being passed to the server as the model weights. This is used for adaptive parameter-wise exchange (i.e. unstructured subsets of parameters) . Unlike the module on the client side, this module has no concept of pre- or post-aggregation checkpointing. It only considers checkpointing the global server model after aggregation, perhaps based on validation statistics retrieved on the client side by running a federated evaluation step. Multiple model checkpointers may be used. For state checkpointing, which saves the state of the entire server-side FL process to help with FL restarts, we allow only a single checkpointer responsible for saving the state after each fit and eval round of FL.
- Parameters:
model (nn.Module | None, optional) – Model architecture to be saved. The module will use this architecture to hold the server parameters and facilitate checkpointing with the help of the parameter exchanger. Recall that servers only have parameters rather than torch models. So we need to know where to route these parameters to allow for real models to be saved. Defaults to None.
model_checkpointers (ModelCheckpointers, optional) – If defined, this checkpointer (or sequence of checkpointers) is used to checkpoint models based on their defined scoring function. Defaults to None.
state_checkpointer (PerRoundStateCheckpointer | None, optional) – If defined, this checkpointer will be used to preserve FL training state to facilitate restarting training if interrupted. Generally, this checkpointer will save much more than just the model being trained. Defaults to None.