fl4health.checkpointing.server_module module¶

class AdaptiveConstraintServerCheckpointAndStateModule(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶

Bases: PackingServerCheckpointAndAndStateModule

__init__(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶

This module is meant to handle FL flows with adaptive constraints, where the server and client communicate a loss weight parameter in addition to the model weights. Unlike the module on the client side, this module has no concept of pre- or post-aggregation checkpointing. It only considers checkpointing the global server model after aggregation, perhaps based on validation statistics retrieved on the client side by running a federated evaluation step. Multiple model checkpointers may be used. For state checkpointing, which saves the state of the entire server-side FL process to help with FL restarts, we allow only a single checkpointer responsible for saving the state after each fit and eval round of FL.

Parameters:

model (nn.Module | None, optional) – Model architecture to be saved. The module will use this architecture to hold the server parameters and facilitate checkpointing with the help of the parameter exchanger. Recall that servers only have parameters rather than torch models. So we need to know where to route these parameters to allow for real models to be saved. Defaults to None.
model_checkpointers (ModelCheckpointers, optional) – If defined, this checkpointer (or sequence of checkpointers) is used to checkpoint models based on their defined scoring function. Defaults to None.
state_checkpointer (ServerStateCheckpointer | None, optional) – If defined, this checkpointer will be used to preserve FL training state to facilitate restarting training if interrupted. Generally, this checkpointer will save much more than just the model being trained. Defaults to None.

class BaseServerCheckpointAndStateModule(model=None, parameter_exchanger=None, model_checkpointers=None, state_checkpointer=None)[source]¶

Bases: object

__init__(model=None, parameter_exchanger=None, model_checkpointers=None, state_checkpointer=None)[source]¶

This module is meant to handle basic model and state checkpointing on the server-side of an FL process. Unlike the module on the client side, this module has no concept of pre- or post-aggregation checkpointing. It only considers checkpointing the global server model after aggregation, perhaps based on validation statistics retrieved on the client side by running a federated evaluation step. Multiple model checkpointers may be used. For state checkpointing, which saves the state of the entire server-side FL process to help with FL restarts, we allow only a single checkpointer responsible for saving the state after each fit and eval round of FL.

Parameters:

model (nn.Module | None, optional) – Model architecture to be saved. The module will use this architecture to hold the server parameters and facilitate checkpointing with the help of the parameter exchanger. Recall that servers only have parameters rather than torch models. So we need to know where to route these parameters to allow for real models to be saved. Defaults to None.
parameter_exchanger (ExchangerType | None, optional) – This will facilitate routing the server parameters into the right components of the provided model architecture. Note that this exchanger and the model must match the one used for training and exchange with the servers to ensure parameters go to the right places. Defaults to None.
model_checkpointers (ModelCheckpointers, optional) – If defined, this checkpointer (or sequence of checkpointers) is used to checkpoint models based on their defined scoring function. Defaults to None.
state_checkpointer (ServerStateCheckpointer | None, optional) – If defined, this checkpointer will be used to preserve FL training state to facilitate restarting training if interrupted. Generally, this checkpointer will save much more than just the model being trained. Defaults to None.

maybe_checkpoint(server_parameters, loss, metrics)[source]¶

If there are model checkpointers defined in this class, we hydrate the model for checkpointing with the server parameters and call maybe checkpoint model on each of the checkpointers to decide whether to checkpoint based on the model metrics or loss and the checkpointer definitions.

Parameters:

server_parameters (Parameters) – Parameters held by the server that should be injected into the model
loss (float) – The aggregated loss value obtained by the current aggregated server model. Potentially used by checkpointer to decide whether to checkpoint the model.
metrics (dict[str, Scalar]) – The aggregated metrics obtained by the aggregated server model. Potentially used by checkpointer to decide whether to checkpoint the model.

Return type:

None

maybe_load_state(server)[source]¶

Facilitates loading of any pre-existing state in the directory of the state_checkpointer. If a state_checkpointer is defined and a checkpoint exists at its checkpoint_path, this method hydrates the model with the saved state and returns the corresponding server Parameters. If no checkpoint exists, it logs this information and returns None.

Parameters:: server (FlServer) – server into which checkpointed state will be loaded if a checkpoint exists
Raises:: ValueError – Throws an error if this function is called, but no state checkpointer has been provided.
Returns:: If the state checkpoint properly exists and is loaded correctly, server_parameters is returned. Otherwise, we return a None (or throw an exception).
Return type:: Parameters | None

save_state(server, server_parameters)[source]¶

Facilitates saving state required to restart the FL process on the server side. By default, this function will preserve the state of the server as defined by snapshot_attrs in ServerStateCheckpointer . Note that server_parameters will be hydrated and passed to the state checkpointer module to facilitate saving the state of the server’s parameters.

Parameters:

server (FlServer) – Server object from which state will be extracted and saved.
server_parameters (Parameters) – Like model checkpointers, these are the aggregated Parameters stored by the server representing model state. They are mapped to a torch model architecture via the _hydrate_model_for_checkpointing function.

Raises:

ValueError – Throws an error if this function is called, but no state checkpointer has been provided.

Return type:

None

class ClippingBitServerCheckpointAndStateModule(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶

Bases: PackingServerCheckpointAndAndStateModule

__init__(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶

This module is meant to handle FL flows with clipping bits being passed to the server along with the model weights. This is used for DP-FL with adaptive clipping. Unlike the module on the client side, this module has no concept of pre- or post-aggregation checkpointing. It only considers checkpointing the global server model after aggregation, perhaps based on validation statistics retrieved on the client side by running a federated evaluation step. Multiple model checkpointers may be used. For state checkpointing, which saves the state of the entire server-side FL process to help with FL restarts, we allow only a single checkpointer responsible for saving the state after each fit and eval round of FL.

Parameters:

model (nn.Module | None, optional) – Model architecture to be saved. The module will use this architecture to hold the server parameters and facilitate checkpointing with the help of the parameter exchanger. Recall that servers only have parameters rather than torch models. So we need to know where to route these parameters to allow for real models to be saved. Defaults to None.
model_checkpointers (ModelCheckpointers, optional) – If defined, this checkpointer (or sequence of checkpointers) is used to checkpoint models based on their defined scoring function. Defaults to None.
state_checkpointer (ServerStateCheckpointer | None, optional) – If defined, this checkpointer will be used to preserve FL training state to facilitate restarting training if interrupted. Generally, this checkpointer will save much more than just the model being trained. Defaults to None.

class DpScaffoldServerCheckpointAndStateModule(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶

Bases: ScaffoldServerCheckpointAndStateModule

__init__(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶

This module is meant to handle DP SCAFFOLD model and state checkpointing on the server-side of an FL process. Unlike the module on the client side, this module has no concept of pre- or post-aggregation checkpointing. It only considers checkpointing the global server model after aggregation, perhaps based on validation statistics retrieved on the client side by running a federated evaluation step. Multiple model checkpointers may be used. For state checkpointing, which saves the state of the entire server-side FL process to help with FL restarts, we allow only a single checkpointer responsible for saving the state after each fit and eval round of FL.

Parameters:

model (nn.Module | None, optional) – Model architecture to be saved. The module will use this architecture to hold the server parameters and facilitate checkpointing with the help of the parameter exchanger. Recall that servers only have parameters rather than torch models. So we need to know where to route these parameters to allow for real models to be saved. Defaults to None.
model_checkpointers (ModelCheckpointers, optional) – If defined, this checkpointer (or sequence of checkpointers) is used to checkpoint models based on their defined scoring function. Defaults to None.
state_checkpointer (ServerStateCheckpointer | None, optional) – If defined, this checkpointer will be used to preserve FL training state to facilitate restarting training if interrupted. Generally, this checkpointer will save much more than just the model being trained. Defaults to None.

class LayerNamesServerCheckpointAndStateModule(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶

Bases: PackingServerCheckpointAndAndStateModule

__init__(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶

This module is meant to handle FL flows with layer names being passed to the server along with the model weights. This is used for adaptive layer exchange FL. Unlike the module on the client side, this module has no concept of pre- or post-aggregation checkpointing. It only considers checkpointing the global server model after aggregation, perhaps based on validation statistics retrieved on the client side by running a federated evaluation step. Multiple model checkpointers may be used. For state checkpointing, which saves the state of the entire server-side FL process to help with FL restarts, we allow only a single checkpointer responsible for saving the state after each fit and eval round of FL.

Parameters:

model (nn.Module | None, optional) – Model architecture to be saved. The module will use this architecture to hold the server parameters and facilitate checkpointing with the help of the parameter exchanger. Recall that servers only have parameters rather than torch models. So we need to know where to route these parameters to allow for real models to be saved. Defaults to None.
model_checkpointers (ModelCheckpointers, optional) – If defined, this checkpointer (or sequence of checkpointers) is used to checkpoint models based on their defined scoring function. Defaults to None.
state_checkpointer (ServerStateCheckpointer | None, optional) – If defined, this checkpointer will be used to preserve FL training state to facilitate restarting training if interrupted. Generally, this checkpointer will save much more than just the model being trained. Defaults to None.

class NnUnetServerCheckpointAndStateModule(model=None, parameter_exchanger=None, model_checkpointers=None, state_checkpointer=None)[source]¶

Bases: BaseServerCheckpointAndStateModule

__init__(model=None, parameter_exchanger=None, model_checkpointers=None, state_checkpointer=None)[source]¶

This module is meant to be used with the NnUnetServer class to handle model and state checkpointing on the server-side of an FL process. Unlike the module on the client side, this module has no concept of pre- or post-aggregation checkpointing. It only considers checkpointing the global server model after aggregation, perhaps based on validation statistics retrieved on the client side by running a federated evaluation step. Multiple model checkpointers may be used. For state checkpointing, which saves the state of the entire server-side FL process to help with FL restarts, we allow only a single checkpointer responsible for saving the state after each fit and eval round of FL.

This implementation differs from the base class in the federated NnUnet only initializes its model after an initial communication phase with the clients. As such, the model is not necessarily available upon initialization, but may be set later.

Parameters:

model (nn.Module | None, optional) –
Model architecture to be saved. The module will use this architecture to hold the server parameters and facilitate checkpointing with the help of the parameter exchanger. Recall that servers only have parameters rather than torch models. So we need to know where to route these parameters to allow for real models to be saved.

NOTE: For NnUnet, this need not be set upon creation, as the model architecture may only be known later

Defaults to None.
parameter_exchanger (FullParameterExchangerWithPacking | None, optional) – This will facilitate routing the server parameters into the right components of the provided model architecture. Note that this exchanger and the model must match the one used for training and exchange with the servers to ensure parameters go to the right places. Defaults to None.
model_checkpointers (ModelCheckpointers, optional) – If defined, this checkpointer (or sequence of checkpointers) is used to checkpoint models based on their defined scoring function. Defaults to None.
state_checkpointer (NnUnetServerStateCheckpointer | None, optional) – If defined, this checkpointer will be used to preserve FL training state to facilitate restarting training if interrupted. Generally, this checkpointer will save much more than just the model being trained. Defaults to None.

class OpacusServerCheckpointAndStateModule(model=None, parameter_exchanger=None, model_checkpointers=None, state_checkpointer=None)[source]¶

Bases: BaseServerCheckpointAndStateModule

__init__(model=None, parameter_exchanger=None, model_checkpointers=None, state_checkpointer=None)[source]¶

This module is meant to handle FL flows with Opacus models where special treatment by the checkpointers is required. This module simply ensures the checkpointers are of the proper type before proceeding. Unlike the module on the client side, this module has no concept of pre- or post-aggregation checkpointing. It only considers checkpointing the global server model after aggregation, perhaps based on validation statistics retrieved on the client side by running a federated evaluation step. Multiple model checkpointers may be used. For state checkpointing, which saves the state of the entire server-side FL process to help with FL restarts, we allow only a single checkpointer responsible for saving the state after each fit and eval round of FL.

Parameters:

model (nn.Module | None, optional) – Model architecture to be saved. The module will use this architecture to hold the server parameters and facilitate checkpointing with the help of the parameter exchanger. Recall that servers only have parameters rather than torch models. So we need to know where to route these parameters to allow for real models to be saved. Defaults to None.
parameter_exchanger (FullParameterExchangerWithPacking | None, optional) – This will facilitate routing the server parameters into the right components of the provided model architecture. Note that this exchanger and the model must match the one used for training and exchange with the servers to ensure parameters go to the right places. Defaults to None.
model_checkpointers (ModelCheckpointers, optional) – If defined, this checkpointer (or sequence of checkpointers) is used to checkpoint models based on their defined scoring function. Defaults to None.
state_checkpointer (ServerStateCheckpointer | None, optional) – If defined, this checkpointer will be used to preserve FL training state to facilitate restarting training if interrupted. Generally, this checkpointer will save much more than just the model being trained. Defaults to None.

class PackingServerCheckpointAndAndStateModule(model=None, parameter_exchanger=None, model_checkpointers=None, state_checkpointer=None)[source]¶

Bases: BaseServerCheckpointAndStateModule

__init__(model=None, parameter_exchanger=None, model_checkpointers=None, state_checkpointer=None)[source]¶

This module is meant to be a base class for any server-side checkpointing module that relies on unpacking of parameters to hydrate models for checkpointing. The specifics of the unpacking will be handled by the child classes of the packer within the parameter exchange. NOTE: This function ASSUMES full parameter exchange unpacking. If more complex unpacking/parameter exchange is used, this is not the right parent class.

Parameters:

model (nn.Module | None, optional) – Model architecture to be saved. The module will use this architecture to hold the server parameters and facilitate checkpointing with the help of the parameter exchanger. Recall that servers only have parameters rather than torch models. So we need to know where to route these parameters to allow for real models to be saved. Defaults to None.
parameter_exchanger (FullParameterExchangerWithPacking | None, optional) – This will facilitate routing the server parameters into the right components of the provided model architecture. It specifically also should handle any necessary unpacking of the parameters. Note that this exchanger and the model must match the one used for training and exchange with the servers to ensure parameters go to the right places. Defaults to None.
model_checkpointers (ModelCheckpointers, optional) – If defined, this checkpointer (or sequence of checkpointers) is used to checkpoint models based on their defined scoring function. Defaults to None.
state_checkpointer (ServerStateCheckpointer | None, optional) – If defined, this checkpointer will be used to preserve FL training state to facilitate restarting training if interrupted. Generally, this checkpointer will save much more than just the model being trained. Defaults to None.

class ScaffoldServerCheckpointAndStateModule(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶

Bases: PackingServerCheckpointAndAndStateModule

__init__(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶

This module is meant to handle SCAFFOLD model and state checkpointing on the server-side of an FL process. Unlike the module on the client side, this module has no concept of pre- or post-aggregation checkpointing. It only considers checkpointing the global server model after aggregation, perhaps based on validation statistics retrieved on the client side by running a federated evaluation step. Multiple model checkpointers may be used. For state checkpointing, which saves the state of the entire server-side FL process to help with FL restarts, we allow only a single checkpointer responsible for saving the state after each fit and eval round of FL.

Parameters:

model (nn.Module | None, optional) – Model architecture to be saved. The module will use this architecture to hold the server parameters and facilitate checkpointing with the help of the parameter exchanger. Recall that servers only have parameters rather than torch models. So we need to know where to route these parameters to allow for real models to be saved. Defaults to None.
model_checkpointers (ModelCheckpointers, optional) – If defined, this checkpointer (or sequence of checkpointers) is used to checkpoint models based on their defined scoring function. Defaults to None.
state_checkpointer (ServerStateCheckpointer | None, optional) – If defined, this checkpointer will be used to preserve FL training state to facilitate restarting training if interrupted. Generally, this checkpointer will save much more than just the model being trained. Defaults to None.

class SparseCooServerCheckpointAndStateModule(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶

Bases: PackingServerCheckpointAndAndStateModule

__init__(model=None, model_checkpointers=None, state_checkpointer=None)[source]¶

This module is meant to handle FL flows with parameters encoded in a sparse COO format being passed to the server as the model weights. This is used for adaptive parameter-wise exchange (i.e. unstructured subsets of parameters) . Unlike the module on the client side, this module has no concept of pre- or post-aggregation checkpointing. It only considers checkpointing the global server model after aggregation, perhaps based on validation statistics retrieved on the client side by running a federated evaluation step. Multiple model checkpointers may be used. For state checkpointing, which saves the state of the entire server-side FL process to help with FL restarts, we allow only a single checkpointer responsible for saving the state after each fit and eval round of FL.

Parameters:

model (nn.Module | None, optional) – Model architecture to be saved. The module will use this architecture to hold the server parameters and facilitate checkpointing with the help of the parameter exchanger. Recall that servers only have parameters rather than torch models. So we need to know where to route these parameters to allow for real models to be saved. Defaults to None.
model_checkpointers (ModelCheckpointers, optional) – If defined, this checkpointer (or sequence of checkpointers) is used to checkpoint models based on their defined scoring function. Defaults to None.
state_checkpointer (ServerStateCheckpointer | None, optional) – If defined, this checkpointer will be used to preserve FL training state to facilitate restarting training if interrupted. Generally, this checkpointer will save much more than just the model being trained. Defaults to None.