Skip to content

get stuck when save_state using DeepSpeed backend under training train_text_to_image_lora #2606

@better629

Description

@better629

Describe the bug

When using DeepSpeed backend, training is ok but get stuck in accelerator.save_state(save_path). If use MULTI_GPU, the process is OK.

The training script is

accelerate launch train_text_to_image_lora.py \
    --pretrained_model_name_or_path="pretrain_models/stable-diffusion-v1-4/"  \
    --dataset_name="lambdalabs/pokemon-blip-captions"  \
    --output_dir="sd-pokemon-model-lora" \
    --resolution=512 \
    --gradient_accumulation_steps=1 \
    --checkpointing_steps=100 \
    --learning_rate=1e-4 \
    --lr_scheduler="constant" \
    --lr_warmup_steps=0 \
    --max_train_steps=500 \
    --validation_epochs=50 \
    --seed="0" \
    --checkpointing_steps 50 \
    --train_batch_size=1 \
    --use_8bit_adam \
    --enable_xformers_memory_efficient_attention

Reproduction

MULTI_GPU backend xx/accelerate/default_config.yaml

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: 1,2,3
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
use_cpu: false

logs

03/08/2023 21:57:44 - INFO - __main__ - ***** Running training *****
03/08/2023 21:57:44 - INFO - __main__ -   Num examples = 833
03/08/2023 21:57:44 - INFO - __main__ -   Num Epochs = 2
03/08/2023 21:57:44 - INFO - __main__ -   Instantaneous batch size per device = 1
03/08/2023 21:57:44 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 3
03/08/2023 21:57:44 - INFO - __main__ -   Gradient Accumulation steps = 1
03/08/2023 21:57:44 - INFO - __main__ -   Total optimization steps = 500
Steps:  10%|████████▎                                                                          | 50/500 [00:11<01:31,  4.94it/s, lr=0.0001, step_loss=0.00245]03/08/2023 21:57:55 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora/checkpoint-50
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Model weights saved in sd-pokemon-model-lora/checkpoint-50/pytorch_model.bin
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora/checkpoint-50/optimizer.bin
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora/checkpoint-50/scheduler.bin
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Gradient scaler state saved in sd-pokemon-model-lora/checkpoint-50/scaler.pt
03/08/2023 21:57:55 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora/checkpoint-50/random_states_0.pkl
03/08/2023 21:57:55 - INFO - __main__ - Saved state to sd-pokemon-model-lora/checkpoint-50
Steps:  20%|████████████████▌                                                                  | 100/500 [00:22<01:21,  4.92it/s, lr=0.0001, step_loss=0.0787]03/08/2023 21:58:06 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora/checkpoint-100
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Model weights saved in sd-pokemon-model-lora/checkpoint-100/pytorch_model.bin
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora/checkpoint-100/optimizer.bin
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora/checkpoint-100/scheduler.bin
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Gradient scaler state saved in sd-pokemon-model-lora/checkpoint-100/scaler.pt
03/08/2023 21:58:06 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora/checkpoint-100/random_states_0.pkl
03/08/2023 21:58:06 - INFO - __main__ - Saved state to sd-pokemon-model-lora/checkpoint-100

DeepSpeed backend xx/accelerate/default_config.yaml

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
use_cpu: false

Which I have commented self._checkpoint_tag_validation(tag) in runtime/engine.py or it stuck in this place.
If commented, the logs is

03/08/2023 22:06:10 - INFO - __main__ - ***** Running training *****
03/08/2023 22:06:10 - INFO - __main__ -   Num examples = 833
03/08/2023 22:06:10 - INFO - __main__ -   Num Epochs = 2
03/08/2023 22:06:10 - INFO - __main__ -   Instantaneous batch size per device = 1
03/08/2023 22:06:10 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 3
03/08/2023 22:06:10 - INFO - __main__ -   Gradient Accumulation steps = 1
03/08/2023 22:06:10 - INFO - __main__ -   Total optimization steps = 500
Steps:  10%|████████▎                                                                          | 50/500 [00:11<01:36,  4.68it/s, lr=0.0001, step_loss=0.00255]03/08/2023 22:06:22 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora/checkpoint-50
03/08/2023 22:06:22 - INFO - accelerate.accelerator - Saving DeepSpeed Model and Optimizer
[2023-03-08 22:06:22,219] [INFO] [logging.py:75:log_dist] [Rank 0] [Torch] Checkpoint pytorch_model is begin to save!
/home/deepwisdom/anaconda3/envs/wjl/lib/python3.10/site-packages/torch/nn/modules/module.py:1432: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
[2023-03-08 22:06:22,222] [INFO] [logging.py:75:log_dist] [Rank 0] Saving model checkpoint: sd-pokemon-model-lora/checkpoint-50/pytorch_model/mp_rank_00_model_states.pt
[2023-03-08 22:06:22,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving sd-pokemon-model-lora/checkpoint-50/pytorch_model/mp_rank_00_model_states.pt...
[2023-03-08 22:06:22,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved sd-pokemon-model-lora/checkpoint-50/pytorch_model/mp_rank_00_model_states.pt.
...

get stuck in deepspeed/runtime/engine.py

# save_checkpoint
# https://github.com/microsoft/DeepSpeed/blob/v0.8.1/deepspeed/runtime/engine.py#LL3123C12-L3123C12

        if self.save_zero_checkpoint:
            self._create_zero_checkpoint_files(save_dir, tag)
            self._save_zero_checkpoint(save_dir, tag)

Logs

No response

System Info

Ubuntu 20.04
Nvidia GTX 3090
CUDA Version: 11.7
Torch: 1.13.1
Diffusers: 0.15.0.dev0
deepspeed: 0.8.1
xformers: 0.0.17.dev466
accelerate: 0.16.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomers

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions