when using accelerate fsdp, from_pretrained loaded error weight of CLIPVisionModel at one process,traing with nan

217 Views Asked by At

when using accelerate fsdp, from_pretrained loaded error weight of CLIPVisionModel at one process,traing with nan

I try to use accelerate fsdp at 2 A40 gpus, but the trainning with nan loss and -inf weight.

by debug the code,I find that CLIPVisionModel(loading at the init beginning and not the trainning model) weight are wrong,

at one process I find the wright CLIPVisionModel weight:
tensor([0.3311, 0.0032, 0.1610, ..., 2.1922, 0.0050, 0.0039],)

and the other process, the CLIPVisionModel weight is wrong:
tensor([-1.9921e-04, 4.5673e-41, -1.9921e-04, ..., 0.0000e+00,0.0000e+00, 0.0000e+00],)

and the wrong CLIPVisionModel weight case the hidden parameters are inf or nan ,and the trainning loss is nan

this is my loading code :

class SLlamaModel(LlamaModel): config_class = SConfig

def __init__(self, config: LlamaConfig, mm_vision_tower=None, mm_hidden_size=None):
    super(SLlamaModel, self).__init__(config)

    if hasattr(config, "mm_vision_tower"):
        # HACK: for FSDP
        self.vision_tower = CLIPVisionModel.from_pretrained(config.mm_vision_tower)

    if hasattr(config, "use_mm_proj"):
        self.mm_projector = nn.Linear(config.mm_hidden_size, config.hidden_size)

config.mm_vision_tower = path to clip-vit-large-patch14

self.vision_tower = CLIPVisionModel.from_pretrained(config.mm_vision_tower)

here is my fsdp config:

  • compute_environment: LOCAL_MACHINE

  • distributed_type: FSDP

  • downcast_bf16: 'no'

  • fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch_policy: BACKWARD_PRE fsdp_forward_prefetch: true fsdp_offload_params: true fsdp_sharding_strategy: 1 fsdp_state_dict_type: FULL_STATE_DICT fsdp_sync_module_states: true fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer fsdp_use_orig_params: true

  • machine_rank: 0

  • main_training_function: main

  • mixed_precision: bf16

  • num_machines: 1

  • num_processes: 2

  • rdzv_backend: static

  • same_network: true

  • tpu_env: []

  • tpu_use_cluster: false

  • tpu_use_sudo: false

  • use_cpu: false

two A40 gpus: 48G VRAM per

  • Accelerate version: 0.21.0
  • Platform: Linux-5.4.0-90-generic-x86_64-with-glibc2.31
  • Python version: 3.10.12
  • Numpy version: 1.25.1
  • PyTorch version (GPU?): 2.0.1+cu117 (False)
  • PyTorch XPU available: False
  • PyTorch NPU available: False
  • System RAM: 755.74 GB

I debug the whole code and find the weight error from the beginning when loading by from_pretrained and not other code error.I try to using test = CLIPVisionModel.from_pretrained("model path") at debug when the code run to loading line and still got error weight

1

There are 1 best solutions below

0
dadaamin On

I think I ran into the same issue. Although I still don't understand how this happens, upgrading transformers to version 4.37.2 solved it for me.