Skip to content

Training erroe with accelerate.backward(loss) #80

Description

@juxingyiwan

When I run the default training script for sd1.5, an error occurred on the line accelerate.backward(loss). The detailed error is as follows:

Traceback (most recent call last):
  File "/mnt/lustre/zhuguibo/dcy/BrushNet/examples/brushnet/train_brushnet.py", line 1403, in <module>
    main(args)
  File "/mnt/lustre/zhuguibo/dcy/BrushNet/examples/brushnet/train_brushnet.py", line 1303, in main
    accelerator.backward(loss)
  File "/mnt/lustre/zhuguibo/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/accelerator.py", line 2248, in backward
    loss.backward(**kwargs)
  File "/mnt/lustre/zhuguibo/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/mnt/lustre/zhuguibo/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: an illegal instruction was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Any one met the same issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions