[Auto-Recovery] Add crash recovery script for unrecoverable CUDA errors #10855
test.yml
on: pull_request
load-matrix
3s
test-notebooks-cu128-py3.12-pytorch-2.9-a10g
5m 46s
Matrix: test
Annotations
7 errors and 2 warnings
|
test-cu130-py3.12-pytorch-nightly-cute-b200
Process completed with exit code 1.
|
|
test-tpu-py3.12-pytorch-nightly-pallas-tpu
Process completed with exit code 1.
|
|
test-rocm7.0-py3.12-pytorch-nightly-triton-mi325x
Process completed with exit code 1.
|
|
test-cu128-py3.10-pytorch-2.9-triton-a10g
Process completed with exit code 1.
|
|
test-cu128-py3.12-pytorch-2.9-triton-a10g-dtype-asserts
Process completed with exit code 1.
|
|
test-cu128-py3.12-pytorch-nightly-triton-a10g-ref-eager
Process completed with exit code 1.
|
|
test-cu128-py3.12-pytorch-nightly-triton-h100
Process completed with exit code 1.
|
|
test-cu130-py3.12-pytorch-2.9-triton-b200
Failed to save: Unable to reserve cache with key setup-uv-2-x86_64-unknown-linux-gnu-ubuntu-24.04-3.12-pruned-d6edf83b7493c8b7817ae65dc4829355e0bf43260f2a41b35cbcbe7cae93ccec, another job may be creating this cache.
|
|
test-cu128-py3.12-pytorch-nightly-triton-h100-distributed
Failed to save: Unable to reserve cache with key setup-uv-2-x86_64-unknown-linux-gnu-ubuntu-24.04-3.12-pruned-d6edf83b7493c8b7817ae65dc4829355e0bf43260f2a41b35cbcbe7cae93ccec, another job may be creating this cache.
|