WIP: [ci] move CI container images to GHCR, workflows to this repo#7109
WIP: [ci] move CI container images to GHCR, workflows to this repo#7109
Conversation
|
@jameslamb I can delete the lightgbm.azurecr.io registry once it's not needed anymore. |
| else # in manylinux image | ||
| sudo yum update -y | ||
| sudo yum install -y \ | ||
| clinfo \ |
There was a problem hiding this comment.
Picking a somewhat-arbitrary place to start a thread.
Right now, the images are building successfully but Python tests with device="gpu" are all failing.
The gpu source job (where LightGBM's default device is set to "gpu") has 238 failures like this:
lightgbm.basic.LightGBMError: Check failed: (best_split_info.right_count) > (0) at /__w/LightGBM/LightGBM/src/treelearner/serial_tree_learner.cpp, line 869 .
Which look at a glance like #3679
The bdist_wheel jobs (which just run a single test checking that OpenCL support was compiled in successfully) on both x86_64 and aarch 64 are failing like this:
____________________________ test_cpu_and_gpu_work _____________________________
@pytest.mark.skipif(
os.environ.get("LIGHTGBM_TEST_DUAL_CPU_GPU", "0") != "1",
reason="Set LIGHTGBM_TEST_DUAL_CPU_GPU=1 to test using CPU and GPU training from the same package.",
)
def test_cpu_and_gpu_work():
# If compiled appropriately, the same installation will support both GPU and CPU.
X, y = load_breast_cancer(return_X_y=True)
data = lgb.Dataset(X, y)
params_cpu = {"verbosity": -1, "num_leaves": 31, "objective": "binary", "device": "cpu"}
cpu_bst = lgb.train(params_cpu, data, num_boost_round=10)
cpu_score = log_loss(y, cpu_bst.predict(X))
params_gpu = params_cpu.copy()
params_gpu["device"] = "gpu"
# Double-precision floats are only supported on x86_64 with PoCL
params_gpu["gpu_use_dp"] = platform.machine() == "x86_64"
> gpu_bst = lgb.train(params_gpu, data, num_boost_round=10)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tests/python_package_test/test_dual.py:32:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/root/miniforge/envs/test-env/lib/python3.13/site-packages/lightgbm/engine.py:297: in train
booster = Booster(params=params, train_set=train_set)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/root/miniforge/envs/test-env/lib/python3.13/site-packages/lightgbm/basic.py:3615: in __init__
_safe_call(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
ret = -1
def _safe_call(ret: int) -> None:
"""Check the return value from C API call.
Parameters
----------
ret : int
The return value from C API calls.
"""
if ret != 0:
> raise LightGBMError(_LIB.LGBM_GetLastError().decode("utf-8"))
E lightgbm.basic.LightGBMError: No OpenCL device found
/root/miniforge/envs/test-env/lib/python3.13/site-packages/lightgbm/basic.py:310: LightGBMError
Ideas I'm looking into:
- maybe we need to update to a newer
Boostto work with a new PoCL? - maybe we need to update the OpenCL kernels in https://github.com/microsoft/LightGBM/blob/master/src/treelearner/ocl to match more platforms
I'm going to focus on the gpu source builds first, because those don't rely on anything in https://github.com/microsoft/LightGBM/blob/master/cmake/IntegratedOpenCL.cmake and so should be a more minimal way to investigate this.
There was a problem hiding this comment.
Noticing that the CI job running with an NVIDIA GPU is working: https://github.com/microsoft/LightGBM/actions/runs/20581695041/job/59110391374?pr=7109
So I guess it's just that these jobs are no longer successfully targeting the host CPUs on the GitHub runners? I'll look into that.
Fixes #5596
Fixes #7011
Moves Dockerfiles and CI pipelines for the container images used in Linux CI into this repo. New pipelines here will publish CI images to GitHub's container registry.
Other changes:
clangfrom source 😁)aarch64Linux wheels up to GLIBC 2.28 (manylinux_2_28)Notes for Reviewers
Post-merge cleanup
After this is merged and once we feel like it's working well, all of the following could be cleaned up: