Fix GPU CI Test Failures: Migrating Tests, NCCL P2P Access Errors, and Test Fixture Issues by devpatelio · Pull Request #477 · NovaSky-AI/SkyRL

devpatelio · 2025-10-14T20:35:22Z

Migrated 4 tests from tests/gpu/ to tests/gpu/gpu_ci/ and fixed critical environment configuration issues causing CI-only failures with NCCL peer-to-peer access errors.

Migrated Tests:

test_worker_offload.py
test_training_step.py
test_ppo_train.py
test_policy_local_engines_e2e.py

Primary Issue: NCCL P2P Access Errors in CI

Users reported that tests would pass locally on 4-GPU machines but fail in CI on 2-GPU machines with:

torch.distributed.DistBackendError: NCCL error
ncclUnhandledCudaError: Call to CUDA function failed.
Last error: Cuda failure 217 'peer access is not supported between these two devices'

Root Cause

When tests were moved to gpu_ci/, these tests did not explicitly use the ray_init_fixture parameter. This caused:

Incorrect Fixture Inheritance: Tests inherited from parent tests/gpu/conftest.py instead of using tests/gpu/gpu_ci/conftest.py
Wrong GPU Count Check: Parent fixture checked peer_access_supported(max_num_gpus_per_node=4) on 2-GPU CI machines
Missing NCCL Flags: Due to the incorrect GPU count, P2P disable flags (NCCL_P2P_DISABLE=1, NCCL_SHM_DISABLE=1) weren't set
Missing VLLM Environment Variables: CI fixture was missing required VLLM env vars (VLLM_USE_V1, VLLM_ENABLE_V1_MULTIPROCESSING, VLLM_ALLOW_INSECURE_SERIALIZATION) because they were being overwritten.

Locally (4 GPUs) vs CI Run Differences

4-GPU dev environments typically support P2P between GPUs, or
The 4-GPU check correctly identified lack of support and set flags appropriately
2-GPU machines don't support P2P
Checking for 4 GPUs on a 2-GPU machine gave incorrect results
NCCL tried to use P2P without disable flags

Changes Made

1. Fixed `tests/gpu/gpu_ci/conftest.py`

Corrected NCCL flag setting (using .update() instead of replacement):

if not peer_access_supported(max_num_gpus_per_node=2):  # Correct for 2-GPU CI
    env_vars.update({
        "NCCL_P2P_DISABLE": "1",
        "NCCL_SHM_DISABLE": "1",
    })

2. Updated All Moved Tests

Added explicit ray_init_fixture parameter to force usage of CI-specific fixture:

Before:

async def test_policy_training_step(cfg, packed, strategy):

After:

async def test_policy_training_step(ray_init_fixture, cfg, packed, strategy):

Applied to all test functions in:

test_training_step.py (2 tests)
test_ppo_train.py (3 tests)
test_worker_offload.py (4 tests)
test_policy_local_engines_e2e.py (1 test)

3. Additional Fixes

Registry State Leakage:

Removed registry reset functions that were causing state leakage between tests
Specifically affected PolicyLossRegistry and AdvantageEstimatorRegistry
test_policy_local_engines_e2e was resetting registries, breaking subsequent test_ppo_train tests
created safe repopulate_registries util function as backup
modified max_num_batched_tokens=32768 in test_skyrl_gym_generator and test_engine_generation tests

When running tests to gpu_ci/, always add ray_init_fixture as first parameter to explicitly use CI fixture we defined in the folder root

gemini-code-assist

Code Review

This pull request focuses on migrating to a new GPU CI setup. The changes include adjusting script parameters for CI environments (e.g., reducing GPU count and epochs), adding a Modal integration for running commands in a cloud environment, and improving test fixtures and utility functions for better resource management and stability. My review highlights several areas for improvement, primarily focusing on security and robustness. I've pointed out critical security vulnerabilities related to the use of shell=True in subprocess calls with user-provided commands and suggested safer alternatives. I've also recommended more specific exception handling to avoid silently ignoring potentially important errors.

gemini-code-assist · 2025-10-14T20:37:16Z

skyrl-train/integrations/modal/run_command.py

+    process = subprocess.Popen(
+        command,
+        shell=True,
+        stdout=subprocess.PIPE,
+        stderr=subprocess.STDOUT,  # Merge stderr into stdout
+        text=True,
+        bufsize=1,  # Line buffered
+        universal_newlines=True,
+    )


Using shell=True with subprocess.Popen and a user-provided command string is a serious security vulnerability, as it can lead to command injection. An attacker could execute arbitrary commands on the system. To fix this, you should parse the command string into a list of arguments using shlex.split() and run it with shell=False. You will need to import shlex at the beginning of the run_script function.

import shlex process = subprocess.Popen( shlex.split(command), stdout=subprocess.PIPE, stderr=subprocess.STDOUT, # Merge stderr into stdout text=True, bufsize=1, # Line buffered universal_newlines=True, )

gemini-code-assist · 2025-10-14T20:37:16Z

skyrl-train/integrations/modal/run_command.py

+            subprocess.run(
+                f"cp -r {gym_src} {gym_dst}",
+                shell=True,
+                check=True,
+            )


Using shell=True with subprocess.run can be a security risk if any part of the command is derived from external input. While gym_src and gym_dst seem safe here, it's better to use safer alternatives like passing arguments as a list and setting shell=False.

subprocess.run( ["cp", "-r", gym_src, gym_dst], check=True, )

gemini-code-assist · 2025-10-14T20:37:16Z

skyrl-train/integrations/modal/run_command.py

+    subprocess.run(
+        "ray start --head",
+        shell=True,
+        check=True,
+    )


Using shell=True is generally discouraged for security and portability reasons, even for hardcoded commands. It's better to pass arguments as a list.

subprocess.run( ["ray", "start", "--head"], check=True, )

gemini-code-assist · 2025-10-14T20:37:17Z

skyrl-train/skyrl_train/utils/utils.py

+            except Exception:
+                pass


Catching a bare Exception and silently passing can hide important errors. It's better to catch more specific exceptions that you expect might occur during the sync_with_actor calls, such as ray.exceptions.RayActorError, and potentially log the exception for debugging purposes.

gemini-code-assist · 2025-10-14T20:37:17Z

skyrl-train/tests/gpu/gpu_ci/conftest.py

+    except Exception:
+        pass


Catching a bare Exception and passing silently can hide issues. It would be better to catch a more specific exception. For instance, dist.destroy_process_group() can raise a RuntimeError if called in an invalid state. Logging the exception would also be helpful for debugging.

…tel/skyrl-gpu_ci_updates

… on 1 inference engine

…vaSky-AI/SkyRL into devpatel/skyrl-gpu_ci_updates

This reverts commit 49eacff.

devpatelio · 2025-10-15T23:17:11Z

\gemini review

tyler-griggs · 2025-10-16T00:51:52Z

skyrl-train/skyrl_train/utils/ppo_utils.py

    return advantages, returns


+def repopulate_registries():


Does this require all these try/except clauses? Why would there be ValueErrors on all the register() calls below? And why would set(PolicyLossRegistry.list_available()) throw an error?

tyler-griggs · 2025-10-16T00:52:41Z

skyrl-train/tests/gpu/gpu_ci/conftest.py



-@pytest.fixture
+@pytest.fixture()


I don't think that's correct. It should not have (), right?

tyler-griggs · 2025-10-16T01:20:25Z

skyrl-train/skyrl_train/utils/ppo_utils.py

    return advantages, returns


+def repopulate_registries():


On second thought, a cleaner pattern would be for each Registry to expose a repopulate_registry() method that handles its own registered methods.

Then, this helper can be renamed repopulate_all_registries() and just calls repopulate_registry() on each registry.

tyler-griggs · 2025-10-16T01:21:49Z

skyrl-train/skyrl_train/utils/utils.py

        raise ValueError(
            "`max_ckpts_to_keep` must be greater than 0 to keep the last N checkpoints or negative to keep all checkpoints"
        )
+    repopulate_registries()


Also, I don't actually think we should be calling this here. To get the tests fixed and unblocked, this is fine for now, but can you please leave a TODO here to move repopulate_registries into our codepath for initializing ray and syncing registries to ray? This repopulation should be seen as a "start up time" task, not a config validation side effect

Replaced 'repopulate_registries' function with 'repopulate_all_registries' for better clarity.

…d Test Fixture Issues (NovaSky-AI#477) Migrated 4 tests from `tests/gpu/` to `tests/gpu/gpu_ci/` and fixed critical environment configuration issues causing CI-only failures with NCCL peer-to-peer access errors. **Migrated Tests:** - `test_worker_offload.py` - `test_training_step.py` - `test_ppo_train.py` - `test_policy_local_engines_e2e.py` ### Primary Issue: NCCL P2P Access Errors in CI Users reported that tests would **pass locally on 4-GPU machines** but **fail in CI on 2-GPU machines** with: ``` torch.distributed.DistBackendError: NCCL error ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 217 'peer access is not supported between these two devices' ``` ### Root Cause When tests were moved to `gpu_ci/`, these tests **did not explicitly use the `ray_init_fixture` parameter**. This caused: 1. **Incorrect Fixture Inheritance**: Tests inherited from parent `tests/gpu/conftest.py` instead of using `tests/gpu/gpu_ci/conftest.py` 2. **Wrong GPU Count Check**: Parent fixture checked `peer_access_supported(max_num_gpus_per_node=4)` on 2-GPU CI machines 3. **Missing NCCL Flags**: Due to the incorrect GPU count, P2P disable flags (`NCCL_P2P_DISABLE=1`, `NCCL_SHM_DISABLE=1`) weren't set 4. **Missing VLLM Environment Variables**: CI fixture was missing required VLLM env vars (`VLLM_USE_V1`, `VLLM_ENABLE_V1_MULTIPROCESSING`, `VLLM_ALLOW_INSECURE_SERIALIZATION`) because they were being overwritten. ### Locally (4 GPUs) vs CI Run Differences - 4-GPU dev environments typically support P2P between GPUs, or - The 4-GPU check correctly identified lack of support and set flags appropriately - 2-GPU machines don't support P2P - Checking for 4 GPUs on a 2-GPU machine gave incorrect results - NCCL tried to use P2P without disable flags ## Changes Made ### 1. Fixed `tests/gpu/gpu_ci/conftest.py` **Corrected NCCL flag setting (using `.update()` instead of replacement):** ```python if not peer_access_supported(max_num_gpus_per_node=2): # Correct for 2-GPU CI env_vars.update({ "NCCL_P2P_DISABLE": "1", "NCCL_SHM_DISABLE": "1", }) ``` ### 2. Updated All Moved Tests Added explicit `ray_init_fixture` parameter to force usage of CI-specific fixture: **Before:** ```python async def test_policy_training_step(cfg, packed, strategy): ``` **After:** ```python async def test_policy_training_step(ray_init_fixture, cfg, packed, strategy): ``` Applied to all test functions in: - `test_training_step.py` (2 tests) - `test_ppo_train.py` (3 tests) - `test_worker_offload.py` (4 tests) - `test_policy_local_engines_e2e.py` (1 test) ### 3. Additional Fixes **Registry State Leakage:** - Removed registry reset functions that were causing state leakage between tests - Specifically affected `PolicyLossRegistry` and `AdvantageEstimatorRegistry` - `test_policy_local_engines_e2e` was resetting registries, breaking subsequent `test_ppo_train` tests - created safe `repopulate_registries` util function as backup - modified `max_num_batched_tokens=32768` in `test_skyrl_gym_generator` and `test_engine_generation` tests **When running tests to `gpu_ci/`, always add `ray_init_fixture` as first parameter to explicitly use CI fixture we defined in the folder root** --------- Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>

devpatelio added 11 commits October 8, 2025 23:51

pushing

5fdfa90

done

45f7d51

Modal + GPU Migration

c10d14b

update timeout:

acd38cf

updates to tests

5fe9032

set scope to module

82fb470

final updates

55e899b

no kill

975b47c

fix import error

c2dac0d

explicit policy loss registry check

ace69a7

check

5513cf3

gemini-code-assist bot reviewed Oct 14, 2025

View reviewed changes

devpatelio added 18 commits October 14, 2025 13:38

Merge branch 'main' of https://github.com/NovaSky-AI/SkyRL into devpa…

3f39ac1

…tel/skyrl-gpu_ci_updates

lets see now...

40b80b9

hard reset

e747139

updated tp to 2 for all override GPU tests since they are all running…

8051533

… on 1 inference engine

update 3 fail tests to have tp=2

dd515e5

done

d0e87dc

move missing tests

80208c4

default skin

6c7ee6f

migration done

b1bb969

no problem

7c87dc8

update env vars instead of replacing them

8f5d494

remove module-level scoping:

9ed518f

fix timeout

21d081c

linter

468367b

Merge branch 'main' into devpatel/skyrl-gpu_ci_updates

22ce960

revert

49eacff

Merge branch 'devpatel/skyrl-gpu_ci_updates' of https://github.com/No…

0034abb

…vaSky-AI/SkyRL into devpatel/skyrl-gpu_ci_updates

Revert "revert"

d38c317

This reverts commit 49eacff.

devpatelio added 2 commits October 15, 2025 16:22

cleanup

fbd83d9

removd wrong skyagent repo

7ddefed

devpatelio changed the title ~~GPU CI Migration~~ Migrate more tests to GPU CI Pipeline Oct 15, 2025

devpatelio changed the title ~~Migrate more tests to GPU CI Pipeline~~ Fix GPU CI Test Failures: Migrating Tests, NCCL P2P Access Errors, and Test Fixture Issues Oct 15, 2025

devpatelio requested a review from tyler-griggs October 16, 2025 00:27

tyler-griggs reviewed Oct 16, 2025

View reviewed changes

devpatelio added 2 commits October 15, 2025 18:11

fixed nits

40c8647

linter

2ef594b

tyler-griggs reviewed Oct 16, 2025

View reviewed changes

devpatelio and others added 5 commits October 15, 2025 18:58

repopulate

39d756e

rename helper

3e15d62

linter

2456469

Update ppo_utils.py

67040cf

Refactor registry repopulation function

73e3760

Replaced 'repopulate_registries' function with 'repopulate_all_registries' for better clarity.

tyler-griggs approved these changes Oct 16, 2025

View reviewed changes

devpatelio merged commit da55b98 into main Oct 16, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GPU CI Test Failures: Migrating Tests, NCCL P2P Access Errors, and Test Fixture Issues#477

Fix GPU CI Test Failures: Migrating Tests, NCCL P2P Access Errors, and Test Fixture Issues#477
devpatelio merged 38 commits intomainfrom
devpatel/skyrl-gpu_ci_updates

devpatelio commented Oct 14, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 14, 2025

Uh oh!

gemini-code-assist bot Oct 14, 2025

Uh oh!

gemini-code-assist bot Oct 14, 2025

Uh oh!

gemini-code-assist bot Oct 14, 2025

Uh oh!

gemini-code-assist bot Oct 14, 2025

Uh oh!

devpatelio commented Oct 15, 2025

Uh oh!

tyler-griggs Oct 16, 2025

Uh oh!

tyler-griggs Oct 16, 2025

Uh oh!

tyler-griggs Oct 16, 2025

Uh oh!

tyler-griggs Oct 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		@pytest.fixture
		@pytest.fixture()

Conversation

devpatelio commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Primary Issue: NCCL P2P Access Errors in CI

Root Cause

Locally (4 GPUs) vs CI Run Differences

Changes Made

1. Fixed tests/gpu/gpu_ci/conftest.py

2. Updated All Moved Tests

3. Additional Fixes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

devpatelio commented Oct 15, 2025

Uh oh!

tyler-griggs Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

tyler-griggs Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

tyler-griggs Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

tyler-griggs Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

devpatelio commented Oct 14, 2025 •

edited

Loading

1. Fixed `tests/gpu/gpu_ci/conftest.py`