Skip to content

Fix GPU CI Test Failures: Migrating Tests, NCCL P2P Access Errors, and Test Fixture Issues#477

Merged
devpatelio merged 38 commits intomainfrom
devpatel/skyrl-gpu_ci_updates
Oct 16, 2025
Merged

Fix GPU CI Test Failures: Migrating Tests, NCCL P2P Access Errors, and Test Fixture Issues#477
devpatelio merged 38 commits intomainfrom
devpatel/skyrl-gpu_ci_updates

Conversation

@devpatelio
Copy link
Collaborator

@devpatelio devpatelio commented Oct 14, 2025

Migrated 4 tests from tests/gpu/ to tests/gpu/gpu_ci/ and fixed critical environment configuration issues causing CI-only failures with NCCL peer-to-peer access errors.

Migrated Tests:

  • test_worker_offload.py
  • test_training_step.py
  • test_ppo_train.py
  • test_policy_local_engines_e2e.py

Primary Issue: NCCL P2P Access Errors in CI

Users reported that tests would pass locally on 4-GPU machines but fail in CI on 2-GPU machines with:

torch.distributed.DistBackendError: NCCL error
ncclUnhandledCudaError: Call to CUDA function failed.
Last error: Cuda failure 217 'peer access is not supported between these two devices'

Root Cause

When tests were moved to gpu_ci/, these tests did not explicitly use the ray_init_fixture parameter. This caused:

  1. Incorrect Fixture Inheritance: Tests inherited from parent tests/gpu/conftest.py instead of using tests/gpu/gpu_ci/conftest.py
  2. Wrong GPU Count Check: Parent fixture checked peer_access_supported(max_num_gpus_per_node=4) on 2-GPU CI machines
  3. Missing NCCL Flags: Due to the incorrect GPU count, P2P disable flags (NCCL_P2P_DISABLE=1, NCCL_SHM_DISABLE=1) weren't set
  4. Missing VLLM Environment Variables: CI fixture was missing required VLLM env vars (VLLM_USE_V1, VLLM_ENABLE_V1_MULTIPROCESSING, VLLM_ALLOW_INSECURE_SERIALIZATION) because they were being overwritten.

Locally (4 GPUs) vs CI Run Differences

  • 4-GPU dev environments typically support P2P between GPUs, or
  • The 4-GPU check correctly identified lack of support and set flags appropriately
  • 2-GPU machines don't support P2P
  • Checking for 4 GPUs on a 2-GPU machine gave incorrect results
  • NCCL tried to use P2P without disable flags

Changes Made

1. Fixed tests/gpu/gpu_ci/conftest.py

Corrected NCCL flag setting (using .update() instead of replacement):

if not peer_access_supported(max_num_gpus_per_node=2):  # Correct for 2-GPU CI
    env_vars.update({
        "NCCL_P2P_DISABLE": "1",
        "NCCL_SHM_DISABLE": "1",
    })

2. Updated All Moved Tests

Added explicit ray_init_fixture parameter to force usage of CI-specific fixture:

Before:

async def test_policy_training_step(cfg, packed, strategy):

After:

async def test_policy_training_step(ray_init_fixture, cfg, packed, strategy):

Applied to all test functions in:

  • test_training_step.py (2 tests)
  • test_ppo_train.py (3 tests)
  • test_worker_offload.py (4 tests)
  • test_policy_local_engines_e2e.py (1 test)

3. Additional Fixes

Registry State Leakage:

  • Removed registry reset functions that were causing state leakage between tests
  • Specifically affected PolicyLossRegistry and AdvantageEstimatorRegistry
  • test_policy_local_engines_e2e was resetting registries, breaking subsequent test_ppo_train tests
  • created safe repopulate_registries util function as backup
  • modified max_num_batched_tokens=32768 in test_skyrl_gym_generator and test_engine_generation tests

When running tests to gpu_ci/, always add ray_init_fixture as first parameter to explicitly use CI fixture we defined in the folder root

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request focuses on migrating to a new GPU CI setup. The changes include adjusting script parameters for CI environments (e.g., reducing GPU count and epochs), adding a Modal integration for running commands in a cloud environment, and improving test fixtures and utility functions for better resource management and stability. My review highlights several areas for improvement, primarily focusing on security and robustness. I've pointed out critical security vulnerabilities related to the use of shell=True in subprocess calls with user-provided commands and suggested safer alternatives. I've also recommended more specific exception handling to avoid silently ignoring potentially important errors.

Comment on lines +145 to +153
process = subprocess.Popen(
command,
shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT, # Merge stderr into stdout
text=True,
bufsize=1, # Line buffered
universal_newlines=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using shell=True with subprocess.Popen and a user-provided command string is a serious security vulnerability, as it can lead to command injection. An attacker could execute arbitrary commands on the system. To fix this, you should parse the command string into a list of arguments using shlex.split() and run it with shell=False. You will need to import shlex at the beginning of the run_script function.

    import shlex
    process = subprocess.Popen(
        shlex.split(command),
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,  # Merge stderr into stdout
        text=True,
        bufsize=1,  # Line buffered
        universal_newlines=True,
    )

Comment on lines +112 to +116
subprocess.run(
f"cp -r {gym_src} {gym_dst}",
shell=True,
check=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using shell=True with subprocess.run can be a security risk if any part of the command is derived from external input. While gym_src and gym_dst seem safe here, it's better to use safer alternatives like passing arguments as a list and setting shell=False.

            subprocess.run(
                ["cp", "-r", gym_src, gym_dst],
                check=True,
            )

Comment on lines +121 to +125
subprocess.run(
"ray start --head",
shell=True,
check=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using shell=True is generally discouraged for security and portability reasons, even for hardcoded commands. It's better to pass arguments as a list.

    subprocess.run(
        ["ray", "start", "--head"],
        check=True,
    )

Comment on lines +238 to +239
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching a bare Exception and silently passing can hide important errors. It's better to catch more specific exceptions that you expect might occur during the sync_with_actor calls, such as ray.exceptions.RayActorError, and potentially log the exception for debugging purposes.

Comment on lines +39 to +40
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching a bare Exception and passing silently can hide issues. It would be better to catch a more specific exception. For instance, dist.destroy_process_group() can raise a RuntimeError if called in an invalid state. Logging the exception would also be helpful for debugging.

@devpatelio
Copy link
Collaborator Author

\gemini review

@devpatelio devpatelio changed the title GPU CI Migration Migrate more tests to GPU CI Pipeline Oct 15, 2025
@devpatelio devpatelio changed the title Migrate more tests to GPU CI Pipeline Fix GPU CI Test Failures: Migrating Tests, NCCL P2P Access Errors, and Test Fixture Issues Oct 15, 2025
return advantages, returns


def repopulate_registries():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this require all these try/except clauses? Why would there be ValueErrors on all the register() calls below? And why would set(PolicyLossRegistry.list_available()) throw an error?



@pytest.fixture
@pytest.fixture()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's correct. It should not have (), right?

return advantages, returns


def repopulate_registries():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, a cleaner pattern would be for each Registry to expose a repopulate_registry() method that handles its own registered methods.

Then, this helper can be renamed repopulate_all_registries() and just calls repopulate_registry() on each registry.

raise ValueError(
"`max_ckpts_to_keep` must be greater than 0 to keep the last N checkpoints or negative to keep all checkpoints"
)
repopulate_registries()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I don't actually think we should be calling this here. To get the tests fixed and unblocked, this is fine for now, but can you please leave a TODO here to move repopulate_registries into our codepath for initializing ray and syncing registries to ray? This repopulation should be seen as a "start up time" task, not a config validation side effect

devpatelio and others added 5 commits October 15, 2025 18:58
Replaced 'repopulate_registries' function with 'repopulate_all_registries' for better clarity.
@devpatelio devpatelio merged commit da55b98 into main Oct 16, 2025
3 checks passed
li-boxuan pushed a commit to li-boxuan/SkyRL that referenced this pull request Nov 23, 2025
…d Test Fixture Issues (NovaSky-AI#477)

Migrated 4 tests from `tests/gpu/` to `tests/gpu/gpu_ci/` and fixed
critical environment configuration issues causing CI-only failures with
NCCL peer-to-peer access errors.

**Migrated Tests:**
- `test_worker_offload.py`
- `test_training_step.py`
- `test_ppo_train.py`
- `test_policy_local_engines_e2e.py`

### Primary Issue: NCCL P2P Access Errors in CI

Users reported that tests would **pass locally on 4-GPU machines** but
**fail in CI on 2-GPU machines** with:
```
torch.distributed.DistBackendError: NCCL error
ncclUnhandledCudaError: Call to CUDA function failed.
Last error: Cuda failure 217 'peer access is not supported between these two devices'
```

### Root Cause

When tests were moved to `gpu_ci/`, these tests **did not explicitly use
the `ray_init_fixture` parameter**. This caused:

1. **Incorrect Fixture Inheritance**: Tests inherited from parent
`tests/gpu/conftest.py` instead of using `tests/gpu/gpu_ci/conftest.py`
2. **Wrong GPU Count Check**: Parent fixture checked
`peer_access_supported(max_num_gpus_per_node=4)` on 2-GPU CI machines
3. **Missing NCCL Flags**: Due to the incorrect GPU count, P2P disable
flags (`NCCL_P2P_DISABLE=1`, `NCCL_SHM_DISABLE=1`) weren't set
4. **Missing VLLM Environment Variables**: CI fixture was missing
required VLLM env vars (`VLLM_USE_V1`, `VLLM_ENABLE_V1_MULTIPROCESSING`,
`VLLM_ALLOW_INSECURE_SERIALIZATION`) because they were being
overwritten.

### Locally (4 GPUs) vs CI Run Differences

- 4-GPU dev environments typically support P2P between GPUs, or
- The 4-GPU check correctly identified lack of support and set flags
appropriately
- 2-GPU machines don't support P2P
- Checking for 4 GPUs on a 2-GPU machine gave incorrect results
- NCCL tried to use P2P without disable flags

## Changes Made

### 1. Fixed `tests/gpu/gpu_ci/conftest.py`

**Corrected NCCL flag setting (using `.update()` instead of
replacement):**
```python
if not peer_access_supported(max_num_gpus_per_node=2):  # Correct for 2-GPU CI
    env_vars.update({
        "NCCL_P2P_DISABLE": "1",
        "NCCL_SHM_DISABLE": "1",
    })
```

### 2. Updated All Moved Tests

Added explicit `ray_init_fixture` parameter to force usage of
CI-specific fixture:

**Before:**
```python
async def test_policy_training_step(cfg, packed, strategy):
```

**After:**
```python
async def test_policy_training_step(ray_init_fixture, cfg, packed, strategy):
```

Applied to all test functions in:
- `test_training_step.py` (2 tests)
- `test_ppo_train.py` (3 tests)
- `test_worker_offload.py` (4 tests)
- `test_policy_local_engines_e2e.py` (1 test)

### 3. Additional Fixes

**Registry State Leakage:**
- Removed registry reset functions that were causing state leakage
between tests
- Specifically affected `PolicyLossRegistry` and
`AdvantageEstimatorRegistry`
- `test_policy_local_engines_e2e` was resetting registries, breaking
subsequent `test_ppo_train` tests
- created safe `repopulate_registries` util function as backup
- modified `max_num_batched_tokens=32768` in `test_skyrl_gym_generator`
and `test_engine_generation` tests

**When running tests to `gpu_ci/`, always add `ray_init_fixture` as
first parameter to explicitly use CI fixture we defined in the folder
root**

---------

Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
dzorlu pushed a commit to fleet-ai/SkyRL that referenced this pull request Feb 4, 2026
…d Test Fixture Issues (NovaSky-AI#477)

Migrated 4 tests from `tests/gpu/` to `tests/gpu/gpu_ci/` and fixed
critical environment configuration issues causing CI-only failures with
NCCL peer-to-peer access errors.

**Migrated Tests:**
- `test_worker_offload.py`
- `test_training_step.py`
- `test_ppo_train.py`
- `test_policy_local_engines_e2e.py`

### Primary Issue: NCCL P2P Access Errors in CI

Users reported that tests would **pass locally on 4-GPU machines** but
**fail in CI on 2-GPU machines** with:
```
torch.distributed.DistBackendError: NCCL error
ncclUnhandledCudaError: Call to CUDA function failed.
Last error: Cuda failure 217 'peer access is not supported between these two devices'
```

### Root Cause

When tests were moved to `gpu_ci/`, these tests **did not explicitly use
the `ray_init_fixture` parameter**. This caused:

1. **Incorrect Fixture Inheritance**: Tests inherited from parent
`tests/gpu/conftest.py` instead of using `tests/gpu/gpu_ci/conftest.py`
2. **Wrong GPU Count Check**: Parent fixture checked
`peer_access_supported(max_num_gpus_per_node=4)` on 2-GPU CI machines
3. **Missing NCCL Flags**: Due to the incorrect GPU count, P2P disable
flags (`NCCL_P2P_DISABLE=1`, `NCCL_SHM_DISABLE=1`) weren't set
4. **Missing VLLM Environment Variables**: CI fixture was missing
required VLLM env vars (`VLLM_USE_V1`, `VLLM_ENABLE_V1_MULTIPROCESSING`,
`VLLM_ALLOW_INSECURE_SERIALIZATION`) because they were being
overwritten.

### Locally (4 GPUs) vs CI Run Differences

- 4-GPU dev environments typically support P2P between GPUs, or
- The 4-GPU check correctly identified lack of support and set flags
appropriately
- 2-GPU machines don't support P2P
- Checking for 4 GPUs on a 2-GPU machine gave incorrect results
- NCCL tried to use P2P without disable flags

## Changes Made

### 1. Fixed `tests/gpu/gpu_ci/conftest.py`

**Corrected NCCL flag setting (using `.update()` instead of
replacement):**
```python
if not peer_access_supported(max_num_gpus_per_node=2):  # Correct for 2-GPU CI
    env_vars.update({
        "NCCL_P2P_DISABLE": "1",
        "NCCL_SHM_DISABLE": "1",
    })
```

### 2. Updated All Moved Tests

Added explicit `ray_init_fixture` parameter to force usage of
CI-specific fixture:

**Before:**
```python
async def test_policy_training_step(cfg, packed, strategy):
```

**After:**
```python
async def test_policy_training_step(ray_init_fixture, cfg, packed, strategy):
```

Applied to all test functions in:
- `test_training_step.py` (2 tests)
- `test_ppo_train.py` (3 tests)
- `test_worker_offload.py` (4 tests)
- `test_policy_local_engines_e2e.py` (1 test)

### 3. Additional Fixes

**Registry State Leakage:**
- Removed registry reset functions that were causing state leakage
between tests
- Specifically affected `PolicyLossRegistry` and
`AdvantageEstimatorRegistry`
- `test_policy_local_engines_e2e` was resetting registries, breaking
subsequent `test_ppo_train` tests
- created safe `repopulate_registries` util function as backup
- modified `max_num_batched_tokens=32768` in `test_skyrl_gym_generator`
and `test_engine_generation` tests

**When running tests to `gpu_ci/`, always add `ray_init_fixture` as
first parameter to explicitly use CI fixture we defined in the folder
root**

---------

Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants