Fix GPU CI Test Failures: Migrating Tests, NCCL P2P Access Errors, and Test Fixture Issues#477
Conversation
There was a problem hiding this comment.
Code Review
This pull request focuses on migrating to a new GPU CI setup. The changes include adjusting script parameters for CI environments (e.g., reducing GPU count and epochs), adding a Modal integration for running commands in a cloud environment, and improving test fixtures and utility functions for better resource management and stability. My review highlights several areas for improvement, primarily focusing on security and robustness. I've pointed out critical security vulnerabilities related to the use of shell=True in subprocess calls with user-provided commands and suggested safer alternatives. I've also recommended more specific exception handling to avoid silently ignoring potentially important errors.
| process = subprocess.Popen( | ||
| command, | ||
| shell=True, | ||
| stdout=subprocess.PIPE, | ||
| stderr=subprocess.STDOUT, # Merge stderr into stdout | ||
| text=True, | ||
| bufsize=1, # Line buffered | ||
| universal_newlines=True, | ||
| ) |
There was a problem hiding this comment.
Using shell=True with subprocess.Popen and a user-provided command string is a serious security vulnerability, as it can lead to command injection. An attacker could execute arbitrary commands on the system. To fix this, you should parse the command string into a list of arguments using shlex.split() and run it with shell=False. You will need to import shlex at the beginning of the run_script function.
import shlex
process = subprocess.Popen(
shlex.split(command),
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT, # Merge stderr into stdout
text=True,
bufsize=1, # Line buffered
universal_newlines=True,
)| subprocess.run( | ||
| f"cp -r {gym_src} {gym_dst}", | ||
| shell=True, | ||
| check=True, | ||
| ) |
There was a problem hiding this comment.
Using shell=True with subprocess.run can be a security risk if any part of the command is derived from external input. While gym_src and gym_dst seem safe here, it's better to use safer alternatives like passing arguments as a list and setting shell=False.
subprocess.run(
["cp", "-r", gym_src, gym_dst],
check=True,
)| subprocess.run( | ||
| "ray start --head", | ||
| shell=True, | ||
| check=True, | ||
| ) |
| except Exception: | ||
| pass |
There was a problem hiding this comment.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
…tel/skyrl-gpu_ci_updates
… on 1 inference engine
…vaSky-AI/SkyRL into devpatel/skyrl-gpu_ci_updates
This reverts commit 49eacff.
|
\gemini review |
| return advantages, returns | ||
|
|
||
|
|
||
| def repopulate_registries(): |
There was a problem hiding this comment.
Does this require all these try/except clauses? Why would there be ValueErrors on all the register() calls below? And why would set(PolicyLossRegistry.list_available()) throw an error?
|
|
||
|
|
||
| @pytest.fixture | ||
| @pytest.fixture() |
There was a problem hiding this comment.
I don't think that's correct. It should not have (), right?
| return advantages, returns | ||
|
|
||
|
|
||
| def repopulate_registries(): |
There was a problem hiding this comment.
On second thought, a cleaner pattern would be for each Registry to expose a repopulate_registry() method that handles its own registered methods.
Then, this helper can be renamed repopulate_all_registries() and just calls repopulate_registry() on each registry.
| raise ValueError( | ||
| "`max_ckpts_to_keep` must be greater than 0 to keep the last N checkpoints or negative to keep all checkpoints" | ||
| ) | ||
| repopulate_registries() |
There was a problem hiding this comment.
Also, I don't actually think we should be calling this here. To get the tests fixed and unblocked, this is fine for now, but can you please leave a TODO here to move repopulate_registries into our codepath for initializing ray and syncing registries to ray? This repopulation should be seen as a "start up time" task, not a config validation side effect
Replaced 'repopulate_registries' function with 'repopulate_all_registries' for better clarity.
…d Test Fixture Issues (NovaSky-AI#477) Migrated 4 tests from `tests/gpu/` to `tests/gpu/gpu_ci/` and fixed critical environment configuration issues causing CI-only failures with NCCL peer-to-peer access errors. **Migrated Tests:** - `test_worker_offload.py` - `test_training_step.py` - `test_ppo_train.py` - `test_policy_local_engines_e2e.py` ### Primary Issue: NCCL P2P Access Errors in CI Users reported that tests would **pass locally on 4-GPU machines** but **fail in CI on 2-GPU machines** with: ``` torch.distributed.DistBackendError: NCCL error ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 217 'peer access is not supported between these two devices' ``` ### Root Cause When tests were moved to `gpu_ci/`, these tests **did not explicitly use the `ray_init_fixture` parameter**. This caused: 1. **Incorrect Fixture Inheritance**: Tests inherited from parent `tests/gpu/conftest.py` instead of using `tests/gpu/gpu_ci/conftest.py` 2. **Wrong GPU Count Check**: Parent fixture checked `peer_access_supported(max_num_gpus_per_node=4)` on 2-GPU CI machines 3. **Missing NCCL Flags**: Due to the incorrect GPU count, P2P disable flags (`NCCL_P2P_DISABLE=1`, `NCCL_SHM_DISABLE=1`) weren't set 4. **Missing VLLM Environment Variables**: CI fixture was missing required VLLM env vars (`VLLM_USE_V1`, `VLLM_ENABLE_V1_MULTIPROCESSING`, `VLLM_ALLOW_INSECURE_SERIALIZATION`) because they were being overwritten. ### Locally (4 GPUs) vs CI Run Differences - 4-GPU dev environments typically support P2P between GPUs, or - The 4-GPU check correctly identified lack of support and set flags appropriately - 2-GPU machines don't support P2P - Checking for 4 GPUs on a 2-GPU machine gave incorrect results - NCCL tried to use P2P without disable flags ## Changes Made ### 1. Fixed `tests/gpu/gpu_ci/conftest.py` **Corrected NCCL flag setting (using `.update()` instead of replacement):** ```python if not peer_access_supported(max_num_gpus_per_node=2): # Correct for 2-GPU CI env_vars.update({ "NCCL_P2P_DISABLE": "1", "NCCL_SHM_DISABLE": "1", }) ``` ### 2. Updated All Moved Tests Added explicit `ray_init_fixture` parameter to force usage of CI-specific fixture: **Before:** ```python async def test_policy_training_step(cfg, packed, strategy): ``` **After:** ```python async def test_policy_training_step(ray_init_fixture, cfg, packed, strategy): ``` Applied to all test functions in: - `test_training_step.py` (2 tests) - `test_ppo_train.py` (3 tests) - `test_worker_offload.py` (4 tests) - `test_policy_local_engines_e2e.py` (1 test) ### 3. Additional Fixes **Registry State Leakage:** - Removed registry reset functions that were causing state leakage between tests - Specifically affected `PolicyLossRegistry` and `AdvantageEstimatorRegistry` - `test_policy_local_engines_e2e` was resetting registries, breaking subsequent `test_ppo_train` tests - created safe `repopulate_registries` util function as backup - modified `max_num_batched_tokens=32768` in `test_skyrl_gym_generator` and `test_engine_generation` tests **When running tests to `gpu_ci/`, always add `ray_init_fixture` as first parameter to explicitly use CI fixture we defined in the folder root** --------- Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
…d Test Fixture Issues (NovaSky-AI#477) Migrated 4 tests from `tests/gpu/` to `tests/gpu/gpu_ci/` and fixed critical environment configuration issues causing CI-only failures with NCCL peer-to-peer access errors. **Migrated Tests:** - `test_worker_offload.py` - `test_training_step.py` - `test_ppo_train.py` - `test_policy_local_engines_e2e.py` ### Primary Issue: NCCL P2P Access Errors in CI Users reported that tests would **pass locally on 4-GPU machines** but **fail in CI on 2-GPU machines** with: ``` torch.distributed.DistBackendError: NCCL error ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 217 'peer access is not supported between these two devices' ``` ### Root Cause When tests were moved to `gpu_ci/`, these tests **did not explicitly use the `ray_init_fixture` parameter**. This caused: 1. **Incorrect Fixture Inheritance**: Tests inherited from parent `tests/gpu/conftest.py` instead of using `tests/gpu/gpu_ci/conftest.py` 2. **Wrong GPU Count Check**: Parent fixture checked `peer_access_supported(max_num_gpus_per_node=4)` on 2-GPU CI machines 3. **Missing NCCL Flags**: Due to the incorrect GPU count, P2P disable flags (`NCCL_P2P_DISABLE=1`, `NCCL_SHM_DISABLE=1`) weren't set 4. **Missing VLLM Environment Variables**: CI fixture was missing required VLLM env vars (`VLLM_USE_V1`, `VLLM_ENABLE_V1_MULTIPROCESSING`, `VLLM_ALLOW_INSECURE_SERIALIZATION`) because they were being overwritten. ### Locally (4 GPUs) vs CI Run Differences - 4-GPU dev environments typically support P2P between GPUs, or - The 4-GPU check correctly identified lack of support and set flags appropriately - 2-GPU machines don't support P2P - Checking for 4 GPUs on a 2-GPU machine gave incorrect results - NCCL tried to use P2P without disable flags ## Changes Made ### 1. Fixed `tests/gpu/gpu_ci/conftest.py` **Corrected NCCL flag setting (using `.update()` instead of replacement):** ```python if not peer_access_supported(max_num_gpus_per_node=2): # Correct for 2-GPU CI env_vars.update({ "NCCL_P2P_DISABLE": "1", "NCCL_SHM_DISABLE": "1", }) ``` ### 2. Updated All Moved Tests Added explicit `ray_init_fixture` parameter to force usage of CI-specific fixture: **Before:** ```python async def test_policy_training_step(cfg, packed, strategy): ``` **After:** ```python async def test_policy_training_step(ray_init_fixture, cfg, packed, strategy): ``` Applied to all test functions in: - `test_training_step.py` (2 tests) - `test_ppo_train.py` (3 tests) - `test_worker_offload.py` (4 tests) - `test_policy_local_engines_e2e.py` (1 test) ### 3. Additional Fixes **Registry State Leakage:** - Removed registry reset functions that were causing state leakage between tests - Specifically affected `PolicyLossRegistry` and `AdvantageEstimatorRegistry` - `test_policy_local_engines_e2e` was resetting registries, breaking subsequent `test_ppo_train` tests - created safe `repopulate_registries` util function as backup - modified `max_num_batched_tokens=32768` in `test_skyrl_gym_generator` and `test_engine_generation` tests **When running tests to `gpu_ci/`, always add `ray_init_fixture` as first parameter to explicitly use CI fixture we defined in the folder root** --------- Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
Migrated 4 tests from
tests/gpu/totests/gpu/gpu_ci/and fixed critical environment configuration issues causing CI-only failures with NCCL peer-to-peer access errors.Migrated Tests:
test_worker_offload.pytest_training_step.pytest_ppo_train.pytest_policy_local_engines_e2e.pyPrimary Issue: NCCL P2P Access Errors in CI
Users reported that tests would pass locally on 4-GPU machines but fail in CI on 2-GPU machines with:
Root Cause
When tests were moved to
gpu_ci/, these tests did not explicitly use theray_init_fixtureparameter. This caused:tests/gpu/conftest.pyinstead of usingtests/gpu/gpu_ci/conftest.pypeer_access_supported(max_num_gpus_per_node=4)on 2-GPU CI machinesNCCL_P2P_DISABLE=1,NCCL_SHM_DISABLE=1) weren't setVLLM_USE_V1,VLLM_ENABLE_V1_MULTIPROCESSING,VLLM_ALLOW_INSECURE_SERIALIZATION) because they were being overwritten.Locally (4 GPUs) vs CI Run Differences
Changes Made
1. Fixed
tests/gpu/gpu_ci/conftest.pyCorrected NCCL flag setting (using
.update()instead of replacement):2. Updated All Moved Tests
Added explicit
ray_init_fixtureparameter to force usage of CI-specific fixture:Before:
After:
Applied to all test functions in:
test_training_step.py(2 tests)test_ppo_train.py(3 tests)test_worker_offload.py(4 tests)test_policy_local_engines_e2e.py(1 test)3. Additional Fixes
Registry State Leakage:
PolicyLossRegistryandAdvantageEstimatorRegistrytest_policy_local_engines_e2ewas resetting registries, breaking subsequenttest_ppo_traintestsrepopulate_registriesutil function as backupmax_num_batched_tokens=32768intest_skyrl_gym_generatorandtest_engine_generationtestsWhen running tests to
gpu_ci/, always addray_init_fixtureas first parameter to explicitly use CI fixture we defined in the folder root