[GPU CI 1/N] Init GPU CI on Anyscale (NovaSky-AI#102)

SumanthRH · web-flow · commit b5b70993612d · 2025-07-21T11:55:26.000-07:00
# What does this PR do?

FIrst PR in enabling GPU CI. 

This PR adds a `SkyRL-GPU` workflow that will run GPU CI on Anyscale. We
will basic unit and integration tests running on a 4xL4 instance in this
workflow. Currently, the CI will run post-merge on `main` . For pull
requests, it requires manual trigger by a maintainer. (so it should not
run on every commit even for maintainers, since that can be wasteful).


In this PR, I've only ported over one of our existing gpu tests -
`test_models.py` to run on CI . The plan is to gradually add more and
more tests after we've made them more efficient and compatible to be run
on a 4xL4 node.

---------

Signed-off-by: SumanthRH &lt;sumanthrh99@gmail.com&gt;
diff --git a/.github/workflows/gpu_ci.yaml b/.github/workflows/gpu_ci.yaml
@@ -0,0 +1,44 @@
+name: SkyRL-GPU
+
+on: 
+  push: 
+    branches: 
+      - main 
+  workflow_dispatch:
+
+
+permissions:
+  checks: write   # for status checks to appear
+  contents: read
+
+jobs:
+  
+  skyrl_tests:
+    runs-on: ubuntu-latest
+    defaults:
+      run:
+        shell: bash
+        working-directory: ./skyrl-train
+
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python
+        # This is the version of the action for setting up Python, not the Python version.
+        uses: actions/setup-python@v5
+        with:
+          # Semantic version range syntax or exact version of a Python version
+          python-version: '3.12'
+          cache: 'pip'
+      - name: Install the latest version of uv
+        uses: astral-sh/setup-uv@v6
+        with:
+          activate-environment: true
+      - name: Install basic dependencies
+        run: uv pip install anyscale==0.24.79 typer==0.9.0
+      # Run tests
+      - name: GPU tests
+        env:
+          ANYSCALE_CLI_TOKEN: ${{ secrets.ANYSCALE_CLI_TOKEN }}
+          ANYSCALE_HOST: https://console.anyscale.com
+        run: |
+          anyscale job submit -f ci/anyscale_gpu_ci.yaml --wait
diff --git a/skyrl-train/ci/anyscale_gpu_ci.yaml b/skyrl-train/ci/anyscale_gpu_ci.yaml
@@ -0,0 +1,10 @@
+name: skyrl-train-gpu-ci
+entrypoint: bash ci/gpu_ci_run.sh
+image_uri: sumanthrh/skyrl-train-ray-2.44.0-py3.12-cu12.8 # (Optional) Exclusive with `containerfile`.
+cloud: sky-anyscale-aws-us-east-1
+ray_version: "2.44.0"
+compute_config: l4_ci 
+working_dir: . # (Optional) Use current working directory "." as the working_dir. Can be any local path or remote .zip file in cloud storage.
+env_vars:
+  RAY_RUNTIME_ENV_HOOK: ray._private.runtime_env.uv_runtime_env_hook.hook
+max_retries: 0 # (Optional) Maximum number of times the job will be retried before being marked failed. Defaults to `1`.
diff --git a/skyrl-train/ci/gpu_ci_run.sh b/skyrl-train/ci/gpu_ci_run.sh
@@ -0,0 +1,3 @@
+#!/usr/bin/env bash
+export CI=true
+uv run --directory . --isolated --extra dev --extra vllm pytest -s tests/gpu/gpu_ci
diff --git a/skyrl-train/tests/gpu/gpu_ci/__init__.py b/skyrl-train/tests/gpu/gpu_ci/__init__.py
diff --git a/skyrl-train/tests/gpu/gpu_ci/conftest.py b/skyrl-train/tests/gpu/gpu_ci/conftest.py
@@ -0,0 +1,28 @@
+import pytest
+import ray
+import os
+from loguru import logger
+from functools import lru_cache
+
+
+@lru_cache(5)
+def log_once(msg):
+    logger.info(msg)
+    return None
+
+
+@pytest.fixture
+def ray_init_fixture():
+    if ray.is_initialized():
+        ray.shutdown()
+    # NOTE (sumanthrh): We disable SHM for CI environment by default - L4s don't support P2P access
+    # if `CI=false`, then this will be overriden.
+    env_vars = {}
+    val = os.environ.get("CI", "").lower()
+    if val in ("1", "true", "yes"):
+        log_once("Disabling NCCL P2P for CI environment")
+        env_vars = {"NCCL_P2P_DISABLE": "1", "NCCL_SHM_DISABLE": "1"}
+    ray.init(runtime_env={"env_vars": env_vars})
+    yield
+    # call ray shutdown after a test regardless
+    ray.shutdown()
diff --git a/skyrl-train/tests/gpu/gpu_ci/test_models.py b/skyrl-train/tests/gpu/gpu_ci/test_models.py
@@ -126,7 +126,7 @@ def cleanup(self):
         dist.destroy_process_group()
 
 
-def test_actor_model_fwd_with_sequence_parallelism():
+def test_actor_model_fwd_with_sequence_parallelism(ray_init_fixture):
 
     # Create input sequence
     tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME, trust_remote_code=True)

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+#!/usr/bin/env bash`
	`2`	`+export CI=true`
	`3`	`+uv run --directory . --isolated --extra dev --extra vllm pytest -s tests/gpu/gpu_ci`