Skip to content

Commit b5b7099

Browse files
authored
[GPU CI 1/N] Init GPU CI on Anyscale (NovaSky-AI#102)
# What does this PR do? FIrst PR in enabling GPU CI. This PR adds a `SkyRL-GPU` workflow that will run GPU CI on Anyscale. We will basic unit and integration tests running on a 4xL4 instance in this workflow. Currently, the CI will run post-merge on `main` . For pull requests, it requires manual trigger by a maintainer. (so it should not run on every commit even for maintainers, since that can be wasteful). In this PR, I've only ported over one of our existing gpu tests - `test_models.py` to run on CI . The plan is to gradually add more and more tests after we've made them more efficient and compatible to be run on a 4xL4 node. --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
1 parent 1742e6f commit b5b7099

File tree

6 files changed

+86
-1
lines changed

6 files changed

+86
-1
lines changed

.github/workflows/gpu_ci.yaml

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
name: SkyRL-GPU
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
workflow_dispatch:
8+
9+
10+
permissions:
11+
checks: write # for status checks to appear
12+
contents: read
13+
14+
jobs:
15+
16+
skyrl_tests:
17+
runs-on: ubuntu-latest
18+
defaults:
19+
run:
20+
shell: bash
21+
working-directory: ./skyrl-train
22+
23+
steps:
24+
- uses: actions/checkout@v4
25+
- name: Set up Python
26+
# This is the version of the action for setting up Python, not the Python version.
27+
uses: actions/setup-python@v5
28+
with:
29+
# Semantic version range syntax or exact version of a Python version
30+
python-version: '3.12'
31+
cache: 'pip'
32+
- name: Install the latest version of uv
33+
uses: astral-sh/setup-uv@v6
34+
with:
35+
activate-environment: true
36+
- name: Install basic dependencies
37+
run: uv pip install anyscale==0.24.79 typer==0.9.0
38+
# Run tests
39+
- name: GPU tests
40+
env:
41+
ANYSCALE_CLI_TOKEN: ${{ secrets.ANYSCALE_CLI_TOKEN }}
42+
ANYSCALE_HOST: https://console.anyscale.com
43+
run: |
44+
anyscale job submit -f ci/anyscale_gpu_ci.yaml --wait
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
name: skyrl-train-gpu-ci
2+
entrypoint: bash ci/gpu_ci_run.sh
3+
image_uri: sumanthrh/skyrl-train-ray-2.44.0-py3.12-cu12.8 # (Optional) Exclusive with `containerfile`.
4+
cloud: sky-anyscale-aws-us-east-1
5+
ray_version: "2.44.0"
6+
compute_config: l4_ci
7+
working_dir: . # (Optional) Use current working directory "." as the working_dir. Can be any local path or remote .zip file in cloud storage.
8+
env_vars:
9+
RAY_RUNTIME_ENV_HOOK: ray._private.runtime_env.uv_runtime_env_hook.hook
10+
max_retries: 0 # (Optional) Maximum number of times the job will be retried before being marked failed. Defaults to `1`.

skyrl-train/ci/gpu_ci_run.sh

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
#!/usr/bin/env bash
2+
export CI=true
3+
uv run --directory . --isolated --extra dev --extra vllm pytest -s tests/gpu/gpu_ci

skyrl-train/tests/gpu/gpu_ci/__init__.py

Whitespace-only changes.
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
import pytest
2+
import ray
3+
import os
4+
from loguru import logger
5+
from functools import lru_cache
6+
7+
8+
@lru_cache(5)
9+
def log_once(msg):
10+
logger.info(msg)
11+
return None
12+
13+
14+
@pytest.fixture
15+
def ray_init_fixture():
16+
if ray.is_initialized():
17+
ray.shutdown()
18+
# NOTE (sumanthrh): We disable SHM for CI environment by default - L4s don't support P2P access
19+
# if `CI=false`, then this will be overriden.
20+
env_vars = {}
21+
val = os.environ.get("CI", "").lower()
22+
if val in ("1", "true", "yes"):
23+
log_once("Disabling NCCL P2P for CI environment")
24+
env_vars = {"NCCL_P2P_DISABLE": "1", "NCCL_SHM_DISABLE": "1"}
25+
ray.init(runtime_env={"env_vars": env_vars})
26+
yield
27+
# call ray shutdown after a test regardless
28+
ray.shutdown()

skyrl-train/tests/gpu/test_models.py renamed to skyrl-train/tests/gpu/gpu_ci/test_models.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,7 @@ def cleanup(self):
126126
dist.destroy_process_group()
127127

128128

129-
def test_actor_model_fwd_with_sequence_parallelism():
129+
def test_actor_model_fwd_with_sequence_parallelism(ray_init_fixture):
130130

131131
# Create input sequence
132132
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME, trust_remote_code=True)

0 commit comments

Comments
 (0)