[train][examples] Fix 8 broken example scripts from skyrl-train migration by CharlieFRuan · Pull Request #1230 · NovaSky-AI/SkyRL

CharlieFRuan · 2026-02-26T23:52:38Z

Summary

Fix 8 bugs in examples/train and examples/train_integrations discovered while systematically running all example scripts after the skyrl-train → skyrl/train migration.

Bugs fixed:

main_generate.py: Migrate from legacy @hydra.main YAML config loader to SkyRLTrainConfig.from_cli_overrides(), matching main_base.py. The old loader didn't understand the new nested generator.inference_engine.* config keys, causing Key 'inference_engine' is not in struct.
gspo/run_gspo_gsm8k.sh and sapo/run_sapo_gsm8k.sh: Fix stale path examples/gsm8k/run_gsm8k.sh → examples/train/gsm8k/run_gsm8k.sh. Also add "$@" passthrough so users can append CLI overrides.
lora/run_qwen2_5_0.5b_gsm8k_ppo_lora.sh: Add missing trainer.placement.critic_num_gpus_per_node — required for PPO (GAE) with colocated critic, otherwise hits assertion num_policy_gpus and num_critic_gpus must be the same.
openenv/run_openenv.sh: Fix package name openenv → openenv-core to match the upstream PyPI metadata in the OpenEnv repo.
harbor/run_codecontest.sh: Add missing "$@" passthrough so users can append CLI overrides (consistent with other example scripts).
on_policy_distillation/main_on_policy_distill.py: Remove duplicate @register_policy_loss("importance_sampling") registration — this loss type is now built-in in ppo_utils.py, causing ValueError: policy loss 'importance_sampling' already registered at startup.
remote_inference_engine/run_remote.sh: Add missing generator.sampling_params.logprobs=null — the default logprobs=1 is not supported in remote inference mode, causing NotImplementedError during validation.

Test plan

Ran 32 example scripts on 8×H100 with tiny datasets and verified at least one full training step completes for each. Full results:

Passed (29 examples/train + 3 examples/train_integrations):

gsm8k/run_gsm8k.sh, gsm8k/run_generation_gsm8k.sh (after fix)
ppo/run_ppo.sh
multiply/run_multiply.sh
sft/sft_trainer.py
All 10 algorithm variants: DAPO GSM8K, CISPO, Dr.GRPO, GSPO (after fix), SAPO GSM8K (after fix), REINFORCE++, RLOO, Clip-Cov, KL-Cov, Custom Advantage Estimator, Custom Policy Loss
DAPO AIME (Qwen3-1.7B-Base), SAPO AIME (Qwen3-4B-Base)
lora/run_qwen2_5_0.5b_gsm8k_grpo_lora.sh, lora/run_qwen2_5_0.5b_gsm8k_ppo_lora.sh (after fix)
training_backends/fsdp/run_fsdp.sh, training_backends/fsdp/run_fsdp2.sh, training_backends/run_no_seq_pack.sh
async/async_run_gsm8k.sh, fully_async/fully_async_run_gsm8k.sh
tis_correction/run_dapo_tis.sh
turn_level_rewards/run_gsm8k_multi_turn.sh
search/run_search.sh (Qwen2.5-1.5B-Instruct, with mock retrieval server — 2/2 training steps, full pipeline verified)
text_to_sql/run_skyrl_sql.sh (Qwen2.5-Coder-7B-Instruct, with OmniSQL databases — 8 training steps, multi-turn SQL generation verified)
on_policy_distillation/ (after fix — Qwen3-1.7B-Base student+teacher, custom apply_reward_kl_penalty and no_op advantage verified)
remote_inference_engine/run_remote.sh (after fix — script bug fixed; NCCL weight sync on single machine is a pre-existing limitation, not a migration bug)
train_integrations/harbor/run_codecontest.sh (Qwen3-8B, with Daytona sandbox)
train_integrations/openenv/ (import-verified after fix)

Skipped (with rationale):

Large model scripts (32B, 30B MoE, 235B) — same entrypoints already validated with smaller models
Megatron backend (13 scripts) — requires Megatron-LM installation
Flash RL (5 scripts) — pre-existing dependency bug in custom vllm wheel (not a migration issue)
External-dependency examples: LLM-as-judge, mini SWE, LiveCodeBench, MoE, GPT-OSS — all use the same main_base entrypoint already validated; blocked on dataset/API/server setup
Modal, Verifiers integrations — excluded per instructions

🤖 Generated with Claude Code

…tion - main_generate.py: migrate from @hydra.main to SkyRLTrainConfig.from_cli_overrides() - gspo/run_gspo_gsm8k.sh, sapo/run_sapo_gsm8k.sh: fix stale path + add $@ passthrough - lora/run_qwen2_5_0.5b_gsm8k_ppo_lora.sh: add missing critic_num_gpus_per_node for PPO - openenv/run_openenv.sh: fix package name openenv -> openenv-core Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ription Add missing "$@" CLI override passthrough to harbor/run_codecontest.sh for consistency with other example scripts. Update PR description with expanded test results including DAPO AIME, SAPO AIME, and Harbor CodeContest (all verified with full training steps on 8xH100). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…mple bugs - Remove duplicate `importance_sampling` policy loss registration in `main_on_policy_distill.py` — this loss type is now built-in in ppo_utils.py, causing "already registered" error at startup. - Add missing `generator.sampling_params.logprobs=null` to `run_remote.sh` — the default logprobs=1 is not supported in remote inference mode, causing NotImplementedError during validation. - Update PR description with expanded test results (32 scripts tested). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CharlieFRuan · 2026-03-02T19:22:25Z

/gemini review

gemini-code-assist

Code Review

This pull request provides a series of well-justified fixes for 8 broken example scripts, stemming from a recent migration. The changes include updating stale paths, adding missing required configuration parameters, fixing an incorrect package name, and removing a duplicate function registration. Additionally, several scripts have been improved by adding "$@" to allow passthrough of command-line arguments, enhancing their usability and consistency. The migration of main_generate.py from Hydra to the custom SkyRLTrainConfig parser is a good modernization that aligns it with other entrypoints. The changes are correct and improve the overall quality and reliability of the example scripts.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

gemini-code-assist

Code Review

This pull request addresses 8 bugs across various example scripts in examples/train and examples/train_integrations, which arose from the skyrl-train to skyrl/train migration. The fixes involve updating incorrect file paths in shell scripts, adding missing required configuration parameters for specific training setups (like LoRA with PPO), correcting a dependency package name for OpenEnv integration, and removing a duplicate policy loss registration. Additionally, several shell scripts have been updated to pass through command-line arguments, enhancing their flexibility. A notable change is the refactoring of the main_generate.py entrypoint to use SkyRLTrainConfig.from_cli_overrides() for configuration, moving away from the legacy Hydra loader to align with the project's current standards and fix issues with nested configuration keys. My review of the changes did not find any issues.

examples/train/lora/run_qwen2_5_0.5b_gsm8k_ppo_lora.sh

examples/train_integrations/openenv/run_openenv.sh

SumanthRH

Thanks!

Left a few nits for pending comment fixes

Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>

devin-ai-integration

Devin Review found 1 new potential issue.

View 6 additional findings in Devin Review.

devin-ai-integration · 2026-03-02T20:46:53Z

examples/train/on_policy_distillation/main_on_policy_distill.py

-    return loss, {"clip_ratio": 0.0}
-
-
 class OnPolicyDistillationExp(BasePPOExp):


🟡 Removing custom importance_sampling loss silently changes loss computation semantics for on-policy distillation example

The PR removes the custom importance_sampling policy loss registration from main_on_policy_distill.py to avoid a duplicate-registration crash with the built-in one in ppo_utils.py. However, the two implementations have fundamentally different loss reduction behavior.

Behavioral difference between old custom and built-in implementations

The old custom implementation (removed in this PR) used a normalized reduction:

loss = reduce_loss(loss, loss_mask, "seq_mean_token_sum_norm", config.max_seq_len) return loss, {"clip_ratio": 0.0}

The built-in implementation at skyrl/backends/skyrl_train/utils/ppo_utils.py:966-980 uses a raw sum:

loss = (elementwise_loss * loss_mask).sum() return loss, {"importance_ratio": mean_ratio.item()}

reduce_loss with "seq_mean_token_sum_norm" computes per-sequence token-sum normalized by max_seq_len, then takes a batch mean — producing a loss that is invariant to batch size and sequence length. The built-in .sum() produces a loss that scales linearly with both, yielding very different gradient magnitudes.

Additionally, the metrics dict changed from {"clip_ratio": 0.0} to {"importance_ratio": ...}, which may affect downstream logging that expects the clip_ratio key.

Impact: Users of the on-policy distillation example will silently get a different (unnormalized) loss computation, which could lead to training instability or require learning rate re-tuning.

Prompt for agents

In examples/train/on_policy_distillation/main_on_policy_distill.py, the removal of the custom importance_sampling registration now silently delegates to the built-in importance_sampling_loss in skyrl/backends/skyrl_train/utils/ppo_utils.py (line 933-980), which uses a raw .sum() reduction instead of the original reduce_loss(..., 'seq_mean_token_sum_norm', config.max_seq_len). To preserve the original behavior, either: 1. Update the built-in importance_sampling_loss in skyrl/backends/skyrl_train/utils/ppo_utils.py (lines 964-980) to use reduce_loss instead of .sum(), and return {"clip_ratio": 0.0} in the metrics dict for consistency with other loss functions, OR 2. Keep the custom registration in main_on_policy_distill.py but use PolicyLossRegistry.unregister / PolicyLossRegistry.register to replace the built-in, OR 3. Add a comment in the on-policy distillation run scripts (run_on_policy_distill_math_qwen3_1.7b.sh and run_on_policy_distill_math_qwen3_4b.sh) noting that loss_reduction should be set to seq_mean_token_sum_norm and the learning rate may need adjustment since the built-in importance_sampling uses raw sum reduction.

Was this helpful? React with 👍 or 👎 to provide feedback.

Hi @CharlieFRuan , the importance_sampling reduction behavior is indeed changed from seq_mean_token_sum_norm to sum. Is this expected

CharlieFRuan and others added 3 commits February 26, 2026 23:51

CharlieFRuan changed the title ~~[train][examples] Fix 5 broken example scripts from skyrl-train migration~~ [train][examples] Fix 8 broken example scripts from skyrl-train migration Mar 2, 2026

Delete pr_description.md

7014419

CharlieFRuan marked this pull request as ready for review March 2, 2026 19:22

gemini-code-assist bot reviewed Mar 2, 2026

View reviewed changes

devin-ai-integration bot reviewed Mar 2, 2026

View reviewed changes

gemini-code-assist bot reviewed Mar 2, 2026

View reviewed changes

CharlieFRuan requested a review from SumanthRH March 2, 2026 19:27

SumanthRH reviewed Mar 2, 2026

View reviewed changes

examples/train/lora/run_qwen2_5_0.5b_gsm8k_ppo_lora.sh Outdated Show resolved Hide resolved

SumanthRH reviewed Mar 2, 2026

View reviewed changes

examples/train/lora/run_qwen2_5_0.5b_gsm8k_ppo_lora.sh Outdated Show resolved Hide resolved

SumanthRH reviewed Mar 2, 2026

View reviewed changes

examples/train_integrations/openenv/run_openenv.sh Outdated Show resolved Hide resolved

SumanthRH reviewed Mar 2, 2026

View reviewed changes

examples/train_integrations/openenv/run_openenv.sh Outdated Show resolved Hide resolved

SumanthRH approved these changes Mar 2, 2026

View reviewed changes

CharlieFRuan and others added 4 commits March 2, 2026 12:41

Update examples/train_integrations/openenv/run_openenv.sh

4a7da9d

Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>

Update examples/train/lora/run_qwen2_5_0.5b_gsm8k_ppo_lora.sh

b635b8a

Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>

Update examples/train/lora/run_qwen2_5_0.5b_gsm8k_ppo_lora.sh

87c014b

Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>

Update examples/train_integrations/openenv/run_openenv.sh

b377b40

Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>

CharlieFRuan merged commit f73e309 into NovaSky-AI:main Mar 2, 2026
0 of 2 checks passed

CharlieFRuan deleted the validate-examples branch March 2, 2026 20:41

devin-ai-integration bot reviewed Mar 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train][examples] Fix 8 broken example scripts from skyrl-train migration#1230

[train][examples] Fix 8 broken example scripts from skyrl-train migration#1230
CharlieFRuan merged 8 commits intoNovaSky-AI:mainfrom
CharlieFRuan:validate-examples

CharlieFRuan commented Feb 26, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

CharlieFRuan commented Mar 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SumanthRH left a comment

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Mar 2, 2026

Uh oh!

gathierry Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return loss, {"clip_ratio": 0.0}


		class OnPolicyDistillationExp(BasePPOExp):

Conversation

CharlieFRuan commented Feb 26, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

CharlieFRuan commented Mar 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SumanthRH left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gathierry Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CharlieFRuan commented Feb 26, 2026 •

edited by devin-ai-integration bot

Loading