Skip to content

[train][examples] Fix 8 broken example scripts from skyrl-train migration#1230

Merged
CharlieFRuan merged 8 commits intoNovaSky-AI:mainfrom
CharlieFRuan:validate-examples
Mar 2, 2026
Merged

[train][examples] Fix 8 broken example scripts from skyrl-train migration#1230
CharlieFRuan merged 8 commits intoNovaSky-AI:mainfrom
CharlieFRuan:validate-examples

Conversation

@CharlieFRuan
Copy link
Copy Markdown
Member

@CharlieFRuan CharlieFRuan commented Feb 26, 2026

Summary

Fix 8 bugs in examples/train and examples/train_integrations discovered while systematically running all example scripts after the skyrl-trainskyrl/train migration.

Bugs fixed:

  • main_generate.py: Migrate from legacy @hydra.main YAML config loader to SkyRLTrainConfig.from_cli_overrides(), matching main_base.py. The old loader didn't understand the new nested generator.inference_engine.* config keys, causing Key 'inference_engine' is not in struct.
  • gspo/run_gspo_gsm8k.sh and sapo/run_sapo_gsm8k.sh: Fix stale path examples/gsm8k/run_gsm8k.shexamples/train/gsm8k/run_gsm8k.sh. Also add "$@" passthrough so users can append CLI overrides.
  • lora/run_qwen2_5_0.5b_gsm8k_ppo_lora.sh: Add missing trainer.placement.critic_num_gpus_per_node — required for PPO (GAE) with colocated critic, otherwise hits assertion num_policy_gpus and num_critic_gpus must be the same.
  • openenv/run_openenv.sh: Fix package name openenvopenenv-core to match the upstream PyPI metadata in the OpenEnv repo.
  • harbor/run_codecontest.sh: Add missing "$@" passthrough so users can append CLI overrides (consistent with other example scripts).
  • on_policy_distillation/main_on_policy_distill.py: Remove duplicate @register_policy_loss("importance_sampling") registration — this loss type is now built-in in ppo_utils.py, causing ValueError: policy loss 'importance_sampling' already registered at startup.
  • remote_inference_engine/run_remote.sh: Add missing generator.sampling_params.logprobs=null — the default logprobs=1 is not supported in remote inference mode, causing NotImplementedError during validation.

Test plan

Ran 32 example scripts on 8×H100 with tiny datasets and verified at least one full training step completes for each. Full results:

Passed (29 examples/train + 3 examples/train_integrations):

  • gsm8k/run_gsm8k.sh, gsm8k/run_generation_gsm8k.sh (after fix)
  • ppo/run_ppo.sh
  • multiply/run_multiply.sh
  • sft/sft_trainer.py
  • All 10 algorithm variants: DAPO GSM8K, CISPO, Dr.GRPO, GSPO (after fix), SAPO GSM8K (after fix), REINFORCE++, RLOO, Clip-Cov, KL-Cov, Custom Advantage Estimator, Custom Policy Loss
  • DAPO AIME (Qwen3-1.7B-Base), SAPO AIME (Qwen3-4B-Base)
  • lora/run_qwen2_5_0.5b_gsm8k_grpo_lora.sh, lora/run_qwen2_5_0.5b_gsm8k_ppo_lora.sh (after fix)
  • training_backends/fsdp/run_fsdp.sh, training_backends/fsdp/run_fsdp2.sh, training_backends/run_no_seq_pack.sh
  • async/async_run_gsm8k.sh, fully_async/fully_async_run_gsm8k.sh
  • tis_correction/run_dapo_tis.sh
  • turn_level_rewards/run_gsm8k_multi_turn.sh
  • search/run_search.sh (Qwen2.5-1.5B-Instruct, with mock retrieval server — 2/2 training steps, full pipeline verified)
  • text_to_sql/run_skyrl_sql.sh (Qwen2.5-Coder-7B-Instruct, with OmniSQL databases — 8 training steps, multi-turn SQL generation verified)
  • on_policy_distillation/ (after fix — Qwen3-1.7B-Base student+teacher, custom apply_reward_kl_penalty and no_op advantage verified)
  • remote_inference_engine/run_remote.sh (after fix — script bug fixed; NCCL weight sync on single machine is a pre-existing limitation, not a migration bug)
  • train_integrations/harbor/run_codecontest.sh (Qwen3-8B, with Daytona sandbox)
  • train_integrations/openenv/ (import-verified after fix)

Skipped (with rationale):

  • Large model scripts (32B, 30B MoE, 235B) — same entrypoints already validated with smaller models
  • Megatron backend (13 scripts) — requires Megatron-LM installation
  • Flash RL (5 scripts) — pre-existing dependency bug in custom vllm wheel (not a migration issue)
  • External-dependency examples: LLM-as-judge, mini SWE, LiveCodeBench, MoE, GPT-OSS — all use the same main_base entrypoint already validated; blocked on dataset/API/server setup
  • Modal, Verifiers integrations — excluded per instructions

🤖 Generated with Claude Code


Open with Devin

CharlieFRuan and others added 3 commits February 26, 2026 23:51
…tion

- main_generate.py: migrate from @hydra.main to SkyRLTrainConfig.from_cli_overrides()
- gspo/run_gspo_gsm8k.sh, sapo/run_sapo_gsm8k.sh: fix stale path + add $@ passthrough
- lora/run_qwen2_5_0.5b_gsm8k_ppo_lora.sh: add missing critic_num_gpus_per_node for PPO
- openenv/run_openenv.sh: fix package name openenv -> openenv-core

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ription

Add missing "$@" CLI override passthrough to harbor/run_codecontest.sh
for consistency with other example scripts. Update PR description with
expanded test results including DAPO AIME, SAPO AIME, and Harbor
CodeContest (all verified with full training steps on 8xH100).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mple bugs

- Remove duplicate `importance_sampling` policy loss registration in
  `main_on_policy_distill.py` — this loss type is now built-in in
  ppo_utils.py, causing "already registered" error at startup.
- Add missing `generator.sampling_params.logprobs=null` to
  `run_remote.sh` — the default logprobs=1 is not supported in remote
  inference mode, causing NotImplementedError during validation.
- Update PR description with expanded test results (32 scripts tested).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@CharlieFRuan CharlieFRuan changed the title [train][examples] Fix 5 broken example scripts from skyrl-train migration [train][examples] Fix 8 broken example scripts from skyrl-train migration Mar 2, 2026
@CharlieFRuan CharlieFRuan marked this pull request as ready for review March 2, 2026 19:22
@CharlieFRuan
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a series of well-justified fixes for 8 broken example scripts, stemming from a recent migration. The changes include updating stale paths, adding missing required configuration parameters, fixing an incorrect package name, and removing a duplicate function registration. Additionally, several scripts have been improved by adding "$@" to allow passthrough of command-line arguments, enhancing their usability and consistency. The migration of main_generate.py from Hydra to the custom SkyRLTrainConfig parser is a good modernization that aligns it with other entrypoints. The changes are correct and improve the overall quality and reliability of the example scripts.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses 8 bugs across various example scripts in examples/train and examples/train_integrations, which arose from the skyrl-train to skyrl/train migration. The fixes involve updating incorrect file paths in shell scripts, adding missing required configuration parameters for specific training setups (like LoRA with PPO), correcting a dependency package name for OpenEnv integration, and removing a duplicate policy loss registration. Additionally, several shell scripts have been updated to pass through command-line arguments, enhancing their flexibility. A notable change is the refactoring of the main_generate.py entrypoint to use SkyRLTrainConfig.from_cli_overrides() for configuration, moving away from the legacy Hydra loader to align with the project's current standards and fix issues with nested configuration keys. My review of the changes did not find any issues.

@CharlieFRuan CharlieFRuan requested a review from SumanthRH March 2, 2026 19:27
Copy link
Copy Markdown
Member

@SumanthRH SumanthRH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Left a few nits for pending comment fixes

CharlieFRuan and others added 4 commits March 2, 2026 12:41
Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>
Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>
Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>
Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>
@CharlieFRuan CharlieFRuan merged commit f73e309 into NovaSky-AI:main Mar 2, 2026
0 of 2 checks passed
@CharlieFRuan CharlieFRuan deleted the validate-examples branch March 2, 2026 20:41
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 6 additional findings in Devin Review.

Open in Devin Review

return loss, {"clip_ratio": 0.0}


class OnPolicyDistillationExp(BasePPOExp):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Removing custom importance_sampling loss silently changes loss computation semantics for on-policy distillation example

The PR removes the custom importance_sampling policy loss registration from main_on_policy_distill.py to avoid a duplicate-registration crash with the built-in one in ppo_utils.py. However, the two implementations have fundamentally different loss reduction behavior.

Behavioral difference between old custom and built-in implementations

The old custom implementation (removed in this PR) used a normalized reduction:

loss = reduce_loss(loss, loss_mask, "seq_mean_token_sum_norm", config.max_seq_len)
return loss, {"clip_ratio": 0.0}

The built-in implementation at skyrl/backends/skyrl_train/utils/ppo_utils.py:966-980 uses a raw sum:

loss = (elementwise_loss * loss_mask).sum()
return loss, {"importance_ratio": mean_ratio.item()}

reduce_loss with "seq_mean_token_sum_norm" computes per-sequence token-sum normalized by max_seq_len, then takes a batch mean — producing a loss that is invariant to batch size and sequence length. The built-in .sum() produces a loss that scales linearly with both, yielding very different gradient magnitudes.

Additionally, the metrics dict changed from {"clip_ratio": 0.0} to {"importance_ratio": ...}, which may affect downstream logging that expects the clip_ratio key.

Impact: Users of the on-policy distillation example will silently get a different (unnormalized) loss computation, which could lead to training instability or require learning rate re-tuning.

Prompt for agents
In examples/train/on_policy_distillation/main_on_policy_distill.py, the removal of the custom importance_sampling registration now silently delegates to the built-in importance_sampling_loss in skyrl/backends/skyrl_train/utils/ppo_utils.py (line 933-980), which uses a raw .sum() reduction instead of the original reduce_loss(..., 'seq_mean_token_sum_norm', config.max_seq_len). To preserve the original behavior, either:

1. Update the built-in importance_sampling_loss in skyrl/backends/skyrl_train/utils/ppo_utils.py (lines 964-980) to use reduce_loss instead of .sum(), and return {"clip_ratio": 0.0} in the metrics dict for consistency with other loss functions, OR

2. Keep the custom registration in main_on_policy_distill.py but use PolicyLossRegistry.unregister / PolicyLossRegistry.register to replace the built-in, OR

3. Add a comment in the on-policy distillation run scripts (run_on_policy_distill_math_qwen3_1.7b.sh and run_on_policy_distill_math_qwen3_4b.sh) noting that loss_reduction should be set to seq_mean_token_sum_norm and the learning rate may need adjustment since the built-in importance_sampling uses raw sum reduction.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @CharlieFRuan , the importance_sampling reduction behavior is indeed changed from seq_mean_token_sum_norm to sum. Is this expected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants