[skyrl-train] Refactor TIS to use more comprehensive off policy correction config by erictang000 · Pull Request #849 · NovaSky-AI/SkyRL

erictang000 · 2026-01-07T00:41:24Z

Overview

Marks trainer.algorithm.use_tis and trainer.algorithm.tis_imp_ratio_cap for deprecation
Introduces new trainer.algorithm.off_policy_correction config (see new config below)
Updates loss functions to return a LossMetrics TypedDict containing loss metrics (previously returned just loss, clip_ratio)
Updates workers to all reduce mean/max/min appropriately, and to propagate loss metrics back up to the trainer.

Off Policy Correction Config

# To be deprecated in favor of off_policy_correction.tis_ratio_type = "token"
# and "token_tis_ratio_clip_high"
tis_imp_ratio_cap: -1.0
use_tis: false

off_policy_correction:
      # type of importance sampling ratio to use for ppo loss correction
      # here importance sampling ratio refers to exp(logprobs_{policy_old} - logprobs_{rollout_policy})
      tis_ratio_type: null # null, "token", "sequence"

      # used if tis_ratio_type = "token", 1.5-5.0 is recommended for "token" tis_ratio_type
      token_tis_ratio_clip_high: 2.0
      # used if tis_ratio_type = "sequence", 2.0-10.0 is recommended for "sequence" tis_ratio_type
      sequence_tis_ratio_clip_high: 5.0

      # method of masking out sequences with cumulative importance sampling ratios outside the cap
      # "product" masks out sequences with product of importance ratios outside the cap
      # "geometric" masks out sequences with geometric mean of importance ratios outside the cap
      sequence_mask_metric: null # null, "product", "geometric"

      # used if sequence_mask_metric = "geometric"
      # values around 0.99-1.01 are recommended for "geometric" sequence_mask_metric - MoE models may need larger allowed ranges due to higher mismatch
      geo_mask_high: 1.01
      geo_mask_low: 0.99

      # used if sequence_mask_metric = "product"
      # values around 0.5-2.0 are recommended for "sequence" sequence_mask_metric
      product_mask_high: 2.0
      product_mask_low: 0.5

      # separate from sequence_mask_metric and tis_ratio_type 
      # if any off_policy_correction is enabled, masks out sequences with any token having importance ratio
      # far outside an acceptable range (low and high thresholds)
      outlier_token_is_threshold_low: null
      outlier_token_is_threshold_high: null

…out_correction

gemini-code-assist

Code Review

This pull request refactors the Truncated Importance Sampling (TIS) configuration into a more comprehensive rollout_correction system, which is a great improvement for structure and extensibility. The new implementation adds flexible rollout correction mechanisms, including different TIS ratio types and rejection masks. The changes are well-documented and handle the deprecation of old parameters gracefully. I've identified a bug in a conditional check that could cause a crash, and an opportunity to refactor for better efficiency and code clarity. My detailed feedback is in the comments below.

skyrl-train/skyrl_train/trainer.py

skyrl-train/skyrl_train/utils/ppo_utils.py

… and min

…kyRL into rollout_correction

… unite metrics under loss_metrics, other clean up

…out_correction

erictang000 · 2026-01-13T00:28:48Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant refactoring of the off-policy correction mechanism, replacing the simple TIS flags with a more comprehensive off_policy_correction configuration. This is a great improvement for flexibility and experimentation. The changes are well-implemented across the codebase, including documentation, examples, and tests. I've identified a few critical bugs in the implementation and some areas for improvement in the examples and utility functions to enhance clarity and correctness. Please see the detailed comments below.

skyrl-train/skyrl_train/utils/utils.py

skyrl-train/skyrl_train/workers/megatron/megatron_worker.py

skyrl-train/docs/configuration/config.rst

skyrl-train/examples/flash_rl/run_dapo_gsm8k_flashrl_0.5b_fp8.sh

skyrl-train/examples/flash_rl/run_dapo_gsm8k_flashrl_0.5b_int8.sh

skyrl-train/examples/megatron/run_megatron.sh

skyrl-train/skyrl_train/config/ppo_base_config.yaml

skyrl-train/skyrl_train/distributed/strategy.py

skyrl-train/tests/cpu/algorithms/test_losses.py

erictang000 · 2026-01-13T00:52:33Z

/gemini review

…out_correction

CharlieFRuan

Made an initial round of review. Will take another round!

CharlieFRuan · 2026-01-22T19:52:01Z

skyrl-train/docs/configuration/config.rst

  - ``tau_neg``: Temperature for gating function for tokens with negative (or zero) advantages.

+Off Policy Correction Configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Let's cite the blogpost here as well

Depends on how you add the separate correction doc page (see other comment). But it'd be easier for the user if we can do the following. Basically help the uesrs understand each config (3 groups of them) one-by-one by pointing them to other resources.

1. Group these three together

algorithm.off_policy_correction.tis_ratio_type

algorithm.off_policy_correction.token_tis_ratio_clip_high

algorithm.off_policy_correction.sequence_tis_ratio_clip_high

and tell them:

how to do the basic TIS proposed in https://fengyao.notion.site/Your-Efficient-RL-Framework-Secretly-Brings-You-Off-Policy-RL-Training-237721e3f6c48094ad67dad3ac091c56

i.e. token level, and a default clip value

what is sequence level, the difference, and the motivation of doing that; perhaps simply refer to section 4.2 of https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda#27b211a558b78099ba48fa8849ab54c8

2. Then group these together

sequence_mask_metric: null # null, "product", "geometric" geo_mask_high: 1.01 geo_mask_low: 0.99 product_mask_high: 2.0 product_mask_low: 0.5

3. Then group the outlier threshold together

outlier_token_is_threshold_low: 1e-4 outlier_token_is_threshold_high: 100

other remarks

Then pointing to our implementation would also be helpful. Namely the rollout_corrections.py or whatever name you decided in the end

skyrl-train/docs/configuration/config.rst

CharlieFRuan · 2026-01-22T21:05:18Z

skyrl-train/skyrl_train/utils/ppo_utils.py

    return y.to(out_dtype or delta.dtype)


+def compute_tis_ratio(


These are great! Can we put them into a separate file? Our ppo_utils.py is 1.4k LOCs now.

In long term we could break ppo_utils.py down, but for now let's create a file of off_policy_correction_utils.py (or some other name you see fit) with all these methods you added. We can keep the rest in where they currently are and come back later if we'd want to further clean up.

CharlieFRuan · 2026-01-22T21:15:06Z

skyrl-train/skyrl_train/utils/ppo_utils.py

    return 0.5 * loss, clipfrac


+class LossMetrics(TypedDict, total=False):


this single-field TypedDict is a bit confusing. I know we can extend this with a lot of fields depending on the set up. Is there a better solution? Do we plan to add more fields to this? If not, should we remove this class for now?

hmm you're right, let me just change the convention to return a python dictionary of metrics

CharlieFRuan · 2026-01-22T21:16:35Z

skyrl-train/skyrl_train/utils/ppo_utils.py

+
 @register_policy_loss(PolicyLossType.REGULAR)
 @register_policy_loss(PolicyLossType.DUAL_CLIP)
 def ppo_policy_loss(


The return is typed as Tuple[torch.Tensor, float], which isn't correct right, due to it currently returning loss_metrics. Depending on what we do with LossMetrics as noted in the other comment on LossMetrics, we could make it dict[str, float]

CharlieFRuan · 2026-01-22T21:21:00Z

skyrl-train/skyrl_train/utils/ppo_utils.py

-        tis_imp_ratio = _safe_exp_delta(old_log_probs - rollout_logprobs, clip=20.0, out_dtype=log_probs.dtype)
-        tis_imp_ratio = torch.clamp(tis_imp_ratio, max=config.tis_imp_ratio_cap)
-        loss = loss * tis_imp_ratio
+    # apply off policy correction


these seem redundant, used in sapo, gspo, cispo, and ppo. Can we write a functional helper to extract these out?

CharlieFRuan · 2026-01-22T21:26:04Z

skyrl-train/skyrl_train/utils/ppo_utils.py

+    return loss, loss_metrics


 @register_policy_loss(PolicyLossType.SAPO)


Might be a dumb question: why is off-policy correction only used in sapo, gspo, cispo, and ppo, not the other loss functions like compute_policy_loss_clip_cov, compute_policy_loss_kl_cov?

hmm it was because the covariance calculation could include masked out samples, so just adding the sequence masking before reduce_loss didn't seem sufficient.

i think we could add it but i would vote to just skip for now since these are not commonly used anyway

CharlieFRuan · 2026-01-23T19:54:51Z

skyrl-train/docs/configuration/config.rst

  - ``tau_neg``: Temperature for gating function for tokens with negative (or zero) advantages.

+Off Policy Correction Configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Depends on how you add the separate correction doc page (see other comment). But it'd be easier for the user if we can do the following. Basically help the uesrs understand each config (3 groups of them) one-by-one by pointing them to other resources.

1. Group these three together

algorithm.off_policy_correction.tis_ratio_type

algorithm.off_policy_correction.token_tis_ratio_clip_high

algorithm.off_policy_correction.sequence_tis_ratio_clip_high

and tell them:

how to do the basic TIS proposed in https://fengyao.notion.site/Your-Efficient-RL-Framework-Secretly-Brings-You-Off-Policy-RL-Training-237721e3f6c48094ad67dad3ac091c56

i.e. token level, and a default clip value

what is sequence level, the difference, and the motivation of doing that; perhaps simply refer to section 4.2 of https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda#27b211a558b78099ba48fa8849ab54c8

2. Then group these together

sequence_mask_metric: null # null, "product", "geometric" geo_mask_high: 1.01 geo_mask_low: 0.99 product_mask_high: 2.0 product_mask_low: 0.5

3. Then group the outlier threshold together

outlier_token_is_threshold_low: 1e-4 outlier_token_is_threshold_high: 100

other remarks

Then pointing to our implementation would also be helpful. Namely the rollout_corrections.py or whatever name you decided in the end

skyrl-train/docs/configuration/config.rst

skyrl-train/tests/cpu/algorithms/test_losses.py

…out_correction

…kyRL into rollout_correction

…out_correction

erictang000 · 2026-01-29T23:48:36Z

erictang000 · 2026-01-29T23:56:34Z

…out_correction

CharlieFRuan

Such a neat PR!! Thank you so much!

Added some nits, after addressing them please feel free to merge!

skyrl-train/examples/megatron/run_megatron_dapo_qwen3_30b_a3b_lora.sh

skyrl-train/skyrl_train/config/config.py

skyrl-train/skyrl_train/config/ppo_base_config.yaml

CharlieFRuan · 2026-02-03T05:50:10Z

skyrl-train/skyrl_train/utils/ppo_utils.py

+    return pg_loss, {"clip_ratio": 0.0}


 @register_policy_loss(PolicyLossType.CROSS_ENTROPY)


Should this return return loss, {"clip_ratio": 0.0} and change return type to Tuple[torch.Tensor, dict[str, float]]? Though we only use it in SFT, but it might make sense to keep it consistent

CharlieFRuan · 2026-02-03T05:59:21Z

docs/content/docs/algorithms/off_policy_correction.mdx

This doc is just great! Thank you so much for the effort!

Some really nits:

For a user that just wants to pick a correction config, they might not want to finish reading everything. Could we give some quick pointers (like TLDR, here are the configs you can start from) at the top? Like, just use

algorithm.off_policy_correction.tis_ratio_type=xxx algorithm.off_policy_correction.token_tis_ratio_clip_high=xxx

if you want to do the most popular (or basic?) TIS.

Use xxx if you want to follow this blog, etc.

And could we add a reference section at the top or the bottom as well please

And could we add a reference section at the top or the bottom as well please

already there! (just didn't screenshot)

For a user that just wants to pick a correction config, they might not want to finish reading everything. Could we give some quick pointers (like TLDR, here are the configs you can start from) at the top? Like, just use

good point, added!

erictang000 added 2 commits January 7, 2026 00:40

x

f033e65

Merge branch 'main' of https://github.com/erictang000/SkyRL into roll…

0b236fe

…out_correction

gemini-code-assist bot reviewed Jan 7, 2026

View reviewed changes

skyrl-train/skyrl_train/trainer.py Outdated Show resolved Hide resolved

skyrl-train/skyrl_train/utils/ppo_utils.py Outdated Show resolved Hide resolved

erictang000 added 12 commits January 7, 2026 00:50

x

3f3b759

x

29efd6f

x

1520157

x

45a59c2

fix tests and add rollout correction to other loss types

ce01bb2

add metrics

abac800

propagate metrics up and refactor how we do metric reductions for max…

2dc7364

… and min

make default null and propagate megatron metrics

349369d

x:

f3f7054

Merge branch 'rollout_correction' of https://github.com/erictang000/S…

c45c130

…kyRL into rollout_correction

big cleanup - remove clip_ratio return (fix custom algorithms stuff),…

63d38c5

… unite metrics under loss_metrics, other clean up

x

7e83c10

erictang000 changed the title ~~[skyrl-train] Refactor TIS to use more comprehensive rollout correction config~~ [skyrl-train] Refactor TIS to use more comprehensive off policy correction config Jan 8, 2026

erictang000 added 8 commits January 8, 2026 23:17

renaming

cf042fc

x

cef7121

x

9485bdd

Merge branch 'main' of https://github.com/erictang000/SkyRL into roll…

9e11eda

…out_correction

x

c06747c

x

0b5ebfd

x

0697957

add docs

6b9e1e4

gemini-code-assist bot reviewed Jan 13, 2026

View reviewed changes

x

46b6fe5

erictang000 requested a review from tyler-griggs January 13, 2026 00:52

x

d72d9c6

erictang000 added 5 commits January 13, 2026 01:16

x

ac0659c

Merge branch 'main' of https://github.com/erictang000/SkyRL into roll…

7ddb85f

…out_correction

Merge branch 'main' of https://github.com/erictang000/SkyRL into roll…

6c8d084

…out_correction

x

08c3625

x

2bca41f

CharlieFRuan reviewed Jan 22, 2026

View reviewed changes

CharlieFRuan self-assigned this Jan 22, 2026

add unit test for metrics all reduce

0c8789a

CharlieFRuan reviewed Jan 23, 2026

View reviewed changes

CharlieFRuan and others added 8 commits January 23, 2026 20:36

lint

8eb436a

Merge branch 'main' of https://github.com/erictang000/SkyRL into roll…

3ea30c2

…out_correction

Merge branch 'main' of https://github.com/erictang000/SkyRL into roll…

55746b5

…out_correction

address comments and create new docs page (almost done)

bb065c9

Merge branch 'rollout_correction' of https://github.com/erictang000/S…

2159a8c

…kyRL into rollout_correction

Merge branch 'main' of https://github.com/erictang000/SkyRL into roll…

f15bbac

…out_correction

x

36a5267

lint

196b88c

erictang000 added 3 commits January 29, 2026 23:57

finish doc

e3b4b5b

minor edits

8721384

Merge branch 'main' of https://github.com/erictang000/SkyRL into roll…

76b5677

…out_correction

erictang000 requested a review from CharlieFRuan January 30, 2026 01:42

erictang000 added 3 commits February 2, 2026 02:39

Merge branch 'main' of https://github.com/erictang000/SkyRL into roll…

c0da805

…out_correction

Merge branch 'main' of https://github.com/erictang000/SkyRL into roll…

4288f5d

…out_correction

use python configs for new off policy correction tests

295da74

CharlieFRuan approved these changes Feb 3, 2026

View reviewed changes

address nits

3110a6a

erictang000 merged commit 5102468 into NovaSky-AI:main Feb 3, 2026
3 checks passed

erictang000 deleted the rollout_correction branch February 3, 2026 20:33

		return 0.5 * loss, clipfrac


		class LossMetrics(TypedDict, total=False):

		return loss, loss_metrics


		@register_policy_loss(PolicyLossType.SAPO)

		return pg_loss, {"clip_ratio": 0.0}


		@register_policy_loss(PolicyLossType.CROSS_ENTROPY)

Conversation

erictang000 commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Off Policy Correction Config

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

erictang000 commented Jan 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

erictang000 commented Jan 13, 2026

Uh oh!

CharlieFRuan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

1. Group these three together

2. Then group these together

3. Then group the outlier threshold together

other remarks

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CharlieFRuan Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

1. Group these three together

2. Then group these together

3. Then group the outlier threshold together

other remarks

Uh oh!

Uh oh!

Uh oh!

erictang000 commented Jan 29, 2026

Uh oh!

erictang000 commented Jan 29, 2026

Uh oh!

CharlieFRuan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

erictang000 commented Jan 7, 2026 •

edited

Loading

CharlieFRuan Jan 22, 2026 •

edited

Loading