-
Notifications
You must be signed in to change notification settings - Fork 286
[skyrl-train] Refactor TIS to use more comprehensive off policy correction config #849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 30 commits
f033e65
0b236fe
3f3b759
29efd6f
1520157
45a59c2
ce01bb2
abac800
2dc7364
349369d
f3f7054
c45c130
63d38c5
7e83c10
cf042fc
cef7121
9485bdd
9e11eda
c06747c
0b5ebfd
0697957
6b9e1e4
46b6fe5
d72d9c6
db76d01
ac0659c
7ddb85f
6c8d084
08c3625
2bca41f
0c8789a
8eb436a
3ea30c2
55746b5
bb065c9
2159a8c
f15bbac
36a5267
196b88c
e3b4b5b
8721384
76b5677
c0da805
4288f5d
295da74
3110a6a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -389,6 +389,45 @@ Algorithm Configuration | |
| # dual clip parameters | ||
| clip_ratio_c: 3.0 | ||
|
|
||
| # To be deprecated in favor of off_policy_correction.tis_ratio_type = "token" | ||
| # and "token_tis_ratio_clip_high" | ||
| use_tis: false | ||
| tis_imp_ratio_cap: -1.0 | ||
|
|
||
| # references | ||
| # - https://github.com/szrlee/verl/blob/yingru/rollout_correction/docs/advance/rollout_corr_math.md | ||
| # - https://fengyao.notion.site/off-policy-rl | ||
| off_policy_correction: | ||
| # type of importance sampling ratio to use for ppo loss correction | ||
| # here importance sampling ratio refers to exp(logprobs_{policy_old} - logprobs_{rollout_policy}) | ||
| tis_ratio_type: null # null, "token", "sequence" | ||
|
|
||
| # used if tis_ratio_type = "token", 1.5-5.0 is recommended for "token" tis_ratio_type | ||
| token_tis_ratio_clip_high: 2.0 | ||
| # used if tis_ratio_type = "sequence", 2.0-10.0 is recommended for "sequence" tis_ratio_type | ||
| sequence_tis_ratio_clip_high: 5.0 | ||
|
|
||
| # method of masking out sequences with cumulative importance sampling ratios outside the cap | ||
| # "product" masks out sequences with product of importance ratios outside the cap | ||
| # "geometric" masks out sequences with geometric mean of importance ratios outside the cap | ||
| sequence_mask_metric: null # null, "product", "geometric" | ||
|
|
||
| # used if sequence_mask_metric = "geometric" | ||
| # values around 0.99-1.01 are recommended for "geometric" sequence_mask_metric - MoE models may need larger allowed ranges due to higher mismatch | ||
| geo_mask_high: 1.01 | ||
| geo_mask_low: 0.99 | ||
|
|
||
| # used if sequence_mask_metric = "product" | ||
| # values around 0.5-2.0 are recommended for "product" sequence_mask_metric | ||
| product_mask_high: 2.0 | ||
| product_mask_low: 0.5 | ||
|
|
||
| # separate from sequence_mask_metric and tis_ratio_type | ||
| # if any off_policy_correction is enabled, masks out sequences with any token having importance ratio | ||
| # far outside an acceptable range (low and high thresholds) | ||
| outlier_token_is_threshold_low: 1e-4 | ||
erictang000 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| outlier_token_is_threshold_high: 100 | ||
|
|
||
| # clip-cov parameters (only used when policy_loss_type: "clip_cov") | ||
| clip_cov: | ||
| clip_ratio: 0.0002 # fraction of tokens to clip based on covariance | ||
|
|
@@ -413,10 +452,6 @@ Algorithm Configuration | |
| type: null # filter (DAPO), replace (POLARIS/WebSailor), or null | ||
| max_sample_batches: 30 # sample at most this many batches before stopping, -1 to sample forever | ||
| min_replace_ratio: 0.3 # minimum proportion of good samples with which to replace bad samples (for replace strategy only) | ||
|
|
||
| # Truncated Importance Sampling as proposed in https://fengyao.notion.site/off-policy-rl | ||
| use_tis: false | ||
| tis_imp_ratio_cap: -1.0 | ||
|
|
||
| # SAPO parameters (only used when policy_loss_type: "sapo") (https://arxiv.org/pdf/2511.20347) | ||
| sapo: | ||
|
|
@@ -466,8 +501,8 @@ Algorithm Configuration | |
| - ``algorithm.dynamic_sampling.type``: Type of dynamic sampling to use. Currently, we support ``filter`` (`DAPO <https://dapo-sia.github.io/>`_), ``replace`` (`POLARIS <https://hkunlp.github.io/blog/2025/Polaris/>`_ / `WebSailor <https://arxiv.org/abs/2507.02592>`_), or ``null`` for no dynamic sampling. | ||
| - ``algorithm.dynamic_sampling.max_sample_batches``: Maximum number of batches to sample before stopping. Set to ``-1`` to sample forever. | ||
| - ``algorithm.dynamic_sampling.min_replace_ratio``: Minimum proportion of good samples with which to replace bad samples for ``replace`` strategy. | ||
| - ``algorithm.use_tis``: Whether to use Truncated Importance Sampling (TIS) as proposed in `this blog <https://fengyao.notion.site/off-policy-rl>`_. | ||
| - ``algorithm.tis_imp_ratio_cap``: Cap parameter for the importance ratio in TIS. | ||
| - ``algorithm.use_tis``: Whether to use Truncated Importance Sampling (TIS) as proposed in `this blog <https://fengyao.notion.site/off-policy-rl>`_. This flag is to be deprecated, use ``off_policy_correction.tis_ratio_type = "token"`` instead. | ||
| - ``algorithm.tis_imp_ratio_cap``: Cap parameter for the importance ratio in TIS. This flag is to be deprecated, use ``off_policy_correction.token_tis_ratio_clip_high`` instead. | ||
| - ``algorithm.clip_cov``: Clip-Cov parameters (only used when ``policy_loss_type`` is ``clip_cov``): | ||
|
|
||
| - ``clip_ratio``: Fraction of tokens to clip based on covariance values. | ||
|
|
@@ -489,6 +524,35 @@ Algorithm Configuration | |
| - ``tau_pos``: Temperature for gating function for tokens with positive advantages. | ||
| - ``tau_neg``: Temperature for gating function for tokens with negative (or zero) advantages. | ||
|
|
||
| Off Policy Correction Configuration | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's cite the blogpost here as well
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Depends on how you add the separate correction doc page (see other comment). But it'd be easier for the user if we can do the following. Basically help the uesrs understand each config (3 groups of them) one-by-one by pointing them to other resources. 1. Group these three together
and tell them:
2. Then group these together3. Then group the outlier threshold togetherother remarksThen pointing to our implementation would also be helpful. Namely the |
||
| - ``algorithm.off_policy_correction``: Off policy correction configuration. See the full configuration below | ||
|
|
||
| .. code-block:: yaml | ||
|
|
||
| off_policy_correction: | ||
| tis_ratio_type: null # null, "token", "sequence" | ||
| token_tis_ratio_clip_high: 2.0 | ||
| sequence_tis_ratio_clip_high: 5.0 | ||
| sequence_mask_metric: null # null, "product", "geometric" | ||
| geo_mask_high: 1.01 | ||
| geo_mask_low: 0.99 | ||
| product_mask_high: 2.0 | ||
| product_mask_low: 0.5 | ||
| outlier_token_is_threshold_low: 1e-4 | ||
| outlier_token_is_threshold_high: 100 | ||
|
|
||
| - ``algorithm.off_policy_correction.tis_ratio_type``: Type of importance sampling ratio to use for ppo loss correction. Options include: ``null``, ``token``, ``sequence``. | ||
| - ``algorithm.off_policy_correction.token_tis_ratio_clip_high``: Cap parameter for "token" tis_ratio_type. | ||
| - ``algorithm.off_policy_correction.sequence_tis_ratio_clip_high``: Cap parameter for "sequence" tis_ratio_type. | ||
| - ``algorithm.off_policy_correction.sequence_mask_metric``: Method of masking out sequences with cumulative importance sampling ratios outside the cap. Options include: ``null``, ``product``, ``geometric``. | ||
| - ``algorithm.off_policy_correction.geo_mask_high``: High threshold for "geometric" sequence_mask_metric. | ||
| - ``algorithm.off_policy_correction.geo_mask_low``: Low threshold for "geometric" sequence_mask_metric. | ||
| - ``algorithm.off_policy_correction.product_mask_high``: High threshold for "product" sequence_mask_metric. | ||
| - ``algorithm.off_policy_correction.product_mask_low``: Low threshold for "product" sequence_mask_metric. | ||
| - ``algorithm.off_policy_correction.outlier_token_is_threshold_low``: Low threshold for outlier token mask - masks out sequences with any token having importance ratio far outside an acceptable range (low and high thresholds). | ||
| - ``algorithm.off_policy_correction.outlier_token_is_threshold_high``: High threshold for outlier token mask - masks out sequences with any token having importance ratio far outside an acceptable range (low and high thresholds). | ||
|
|
||
| Policy Loss Formulation | ||
| ~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
|
|
@@ -502,7 +566,7 @@ It can be helpful to understand the final loss formulation to see how the differ | |
| advantages: torch.Tensor, | ||
| config: DictConfig, # trainer.algorithm config | ||
| loss_mask: Optional[torch.Tensor] = None, | ||
| ) -> torch.Tensor: | ||
| ) -> Tuple[torch.Tensor, LossMetrics]: | ||
|
|
||
| ratio = (log_probs - old_log_probs).exp() | ||
| surr1 = ratio * advantages | ||
|
|
@@ -515,7 +579,7 @@ It can be helpful to understand the final loss formulation to see how the differ | |
| clip_pg_losses2 = torch.min(pg_losses3, clip_pg_losses1) | ||
| loss = torch.where(advantages < 0, clip_pg_losses2, clip_pg_losses1) | ||
| loss = reduce_loss(loss, loss_mask, config.loss_reduction) | ||
| return loss, clip_ratio | ||
| return loss, LossMetrics(clip_ratio=clip_ratio) | ||
|
|
||
|
|
||
| Generator Configuration | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.