[skyrl-train] Implement loss reduction via advantage normalization and fix `token_mean` reduction strategy by justinvyu · Pull Request #1296 · NovaSky-AI/SkyRL

justinvyu · 2026-03-09T18:56:09Z

Summary

Change reduce_loss() to always returns a simple masked sum ((loss * mask).sum()). To achieve different reduction strategies, we pre-scale the advantages before they enter the loss function, which also aligns with how Tinker's API handles it.
- Scales the loss by the DP size before calling backward() to counteract the default data parallel mean gradient all-reduce across workers to do a sum instead.
Fixes the token_mean loss reduction method to take a mean across all tokens in the minibatch rather than averaging across microbatches. Allows running with the old loss reduction with the token_mean_legacy config.

Loss reduction strategies

Option 1: token_mean
- Average loss per token across the entire mini-batch.
- This is the fixed version where the denominator is the total token count across the full mini-batch, so the gradient is independent of how the minibbatch is split into micro-batches.
Option 1b: token_mean_legacy
- Compute token-mean loss within each micro-batch, then average across micro-batches.
- This reproduces the token_mean behavior before this PR.
- The problem: if micro-batches have different token counts, the effective weighting differs from a true global token mean. This is also less usable since changing micro batch size affects the loss and the training dynamics.
- Kept as a fallback in case of performance regressions — we should remove this down the line.
Option 2: sequence_mean
- Compute per-token loss within each sequence, average across sequences.
- This is unchanged and is just implemented via advantage normalization instead.
Option 3: seq_mean_token_sum_norm
- Dr. GRPO style — normalize by a fixed constant to avoid any length-dependent weighting.
- This is unchanged and is just implemented via advantage normalization instead.

Mean all-reduce -> sum all-reduce

We need the loss to be summed across microbatches and data parallel workers:

DDP/FSDP defaults to a mean all-reduce for gradients across workers. This PR counteracts this by multiplying by the DP world size in order to keep the loss sum across data parallel groups.
Megatron also does a similar mean reduction across microbatches and workers, so we counteract this by multiplying by num microbatches and DP size to achieve the sum.

Tinker compatibility

Here was the first attempt at fixing the loss reduction across microbatches: #909

This method was to track total tokens and then do one big normalization at the optim_step in order to get an average per-token loss. But, we decided to align with Tinker's way of just summing up the loss at the end, and pushing any loss normalization to the user's advantage calculation.

The benefit is that users have full control of customizing their loss reduction strategy, rather than having it happen in our opaque forward_backward, optim_step implementation which would require some configuration argument that diverges from tinker's API. For example, we would need to add a config somewhere to determine how to average/sum the loss:

client.forward_backward(...)
client.optim_step(..., loss_reduction="token_mean")  # no longer tinker compatible

The current PR aligns with Tinker semantics:

Notice that for all objectives we sum the token-level losses over the sequence length unlike some other loss implementations. If you would like to explore different aggregation schemes, you can include that in the advantage tensor computation.

Example for loss_reduction="token_mean":

Move the 1/num_minibatch_tokens normalization into the advantage: loss = sum( -advantage_i * ratio_i for i in range(num_minibatch_tokens) ) / num_minibatch_tokens
-> sum( -(advantage_i / num_minibatch_tokens) * ratio_i for i in range(num_minibatch_tokens) )

Learning curve comparisons before/after the PR

FSDP (wandb)

Megatron (wandb)

1.7B:

30B lora:

master baseline:

token_mean_legacy + fixed token_mean:

… scale loss by dp_size for FSDP/Megatron parity Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…omparison Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

skyrl/train/trainer.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…uction # Conflicts: # skyrl/backends/skyrl_train/utils/ppo_utils.py # skyrl/train/fully_async_trainer.py # skyrl/train/trainer.py # tests/backends/skyrl_train/gpu/test_grpo_sp_sanity.py

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…ritic, rename token_mean_baseline to token_mean_legacy Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

… add unit tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a significant refactoring of the loss reduction mechanism. The core change moves the reduction logic from a monolithic reduce_loss function to pre-scaling advantages based on the desired reduction strategy. This aligns with Tinker's API, making the system more modular and explicit. The changes also correctly handle gradient accumulation in distributed settings (DDP/FSDP/Megatron) by scaling the loss to counteract the default mean-reduction of gradients, effectively performing a sum. The refactoring is extensive, touching core training logic, worker implementations, and associated tests. My review identified a couple of issues related to metric reporting where losses are incorrectly scaled, which could lead to misleading monitoring data. Apart from that, the changes appear solid and well-implemented.

gemini-code-assist · 2026-03-20T22:38:03Z

skyrl/backends/skyrl_train/workers/megatron/megatron_model_wrapper.py

+                "final_loss": unscaled_loss.detach().item() * dp_size,
+                "policy_loss": policy_loss.detach().item() * dp_size,


The reported final_loss and policy_loss metrics are being scaled by dp_size. Since these metrics are summed across micro-batches and then sum-reduced across data-parallel ranks, this will result in the total loss being over-reported by a factor of dp_size. The loss scaling is necessary for correct gradient computation, but for metric reporting, the unscaled loss should be used to reflect the true total loss.

Suggested change

"final_loss": unscaled_loss.detach().item() * dp_size,

"policy_loss": policy_loss.detach().item() * dp_size,

"final_loss": unscaled_loss.detach().item(),

"policy_loss": policy_loss.detach().item(),

@justinvyu i think i remember i made these changes to get these metrics matching and to be invariant to dp size... but just want to check - were the metric scales roughly similar for FSDP vs megatron on your runs?

for the 1.7b runs could you paste these metrics for megatron vs fsdp?

gemini-code-assist · 2026-03-20T22:38:03Z

skyrl/backends/skyrl_train/workers/worker.py

                "final_loss": loss.item(),
-                "policy_loss": policy_loss.item(),
+                "policy_loss": policy_loss.item() * loss_scale,


The reported final_loss and policy_loss metrics are being scaled by loss_scale (which is dp_size). Since these metrics are summed across micro-batches and then sum-reduced across data-parallel ranks, this will result in the total loss being over-reported by a factor of dp_size. While loss scaling is correct for the backward pass to counteract DDP's mean reduction, the reported metrics should be based on the unscaled loss to accurately reflect the total loss.

Suggested change

"final_loss": loss.item(),

"policy_loss": policy_loss.item(),

"policy_loss": policy_loss.item() * loss_scale,

"final_loss": unscaled_loss.item(),

"policy_loss": policy_loss.item(),

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

erictang000

this looks almost good to merge, super clean thanks for adding the token_mean_legacy path

just want to check my understanding + 1 question about the metrics code that I think I probably wrote on the old PR...

skyrl/backends/skyrl_train/utils/ppo_utils.py

erictang000 · 2026-03-23T01:31:45Z

skyrl/backends/skyrl_train/workers/megatron/megatron_model_wrapper.py

+                "final_loss": unscaled_loss.detach().item() * dp_size,
+                "policy_loss": policy_loss.detach().item() * dp_size,


@justinvyu i think i remember i made these changes to get these metrics matching and to be invariant to dp size... but just want to check - were the metric scales roughly similar for FSDP vs megatron on your runs?

for the 1.7b runs could you paste these metrics for megatron vs fsdp?

… mini-batch reduction - Report unscaled loss metrics (remove * loss_scale / * dp_size) in both FSDP and Megatron workers - Rename reduce_metrics -> reduce_metrics_across_microbatches (sums _loss for gradient accumulation) - Add reduce_metrics_across_minibatches in trainer_utils (averages _loss for logging) - Use sum all-reduce for _loss keys across DP workers to reconstruct full mini-batch loss Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

justinvyu · 2026-03-25T23:01:06Z

skyrl/backends/skyrl_train/workers/megatron/megatron_model_wrapper.py

+                "final_loss": unscaled_loss.detach().item(),
+                "policy_loss": policy_loss.detach().item(),


Metrics fix 1: remove dp_size multiplier in reported metrics, since there's no average that we need to correct for, since reduce_microbatch_metrics and all_reduce_metrics both do sums for *_loss metrics.

justinvyu · 2026-03-25T23:02:25Z

skyrl/train/trainer.py

        # pop out loss_fn_outputs since it's not a scalar metric and to avoid logging it
        all_metrics.pop("loss_fn_outputs", None)
-        reduced_metrics = reduce_metrics(all_metrics)
+        reduced_metrics = reduce_metrics_across_minibatches(all_metrics)


Metrics fix 2: Take an average across minibatches instead of still summing. This is because the loss reduction normalization happens at the minibatch level. Across different minibatches we should just average, otherwise we'll increase the reported loss scale by ~num_minibatches

justinvyu · 2026-03-26T08:23:08Z

skyrl/backends/skyrl_train/workers/worker.py

                all_metrics[k].append(v)

-        return reduce_metrics(dict(all_metrics))
+        return reduce_metrics_across_microbatches(dict(all_metrics))


critic codepath may need to be reverted since it doesn't use the advantages?

justinvyu and others added 3 commits March 9, 2026 11:51

Move loss reduction normalization to trainer-level advantage scaling,…

589c150

… scale loss by dp_size for FSDP/Megatron parity Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add token_mean_baseline loss reduction for mean-of-microbatch-means c…

333f31a

…omparison Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix assertion

aaaba4c

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu commented Mar 9, 2026

View reviewed changes

skyrl/train/trainer.py Outdated Show resolved Hide resolved

skyrl/train/trainer.py Outdated Show resolved Hide resolved

justinvyu and others added 7 commits March 9, 2026 18:27

Update tests for sum-based reduce_loss and dp_size scaling changes

a121360

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge remote-tracking branch 'upstream/main' into token_mean_loss_red…

15de89a

…uction # Conflicts: # skyrl/backends/skyrl_train/utils/ppo_utils.py # skyrl/train/fully_async_trainer.py # skyrl/train/trainer.py # tests/backends/skyrl_train/gpu/test_grpo_sp_sanity.py

lint

e3842c3

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix tests

13bfe80

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Refactor advantage normalization: fix z-score propagation, skip for c…

e76bece

…ritic, rename token_mean_baseline to token_mean_legacy Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

token_mean_baseline -> token_mean_legacy

0192e8e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Extract apply_loss_reduction_to_advantages_minibatch to ppo_utils and…

4ee0b31

… add unit tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

justinvyu marked this pull request as ready for review March 20, 2026 22:34

gemini-code-assist bot reviewed Mar 20, 2026

View reviewed changes

devin-ai-integration bot reviewed Mar 20, 2026

View reviewed changes

justinvyu changed the title ~~[wip] loss reduction~~ [skyrl-train] Implement loss reduction via advantage normalization and fix token_mean reduction strategy Mar 20, 2026

justinvyu mentioned this pull request Mar 20, 2026

[skyrl-train] Fix loss reduction by moving normalization to the advantage computation #925

Closed

erictang000 reviewed Mar 23, 2026

View reviewed changes

justinvyu commented Mar 25, 2026

View reviewed changes

justinvyu commented Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[skyrl-train] Implement loss reduction via advantage normalization and fix `token_mean` reduction strategy#1296

[skyrl-train] Implement loss reduction via advantage normalization and fix `token_mean` reduction strategy#1296
justinvyu wants to merge 11 commits intoNovaSky-AI:mainfrom
justinvyu:token_mean_loss_reduction

justinvyu commented Mar 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 20, 2026

Uh oh!

erictang000 Mar 23, 2026

Uh oh!

gemini-code-assist bot Mar 20, 2026

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

erictang000 left a comment

Uh oh!

Uh oh!

erictang000 Mar 23, 2026

Uh oh!

justinvyu Mar 25, 2026 •

edited

Loading

Uh oh!

justinvyu Mar 25, 2026

Uh oh!

justinvyu Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		"final_loss": unscaled_loss.detach().item() * dp_size,
		"policy_loss": policy_loss.detach().item() * dp_size,

Conversation

justinvyu commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Loss reduction strategies

Mean all-reduce -> sum all-reduce

Tinker compatibility

Learning curve comparisons before/after the PR

FSDP (wandb)

Megatron (wandb)

1.7B:

30B lora:

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

erictang000 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

erictang000 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

erictang000 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

justinvyu Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinvyu Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

justinvyu Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

justinvyu commented Mar 9, 2026 •

edited

Loading

justinvyu Mar 25, 2026 •

edited

Loading