Skip to content

Add entropy loss#622

Merged
SumanthRH merged 14 commits intoNovaSky-AI:mainfrom
pbokc:add_entropy_loss
Nov 17, 2025
Merged

Add entropy loss#622
SumanthRH merged 14 commits intoNovaSky-AI:mainfrom
pbokc:add_entropy_loss

Conversation

@pbokc
Copy link
Contributor

@pbokc pbokc commented Nov 4, 2025

Add entropy loss to address #583

Ran FSDP with and without entropy loss with Qwen2.5-0.5B-Instruct and saw that policy entropy remained ~stable with entropy loss off and it increased rapidly with entropy loss on with a coefficient of 10.
W B Chart 11_5_2025, 8_18_42 PM
W B Chart 11_5_2025, 8_19_56 PM

@pbokc pbokc changed the title [WIP] Add entropy loss Add entropy loss Nov 6, 2025
@pbokc pbokc marked this pull request as ready for review November 6, 2025 23:06
@SumanthRH
Copy link
Member

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds entropy loss to the PPO algorithm, which is a valuable addition for encouraging exploration. The implementation is mostly correct, but there are a few areas for improvement regarding efficiency, code duplication, and dead code. I've left specific comments with suggestions to address these points.

@SumanthRH
Copy link
Member

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for entropy loss regularization in the PPO algorithm. The changes look good overall, introducing the necessary configuration options and modifying the training loop to incorporate the entropy term in the loss function. I've identified a couple of areas for improvement: one is a redundant piece of code in worker.py that can be simplified for clarity, and the other is an inconsistency in how policy_loss is calculated and logged in the Megatron worker compared to the standard worker. Addressing these points will improve the code's clarity and consistency.

Comment on lines +241 to +242
if self.cfg.trainer.algorithm.use_entropy_loss:
policy_loss = policy_loss - self.cfg.trainer.algorithm.entropy_loss_coef * entropy_scalar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This in-place modification of policy_loss leads to an inconsistency in logging. The logged policy_loss metric will include the entropy term, which is inconsistent with the non-Megatron worker (worker.py) where policy_loss only contains the PPO loss. For consistent and clearer metrics across implementations, it's better to not modify policy_loss here. Instead, you could add the entropy loss term directly to the final loss variable on line 257.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this fix @pbokc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@pbokc
Copy link
Contributor Author

pbokc commented Nov 13, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces entropy loss to the PPO training algorithm, adding configurations and updating the model wrappers and workers. The implementation correctly computes entropy and adds it to the loss function to encourage policy exploration. However, I've identified a couple of issues. In both the Megatron and general worker implementations, the entropy loss is not applied conditionally based on the use_entropy_loss flag, which could lead to incorrect behavior if misconfigured. Additionally, the Megatron worker could be optimized to avoid unnecessary gradient calculations for entropy, consistent with the general worker's implementation. I've provided specific comments and code suggestions to address these points.

@pbokc
Copy link
Contributor Author

pbokc commented Nov 13, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for entropy loss to the PPO training algorithm. The changes include adding configuration options for enabling entropy loss and setting its coefficient, modifying the model wrappers to compute entropy with gradients when needed, and updating the loss calculation in the training workers to include the entropy term.

The implementation is mostly correct and follows the pattern of other loss components. However, I found a critical bug in skyrl_train/workers/worker.py where the entropy loss is multiplied by a boolean flag instead of the configured coefficient. This would cause the entropy loss to be incorrectly scaled. I've provided a specific comment and suggestion to fix this.

Once this issue is addressed, the changes look good to go.

pbokc and others added 6 commits November 12, 2025 23:09
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH
Copy link
Member

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces entropy loss regularization to the PPO algorithm, which is a valuable addition for promoting exploration. The implementation is well-integrated into both the standard and Megatron training workflows, including configuration options and updated loss calculations. The accompanying tests are thorough, covering various combinations of KL and entropy loss.

My review focuses on a critical correctness issue related to device placement for tensors, which could lead to runtime errors, and a suggestion for improving maintainability by refactoring duplicated code. Addressing these points will make the implementation more robust and easier to maintain.

@SumanthRH
Copy link
Member

Thanks @pbokc ! I made some updates to include this in the tests for FSDP and Megatron backends and also added the final loss (after aggregation and normalization with accumulation steps) as final_loss

@SumanthRH SumanthRH merged commit 2e9e4be into NovaSky-AI:main Nov 17, 2025
3 checks passed
@pbokc pbokc deleted the add_entropy_loss branch November 17, 2025 22:27
@SumanthRH SumanthRH mentioned this pull request Nov 20, 2025
li-boxuan pushed a commit to li-boxuan/SkyRL that referenced this pull request Nov 23, 2025
Add entropy loss to address NovaSky-AI#583 

Ran FSDP with and without entropy loss with Qwen2.5-0.5B-Instruct and
saw that policy entropy remained ~stable with entropy loss off and it
increased rapidly with entropy loss on with a coefficient of 10.
<img width="400" height="200" alt="W B Chart 11_5_2025, 8_18_42 PM"
src="https://github.com/user-attachments/assets/422c2e0f-9228-4fd1-a090-71d1bccaeab3"
/>
<img width="400" height="200" alt="W B Chart 11_5_2025, 8_19_56 PM"
src="https://github.com/user-attachments/assets/afecb015-7ba7-4003-9e6e-8960e22d15d0"
/>


---------

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: SumanthRH <sumanthrh99@gmail.com>
dzorlu pushed a commit to fleet-ai/SkyRL that referenced this pull request Feb 4, 2026
Add entropy loss to address NovaSky-AI#583 

Ran FSDP with and without entropy loss with Qwen2.5-0.5B-Instruct and
saw that policy entropy remained ~stable with entropy loss off and it
increased rapidly with entropy loss on with a coefficient of 10.
<img width="400" height="200" alt="W B Chart 11_5_2025, 8_18_42 PM"
src="https://github.com/user-attachments/assets/422c2e0f-9228-4fd1-a090-71d1bccaeab3"
/>
<img width="400" height="200" alt="W B Chart 11_5_2025, 8_19_56 PM"
src="https://github.com/user-attachments/assets/afecb015-7ba7-4003-9e6e-8960e22d15d0"
/>


---------

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: SumanthRH <sumanthrh99@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants