Conversation
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds entropy loss to the PPO algorithm, which is a valuable addition for encouraging exploration. The implementation is mostly correct, but there are a few areas for improvement regarding efficiency, code duplication, and dead code. I've left specific comments with suggestions to address these points.
skyrl-train/skyrl_train/workers/megatron/megatron_model_wrapper.py
Outdated
Show resolved
Hide resolved
skyrl-train/skyrl_train/workers/megatron/megatron_model_wrapper.py
Outdated
Show resolved
Hide resolved
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds support for entropy loss regularization in the PPO algorithm. The changes look good overall, introducing the necessary configuration options and modifying the training loop to incorporate the entropy term in the loss function. I've identified a couple of areas for improvement: one is a redundant piece of code in worker.py that can be simplified for clarity, and the other is an inconsistency in how policy_loss is calculated and logged in the Megatron worker compared to the standard worker. Addressing these points will improve the code's clarity and consistency.
| if self.cfg.trainer.algorithm.use_entropy_loss: | ||
| policy_loss = policy_loss - self.cfg.trainer.algorithm.entropy_loss_coef * entropy_scalar |
There was a problem hiding this comment.
This in-place modification of policy_loss leads to an inconsistency in logging. The logged policy_loss metric will include the entropy term, which is inconsistent with the non-Megatron worker (worker.py) where policy_loss only contains the PPO loss. For consistent and clearer metrics across implementations, it's better to not modify policy_loss here. Instead, you could add the entropy loss term directly to the final loss variable on line 257.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces entropy loss to the PPO training algorithm, adding configurations and updating the model wrappers and workers. The implementation correctly computes entropy and adds it to the loss function to encourage policy exploration. However, I've identified a couple of issues. In both the Megatron and general worker implementations, the entropy loss is not applied conditionally based on the use_entropy_loss flag, which could lead to incorrect behavior if misconfigured. Additionally, the Megatron worker could be optimized to avoid unnecessary gradient calculations for entropy, consistent with the general worker's implementation. I've provided specific comments and code suggestions to address these points.
skyrl-train/skyrl_train/workers/megatron/megatron_model_wrapper.py
Outdated
Show resolved
Hide resolved
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds support for entropy loss to the PPO training algorithm. The changes include adding configuration options for enabling entropy loss and setting its coefficient, modifying the model wrappers to compute entropy with gradients when needed, and updating the loss calculation in the training workers to include the entropy term.
The implementation is mostly correct and follows the pattern of other loss components. However, I found a critical bug in skyrl_train/workers/worker.py where the entropy loss is multiplied by a boolean flag instead of the configured coefficient. This would cause the entropy loss to be incorrectly scaled. I've provided a specific comment and suggestion to fix this.
Once this issue is addressed, the changes look good to go.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces entropy loss regularization to the PPO algorithm, which is a valuable addition for promoting exploration. The implementation is well-integrated into both the standard and Megatron training workflows, including configuration options and updated loss calculations. The accompanying tests are thorough, covering various combinations of KL and entropy loss.
My review focuses on a critical correctness issue related to device placement for tensors, which could lead to runtime errors, and a suggestion for improving maintainability by refactoring duplicated code. Addressing these points will make the implementation more robust and easier to maintain.
|
Thanks @pbokc ! I made some updates to include this in the tests for FSDP and Megatron backends and also added the final loss (after aggregation and normalization with accumulation steps) as |
Add entropy loss to address NovaSky-AI#583 Ran FSDP with and without entropy loss with Qwen2.5-0.5B-Instruct and saw that policy entropy remained ~stable with entropy loss off and it increased rapidly with entropy loss on with a coefficient of 10. <img width="400" height="200" alt="W B Chart 11_5_2025, 8_18_42 PM" src="https://github.com/user-attachments/assets/422c2e0f-9228-4fd1-a090-71d1bccaeab3" /> <img width="400" height="200" alt="W B Chart 11_5_2025, 8_19_56 PM" src="https://github.com/user-attachments/assets/afecb015-7ba7-4003-9e6e-8960e22d15d0" /> --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: SumanthRH <sumanthrh99@gmail.com>
Add entropy loss to address NovaSky-AI#583 Ran FSDP with and without entropy loss with Qwen2.5-0.5B-Instruct and saw that policy entropy remained ~stable with entropy loss off and it increased rapidly with entropy loss on with a coefficient of 10. <img width="400" height="200" alt="W B Chart 11_5_2025, 8_18_42 PM" src="https://github.com/user-attachments/assets/422c2e0f-9228-4fd1-a090-71d1bccaeab3" /> <img width="400" height="200" alt="W B Chart 11_5_2025, 8_19_56 PM" src="https://github.com/user-attachments/assets/afecb015-7ba7-4003-9e6e-8960e22d15d0" /> --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: SumanthRH <sumanthrh99@gmail.com>
Add entropy loss to address #583
Ran FSDP with and without entropy loss with Qwen2.5-0.5B-Instruct and saw that policy entropy remained ~stable with entropy loss off and it increased rapidly with entropy loss on with a coefficient of 10.

