Skip to content

[train] Add explicit finish calls for tracker when training ends#1198

Merged
SumanthRH merged 2 commits intomainfrom
explicit-finish
Feb 24, 2026
Merged

[train] Add explicit finish calls for tracker when training ends#1198
SumanthRH merged 2 commits intomainfrom
explicit-finish

Conversation

@SumanthRH
Copy link
Member

@SumanthRH SumanthRH commented Feb 23, 2026

What does this PR do?

Adds an explicit .finish call for the trackers (ex: Wandb) when training ends.

I noticed that some of our E2E CI runs seem to be "finishing early" / missing some datapoints (Internal link)

(Expected an entry at step 10 but some runs only have eval metric at step 5).

This is primarily because we don't have an explicit wandb.finish() call at the end of training. While we do have the same logic in __del__, this is not guaranteed to be called and thus we can end up with missing entries. Further the status of the run is marked as "crashed" even though the program exit code was 0.

Typically I refer to the fsdp backend based ci runs (Internal link), which seem to have the values for all 7/7 steps logged. I suspect this is because of the difference in the final cleanup steps that happen for each backend:

FSDP:

(skyrl_entrypoint pid=9555) 2026-02-23 10:16:03.192 | INFO     | skyrl.train.trainer:save_checkpoints:1236 - Finished: 'cleanup_old_checkpoints', time cost: 7.34s
(skyrl_entrypoint pid=9555) 2026-02-23 10:16:03.192 | INFO     | skyrl.train.trainer:train:346 - Saved final checkpoint.
(skyrl_entrypoint pid=9555) 2026-02-23 10:16:03.193 | INFO     | skyrl.train.trainer:train:344 - Finished: 'save_checkpoints', time cost: 15.72s
(skyrl_entrypoint pid=9555) 2026-02-23 10:16:03.193 | INFO     | skyrl.train.trainer:train:351 - Training done!

Megatron:

(skyrl_entrypoint pid=9799) 2026-02-22 09:36:44.136 | INFO     | skyrl.train.trainer:save_checkpoints:1236 - Finished: 'cleanup_old_checkpoints', time cost: 0.01s
(skyrl_entrypoint pid=9799) 2026-02-22 09:36:44.136 | INFO     | skyrl.train.trainer:train:346 - Saved final checkpoint.
(skyrl_entrypoint pid=9799) 2026-02-22 09:36:44.136 | INFO     | skyrl.train.trainer:train:344 - Finished: 'save_checkpoints', time cost: 9.00s
(skyrl_entrypoint pid=9799) 2026-02-22 09:36:44.136 | INFO     | skyrl.train.trainer:train:351 - Training **done!**

Checkpointing time with the Megatron backend is faster and that is probably why wandb's background process didn't get enough time to upload the final step's metrics after step end.

All changes are made only to the new skyrl/ package


Open with Devin

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an explicit .finish() call for trackers at the end of training sessions. This is a solid improvement over relying on __del__, which is not guaranteed to be called, ensuring that all tracking data is properly saved. The implementation refactors the cleanup logic into a new finish() method within the Tracking class and integrates this call across various trainer implementations. The changes are clean and effectively address the described issue of incomplete data logging. I have one suggestion to enhance the robustness of the new finish method.

devin-ai-integration[bot]

This comment was marked as resolved.

x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH SumanthRH merged commit 48cf603 into main Feb 24, 2026
5 checks passed
@SumanthRH SumanthRH deleted the explicit-finish branch February 26, 2026 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant