[train] Add explicit `finish` calls for tracker when training ends by SumanthRH · Pull Request #1198 · NovaSky-AI/SkyRL

SumanthRH · 2026-02-23T23:44:36Z

What does this PR do?

Adds an explicit .finish call for the trackers (ex: Wandb) when training ends.

I noticed that some of our E2E CI runs seem to be "finishing early" / missing some datapoints (Internal link)

(Expected an entry at step 10 but some runs only have eval metric at step 5).

This is primarily because we don't have an explicit wandb.finish() call at the end of training. While we do have the same logic in __del__, this is not guaranteed to be called and thus we can end up with missing entries. Further the status of the run is marked as "crashed" even though the program exit code was 0.

Typically I refer to the fsdp backend based ci runs (Internal link), which seem to have the values for all 7/7 steps logged. I suspect this is because of the difference in the final cleanup steps that happen for each backend:

FSDP:

(skyrl_entrypoint pid=9555) 2026-02-23 10:16:03.192 | INFO     | skyrl.train.trainer:save_checkpoints:1236 - Finished: 'cleanup_old_checkpoints', time cost: 7.34s
(skyrl_entrypoint pid=9555) 2026-02-23 10:16:03.192 | INFO     | skyrl.train.trainer:train:346 - Saved final checkpoint.
(skyrl_entrypoint pid=9555) 2026-02-23 10:16:03.193 | INFO     | skyrl.train.trainer:train:344 - Finished: 'save_checkpoints', time cost: 15.72s
(skyrl_entrypoint pid=9555) 2026-02-23 10:16:03.193 | INFO     | skyrl.train.trainer:train:351 - Training done!

Megatron:

(skyrl_entrypoint pid=9799) 2026-02-22 09:36:44.136 | INFO     | skyrl.train.trainer:save_checkpoints:1236 - Finished: 'cleanup_old_checkpoints', time cost: 0.01s
(skyrl_entrypoint pid=9799) 2026-02-22 09:36:44.136 | INFO     | skyrl.train.trainer:train:346 - Saved final checkpoint.
(skyrl_entrypoint pid=9799) 2026-02-22 09:36:44.136 | INFO     | skyrl.train.trainer:train:344 - Finished: 'save_checkpoints', time cost: 9.00s
(skyrl_entrypoint pid=9799) 2026-02-22 09:36:44.136 | INFO     | skyrl.train.trainer:train:351 - Training **done!**

Checkpointing time with the Megatron backend is faster and that is probably why wandb's background process didn't get enough time to upload the final step's metrics after step end.

All changes are made only to the new skyrl/ package

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

gemini-code-assist

Code Review

This pull request introduces an explicit .finish() call for trackers at the end of training sessions. This is a solid improvement over relying on __del__, which is not guaranteed to be called, ensuring that all tracking data is properly saved. The implementation refactors the cleanup logic into a new finish() method within the Tracking class and integrates this call across various trainer implementations. The changes are clean and effectively address the described issue of incomplete data logging. I have one suggestion to enhance the robustness of the new finish method.

skyrl/train/utils/tracking.py

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

explicit finish

442c176

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

gemini-code-assist bot reviewed Feb 23, 2026

View reviewed changes

skyrl/train/utils/tracking.py Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

x

cc5c5fb

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH merged commit 48cf603 into main Feb 24, 2026
5 checks passed

SumanthRH deleted the explicit-finish branch February 26, 2026 23:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Add explicit `finish` calls for tracker when training ends#1198

[train] Add explicit `finish` calls for tracker when training ends#1198
SumanthRH merged 2 commits intomainfrom
explicit-finish

SumanthRH commented Feb 23, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SumanthRH commented Feb 23, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SumanthRH commented Feb 23, 2026 •

edited by devin-ai-integration bot

Loading