[train][CI] Add regression thresholds for E2E CI runs by SumanthRH · Pull Request #1199 · NovaSky-AI/SkyRL

SumanthRH · 2026-02-24T00:47:56Z

What does this PR do?

Adds regression thresholds for E2E CI runs.

Methodology

Currently, I've chosen the following metrics:

Evaluation reward (eval/all/avg_score)
Train reward (loss/avg_final_rewards)
Avg num tokens generated (training) (generate/avg_num_tokens)
Rollout logprobs train mean difference (policy/rollout_train_logprobs_abs_diff_mean)

For each metric, I found the min/ max values over the previous 10 CI runs and used a threshold with a 5 % allowance. I found this to be reasonable for all the metrics chosen.

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

gemini-code-assist

Code Review

This pull request introduces regression testing for E2E CI runs by adding a Python script to check metrics from wandb and updating CI shell scripts to use it. The changes are logical and well-structured. I've provided a few suggestions to improve the implementation, mainly focusing on making the run identification more robust, improving code readability, and following best practices.

gemini-code-assist · 2026-02-24T00:53:04Z

tests/train/gpu_e2e_test/get_summary.py

+runs = api.runs(f"{args.project_name}", order="-created_at", per_page=50)
+# pages are fetched lazily
+matched_run = None
+for run in runs:
+    if run.name == args.run_name:
+        matched_run = run
+        break


For better performance and cleaner code, you can use the wandb API's filtering capabilities to fetch the specific run directly, instead of fetching a list of runs and iterating through them. You can replace this block with a more direct approach like the following:

# get latest run with the run name # get all runs in the project in the order of latest to oldest runs = api.runs(f"{args.project_name}", filters={"display_name": args.run_name}, order="-created_at") matched_run = next(iter(runs), None)

tests/train/gpu_e2e_test/get_summary.py

tests/train/gpu_e2e_test/gsm8k_colocate.sh

gemini-code-assist · 2026-02-24T00:53:04Z

tests/train/gpu_e2e_test/gsm8k_colocate.sh

+# Eval and train accuracy should be greater than the threshold
+# Average number of tokens generated should decrease over time
+# Policy rollout train logprobs absolute difference should be small
+uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py --run_name $RUN_NAME --project_name "gsm8k_ci" --asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" "loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" "generate/avg_num_tokens <= $NUM_TOKENS_MAX_VALUE" "policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE"


This line is quite long and can be hard to read. For better readability and maintainability, you can break it into multiple lines using backslashes, like this:

uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py \ --run_name $RUN_NAME \ --project_name "gsm8k_ci" \ --asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" \ "loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" \ "generate/avg_num_tokens <= $NUM_TOKENS_MAX_VALUE" \ "policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE"

gemini-code-assist · 2026-02-24T00:53:04Z

tests/train/gpu_e2e_test/gsm8k_colocate_megatron.sh

 #!/usr/bin/env bash
 set -euo pipefail

+RUN_NAME="run_$(date +%Y%m%d%H)"


The current RUN_NAME format run_$(date +%Y%m%d%H) might not be unique if multiple CI runs are triggered within the same hour. This could lead to the test script checking the wrong run. To ensure uniqueness, consider adding minutes and seconds to the timestamp.

Suggested change

RUN_NAME="run_$(date +%Y%m%d%H)"

RUN_NAME="run_$(date +%Y%m%d%H%M%S)"

gemini-code-assist · 2026-02-24T00:53:04Z

tests/train/gpu_e2e_test/gsm8k_colocate_megatron.sh

+  trainer.project_name=\"$PROJECT_NAME\" \
+  trainer.run_name=\"$RUN_NAME\"
+
+uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py --run_name $RUN_NAME --project_name $PROJECT_NAME --asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" "loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" "generate/avg_num_tokens <= $NUM_TOKENS_MAX_VALUE" "policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE"


This line is quite long and can be hard to read. For better readability and maintainability, you can break it into multiple lines using backslashes, like this:

uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py \ --run_name $RUN_NAME \ --project_name $PROJECT_NAME \ --asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" \ "loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" \ "generate/avg_num_tokens <= $NUM_TOKENS_MAX_VALUE" \ "policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE"

tests/train/gpu_e2e_test/gsm8k_fully_async.sh

gemini-code-assist · 2026-02-24T00:53:04Z

tests/train/gpu_e2e_test/gsm8k_fully_async.sh

-  trainer.run_name=\"run_$(date +%Y%m%d%H)\"
+  trainer.run_name=\"$RUN_NAME\" trainer.project_name=\"gsm8k_fully_async_ci\"
+
+uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py --run_name $RUN_NAME --project_name "gsm8k_fully_async_ci" --asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" "loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" "generate/avg_num_tokens <= $AVG_NUM_TOKENS_MAX_VALUE" "policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE"


This line is quite long and can be hard to read. For better readability and maintainability, you can break it into multiple lines using backslashes, like this:

uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py \ --run_name $RUN_NAME \ --project_name "gsm8k_fully_async_ci" \ --asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" \ "loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" \ "generate/avg_num_tokens <= $AVG_NUM_TOKENS_MAX_VALUE" \ "policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE"

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

CharlieFRuan · 2026-02-24T01:18:25Z

tests/train/gpu_e2e_test/gsm8k_colocate_megatron.sh

+  trainer.project_name=\"$PROJECT_NAME\" \
+  trainer.run_name=\"$RUN_NAME\"
+
+uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py --run_name $RUN_NAME --project_name $PROJECT_NAME --asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" "loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" "generate/avg_num_tokens <= $NUM_TOKENS_MAX_VALUE" "policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE"


i've always thought that our CI runs will just end after we hit the timelimit. if so, will this get summary run at all?

The CI workflow will fail when it hits timeout , yes.

The get_summary script will not run then - and that seems fine by design - because the Github workflow will fail.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH added 7 commits February 24, 2026 00:09

add regression thresholds for e2e CI runs

b35d35a

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

47ee0bf

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

move changes to tests/

e745991

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

d1829f5

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

fix bug

2ee8ec5

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

9e514aa

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

0c30630

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH changed the title ~~[train] Add regression thresholds for E2E CI runs~~ [train][CI] Add regression thresholds for E2E CI runs Feb 24, 2026

This comment was marked as resolved.

Sign in to view

gemini-code-assist bot reviewed Feb 24, 2026

View reviewed changes

SumanthRH requested a review from CharlieFRuan February 24, 2026 01:07

Update tests/train/gpu_e2e_test/get_summary.py

0919b6d

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

CharlieFRuan reviewed Feb 24, 2026

View reviewed changes

Apply suggestions from code review

a355c00

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

This comment was marked as resolved.

Sign in to view

x

d4c4eb7

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

CharlieFRuan approved these changes Feb 24, 2026

View reviewed changes

x

b598bc1

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

SumanthRH merged commit 5f30619 into main Feb 24, 2026
5 checks passed

SumanthRH deleted the regression-threshold branch February 26, 2026 23:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train][CI] Add regression thresholds for E2E CI runs#1199

[train][CI] Add regression thresholds for E2E CI runs#1199
SumanthRH merged 11 commits intomainfrom
regression-threshold

SumanthRH commented Feb 24, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

CharlieFRuan Feb 24, 2026

Uh oh!

SumanthRH Feb 24, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	RUN_NAME="run_$(date +%Y%m%d%H)"
	RUN_NAME="run_$(date +%Y%m%d%H%M%S)"

Conversation

SumanthRH commented Feb 24, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Methodology

Uh oh!

This comment was marked as resolved.

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

CharlieFRuan Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

SumanthRH Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SumanthRH commented Feb 24, 2026 •

edited by devin-ai-integration bot

Loading