Skip to content

[train][CI] Add regression thresholds for E2E CI runs#1199

Merged
SumanthRH merged 11 commits intomainfrom
regression-threshold
Feb 24, 2026
Merged

[train][CI] Add regression thresholds for E2E CI runs#1199
SumanthRH merged 11 commits intomainfrom
regression-threshold

Conversation

@SumanthRH
Copy link
Copy Markdown
Member

@SumanthRH SumanthRH commented Feb 24, 2026

What does this PR do?

Adds regression thresholds for E2E CI runs.

Methodology

Currently, I've chosen the following metrics:

  1. Evaluation reward (eval/all/avg_score)
  2. Train reward (loss/avg_final_rewards)
  3. Avg num tokens generated (training) (generate/avg_num_tokens)
  4. Rollout logprobs train mean difference (policy/rollout_train_logprobs_abs_diff_mean)

For each metric, I found the min/ max values over the previous 10 CI runs and used a threshold with a 5 % allowance. I found this to be reasonable for all the metrics chosen.


Open with Devin

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH SumanthRH changed the title [train] Add regression thresholds for E2E CI runs [train][CI] Add regression thresholds for E2E CI runs Feb 24, 2026
devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces regression testing for E2E CI runs by adding a Python script to check metrics from wandb and updating CI shell scripts to use it. The changes are logical and well-structured. I've provided a few suggestions to improve the implementation, mainly focusing on making the run identification more robust, improving code readability, and following best practices.

Comment on lines +31 to +37
runs = api.runs(f"{args.project_name}", order="-created_at", per_page=50)
# pages are fetched lazily
matched_run = None
for run in runs:
if run.name == args.run_name:
matched_run = run
break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better performance and cleaner code, you can use the wandb API's filtering capabilities to fetch the specific run directly, instead of fetching a list of runs and iterating through them. You can replace this block with a more direct approach like the following:

# get latest run with the run name
# get all runs in the project in the order of latest to oldest
runs = api.runs(f"{args.project_name}", filters={"display_name": args.run_name}, order="-created_at")
matched_run = next(iter(runs), None)

# Eval and train accuracy should be greater than the threshold
# Average number of tokens generated should decrease over time
# Policy rollout train logprobs absolute difference should be small
uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py --run_name $RUN_NAME --project_name "gsm8k_ci" --asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" "loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" "generate/avg_num_tokens <= $NUM_TOKENS_MAX_VALUE" "policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line is quite long and can be hard to read. For better readability and maintainability, you can break it into multiple lines using backslashes, like this:

uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py \
    --run_name $RUN_NAME \
    --project_name "gsm8k_ci" \
    --asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" \
              "loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" \
              "generate/avg_num_tokens <= $NUM_TOKENS_MAX_VALUE" \
              "policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE"

#!/usr/bin/env bash
set -euo pipefail

RUN_NAME="run_$(date +%Y%m%d%H)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current RUN_NAME format run_$(date +%Y%m%d%H) might not be unique if multiple CI runs are triggered within the same hour. This could lead to the test script checking the wrong run. To ensure uniqueness, consider adding minutes and seconds to the timestamp.

Suggested change
RUN_NAME="run_$(date +%Y%m%d%H)"
RUN_NAME="run_$(date +%Y%m%d%H%M%S)"

trainer.project_name=\"$PROJECT_NAME\" \
trainer.run_name=\"$RUN_NAME\"

uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py --run_name $RUN_NAME --project_name $PROJECT_NAME --asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" "loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" "generate/avg_num_tokens <= $NUM_TOKENS_MAX_VALUE" "policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line is quite long and can be hard to read. For better readability and maintainability, you can break it into multiple lines using backslashes, like this:

uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py \
    --run_name $RUN_NAME \
    --project_name $PROJECT_NAME \
    --asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" \
              "loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" \
              "generate/avg_num_tokens <= $NUM_TOKENS_MAX_VALUE" \
              "policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE"

trainer.run_name=\"run_$(date +%Y%m%d%H)\"
trainer.run_name=\"$RUN_NAME\" trainer.project_name=\"gsm8k_fully_async_ci\"

uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py --run_name $RUN_NAME --project_name "gsm8k_fully_async_ci" --asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" "loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" "generate/avg_num_tokens <= $AVG_NUM_TOKENS_MAX_VALUE" "policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line is quite long and can be hard to read. For better readability and maintainability, you can break it into multiple lines using backslashes, like this:

uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py \
    --run_name $RUN_NAME \
    --project_name "gsm8k_fully_async_ci" \
    --asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" \
              "loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" \
              "generate/avg_num_tokens <= $AVG_NUM_TOKENS_MAX_VALUE" \
              "policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE"

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
trainer.project_name=\"$PROJECT_NAME\" \
trainer.run_name=\"$RUN_NAME\"

uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py --run_name $RUN_NAME --project_name $PROJECT_NAME --asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" "loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" "generate/avg_num_tokens <= $NUM_TOKENS_MAX_VALUE" "policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've always thought that our CI runs will just end after we hit the timelimit. if so, will this get summary run at all?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI workflow will fail when it hits timeout , yes.

The get_summary script will not run then - and that seems fine by design - because the Github workflow will fail.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
devin-ai-integration[bot]

This comment was marked as resolved.

x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH SumanthRH merged commit 5f30619 into main Feb 24, 2026
5 checks passed
@SumanthRH SumanthRH deleted the regression-threshold branch February 26, 2026 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants