[train][CI] Add regression thresholds for E2E CI runs#1199
Conversation
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces regression testing for E2E CI runs by adding a Python script to check metrics from wandb and updating CI shell scripts to use it. The changes are logical and well-structured. I've provided a few suggestions to improve the implementation, mainly focusing on making the run identification more robust, improving code readability, and following best practices.
| runs = api.runs(f"{args.project_name}", order="-created_at", per_page=50) | ||
| # pages are fetched lazily | ||
| matched_run = None | ||
| for run in runs: | ||
| if run.name == args.run_name: | ||
| matched_run = run | ||
| break |
There was a problem hiding this comment.
For better performance and cleaner code, you can use the wandb API's filtering capabilities to fetch the specific run directly, instead of fetching a list of runs and iterating through them. You can replace this block with a more direct approach like the following:
# get latest run with the run name
# get all runs in the project in the order of latest to oldest
runs = api.runs(f"{args.project_name}", filters={"display_name": args.run_name}, order="-created_at")
matched_run = next(iter(runs), None)| # Eval and train accuracy should be greater than the threshold | ||
| # Average number of tokens generated should decrease over time | ||
| # Policy rollout train logprobs absolute difference should be small | ||
| uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py --run_name $RUN_NAME --project_name "gsm8k_ci" --asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" "loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" "generate/avg_num_tokens <= $NUM_TOKENS_MAX_VALUE" "policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE" |
There was a problem hiding this comment.
This line is quite long and can be hard to read. For better readability and maintainability, you can break it into multiple lines using backslashes, like this:
uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py \
--run_name $RUN_NAME \
--project_name "gsm8k_ci" \
--asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" \
"loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" \
"generate/avg_num_tokens <= $NUM_TOKENS_MAX_VALUE" \
"policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE"| #!/usr/bin/env bash | ||
| set -euo pipefail | ||
|
|
||
| RUN_NAME="run_$(date +%Y%m%d%H)" |
There was a problem hiding this comment.
The current RUN_NAME format run_$(date +%Y%m%d%H) might not be unique if multiple CI runs are triggered within the same hour. This could lead to the test script checking the wrong run. To ensure uniqueness, consider adding minutes and seconds to the timestamp.
| RUN_NAME="run_$(date +%Y%m%d%H)" | |
| RUN_NAME="run_$(date +%Y%m%d%H%M%S)" |
| trainer.project_name=\"$PROJECT_NAME\" \ | ||
| trainer.run_name=\"$RUN_NAME\" | ||
|
|
||
| uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py --run_name $RUN_NAME --project_name $PROJECT_NAME --asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" "loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" "generate/avg_num_tokens <= $NUM_TOKENS_MAX_VALUE" "policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE" |
There was a problem hiding this comment.
This line is quite long and can be hard to read. For better readability and maintainability, you can break it into multiple lines using backslashes, like this:
uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py \
--run_name $RUN_NAME \
--project_name $PROJECT_NAME \
--asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" \
"loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" \
"generate/avg_num_tokens <= $NUM_TOKENS_MAX_VALUE" \
"policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE"| trainer.run_name=\"run_$(date +%Y%m%d%H)\" | ||
| trainer.run_name=\"$RUN_NAME\" trainer.project_name=\"gsm8k_fully_async_ci\" | ||
|
|
||
| uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py --run_name $RUN_NAME --project_name "gsm8k_fully_async_ci" --asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" "loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" "generate/avg_num_tokens <= $AVG_NUM_TOKENS_MAX_VALUE" "policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE" |
There was a problem hiding this comment.
This line is quite long and can be hard to read. For better readability and maintainability, you can break it into multiple lines using backslashes, like this:
uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py \
--run_name $RUN_NAME \
--project_name "gsm8k_fully_async_ci" \
--asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" \
"loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" \
"generate/avg_num_tokens <= $AVG_NUM_TOKENS_MAX_VALUE" \
"policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE"Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
| trainer.project_name=\"$PROJECT_NAME\" \ | ||
| trainer.run_name=\"$RUN_NAME\" | ||
|
|
||
| uv run --isolated --extra fsdp $SCRIPT_DIR/get_summary.py --run_name $RUN_NAME --project_name $PROJECT_NAME --asserts "eval/all/avg_score >= $EVAL_ACC_MIN_VALUE" "loss/avg_final_rewards >= $TRAIN_ACC_MIN_VALUE" "generate/avg_num_tokens <= $NUM_TOKENS_MAX_VALUE" "policy/rollout_train_logprobs_abs_diff_mean <= $LOGPROBS_DIFF_MAX_VALUE" |
There was a problem hiding this comment.
i've always thought that our CI runs will just end after we hit the timelimit. if so, will this get summary run at all?
There was a problem hiding this comment.
The CI workflow will fail when it hits timeout , yes.
The get_summary script will not run then - and that seems fine by design - because the Github workflow will fail.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
What does this PR do?
Adds regression thresholds for E2E CI runs.
Methodology
Currently, I've chosen the following metrics:
eval/all/avg_score)loss/avg_final_rewards)generate/avg_num_tokens)policy/rollout_train_logprobs_abs_diff_mean)For each metric, I found the min/ max values over the previous 10 CI runs and used a threshold with a 5 % allowance. I found this to be reasonable for all the metrics chosen.