Integrate CHS to GKE and Slurm A3U and A4 Daily Tests by simrankaurb · Pull Request #5335 · GoogleCloudPlatform/cluster-toolkit

simrankaurb · 2026-03-11T06:10:48Z

This pull request integrates CHS to the existing daily integration test framework for GKE A3 Ultra and A4. The goal is to provide automated GPU diagnostics and network performance validation as part of the post-deployment test suite.

Key Features & Changes:

Non-Blocking Execution: The CHS validation is designed to be informative rather than disruptive. It utilizes an Ansible rescue block to catch failures, allowing the main playbook to continue execution even if health checks do not pass.
New Validation Playbook (test-chs.yml): Implements a full setup and test cycle, including:
- Python environment configuration using pyenv.
- Cloning and installing CHS into a dedicated virtual environment.
- Running the CHS "lite" suite.
Metric Thresholds:
- DCGM Diagnostics: Asserts that all GPU diagnostic metrics return a "Pass" status.
- NCCL Performance: Validates that the average bus bandwidth meets a minimum threshold of 100.0 GB/s.
Cloud Build Integration: Modified Cloud Build configurations to inject the CHS_REPO repository URL via Secret Manager and pass it to the integration test playbooks.

gemini-code-assist · 2026-03-11T06:11:14Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the Cluster Health Scanner (CHS) into the existing daily integration test framework. The primary goal is to enhance the robustness of GKE cluster validations by adding automated checks for GPU diagnostics and network performance. This involves a new Ansible playbook for CHS setup and execution, along with necessary updates to Cloud Build configurations to orchestrate these new health checks, ensuring the stability and performance of the tested environments.

Highlights

Cluster Health Scanner (CHS) Integration: Integrated the Cluster Health Scanner into the daily integration test suite to perform automated health checks on GKE clusters.
New Ansible Playbook for CHS: Added a dedicated Ansible playbook (test-chs.yml) responsible for setting up the CHS environment, installing the tool, running its test suite, and validating key metrics like DCGM diagnostics and NCCL bandwidth.
Cloud Build Configuration Updates: Modified several Cloud Build configurations to invoke the new CHS tests and securely pass the CHS repository URL via Secret Manager.
Build Check Temporarily Disabled: Commented out a check_for_running_build step in the gke-a4-onspot.yaml Cloud Build configuration.

Changelog

tools/cloud-build/daily-tests/ansible_playbooks/base-integration-test.yml
- Added a new task block to run CHS tests, delegating to localhost.
- Included a rescue block to gracefully handle and ignore CHS test failures, allowing the playbook to continue execution.
tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-chs.yml
- Created a new Ansible playbook to set up a Python environment using pyenv and install necessary dependencies.
- Implemented tasks to clone the CHS repository and install it into a dedicated Python virtual environment.
- Added steps to configure a bucket for CHS results and run the CHS lite suite with specified scheduler and machine type.
- Included logic to parse the CHS Bill of Health JSON output and validate specific metrics, such as DCGM diagnostic results and NCCL average bus bandwidth.
tools/cloud-build/daily-tests/builds/gke-a3-ultragpu-onspot.yaml
- Modified the ansible-playbook command to pass the chs_repo variable.
- Updated secretEnv and availableSecrets to include CHS_REPO, fetching its value from Secret Manager.
tools/cloud-build/daily-tests/builds/gke-a3-ultragpu.yaml
- Modified the ansible-playbook command to pass the chs_repo variable.
- Updated secretEnv and availableSecrets to include CHS_REPO, fetching its value from Secret Manager.
tools/cloud-build/daily-tests/builds/gke-a4-onspot.yaml
- Commented out the check_for_running_build step.
- Modified the ansible-playbook command to pass the chs_repo variable.
- Updated secretEnv and availableSecrets to include CHS_REPO, fetching its value from Secret Manager.
tools/cloud-build/daily-tests/builds/gke-a4.yaml
- Modified the ansible-playbook command to pass the chs_repo variable.
- Updated secretEnv and availableSecrets to include CHS_REPO, fetching its value from Secret Manager.

Activity

No specific review comments or activity have been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces integration tests for the Cluster Health Scanner (CHS). The changes involve adding a new Ansible playbook to set up and run CHS, and updating several Cloud Build configurations to trigger these tests. My review focuses on improving the new test playbook for better security, reproducibility, and maintainability by addressing hardcoded values and insecure practices. I've also identified some inconsistencies in the Cloud Build configurations and a potentially problematic commented-out step that should be addressed. Several comments were enhanced with references to existing repository rules to ensure consistency and best practices.

tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-chs.yml

tools/cloud-build/daily-tests/builds/gke-a4-onspot.yaml

tools/cloud-build/daily-tests/ansible_playbooks/base-integration-test.yml

tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-chs.yml

tools/cloud-build/daily-tests/builds/gke-a3-ultragpu-onspot.yaml

tools/cloud-build/daily-tests/builds/gke-a3-ultragpu.yaml

tools/cloud-build/daily-tests/builds/gke-a4.yaml

simrankaurb · 2026-03-17T09:55:24Z

/gemini review

gemini-code-assist

Code Review

This pull request integrates the Cluster Health Scanner (CHS) into the daily integration tests for GKE and Slurm clusters, which is a valuable addition for automated diagnostics. The implementation introduces a new Ansible playbook for CHS and updates the Cloud Build configurations accordingly.

My review has identified a few issues that need attention:

A critical merge conflict in tools/cloud-build/daily-tests/tests/ml-a3-ultragpu-onspot-slurm.yml that must be resolved.
Several build configuration files contain a hardcoded project ID, which should be replaced with the ${PROJECT_ID} variable for better portability.
The new Ansible playbook test-chs.yml has some areas for improvement regarding security (insecure directory permissions), maintainability (brittle git refspec), and correctness (a mismatch in the NCCL performance threshold compared to the PR description).

Please address these points to ensure the stability, security, and maintainability of the new testing infrastructure.

tools/cloud-build/daily-tests/tests/ml-a3-ultragpu-onspot-slurm.yml

gemini-code-assist · 2026-03-17T09:57:36Z

tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-chs.yml

+      ansible.builtin.file:
+        path: "{{ ansible_async_dir }}"
+        state: directory
+        mode: '0777'


The ansible_async_dir is created with 0777 permissions, which is world-writable. This poses a security risk as it could allow other users on the system to tamper with the contents of this directory, potentially leading to a symlink attack. It's recommended to use more restrictive permissions.

mode: '0700'

gemini-code-assist · 2026-03-17T09:57:36Z

tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-chs.yml

+        refspec: 'refs/changes/00/2700/5'
+        version: FETCH_HEAD


The CHS repository is being cloned using a refspec pointing to a specific Gerrit change (refs/changes/00/2700/5). This is very brittle for a daily integration test. If the change is merged, abandoned, or rebased, this test will fail. For long-term stability, it's better to clone from a stable branch or a specific release tag.

version: main

Will change the version before pushing

tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-chs.yml

tools/cloud-build/daily-tests/builds/gke-a3-ultragpu-onspot.yaml

tools/cloud-build/daily-tests/builds/ml-a3-ultragpu-onspot-slurm.yaml

tools/cloud-build/daily-tests/builds/ml-a4-highgpu-onspot-slurm.yaml

…validation in the GKE A3 Ultra playbook.

…ed in following commit)

sarthakag · 2026-03-20T05:10:51Z

tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-chs.yml

+        pyenv_root: "/opt/pyenv"
+        chs_venv_path: "/opt/chs_venv"
+
+    - name: Install pyenv dependencies


Add a todo here to move away from pyenv?

gemini-code-assist bot reviewed Mar 11, 2026

View reviewed changes

simrankaurb force-pushed the chs-integ branch 6 times, most recently from 6717e2e to ceca704 Compare March 11, 2026 12:39

simrankaurb changed the title ~~Chs integ~~ Integrate CHS to GKE A3 and A4 Daily Tests Mar 11, 2026

simrankaurb added the release-improvements Added to release notes under the "Improvements" heading. label Mar 11, 2026

simrankaurb assigned simrankaurb and unassigned simrankaurb Mar 11, 2026

simrankaurb requested review from bytetwin and sarthakag March 11, 2026 14:15

simrankaurb changed the title ~~Integrate CHS to GKE A3 and A4 Daily Tests~~ Integrate CHS to GKE and Slurm A3U and A4 Daily Tests Mar 16, 2026

gemini-code-assist bot reviewed Mar 17, 2026

View reviewed changes

simrankaurb added 13 commits March 17, 2026 11:58

feat: Introduce Cluster Health Scanner (CHS) tests for NCCL and DCGM …

c2e0b4d

…validation in the GKE A3 Ultra playbook.

fix: Delegate with include task

6de31b2

fix: Changing to develop branch for chs

0a09254

fix venv

e8c430b

fix: Maybe hopefully final

7d44b77

fix: python version

34de11c

Add a4

e77a8fd

fix: owner's permission

fcf8071

fix: cloudbuild and print logs on failure

bd9cbe8

installing ramble dependencies

cf768d8

Adding path in chs

6e1f5d3

Run lite suite instead of particular tests

e2eb8b8

Cleaning up code

61a6c7a

simrankaurb added 26 commits March 17, 2026 11:59

Using path in environment

a628bd9

Arch change

4e286ca

Minor fix

06da928

Fixing Gemini review comments

779ba44

Add tempfile for CHS

5fe822f

set -eu

c21a900

Fix: bucket name

050ee79

Refactoring

95e52bb

Removing extra space

072bdd3

Lint errors

2c12f4a

Delegating to localhost

8d0e310

Variables

97a9fa8

Adding timeout and slurm a3u and a4

34bb40f

Cloning develop branch to test

e76656b

fix: Ansible async error

8ec9429

fix: async dir

73a9264

Add chs repo for a3u and a4 slurm

66f38c0

Adding slurm cancel and changing host for slurm

bc01fa2

Ensure async directory

b20ced4

fix: Configure Enroot and print internal logs on failure(to be revers…

b20c782

…ed in following commit)

fix: set default Enroot paths

5bbb7ed

Cleaning up code

d0c06d2

Setting nccl bandwidth to be consistent with CTK

9789469

Gemini comments

7bc07ea

Pre-commits fix

2262b61

fixing 0777 error

f540345

simrankaurb force-pushed the chs-integ branch from 6731c51 to f540345 Compare March 17, 2026 12:03

simrankaurb added 2 commits March 17, 2026 13:02

Network name a4

cfa4631

Increase timeout- slurm

e5d2176

sarthakag approved these changes Mar 20, 2026

View reviewed changes

Conversation

simrankaurb commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Mar 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simrankaurb commented Mar 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

simrankaurb Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sarthakag Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

simrankaurb commented Mar 11, 2026 •

edited

Loading