Skip to content

Integrate CHS to GKE and Slurm A3U and A4 Daily Tests#5335

Draft
simrankaurb wants to merge 52 commits intoGoogleCloudPlatform:developfrom
simrankaurb:chs-integ
Draft

Integrate CHS to GKE and Slurm A3U and A4 Daily Tests#5335
simrankaurb wants to merge 52 commits intoGoogleCloudPlatform:developfrom
simrankaurb:chs-integ

Conversation

@simrankaurb
Copy link
Contributor

@simrankaurb simrankaurb commented Mar 11, 2026

This pull request integrates CHS to the existing daily integration test framework for GKE A3 Ultra and A4. The goal is to provide automated GPU diagnostics and network performance validation as part of the post-deployment test suite.

Key Features & Changes:

  • Non-Blocking Execution: The CHS validation is designed to be informative rather than disruptive. It utilizes an Ansible rescue block to catch failures, allowing the main playbook to continue execution even if health checks do not pass.
  • New Validation Playbook (test-chs.yml): Implements a full setup and test cycle, including:
    • Python environment configuration using pyenv.
    • Cloning and installing CHS into a dedicated virtual environment.
    • Running the CHS "lite" suite.
  • Metric Thresholds:
    • DCGM Diagnostics: Asserts that all GPU diagnostic metrics return a "Pass" status.
    • NCCL Performance: Validates that the average bus bandwidth meets a minimum threshold of 100.0 GB/s.
  • Cloud Build Integration: Modified Cloud Build configurations to inject the CHS_REPO repository URL via Secret Manager and pass it to the integration test playbooks.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the Cluster Health Scanner (CHS) into the existing daily integration test framework. The primary goal is to enhance the robustness of GKE cluster validations by adding automated checks for GPU diagnostics and network performance. This involves a new Ansible playbook for CHS setup and execution, along with necessary updates to Cloud Build configurations to orchestrate these new health checks, ensuring the stability and performance of the tested environments.

Highlights

  • Cluster Health Scanner (CHS) Integration: Integrated the Cluster Health Scanner into the daily integration test suite to perform automated health checks on GKE clusters.
  • New Ansible Playbook for CHS: Added a dedicated Ansible playbook (test-chs.yml) responsible for setting up the CHS environment, installing the tool, running its test suite, and validating key metrics like DCGM diagnostics and NCCL bandwidth.
  • Cloud Build Configuration Updates: Modified several Cloud Build configurations to invoke the new CHS tests and securely pass the CHS repository URL via Secret Manager.
  • Build Check Temporarily Disabled: Commented out a check_for_running_build step in the gke-a4-onspot.yaml Cloud Build configuration.
Changelog
  • tools/cloud-build/daily-tests/ansible_playbooks/base-integration-test.yml
    • Added a new task block to run CHS tests, delegating to localhost.
    • Included a rescue block to gracefully handle and ignore CHS test failures, allowing the playbook to continue execution.
  • tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-chs.yml
    • Created a new Ansible playbook to set up a Python environment using pyenv and install necessary dependencies.
    • Implemented tasks to clone the CHS repository and install it into a dedicated Python virtual environment.
    • Added steps to configure a bucket for CHS results and run the CHS lite suite with specified scheduler and machine type.
    • Included logic to parse the CHS Bill of Health JSON output and validate specific metrics, such as DCGM diagnostic results and NCCL average bus bandwidth.
  • tools/cloud-build/daily-tests/builds/gke-a3-ultragpu-onspot.yaml
    • Modified the ansible-playbook command to pass the chs_repo variable.
    • Updated secretEnv and availableSecrets to include CHS_REPO, fetching its value from Secret Manager.
  • tools/cloud-build/daily-tests/builds/gke-a3-ultragpu.yaml
    • Modified the ansible-playbook command to pass the chs_repo variable.
    • Updated secretEnv and availableSecrets to include CHS_REPO, fetching its value from Secret Manager.
  • tools/cloud-build/daily-tests/builds/gke-a4-onspot.yaml
    • Commented out the check_for_running_build step.
    • Modified the ansible-playbook command to pass the chs_repo variable.
    • Updated secretEnv and availableSecrets to include CHS_REPO, fetching its value from Secret Manager.
  • tools/cloud-build/daily-tests/builds/gke-a4.yaml
    • Modified the ansible-playbook command to pass the chs_repo variable.
    • Updated secretEnv and availableSecrets to include CHS_REPO, fetching its value from Secret Manager.
Activity
  • No specific review comments or activity have been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces integration tests for the Cluster Health Scanner (CHS). The changes involve adding a new Ansible playbook to set up and run CHS, and updating several Cloud Build configurations to trigger these tests. My review focuses on improving the new test playbook for better security, reproducibility, and maintainability by addressing hardcoded values and insecure practices. I've also identified some inconsistencies in the Cloud Build configurations and a potentially problematic commented-out step that should be addressed. Several comments were enhanced with references to existing repository rules to ensure consistency and best practices.

@simrankaurb simrankaurb force-pushed the chs-integ branch 6 times, most recently from 6717e2e to ceca704 Compare March 11, 2026 12:39
@simrankaurb simrankaurb changed the title Chs integ Integrate CHS to GKE A3 and A4 Daily Tests Mar 11, 2026
@simrankaurb simrankaurb added the release-improvements Added to release notes under the "Improvements" heading. label Mar 11, 2026
@simrankaurb simrankaurb changed the title Integrate CHS to GKE A3 and A4 Daily Tests Integrate CHS to GKE and Slurm A3U and A4 Daily Tests Mar 16, 2026
@simrankaurb
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates the Cluster Health Scanner (CHS) into the daily integration tests for GKE and Slurm clusters, which is a valuable addition for automated diagnostics. The implementation introduces a new Ansible playbook for CHS and updates the Cloud Build configurations accordingly.

My review has identified a few issues that need attention:

  • A critical merge conflict in tools/cloud-build/daily-tests/tests/ml-a3-ultragpu-onspot-slurm.yml that must be resolved.
  • Several build configuration files contain a hardcoded project ID, which should be replaced with the ${PROJECT_ID} variable for better portability.
  • The new Ansible playbook test-chs.yml has some areas for improvement regarding security (insecure directory permissions), maintainability (brittle git refspec), and correctness (a mismatch in the NCCL performance threshold compared to the PR description).

Please address these points to ensure the stability, security, and maintainability of the new testing infrastructure.

ansible.builtin.file:
path: "{{ ansible_async_dir }}"
state: directory
mode: '0777'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The ansible_async_dir is created with 0777 permissions, which is world-writable. This poses a security risk as it could allow other users on the system to tamper with the contents of this directory, potentially leading to a symlink attack. It's recommended to use more restrictive permissions.

        mode: '0700'

Comment on lines +102 to +103
refspec: 'refs/changes/00/2700/5'
version: FETCH_HEAD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The CHS repository is being cloned using a refspec pointing to a specific Gerrit change (refs/changes/00/2700/5). This is very brittle for a daily integration test. If the change is merged, abandoned, or rebased, this test will fail. For long-term stability, it's better to clone from a stable branch or a specific release tag.

        version: main

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change the version before pushing

pyenv_root: "/opt/pyenv"
chs_venv_path: "/opt/chs_venv"

- name: Install pyenv dependencies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a todo here to move away from pyenv?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants