Integrate CHS to GKE and Slurm A3U and A4 Daily Tests#5335
Integrate CHS to GKE and Slurm A3U and A4 Daily Tests#5335simrankaurb wants to merge 52 commits intoGoogleCloudPlatform:developfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces the Cluster Health Scanner (CHS) into the existing daily integration test framework. The primary goal is to enhance the robustness of GKE cluster validations by adding automated checks for GPU diagnostics and network performance. This involves a new Ansible playbook for CHS setup and execution, along with necessary updates to Cloud Build configurations to orchestrate these new health checks, ensuring the stability and performance of the tested environments. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces integration tests for the Cluster Health Scanner (CHS). The changes involve adding a new Ansible playbook to set up and run CHS, and updating several Cloud Build configurations to trigger these tests. My review focuses on improving the new test playbook for better security, reproducibility, and maintainability by addressing hardcoded values and insecure practices. I've also identified some inconsistencies in the Cloud Build configurations and a potentially problematic commented-out step that should be addressed. Several comments were enhanced with references to existing repository rules to ensure consistency and best practices.
tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-chs.yml
Outdated
Show resolved
Hide resolved
tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-chs.yml
Outdated
Show resolved
Hide resolved
tools/cloud-build/daily-tests/ansible_playbooks/base-integration-test.yml
Outdated
Show resolved
Hide resolved
tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-chs.yml
Outdated
Show resolved
Hide resolved
tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-chs.yml
Outdated
Show resolved
Hide resolved
tools/cloud-build/daily-tests/builds/gke-a3-ultragpu-onspot.yaml
Outdated
Show resolved
Hide resolved
6717e2e to
ceca704
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request integrates the Cluster Health Scanner (CHS) into the daily integration tests for GKE and Slurm clusters, which is a valuable addition for automated diagnostics. The implementation introduces a new Ansible playbook for CHS and updates the Cloud Build configurations accordingly.
My review has identified a few issues that need attention:
- A critical merge conflict in
tools/cloud-build/daily-tests/tests/ml-a3-ultragpu-onspot-slurm.ymlthat must be resolved. - Several build configuration files contain a hardcoded project ID, which should be replaced with the
${PROJECT_ID}variable for better portability. - The new Ansible playbook
test-chs.ymlhas some areas for improvement regarding security (insecure directory permissions), maintainability (brittle git refspec), and correctness (a mismatch in the NCCL performance threshold compared to the PR description).
Please address these points to ensure the stability, security, and maintainability of the new testing infrastructure.
tools/cloud-build/daily-tests/tests/ml-a3-ultragpu-onspot-slurm.yml
Outdated
Show resolved
Hide resolved
| ansible.builtin.file: | ||
| path: "{{ ansible_async_dir }}" | ||
| state: directory | ||
| mode: '0777' |
There was a problem hiding this comment.
The ansible_async_dir is created with 0777 permissions, which is world-writable. This poses a security risk as it could allow other users on the system to tamper with the contents of this directory, potentially leading to a symlink attack. It's recommended to use more restrictive permissions.
mode: '0700'| refspec: 'refs/changes/00/2700/5' | ||
| version: FETCH_HEAD |
There was a problem hiding this comment.
The CHS repository is being cloned using a refspec pointing to a specific Gerrit change (refs/changes/00/2700/5). This is very brittle for a daily integration test. If the change is merged, abandoned, or rebased, this test will fail. For long-term stability, it's better to clone from a stable branch or a specific release tag.
version: mainThere was a problem hiding this comment.
Will change the version before pushing
tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-chs.yml
Outdated
Show resolved
Hide resolved
tools/cloud-build/daily-tests/builds/gke-a3-ultragpu-onspot.yaml
Outdated
Show resolved
Hide resolved
tools/cloud-build/daily-tests/builds/ml-a3-ultragpu-onspot-slurm.yaml
Outdated
Show resolved
Hide resolved
tools/cloud-build/daily-tests/builds/ml-a4-highgpu-onspot-slurm.yaml
Outdated
Show resolved
Hide resolved
…validation in the GKE A3 Ultra playbook.
…ed in following commit)
| pyenv_root: "/opt/pyenv" | ||
| chs_venv_path: "/opt/chs_venv" | ||
|
|
||
| - name: Install pyenv dependencies |
There was a problem hiding this comment.
Add a todo here to move away from pyenv?
This pull request integrates CHS to the existing daily integration test framework for GKE A3 Ultra and A4. The goal is to provide automated GPU diagnostics and network performance validation as part of the post-deployment test suite.
Key Features & Changes:
test-chs.yml): Implements a full setup and test cycle, including:CHS_REPOrepository URL via Secret Manager and pass it to the integration test playbooks.