Extending Muon Optimizer Support for ZeRO Stage 3 by PKUWZP · Pull Request #7919 · deepspeedai/DeepSpeed

PKUWZP · 2026-03-23T19:01:11Z

Authors: @pengdurice and @PKUWZP

Create a separate PR based on #7798 with the same functional diff on a clean signed-off branch to resolve DCO issues.

We aim on adding Muon Optimizer to zero stage 3 in this draft PR:

Created a dedicated momentum buffer in zero stage 3 optimizer to save the momentum buffers specifically for Muon Optimizer.
The optimizer states can be dispatched into 3 devices: GPU, CPU and NVME. For GPU and CPU, we just make the new buffers the same device of self.fp32_partitioned_groups_flat; when device == NVME, we make sure that the momentum buffers can be swapped in and out along with other components in the optimizer states.
The new momentum buffers are also partitioned like self.fp32_partitioned_groups_flat to save memory footprint. So, before the muon update, we need to perform all_gather on top of each data-parallel group rank. The Muon updates of the parameters are also divided across the data-parallel ranks, and the results are all-gathered once all updates are complete. After the all_gather, the momentum buffers are partitioned and flattened again.

Next steps:

Explore quantization of momentum buffers for saving memory
Explore using highly optimized Adam / AdamW Optimizers

Signed-off-by: PKUWZP <zhipeng.rainbowserie@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56451ea0fe

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

deepspeed/runtime/zero/stage3.py

chatgpt-codex-connector · 2026-03-23T19:05:04Z

deepspeed/runtime/zero/stage3.py

+            grads_pad = [param.grad for param in params] + [torch.empty_like(params[-1].grad)] * (
+                (world_sz - len(params) % world_sz) % world_sz)
+            gathered_momentums_pad = gathered_params_momentums + [torch.empty_like(gathered_params_momentums[-1])] * (
+                (world_sz - len(gathered_params_momentums) % world_sz) % world_sz)


Use distinct pad tensors in Muon all_gather

When a Muon subgroup has fewer tensors than the data-parallel world size, these padding expressions reuse the same torch.empty_like(...) object for every padded slot. The subsequent dist.all_gather then receives overlapping output buffers in the final partial chunk, which can error or corrupt the gathered gradients/momentum on small models or whenever the last chunk is not full.

Useful? React with 👍 / 👎.

tohtana

Thank you @PKUWZP, thank you for extending Muon Optimizer support. This is a significant improvement.
I left some comments about some regression and enhancement.

tests/unit/ops/muon/test_muon.py

deepspeed/runtime/zero/stage3.py

- Move save_muon_momentum_buffer_in_memory config to DeepSpeedZeroConfig in config.py instead of reading inline from ds_config dict - Fix index bug: change muon_momentum_buffer_partitioned_groups_flat from list to dict keyed by sub-group index to avoid out-of-bounds access when non-muon groups precede muon groups - Add valid code path for non-swappable (GPU/CPU) optimizer without save_muon_momentum_buffer_in_memory, replacing ValueError - Validate that all Muon parameter groups share the same momentum (beta) - Parametrize tests for both True and False save_muon_momentum_buffer_in_memory - Update docs to show config under zero_optimization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: PKUWZP <zhipeng.rainbowserie@gmail.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: PKUWZP <zhipeng.rainbowserie@gmail.com>

tohtana

I left a comment regarding simplifying conditions, but overall looks good to me. Thank you for the great work, @PKUWZP!

tohtana · 2026-03-26T02:44:56Z

deepspeed/runtime/zero/stage3.py

+                continue
+
+            momentum_buffer = []
+            if self._swappable_optimizer_subgroup(i) and not self.save_muon_momentum_buffer_in_memory:


The condition looks correct, but consider simplifying like the following (should be equivalent)

if self.save_muon_momentum_buffer_in_memory: ... elif self._swappable_optimizer_subgroup(i): ... else: ...

delock · 2026-03-26T02:53:04Z

Suggest to turn on stage 3 for test_muon_partial_training.py as well, this would check the case when almost all parameters are freezed so all trainable parameters use muon optimizer.

delock · 2026-03-26T02:59:54Z

I remember deepspeed allow seperate learning rate for muon and adam (muon_lr and adam_lr), can we have a config in UT to cover this usage?

delock

LGTM, some test case coverage suggestion are added to comments.

delock · 2026-03-26T03:09:50Z

docs/_pages/config-json.md

+  "zero_optimization": {
+    "stage": 3,
+    "save_muon_momentum_buffer_in_memory": true
+  }


Did Muon optimizer for stage 3 mandates reduce_scatter = false? Does "reduce_scatter": false need to be added to the example?

Extending Muon Optimizer Support for ZeRO Stage 3

56451ea

Signed-off-by: PKUWZP <zhipeng.rainbowserie@gmail.com>

PKUWZP requested review from loadams, tjruwase and tohtana as code owners March 23, 2026 19:01

chatgpt-codex-connector bot reviewed Mar 23, 2026

View reviewed changes

tohtana reviewed Mar 25, 2026

View reviewed changes

PKUWZP and others added 2 commits March 25, 2026 17:08

Fix yapf formatting in stage3.py and test_muon.py

630f421

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: PKUWZP <zhipeng.rainbowserie@gmail.com>

tohtana approved these changes Mar 26, 2026

View reviewed changes

delock approved these changes Mar 26, 2026

View reviewed changes

delock reviewed Mar 26, 2026

View reviewed changes

Merge branch 'master' into pr-7798-clean

74bce70

PKUWZP merged commit 956ec6f into deepspeedai:master Mar 26, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending Muon Optimizer Support for ZeRO Stage 3#7919

Extending Muon Optimizer Support for ZeRO Stage 3#7919
PKUWZP merged 4 commits intodeepspeedai:masterfrom
PKUWZP:pr-7798-clean

PKUWZP commented Mar 23, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot Mar 23, 2026

Uh oh!

tohtana left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tohtana left a comment

Uh oh!

tohtana Mar 26, 2026

Uh oh!

delock commented Mar 26, 2026

Uh oh!

delock commented Mar 26, 2026

Uh oh!

delock left a comment

Uh oh!

delock Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

PKUWZP commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tohtana left a comment

Choose a reason for hiding this comment

Uh oh!

tohtana Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

delock commented Mar 26, 2026

Uh oh!

delock commented Mar 26, 2026

Uh oh!

delock left a comment

Choose a reason for hiding this comment

Uh oh!

delock Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PKUWZP commented Mar 23, 2026 •

edited

Loading